(toiminnot)

hwechtla-tl: Creating text corpora for tensorflow: viime muutokset

(nettipäiväkirja 25.10.2018) I read Tensorflow's text classification tutorial (https://www.tensorflow.org/tutorials/keras/basic_text_classification). It's pretty straightforward, and I encourage you to take a look at it, too. But it uses a ready-made dataset; and while this is quite okay, it's much easier to create your own than you might think.

[...]

Data like this can be zero-padded and read into numpy:

{{{ atehwa@odoran:~/proj/keras-test$ (echo %; cat /usr/share/games/fortunes/literature.u8 ) | > sed 's/^%$/datumseparator/' | > egrep -o '[[:alnum:]]+' | > awk '!trans[$0] {trans[$0]=++idx} {print trans[$0]}' | > tr \\012 \\011 | > sed 's/\t1\t/\n/g' | > cut -f1-248 | > awk '{ pad=""; for (i = NF; i < 248; pad = pad "\t0"; print $0 pad; }' > fortunes.csv atehwa@odoran:~/proj/keras-test$ ./myenv/bin/ipython Python 3.6.6 (default, Sep 12 2018, 18:26:19) Type 'copyright', 'credits' or 'license' for more information IPython 7.0.1 -- An enhanced Interactive Python. Type '?' for help.

In [102]: fortune_data = np.genfromtxt('fortunes.csv', delimiter='\t', dtype=int)

In [103]: fortune_data[0] Out[103]: array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 4, 15, 16, 17, 18, 19, 13, 20, 18, 21, 22, 23, 24, 25, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]) }}}

These can then be readily fed to your Tensorflow models. (I don't have a sensible classification for this data so I'll just make one that tells whether it has word 2, i.e. "A".)

{{{ In [107]: fortune_labels = np.array([1.0 * (2 in fortune_data[i]) for i in range(len(fortune_data))], dtype=float)

In [108]: fortune_labels Out[108]: array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 1., 1., 0.])

In [109]: fortune_model = keras.Sequential([ ...: keras.layers.Embedding(2760, 16), # 2760 = vocabulary size ...: keras.layers.GlobalAveragePooling1D(), ...: keras.layers.Dense(16, activation=tf.nn.relu), ...: keras.layers.Dense(1, activation=tf.nn.sigmoid) ...: ])

In [110]: fortune_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [111]: fortune_model.fit(fortune_data, fortune_labels, epochs=10, validation_split=0.2) }}}

The predictions are pure crap :)

{{{ In [114]: fortune_model.predict(fortune_data[:10]) Out[114]: array([[0.0942275 ], [0.0907232 ], [0.08790585], [0.09742033], [0.08164175], [0.08711455], [0.08673615], [0.09668393], [0.09015133], [0.1843614 ]], dtype=float32) }}}

Note that this is a bag-of-words approach: the GlobalAveragePooling1D layer just takes an average of all embedding vectors, and totally throws away any order / place information the input vectors had. One can deal with this by using something else to handle the embedding vectors, but another approach is to make it a bag of bigrams (two-word combinations) instead. This way, the input will consist of information about which word preceded which.

To recover our datum separator from the index encoding, it now has to be two words. It requires some special attention, but as a bonus, we get bigrams for a word at beginning of sentence and a word at the end.

{{{ atehwa@odoran:~/proj/keras-test$ (echo %; cat /usr/share/games/fortunes/literature.u8 ) | > sed 's/^%$/datumseparator datumseparator/' | > egrep -o '[[:alnum:]]+' | > awk '{ print prev, $0; prev = $0; }' | > awk '!trans[$0] {trans[$0]=++idx} {print trans[$0], $0}' | > head # just to cut output 1 datumseparator 2 datumseparator datumseparator 3 datumseparator A 4 A banker 5 banker is 6 is a 7 a fellow 8 fellow who 9 who lends 10 lends you atehwa@odoran:~/proj/keras-test$ (echo %; cat /usr/share/games/fortunes/literature.u8 ) | > sed 's/^%$/datumseparator datumseparator/' | > egrep -o '[[:alnum:]]+' | > awk '{ print prev, $0; prev = $0; }' | > awk '!trans[$0] {trans[$0]=++idx} {print trans[$0]}' | > tr \\012 \\011 | > sed 's/\t2\t/\n/g' | > cut -f1-249 | > awk '{ pad=""; for (i = NF; i < 249; pad = pad "\t0"; print $0 pad; }' | > cut -c1-30 | tail # just to cut output 6492 6728 6729 6730 6731 6732 6492 3108 6733 6734 6735 6736 6758 6759 6760 6761 6762 3763 6775 6776 1697 6777 6778 6779 4196 6794 6795 6796 6797 6798 1864 6813 6814 1603 6815 6816 3 6842 6843 6 6844 6843 6 6844 6223 6892 6893 6894 6895 6896 6968 6969 6970 6971 6972 6973 1864 7087 7312 7313 7314 822 7 }}}

Not surprisingly, there are very many bigrams, so this preprocessing requires even bigger corpora.

* [merkintä: 2018-10] * [atehwa] * [kategoria: päiväkirjamerkintä] * [python] * [tekstityökalut] * [kategoria: työkalut]


(viimeksi muutettu 25.10.2018 02:53)