(nettipäiväkirja 25.10.2018) I read Tensorflow's text classification tutorial (https://www.tensorflow.org/tutorials/keras/basic_text_classification). It's pretty straightforward, and I encourage you to take a look at it, too. But it uses a ready-made dataset; and while this is quite okay, it's easier to create your own than you might think.
The Tensorflow examples encode text as word lists, where each word is encoded by an integer. Such integer indices can be generated very easily with Unix tools:
atehwa@odoran:~$ w3m -dump https://en.wikipedia.org/wiki/Main_Page | > egrep -o '[[:alnum:]]+' | > awk '!trans[$0] {trans[$0]=++idx} {print trans[$0], $0}' | > tail # Just to cut the output Received a secured cookie 734 Developers 735 Cookie 736 statement 737 Mobile 103 view 522 Wikimedia 523 Foundation 738 Powered 76 by 543 MediaWiki
If you want to create a lot of data from a file, you just need some separator quoting to get the index lists into CSV format. For instance, the fortunes are separated by "%" characters; let's preprocess to ensure these % chars are converted to the integer 1, and postprocess to break lists on this separator:
atehwa@odoran:~$ (echo %; cat /usr/share/games/fortunes/literature.u8 ) | > sed 's/^%$/datumseparator/' | > egrep -o '[[:alnum:]]+' | > awk '!trans[$0] {trans[$0]=++idx} {print trans[$0]}' | > tr \\012 \\011 | > sed 's/\t1\t/\n/g' | > cut -c1-20 | tail # just to cut the output 283 2561 322 101 256 283 56 1052 2516 214 2571 2053 4 517 690 2574 227 87 828 355 33 936 4 1080 447 12 227 1311 226 4 434 6 2 2591 4 5 2591 4 5 95 51 226 2611 2612 2635 2636 2637 248 9 227 96 5 2741 35 232
Data like this can be zero-padded and read into numpy:
atehwa@odoran:~/proj/keras-test$ (echo %; cat /usr/share/games/fortunes/literature.u8 ) | > sed 's/^%$/datumseparator/' | > egrep -o '[[:alnum:]]+' | > awk '!trans[$0] {trans[$0]=++idx} {print trans[$0]}' | > tr \\012 \\011 | > sed 's/\t1\t/\n/g' | > cut -f1-248 | > awk '{ pad=""; for (i = NF; i < 248; ++i) pad = pad "\t0"; print $0 pad; }' > fortunes.csv atehwa@odoran:~/proj/keras-test$ ./myenv/bin/ipython Python 3.6.6 (default, Sep 12 2018, 18:26:19) Type 'copyright', 'credits' or 'license' for more information IPython 7.0.1 -- An enhanced Interactive Python. Type '?' for help. In [102]: fortune_data = np.genfromtxt('fortunes.csv', delimiter='\t', dtype=int) In [103]: fortune_data[0] Out[103]: array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 4, 15, 16, 17, 18, 19, 13, 20, 18, 21, 22, 23, 24, 25, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
These can then be readily fed to your Tensorflow models. (I don't have a sensible classification for this data so I'll just make one that tells whether it has word 2, i.e. "A".)
In [107]: fortune_labels = np.array([1.0 * (2 in fortune_data[i]) for i in range(len(fortune_data))], dtype=float) In [108]: fortune_labels Out[108]: array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 1., 1., 0.]) In [109]: fortune_model = keras.Sequential([ ...: keras.layers.Embedding(2760, 16), # 2760 = vocabulary size ...: keras.layers.GlobalAveragePooling1D(), ...: keras.layers.Dense(16, activation=tf.nn.relu), ...: keras.layers.Dense(1, activation=tf.nn.sigmoid) ...: ]) In [110]: fortune_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']) In [111]: fortune_model.fit(fortune_data, fortune_labels, epochs=10, validation_split=0.2)
The predictions are pure crap :)
In [114]: fortune_model.predict(fortune_data[:10]) Out[114]: array([[0.0942275 ], [0.0907232 ], [0.08790585], [0.09742033], [0.08164175], [0.08711455], [0.08673615], [0.09668393], [0.09015133], [0.1843614 ]], dtype=float32)
Note that this is a bag-of-words approach: the GlobalAveragePooling1D layer just takes an average of all embedding vectors, and totally throws away any order / place information the input vectors had. One can deal with this by using something else to handle the embedding vectors, but another approach is to make it a bag of bigrams (two-word combinations) instead. This way, the input will consist of information about which word preceded which.
To recover our datum separator from the index encoding, it now has to be two words. It requires some special attention, but as a bonus, we get bigrams for a word at beginning of sentence and a word at the end.
atehwa@odoran:~/proj/keras-test$ (echo %; cat /usr/share/games/fortunes/literature.u8 ) | > sed 's/^%$/datumseparator datumseparator/' | > egrep -o '[[:alnum:]]+' | > awk '{ print prev, $0; prev = $0; }' | > awk '!trans[$0] {trans[$0]=++idx} {print trans[$0], $0}' | > head # just to cut output 1 datumseparator 2 datumseparator datumseparator 3 datumseparator A 4 A banker 5 banker is 6 is a 7 a fellow 8 fellow who 9 who lends 10 lends you atehwa@odoran:~/proj/keras-test$ (echo %; cat /usr/share/games/fortunes/literature.u8 ) | > sed 's/^%$/datumseparator datumseparator/' | > egrep -o '[[:alnum:]]+' | > awk '{ print prev, $0; prev = $0; }' | > awk '!trans[$0] {trans[$0]=++idx} {print trans[$0]}' | > tr \\012 \\011 | > sed 's/\t2\t/\n/g' | > cut -f1-249 | > awk '{ pad=""; for (i = NF; i < 249; ++i) pad = pad "\t0"; print $0 pad; }' | > cut -c1-30 | tail # just to cut output 6492 6728 6729 6730 6731 6732 6492 3108 6733 6734 6735 6736 6758 6759 6760 6761 6762 3763 6775 6776 1697 6777 6778 6779 4196 6794 6795 6796 6797 6798 1864 6813 6814 1603 6815 6816 3 6842 6843 6 6844 6843 6 6844 6223 6892 6893 6894 6895 6896 6968 6969 6970 6971 6972 6973 1864 7087 7312 7313 7314 822 7
Not surprisingly, there are very many bigrams, so this preprocessing requires even bigger corpora.