I read Tensorflow's text classification tutorial (https://www.tensorflow.org/tutorials/keras/basic_text_classification). It's pretty straightforward, and I encourage you to take a look at it, too. But it uses a ready-made dataset; and while this is quite okay, it's easier to create your own than you might think.
The Tensorflow examples encode text as word lists, where each word is encoded by an integer. Such integer indices can be generated very easily with Unix tools:
atehwa@odoran:~$ w3m -dump https://en.wikipedia.org/wiki/Main_Page |
> egrep -o '[[:alnum:]]+' |
> awk '!trans[$0] {trans[$0]=++idx} {print trans[$0], $0}' |
> tail # Just to cut the output
Received a secured cookie
734 Developers
735 Cookie
736 statement
737 Mobile
103 view
522 Wikimedia
523 Foundation
738 Powered
76 by
543 MediaWiki
If you want to create a lot of data from a file, you just need some separator quoting to get the index lists into CSV format. For instance, the fortunes are separated by "%" characters; let's preprocess to ensure these % chars are converted to the integer 1, and postprocess to break lists on this separator:
atehwa@odoran:~$ (echo %; cat /usr/share/games/fortunes/literature.u8 ) |
> sed 's/^%$/datumseparator/' |
> egrep -o '[[:alnum:]]+' |
> awk '!trans[$0] {trans[$0]=++idx} {print trans[$0]}' |
> tr \\012 \\011 |
> sed 's/\t1\t/\n/g' |
> cut -c1-20 | tail # just to cut the output
283 2561 322 101 256
283 56 1052 2516 214
2571 2053 4 517 690
2574 227 87 828 355
33 936 4 1080 447 12
227 1311 226 4 434 6
2 2591 4 5 2591 4 5
95 51 226 2611 2612
2635 2636 2637 248 9
227 96 5 2741 35 232
Data like this can be zero-padded and read into numpy:
atehwa@odoran:~/proj/keras-test$ (echo %; cat /usr/share/games/fortunes/literature.u8 ) |
> sed 's/^%$/datumseparator/' |
> egrep -o '[[:alnum:]]+' |
> awk '!trans[$0] {trans[$0]=++idx} {print trans[$0]}' |
> tr \\012 \\011 |
> sed 's/\t1\t/\n/g' |
> cut -f1-248 |
> awk '{ pad=""; for (i = NF; i < 248; ++i) pad = pad "\t0"; print $0 pad; }' > fortunes.csv
atehwa@odoran:~/proj/keras-test$ ./myenv/bin/ipython
Python 3.6.6 (default, Sep 12 2018, 18:26:19)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.0.1 -- An enhanced Interactive Python. Type '?' for help.
In [102]: fortune_data = np.genfromtxt('fortunes.csv', delimiter='\t', dtype=int)
In [103]: fortune_data[0]
Out[103]:
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 4, 15, 16,
17, 18, 19, 13, 20, 18, 21, 22, 23, 24, 25, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
These can then be readily fed to your Tensorflow models. (I don't have a sensible classification for this data so I'll just make one that tells whether it has word 2, i.e. "A".)
In [107]: fortune_labels = np.array([1.0 * (2 in fortune_data[i]) for i in range(len(fortune_data))], dtype=float)
In [108]: fortune_labels
Out[108]:
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
0., 1., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
1., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 1., 0., 0., 0.,
0., 0., 0., 1., 1., 1., 0.])
In [109]: fortune_model = keras.Sequential([
...: keras.layers.Embedding(2760, 16), # 2760 = vocabulary size
...: keras.layers.GlobalAveragePooling1D(),
...: keras.layers.Dense(16, activation=tf.nn.relu),
...: keras.layers.Dense(1, activation=tf.nn.sigmoid)
...: ])
In [110]: fortune_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
In [111]: fortune_model.fit(fortune_data, fortune_labels, epochs=10, validation_split=0.2)
The predictions are pure crap :)
In [114]: fortune_model.predict(fortune_data[:10])
Out[114]:
array([[0.0942275 ],
[0.0907232 ],
[0.08790585],
[0.09742033],
[0.08164175],
[0.08711455],
[0.08673615],
[0.09668393],
[0.09015133],
[0.1843614 ]], dtype=float32)
Note that this is a bag-of-words approach: the GlobalAveragePooling1D layer just takes an average of all embedding vectors, and totally throws away any order / place information the input vectors had. One can deal with this by using something else to handle the embedding vectors, but another approach is to make it a bag of bigrams (two-word combinations) instead. This way, the input will consist of information about which word preceded which.
To recover our datum separator from the index encoding, it now has to be two words. It requires some special attention, but as a bonus, we get bigrams for a word at beginning of sentence and a word at the end.
atehwa@odoran:~/proj/keras-test$ (echo %; cat /usr/share/games/fortunes/literature.u8 ) |
> sed 's/^%$/datumseparator datumseparator/' |
> egrep -o '[[:alnum:]]+' |
> awk '{ print prev, $0; prev = $0; }' |
> awk '!trans[$0] {trans[$0]=++idx} {print trans[$0], $0}' |
> head # just to cut output
1 datumseparator
2 datumseparator datumseparator
3 datumseparator A
4 A banker
5 banker is
6 is a
7 a fellow
8 fellow who
9 who lends
10 lends you
atehwa@odoran:~/proj/keras-test$ (echo %; cat /usr/share/games/fortunes/literature.u8 ) |
> sed 's/^%$/datumseparator datumseparator/' |
> egrep -o '[[:alnum:]]+' |
> awk '{ print prev, $0; prev = $0; }' |
> awk '!trans[$0] {trans[$0]=++idx} {print trans[$0]}' |
> tr \\012 \\011 |
> sed 's/\t2\t/\n/g' |
> cut -f1-249 |
> awk '{ pad=""; for (i = NF; i < 249; ++i) pad = pad "\t0"; print $0 pad; }' |
> cut -c1-30 | tail # just to cut output
6492 6728 6729 6730 6731 6732
6492 3108 6733 6734 6735 6736
6758 6759 6760 6761 6762 3763
6775 6776 1697 6777 6778 6779
4196 6794 6795 6796 6797 6798
1864 6813 6814 1603 6815 6816
3 6842 6843 6 6844 6843 6 6844
6223 6892 6893 6894 6895 6896
6968 6969 6970 6971 6972 6973
1864 7087 7312 7313 7314 822 7
Not surprisingly, there are very many bigrams, so this preprocessing requires even bigger corpora.