Creating text corpora for tensorflow

hwechtla-tl: Creating text corpora for tensorflow

Mikä on WikiWiki?
nettipäiväkirja
koko wiki (etsi)
viime muutokset

I read Tensorflow's text classification tutorial (https://www.tensorflow.org/tutorials/keras/basic_text_classification). It's pretty straightforward, and I encourage you to take a look at it, too. But it uses a ready-made dataset; and while this is quite okay, it's easier to create your own than you might think.

The Tensorflow examples encode text as word lists, where each word is encoded by an integer. Such integer indices can be generated very easily with Unix tools:

atehwa@odoran:~$ w3m -dump https://en.wikipedia.org/wiki/Main_Page |
> egrep -o '[[:alnum:]]+' |
> awk '!trans[$0] {trans[$0]=++idx} {print trans[$0], $0}' |
> tail   # Just to cut the output
Received a secured cookie
734 Developers
735 Cookie
736 statement
737 Mobile
103 view
522 Wikimedia
523 Foundation
738 Powered
76 by
543 MediaWiki

If you want to create a lot of data from a file, you just need some separator quoting to get the index lists into CSV format. For instance, the fortunes are separated by "%" characters; let's preprocess to ensure these % chars are converted to the integer 1, and postprocess to break lists on this separator:

atehwa@odoran:~$ (echo %; cat /usr/share/games/fortunes/literature.u8 ) |
> sed 's/^%$/datumseparator/' |
> egrep -o '[[:alnum:]]+' |
> awk '!trans[$0] {trans[$0]=++idx} {print trans[$0]}' |
> tr \\012 \\011 |
> sed 's/\t1\t/\n/g' |
> cut -c1-20 | tail   # just to cut the output
283     2561    322     101     256
283     56      1052    2516    214
2571    2053    4       517     690
2574    227     87      828     355
33      936     4       1080    447     12
227     1311    226     4       434     6
2       2591    4       5       2591    4       5
95      51      226     2611    2612
2635    2636    2637    248     9
227     96      5       2741    35      232

Data like this can be zero-padded and read into numpy:

atehwa@odoran:~/proj/keras-test$ (echo %; cat /usr/share/games/fortunes/literature.u8 ) |
> sed 's/^%$/datumseparator/' |
> egrep -o '[[:alnum:]]+' |
> awk '!trans[$0] {trans[$0]=++idx} {print trans[$0]}' |
> tr \\012 \\011 |
> sed 's/\t1\t/\n/g' |
> cut -f1-248 |
> awk '{ pad=""; for (i = NF; i < 248; ++i) pad = pad "\t0"; print $0 pad; }' > fortunes.csv
atehwa@odoran:~/proj/keras-test$ ./myenv/bin/ipython
Python 3.6.6 (default, Sep 12 2018, 18:26:19) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.0.1 -- An enhanced Interactive Python. Type '?' for help.

In [102]: fortune_data = np.genfromtxt('fortunes.csv', delimiter='\t', dtype=int)                                                               

In [103]: fortune_data[0]                                                                                                                       
Out[103]: 
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14,  4, 15, 16,
       17, 18, 19, 13, 20, 18, 21, 22, 23, 24, 25,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0])

These can then be readily fed to your Tensorflow models. (I don't have a sensible classification for this data so I'll just make one that tells whether it has word 2, i.e. "A".)

In [107]: fortune_labels = np.array([1.0 * (2 in fortune_data[i]) for i in range(len(fortune_data))], dtype=float)                              

In [108]: fortune_labels                                                                                                                        
Out[108]: 
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
       0., 1., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       1., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 1., 0., 0., 0.,
       0., 0., 0., 1., 1., 1., 0.])

In [109]: fortune_model = keras.Sequential([ 
     ...:     keras.layers.Embedding(2760, 16),  # 2760 = vocabulary size 
     ...:     keras.layers.GlobalAveragePooling1D(), 
     ...:     keras.layers.Dense(16, activation=tf.nn.relu), 
     ...:     keras.layers.Dense(1, activation=tf.nn.sigmoid) 
     ...:     ])                                                                                                                                

In [110]: fortune_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])                                             

In [111]: fortune_model.fit(fortune_data, fortune_labels, epochs=10, validation_split=0.2)

The predictions are pure crap :)

In [114]: fortune_model.predict(fortune_data[:10])                                                                                              
Out[114]: 
array([[0.0942275 ],
       [0.0907232 ],
       [0.08790585],
       [0.09742033],
       [0.08164175],
       [0.08711455],
       [0.08673615],
       [0.09668393],
       [0.09015133],
       [0.1843614 ]], dtype=float32)

Note that this is a bag-of-words approach: the GlobalAveragePooling1D layer just takes an average of all embedding vectors, and totally throws away any order / place information the input vectors had. One can deal with this by using something else to handle the embedding vectors, but another approach is to make it a bag of bigrams (two-word combinations) instead. This way, the input will consist of information about which word preceded which.

To recover our datum separator from the index encoding, it now has to be two words. It requires some special attention, but as a bonus, we get bigrams for a word at beginning of sentence and a word at the end.

atehwa@odoran:~/proj/keras-test$ (echo %; cat /usr/share/games/fortunes/literature.u8 ) |
> sed 's/^%$/datumseparator datumseparator/' |
> egrep -o '[[:alnum:]]+' |
> awk '{ print prev, $0; prev = $0; }' |
> awk '!trans[$0] {trans[$0]=++idx} {print trans[$0], $0}' |
> head     # just to cut output
1  datumseparator
2 datumseparator datumseparator
3 datumseparator A
4 A banker
5 banker is
6 is a
7 a fellow
8 fellow who
9 who lends
10 lends you
atehwa@odoran:~/proj/keras-test$ (echo %; cat /usr/share/games/fortunes/literature.u8 ) |
> sed 's/^%$/datumseparator datumseparator/' |
> egrep -o '[[:alnum:]]+' |
> awk '{ print prev, $0; prev = $0; }' |
> awk '!trans[$0] {trans[$0]=++idx} {print trans[$0]}' |
> tr \\012 \\011 |
> sed 's/\t2\t/\n/g' |
> cut -f1-249 |
> awk '{ pad=""; for (i = NF; i < 249; ++i) pad = pad "\t0"; print $0 pad; }' |
> cut -c1-30 | tail    # just to cut output
6492    6728    6729    6730    6731    6732    
6492    3108    6733    6734    6735    6736    
6758    6759    6760    6761    6762    3763    
6775    6776    1697    6777    6778    6779    
4196    6794    6795    6796    6797    6798    
1864    6813    6814    1603    6815    6816    
3       6842    6843    6       6844    6843    6       6844
6223    6892    6893    6894    6895    6896    
6968    6969    6970    6971    6972    6973    
1864    7087    7312    7313    7314    822     7

Not surprisingly, there are very many bigrams, so this preprocessing requires even bigger corpora.

kommentoi (viimeksi muutettu 28.05.2025 14:54)