Turning fixed width fields into csv

hwechtla-tl: Turning fixed width fields into csv

Mikä on WikiWiki?
nettipäiväkirja
koko wiki (etsi)
viime muutokset

Here are a couple of short utilities to help transform table data that is laid out in fixed width fields into CSV (comma-separated values) format. Say, the input looks like this:

Grendel   50  aggregate
Hip Jokk  150 formulate and think
My            standard deviation
Amaranthe 1   do as you wish
...

and the output should look like this:

Grendel,50,aggregate
Hip Jokk,150,formulate and think
My,,standard deviation
Amaranthe,1,do as you wish
...

Note that you can't just use "any amount of whitespace" as the definition of a separator, because the fields may contain whitespace, and there might be missing fields (denoted, practically, by spaces).

If you already know the widths of the fields, you can use gawk (other awk's don't support FIELDWIDTHS, as far as I know):

--- begin fixed_to_csv.awk ---
#!/usr/bin/gawk -f

BEGIN {
        FIELDWIDTHS=gensub(",", " ", "g", ARGV[1]);
        delete ARGV[1];
        OFS=",";
}

{ for (i = 0; i < NF; i++) gsub("[[:space:]]*$", "", $i); print; }
--- end fixed_to_csv.awk ---

You use this as, for instance, "./fixed_to_csv.awk 10,4,20 table.txt". Here "10,4,20" is a list of field widths; the values given correspond to my example input above.

If you don't know the widths of the fields in the data file, you can, of course, count the positions. However, this is tedious and error-prone for large files, especially if a given field is only used a couple of times (and is null most of the time). To find out in which columns the fields are, you can use this program that will print out the columns of consecutive non-whitespace characters in a file:

--- begin findcols.py ---
#!/usr/bin/env python

import sys, re

field_re = re.compile('\\S+')

def fields_of_line(line):
        return [(match.start(), match.end())
                        for match in field_re.finditer(line)]

def main(args):
        if len(args) > 1: f = open(args[1], "r")
        else: f = sys.stdin
        for line in f:
                for fld in fields_of_line(line): print "%d-%d" % fld

if __name__ == '__main__': main(sys.argv)
--- end findcols.py ---

You can use it in many ways: for instance, you can use just the first line of a file (you can get the field widths by substracting the starting columns from each other):

$ head -1 test.txt | ./findcols.py 
0-7
10-12
14-23

Or you can use findcols.py statistically, to find the most common starting columns from a file:

$ ./findcols.py test.txt | cut -d- -f1 | sort | uniq -c | sort -rn
      4 14
      4 0
      3 10
      2 24
      1 4
      1 28
      1 23
      1 20
      1 17

Or you can also try and look at the ending columns. The field values so extracted can also be fed to cut -c; just remember to add 1 to the starting column, for the changed base (but not to the ending column, because cut ranges are inclusive):

$ cut -c15-23 test.txt 
aggregate
formulate
standard 
do as you

kommentoi (viimeksi muutettu 24.01.2024 14:05)