(gawk.info.gz) Splitting By Content
Info Catalog
(gawk.info.gz) Constant Size
(gawk.info.gz) Reading Files
(gawk.info.gz) Multiple Line
4.7 Defining Fields By Content
==============================
(This minor node discusses an advanced feature of `awk'. If you are a
novice `awk' user, you might want to skip it on the first reading.)
Normally, when using `FS', `gawk' defines the fields as the parts of
the record that occur in between each field separator. In other words,
`FS' defines what a field _is not_, instead of what a field _is_.
However, there are times when you really want to define the fields by
what they are, and not by what they are not.
The most notorious such case is so-called "comma separated value"
(CSV) data. Many spreadsheet programs, for example, can export their
data into text files, where each record is terminated with a newline,
and fields are separated by commas. If only commas separated the data,
there wouldn't be an issue. The problem comes when one of the fields
contains an _embedded_ comma. While there is no formal standard
specification for CSV data(1), in such cases, most programs embed the
field in double quotes. So we might have data like this:
Robbins,Arnold,"1234 A Pretty Street, NE",MyTown,MyState,12345-6789,USA
The `FPAT' variable offers a solution for cases like this. The
value of `FPAT' should be a string that provides a regular expression.
This regular expression describes the contents of each field.
In the case of CSV data as presented above, each field is either
"anything that is not a comma," or "a double quote, anything that is
not a double quote, and a closing double quote." If written as a
regular expression constant ( Regexp), we would have
`/([^,]+)|("[^"]+")/'. Writing this as a string requires us to escape
the double quotes, leading to:
FPAT = "([^,]+)|(\"[^\"]+\")"
Putting this to use, here is a simple program to parse the data:
BEGIN {
FPAT = "([^,]+)|(\"[^\"]+\")"
}
{
print "NF = ", NF
for (i = 1; i <= NF; i++) {
printf("$%d = <%s>\n", i, $i)
}
}
When run, we get the following:
$ gawk -f simple-csv.awk addresses.csv
NF = 7
$1 = <Robbins>
$2 = <Arnold>
$3 = <"1234 A Pretty Street, NE">
$4 = <MyTown>
$5 = <MyState>
$6 = <12345-6789>
$7 = <USA>
Note the embedded comma in the value of `$3'.
A straightforward improvement when processing CSV data of this sort
would be to remove the quotes when they occur, with something like this:
if (substr($i, 1, 1) == "\"") {
len = length($i)
$i = substr($i, 2, len - 2) # Get text within the two quotes
}
As with `FS', the `IGNORECASE' variable ( User-modified)
affects field splitting with `FPAT'.
Similar to `FIELDWIDTHS', the value of `PROCINFO["FS"]' will be
`"FPAT"' if content-based field splitting is being used.
NOTE: Some programs export CSV data that contains embedded
newlines between the double quotes. `gawk' provides no way to
deal with this. Since there is no formal specification for CSV
data, there isn't much more to be done; the `FPAT' mechanism
provides an elegant solution for the majority of cases, and the
`gawk' maintainer is satisfied with that.
As written, the regexp used for `FPAT' requires that each field have
a least one character. A straightforward modification (changing
changed the first `+' to `*') allows fields to be empty:
FPAT = "([^,]*)|(\"[^\"]+\")"
Finally, the `patsplit()' function makes the same functionality
available for splitting regular strings ( String Functions).
---------- Footnotes ----------
(1) At least, we don't know of one.
Info Catalog
(gawk.info.gz) Constant Size
(gawk.info.gz) Reading Files
(gawk.info.gz) Multiple Line
automatically generated by
info2html