(gawk.info.gz) Regexp Field Splitting
Info Catalog
(gawk.info.gz) Default Field Splitting
(gawk.info.gz) Field Separators
(gawk.info.gz) Single Character Fields
4.5.2 Using Regular Expressions to Separate Fields
--------------------------------------------------
The previous node discussed the use of single characters or simple
strings as the value of `FS'. More generally, the value of `FS' may be
a string containing any regular expression. In this case, each match
in the record for the regular expression separates fields. For
example, the assignment:
FS = ", \t"
makes every area of an input line that consists of a comma followed by a
space and a TAB into a field separator. (`\t' is an "escape sequence"
that stands for a TAB; Escape Sequences, for the complete list
of similar escape sequences.)
For a less trivial example of a regular expression, try using single
spaces to separate fields the way single commas are used. `FS' can be
set to `"[ ]"' (left bracket, space, right bracket). This regular
expression matches a single space and nothing else ( Regexp).
There is an important difference between the two cases of `FS = " "'
(a single space) and `FS = "[ \t\n]+"' (a regular expression matching
one or more spaces, TABs, or newlines). For both values of `FS',
fields are separated by "runs" (multiple adjacent occurrences) of
spaces, TABs, and/or newlines. However, when the value of `FS' is
`" "', `awk' first strips leading and trailing whitespace from the
record and then decides where the fields are. For example, the
following pipeline prints `b':
$ echo ' a b c d ' | awk '{ print $2 }'
-| b
However, this pipeline prints `a' (note the extra spaces around each
letter):
$ echo ' a b c d ' | awk 'BEGIN { FS = "[ \t\n]+" }
> { print $2 }'
-| a
In this case, the first field is "null" or empty.
The stripping of leading and trailing whitespace also comes into
play whenever `$0' is recomputed. For instance, study this pipeline:
$ echo ' a b c d' | awk '{ print; $2 = $2; print }'
-| a b c d
-| a b c d
The first `print' statement prints the record as it was read, with
leading whitespace intact. The assignment to `$2' rebuilds `$0' by
concatenating `$1' through `$NF' together, separated by the value of
`OFS'. Because the leading whitespace was ignored when finding `$1',
it is not part of the new `$0'. Finally, the last `print' statement
prints the new `$0'.
There is an additional subtlety to be aware of when using regular
expressions for field splitting. It is not well-specified in the POSIX
standard, or anywhere else, what `^' means when splitting fields. Does
the `^' match only at the beginning of the entire record? Or is each
field separator a new string? It turns out that different `awk'
versions answer this question differently, and you should not rely on
any specific behavior in your programs. (d.c.)
As a point of information, Brian Kernighan's `awk' allows `^' to
match only at the beginning of the record. `gawk' also works this way.
For example:
$ echo 'xxAA xxBxx C' |
> gawk -F '(^x+)|( +)' '{ for (i = 1; i <= NF; i++)
> printf "-->%s<--\n", $i }'
-| --><--
-| -->AA<--
-| -->xxBxx<--
-| -->C<--
Info Catalog
(gawk.info.gz) Default Field Splitting
(gawk.info.gz) Field Separators
(gawk.info.gz) Single Character Fields
automatically generated by
info2html