(gawk.info.gz) Regexp Field Splitting

(gawk.info.gz) Default Field Splitting
(gawk.info.gz) Single Character Fields
 
 4.5.2 Using Regular Expressions to Separate Fields
 --------------------------------------------------
 
 The previous node discussed the use of single characters or simple
 strings as the value of `FS'.  More generally, the value of `FS' may be
 a string containing any regular expression.  In this case, each match
 in the record for the regular expression separates fields.  For
 example, the assignment:
 
      FS = ", \t"
 
 makes every area of an input line that consists of a comma followed by a
 space and a TAB into a field separator.  (`\t' is an "escape sequence"
 that stands for a TAB;  Escape Sequences, for the complete list
 of similar escape sequences.)
 
    For a less trivial example of a regular expression, try using single
 spaces to separate fields the way single commas are used.  `FS' can be
 set to `"[ ]"' (left bracket, space, right bracket).  This regular
 expression matches a single space and nothing else ( Regexp).
 
    There is an important difference between the two cases of `FS = " "'
 (a single space) and `FS = "[ \t\n]+"' (a regular expression matching
 one or more spaces, TABs, or newlines).  For both values of `FS',
 fields are separated by "runs" (multiple adjacent occurrences) of
 spaces, TABs, and/or newlines.  However, when the value of `FS' is
 `" "', `awk' first strips leading and trailing whitespace from the
 record and then decides where the fields are.  For example, the
 following pipeline prints `b':
 
      $ echo ' a b c d ' | awk '{ print $2 }'
      -| b
 
 However, this pipeline prints `a' (note the extra spaces around each
 letter):
 
      $ echo ' a  b  c  d ' | awk 'BEGIN { FS = "[ \t\n]+" }
      >                                  { print $2 }'
      -| a
 
 In this case, the first field is "null" or empty.
 
    The stripping of leading and trailing whitespace also comes into
 play whenever `$0' is recomputed.  For instance, study this pipeline:
 
      $ echo '   a b c d' | awk '{ print; $2 = $2; print }'
      -|    a b c d
      -| a b c d
 
 The first `print' statement prints the record as it was read, with
 leading whitespace intact.  The assignment to `$2' rebuilds `$0' by
 concatenating `$1' through `$NF' together, separated by the value of
 `OFS'.  Because the leading whitespace was ignored when finding `$1',
 it is not part of the new `$0'.  Finally, the last `print' statement
 prints the new `$0'.
 
    There is an additional subtlety to be aware of when using regular
 expressions for field splitting.  It is not well-specified in the POSIX
 standard, or anywhere else, what `^' means when splitting fields.  Does
 the `^'  match only at the beginning of the entire record? Or is each
 field separator a new string?  It turns out that different `awk'
 versions answer this question differently, and you should not rely on
 any specific behavior in your programs.  (d.c.)
 
    As a point of information, Brian Kernighan's `awk' allows `^' to
 match only at the beginning of the record. `gawk' also works this way.
 For example:
 
      $ echo 'xxAA  xxBxx  C' |
      > gawk -F '(^x+)|( +)' '{ for (i = 1; i <= NF; i++)
      >                                   printf "-->%s<--\n", $i }'
      -| --><--
      -| -->AA<--
      -| -->xxBxx<--
      -| -->C<--
Info Catalog
(gawk.info.gz) Default Field Splitting
(gawk.info.gz) Field Separators
(gawk.info.gz) Single Character Fields
automatically generated by info2html