(gawk.info.gz) Records

Info Catalog (gawk.info.gz) Reading Files (gawk.info.gz) Fields
 
 4.1 How Input Is Split into Records
 ===================================
 
 The `awk' utility divides the input for your `awk' program into records
 and fields.  `awk' keeps track of the number of records that have been
 read so far from the current input file.  This value is stored in a
 built-in variable called `FNR'.  It is reset to zero when a new file is
 started.  Another built-in variable, `NR', records the total number of
 input records read so far from all data files.  It starts at zero, but
 is never automatically reset to zero.
 
    Records are separated by a character called the "record separator".
 By default, the record separator is the newline character.  This is why
 records are, by default, single lines.  A different character can be
 used for the record separator by assigning the character to the
 built-in variable `RS'.
 
    Like any other variable, the value of `RS' can be changed in the
 `awk' program with the assignment operator, `=' ( Assignment
 Ops).  The new record-separator character should be enclosed in
 quotation marks, which indicate a string constant.  Often the right
 time to do this is at the beginning of execution, before any input is
 processed, so that the very first record is read with the proper
 separator.  To do this, use the special `BEGIN' pattern (
 BEGIN/END).  For example:
 
      awk 'BEGIN { RS = "/" }
           { print $0 }' BBS-list
 
 changes the value of `RS' to `"/"', before reading any input.  This is
 a string whose first character is a slash; as a result, records are
 separated by slashes.  Then the input file is read, and the second rule
 in the `awk' program (the action with no pattern) prints each record.
 Because each `print' statement adds a newline at the end of its output,
 this `awk' program copies the input with each slash changed to a
 newline.  Here are the results of running the program on `BBS-list':
 
      $ awk 'BEGIN { RS = "/" }
      >      { print $0 }' BBS-list
      -| aardvark     555-5553     1200
      -| 300          B
      -| alpo-net     555-3412     2400
      -| 1200
      -| 300     A
      -| barfly       555-7685     1200
      -| 300          A
      -| bites        555-1675     2400
      -| 1200
      -| 300     A
      -| camelot      555-0542     300               C
      -| core         555-2912     1200
      -| 300          C
      -| fooey        555-1234     2400
      -| 1200
      -| 300     B
      -| foot         555-6699     1200
      -| 300          B
      -| macfoo       555-6480     1200
      -| 300          A
      -| sdace        555-3430     2400
      -| 1200
      -| 300     A
      -| sabafoo      555-2127     1200
      -| 300          C
      -|
 
 Note that the entry for the `camelot' BBS is not split.  In the
 original data file ( Sample Data Files), the line looks like
 this:
 
      camelot      555-0542     300               C
 
 It has one baud rate only, so there are no slashes in the record,
 unlike the others which have two or more baud rates.  In fact, this
 record is treated as part of the record for the `core' BBS; the newline
 separating them in the output is the original newline in the data file,
 not the one added by `awk' when it printed the record!
 
    Another way to change the record separator is on the command line,
 using the variable-assignment feature ( Other Arguments):
 
      awk '{ print $0 }' RS="/" BBS-list
 
 This sets `RS' to `/' before processing `BBS-list'.
 
    Using an unusual character such as `/' for the record separator
 produces correct behavior in the vast majority of cases.
 
    There is one unusual case, that occurs when `gawk' is being fully
 POSIX-compliant ( Options).  Then, the following (extreme)
 pipeline prints a surprising `1':
 
      $ echo | gawk --posix 'BEGIN { RS = "a" } ; { print NF }'
      -| 1
 
    There is one field, consisting of a newline.  The value of the
 built-in variable `NF' is the number of fields in the current record.
 (In the normal case, `gawk' treats the newline as whitespace, printing
 `0' as the result. Most other versions of `awk' also act this way.)
 
    Reaching the end of an input file terminates the current input
 record, even if the last character in the file is not the character in
 `RS'.  (d.c.)
 
    The empty string `""' (a string without any characters) has a
 special meaning as the value of `RS'. It means that records are
 separated by one or more blank lines and nothing else.   Multiple
 Line, for more details.
 
    If you change the value of `RS' in the middle of an `awk' run, the
 new value is used to delimit subsequent records, but the record
 currently being processed, as well as records already processed, are not
 affected.
 
    After the end of the record has been determined, `gawk' sets the
 variable `RT' to the text in the input that matched `RS'.
 
    When using `gawk', the value of `RS' is not limited to a
 one-character string.  It can be any regular expression (
 Regexp). (c.e.)  In general, each record ends at the next string that
 matches the regular expression; the next record starts at the end of
 the matching string.  This general rule is actually at work in the
 usual case, where `RS' contains just a newline: a record ends at the
 beginning of the next matching string (the next newline in the input),
 and the following record starts just after the end of this string (at
 the first character of the following line).  The newline, because it
 matches `RS', is not part of either record.
 
    When `RS' is a single character, `RT' contains the same single
 character. However, when `RS' is a regular expression, `RT' contains
 the actual input text that matched the regular expression.
 
    If the input file ended without any text that matches `RS', `gawk'
 sets `RT' to the null string.
 
    The following example illustrates both of these features.  It sets
 `RS' equal to a regular expression that matches either a newline or a
 series of one or more uppercase letters with optional leading and/or
 trailing whitespace:
 
      $ echo record 1 AAAA record 2 BBBB record 3 |
      > gawk 'BEGIN { RS = "\n|( *[[:upper:]]+ *)" }
      >             { print "Record =", $0, "and RT =", RT }'
      -| Record = record 1 and RT =  AAAA
      -| Record = record 2 and RT =  BBBB
      -| Record = record 3 and RT =
      -|
 
 The final line of output has an extra blank line. This is because the
 value of `RT' is a newline, and the `print' statement supplies its own
 terminating newline.   Simple Sed, for a more useful example of
 `RS' as a regexp and `RT'.
 
    If you set `RS' to a regular expression that allows optional
 trailing text, such as `RS = "abc(XYZ)?"' it is possible, due to
 implementation constraints, that `gawk' may match the leading part of
 the regular expression, but not the trailing part, particularly if the
 input text that could match the trailing part is fairly long.  `gawk'
 attempts to avoid this problem, but currently, there's no guarantee
 that this will never happen.
 
      NOTE: Remember that in `awk', the `^' and `$' anchor
      metacharacters match the beginning and end of a _string_, and not
      the beginning and end of a _line_.  As a result, something like
      `RS = "^[[:upper:]]"' can only match at the beginning of a file.
      This is because `gawk' views the input file as one long string
      that happens to contain newline characters in it.  It is thus best
      to avoid anchor characters in the value of `RS'.
 
    The use of `RS' as a regular expression and the `RT' variable are
 `gawk' extensions; they are not available in compatibility mode (
 Options).  In compatibility mode, only the first character of the
 value of `RS' is used to determine the end of the record.
 
 Advanced Notes: `RS = "\0"' Is Not Portable
 -------------------------------------------
 
 There are times when you might want to treat an entire data file as a
 single record.  The only way to make this happen is to give `RS' a
 value that you know doesn't occur in the input file.  This is hard to
 do in a general way, such that a program always works for arbitrary
 input files.
 
    You might think that for text files, the NUL character, which
 consists of a character with all bits equal to zero, is a good value to
 use for `RS' in this case:
 
      BEGIN { RS = "\0" }  # whole file becomes one record?
 
    `gawk' in fact accepts this, and uses the NUL character for the
 record separator.  However, this usage is _not_ portable to other `awk'
 implementations.
 
    All other `awk' implementations(1) store strings internally as
 C-style strings.  C strings use the NUL character as the string
 terminator.  In effect, this means that `RS = "\0"' is the same as `RS
 = ""'.  (d.c.)
 
    The best way to treat a whole file as a single record is to simply
 read the file in, one record at a time, concatenating each record onto
 the end of the previous ones.
 
    ---------- Footnotes ----------
 
    (1) At least that we know about.
 
Info Catalog (gawk.info.gz) Reading Files (gawk.info.gz) Fields
automatically generated by info2html