(gawk.info.gz) Gory Details

Info Catalog (gawk.info.gz) String Functions
 
 9.1.3.1 More About `\' and `&' with `sub()', `gsub()', and `gensub()'
 .....................................................................
 
 When using `sub()', `gsub()', or `gensub()', and trying to get literal
 backslashes and ampersands into the replacement text, you need to
 remember that there are several levels of "escape processing" going on.
 
    First, there is the "lexical" level, which is when `awk' reads your
 program and builds an internal copy of it that can be executed.  Then
 there is the runtime level, which is when `awk' actually scans the
 replacement string to determine what to generate.
 
    At both levels, `awk' looks for a defined set of characters that can
 come after a backslash.  At the lexical level, it looks for the escape
 sequences listed in  Escape Sequences.  Thus, for every `\' that
 `awk' processes at the runtime level, you must type two backslashes at
 the lexical level.  When a character that is not valid for an escape
 sequence follows the `\', Brian Kernighan's `awk' and `gawk' both
 simply remove the initial `\' and put the next character into the
 string. Thus, for example, `"a\qb"' is treated as `"aqb"'.
 
    At the runtime level, the various functions handle sequences of `\'
 and `&' differently.  The situation is (sadly) somewhat complex.
 Historically, the `sub()' and `gsub()' functions treated the two
 character sequence `\&' specially; this sequence was replaced in the
 generated text with a single `&'.  Any other `\' within the REPLACEMENT
 string that did not precede an `&' was passed through unchanged.  This
 is illustrated in  table-sub-escapes.
 
       You type         `sub()' sees          `sub()' generates
       -------         ---------          --------------
           `\&'              `&'            the matched text
          `\\&'             `\&'            a literal `&'
         `\\\&'             `\&'            a literal `&'
        `\\\\&'            `\\&'            a literal `\&'
       `\\\\\&'            `\\&'            a literal `\&'
      `\\\\\\&'           `\\\&'            a literal `\\&'
          `\\q'             `\q'            a literal `\q'
 
 Table 9.1: Historical Escape Sequence Processing for `sub()' and
 `gsub()'
 
 This table shows both the lexical-level processing, where an odd number
 of backslashes becomes an even number at the runtime level, as well as
 the runtime processing done by `sub()'.  (For the sake of simplicity,
 the rest of the following tables only show the case of even numbers of
 backslashes entered at the lexical level.)
 
    The problem with the historical approach is that there is no way to
 get a literal `\' followed by the matched text.
 
    The 1992 POSIX standard attempted to fix this problem. That standard
 says that `sub()' and `gsub()' look for either a `\' or an `&' after
 the `\'. If either one follows a `\', that character is output
 literally.  The interpretation of `\' and `&' then becomes as shown in
  table-sub-posix-92.
 
       You type         `sub()' sees          `sub()' generates
       -------         ---------          --------------
            `&'              `&'            the matched text
          `\\&'             `\&'            a literal `&'
        `\\\\&'            `\\&'            a literal `\', then the matched text
      `\\\\\\&'           `\\\&'            a literal `\&'
 
 Table 9.2: 1992 POSIX Rules for sub and gsub Escape Sequence Processing
 
 This appears to solve the problem.  Unfortunately, the phrasing of the
 standard is unusual. It says, in effect, that `\' turns off the special
 meaning of any following character, but for anything other than `\' and
 `&', such special meaning is undefined.  This wording leads to two
 problems:
 
    * Backslashes must now be doubled in the REPLACEMENT string, breaking
      historical `awk' programs.
 
    * To make sure that an `awk' program is portable, _every_ character
      in the REPLACEMENT string must be preceded with a backslash.(1)
 
    Because of the problems just listed, in 1996, the `gawk' maintainer
 submitted proposed text for a revised standard that reverts to rules
 that correspond more closely to the original existing practice. The
 proposed rules have special cases that make it possible to produce a
 `\' preceding the matched text. This is shown in 
 table-sub-proposed.
 
       You type         `sub()' sees         `sub()' generates
       -------         ---------         --------------
      `\\\\\\&'           `\\\&'            a literal `\&'
        `\\\\&'            `\\&'            a literal `\', followed by the matched text
          `\\&'             `\&'            a literal `&'
          `\\q'             `\q'            a literal `\q'
         `\\\\'             `\\'            `\\'
 
 Table 9.3: Proposed rules for sub and backslash
 
    In a nutshell, at the runtime level, there are now three special
 sequences of characters (`\\\&', `\\&' and `\&') whereas historically
 there was only one.  However, as in the historical case, any `\' that
 is not part of one of these three sequences is not special and appears
 in the output literally.
 
    `gawk' 3.0 and 3.1 follow these proposed POSIX rules for `sub()' and
 `gsub()'.  The POSIX standard took much longer to be revised than was
 expected in 1996.  The 2001 standard does not follow the above rules.
 Instead, the rules there are somewhat simpler.  The results are similar
 except for one case.
 
    The POSIX rules state that `\&' in the replacement string produces a
 literal `&', `\\' produces a literal `\', and `\' followed by anything
 else is not special; the `\' is placed straight into the output.  These
 rules are presented in  table-posix-sub.
 
       You type         `sub()' sees         `sub()' generates
       -------         ---------         --------------
      `\\\\\\&'           `\\\&'            a literal `\&'
        `\\\\&'            `\\&'            a literal `\', followed by the matched text
          `\\&'             `\&'            a literal `&'
          `\\q'             `\q'            a literal `\q'
         `\\\\'             `\\'            `\'
 
 Table 9.4: POSIX rules for `sub()' and `gsub()'
 
    The only case where the difference is noticeable is the last one:
 `\\\\' is seen as `\\' and produces `\' instead of `\\'.
 
    Starting with version 3.1.4, `gawk' followed the POSIX rules when
 `--posix' is specified ( Options). Otherwise, it continued to
 follow the 1996 proposed rules, since that had been its behavior for
 many years.
 
    When version 4.0.0, was released, the `gawk' maintainer made the
 POSIX rules the default, breaking well over a decade's worth of
 backwards compatibility.(2) Needless to say, this was a bad idea, and
 as of version 4.0.1, `gawk' resumed its historical behavior, and only
 follows the POSIX rules when `--posix' is given.
 
    The rules for `gensub()' are considerably simpler. At the runtime
 level, whenever `gawk' sees a `\', if the following character is a
 digit, then the text that matched the corresponding parenthesized
 subexpression is placed in the generated output.  Otherwise, no matter
 what character follows the `\', it appears in the generated text and
 the `\' does not, as shown in  table-gensub-escapes.
 
        You type          `gensub()' sees         `gensub()' generates
        -------          ------------         -----------------
            `&'                    `&'            the matched text
          `\\&'                   `\&'            a literal `&'
         `\\\\'                   `\\'            a literal `\'
        `\\\\&'                  `\\&'            a literal `\', then the matched text
      `\\\\\\&'                 `\\\&'            a literal `\&'
          `\\q'                   `\q'            a literal `q'
 
 Table 9.5: Escape Sequence Processing for `gensub()'
 
    Because of the complexity of the lexical and runtime level processing
 and the special cases for `sub()' and `gsub()', we recommend the use of
 `gawk' and `gensub()' when you have to do substitutions.
 
 Advanced Notes: Matching the Null String
 ----------------------------------------
 
 In `awk', the `*' operator can match the null string.  This is
 particularly important for the `sub()', `gsub()', and `gensub()'
 functions.  For example:
 
      $ echo abc | awk '{ gsub(/m*/, "X"); print }'
      -| XaXbXcX
 
 Although this makes a certain amount of sense, it can be surprising.
 
    ---------- Footnotes ----------
 
    (1) This consequence was certainly unintended.
 
    (2) This was rather naive of him, despite there being a note in this
 section indicating that the next major version would move to the POSIX
 rules.
 
Info Catalog (gawk.info.gz) String Functions
automatically generated by info2html