(gawk.info.gz) Gory Details
Info Catalog
(gawk.info.gz) String Functions
8.1.3.1 More About `\' and `&' with `sub', `gsub', and `gensub'
...............................................................
When using `sub', `gsub', or `gensub', and trying to get literal
backslashes and ampersands into the replacement text, you need to
remember that there are several levels of "escape processing" going on.
First, there is the "lexical" level, which is when `awk' reads your
program and builds an internal copy of it that can be executed. Then
there is the runtime level, which is when `awk' actually scans the
replacement string to determine what to generate.
At both levels, `awk' looks for a defined set of characters that can
come after a backslash. At the lexical level, it looks for the escape
sequences listed in Escape Sequences. Thus, for every `\' that
`awk' processes at the runtime level, type two backslashes at the
lexical level. When a character that is not valid for an escape
sequence follows the `\', Unix `awk' and `gawk' both simply remove the
initial `\' and put the next character into the string. Thus, for
example, `"a\qb"' is treated as `"aqb"'.
At the runtime level, the various functions handle sequences of `\'
and `&' differently. The situation is (sadly) somewhat complex.
Historically, the `sub' and `gsub' functions treated the two character
sequence `\&' specially; this sequence was replaced in the generated
text with a single `&'. Any other `\' within the REPLACEMENT string
that did not precede an `&' was passed through unchanged. This is
illustrated in table-sub-escapes.
You type `sub' sees `sub' generates
------- --------- --------------
`\&' `&' the matched text
`\\&' `\&' a literal `&'
`\\\&' `\&' a literal `&'
`\\\\&' `\\&' a literal `\&'
`\\\\\&' `\\&' a literal `\&'
`\\\\\\&' `\\\&' a literal `\\&'
`\\q' `\q' a literal `\q'
Table 8.1: Historical Escape Sequence Processing for sub and gsub
This table shows both the lexical-level processing, where an odd number
of backslashes becomes an even number at the runtime level, as well as
the runtime processing done by `sub'. (For the sake of simplicity, the
rest of the following tables only show the case of even numbers of
backslashes entered at the lexical level.)
The problem with the historical approach is that there is no way to
get a literal `\' followed by the matched text.
The 1992 POSIX standard attempted to fix this problem. That standard
says that `sub' and `gsub' look for either a `\' or an `&' after the
`\'. If either one follows a `\', that character is output literally.
The interpretation of `\' and `&' then becomes as shown in
table-sub-posix-92.
You type `sub' sees `sub' generates
------- --------- --------------
`&' `&' the matched text
`\\&' `\&' a literal `&'
`\\\\&' `\\&' a literal `\', then the matched text
`\\\\\\&' `\\\&' a literal `\&'
Table 8.2: 1992 POSIX Rules for sub and gsub Escape Sequence Processing
This appears to solve the problem. Unfortunately, the phrasing of the
standard is unusual. It says, in effect, that `\' turns off the special
meaning of any following character, but for anything other than `\' and
`&', such special meaning is undefined. This wording leads to two
problems:
* Backslashes must now be doubled in the REPLACEMENT string, breaking
historical `awk' programs.
* To make sure that an `awk' program is portable, _every_ character
in the REPLACEMENT string must be preceded with a backslash.(1)
Because of the problems just listed, in 1996, the `gawk' maintainer
submitted proposed text for a revised standard that reverts to rules
that correspond more closely to the original existing practice. The
proposed rules have special cases that make it possible to produce a
`\' preceding the matched text. This is shown in
table-sub-proposed.
You type `sub' sees `sub' generates
------- --------- --------------
`\\\\\\&' `\\\&' a literal `\&'
`\\\\&' `\\&' a literal `\', followed by the matched text
`\\&' `\&' a literal `&'
`\\q' `\q' a literal `\q'
`\\\\' `\\' `\\'
Table 8.3: Proposed rules for sub and backslash
In a nutshell, at the runtime level, there are now three special
sequences of characters (`\\\&', `\\&' and `\&') whereas historically
there was only one. However, as in the historical case, any `\' that
is not part of one of these three sequences is not special and appears
in the output literally.
`gawk' 3.0 and 3.1 follow these proposed POSIX rules for `sub' and
`gsub'. The POSIX standard took much longer to be revised than was
expected in 1996. The 2001 standard does not follow the above rules.
Instead, the rules there are somewhat simpler. The results are similar
except for one case.
The 2001 POSIX rules state that `\&' in the replacement string
produces a literal `&', `\\' produces a literal `\', and `\' followed
by anything else is not special; the `\' is placed straight into the
output. These rules are presented in table-posix-2001-sub.
You type `sub' sees `sub' generates
------- --------- --------------
`\\\\\\&' `\\\&' a literal `\&'
`\\\\&' `\\&' a literal `\', followed by the matched text
`\\&' `\&' a literal `&'
`\\q' `\q' a literal `\q'
`\\\\' `\\' `\'
Table 8.4: POSIX 2001 rules for sub
The only case where the difference is noticeable is the last one:
`\\\\' is seen as `\\' and produces `\' instead of `\\'.
Starting with version 3.1.4, `gawk' follows the POSIX rules when
`--posix' is specified ( Options). Otherwise, it continues to
follow the 1996 proposed rules, since, as of this writing, that has
been its behavior for over seven years.
NOTE: At the next major release, `gawk' will switch to using the
POSIX 2001 rules by default.
The rules for `gensub' are considerably simpler. At the runtime
level, whenever `gawk' sees a `\', if the following character is a
digit, then the text that matched the corresponding parenthesized
subexpression is placed in the generated output. Otherwise, no matter
what character follows the `\', it appears in the generated text and
the `\' does not, as shown in table-gensub-escapes.
You type `gensub' sees `gensub' generates
------- ------------ -----------------
`&' `&' the matched text
`\\&' `\&' a literal `&'
`\\\\' `\\' a literal `\'
`\\\\&' `\\&' a literal `\', then the matched text
`\\\\\\&' `\\\&' a literal `\&'
`\\q' `\q' a literal `q'
Table 8.5: Escape Sequence Processing for gensub
Because of the complexity of the lexical and runtime level processing
and the special cases for `sub' and `gsub', we recommend the use of
`gawk' and `gensub' when you have to do substitutions.
Advanced Notes: Matching the Null String
----------------------------------------
In `awk', the `*' operator can match the null string. This is
particularly important for the `sub', `gsub', and `gensub' functions.
For example:
$ echo abc | awk '{ gsub(/m*/, "X"); print }'
-| XaXbXcX
Although this makes a certain amount of sense, it can be surprising.
---------- Footnotes ----------
(1) This consequence was certainly unintended.
Info Catalog
(gawk.info.gz) String Functions
automatically generated by
info2html