(gawk.info.gz) Translate Program

Info Catalog (gawk.info.gz) Alarm Program (gawk.info.gz) Miscellaneous Programs (gawk.info.gz) Labels Program
 
 13.3.3 Transliterating Characters
 ---------------------------------
 
 The system `tr' utility transliterates characters.  For example, it is
 often used to map uppercase letters into lowercase for further
 processing:
 
      GENERATE DATA | tr 'A-Z' 'a-z' | PROCESS DATA ...
 
    `tr' requires two lists of characters.(1)  When processing the
 input, the first character in the first list is replaced with the first
 character in the second list, the second character in the first list is
 replaced with the second character in the second list, and so on.  If
 there are more characters in the "from" list than in the "to" list, the
 last character of the "to" list is used for the remaining characters in
 the "from" list.
 
    Some time ago, a user proposed that a transliteration function should
 be added to `gawk'.  The following program was written to prove that
 character transliteration could be done with a user-level function.
 This program is not as complete as the system `tr' utility but it does
 most of the job.
 
    The `translate' program demonstrates one of the few weaknesses of
 standard `awk': dealing with individual characters is very painful,
 requiring repeated use of the `substr', `index', and `gsub' built-in
 functions ( String Functions).(2) There are two functions.  The
 first, `stranslate', takes three arguments:
 
 `from'
      A list of characters from which to translate.
 
 `to'
      A list of characters to which to translate.
 
 `target'
      The string on which to do the translation.
 
    Associative arrays make the translation part fairly easy. `t_ar'
 holds the "to" characters, indexed by the "from" characters.  Then a
 simple loop goes through `from', one character at a time.  For each
 character in `from', if the character appears in `target', it is
 replaced with the corresponding `to' character.
 
    The `translate' function simply calls `stranslate' using `$0' as the
 target.  The main program sets two global variables, `FROM' and `TO',
 from the command line, and then changes `ARGV' so that `awk' reads from
 the standard input.
 
    Finally, the processing rule simply calls `translate' for each
 record:
 
      # translate.awk --- do tr-like stuff
      # Bugs: does not handle things like: tr A-Z a-z, it has
      # to be spelled out. However, if `to' is shorter than `from',
      # the last character in `to' is used for the rest of `from'.
 
      function stranslate(from, to, target,     lf, lt, ltarget, t_ar, i, c,
                                                                     result)
      {
          lf = length(from)
          lt = length(to)
          ltarget = length(target)
          for (i = 1; i <= lt; i++)
              t_ar[substr(from, i, 1)] = substr(to, i, 1)
          if (lt < lf)
              for (; i <= lf; i++)
                  t_ar[substr(from, i, 1)] = substr(to, lt, 1)
          for (i = 1; i <= ltarget; i++) {
              c = substr(target, i, 1)
              if (c in t_ar)
                  c = t_ar[c]
              result = result c
          }
          return result
      }
 
      function translate(from, to)
      {
          return $0 = stranslate(from, to, $0)
      }
 
      # main program
      BEGIN {
          if (ARGC < 3) {
              print "usage: translate from to" > "/dev/stderr"
              exit
          }
          FROM = ARGV[1]
          TO = ARGV[2]
          ARGC = 2
          ARGV[1] = "-"
      }
 
      {
          translate(FROM, TO)
          print
      }
 
    While it is possible to do character transliteration in a user-level
 function, it is not necessarily efficient, and we (the `gawk' authors)
 started to consider adding a built-in function.  However, shortly after
 writing this program, we learned that the System V Release 4 `awk' had
 added the `toupper' and `tolower' functions ( String Functions).
 These functions handle the vast majority of the cases where character
 transliteration is necessary, and so we chose to simply add those
 functions to `gawk' as well and then leave well enough alone.
 
    An obvious improvement to this program would be to set up the `t_ar'
 array only once, in a `BEGIN' rule. However, this assumes that the
 "from" and "to" lists will never change throughout the lifetime of the
 program.
 
    ---------- Footnotes ----------
 
    (1) On some older System V systems, `tr' may require that the lists
 be written as range expressions enclosed in square brackets (`[a-z]')
 and quoted, to prevent the shell from attempting a file name expansion.
 This is not a feature.
 
    (2) This program was written before `gawk' acquired the ability to
 split each character in a string into separate array elements.
 
Info Catalog (gawk.info.gz) Alarm Program (gawk.info.gz) Miscellaneous Programs (gawk.info.gz) Labels Program
automatically generated by info2html