(gawk.info.gz) Uniq Program

Info Catalog (gawk.info.gz) Tee Program (gawk.info.gz) Clones (gawk.info.gz) Wc Program
 
 13.2.6 Printing Nonduplicated Lines of Text
 -------------------------------------------
 
 The `uniq' utility reads sorted lines of data on its standard input,
 and by default removes duplicate lines.  In other words, it only prints
 unique lines--hence the name.  `uniq' has a number of options. The
 usage is as follows:
 
      uniq [-udc [-N]] [+N] [ INPUT FILE [ OUTPUT FILE ]]
 
    The options for `uniq' are:
 
 `-d'
      Print only repeated lines.
 
 `-u'
      Print only nonrepeated lines.
 
 `-c'
      Count lines. This option overrides `-d' and `-u'.  Both repeated
      and nonrepeated lines are counted.
 
 `-N'
      Skip N fields before comparing lines.  The definition of fields is
      similar to `awk''s default: nonwhitespace characters separated by
      runs of spaces and/or TABs.
 
 `+N'
      Skip N characters before comparing lines.  Any fields specified
      with `-N' are skipped first.
 
 `INPUT FILE'
      Data is read from the input file named on the command line,
      instead of from the standard input.
 
 `OUTPUT FILE'
      The generated output is sent to the named output file, instead of
      to the standard output.
 
    Normally `uniq' behaves as if both the `-d' and `-u' options are
 provided.
 
    `uniq' uses the `getopt()' library function ( Getopt Function)
 and the `join()' library function ( Join Function).
 
    The program begins with a `usage()' function and then a brief
 outline of the options and their meanings in comments.  The `BEGIN'
 rule deals with the command-line arguments and options. It uses a trick
 to get `getopt()' to handle options of the form `-25', treating such an
 option as the option letter `2' with an argument of `5'. If indeed two
 or more digits are supplied (`Optarg' looks like a number), `Optarg' is
 concatenated with the option digit and then the result is added to zero
 to make it into a number.  If there is only one digit in the option,
 then `Optarg' is not needed. In this case, `Optind' must be decremented
 so that `getopt()' processes it next time.  This code is admittedly a
 bit tricky.
 
    If no options are supplied, then the default is taken, to print both
 repeated and nonrepeated lines.  The output file, if provided, is
 assigned to `outputfile'.  Early on, `outputfile' is initialized to the
 standard output, `/dev/stdout':
 
      # uniq.awk --- do uniq in awk
      #
      # Requires getopt() and join() library functions
 
      function usage(    e)
      {
          e = "Usage: uniq [-udc [-n]] [+n] [ in [ out ]]"
          print e > "/dev/stderr"
          exit 1
      }
 
      # -c    count lines. overrides -d and -u
      # -d    only repeated lines
      # -u    only nonrepeated lines
      # -n    skip n fields
      # +n    skip n characters, skip fields first
 
      BEGIN   \
      {
          count = 1
          outputfile = "/dev/stdout"
          opts = "udc0:1:2:3:4:5:6:7:8:9:"
          while ((c = getopt(ARGC, ARGV, opts)) != -1) {
              if (c == "u")
                  non_repeated_only++
              else if (c == "d")
                  repeated_only++
              else if (c == "c")
                  do_count++
              else if (index("0123456789", c) != 0) {
                  # getopt requires args to options
                  # this messes us up for things like -5
                  if (Optarg ~ /^[[:digit:]]+$/)
                      fcount = (c Optarg) + 0
                  else {
                      fcount = c + 0
                      Optind--
                  }
              } else
                  usage()
          }
 
          if (ARGV[Optind] ~ /^\+[[:digit:]]+$/) {
              charcount = substr(ARGV[Optind], 2) + 0
              Optind++
          }
 
          for (i = 1; i < Optind; i++)
              ARGV[i] = ""
 
          if (repeated_only == 0 && non_repeated_only == 0)
              repeated_only = non_repeated_only = 1
 
          if (ARGC - Optind == 2) {
              outputfile = ARGV[ARGC - 1]
              ARGV[ARGC - 1] = ""
          }
      }
 
    The following function, `are_equal()', compares the current line,
 `$0', to the previous line, `last'.  It handles skipping fields and
 characters.  If no field count and no character count are specified,
 `are_equal()' simply returns one or zero depending upon the result of a
 simple string comparison of `last' and `$0'.  Otherwise, things get more
 complicated.  If fields have to be skipped, each line is broken into an
 array using `split()' ( String Functions); the desired fields
 are then joined back into a line using `join()'.  The joined lines are
 stored in `clast' and `cline'.  If no fields are skipped, `clast' and
 `cline' are set to `last' and `$0', respectively.  Finally, if
 characters are skipped, `substr()' is used to strip off the leading
 `charcount' characters in `clast' and `cline'.  The two strings are
 then compared and `are_equal()' returns the result:
 
      function are_equal(    n, m, clast, cline, alast, aline)
      {
          if (fcount == 0 && charcount == 0)
              return (last == $0)
 
          if (fcount > 0) {
              n = split(last, alast)
              m = split($0, aline)
              clast = join(alast, fcount+1, n)
              cline = join(aline, fcount+1, m)
          } else {
              clast = last
              cline = $0
          }
          if (charcount) {
              clast = substr(clast, charcount + 1)
              cline = substr(cline, charcount + 1)
          }
 
          return (clast == cline)
      }
 
    The following two rules are the body of the program.  The first one
 is executed only for the very first line of data.  It sets `last' equal
 to `$0', so that subsequent lines of text have something to be compared
 to.
 
    The second rule does the work. The variable `equal' is one or zero,
 depending upon the results of `are_equal()''s comparison. If `uniq' is
 counting repeated lines, and the lines are equal, then it increments
 the `count' variable.  Otherwise, it prints the line and resets `count',
 since the two lines are not equal.
 
    If `uniq' is not counting, and if the lines are equal, `count' is
 incremented.  Nothing is printed, since the point is to remove
 duplicates.  Otherwise, if `uniq' is counting repeated lines and more
 than one line is seen, or if `uniq' is counting nonrepeated lines and
 only one line is seen, then the line is printed, and `count' is reset.
 
    Finally, similar logic is used in the `END' rule to print the final
 line of input data:
 
      NR == 1 {
          last = $0
          next
      }
 
      {
          equal = are_equal()
 
          if (do_count) {    # overrides -d and -u
              if (equal)
                  count++
              else {
                  printf("%4d %s\n", count, last) > outputfile
                  last = $0
                  count = 1    # reset
              }
              next
          }
 
          if (equal)
              count++
          else {
              if ((repeated_only && count > 1) ||
                  (non_repeated_only && count == 1))
                      print last > outputfile
              last = $0
              count = 1
          }
      }
 
      END {
          if (do_count)
              printf("%4d %s\n", count, last) > outputfile
          else if ((repeated_only && count > 1) ||
                  (non_repeated_only && count == 1))
              print last > outputfile
          close(outputfile)
      }
 
Info Catalog (gawk.info.gz) Tee Program (gawk.info.gz) Clones (gawk.info.gz) Wc Program
automatically generated by info2html