(gawk.info.gz) Word Sorting

Info Catalog (gawk.info.gz) Labels Program (gawk.info.gz) Miscellaneous Programs (gawk.info.gz) History Sorting
 
 13.3.5 Generating Word-Usage Counts
 -----------------------------------
 
 The following `awk' program prints the number of occurrences of each
 word in its input.  It illustrates the associative nature of `awk'
 arrays by using strings as subscripts.  It also demonstrates the `for
 INDEX in ARRAY' mechanism.  Finally, it shows how `awk' is used in
 conjunction with other utility programs to do a useful task of some
 complexity with a minimum of effort.  Some explanations follow the
 program listing:
 
      # Print list of word frequencies
      {
          for (i = 1; i <= NF; i++)
              freq[$i]++
      }
 
      END {
          for (word in freq)
              printf "%s\t%d\n", word, freq[word]
      }
 
    This program has two rules.  The first rule, because it has an empty
 pattern, is executed for every input line.  It uses `awk''s
 field-accessing mechanism ( Fields) to pick out the individual
 words from the line, and the built-in variable `NF' ( Built-in
 Variables) to know how many fields are available.  For each input
 word, it increments an element of the array `freq' to reflect that the
 word has been seen an additional time.
 
    The second rule, because it has the pattern `END', is not executed
 until the input has been exhausted.  It prints out the contents of the
 `freq' table that has been built up inside the first action.  This
 program has several problems that would prevent it from being useful by
 itself on real text files:
 
    * Words are detected using the `awk' convention that fields are
      separated just by whitespace.  Other characters in the input
      (except newlines) don't have any special meaning to `awk'.  This
      means that punctuation characters count as part of words.
 
    * The `awk' language considers upper- and lowercase characters to be
      distinct.  Therefore, "bartender" and "Bartender" are not treated
      as the same word.  This is undesirable, since in normal text, words
      are capitalized if they begin sentences, and a frequency analyzer
      should not be sensitive to capitalization.
 
    * The output does not come out in any useful order.  You're more
      likely to be interested in which words occur most frequently or in
      having an alphabetized table of how frequently each word occurs.
 
    The way to solve these problems is to use some of `awk''s more
 advanced features.  First, we use `tolower' to remove case
 distinctions.  Next, we use `gsub' to remove punctuation characters.
 Finally, we use the system `sort' utility to process the output of the
 `awk' script.  Here is the new version of the program:
 
      # wordfreq.awk --- print list of word frequencies
 
      {
          $0 = tolower($0)    # remove case distinctions
          # remove punctuation
          gsub(/[^[:alnum:]_[:blank:]]/, "", $0)
          for (i = 1; i <= NF; i++)
              freq[$i]++
      }
 
      END {
          for (word in freq)
              printf "%s\t%d\n", word, freq[word]
      }
 
    Assuming we have saved this program in a file named `wordfreq.awk',
 and that the data is in `file1', the following pipeline:
 
      awk -f wordfreq.awk file1 | sort -k 2nr
 
 produces a table of the words appearing in `file1' in order of
 decreasing frequency.  The `awk' program suitably massages the data and
 produces a word frequency table, which is not ordered.
 
    The `awk' script's output is then sorted by the `sort' utility and
 printed on the terminal.  The options given to `sort' specify a sort
 that uses the second field of each input line (skipping one field),
 that the sort keys should be treated as numeric quantities (otherwise
 `15' would come before `5'), and that the sorting should be done in
 descending (reverse) order.
 
    The `sort' could even be done from within the program, by changing
 the `END' action to:
 
      END {
          sort = "sort -k 2nr"
          for (word in freq)
              printf "%s\t%d\n", word, freq[word] | sort
          close(sort)
      }
 
    This way of sorting must be used on systems that do not have true
 pipes at the command-line (or batch-file) level.  See the general
 operating system documentation for more information on how to use the
 `sort' program.
 
Info Catalog (gawk.info.gz) Labels Program (gawk.info.gz) Miscellaneous Programs (gawk.info.gz) History Sorting
automatically generated by info2html