(gawk.info.gz) Uniq Program
Info Catalog
(gawk.info.gz) Tee Program
(gawk.info.gz) Clones
(gawk.info.gz) Wc Program
13.2.6 Printing Nonduplicated Lines of Text
-------------------------------------------
The `uniq' utility reads sorted lines of data on its standard input,
and by default removes duplicate lines. In other words, it only prints
unique lines--hence the name. `uniq' has a number of options. The
usage is as follows:
uniq [-udc [-N]] [+N] [ INPUT FILE [ OUTPUT FILE ]]
The options for `uniq' are:
`-d'
Print only repeated lines.
`-u'
Print only nonrepeated lines.
`-c'
Count lines. This option overrides `-d' and `-u'. Both repeated
and nonrepeated lines are counted.
`-N'
Skip N fields before comparing lines. The definition of fields is
similar to `awk''s default: nonwhitespace characters separated by
runs of spaces and/or TABs.
`+N'
Skip N characters before comparing lines. Any fields specified
with `-N' are skipped first.
`INPUT FILE'
Data is read from the input file named on the command line,
instead of from the standard input.
`OUTPUT FILE'
The generated output is sent to the named output file, instead of
to the standard output.
Normally `uniq' behaves as if both the `-d' and `-u' options are
provided.
`uniq' uses the `getopt()' library function ( Getopt Function)
and the `join()' library function ( Join Function).
The program begins with a `usage()' function and then a brief
outline of the options and their meanings in comments. The `BEGIN'
rule deals with the command-line arguments and options. It uses a trick
to get `getopt()' to handle options of the form `-25', treating such an
option as the option letter `2' with an argument of `5'. If indeed two
or more digits are supplied (`Optarg' looks like a number), `Optarg' is
concatenated with the option digit and then the result is added to zero
to make it into a number. If there is only one digit in the option,
then `Optarg' is not needed. In this case, `Optind' must be decremented
so that `getopt()' processes it next time. This code is admittedly a
bit tricky.
If no options are supplied, then the default is taken, to print both
repeated and nonrepeated lines. The output file, if provided, is
assigned to `outputfile'. Early on, `outputfile' is initialized to the
standard output, `/dev/stdout':
# uniq.awk --- do uniq in awk
#
# Requires getopt() and join() library functions
function usage( e)
{
e = "Usage: uniq [-udc [-n]] [+n] [ in [ out ]]"
print e > "/dev/stderr"
exit 1
}
# -c count lines. overrides -d and -u
# -d only repeated lines
# -u only nonrepeated lines
# -n skip n fields
# +n skip n characters, skip fields first
BEGIN \
{
count = 1
outputfile = "/dev/stdout"
opts = "udc0:1:2:3:4:5:6:7:8:9:"
while ((c = getopt(ARGC, ARGV, opts)) != -1) {
if (c == "u")
non_repeated_only++
else if (c == "d")
repeated_only++
else if (c == "c")
do_count++
else if (index("0123456789", c) != 0) {
# getopt requires args to options
# this messes us up for things like -5
if (Optarg ~ /^[[:digit:]]+$/)
fcount = (c Optarg) + 0
else {
fcount = c + 0
Optind--
}
} else
usage()
}
if (ARGV[Optind] ~ /^\+[[:digit:]]+$/) {
charcount = substr(ARGV[Optind], 2) + 0
Optind++
}
for (i = 1; i < Optind; i++)
ARGV[i] = ""
if (repeated_only == 0 && non_repeated_only == 0)
repeated_only = non_repeated_only = 1
if (ARGC - Optind == 2) {
outputfile = ARGV[ARGC - 1]
ARGV[ARGC - 1] = ""
}
}
The following function, `are_equal()', compares the current line,
`$0', to the previous line, `last'. It handles skipping fields and
characters. If no field count and no character count are specified,
`are_equal()' simply returns one or zero depending upon the result of a
simple string comparison of `last' and `$0'. Otherwise, things get more
complicated. If fields have to be skipped, each line is broken into an
array using `split()' ( String Functions); the desired fields
are then joined back into a line using `join()'. The joined lines are
stored in `clast' and `cline'. If no fields are skipped, `clast' and
`cline' are set to `last' and `$0', respectively. Finally, if
characters are skipped, `substr()' is used to strip off the leading
`charcount' characters in `clast' and `cline'. The two strings are
then compared and `are_equal()' returns the result:
function are_equal( n, m, clast, cline, alast, aline)
{
if (fcount == 0 && charcount == 0)
return (last == $0)
if (fcount > 0) {
n = split(last, alast)
m = split($0, aline)
clast = join(alast, fcount+1, n)
cline = join(aline, fcount+1, m)
} else {
clast = last
cline = $0
}
if (charcount) {
clast = substr(clast, charcount + 1)
cline = substr(cline, charcount + 1)
}
return (clast == cline)
}
The following two rules are the body of the program. The first one
is executed only for the very first line of data. It sets `last' equal
to `$0', so that subsequent lines of text have something to be compared
to.
The second rule does the work. The variable `equal' is one or zero,
depending upon the results of `are_equal()''s comparison. If `uniq' is
counting repeated lines, and the lines are equal, then it increments
the `count' variable. Otherwise, it prints the line and resets `count',
since the two lines are not equal.
If `uniq' is not counting, and if the lines are equal, `count' is
incremented. Nothing is printed, since the point is to remove
duplicates. Otherwise, if `uniq' is counting repeated lines and more
than one line is seen, or if `uniq' is counting nonrepeated lines and
only one line is seen, then the line is printed, and `count' is reset.
Finally, similar logic is used in the `END' rule to print the final
line of input data:
NR == 1 {
last = $0
next
}
{
equal = are_equal()
if (do_count) { # overrides -d and -u
if (equal)
count++
else {
printf("%4d %s\n", count, last) > outputfile
last = $0
count = 1 # reset
}
next
}
if (equal)
count++
else {
if ((repeated_only && count > 1) ||
(non_repeated_only && count == 1))
print last > outputfile
last = $0
count = 1
}
}
END {
if (do_count)
printf("%4d %s\n", count, last) > outputfile
else if ((repeated_only && count > 1) ||
(non_repeated_only && count == 1))
print last > outputfile
close(outputfile)
}
Info Catalog
(gawk.info.gz) Tee Program
(gawk.info.gz) Clones
(gawk.info.gz) Wc Program
automatically generated by
info2html