Academic Computing Services * Simon Fraser University

HOW TO

Count, Sort and Compare Unix Files

© March 21, 1994 B-18



Unix has several utilities to help count, sort and compare files. To illustrate these utilities, assume we have two files, phonesA and phonesB. Each contains a list of telephone locals and people's names:
phonesAphonesB
4678 George Smith 4221 Susan Wilson
2870 Bill Anderson 3895 Jan Smythe
3717 Joan Brown 3664 John Lee
4221 Susan Wilson

Counting characters, words and lines

The command wc will count the number of lines, words, and characters in a file. To count the file phonesA, type

wc phonesA

The result is

         4        12        75 phonesA

The first number is the number of lines in the file, the second is the number of words and the third is the number of characters.

Use wc to count multiple files by giving several filenames as arguments to the command. For example, to count both phonesA and phonesB, type

wc phonesA phonesB

The result is

         4        12        75 phonesA

         3         9        51 phonesB

         7        21       126 total

wc counts lines, words and characters in each file, and totals each of these three items for the two files.

Options for wc limit its output so that only the number of lines, the number of words or the number of characters is displayed, instead of all three:
-lcounts only the number of lines
-w counts only the number of words
-c counts only the number of characters

You may combine two of the options for wc. For example, to count the number of words and the number of characters in phonesB, you would type

wc -wc phonesB

Sorting and merging files

The sort command will sort and merge files and display the output on the screen. Using redirection (>) we can save the sorted and merged output in a file phonesC (see how-to B-7, Examine Unix Files and Redirect Output, for more on redirection):

sort phonesA phonesB > phonesC

If we then type

more phonesC

we will see that file phonesC contains:

2870 Bill Anderson
3664 John Lee
3717 Joan Brown
3895 Jan Smythe
4221 Susan Wilson
4221 Susan Wilson
4678 George Smith

Notice that the entry for Wilson is repeated, because it was originally in both phonesA and phonesB. We can avoid having lines repeated by using the -u option with the sort command, for example

sort -u phonesA phonesB > phonesC

By default, sort acts on the first character in each line. Thus phonesC is sorted by the telephone number. In order to obtain a file sorted alphabetically by the person's last name, we must identify fields for the sort command to sort on. By default, sort considers a field to be any string separated by blanks. Thus our telephone files consist of 3 fields: telephone local, first name, and last name.

We want to have the merged file sorted alphabetically by the third field. To do this we give the sort command a flag consisting of a plus sign and a number telling it how many fields to skip before it begins sorting. We'll also include the -u option to avoid repeated lines. Thus

sort -u +2 phonesA phonesB > phonesD

produces a merged file phonesD containing all the phone listings sorted alphabetically by the the last name, with no repetitions:

2870 Bill Anderson
3717 Joan Brown
3664 John Lee
4678 George Smith
3895 Jan Smythe
4221 Susan Wilson

sort has several useful options:
-fignore the difference between upper and lowercase when sorting. (Otherwise, uppercase words will appear first, followed by lowercase.)
-nsort numbers by their arithmetic value. (Otherwise, numbers are sorted by their first digit, e.g. 1, 11, 110, 2, 23)
-uremove duplicated lines
-rsort in reverse order
-txusex as the field separator instead of blanks. Ifx is the tab symbol, press the TAB key.

Comparing the contents of files

Four Unix utilities, cmp, comm and diff/sdiff, can be used to compare the contents of two files.

Using cmp

The utility cmp compares two files and finds the first disagreement between them. When comparing our phone lists

cmp phonesA phonesB

the result would be

phonesA phonesB differ:  char 2, line 1

cmp reports that the first difference occurs at the second character in the first line.

Using comm

The utitility comm produces more information, but only works on files which have been sorted. If we first sort the two phone lists:

sort phonesA > sortedphonesA
sort phonesB > sortedphonesB

then we can compare them using comm.

comm sortedphonesA sortedphonesB

The output from comm is split into 3 columns. The first column lists all lines that are in sortedphonesA but not in sortedphonesB. The second column lists all lines that are in sortedphonesB but not in sortedphonesA, and the third column lists all lines that are found in both sortedphonesA and sortedphonesB.

Use numeric options with comm to affect the output:
-1does not display the first column
-2does not display the second column
-3does not display the third column
-12does not display the first or second column (that is, only lines found in both files are displayed)

Using diff and sdiff

diff compares two files and reports what changes need to be made to the first file in order to make it identical to the second. Using diff to compare the two phone lists:

diff phonesA phonesB

we would see

1,3d0
< 4678 George Smith
< 2870 Bill Anderson
< 3717 Joan Brown
4a2,3
> 3895 Jan Smythe
> 3664 John Lee

This output from diff is interpreted as follows:

1,3d0 means that lines 1-3 should be deleted from phonesA to make it like phonesB. This is followed by the text of the 3 lines in phonesA that should be deleted, each marked with a <.

4a2,3 means that lines 2 and 3 from phonesB should be added to phonesA after line 4. This is followed by the text of the 2 lines of phonesB that should be added, each marked with a >.

If a change is required to make the two files identical, diff reports with c (for change), rather than a (for addition) or d (for deletion).

If the two files are identical, diff produces no output (not even a confirmation that they are identical). When comparing binary files (e.g., executable programs), diff does not report the differences but merely confirms that the files differ.

sdiff, a utility based on diff, will assist you in merging, line-by-line, the contents of two differing files into a third (output) file. sdiff pauses as it encounters each difference and waits for you to do one of the following:

bc, a Unix desktop calculator

The utility bc can be used as a desktop calculator. To illustrate the use of bc, here is a brief sequence of commands and (in bold) responses, with notes in square brackets:

bc
6+4*2-2^3 [^ is the exponent operator]
6 [*, / and ^ are evaluated before + and -]
2^3-5/(4-1) [() expressions are evaluated first]
7 [use scale (below) for decimal points]
scale=3; 2^3-5/(4-1) [combine statements using ;]
6.334
sqrt(3.4) [other math functions are available]
1.843

To quit bc, type Control-d.

For more information

For more information about the utilities described in this how-to, see the on-line manual (man pages) for them. For example, to view the man page for sdiff, type

man sdiff

Man pages are described in how-to B-10, Use Unix Online Documentation.


* * * * * * * * * * * * * * *

This page written and maintained by Academic Computing Services, Simon Fraser University.
Please e-mail questions or comments to help@sfu.ca.