|
Count, Sort and Compare Unix Files | |
|---|---|---|
| © March 21, 1994 | B-18 | |
Unix has several utilities to help count, sort and compare files. To illustrate these utilities, assume we have two files, phonesA and phonesB. Each contains a list of telephone locals and people's names:
| phonesA | phonesB | |||||||
|---|---|---|---|---|---|---|---|---|
| 4678 | George Smith | 4221 | Susan Wilson | |||||
| 2870 | Bill Anderson | 3895 | Jan Smythe | |||||
| 3717 | Joan Brown | 3664 | John Lee | |||||
| 4221 | Susan Wilson |
The command wc will count the number of lines, words, and characters in a file. To count the file phonesA, type
wc phonesA
The result is
4 12 75 phonesA
The first number is the number of lines in the file, the second is the number of words and the third is the number of characters.
Use wc to count multiple files by giving several filenames as arguments to the command. For example, to count both phonesA and phonesB, type
wc phonesA phonesB
The result is
4 12 75 phonesA
3 9 51 phonesB
7 21 126 total
wc counts lines, words and characters in each file, and totals each of these three items for the two files.
Options for wc limit its output so that only the number of lines, the number of words or the number of characters is displayed, instead of all three:
| -l | counts only the number of lines |
| -w | counts only the number of words |
| -c | counts only the number of characters |
You may combine two of the options for wc. For example, to count the number of words and the number of characters in phonesB, you would type
wc -wc phonesB
The sort command will sort and merge files and display the output on the screen. Using redirection (>) we can save the sorted and merged output in a file phonesC (see how-to B-7, Examine Unix Files and Redirect Output, for more on redirection):
sort phonesA phonesB > phonesC
If we then type
more phonesC
we will see that file phonesC contains:
2870 Bill Anderson
3664 John Lee
3717 Joan Brown
3895 Jan Smythe
4221 Susan Wilson
4221 Susan Wilson
4678 George Smith
Notice that the entry for Wilson is repeated, because it was originally in both phonesA and phonesB. We can avoid having lines repeated by using the -u option with the sort command, for example
sort -u phonesA phonesB > phonesC
By default, sort acts on the first character in each line. Thus phonesC is sorted by the telephone number. In order to obtain a file sorted alphabetically by the person's last name, we must identify fields for the sort command to sort on. By default, sort considers a field to be any string separated by blanks. Thus our telephone files consist of 3 fields: telephone local, first name, and last name.
We want to have the merged file sorted alphabetically by the third field. To do this we give the sort command a flag consisting of a plus sign and a number telling it how many fields to skip before it begins sorting. We'll also include the -u option to avoid repeated lines. Thus
sort -u +2 phonesA phonesB > phonesD
produces a merged file phonesD containing all the phone listings sorted alphabetically by the the last name, with no repetitions:
2870 Bill Anderson
3717 Joan Brown
3664 John Lee
4678 George Smith
3895 Jan Smythe
4221 Susan Wilson
sort has several useful options:
| -f | ignore the difference between upper and lowercase when sorting. (Otherwise, uppercase words will appear first, followed by lowercase.) |
| -n | sort numbers by their arithmetic value. (Otherwise, numbers are sorted by their first digit, e.g. 1, 11, 110, 2, 23) |
| -u | remove duplicated lines |
| -r | sort in reverse order |
| -tx | usex as the field separator instead of blanks. Ifx is the tab symbol, press the TAB key. |
Four Unix utilities, cmp, comm and diff/sdiff, can be used to compare the contents of two files.
The utility cmp compares two files and finds the first disagreement between them. When comparing our phone lists
cmp phonesA phonesB
the result would be
phonesA phonesB differ: char 2, line 1
cmp reports that the first difference occurs at the second character in the first line.
The utitility comm produces more information, but only works on files which have been sorted. If we first sort the two phone lists:
sort phonesA > sortedphonesA
sort phonesB > sortedphonesB
then we can compare them using comm.
comm sortedphonesA sortedphonesB
The output from comm is split into 3 columns. The first column lists all lines that are in sortedphonesA but not in sortedphonesB. The second column lists all lines that are in sortedphonesB but not in sortedphonesA, and the third column lists all lines that are found in both sortedphonesA and sortedphonesB.
Use numeric options with comm to affect the output:
| -1 | does not display the first column |
| -2 | does not display the second column |
| -3 | does not display the third column |
| -12 | does not display the first or second column (that is, only lines found in both files are displayed) |
diff compares two files and reports what changes need to be made to the first file in order to make it identical to the second. Using diff to compare the two phone lists:
diff phonesA phonesB
we would see
1,3d0
< 4678 George Smith
< 2870 Bill Anderson
< 3717 Joan Brown
4a2,3
> 3895 Jan Smythe
> 3664 John Lee
This output from diff is interpreted as follows:
1,3d0 means that lines 1-3 should be deleted from phonesA to make it like phonesB. This is followed by the text of the 3 lines in phonesA that should be deleted, each marked with a <.
4a2,3 means that lines 2 and 3 from phonesB should be added to phonesA after line 4. This is followed by the text of the 2 lines of phonesB that should be added, each marked with a >.
If a change is required to make the two files identical, diff reports with c (for change), rather than a (for addition) or d (for deletion).
If the two files are identical, diff produces no output (not even a confirmation that they are identical). When comparing binary files (e.g., executable programs), diff does not report the differences but merely confirms that the files differ.
sdiff, a utility based on diff, will assist you in merging, line-by-line, the contents of two differing files into a third (output) file. sdiff pauses as it encounters each difference and waits for you to do one of the following:
The utility bc can be used as a desktop calculator. To illustrate the use of bc, here is a brief sequence of commands and (in bold) responses, with notes in square brackets:
bc 6+4*2-2^3 [^ is the exponent operator] 6 [*, / and ^ are evaluated before + and -] 2^3-5/(4-1) [() expressions are evaluated first] 7 [use scale (below) for decimal points] scale=3; 2^3-5/(4-1) [combine statements using ;] 6.334 sqrt(3.4) [other math functions are available] 1.843
To quit bc, type Control-d.
For more information about the utilities described in this how-to, see the on-line manual (man pages) for them. For example, to view the man page for sdiff, type
man sdiff
Man pages are described in how-to B-10, Use Unix Online Documentation.