Details for Programming Assignment # 3

Your objective is to implement and examine the efficiency of alternative algorithms for finding duplicates in a collection of strings.

Here is a high-level description of what you are to do:

Read in a count, indicating the number of tests  to be run
For (I=1; I<=number_of_tests; I++)
    Read a text file called "TestDataI.txt" from disk (same directory as code)
        Create a collection of the words found (vector or array are equally good)
            You will need to extract the words, treating punctuation as white space
                white space is anything that is not alphanumeric
        Let N be number of words extracted
        DO NOT look for duplicates at this stage
    You now have a collection of strings with possible duplicates
    Hash Table Test
        Enter these strings in a hash table having N/2 buckets
            Each entry in a bucket is a handle to a string and its repeat count
        Iterate across all entries in the table
            Print duplicated words and their repeat counts
                Do not print out words that occur just once
    Tree Test
        Choose one of the following "balanced" tree representations
            AVL tree, splay tree, red-black tree
        Enter these strings in the tree.
            Each node is a handle to a string and its repeat count
        Walk the tree using an inorder traversal
            Print duplicated words and their repeat counts
    Produce a printed summary of the cost of the alternative methods
       This needs to specify the techniques compared (Hash vs ??)
       This needs to specify how many words (N) were found
        The measures can be
            Actual timing analyseis or
            Counts of calls and iterations
        The choice of what you measure is in your hands
    Printed output should be redirected to a file named "TestOutI.txt"
end For

Turn in a disk with a directory whose name is the same as yours
    Use the convention <Last Name> <Initial of First Name>
        So my folder would be named "HughesC"
    The contents of the folder will be
        The source file(s) for your program
        Executable files, as appropriate to programming language
        Test Files named TestData1.txt, TestData2.txt, ...
            Include at least 3 files
                One of these must have no duplicate words
                One of these must have some words with at least 3 repeats
                One of these must contain at least 100 distinct words
                    Hint: Just use some paper you wrote for a class
                        Be sure to save it as a text file
        Text files showing your output for each test data file
            These are named TestOut1.txt, TestOut2.txt, ...
        A report entitled Analysis.doc that describes your observations
            This must address the observed and theoretical performance of techniques

Due: By 4:00 PM on Monday, November 1, 1999
Lateness: 5% per day
Cutoff: No assignment will be accepted after Friday, November 5 at 4:00 PM

I/O code
Echoing a file in Java

import java.io.*;
import java.util.*;
...
String fileName = "test1.txt";
StreamTokenizer tokens;
try {
      tokens = new StreamTokenizer (new FileInputStream(fileName));
      while (tokens.nextToken()!=StreamTokenizer.TT_EOF)
            System.out.println(tokens.sval);
} catch (IOException e) {
      System.err.println("Could not process file " + fileName);
      System.exit(1);
}
Echoing a file in C++
#include <fstream>
#include <string>
#include <iostream>
using namespace std;
...
ifstream infile;
string s;
string fileName = "test1.txt";
infile.open(fileName);
if (infile.fail()) {
     cout << "file error" << endl;
     exit(1);
}
while (!infile.eof()) {
     infile >> s;
     cout << s << endl;
}