CPSC 220 Fall 2003
Program 3: Frequency Counts

Overview

When you submit a query to a search engine, it looks like the engine goes out and searches all the pages on the web for your query. If it really did this, however, you would have to wait a long time for your search results. Instead, the search engine has already indexed every page, that is, created and stored a data structure that represents its tokens in a way that can be searched efficiently. During this process many search engines ignore words such as the and and that appear so frequently that searching for them returns too many results to be meaningful. These are called filter words or stop words.

In this part of the project you will do a similar thing for a file -- put its tokens into a data structure that is efficient for searching. Use this file of stop words, which is taken from the Minnesota Public Radio web site. Also ignore all HTML tags (anything of the form <tag name>), as they do not contribute to the content of the file. Since we aren't actually doing searches yet, for now you will simply give an alphabetical list of the tokens in the file with the frequency of occurrence of each token.

Input and Output

Your program should take two command line arguments:

the name of the file to be parsed
the name of a file that contains the tokens that should be ignored

If these values are not given on the command line, your program should prompt for them.

Your program should print, in alphabetical order, the tokens found in the parsed file (except those to be ignored) along with the frequency with which each token occurred. The tokens should appear in all lower case, and the comparison to see if two tokens are equal should not be case sensitive. For example, suppose the file to be parsed contained the following text:

<html>
<body bgcolor="#FFFF">
This is a test.  
Hope this test works!
</body>
<html>

Also suppose that you are ignoring html tags and that your ignore file contains the following items:

is
it
a
are

Then your program would output the following:

! 1
. 1
hope 1
test 2
this 2
works 1

You will also need to print some information about the tree in which the tokens are stored; see below.

Program Construction

You will need to use several classes for this program:

Your Scanner class, to read both the ignore file and the input file.
A binary search tree (BST) class. This class should extend the BinaryTree class that you already defined, and should implement the BSTADT interface, which should extend the BinaryTreeADT interface (got that?). The BSTADT interface should include the following methods:
- Comparable addElement(Object item) -- Inserts item into the tree. If an element that is equal (that is, .equals) to item already exists in the tree, return that element; otherwise return the item inserted. Note that this is different from the addElement method in the book.
- Comparable removeElement(Object item) -- Removes item from the tree if it exists, otherwise returns null.
Note that anything stored in a BST must be Comparable; you should enforce this. Also, since you are extending the BinaryTree class, you already have iterators defined for your BST. Hurray!
A TokenFrequency class. Since you want to know how often each token occurs, you have to store both the token and a count at each node of the BST. A TokenFrequency object holds exactly these two pieces of information, so you will store TokenFrequency objects in the BST that you construct from your input file. Note that TokenFrequency will have to implement the Comparable interface. Think carefully about how you want it to do this.

You can use my Scanner and/or BinaryTree classes if yours aren't up to the task.

Your main program will be fairly straightforward. First it will scan the ignore file and store the tokens from this file into a binary search tree (the "ignore tree"). Then it will scan the input file. If a token is in the ignore tree or is an HTML tag, discard it. Otherwise, create a TokenFrequency object for the token and insert it into the token tree. Be sure you increment the count of any object already found in the tree. Note that the addElement method returns the item in the tree if it is already there; you may find this helpful. When you have scanned the entire file, just use the appropriate iterator to print the tokens and their frequencies.

Efficiency Issues

A BST is only an efficient search structure if it is reasonably well balanced. To get a feel for this, after you list the tokens in the file, print the number of distinct, non-ignored tokens found and the height of the BST. This will require that you add a height() method to your BST (or BinaryTree) class.

Documentation

Each class should include a header including the class name, your name, and a description of the class. Each method should include a header including the name of the method, descriptions of its parameters and return value, and clear documentation of any side effects it performs (that is, any input, output, or changes to non-local memory). Surround your method headers with lines of dashes (------) to make them more visible. Variables should have good, descriptive names, and any piece of code whose function is not obvious by inspection should be documented.

What to Turn In

Turn in hard copy of each of your classes. Also tar your directory and e-mail it to bloss@roanoke.edu with cpsc220 prog3 in the subject.