CPSC 220 Fall 2006
Program 1 (really): Indexing Information for a Search Engine

Overview

When you submit a query to a search engine, it looks like the engine goes out and searches all the pages on the web for your query. If it really did this, however, you would have to wait a long time for your search results. Instead, the search engine has already indexed every page, that is, created and stored a data structure that represents its tokens in a way that can be searched efficiently. During this process many search engines ignore words such as the and and that appear so frequently that searching for them returns too many results to be meaningful. These are called filter words or stop words.

In this first part of the project you will do a similar thing for a file -- put its tokens into a data structure that is efficient for searching. Use this file of stop words, which is taken from the Minnesota Public Radio web site. Also ignore all HTML tags (anything of the form <tag name>), as they do not contribute to the content of the file. Since we aren't actually doing searches yet, for now you will simply give an alphabetical list of the tokens in the file with the frequency of occurrence of each token.

Input and Output

Your program should take two command line arguments: If these values are not given on the command line, your program should prompt for them.

Your program should print, in alphabetical order, the tokens found in the file to be indexed (except those to be ignored) along with the frequency with which each token occurred. The tokens should appear in all lower case, and the comparison to see if two tokens are equal should not be case sensitive. For example, suppose the file to be indexed contained the following text:

<html>
<body bgcolor="#FFFF">
This is a test.  
Hope this test works!
</body>
<html>
Also suppose that you are ignoring html tags and that your ignore file contains the following items:
is
it
a
are
Then your program would output the following:
! 1
. 1
hope 1
test 2
this 2
works 1
You will also need to print some information about the tree in which the tokens are stored; see below.

Program Construction

You will need to use several classes for this program: Your main program will be fairly straightforward. First it will scan the ignore file and store the tokens from this file into a binary search tree (the "ignore tree"). Then it will scan the input file. If a token is in the ignore tree or is an HTML tag, discard it. Otherwise, create a TokenFrequency object for the token and insert it into the token tree. Be sure you increment the count of any object already found in the tree. Note that the insert method returns the item in the tree if it is already there; you will find this helpful. When you have scanned the entire file, just use the appropriate iterator to print the tokens and their frequencies.

Efficiency Issues

A BST is only an efficient search structure if it is reasonably well balanced. To get a feel for this, after you list the tokens in the file, print the number of distinct, non-ignored tokens found and the height of the BST. This will require that you add a height() method to your BST (or BinaryTree) class.

Documentation

Each class should include a header including the class name, your name, and a description of the class. Each method should include a header including the name of the method, descriptions of its parameters and return value, and clear documentation of any side effects it performs (that is, any input, output, or changes to non-local memory). Surround your method headers with lines of dashes (------) to make them more visible. Variables should have good, descriptive names, and any piece of code whose function is not obvious by inspection should be documented.

What to Turn In

Turn in hard copy of each of your classes. Also tar your directory and e-mail it to bloss@roanoke.edu with cpsc220 reallyprog1 in the subject.