CPSC 220 Fall 2006
Program 1 (really): Indexing Information for a Search Engine

Overview

When you submit a query to a search engine, it looks like the engine goes out and searches all the pages on the web for your query. If it really did this, however, you would have to wait a long time for your search results. Instead, the search engine has already indexed every page, that is, created and stored a data structure that represents its tokens in a way that can be searched efficiently. During this process many search engines ignore words such as the and and that appear so frequently that searching for them returns too many results to be meaningful. These are called filter words or stop words.

In this first part of the project you will do a similar thing for a file -- put its tokens into a data structure that is efficient for searching. Use this file of stop words, which is taken from the Minnesota Public Radio web site. Also ignore all HTML tags (anything of the form <tag name>), as they do not contribute to the content of the file. Since we aren't actually doing searches yet, for now you will simply give an alphabetical list of the tokens in the file with the frequency of occurrence of each token.

Input and Output

Your program should take two command line arguments:

the name of the file to be indexed
the name of the file that contains the tokens that should be ignored

If these values are not given on the command line, your program should prompt for them.

Your program should print, in alphabetical order, the tokens found in the file to be indexed (except those to be ignored) along with the frequency with which each token occurred. The tokens should appear in all lower case, and the comparison to see if two tokens are equal should not be case sensitive. For example, suppose the file to be indexed contained the following text:

<html>
<body bgcolor="#FFFF">
This is a test.  
Hope this test works!
</body>
<html>

Also suppose that you are ignoring html tags and that your ignore file contains the following items:

is
it
a
are

Then your program would output the following:

! 1
. 1
hope 1
test 2
this 2
works 1

You will also need to print some information about the tree in which the tokens are stored; see below.

Program Construction

You will need to use several classes for this program:

A class to scan a text file and return one token at a time. This class will have the interface below and will be provided for you in the cs220utils package.
- FileScanner(String filename) -- creates a FileScanner object that reads from file filename.
- String next() -- returns the next token in the file, null if there is no next token.
- boolean hasNext() -- returns true if there is another token available (i.e., the file is open and it's not EOF), false otherwise.
For our purposes a token is one of two things:
- A string of contiguous alphabetic and numeric characters terminated by any non-aphabetic/non-numeric character (including whitespace) or EOF.
- A single non-alphabetic, non-numeric character.
So the text below contains the tokens shown:
```
   public class Test5
   {
     int x;
   }
   //Very silly


Tokens:  public
         class
         Test5
         {
         int
         x
         ;
         }
         /
         /
         Very
         silly
```
Your binary search tree class. However, you should modify the insert method so that it both takes and returns an object of type T, so the interface is T insert(T item). If the item to be inserted is already in the tree (that is, it is .equals to an element in the tree), then insert should return the element already in the tree; otherwise it should return the item inserted.
A TokenFrequency class. Since you want to know how often each token occurs, you have to store both the token and a count at each node of the BST. A TokenFrequency object holds exactly these two pieces of information, so you will store TokenFrequency objects in the BST that you construct from your input file. (Now aren't you glad you made that BST generic?) Note that TokenFrequency will have to implement the Comparable interface. Think carefully about how you want it to do this.

Your main program will be fairly straightforward. First it will scan the ignore file and store the tokens from this file into a binary search tree (the "ignore tree"). Then it will scan the input file. If a token is in the ignore tree or is an HTML tag, discard it. Otherwise, create a TokenFrequency object for the token and insert it into the token tree. Be sure you increment the count of any object already found in the tree. Note that the insert method returns the item in the tree if it is already there; you will find this helpful. When you have scanned the entire file, just use the appropriate iterator to print the tokens and their frequencies.

Efficiency Issues

A BST is only an efficient search structure if it is reasonably well balanced. To get a feel for this, after you list the tokens in the file, print the number of distinct, non-ignored tokens found and the height of the BST. This will require that you add a height() method to your BST (or BinaryTree) class.

Documentation

Each class should include a header including the class name, your name, and a description of the class. Each method should include a header including the name of the method, descriptions of its parameters and return value, and clear documentation of any side effects it performs (that is, any input, output, or changes to non-local memory). Surround your method headers with lines of dashes (------) to make them more visible. Variables should have good, descriptive names, and any piece of code whose function is not obvious by inspection should be documented.

What to Turn In

Turn in hard copy of each of your classes. Also tar your directory and e-mail it to bloss@roanoke.edu with cpsc220 reallyprog1 in the subject.