CPSC 220 Fall 2002
Program 2, Part 2: Frequency Counts

Overview

This is the first part of a project to write a simple search engine for a portion of the web. A search engine compares the words or phrases being searched for with the contents of the web pages it scans, somehow using the results of this comparison to determine how relevant each page is. The task for this portion of the assignment is to organize the contents of a page so that this comparison can be made efficiently.

Input and Output

Your program should take two command line arguments: Your program should print, in alphabetical order, the tokens found in the parsed file (except those to be ignored) along with the frequency with which each token occurred. For example, suppose the file to be parsed contained the following text:
<html>
<body bgcolor="#FFFF">
This is a test.  Hope this test works!
</body>
<html>
Suppose further that you have decided to ignore all html tags and your ignore file contains the following items:
is
it
a
are
Then your program would output the following:
! 1
. 1
hope 1
test 2
this 2
works 1
Note that all letters have been converted to lower case, and that the token comparison is not case sensitive.

Program Construction

You will need to use several classes in this program: Your main program will be fairly straightforward. First it will scan the ignore file and store the tokens from this file into a binary search tree (the "ignore tree"). Then it will scan the input file. If a token is in the ignore tree or falls into a category of tokens that you have chosen to ignore (eg, HTML tags), discard it. (Be sure to document what categories of tokens you are ignoring. The ignore file can speak for itself.) Otherwise, create a TokenFrequency object for the token and insert it into the token tree. Be sure you increment the count of any object already found in the tree. When you have scanned the entire file, just use the iterator to print the tokens and their frequencies.

What to Turn In

Turn in hard copy of each of your classes. Also tar your directory and e-mail it to bloss@roanoke.edu (NOT bloss@cs.roanoke.edu). IMPORTANT: The subject should be cpsc220 prog2 pt2.