CPSC 220 Fall 2002
Program 2, Part 2: Frequency Counts
Overview
This is the first part of a project to write a simple search engine for
a portion of the web. A search engine compares the words or phrases being searched for
with the contents of the web pages it scans, somehow using the results of
this comparison to determine how relevant each page is. The task for this
portion of the assignment is to organize the contents of a page so that this
comparison can be made efficiently.
Input and Output
Your program should take two command line arguments:
- the name of the file to be parsed
- the name of a file that contains a list of tokens that should be ignored
Your program should print, in alphabetical order, the tokens found in
the parsed file (except those to be ignored) along with the
frequency with which each token occurred. For example, suppose the file to
be parsed contained the following text:
<html>
<body bgcolor="#FFFF">
This is a test. Hope this test works!
</body>
<html>
Suppose further that you have decided to ignore all html tags and your
ignore file contains the following items:
is
it
a
are
Then your program would output the following:
! 1
. 1
hope 1
test 2
this 2
works 1
Note that all letters have been converted to lower case, and that the token comparison is
not case sensitive.
Program Construction
You will need to use several classes in this program:
- Your Scanner class from part 1, to read both the ignore file and the input file
- A BST class. This class should extend the BinaryTree class that you already
defined and should provide the
following additional methods:
- Object insert(Object item) -- Inserts item into the tree. If item already exists
in the tree, return that object (and do not modify the tree). Otherwise return the
item inserted.
- Object find(Object item) -- Looks for item in the tree. Returns the object if it is
found, else returns null.
Note that anything stored in a BST must be Comparable; you should enforce this.
Also, since you are extending the BinaryTree class, you already have iterators
defined for your BST. Hurray!
- A TokenFrequency class. An object of this type should hold a token and the number
of times it has occurred. You will store TokenFrequency objects in the BST that you
construct from your input file. Note that TokenFrequency will have to implement the
Comparable interface. Think carefully about how you want it to do this.
Your main program will be fairly straightforward.
First it will scan the ignore file and store the tokens from this file into
a binary search tree (the "ignore tree"). Then it will scan the input file. If
a token is in the
ignore tree or falls into a category of tokens that you have chosen to ignore
(eg, HTML tags), discard it. (Be sure to document what categories of tokens you
are ignoring. The ignore file can speak for itself.) Otherwise, create a
TokenFrequency object for the token and insert it into the token tree. Be sure you
increment the count of any object already found in the tree.
When you have scanned the entire file, just use the
iterator to print the tokens and their frequencies.
What to Turn In
Turn in hard copy of each of your classes.
Also tar your directory and e-mail it to bloss@roanoke.edu
(NOT bloss@cs.roanoke.edu). IMPORTANT: The subject should be
cpsc220 prog2 pt2.