CPSC 220 Fall 2006
Program 1 (really): Indexing Information for a Search Engine
Overview
When you submit a query to a search engine, it looks like the engine goes out
and searches all the pages on the web for your query. If it really
did this, however, you would have to wait a long time for your search results.
Instead, the search engine has already indexed every page, that is,
created and stored a data structure that represents its tokens in a way
that can be searched efficiently. During this process many
search engines ignore words such as the and and
that appear so frequently that searching for them returns too many
results to be meaningful. These are called filter words or
stop words.
In this first part of the project you will do a similar thing for a file --
put its tokens into a data structure that is efficient for searching.
Use this file of stop words, which is taken
from the Minnesota
Public Radio web site.
Also ignore all
HTML tags (anything of the form <tag name>), as they do not
contribute to the content of the file.
Since we aren't actually doing searches yet,
for now you will simply give an alphabetical
list of the tokens in the file with the frequency of occurrence of each
token.
Input and Output
Your program should take two command line arguments:
- the name of the file to be indexed
- the name of the file that contains the tokens that should be ignored
If these values are not given on the command line, your program should
prompt for them.
Your program should print, in alphabetical order, the tokens found in
the file to be indexed (except those to be ignored) along with the
frequency with which each token occurred. The tokens should appear in
all lower case, and the comparison to see if two tokens are equal
should not be case sensitive. For example, suppose the file to
be indexed contained the following text:
<html>
<body bgcolor="#FFFF">
This is a test.
Hope this test works!
</body>
<html>
Also suppose that you are ignoring html tags and that your
ignore file contains the following items:
is
it
a
are
Then your program would output the following:
! 1
. 1
hope 1
test 2
this 2
works 1
You will also need to print some information about the tree in which
the tokens are stored; see below.
Program Construction
You will need to use several classes for this program:
Your main program will be fairly straightforward.
First it will scan the ignore file and store the tokens from this file into
a binary search tree (the "ignore tree"). Then it will scan the input file. If a token is in the
ignore tree or is an HTML tag, discard it.
Otherwise, create a
TokenFrequency object for the token and insert it into the token tree. Be sure you
increment the count of any object already found in the tree. Note that
the insert method returns the item in the tree if it is already
there; you will find this helpful.
When you have scanned the entire file, just use the
appropriate iterator to print the tokens and their frequencies.
Efficiency Issues
A BST is only an efficient search structure if it is reasonably well
balanced. To get a feel for this, after you list the tokens in the file,
print the number of distinct, non-ignored tokens found and the height
of the BST. This will require that you add a height() method
to your BST (or BinaryTree) class.
Documentation
Each class should include a header including the class name, your name,
and a description of the class. Each method should include a header
including the name of the method, descriptions of its parameters and
return value, and clear documentation of any side effects it performs
(that is, any input, output, or changes to non-local memory). Surround
your method headers with lines of dashes (------) to make them more
visible. Variables
should have good, descriptive names, and any piece of code whose function
is not obvious by inspection should be documented.
What to Turn In
Turn in hard copy of each of your classes.
Also tar your directory and e-mail it to bloss@roanoke.edu
with cpsc220 reallyprog1 in the subject.