CPSC 220 Fall 2002
Program 3: Processing Queries

In this part of the project you will implement a fledgling search engine that orders web pages based on how relevant they are to a search query. For now, we will say that the best match is the web page with the total highest word frequency counts for the words in the search query. (Searches should be case insensitive.) However, if any word in the search query has no matches on the page at all, then the entire page has a rating of 0. For example, suppose the query is "binary search tree", and three pages have frequencies for each word as shown below:
Page   "binary"   "search"  "tree"
----------------------------------
 1       8           0        10
 2       3           8         4
 3       6           3         3
Page 2 has the highest rating with 15, page 3 has the second highest rating with 12, and page 1 has the lowest rating with 0.

Program Structure

Your top-level class should be called ProcessQueries and should take two command-line arguments:
     java ProcessQueries urlListFile ignoreFile
The urlListFile should contain a list of URLs, one per line. The ignoreFile should contain words to ignore as you count word frequencies in the html files (just as in the last program).

In order to process queries from a user, you'll need to create a new class that joins together a URL string with a binary search tree (holding TokenFrequency objects) representing that web page's content. Call this class URLContent. Your program should create a list of URLContent objects, one for each URL that appears in the urlListFile. (You might find a Vector a handy structure for holding this list.)

Once you have processed all the URLs in the input file (you should gracefully handle invalid URLs), your program should enter a loop as shown below, which prompts the user to enter a search query (or -1 to quit), and then lists all URL's that match the query (relevance rating > 0) in order from the best match first to the worst match last. Include each result URL's rating. URLs of web pages with ratings of 0 should not appear in the result list.

     Enter a query or -1 to quit.

     Search for: programming

     Relevance  URL
     --------- ----
        10      http://cs.roanoke.edu/~bloss/cpsc150syl.html
         7      http://cs.roanoke.edu/~bloss/cs310syl01.html
         7      http://cs.roanoke.edu/~bloss/cs101syl.html

     Search for: computer science

     Relevance  URL
     --------- ----
        26      http://cs.roanoke.edu/~bloss/cs101syl.html
        12      http://cs.roanoke.edu/~bloss/cpsc150syl.html
To find the results of the query in order, you will process the binary search trees in the list of URLContent objects, create priority queue elements for the URLContent objects whose trees give relevance ratings > 0, and add them to a priority queue for the search. Then use the priority queue to print out the matching urls in order. Remember that in a priority queue low values equate with high priority.

Notes

You will have to modify your Scanner class to scan a web page, not just a file. This is easy; just add a constructor that takes a URL object (so you can distinguish it from a filename), call its openStream method to get an InputStream, and then hook that up to a Reader. Voila! Alternatively, you can add a constructor that takes a Reader and do the URL work in the calling method.

You should be able to use your BST, HeapPriorityQueue and related classes without modification. The programs that you wrote to test them will not be used directly, but you may need to incorporate some of the code in those programs.

You are collecting enough stuff here that it's time to organize it into packages. We'll talk about this in class.

What to Turn In

Turn in hardcopy for each of your new classes, and any previous classes that you modified. Also tar your directory and e-mail it to bloss@roanoke.edu. The subject should be cpsc220 prog3.