CPSC 220 Fall 2006
Program 2: Processing Queries

In this part of the project you will implement a fledgling search engine that orders web pages based on how relevant they are to a search query. For now, we will say that the best match is the web page with the total highest word frequency counts for the words in the search query. (Searches should be case insensitive.) However, if any word in the search query has no matches on the page at all, then the entire page has a rating of 0. For example, suppose the query is "binary search tree", and three pages have frequencies for each word as shown below:

Page   "binary"   "search"  "tree"
----------------------------------
 1       8           0        10
 2       3           8         4
 3       6           3         3

Page 2 has the highest rating with 15, page 3 has the second highest rating with 12, and page 1 has the lowest rating with 0.

Program Structure

Your top-level class should be called ProcessQueries and should take two command-line arguments:

     java ProcessQueries urlListFile ignoreFile

The urlListFile should contain a list of URLs, one per line. The ignoreFile should contain words to ignore as you count word frequencies in the html files (just as in the last program).

In order to process queries from a user, you'll need to create a new class that joins together a URL string with a binary search tree (holding TokenFrequency objects) representing that web page's content. Call this class URLContent. Your program should create a list of URLContent objects, one for each URL that appears in the urlListFile. (You might find a Vector a handy structure for holding this list.)

Once you have processed all the URLs in the input file (you should gracefully handle invalid URLs), your program should enter a loop as shown below, which prompts the user to enter a search query (or -1 to quit), and then lists all URL's that match the query (relevance rating > 0) in order from the best match first to the worst match last. Include each result URL's rating. URLs of web pages with ratings of 0 should not appear in the result list.

     Enter a query or -1 to quit.

     Search for: programming

     Relevance  URL
     --------- ----
        10      http://cs.roanoke.edu/~bloss/cpsc150syl.html
         7      http://cs.roanoke.edu/~bloss/cs310syl01.html
         7      http://cs.roanoke.edu/~bloss/cs101syl.html

     Search for: computer science

     Relevance  URL
     --------- ----
        26      http://cs.roanoke.edu/~bloss/cs101syl.html
        12      http://cs.roanoke.edu/~bloss/cpsc150syl.html

To find the results of the query in order, you will process the binary search trees in the list of URLContent objects, create priority queue elements for the URLContent objects whose trees give relevance ratings > 0, and add them to a priority queue for the search. Then use the priority queue to print out the matching urls in order. Remember that in a priority queue low values equate with high priority.

Notes

The FileScanner class has been modified so it can scan a web page, not just a file. There is an additional constructor that takes a URL object (so you can distinguish it from a filename) and reads from the text of that object.

You should be able to use your BST, HeapPriorityQueue and related classes without modification. The programs that you wrote to test them will not be used directly, but you may need to incorporate some of the code in those programs.

What to Turn In

Turn in hardcopy for each of your new classes, and any previous classes that you modified. Also tar your directory and e-mail it to bloss@roanoke.edu. The subject should be cs220 prog2.

CPSC 220 Fall 2006Program 2: Processing Queries

Program Structure

Notes

What to Turn In

CPSC 220 Fall 2006
Program 2: Processing Queries