CPSC 220 Fall 2006
Program 3: Chasing Links

In this part of the project you will put your pieces together into a command-line search engine that takes an initial URL, then accepts queries from the user and responds to each query with a list of the relevant pages that are reachable from that initial URL. First some background on viewing the World Wide Web as a graph.

The Web as a Graph

The Web can be thought of as a graph in which web pages are the vertices and links embedded within those web pages are the edges. Using this model, you can do a breadth-first search of the Web by choosing an initial page, then finding all of the links within that page and visiting each of those pages before visiting any of the pages that they link to. Note that this does not require an explicit representation of the graph -- the edges are already stored within the web pages. However, since the same page could be linked to from multiple places, you'll have to be sure you don't visit the same page twice. You can't mark the pages themselves, but you can store the URLs that have been visited in a data structure that will let you insert and search efficiently (e.g., a BST). Note that you cannot easily detect when different URLs lead to the same web page (e.g., http://cnn.com and http://www.cnn.com), so you will still have some repetition.

Implementation

You will modify your code from Program 2 so that instead of taking a list of URLs to scan, it takes a single URL and does a breadth first search from that web page as described above to find additional pages to scan.

You will need to make the following changes to Program 2 code:

Write a class called URLQueue with the following public methods:
- URLQueue(URL url) -- Takes a URL and creates a queue of URLs resulting from a breadth-first search of the graph formed by the links in the given page and in the pages linked to. Note that initially the queue holds only the given url; it is filled incrementally when each URL in it is dequeued (see description of dequeue below).
- URL dequeue() -- Returns the next URL in the queue and adds each link in that page to the queue if it isn't already there. Again: the links in a given page are added to the queue when the URL for that page is dequeued, not before.
- boolean isEmpty() -- Returns true if the queue is empty.
Note that no public enqueue method is needed, as additional URLs are added to the queue as a by-product of dequeuing.
Now modify your ProcessQueries class so that it takes a string representing an initial URL (along with the ignore file) from the command line, creates a URLQueue, and processes URLs taken off the queue instead of those read from an input file. Most of the rest of your ProcessQueries class will be unchanged; each of the URLs should be scanned to produce a list of URLContent objects, as before, and then you will use this list to find the relevant pages for each query.
One more thing: as a third but optional command line argument, take the maximum number of URLs to search. If this argument is not supplied, set the maximum to 100. Note that if no limit is set here, your program could go on for a very long time!

Caution: Real web pages are often poorly formed with missing quotes, missing end tags, and so on. Be sure that you handle these situations gracefully, extracting as many links as possible but most of all not allowing your program to crash.

What to Turn In

Turn in hardcopy for each of your new classes, and any previous classes that you modified.

CPSC 220 Fall 2006Program 3: Chasing Links

The Web as a Graph

Implementation

What to Turn In

CPSC 220 Fall 2006
Program 3: Chasing Links