CPSC 220 Fall 2006
Program 3: Chasing Links

In this part of the project you will put your pieces together into a command-line search engine that takes an initial URL, then accepts queries from the user and responds to each query with a list of the relevant pages that are reachable from that initial URL. First some background on viewing the World Wide Web as a graph.

The Web as a Graph

The Web can be thought of as a graph in which web pages are the vertices and links embedded within those web pages are the edges. Using this model, you can do a breadth-first search of the Web by choosing an initial page, then finding all of the links within that page and visiting each of those pages before visiting any of the pages that they link to. Note that this does not require an explicit representation of the graph -- the edges are already stored within the web pages. However, since the same page could be linked to from multiple places, you'll have to be sure you don't visit the same page twice. You can't mark the pages themselves, but you can store the URLs that have been visited in a data structure that will let you insert and search efficiently (e.g., a BST). Note that you cannot easily detect when different URLs lead to the same web page (e.g., http://cnn.com and http://www.cnn.com), so you will still have some repetition.

Implementation

You will modify your code from Program 2 so that instead of taking a list of URLs to scan, it takes a single URL and does a breadth first search from that web page as described above to find additional pages to scan.

You will need to make the following changes to Program 2 code:

Caution: Real web pages are often poorly formed with missing quotes, missing end tags, and so on. Be sure that you handle these situations gracefully, extracting as many links as possible but most of all not allowing your program to crash.

What to Turn In

Turn in hardcopy for each of your new classes, and any previous classes that you modified.