CPSC 220 Fall 2006
Program 3: Chasing Links
In this part of the project you will put your pieces together into
a command-line search engine that takes an initial URL, then
accepts queries from the user and responds to each query
with a list of the relevant pages that are
reachable from that initial URL. First some background on viewing the
World Wide Web as a graph.
The Web as a Graph
The Web can be thought of as a graph in which web pages are the vertices
and links embedded within those web pages are the edges.
Using this model, you can do a breadth-first search of the Web by
choosing an initial page, then
finding all of the links within that page and visiting each of
those pages before visiting any of the pages that they link to.
Note that this
does not require an explicit representation of
the graph -- the
edges are already stored within the web pages.
However, since the same page could be linked to
from multiple places,
you'll have to be sure you don't visit the same page twice.
You
can't mark the pages themselves, but you can store the URLs that have been
visited in a data structure that will let you insert and search
efficiently (e.g.,
a BST).
Note that you cannot
easily detect when different
URLs lead to the same
web page (e.g., http://cnn.com and http://www.cnn.com), so you will still
have some repetition.
Implementation
You will modify your code from Program 2 so that instead of taking
a list of URLs to scan, it takes a single URL and does a breadth first
search from that web page as described above to find additional pages to
scan.
You will need to make the following changes to Program 2 code:
- Write a class called URLQueue with the following public methods:
- URLQueue(URL url) -- Takes a URL and creates a queue of
URLs resulting from a breadth-first search of the graph formed by
the links in the given page and in the pages linked to. Note that
initially the queue holds only the given url; it is filled incrementally when
each URL in it is dequeued (see description of dequeue below).
- URL dequeue() -- Returns the next URL in the queue and adds
each link in that page to the queue if it isn't already there.
Again: the links in a given page are
added to the queue when the URL for that page is dequeued, not before.
- boolean isEmpty() -- Returns true if the queue is empty.
Note that no public enqueue method is needed, as additional URLs are
added to the queue as a by-product of dequeuing.
- Now modify your ProcessQueries class so that it takes a string representing
an initial URL (along with the ignore file) from the command line, creates
a URLQueue, and processes URLs taken off the queue instead of those read
from an input file. Most of the rest of your ProcessQueries class
will be unchanged; each
of the URLs should be scanned to produce a list of
URLContent objects, as before, and then you will use this list to find the
relevant pages for each query.
One more thing: as a third but optional
command line argument, take the maximum number of URLs to
search. If this argument is not supplied, set the maximum to 100.
Note that if no limit is set here, your program could go on for a
very long time!
Caution: Real
web pages are often poorly formed with missing
quotes, missing end tags, and so on. Be sure that you handle these
situations gracefully, extracting as many links as possible but most of
all not allowing your program to crash.
What to Turn In
Turn in hardcopy for each of your new classes, and any previous classes that
you modified.