CS220 Lab
Finding Links in HTML Files
In this lab you will write a simple program that extracts the links
from an html file. This will be useful in the next stage of the
project, where you will find web pages to process by chasing links.
Finding Links
It's easy to find the links in a web page; they are usually
in href attributes:
<a href="http://cs.roanoke.edu/CPSC220A">My favorite class!</a>
Just skip tokens until you find "href", then
skip the = and " to get to the url.
Of course, the url itself contains
a number of tokens, but that's ok -- it just goes to the next double quote.
Note, however, that this strategy assumes that the url is in quotes, which
it should be, but sometimes isn't.
Reliably parsing a url that is not in quotes is a little tricky given
the capabilities of our scanner, but ignoring the possibility
can mean not only missing the
unquoted url, but consuming tokens up until the next ", thereby possibly
missing the next url. To address this problem, stop reading a url when
you hit " or > (the end of the tag). Of course, you'll also
need to stop if you
get a null token; this indicates that you hit EOF at an unexpected time,
which may mean that the HTML file is poorly formed, but it shouldn't bomb
your program.
Exercise 1: Write a program called NetTest.java that prompts the
user for the url of a web page, then prints all of the urls that the given
page links to. It should also print the number of urls found.
You will need to:
- Create a URL object from the input string
- Create a Scanner object to read from the URL
- Get tokens from the scanner, parsing them as described above.
When you find a url, bundle it into a string, count it, and print it.
You don't need to
save it after you print it.
- Print the total number of urls.
Remember to import java.net.* and java.io.*.
Absolute and Relative URLs
A url in an anchor tag can be absolute or relative. An absolute
url starts with http (or maybe another protocol, which we'll discuss
later), e.g., http://cs.roanoke.edu/CPSC220A. A relative url just gives
the path to a file relative to the current document, e.g.,
prog3.html or lab9/post9.html.
If you process a variety of pages using your program from above,
you'll probably find examples of both.
To open a stream to a URL (which you need to do
to parse a web page), the url must be absolute. Fortunately, it is easy
to construct an absolute url from a relative one, provided that you
have the url that it is relative to (got that?). The URL class provides
a constructor that takes a URL and a String and returns a new URL that
is the string relative to the URL. Note that it strips off the filename
of the given URL before appending the string.
If the string represents an absolute
URL, the URL passed in has no effect. For example, the code
below will produce the output shown:
Code:
URL url1 = new URL ("http://cs.roanoke.edu/CPSC120A/index.html");
URL url2 = new URL(url1, "prog1.html");
URL url3 = new URL(url1, "/lab9/post9.html");
URL url4 = new URL(url1, "http://cs.roanoke.edu/~bloss/PEP.html");
System.out.println(url1);
System.out.println(url2);
System.out.println(url3);
System.out.println(url4);
Output:
http://cs.roanoke.edu/CPSC120A
http://cs.roanoke.edu/CPSC120A/prog1.html
http://cs.roanoke.edu/CPSC120A/lab9/post9.html
http://cs.roanoke.edu/~bloss/PEP.html
Exercise 2: Modify your NetTest program so that all of the URLs
printed (and counted) are absolute. If you find a url with a
protocol unknown to the java compiler (a common one is "javascript"), it
will throw a MalformedURLException when you try to call the URL
constructor. Deal with this situation gracefully, that is, skip
over the url and go to the next one, but don't terminate the program.
Other Protocols
A url in an href
can have other protocols besides http -- for example, mailto is
common. But only http will give a parseable page, so you should restrict the
urls you consider to those with the http protocol.
Again, the URL class makes this easy. It has a getProtocol() method
that returns a String representing the protocol.
Exercise 3: Modify your program to
print (and count) only urls with the http protocol.
What To Turn In
Turn in hardcopy of your NetTest class and e-mail it to me
with cs220 NetTest in the subject line.