CS220 Lab
Finding Links in HTML Files
In this lab you will write a simple program that extracts the links
from an html file. This will be useful in the next stage of the
project, where you will find web pages to process by chasing links.
Finding Links
It's easy to find the links in a web page; they are in the href
portion of anchor tags.
<a href="http://cs.roanoke.edu/CPSC220A">My favorite class!</a>
An anchor tag may have other attributes besides href, and if there are
other attributes the href may not come first. But
in the interest of keeping things simple we'll assume that
href is the first attribute. (An easy extension would be to
drop this assumption.) This means that you can find the URL
just by looking for a "<" followed by an "a", skipping three more tokens
(href, =, and ") and there you are. Of course, the URL itself contains
a number of tokens, but that's ok -- it just goes to the next double quote.
Exercise 1: Write a program called NetTest.java that prompts the
user for the URL of a web page, then prints all of the URLs that the given
page links to. You will need to:
- Create a URL object from the input string
- Create a Scanner object to read from the URL
- Get tokens from the scanner, looking for the pattern described above.
When you find a URL, bundle it into a string and print it. You don't need to
save it after you print it.
Remember to import java.net.* and java.io.*.
Absolute and Relative URLs
A URL in an anchor tag can be absolute or relative. An absolute
URL starts with http (or maybe another protocol, which we'll discuss
later), e.g., http://cs.roanoke.edu/CPSC220A. A relative URL just gives
the path to a file relative to the current document, e.g.,
prog3.html or lab9/post9.html.
If you process a variety of pages using your program from above,
you'll probably find examples of both.
Unfortunately, to open a stream to a URL (which you need to do
to parse a web page), it must be absolute. Fortunately, it is easy
to construct an absolute URL from a relative one, provided that you
have the URL that it is relative to (got that?). The URL class provides
a constructor that takes a URL and a String and returns a new URL that
is the string relative to the URL. If the string represents an absolute
URL, the URL passed in has no effect. For example, the code
below will produce the output shown:
Code:
URL url1 = new URL ("http://cs.roanoke.edu/CPSC120A/index.html");
URL url2 = new URL(url1, "prog1.html");
URL url3 = new URL(url1, "lab9/post9.html");
URL url4 = new URL(url1, "http://cs.roanoke.edu/~bloss/PEP.html");
System.out.println(url1);
System.out.println(url2);
System.out.println(url3);
System.out.println(url4);
Output:
http://cs.roanoke.edu/CPSC120A
http://cs.roanoke.edu/CPSC120A/prog1.html
http://cs.roanoke.edu/CPSC120A/lab9/post9.html
http://cs.roanoke.edu/~bloss/PEP.html
Exercise 2: Modify your program so that all of the URLs
printed are absolute.
Other Protocols
A URL in an href
can have other protocols besides http -- for example, mailto is
common. But only http will give a parseable page, so you should restrict the
URLs you consider to those with the http protocol.
Again, the URL class makes this easy. It has a getProtocol() method
that returns a String representing the protocol.
Exercise 3: Modify your program to
print only URLs with the http protocol.
What To Turn In
Turn in hardcopy of your NetTest.java class and e-mail it to me
with cpsc220 NetTest in the subject line.