CS220 Lab
Finding Links in HTML Files

In this lab you will write a simple program that extracts the links from an html file. This will be useful in the next stage of the project, where you will find web pages to process by chasing links.

Finding Links

It's easy to write a rudimentary program to find the links to other web pages; they are usually in href attributes:

<a href="http://cs.roanoke.edu/Fall2006/CPSC220A">My favorite class!</a>

Just skip tokens until you find "href", then skip the = and " to get to the url. The url itself contains a number of tokens, but that's ok -- it just goes to the next double quote. Note, however, that this strategy assumes that the url is in quotes, which it should be, but sometimes isn't. Reliably parsing a url that is not in quotes is a little tricky, but ignoring the possibility can mean not only missing the unquoted url, but consuming tokens up until the next ", thereby possibly missing the next url. To address this problem you'll need to do two things: 1)check if the character after the = is a quote, and skip it if it is (remember that the string containing just a double quote is written "\""), and 2)stop reading a url when you hit " or > (the end of the tag). Of course, you'll also need to stop if you get a null token; this indicates that you hit EOF at an unexpected time, which may mean that the HTML file is poorly formed, but it shouldn't bomb your program.

Exercise 1: Write a program called NetTest.java that prompts the user for the url of a web page, then prints all of the urls that the given page links to. It should also print the number of urls found. You will need to:

Create a URL object from the input string
Create a FileScanner object to read from the URL
Get tokens from the scanner, parsing them as described above. When you find a url, bundle it into a string, count it, and print it. You don't need to save it after you print it.
Print the total number of urls.

Remember to import java.net.* and java.io.*.

Absolute and Relative URLs

A url in an anchor tag can be absolute or relative. An absolute url starts with http (or maybe another protocol, which we'll discuss later), e.g., http://cs.roanoke.edu/CPSC220A. A relative url just gives the path to a file relative to the current document, e.g., prog3.html or lab9/post9.html. If you process a variety of pages using your program from above, you'll probably find examples of both. To open a stream to a URL (which you need to do to parse a web page), the url must be absolute. Fortunately, it is easy to construct an absolute url from a relative one, provided that you have the url that it is relative to (got that?). The URL class provides a constructor that takes a URL and a String and returns a new URL that is the string relative to the URL. Note that it strips off the filename of the given URL before appending the string. If the string represents an absolute URL, the URL passed in has no effect. For example, the code below will produce the output shown:

Code:
    URL url1 = new URL ("http://cs.roanoke.edu/CPSC120A/index.html");
    URL url2 = new URL(url1, "prog1.html");
    URL url3 = new URL(url1, "/lab9/post9.html");
    URL url4 = new URL(url1, "http://cs.roanoke.edu/~bloss/PEP.html");
    System.out.println(url1);
    System.out.println(url2);
    System.out.println(url3);
    System.out.println(url4);

Output:
    http://cs.roanoke.edu/CPSC120A/index.html
    http://cs.roanoke.edu/CPSC120A/prog1.html
    http://cs.roanoke.edu/CPSC120A/lab9/post9.html
    http://cs.roanoke.edu/~bloss/PEP.html

Exercise 2: Modify your NetTest program so that all of the URLs printed (and counted) are absolute. If you find a url with a protocol unknown to the java compiler (a common one is "javascript"), it will throw a MalformedURLException when you try to call the URL constructor. Deal with this situation gracefully, that is, skip over the url and go to the next one, but don't terminate the program.

Other Protocols

A url in an href can have other protocols besides http -- for example, mailto is common. But only http will give a parseable page, so you should restrict the urls you consider to those with the http protocol. Again, the URL class makes this easy. It has a getProtocol() method that returns a String representing the protocol.

Exercise 3: Modify your program to print (and count) only urls with the http protocol.

What To Turn In

Turn in hardcopy of your NetTest class and e-mail it to me with cs220 NetTest in the subject line.

CS220 Lab Finding Links in HTML Files

Finding Links

Absolute and Relative URLs

Other Protocols

What To Turn In

CS220 Lab
Finding Links in HTML Files