CS220 Lab
Finding Links in HTML Files

In this lab you will write a simple program that extracts the links from an html file. This will be useful in the next stage of the project, where you will find web pages to process by chasing links.

Finding Links

It's easy to find the links in a web page; they are in the href portion of anchor tags.

<a href="http://cs.roanoke.edu/CPSC220A">My favorite class!</a>

An anchor tag may have other attributes besides href, and if there are other attributes the href may not come first. But in the interest of keeping things simple we'll assume that href is the first attribute. (An easy extension would be to drop this assumption.) This means that you can find the URL just by looking for a "<" followed by an "a", skipping three more tokens (href, =, and ") and there you are. Of course, the URL itself contains a number of tokens, but that's ok -- it just goes to the next double quote.

Exercise 1: Write a program called NetTest.java that prompts the user for the URL of a web page, then prints all of the URLs that the given page links to. You will need to:

Create a URL object from the input string
Create a Scanner object to read from the URL
Get tokens from the scanner, looking for the pattern described above. When you find a URL, bundle it into a string and print it. You don't need to save it after you print it.

Remember to import java.net.* and java.io.*.

Absolute and Relative URLs

A URL in an anchor tag can be absolute or relative. An absolute URL starts with http (or maybe another protocol, which we'll discuss later), e.g., http://cs.roanoke.edu/CPSC220A. A relative URL just gives the path to a file relative to the current document, e.g., prog3.html or lab9/post9.html. If you process a variety of pages using your program from above, you'll probably find examples of both. Unfortunately, to open a stream to a URL (which you need to do to parse a web page), it must be absolute. Fortunately, it is easy to construct an absolute URL from a relative one, provided that you have the URL that it is relative to (got that?). The URL class provides a constructor that takes a URL and a String and returns a new URL that is the string relative to the URL. If the string represents an absolute URL, the URL passed in has no effect. For example, the code below will produce the output shown:

Code:
    URL url1 = new URL ("http://cs.roanoke.edu/CPSC120A/index.html");
    URL url2 = new URL(url1, "prog1.html");
    URL url3 = new URL(url1, "lab9/post9.html");
    URL url4 = new URL(url1, "http://cs.roanoke.edu/~bloss/PEP.html");
    System.out.println(url1);
    System.out.println(url2);
    System.out.println(url3);
    System.out.println(url4);

Output:
    http://cs.roanoke.edu/CPSC120A
    http://cs.roanoke.edu/CPSC120A/prog1.html
    http://cs.roanoke.edu/CPSC120A/lab9/post9.html
    http://cs.roanoke.edu/~bloss/PEP.html

Exercise 2: Modify your program so that all of the URLs printed are absolute.

Other Protocols

A URL in an href can have other protocols besides http -- for example, mailto is common. But only http will give a parseable page, so you should restrict the URLs you consider to those with the http protocol. Again, the URL class makes this easy. It has a getProtocol() method that returns a String representing the protocol.

Exercise 3: Modify your program to print only URLs with the http protocol.

What To Turn In

Turn in hardcopy of your NetTest.java class and e-mail it to me with cpsc220 NetTest in the subject line.

CS220 Lab Finding Links in HTML Files

Finding Links

Absolute and Relative URLs

Other Protocols

What To Turn In

CS220 Lab
Finding Links in HTML Files