<a href="http://cs.roanoke.edu/Fall2006/CPSC220A">My favorite class!</a>Just skip tokens until you find "href", then skip the = and " to get to the url. The url itself contains a number of tokens, but that's ok -- it just goes to the next double quote. Note, however, that this strategy assumes that the url is in quotes, which it should be, but sometimes isn't. Reliably parsing a url that is not in quotes is a little tricky, but ignoring the possibility can mean not only missing the unquoted url, but consuming tokens up until the next ", thereby possibly missing the next url. To address this problem you'll need to do two things: 1)check if the character after the = is a quote, and skip it if it is (remember that the string containing just a double quote is written "\""), and 2)stop reading a url when you hit " or > (the end of the tag). Of course, you'll also need to stop if you get a null token; this indicates that you hit EOF at an unexpected time, which may mean that the HTML file is poorly formed, but it shouldn't bomb your program.
Exercise 1: Write a program called NetTest.java that prompts the user for the url of a web page, then prints all of the urls that the given page links to. It should also print the number of urls found. You will need to:
Code: URL url1 = new URL ("http://cs.roanoke.edu/CPSC120A/index.html"); URL url2 = new URL(url1, "prog1.html"); URL url3 = new URL(url1, "/lab9/post9.html"); URL url4 = new URL(url1, "http://cs.roanoke.edu/~bloss/PEP.html"); System.out.println(url1); System.out.println(url2); System.out.println(url3); System.out.println(url4); Output: http://cs.roanoke.edu/CPSC120A/index.html http://cs.roanoke.edu/CPSC120A/prog1.html http://cs.roanoke.edu/CPSC120A/lab9/post9.html http://cs.roanoke.edu/~bloss/PEP.html
Exercise 2: Modify your NetTest program so that all of the URLs printed (and counted) are absolute. If you find a url with a protocol unknown to the java compiler (a common one is "javascript"), it will throw a MalformedURLException when you try to call the URL constructor. Deal with this situation gracefully, that is, skip over the url and go to the next one, but don't terminate the program.
Exercise 3: Modify your program to print (and count) only urls with the http protocol.