In this part of the project you will do a similar thing for a file -- put its tokens into a data structure that is efficient for searching. Use this file of stop words, which is taken from the Minnesota Public Radio web site. Also ignore all HTML tags (anything of the form <tag name>), as they do not contribute to the content of the file. Since we aren't actually doing searches yet, for now you will simply give an alphabetical list of the tokens in the file with the frequency of occurrence of each token.
Your program should print, in alphabetical order, the tokens found in the parsed file (except those to be ignored) along with the frequency with which each token occurred. The tokens should appear in all lower case, and the comparison to see if two tokens are equal should not be case sensitive. For example, suppose the file to be parsed contained the following text:
<html> <body bgcolor="#FFFF"> This is a test. Hope this test works! </body> <html>Also suppose that you are ignoring html tags and that your ignore file contains the following items:
is it a areThen your program would output the following:
! 1 . 1 hope 1 test 2 this 2 works 1You will also need to print some information about the tree in which the tokens are stored; see below.
Note that anything stored in a BST must be Comparable; you should enforce this. Also, since you are extending the BinaryTree class, you already have iterators defined for your BST. Hurray!
Your main program will be fairly straightforward. First it will scan the ignore file and store the tokens from this file into a binary search tree (the "ignore tree"). Then it will scan the input file. If a token is in the ignore tree or is an HTML tag, discard it. Otherwise, create a TokenFrequency object for the token and insert it into the token tree. Be sure you increment the count of any object already found in the tree. Note that the addElement method returns the item in the tree if it is already there; you may find this helpful. When you have scanned the entire file, just use the appropriate iterator to print the tokens and their frequencies.