CPSC170A Assignment 10
Spell Checker

Monday, April 22nd by 5:00 pm

Research into spell checking dates back to the early 1950's, and was performed on images of cursive writing. Modern day spell checkers have evolved quite a bit, but the idea is fairly straightforward: See if a word is in a collection of known, correctly spelled words. You can very easily write this up using a linear structure. However, your look up times for determining if a word is correctly spelled is on order of the size of the correct word list. We can use a special type of tree, however, to get look up times on the order of the size of the word!

This assignment is different from all of the other assignments: you will be working in pairs for this project. Everyone must work with 1 other member of the class. There will be no groups of more than 2 members allowed for this assignment. Choose your partners carefully, as you and your partner will receive the same grade.

Details

Your program is going to use a special type of tree, known as an alphabetic Trie (pronounced "Tree" or "Try"). An alphabetic trie is a 26-ary Tree. This means that each trie node has up to 26 children. This aligns very well with the English alphabet, as we can label each branch of the tree using a unique letter.

Correct words can be inserted by following the defined branches based on the current character in the word being inserted. When you reach the end of the word, you store the correctly spelled word as the associated data of the node you ended on. Once you've inserted all of the known correct words into the trie, you can determine if a word is correctly spelled by navigating down the branches until you reach the end of the word. You can then check the queried word against the data in the node. A word is correctly spelled if and only if the data of the node you end up at is the same as the word you were searching for.

Consider inserting "cat" into a trie, as shown above. To insert, starting from the root, you follow the branch for 'c'. That gives you a new node, and from that node you follow the 'a' branch. This again gives you a new node, and you follow the 't' branch. Now that you have reached the end of the word, you can insert "cat" into the final node. Searching is essentially the same process, except instead of inserting data at the end, you compare to the current data.

Your program is going to read in a file of words (specified as a command line argument), similar to this words.txt file. Notice that this file is slightly different than the word file provided for hangman. There are more special characters than just the apostrophe character. You should ignore any word that has a non-alphabetic character. You should also convert each word in the word list to lower case as you read in the word file.

Read and insert all of the words in the known correct "dictionary" into a trie. Your program will then allow the user to type in words to check the spelling of. If the word is spelled correctly (i.e. it was found in the trie), inform the user. If the word is misspelled, However, you should inform the user that they were incorrect, and provide a potential alternative. You can use the trie to come up with some suggestion of the word they may have intended, by simply returning a word that was found somewhere on the path to the word they were typing ("yesterday" might be a suggestion for the word "yesterdayz", for example). If no word exists on that path, you should inform the user that you have no suggestions for them.

Submission: Submit your code as a zip file with your name as the zip file name on the course Inquire site.

Program Outline and Test Cases

(This is individual work. Each individual must do this by Tuesday.)

Write down an outline of your program in .py files. You should define what classes you need, what attributes those classes need, and what methods each class will have. For each attribute and global variable, define the data type stored within, and the purpose of the attribute (i.e. why is it an attribute, and not a functional parameter.). For each method, provide precise and complete Pre and Post conditions that describe how to use the method, and why you would use the method.

You are required to also generate 3 test cases for the actual building of the trie. Define your test cases in .py files. However, for the expected results of your test cases, you should draw them out on paper. You should put some thought into these test cases. Think about what weird types of inputs you might need to be able to handle, and specify how your program will handle them.

All of what you define here can (and probably will) change by the time the final program is written. This is a very useful exercise to get yourself started thinking about the problem. You should be able to, at the very least, break the problem into a rough set of steps to generate a final solution.

Submission: Submit your files as a zip file with named First_NameLast_Name_Tests.zip, replacing First_NameLast_Name with your name. Bring your drawings to lab on Tuesday.
Due Tuesday, April 16th by 2:50 pm.

Extra

Personal Dictionaries Sometimes, a dictionary is incomplete. For example, "Shende" and "Bouchard" are both not in the word list provided above. However, it would be nice, at least for Dr. Shende and Dr. Bouchard, for their names to be included in their own personal dictionary.

As words are spell checked, you should ask the user of the provided word was indeed correctly spelled. If it was, then insert it into the trie. After the user quits the program, you should detect if they saved any new words to the dictionary. If the user did add new words, you should ask the user for a file name and write all of the words in the trie to the file specified.

Better Suggestions: The suggestions being provided for the base requirements are fairly weak, and will probably not find suggestions for some common misspellings. However, this can be remedied by expanding the search in the tree. Instead of focusing on the path from the root to the last node, you can start looking down other branches.

To determine if a word is a good suggestion, you can use various metrics. One possible metric is the hamming distance, which simply computes the number of characters in each string are mismatched. This is a fairly weak metric for spell checking, but likely will produce slightly better results than the above. Another metric is the Levenshtein distance, which better handles strings of different lengths. Each of these metrics measure how different two strings are. Implement one of the above algorithms, and provide a handful of better suggestions by ranking them using the implemented metric.

For more information about Hamming and Levenshtein distances, visit Cut the Knot.

Spell Check a File: It is not very useful to have a stand alone program for spell checking. It would be better if you could specify a file to be spell checked, and perform this operation on that file. You could even produce a new file, which is a spell checked version of the original file.

This program should take, as a command line arguments, the dictionary file name AND a plain text file to spell check. Your program should build a trie as normal using the dictionary. It should then check each word sequentially in the file. For every misspelled word, you should provide suggestions and ask the user which suggestion should be used. For ease of use, one of the suggestions should be the "incorrectly" spelled word.

After you have finished checking the specified file, you should write a new file with the string "_checked" appended to the end of the file name. This file will be essentially a copy of the original file, but with all of the spellings "corrected" by the users choices.