CPSC 120 -- Assignment #4: Cryptology
Due Friday, December 10, 2010 by 4 p.m.

Security is a major concern in modern computing systems. Confidential information stored in computer databases and transmitted over computer networks needs to be protected from unauthorized access. The systems themselves must be restricted to authorized users. One of many means of ensuring such security is by encrypting data that is stored and transmitted by computers, and using encrypted passwords to limit access to systems. Encryption is the process of converting data (generally called a message) from its original form, called plaintext, into "scrambled" form, called ciphertext. The process of turning ciphertext back to plaintext is called decryption. A goal in systems security is to find encryption algorithms (called ciphers) that are relatively easy to compute but difficult to "reverse" (to break) without some knowledge of the decryption method (especially the "key" needed to decrypt). Most modern algorithms rely heavily on number theory and use the fact that it is difficult to factor very large numbers (a few hundred digits long) in a reasonable length of time. Early algorithms, however, relied on much simpler ideas. One of the simplest algorithms is called the monoalphabetic substitution cipher. In this cipher each letter in the plaintext is replaced by another letter in the alphabet. Basically there is the regular alphabet (called the plain alphabet) and the corresponding scrambled alphabet (called the cipher alphabet) that defines the substitutions. This type of cipher was used in exercise 3 of Lab 11 and is used in lots of puzzles such as CryptoQuote in the daily newspaper. Unfortunately it is very easy to "break" - that is figure out the plaintext message if you have the ciphertext. The key to breaking it is frequency analysis; that is, figure out the frequency of each letter in the ciphertext and try substituting the most frequent letters in the English language for those. Ever since this was figured out by Arab cryptanalysts (people who figure out how to "break" ciphers) in the ninth century, cryptographers have tried to come up with new ciphers that would defy frequency analysis. One such early method was the homophonic substitution cipher. In this cipher, each letter has possibly many different substitution symbols. The idea is for each letter to have a number of substitutes that is proportional to its frequency in the English language.

In this assignment you will write a program to implement a homophonic substitution cipher. Recall that in lab 11 the "key" to the cipher was an array of characters to represent the cipher alphabet. For the homophonic substitution cipher we need a list of substitution characters for each letter. One way to do this is to represent the key as an array of strings. Each string will consist of the characters that can substitute for the corresponding letter. The following illustrates part of such an array.


      Plain Alphabet    Key Array    Cipher Alphabet (Substitution symbols)
                         ------
                        |      |
           a            | -----|---> "#4Avz"
                        |      |                  
                        |------|
                        |      |
           b            | -----|---> "?"
                        |      |                  
                        |------|
                        |      |
           c            | -----|---> "q7>"
                        |      |                  
                        |------|
   
                           ...
To encipher a message, each letter in the message is replaced with a randomly chosen character in the corresponding string of substitution symbols. For example, using the above example key the plaintext message "cabba" might by enciphered as "7A??#" or as "qv??4". There are several other possibilities.

To decipher a message, each letter in the ciphertext must be replaced with the corresponding plaintext letter. To do this there needs to be a way to find the correct letter. For example, in the illustration above the symbol > in a ciphertext message must be replaced with the letter c. So, there needs to be a way to find the symbol > to know which letter it corresponds to. One way to do this is to search through every string in the key array every time. A slightly more efficient method is to build a "reverse" key as the key is read in. For that we will use the idea of parallel arrays - two arrays with related entries. One array will be an array of the symbols in the cipher alphabet (so an array of characters), the other array will be an array of the corresponding letters. This can be visualized as follows for the key above.


      Cipher Alphabet     Corresponding Plain
         Symbols            Alphabet Letters

          ------                ------
         |  #   |              |  a   |
         |------|              |------|
         |  4   |              |  a   |                  
         |------|              |------|
         |  A   |              |  a   |
         |------|              |------|
         |  v   |              |  a   |                
         |------|              |------|
         |  z   |              |  a   |
         |------|              |------|
         |  ?   |              |  b   |       
         |------|              |------|
         |  q   |              |  c   |                
         |------|              |------|
         |  7   |              |  c   |
         |------|              |------|
         |  >   |              |  c   |       
         |------|              |------|
            ...                  ...

Details

Write a class to represent a homophonic substitution cipher and, to test your class, a write a program that creates a cipher object, reads in the key, then allows the user to encipher and decipher messages (use a menu of options similar to that in the program for testing methods in the CDCollection class from lab 11). The class needs to have the following:

Instance Data:

Methods:

Reading from a Text File

Section 4.6 of the textbook (pages 137 - 141) discusses the idea of iterators and reading data from a textfile. Another example is in the file FileExample.java which uses a method in Course.java to read in data. Note that both the main method and a method in a class that reads from a file must throw an IOException. You can try this program out using the files grades1.txt, grades2.txt, and grades3.txt.

The readKey method

The method will read the key information from a file and set up the arrays. The format of the file will be as follows: There will be 26 lines in the file. Each line corresponds to a letter of the alphabet with the first line corresponding to a, the second to b, and so on. The first entry on each line will be an integer representing the number of cipher symbol substitutes for the letter. Then there will be a list of integers representing the Unicode value of each symbol. For example, the file that corresponds to the example key above would be as follows (see the Unicode chart on page 788):

        5  35  52  65  118   122
        1  63
        3  113  55  62
        ...

NOTE: Your documentation for the method must describe the file format!

For each plain alphabet letter, the method must read in the Unicode value of each corresponding symbol and use that to compose the string for the key array. It must also keep track of the number of cipher alphabet symbols and create entries in both the array of cipher symbols and the array of corresponding plain alphabet characters.

The encipher method

The encipher method will read in a message from a file and, for each letter, substitute a random corresponding cipher alphabet letter. Each line from the file should be converted to lowercase after being read in so you only need to worry about lowercase letters. Characters that are not letters (such as blank spaces and punctuation symbols) should be omitted from the ciphertext (just throw them away). That means the ciphertext will have no blank spaces or punctuation symbols. To encipher a letter a random index needs to be generated to choose a random substitution symbol.

Some Test Files

The following are files you can use for testing your program.

Requirements:

Warning: A program that does not compile will receive at most 20% on the assignment even if the error is a small one preventing compilation.

Submission

Tar your assignment 4 directory and copy the tar file containing your code to the directory:

   ~ingram/CPSC120/yourname

Academic Integrity Reminder!!! Programming assignments are to be your own work. You may get help on the specifics of the assignment from no one except the instructor. You may not show your program to anyone or look at anyone else's program or share ideas with anyone about how to write the program.