CPSC120A - Scientific Assignment 5

Finding genetic markers for disease, determining the phylogenics of organisms, and even determining paternity require aligning DNA sequences in order to make comparisons of similarity. It is not possible for individuals to analyzes gene sequences manually because they can be quite long. The human genome, for example, consists of more than 3.2 billion base pairs. However, it is also not possible for computers alone to analyze the data. Long repeated sequences and copy errors make aligning very difficult for computers. Instead DNA is analyzed by humans that use programs to find long sequences that match in order to align two DNA sequences for comparison.

Details

You will write a program that finds the longest shared subsequence between two DNA sequences. Each DNA sequence will be represented as a string that only contains the letters A (Adenine), T (Thymine), C (Cytosine), and G (Guanine). These letters correspond to the nucleotides in one strand of the DNA molecule's Double Helix. Your program should output a string, which is the longest sequence of contiguous characters that appears in both sequences.

For example, consider the following strings:

ACACACTCTGT
GTCACACGCTGT

The longest shared subsequence between these two strings is CACAC. This begins at index 1 in string 1 and index 2 in string 2.

The program should read, using input, two DNA sequences. It should print the longest matching subsequence, and the indicies of where it begins in both of the input sequences. You can assume there is only one longest sub-sequence.

"Hacker" Prompt

Each week, additional exercises related to the assignment will be provided at the end. These exercises are typically more challenging than the regular assignment. Bonus points will be provided to students who complete any of the "Hacker" level assignments.

Purines and Pyrimidines: The techniques that sequence DNA are not perfect. Sometimes they are not able to determine whether a nucleotide is an 'A', 'T', 'C', or 'G'. In this case the sequence will have an 'R' if it is a purine ('A' or 'G'), a 'Y' if it is a pyrimidine, or an 'N' if it is any nucleotide ('A', 'T', 'C', or 'G'). Modify your program so that it can read DNA sequence strings that contain 'R', 'Y', and 'N', in addition to 'A', 'T', 'C', and 'G'. The program should find the longest sub-sequence that may be a match.
FASTA: Bioinformatics software often uses the FASTA file format to store DNA sequences. Modify your program so that it finds the longest sub-sequence in the first two sequences of a FASTA file. In the FASTA format, a sequence is a line of text of capitalized characters that represent the nucleotides of the sequence. Before the line that contains a sequence is a line that begins with the ">" character and contains a description of the following sequence. The following is an example of a FASTA file:
```
>chupacabra DLF gene
CGATCGATCGATCGATCGATCGATCGATCGATCGATCGACTAATCGATCTAGCTCGATCGATCGATCTATGACGATCGGATCGTTACGGCGATCGACTAACGTCGTGATCGTCTACTGCGTCATAGCTGA
>yeti BRT1F gene
ATCGATCGATTGCTAGCTATATCGAGCGACGCTACGCTACGACTACGACTCGACTACGCTACGCGGATCTAACGTATCGTACGGATCTGTGACGGAGACACTGATCATGCGATACTATCGGCTATGCTGA
```
A file can be used as input to a Python program by using redirection in the terminal. For example, to use the file test_data.txt as input to the Python program compute.py, execute the following command in the terminal.
```
python3 compute.py < test_data.txt
```
Each call to the input() function in the compute.py program will return one line of text from the test_data.txt file. Find an actual FASTA file on-line to test your program on.

Submission

Your code should conform to the course’s code conventions and will be graded according the the course’s programming assignment rubric. Submit your program as a .py file on the course Inquire page before class on Friday October 24^th.