CPSC120 Assignment 6

DNA Sequence Alignment

Finding genetic markers for disease, determining the phylogenics of organisms, and even determining paternity require aligning DNA sequences in order to make comparisons of similarity. It is not possible for individuals to analyzes gene sequences manually because they can be quite long. The human genome, for example, consists of more than 3.2 billion base pairs. However, it is also not possible for computers alone to analyze the data. Long repeated sequences and copy errors make aligning very difficult for computers. Instead DNA is analyzed by humans that use programs to find long sequences that match in order to align two DNA sequences for comparison.

Details

Write a program that finds the longest sub sequence that is identical in two DNA sequences. The program should prompt the user to enter two strings that represent DNA sequences. A DNA sequence can be represented as a sequence of the letters A, T, C, and G. The letters correspond to the nucleotides in one strand of the DNA molecule’s double helix. A string that represents a DNA sequence contains the character A, T, C, and G (no commas, spaces, or other characters).

The program should print the longest matching sub-sequences and the indices of where it begins in both of the input sequences. Assume that there is only one longest sub-sequence.

Submission

Test: Submit your test files as a zip file on the course Inquire site by 9AM on Monday October 22nd.

Code: Submit your code as a zip file on the course Inquire site by 5PM on Friday October 26th.

Extra

Purines and Pyrimidines: The techniques that sequence DNA are not perfect, sometimes they are not able to determine whether a nucleotide is an A, T, C, or G. In this case the sequence will have an R if it is a purine (A or G), a Y if it is a pyrimidine, or an N if it is any nucleotide (A, T, C, or G). Modify your program so that it can read DNA sequence strings that contain R, Y, and N, in addition to A, T, C, and G. The program should find the longest sub-sequence that may be a match.

FASTA: Bioinformatics software often uses the FASTA file format to store DNA sequences. Modify your program so that it finds the longest sub-sequence in the first two sequences of a FASTA file. In the FASTA format, a sequence is a line of text of capitalized characters that represent the nucleotides of the sequence. Before the line that contains a sequence is a line that begins with the “>” character and contains a description of the following sequence. The following is an example of a FASTA file:

>chupacabra DLF gene
CGATCGATCGATCGATCGATCGATCGATCGATCGATCGACTAATCGATCTAGCTCGATCGATCGATCTATGACGATCGGATCGTTACGGCGATCGACTAACGTCGTGATCGTCTACTGCGTCATAGCTGA
>yeti BRT1F gene
ATCGATCGATTGCTAGCTATATCGAGCGACGCTACGCTACGACTACGACTCGACTACGCTACGCGGATCTAACGTATCGTACGGATCTGTGACGGAGACACTGATCATGCGATACTATCGGCTATGCTGA

A file can be used as input to a Python program by using redirection in the terminal. For example, to use the file test_data.txt as input to the Python program compute.py, execute the following command in the terminal.

python3 compute.py < test_data.txt

Each call to the input() function in the compute.py program will return one line of text from the test_data.txt file. Find an actual FASTA file on-line to test your program on.