CPSC150A
Scientific Computing

Assignment 4

DNA Sequence Alignment

Finding genetic markers for disease, determining the phylogenics of organisms, and even determining paternity require aligning DNA sequences in order to make comparisons of similarity. It is not possible for individuals to analyzes gene sequences manually because they can be quite long. The human genome, for example, consists of more than 3.2 billion base pairs. However, it is also not possible for computers alone to analyze the data. Long repeated sequences and copy errors make aligning very difficult for computers. Instead DNA is analyzed by humans that use programs to find long sequences that match in order to align two DNA sequences for comparison.


Details

You will write a program that finds the longest shared subsequence between two DNA sequences. Each DNA sequence will be represented as a string that only contains the letters A (Adenine), T (Thymine), C (Cytosine), and G (Guanine). These letters correspond to the nucleotides in one strand of the DNA molecule’s Double Helix. Your program should output a string, which is the longest sequence of contiguous characters that appears in both sequences.

For example, consider the following strings:

ACACACTCTGT
GTCACACGCTGT

The longest shared subsequence between these two strings is CACAC. This begins at index 1 in string 1 and index 2 in string 2.

The program should read, using input, two DNA sequences. It should print the longest matching subsequence, and the indicies of where it begins in both of the input sequences. You can assume there is only one longest sub-sequence.


Grading

The assignment will be graded on the following requirements according to the course’s programming assignment rubric.

A functional program will:

Submission

Submit your program as a .py file on the course Inquire page before class on Thursday March 2nd.