Finding genetic markers for disease, determining the phylogenics of
organisms, and even determining paternity require aligning DNA
sequences in order to make comparisons of similarity. It is not
possible for individuals to analyzes gene sequences manually because
they can be quite long. The human genome, for example, consists of
more than 3.2 billion base pairs. However, it is also not possible
for computers alone to analyze the data. Long repeated sequences and
copy errors make aligning very difficult for computers. Instead DNA
is analyzed by humans that use programs to find long sequences that
match in order to align two DNA sequences for comparison.
Details
You will write a program that finds the longest shared subsequence
between two DNA sequences. Each DNA sequence will be represented as
a string that only contains the letters A (Adenine), T (Thymine), C
(Cytosine), and G (Guanine). These letters correspond to the
nucleotides in one strand of the DNA molecule's Double Helix. Your
program should output a string, which is the longest sequence of
contiguous characters that appears in both sequences.
For example, consider the following strings:
ACACACTCTGT
GTCACACGCTGT
The longest shared subsequence between these two strings
is CACAC
. This begins at index 1 in string 1 and index
2 in string 2.
The program should read, using input
, two DNA sequences.
It should print the longest matching subsequence, and the indicies
of where it begins in both of the input sequences. You can assume
there is only one longest sub-sequence.
"Hacker" Prompt
Each week, additional exercises related to the assignment will be
provided at the end. These exercises are typically more challenging
than the regular assignment. Bonus points will be provided to
students who complete any of the "Hacker" level assignments.
-
Purines and Pyrimidines: The techniques that sequence DNA
are not perfect. Sometimes they are not able to determine whether
a nucleotide is an 'A', 'T', 'C', or 'G'. In this case the
sequence will have an 'R' if it is a purine ('A' or 'G'), a 'Y' if
it is a pyrimidine, or an 'N' if it is any nucleotide ('A', 'T',
'C', or 'G'). Modify your program so that it can read DNA sequence
strings that contain 'R', 'Y', and 'N', in addition to 'A', 'T',
'C', and 'G'. The program should find the longest sub-sequence
that may be a match.
-
FASTA:
Bioinformatics software often uses the FASTA file format to
store DNA sequences. Modify your program so that it finds the
longest sub-sequence in the first two sequences of a FASTA
file. In the FASTA format, a sequence is a line of text of
capitalized characters that represent the nucleotides of the
sequence. Before the line that contains a sequence is a line
that begins with the ">" character and contains a description
of the following sequence. The following is an example of a
FASTA file:
>chupacabra DLF gene
CGATCGATCGATCGATCGATCGATCGATCGATCGATCGACTAATCGATCTAGCTCGATCGATCGATCTATGACGATCGGATCGTTACGGCGATCGACTAACGTCGTGATCGTCTACTGCGTCATAGCTGA
>yeti BRT1F gene
ATCGATCGATTGCTAGCTATATCGAGCGACGCTACGCTACGACTACGACTCGACTACGCTACGCGGATCTAACGTATCGTACGGATCTGTGACGGAGACACTGATCATGCGATACTATCGGCTATGCTGA
A file can be used as input to a Python program by using redirection
in the terminal. For example, to use the
file test_data.txt
as input to the Python
program compute.py
, execute the following command in
the terminal.
python3 compute.py < test_data.txt
Each call to the input()
function in
the compute.py
program will return one line of text from
the test_data.txt
file. Find an actual FASTA file on-line
to test your program on.