CPSC 120 B




Lecture
MWF 10:50am - 11:50am
Lab
MWF 12:00pm - 1:00pm

Scotty Smith

Office
Trexler 365-B
Office Hours
Monday / Thursday
3:00pm - 5:00pm
Email
chssmithATroanoke.edu

< Back
Printable Copy

DNA Sequence Finding

Finding genetic markers for disease, determining the phylogenics of organisms, and even determining paternity require aligning DNA sequences in order to make comparisons of similarity. It is not possible for individuals to analyzes gene sequences manually because they can be quite long. The human genome, for example, consists of more than 3.2 billion base pairs. However, it is also not possible for computers alone to analyze the data. Long repeated sequences and copy errors make aligning very difficult for computers. Instead DNA is analyzed by humans that use programs to find sequences that match in order to align two DNA sequences for comparison.

For this assignment, you will write a program that will take two inputs: A long string representing some DNA sequence, and a string that is a portion of some other DNA sequence. Your program will find out if, where, and how often the second string occurs in the first.


Setup

Create a directory assignment6 under assignments in your cs120 directory. All code for the assignment should be stored in this directory.

cd ~/cs120/assignments
mkdir assignment6
cd assignment6

Details

DO NOT USE ANY BUILT IN STRING FUNCTIONS FOR THIS ASSIGNMENT!

Your program is required to take two string inputs: A genomic sequence, and a genomic sub-sequence. Both strings will only contain the letters A (Adenine), T (Thymine), C (Cytosine), and G (Guanine). These letters correspond to the nucleotides in one strand of the DNA molecule's Double Helix.

The second input to your program will necessarily be smaller than the first. Your program should print out several things:

  1. The starting index (in the main sequence) that the sub-sequence appeared [FOR ALL MATCHES]
  2. The number of times the sub-sequence appeared in the main sequence

For example, consider the following inputs:

  ACACACTCTGT
  CAC

Then your code should output:

  Match at location 1.
  Match at location 3.
  2 total matches.

However, if the input was:

  ACACACTCTGT
  TAG

Your output should be:

  0 total matches.

While there are no required functions for this assignment, you should use functions where appropriate.


Submission

You are required to submit a tar file to http://cseval.roanoke.edu/. On cseval, there is a link for Assignment 6. You can create a tar file by issuing the following commands:

cd ~/cs120/assignments
tar czvf assignment6.tgz assignment6/
Both partners must submit through cseval!

"Hacker" Prompt

  1. Purines and Pyrimidines: The techniques that sequence DNA are not perfect, sometimes they are not able to determine whether a nucleotide is an A, T, C, or G. In this case the sequence will have an R if it is a purine (A or G), a Y if it is a pyrimidine (T or C), or an N if it is any nucleotide (A, T, C, or G). Modify your program so that it can read DNA sequence strings that contain R, Y, and N, in addition to A, T, C, and G. The program should find the locations that may be a match as well.

  2. Longest Substring Matching: Often times, there are multiple DNA sub-sequences that can match in a given DNA sequence. When trying to compare the Genomes of two entities, what is typically done is to find the longest possible substring to match between two strings. This could then be used to align the sequences, to compare differences between them.

    Write a function called find_longest_substring that takes two DNA sequences (as Strings, which could be of the same length) And finds the longest substring that BOTH DNA sequences contain. If multiple longest substrings exist, return the first one found.:

        ACACTCTGTGACTGGGTACACG
        CAGTACTCACAGCTACTACCCG
    
        The longest matching substring is ACTC
  3. FASTA: Bioinformatics software often uses the FASTA file format to store DNA sequences. Modify your program so that it finds the longest sub-sequence in the first two sequences of a FASTA file. In the FASTA format, a sequence is a line of text of capitalized characters that represent the nucleotides of the sequence. Before the line that contains a sequence is a line that begins with the '>' character and contains a description of the following sequence. The following is an example of a FASTA file:

    >chupacabra DLF gene
    CGATCGATCGATCGATCGATCGATCGATCGATCGATCGACTAATCGATCTAGCTCGATCGATCGATCTATGACGATCGGATCGTTACGGCGATCGACTAACGTCGTGATCGTCTACTGCGTCATAGCTGA
    >yeti BRT1F gene
    ATCGATCGATTGCTAGCTATATCGAGCGACGCTACGCTACGACTACGACTCGACTACGCTACGCGGATCTAACGTATCGTACGGATCTGTGACGGAGACACTGATCATGCGATACTATCGGCTATGCTGA

    A file can be used as input to a Python program by using redirection in the terminal. For example, to use the file test_data.txt as input to the Python program DNA_match.py, execute the following command in the terminal.

    python3 DNA_match.py < test_data.txt
    

    Each call to the input() function in the compute.py program will return one line of text from the test_data.txt file. Find an actual FASTA file on-line to test your program on.