< Back

Assignment

This is a pair assignment. You should create a directory called assignment in cs120/assignments for this assignment. All code written for this assignment should be stored in that directory.

$ cd ~/cs120/assignments
$ mkdir assignment 
$ cd assignment 


DNA Sequence Analysis

Finding genetic markers for disease, determining the phylogenics of organisms, and even determining paternity require aligning DNA sequences in order to make comparisons of similarity. It is not possible for individuals to analyzes gene sequences manually because they can be quite long. The human genome, for example, consists of more than 3.2 billion base pairs. However, it is also not possible for computers alone to analyze the data. Long repeated sequences and copy errors make aligning very difficult for computers. Instead DNA is analyzed by humans that use programs to find long sequences that match in order to align two DNA sequences for comparison.


Details

You will write a program that finds the longest shared subsequence between two DNA sequences. Each DNA sequence will be represented as a string that only contains the letters A (Adenine), T (Thymine), C (Cytosine), and G (Guanine). These letters correspond to the nucleotides in one strand of the DNA molecule's Double Helix. Your program should output a string, which is the longest sequence of contiguous characters that appears in both sequences.

For example, consider the following strings:

ACACACTCTGT
GTCACACGCTGT

The longest shared subsequence between these two strings is CACAC. This begins at index 1 in string 1 and index 2 in string 2.

The program should read, using input, two DNA sequences. It should print the longest matching subsequence, and the indicies of where it begins in both of the input sequences. You can assume there is only one longest sub-sequence.


"Hacker" Prompt

Each week, additional exercises related to the assignment will be provided at the end. These exercises are typically more challenging than the regular assignment. Bonus points will be provided to students who complete any of the "Hacker" level assignments.

  1. Purines and Pyrimidines: The techniques that sequence DNA are not perfect. Sometimes they are not able to determine whether a nucleotide is an 'A', 'T', 'C', or 'G'. In this case the sequence will have an 'R' if it is a purine ('A' or 'G'), a 'Y' if it is a pyrimidine, or an 'N' if it is any nucleotide ('A', 'T', 'C', or 'G'). Modify your program so that it can read DNA sequence strings that contain 'R', 'Y', and 'N', in addition to 'A', 'T', 'C', and 'G'. The program should find the longest sub-sequence that may be a match.

  2. FASTA: Bioinformatics software often uses the FASTA file format to store DNA sequences. Modify your program so that it finds the longest sub-sequence in the first two sequences of a FASTA file. In the FASTA format, a sequence is a line of text of capitalized characters that represent the nucleotides of the sequence. Before the line that contains a sequence is a line that begins with the ">" character and contains a description of the following sequence. The following is an example of a FASTA file:

    >chupacabra DLF gene
    CGATCGATCGATCGATCGATCGATCGATCGATCGATCGACTAATCGATCTAGCTCGATCGATCGATCTATGACGATCGGATCGTTACGGCGATCGACTAACGTCGTGATCGTCTACTGCGTCATAGCTGA
    >yeti BRT1F gene
    ATCGATCGATTGCTAGCTATATCGAGCGACGCTACGCTACGACTACGACTCGACTACGCTACGCGGATCTAACGTATCGTACGGATCTGTGACGGAGACACTGATCATGCGATACTATCGGCTATGCTGA

    A file can be used as input to a Python program by using redirection in the terminal. For example, to use the file test_data.txt as input to the Python program compute.py, execute the following command in the terminal.

    python3 compute.py < test_data.txt

    Each call to the input() function in the compute.py program will return one line of text from the test_data.txt file. Find an actual FASTA file on-line to test your program on.


Grading

The assignment will be graded on the following requirements according to the course’s programming assignment rubric.

Functionality (75%): A functional program will:

  • Reads two strings from the command line,
  • Use loops to iterate over these strings,
  • Stores the longest subsequence,
  • Stores the indices of the longest subsequence,
  • Prints the longest subsequence and indices to the terminal.

Style (25%): A program with good style will:

  • include a header comment signifying the authors of the file,
  • avoid magic numbers (literal primitive numbers),
  • use meaningful names for variables and functions,
  • have statements that are small (80 characters or less including leading space) and do one thing, and
  • have functions that are small (40 lines or less including comments) and do one thing
  • have a comment above functions that includes the purpose, the pre-conditions, and the post-conditions of the function.
  • have spaces after commas in argument lists and spaces on both sides of binary operators (=, +, -, *, etc.).

Creative: Educational Game

Many people think of games as waste of time or worse a corrupting influence on children. But games can also be a positive influence on children. A game that makes learning fun can boost a students performance in school which can have many positive effects later in life. In this assignment you will create an educational game.

Details

Create a Python program that uses the graphics.py module to create an educational video game. The game can be anything you want, but must:

  • use the graphics module to display the game.
  • use the graphics module for for user input.
  • not use use command line for input or output.
  • display instructions to the user on how to play.
  • have the ability for the player to win or lose.
  • display a message when the player wins or loses.

Extra

Remarkable games will receive extra credit. I will be the arbiter of whether a drawing is astounding.

Grading

The assignment will be graded on the following requirements according to the course’s programming assignment rubric.

Effort (40%): Here is an example game with average effort:

Functionality (35%): A functional program will:

  • use the graphics module to display the game.
  • use the graphics module for for user input.
  • not use use command line for input or output.
  • display instructions on how to play.
  • have the ability to win or lose.
  • display a message when the player wins or loses.

Style (25%): A program with good style will:

  • include a header comment signifying the authors of the file.
  • avoid magic numbers (literal primitive numbers).
  • use meaningful names for variables and functions.
  • have statements that are small (80 characters or less including leading space) and do one thing.
  • have functions that are small (40 lines or less including comments) and do one thing.
  • have a comment above functions that includes the purpose, the pre-conditions, and the post-conditions of the function.
  • have spaces after commas in argument lists and spaces on both sides of binary operators (=, +, -, *, etc.).

Your program should include the traditional header, use appropriate variable names, and nicely label all values printed to the terminal. Submission are to be done through inquire.roanoke.edu through the Assignment link. Both partners must submit through inquire!