< Back

Assignment 5

This is a pair assignment. You should create a directory called assignment 5 in cs120/assignments for this assignment. All code written for this assignment should be stored in that directory.

$ cd ~/cs120/assignments
$ mkdir assignment5 
$ cd assignment5 


DNA Sequence Analysis

Finding genetic markers for disease, determining the phylogenics of organisms, and even determining paternity require aligning DNA sequences in order to make comparisons of similarity. It is not possible for individuals to analyzes gene sequences manually because they can be quite long. The human genome, for example, consists of more than 3.2 billion base pairs. However, it is also not possible for computers alone to analyze the data. Long repeated sequences and copy errors make aligning very difficult for computers. Instead DNA is analyzed by humans that use programs to find long sequences that match in order to align two DNA sequences for comparison.


Details

You will write a program that finds the longest shared subsequence between two DNA sequences. Each DNA sequence will be represented as a string that only contains the letters A (Adenine), T (Thymine), C (Cytosine), and G (Guanine). These letters correspond to the nucleotides in one strand of the DNA molecule's Double Helix. Your program should output a string, which is the longest sequence of contiguous characters that appears in both sequences.

For example, consider the following strings:

ACACACTCTGT
GTCACACGCTGT

The longest shared subsequence between these two strings is CACAC. This begins at index 1 in string 1 and index 2 in string 2.

The program should read, using input, two DNA sequences. It should print the longest matching subsequence, and the indicies of where it begins in both of the input sequences. You can assume there is only one longest sub-sequence.


"Hacker" Prompt

Each week, additional exercises related to the assignment will be provided at the end. These exercises are typically more challenging than the regular assignment. Bonus points will be provided to students who complete any of the "Hacker" level assignments.

  1. Purines and Pyrimidines: The techniques that sequence DNA are not perfect. Sometimes they are not able to determine whether a nucleotide is an 'A', 'T', 'C', or 'G'. In this case the sequence will have an 'R' if it is a purine ('A' or 'G'), a 'Y' if it is a pyrimidine, or an 'N' if it is any nucleotide ('A', 'T', 'C', or 'G'). Modify your program so that it can read DNA sequence strings that contain 'R', 'Y', and 'N', in addition to 'A', 'T', 'C', and 'G'. The program should find the longest sub-sequence that may be a match.

  2. FASTA: Bioinformatics software often uses the FASTA file format to store DNA sequences. Modify your program so that it finds the longest sub-sequence in the first two sequences of a FASTA file. In the FASTA format, a sequence is a line of text of capitalized characters that represent the nucleotides of the sequence. Before the line that contains a sequence is a line that begins with the ">" character and contains a description of the following sequence. The following is an example of a FASTA file:

    >chupacabra DLF gene
    CGATCGATCGATCGATCGATCGATCGATCGATCGATCGACTAATCGATCTAGCTCGATCGATCGATCTATGACGATCGGATCGTTACGGCGATCGACTAACGTCGTGATCGTCTACTGCGTCATAGCTGA
    >yeti BRT1F gene
    ATCGATCGATTGCTAGCTATATCGAGCGACGCTACGCTACGACTACGACTCGACTACGCTACGCGGATCTAACGTATCGTACGGATCTGTGACGGAGACACTGATCATGCGATACTATCGGCTATGCTGA

    A file can be used as input to a Python program by using redirection in the terminal. For example, to use the file test_data.txt as input to the Python program compute.py, execute the following command in the terminal.

    python3 compute.py < test_data.txt

    Each call to the input() function in the compute.py program will return one line of text from the test_data.txt file. Find an actual FASTA file on-line to test your program on.


Grading

The assignment will be graded on the following requirements according to the course’s programming assignment rubric.

Functionality (75%): A functional program will:

  • Reads two strings from the terminal,
  • Use loops to iterate over these strings,
  • Stores the longest subsequence,
  • Stores the indices of the longest subsequence,
  • Prints the longest subsequence and indices to the terminal.

Style (25%): A program with good style will:

  • include a header comment signifying the authors of the file,
  • avoid magic numbers (literal primitive numbers),
  • use meaningful names for variables and functions,
  • have statements that are small (80 characters or less including leading space) and do one thing, and
  • have functions that are small (40 lines or less including comments) and do one thing
  • have a comment above functions that includes the purpose, the pre-conditions, and the post-conditions of the function.
  • have spaces after commas in argument lists and spaces on both sides of binary operators (=, +, -, *, etc.).

Mastermind

Bulls and Cows is a word guessing game that turned into a popular code breaking game Mastermind.

Details

Create a Python program that plays a one-player, command-line version of the game Mastermind. The rules of the game are as follows:

  1. One player, the codemaker, creates a secret code that is any combination of four letters from A to D.
  2. The other player, the codebreaker, tries to guess the secret code by creating their own combination of four letters from A to D.
  3. The codemaker tells the codebreaker two things about their guess:
    1. The number of bulls, letters that are in the same location in the secret code.
    2. The number of cows, letters that are not bulls but are located elsewhere in the secret code.
  4. The codebreaker wins by getting 4 bulls in 10 or fewer guesses.
BAAA
I have created a secret code.  Try to guess it.
You have 10 guesses left.
Enter a guess (4 letters, A to D): AAD
Invalid input, Please try again.
Enter a guess (4 letters, A to D): AXBC
Invalid input, Please try again.
Enter a guess (4 letters, A to D): ABABA
Invalid input, Please try again.
Enter a guess (4 letters, A to D): DABC
You got 1 bull(s) and 1 cows(s).
You have 9 guesses left.
Enter a guess (4 letters, A to D): CABC
You got 1 bull(s) 1 cows(s).
You have 8 guesses left.
Enter a guess (4 letters, A to D): AABA
You got 2 bull(s) 2 cows(s).
You have 7 guesses left.
Enter a guess (4 letters, A to D): ABAA
You got 2 bull(s) 2 cows(s).
You have 6 guesses left.
Enter a guess (4 letters, A to D): BAAA
You got 4 bull(s) 0 cows(s).
You win!

Extra

Use the graphics library to make the game graphical. Use colored circles in a grid like the board game to represent the codes.

Grading

The assignment will be graded on the following requirements according to the course’s programming assignment rubric.

Functionality (75%): A functional program will:

  • Generate a random secret code.
  • Repeatedly prompt the user for a guess, until it is valid.
  • Print the number of bulls in the user's guess.
  • Print the number of cows in the user's guess.
  • Stop the game when the user wins or loses.

Style (25%): A program with good style will:

  • include a header comment signifying the authors of the file,
  • avoid magic numbers (literal primitive numbers),
  • use meaningful names for variables and functions,
  • have statements that are small (80 characters or less including leading space) and do one thing, and
  • have a comment above functions that includes the purpose, the pre-conditions, and the post-conditions of the function.
  • have spaces after commas in argument lists and spaces on both sides of binary operators (=, +, -, *, etc.).

Your program should include the traditional header, use appropriate variable names, and nicely label all values printed to the terminal. Submission are to be done through inquire.roanoke.edu through the Assignment 5 link. Both partners must submit through inquire!