CPSC120B
Fundamentals of Computer Science I

Assignment 5

Strings

Finding genetic markers for disease, determining the phylogenics of organisms, and even determining paternity require aligning DNA sequences in order to make comparisons of similarity. It is not possible for individuals to analyzes gene sequences manually because they can be quite long. The human genome, for example, consists of more than 3.2 billion base pairs. However, it is also not possible for computers alone to analyze the data. Long repeated sequences and copy errors make aligning very difficult for computers. Instead DNA is analyzed by humans that use programs to find long sequences that match in order to align two DNA sequences for comparison.


Details

You will write a program that finds the longest shared subsequence between two DNA sequences. Each DNA sequence will be represented as a string that only contains the letters A (Adenine), T (Thymine), C (Cytosine), and G (Guanine). These letters correspond to the nucleotides in one strand of the DNA molecule's Double Helix. Your program should output a string, which is the longest sequence of contiguous characters that appears in both sequences.

For example, consider the following strings:

ACACACTCTGT
GTCACACGCTGT

The longest shared subsequence between these two strings is CACAC. This begins at index 1 in string 1 and index 2 in string 2.

The program should read, using input, two DNA sequences. It should print the longest matching subsequence, and the indicies of where it begins in both of the input sequences. You can assume there is only one longest sub-sequence.


"Hacker" Prompt

Each week, additional exercises related to the assignment will be provided at the end. These exercises are typically more challenging than the regular assignment. Bonus points will be provided to students who complete any of the "Hacker" level assignments.

  1. Purines and Pyrimidines: The techniques that sequence DNA are not perfect. Sometimes they are not able to determine whether a nucleotide is an 'A', 'T', 'C', or 'G'. In this case the sequence will have an 'R' if it is a purine ('A' or 'G'), a 'Y' if it is a pyrimidine, or an 'N' if it is any nucleotide ('A', 'T', 'C', or 'G'). Modify your program so that it can read DNA sequence strings that contain 'R', 'Y', and 'N', in addition to 'A', 'T', 'C', and 'G'. The program should find the longest sub-sequence that may be a match.

  2. FASTA: Bioinformatics software often uses the FASTA file format to store DNA sequences. Modify your program so that it finds the longest sub-sequence in the first two sequences of a FASTA file. In the FASTA format, a sequence is a line of text of capitalized characters that represent the nucleotides of the sequence. Before the line that contains a sequence is a line that begins with the ">" character and contains a description of the following sequence. The following is an example of a FASTA file:

    >chupacabra DLF gene
    CGATCGATCGATCGATCGATCGATCGATCGATCGATCGACTAATCGATCTAGCTCGATCGATCGATCTATGACGATCGGATCGTTACGGCGATCGACTAACGTCGTGATCGTCTACTGCGTCATAGCTGA
    >yeti BRT1F gene
    ATCGATCGATTGCTAGCTATATCGAGCGACGCTACGCTACGACTACGACTCGACTACGCTACGCGGATCTAACGTATCGTACGGATCTGTGACGGAGACACTGATCATGCGATACTATCGGCTATGCTGA

    A file can be used as input to a Python program by using redirection in the terminal. For example, to use the file test_data.txt as input to the Python program compute.py, execute the following command in the terminal.

    python3 compute.py < test_data.txt

    Each call to the input() function in the compute.py program will return one line of text from the test_data.txt file. Find an actual FASTA file on-line to test your program on.


Grading

The assignment will be graded on the following requirements according to the course’s programming assignment rubric.

Functionality (75%): A functional program will:

  • Reads two strings from the terminal,
  • Use loops to iterate over these strings,
  • Stores the longest subsequence,
  • Stores the indices of the longest subsequence,
  • Prints the longest subsequence and indices to the terminal.

Style (25%): A program with good style will:

  • include a header comment signifying the authors of the file,
  • avoid magic numbers (literal primitive numbers),
  • use meaningful names for variables and functions,
  • have statements that are small (80 characters or less including leading space) and do one thing, and
  • have functions that are small (40 lines or less including comments) and do one thing
  • have a comment above functions that includes the purpose, the pre-conditions, and the post-conditions of the function.
  • have spaces after commas in argument lists and spaces on both sides of binary operators (=, +, -, *, etc.).

Bulls and Cows is a word guessing game that turned into a popular code breaking game Mastermind.

Details

Create a Python program that plays a one-player, command-line version of the game Mastermind. The rules of the game are as follows:

  1. One player, the codemaker, creates a secret code that is any combination of four letters from A to D.
  2. The other player, the codebreaker, tries to guess the secret code by creating their own combination of four letters from A to D.
  3. The codemaker tells the codebreaker two things about their guess:
    1. The number of bulls, letters that are in the same location in the secret code.
    2. The number of cows, letters that are not bulls but are located elsewhere in the secret code.
  4. The codebreaker wins by getting 4 bulls in 10 or fewer guesses.
BAAA
I have created a secret code.  Try to guess it.
You have 10 guesses left.
Enter a guess (4 letters, A to D): AAD
Invalid input, Please try again.
Enter a guess (4 letters, A to D): AXBC
Invalid input, Please try again.
Enter a guess (4 letters, A to D): ABABA
Invalid input, Please try again.
Enter a guess (4 letters, A to D): DABC
You got 1 bull(s) and 1 cows(s).
You have 9 guesses left.
Enter a guess (4 letters, A to D): CABC
You got 1 bull(s) 1 cows(s).
You have 8 guesses left.
Enter a guess (4 letters, A to D): AABA
You got 2 bull(s) 2 cows(s).
You have 7 guesses left.
Enter a guess (4 letters, A to D): ABAA
You got 2 bull(s) 2 cows(s).
You have 6 guesses left.
Enter a guess (4 letters, A to D): BAAA
You got 4 bull(s) 0 cows(s).
You win!

Extra

Use the graphics library to make the game graphical. Use colored circles in a grid like the board game to represent the codes.

Grading

The assignment will be graded on the following requirements according to the course’s programming assignment rubric.

Functionality (75%): A functional program will:

  • Generate a random secret code.
  • Repeatedly prompt the user for a guess, until it is valid.
  • Print the number of bulls in the user's guess.
  • Print the number of cows in the user's guess.
  • Stop the game when the user wins or loses.

Style (25%): A program with good style will:

  • include a header comment signifying the authors of the file,
  • avoid magic numbers (literal primitive numbers),
  • use meaningful names for variables and functions,
  • have statements that are small (80 characters or less including leading space) and do one thing, and
  • have a comment above functions that includes the purpose, the pre-conditions, and the post-conditions of the function.
  • have spaces after commas in argument lists and spaces on both sides of binary operators (=, +, -, *, etc.).

Simon is a memory game that was popular in the 1980s that is similar to the children's game Simon Says. The toy Simon plays series of tones accompanied by flashing colored lights and the player has to remember the pattern and recreate it by pressing buttons.

Details

Create a circuit that plays Simon. The circuit should have 4 LEDs, each a different color, 4 push buttons, each near a different color LED, and a piezo buzzer. When one of the LED's is illuminated it should play a tone, each LED should have a different tone. The program should generate a string that consists of a series of letters that correspond to the different LED lights. For example, "RGGB" would represent the sequence red, green, green, blue. The circuit should display the random sequence as LED flashes and tones. Only one LED should be lit at a time, the flashes should always be the same duration, and each flash should be separated by a period with both LEDs unlit. After the sequence of flashes and tones, the circuit should record a sequence of button presses. When the button is pressed, the corresponding LED should be lit up and the corresponding tone should be played. If while entering a sequence the user makes an error, presses a button different from the random sequence, the circuit should inform the user of their error by flashing all of the lights and playing a tone.

Challenge

The actual simon game gets progressively harder by increasing the length of the random sequence. Modify the circuit's program to include multiple patterns with increasing length.

Grading

The assignment will be graded on the following requirements according to the course’s programming assignment rubric.

Functionality (75%): A functional program will:

  • have a circuit with 4 LEDs, 4 buttons, and a Piezo buzzer.
  • generate a random string of letters that correspond to LEDs.
  • play the sequence of LEDs flashes and buzzer tones stored in a string.
  • Light the LED and play the buzzer tone that corresponds to a button.
  • Inform the user of a sequence input error by flashing the LEDs and playing a buzzer tone.

Style (25%): A program with good style will:

  • include a header comment signifying the authors of the file,
  • avoid magic numbers (literal primitive numbers),
  • use meaningful names for variables and functions,
  • have statements that are small (80 characters or less including leading space) and do one thing, and
  • have spaces after commas in argument lists and spaces on both sides of binary operators (=, +, -, *, etc.).

Submission

Submit your program as a .py file on the course Inquire page before class on Wednesday November 2nd.