GENOME 373: Genomic Informatics

Homework 3

Due Wednesday, April 22, at the beginning of class. Homework turned in more than five minutes after the start of class will be marked as late and penalized 10% per day thereafter.

  1. (15 points) For the following alignment, use the BLOSUM62 matrix (ftp://ftp.ncbi.nih.gov/blast/matrices/BLOSUM62) to compute the score for every set of 3 contiguous columns that do not contain an indel. How many 3-mers receive a score greater than 6?
  2. YECNERSKA-SCPSHLQ--KRRQIG
    QECNQCGKAFAQHSSLKCHYRTHIG
    
  3. (15 points) Fill in the following matrix such that the score in row i, column j is the score for aligning the length-three segment starting at position i in the first sequence (HKRT) with the length-three segment starting at position j in the second sequence (QYHERTHTG). Use the BLOSUM62 matrix. I have filled in the first entry for you already. Note that some of the matrix cells along the edges should be left blank.
  4.   Q Y H E R T H T G
    H -2                
    K                  
    R                  
    T                  
  5. (10 points) Following is a list of p-values, created by searching a small database of proteins with the Smith-Waterman algorithm. The list includes one p-value for each target protein in the database. Using a Bonferroni correction, how many p-values are deemed significant if we use a threshold of 0.01?
  6. 0.00012, 0.0021, 0.0050, 0.0089, 0.0044, 0.0032, 0.0021, 0.033, 0.13, 0.15, 0.18, 0.29, 0.30, 0.33, 0.33, 0.35, 0.45, 0.47, 0.49, 0.55, 0.56, 0.57, 0.62, 0.63, 0.63, 0.67, 0.78, 0.83, 0.84, 0.95, 0.99

  7. (10 points) Convert the list of p-values from the previous question into a list of E-values.

Optional programming practice problems

  1. Write a program to search a given file for occurrences of a specified sequence. Print the line number and location of each occurrence of the file. For example, say that the file myfile.txt contains this text:
    The quick brown fox
    The quick and the dead
    Pretty darn quick
    Molasses in January
    
    Then you could run your program as follows:
    > search-for-string.py quick myfile.txt
    Line 1, position 5
    Line 2, position 5
    Line 3, position 13
    > search-for-string.py x myfile.txt
    Line 1, position 18
    
  2. Write a program that prints the BLOSUM62 score for a given pair of amino acids. You can include the matrix itself in your program code, rather than reading it from a file.
  3. > python print-blosum62.py K L
    -2
    
  4. Write a program that reads a two-line DNA alignment from a file and scores it using +10 for identical bases, 0 for transitions, -5 for transversions and -10 for gaps (linear gap penalty). For example, say that myfile.txt contains the following two lines:
    AACGTGA
    AAG-TAC
    
    Then you could score this alignment as follows:
    > python score-dna-alignment.py myfile.txt
    10 + 10 + -5 + -10 + 10 + 0 + -5 = 10