GENOME 373: Genomic Informatics

Homework 4

Due Wednesday, April 29, at the beginning of class. Homework turned in more than five minutes after the start of class will be marked as late and penalized 10% per day thereafter.

  1. (15 points) Here are five occurrences of a length-4 DNA motif:
    ACTG
    ATTG
    ATTC
    ACCG
    ACTG
    

    Use these background frequencies to answer the subsequent questions: A=0.309, C=0.191, G=0.191, T=0.309.

    • Write down the corresponding counts matrix.
    • Write down the matrix after adding pseudocounts with beta=1.
    • Write down the frequency matrix.
    • Write down the log-odds matrix, using log base 2.
  2. (10 points) Here is a log-odds matrix for a length-5 DNA motif:
    A  -2.45 -10.81 -10.81   1.55 -10.81
    C  -1.82  -1.82  -1.82  -1.82   0.99
    G   2.09 -10.81   2.27 -10.81  -0.23
    T  -2.45   1.63 -10.81  -2.45   0.35
    

    Show how to linearly rescale the matrix values to be integers between 0 and 1000.

  3. (15 points) Here is an integer matrix for a length-3 DNA motif:
  4. A 5 10  5
    C 0  1  1
    G 1  1  5
    T 3  2  1
    

    Fill in the dynamic programming matrix to compute the scores of all possible sequences of length 3. What is the p-value associated with a score of 16?

  5. (5 points) A profile HMM contains three types of states: match, insert and delete states. Which of these three state types differs from the other two, and in what way?
  6. (2 points) What is the name of the specific dynamic programming algorithm that finds the best (i.e., most probable) alignment between an HMM and a sequence?
  7. (3 points) What does the "Markov" in "hidden Markov model" refer to?

Optional programming practice problems

  1. Write a program that takes as command line arguments two DNA or protein sequences and prints a matrix in which the entry in the ith row and jth column indicates whether the length-3 substring starting at position i in the first sequence is equal to the length-3 substring starting at position j in the second sequence. Use an "X" to indicate identical substrings and an "O" to indicate different substrings.
  2. > python make-similarity-matrix.py ACGTAG  ACGGAGA
      ACGGGTA
    A XOOOO
    C OOOOO
    G OOOOX
    T OOOOO
    A
    G
    
  3. Write a program that reads a white-space delimited matrix of floating point numbers from a file and prints the sum of the entries in each column.
  4. > cat matrix.txt
    0.1 1.2 3.4
    0.9 9.7 2.2
    4.3 3.7 4.4
    > python column-sums.py matrix.txt
    5.3 14.6 10.0
    
  5. Write a program that reads a PSSM from one file and a DNA sequence from a second file and then scans the PSSM across the sequence, printing the resulting scores. Assume that the PSSM is written with rows corresponding to "A", "C", "G", "T".
  6. > cat matrix.txt
    A 1  4  0
    C 2  1  2
    G 0  7  0
    T 4  0  8
    > cat dna.txt
    GTTACGA
    > python scan-pssm.py matrix.txt dna.txt
    8 4 10 2 9