GENOME 373: Genomic Informatics Homework 4
Due Wednesday, April 29, at the beginning of class. Homework turned in more than five minutes after the start of class will be marked as late and penalized 10% per day thereafter.
- (15 points) Here are five occurrences of a length-4 DNA motif:
ACTG ATTG ATTC ACCG ACTGUse these background frequencies to answer the subsequent questions: A=0.309, C=0.191, G=0.191, T=0.309.
- Write down the corresponding counts matrix.
- Write down the matrix after adding pseudocounts with beta=1.
- Write down the frequency matrix.
- Write down the log-odds matrix, using log base 2.
- (10 points) Here is a log-odds matrix for a length-5 DNA motif:
A -2.45 -10.81 -10.81 1.55 -10.81 C -1.82 -1.82 -1.82 -1.82 0.99 G 2.09 -10.81 2.27 -10.81 -0.23 T -2.45 1.63 -10.81 -2.45 0.35Show how to linearly rescale the matrix values to be integers between 0 and 1000.
- (15 points) Here is an integer matrix for a length-3 DNA motif:
A 5 10 5 C 0 1 1 G 1 1 5 T 3 2 1Fill in the dynamic programming matrix to compute the scores of all possible sequences of length 3. What is the p-value associated with a score of 16?
- (5 points) A profile HMM contains three types of states: match, insert and delete states. Which of these three state types differs from the other two, and in what way?
- (2 points) What is the name of the specific dynamic programming algorithm that finds the best (i.e., most probable) alignment between an HMM and a sequence?
- (3 points) What does the "Markov" in "hidden Markov model" refer to?
Optional programming practice problems
- Write a program that takes as command line arguments two DNA or protein sequences and prints a matrix in which the entry in the ith row and jth column indicates whether the length-3 substring starting at position i in the first sequence is equal to the length-3 substring starting at position j in the second sequence. Use an "X" to indicate identical substrings and an "O" to indicate different substrings.
> python make-similarity-matrix.py ACGTAG ACGGAGA ACGGGTA A XOOOO C OOOOO G OOOOX T OOOOO A G- Write a program that reads a white-space delimited matrix of floating point numbers from a file and prints the sum of the entries in each column.
> cat matrix.txt 0.1 1.2 3.4 0.9 9.7 2.2 4.3 3.7 4.4 > python column-sums.py matrix.txt 5.3 14.6 10.0- Write a program that reads a PSSM from one file and a DNA sequence from a second file and then scans the PSSM across the sequence, printing the resulting scores. Assume that the PSSM is written with rows corresponding to "A", "C", "G", "T".
> cat matrix.txt A 1 4 0 C 2 1 2 G 0 7 0 T 4 0 8 > cat dna.txt GTTACGA > python scan-pssm.py matrix.txt dna.txt 8 4 10 2 9