GENOME 373: Genomic Informatics

Homework 1

Due Wednesday, April 8, at the beginning of class. Homework turned in more than five minutes after the start of class will be marked as late and penalized 10% per day thereafter.

  1. (5 points) Define log-odds score and explain how it relates to substitution matrices.
  2. (5 points) Why is the BLOSUM62 score for cysteine-cysteine 9, whereas the score for lysine-lysine is only 2?
  3. (10 points) Here is a pair of aligned protein sequences:

    GDIFYPGYCPDVKPVNKQFDLSAFAGAWHEIAKLP
    GDNFHLGKCPSPLPVQENFDVKKYLGRWYEIEKIP
    

    If this alignment were to be included in the data set used to generate statistics for the BLOSUM matrices, which of the following matrices would it be used to help generate: BLOSUM90, BLOSUM80, BLOSUM62, BLOSUM52, BLOSUM45. Why?

  4. (5 points) You can find a copy of the BLOSUM45 matrix at ftp://ftp.ncbi.nih.gov/blast/matrices/BLOSUM45. Which amino acid has the largest number of negative scores associated with it? Why?

  5. (10 points)

    RVVNLVP----WVLATDYKNY
    QFFPLMPPAPYWILATDYENY
    

    Score the above alignment using

    • BLOSUM45 and a linear gap penalty of -4
    • BLOSUM80 with affine gap penalties: gap open of -9 and gap extension of -1.

    You can find the BLOSUM80 matrix at ftp://ftp.ncbi.nih.gov/blast/matrices/BLOSUM80. Be sure to show your work.

  6. (15 points) Draw and fill in the dynamic programming matrix to align these two sequences: CATTC and CGATC. Use this substitution matrix:

      A C G T
    A 2 -7 -3 -7
    C -7 2 -7 -3
    G -3 -7 2 -7
    T -7 -3 -7 2

    and use a fixed gap penalty of -5. What is the score of the optimal global alignment?


Optional programming practice problems

  1. Write a program that takes as input the first three command line arguments (after the program name) and prints them in uppercase letters on a single line with spaces between.

    > python get-three-args.py con stan tinople
    CON STAN TINOPLE
    
  2. Write a program similar to the previous one, but print the three arguments without spaces between.

    > python get-three-args.py con stan tinople
    CONSTANTINOPLE
    
  3. Write a program that takes as input two command line arguments: the first argument is a DNA or protein sequence, and the second is an integer n. Print the nth character in the given sequence.

    > python get-nth-character.py curmudgeon 5
    u
    
  4. Write a program that takes as input two command line arguments, counts how many time the second one appears inside the first one, and then tells the user how many there are, like this:

    > python count-substrings-in-string.py acgtacgtttgacgtacc acg
    The substring acg appears in the sequence acgtacgtttgacgtacc 3 times.