GENOME 373: Genomic Informatics Homework 8
Due Friday, June 5, at the beginning of class. Homework turned in more than five minutes after the start of class will be marked as late and penalized 10% per day thereafter.
- (5 points) Compute the Pearson correlation between these two gene expression profiles: geneA = [0.25, 4.73, 2.29], geneB = [0.53, 6.79, 4.60]. Show your work.
- (5 points) Here is a correlation matrix for three genes:
geneA geneB geneC geneA 1.000 0.412 0.043 geneB 0.412 1.000 0.474 geneC 0.043 0.474 1.000Write down the tree that is created by hierarchical clustering.- (5 points) In an HMM used for analyzing array CGH data, what do the states represent?
- (2 points) What is the primary difference in the output created by Sanger sequencing compared to Solexa or SOLiD sequencing?
- (5 points) What are paired reads, and why are they useful?
- (4 points) What is the probability that a base with Phred score 10 was incorrectly called?
- (4 points) If a base has an accuracy of 99.999%, what is its associated Phred quality?
- (5 points) Say that the following two sequence tags contain SNPs at the locations indicated by capital letters: atCtgcagGgcatacccTcagaaaTgca, atCTGCagggcataccctcagaaatgca. Which one will be more difficult for MAQ to map? Why?
- (5 points) Say that the following two sequence tags contain a SNP at the locations indicated by capital letters: atCtgcagggcataccctcagaaatgca, atctgcagggcataccctcagaAatgca. Which one will be more difficult for Bowtie to map? Why?
- (10 points) Write a Python program to map tags to a reference genome. Report, along with each mapped tag, a count of the number of times that the tag occurs. Your output file should contain four columns: the tag, the chromosome index, the position, and the count. Here is a sample command line and output:
> python map-tags3.py genome.txt tags.txt output.txtRun the program on the same genome, using this set of tags, and report the locations and number of occurrences of the most frequently appearing tag. Note that, to find the most frequently occurring tag, you don't have to write Python code to sort your output by the number of occurrences; you can do the sorting using spreadsheet software.- (10 points) Write a Python program that takes as input a tab-delimited file of gene expression values and produces the corresponding correlation matrix. You can use as a template the program read-matrix.py that simply reads a file into memory. Test your program by running it on this expression matrix and verifying that you get this correlation matrix. Then run your program on this matrix and turn in a copy of your results, along with the program code.