„Omics“ Module Exam Date: Friday 7th, 2014 Time: 10.15h Location: Seminar room 0.34, Genetics building Zülpicher Str. 47a Duration 120 min Maximum points: 60 points, 2/3 of the final grade (seminar: max 30 pts) Pass mark: 50% of the max total points = 45 pts Additional remarks: This is a written exam. No computers allowed, no calculators required. Topics covered in the exam (Achim Tresch) Elementary Statistics Data description (measures of location and scale, quantiles), graphical representation of continuous (univariate and bivariate)data, Boxplots, Scatterplots, MA-Plots) Measures of dependence (Pearson-, Spearmankorrelation – which is preferable under which condition?) k-means clustering (2-step procedure, center calculation, assignment of points to centers) Testing: decision boundary, false positives, false negatives, p-value, significance of a test, null hypothesis, acceptance/rejection region Tests for location: t-Test vs. Wilcoxon rank sum test (which is preferable under which condition?) (Fisher test, Chi-squared test will be covered in Kay Hofmann’s part) Multiple testing (why, Bonferroni procedure) Classification: test error, training error, overfitting, bias-variance tradeoff, Linear classifiers Cross validation (how does it work, purpose) Sequencing: systematic and random errors in NGS: Mappability of reads, sequence (GC) bias, positional bias (chromatin structure), … Sequencing techniques: Sanger, Illumina bridge amplification + sequencing by synthesis,third generation sequencing (Pacific Biosciences) advantages of paired-end reads Normalization procedures for gene expression data: Housekeeping-, Spike-ins, VSN, Quantile Normalization, Lowess (basic idea behind quantile normalization) Statistics for RNA-Seq data: RPKM values Mapping: Hash tables (how do they work? key, index (hash value), collisions) NOT covered in the exam: suffix arrays, Burrows wheeler transform ChIP-Seq: How does it work? Peak detection: Poisson distribution, “Peak score” = (log) likelihood ratio (NOT: negative binomial distribution) NOT: motif search Hidden Markov Models: What do they learn (what are the parameters)? What do they output (Viterbi path)? Epigenetics: bisulfite sequencing, mapping strategies for mapping bislufite converted reads to a reference sequence lollipop plot, methylation rate calculation biological role of DNA methylation? (correlation with chromatin structure and histone modifications, role in differentiation, aging and degeneration) (Kay Hofmann) Generally no calculations. Questions on the problems, failures of a method. Finding errors in statements. Proteomics Tandem MS: technology protein identification – what can go wrong? quantification of MS data: SILAC, iTRAC, label-free methods post-translational modifications: which, how can they be identified? protein-protein interactions: Large scale analysis methods of p-p interactions (Y2H) Gene set enrichment: Gene Ontology? DAG? Statistical tests for gene set enrichment– Fisher test Which databases are there for which purposes (which database contains which information)? protein function prediction, guilt by association (Andreas Beyer) Interaction networks, integrative Data Analysis Epistasis: definition + meaning eMAP: methods for finding interactions (experiment + computation) ANOVA + linear models (NO formulas, however: ability to transform an experimental design into a linear model) allele incompatibility: definition Classification: random forests – advantages / problems relative to e.g. linear models QTLs NOT: Boolean algebra (Thomas Wiehe) Genomics Coverage Genome assembly – problems Alignments gene prediction (using HMMs) (no concrete calculations) Sequencing and Assembling of genomes shotgun sequencing Genome assembly Overlap alignment (theory - how to calculate an alignment; alignment matrix), tiling paths, coverage Theory: connection between sequencing effort (number of fragments) and coverage Genome annotation Aequence signals (donor, acceptor, etc) Sequence logos; information content Accuracy measurements of gene predictions What other elements, besides protein coding genes, belong to the 'annotation' of a genome? Metagenomics Concept and scope What kind of data are produced and analyzed How to assign genomes to 'species'. Phylogeny mapping (placing a sequence into a phylogeny).