Pre-EXAM -------------------------------------------------------------------------------------------------------1. You need to find out which genes are differentially expressed in cancer in mouse. You want study differential expression at gene level (as opposed to transcript level) and you want to focus on known genes (as opposed to finding novel ones). You have a limited budget for sequencing, so you have to select one of the following options: a) obtain 120 million reads (75 bp, single end) from one control sample and one cancer sample b) obtain 30 million reads (75 bp, single end) from four control samples and four cancer samples c) obtain 90 million reads (150 bp, single end) from one control sample and one cancer sample Which option would you choose and why? Answer: b) because you need biological replicates to estimate the within-group variance. 2- If your focus was discovering novel transcripts rather than detecting differential expression, what kind of reads would you order (single end/ paired end, shorter/longer) and why? Answer: Paired end and longer, so that it is easier to differentiate between transcripts 3- RNAseq data analysis involves several steps and quality control can be performed at different stages of analysis. List what kind of quality aspects can be measured. As a bonus list names of software for each of these steps. Answer: -raw reads (fastq): base quality, base composition. FastQC. -aligned reads: mapping quality, saturation, if reads map to genes or intergenic regions, uniform transcript coverage, etc. RseQC. -count table: MDS plot to see if groups separate and if there are confounding factors. edgeR. -------------------------------------------------------------------------------------------------------Practice exam NGS 1 (2p) A common problem when performing de-novo assembly of next generation sequencing reads is that the same sequence pattern can occur more than once in the target sequence. This causes the assembly to break into a set of “contigs”. In the illustrated case below, the same pattern symbolized by a yellow arrow occurs three times in the sequenced genome: Which of the following strategies could reasonably be expected to reduce this problem? (2p) YES I) II) III) IV) NO Using paired-end reads Increasing the read length Quality trimming of the read data “Barcoding” the DNA with index sequences Check “yes” or “no” boxes for +0.5p per correct answer. -0.5p for incorrect answers, leave both blank for 0p. Total sum will not be counted as negative. Practice exam NGS 2 (4p) A researcher is mapping her NGS reads to a reference sequence to find bases that differ between the reference and the sequenced sample (called single nucleotide polymorphisms, “SNPs”). One such SNP is reported at the position shown in the figure. The researcher is viewing the mapping output using the Tablet software with settings enabled to show the read direction as green / blue and bases differing from the reference sequence as red. a) A false SNP is reported at the indicated position. What is the likely cause(s)? (2p) b) Suggest one thing you could do to avoid this problem. (2p) Answer 1 (2p) YES V) VI) VII) VIII) NO Using paired-end reads Increasing the read length Quality trimming of the read data “Barcoding” the DNA with index sequences Answer NGS 2 (4p) a) Coverage is very low at the indicated position. All of the reads indicating a difference do so in their last few bases, where quality is often poor. b) Any of the following would work: Increase coverage to “dilute” away the errors Do end quality trimming on the read data (Not covered in course but OK) Filter the SNP positions based on coverage/quality etc. ------------------------------------------------------------------------------------------------------ Introduction to Protein Analysis: Training Questions Questions A. Which of the following regular expressions would be matched by sequence DWILKDG? 1. D-M-x-[ILV]-x{2}-G 2. [DN]-W-x-[ILV]-[RKH]-x-G 3. [DN]-W-x{2}-[ILV]-G 4. D-W-I-[ILMV]-x-K-[GA] B. Analyze the following protein sequence. What is the function? Do you have an idea of the structure? Give the evidences that you find (patterns, profiles,…) >TRFE_CHICK YFAVAVARKDSNVNWNNLKGKKSCHTAVGRTAGWVIPMGLIHNRTGTCNFDEYFSEGCAPGS PPNSRLCQLCQGSGGIPP EKCVASSHEKYFGYTGALRCLVEKGDVAFIQHSTVEENTGGKNKADWAKNLQMDDFELLCTDG RRANVMDYRECNLAEVP THAVVVRPEKANKIRDLLERQEKRFG ANSWERS A. 2 B. Here is a concise analysis: Expasy ScanProsite and/or NPSA proscan - One matching PROFILE found (expasy ScanProsite): PS51408 TRANSFERRIN_LIKE_4 Transferrin-like domain profile Description: The transferrin family is a group of glycosylated proteins found in both vertebrates and invertebrates. Included in this group are molecules known to bind iron, including serotransferrin, ovotransferrin, lactotransferrin, and melanotransferrin. 4 disulfide bonds in proteins of this family, and a total of 3 Cysteine residues strictly conserved in the 3 patterns signature of the transferrin function - 3 patterns/signatures found: TRANSFERRIN_LIKE_1 positions 1-10 in query sequence pattern in query: YFAVAVARKD. Pattern= Y-x(0,1)-[VAS]-V-[IVAC]-[IVA]-[IVA]-[RKH]-[RKS]-[GDENSA] Y is an iron ligand TRANSFERRIN_LIKE_2 positions 94-109 in query sequence pattern in query: YTGALRCLVEKGDVAF Pattern= [YI]-x-G-A-[FLI]-[KRHNQS]-C-L-x(3,4)-G-[DENQ]-V-[GAT]-[FYW] Y is an iron ligand; C is involved in a disulfide bond TRANSFERRIN_LIKE_3 positions 135-165 in query sequence pattern in query: DFELLCTDGRRANVMDYRECNLAEVPTHAVV Pattern= [DENQK]-[YF]-x-[LY]-L-C-x-[DN]-x(5,8)-[LIV]-x(4,5)-C-x(2)-A-x(4)[HQR]-x-[LIVMFY W]-[LIVM]. H is an iron ligand; The 2 C's are linked by a disulfide bond The protein seems to be a Transferrin ---Looking for homologous protein in SwissProt: BLAST against Swissprot (on NPS@ server) The best hit is: > NPSA gnl|unipsp|P02789 Ovotransferrin (length=705 residues). Length = 705 Score = 396 bits (1017), Expect = e-135, Method: Compositional matrix adjust. Identities = 186/186 (100%), Positives = 186/186 (100%) => Swissprot entry: P02789- TRFE_CHICK: Ovotransferrin – Gallus gallus precursor According to the Blast alignment, the query sequence is aligned with positions 450 to 635 of the subject sequence, with 100% identity. Our query sequence seems to correspond to a part of the Domain transferrin-like 2 of the Gallus gallus Ovotransferrin (this domain is located between positions 364 and 689) There is a PDB entry corresponding to this protein (Ovotransferrin, C-Terminal Lobe): 1IQ7. Analyzing the structure (Rasmol or SwissPDBViewer): It is a globular protein, monomer. This protein is structured with alpha helices, beta strands and turns, forming beta sheets. SO4 ligand bound. NGS 1- What is "low-complexity" sequence? (1p) Answer: Regions with low-complexity sequence have an unusual composition that can create problems in sequence similarity searching. Low-complexity sequence can often be recognized by visual inspection. For example, the protein sequence PPCDPPPPPKDKKKKDDGPP has low complexity and so does the nucleotide sequence AAATAAAAAAAATAAAAAAT 2- Explain sequence contaminants and some usual cases (1p) Answer: Wrong nucleotide material in the initial sample e.g. • Mitochondrial and chloroplastic in genomic samples. • rRNA in transcriptomes • pathogens in infected samples 3- Assembly has three main steps, name them (1p) Answer: i) Assemble all pieces of unique regions. These are called contigs. ii) Connect them up. The end product is called scaffold. iii) Fill in the gaps in between with repetitive sequences. ------------------------------------------------------------------------------------------------------UNIX Question 1. a. In how many ways cat command can be used in unix? (1 pt) b. Which command you can use to walk upwards and downwards in a file? (1 pt) Answers: a. Print Write Concatenate/Append b. The command more Question 2. Write commands that do the following: a. Lists all files that contains "seq" in the filename (0.5 pt) b. Creates a new directory “unix” (0.5 pt) c. Copy the content of the files with word "seq" in the filename into a file called "tot.txt" (1 pt) d. Write a one-line command that shows how many files you have in your current working directory. (2 pt) Answers: a. b. c. d. ls *seq* mkdir unix cat *seq* > unix/tot.txt ls | wc -w 1) QTL studies in livestock have been very productive (see animalQTLdb: http://www.animalgenome.org/QTLdb/ ) why is it hard to identify the genes underlying the QTL (2P) Answer: the confidence intervals for QTLs are large and contain hundreds of potential candidate genes How can we use gene expression studies to increase the resolution of QTL studies? (4p) Answer: If we study gene expression in animals with alternative QTL genotypes we can discover how the QTL affects gene expression. Genes that differ in expression between genotypes can either be local (near the QTL) or unlinked to the QTL. Local differentially expressed genes are good candidate genes for the QTL. Unlinked genes can point to the molecular pathway that is perturbed by the QTL. Using the sequence below answer to the following questions: 1- Which species does this sequence belongs to? (1P) 2- Translate this nucleotide sequence to protein (1P) 3- Find out which splice variant this belongs to (1P) 4- Find the cow homolog to this gene and answer if this splice variant is known in bovine. (4P) >xng ATGGGGACTTCCCATCCGGCGTTCCTGGTCTTAGGCTGTCTTCTCACAGGGCTGAGCCTA ATCCTCTGCCAGCTTTCATTACCCTCTATCCTTCCAAATGAAAATGAAAAGGTTGTGCAG CTGAATTCATCCTTTTCTCTGAGATGCTTTGGGGAGAGTGAAGTGAGCTGGCAGTACCCC ATGTCTGAAGAAGAGAGCTCCGATGTGGAAATCAGAAATGAAGAAAACAACAGCGGCCTT TTTGTGACGGTCTTGGAAGTGAGCAGTGCCTCGGCGGCCCACACAGGGTTGTACACTTGC TATTACAACCACACTCAGACAGAAGAGAATGAGCTTGAAGGCAGGCACATTTACATCTAT GTGCCAGACCCAGATGTAGCCTTTGTACCTCTAGGAATGACGGATTATTTAGTCATCGTG GAGGATGATGATTCTGCCATTATACCTTGTCGCACAACTGATCCCGAGACTCCT ANSWER: 1- Human, as you do not know from the beginning a normal blast using NCBI NR database can give you a hint but you will see that not only human gives a good hit, you must also look to the length of the hit. Continue then with ENSEMBL and select human 2- Use for instance EMBOSS transeq >xng_1 MGTSHPAFLVLGCLLTGLSLILCQLSLPSILPNENEKVVQLNS SFSLRCFGESEVSWQYPMSEEESSDVEIRNEENNSGLFVTV LEVSSASAAHTGLYTCYYNHTQTEENELEGRHIYIYVPDPDV AFVPLGMTDYLVIVEDDDSAIIPCRTTDPETP test that is correct blasting to a protein database or comparing to the protein information in ENSEMBL, 3- Use ENSEMBL and just blast the sequence selecting the cow genome. Much more difficult using NCBI blast as most hits are primate related. 4- No is not known in the cow. You can in ENSEMBL select to show all known cow mRNAs/cDNAs Last question: Give examples of why the following steps are important and what the outcome is when performing a genome-wide association analysis with PLINK (5p). 1) 2) 3) 4) 5) Making a binary PED file Perform basic statistics Perform basic association analysis Check for stratification Test the region for genome-wide significance Answer: 1) A binary PED file is a more compact representation of the data that saves space and speeds up analysis. 2) Get basic knowledge about the data, see how well SNPs and individuals worked, report allele frequencies and missing rates etc. 3) To uncover the genetic basis of a given disease or phenotypic trait. A file with p-values is reported. 4) Important analysis to visualize substructure. Significance test for whether two individuals belong to the same population. PLINK constrain cluster solution by phenotype if stratification is detected and performs association analyses conditional on cluster solution. 5) Adjust for multiple hypothesis testing. A file with adjusted p-values is reported.