Pre-EXAM

advertisement
Pre-EXAM
-------------------------------------------------------------------------------------------------------1. You need to find out which genes are differentially expressed in cancer in
mouse. You want study differential expression at gene level (as opposed to
transcript level) and you want to focus on known genes (as opposed to finding
novel ones). You have a limited budget for sequencing, so you have to select one
of the following options:
a) obtain 120 million reads (75 bp, single end) from one control sample and one
cancer sample
b) obtain 30 million reads (75 bp, single end) from four control samples and four
cancer samples
c) obtain 90 million reads (150 bp, single end) from one control sample and one
cancer sample
Which option would you choose and why?
2- If your focus was discovering novel transcripts rather than detecting
differential expression, what kind of reads would you order (single end/ paired
end, shorter/longer) and why?
3- RNAseq data analysis involves several steps and quality control can be
performed at different stages of analysis. List what kind of quality aspects can
be measured. As a bonus list names of software for each of these steps.
-------------------------------------------------------------------------------------------------------Practice exam NGS 1 (2p)
A common problem when performing de-novo assembly of next generation
sequencing reads is that the same sequence pattern can occur more than once in
the target sequence. This causes the assembly to break into a set of “contigs”.
In the illustrated case below, the same pattern symbolized by a yellow arrow
occurs three times in the sequenced genome:
Which of the following strategies could reasonably be expected to reduce this
problem? (2p)
YES
I)
II)
III)
IV)
NO
Using paired-end reads
Increasing the read length
Quality trimming of the read data
“Barcoding” the DNA with index sequences
Check “yes” or “no” boxes for +0.5p per correct answer. -0.5p for incorrect
answers, leave both blank for 0p. Total sum will not be counted as negative.
Practice exam NGS 2 (4p)
A researcher is mapping her NGS reads to a reference sequence to find bases that
differ between the reference and the sequenced sample (called single nucleotide
polymorphisms, “SNPs”). One such SNP is reported at the position shown in the
figure.
The researcher is viewing the mapping output using the Tablet software with
settings enabled to show the read direction as green / blue and bases differing
from the reference sequence as red.
a) A false SNP is reported at the indicated position. What is the likely
cause(s)? (2p)
b) Suggest one thing you could do to avoid this problem. (2p)
Introduction to Protein Analysis: Training Questions
Questions
A. Which of the following regular expressions would be matched by sequence
DWILKDG?
1. D-M-x-[ILV]-x{2}-G
2. [DN]-W-x-[ILV]-[RKH]-x-G
3. [DN]-W-x{2}-[ILV]-G
4. D-W-I-[ILMV]-x-K-[GA]
B. Analyze the following protein sequence. What is the function? Do you have
an idea of the structure? Give the evidences that you find (patterns,
profiles,…)
>TRFE_CHICK
YFAVAVARKDSNVNWNNLKGKKSCHTAVGRTAGWVIPMGLIHNRTGTCNFDEYFSEGCAPGS
PPNSRLCQLCQGSGGIPP
EKCVASSHEKYFGYTGALRCLVEKGDVAFIQHSTVEENTGGKNKADWAKNLQMDDFELLCTDG
RRANVMDYRECNLAEVP
THAVVVRPEKANKIRDLLERQEKRFG
NGS
1- What is "low-complexity" sequence? (1p)
2- Explain sequence contaminants and some usual cases (1p)
3- Assembly has three main steps, name them (1p)
------------------------------------------------------------------------------------------------------UNIX
Question 1.
a. In how many ways cat command can be used in unix? (1 pt)
b. Which command you can use to walk upwards and downwards in a file? (1 pt)
Question 2. Write commands that do the following:
a. Lists all files that contains "seq" in the filename (0.5 pt)
b. Creates a new directory “unix” (0.5 pt)
c. Copy the content of the files with word "seq" in the filename into a file called
"tot.txt" (1 pt)
d. Write a one-line command that shows how many files you have in your
current working directory. (2 pt)
1) QTL studies in livestock have been very productive (see animalQTLdb:
http://www.animalgenome.org/QTLdb/ ) why is it hard to identify the
genes underlying the QTL (2P)
2) How can we use gene expression studies to increase the resolution of QTL
studies? (4p)
Using the sequence beowanswer to the following questions:
1- Which species does this sequence belongs to? (1P)
2- Translate this nucleotide sequence to protein (1P)
3- Find out which splice variant this belongs to (1P)
4- Find the cow homolog to this gene and answer if this splice variant is
known in bovine. (4P)
>xng
ATGGGGACTTCCCATCCGGCGTTCCTGGTCTTAGGCTGTCTTCTCACAGGGCTGAGCCTA
ATCCTCTGCCAGCTTTCATTACCCTCTATCCTTCCAAATGAAAATGAAAAGGTTGTGCAG
CTGAATTCATCCTTTTCTCTGAGATGCTTTGGGGAGAGTGAAGTGAGCTGGCAGTACCCC
ATGTCTGAAGAAGAGAGCTCCGATGTGGAAATCAGAAATGAAGAAAACAACAGCGGCCTT
TTTGTGACGGTCTTGGAAGTGAGCAGTGCCTCGGCGGCCCACACAGGGTTGTACACTTGC
TATTACAACCACACTCAGACAGAAGAGAATGAGCTTGAAGGCAGGCACATTTACATCTAT
GTGCCAGACCCAGATGTAGCCTTTGTACCTCTAGGAATGACGGATTATTTAGTCATCGTG
GAGGATGATGATTCTGCCATTATACCTTGTCGCACAACTGATCCCGAGACTCCT
Give examples of why the following steps are important and what the outcome is
when performing a genome-wide association analysis with PLINK (5p).
1)
2)
3)
4)
5)
Making a binary PED file
Perform basic statistics
Perform basic association analysis
Check for stratification
Test the region for genome-wide significance
Download