Pre-EXAM-Answers

advertisement
Pre-EXAM
-------------------------------------------------------------------------------------------------------1. You need to find out which genes are differentially expressed in cancer in
mouse. You want study differential expression at gene level (as opposed to
transcript level) and you want to focus on known genes (as opposed to finding
novel ones). You have a limited budget for sequencing, so you have to select one
of the following options:
a) obtain 120 million reads (75 bp, single end) from one control sample and one
cancer sample
b) obtain 30 million reads (75 bp, single end) from four control samples and four
cancer samples
c) obtain 90 million reads (150 bp, single end) from one control sample and one
cancer sample
Which option would you choose and why?
Answer:
b) because you need biological replicates to estimate the within-group
variance.
2- If your focus was discovering novel transcripts rather than detecting
differential expression, what kind of reads would you order (single end/ paired
end, shorter/longer) and why?
Answer:
Paired end and longer, so that it is easier to differentiate between transcripts
3- RNAseq data analysis involves several steps and quality control can be
performed at different stages of analysis. List what kind of quality aspects can
be measured. As a bonus list names of software for each of these steps.
Answer:
-raw reads (fastq): base quality, base composition. FastQC.
-aligned reads: mapping quality, saturation, if reads map to genes or
intergenic regions, uniform transcript coverage, etc. RseQC.
-count table: MDS plot to see if groups separate and if there are confounding
factors. edgeR.
-------------------------------------------------------------------------------------------------------Practice exam NGS 1 (2p)
A common problem when performing de-novo assembly of next generation
sequencing reads is that the same sequence pattern can occur more than once in
the target sequence. This causes the assembly to break into a set of “contigs”.
In the illustrated case below, the same pattern symbolized by a yellow arrow
occurs three times in the sequenced genome:
Which of the following strategies could reasonably be expected to reduce this
problem? (2p)
YES
I)
II)
III)
IV)
NO
Using paired-end reads
Increasing the read length
Quality trimming of the read data
“Barcoding” the DNA with index sequences
Check “yes” or “no” boxes for +0.5p per correct answer. -0.5p for incorrect
answers, leave both blank for 0p. Total sum will not be counted as negative.
Practice exam NGS 2 (4p)
A researcher is mapping her NGS reads to a reference sequence to find bases that
differ between the reference and the sequenced sample (called single nucleotide
polymorphisms, “SNPs”). One such SNP is reported at the position shown in the
figure.
The researcher is viewing the mapping output using the Tablet software with
settings enabled to show the read direction as green / blue and bases differing
from the reference sequence as red.
a) A false SNP is reported at the indicated position. What is the likely
cause(s)? (2p)
b) Suggest one thing you could do to avoid this problem. (2p)
Answer 1 (2p)
YES
V)
VI)
VII)
VIII)
NO
Using paired-end reads
Increasing the read length
Quality trimming of the read data
“Barcoding” the DNA with index sequences
Answer NGS 2 (4p)
a) Coverage is very low at the indicated position. All of the reads indicating a
difference do so in their last few bases, where quality is often poor.
b) Any of the following would work:



Increase coverage to “dilute” away the errors
Do end quality trimming on the read data
(Not covered in course but OK) Filter the SNP positions based on
coverage/quality etc.
------------------------------------------------------------------------------------------------------
Introduction to Protein Analysis: Training Questions
Questions
A. Which of the following regular expressions would be matched by sequence
DWILKDG?
1. D-M-x-[ILV]-x{2}-G
2. [DN]-W-x-[ILV]-[RKH]-x-G
3. [DN]-W-x{2}-[ILV]-G
4. D-W-I-[ILMV]-x-K-[GA]
B. Analyze the following protein sequence. What is the function? Do you have
an idea of the structure? Give the evidences that you find (patterns,
profiles,…)
>TRFE_CHICK
YFAVAVARKDSNVNWNNLKGKKSCHTAVGRTAGWVIPMGLIHNRTGTCNFDEYFSEGCAPGS
PPNSRLCQLCQGSGGIPP
EKCVASSHEKYFGYTGALRCLVEKGDVAFIQHSTVEENTGGKNKADWAKNLQMDDFELLCTDG
RRANVMDYRECNLAEVP
THAVVVRPEKANKIRDLLERQEKRFG
ANSWERS
A. 2
B. Here is a concise analysis:
Expasy ScanProsite and/or NPSA proscan
- One matching PROFILE found (expasy ScanProsite):
PS51408 TRANSFERRIN_LIKE_4 Transferrin-like domain profile
Description:
The transferrin family is a group of glycosylated proteins found in both
vertebrates and invertebrates. Included in this group are molecules known to
bind iron, including serotransferrin, ovotransferrin, lactotransferrin, and
melanotransferrin.
4 disulfide bonds in proteins of this family, and a total of 3 Cysteine residues
strictly conserved in the 3 patterns signature of the transferrin function
- 3 patterns/signatures found:
TRANSFERRIN_LIKE_1
positions 1-10 in query sequence
pattern in query: YFAVAVARKD.
Pattern= Y-x(0,1)-[VAS]-V-[IVAC]-[IVA]-[IVA]-[RKH]-[RKS]-[GDENSA]
Y is an iron ligand
TRANSFERRIN_LIKE_2
positions 94-109 in query sequence
pattern in query: YTGALRCLVEKGDVAF
Pattern= [YI]-x-G-A-[FLI]-[KRHNQS]-C-L-x(3,4)-G-[DENQ]-V-[GAT]-[FYW]
Y is an iron ligand; C is involved in a disulfide bond
TRANSFERRIN_LIKE_3
positions 135-165 in query sequence
pattern in query: DFELLCTDGRRANVMDYRECNLAEVPTHAVV
Pattern=
[DENQK]-[YF]-x-[LY]-L-C-x-[DN]-x(5,8)-[LIV]-x(4,5)-C-x(2)-A-x(4)[HQR]-x-[LIVMFY W]-[LIVM].
H is an iron ligand; The 2 C's are linked by a disulfide bond
The protein seems to be a Transferrin
---Looking for homologous protein in SwissProt:
BLAST against Swissprot (on NPS@ server)
The best hit is:
> NPSA gnl|unipsp|P02789 Ovotransferrin (length=705 residues).
Length = 705
Score = 396 bits (1017), Expect = e-135,
Method:
Compositional matrix adjust. Identities = 186/186 (100%), Positives
= 186/186 (100%)
=> Swissprot entry: P02789- TRFE_CHICK: Ovotransferrin – Gallus gallus
precursor
According to the Blast alignment, the query sequence is aligned with positions
450 to 635 of the subject sequence, with 100% identity.
Our query sequence seems to correspond to a part of the Domain transferrin-like
2 of the Gallus gallus Ovotransferrin (this domain is located between positions
364 and 689)
There is a PDB entry corresponding to this protein (Ovotransferrin, C-Terminal
Lobe): 1IQ7.
Analyzing the structure (Rasmol or SwissPDBViewer):
It is a globular protein, monomer. This protein is structured with alpha helices,
beta strands and turns, forming beta sheets. SO4 ligand bound.
NGS
1- What is "low-complexity" sequence? (1p)
Answer: Regions with low-complexity sequence have an unusual composition
that can create problems in sequence similarity searching. Low-complexity
sequence can often be recognized by visual inspection. For example, the protein
sequence PPCDPPPPPKDKKKKDDGPP has low complexity and so does the
nucleotide sequence AAATAAAAAAAATAAAAAAT
2- Explain sequence contaminants and some usual cases (1p)
Answer: Wrong nucleotide material in the initial sample e.g.
•
Mitochondrial and chloroplastic in genomic samples.
•
rRNA in transcriptomes
•
pathogens in infected samples
3- Assembly has three main steps, name them (1p)
Answer:
i) Assemble all pieces of unique regions. These are called contigs.
ii) Connect them up. The end product is called scaffold.
iii) Fill in the gaps in between with repetitive sequences.
------------------------------------------------------------------------------------------------------UNIX
Question 1.
a. In how many ways cat command can be used in unix? (1 pt)
b. Which command you can use to walk upwards and downwards in a file? (1 pt)
Answers:
a. Print
Write
Concatenate/Append
b. The command more
Question 2. Write commands that do the following:
a. Lists all files that contains "seq" in the filename (0.5 pt)
b. Creates a new directory “unix” (0.5 pt)
c. Copy the content of the files with word "seq" in the filename into a file called
"tot.txt" (1 pt)
d. Write a one-line command that shows how many files you have in your
current working directory. (2 pt)
Answers:
a.
b.
c.
d.
ls *seq*
mkdir unix
cat *seq* > unix/tot.txt
ls | wc -w
1) QTL studies in livestock have been very productive (see animalQTLdb:
http://www.animalgenome.org/QTLdb/ ) why is it hard to identify the
genes underlying the QTL (2P)
Answer: the confidence intervals for QTLs are large and contain
hundreds of potential candidate genes
How can we use gene expression studies to increase the resolution of QTL
studies? (4p)
Answer: If we study gene expression in animals with alternative QTL
genotypes we can discover how the QTL affects gene expression. Genes
that differ in expression between genotypes can either be local (near the
QTL) or unlinked to the QTL. Local differentially expressed genes are
good candidate genes for the QTL. Unlinked genes can point to the
molecular pathway that is perturbed by the QTL.
Using the sequence below answer to the following questions:
1- Which species does this sequence belongs to? (1P)
2- Translate this nucleotide sequence to protein (1P)
3- Find out which splice variant this belongs to (1P)
4- Find the cow homolog to this gene and answer if this splice variant is
known in bovine. (4P)
>xng
ATGGGGACTTCCCATCCGGCGTTCCTGGTCTTAGGCTGTCTTCTCACAGGGCTGAGCCTA
ATCCTCTGCCAGCTTTCATTACCCTCTATCCTTCCAAATGAAAATGAAAAGGTTGTGCAG
CTGAATTCATCCTTTTCTCTGAGATGCTTTGGGGAGAGTGAAGTGAGCTGGCAGTACCCC
ATGTCTGAAGAAGAGAGCTCCGATGTGGAAATCAGAAATGAAGAAAACAACAGCGGCCTT
TTTGTGACGGTCTTGGAAGTGAGCAGTGCCTCGGCGGCCCACACAGGGTTGTACACTTGC
TATTACAACCACACTCAGACAGAAGAGAATGAGCTTGAAGGCAGGCACATTTACATCTAT
GTGCCAGACCCAGATGTAGCCTTTGTACCTCTAGGAATGACGGATTATTTAGTCATCGTG
GAGGATGATGATTCTGCCATTATACCTTGTCGCACAACTGATCCCGAGACTCCT
ANSWER:
1- Human, as you do not know from the beginning a
normal blast using NCBI NR database can give you a
hint but you will see that not only human gives a
good hit, you must also look to the length of the
hit. Continue then with ENSEMBL and select human
2- Use for instance EMBOSS transeq
>xng_1
MGTSHPAFLVLGCLLTGLSLILCQLSLPSILPNENEKVVQLNS
SFSLRCFGESEVSWQYPMSEEESSDVEIRNEENNSGLFVTV
LEVSSASAAHTGLYTCYYNHTQTEENELEGRHIYIYVPDPDV
AFVPLGMTDYLVIVEDDDSAIIPCRTTDPETP
test that is correct blasting to a protein database
or comparing to the protein information in ENSEMBL,
3- Use ENSEMBL and just blast the sequence selecting the
cow genome. Much more difficult using NCBI blast as most
hits are primate related.
4- No is not known in the cow. You can in ENSEMBL select to
show all known cow mRNAs/cDNAs
Last question:
Give examples of why the following steps are important and what the outcome is
when performing a genome-wide association analysis with PLINK (5p).
1)
2)
3)
4)
5)
Making a binary PED file
Perform basic statistics
Perform basic association analysis
Check for stratification
Test the region for genome-wide significance
Answer:
1) A binary PED file is a more compact representation of the data that saves
space and speeds up analysis.
2) Get basic knowledge about the data, see how well SNPs and individuals
worked, report allele frequencies and missing rates etc.
3) To uncover the genetic basis of a given disease or phenotypic trait. A file with
p-values is reported.
4) Important analysis to visualize substructure. Significance test for whether two
individuals belong to the same population. PLINK constrain cluster solution
by phenotype if stratification is detected and performs association analyses
conditional on cluster solution.
5) Adjust for multiple hypothesis testing. A file with adjusted p-values is
reported.
Download