Gene Expression Analysis, DNA Chips and Genetic Networks

advertisement
Motif Finding Workshop
Project
Chaim Linhart
January 2008
MF workshop 08 © Ron Shamir
1
Outline
1. Some background again…
2. The project
MF workshop 08 © Ron Shamir
2
1. Background
Slides with Ron Shamir and Adi Akavia
MF workshop 08 © Ron Shamir
3
Gene: from DNA to protein
DNA
PremRNA
transcription
MF workshop 08 © Ron Shamir
Mature
mRNA
splicing
protein
translation
4
DNA
• DNA: a “string” over the alphabet of 4 bases
(nucleotides): { A, C, G, T }
• Resides in chromosomes
• Complementary strands: A-T ; C-G
Forward/sense strand:
AACTTGCG
Reverse-complement/anti-sense strand:
TTGAACGC
• Directional: from 5’ to 3’:
(upstream)
AACTTGCGATACTCCTA
5’ end
MF workshop 08 © Ron Shamir
(downstream)
3’ end
5
Gene structure (eukaryotes)
Promoter
DNA
Coding strand
Transcription
start site (TSS)
Transcription
Pre-mRNA
Exon
Intron
Splicing
Mature mRNA
5’ UTR
Start codon
(RNA polymerase)
Exon
(spliceosome)
3’ UTR
Coding region
Translation
MF workshop 08 © Ron Shamir
Protein
Stop codon
(ribosome)
6
Translation
• Codon - a triplet of bases, codes a specific amino
acid (except the stop codons); many-to-1 relation
• Stop codons - signal termination of the protein
synthesis process
MF workshop 08 © Ron Shamir
http://ntri.tamuk.edu/cell/ribosomes.html
7
Genome sequences
• Many genomes have been sequences,
including those of viruses, microbes, plants
and animals.
• Human:
– 23 pairs of chromosomes
– 3+ Gbps (bps = base pairs) , only ~3% are genes
– ~25,000 genes
• Yeast:
– 16 chromosomes
– 20 Mbps
– 6,500 genes
MF workshop 08 © Ron Shamir
8
Regulation of Expression
• Each cell contains an identical copy of the
whole genome - but utilizes only a subset
of the genes to perform diverse, unique
tasks
• Most genes are highly regulated –
their expression is limited to specific
tissues, developmental stages,
physiological condition
• Main regulatory mechanism –
transcriptional regulation
9
MF workshop 08 © Ron Shamir
Transcriptional regulation
• Transcription is regulated primarily by transcription
factors (TFs) – proteins that bind to DNA
subsequences, called binding sites (BSs)
• TFBSs are located mainly (not always!) in the gene’s
promoter – the DNA sequence upstream the gene’s
transcription start site (TSS)
• BSs of a particular TF share a common pattern, or
motif
• Some TFs operate together – TF modules
TF
5’
BS
MF workshop 08 © Ron Shamir
TF
Gene
BS
TSS
3’
10
TFBS motif models
• Consensus (“degenerate”) string:
A ACT CT
C
G
AACTGT
CACTGT
CACTCT
CACTGT
AACTGT
gene 1
gene 2
gene 3
gene 4
gene 5
gene 6
gene 7
gene 8
gene 9
gene 10
• Statistical models…
• Motif logo representation
MF workshop 08 © Ron Shamir
11
Human G2+M cell-cycle genes:
The CHR – NF-Y module
CDCA3 (trigger of mitotic entry 1)
CTCAGCCAATAGGGTCAGGGCAGGGGGCGTGGCGGGAAGTTTGAAACT -18
CDCA8 (cell division cycle associated 8)
TTGTGATTGGATGTTGTGGGA…[25bp]…TGACTGTGGAGTTTGAATTGG +23
CDC2 (cell division control protein 2 homolog)
CTCTGATTGGCTGCTTTGAAAGTCTACGGGCTACCCGATTGGTGAATCCGGG
GCCCTTTAGCGCGGTGAGTTTGAAACTGCT 0
CDC42EP4 (cdc42 effector protein 4)
GCTTTCAGTTTGAACCGAGGA…[25bp]…CGACGGCCATTGGCTGCTGC -110
CCNB1 (G2/mitotic-specific cyclin B1)
AGCCGCCAATGGGAAGGGAG…[30bp]…AGCAGTGCGGGGTTTAAATCT +45
CCNB2 (G2/mitotic-specific cyclin B2)
TTCAGCCAATGAGAGT…[15bp]…GTGTTGGCCAATGAGAAC…[15bp]…GGGC
CGCCCAATGGGGCGCAAGCGACGCGGTATTTGAATCCTGGA +10
BS’s are short, non-specific, hiding in both strands and at various
TFs: NF-Y , CHR
12
locations
along
the
promoters
MF workshop 08 © Ron Shamir
The computational challenge
• Given a set of co-regulated genes
(e.g., from gene expression chips)
• Find a motif that is over-represented
(occurs unusually often) in their
promoters
• This may be the TF binding site motif
• Find TF modules – over-represented
motifs that tend to co-occur
MF workshop 08 © Ron Shamir
13
The computational challenge (II)
• Motifs can also be found w/o a given
target-set – “genome-wide”
• Find a motif that is localized - occurs
more often neat the TSS of genes
• Find a motif with a strand bias –
occurs more often on the genes’
coding strand
• Find TF modules with biases in their
order / orientation / distance
MF workshop 08 © Ron Shamir
14
Motif finding algorithms
• >100 motif finding algs
• Main differences between them:
– Type of analysis & input:
• Target-set vs. genome-wide
• Single vs. multi-species (conservation)
• Single motifs vs. modules
– Motif model
– Score for evaluating motif
– Motif search technique:
• Combinatorial (enumeration) vs.
Statistical optimization
MF workshop 08 © Ron Shamir
15
Example - Amadeus
Over-represented motifs in the promoters of genes expressed in
the G2 and G2/M phases of the human cell cycle:
CHR
NF-Y
MF workshop 08 © Ron Shamir
16
2. The project
MF workshop 08 © Ron Shamir
17
General goals
• Develop software from A-Z:
–
–
–
–
Design
Implementation
(Optimization)
Execution & analysis of real data
• A taste of bioinformatics
• Have fun
• Get credit…
MF workshop 08 © Ron Shamir
18
The computational task
• Given a set of DNA sequences
• Find “interesting” pairs of motifs:
– Order bias
– Other scores…
• Main challenges:
– Performance (time, memory)
– Output redundancy
MF workshop 08 © Ron Shamir
19
Input
File with DNA sequences in “fasta” format:
>sequence-name1 <space> [header1]
ACCCGNNNNTCGGAAATGANN
CGGAGTAAAATATGCGAGCGT
>sequence-name2 <space> [header2]
cggattnnnaccgcannnnnnnnaccgtga
>sequence-name3 <space> [header3]
agtttagactgctagctcgatcgcta
gcggatnggctannnnnatctag
MF workshop 08 © Ron Shamir
20
Input (II)
• Ignore the header lines
• Sequence may span multiple lines or
one long line
• Sequence contains the characters
A,C,G,T,N in upper or lower case
• “N” means unknown or masked base
• Sample input files will be supplied
MF workshop 08 © Ron Shamir
21
Input (III)
• Search parameters:
– Length of motifs (between 5-10)
– Min. + Max. distance between the motifs:
ACGGATTGATNNNTGGATGCCAT
distance=9
– Single vs. two strands search
– Min. number of occurrences (hits) of pair:
GCGGATTCAGTGATGCCANGNATGCCTCAGGATTGNAATGCCA
hit
hit
hit
– Max. p-value
– Additional parameters…
MF workshop 08 © Ron Shamir
(don’t count overlaps,
e.g. AAAAAA)
22
Output
A. A list of the string pairs with the
best order-bias score (smallest pvalues):
Motif A
ACGTT
ACGTT
TTAAC
Motif B
GGATT
GATTC
CAGCC
A→B
97
87
31
B→A
17
16
114
p-value
4.3E-15
2.7E-13
1.2E-12
B. A non-redundant list of motif pairs
(motif = consensus string):
logos, # of hits, additional scores
MF workshop 08 © Ron Shamir
23
Part A: String pairs with
order bias
•
•
•
•
•
nA = # of A→B ; nB = # of B→A
WLOG, nA > nB
n = nA + nB
H0 = random order: nA ~ B(n, 0.5)
p-value = prob for at least nA occurrences
of A→B = tail of B(n, 0.5)
• Normal approximation (central limit thm.)
• Fix for multiple testing: x2
n j
Binomial tail ( n, p, k )     p (1  p) n j
j k  j 
n
MF workshop 08 © Ron Shamir
24
Part B: Non-redundant list
of motif pairs
• Collect similar strings to motif with better
score: (motif = consensus)
String pair (p-value)
ACGTT , GGATT (4.3E-15)
ACGAT , GGATT (2.4E-11)
AGGAT , GGTTT (1.7E-5)
AGGTT , GGTTT (5.9E-5)
Motif pair
,
(8.1E-31)
• Don’t report similar motif pairs:
– Motifs that consist of similar strings
– Motif pairs that are small shifts of one another
– Palindromes
MF workshop 08 © Ron Shamir
25
Part B (cont.): Additional score
Option I: Co-occurrence rate
N = total # of sequences
sA = # of sequences that contain motif A
sAB = # of sequences that contain motifs A and B
H0 = motifs occur independently and randomly
p-value = prob for at least joint occurrences, given the
number of hits of each single motif
= tail of hypergeometric distribution
 sB   N  sB 
 s  i 
min( sA , sB ) 
i

HG tail ( N , s A , sB , s AB )     A
N
i  sAB
 
 sA 
MF workshop 08 © Ron Shamir
26
Part B (cont.): Additional score
Option II: Distance bias
Is the distance between the two motifs uniform (H0),
or are there specific distances that are very common?
Option III: Gap variability
Are the sequences between the motifs conserved (H0),
or are they highly variable?
Other options??
MF workshop 08 © Ron Shamir
27
Implementation
• Java (Eclipse) ; Linux
• GUI: Simple graphical user interface for
supplying the input parameters and
reporting the results
• Packages for motif logo and statistical
scores will be supplied
• Time performance will be measured only for
part A
• Reasonable documentation
• Separate packages for data-structures,
scores, GUI, I/O, etc.
MF workshop 08 © Ron Shamir
28
Design document
• Due in 3 weeks (Feb 24)
• 3-5 pages (Word), Hebrew/English
• Briefly describe main goal, input and
output of program
• Describe main data structures,
algorithms, and scores for parts A+B
• Meet with me before submission
MF workshop 08 © Ron Shamir
29
Fin
MF workshop 08 © Ron Shamir
30
Download