TimeSearcher: Interactive Querying for Identification of Patterns in Genetic Data Harry Hochheiser

advertisement
TimeSearcher: Interactive Querying for
Identification of Patterns in Genetic Data
Harry Hochheiser
Eric Baehrecke
Stephen Mount
Ben Shneiderman
Harry Hochheiser is supported by a fellowship from America Online.
Time Series Data
• Real-Valued function over time
• Goal: find patterns
– “Starts Low, Ends High”
– Outliers
– Periodic Patterns
– Laggards and Leaders
• Hypothesis generation
2
Microarray Data
Chu, et al. The transcriptional program of sporulation in budding yeast,
Science 1998 Oct 23; 282(5389): 699-705.
3
Timeboxes
• Rectangular query regions
• Value must be in range for all time points in region
• Combine multiple timeboxes for conjunctive query
Sharp Rise
Panic Reversal
4
TimeSearcher/Microrarray demo
5
TimeSearcher
•
•
•
•
•
•
Interactive exploration of time-series data
Dynamic queries (<100ms)
Linear display of individual items
Create queries on graph area
Move, scale timeboxes to modify query
Drag-and-Drop for query-by-example
6
Other Applications
• “Time”: linear ordered sequence
• Use TimeSearcher for general sequences
– E.g., DNA
7
TimeSearcher for analysis of
weak signals in nucleotide
sequences:
Application to the case of
the Arabidopsis thaliana
branch site consensus
splicing signal.
U1
Steve Mount
Cell Biology and Molecular Genetics
Harry Hochheiser and Ben Shneiderman
Human Computer Interaction Lab
Steven Salzberg
The Institute for Genomic Research
Exon 1
Splicing signals are recognized during early
steps in the biochemical process of splicing.
SF1
U2AF65
Branch
Site
U2AF35
(Y)n
AG
Exon 2
8
Consensus sequences:
Two-step pre-mRNA splicing
mechanism with branched
intermediate:
Yeast (Saccharomyces cerevisiae)
Invariant:
TACTAAC
Humans (Homo sapiens)
Consensus:
TNYTRAYY
Fruit flies (Drosophila melanogaster)
Invariant:
WCTAATY
Weeds (Arabidopsis thaliana):
Invariant:
CTRAY
Diagram courtesy of Dr. Martinez Hewlett
Y = C or T;
W = A or T; R = A or G; N = A, C, G or T
Here we sought to verify and extend the experimentally determined branch site
consensus CTRAY determined by Simpson et al. (2002).
Our long-term goal is the characterization of an even weaker signal, the ‘exonic
splicing enhancer.’
9
10
11
12
13
14
15
Conclusions:
TimeSearcher can be used to identify weak signals in
aligned nucleotide sequences.
Number of over-represented words
Analysis of 8,550 exons from Arabidopsis supports the
branch site consensus WYTRAY.
one sigma
Branch site
two sigma
Pyrimidines
Distance to 3’ splice site
ACTAA
ACTGA
ATAAC
ATTGA
CTAAA
CTAAC
CTAAT
CTCAT
CTGAC
TAACG
TAACT
TCTAA
TGACT
TGATT
TTAAC
WYTRAY
Y = C or T; W = A or T; R = A or G; N = A, C, G or T
16
Future Work: Extensions to query model
• Leaders and Laggards
– Identification of regulatory genes
• Multiple time-varying values
• Variable Time timeboxes
• Collaborations with biologists
inform design
What sort of queries are of interest?
17
Conclusions
• TimeSearcher: interactive tool for graphical
exploration of time series data
• Ongoing use for analyzing microarray data
and sequence data
We’re interested in working with motivated
users & real data sets
www.cs.umd.edu/hcil/timesearcher
18
Download