SAB2010_ENCODE.v6

advertisement
ENCODE Data Coordination at UCSC
Kate Rosenbloom
ENCODE DCC Technical Project Manager
UCSC Genome Bioinformatics Group
September 2010 Genome Browser SAB Review
ENCODE in a nutshell: Y3Q3
•
•
•
•
•
•
•
2 species (human, mouse)
3 years of production phase (2 years left)
17 grants, 27 labs
20 experiment types
27 browser tracks
140 cell & tissue types
1559 datasets
Topics:
1.
2.
3.
4.
DCC role in ENCODE
Progress since last SAB review
Challenges
Browser impact
DCC role in ENCODE
• Define data formats and submission process
• Load, display, curate, review & release data
• Collect metadata and documentation
•
•
•
•
Provide public website, viz, tools
User outreach & support
Support consortium communications (wiki)
Support analysis group
Data submission website:
http://encodesubmit.ucsc.edu
2438 submissions as of September 2010
Lifecycle of a data submission
sub.tar.gz
1
lab uploads to
submission website
2
uploaded
data fails validation
or loading
pipeline validates and
loads into database
validate failed,
load failed
loaded
3
wrangler configures
browser track
displayed (on test browser)
4
lab approves track
on test browser
approved,
reviewing
5
Q/A reviews and releases
released (to public browser)
ENCODE track quality review
•
•
•
•
•
•
•
Data format checks
Description and metadata complete & correct
Configurability
Display at different zoom levels and visibilities
Performance
Does the data make biological sense ?
Usability
Released Data:
27 tracks in hg18, representing 860
experiments as of September 2010
Progress since last SAB review (Feb ‘09)
Features planned for Year 2:
•
High-resolution wiggle (bigWig) DONE
•
RNAseq display enhancements DONE
•
NCBI accessioning of seq data IN PROGRESS
•
Track search tool IN REVIEW
Plus:
•
Integrated regulatory track
•
Hg19/GRCh37 migration
•
BAM support (spec, validation, display, c-tracks)
•
Mid-course review, 4 data freezes, 2 analysis
workshops, DCC site review
• Mouse ENCODE
ENCODE data at NCBI GEO:
Caltech RNA-seq
Mouse ENCODE experiment matrix
more cell types
more factors
4 grants funded by the ARRA, 3 are now submitting data
Initial tracks of mouse data (test browser)
Finding data in the browser: Simple free-text search
Simple search looks at:
• Track names and labeling
• Tracl description
• Metadata terms (specifically ENCODE controlled vocabulary)
Finding data in the browser: By metadata terms
Advanced search allows selection
by defined metadata terms.
(Currently only for ENCODE tracks)
This search finds
histone modification
H3K4me3 as seen
in H1-hESC cells.
Results from track search
The results of the search on the previous slide is a single
track of histone modification H3K4me3 as seen in H1-hESC
cells. Clicking ‘View in Browser’ will display this data.
Challenges
• Number of labs, difficulty of some
• Metadata expansion, special handling beyond
normal browser data
• Multiple customers: NHGRI, analysis group,
labs, user community
• Production vs. research
• Mission expansion: GEO/SRA, standards,
ARRA, year 5
• Reporting overhead
• Engineering staff -> hire ‘wranglers’
• Funding delays
DCC site visit recommendations
Blue items are DCC-specific
1. Data accessibility
Track search, Feature supertracks, Tutorial
2. Data usability
3. Data quality
Post standards on website, Flag non-conforming data
4. Long-term repository
Deposit data to GEO
5. Metadata user review
6. Use cases
Session gallery on website
7. Reproducibility in publications
8. Web site
Data snapshot on website, Improve labeling
9. Analysis data sets
Integrated regulatory track, Imports from AWG
10. Metrics for success
Impact on browser
• Expanded data – mostly useful, some
not so much
• Pushes development of viz, tools,
formats for large datasets
• Competes for staff and mgmt resources
People at the DCC
PI: Jim Kent
• Technical project manager: Kate Rosenbloom
• Engineering / Wrangling: Tim Dreszer, Venkat Malladi,
Brian Raney, Cricket Sloan, Melissa Cline
• Outreach, usability: Melissa, OpenHelix (contractor)
• Submissions website: Galt Barber
• GEO tools: Krishna Roskin
• Quality assurance: Katrina Learned, Vanessa Swing
• Browser management: Donna Karolchik, Bob Kuhn, Ann Zweig
Reporting:
Monthly
Quarterly
Annual
Additional slides
Plans
•
•
•
•
•
•
ENCODE tutorial
Portal upgrade
Complete GEO submissions
Analysis tracks
ARRA grants (protegenomics, epitope-tag)
Release Mouse data
Browser features developed for ENCODE
•
•
•
•
•
•
High resolution wiggle (bigWig)
HTS formats (BAM and bigBed)
BIG custom tracks
View-based tracks
Data selection matrix
Metadata links
• Coming soon: Track Search
GEO Submission Pipeline
ENCODE Portal http://encodeproject.org
ENCODE Outreach 2009-2010
• Publication: NAR 2010 Database issue (2011 update in press)
• Presentations: CSHL Statistical Analysis course June 2010,
Stanford Computational Systems Bioinformatics, Aug 2010
• Posters: CSHL Biology of Genomes May 2009, CSB 2010
OpenHelix ENCODE tutorial
Integrated regulatory tracks
UCSC-developed integrative ENCODE track – shows enrichment of histone modifications
suggestive of enhancer and promoter activity, DNAse clusters indicating open chromatin,
regions of transcription factor binding, and transcription levels, derived from ENCODE data
collected in multiple cell lines.
Key items from site visit recommendations
• Make a track search tool to make it easy to find all
data on one cell line or one transcription factor
• Organize data by biochemical entities rather than
by lab.
• Put effort into high level documentation on website
– “Sessions gallery” to show use cases
– Page that give overview of what data is available in
ENCODE including cells, antibodies, and assays.
– Put up data summaries, figures, and presentations
generated by the AWG onto site
Some user comments from DCC survey
• Linking annotation across all cell types and linking all
annotations across one cell type would be quite nice. As it is
now, it takes a fair bit of manual manipulation to do this.
• Great job, awesome resource. Thanks to all!
• Need more cell types and conditions. Do data from nonENCODE consortium groups get incorporated?
• Amazingly there isn't a useful Encode summary, let alone a
detailed description of the project and results. There's a nice
do-loop with links between UCSC & NHGRI that don't lead
anywhere. Is there a publication or link that I'm missing
somewhere that informs, educates & is a users' guide? Great
project, just difficult to sift through in it's current form.
• Encode only covers 1% of the human genome. Not sufficient
coverage
Download