DNA Subway Green Line Onramp to HPC in Biology Education Dave Micklos and Uwe Hilgert iPlant Collaborative DNA Learning Center, Cold Spring Harbor Laboratory; Bio5 Institute, University of Arizona …ride an educational Discovery Environment Green Line: RNA Sequence (RNA-Seq) Analysis • First fully GUI interface for RNA-Seq analysis — no command line or data conversions • Accesses XSEDE system through the iPlant Agave API • Co-localizes up to 100 GB of data in iPlant Data Store • Look for differential gene expression in different tissues, life stages, or treatment • Generate lists of expressed genes and fold-changes • Annotate sequenced genomes; add results to Red Line projects RNA code represents “active” DNA in genome 150 feet Homo sapiens bitter taste receptor (TAS2R38) DNA code > RNA code CCTTTCTGCACTGGGTGGCAACCAGGTCTTTAGATTAGCCAACTAGAGAAGAGAAGTAGAATAGCC AATTAGAGAAGTGACATCATGTTGACTCTAACTCGCATCCGCACTGTGTCCTATGAAGTCAGGAGT ACATTTCTGTTCATTTCAGTCCTGGAGTTTGCAGTGGGGTTTCTGACCAATGCCTTCGTTTTCTTG GTGAATTTTTGGGATGTAGTGAAGAGGCAGGCACTGAGCAACAGTGATTGTGTGCTGCTGTGTCTC AGCATCAGCCGGCTTTTCCTGCATGGACTGCTGTTCCTGAGTGCTATCCAGCTTACCCACTTCCAG AAGTTGAGTGAACCACTGAACCACAGCTACCAAGCCATCATCATGCTATGGATGATTGCAAACCAA GCCAACCTCTGGCTTGCTGCCTGCCTCAGCCTGCTTTACTGCTCCAAGCTCATCCGTTTCTCTCAC ACCTTCCTGATCTGCTTGGCAAGCTGGGTCTCCAGGAAGATCTCCCAGATGCTCCTGGGTATTATT CTTTGCTCCTGCATCTGCACTGTCCTCTGTGTTTGGTGCTTTTTTAGCAGACCTCACTTCACAGTC ACAACTGTGCTATTCATGAATAACAATACAAGGCTCAACTGGCAGATTAAAGATCTCAATTTATTT TATTCCTTTCTCTTCTGCTATCTGTGGTCTGTGCCTCCTTTCCTATTGTTTCTGGTTTCTTCTGGG ATGCTGACTGTCTCCCTGGGAAGGCACATGAGGACAATGAAGGTCTATACCAGAAACTCTCGTGAC CCCAGCCTGGAGGCCCACATTAAAGCCCTCAAGTCTCTTGTCTCCTTTTTCTGCTTCTTTGTGATA TCATCCTGTGCTGCCTTCATCTCTGTGCCCCTACTGATTCTGTGGCGCGACAAAATAGGGGTGATG GTTTGTGTTGGGATAATGGCAGCTTGTCCCTCTGGGCATGCAGCCATCCTGATCTCAGGCAATGCC AAGTTGAGGAGAGCTGTGATGACCATTCTGCTCTGGGCTCAGAGCAGCCTGAAGGTAAGAGCCGAC CACAAGGCAGATTCCCGGACACTGTGCTGAGAATGGACATGAAATGAGCTCTTCATTAATACGCCT GTGAGTCTTCATAAATATGCC Differential Gene Expression RNA Sequence (RNA-Seq) gives “snapshot” of genes active in different cells at different times 6 Differential Gene Expression RNA Sequence (RNA-Seq) gives “snapshot” of genes active in different cells 7 RNA Sequence (RNA-Seq) Analysis Design RNA-Seq experiment, i.e., differential expression Isolate total RNA; convert to DNA library Sequence experiment and control libraries Analyze sequence data on DNA Subway Green Line Follow-up experimental validation Image source: http://www.bgisequence.com 1) Manage Data: Quality Assessment with FastQC; ~100 Million 75/150 nucleotide reads in < 1hr 2) FastX ToolKit: Quality Control with FastX Toolkit; ~100M 75/150 nucleotide reads in <1 hr (some took up to 19 hours…) 3) TopHat: Aligns ~100 Million 75/150 nucleotide (paired end) reads to a reference genome of 100M–5B in 6–19hr TopHat Alignment JBrowse TopHat Alignment JBrowse 4) CuffLinks: Assembles transcripts and calculates abundance on BAM files, 1–12GB in 6–19hr 5) CuffDiff: Merges assemblies from Cufflinks and performs differential expression analysis on 4–9 samples in 6–19 hr Green Line Queue time vs Run time Asking for a high run time, leads to longer queue times Asking for a short high time may lead to job being terminated Users don't like to wait too long Users want the results right away Finding the right balance is not easy Green Line Dealing w/ the unexpected Systems taken offline Maintenance Network outages, data transfer issues Science API gives glitches Authentication Green Line “Monitoring XSEDE” DNA Subway “Power Desktop” • Intuitive interface to support seamless genome “round trip” for eukaryote of choice • Access high performance computing to analyze whole genome data (RNA-seq, initially) • Scaffold data to sequenced genomes available in iPlant Data Store • Directly upload RNA-seq reads as biological evidence for genome annotation using Red Line NSF CCLI Project Retreat June 8–20, 2014, CSHL • 11 faculty from PUIs • Program included lectures/practical sessions Wet lab: RNA library prep Green Line analysis & bioinformatics Pedagogy/teaching resources Virtual training materials NSF CCLI Project Retreat Faculty Participants Agnes Ayme-Southgate College of Charleston, SC Judy Brusslan California State University, Long Beach, CA Raymond Enke James Madison University, VA Shaye Lewis Prairie View A&M University, TX Irina Makarevitch Hamline University, MN Judith Ogilvie Saint Louis University, MO Jeremy Seto New York City College of Technology, CUNY, NY Carrie Thurber Abraham Baldwin Agricultural College, IL George Ude Bowie State University, MD Deirdre Vaden Prairie View A&M University, TX Scott Woody University of Wisconsin, WI Flight muscle development during life-stage transitions in Apis melifera (honeybee) Leaf development and senescence in Arabidopsis thaliana Retina development in Gallus gallus Testes development from juvenile to puberty in caprine (goat) Response to cold stress in maize Retinal changes of mice with retinitis pigmentosa Differentiation of rat pheochromocytoma line cells (PC12) to a neuronal-like phenotype Seed abscission in Sorghum bicolor Floral inflorescence genes in banana/plantains Peripheral blood mononuclear cells from hypertensive rats treated with captopril Gibberellic acid exposure in Brassica rapa (Fast Plants) gibberellic acid (gad) mutants NSF CCLI Project Retreat Flight muscle development during life-stage transitions in Apis mellifera (honeybee) Agnes Ayme-Southgate, College of Charleston, SC All honeybees begin as worker bees, flying short distances. Some honeybees transition into foragers, flying long distances. This transition necessitates major changes in flight muscles. Goal is to identify the gene expression changes in flight muscles during this transition Courses • Biol 322: Developmental Biology, 30–38 students • Genetics, 100 students • Undergraduate research in lab, 2–3 students NSF CCLI Project Retreat Differential gene expression in Capra hircus (goat) testes during juvenile development Shaye Lewis, Prairie View A&M University, TX Fertility phenotypes show low heritability, and semen analysis parameters cannot determine fertility status. Molecular biomarkers can increase efficiency of artificial insemination and embryo transfer in goats. Goal is to identify genes important for normal testes development and function Courses •4533: Animal Breeding & Genetics, 20 students •Undergraduate research in lab, 4 students NSF CCLI Project Retreat Understanding transcriptional response to cold stress in maize Irina Makarevitch, Hamline University, MN Maize is grown worldwide and is astaple for >1 billion people. Maize is thermophilic and sensitive to low temperatures, and understanding how plants respond to cold can improve yields. Goal is to identify genes that are differentially expressed when maize is grown under cold stress Courses •Biol 201: Principles of Genetics, 80 students •Biol 301: Genomics & Bioinformatics, 20 students •Undergraduate research in lab, 4 students NSF CCLI Project Retreat RNA-Seq Datasets Generated and Analyzed Using the Green Line of DNA Subway • 8 eukaryotic organisms • 21 controls paired with 26 experimental conditions • 402 Gbases sequenced • 837 jobs submitted to TACC • 87% jobs completed • 695 hours total CPU time • 16 threads/processors running concurrently Intended Implementation 2014-15 100 level 200 level 300 level 400 level 500 level Intro Genetics, 270 Genetics, 220 Molecular & Cell Molecular Biology, Biology, 50 100 Molecular Applications in Crop Improvement 15 Biology Cell & Molecular Biology, 75 20 15 Genomics, 40 Genomics & Bioinformatics, 70 Animal Breeding & Genetics, 20 Developmental Biology, 35 Independent Research, 5 Undergrad Research Cell Structure & Function, 30 Synthetic Biology, 30 Anatomy/Physiology, 50 Advanced Genetic Techniques, 15 100s 320 550 140 DNA Subway is… Producers Uwe Hilgert David Micklos Jason Williams Designers Eun-Sook Jeong Susan Lauter Programmers Cornel Ghiban Mohammed Khalfan Sheldon McKay Contributors Matt Vaughn Rion Dooley Anthony Biondo Jim Burnette Scott Cain Ed Lee Zhenyuan Lu Advisors Matt Conte Carson Holt Bruce Nash Oscar Pineda-Catalan HPC in Undergraduate Biology Education Banbury Center, CSHL, September 3-5, 2014 Contact Dave Micklos (micklos@cshl.edu) A Great Gatsby era estate on Long Island’s “Gold Coast” Funded by NSF and the Alfred P. Sloan Foundation