Lecture - Computational Bioscience Program

advertisement
Bioinformatics Tools in Context
How Informatics Improves
Research Every Day
Adelaide Fletcher, MLIS
Tzu L. Phang Ph.D.
July 27, 2012
True or False?
You have to be a collaborator on
someone’s clinical trial to make
discoveries with their genetic data...
2
3
• Stanford School of Medicine's Atul Butte identified a new drug target
for diabetes by downloading data from 130 gene-expression studies
in mice, rats, and humans that were done by other researchers and
doing a meta-analysis to look for a common link
• wet lab experiments are more for validating hypotheses than making
discoveries
Meet Our Hero...
• Name: Hunter
• Research Interests: The
role of mammary epithelial
cells in breast cancer
• Goal: Develop a genetic
drug tarFget for breast
cancer
• Post-grad experience: <
1 year
• Funding: $0
7
Where should he start?
• A. Ask for $$!
• B. Do a lit search
• C. Try to find free genetic data
8
Finding out what’s known
• Google Scholar
- http://scholar.google.com
• Web Of Science
- http://isiknowledge.com/WOS
– (http://hslezproxy.ucdenver.edu/login?url=http://isiknowl
edge.com/WOS)
9
Google Scholar
• http://scholar.google.com - search “mammary epithelial
10
cells”
What’s this? Free data?
11
Follow the path of serendipity
“Data”
12
Now that we’ve found “Data”
what are we going to “Tzu”?
http://cctsi.ucdenver.edu/RIIC
13
GEO (Gene Expression Omnibus)
http://www.ncbi.nlm.nih.gov/geo/
As of July 19, 2012
Using GEO as an example
• Naming schemes:
GPL
GSM
GSE
GDS
GPL (Geo PLatform)
• Describe list of elements in the array
– cDNAs, oligonucleotide probesets, ORFs,
antibodies)
• Each platform is assigned a unique and
stable GEO accession number (GPLxxx)
• Example:
– GPL570: Affymetrix GeneChip Human
Genome U133 Plus 2.0 Array
GSM (Geo SaMple)
• Describe the conditions under which an
individual Sample was handled, the
manipulation it underwent, and the
abundance measurement of each element
derived from it!
• A Sample entity must reference only one
Platform and may be included in multiple
Series
• Example: GSM300166 (remember HW 2??!)
PostcentralGyrus_female_91yrs_indiv10
GSE (Geo SEries)
• Defines a set of related Samples
considered to be part of a group
• Provide a focal point and description of the
experiment as a whole
• Example:
Let’s look at an example
• Goto the GEO site
• Under “GEO accession”, type:
– GSE11882
• Find these terms:
– GPL
– GSM
– GSE
GDS (Geo DataSet)
• Curated sets of GEO Sample data
• Represents a collection of biologically and
statistically comparable GEO Samples
– Same platform
– Shared common set of probe elements
– Samples’ intensities calculated in an
equivalent manner (background correction,
normalization, etc)
• Example: GSD200 (see next page)
What can you do in GEO?
Clustering Analysis
Class Comparison Analysis
Gene Expression Profile
Let’s import the dataset
• GDS2789
What’s wrong with the
approach?
• Only show one gene at a time
• Hard to select a gene set for downstream
analysis such as clustering
• Hard to output a gene list.
BRB-ArrayTools
http://linus.nci.nih.gov/BRB-ArrayTools.html
 Free, open-source software
 Microsoft Excel plug-in
 Only works on Windows platform
 Imposed by all Excel limitations
BRB-ArrayTools
• Biometric Research Branch (BRB)
– Statistical/biomathematical component
– Division of Cancer Treatment and Diagnosis (NCI)
• Richard Simon & BRB-ArrayTools Development Team
• BRB ArrayTools
– Visualization and statistical analysis of DNA microarray gene
expression data
– Developed by statisticians
– Excel add-in
– Analytic/visualization tools: R statistical system, C and Fortran
programs, Java applications.
– Visual Basic for Applications integrates components
Objectives
• “provide scientists with software … without
requiring them to learn a programming
language”
• “encapsulate into software the experience
of professional statisticians”
• “facilitate education of scientists in
statistical methods for the analysis of DNA
microarray data”
Installing BRB-ArrayTools
• Windows 98/2000/NT/XP/Vista/7
• Loads package as add-in to Microsoft
Excel
– Excel 2000 or later
– Creates ArrayTools menu on Excel menu bar
• Intensive computations performed in R or
compiled programs
Installation
• Go to “http://linus.nci.nih.gov/BRB-ArrayTools.html”
• Click on “All required components in ONE file”
Installation
• Click on “Download Standard Version 3.7.1 (All in one file)”
• When prompted, enter User name and Password
(these will be sent to you after your FREE registration)
Demonstration
Installation
• Follow the step-by-step procedures
• In the interest of time, the software has already been
installed on your machine
Demonstration
Excel 2007: Security Setting
Now, a video demo ….
1
http://david.abcc.ncifcrf.gov/home.jsp
2
3
4
5
A quick recap...
42
List of 220 or so genes with potential indications for
treatment or further understanding of Breast
Cancer pathways
43
List of 220 or so genes with potential indications for
treatment or further understanding of Breast
Cancer pathways
List of 6 or so genes with a shared biological
pathway (transcription factor activity)
44
Do these genes have a CA
connection?
• In NCBI GENE search:
“(TBX6 OR ZNF423 OR NR4A3 OR
SCAND2 OR CEBPE OR SIX2) AND
Cancer”
45
NCBI Gene - a 1 stop shop
46
All Roads Lead to GENE
• Summary
– Official Symbol,
Aliases
• Context, Regions,
Transcripts
• Related Article and
GeneRIFs
• Phenotypes
• General Info
– Homology,
Pathways, Ontology
• Reference Sequences
• Internal Links
– MapViewer
– OMIM
– BLAST
• External Links
– Ensembl
– UCSC
47
Browsing Genes and Genomes
• NCBI
• Ensembl
• UCSC Genome Browser
– Which one to use?
• http://cctsi.ucdenver.edu/RIIC/Pages/TranslationalI
nformaticsVideos.aspx#GenomeBrowsers
– A full day of Ensembl training:
http://hsl2.ucdenver.edu/ensembl/
48
BLASTing
• To what gene does this nucleotide sequence most likely
belong?
• gggtgaacag ccgcacggga gtaggtacgc acctgacctc gctggcactg
ccgggcaagg cagagggtgt ggcgtcgctc accagccagt gcagctacag
cagcaccatc gtccatgtgg gagacaagaa gccgcagccg gagttagaga
tggtggaaga tgctgcgagt gggccagaat
• http://blast.ncbi.nlm.nih.gov/Blast.cgi
• http://www.ensembl.org/Danio_rerio/blastview
• http://genome.ucsc.edu/cgi-bin/hgBlat?command=start
49
BLASTing
•
What about this one?
•
acatttgctt ctgacacaac tgtgttcact agcaacctca aacagacacc atggtgcacc
tgactcctga ggagaagtct gcggttactg ccctgtgggg caaggtgaac gtggatgaag
ttggtggtga ggccctgggc aggctgctgg tggtctaccc ttggacccag aggttctttg agtcctttgg
ggatctgtcc actcctgatg cagttatggg caaccctaag gtgaaggctc atggcaagaa
agtgctcggt gcctttagtg atggcctggc tcacctggac aacctcaagg gcacctttgc
cacactgagt gagctgcact gtgacaagct gcacgtggat cctgagaact tcaggctcct
gggcaacgtg ctggtctgtg tgctggccca tcactttggc aaagaattca ccccaccagt
gcaggctgcc tatcagaaag tggtggctgg tgtggctaat gccctggccc acaagtatca
ctaagctcgc tttcttgctg tccaatttct attaaaggtt cctttgttcc ctaagtccaa ctactaaact
gggggatatt atgaagggcc ttgagcatct ggattctgcc taataaaaaa catttatttt
50
Genetics in Literature
•
What does this Sequence:
•
ATTAAAGATGATTTTTACAGTCAATGAGCCACGTCAGGGAGCGATGGCACCCGCAGGCGGTATCAACTGAT
GCAAGTGTTCAAGCGAATCTCAACTCGTTTTTTCCGGTGACTCATTCCCGGCCCTGCTTGGCAGCGCTGCA
CCCTTTAACTTAAACCTCGGCCGGCCGCCCGCCGGGGGCACAGAGTGTGCGCCGGGCCGCGCGGCAATT
GGTCCCCGCGCCGACCTCCGCCCGCGAGCGCCGCCGCTTCCCTTCCCCGCCCCGCGTCCCTCCCCCTCG
GCCCCGCGCGTCGCCTGTCCTCCGAGCCAGTCGCTGACAGCCGCGGCGCCGCGAGCTTCTCCTCTCCTC
ACGACCGAGGCAGGTAAACGCCCGGGGTGGGAGGAACGCGGGCGGGGGCAGGGGAGCCGCGGGGGCC
GAGTGAGGACCCCGGGCCTCGGGTCCCAGGCGCAAGGGTGCCCGGCCGGGCGGGGTCGGGACCCCAG
TGAGGAGGGGCCGGGGGCTGCCCCGCGGGCGCGTGACGCGTCTCGGGCCTGCCCGGCTGCGCTGGTCT
CCGCTCGGGTGAGGCGGCTTGGCTTCGCTTTTCAGGTTAGGAAAGCTCCCTTTACTGCGCGTTGGGGGGC
TGGGGGAGCTGGCGGAGCCCCGTTAGGGAGGTCGGTGGCGCCGGGGTGTCTCAGCGCCCCCTGCACCC
CGCGCGGGTCCGGCCCAGCGGGCGATCGCTGGCGCCCAGGGAACTCCGGGAGGGCCGCCAGCGGGCT
CCGCAGGGCGCGGGGCGGGGAGGGGCGCCTGGGGGCCGCGGGGCTCGCGCTCCCCGCCCGTTGGCCG
CCCCTCGGAGGCCGAGATCGGGGCCCAGAACGCCCCTTGGCAAGGCCTGGCGCTTCCGCGATGCCCAGA
GGGTGCTTGGGGGGATGGAGAGAGGGGCGCCCGCCGGGGGAGTTCCGGGAGCCTCGGTGCCTCCCGCC
GCAGCTGCAGCGTTCCTCCCGGGAGGCGGCCCAGCCCTTCATCCTCGCCGCCTGAGCTTCTCCGAGGGG
GGCTGCAGCCTTGCGGCCGTTGCCACCGCCTGGAGAAGCGGCCCACGCGGACTGACGGGCGGGGGCGG
GGCCTCGGGCCTCGGCGGGGGCGGGGTCCGGGGAGGCCCCACCCTCTGTTCTCCAGGGGCGGGGAGA
GAGGAGCTGCAGGTCTGCGGCCTGGC
•
Have to do with this book?
http://www.amazon.com/The-Family-That-Couldnt-Sleep/dp/1400062454
51
Oh yeah, him
52
Phylogenetics
• Scientific procedure to reconstruct the evolutionary
history of organism or sequences
• Evolutionary theory: groups of similar organisms are
descended from common ancestor.
• Cladistics:
– Developed by Will Hennig, German entomologist
(1950)
– Phylogenetic systematics: a mathematical approach
– Method of taxonomic classification of organism based
on their evolution
• So, why do we study phylogenetics?
What can Phylogenetic tell you?
• Discovering the function of a gene
– Is your gene of interest orthologous to another
well-characterized gene from another species
• Retracing the origin of a gene
– Most genes travel together through
evolutionary time.
– Determine if genes undergo genomic
modification such as mutation, deletion,
duplication, speciation, loss and gain of
function, inactivation and etc.
DNA; a good measurement
• Advantages over morphological taxonomic
characters:
– Character states are unambigous
– Large number of characters can be used to
perform the analysis.
Using clustalw:
www.ebi.ac.uk/clustalw
Now, a video demo …
Find collaborators
• Colorado Profiles:
http://profiles.ucdenver.edu/Search.aspx
– Search: “mammary epithelial cells”
• Colorado Translational Informatics Community on
Facebook:
http://www.facebook.com/pages/Colorado-TranslationalInformatics-Community/136023206424789
62
Get Informatics Help
• http://cctsi.ucdenver.edu/RIIC
– 5 x 5 Videos
– Find informatics experts
– Monthly podcast
– SeDLAC (Secondary Database Library and
Analysis Center)
– Consultation and Data Analysis
63
Get $$
• NLM Professional Development Repository:
http://cnx.org/content/m37008/latest/
• CCTSI Funding:
http://cctsi.ucdenver.edu/Funding/Pages/default.aspx
• UC Denver Office of Grants and Contracts:
http://www.ucdenver.edu/academics/research/AboutUs/
GrantsContractsOffice/Pages/default.aspx
64
Find a Journal to Publish
Findings
• http://www.biosemantics.org/jane/ - Example Search:
•
“cDNA microarrays and a clustering algorithm were used to identify patterns of gene expression in
human mammary epithelial cells growing in culture and in primary human breast tumors. Clusters of
coexpressed genes identified through manipulations of mammary epithelial cells in vitro also showed
consistent patterns of variation in expression among breast tumor samples. By using
immunohistochemistry with antibodies against proteins encoded by a particular gene in a cluster, the
identity of the cell type within the tumor specimen that contributed the observed gene expression pattern
could be determined. Clusters of genes with coherent expression patterns in cultured cells and in the
breast tumors samples could be related to specific features of biological variation among the samples.
Two such clusters were found to have patterns that correlated with variation in cell proliferation rates and
with activation of the IFN-regulated signal transduction pathway, respectively. Clusters of genes
expressed by stromal cells and lymphocytes in the breast tumors also were identified in this analysis.
These results support the feasibility and usefulness of this systematic approach to studying variation in
gene expression patterns in human cancers as a means to dissect and classify solid tumors.”
65
Get Informatics Help!
• http://cctsi.ucdenver.edu/RIIC
– 5 x 5 Videos
– Find informatics experts
– Monthly podcast
– SeDLAC (Secondary Database Library and
Analysis Center)
– Consultation and Data Analysis
66
Thank You!
• Tzu Phang, Ph.D.
– Tzu.Phang@UCDenver.EDU
• Addie Fletcher, MLIS
– Adelaide.Fletcher@UCDenver.EDU
67
Download