Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University Data collection queries Scientific protocol – Must be able to reproduce the process Involve multiple resources – Data sources – Applications Expressing scientific protocols Scientific protocols mix design and implementation Design – What the protocols does (tasks) – Scientific objects involved Implementation – How the protocol is executed – Data sources and applications Expressing scientific protocols Scientific protocols are driven by their implementation – Scientists use the resources they know • data (quality) • access to data • format, limits, etc. – Scientists may not exploit better resources because they do not know them Queries should be driven by the design, the implementation should meet the design needs Example* - Pipeline for Analysis of Protein Variation Due to Alternative Splicing and SNPs The alternative splicing pipeline will provide a complete characterization of variations in proteins due to splice variation or SNPs evident in repositiories of contiguous genome sequence data and expressed sequence tags (ESTs). The pipeline applies secondary structure, tertiary structure, domain motif detection and sequence comparison tools to proteins encoded by genes with alternatively splice forms or SNPs. *Courtesy of Dr. Marta Janer, Institute for Systems Biology Step 2 - Pipeline for Analysis of Protein Variation Due to Alternative Splicing and SNPs From GenBank, Dbest and the Riken Clone Collection, collect all EST and full-length cDNA sequences from the target organisms of interest (in this case, human and mouse) that match the query proteins (mouse DNA binding proteins) using tblastn. Map the query protein to the target DNA sequences, keeping track of which query amino acids correspond to which nucleotides. Step 2 - Pipeline for Analysis of Protein Variation Due to Alternative Splicing and SNPs From GenBank, Dbest and the Riken Clone Collection, collect all EST and full-length cDNA sequences from the target organisms of interest (in this case, human and mouse) that match the query proteins (mouse DNA binding proteins) using tblastn. Map the query protein to the target DNA sequences, keeping track of which query amino acids correspond to which nucleotides. Data sources Step 2 - Pipeline for Analysis of Protein Variation Due to Alternative Splicing and SNPs From GenBank, Dbest and the Riken Clone Collection, collect all EST and full-length cDNA sequences from the target organisms of interest (in this case, human and mouse) that match the query proteins (mouse DNA binding proteins) using tblastn. Map the query protein to the target DNA sequences, keeping track of which query amino acids correspond to which nucleotides. tools Step 2 - Pipeline for Analysis of Protein Variation Due to Alternative Splicing and SNPs From GenBank, Dbest and the Riken Clone Collection, collect all EST and full-length cDNA sequences from the target organisms of interest (in this case, human and mouse) that match the query proteins (mouse DNA binding proteins) using tblastn. Map the query protein to the target DNA sequences, keeping track of which query amino acids correspond to which nucleotides. tasks Step 2 - Pipeline for Analysis of Protein Variation Due to Alternative Splicing and SNPs From GenBank, Dbest and the Riken Clone Collection, collect all EST and full-length cDNA sequences from the target organisms of interest (in this case, human and mouse) that match the query proteins (mouse DNA binding proteins) using tblastn. Map the query protein to the target DNA sequences, keeping track of which query amino acids correspond to which nucleotides. Scientific objects Pipeline Selecting Target Proteins* SwissProt SMART BIND DIP sigpep blast x D.mel CEY2H Step 1 = retrieve all proteins from SMART and Swiss-Prot with textual search with the keyword “apoptosis” Step 2 = retrieve all proteins from Swiss-Prot with a signal peptide feature and the keyword “apoptosis” Step 3 = retrieve their binding partners from DIP, BIND and the C.elegans dataset Step 4 = run through a signal peptide prediction program such as SigPep to check for the presence of signal peptides in each of the sequences Step 5 = homology search using BLAST of the retrieved sequences with proteins predicted from the Drosophila melanogaster genome might yield additional candidates Output = final set of signal peptide proteins involved in apoptosis *Courtesy of Dr. Terry Gaasterland, The Rockefeller University Design and implementation Step Task Implementation Input Relevant keyword for which the proteins are required Step 1All proteins with keyword and with signal feature peptide must be retrieved SMART Swissprot Step 2Binding partners of all of these proteins are retrieved DIP BIND Step 3Integration into final set is run through a signal peptide prediction program SigPep Step 4Homology search of the retrieved sequences with proteins predicted from the specific genome yield additional candidates BLAST Expressing scientific pipelines with BioNavigation Queries are expressed at a conceptual level (design) Protein Seq. Citation Scientific classes Disease DNA Seq. Gene Conceptual level Conceptual graph Labeled edges – Scientific meaningful edges isA isA Nucleotide Sequence RNA mRNA isTranscribedFrom translatesTo transcribesTo isTranslatedFrom isA DNA isA Gene Protein Conceptual graph IsRelatedTo IsRelatedTo isA Nucleotide Sequence isA RNA IsRelatedTo IsRelatedTo IsRelatedTo mRNA isTranscribedFrom translatesTo transcribesTo IsRelatedTo isTranslatedFrom isA IsRelatedTo DNA isA Gene Protein Mapping to physical resources Protein Seq. DNA Seq. Citation Scientific classes Disease Gene Conceptual level Physical level PubMed GenBank HUGO Data Sources NCBI Protein OMIM Mapping to physical resources Protein Seq. DNA Seq. Citation Scientific classes Disease Gene Conceptual level Physical level PubMed GenBank HUGO Data Sources NCBI Protein OMIM Exploring biological metadata “Return all citations that are related to some disease or condition” Diabetes : 11 Aging : 71 Cancer : 391 OMIM (P2) NUCLEOTIDE (P1) PUBMED •Link: Entrez provides an index with the Links in the display option from each entry • Parse: Parsing each entry to retrieve its related entries PROTEIN (P3) •All: Entrez provides an index with the Links in the display option which allows to look at a set of entries at a time Selecting biological resources 3 resources that look the same – Are they the same? 3 paths that will retrieve PubMed entries related to citations – Do they have the same semantics? Results for the disease conditions diabetes, aging and cancer Diabetes Link Parse Aging All Link Parse Cancer All Link Parse All P1 43,890 43,747 44,037 48,393 48,398 48,393 56,315 56,315 56,532 P2 42,969 43,090 43,581 51,712 51,855 51,474 54,487 54,607 52,488 P3 59,959 51,906 49,719 60,129 61,260 60,938 62,686 63,367 60,033 Overlap results for the disease conditions diabetes P1 P2 Link P3 P1 P2 Parse P3 P1 P2 All P3 P1 100% 25.28% 29.98% 100% 29.18% 33.60% 100% 24.64% 27.42% P2 25.82% 100% 97.68% 23.93% 100% 97.81% 24.75% 100% 90.68% P3 21.95% 70.00% 100% 22.87% 81.20% 100% 24.29% 79.49% 100% Evaluating resources Similar applications – Different outputs Similar data sources – Different output Number of resources – Different output Order of resources – Different output Exploiting semantics of resources Number of entries Characterization of entries (number of attributes) Time Exploiting the semantics of links BioNavigation (joint work with Louiqa Raschid and Maria-Esther Vidal) Conceptual graph – No labeled links Queries – Regular expressions of concepts ESearch – Path cardinality - number of instances of paths of the result. For a path of length 1 between two sources S1 and S2, it is the number of pairs (e1, e2) of entries e1 of S1 linked to an entry e2 of S2. – Target Object Cardinality – number of distinct objects retrieved from the final data source. – Evaluation Cost – cost of the evaluation plan, which involves both the local processing cost and remote network access delays. Work in progress Conceptual graph – Labeled links Queries – Complex dataflows Physical graph – Access to a BioMetaDatabase – Data sources – Applications Representing the conceptual graph in Protégé Visualization Limitations in Protégé Using the GraphViz plugin – Shows only IsA hierarchy TgiViz plugin Conclusion Scientists need support to select resources to express their protocols Semantics of resources may be exploited to enhance the data collection process Need for a repository of biological metadata (BioMetaDatabase)