Scientific publications and archives: media, content and access Lesk, Ch 3 (Lesk, 2008) Scientific literature • Scientific publications began as interpersonal communications – lectures, seminars and discussions – oral communication. • Formal written article or books – scientific literature. • Today, journals, presentation at meetings, books, book chapters, Web material, films, radio, television programs, podcasts. • Formal academic publications must pass the test of ‘peer review’ – quality control. • Before the Internet, scientific literature appeared on paper (journals). Today, journals appear electronically as well as on paper (some rarely visit a library to read journals). • Delocalized literature delivery and computational methods of information retrieval. 2 Economic factors governing access to scholarly publications • Traditional economic model of scientific journals: a scientific organization or publisher produces and distribute at regular intervals, a paper-bound ‘issue’ of articles. – Cost: editorial office; preparation of manuscripts; printing/distribution. – Support (income): sales (subscription), page charges to authors, donation, subsidy, advertisements etc. • Recently, changes: – – – – – More papers are published – driving up costs. Larger volume of publication puts libraries under financial pressure. Electronic facilities reduces costs. Electronic distribution extends the potential format of journal articles. User community supports open access. 3 Open access / traditional and digital libraries • Redefinition of the author/publisher/reader relationship. – – – – Retains peer-review process. Accepted articles are placed on the Web, with free access. Authors retain copyright (instead of publisher). Cost of publication are transferred from readers to authors. • Traditional libraries – you know what it is. • Digital libraries. – Electronic form, on-line. – Raise economic questions. – Large-scale digital libraries by scanning? 4 The information explosion / Databases • Efficient delivery can be a mixed blessing. • Impossible for anyone to read all the literature in a given field. • The Web gives a higher dimension – no longer linear, new media, new way of searching, bibliography management, organizing and sharing the harvest. • Databases: contents, ontology, logical structure, format of the data, routes for retrieval of data, links to other resources. • Literature as a database: e.g. Medline (Medical Literature Analysis and Retrieval System Online) – now part of PubMed, bibliographic database. 5 Databases • Database organization / design – e.g. design of a relational database of amino acids. • Annotation: a typical entry in a molecular biology database might contain other information (other than say gene sequences). – Reference information (citations of publications). – Interpretative information. – Links to other information. • Database quality control (errors?) – “Get it right the first time”: database curation and annotation – a new profession. – Identify errors – external curators /users. – Tracking database changes. 6 Databases • Database access: a issue to consider. • Links (utility of a database): internal links and external links. • Database interoperability: questions that require appeal to multiple database at once? – Merge several databases? – Methods for intercommunication between databases? • Data mining. – – – – – Knowledge discovery: description/explanation. Successful forecasting / predictive modeling. Statistical techniques. Artificial neutral networks. Support vector machines. 7 Programming languages and tools • Traditional programming languages: FORTRAN, C, C++ • Scripting languages: PERL, PYTHON, RUBY… • Program libraries specialized for molecular biology: standard libraries (numerical analysis and text processing), libraries for molecular biology (e.g. bioperl.org). • Java – Java Virtual Machine – computing over the Web? • Markup languages: implements data structures, XML. 8 Natural language processing • Natural language: verbal-oral and/or textual forms of humanhuman communication. • Natural language processing has been a goal of computing. • Difficulty: ambiguity of words and phrases. • Identifying keywords and combinations of keywords: e.g. names of genes and names of diseases. • Knowledge extraction: protein-protein interactions (automatic textmining software). • Text mining: – – – – Identification of references to genes and proteins. Identification of interactions. Interaction networks and diseases. Hypothesis generation (unsuspected relationships between genes and diseases). 9 Archives and information retrieval Lesk, Ch 4 (Lesk, 2008) Database indexing and specification of search terms • An index: set of pointers to information in a database. • Information retrieval programs accepts multiple query terms and keywords. • Possible to ask for logical combinations of indexing terms. • Many database search engines allow complex logical expressions. • Follow-up questions: modify query, cumulative searches, links between entries in different databases. • Analysis and processing of retrieved data: using results retrieved in one search as input for another one (some information retrieval systems provide such facilities). 11 Nucleic acid sequence databases • Archiving of bioinformatics data was originally carried out by individual research groups. • As requirements grew, projects become very large-scale. • Primary data collections related to biological macromolecules: – – – – – – – Nucleic acid sequences, including whole-genome projects. Amino acid sequence of proteins. Protein and nucleic acid structures. Small-molecule crystal structures. Protein functions. Expression patterns of genes. Networks: of metabolic pathways, of gene and protein interactions, and of control cascades. – Publications. 12 Nucleic acid sequence databases • Triple partnership of the National Center for Biotechnology Information (USA); the EMBLBank (European Bioinformatics Institute, UK) and the Data Bank of Japan (National Institute of Genetics, Japan). • Curate, archive and distribute DNA and RNA sequences. • Entries have life history: – Unannotated -> Preliminary -> Unreviewed -> Standard • Sample entry includes: properties of specific regions (e.g. coding sequences, performs of affect function, interaction with other molecules, affect replication, etc) 13 Genome databases and genome browsers • Genome browsers (full-genome sequences): databases bringing together all molecular information available about a particular species. • E.g. ensembl.org: intended to be the universal information source for the human and other genomes. 14 Protein sequence databases • In 2002, three protein sequence databases, the Protein Information Resource (PIR) , USA and SWISS-PORT, Swiss and TrEMBL, Europe, formed the UniPort consortium. • Share the database but continue to offer separate information-retrieval tools for access. • Databases associated with SWISS-PORT: – ENZYME DB and PROSITE • PIR and associated databases: – PIRSF: protein family classification system. – iProClass: protein knowledge, access to over 90 biological databases. – iProLINK: gateway to protein literature. 15 Databases of protein families • Evolutionary relationships / homology detection. • Two full-length protein sequences (>=100 residues) that have >=25% identical residues in an optional alignment are likely to be related. • Need sequence alignment algorithms. • Refer to a group of related proteins as a family. 16 Databases of structures • Structure databases archive, annotate and distribute sets of atomic coordinates. • World-wide Protein Data Bank (wwPDB.org). – Joint effort of the Research Collaboratory for Structural Bioinformatics (RCSB) and the Protein Data Bank Japan. – Contains the structures of proteins. – It overlaps several other databases. • Several website offer hierarchical classification of all proteins of known structure – SCOPE, CATH, DALI, CE 17 Other databases • Classification and assignment of protein function. – The Enzyme Commission. – The Gene Ontology Consortium protein function classification. • Specialized, or ‘boutique’ databases. • Expression (mRNA levels) and proteomics databases (interpretation in terms of protein patterns). • Databases of metabolic pathways (flow of molecules and energy through pathways of chemical reactions). • Bibliographic databases. • Only a few of the many databases… 18