[Business Communication]

advertisement
Scientific publications and
archives: media, content
and access
Lesk, Ch 3
(Lesk, 2008)
Scientific literature
• Scientific publications began as interpersonal communications –
lectures, seminars and discussions – oral communication.
• Formal written article or books – scientific literature.
• Today, journals, presentation at meetings, books, book chapters,
Web material, films, radio, television programs, podcasts.
• Formal academic publications must pass the test of ‘peer review’ –
quality control.
• Before the Internet, scientific literature appeared on paper
(journals). Today, journals appear electronically as well as on paper
(some rarely visit a library to read journals).
• Delocalized literature delivery and computational methods of
information retrieval.
2
Economic factors governing access to
scholarly publications
• Traditional economic model of scientific journals: a scientific
organization or publisher produces and distribute at regular
intervals, a paper-bound ‘issue’ of articles.
– Cost: editorial office; preparation of manuscripts;
printing/distribution.
– Support (income): sales (subscription), page charges to authors,
donation, subsidy, advertisements etc.
• Recently, changes:
–
–
–
–
–
More papers are published – driving up costs.
Larger volume of publication puts libraries under financial pressure.
Electronic facilities reduces costs.
Electronic distribution extends the potential format of journal articles.
User community supports open access.
3
Open access / traditional and digital
libraries
• Redefinition of the author/publisher/reader relationship.
–
–
–
–
Retains peer-review process.
Accepted articles are placed on the Web, with free access.
Authors retain copyright (instead of publisher).
Cost of publication are transferred from readers to authors.
• Traditional libraries – you know what it is.
• Digital libraries.
– Electronic form, on-line.
– Raise economic questions.
– Large-scale digital libraries by scanning?
4
The information explosion / Databases
• Efficient delivery can be a mixed blessing.
• Impossible for anyone to read all the literature in a given field.
• The Web gives a higher dimension – no longer linear, new
media, new way of searching, bibliography management,
organizing and sharing the harvest.
• Databases: contents, ontology, logical structure, format of the
data, routes for retrieval of data, links to other resources.
• Literature as a database: e.g. Medline (Medical Literature
Analysis and Retrieval System Online) – now part of PubMed,
bibliographic database.
5
Databases
• Database organization / design – e.g. design of a relational database
of amino acids.
• Annotation: a typical entry in a molecular biology database might
contain other information (other than say gene sequences).
– Reference information (citations of publications).
– Interpretative information.
– Links to other information.
• Database quality control (errors?)
– “Get it right the first time”: database curation and annotation – a new
profession.
– Identify errors – external curators /users.
– Tracking database changes.
6
Databases
• Database access: a issue to consider.
• Links (utility of a database): internal links and external links.
• Database interoperability: questions that require appeal to multiple
database at once?
– Merge several databases?
– Methods for intercommunication between databases?
• Data mining.
–
–
–
–
–
Knowledge discovery: description/explanation.
Successful forecasting / predictive modeling.
Statistical techniques.
Artificial neutral networks.
Support vector machines.
7
Programming languages and tools
• Traditional programming languages: FORTRAN, C, C++
• Scripting languages: PERL, PYTHON, RUBY…
• Program libraries specialized for molecular biology: standard
libraries (numerical analysis and text processing), libraries for
molecular biology (e.g. bioperl.org).
• Java – Java Virtual Machine – computing over the Web?
• Markup languages: implements data structures, XML.
8
Natural language processing
• Natural language: verbal-oral and/or textual forms of humanhuman communication.
• Natural language processing has been a goal of computing.
• Difficulty: ambiguity of words and phrases.
• Identifying keywords and combinations of keywords: e.g. names of
genes and names of diseases.
• Knowledge extraction: protein-protein interactions (automatic textmining software).
• Text mining:
–
–
–
–
Identification of references to genes and proteins.
Identification of interactions.
Interaction networks and diseases.
Hypothesis generation (unsuspected relationships between genes and
diseases).
9
Archives and information
retrieval
Lesk, Ch 4
(Lesk, 2008)
Database indexing and specification of
search terms
• An index: set of pointers to information in a database.
• Information retrieval programs accepts multiple query terms
and keywords.
• Possible to ask for logical combinations of indexing terms.
• Many database search engines allow complex logical
expressions.
• Follow-up questions: modify query, cumulative searches, links
between entries in different databases.
• Analysis and processing of retrieved data: using results
retrieved in one search as input for another one (some
information retrieval systems provide such facilities).
11
Nucleic acid sequence databases
• Archiving of bioinformatics data was originally carried out by
individual research groups.
• As requirements grew, projects become very large-scale.
• Primary data collections related to biological macromolecules:
–
–
–
–
–
–
–
Nucleic acid sequences, including whole-genome projects.
Amino acid sequence of proteins.
Protein and nucleic acid structures.
Small-molecule crystal structures.
Protein functions.
Expression patterns of genes.
Networks: of metabolic pathways, of gene and protein interactions, and of
control cascades.
– Publications.
12
Nucleic acid sequence databases
• Triple partnership of the National Center for Biotechnology
Information (USA); the EMBLBank (European Bioinformatics
Institute, UK) and the Data Bank of Japan (National Institute of
Genetics, Japan).
• Curate, archive and distribute DNA and RNA sequences.
• Entries have life history:
– Unannotated -> Preliminary -> Unreviewed -> Standard
• Sample entry includes: properties of specific regions (e.g.
coding sequences, performs of affect function, interaction
with other molecules, affect replication, etc)
13
Genome databases and genome browsers
• Genome browsers (full-genome sequences): databases
bringing together all molecular information available about a
particular species.
• E.g. ensembl.org: intended to be the universal information
source for the human and other genomes.
14
Protein sequence databases
• In 2002, three protein sequence databases, the Protein
Information Resource (PIR) , USA and SWISS-PORT, Swiss and
TrEMBL, Europe, formed the UniPort consortium.
• Share the database but continue to offer separate
information-retrieval tools for access.
• Databases associated with SWISS-PORT:
– ENZYME DB and PROSITE
• PIR and associated databases:
– PIRSF: protein family classification system.
– iProClass: protein knowledge, access to over 90 biological databases.
– iProLINK: gateway to protein literature.
15
Databases of protein families
• Evolutionary relationships / homology detection.
• Two full-length protein sequences (>=100 residues) that have
>=25% identical residues in an optional alignment are likely to
be related.
• Need sequence alignment algorithms.
• Refer to a group of related proteins as a family.
16
Databases of structures
• Structure databases archive, annotate and distribute sets of
atomic coordinates.
• World-wide Protein Data Bank (wwPDB.org).
– Joint effort of the Research Collaboratory for Structural Bioinformatics
(RCSB) and the Protein Data Bank Japan.
– Contains the structures of proteins.
– It overlaps several other databases.
• Several website offer hierarchical classification of all proteins
of known structure
– SCOPE, CATH, DALI, CE
17
Other databases
• Classification and assignment of protein function.
– The Enzyme Commission.
– The Gene Ontology Consortium protein function classification.
• Specialized, or ‘boutique’ databases.
• Expression (mRNA levels) and proteomics databases
(interpretation in terms of protein patterns).
• Databases of metabolic pathways (flow of molecules and
energy through pathways of chemical reactions).
• Bibliographic databases.
• Only a few of the many databases…
18
Download