Practical lecture 1

advertisement
GBIO0009-1 Bioinformatics
Introduction to DB
Instructors
• Practical sessions
Kyrylo Bessonov (Kirill)
• Office: B37 1/16
• kbessonov@ulg.ac.be
• Office hours: by appointment
Overview
1. Introduction to public databases
2. Databases demo HW
3. The submission system
What are we looking for?
Data & databases
Biologists Collect Lots of Data
• Hundreds of thousands of species to explore
• Millions of written articles in scientific journals
• Detailed genetic information:
• gene names
• phenotype of mutants
• location of genes/mutations on chromosomes
• linkage (distances between genes)
• High Throughput lab technologies
• PCR
• Rapid inexpensive DNA sequencing (Illumina HiSeq)
• Microarrays (Affymetrix)
• Genome-wide SNP chips / SNP arrays (Illumina)
• Must store data such that
• Minimum data quality is checked
• Well annotated according to standards
• Made available to wide public to foster research
What is database?
• Organized collection of data
• Information is stored in "records“, "fields“, “tables”
• Fields are categories
Must contain data of the same type (e.g. columns below)
• Records contain data that is related to one object
(e.g. protein, SNP) (e.g. rows below)
SNP ID
SNPSeqID
Gene
+primer
-primer
D1Mit160_1
10.MMHAP67FLD1.seq
lymphocyte antigen 84
AAGGTAAAAGGCAAT
CAGCACAGCC
TCAACCTGGAGTCAGA
GGCT
M-05554_1
12.MMHAP31FLD3.seq
procollagen, type III,
alpha
TGCGCAGAAGCTGA
AGTCTA
TTTTGAGGTGTTAATGG
TTCT
Genome sequencing generates lots of data
Biological Databases
The number of databases is constantly growing!
- OBRC: Online Bioinformatics Resources Collection
currently lists over 2826 databases (2013)
Main databases by category
Literature
• PubMed: scientific & medical abstracts/citations
Health
• OMIM: online mendelian inheritance in man
Nucleotide Sequences
Nucleotide: DNA and RNA sequences
Genomes
• Genome: genome sequencing projects by organism
• dbSNP: short genetic variations
Genes
• Protein: protein sequences
• UniProt: protein sequences and related information
Chemicals
• PubChem Compound: chemical information with structures,
information and links
Pathways
• BioSystems: molecular pathways with links to genes, proteins
• KEGG Pathway: information on main biological pathways
Growth of UniProtKB database
number of entries
• UniProtKB contains mainly protein sequences (entries). The
database growth is exponential
• Data management issues? (e.g. storage, search, indexing?)
Source: http://www.ebi.ac.uk/uniprot/TrEMBLstats
Primary and Secondary Databases
Primary databases
REAL EXPERIMENTAL DATA (raw)
Biomolecular sequences or structures and associated
annotation information (organism, function, mutation linked to
disease, functional/structural patterns, bibliographic etc.)
Secondary databases
DERIVED INFORMATION (analyzed and annotated)
Fruits of analyses of primary data in the primary sources
(patterns, blocks, profiles etc. which represent the most conserved
features of multiple alignments)
Primary Databases
Sequence Information
– DNA: EMBL, Genbank, DDBJ
– Protein: SwissProt, TREMBL, PIR, OWL
Genome Information
– GDB, MGD, ACeDB
Structure Information
– PDB, NDB, CCDB/CSD
Secondary Databases
Sequence-related Information
– ProSite, Enzyme, REBase
Genome-related Information
– OMIM, TransFac
Structure-related Information
– DSSP, HSSP, FSSP, PDBFinder
Pathway Information
– KEGG, Pathways
GenBank database
•
•
•
Contains all DNA and protein sequences described
in the scientific literature or collected in publicly
funded research
One can search by protein name to get DNA/mRNA
sequences
The search results could be filtered by species and
other parameters
GenBank main fields
NCBI Databases contain more than just
DNA & protein sequences
NCBI main portal: http://www.ncbi.nlm.nih.gov/
Fasta format to store sequences
• The FASTA format is now universal for all
databases and software that handles DNA and
protein sequences
• Specifications:
• One header line
• starts with > with a ends with [return]
Saccharomyces cerevisiae strain YC81 actin (ACT1) gene
GenBank: JQ288018.1
>gi|380876362|gb|JQ288018.1| Saccharomyces cerevisiae strain YC81
actin (ACT1) gene, partial cds
TGGCATCATACCTTCTACAACGAATTGAGAGTTGCCCCAGAAGAACACCCTGTTCTTTTGACTGA
AGCTCCAATGAACCCTAAATCAAACAGAGAAAAGATGACTCAAATTATGTTTGAAACTTTCAACG
TTCCAGCCTTCTACGTTTCCATCCAAGCCGTTTTGTCCTTGTACTCTTCCGGTAGAACTACTGGT
ATTGTTTTGGATTCCGGTGATGGTGTTACTCACGTCGTTCCAATTTACGCTGGTTTCTCTCTACC
TCACGCCATTTTGAGAATCGATTTGGCCGGTAGAGATTTGACTGACTACTTGATGAAGATCTTGA
GTGAACGTGGTTACTCTTTCTCCACCACTGCTGAAAGAGAAATTGTCCGTGACATCAAGGAAAAA
CTATGTTACGTCGCCTTGGACTTCGAGCAAGAAATGCAAACCGCTGCTCAATCTTCTTCAATTGA
AAAATCCTACGAACTTCCAGATGGTCAAGTCATCACTATTGGTAAC
OMIM database
Online Mendelian Inheritance in Man (OMIM)
• ”information on all known mendelian disorders linked to
over 12,000 genes”
• “Started at 1960s by Dr. Victor A. McKusick as a catalog of
mendelian traits and disorders”
• Linked disease data
• Links disease phenotypes and causative genes
• Used by physicians and geneticists
OMIM – basic search
• Online Tutorial: http://www.openhelix.com/OMIM
• Each search results entry has *, +, # or % symbol
• # entries are the most informative as molecular basis of
phenotype – genotype association is known is known
• Will do search on: Ankylosing spondylitis (AS)
• AS characterized by chronic inflammation of spine
OMIM-search results
• Look for the entires that link to the genes. Apply filters if needed
Filter results if known SNP is associated to
the entry
Some of the interesting entries. Try to look
for the ones with # sign
OMIM-entries
OMIM Gene ID -entries
OMIM-Finding disease linked genes
• Read the report of given top gene linked phenotype
• Mapping – Linkage heterogeneity section
• Go back to the original results
• Previously seen entry *607562 – IL23R
PubMed database
• PubMed is one of the best known database in the whole scientific
community
• Most of biology related literature from all the related fields are being
indexed by this database
• It has very powerful mechanism of constructing search queries
• Many search fields ● Logical operatiors (AND, OR)
• Provides electronic links to most journals
• Example of searching by author articles published within 2012-2013
References
[1] Durinck, Steffen, et al. "BioMart and Bioconductor: a powerful
link between biological databases and microarray data
analysis." Bioinformatics 21.16 (2005): 3439-3440.
[2] Hamosh, Ada, et al. "Online Mendelian Inheritance in Man
(OMIM), a knowledgebase of human genes and genetic
disorders." Nucleic acids research30.1 (2002): 52-55.
[3] Ihaka, Ross, and Robert Gentleman. "R: a language for data
analysis and graphics." Journal of computational and graphical
statistics 5.3 (1996): 299-314.
Demo homework
Exploring OMIM and
PubMed databases
Demo HW assignment (1)
Question 1: Inherited Disease Genes
In this question, you will choose a human disease and find
the GenBank accession numbers and sequences of some
genes which are thought to affect it.
1) Go to the OMIM database: http://www.ncbi.nlm.nih.gov/omim
2) Perform a search for a human disease you are interested in.
Some possibilities include: Leukaemia, Breast cancer, Crohn,
IBD. You can choose any other disease
3) Print the first page of the search results and circle two results in
the printout which you will use to find related to the disease
nucleotide sequences (i.e. genes). (Not every item in the
search results is related/linked to a sequence)
Demo HW assignment (2)
4)
5)
For each of the two circled entries, follow the links to a
GenBank database (Note that some of the sequences you will
see in the first list may not be human.)
Display the chosen nucleotide sequences of the diseaserelated genes in FASTA format as Plain Text and copy&paste
it below (only the 1st 5 lines, do not copy whole FASTA file)
Demo HW assignment (3)
Question 2: Medical Articles
In this question, you will search for articles on your chosen
disease and restrict your search in various ways.
1)Go to the PubMed database: http://www.ncbi.nlm.nih.gov/pubmed
2)Perform a search for the same human disease as you used for
OMIM. Write down how many articles are out there? Provide below
the search key word(s) used to obtain the results
3)Perform the same search, only for articles which appeared exactly
within the 2013 year. How many did you found? Provide below the
exact query search key words used to obtain the results (e.g.
([Author] …) AND ([Journal] …) ) and or graphical explanation on
how the publication date filter was applied
4)Print the Abstracts of the first 5 search results
Assignment Submission
Step by Step Guide
Assignment submission
• All assignments should be zipped into
one file (*.zip) and submitted online
• Create a submission account
Account creation
• Any member of the group can submit assignment
• Account details will be emailed to you automatically
• All GBIO009-1 students should create an account
Submit your assignment
• After account creation login into a submission page
• The remaining time to deadline is displayed. Good idea to
check it from time to time in order to be on top of things
• File extension should be zip
• Can submit assignment as many times as you wish
Next class bring PC for R installation!
Next class
form groups of 2-3 persons to work on HW
Download