Bioinfo_Course_Rotterdam

advertisement
Applied Bioinformatics
Finding your way in biological information
Table of Contents:
Applied Bioinformatics ............................................................................................................. 1
Terms and abbreviations........................................................................................................ 2
Introduction ........................................................................................................................... 3
Finding back precise information .......................................................................................... 3
Searching and finding unknown information ........................................................................ 6
Managing information ........................................................................................................... 7
Complex analysis for the Lancet paper ............................................................................... 12
Suggested Solutions (for all questions): .............................................................................. 13
Additional URLs of Interest ................................................................................................ 16
Terms and abbreviations
Affymetrix
Commercial supplier of DNA hybridisation microarrays
Google
Web-Based Keyword Search Engine for indexed World Wide Web pages
Scholar Google
Web-Based Keyword Search Engine for indexed scientific papers
PubMed
Web-Based Keyword Search Engine at NCBI for indexed molecular
biology scientific papers with links to molecular biology information
GenBank
DNA sequence repository in the United States of America
EMBL
European Molecular Biology Library: DNA sequence repository in Europe
DDBJ
DNA DataBase of Japan: DNA sequence repository in Japan
Swiss-Prot
Protein sequence repository in Europe
PIR
Protein Information Resource: Protein sequence repository in the USA
UniProt
Collaborative protein sequence repository in the USA and Europe
PDB
Protein DataBase: Protein structure repository in the USA
Ensembl
Genome Sequence of Animals, repository and interface, in Europe
UCSC
Genome Sequence of Animals, repository and interface, in the USA
GEO
DNA microarray expression information repository in the USA
ArrayExpress
DNA microarray expression information repository in Europe
OMIM
Disease and genome information repository in the USA
Entrez Gene
Web-Based Keyword Search Engine for biological molecular information,
based in the USA
SRS
Web-Based Keyword Search Engine for biological molecular information,
based in Europe
NCBI
National Center for Biological Information, USA
EBI
European Bioinformatics Institute, UK
BLAST
Basic Local Alignment Search Tool, a software tool to detect DNA or
Protein sequences similar to a DNA or Protein query sequence
CLUSTALW
Cluster Alignment W, a software tool to align DNA or Protein sequences
ExPASy
Expert Protein Analysis System for analysis of protein sequence properties
Abcam
Antibody commercial provider
Invitrogen
Molecular Reagent commercial provider
Treefam
Sequence Alignment database
TextPad
Text Editor, with special interesting features
iRider
Web Browser, with special interesting features
Deltagen
Transgenic mice commercial provider
Introduction
Suppose the following situation:
You have received a copy of the recent Lancet paper (attached) with the request by your
department head to obtain additional information.
The paper lists database identifiers, gene names and symbols, tissue cell lines, disease types,
Affymetrix database identifiers, etc. You need to find and to keep track of all this information
quickly, cheaply, and understandable by your supervisor.
You would usually start laboratory work and confirm experimental findings described, but
now you need to review the paper, to discuss conclusions, and to formulate follow-up studies.
Initial attempts lead you to Google and various redundant sources on the World Wide Web,
and you find that a lot of information is updated very quickly. So your search must be easy to
repeat in the future.
This workshop will help you:

find known information – the database record of the sequence with a particular accession
number

search unknown information – mouse sequences corresponding to the listed human
sequences

manage all this information – quickly, cheaply, understandably, and reproducibly
Finding back precise information
A bit of history will explain some terms and avoid further confusion.
In the 1980s, several databases started to collect sequence information: GenBank in the USA,
EMBL in Europe and DDBJ in Japan for DNA; SwissProt in Switzerland and PIR in the USA
for proteins; PDB in the USA for protein structures. Every database lists each entry (sequence
or structure) with a unique and typical ‘database identifier’. To avoid confusion and facilitate
data exchange, DNA and protein databases agreed to give an ‘accession number’ to each
entry, a number that is unique but common in all databases. The consequence is: the
‘accession numbers’ almost never change, and they are almost always the same if you search
GenBank or EMBL or DDBJ, or if you search UniProt or PIR. In contrast, ‘database
identifiers’ change all the time.
In the 1990s, the Internet became extremely popular by the World Wide Web invented at the
CERN. Suddenly databases were accessible (by SRS) and searchable (by BLAST) on Web
pages. It became apparent that different biologists were naming similar genes with completely
different names, and were giving similar names to completely different genes in other species.
Commissions started to propose official gene names and symbols. The 1990s were also
characterised by the development of high throughput techniques such as genome sequencing,
and genomic databases such as Ensembl and UCSC appeared.
In the 2000s, gene expression arrays enter mainstream biology, and require databases that are
hosting and presenting results from those experiments, such as GEO in the USA and
ArrayExpress at the EBI in Europe. Geneticists run high throughput analyses and start to
produce information on sequence variation, association with disease. OMIM stores this type
of information in a biologist-friendly way.
Literature mining, high throughput investigation of protein interactions, measurement of
metabolic fluxes, promoter modulation, etc. are starting to generate information that can be
used in modelling simple biological systems. Integration and representation of this new type
of integrated information is a major endeavour for the next decade in bioinformatics.
More information can be found at the following resources:

Google and Scholar Google

SRS: molecular information

Entrez Gene: molecular information

Ensembl or UCSC browser: genomic information

NCBI: information resource

EBI: information resource

OMIM: disease information
Google
System to search for keywords on the World Wide Web, and so much more…
by Google Inc
http://www.google.com
example:
Search for the paper of Wang et al. (Foekens is corresponding author) in the Lancet.
Download the Affymetrix relevant chip annotation information from the Affymetrix web site.
tip: you will probably have to register in order to obtain the information.
Other search examples, which is the most convenient? Search for human kinases
Search for Escherichia coli proteins longer than 100 amino acids, good luck with Google
Scholar Google
System to search for keywords in scientific literature
by Google Inc
http://scholar.google.com
example:
Search for the Lancet paper of Wang et al. again.
What are Dr. Foekens’ publications, what has he been working on, what is his curriculum?
tip: use Google, Scholar Google. Which is the most convenient?
Other search examples, search for human kinases
Search for Escherichia coli proteins longer than 100 amino acids; good luck
SRS
Sequence Retrieval System
System to search for keywords in sequence-related databases integrated to SRS
by Thure Etzold
http://srs.ebi.ac.uk
Dutch SRS server: http://srs.bioinformatics.nl
example:
Search for the Lancet paper of Wang et al. again.
Search for a gene of interest in the Lancet list.
Search for Escherichia coli proteins longer than 100 amino acids
Search for sequences published by Jean-Marc Neefs
Entrez Gene
Automated curated system to search for genes and related information of (almost) completely
sequenced genomes
by NCBI
http://ncbi.nlm.nih.gov/Entrez
example:
Search for a gene of interest in the Lancet list.
Search for Glutamate Transporters
Search for cirrhosis related genes.
Ensembl
Automated curated system for genomic information of (almost) completely sequenced
genomes
by EBI and Sanger Centre
http://www.ensembl.org
example:
Search for a gene of interest in the Lancet list.
Search for fish Glutamate Transporters
Search for a few Affymetrix identifiers from the Lancet paper if available.
UCSC Genome Browser
Automated curated system for genomic information of (almost) completely sequenced
genomes
by University of California at Santa Cruz
http://genome.ucsc.edu/index.html
example:
Search for a gene of interest in the Lancet list.
Search for Glutamate Transporters
Search for a few Affymetrix identifiers from the Lancet paper if available.
NCBI
National Center for Biological Information
http://ncbi.nlm.nih.gov/
Integrated system to search for any sequence-related information
example:
Search for Glutamate Transporters and follow links from there
EBI
European Bioinformatics Institute
http://www.ebi.ac.uk
Integrated system to search for any sequence-related information
example:
Search for a gene of interest in the Lancet list.
Search for Glutamate Transporters and link out from there
OMIM
Online Mendelian Inheritance in Man
Hand-curated system to search for disease-related keywords and genes
by Victor McKusick
http://www.ncbi.nlm.nih.gov/sites/entrez?db=OMIM
example:
Search for a gene of interest in the Lancet list.
Search for cirrhosis related genes
Search for Glutamate Transporters in human and other species
Searching and finding unknown information
You have a few sequences listed. Is there any other related information?
More information can be found at the following resources:
BLAST: sequence similarity
CLUSTALW: sequence alignment
ExPASy: Sequence Analysis
BLAST
Basic Local Alignment Search Tool
Searches for sequence similar to a query protein or DNA sequence in any specified protein or
DNA sequence database
by David Lipman
http://www.ncbi.nlm.nih.gov/BLAST/
Dutch BatchBlast: http://services.nbic.nl/bb/cgi-bin/bb_search.cgi
CoPub mapper: http://services.nbic.nl/cgi-bin/copub/CoPub.pl
example:
From the gene of interest in the Lancet sequence, find similar sequences.
CLUSTALW
Cluster Alignment W
‘Align’, or display DNA or Protein sequences, such that individual sequences are placed
horizontally, and residues in common positions between sequences are placed vertically.
Amino acids ‘missing’ in shorter sequences are replaced by ‘gap’ symbols to ensure that all
sequences in the alignment are equally long.
by Julie Thomson
http://www.ch.embnet.org/software/ClustalW.html
example:
From the sequences obtained with BLAST, create a sequence alignment for mammalian
sequences only.
Align human Glutamate Transporters
ExPASy
Expert Protein Analysis System to analyse protein sequences
by Gasteiger and Bairoch
http://www.expasy.org
example:
Find human Glutamate Transporter 1
Managing information
Use the following resources:

Microsoft Excel: manage tabular data

TextPad: manage test data

iRIDER: manage Web Resources
Excel tips

Ordering information. Make sure your table has headers for ALL columns, otherwise
your data will mix up

Adding information. You can press CTRL+D to add the same information as the cell
above; or CTRL+R to add the same information as the cell left. It also works with
selections.

Summarizing information. Pivot tables are a fabulous tool to summarize large tables

Linking Out. The HYPERLINK function allows to calculate as many hyperlinks as you
like from information in one or several tables. Left-clicking the cell opens the link
immediately. The example below takes you step by step to creating URL in Excel.
We must have gene symbols in Excel column A, rows 1 to 76
Consider the formula:
http://www.ensembl.org/Homo_sapiens/geneview?db=core;gene=GSK3B
There are 2 parts in this URL:
http://www.ensembl.org/Homo_sapiens/geneview?db=core;gene=
which is the core or the root or the general term of the URL, and
GSK3B
which is gene specific (gene symbol)
We could write the URL for every gene in every cell in column B. This takes time, cannot
be changed easily, and is not 100% accurate
We can save time, become 100% accurate, and flexible. Write “GSK3B” in cell A1, and
write the following in cell B1:
="http://www.ensembl.org/Homo_sapiens/geneview?db=core;gene="&A1
When we copy it to the next rows, the gene name changes in column B automatically.
This is still no clickable URL. To make it clickable in Excel, write the following in cell
B1:
=HYPERLINK("http://www.ensembl.org/Homo_sapiens/geneview?db=core;g
ene="&A1)
Now the URL is clickable, but quite ugly. To improve, write the following in cell B1:
=HYPERLINK("http://www.ensembl.org/Homo_sapiens/geneview?db=core;g
ene="&A1,"Ensembl "&A1)
The URL stays clickable, but now the text "Ensembl GSK3B" is shown.
When we copy it to the next rows, the gene name changes in column B automatically.
We can still do much better.
write the following in cell Z1:
http://www.ensembl.org/Homo_sapiens/geneview?db=core;gene=
write the following in cell B1:
=HYPERLINK($Z$1&A1,"Ensembl "&A1)
You will see no change, but the formula is now much smaller, and will only take one
value: Z1 for the link, and all gene names
Please note the '$' signs, which fix the reference of the formula to cell Z1 in all cases
The advantage is that you need to make only one change when the URL changes
for example:
write the following in cell Z2:
http://www.ensembl.org/Mus_musculus/geneview?db=core;gene=
write the following in cell B1:
=HYPERLINK($Z$1&A1,"Ensembl Human "&$A1)
gives you the human link
write the following in cell C1:
=HYPERLINK($Z$2&A1,"Ensembl Mouse "&$A1)
gives you the mouse link
you can copy B1 and C1 to all rows, and have 152 URLs defined immediately
Exercise: write hyperlinks in column D for rat (Rattus_norvegicus) links

Merging information. Combining INDEX and MATCH functions in
INDEX(column_you_need, MATCH(cell_reference, column_you_search , 0)) will find
information from the ‘column_you_need’ for the ‘cell_reference” and add it to your
current data. We will practise a few examples.
Merging information in Excel
Excel is bad at merging data in a standard way. Merging rows will only keep the top row;
merging columns will only keep the left column; merging files just does not work.
If you type
=MATCH(first_cell, column_A, 0)
in a cell, Excel will find the first row in 'column_A' with the same value as 'first_cell' and
return the number of that row.
If you type
=INDEX(column_B, any_number)
in a cell, Excel will return the value of row 'any_number' in column 'column_B'.
This is powerful if you make Excel calculate this 'any_number' by the MATCH formula
described above.
This results in:
=INDEX(column_B, MATCH(first_cell, column_A, 0))
Let us work with one example:
Copy the following simple table (tab-delimited text) in Excel
Nr Letter What? Nr
Color
1
a
1
red
2
b
3
green
3
c
2
yellow
4
d
5
blue
5
e
4
black
6
f
question is: what color are a, b, c, d, e?
Next to "a", type =INDEX(select the column with Color, MATCH(select cell in first
column with Nr, select second column with Nr,0))
In the cell, the result should be "red"
Now be careful. As long as references are not fixed by adding '$' to column or rows, these
will change when you copy the cells!
So fix references (add '$' where necessary) and copy cells where you want them
This also works between Excel workbooks or between worksheets. It looks impressive
but the principle stays the same.
When you are satisfied with the merge results, always copy the new information as
values.
Otherwise you will lose the information when the file changes or is deleted.
Now you are ready to make some Excel magic!
TextPad tips

TextPad has a number of unique interesting time saving features

Ordering information. Sorting text line by line allows deleting duplicate lines.

Adding information. Copy and paste works as usual but also as text block

Regular Expressions. Allow replacement of spaces by tabs, or very complex operations,
line by line. Use POSIX settings

Linking Out. Hyperlinks in the text can be opened by pressing CTRL+G

Powerful Macros. Everything typed is recorded and can be played back several times.
The disadvantage is that you cannot yet edit the saved macros.
iRIDER tips

iRIDER has a number of unique interesting time saving features

It is a tabbed browser that allows searching hundreds of links pasted in the application
simultaneously. You can view the first results while the rest of the pages are downloaded.
A selection in a Web page text followed by a right click will open all selected links.
The complex analysis on the next page will require a combination of the tools mentioned
above.
For analysis of the Lancet paper, you can quickly find relevant information TODAY, without
programming anything. Writing Excel formulas or TextPad macros is not REAL
programming, but will get almost any mundane job done VERY quickly. Of course, if you
need to perform more complicated data analysis in a more systematic way, and interfaced on
the Web for other users, you will have to resort to programming.
Anyway, programming will use all of the techniques that have been outlined in this
workshop: data retrieval, regular expression, data parsing, collection, dynamic URL
generation, relational database queries, chaining, etc. Programming will add speed,
presentation layers, allow wider application and encourage a systematic approach.
If you really, really want to start programming, use PERL for coding, R for analysis, MySQL
for database and all World Wide Web resources for help. They are all for free.
I have brought a few books of interest for bioinformatics starters. If you need additional
details and more references, please provide your e-mail contact and I will forward these
to you.
Complex analysis for the Lancet paper
In the Lancet Paper, you have a list of accession numbers, gene symbols, Affymetrix
identifiers. What are these about?
Get the summary information of table 3 in a table. Tip: use Acrobat Reader and
Textpad to obtain a tab-delimited file and put the results in Excel. You cannot obtain
these data in Excel immediately. Why not?
Locate the dataset discussed in the paper in the GEO database. Discuss the expression
patterns for the 76 reported genes of interest. tip: use Google.
Obtain the protein sequences (if available) for the 76 reported genes of interest. tip: use
the gene identifiers or the Affymetrix numbers as a start.
Is there any way to manage this: without programming a database? I think you can:
Affymetrix data contains both probeset identifiers and gene symbols. You can search for
gene symbols in SRS or NCBI or ExPASy, and get to protein sequences from there. You
only need to put the sequences back to the existing table of 76 probesets in the correct
order in Excel. We will work through this example step by step.
Search for diseases are related to the 76 listed proteins?
Search for pathway information from the Affymetrix annotation file, compare it with
the information reported in table 4 of the paper.
Search for alignments and phylogenetic trees of mammalian proteins for the gene(s)
involved with your disease of highest interest. Edit the alignments to remove partial
sequences and sequence that are not from mammals. Tip: locate the Treefam database
and remember the URL pointing to your gene of interest.
Search for the gene location on the human genome. tip: use the Affymetrix data file, or
Ensembl. Which genes are located near it?
Search for transgene mice for the listed 76 genes.
Search for available antibodies to measure protein abundance in your tissue bank.
Search for available cDNA clones, other reagents (siRNA).
Search for the most recent literature about this gene.
Suggested Solutions (for all questions):
Search for the paper of Wang et al. (Foekens is corresponding author) in the Lancet.
Use Scholar Google, or PubMed. Type Foekens and Lancet, browse results.
Download the Affymetrix relevant chip annotation information from the Affymetrix web site.
tip: you will probably have to register in order to obtain the information.
Use Google. Type Affymetrix. Browse for Affymetrix.net web site. Register to access
the web site. Search there for the correct Affymetrix probeset used in the paper
(human 133a). Look at the available data files for that chip. Download the annotation
file and expand the zipped file. Open in Textpad or Excel.
Search for human kinases
Use SRS at EBI. Choose protein databases. Type kinase in gene description and
Homo sapiens in species. Browse results.
Search for Escherichia coli proteins longer than 100 amino acids.
Use SRS at EBI. Choose protein databases. Type Escherichia coli in species. Type
>100 in sequence length. Browse results.
What are Dr. Foekens’ publications, what has he been working on, what is his curriculum?
tip: use Google, Scholar Google, or PubMed. Which is the most convenient?
Use Scholar Google, or PubMed. Type Foekens. Browse results.
Search for sequences published by Jean-Marc Neefs
Use SRS.
Type Neefs in ‘All Fields’.
Browse results.
Search for a gene of interest in the Lancet list.
Use SRS or Entrez Gene.
Type the Accession number of your choice.
Browse results.
Search for Glutamate Transporters
Use SRS or Entrez Gene.
Type “glutamate transporter” in search fields.
Browse results.
Search for cirrhosis related genes
Use OMIM.
Type the Accession number of your choice in search fields.
Browse results and remember the URL when searching with gene symbols: you will
need it later.
Search for a few Affymetrix identifiers from the Lancet paper if available.
Use Ensembl.
Type Affymetrix identifiers in the human database.
Browse results.
Search for Glutamate Transporters and link out from there
Use NCBI or EBI web sites.
Type ‘Glutamate Transporter’ in gene descriptions.
Browse results, copy interesting URLs.
COMPLEX ANALYSIS:
Get the summary information of table 3 in a table.
Copy the text from table 3 with Adobe AcrobatReader.
Paste the text in TextPad
Convert to tab-separated file by replacing “_at’space’” to “_at’tab’”, “’space’/DEF”
to “tab”.
Verify results and copy to Excel.
Insert a row for column headers.
Browse results.
Locate the dataset discussed in the paper in the GEO database. Discuss the expression
patterns for the 76 reported genes of interest. tip: use Google.
Use Google to find the site, note the URL carefully once you have found the gene
expression image for the gene of interest in the experiment of interest.
Browse results.
Obtain the protein sequences (if available) for the 76 reported genes of interest. tip: use
the gene identifiers or the Affymetrix numbers as a start.
There are several way to solve this.
The list contains Affymetrix probeset Ids, database identifiers, and a protein
description. We can search every probeset ID at Ensembl and get the gene symbol
from there, but we use the Affymetrix dataset instead, which contains both.
Use both the Excel file with the Lancet data and the Affymetrix data file. Place Gene
Symbols in a new column of the Lancet data with the INDEX and MATCH formulas
in Excel.
Once we have the gene symbol, we can look for human corresponding proteins at
ExPASy, or lookup all human gene symbols at once at NCBI. Protein sequences can
also be extracted all at once from this site.
Save sequences in FASTA format, transform to tab-delimited text in Textpad
List all in a new Excel file, and merge the information with the Lancet list using the
INDEX and MATCH functions in Excel.
You will have to include RefSeq identifiers in Excel as well to put the correct
sequences with the gene symbols in the table, again using the INDEX and MATCH
functions.
Search for diseases are related to the 76 listed proteins?
Use gene symbols, lookup comma-delimited gene list in OMIM.
Search for pathway information from the Affymetrix annotation file, compare it with
the information reported in table 4 of the paper.
Use Affymetrix data file information, with INDEX and MATCH functions.
Search for alignments and phylogenetic trees of mammalian proteins for the gene(s)
involved with your disease of highest interest. Edit the alignments to remove partial
sequences and sequence that are not from mammals. Tip: try and locate the Treefam
database and remember the URL pointing to your gene of interest.
Use Treefam URL in combination with gene symbols in Excel.
Search for the gene location on the human genome. tip: use the Affymetrix data file, or
Ensembl. Which genes are located near it?
Use Ensembl URL in combination with gene symbols in Excel.
Search for transgene mice for the listed 76 genes.
Use Deltagen Excel file in combination with gene symbols and INDEX+MATCH
formulas in Excel.
Search for available antibodies to measure protein abundance in your tissue bank.
Use Abcam URL in combination with gene symbols in Excel.
Search for available cDNA clones, other reagents (siRNA).
Use Invitrogen URL in combination with gene symbols in Excel.
Search for the most recent literature about this gene.
Use PubMed URL in combination with gene symbols in Excel.
Additional URLs of Interest
expression of some GEO genes in some GEO datasets:
http://www.ncbi.nlm.nih.gov/sites/entrez?db=geo&term="GDS2319"[ACCN]+alk
Abcam: http://www.abcam.com
Invitrogen: http://www.invitrogen.com
Treefam: http://www.treefam.org
Deltagen: http://www.deltagen.com
Other URLs can be found rather easily (by searching for them in Google) and are therefore
not provided. Finding them and saving the information, in Excel or Textpad, is part of the
workshop training skills.
Download