Bioinfo_Course_Rotterdam

Applied Bioinformatics Finding your way in biological information Table of Contents: Applied Bioinformatics ............................................................................................................. 1 Terms and abbreviations........................................................................................................ 2 Introduction ........................................................................................................................... 3 Finding back precise information .......................................................................................... 3 Searching and finding unknown information ........................................................................ 6 Managing information ........................................................................................................... 7 Complex analysis for the Lancet paper ............................................................................... 12 Suggested Solutions (for all questions): .............................................................................. 13 Additional URLs of Interest ................................................................................................ 16 Terms and abbreviations Affymetrix Commercial supplier of DNA hybridisation microarrays Google Web-Based Keyword Search Engine for indexed World Wide Web pages Scholar Google Web-Based Keyword Search Engine for indexed scientific papers PubMed Web-Based Keyword Search Engine at NCBI for indexed molecular biology scientific papers with links to molecular biology information GenBank DNA sequence repository in the United States of America EMBL European Molecular Biology Library: DNA sequence repository in Europe DDBJ DNA DataBase of Japan: DNA sequence repository in Japan Swiss-Prot Protein sequence repository in Europe PIR Protein Information Resource: Protein sequence repository in the USA UniProt Collaborative protein sequence repository in the USA and Europe PDB Protein DataBase: Protein structure repository in the USA Ensembl Genome Sequence of Animals, repository and interface, in Europe UCSC Genome Sequence of Animals, repository and interface, in the USA GEO DNA microarray expression information repository in the USA ArrayExpress DNA microarray expression information repository in Europe OMIM Disease and genome information repository in the USA Entrez Gene Web-Based Keyword Search Engine for biological molecular information, based in the USA SRS Web-Based Keyword Search Engine for biological molecular information, based in Europe NCBI National Center for Biological Information, USA EBI European Bioinformatics Institute, UK BLAST Basic Local Alignment Search Tool, a software tool to detect DNA or Protein sequences similar to a DNA or Protein query sequence CLUSTALW Cluster Alignment W, a software tool to align DNA or Protein sequences ExPASy Expert Protein Analysis System for analysis of protein sequence properties Abcam Antibody commercial provider Invitrogen Molecular Reagent commercial provider Treefam Sequence Alignment database TextPad Text Editor, with special interesting features iRider Web Browser, with special interesting features Deltagen Transgenic mice commercial provider Introduction Suppose the following situation: You have received a copy of the recent Lancet paper (attached) with the request by your department head to obtain additional information. The paper lists database identifiers, gene names and symbols, tissue cell lines, disease types, Affymetrix database identifiers, etc. You need to find and to keep track of all this information quickly, cheaply, and understandable by your supervisor. You would usually start laboratory work and confirm experimental findings described, but now you need to review the paper, to discuss conclusions, and to formulate follow-up studies. Initial attempts lead you to Google and various redundant sources on the World Wide Web, and you find that a lot of information is updated very quickly. So your search must be easy to repeat in the future. This workshop will help you:  find known information – the database record of the sequence with a particular accession number  search unknown information – mouse sequences corresponding to the listed human sequences  manage all this information – quickly, cheaply, understandably, and reproducibly Finding back precise information A bit of history will explain some terms and avoid further confusion. In the 1980s, several databases started to collect sequence information: GenBank in the USA, EMBL in Europe and DDBJ in Japan for DNA; SwissProt in Switzerland and PIR in the USA for proteins; PDB in the USA for protein structures. Every database lists each entry (sequence or structure) with a unique and typical ‘database identifier’. To avoid confusion and facilitate data exchange, DNA and protein databases agreed to give an ‘accession number’ to each entry, a number that is unique but common in all databases. The consequence is: the ‘accession numbers’ almost never change, and they are almost always the same if you search GenBank or EMBL or DDBJ, or if you search UniProt or PIR. In contrast, ‘database identifiers’ change all the time. In the 1990s, the Internet became extremely popular by the World Wide Web invented at the CERN. Suddenly databases were accessible (by SRS) and searchable (by BLAST) on Web pages. It became apparent that different biologists were naming similar genes with completely different names, and were giving similar names to completely different genes in other species. Commissions started to propose official gene names and symbols. The 1990s were also characterised by the development of high throughput techniques such as genome sequencing, and genomic databases such as Ensembl and UCSC appeared. In the 2000s, gene expression arrays enter mainstream biology, and require databases that are hosting and presenting results from those experiments, such as GEO in the USA and ArrayExpress at the EBI in Europe. Geneticists run high throughput analyses and start to produce information on sequence variation, association with disease. OMIM stores this type of information in a biologist-friendly way. Literature mining, high throughput investigation of protein interactions, measurement of metabolic fluxes, promoter modulation, etc. are starting to generate information that can be used in modelling simple biological systems. Integration and representation of this new type of integrated information is a major endeavour for the next decade in bioinformatics. More information can be found at the following resources:  Google and Scholar Google  SRS: molecular information  Entrez Gene: molecular information  Ensembl or UCSC browser: genomic information  NCBI: information resource  EBI: information resource  OMIM: disease information Google System to search for keywords on the World Wide Web, and so much more… by Google Inc http://www.google.com example: Search for the paper of Wang et al. (Foekens is corresponding author) in the Lancet. Download the Affymetrix relevant chip annotation information from the Affymetrix web site. tip: you will probably have to register in order to obtain the information. Other search examples, which is the most convenient? Search for human kinases Search for Escherichia coli proteins longer than 100 amino acids, good luck with Google Scholar Google System to search for keywords in scientific literature by Google Inc http://scholar.google.com example: Search for the Lancet paper of Wang et al. again. What are Dr. Foekens’ publications, what has he been working on, what is his curriculum? tip: use Google, Scholar Google. Which is the most convenient? Other search examples, search for human kinases Search for Escherichia coli proteins longer than 100 amino acids; good luck SRS Sequence Retrieval System System to search for keywords in sequence-related databases integrated to SRS by Thure Etzold http://srs.ebi.ac.uk Dutch SRS server: http://srs.bioinformatics.nl example: Search for the Lancet paper of Wang et al. again. Search for a gene of interest in the Lancet list. Search for Escherichia coli proteins longer than 100 amino acids Search for sequences published by Jean-Marc Neefs Entrez Gene Automated curated system to search for genes and related information of (almost) completely sequenced genomes by NCBI http://ncbi.nlm.nih.gov/Entrez example: Search for a gene of interest in the Lancet list. Search for Glutamate Transporters Search for cirrhosis related genes. Ensembl Automated curated system for genomic information of (almost) completely sequenced genomes by EBI and Sanger Centre http://www.ensembl.org example: Search for a gene of interest in the Lancet list. Search for fish Glutamate Transporters Search for a few Affymetrix identifiers from the Lancet paper if available. UCSC Genome Browser Automated curated system for genomic information of (almost) completely sequenced genomes by University of California at Santa Cruz http://genome.ucsc.edu/index.html example: Search for a gene of interest in the Lancet list. Search for Glutamate Transporters Search for a few Affymetrix identifiers from the Lancet paper if available. NCBI National Center for Biological Information http://ncbi.nlm.nih.gov/ Integrated system to search for any sequence-related information example: Search for Glutamate Transporters and follow links from there EBI European Bioinformatics Institute http://www.ebi.ac.uk Integrated system to search for any sequence-related information example: Search for a gene of interest in the Lancet list. Search for Glutamate Transporters and link out from there OMIM Online Mendelian Inheritance in Man Hand-curated system to search for disease-related keywords and genes by Victor McKusick http://www.ncbi.nlm.nih.gov/sites/entrez?db=OMIM example: Search for a gene of interest in the Lancet list. Search for cirrhosis related genes Search for Glutamate Transporters in human and other species Searching and finding unknown information You have a few sequences listed. Is there any other related information? More information can be found at the following resources: BLAST: sequence similarity CLUSTALW: sequence alignment ExPASy: Sequence Analysis BLAST Basic Local Alignment Search Tool Searches for sequence similar to a query protein or DNA sequence in any specified protein or DNA sequence database by David Lipman http://www.ncbi.nlm.nih.gov/BLAST/ Dutch BatchBlast: http://services.nbic.nl/bb/cgi-bin/bb_search.cgi CoPub mapper: http://services.nbic.nl/cgi-bin/copub/CoPub.pl example: From the gene of interest in the Lancet sequence, find similar sequences. CLUSTALW Cluster Alignment W ‘Align’, or display DNA or Protein sequences, such that individual sequences are placed horizontally, and residues in common positions between sequences are placed vertically. Amino acids ‘missing’ in shorter sequences are replaced by ‘gap’ symbols to ensure that all sequences in the alignment are equally long. by Julie Thomson http://www.ch.embnet.org/software/ClustalW.html example: From the sequences obtained with BLAST, create a sequence alignment for mammalian sequences only. Align human Glutamate Transporters ExPASy Expert Protein Analysis System to analyse protein sequences by Gasteiger and Bairoch http://www.expasy.org example: Find human Glutamate Transporter 1 Managing information Use the following resources:  Microsoft Excel: manage tabular data  TextPad: manage test data  iRIDER: manage Web Resources Excel tips  Ordering information. Make sure your table has headers for ALL columns, otherwise your data will mix up  Adding information. You can press CTRL+D to add the same information as the cell above; or CTRL+R to add the same information as the cell left. It also works with selections.  Summarizing information. Pivot tables are a fabulous tool to summarize large tables  Linking Out. The HYPERLINK function allows to calculate as many hyperlinks as you like from information in one or several tables. Left-clicking the cell opens the link immediately. The example below takes you step by step to creating URL in Excel. We must have gene symbols in Excel column A, rows 1 to 76 Consider the formula: http://www.ensembl.org/Homo_sapiens/geneview?db=core;gene=GSK3B There are 2 parts in this URL: http://www.ensembl.org/Homo_sapiens/geneview?db=core;gene= which is the core or the root or the general term of the URL, and GSK3B which is gene specific (gene symbol) We could write the URL for every gene in every cell in column B. This takes time, cannot be changed easily, and is not 100% accurate We can save time, become 100% accurate, and flexible. Write “GSK3B” in cell A1, and write the following in cell B1: ="http://www.ensembl.org/Homo_sapiens/geneview?db=core;gene="&A1 When we copy it to the next rows, the gene name changes in column B automatically. This is still no clickable URL. To make it clickable in Excel, write the following in cell B1: =HYPERLINK("http://www.ensembl.org/Homo_sapiens/geneview?db=core;g ene="&A1) Now the URL is clickable, but quite ugly. To improve, write the following in cell B1: =HYPERLINK("http://www.ensembl.org/Homo_sapiens/geneview?db=core;g ene="&A1,"Ensembl "&A1) The URL stays clickable, but now the text "Ensembl GSK3B" is shown. When we copy it to the next rows, the gene name changes in column B automatically. We can still do much better. write the following in cell Z1: http://www.ensembl.org/Homo_sapiens/geneview?db=core;gene= write the following in cell B1: =HYPERLINK($Z$1&A1,"Ensembl "&A1) You will see no change, but the formula is now much smaller, and will only take one value: Z1 for the link, and all gene names Please note the '$' signs, which fix the reference of the formula to cell Z1 in all cases The advantage is that you need to make only one change when the URL changes for example: write the following in cell Z2: http://www.ensembl.org/Mus_musculus/geneview?db=core;gene= write the following in cell B1: =HYPERLINK($Z$1&A1,"Ensembl Human "&$A1) gives you the human link write the following in cell C1: =HYPERLINK($Z$2&A1,"Ensembl Mouse "&$A1) gives you the mouse link you can copy B1 and C1 to all rows, and have 152 URLs defined immediately Exercise: write hyperlinks in column D for rat (Rattus_norvegicus) links  Merging information. Combining INDEX and MATCH functions in INDEX(column_you_need, MATCH(cell_reference, column_you_search , 0)) will find information from the ‘column_you_need’ for the ‘cell_reference” and add it to your current data. We will practise a few examples. Merging information in Excel Excel is bad at merging data in a standard way. Merging rows will only keep the top row; merging columns will only keep the left column; merging files just does not work. If you type =MATCH(first_cell, column_A, 0) in a cell, Excel will find the first row in 'column_A' with the same value as 'first_cell' and return the number of that row. If you type =INDEX(column_B, any_number) in a cell, Excel will return the value of row 'any_number' in column 'column_B'. This is powerful if you make Excel calculate this 'any_number' by the MATCH formula described above. This results in: =INDEX(column_B, MATCH(first_cell, column_A, 0)) Let us work with one example: Copy the following simple table (tab-delimited text) in Excel Nr Letter What? Nr Color 1 a 1 red 2 b 3 green 3 c 2 yellow 4 d 5 blue 5 e 4 black 6 f question is: what color are a, b, c, d, e? Next to "a", type =INDEX(select the column with Color, MATCH(select cell in first column with Nr, select second column with Nr,0)) In the cell, the result should be "red" Now be careful. As long as references are not fixed by adding '$' to column or rows, these will change when you copy the cells! So fix references (add '$' where necessary) and copy cells where you want them This also works between Excel workbooks or between worksheets. It looks impressive but the principle stays the same. When you are satisfied with the merge results, always copy the new information as values. Otherwise you will lose the information when the file changes or is deleted. Now you are ready to make some Excel magic! TextPad tips  TextPad has a number of unique interesting time saving features  Ordering information. Sorting text line by line allows deleting duplicate lines.  Adding information. Copy and paste works as usual but also as text block  Regular Expressions. Allow replacement of spaces by tabs, or very complex operations, line by line. Use POSIX settings  Linking Out. Hyperlinks in the text can be opened by pressing CTRL+G  Powerful Macros. Everything typed is recorded and can be played back several times. The disadvantage is that you cannot yet edit the saved macros. iRIDER tips  iRIDER has a number of unique interesting time saving features  It is a tabbed browser that allows searching hundreds of links pasted in the application simultaneously. You can view the first results while the rest of the pages are downloaded. A selection in a Web page text followed by a right click will open all selected links. The complex analysis on the next page will require a combination of the tools mentioned above. For analysis of the Lancet paper, you can quickly find relevant information TODAY, without programming anything. Writing Excel formulas or TextPad macros is not REAL programming, but will get almost any mundane job done VERY quickly. Of course, if you need to perform more complicated data analysis in a more systematic way, and interfaced on the Web for other users, you will have to resort to programming. Anyway, programming will use all of the techniques that have been outlined in this workshop: data retrieval, regular expression, data parsing, collection, dynamic URL generation, relational database queries, chaining, etc. Programming will add speed, presentation layers, allow wider application and encourage a systematic approach. If you really, really want to start programming, use PERL for coding, R for analysis, MySQL for database and all World Wide Web resources for help. They are all for free. I have brought a few books of interest for bioinformatics starters. If you need additional details and more references, please provide your e-mail contact and I will forward these to you. Complex analysis for the Lancet paper In the Lancet Paper, you have a list of accession numbers, gene symbols, Affymetrix identifiers. What are these about? Get the summary information of table 3 in a table. Tip: use Acrobat Reader and Textpad to obtain a tab-delimited file and put the results in Excel. You cannot obtain these data in Excel immediately. Why not? Locate the dataset discussed in the paper in the GEO database. Discuss the expression patterns for the 76 reported genes of interest. tip: use Google. Obtain the protein sequences (if available) for the 76 reported genes of interest. tip: use the gene identifiers or the Affymetrix numbers as a start. Is there any way to manage this: without programming a database? I think you can: Affymetrix data contains both probeset identifiers and gene symbols. You can search for gene symbols in SRS or NCBI or ExPASy, and get to protein sequences from there. You only need to put the sequences back to the existing table of 76 probesets in the correct order in Excel. We will work through this example step by step. Search for diseases are related to the 76 listed proteins? Search for pathway information from the Affymetrix annotation file, compare it with the information reported in table 4 of the paper. Search for alignments and phylogenetic trees of mammalian proteins for the gene(s) involved with your disease of highest interest. Edit the alignments to remove partial sequences and sequence that are not from mammals. Tip: locate the Treefam database and remember the URL pointing to your gene of interest. Search for the gene location on the human genome. tip: use the Affymetrix data file, or Ensembl. Which genes are located near it? Search for transgene mice for the listed 76 genes. Search for available antibodies to measure protein abundance in your tissue bank. Search for available cDNA clones, other reagents (siRNA). Search for the most recent literature about this gene. Suggested Solutions (for all questions): Search for the paper of Wang et al. (Foekens is corresponding author) in the Lancet. Use Scholar Google, or PubMed. Type Foekens and Lancet, browse results. Download the Affymetrix relevant chip annotation information from the Affymetrix web site. tip: you will probably have to register in order to obtain the information. Use Google. Type Affymetrix. Browse for Affymetrix.net web site. Register to access the web site. Search there for the correct Affymetrix probeset used in the paper (human 133a). Look at the available data files for that chip. Download the annotation file and expand the zipped file. Open in Textpad or Excel. Search for human kinases Use SRS at EBI. Choose protein databases. Type kinase in gene description and Homo sapiens in species. Browse results. Search for Escherichia coli proteins longer than 100 amino acids. Use SRS at EBI. Choose protein databases. Type Escherichia coli in species. Type >100 in sequence length. Browse results. What are Dr. Foekens’ publications, what has he been working on, what is his curriculum? tip: use Google, Scholar Google, or PubMed. Which is the most convenient? Use Scholar Google, or PubMed. Type Foekens. Browse results. Search for sequences published by Jean-Marc Neefs Use SRS. Type Neefs in ‘All Fields’. Browse results. Search for a gene of interest in the Lancet list. Use SRS or Entrez Gene. Type the Accession number of your choice. Browse results. Search for Glutamate Transporters Use SRS or Entrez Gene. Type “glutamate transporter” in search fields. Browse results. Search for cirrhosis related genes Use OMIM. Type the Accession number of your choice in search fields. Browse results and remember the URL when searching with gene symbols: you will need it later. Search for a few Affymetrix identifiers from the Lancet paper if available. Use Ensembl. Type Affymetrix identifiers in the human database. Browse results. Search for Glutamate Transporters and link out from there Use NCBI or EBI web sites. Type ‘Glutamate Transporter’ in gene descriptions. Browse results, copy interesting URLs. COMPLEX ANALYSIS: Get the summary information of table 3 in a table. Copy the text from table 3 with Adobe AcrobatReader. Paste the text in TextPad Convert to tab-separated file by replacing “_at’space’” to “_at’tab’”, “’space’/DEF” to “tab”. Verify results and copy to Excel. Insert a row for column headers. Browse results. Locate the dataset discussed in the paper in the GEO database. Discuss the expression patterns for the 76 reported genes of interest. tip: use Google. Use Google to find the site, note the URL carefully once you have found the gene expression image for the gene of interest in the experiment of interest. Browse results. Obtain the protein sequences (if available) for the 76 reported genes of interest. tip: use the gene identifiers or the Affymetrix numbers as a start. There are several way to solve this. The list contains Affymetrix probeset Ids, database identifiers, and a protein description. We can search every probeset ID at Ensembl and get the gene symbol from there, but we use the Affymetrix dataset instead, which contains both. Use both the Excel file with the Lancet data and the Affymetrix data file. Place Gene Symbols in a new column of the Lancet data with the INDEX and MATCH formulas in Excel. Once we have the gene symbol, we can look for human corresponding proteins at ExPASy, or lookup all human gene symbols at once at NCBI. Protein sequences can also be extracted all at once from this site. Save sequences in FASTA format, transform to tab-delimited text in Textpad List all in a new Excel file, and merge the information with the Lancet list using the INDEX and MATCH functions in Excel. You will have to include RefSeq identifiers in Excel as well to put the correct sequences with the gene symbols in the table, again using the INDEX and MATCH functions. Search for diseases are related to the 76 listed proteins? Use gene symbols, lookup comma-delimited gene list in OMIM. Search for pathway information from the Affymetrix annotation file, compare it with the information reported in table 4 of the paper. Use Affymetrix data file information, with INDEX and MATCH functions. Search for alignments and phylogenetic trees of mammalian proteins for the gene(s) involved with your disease of highest interest. Edit the alignments to remove partial sequences and sequence that are not from mammals. Tip: try and locate the Treefam database and remember the URL pointing to your gene of interest. Use Treefam URL in combination with gene symbols in Excel. Search for the gene location on the human genome. tip: use the Affymetrix data file, or Ensembl. Which genes are located near it? Use Ensembl URL in combination with gene symbols in Excel. Search for transgene mice for the listed 76 genes. Use Deltagen Excel file in combination with gene symbols and INDEX+MATCH formulas in Excel. Search for available antibodies to measure protein abundance in your tissue bank. Use Abcam URL in combination with gene symbols in Excel. Search for available cDNA clones, other reagents (siRNA). Use Invitrogen URL in combination with gene symbols in Excel. Search for the most recent literature about this gene. Use PubMed URL in combination with gene symbols in Excel. Additional URLs of Interest expression of some GEO genes in some GEO datasets: http://www.ncbi.nlm.nih.gov/sites/entrez?db=geo&term="GDS2319"[ACCN]+alk Abcam: http://www.abcam.com Invitrogen: http://www.invitrogen.com Treefam: http://www.treefam.org Deltagen: http://www.deltagen.com Other URLs can be found rather easily (by searching for them in Google) and are therefore not provided. Finding them and saving the information, in Excel or Textpad, is part of the workshop training skills.

Bioinfo_Course_Rotterdam

Related documents

Products

Support

Bioinfo_Course_Rotterdam

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib