MEDG520 Block 5 Bioinformatics Concepts: Sensitivity vs. Specificity Curated vs. Comprehensive databases Supervised vs. Unsupervised Machine Learning Methods o How do the terms apply to microarray analysis? Accessing Gene Information o Genome Browser o Gene Catalogs Classification of Gene Orthologs Assessing the Data Are DNA sequences random? What is the difference between a local and global alignment? Over-Training Cross-Validation versus Independent Test Sets Assigned Papers Profile Models Assigned Papers Web Resources Sample Questions Sensitivity vs. Specificity – Know the meaning of each term and how they are applied to discuss the performance of computational methods. Presented with specific examples, suggest which could be limiting the utility. Sensitivity is the ability to correctly identify those who have the disease. Specificity is the ability to correctly identify those who do not have the disease. Ideally, a test should have 100% sensitivity and 100% specificity. In other words, the test always correctly identifies the disease state of the people tested. Result of Screening Positive Negative Disease State Disease True Positive False Negative No Disease False Positive True Negative People who have the disease are the "true positives." People who do not have the disease are the "true negatives." Sensitivity = (true positives / (true positives + false negatives)) x 100 (total actually diseased) When the "false negatives" is a small number relative to the "true positives", sensitivity approaches 100%. Specificity = (true negatives / (true negatives + false positives)) x 100 (total actually not diseased) When the "true negatives" is a small number relative to the "false positives", specificity approaches 100%. Curated vs. Comprehensive databases Curated All data reviewed by actual human biologist or trained curators. Comprehensive Involves application of algorithms to genome. Not necessarily curated. Supervised vs. Unsupervised Machine Learning Methods – Know the difference between the two. Presented with a specific example, classify the training method as supervised or unsupervised. Indicate how specific supervision could enhance performance. Supervised The process of building data mining models using a known dependent variable, also referred to as the target (the thing to be predicited). Classification techniques are supervised. Example: Shipp et al (2002) used a supervised learning method to identify genes which will be predictive for lymphoma. A human being chose genes from the microarray data (8 to 16 genes were chosen) and then tested those genes ability to accurately predict lymphoma (outcome?). This is the supervision. Unsupervised The process of building data mining models without the guidance (supervision) of a known, correct result. Clustering and association rules are unsupervised mining functions. Take data from instrument and pass straight to algorithm. Example: Any microarray study where a competitive hybridization is conducted on your array and you simply normalize the ratios and perform hierarchical clustering (for example) on the data to identify clusters of coexpressed genes. How do the terms apply to microarray analysis? Supervised analysis to determine genes that fit a predetermined pattern Usually used to find genes with expression levels that are significantly different between groups of samples or finding genes that accurately predict a characteristic of the sample. Example: find gene or genes that accurately distinguish one type of cancer from another or a metastatic tumour from a non-metastatic one. Two popular supervised techniques would be nearest-neighbour analysis and support vector machines. Decision trees or neural networks are other examples. Unsupervised analysis to characterize the components of a data set without a priori input or knowledge of a training signal Try to find internal structure or relationships in data without trying to predict some ‘correct answer’. Three classes: 1. Feature determination o Look for genes with interesting patterns o Eg. Principal-components analysis 2. Cluster determination o determine groups of genes with similar expression patterns o eg. Nearest-neighbour clustering, self-organizing maps, k-means clustering, 2d hierarchical clustering 3. Network determination o determine graphs representing gene-gene or gene-phenotype interactions. o Eg. Boolean networks, Bayesian networks, relevance networks Accessing Gene Information - Know the difference between genome browsers and gene catalogs. Be able to suggest which would have greater utility for users with specific problems to solve. Genome Browsers Display sequences and annotations, alignments, etc for all sequences in genomes sequenced to date. Useful for: o looking for new genes or potential genes of interest. o Understanding gene ‘neighborhood’ for your gene of interest. Eg. proximal regulatory elements (promoters, enhancers, etc.), Neighboring genes, intron-exon structure, snps, repeats, etc. o Can be used to visualize practically any feature of a gene or region of genome, including homologies to other sequences. o Allows assessment of quality and status of whole genome assembly and degree to which a genome is ‘finished’. o Can perform automated assembly and annotation (eg. gene prediction) of genome as individual sequences are created and submitted to various public databases all over the world. Examples UCSC Genome Browser (http://genome.ucsc.edu/cgi-bin/hgGateway) Ensembl Genome Browser (http://www.ensembl.org) Vista Genome Browser (http://pipeline.lbl.gov/) Gene Catalogs Databases containing information from various sources for each known gene. Can be curated (eg. genbank) or comprehensive (eg. gene cards). Contains info like: o Official name o Synonyms o Gene IDs for other gene-based resources (eg. LocusLink). o Cytogenetic locus of the gene, genomic region, gene coordinates. o Name, functions, expression patterns, of protein product. o Similarities to other sequences, proteins. o Involvement in diseases, medical applications. o Papers published on gene. o Sequences that gene information is based upon. o Etc. o Links to sources of all information above. A way of organizing and presenting the sometimes overwhelming amount of information that is being produced for genes. Examples NCBI – Genbank (http://www.ncbi.nlm.nih.gov/Genbank/index.html) GeneCards (http://bioinformatics.weizmann.ac.il/cards/) LocusLink (http://www.ncbi.nlm.nih.gov/LocusLink/) Classification of Gene Orthologs Homolog Any member of a set of genes, DNA sequences or protein sequences whose nucleotide sequences show a high degree of one-to-one correspondence Ortholog Homologous sequences in different species that arose from a common ancestral gene during speciation; may or may not be responsible for a similar function. Paralog Homologous sequences within a single species that arose by gene duplication Why do orthology studies tend to use protein sequence, not nucleotide? More likely to be conserved Some species favour different codon structures or average base compositions. Example: species living in very hot climates might favour a base composition less prone to denaturation (high GC percent). Be able to explain two examples of why one might need to know which genes are orthologous. Stuart et al. (2003) recently demonstrated that by considering orthologous genes from several species as metagenes together with expression data you can focus in on gene coexpression networks more likely to represent biologically significant interactions (instead of just the statistically significant networks found with expression data alone). Most studies of disease in animal models make use of orthologous genes. o In some cases, a gene will be identified in the disease model by linkage or association and then the orthologous gene in humans needs to be found to see if the study can be moved to humans o In other cases, a gene is first linked to the disease in humans but then an animal model is needed for further study (eg. to test different therapies). Again, there must be a known ortholog in the animal model for it to be useful. Many phylogenetic studies that determine how species are related and how they diverged depend on the degree of sequence similarity between orthologous genes and regulatory elements to draw conclusions. Indicate limitations of ortholog-classification methods that are based only on BLAST comparisons. Should consider more than just base or amino acid differences. Synonymous changes are less significant than non-synonymous and conservative changes less significant than non-conservative. Does not account for functions of “orthologs”. In many cases, an analysis will be based on the assumption that orthologs (determined by sequence homology) have the same function. But, this is not necessarily the case. For example, you might look for regulatory motifs in the upstream region of orthologous genes on the assumption that genes with shared function are likely to share regulatory control. However, if one of the genes has assumed a new function over evolutionary time, then your analysis will be unsuccessful or misleading. Assessing the Data – Given specific examples, explain how the available data for training an algorithm could limit performance. One problem is overtraining resulting from small or biased data sets. In some cases, there may be no consistent informative characteristics that can be used reliably as a predictor, for the kind of test you are performing. For a microarray analysis, it may be that there are no transcripts consistently, differentially expressed between controls and cancer patients. Biological (eg. natural variations in transcript levels) and experimental variability (eg. differences in dye concentration) can be too great to detect the subtle differences required for the condition with statistical confidence. For a trivial example, a method developed to predict patient survival based on microarray results for PSA-positive prostate tumors might not be relevant for analysis of expression data from PSA-negative tumors. PSA-positive tumors and PSA-negative tumors could represent very different expression environments. They could represent patients with very different lifestyles (eg. smokers vs non-smokers). Therefore, a model trained on only the one data set will not necessarily work for the other. The data must be representative of all possible situations so that predictive characteristics can be found that will work for any patient or patient group. Realistically, this is very difficult to achieve. DNA as a random string of letters - Are DNA sequences random? - If not, how might the non-random character impact the performance of a specific software bioinformatics method. DNA sequences are not random. Many models and statistics assume a random and uniform distribution of the four bases. However, this is far from the true situation. Many regions are biased by certain motifs, repetitive elements, etc. The genome as a whole is generally GCpoor but contains some GC-rich regions (eg. CpG islands). This could impact many software bioinformatics methods. Example, an algorithm that finds transcription factor binding sites using a binding matrix. If the matrix is for a GC rich binding motif, the algorithm will return more false positives for GC rich regions like CpG islands because the motif is more likely to occur by chance in these regions (and not represent a true binding site). What is the difference between a local and global alignment? Global alignment - An alignment that assumes that the two proteins are basically similar over the entire length of one another. The alignment attempts to match them to each other from end to end, even though parts of the alignment are not very convincing. Local alignment - An alignment that searches for segments of the two sequences that match well. There is no attempt to force entire sequences into an alignment, just those parts that appear to have good similarity, according to some criterion. Using the same sequences as above, one could get: Over-Training – What does it mean? Given an example, suggest how over-training could occur. Using an unsupervised approach can lead to overtraining (what about supervised approaches). Usually overtraining results from using too small of a training data set, and failure to cross-validate or test your model against an independent data set. You can even overtrain to your independent data set. o Example: When the genome was first sequenced, everyone wrote genefinding algorithms that were validated against the same independent data set of 150 genes that were not used to train their models. But, because everyone used the same independent data set (and competed to identify as many of them as possible), we ended up with a bunch of algorithms that were only good at finding genes with characteristics similar to those in the independent set. Overtraining generally results when you apply a huge number of variables (eg. 15,000 genes in a microarray) to a small number of samples (eg. 50 lymphoma tissue samples). You are bound to find some genes which correlate with almost anything for this sample (eg. birthdays). Cross-Validation versus Independent Test Sets - Know the difference between crossvalidation and independent test sets for measuring the performance of bioinformatics methods. Be able to suggest how cross-validation could be problematic for specific examples. Cross-validation Used to estimate generalization error. A method for evaluating a statistical model or algorithm that has free parameters. Divide the training data into several parts, and in turn use one part to test the procedure fitted to the remaining parts. It can be used for model selection or for parameter estimation when there are many parameters. It approximates predictive estimation. Jackknifing is a similar, but slightly different, technique. Example: Take-one-away method. Remove one individual from the data set and use the remaining data to predict what the one taken away was. If we are using the expression profiles of 100 patients (50 with cancer and 50 controls). We would take one out, and use the remaining 99 to assess which genes are predictive of cancer and then see if we can correctly predict the state of the patient who was removed. Then repeat, taking a different one out of data set until all 100 have been done. Jackknifing Similar to but not exactly the same as Cross-validation. Both involve a leave-oneout method but jackknifing is used to estimate the bias of the statistic instead of the generalization error. Bootstrapping A technique for simulating new data sets, to assess the robustness of a model or to produce a set of likely models. The new data sets are created by re-sampling with replacement from the original training set, so each datum may occur more than once. A way of getting a confidence level when you don’t have enough data for validation of your results. Independent test sets Any time that you train a model with a training set, you should validate your model with an independent test set. This will minimize the chance of overtraining. Independent tests set are necessary to confirm that your model or algorithm has general biological validity beyond your training data. Why do we need cross-validation and/or independent test sets? The problem is numbers. Example: We have a small number of people with lymphoma but a large number of characteristics (10,000s of genes with expression measurements on the chip). Therefore, you can expect to find by chance good predictive genes for one group of individuals that don’t work for another group. Cross-validation is good but you should also validate against an independent data set. Bayesian Statistics Assigns a probability to a certain event based on existing knowledge. Eg. Polyabayes looks at a multiple sequence alignment of chromatograms. Takes into account sequence reads and past knowledge of SNPs and sequencing errors (eg. A -> T is a common sequencing error in a polyA region). Uses the prior knowledge and calculates a probability of a SNP being real using Bayesian statistics. Specific Methods from Class Profile Models – Know how a matrix-type profile is used to generate a score for the match between a given sequence and a characteristic motif. Given a SNP, how can we score the binding strength of a known transcription factor for each allele? - Take each sequence and pass it to a motif-scanning algorithm with the known TF’s matrix and a background model and compare the binding strength score returned. Esentially just compares TF matrix to sequence and whichever is a “better fit” gets a better score. Describe a matrix profile for predicting transcription factor binding sites: Are there tools that facilitate such studies? There are two kinds of tools out there: 1. Motif-discovery tools 2. Motif-scanning tools – use known TFs 3. A third approach is search for modules or clusters of motifs using either of the methods above. What are the limitations of such predictions? Are predicted binding sites likely to be real? A portion of them are but there tends to be lots of false positives and/or missed sites. The short length and degenerate nature of transcription-factor-binding sites accounts for most of these false-postives. Eg. The unambiguous sequence TATAA is expected once every 1,024 bp by chance. We would thus predict 30 million potential binding sites in a mammalian genome. Most binding sites in mammalian genome sequence are biologically non-functional. Binding strength of a TF to a TFBS depends on more than just the structure of the TF and the sequence of the TFBS. o chromatin imposes complex rules on TF access to a binding site o TF binding may depend on the presence of other TFs, pH, temperature, chemical concentrations, cellular localization, etc. A large number of binding sites are missed (false-negatives). o Only a fraction of TFs and TFBSs are known. How can we reduce the number of false-positive predictions? Phylogenetic footprinting – assume that important regulatory elements (like TFBSs) will be conserved across related species and look for binding sites only in highly conserved sequences. Phylogenetic shadowing – multiple sequence comparisons are made between orthologous genes across short evolutionary distances, taking relationships into account. Ie. Attempt to differentiate between sequences shared because of recent ancestry versus functional importance. Bird (1995) suggests that the trouble is not turning genes on at the right time but keeping them from turning on at the wrong time. Therefore if a binding sequence arises or is present where it shouldn’t be evolution will favor non-conservative mutations at these positions. Conservative mutations will be more common in true binding sequences. Genes shown to be coexpressed by gene expression studies are more likely to be coregulated. Therefore, regions around these genes are more likely to contain the same binding sequences. Characterize more TFs and promoter regions and try to develop rules or commonalities. Eg. ~50% of promoters are found in association with a CpG island and 93% are associated with CpG islands of a certain length and CpG dinucleotide frequency. Consider the sequence context. For the TATAA example, higher statistical scores can be assigned if it is found within 30bp of a predicted transcription start site. Nature of transcription factor binding can be considered. Eg. If it’s a dimer, you would be looking for two similar adjacent binding sites. Look for clustered or composite binding sites. Consider protein-protein interactions. Similar idea to looking in upstream regions of coexpressed genes. Proteins that interact are more likely to belong to the same pathway and to be coregulated by the same TFs and TFBSs. Combine as many of the above as possible. Eg. rVista makes use of sequence conservation and motif clustering (modules) to accurately identify only the TFBSs most likely to be functional. Develop statistical measures and thresholds which reliably separate false-positives from real motifs while minimizing false-negatives. Fig: Using sequence conservation to help find TFBSs Assigned Papers Students should be prepared to cite specific examples (from the readings) that demonstrate important aspects of these problems. Of the papers we read, the following are most likely to be relevant to the questions: Shipp MA, et al. Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat Med. 2002 Jan;8(1):68-74. PMID: 11786909 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=1 1786909&dopt=Abstract Developed a supervised learning prediction method to identify cured versus fatal or refractory lymphomas. Using microarray data for 77 samples, and 6817 genes, they found 30 predictor genes that could distinguish the 58 diffuse large B-cell lymphomas (DLBCL) from the 19 folicular lymphomas (FL) with a 91% performance rate as measure by a take-one-out cross-validation test. o Ie. 1 of 77 samples is withheld and the remaining 76 used to train a geneexpression based model (determine predictor genes) and predict the class of the withheld sample. For the 58 DLBCL patients, a similar supervised learning approach was used to determine predictors for outcome. Found 13 predictors that divided patients into two groups: those predicted to be cured and those predicted to have fatal/refractory disease. Patients predicted to be cured did have significantly improved survival. The main concept here is the use of supervised learning. This is a pretty classic example. As stated in the definition of supervised learning, an kind of classification study is supervised. They did attempt to validate their results against an independent data set (other published microarray data for DLBCL patients). But, the two experiments did not have all the same genes on the microarray, and therefore, their method requires additional validation to see if it will be applicable for other samples. Cowles CR, Hirschhorn JN, Altshuler D, Lander ES. Detection of regulatory variation in mouse genes. Nat Genet. 2002 Nov;32(3):432-7. Epub 2002 Oct 15. PMID: 12410233 http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids =12410233&dopt=Abstract Look at possible effects of variations (eg. SNPs) within cis-acting regulatory elements on gene expression. o A difficult challenge because it is often difficult even recognize the regulatory regions of most genes. Which can be thousands of bases away from the transcriptional unit. o Also difficult to predict which nucleotide changes in regulatory elements might have an effect on expression. o Impossible to tell if differences in expression between individuals are due to cis-acting regulatory regions or trans-acting factors, or environmental factors. Compared expression of alleles from two mouse strains (A and B) in an F1 hybrid mouse (A X B) to control for trans effects and environmental influences (ie assume same trans factors and enviro influences in littermates). Distinguised between transcripts of A and B using another SNP marker (not the regulatory variant). Compared the levels of the two alleles in genomic DNA and mRNA using PCR and RT-PCR and a DNA sequence detector. Considered anything greater than a 60:40 ratio to be a significant difference in expression between the two alleles. This method identifies transcripts whose expression is affected by variation regulatory regions but can not pinpoint for sure which variation is responsible. Suggest there are probably many genes whose expression is effected by variations in regulatory elements Sachidanandam R, et al. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature. 2001 Feb 15;409(6822):928-33. PMID: 11237013 Describe a map of 1.42 million SNPs distributed throughout the human genome (~1 SNP / 1.9 kb). In the human population, most variant sites are rare, but the small number of common variants explain the bulk of heterozygosity. It should therefore be possible to define common haplotypes using a dense set of polymorphic markers, and to evaluate each haplotype for association with disease. Identified SNPs using sequences from the human genome sequencing project and algorithms like Polybayes (see Bayesian statistics above). Bosma PJ, Chowdhury JR, Bakker C, Gantla S, de Boer A, Oostra BA, Lindhout D, Tytgat GN, Jansen PL, Oude Elferink RP, et al. The genetic basis of the reduced expression of bilirubin DP-glucuronosyltransferase 1 in Gilbert's syndrome. N Engl J Med. 1995 Nov 2;333(18):1171-5. PMID: 7565971 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=7 565971&dopt=Abstract Proof of principle paper that changes in regulatory elements can change phenotype leading to disease. Sequenced the coding and promoter regions of the gene for bilirubin UDPglucuronosyltransferase 1 (enzyme responsible for bilirubin glucuronidation) in 10 unrelated patients with Gilbert’s syndrome, 16 members of a kindred with a history of Crigler-Najjar syndrome, and 55 normal subjects. People with Gilbert’s have mild, chronic unconjugated hyperbilirubinemia (jaundice). Found that Gilbert’s patients carried a mutation for an extra TA in the TATAA element of the 5’ promoter region of the gene. The presences of the longer TATAA element resulted in reduced expression and was proposed as a necessary factor for Gilbert’s syndrome. Tatusov RL, et al. The COG database: an updated version includes eukaryotes. BMC Bioinformatics. 2003 Sep 11;4(1):41. PMID: 12969510 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=1 2969510&dopt=Abstract Describe COG database and addition of eukaryotic version of COGs (KOGs).. Clusters of Orthologous Groups of proteins A way of assigning function to proteins for newly sequenced genomes based on knowledge of orthologous genes in other species. A COG is a 3-way reciprocal best match using a protein sequence comparison (blastp?) What is the problem with this method of determining orthologs? o Will lose proteins that have a best reciprocal match with one of the other members of the COG but not the third. o Assignment to a COG could be based on shared domain despite being a drastically different protein. Web Resources: Of those discussed in class, the following web resources are most likely to be referred to in the examination: Gene Info UCSC Genome Browser: genome.ucsc.edu GeneCards: http://bioinfo.weizmann.ac.il/cards/ SNP Databases HGV Base http://hgvbase.cgb.ki.se/ Genetics OMIM: GeneTests: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM http://www.geneclinics.org/ Sample Questions 1. Given the specific matrix model for the binding sites of a TF (see below), assign a score to the sample sequence. Will predictions (high-scoring sequences) of potential binding sites be limited primarily by poor sensitivity or specificity? Why might predicted binding sites generated with this model occur more frequently near promoters (transcription start sites) in genes? MATRIX A 0 0 0 0 C 7 9 3 8 G 2 0 5 1 T 0 0 1 0 Sample Sequence: 0 0 8 1 9 0 0 0 5’TGCCCG3’ Answer: Given the specific matrix model for the binding sites of a TF (see above), assign a score to the sample sequence. Highest scoring possible sequence would be: 5’CCGCGA3’ Score = 7 + 9 + 5 + 8 + 8 + 9 = 46 Sample Sequence: 5’TGCCCG3’ Score = 0 + 0 + 3 + 8 + 0 + 0 = 11 Reverse Complement: 3’CGGGCA5’ Score = 7 + 0 + 5 + 1 + 0 + 9 = 22 Will predictions (high-scoring sequences) of potential binding sites be limited primarily by poor sensitivity or specificity? Sensitivity here refers to the ability Specificity here refers to the ability of the model to minimize falsenegatives relative to true-positives. There will be poor specificity if a lot of actual binding sites are missed. It is likely that there will be both false-positives and false-negatives and thus less than perfect sensitivity and specificity. But, Specificity will likely be worse. This is because, there will be a lot more false-positives than false-negatives. For a motif of 6 bases, the probability of any one sequence is 1/46. For a genome of 3 billion bases, this means you can expect any 6 base sequence nearly 1 million times. A binding motif probably represents several sequences that would be considered high scoring. Therefore, we can expect several million high-scoring motifs throughout the genome by chance. Only a tiny fraction of these will be true transcription factor binding sites. This is terrible specificity. The nature of the matrix will have an effect. Matrices that are more specific (ie. Specify less potential high-scoring sequences) will have less sensitivity (might miss a binding motif that is just a little too different from matrix) but improved specificity (less false-positives). Conversely, if the matrix is too general (specify numerous potential high-scoring sequences) the sensitivity will improve at the expense of specificity. 2. Dr. Zany reports that his advanced neural network algorithm is able to predict the birthdays of all cancer patients based on microarray expression profiles of tumor samples. Fifty tumor samples were analyzed and a full probabilistic Bayesian model was created. The model is based on only 15 genes from the 15,000 represented on the array. The performance (100% success) was assessed using leave-one-out cross-validation. Dr. Zany claims that the model is meaningful because the predictions are independent of tumor type. Why might one be concerned about this system? Explain your rationale. Answer: The problem with this method is that the model is highly overtrained. Dr. Zany has applied a huge number of variables (15,000 genes) to a small number of samples (50). Out of 15,000 genes, there are bound to be some that correlate well with any other variable (eg. birthday) by chance. The fact that the predictions are independent of tumour type seems meaningless to me. The cross-validation step would give him an estimate of the general error within his sample but does not give much confidence that the predictions will be accurate outside his sample. The method must be validated against an independent data set to see if the 15 genes are truly predictive of birthday. 3. John Smith has recently been informed that his child has a rare genetic disorder (Shucks Syndrome) that results in the inability to yawn. Using Google, he has identified the following resources that might contain information that will help him understand the syndrome. He is a molecular biologist and wishes to understand the mechanisms of the disease and whether his next child is likely to suffer from the same problems. He will be visiting a genetic counselor next week, but he wants to enter the conversation prepared. Explain the intended uses of each of the following web resources: OMIM, hgvBASE, UCSC Genome Browser and GeneTests. OMIM – Online Mendelian Inheritance in Man. (http://www.ncbi.nlm.nih.gov/omim/) This database is a catalog of human genes and genetic disorders. For a given disorder, will provide: DESCRIPTION, CLINICAL, FEATURES, INHERITANCE, CYTOGENETICS, MAPPING, MOLECULAR GENETICS, HETEROGENEITY, PATHOGENESIS, DIAGNOSIS, CLINICAL MANAGEMENT, POPULATION GENETICS, EVOLUTION, GENETIC VARIABILITY, ANIMAL MODEL, HISTORY, REFERENCES, CLINICAL SYNOPSIS Very useful for learning about a genetic disease. hgvBASE (http://hgvbase.cgb.ki.se/) The objective of HGVbase (the Human Genome Variation Database) is to provide an accurate, high utility and ultimately fully comprehensive catalog of normal human gene and genome variation, useful as a research tool to help define the genetic component of human phenotypic variation. All records are highly curated and annotated, ensuring maximal utility and data accuracy. Can search by sequence, genome position, gene name/ID, variation ID, or keyword (eg. cystic fibrosis). Useful if you want to know about all known variations for your gene of interest. UCSC Genome Browser (http://genome.ucsc.edu/cgi-bin/hgGateway) The UCSC Genome Browser provides a rapid and reliable display of any requested portion of genomes at any scale, together with dozens of aligned annotation tracks (known genes, predicted genes, ESTs, mRNAs, CpG islands, assembly gaps and coverage, chromosomal bands, mouse homologies, and more. The user can look at a whole chromosome to get a feel for gene density, open a specific cytogenetic band to see a positionally mapped disease gene candidate, or zoom in to a particular gene to view its spliced ESTs and possible alternative splicing. The Genome Browser itself does not draw conclusions; rather, it collates all relevant information in one location, leaving the exploration and interpretation to the user The Genome Browser supports text and sequence based searches that provide quick, precise access to any region of specific interest. Secondary links from individual entries within annotation tracks lead to sequence details and supplementary off-site databases. Clicking on an individual item within a track opens a details page containing a summary of properties and links to off-site repositories such as PubMed, GenBank, LocusLink, and OMIM. This would not be very useful for someone trying to understand a disease for a practical purpose like visiting the doctor. GeneTests (http://www.genetests.org/) Provides current, authoritative information on genetic testing and its use in diagnosis, management, and genetic counseling, GeneTests promotes the appropriate use of genetic services in patient care and personal decision making. GeneTests is a medical genetics information resource for physicians and other healthcare providers. The site comprises: o GeneReviews: Expert-authored, peer-reviewed, disease-specific Reviews describing the application of genetic testing to the diagnosis, management, and genetic counseling of patients and families with hereditary disorders o Laboratory Directory: A database of US and international laboratories performing genetic testing o Clinic Directory: A database of US and international clinics providing genetic counseling o Educational Materials: Basic information on the use of genetic services, teaching materials for genetics professionals, and an illustrated glossary of terms used in the GeneReviews Can search by disease, gene, locus, product, feature, OMIM, author, titles, or text. Searching for information about disease will give you information about: Diagnosis, Clinical Description, Differential Diagnosis, Management, Genetic Counseling, and Molecular Genetics. An excellent source of information for someone who wants to prepare for a meeting with a genetic counselor since they themselves might check the very same site. GeneCards (http://bioinformatics.weizmann.ac.il/cards/) Integrates a subset of the information stored in major data sources dealing with human genes and their products (with a major focus on medical aspects). A search for your disease of interest will return any known genes associated with it. If you know which gene you are interested in, genecards will provide info or link to info on: Aliases and Additional Descriptions, Chromosomal Location, Proteins, Protein Domains/ Families/Ontologies, Sequences, Expression in Human Tissues, Similar Genes in Other Organisms, Related Human Genes, SNPs/Variants, Disorders & Mutations, Medical News, Research Articles, Links to countless other databases (eg. LocusLink, Ensembl, GeneLynx, etc). Useful to both researchers and laymen if you know which gene you are interested in. Which are most relevant to Mr. Smith? Why? All of the above sources could potentially have some useful information for Mr. Smith. However, OMIM and GeneTests would be the most useful to him. These are the most disease specific and contain the most general information relevant to someone with the actual disease. o OMIM would give him information about the genetics of the disease and how it is inherited. This will help him to determine the chances of his next offspring inheriting the disease. o GeneTests will give him an idea of what kind of things the counselor will tell him and give him a chance to look up any terminology he may not be familiar with.