This material is from IPA (www.ingenuity.com) online help manual Topics included are: 1. Introduction to IPA 2. Input file templates 3. Data Filters 4. Networks Relationships 5. Ratio Calculations for Pathways 6. Interpreting Functional Analysis Results 7. Filtered Datasets/ Enriched Datasets 8. Reviewing Mapped Genes 9. FAQs about Functional Analysis Statistical Calculations 1. Introduction to IPA IPA is built upon a huge foundation of scientific evidence, manually curated from hundreds of thousands of journal articles, textbooks, and other data sources. IPA acts as the gateway to this vast amount of biological and chemical information and presents these data in a meaningful, visual and knowledgeable way. IPA allows the researcher to explore molecular chemical, gene, protein and miRNA interactions, create custom molecular pathways, view and modify metabolic, signaling, and toxicological canonical pathways, each with underlying experimental literature evidence. To further interpretation, in addition to the networks and pathways that can be created, IPA can provide multiple layering of additional information, such as drugs, disease genes, expression data, cellular functions and processes, or a researcher’s own genes or chemicals of interest. Entry Points into IPA There are three general ways of getting started with IPA: • Starting with terms related to a therapeutic area, disease or function for additional research, study or experimental designs • Starting with an small number of genes or chemicals (1-300), including drugs, such as a list of drugs for repurposing, understanding the context of a drug target, or candidate genes from SNP genotyping analysis • Starting with a large dataset, typically thousands of data points, such as microarray, or other high-throughput screening Regardless of where you start, IPA results will provide a better understanding of the biological context of your data or area of interest. Analysis Types The Question it Addresses Core Analysis Core Comparison Analysis IPA-Metabolomics(TM) Analysis IPA-Metabolomics Comparison Analysis IPA-Tox(TM) Analysis IPA-Tox Comparison Analysis IPA-Biomarker(TM) Analysis IPA-Biomarker Comparison Analysis Core Analysis allows you to interpret large and small datasets in the context of biological processes, pathways and molecular networks. Core Comparison allows you to analyze changes in biological states across experimental conditions. First run a core analysis on your multiple datasets that represent multiple treatments. Then use core Comparison Analysis to understand which biological processes and/or diseases are relevant to each condition. The Metabolomics solution provides you with a way of analyzing metabolite data to learn more about cell physiology and metabolism. Metabolomics Comparison allows you to analyze changes in biological states across experimental conditions. First run a Metabolomics Analysis on your multiple datasets that represent multiple treatments. Then use Metabolomics Comparison Analysis to understand which biological processes and/or diseases are relevant to each condition. The Tox solution allows you to assess toxicity and safety of compounds-ofinterest early in the development process. Tox analysis rapidly displays the relevant toxicity phenotypes and clinical pathology endpoints associated with a dataset. Allows you to analyze changes in relevant toxicity phenotypes and clinical pathology endpoints across observations. First run a Tox Analysis on your multiple observationss, then use Tox Comparison Analysis to understand which tox functions and/or pathways are relevant to each timepoint or dose. The biomarker analysis solution allows you to identify and prioritize the most relevant and promising molecular biomarker candidates from datasets from nearly any step of the drug discovery process or disease research. Use the Biomarker filters to prioritize molecular biomarker candidates based on contextual information such as mechanistic connection to diseases or detection in bodily fluids. Then, use the Biomarker Comparison Analysis to identify biomarker candidates that discriminate between or are common to a disease state and/or drug response. The Biomarker Comparison Analysis identifies biomarker candidates that discriminate between or are common to a disease state and/or drug response. First run the Biomarker Filter to prioritize molecular biomarker candidates based on contextual information such as mechanistic connction to diseases, detection in bodily fluids, and use the Biomarker Comparison Analysis to identify biomarker candidates that discriminate between multiple samples. If you have a large dataset to analyze, involving several hundred to tens of thousands of molecules, you will want to run an analysis appropriate to your interests, this includes Core Analysis, IPA-Tox Analysis, IPA-Metabolomics or IPA-Biomarker analysis (see Types of Analyses). These analyses begin with upload of your data, setting analysis criteria, running the analysis, and interpreting the results. The results provide information on how your dataset overlaps molecules associated with various diseases and cellular functions, and that are part of canonical pathways. These results often give a good indication of what cellular processes your data set is related to and will often lead to further investigation of these relationships by building custom networks/pathways. In addition, you can view molecular networks that show how the significant molecules in your dataset are known to interact with one another and other closely interacting molecules. The other general entry point is if you interested in learning more about a small set of genes, chemicals/drugs, or disease. You can use IPA as a visual literature search tool to help you understand and stay current in biological interactions and relationships relevant to your projects. These analyses generally start with a small list of genes, proteins or chemicals (1-300), or one or more search terms of a cellular functions or diseases. Types of Analyses and Key Features IPA Feature Networks Functions Canonical Pathways Search My Pathways and My Lists Path Designer Compare Sharing and Collaborating Exporting and Reporting IPA Integration Module The Question it Addresses What regulatory relationships exist between the genes/ proteins in my dataset? Which biological and disease processes are most relevant to my genes of interest? Which well-characterized cell signaling and metabolic pathways are most relevant in my data? How do I query Ingenuity's knowledge base about specific genes, processes, diseases, drugs, families, or subcellular locations? How can I build a library of biological models that I can also use in analyzing expression data? How can I modify my pathways for publications and presentations? Which molecules are unique or in common between more than one IPA entity? Perform set analyses using the Compare feature. How can I communicate and share my results with my colleagues and collaborators? How can I publish my results from IPA? Can I seamlessly transition between my internal application or database into IPA? 2.Input file templates Formatting Your Data using IPA Templates One method of inputting your data into IPA is to use one of the IPA templates. If you are using the IPA Flexible Format, click here. This article covers using IPA templates for data upload. Single Observation Format (File Format A) Multiple Observation Format (File Format B) Basic (Legacy) Format (used prior to IPA 3.0) Each row in the dataset file represents a single gene or protein identifier. An optional header row may be included. A 65,000 row and 18MB file size limit exists on dataset files uploaded into the application. All columns, as shown in the dataset file templates, are required, although some of the columns may be left empty. Data in the Identifier column is required so that the application can accurately identify and map the molecules from your dataset file to the corresponding entries in Ingenuity's knowledge base. Data in the other columns pertaining to Expression Value, Absent and Override - are optional and are used to indicate which molecules in your dataset file are of most interest for the analysis and thus should be selected as Network Eligible or Functions/ Pathways Eligible molecules. Ingenuity File Format A This template should be used for datasets consisting of a single observation. Each identifier may be assigned up to three different expression value types. Download Format A Template (Note: if the downloaded file does not display properly, right-click on the link instead and choose the "Save Target As" option to save the file to your computer) Example: This Ingenuity File Format A example dataset uses GenBank IDs as the Gene ID and Normalized Ratios as the expression value type. Only one expression value type is entered for each identifier, so the second and third Expression Value columns are empty. The Absent and Override columns are also empty. NOTE: There is a 65,000 row and 18MB file size limit on dataset files uploaded into the application. Ingenuity File Format B This template should be used for datasets consisting of multiple (up to ten) observations. Each identifier may be assigned up to three different expression value types for each observation. Download Format B Template (Note: if the downloaded file does not display properly, right-click on the link instead and choose the "Save Target As" option to save the file to your computer) Example: This Ingenuity File Format B example dataset uses GenBank IDs as the Gene ID and Ratios as the expression value type. Expression values from two observations are shown. Only one expression value type is entered for each identifier, so the second and third Expression Value columns are empty for each observation. The Absent and Override columns are also empty for each observation. NOTE: There is a 65,000 row and 18MB file size limit on dataset files uploaded into the application. Ingenuity Basic Format (Legacy) This is the basic file format supported in versions of the application prior to the IPA Summer '04 release. If you used this format in previous versions of the application, you may continue to do so. See below for an illustration of a correctly formatted example dataset and a link to download the Basic Format Template. Download Basic Format Template (Note: if the downloaded file does not display properly, right-click on the link instead and choose the "Save Target As" option to save the file to your computer) Example: This Ingenuity Basic Format example dataset file uses Affymetrix (Affy) identifiers and p-values as the expression value type. The Absent and Override columns are empty. NOTE: There is a 65,000 row and 18MB file size limit on dataset files uploaded into the application. 3. Data Filters Species Filter As a default, all species available are selected for analysis, indicating that IPA does not automatically limit the analysis. The Stringent filter is a highly constrained filter that will return only those molecules and relationships that are very relevant to the selected species; however, because the filter is highly constrained, it may not return many results. To be specific, the Stringent filter will match molecules that contain an ortholog matching the selected species. In addition, relationships must involve two molecules that match the selected species, and the location of the relationship must be in the selected species. In contrast, the Relaxed filter is less constrained. Therefore, the molecules and relationships that it matches may be less targeted, but there may be more results. To be specific, the Relaxed filter will match molecules that contain an ortholog that includes the selected species. It filters on the orthologs, but not the relationships between the orthologs. Tissues and Cell Lines Filter The Tissues & Cell Lines tab contains a set of tissues and cell lines from which to choose plus two filtering choices: Stringent and Relaxed. As a default, all tissues and cell lines are selected for analysis, indicating that IPA does not automatically limit the analysis. The Stringent filter is a highly constrained filter that will return only those molecules and relationships that are expressed in the selected tissues or cell ines. Since the filter is highly constrained, it may not return many results. In contrast, the Relaxed filter is less constrained. Therefore, the molecules and relationships that it matches may be less targeted, but there may be more results. To be specific, the Relaxed filter will match only molecules that are expressed in the selected tissues or cell lines. 4. Networks Relationships Lines that connect two molecules represent relationships. Thus any two molecules that bind, act upon one another, or that are involved with each other in any other manner would be considered to possess a relationship between them. Each relationship between molecules is created using scientific information contained in Ingenuity's knowledge base. In Network Explorer, My Pathways, or Neighborhood Explorer relationships are shown as lines or arrows between molecules. Arrows indicate the directionality of the relationship, such that an arrow from molecule A to B would indicate that molecule A acts upon B. The lines used to depict relationships are described in the Legend. Relationship Labels: You can hover over a relationship using your mouse to highlight it and display a simple label to designate the general type of relationship that exists between the two molecules. Relationship labels may also be turned on across the entire network by adjusting the Relationship label settings. See Network Explorer Preferences for more details. The following is a key that lists the identification of relationship labels: A Activation B Binding C Causes/Leads to CC Chemical-Chemical interaction CP Chemical-Protein interaction E Expression (includes metabolism/ synthesis for chemicals) EC Enzyme Catalysis I Inhibition L ProteoLysis (includes degradation for Chemicals) LO Localization M Biochemical Modification MB Group/complex Membership P Phosphorylation/Dephosphorylation PD Protein-DNA binding PP Protein-Protein binding PR Protein-RNA binding RB Regulation of Binding RE Reaction T Transcription TR Translocation You can also view the number of citations supporting the relationship in the network diagram by selecting this as an option under your Network Explorer Preferences. Relationship Types: The various arrow shapes represent different types of interactions, as shown in the key below: Data Source Protein-Protein Interaction and MicroRNA Database Imports The following databases have been imported into Ingenuity's knowledge base and can be used in the Analysis Parameters and in the Build tools found in My Pathways and Path Designer. Please note that these databases can be selected in addition to (but not instead of) Ingenuity's knowledge base when creating analyses and using the Build tools. ARGONAUTE 2 The Argonaute 2 Database is a comprehensive database on mammalian microRNAs and their known or predicted regulatory targets. It provides information on origin of miRNAs, tissue specificity of their expressions and their known or proposed functions, their potential target genes as well as data on miRNA families based on their coexpression and proteins known to be involved in miRNA processing. Currently, IPA only includes the target genes for microRNAs records from Argonaute 2. http://www.ma.uni-heidelberg.de/apps/zmf/argonaute/ BIND The Biomolecular Interaction Network Database (BIND) is a collection of records documenting molecular interactions. A BIND record represents an interaction between two or more objects that is believed to occur in a living organism. A biological object can be a protein, DNA, RNA, ligand, molecular complex, or gene. BIND records are created for interactions which have been shown experimentally and published in at least one peer-reviewed journal. A record also references any papers with experimental evidence that support or dispute the associated interaction. Interactions are the basic units of BIND and can be linked together to form molecular complexes or pathways. http://bond.unleashedinformatics.com/Action?pg=23299#BIND BioGRID The Biological General Repository for Interaction Datasets (BioGRID) database was developed to house and distribute collections of protein and genetic interactions from major model organism species. BioGRID currently contains over 198,000 interactions from six different species, as derived from both high-throughput studies and conventional focused studies. IPA currently only includes interactions from human, mouse and rat. http://www.thebiogrid.org Cognia Cognia is a database of molecular interactions manually curated from the scientific literature on the topic of the ubiquitin system. It's a comprehensive resource on key regulatory proteins of the ubiquitin system, their attributes and interactions. http://www.ncbi.nlm.nih.gov/pubmed/12645912?ordinalpos=2&itool=EntrezSystem2.PE ntrez.Pubmed.Pubmed_ResultsPanel.Pubmed_DefaultReportPanel.Pubmed_RVDocSum DIP The DIP database (Database of Interacting Proteins) catalogs experimentally determined interactions between proteins. It combines information from a variety of sources to create a single, consistent set of protein-protein interactions. The data stored within the DIP database were curated, both, manually by expert curators and also automatically using computational approaches that utilize the the knowledge about the protein-protein interaction networks extracted from the most reliable, core subset of the DIP data. http://dip.doe-mbi.ucla.edu/ IntAct IntAct provides a freely available, open source database system and analysis tools for protein interaction data. All interactions are derived from literature curation or direct user submissions and are freely available. http://www.ebi.ac.uk/intact/site/index.jsf Interactome The 'Interactome studies' option in the data source filter gathers individual published protein interaction studies that utilize high-throughput methods (Eg Yeast-2-Hybrid) for detecting a large number of protein-protein interactions. As the methods used for largescale detection protein-protein interactions are not as stringent as other methods used to validate each protein-protein interaction, these findings are separated out to allow the user to filter them out if necessary. Under ‘Interactome studies’, IPA currently includes the study below: ‘Towards a proteome-scale map of the human protein-protein interaction network’ Rual et al. Nature 2005 Oct 20;437(7062):1173-8 (PMID: 16189514) http://www.ncbi.nlm.nih.gov/pubmed/16189514?dopt=Abstract MINT The Molecular INTeraction database (MINT) focuses on experimentally verified proteinprotein interactions mined from the scientific literature by expert curators. IPA only includes molecular interactions from human, mouse and rat. However, the full MINT dataset can be freely downloaded. http://mint.bio.uniroma2.it/mint/Welcome.do MIPS The Munich Information Center for Protein Sequences (MIPS) mammalian proteinprotein interaction database is a collection of manually curated high-quality PPI data collected from the scientific literature by expert curators. Only data from individually performed experiments are included since they usually provide the most reliable evidence for physical interactions. http://mips.gsf.de/proj/ppi/ 5. Ratio Calculations for Pathways The ratio is calculated as follows: The number of molecules in a given pathway that meet cutoff criteria, divided by total number of molecules that make up that pathway. Example: If you specify a cutoff of 2.0 for fold-change, then any molecule that is either upregulated more than two-fold or downregulated more than 2-fold will meet the cutoff. If 10 molecules meet the cutoff for that pathway and 100 total molecules form that pathway, the ensuing ratio would be 0.1. Ratio Calculations for Metabolic Pathways After we construct a pathway, we map all EC numbers to EntrezGene names. In order for an EC number to be mapped in IPA, there has to be an entry for this gene available in EntrezGene that is from human, mouse, or rat. If a molecule cannot be mapped to any human, mouse, or rat gene, it won't be considered in the ratio calculation. For any given pathway, you can check which molecules map and which ones do not, by going to the canonical pathways library from the project manager. Two different colors are used for the molecules: gray and white. Gray represents molecules that can be mapped to human, mouse or rat. White molecules cannot be mapped and are therefore excluded from the ratio calculations. What is the difference between the significance (p-value) and ratio in the Canonical Pathways bar chart? Which one is the best to use? The ratio gives you a good idea of the percentage of genes in a pathway that were also found in your uploaded list. The ratio is therefore good for looking at which pathway has been affected the most based on the percentage of genes uploaded into IPA. The significance (p-value) is better at asking, "Is there an association between a specific pathway and my uploaded dataset and is it due to chance?" The null hypothesis is that there is no association. If a p-value is very small you can be confidant that the pathway is associated with the uploaded dataset. This may be an indication of that certain pathways are more likely to explain the phenotype that is observed. Neither the significance, nor the ratio tells you how these genes are associated with the Canonical Pathway (i.e. you must look at the function of the affected genes in the Canonical Pathway to determine whether the pathway is up- or down-regulated). In the end it is a matter of percentage vs. probability. The ratio gives you amount of association; the significance gives confidence of association. For example, if a pathway has a high ratio (percentage) and a very low p-value, the pathway is probably associated with the data and a large portion of the pathway may be involved or affected. These pathways may be the most likely candidates for an explanation of the observed phenotype. 6. Interpreting Functional Analysis Results Significance in Functional Analysis for a Dataset The significance value associated with Functional Analysis for a dataset is a measure of the likelihood that the association between a set of Functional Analysis molecules in your experiment and a given process or pathway is due to random chance. The smaller the pvalue the less likely that the association is random and the more significant the association. In general, p-values less than 0.05 indicate a statistically significant, nonrandom association. The p-value is calculated using the right-tailed Fisher Exact Test. In this method, the p-value for a given function is calculated by considering 1) the number of functional analysis molecules that participate in that function and 2) the total number of molecules that are known to be associated with that function in Ingenuity's knowledge base The more functional analysis molecules that are involved, the more likely the association is not due to random chance, and thus the more significant the p-value. Similarly, the larger the total number of molecules known to be associated with the process, the greater the likelihood that an association is due to random chance, and the p-value accordingly becomes less significant. In short, the p-value identifies statistically significant overrepresentation of functional analysis molecules in a given process. Over-represented functional or pathway processes are processes which have more focus molecules than expected by chance ("right-tailed"). For example: Experiment #1 5 functional analysis molecules related to hematopoesis 50 molecules related to hematopoesis total in Ingenuity's knowledge base Experiment #2 5 functional analysis molecules related to hematopoesis 10 molecules related to hematopoesis total in Ingenuity's knowledge base In this case the p-value for the hematopoesis process in Experiment #2 would be more significant (i.e. smaller value) than the p-value for hematopoesis in Experiment #1 because of the greater over-representation of the functional analysis molecules in the relevant set of molecules related to hematopoesis. How to Use the p-values IPA Calculates We suggest you use the p-values and scores calculated in IPA act as starting points for further investigation and as rough guides for helping you identify significant processes or pathways being affected in your experiment. In some cases, you will need to explore the supporting evidence to understand the full biological implications for significant results. In other cases, there may be results with large p-values (>0.05) that you will find compelling upon further investigation. This can occur in the case of Canonical Pathways, where the involvement of just one molecule with a canonical pathway may be biologically interesting, even if it is not statistically significant. Interpreting Functional Analysis for Comparisons Functional Analysis can be used to quickly gain an overview over the effect of different drug treatments or time courses on a variety of functions. But how do you interpret such results? The significance calculated for each function returned in Functional Analysis is a measurement of the likelihood that the function is associated with the dataset by random chance. On the y-axis of the diagram, the significance is expressed as the negative exponent of the p-value calculated for each function. That is, taller bars are more significant than shorter bars. To determine whether and to what extent a given function is affected from one observation to another within a comparison you can start by comparing the extent to which the significances change from one observation to another. For example, if the significance of a function changes from one treatment to the next, then it is likely that the treatment had an impact on the function under investigation. Note, however, that functions whose significance values exhibit little or no change from one observation to another may be changing and should also be investigated further. Specifically, it is possible that the molecules, and therefore the underlying biology, in one observation could be different from another observation even though the significance values remained unchanged. When interpreting your results, it is important to keep in mind that the significance values you are seeing refer to the High Level Functions rather than to individual functions. If a High Level Function contains two or more specific functions, a range of significances is displayed. 7.Filtered Datasets/ Enriched Datasets Enriched Datasets are enriched by Ingenuity's Knowledge Base to provide specific details such as subcellular location, functional gene family and association with drugs, for enhanced exploration and understanding. Enriched Datasets enable more data interactivity and flexibility when using IPA, particularly in Compare, Grow Out, and Overlay. To get even more value out of your datasets when applying them to these functionalities, apply filters and cutoffs to your dataset to limit your experimental observations to the most relevant information. Using Filtered Datasets in Overlay, Grow Out and Compare Overlay Datasets Filtered Datasets can be used to Overlay dataset values onto Networks, My Pathways, Canonical Pathways, and in Path Designer. Using Filtered Datasets allows you to visualize your experimental data with your selected filter and cutoff values on a pathway or network without having to run an analysis. Use this functionality to quickly see how your experiment affects a well characterized Signaling or Metabolic pathway from the Ingenuity Pathway Library, or how genes in your experiment support a hypothesis you built using IPA’s Search and Explore functionality. When you overlay a Filtered Dataset onto a pathway, pathway nodes will be colored according to the experimental values in your dataset that meet your filter criteria. Grow from Datasets Filtered Datasets can be used in Grow to add molecules to Networks and Pathways. Using Filtered Datasets allows you to limit the molecules used to identify molecules and relationships to expand networks and pathways to molecules in your dataset that meet the criteria you set without having to run an analysis. Use this functionality to determine molecules from you experiment that may regulate a well characterized Signaling or Metabolic pathway from the Ingenuity Pathway Library or how molecules in your experiment are associated with a hypothesis you built around a particular biological function or disease. When you grow out from a Filtered Dataset, molecules added to the pathway are those from your dataset that meet your filter criteria. If no molecules are added, your filter criteria may be too stringent, but you can go back to the Dataset Filter by double clicking on your Filtered Dataset in the Project Manager to adjust the settings. Compare Datasets Filtered Datasets can be used in Compare to compare molecules from different experimental observations. Comparing Filtered Datasets allows you apply filter criteria to identify the common and unique molecules across your experiments and quickly discern trends in your data. When comparing Filtered Datasets, only those molecules from your dataset that meet your filter criteria will be used to compare. Creating an Analysis from a Filtered Multi-observation Dataset In addition to using filtered datasets in Overlay, Grow, and Compare, one can use filtered datasets for analysis. When filtering a dataset for use in analysis, you can apply contextual filters, such as species, disease, or molecule type, and/or an expression value cutoff. Those molecules that meet the filter criteria are available for use in the analysis. When you apply an expression value cutoff to a multi-observation dataset and then use that dataset for analysis, IPA identifies all molecules across the dataset observations that meet the filter criteria. The union of those molecules then becomes the input to the analysis. For example, if you filter a dataset with a cutoff value of 1.0 and a molecule has an expression value of 0.99 for one observation and an expression value of 1.1 for another, that molecule will be included as an input for the analysis. You can then apply an expression value cutoff and/or additional contextual filters as part of the Create Analysis process. These filters will be applied to your dataset and across multiple observations and may adjust the number of molecules that are Network or Functions/Pathway/List eligible. Extending the above example, if you apply a cutoff value of 1.0 as part of Create Analysis, the molecule with an expression value of 0.99 will not be included as a Focus Gene for that particular observation, but will be included for the observation where the expression value is 1.1. Once you are satisfied with the analysis conditions, click on Run Analysis to name and run the analysis. Creating Filtered Datasets To create a Filtered Dataset click on the Filter Dataset link from the Quick Start menu and select or upload a dataset. On the Filter Dataset page, select the filters most relevant to you and your experiment. Species: Filter for genes that exist in a particular species. By selecting any item in this filter, you are specifying that you are interested in genes that exist in a species. Tissues and Cell Lines: Filter for genes expressed in a particular tissue or cell line. By selecting any item in this filter, you are specifying that you are interested in genes expressed in the selected tissue(s) and/or cell line(s). Biofluid: Filter for proteins detectable in a particular bodily fluid. By selecting any item in this filter, you are specifying that you are only interested in proteins that are detectable in that particular bodily fluid(s). Diseases: Filter for genes associated with a particular disease. By selecting any item in this filter, you are specifying that you are interested in genes associated with the selected disease(s). Molecules: Filter for specific molecule types in your dataset, such as kinases or transcription regulators. By selecting any item in this filter, you are specifying that you are only interested in a particular type of molecule. NOTE: You may select multiple items within each filter and across the different filters. The filter runs an ”OR” operation within each filter and an ”AND” operation across the filters. For example, selecting blood, saliva and human will utilize those genes that are detectable in [blood or saliva] and exist in [human]. A summary of your filter selection is provided on the right side of the page, so you can easily determine which filters you have selected for application to your dataset. In addition to selecting filters to apply to your dataset, you can also define the Expression Value Cutoff. For each expression value type in your dataset, enter a cutoff value to indicate at what expression level or significance value the molecules become important to you. If all the molecules are important to you, input a cutoff value that includes all identifiers (i.e. p-value = 1.0 or Fold Change = 1). HINT: When using multiple expression value types and cutoff values, IPA uses the ”AND” function for calculating the number of molecules eligible for the Dataset Filter. Molecules must meet both cutoff values to be considered for the analysis. When you have finished selecting filters and applying cutoffs, click the Recalculate button to view the number of molecules eligible for the Dataset Filter. These are the molecules that satisfy the filter criteria that you set and can be seen by clicking the Filter Eligible tab that appears on this page. • If no molecules are eligible, review the parameters set for the Dataset Filter and adjust the settings by de-selecting filters or decreasing the stringency of the expression cutoff values. When you are satisfied with your filter and cutoff settings, click the Save button to save your settings to this dataset. This closes the Filter Dataset page and takes you back to the Project Manager. 8.Reviewing Mapped Genes The following lists of genes are generated after mapping is complete. Mapped IDs are identifiers that were successfully mapped to a molecule in Ingenuity's knowledge base. Duplicate identifiers are mapped to a single molecule. Mapping of external identifiers to molecules in Ingenuity's knowledge base is performed using information in Homologene. Unmapped IDs are identifiers that were not mapped to a molecule in Ingenuity's knowledge base. Unmapped molecules may fall into one of the following categories: 1) The gene/protein ID does not correspond to a known gene product. For example, most ESTs are not found in the knowledge base (exception: ESTs that have a corresponding Entrez Gene identifier are in the knowledge base). 2) There are insufficient findings in the literature regarding this molecule. 3) Findings for this molecule have not been entered in Ingenuity's knowledge base. 4) The gene/protein ID is one of a small percentage of GenBank IDs that corresponds to several loci and several genes, and thus to several Entrez Gene IDs. Such identifiers are left unmapped in the application due to the ambiguity of its identity. If you have an identifier that is not being mapped in IPA and you think it should be, please contact Customer Support at 650-381-5111 or Support@ingenuity.com. We can research why individual identifiers are not mapped. Please tell us the identifier in question, the identifier type you are using, the molecule you think it should map to, and the molecule that it is mapping to in IPA (if you think it is a mismapping). All IDs are all of the identifiers contained in the dataset file. Network Eligible Molecules are the molecules that are eligible for network generation. In order for a molecule to become a Network Focus molecule, it must meet two criteria. 1) It must meet all of the criteria you specified in your analysis parameters. (i.e. Must meet the cutoff value, focus on, and directionality.) 2) There must be at least one other molecule in Ingenuity's knowledge base that interacts with it. Functions/Pathways Molecules are the molecules eligible for functional analysis. In order for a molecule to become a Functions/ Pathway Molecule, it must meet two criteria. 1) It must meet all of the criteria you specified in your analysis parameters. (i.e. Must meet the cutoff value, focus on, and directionality.) 2) There must be at least one functional annotation (function, pathway or list) associated with this molecule in Ingenuity's knowledge base. NOTES: 1) A molecule can be both a Network Eligible molecule and a Functions/ Pathway Eligible molecule if it meets all 3 criteria specified above. 2) You can change the number of Network Molecules or Functions/ Pathways Eligible Molecules by changing the analysis parameters. 9.FAQs about Functional Analysis Statistical Calculations To assist you with understanding how Ingenuity Pathways Analysis calculates the statistical values displayed in Functions and Pathways here are answers to some frequently asked questions. How are the significances/p-values for Functions and Pathways in IPA calculated? The significance value associated with Functional Analysis for a dataset is a measure of the likelihood that the association between a set of Functional Analysis genes in your experiment and a given process or pathway is due to random chance. The smaller the pvalue the less likely that the association is random and the more significant the association. In general, p-values less than 0.05 indicate a statistically significant, nonrandom association. The p-value associated with a biological process or pathway annotation is a measure of its statistical significance with respect to the Functions/Pathways/Lists Eligible molecules for the dataset and a Reference Set of molecules (which define the molecules that could possibly have been Functions/Pathways/Lists Eligible). The p-value is calculated with the right-tailed Fisher's Exact Test. In this method, the p-value for a given function is calculated by considering: 1) The number of Functions/Pathways/Lists Eligible molecules that participate in that annotation 2) The total number of knowledge base molecules known to be associated with that function 3) The total number of Functions/Pathways/Lists Eligible molecules 4) The total number of genes in the Reference Set In the right-tailed Fisher's Exact Test, only over-represented functions or pathways -those that have more Functions/Pathways/Lists Eligible molecules than expected by chance, are significant. Under-represented functions or pathways ('left-tailed' p-values) which have significantly fewer molecules than expected by chance are not shown. Why do we use the Fisher’s exact test instead of some other types of p-value calculations? The type of p-value calculation depends on the statistical null model (i.e. the “random” model) that is used for assessing significance. In the case of functional analysis (where we have a set of N molecules and ask if this set is significantly enriched in molecules with a particular annotation) the random model corresponds to picking the N molecules just randomly. The assumption of this null model leads to Fisher’s exact test. Other null models are also plausible (and sometimes preferred) like permuting identities of annotations or molecules which maintains the annotation tree structure but is computationally very expensive. Fisher’s exact test is computationally less expensive and widely used. What factors influence the size of the p-value in Functional Analysis for a dataset? While the number of Functions/Pathways/Lists Eligible molecules associated with a given function/pathway is an important measure when calculating the p-value for Functional Analyses, the p-value does not only depend on this number. The p-value for a given function is calculated by considering: 1) The number of Functions/Pathways/Lists Eligible molecules that participate in that annotation 2) The total number of knowledge base molecules known to be associated with that function 3) The total number of Functions/Pathways/Lists Eligible molecules 4) The total number of molecules in the Reference Set Why are the significance/p-value calculations in the application not based on a binomial distribution? The proper null hypothesis for statistical testing needs to reflect the constraint that a particular molecules can appear in a given set only once. This would be violated if the binomial distribution was used. The difference between the hypergeometric and binomial distributions is that the hypergeometric calculates probabilities without replacement and the binomial assumes replacement. Since each molecule can only be used once in each p-value calculation, no replacement should be considered, so the binomial distribution cannot be used. Are Bonferroni corrections used for significance/p-value calculations in Ingenuity Pathways Analysis? No, we do not apply a Bonferroni correction, which is one of several ways to correct for testing the same data against multiple hypotheses. The Bonferroni correction is widely viewed as being overcorrecting, and leading to a high false negative rate. This would cause too many functions to fall above the p-value threshold of 0.05 and thus not be shown. We provide an option for multiple hypothesis correction based on the BenjaminiHochberg approach. Applying any multiple-testing correction does not change the order of the annotations sorted by their significance, but it might equalize p-values for some functions. You should have the highest confidence in annotations with the smallest pvalues, and can discount annotations with relatively higher p-values. What is the Benjamini-Hochberg method of multiple testing correction? This calculation returns adjusted p-values and enables you to control the noise in certain Functional Analysis and Canonical Pathway results. This corrected p-value can be interpreted as an upper bound for the expected fraction of false positives. For example, if the threshold is 0.01, you can expect that the fraction of false positives among the significant functions is less than 1%. The formula for the Benjamini-Hochberg method of multiple testing correction is: where p_1 <= p_2 <= etc <= p_m is the ordered sequence of non–corrected p-values. When should I use the Benjamini-Hochberg multiple testing correction? 1. If you ask for the significance of a particular function or pathway in relation to your dataset, then the uncorrected p-value (the Fisher's Exact Test p-value) is appropriate. It measures how likely the observed result would be if the association was just random. If this is very unlikely (i.e. the p-value is below the threshold) then the function is said to be significant. 2. If you have a set of functions or pathways and ask for all significant functions within this set, the Benjamini-Hochberg multiple testing correction p-value is the more appropriate measure. In this case the threshold p-value gives you information about how many false positives (i.e. functions falsely identified as being significant) you can maximally expect among the significant functions. For example, if your p-value threshold is 0.01 and there are 100 significant functions with a p-value below this threshold, you can expect that at most, on average one of them was falsely identified as being significant. The threshold p-value corresponds to the false discovery rate, which is 1% in this example. Why did we choose the Benjamini-Hochberg multiple testing correction over other methods? We chose to offer B-H multiple-testing corrected p-values because they are widely used, straightforward to implement, and computationally inexpensive. A downside is that these corrected p-values are slightly over-correcting (but by far not as much as Bonferroni) because they involve the estimate of an upper bound of the false discovery rate (FDR) that corresponds to the case that the null hypothesis is true for all tests. Furthermore null distributions for the different tests are assumed to be independent. For Canonical Pathways, does the p-value depend on the size of the pathway? Yes. For example, let us compare a 50% overlap of a small canonical pathway (with 10 Functions/Pathways/Lists Eligible molecules and 20 molecules in the pathway) to a 50% overlap of a large pathway (with 50 Functions/Pathways/Lists Eligible molecules and 100 molecules in the pathway). The p-value is more significant when the relative proportion of Functions/Pathways/Lists Eligible molecules in the pathway is greater. For example, if there were 200 such molecules in the dataset, the 50/100 pathway would have a greater proportion of them than the 10/20 pathway (50/200 vs. 10/200) and this would lead to a lower p-value for the 50/100 pathway. On the other hand, the p-value is less significant when the proportion of Reference Set molecules in the pathway is greater. The 50/100 gene pathway has a greater proportion than the 10/20 gene pathway, therefore this would contribute to a larger p-value for the 50/100 gene pathway. Since these two considerations rarely cancel each other, generally the p-value will change when the size of the canonical pathway changes. How does a customized microarray that focuses on a particular pathway or disease affect the p-values in Canonical Pathways? The p-values are calculated using the Reference Set that is defined in the Analysis Components settings within the Create Analysis page. When the identifiers assayed are known, include them in the input file and use a cutoff to indicate the molecules of interest. You should try to use your dataset as the reference set, unless it is so small that IPA is unable to calculate p-values for Functions/ Pathways/ Lists. The resulting pvalues indicate how significant the molecule overlap is with each annotation considering both the molecules assayed and the input molecules that met the cutoff. If you have a molecule list or do not know which molecules were assayed, the default reference set in such cases should be on of the Ingenuity knowledge base reference sets (genes only, endogenous chemicals only, or both genes and endogenous chemicals), which includes all functionally-characterized molecules. When Ingenuity's knowledge base reference sets are used, the p-value answer the question "What biological annotations are significantly associated with my input molecules relative to all functionally-characterized mammalian molecules?" What is the difference between the significance (p-value) and ratio in the Canonical Pathways bar chart? Which one is the best to use? The ratio is calculated by taking the number of genes from your dataset that participate in a Canonical Pathway, and dividing it by the total number of genes in that Canonical Pathway. The ratio indicates the percentage of genes in a pathway that were also found in your uploaded gene list (or Functions/Pathways/Lists Eligible genes if a cutoff was specified). The ratio is therefore useful for determining which pathways overlap the most with the genes in your dataset. The p-value measures how likely the observed association between a specific pathway and the dataset would be if it was only due to random chance, by also considering the total number of Functions/Pathways/Lists Eligible genes in your dataset and the Reference Set of genes (those which potentially could be significant in your dataset). If a p-value is very small you can be confident that the corresponding pathway is significantly associated with the uploaded dataset. This can be read as an indication that this pathway is more likely to explain the observed phenotype than others. Neither the significance nor the ratio tells you how these genes are associated with the Canonical Pathway (i.e. you must look at the function of the affected genes in the Canonical Pathway to determine whether the pathway is up- or down-regulated). The ratio indicates the strength of the association, whereas the p-value measures its statistical significance. If all genes in the genome were Network Eligible then even though every annotation would have a ratio of one (100% overlap), none of the annotations would be significant. If a pathway has a high ratio (percentage overlap with Functions/Pathways/Lists Eligible genes) and a very small p-value, the pathway is probably associated with the data and a large portion of the pathway may be involved or affected. These pathways may be the most likely candidates for an explanation of the observed phenotype. What is the "threshold" line on the Functional Analysis and Canonical Pathways analysis bar charts? The threshold line that appears in the bar chart represents a p-value of 0.05. Canonical Pathways are included in the bar chart even if they are not statistically significant (are below the line) for reference. The threshold line can be customized to a different level or removed by clicking on the Customize Chart button. How should I use the p-values (and Network Scores) appropriately? We suggest you use the p-values and scores calculated in IPA as starting points for further investigation and as rough guides for helping you identify significant processes or pathways being affected in your experiment. In some cases, you will need to explore the supporting evidence to understand the full biological implications for significant results. In other cases, there may be results with insignificant p-values (>0.05) that you will find compelling nevertheless upon further investigation. This can with Canonical Pathways, where the participation of just one Network Eligible gene in a canonical pathway may be biologically interesting, even if it is not statistically significant.