here - AgBase

advertisement
This material is from IPA (www.ingenuity.com) online
help manual
Topics included are:
1. Introduction to IPA
2. Input file templates
3. Data Filters
4. Networks Relationships
5. Ratio Calculations for Pathways
6. Interpreting Functional Analysis Results
7. Filtered Datasets/ Enriched Datasets
8. Reviewing Mapped Genes
9. FAQs about Functional Analysis Statistical
Calculations
1. Introduction to IPA
IPA is built upon a huge foundation of scientific evidence, manually curated from
hundreds of thousands of journal articles, textbooks, and other data sources. IPA acts as
the gateway to this vast amount of biological and chemical information and presents
these data in a meaningful, visual and knowledgeable way. IPA allows the researcher to
explore molecular chemical, gene, protein and miRNA interactions, create custom
molecular pathways, view and modify metabolic, signaling, and toxicological canonical
pathways, each with underlying experimental literature evidence. To further
interpretation, in addition to the networks and pathways that can be created, IPA can
provide multiple layering of additional information, such as drugs, disease genes,
expression data, cellular functions and processes, or a researcher’s own genes or
chemicals of interest.
Entry Points into IPA
There are three general ways of getting started with IPA:
• Starting with terms related to a therapeutic area, disease or function for additional
research, study or experimental designs
• Starting with an small number of genes or chemicals (1-300), including drugs, such as
a list of drugs for repurposing, understanding the context of a drug target, or candidate
genes from SNP genotyping analysis
• Starting with a large dataset, typically thousands of data points, such as microarray, or
other high-throughput screening
Regardless of where you start, IPA results will provide a better understanding of the
biological context of your data or area of interest.
Analysis Types
The Question it Addresses
Core Analysis
Core Comparison Analysis
IPA-Metabolomics(TM) Analysis
IPA-Metabolomics Comparison
Analysis
IPA-Tox(TM) Analysis
IPA-Tox Comparison Analysis
IPA-Biomarker(TM) Analysis
IPA-Biomarker Comparison
Analysis
Core Analysis allows you to interpret large and small datasets in the context
of biological processes, pathways and molecular networks.
Core Comparison allows you to analyze changes in biological states across
experimental conditions. First run a core analysis on your multiple datasets
that represent multiple treatments. Then use core Comparison Analysis to
understand which biological processes and/or diseases are relevant to each
condition.
The Metabolomics solution provides you with a way of analyzing metabolite
data to learn more about cell physiology and metabolism.
Metabolomics Comparison allows you to analyze changes in biological
states across experimental conditions. First run a Metabolomics Analysis on
your multiple datasets that represent multiple treatments. Then use
Metabolomics Comparison Analysis to understand which biological
processes and/or diseases are relevant to each condition.
The Tox solution allows you to assess toxicity and safety of compounds-ofinterest early in the development process. Tox analysis rapidly displays the
relevant toxicity phenotypes and clinical pathology endpoints associated
with a dataset.
Allows you to analyze changes in relevant toxicity phenotypes and clinical
pathology endpoints across observations. First run a Tox Analysis on your
multiple observationss, then use Tox Comparison Analysis to understand
which tox functions and/or pathways are relevant to each timepoint or dose.
The biomarker analysis solution allows you to identify and prioritize the
most relevant and promising molecular biomarker candidates from datasets
from nearly any step of the drug discovery process or disease research. Use
the Biomarker filters to prioritize molecular biomarker candidates based on
contextual information such as mechanistic connection to diseases or
detection in bodily fluids. Then, use the Biomarker Comparison Analysis to
identify biomarker candidates that discriminate between or are common to a
disease state and/or drug response.
The Biomarker Comparison Analysis identifies biomarker candidates that
discriminate between or are common to a disease state and/or drug response.
First run the Biomarker Filter to prioritize molecular biomarker candidates
based on contextual information such as mechanistic connction to diseases,
detection in bodily fluids, and use the Biomarker Comparison Analysis to
identify biomarker candidates that discriminate between multiple samples.
If you have a large dataset to analyze, involving several hundred to tens of thousands of
molecules, you will want to run an analysis appropriate to your interests, this includes
Core Analysis, IPA-Tox Analysis, IPA-Metabolomics or IPA-Biomarker analysis (see
Types of Analyses). These analyses begin with upload of your data, setting analysis
criteria, running the analysis, and interpreting the results. The results provide
information on how your dataset overlaps molecules associated with various diseases and
cellular functions, and that are part of canonical pathways. These results often give a
good indication of what cellular processes your data set is related to and will often lead to
further investigation of these relationships by building custom networks/pathways. In
addition, you can view molecular networks that show how the significant molecules in
your dataset are known to interact with one another and other closely interacting
molecules.
The other general entry point is if you interested in learning more about a small set of
genes, chemicals/drugs, or disease. You can use IPA as a visual literature search tool to
help you understand and stay current in biological interactions and relationships relevant
to your projects. These analyses generally start with a small list of genes, proteins or
chemicals (1-300), or one or more search terms of a cellular functions or diseases.
Types of Analyses and Key Features
IPA Feature
Networks
Functions
Canonical Pathways
Search
My Pathways and My Lists
Path Designer
Compare
Sharing and Collaborating
Exporting and Reporting
IPA Integration Module
The Question it Addresses
What regulatory relationships exist between the genes/ proteins in my
dataset?
Which biological and disease processes are most relevant to my genes of
interest?
Which well-characterized cell signaling and metabolic pathways are most
relevant in my data?
How do I query Ingenuity's knowledge base about specific genes,
processes, diseases, drugs, families, or subcellular locations?
How can I build a library of biological models that I can also use in
analyzing expression data?
How can I modify my pathways for publications and presentations?
Which molecules are unique or in common between more than one IPA
entity? Perform set analyses using the Compare feature.
How can I communicate and share my results with my colleagues and
collaborators?
How can I publish my results from IPA?
Can I seamlessly transition between my internal application or database
into IPA?
2.Input file templates
Formatting Your Data using IPA Templates
One method of inputting your data into IPA is to use one of the IPA templates. If you are
using the IPA Flexible Format, click here. This article covers using IPA templates for
data upload.
Single Observation Format (File Format A)


Multiple Observation Format (File Format B)
Basic (Legacy) Format (used prior to IPA 3.0)
Each row in the dataset file represents a single gene or protein identifier. An optional
header row may be included. A 65,000 row and 18MB file size limit exists on dataset
files uploaded into the application.
All columns, as shown in the dataset file templates, are required, although some of the
columns may be left empty. Data in the Identifier column is required so that the
application can accurately identify and map the molecules from your dataset file to the
corresponding entries in Ingenuity's knowledge base. Data in the other columns pertaining to Expression Value, Absent and Override - are optional and are used to
indicate which molecules in your dataset file are of most interest for the analysis and thus
should be selected as Network Eligible or Functions/ Pathways Eligible molecules.
Ingenuity File Format A
This template should be used for datasets consisting of a single observation. Each
identifier may be assigned up to three different expression value types.
Download Format A Template (Note: if the downloaded file does not display properly,
right-click on the link instead and choose the "Save Target As" option to save the file to
your computer)
Example:
This Ingenuity File Format A example dataset uses GenBank IDs as the Gene ID and
Normalized Ratios as the expression value type. Only one expression value type is
entered for each identifier, so the second and third Expression Value columns are empty.
The Absent and Override columns are also empty.
NOTE: There is a 65,000 row and 18MB file size limit on dataset files uploaded into the
application.
Ingenuity File Format B
This template should be used for datasets consisting of multiple (up to ten) observations.
Each identifier may be assigned up to three different expression value types for each
observation.
Download Format B Template (Note: if the downloaded file does not display properly,
right-click on the link instead and choose the "Save Target As" option to save the file to
your computer)
Example:
This Ingenuity File Format B example dataset uses GenBank IDs as the Gene ID and
Ratios as the expression value type. Expression values from two observations are shown.
Only one expression value type is entered for each identifier, so the second and third
Expression Value columns are empty for each observation. The Absent and Override
columns are also empty for each observation.
NOTE: There is a 65,000 row and 18MB file size limit on dataset files uploaded into the
application.
Ingenuity Basic Format (Legacy)
This is the basic file format supported in versions of the application prior to the IPA
Summer '04 release. If you used this format in previous versions of the application, you
may continue to do so. See below for an illustration of a correctly formatted example
dataset and a link to download the Basic Format Template.
Download Basic Format Template (Note: if the downloaded file does not display
properly, right-click on the link instead and choose the "Save Target As" option to save
the file to your computer)
Example:
This Ingenuity Basic Format example dataset file uses Affymetrix (Affy) identifiers and
p-values as the expression value type. The Absent and Override columns are empty.
NOTE: There is a 65,000 row and 18MB file size limit on dataset files uploaded into the
application.
3. Data Filters
Species Filter
As a default, all species available are selected for analysis, indicating that IPA does not
automatically limit the analysis.
The Stringent filter is a highly constrained filter that will return only those molecules and
relationships that are very relevant to the selected species; however, because the filter is
highly constrained, it may not return many results. To be specific, the Stringent filter will
match molecules that contain an ortholog matching the selected species. In addition,
relationships must involve two molecules that match the selected species, and the
location of the relationship must be in the selected species.
In contrast, the Relaxed filter is less constrained. Therefore, the molecules and
relationships that it matches may be less targeted, but there may be more results. To be
specific, the Relaxed filter will match molecules that contain an ortholog that includes the
selected species. It filters on the orthologs, but not the relationships between the
orthologs.
Tissues and Cell Lines Filter
The Tissues & Cell Lines tab contains a set of tissues and cell lines from which to choose
plus two filtering choices: Stringent and Relaxed. As a default, all tissues and cell lines
are selected for analysis, indicating that IPA does not automatically limit the analysis.
The Stringent filter is a highly constrained filter that will return only those molecules and
relationships that are expressed in the selected tissues or cell ines. Since the filter is
highly constrained, it may not return many results.
In contrast, the Relaxed filter is less constrained. Therefore, the molecules and
relationships that it matches may be less targeted, but there may be more results. To be
specific, the Relaxed filter will match only molecules that are expressed in the selected
tissues or cell lines.
4. Networks Relationships
Lines that connect two molecules represent relationships. Thus any two molecules that
bind, act upon one another, or that are involved with each other in any other manner
would be considered to possess a relationship between them. Each relationship between
molecules is created using scientific information contained in Ingenuity's knowledge
base.
In Network Explorer, My Pathways, or Neighborhood Explorer relationships are shown
as lines or arrows between molecules. Arrows indicate the directionality of the
relationship, such that an arrow from molecule A to B would indicate that molecule A
acts upon B. The lines used to depict relationships are described in the Legend.
Relationship Labels: You can hover over a relationship using your mouse to highlight
it and display a simple label to designate the general type of relationship that exists
between the two molecules. Relationship labels may also be turned on across the entire
network by adjusting the Relationship label settings. See Network Explorer Preferences
for more details.
The following is a key that lists the identification of relationship labels:
A Activation
B Binding
C Causes/Leads to
CC Chemical-Chemical interaction
CP Chemical-Protein interaction
E Expression (includes metabolism/ synthesis for chemicals)
EC Enzyme Catalysis
I Inhibition
L ProteoLysis (includes degradation for Chemicals)
LO Localization
M Biochemical Modification
MB Group/complex Membership
P Phosphorylation/Dephosphorylation
PD Protein-DNA binding
PP Protein-Protein binding
PR Protein-RNA binding
RB Regulation of Binding
RE Reaction
T Transcription
TR Translocation
You can also view the number of citations supporting the relationship in the network
diagram by selecting this as an option under your Network Explorer Preferences.
Relationship Types: The various arrow shapes represent different types of interactions,
as shown in the key below:
Data Source
Protein-Protein Interaction and MicroRNA Database Imports
The following databases have been imported into Ingenuity's knowledge base and can be
used in the Analysis Parameters and in the Build tools found in My Pathways and Path
Designer. Please note that these databases can be selected in addition to (but not instead
of) Ingenuity's knowledge base when creating analyses and using the Build tools.
ARGONAUTE 2
The Argonaute 2 Database is a comprehensive database on mammalian microRNAs and
their known or predicted regulatory targets. It provides information on origin of miRNAs,
tissue specificity of their expressions and their known or proposed functions, their
potential target genes as well as data on miRNA families based on their coexpression and
proteins known to be involved in miRNA processing. Currently, IPA only includes the
target genes for microRNAs records from Argonaute 2.
http://www.ma.uni-heidelberg.de/apps/zmf/argonaute/
BIND
The Biomolecular Interaction Network Database (BIND) is a collection of records
documenting molecular interactions. A BIND record represents an interaction between
two or more objects that is believed to occur in a living organism. A biological object can
be a protein, DNA, RNA, ligand, molecular complex, or gene. BIND records are created
for interactions which have been shown experimentally and published in at least one
peer-reviewed journal. A record also references any papers with experimental evidence
that support or dispute the associated interaction. Interactions are the basic units of BIND
and can be linked together to form molecular complexes or pathways.
http://bond.unleashedinformatics.com/Action?pg=23299#BIND
BioGRID
The Biological General Repository for Interaction Datasets (BioGRID) database was
developed to house and distribute collections of protein and genetic interactions from
major model organism species. BioGRID currently contains over 198,000 interactions
from six different species, as derived from both high-throughput studies and conventional
focused studies. IPA currently only includes interactions from human, mouse and rat.
http://www.thebiogrid.org
Cognia
Cognia is a database of molecular interactions manually curated from the scientific
literature on the topic of the ubiquitin system. It's a comprehensive resource on key
regulatory proteins of the ubiquitin system, their attributes and interactions.
http://www.ncbi.nlm.nih.gov/pubmed/12645912?ordinalpos=2&itool=EntrezSystem2.PE
ntrez.Pubmed.Pubmed_ResultsPanel.Pubmed_DefaultReportPanel.Pubmed_RVDocSum
DIP
The DIP database (Database of Interacting Proteins) catalogs experimentally determined
interactions between proteins. It combines information from a variety of sources to create
a single, consistent set of protein-protein interactions. The data stored within the DIP
database were curated, both, manually by expert curators and also automatically using
computational approaches that utilize the the knowledge about the protein-protein
interaction networks extracted from the most reliable, core subset of the DIP data.
http://dip.doe-mbi.ucla.edu/
IntAct
IntAct provides a freely available, open source database system and analysis tools for
protein interaction data. All interactions are derived from literature curation or direct user
submissions and are freely available.
http://www.ebi.ac.uk/intact/site/index.jsf
Interactome
The 'Interactome studies' option in the data source filter gathers individual published
protein interaction studies that utilize high-throughput methods (Eg Yeast-2-Hybrid) for
detecting a large number of protein-protein interactions. As the methods used for largescale detection protein-protein interactions are not as stringent as other methods used to
validate each protein-protein interaction, these findings are separated out to allow the
user to filter them out if necessary. Under ‘Interactome studies’, IPA currently includes
the study below:
‘Towards a proteome-scale map of the human protein-protein interaction network’
Rual et al. Nature 2005 Oct 20;437(7062):1173-8 (PMID: 16189514)
http://www.ncbi.nlm.nih.gov/pubmed/16189514?dopt=Abstract
MINT
The Molecular INTeraction database (MINT) focuses on experimentally verified proteinprotein interactions mined from the scientific literature by expert curators. IPA only
includes molecular interactions from human, mouse and rat. However, the full MINT
dataset can be freely downloaded.
http://mint.bio.uniroma2.it/mint/Welcome.do
MIPS
The Munich Information Center for Protein Sequences (MIPS) mammalian proteinprotein interaction database is a collection of manually curated high-quality PPI data
collected from the scientific literature by expert curators. Only data from individually
performed experiments are included since they usually provide the most reliable evidence
for physical interactions.
http://mips.gsf.de/proj/ppi/
5. Ratio Calculations for Pathways
The ratio is calculated as follows:
The number of molecules in a given pathway that meet cutoff criteria, divided by total
number of molecules that make up that pathway.
Example: If you specify a cutoff of 2.0 for fold-change, then any molecule that is either
upregulated more than two-fold or downregulated more than 2-fold will meet the cutoff.
If 10 molecules meet the cutoff for that pathway and 100 total molecules form that
pathway, the ensuing ratio would be 0.1.
Ratio Calculations for Metabolic Pathways
After we construct a pathway, we map all EC numbers to EntrezGene names. In order for
an EC number to be mapped in IPA, there has to be an entry for this gene available in
EntrezGene that is from human, mouse, or rat. If a molecule cannot be mapped to any
human, mouse, or rat gene, it won't be considered in the ratio calculation.
For any given pathway, you can check which molecules map and which ones do not, by
going to the canonical pathways library from the project manager. Two different colors
are used for the molecules: gray and white. Gray represents molecules that can be
mapped to human, mouse or rat. White molecules cannot be mapped and are therefore
excluded from the ratio calculations.
What is the difference between the significance (p-value) and ratio in the
Canonical Pathways bar chart? Which one is the best to use?
The ratio gives you a good idea of the percentage of genes in a pathway that were also
found in your uploaded list. The ratio is therefore good for looking at which pathway has
been affected the most based on the percentage of genes uploaded into IPA.
The significance (p-value) is better at asking, "Is there an association between a specific
pathway and my uploaded dataset and is it due to chance?" The null hypothesis is that
there is no association. If a p-value is very small you can be confidant that the pathway is
associated with the uploaded dataset. This may be an indication of that certain pathways
are more likely to explain the phenotype that is observed.
Neither the significance, nor the ratio tells you how these genes are associated with the
Canonical Pathway (i.e. you must look at the function of the affected genes in the
Canonical Pathway to determine whether the pathway is up- or down-regulated).
In the end it is a matter of percentage vs. probability. The ratio gives you amount of
association; the significance gives confidence of association. For example, if a pathway
has a high ratio (percentage) and a very low p-value, the pathway is probably associated
with the data and a large portion of the pathway may be involved or affected. These
pathways may be the most likely candidates for an explanation of the observed
phenotype.
6. Interpreting Functional Analysis Results
Significance in Functional Analysis for a Dataset
The significance value associated with Functional Analysis for a dataset is a measure of
the likelihood that the association between a set of Functional Analysis molecules in your
experiment and a given process or pathway is due to random chance. The smaller the pvalue the less likely that the association is random and the more significant the
association. In general, p-values less than 0.05 indicate a statistically significant, nonrandom association. The p-value is calculated using the right-tailed Fisher Exact Test.
In this method, the p-value for a given function is calculated by considering
1) the number of functional analysis molecules that participate in that function and
2) the total number of molecules that are known to be associated with that function in
Ingenuity's knowledge base
The more functional analysis molecules that are involved, the more likely the association
is not due to random chance, and thus the more significant the p-value. Similarly, the
larger the total number of molecules known to be associated with the process, the greater
the likelihood that an association is due to random chance, and the p-value accordingly
becomes less significant. In short, the p-value identifies statistically significant overrepresentation of functional analysis molecules in a given process. Over-represented
functional or pathway processes are processes which have more focus molecules than
expected by chance ("right-tailed").
For example:
Experiment #1
5 functional analysis molecules related to hematopoesis
50 molecules related to hematopoesis total in Ingenuity's knowledge base
Experiment #2
5 functional analysis molecules related to hematopoesis
10 molecules related to hematopoesis total in Ingenuity's knowledge base
In this case the p-value for the hematopoesis process in Experiment #2 would be more
significant (i.e. smaller value) than the p-value for hematopoesis in Experiment #1
because of the greater over-representation of the functional analysis molecules in the
relevant set of molecules related to hematopoesis.
How to Use the p-values IPA Calculates
We suggest you use the p-values and scores calculated in IPA act as starting points for
further investigation and as rough guides for helping you identify significant processes or
pathways being affected in your experiment. In some cases, you will need to explore the
supporting evidence to understand the full biological implications for significant results.
In other cases, there may be results with large p-values (>0.05) that you will find
compelling upon further investigation. This can occur in the case of Canonical Pathways,
where the involvement of just one molecule with a canonical pathway may be
biologically interesting, even if it is not statistically significant.
Interpreting Functional Analysis for Comparisons
Functional Analysis can be used to quickly gain an overview over the effect of different
drug treatments or time courses on a variety of functions. But how do you interpret such
results?



The significance calculated for each function returned in Functional Analysis is a
measurement of the likelihood that the function is associated with the dataset by
random chance. On the y-axis of the diagram, the significance is expressed as the
negative exponent of the p-value calculated for each function. That is, taller bars
are more significant than shorter bars.
To determine whether and to what extent a given function is affected from one
observation to another within a comparison you can start by comparing the extent
to which the significances change from one observation to another. For example,
if the significance of a function changes from one treatment to the next, then it is
likely that the treatment had an impact on the function under investigation. Note,
however, that functions whose significance values exhibit little or no change from
one observation to another may be changing and should also be investigated
further. Specifically, it is possible that the molecules, and therefore the underlying
biology, in one observation could be different from another observation even
though the significance values remained unchanged.
When interpreting your results, it is important to keep in mind that the
significance values you are seeing refer to the High Level Functions rather than to
individual functions. If a High Level Function contains two or more specific
functions, a range of significances is displayed.
7.Filtered Datasets/ Enriched Datasets
Enriched Datasets are enriched by Ingenuity's Knowledge Base to provide specific details
such as subcellular location, functional gene family and association with drugs, for
enhanced exploration and understanding. Enriched Datasets enable more data
interactivity and flexibility when using IPA, particularly in Compare, Grow Out, and
Overlay. To get even more value out of your datasets when applying them to these
functionalities, apply filters and cutoffs to your dataset to limit your experimental
observations to the most relevant information.
Using Filtered Datasets in Overlay, Grow Out and Compare
Overlay Datasets
Filtered Datasets can be used to Overlay dataset values onto Networks, My Pathways,
Canonical Pathways, and in Path Designer. Using Filtered Datasets allows you to
visualize your experimental data with your selected filter and cutoff values on a pathway
or network without having to run an analysis. Use this functionality to quickly see how
your experiment affects a well characterized Signaling or Metabolic pathway from the
Ingenuity Pathway Library, or how genes in your experiment support a hypothesis you
built using IPA’s Search and Explore functionality. When you overlay a Filtered Dataset
onto a pathway, pathway nodes will be colored according to the experimental values in
your dataset that meet your filter criteria.
Grow from Datasets
Filtered Datasets can be used in Grow to add molecules to Networks and Pathways.
Using Filtered Datasets allows you to limit the molecules used to identify molecules and
relationships to expand networks and pathways to molecules in your dataset that meet the
criteria you set without having to run an analysis. Use this functionality to determine
molecules from you experiment that may regulate a well characterized Signaling or
Metabolic pathway from the Ingenuity Pathway Library or how molecules in your
experiment are associated with a hypothesis you built around a particular biological
function or disease. When you grow out from a Filtered Dataset, molecules added to the
pathway are those from your dataset that meet your filter criteria. If no molecules are
added, your filter criteria may be too stringent, but you can go back to the Dataset Filter
by double clicking on your Filtered Dataset in the Project Manager to adjust the settings.
Compare Datasets
Filtered Datasets can be used in Compare to compare molecules from different
experimental observations. Comparing Filtered Datasets allows you apply filter criteria
to identify the common and unique molecules across your experiments and quickly
discern trends in your data. When comparing Filtered Datasets, only those molecules
from your dataset that meet your filter criteria will be used to compare.
Creating an Analysis from a Filtered Multi-observation Dataset
In addition to using filtered datasets in Overlay, Grow, and Compare, one can use filtered
datasets for analysis. When filtering a dataset for use in analysis, you can apply
contextual filters, such as species, disease, or molecule type, and/or an expression value
cutoff. Those molecules that meet the filter criteria are available for use in the analysis.
When you apply an expression value cutoff to a multi-observation dataset and then use
that dataset for analysis, IPA identifies all molecules across the dataset observations that
meet the filter criteria. The union of those molecules then becomes the input to the
analysis.
For example, if you filter a dataset with a cutoff value of 1.0 and a molecule has an
expression value of 0.99 for one observation and an expression value of 1.1 for another,
that molecule will be included as an input for the analysis. You can then apply an
expression value cutoff and/or additional contextual filters as part of the Create Analysis
process. These filters will be applied to your dataset and across multiple observations
and may adjust the number of molecules that are Network or Functions/Pathway/List
eligible. Extending the above example, if you apply a cutoff value of 1.0 as part of
Create Analysis, the molecule with an expression value of 0.99 will not be included as a
Focus Gene for that particular observation, but will be included for the observation where
the expression value is 1.1. Once you are satisfied with the analysis conditions, click on
Run Analysis to name and run the analysis.
Creating Filtered Datasets
To create a Filtered Dataset click on the Filter Dataset link from the Quick Start menu
and select or upload a dataset.
On the Filter Dataset page, select the filters most relevant to you and your experiment.
Species: Filter for genes that exist in a particular species. By selecting any item in this
filter, you are specifying that you are interested in genes that exist in a species.
Tissues and Cell Lines: Filter for genes expressed in a particular tissue or cell line. By
selecting any item in this filter, you are specifying that you are interested in genes
expressed in the selected tissue(s) and/or cell line(s).
Biofluid: Filter for proteins detectable in a particular bodily fluid. By selecting any item
in this filter, you are specifying that you are only interested in proteins that are detectable
in that particular bodily fluid(s).
Diseases: Filter for genes associated with a particular disease. By selecting any item in
this filter, you are specifying that you are interested in genes associated with the selected
disease(s).
Molecules: Filter for specific molecule types in your dataset, such as kinases or
transcription regulators. By selecting any item in this filter, you are specifying that you
are only interested in a particular type of molecule.
NOTE: You may select multiple items within each filter and across the different filters.
The filter runs an ”OR” operation within each filter and an ”AND” operation across the
filters. For example, selecting blood, saliva and human will utilize those genes that are
detectable in [blood or saliva] and exist in [human].
A summary of your filter selection is provided on the right side of the page, so you can
easily determine which filters you have selected for application to your dataset.
In addition to selecting filters to apply to your dataset, you can also define the Expression
Value Cutoff. For each expression value type in your dataset, enter a cutoff value to
indicate at what expression level or significance value the molecules become important to
you. If all the molecules are important to you, input a cutoff value that includes all
identifiers (i.e. p-value = 1.0 or Fold Change = 1).
HINT: When using multiple expression value types and cutoff values, IPA uses the
”AND” function for calculating the number of molecules eligible for the Dataset Filter.
Molecules must meet both cutoff values to be considered for the analysis.
When you have finished selecting filters and applying cutoffs, click the Recalculate
button to view the number of molecules eligible for the Dataset Filter. These are the
molecules that satisfy the filter criteria that you set and can be seen by clicking the Filter
Eligible tab that appears on this page.
• If no molecules are eligible, review the parameters set for the Dataset Filter and adjust
the settings by de-selecting filters or decreasing the stringency of the expression cutoff
values.
When you are satisfied with your filter and cutoff settings, click the Save button to save
your settings to this dataset. This closes the Filter Dataset page and takes you back to the
Project Manager.
8.Reviewing Mapped Genes
The following lists of genes are generated after mapping is complete.
Mapped IDs are identifiers that were successfully mapped to a molecule in Ingenuity's
knowledge base. Duplicate identifiers are mapped to a single molecule. Mapping of
external identifiers to molecules in Ingenuity's knowledge base is performed using
information in Homologene.
Unmapped IDs are identifiers that were not mapped to a molecule in Ingenuity's
knowledge base. Unmapped molecules may fall into one of the following categories:
1) The gene/protein ID does not correspond to a known gene product. For example, most
ESTs are not found in the knowledge base (exception: ESTs that have a corresponding
Entrez Gene identifier are in the knowledge base).
2) There are insufficient findings in the literature regarding this molecule.
3) Findings for this molecule have not been entered in Ingenuity's knowledge base.
4) The gene/protein ID is one of a small percentage of GenBank IDs that corresponds to
several loci and several genes, and thus to several Entrez Gene IDs. Such identifiers are
left unmapped in the application due to the ambiguity of its identity.
If you have an identifier that is not being mapped in IPA and you think it should be,
please contact Customer Support at 650-381-5111 or Support@ingenuity.com. We can
research why individual identifiers are not mapped. Please tell us the identifier in
question, the identifier type you are using, the molecule you think it should map to, and
the molecule that it is mapping to in IPA (if you think it is a mismapping).
All IDs are all of the identifiers contained in the dataset file.
Network Eligible Molecules are the molecules that are eligible for network generation.
In order for a molecule to become a Network Focus molecule, it must meet two criteria.
1) It must meet all of the criteria you specified in your analysis parameters. (i.e. Must
meet the cutoff value, focus on, and directionality.)
2) There must be at least one other molecule in Ingenuity's knowledge base that interacts
with it.
Functions/Pathways Molecules are the molecules eligible for functional analysis. In
order for a molecule to become a Functions/ Pathway Molecule, it must meet two criteria.
1) It must meet all of the criteria you specified in your analysis parameters. (i.e. Must
meet the cutoff value, focus on, and directionality.)
2) There must be at least one functional annotation (function, pathway or list) associated
with this molecule in Ingenuity's knowledge base.
NOTES:
1) A molecule can be both a Network Eligible molecule and a Functions/ Pathway
Eligible molecule if it meets all 3 criteria specified above.
2) You can change the number of Network Molecules or Functions/ Pathways Eligible
Molecules by changing the analysis parameters.
9.FAQs about Functional Analysis Statistical
Calculations
To assist you with understanding how Ingenuity Pathways Analysis calculates the
statistical values displayed in Functions and Pathways here are answers to some
frequently asked questions.
How are the significances/p-values for Functions and Pathways in IPA calculated?
The significance value associated with Functional Analysis for a dataset is a measure of
the likelihood that the association between a set of Functional Analysis genes in your
experiment and a given process or pathway is due to random chance. The smaller the pvalue the less likely that the association is random and the more significant the
association. In general, p-values less than 0.05 indicate a statistically significant, nonrandom association.
The p-value associated with a biological process or pathway annotation is a measure of
its statistical significance with respect to the Functions/Pathways/Lists Eligible molecules
for the dataset and a Reference Set of molecules (which define the molecules that could
possibly have been Functions/Pathways/Lists Eligible). The p-value is calculated with the
right-tailed Fisher's Exact Test.
In this method, the p-value for a given function is calculated by considering:
1) The number of Functions/Pathways/Lists Eligible molecules that participate in that
annotation
2) The total number of knowledge base molecules known to be associated with that
function
3) The total number of Functions/Pathways/Lists Eligible molecules
4) The total number of genes in the Reference Set
In the right-tailed Fisher's Exact Test, only over-represented functions or pathways -those that have more Functions/Pathways/Lists Eligible molecules than expected by
chance, are significant. Under-represented functions or pathways ('left-tailed' p-values)
which have significantly fewer molecules than expected by chance are not shown.
Why do we use the Fisher’s exact test instead of some other types of p-value
calculations?
The type of p-value calculation depends on the statistical null model (i.e. the “random”
model) that is used for assessing significance. In the case of functional analysis (where
we have a set of N molecules and ask if this set is significantly enriched in molecules
with a particular annotation) the random model corresponds to picking the N molecules
just randomly. The assumption of this null model leads to Fisher’s exact test. Other null
models are also plausible (and sometimes preferred) like permuting identities of
annotations or molecules which maintains the annotation tree structure but is
computationally very expensive. Fisher’s exact test is computationally less expensive and
widely used.
What factors influence the size of the p-value in Functional Analysis for a dataset?
While the number of Functions/Pathways/Lists Eligible molecules associated with a
given function/pathway is an important measure when calculating the p-value for
Functional Analyses, the p-value does not only depend on this number. The p-value for
a given function is calculated by considering:
1) The number of Functions/Pathways/Lists Eligible molecules that participate in that
annotation
2) The total number of knowledge base molecules known to be associated with that
function
3) The total number of Functions/Pathways/Lists Eligible molecules
4) The total number of molecules in the Reference Set
Why are the significance/p-value calculations in the application not based on a binomial
distribution?
The proper null hypothesis for statistical testing needs to reflect the constraint that a
particular molecules can appear in a given set only once. This would be violated if the
binomial distribution was used.
The difference between the hypergeometric and binomial distributions is that the
hypergeometric calculates probabilities without replacement and the binomial assumes
replacement. Since each molecule can only be used once in each p-value calculation, no
replacement should be considered, so the binomial distribution cannot be used.
Are Bonferroni corrections used for significance/p-value calculations in Ingenuity
Pathways Analysis?
No, we do not apply a Bonferroni correction, which is one of several ways to correct for
testing the same data against multiple hypotheses. The Bonferroni correction is widely
viewed as being overcorrecting, and leading to a high false negative rate. This would
cause too many functions to fall above the p-value threshold of 0.05 and thus not be
shown.
We provide an option for multiple hypothesis correction based on the BenjaminiHochberg approach. Applying any multiple-testing correction does not change the order
of the annotations sorted by their significance, but it might equalize p-values for some
functions. You should have the highest confidence in annotations with the smallest pvalues, and can discount annotations with relatively higher p-values.
What is the Benjamini-Hochberg method of multiple testing correction?
This calculation returns adjusted p-values and enables you to control the noise in certain
Functional Analysis and Canonical Pathway results. This corrected p-value can be
interpreted as an upper bound for the expected fraction of false positives. For example, if
the threshold is 0.01, you can expect that the fraction of false positives among the
significant functions is less than 1%.
The formula for the Benjamini-Hochberg method of multiple testing correction is:
where p_1 <= p_2 <= etc <= p_m is the ordered sequence of non–corrected p-values.
When should I use the Benjamini-Hochberg multiple testing correction?
1. If you ask for the significance of a particular function or pathway in relation to your
dataset, then the uncorrected p-value (the Fisher's Exact Test p-value) is appropriate. It
measures how likely the observed result would be if the association was just random. If
this is very unlikely (i.e. the p-value is below the threshold) then the function is said to be
significant.
2. If you have a set of functions or pathways and ask for all significant functions within
this set, the Benjamini-Hochberg multiple testing correction p-value is the more
appropriate measure. In this case the threshold p-value gives you information about how
many false positives (i.e. functions falsely identified as being significant) you can
maximally expect among the significant functions. For example, if your p-value threshold
is 0.01 and there are 100 significant functions with a p-value below this threshold, you
can expect that at most, on average one of them was falsely identified as being
significant. The threshold p-value corresponds to the false discovery rate, which is 1% in
this example.
Why did we choose the Benjamini-Hochberg multiple testing correction over other
methods?
We chose to offer B-H multiple-testing corrected p-values because they are widely used,
straightforward to implement, and computationally inexpensive. A downside is that these
corrected p-values are slightly over-correcting (but by far not as much as Bonferroni)
because they involve the estimate of an upper bound of the false discovery rate (FDR)
that corresponds to the case that the null hypothesis is true for all tests. Furthermore null
distributions for the different tests are assumed to be independent.
For Canonical Pathways, does the p-value depend on the size of the pathway?
Yes. For example, let us compare a 50% overlap of a small canonical pathway (with 10
Functions/Pathways/Lists Eligible molecules and 20 molecules in the pathway) to a 50%
overlap of a large pathway (with 50 Functions/Pathways/Lists Eligible molecules and 100
molecules in the pathway).
The p-value is more significant when the relative proportion of Functions/Pathways/Lists
Eligible molecules in the pathway is greater. For example, if there were 200 such
molecules in the dataset, the 50/100 pathway would have a greater proportion of them
than the 10/20 pathway (50/200 vs. 10/200) and this would lead to a lower p-value for the
50/100 pathway.
On the other hand, the p-value is less significant when the proportion of Reference Set
molecules in the pathway is greater. The 50/100 gene pathway has a greater proportion
than the 10/20 gene pathway, therefore this would contribute to a larger p-value for the
50/100 gene pathway. Since these two considerations rarely cancel each other, generally
the p-value will change when the size of the canonical pathway changes.
How does a customized microarray that focuses on a particular pathway or disease affect
the p-values in Canonical Pathways?
The p-values are calculated using the Reference Set that is defined in the Analysis
Components settings within the Create Analysis page. When the identifiers assayed are
known, include them in the input file and use a cutoff to indicate the molecules of
interest. You should try to use your dataset as the reference set, unless it is so small that
IPA is unable to calculate p-values for Functions/ Pathways/ Lists. The resulting pvalues indicate how significant the molecule overlap is with each annotation considering
both the molecules assayed and the input molecules that met the cutoff.
If you have a molecule list or do not know which molecules were assayed, the default
reference set in such cases should be on of the Ingenuity knowledge base reference sets
(genes only, endogenous chemicals only, or both genes and endogenous chemicals),
which includes all functionally-characterized molecules. When Ingenuity's knowledge
base reference sets are used, the p-value answer the question "What biological
annotations are significantly associated with my input molecules relative to all
functionally-characterized mammalian molecules?"
What is the difference between the significance (p-value) and ratio in the Canonical
Pathways bar chart? Which one is the best to use?
The ratio is calculated by taking the number of genes from your dataset that participate in
a Canonical Pathway, and dividing it by the total number of genes in that Canonical
Pathway. The ratio indicates the percentage of genes in a pathway that were also found in
your uploaded gene list (or Functions/Pathways/Lists Eligible genes if a cutoff was
specified). The ratio is therefore useful for determining which pathways overlap the most
with the genes in your dataset.
The p-value measures how likely the observed association between a specific pathway
and the dataset would be if it was only due to random chance, by also considering the
total number of Functions/Pathways/Lists Eligible genes in your dataset and the
Reference Set of genes (those which potentially could be significant in your dataset). If a
p-value is very small you can be confident that the corresponding pathway is significantly
associated with the uploaded dataset. This can be read as an indication that this pathway
is more likely to explain the observed phenotype than others.
Neither the significance nor the ratio tells you how these genes are associated with the
Canonical Pathway (i.e. you must look at the function of the affected genes in the
Canonical Pathway to determine whether the pathway is up- or down-regulated).
The ratio indicates the strength of the association, whereas the p-value measures its
statistical significance. If all genes in the genome were Network Eligible then even
though every annotation would have a ratio of one (100% overlap), none of the
annotations would be significant. If a pathway has a high ratio (percentage overlap with
Functions/Pathways/Lists Eligible genes) and a very small p-value, the pathway is
probably associated with the data and a large portion of the pathway may be involved or
affected. These pathways may be the most likely candidates for an explanation of the
observed phenotype.
What is the "threshold" line on the Functional Analysis and Canonical Pathways analysis
bar charts?
The threshold line that appears in the bar chart represents a p-value of 0.05. Canonical
Pathways are included in the bar chart even if they are not statistically significant (are
below the line) for reference. The threshold line can be customized to a different level or
removed by clicking on the Customize Chart button.
How should I use the p-values (and Network Scores) appropriately?
We suggest you use the p-values and scores calculated in IPA as starting points for
further investigation and as rough guides for helping you identify significant processes or
pathways being affected in your experiment. In some cases, you will need to explore the
supporting evidence to understand the full biological implications for significant results.
In other cases, there may be results with insignificant p-values (>0.05) that you will find
compelling nevertheless upon further investigation. This can with Canonical Pathways,
where the participation of just one Network Eligible gene in a canonical pathway may be
biologically interesting, even if it is not statistically significant.
Download