How to access and interpret your data

advertisement
How to access and interpret your data
Step1 Downloading an FTP Client
Registered Windows (and Mac) users can access their data from our server by
downloading and installing the following free FTP client
FileZilla (for all platforms) Click here for website
Mac users can also use the ‘Go’ menu
Step 2 Downloading your data
Once you have installed your FTP client you can fill in server details which will be
emailed to you. (See links below)
With FileZilla, you should type 138.251.93.65 then the username and password
provided by us in an email, and click ‘quickconnect’. For more detail, click here.
With the ‘Go’ menu in OSX, click ‘connect to server’ and type:
‘ftp://138.251.93.65’. You will be prompted for a username and password (which
is supplied by us in an email).
Step 3 Viewing results
MALDI (4800) Data ‐ sent in Powerpoint format
IMPORTANT All files should be saved as sent in the same directory on your
computer. Double‐click the Powerpoint presentation and view as a "Slide show".
This enables hyperlinks to be active allowing access to Mascot search results.
Electrospray (Q‐STAR) ‐ sent in Zip Format
IMPORTANT This should be unzipped (Using Winzip or Stuffit Expander) resulting in a folder for
each sample run. Each folder contains a file called Index.htm which should be double‐clicked.
This will open up Mascot search data in your chosen web browser for you to view.
Step 4 Interpreting your data
See our instructions below to assist you in the interpretation of your results.
Using FileZilla
Opening the software, you will need to type 138.251.93.65 into the field, like so –
Then type in the username and password provided by us. When connected, the folders on the server
will appear in the indicated box –
You will be able to navigate to your folder from here. To download your files, right-click on the file
and select ‘Download’. The file will then be copied to the location on your computer displayed in the
left hand box.
Interpreting Mascot data
(Hyperlink to Matrix Science website)
MS/MS Results Interpretation
The following pages can be found on the Matrix Science website (follow the hyperlink above),
but are reproduced here also.
Ions score significance thresholds
In Mascot, the ions score for an MS/MS match is based on the calculated probability, P, that the
observed match between the experimental data and the database sequence is a random event.
The reported score is ‐10Log(P). So, during a search, if 1500 peptides fell within the mass
tolerance window about the precursor mass, and the significance threshold was chosen to be
0.05, (a 1 in 20 chance of being a false positive), this would translate into a score threshold of
45.
If the quality of an MS/MS spectrum is poor, particularly if the signal to noise ratio is low, a
match to the "correct" sequence might not exceed this absolute threshold. Even so, the best
match could have a relatively high score, which is well separated from the distribution of 1500
random scores. In other words, the score is an outlier. This would indicate that the match is not
a random event and, if tested using a method such as a target‐decoy search, such matches can
be shown to be reliable. For this reason, Mascot also attempts to characterise the distribution of
random scores, and provide a second, lower threshold to highlight the presence of any outlier.
The lower, relative threshold is reported as the homology threshold while the higher threshold
is reported as the identity threshold.
The identity threshold is still useful because it is not always possible to estimate a homology
threshold. If the instrument accuracy is very high or the database is very small, there may only
be a small handful of candidate sequences, so that it is not possible to say whether a match is an
outlier.
For a search of at least 1000 spectra, where an automatic decoy search was used, you can
choose to process the Mascot scores through Percolator. This uses machine learning to re‐rank
the matches, so as to obtain an optimum false discovery rate. The revised probabilites are
converted to scores for reporting purposes, together with a single score threshold to indicate
significance.
Protein scores
The protein score in the result report from an MS/MS search is derived from the ions scores. For
a search that contains a small number of queries, the protein score is the sum of the highest
ions score for each distinct sequence. That is, excluding the scores of
duplicate matches, which are shown in parentheses. A small correction is applied to reduce the
contribution of low‐scoring random matches. This correction is a function of the total number of
molecular mass matches for each query. This correction is usually very small, except in no
enzyme searches.
This protein score works well for small searches, and provides a logical order to the report. If
multiple queries match to a single protein, but the individual ions scores are below threshold,
the combined ions scores can still place the protein high in the report. However, the standard
protein score is less satisfactory for searches with very large numbers of queries, such as
MudPIT data sets. For each MS/MS query, Mascot retains up to 10 peptide matches. When the
number of queries is comparable with the number of entries in the database, this means that
there can be random, low‐scoring matches for every entry. Although the average number of
random matches per entry might be low, the actual number will follow a distribution, and some
entries will have large numbers of low scoring matches, leading to large protein scores.
While it is obvious from a detailed study of the report that these are meaningless matches, it
would be better to eliminate them entirely. So, if the ratio between the number of queries and
the number of entries in the database exceeds a pre‐determined threshold, the basis for
calculating the protein score is changed. Only those ions scores that exceed one or both
significance thresholds contribute to the score, so that low scoring, random matches have no
effect. This gives a much cleaner report for a large scale search. This threshold is 0.001 by
default, and can be changed on a global basis in the configuration file, mascot.dat, or changed
for a single report by using the format controls at the top of the report. Note that, when
calculating this threshold, if a taxonomy filter is being used, the number of entries in the
database is the number remaining after the taxonomy filter.
Protein Inference
When MS/MS spectra are searched against a sequence database, we are matching peptides, not
proteins. In most cases, the matched peptides will not be unique to a single protein. Yet, we
usually want to know which proteins were present in the sample. So, we are faced with the
challenge of protein inference: given a set of peptide matches, which proteins do we believe
were present in the sample?
The usual approach is based on the "Principle of Parsimony". We report the minimum set of
proteins that account for the observed peptide matches. If we had four peptide matches, two of
which occurred in protein A and two in protein B but all four were found in protein C, we would
report that protein C had been identified. Proteins A and B might be listed as "sub‐set" proteins.
It is perfectly possible that our sample actually contained a mixture of proteins A and B, but
there is no evidence for this.
The Peptide Summary and Select Summary uses a very simple algorithm. First, we take the
protein with the highest protein score, and call this hit number 1. We then take all other
proteins that share the same set of peptide matches or a sub‐set and include these in the same
hit. In the report, they are listed as same‐set and sub‐set proteins. With these proteins removed
from the list, we now take the remaining protein with the highest score and repeat the process
until all the significant peptide matches are accounted for.
This sounds simple enough, and works well for small datasets, but larger search results create
difficulties:
• What if two proteins have many strong matches in common but one has an additional
weak match? Should we treat one as the outright winner, and relegate the other to the
status of sub‐set?
• What if we have intersections? That is, the protein is not a sub‐set of any other one
protein, but all the matches can be found in a set of proteins, each of which has
additional matches.
• In many cases, the exact sequence of the protein that was analysed is not in the database.
All the peptide sequences are present, but spread across several homologous proteins,
which might be splice variants or represent different combinations of SNPs.
The Protein Family Summary tries to address these difficulties by clustering proteins into
families. The algorithm works as follows:
1. Create a list of proteins, ordered by protein score
2. Take the highest scoring protein
3. Find all the family members for this protein:
o select all matches with a score at or above the homology threshold
o for each match, select all other the proteins that contain this match (using the
score as a test to include matches that are identical matches though not
identical sequences, e.g. I to L substitution or other differences that have no
impact on the score)
o for each new protein, select all new matches with a score at or above the
homology threshold
o loop until all related proteins and matches have been found
Note that this grouping into families is based on significant matches. Non‐significant matches
are ignored.
4. Report this family as a single hit. All these proteins can be removed from the list
5. For each protein in the family, make a list of the distinct peptide sequences. That is,
ignore differences in score, modifications, charge, etc. Where there are duplicate
matches, use the highest score
6. Divide and group the proteins into same‐set proteins and sub‐set proteins; sub‐sets
include intersections
o Where there are same‐set proteins, collapse into a single family member
o Move any proteins that are sub‐sets or intersections to the sub‐sets list
7. Perform hierarchical clustering on the family members, using the score excess over threshold
of the non‐shared matches as the distance metric
8. Loop from step 2 until no more proteins remain that contain matches with homology score or
better
The goal is to present the possible protein assignments clearly, so that someone with knowledge
of the biology can make an informed decision as to which proteins are present. In most cases,
there will be some ambiguity about precisely which proteins are present. As mentioned earlier,
the exact sequence of an analyte may not be in the database, and peptide matches may be
distributed across multiple, homologous databse entries. If it is essential to characterise the
complete protein sequence, or to choose between splice variants, or to confirm a SNP, it is likely
that additional, targeted experiments will be required.
Hierarchical Clustering
To cluster proteins into families, we use the score of the non‐shared matches as the distance
between two proteins. More precisely, we use the score excess over the significance threshold,
since a score below significance threshold could be random, and should not be taken as
evidence for two different proteins being present. This means that matches below threshold
play no part in the clustering process. Each distinct peptide sequence is represented once by the
match with the highest score. Matches to the same sequence with different charge states or
with different modifications are considered duplicates.
If two proteins have the same set of peptide matches, the distance between them is zero. If
they have just a single shared match, the distance between them is the sum of the score
excesses of all the non‐shared matches in one protein, since discarding these would make the
protein a sub‐set of the other, based on the single shared match.
There are some subtleties to this procedure. Consider the case of two proteins which have
different peptide matches to the same query with the same score. Only one of these matches
can be correct, but we don't know which. One obvious example is where the two sequences
differ only in exchange of I and L. In terms of the mass spectrum, these sequences are identical.
Unless the mass accuracy is high, the same is true for exchange of Q and K or F and oxidised M.
Clearly, a sequence containing F at a particular position is very different, in biological terms,
from one containing M at the same position. But, if the scores are the same, there is simply no
evidence from the mass spectrometry data for two proteins. In terms of a distance matrix, we
must treat it is as if there was no match to either peptide.
Now, consider the case where we have two proteins with different peptide matches to the same
query and the scores are not the same. Assume the threshold is 40 and one has a score of 50
and the other has a score of 60. Again, only one of these matches can
be correct; it is not the same as if they were independent matches to different queries.
Extending the logic that matches to the same query with the same score correspond to a
distance of zero, matches to the same query with different scores correspond to a distance that
is the score difference. In this example, the distance would be 10. If the two matches came from
different queries, and could be treated independently, the distance would be (60 ‐ 40) + (50 ‐
40) = 30
To create the dendrogram, we first compute a distance matrix, which is the distance between
each pair of proteins. The two proteins separated by the smallest distance are joined to create a
node, with the length of the branches from the node are the score distance between the
proteins. The two joined proteins are removed from list, replaced by the node, and the
distances between the new node and all other remaining proteins (or nodes) computed. The
process is repeated until only one node remains.
When the dendrogram (or tree) is drawn, the order is chosen to avoid any branches crossing.
There is no other significance to the order of the branches, and there are many possible ways to
order the branches so as to avoid crossings. In the tabular part of the report, proteins are sorted
in order of decreasing score, and this will often be different from the dendrogram order.
Note that, if you select a pair of family members from a large family, it is perfectly possible that
they will have no shared matches. Each family member will have shared matches with at least
one other family member, or they would not have been grouped into the same family, but this
doesn't mean that there are going to be shared matches between every pair.
Large Search Results and the Peptide and Select Summary Reports
The Protein Family Summary is designed for large search results. If, for some reason, you wish to
view results using the Peptide Summary or Select Summary reports, this section contains some
tips.
The format controls near the top of the report can help streamline the results from a large
search by eliminating most of the "junk". These options can also be selected by adding URL
switches to the report URL.
MudPIT Protein Scoring: By default, large searches will switch to using more aggressive protein
scoring. This removes many of the junk protein hits, which have high protein scores but no high
scoring peptide matches. Do not be tempted to switch back to standard scoring.
Require Bold Red: The Peptide Summary and Select Summary reports do not detect
intersections. In these reports, red and bold typefaces are used to highlight the most logical
assignment of peptides to proteins. The first time a peptide match to a query
appears in the report, it is shown in bold face. Whenever the top ranking peptide match
appears, it is shown in red. Thus, a bold red match is the highest scoring match to a particular
query listed under the highest scoring protein containing that match. This means that protein
hits with many peptide matches that are both bold and red are the most likely assignments.
Conversely, a protein that does not contain any bold red matches is an intersection of proteins
listed higher in the report.
Requiring a protein hit to include at least one bold red peptide match is a good way to filter
homologous proteins from a report. You can turn this on using a checkbox in the format
controls. The down‐side is that you may sometimes throw out the wrong protein! For example,
imagine you are searching with a taxonomy of mammals but are mainly interested in yeti
proteins. If the same strong peptide matches are found in a yeti protein and also in the human
homologue, and one or more junk peptide matches prevent the two proteins collapsing into a
single hit, but give the human protein a slightly higher score, that is the one that will feature in
the report.
Ignore Ions Score Below: You can minimise the previous problem by judicious use of the Ions
score cut‐off field. By setting this to (say) 20, you cut out all of the very low scoring, random
peptide matches. This means that homologous proteins are more likely to collapse into a single
hit, avoiding the need to choose between them.
Suppress the pop‐ups: The JavaScript pop‐up windows, that show the top 10 peptide matches
for each query, are very useful, but they make the HTML report much larger and slower to load
in a web browser. If you have a report that never seems to load, or is very slow to scroll, try
using the radio buttons to suppress pop‐ups
Download