EMI_2119_sm_Figs_S2-S4

advertisement
Supplemental text and Figures S2, S3 and S4
Statistical analysis of presence/absence of genes
Hybridization data were normalized by local fitting of an M-A plot, implementing
a Lowess algorithm in R-2.2.1 program (http://cran.r-project.org/bin/windows/base/) as
previously described (Molenaar et al., 2005). The M-value was defined as log2 (Ch1/Ch2)
and the A-value was calculated with the following formula [log2(Ch1) + log2(Ch2)]/2,
considering Cy3 (reference DNA of WCFS1 strain) as channel 2 (Ch2) and Cy5 signal
(tested DNA of query strain) as channel 1, Ch1. As a control, DNA from the reference
strain WCFS1 was labeled with both dyes and hybridized on one array. This self-self
hybridization showed a narrow range of M-values between -0.2 and +0.2 (Figure S2), and
only one of the probes displayed a signal outside of the M=+/-1 limits, showing that the
system is unlikely to yield many outliers that lead to wrong interpretation.
We know have 2 complete L. plantarum genomes (WCFS1, JDM1) and 3
sequenced L. plantarum genomes still in contigs, and are doing a full genome comparison
of these 5 strains (R.J.Siezen et al, in preparation). Sequence homology between these
strains is about 97-99% at nucleotide level, which should allow efficient hybridization of
present genes to the CGH microarray with WCFS1 probes.
Examples of M-A plots are shown in Figure S3 for strains (a) WCFS1, (b)
ATCC14917, and (c) DKO22. In general, a cloud of black dots around M-value 0
represents positive hybridization to probes, while red dots represent M values below -1.5.
The cutoff value of M = -1.5 for presence is somewhat arbitrary, but based on the fact
that the bulk of probe M-values (85.5%) lies between +/- 1, and a minimum is seen at Mvalue of -1.5 (Figure S4). Although we could have taken a narrower range of +/- 0.25, as
1
suggested by the self-self hybridization of Figure S2, it should be realized that the
Lowess normalization, applied to remove total signal dependent dye bias, is influenced
by the presence of a large number of probes with low signal strength (not present in the
self-self hybridization). This in turn increases the spread of the values around M = 0. The
probes with no or very low signal clearly show peaks around M = -3, -10 and -15. The
fact that different peaks appear in Figure S4 is correlated with different batches of
microarrays and different batches of hybridization. The intermediate range of M = [-1,-2],
reflecting reduced hybridization that presumably results from lower sequence identity, is
also somewhat arbitrary, but based on the fact that the probes peaking around M = -3
(interpreted as absent signals) seem to have a rather narrow distribution that has a very
low density above M = -2.
M-values for the three probes per gene were averaged and used to predict gene
presence or absence. Presence of a target gene was scored if the average M-value was
above -1.5. For the large majority of genes, all probes per gene gave consistent M values
of either >-1.0 or <-2.0. However, we observed for about 50 genes that the 3 (or more)
probes per gene gave inconsistent M-values, ranging from the positive (M>-1.5) to
negative presence (M<-1.5) range, and this frequently led to an average M-value between
-1 and -2. To test if this effect was due to sequence variation in the query genes, we
diagnostically sequenced two of these genes (lp_0575 and lp_0576) in 6 of the query
strains (NCTH19-1, Q2, KOG18, KOG24, Lp95, DKO22) and one gene (lp_1823) in 23
of the query strains (WCFS1, ATCC14917, MLC43, CHEO3, NCTH19-1, NCTH19-2,
NCTH27, B2535, NOS140, Q2, B2757, H14, CECT4645, B2776, KOG18, KOG24,
LMG9208, Lp95, BLL, CECT221, N58, X17, LAC7, LD3, DKO22). In all cases the
2
genes were found to be present. However, many single nucleotide polymorphisms (SNPs)
were detected in these genes. For gene lp_1823 (399 nt), the overall nucleotide sequence
identity of the 23 query strains to strain WCFS1 was found to differ by up to 25%, and
the number of mismatches to the three 60-mer WCFS1 probes on the array was
maximally 14, 12 and 15 (20-25%). Several query genes even had indels relative to strain
WCFS1. For genes lp_0575 (975 nt) and lp_0576 (807 nt), the sequences in the 6 query
strains were identical, but all differed by about 15% from the corresponding genes in
strain WCFS1. Comparison of these genes in 4 fully sequenced L. plantarum genomes
(WCFS1, ATCC14917, NC8, JDM1) confirmed these variations, but also showed that
surrounding genes (e.g. lp_0573, lp_0574, lp_1820, lp_1824) had 96-100% identity in
these 4 strains.
Therefore, we conclude that genes lp_0575, lp_0576, and lp_1823 are present in the
query strains, but due to many mismatches the query DNA hybridized poorly to the
probes and resulted in intermediate signals. This presumably also applies to the other
genes that show intermediate hybridization.
For many paralogs, no unique probes could be designed to distinguish between them.
Therefore one set of probes was used for highly similar paralogs, and all paralogous
genes were assumed to be either present or absent in query strains, depending on the
hybridization M-values.
3
6
4
2
0
density of distribution
8
Figure S2: Self-self hybridization of WCFS1.
−0.4
−0.2
0.0
0.2
0.4
M
4
0
-5
-10
M-value (2log(cy3/cy5))
5
Figure S3a: M-A plot for WCFS1-WCFS1 (self) hybridization.
2.0
2.5
3.0
3.5
4.0
4.5
A-value (10log(cy3*cy5)/2)
5
0
-5
-10
M-value (2log(cy3/cy5))
5
Figure S3b: M-A plot for WCFS1-ATCC14917 hybridization.
2.0
2.5
3.0
3.5
4.0
4.5
A-value (10log(cy3*cy5)/2)
6
0
-5
-10
M-value (2log(cy3/cy5))
5
Figure S3c: M-A plot for WCFS1-DKO22 hybridization.
2.0
2.5
3.0
3.5
4.0
4.5
A-value (10log(cy3*cy5)/2)
7
Figure S4: Density distribution of probes on all microarrays. M values of +/- 1 and +/- 2
are indicated with dashed lines.
8
Download