RADical-Worms

advertisement
RADical-Worms
Natural adaptation to
geogenic and
anthropogenic soil
contamination in the
earthworm
Lumbricus rubellus:
evolutionary history and
functional genomics
The soil underneath our feet
Google Maps
Metal burdens of UK soils - Cu
McGrath and Loveland
Soil Geochemical Atlas
1 per 25 sq km
(Yellow is highest)
Cu
BGS – Northern Britain
G-Base Stream Sediment
1 per 2 sq km
(Red is highest)
Metal burdens of UK soils - As
Wolfson Geochemical Atlas
(Webb et al.)
Stream Sediment
As
BGS G-Base
Stream Sediment
Anthropogenic burdens
Local adaptation to metal burdens
a
Part of the abandoned Great Consols site in Devon.
Concentrations of As reach 20,000 ppm around this building (the
“Arsenic labyrinth”) but Cu concentrations are near baseline.
b
LC50
a b c
c
Concentration of index chemical
Local adaptation versus migration
what if ... (2)
(1)
The British Isles: a natural laboratory
Ages of metal islands
Available adaptation period (Years)
10,000
1-2,000
1-200
www.faunaeur.org
Lumbricus rubellus: as a
environmental sentinel
• Wide distribution
• Epigeic / Ecosystem engineer
• Sexually Reproducing
Sj
1
• Life history model
• Ecotoxicology
5
nt
Sr,c
2
tj
tr
3
4
Three stage - Continuous reproduction - hermaphrodite
1=
t j + t a −1
∑e
t = t j +1
-rt
⋅ nt ⋅ s j ⋅ sa [(( t
− t j ) / t a ]c
Kammenga et al (2003)
OIKOS 100, 89-95
• Dispersal rate 10 M per year
• As and Cu tolerance
• Environmental Genomic Model !!
Langdon et al., (2003)
ES&T 22, 2344-2348
• Good “species”
Lumbricus rubellus:
as a genomic model
• Genome size ~430 MB
• Genome N50 1425 bp
• Combined Contig Length 526 MB
• 89% of 20K EST Mapped
• But….a species complex !!
• 2 major lineages separating at
genome and mitochondrial level
….and mostly thanks to Mark and Ben
* * *
*
* *
Coord 2
CC
* **
CS
E
W
*
*
*
*
Coord 1
Key Questions
• Pierfrancesco Sechi - Phylogeography
• Question 1: Use SNPs to map how has soil chemistry conditioned
soil re-colonisation since the last ice age? – Pierfrancesco Sechi
• Question 2: How does diversity impact adaptation to local
pressures? - Pierfrancesco Sechi
• Craig Anderson – Mapping metal tolerance loci
• Question 2: Can we use SNPs to determine the functional loci
responsible to metal tolerance? – Craig Anderson
• Question 3: Uses SNP data to determine on local scales how do
metal burdens affect genetic diversity of soil animals? – Craig
Anderson
Pan-European Collection
Specimens received
from:
• France
• Hungary
• Serbia
• Estonia
• Sweden
• Finland
Projected response:
• 2 more locations from
France
• Italy
• Spain
• Portugal
• Austria
Numbers : Pan European Analysis
40% GC / 420 MB genome / 10 fold Coverage / 20 M reads
Restriction Sites
RADseq Sites
RADSeq Tags
Max Plex.
SbfI
7310
14620
137
NotI
2230
4460
448
BbvCI
21410
42820
93
EagI
34351
68702
29
Plan for Pan European analysis
Source
Worms
No. by
Estimated
haplotype
Coverage
and location
9 per lane
10 per site
90
10 Fold
9 per lane
10 per site
90
10 Fold
Total
20 sites
180 worms
10 fold for each
Can we use SNPs to determine the functional
loci responsible to metal tolerance?
1. Derive SNP map from individuals from phenotyped
including Tt x Tt F3 progeny.
2. Perform high density SNP mapping of multiple
independent populations resident to contaminated
soils compared with proximal non-exposed
populations.
Field Sites
Carrock Fell
10,000 mg/kg As and 700 mg/kg Cu
Ecton
3,200 mg/kg Cu, 4,500 mg/kg Pb and
30,000 Zn
Cwmystwyth
1200 mg/kg Pb and 30,000 mg/kg Zn
Shipham
4,000 mg/kg Pb, 14,000 mg/kg Zn and
800 mg/kg Cd
Caradon
800 mg/kg Cu
Devon Great Consuls
9,000 mg/kg As and 1,700 mg/kg Cu
Genetic mapping of tolerance traits
1 Pd/Zn site – lineage B | 2 x As / Cu Sites Lineage A
Fam1
Fam1
TT
Tt
T
T
TT
tt
Tt
Tt tt
Tt
T
T
Tt
Tt tt
Tt
T
T
tt
Tt
Tt tt
RAD tag F1 Vs Phenotyped F3
Tt
T
T
F1
Tt
F2
Tt tt
F3
Selection of F3s for RAD Analysis
A
B
C
T is recessive
T is dominant
T is codominant
Mass
Cumulative
frequency
• Following screening, if the
distribution was spread like in graph
A, B or C, we’d just analyse a
similar number of homozygous +ve
and -ve individuals
• Its more likely that we’ll achieve a
spread of data from graph D than
the others, so we should distinguish
between a set number of
acknowledged phenotypes (graph
D).
Mass
Cumulative
frequency
Cumulative
frequency
D
T is a multilocus trait
% Increase growth
Mass
Cumulative frequency
Numbers : Pedigree analysis
40% GC / 420 MB genome / 10 fold Coverage / 20 M reads
Restriction Sites
RADseq Sites
RADSeq Tags
Max Plex.
SbfI
7310
14620
137
NotI
2230
4460
448
BbvCI
21410
42820
93
EagI
34351
68702
29
Plan for pedigree analysis : Paired-end BbvI (for scaffolding)
Source
Worms
No. (pooled by
phenotype)
Estimated
Coverage
Mixed F1 from
CW, CF and DC
F1s from 3
sites
6 x 3 = 18
52 Fold
Cwmystwyth (B)
F3 from 3
families
3 x 20 = 60
16 Fold
Carrock Fell (A1)
F3 from 3
families
3 x 20 = 60
16 Fold
Devon Great
Consuls (A2)
F3 from 3
families
3 x 20 = 60
16 Fold
Resident analysis
C1
T1
T2
T3
Boundary of metal
contamination
Surrounding geochemically
equivalent nonC3
C2
contaminated soils
Should controls be contiguous to site ? Or remote ? Or both ?
Numbers :
Resident analysis
40% GC / 420 MB genome / 10 fold Coverage / 20 M reads
Restriction Sites
RADseq Sites
RADSeq Tags
Max Plex.
SbfI
7310
14620
137
NotI
2230
4460
448
BbvCI
21410
42820
93
EagI
34351
68702
29
Plan for resident Analysis - Paired-end BbvI (for scaffolding)
Source
Worms
No. (pooled by
phenotype)
Estimated
Coverage
Cwmystwyth (B)
3 x 10 within site
/3 x 10 outside
6 x 10 = 60
16 Fold
Carrock Fell (A1)
3 x 10 within site
/3 x 10 outside
6 x 10 = 60
16 Fold
Devon Great
Consuls (A2)
3 x 10 within site
/3 x 10 outside
6 x 10 = 60
16 Fold
Interpretation L. rubellus
Candidate targets for Cu tolerance
Darwin, 1883
“It may be doubted
whether there are
many other animals
…..which have played
so important a part in
the history of the
world.”
To Professor A. John
Morgan who taught me
the truth of this
saying.
Bulk Segregant Mapping
Produced a RAD tag library from the parental genomic
DNA digested with SbfI
1.4 Million Markers with Parental
barcodes
Removed markers that lacked
SbfI site
608,000 BP and 760,000 RS Tags
Refined sequences to those with a max of a
single mismatch
41,622 Tags
RAD tags were selected if they had at least
8 instances in one strain and none in the
other.
1,097 BP and 1,890 RS Tags
Then pooled DNA by phenotype, developed RAD tags and
sequenced
Main Questions
• How many lanes do we have?
Need to decide:
• What resolution do we want- i.e., how many
tags/ individual?
• What restriction enzyme can we use?
• How many individuals/ population?
Importantly:
• How many times will each tag be sequenced?
For restrictions on the resolution
• What are the odds of sequencing all individuals
and not only achieving a subsection of the
population?
Eric Johnson’s RAD sequencing method
B
P
Numbers of
individuals/
RS (Dominant)cross
1x1
1 x1 full sib
cross
Produce 96
juveniles,
Pooled as 31
displaying the
Bulk Segregant Mapping
Produced a RAD tag library from the parental genomic
DNA digested
with Parental
SbfI
1.4 Million
Markers with
barcodes
Removed markers that lacked
SbfI site
608,000 BP and 760,000 RS Tags
Refined sequences to those with a max of a
single mismatch
41,622 Tags
RAD tags were selected if they had at least
8 instances in one strain and none in the
other.
1,097 BP and 1,890 RS Tags
Then pooled DNA by phenotype, developed RAD tags and
sequenced
• Genome size (bp) =(0.978 x109)x DNA content (pg) (Dolezel
et al., 2003:
http://www3.interscience.wiley.com/journal/102526737/abst
ract?CRETRY=1&SRETRY=0)
• Lumbricus rubellus genome size estimated to be 0.43 pg (+/0.02)(Gregory and
Hebert, 2002). Thus the size in bp is
9
0.978 x10 x 0.43= 420,540,000bp (420Mb)
• Searched genome for EcoR1 (5'-GAATTC-3') and Sbf1 (5'CCTGCAGG-3') restriction site matches.
• These restriction enzymes were used throughout Eric
Johnson’s papers.
• There are 60519 and 3318 sites in the EW genome,
respectively, expected to generate 2 tags each.
• Would be on a GAIIx Illumina sequencer. Should be able to
get 12 million high quality 100bp reads / lane.
• Dog genome estimated to be max of 2.47 Gb, Karlsson et al.
(2007) used StyI to get 64,000 markers, cut down to 27,000
high-performing markers for 2 traits with genome-wide
association mapping within dog breeds (using ~10 affected and
Design 1- Single Family Intraspecific Analysis
• An all in one analysis, would take
all individuals (~40 (assuming we
require 60,000 markers with each
marker sequenced 5 times)) of
one F1 cross / site for sequencing.
A
• Screening:
– Screen all of a single F3 Family
from a single F2 cross for each F1
to infer relative tolerance.
a
A
A
F1
a
a
A
a
A
a
A
a
F2
A A a
A A a F3
A X40
a a
A a a
• Advantages
– Better idea of linkage because of the improved rates of
recombination within a single range of markers
– All markers in the F3s should be observable in F1s.
– Provides the highest coverage of a genome(?) for future
use.
• Disadvantages
Extra Considerations
• During Johnson’s fine mapping using EcoRI
(producing a possible 150k tags), they yielded
2,311 BP and 4,530 RS specific markers.
• For each specific marker, they only achieved RAD
tags from 16-20 individuals/ phenotype (the
majority of F2s demonstrated between 50-150k
individual tags in total) therefore each F2 only had
a subset sequenced. That’s a 0.25-0.33 chance of
inclusion, can we improve this?
• If not, we would have to ensure that numbers
from design 2 and 3 are representative when
reduced by 1/3, but can’t simply extend the
numbers individuals we use, would have to
increase tag numbers.
• If we make use of a barcoding design, we need to
be sure we can sort the individuals in silico
• Linkage Mapping
– Defined as A method for localizing genes that is based on the
coinheritance of genetic markers and phenotypes in families
over several generations
– Relies on observed recombinations in genotyped families
– Positional resolution depends on the number of observable
recombinations in the pedigree, not marker density.
– Positional resolution and statistical power in a given design is
limited (>10cM).
• Association mapping
– Defined as a gene discovery strategy that compares cases with
controls to assess the contribution of genetic variants to
phenotypes in specific populations.
– Relies on linkage disequilibrium (LD) which has built up over
the history of the population.
– Requires no family structure in the sample
– True LD spans very short regions, meaning that association
mapping has high power in the immediate vicinity (<1cM),
though false positives can occur more frequently.
important factor for most efficiently distinguishing
relevant traits.
Hohenloe et al., 2010, used 45,000 markers in a
567,240,000 bp size genome and discovered most of the
relevant loci already described plus extras.
That’s a marker every 12605 bases, but in ew, it’s a
marker every 3474 bp.
Additionally, proper selection of individuals can provide an
adequate level of variation within a population.
• Sourcing all of the individuals for RAD tagging from a single “family” would
represent only a single individual from the contaminated site.
–
Not necessarily unrepresentative, as genotyping the individual may show that it
belongs to the most frequently observed genotype.
• Further, it was suggested that for genome wide linkage disequilibrium mapping,
its most effective to study unrelated, affected and control individuals. By
contrast, family based linkage designs will yield much larger linked regions owing
to limited recombination within a pedigree. With unrelated individuals,
associated regions will reflect the haplotype block size, and should be small
enough for efficient fine-mapping (Karlsson et al., 2007).
• The most relevant markers should be well conserved among all “families”.
• Question:
–
How many families should we take from?
Download