RADical-Worms Natural adaptation to geogenic and anthropogenic soil contamination in the earthworm Lumbricus rubellus: evolutionary history and functional genomics The soil underneath our feet Google Maps Metal burdens of UK soils - Cu McGrath and Loveland Soil Geochemical Atlas 1 per 25 sq km (Yellow is highest) Cu BGS – Northern Britain G-Base Stream Sediment 1 per 2 sq km (Red is highest) Metal burdens of UK soils - As Wolfson Geochemical Atlas (Webb et al.) Stream Sediment As BGS G-Base Stream Sediment Anthropogenic burdens Local adaptation to metal burdens a Part of the abandoned Great Consols site in Devon. Concentrations of As reach 20,000 ppm around this building (the “Arsenic labyrinth”) but Cu concentrations are near baseline. b LC50 a b c c Concentration of index chemical Local adaptation versus migration what if ... (2) (1) The British Isles: a natural laboratory Ages of metal islands Available adaptation period (Years) 10,000 1-2,000 1-200 www.faunaeur.org Lumbricus rubellus: as a environmental sentinel • Wide distribution • Epigeic / Ecosystem engineer • Sexually Reproducing Sj 1 • Life history model • Ecotoxicology 5 nt Sr,c 2 tj tr 3 4 Three stage - Continuous reproduction - hermaphrodite 1= t j + t a −1 ∑e t = t j +1 -rt ⋅ nt ⋅ s j ⋅ sa [(( t − t j ) / t a ]c Kammenga et al (2003) OIKOS 100, 89-95 • Dispersal rate 10 M per year • As and Cu tolerance • Environmental Genomic Model !! Langdon et al., (2003) ES&T 22, 2344-2348 • Good “species” Lumbricus rubellus: as a genomic model • Genome size ~430 MB • Genome N50 1425 bp • Combined Contig Length 526 MB • 89% of 20K EST Mapped • But….a species complex !! • 2 major lineages separating at genome and mitochondrial level ….and mostly thanks to Mark and Ben * * * * * * Coord 2 CC * ** CS E W * * * * Coord 1 Key Questions • Pierfrancesco Sechi - Phylogeography • Question 1: Use SNPs to map how has soil chemistry conditioned soil re-colonisation since the last ice age? – Pierfrancesco Sechi • Question 2: How does diversity impact adaptation to local pressures? - Pierfrancesco Sechi • Craig Anderson – Mapping metal tolerance loci • Question 2: Can we use SNPs to determine the functional loci responsible to metal tolerance? – Craig Anderson • Question 3: Uses SNP data to determine on local scales how do metal burdens affect genetic diversity of soil animals? – Craig Anderson Pan-European Collection Specimens received from: • France • Hungary • Serbia • Estonia • Sweden • Finland Projected response: • 2 more locations from France • Italy • Spain • Portugal • Austria Numbers : Pan European Analysis 40% GC / 420 MB genome / 10 fold Coverage / 20 M reads Restriction Sites RADseq Sites RADSeq Tags Max Plex. SbfI 7310 14620 137 NotI 2230 4460 448 BbvCI 21410 42820 93 EagI 34351 68702 29 Plan for Pan European analysis Source Worms No. by Estimated haplotype Coverage and location 9 per lane 10 per site 90 10 Fold 9 per lane 10 per site 90 10 Fold Total 20 sites 180 worms 10 fold for each Can we use SNPs to determine the functional loci responsible to metal tolerance? 1. Derive SNP map from individuals from phenotyped including Tt x Tt F3 progeny. 2. Perform high density SNP mapping of multiple independent populations resident to contaminated soils compared with proximal non-exposed populations. Field Sites Carrock Fell 10,000 mg/kg As and 700 mg/kg Cu Ecton 3,200 mg/kg Cu, 4,500 mg/kg Pb and 30,000 Zn Cwmystwyth 1200 mg/kg Pb and 30,000 mg/kg Zn Shipham 4,000 mg/kg Pb, 14,000 mg/kg Zn and 800 mg/kg Cd Caradon 800 mg/kg Cu Devon Great Consuls 9,000 mg/kg As and 1,700 mg/kg Cu Genetic mapping of tolerance traits 1 Pd/Zn site – lineage B | 2 x As / Cu Sites Lineage A Fam1 Fam1 TT Tt T T TT tt Tt Tt tt Tt T T Tt Tt tt Tt T T tt Tt Tt tt RAD tag F1 Vs Phenotyped F3 Tt T T F1 Tt F2 Tt tt F3 Selection of F3s for RAD Analysis A B C T is recessive T is dominant T is codominant Mass Cumulative frequency • Following screening, if the distribution was spread like in graph A, B or C, we’d just analyse a similar number of homozygous +ve and -ve individuals • Its more likely that we’ll achieve a spread of data from graph D than the others, so we should distinguish between a set number of acknowledged phenotypes (graph D). Mass Cumulative frequency Cumulative frequency D T is a multilocus trait % Increase growth Mass Cumulative frequency Numbers : Pedigree analysis 40% GC / 420 MB genome / 10 fold Coverage / 20 M reads Restriction Sites RADseq Sites RADSeq Tags Max Plex. SbfI 7310 14620 137 NotI 2230 4460 448 BbvCI 21410 42820 93 EagI 34351 68702 29 Plan for pedigree analysis : Paired-end BbvI (for scaffolding) Source Worms No. (pooled by phenotype) Estimated Coverage Mixed F1 from CW, CF and DC F1s from 3 sites 6 x 3 = 18 52 Fold Cwmystwyth (B) F3 from 3 families 3 x 20 = 60 16 Fold Carrock Fell (A1) F3 from 3 families 3 x 20 = 60 16 Fold Devon Great Consuls (A2) F3 from 3 families 3 x 20 = 60 16 Fold Resident analysis C1 T1 T2 T3 Boundary of metal contamination Surrounding geochemically equivalent nonC3 C2 contaminated soils Should controls be contiguous to site ? Or remote ? Or both ? Numbers : Resident analysis 40% GC / 420 MB genome / 10 fold Coverage / 20 M reads Restriction Sites RADseq Sites RADSeq Tags Max Plex. SbfI 7310 14620 137 NotI 2230 4460 448 BbvCI 21410 42820 93 EagI 34351 68702 29 Plan for resident Analysis - Paired-end BbvI (for scaffolding) Source Worms No. (pooled by phenotype) Estimated Coverage Cwmystwyth (B) 3 x 10 within site /3 x 10 outside 6 x 10 = 60 16 Fold Carrock Fell (A1) 3 x 10 within site /3 x 10 outside 6 x 10 = 60 16 Fold Devon Great Consuls (A2) 3 x 10 within site /3 x 10 outside 6 x 10 = 60 16 Fold Interpretation L. rubellus Candidate targets for Cu tolerance Darwin, 1883 “It may be doubted whether there are many other animals …..which have played so important a part in the history of the world.” To Professor A. John Morgan who taught me the truth of this saying. Bulk Segregant Mapping Produced a RAD tag library from the parental genomic DNA digested with SbfI 1.4 Million Markers with Parental barcodes Removed markers that lacked SbfI site 608,000 BP and 760,000 RS Tags Refined sequences to those with a max of a single mismatch 41,622 Tags RAD tags were selected if they had at least 8 instances in one strain and none in the other. 1,097 BP and 1,890 RS Tags Then pooled DNA by phenotype, developed RAD tags and sequenced Main Questions • How many lanes do we have? Need to decide: • What resolution do we want- i.e., how many tags/ individual? • What restriction enzyme can we use? • How many individuals/ population? Importantly: • How many times will each tag be sequenced? For restrictions on the resolution • What are the odds of sequencing all individuals and not only achieving a subsection of the population? Eric Johnson’s RAD sequencing method B P Numbers of individuals/ RS (Dominant)cross 1x1 1 x1 full sib cross Produce 96 juveniles, Pooled as 31 displaying the Bulk Segregant Mapping Produced a RAD tag library from the parental genomic DNA digested with Parental SbfI 1.4 Million Markers with barcodes Removed markers that lacked SbfI site 608,000 BP and 760,000 RS Tags Refined sequences to those with a max of a single mismatch 41,622 Tags RAD tags were selected if they had at least 8 instances in one strain and none in the other. 1,097 BP and 1,890 RS Tags Then pooled DNA by phenotype, developed RAD tags and sequenced • Genome size (bp) =(0.978 x109)x DNA content (pg) (Dolezel et al., 2003: http://www3.interscience.wiley.com/journal/102526737/abst ract?CRETRY=1&SRETRY=0) • Lumbricus rubellus genome size estimated to be 0.43 pg (+/0.02)(Gregory and Hebert, 2002). Thus the size in bp is 9 0.978 x10 x 0.43= 420,540,000bp (420Mb) • Searched genome for EcoR1 (5'-GAATTC-3') and Sbf1 (5'CCTGCAGG-3') restriction site matches. • These restriction enzymes were used throughout Eric Johnson’s papers. • There are 60519 and 3318 sites in the EW genome, respectively, expected to generate 2 tags each. • Would be on a GAIIx Illumina sequencer. Should be able to get 12 million high quality 100bp reads / lane. • Dog genome estimated to be max of 2.47 Gb, Karlsson et al. (2007) used StyI to get 64,000 markers, cut down to 27,000 high-performing markers for 2 traits with genome-wide association mapping within dog breeds (using ~10 affected and Design 1- Single Family Intraspecific Analysis • An all in one analysis, would take all individuals (~40 (assuming we require 60,000 markers with each marker sequenced 5 times)) of one F1 cross / site for sequencing. A • Screening: – Screen all of a single F3 Family from a single F2 cross for each F1 to infer relative tolerance. a A A F1 a a A a A a A a F2 A A a A A a F3 A X40 a a A a a • Advantages – Better idea of linkage because of the improved rates of recombination within a single range of markers – All markers in the F3s should be observable in F1s. – Provides the highest coverage of a genome(?) for future use. • Disadvantages Extra Considerations • During Johnson’s fine mapping using EcoRI (producing a possible 150k tags), they yielded 2,311 BP and 4,530 RS specific markers. • For each specific marker, they only achieved RAD tags from 16-20 individuals/ phenotype (the majority of F2s demonstrated between 50-150k individual tags in total) therefore each F2 only had a subset sequenced. That’s a 0.25-0.33 chance of inclusion, can we improve this? • If not, we would have to ensure that numbers from design 2 and 3 are representative when reduced by 1/3, but can’t simply extend the numbers individuals we use, would have to increase tag numbers. • If we make use of a barcoding design, we need to be sure we can sort the individuals in silico • Linkage Mapping – Defined as A method for localizing genes that is based on the coinheritance of genetic markers and phenotypes in families over several generations – Relies on observed recombinations in genotyped families – Positional resolution depends on the number of observable recombinations in the pedigree, not marker density. – Positional resolution and statistical power in a given design is limited (>10cM). • Association mapping – Defined as a gene discovery strategy that compares cases with controls to assess the contribution of genetic variants to phenotypes in specific populations. – Relies on linkage disequilibrium (LD) which has built up over the history of the population. – Requires no family structure in the sample – True LD spans very short regions, meaning that association mapping has high power in the immediate vicinity (<1cM), though false positives can occur more frequently. important factor for most efficiently distinguishing relevant traits. Hohenloe et al., 2010, used 45,000 markers in a 567,240,000 bp size genome and discovered most of the relevant loci already described plus extras. That’s a marker every 12605 bases, but in ew, it’s a marker every 3474 bp. Additionally, proper selection of individuals can provide an adequate level of variation within a population. • Sourcing all of the individuals for RAD tagging from a single “family” would represent only a single individual from the contaminated site. – Not necessarily unrepresentative, as genotyping the individual may show that it belongs to the most frequently observed genotype. • Further, it was suggested that for genome wide linkage disequilibrium mapping, its most effective to study unrelated, affected and control individuals. By contrast, family based linkage designs will yield much larger linked regions owing to limited recombination within a pedigree. With unrelated individuals, associated regions will reflect the haplotype block size, and should be small enough for efficient fine-mapping (Karlsson et al., 2007). • The most relevant markers should be well conserved among all “families”. • Question: – How many families should we take from?