Follow-up Genotyping Conference Call “Transdisciplinary Research in Cancer of the Lung (TRICL)” August 16, 2012 Meeting minutes Participants: Chris Amos (MDACC), Mala Pande (MDACC), Maria Tere Landi (NCI), Shenying Fang (MDACC), Jen Doherty (Dartmouth), Andrea Finn (Affymetrix) Yiping Zhan (Affymetrix), Teresa Webster (Affymetrix), Alex Forrest-Hay (Affymetrix) The purpose of this call was to discuss the ongoing development of an Affymetrix Axiom array genotyping platform to be applied to samples from TRICL replication and fine mapping analyses. The TRICL grant proposal calls for a replication and fine mapping stage of about 15,000 individuals across different ethnic groups with an emphasis on studying cohort samples. The primary goals were to validate findings from the GWAS that was conducted, to perform fine mapping of regions identified by GWAS, to characterize findings in different populations, and to provide a platform for epidemiological characterization. Shenying Fang and Mala Pande are working on configuring the request for variants to be genotyped. Dr. Amos began the call by reviewing the status of genotyping contracts. Affymetrix indicated that the initial contract from Dr. Hung has been received and they are awaiting a second contract to be received next week. Dr. Doherty has not initiated her contract but it must be offered as a fee for service relationship. It is preferred that the work be done at the same time as the rest of the genotyping to ensure that genotype calling is comparable among studies. Dr. Amos asked about the manner in which information should be provided to Affymetrix about SNPs, genes, variations described by DNA sequences and regions of interest. In particular he was wondering if we could keep track of what data source provided a request so that we could separate the requests at the end of the experiment. Affymetrix responded that this would be feasible but they prefer to have spreadsheets that concatenate all the different sources of information. Affymetrix indicated that there are several categories of markers they can query with their platform. There are already validated markers that can immediately be included (with very high probability their genotyping will work in the array). Where markers are not already validated the P.I. can elect to try to genotype using probesets that query the marker using the forward and backward strands, so that these trial analyses require twice as much space in the array. Finally the P.I. can select tagging SNPs that have been validated (and therefore only require querying a single variant to tag a number of other markers in many cases). The genes that have been requested currently contain about 160,000 Axiom-validated markers (and there are about 30,000 rsID’s in the request) but this is including all Axiom-validated markers in the genes of interest, which may be more than we require for many genes for which tagging SNPs should suffice. Affymetrix suggested that we submit a list of genes and regions that we want densely genotyped and on that list also indicate genes that can be genotyped by tagging (At Affy a greedy method is used to pick Axiom-validated markers that can tag all markers with LD data in a given population and can be tagged at a given r2 cutoff.). They thought this would allow a sufficient reduction in number of genotypes to be done that we would fit within the approximately 100K markers, assuming the “normal” amount of space (versus twice as much space for markers tiled “de novo”) requirement for each marker available on the platform. When SNP information is transmitted to Affymetrix they prefer that the rs number, the chromosomal position in HG19 coordinates and the forward strand alleles be provided. Flanking sequences are very valuable especially for indels. An individual marker list with indications about how far we are willing to go to get its genotype information for them will be provided. There are at least the following 3 options. #1: Only include a marker in an array if it is already Axiom-validated (for markers that are not very important as an individual) #2: Include a marker in an array if it is Axiom-validated and greedily pick tags for all those markers that are not Axiom-validated but have LD data (this can be done in the same process when we pick tags for genes). #3: Include a marker in an array if it is Axiom-validated and for each marker that is not Axiomvalidated, tile it “de novo” on the array and pick a “best tag” for it if possible (we can even consider picking redundant tags for these super important markers where possible). They indicated that the TCGA derived mRNA information would be hard to transform to a format for array design as this spreadsheet requires that they derive the forward strand allele based on information related to mRNA. They requested that rs numbers and forward strand alleles be provided if possible. Dr. Landi asked about timeline. Dr. Finn indicated that from the time that a final SNP list is provided to Affymetrix, it will take 4-6 weeks to configure and deliver the array. The time between when a design request is communicated with Affymetrix and when they can deliver the associated design results depends on the complexity of the request but this time delay can be as short as two to three days. Dr. Finn indicated that the lab in Toronto is working towards having the equipment in place and working. Dr. Amos indicated that communication from Affymetrix to TRICL should only include Drs. Amos and Pande for now if it includes specific information about SNPs because some SNP requests are viewed as proprietary by the requestors. Additional notes: YZ: Flanking sequences in the format like “AGACCATTCTTGCCCCAGCCCTTTCACCTGGCCCA[/CCT]CCTCTCCCTCCTCAGGGCCTGAGCACATCACAACT” are highly valuable for indel markers if provided since indel position alignment can be confusing in dbSNP in some cases. If such flanking sequences are not provided, we can extract such information based on our understanding and together we can figure out some method to ensure that Affy is going to use the correct sequences to design the probesets later. YZ: It seems that the TCGAlung3007 tab does contain the chromosomal positions in the first data column (in hg19?). Understanding the allele-specification may take some work though, especially figuring out on which strand the alleles are located AF from a later request about details of amplified DNA: It seems that the TCGAlung3007 tab does contain the chromosomal positions in the first data column (in hg19?). Understanding the allele-specification may take some work though, especially figuring out on which strand the alleles are located CA: we do not have any lists from fine mapping (Yufei and Rayjean) or from Alvara Monteiro. CA: For consistency chromosomal positions should be retrieved from HG19 CA: Fine Level Prioritization Scheme: 1. SNPs/genes identified from U19 GWAS – very dense coverage. 2. SNPs/genes identified from pathway based or other novel analytical approaches from U19 (by the way I will also have some genes from g x e analyses to suggest) – tagging SNPs. 3. SNPs/genes suggested from other GWAS of lung cancer – very dense coverage. 4. SNPs/genes suggested from TCGA affecting risk for disease. 5. SNPs/genes suggested from survival analyses of lung cancer – tagging SNPs. 6. SNPs suggested by other U19 groups. 7. Vitamin B/Folate (integration with another consortium). 8. Strong Candidate Genes for lung cancer (e.g. CYP2A6, CYP2B6). 9. Inflammation pathway and other suggested pathways (these are put below candidate genes because I think they have less evidence than some candidate genes) 10. Weak Candidate Genes for lung cancer (e.g.GSTM1). 11. SNPs related to Smoking behavior. 12. SNPs suggested from COPD studies. 13. SNPs/genes suggested from TCGA or other bioinformatic studies that identify genes modified in lung cancer. 14. SNPs related to other cancers. 15. SNPS for European Admixture. 16. SNPs related other diseases like asthma.