SUPPLEMENTARY METHODS Integration site analysis Most vector integration sites from HT1080 cell clones were identified by inverse PCR as previously described,1 with minor modifications. In short, genomic DNA was digested individually with either PstI, Hind III, BglII, SphI, BamHI, XbaI, or a combination of BglII and BamHI, all of which cut within the vector provirus. The restriction enzymes were inactivated, and approximately 100 ng was ligated with 400 units T4 DNA ligase in 20 ul reaction volume. Provirus-genomic junction fragments were then amplified by nested PCR using vector-specific primers. First primer pair, 5'-CTAGAAACTGCTGAGGGCGG and 5'-CTGATCCTTGGGAGGGT; nested primer pair, 5'-TCCTAACCTTGATCTGA and 5'-CAGATTGATTGACTGCC. The resulting PCR products were separated by gel electrophoresis, excised, and column purified with the Qiaquick gel extraction kit (Quigen). Junction fragments were then sequenced either directly using the PCR fragments as template and the nested PCR oligos as primers, or after subcloning into TOPO TA cloning vectors (Invitrogen Corp., Carlsbad, CA). Some vector integration sites from HT1080 cell clones were identified by a directed DNA library screening approach as previously described,2 with modifications. First, genomic DNA was digested with XbaI (which cuts once in each vector LTR), and half of the digested product was subjected to Southern analysis essentially as described above using probes for the vector LTR that are located either proximally (5' LTR) or distally (3' LTR) to the Xba I site in order to identify band sizes for individual provirus junction fragments. The remaining digested DNA products were then separated by gel electrophoresis under identical conditions as those used for the Southern analysis, and band-specific regions were excised and column purified. Extracted DNA was then subcloned into pUC19 cloning vectors, and the resulting libraries were screened by conventional colony lifts and hybridization using the same proximal LTR probe. Plasmid inserts from positive colonies were subsequently sequenced. Some vector integration sites in HT1080 cell clones, and all vector integration sites in 32D cell clones, were identified by linear amplification-mediated polymerase chain reaction (LAM-PCR) essentially as previously described.3 In short, single-stranded copies of the viral 5' LTR junction fragments were first generated by 100 cycles of linear amplification using biotinylated primers specific for the proximal region of the vector LTR's (vector MGPN2, 5'biotinTTCTCTAGAAACTGCTGAGG; vector INS4(+), 5'biotin-ATTCTAAATCTCTCTTTCAGCC). The products were then isolated using streptavidin-coated magnetic M280 Dynabeads (Dynal Biotech, Oslo, Norway), converted to double-stranded DNA using random hexamers and Klenow, digested with either Tsp509I, HaeIII, or RsaI (which cut in the genomic sequences), and capped with anchor primers compatible with the restricted ends. The vector-genomic junction fragments were eluted from the Dynabead matrix and amplified by two additional rounds of nested PCR using primers specific to the vector LTR and anchor primer sequences. The resulting LAM-PCR products were separated by gel electrophoresis, excised, and column purified. Junction fragments were then sequenced either directly using the PCR fragments as template and the nested PCR oligos as primers, or after subcloning in TOPO TA cloning vectors. Sequences were BLAST searched against either the human genome (March 2006 assembly) or the mouse genome (February 2006 assembly) using the UCSC Genome Browser ( as previously described.4 Insertion sites were considered authentic if they contained adjoining retroviral sequences and gave a unique best match with better than 90% identity. Analysis of integration sites relative to flanking transcription units were also performed using the UCSC Genome Browser and included all known genes (UCSC known genes based on UniProt, RefSeq, and GeneBank mRNA). Simulated random integration datasets were generated essentially as described.4 In short, random sites in the human or mouse genomes were chosen using a random number generator. Sequences of lengths about the same size as the experimental data (50 bp) were then identified adjacent to these sites and BLAST searched using the criteria used for the experimental datasets described above. Expression microarray analysis The transduced HT1080 cell clones were screened for dysregulated cellular genes using Codelink UniSet Human 20K I Bioarrays and gene expression system (Amersham / GE Healthcare Bio-Sciences Corp., Piscataway, NJ) following the manufacturer's directions. These arrays include approximately 20,000 human genes. Total RNA from HT1080 cell clones and two independent aliquots of untransduced HT1080 cells was prepared by column purification (RNeasy Mini kit, Qiagen), and used as template to prepare biotin-labeled cRNA target by linear amplification. Labeled target was then fragmented and hybridized to individual bioarrays (one array per clone or control). The hybridized arrays were then washed, stained with Cy5streptavidin, and scanned using a GenePix 4000A analyzer (Axon Instruments / MolecularDevices, Sunnyvale, CA). Expression levels were first analyzed using the manufacturer's software (CodeLink EXP v4.1) in order to assess the overall signal quality and to establish minimum thresholds for signal reliability. Pair-wise comparisons between each of the individual arrays versus all of the remaining arrays (two untransduced controls and 86 transduced clones) were then performed using GeneSifter software (VizX Lab LLC, Seattle, WA). For this purpose we normalized signals to array means, and excluded individual spots if they were background-contaminated, irregularly shaped, near background, or saturated; otherwise, no additional transformations or corrections were made. A gene was considered to be dysregulated if the intensity of that gene's signal within any one cell clone was either 5-fold higher or 5-fold lower than the mean signal intensity for the remaining cell clones and untransduced controls, that gene's signal was considered reliable by the manufacturer's criteria, and that gene was not found to be dysregulated in more than one clone. Statistical analysis Most comparisons between discrete datasets were performed using the KolmogorovSmirnov (KS) test.5 This is a non-parametric and distribution free method that does not require the datasets to be normally distributed. In cases where comparisons were performed between the means of small matched datasets with apparent normal distributions, we used the paired, two-tailed Student's t-test. In cases where comparisons were made between two discrete proportions (frequencies), we used the Z-test for two proportions. Kaplan-Meier survival curves were analyzed using the logrank test and chi-squared distribution. In order to estimate the frequency of vector-mediated tumor formation (Fig. 5a), we first estimated the number of independent transformation events based on the fraction of tumor-free animals at 130 days using the Poisson distribution: vector MGPN2, 1 of 10 animals surviving indicating 23 independent transformation events; vector INS4(+), 4 of 10 animals surviving indicating 9 independent transformation events. We then divided the estimated number of transforming events by the estimated number of cells that were transduced during the original transduction culture (a total of 37,700 for vector MGPN2 and 89,680 for vector INS4(+), Supplementary Table 3, experiments D and E). These ratios were then compared by the 1sided Z-test for two proportions. In order to estimate the number of simulated random integration events found +/- 40 Mb of dysregulated genes (Table 1), we first mapped 100 simulated random integration sites relative to the dysregulated genes. This analysis revealed 29 cases (out of 32 dysregulated genes) where unique simulated random integration sites were located within a 40 Mb window of unique dysregulated genes, for an overall risk of 1 in 100 for each of these 29 genes (and 0 for the remaining 3 genes). We then calculated the relative risk for each of the cell clones by multiplying the risk for the dysregulated genes present in that clone (either 0 or 0.01) by the number of authentic vector provirus present within that clone. Finally, we calculated the cumulative risk for all such occurrences by summing over all clones for each vector panel. Since none of the simulated random integration sites were found to be located within the body of dysregulated genes, we calculated this risk to be 0. Aker M, Tubb J, Miller DG, Stamatoyannopoulos G, Emery DW Integration bias of gammaretrovirus vectors following transduction and growth of primary mouse hematopoietic progenitor cells with and without selection. Mol Ther. 2006;4;226-235. 5. Horn SD. Goodness-of-fit tests for discrete data: a review and an application to a health impairment scale. Biometrics. 1977;33:237-247. Li, Stamatoyannopoulos & Emery (8)