You are cordially invited to a talk of the Edmond J. Safra Bioinformatics Program Distinguished Speaker Series. The speaker is Dr. Yaniv Erlich, Whitehead Institute for Biomedical Research, MIT Title: "Surname leakage from whole genome sequencing dataset" Time: Tuesday, 10 January 2012, at 12:15 (refreshments from 12:00) Place: Holtzblat hall 007, Physics building, at Exact Sciences Faculty Host: Prof. Ron Shamir, School of Computer Science Abstract: Posting annonymous sequening results without identifiers have become a common practice in large scale sequecning projects. Here, we report a novel risk of surname leakage from wholegenome sequencing datasets. Different from previously described risks, this approach does not require physical access to DNA of the target or prvious leakage of DNA data from the target. Surname leakage relies on bioinformatic profiling of short tandem repeats (STR) on the Y-chromosome and querying massive Web 2.0 genealogical databases. We demonstrate the applicability of the technique by recovering the surname ‘Venter’ from Craig Venter’s genome sequence. We also show that short read datasets are amenable to this technique. According to a conservative estimation, surname recovery would jeopardize 10% of whole genome sequencing datasets of US individuals. Moreover, the combination of a leaked surname with age and state narrows the identity of a sequencing dataset to ≤10 US individuals in most cases. As a remedy, we developed STR ANGERS, a tool to mitigate the risk for surname leakage from sequencing datasets. STRANGERS uses a statistical framework that ensures maximal datasharing of Y chrmosome variations while eliminating the risk of surname leakage.