NIHR/BRC Vacation Studentship Summer 2014 Student: Noeline Nadarajah Supervisors: Hywel Williams, Vijeya Ganesan Title of Project: Elucidating the genetic basis of Moyamoya Disease Introduction Moyamoya Disease (MMD) is a cerebrovascular disease which can be characterised by the spontaneous occlusion of the circle of Willis. This is caused by the vascular occlusion of the carotid artery which can result in arterial ischemic stroke. Previous research has found that MMD can be inherited in an autosomal dominant or X-linked manner. However, this project focused primarily on the autosomal dominant mode of inheritance. In particular, exome sequences were analysed across 5 affected families (Family 4-8). For some analyses family 4 was analysed separately because it is a large multi-affected pedigree that may be segregating a unique mutation compared to family 5-8 which are smaller sporadic families. This project utilised Next Generation Sequencing (NGS) which is a technique for highthroughput genome sequencing through the process of parallel sequencing reactions. NGS describes a number of related techniques and for this project exome sequencing was chosen because it is a cost-effective method to search for rare disease variants. Exome sequencing describes a technique that sequences just the exons of genes which are protein-coding regions of the genome therefore focusing the analysis process. Since MMD can express such a severe phenotype, it has been suggested that this is most likely due to a mutation in a protein-coding region. Methods 1. Using specific software programs, exome sequence output data was converted into a VCF file which is a Variant Call Format file. This identifies all the variants which differ from the reference human genome and therefore could be potentially damaging. To analyse these VCF files across 5 families, Ingenuity Variant Analysis software was used by applying various filters to decrease the number of possible variants to detect a disease-causing candidate. Possible variants were assessed across families. 2. There is an assumption that a causal variant in a disease must be a rare variant therefore to test this, the software tool PLINK-SEQ was used. Potential candidate genes were entered into PLINK-SEQ. This allows an in-depth search of the GOSgene internal database of 385 past patients to distinguish between polymorphisms which are present in >1% of the population and rare variants which are present in <1% of the population. 3. In the case that there are no common variants across families, protein pathways in which variants may hold a role are examined using protein analysis tools such as DAVID, DAPPLE and Ingenuity Pathway Analysis (IPA). DAVID Functional Annotation Tool identifies and quantifies the genes involved in a pathway from the input gene list. DAPPLE stands for Disease Association Protein-Protein Link Evaluator and builds direct and indirect interaction networks from proteins encoded by input seed genes. IPA recognises the function of the protein encoded and possible protein interactions. This program can also determine the predominant pathway that the majority of genes are in. Results 1. Fig. 1 shows the initial filters applied to Ingenuity Variant Analysis and the genes TRPM6, CAMKK2 AND NF1 were found to be consistent across the sporadic family 5 – 8. By using PLINK-SEQ I identified that these variants were not rare variants and therefore unlikely to be disease-causing. 2. After altering the filters on Ingenuity Variant Analysis (Fig.2) to shift the focus to potentially damaging variants, it was found that there were no common variants across families. Therefore I next used the protein analysis software tools and in family 4 the CELA1 pathway was highlighted in DAVID for occupying a role in the conversion of proepithelin to epithelin and wound repair control. This same gene was also highlighted in IPA as being present in the pathway, displayed in fig.3. 3. Finally, I proceeded to apply additional filters on Ingenuity Variant Analysis (Fig. 4) to allow the identification of the most damaging list of genes. CELA1 was the only remaining gene in family 4 and therefore I checked the variant in PLINKSEQ. There were 2 mutations which affected CELA1 on chromosome 12 in PLINKSEQ. These were at positions 51740413 and 51740415 with the mutation at position 51740415 having a significantly lower number of variant alleles than the mutation at 51740413. Therefore this was a potentially interesting candidate and so to visually check the integrity of the variant in the raw data I used the software IGV Genome Browser. Due to the high number of variant calls in this region, we can deduce that this region is most likely a result of misalignment of the sequence reads during the initial alignment stage of the data analysis (fig. 5). Whilst assessing this area on the human genome browser UCSC, a 20bp insertion was identified at this location which is common in the population therefore the variant I discovered cannot be a rare disease variant. Discussion The aim of this project was to identify if there were any damaging variants present in families 4-8 and to analyse how these variants can affect sufferers of MMD. The families were tested under the autosomal dominant mode of inheritance and it was found that there were not any variants that were shared across families 4-8. This suggests that there is no single gene that is causing MMD. Through the use of protein analysis software the CELA1 gene was highlighted in family 4, however, further in-depth analysis of the raw data identified that this was a sequencing artefact that was due to misalignment of the sample to the reference genome. This suggests there is no predominant protein pathway that is being affected. MMD is a disease which is not well understood at the phenotype level therefore it is difficult to focus analysis on well-defined patient populations. Due to the rarity of the disease it is also difficult to determine the mode of inheritance and analyse the data to its full extent. In the future, with a better understanding of the phenotype and more patients to study it should become easier to identify patient sub-groups that are more amenable to genetic study which will aid in the identification of causative genetic factors. Acknowledgements This studentship opportunity has enabled me to develop my analytical skills by using different programs to come to a valid conclusion. I have learnt that organisational and presentation skills are vital to visualise findings in a clear and concise manner. By attending lab meetings I am now more aware of current research occurring at Institute of Child Health and I now appreciate the huge amount of time and the large cohort of people that are involved in the progress of a research project. Most importantly, I would like to express my thanks to GOSgene for giving me the chance to work in their team during my summer and it has been thoroughly enjoyable. Variants In Gene Variant Allele Frequency 1% Pathogenic/ likely pathogenic phenotype LOF associated with frameshift, missense Shared between families Low variant no. on internal database TRPM6 CAMKK2 NF1 Figure 1: This represents the primary filters applied to the exome sequence data on Ingenuity Variant Analysis software. The output of this analysis produced common variants, TRPM6, CAMKK2, NF1 which were SNPs or sequencing variants. Variants In Gene Variant Allele Frequency 1% LOF associated with frameshift, missense + damaging polyphen Input variants into protein analysis software Pathways shared between families CELA1 Figure 2: This represents the additional filters applied to the exome sequence data on Ingenuity Variant Analysis to increase the focus of the search for a rare disease variant. It was found that none of the pathways were shared between families. The CELA1 pathway was detected in family 4 by more than one software program. Acts on Peptidase Binding Enzyme Transcription regulator Kinase Figure 3: This is the Ingenuity Pathway Analysis output for the variants of family 4. There are protein interactions occurring between the damaging variants of family 4 and the TCF4 transcription regulator is regulating CELA1. Variants In Gene Variant Allele Frequency 1% LOF associated with splice, frameshift, nonsense Low variant no. on internal database Check on IGV for confirmation Figure 4: This is the most extreme filter applied to the exome sequence data on Ingenuity Variant Analysis. I focused particularly on family 4 and investigated the CELA1 gene on IGV Genome Browser. Homozygous unaffected Heterozygous parent Homozygous affected Heterozygous parent Figure 5: IGV Genome Browser shows there are heterozygous parents for this CELA1 variant where both of the parents have one normal chromosome and one chromosome with a deletion. The affected individual has inherited both of the deletion variants from each parent. However this is most likely a misalignment due to there being a number of other variants present in this region. UCSC genome browser has identified this region as a common variant therefore unlikely to be damaging.