Application The UCSC Immunobrowser: Interactive Analysis of T-cell Receptor Sequencing Experiments Hyunsung John Kim1,*, Co-author2 and Co-Author2 1 2 Department of Biomolecular Engineering, UC Santa Cruz Department of Biomolecular Engineering, Address 1156 High Street, Santa Cruz, CA Received on XXXXX; revised on XXXXX; accepted on XXXXX Associate Editor: XXXXXXX ABSTRACT The human body is able to defend itself from a wide array of invading pathogens through innate and adaptive immune responses. The adaptive immune system possesses the extraordinary ability to generate specific defenses against never before seen pathogens. Tcells (and B-cells) are able to generate unique receptors through a process of combinatorial joining of DNA and random nucleotide addition, known as VDJ recombination. Thus, every immature T-cell contains a unique receptor DNA sequence not present in any other cell in the human body. T-cells whose receptors are able to target an invading pathogen specifically are then given signals to divide. With this phenomenon in mind, we have the ability to track the rise and fall of T-cell activity by identifying unique receptors via High Throughput Sequencing. Changes in fraction of T-cells in a blood sample displaying a particular receptor sequence indicate changes in activity of the immune system specific to a pathogen. Although these experiments are in their infancy, one day we hope to be able to track the status of chronic autoimmune disorders and identify exact illnesses based on receptor sequence alone. One application of this methodology is the tracking of blood cancers and sensitive detection of cancer reoccurrence. To enable the discovery of disease specific receptor sequences, I have developed the UCSC TCR-B Browser. The Browser is a web-based tool that allows researchers to quickly compare the fractions of unique receptor sequences between samples. It also allows researchers to identify specific receptors that may be disease associated and allows them to track their relative rise and fall over time. The tool is particularly useful for tracking of blood cancers, but is also applicable to tracking the disease state in a person affected by autoimmune disorders. *To whom correspondence should be addressed. © Oxford University Press 2005 1 MOTIVATION The adaptive immune system is unique in its ability to generate specific attacks to a diverse array of pathogens. T-cells are instrumental to the development of specific immune responses due to their ability to generate surface-bound receptors that can bind specifically to pathogenic peptides displayed on the surface of infected cells. Each immature Tcell generates a novel T-cell Receptor with a random binding pocket that can interact specifically with a pathogen. When an immature T-cell interacts with an infected cell and detects the presence of a non-self peptide, it becomes activated. Activated T-cells begins dividing in a process known as clonal expansion and its progeny take on a wide array of functions including killing of infected cells and activating other immune cells. Clonal expansion is a marker of an immune response specific to a pathogen. All members of a clonal expansion share the same TCR developed from the single immature T-cell that gave rise to the entire population. The TCR structure is unique to the T-cell members of a clonal expansion. The unique structure of each TCR was generated through a process known as VDJ recombination. This somatic process is known as VDJ recombination and introduces variability through combinatorial joining of homologous gene segments and random nucleotide insertion and deletion. VDJ recombination produces millions of possible sequences and serves as a genetic fingerprint for each clonal expansion. Advances in sequencing have enabled the tracking of clonal populations of T-cells by a genetic fingerprint introduced during the development of immature T-cells in the thymus. The process of sequencing the TCR has been described as “immunosequencing” and “rep-seq” by various groups, but is in essence a combination of high throughput sequencing and targeted amplification(Benichou et al. 2011; Robins 2013). Numerous groups have described similar assays that sequence the DNA or RNA of the TCR by 454, Illumina, and Ion Torrent Sequencers(Robins et al. 2009; Warren et 1 H. Kim et al. al. 2009; Wang et al. 2010; Robins et al. 2011; Warren et al. 2011). Additionally, numerous commercial entities such as adaptiveTCR, irepertoire and gigagen provide TCR sequencing services. Together these technologies have powered studies on the general properties of the TCR repertoire, tracking of minimal residual disease in leukemia and for the identification of public TCR sequences(Boyd et al. 2009; Freeman et al. 2009; Robins et al. 2009; Klarenbeek et al. 2010; Robins et al. 2010; Wang et al. 2010; Robins et al. 2011; Venturi et al. 2011; Klarenbeek et al. 2012; Wu et al. 2012; van Heijst et al. 2013). As the technology matures, numerous groups have developed software to process the large amount of data. However, the majority of these tools are dedicated for the processing of raw seqeuncing data, calling VDJ gene segments in recombined receptors and mitigating sequencing error(Bolotin et al. 2012; Bolotin et al. 2013; Thomas et al. 2013). Even software packages that offer visualization tools, such as miTCR, are not well suited for cross sample comparison. While commericial services such as adaptiveTCR and iRepertoire offer web-based analysis and visualization tools, these are not open to the public and data generated using open protocols are not compatible with their systems. The need for an open academic resource for the comparison of T-cell receptor data is apparent. Here we present the UCSC Immunobrowser, a web-based tool for analysis of TCR sequencing data. The Immunobrowser automatically generates interactive repertoire-level visualizations for any number of samples. Interactivity allows researchers to concentrate on relevant changes to in clonality and to identify individual clones responsible for large changes in clonal representation of the TCR repertoire. Recombination details are provided for each clone and all clones can be searched against an extensive database of TCR sequences mined from the literature. All views generated during use of the immunobrowser are persistent. RESTful architecture makes sharing results between researchers as simple as copy and pasting the current URL. 2 VIEWS The immunobrowser presents users with a traditional webbased interface to analyze and compare the repertoire of an unlimited number of samples. Upon first loading the site, users are presented with five distinct views: a sample browse view, comparison view, literature search, help and a local search bar. The Browse Samples View The sample browse displays all samples present within the immunobrowser and relevant clinical details including the age of the patient, disease status, cell type, and blood draw date. From this view, users can also follow a detail link which loads the repertoire of the associated sample in the comparison view. 2 The Comparison View The comparison view is the most feature rich of all the views and displays all the repertoire-level generated by the Immunobrowser. All samples are color-coded represented in the plots generated in the comparison view. Highlighting any sample highlights the corresponding sample throughout all the plots in the view. The comparison view automatically generates a table of summary statistics, spectratype, functionality statistics, a clone domination, V-J usage scatterplot and (for comparisons with more than one sample) a shared clones plot. The clonotypes belonging to all samples can be filtered based on their frequency, TCR nucleotide length, or for a specific V or J gene segment. This allows users to compare multiple subsets of the same sample, or to narrow down the repertoire to receptors of interest. The summary statistics table shows the number of unique amino acids, recombinations and sequencing reads observed for each sample. Shannon’s entropy is also reported as a measure of clonal expansion in the repertoire. Low entropy values indicate the dominance of a few clones over the entire repertoire and provides a summary statistic for adaptive immune health. It also contains a link to all the clonotypes within a sample that pass user-defined filters.. The spectratype plot recreates an immunological assay where TCR sequence representation was assayed by lengthbased electrophoresis rather than sequencing. It can be interpreted as a histogram of CDR3 length, where the X-axis represents sequence length and the Y-axis represents frequency. In general, the spectratype shows spikes on CDR3 lengths that are multiples of three. This is effect attributable to the selection of coding TCR during T-cell development. CDR3’s whose lengths are lengths are a multiple of 3 are distributed normally in healthy adults who are not undergoing an active immune response. Large deviations from a normal distribution are often viewed as evidence of a clonal expansion. The functionality plot is a simple stacked bar plot that shows the proportion of the sample that encodes coding TCRs, sequences with stop codons and frameshift mutations. This plot allows researchers to assay enrichment for functional TCR sequences in a dataset, a metric which is especially useful for RNA-based sequencing assays. A domination plot shows the total cumulative fraction of the repertoire that is represented by the top 100 clones with the largest frequency. A quickly rising line represents large clonal expansions that dominate the repertoire, while a slowly rising line indicates a repertoire with a more even distribuiton of clonotypes. Scatterplots are have multiple elements to simultaneously display the V-J recombined pair, V-gene segment and Jgene segment observed frequencies (Figure 1). V-J recombined pairs are displayed as a scatter plot where each V-J pair is represented of a circle. Circles with larger sizes represent higher observed frequencies within a sample. When a circle is highlighted, a tool tip displaythe usage of each sample as a decimal in a pop up tool tip. V and J gene segment usage histograms flank the main scatterplot and share The UCSC Immunobrowser: Interactive Analysis of T-cell Receptor Sequencing Experiments the same axis labels. By default, these histograms show the overall usage of V and J gene segments in the dataset. However, the histograms can display a specific V or J gene segment usage histogram when a axis label is highlighted. The shared clones plot displays frequencies of amino acid sequences shared between all samples in a comparison. Frequencies for shared amino acid sequences are displayed as both a line plot and in a table. All shared amino acids contain a link to the amino acid detail view. Amino acids highlighted in the plot are also highlighted in the table, and vice versa. Literature Search Users can search our database of CDR3 sequences mined from the literature by inputting their protein sequence in FASTA format. Alternatively, users can access this page from the amino acid detail view. Any resulting hits to the Literature Mined database are displayed with querysequence alignments and links to the original publication. tions in a note field. Each sample model contains many clonotypes. A clonotype stores the observed number of sequencing reads and the frequency of that clonotype within a sample. Each clonotype model stores contains a foreign reference exactly one genetic recombination. A recombination model stores the called V, D and J gene segments along with recombination parameter such as number of insertions or deletions between gene segments. All recombinations also contain a foreign reference to a single amino acid object. The amino acid model stores only an amino acid sequence. The models were designed to reflect reality, this multiple recombinations can generate the same amino acid and recombinations can be shared across samples. In addition to the models which represent sample data, there are also models for the literature database, which contain information on the amino acid sequence and details from the article from which it was extract. Help View The help view contains the biological background necessary to understand and utilize the browser. The section covers the basic adaptive immune response, the VDJ Recombination mechanism and clonal expansion. In addition, the help section also covers the technical All plots for the comparison views were generated using d3.js. The plots are designed to be reusable and can facilitate any data in the same format. AJAX calls retrieve JSON objects from the Django backend. Thus external services can also utilize the same API to retrieve JSON data from the Immunobrowser database. Local Search The local search is a unified search that queries samples, recombinations and amino acids. Users can search for exact matches to samples, recombination or amino acid sequences. Currently the Immunobrowser supports data generated using the miTCR software along with the adaptiveTCR commercial services. Other views: Clonotype, Recombination and Amino Acid Detail views In addition to interactive sample comparison views, detailed views exist for individual clonotypes, recombination and amino acids. These views can be accessed via the “view all clonotypes” summary table, or from the shared amino acid plot in the comparison view. These views display the total number of reads represented in a clonotype, the V, D and J gene segments used in a recombination as well as the final coding amino acid sequence (if applicable). 3 IMPLEMENTATION The Immunobrowser is built using standard web development platforms. Django is used for the backend with MySQL used for data storage. The front end is built using jQuery and twitter Bootstrap. Interactive visualizations in the comparison view are built using d3.js. Django follows a Model-View-Controller (MVC) software design pattern. In the immunobrowser, models were designed to reflect real life. Five key models were designed to represent data generated by TCR sequencing experiments. A patient model stores gender, age and disease status. Each patient can be associated with one or more samples. Samples Each sample represents a single sequencing experiment and can store blood draw date, cell type and other descrip- REFERENCES Benichou J, Ben-Hamo R, Louzoun Y, Efroni S. 2011. Rep-Seq: Uncovering the Immunological Repertoire through Next Generation Sequencing. Immunology 135(3): 183-191. Bolotin DA, Mamedov IZ, Britanova OV. 2012. Next generation sequencing for TCR repertoire profiling: Platformâ specific features and correction algorithms. European Journal of …. Bolotin DA, Shugay M, Mamedov IZ, Putintseva EV, Turchaninova MA, Zvyagin IV, Britanova OV, Chudakov DM. 2013. MiTCR: software for T-cell receptor sequencing data analysis. Nature methods 10(9): 813-814. Boyd SD, Marshall EL, Merker JD, Maniar JM, Zhang LN, Sahaf B, Jones CD, Simen BB, Hanczaruk B, Nguyen KD et al. 2009. Measurement and clinical monitoring of human lymphocyte clonality by massively parallel VDJ pyrosequencing. Science translational medicine 1(12): 12ra23. Freeman JD, Warren RL, Webb JR, Nelson BH, Holt RA. 2009. Profiling the T-cell receptor beta-chain repertoire by massively parallel sequencing. Genome research 19(10): 1817-1824. 3 H. Kim et al. Klarenbeek P, Tak P, van Schaik B. 2010. Human T-cell memory consists mainly of unexpanded clones. Immunology letters. Klarenbeek PL, de Hair MJH, Doorenspleet ME, van Schaik BDC, Esveldt REE, van de Sande MGH, Cantaert T, Gerlag DM, Baeten D, van Kampen AHC et al. 2012. Inflamed target tissue provides a specific niche for highly expanded T-cell clones in early human autoimmune disease. Annals of the rheumatic diseases. Robins H. 2013. Immunosequencing: applications of immune repertoire deep sequencing. Current opinion in immunology. Robins H, Desmarais C, Matthis J, Livingston R, Andriesen J, Reijonen H, Carlson C, Nepom G, Yee C, Cerosaletti K. 2011. Ultra-sensitive detection of rare T cell clones. Journal of immunological methods. Robins HS, Campregher PV, Srivastava SK, Wacher A, Turtle CJ, Kahsai O, Riddell SR, Warren EH, Carlson CS. 2009. Comprehensive assessment of T-cell receptor beta-chain diversity in alphabeta T cells. Blood 114(19): 4099-4107. Robins HS, Srivastava SK, Campregher PV, Turtle CJ, Andriesen J, Riddell SR, Carlson CS, Warren EH. 2010. Overlap and effective size of the human CD8+ T cell receptor repertoire. Science translational medicine 2(47): 47ra64. Thomas N, Heather J, Ndifon W, Shawe-Taylor J, Chain B. 2013. Decombinator: a tool for fast, efficient gene assignment in T-cell receptor sequences using a finite state machine. Bioinformatics (Oxford, England) 29(5): 542-550. van Heijst JWJ, Ceberio I, Lipuma LB, Samilo DW, Wasilewski GD, Gonzales AMR, Nieves JL, van den Brink MRM, Perales MA, Pamer EG. 2013. Quantitative assessment of T cell repertoire recovery after hematopoietic stem cell transplantation. Nature Medicine 19(3): 372-377. Venturi V, Ng P, Ende Z, McIntosh T. 2011. A Mechanism for TCR Sharing between T Cell Subsets and Individuals Revealed by Pyrosequencing. The Journal of …. Wang C, Sanders CM, Yang Q, Schroeder HW, Wang E, Babrzadeh F, Gharizadeh B, Myers RM, Hudson JR, Davis RW et al. 2010. High throughput sequencing reveals a complex pattern of dynamic interrelationships among human T cell subsets. Proceedings of the National Academy of Sciences of the United States of America 107(4): 1518-1523. Warren RL, Freeman JD, Zeng T, Choe G, Munro S, Moore R, Webb JR, Holt RA. 2011. Exhaustive T-cell repertoire sequencing of human peripheral blood samples reveals signatures of antigen selection and a directly measured repertoire size of at least 1 million clonotypes. Genome research 21(5): 790-797. Warren RL, Nelson BH, Holt RA. 2009. Profiling model T-cell metagenomes with short reads. Bioinformatics (Oxford, England) 25(4): 458-464. Wu D, Sherwood A, Fromm JR, Winter SS, Dunsmore KP, Loh ML, Greisman HA, Sabath DE, Wood BL, Robins H. 4 2012. High-throughput sequencing detects minimal residual disease in acute T lymphoblastic leukemia. Science translational medicine 4(134): 134ra163. The UCSC Immunobrowser: Interactive Analysis of T-cell Receptor Sequencing Experiments Figure 1. Scatterplots show the V-J gene segment usage in three samples where red, green and blue circles represent individual samples. Area of circles are proportional to V-J junctional usage frequencies. Histograms sit to the right and below the scatterplot and show marginal frequencies of individual V and J gene segments a. The default scatterplot view. b. When an axis label is highlighted and the V usage histogram below the main scatterplot changes to reflect the V usage only in the TRBJ2-1 (bold) gene segment. c. Individual samples can be highlighted by mousing over an individual sample. d. Individual V-J gene segments can be interrogated by placing mouse cursor over a circle. The browser automatically displays numerical frequencies observed for each sample. 5