Technical Writeup

advertisement
Application
The UCSC Immunobrowser: Interactive Analysis of
T-cell Receptor Sequencing Experiments
Hyunsung John Kim1,*, Co-author2 and Co-Author2
1
2
Department of Biomolecular Engineering, UC Santa Cruz
Department of Biomolecular Engineering, Address 1156 High Street, Santa Cruz, CA
Received on XXXXX; revised on XXXXX; accepted on XXXXX
Associate Editor: XXXXXXX
ABSTRACT
The human body is able to defend itself from a wide
array of invading pathogens through innate and adaptive immune responses. The adaptive immune system
possesses the extraordinary ability to generate specific
defenses against never before seen pathogens. Tcells (and B-cells) are able to generate unique receptors through a process of combinatorial joining of DNA
and random nucleotide addition, known as VDJ recombination. Thus, every immature T-cell contains a
unique receptor DNA sequence not present in any
other cell in the human body. T-cells whose receptors
are able to target an invading pathogen specifically are
then given signals to divide.
With this phenomenon in mind, we have the ability to
track the rise and fall of T-cell activity by identifying
unique receptors via High Throughput Sequencing.
Changes in fraction of T-cells in a blood sample displaying a particular receptor sequence indicate changes in activity of the immune system specific to a pathogen. Although these experiments are in their infancy,
one day we hope to be able to track the status of
chronic autoimmune disorders and identify exact illnesses based on receptor sequence alone. One application of this methodology is the tracking of blood cancers and sensitive detection of cancer reoccurrence.
To enable the discovery of disease specific receptor
sequences, I have developed the UCSC TCR-B
Browser. The Browser is a web-based tool that allows
researchers to quickly compare the fractions of unique
receptor sequences between samples. It also allows
researchers to identify specific receptors that may be
disease associated and allows them to track their relative rise and fall over time. The tool is particularly useful for tracking of blood cancers, but is also applicable
to tracking the disease state in a person affected by
autoimmune disorders.
*To
whom correspondence should be addressed.
© Oxford University Press 2005
1 MOTIVATION
The adaptive immune system is unique in its ability to generate specific attacks to a diverse array of pathogens. T-cells
are instrumental to the development of specific immune
responses due to their ability to generate surface-bound receptors that can bind specifically to pathogenic peptides
displayed on the surface of infected cells. Each immature Tcell generates a novel T-cell Receptor with a random binding pocket that can interact specifically with a pathogen.
When an immature T-cell interacts with an infected cell and
detects the presence of a non-self peptide, it becomes activated. Activated T-cells begins dividing in a process known
as clonal expansion and its progeny take on a wide array of
functions including killing of infected cells and activating
other immune cells.
Clonal expansion is a marker of an immune response specific to a pathogen. All members of a clonal expansion share
the same TCR developed from the single immature T-cell
that gave rise to the entire population. The TCR structure is
unique to the T-cell members of a clonal expansion. The
unique structure of each TCR was generated through a process known as VDJ recombination. This somatic process is
known as VDJ recombination and introduces variability
through combinatorial joining of homologous gene segments and random nucleotide insertion and deletion. VDJ
recombination produces millions of possible sequences and
serves as a genetic fingerprint for each clonal expansion.
Advances in sequencing have enabled the tracking of clonal
populations of T-cells by a genetic fingerprint introduced
during the development of immature T-cells in the thymus.
The process of sequencing the TCR has been described as
“immunosequencing” and “rep-seq” by various groups, but
is in essence a combination of high throughput sequencing
and targeted amplification(Benichou et al. 2011; Robins
2013). Numerous groups have described similar assays that
sequence the DNA or RNA of the TCR by 454, Illumina,
and Ion Torrent Sequencers(Robins et al. 2009; Warren et
1
H. Kim et al.
al. 2009; Wang et al. 2010; Robins et al. 2011; Warren et al.
2011). Additionally, numerous commercial entities such as
adaptiveTCR, irepertoire and gigagen provide TCR sequencing services. Together these technologies have powered studies on the general properties of the TCR repertoire,
tracking of minimal residual disease in leukemia and for the
identification of public TCR sequences(Boyd et al. 2009;
Freeman et al. 2009; Robins et al. 2009; Klarenbeek et al.
2010; Robins et al. 2010; Wang et al. 2010; Robins et al.
2011; Venturi et al. 2011; Klarenbeek et al. 2012; Wu et al.
2012; van Heijst et al. 2013).
As the technology matures, numerous groups have developed software to process the large amount of data. However,
the majority of these tools are dedicated for the processing
of raw seqeuncing data, calling VDJ gene segments in recombined receptors and mitigating sequencing error(Bolotin
et al. 2012; Bolotin et al. 2013; Thomas et al. 2013). Even
software packages that offer visualization tools, such as
miTCR, are not well suited for cross sample comparison.
While commericial services such as adaptiveTCR and
iRepertoire offer web-based analysis and visualization tools,
these are not open to the public and data generated using
open protocols are not compatible with their systems.
The need for an open academic resource for the comparison
of T-cell receptor data is apparent. Here we present the
UCSC Immunobrowser, a web-based tool for analysis of
TCR sequencing data. The Immunobrowser automatically
generates interactive repertoire-level visualizations for any
number of samples. Interactivity allows researchers to concentrate on relevant changes to in clonality and to identify
individual clones responsible for large changes in clonal
representation of the TCR repertoire. Recombination details
are provided for each clone and all clones can be searched
against an extensive database of TCR sequences mined from
the literature.
All views generated during use of the immunobrowser are
persistent. RESTful architecture makes sharing results between researchers as simple as copy and pasting the current
URL.
2 VIEWS
The immunobrowser presents users with a traditional webbased interface to analyze and compare the repertoire of an
unlimited number of samples. Upon first loading the site,
users are presented with five distinct views: a sample
browse view, comparison view, literature search, help and a
local search bar.
The Browse Samples View
The sample browse displays all samples present within the
immunobrowser and relevant clinical details including the
age of the patient, disease status, cell type, and blood draw
date. From this view, users can also follow a detail link
which loads the repertoire of the associated sample in the
comparison view.
2
The Comparison View
The comparison view is the most feature rich of all the
views and displays all the repertoire-level generated by the
Immunobrowser. All samples are color-coded represented in
the plots generated in the comparison view. Highlighting
any sample highlights the corresponding sample throughout
all the plots in the view. The comparison view automatically
generates a table of summary statistics, spectratype, functionality statistics, a clone domination, V-J usage scatterplot
and (for comparisons with more than one sample) a shared
clones plot.
The clonotypes belonging to all samples can be filtered
based on their frequency, TCR nucleotide length, or for a
specific V or J gene segment. This allows users to compare
multiple subsets of the same sample, or to narrow down the
repertoire to receptors of interest.
The summary statistics table shows the number of unique
amino acids, recombinations and sequencing reads observed
for each sample. Shannon’s entropy is also reported as a
measure of clonal expansion in the repertoire. Low entropy
values indicate the dominance of a few clones over the entire repertoire and provides a summary statistic for adaptive
immune health. It also contains a link to all the clonotypes
within a sample that pass user-defined filters..
The spectratype plot recreates an immunological assay
where TCR sequence representation was assayed by lengthbased electrophoresis rather than sequencing. It can be interpreted as a histogram of CDR3 length, where the X-axis
represents sequence length and the Y-axis represents frequency. In general, the spectratype shows spikes on CDR3
lengths that are multiples of three. This is effect attributable
to the selection of coding TCR during T-cell development.
CDR3’s whose lengths are lengths are a multiple of 3 are
distributed normally in healthy adults who are not undergoing an active immune response. Large deviations from a
normal distribution are often viewed as evidence of a clonal
expansion.
The functionality plot is a simple stacked bar plot that
shows the proportion of the sample that encodes coding
TCRs, sequences with stop codons and frameshift mutations. This plot allows researchers to assay enrichment for
functional TCR sequences in a dataset, a metric which is
especially useful for RNA-based sequencing assays.
A domination plot shows the total cumulative fraction of the
repertoire that is represented by the top 100 clones with the
largest frequency. A quickly rising line represents large
clonal expansions that dominate the repertoire, while a
slowly rising line indicates a repertoire with a more even
distribuiton of clonotypes.
Scatterplots are have multiple elements to simultaneously
display the V-J recombined pair, V-gene segment and Jgene segment observed frequencies (Figure 1). V-J recombined pairs are displayed as a scatter plot where each V-J
pair is represented of a circle. Circles with larger sizes represent higher observed frequencies within a sample. When a
circle is highlighted, a tool tip displaythe usage of each
sample as a decimal in a pop up tool tip. V and J gene segment usage histograms flank the main scatterplot and share
The UCSC Immunobrowser: Interactive Analysis of T-cell Receptor Sequencing Experiments
the same axis labels. By default, these histograms show the
overall usage of V and J gene segments in the dataset. However, the histograms can display a specific V or J gene segment usage histogram when a axis label is highlighted.
The shared clones plot displays frequencies of amino acid
sequences shared between all samples in a comparison. Frequencies for shared amino acid sequences are displayed as
both a line plot and in a table. All shared amino acids contain a link to the amino acid detail view. Amino acids highlighted in the plot are also highlighted in the table, and vice
versa.
Literature Search
Users can search our database of CDR3 sequences mined
from the literature by inputting their protein sequence in
FASTA format. Alternatively, users can access this page
from the amino acid detail view. Any resulting hits to the
Literature Mined database are displayed with querysequence alignments and links to the original publication.
tions in a note field. Each sample model contains many
clonotypes. A clonotype stores the observed number of sequencing reads and the frequency of that clonotype within a
sample. Each clonotype model stores contains a foreign
reference exactly one genetic recombination. A recombination model stores the called V, D and J gene segments along
with recombination parameter such as number of insertions
or deletions between gene segments. All recombinations
also contain a foreign reference to a single amino acid object. The amino acid model stores only an amino acid sequence. The models were designed to reflect reality, this
multiple recombinations can generate the same amino acid
and recombinations can be shared across samples.
In addition to the models which represent sample data, there
are also models for the literature database, which contain
information on the amino acid sequence and details from the
article from which it was extract.
Help View
The help view contains the biological background necessary
to understand and utilize the browser. The section covers the
basic adaptive immune response, the VDJ Recombination
mechanism and clonal expansion. In addition, the help section also covers the technical
All plots for the comparison views were generated using
d3.js. The plots are designed to be reusable and can facilitate any data in the same format. AJAX calls retrieve JSON
objects from the Django backend. Thus external services
can also utilize the same API to retrieve JSON data from the
Immunobrowser database.
Local Search
The local search is a unified search that queries samples,
recombinations and amino acids. Users can search for exact
matches to samples, recombination or amino acid sequences.
Currently the Immunobrowser supports data generated using
the miTCR software along with the adaptiveTCR commercial services.
Other views: Clonotype, Recombination and Amino Acid
Detail views
In addition to interactive sample comparison views, detailed
views exist for individual clonotypes, recombination and
amino acids. These views can be accessed via the “view all
clonotypes” summary table, or from the shared amino acid
plot in the comparison view. These views display the total
number of reads represented in a clonotype, the V, D and J
gene segments used in a recombination as well as the final
coding amino acid sequence (if applicable).
3 IMPLEMENTATION
The Immunobrowser is built using standard web development platforms. Django is used for the backend with
MySQL used for data storage. The front end is built using
jQuery and twitter Bootstrap. Interactive visualizations in
the comparison view are built using d3.js.
Django follows a Model-View-Controller (MVC) software
design pattern. In the immunobrowser, models were designed to reflect real life. Five key models were designed to
represent data generated by TCR sequencing experiments. A
patient model stores gender, age and disease status. Each
patient can be associated with one or more samples. Samples Each sample represents a single sequencing experiment
and can store blood draw date, cell type and other descrip-
REFERENCES
Benichou J, Ben-Hamo R, Louzoun Y, Efroni S. 2011. Rep-Seq:
Uncovering the Immunological Repertoire through Next
Generation Sequencing. Immunology 135(3): 183-191.
Bolotin DA, Mamedov IZ, Britanova OV. 2012. Next generation
sequencing for TCR repertoire profiling: Platform‐
specific features and correction algorithms. European
Journal of ….
Bolotin DA, Shugay M, Mamedov IZ, Putintseva EV,
Turchaninova MA, Zvyagin IV, Britanova OV,
Chudakov DM. 2013. MiTCR: software for T-cell
receptor sequencing data analysis. Nature methods 10(9):
813-814.
Boyd SD, Marshall EL, Merker JD, Maniar JM, Zhang LN, Sahaf
B, Jones CD, Simen BB, Hanczaruk B, Nguyen KD et al.
2009. Measurement and clinical monitoring of human
lymphocyte clonality by massively parallel VDJ
pyrosequencing. Science translational medicine 1(12):
12ra23.
Freeman JD, Warren RL, Webb JR, Nelson BH, Holt RA. 2009.
Profiling the T-cell receptor beta-chain repertoire by
massively parallel sequencing. Genome research 19(10):
1817-1824.
3
H. Kim et al.
Klarenbeek P, Tak P, van Schaik B. 2010. Human T-cell memory
consists mainly of unexpanded clones. Immunology
letters.
Klarenbeek PL, de Hair MJH, Doorenspleet ME, van Schaik BDC,
Esveldt REE, van de Sande MGH, Cantaert T, Gerlag
DM, Baeten D, van Kampen AHC et al. 2012. Inflamed
target tissue provides a specific niche for highly
expanded T-cell clones in early human autoimmune
disease. Annals of the rheumatic diseases.
Robins H. 2013. Immunosequencing: applications of immune
repertoire deep sequencing. Current opinion in
immunology.
Robins H, Desmarais C, Matthis J, Livingston R, Andriesen J,
Reijonen H, Carlson C, Nepom G, Yee C, Cerosaletti K.
2011. Ultra-sensitive detection of rare T cell clones.
Journal of immunological methods.
Robins HS, Campregher PV, Srivastava SK, Wacher A, Turtle CJ,
Kahsai O, Riddell SR, Warren EH, Carlson CS. 2009.
Comprehensive assessment of T-cell receptor beta-chain
diversity in alphabeta T cells. Blood 114(19): 4099-4107.
Robins HS, Srivastava SK, Campregher PV, Turtle CJ, Andriesen
J, Riddell SR, Carlson CS, Warren EH. 2010. Overlap
and effective size of the human CD8+ T cell receptor
repertoire. Science translational medicine 2(47): 47ra64.
Thomas N, Heather J, Ndifon W, Shawe-Taylor J, Chain B. 2013.
Decombinator: a tool for fast, efficient gene assignment
in T-cell receptor sequences using a finite state machine.
Bioinformatics (Oxford, England) 29(5): 542-550.
van Heijst JWJ, Ceberio I, Lipuma LB, Samilo DW, Wasilewski
GD, Gonzales AMR, Nieves JL, van den Brink MRM,
Perales MA, Pamer EG. 2013. Quantitative assessment
of T cell repertoire recovery after hematopoietic stem
cell transplantation. Nature Medicine 19(3): 372-377.
Venturi V, Ng P, Ende Z, McIntosh T. 2011. A Mechanism for
TCR Sharing between T Cell Subsets and Individuals
Revealed by Pyrosequencing. The Journal of ….
Wang C, Sanders CM, Yang Q, Schroeder HW, Wang E,
Babrzadeh F, Gharizadeh B, Myers RM, Hudson JR,
Davis RW et al. 2010. High throughput sequencing
reveals a complex pattern of dynamic interrelationships
among human T cell subsets. Proceedings of the
National Academy of Sciences of the United States of
America 107(4): 1518-1523.
Warren RL, Freeman JD, Zeng T, Choe G, Munro S, Moore R,
Webb JR, Holt RA. 2011. Exhaustive T-cell repertoire
sequencing of human peripheral blood samples reveals
signatures of antigen selection and a directly measured
repertoire size of at least 1 million clonotypes. Genome
research 21(5): 790-797.
Warren RL, Nelson BH, Holt RA. 2009. Profiling model T-cell
metagenomes with short reads. Bioinformatics (Oxford,
England) 25(4): 458-464.
Wu D, Sherwood A, Fromm JR, Winter SS, Dunsmore KP, Loh
ML, Greisman HA, Sabath DE, Wood BL, Robins H.
4
2012. High-throughput sequencing detects minimal
residual disease in acute T lymphoblastic leukemia.
Science translational medicine 4(134): 134ra163.
The UCSC Immunobrowser: Interactive Analysis of T-cell Receptor Sequencing Experiments
Figure 1. Scatterplots show the V-J gene segment usage in three samples where red, green and blue circles
represent individual samples. Area of circles are proportional to V-J junctional usage frequencies. Histograms sit to the right and below the scatterplot and
show marginal frequencies of individual V and J gene
segments a. The default scatterplot view. b. When an
axis label is highlighted and the V usage histogram
below the main scatterplot changes to reflect the V
usage only in the TRBJ2-1 (bold) gene segment. c.
Individual samples can be highlighted by mousing
over an individual sample. d. Individual V-J gene
segments can be interrogated by placing mouse cursor
over a circle. The browser automatically displays numerical frequencies observed for each sample.
5
Download