Communication between Layers in Biological Transcriptional Networks Alex Tsankov

advertisement
Communication between Layers in Biological
Transcriptional Networks
by
Alex Tsankov
Submitted to the Department of Electrical Engineering and Computer
Science
in partial fulfillment of the requirements for the degree of
Master of Science in Computer Science and Engineering
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
May 2005
c Alex Tsankov. All rights reserved.
The author hereby grants to MIT permission to reproduce and distribute
publicly paper and electronic copies of this thesis document in whole or in
part.
Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Department of Electrical Engineering and Computer Science
May 19, 2005
Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Moe Win
Associate Professor, Massachusetts Institute of Technology
Thesis Supervisor
Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Pamela Silver
Professor, Harvard Medical School
Thesis Supervisor
Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Arthur C. Smith
Chairman, Department Committee on Graduate Students
2
Communication between Layers in Biological Transcriptional
Networks
by
Alex Tsankov
Submitted to the Department of Electrical Engineering and Computer Science
on May 19, 2005, in partial fulfillment of the
requirements for the degree of
Master of Science in Computer Science and Engineering
Abstract
Chromatin-immunoprecipitation experiments in combination with microarrays (known as
ChIP-chip) have recently allowed biologists to map where proteins bind in the yeast genome.
The combinatorial binding of different proteins at or near a gene controls the transcription
(copying) of a gene and the production of the functional RNA or protein that the gene encodes. Therefore, ChIP-chip data provides powerful insight on how genes and gene products
(i.e., proteins, RNA) interact and regulate one another in the underlying network of the cell.
Much of the current work in modeling yeast transcriptional networks focuses on the regulatory effect of a class of proteins known as transcription factors (TF). However, other sets
of factors also influence transcription, including histone modifications and states (HS), histone modifiers (HM) and remodelers, nuclear processing (NP), and nuclear transport (NT)
proteins. In order to gain a holistic understanding of the non-linear process of transcription,
our work examines the communication between all five forementioned classes (or layers) of
regulators. We use vastly available rich-media ChIP-chip data for various proteins within
the five classes to model a multi-layered transcriptional network of the yeast species Saccharomyces cerevisiae. Following the introduction in Chapter 1, Chapter 2 describes the
non-trivial process of incorporating the different sources of data into a coherent set and normalizing the heterogeneous data to improve biological accuracy. Using the normalized data,
Chapter 3 finds biologically meaningful pairwise statistics between proteins, including filtered correlation coefficient, and mutual information p-values. It then combines the p-values
of the two complementary approaches in order to increase the reliability of our predictions.
Chapter 4 uncovers group-wise relationships between proteins using a novel semi-supervised
clustering algorithm that preserves information about elements of a cluster in order to better
capture group-wise dependencies. Throughout the theoretical analysis, we confirm various
known biological processes and uncover several novel hypotheses. Based on the developed
methodology, Chapter 5 builds a multi-layered transcriptional network and quantifies the
communication between levels in biological transcriptional networks.
3
Thesis Supervisor: Moe Win
Title: Associate Professor, Massachusetts Institute of Technology
Thesis Supervisor: Pamela Silver
Title: Professor, Harvard Medical School
4
Acknowledgments
The fruition of this project would not have been possible without the cumulative contributions of many people. I first want to express my gratitude for my advisor, Moe Win. You
are the most thoughtful professor I have ever me, and a boss I can truly call a friend. Thank
you for your trust and loyal support during both happy times and tough moments at MIT.
Thank you for being open-minded about my research direction and for your encouragements
as I first started learning biology. I also want to acknowledge Ae Suwansantisuk and Hyundong Shin for their valuable contributions in proofreading the final document. In addition,
I want to thank Professor Jaakola, Sourav Dey, and Obrad Scepanovic for their insights
during Machine Learning class.
This project required the bridging of two disciplines and would not have even taken place
without all the help form Pam Silver’s lab at Harvard Medical School (HMS). Thank you
Pam for bringing me into the family, granting me computing resources, and for taking a
chance on someone who knew nearly nothing about biology nine months ago. Mike, thank
you for your patience and experimental validations, without which the project would be
incomplete. And finally, I want to especially thank Jason Casolari, who’s creative insight
inspired this undertaking. Thank you for always being honest with me, even during times
when everything was foreign to me. Thank you for being my mentor in this new field,
for explaining what YPD meant every Sunday afternooon, and for guiding me with your
ingenious insight and biological intuition–I will never forget it.
In the end, I want to express my eternal gratitude towards my family. Thank you Dona,
for reaching out to me as your younger brother and for understanding me and accepting me
with all my flaws. Fetzi, thank you for becoming my best friend at a time when I had no
other friends. And Sparti, thank you for your unconditional love and care from the day I
was born. You are all a large part of who I am.
5
6
Contents
1 Introduction
13
1.1
Regulation of Transcription . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
1.2
ChIP-chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
2 Data Preparation
19
2.1
Data Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
2.2
Data Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
2.2.1
Finding Missing P -values . . . . . . . . . . . . . . . . . . . . . . . . .
23
2.2.2
Evaluating Data Normalization: Correlation Analysis . . . . . . . . .
27
3 Pairwise Statistics
3.1
3.2
3.3
3.4
31
Filtered Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . . . . .
31
3.1.1
Filtered Correlation Coefficient P -values . . . . . . . . . . . . . . . .
34
Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
3.2.1
Mutual Information P -values . . . . . . . . . . . . . . . . . . . . . .
39
3.2.2
Evaluating Data Classification: Mutual Information Analysis . . . . .
41
Combining P -values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
3.3.1
Biological Predictions using Combined P -values . . . . . . . . . . . .
45
Minimized Mutual Information P -value . . . . . . . . . . . . . . . . . . . . .
49
4 Group-wise Relationships: PCA and Clustering
4.1
Principal Components Analysis . . . . . . . . . . . . . . . . . . . . . . . . .
7
51
51
4.2
Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
4.2.1
K-means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
4.2.2
Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . .
55
4.2.3
Semi-Supervised Clustering . . . . . . . . . . . . . . . . . . . . . . .
57
4.2.4
Biological Validation of Semi-Supervised Clustering . . . . . . . . . .
59
4.2.5
Active Processes and SIR2 . . . . . . . . . . . . . . . . . . . . . . . .
62
5 Network of the Nucleus
5.1
5.2
67
Network Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
5.1.1
Assigning Links: Second Pass . . . . . . . . . . . . . . . . . . . . . .
71
5.1.2
Network Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . .
74
Analysis of Network Topology . . . . . . . . . . . . . . . . . . . . . . . . . .
75
6 Conclusion
81
A Semi-Supervised Clustering based on KL-divergence
83
8
List of Figures
1-1 Layers in the yeast transcriptional network.
. . . . . . . . . . . . . . . . . .
15
2-1 A histogram of the binding profile of protein Polymerase III. . . . . . . . . .
24
2-2 Correlation coefficient distance matrix using the BR, N &P R, P V , and SD
data representations of a test set of proteins. . . . . . . . . . . . . . . . . . .
28
3-1 Venn diagram for the subsets of genes bound by proteins POL3 and TBP. . .
39
3-2 Filtered correlation coefficient, mutual information, and combined p-values for
a test set of proteins. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
4-1 Casolari et al. model for nuclear transport and validation using Principal
Component Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
4-2 Potential pitfall in using hierarchical clustering. . . . . . . . . . . . . . . . .
57
4-3 Confirmation of known biological processes and new insight using semi-supervised
clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
4-4 Active biological processes and new insight using semi-supervised clustering.
65
5-1 Multi-layered transcriptional network of highly significant interactions. . . .
73
A-1 Semi-supervised clustering of all nuclear transport and nuclear processing factors binding profiles using the KL distance metric. . . . . . . . . . . . . . . .
9
86
10
List of Tables
2.1
Total correlation coefficient distance at likely and unlikely interactions using
different data representations. . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1
Mutual information evaluation at probable and unlikely interactions using
different data representations. . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1
74
Total links and percentage of links realized within and between the five layers
of the entire and the positive edge transcriptional network. . . . . . . . . . .
5.3
42
Comparison between various protein-protein interaction data sets and our
network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2
29
76
Network topology statistics for the entire and the positive edge transcriptional
network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
78
12
Chapter 1
Introduction
Intricate communication between genes regulate a complicated network within and between
millions of cells that make our body function. The recent emergence of novel, data-rich experimental techniques that can simultaneously monitor the behavior of all genes within a cell
has enabled computational biologists to study gene networks quantitatively. Understanding
the process of transcription, or how genes are turned “on” and copied into more functional
forms, is central to modeling gene regulatory networks.
1.1
Regulation of Transcription
Ordered sequences of deoxyribonucleic acid (DNA) base pairs, protected within the nucleus
of each cell, encode the unique hereditary information of each individual. Genes are segments
of DNA located on different chromosomes (large bundles of continuous DNA) that encode a
unique product with a specific cellular function. During transcription, a gene is copied (or
expressed) to a more mobile strand of information, known as RNA. For most genes, their
RNA products are mere messengers (or mRNA) of the DNA code. The mRNA transports
the DNA’s information outside of the nucleus where it is translated into proteins, another
even more functional string composed of amino acid molecules. The final product of some
genes, however, is the RNA itself. Such strands execute specific tasks within the body in the
13
same manner as proteins. The repertoire of RNAs and proteins expressed from “on” genes
in each cell allows for the myriad of functions necessary for cell survival and, on a larger
scale, for the many different cell types found in complex organisms.
The transcription of genes is regulated by various RNAs and proteins, or products of other
genes. Hence, genes regulate each other’s activity through their respective products. Several
classes of proteins or factors, which we refer to as layers or levels, collaborate to regulate the
process of transcription. A set of proteins, known as transcription factors (TF), can enhance
or suppress how actively a gene g works to produce its corresponding protein by binding
to the DNA preceding g, or the promoter of g. Nuclear transport factors (NTs) represent
another layer of proteins that affects transcription by controling the flux of molecules (such
as TFs, mRNAs, etc) going in and out of the nucleus. Nuclear processing factors (NPs) also
control transcription by executing functions within the nucleus, such as the actual copying of
DNA to RNA or post-transcriptional processing of mRNAs. Chromatin, a macromolecular
complex consisting of DNA and proteins that is found in eukaryotes (nucleus-containing
organisms), also has profound affects on transcription. Chromatin is organized into packed,
inaccessible and unpacked, accessible regions of DNA by structural proteins such as histones;
therefore, histones have the potential to alter the expression state of genes. A fourth class
of proteins that regulate transcription consists of histone modifiers (HMs), which change
the accessibility of a gene’s DNA by adding or removing acetyl, methyl, or other molecular
groups on specific histone amino acids sites. We also include nucleosome remodelers in this
category, or protein complexes that displace and reposition macromolecules of eight histones
called nucleosomes. And finally, the actual histone states (HSs) in terms of the acetylation
and methylation levels on specific histone amino acids, form yet another layer of control in
transcription. For example, some proteins bind to specific patterns of histone acetylation
levels in order to regulate the nearby DNA.
The protein classes described above each contribute to the expression of a gene in unique
and situation-specific manners. Indeed, large research fields are devoted to the study of
each of the described gene/protein classes and how particular stimuli are able to alter the
14
transcriptional output of a gene. In order to better understand the non-linear process of
transcription, it is necessary to examine the genome-wide interplay of all of these transcriptional regulators both within particular fields, such as how modification of a histone at one
amino acid affects modification at other amino acids, and between fields of study, such as
the relationship between histone modification and recruitment of transcription factors to
particular genes. The budding yeast Saccharomyces cerevisiae offers a unique opportunity
to achieve this essential next step in understanding transcriptional control as it has a fully
sequenced and annotated genome as well as an extraordinary body of literature regarding
functions for many of these genes and their resulting RNAs or proteins. Figure 1 summarizes the different layers of organization known to regulate the transcriptional network in
Saccharomyces cerevisiae which we will model. In order to bettter understand the non-linear
process of transcription, we need information about where different classes of proteins bind
to in the genome.
Figure 1-1: The five layers in the yeast transcriptional network that we aim to model. Twosided arrows represent undirected functional connections between components within a level
as well as interplay of components between layers of organization.
15
1.2
ChIP-chip
A biochemical technique known as chromatin immunoprecipitation (ChIP) allows for the
chemical tethering of particular proteins to chromatin. The recent advent of ChIP in combination with microarrays (known as ChIP-chip or genomic location analysis) allows biologists
to adapt this technique to a genome-wide scale, identifying the DNA binding sites for particular proteins across the entire genome [10]. Each ChIP-chip experiment begins by harvesting
a yeast strain in a carefully prepared solution simulating a specific environmental condition.
Most ChIP-chip experiments use the standard YPD (Yeast extract Peptone Dextrose) solution containing the sugar dextrose, in order to simulate rich media conditions where the
yeast population grows exponentially with time. Next, addition of formaldehyde crosslinks
or “fixes” any proteins bound to DNA inside the normal growing cell, or in vivo. The inner
contents of the cell are then extracted and the strands of DNA along with the bound proteins
are sheared into fragments by sonication. Using an anti-body that binds specifically to the
protein i of interest, one can isolate all DNA fragments bound to the designated protein i by
precipitation (called immunoprecipitation). The DNA-(protein i) crosslink is then reversed
in order to release the DNA and digest the protein. The free DNA, previously bound by
protein i, is then amplified by Polymerase Chain Reaction (PCR) and labeled with a fluorescent dye (Cy5). In parallel, a sample of sheared DNA is taken as a control and is not
enriched for binding with protein i using immunoprecipitation. This sample is also amplified
using PCR and labeled using a different fluorescent dye (Cy3). Ultimately, the ratio of Cy5
and Cy3 fluorescence intensities at each gene will allow one to infer protein i’s predilection
for binding to g.
Microarray technology when coupled with ChIP enables biologists to simultaneously measure the binding preference of protein i at all genes in the genome relative to other genes in
just one experiment. One first needs to prepare a microarray matrix of wells, where each well
g contains a probe specific to gene g (for example the complementary DNA of gene g) . Next,
each microarray well is filled with a sample from both the pool of sheared DNA enriched
(Cy5) and unenriched (Cy3) for protein i. Inside each well g, the differently dyed fragments
16
originally sheared from gene g will competitively bind to the complementary DNA specific
to g during the process of hybridization. The array of wells can then be scanned using a
laser that detects the intensities of the fluorescent red (Cy5) and green (Cy3) dye [39]. The
ratio of intensities (Cy5/Cy3), also known as a binding ratio, is proportional to the ratio of
the number of gene g fragments bound by protein i to the number of g fragments randomly
resulting from the same process without immunoprecipitation. Each ChIP-chip experiment
is often repeated three or more times in order to reduce the inherent experimental noise
and the resulting binding ratios from each experiment are usually combined using weighted
averaging.
ChIP-chip provides powerful, in vivo measurements of how genes and gene products (i.e.
proteins, RNA) interact and regulate one another in the complex underlying network of a
cell. Much of the current work in modeling yeast networks focuses on the regulatory effect of
TFs [2]. However, recent publications [4,6,7,13–20] show that other sets of proteins (including
HMs, HSs, NPs, and NTs) also control gene expression. Hence, transcription has several
levels of organization. The next four chapters use ChIP-chip data to infer the underlying
communication between different layers in the transcriptional network of the yeast species
Saccharomyces cerevisiae.
Following the introduction in Chapter 1, four chapters build up the theory behind our
transcriptional network model. Chapter 2 describes the preliminary data preparation on
which we will base all future analysis. It first describes the integration of the different
ChIP-chip data sets into one coherent set and then explores the crucial issue of binding
data normalization and representation. Using a normalized data set, Chapter 3 finds biologically meaningful pairwise statistics between binding profiles of proteins, including filtered
correlation coefficient, mutual information, and combined p-values. The next chapter uncovers group-wise binding relationships between proteins using Principal Component Analysis
(PCA) and clustering. Based on the measures developed in Chapter 3, we introduce a novel
semi-supervised clustering algorithm that preserves information about elements of a cluster in order to better capture group-wise dependencies between proteins. Throughout the
17
theoretical analysis, we confirm various known biological processes and predict several novel
hypotheses. And finally, Chapter 5 combines the methodology developed in the previous
chapters in order to build a multi-layered transcriptional network of the nucleus. To the
best of our knowledge, our finalized network is the first attempt in the literature to quantify
the communication between layers in biological transcriptional networks.
18
Chapter 2
Data Preparation
This chapter explores the extremely important issue of data preparation. Our data comes
from several different sources and in various forms. The first section explains the nontrivial
procedure for integrating all the different sources of data into one coherent set. The second
part of the chapter attempts to reconcile the discrepancies between ChIP-chip experiments
done in different labs by normalizing the data.
2.1
Data Integration
Integrating the publicly available data sets proved to be a tedious but non-trivial part of
the project. We downloaded published genome-wide binding data for several sets of proteins
that may be involved in regulation of genes, including T F s [1, 17, 21], N T s [13, 14], N P s
[8, 14, 16, 23, 24, 28], HM s [4, 6, 7, 15, 20, 21, 25], and HSs [17, 18, 26, 27]. All of the ChIPchip experiments were performed on yeast strains grown in rich media conditions, where the
sugar dextrose and other nutrients are readily present so that the yeast can quickly multiply.
Since some genes only become active during specific environmental conditions, we would
ideally want ChIP-chip binding data for yeast grown in various media. However, from our
experience and as evidenced in [7], binding relationships in biological processes that are not
turned “on” during rich media conditions can often still be detected using our ChIP-chip
19
data. For example, our analysis in Section 4.2.4 captures the collaborative effect of GAL3
and GAL80, two transcription factors that regulate genes involved in breakdown of the sugar
galactose, using ChIP-chip data for yeast grown in dextrose-rich, YPD medium. In addition,
the authors in [7] found that the RSC nucleosome remodeling complex associates with many
genes involved in nitrogen regulation and non-fermentative carbohydrate metabolism using
YPD ChIP-chip data. In order to explain the relationship fully, more experiments under
different conditions were necessary, but the initial prediction came from looking at rich
media ChIP-chip data. Thus, we felt that usage of the vastly more complete YPD datasets
would still allow for the formation of hypotheses reflecting other growth conditions.
The data sets further differed in the microarray probes used to detect binding relationships, where some experiments used probes that target the open reading frames (ORF), or
regions of genes that code for RNA, while other experiments used probes for intergenic regions, or the promoter or control regions of genes. We used a gene-centric approach, where
each relevant probe was assigned to its corresponding gene(s). For ORF arrays, we simply
assigned the ChIP-chip information at each ORF to the corresponding gene.
For intergenic arrays, we assigned each DNA probe (or fragment) to the gene that it
most likely regulates. Biologists commonly refer to the DNA region preceding the site where
transcription starts as the upstream intergenic region of a gene, and the region following
the site where transcription ends as the downstream intergenic region of a gene. In yeast,
each intergenic region can control zero genes (e.g., telomeres at the end of chromosomes,
intergenic regions at the downstream end of two genes), a single gene, or two genes (e.g.,
intergenic regions at the upstream end of two genes). Moreover, there may exist several
probes for a single intergenic region. For example, small genes that encode tRNAs (a class
of RNAs that have a role in protein production) are often contained within the intergenic
region upstream of a longer gene; therefore, tRNA gene probes also measure the binding
of factors that regulate the longer gene. To further complicate matters, for some genes it
is still not known which intergenic region controls them. Hence, we needed to develop a
many-to-many mapping from intergenic fragments to the genes they might control. The
20
mapping algorithm uses the union of intergenic probe-gene assignment pairs as defined by
several authors [1, 6, 8, 9, 15, 19, 21, 23, 24, 27], where we included assignment pairs from at
least one author from each different lab when it was available. Moreover, when two or more
intergenic fragments mapped to the same gene, the probe that contains the most amount of
information was chosen. Since ChIP-chip experiments contain more information at the tails
of the binding distribution, we chose the most bound fragment for multiple probes that were
consistently bound and the least bound fragment for multiple probes that were consistently
not bound.
For each experiment, we obtained binding ratio (BR) data or the combination of BR
and p-value (PV) data. We first mapped the binding ratios and p-values to all annotated
genes for which data was available using the assignment algorithm described above. We then
integrated the data into a single BR matrix and a single P V matrix, where rows represent
factors, columns represent unique genes and entries represent the data. For example, the
(i, g)th entry of the matrix P V , P Vi,g , represents the p-value of factor i binding to gene g.
Missing data was annotated with NaNs.
As previously explained, a binding ratio for protein i and gene g represents a weighted
average of ratios of the number of immunoprecipitated DNA fragments from g enriched
with protein i to the number of control (unenriched) DNA fragments from g that occur at
random. The p-value for the binding ratio of protein i at gene g measures the probability of
erroneously deciding that protein i binds to g when the null hypothesis (i does not bind to
g) is in fact true. Hence, small p-values correspond to large binding ratios. The error model
used for calculating the p-values varies as chosen by the authors of each paper. Since we
are integrating binding data from various papers that use different ChIP-chip experimental
protocols and different error models, we needed to derive a normalized representation of the
binding data in order to compare binding profiles across papers.
21
2.2
Data Normalization
To normalize the binding data, we first sorted the protein-gene binding interactions for each
protein (i.e., row-wise in our matrices) from most bound to least bound, as defined by the
authors of each paper. We then mapped the sorted binding data to uniformly spaced discrete
points in the interval [0, 1], where 1 and 0 correspond to the most and least bound genes
for each protein, respectively. Hence we transformed the information in the BR and P V
matrices to a percentile rank P R matrix of the same dimensions. For example, P Ri,g = 0.95
means that gene g is in the top 5% of genes most bound by protein i. Moreover, since
each protein binds to a varying number of genes, each author usually defines a strength
of binding threshold, above which all protein-gene interactions are classified as bound. We
further normalized the P R matrix by subtracting the strength of binding threshold from
each entry, making all interactions classified as bound positive and all interactions classified
as unbound negative. We called this matrix the normalized N data matrix. By thresholding
N at 0, we also easily derived a bound/unbound B data representation matrix, where all
bound interactions were mapped to 1 and all unbound interactions to 0.
Each data representation has inherent advantages and disadvantages that may prove
more biologically useful or less informative depending on the nature of the analysis. The
bound/unbound B representation of the data seems most natural for representing the presence or absence of a biological interaction. However, ChIP-chip data contains significant
experimental noise that varies from lab to lab and protein to protein; therefore, the classified bound/unbound data contains a large number of false positives and false negatives. The
data set from [1] estimates the false positive rate to be around 4-6% and the false negative
rate around 24%, for a binding threshold at p-value = 0.001. The percentile rank P R matrix removes discrepancies between the different averaging techniques used to derive binding
ratios BR and the different error models used to calculate p-values P V . The P R matrix
measures the pairwise binding strengths of gene-protein interactions consistently across data
from different papers; however, it contains no information about which percentile determines significant binding for a given factor and the number of gene targets vary greatly for
22
different proteins. The normalized N matrix improves the P R matrix by subtracting the
percentile threshold that each author used to classify interactions as bound and unbound.
This achieves a soft, unclassified representation than the B matrix, where the more negative
or positive an entry in the N matrix is, the more confident one can be that the gene-protein
interaction is unbound or bound, respectively. While the P R and the N matrices measure
binding strength relative to other interactions, they fail to capture the absolute strength of
binding. To remedy this last shortcoming, we derive the SD matrix in the next section by
finding the missing p-values in our data sets.
2.2.1
Finding Missing P -values
We consider the P V matrix as the most reliable source of information about making decisions
on the presence or absence of a binding interaction. Further, nearly all of data sets with pvalues base their calculation on adaptations of the single array error model [11]. Specifically,
the data sets from [13, 14, 16] use a two-sided model, which can easily be converted to the
equivalent right-sided single array model by the following conversion:
1
BRi,g > 1 : P Vi,g|1−sided = P Vi,g|2−sided
2
1
BRi,g ≤ 1 : P Vi,g|1−sided = 1 − P Vi,g|2−sided .
2
(2.1)
(2.2)
Reconciling these discrepancies provides reliable, consistent p-values for two thirds of the
proteins in the data. We need to find the missing p-values for the other third based on
the available BR information in order to obtain a complete P V matrix. Figure 2-1 shows
the binding disribution of protein Polymerase III (POL3) across the natural logarithm of
its binding ratios. The figure reveals that the distribution of binding ratios consists of two
heterogeneous classes: i) binding ratios measured at genes that are not bound by POL3
and ii) binding ratios at genes that are bound by POL3, congregated at the right tail of
the distribution. We need to model the distribution of the unbound genes in order to find
23
appropriate p-values.
Figure 2-1: A histogram of the binding profile of protein Polymerase III: the number of genes
bound by Pol III versus the natural logarithm of the binding ratios.
Mathematically, let Xi,g denote random variables describing the natural logarithm of the
binding ratios of protein i (POL3 in our example) at genes g and lets assume that the binding
of i at each gene g, or Xi,g , is independent and identically distributed (i.i.d.). Since there is
inherent noise in ChIP-chip experiments, it makes sense to use random variables to model
our data. Following the convention of using capital letters for random variables and lowercase
letters for their realizations, we use xi,g to denote the measured outcomes of random variable
Xi,g at each gene g. Since random variables Xi,g are identically distributed at all genes g, let
Xi denote the underlying binding tendency of protein i that we want to describe. Therefore,
Figure 2-1 shows a representative, empirical example of the estimated distribution of Xi , or
P̂Xi (xi ). In order to extract p-values, we want to estimate the distribution of the natural
logarithm of binding ratios at all genes g under the null hypothesis H0 that protein i does not
bind to g, or the noise distribution P̂Xi |H0 (xi |H0 ). ChIP-chip experiments introduce various,
independent sources of multiplicative noise, which corresponds to several independent sources
of additive noise in the logarithm domain. Therefore, the Central Limit Theorem makes it
24
reasonable to assume that the distribution of the logarithm of binding ratios for the class
of unbound genes is Gaussian [39]. Since logarithm of binding ratios in our data result
from weighted averages of logarithm of binding ratios from single independent ChIP-chip
experiments and since linear combinations of independent Gaussian random variables is also
Gaussian, we can reasonably model PXi |H0 (xi |H0 ) using a Gaussian distribution.
The Gaussian noise distribution PXi |H0 (xi |H0 ) is uniquely defined by its mean and variance. We use the peak or mode of the overall distribution P̂Xi (xi ) (e.g., the peak at
log(BR) = 0 in Figure 2-1) to estimate the mean of the noise distribution, µ̂Xi |H0 . This
follows from the fact that the alternative hypothesis that protein i binds to gene g, H1 , is
is significantly less likely than the null hypothesis (on average, about 4% of gene targets are
classified as bound, or Pr(H1 ) ≈ 0.04 and Pr(H0 ) ≈ 0.96). Since the overall distribution
decomposes as follows,
PXi (xi ) = Pr(H0 )PXi |H0 (xi |H0 ) + Pr(H1 )PXi |H1 (xi |H1 )
(2.3)
≈ 0.96PXi |H0 (xi |H0 ) + 0.04PXi |H1 (xi |H1 ) ,
the peak of the unbound distribution should roughly correspond to the observed mode of the
overall distribution P̂Xi (xi ). Moreover, since the noise distribution is Gaussian, the mostly
unaffected peak of PXi |H0 (xi |H0 ) corresponds to the mean µ̂Xi |H0 . Hence, we can write
µ̂Xi |H0 = modeg∈Gi (xi,g ) ,
(2.4)
where Gi is the set of all genes with measured binding information for protein i. In order
to estimate the variance of the noise distribution, σˆ2 Xi |H0 , we consider genes with a binding
ratio smaller than µ̂Xi |H0 to be extremely unlikely targets of protein i. Hence, we use Ui to
denote the set of genes to the left of the mode of P̂Xi (xi ), or the genes the make up the left
side of the unbound distribution (the distribution to the left of the peak at log(BR) = 0
in Figure 2-1). Due to the symmetry of Gaussian distributions, we estimate the variance
25
of PXi |H0 (xi |H0 ) only using the left side of the noise distribution. The maximum likelihood
estimator for variance of Gaussian random variables states that
1 X
σˆ2 Xi |H0 =
(xi,g − µ̂Xi |H0 )2 ,
|Ui | g∈U
(2.5)
i
where |Ui | denotes the number of elements in set Ui , or the cardinality of Ui . The p-value
P Vi,g corresponds to the probability that just as extreme or more extreme of an observation
could occur if we assume the null hypothesis H0 that the factor i does not bind to gene g.
To calculate the missing p-value for observation xi,g , or P Vi,g , we integrate our estimated
noise distribution over the interval [xi,g , ∞):
Z
∞
Z
∞
P̂Xi |H0 (xi |H0 )dxi =
P Vi,g =
xi,g
xi,g
1
q
exp[
2π σˆ2 Xi
−(xi − µ̂Xi )2
]dxi .
2σˆ2 X
(2.6)
i
After obtaining a complete P V matrix, we wanted to normalize the P V matrix so that we
can assume a nearly Gaussian binding distribution for entries across rows for in the following
analysis. Using the inverse of the Gaussian Q-function, we obtained the normalized SD
matrix, where each entry SDi,g represents the number of standard deviations of confidence
in rejecting the null hypothesis that protein i binds to gene g. Mathematically,
∞
−w2
1
√ exp[
]dw
2
2π
z
= Q−1 (P Vi,g ) .
Z
Q(z) =
(2.7)
SDi,g
(2.8)
Since each row in the SD matrix approximates a Gaussian distribution with zero mean and
unit variance, the SD matrix, just like the N matrix, fixes the problem in normalizing the
data set so that binding levels can be compared across different experiments. Moreover,
since the entries in the SD matrix represent standard deviation of confidence in rejecting
the null hypothesis, the SD matrix preserves the absolute strength of binding of gene-protein
interactions in a continuous manner, which the N matrix fails to capture. In the following
26
section we explore the advantages of the SD data matrix in relation to correlation analysis.
2.2.2
Evaluating Data Normalization: Correlation Analysis
We use the measure of correlation coefficient distance to gauge the performance of the various data representations in relation to correlation analysis. We define correlation coefficient
distance as one minus the correlation coefficient; therefore, a distance of 0 represents full
positive linear dependence and a distance of 1 denotes no linear relationship [?]. Figure 2-2
illustrates the concepts from the previous paragraph in Section 2.2.1. It shows four correlation coefficient distance matrices using the BR, N &P R, P V , and SD data representations of
a test set of proteins. Note that since the N and P R rows are simply shifted versions of one
another and since correlation analysis is invariant to scales and shifts, the two matrices produce an identical correlation coefficient distance matrix, which we denote as resulting from
N &P R. Two known biological processes are shown: RSC2, RSC3, RSC8, and STH1 are
all components of the RSC nucleosome remodeling complex [7] and MLP1, MLP2, NIC96,
and KAP95 all play role in nuclear transport [13]. We would expect the members of each
biological processes to share similar binding profiles. This is shown by small correlation distance values in the two squares along the diagonal. The BR matrix clearly performs worse
than the P V matrix in confirming the known relationships, corroborating the fact that enrichment ratios contain less reliable information than p-values. The N and P R matrices are
ranked version of the P V matrix, so the three, naturally, share very similar results. Since
correlation coefficient analysis assumes Gaussian binding profiles, the SD matrix improves
on the P V matrix even though (2.8) shows that the two are mere transformations of each
other. Moreover, the SD matrix outperforms the N matrix not only because its row vector
entries approximate a normal distribution, but also because it preserves information about
the absolute strength of binding.
The performance of the different data representation in correlation analysis also extends
to the whole data. We generated a test set of 1447 probable interactions, most of which were
confirmed or suggested by published literature. Therefore, the known relationships discussed
27
(a) BR correlation coefficient distance.
(b) N &P R correlation coefficient distance.
(c) P V correlation coefficient distance.
(d) SD correlation coefficient distance.
Figure 2-2: Correlation coefficient distance matrix using the (a) BR, (b) N &P R, (c) P V ,
and (d) SD data representations of a test set of proteins. We define correlation coefficient
distance as one minus the correlation coefficient; therefore, a distance of 0 represents full
positive linear dependence and a distance of 1 denotes no linear relationship [?]. Protein
names are listed on the x and y axes with an “o” or i at the end of the names signifying that
the data came from an ORF or intergenic array, respectively. Two known biological processes
are shown, namely RSC2-RSC3-RSC8-STH1 [7] and MLP1-MLP2-NIC96-KAP95 [13]. The
SD matrix performs the best in confirming the known relationships followed closely by the
P V and N &P R matrices, while the BR matrix clearly performs the worst.
28
Matrix Total Distance at Probable Interactions
BR
978.5
log(BR)
974.3
N &P R
945.8
PV
966.3
SD
879.7
Total Distance at Unlikely Interactions
5496.9
5415.7
5341.4
5763.9
5374.5
Table 2.1: Total correlation coefficient distance for a test set of likely and unlikely interactions. A smaller total distance at known interactions should correspond to a smaller false
negative rate for a given data representation, while a large total distance at unlikely interactions should lead to a smaller false positive rate. The SD matrix performs best in normalizing
the data since it has the lowest total distance at known interactions and a comparable total
distance to all the other matrices at unlikely interactions. The P V , N , and P R matrices do
slightly worse but still better than both the log(BR), and BR data representations.
in Figure 2-2 represent a subset of this test set. We also created a test set of 5926 extremely
unlikely interactions. Table 2.1 shows the computed total correlation coefficient distance for
a test set of likely and unlikely interactions. A smaller total distance at known interactions
should correspond to a smaller false negative rate for a given data representation, while a
large total distance at unlikely interactions should lead to a smaller false positive rate. The
SD matrix again performs best in normalizing the data since it has the lowest total distance
at known interactions and a comparable total distance to all the other matrices at unlikely
interactions. The P V , N , and P R matrices do slightly worse but still better than both the
log(BR), and BR data representations. Note that the log(BR) data representation has a
near Gaussian distribution for entries across rows. The fact that it slightly outperforms the
BR matrix shows that the Gaussian data distribution accounts for some but not all of the
improvement in normalizing the data. These results substantiate our decision to use the SD
matrix for the correlation analysis that follows. The following chapter utilizes the different
data representations in building a statistical method for detecting pairwise relationships
between proteins.
29
30
Chapter 3
Pairwise Statistics
With an integrated and normalized data set, this chapter tries to find pairwise binding relationships between two proteins. We introduce two biologically meaningful pairwise measures
of binding dependencies: filtered correlation coefficient and mutual information. We also find
consistent methods for evaluating the p-values of the two analyses. In the end, we combine
the p-values from the two complementary pairwise approaches in order to reduce the number
of false positive and false negative biological predictions and increase the reliability of our
overall analysis.
3.1
Filtered Correlation Coefficient
This section introduces the pairwise measure of filtered correlation coefficient, an extension of
the standard correlation analysis commonly encountered in biology. This technique extracts
the binding relationships between two factors with great accuracy, by isolating the analysis
on the pertinent dimensions in ChIP-chip data. To estimate the filtered correlation coefficient
between two proteins i and j, we use the SD matrix (as justified in Section 2.2.2). As in
Section 2.2.1, we consider binding of factors i and j at genes g, denoted as Xi,g and Xj,g ,
as i.i.d. random variables with measured outcomes xi,g and xj,g , respectively. Moreover, we
again denote the underlying binding tendency of the two factors i and j as random variables
31
Xi and Xj . Since the rows of the SD data representation closely resemble samples from a
Gaussian distribution, we assume that Xi and Xj are jointly Gaussian. For factors i and j, let
Gi and Gj represent the set of all genes for which we have binding information, respectively.
Moreover, we use Fi and Fj to denote the filtered sets of genes bound by proteins i and j,
respectively, and Fi,j = Fi ∪ Fj to represent the overall filtered set of genes, or the set of
genes classified as bound by at least one of the two factors. Using the SD data matrix, the
following equations find the means, filtered variances and filtered covariance of the binding
tendencies of two proteins using maximum-likelihood (ML) estimators for jointly Gaussian
random variables as shown below:
µ̂Xi =
µ̂Xj =
σˆ2 Xi =
σˆ2 Xj =
σ̂Xi ,Xj =
1 X
xi,g
|Gi | g∈G
i
1 X
xj,g
|Gj | g∈G
j
1 X
(xi,g − µ̂Xi )2
|Fi,j | g∈F
i,j
X
1
(xj,g − µ̂Xj )2
|Fi,j | g∈F
i,j
X
1
(xi,g − µ̂Xi )(xj,g − µ̂Xj ) .
|Fi,j | g∈F
(3.1)
(3.2)
(3.3)
(3.4)
(3.5)
i,j
Note that the estimates of the means, µ̂Xi and µ̂Xj , consider all genes for which we have
binding information, while the estimates of the variances, σˆ2 Xi and σˆ2 Xj , and covariance,
σ̂Xi ,Xj , consider only the genes classified as bound by at least one of the two factors. We
call these estimates the filtered variances and filtered covariance of binding profiles Xi and
Xj . We again use the ML estimator for jointly Gaussian random variables to estimate the
filtered correlation coefficient between binding profiles Xi and Xj , or ρ̂Xi ,Xj , as follows:
32
ρ̂Xi ,Xj = q
σ̂Xi ,Xj
.
(3.6)
σˆ2 Xi σˆ2 Xj
The difference between the filtered correlation coefficient and the standard correlation
coefficient is that the estimates of the filtered variances and covariance of binding profiles Xi
and Xj consider just a filtered subset of genes. As we would expect, the filtered correlation
coefficient still retains some of the important properties of the correlation coefficient. For
example, 0 ≤ |ρ̂Xi ,Xj | ≤ 1, with ρ̂Xi ,Xj = 0 if and only if two data vectors have no linear
dependence (uncorrelated) and ρ̂Xi ,Xj = 1 if and only if one data vector is a shifted and
scaled version of the other. This directly follows from Schwartz’s inequality, which states
that
1 X
|
(xi,g − µ̂Xi )(xj,g − µ̂Xj )| ≤
|Fi,j | g∈F
i,j
s
(
1 X
1 X
(xi,g − µ̂Xi )2 )(
(xj,g − µ̂Xj )2 ) ,
|Fi,j | g∈F
|Fi,j | g∈F
i,j
i,j
(3.7)
or equivalently that |σ̂Xi ,Xj | ≤
q
σˆ2 Xi σˆ2 Xj , where the equality holds true if and only if at
every g, xi,g = αxj,g + β for some choice of constants α and β.
In addition, the filtered correlation coefficient is also shift and scale invariant. This is
a particularly useful property for our analysis, since it normalizes for the fact that some
binding profiles may vary more than others. To prove that this property still holds, let
Xi0 = aXi + b and Xj0 = cXj + d represent linear transformations of random variables Xi
and Xj with transformed observations x0i,g = axi,g + b and x0j,g = cxj,g + d, respectively. The
estimates for the means, filtered variances, and filtered covariance change as follows:
33
µ̂Xi0 =
1 X
a X
1 X 0
xi,g =
(axi,g + b) =
(xi,g ) + b = aµ̂Xi + b
0
|Gi |
|Gi | g∈G
|Gi | g∈G
0
(3.8)
1
|G0j |
(3.9)
g∈Gi
µ̂Xj0 =
σˆ2 Xi0 =
σˆ2 Xj0 =
σ̂Xi0 ,Xj0 =
X
i
x0j,g =
g∈G0j
1
0
|Fi,j
|
1
0
|Fi,j
|
X
i
1
c
(cxj,g + d) =
(xj,g ) + d = cµ̂Xj + d
|Gj | g∈G
|Gj | g∈G
X
j
j
(x0i,g − µ̂Xi0 )2 =
1
(axi,g + b − aµ̂Xi − b)2 = a2 σˆ2 Xi (3.10)
|Fi,j | g∈F
(x0j,g − µ̂Xj0 )2 =
1
(cxj,g + d − cµ̂Xj − d)2 = c2 σˆ2 Xj (3.11)
|Fi,j | g∈F
0
g∈Fi,j
X
X
X
i,j
0
g∈Fi,j
X
i,j
1 X 0
(xi,g − µ̂Xi0 )(x0j,g − µ̂Xj0 ) = acσ̂Xi ,Xj .
0
|Fi,j |
0
(3.12)
g∈Fi,j
Substituting the new estimates for filtered variances and covariance into 3.6 we see that the
filtered correlation coefficient of the scaled and shifted versions of the data is the same as
that of the original binding profiles Xi and Xj :
σ̂Xi0 ,Xj0
ρ̂Xi0 ,Xj0 = q
σˆ2 Xi0 σˆ2 Xj0
=q
acσ̂Xi ,Xj
a2 σˆ2
Xi
c2 σˆ2
= ρ̂Xi ,Xj .
(3.13)
Xj
Filtered correlation coefficient is a simple but powerful measure of binding relationships
between two factors. It isolates the analysis on the pertinent dimensions (or genes) of the
ChIP-chip data and predicts biological interaction between factors with great accuracy, as
shown in the following sections and chapters. Moreover, it can compare vastly different
data sets in a consistent manner. We explore the issue of significance in filtered correlation
analysis next.
3.1.1
Filtered Correlation Coefficient P -values
Ultimately, we want to use our filtered correlation coefficient to make decisions on whether
two factors are linearly related. Hence, under our null hypothesis H0 the binding profiles of
two factors are not linearly related (i.e., ρXi ,Xj = 0) and under our alternative hypothesis
34
H1 they are linearly related (i.e., ρXi ,Xj 6= 0). For sample size n and filtered correlation
coefficient ρ̂Xi ,Xj , we want to evaluate the probability that we reject the null hypothesis
when it is actually true, or the p-value. To evaluate the significance of our filtered correlation
coefficient, we use the test statistic
√
ρ̂Xi ,Xj n − 2
T =
.
1 − ρ̂2Xi ,Xj
(3.14)
Assuming that binding profiles Xi and Xj are jointly Gaussian, [35] shows that T is a
Student-T random variable of n−2 degrees of freedom. Further, T results from a generalized
likelihood ratio test and hence defines the optimal decision boundary [35]. Let tρ̂,n−2 denote
one positive outcome of the random variable T for a given ρ̂Xi ,Xj = ρ̂ and n − 2 . Also, let
xi = [xi,g1 . . . xi,g|G| ] and xj = [xj,g1 . . . xj,g|G| ] represent the row vectors of binding data for
proteins i and j across all genes g in the set G, following the convention of using boldface
to represent vectors. Since, tρ̂,n−2 only depends on n − 2 and the estimated value ρ̂ found
using xi and xj , tρ̂,n−2 is completely determined by xi and xj . We represent the filtered
correlation coefficient p-value for a given tρ̂,n−2 , or for a given xi and xj , as pvF CC (xi , xj ).
To find this quantity, we need to find the probability that test statistic t can have a value
more extreme than tρ̂,n−2 . This corresponds to integrating the distribution of T , PT (t), over
the two disjoint intervals [−∞, −tρ̂,n−2 ] ∪ [tρ̂,n−2 , ∞] where T exceeds the outcome tρ̂,n−2 .
Due to the symmetry of the distribution of Student-T random variable T , we can write
Z
∞
pvF CC (xi , xj ) = P r(T ≥ tρ̂,n−2 ) = 2
PT (t)dt .
(3.15)
tρ̂,n−2
Note that we implicitly assume that both highly positive and negative values of ρ̂ are significant here. If we wanted to simply consider positive/negative correlation coefficients as
significant, we would need to remove the factor of two and only integrate over the interval
corresponding to the right/left tail of T ’s distribution.
The filtered correlation coefficient measures how consistently two binding profiles fluctuate above and below their respective means. It represents a normalized measure of the linear
35
relationship between two data vectors. However, two random entities can have no linear
relationship (i.e., ρXi ,Xj = 0) but still dependent on each other in a non-linear fashion. We
explore a more general notion of probabilistic dependence between random variables in the
next section.
3.2
Mutual Information
Mutual information is a general measure of the probabilistic dependence between two random
variables. As in the filtered correlation analysis, we again consider binding profiles Xi and
Xj of proteins i and j as random variables, but now their outcomes only take on NXi and
NXj discrete values. The entropy of binding profile Xi measures the amount of uncertainty in
predicting the observations of Xi and forms the basis for calculating the mutual information.
Given a probability mass distribution of binding profile Xi , PXi (Xi = xi,r ) for r = 1, . . . , NXi ,
the entropy of X is define as
N Xi
X
PXi (Xi = xi,r ) log PXi (Xi = xi,r )
H(Xi ) = −
r=1
N Xi NXj
XX
=−
PXi ,Xj (Xi = xi,r , Xj = xj,s ) log PXi (Xi = xi,r ) .
(3.16)
r=1 s=1
Suppose now that we observe the binding profile Xj , which might be related to Xi . The
conditional entropy of Xi given Xj measures the amount of uncertainty that Xi contains
with prior knowledge of Xj and can be calculated as follows:
36
N
Xj
X
H(Xi |Xj ) = −
PXj (Xj = xj,s )H(Xi |Xj = xj,s )
s=1
N Xj
NXi
X
X
=−
PXj (Xj = xj,s )
PXi |Xj (Xi = xi,r |Xj = xj,s ) log PXi |Xj (Xi = xi,r |Xj = xj,s )
s=1
NXi N Xj
r=1
XX
=−
PXi ,Xj (Xi = xi,r , Xj = xj,s ) log PXi |Xj (Xi = xi,r |Xj = xj,s ) .
(3.17)
r=1 s=1
The amount that the uncertainty in Xi decreases with the observation of Xj ; hence, the
entropy reduction H(Xi ) − H(Xi |Xj ) corresponds to the amount of information that Xj
contains about Xi . Combining (3.16) and (3.17), the mutual information between random
binding profiles Xi and Xj takes the form:
I(Xi ; Xj ) = H(Xi ) − H(Xi |Xj ) = H(Xj ) − H(Xj |Xi )
N
=
NXi Xj
X
X
PXi ,Xj (Xi = xi,r , Xj = xj,s )(log PXi |Xj (Xi = xi,r |Xj = xj,s ) − log PXi (Xi = xi,r ))
r=1 s=1
NXi N Xj
=
XX
PXi ,Xj (Xi = xi,r , Xj = xj,s ) log
r=1 s=1
PXi ,Xj (Xi = xi,r , Xj = xj,s )
.
PXi (Xi = xi,r )PXj (Xj = xj,s )
Note that the second equality in the first line shows that mutual information is symmetric,
or I(Xi ; Xj ) = I(Xj ; Xi ). This can easily be verified by observing that trading the places of
all the Xi and Xj in (3.18) results in no change. Another property of the mutual information
is that it is non-negative, or I(Xi ; Xj ) ≥ 0.
In order to estimate the mutual information between the binding profiles Xi and Xj of
proteins i and j, we need to estimate the marginal and joint probability mass functions of
Xi and Xj based on the data. Using the P V and P R data representations (justification
discussed in Section 3.2.2), we classify each data entry as a 1 or a 0, signifying the presence
or absence of an interaction, respectively. Hence, we model our binding profiles as Bernoulli
37
(3.18)
random variables with i.i.d. random samples Xi,g and Xj,g and outcomes xi,g and xj,g , where
xi,g , xj,g ∈ {0, 1} at all genes g. Let Gi,j denote the set of all genes with binding observations
for both proteins i and j, and let Gi,j have w gene members. We can use maximum likelihood
estimators for the parameters of Bernoulli random variables to estimate the marginal and
joint mass distributions using
P̂ (Xi = 1) =
1
w
X
g∈{g:xi,g =1}
1=
v
w
(3.19)
w−v
P̂ (Xi = 0) = 1 − P̂ (Xi = 1) =
w
X
1
u
P̂ (Xj = 1) =
1=
w
w
(3.20)
(3.21)
g∈{g:xj,g =1}
w−u
P̂ (Xj = 0) = 1 − P̂ (Xj = 1) =
w
X
1
h
P̂ (Xi = 1, Xj = 1) =
1=
w
w
(3.22)
(3.23)
g∈{g:xi,g =1,xj,g =1}
P̂ (Xi = 1, Xj = 0) =
P̂ (Xi = 0, Xj = 1) =
P̂ (Xi = 0, Xj = 0) =
1
w
1
w
1
w
X
1=
v−h
w
(3.24)
1=
u−h
w
(3.25)
1=
w−v−u+h
,
w
(3.26)
g∈{g:xi,g =1,xj,g =0}
X
g∈{g:xi,g =0,xj,g =1}
X
g∈{g:xi,g =0,xj,g =0}
where h denotes the number of genes bound by both proteins, v the number of genes bound
by protein with binding profile Xi and u the number of genes bound by protein with profile
Xj . Setting NXi = NXj = 2 and substituting our estimated distributions in (3.18) gives us
ˆ i ; Xj ).
the estimated mutual information between binding profiles Xi and Xj , I(X
Mutual information provides a more general framework for measuring the dependence
between two random variables. Moreover, the following sections and chapters demonstrate
that it is a very natural and biologically meaningful measure of protein dependence in ChIPchip data. We ultimately want to use our estimate of I(Xi ; Xj ) in order to make decisions
38
on whether two proteins with binding profiles Xi and Xj participate in the same biological
process. In the next section, we introduce a method for finding p-values for the mutual
information analysis.
3.2.1
Mutual Information P -values
P -values allow us to make unbiased decisions about biological dependence based on mutual
information. In this scenario, under the null hypothesis H0 the two factors have no binding
dependence (i.e., I(Xi ; Xj ) = 0). The p-value measures the probability that an estimated
mutual information of a given significance or greater can occur at random. Note that our
estimate of the mutual information in (3.19)-(3.26) depends only on four parameters, namely
ˆ i ; Xj ) maps to the Venn diagram shown in Figure 3-1, where
h, u, v, and w. Hence, each I(X
Xi is the binding profile of the TATA-Box Protein (TBP) and Xj is the binding profile of
POL3.
Figure 3-1: Venn diagram for the subsets of genes bound by proteins POL3 and TBP: w
denotes the number of genes with observed binding information about both POL3 and TBP,
v is the number of genes bound by TBP, u is the number of genes bound by POL3, and
h is the number of genes bound by both. The classification of bound was chosen based on
estimated p-values (see Section 2.2.1) below the threshold 0.001.
39
Given a superset of w genes and two subsets of u and v genes, the probability of the
two subsets having an overlap of h elements at random has a hypergeometric distribution
PH|U,V,W (h|u, v, w). Hence, we can calculate the probability of estimating a mutual informaˆ i ; Xj ) = îh,u,v,w at random using
tion I(X
w−v v
u−h h
w
u
ˆ i ; Xj ) = îh,u,v,w |H0 ) = PH|U,V,W (h|u, v, w) =
Pr(I(X
.
(3.27)
The denominator in (3.27) represents the number of ways to choose u − h non-overlaping
elements from the complement of the set containing v objects, times the number of ways to
choose the h overlapping elements from the set with v objects. The product hence represents
the number of ways to choose a subset of u elements from a superset of w objects, such
that exactly h of them overlap with a pre-designated subset of v elements. The numerator
computes all the possible ways of choosing a subset of u elements from a set of size w,
normalizing (3.27) in order to obtain a probability.
ˆ i ; Xj ) = îh,u,v,w to hypergeometric
The mapping from mutual information estimate I(X
probabilities is not one to one. For example, the parameters (h = 1, u = 2, v = 2, w = 20)
and (h = 300, u = 600, v = 600, w = 6000) have the same îh,u,v,w but hypergeometric
probabilities of 0.1895 and 2.3028 × 10−165 , respectively. This exaggerated example shows
that the mutual information estimate does not take into account the sample size w in judging
the significance of the dependence between two random variables, unlike the corresponding
hypergeometric probability PH|U,V,W (h|u, v, w). Since a larger value of w should increase
ˆ i ; Xj ), random variable H when conditioned on U = u,
the confidence in our estimate I(X
V = v, and W = w is a more appropriate test statistic for evaluating mutual information
significance. As before, outcomes h, u, v, and w are completely determined by discrete
vectors of binding data xi and xj for proteins i and j. To find the p-value, or the probability
of randomly estimating a mutual information of a positive binding relationship equally or
more significant than îh,u,v,w , we sum over the right tail of the hyper-geometric test statistic
H|U = u, V = v, W = w for the outcome h as follows:
40
min (u,v)
pvM I (xi , xj ) = Pr(H ≥ h|u, v, w) =
X
r=h
w−v v
u−r
r
w
u
.
(3.28)
As before, we express our mutual information p-value pvM I (xi , xj ) as a function of the
binding data xi and xj . The hyper-geometric p-values above only evaluate the significance of
synergistic binding between two factors, while mutual information can capture both positive
and negative binding relationships. To evaluate the p-value of a negative binding relationship,
or the probability of having an overlap of h or smaller at random, we would need to sum
from 0 to h in (3.28). However, due to the high sensitivity at h = 0 and due to the large
amount of false positives and false negatives in our binding data, this evaluation proves
unreliable. Opposing binding relationships can evince interesting biological phenomena, as
well, but it becomes clear later why such relationships complicate future analyses and their
biological interpretation. Hence, we avoid using mutual information in finding negative
binding relationships and will exclusively use the filtered correlation coefficients for that
purpose. Now that the we have developed a framework for evaluating the significance of
our mutual information analysis, we can substantiate our decision for using the P V and P R
ˆ i ; Xj ).
data representations in classifying the data and finding I(X
3.2.2
Evaluating Data Classification: Mutual Information Analysis
This section gauges how accurately the different data representations classify their binding
information into 0s and 1s, in order to estimate the mutual information and find p-values as
in Sections 3.2 and 3.2.1. We generated a test set of 1447 probable protein-protein binding
dependencies, most of which were confirmed or suggested by published literature. We also
created a test set of 20707 extremely unlikely relationships. In order to compare the different
matrices in an unbiased manner, we designated reasonable thresholds for classifying the data
sets into a similar number of bound (1s) and unbound (0s) gene-protein interactions. Table
3.1 lists the data representations, their corresponding threshold test and the number of
41
Data
Matrix
BR
PR
N
PV
N + PR
PV + PR
Classification
Total Entries
Hit Rate (captured
Test for Binding
Classified Bound
/probable links)
BR ≥ 1.9
81450
738/1447
P R ≥ .965
77377
1402/1447
N ≥ −.002
79639
1409/1447
P V ≤ .015
80948
1445/1447
N ≥ −.002 or P R ≥ .99
80215
1414/1447
P V ≤ .015 or P R ≥ .99
81461
1446/1447
Total Noise at
Unlikely Links
5283.9
3784.5
1460.5
1021.1
1394.7
1021.0
Table 3.1: Hit rate (capture/probable links) at probable interactions and total noise at unlikely interactions based on mutual information estimates using different data representations
(see text).
entries classified as bound (all around 80,000) in the first three columns. The classified data
for each data representation was then used to find mutual information p-values, pvM I , as
in Section 3.2.1. The next column of Table 3.1 enumerates the hit rate, or the number of
relationships captured at a significance level of 10−10 divided by the total number of probable
links in our test set. The last column lists the total noise at unlikely links, computed by
accumulating the total significance at interactions in second test set. The significance of an
interaction between proteins i and j was calculated by replacing pvC with pvM I in (4.4). A
higher hit rate should correspond to fewer false negatives for a given data representation,
while less total noise at unlikely relationships should lead to fewer false positives.
The P V (equivalent to thresholding on SD) matrix performs best in individually classifying the data, since it has the highest hit rate at known interactions and the lowest total
noise at unlikely interactions. The N and P R matrices perform slighly worse in confirming
probable relationships. However, the N and P R matrices introduce 43% and 271% more
noise than the P V matrix, respectively, which will undoubtedly lead to more false positive
claims. Since some proteins have very few gene targets at the chosen thresholds, the classified, binary data would provide little information about their behavior. In order to use
mutual information to compare these proteins with the rest, the last two rows in Table 3.1
classify the data using a combination of the top two data representations (P V and N ) and
the P R data representation, guaranteeing that at least the top 1% of most immunoprecipitated gene probes is considered bound for each factor. The number of entries classified
42
as bound increase negligibly, since most publications already considered the strongest 1%
of gene-protein interactions for each protein as bound. Creating a buffer of bound targets
slightly improves the hit rate of both the individual N and P V matrix at no expense of
added noise. These results substantiate our decision to use the combination of the P V and
P R matrices for classifying the data in order to estimate the mutual information p-values.
Now that we have a method for evaluating the significance of both our filtered correlation
coefficient and mutual information analyses, we can combine the evidence from the two
approaches in order to improve the reliability of our biological predictions.
3.3
Combining P -values
The p-value calculations for both the filtered correlation coefficient and mutual information
estimation test similar null hypotheses. In Section 3.1.1, we test the null hypothesis that
two binding profiles Xi and Xj are not linearly dependent (ρx,y = 0) while in the last section
we considered the null hypothesis that the two profiles are not probabilistically dependent
(I(Xi ; Xj ) = 0). Ultimately, we aim to find out whether two proteins belong in the same
biological process. Hence we want to combine the p-values from the two analyses in order to
incorporate both sources of evidence in testing the overall null hypothesis that two proteins
with binding profiles Xi and Xj do not share a biological function.
We combine p-values using Fisher’s method. It assumes that the p-values result from
independent studies that use continuous random variables as test statistics to challenge similar null hypotheses. Since we use the same data source to estimate the filtered correlation
coefficient and mutual information, our studies are not completely independent. Moreover,
the hyper-geometric test statistic is not continuous. Despite these approximations, the authors in [37] and Section 3.3.1 demonstrate how combining p-values using Fisher’s method
improves biological prediction.
In order to derive Fisher’s method for combining p-values for synergistic binding relationships, we only consider positive correlations as significant for the following. Hence, in this section we calculate our p-values for the filtered correlation coefficients using a one-sided test, or
43
using (3.15) without the factor of 2. Let p1 = Pr(T ≥ tρ̂,n−2 ) and p2 = Pr(H ≥ h|u, v, w) represent two p-value observations that result from test statistics T and H|U = u, V = v, W = w
and observations (ρ̂, n − 2) and h, respectively. Using the assumption of independent studies, the joint probability of observing events T ≥ tρ̂,n−2 and H ≥ h|u, v, w under the null
hypothesis equals
Pr(T ≥ tρ̂,n−2 , H ≥ h|u, v, w) = Pr(T ≥ tρ̂,n−2 ) Pr(H ≥ h|u, v, w) = p1 p2 .
(3.29)
From the equation above, it seems natural to consider the observation of p-value pairs
(p1 = 0.01, p2 = 0.01) and (p1 = 0.1, p2 = 0.001) with equivalent products p1 p2 as evenly
significant sources of evidence for rejecting the null hypothesis H0 . Intuitively, this refers to
having two equally strong sources of evidence for dependence between two proteins or having
a slightly stronger and a slightly weaker source of evidence. Hence, we want to make decisions
about the strength of our combined evidence using the test statistic K = P1 P2 , where P1 and
P2 are the p-values resulting from the filtered correlation coefficient and mutual information
estimations, respectively. Since P1 and P2 depend on assumed independent random variables
T and H, they are themselves independent random variables. The distributions of p-values P1
and P2 resulting from a continuous test statistics is exactly uniform on the interval [0, 1] [37].
Although, our test statistic for mutual information is discrete, assuming a continuous uniform
distribution for P2 is just an approximation that only slightly affects the accuracy of our
evaluation [37].
To find the overall p-value for a given p-value product p1 p2 , we need to evaluate the
probability of randomly arriving at a p-value product more significant than the observation
k2 = p1 p2 . This corresponds to the region in probability space where p1 p2 ≤ k2 . Hence, we
need to find the volume of the region p1 p2 ≤ k2 over the joint distribution of P1 and P2 .
Since P1 and P2 were shown to be independent and uniformly distributed random variables,
their joint distribution PP1 ,P2 (p1 , p2 ) is a 2-dimensional unit cube defined on {(p1 , p2 ) : p1 ∈
[0, 1], p2 ∈ [0, 1]}. The combined p-value from the two analyses for observed p-value product
44
k2 , pvC (k2 ), follows from the integration below:
ZZ
pvC (k2 ) =
=
PP1 ,P2 (p1 , p2 )dp1 dp2
p1 p2 ≤k2 , 0≤p1 ,p2 ≤1
Z 1
Z k2
k2
1dp1 +
dp1 = p1 |k02 +k2 ln p1 |1k2 =
p
1
k2
0
k2 − k2 ln k2 .
(3.30)
Since p1 = pvF CC (xi , xj ) and p2 = pvM I (xi , xj ) for two proteins i and j with binding data
vectors xi and xj , the combined p-value for the binding relationship between proteins i and
j only depends on xi and xj and can be equivalently denoted as pvC (xi , xj ). The equation
in (3.30) only depends on the number of dimensions, 2, and the product of our observed
p-values, k2 . The extension for combining p-values from n independent experiments, also
depends only on n and the observation kn = p1 p2 · · · pn . Now, the integration takes place
over the n dimensional region p1 p2 · · · pn ≤ kn of the joint distribution PP1 ,...,Pn (p1 , . . . , pn ).
Again from independence, the joint distribution is an n-dimensional unit hypercube defined
on {(p1 , . . . , pn ) : p1 ∈ [0, 1], . . . , pn ∈ [0, 1]} and the combined p-value is
pvC (kn ) = kn
n−1
X
(− ln kn )r
r=0
r!
.
(3.31)
The following section mitigates the problem of assuming independent analyses and illustrates how combining p-values can improve biological prediction.
3.3.1
Biological Predictions using Combined P -values
To show the benefit of combining p-values we present an example of a biological prediction
that is representative of the rest of the pairwise protein analysis. Figure 3-2 shows filtered
correlation coefficient, mutual information, and combined p-values for a test set of proteins.
The filtered correlation coefficient and mutual information p-values were created using onesided models that only considered positive relationships as significant. The scale on the right
is in units of log10 (pv) ranging from -2 (or p-value = .01) to -20 (or p-value = 10−20 ). Protein
45
names are listed on the x and y axis with and “o” or “i” at the end of the names signifying
that the data came from an ORF or intergenic array, respectively. Since combining p-values
integrates evidence from the two complementary pairwise analyses, it should lead to fewer
errors in deciding whether two factors have a significant binding dependence.
The first 3 factors, RSC3, RSC8, and RSC, are components of the RSC nucleosome remodeling complex. Since these proteins carry out tasks as part of the same complex, we
would expect a high degree of similarity in their binding profiles, as shown in [7]. Indeed,
each subfigure shows a p-value ≤ 10−20 for the interactions within the entire complex (dark
blue box on the top left). For individual analyses, we consider a p-value of 10−10 as significant for a pairwise interaction. However, since we are combining two studies that are not
fully independent, we consider a p-value of 10−20 as significant for the combined p-values. If
our two analyses were completely dependent on one another, the p-values testing the same
null hypothesis should be identical and a significance of 10−10 would translate to a significance of 10−20 when the p-values are combined. However, since our pairwise studies are not
fully dependent on one another, a p-value threshold of 10−20 for combined p-values is more
stringent than a p-value threshold of 10−10 for a single analysis.
The next two proteins LRP1 and RRP6, are involved in mRNA degradation and surveillence, respectively. In [16], the authors show that LRP1 depends on RRP6 for recruitment
to a large fraction of its gene targets. Therefore, we would expect similarity in their binding
profiles and both analyses capture this interaction with a low p-value. Curiously, protein
PRP20 also shows high similarity with the LRP1-RRP6 biological process. PRP20, also
known as RanGEF in humans, binds preferentially to inactive genes [13]. This leads us to
a novel biological hypothesis. It is known that the exosome (a complex of proteins which
also contains LRP1 and RRP6) controls protein synthesis by degrading the mRNAs of genes
that are improperly formed, processed, or transcribed at a given time. For example, we
would not want the GAL proteins, involved in galactose breakdown, to be synthesized when
the cell is grown in a dextrose-rich YPD medium. Moreover, [13] shows that PRP20 binds
to genes that should be turned off during a given condition, such as GAL genes in YPD.
46
(a) Filtered correlation coefficient p-values for
a test set of proteins.
(b) Mutual information p-values for test a set
of proteins.
(c) Combined p-values for test a set of proteins.
Figure 3-2: Filtered correlation coefficient (a), mutual information (b), and combined (c)
p-values for a test set of proteins. The filtered correlation coefficient and mutual information
p-values were created using one-sided models that only considered positive relationships as
significant. Protein names are listed on the x and y axis with an “o” or “i” at the end of
the names signifying that the data came from an ORF or intergenic array, respectively. The
scale on the right is in units of log10 (pv) ranging from -2 (or p-value = .01) to -20 (or p-value
= 10−20 ).
47
The fact that the exosome binds to a similar set of gene targets as PRP20 suggests that
inactive genes are actually not completely turned off. It may be possible that the process of
transcription is leaky, in which case inactive genes that should not be transcribed in a given
condition may be erroneously transcribed. Indeed, there is a growing body of evidence in
the transcription field to suggest that RNA polymerase is promiscuous in its induction of
transcription at non-ORF sites within the genome, the functional consequences of which are
still unclear. This suggests that the exosome may have a role in quality control at inactive
genes, binding to them in order to readily degrade their unwanted mRNA.
The last 2 proteins, Polymerase III (POL3) and the TATA-Box Protein (TBP), also associate significantly for all three p-value calculations. In a recent paper [28], the authors show
that the TBP preferentially associates with gene targets of POL3 in various environmental
conditions. Although, the POL3 and TBP experiments were done in different laboratories,
our normalized analyses confirm the results in [28].
After identifying connections within 3 biological processes, we consider the interplay between the three complexes. The authors in [7] show that the RSC nucleosome remodeling
complex has a preference for POL3 gene targets. Hence, we should expect significant interaction in the top right corner of Figures 3-2(a), 3-2(b), and 3-2(c). The filtered correlation
coefficient analysis finds all 6 interactions significant (p-value ≤ 10−10 ) but the mutual information and combined p-value calculation finds 5/6 relationships as significant (p-value
≤ 10−10 and p-value ≤ 10−20 , respectively), making an error on RSC3-POL3. The filtered
correlation coefficient analysis additionally predicts interactions LRP1-POL3 and LRP1TBP but no relationships between its recruitment factor RRP6 and POL3 or TBP. Due to
this discrepancy and due to the lack of literature supporting this interaction, we consider
these two predictions as likely false positives. The mutual information and the combined
p-value do not make this probable mistake. In summary, the filtered correlation makes 2 potentially false positive claims while the mutual information and combined p-values make one
false negative error. Moreover, if we reduce the combined p-value threshold for significance
to a less conservative and more accurate 10−18 (since our complementary analyses are not
48
completely dependent), we obtain 0 errors.
This example accurately represents the overall trend that correlation coefficient and mutual information tests will generate a higher number of false positives and false negatives,
respectively. This is because mutual information makes a hard decision on classifying the
data as 0s and 1s prior to the analysis and cannot capture relationships when a significant
number of errors are made in the classification, resulting in false negatives. In contrast,
the filtered correlations cost function considers a soft and continuous version of the data in
terms of standard deviations of confidence and does capture these relationships. However,
this also leads to considering unbound genes as partially bound and creates a higher number
of false positives. Combining the two complementary analyses leads to fewer errors or to a
more optimal operational point on an ROC curve of false negative rate versus false positive
rate. We use this concept in building a more reliable network of our nucleus in Chapter 5.
The next section attempts to reduce the false positive rate of predictions based on mutual
information.
3.4
Minimized Mutual Information P -value
The previous section revealed that mutual information p-values are highly sensitive to the
chosen cutoff for classifying the binding data into 1s and 0s. To alleviate this problem,
this section introduces minimized mutual information p-values. This method allows for
the binding cutoffs to take on several values from an allowable set and finds the cutoffs
that minimize the mutual information p-value between two proteins. Section 3.2.2 shows
that transforming data into 0s and 1s seems most appropriate when using the P V and P R
matrices. Let A = {0.001, 0.005, 0.01, 0.02} denote the allowable set of p-value cutoffs for
thresholding the P V matrix. Moreover, let ci , cj ∈ A denote the p-value cutoffs of proteins
i and j for transforming continuous row vectors xi and xj from the P V matrix to binary
vectors, respectively. For each combination of cutoffs ci , cj ∈ A, lets also consider the
strongest 1% of gene-protein interactions for proteins i and j as bound. Then the minimized
mutual information p-value, pvM M I , between binding vectors xi and xj takes the form
49
pvM M I (xi , xj ) =
min
ci ∈A,cj ∈A
pvM I (xi , xj ) ,
(3.32)
where pvM I is the mutual information p-value previously calculated in (3.28). The minimized
mutual information p-value mitigates the sensitivity in the mutual information analysis due
to hard classification of the data. However, since this analysis allows for the same protein to
have different bound cutoffs, it is only suited for making decisions on pairwise interactions
between two proteins and does not allow for unbiased group-wise comparisons. Hence, we will
use the minimized mutual information p-values for making decisions about adding links in a
network between two proteins (Chapter 5) but refrain from using it for inferring group-wise
relationships between proteins (Chapter 4).
Filtered correlation coefficient, mutual information, and combined p-values allow us to
judge the strength of pairwise interactions between two proteins. The following chapter
introduces clustering and PCA, which can evince group-wise relationships between factors
that regulate transcription.
50
Chapter 4
Group-wise Relationships: PCA and
Clustering
Biological processes depend on the collaborative interaction of several proteins. In the previous chapter we developed pair-wise statistics in order to describe the relationship between
two proteins. This chapter introduces Principal Component Analysis (PCA) and clustering,
two methods that can evince group-wise dependencies between proteins. The two techniques
verify known biological mechanisms and uncover novel cellular processes.
4.1
Principal Components Analysis
PCA is a common technique for data reduction and visualization. It determines the directions
of the greatest variance within a data matrix by calculating the orthonormal eigenvectors
associated with the largest eigenvalues of the scatter matrix of the data [36]. The principal
components, usually referred to as scores, are row vectors of the same dimensionality as row
vectors of the data matrix. The data can be modeled by taking linear combinations of the
scores; the associated weights with each score are called loads. Since most of the variance
within a data set can usually be captured with a small number of principal components,
PCA exploits the common information within the data in order to reduce its dimensionality.
51
One can choose the number of principal components one wishes to calculate; more principal
components can characterize the data better at the cost of having to keep track of more
information. For visualization purposes, two or three principle components are often used.
For a given choice of data representation, let xi = [xi,g1 . . . xi,g|G| ] denote a row vector of
binding data for protein i across all genes g in the set G. If performing PCA on the binding
profiles of each protein i, we can write
xi = µ̂i 1 +
N
X
wni sn .
(4.1)
n=1
The PCA equation (4.1) uses N principal components, with row vector scores s1 , . . . , sN and
i
} specific to each xi , where µ̂ is a scalar that is the mean across the
scalar loads {w1i , . . . , wN
elements of xi , and 1 is row a vector of all ones.
Figure 4-1 shows how PCA can provide useful biological insight. The large isolation
of protein Prp20 and Nup100 from the rest of the nuclear pore related proteins in Figure
4-1(b) supports a model introduced in [13]. Casolari et al. shows that Prp20, known as
RanGEF in humans, binds preferentially to transcriptionally inactive genes. RanGEF is also
known to catalyze the unloading of transcription factors (TF) transported into the nucleus by
nuclear import proteins, such as Kap. The left part of Figure 4-1(a) illustrates how RanGEF
converts RanGDP to RanGTP which in turn unloads the TF transported into the nucleus by
Kap. Hence, the authors suggest that Prp20 may contribute to transcriptional activation by
facilitating the release of transcription factors at genes requiring fast activation. Accordingly,
induction of said genes leads to experimentally confirmed loss of RanGEF association and
a gain in binding at the nuclear pore on the periphery of the nucleus, as illustrated on the
right part of Figure 4-1(a). Thus, Prp20 and the rest of the nuclear pore proteins should
bind to very few genes in common at any one time as shown by the great separation between
the two in Figure 4-1(b). Further, the results suggest that the other aberrant nuclear factor,
Nup100, also has a role that is orthogonal to the workings of the majority of the nuclear
pore proteins. The following section discusses another method for finding biological processes
using ChIP-chip data.
52
(a) Casolari et al. Model for Nuclear Transport (picture from [13])
Principal Component Scatter Plot with Colored Clusters
100
1
2
3
Nup100
Second Principal Component
80
60
40
20
0
−20
−50
Prp20
0
50
100
150
200
250
First Principal Component
(b) Principal Component Analysis Validates the Model
Figure 4-1: The subfigures show (a) Casolari et al. model for nuclear transport and (b) its
validation using Principal Component Analysis (PCA) [13]. Figure (b) plots the loads on
the first two principal components. The colors denote clusters constructed using hierarchical
clustering based on correlation coefficient distance, as defined in Section 2.2.2, and a cutoff
for three clusters.
53
4.2
Clustering
Clustering of genomic data can evince group-wise relationships between proteins or genes. We
first introduce K-means and hierarchical clustering and then adapt the algorithms in order
to use the pair-wise distance metrics developed in the previous chapter. We then develop a
novel semi-supervised clustering algorithm that preserves information about elements within
each cluster in order to better capture group-wise dependencies between proteins.
4.2.1
K-means Clustering
K-means clustering is one of the most commonly encountered algorithms in biology, due to
its ease of implementation. K-means clustering uses the expectation maximization (EM)
algorithm to partition (cluster) N elements into K disjoint subsets Sk , k = 1, . . . , K. Given
an appropriate choice of data representation, let xi denote a row vector of binding data for
protein i, as in Section 4.1. In order to find groups of related proteins, we represent binding
profiles as elements that we want to cluster. The K-means algorithm finds K disjoint subsets
Sk , k = 1, . . . , K that minimize the overall cost function
J=
K X
X
d2 (xi , ck ) ,
(4.2)
k=1 i∈Sk
where d(xi , ck ) is the distance between binding profile xi and centroid ck of subset Sk . The
algorithm usually uses Euclidean distance for d and defines ck as the mean of all the elements
in Sk :
ck =
1 X
xi .
|Sk | i∈S
(4.3)
k
The user initiates the clustering by designating the K underlying groups of objects. Then
the K centroids are initialized, usually by assigning each element to one of the K clusters at
random and using (4.3). The algorithm then iterates between (i) assigning all the elements to
subset Sk with the “closest” (in terms of distance d) centroid ck (E-step) and (ii) reestimating
54
the cluster centroids based on the new assignments using (4.3) (M-step). The procedure stops
once no further change in assignments occurs.
We adapted the K-means algorithm in order to incorporate the pairwise statistical analysis from Chapter 3. The following formula converts the combined p-values from Section 3.3
to a non-negative distance, where a small p-value corresponds to a small distance between
the binding profiles xi and xj of two proteins i and j:
d(xi , xj ) = − log10 (1 − pvC (xi , xj )) .
(4.4)
K-means clustering (as any EM based algorithm) does not guarantee global minimization
of the cost function in (4.2), and its performance heavily depends on the initialization of the
K centroids. Due to this problem, the K-means algorithm often fails to identify known
protein clusters (e.g., the RSC complex). Other EM based partitioning algorithms, such as
fuzzy membership clustering, would also suffer in performance due to random initialization;
therefore, the next section introduces a different family of algorithms based on hierarchical
clustering.
4.2.2
Hierarchical Clustering
Hierarchical clustering has wide applicability in biology, as well. In addition, hierarchical
partitioning does not require a priori knowledge of the number of clusters and can be easily
visualized using a tree structure (called dendrogram), making it often preferable to K-means
clustering. Hierarchical clustering is subdivided into agglomerative methods, which proceed
by series of fusions of the N objects into groups, and divisive methods, which separate N
elements successively into finer subsets.
We adapted the agglomerative hierarchical clustering algorithm in order to incorporate
the pairwise statistical analysis from Chapter 3. As in the previous section, we use binding
profiles as elements and define all the pairwise distances between elements using (4.4). At
the start, the algorithm treats each element as a cluster, and proceeds for N − 1 iterations.
At each iteration, the algorithm links the two most similar clusters (represented as linking
55
two child nodes to a parent node on a dendrogram) until all N elements are unified into
one partition. Hierarchical clustering algorithms can define distance (or similarity) between
clusters in various ways. The following equations describe several common distances for
linking two clusters Ck and Cl :
Nearest Neighbor :
d(Ck , Cl ) =
Farthest Neighbor :
d(Ck , Cl ) =
Average :
d(Ck , Cl ) =
d(xi , xj )
(4.5)
max d(xi , xj )
(4.6)
min
i∈Ck ,j∈Cl
i∈Ck ,j∈Cl
XX
1
d(xi , xj ) .
|Ck ||Cl | i∈C j∈C
k
(4.7)
l
Hierarchical clustering based on combined p-values from Section 3.3 confirms known biological processes with great accuracy, especially when using the average distance for linking
clusters. However, hierarchical clustering only considers pairwise relationships between binding profiles. Figure 4-2 uses an example to illustrate the potential pitfalls with clustering
solely on pairwise distances. The Venn diagrams on the left and on the right describe two
hypothetical relationships between three factors (proteins). Each circle or oval represents
the subset of genes classified as bound by the six different factors. In the left scenario, each
possible pair of factors (i.e., A↔B, A↔C, and B↔C) have a number of genes that they bind
to in common. But when considering the group-wise relationship (i.e., A↔B↔C), we find
no common intersection in bound genes between all three factors. In contrast, the scenario
on the right shows that factors X, Y, and Z share a large common intersection and thus
interact in a pairwise as well as group-wise manner. Since both sets of three factors have
equivalent pairwise distances, hierarchical clustering cannot distinguish any difference between the two scenarios. However, the much stronger group-wise interaction of the scenario
on the right makes it more likely for factors X-Y-Z to share a common biological process
than for factors A-B-C. In the next section, we develop a novel semi-supervised clustering
algorithm that preserves information about elements within each cluster in order to better
capture group-wise dependencies between proteins.
56
Figure 4-2: The Venn diagrams on the left and on the right describe two hypothetical relationships between three factors. Since both groups of three factors have equivalent pairwise
overlaps in bound genes, hierarchical clustering cannot distinguish any difference between the
two scenarios (see text). However, the much stronger group-wise interaction of the scenario
on the right makes it more likely for factors X-Y-Z to share a common biological process
than for factors A-B-C.
4.2.3
Semi-Supervised Clustering
Semi-supervised clustering derives its name from the fact that it retains information about
elements within each cluster as it partitions the objects. In order to preserve information
about the elements of cluster Ck the algorithm maintains two vectors, fk and xk . Vector
fk = [fk,g1 . . . fk,g|G| ] records the fraction of elements that bind to each gene g in the set
G. Vector xk = [xk,g1 . . . xk,g|G| ] represents the averaged binding profile at gene g ∈ G for
all members within the partition, based on the SD data representation. When merging two
clusters Ck and Cl into Co , the resulting two vectors for cluster Co are weighted combinations
of the vectors for Ck and Cl :
1
(|Ck |xk + |Cl |xl )
|Ck | + |Cl |
1
fo =
(|Ck |fk + |Cl |fl ) .
|Ck | + |Cl |
xo =
(4.8)
(4.9)
Note that merged vector fo still maintains the fraction of genes bound by the elements of
57
the new cluster and xo still represent the average binding profile of all the joined objects.
To define similarity between clusters, we again use the distance based on combined p-values
in (4.4). Although there are many choices for how to define similarity, we wanted to incorporate the robust measures we derived in the previous chapter. Appendix A develops the
algorithm with a more theoretically intuitive but less biologically meaningful distance based
on Kullback-Leibler(KL) divergence.
To find combined p-values for the relationship between two clusters, the algorithm evaluates the filtered correlation coefficient and mutual information but incorporates the groupwise dependence by modifying how to select the pertinent sets. For filtered correlation, let
Fk and Fl represent the filtered subsets of genes for clusters Ck and Cl , respectively, and
let Fk,l = Fk ∪ Fl again denote the filtered subset of genes over which we will estimate
the variances and covariance of two partitions using (3.1) - (3.5). The filtered subsets Fk
and Fl no longer consist of the set of genes bound by one factor but now represent the
sets of genes bound by a fraction of objects within a cluster. Mathematically, letting fthresh
represent the fraction of proteins that need to bind to a gene in the filtered subsets, we
define Fk = {g : fk,g ≥ fthresh } and Fl = {g : fl,g ≥ fthresh }, respectively. Having defined Fk,l = Fk ∪ Fl , the algorithm uses the averaged binding vectors xk and xl in (3.1) (3.5) to compute the means, filtered variances, and filtered covariance of clusters Ck and
Cl . Next, using (3.6), (3.14), and (3.15), the algorithm derives the filtered correlation coefficient p-value between the two partitions. For finding mutual information p-values, we
can now assign u = |Fk |, v = |Fl |, h = |Fk ∩ Fl |, and w equal to the number of genes with
binding information for both clusters. Using these quantities, we can use (3.28) to find the
mutual information p-value between two clusters. And finally, we again combine p-values
using (3.30).
Similar to hierarchical clustering, the algorithm treats each element as a cluster at the
start and proceeds for N − 1 iterations. At each iteration, the algorithm links the two most
similar partitions, based on the combined p-value distance in (4.4), until all N elements
are unified into one partition. However, the new method for finding the combined p-values
58
between clusters incorporates the group-wise dependence of elements, rectifying the potential
problem with using hierarchical clustering. The next section validates the algorithm using
several known biological processes.
4.2.4
Biological Validation of Semi-Supervised Clustering
In order to evaluate the performance of the new algorithm, we clustered the SD data representation using a bound cutoff of P V ≤ .015 ∨ P R ≥ .99, where ∨ denotes logical or,
consistent with our results in Sections 2.2.2 and 3.2.2. Moreover, we set fthresh = 0.5 in order
to consider genes bound by half or more of the factors in a partition as representative of the
biological process of the partition. We first clustered the whole data set and then extracted
the produced clusters into separate plots to facilitate visualization. Figure 4-3 illustrates
how the algorithm performs on a selected group of biological processes using tree structures
or dendograms. The horizontal branches on the dendrogram represent the merging of two
similar objects at an iteration in the algorithm, while the length of the vertical branches
corresponds to the similarity distance (as in (4.4)) of the fused nodes. Unlike K-means,
semi-supervised clustering does not decide the final number of clusters but allows the user
to choose a significant threshold distance for a cluster. In Figure 4-3, we chose a distance of
less than 303 as significant, which corresponds to a combined p-value of 10−20 in our implementation. For visualization purposes, however, we color coded clusters of distance ≤ 253
or of combined p-value ≤ 10−70 .
Figure 4-3(a) shows the clustering of four known biological processes using different colors.
It again illustrates the significant association of components of the RSC nucleosome remodeling complex (RSC, RSC1, RSC2, RSC3, RSC8, RSC9, and STH1) and of components
of the Polymerase III transcriptional machinery (POL3, TBP, BRF, TRF4, and RPC34).
It also demonstrates the significant clustering between transcription factors STE12, DIG1,
and TEC1. In [5], the authors show that STE12 activates two different sets of genes under
different conditions. In the presence of a nutrient limiting agent butanol, binding of STE12,
DIG1, and TEC1 activates genes necessary for growth of filaments, or structures that expand
59
(a) Validation of the RSC, POL3, STE12, and GAL biological processes.
(b) Clustering of the MCM/ORC, SIR, and LRP1 biological processes.
Figure 4-3: Confirmation of known biological processes and new insight using semi-supervised
clustering (see text). The two dendrograms above show a subset of the results from clustering
the whole data set using the SD data representation and a bound cutoff of P V ≤ .015∨P R ≥
.99. We set fthresh = 0.5 in order to consider genes bound by half or more of the factors
in a partition as representative of the biological process of the partition. The horizontal
branches signify the merging of two similar objects, while the length of the vertical branches
corresponds to the similarity distance (as in (4.4)) of the fused nodes. For visualization
purposes, different colors represent clusters of distance ≤ 253 or of combined p-value ≤ 10−70 .
60
the size of the cell and facilitate storage of nutrients. During pheromone exposure, binding of
STE12 and DIG1 (no TEC1) induces genes involved in mating. Although our data monitors
the binding of the three factors under no treatment in YPD, the clustering still evinces the
strong binding dependence between STE12 and DIG1, and their auxiliary relationship with
TEC1. Moreover, Figure 4-3(a) also confirms the significant interdependence between the
POL3 machinery and RSC complex (p-value ≤ 10−40 ). It further suggests a previously unknown relationship between the STE12 cluster and the POL3 and RSC biological processes.
And finally, the right most partition shows significant similarity between transcription factors GAL3 and GAL80. As discussed in Section 2.1, the two transcription factors regulate
the expression of genes that breakdown the sugar galactose. Similar to the STE12 analysis,
using YPD data again uncovers the general relationship between the two galactose factors.
However, we do not believe that the GAL transcription factors share significant functions
with the previous three processes (p-value ≥ 10−5 ).
Figure 4-3(b) displays a dendogram of several other known and hypothesized biological
mechanisms. The rightmost colored tree corroborates our hypothesis that LRP1, RRP6,
and PRP20 may synergistically bind at inactive genes. As discussed in Sections 3.3.1 and
4.1, LRP1 and RRP6 may bind to inactive genes in order to readily degrade their unwanted
mRNA, while PRP20 may bind at inactive genes to facilitate fast activation [13]. Moreover,
the association LRP1 and RRP6 with mRNA export factor YRA1, supports the results
in [16] that demonstrate coupling of mRNA production quality assurance to mRNA export.
Further, the similar relationship between NUP100 and PRP20 suggests that NUP100 may
also have a role that is orthogonal to the workings of the majority of the nuclear pore factors,
as predicted in the PCA analysis. However, this cluster does not seem to share significant
commonality with the rest of the colored trees on the figure.
The other three colored clusters in Figure 4-3(b) also provide validation of our analysis
and new biological insight. During normal growing conditions, a yeast cell periodically cycles
through several phases of growth. If a mother cell has sufficient nutrients in its environment
it may decide to divide into two daughter cells, in a phase of the cell cycle called mitosis. To
61
make sure that the daughter cells inherit the genetic information of the mother cell, prior to
mitosis and during the S phase of the cell cycle, the mother cell makes two identical copies
of all of its genetic material in a process known as DNA replication. In [8], the authors show
that the Origin Recognition Complex (ORC) and the MiniChromosome Maintenance (MCM)
proteins bind together at origins of replication, or sites along the chromosomes where DNA
replication is initiated. The two left most colored trees confirm the significant association of
components of the ORC (ORC1, ORC1-6) and MCM (MCM3, MCM4, and MCM7) complex
at both intergenic and open reading frame (ORF) regions of the genome, denoted by an “i”
and “o” at the end of each protein name, respectively. Although DNA replication only takes
place during the S phase of the cell cycle, and despite the fact that our unsynchronized
ChIP-chip data samples cells during all phases of growth, the semi-supervised clustering still
uncovers the relationship described in [8]. The next colored partition confirms the significant
association between components of the SIR complex at both intergenic and ORF regions [21].
Moreover, the SIR complex associates significantly (p-value ≤ 10−35 ) with the MCM/ORC
cluster, suggesting a mutual dependence between the two processes. The binding profile of
SIR2o also has a close relationship with RAP1o, a TF involved in transcriptionally active
processes, which we will explore in more detail in the next section.
4.2.5
Active Processes and SIR2
Transcriptionally active processes, or simply active processes in our context, refer to biological mechanisms that occur at highly expressed genes. For example, the model introduced
in Section 4.1 [13] of how highly expressed genes bind to the nuclear pore describes an active process. To gauge how actively a gene is expressed in YPD, we downloaded data for
the number of mRNAs produced per hour, or the transcriptional frequency of a gene, as
measured in [12]. The results from [12] show that only about 21% of the genes produce
more than 6 mRNAs/hr. Therefore, we define genes with transcriptional frequency in the
top 20% as active. To determine whether a factor participates in active processes, we again
calculate filtered correlation coefficient, mutual information, and combined p-values between
62
the bound set of a factor and the set of active genes. Mathematically, we let A represent the
set of active genes and let Fi represent the set of genes classified as bound for a potential
active factor i. Then we define Fi,j = Fi ∪ A in the filtered correlation coefficient equations
(3.1) - (3.5), and use (3.6), (3.14), and (3.15) in succession to derive p-values. For finding
mutual information p-values, we can now assign u = |A|, v = |Fi |, h = |A ∩ Fi |, and w equal
to the number of genes with observed information for both data sets. Using these quantities,
we can use (3.28) to find the mutual information p-value. And finally, we again combine
p-values using (3.30) and consider a factor as active if it has a combined p-value ≤ 10−20 or
a mutual information p-value ≤ 10−10 (justified in Section 5.1).
In order to explore group-wise relationships between active factors, we categorized all
proteins as active or non-active using the stringent criterion above. Figure 4-4 displays the
results of clustering only the active factors using the SD data representation and a bound
cutoff of P V ≤ .015 ∨ P R ≥ .99 once again. Note that all the factors included have a
significant similarity distance of ≤ 283, or cluster combined p-value ≤ 10−40 . The results
provide interesting biological validation and exciting new insight. Figure 4-4 includes all the
nuclear proteins considered to have a preference for binding to active genes in [13] (CSE1,
NUP116, NIC96, MLP1, MLP2, XPO1, NUP2, KAP95, and NUP60) and in [14] (THO2,
NPL3, and HMT1), justifying the above criterion for finding active factors. Moreover, the
tight clustering of the active nuclear pore factors once again validates the model in [13],
which suggests that these factors share a biologically active mechanism. The large similarity
between nuclear proteins THO2, NPL3, and HMT1 also seems to confirm the results in
[14]. Yu et al. introduced a model showing that methyltransferase HMT1 plays a role in
disassociating transcriptional elongation protein THO2 and mRNA export factor NPL3.
Moreover, nonfunctional (mutated) HMT1 leads to failed export of mRNA produced at a
selected active gene. Since all three factors bind preferentially to highly expressed genes, they
seem to share an active biological process, whereby HMT1 dependent dissociation of THO2
and NPL3 may prove necessary for export of mRNAs produced by active genes. Further,
their close relationship with the nuclear pore proteins implies that this mechanism might
63
also occur at the periphery of the nucleus.
The red tree on the left part of Figure 4-4(b) illustrates another active biological process.
Histone acetyltransferases (HATs) ESA1 and GCN5 have a predilection for binding to active
genes [4], while data H3i, H2Bi, and iNuclsm measuring nucleosome depletion also overlaps
significantly with promoters of active genes [27]. Moreover, transcription factors FHL1 and
RAP1 regulate protein biosynthesis and other active processes [38]. The nearby congregation
of RAP1, nucleosome depletion, and HATs ESA1 and GCN5 seems to support a conjecture
mentioned in [27]. Berstein et al. showed that RAP1 is necessary but not sufficient for
the mechanism that displaces nucleosomes at highly expressed genes. Hence, the authors
conjectured that other factors, including ESA1 and GCN5, might be required along with
RAP1 in order to reposition nucleosomes at induced genes. Moreover the large similarity
between this process and the nuclear pore factors as seen in Figure 4-4(a) again implies that
this active process takes place at the periphery of the nucleus.
Surprisingly, Figure 4-4(b) also uncovers a great similarity between the binding of SIR2
at ORF regions, denoted as SIR2o, and other active proteins. The SIR complex does not
seem to bind to DNA directly but interacts with other factors, such as RAP1 or deacetylation
at histone 4, in order to access the nearby genes. This explains the association of SIR2 with
active factor RAP1 in Figure 4-3(b). However, SIR2 is a histone deacetylase that plays an
important role in silencing transcription at telomeres (or ends of chromosomes) and matingtype gene loci HML and HMR [21]. Hence, the designation of SIR2o as an active factor
and its significant binding similarity with all other active factors is unexpected. In Figure
4-4(a), SIR2o associates with several active transcription factors, including GAT3, FHL1,
RAP1, and YAP5. It also has a significant binding similarity with semi-active (combined
p-value ≤ 10−10 instead of ≤ 10−20 ) transcription factors PDR1, RGM1, and SMP1. These
transcription factors (TFs) regulate various different biological pathways, including biosynthesis, environmental response, and metabolism. However, all the enumerated TFs bind
preferentially to nucleosome depleted promoters in a functionally cooperative manner [27].
This may suggest a role for SIR2 in nucleosome displacement. Moreover, SIR2 deacetylates
64
(a) Active cluster, part I
(b) Active cluster, part II
Figure 4-4: Active biological processes and new insight using semi-supervised clustering.
Figures 4-4(a) and 4-4(a) display the results of clustering all active factors (see text), using
the SD data representation and a bound cutoff of P V ≤ .015 ∨ P R ≥ .99 once again. We
again set fthresh = 0.5 in order to consider genes bound by half or more of the factors in
a partition as representative of the biological process of the partition. All members of the
cluster associate significantly (combined p-value ≤ 10−40 ), but for visualization purposes
different colors represent clusters of distance ≤ 253 or of combined p-value ≤ 10−70 .
65
lysine 16 on histone 4, which is generally associated with active processes. Finding the full
extent of SIR2o’s role in active processes, and why it only appears to take place at ORF and
not intergenic regions may uncover exciting and new biological insight.
Figure 4-4 also includes several other factors with proposed roles in active processes. First,
histone deacetylase HOS2 was shown to play a role in deacetylation at active genes in [20] and
clusters tightly with the nuclear pore factors in Figure 4-4(a). Second, [6] shows that histone
methyltransferase SET1, also in Figure 4-4(a), adds three methyl groups at lysine 4 of histone
3 during transcription (i.e., at active genes), which may serve as memory that a gene was
recently turned “on”. Third, Santos-Rosa et al. also shows an intimate connection between
SET1 and nucleosome remodeling complex protein ISW1, shown in Figure 4-4(b) [25]. The
authors show that ISW1 recruitment at active genes is dependent upon methylation by
SET1. Further, the paper illustrates that proper transcription at selected genes depends
upon the collaborative function of both ISW1 and SET1. And finally, [17] shows that high
acetylation level at histone 3 lysine 18 at both intergenic and ORF regions correlates with
gene activity. Figure 4-4(b) includes the three factors, designated as H3K18o, H3K18no,
and H3K18ni, where “o”, “no”, and “ni” at the end refer to ORF, normalized ORF, and
normalized intergenic regions [17]. Our clustering analysis captures the collective relationship
between all the mentioned active factors for the first time in the literature. In the next
chapter, we use the theory developed thus far to build a multi-layered transcriptional network
of the nucleus.
66
Chapter 5
Network of the Nucleus
This chapter builds on the methodology developed in the previous chapters in order to build
a network of the nucleus. Our holistic approach is the first attempt in the literature to
quantify the communication between layers in eukaryotic transcriptional networks. Analysis
of the finalized network will hopefully unveil a general view of the non-linear process of
transcription. The next section describes the model we used to determine the interplay
between various proteins.
5.1
Network Model
In order to build a transcriptional network of the nucleus, we represent each factor as a
node (or oval) and use the pairwise statistics developed in Chapter 3 to find significant
binding relationships between factors, represented as edges (or links) between nodes. Using
the convention of italicizing sets but not members of sets, we again define the following
five categories of proteins: Transcription Factors (T F ), Nuclear Transport proteins (N T ),
Nuclear Processing proteins (N P ), Histone Modifiers and Nucleosome Remodelers (HM ),
and Histone State (HS) in terms of acetylation levels, methylation levels, and nucleosome
distribution. We use an undirected network model, since our ChIP-chip data cannot evince
directionality in binding dependencies, or whether factor i causes factor j to bind to gene g.
67
Therefore, an edge between two factors i and j contains zero or two arrows in our model,
lacking information about directionality.
Our network model consists of two types of edges—a positive and a negative edge representing a significant synergistic and an opposing binding relationship, respectively. Since our
mutual information p-values only evaluate the significance of positive binding relationships,
decisions on extending negative edges will solely depend on correlation analysis. Hence, we
use the two sided integration in (3.15) to evaluate the significance of finding both extremely
positive or negative binding relationships. To avoid ambiguity when using correlation analysis, we set the p-value between proteins i and j to one if ρ̂Xi ,Xj < 0 when considering positive
edges and if ρ̂Xi ,Xj ≥ 0 when considering negative edges.
For simplicity, we first consider protein-protein interactions, excluding all factors within
the HS level. Mathematically, let νi and νj denote the nodes (or vertices) for proteins i and
−
j, respectively. Then +
i,j and i,j represent a positive and a negative binding relationship
−
between i and j, respectively. Moreover, +
i,j and i,j can only equal to 1 or 0, corresponding
to a presence or absence of a significant binding relationship between i and j. We add a
positive edge (+
i,j = 1) between proteins i and j from all layers except the set HS if their
respective binding profiles xi and xj have a combined p-value pvC ≤ 10−20 as in (3.30) or a
minimized mutual information p-value pvM M I ≤ 10−10 as in (3.32):
i∈
/ HS, j ∈
/ HS :
−10
+
) ∨ (pvC (xi , xj ) ≤ 10−20 ) .
i,j = 1 if (pvM M I (xi , xj ) ≤ 10
(5.1)
The decision to use the or condition (denoted by the symbol ∨) above is substantiated by our
observation in Section 3.3.1 that mutual information analysis leads to very few false positive
claims at a significance level of 10−10 . In addition, the test incorporates the information
from the filtered correlation analysis in the evaluation of the combined p-value. Since filtered
correlation analysis leads to more false positives, it needs to combine its evidence with that
of the mutual information analysis and meet a stricter significance threshold to create an
edge.
68
Negative edges in the network prove much more complicated to assign and interpret.
Section 3.2.1 discusses the inherent difficulty with using mutual information p-values for
finding opposing binding relationships. Moreover, it is possible to consider the overlap
between extremely bound and unbound sets of genes but the biological interpretation of
significant overlap in that scenario is not clear. Hence, it seems reasonable to solely base
our decision on assigning negative edges in the network using filtered correlation coefficient
p-values. To assign a negative link (−
i,j = 1) between the binding profiles xi and xj of the
proteins i and j not in the set HS, we use a filtered correlation coefficient p-value pvF CC
threshold of 10−10 :
i∈
/ HS, j ∈
/ HS :
−10
−
).
i,j = 1 if (pvF CC (xi , xj ) ≤ 10
(5.2)
Data from the HS class of factors requires particular care. For proteins, it seems natural
to classify binding data at a gene as 0s and 1s, representing an absence or presence of
association with a gene’s DNA. However, ChIP-chip data of histone acetylation levels, for
example, can indicate gradations of acetylation (i.e., hypo, medium, hyperacetylation, etc.).
Hence, it seems more natural to consider the continuum of data for members of the HS layer.
Although mutual information analysis extends to continuous distributions, finding p-values
for this scenario that are consistent with the previously developed pairwise statistics proves
more difficult. Hence, we decided to use standard correlation coefficient p-values (using
ChIP-chip data at all genes) to quantify the relationships between two factors within the
HS layer. The p-values for standard correlation coefficient estimates, pvSCC , can also be
assigned using (3.6), (3.14), and (3.15) in succession. Mathematically, let i and j represent
two factors from the HS layer with binding profiles xi and xj . We extend positive and
negative links in our network model if the p-value pvSCC ≤ 10−20 :
i ∈ HS, j ∈ HS :
−20
+
) ∧ (ρ̂Xi ,Xj ≥ 0)
i,j = 1 if (pvSCC (xi , xj ) ≤ 10
(5.3)
i ∈ HS, j ∈ HS :
−20
−
) ∧ (ρ̂Xi ,Xj < 0) .
i,j = 1 if (pvSCC (xi , xj ) ≤ 10
(5.4)
69
In the above equations, ∧ stands for a logical and, requiring that positive and negative
interactions have a positive and negative estimate of the filtered correlation coefficient, respectively. Now let us consider relationships between one protein not in the set HS and one
factor from the set HS. Ultimately, we want to make statements such as genes bound by
transcription factor i have high acetylation. In order to use filtered correlation coefficient
analysis in this context, we need to let the non-HS factor determine the pertinent set of
dimensions or genes. Letting Fi denote the set of genes bound by the non-HS protein i, we
define the filtered set Fi,j = Fi in (3.1) - (3.5) and use (3.6), (3.14), and (3.15) in succession
to derive p-values. Mutual information p-value analysis again proves difficult to extend in
this context. For non-HS protein i and HS factor j, we assign positive and negative links
in our network using the following thresholds on our filtered correlation coefficient p-value
pvF CC :
i∈
/ HS, j ∈ HS :
−4
+
i,j = 1 if (pvF CC (xi , xj ) ≤ 10 ) ∧ ρ̂Xi ,Xj ≥ 0
(5.5)
i∈
/ HS, j ∈ HS :
−4
−
i,j = 1 if (pvF CC (xi , xj ) ≤ 10 ) ∧ ρ̂Xi ,Xj < 0 .
(5.6)
Using a lower threshold for assigning links in the scenario above is necessary due to the lower
dimensionality of the data. Since, our filtered set now only contains the set of genes defined
as bound by a single factor, it often proves impossible to achieve such high significance
thresholds as in the previous analyses. Moreover, we used an empirical, non-parametric
method for calculating p-values for the above scenario in order to check if they matched our
analytical p-value derivation in Section 3.1.1. The algorithm first finds the filtered correlation
for the genes bound by the non-HS factor, g ∈ Fi , as described above. Then it picks 10000
random sets of genes of the same size, |Fi |, finds the filtered correlation coefficient for each
random set and counts the n filtered correlation coefficients that exceed that of factor i.
Then, the non-parametric p-value equals
n
.
10000
Both approaches achieve very similar p-
values, proving the reliability of our method for finding p-values introduced in Section 3.1.1.
The next section considers how we can exploit clustering in order to uncover other significant
70
interactions omitted by our initial thresholds.
5.1.1
Assigning Links: Second Pass
The thresholding approach for finding significant binding relationships introduced in the
previous section has an inherent sensitivity to the chosen cutoffs. To alleviate this problem
and to minimize the chance of false negatives, this section describes a second pass in our
algorithm for assigning links. As Section 4.2.4 illustrated, clustering ChIP-chip data at a
stringent significance cutoff (combined p-value ≤ 10−70 ) evinces various known biological
complexes. Moreover, we noticed in Section 3.3.1 that we can infer the binding relationship
between POL3 and RSC3 based on the interactions between POL3 and other components
of the RSC complex (i.e. RSC8, RSC). Generally, we can deduce that protein i’s binding
profile xi has a significant relationship with protein j1 ’s binding profile xj1 contained in
cluster Cj , xj1 ∈ Cj , if the combined binding relationship between xi and all the binding
vector components of Cj is significant. Using (3.31), we can find the total p-value for the
binding relationship between xi and the cluster Cj , or pvT (xi , Cj ), by combining the pvalues for the individual binding dependencies between xi and each component of Cj , where
Cj = {xj1 , . . . , xj|Cj | }:
|Cj |−1
pvT (xi , Cj ) = pvC (pv(xi , xj1 ) · · · pv(xi , xj|Cj | )) = k|Cj |
X (− ln k|Cj | )r
.
r!
r=0
(5.7)
Equation 5.7 only depends on the number of p-values combined, |Cj |, and the product of
Q|Cj |
pv(xi , xjm ), as in (3.31). However, (5.7) incorporates p-values
the p-values k|Cj | = m=1
based on binding relationships between independent data sets xi , xj1 , . . . , xj|Cj | . In this
context, our use of (3.31) satisfies the independent studies assumption and we do not have
to double the p-values as in Section 3.3.1. Moreover, (5.7) does not specify which type
of p-values to combine. For a possible positive or negative link, our network algorithm
incorprates the combined pvC (xi , xj ) or filtered correlation coefficient pvF CC (xi , xj ) p-values,
respectively. For extending a potential positive or negative link, we consider a total p-value
71
pvT (xi , Cj ) ≤ 10−20 or ≤ 10−10 as significant, respectively. With the additional evidence that
protein i’s binding profile associates significantly with the biologically related components
of Cj , the threshold for significant binding relationship between i and j1 should decrease.
Repeating this process for every possible combination of i and j1 , where j1 ∈ Cj , the second
pass of our network algorithm assigns positive and negative edges as follows:
−20
] ∧ [(pvM M I (xi , xj1 ) ≤ 10−7.5 ) ∨ (pvC (xi , xj1 ) ≤ 10−15 )]
+
i,j1 = 1 if [pvT (xi , Cj ) ≤ 10
−10
−
] ∧ [(pvF CC (xi , xj1 ] ≤ 10−7.5 )] .
i,j1 = 1 if [pvT (xi , Cj ) ≤ 10
(5.8)
The first pass of the network algorithm found 4692 positive and 2629 negative binding
relationships, while the second pass uncovered 806 and 574 more significant binding relationships, respectively. The larger number of positive links reflects the fact that significant
positive binding relationships are easier to find using ChIP-chip data than significant negative binding relationships. Actual visualization of all the significant links proves unwieldy.
Therefore, Figure 5-1 displays only highly significant positive binding relationships that have
a minimized mutual information p-value ≤ 10−20 or combined p-value ≤ 10−40 . Different colored ovals represent factors from the various levels and lines between ovals denote significant
synergistic interactions between factors. Inspecting the graph, one can notice neighborhoods
of nodes that share common biological processes. For example, the MCM-ORC cluster occupies the top left corner of Figure 5-1, with the SIR complex directly to its right. Continuing
to the right from the SIR complex, one can observe a congregation of nuclear transport
proteins, factors from the active cluster, the POL3 machinery, and the RSC complex at the
middle right end of the figure. Having built a network of the nucleus, the next section verifies
the ability of our network model to predict protein-protein interactions by comparing it with
several previous protein-protein studies.
72
Figure 5-1: Network of highly significant positive binding relationships that have a minimized mutual information p-value ≤ 10−20 or combined p-value ≤ 10−40 . Colored ovals
represent factors from the various levels and lines between ovals denote significant synergistic interactions between factors. Different colors represent the five layers: yellow = T F ,
green = HS, red = HM , blue = N P , and pink = N T . Graph rendering was performed
with Pajek (http://vlado.fmf.uni-lj.si/pub/networks/pajek/doc/pajekman.htm).
73
Data Set
Interactions
Yu et al.
395
Gavin et al.
462
Ho et al.
137
Ito et al.
43
Uetz et al.
17
Overlap
P-Value
134
7.42×10−46
129
3.11×10−34
45
9.32×10−16
8
0.0280
4
0.0521
Table 5.1: Comparison between various protein-protein interaction data sets and our network. The first column lists the source of the data. The second column records the number
of predictions from each data set between proteins considered in this work. The third column
lists the overlap in predicted significant interactions between the five studies and our network.
The last column finds p-values for the significance in the overlap using the hypergeometric
test statistic discussed in Section 3.2.1 (see text).
5.1.2
Network Comparison
Numerous previous large scale and small scale methods have predicted protein-protein interactions with varied success. For comparison, we downloaded data from five protein-protein
interaction studies [30–34]. Ito et al. and Uetz et al. used an experimental technique called
Yeast Two-Hybrid (Y2H) to infer associations between proteins, while Gavin et al. and Ho
et al. used a more accurate mass spectrometry method. Yu et al. incorporated the significant predictions from the above data sets along with interactions found using several
small scale studies. Small scale studies generally have less noise in their predictions than
the high-throughput methods in [31–34]; therefore, the Yu et al. data set is probably most
reliable.
Table 5.1 demonstrates the results of the comparison. The first column lists the source of
the data. The second column records the number of predictions from each data set between
proteins considered in this work. The third column lists the overlap in predicted significant
interactions between the five studies and our network. The last column finds p-values for
the significance in the overlap using the hypergeometric test statistic discussed in Section
3.2.1. The p-values were calculated by setting u equal to entries in the second column, h to
entries in the third column, v = 3771, or the number of significant positive protein-protein
binding relationships in our network (i.e., excluding edges with factors in the HS layer), and
w = 43956, or all possible protein-protein relationships that could be predicted.
74
The table above shows that the three most reliable data sets overlap significantly with
the predictions made in our network. The Y2H data does not have enough predictions in
common with our data set in order to accurately determine the significance in the overlap.
Moreover, the Y2H technique is considered to have the most amount of noise. Generally,
we do not expect a full overlap between the methods since they target different biological
mechanisms. Our method finds biological complexes that associate at or near DNA but does
not capture other cellular protein-protein interactions. For example, most of the above data
sets show that protein Hrp1 and Nab2 form a complex but our analysis demonstrates that
the two proteins have very different binding profiles. Moreover, our method can capture
binding dependencies between proteins that consistently bind to the same gene targets and
hence participate in similar biological processes, but that do not necessarily bind at the same
time or associate with one another. And finally, it is generally agreed upon that the data
sets above have a large amount of noise. In summary, this comparison validates the ability of
our method to find protein-protein interactions predicted in previous studies. Having fully
specified the network structure enables us to infer the underlying communication between
layers in the eukaryotic transcriptional network in the next section.
5.2
Analysis of Network Topology
This section introduces several statistical methods for quantifying the interplay between the
layers in our yeast transcriptional network. Table 5.2 shows the number of links within and
between the five categories of factors (namely T F , HS, HM , N P , and N T ) for the entire
and the positive edge network. Each entry within the table represents the total number of
links and the percentage of possible links realized (in parenthesis) between factors from the
layers listed in the corresponding entry in the first row and column. For example, the first
entry in Table 5.2(a) shows that there are 3659 links between pairs of transcription factors,
which represents 17.2% of the total possible links that could be realized between all pairs
of TFs. Percentage of links realized represents a normalized measure of the connectedness
between two layers, which accounts for the number of factors in each layer. Given two layers
75
Layers
TF
HS
HM
NP
NT
TF
HS
HM
3659(17.2%) 1018(9.28%) 737(7.58%)
1018(9.28%) 616(44.7%) 776(31.2%)
737(7.58%) 776(31.2%) 198(18.3%)
409(7.32%) 236(16.5%) 208(16.4%)
171(5.16%) 276(32.5%) 100(13.3%)
NP
409(7.32%)
236(16.5%)
208(16.4%)
118(33.6%)
97(22.5%)
NT
171(5.16%)
276(32.5%)
100(13.3%)
97(22.5%)
95(79.2%)
(a) Total links and percentage of links captured for the entire network.
Layers
TF
HS
TF
2099(9.84%) 602(5.49%)
HS
602(5.49%) 513(37.2%)
HM
519(5.33%)
323(13%)
NP
328(5.87%) 151(10.6%)
NT
108(3.26%) 138(16.3%)
HM
519(5.33%)
323(13%)
161(14.9%)
178(14%)
85(11.3%)
NP
328(5.87%)
151(10.6%)
178(14%)
113(32.2%)
93(21.5%)
NT
108(3.26%)
138(16.3%)
85(11.3%)
93(21.5%)
87(72.5%)
(b) Total links and percentage of links captured for the positive edge network.
Table 5.2: Total links and percentage of links realized within and between the five layers
of the (a) entire and the (b) positive edge transcriptional network. Each entry within each
table represents the total number of links and the percentage of possible links realized (in
parenthesis) between factors from the layers listed in the corresponding entry in the first
row and column. For example, the first entry in Table (a) shows that there are 3659 links
between pairs of transcription factors, which represents 17.2% of the total possible links that
could be realized between all pairs of TFs.
with n and m factors, respectively, and k mutual interactions the inter-level percentage of
k
links realized is simply 100 nm
%. Moreover, if a layer with n elements has l interactions
2l
within its layer, the intra-level percentage of links realized is 100 n(n−1)
%.
Table 5.2(a) measures the level of connectedness or communication within and between
layers. By examining the percentage of links realized, we see that most of the communication
occurs within layers. The diagonal in Table 5.2(a) shows that the five layers have 17.2%,
44.7%, 18.3%, 33.6%, and 79.2% of intra-connectedness. Moreover, inspecting the entries in
row 1 of Table 5.2(a) from left to right, we see that the percentages gradually decrease, with
TFs being most highly associated with HSs, followed by HMs, NPs, and NTs. These results
confirm the current thought in biology that close collaboration between TFs, HSs, and HMs
76
induces specific classes of genes while NPs and NTs carry out more general mechanisms
related to transcription. However, there are still a number of interactions between the
levels, illustrating that significant amount of communication between layers of the yeast
transcriptional network.
Several other network statistics can also quantify the behavior of the different classes of
proteins. For, example we can count the number of links stemming from each node, or the
number of degrees, and find a distribution of degrees for nodes within a layer. Computing
the average of that distribution measures the average level of connectivity for members from
each layer. Table 5.3 shows the average number of degrees for proteins within the five layers
in both the entire and the positive edge network. Members of the HS level seem most
promiscuous in their association with other factors, but, overall, all layers have a similar
number of average degrees.
Clustering coefficient is a common method for measuring the cliquishness of each node
in the network, or how much the node’s nearest neighbors (adjacent nodes) interact with
one another. Finding the average clustering coefficient for all proteins within a particular
class measures the tendency for cliquishness in each layer. Let node i of layer L have ki
nearest neighbors. Moreover, let its ki adjacent nodes have ci connections between them
out of a possible
ki (ki −1)
2
links. Then the clustering coefficient at node i, or the fraction of
links realized between i’s nearest neighbors becomes
2ci
.
ki (ki −1)
Finally, the average clustering
coefficient for all factors i in layer L, CCL , takes the form:
CCL =
1 X
2ci
|L| i∈L ki (ki − 1)
(5.9)
Table 5.3 shows the average clustering coefficient for proteins within the five layers in both
the entire and the positive edge network. The preference for cliquish interaction between
nearest neighbors of nodes seems strongest amongst members of the N T class of proteins.
Given that almost all nuclear transport factors considered seem to associate with members of
the nuclear pore, this measure seems consistent with biological mechanism of the N T class.
In summary, the network topology statistics discussed in this section confirm that the
77
Layer Avg. Number Avg. Clustering
Name
of Degrees
Coefficient
TF
47.2
0.458
HS
68.9
0.427
HM
47.3
0.495
NP
43.9
0.539
NT
52.1
0.631
All
50.5
0.473
(a) Network topology statistics for the entire
network.
Layer Avg. Number Avg. Clustering
Name
of Degrees
Coefficient
TF
27.8
0.526
HS
42.3
0.482
HM
30.4
0.539
NP
36.1
0.575
NT
37.4
0.676
All
31.4
0.532
(b) Network topology statistics for the positive
edge network.
Table 5.3: Network topology statistics for (a) the entire and (b) the positive edge transcriptional network.
78
T F , HM , and HS levels seem most tightly coupled. However, over a thousand significant interactions also occur between factors outside of these subsets, demonstrating that eukaryotic
transcription depends on the intricate interplay between all five layers presented.
79
80
Chapter 6
Conclusion
Genome-wide binding data from ChIP-chip experiments provides a wealth of information
about protein-gene interactions and about the underlying workings of eukaryotic cells. In
order to gain a holistic understanding of the non-linear process of transcription, our work
examines the communication between various classes of regulators in the yeast specie Saccharomyces cerevisiae. We used ChIP-chip data to quantify the interplay between five categories
of factors that affect transcription: histone states, histone modifiers and nucleosome remodelers, transcription factors, nuclear processing proteins, and nuclear transport factors.
Following the introduction in Chapter 1, Chapter 2 described the non-trivial process of
incorporating the different sources of ChIP-chip data into a coherent set. Due to the heterogeneity of the obtained data, this chapter further discussed the need for data normalization
and showed the benefits of a normalized data set. Chapter 3 used the processed data to
find pairwise statistics that test binding dependencies between two proteins. Combining the
complementary pairwise measures of filtered correlation coefficient and mutual information
p-values reduced the false positive and false negative rates in our biological predictions and
increased the reliability of our analysis. Next, Chapter 4 uncovered group-wise relationships
between factors using Principal Component Analysis (PCA) and clustering. PCA allowed us
to visualize subsets of the data in 2-D. Based on the developed pairwise measures, Chapter
4 also introduced a novel semi-supervised clustering algorithm that preserves information
81
about elements of a cluster in order to better capture group-wise dependencies between proteins. And finally, Chapter 5 combined the methodology developed in the previous chapters
in order to build a multi-layered transcriptional network of the nucleus.
Throughout the theoretical analysis, we validated various known biological processes that
occur in dextrose rich conditions. Moreover, our analysis and previous literature [7] showed
that usage ChIP-chip experiments in rich media YPD conditions can still provide insight
about biological processes in other growth environments. Further, the fact that our data
does not synchronize cells at a particular phase of growth, does not preclude formulation of
hypothesis about biological processes that occur only in a specific phase of the cell cycle,
as shown with the MCM-ORC complex in Section 4.2.4. Our theoretical analysis also uncovered several novel biological hypothesis. In particular, Chapter 3 suggests that proteins
LRP1 and RRP6 might participate in a mechanism that ensures quality control at falsely
transcribed genes, by binding to inactive genes in order to readily degrade unwanted production of their mRNAs. Moreover, Chapter 4 hypothesized that SIR2, long thought to silence
the transcription of DNA, might have a role in transcriptional activation at the ORF region
of genes. And finally, Chapter 5 quantified the communication between layers in biological
transcriptional networks for the first time in the literature. The network topology statistics
discussed in Chapter 5 confirm that the T F , HM , and HS levels seem most tightly coupled. However, over a thousand significant interactions also occur between factors outside of
these three categories, demonstrating that eukaryotic transcription depends on the intricate
interplay between all five layers presented.
82
Appendix A
Semi-Supervised Clustering based on
KL-divergence
Semi-supervised clustering is more theoretically intuitive when objects are represented using
distibutions. We chose to define the probability distribution of object i as the normalized
binding profile of binding ratios across all genes, or normalized rows in the BR matrix.
Specifically, if xi = [xi,g1 . . . xi,g|G| ] denotes a row vector from the BR matrix for protein i
across all genes g in the set G, the normalized binding profile (or binding distribution) for
protein i is
xi,g
.
g∈G xi,g
P (g|i) = P
(A.1)
We implemented a semi-supervised clustering algorithm that can perform hierarchical
partitioning of distributions as defined in (A.1). We treat the binding distribution P (g|i) for
each protein i as an object (or node) and merge objects into clusters (aggregations of nodes)
based on a distance metric using the Kullback-Leibler divergence.
The KL-divergence represents how different two distributions are. For the normalized
binding profiles, it is defined as
83
D(P (g|Ck )kP (g|Ck ∪ Cl )) =
X
g∈G
P (g|Ck ) log
P (g|Ck )
,
P (g|Ck ∪ Cl )
(A.2)
where Ck and Cl represent two clusters and Ck ∪ Cl denotes the merged partition of Ck and
Cl . In information theoretic terms, the KL-divergence represents the information lost by
joining Ck and Cl into a combined distribution for Ck ∪ Cl . Note that the KL-divergence is
not a distance metric; it does not satisfy the triangle inequality nor is it commutative. For
the semi-supervised clustering algorithm, we define the following distance measure based on
the KL-divergence:
d(Ck , Cl ) = |Ck |D(P (g|Ck )kP (g|Ck ∪ Cl )) + |Cl |D(P (g|Cl )kP (g|Ck ∪ Cl )) .
(A.3)
The above equation (A.3) can intuitively be interpreted as the total information lost in
combining the distributions of Ck and Cl into P (g|Ck ∪ Cl ), where the weights |Ck | and
|Cl | account for the number of nodes in each cluster. When a partition Ck contains just a
single protein, |Ck | = 1. Note that this distance, which we refer to as the KL-distance, is
commutative.
The hierarchical clustering proceeds by initially treating all objects as clusters and then
successively merging partitions that are “closest”, in terms of the KL distance in (A.3).
When objects are merged into clusters, the new clusters preserve information about their
component nodes in their distribution. The combined distribution is a weighted average of
the constituent distributions:
P (g|Ck ∪ Cl ) =
1
(|Ck |P (g|Ck ) + |Cl |P (g|Cl )) .
|Ck | + |Cl |
(A.4)
The performance of the clustering algorithm is illustrated in Figure A-1. Figure A-1
partitions the nuclear processing and nuclear transport factors and shows similar results as
the PCA analysis in Section 4.1. The related NTs cluster together once again. In addition, HMT1 (a protein that adds methyl groups to other proteins), THO2 (a protein that
84
starts transcription), and YRA1 (a protein that exports mRNA from the nucleus) also cluster
tightly with the NTs in [13], suggesting a common regulatory mechanism. Moreover, proteins
RRP6, an mRNA surveillance factor, and mRNA export factor NPL3 cluster together, supporting the results in [16] that demonstrate coupling of mRNA quality assurance to mRNA
export. The algorithm also captures the relationship between LRP1 and PRP20 discussed
in section 3.3.1. However, when the algorithm is run on the entire data, it fails to find
commonality between several known biological processes. In the end, the pairwise statistics
discussed in Chapter 3 prove much more biologically meaningful than KL-divergence.
85
1.2
Distance
1
0.8
0.6
0.4
0.2
Genes
0
YAL002W
YAL001C
YAL004W
YAL003W
YAL005C
YAL008W
YAL007C
YAL009W
YAL011W
YAL010C
YAL013W
YAL012W
YAL016W
YAL015C
YAL014C
YAL017W
YAL018C
YAL019W
YAL020C
YAL023C
YAL022C
YAL021C
YAL025C
YAL024C
YAL028W
YAL027W
YAL026C
YAL030W
YAL029C
YAL033W
YAL032C
YAL031C
YAL035C−A
YAL034C
YAL035W
YAL036C
YAL038W
YAL037W
YAL039C
YAL041W
YAL040C
YAL043C−A
YAL042W
YAL043C
YAL046C
YAL044C
YAL048C
YAL047C
YAL053W
YAL051W
YAL049C
YAL055W
YAL054C
YAL058C−A
YAL056W
YAL058W
YAL060W
YAL059W
YAL062W
YAL061W
YAL063C
YAL064W
YAL065C
YAL068C
YAL067C
YAR003W
YAR002W
YAR007C
YAR008W
YAR009C
YAR015W
YAR014C
YAR010C
YAR019C
YAR018C
YAR027W
YAR023C
YAR020C
YAR029W
YAR028W
YAR033W
YAR031W
YAR044W
YAR042W
YAR035W
YAR053W
YAR050W
YAR062W
YAR061W
YAR060C
YAR066W
YAR064W
YAR068W
YAR070C
YAR069C
YAR073W
YAR071W
YAR075W
YBL001C
YBL002W
YBL004W
YBL003C
YBL005W−A
YBL005W
YBL005W−B
YBL007C
YBL006C
YBL009W
YBL008W
YBL011W
YBL010C
YBL012C
YBL013W
YBL014C
YBL016W
YBL015W
YBL019W
YBL018C
YBL017C
YBL020W
YBL021C
YBL024W
YBL023C
YBL022C
YBL026W
YBL025W
YBL027W
YBL029W
YBL028C
YBL031W
YBL030C
YBL032W
YBL033C
YBL036C
YBL035C
YBL034C
YBL038W
YBL037W
YBL041W
YBL040C
YBL039C
YBL043W
YBL042C
YBL044W
YBL045C
YBL046W
YBL049W
YBL047C
YBL050W
YBL051C
YBL054W
YBL053W
YBL052C
YBL056W
YBL055C
YBL059W
YBL058W
YBL057C
YBL060W
YBL061C
YBL063W
YBL062W
YBL067C
YBL066C
YBL064C
YBL069W
YBL068W
YBL072C
YBL071C
YBL070C
YBL075C
YBL074C
YBL079W
YBL078C
YBL076C
YBL081W
YBL080C
YBL083C
YBL082C
YBL085W
YBL084C
YBL086C
YBL088C
YBL087C
YBL090W
YBL089W
YBL091C
YBL092W
YBL093C
YBL095W
YBL094C
YBL096C
YBL098W
YBL097W
YBL099W
YBL100C
YBL101W−B
YBL101W−A
YBL101C
YBL102W
YBL103C
YBL106C
YBL105C
YBL104C
YBL109W
YBL107C
YBL113C
YBL112C
YBL111C
YBR002C
YBR001C
YBR003W
YBR004C
YBR006W
YBR005W
YBR007C
YBR009C
YBR008C
YBR012W−A
YBR010W
YBR011C
YBR012W−B
YBR013C
YBR016W
YBR015C
YBR014C
YBR018C
YBR017C
YBR020W
YBR019C
YBR022W
YBR021W
YBR023C
YBR024W
YBR025C
YBR028C
YBR027C
YBR026C
YBR030W
YBR029C
YBR033W
YBR031W
YBR036C
YBR035C
YBR034C
YBR038W
YBR037C
YBR041W
YBR040W
YBR039W
YBR043C
YBR042C
YBR046C
YBR045C
YBR044C
YBR048W
YBR047W
YBR050C
YBR049C
YBR051W
YBR053C
YBR052C
YBR054W
YBR055C
YBR056W
YBR058C
YBR057C
YBR060C
YBR059C
YBR063C
YBR062C
YBR061C
YBR064W
YBR065C
YBR067C
YBR066C
YBR070C
YBR069C
YBR068C
YBR072W
YBR071W
YBR075W
YBR074W
YBR073W
YBR076W
YBR077C
YBR078W
YBR080C
YBR079C
YBR082C
YBR081C
YBR084C−A
YBR083W
YBR085W
YBR084W
YBR086C
YBR087W
YBR088C
YBR089W
YBR091C
YBR090C
YBR093C
YBR092C
YBR094W
YBR096W
YBR095C
YBR098W
YBR097W
YBR100W
YBR099C
YBR103W
YBR102C
YBR101C
YBR104W
YBR105C
YBR106W
YBR108W
YBR107C
YBR110W
YBR109C
YBR114W
YBR112C
YBR111C
YBR117C
YBR115C
YBR119W
YBR118W
YBR122C
YBR121C
YBR120C
YBR124W
YBR123C
YBR127C
YBR126C
YBR125C
YBR129C
YBR128C
YBR131W
YBR130C
YBR134W
YBR133C
YBR132C
YBR136W
YBR135W
YBR139W
YBR137W
YBR140C
YBR142W
YBR141C
YBR146W
YBR145W
YBR143C
YBR148W
YBR147W
YBR149W
YBR150C
YBR153W
YBR152W
YBR151W
YBR155W
YBR154C
YBR158W
YBR157C
YBR156C
YBR160W
YBR159W
YBR162W−A
YBR161W
YBR162C
YBR163W
YBR164C
YBR165W
YBR166C
YBR168W
YBR167C
YBR169C
YBR171W
YBR170C
YBR175W
YBR173C
YBR172C
YBR176W
YBR177C
YBR178W
YBR180W
YBR179C
YBR182C
YBR181C
YBR184W
YBR183W
YBR187W
YBR186W
YBR185C
YBR189W
YBR188C
YBR192W
YBR191W
YBR193C
YBR194W
YBR195C
YBR198C
YBR197C
YBR196C
YBR200W
YBR199W
YBR202W
YBR201W
YBR203W
YBR205W
YBR204C
YBR207W
YBR206W
YBR210W
YBR209W
YBR208C
YBR212W
YBR211C
YBR215W
YBR214W
YBR213W
YBR217W
YBR216C
YBR220C
YBR218C
YBR224W
YBR222C
YBR221C
YBR225W
YBR227C
YBR228W
YBR230C
YBR229C
YBR233W
YBR231C
YBR235W
YBR234C
YBR237W
YBR236C
YBR238C
YBR240C
YBR239C
YBR242W
YBR241C
YBR243C
YBR244W
YBR245C
YBR246W
YBR248C
YBR247C
YBR250W
YBR249C
YBR252W
YBR251W
YBR253W
YBR255W
YBR254C
YBR257W
YBR256C
YBR259W
YBR258C
YBR260C
YBR263W
YBR261C
YBR267W
YBR265W
YBR264C
YBR268W
YBR269C
YBR271W
YBR270C
YBR274W
YBR273C
YBR272C
YBR276C
YBR275C
YBR279W
YBR278W
YBR280C
YBR282W
YBR281C
YBR285W
YBR284W
YBR283C
YBR287W
YBR286W
YBR289W
YBR288C
YBR290W
YBR293W
YBR291C
YBR295W
YBR294W
YBR297W
YBR296C
YBR298C
YBR299W
YBR300C
YBR301W
YCL004W
YCL002C
YCL005W
YCL006C
YCL008C
YCL007C
YCL011C
YCL010C
YCL009C
YCL014W
YCL016C
YCL019W
YCL018W
YCL017C
YCL020W
YCL022C
YCL024W
YCL023C
YCL025C
YCL028W
YCL027W
YCL030C
YCL029C
YCL032W
YCL031C
YCL033C
YCL034W
YCL035C
YCL036W
YCL038C
YCL037C
YCL039W
YCL041C
YCL042W
YCL043C
YCL047C
YCL045C
YCL044C
YCL048W
YCL049C
YCL051W
YCL050C
YCL052C
YCL055W
YCL054W
YCL057W
YCL056C
YCL058C
YCL061C
YCL059C
YCL063W
YCL064C
YCL066W
YCL065W
YCL067C
YCL069W
YCL068C
YCR001W
YCL074W
YCL073C
YCR003W
YCR002C
YCR006C
YCR005C
YCR004C
YCR008W
YCR007C
YCR010C
YCR009C
YCR012W
YCR011C
YCR013C
YCR015C
YCR014C
YCR016W
YCR018C
YCR017C
YCR019W
YCR020C
YCR020C−A
YCR023C
YCR021C
YCR024C−A
YCR024C
YCR026C
YCR025C
YCR030C
YCR028C
YCR027C
YCR033W
YCR032W
YCR034W
YCR036W
YCR035C
YCR038C
YCR037C
YCR041W
YCR040W
YCR039C
YCR043C
YCR042C
YCR045C
YCR044C
YCR048W
YCR047C
YCR046C
YCR051W
YCR050C
YCR053W
YCR052W
YCR054C
YCR059C
YCR057C
YCR063W
YCR061W
YCR060W
YCR066W
YCR065W
YCR068W
YCR067C
YCR069W
YCR072C
YCR071C
YCR075C
YCR073C
YCR079W
YCR077C
YCR076C
YCR082W
YCR081W
YCR083W
YCR084C
YCR089W
YCR088W
YCR086W
YCR091W
YCR090C
YCR094W
YCR093W
YCR092C
YCR096C
YCR095C
YCR101C
YCR100C
YCR098C
YCR103C
YCR102C
YCR105W
YCR104W
YCR107W
YCR106W
YDL001W
YDL003W
YDL002C
YDL004W
YDL006W
YDL005C
YDL008W
YDL007W
YDL010W
YDL009C
YDL011C
YDL013W
YDL012C
YDL014W
YDL015C
YDL017W
YDL016C
YDL018C
YDL020C
YDL019C
YDL022W
YDL021W
YDL023C
YDL025C
YDL024C
YDL029W
YDL028C
YDL027C
YDL031W
YDL030W
YDL035C
YDL033C
YDL038C
YDL037C
YDL036C
YDL040C
YDL039C
YDL044C
YDL043C
YDL042C
YDL046W
YDL045C
YDL047W
YDL049C
YDL048C
YDL051W
YDL050C
YDL053C
YDL052C
YDL056W
YDL055C
YDL054C
YDL058W
YDL057W
YDL060W
YDL059C
YDL061C
YDL062W
YDL063C
YDL064W
YDL065C
YDL066W
YDL068W
YDL067C
YDL070W
YDL069C
YDL073W
YDL072C
YDL071C
YDL075W
YDL074C
YDL078C
YDL077C
YDL076C
YDL080C
YDL079C
YDL082W
YDL081C
YDL085W
YDL084W
YDL083C
YDL086W
YDL087C
YDL089W
YDL088C
YDL090C
YDL092W
YDL091C
YDL093W
YDL095W
YDL094C
YDL097C
YDL096C
YDL099W
YDL098C
YDL102W
YDL101C
YDL100C
YDL104C
YDL103C
YDL105W
YDL107W
YDL106C
YDL108W
YDL109C
YDL112W
YDL111C
YDL110C
YDL114W
YDL113C
YDL116W
YDL115C
YDL118W
YDL117W
YDL119C
YDL120W
YDL121C
YDL124W
YDL123W
YDL122W
YDL126C
YDL125C
YDL129W
YDL128W
YDL127W
YDL131W
YDL130W
YDL133W
YDL132W
YDL136W
YDL135C
YDL134C
YDL138W
YDL137W
YDL141W
YDL140C
YDL139C
YDL143W
YDL142C
YDL146W
YDL145C
YDL144C
YDL147W
YDL148C
YDL150W
YDL149W
YDL152W
YDL151C
YDL153C
YDL155W
YDL154W
YDL156W
YDL158C
YDL157C
YDL159W
YDL160C
YDL163W
YDL161W
YDL165W
YDL164C
YDL166C
YDL168W
YDL167C
YDL170W
YDL169C
YDL171C
YDL173W
YDL172C
YDL176W
YDL175C
YDL174C
YDL178W
YDL177C
YDL180W
YDL179W
YDL182W
YDL181W
YDL183C
YDL185W
YDL184C
YDL186W
YDL188C
YDL187C
YDL189W
YDL190C
YDL193W
YDL192W
YDL191W
YDL195W
YDL194W
YDL198C
YDL197C
YDL201W
YDL200C
YDL199C
YDL202W
YDL203C
YDL204W
YDL206W
YDL205C
YDL208W
YDL207W
YDL210W
YDL209C
YDL211C
YDL212W
YDL213C
YDL215C
YDL214C
YDL218W
YDL217C
YDL216C
YDL219W
YDL220C
YDL224C
YDL223C
YDL222C
YDL225W
YDL226C
YDL229W
YDL228C
YDL227C
YDL230W
YDL231C
YDL233W
YDL232W
YDL236W
YDL235C
YDL234C
YDL237W
YDL238C
YDL241W
YDL240W
YDL239C
YDL244W
YDL243C
YDL247W
YDL246C
YDL245C
YDR002W
YDR001C
YDR004W
YDR003W
YDR007W
YDR006C
YDR005C
YDR011W
YDR009W
YDR014W
YDR013W
YDR012W
YDR016C
YDR015C
YDR018C
YDR017C
YDR021W
YDR020C
YDR019C
YDR023W
YDR022C
YDR025W
YDR027C
YDR026C
YDR029W
YDR028C
YDR031W
YDR030C
YDR032C
YDR033W
YDR034C
YDR035W
YDR036C
YDR037W
YDR039C
YDR038C
YDR041W
YDR040C
YDR044W
YDR043C
YDR042C
YDR046C
YDR045C
YDR047W
YDR049W
YDR048C
YDR051C
YDR050C
YDR053W
YDR052C
YDR055W
YDR054C
YDR056C
YDR057W
YDR058C
YDR061W
YDR060W
YDR059C
YDR063W
YDR062W
YDR065W
YDR064W
YDR066C
YDR068W
YDR067C
YDR070C
YDR069C
YDR073W
YDR072C
YDR071C
YDR075W
YDR074W
YDR077W
YDR076W
YDR078C
YDR080W
YDR079W
YDR083W
YDR082W
YDR081C
YDR085C
YDR084C
YDR087C
YDR086C
YDR089W
YDR088C
YDR090C
YDR092W
YDR091C
YDR096W
YDR094W
YDR093W
YDR098C
YDR097C
YDR100W
YDR099W
YDR101C
YDR103W
YDR104C
YDR106W
YDR105C
YDR108W
YDR107C
YDR109C
YDR110W
YDR111C
YDR112W
YDR114C
YDR113C
YDR115W
YDR116C
YDR118W
YDR117C
YDR119W
YDR121W
YDR120C
YDR122W
YDR123C
YDR124W
YDR126W
YDR125C
YDR128W
YDR127W
YDR131C
YDR130C
YDR129C
YDR134C
YDR132C
YDR137W
YDR135C
YDR138W
YDR140W
YDR139C
YDR142C
YDR141C
YDR145W
YDR144C
YDR143C
YDR147W
YDR146C
YDR150W
YDR149C
YDR148C
YDR152W
YDR151C
YDR156W
YDR153C
YDR160W
YDR158W
YDR157W
YDR161W
YDR162C
YDR163W
YDR165W
YDR164C
YDR167W
YDR166C
YDR168W
YDR170C
YDR169C
YDR172W
YDR173C
YDR174W
YDR175C
YDR178W
YDR177W
YDR176W
YDR179W−A
YDR179C
YDR180W
YDR182W
YDR181C
YDR183W
YDR184C
YDR189W
YDR188W
YDR185C
YDR191W
YDR190C
YDR193W
YDR192C
YDR195W
YDR194C
YDR196C
YDR197W
YDR198C
YDR199W
YDR201W
YDR200C
YDR204W
YDR203W
YDR206W
YDR205W
YDR207C
YDR210W
YDR208W
YDR212W
YDR211W
YDR216W
YDR214W
YDR213W
YDR219C
YDR217C
YDR223W
YDR222W
YDR221W
YDR225W
YDR224C
YDR227W
YDR226W
YDR230W
YDR229W
YDR228C
YDR232W
YDR231C
YDR235W
YDR233C
YDR236C
YDR237W
YDR238C
YDR241W
YDR240C
YDR239C
YDR242W
YDR243C
YDR245W
YDR244W
YDR247W
YDR246W
YDR248C
YDR251W
YDR249C
YDR252W
YDR254W
YDR253C
YDR256C
YDR255C
YDR259C
YDR258C
YDR257C
YDR261C
YDR260C
YDR262W
YDR263C
YDR265W
YDR264C
YDR266C
YDR268W
YDR267C
YDR270W
YDR272W
YDR271C
YDR275W
YDR273W
YDR279W
YDR277C
YDR276C
YDR280W
YDR281C
YDR283C
YDR282C
YDR285W
YDR284C
YDR286C
YDR288W
YDR287W
YDR291W
YDR290W
YDR289C
YDR293C
YDR292C
YDR296W
YDR295C
YDR294C
YDR297W
YDR298C
YDR299W
YDR300C
YDR302W
YDR301W
YDR303C
YDR305C
YDR304C
YDR307W
YDR306C
YDR308C
YDR310C
YDR309C
YDR312W
YDR311W
YDR313C
YDR315C
YDR314C
YDR317W
YDR316W
YDR318W
YDR320C
YDR319C
YDR322W
YDR321W
YDR325W
YDR324C
YDR323C
YDR327W
YDR326C
YDR329C
YDR328C
YDR332W
YDR331W
YDR330W
YDR334W
YDR333C
YDR337W
YDR336W
YDR335W
YDR339C
YDR338C
YDR340W
YDR342C
YDR341C
YDR345C
YDR343C
YDR347W
YDR346C
YDR350C
YDR349C
YDR348C
YDR352W
YDR351W
YDR354W
YDR353W
YDR355C
YDR356W
YDR357C
YDR358W
YDR360W
YDR359C
YDR362C
YDR361C
YDR363W
YDR364C
YDR368W
YDR367W
YDR365C
YDR370C
YDR369C
YDR371W
YDR373W
YDR372C
YDR375C
YDR374C
YDR377W
YDR376W
YDR378C
YDR380W
YDR379W
YDR382W
YDR381W
YDR385W
YDR384C
YDR383C
YDR386W
YDR387C
YDR389W
YDR388W
YDR390C
YDR392W
YDR391C
YDR395W
YDR394W
YDR393W
YDR396W
YDR397C
YDR399W
YDR398W
YDR401W
YDR400W
YDR402C
YDR403W
YDR404C
YDR406W
YDR405W
YDR407C
YDR409W
YDR408C
YDR412W
YDR411C
YDR410C
YDR414C
YDR413C
YDR416W
YDR415C
YDR420W
YDR419W
YDR418W
YDR421W
YDR422C
YDR425W
YDR424C
YDR423C
YDR427W
YDR428C
YDR430C
YDR429C
YDR433W
YDR432W
YDR431W
YDR434W
YDR435C
YDR438W
YDR437W
YDR436W
YDR440W
YDR439W
YDR442W
YDR441C
YDR443C
YDR446W
YDR444W
YDR448W
YDR447C
YDR450W
YDR449C
YDR451C
YDR452W
YDR453C
YDR457W
YDR456W
YDR454C
YDR459C
YDR458C
YDR462W
YDR461W
YDR460W
YDR464W
YDR463W
YDR466W
YDR465C
YDR469W
YDR468C
YDR467C
YDR471W
YDR470C
YDR472W
YDR474C
YDR473C
YDR476C
YDR475C
YDR478W
YDR477W
YDR479C
YDR480W
YDR481C
YDR483W
YDR482C
YDR484W
YDR486C
YDR485C
YDR488C
YDR487C
YDR489W
YDR492W
YDR490C
YDR494W
YDR493W
YDR497C
YDR496C
YDR495C
YDR499W
YDR500C
YDR501W
YDR502C
YDR506C
YDR505C
YDR503C
YDR508C
YDR507C
YDR511W
YDR510W
YDR509W
YDR513W
YDR512C
YDR515W
YDR514C
YDR518W
YDR517W
YDR516C
YDR519W
YDR520C
YDR521W
YDR523C
YDR522C
YDR527W
YDR524C
YDR528W
YDR530C
YDR529C
YDR531W
YDR532C
YDR534C
YDR533C
YDR539W
YDR538W
YDR536W
YDR541C
YDR540C
YDR542W
YDR545W
YDR544C
YEL002C
YEL001C
YEL004W
YEL003W
YEL005C
YEL007W
YEL006W
YEL011W
YEL009C
YEL013W
YEL012W
YEL014C
YEL015W
YEL016C
YEL018W
YEL017W
YEL019C
YEL021W
YEL020C
YEL022W
YEL024W
YEL023C
YEL026W
YEL025C
YEL027W
YEL029C
YEL032W
YEL031W
YEL030W
YEL034W
YEL035C
YEL038W
YEL037C
YEL036C
YEL040W
YEL039C
YEL043W
YEL042W
YEL041W
YEL044W
YEL046C
YEL048C
YEL047C
YEL049W
YEL051W
YEL050C
YEL052W
YEL053C
YEL056W
YEL055C
YEL054C
YEL058W
YEL057C
YEL059C−A
YEL061C
YEL060C
YEL062W
YEL063C
YEL065W
YEL064C
YEL066W
YEL069C
YEL067C
YEL071W
YEL070W
YEL072W
YEL075C
YEL073C
YEL076C−A
YEL076C
YEL076W−C
YEL077C
YER002W
YER001W
YER003C
YER005W
YER004W
YER007C−A
YER006W
YER007W
YER009W
YER008C
YER012W
YER011W
YER010C
YER014W
YER013W
YER016W
YER015W
YER019C−A
YER018C
YER017C
YER020W
YER019W
YER023W
YER022W
YER021W
YER025W
YER024W
YER028C
YER027C
YER026C
YER030W
YER029C
YER032W
YER031C
YER035W
YER034W
YER033C
YER037W
YER036C
YER040W
YER039C
YER038C
YER042W
YER041W
YER044C−A
YER044C
YER043C
YER046W
YER045C
YER048C
YER047C
YER049W
YER051W
YER050C
YER053C
YER052C
YER056C
YER055C
YER054C
YER056C−A
YER057C
YER060W
YER059W
YER058W
YER060w−A
YER061C
YER063W
YER062C
YER066W
YER065C
YER064C
YER068W
YER067W
YER070W
YER069W
YER071C
YER073W
YER072W
YER074W
YER076C
YER075C
YER078C
YER077C
YER080W
YER079W
YER081W
YER083C
YER082C
YER084W
YER085C
YER087W
YER086W
YER088C
YER090W
YER089C
YER092W
YER091C
YER093C−A
YER093C
YER094C
YER096W
YER095W
YER100W
YER098W
YER101C
YER103W
YER102W
YER104W
YER106W
YER105C
YER109C
YER107C
YER111C
YER110C
YER112W
YER114C
YER113C
YER116C
YER115C
YER117W
YER119C
YER118C
YER119C−A
YER120W
YER121W
YER123W
YER122C
YER125W
YER124C
YER127W
YER126C
YER129W
YER128W
YER130C
YER131W
YER132C
YER133W
YER136W
YER134C
YER138C
YER137C
YER141W
YER140W
YER139C
YER143W
YER142C
YER145C
YER144C
YER146W
YER148W
YER147C
YER150W
YER149C
YER154W
YER153C
YER151C
YER156C
YER155C
YER157W
YER159C
YER158C
YER161C
YER160C
YER163C
YER162C
YER166W
YER165W
YER164W
YER167W
YER168C
YER171W
YER170W
YER169W
YER173W
YER172C
YER176W
YER175C
YER174C
YER178W
YER177W
YER179W
YER180C
YER182W
YER184C
YER183C
YER185W
YER186C
YER189W
YER188W
YER187W
YER190W
YFL001W
YFL003C
YFL002C
YFL006W
YFL005W
YFL004W
YFL008W
YFL007W
YFL009W
YFL011W
YFL010C
YFL012W
YFL013C
YFL013W−A
YFL014W
YFL016C
YFL018C
YFL017C
YFL021W
YFL020C
YFL023W
YFL022C
YFL024C
YFL026W
YFL025C
YFL029C
YFL028C
YFL027C
YFL031W
YFL030W
YFL036W
YFL034W
YFL033C
YFL037W
YFL038C
YFL040W
YFL039C
YFL041W
YFL044C
YFL042C
YFL046W
YFL045C
YFL047W
YFL049W
YFL048C
YFL051C
YFL050C
YFL053W
YFL052W
YFL054C
YFL055W
YFL056C
YFL059W
YFL058W
YFL062W
YFL061W
YFL060C
YFL063W
YFL064C
YFL067W
YFL066C
YFL065C
YFR002W
YFR001W
YFR004W
YFR003C
YFR005C
YFR007W
YFR006W
YFR009W
YFR008W
YFR010W
YFR012W
YFR011C
YFR013W
YFR014C
YFR017C
YFR016C
YFR015C
YFR019W
YFR018C
YFR023W
YFR022W
YFR021W
YFR024C−A
YFR025C
YFR027W
YFR026C
YFR030W
YFR029W
YFR028C
YFR031C−A
YFR031C
YFR034C
YFR033C
YFR032C
YFR036W
YFR035C
YFR038W
YFR037C
YFR040W
YFR039C
YFR041C
YFR042W
YFR043C
YFR045W
YFR044C
YFR046C
YFR048W
YFR047C
YFR049W
YFR051C
YFR050C
YFR052W
YFR053C
YFR057W
YFR055W
YGL002W
YGL001C
YGL003C
YGL005C
YGL004C
YGL006W
YGL009C
YGL008C
YGL010W
YGL011C
YGL012W
YGL014W
YGL013C
YGL016W
YGL015C
YGL017W
YGL018C
YGL019W
YGL021W
YGL020C
YGL022W
YGL023C
YGL027C
YGL026C
YGL025C
YGL029W
YGL028C
YGL030W
YGL032C
YGL031C
YGL033W
YGL035C
YGL036W
YGL037C
YGL039W
YGL038C
YGL040C
YGL043W
YGL044C
YGL047W
YGL045W
YGL048C
YGL050W
YGL049C
YGL053W
YGL052W
YGL051W
YGL055W
YGL054C
YGL057C
YGL056C
YGL060W
YGL059W
YGL058W
YGL063W
YGL061C
YGL066W
YGL065C
YGL064C
YGL068W
YGL067W
YGL071W
YGL070C
YGL069C
YGL073W
YGL072C
YGL075C
YGL074C
YGL078C
YGL077C
YGL076C
YGL080W
YGL079W
YGL083W
YGL082W
YGL081W
YGL085W
YGL084C
YGL086W
YGL087C
YGL090W
YGL089C
YGL091C
YGL093W
YGL092W
YGL096W
YGL095C
YGL094C
YGL098W
YGL097W
YGL101W
YGL100W
YGL099W
YGL103W
YGL102C
YGL105W
YGL104C
YGL106W
YGL108C
YGL107C
YGL111W
YGL110C
YGL114W
YGL113W
YGL112C
YGL116W
YGL115W
YGL119W
YGL117W
YGL120C
YGL122C
YGL121C
YGL123W
YGL124C
YGL126W
YGL125W
YGL127C
YGL129C
YGL128C
YGL130W
YGL132W
YGL131C
YGL134W
YGL133W
YGL135W
YGL137W
YGL136C
YGL139W
YGL138C
YGL141W
YGL140C
YGL144C
YGL143C
YGL142C
YGL145W
YGL146C
YGL149W
YGL148W
YGL147C
YGL151W
YGL150C
YGL153W
YGL152C
YGL154C
YGL156W
YGL155W
YGL158W
YGL157W
YGL160W
YGL159W
YGL161C
YGL162W
YGL163C
YGL166W
YGL165C
YGL164C
YGL168W
YGL167C
YGL169W
YGL170C
YGL172W
YGL171W
YGL173C
YGL174W
YGL175C
YGL178W
YGL176C
YGL179C
YGL181W
YGL180W
YGL185C
YGL184C
YGL183C
YGL187C
YGL186C
YGL190C
YGL189C
YGL192W
YGL191W
YGL193C
YGL195W
YGL194C
YGL198W
YGL197W
YGL196W
YGL200C
YGL199C
YGL202W
YGL201C
YGL203C
YGL205W
YGL206C
YGL208W
YGL207W
YGL211W
YGL210W
YGL209W
YGL212W
YGL213C
YGL216W
YGL215W
YGL214W
YGL219C
YGL217C
YGL220W
YGL222C
YGL221C
YGL224C
YGL223C
YGL226W
YGL225W
YGL228W
YGL227W
YGL229C
YGL231C
YGL230C
YGL234W
YGL233W
YGL232W
YGL237C
YGL236C
YGL238W
YGL241W
YGL239C
YGL243W
YGL242C
YGL245W
YGL244W
YGL248W
YGL247W
YGL246C
YGL250W
YGL249W
YGL253W
YGL252C
YGL251C
YGL255W
YGL254W
YGL256W
YGL258W
YGL257C
YGL259W
YGL261C
YGL263W
YGL262W
YGR003W
YGR002C
YGR001C
YGR004W
YGR005C
YGR007W
YGR006W
YGR008C
YGR010W
YGR009C
YGR012W
YGR011W
YGR014W
YGR013W
YGR015C
YGR017W
YGR016W
YGR019W
YGR021W
YGR020C
YGR023W
YGR024C
YGR026W
YGR025W
YGR027C
YGR029W
YGR028W
YGR031W
YGR030C
YGR032W
YGR034W
YGR033C
YGR036C
YGR035C
YGR040W
YGR038W
YGR037C
YGR042W
YGR041W
YGR045C
YGR044C
YGR043C
YGR046W
YGR047C
YGR049W
YGR048W
YGR052W
YGR054W
YGR053C
YGR056W
YGR055W
YGR059W
YGR058W
YGR057C
YGR060W
YGR061C
YGR064W
YGR063C
YGR062C
YGR066C
YGR065C
YGR068C
YGR067C
YGR070W
YGR069W
YGR071C
YGR074W
YGR072W
YGR077C
YGR076C
YGR075C
YGR079W
YGR078C
YGR080W
YGR082W
YGR081C
YGR084C
YGR083C
YGR086C
YGR085C
YGR089W
YGR088W
YGR087C
YGR091W
YGR090W
YGR094W
YGR093W
YGR092W
YGR096W
YGR095C
YGR097W
YGR099W
YGR098C
YGR101W
YGR100W
YGR103W
YGR102C
YGR105W
YGR104C
YGR106C
YGR108W
YGR109C
YGR112W
YGR111W
YGR110W
YGR113W
YGR114C
YGR116W
YGR115C
YGR118W
YGR117C
YGR119C
YGR121C
YGR120C
YGR122W
YGR124W
YGR123C
YGR126W
YGR125W
YGR127W
YGR129W
YGR128C
YGR131W
YGR130C
YGR133W
YGR132C
YGR136W
YGR135W
YGR134W
YGR137W
YGR138C
YGR142W
YGR141W
YGR140W
YGR144W
YGR143W
YGR145W
YGR147C
YGR146C
YGR149W
YGR148C
YGR152C
YGR150C
YGR153W
YGR155W
YGR154C
YGR157W
YGR156W
YGR160W
YGR159C
YGR158C
YGR162W
YGR161C
YGR165W
YGR164W
YGR163W
YGR167W
YGR166W
YGR169C
YGR168C
YGR170W
YGR172C
YGR171C
YGR173W
YGR174C
YGR176W
YGR175C
YGR177C
YGR179C
YGR178C
YGR181W
YGR180C
YGR182C
YGR184C
YGR183C
YGR186W
YGR185C
YGR189C
YGR188C
YGR187C
YGR191W
YGR190C
YGR194C
YGR193C
YGR192C
YGR195W
YGR196C
YGR199W
YGR198W
YGR197C
YGR201C
YGR200C
YGR203W
YGR202C
YGR206W
YGR205W
YGR204W
YGR208W
YGR207C
YGR211W
YGR210C
YGR209C
YGR212W
YGR213C
YGR215W
YGR214W
YGR218W
YGR217W
YGR216C
YGR219W
YGR220C
YGR222W
YGR221C
YGR223C
YGR225W
YGR224W
YGR228W
YGR227W
YGR226C
YGR230W
YGR229C
YGR232W
YGR231C
YGR234W
YGR233C
YGR235C
YGR237C
YGR236C
YGR240C
YGR239C
YGR238C
YGR243W
YGR241C
YGR246C
YGR245C
YGR244C
YGR248W
YGR247W
YGR249W
YGR250C
YGR252W
YGR251W
YGR253C
YGR254W
YGR255C
YGR256W
YGR258C
YGR257C
YGR260W
YGR259C
YGR263C
YGR262C
YGR261C
YGR265W
YGR264C
YGR266W
YGR267C
YGR270W
YGR269W
YGR268C
YGR271W
YGR272C
YGR275W
YGR274C
YGR273C
YGR277C
YGR276C
YGR278W
YGR280C
YGR279C
YGR281W
YGR282C
YGR284C
YGR283C
YGR287C
YGR286C
YGR285C
YGR288W
YGR289C
YGR292W
YGR290W
YGR293C
YGR294W
YGR295C
YGR296W
YHL002W
YHL001W
YHL005C
YHL003C
YHL007C
YHL006C
YHL010C
YHL009C
YHL008C
YHL012W
YHL011C
YHL015W
YHL014C
YHL013C
YHL017W
YHL016C
YHL018W
YHL019C
YHL022C
YHL021C
YHL020C
YHL024W
YHL023C
YHL025W
YHL027W
YHL026C
YHL028W
YHL029C
YHL030W
YHL032C
YHL031C
YHL034C
YHL033C
YHL036W
YHL035C
YHL039W
YHL038C
YHL040C
YHL042W
YHL041W
YHL044W
YHL043W
YHL046C
YHL048W
YHL047C
YHR001W
YHL050C
YHL049C
YHR001W−A
YHR002W
YHR004C
YHR003C
YHR006W
YHR005C
YHR007C
YHR009C
YHR008C
YHR012W
YHR011W
YHR010W
YHR014W
YHR013C
YHR015W
YHR017W
YHR016C
YHR019C
YHR018C
YHR020W
YHR021C
YHR023W
YHR022C
YHR024C
YHR026W
YHR025W
YHR029C
YHR028C
YHR027C
YHR031C
YHR030C
YHR033W
YHR032W
YHR034C
YHR036W
YHR035W
YHR038W
YHR037W
YHR040W
YHR039C
YHR041C
YHR042W
YHR043C
YHR045W
YHR044C
YHR046C
YHR048W
YHR047C
YHR051W
YHR050W
YHR049W
YHR052W
YHR053C
YHR055C
YHR054C
YHR058C
YHR057C
YHR056C
YHR060W
YHR059W
YHR063C
YHR062C
YHR061C
YHR065C
YHR064C
YHR067W
YHR066W
YHR068W
YHR070W
YHR069C
YHR072W
YHR071W
YHR074W
YHR073W
YHR075C
YHR076W
YHR077C
YHR078W
YHR081W
YHR080C
YHR083W
YHR082C
YHR085W
YHR084W
YHR088W
YHR087W
YHR086W
YHR090C
YHR089C
YHR093W
YHR092C
YHR091C
YHR095W
YHR094C
YHR098C
YHR097C
YHR096C
YHR099W
YHR100C
YHR102W
YHR101C
YHR105W
YHR104W
YHR103W
YHR106W
YHR107C
YHR110W
YHR109W
YHR108W
YHR111W
YHR112C
YHR114W
YHR113W
YHR115C
YHR117W
YHR116W
YHR119W
YHR118C
YHR122W
YHR121W
YHR120W
YHR124W
YHR123W
YHR128W
YHR127W
YHR126C
YHR131C
YHR129C
YHR134W
YHR133C
YHR132C
YHR136C
YHR135C
YHR137W
YHR138C
YHR140W
YHR139C
YHR141C
YHR143W
YHR142W
YHR143W−A
YHR146W
YHR144C
YHR148W
YHR147C
YHR150W
YHR149C
YHR151C
YHR152W
YHR153C
YHR155W
YHR154W
YHR157W
YHR156C
YHR158C
YHR159W
YHR160C
YHR163W
YHR162W
YHR161C
YHR165C
YHR164C
YHR167W
YHR166C
YHR170W
YHR169W
YHR168W
YHR172W
YHR171W
YHR176W
YHR175W
YHR174W
YHR178W
YHR177W
YHR182W
YHR181W
YHR179W
YHR184W
YHR183W
YHR186C
YHR185C
YHR187W
YHR189W
YHR188C
YHR190W
YHR191C
YHR192W
YHR194W
YHR193C
YHR196W
YHR195W
YHR197W
YHR199C
YHR198C
YHR200W
YHR201C
YHR202W
YHR203C
YHR206W
YHR205W
YHR204W
YHR208W
YHR207C
YHR209W
YHR211W
YHR210C
YHR214C−B
YHR212C
YHR214W−A
YHR214W
YHR215W
YHR218W
YHR216W
YHR219W
YIL001W
YIL003W
YIL002C
YIL004C
YIL006W
YIL005W
YIL009W
YIL008W
YIL007C
YIL011W
YIL010W
YIL015W
YIL014W
YIL013C
YIL018W
YIL016W
YIL019W
YIL020C
YIL022W
YIL021W
YIL023C
YIL025C
YIL024C
YIL029C
YIL027C
YIL026C
YIL031W
YIL030C
YIL034C
YIL033C
YIL036W
YIL035C
YIL037C
YIL039W
YIL038C
YIL041W
YIL040W
YIL042C
YIL044C
YIL043C
YIL046W
YIL045W
YIL047C
YIL049W
YIL048W
YIL050W
YIL051C
YIL053W
YIL052C
YIL055C
YIL056W
YIL057C
YIL060W
YIL062C
YIL061C
YIL064W
YIL063C
YIL067C
YIL066C
YIL065C
YIL069C
YIL068C
YIL072W
YIL070C
YIL075C
YIL074C
YIL073C
YIL076W
YIL077C
YIL078W
YIL080W
YIL079C
YIL082W−A
YIL083C
YIL086C
YIL085C
YIL084C
YIL088C
YIL087C
YIL090W
YIL089W
YIL092W
YIL091C
YIL093C
YIL095W
YIL094C
YIL097W
YIL096C
YIL098C
YIL099W
YIL101C
YIL103W
YIL102C
YIL104C
YIL106W
YIL105C
YIL108W
YIL107C
YIL111W
YIL110W
YIL109C
YIL113W
YIL112W
YIL116W
YIL115C
YIL114C
YIL118W
YIL117C
YIL121W
YIL120W
YIL119C
YIL123W
YIL122W
YIL125W
YIL124W
YIL126W
YIL128W
YIL127C
YIL130W
YIL129C
YIL133C
YIL132C
YIL131C
YIL134W
YIL135C
YIL136W
YIL137C
YIL140W
YIL139C
YIL138C
YIL142W
YIL141W
YIL144W
YIL143C
YIL145C
YIL147C
YIL146C
YIL148W
YIL150C
YIL149C
YIL152W
YIL151C
YIL153W
YIL154C
YIL156W
YIL155C
YIL157C
YIL159W
YIL158W
YIL162W
YIL161W
YIL160C
YIL166C
YIL164C
YIL168W
YIL170W
YIL169C
YIL171W
YIL172C
YIL174W
YIL173W
YIL175W
YIL177C
YIL176C
YIR002C
YIR001C
YIR005W
YIR004W
YIR003W
YIR007W
YIR006C
YIR010W
YIR009W
YIR008C
YIR012W
YIR011C
YIR014W
YIR013C
YIR016W
YIR015W
YIR017C
YIR018W
YIR019C
YIR023W
YIR022W
YIR021W
YIR025W
YIR024C
YIR028W
YIR027C
YIR026C
YIR029W
YIR030C
YIR032C
YIR031C
YIR033W
YIR035C
YIR034C
YIR037W
YIR036C
YIR041W
YIR039C
YIR038C
YJL001W
YIR042C
YJL003W
YJL002C
YJL004C
YJL005W
YJL006C
YJL008C
YJL007C
YJL012C
YJL011C
YJL010C
YJL015C
YJL013C
YJL019W
YJL017W
YJL016W
YJL021C
YJL020C
YJL024C
YJL023C
YJL026W
YJL025W
YJL027C
YJL030W
YJL029C
YJL033W
YJL032W
YJL031C
YJL034W
YJL035C
YJL037W
YJL036W
YJL038C
YJL041W
YJL039C
YJL043W
YJL042W
YJL046W
YJL045W
YJL044C
YJL048C
YJL047C
YJL051W
YJL050W
YJL049W
YJL053W
YJL052W
YJL055W
YJL054W
YJL056C
YJL058C
YJL057C
YJL060W
YJL059W
YJL062W
YJL061W
YJL063C
YJL066C
YJL065C
YJL070C
YJL069C
YJL068C
YJL071W
YJL072C
YJL073W
YJL075C
YJL074C
YJL076W
YJL077C
YJL079C
YJL078C
YJL082W
YJL081C
YJL080C
YJL083W
YJL084C
YJL085W
YJL087C
YJL086C
YJL089W
YJL088W
YJL092W
YJL091C
YJL090C
YJL094C
YJL093C
YJL097W
YJL095W
YJL100W
YJL099W
YJL098W
YJL102W
YJL101C
YJL105W
YJL104W
YJL103C
YJL106W
YJL107C
YJL110C
YJL109C
YJL108C
YJL112W
YJL111W
YJL114W
YJL113W
YJL115W
YJL117W
YJL116C
YJL118W
YJL119C
YJL120W
YJL122W
YJL121C
YJL124C
YJL123C
YJL126W
YJL125C
YJL129C
YJL128C
YJL127C
YJL131C
YJL130C
YJL134W
YJL133W
YJL132W
YJL137C
YJL136C
YJL140W
YJL139C
YJL138C
YJL142C
YJL141C
YJL144W
YJL143W
YJL146W
YJL145W
YJL147C
YJL149W
YJL148W
YJL154C
YJL153C
YJL151C
YJL156C
YJL155C
YJL161W
YJL160C
YJL158C
YJL163C
YJL162C
YJL165C
YJL164C
YJL167W
YJL166W
YJL168C
YJL171C
YJL170C
YJL172W
YJL174W
YJL173C
YJL175W
YJL176C
YJL177W
YJL179W
YJL178C
YJL181W
YJL180C
YJL183W
YJL182C
YJL184W
YJL186W
YJL185C
YJL188C
YJL187C
YJL189W
YJL191W
YJL190C
YJL193W
YJL192C
YJL194W
YJL196C
YJL195C
YJL198W
YJL197W
YJL201W
YJL200C
YJL203W
YJL202C
YJL204C
YJL207C
YJL206C
YJL210W
YJL208C
YJL211C
YJL213W
YJL212C
YJL214W
YJL217W
YJL216C
YJL219W
YJL218W
YJL220W
YJL221C
YJL222W
YJL225C
YJL223C
YJR002W
YJR001W
YJR005W
YJR004C
YJR003C
YJR007W
YJR006W
YJR008W
YJR009C
YJR010C−A
YJR010W
YJR011C
YJR014W
YJR013W
YJR015W
YJR017C
YJR016C
YJR018W
YJR019C
YJR020W
YJR022W
YJR021C
YJR025C
YJR024C
YJR027W
YJR026W
YJR029W
YJR028W
YJR031C
YJR032W
YJR033C
YJR037W
YJR035W
YJR034W
YJR040W
YJR039W
YJR042W
YJR041C
YJR043C
YJR045C
YJR044C
YJR046W
YJR047C
YJR048W
YJR050W
YJR049C
YJR052W
YJR051W
YJR055W
YJR054W
YJR053W
YJR057W
YJR056C
YJR060W
YJR059W
YJR058C
YJR061W
YJR062C
YJR064W
YJR063W
YJR066W
YJR065C
YJR067C
YJR068W
YJR069C
YJR071W
YJR070C
YJR072C
YJR074W
YJR073C
YJR075W
YJR077C
YJR076C
YJR079W
YJR078W
YJR082C
YJR080C
YJR086W
YJR084W
YJR083C
YJR087W
YJR088C
YJR092W
YJR091C
YJR090C
YJR094C
YJR093C
YJR094W−A
YJR096W
YJR095W
YJR097W
YJR098C
YJR099W
YJR100C
YJR101W
YJR103W
YJR102C
YJR105W
YJR104C
YJR108W
YJR107W
YJR106W
YJR110W
YJR109C
YJR112W
YJR111C
YJR115W
YJR114W
YJR113C
YJR117W
YJR116W
YJR120W
YJR119C
YJR118C
YJR122W
YJR121W
YJR123W
YJR125C
YJR124C
YJR127C
YJR126C
YJR128W
YJR129C
YJR132W
YJR131W
YJR130C
YJR133W
YJR134C
YJR137C
YJR136C
YJR135C
YJR138W
YJR139C
YJR142W
YJR141W
YJR140C
YJR144W
YJR143C
YJR147W
YJR145C
YJR149W
YJR148W
YJR150C
YJR152W
YJR151C
YJR155W
YJR154W
YJR153W
YJR157W
YJR156C
YJR159W
YJR158W
YJR160C
YKL001C
YJR161C
YKL002W
YKL003C
YKL006C−A
YKL004W
YKL005C
YKL007W
YKL006W
YKL009W
YKL008C
YKL010C
YKL012W
YKL011C
YKL015W
YKL014C
YKL013C
YKL017C
YKL016C
YKL019W
YKL018W
YKL022C
YKL021C
YKL020C
YKL023W
YKL024C
YKL027W
YKL026C
YKL025C
YKL028W
YKL029C
YKL031W
YKL033W
YKL032C
YKL035W
YKL034W
YKL037W
YKL036C
YKL039W
YKL038W
YKL040C
YKL042W
YKL041W
YKL045W
YKL043W
YKL046C
YKL047W
YKL048C
YKL050C
YKL049C
YKL051W
YKL053W
YKL052C
YKL055C
YKL054C
YKL058W
YKL057C
YKL056C
YKL060C
YKL059C
YKL062W
YKL061W
YKL063C
YKL064W
YKL065C
YKL068W
YKL067W
YKL071W
YKL070W
YKL069W
YKL073W
YKL072W
YKL076C
YKL075C
YKL074C
YKL078W
YKL077W
YKL081W
YKL080W
YKL079W
YKL083W
YKL082C
YKL085W
YKL084W
YKL086W
YKL088W
YKL087C
YKL090W
YKL089W
YKL093W
YKL092C
YKL091C
YKL095W
YKL094W
YKL098W
YKL096W
YKL099C
YKL101W
YKL100C
YKL104C
YKL103C
YKL107W
YKL106W
YKL105C
YKL109W
YKL108W
YKL112W
YKL111C
YKL110C
YKL114C
YKL113C
YKL118W
YKL117W
YKL116C
YKL120W
YKL119C
YKL121W
YKL122C
YKL126W
YKL125W
YKL124W
YKL127W
YKL128C
YKL132C
YKL130C
YKL129C
YKL134C
YKL133C
YKL136W
YKL135C
YKL137W
YKL139W
YKL138C
YKL141W
YKL140W
YKL143W
YKL142W
YKL144C
YKL146W
YKL145W
YKL149C
YKL148C
YKL147C
YKL150W
YKL151C
YKL154W
YKL155C
YKL157W
YKL156W
YKL159C
YKL160W
YKL161C
YKL163W
YKL162C
YKL164C
YKL166C
YKL165C
YKL169C
YKL168C
YKL167C
YKL171W
YKL170W
YKL173W
YKL172W
YKL175W
YKL174C
YKL176C
YKL179C
YKL178C
YKL182W
YKL181W
YKL180W
YKL184W
YKL183W
YKL185W
YKL187C
YKL186C
YKL189W
YKL188C
YKL191W
YKL190W
YKL194C
YKL193C
YKL192C
YKL195W
YKL196C
YKL201C
YKL199C
YKL197C
YKL204W
YKL203C
YKL205W
YKL207W
YKL206C
YKL208W
YKL209C
YKL210W
YKL211C
YKL212W
YKL214C
YKL213C
YKL216W
YKL215C
YKL217W
YKL219W
YKL218C
YKL221W
YKL220C
YKR001C
YKL224C
YKL222C
YKR003W
YKR002W
YKR005C
YKR004C
YKR008W
YKR007W
YKR006C
YKR010C
YKR009C
YKR013W
YKR012C
YKR011C
YKR015C
YKR014C
YKR016W
YKR017C
YKR020W
YKR019C
YKR018C
YKR021W
YKR022C
YKR025W
YKR024C
YKR026C
YKR028W
YKR027W
YKR030W
YKR029C
YKR031C
YKR034W
YKR035C
YKR037C
YKR036C
YKR041W
YKR039W
YKR038C
YKR042W
YKR043C
YKR044W
YKR046C
YKR045C
YKR047W
YKR048C
YKR051W
YKR050W
YKR049C
YKR053C
YKR052C
YKR055W
YKR054C
YKR058W
YKR057W
YKR056W
YKR060W
YKR059W
YKR062W
YKR061W
YKR063C
YKR064W
YKR065C
YKR067W
YKR066C
YKR068C
YKR070W
YKR069W
YKR072C
YKR071C
YKR074W
YKR076W
YKR075C
YKR078W
YKR079C
YKR080W
YKR082W
YKR081C
YKR084C
YKR083C
YKR086W
YKR085C
YKR087C
YKR089C
YKR088C
YKR091W
YKR090W
YKR093W
YKR092C
YKR094C
YKR096W
YKR095W
YKR097W
YKR099W
YKR098C
YKR101W
YKR100C
YKR104W
YKR103W
YKR102W
YKR106W
YKR105C
YLL002W
YLL001W
YLL004W
YLL003W
YLL005C
YLL006W
YLL007C
YLL008W
YLL010C
YLL009C
YLL012W
YLL011W
YLL014W
YLL013C
YLL015W
YLL019C
YLL018C
YLL021W
YLL020C
YLL024C
YLL023C
YLL022C
YLL026W
YLL025W
YLL029W
YLL028W
YLL027W
YLL032C
YLL031C
YLL033W
YLL034C
YLL035W
YLL038C
YLL036C
YLL040C
YLL039C
YLL043W
YLL042C
YLL041C
YLL046C
YLL045C
YLL049W
YLL048C
YLL050C
YLL052C
YLL051C
YLL054C
YLL053C
YLL055W
YLL057C
YLL056C
YLL058W
YLL060C
YLL061W
YLL063C
YLL062C
YLL066C
YLL064C
YLR002C
YLR001C
YLL067C
YLR004C
YLR003C
YLR005W
YLR006C
YLR007W
YLR009W
YLR008C
YLR011W
YLR010C
YLR013W
YLR012C
YLR014C
YLR015W
YLR016C
YLR017W
YLR019W
YLR018C
YLR021W
YLR020C
YLR023C
YLR022C
YLR025W
YLR024C
YLR026C
YLR028C
YLR027C
YLR031W
YLR030W
YLR029C
YLR033W
YLR032W
YLR036C
YLR035C
YLR034C
YLR038C
YLR037C
YLR040C
YLR039C
YLR041W
YLR043C
YLR042C
YLR045C
YLR044C
YLR048W
YLR047C
YLR046C
YLR050C
YLR049C
YLR052W
YLR051C
YLR055C
YLR054C
YLR053C
YLR057W
YLR056W
YLR060W
YLR059C
YLR058C
YLR063W
YLR061W
YLR064W
YLR066W
YLR065C
YLR068W
YLR067C
YLR070C
YLR069C
YLR072W
YLR071C
YLR073C
YLR075W
YLR074C
YLR077W
YLR079W
YLR078C
YLR081W
YLR080W
YLR084C
YLR083C
YLR082C
YLR086W
YLR085C
YLR088W
YLR087C
YLR091W
YLR090W
YLR089C
YLR092W
YLR093C
YLR096W
YLR095C
YLR094C
YLR099C
YLR097C
YLR100W
YLR103C
YLR102C
YLR104W
YLR105C
YLR107W
YLR106C
YLR109W
YLR108C
YLR110C
YLR113W
YLR114C
YLR116W
YLR115W
YLR117C
YLR119W
YLR118C
YLR124W
YLR121C
YLR120C
YLR125W
YLR126C
YLR128W
YLR127C
YLR129W
YLR131C
YLR130C
YLR133W
YLR132C
YLR135W
YLR134W
YLR136C
YLR138W
YLR137W
YLR143W
YLR142W
YLR141W
YLR145W
YLR144C
YLR147C
YLR146C
YLR150W
YLR149C
YLR151C
YLR153C
YLR152C
YLR156W
YLR155C
YLR154C
YLR158C
YLR157C
YLR159W
YLR160C
YLR162W
YLR161W
YLR163C
YLR164W
YLR165C
YLR167W
YLR166C
YLR168C
YLR169W
YLR170C
YLR171W
YLR173W
YLR172C
YLR175W
YLR174W
YLR177W
YLR176C
YLR180W
YLR179C
YLR178C
YLR182W
YLR181C
YLR186W
YLR185W
YLR183C
YLR188W
YLR187W
YLR191W
YLR190W
YLR189C
YLR193C
YLR192C
YLR195C
YLR194C
YLR197W
YLR196W
YLR198C
YLR200W
YLR199C
YLR204W
YLR203C
YLR205C
YLR207W
YLR206W
YLR208W
YLR210W
YLR209C
YLR212C
YLR211C
YLR215C
YLR213C
YLR217W
YLR216C
YLR218C
YLR220W
YLR219W
YLR223C
YLR222C
YLR221C
YLR224W
YLR225C
YLR226W
YLR228C
YLR227C
YLR231C
YLR229C
YLR232W
YLR233C
YLR234W
YLR237W
YLR236C
YLR238W
YLR239C
YLR241W
YLR240W
YLR242C
YLR243W
YLR244C
YLR246W
YLR245C
YLR247C
YLR249W
YLR248W
YLR251W
YLR250W
YLR253W
YLR252W
YLR254C
YLR257W
YLR256W
YLR258W
YLR260W
YLR259C
YLR263W
YLR262C
YLR264W
YLR265C
YLR268W
YLR267W
YLR266C
YLR271W
YLR270W
YLR274W
YLR273C
YLR272C
YLR275W
YLR276C
YLR279W
YLR278C
YLR277C
YLR282C
YLR281C
YLR283W
YLR284C
YLR285W
YLR287C
YLR286C
YLR287C−A
YLR288C
YLR289W
YLR291C
YLR290C
YLR293C
YLR292C
YLR297W
YLR296W
YLR295C
YLR299W
YLR298C
YLR301W
YLR300W
YLR303W
YLR305C
YLR304C
YLR307W
YLR306W
YLR308W
YLR310C
YLR309C
YLR313C
YLR312C
YLR315W
YLR314C
YLR316C
YLR318W
YLR317W
YLR320W
YLR319C
YLR322W
YLR321C
YLR323C
YLR324W
YLR325C
YLR326W
YLR328W
YLR327C
YLR330W
YLR329W
YLR332W
YLR336C
YLR333C
YLR340W
YLR338W
YLR342W
YLR341W
YLR345W
YLR344W
YLR343W
YLR347C
YLR346C
YLR350W
YLR349W
YLR348C
YLR352W
YLR351C
YLR353W
YLR355C
YLR354C
YLR357W
YLR356W
YLR360W
YLR359W
YLR362W
YLR361C
YLR363C
YLR366W
YLR364W
YLR369W
YLR368W
YLR367W
YLR371W
YLR370C
YLR372W
YLR373C
YLR375W
YLR374C
YLR376C
YLR378C
YLR377C
YLR381W
YLR380W
YLR379W
YLR383W
YLR382C
YLR386W
YLR385C
YLR384C
YLR388W
YLR387C
YLR390W
YLR389C
YLR394W
YLR393W
YLR392C
YLR396C
YLR395C
YLR399C
YLR398C
YLR397C
YLR400W
YLR401C
YLR405W
YLR404W
YLR403W
YLR407W
YLR406C
YLR409C
YLR408C
YLR412W
YLR411W
YLR410W
YLR413W
YLR414C
YLR417W
YLR415C
YLR418C
YLR420W
YLR419W
YLR422W
YLR421C
YLR423C
YLR425W
YLR424W
YLR427W
YLR426W
YLR430W
YLR429W
YLR431C
YLR432W
YLR433C
YLR435W
YLR434C
YLR436C
YLR438W
YLR437C
YLR439W
YLR441C
YLR440C
YLR443W
YLR442C
YLR446W
YLR445W
YLR449W
YLR448W
YLR447C
YLR451W
YLR450W
YLR454W
YLR453C
YLR452C
YLR456W
YLR455W
YLR458W
YLR457C
YLR459W
YLR461W
YLR460C
YLR462W
YLR463C
YLR464W
YLR466W
YLR465C
YML001W
YLR467W
YML003W
YML002W
YML004C
YML005W
YML006C
YML007W
YML008C
YML010W
YML009C
YML011C
YML013C−A
YML012W
YML014W
YML013W
YML015C
YML017W
YML016C
YML020W
YML019W
YML018C
YML022W
YML021C
YML024W
YML023C
YML027W
YML026C
YML025C
YML029W
YML028W
YML031W
YML030W
YML032C
YML034W
YML035C
YML035C−A
YML036W
YML037C
YML039W
YML038C
YML040W
YML041C
YML042W
YML045W
YML043C
YML046W
YML047C
YML048W−A
YML048W
YML049C
YML051W
YML050W
YML052W
YML054C
YML053C
YML055W
YML056C
YML058C−A
YML057W
YML058W
YML060W
YML059C
YML062C
YML061C
YML063W
YML065W
YML064C
YML067C
YML066C
YML070W
YML069W
YML068W
YML072C
YML071C
YML074C
YML073C
YML077W
YML076C
YML075C
YML079W
YML078W
YML082W
YML081W
YML080W
YML085C
YML083C
YML087C
YML086C
YML088W
YML092C
YML091C
YML094W
YML093W
YML095C−A
YML096W
YML095C
YML098W
YML097C
YML100W−A
YML100W
YML099C
YML102W
YML101C
YML104C
YML103C
YML106W
YML105C
YML107C
YML109W
YML108W
YML112W
YML111W
YML110C
YML113W
YML114C
YML117W
YML116W
YML115C
YML117W−A
YML118W
YML119W
YML120C
YML121W
YML124C
YML123C
YML126C
YML125C
YML127W
YML129C
YML128C
YML131W
YML130C
YML132W
YMR001C
YML133C
YMR003W
YMR002W
YMR005W
YMR004W
YMR007W
YMR006C
YMR008C
YMR010W
YMR009W
YMR012W
YMR011W
YMR013C
YMR014W
YMR015C
YMR018W
YMR017W
YMR016C
YMR020W
YMR019W
YMR022W
YMR021C
YMR025W
YMR024W
YMR023C
YMR027W
YMR026C
YMR028W
YMR030W
YMR029C
YMR031W−A
YMR031C
YMR033W
YMR032W
YMR034C
YMR035W
YMR036C
YMR038C
YMR037C
YMR040W
YMR039C
YMR041C
YMR043W
YMR042W
YMR044W
YMR046C
YMR045C
YMR048W
YMR047C
YMR050C
YMR049C
YMR052W
YMR051C
YMR053C
YMR054W
YMR055C
YMR059W
YMR058W
YMR056C
YMR061W
YMR060C
YMR064W
YMR063W
YMR062C
YMR066W
YMR065W
YMR068W
YMR067C
YMR070W
YMR069W
YMR071C
YMR072W
YMR073C
YMR075C−A
YMR075W
YMR074C
YMR077C
YMR076C
YMR079W
YMR078C
YMR080C
YMR083W
YMR081C
YMR085W
YMR084W
YMR087W
YMR086W
YMR088C
YMR090W
YMR089C
YMR093W
YMR092C
YMR091C
YMR094W
YMR095C
YMR096W
YMR098C
YMR097C
YMR100W
YMR099C
YMR102C
YMR101C
YMR106C
YMR105C
YMR104C
YMR108W
YMR107W
YMR109W
YMR111C
YMR110C
YMR113W
YMR112C
YMR115W
YMR114C
YMR116C
YMR118C
YMR117C
YMR119W−A
YMR119W
YMR123W
YMR121C
YMR120C
YMR125W
YMR124W
YMR129W
YMR128W
YMR126C
YMR130W
YMR131C
YMR134W
YMR133W
YMR132C
YMR135W−A
YMR135C
YMR136W
YMR137C
YMR140W
YMR139W
YMR138W
YMR142C
YMR141C
YMR144W
YMR143W
YMR145C
YMR147W
YMR146C
YMR149W
YMR148W
YMR152W
YMR151W
YMR150C
YMR153C−A
YMR153W
YMR155W
YMR154C
YMR156C
YMR158W
YMR157C
YMR161W
YMR160W
YMR159C
YMR163C
YMR162C
YMR165C
YMR164C
YMR167W
YMR166C
YMR168C
YMR170C
YMR169C
YMR173W
YMR172W
YMR171C
YMR173W−A
YMR174C
YMR177W
YMR176W
YMR175W
YMR179W
YMR178W
YMR181C
YMR180C
YMR184W
YMR183C
YMR182C
YMR186W
YMR185W
YMR189W
YMR188C
YMR187C
YMR191W
YMR190C
YMR194W
YMR193W
YMR192W
YMR196W
YMR195W
YMR198W
YMR197C
YMR200W
YMR199W
YMR201C
YMR203W
YMR202W
YMR206W
YMR205C
YMR204C
YMR208W
YMR207C
YMR211W
YMR210W
YMR209C
YMR213W
YMR212C
YMR215W
YMR214W
YMR217W
YMR216C
YMR218C
YMR220W
YMR219W
YMR223W
YMR222C
YMR221C
YMR225C
YMR224C
YMR228W
YMR227C
YMR226C
YMR230W
YMR229C
YMR232W
YMR231W
YMR234W
YMR233W
YMR235C
YMR237W
YMR236W
YMR238W
YMR240C
YMR239C
YMR241W
YMR243C
YMR244C−A
YMR244W
YMR246W
YMR245W
YMR247C
YMR251W
YMR250W
YMR251W−A
YMR253C
YMR252C
YMR255W
YMR256C
YMR259C
YMR258C
YMR257C
YMR261C
YMR260C
YMR263W
YMR262W
YMR264W
YMR267W
YMR265C
YMR269W
YMR268C
YMR272C
YMR271C
YMR270C
YMR274C
YMR273C
YMR277W
YMR276W
YMR275C
YMR278W
YMR279C
YMR281W
YMR280C
YMR285C
YMR283C
YMR282C
YMR288W
YMR287C
YMR290W−A
YMR289W
YMR290C
YMR292W
YMR293C
YMR294W−A
YMR294W
YMR295C
YMR297W
YMR296C
YMR298W
YMR299C
YMR302C
YMR301C
YMR300C
YMR304W
YMR303C
YMR306W
YMR305C
YMR308C
YMR310C
YMR309C
YMR312W
YMR311C
YMR313C
YMR315W
YMR314W
YMR316C−B
YMR316W
YMR317W
YMR319C
YMR318C
YMR322C
YMR321C
YMR323W
YNL001W
YNL002C
YNL004W
YNL003C
YNL006W
YNL005C
YNL008C
YNL010W
YNL009W
YNL014W
YNL012W
YNL016W
YNL015W
YNL017C
YNL019C
YNL018C
YNL021W
YNL020C
YNL022C
YNL025C
YNL024C
YNL027W
YNL026W
YNL028W
YNL030W
YNL029C
YNL032W
YNL031C
YNL034W
YNL033W
YNL035C
YNL036W
YNL037C
YNL040W
YNL039W
YNL038W
YNL042W
YNL041C
YNL045W
YNL044W
YNL046W
YNL048W
YNL047C
YNL050C
YNL049C
YNL053W
YNL052W
YNL051W
YNL054W
YNL055C
YNL056W
YNL059C
YNL058C
YNL061W
YNL062C
YNL063W
YNL064C
YNL067W
YNL066W
YNL065W
YNL069C
YNL068C
YNL072W
YNL071W
YNL070W
YNL073W
YNL074C
YNL077W
YNL076W
YNL075W
YNL078W
YNL079C
YNL081C
YNL080C
YNL083W
YNL085W
YNL084C
YNL087W
YNL086W
YNL088W
YNL090W
YNL089C
YNL092W
YNL091W
YNL094W
YNL093W
YNL095C
YNL097C
YNL096C
YNL099C
YNL098C
YNL102W
YNL101W
YNL100W
YNL103W
YNL104C
YNL107W
YNL105W
YNL108C
YNL111C
YNL110C
YNL113W
YNL112W
YNL117W
YNL115C
YNL114C
YNL119W
YNL120C
YNL123W
YNL122C
YNL121C
YNL124W
YNL125C
YNL128W
YNL127W
YNL126W
YNL129W
YNL130C
YNL132W
YNL131W
YNL135C
YNL134C
YNL133C
YNL136W
YNL137C
YNL138W
YNL141W
YNL139C
YNL142W
YNL144C
YNL147W
YNL146W
YNL145W
YNL149C
YNL148C
YNL152W
YNL151C
YNL155W
YNL154C
YNL153C
YNL157W
YNL156C
YNL158W
YNL160W
YNL159C
YNL162W
YNL161W
YNL165W
YNL163C
YNL166C
YNL168C
YNL167C
YNL171C
YNL169C
YNL172W
YNL175C
YNL173C
YNL177C
YNL176C
YNL178W
YNL181W
YNL180C
YNL183C
YNL182C
YNL187W
YNL186W
YNL185C
YNL189W
YNL188W
YNL191W
YNL190W
YNL193W
YNL195C
YNL194C
YNL197C
YNL196C
YNL201C
YNL200C
YNL199C
YNL202W
YNL203C
YNL206C
YNL205C
YNL204C
YNL208W
YNL207W
YNL210W
YNL209W
YNL214W
YNL213C
YNL211C
YNL216W
YNL215W
YNL218W
YNL217W
YNL219C
YNL220W
YNL221C
YNL223W
YNL222W
YNL227C
YNL225C
YNL224C
YNL228W
YNL229C
YNL232W
YNL231C
YNL230C
YNL234W
YNL233W
YNL237W
YNL236W
YNL235C
YNL239W
YNL238W
YNL241C
YNL240C
YNL243W
YNL242W
YNL244C
YNL246W
YNL245C
YNL247W
YNL249C
YNL248C
YNL250W
YNL251C
YNL253W
YNL252C
YNL254C
YNL256W
YNL255C
YNL258C
YNL257C
YNL261W
YNL260C
YNL259C
YNL262W
YNL263C
YNL266W
YNL265C
YNL264C
YNL268W
YNL267W
YNL272C
YNL271C
YNL270C
YNL273W
YNL274C
YNL275W
YNL276C
YNL281W
YNL279W
YNL278W
YNL282W
YNL283C
YNL287W
YNL286W
YNL284C
YNL289W
YNL288W
YNL290W
YNL292W
YNL291C
YNL293W
YNL294C
YNL296W
YNL295W
YNL299W
YNL298W
YNL297C
YNL300W
YNL301C
YNL304W
YNL302C
YNL305C
YNL306W
YNL307C
YNL311C
YNL310C
YNL308C
YNL312W
YNL313C
YNL314W
YNL315C
YNL317W
YNL316C
YNL318C
YNL320W
YNL319W
YNL321W
YNL323W
YNL322C
YNL326C
YNL325C
YNL327W
YNL328C
YNL331C
YNL330C
YNL329C
YNL333W
YNL332W
YNL336W
YNL335W
YNL334C
YNL338W
YNL339C
YNR003C
YNR002C
YNR001C
YNR004W
YNR005C
YNR006W
YNR007C
YNR010W
YNR009W
YNR008W
YNR012W
YNR011C
YNR015W
YNR014W
YNR013C
YNR017W
YNR016C
YNR019W
YNR018W
YNR020C
YNR021W
YNR022C
YNR024W
YNR023W
YNR027W
YNR026C
YNR025C
YNR028W
YNR029C
YNR030W
YNR032W
YNR031C
YNR034W
YNR033W
YNR037C
YNR036C
YNR035C
YNR038W
YNR039C
YNR040W
YNR041C
YNR044W
YNR043W
YNR042W
YNR046W
YNR045W
YNR048W
YNR047W
YNR049C
YNR051C
YNR050C
YNR054C
YNR053C
YNR052C
YNR057C
YNR055C
YNR059W
YNR058W
YNR060W
YNR062C
YNR061C
YNR063W
YNR064C
YNR068C
YNR067C
YNR066C
YNR070W
YNR069C
YNR072W
YNR071C
YNR073C
YNR075W
YNR074C
YNR076W
YOL001W
YOL004W
YOL003C
YOL002C
YOL006C
YOL005C
YOL008W
YOL007C
YOL009C
YOL011W
YOL010W
YOL013C
YOL012C
YOL015W
YOL014W
YOL016C
YOL017W
YOL018C
YOL020W
YOL019W
YOL021C
YOL023W
YOL022C
YOL025W
YOL024W
YOL026C
YOL028C
YOL027C
YOL030W
YOL029C
YOL033W
YOL032W
YOL031C
YOL036W
YOL034W
YOL039W
YOL038W
YOL037C
YOL041C
YOL040C
YOL042W
YOL044W
YOL043C
YOL045W
YOL046C
YOL048C
YOL047C
YOL049W
YOL051W
YOL050C
YOL053W
YOL052C
YOL054W
YOL056W
YOL055C
YOL058W
YOL057W
YOL059W
YOL061W
YOL060C
YOL063C
YOL062C
YOL065C
YOL064C
YOL068C
YOL067C
YOL066C
YOL069W
YOL070C
YOL072W
YOL071W
YOL073C
YOL076W
YOL075C
YOL079W
YOL078W
YOL077C
YOL081W
YOL080C
YOL083W
YOL082W
YOL084W
YOL087C
YOL086C
YOL089C
YOL088C
YOL092W
YOL091W
YOL090W
YOL093W
YOL094C
YOL097C
YOL096C
YOL095C
YOL099C
YOL098C
YOL100W
YOL101C
YOL103W
YOL102C
YOL104C
YOL107W
YOL105C
YOL110W
YOL109W
YOL108C
YOL112W
YOL111C
YOL113W
YOL114C
YOL117W
YOL116W
YOL115W
YOL120C
YOL119C
YOL123W
YOL122C
YOL121C
YOL125W
YOL124C
YOL127W
YOL126C
YOL128C
YOL130W
YOL129W
YOL132W
YOL131W
YOL133W
YOL135C
YOL134C
YOL137W
YOL136C
YOL140W
YOL139C
YOL138C
YOL142W
YOL141W
YOL144W
YOL143C
YOL145C
YOL146W
YOL147C
YOL149W
YOL148C
YOL152W
YOL151W
YOL150C
YOL154W
YOL155C
YOL156W
YOL158C
YOL157C
YOL161C
YOL159C
YOL164W
YOL163W
YOL165C
YOR001W
YOL166C
YOR003W
YOR002W
YOR004W
YOR006C
YOR005C
YOR008C
YOR007C
YOR009W
YOR011W
YOR010C
YOR013W
YOR012W
YOR015W
YOR014W
YOR016C
YOR018W
YOR017W
YOR019W
YOR020C
YOR023C
YOR022C
YOR021C
YOR026W
YOR025W
YOR027W
YOR030W
YOR028C
YOR031W
YOR032C
YOR035C
YOR034C
YOR033C
YOR037W
YOR036W
YOR039W
YOR038C
YOR040W
YOR042W
YOR041C
YOR044W
YOR043W
YOR045W
YOR047C
YOR046C
YOR049C
YOR048C
YOR052C
YOR051C
YOR053W
YOR055W
YOR054C
YOR057W
YOR056C
YOR060C
YOR059C
YOR058C
YOR061W
YOR062C
YOR063W
YOR065W
YOR064C
YOR066W
YOR067C
YOR069W
YOR068C
YOR073W
YOR071C
YOR070C
YOR075W
YOR074C
YOR078W
YOR077W
YOR076C
YOR080W
YOR079C
YOR084W
YOR083W
YOR081C
YOR085W
YOR086C
YOR088W
YOR087W
YOR091W
YOR090C
YOR089C
YOR092W
YOR093C
YOR094W
YOR096W
YOR095C
YOR098C
YOR097C
YOR099W
YOR101W
YOR100C
YOR102W
YOR103C
YOR105W
YOR104W
YOR108W
YOR107W
YOR106W
YOR110W
YOR109W
YOR113W
YOR112W
YOR111W
YOR114W
YOR115C
YOR118W
YOR117W
YOR116C
YOR120W
YOR119C
YOR122C
YOR121C
YOR125C
YOR124C
YOR123C
YOR127W
YOR126C
YOR130C
YOR129C
YOR128C
YOR132W
YOR131C
YOR134W
YOR133W
YOR135C
YOR136W
YOR137C
YOR139C
YOR138C
YOR140W
YOR142W
YOR141C
YOR144C
YOR143C
YOR147W
YOR146W
YOR145C
YOR149C
YOR148C
YOR150W
YOR151C
YOR154W
YOR153W
YOR152C
YOR156C
YOR155C
YOR158W
YOR157C
YOR159C
YOR160W
YOR161C
YOR163W
YOR162C
YOR164C
YOR165W
YOR166C
YOR168W
YOR167C
YOR170W
YOR172W
YOR171C
YOR174W
YOR173W
YOR176W
YOR175C
YOR177C
YOR179C
YOR178C
YOR181W
YOR180C
YOR182C
YOR184W
YOR185C
YOR187W
YOR186W
YOR190W
YOR189W
YOR188W
YOR191W
YOR192C
YOR193W
YOR195W
YOR194C
YOR197W
YOR196C
YOR200W
YOR199W
YOR198C
YOR202W
YOR201C
YOR204W
YOR203W
YOR206W
YOR205C
YOR207C
YOR208W
YOR209C
YOR210W
YOR212W
YOR211C
YOR215C
YOR213C
YOR217W
YOR216C
YOR218C
YOR220W
YOR219C
YOR222W
YOR221C
YOR223W
YOR225W
YOR224C
YOR227W
YOR226C
YOR230W
YOR229W
YOR228C
YOR232W
YOR231W
YOR233W
YOR234C
YOR238W
YOR237W
YOR236W
YOR240W
YOR239W
YOR241W
YOR243C
YOR242C
YOR244W
YOR245C
YOR248W
YOR247W
YOR246C
YOR250C
YOR249C
YOR252W
YOR251C
YOR253W
YOR255W
YOR254C
YOR257W
YOR256C
YOR258W
YOR260W
YOR259C
YOR262W
YOR261C
YOR266W
YOR265W
YOR264W
YOR269W
YOR267C
YOR271C
YOR270C
YOR272W
YOR274W
YOR273C
YOR276W
YOR275C
YOR278W
YOR280C
YOR279C
YOR282W
YOR281C
YOR285W
YOR284W
YOR283W
YOR286W
YOR287C
YOR289W
YOR288C
YOR291W
YOR290C
YOR292C
YOR294W
YOR293W
YOR296W
YOR295W
YOR297C
YOR299W
YOR298W
YOR302W
YOR301W
YOR300W
YOR304C−A
YOR303W
YOR305W
YOR304W
YOR308C
YOR307C
YOR306C
YOR311C
YOR310C
YOR315W
YOR313C
YOR312C
YOR317W
YOR316C
YOR319W
YOR321W
YOR320C
YOR323C
YOR322C
YOR325W
YOR324C
YOR326W
YOR328W
YOR327C
YOR330C
YOR329C
YOR332W
YOR334W
YOR333C
YOR336W
YOR335C
YOR338W
YOR337W
YOR341W
YOR340C
YOR339C
YOR344C
YOR342C
YOR346W
YOR345C
YOR347C
YOR349W
YOR348C
YOR352W
YOR351C
YOR350C
YOR354C
YOR353C
YOR356W
YOR355W
YOR359W
YOR358W
YOR357C
YOR361C
YOR360C
YOR364W
YOR363C
YOR362C
YOR366W
YOR365C
YOR368W
YOR367W
YOR369C
YOR371C
YOR370C
YOR373W
YOR372C
YOR374W
YOR377W
YOR375C
YOR380W
YOR378W
YOR382W
YOR381W
YOR383C
YOR385W
YOR384W
YOR386W
YOR388C
YOR387C
YOR390W
YOR389W
YOR392W
YOR391C
YOR394W
YOR393W
YPL001W
YPL003W
YPL002C
YPL006W
YPL005W
YPL004C
YPL008W
YPL007C
YPL010W
YPL009C
YPL011C
YPL012W
YPL013C
YPL014W
YPL015C
YPL016W
YPL018W
YPL017C
YPL020C
YPL019C
YPL022W
YPL021W
YPL023C
YPL024W
YPL026C
YPL029W
YPL028W
YPL027W
YPL030W
YPL031C
YPL033C
YPL032C
YPL034W
YPL036W
YPL035C
YPL038W
YPL037C
YPL039W
YPL041C
YPL040C
YPL043W
YPL042C
YPL045W
YPL044C
YPL048W
YPL047W
YPL046C
YPL050C
YPL049C
YPL052W
YPL051W
YPL053C
YPL054W
YPL055C
YPL058C
YPL057C
YPL056C
YPL060W
YPL059W
YPL063W
YPL061W
YPL066W
YPL065W
YPL064C
YPL068C
YPL067C
YPL070W
YPL069C
YPL071C
YPL072W
YPL073C
YPL076W
YPL075W
YPL074W
YPL078C
YPL077C
YPL079W
YPL080C
YPL081W
YPL083C
YPL082C
YPL085W
YPL084W
YPL088W
YPL087W
YPL086C
YPL090C
YPL089C
YPL093W
YPL092W
YPL091W
YPL095C
YPL094C
YPL097W
YPL096W
YPL100W
YPL099C
YPL098C
YPL101W
YPL102C
YPL104W
YPL103C
YPL105C
YPL108W
YPL106C
YPL111W
YPL110C
YPL109C
YPL113C
YPL112C
YPL116W
YPL115C
YPL118W
YPL117C
YPL119C
YPL120W
YPL121C
YPL124W
YPL123C
YPL122C
YPL126W
YPL125W
YPL129W
YPL128C
YPL127C
YPL131W
YPL130W
YPL132W
YPL133C
YPL135W
YPL134C
YPL137C
YPL139C
YPL138C
YPL142C
YPL141C
YPL140C
YPL143W
YPL145C
YPL147W
YPL146C
YPL150W
YPL149W
YPL148C
YPL152W
YPL151C
YPL155C
YPL154C
YPL153C
YPL157W
YPL156C
YPL160W
YPL159C
YPL158C
YPL162C
YPL161C
YPL164C
YPL163C
YPL166W
YPL165C
YPL167C
YPL168W
YPL169C
YPL170W
YPL172C
YPL171C
YPL173W
YPL174C
YPL175W
YPL177C
YPL176C
YPL179W
YPL178W
YPL181W
YPL180W
YPL184C
YPL183C
YPL182C
YPL185W
YPL186C
YPL189W
YPL188W
YPL187W
YPL191C
YPL190C
YPL194W
YPL193W
YPL192C
YPL196W
YPL195W
YPL198W
YPL197C
YPL200W
YPL199C
YPL201C
YPL203W
YPL202C
YPL204W
YPL206C
YPL205C
YPL208W
YPL207W
YPL211W
YPL210C
YPL209C
YPL213W
YPL212C
YPL215W
YPL214C
YPL216W
YPL218W
YPL217C
YPL220W
YPL219W
YPL222W
YPL221W
YPL223C
YPL225W
YPL224C
YPL226W
YPL228W
YPL227C
YPL230W
YPL229W
YPL232W
YPL231W
YPL233W
YPL235W
YPL234C
YPL237W
YPL236C
YPL239W
YPL238C
YPL240C
YPL242C
YPL241C
YPL243W
YPL244C
YPL245W
YPL247C
YPL246C
YPL249C
YPL248C
YPL253C
YPL252C
YPL250C
YPL255W
YPL254W
YPL257W
YPL256C
YPL258C
YPL260W
YPL259C
YPL262W
YPL261C
YPL265W
YPL264C
YPL263C
YPL267W
YPL266W
YPL270W
YPL269W
YPL268W
YPL271W
YPL272C
YPL274W
YPL273W
YPL277C
YPL279C
YPL278C
YPL280W
YPL281C
YPR001W
YPL283C
YPL282C
YPR002W
YPR003C
YPR006C
YPR005C
YPR004C
YPR008W
YPR007C
YPR009W
YPR011C
YPR010C
YPR012W
YPR013C
YPR016C
YPR015C
YPR019W
YPR018W
YPR017C
YPR020W
YPR021C
YPR024W
YPR023C
YPR022C
YPR026W
YPR025C
YPR028W
YPR027C
YPR029C
YPR031W
YPR030W
YPR032W
YPR033C
YPR036W
YPR035W
YPR034W
YPR040W
YPR039W
YPR041W
YPR043W
YPR042C
YPR045C
YPR044C
YPR048W
YPR047W
YPR046W
YPR050C
YPR049C
YPR051W
YPR053C
YPR056W
YPR055W
YPR054W
YPR058W
YPR057W
YPR061C
YPR060C
YPR059C
YPR062W
YPR063C
YPR066W
YPR065W
YPR067W
YPR069C
YPR068C
YPR071W
YPR070W
YPR072W
YPR074C
YPR073C
YPR076W
YPR075C
YPR080W
YPR079W
YPR078C
YPR082C
YPR081C
YPR084W
YPR083W
YPR087W
YPR086W
YPR085C
YPR089W
YPR088C
YPR090W
YPR092W
YPR091C
YPR094W
YPR093C
YPR097W
YPR095C
YPR098C
YPR101W
YPR100W
YPR103W
YPR102C
YPR106W
YPR105C
YPR104C
YPR108W
YPR107C
YPR109W
YPR111W
YPR110C
YPR113W
YPR112C
YPR116W
YPR115W
YPR114W
YPR118W
YPR117W
YPR119W
YPR120C
YPR124W
YPR122W
YPR121W
YPR127W
YPR125W
YPR129W
YPR128C
YPR130C
YPR132W
YPR131C
YPR135W
YPR134W
YPR133C
YPR137W
YPR136C
YPR139C
YPR138C
YPR140W
YPR143W
YPR141C
YPR145W
YPR144C
YPR149W
YPR148C
YPR147C
YPR150W
YPR151C
YPR154W
YPR153W
YPR152C
YPR156C
YPR155C
YPR158W
YPR157W
YPR160W
YPR159W
YPR161C
YPR163C
YPR162C
YPR165W
YPR164W
YPR166C
YPR168W
YPR167C
YPR171W
YPR169W
YPR172W
YPR174C
YPR173C
YPR175W
YPR176C
YPR178W
YPR177C
YPR179C
YPR180W
YPR181C
YPR184W
YPR183W
YPR182W
YPR185W
YPR186C
YPR187W
YPR188C
YPR189W
YPR191W
YPR190C
YPR192W
YPR193C
YPR196W
YPR194C
YPR197C
YPR198W
YPR199C
YPR202W
YPR201W
YPR200C
YPR204W
YPR203W
Hmt1 Mlp1 Xpo1 Mlp2 Nup60 Cse1 Nup116 Yra1 Nup2 Nic96 Kap95 Nsp1 Nup84Nup145 Tho2 Rsc9 Npl3 Rrp6 Nup100 Nab2 Lrp1 Prp20 Hrp1
Proteins
Hmt1 Mlp1 Xpo1 Mlp2 Nup60 Cse1 Nup116 Yra1 Nup2 Nic96 Kap95 Nsp1 Nup84 Nup145 Tho2 Rsc9 Npl3 Rrp6 Nup100 Nab2 Lrp1 Prp20 Hrp1
Proteins
Figure A-1: Semi-supervised clustering of all nuclear transport and nuclear processing factors binding profiles [13–16] using the KL distance metric. The upper plot illustrates the
dendrogram. The vertical length of each branch is proportional to the distances between
clusters. The lower plot illustrates the vectors of binding ratios of each factor on a scale from
0 to 2. The binding vectors are arranged in the same order as their corresponding protein
in the dendrogram.
86
Bibliography
[1] C. T. Harbinson et al. Transcrptional regulatory code of a eukaryotic genome. Nature,
431:99–103, September 2004.
[2] Z. Bar-Joseph et al. Computational discovery of gene modules and regulatory networks.
Nature Biotechnology, 21(11):1337–1342, November 2003.
[3] T. I. Lee et al. Transcriptional Regulatory Networks in Saccharomyces cerevisiae. Science, 21(11):1337–1342, October 2002.
[4] F. Robert et al. Global Position and Recruitment of HATs and HDACs in the Yeast
Genome. Molecular Cell, 2004.
[5] J. Zeitlinger et al. Program-Specific Distribution of a Transcription Factor Dependent
on Partner Transcription Factor and MAPK Signaling. Cell, May 2, 2003.
[6] H. H. Ng et al. Targeted recruitment of Set1 Histone Methylase by Elongating Pol II
Provides a Localized Mark and Memory of Recent Transcriptional Activity. Molecular
Cell, March, 2003.
[7] H. H. Ng et al. Genome-wide location and regulated recruitement of the RSC
nucleosome-remodeling complex. Genes and Development, 2002.
[8] J. J. Wyrick et al. Genome-Wide Distribution of ORC and MCM Proteins in S. cerevisiae: High-Resolution Mapping of Replication Origins. Cell, December 14, 2001.
[9] I. Simon et al. Serial Regulation of Transcriptional Regulators in the Yeast Cell Cycle.
Cell, September 21, 2001.
[10] B. Ren et al. Genome-Wide Location and Function of DNA Binding Proteins. Science,
December 22, 2000.
[11] C. J. Roberts et al. Signaling and circuitry of multiple MAPK pathways revealed by a
matrix of global gene expression profiles. Science, 2000.
[12] F. C. Holstege et al. Dissecting the Regulatory Circuitry of a Eukaryotic Genome. Cell,
November 25, 1998.
87
[13] J. M. Casolari et al. Genome-wide Localization of the Nuclear Transport Machinery
Couples Transcriptional Status and Nuclear Organization. Cell, 117:427–439, May 2004.
[14] M. C. Yu et al. Arginine methyltransferase affects interactions and recruitment of
mRNA processing and export factors. Genes and Development, 2004.
[15] M. Damelin et al. The Genome-wide Localization of Rsc9, a Component of the RSC
Chromatin-Remodeling Complex, Changes in Response to Stress. Molecular Cell, 9:563–
573, March 2002.
[16] H. Hieronymus et al. Genome-wide mRNA surveillance is coupled to mRNA transport.
Genes and Development, 2004.
[17] S. K. Kurdistani et al. Mapping Global Histone Acetylation Patterns to Gene Expression. Cell, 117:721–733, June 11, 2004.
[18] D. Robyr et al. Microarray Deacetylation Maps Determine Genome-Wide Functions for
Yeast Histone Deacetylases. Cell, 109:437–446, May 17, 2002.
[19] S. K. Kurdistani et al. Genome-wide binding map of the histone deacetylase Rpd3 in
yeast. Nature Genetics, June 24, 2002.
[20] A. Wang et al. Requirement of Hos2 Histone Deacetylase for Gene Activity in Yeast.
Science, 2002.
[21] J. Lieb et al. Promoter-specific binding of Rap1 revealed by genome-wide maps of
protein-DNA association. Nature Genetics, August, 2001.
[22] P. L. Nagy et al. Genomewide demarcation of RNA polymerase II transcription units
revealed by physical fractionation of chromatin. PNAS, May 27, 2003.
[23] J. V. Geisberg et al. Cellular Stress Alters the Transcriptional Properties of PromoterBound Mot1-TBP Complexes. Molecular Cell, May 21, 2004.
[24] Z. Moqtaderi et al. Genome-Wide Occupancy Profile of the RNA Polymerase III Machinery in Saccharomyces cerevisiae Reveals Loci with Incomplete Transcription Complexes. Molecular and Cellular Biology, December, 2003.
[25] H. Santos-Rosa et al. Methylation of Histone H3 K4 Mediates Association of the Isw1p
ATPase with Chromatin. Molecular Cell, November, 2003.
[26] B. E. Bernstein et al. Methylation of histone H3 Lys 4 in coding regions of active genes.
PNAS, June 25, 2002.
[27] B. E. Bernstein et al. Global nucleosome occupancy in yeast. Genome Biology, August
20, 2004.
88
[28] J. Kim et al. Global Role of TATA Box-Binding Protein Recruitment to Promoters in
Mediating Gene Expression Profiles. Molecular and Cellular Biology, September, 2004.
[29] N. M. Luscombe et al. Genomic analysis o regulatory network dynamics reveals large
topological changes. Letters to Nature, September 16, 2004.
[30] H. Yu et al. Genomic analysis of essentiality within protein networks. TRENDS in
Genetics, June, 2004.
[31] A. C. Gavin et al. Functional organization of the yeast proteome by systematic analysis
of protein complexes. Nature, 2002.
[32] Y. Ho et al. Systematic identification of protein complexes in Saccharomyces cerevisiae
by mass spectrometry. Nature, 2002.
[33] T. Ito et al. Toward a protein-protein interaction map of the budding yeast: A comprehensive system to examine two-hybrid interactions in all possible combinations between
the yeast proteins. PNAS, 2000.
[34] P. Uetz et al. A comprehensive analysis of protein-protein interactions in Saccharomyces
cerevisiae. Nature, 2000.
[35] R. J. Larsen, M. L. Marx. An Introduction to Mathematical Statistics and Its Applications. Prentice Hall, 2001.
[36] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. John Wiley and Sons,
2001.
[37] T. L. Bailey, M. Gribskov et al. Combining evidence using p-values: application to
sequence homology searches. Bioinformatics, 1998.
[38] H. W. Mewes. Munich Information Center for Protein Sequences: Comprehensive Yeast
Genome Database. http://mips.gsf.de/genre/proj/yeast/index.jsp, 2004.
[39] Steen Knudsen. Guide to Analysis of Dna Microarray Data. Wiley, 2004
89
Download