Budding Yeast Cell Cycle Analysis ... by

advertisement
Budding Yeast Cell Cycle Analysis and Morphological
Characterization by Automated Image Analysis
by
Elizabeth Perley
B.Sc., Massachusetts Institute of Technology, 2010
Submitted to the Department of Electrical Engineering and Computer Science
in Partial Fulfillment of the Requirements for the Degree of
Master of Engineering in Electrical Engineering and Computer Science
at the
Massachusetts Institute of Technology
MASSACHUSETVS INSTTUTE
OF TECHL.O' y
May 2011
JUN 2 1 2011
[3unt -2C
Institute of Technology
@2011 Massachusetts
LIBRA R IES
All rights reserved.
ARCHIVES
The author hereby grants to M.I.T. permission to reproduce and to
distribute publicly paper and electronic copies of this thesis document
in whole and in part in any medium now known or hereafter created.
A uth or .........
....... .
.
.
....................................................
Department of Electrical Engineering and Computer Science
May 20, 2011
C ertified .by ...................................................
Mark Bathe, Assistant Professor
Thesis Supervisor
A ccepted by ........
..
...........................................................
Dr. Christopher J. Terman
Chairman, Masters of Engineering Thesis Committee
Budding Yeast Cell Cycle Analysis and Morphological
Characterization by Automated Image Analysis
by
Elizabeth Perley
Submitted to the Department of Electrical Engineering and Computer Science
in partial fulfillment of the requirements for the degree of
Master of Engineering
Abstract
Budding yeast Saccharomyces cerevisiae is a standard model system for analyzing cellular
response as it is related to the cell cycle. The analysis of yeast cell cycle is typically done
visually or by using flow cytometry. The first of these methods is slow, while the second
offers a limited amount of information about the cell's state. This thesis develops methods
for automatically analyzing yeast cell morphology and yeast cell cycle using high content
screening with a high-capacity automated imaging system. The images obtained using this
method can also provide information about fluorescently labelled proteins, unlike flow cytometry, which can only measure overall fluorescent intensity. The information about yeast cell
cycle stage and protein amount and localization can then be connected in order to develop
a model of yeast cellular response to DNA damage.
Thesis supervisor: Mark Bathe
Supervisor title: Assistant Professor
Table of Contents
Abstract
2
Table of Contents
3
List of Figures
5
List of Tables
6
1 Introduction
7
10
2 Related work and Background
2.1
2.2
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.1.1
Yeast cell cycle stage and response to DNA damage . . . . . . . . . .
10
2.1.2
Yeast Cell Cycle Analysis and Morphological Characterization by Multispectral Imaging Flow Cytometry . . . . . . . . . . . . . . . . . . .
11
Current cell detection and classification software . . . . . . . . . . . . . . . .
13
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
2.2.1
C ellProfiler
3
Overview
15
4
Image Processing
16
Im aging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
Bright field images vs. Concanavalin A images . . . . . . . . . . . . .
16
Cell detection and segmentation . . . . . . . . . . . . . . . . . . . . . . . . .
18
4.2.1
Edge detection with watershedding of bright field images . . . . . . .
19
4.2.2
Thresholding using Concanavalin A
. . . . . . . . . . . . . . . . . .
23
4.2.3
Voronoi-based segmentation using CellProfiler . . . . . . . . . . . . .
25
4.2.4
Yeast-specific cell detection and segmentation . . . . . . . . . . . . .
27
4.3
Nucleus detection and segmentation . . . . . . . . . . . . . . . . . . . . . . .
31
4.4
Discussion . . . . . . . . . . . . . . . . . . ..
- - - - - -
33
4.1
4.1.1
4.2
.
. . . ..
-.
5 Cell Cycle Stage Classification
34
5.1
Creation of a training set . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
5.2
Feature selection
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
5.3
5.4
5.5
6
Feature calculation . . . . . . . . . . . . . . . .
. . . . . . . . . . . . .
37
5.3.1
Basic cell features . . . . . . . . . . . . .
. . . . . . . . . . . . .
38
5.3.2
Bud detection . . . . . . . . . . . . . . .
. . . . . . . . . . . . .
38
Feature validation . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . .
42
5.4.1
Using the training set . . . . . . . . . . .
. . . . . . . . . . . . .
42
5.4.2
Using cells arrested in stages of cell cycle
. . . . . . . . . . . . .
47
. . . . . . . . .
. . . . . . . . . . . . .
51
5.5.1
Creation of the neural net . . . . . . . .
. . . . . . . . . . . . .
51
5.5.2
Net performance on training set . . . . .
. . . . . . . . . . . . .
52
5.5.3
Net performance on additional data . . .
. . . . . . . . . . . . .
52
Classification using neural nets
Conclusions
58
7 Appendices
60
7.A Edge detection and watershedding code . . . . . . . . . . . . . . . . . . . . .
60
7.B Concanavalin A thresholding code . . . . . . . . . . . . . . . . . . . . . . . .
62
7.C Nucleus detection code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
7.D Feature calculation code . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
7.E Bud detection code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
References
70
List of Figures
1
Bright field image of budding yeast cells
. . . . . . . . . . .
2
Bright field image of budding yeast cells
. . . . . . . . . . .
. . . . . . . .
15
3
Comparison of bright field and ConA images . . . . . . . . .
. . . . . . . .
18
4
Edge detection and watershedding to segment cells.....
. . . . . . . .
21
5
Distance transform of cells . . . . . . . . . . . . . . . . . . .
. . . . . . . .
22
6
Edge detection and watershedding on ConA images.....
. . . . . . . .
23
7
CellProfiler Cell Segmentation . . . . . . . . . . . . . . . . .
. . . . . . . .
26
8
Polar plot of yeast cells for segmentation . . . . . . . . . . .
. . . . . . . .
29
9
Yeast-specific cell detection algorithm to detect cells . . . . .
. . . . . . . .
30
10
Budding yeast cell cycle stages. Adapted from Calvert, et al. [2] . . . . . . .
34
11
Cell undergoing bud detection algorithm . . . . . ... . ..
. . . . . . . .
40
12
Cell area feature distribution. . . . . . . . . . . . . . . . . .
. . . . . . . .
44
13
Average nuclear intensity distributions. . . . . . . . . . . . .
. . . . . . . .
45
14
Overall nuclear intensity distributions.
. . . . . . . . . . . .
. . . . . . . .
46
15
Bud size distributions . . . . . . . . . . . . . . . . . . . . . .
16
Bud/mother cell ratio distributions. . . . . . . . . . . . . . .
17
Cell area - Arrested Cells vs. Training set Comparison. . . .
18
Average nuclear intensity - Arrested Cells vs. Training set Comparison
50
19
Bud size - Arrested Cells vs. Training set Comparison . . . . . . . . . .
51
20
Neural net used for classification . . . . . . . . . . . . . . . . . . . . . .
52
21
Cell area feature distribution. . . . . . . . . . . . . . . . . . . . . . . .
54
22
Average nuclear intensity distributions. . . . . . . . . . . . . . . . . . .
55
23
Overall nuclear intensity distributions.
. . . . . . . . . . . . . . . . . .
55
24
Bud size distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
25
Bud/mother cell ratio distributions. . . . . . . . . . . . . . . . . . . . .
56
8
List of Tables
1
Feature list for classification . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
2
Differentiation between G1 and G2 using a feature subset . . . . . . . . . . .
37
3
Differentiation between G1 , S and G2 /M using a feature subset . . . . . . . .
38
4
Estimated cell diameter
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
5
Neural net performance on training set . . . . . . . . . . . . . . . . . . . . .
53
6
Neural net performance on all data . . . . . . . . . . . . . . . . . . . . . . .
53
7
Estimated cell diameter
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
8
Neural net performance on arrested cells . . . . . . . . . . . . . . . . . . . .
57
1
Introduction
In recent years there has been a significant amount of development in the area of high content
(HC) imaging. This technique attempts to combine high throughput with high resolution
imaging to provide statistics with a large number of samples, and images which provide
accurate measurements. A number of platforms and a several pieces of software are now
available to acquire and analyze images from HC screens. However, most of these were
developed specifically for mammalian cells, which are large and grow in single layers, and
therefore make assumptions about the types of images that can be used as input lack the
ability to deal with other types of cells.
One type of cell that these platforms are often not capable of dealing with is the budding
yeast (Saccharomyces cerevisiae) cell. Budding yeast cells are one of the standard model
systems in biology. Their cell-cycle has been well characterized in terms of cellular morphology as well as in the proteins involved, and there is a large amount of biological knowledge
that exists. This makes them an excellent choice for developing models. However, these cells
do not grow in a single layer, and they are quite small compared to mammalian cells, so in
order to use HC imaging, a new tool for analysis must be used.
The first part of this thesis describes methods to analyze HC images of budding yeast
cells. First, cell outlines must be determined from the images of cells. Each well on a plate
contains hundreds of cells and requires multiple images to be taken, each of which contains
tens of cells. These cells must be correctly identified, and accurately segmented and outlined
to get the correct cell shape, as well as to allow for the correct calculation of protein levels and
localization. This problem poses several challenges: The images are taken automatically, and
as a result are not always completely focused. They also often have uneven illumination, and
contain other particles or out of focus cells, which contribute to noise and misidentification
of cells. Budding yeast cells also have very thick cell walls, making the decision to find the
Figure 1: Bright field image of budding yeast cells
"correct" cell border a difficult one. This can be seen in Figure 1. It is unclear if it is at
the outer cell border, or the inner cell border, or somewhere in between. Several different
approaches for cell segmentation were investigated, and the best selected.
The second part of this thesis describes a way to use these image analysis methods to
develop a model of how budding yeast cells respond to DNA damage. When a cell's DNA
becomes damaged, the cell must respond in some way in order to repair its DNA and continue
its progress in the cell cycle. Many genes and pathways have been found that are responsive
to DNA damage in budding yeast, using techniques such as bulk transcriptional profiling
and genomic phenotyping assays. However, little is known about the mechanism by which
these changes occur, and how protein levels and localization with in the cell are affected by
damage.
Previously, flow cytometry was used to study this response. However, this method only
allows overall protein and DNA levels to be measured. It has been shown that the response
of budding yeast cells is dependent on their current stage of the cell cycle, which cannot be
seen using flow cytometry. Although it is possible to arrest cells at specific stages of the cell
cycle with drugs, these may introduce artifacts. Since HC imaging allows cell morphology
to be examined and cell cycle determined, this approach offers a significant advantage and
it can provide a more detailed picture of the response of budding yeast.
The paper describes a machine learning approach to classifying yeast cells into their cell
cycle stages. Once the cell cycle stages can be accurately identified from cell shape and
nuclear shape, DNA content and position, then asynchronous populations of yeast cells can
be characterized. Algorithms to find the bud of a cell and calculate morphological features
of the cells were developed. Using these features, neural nets were tested as ways to classify
these cells in a supervised learning approach.
Related work and Background
2
2.1
2.1.1
Background
Yeast cell cycle stage and response to DNA damage
Previous work has been done to determine the cellular response of budding yeast to DNA
damaging agents by Jelinsky, et al.[10] In their paper, budding yeast cells were exposed to
various carcinogenic alkylating agents and oxidizing agents, as well as ionizing radiation, and
it was found that this exposure modulates transcript levels for over one third of the genes
of the yeast cells. What was particularly relevant about these results was the finding that
for one of the carcinogenic agents, MMS, the response is dramatically affected by the cell's
position in the cell cycle at the time that it is exposed to the agent.
Cultures of log-phase budding yeast cells were arrested in G1 by a-factor, in S phase by
hydroxyurea, and in G2 by nocodazole. These cells were then allowed to grow for three days
before each group of cells was split in half and MMS added to one of the two halves in each
phase.
In order to investigate the cellular response to the addition of MMS, GeneChip hybridizations were used. A set of arrays was used contianing probes for 6,218 yeast ORFs and then
analyzed using the GeneChip analysis suite. Then, any gene which showed a change of 3.0fold or more in at least one of the experimental conditions was further examined. The result
was that there were 693 such genes which were responsive to treatment with MMS. These
were shown to be certainly responsive, with none of the error bars for treated and untreated
cells coming close to overlapping.
Initially, when asynchronous log-phase cultures were treated with MMS, many genes were
scored only as weakly responsive. However, treating yeast cells which had been arrested in
the stages of the cell cycle caused many more genes to be shown as clearly responsive to
the treatment. Of the genes that were responsive, 199 were responsive only in G1 , 84 were
responsive only in S phase, 94 were responsive only in G2 , and 229 were responsive only in
stationary phase. Prior to these experiments, fewer than 20% of the genes examined had
been shown to have cell-cycle dependent expression.
These results indicate that, in order to fully understand the budding yeast response to
damage, each cell's stage in the cell cycle must be taken into account.
2.1.2
Yeast Cell Cycle Analysis and Morphological Characterization by Multispectral Imaging Flow Cytometry
Yeast cell cycle analysis has previously been done using multispectral imaging flow cytometry
(MIFC) in a computational approach. [2] In a paper by Calvert, et al., an improvement on the
traditional yeast cell cycle analysis using flow cytometry is developed. MIFC offers multiple
parameters for morphological analysis, allowing for the calculation of a novel feature, bud
length, which is used to observe the change in morphological phenotypes between wild-type
yeast cells, and those which overexpress NAP1, which causes an elongated bud phenotype.
An imaging flow cytometer was used, which allows the simultaneous detection of bright
field, dark field, and four fluorescent channels. The images from MIFC were then used to
visually assign cells to one of the three stages of the cell cycle: G1 , S, and G2 /M. They were
distinguished as cells in G1 being those that have a single nucleus and no bud, S being cells
with a single nucleus and a visible bud, and G2 /M being those with elongated or divided
nuclei and a large bud.
The yeast cell cycle was determined using a combination of DNA intensity and nuclear
morphology. Cells were split into ones that contained IN DNA content and round nuclei,
2N DNA content and round nuclei, and 2N DNA content and elongated or divided nuclei,
and these three groups were labelled as G 1, S, and G2 , respectively. 100 cells from each stage
of the cell cycle were then visually identified and labelled in order to validate this method
of determining cell cycle, which was 99% accurate. Further testing of this method of cell
cycle analysis was performed, and results from visual analysis of morphology and from this
method of separation were similar, and far more accurate that analysis with standard flow
cytometry using a cell cycle modelling program.
The bright field images were analyzed automatically to calculate the bud length feature.
To do this, an object mask for the cell was calculated by separating the cell from its background using pixel intensity in the bright field channel. This mask was then eroded by 3
pixels. Then, the bud length was calculated by subtracting the maximum thickness of the
cell calculated from MIFC, which is assumed to be the diameter of the mother cell, from
the total length of the cell as determined from the object mask. Relative bud length was
calculated to be the ratio of the minimum thickness of the cell calculated from MIFC, and
the width of the bud. The cell aspect ratio was calculated as the ratio between the total cell
length and the width. Then, any cell which had a relative bud length larger than 1.5, and an
aspect ratio of less than 0.5 was considered to be a cell with an elongated bud. The NAP1
strains and wild-type strains were then analyzed for differences in bud length as well as
cellular shape. It was shown that MIFC could accurately distinguish between the elongated
bud phenotype and a normal bud phenotype.
This approach shows that it is possible to determine cell cycle stage with MIFC using
simple computational methods. In this case, a simple scatter plot was used to separate the
cells between G1 , S and G2 , using DNA content and nuclear morphology. It also shows that
morphological features can be generated using cell shape and size.
2.2
2.2.1
Current cell detection and classification software
CellProfiler
CellProfiler aims to perform automatic cell image analysis for a variety of phenotypes of cells
from different organisms. [4] It is useful for measuring a number of cell features, including cell
count, cell size, cell cycle distribution, organelle number and size, cell shape, cell texture,
and the levels and localization of proteins and phosphoproteins. The motivation behind the
creation of CellProfiler is to provide quantitative image analysis, as opposed to human-scored
image analysis, which is qualitative and usually classifies samples only as hits or misses. It
also allows the processing of data at a much quicker rate, and the creators consider cell image
analysis to be one of the greatest remaining challenges in screening.
The software system consists of many already- developed methods for many cell types
and assays. It uses the concept of a pipeline of these individual modules, where each module
processes the images in some manner, and the modules are placed in a sequential order. The
standard order for this processing first contains image processing, then object identification,
and finally measurement.
According to the developers of CellProfiler, the most challenging step in image analysis
is object identification. It is also one of the most important steps, since the accuracy of
object detection determines the accuracy of the resulting cell measurements. The standard
approach to object identification in CellProfiler is to first identify primary objects, which
are often nuclei identified from DNA-stained images. Once these primary objects have been
identified, they are used to help identify and segment secondary objects, such as cells. The
identification first of primary objects helps to distinguish between different secondary objects,
since they are often clumped. In the software, first the clumped objects are recognized and
separated, then the dividing lines between them are found. Some of these algorithms for
object identification are discussed later in subsection 4.2.
One of these was specifically
developed for CellProfiler, which was an improved Propagate algorithm and allowed cell
segmentation for some phenotypes to be performed which had never been possible previously.
The main testing of CellProfiler was performed on Drosophila KC167 cells because they
are particularly challenging to identify using automated image analysis. It was also tested
on many different types of human cells, and has been shown to work well on mouse and rat
cells as well.
One of the main goals of CellProfiler was to be able to identify a variety of phenotypes and
to be flexible, modular, and to make setting up an analysis feasible for non-programmers.
This means that it does not have many modules for specific cell types, but rather more
general ones that are applicable to many different types, which does make it good for many
applications. However, for specific applications, or for cell types that it wasn't designed for,
CellProfiler is not always a good fit.
.....
..........
. ..
.....
....
3
Overview
The approach to examine budding yeast cellular response to DNA damage is outlined in
Figure 2. First, images of cells are acquired using a Cellomics high-content automated
imaging system. They then undergo image processing, which involves cell detection and
segmentation, as well as nucleus detection. As described in the CellProfiler approach, this
step is the most important because it determines the accuracy of all other measurements.
Segment cells
A
Segment nuclei
Measure features of cells
(Size, intensity, bud size, etc)
Classify cells according to
stage of cell cycle
Calculate statistical metrics
to determine quality of
features and classification
Figure 2: Bright field image of budding yeast cells
Once the cells have been detected, then features of these cells are measured, such as cell
size, shape, and nuclear size, shape, and intensity. These features are then used as input to
a supervised learning system in order to classify cells as being in one of the three cell-cycle
stages: G1, S, or G2/M. Finally, the features and the classification system are validated and
tested.
Once cell cycle can be accurately determined, it is possible to measure other features
of the cell from fluorescent channels and combine these measurements with the cell cycle
information to learn more about the yeast cell response to DNA damage.
Image Processing
4
4.1
Imaging
The yeast library which was used was developed at the University of California San Francisco,
and is now commercially available. It consists of yeast strains in which individual proteins
are expressed with C-terminal GFP tags from endogenous promoters. The cells were fixed
and stained with DAPI for visualizing the DNA, and with Concanavalin A for visualizing
the cell walls. These cells were then imaged on 96-well plates using a Cellomics system
from Thermofisher Scientific. Images were acquired in three fluorescent channels (for DAPI,
GFP, and Concanavalin A) and in the bright field channel to allow ready visualization of
cell contours.
4.1.1
Bright field images vs. Concanavalin A images
The standard way of viewing cell morphology is through the use of bright field images.
These images show the basic cell outline of the budding yeast cells. However, these bright
field images might not provide the best data for a high content screening approach for the
following reasons:
" It is difficult even for a human to determine what the actual boundary of the cell is
due to the thickness of the cell wall in the bright field images
" Cell buds are difficult to see due to the thickness of the cell wall, especially when they
are small.
* Bright field images show not only cells, but also other particles in the well, and out of
focus cells, which must be removed for further analysis of the cell images
Despite these problems, bright field is a type of image worth exploring for looking at cells
because it requires no additional stains or steps in cell preparation. It also doesn't require the
use of a GFP channel on the Cellomics imaging platform, which only has a limited number
of channels.
Another choice for viewing cell morphology is to use cells which have been stained with
fluorescent-conjugated Concanavalin A (ConA). ConA is a lectin which combines with proteins on the budding yeast cell walls. ConA provides a much clearer image of yeast cell
morphology for the following reasons:
" Cell outlines are well defined with no ambiguity
" Even small buds can be easily seen
" There is a minimal amount of noise, with no extra particles or out-of-focus cells, due
to the fact that this is a fluorescent image
However, this requires the use of an additional channel in the microscope, which is only
capable of taking pictures in 4 channels. Although in this paper not all channels are needed,
it is desirable to have additional channels available so that multiple molecules in the cell may
be observed.
The other drawback to using ConA images is the fact that the definitions between cells
isn't quite as clear. While it is quite easy to tell which parts of the images are cells and
which aren't, it can be difficult to distinguish between different cells in a cluster. This is
due to the fact that there is no actual outline of the cell, and the entire cell is stained with
ConA. However, use of a more dilute sample can alleviate this problem.
In subsubsection 4.1.1, the comparison between bright field and ConA images is clear.
ConA images provide a much clearer view of the cells. Although some groups have had
success with segmenting cells using bright field images alone [13], ConA images were chosen
Figure 3: Comparison of bright field (left) and ConA (right) images of budding yeast cells
after the investigation of the use of both types of images. Bud size is an easy way to roughly
determine cell cycle stage, so the ability to calculate precise cell outlines was a major deciding
factor in the choice of images . It is also important that the amount of protein in the cell
can be determined. If the correct cell outlines are not found, then the amount of protein can
be overestimated or underestimated.
4.2
Cell detection and segmentation
The cells in any image must first be separated from their background before further analysis
can be done. While for many applications, only a cell count or a rough estimate of the cell
shape and size is necessary, an accurate border is required in order to detect the bud and
compute levels of fluorescence correctly.
4.2.1
Edge detection with watershedding of bright field images
There are three major challenges in detecting and segmenting cells. One of these is that
the bright field images are the preferred choice for cell detection. However, the cell cannot
simply be thresholded out in these images due to a low level of contrast between cells and
the background. Another such challenge is the presence of non-cellular objects within the
bright field images. Each object that is detected must be a cell, so any object that might be
mistaken for a cell must be identified and removed. The last challenge is the fact that the
cells are often grouped very closely together, and the cell detection algorithm must be able
to separate these clusters.
Edge detection
One way to detect the cells in the low-contrast bright field images is to perform edge detection. Edge detection algorithms calculate the gradient across the image and then use a
cutoff threshold to determine which values are a part of an edge, and which are not. This
procedure uses the Canny edge detection algorithm.
[3] This
algorithm is designed to mark
each edge only once, detect edges as close as possible to the actual edges in the image, and
to not be affected by noise in the original image.
The cell detection algorithm is outlined as follows, and code be found in appendix 7.A:
1. Apply a Gaussian filter to reduce the amount of noise in the original image
2. Use the Canny algorithm to detect edges in the image.
3. Dilate the detected edges with linear elements. This fills in small gaps in the edges,
which occur due to the fact that the image is very low-contrast, making it difficult to
detect a full outline of each cell.
4. Fill in the interiors of the cells by filling in every closed outline in the image.
5. Erode with a circular element to smoothen the cells.
6. Remove objects that are too small to be cells.
Although the Canny edge detection algorithm has a noise removal step in it, the first
step of the cell detection applies a Gaussian blur for further noise removal. This causes the
edge detection to detect fewer incorrect small edges, since the cells are large enough such
that this step does not affect their edges significantly. The MATLAB canny function can
takes an argument that specifies the sensitivity threshold. In the code, a sensitivity value
of 0.3 was chosen, which was also chosen to minimize the number of incorrect small edges
while still detecting the main cell outlines.
Once edges have been detected, there are often gaps in the full cell outlines. These must
be removed so that a cell mask can be created by filling in the outlines. Dilating the edges
with a small horizontal and vertical linear element fills in the majority of these small gaps.
After the cells have been filled in, the cell mask which has been created is slightly larger
than the cell. This is partially due to the fact that the edges were dilated, but also due to
the fact that the cell wall is quite thick in the bright field images. For any particular cell,
two edges are detected: the outside edge of the cell wall, and the inside edge of the cell wall.
The cell border is actually somewhere in between these two lines, so the cell must be eroded
a small amount. This erosion step can also help to smoothen the jagged edges that can be
created due to the edge detection. A circular element is used for the erosion to maintain as
much of the original cell shape as possible. The erosion also removes any objects which are
too thin in one dimension to possibly be a cell.
Although the last erosion step is necessary, it removes quite a bit of detail from the cell
outlines. While it smooths the jagged edges, it also can remove small buds from the cell. It
also erodes the cell shape such that it is not always clear where the bud of the cell is exactly.
Figure 4: Using edge detection and watershedding to segment cells
Watershedding
After cell detection has been performed, the cells must still be segmented, since clusters of
cells still remain. One standard algorithm which is used for cell segmentation is watershedding. [6] The basic concept behind watershedding is that the image is treated as a height
field, and then the pixels are separated based on which minimum a drop of water would flow
to if it were placed on that pixel in the height field.
In order to perform watershedding, first a distance transform is calculated on the cell
mask obtained from the cell detection process. This involves calculating the distance from
any pixel which is part of a cell to the closest pixel which is not part of a cell. So, pixels
which are at the center of a cell have the highest values. An example of this can be seen
in Figure 4.2.1. The values are then negated, and watershedding is performed. This means
that any pixel which is considered to be part of a cell is assigned to the local minimum to
which a drop of water would flow to if it were placed on that pixel in the negated distance
transform. In Figure 4.2.1 these local minima would be the brightest pixels in the distance
Figure 5: Comparison of cell mask (left) with its distance transform (right)
transform.
However, watershedding is not an ideal method for cell segmentation. It can be practical
because it is quick and can give a reasonable estimate of the number of cells in an image
as well as the approximate area of the cells. However, a single noisy pixel can affect the
grouping of an entire cluster of pixels by creating a gap between the group and a diferent
minimum. The cell borders which are finally detected are also only somewhat related to
the original image, meaning that when the cells in a cluster are segmented, the line chosen
is based on the distance transform, rather than any edges that were present in the original
image. Then, by the time the watershedding is performed, the cell outlines have already
been manipulated so much that the distance transform is not a good measure of where the
cells lie.
Even on images which are not noisy and segmentation should be easy, such as the ConA
images, edge detection with watershedding does not perform well.
This can be seen in
Figure 6, where this same algorithm was run on ConA images of cells. Although the initial
segmentation of the cells was trivial, the subsequent steps caused much of the clarity of the
cell outlines and shapes to be lost, resulting in a final cell mask which has lost a great deal
of information.
Figure 6: Comparison of original ConA image (left) with its cell outlines after edge detection
(middle) with the final output after smoothing and watershedding (right)
4.2.2
Thresholding using Concanavalin A
Images of yeast cells taken using Concanavalin A stain in a fluorescent channel provides
a much better source image to work with than bright field images. The cell outlines are
clear, and since the images are high-contrast, thresholding the cells from their background
is possible. This makes cell detection a trivial task.
The cell detection algorithm is outlined below, and its code can be found in appendix 7.B:
1. Use a Gaussian filter to remove noise in the image
2. Adjust the contrast of the image so that the brightest pixels are as bright as possible
and the darkest are as dark as possible. This helps to prepare the image so that it can
be thresholded most easily.
3. Perform morphological closing on the image, to remove uneven brightness in the cells.
[6] This occurs due to the fact that the edges of the cells appear brighter in the ConA
images, since ConA is a cell wall stain. Morphological closing helps to remove gaps in
the middle of the cell.
4. Threshold out the cells using an automatically calculated threshold value
5. Fill in the interiors of the cells by filling in every closed outline in the image.
6. Fill in the gaps in any outlines of the cells using linear line elements, and then fill in
any new holes created.
7. Remove objects which are too small or too large to be cells
The first step of the cell detection applies a Gaussian blur of size 3 x 3 for further noise
removal. This smooths the cells to allow smooth cell borders to be detected once the image
is thresholded. Then the values of the image are scaled such that they take on the full
range of possible pixel values. Spreading the values out in this way allows the image to be
thresholded more easily. The morphological closing is performed to fill in gaps in the middle
of the cells and to even out some of the cell brightness. Though the interiors of the cells will
be later filled in, this is done as a preventative measure to also ensure that there are not
gaps at the edge of the cells which cannot get filled in.
This image is then turned into a black and white cell mask using an automatically
calculated threshold value. The standard MATLAB greythresh method is used. This
method implements Otsu's method, which chooses the threshold to minimize the intraclass
variance of the black and white pixels. [16] This function is used to calculate a starting point
for a cutoff threshold, and then this is scaled by a factor of 0.8 to adjust it for the image
set being used. After the cells are filled in, there are still some remaining portions of cell
outlines which have not been filled in. The image is dilated with small linear elements using
many different angles to fill in the largest number of gaps possible. Any new cell interiors
which have been created are then filled in using MATLAB's imf ill method, which fills in
any background pixels which cannot be reached from the edge of the image.
However, cell segmentation still presents a problem. Watershedding loses information
about cell shape as shown in the previous section. This problem can be dealt with by using
a sample of cells which is dilute enough such that there are few clusters of cells, which are
then not included in further data analysis (they get removed in the last step of the cell
detection algorithm). It also might be possible to segment these cells using an algorithm
such as that described in the next sections. This problem ended up being beyond the scope
of the thesis, but would be good for further investigation.
4.2.3
Voronoi-based segmentation using CellProfiler
The CellProfiler software discussed in subsubsection 2.2.1 uses a novel method of segmenting
cells using Voronoi regions. [12] This approach was designed to overcome the limitations of
watershedding that cause it to be a fragile algorithm for segmenting cells. It does so by
comparing neighborhoods of pixels rather than individual pixels to avoid the issue of a single
noisy pixel affecting the segmentation of a group of pixels. It also tries to segment cells based
on the borders found in the original image but includes a regularization factor to provide
reasonable behavior when there is not a strong edge between two cells.
Unlike the other algorithms which relied only on the image of the cell morphology to
detect and segment cells, this approach relies on an image of the nuclei as well. It considers
these nuclei to be seed regions, and proceeds to find a cell for each nucleus detected.
The CellProfiler platform provides an implementation of this algorithm, and the image
of detected nuclei from subsection 4.3 was used to provide the seed regions for the algorithm.
Both bright field and ConA images were used as potential source images for which to segment
the cells.
The results of two representative runs of the algorithm on these images are shown in
Figure 7. In the runs with bright field images, the noisiness of the original image caused
the detected cell borders to also be extremely noisy and jagged, often extending far beyond
the actual cell border. The ConA images provided smoother borders for the most part, with
some extreme missegmentations, like putting one cell inside another cell. The ConA images
Figure 7: Output of the CellProfiler segmentation algorithm run with the bright field (left)
and ConA (right) images as input superimposed on the original bright field image.. Nuclei
used as seed regions are outlined in green and detected cell borders are shown in red.
also often had borders which did not align with cell borders, or which extended beyond cell
borders.
Results
While this algorithm initially appeared promising, results were not as good as those obtained
from thresholding. One possible explanation for the poor quality of cell detection and segmentation was the fact that the cells being segmented are yeast cells, while CellProfiler was
designed mostly for mammalian cells where most cells are sharing borders with other cells.
Yeast cell morphology is significantly different from that of other cells. They are smaller,
so small errors in borders affect overall accuracy more, and they are not usually touching
on all sides like mammalian cell do, though there are occasional clusters. However, it would
be possible in future work with budding yeast cells to adapt this algorithm for their specific
cell morphology. It did perform well in determining which objects were actually composed
of more than one cell and approximating where the borders might be between them.
4.2.4
Yeast-specific cell detection and segmentation
It is appropriate to look at a cell detection algorithm tailored specifically to yeast cells given
the fact that the more general algorithm in the previous section did not perform well due
to cell morphology differences. Although there are not many yeast-specific segmentation
approaches, one group developed an approach that performed well. [13] This approach uses
a bright field includes a scheme for cell detection as well as segmentation, both of which rely
on the computation of a gradient image. This gradient image helps to eliminate problems
of uneven illumination. The segmentation part of the algorithm relies on the detection of
candidate cell centers in a cluster of cells and then the use of a polar plot of the cell to find
the best cell contours.
This approach shouldn't suffer from the same problems as had been described in subsubsection 4.2.1. It chooses the best cell borders using dynamic programming, which optimizes
the choice of a cell border by looking at the whole cell. It avoids the problem of having
jagged cell borders by the nature of the dynamic programming approach, and it detects the
cells initially by using a method that models noise in the image to eliminate it as a source
for errors.
The cell detection process can be outlined as follows:
1. Compute the gradient image from the original bright field image using Prewitt's method.
[7]
2. Find the threshold at which to determine which gradient values are part of the cells
and which are noise. This is calculated by fitting the gradient values to a distribution
and removing those below a specific value (described in detail below).
3. Fill in remaining holes in the cells.
4. Perform a morphological opening with a small circular element to remove small structures due to noise.
The output of this algorithm should be a set of cells ready to be segmented. The gradient
image is calculated by filtering the image with two masks, one for each direction, to enhance
differences in both directions, and then letting the final gradient image be the magnitude of
these two filtered images. Then, any pixel with a value above a threshold, which is defined
as
# in
the original paper, is assigned to be a part of the foreground.
In order to calculate /, the gradient values below the median are fitted to a Rayleigh
distribution function, where there is a parameter o which is varied until the best fit is found.
Then,
#
= 7.5o-, which corresponds to designating pixels with gradient values more than 6
standard deviations larger than the mean of the estimated distribution of background pixels
as a part of the foreground. This scheme relies on the assumption that the distribution of
noise in the background regions of the image are approximately normally distributed.
Once the cells have been detected from their backgrounds, the cells must then be segmented. The segmentation algorithm is outlined below:
1. Find candidate cell centers from the segmented image.
2. Create a polar plot of the cell from each candidate cell center.
3. Use dynamic programming with global constraints to choose an an optimal path from
left to right on the polar plot. The original plot is repeated three times and the final
chosen path is taken from the center repetition of the plot to ensure that the chosen
path is closed.
There are multiple ways to choose candidate cell centers depending on how clustered the
cells are. If the clusters are no larger than 3 or 4 cells in each cluster, then a simple distance
a
Figure 8: An example polar plot of two yeast cells calculated from the gradient image. This
would have the dynamic programming algorithm run on it to find an optimal path
transform on the cell mask can be used, and the local maxima are considered to be candidate
cell centers, which is a good choice for this dataset.
The polar plot from the candidate cell centers is then created by sampling rays outward
at 30 equally spaced radial points. Then using this plot a cell contour can be extracted.
The dynamic programming scheme uses constraints to ensure convexity for the majority
of the cell, so it penalizes transitions in the polar plot which correspond to right turns.
It does ensure that the extracted contour is closed by going around the cell three times.
This algorithm should perform well even when the calculated candidate cell center is quite
off-center.
Results
This approach relied on the ability to segment the cells by calculating the gradient image
and then choosing a cutoff value from this image to determine which parts of the image were
edges. However, it did not detect cell edges such that the cells could be filled in in the way
that the algorithm describes. The cells were not even close to having closed outlines. To
test if it was simply a problem of fitting the data correctly, the best value for
#
was chosen
manually for several images. However, the output was still not ideal. There was still the
problem of the two detected edges for each cell wall as was discussed in earlier approaches
to cell detection. It is difficult to know which cell edge to choose. Then, with the choice of
#,
it is not possible to guarantee that only one set of edges is included in the thresholded
images, or that both sets of edges are included. Some partial edges are included, and in
some cases, an entire outer edge is thresholded out. Finally, any partial outer edges which
were detected are then removed with the morphological opening. This leads to a great deal
of inconsistency in the final detected cells: some include the outer edges, and some don't.
In cells with small buds, the buds are often lost as can be seen in the figure.
Figure 9: Yeast-specific cell detection algorithm to detect cells. A threshold value # was
chosen manually for this set of images
This poor performance is most likely related to some of the assumptions made in the
original paper about the distribution of noise in the image, and how it can be fit to a
distribution. Though it claims that these assumptions are "crude", it says that they still
allow the algorithm to perform well. This was not able to be reproduced, perhaps even due
to some difference in the distribution of pixel values in the original data.
Then, the second part of the scheme to segment cells was not implemented as a result of
the failure of the first part. Though it could have been attempted using cells detected using
other methods, it wasn't a worthwhile endeavor given the fact that the first part completely
failed. The paper did admit that the only incorrectly detected cell contours were buds, which
are important in this case. Though there was a proposed solution to this problem, it was
not tested.
4.3
Nucleus detection and segmentation
The nuclei of the cells must also be detected and segmented in addition to the cells themselves. The cells, which were stained with ConA, were also stained with DAPI, a fluorescent
stain which binds to A-T rich regions of DNA. This problem is relatively straightforward in
comparison to that of segmenting cells, since the nuclei are never touching each other and
clustered.
The algorithm to detect nuclei is outlined below, and the code can be found in appendix 7.C
1. Use morphological opening to calculate the background of the image
2. Subtract this calculated background from the original image
3. Threshold out the nuclei to get a nuclear mask
4. Remove any objects which are too big to be nuclei
5. Dilate the nuclear mask since this thresholding tends to underestimate nuclear boundaries
First the background of the image is calculated using morphological opening with a disk
of radius 6. This removes any objects in the image which are smaller or thinner than the
disk, which effectively removes all of the nuclei in the image. This then leaves only the
background noise of the image, and any uneven illumination. This background calculation
is important because in examining the nuclei, the intensity of the stain is important and can
indicate the amount of DNA in the nucleus. Removal of the background ensures consistent
intensity calculations. This step also helps to deal with one of the problems that can arise
with the DAPI staining: occasionally the entire cell will get stained with DAPI and the
entire cell will fluoresce, or a much larger part of the cell will get stained with DAPI. These
regions, which are larger than nuclei will, for the most part, be included in the background.
Then, they will be removed in the next step when the background is subtracted from the
original image.
Once the background has been removed, the nuclei are thresholded out to create a nuclear
mask. Then, any objects which are larger than 100 pixels are removed from the mask. This
represents a nucleus with a radius of 6 pixels, which is the same as the disk size that was
used for the morphological opening. Any nucleus should be much smaller than this. This
step ensures that any nuclei which had problems with the DAPI staining are not included
in the mask.
In this nuclear detection code the DAPI channel is modified to remove any objects which
are not nuclei, and to remove any background noise. In the last step, the nuclear mask is
then dilated to make sure that the entire nucleus is included in the final modified image. It
is not as important to get the nuclear outline perfect as it is to ensure that the entire nucleus
is included in the mask, since the overall intensity of the nucleus is important, so the image
is dilated several times with a disk of radius 1.
In later steps in the workflow, the DAPI images will be combined with the segmented
cells, and any nuclei which are not contained within cells will not be included in calculations.
While it would be possible to start with a cell mask and then to look for a nucleus within
the mask by performing operations locally, it is more simple and quite effective to do the
operation for the entire image and apply the mask later. This order of operations also allows
for the background intensity correction to occur for the entire image.
4.4
Discussion
An ideal cell detection and segmentation algorithm would be able to identify only the parts
of an image which are cells using smooth contours, and then correctly segment them. It
would not leave out the contours of any buds, since those are necessary for cell cycle stage
classification. It would also be precise in the detection of cell outlines, since this is necessary
for calculating GFP intensity in the investigation into the response to DNA damage. An
ideal algorithm would be able to segment even large clusters of cells so that any type of data
could be used, with either dense cells or with sparse cells. Most importantly, the algorithm
would be consistent in the way it detects and segments cells.
The only approach that satisfied the majority of these requirements was that of using
the ConA images with thresholding. This approach was selected because of its consistency
and simplicity. The use of the ConA images completely eliminates the problem of cell wall
thickness creating multiple edges, and thresholding is a reliable method of detection. Its
drawback is that the images of cells must not contain large clusters, since these cannot be
handled and will be ignored. This means that sparse images with dilute cells must be used.
It might be possible to combine the ConA thresholding approach with one of those which
performs segmentation by modifying both. However, it is possible that more complicated
approaches such as that from subsubsection 4.2.4 might not perform better in this context
given the emphasis placed on bud detection. Once the rest of the methods are completely
developed and tested, it will be possible to work on the cell detection and segmentation part
of the workflow further to allow a wider variety of data to be able to be used as input.
Cell Cycle Stage Classification
5
It can be seen from the paper by Jelinsky, et al. that the budding yeast cellular response to
DNA damaging agents is dramatically affected by the cell's position in the cell cycle at the
time of exposure. [10] Therefore it is important that the cell cycle stage of individual cells
can be determined from the images of cells.
In their approach, populations of cells were arrested in various cell cycle stages and then
analyzed using oligonucleotide probes. With images of the cell in this work flow, more data
can be obtained than simply amount of expression of a gene, since the images can be analyzed
and amount and localization of protein using the GFP tagged proteins can be determined.
However, in order to combine this information with that about cell cycle stage, the cell
must be classified. While it is possible to arrest cells in various stages of the cell cycle, this
process can affect cell morphology and state, compromising the detection and classification
approach. (see subsubsection 5.4.2) Rather than taking this approach, supervised learning
and classification of cells from a wild-type, unsynchronized population is more appropriate.
First, a descriptive set of cell features must be created which will allow the cells to be
classified. The cells should be able to be classified as being in one of three stages of the cell
cycle: G1 , S, or G2 /M. Then a manually-curated training set of cells must be created for
training as well as for validation steps. Finally, the remaining cells can be classified.
G1
S
G2 /M
anaphase
telophase
Figure 10: Budding yeast cell cycle stages. Adapted from Calvert, et al. [2]
5.1
Creation of a training set
A training set of data must be created in order to train neural nets. It is also useful to have
a set of labelled cells to validate the features and to make sure that the calculated values are
consistent with what would be expected for cells in the different stages of the cell cycle. In
order to do this, a MATLAB interface was created which allows a user to manually label
cells. To ensure that a representative set of cells were included in the training set, the input
to this interface is a directory of images of yeast cells and cell masks generated from the
program described in subsubsection 4.2.2. The user can then choose the number of cells
from each image to label. Then the user is presented with that number of randomly selected
cells from each image. The code displays a superposition of the original ConA image of the
cell and the DAPI image of the DNA in the cell. In general, a combination of bud size and
nuclear size, position, and intensity is enough for a human to accurately and easily label a
cell. Cells which had just undergone G2 /M and were then in G1 but still connected and not
segmented were classified as a single cell in G1, since they are known to be in the same stage
of the cell cycle and have just finished dividing. If the image with which the user is presented
contains more than one cell, the user can designate it as an invalid cell segmentation. This
will allow later identification of clusters of cells which were incorrectly segmented and make
sure that they are not included in data about single cells.
In the images, there are generally more cells in the G1 and S stages than there are in the
G2 /M phase, since the cell spends such a small amount of time in G2 /M before it becomes a
mother and daughter cell in G1 again. This led to an unequal number of cells in the different
stages of the cell cycle. Generating enough labels for cells creates a large enough data set
that the small percentage of G2 /M cells still has enough data points for the training of a
neural net.
In the final training set which was created and used, 506 cells were classified in total. Of
these, 62% were G 1 , 28% were S,6% were G 2 /M, and 4% were classified as invalid cells.
5.2
Feature selection
In order to classify cells automatically, a set of features must be chosen and calculated. 14
features were chosen to give the information necessary. These include features that describe
the size and geometry of the cells. In order to make sure that the features that were chosen
do, in fact, give useful information about the cells, other cell classification schemes were
investigated and their feature lists examined.
Source Image(s)
ConA
DNA
[1]
Feature Name
Feature Definition
Area
Cellular area
Convex Area
Area contained in the convex hull of the cell
Solidity/Convexity
coArea ConvexArea-
Perimeter
Cellular perimeter
Form Factor
47rfre
1 for a convex cell
= 1 for a perfectly circular cell
Major Axis Length
Major axis length of the ellipse that is the
best fit to the cell
Minor Axis Length
Minor axis of the best fit ellipse
Eccentricity
Ratio of the distance between the foci of the
best fit ellipse and its major axis length
Bud size
Size of the bud (see 5.3.2)
Ellipticity
Residual of the best-fit ellipse divided by the
number of pixels in the cell boundary that
was fit
Nuclear Area
Area of the detected nucleus
Number of nuclei
Average Intensity
Number of nuclei in the given cell mask
Overall Intensity
(Amount of DNA)
Intensity of the detected DNA averaged over
all pixels in the nucleus
Sum of the intensity values at each pixel in
the nucleus
Table 1: List of features calculated for each cell for classification and which images they are
derived from
These features lists gave a good baseline for the maximum number of features a classification process might want to use. These were then pared down to features which included
information about cell morphology - information about cell texture or protein intensity is
not related to the problem of cell cycle stage classification and therefore unnecessary.
Once these generalized feature lists had been consulted and modified, some budding
yeast-specific features were added. The main such feature was bud size, which is one feature
that is key to determining cell-cycle stage, since when a cell is in G 1 , it has no bud, while
in S, it has a bud of increasing size, and in G2 /M, the bud is nearly the same size as the
original mother cell. Another yeast-specific feature which was added was the number of nuclei
in the cell. This was added because, as was indicated in the previous section, sometimes
the detected cells are actually doublets which have just finished G2 /M and have not fully
separated. This feature was included to help identify these cells. In Table 2 and Table 3 a
probable scheme for differentiating between the cell cycle stages is shown using the chosen
features.
The final feature list can be seen in Table 1. This shows the images which are used to
generate those particular features as well as the exact description of the features.
Feature
G1
G2
DNA Intensity
Nuclear position
Cell Area
Bud Size
n
Centered
n
< 70% mother cell size
2n
Towards bud
> 1.3n
> 70% mother cell size
Table 2: Differentiation between G1 and G2 using a subset of the selected features
5.3
Feature calculation
The output of the cell and nuclear detection programs described in section 4 was used to
calculate the selected cell features in MATLAB. These were then saved in csv files to be
Feature
G1
S
G2
DNA Intensity
n
n - 2n
2n
Nuclear position
Centered
Towards bud
Cell Area
Bud Size
n
< 10% mother cell size
1.3n
< 70% mother cell size
Towards bud or
split between 2 cells
> 1.5
> 70% mother cell size
Table 3: Differentiation between G1 , S and G2 /M using a subset of the selected features
used for later data analysis as well as for input for classification.
5.3.1
Basic cell features
The basic cell features were calculated using MATLAB's regionprops function.
This
function takes in a black and white image mask and an image of values for each pixel as
well. Each cell in the black and white image is considered to be a connected component,
for which MATLAB calculates its own sets of values. Specifically, the values for cell area,
convex area, perimeter, major axis length, minor axis length, and cell eccentricity are all
calculated directly from this function. Then the features solidity and form factor are be
calculated directly from these values.
Once the basic cell features have been calculated for a particular cell, that cell's mask
is applied to the mask of cell nuclei to obtain an image with only the nuclei of the current
cell. Then, the function regionprops is used once again on this image along with the DAPI
image to calculate nuclear intensity, average nuclear intensity, number of nuclei, and nuclear
intensity.
5.3.2
Bud detection
Along with DNA intensity, bud size is the most descriptive feature about a budding cell when
determining cell cycle stage. This makes it one of the most important features to calculate
as input for classification.
Bud detection is challenging in that budding yeast cells have irregular shapes. They are
not perfect circles nor are they perfect ellipses, and occasionally the cell segmentation might
introduce some errors which add to the irregularity. The detection algorithm must be able to
determine whether or not a cell has a bud, or whether any irregularities in the cell contours
are simply a part of the cell's shape. Then, if a cell does, in fact have a bud, it must be able
to determine which of these irregularities is actually a part of a bud. Finally, it must be able
to measure the size of the bud. A number of approaches to bud detection were tested, and
the best and most accurate selected.
Bud detection using polar plots
If a yeast cell with no bud is viewed as an ellipse, then the bud of a yeast cell would simply
be an irregularity in this ellipse. If the approximate center of the cell is found, then a polar
radial plot of the cell can be created by sampling the radius at equally spaced points around
the cell. While this plot of a cell should have few irregularities, the plot of a budded cell will
have a clear point at which there is a bud.
This bud detection algorithm calculates this radial plot and finds the largest peak, which
should be the bud. Then, it finds the two local minima nearby, which should be the beginning
and end of the bud. It then uses these values to segment the bud from the mother cell. While
this method performs reasonably well, it can discount buds which are small on cells which
are very elliptical. In these cells, a point along the major axis is the maximum point in the
radial plot and is chosen as the bud, while the bud remains ignored because it is a smaller
local maximum.
Bud detection using polar plots - fitting to an ellipse
If instead of simply creating a plot of the cell's radii out from the cell center and looking
for the maximum radius, the cell is fitted to an ellipse, then the accuracy of bud detection
can be increased significantly. In order to locate the bud, the data for an ellipse equivalent
to the cell with no bud can be generated and compared to the actual cell data. Using this
comparison, the angular location of the bud can be found.
fitted ellipse
AL
radii
differences
Figure 11: Cell undergoing bud detection. The candidate cell center is marked in the top
left image. The bottom left image shows the calculated radii out from the cell center, with
the top right showing the ellipse that was fitted to the cell, and the bottom right showing
the difference between the calculated radii and the fitted ellipse. This clearly shows a bud
at the correct position.
The bud detection algorithm is outlined here, and the code can be found in subsection 7.E
1. Compute the distance transform on the cell and then find the maximum value to find
the candidate cell center.
2. From this cell center, find the radius of the cell at 5 degree increments around the
entire cell in order to create a polar plot of angle vs. radius.
3. Smooth this radius plot to remove small irregularities in the cell as well as small
discontinuities due to the fact that the radius is only sampled every 5 degrees.
4. Find the local minima in this radial plot. These should be the minor axes of the ellipse.
Designate their mean to be the minor axis length, b.
5. Find the local maxima in the radial plot. Some of these may be the major axes of the
ellipse, and one a bud.
6. If there is only one local maximum, then it is most likely a bud, and the cell is circular.
Then the major axis length, a is equal to the minor axis length previously found.
7. Otherwise, calculate the mean of the local maxima, excluding the largest, which is
most likely the bud. If the difference between the maximum and the mean of the other
maxima is significant then there is a bud. Designate the major axis length, a to be the
mean of the other maxima,.
8. Calculate the radii for the fitted ellipse using the major and minor axis lengths found
(a and b respectively) using the formula for an ellipse
ab
(0
(bcos(O))2
S
+ a sin() 2
Subtract these calculated values from the cell's radii after aligning the major axis with
one of the local maxima found previously.
9. Find the maximum value in the differences. This will be the center of the cell's bud.
Call the angle associated with this maximum value
#.
10. Find the two local minima on either side of the center of the bud. The angles associated
with these minima will be the two angles at which the bud starts and ends.
11. Rotate the cell by -#.
Then find the locations of the two local minima along the
outline of the cell. Calculate the average of their values, which can then be used to
draw a straight vertical line to segment the bud from the mother cell.
The candidate cell center is found using the distance transform, which is an approach
that has been used for similar applications previously. [13, 15]. The maximum value of the
distance transform will be in the center along one axis of the ellipse, and while it might be
slightly biased along the other axis towards the bud, it will still be close enough to the actual
center of the ellipse. Once the radii have been calculated out from this center point, they are
smoothed using MATLAB's smooth function, which implements a moving average lowpass
filter with a span of 5. This filter removes small irregularities in the radial plot so that the
curves that remain are due to the contours of the ellipse and the bud only. Once these data
have been smoothed, finding local minima and maxima is straightforward.
5.4
Feature validation
The calculated features must be validated with respect to known values about general yeast
cell morphology such as cell size. This indicates that the features have been correctly determined from the yeast cell images, and that the cells have been properly segmented. The
calculated features can also be validated by comparing the feature values between cells in different stages of the cell cycle. Using the manually curated training set as well as cells which
have been arrested in stages of the cell cycle, this feature comparison was also performed.
The cells were imaged at 40 x magnification, with the entire image being 1024 x 1024
pixels, and also 165.12pm x 165.12pm. This puts the calculated scale at 0.16125pm/pixel.
The standard size of a budding yeast cell is 5 - 10pm in diameter. All of these figures were
used in the following feature validation.
5.4.1
Using the training set
Using the manually curated training set, the labelled cell features can be validated. The
features chosen for validation were those which could be validated against known values.
For example, nuclear size, intensity, and number of nuclei can easily be validated, as well
as cell size, bud size, and bud/cell ratio. However, there is no basis with which to validate
values such as cell eccentricity.
It is also possible to determine cell cycle stage with some accuracy based merely on the
absence or presence of a bud and nuclear intensity, which makes these features of particular
interest. [2]
All of the distributions were calculated for both cells which were a part of the training
set, as well as cells which were classified using the neural net which is described later in
subsubsection 5.5.1. It is useful to compare these distributions both for feature validation
as well as for validation of the neural net. If the neural net functions as it properly should,
then the distributions should be similar.
Unfortunately due to the nature of the cell cycle stages, there are not as many datapoints
for the G2 /M stage as there are for the other stages of the cell cycle. There is no way to
remedy this problem, since cells do not spend much time in this stage of the cell cycle.
However, there were enough that the distributions should be representative of cells in this
stage of the cell cycle. All of the graphs generated for the distributions have been normalized,
meaning that the histograms were divided by the total number of cells, making the area under
each of the graphs equal to 1.
Cell area
Since the known budding yeast cell diameter is 5 - 1pm, the expected yeast cell area would
be 19.6 - 78.5pm, or 755 - 3020 pixels. The values for cell area between G1 , S, and G2 are
expected to increase, since the cell is growing over this time period. This occurs as expected,
with the mean increasing throughout the cell cycle stages as can be seen in Figure 12. The
values for cell area for the invalid cells are expected to be even larger than any of these, since
these cells are mostly cells which were missegmented and contain multiple cells.
The distribution for the G1 cells appears to be bimodal. This occurs, and is expected
.......
. ......
....
Cell area
0.31
-
0.21
II
)
1000
2000
Training set Gi
n=314 mean=914
SE=22.64
2000
1000
Training set S
n=140 mean=1049
SE=32.59
0
0.3 r
0.3
0.1
L
0.2
.
0
1000
2000
0
Training set G2/M
n=31 mean=1159
SE=67.70
0.1
0
1000
2000
0
Training set invalid
n=21 mean=1406
SE=91.51
Figure 12: Cell area feature distribution
because the cells labelled as G1 are both single cells before they have gone through the cell
cycle, and also doublets of yeast cells which have just finished the G2 /M stage but have not
yet completely undergone cytokinesis. These cells are still considered to be in G 1 . These
two types of cells also shift the mean of G1 cells higher than might be expected, causing the
difference between G1 and S cells to be not quite as large as expected.
These cell areas can be converted back into diameter values in Pm to view in yet another
way how these cells are growing, and how the cell area is correctly calculated in Table 7.
Training set
Classified cells
G1
S
G2
Invalid
5.50
5.67
5.89
6.02
6.19
6.23
6.82
7.46
Table 4: Estimated cell diameter as calculated from cell area in ym.
Nuclear intensity
Nuclear intensity is also a good feature with which to validate. When cells are in G1 , there
is one copy of each chromosome, since the cells are all haploid. Then, when the cells are the
S phase, the chromosomes are being replicated, so at the beginning of the S phase there is
one copy of each chromosome, and by the time S is complete there are two copies of each
chromosome, so there is twice as much DNA in the nucleus. Finally, when the cells are in
......
...............
.......
. ....
-
-
------------------
-
- --------------
G2 /M, there are still two copies of each chromosome, but the two nuclei are beginning to
separate from each other. Therefore the nuclear intensity doubles at some point during the
S phase from what it had been in the G 1 . Then, in the G2/M phase, the average nuclear
intensity is still greater than it would be in G 1 , but not as large as it was in the S phase due
to the fact that the nucleus is getting larger and the chromosomes are splitting up.
0.3
0.3
Average nudear intensity
0.3
0.3
0.2
0.2
0.2
0.2
0.1
0.1
0.1
0.1
0
0
1000 2000 3000
0
0
1000 2000 3000
0
0
1000 2000 3000
0
0
1000 2000 3000
Training set G1
Training set S
Training set G2/M
Training set invalid
n=314 mean=1351
n=140 mean=1754
n=31 mean=1571
n=21 mean=1230
SE=37.70
SE=67.20
SE=122.49
SE=150.77
Figure 13: Average nuclear intensity distributions.
This pattern can be observed in Figure 13 and Figure 14. In Figure 13 the graphs show
the average nuclear intensity, it can be seen that it increases significantly in the S stage.
However, the average isn't double the average in G1 due to the fact that some of the cells
only have one copy of each chromosome as they have not finished replication yet. This fact
can be further observed in Figure 14, which shows the total nuclear intensity. In this data,
the total intensity over all pixels in the nucleus is summed up to create an overall nuclear
intensity. The overall nuclear intensity for cells in the S phase is clearly bimodal.
Bud size
The bud size of cells is also quite clearly expected to vary throughout the cell cycle, which
can be correctly observed through the graphs of bud size. The cells in G1 characteristically
have no bud. In the graphs of bud size of G1 cells in this case, bud size distribution is
expected to be bimodal. This is due to the fact that doublets which have finished G2 /M
- --I'll,
....
...............
.......
..............
.........
Total nuclear intensity (values are x 105)
0.15
0.15
0.15
0.1
0.1
0.05
0.05
0
0
0.1
0.05
3
1
2
0
Training set G2/M
n=31 mean=2.60
SE=0.39
0
1
2
3
Training set S
n=140 mean=1.75
SE=0.13
3
2
1
Training set G1
n=314 mean=1.32
SE=0.08
0
2
3
1
0
raining
set
invalid
T
n=21 mean=1.24
SE=0.26
Figure 14: Overall nuclear in tensity distributions.
Bud Size
0.4
0
0.3
0
0.2
0
U.e
u.C
0.1
0
0.1
0.1
0
0
1000
500
Training set G1
n=314 mean=228
SE=14.61
1000
500
Training set S
n=140 mean=286
SE=16.04
0
0
1000
0
500
Training set G2/M
n=31 mean=470
SE=33.40
0
1000
0
500
Training set invalid
n=21 mean=419
SE=40.56
Figure 15: Bud size distributions
but have not yet separated are still considered to be one cell. Therefore, some of the cells
are expected to have a bud size of 0, and the others are expected to have a bud size which
is approximately equal to the size of another whole cell. Then, the cells in S are expected
to have a bud, and the cells in G2 /M are also expected to have buds, with bud sizes much
larger than those in the S phase.
The graph distributions appear as expected in Figure 15 and Figure 16. In the G1 graphs,
approximately half of the cells have no bud, and half have a bud that is about the size of a
cell. For some of these, the second cell is somewhat smaller than what a normal cell would
be expected to be. This could be due to two different factors: one is the fact that the
...
.... .....
..
..........
.....
Bud/Cell ratio
0.4
0.4
0.4
0.31
0.3
0.3
0.2
0.2-
0.2-
0.1
0.1
0.1
0
0
0.5
1
Training set G1
n=314 mean=0.33
SE=0.02
0
0.5
1
set
S
Training
n=140 mean=0.40
SE=0.02
0
0
0.5
1
Training set G2/M
n=31 mean=0.67
SE=0.03
0
0
0.5
1
Training set invalid
n=21 mean=0.50
SE=0.06
Figure 16: Bud/mother cell ratio distributions.
cells which have just finished the cell cycle then grow in the G1 phase, which means that
they are smaller than an average cell to start with and then continue to grow. The other
factor contributing to the smaller-than-expected "bud" size in the G1 phase is a possible
imperfection in the bud detection algorithm which would cause a small bud to be detected
on a cell which has no bud. Overall, though, the bud size distributions appear as would be
expected.
Although feature graphed in Figure 16 is not independent from the other features which
have been examined, it is still a useful distribution to check. The bud/cell ratio is the size
of the bud detected divided by the size of the mother cell (the area of the bud subtracted
from the area of the entire cell.) With these distributions it can be seen even more clearly
that the size of the "buds" in G1 are on the order of the size of a cell. Then, the buds are
some percentage of the cell size in the S phase, and a much larger percentage of the cell size
in the G2 /M phase.
5.4.2
Using cells arrested in stages of cell cycle
In addition to using a manually curated training set, a data set of cells which were arrested in
particular stages of the cell cycle were available to validate the selected features. Cells which
were arrested in the G1 and S/G
2 /M
stages were imaged as described in subsection 4.1. These
two cell sets were created, as well as a control set of images which contains asynchronous
cells.
Initially it seemed that it would be possible to also use these cells to test the neural
net, since they would provide a large source of pre-labelled cells. However, these cells are
morphologically different from the yeast cells in asynchronous populations. This includes
features such as cell size and shape. However, it is possible to relate the distributions of
these features between the two populations and show that the distributions are actually
equivalent in order to utilize these images. The morphological differences between these
different types of cells also don't extend to all of the features, making it possible still to
compare features such as nuclear intensity.
Cell area
Since the cells which are in arrested stages of the cell cycles are known to be morphologically
different, cell area is one of the features that is expected to differ between the training and
arrested cell sets. Interestingly, the distributions of the G1 cells do not differ significantly,
as can be seen in Figure 17. The means are very similar, and the distributions look visually
the same, with the exception of a slightly larger and more differentiated second distribution
which occurs from the doublets of cells that have not separated after going through the
G2 /M stages.
However, the S distributions look entirely different. The cells are all in the same ranges
for cell size, but there are far more smaller cells in the training set, perhaps with smaller
buds. This might occur due to the fact that cells arrested in the S stage of the cell cycle
have had more time to allow their buds to grow since they have been arrested in S phase.
Overall, the difference in the distributions is reasonable and can be explained.
The last distribution contains cells from an asynchronous population, which were created
. ......................
...
...
___
......
........
..
.....
Cell area
0.3
0.3
.3
0.2
1.2
0.2
0.1
1.1
0.1
0
0
2000
1000
Training set GI
n=314 mean=914
SE=22.64
0
0
1000
2000
Training set S
n=140 mean=1049
SE=32.59
0
0
1000
2000
Training set G2/M
n=31 mean=1159
SE=67.70
0.3!
U
0
1000
2000
Training set all
n=506 mean=987
SE= 18.31
0.2
0.1
0
J
2000
1000
Arrested G1
n=701 mean=954
SE=14.59
0
0
2000
1000
Arrested S
n=700 mean=1178
SE=15.95
0
0
0
1000
2000
Arrested Control
n=1456 mean=908
SE=9.57
Figure 17: Cell area - Arrested Cells vs. Training set Comparison.
at the same time as the arrested cell images. This distribution is equivalent to the distribution
containing all of the cells from the training set, with the mean differing only slightly.
Nuclear Intensity
Overall, the distributions associated with average nuclear intensity in arrested cells are consistent with what is expected, as shown in Figure 18. The mean of the S cells is larger than
the mean average nuclear intensity of the G1 cells. However, it does not increase as much
as the nuclear intensity in the manually classified cells. The two G1 distributions differ in
the two sets of data, which is the most likely source of this discrepancy. The G1 distribution in manually classified cells drops off more quickly than the distribution in the arrested
cells, raising the overall mean average nuclear intensity in arrested cells in comparison. This
is unsurprising given the comparison between the control cells and the training set overall
distributions. The same pattern can be seen, although it is unclear exactly why this might
.....
....
...............
....
...
.........
........
0.3
Average nuclear intensity
1 0.31
-
0.1
.1
0
1000 2000 3000
-- --.......
0
0.1
0.1
0
1000 2000 3000
0
0.3 i
0
1000 2000 3000
0
0
1000 2000 3000
Training set G1
Training set S
Training set G2/M
Training set all
n=314 mean=1351
n=140 mean=1754
n=31 mean=1571
n=506 mean=1471
SE=37.70
SE=67.20
SE=122.49
SE=32.40
0
1000 2000 3000
0.2
0.2
0.1
0.1
0
0
0
1000 2000 3000
0
1000 2000 3000
Arrested GI
Arrested S
Arrested Control
n=701 mean=1479
SE=29.18
n=700 mean=1596
SE=26.33
n=1456 mean=2160
SE=24.23
Figure 18: Average nuclear intensity - Arrested Cells vs. Training set Comparison
occur. It is most likely a factor of different image sets.
Bud size
Although it was previously determined that the cell morphologies between the two sets of
images are different, comparing the bud sizes is worthwhile. In this case, they differ as would
be expected. The G1 cells in the arrested cells have smaller buds overall, if any, and then
also some which are approximately the size of a second cell. In the arrested S distributions,
the buds are larger than in the training set, which supports the hypothesis proposed earlier
to explain why the cells are larger in the S stage of the cell cycle in the arrested cells. These
cells have been arrested for some period of time and have therefore had enough time for their
buds to grow larger, shifting the distribution of those cells with buds over.
..................
...
.....
_---
---_. .....
----------
Bud Size
0.4
0.4
0.4
0.4
0.3
0.3
0.3
0.3
0.2
0.2
0.2'
0.2
0.1
0.1
0.1
0.1
0
0
1000
500
Training set G1
n=314 mean=249
SE=14.45
0
0
500
1000
Training set S
n=140 mean=284
SE=15.46
0
0
500
1000
Training set G2/M
n=31 mean=457
SE=34.01
0
0
500
1000
Training set all
n=506 mean=280
SE=10.70
0.4
0.4,
0.4
0.3
0.3
0.3
0.2,
0.2-
0.2
0.1
0.1
0.1
0
1000
500
Arrested G I
n=701 mean=318
SE=8.09
0
0
0
500
1000
Arrested S
n=700 mean=420
SE=10.34
0
0
500
1000
Arrested Control
n=1456 mean=257
SE=5.95
Figure 19: Bud size - Arrested Cells vs. Training set Comparison
5.5
Classification using neural nets
The final feature set was then used as input for classification using a neural net. The choice
of a neural net as a classification method was based on other papers which had similar cell
classification goals, from which the non-specific cell features were also drawn. [1]
5.5.1
Creation of the neural net
The neural net used was creating using MATLAB's neural network toolbox. It was trained
using 70% of the training set as the actual training data, 15% for validation, and 15% for
testing. The neural net created was a 2 layer feed forward network with 50 hidden layers.
In the network, the first layer has a connection from the network input, which in this case
are the features which have been calculated. Then, each subsequent layer has a connection
from the previous layer, and the final layer produces the output labels, which are either G1 ,
S, G2 /M, or I in this case. A representation of the neural network used for classification can
be seen in Figure 20, with W being the weights of each of the inputs, and b being the bias
vectors of the neural network.
Hidden
Output
14
4
10
4
Figure 20: Neural net used for classification
The feedforward network was trained using the Levenberg-Marquardt backpropagation
algorithm as implemented in MATLAB. [8] The algorithm continues training until the
maximum number of repetitions is reached, or until a performance goal is reached.
5.5.2
Net performance on training set
Once the neural net was created, it was tested to see how it performed on the initial training
set. The results of classification of the training set using this neural net can be seen in
Table 5. It performs quite well on the G1 and S labelled cells. However, the training G2 /M
cells are not labelled quite as well as would be expected. This might occur due to the fact
that the number of cells labelled with M initially is so small that some number of errors are
expected on these cells, and that number causes the actual percentage of misclassified cells
to be quite high.
5.5.3
Net performance on additional data
Once the neural net was tested on the training set, it was then run on the rest of the cell
data. Once this had been done, there were two ways to look at the resulting data. One is
Classified as
G1
G1
S
G2
Invalid
95.5%
21.4%
48.4%
42.9%
S
300
30
15
9
Invalid
G2
4.5%
78.6%
16.1%
0%
14
110
5
0
0%
0%
32.3%
0%
0
0
10
0
0%
0%
3.2%
57.1%
0
0
1
12
Table 5: Neural net performance on training set
to compare the percentages of G1 , S, G2 , and invalid cells between the training set and the
remaining data. Since the training set cells were chosen randomly from the full dataset, the
distributions of these cells should be approximately the same. The results of this comparison
can be seen in Table 6. It can be seen that the distributions are quite similar between the
training labels distribution between the classes and the classification of the rest of the cells.
However it can be seen that the percentage of cells which are classified as G2 /M is lower in
the set of all classified data than it is in the training set. This might be a problem that is
unavoidable given the data set, since it is difficult to get enough cells in that stage of the
cell cycle for the training set.
Training Set
All Classified Data
G1
S
G2
62%
65%
28% 6% 4%
31% 3% 1%
Invalid
Table 6: Neural net performance on all data
The other way in which it can be tested on the rest of the data is to look at the distributions of calculated feature values in the training set vs. the rest of the data. These
distributions are all very similar as can be seen in Figure 21, Figure 22, Figure 23, Figure 24
and Figure 25.
....
..
..
..............
I...
.....
..
...................
. ....
......
- -:!666N
---------------
Training set
Classified cells
-111,111-==
-
G1
S
G2
Invalid
5.50
5.67
5.89
6.02
6.19
6.23
6.82
7.46
Table 7: Estimated cell diameter as calculated from mean cell area in ym.
Cell area
0.25
0.25
0.25
0.25
0.2
0.2
0.2
0.2
0.15
0.15
0.15
0.15
0.1
0.1
0.1
0.1
0.05
0.05
0.05
0.05
0
0
1000
2000
Training set G1
n=314 mean=914
SE=22.64
0
2000
1000
Training set S
n=1 40 mean=1 049
SE=32.59
0
0.2
0.25
0.25
0.2
0
2000
0
1000
Training set G2/M
n=31 mean=1159
SE=67.70
0
0.25
0.25
0.2
0.2
0.15
0.15-
0.15
0.15
0.1
0.1
0.1
0.1
0.05
0.05
0.05
0.05
0
2000
1000
Classified G I
n=574 mean=972
SE=1 7.18
0 0
1000
2000
Classified S
n=214 mean=1095
SE=22.21
-1
2000
1000
0
Training set invalid
n=21 mean=1 406
SE=91.51
0
0
2000
1000
Classified G2/M
n=60 mean=1 173
SE=43.87
0 0
1000
2000
Classified invalid
n=43 mean=1679
SE=52.77
Figure 21: Cell area feature distribution.
The calculated cell areas can also be converted into diameter values in pm, as seen in
Table 7. This shows in yet another way how the cells are growing in each stage of the cell
cycle, and and also how the cell area is correctly calculated and cells are correctly classified
using the neural nets.
.......
..
...
........
........
...
........
...........
- ==
- -
- -
-
,
.
--
-
-
-
Average nudear intensity
0.25
0.25 [
0. 25
0.25
0.2
0.2
0 .2
0.2
0.15
0.15
0.1 5
0.15
.1
0.1
0.1
0.1
LLL
0
0.05
0.05
0.05
0
0
0
0
1000 2000 3000
Training set G1
n=31 4 mean=1351
SE=37.70
0
0 1000 2000 3000
Training set invalid
n=21 mean=1 230
SE=1 50.77
0 1000 2000 3000
Training set G2/M
n=31 mean=1 571
SE=1 22.49
1000 2000 3000
Training set S
n=140 mean=1754
SE=67.20
0.25
0.25
0.25
0.2 5
0.2
0.2
0.2
0.
0.15
0.15.
0.15
0.1
0.1
0.1
0.1
0.1
0.05
0.05
3.05
00
0.0
0
03
1000 2000 3000
Classified G1
n=574 mean=1588
SE=25.50
0
0
1000 2000 3000
Classified S
n=214 mean=1 892
SE=41.47
0
0
1000 2000 3000
Classified G2/M
n=60 mean=1 695
SE=80.65
0 1000 2000 3000
Classified invalid
n=43 mean=1 479
SE=42.09
Figure 22: Average nuclear intensity distributions.
Total nuclear intensity
0.1
0.1
0.05
).05
0
3
1
2
Training set G1
n=314 mean=1 32015
SE=8248.86
0
0.05
0 11
0
0
2
1
Training set S
n=1 40 mean=1 75252
SE=1 3129.00
0.1
1
3
2
Training set G2/M
n=31 mean=260455
SE=39287.28
0
0
1
2
Training set invalid
n=21 mean=1 23602
SE=26327.90
0.1
0.05
0.05
0.05
0 0L
0
0
1
2
3
Classified G1
n=574 mean=1 67404
SE=6665.45
0
1
2
3
Classified S
n=214 mean=1 84904
SE=9343.23
0
1
2
1
Classified G2/M
n=60 mean=248957
SE=32545.40
0
0.05
0 1
0
1
2
3
Classified invalid
n=43 mean=306238
SE=44783.41
Figure 23: Overall nuclear intensity distributions.
. ... ..
.....
......
...
Bud Size
0.5
0.5
0.5,
0.5,
0.4
0.3
0.2
0.1-
0.1
500
0
0.1
1000
0
500
1000
500
Training set G1
Training set S
n=140 mean=284
n=314 mean=249
SE=1 5.46
SE=1 4.45
0.51
i
0
0
500
1000
Training set G2/M
n=31 mean=457
SE=34.01
0.5 ,
0.5
0.5,
0
500
1000
Training set invalid
n=21 mean=465
SE=49.30
0.4
0.1
0.1
0
500
1000
Classified S
n=214 mean=293
SE=1 0.99
1000
500
Classified G1
n=574 mean=280
SE=1 1.08
0
0
1
1 0.1
500
1000
0
Classified invalid
n=43 mean=576
SE=36.72
500
1000
Classified G2/M
n=60 mean=461
SE=24.91
Figure 24: Bud size distributions
Bud/Cell ratio
0.5.
0.5.
0.5
0.5,
0.4
0.3 1
0.3
0.2
0.2
0.2
0.2
0.1
0.1
0. 1
0.1
0
1
0.5
Training set G1
n=314 mean=0.35
SE=0.02
0
LL
00
0
0.5
1
Training set S
n=140 mean=0.39
SE=0.02
1
0.5
Training set G2/M
n=31 mean=0.63
SE=0.03
LUE
0.5
1
0
Training set invalid
n=21 mean=0.50
SE=0.05
0.5,
0.5,
0.5,
0
0.4
0
1
0.5
Classified G I
n=574 mean=0.39
SE=0.01
0.3
0.3 1
0.2
0.2
0.1
0.1
0
1
0.5
Classified S
n=214 mean=0.39
SE=0.02
2
0.2
0.1
0
1
0.5
Classified G2/M
n=60 mean=0.64
SE=0.03
0
1
0.5
Classified invalid
n=43 mean=0.55
SE=0.04
Figure 25: Bud/mother cell ratio distributions.
Although the cell morphologies were already determined to differ between the asynchronous populations of budding yeast cells and the cells which had been arrested in stages
of the cell cycle, it was logical to test the neural net on these cells anyway to see how well it
performed. The results of this run can be seen in Table 8
Classified as
G1
G1 57.5%
S 57%
S
Invalid
G2
403
23.1%
162
11.3%
79
8.1%
57
399
17.3%
121
22.4%
157
3.3%
23
Table 8: Neural net performance on arrested cells
6
Conclusions
In this thesis a method for detecting, characterizing and classifying budding yeast cells has
been developed. The yeast cell detection and segmentation method works reliably for cells
which aren't densely clustered and can detect those cells which cannot be correctly detected
and segmented. These cells are then well characterized using a set of features which are
calculated about each cell. A method was developed to detect buds in order to include
yeast-specific features to tailor the process to yeast cell classification. Finally, these features
are used as input to neural nets in order to determine cell cycle stage.
While the cell detection and segmentation works well, further work should include the
investigation of better cell segmentation. The input data is currently limited to cells which
are not too dense, and therefore not highly clustered. An ideal system would be able to
process any type of image, meaning that all cells should be able to be detected. Specifically,
some of the more sophisticated algorithms described did not work readily for yeast cells, but
further work could be done to tailor them to work for this application.
The neural net classification also did not perform as well as it might be able to due to
the limited data about cells in the G2 /M cell cycle stage. The classification of these cells
could be improved by obtaining more data, or by finding a method which does not need as
many data points about this set of cells in order to be able to classify them.
The initial motivation behind the development of these methods was to then be able to
examine the budding yeast response to DNA damage. This response is different depending on
a given yeast cell's current stage in the cell cycle, so it is important to be able to determine
that piece of information in order to develop a complete model of the budding yeast cell
response. However, these methods could be used for other applications as well, since cell
cycle stage is an important piece of information to know about a cell, and there are many
other assays which can be performed using a high-content screen with automated image
analysis.
Further work would include investigating strains of yeast which have had proteins tagged
with GFP and determining how the protein amount and localization within the cell varies
depending its current stage in the cell cycle. Then yeast response to DNA damage could
be studied by comparing the protein levels from cells which have been exposed to a DNA
damaging agent and from those which have not. Further work could also be done to detect
mRNA transcript numbers and localization in cells to investigate how these vary with the
cell cycle stage.
Overall the ability to determine the current cell cycle stage of a budding yeast cell is a
useful method which can be applied to a variety of problems and allows yeast cells to be
studied in even greater detail using a high content approach.
7
Appendices
7.A
Edge detection and watershedding code
% First, read in the image
XX This part of the code not included
% Image is read into the variable BF
XX
Step 1: Apply a gaussian filter and adjust image contrast
h = fspecial('gaussian', 3);
12 = imfilter(BF,h);
13 = imadjust(12);
XX
Step 2: Detect edges of cells
BWs = edge(13,'canny', 0.3);
XX
Step 3: Dilate the detected edges to fill in gaps
se90 = strel('line', 3, 90);
seO = strel('line', 3, 0);
BWsdil = imdilate(BWs, [se90 se0]);
BWsdil2 = imdilate(BWsdil, [se90 se0]);
XX
Step 4: Fill Interior Gaps of cells
BWdfill = imfill(BWsdil2, 'holes');
XX
Step 5: Smoothen the cells
seD = strel('disk',2);
BW2 = imerode(BWdfill,seD);
BW3 = imerode(BW2, seD);
BW4 = imerode(BW3, seD);
XX
Step 6: Remove object which are obviously too small
BWfinal = bwareaopen(BW4, 300);
XX Step 7: Watershedding
X Compute the distance transform
of the complement of the binary image.
D = bwdist(~BWfinal);
X Complement the
X the objects to
distance transform, and force pixels that don't belong to
be at -Inf.
D = -D;
D(~BWfinal) = -Inf;
% Supress oversegmentation
D2 = imhmin(D, 0.3);
% Compute the watershed transform
L = watershed(D2);
L = L-1;
L3 = im2bw(L);
L3 = bwareaopen(L3, 300);
L4 = bwareaopen (L3, 1200);
L5
L6
L6
L6
=
=
=
=
L3-L4;
im2bw(L5);
imerode(L6, seD);
imerode(L6, seD);
Concanavalin A thresholding code
7.B
%%First, read in the image. This part of the code not included,
but the image is read into the variable ConA
%%
Step 1: Apply a Gaussian filter to the image
%
h = fspecial('gaussian', 3);
12 = imfilter(ConA, h);
%%Step 2: Adjust the contrast of the image to the maximum possible contrast
13 = imadjust(ConA,
[double(min(min(ConA)))/65535, double(max(max(ConA)))/65535],
[0, 1], 1);
Step 3: Perform morphological closing on the image, removing any holes
%%
%%that might be in the cell due to uneven staining of the cell
13 = imclose(13, strel('disk',5));
Step 4: Threshold out the cells
%%
t
=
14
=
graythresh(13);
im2bw(I3, t*0.8);
Step 5: Fill in holes in the cells
%%
14a = imfill(14, 'holes');
%%
Step 6: Remove cells on the border of the image
15 = imclearborder(14a);
%%Step 7: Fill in any gaps where cells may have not been completely outlined
%%and fill in the holes created
15a
15a
I5a
I5a
I5a
I5a
=
=
=
=
=
=
imclose(15,
imclose(15,
imclose(15,
imclose(15,
imclose(15,
imfill(15a,
strel('disk',10));
strel('line',4, 0));
strel('line',4, 90));
strel('line',4, 45));
strel('line',4, -45));
'holes');
%%Step 8: Remove objects which are clearly too small
%%and which are too large to be a cell
16 = bwareaopen(I5a, 300);
17 = bwareaopen(I6, 2000);
18 = 16-17;
ConAF = 18;
7.C
Nucleus detection code
%%First, read in the image. This part of the code not included,
%% but the image is read into the variable Nuc
% Step 1: Find the background of the image using morphological opening
background = imopen(Nuc,strel('disk',6));
%%
Step 2: Subtract the background from the original image
Nuc = Nuc-background;
%%
Step 3: Threshold out the nulcei to get a nuclear mask
Nuc=Nuc*8;
NucBW = im2bw(Nuc, 0.07);
% Step 3: Remove any objects which are too big to be nuclei
NucBW2 = bwareaopen(NucBW, 100);
%%Step 4: Dilate the nuclear mask to make the nuclei slightly bigger
seD = strel('disk',1);
NucBW2=imdilate(NucBW2, seD);
NucBW2=imdilate(NucBW2, seD);
NucBW2=imdilate(NucBW2, seD);
NucBW2=imdilate(NucBW2, seD);
NucBW2=imdilate(NucBW2, seD);
NucBW3=imcomplement(NucBW2);
%%Step 5: Combine the nuclear mask and the original image
Nuc3=immultiply(Nuc, NucBW3);
Feature calculation code
7.D
features = zeros(numlabelsnumfeatures,'double');
for k=1:numlabels
Cell = (Cells==k);
b = regionprops(Cell,'Area','ConvexArea','Eccentricity',
'MajorAxisLength','MinorAxisLength',
'Perimeter', 'Orientation');
featv = [b.Area, b.ConvexArea, b.Eccentricity, b.MajorAxisLength,
b.MinorAxisLength, b.Perimeter];
features(k,1) = b.Area;
features(k,2) = b.ConvexArea;
features(k,3) = b.Area/b.ConvexArea; %Convexity
features(k,4) = b.Perimeter;
features(k,5) = 4*pi*b.Area/b.Perimeter^2; %Form Factor
features(k,6) = b.MajorAxisLength;
features(k,7) = b.MinorAxisLength;
features(k,8) = b.Eccentricity;
%Bud detection code here. Can be found in the next appendix
features(k,9) = bs; %bud size
features(k,14) = ms; Ymother cell size
%Calculate Nuclear Features
n = Cell.*NucM;
c = regionprops(bwconncomp(n), Nuc,'Area','MeanIntensity','PixelValues');
if
(numel(c)==1)
features(k,10) = 1; XNumber of nuclei
features(k,11) = c.MeanIntensity;
features(k,12) = sum(c.PixelValues);
features(k,13) = c.Area;
elseif (numel(c)==O)
features(k,10) = 0;
else
features(k,10) = nuimel(c);
features(k,11) = mean([c.MeanIntensity]);
for m=1:numel(c)
features(k,12) = features(k,12)+sum((c(2).PixelValues));
end
features(k,13) = sum([c.Areal);
end
end
7.E
Bud detection code
CellB = imrotate(Cell, -1*b.Orientation);
bb = regionprops(CellB, 'BoundingBox');
bb = bb.BoundingBox;
CellB = imcrop(CellB, bb);
%Find the cell center according to the distance transform
s = size(CellB);
d = bwdist(not(CellB));
[ml,max-iv] = max(d);
[m2,cc-j] = max(max(d));
cci = maxiv(cc-j);
%Put the cell center at the center of the isolated cell
%image
if cci < s(1)/2
CellB = cat(1,zeros(uintl6(s(1)/2)-cci, s(2)),CellB);
else
CellB = cat(1,CellB, zeros(cci-uint16(s(1)/2), s(2)));
end
s = size(CellB);
if cc-j < s(2)/2
CellB = cat(2,zeros(s(1), uintl6(s(2)/2)-cc-j),CellB);
else
CellB = cat(2,CellB, zeros(s(1), cc-j-uintl6(s(2)/2)));
end
XGet
the radii outward from the cell center
inc = 0.0278;
maxt = uint16(1/inc)-1;
radii = zeros(1,max-t*2+1);
angles = zeros(1,max-t*2+1);
for t=0:max_t
CellR = imrotate(CellB, double(t)*inc*180);
s = size(CellR);
mid1=uint16(s(1)/2);
mid2=uint16(s(2)/2);
1 = sum(CellR(midl,1:mid2));
r = sum(CellR(mid1,mid2:s(2)));
radii(t+1) = 1;
radii(t+max-t+2) = r;
angles(t+1) = double(t)*inc*180;
angles(t+max t+2) = angles(t+1)+180;
end
%Smooth the radii
radiis = smooth(radii);
radiis2 = cat(1,radiis,radiis);
%Get the local minima and maxima of the radii
[vals,rs] = findPeaks(-1*radiis,'sortstr','descend','inpeaks',2,
'minpeakdistance',8);
[vals2,rs2] = findPeaks(radiis2,'sortstr','descend','minpeakdistance',8);
%Remove repeat values from the discovered maxima
repeatint = size(radiis,1);
finalvals = [];
finalrs = [];
for g=1:size(vals2)
if sum(sum(rs2==(rs2(g)-repeatint))) == 0
if (rs2(g) > repeatint)
rs2(g) = rs2(g)-repeatint;
end
finalrs = cat(1,finalrs,[rs2(g)]);
finalvals = cat(1,finalvals, [vals2(g)]);
end
end
%Calculate the mean of the minima - the minor axis of a
Xfitted
ellipse - b
b = mean(vals)*-1;
bud=0;
XIf
there is only one maximum then it must be a bud and
%the cell can be approximated by a circle
if size(finalvals,1)==1
a = b;
pks = sprintf('%.Of bud', 1);
bud=1;
XXIf the largest maximum is significantly larger than the
X/mean of the rest of the maxima then it is a bud
Xand the major axis is the mean of the rest of the maxima
elseif finalvals(1)-mean(finalvals(2:end))>3
a = mean(finalvals(2:end));
pks = sprintf('%.Of bud', size(finalvals,1)-1);
bud=1;
XXOtherwise there is no bud and the major axis is the mean
%%of all of the maxima
else
a = mean(finalvals);
pks = sprintf('%.Of', size(finalvals,1));
end
%%Determine the starting angle of the ellipse - the maximum
%%point along the major axis
if size(vals,1)==O
addang=O;
elseif size(finalvals,1)>1
addang=angles(finairs(2));
else
addang=angles(finairs(1));
end
%Calculate the values of the ellipse along the angles
%%sampled
ellipsevals = zeros(size(radii));
for g=1:size(ellipsevals,2)
ang = (addang+angles(g))/180*pi;
ellipsevals(g) = double(a*b)/((b*cos(ang))^2+(a*sin(ang))^2)^0.5;
end
%%Calculate the difference between the fitted ellipse and
%the original cell
Differences = smooth(radii-ellipsevals);
CellR = CellB*1;
if(bud==1)
%%The location of the bud is the location of the maximum
%%difference between the ellipse and the cell radii
[budcenterd,budcenteri] = max(Differences);
budangle = angles(budcenteri);
pks = sprintf('%s %f',pks,budangle);
%%The start and end of the bud must be the two minima
%%surrounding the bud center
first-bud-start=O;
secondbudstart=O;
%%If we have at least 2 minima use those to find bud
%%start and end points
sortedrs = sort(rs);
budcenteri;
if size(vals,1)>1
for g=1:size(vals,1)
if sortedrs(g) < budcenteri
firstbudstart=sortedrs(g);
elseif secondbudstart==O && sortedrs(g)>budcenteri
secondbudstart=sortedrs(g);
end
end
if secondbudstart == 0
secondbudstart = rs(1);
end
if firstbudstart
== 0
firstbudstart
=
rs(1);
end
nobud=0;
%%Otherwise there is no bud
else
first bud start = mod((budcenteri-3),size(radiis))+1;
second bud start = mod((budcenteri+3),size(radiis))+1;
no bud=1
end
%%If there's a bud, find exactly where it
if nobud==0
anglel = angles(first budstart);
angle2 = angles(second budstart);
r1 = radii(firstbud-start);
r2 = radii(secondbudstart);
if
angle2 > anglel
tmp = anglel;
anglel = angle2;
angle2 = tmp;
end
if anglel-angle2 < 180
startAngle = angle2;
endAngle = angle1;
budAWidth = anglel-angle2;
startR = r2;
endR = r1;
else
starts
startAngle = anglel;
endAngle = angle2;
budAWidth = angle2-anglel+360;
startR = r1;
endR = r2;
end
budAngle = mod(startAngle+budAWidth/2,
360);
rotateAngle = 360-budAngle;
bs=0;
ms=0;
CellR = imrotate(CellR,budangle);
midi = uint16(s(1)/2);
mid2 = uintl6(s(2)/2);
CellR(midl,mid2) = 0;
CellRm = 1*CellR;
sra = -1*mod(budangle-startAngle, 360);
era = mod(budangle+endAngle, 360);
ps = [uint16(midl+startR*sin(sra/180*pi)),
uintl6(mid2-startR*cos(sra/180*pi))];
pe = [uintl6(mid1+endR*sin(era/180*pi)),
uintl6(mid2-endR*cos(era/180*pi))];
yCutoff = uint16(min(pe(2),ps(2)));
c1 = sum(sum(CellR(:,1:yCutoff)));
c2 = sum(sum(CellR(:,yCutoff:size(CellR,2))));
if c1>c2
ms = c1;
bs = c2;
else
ms = c2;
bs = c1;
end
else
ms=sum(sum(CellR(:,:)));
bs = 0;
end
References
[1] Chris Bakal, John Aach, George Church, and Norbert Perrimon, Quantitative morphological signatures define local signalingnetworks regulating cell morphology, Science 316
(2007), no. 5832, 1753-6 (eng).
[2] MEK Calvert, JA Lannigan, and LF Pemberton, Optimization of yeast cell cycle analysis and morphological characterizationby multispectral imaging flow cytometry, Cytometry 73 (2008), no. 9, 825-833.
[3] John Canny, A computationalapproach to edge detection, IEEE Transactions on Pattern
Analysis and Machine Intelligence 8 (1986), no. 6, 679-698.
[4] A Carpenter, T Jones, M Lamprecht, C Clarke, I Kang, 0 Friman, D Guertin, J Chang,
R Lindquist, and J Moffat, Cellprofpler: image analysis software for identifying and
quantifying cell phenotypes, Genome Biology 7 (2006), no. 10, R100.
[5] AE Carpenter, Image-based chemical screening, Nature chemical biology 3 (2007), no. 8,
461-465.
[6] Edward R. Dougherty and Roberto A. Lotufo, Hands-on morphological image processing, Spie Press, 2003.
[7] R Gonzales and R Woods, Digital image processing,Prentice-Hall, Upper Saddle River,
New Jersey, 2002.
[8] M.T. Hagan and M. Menhaj, Training feed-forward networks with the marquardt algorithm, IEEE Transactions on Neural Networks 5 (1999), no. 6, 989-993.
[9] Michael Held, Michael H A Schmitz, Bernd Fischer, Thomas Walter, Beate Neumann,
Michael H Olma, Matthias Peter, Jan Ellenberg, and Daniel W Gerlich, Cellcognition:
time-resolved phenotype annotation in high-throughput live cell imaging, Nature Publishing Group 7 (2010), no. 9, 747-754.
[10] SA Jelinsky, P Estep, GM Church, and LD Samson, Regulatory networks revealed by
transcriptionalprofiling of damaged saccharomyces cerevisiae cells: Rpn4 links base
excision repair with proteasomes, Molecular and cellular biology 20 (2000), no. 21,
8157.
[11] AP Joglekar, ED Salmon, and KS Bloom, Counting kinetochore protein numbers in
budding yeast using genetically encoded fluorescent proteins, Methods in cell biology
(2008), 127-151.
[12] T Jones, A Carpenter, and P Golland, Voronoi-based segmentation of cells on image
manifolds, Computer Vision for Biomedical Image Applications (2005), 535-543.
[13] M Kvarnstrom, K Logg, A Diez, K Bodvard, and M Kall, Image analysis algorithmsfor
cell contour recognition in budding yeast, Molecular cell 21 (2006), 3-14.
[14] John R.S. Newman, Sina Ghaemmaghami, Jan Ihmels, David K. Breslow, Matthew
Noble, Joseph L. DeRisi, and Jonathan S. Weissman, Single-cell proteomic analysis of
s. cerevisiae reveals the architecture of biological noise, Nature (2006), no. 15, 841-846.
[15] A Niemisto, J Selinummi, R Saleem, I Shmulevich, J Aichison, and 0 Yli-Jarja, Extraction of the number of peroxisomes in yeast cells by automated image analysis, Proceedings of the 28th IEEE EMBS Annual International Conference (2006).
[16] N. Otsu, A threshold selection method from gray-level histograms, IEEE Transactions
on Systems, Man, and Cybernetics 9 (1979), no. 1, 62-66.
Download