Document 17973372

advertisement
>> Vani Mandava: So welcome everyone. It's with great pleasure that I
introduce Ben Chidester, who is an intern on the Microsoft Research Outreach
Team this summer. He's a Ph.D. student at the University of Illinois
Urbana-Champaign, advised by Professor Minh Do.
His research interest lies in image processing and statistical learning. And
his work focuses on applications and bioemerging domain. He's done some great
work with us this summer, applying his problem on Azure, applying deep neural
networks and trying some different implementations on Azure.
And his talk today is titled Cell Classification of FT-IR Spectroscopic Data
for Histopathology using Neural Networks. Welcome Ben.
>> Ben Chidester: Thank you. Thanks for the introduction, Vani. And thanks
for joining. So Vani introduced the title already. There's a lot of things
there that I want to go through a little bit, introduce at a time what all the
entries are for.
As a brief outline of what this talk will include, first talk about the
application of FT-IR or Fourier Transform Infrared spectroscopy for
histopathology. And some of the current challenges with that work and the
motivation for using deep neural networks in this with this FT-IR data. Also
talk about implementing the DNNs on Azure and what Azure brings to the table
and then finally culminating all these different steps is actually classifying
cells of FT-IR data to aid in the process of histopathology.
So to begin, I'll start with just describing a bit about what is Fourier
Transform infrared spectroscopy. It's a type of chemical imaging that uses
infrared light. Infrared light is emitted from a source and it passes through
a beam splitter which allows -- you have a translating mirror here. So you can
change the phase of the light that's returning to the original source.
And they interfere with one another and pass through the sample, they're
detected, and this detector records what's called interferogram which if you
use Fourier Transform, it can give you a measure of absorption. As this
infrared light passes through, it resonates with the sample with some of the
vibrational frequencies of functional groups of molecules. And then it's
absorbed. Otherwise it passes through and we get a spectrum here of for
different frequencies of infrared light how much is absorbed and how much
passes through and then for each pixel or each spot in the sample, we have the
spectrum, if we put them all together we get what is a 3-D volume. If I look
down one pixel through the 3-D volume I'll see the spectrum. And each
frequency I have an image which is a measure, which is an absorption map. So
for every frequency I'll have a different image. So that's the way that FT-IR
works. So in the end you have not one single image but actually this 3-D
volume. The resolution of FT-IR is about five microns. So at each one of
these spectrum we're seeing the response of possibly several cells, possibly a
pure region of one cell but possibly several cells overlapping. So this
spectrum actually tells us a lot about the molecular chemical composition of
what's inside the sample. So that's where the power comes in. So normal
histopathology. If you want to go to a doctor to look at a tumor, see if it's
cancerous or not, the doctor would take, do a biopsy and extract a core from
the site and the core is sliced very thinly into different slices. Each one of
these slices is then stained. And using like H and E staining, and from this a
doctor will observe under a microscope the structure and then from the
structure will decide a pathologist will decide whether it's cancerous or the
grading of cancer. So they're looking for things in this structure of the
data. So with FT-IR we're proposing to replace this staining stage so instead
of doing the staining, we actually image each slice and we have this 3-D volume
and spectrum and then from the spectrum at each pixel I can, looking at a
spectrum I can tell a lot about the molecular composition, and I can create
actually a map of the different cell types that are found there. So some of
the key players are things like epithelium or Thalagen or fibroblasts and from
this then a pathologist could look at this map and make a diagnosis about
cancer or the grading of cancer. And what we'd like to do is automate this
task of labeling using machine learning.
So that's the current process. But this hasn't really seen much clinical use
yet because of a couple of different drawbacks. One of the main ones is that
the imaging is very slow. So the image of the core could take about a day.
And if you're trying to see several patients a day, this is kind of
prohibitive. We want to see more like maybe 30 minutes for patients core not a
day. The other issue is that to know what the cell label should be requires a
lot of insight by a pathologist or someone's who is trained to look at the
spectrum to say what this cell type is. And so the features that are currently
being looked at for doing this classification, they're designed, these metrics
are designed by pathologists. And another problem is that these features are
specific to each tissue type. So this spectrum for a certain cell type may
change from breast or to prostate tissue. So if we want to extend it to a new
tissue type, then we have to create new features each time and tune them and
engineer them and see which ones work the best. So this is what my project is
looking at is this pipeline of classifying this data. So I have my spectrum
input and I'd have some sort of feature extraction classification stage and
then in the output I would have a cell label. So what we're proposing is to
look at deep learning for doing this classification process. So in deep
learning, I no longer have the feature extraction stage. I actually just have
a neural network. And I can feed into my neural network just the individual
data points themselves, the absorptions at each frequency. As input nodes to a
network, and I can train this network and get out the class label. So neural
networks have a lot of advantages. One of the key ones for us is looking at
learning the features automatically. So no longer need the pathologist to tell
me what things are important or what kind of metrics I should be looking for.
I can actually try to learn these on our own without any prior knowledge. And
it's also been shown to provide very high classification accuracy in a lot of
problems like speech and image classification. But it also, we face some
challenges with using deep learning, which if anyone's worked with deep
learning you know it requires a significant amount of computational resources
and parameter toothing, and what is also a challenge is the more data the
better. So for all these -- which is a good thing but also can be a bad thing.
If you have enough data it can work well but if you're limited on data it can
be a problem. So to help with this computational resource constraint and
parameter tuning is where Azure comes into play. So using Azure for data
science problems we can actually have, create lots of different virtual
machines in the cloud, which I can access through the Azure portal or through
SSH. I can do Windows with Linux machines. I can have storage and I can store
on my online storage accounts virtual hard disk containing my data or blobs
Azure's type of file storage. And I also have access to APIs and Python and
different languages to access these virtual machines and my data.
So in terms of parallelizing or distributing this task, it's kind of two
approaches that are used or that we could take. So let's see. The simplest
approach would be what we could call embarrassingly parallel, which is the
terminologies used in this area. So basically each virtual machine I just have
a neural network and I train independent of each other. And maybe I have a
head node that kind of organizes and distributes the workloads. When one
virtual machine is done it reports back and it gives a new job to another one.
A more sophisticated method would be something like that's very I/0 intensive
maybe I'm training one network and I'm distributing the work across many
virtual machines, possibly asynchronously and they're all kind of doing
different tasks and reporting back to the head node. So this is more
sophisticated approach. This is a more simple approach. So this might be the
ideal of I/0 intensive. I think just doing this embarrassingly parallel,
there's lots to be gained as a lot of the work and time that's spent in deep
learning is tuning these hyperparameters which we can do without the need for
this more intensive parallelization. So also if you look at the libraries out
there, the limitation is not many libraries allow for this kind of intense
parallelization. So there's a lot on this slide. But if you work with deep
learning maybe you're familiar with some of these libraries. But the one I
like to point out, a couple I'd like to point out to use and one in particular
I'm using is Theano or Pylearn 2, built on top of Theano. So it offers a lot
of deep learning already implemented networks. It's quite actively maintained
and as new features in deep learning are coming out they're pretty good about
adding them into the code base.
And other good ones are Caffe and Torch7. Caffe especially if you're
interested in doing computational neural networks, which is something I may
turn to in the future. But I definitely would recommend looking at those as
well. But mostly I picked Pylearn 2 because it's very easy to use it's in
Python, which is something I'm familiar with. And pretty easy just for getting
started with.
And if you're interested in knowing more about these libraries, like you can
talk afterward. So I'll describe my setup in Azure how I'm using Azure to do
this deep learning training. So I have, if you start with the portal in the
cloud, I can create my virtual machine.
And through the portal and I can SSH into it, make a Linux machine, for
instance, and I can install all of my favorite Python libraries or whatever
libraries you may want to include.
And I can create a storage account where I hook to the VM and where I store my
FTR dataset. So that's what I currently have operating. Once I have this VM
created, then I can just -- I just copied it and can make more instances of the
same VM and connect each one of them to the same dataset.
And access all of them through SSH and using Python Scripts I can kind of
delegate different tasks or different networks for each of the machines. So I
want to show a little bit just with my Azure setup. Let's see. So the portal
is really easy to access. It's just manage.windows.ur.com. But I don't seem
to have the server. Let's see. Oh. Lost connectivity. Let's try again.
Okay. Here we go. So I can load up the portal and shows all of my items that
I have. I can look at my virtual machines and here I have five virtual
machines running, I work at at group in Illinois CIG, call it CIG Pylearn and
then the instance number. I can look at one of these because of the dashboard
and currently it's not doing too much. This is an eight core machine and it
has 14 gigabytes memory. So with Azure research, Azure for research you can
get up to 32 cores which I'm using currently. Ranging in different sizes. And
I can also have my storage accounts. Whoops. Here we go. So I have a storage
account called CIG General Store. And on here I have my containers, virtual
hard drives, and I can see all the hard disks for the operating system on
virtual machines and this top one, BRC 961, which is the FT-IR dataset that I'm
using.
And so on here it's really easy to like create new virtual machines or create
new storage accounts. Very easy to manage everything. The other great thing
about this is that I don't have to worry about actually managing all the
hardware of a cluster or server. So like our group at Illinois has an old
server that we got from an old group so it's basically not usable. It's so old
that it's obsolete in terms of what's currently out there. But with Azure I
can just have, I have access to the most recent hardware and I don't have to
worry about maintaining the hardware or backing up things. So it's pretty nice
in that sense.
Okay. So now I'll talk just about actually doing class -- cell class labeling
using FT-IR. So the dataset that we have for our experiments about 97 samples.
So if you can see each one of these is a sample like a slice taken from a core,
with biopsy.
So from each sample, there's about 200 by 200 pixels each pixel is spectra,
about 2 million spectra, of which about 500,000 are labeled. And then this is
just one image for one frequency, the response. But we would have an image for
each different frequency band. So 501 bands. There's seven different cell
types that we're looking at. And these ones that are labeled were labeled by a
specialist and you'll notice there's some holes in the data. Those are not
like missing cells or anything. It's sometimes the cells may be constituted of
different cell types or maybe hard to identify. So these ones we have labeled
are what are considered for certain this is the cell type it is. So for using
it for training. So I split it up into about 70 percent training, 10 percent
validation, and 20 percent testing. And this is the setup that I am using. So
I extracted a spectrum from the 3-D volume and I just store all the spectra in
big 2-D matrix. So each row is a spectrum and then these are across the row is
their response for different frequencies, and then I just feed this spectrum
into the neural network.
This is actually a little bit misleading because there's not one node at the
output but there's actually seven for each of the classes. Each node
representing a probability of that class. And then for the experiments, I use
like two to four layers. But this kind of preliminary work. So there's no
reason not to go for more. So I'll be looking to do more, try more layers and
different configurations, also just using five to 500 nodes per layer but
there's no reason we can't go higher.
I'm just using the entire data for input layer for 501 nodes. If you're
familiar with deep learning, there's a lot of different activation functions
you can use. It's another part of the tuning, and I'm using rectified linear
units and soft max regression for the output. And then the output we'd be
getting actually just one label which would constitute an entire cell map.
Some results I've gotten from some of the networks that did the best, so using
the rectified linear units for activation and then some different configuration
of layers and nodes. This is just the number of nodes per layer and for each
layer is the first layer, second layer, and then accuracy is around close to
90 percent. But I think there's a lot of room for improvement. So that's part
of the deep learning work is continuing to search different sets of
hyperparameters and different configurations and different learning rates can
get better results. So that will be something I'll be looking toward the
future toward to improving these results.
Let's see. Okay. So in conclusion, FT-IR spectroscopy, there's a lot of
potential to provide quality or, not qualitative, quantitative morphological
and molecular information for histopathology. And there's still a great need
in histopathology for more accurate diagnosis. So hoping that FT-IR can aid
greatly in that.
And using deep neural networks, this allows for accurate
hoping to continue to improve without the need for doing
having domain knowledge and with Azure we can I was able
parameter search, which greatly expedites the process of
hyperparameters.
cell classification
feature imaging or
to parallelize this
searching for
So to acknowledge just some people who were helping. I'd like to thank Vani,
my mentor. And Harold, just for the opportunity to intern at Microsoft, and
some of the staff I talked with, Trishul was really helpful. Great insight,
other interns like David, thanks for their helpful conversations. And my
advisor back at Illinois and the group we work at Illinois Rohit Bhargava and
other members of his group. So that concludes my talk. So thanks for your
attention and if you have any questions feel free to ask.
>>:
Can you talk about future direction?
>> Ben Chidester: Yeah, I don't think I have a slide for it in here, actually.
So let's see. So I think in the future -- that's not it. Let's just go back
to the spectroscopy picture. One thing that we're hoping to do is try to take
more advantage of like this spatial data that we have, information, so not just
labeling each spectrum individually but trying to make use of the entire 3-D
volume. That would be one, that's one thing, and also if I go back to this
histopathology section. Or slide. The ultimate goal is really to automate
this whole process. So maybe we get to the cell map and then from the cell map
actually doing, using morphological features doing machine learning on top of
this to create a diagnosis or grading of cancer or labeling cancer as pixels.
Not necessarily to replace the pathologists but there's a great need for second
opinion. So there can often be a lot of inconsistency between pathologists
with diagnosis or even a pathologist if doing subsequent diagnosis of the same
sample can give different diagnoses. So we're hoping that an automated
histopathology process could actually aid a doctor in giving a second opinion.
But that's the future goal, direction of it.
Any other questions?
>>: Just kind of curiosity, how long does it take to go through the process of
labeling and seeing which ones the neural network has been trained, the single
spectra from the corpora.
>> Ben Chidester: Once it's trained it's quite fast it's mostly in the
training takes a long time. Training the networks, it can take like a few
hours. Or several hours, usually, for configuration. That's what I've found.
So far. But labeling is pretty quick, yeah.
>>:
Do you use the I/0 intensive training?
>> Ben Chidester: No, I'm not using that. So I'm not using -- I'm not using
this method. I'm just using the simple embarrassingly parallel.
>>:
Different virtual machines that try to train different type of --
>> Ben Chidester: Exactly. Each machine has its own set of, configuration of
neural network. It trains on its own. So projects like ADAM do this style of
parallelization. Or BeliefNet by Andrew, I think who does something like this.
But Ion's [phonetic] not doing that, yeah.
>>: Another question about the images. So when you do FT-IR image it's not a
single slide, right, it's a 3-D, when you take, do biopsy it's a 3-D mask or
not, or is it like ->> Ben Chidester:
>>:
Yeah --
I didn't get that.
>> Ben Chidester:
Let's see where was it here?
>>: [indiscernible] does that mean that it -- if I have a volume and a CT
scan, I would say every slide is -- so when you make a treaty volume from CT
scan it just describes the organ and treaty but here it's not the case, right,
this is for a cell in 2-D.
>> Ben Chidester:
>>:
It actually would be like 4-D technically.
It should be 4-D.
>> Ben Chidester: Because for each slice in the core you have a 3-D volume.
And you'd have, so the fourth dimension is ->>:
In that sense, how are you classifying it?
>> Ben Chidester: I'm just using, I'm only considering like one slice of the
core, not the whole thing. Yeah. But yeah, so I'm just one slice and I'm
using just that slice, imaging which gives you the 3-D volume and then doing
the labeling from there.
>>:
You said 97 samples.
Is it 97 slices?
>> Ben Chidester:
[applause]
Slices from the orbs, exactly.
>> Ben Chidester:
Thanks for your attention.
Download