791H Senior Project Progress Report Image Filtering and Enhancement of Scanning Transmission Electron Microscope Images Submitted for Review to: Dr. Tom Miller Submitted by: Nathan P. Brouwer University of New Hampshire College of Engineering and Physical Sciences Department of Electrical and Computer Engineering 55 Edgewood St. Durham, New Hampshire 03824 Created: December 12, 2010 REVISED: December 17, 2010 REV: FINAL 1 Table of Contents Table of Contents .......................................................................................................... 2 1 Abstract...................................................................................................................... 3 2 Project History and Definition ................................................................................... 3 2.1 Background.......................................... Error! Bookmark not defined. 2.2 Problem................................................ Error! Bookmark not defined. 2.3 Project Objective................................................................................... 4 3 Methodology ............................................................ Error! Bookmark not defined. 3.1 Three Phase Iterative Approach .......................................................... 4 4 Significance/Implications......................................................................................... 10 5 Personal Outcome .................................................. Error! Bookmark not defined. 6 Location ................................................................... Error! Bookmark not defined. 7 Preparation/Experience .......................................... Error! Bookmark not defined. 8 Time Table .............................................................................................................. 10 9 Appendices.............................................................................................................. 11 9.1 Timeline for Project ............................................................................. 11 9.2 Budget Explanation ............................................................................ 11 9.3 References.......................................................................................... 12 2 1 Abstract ZSGenetics uses a scanning transmission electron microscope (STEM) to perform the direct imaging of Deoxyribonucleic acids (DNA) for research purposes. Because of the high magnification and the way images are formed, current images from this process are unclear and difficult to analyze directly. The proposed solution is to construct and implement a variety of image processing algorithms to improve and enhance the quality of DNA images and better enable the extraction of information that can be utilized by the scientists at ZSGenetics. The project will result in a graphical user interface (GUI) that can be used by researchers to process these images and make analyses much quicker and more accurate. This is novel work in an emerging engineering field with great potential for publication at its conclusion. 2 2.1 Project Definition and Objectives Project Definition The problem the scientists are facing is that the pictures are very difficult to analyze because there is a large amount of cluttering information, or noise that interferes with the ability to detect these DNA strands. It is a scatter problem that becomes largely statistical, and is the major challenge or overcome. ZSGenetics has an innovative patented technique to bind larger atoms to certain nucleotide pairs in DNA. While, due to contrast and resolution, it is still nearly impossible to detect the DNA atoms directly, the problem has become modified to find the location of the marker atoms. If the location of the marker atoms can be determined, then the corresponding base pair is also known. 3 2.2 Project Objective The project goal is to provide a solution to the DNA imaging problem by using image processing algorithms and filters to extract information and improve the images to a point where they can be useful to the scientists at ZSGenetics. Through a variety of algorithms, it is possible to overcome the scatter problem of noise, detect marker atoms, and calculate the distance between markers to determine the number of non-marked base pairs between markers. Prior to primary the objective, it is essential to be competent in recognizing the DNA strand through pattern recognition software I develop. After that has been accomplished, the main goal of this project is to detect the marker atoms in a coherent automated algorithm. Only once the strands are detected and the marker atoms identified, the distance between points can be calculated using the pixels and orientation of the image. 3 Design Process and Implementation Plan 3.1 Three Phase Iterative Approach This cyclical three phase approach includes data definition and collection, algorithms and testing, and evaluation and feedback of algorithms. The preliminary phase will consists of data definition and collection. This will include travelling to Cambridge Massachusetts to receive additional data sets of still images and video sequences of DNA from ZSGenetics. I will personally be receiving certified training on how to safely use and operate the electron microscope at Harvard University. There will be meetings to learn from the scientists exactly what they are looking for and how they may want the image 4 enhanced in order to better comprehend the data set. Phase I will end with the compilation of pertinent data sets with a clear idea of what algorithms might produce the desired results. Once the exact nature of the images that need to be improved is understood, the best combination of image processing algorithms will be determined and applied. Phase II will be primarily the application of any algorithms identified in Phase I to the data set and then modifying them with feedback from the experts at ZSGenetics. Evidence suggests that the scattering noise will inevitably lead to statistical solutions. The design goal for step two will analyze statistical trends in DNA images, and using pattern recognition algorithms, correctly find and crop the DNA sequence according to those trends. Once the strand is found I will implement a decision process to determine where each marker is. More research will be done examining a reliable set of features to be used at criteria for the marker decision. Phase III will include additional feedback by sending our processed images using the current combination of algorithms back to ZSGenetics for assessment. They will use their expertise in the subject matter to evaluate how successful the attempts were and offer suggestions of what needs to be done to the images for even better clarity. The iterative part of this project will be using the feedback from ZSGenetics to return to the drawing board in order to further improve our process. William Glover from ZSGenetics has been working in open communication with myself and Professor Messner’s laboratory, which is a necessity for the project. Figure 1 below shows graphically how this phased approach will flow. 5 Research and Data Collection Algorithms and Testing Evaluation and Feedback of Algorithms by ZSGenetics Create Graphical User Interface for use after project completion Final Report and publication Figure 1: Project Flow The end result will be a set of tuned image processing routines and a graphical user interface (GUI) able to be used by scientists and engineers for DNA research. We expect that our end results will be publishable and expect to submit our finding to an appropriate journal for publication. 4 Progress Phase I is a data collection and interpretation phase. ZSGenetics has acquired and forwarded to us several batches of images to begin working with. Examples of two types of raw images collected by ZSGenetics are shown below: Raw DNA strand (Bright Field Imaging) 6 Raw DNA strand (Dark Field Imaging) Bright field and dark field are two distinct types of data acquisition techniques result in very different images. Bright field imaging effectively blasts the sample with electrons and a sensor monitors the scattering of electrons through the sample. Dark field imaging uses a scanned electron beam to raster scan the sample with a concentrated beam of electrons. This builds the image pixel by pixel. It is anticipated that the dark field will yield better images due to contrast. This remains to be proven by processing on each type of acquired images. By working on both image sets, I will be able to determine which type will yield the best results. It is possible with both types of imaging to ascertain the location and orientation of the strand of DNA under sufficient magnification and focus. The effectiveness of algorithms to identify such features will be a deciding factor in which types of images to further pursue. In addition to data collection, I also travelled down to Harvard to attend an electron microscope training session. I completed the first part of the electron microscope training. This session dealt with facility safety training. I will be scheduling the final part of training over break. 7 Currently, I am in working on phase II, the algorithm development phase. The first approach was to develop an algorithm that would be able to find the DNA strand by looking at local area statistics. Quad tree decomposition is the technique I pursued to implement. The theory behind the technique is to test the parent image for certain criteria. If it does not meet the criteria, split the parent into 4 equally sized child blocks. Each of those blocks is then tested for the conditions, and if they fail them they are iteratively broken down until the criteria are met. Once conditions are fulfilled, the block ceases to be broken down further because it is assumed that all children of that block will also meet those conditions. Quad Tree Decomposition In this 64X64 pixel image is a 5X6 object. Instead of having to test each pixel, which would take 4096 computations, it would only take 84 computations to fully detect this object’s location. This method is much faster and can be set up to use any criteria to test each cell. The example uses a simple threshold (if cells do not equal to zero, then split cells). A problem I ran into was that simple thresholds would not suffice to detect the image because of the random noise that appears. I devised a way to create a test 8 function, called by a function handle that can test each cell based on conditions. The problem is I do not know the best conditions to test. For the time being, the quad tree decomposition function was put aside so I can begin looking into the statistics of the noise. If I can characterize the noise in a coherent statistical manner, that could provide the conditions for the quad tree decomposition. The next step entailed inspecting the statistics of the noise with and without the DNA present. I determined that there is a distinct statistical distribution that exists through the image. An example of a normalized histogram of the intensity distributions is shown below: Hisogram of Characterized Noise Normalized Frequency of Occurance 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 50 100 Intensity Bins 150 Histogram of Scatter Noise Knowing that there are statistical trends, I have been developing a user friendly interface to calculate deviation, variance, and higher order moments on the two dimensional data. By inspecting data sets with and without DNA, it should give us a good insight into differences of trends. Using this theory, it should provide me a method for creating conditions for quad tree decomposition. This would be accomplishing our first objective, algorithmically detecting DNA sequences from images. 9 Another technique we employed was linear mapping, which basically maps all the values of an image to a stretched out scale. As shown in the histogram, there is little to no information beyond bin 80. Therefore I can have 0-80 map to 0-255, the maximum number of an unsigned 8 bit number. This preserves the information as long as there are no values above 80 in the original image, as well as allows us better resolution for visualization because the gray scale has been expanded. 5 Significance/Implications Being on the verge of the capability to detect a DNA sequence in software from a .TIFF image puts me in good shape for completion of the project for next semester. Overcoming the noise problem by characterizing it statistically is publishable material that could be useful to others. This work on imaging DNA in order to identify the specific DNA sequence in a sample via a scanning electron microscope has never been done before. If the project is successful, it will provide scientists and medical researchers with a method to extract information directly from the images of DNA. 6 Time Table Observe figure 2 under attachments for a Gantt chart that describes the timeline. I am right on track, in the middle of the algorithm phase, while continually getting feedback to push in the knowledgeable direction to get a result. October and November was mostly collecting the images and brainstorming possible approaches. December mostly consisted of writing algorithms. January and February will be continued writing code with lots of feedback to determine the 10 best course of action. After that, I will be focusing on GUI implementation and documentation for publication. Nearly a month is designated at the end of the semester for last minute alterations and, most importantly, the final report and publication of this research. In April, there is an Undergraduate Research Conference where this research will be presented. 7 Appendices 7.1 Timeline for Project Figure 2: Gantt chart timeline 7.2 Budget Supplies Travel Other Expenses Total Paper Flash Drives Durham-Danvers Durham-Cambridge Photo Copies Color Printing 8 GB 96 mi RT 138 mi RT 2 Reams $35.98 2 $39.98 2 Trips $48.00 3 Trips $103.50 250 $25 $252.46 Note: SURF grant has awarded $150 for budget 7.3 Budget Explanation A. Paper – This will cover the actual paper used for printing and calculations, as well as the cost of color printing. Any cost above the budgeted amount will be covered by the ECE department. 11 B. Flash Drives- It is necessary to find an easy and universal way to transfer and store images. It will be much simpler to transport the images that will be much too large to send over email. C. Travel- It will be necessary for training and data collection in both Danvers, MA and Cambridge, MA. D. Photocopies- It will be necessary to reproduce many of the images created. Any cost above the budgeted amount will be covered by the ECE department. 7.4 References Bell, David C., Murtagh, Katelyn M., Dionne, Cheryl A., Glover, William R. Glover. Direct observation of single-atom DNA labels with annular dark-field electron microscopy. Submitted to Nature (2010). Gonzalez, Rafael C., Richard E. Woods. Digital Image Processing. Upper Saddle River, N.J.: Prentice Hall, 2008. Nakanishi, Nobuto. Kotaka, Tasutoshi. Yamazaki, Takashi. An expanded approach to noise reduction from high-resolution STEM images based on the maximum entropy method. Ultramicroscopy 106 (2006) 233-239. Robinson, Richard. DNA Structure and Function, History. Genetics (2003). Schalkoff, Robert. Pattern Recognition. New York, NY: John Wiley & Sons, Inc, 1992. 12