Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 Proteomic Classification of Liver Cancer using Artificial Neural Network By Lam, Yee Hong Brian Supervisor: Mr. Willy Tse May, 2005 Submitted as part of the requirements for the award of the Degree in Computing and Information Systems of the University of London. 0 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 Acknowledgments I gratefully acknowledge the inspiration and guidance of my supervisor, Mr. Willy Tse (HKU SPACE, The University of Hong Kong) throughout the course of this work. I am greatly indebted to Dr. John Luk (Department of Surgery, The University of Hong Kong) for providing tremendous support, including proteomic data and computing facilities for this study. I may also thank my father, my brother, Ms. Marcella Ma, Mr. Stanley Lam for their useful advice and suggestions, and to all the staffs in HKU SPACE and colleagues in the department of Surgery for their help and co-operation Last but not least, I would like to express my gratitude to my family for their patience and support throughout these years of study. 1 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 Abstract Hepatocellular carcinoma (HCC) is one of the most deadly cancers worldwide. Current advances in proteomic approaches facilitate several proteome wide studies in identifying markers as well as insights into the mechanisms of HCC development. Facing a relatively large amount of data in the proteome, modern high-order data mining techniques may provide a systematic way in search for meaningful and biological significant pattern and trends hidden in the proteomic dataset. In this study, a proteomic dataset of 132 HCC related tumour and non-tumour samples, each consisting of 1433 variables was used for construction and evaluation of classification models based on artificial neural network (ANN) and classification and regression trees (CART) algorithm. Both algorithms successfully segregate samples into corresponding phenotypes with high sensitivities and specificities (ANN: 89.4%, 89.4%; CART: 80.3%, 80.3%), enlightening the usefulness and possibilities of data mining techniques in genomic and proteomic expression profiling studies. 2 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 Table of Contents Acknowledgments ....................................................................................................................1 Abstract.....................................................................................................................................2 Table of Contents..................................................................................................................3 List of Figures and Tables1 Introduction ..............................................................................5 1 Introduction...........................................................................................................................6 Pathogenesis of Cancer ..........................................................................................................8 Proteomics ...........................................................................................................................11 Data Mining .........................................................................................................................14 Rationale and objectives of study ........................................................................................17 2 Terms of Reference .............................................................................................................19 Problem domain and problem statement .............................................................................19 Project objectives.................................................................................................................19 Expected deliverables ..........................................................................................................19 The approach of project .......................................................................................................20 3 Liver Cancer Proteomics – a brief review ........................................................................21 Proteomic approaches & current findings ...........................................................................22 Data mining – an emerging discipline in HCC proteomics .................................................25 4 Artificial Neural Network and Uses in Proteomics..........................................................26 The artificial neuron model..................................................................................................27 ANN Architectures ..............................................................................................................28 Learning Rules.....................................................................................................................28 Backpropagation algorithm..................................................................................................29 Applications of ANN in HCC proteomics...........................................................................29 5 Implementation and training of ANN ...............................................................................30 3 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 Proteomic data structure & characteristics ..........................................................................30 Data pre-processing .............................................................................................................33 Implementation of ANN ......................................................................................................36 Training & Optimization .....................................................................................................37 6 Results ..................................................................................................................................43 Training performance ..........................................................................................................43 Validation performance .......................................................................................................46 7 Comparison with Classification and Regression Trees Algorithm ................................50 Implementation of CART ....................................................................................................50 The classification tree ..........................................................................................................51 Training performance ..........................................................................................................54 Validation performance .......................................................................................................56 Comparison of ANN and CART algorithm.........................................................................57 8 Discussion.............................................................................................................................59 Applicability of ANN based classification algorithm in HCC proteomics .........................59 ANN & CART: Which is better?.........................................................................................60 Future Directions .................................................................................................................61 9 Conclusion ...........................................................................................................................62 References...............................................................................................................................63 Appendix .................................................................................................................................a Outputs of different classification model constructed in this study....................................... a Companion CD ......................................................................................................................d 4 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 List of Figures and Tables Page Figures Figure 1.1 The Central Dogma of biology. Figure 1.2 The cell cycle. Figure 1.3 Proteomic based on 2D-E approach. Figure 3.1 Major Proteomic technologies. Figure 3.2 Chart for marker discovery and development using proteomic technologies. Figure 4.1 The artificial neuron. Figure 4.2 Transfer functions. Figure 4.3 Topology of feed-forward network. Figure 5.1 A sample image of 2D-E gel. Figure 5.2 Raw spots intensities versus natural log-transformation. Figure 5.3 A dendrogram of average linkage clustering of protein and samples. Figure 5.4 A screenshot of neural network manager. Figure 5.5 Network topology of ANN_1. Figure 5.6 Network topology of ANN_2. Figure 5.7 Network topology of ANN_3. Figure 5.8 Network topology of ANN_4. Figure 5.9 Network topology of ANN_5. Figure 5.10 Network topology of ANN_6. Figure 5.11 Network topology of ANN_7. Figure 6.1 Training curves of ANNs. Figure 6.2 Linear regression of ANNs. Figure 6.3 Receiver operating characteristic curves of ANNs. Figure 7.1 The optimal classification tree constructed by CART. Figure 7.2 Receiver operating characteristic curves for ANN and CART. 8 10 15 23 24 27 27 28 31 32 35 36 38 38 39 39 40 40 41 44 47 49 53 58 Tables Table 5.1 An overview of the proteomic data file in tab-delimited format. Table 5.2 Detailed settings of ANNs used in this study. Table 6.1 Learning sensitivities and specificities of ANN built in this study. Table 6.2 The sensitivities and specificities of ANNs built in this study. Table 7.1 Classification trees constructed by BPS. Table 7.2 Training sensitivities and specificities of CART. Table 7.3 Estimated sensitivities and specificities of CART. Table 7.4 The sensitivities and specificities of CART built in this study. Table 7.5 The area under curve of ANN6, ANN_6 + hard-limit trasformation and CART. Table 8.1 Major advantages and disadvantages of ANN and CART. Table a. Validation output of ANN and CART models constructed in this study. 5 30 42 45 46 51 55 55 56 58 62 a Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 1 Introduction Liver cancer is one of the most life-threatening solid tumours worldwide with more than one million cases diagnosed each year. In Hong Kong, liver and related malignancies account for 12.5% of cancer cases, and it is the second leading cause of cancer death with the mortality rate of 21 per 100,000 populations. Major risk factors of HCC includes chronic hepatitis virus infections, in particular hepatitis B and hepatitis C; cirrhosis caused by either hepatitis or alcoholism [1]; and chronic exposures to various cytotoxic substances such as arsenic [2]; polyvinyl chloride (PVC) [3] etc. The geographical distribution of this cancer is uneven, with highest incidence rate in eastern and Southeastern Asia, sub-Sharn Africa and Melanesia [4, 5]. The remarkable variations of the incidence between different regions coincide with the prevalence of chronic hepatitis infections, as well as dietary habits, environmental and genetic factors [6]. The diagnosis of liver cancer usually occurs at late stages in the disease when there are few effective treatment options and the prognosis for patients with HCC is very poor. Currently, hepatic excision remains the standard for treatment of HCC, nevertheless, the procedure is somewhat not sufficient due to low resectability rate. In addition, recurrence often happens in most of the cases (>60%) after resection with short life expectancy of about 6 months from the time of diagnosis [7, 8]. In order to pursue a better disease management of liver cancers, there is an urgent need in developing models for early cancer detection, as well as understanding the underlying mechanisms of liver cancer development. Proteomic approaches (studies of protein expression profiles) have recently gained popularity in deciphering a global view on protein 6 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 expressions in a variety of studies, and they provide an excellent opportunity in uncovering the biologically significant patterns (or fingerprints) specific in liver cancer. Recently, several proteomic studies have been conducted on cultured cells, tissues as well as blood from liver cancer patients utilizing different approaches [9-13]. Several proteins have been identified, and most of them have been indicated to be involved in the development of cancer, as well as drug responses and toxicities. However, to the best of our knowledge, the biological significance of those proteins found in these study remains to be elucidated. In addition, neither a liver cancer specific pattern, nor a reliable classification model has been established Data mining is a wide set of statistical analytical techniques in discovering hidden data attributes, trends and patterns in large databases. Since the mid-nineties, data mining is gaining ground and being used increasingly in marketing, environmental, banking, commercial, as well as several medical applications. In this study, the data on the protein expression profiles of 66 patients was adopted from the University of Hong Kong Medical Centre with permission. Preliminary examination on the data revealed more than 1400 proteins, and more than 90 proteins showed changes with statistical significance of p < 0.01. We have employed ANN to construct a classification model in order to distinguish among normal and tumour tissues. In addition, the discriminative performance of the ANN was evaluated with classification models built by CART algorithm using the proteomic pattern generated from 2D-E. 7 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 Pathogenesis of Cancer The Central Dogma of biology The Central Dogma is a series complex processes involving the generation of RNA based on the genetic code in genomic DNA by means of “transcription” and subsequent production of a protein using the RNA as template through the process of “translation”. Proteins and some specialized RNA (such as tRNA) are functional units (machines) which carry out biological functions (Figure 1.1). DNA Transcription – making a functional copy of genetic matierial RNA In some cases, RNA itself is functional Function Processing of mRNA transcript + Translation – use the mRNA as template to build proteins, the functional unit of gene Protein Post-translational modifications – final steps to make proteins functional Function Figure 1.1 The Central Dogma of biology The Central Dogma of biology describes the processes of which a gene is transcribe from DNA to RNA and then translated into protein to carry out functions. 8 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 In the study of biology, the Central Dogma act as the basis for most of the biological processes, thus most of the studies on biological systems focus on the variations of the entities involved in the dogma, in particular, the DNA sequences, the expression profiles of RNA (by means of gene chips) and proteins (proteomic profiling). The Cell-Cycle The cell cycle is the recurring sequence of events that includes the duplication of a cell's contents and its subsequent division. The cell cycle is divided into two major phases, namely interphase and mitosis. During interphase, appropriate cellular components are copied. Interphase is made up of three distinct sub-phases: G1, S, and G2. The G1 and G2 phases serve as checkpoints for the cell to make sure that it is ready to proceed in the cell cycle, while S phase involves the DNA synthesis and replication of chromosomes. Mitosis is the part of the cell cycle when the cell prepares for and completes cell division. A summary of the process cell cycle is depicted in Figure 1.2. The cell cycle is a highly regulated process in which the cell division/growth is tightly controlled (i.e. the cell will proceed, or suspend the cycle depend on needs). Every machineries including proteins and chemical factors involved in the cell cycle are “turned on or off” in a specific time with high precision during different phases of the process. In cancer however, some of the machineries malfunctions and resulted in a lost of control in cell division, resulting in either cell death, or tumour growth. 9 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 Figure 1.2 The cell cycle The cell cycle is divided into interphase, which is further divided into G1, S, and G2 sub-phases, and mitosis. Interphase involves growth and preparation of genetic materials and machineries for cell division; and mitosis involves the final steps of cell cleavage from one to two daughter cells. (Courtesy of Randy Poon, Department of Biochemistry, The Hong Kong University of Science & Technology) Cancer is a Result of Uncontrolled Cell Growth Cancer is a class of diseases characterized by abnormal and uncontrolled (malignant) growth of cells through the dysregulation of the cell cycle. The resulting mass, also known as tumour or neoplasm, can invade and destroy surrounding normal tissues. Cancer cells from the 10 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 tumour can also spread through the bloodstream or lymph system to start new cancers in other parts of the body, this process is known as metastasis. As mentioned earlier, cancer is often caused by the malfunctions of the machineries involved in the cell cycle, and in most of the cancer cases, there are multiple dysfunctions of the cell cycle process which cause tumour growth. Different cell types have different tendency to become cancerous, the more active the cell cycle proceeds, the more likely the cell may mutate to become cancer. In liver, hepatocytes, which populate more than 90% of the organ mass, are highly active cells that involves in the metabolism and detoxification processes. Cancer related to the dysregulated growth of hepatocytes is called hepatocellular carcinoma, or simply HCC. HCC is the most frequent type of liver cancers and it is one of the most common solid malignancies together with nasopharyngeal cancer (NPC) and colorectal cancer (CRC). Proteomics A disease arises when a gene or protein is over- or under-expressed, or when a mutation in a gene results in a malformed protein, resulting in the alteration a protein's function from normal. In cancer, it is often the case that some of the genes or proteins are expressed in an abnormal fashion, especially for genes that are involved in the regulation of the cell cycle. As the terminal and the function unit of the gene, it is of great importance to study the expression profiles and function of the proteins (the machineries) in order to decipher the pathological process of liver cancer. 11 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 Proteomics is defined as the systematic large-scale analysis of protein expression under normal and perturbed states, in this study, liver cancer. Proteomics generally involves the separation, identification, and characterization of all of the proteins in a experimental sample in a global and comprehensive manner. There is a broad range of technologies used in proteomics, but the central paradigm has been the use of 2-D gel electrophoresis (2D-E) followed by mass spectrometry (MS). 2D-E is used to first separate the proteins by intrinsic charge and then by molecular size, while MS is used for the identification using the peptide mass fingerprints. The proteomic profiling using 2DE approach is summarized in the Figure 1.3. 12 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 Protein extraction from excised liver tissues 1st dimension isoelectric focusing by pI MALDI-TOF/MS - MS/MS protein identification Data mining nd 2 dimension gel electrophoresis by Image analysis and spot data warehousing Figure 1.3 Proteomics based on 2D-E approach 1) Protein were extracted from tumour/non-tumour tissues and protein concentration was quantified; 2) protein were loaded onto 11 cm gradient gel strips for 1st dimension separation according to their intrinsic charges with resolution of pH4-7; 3) the gel strips were subsequently subjected to 2nd dimension electrophoresis according to molecular weight; 4) The 2-DE gels stained with silver were digitized using calibrated densitometer and the spots on gel were matched using specialized software. The spot data were then sent to database for data warehousing; 5) The spot data were extracted from database and subjected to data mining analyses; 6) Meaningful spots were subjected to MALDI-TOF MS or MS/MS for identification basing on trypsin digested peptide masses. Large-scale approaches including genomics and proteomics, can generate much more useful information than traditional hypothesis-driven approach in the study of biology. The two approaches are not mutually exclusive however, and indeed broad hypotheses can be formed 13 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 by selecting the appropriate data from the –omics experiment through proper use of data mining techniques which will be discuss in the following section. Data Mining Data mining is a set of analytical techniques in discovering hidden data attributes, trends and patterns in large datasets, and provide valuable insights into the function or environment of certain scenarios. By uncovering the pattern and trends in the dataset, data mining also makes the prediction of future events possible. Data mining methods are all based on induction-based learning [14], the process of forming a general concept definition by observing specific examples of the concept to be learned. Data mining in general can be divided into four major steps: 1. Assemble a collection of data (from data warehouse) to analyze; 2. Transfer data to data mining software for analysis; 3. Result interpretation to examine if what has been discovered is useful; 4. Apply what has been discovered to new situations. Data mining strategies Data mining can be classified into two major strategies, namely unsupervised and supervised. Unsupervised clustering builds models from data without predefine classes while supervised learning builds models by using input attributes to predict output attribute values. Supervised learning strategies can be further divided according to whether output attributes are 14 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 categorical or discrete, including classification, estimation and prediction. The hierarchy of data mining strategies is shown in figure 1.4. Data Mining Strategies Unsupervised Clustering Supervised Learning Classification Estimation Prediction Figure 1.4 A hierarchy of data mining strategies Unsupervised data mining techniques Hierarchical clustering analysis Clustering analysis was first used by Tryon [15] in 1939. It encompasses a number of different algorithms for grouping objects of similar kind into respective categories by developing taxonomies of objects according to the degree of association. In hierarchical clustering analysis, the similarity/desimilarity between objects are measured by Euclidean distance: distance(x,y) = { i (xi - yi)2 }½ Clustering analysis are used to discover structures in data without providing an explanation or supervision, that is, in other words, discovers structures in data without explaining why they exist. 15 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 Supervised data mining techniques Statistical regression Statistical regression is a supervised learning technique that generalizes a set of numeric data by creating a mathematical equation relating one or more input attributes to a single numeric output attributes. Popular statistical regression techniques include linear regression which the value of an output attribute is determined by a linear sum of weighted input attribute values: f (x) = a * y + b * z + k where x is the output; y, z are inputs; a, b are weights and k is a constant. Logistic regression is a broad class of models includes ordinary regression and ANOVA, as well as multivariate statistics such as ANCOVA and log-linear regression. Logistic regression allows one to predict a discrete outcome, such as group membership, from a set of variables that may be continuous, discrete, dichotomous, or a mix of any of these. Generally, the dependent or response variable is dichotomous, such as presence/absence or success/failure. 16 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 Artificial neural network Artificial Neural Network (ANN) is computational data mining tools based on the basic neural structure and learning model of the brain. [16] ANN learns from experience and stores information as patterns, just as human brains do. Unlike traditional computer programmes, ANN promises a new way of solving complex classification problems by developing a pattern through alterations in neuronal inputs (in our case, the protein expressions) and weights in the perceptional layers from the learning data set and assign class outcomes based on the pattern built. Details of ANN will be discussed in chapter 3. Classification & regression trees On the other hand, classification and regression tree (CART) analysis is a statistical based, tree-structured data mining technique proposed by statistician Breiman in 1984 [17, 18]. CART is a recursive-partitioning algorithm that builds a decision tree by identifying a set of if-then logical (univariate split) conditions recursively through an exhaustive search over all variables in the dataset, and permit accurate prediction or classification of cases using a rather simple and easy to understand graphical representations. Rationale and objectives of study There is an urgent need to build a reliable and reproducible classification model for early detection in order to provide better treatment for patients suffering from liver cancers. Using such models based on proteomic profiles therefore may provide an excellent opportunity in 17 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 providing better medical diagnosis and prognosis and thus increase the quality of life of patients suffering from HCC. In this study, ANN was implemented to construct a classification model based on the liver cancer proteomic dataset in order to distinguish among normal and tumour tissues. In addition, the discriminative performance of the ANN was evaluated with classification models built by CART algorithm. 18 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 2 Terms of Reference Problem domain and problem statement ⇒ To classify tissues from tumour and normal based on the 2-DE proteomic profiles using artificial neural network; ⇒ To design and implement a neural network based classification model to achieve a reliable sensitive and specific prediction of different disease phenotype; ⇒ Comparative study on classification models built by different supervised learning algorithms such as ANN and CART. Project objectives To build a classification model using artificial neural network technology in order to differentiate cancerous tissues from normal based on their protein expression patterns. In addition, the discriminative performance of the ANN will be evaluated with classification models built by CART algorithm. Expected deliverables ⇒ Literature review on current researches on liver cancer and biomarkers discovery, technology and approaches that have been employed; ⇒ Literature review on major data mining approaches; ⇒ The neural network based classification model which includes o Conceptual design and implementation; o Optimization; 19 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 o Evaluation on sensitivities and specificities; ⇒ Comparison with other known data mining model o Classification & Regression Trees (CART); The approach of project ⇒ Literature reviews will be done based on articles and reviews published in internationally recognized journals; ⇒ Protein expression dataset will be obtained from work place with supervisor’s agreement; ⇒ Artificial neural network (feed-forward backpropagation network) will be built using MATLAB 7.0 Release 14 (which will consists of ~100 inputs and 2-3 perceptron layers); ⇒ Validation will be carried out by statistical means such as receiver operator curves; ⇒ CART based classification model will be implemented for comparison with ANN on discriminative performance. 20 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 3 Liver Cancer Proteomics – a brief review The poor survival of patients with HCC is related to a lack of reliable tools for early diagnosis. Therefore, it is of urgent need for the discovery and refinement of markers (or fingerprints) that is highly sensitive and specific in order to promote better HCC detection, earlier intervention and successful treatment, thus improving long-term outcomes. Proteomics, or the study of protein expressions, has recently gained much popularity due to its unique ability to delineate global changes in protein expression patterns that reflects several cellular actions taking place in the transformation of normal to disease state. In HCC, in particular, proteomics has been aimed at identifying changes in protein expression, structure, modifications and sub-cellular localization [19]. Some of these changes may indicate the formation of cancer and thus lead to great interest for the discovery of novel markers for the detection of cancer. In last few years, remarkable advances have been made in several proteomic technologies [20-22], which facilitate more delicate approaches to investigate the proteome in a more reliable and reproducible manner. 21 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 Proteomic approaches & current findings Proteomics involves the combination of high-resolution protein separations, identifications, data warehousing and data analyses techniques. To date, most of the studies are focused on the first two components, while the latter two are now starting to gain attention as there is a great amount of data being generated that leads to increasing needs for better management and efficient use of data. Nowadays, 2D gel electrophoresis followed by mass spectrometry (2D-E/MS), liquid chromatography – tandem mass spectrometry (LC-MS/MS), surface enhanced laser desorption ionization (SELDI, also known as protein chip) and protein microarray are the four major technologies being used in the study of proteomics. Chignard and Beretta [19] has written an excellent review on several proteomic technology platforms (Figure 3.1) and strategies of marker discovery (Figure 3.2) currently being adopted by several research laboratories. However, due to the scope of this report, detail descriptions of these platforms are not included here. 22 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 Figure 3.1 Major Proteomics Technologies The figure shows the 4 major proteomic technology platforms being used by most research laboratories: (A) Two-dimensional gel electrophoresis (2D-E); (B) multidimensional protein identification technology; (C) surface-enhanced laser desorption ionization (SELDI); and (D) protein microarray. (Adopted from [19]) 23 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 Figure 3.2 Chart for marker discovery and development using proteomics technologies (Top) The different proteomic strategies used to date for hepatocellular carcinoma marker discovery. (Bottom) Subsequent steps toward test development and clinical validation. Abbreviations: SELDI, surface-enhanced laser desorption ionization; 2D-PAGE, 2-dimensional polyacrylamide gel electrophoresis; 2D-LC, 2dimensional liquid chromatography; LCM, laser capture microdissection. (Adopted from [19]) Several proteomic studies of HCC have been carried out in these few years from several groups [9, 12, 13, 22-28]. Several proteins of different classes have been identified to correlate to the pathogenesis of HCC. And some of them have been now put into the list of new candidate markers for the detection of HCC and pending for clinical validations. Data warehousing is one of the major issue to ease the access of information in proteomics due to the enormous amount of data being generated from experiments. Cho et. al. [11] in Korea attempted to build a HCC proteome database based on Pedro standard [29, 30], and was made available through the Internet (Available: http://yprcpdb.proteomix.org/). 24 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 Data mining – an emerging discipline in HCC proteomics With the large pool of data available from the HCC proteome, data mining is gaining popularity and being used by several teams [27, 31, 32]. Common data mining techniques including unsupervised clustering of proteins, classifications analyses (such as CART, logistic regression, ANN, etc) have been used recently for identification of HCC specific protein markers, or pattern of protein expressions that can acts as a fingerprint for the detection of HCC. Further optimization and advancement of data mining techniques are being carried out in order to establish more reliable, sensitive and specific detection methods of HCC in order to extend the quality of medical management of HCC and promote the quality of life of patients. 25 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 4 Artificial Neural Network and Uses in Proteomics Artificial neural network (ANN) branches from the research of artificial intelligence in computer science. It tries to mimic fault-tolerance and learning properties of the biological brain. ANN can be regarded as one of the multivariate nonlinear analytical tools, and are known to be good at recognizing patterns from noisy and complex data, and estimate their nonlinear relations. According to the DARPA Neural Network Study [33], ANN is defined as: A neural network is a system composed of many simple processing elements operating in parallel whose function is determined by network structure, connection strengths, and the processing performed at computing elements or nodes. The idea of ANN arose back in 1943 by McCulloch and Pitts [34] based on the knowledge of neurology at that moment. These models made several assumptions about how neurons worked. Their networks were based on simple neurons which were considered to be binary devices with fixed thresholds. The results of their model were simple logic functions such as "a or b" and "a and b". In 1950-1960s, different models of ANN emerged such as the Perceptron developed by Rosenblatt [35] and ADALINE (ADAptive LInear Element) developed by Widrow and Hoff in 1960 [36]. There was a major dispute in the development of ANN in 1969 in which the idea was criticized as “...our intuitive judgment that the extension (to multilayer systems) is sterile" [37] and suspended researches in ANN. Nevertheless, ANN regained its momentum in 1970s, and it is widely used since the DARPA neural network studies [33]. Nowadays, ANN is applied 26 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 in a variety of applications in the area of aerospace, banking, financial, securities, speech, as well as medicine [38]. The artificial neuron model ANN composes network of artificial neurons. An artificial neuron is designed in order to mimic the functions of neurons in the biological brain. The artificial neuron model consists of inputs, weights, bias and transfer function. A typical artificial neuron (based on [38]) is depicted below: Figure 4.1 The artificial neuron. (Adopted from [38]) The output of a neuron depends on the neurons inputs and on its transfer function. In the neuron model above, the input p is transmitted through a connection that is multiplied by weight w, and the bias b is added to the product wp. The transfer function f , which is usually a step function or sigmoid functions, takes the net input n, which is the sum of (wp + b) and produces output a. There are various kinds of transfer function, including hard-limit transfer function, linear transfer function and log-sigmoid transfer function. Figure 4.2 Transfer functions: Hard-limit (left); linear (middle); log-sigmoid (right). (Adopted from [38]) 27 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 ANN Architectures ANN can be classified into two major topologies, feed-forward neural network and recurrent neural network. In feed-forward network, data flows from input to output in a one-directional manner, in contrast, in recurrent network, the output from neuron layer may feed backward to the previous layer as input. Figure 4.3 shows the topology of a typical feed-forward network. Feed-forward networks cannot perform temporal (time) computation, and recurrent network is required for temporal behaviour. Figure 4.3 Topology of a feed-forward network. Learning Rules Learning (or training) rule is the procedure for modifying the weights and bias, i.e. w and b of ANN. Learning rules can be divided into two major categories, namely supervised and unsupervised. Depending on the problem nature, both learning methods are widely used. In unsupervised learning, the weights and biases are modified in response to network inputs only, and there is no target output available. Most algorithms falling in this category perform clustering operations, which categorize the input patterns into a number of classes. In supervised learning, the weights and biases of ANN are adjusted by means of the training set, which the input and the corresponding correct output are provided. As the inputs are applied to the network, the network outputs are then compared to the targets and adjust is 28 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 made in order to move the outputs closer to the targets. Supervised learning is also known as inductive learning and the rule is usually used in classification tasks. Backpropagation algorithm Backpropagation was created by Widrow-Hoff [36] and it is one of the supervised learning rules. Backpropagation is a gradient descent algorithm which the network weights are moved along the negative of the gradient of the performance function, which is calculated by comparing network outputs with corresponding targets. Backpropagation refers to which the gradient is computed and feedback to the network in order to do adaptive changes on weights and biases. Applications of ANN in HCC proteomics There are several types of ANN that is being used in the area of medicine, and the major use of ANN includes several classification analyses in medical diagnosis, as well as selforganizing maps in gene/protein clustering analyses in sample profiling studies [31,32] In this study, the multiple-layer perceptron (MLP) based feed-forward neural network model was implemented and trained using backpropagation learning algorithm in order to classify HCC from normal tissues based on their proteomic profiles. 29 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 5 Implementation and training of ANN Proteomic data structure & characteristics Proteomic data format According to the data warehouse specification, the data was stored in database table using MS-Access as DBMS. In order to retrieve data from the database, the proteomic data was exported into tab-delimited text file with the attributes as shown in Table 5.1. Table 5.1 An overview of the proteomic data file in tab-delimited format. List of protein spots Spot intensities of individual sample (in ppm) ↓ ↓ Spot ID Sample 1 Sample 2 …. 000001 5.2 3971.9 Å Spots Intensities 000002 1512 12341.3 … … … For each individual samples, the proteomic profile consisted of a list of intensities of spots and this list acted as a proteomic fingerprint for classification analysis. In this study, data of 132 samples with 1433 spots each was utilized in the implementation and evaluation of the ANN. (Attached on CD bundled, filename: proteome.xls) 30 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 Protein spot intensities & non-linearity The intensity of protein spots was measured according to the darkness of silver stain on gel. (Silver reacts with lysine residues, a type of amino acid residues which exists in almost every protein, and become dark brown in colour, as shown in Figure 5.1) The darkness was quantified by densitometer and the values was expressed in ppm (parts per million). Figure 5.1 A sample image of 2D-E gel. Proteins were separated by means of intrinsic charge and molecular weight and formed spots on gel. In order to visualize the proteins, silver was used to stain the lysine residues to give dark brown color. The intensities of protein did not follow a linear relationship with the quantity of protein. And the distribution of spots intensities did not follow a normal distribution (Figure 5.2b). The highly skewed nature of spot intensities might give rise to inaccurate statistical results: for example, the mean and variances, and even the non-parametric Wilcoxon test assumes symmetry in outcome variables [39]. In order to prevent the inaccuracies, log-transform was done as recommended in [40] prior to analysis. 31 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 Figure 5.2 Raw spots intensities versus natural log-transformation. The highly skewed distribution of raw spot intensities (left) and the effect of log-transformation (right). Adopted from [40] 32 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 Before data were transferred to the data warehouse, the spot intensities was normalized such that the total sum of intensities of the protein spots were the same among different samples, based on the fact that the total amount of protein for each samples profiled were the same (50µg protein per sample). Normalization of the spots also removed effects of inter- experimental fluctuations and differential staining. Data pre-processing Before feeding proteomic data into the ANN, the following pre-processing procedures were done in order to reduce noise and artifacts, and transform data into appropriate range for input: 1. Log-transformation; 2. False positive signals removal; 3. Statistical filtering using Student’s t-test; 4. Normalization of intensities into ranges of [0, 1]; 5. Data conversion into MATLAB data format. Log-transformation and Data Reduction The proteomic data set was natural log-transformed in order to establish a symmetric and linear distribution and background was subtracted subsequently. In order to remove false positive signals, and filter was set such that spots with more than 70% presence in either tumour or non-tumour group in data set was taken and out of 1433 spots, 1203 spots were filtered and 230 spots were used to subsequent analyses. 33 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 Further reduction was done by calculation the probability value (p) using Student’s t-test. Only spots showing a statistical significant change with p-value of smaller than 0.05 was taken. After the statistical filter, 92 spots left. The transformation and data reduction procedures were done in Microsoft Excel (Microsoft Corp. Redmond, WA) and was included in CD attached (Filename: proteome data.xls). Hierarchical clustering Unsupervised hierarchical clustering was performed to give a general overview of the proteomic patterns of samples after data reduction. Using this technique, samples with similar proteomic profiles were grouped into clusters and samples that do not group into the corresponding cluster were treated as outliers. For protein spots, the clustering technique groups according to their similarity in expression among samples. Clustering of the proteomic data was performed using Cluster developed by Eisen [41] (available: http://rana.lbl.gov). A 2-dimensional average linkage algorithm was used to generate cluster of samples and protein spots based on the processed data using standard procedures as stated in the developer’s manual. Treeview developed from the same group was used to visualize the cluster and the result is shown in Figure 5.3. Samples were divided into two major clusters, non-tumour (left cluster) and tumour (right cluster) and protein spots were also divided into two major cluster, up-regulated (upper cluster) and down-regulated (lower cluster). Outliers were samples that were not grouped in the corresponding clusters (highlighted in red), which indicates that the proteomic profiles was different from others. 34 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 Figure 5.3 A dendrogram of average linkage clustering of protein spots (row) and samples (column). Protein expression levels were depicted as scale from green – down-regulated to red – up-regulated. Samples which were not grouped into corresponding cluster were highlighted in red at the top. A high resolution image was available on CD attached at the back of the report. (Filename: cluster.pdf) 35 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 Normalization The spot intensities were normalized into scale [0, 1] and were converted to MATLAB format before feeding into the ANN. The files were saved on CD attached (ann_data_set.mat) Implementation of ANN MATLAB 7.0 Release 14 (The Mathworks Inc., Natick, MA) was used for the implementation. The ANN was constructed using “Neural Network Toolbox” bundled with the software. A screenshot for the neural network manager (nntool) was shown below: Figure 5.4 A screenshot of neural network manager. 36 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 Training & Optimization Training and Simulation Datasets The proteome data set was randomly divided into training and simulation sets with each set consisting of 66 non-tumour and 66 tumour samples (i.e. a total of 132 samples). The data in the training dataset was used for training the network while the simulation dataset was used for evaluation of the network. ANN Design & Optimizations A total of 7 feed-forward backpropagation ANN was designed namely ANN_1 to ANN_7. ANN_1, ANN_2 and ANN_3 were 2-layered network with increasing number (10, 20, 30 respectively) of neurons in the 1st layer and was trained by the Levenberg-Marquardt (LM) algorithm (a fast heuristic variant of backpropagation, this method was chosen because it was the fastest training algorithm when compared to others [38]). ANN_4 was similar to ANN_2, but it uses the classical gradient descent algorithm instead of LM. ANN_5, ANN_6 and ANN_7 were 3-layered networks and consist of different number of neurons in the 1st layer and 2nd layers. The detailed network topologies and settings are shown Figure 5.5 to 5.12 and Table 5.2 respectively. The performance of the ANNs was evaluated using post-training linear regression and receiver operating characteristic curves (ROC) analyses. 37 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 Figure 5.5 Network topology of ANN_1. Figure 5.6 Network Topology of ANN_2. 38 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 Figure 5.7 Network Topology of ANN_3. Figure 5.8 Network Topology of ANN_4. 39 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 Figure 5.9 Network Topology of ANN_5. Figure 5.10 Network Topology of ANN_6. 40 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 Figure 5.11 Network Topology of ANN_7. 41 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 Table 5.2 Detailed settings for ANNs used in this study. Tansig – Tan-sigmoid transfer function; Purelin – Linear transfer function; GDM – gradient descent backpropagation with momentum; LM – Levenberg-Marquardt algorithm ANN_1 ANN_2 ANN_3 ANN_4 ANN_5 ANN_6 ANN_7 No. of neuronal layers 2 (1 hidden) 2 (1 hidden) 2 (1 hidden) 2 (1 hidden) 3 (2 hidden) 3 (2 hidden) 3 (2 hidden) No. of inputs 92 92 92 92 92 92 92 No. of neurons in 1st layer 10 (Tansig) 20 (Tansig) 30 (Tansig) 20 (Tansig) 10 (Tansig) 20 (Tansig) 20 (Tansig) No. of neurons in 2nd layer 1 (Purelin) 1 (Purelin) 1 (Purelin) 1 (Purelin) 5 (Tansig) 5 (Tansig) 5 (Tansig) No. of neurons in 3rd layer 0 0 0 0 1 (Purelin) 1 (Purelin) 1 (Purelin) Training function LM LM LM GDM LM LM LM Epoch 100 100 100 10000 100 100 100 Performance Goal 0 0 0 1e -30 0 0 0 42 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 6 Results Training performance In order to search for the best artificial neural network (ANN) classification model, 7 ANNs, namely: ANN_1, ANN_2, ANN_3, ANN_4, ANN_5, ANN6 and ANN_7 were constructed with increasing complexities. The training curves of the networks are shown in Figure 6.1 on the next page. Speed All ANNs designed in this study showed convergence and reached a performance goal of at least 1e-25 with the exception of ANN_4, which the root mean square of error (MSE) was 0.0007 when the epoch limit was reached. With increasing complexities, the training time for ANN_1 was the shortest (around 10s1), while ANN_7 was the longest (around 5 min1). Classification Performance The learning outcomes of the ANN models are listed in table 6.1. Linear regression analysis was performed between network outputs and targets of the training dataset. The R-square value revealed that all ANNs built was able to distinguish non-tumour and tumour samples in the training dataset with high accuracies. In order classify outputs of the network into two discrete classes for calculation of sensitivities and specificities, the ANN output was transfer to a hard-limit transfer function where f (p) < 0 (where S is the inputs) was defined as nontumour (a = -1) and f (p) ≥ 0 was defined as tumour (a = 1). All ANNs built were able to achieve perfect discriminations between the two classes in the training set. 1 Based on UNIX time-sharing server with 24 x 1GHz RISC processors and 32GB of memory 43 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 a. b. c. d. e. f. g. Figure 6.1 Training curves of a) ANN_1, b) ANN_2, c) ANN_3, d) ANN_4, e) ANN_5, f) ANN_6 and g) ANN_7. 44 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 Table 6.1 Learning sensitivities and specificities of ANNs built in this study. (No. of samples n = 33 for tumour and n = 33 for non-tumour) ANN_1 ANN_2 ANN_3 ANN_4 ANN_5 ANN_6 ANN_7 Sensitivity Tumour Non-tumour 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% Specificity Tumour Non-tumour 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% 100% R-square 1.000 1.000 1.000 0.995 1.000 1.000 1.000 45 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 Validation performance Sensitivities and specificities The sensitivities and the specificities of the ANN models were examined based on the simulation dataset which was not included in the training dataset. As shown in Table 6.2, ANN_1, ANN_2, ANN_3, ANN_6 and ANN_7 showed high sensitivities and specificities in discriminating between tumour and non-tumour of greater than 70%. In particular, the discrimination performance of ANN_6 was as highest with sensitivities of 90.91% for tumour, 87.88% for non-tumour and specificities of 88.24% for tumour and 90.63% for non-tumour respectively. A detailed list of ANN outputs are listed in Table a in the appendix. Table 6.2 The sensitivities and specificities of ANNs built in this study. (n = 33 for tumour and n = 33 for non-tumour) Sensitivity Tumour Nontumour Specificity Tumour Nontumour ANN_1 ANN_2 ANN_3 ANN_4 ANN_5 ANN_6 ANN_7 72.73% 75.76% 72.73% 60.61% 63.64% 90.91% 78.79% 84.85% 84.85% 81.82% 87.88% 87.88% 87.88% 90.91% 82.76% 83.33% 80.00% 83.33% 84.00% 88.24% 89.66% 75.68% 77.78% 75.00% 69.05% 70.73% 90.63% 81.08% Statistical evaluation Further evaluations of the ANNs were performed using the simulation dataset and the simulation output were analyzed using two statistical methods: firstly by linear regression and secondly by receiver operating characteristic curves (ROC). 46 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 a. b. c. d. e. f. g. Figure 6.2 Linear Regression of a) ANN_1, b) ANN_2, c) ANN_3, d) ANN_4, e) ANN_5, f) ANN_6 and g) ANN_7. 47 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 Figure 6.2 shows the linear regression between network output and target of the simulation set. All ANNs built showed a best-linear-fit with R-square (shown as R in figure) of over 0.5. ANN_6 has an R-square of 0.808, indicating that the network output is correlated with target with satisfactory accuracies. In addition, ROC analysis was performed as another statistical means of measuring the discriminative performance of ANNs constructed. Figure 6.3a shows the ROC of 2-layered networks ANN_1 to ANN_4. The increase in the number of neurons in the hidden layer did not correlate with the improvement in performance according to the area under curve of ROC (AUC, where the performance is the best when AUC approaches 1). Instead, ANN_2, which composes of 20 neurons in the hidden layer, performed best among the 2-layered networks. For ANN_4, due to the incomplete training (MSE = 0.0007) when compared to the others, the classification performance was the worst (AUC = 0.832). For 3-layered networks ANN_5 to ANN_7, the AUC was 0.839, 0.936 and 0.875 respectively. Among the three networks, ANN_6 performed best, which was consistent with other means of evaluation. 48 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 1 - Sensitivity Source of the Curve 1.0 ANN_1 AUC = 0.877 p = 1.4e-07 ANN_2 AUC = 0.893 p = 4.2e-08 ANN_3 AUC = 0.839 p = 2.2e-06 ANN_4 AUC = 0.832 p = 3.6e-06 0.8 Reference Line 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 1 - Specificity a. 1 - Sensitivity 1.0 Source of the Curve ANN_5 AUC = 0.846 p = 1.4e-06 ANN_6 AUC = 0.936 p = 1.2e-09 ANN_7 AUC = 0.875 p = 1.6e-07 0.8 Reference Line 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 1 - Specificity b. Figure 6.3 Receiver operating characteristic curves for a) ANN_1, ANN_2, ANN_3 and ANN_4, b) ANN_5, ANN_6 and ANN_7. The area under curve and the significance are listed in the legend. 49 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 7 Comparison with Classification and Regression Trees Algorithm Apart from artificial neural network based classification algorithm, in this study, another statistical algorithm, namely classification and regression trees (CART) was also implemented and the discriminative performance between the two algorithms were compared and evaluated. Implementation of CART Data pre-processing Data pre-processing of CART was different from ANN. In the case of ANN, spots were filtered based on the presence in sample and only spots showing statistical significant changes in level were used as network inputs. For CART algorithm, the filters may weaken the performance due to its recursive partitioning nature. Therefore, the only pre-treatment was log-transformation and no filter was used in the implementation of CART, i.e. all 1433 spots were used for the construction of CART based classification model. 50 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 CART implementation CART was implemented using Biomarker Pattern Software (BPS) Version 4.0 (Ciphergen Biosystems Inc., Fremont, CA). GINI impurity criterion was used for splitting and estimation was performed using 10-fold cross-validation. Similar to ANN implementation, the same training and simulation set was used. The classification tree Based on CART, a number of trees were constructed and are listed in Table 7.1. The maximal tree was tree number 1, where it consists of 5 terminal nodes and lowest cost (best performance). Tree number 2 performed equally well as tree number 1, therefore, it was chosen as the optimal tree and used for further analyses. Table 7.1 Classification trees constructed by BPS. The optimal tree with lowest misclassification cost was taken for analysis Tree Number Terminal Nodes Cross-Validated Cost 1 5 0.273 ± 0.084 2** 4 0.273 ± 0.084 3 3 0.364 ± 0.095 4 2 0.333 ± 0.091 5 1 1.000 ± 0.000 ** Optimal 51 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 The details of the optimal decision tree are shown in Figure 7.1. Classification was done recursively from parent to its corresponding child nodes. In this study, the classification tree consisted of 3 classifier nodes, namely SSP0026, SSP2201 and SSP3102. For every sample, the classification scheme determined the route using the natural log-transformed intensity of the classifier spots in a hierarchical manner and assigned a final class outcome (tumour or non-tumour) in one of the four terminal nodes. The classification rules are listed below: For sample going to terminal node 1 (non-tumour) SSP0026 ≤ 9.072 and SSP2201 ≤ 3.899 For sample going to terminal node 2 (tumour) SSP0026 ≤ 9.072 and SSP2201 > 3.899 For sample going to terminal node 3 (tumour) SSP0026 > 9.072 and SSP3102 ≤ 6.928 For sample going to terminal node 4 (non-tumor) SSP0026 > 9.072 and SSP3102 > 6.928 52 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 _ No NT T Node 2 SSP2201 > 3.899 No NT = T = Terminal Node NT = T = 3 0 = = Yes Node 3 SSP3102 > 6.928 No NT = T = 3 27 Terminal 1 Yes SSP0026 > 9.072 30 6 Terminal Terminal Node 2 Node 3 NT T 0 27 NT T 0 6 = = = = Yes Node NT = T = 4 30 0 Figure 7.1 The optimal classification tree constructed by CART. A decision tree was constructed using GINI splitting criterion (minimal misclassification cost) and 10-fold cross validation. Based on the evaluation the classifier criteria in the nodes and follow the tree path to the terminal node, a terminal class, tumour (T) or non-tumour (NT) is assigned to the sample. 53 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 Training performance Speed The induction of the classification model using CART was very efficient, 5 candidate trees were constructed by the algorithm in about 10s2. Classification of training set and cross-validation estimation The sensitivities and specificities of CART for the training set are listed in table 7.2. The tree correctly identified 96.97% of the samples in the training set. Table 7.3 shows the estimated performance of CART based on 10-fold cross-validation test. The predicted sensitivities and specificities of the model are 86.37% and 86.65% respectively. The output of CART is listed in table a in the appendix. 2 Based on Windows PC with 1 x 2GHz processor and 1GB of memory, 10-fold cross-validation 54 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 Table 7.2 Training sensitivities and specificities of CART. (n = 33 for tumour and n = 33 for non-tumour) Training Performance (training set) Sensitivity Tumour Non-tumour 96.97% 96.97% Specificity Tumour Non-tumour 96.97% 96.97% Table 7.3 Estimated sensitivities and specificities of CART. (n = 33 for tumour and n = 33 for non-tumour) Estimated Performance (10-CV) Sensitivity Tumour Non-tumour 90.91% 81.82% Specificity Tumour Non-tumour 83.33% 90.00% 55 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 Validation performance Similar to ANN models, the CART model was also subjected to validation test using the same simulation dataset as the ANN and the classification performance was evaluated. Out of the 33 tumour samples in the dataset, CART correctly identified 27 samples and 6 were misclassified as non-tumour, while 26 non-tumour samples were classified accurately with 7 misclassifications as tumour. Table 7.4 lists the sensitivities and specificities of CART, where this model achieved an overall sensitivity of 80.31% and specificity of 80.33% respectively, which was lower then predicted. Table 7.4 The sensitivities and specificities of CART built in this study. (n = 33 for tumour and n = 33 for non-tumour) Validation Performance (validation set) Sensitivity Tumour Non-tumour 81.82% 78.79% Specificity Tumour Non-tumour 79.41% 81.25% 56 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 Comparison of ANN and CART algorithm Training Performance The algorithm of CART was more efficient than ANN, which it took about 10s for the construction of the candidate trees listed in Table 7.1 and it took minutes for construction of ANN with 2 hidden layers and one output layer. Classification Performance The ROC curve analysis was performed in order to evaluate the classification performance of ANN and CART. Figure 7.2 shows the ROC curve of the ANN_6, ANN_6 after hard limit transfer function and CART and Table 7.5 lists the corresponding AUC values of the three model. ANN_6 achieved the highest AUC of 0.936, while CART could only achieve 0.803. Since CART classifies samples in a discrete manner, a hard-limit function with cutoff value of 0 was performed over the ANN outputs of ANN_6 which could simulate the output of CART. After the hard-limit transformation, Hardlim(ANN_6) achieved an AUC of 0.894, which was still higher than that of CART. 57 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 ROC Curve 1.0 Source of the Curve CART ANN_6 Hardlim (ANN_6) 0.8 Reference Line 0.6 0.4 Sensitivity 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 1 - Specificity Diagonal segments are produced by ties. Figure 7.2 Receiver operating characteristic curves for ANN_6, ANN_6 + hard-limit transformation and CART. Table 7.5 The area under curve of ANN6, ANN_6 + hard-limit trasformation and CART. Model AUC CART 0.803 ANN_6 0.936 Hardlim(ANN_6) 0.894 58 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 8 Discussion Artificial neural network, along with logistic regression, classification and regression trees (CART), has been reported as one of the most efficient data mining techniques in clinical practice. In this study, an attempt was made to construct a classification model from 66 sets of tumour and non-tumour 2D-E proteomic profiles (a total of 132 samples) using artificial neural network. The resulting model successfully classified liver samples from tumour to non-tumour with fine accuracies of 89.4% sensitivity and 89.4% specificity. In addition, the model was statistically supported by and ROC curve with AUC of 0.936. In addition to the ANN model, a classification model based on CART was also constructed for comparison and evaluation. The discriminative performance of CART was slightly lower then that of ANN by 9.1% in both sensitivity and specificity. However, in terms of training time complexity and ease of use, CART may outperform ANN. Applicability of ANN based classification algorithm in HCC proteomics As mentioned earlier in the report, ANN has been used extensive in clinical practices, particular in the area of medical diagnosis and profiling studies. In this study, the implementation of ANN in the classification of tumour and non-tumour is one of the first attempts in applying data mining technologies on to proteomic profiles. The resulting model, which correctly segregates most of the samples into their corresponding groups, may indicate the possibilities of applying a similar technique in other cancer related proteomic studies, such as cancer prognosis. 59 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 Nevertheless, ANN is not without problems. Firstly in terms of disease marker discoveries, ANN utilizes a set of variables, and develops classification model in a “black box” manner without a clear logical correlations on how the variables are used. If one focuses on the potential markers but not the classification model, ANN may not be a good choice in contrast to CART and logistic regression. Secondly, the training of ANN is comparatively complex in terms of time and space, especially when deal with a large number of inputs and complex network topologies. ANN & CART: Which is better? Both ANN and CART are commonly used to perform classification task, especially for noisy and non-linear datasets. In this study, both classification models performed well in classification between tumour and non-tumour. Table 8.1 lists the major advantages and disadvantages of ANN and CART in terms of classification analysis. The two algorithms use different approaches in the course of inductive learning and result in different ways of classification. ANNs utilized all the neuronal inputs for the construction of classification model and by means of changing its weights and bias. On the other hand, CART exhaustively search for a hierarchical set of classifiers and construct a decision making tree for classification. Both methods offer unique advantages and at the same time suffer from their intrinsic weakness. Therefore, no one method could replace the other and we could say that they are not mutually exclusive and complementary to each other. 60 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 Table 8.1 Major advantages and disadvantages of ANN and CART ANN CART Algorithm Learn and develop classification model through associative learning in a biological neuron-like manner Search of a hierarchical set of classifiers and construct classification tree in a recursive bipartitioning manner Advantages Good performance Good performance Robust to noise and non-linearity Robust to noise and non-linearity Classification based on the whole set of inputs Easy to implement Training is relatively fast Need less samples than CART to build a good classification model The decision making tree is easy to understand Further investigation of classifiers are possible Disadvantages “Black box” like operation, not recommended for biomarker discoveries Computationally more complex than CART Decision making is not understandable Classification based on only a few major classifiers Hierarchical nature of classifiers may result in overwhelming effect of the variables in parent nodes A large number of samples is often needed to built a tree of satisfactory performance Classification tree composition may easily change totally with changes in training data Future Directions The 2D-E proteomic profiling of hepatocellular carcinoma has generated a large pool of global protein expression data which consists of rich information on the biology of cancer. ANN, as well as CART algorithms are used for building preliminary classification models on tumour/non-tumour differentiation based on the hidden pattern in the proteomic dataset. With increasing number of samples, both models may perform better than what is reported in this report. In addition, it is worthwhile to utilize similar algorithms in more sophisticated studies such as cancer recurrence, tumour staging in order to help combating HCC. 61 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 9 Conclusion In conclusion, both the ANN and CART model built produces good predictive ability in differentiating between tumour and non-tumour tissues based on their 2D-E proteomic profiles with over 80% accuracies. Both classification models are somewhat similar in terms of discriminative performance and only show a small difference by means of statistical evaluations. And more importantly, none of the two models are mutually exclusive and are in fact complementary to each other. Given these encouraging results, it is worthwhile to investigate the potential use of ANN, CART, as well as other data mining algorithms in more sophisticated studies such as cancer recurrence in order to explore their potentials and application in the management of HCC. 62 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 References [1] H. B. El-Serag, "Hepatocellular carcinoma: an epidemiologic view," J Clin Gastroenterol, vol. 35, pp. S72-8, 2002. [2] M. P. Waalkes, J. Liu, H. Chen, Y. Xie, W. E. Achanzar, Y. S. Zhou, M. L. Cheng, and B. A. Diwan, "Estrogen signaling in livers of male mice with hepatocellular carcinoma induced by exposure to arsenic in utero," J Natl Cancer Inst, vol. 96, pp. 466-74, 2004. [3] K. Heinemann, S. N. Willich, L. A. Heinemann, T. DoMinh, M. Mohner, and G. E. Heuchert, "Occupational exposure and liver cancer in women: results of the Multicentre International Liver Tumour Study (MILTS)," Occup Med (Lond), vol. 50, pp. 422-9, 2000. [4] P. Pisani, D. M. Parkin, F. Bray, and J. Ferlay, "Estimates of the worldwide mortality from 25 cancers in 1990," Int J Cancer, vol. 83, pp. 18-29, 1999. [5] D. M. Parkin, "Global cancer statistics in the year 2000," Lancet Oncol, vol. 2, pp. 533-43, 2001. [6] C. J. Chen and D. S. Chen, "Interaction of hepatitis B virus, chemical carcinogen, and genetic susceptibility: multistage hepatocarcinogenesis with multifactorial etiology," Hepatology, vol. 36, pp. 1046-9, 2002. [7] M. A. Feitelson, B. Sun, N. L. Satiroglu Tufan, J. Liu, J. Pan, and Z. Lian, "Genetic mechanisms of hepatocarcinogenesis," Oncogene, vol. 21, pp. 2593-604, 2002. [8] B. W. Wong, J. M. Luk, I. O. Ng, M. Y. Hu, K. D. Liu, and S. T. Fan, "Identification of liver-intestine cadherin in hepatocellular carcinoma--a potential disease marker," Biochem Biophys Res Commun, vol. 311, pp. 618-24, 2003. 63 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 [9] D. B. Kristensen, N. Kawada, K. Imamura, Y. Miyamoto, C. Tateno, S. Seki, T. Kuroki, and K. Yoshizato, "Proteome analysis of rat hepatic stellate cells," Hepatology, vol. 32, pp. 268-77, 2000. [10] G. S. Yoon, H. Lee, Y. Jung, E. Yu, H. B. Moon, K. Song, and I. Lee, "Nuclear matrix of calreticulin in hepatocellular carcinoma," Cancer Res, vol. 60, pp. 1117-20, 2000. [11] S. Y. Cho, K. S. Park, J. E. Shim, M. S. Kwon, K. H. Joo, W. S. Lee, J. Chang, H. Kim, H. C. Chung, H. O. Kim, and Y. K. Paik, "An integrated proteome database for two-dimensional electrophoresis data analysis and laboratory information management system," Proteomics, vol. 2, pp. 1104-13, 2002. [12] L. F. Steel, T. S. Mattu, A. Mehta, H. Hebestreit, R. Dwek, A. A. Evans, W. T. London, and T. Block, "A proteomic approach for the discovery of early detection markers of hepatocellular carcinoma," Dis Markers, vol. 17, pp. 179-89, 2001. [13] K. S. Park, S. Y. Cho, H. Kim, and Y. K. Paik, "Proteomic alterations of the variants of human aldehyde dehydrogenase isozymes correlate with hepatocellular carcinoma," Int J Cancer, vol. 97, pp. 261-5, 2002. [14] R. J. Roiger and M. W. Geatz, Data Mining: A Tutorial Based Primer: AddisonWesley, 2003. [15] R. Tryon, Cluster Analysis. Ann Arbor, MI: Edward Brothers, 1939. [16] D. Ballard, An Introduction to Natural Computation. Cambridge MA: MIT Press, 1997. [17] L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and Regression Trees. Wadsworth: Pacific Grove, 1984. [18] D. Steinberg and P. Colla, CART - Classification and Regression Trees. San Diego: Salford Systems, 1997. 64 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 [19] N. Chignard and L. Beretta, "Proteomics for hepatocellular carcinoma marker discovery," Gastroenterology, vol. 127, pp. S120-5, 2004. [20] J. D. Wulfkuhle, L. A. Liotta, and E. F. Petricoin, "Proteomic applications for the early detection of cancer," Nat Rev Cancer, vol. 3, pp. 267-75, 2003. [21] S. Hanash, "Disease proteomics," Nature, vol. 422, pp. 226-32, 2003. [22] P. R. Srinivas, M. Verma, Y. Zhao, and S. Srivastava, "Proteomics for cancer biomarker discovery," Clin Chem, vol. 48, pp. 1160-9, 2002. [23] E. Zeindl-Eberhart, S. Haraida, S. Liebmann, P. R. Jungblut, S. Lamer, D. Mayer, G. Jager, S. Chung, and H. M. Rabes, "Detection and identification of tumor-associated protein variants in human hepatocellular carcinomas," Hepatology, vol. 39, pp. 540-9, 2004. [24] E. E. Schwegler, L. Cazares, L. F. Steel, B. L. Adam, D. A. Johnson, O. J. Semmes, T. M. Block, J. A. Marrero, and R. R. Drake, "SELDI-TOF MS profiling of serum for detection of the progression of chronic hepatitis C to hepatocellular carcinoma," Hepatology, vol. 41, pp. 634-42, 2005. [25] C. Li, Y. X. Tan, H. Zhou, S. J. Ding, S. J. Li, D. J. Ma, X. B. Man, Y. Hong, L. Zhang, L. Li, Q. C. Xia, J. R. Wu, H. Y. Wang, and R. Zeng, "Proteomic analysis of hepatitis B virus-associated hepatocellular carcinoma: Identification of potential tumor markers," Proteomics, vol. 5, pp. 1125-39, 2005. [26] T. C. Poon and P. J. Johnson, "Proteome analysis and its impact on the discovery of serological tumor markers," Clin Chim Acta, vol. 313, pp. 231-9, 2001. [27] T. C. Poon, T. T. Yip, A. T. Chan, C. Yip, V. Yip, T. S. Mok, C. C. Lee, T. W. Leung, S. K. Ho, and P. J. Johnson, "Comprehensive proteomic profiling identifies serum proteomic signatures for detection of hepatocellular carcinoma and its subtypes," Clin Chem, vol. 49, pp. 752-60, 2003. 65 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 [28] P. Shalhoub, S. Kern, S. Girard, and L. Beretta, "Proteomic-based approach for the identification of tumor markers associated with hepatocellular carcinoma," Dis Markers, vol. 17, pp. 217-23, 2001. [29] K. L. Garwood, C. F. Taylor, K. J. Runte, A. Brass, S. G. Oliver, and N. W. Paton, "Pedro: a configurable data entry tool for XML," Bioinformatics, vol. 20, pp. 2463-5, 2004. [30] K. Garwood, T. McLaughlin, C. Garwood, S. Joens, N. Morrison, C. F. Taylor, K. Carroll, C. Evans, A. D. Whetton, S. Hart, D. Stead, Z. Yin, A. J. Brown, A. Hesketh, K. Chater, L. Hansson, M. Mewissen, P. Ghazal, J. Howard, K. S. Lilley, S. J. Gaskell, A. Brass, S. J. Hubbard, S. G. Oliver, and N. W. Paton, "PEDRo: a database for storing, searching and disseminating experimental proteomics data," BMC Genomics, vol. 5, pp. 68, 2004. [31] H. Yokoo, T. Kondo, K. Fujii, T. Yamada, S. Todo, and S. Hirohashi, "Proteomic signature corresponding to alpha fetoprotein expression in liver cancer cells," Hepatology, vol. 40, pp. 609-17, 2004. [32] T. C. Poon, A. Y. Hui, H. L. Chan, I. L. Ang, S. M. Chow, N. Wong, and J. J. Sung, "Prediction of liver fibrosis and cirrhosis in chronic hepatitis B infection by serum proteomic fingerprinting: a pilot study," Clin Chem, vol. 51, pp. 328-35, 2005. [33] DARPA, "Technical Report DARPA Neural Network Study Final Report October 1987-February 1988," Lincoln Laboratory, MIT 1989. [34] W. S. McCulloch and W. Pitts, "A logical calculus of the ideas immanent in nervous activity," Bulletin of Mathematical Biophysics, vol. 5, pp. 115-133, 1943. [35] F. Rosenblatt, "The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain," Psychological Review, vol. 65, pp. 386-408, 1958. 66 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 [36] B. Widrow and M. E. Hoff, "Adaptive switching circuits," presented at IRE WESCON, New York, 1960. [37] M. L. Minsky and S. A. Papert, Perceptrons, An introduction to computational geometry (Expanded edition). Cambridge, MA: MIT Press, 1987. [38] M. T. Hagan, H. B. Demuth, and M. Baele, Neural Network Design. Boston, MA: PWS Publishing Co., 1997. [39] J. Gibbons, "From two independent samples: Mann-Whitney-Wilcoxon procedures," in Nonparametric Methods for Quantitative Analysis, 3rd ed. Columbus, OH: American Sciences Press, Inc, 1997, pp. 171-188. [40] S. Meleth, J. Deshane, and H. Kim, "The case for well-conducted experiments to validate statistical protocols for 2D gels: different pre-processing = different lists of significant proteins," BMC Biotechnol, vol. 5, pp. 7, 2005. [41] M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein, "Cluster analysis and display of genome-wide expression patterns," Proc Natl Acad Sci U S A, vol. 95, pp. 14863-8, 1998. 67 Lam Yee Hong Brian Classification of Liver Cancer by Artificial Neural Network U990209278 Appendix Outputs of different classification model constructed in this study Table a. Validation output of ANN and CART models constructed in this study (-1 = non-tumour, 1 = tumour) Target ANN_1 ANN_2 ANN_3 ANN_4 ANN_5 ANN_6 ANN_7 Hardlim (ANN_1) Hardlim (ANN_2) Hardlim (ANN_3) Hardlim (ANN_4) Hardlim (ANN_5) Hardlim (ANN_6) Hardlim (ANN_7) CART -1 -1.58 -0.97 0.30 -1.78 -0.99 -0.78 -1.18 -1 -1 1 -1 -1 -1 -1 -1 -1 -1.56 -2.52 -1.28 -1.27 -1.00 -1.02 -1.39 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1.44 -0.38 -1.50 -1.08 -0.69 -1.03 -1.09 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1.42 -0.96 -2.82 -1.39 -0.65 -1.04 -0.52 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1.37 -1.37 -1.05 -1.15 -1.38 -1.01 -1.05 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1.31 -1.54 -1.23 -1.04 -1.24 -0.93 -1.45 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1.23 -1.41 -1.29 -1.22 -1.17 -1.02 -1.16 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1.16 -0.57 -1.06 -0.79 -1.02 -0.90 -1.05 -1 -1 -1 -1 -1 -1 -1 1 -1 -1.14 -0.93 -1.04 -1.47 -1.19 -1.02 -1.62 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1.12 -1.28 -0.35 -1.30 -0.86 -1.01 -0.52 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1.07 -0.35 -1.02 -1.24 -1.35 -0.99 -1.18 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1.05 -0.74 -1.02 -1.00 -1.12 -0.99 -1.60 -1 -1 -1 -1 -1 -1 -1 1 -1 -1.04 -1.98 -1.21 -1.28 -1.09 -0.99 -0.58 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1.03 -0.86 -1.26 -0.94 -1.29 -1.01 -1.16 -1 -1 -1 -1 -1 -1 -1 -1 -1 -0.98 -1.75 -0.89 -0.35 -1.00 -1.00 -2.14 -1 -1 -1 -1 -1 -1 -1 1 -1 -0.96 -0.32 -0.86 -0.86 -1.00 0.93 1.13 -1 -1 -1 -1 -1 1 1 -1 -1 -0.96 -0.01 -0.66 -0.60 0.05 -0.98 -1.40 -1 -1 -1 -1 1 -1 -1 1 -1 -0.95 -1.79 -1.96 -1.16 -1.46 -1.03 -0.71 -1 -1 -1 -1 -1 -1 -1 -1 -1 -0.93 -1.42 -1.25 -1.29 -0.86 -0.73 0.73 -1 -1 -1 -1 -1 -1 1 -1 -1 -0.91 -1.47 -0.42 -0.95 -1.23 -1.04 -0.99 -1 -1 -1 -1 -1 -1 -1 1 -1 -0.86 -1.13 -1.18 -1.55 -1.77 -0.79 -0.16 -1 -1 -1 -1 -1 -1 -1 -1 -1 -0.85 -1.88 -0.13 -2.63 -1.01 -1.00 -0.61 -1 -1 -1 -1 -1 -1 -1 1 -1 -0.83 -0.75 -1.11 -0.77 -1.41 -1.00 -0.73 -1 -1 -1 -1 -1 -1 -1 -1 a Lam Yee Hong Brian Classification of Liver Cancer by Artificial Neural Network U990209278 -1 -0.76 -1.16 -0.06 -1.23 -0.95 -0.71 -1.04 -1 -1 -1 -1 -1 -1 -1 -1 -1 -0.69 -0.45 -0.73 -0.73 -0.56 -0.76 -0.59 -1 -1 -1 -1 -1 -1 -1 1 -1 -0.62 -0.95 -0.73 -1.31 -1.55 -0.60 -0.20 -1 -1 -1 -1 -1 -1 -1 -1 -1 -0.56 -0.85 -0.14 -1.45 -0.94 -0.89 -1.38 -1 -1 -1 -1 -1 -1 -1 -1 -1 -0.01 0.57 -0.18 -0.03 -0.05 -0.83 -0.99 -1 1 -1 -1 -1 -1 -1 -1 -1 0.07 -2.47 0.70 -0.44 -0.59 -1.02 -0.88 1 -1 1 -1 -1 -1 -1 -1 -1 0.55 0.53 0.11 0.73 0.15 1.00 -0.54 1 1 1 1 1 1 -1 -1 -1 0.85 0.27 1.17 0.26 0.06 1.00 -0.55 1 1 1 1 1 1 -1 -1 -1 1.01 1.23 1.29 0.89 1.09 1.00 1.13 1 1 1 1 1 1 1 -1 -1 1.16 0.78 1.04 0.23 -0.41 -0.88 -0.34 1 1 1 1 -1 -1 -1 -1 1 -0.26 -0.74 -0.20 -0.04 -0.92 -0.02 -0.64 -1 -1 -1 -1 -1 -1 -1 1 1 2.19 2.27 0.34 2.27 1.21 0.98 0.40 1 1 1 1 1 1 1 1 1 1.34 1.33 1.36 0.12 3.06 1.00 1.15 1 1 1 1 1 1 1 1 1 0.08 -0.24 -0.29 -1.45 0.43 1.00 0.62 1 -1 -1 -1 1 1 1 1 1 -0.41 0.67 -1.60 -1.22 -0.97 1.00 -0.29 -1 1 -1 -1 -1 1 -1 1 1 1.56 1.43 0.95 1.76 3.58 1.00 1.71 1 1 1 1 1 1 1 1 1 -0.05 1.02 -0.64 -0.33 2.17 0.99 -0.56 -1 1 -1 -1 1 1 -1 1 1 -0.98 -0.36 0.28 -0.10 -0.59 1.00 -0.16 -1 -1 1 -1 -1 1 -1 1 1 1.96 1.61 0.66 0.92 3.60 1.00 0.42 1 1 1 1 1 1 1 1 1 0.20 -0.47 -0.11 -1.41 -1.16 0.96 0.42 1 -1 -1 -1 -1 1 1 1 1 1.95 3.03 1.41 2.68 2.34 1.00 1.21 1 1 1 1 1 1 1 1 1 -1.24 -0.31 -0.06 -0.92 -1.02 -0.55 0.67 -1 -1 -1 -1 -1 -1 1 1 1 0.25 0.89 2.21 0.89 1.23 1.00 0.25 1 1 1 1 1 1 1 1 1 1.43 1.13 1.42 2.47 1.06 1.00 1.09 1 1 1 1 1 1 1 -1 1 1.47 1.88 0.72 1.86 0.95 1.00 0.47 1 1 1 1 1 1 1 1 1 1.03 -0.57 0.39 -0.34 -0.03 1.00 1.21 1 -1 1 -1 -1 1 1 -1 1 1.02 1.04 1.17 1.36 0.72 1.00 1.11 1 1 1 1 1 1 1 1 1 1.67 0.93 1.87 0.40 1.97 1.00 1.27 1 1 1 1 1 1 1 1 1 2.47 1.59 1.17 1.52 1.13 1.00 0.89 1 1 1 1 1 1 1 -1 1 1.27 2.11 1.06 2.21 0.80 1.00 1.29 1 1 1 1 1 1 1 1 1 -0.94 -1.77 -1.67 -0.92 -1.31 0.59 -1.12 -1 -1 -1 -1 -1 1 -1 -1 1 -0.97 0.04 0.77 -0.50 -1.05 0.20 -1.44 -1 1 1 -1 -1 1 -1 1 b Lam Yee Hong Brian Classification of Liver Cancer by Artificial Neural Network U990209278 1 -0.38 0.14 1.23 -1.41 -0.67 1 0.05 1 1.03 1 1 1 1 1.00 -0.02 -1 1 1.25 1.15 1.60 -0.45 1.00 0.95 1 1 1.40 -0.09 1.00 0.84 1.00 0.15 1 1 1.60 2.32 0.19 0.39 1.76 1.00 0.89 1 1 1 3.09 0.62 2.04 0.83 -0.71 1.00 0.98 1 1 1 -0.10 1.10 -0.16 -0.08 0.94 -0.94 0.35 -1 1 -1 1.55 0.23 1.25 0.68 0.76 1.00 2.25 1 1 1 1 1.60 1.21 0.72 2.06 0.76 1.00 0.50 1 1 1 1.22 0.23 1.55 1.85 0.78 1.00 0.94 1 1 0.91 1.37 1.01 0.64 1.12 1.00 0.53 1 1 0.18 -0.09 0.29 -0.65 -0.47 1.00 0.67 1 c 1 -1 -1 1 -1 1 1 1 -1 1 1 1 -1 1 1 1 1 1 1 1 1 1 1 1 -1 1 1 -1 -1 1 -1 1 1 1 1 1 1 -1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 -1 1 -1 -1 1 1 1 Classification of Liver Cancer by Artificial Neural Network Lam Yee Hong Brian U990209278 Companion CD A CD is attached which include the PDF format of this report, the proteomic data, data for MATLAB and the PDF file of the clustering analysis. CD Content: 1. dissertation.pdf (dissertation in PDF format) 2. proteomic data.xls (Proteomic data and transformation in Excel format) 3. ann_data_set.mat (Data used for training and simulation of artificial neural network 4. cluster.pdf (The high-resolution dendrogram of average linkage clustering of samples and proteins in PDF format) d