BCM57810 HW failure analysis, utilizing Machine Learning Algorithms Tel Aviv University School Of Computer Sciences Machine Learning Course – Prof. Yishay Mansour Final Project By Ofir Hermesh 3/3/2013 Data collection- in cooperation with Broadcom Israel Research © 1. Abstract This work describes an experiment of applying machine learning algorithms to a set of HW failure reports from an ASIC device designed at Broadcom Israel Research Labs. The experiment was conducted as a final project for Machine Learning course at Tel Aviv University by Prof. Yishay Mansour. The document describes the problems of analyzing the HW failure reports and the steps used to build the data set. It then describes the general strategy for learning this data, the coding practices used and the usage and performance of each classification algorithm. Finally, it compares the algorithms and analyzes the results. The algorithms used for classification are Naïve Bayes, KNN, SVM and AdaBoost. 2. Background and Motivation In Broadcom Israel Research labs we are developing a 10GBE Converged Network Controller for the server market (BCM57810), supporting various networking and storage protocols (Ethernet, TCP, iSCSI, FCoE, RDMA). The device is installed in millions of servers all over the world, and when one of them is hitting a HW failure the user is collecting a post-mortem report using diagnostic software. This report is referred to as “Idle Check” as it checks for “non idle” conditions in the HW blocks, such as non-empty queues or pending interrupts. There are ~400 kinds of errors which can be detected by the diagnostic software. When we receive such “idle check” report, its analysis is not trivial. The BCM57810 is an integrated device, which several processors, HW blocks and memories which are all interconnected. In many cases it is difficult to analyze the idle-check report, and only skilled engineers, who work on the HW core, are able to do so. However, we would like that field engineers will also be able to extract some meaningful analysis from the idle check report. We tried several approached for automating the idle-check analysis; all based on classical algorithms and classification methods. None of them was satisfactory so far. In this project I tried to apply machine learning algorithms on a data set of idle check from previous failures, which were classified manually. 3. Creating the Data Set The first stage of the work was creating a data set which will be digestible for machine learning algorithms. In our logging folder I found 430 idle check files from various cases at our labs and at customer premises. Appendix A shows an example of a raw idle check file. This is a simple text file including the non-idle conditions as detected by the diagnostic software. Figure 1 describes The BCM57810 device block diagram. I decided to classify the examples according to the cause of the failure in the main 5 blocks of the device: 1. RFE – Network Interface- queues, buffers and classifiers for ingress and egress traffic to the network. 2. 3. 4. 5. RX –HW and processors which performs the packet processing on the ingress path. TX – HW and processors which performs the packet processing on the egress path. PXP – PCI interface HW, manages the PCI protocol and credits for all traffic. IGU - Interrupt Generation Unit- handles interrupt generation towards the server CPU. The complicated part in the analysis is that in some cases a failure in one block is causing a non-idle condition in another. For example, an ingress packet error in the RFE can cause a failure at the RX processing stage which will cause a non-idle condition in one of its sub-blocks. TX – Transmitter HW Ethernet Interface RFE Network Interface IGU- Interrupt Generation Unit PXP – PCI Express Interface PCI Bus RX- Receiver HW Figure 1 – BCM57810 Block Diagram I applied the following stages in order to prepare the data set, assisted by simple Perl scripts: 1. I accessed the failures database of the device, and scanned for idle check files with detected non-idle conditions (some of the files were generated on systems which were working properly). This resulted in ~280 files. 2. The files were divided between several engineers, which manually analyzed them and provided their classification to one of the 5 labels. Some files which were difficult to classify were put aside. 3. For the files put aside, I reviewed the history of the issue found in the bug track software, understood which issue was eventually discovered and classified accordingly. 4. I converted the whole data set into a CSV file, with a single bit indicating the existence of some idle check condition. This resulted in a file with 280 examples, each one with 400 attributes. 5. I started converging similar features. For example, errors such as “buffer is full with 5 elements” and “buffer is full with 3 elements” were considered the same. 6. After this convergence, many data points become identical. I verified that their classification is the same and remove duplications. The final data set, described in idlechk.csv, includes 120 examples with 119 binary features. The examples are classified to 5 different labels (1-5 as described in the list above). The goal of the project is to train an algorithm to deduce the label from a new example. 4. Guidelines for Learning the Data Set There are several difficulties in learning this specific data set: 1. The amount of examples is low. At the beginning I was hoping to have ~400 examples. But during the processing stage I was left with only 120. 2. The amount of features is high- though I was able to reduce it to 119 features. 3. There are 5 different labels- which make the algorithms more complicated. Also, the number of examples per label is small. I tried to reduce the number of labels (by uniting some of them) but this turned out to have bad impact on the results. In order to overcome the difficulties described above I used these methods: 1. 10-fold cross validation- allows evaluating a learner on a small data set, by cutting the data set to 10 equal pieces, training the learner of 90% of the data set and then testing it on the remaining 10%. By averaging the results of the learner on the 10 pieces we can evaluate its quality though we have a small amount of data. 2. Feature selection- used to reduce the number of features. This process adds features iteratively until an addition of feature is no longer improving the accuracy of the learner. The fact that the features are all binary makes the classification process easier, as it is easy to put boundaries between the values of each attribute, with good margins. While applying the algorithms, I used several techniques which for binary features, described later. 5. General Classification Scheme For the usage of cross-validation and feature-selection I applied a general classification scheme, which was used thrice with the Naïve Bayes, KNN and SVM algorithms. The scheme is described in Figure 2, and follows these steps: 1. Run the classification algorithm on the entire data set, and calculate the misclassification rate on the entire set. This error is marked as Err1. 2. Run a 10-fold cross validation on the entire data set. This will cut the set into 10 equal pieces, and run the classification algorithm 10 times. The misclassification rate is the average among the 10 pieces, and is marked as Err2. 3. Run backward feature selection, with 10 fold cross validation. This will result with a set of selected features- FS_B. Create a data set including only these features. 4. Run forward feature selection, with 10 fold cross validation. This will result with a set of selected features- FS_F. Create a data set including only these features. 5. Repeat step (1) and (2) for the samples with FS_B features only- the results of backward feature selection. The misclassification rate on the whole set is Err3. The misclassification rate with cross validation is Err4. 6. Repeat step (1) and (2) for the samples with FS_F features only- the results of forward feature selection. The misclassification rate on the whole set is Err5. The misclassification rate with cross validation is Err6. After applying steps 1-6 to the three algorithms, we create the set of features as selected by the 6 runs of feature selection – including NS features. We then run the next steps: 7. Run the classification algorithm on the entire data set with NS features. The misclassification rate is marked as Err7. 8. Run a 10-fold cross validation on the entire data set with NS features. The misclassification rate is marked as Err8. Samples: Samples: M M examples, examples, NN features features 10fold cross validation Err2 Reduced Reduced Samples: Samples: 0.9M 0.9M examples, examples, NN features features Classification Algorithm Err1 Classification Algorithm Forward Feature Selection, with 10-fold Cross Validation Reduced Reduced Samples: Samples: 0.9M 0.9M examples, examples, 1..N 1..N features features Err5 Samples: Samples: M M examples, examples, FS_F FS_F features features Classification Algorithm Classification Algorithm Reduced Reduced Samples: Samples: 0.9M 0.9M examples, examples, 1..N 1..N features features Backward Feature Selection, with 10-fold Cross Validation Err3 Samples: Samples: M M examples, examples, FS_B FS_B features features Err6 10fold cross validation Err4 Samples: Samples: 0.9M 0.9M examples, examples, FS_F FS_F features features Classification Algorithm Samples: Samples: 0.9M 0.9M examples, examples, FS_B FS_B features features 10fold Cross Validation NS = FS_F(ALG1) U FS_B(ALG1) U FS_F(ALG2) U FS_B(ALG2) U FS_F(ALG3) U FS_B(ALG3) Samples: Samples: M M examples, examples, NS NS features features 10fold cross validation Err8 Reduced Reduced Samples: Samples: 0.9M 0.9M examples, examples, NS NS features features Classification Algorithm Figure 2 – General Classification Scheme Err7 Classification Algorithm 6. Programming Practice I implemented the classification scheme and run the tests using Matlab, using the Statistics Toolbox which provides many useful functions for machine learning. Feature selection is implemented using the “sequentialfs” function. This function selects a subset of features from a data matrix that best predict the data labels, by sequentially selecting features until there is no improvement in prediction. Starting from an empty feature set, sequentialfs creates candidate feature subsets by sequentially adding each of the features not yet selected. For each candidate feature subset, sequentialfs performs 10-fold cross-validation by repeatedly calling the classification algorithm with different training subsets. When working in “backward” mode it starts with the full feature set and removing features sequentially. Cross validation is implemented using the “crossval” function, with MCR loss. The function performs a 10-fold cross validation estimate of misclassification rate for the classification algorithm. The functions used for each classification algorithm are described in the following sections, together with the results. 7. Learning with Naïve Bayes For Naïve Bayes learning I used the Matlab class NaiveBayes, provided in the statistics toolbox. The function selects the most probable label given each feature independent distribution. The Naïve Bayes class expects some hint as to the distribution of the features. For the binary case I selected the “mvmn” option- multivariate multinomial distribution. The issue of multiple labels in Naïve Bayes is trivial- the algorithm finds a label which is the most likely, and this can be done for several labels in the same way it is done for 2 labels. The results for Naïve Bayes classification can be seen in Table 1, and the cross validation process is also described in Figure 3. Step Err1 Err2 Err3 Err4 Err5 Err6 Err7 Err8 Description Entire data set as train and test set Cross validation on entire data set Entire data set after backward feature selection Cross validation after backward feature selection Entire data set after forward feature selection Cross validation after forward feature selection Entire data set after converged feature selection Cross validation after converged feature selection Table 1 – Results of Naïve Bayes Classification Misclassification Rate 32% 65% 26% 42% 21% 23% 26% 47% Naive Bayes classification - backward feature selection Naive Bayes classification - forward feature selection 90 60 55 80 Classification error(%) Classification error(%) 50 45 40 35 70 60 50 30 40 25 20 1 1.5 2 2.5 3 3.5 Step 4 4.5 5 5.5 6 30 0 20 40 60 Step 80 100 120 Figure 3 – Naïve Bayes Cross Validation 8. Learning with KNN For KNN learning I used the Matlab class ClassificationKNN, provided in the statistics toolbox. The function selects a label according to its nearest K neighbors. I used K=5 after several experiments. For the distance label, I used the “hamming distance”, which seems to be a good indicator for the difference between two binary stringsi. It counts the number of different bits between the strings and by this it is more suitable for the binary case then a general Euclidean distance. The issue of multiple labels is easy to solve in KNN, as the algorithm needs to find the most common label in the K nearest neighbors. The results for KNN classification can be seen in Table 2, and the cross validation process is also described in Figure 4. Step Err1 Err2 Err3 Err4 Err5 Err6 Err7 Err8 Description Entire data set as train and test set Cross validation on entire data set Entire data set after backward feature selection Cross validation after backward feature selection Entire data set after forward feature selection Cross validation after forward feature selection Entire data set after converged feature selection Cross validation after converged feature selection Table 2 – Results of KNN Classification Misclassification Rate 29% 28% 23% 23% 23% 25% 22% 29% KNN classification - forward feature selection KNN classification - backward feature selection 65 40 60 Classification error(%) Classification error(%) 55 50 45 40 35 35 30 30 25 20 1 1.5 2 2.5 3 3.5 Step 4 4.5 5 5.5 6 25 0 20 40 60 Step 80 100 120 Figure 4 – KNN Cross Validation 9. Learning with SVM For SVM learning I used the libsvmii, a free Matlab library downloaded from the Internet. The library implements various types of SVMs with optional kernels. I tried several kernel functions, and finally decided on the one which was suggested in the libsvm website for learning binary data sets: 𝐾(𝑥, 𝑧) = (𝑥 ∙ 𝑧 + 1)3 It seems like a polynomial kernel fits the problem of binary classification well (similar to the XOR problem we analyzed in class). The degree of the polynomial affects the accuracy and the running time of the algorithm, and it seems like a degree of 3 is good enough. Finally, the C parameter of the SVM algorithm (cost of misclassification) is set to 10. In the binary domain it is usually possible to find separators with good margins (as the values are only 0 and 1). Therefore we would like the algorithm not to compromise for accuracy in order to enlarge the margins. The issue of multiple labels is handled in libsvm with the “one-against-all” method. The learner finds a linear separator between each class of examples and all the rest, and does the final classification according to the likelihood of each possible label with respect to its separator. The results for SVM classification can be seen in Table 3, and the cross validation process is also described in Figure 5. Step Err1 Err2 Err3 Err4 Err5 Err6 Err7 Err8 Description Entire data set as train and test set Cross validation on entire data set Entire data set after backward feature selection Cross validation after backward feature selection Entire data set after forward feature selection Cross validation after forward feature selection Entire data set after converged feature selection Cross validation after converged feature selection Misclassification Rate 3% 29% 7% 22% 16% 17% 6% 23% Table 3 – Results of SVM Classification SVM classification - backward feature selection SVM classification - forward feature selection 32 60 55 30 Classification error(%) Classification error(%) 50 45 40 35 30 28 26 24 25 22 20 15 1 1.5 2 2.5 3 3.5 Step 4 4.5 5 5.5 6 20 0 20 40 60 Step 80 100 120 Figure 5 – SVM Cross Validation 10. Learning with AdaBoost For AdaBoost learning I used the Matlab function fitensemble, provided in the statistics toolbox. This is a general boosting framework, which allows several kinds of boosting algorithms. For coping with multiple labels, I used the AdaBoostM2 algorithm, which is a variation of the classical AdaBoost algorithmiii. AdaBoostM2 allows the weak learners to generate more expressive hypotheses, which, rather than identifying a single label, choose a set of “plausible” labels. It also allows the weak learner to indicate a “degree of confidence” for each label. For the weak learners I used a classification tree, implemented with Matlab class ClassificationTree. This seems like a good simple learner for a binary domain. I limited the trees to 10 nodes in order keep the weak learners simple. Feature selection did not work well with AdaBoost, as AdaBoost includes some native feature selection by the way it change the probability of the examples. Therefore a smaller set of experiments was run. The results for AdaBoost classification can be seen in Table 4. Step Err1 Err2 Err7 Err8 Description Entire data set as train and test set Cross validation on entire data set Entire data set after converged feature selection Cross validation after converged feature selection Misclassification Rate 18% 25% 23% 25% Table 4 – Results of AdaBoost Classification 11. Conclusions This seems to be a difficult dataset to learn, and yet all the algorithms were close to 20% misclassification, which can be a basis for a useful implementation of automatic idle check analyzer. The best results of the algorithms were usually produced using forward feature selection. Backward feature selection takes much longer and does not achieve much better results. Also, running the algorithm on the converged feature set does not seem to improve their performance. It seems like it is best to do the feature selection process for each algorithm separately and by this choose the features which contributes the most to the accuracy of the specific algorithm. In all cases the algorithms achieved better results on the whole data set than with cross-validation, which is also expected. The algorithms tend to fit the actual data sets and face bigger difficulties with new examples. The best results of the algorithms (with cross validation) were: 1. 2. 3. 4. Naïve Bayes, with forward feature selection – 23% KNN, with backward feature selection- 23%. SVM with forward feature selection- 17%. AdaBoost – 25%. As expected, SVM gives the best results, and also runs faster than the other algorithms. It seems like SVM with the polynomial kernel finds good linear separators for the dataset, and also the feature selection process allows finding even better separators as the dimension is much lower. In all cases the feature selection processes allowed the algorithms to achieve good results when based on 10 or less features. This is an important result on its own, as it can show us the features which we should focus on when manually analyzing future idle check reports. Appendix A – idle check file example # idle_chk. Error if no traffic (level 2) : CFC: AC is neither 0 nor 2 on connType 0 (ETH). Activity counter value is 0x16. LCID 0 CID_CAM 0x1 # idle_chk. Error if no traffic (level 2) : QM: VOQ_0, VOQ credit is not equal to initial credit, Values are 0xec 0x140 # idle_chk. Error if no traffic (level 2) : XCM: XX protection CAM is not empty, Value is 0x1 # idle_chk. Error if no traffic (level 2) : BRB1: BRB is not empty, Value is 0x1d0 # idle_chk. Error if no traffic (level 2) : TCM: FIC0_INIT_CRD is not 64, Value is 0x24 # idle_chk. Error if no traffic (level 2) : PRS: TCM current credit is not 0, Value is 0x38 # idle_chk. Error if no traffic (level 2) : PRS: PENDING_BRB_PRS_RQ is not 0, Value is 0x2 # idle_chk. Warning (level 3): CDU: parity status is not 0, Value is 0x1 # idle_chk. Warning (level 3): CSDM: parity status is not 0, Value is 0x1 # idle_chk. Warning (level 3): TCM: parity status is not 0, Value is 0x1 # idle_chk. Warning (level 3): CCM: parity status is not 0, Value is 0x1 # idle_chk. Warning (level 3): UCM: parity status is not 0, Value is 0x1 # idle_chk. Warning (level 3): PXP: parity status is not 0, Value is 0x1 # idle_chk. Warning (level 3): TSDM: parity status is not 0, Value is 0x1 # idle_chk. Warning (level 3): USDM: parity status is not 0, Value is 0x1 # idle_chk. Warning (level 3): XSDM: parity status is not 0, Value is 0x1 # idle_chk. Warning (level 3): CSEM: parity status 0 is not 0, Value is 0x1 # idle_chk. Warning (level 3): PXP2: parity status 0 is not 0, Value is 0x1 # idle_chk. Warning (level 3): USEM: parity status 0 is not 0, Value is 0x1 # idle_chk. Warning (level 3): MISC: pcie_rst_b was asserted without perst assertion, Value is 0x1 # idle_chk. Warning (level 3): TSEM: interrupt 0 is active, Value is 0x10000 # idle_chk. Warning (level 3): XSEM: interrupt 0 is active, Value is 0x10000 # idle_chk. Warning (level 3): SRCH: parity status is not 0, Value is 0x1 # idle_chk. Error if no traffic (level 2) : QM: Byte credit 0 is not equal to initial credit, Values are 0x5a1c 0x8000 # idle_chk. Warning (level 3): IGU: parity status is not 0, Value is 0x1 # idle_chk. Error if no traffic (level 2) : NIG: Port 0 EOP FIFO is not empty., Value is 0x0 # Idle_chk failed !!! (with errors: 9 warnings: 17 of 585 checks) i Mohammad Norouzi, David J. Fleet, Ruslan Salakhutdinov, 2012, Hamming Distance Metric Learning, Neural Information Processing Systems (NIPS). ii Chih-Chung Chang and Chih-Jen Lin , LIBSVM -- A Library for Support Vector Machines, ,http://www.csie.ntu.edu.tw/~cjlin/libsvm. iii Freund, Yoav and Robert E. Schapire, 1996. Experiments with a new boosting algorithm, Machine Learning: Proceedings of the Thirteenth International Conference (ICML '96), edited by Lorenza Saitta, pages 148-156, Morgan Kaufmann.