Summarizing report

advertisement
BCM57810 HW failure analysis, utilizing
Machine Learning Algorithms
Tel Aviv University
School Of Computer Sciences
Machine Learning Course – Prof. Yishay Mansour
Final Project
By Ofir Hermesh
3/3/2013
Data collection- in cooperation with Broadcom Israel Research ©
1. Abstract
This work describes an experiment of applying machine learning algorithms to a set of HW failure
reports from an ASIC device designed at Broadcom Israel Research Labs. The experiment was conducted
as a final project for Machine Learning course at Tel Aviv University by Prof. Yishay Mansour.
The document describes the problems of analyzing the HW failure reports and the steps used to build
the data set. It then describes the general strategy for learning this data, the coding practices used and
the usage and performance of each classification algorithm. Finally, it compares the algorithms and
analyzes the results.
The algorithms used for classification are Naïve Bayes, KNN, SVM and AdaBoost.
2. Background and Motivation
In Broadcom Israel Research labs we are developing a 10GBE Converged Network Controller for the
server market (BCM57810), supporting various networking and storage protocols (Ethernet, TCP, iSCSI,
FCoE, RDMA). The device is installed in millions of servers all over the world, and when one of them is
hitting a HW failure the user is collecting a post-mortem report using diagnostic software. This report is
referred to as “Idle Check” as it checks for “non idle” conditions in the HW blocks, such as non-empty
queues or pending interrupts. There are ~400 kinds of errors which can be detected by the diagnostic
software.
When we receive such “idle check” report, its analysis is not trivial. The BCM57810 is an integrated
device, which several processors, HW blocks and memories which are all interconnected. In many cases
it is difficult to analyze the idle-check report, and only skilled engineers, who work on the HW core, are
able to do so. However, we would like that field engineers will also be able to extract some meaningful
analysis from the idle check report.
We tried several approached for automating the idle-check analysis; all based on classical algorithms
and classification methods. None of them was satisfactory so far. In this project I tried to apply machine
learning algorithms on a data set of idle check from previous failures, which were classified manually.
3. Creating the Data Set
The first stage of the work was creating a data set which will be digestible for machine learning
algorithms. In our logging folder I found 430 idle check files from various cases at our labs and at
customer premises. Appendix A shows an example of a raw idle check file. This is a simple text file
including the non-idle conditions as detected by the diagnostic software.
Figure 1 describes The BCM57810 device block diagram. I decided to classify the examples according to
the cause of the failure in the main 5 blocks of the device:
1. RFE – Network Interface- queues, buffers and classifiers for ingress and egress traffic to the
network.
2.
3.
4.
5.
RX –HW and processors which performs the packet processing on the ingress path.
TX – HW and processors which performs the packet processing on the egress path.
PXP – PCI interface HW, manages the PCI protocol and credits for all traffic.
IGU - Interrupt Generation Unit- handles interrupt generation towards the server CPU.
The complicated part in the analysis is that in some cases a failure in one block is causing a non-idle
condition in another. For example, an ingress packet error in the RFE can cause a failure at the RX
processing stage which will cause a non-idle condition in one of its sub-blocks.
TX – Transmitter HW
Ethernet
Interface
RFE Network
Interface
IGU- Interrupt
Generation Unit
PXP –
PCI Express
Interface
PCI Bus
RX- Receiver HW
Figure 1 – BCM57810 Block Diagram
I applied the following stages in order to prepare the data set, assisted by simple Perl scripts:
1. I accessed the failures database of the device, and scanned for idle check files with detected
non-idle conditions (some of the files were generated on systems which were working properly).
This resulted in ~280 files.
2. The files were divided between several engineers, which manually analyzed them and provided
their classification to one of the 5 labels. Some files which were difficult to classify were put
aside.
3. For the files put aside, I reviewed the history of the issue found in the bug track software,
understood which issue was eventually discovered and classified accordingly.
4. I converted the whole data set into a CSV file, with a single bit indicating the existence of some
idle check condition. This resulted in a file with 280 examples, each one with 400 attributes.
5. I started converging similar features. For example, errors such as “buffer is full with 5 elements”
and “buffer is full with 3 elements” were considered the same.
6. After this convergence, many data points become identical. I verified that their classification is
the same and remove duplications.
The final data set, described in idlechk.csv, includes 120 examples with 119 binary features. The
examples are classified to 5 different labels (1-5 as described in the list above).
The goal of the project is to train an algorithm to deduce the label from a new example.
4. Guidelines for Learning the Data Set
There are several difficulties in learning this specific data set:
1. The amount of examples is low. At the beginning I was hoping to have ~400 examples. But
during the processing stage I was left with only 120.
2. The amount of features is high- though I was able to reduce it to 119 features.
3. There are 5 different labels- which make the algorithms more complicated. Also, the number of
examples per label is small. I tried to reduce the number of labels (by uniting some of them) but
this turned out to have bad impact on the results.
In order to overcome the difficulties described above I used these methods:
1. 10-fold cross validation- allows evaluating a learner on a small data set, by cutting the data set
to 10 equal pieces, training the learner of 90% of the data set and then testing it on the
remaining 10%. By averaging the results of the learner on the 10 pieces we can evaluate its
quality though we have a small amount of data.
2. Feature selection- used to reduce the number of features. This process adds features iteratively
until an addition of feature is no longer improving the accuracy of the learner.
The fact that the features are all binary makes the classification process easier, as it is easy to put
boundaries between the values of each attribute, with good margins. While applying the algorithms, I
used several techniques which for binary features, described later.
5. General Classification Scheme
For the usage of cross-validation and feature-selection I applied a general classification scheme, which
was used thrice with the Naïve Bayes, KNN and SVM algorithms. The scheme is described in Figure 2,
and follows these steps:
1. Run the classification algorithm on the entire data set, and calculate the misclassification rate on
the entire set. This error is marked as Err1.
2. Run a 10-fold cross validation on the entire data set. This will cut the set into 10 equal pieces,
and run the classification algorithm 10 times. The misclassification rate is the average among the
10 pieces, and is marked as Err2.
3. Run backward feature selection, with 10 fold cross validation. This will result with a set of
selected features- FS_B. Create a data set including only these features.
4. Run forward feature selection, with 10 fold cross validation. This will result with a set of selected
features- FS_F. Create a data set including only these features.
5. Repeat step (1) and (2) for the samples with FS_B features only- the results of backward feature
selection. The misclassification rate on the whole set is Err3. The misclassification rate with cross
validation is Err4.
6. Repeat step (1) and (2) for the samples with FS_F features only- the results of forward feature
selection. The misclassification rate on the whole set is Err5. The misclassification rate with cross
validation is Err6.
After applying steps 1-6 to the three algorithms, we create the set of features as selected by the 6
runs of feature selection – including NS features. We then run the next steps:
7. Run the classification algorithm on the entire data set with NS features. The misclassification
rate is marked as Err7.
8. Run a 10-fold cross validation on the entire data set with NS features. The misclassification rate
is marked as Err8.
Samples:
Samples:
M
M examples,
examples,
NN features
features
10fold cross validation
Err2
Reduced
Reduced Samples:
Samples:
0.9M
0.9M examples,
examples,
NN features
features
Classification
Algorithm
Err1
Classification
Algorithm
Forward Feature Selection, with
10-fold Cross Validation
Reduced
Reduced Samples:
Samples:
0.9M
0.9M examples,
examples,
1..N
1..N features
features
Err5
Samples:
Samples:
M
M examples,
examples,
FS_F
FS_F features
features
Classification
Algorithm
Classification
Algorithm
Reduced
Reduced Samples:
Samples:
0.9M
0.9M examples,
examples,
1..N
1..N features
features
Backward Feature Selection,
with 10-fold Cross Validation
Err3
Samples:
Samples:
M
M examples,
examples,
FS_B
FS_B features
features
Err6
10fold cross validation
Err4
Samples:
Samples:
0.9M
0.9M examples,
examples,
FS_F
FS_F features
features
Classification
Algorithm
Samples:
Samples:
0.9M
0.9M examples,
examples,
FS_B
FS_B features
features
10fold Cross Validation
NS = FS_F(ALG1) U FS_B(ALG1) U FS_F(ALG2) U FS_B(ALG2) U FS_F(ALG3) U FS_B(ALG3)
Samples:
Samples:
M
M examples,
examples,
NS
NS features
features
10fold cross validation
Err8
Reduced
Reduced Samples:
Samples:
0.9M
0.9M examples,
examples,
NS
NS features
features
Classification
Algorithm
Figure 2 – General Classification Scheme
Err7
Classification
Algorithm
6. Programming Practice
I implemented the classification scheme and run the tests using Matlab, using the Statistics Toolbox
which provides many useful functions for machine learning.
Feature selection is implemented using the “sequentialfs” function. This function selects a subset of
features from a data matrix that best predict the data labels, by sequentially selecting features until
there is no improvement in prediction. Starting from an empty feature set, sequentialfs creates
candidate feature subsets by sequentially adding each of the features not yet selected. For each
candidate feature subset, sequentialfs performs 10-fold cross-validation by repeatedly calling the
classification algorithm with different training subsets. When working in “backward” mode it starts with
the full feature set and removing features sequentially.
Cross validation is implemented using the “crossval” function, with MCR loss. The function performs a
10-fold cross validation estimate of misclassification rate for the classification algorithm.
The functions used for each classification algorithm are described in the following sections, together
with the results.
7. Learning with Naïve Bayes
For Naïve Bayes learning I used the Matlab class NaiveBayes, provided in the statistics toolbox. The
function selects the most probable label given each feature independent distribution.
The Naïve Bayes class expects some hint as to the distribution of the features. For the binary case I
selected the “mvmn” option- multivariate multinomial distribution.
The issue of multiple labels in Naïve Bayes is trivial- the algorithm finds a label which is the most likely,
and this can be done for several labels in the same way it is done for 2 labels.
The results for Naïve Bayes classification can be seen in Table 1, and the cross validation process is also
described in Figure 3.
Step
Err1
Err2
Err3
Err4
Err5
Err6
Err7
Err8
Description
Entire data set as train and test set
Cross validation on entire data set
Entire data set after backward feature selection
Cross validation after backward feature selection
Entire data set after forward feature selection
Cross validation after forward feature selection
Entire data set after converged feature selection
Cross validation after converged feature selection
Table 1 – Results of Naïve Bayes Classification
Misclassification Rate
32%
65%
26%
42%
21%
23%
26%
47%
Naive Bayes classification - backward feature selection
Naive Bayes classification - forward feature selection
90
60
55
80
Classification error(%)
Classification error(%)
50
45
40
35
70
60
50
30
40
25
20
1
1.5
2
2.5
3
3.5
Step
4
4.5
5
5.5
6
30
0
20
40
60
Step
80
100
120
Figure 3 – Naïve Bayes Cross Validation
8. Learning with KNN
For KNN learning I used the Matlab class ClassificationKNN, provided in the statistics toolbox. The
function selects a label according to its nearest K neighbors. I used K=5 after several experiments.
For the distance label, I used the “hamming distance”, which seems to be a good indicator for the
difference between two binary stringsi. It counts the number of different bits between the strings and by
this it is more suitable for the binary case then a general Euclidean distance.
The issue of multiple labels is easy to solve in KNN, as the algorithm needs to find the most common
label in the K nearest neighbors.
The results for KNN classification can be seen in Table 2, and the cross validation process is also
described in Figure 4.
Step
Err1
Err2
Err3
Err4
Err5
Err6
Err7
Err8
Description
Entire data set as train and test set
Cross validation on entire data set
Entire data set after backward feature selection
Cross validation after backward feature selection
Entire data set after forward feature selection
Cross validation after forward feature selection
Entire data set after converged feature selection
Cross validation after converged feature selection
Table 2 – Results of KNN Classification
Misclassification Rate
29%
28%
23%
23%
23%
25%
22%
29%
KNN classification - forward feature selection
KNN classification - backward feature selection
65
40
60
Classification error(%)
Classification error(%)
55
50
45
40
35
35
30
30
25
20
1
1.5
2
2.5
3
3.5
Step
4
4.5
5
5.5
6
25
0
20
40
60
Step
80
100
120
Figure 4 – KNN Cross Validation
9. Learning with SVM
For SVM learning I used the libsvmii, a free Matlab library downloaded from the Internet. The library
implements various types of SVMs with optional kernels.
I tried several kernel functions, and finally decided on the one which was suggested in the libsvm
website for learning binary data sets:
𝐾(𝑥, 𝑧) = (𝑥 ∙ 𝑧 + 1)3
It seems like a polynomial kernel fits the problem of binary classification well (similar to the XOR
problem we analyzed in class). The degree of the polynomial affects the accuracy and the running time
of the algorithm, and it seems like a degree of 3 is good enough.
Finally, the C parameter of the SVM algorithm (cost of misclassification) is set to 10. In the binary
domain it is usually possible to find separators with good margins (as the values are only 0 and 1).
Therefore we would like the algorithm not to compromise for accuracy in order to enlarge the margins.
The issue of multiple labels is handled in libsvm with the “one-against-all” method. The learner finds a
linear separator between each class of examples and all the rest, and does the final classification
according to the likelihood of each possible label with respect to its separator.
The results for SVM classification can be seen in Table 3, and the cross validation process is also
described in Figure 5.
Step
Err1
Err2
Err3
Err4
Err5
Err6
Err7
Err8
Description
Entire data set as train and test set
Cross validation on entire data set
Entire data set after backward feature selection
Cross validation after backward feature selection
Entire data set after forward feature selection
Cross validation after forward feature selection
Entire data set after converged feature selection
Cross validation after converged feature selection
Misclassification Rate
3%
29%
7%
22%
16%
17%
6%
23%
Table 3 – Results of SVM Classification
SVM classification - backward feature selection
SVM classification - forward feature selection
32
60
55
30
Classification error(%)
Classification error(%)
50
45
40
35
30
28
26
24
25
22
20
15
1
1.5
2
2.5
3
3.5
Step
4
4.5
5
5.5
6
20
0
20
40
60
Step
80
100
120
Figure 5 – SVM Cross Validation
10.
Learning with AdaBoost
For AdaBoost learning I used the Matlab function fitensemble, provided in the statistics toolbox. This is
a general boosting framework, which allows several kinds of boosting algorithms.
For coping with multiple labels, I used the AdaBoostM2 algorithm, which is a variation of the classical
AdaBoost algorithmiii. AdaBoostM2 allows the weak learners to generate more expressive hypotheses,
which, rather than identifying a single label, choose a set of “plausible” labels. It also allows the weak
learner to indicate a “degree of confidence” for each label.
For the weak learners I used a classification tree, implemented with Matlab class ClassificationTree. This
seems like a good simple learner for a binary domain. I limited the trees to 10 nodes in order keep the
weak learners simple.
Feature selection did not work well with AdaBoost, as AdaBoost includes some native feature selection
by the way it change the probability of the examples. Therefore a smaller set of experiments was run.
The results for AdaBoost classification can be seen in Table 4.
Step
Err1
Err2
Err7
Err8
Description
Entire data set as train and test set
Cross validation on entire data set
Entire data set after converged feature selection
Cross validation after converged feature selection
Misclassification Rate
18%
25%
23%
25%
Table 4 – Results of AdaBoost Classification
11.
Conclusions
This seems to be a difficult dataset to learn, and yet all the algorithms were close to 20%
misclassification, which can be a basis for a useful implementation of automatic idle check analyzer.
The best results of the algorithms were usually produced using forward feature selection. Backward
feature selection takes much longer and does not achieve much better results.
Also, running the algorithm on the converged feature set does not seem to improve their performance.
It seems like it is best to do the feature selection process for each algorithm separately and by this
choose the features which contributes the most to the accuracy of the specific algorithm.
In all cases the algorithms achieved better results on the whole data set than with cross-validation,
which is also expected. The algorithms tend to fit the actual data sets and face bigger difficulties with
new examples.
The best results of the algorithms (with cross validation) were:
1.
2.
3.
4.
Naïve Bayes, with forward feature selection – 23%
KNN, with backward feature selection- 23%.
SVM with forward feature selection- 17%.
AdaBoost – 25%.
As expected, SVM gives the best results, and also runs faster than the other algorithms. It seems like
SVM with the polynomial kernel finds good linear separators for the dataset, and also the feature
selection process allows finding even better separators as the dimension is much lower.
In all cases the feature selection processes allowed the algorithms to achieve good results when based
on 10 or less features. This is an important result on its own, as it can show us the features which we
should focus on when manually analyzing future idle check reports.
Appendix A – idle check file example
# idle_chk. Error if no traffic (level 2)
: CFC: AC is neither 0 nor 2 on connType 0 (ETH).
Activity counter value is 0x16. LCID 0 CID_CAM 0x1
# idle_chk. Error if no traffic (level 2)
: QM: VOQ_0, VOQ credit is not equal to initial
credit, Values are 0xec 0x140
# idle_chk. Error if no traffic (level 2) : XCM: XX protection CAM is not empty, Value is 0x1
# idle_chk. Error if no traffic (level 2) : BRB1: BRB is not empty, Value is 0x1d0
# idle_chk. Error if no traffic (level 2) : TCM: FIC0_INIT_CRD is not 64, Value is 0x24
# idle_chk. Error if no traffic (level 2) : PRS: TCM current credit is not 0, Value is 0x38
# idle_chk. Error if no traffic (level 2) : PRS: PENDING_BRB_PRS_RQ is not 0, Value is 0x2
# idle_chk. Warning (level 3): CDU: parity status is not 0, Value is 0x1
# idle_chk. Warning (level 3): CSDM: parity status is not 0, Value is 0x1
# idle_chk. Warning (level 3): TCM: parity status is not 0, Value is 0x1
# idle_chk. Warning (level 3): CCM: parity status is not 0, Value is 0x1
# idle_chk. Warning (level 3): UCM: parity status is not 0, Value is 0x1
# idle_chk. Warning (level 3): PXP: parity status is not 0, Value is 0x1
# idle_chk. Warning (level 3): TSDM: parity status is not 0, Value is 0x1
# idle_chk. Warning (level 3): USDM: parity status is not 0, Value is 0x1
# idle_chk. Warning (level 3): XSDM: parity status is not 0, Value is 0x1
# idle_chk. Warning (level 3): CSEM: parity status 0 is not 0, Value is 0x1
# idle_chk. Warning (level 3): PXP2: parity status 0 is not 0, Value is 0x1
# idle_chk. Warning (level 3): USEM: parity status 0 is not 0, Value is 0x1
# idle_chk. Warning (level 3): MISC: pcie_rst_b was asserted without perst assertion, Value is
0x1
# idle_chk. Warning (level 3): TSEM: interrupt 0 is active, Value is 0x10000
# idle_chk. Warning (level 3): XSEM: interrupt 0 is active, Value is 0x10000
# idle_chk. Warning (level 3): SRCH: parity status is not 0, Value is 0x1
# idle_chk. Error if no traffic (level 2) : QM: Byte credit 0 is not equal to initial credit,
Values are 0x5a1c 0x8000
# idle_chk. Warning (level 3): IGU: parity status is not 0, Value is 0x1
# idle_chk. Error if no traffic (level 2) : NIG: Port 0 EOP FIFO is not empty., Value is 0x0
# Idle_chk failed !!! (with errors: 9 warnings: 17 of 585 checks)
i
Mohammad Norouzi, David J. Fleet, Ruslan Salakhutdinov, 2012, Hamming Distance Metric Learning, Neural
Information Processing Systems (NIPS).
ii
Chih-Chung Chang and Chih-Jen Lin , LIBSVM -- A Library for Support Vector Machines,
,http://www.csie.ntu.edu.tw/~cjlin/libsvm.
iii
Freund, Yoav and Robert E. Schapire, 1996. Experiments with a new boosting algorithm, Machine Learning:
Proceedings of the Thirteenth International Conference (ICML '96), edited by Lorenza Saitta, pages 148-156,
Morgan Kaufmann.
Download