DATA MINING OF MINE EQUIPMENT DATABASES TAD S. GOLOSINSKI University of Missouri-Rolla, MO 65409-0450, USA HUI HU University of Missouri-Rolla, MO 65409-0450, USA ABSTRACT The paper presents research into use of data mining methods for knowledge discovery in mining databases. The data was collected using VIMS system of Caterpillar installed on several trucks operating in a surface mine. It was mined with IBM Intelligent Miner for data. Data mining was found to allow for identification and quantification of relations between the various types of VIMS data. As such it offers the potential for development of a truck model that can be used for prognosticating truck condition and performance. Development of this capability requires further research. INTRODUCTION Modern mining equipment is fitted with numerous sensors that monitor its condition and performance. The data collected by these sensors is used to alert the operator to existence of abnormal operating conditions and to perform emergency shut-own if the pre-set values of the monitoring parameters are exceeded. This data is also used for postfailure diagnostics and for reporting and analysis of equipment performance. It is believed that availability of this voluminous data, together with availability of sophisticated data processing methods and tools, may allow for extraction of a variety of additional information contained in the data. One method that may be of value is data mining (Golosinski, 2001). The research presented in this paper analyzes data collected from various sensors installed on several mining trucks with the purpose to develop a model of truck operation that may facilitate reliable projection of truck performance and its condition into the future. Data was collected from Caterpillar 789B trucks equipped with VIMS (Vital Information Management Systems, during the period of January to October 2000. IBM Intelligent Miner for Data was used to conduct data mining.: VIMS OPERATION Caterpillar's Vital Information Management System (VIMS) is installed on selected CAT mining equipment. It is intended to assist with machine management by informing operators, service personnel and supervisors of the status of selected machine functions and by providing information on equipment production and performance. VIMS monitors and records parameters of numerous sensors that are integrated into the vehicle design. It has the capacity to alert the operator if these parameters exceed the pre-set critical values. In addition it can conduct emergency equipment shut-down if so programmed (Caterpillar, 2000). On-board VIMS unit records the collected data as well as occurrence of certain VIMS events. The recorded data can be downloaded into a notebook computer. Alternately it can be sent to the central control unit via radio (VIMS Wireless). VIMS DATA VIMS records data in seven different formats. These are: Event Summary List (ESL). A VIMS event is recorded when the measured value of a monitored parameter exceeds that considered acceptable. Event List is a record of events that are occurring on the machine. It is limited to the last 500 events, listed in a chronological order. Snapshot. Snapshot stores a segment of machine history that consists of values of all monitored parameters recorded at one-second interval. The snapshot is triggered by VIMS event and as such it is related to abnormal condition or emergency situation of the machine. Data Logger. Data Logger records values of all the machine parameters that are monitored by VIMS and sampled at one-second intervals. The logger is started and stopped by the operator command and can record data for up to 30 minutes. Trends. Trends record the minimums, maximums and averages of the selected machine condition parameters for a pre-selected period of time. Cumulative. Cumulative records the number of occurrences of specific events over a pre-set period of time. An example of cumulative information can be the engine revolutions or fuel consumption over the life of the machine, or its component. Histogram. Histogram records the performance history of a selected parameter since last reset. For example a histogram of the engine speed would indicate the percentages of time that the engine operated within a pre-specified speed ranges. Payload. Payload carried by the machine can be recorded if so specified and providing that the machine is equipped with an appropriate sensor. Four different data types are recorded. These are: Sensed Data. This data contains values of sensor parameters and position of switches installed on the machine. Internal Data. This data is generated internally within VIMS main module. It includes records of date and time. Communicated Data. This data is acquired through the data links to various machine components, including non-CAT components. For example the engine speed may be monitored and recorded through the data link to the electronic engine control system. Calculated Data. This data is calculated by the VIMS main module as a function of other data that is being collected. As an example event duration may be calculated based on internal data and stored in the event list. INTELLIGENT MINER Variety of data mining software is available from numerous vendors. It includes Intelligent Miner of International Business Machines Corporation, MineSet of Silicon Graphic Inc., Clementine of Integral Solutions Limited of U.K. and other (Westphal and Blaxton, 1998). The IBM Intelligent Miner (IM) version 6.1 was used for data mining reported in this paper (IBM, 2000). It offers a choice of algorithms, is easy to use, and has proven itself useful in many commercial applications. Following mining and statistical functions are included in Intelligent Miner: 1. Mining functions: associations, demographic and neural clustering, sequential patterns and similar sequences, tree and neural classification, and neural and RBF (Radial Basis Function) prediction. 2. Statistics functions: bivariate statistics, linear regression, principal component analysis, univariate curve fitting and factor analysis. The IM allows modeling of events and processes that can be either usual or unusual. Usual events describe the situation that is considered normal and for which the relations between different attributes are sought. For example, relations between truck operating and mechanical attributes can be defined such as a relation between engine load and truck payload. Definition and quantification of these relations may be of help in improving efficiency of truck operation or help with operator training. The unusual events are failures of the monitored machine or its component. Data mining of these events may allow for definition of algorithms that would facilitate modeling of truck operation to help with planning of its maintenance and reduction of downtime. To facilitate data mining of VIMS databases with IM the data format has to be adapted to that acceptable to the IM. The original VIMS data, downloaded from an onboard VIMS unit, can be easily merged into MS Access 97 database using the VIMS PC99 software. However, IM does not accept Access data format and to facilitate its use data has to be transferred to DB2 database that is compatible with version 6.1 of Intelligent Miner. DATA MINING METHODS Of the various data mining methods used by IM the following were used in this investigations: Major Factor Analysis, Clustering, Classification, and Sequential Pattern. Data mining was done on VIMS database that consisted of 300 MB of records collected on several Caterpillar model 789B trucks that operated in a surface mine between February and October 2000. Data collected by VIMS data logger consisted of 105 data sets, each set covering a period of up to 30 minutes of truck operation. Overall 85 parameters of truck condition and performance were monitored with their values recorded each second by the on-board VIMS. The original data was transferred to DB2 and pre-processed. This included data clean-up, and identification and extraction of data that is of interest to the problem at hand. INVESTIGATIONS Relationship between Truck Parameters in VIMS Data Logger Data Not all truck parameters are independent and a variety of relations exist between them. The preliminary research done by Ataman (2001) defined significant correlations to exist between various parameters. Two VIMS parameters, engine speed and fuel flow, were found to show strong correlation with many other parameters. Also confirmed was a relatively strong relation between engine oil pressure and engine speed, indicated in the VIMS manual. The relation between engine coolant temperature and aftercooler temperature was another expected result. No other significant relations were identified. This work confirms that the linear regression method of IM can be used to define and quantify the relations that may exist between various parameters describing truck performance and condition. It is believed that these relations, in turn, can be used for truck operation. In relation to VIMS data the major problem is data format incompatibility with that of IM. An interface between VIMS and IM needs to be developed that would allow for easy data transfer and manipulation. Major Factor Analysis (MFA) of VIMS Data Logger Data In statistical terms, all parameters constitute variables. The relationship between two variables is defined by the correlation coefficient. For the purpose of modeling truck condition and performance high correlation between any two variables indicates redundancy. MFA eliminates this redundancy by combining correlated variables into factors. Lower number of factors simplifies further analysis. In the described research all monitored truck parameters constituted inputs into MFA. The analysis was performed using varimax rotation that maximizes the variance of the factor loadings for each input variable. The rotated factors have a high correlation with one set of input variables and little or no correlation with another set of input variables. The varimax rotational strategy can give a clearer interpretation of the results by classifying variables into new independent factors. Figure 1. presents the factor loadings that quantify strength of relationships between variables in the investigated databases. Their value reflects the linear relationship between the input variables and the corresponding factors, Figure 1. The Factor Loading View and varies between –1 and +1. If the factor loading is +1 there is a perfect positive relationship between the variable and the factor. Factor loading of –1 denotes a perfect negative relationship. If the factor loading is 0, there is no relationship between the input variable and the factor. In the factor loading window, the vertical axis represents one of the factors while the horizontal one represents another. The dots depict the factor loadings. The labels next to the dots show the number of the input variables, name of each variable identifiable at the label list on the right side of figure 1. If a dot has a high coordinated value on one of the axes and lies in close proximity to it, there is distinct relationship between one of the two factors and this variable (IBM, 2000). The results of this analysis identified 19 independent statistically factors that represent the original 85 truck parameters. The variables that are included into the same factor are highly correlated. Table 1 summarizes the results of Major Factor Analysis The first factor accounts for 29% of variables, or 24 truck parameters. These are highly correlated with each other as well as with the first factor. All the 24 parameters define temperature and pressure, including atmospheric temperature, engine coolant temperature, turbocharger inlet air pressure, etc. Therefore this class of parameters is represents temperature/pressure indicators of the truck. The second factor accounts for 12% of the parameters. It groups engine load indicators and includes such variables as engine speed, throttle position, boost pressure, and so on. Interestingly the ECM (Electronic Control Module) calculates engine load as a function of: engine speed, throttle switch position, throttle position, boost pressure, and atmospheric pressure. The third factor can be thought as the payload indicator, and the fourth factor is the fluid level indicator. No physical interpretation for all factors can be provided at present. The MFA output results also include factor scores, the actual values of individual observations for the factors. These factor scores are particularly useful when further analysis of factors is to be performed. In conclusion, the Major Factor Analysis can be used to reduce the number of truck performance and condition parameters that one needs to be concerned with, thus simplify further analysis. Lower number of variables in the input to clustering and classification saves evaluation time and minimizes problems created by missing variable values. Table 1. Machine Parameter Indicators (Factors) No . Factor (percentage) Indicator 1 Factor 1 (29%) Temperature 2 Factor 2 (12%) Engine Load 3 Factor 3 (5%) Payload 4 Factor 4 (6%) Fluid Level 5 Factor 6 (2.9%) Road Condition 6 Factor 8 (2.7%) Transmission Switch 7 8 Factor 13 (2%) Factor 14 (2.28%) Auto Lube Body Level 9 Factor 16 (3.2%) Fan Speed 10 Factor 17 (2.4%) Engine Fuel Rate Parameters (Variables) Atmospheric Temperature, engine coolant temperature, turbocharger inlet air pressure, etc. Engine Speed, throttle Position, Boost Pressure, etc. Payload, Suspension Cylinder Pressures, Payload Status, Machine Pitch. Engine Oil Level, Low Steering Pressure, Engine Oil Pressure, etc. RTR-LTR and RTF-LTF Suspension Level, Machine Rack Torque Converter Screen, Transmission Charge Filter, etc. Auto Lube Datalink, Auto Lube Body Level, Body Position High or Low Speed Fan, Ground Speed, etc. Engine Fuel Rate Clustering of VIMS Data Logger Data Clustering searches for characteristics that most frequently occur in common and groups the related data into clusters. The number of detected clusters and the properties of each cluster are the results. In addition distribution of characteristics within the clusters is quantified. The Demographic Clustering provides fast and natural clustering of very large databases. It automatically determines the number of clusters to be generated. Similarities between records are determined by comparing their field values. The clusters are then defined so that Condorcet’s criterion is maximized (IBM, 2000). Following the Major Factor analysis the remaining data set was data mined using the IM Figure 2. Demographic Clustering -IM Output demographic clustering. As a result the data set was segmented into 9 clusters as shown in fig. 2 (Golosinski, Hu, and Figure 4. Demographic Clustering: Elias, 2001). The three largest clusters each Payload Cluster (Horizontal Scale: Payload in Tons) account for the 14% of the whole data set Fig. 3 and 4 show a zoom of the cluster related to haul distance and to truck payload. The haul distance cluster, shown in fig. 3 indicates that the haul distance is one of the main determinants of fuel consumption rate. Interestingly, the percentage of 6 to 10 mile long hauls in this cluster is approximately 40%, while the same percentage for the whole % population is only 5%. One possible 0 2 4 6 8 10 12 14 16 18 20 22 24 explanation is that on the long hauls truck fuel consumption rate is larger since truck spends Figure 3. Demographic Clustering: more time running at the full load. On short Haul Distance Cluster (Horizontal hauls more time is spent loading / dumping / Scale: Haul Distance in Miles) maneuvering / waiting activities during which fuel consumption is low. Fig.4 shows the payload cluster. It indicates that all trucks in this cluster were empty (100% of the cluster), while the percentage of empty trucks in the whole database is only around 50%. All the trucks in % Payload cluster were traveling at 4th gear with 0 20 40 60 80 100 120 140 160 180 the speed of 25 to 35 MPH and the fuel consumption rate was average. The other clusters identified in this work are presented in fig. 2. These contain variety of information related to truck performance. Classification of VIMS Data Logger Data Classification is used to segregate database records into pre-defined classes based on specific criteria. Thus this technique can be used to define what truck operating or condition parameters define fuel consumption rate, what parameters define its cycle time and the like. The tree-classification mining function builds a classification model as a binary decision tree. Each interior node of the binary decision tree tests an attribute of a record. If the attribute value satisfies the test, the record is sent down the left branch of the node. If the attribute value does not meet the requirements, the record is sent down the right branch of the node. The 4 classes are marked with different colors at upper left corner. They are reflected in the tree map as solid square. The solid circles are the decision nodes. The binary decision tree consists of the root node on top, followed by non-leaf nodes and leaf nodes. Branches connect a node to 2 other nodes. Root and non-leaf nodes are represented as pie charts. Leaf nodes are represented as rectangles. Each node can display its characteristics in the window shown at the bottom of fig.5. This information includes: Label: The pre-dominant class label of the selected node. Test: The split criterion for this node. This applies only to non-leaf nodes and specifies a simple selection. Records: The number of records contained in each of the sub-nodes of the selected node. Distribution: The number of records corresponding to each of the possible class labels. The classification is most meaningful if all records belong to one leaf node only. However, by pruning the binary decision tree, records of other nodes can be assigned to the selected node. Figure 5. Classification-Tree Purity: The percentage of correctly classified records assigned to a node. The Tree Classification was done for the fuel consumption rate, leading to definition of four classes of parameters that indicate high fuel consumption. These can be identified by tracking the thick black line with the arrow that links the nodes and continues on to the rectangles at the foot of figure 5. Use of color in computer generated figure 5 makes the tracking easy. Selected observations that can be made in this case indicate that: 1. When ground speed is in the range from 12.25 MPH to 15.5 MPH and the payload is over 126.85 t, 96.8% of the 283 analyzed records indicate high engine fuel consumption rate. 2. When ground speed is more than 31.5 MPH and actual gear is higher than 5, all 146 records show low engine fuel consumption rate. 3. The ground speed has more impact on the engine fuel consumption rate than do other parameters. Sequential Pattern Recognition in VIMS Event Logs Thousands of events are recorded during a life of an average mining truck. These are stored in the Event data log of the VIMS database. All VIMS events are classified into two categories, data event and maintenance event. The data event is related to the machine operating status, such as low engine oil level. The maintenance event is related to the machine control system, a problem of the VIMS itself, such as severed sensor wire. VIMS and related tools can only list and tabulate events. Other tools are needed to discover additional knowledge that may exist in VIMS database. One such tool is the Sequential Pattern, one of the data mining methods of Intelligent Miner. It can be used to discover similar sequential data patterns in VIMS data bases. The inputs to sequential pattern analysis were: Serial Number, System Measurement Unit (SMU) and Event Identifier, a combination of event description and event level. The minimum support level was set at 80%. Database of 42,514 events recorded on 12 trucks were datamined. These included 69 event types. As a result 77,327 sequential patterns were identified. The sequential patterns were identified to exist in all the 12 machines at some point of time. The events Engine Oil Level, TC Out Temperature and VIMS Snapshot show particularly strong relationship with each other. Engine Oil Level monitors engine oil level and informs engine ECM when it drops below acceptable minimum. This is an on/off switch type signal with switch open when the oil level is low. TC Out Temperature monitors oil temperature on the outlet side of the torque converter. Sensor signal pulse width changes as the torque converter oil temperature changes. VIMS then determine the temperature based on the width of sensor signal pulse. VIMS Snapshot is a percentage of memory space that is left available for storing of the VIMS Snapshot data. The physical interpretation of relation between the first two of the above is clear, the last one is rather unusual. Since VIMS Snapshot is triggered by full VIMS Snapshot memory it may happen when the data is not timely downloaded. On the other hand more detailed analysis indicates that the first two occupy large part of VIMS Snapshot memory,especially when the events take place frequently, or when the operator repeatedly ignores these two events leading to overfilling the memory. Information on relation between the events engine oil level and event of torque converter temperature allows to predict increase of torque converter temperature from the reading of engine oil level and vise versa. As such it is important to the task of this work. Overall the sequential patterns identified to exist in VIMS Event database with high confidence level constitute very low percentage of all events. This does not disqualify Sequential Pattern as a data mining method useful for mining VIMS databases. However, it indicates the need to revise the approach used in future investigations. Possible changes are: use of larger databases that include data collected from a variety of trucks and from different mines, inclusion into analyzed databases of other data, external to VIMS, and increasing the size of VIMS data sets so that these cover extended periods of truck operation. DISCUSSION Work presented in this paper indicates that data mining of VIMS generated data bases allows for discovery of knowledge contained in these databases. In particular relations that exist between various VIMS-collected data can be identified, described and quantified. This can be achieved through use of two IM statistical functions: linear regression and factor analysis. While both these functions can serve this purpose, Factor Analysis allows to significantly reduce the number of variables that need to be considered and groups all related parameters into factors. The clustering can be used to segment VIMS database through grouping of data that have similar characteristics. This allows to idnetify the paramterers that are of key importance to truck performance. Thus, as an example, the parameters that influence truck fuel consumption rate can be defined. Further work is needed to fully define applicability of custering to the problem at hand. It appears that clear definition of relations or goals sought may be needed to realize the full potential of this method. The classification was applied to interpret clustering results. It allowed for quantification of impact that various VIMS parameters have on truck fuel consumption. It was further proven that classification can be used to build a model that describes the behaviour of VIMS parameters that are of interest. The research indicates that classification alone does not yield meaningfull results. It does, however, yield these results if used in conjunction with clustering. Sequential patterns are usually used to find predictable patterns of behavior for a given phenomenon over a period of time. In relation to VIMS parameters the intent was to be able to predict occurrence of a specific event based on occurrence of similar event in the past. However, this approach was unsuccessfull. While a multitude of similar patterns was identified to exist in the data, its wariability was to large to permit draving of valid conclusions on their repeatibility. It is believed that more complete VIMS database, collected over extended period of time and in a variety of truck operating condition may overcome this problem, and permit development of predictive models of truck behaviour. Ultimate goal of this work is to construct a model of truck operation that would allow projection of truck condition and performance into future. The relations discovered to exist between various VIMS parameters lay a foundation for this modeling work. Of particular importance is use of Factor Analysis that reduces the number of parameters to manageable level. Classification and clustering allows for analysis of truck performance and indirectly its optimization. Use of sequential patterns need further study. CONCLUSIONS The investigations presented above prove that data mining can be used to analyze performance of mining equipment. In particular the relations between its various operating, condition and performance parameters can be defined, and quantified using regression and factor analysis methods. VIMS provides variety of data that quantify truck condition and performance. To maximize use of this data it needs to be collected continuously over expended periods of time and under a variety of operating and climatic conditions. Intelligent Miner contains a variety of data mining tools, many of which can be used to successfully data mine VIMS databases. However, the input data format of IM is not compatible with that of VIMS generated data. Therefore an interface between the two is needed that facilitates fast data reformatting. More investigations are needed to fully define the applicability of data mining to knowledge discovery in VIMS databases. Use of databases collected over extended period of time and containing data external to VIMS that define truck operating condition may be the key factor in determining this applicability. ACKNOWLEDGEMENT Financial support of investigations presented in this paper by Univeristy of Missouri Research Board is gratefully acknowledged. REFERENCES Bernson, A. and Smith, S.J. 1997. “Data Warehousing, Data Mining and OLAP”. McGraw-Hill. Westphal, C. and Blaxton, T. 1998. “Data Mining Solutions”. John Wiley and Sons, Inc. Caterpillar, Inc. 1999. “Vital Information management System (VIMS): System Operation Testing and Adjusting”. Company publication. IBM (International Business Machines Corporation). 2000. Manual: “Using the Intelligent Miner for Data”. Company publication. Golosinski. 2001. “Data Mining Uses in Mining”. Proceedings, APCOM 2001, Beijing, China. Golosinski, T.S., Hu, Hui and Elias, R. 2001. “Data Mining VIMS for Information on Truck Condition”. APCOM 2001, Beijing, China. Ataman, K. 2001. M.S. Thesis: “Data Mining for Prediction of Condition and Performance of Mine Machinery”. University of Missouri-Rolla publication. Madiba E. 2001. M.S. Thesis: “Application of IBM DB2 Intelligent Miner for data to mine Vital Information Management System (VIMS) data”. University of Missouri-Rolla publication.