56:166 PRODUCTION SYSTEMS Homework No. 4 Due: T, Sep. 27, 2005 Posted: T, Sept 20, 2005 Due: T, Sept 27, 2005 Problem 1 (20%) Table 1. Weather data from WEKA outlook temperature humidity windy sunny hot high FALSE sunny hot high TRUE overcast hot high FALSE rainy mild high FALSE rainy cool normal FALSE rainy cool normal TRUE overcast cool normal TRUE sunny mild high FALSE sunny cool normal FALSE rainy mild normal FALSE sunny mild normal TRUE overcast mild high TRUE overcast hot normal FALSE rainy mild high TRUE 1. 2. play no no yes yes yes no yes no yes yes yes yes yes no Calculate the information content, entropy and information gain for the 4 parameters (features) in Table 1 (assume the “play” is the desired decision variable). Select out the parameter with the highest information gain and see whether this is the same as the one (tree root) selected by WEKA’s decision tree algorithm (J48). Problem 2 (30%) A small manufacturing company produces a small tractor. And the customers can go to the company’s homepage and configure their desired tractors by selecting different option codes provided by the company. Now the company realized that the sales data collected through their “web product configuring system” could contain important information about the customers’ buying behavior. Thus they decided to use data mining to extract knowledge out from the sales data. The knowledge can be used to guide the company’s manufacturing. Currently the company is interested in finding some knowledge about the “Comfort Package”. That is to say, what option codes are usually going with “Comfort Package 1200”, and what option codes are usually going with “Comfort Package 1210”. Based on this piece of information, the company can pre-assemble some options together with the “Comfort Package 1210” or “Comfort Package 1200”, thus reduce the response time and improve the customer satisfaction. Data is provided in the separate Excel file. List all the knowledge found by you in terms of rules. Submit the confusion matrix and 10-cross validation accuracy. Justify why you select those rules? Hint: Use WEKA’s PART Algorithm and make the “Comfort Package” as decision variable Problem 3 (30%) An energy company is very interested in improving their coal-fired boilers’ combustion efficiency. High combustion efficiency will optimize fuel usage and have positive impact on emissions. The company has set up a collection of sensors continuously monitoring and recording real-time values of the boiler parameters, such as the feeder speed, fan speed, pressure, steam temperature, megawatt load, etc., all saved to a database every minute. At the same time, the boiler’s efficiency is computed by the predefined equations and experimental data. Based on the data collected by sensors and the calculated boiler 1 efficiency, the company is going to apply data mining approach to find strong rules for improving the combustion efficiency. Thus the boiler operator can adjust the parameters’ value according to the strong rules (knowledge) which indicate higher efficiency levels. Use WEKA’s PART algorithm to mine the dataset 2 in the Excel file, find out the strong rules for each efficiency level. Submit the strong rules selected by you and justify why you select them? Problem 4 (10%) Apply the neural network model in WEKA to dataset 2, tune the parameters of neural network and see whether you can improve your 10-cross validation accuracy? (Try at least 3 settings of the parameters and report the corresponding cross-validation accuracies) Problem 5 (10%) Apply the simple k-means clustering algorithm in WEKA to the dataset 1 with 10 clusters. Try at least 3 different seeds and compare whether there are differences among the centroids? Report your cluster centroids and the corresponding seeds, and explanations. 2