Chapter 1 Case Studies Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006 Case study - Aviation Wipro (a large Indian IT company) reported a study of frequent flyer data from an Indian airline. Before carrying out data mining, the data was selected and prepared. For example, it was decided to use only the three most common sectors flown by each customer and the three most common sectors when points are redeemed by each customer. It was discovered that much of the data supplied by the airline was incomplete or inaccurate. Also, it was found that the customer data captured by the company could have been more complete. For example, the airline did not know customers’ marital status or their income or their reasons for taking a journey. 27 November 2008 ©GKGupta 2 Case Study - Astronomy Astronomers produce huge amounts of data every night on the fluctuating intensity of around 20 million stars which are classified by their spectra and their surface temperature. Some 90% of stars are called main sequence stars including some stars that are very large, very hot and blue in colour. The main sequence stars are fuelled by nuclear fusion and are very stable, lasting billions of years. Smaller main sequence stars include the Sun (star type G in the table below). There are a number of classes including stars called yellow dwarf, red dwarf and white dwarf. We show the seven major classes: 27 November 2008 ©GKGupta 3 Different Types of Stars Star type Colour Approximate temperature O B A F G K M Blue Blue Blue Blue to White White to Yellow Orange to Red Red > 25,000K 11,000 to 25,000K 7,500 to 11,000K 6,000 to 7,500K 5,000 to 6,000K 3,500 to 5,000K < 3,500K 27 November 2008 ©GKGupta Average brightness (Sun = 1) > million 20,000 80 6 1.2 0.4 0.04 Average radius (Sun =1) 60 18 3.2 1.7 1.1 0.8 0.3 4 Astronomy When a clustering program was used to group a large amount of astronomical data, four classes corresponding to stars, galaxies with bright central cores, galaxies without bright central cores and stars with a visible “fuzz” around them were found. The clustering program found meaningful results without any understanding of astronomical data. 27 November 2008 ©GKGupta 5 Case Study – Mail Order A direct mail company held a list of large number of potential customers with a response rate of only 1%. The company wanted to improve the response rate. To carry out data mining, the company had to first prepare data, which included sampling the data to select a subset of customers including those that responded to direct mails and those that did not. 27 November 2008 ©GKGupta 6 Case Study – Mail Order For each customer, there were more than 200 variables including basic personal information like the locality where they lived, their gender, marital status, and their buying habits including when they last responded to a mailout, what money they spent the last time they responded, and the product bought the last time. 27 November 2008 ©GKGupta 7 Case Study – Mail Order Using the decision tree approach, the company was able to identify characteristics of customers who were more likely to respond. The company was thus able to reduce the number of customers it mailed to, thus reducing cost, while simultaneously improving the response rate. 27 November 2008 ©GKGupta 8 Case Study 1A Inventory Control The case study reports results of using data mining in inventory control of a US pharmaceutical company Medicorp which is the largest retail distribution company with 4100 stores in 25 US states. Medicorp maintained an inventory worth almost one billion dollars to ensure that any drug required by a customer had a 95% chance of being available from any outlet of the company. To achieve this goal, the company had a rule of thumb to maintain “three weeks supply” of every drug. 27 November 2008 ©GKGupta 9 Case Study 1A Inventory Control The study involved collecting relevant data and then carrying out some preliminary studies. Models were developed for predicting demand for various drugs. The models were not very accurate for daily predictions but were more accurate for weekly forecasts and even better for monthly forecasts. 27 November 2008 ©GKGupta 10 Case Study 1A Inventory Control The weekly forecasting model was chosen, since that better suited the company’s need. The study concluded that the company needed to change its rule of thumb of maintaining three weeks supply of drugs. It recommended that the three weeks should be reduced for popular drugs and needed to be extended for less popular items, since large selling items can be easily replenished on a weekly basis. The company was reported to have reduced its inventory by half, resulting in considerable savings. 27 November 2008 ©GKGupta 11 Case Study 1B Crime Prevention This case study was published in the magazine IEEE Computer in April 2004. Crime data was grouped into eight categories comprising traffic violation, sex crime, theft, fraud, arson, gang/drug offences, violent crimes and cybercrime. Some of the major crimes are included in the category violent crime, including murder, assault, armed robbery, sexual and hate crimes. The study focussed on three aspects of crime: extracting named entities from narrative reports, detecting deceptive criminal identities and identifying criminal groups and key members of the groups. 27 November 2008 ©GKGupta 12