CS490/584 – Data Mining Homework 1 NAME ______Jacob Adams______________________________ SID ________800165301___ Briefly answer each the following questions. (10 points each) 1. What is a "decision list"? A: A set of rules learned from data that are intended to be interpreted in sequence. 2. What does it mean to say that a set of rules is "complete"? A: It means that there is a rule to handle any and every possible combination of attributes. 3. Briefly describe the differences between the following approaches for the integration of a data mining system with a database or data warehouse system: no coupling, loose coupling, semi-tight coupling, and tight coupling. State and explain which approach is most popular based on the information you can find on the web. Be sure to include the link A: No coupling means that the data mining system will not make use of a database or data warehouse at all. The data mining system must perform the cleaning, organizing, collecting, and transforming that the database or data warehouse. Loose coupling means that the data mining system will make use of some, but not all of the operations provided by the database or data warehouse. These systems often use querying and indexing functionality, but not query optimization. Semi-tight means that the database or data warehouse are not only linked to the data mining system, but they can also perform basic data mining tasks and store intermediate mining results for in order to improve performance. Tight coupling means that a data mining system is completely and smoothly integrated into the database/data warehouse system, such that the entire system is considered one functional component Tight coupling is preferred to loose coupling since it provide higher performance. One study even showed that tight coupling had almost a two times performance advantage over loose coupling. http://www.almaden.ibm.com/cs/projects/iis/hdb/Publications/papers/kdd96_udf.pd f 4. Write a rule based on the following whether data. Note that your rule should (a) correctly classify one or more of the instances and (b) not misclassify any instance. outlook temperature humidity Windy 1 play outlook sunny sunny sunny overcast Hot Hot Normal Normal high Low high normal FALSE TRUE TRUE FALSE no Yes No yes sunny sunny sunny overcast A: If humidity = high then play=no 5. Discuss the differences and similarity between a data warehouse and a database. A: Both are databases and both store data. Regular databases are intended to store the current state of the data, so both reads and writes are allowed. Data warehouses are designed for querying and analysis. They can contain data from several databases and over can contain the state of the data over a period of time. After data is entered into data warehouses, it is typically non-volatile. 6. Recent applications pay special attention to spatiotemporal data streams. A spatiotemporal data stream contains spatial information that changes over time, and is in the form of stream data, i.e., the data flowing in-and-out like possibly infinite streams. (a) Present an application example of spatiotemporal data streams. A: Highway traffic (b) Discuss what kind of interesting knowledge can be mined from such data streams, with limited time and resources. A: Outlier detection, anomaly detection, rare event detection, surprising patterns, concept drifting, emerging events (c) Identify and discuss the major challenges in spatiotemporal data mining. A: There is a large amount of data constantly being created. This means that the processing either has to be limited or very efficient. The data is also coming from an array of different places which may be changing. This means that some data sources may be slow to report or completely nonexistent. The data mining system needs to be flexible enough to handle this. (d) Sketch a method to mine one kind of knowledge from such stream data efficiently. A: If there were several speed sensors at various points along a highway, you could monitor the average speed of the traffic at each given section. All that would have to be stored is the current average and the number of instances for each location. When a new instance at that location is recorded, a new average can easily be calculated and restored. Short term averages, such as over the last minute or hour, could also be kept in similar fashions. Unusually fast or slow times could be stored in a separate table for analysis 2 later on. The average speed in each section could also be used to dynamically change the speed limit in adjacent sections of highway. http://www.academypublisher.com/jcp/vol01/no03/jcp01034350.pdf http://www.springerlink.com/content/k3hq90812024777m/ http://www.cs.purdue.edu/research/technical_reports/2006/TR%2006-020.pdf http://www-users.cs.umn.edu/~mokbel/demos/PlaceDemo5.pdf You are encouraged to use info from the web for this question. Be sure to include the link. Due Friday, January 23rd at 10 AM. Submit a softcopy on Moodle @ classes.cs.siue.edu 3