HW1 - jakeadams

advertisement
CS490/584 – Data Mining
Homework 1
NAME ______Jacob Adams______________________________ SID ________800165301___
Briefly answer each the following questions. (10 points each)
1. What is a "decision list"?
A: A set of rules learned from data that are intended to be interpreted in sequence.
2. What does it mean to say that a set of rules is "complete"?
A: It means that there is a rule to handle any and every possible combination of
attributes.
3. Briefly describe the differences between the following approaches for the integration of a
data mining system with a database or data warehouse system: no coupling, loose
coupling, semi-tight coupling, and tight coupling. State and explain which approach is
most popular based on the information you can find on the web. Be sure to include the
link
A: No coupling means that the data mining system will not make use of a database
or data warehouse at all. The data mining system must perform the cleaning,
organizing, collecting, and transforming that the database or data warehouse.
Loose coupling means that the data mining system will make use of some, but not all
of the operations provided by the database or data warehouse. These systems often
use querying and indexing functionality, but not query optimization.
Semi-tight means that the database or data warehouse are not only linked to the
data mining system, but they can also perform basic data mining tasks and store
intermediate mining results for in order to improve performance.
Tight coupling means that a data mining system is completely and smoothly
integrated into the database/data warehouse system, such that the entire system is
considered one functional component
Tight coupling is preferred to loose coupling since it provide higher performance.
One study even showed that tight coupling had almost a two times performance
advantage over loose coupling.
http://www.almaden.ibm.com/cs/projects/iis/hdb/Publications/papers/kdd96_udf.pd
f
4. Write a rule based on the following whether data. Note that your rule should (a) correctly
classify one or more of the instances and (b) not misclassify any instance.
outlook
temperature humidity Windy
1
play outlook
sunny
sunny
sunny
overcast
Hot
Hot
Normal
Normal
high
Low
high
normal
FALSE
TRUE
TRUE
FALSE
no
Yes
No
yes
sunny
sunny
sunny
overcast
A: If humidity = high then play=no
5. Discuss the differences and similarity between a data warehouse and a database.
A: Both are databases and both store data. Regular databases are intended to store
the current state of the data, so both reads and writes are allowed. Data warehouses
are designed for querying and analysis. They can contain data from several
databases and over can contain the state of the data over a period of time. After data
is entered into data warehouses, it is typically non-volatile.
6. Recent applications pay special attention to spatiotemporal data streams. A
spatiotemporal data stream contains spatial information that changes over time, and is in
the form of stream data, i.e., the data flowing in-and-out like possibly infinite streams.
(a) Present an application example of spatiotemporal data streams.
A: Highway traffic
(b) Discuss what kind of interesting knowledge can be mined from such data streams,
with limited time and resources.
A: Outlier detection, anomaly detection, rare event detection, surprising
patterns, concept drifting, emerging events
(c) Identify and discuss the major challenges in spatiotemporal data mining.
A: There is a large amount of data constantly being created. This means that
the processing either has to be limited or very efficient. The data is also
coming from an array of different places which may be changing. This
means that some data sources may be slow to report or completely
nonexistent. The data mining system needs to be flexible enough to handle
this.
(d) Sketch a method to mine one kind of knowledge from such stream data
efficiently.
A: If there were several speed sensors at various points along a highway, you
could monitor the average speed of the traffic at each given section. All that
would have to be stored is the current average and the number of instances
for each location. When a new instance at that location is recorded, a new
average can easily be calculated and restored. Short term averages, such as
over the last minute or hour, could also be kept in similar fashions.
Unusually fast or slow times could be stored in a separate table for analysis
2
later on. The average speed in each section could also be used to dynamically
change the speed limit in adjacent sections of highway.
http://www.academypublisher.com/jcp/vol01/no03/jcp01034350.pdf
http://www.springerlink.com/content/k3hq90812024777m/
http://www.cs.purdue.edu/research/technical_reports/2006/TR%2006-020.pdf
http://www-users.cs.umn.edu/~mokbel/demos/PlaceDemo5.pdf
You are encouraged to use info from the web for this question. Be sure to include the
link.
Due Friday, January 23rd at 10 AM. Submit a softcopy on Moodle @ classes.cs.siue.edu
3
Download