CS 563 May Interim 1 KNOWLEDGE DISCOVERY AND DATA

advertisement

CS 563 May Interim 1

KNOWLEDGE DISCOVERY AND DATA MINING

Dr. Christos Nikolopoulos

Office: BR 197

(309) 677-2456 chris@bradley.edu

class web site at : http://hilltop.bradley.edu/~chris/cs563.html

Required Textbook:

Witten I. and Frank E., DATA MINING: Practical Machine Learning Tools and

Techniques, Morgan Kaufmann Publishers.

Optional References:

1. Cios et al., DATA MINING: A Knowledge Discovery Approach , Springer, 2007.

2. Chris Nikolopoulos, Expert Systems: An Introductionto First, Second Generation and Hybrid Knowledge based Systems,Marcel Dekker, 1997.

Description:

Advances in Knowledge Discovery and Data Mining bring together the latest research in the areas of statistics, databases, machine learning, and artificial intelligence which together contribute to the rapidly growing field of knowledge discovery and data mining. Topics covered include fundamental issues, knowledge representation, cleaning and reprocessing of data sets, classification and clustering, machine learning algorithms, comparing machine learning algorithms and models, evaluating performance. The complimentary topic of Data Warehousing and OLAP is covered in the class CS 572, Advanced Databases.

Evaluation:

Tests will be online.

Midterm:

The midterm will be sent by email to each student by 10:00 a.m. of Monday, June 2 nd and she/he will have till 10:00 p.m. of the next day (Tuesday June 3 rd

) to complete, and email the answers back to the instructor. The answers could be typed in a word file and emailed or they could be handwritten, scanned and emailed. (times are Peoria (central) time)

Final:

The final project substitutes the final exam, and the data set to be analyzed must be chosen by Friday 5/23. Send me an email to tell me which data set you chose. The project is due on Friday June 6 th

by 5:00 p.m. It has to be written in a research paper format

(abstract, introduction, main sections, conclusions, bibliography) and is due back by email.

Homework:

Homework assignments are also due by email on Friday June 6th by 5:00 p.m.

The table below gives the reading assignments each day from the books and online sources.

# Date

M day 1

Topics for online discussion Readings/Assignments

Introduction to Machine

Learning tools and techniques

Witten Part I, Chapter 1, pp. 4-39

Check out the class website: http://hilltop.bradley.edu/~chris/cs563.html

for powerPoint notes, homework assignments etc.

Download WEKA (see link on class web site)

T day 2 Witten Part I, Chapter 2, pp. 41-60

W day 3

Input: Concepts, instances and attributes

Output: Knowledge representation

Witten Part I, Chapter 3, pp. 61-82

TH day 4 Machine Learning: the basic methods

F day 5 Machine Learning: the basic methods

Witten Part I, Chapter 4, pp. 83-111, sections 4.1-4.4

Watch video 1

Witten Part I, Chapter 4, pp. 112-139, sections 4.5-4.9

Watch video 2

Witten Part II, Chapter 9, pp. 365-368 and

Chapter 10, pp. 369-401.

Send me an email on which data set you chose for your project, by 5:00 p.m.

M day 6

T day 7

W day 8

The WEKA machine learning workbench

Decide on a data set to use for Final Project/Exam

(University of California

Irvine Machine Learning

Data set http://archive.ics.uci.edu/ml/

)- send email to instructor to notify him of which data set you chose

The WEKA machine learning workbench

Evaluating what has been learned

TH day 9 Evaluating what has been learned

F day 10 Engineering the input and

Witten Part II, Chapter 10, pp. 401-423

Witten Part I, Chapter 5, pp. 143-160

Watch video 3

Witten Part I, Chapter 5, pp. 160-183

Watch video 4

Witten Part I, Chapter 7, pp. 285-341

output, attribute selection, discretizing, automatic data cleansing

M day 11 MIDTERM EXAM mailed The take home test covers chapters

1,2,3,4,5 from Witten’s book.

Test emailed to you by 10:00 a.m. Due back next day by 10:00 p.m. by email.

T day 12 MIDTERM EXAM due date Midterm answers due back by 10:00 p.m.

W day 13 Details on Decision trees, classification rules, extending linear models, neural nets by email.

Answers to test due back by 9:00 a.m.

Witten Part I, Chapter 6, pp. 187-235, sections 6.1-6.3

Watch video 5

Witten Part I, Chapter 6, pp. 235-283 TH day 14 Instance-based learning, numeric prediction, clustering, Bayesian

F day 15 networks

FINAL PROJECT REPORT

DUE and HOMEWORK

ASSIGNMENT DUE

By 5:00 p.m. both HW and the PROJECT are due, by email.

Assessment

100 Points Total

50% Midterm Exam

30% Final Data Mining project report (in place of final exam)

20% homework assignments

Some Videos on DM/KD to watch:

Video 1: IIT lecture 1: http://www.bing.com/videos/watch/video/lecture-34-data-miningand-knowledge-discovery/1d0668894dc732fe82b91d0668894dc732fe82b9-83872645560

Video 2: IIT lecture 2: http://www.bing.com/videos/watch/video/lecture-35-data-miningand-knowledge-discovery-part-ii/f2c1c8cfcc5e319417f6f2c1c8cfcc5e319417f6-

29437526744

Video 3: DM and KD: http://videolectures.net/mps07_lavrac_dmkd/

Video 4: Data Mining at NASA: http://videolectures.net/kdd09_srivastava_dmnasata/

Video 5: SQL Know How Video, http://www.microsoft.com/showcase/en/us/details/38b7e057-42d2-4a8c-b4d2-

3154bc35d87a

More: http://videolectures.net/Top/Computer_Science/Data_Mining/

Data Mining Project (in place of final exam)

:

The project could be worked on as a team project (teams of at most two members), but individual projects are also fine if you so choose. The project is open ended and it involves applying WEKA to analyze a data set. Which algorithms to use and which are the most appropriate, how to clean the data etc. is entirely up to you. To find a good data set for your project, look at the machine learning depository stored at the University of

California Irvine’s ML site: http://archive.ics.uci.edu/ml/ . The data mining software you will use for your project is WEKA (see Witten's book). Download the WEKA software from link in my main class page).

Download