CPSC 6127 - Zanev - Columbus State University

advertisement
Course Description and
Objectives
Textbook
Software
Methods of Instruction
Evaluation
Student Responsibilities
Attendance Policy
Academic Dishonesty
ADAAccommodation
Notice
Instructor: Dr. Vladimir Zanev
Office Location/Phone Number: CCT 442/ 569-3056
Office Hours: Mon-Thu, 3:00 p.m. - 4:00 p.m., Fri: 10:00 a.m.-11:00 a.m.
E-mail: WebCT class e-mail or zanev_vladimir@colstate.edu
Website: http://webct.colstate.edu
http://csc.colstate.edu/zanev/current_courses.asp
This course is offered as an online class in the Spring semester 2007. Class meets
100% online at
( http://webct.colstate.edu )
Online Interface:
WebCT Vista will be the primary method of online interaction in this course. Course
materials (course outline, schedule, assignments, projects, course notes, datasets,
discussions, resources, and grading will be available through WebCT Vista. You can
access WebCT at Vista:
or http://webct.colstate.edu
At this page, click on the "Log-in" link to activate the WebCT Vista logon dialog box,
which will ask for your WebCT Vista username and password. Your WebCT Vista
username and password are:
Username: lastname_firstname
Password: DDMMYY
where DDMMYY is the student birth date. (Example - Birthday of Oct. 25, 1978 is
251078)
If you try the above and WebCT Vista will not let you in, please use the
"Comments/Problems" link at the bottom of the WebCT Vista home page to request help.
If you are still having problems gaining access a day or so after the class begins, please email me. Once you have clicked on the course's name and accessed the course itself, you
will find a home page with links to other sections and tools, and a menu on the left-hand
side. This course homepage and the left-hand menu will give you access to all course
materials.
Course Description and Objectives
Course Description:
Prerequisite - CPSC 5115. Algorithm Analysis and Design, CPSC 5138 Advanced
DBMS.
These prerequisites will not be enforced. Consider them as a suggested background,
which you should have to pass this course in a breeze. It is not required that you must
have taken the courses above. However, completing the following courses and/or having
a working knowledge in the respective areas will greatly help you to succeed in this class.
This course is an introduction to data mining. Recent advances in database technology along
with the phenomenal growth of the Internet have resulted in an explosion of data collected,
stored, and disseminated by various organizations. Because of its massive size, it is difficult for
analysts to sift through the data even though it may contain useful information. Data mining holds
great promise to address this problem by providing efficient techniques to uncover useful
information hidden in the large data repositories. Data mining is a modern area of computer
science concerned with automated or convenient extraction of patterns that represents
previously unknown knowledge implicitly stored in large databases, data warehouses,
and other massive information repositories. In this course we will approach the data
mining problem from the position of database design and programming. We will discuss
suitable data models, data preparation, and finally - different methods and algorithms one
can implement to discover new knowledge from raw data. The key objectives of this course
are two-fold: (1) to teach the fundamental concepts of data mining and (2) to provide extensive
hands-on experience in applying the concepts to real-world applications. The core topics to be
covered in this course include:





data and exploring/preprocessing data
classification data mining algorithms and methods
association analysis data mining algorithms and methods
cluster data mining algorithms and methods
SQL Server 2005 data mining environment
Expected Outcomes
At the completion of this course, students will have an understanding and knowledge of:



What is data mining?
Data and exploring data: sampling, data cleaning, feature selection, and
dimensionality reduction
Classification: basic concepts, decision trees, model evaluation





Classification: naive Bayes, time series, neural networks,
Association analysis: basic concepts and algorithms, Apriori algorithm,
Cluster analysis: basic concepts and algorithms, partitional and hierarchical
clustering methods,
SQL Server 2005 environment, tools, and algorithms
How to use SQL Server 2005 for data mining
Textbook
Textbooks - required
Title: Introduction to Data Mining
Authors: Pang-Ning Tan, Michael Steinbach and Vipin
Kumar
Edition: 2006
Publisher: Addison-Wesley
ISBN: 0-321-32136-7
Title: Data Mining with SQL Server 2005
Authors: ZhaoHui Tang
Edition: 2005
Publisher: Wiley Publishing Inc.
ISBN: 0-471-46261-6
Software
Software
To complete all lessons, the data mining project, assignments, discussions, and exams,
you will need a computer with:



Windows 2000/XP, Internet Explorer, PowerPoint, and Word
SQL Server 2005 (see Resources Web page for details how to obtain SQL Server
2005).
Access to WebCT Vista at CSU
Methods of Instruction
Methods of Instruction:


Online Study
Assignments




Data Mining Project
Discussions
Midterm Exam
Final Exam
Online Study
Each student is expected to complete all readings from the textbooks following the course
schedule. Make your own notes. You can use your own notes during the exams.
Assignments
Four to six assignments will be given that build upon the concepts covered in the
textbooks and have to be completed on your own time. Assignment deadlines are not
flexible for any reason. Late assignments are not accepted for credit. Assignment
submissions are usually via WebCT Vista email.
Data Mining Project
The purpose of this project is to give you experience with a Data Mining implementation.
The data mining project is an opportunity to apply on real data the concepts, techniques,
and tools studied in class. This project is a data mining project developed individually.
The objective is to implement and run a data mining algorithm analyzing real data sets.
You can use SQL Server 2005 as implementation tool or another data mining software
(see the Resources Web page).
Discussions
A special Discussion Board with three discussions will be opened in the course WebCT
site. Online discussions will be based on the discussion questions posted by the instructor
(threaded discussions). Your participation in the discussions will be evaluated through
your contributions (questions, answers, remarks, and essays) in the three discussions. For
details see the Discussion area on the WebCT class Web site.
Exams
Your performance in this class will be measured by two online exams - Midterm and
Final Exam. No make up tests will be given unless an exam was missed due to a
documented emergency. The exams will be closed textbook but you can use your own
notes. Questions on the exams may include the following:




problem solving
essay questions
multiple choice answer selection
filling in the blanks
Evaluation
Evaluation
The final grade will be obtained from the following:
Assignments
Discussions
Project
Midterm
Exam
Final Exam
20%
15%
20%
20%
25%
The letter grade will be assigned as follows:
Grade
A
B
C
D
F
Points
90-100
80-89
70-79
60-69
0 -59
Grading Example:
Assignments
Discussions
Project
Midterm Exam
Final Exam
85, 90, 80, 70
85, 90, 90
95
80
94
G = (85+90+80+70)/4*0.2 + (85+90+90+)/3*0.15 + 95*0.2 + 80*0.2+94*0.25 = 88.69
It is a B.
Student Responsibilities
Student Responsibilities


Each student is responsible to manage his/her time and maintain the discipline
required to meet the course requirements.
Each student is responsible to read from the textbooks all topics covered in the
class




Each student is responsible to read from the textbook all chapter topics,
bibliographic notes, and summaries
Each student is responsible to execute the data mining project and all discussions
Each student is responsible to adhere to all course deadlines
Each student is responsible to take the exams as they are scheduled in the course
schedule.
“I didn’t know” is no an acceptable excuse for failing to meet the course requirements.
Students who fail to meet their responsibilities do so at their own risk.
Attendance Policy
Attendance Policy
Attendance at all classes and other activities (lecture periods, laboratory sessions, tests,
examinations, or other schedule meetings is required of every student at Columbus State
University. The attendance record begins with the first meeting of the class, and one who
registers late is responsible for class work missed. Student should note that the Computer
Science Faculty does not initiate "class drops". A student wishing to drop should
complete the official procedure before the deadline. Those who violate the attendance
policy after that deadline may receive an "F" at the discretion of the instructor. After the
midpoint of the quarter, no drop slip will be signed by the Dean unless extreme
circumstances can be proved.
Academic Dishonesty
Academic Dishonesty: Academic dishonesty includes, but is not limited to, activities
such as cheating and plagiarism
(http://aa.colstate.edu/advising/a.htm#AcademicDishonesty/Academic Misconduct). It is
a basis for disciplinary action. Any work turned in for individual credit must be entirely
the work of the student submitting the work. All work must be your own. You may
share ideas but submitting identical assignments (for example) will be considered
cheating. You may discuss the material in the course and help one another with
debugging; however, any work you hand in for a grade must be your own. A simple
way to avoid inadvertent plagiarism is to talk about the assignments, but don't read each
other's work or write solutions together unless otherwise directed. For your own
protection, keep scratch paper and old versions of assignments to establish ownership,
until after the assignment has been graded and returned to you. If you have any
questions about this, please see me immediately. For assignments, access to notes, the
course textbooks, books and other publications is allowed. All work that is not your own,
MUST be properly cited. This includes any material found on the Internet. Stealing or
giving or receiving any code, diagrams, drawings, text or designs from another person
(CSU or non-CSU, including the Internet) is not allowed. Having access to another
person’s work on the computer system or giving access to your work to another person is
not allowed. It is your responsibility to keep your work confidential.
No cheating in any form will be tolerated. Penalties for academic dishonesty may
include:




a zero grade on the assignment or exam/quiz
a failing grade for the course
suspension from the Computer Science program
dismissal from the Computer Science program.
All instances of cheating will be documented in writing with a copy placed in the
Department’s files. Students will be expected to discuss the academic misconduct with
the faculty members and the chair person. For more details see the Faculty
Handbook:http://aa.colstate.edu/faculty/FacHandbook0203/sec100.htm#109.14and the
Student Handbook:http://sa.colstate.edu/handbook/handbook2003.pdf
ADA Accommodation Notice
ADA Accommodation Notice
If you have a documented disability as described by the Rehabilitation Act of 1973 (P.L.
933-112 Section 504) and Americans with Disabilities Act (ADA) and would like to
request academic and/or physical accommodations please the Office of Disability
Services in the Center for Academic Support and Student Retention, Tucker Hall 100 or
at (706) 568-2330, as soon as possible. Course requirements will not be waived but
reasonable accommodations may be provided as appropriate.
Tentative Schedule
CPSC 6127 is an online class and the schedule is flexible and suggestive. However the scheduled
studies, software installation, assignments, project, and discussions have to be completed on strict weekly
basis. The exam dates and times are firm.
Textbooks:
DM: P. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining
SS05: Z. Tang, J. MacLennan, Data Mining with SQL Server 2005
Week
1
01/9
01/11
2
01/16
01/18
3
01/23
01/25
4
01/30
02/01
5
02/06
02/08
Lecture Topics
Class organization and administration. WebCT
class site.
Install SQL Server 2005 (see Resources Web
page for details. SQL Server 2005 will be used
with the SS05 textbook.
SQL Server 2005 environment: Documentation
and Tutorials, Business Intelligence Developer
Studio
Chapter 1. Introduction, DM
Chapter 1. Introduction to Data Mining, SS05
Chapter 2. Data, DM
Chapter 2. OLE DB for Data Mining, SS05
Chapter 2. Data, DM
Chapter 2. OLE DB for Data Mining, SS05
Chapter 2. Data, DM
Chapter 2. OLE DB for Data Mining, SS05
Chapter 2. Data, DM
Chapter 3. Using SQL Server Data Mining,
SS05
Chapter 3. Exploring Data, DM
Chapter 3. Using SQL Server Data Mining,
SS05
Chapter 3. Exploring Data, DM
Chapter 3. Exploring Data, DM
9
Chapter 4. Classification: Basic Concepts,
Decision Trees, and Model Evaluation, DM
Chapter 5. Microsoft Decision Trees, SS05
Chapter 4. Classification: Basic Concepts,
Decision Trees, and Model Evaluation, DM
Chapter 5. Microsoft Decision Trees, SS05
Chapter 4. Classification: Basic Concepts,
Decision Trees, and Model Evaluation, DM
Chapter 5. Microsoft Decision Trees, SS05
Review for Midterm Exam (Chapters 1-4, DM;
Chapters 1,2,3, and 5, SS05)
Midterm Exam on February 27, 2007, Tue
Midpoint of semester – March 1, Thu.
Chapter 6. Microsoft Time Series, SS05
March 5-9. Spring Break. No classes.
10
Chapter 11. Mining OLAP Cubes, SS05
6
02/13
02/15
7
02/20
02/22
8
02/27
03/01
Assignments, Project,
Discussions, and Exams
Discussion 1 opened on 01/18/2007, Thu
Assignment 1 begin
Project begin
Discussion 1 closed on 02/10/2006, Sat
Assignment 1 due on 02/13, Tue
Assignment 2 begin
Discussion 2 opened on 02/15/2006, Thu
Project Proposal due on 02/20/2006,
Tue
Begin Assignment 3
Assignment 2 due on 02/22, Thu
Midterm Exam on 02/27, Tue
03/13
03/15
11
03/20
03/22
12
03/27
03/29
13
04/03
04/05
14
04/10
04/12
15
04/17
04/19
16
04/24
04/26
17
05/03
Chapter 5. Classification: Alternative
Techniques (section 5.1-5.4), DM
Chapter 4. Microsoft Naïve Bayes, SS05
Chapter 10. Microsoft Neural Networks, SS05
Chapter 5. Classification: Alternative
Techniques (section 5.1-5.4), DM
Chapter 4. Microsoft Naïve Bayes, SS05
Chapter 10. Microsoft Neural Networks, SS05
Chapter 5. Classification: Alternative
Techniques (section 5.1-5.4), DM
Chapter 4. Microsoft Naïve Bayes, SS05
Chapter 10. Microsoft Neural Networks, SS05
Chapter 5. Classification: Alternative
Techniques (section 5.1-5.4), DM
Chapter 4. Microsoft Naïve Bayes, SS05
Chapter 10. Microsoft Neural Networks, SS05
Chapter 5. Classification: Alternative
Techniques (section 5.1-5.4), DM
Chapter 4. Microsoft Naïve Bayes, SS05
Chapter 10. Microsoft Neural Networks, SS05
Chapter 6. Association Analysis: Basic
Concepts and Algorithms (section 6.1-6.4), DM
Chapter 9. Microsoft Association Rules, SS05
Chapter 6. Association Analysis: Basic
Concepts and Algorithms (section 6.1-6.4), DM
Chapter 9. Microsoft Association Rules, SS05
Chapter 6. Association Analysis: Basic
Concepts and Algorithms (section 6.1-6.4), DM
Chapter 9. Microsoft Association Rules, SS05
Chapter 6. Association Analysis: Basic
Concepts and Algorithms (section 6.1-6.4), DM
Chapter 9. Microsoft Association Rules, SS05
Chapter 8. Cluster Analysis: Basic Concepts
and Algorithms, DM
Chapter 7. Microsoft Clustering, SS05
Chapter 8. Cluster Analysis: Basic Concepts
and Algorithms, DM
Chapter 7. Microsoft Clustering, SS05
Chapter 8. Cluster Analysis: Basic Concepts
and Algorithms, DM
Chapter 7. Microsoft Clustering, SS05
Review for the Final Exam
Final Exam on May 3, 2007, Thu
Assignment 3 due on 03/15/2007, Thu
Discussion 2 closed on 03/29/200, Thu
Begin Assignment 4
Discussion 3 opened on 04/03/2007, Tue
Assignment 4 due on 04/10/2007, Tue
Begin Assignment 5
Assignment 5 due on 04/24/2007, Tue
Project Final Report and
Project Implementation due on
04/26/2007, Thu
Discussion 3 closed on 04/27/2007, Fri
Final Exam on May 3, 2007, Thu
Download