DSCI 5240 – Data Mining Fall 2015

advertisement
SYLLABUS
DSCI 5240 – Data Mining
Fall 2015
CLASS (DAY/TIME):
INSTRUCTOR:
OFFICE HRS:
CONTACT INFO:
Tuesdays 6:30-9:20, BLB 070
Dr. Nick Evangelopoulos
TW 1:00-2:00pm, T 5:00-6:00pm, BLB 365D
OFFICE PHONE: 940-565-3056
E-MAIL (preferred): Nick.Evangelopoulos@unt.edu
Textbooks in printed and PDF file format





Kattamuri Sarma, Predictive Modeling with SAS Enterprise Miner, Second Edition, SAS
Press 2013, ISBN: 978-1-60764-767-6 (required, printed text)
Data Mining Using SAS Enterprise Miner, A Case Study Approach, 3rd Edition, SAS
Publishing 2013 (required, free PDF), or ISBN 978-1-61290-638–6 (printed text)
Getting Started with SAS Enterprise Miner 13.1, SAS Publishing 2013 (required, free
PDF)
Getting Started with SAS Text Miner 13.2, SAS Publishing 2014 (required, free PDF)
Getting Started with SAS Enterprise Miner 5.3, SAS Pub. 2008 (recommended, free PDF)
Software
IBM SPSS Statistics 22, IBM SPSS Modeler 15, SAS Enterprise Miner 13.1, SAS
Text Miner 13.1. All these are available at the CoB lab, physically and via RDS.
Web Site
http://www.cob.unt.edu/itds/courses/DSCI4520/DSCI4520.htm
http://www.cob.unt.edu/itds/faculty/evangelopoulos/evangelopoulos.htm
Purpose of the Course
This course deals with the problem of extracting information from large databases and designing
data-based decision support systems. The extracted knowledge is subsequently used to support
human decision-making in the areas of summarization, prediction, and the explanation of
observed phenomena (e.g. patterns, trends, and customer behavior).
Techniques such as visualization, statistical analysis, decision trees, and neural networks can be
used to discover relationships and patterns that shed light on business problems. This course will
examine methods for transforming massive amounts of data into new and useful information,
uncovering factors that affect purchasing patterns, and identifying potential profitable
investments and opportunities.
Learning Objectives
1. Understand the problems and opportunities when dealing with extremely large databases.
2. Review data visualization software used for interpreting complex patterns in
multidimensional data. Learn to identify what information is useful and what is not.
© 2003-2015 Nicholas Evangelopoulos
3. Provide an understanding of predictive models and algorithms, as well as exploratory
algorithms.
4. Examine all phases of decision making, including discovery and data query, data analysis
and confirmation, presentation, and implementation of results.
Class Attendance
Regular class attendance and informed participation are expected.
Academic Integrity
This course adheres to the UNT policy on academic integrity. The policy can be found at
http://vpaa.unt.edu/academic-integrity.htm. If you engage in academic dishonesty related to this
class, you will receive a failing grade on the test or assignment, or a failing grade in the course.
In addition, the case may be referred to the Dean of Students for appropriate disciplinary action.
Students with Disabilities
The College of Business complies with the Americans with Disabilities Act in making
reasonable accommodations for qualified students with disability. If you have an established
disability and would like to request accommodation, please see your instructor as soon as
possible. You will need to register with the UNT Office for Disability Accommodation.
Deadlines
Dates of drop deadlines, final exams, etc., are published in the university catalog and the
schedule of classes. Please be sure you keep informed about these dates.
Teaching Evaluation
Evaluation of the instructor’s teaching effectiveness by the students is a requirement for all
organized classes at UNT. At the end of the semester/session there will be a short Web-based
survey, and/or an in-class paper survey, providing you a chance to comment on how this class is
taught. I am very interested in the feedback I get from students, as I work to continually improve
my teaching. I consider your participation as an important part of in this class.
Cell Phones
As a courtesy to your instructor and to your fellow classmates, you are asked to silence your cell
phone or set it to vibrate. In case of a personal emergency, if you must use your cell phone, you
are asked to step out of the classroom.
Incomplete Grade (I)
The grade of "I" is not given except for rare and very unusual emergencies, as per University
guidelines. An “I” grade cannot be used to substitute your poor performance in class. If you think
you will not be able to complete the class satisfactorily, please drop the course.
Campus Closures
Should UNT close campus, it is your responsibility to keep checking your official UNT e-mail
account (EagleConnect) to learn if your instructor plans to modify class activities, and how. This
may include changing assignment due dates, rescheduling quizzes and exams, etc.
2
Point Allocation
DSCI 4520
25%
5%
25%
20%
25%
Homework exercises (8 exercises)
In-class quizzes (8 quizzes, 3 dropped)
Mid-term Exam (in-class)
Final Exam (take-home)
Project (4 individual parts and 1 group part)
Graduate Presentation
TOTAL
100%
Bonus for attending/giving a graduate presentation
Bonus for graduate presentation – Option C
Letter Grades:
90% or more = A
70% or more = C
0.5%
80% or more = B
60% or more = D
DSCI 5240
22.5%
4.5%
22.5%
18.0%
22.5%
10.0%
100.0%
0.5%
up to 20%
Below 60% = F
Homework Exercises
There will be 8 homework exercises that you will have to turn in. Exercises will be using IBM
SPSS Statistics, IBM SPSS Modeler, SAS Enterprise Miner, and SAS Text Miner. The
homework exercises ask you to perform certain types of analysis, capture screen shots, and
answer questions. Related handouts and PowerPoint slides with data description, step-by-step
instructions, and assignment details, will be available on Blackboard. HW7 closely follows the
text in Getting Started with SAS Text Miner, referred to as “GSTM text” below. Homework is
turned in electronically using Blackboard, in the form of a report document. If you turn in
your HW report late, 50% of HW credit is awarded.
HW1. Multiple Regression for TargetD using IBM SPSS Statistics. MYRAW data.
HW2. Logistic Regression for TargetB using IBM SPSS Statistics. Small sample effects.
MYRAW data.
HW3. Overview of SEMMA process in SAS Enterprise Miner. Decision Tree and Logistic
Regression. Model comparison. HMEQ data.
HW4. Scoring, Reporting in SAS EM. HMEQ data.
HW5. Clustering in SAS EM. SHOESTORE data.
HW6. Association Analysis in SAS EM. ASSOCS data.
HW7. Text Analytics in SAS Text Miner. Text cleanup, synonyms, stop list, topic extraction,
and predictive modeling using text data. VAEREXT data. Based on the GSTM text.
HW8. Introduction to IBM SPSS Modeler. Decision Tree and Logistic Regression. HMEQ
data.
Term Project
This course has a term project. You will be asked to analyze data related to the KDD-cup 98, an
International competition for professional data miners. The data set will be available on
Blackboard. Handouts describing what you have to do will be distributed in class. During the
first 4 parts you will work individually and submit your work as a Word document that includes
3
screen shots from Enterprise Miner and answers to various questions as described on the
handouts. You will turn in your reports by uploading them on Blackboard. Grading and late
penalty policies for PR1-PR4 are the same as with HW1-HW8. During the last part of the
project you will form groups. Each group will include at least one undergraduate student and
at least one graduate student. The maximum group size will be 6. Groups will be selfmanaged. If the group is not satisfied with some member’s contribution they may choose to
dismiss that person from the group. In such a case, alternative individual assignment will be
given to the dismissed group member. The group will turn in a single PR5 report, listing all
group member names, in printed hard copy format (i.e., brought to class, not uploaded on
Blackboard). A summary of the project parts follows below.
PR1.
PR2.
PR3.
PR4.
PR5.
Topic
Open the data, produce statistics and graphs
Neural Networks
Project Part 2: Regression/Decision Trees
Decision Trees/Regression
Final Written Report. Comparison and evaluation of 3 models
(Decision tree, logistic regression, neural net).
Work type
Individual
Individual
Individual
Individual
Group
Graduate Presentations (for DSCI 5240 students only)
If you are taking DSCI 5240 for credit, you will be asked to prepare a class presentation, on top
of doing the regular DSCI 4520 assignments and tests. Presentation topics (lectures, papers,
projects) will be reserved for presentation on a first-come-first-served basis, so you may want to
decide soon. Contact your instructor by e-mail at your earliest convenience, but no later than
one week before the presentation, to avoid a grade of 0. Your presentation will be graded as
“adequate” (90% to 100%) or problematic (below 90%). The final grade for graduate students
will be calculated using the following weighs:
All undergraduate assignments:
90%
Graduate Presentation:
10% (+ up to 20% for option C)
A. Existing Research Paper Presentation option. You will be asked to select a recent
research journal, practitioner journal, or conference proceedings paper (e.g. from the
SIAM Conference on Data Mining, etc.) and present it to class in your own words. Use
Google Scholar to search for relevant papers. Your presentation should be about 8 to 12
slides and last about 10 minutes. The selected paper will have to be approved by the
instructor, so that the same paper is not presented twice. Presentations in option B are
individual.
B. Original Analysis Using your Own Data option. You will be asked to obtain a data set
and perform your own analysis, using SAS EM, SAS TM, or IBM SPSS Modeler. You
are encouraged to try topic extraction in SAS TM, or a comparison of solutions obtained
using SAS EM and IBM SPSS Modeler, but a single model in SAS EM may be enough,
depending on the complexity of your project. Your presentation should be about 8 to 12
slides and last about 10 minutes.
C. Original Research Project option (up to 20% bonus pts.). With instructor’s approval,
you will propose an original data mining project, collect or download data, and then
4
present your findings. This option may involve a research-oriented project (strongly
recommended for doctoral students). If the project is expected to be time consuming, up
to 20% of your total grade may be offered as a bonus. This bonus is added to your grade
at the end, after your DSCI 4520/5240 assignment grades are multiplied by 0.9 and your
Graduate Presentation grade is added. The purpose of this bonus is to allow you to skip
some of the regular assignments in order to focus on your graduate project. The bonus
will compensate for missing some of the regular DSCI 4520/5240 assignments. You are
particularly encouraged to skip PR5 (this is worth 4.5% of your grade). In order to
receive the full 20% bonus, you need to convince your instructor that you made an effort
to explore funding opportunities or that your project involved publishable research.
Option C could be up to 4 times more time-consuming than options A or B (which are
expected to be equivalent.) See your instructor for more details on this option. Your
presentation should be about 8 to 12 slides and last about 10 minutes. For option C,
working in groups of two graduate students is preferred.
Signing up for a graduate presentation
The instructor will announce in class the minimum and maximum number of students who can
sign up for each option. Students will reserve a spot on a first-come-first-served basis. Signing
up for option C will conclude on the day of the mid-term exam. Signing up for option A or B
will conclude two weeks before your class presentation date.
Providing feedback on other students’ presentations
After you prepare your presentation slides, you will be asked to turn them in one week before
your class presentation date by 5PM. The instructor will post the PowerPoint files on
Blackboard. You will then download and review presentations of the same type (A, B, C) as your
own, prepared by your classmates, and provide feedback to the instructor (excluding your own
presentation) in the form of an e-mail. If you fail to turn in your feedback, you will receive a 0%
(out of 10%) as your Graduate Presentation grade. The process of reviewing and providing
feedback is not expected to be very time-consuming. Feedback e-mails are due 2 days before
presentation date by 5PM.
Giving a presentation and receiving a grade
The instructor will compile the results and invite the top individuals or teams to actually present.
All presentation files will be graded by the instructor and receive a Graduate Presentation grade
(out of 10%.) If you are selected to present to class, you will receive a 0.5% presentation bonus.
If you are selected to give a presentation but you fail to show up, you will receive a 5% (out of
10%) as your Graduate Presentation grade.
5
DSCI 4520/5240 TIME SCHEDULE – Fall 2015
The schedule below is a tentative outline for the semester. It is meant to be a guide and several
items are subject to change. Certain topics may be stressed more or less than indicated.
Date
Topics
Aug. 25
Intro to Data Mining, Ch1
Multiple Linear Regression
Sept. 1
Logistic Regression, Ch 6
Stepwise Procedure
Assignment due
HW1 (regression in SPSS, MYRAW)
Sept. 8
SEMMA, CRISP-DM,
Model comparison, Ch7
HW2 (Log Reg in SPSS, MYRAW)
Sept. 15
Scoring & Deployment
HW3 (SEMMA, HMEQ)
Sept. 22
Decision Trees, Sarma Ch 4
HW4 (scoring, HMEQ)
Sept. 29
Decision Trees, Sarma Ch 4
Neural Networks, Sarma Ch 5
PR1 (data explor., DONOR_RAW1)
Oct. 6
Clustering Analysis
PR2 (neural nets, DONOR_RAW1)
Oct. 13
Review for Exam 1
“Project Part 2”:
PR3 (trees/reg, DONOR_RAW1…)
Oct. 20
Buffer lecture (use as needed)
PR4 (reg/trees, DONOR_RAW1…)
Oct. 27
*** Exam 1 (in-class) ***
Grad Pres. signup deadline (type C only)
Nov. 3
Association Analysis
HW5 (clustering, SHOESTORE)
Nov. 10
Text Mining, Sarma Ch. 9
Buffer lecture (use as needed)
Grad Pres. signup deadline (types A-B)
HW6 (market basket, ASSOCS)
Nov. 17
Grad student presentations
Buffer lecture (use as needed)
Grad Presentations are due by 5PM
HW7 (text mining, VAEREXT)
0.5 pt. extra credit for attendance (UG st.)
Nov. 24
Grad student presentations
Buffer lecture (use as needed)
HW8 (SPSS Modeler, HMEQ)
0.5 pt. extra credit for attendance (UG st.)
Dec 1
Course review
Take-home final handed out
PR5 (project report)
0.5 pt. extra credit for attendance
Dec 8
*** Final Exam (take-home, due 11:59PM on Blackboard) ***
6
Download