IST 565: Data Mining - Web-based Information Science Education

advertisement
IST 565: Data Mining
Syracuse University School of Information Studies
Updated Syllabus – 10/28/2009
Instructor:
Dr. Ozgur Yilmazel
Research Assistant Professor
School of Information Studies
Syracuse University, Syracuse, NY 13244
Email: oyilmaz@syr.edu
phone: 315-849-5666
This course will introduce popular data mining methods for extracting knowledge from data. It
will balance theory and practice. The principles of data mining methods will be discussed, but
students will also acquire hands-on experience using software to develop data mining solutions to
scientific and business problems. The focus of this course is in understanding data and how
to formulate data mining tasks in order to solve problems using the data.
Course will include the key tasks of data mining, including data preparation, concept
description, association rule mining, classification, clustering, evaluation and analysis.
Throughout the class, students will learn both through readings, as well as hands-on work
with real data in a data mining project.
The format of the course will be a combined lecture and lab format, with lectures to cover
material and lab time to investigate small examples for the topic of the week. There will
be weekly readings based on the textbook and on other materials which will be posted
on-line.
Objectives:
1. Understand the fundamental processes, concepts and techniques of data mining,
2. Develop familiarity with data mining techniques and be able to apply them to
real-world problems, and
3. Advance your understanding of contemporary data-mining systems.
Course Materials:
1. Data Mining: Practical Machine Learning Tools and Techniques Second
Edition by Ian H.Witten & Eibe Frank. ISBN 0120884070
2. Software as needed for assignments and projects. Students will need access to a
PC where they can install and run programs.
Other materials will be assigned and will be made available through electronic means,
and these will be discussed in class.
Technology Requirements:
This course will be run on LMS. There will be optional video components.
In addition, students will be using various tools to clean and manipulate data throughout
the semester. This may require students to install various programs on a PC-based
machine, so students should not only have access to a PC but also be comfortable with
installing and uninstalling programs.
Graduate students are expected to meet the minimum and recommended information
technology literacy skills required of students in all School of Information
Studies master's programs. Please refer to:
http://istweb.syr.edu/prospective/graduate/literacyreq.asp for the "Computer Literacy
Requirements" document.
Student Responsibilities:
You are expected to participate fully in all activities and discussions during the class
duration, as well turning in assignments by the designated time. Just as in the real world,
late assignments will be penalized. If you do not understand assignments, readings, etc. it
is your responsibility to inform the instructor well before the due date. If you are having
difficulty, please contact us early so that I can resolve problems. . You must complete all
assignments to pass the course.
A grade of Incomplete will only be given in the most extreme emergency cases. If, for
some reason, you feel that it is imperative that you receive an Incomplete, you must bring
your concerns to our attention as soon as the emergency situation arises.
Please note: Successful time management skills is the most critical factor of
success for students in a distance course. You are expected to log in to class
at least four times a week, if not every day. Logging into class and
participating regularly and in small amounts makes it much easier to avoid
hours of work built up.
Statement of Academic Integrity:
Undergraduate, graduate and doctoral students enrolled in IST courses are required to
follow the guidelines for academic honesty described in the School of Information
Studies Statement on Academic Integrity, available:
• in any IST Student Handbook,
• on the Web or
• on request at the IST Student Services Office.
Academic dishonesty includes but is not limited to plagiarism, cheating on examinations,
unauthorized collaboration, multiple submission of work, misusing resources for teaching
and learning, falsifying information, forgery, bribery, and any other acts that deceive
others about ones academic work or record. Students should be aware that standards for
documentation and intellectual contribution may depend on the course content and
method of teaching, and should consult instructors for guidance.
Sanctions for academic dishonesty may include but are not limited to the following:
• requiring students to re-produce work under the supervision of a proctor;
• rejecting the student work that was dishonestly created, and giving the student a
zero or failing grade for that work
• lowering the course grade
• giving a failing grade in the course
• formal reprimand and warning
• disciplinary probation
• administrative withdrawal from the course
• suspension from the University
• or expulsion from the University.
Instructors who impose sanctions must notify the student promptly and indicate any
formal or informal hearing procedures available. Students accused of academic
dishonesty have the right to challenge accusations.
iLMS information
This course will be give via the iSchool Learning Management System powered by
WebCT/Blackboard (the iSchool iLMS). On the iSchool iLMS homepage (http://ischool.
syr.edu/learn), you will find links to student FAQs, troubleshooting tips, tutorials, and
other information that may prove useful for both new and experienced iLMS users. As
students are responsible for being fluent on this educational platform, it is recommended
that you review these resources as necessary.
Guidelines for preparing assignments
Prepare a professional document. Include tables and graphs that support your content
where appropriate.
When you prepare assignments be sure to provide proper bibliographical information for
any sources referenced, for direct quotations and for the source of key concepts or ideas.
Any citation format is acceptable (I personally use APA format), as long as it provides
sufficient information for a reader to find the source (i.e., authors names, title of article or
book, title, volume and issue of journal (if appropriate), page numbers, publisher, date of
publication).
If you cite a webpage, be sure to indicate the URL and the date on which you accessed
the page, as pages do change. Failure to cite sources is considered plagiarism and subject
to sanctions ranging from being required to redo the assignment through expulsion (see
above).
If you have any questions about what must be cited or how to cite, please feel free to ask.
In addition to punctuality, grammar, presentation and ability to follow instructions are
very important, as in the real world. If your work does not meet professional standards,
up to 30% of your score may be deducted. It is essential that you spell check and
proofread your documents.
According to Family Educational Rights and Privacy Act, all student work produced as
part of this course may be used during this semester for educational purposes. Later use
of this work is also permitted providing either the work is rendered anonymous or if the
student provides written permission for such usage. If you do not wish your work to be
used even if rendered anonymous let me know and I will not do so.
Special consideration
Students who may need special consideration because of a disability should contact
instructors at the start of the course, and anytime thereafter if further consideration is
needed. In addition, you are encouraged to register with the Office of Student Assistance
in 306 Steele Hall 315-443-4357) and/or the Center for Academic Achievement in 804
University Avenue, Room 303 (315-443- 2622).
Course Topics:
1.
Lecture/
Date
1/19-1/24
2.
1/25 –
1/31
3.
2/1-2/7
Topic
Tentative Readings
Introduction to Course
Structure, Content &
Requirements
Data Mining Basic
Concepts
Introduction to Data issues
Data Preprocessing
Data Cleansing ,
Transformation,
Integration & Reduction
Data Warehousing
Styles of Machine
Learning
Decision Trees
Classification
Clustering
Whitten & Frank, Ch. 1. What's it all
about?
Benoit, 2002 ARIST Chapter 6, Data
Mining
Whitten & Frank, Ch. 2. Input: Concepts,
Instances & Attributes
- Witten & Frank - pp. 58-63; 70-72; 8992; 159-170
- Witten & Frank - pp. 170-183 (or 190)
- Witten & Frank - pp. 75-76; 210-227
Colet, E. Clustering and Classification:
Data Mining Approaches
http://www.tgc.com/dsstar/00/0704/000704
.html
4.
2/8 – 2/14
5.
2/15 –
2/21
2/22 –
2/28
3/1 – 3/7
3/8 – 3/14
6.
7.
8.
3/15-3/21
9.
10
.
11
12
.
Introduction to WEKA
Spring Break (Review and
Discuss Track A Projects)
3/22 –
3/28
3/29 – 4/4
4/5 – 4/11
4/12 –
4/18
Data Output
Visualization of Data
Mining Results
13
.
4/19 –
4/25
14
.
4/26 – 5/2
15
Whitten & Frank, Ch. 8.
5/3 – 5/9
Whitten & Frank, Ch. 3, Output:
Knowledge Representation
Thearling, Becker, DeCoste,
Mawby, Pilote &
Sommerfield. Visualizing Data Mining
Models http://www3.shore.net/~kht/text/d
mviz/modelviz.htm
Ramakrishnan & Grama. Mining Scientific
Data. Advances in
Computers.http://people.cs.vt.edu/~ramakri
s/papers/scimining.pdf
Privacy of Data and Legal
Implications
Data Mining Products
How to select Data Mining
software
TBA
Future Directions of Data
Whitten & Frank, Ch. 9, Looking Forward
www.kdnuggets.com
Elder & Abbot. A Comparison of Leading
Data Mining
Tools,http://www.datamininglab.com/pubs/
kdd98_elder_abbott_nopics_bw.pdf
.
Mining
& Knowledge Discovery
ASSIGNMENTS
These are just outlines. As we approach the sections in the class dealing with each
assignment, more detail will be provided to you. The instructor reserves the right to
change the assignments, add new assignments, and alter the percentages attached to the
assignments with ample notice.
Individual Assignments 20% There will be 3 or 4 assignments during the semesters. They
will be take home problem sets.
Overall Participation 20% - In a distance class, a large part of your learning experience
comes from participating in class discussions. In this course, you’ll be learning a
combination of theory and practice, and the participation in course discussions will
provide the theory. This participation is essential and the weight of this grade should
indicate to students that it will be taken seriously.
Course Project 60% - You will work in a group max 3 people on a DM problem of your
choice. It will be up to you to find your data and run experiments.
Here are a few ground rules about posting in the public iLMS forums:
1 – The tone of your messages should be similar to the tone you would use in a classroom
discussion, and should be placed in the appropriate forum. The “Informal Chit-chat”
board is designed for social communication. “Questions to the Professor” should be used
whenever possible so that all students may benefit from the answer.
2 – In classroom discussion, present more than your opinion. If you present an opinion,
also present some support from the readings or from other sources you have discovered
or a logical argument from commonly accepted beliefs. Part of the graduate education
experience is to help you learn how to present information with support and not just say
“Well, I think that…”. This also applies to agreeing with someone; the statement “I
agree” should be presented with some other fact or information new to the discussion.
3 – Remember that there is someone behind every post, and there is a reason that they
wrote what they did. Before taking offense, getting upset, or lashing out, think about why
that may have made that post. Treat each other with respect.
4 – Review the standard rules of Net Etiquette (a.k.a. netiquette) at
http://www.albion.com/netiquette/corerules.html
5 – When discussing a point from a previous post, copy and paste the appropriate points
into your e-mail (and just post the portion you need for the discussion). The typical
symbol for showing a quote is > before the line.
The instructors maintain the right to move or delete discussions from the public forums as
they see fit. This might be done because a post is inflammatory, off-topic for the board,
or contains private information.
Grading Policy
Grades will be numeric, ranging from 0 to 100, with the most likely grades for a thorough
job being between 70 and 90. There are likely to be few, if any, grades of 100 earned. We
reserve this grade for truly exceptional performance. Grades lower than 70 suggest a need
to discuss your performance with the instructors.
Download