Document

advertisement
Introduction to Data Analytics,
Current Challenges.
Course Outline
Peter Fox
Data Analytics – ITWS-4600/ITWS-6600
Week 1a, January 26, 2016, Lally 102
1
Admin info (keep/ print this slide)
•
•
•
•
•
•
•
•
•
Class: ITWS-4600/ITWS 6600
Hours: 12:00pm-1:50pm Tuesday/ Friday
Location: Lally 102
Instructor: Peter Fox
Instructor contact: pfox@cs.rpi.edu, 518.276.4862 (do not
leave a msg)
Contact hours: Monday** 3:00-4:00pm (or by email appt)
Contact location: Winslow 2120 (sometimes Lally 207A
announced by email)
TA: Rahul Divekar, divekr@rpi.edu
Web site: http://tw.rpi.edu/web/courses/DataAnalytics/2016
– Schedule, lectures, syllabus, reading, assignments, etc.
– <location of data and R script fragments – coming soon>
2
Contents
•
•
•
•
Intro – about this course
Learning objectives
Outline of the course
Definitions and why
Analytics is more than
Analysis
• What skills are needed
• What is expected
3
Truth in Advertising
4
5
Assessment and Assignments
• Via written assignments with specific percentage of
grade allocation provided with each assignment
• Via individual oral presentations with specific
percentage of grade allocation provided
• Via presentations – depending on class size
• Via participation in class (not to exceed 5% of total,
start with 5% and lose % by not participating)
• Late submission policy: first time with valid reason –
no penalty, otherwise 20% of score deducted each
late day. Talk to me EARLY if you are having
schedule problems completing assignments
6
Assessment and Assignments
• Reading assignments
– Are given when needed to support key topics or
to complete assignments
– Will not be discussed in class unless there are
questions
• You will mostly perform individual work
7
Project options (examples)
•
•
•
•
•
•
Social networks
Financial
Social-economic, marketing
Network/ security data
Linked data
Competitions (Web and local)
8
Objectives
• Introduce students to relevant methods to recognize and apply
quantitative algorithms, techniques and interpretation
• To develop students' strategic thinking skills, combined with a solid
technical foundation in data and model-driven decision-making.
• Develop ability to apply critical and analytical methods to formulate and
solve science, engineering, medical, and business problems
• In groups, students will identify qualitative problems and apply content
analytics
• Students will examine real-world examples to place data-mining
techniques in context, to develop data-analytic thinking, and to illustrate
that proper application is as much an art as it is a science.
• By the end of the course, students can effectively communicate analytic
findings to non-specialists
• Decision making under uncertainty, how optimize models that
incorporate random parameters: static stochastic optimization, two-stage
optimization with recourse, chance-constrained optimization, and
sequential decision making, weak models, mixed models
9
Learning Objectives
• Through class lectures, practical sessions, written and oral
presentation assignments and projects, students should:
– Students to demonstrate knowledge of relevant analytic methods, and to
recognize and apply quantitative algorithms, techniques and interpret results
– Students to demonstrate strategic thinking skills, combined with a solid technical
foundation in data and model-driven decision-making.
– Students to develop ability to apply critical and analytical methods to formulate
and solve science, engineering, medical, and business problems
– Students will examine real-world examples to place data-mining techniques in
context, to develop data-analytic thinking, and to illustrate that proper application
is as much an art as it is a science.
– Students must effectively communicate analytic findings to non-specialists.
– [graduate level] Students must develop and demonstrate a working knowledge of
decision making under uncertainty, be able to build optimization models that
incorporate random parameters: static stochastic optimization, two-stage
optimization with recourse, chance-constrained optimization, and sequential
decision making.
10
Undergraduates/ Grads
• Graduate students are assessed at:
– Higher level of demonstration
– Additional questions or tasks in assignments
• Undergraduates are welcome to complete
these higher requirements to extra grade
• Extra points for outstanding/ above and
beyond are given**
11
Academic Integrity
•
Student-teacher relationships are built on trust. For example, students must trust that
teachers have made appropriate decisions about the structure and content of the courses
they teach, and teachers must trust that the assignments that students turn in are their
own. Acts, which violate this trust, undermine the educational process. The Rensselaer
Handbook of Student Rights and Responsibilities defines various forms of Academic
Dishonesty and you should make yourself familiar with these. In this class, all assignments
that are turned in for a grade must represent the student’s own work. In cases where help
was received, or teamwork was allowed, a notation on the assignment should indicate your
collaboration.
• Submission of any assignment that is in violation of this policy will result
in a penalty. If found in violation of the academic dishonesty policy,
students may be subject to two types of penalties. The instructor
administers an academic (grade) penalty of full loss of grade for the work
in violation, and the student may also enter the Institute judicial process
and be subject to such additional sanctions as: warning, probation,
suspension, expulsion, and alternative actions as defined in the current
Handbook of Student Rights and Responsibilities.
• If you have any question concerning this policy before submitting
an assignment, please ask for clarification.
12
Current Syllabus/Schedule
• Web site:
http://tw.rpi.edu/web/courses/DataAnalytics/2
016
13
Questions so far?
14
Introductions
• Who you are, background?
• Why you are here?
• What you expect to learn?
15
The nature of the challenge
16
17
18
Perspective
• People make decisions every day and
increasingly they are using computers to
assist them.
• Knowledge is power:
– Or accurate/ reliable knowledge is actionable
• Gaining knowledge and how to use that
knowledge - from (often multiple ones)
information and data sources
• A model = formula/ equation that could
depend on parameters and variables
19
So what are we talking about?
20
Definitions (at least for this course)
• Data - are encodings that represent the
qualitative or quantitative attributes of a
variable or set of variables.
• Data (plural of "datum", which is seldom
used) - are typically the results of
measurements, computations, or
observations and can be the basis of graphs,
images of a set of variables.
• Data - are often viewed as the lowest level of
abstraction from which information and
knowledge are derived***
21
And then there is Big Data
22
~ Data for this course
• FaceBook 100 - Facebook networks (from a
date in Sept. 2005) for 100 colleges and
universities. These files only include intraschool links. Anonymized. Well curated. Very
good quality. Matlab.
• NY Housing datasets
• Many from R packages
• Linked data – logd.tw.rpi.edu/datasets, e.g.
EPA Facility Register System for each U.S.
state. Linked. Quality unknown. RDF.
23
A view from IBM …
• “Anyone who wants to learn something about data
analytics should take a road trip. Myriad real-time
decisions must be made based on analysis of static
information as well as ever-changing conditions.
Data about traffic, weather, road construction, fuel,
time, current location and available funds are just a
few of the factors.”
• This information and much more are needed to
answer questions like:
– If I skip this gas station, will I run out of gas before the next one?
– Is it worth driving 50 miles out of the way to see the Corn Palace?
How late will that side trip make us?
– Can I make it to Billings, Mont., by sunset or should I look for a place
to stop?
24
Case Studies (warming up)
• Sports Analytics – Moneyball
(http://www.imdb.com/title/tt1210166/), Nate Silver
(http://en.wikipedia.org/wiki/Nate_Silver)
• Google Analytics http://www.marketingscoop.com/google-analyticscasestudy.htm
• Marketing Analytics – products for pregnant
(women)
• Amazon Recommender – “If you liked, …”
• http://www.slideshare.net/lsakoda/case-studiesutilizing-real-time-data-analytics
25
Data…
• Finding it …
• Getting ready to use it … yes, on your laptop!
• Using it:
– Big data technologies: Hadoop (MapReduce) +
Pig, HDFS, HIVE – see
http://projects.apache.org/indexes/category.html#
database
– NoSQL, Graph, Hbase, Cassandra, Mongo DB,
Riak, CouchDB
– MPP Databases: Storm, Drill, Dremel
26
Analysis
• Software packages / environments:
– Gnu R
– Rstudio
• Extensive libraries
• Going from preliminary to initial analysis…
• Parametric (assumes or asserts a probability
distribution) and non-parametric statistics
27
What is "statistics"?
• The term "statistics" has two common
meanings, which we want to clearly
separate: descriptive and
inferential statistics.
• But to understand the difference between
descriptive and inferential statistics, we
must first be clear on the difference
between populations and samples.
Courtesy Marshall Ma (and prior sources)
28
Populations and samples
• A population is a set of well-defined objects
– We must be able to say, for every object, if it is in the population or
not
– We must be able, in principle, to find every individual of the
population
A geographic example of a population is all pixels in a multi-spectral
satellite image
• A sample is a subset of a population
– We must be able to say, for every object in the population, if it is in
the sample or not
– Sampling is the process of selecting a sample from a population
Continuing the example, a sample from this population could be a set
of pixels from known ground truth points
Courtesy Marshall Ma (and prior sources)
29
Populations and samples
• A population = “all” of the data, if you can
get it (BIG Data)
– This is what is different about the methods you
use
• A sample = “some” of the data, and you
may not know how representative it is
– This is what limits analysis but certainly the
development of models
Courtesy Marshall Ma (and prior sources)
30
What do we mean by
"statistics"?
• Two common uses of the word:
• Descriptive statistics: numerical summaries of samples
– i.e., what was observed
– Note the ‘sample’ may be exhaustive, i.e., identical to the population
• Inferential statistics: from samples to populations
– i.e., what could have been or will be observed in a larger population
Example:
Descriptive "The adjustments of 14 GPS control points for this
orthorectification ranged from 3.63 to 8.36 m with an arithmetic mean of
5.145 m"
Inferential "The mean adjustment for any set of GPS points taken under
specified conditions and used for orthorectification is no less than 4.3
and no more than 6.1 m; this statement has a 5% probability of being
wrong"
Courtesy Marshall Ma (and prior sources)
31
Patterns and Relationships
• Stepping from elementary/ distribution
analysis to algorithmic-based analysis
• Often: data mining: classification, clustering,
rules; machine learning; support vector
machine, non-parametric models
• Outcome: model and an evaluation of its
fitness for purpose
32
Prediction
•
•
•
•
Choosing applicable models
Combining models
Confidence levels
Multi-variate
• Future, event, pattern
• Past event, relation
• Etc.
33
Prescription
• Decisions and Effects: What should you do
and why? “Business Rules”
• Benefit or mitigate or adapt? [Personal e.g.]
• Builds on Prediction, often involves scenarios
and post-analysis
34
Summary
• We’ll work our way through the stages of
analytics
• We’ll use current both laptop installed software
and some server data infrastructures for analytics
to give you practical experience
• We’ll cover algorithms, models, and software to
use them
• Aim: ~ Tuesday lecture, Friday hands-on lab and
interactions
• Course will be “adapted” as we go (“different”
people in the class this year)
35
Just for fun…
36
Traversal for new patterns
37
Smart visual exploration
38
Skills needed
• Database or data structures
• Literacy with computers and applications that
can handle the data we will use
• Pick up R programming, terminology and
syntax
• Ability to access internet, servers and
retrieve/ acquire data, install/ configure
software
• Presentation of proposal projects and
assignment results
39
Tentative assignment structure (no exam)
• Assignment 1: Review of a DA Case Study. Due ~ week 2
(Friday). 10% (written/ discuss; individual);
• Assignment 2: Datasets and data infrastructures – lab
assignment. Due ~ week 3. 10% (lab; individual);
• Assignment 3: Preliminary and Statistical Analysis. Due ~
week 4. 15% (15% written and 0% oral; individual);
• Assignment 4: Pattern, trend, relations: model development
and evaluation. Due ~ week 5. 15% (10% written and 5%
oral; individual);
• Assignment 5: Term project proposal. Due ~ week 6. 5%
(0% written and 5% oral; individual);
• Assignment 6: Predictive and Prescriptive Analytics. Due ~
week 8. 15% (15% written and 5% oral; individual);
• Term project. Due ~ week 13. 30% (25% written, 5% oral;
individual).
40
What is expected
• Attend class, complete assignments,
participate
• Ask questions, offer answers in class
• Work individually on assignments
• In a group at your table, learn from each
other, help each other) Work constructively in
class sessions
• Next class is Jan. 29 – Introduction or
refresher to statistics
41
Reading/ watching
• Sports Analytics – Moneyball
(http://www.imdb.com/title/tt1210166/),
• Nate Silver (http://en.wikipedia.org/wiki/Nate_Silver)
• Google Analytics http://www.marketingscoop.com/google-analyticscasestudy.htm
• http://www.slideshare.net/lsakoda/case-studiesutilizing-real-time-data-analytics
• http://www.marketquotient.com/case-studies.html
• http://www.ibm.com/analytics/us/en/case-studies/
• More in the Assignment (on Web)
42
Reference Material
• See on Website after class 14 reading
section
• Data Analytics – various intro material
• Using R
43
Files
• TBD
• This is where the files for assignments,
exercise will be placed – data, code
(fragments), etc.
44
Assignment 1
• Choose a DA case study from a) readings, or
b) your choice (must be approved by me)
• Read it and provide a short written review/
critique (business case, area of application,
approach/ methods, tools used, results,
actions, benefits). Hand in a written report.
• Be prepared to discuss it on class on Friday
5th..
• Details on the Web site (under
Reading/Assignments; Week 1)
45
Gnu R • http://lib.stat.cmu.edu/R/CRAN/ - load this first
• http://cran.r-project.org/doc/manuals/
• http://cran.r-project.org/doc/manuals/R-lang.html
• R Studio – see R-intro.html in
manualshttp://www.rstudio.com/ide/download
/
– Manuals - Libraries – at the command line –
library(), or select the packages tab, and check/
uncheck as needed
46
•
Exercises – getting data in –
try
before
Friday?
Rstudio
– read in csv file (two ways to do this) GPW3_GRUMP_SummaryInformation_2010.csv
– Read in excel file (directly or by csv convert) 2010EPI_data.xls (2010EPI_data tab)
– See if you can plot some variables
– Anything in common between them?
47
Download