Introduction - University of Warwick

advertisement
CS910: Foundations
of Data Analytics
Graham Cormode
G.Cormode@warwick.ac.uk
Introduction
Agenda




2
Introductions
Introduction to Foundations of Data Analytics
Course Admin
Marketplace Survey
CS910 Foundations of Data Analytics
Data Analytics
 What is Data Analytics?
–
The science of studying data to draw conclusions
 Why?
More organizations are collecting more data than ever before
 Business, Government, Healthcare, Charity – everything!
– This data holds many insights into their operations and beyond
– Data Analytics is required to extract these insights
 Requires analytical, statistical and computational skills
– Lot of focus (investment) on analytics/big data/data science
–
3
CS910 Foundations of Data Analytics




4
How to predict flu outbreaks from search queries?
How to recommend which movie to watch?
How to find friends in a social network?
How to predict house prices from listings?
CS910 Foundations of Data Analytics
Analytics in Action: Flu Trends
Digging deeper into Flu Trends
 Privacy concerns raised: users did not consent to this use of data
–
The slippery slope argument: what else will data be used for?
 Accuracy concerns raised: will it remain accurate?
Initial report of 0.97 correlation with official CDC data
– Prevalence overestimated by 50% in 2013 flu season
– Possible explanation: media speculation about flu
epidemic caused more searches for related terms
– Models need to be continually tuned and refined
–
 Lesson: data analytics can be a moving target…
6
CS910 Foundations of Data Analytics
Warwick connections
 Warwick is one of five universities partnering
in the Alan Turing Institute for Data Science
Joint with Oxford, Cambridge, UCL, Edinburgh
– Focusing on developing new, scalable methods for data analytics
– Drawing on strength in mathematical sciences (maths, CS, stats)
–
 Warwick Institute for the Science of Cities
Applying analytics to data from cities
– Partnering with Centre for Urban Science in New York and London
–
7
CS910 Foundations of Data Analytics
This Module: CS910
 What is Foundations of Data Analytics about?
–
–
–
–
–
–
The tools to manipulate and aggregate data
Dealing with data problems (missing values, changing format)
Models to represent data
Building and testing hypotheses about the data
Algorithms to analyze data
Ways to scale up analytics to big data
 This module emphasizes the foundations
Will focus on the theoretical underpinnings of these methods
– Will require some mathematical and computational thinking
–
8
CS910 Foundations of Data Analytics
Module Outline
Part 1:
Preliminaries
Part 2:
Core methods
Part 3:
Advanced topics
9
 Statistics and data handling
 Introduction to useful tools
 Case studies of analytics in action
 Regression: fitting a curve to data
 Classification: learning a model from data
 Clustering: finding groups in data




Social Network Analysis
Recommender systems
Time series analysis
Data management systems
CS910 Foundations of Data Analytics
Topics in Detail: Preliminaries and core
1. Statistical tools:
–
Refresher on probability, distributions, significance tests
2. Introduction to analytics, case studies
–
How analytics is used in practice.
Examples from YouTube, Facebook, Kaggle, and Twitter.
3. Basic tools: command line, plotting, programming tools
4. Modeling data via regression:
–
linear regression, least squares, logistic regression
5. Classification to predict values
–
Decision tree, Naive Bayes, Support Vector Machines
6. Clustering methods
–
10
Finding clusters in data (hierarchical, k-means, k-center)
CS910 Foundations of Data Analytics
Topics in Detail: advanced topics
7. Recommender systems
–
Making recommendations (movies, music, products) for people
8. Time series data
–
Predicting data from a sequence of observations
9. Data management systems
–
Map Reduce, Data Warehouse, Relational data, SQL, NoSQL
10. Graphs and networks
–
Graph representations of data (application to social networks)
11. [If time] Data Structures for big data and data streams.
–
11
The Bloom filter and sketch data structures
CS910 Foundations of Data Analytics
Course Administration
 Lectures start at 5 past the hour, should finish by 5 to the hour
–
To allow time to get to next lecture/get held up by traffic
 Attendance is not taken
 Phones off/silent in lectures
–
No one wants to hear your “wacky” ringtone
 Laptops/Tablets/phones permitted but not recommended
–
Too easy to get distracted messaging/surfing
 Questions welcomed in lectures
Quick clarifications at any point
– Detailed queries best saved for the end, or via email
–
12
CS910 Foundations of Data Analytics
Course Assessment
 Exam in 2016
–
2 hours, contributes 50% to final grade
 Project worth 35% due 16 December 2015 (after end term 1)
–
Project briefing lecture in a couple of weeks
 5 assessed homeworks applying skills from lectures (15%)
Due dates: Wednesdays @ noon, Weeks 2, 4, 6, 8, 10
– Lab drop-in sessions: Mondays @ 10am, Weeks 2, 4, 6, 8, 10
– Lab tutors: Helen McKay and Jack Kirton
– Goal: prepare you for the project
–
 Updates/news on course webpage and via email
–
13
www2.warwick.ac.uk/fac/sci/dcs/teaching/modules/cs910
CS910 Foundations of Data Analytics
First piece of coursework
 Warm-up exercise in using Weka
–
–
–
–
–
Load a data set, explore it, make observations
Hopefully will not be taxing
Can do whenever you like, wherever you like
Lab session: tutors on hand to help and advise
Make sure you are registered on CS910 for computer access
 Submission: complete the worksheet, hand in to CS reception
Deadline: next Wednesday 12 noon
– Print a cover sheet (read carefully the notes on plagiarism):
http://www2.warwick.ac.uk/fac/sci/dcs/teaching/pgcoversheet/
–
14
CS910 Foundations of Data Analytics
Course Material
 A developing topic, so no textbook covers everything
 Slides will be put on the course webpage after lectures
–
Handouts available at the start of each section
 Plenty of material on the web on each topic
–
Wikipedia is a good place to start (but not to finish)
 Data Mining: Concepts and Techniques 3rd ed. Han, Kanber, Pei
Good coverage of many core data analytic ideas
– Text available online via Warwick Library (ebook)
– Also useful for CS909 Data Mining
–
 Other sources will be linked to from slides, course page
15
CS910 Foundations of Data Analytics
DESIRABLE SKILLS IN
DATA ANALYTICS
16
CS910 Foundations of Data Analytics
Senior Data Scientist - Expedia
The successful candidate will have the following skills and Experience:
 A (Masters or PhD) background in computer science or statistics with strong
machine learning component.
 Will have expert knowledge of at least one of the following programming
languages or equivalents; Ruby, Python, R, and or functional languages such
as Lisp, Haskel or Erlang.
 Have very good understanding of database technologies; Hadoop, Mongo or
equivalent, and standard relational database structures along with query
languages such as Hive, Pig and SQL.
 As well as these programming skills, the candidate should be able to
demonstrate a very good understanding of one of the following; Bayesian
networks, Neural networks, Heuristics, Support vector machines, genetic
algorithms, or PAC learning. Along with good knowledge of statistical
classification techniques such as k means and hierarchical clustering,
partition trees, and logistic regression.
17
CS910 Foundations of Data Analytics
Yahoo! Experienced Data Analyst
 We are looking for a Data Analyst with industry experience
who is able to take large datasets and analyze them using
statistical methods to draw out insights and data trends. They
will be experienced at analysis techniques using Excel or R
and comfortable using Unix, working with big data, scripting
with Perl/Python and be able to quickly construct SQL queries
to interrogate databases.
 Independence, logical reasoning, and motivation is important.
Being able to work in an Agile environment is very important.
The candidate should demonstrate the ability to learn new
technologies and be happy to take on responsibility. They
should have excellent communication skills and be able to
present their findings in a clear and concise way.
18
CS910 Foundations of Data Analytics
Google Statistician/Engineering Analyst
 MS or PhD in Statistics or other quantitative disciplines such as
Engineering, Applied Mathematics, etc.
 Broad work experience with large data sets.
 Considerable practical experience in quantitative analysis.
 Specific positions can benefit from experience in one or more of:
Operations Research, Online advertising, search, commerce
Machine Learning
Languages such as Python, JavaScript
Forecasting, Time-series modeling
Proficiency in foreign languages.
 Excellent written and verbal presentation skills.
19
CS910 Foundations of Data Analytics
“One of the largest global tech companies”
The company buys ad impressions in real time auctions and algorithmically deliver
the most relevant ad possible.
 Identify and work with large datasets from multiple sources
 Visualize and analyse data, developing hypotheses and ideas for experiments.
 Run experiments to improve the relevance and efficiency of all advertising.
 Identify relevant research from industry and academia.
Preferred Qualifications
 Masters in a relevant field and/or experience is highly regarded
 Iteratively analysing data, integrating new data, experimenting and optimizing
 Near real-time data analysis, feeding into decisioning systems.
 Practical experience in a variety of machine learning and modelling techniques
including time series forecasting, decision trees, multi-linear/logistic regression
and Bayesian analysis.
 Presenting data effectively.

20 Experience using R, SAS or equivalent.
CS910 Foundations of Data Analytics
Head of Data Science - Global FinTech
 An MSc or Ph.D. in a quantitative discipline e.g. statistics,
mathematics, computer science
 Expert programming experience in R and/or Python
 Experience in implementing predictive analytics models and
machine learning
 Proven experience leading a team, mentoring and developing
data scientists
 Experience in distributed computing systems e.g. Spark,
Hadoop, AWS etc
 As the Head of Data Science you could earn between
£100,000 - £130,000 + benefits
21
CS910 Foundations of Data Analytics
Facebook Quantitative Engineer
Requirements
 MS/PhD in computer science, computational statistics, computational
econometrics, operations research or related field.
 Hands-on, deep knowledge of Python as a user of scientific libraries
(numpy, scipy, pandas, scikit-learn, etc.) and as a generalist.
Alternatively, R or MATLAB with strong C++ or Java experience.
 2+ years experience and an excellent understanding of machine
learning techniques (classification, clustering, dimensionality reduction)
 2+ years hands on experience working with large datasets (>10TB) on
distributed systems.
 Good understanding of fundamentals of statistics.
 Good understanding of fundamentals of SQL.
22
CS910 Foundations of Data Analytics
Recommended Reading
 Data Mining Concepts and Techniques, Chapter 1: Introduction
–
http://0-www.sciencedirect.com.pugwash.lib.warwick.ac.uk/science/article/pii/B9780123814791000010
 “Detecting influenza epidemics using search engine query data”
Jeremy Ginsberg, Matthew H. Mohebbi, Rajan S. Patel, Lynnette
Brammer, Mark S. Smolinski & Larry Brilliant
–
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/papers/dete
cting-influenza-epidemics.pdf
 “When Google got flu wrong”, Nature (news)
–
23
http://www.nature.com/news/when-google-got-flu-wrong-1.12413
CS910 Foundations of Data Analytics
Download