Lecture 1 - dataengineering.org

advertisement
Lecture 1:
Data Science &
Data Engineering
CS 6071
Big Data Engineering,
Architecture, and Security
Fall 2015, Dr. Rozier
Homework 1
• Data Structures and Basic Programming
• Due: September 1st at the beginning of class
This assignment must be completed individually. It
is worth 25% of your homework grade for the
semester, and should help you judge if you are
prepared for the course.
Homework 2
• Presentations on Biomedical Data Science
• Due: September 10th during class
You will be divided into five groups, one at NG
Xetron, four at UC.
Each group will read an assigned article and prepare
a 10 minute presentation on the topic for the rest of
the class.
The Big News about Sanders
Data Science and Engineering
From the Information Age to the
Data Age
What is Data Science?
What is Data Engineering?
Drew Conway’s Venn Diagram of
Data Science
The Foundations of Data Science
• Statistics
• Computer Science
• Domain Expertise
Doing Data Science
Back to Bernie and Clinton…
Problems with Anecdotal Data
• Small number of observations
• Selection bias
• Confirmation bias
• Inaccuracy
Some Basic Definitions
• Population – the set of objects or units to be
measured.
Some Basic Definitions
• Population – the set of objects or units to be
measured.
• Observations –
extracted or
measured
characteristics about
the objects.
Some Basic Definitions
• Population – the set of objects or units to be
measured.
• Sample – the subset
of objects examined
in order to draw
conclusions and make
inferences about the
population.
Example
• Let’s say we want to infer information about
the quality of students admitted to UC.
• Define the population, a single observation,
and a sample.
Example
• Let’s say we want to infer information about
the quality of students admitted to UC.
• How might we introduce biases into the data?
• What might the consequences be?
Estimating the e-mail generated by
employees
• Bearcats Health Insurance Inc has hired you to
help them understand their e-mail traffic.
They have 5,000 employees, and it is
infeasible to capture all mailing records. They
have asked you to evaluate a possible method
for sampling:
– Select 10% of their employees at random, and
sample all e-mail they have ever sent.
Estimating the e-mail generated by
employees
• Bearcats Health Insurance Inc has hired you to
help them understand their e-mail traffic.
They have 5,000 employees, and it is
infeasible to capture all mailing records. They
have asked you to evaluate a possible method
for sampling:
– Select 10% of all e-mail sent during the day at
random.
But this is the age of BIG DATA!
• Why not just sample every message?
Measurements
• Measurements have inherent assumptions
• Measurements are often stated very
informally
– Formalize our measures!
Measurements
Measure theory is a bit like grammar, many
people communicate clearly without worrying
about all the details, but the details do exist and
for good reasons.
- Maya Gupta, University of Washington
The Problem of Measures
• Physical intuition of the measure of length,
given a body E, the measure of this body, m(E)
might be the sum of it’s components, or
points.
• Let’s take two bodies on the real number line
– Body A is the line A = [0, 1]
– Body B is the line B = [0, 2]
Which is “longer”?
The Problem of Measures
• Physical intuition of the measure of length,
given a body E, the measure of this body, m(E)
might be the sum of it’s components, or
points.
• Let’s take two bodies on the natural number
line
– Body A is the line A = [0, 1]
– Body B is the line B = [0, 2]
Which is “longer”?
Solving the Problem of Measures
• What does it mean for some body (or subset)
to be measurable?
• If a set E is measurable, how does one define its
measure?
• What properties or axioms does measure (or the
concept of measurability) obey?
Measure Theory
• Before we can measure anything we need
something to measure!
• Let’s define a measurable space
– A measurable space is a collection of events B,
and the set of all outcomes, Ω, also called the
sample space.
Events and Sample Spaces
• Each event, F, is a set containing zero or more
outcomes.
– Each outcome can be viewed as a realization of an
event. The real world can be viewed as a player in
a game that makes some move:
– All events in F that contain the selected outcome
are said to “have occurred”.
Events and Sample Space
• Take a deck of 52 cards
+ 2 jokers
• Draw a single card from
the deck.
• Sample space: 54
element set, each card
is a possible outcome.
• An event is any subset
of the sample space,
including a singleton
set, or the empty set.
Events and Sample Space
• Potential events:
– “Red and black at the
same time without being
a joker” – (0 elements)
– “The 5 of hearts” – (1
element)
– “A king” – (4 elements)
– “A face card” – (12
elements)
– “A card” – (54 elements)
Forming an Algebra on B and Ω
• In order to define measures on B, we need to
make sure it has certain properties, those of a
σ-algebra.
• A σ-algebra is a special kind of collection of
subsets that is closed under countable-fold
set operations (complement, union of
countably many sets, and intersection of
countably many sets).
• “Vanilla” algebras are closed only under finite
set operations.
Countable Sets
• Countable sets are those with the same
cardinality of natural numbers.
• Quick refresher: Prove the cardinality of
integers and natural numbers are the same.
σ-algebra
• If we have a σ-algebra on our sample space Ω,
then:
Measures
• A measure µ takes a set A from a measureable
collection of sets B and returns the measure
of A, which is some positive real number.
Formally:
Example Measure
• Let’s define a measure of “Volume”.
• The triple
combines a
measureable space and a measure, the triple
is called a measure space. This space is
defined by two properties:
– Nonnegativity:
– Countable additivity:
are disjoint
sets for i = 1, 2, …, then the measure of the union
of
is equal to the sum of the measures of
Example Measure
• Does the ordinary concept of volume satisfy
these two properties?
– Nonnegativity:
– Countable additivity:
are disjoint
sets for i = 1, 2, …, then the measure of the union
of
is equal to the sum of the measures of
Two Special Kinds of Measures
• Signed measure – can be negative
• Probability measure – defined over a
probability space with a probability measure.
– A probability measure, P, has the normal
properties of a measure, but it is also normalized
such that:
Sets of Measure Zero
• A set of measure zero is some set
• For a probability measure, any set of measure
zero can never occur as it has probability of
zero.
– It can thus be ignored when stating things about
the collection of sets B.
Borel Sets
• A common σ-algebra is the Borel σ-algebra. A
Borel set is an element of a Borel σ-algebra.
– Almost any set you can describe on the real line is
a Borel set, for example, the unit line segment
[0,1]. Irrational numbers, etc.
– The Borel σ-algebra on the real line is a collection
of sets that is the smallest σ-algebra that includes
the open subsets of the real line.
Borel Sets
• For some space X, the collection of all Borel
sets on X forms a σ-algebra known as the
Borel algebra (or Borel σ-algebra) on X.
• Important!
• Why? Any measure defined on the open set of
a space, or closed sets of a space, must also
be defined on all Borel sets of that space.
Borel Sets
• Borel sets are powerful because if you know
what a probability measure does on every
interval, then you know what it does on all the
Borel sets.
• Allows us to define equivalence of measures.
Borel Sets
• Let’s say we have two measures:
• To show they are equivalent we just need to show
that:
– They are equivalent on all intervals
• By definition they are then equivalent for all Borel
sets, and hence over the measurable space.
• Example: Given probability distributions A, and B,
with equivalent cumulative distribution functions,
then the probability distributions must also be equal.
Measure Theory and Data Science
• Data Science is about working with, and
deriving observations or features from data.
• Features are effectively measures of some
sort, but often not for the underlying space of
interest.
• Important to realize the limitations of
measurable spaces for metrics of interest, and
what can and cannot be measured.
Example
Bearcats Elementary School had 300 students in their
5th grade class. 77% of them graduated to middle
school. 12% failed their mathematics Standards Of
Learning, 11% failed their reading Standards of
Learning.
The new class of 1st graders had interventions in
mathematics and grammar, their graduation rates
improved to 88%, with 7% failing mathematics, and 5%
failing reading.
What can we infer? How does measure theory relate?
Measure Theory: Further Reading
• M. Capinski and E. Kopp, “Measure, Integral,
and Probability”, Springer Undergraduate
Mathematics Series, 2004
• S. I. Resnick, “A probability path”, Birkhauser,
1999.
• A. Gut, “Probability: A Graduate Course”,
Springer, 2005.
• R. M. Gray, “Entropy and Information Theory”,
Springer Verlag (available free online), 1990.
The Data Science Pipeline
•
•
•
•
•
•
•
Metric identification
Data collection
Data exploration and summary statistics
Feature generation
Feature importance testing
Modeling
Validation
Automating the Data Pipeline
Drake – Like make for data.
For next time
• Homework 1
• Due this Tuesday!!!
Download