Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier Homework 1 • Data Structures and Basic Programming • Due: September 1st at the beginning of class This assignment must be completed individually. It is worth 25% of your homework grade for the semester, and should help you judge if you are prepared for the course. Homework 2 • Presentations on Biomedical Data Science • Due: September 10th during class You will be divided into five groups, one at NG Xetron, four at UC. Each group will read an assigned article and prepare a 10 minute presentation on the topic for the rest of the class. The Big News about Sanders Data Science and Engineering From the Information Age to the Data Age What is Data Science? What is Data Engineering? Drew Conway’s Venn Diagram of Data Science The Foundations of Data Science • Statistics • Computer Science • Domain Expertise Doing Data Science Back to Bernie and Clinton… Problems with Anecdotal Data • Small number of observations • Selection bias • Confirmation bias • Inaccuracy Some Basic Definitions • Population – the set of objects or units to be measured. Some Basic Definitions • Population – the set of objects or units to be measured. • Observations – extracted or measured characteristics about the objects. Some Basic Definitions • Population – the set of objects or units to be measured. • Sample – the subset of objects examined in order to draw conclusions and make inferences about the population. Example • Let’s say we want to infer information about the quality of students admitted to UC. • Define the population, a single observation, and a sample. Example • Let’s say we want to infer information about the quality of students admitted to UC. • How might we introduce biases into the data? • What might the consequences be? Estimating the e-mail generated by employees • Bearcats Health Insurance Inc has hired you to help them understand their e-mail traffic. They have 5,000 employees, and it is infeasible to capture all mailing records. They have asked you to evaluate a possible method for sampling: – Select 10% of their employees at random, and sample all e-mail they have ever sent. Estimating the e-mail generated by employees • Bearcats Health Insurance Inc has hired you to help them understand their e-mail traffic. They have 5,000 employees, and it is infeasible to capture all mailing records. They have asked you to evaluate a possible method for sampling: – Select 10% of all e-mail sent during the day at random. But this is the age of BIG DATA! • Why not just sample every message? Measurements • Measurements have inherent assumptions • Measurements are often stated very informally – Formalize our measures! Measurements Measure theory is a bit like grammar, many people communicate clearly without worrying about all the details, but the details do exist and for good reasons. - Maya Gupta, University of Washington The Problem of Measures • Physical intuition of the measure of length, given a body E, the measure of this body, m(E) might be the sum of it’s components, or points. • Let’s take two bodies on the real number line – Body A is the line A = [0, 1] – Body B is the line B = [0, 2] Which is “longer”? The Problem of Measures • Physical intuition of the measure of length, given a body E, the measure of this body, m(E) might be the sum of it’s components, or points. • Let’s take two bodies on the natural number line – Body A is the line A = [0, 1] – Body B is the line B = [0, 2] Which is “longer”? Solving the Problem of Measures • What does it mean for some body (or subset) to be measurable? • If a set E is measurable, how does one define its measure? • What properties or axioms does measure (or the concept of measurability) obey? Measure Theory • Before we can measure anything we need something to measure! • Let’s define a measurable space – A measurable space is a collection of events B, and the set of all outcomes, Ω, also called the sample space. Events and Sample Spaces • Each event, F, is a set containing zero or more outcomes. – Each outcome can be viewed as a realization of an event. The real world can be viewed as a player in a game that makes some move: – All events in F that contain the selected outcome are said to “have occurred”. Events and Sample Space • Take a deck of 52 cards + 2 jokers • Draw a single card from the deck. • Sample space: 54 element set, each card is a possible outcome. • An event is any subset of the sample space, including a singleton set, or the empty set. Events and Sample Space • Potential events: – “Red and black at the same time without being a joker” – (0 elements) – “The 5 of hearts” – (1 element) – “A king” – (4 elements) – “A face card” – (12 elements) – “A card” – (54 elements) Forming an Algebra on B and Ω • In order to define measures on B, we need to make sure it has certain properties, those of a σ-algebra. • A σ-algebra is a special kind of collection of subsets that is closed under countable-fold set operations (complement, union of countably many sets, and intersection of countably many sets). • “Vanilla” algebras are closed only under finite set operations. Countable Sets • Countable sets are those with the same cardinality of natural numbers. • Quick refresher: Prove the cardinality of integers and natural numbers are the same. σ-algebra • If we have a σ-algebra on our sample space Ω, then: Measures • A measure µ takes a set A from a measureable collection of sets B and returns the measure of A, which is some positive real number. Formally: Example Measure • Let’s define a measure of “Volume”. • The triple combines a measureable space and a measure, the triple is called a measure space. This space is defined by two properties: – Nonnegativity: – Countable additivity: are disjoint sets for i = 1, 2, …, then the measure of the union of is equal to the sum of the measures of Example Measure • Does the ordinary concept of volume satisfy these two properties? – Nonnegativity: – Countable additivity: are disjoint sets for i = 1, 2, …, then the measure of the union of is equal to the sum of the measures of Two Special Kinds of Measures • Signed measure – can be negative • Probability measure – defined over a probability space with a probability measure. – A probability measure, P, has the normal properties of a measure, but it is also normalized such that: Sets of Measure Zero • A set of measure zero is some set • For a probability measure, any set of measure zero can never occur as it has probability of zero. – It can thus be ignored when stating things about the collection of sets B. Borel Sets • A common σ-algebra is the Borel σ-algebra. A Borel set is an element of a Borel σ-algebra. – Almost any set you can describe on the real line is a Borel set, for example, the unit line segment [0,1]. Irrational numbers, etc. – The Borel σ-algebra on the real line is a collection of sets that is the smallest σ-algebra that includes the open subsets of the real line. Borel Sets • For some space X, the collection of all Borel sets on X forms a σ-algebra known as the Borel algebra (or Borel σ-algebra) on X. • Important! • Why? Any measure defined on the open set of a space, or closed sets of a space, must also be defined on all Borel sets of that space. Borel Sets • Borel sets are powerful because if you know what a probability measure does on every interval, then you know what it does on all the Borel sets. • Allows us to define equivalence of measures. Borel Sets • Let’s say we have two measures: • To show they are equivalent we just need to show that: – They are equivalent on all intervals • By definition they are then equivalent for all Borel sets, and hence over the measurable space. • Example: Given probability distributions A, and B, with equivalent cumulative distribution functions, then the probability distributions must also be equal. Measure Theory and Data Science • Data Science is about working with, and deriving observations or features from data. • Features are effectively measures of some sort, but often not for the underlying space of interest. • Important to realize the limitations of measurable spaces for metrics of interest, and what can and cannot be measured. Example Bearcats Elementary School had 300 students in their 5th grade class. 77% of them graduated to middle school. 12% failed their mathematics Standards Of Learning, 11% failed their reading Standards of Learning. The new class of 1st graders had interventions in mathematics and grammar, their graduation rates improved to 88%, with 7% failing mathematics, and 5% failing reading. What can we infer? How does measure theory relate? Measure Theory: Further Reading • M. Capinski and E. Kopp, “Measure, Integral, and Probability”, Springer Undergraduate Mathematics Series, 2004 • S. I. Resnick, “A probability path”, Birkhauser, 1999. • A. Gut, “Probability: A Graduate Course”, Springer, 2005. • R. M. Gray, “Entropy and Information Theory”, Springer Verlag (available free online), 1990. The Data Science Pipeline • • • • • • • Metric identification Data collection Data exploration and summary statistics Feature generation Feature importance testing Modeling Validation Automating the Data Pipeline Drake – Like make for data. For next time • Homework 1 • Due this Tuesday!!!