Chapter 1 Introduction 1.1 Data Element: a person, object, or other entity about which we wish to draw a conclusion. Variable: a characteristic of a population or sample element. ● Quantitative: a variable having values that are numbers representing quantities. Eg. Selling price, temperature, car mileage ● Qualitative: a variable having values that indicate into which of several categories a population element belongs. Eg. weather, gender, car color Data set: facts and figures, taken together, that are collected for a statistical study. ● Cross-sectional data: data collected at the same point of time. Eg. cell phone costs of different employees in June. ● Time series data: data collected over different time periods. Eg. Temperature of each month 1.2 Data sources, data warehousing, and big data Type of data sources: ● Primary data: data collected by an individual or business directly through planned experimentation or observation. ○ Experimentation: a statistical study in which the analyst is able to set or manipulate the values of the factors. ○ Observation: a statistical study in which the analyst is not able to control the values of the factors. ● Secondary data: data taken from an existing source. Eg. Internet, company reports, business journals Experimental and observational studies (DIY) Response variable: variable of interest that we wish to study. Factors: other variables that may be related to the response variable. ● Experimental study ● Observational study Eg. in studies of diet and cholesterol, patients are unlikely to follow the prescribed diets, thus diet is a factor, and cholesterol level is the response variable. ● Survey Transactional data, data warehousing and big data Data warehousing: a process of centralized data management and retrieval and has as its ideal objective the creation and maintenance of a central repository for all of an organization’s data. Ie. the process of storing customers’ transactional data (eg. address, phone number, etc) Big data: massive amounts of data, often collected at very fast rates in real time and in different forms and sometimes needing quick preliminary analysis for effective business decision making. Note: the huge capacity of data warehouses has given rise to the term big data. 1.3 populations, samples, and traditional statistics Population: the set of all elements about which we wish to draw conclusions. Eg. all current MasterCard holders, all of last year’s graduates. Population measurements: carry out a measurement to assign a value of a variable to each and every population element. Census: examination of all elements in a population. (usually small group) Sample: a subset of the elements of a population. (for large group) Note: when we measure a characteristic of the elements in a sample, we have a sample of measurements. Descriptive statistics: the science of describing the important aspects of a set of measurements. Statistical inference (推理): the science of using a sample of measurements to make generalization about the important aspects of a population of measurements. (for large group) Traditional statistics Traditional statistics consists of a set of concepts and techniques that are used to describe populations and samples and to make statistical inferences about populations by using samples. Note: much of this book is devoted to traditional stats, but traditional stats is sometimes not sufficient to analyze big data. 2 related extensions to help (chapter 1.5): 1. Business analytics: the use of traditional and newly developed stats methods, advances in information systems, and techniques from management science to continuously and iteratively explore and investigate past business performance, with the purpose of gaining insight and improving business planning and operations. 2. Data mining: the process of discovering useful knowledge in extremely large data sets. 1.4 Random sampling and 3 case studies that illustrate stats inference Random sample: a sample selected in such a way that every set of n elements in the population has the same chance of being selected. In making random selection, we can sample with or without replace: ● Sample with replacement: place the element chosen back into the population. Thus, this element has a chance to be chosen again. ● Sample without replacement: do not place the element chosen back into the population. Note: it is best to sample without replacement, as it guarantees that all of the elements in the sample will be different, and thus we will have the fullest possible look at the population. 3 case studies with 3 goals: 1. The need for a random sample 2. How to select the needed sample 3. The use of the sample in making stats inferences Processes Sometimes we are interested in studying the population of all of the elements that will be or could potentially be produced by a process. Process: a sequence of operations that takes inputs (labor, materials, machines...) and turns them into output (products, services, ...) ● Finite population: a population that contains a finite number of elements. ● Infinite population: a population that is defined so that there is no limit to the number of elements that could potentially belong to the population. Probability sampling Probability sampling: sampling where we know the chance that each element in the population will be included in the sample. Note: if we employ probability sampling, the sample obtained can be used to make valid stat inferences about the sampled population. Non-probability sampling ● Convenience sampling: sampling where we select elements because they are easy or convenient to sample. ● Voluntary response sample: sampling in which the sample participants self-select. (eg. employed by television and radio). This sample overrepresent people with strong opinions. ● Judgement sampling: sampling where an expert selects population elements that he/she feels are representative of the population. Unethical stats practices: ● Improper sampling: cherry-picking ● Misleading charts, graphs and descriptive measures ● Inappropriate stats analysis or inappropriate interpretation of stats results 1.5 Business analytics and data mining 3 categories of business analytics: 1. Descriptive analytics: graphical and numerical methods used to find and visualize patterns, associations, anomalies and other relationships in data sets, with the purpose of business improvement. a. Graphical descriptive analytics It uses the traditional/newer graphics to present easy-to-understand visual summaries of the operational status of a business. Eg. gauges, bullet graphs, treemaps, sparkline, etc. Mixed used to form analytic dashboards, part of executive information systems. b. Numerical descriptive analytics i. Association learning ii. Text mining iii. Cluster analysis iv. Factor analysis 2. Predictive analytics: methods used to find anomalies, patterns, and associations in data sets, with the purpose of predicting future outcomes. Response variable: variable of interest that we wish to study. 3. Prescriptive analytics: the use of internal and external variables, along with the predictions obtained from predictive analytics, to recommend one or more courses of action. 1.6 Ratio, interval, ordinal, and nominative scales of measurement Quantitative variables 1. Ratio: quantitative variable such that the ratios of its values are meaningful (eg. $5k/month is 2 times more than $2.5k/month) and for which there is an inherently defined zero value (eg. 0km is ”no distance at all”). Eg. salary, height, weight, time, distance 2. Interval: quantitative variable where ratios of its values are not meaningful and there is not an inherently defined zero value. Eg. temperature: 60deg is not twice hotter than 30deg. 0deg doesn’t mean “no heat at all”. Note: very few interval variables, almost all quantitative variables are ratio variables. Qualitative variables 1. Ordinal: qualitative variable for which there is a meaningful ordering, or ranking of the categories. Ordinal variables can be numerical or nonnumerical. Eg. satisfaction ranking from 0 to 5, or from “no satisfactory” to “very satisfied” 2. Nominative: qualitative variable for which there is no meaningful ordering, or ranking, of the categories. Eg. colors of car, gender 1.7 Stratified random, cluster, and systematic sampling Sample designs: methods for obtaining a sample. Frame: a list of all of the population elements. 3 sample designs that are alternatives to random sampling: 1. Stratified random sampling A sampling design in which we divide a population into non overlapping subgroups (strata) and then select a random sample from each subgroup (stratum) Eg. city, suburban and rural population can be the 3 selected strata for a consumer study. 2. Cluster sampling (multistage) A sampling design in which we sequentially cluster population elements into subpopulations. Note: “cluster” because at each stage we cluster the voters into subgroups. 3. Systematic sampling A sample taken by moving systematically through the population Eg. randomly select every 200th person in the sample. 1.8 More about surveys and eros in survey sampling Survey questions can be: 1. Dichotomous (yes or no) 2. Multiple-choice 3. Open-ended Types of surveys 1. Phone survey (low response rate) 2. Mail survey (low response rate) 3. Web survey (low response rate) 4. Personal interview survey (high response rate) Eg. Mall survey Errors occurring in surveys ● The target population and sample frame are not well defined. Sample frame: list of sampling elements from which the sample will be selected. It should closely agree with the target population. Eg. consider a study to estimate the avg starting salary of students who have graduated from JMSB over the last 5 years. Target population: the group of graduates from JMSB Sample frame: JMSB’s IB program graduates for the past 5 years. ● 2 general classes of survey errors: Sampling error: the difference between a numerical descriptor of the population and the corresponding descriptor of the sample. 1. Errors of non observation: sampling error related to population elements that are not observed. a. Undercoverage: occurs when some population elements are excluded from the process of selecting the sample. b. Nonresponse 2. Errors of observation: sampling error that occurs when the data collected in a survey differs from the truth a. Recording error: occurs when either the respondent or interviewer incorrectly marks an answer. b. Response bias: bias in the result obtained when carrying out a statistical study that is related to how survey participants answer the questions. Assignment 1 1. Probability sampling is where we know the chance that each element will be included in the sample, which allows us to make stats inferences about the sample population. 2. Data collected for a particular study are referred to as a data set. 3. Descriptive stats refers to describing the important aspects of a set of measurements. 4. Sampling error is the difference between a numerical descriptor of the population and the corresponding descriptor of the sample. 5. Traditional stats consists of a set of concepts and techniques that are used to describe populations and samples and to make statistical inferences about populations by using samples. 6. Methods for obtaining a sample are called sampling designs. 7. Which of the following is a type of question used in survey research? Dichotomous, open-ended, and multiple-choice 8. A data set provides information about some group of individual elements. 9. When the data being studied are gathered from a published source, this is referred to as an existing data source. 10. A ratio variable has the following characteristic: inherently defined zero value. Chapter 2 2.1 Graphically summarizing qualitative data Frequency distribution: a table that summarizes the number (or frequency) of items in each of several non overlapping classes. 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑖𝑡𝑒𝑚 Relative frequency = , where n is the total # of items. 𝑛 Bar chart: a graphical display of data in categories made up of vertical or horizontal bars. Each bar gives the frequency, relative frequency, or percentage frequency of items in its corresponding category. Pie chart: a graphical display of data in categories made up of pie slices representing the frequency, relative frequency, or percentage frequency of items in its corresponding category. Pareto charts: a bar chart of the frequencies or percentages for various types of defects. These are used to identify opportunities for improvement. Note: Pareto charts are sometimes plotted as a cumulative percentage point (up to 100%). 2.2 Graphically summarizing quantitative data Histogram: display frequency distribution data 1. Find the number of classes 𝑘 K is the smallest number of classes in such way that 2 is greater than the total number of items (n) in the data set. 2. Find the class length (largest measurement - smallest measurement) / K Frequency polygons: graphical display in which we plot points representing each class frequency above their corresponding class midpoints and connect the points with lines. Ogives (cumulative distribution) Cumulative frequency distribution: a table that summarizes the number of measurements that are the sum of the previous measurements. 2.3 Dot plots Dot plot: graphical portrayal of a data set that shows the data set’s distribution by plotting individual data point above a horizontal axis. Note: dot plots are useful for detecting outliers (unusually large or small observation that is well separated from the remaining of observations) 2.4 Stem-and-leaf displays Stem-and-leaf display: graphical portrayal of a data set that shows the data set’s distribution by using stems consisting of leading digits and leaves consisting of trailing digits. Leaf unit: the unit that the leaf is representing. Eg. if leaf 5 represents 500, the leaf unit is 100. 2.5 Contingency tables 2.6 Scatter plots Scatter plot: a graph that is used to study the possible relationship between 2 variables x and y. The observed values of y are plotted on the vertical axis and x on the horizontal. Eg. time series plot 2.8 Graphical descriptive analytics (recent) ● ● ● ● Analytic dashboard: a graphical representation of the current status and historical trends of a business’ key performance indicators. (car dashboard) Gauge: graphics similar to speedometer on cars Bullet graphs: graphic that features a single measure and displays it as a horizontal/vertical bar that extends into ranges representing qualitative measures of performance, such as poor and good. Treemaps Sparklines Summary Qualitative data ● Pareto charts: specialized bar chart that order the bar from the highest frequency to the lowest frequency ● Bar charts ● Pie charts Quantitative data ● Histogram ● Frequency polygons ● Stem-and-leaf ● Dot plot ● Ogive plot (cumulative) ● Bullet graph Dot plot displays individual data points Ogive plot is a curved display of the cumulative distribution of the data Box plot does not easily group measurements into classes. Scatter plot is for looking at the relationship between 2 variables. Assignment 2 1. Pareto charts are frequently used to identify the most common types of defects. 2. A stem-and-leaf is best used to display the shape of the distribution. 3. 30 items are rejected daily by a manufacturer because of defects for the last 30 days. How many classes should be used in constructing a histogram? 5 4. What would be the first class interval for the frequency histogram? 5.2 < 6.6 50 data measurements: 2^k, where k is the closest value larger than 50. So 2^6=64, so 6 classes. Class length = (13.5-5.2)/6=1.38. So the boundary for the first interval is 5.2+1.38=6.58. The first interval will contain the values 5.2 < 6.6. 5. A graphical portrayal of a quantitative data set that divides the data into classes and gives the frequency of each class is a histogram. 6. An example of manipulating a graphical display to distort reality is stretching the axes. 7. A MCQ on an exam has 4 possible responses (a,b,c,d). When 390 students take the exam, 117 give response a, 39 b, 78 c, and 156 d. a. How many degrees would be assigned to the “pie slice” for a? 108 deg b. How many degrees would be assigned to the “pie slice” for b? 36 deg 8. With the same 50 data in Q4, the shape of the distribution of the data is skewed to the right. With outliers at the stem of 13 and the majority of the data grouped around stems 6,7,8, the shape is skewed with the outliers to the right. 9. Bar chart displays the frequency of each class with qualitative data Histogram displays the frequency of each class with quantitative data 10. As a general rule, when creating a stem-and-leaf display, there should be 5-20 stem values. By definition, there should be 5-20 stems to enable reasonable display of the shape of the distribution. 11. A histogram that has a longer tail extending toward smaller values is skewed to the left. Chapter 3 3.1 Describing central tendency Central tendency: refers to the middle of a population or sample. Population parameter: a descriptive measure of a population. It is a number calculated using the population measurements that describes some aspect of the population. Eg. population mean -> parameter average Point estimate: a one-number estimate for the value of a population parameter. Sample statistic: a descriptive measure of a sample. It is one way to find a point estimate of a population parameter. Eg. sample mean -> statistic average Exam: among mean, median and mode, which is the best to use? Median For a positively skewed distribution, the mean will always be the highest estimate of central tendency and the mode will always be the lowest estimate of central tendency (assuming that the distribution has only one mode). 3.2 Measures of variation In addition to estimating a population’s central tendency, it is important to estimate the variability of the population’s individual values. Variability (aka. spread/dispersion): refers to how spread out a set of data is. It gives you a way to describe how much data sets vary and allow you to use statistics to compare your data to other sets of data. The 4 main ways to describe variability in a data set are: 1. Range 2. IQR 3. Variance: the average of the squared deviations of the individual population measurements from the mean. 4. Standard deviation Empirical Rule Tolerance interval: an interval of numbers that contains a specified percentage of the individual measurements in a population. Under normal distribution: ● µ ± σ: 68.26% ● µ ± 2σ:95.44% ● µ ± 3σ: 99.73% Chebyshev’s Theorem It allows us to find an interval that contains a specified percentage of the individual measurements in the population. Chebyshev’s theorem: Consider any population that has mean µ and standard deviation σ. Then for any value of k greater than 1, at least 100(1 − 1 2 𝑘 )% of the population measurements lie in the interval [µ ± 𝑘σ] Z-score Z-score (aka. Standardized value): the number of standard deviations that a measurement is from the mean. The quantity indicates the relative location of a measurement within its distribution. ● Positive z-score: x is above the mean ● Negative z-score: x is below the mean Note: z-score is a standardized measurement of samples with each different mean and standard deviation, to facilitate the comparison among them. Eg. Class A has an average of 65 and standard deviation of 10; and Class B has an average of 80 and standard deviation of 5. A student in Class A who scores an 85 is the same as a student who scores a 90 in Class B, because their z-scores are equal. (85-65)/10=2 and (90-80)/5=2. The coefficient of variation Coefficient of variation: measures the variation of a population or sample relative to its mean. 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝐶𝑂𝑉 = 𝑥100 𝑚𝑒𝑎𝑛 3.3 Percentiles and box-and-whisker display Percentiles ● ● ● ● First quartile (Q1): 25th percentile Second quartile (median): 50th percentile Third quartile (Q3): 75th percentile Interquartile range (IQR) = Q3-Q1 To find the index i of the pth percentile for a set of n measurements: 𝑝 𝑖 = ( 100 ) 𝑥 𝑛 Note: ● If i is not an integer: round up to the next integer greater than i. ● If i is an integer: Take the number at this location index and the number at the next index, and average them. Box-and-whiskers displays (box plots) Box plots: a graphical portrayal of a data set that depicts both the central tendency and variability of the data. It is constructed using Q1, Median, and Q3 1. Draw a box that extends from Q1 to Q3 and draw a vertical line at the median 2. Determine the values of the lower and upper limits. a. Lower limit: Q1 - 1.5 IQR b. Upper limit: Q3 + 1.5 IQR 3. Draw whiskers as dashed lines that extend below Q1 and above Q3. a. Draw one whisker from Q1 to the smallest number that is between the lower and upper limits. b. Draw one whisker from Q3 to the largest number that is between the lower and upper limits. 4. Number that is less than the lower limit or greater than the upper limit is an outlier. Plot each outlier with “*” 3.4 Covariance, correlation, and the least squares line Covariance: a measure of the strength of the linear relationship between x and y. Correlation coefficient: a measure of the strength of the linear relationship between -1 and 1, and independent of the units of x and y. ● ● R close to 1: x and y have a strong tendency to move together in a straight-line fashion with a positive slope. So x and y are highly related and positively correlated. R close to -1: x and y have a strong tendency to move together in a straight-line fashion with a negative slope. So x and y are highly related and negatively correlated. Least squares line: the line that minimizes the sum of the squared vertical differences between points on a scatter plot and the line. = 𝑠𝑥𝑦 ● Slope 𝑏 1 ● Y-intercept 𝑏 0 2 𝑠𝑥 = 𝑦 − 𝑏1𝑥 3.5 Weighted means and grouped data Weighted mean: a mean where different measurements are given different weights based on their importance. ∑𝑤𝑖𝑥𝑖 Weighted mean = , where 𝑥𝑖= the value of the ith measurement ∑𝑤𝑖 𝑤𝑖= the weight applied to the ith measurement Eg. percentage return are measurements and weighted applied are the amount invested. We are weighting the percentage returns by the amount invested. Grouped data: data presented in the form of a frequency distribution or a histogram. 3.6 Geometric mean Geometric mean: the constant return Rg that yields the same wealth at the end of the investment period as do the actual returns. Note: unlike arithmetic mean, geometric mean takes time into consideration. Assignment 3 Population or sample: The question will specify it. If it says “the numbers are collected from a larger group”, then it is a sample. If not specified, it is a population. Histogram, standard deviation, box plot are a must for the exam. Chapter 4 Probability and probability models 4.1 Probability, sample spaces, and probability models Probability: number that measures the chance, or likelihood, that an event will occur when an experiment is carried out. Experiment: a process of observation that has an uncertain outcome. 2 ways of collecting data: 1. Performing a controlled experiment 2. Observing uncontrolled events (eg. watch stock market) Sample space: the set of all possible experimental outcomes (sample space outcomes) Note: the possible outcomes aka. sample space outcomes or experimental outcomes. Methods of assigning probabilities Classical method: method of assigning probabilities that can be used when all of the sample space outcomes are equally likely. Eg. dice, coin Relative frequency method (long-run): method of estimating a probability by performing an experiment (in which an outcome of interest might occur) many times. Eg. Sample testing Subjective probability method: using experience, intuition or expertise to assess the probability of an event. Eg. Horse bet Probability models Definition: a mathematical representation of a random phenomenon. Types of random phenomenon: ● Experiment (Chap 4) The probability model describing an experiment consists of ○ The sample space of the experiment ○ Procedure for calculating probabilities concerning the sample space outcomes ● Random variable (Chap 6,7): a variable whose value is numeric and is determined by the outcome of an experiment The probability model describing a random variable is called probability distribution, and consists of ○ Specification of the possible values of the random variable ○ Table, graph, or formula that can be used to calculate probabilities concerning the values that the random variable might equal 2 types of probability distribution: 1. Discrete probability distribution (chap 6) 2. Continuous probability distribution (chap 7) 4.2 Probability and events Event: a set of one or more sample space outcomes. P(event): the sum of the probabilities of the sample space outcomes that correspond to the event. 4.3 Some elementary probability rules 1. Rule of complements 2. Addition rule 𝑃(𝐴 ∪ 𝐵) = 𝑃(𝐴) + 𝑃(𝐵) − 𝑃(𝐴 ∩ 𝐵) 𝑝(𝐴 ∪ 𝐵 ∪ 𝐶) = 𝑃(𝐴) + 𝑃(𝐵) + 𝑃(𝐶) − 𝑃(𝐴 ∩ 𝐵) − 𝑝(𝐴 ∩ 𝐶) − 𝑃(𝐵 ∩ 𝐶) + 𝑃(𝐴 ∩ 𝐵 ∩ 𝐶) 3. Mutually exclusive event Event A and B are mutually exclusive if they have no sample space outcomes in common, thus events A and B cannot occur simultaneously. 𝑃(𝐴 ∩ 𝐵) = 0 4.4 Conditional probability and independence Conditional probability: the probability that one event will occur given that we know that another event has occurred. 𝑃(𝐴 | 𝐵 ) = 𝑃(𝐴∩𝐵) 𝑃(𝐵) 𝑃(𝐴 ∩ 𝐵) = 𝑃(𝐴 | 𝐵 ) 𝑃(𝐵) = 𝑃(𝐵 | 𝐴) 𝑃(𝐴) Independent events 2 events A and B are independent iff: 1. 𝑃(𝐴 | 𝐵) = 𝑃(𝐴) or, equivalently, 2. P(B | A) = P(B) Assume that P(A) and P(B) are greater than 0. If A and B are independent events, then 𝑃(𝐴 ∩ 𝐵) = 𝑃(𝐴) 𝑃(𝐵) 4.5 Bayes’ theorem Sometimes we have: 1. A prior probability (initial) that an event will occur. 2. When new information appear, we use Bayes’ Theorem to revise the prior probability 3. The revised probability is called posterior probability. 4.6 Counting rules Counting rule for combinations The number of combinations of n items selected from N items is: 𝑁! 𝑁 = 𝑛 𝑛!(𝑁−𝑛)! () Contingency table: ● Marginal probability Probability of the occurrence of 1 event. ● Joint probability Probability of the occurrence of 2 or more events together. Practice 4 1. A manager has just received the expense checks for 6 of her employees. She randomly distributes the checks to the 6 employees. What is the probability that exactly 5 of them will receive the correct checks? 0. If all 5 receives their correct check, the 6th person must receive the correct check as well. So the probability that exactly 5 receiving the correct checks and the 6th receiving the wrong check is 0. 2. A group has 12 men and 4 women. If 3 people are selected at random from the group, what is the probability that they are all men? 12𝐶3 Probability= =0.3929 16𝐶 3 Quiz 4 1. Container 1 has 8 items, 3 of which are defective. Container 2 has 5 items, 2 of which are defective. If one item is drawn from each container, what is the probability that only one of the items is defective? 3 8 𝑥 3 5 + 5 8 𝑥 2 5 = 0. 475 2. A family has two children. What is the probability that both are girls, given that at least one is a girl? Sample set = {BB,BG,GB,GG} At least 1 girl: P(G1) = ¾ Both girls: P(GG)=¼ 𝑃(𝐺𝐺 | 𝐺1) = 𝑃(𝐺𝐺∩𝐺1) 𝑃(𝐺1) = 1/4 3/4 = 1 3 3. A lot contains 12 items, and 4 are defective. If three items are drawn at random from the lot, what is the probability they are not defective? () ( ) 8 3 12 3 = 8𝑥7𝑥6 12 𝑥 11 𝑥10 = 0. 2545, the number drawn is in both binomial, small top, big bottom. 4. Three data entry specialists enter requisitions into a computer. Specialist 1 processes 33 percent of the requisitions, specialist 2 processes 38 percent, and specialist 3 processes 29 percent. The proportions of incorrectly entered requisitions by data entry specialists 1, 2, and 3 are .04, .02, and .04, respectively. Suppose that a random requisition is found to have been incorrectly entered. What is the probability that it was processed by data entry specialist 1? By data entry specialist 2? By data entry specialist 3? P(S1)=0.33, P(S2)=0.38, P(S3)=0.29 P(I | S1)=0.04, P(I | S2)=0.02, P(I | S3)=0.04 By Bayer’s theorem 𝑃(𝑆1 | 𝐼) = 𝑃(𝑆1)𝑃(𝐼 | 𝑆1) 𝑃(𝑆1)𝑃(𝐼 | 𝑆1)+𝑃(𝑆2)𝑃(𝐼 | 𝑆2)+𝑃(𝑆3)𝑃(𝐼 | 𝑆3) 5. If events A and B are independent, then the probability of simultaneous occurrence of event A and event B can be found with ________. All of these choices are correct: ● P(A)P(B|A) ● P(B)P(A|B) ● P(A)P(B) Chapter 6 - Discrete Random Variables 6.1 Two types of random variable Random variable: a variable whose value is uncertain and numerical and is determined by the outcome of an experiment. A random variable assigns one and only one numerical value to each experimental outcome. Discrete random variable: when the possible values of a random variable can be counted or listed by a finite number of possible values or by a countably infinite list. Eg. The number of cars sold next month, x = 0,1,2,3… Continuous random variable: when a random variable may assume any numerical value in one or more intervals on the real number line. Not countable. Eg. interest rate (%), time (s), temperature (F), weight (kg), car mileage (km/l) 6.2 Discrete probability distributions p(x) Discrete probability distribution: table, graph or formula that gives the probability associated with each of the discrete random variable’s values. Properties of DPD p(x): 1. 𝑝(𝑥) ≥ 0for each value of x 2. ∑ 𝑝(𝑥) = 1 𝐴𝑙𝑙 𝑥 Expected value (or mean) of DRV µ𝑥 = ∑ 𝑥𝑝(𝑥) 𝐴𝑙𝑙 𝑥 Variance of DRV 2 2 σ𝑥 = ∑ (𝑥 − µ𝑥 ) 𝑝(𝑥) 𝐴𝑙𝑙 𝑥 2 2 2 Estimated: σ𝑥 = ∑ 𝑥 𝑝(𝑥) − ∑ (𝑥𝑝(𝑥)) 𝐴𝑙𝑙 𝑥 𝐴𝑙𝑙 𝑥 Standard deviation of DRV σ𝑥 = 2 σ𝑥 6.3 Binomial distribution Binomial distribution (or binomial model): the probability distribution that describes a binomial random variable, which defined to be the total number of successes in n trials of a binomial experiment. The number of ways to arrange x successes among n trials: 𝑛! 𝑥!(𝑛−𝑥)! The binomial distribution (binomial model) Binomial experiment’s characteristics: 1. It has n identical trials 2. Each trial result in a success or a failure (2 results thus Bi) 3. The probability of a success on any trial is p and remains constant from trial to trial. Thus, the probability of failure, q, on any trial is (1-p) and remains constant too. 4. Trails are independent If a binomial random variable x = the total number of successes in n trials of a binomial experiment, then the probability of obtaining x successes in n trials is: 𝑝(𝑥) = 𝑛! 𝑥!(𝑛−𝑥) 𝑥 𝑛−𝑥 𝑝𝑞 Binomial tables: show the probability of x successes in n trials, with success rate p. Mean: µ𝑥 = 𝑛𝑝 2 Variance: σ𝑥 = 𝑛𝑝𝑞 Standard deviation σ𝑥 = 𝑛𝑝𝑞 Where n is the number of trials, p is the probability of success on each trial and q=1-p 6.4 Poisson distribution (Poisson model) Poisson distribution: describes a Poisson random variable, which describes the number of occurrences of an event over a specified interval of time or space. Assume: 1. The probability of the event’s occurrence is the same for any 2 intervals of equals length 2. Whether the event occurs in any interval is independent of whether the event occurs in any other non overlapping interval The probability that the event will occur x times in a specified interval is: −µ 𝑥 𝑒 µ 𝑝(𝑥) = 𝑥! , where µis the mean (or expected) number of occurrences of the event in the specified interval, and e=2.71828 is the base of Napierian logarithms. Mean: µ𝑥 = µ 2 Variance: σ𝑥 = µ Standard deviation σ𝑥 = µ Where µis the mean number of occurrences of an event over the specified interval of time or space of interest. 6.5 Hypergeometric distribution Suppose a population consists of N items and that r of these items are success and N-r items are failures. If we randomly select n items without replacement from the population, the probability that x items of the n randomly selected items will be successes is given by the hypergeometric probability formula: 𝑝(𝑥) = Where: ( )( ) () 𝑟 𝑥 𝑁−𝑟 𝑛−𝑥 𝑁 𝑛 () 𝑟 𝑥 is the number of ways x successes can be selected from the total of r successes in the population. ( ) 𝑁−𝑟 𝑛−𝑥 is the number of way n-x failures can be selected from the total of N-r failures in the population. () 𝑁 𝑛 is the number of ways a sample of size n can be selected from a pop of size N. 𝑟 Mean: µ𝑥 = 𝑛( 𝑁 ) 2 𝑟 Variance: σ𝑥 = 𝑛( 𝑁 )(1 − 𝑟 𝑁 𝑁−𝑛 )( 𝑁−1 ) Note: if the population size N is “much larger” than the sample size n (at least 20 times larger), then making selections will not substantially change the probability of a success. We can assume that the probability of a success stays essentially constant from selection to selection, and the different selections are essentially independent of each other. In this case, we can approximate the hypergeometric distribution by using the binomial distribution: 𝑥 𝑛−𝑥 𝑛! 𝑛! 𝑟 𝑥 𝑟 𝑛−𝑥 𝑝(𝑥) = 𝑥!(𝑛−𝑥)! 𝑝 (1 − 𝑝) = 𝑥!(𝑛−𝑥)! ( 𝑁 ) (1 − 6.6 Joint distributions and the covariance 𝑁 ) Joint probability distribution of (x,y): a probability distribution that assigns probabilities to all combinations of values of x and y. To further measure the association between x and y, we calculate the covariance between x and y. Covariance: measures linearly the total variation of 2 random variables from their expected values. Using covariance, we can only gauge the direction of the relationship. ● A positive covariance says that as x increases, y tends to increase in a linear fashion. ● A negative covariance says that as x increases, y tends to decrease in a linear fashion. 2 σ𝑥𝑦 = ∑(𝑥 − µ𝑥)(𝑦 − µ𝑦)𝑃(𝑥, 𝑦) Note: covariance helps us understand the importance of investment diversification. Property of expected value of mixed investments (say P=0.5x+0.5y) µ(𝑎𝑥+𝑏𝑦) = 𝑎µ𝑥 + 𝑏µ𝑦 Property of variances of mixed investments 2 2 2 2 2 2 σ(𝑎𝑥+𝑏𝑦) = 𝑎 σ𝑥 + 𝑏 σ𝑦 + 2𝑎𝑏σ𝑥𝑦 Correlation: measures linearly the strength of the relationship between variables. Correlation is the scaled measure of covariance. Correlation coefficient between x and y: 2 ρ= σ𝑥𝑦 σ𝑥 σ𝑦 4 properties of expected values and variances: 1. If a is a constant and x is a random variable, µ 𝑎𝑥 2. If x1, x2… are random variables, µ (𝑥1+𝑥2+...) = 𝑎µ𝑥 = µ𝑥1 + µ𝑥2 +... 2 2 2 3. If a is a constant and x is a random variable, σ = 𝑎 σ𝑥 𝑎𝑥 4. If x1, x2… are independent random variables, then the covariance between any 2 2 of these random independent variables is 0 and σ 2 2 = σ𝑥1 + σ𝑥2 +... 𝑥1+𝑥2+... Assignment 6 1. A total of 50 raffle tickets are sold for a contest to win a car. If you purchase one ticket, what are your odds against winning? 49 to 1 2. If p = .1 and n = 5, then the corresponding binomial distribution is: 3. 4. 5. 6. 7. Right skewed If you were asked to play a game in which you tossed a fair coin three times and were given $2 for every head you threw, how much would you expect to win on average? 3$. The expected number of head E(x)=np=3*0.5=1.5. Money earned on average = 1.5 x $2 = 3$ For a random variable X, the mean value of the squared deviations of its values from their expected value is called its ________. Variance Which one of the following statements is not an assumption of the binomial distribution? Sampling with replacement Which of the following is a valid probability value for a discrete random variable? 0.2. (Between 0 and 1) An insurance company will insure a $75,000 particular automobile make and model for its full value against theft at a premium of $1500 per year. Suppose that the probability that this particular make and model will be stolen is .0075. Find the premium that the insurance company should charge if it wants its expected net profit to be $2000. -$75,000 x 0.0075 + Premium = $2000 $2562.5 Chapter 7 Continuous random variables 7.1 Continuous probability distributions Continuous random variable: when a random variable assumes any numerical value in one or more intervals on the real number line. Continuous probability distributions (aka. Probability curve or probability density function) The curve f(x) is the continuous probability distribution of the random variable x if the probability that x will be in a specified interval of number is the area under the curve f(x) corresponding to the interval. Property of a CPD: 1. f(x)≥ 0for any value of x 2. The total area under the curve f(x) = 1 7.2 Uniform distribution Uniform distribution: a continuous probability distribution having a rectangular shape that says the probability is distributed evenly over an interval of numbers. If c and d are numbers on the real line, the equation describing the uniform distribution is 1 𝑓(𝑥) = 𝑑−𝑐 𝑓𝑜𝑟 𝑐 ≤ 𝑥 ≤ 𝑑 = 0 otherwise Mean: µ𝑥 = 𝑐+𝑑 2 Standard deviation σ𝑥 = 𝑑−𝑐 12 Eg. imagine the waiting time for an elevator is uniformly distributed between 0 and 4 minutes. The uniform distribution is f(x)=¼ for 0 ≤ 𝑥 ≤ 4, having the shape of a rectangle with base 4-0 and height ¼ . 7.3 Normal probability distribution Normal distribution: the most important continuous probability distribution. Its probability curve is the bell-shaped normal curve. µ 𝑎𝑛𝑑 σare the mean and standard deviation of the population. e=2.71828 Note: We use a normal curve table to find areas (thus probabilities) unde the normal curve. Normal curve table’s properties: 1. The shape of each normal distribution is determined by its mean and its standard deviation. 2. The highest point on the normal curve is located at the mean µ , which is also the median and the mode of the distribution. ● Higher the mean µ , further the curve is shifted to the right ● Higher the standard deviation σ, flatter the curve becomes 3. The normal distribution is symmetrical: a. Meaning the area under the normal curve to the right of the mean equals the area under the curve to the left of the mean, and each area = 0.5 4. The tails of the normal curve extend to infinity but never touch the horizontal axis. The tails get close enough to the horizontal axis to ensure that the total area under the normal curve = 1. The Empirical Rule comes handy here with 3 important percentages: The Standard Normal Distribution If a random variable x is normally distributed with mean and standard deviation, then the random variable 𝑧 = 𝑥−µ is normally distributed with mean 0 and standard deviation 1. A σ normal distribution with mean 0 and standard deviation 1 is called a standard normal distribution. Note: 𝑧= 𝑥−µ expresses the number of standard deviations that x is from the σ mean. Cumulative normal table: gives the area under the standard normal curve to the left of z, for many different values of z. Positive and negative Z-value ● Positive Z-value, 𝑍 is the point on the horizontal axis under the standard normal 𝑎 curve that gives a right-hand tail area equal to a. Eg. number of cases ordered so only a 5% chance the store will run short, 𝑍0.05. #cases ordered as the x-axis. ● Negative Z-value, − 𝑍𝑎is the point on the horizontal axis under the standard normal curve that gives a left-hand tail area equal to a. Eg. number of months to guarantee so that only 1% of the batteries will need to be replaced free of charge, − 𝑍0.01. Battery life as the x-axis. 7.4 Approximating the binomial distribution by using the normal distribution Consider a binomial random variable x, where n is the number of trials and p is the probability of success on each trial. If 𝑛𝑝 ≥ 5 𝑎𝑛𝑑 𝑛(1 − 𝑝) ≥ 5, then x is approximately normally distributed with mean µ = 𝑛𝑝 and standard deviation σ = 𝑛𝑝𝑞 7.5 Exponential distribution A probability distribution with mean 1 that describes the time or space between λ successive occurrences of an event when the number of times the event occurs over an interval of time is described by a Poisson distribution with mean λ. If x is described by an exponential distribution with mean 1 , then the equation of the λ probability curve describing x is −λ𝑥 𝑓(𝑥) = λ𝑒 𝑓𝑜𝑟 𝑥 ≥ 0 or 0 otherwise Using this probability curve, it can be shown that: −λ𝑎 𝑃(𝑎 ≤ 𝑥 ≤ 𝑏) = 𝑒 −λ𝑏 −𝑒 Mean and standard deviation of exponential distribution: µ𝑥 = σ𝑥 = 1 λ Note: Exponential and related Poisson distributions are useful in analyzing waiting lines or queues. Eg. Queuing theory attempts to determine the number of servers that strikes an optimal balance between the time customers wait for service and the cost of providing service. Quiz 4 1. Consider a normal population with a mean of 10 and a variance of 4. Find P(X > 18). 0 z=(18-10)/2=4. The normal table’s highest value is at 3.9999, so above that the probability is 0. 2. The relationship between the standard normal random variable, z, and normal random variable, X, is that the standard normal variable z counts the number of standard deviations that the value of the normal random variable X is away from its mean. 3. The weight of a product is normally distributed with a mean of 5 ounces. A randomly selected unit of this product weighs 7.1 ounces. The probability of a unit weighing more than 7.1 ounces is .0014. The production supervisor has lost files containing various pieces of information regarding this process, including the standard deviation. Determine the value of the standard deviation for this process. P(x>7.1)=0.0014. p(x≤7.1) = 1-0.0014=0.9986. Look at the normal table to find 𝑧.0014 = 2. 98. σ = 0. 70 Midterm questions 1. From a population of size 2,000, a random sample of 200 items is selected. The mean of the sample: Can be larger, smaller or equal to the population mean 2. When a class interval is expressed as: 100 to under 200, it implies that: The class must contain an observation with a value of 100 3. Consider a statistics defined as the distance between the 33rd percentile and 67th percentile. This statistics would give us information concerning: Variability 4. Long question: Let s=the sum of the returns from 2 projects, find 𝑝(𝑠 ≥ 18, 000 | 𝑠 ≥ 12, 000). 𝑝(𝑠 ≥ 18, 000 | 𝑠 ≥ 12, 000) = 𝑝(𝑠≥18,000 ∩ 𝑠≥12,000) 𝑝(𝑠≥12,000) = 𝑝(𝑠≥18,000) , 𝑝(𝑠≥12,000) if s≥18,000 then s≥12,000, so 𝑝(𝑠 ≥ 18, 000 ∩ 𝑠 ≥ 12, 000) = 𝑝(𝑠 ≥ 18, 000) 𝑝(𝑠 ≥ 12, 000) = 𝑝(6)𝑝(6) + 𝑝(18)𝑝(18) + 𝑝(6)𝑝(18) + 𝑝(18)𝑝(6) = 0. 7569 𝑝(𝑠 ≥ 18, 000) = 𝑝(18)𝑝(18) + 𝑝(18)𝑝(6) + 𝑝(6)𝑝(18) = 0. 3344 𝑝(𝑠 ≥ 18, 000 | 𝑠 ≥ 12, 000) = 0.3344 0.7569 = 0. 4418 5. For a positively skewed distribution, the mean will always be the highest estimate of central tendency and the mode will always be the lowest estimate of central tendency (assuming that the distribution has only one mode). In a right skewed distribution: Mode -> median -> mean 6. Chapter 8 Sampling distributions 8.1 Sampling distribution of the sample mean 𝑥 𝑥 is the probability distribution of the population of all possible sample means that could be obtained from all possible samples of the same size. Note: one purpose of 𝑥is to tell how accurate the sample mean is likely to be as a point estimate of the population mean. But when the population is large, it is hard to tell. Unbiased point estimate: a sample stat is an unbiased estimate of a population parameter if µ 𝑥 = µ, the mean of the population of all possible values of the sample stat equals the population parameter. The population of all possible sample means has: 1. Normal distribution, if the sampled population has a normal distribution 2. Mean µ 𝑥 = µ, the sampling distribution 𝑥 of has mean µ𝑥 equals to the population mean 3. Standard deviation σ = 𝑥 σ 𝑛 , if the sample population is infinite or ≥ 20times the sample size. Note: σ = 𝑥 σ 𝑛 means that if the sample size n > 1, the SD of the sampling distribution is smaller than the SD of the population. See the spread of the graph below If the sample size n is larger, the spread of sampling distribution is smaller, thus closer to the population mean µ, so it’s more likely to obtain a sample mean that is near the population mean. 8.2 Central limit theorem If the sample size n is large (𝑛 ≥ 30), then the sampling distribution of 𝑥is approximately normal, even if the sampled population is not normally distributed. Note: ● the larger the sample size n is, the more nearly normally distributed is the population of all possible sample means. ● The more skewed the probability distribution of the sampled population, the larger the sample size must be for the population of all possible sample means to be approx. normally distributed. ● As the sample size increases, the spread of the distribution of all possible sample means decreases (ie. the spread is measured by σ 𝑥 , so σ 𝑥 decreases as well ) Unbiasedness and min-variance estimates Sampling distribution of a sample statistic: the probability distribution of the pop of all possible values of the sample statistic (descriptive measure eg. sample mean, sample median, sample SD, etc). Unbiased point estimate: ● The sample mean is also called a min-variance unbiased point estimate of µ. ● 𝑠 2 2 is an unbiased point estimate of σ if the sampled population is infinite. 8.2 The sampling distribution of the sample proportion 𝑝 The population of all possible sample proportions: 1. Approximately has a normal distribution, if the sample size n is large 2. Has mean µ 𝑝 =𝑝 3. Has standard deviation σ 𝑝 = 𝑝(1−𝑝) 𝑛 Note: n should be considered large if both np and n(1-p) are at least 5. Chapter 9 Confidence intervals 9.1 Z-based confidence intervals for a population mean:σ known Confidence interval for a pop mean: an interval constructed around the sample mean so that we are reasonably confident that this interval contains the pop mean. Confidence level: the percentage of time that a confidence interval would contain a population parameter if all possible samples were used to calculate the interval. Margin of error: the quantity that is added to and subtracted from a point estimate of a pop parameter to obtain a confidence interval for the parameter. Eg. [𝑥 ± 𝑚𝑎𝑟𝑔𝑖𝑛 𝑜𝑓 𝑒𝑟𝑟𝑜𝑟] Increasing the confidence level has: ● Advantage of being more confident that µ is contained in the confidence interval ● Disadvantage of increasing the margin of error and thus providing a less precise estimate of the true value of µ 9.2 T-based confidence interval for a pop mean: σunknown T-distribution: commonly used continuous prob distribution that is described by a distribution curve similar to a normal curve. The t curve is symmetrical about 0 and is more spread out than a standard normal curve. If we don’t know σ, we can use 𝑠 to help construct a confidence interval for µ: 𝑡= 𝑥−µ 𝑠/ 𝑛 Degree of freedom (df): determine the spread 𝑑𝑓 = 𝑛 − 1 When the sample size ≥ 30, you are safe to use t-table. Note: z-table and t-table are the same when the sample size (df) is large. It is reasonable to approximate the value of 𝑡α by 𝑧α when df is greater than 100. 9.3 Sample size determination 9.4 Confidence intervals for a population proportion Note: if both np and n(1-p) are larger than 5, you can use z-table. Note: if the sample size is in decimal, always round up to a higher integer. Quiz 5 1. The width of a confidence interval will be a. Narrower for 99% confidence than 95% confidence b. Wider for a sample size of 100 than for size of 50 c. narrower for 90% confidence than 95% confidence d. Wider when the sample s is small than when s is large 2. The internal auditing staff of a local manufacturing company performs a sample audit each quarter to estimate the proportion of accounts that are current (between 0 and 60 days after billing). The historical records show that over the past 8 years 70 percent of the accounts have been current. Determine the sample size needed in order to be 95% confident that the sample proportion of the current customer accounts is within .03 of the true proportion of all current accounts for this company. 2 𝑛= (𝑧α/2) 𝑝(1−𝑝) 2 𝐸 2 = 𝑧0.025 𝑥 0.7 𝑥 0.3 2 0.03 = 897 3. In the case where E is not given: If the interval is [100,200], E=50 since the population mean will be at the middle of the distribution curve. 4. Sdsa Chapter 10 Hypothesis testing 10.1 The null and alternative hypotheses and errors in hypothesis testing When doing hypothesis testing, it’s important to decide which of the statements is the null hypothesis and which is the alternative hypothesis. 1. Null hypothesis (𝐻0): the statement being tested. It’s given the benefit of doubt and is not rejected unless there is convincing sample evidence that it is false. Ie. we assume that the 𝐻0 is true and will reject 𝐻0 only if there is convincing sample evidence. 2. Alternative hypothesis (𝐻1): the statement that is assigned the burden of proof. It is accepted only if there is convincing sample evidence that it is true. Always testing what is in H0 𝐻0: = ≤ ≥ 𝐻1: ≠ > < State h0 first, then support it with h1 Eg. I don’t have sufficient information to claim that the speed is less than 7. P-value ● If p-value < α, z is in the reject area ● If p-value > α, z is not in the reject area Chapter 13 Chi-square tests Goodness of fit: Condition: E=np > 5. If np < 5, we need to use a bigger sample size. Chi-square tests are right skewed for this course. Steps in doing Chi tests: 1. 𝐻0: 𝑃1 = 10%, 𝑃2 = 20%, 𝑃3 = 25%, 𝑃4 = 30%, 𝑃5 = 15% 𝐻1: 𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝑃 𝑖𝑠 𝑤𝑟𝑜𝑛𝑔 2. CV -> Chi-score from the Chi table, df=k-1, where k is the # of proportion. Note: always take the Chi-score to the right. Df = 5-1 = 4 for this case. 𝑘 (𝑂 −𝐸 )2 2 𝑖 𝑖 3. 𝑋 = ∑ 𝐸𝑖 𝑖=1 4. Reject or don’t reject 𝐻0 Note: In hypothesis, we never put the numbers collected from samples! Put the sample numbers in Observed data. So don’t put P1=253/1200 in the hypothesis. Use the hypothesis in the question text. Instead the 𝐻0: 𝑃1 = 𝑃2 = 𝑃3 = 𝑃4 = 𝑃5 = 0. 2 The rejection area is on the right side, anything to the left is accepted. The Chi-square formula gives basically the margin of error of the hypothesis from observation. Test of independence or homogeneity (row total) x (column total) / total = expected value 𝐻0 always assume it’s independent (not related). Ie. 𝐻0 Gender and owning a cell phone are not related. 𝐻1: Gender and owning a cell phone are related. Note: if the 2 variables are independent, the proportion should be approximately even distributed: Age A B C 0-10 ≅33% 10-30 ≅33% >30 ≅33% Chapter 13 Nov 16 Class Coefficient slope B1 Chapter 15 For multiple test, use f test