Uploaded by kabico7051

COMM 215 Notes

advertisement
Chapter 1 Introduction
1.1 Data
Element: a person, object, or other entity about which we wish to draw a conclusion.
Variable: a characteristic of a population or sample element.
● Quantitative: a variable having values that are numbers representing quantities.
Eg. Selling price, temperature, car mileage
● Qualitative: a variable having values that indicate into which of several categories a
population element belongs.
Eg. weather, gender, car color
Data set: facts and figures, taken together, that are collected for a statistical study.
● Cross-sectional data: data collected at the same point of time.
Eg. cell phone costs of different employees in June.
● Time series data: data collected over different time periods.
Eg. Temperature of each month
1.2 Data sources, data warehousing, and big data
Type of data sources:
● Primary data: data collected by an individual or business directly through planned
experimentation or observation.
○ Experimentation: a statistical study in which the analyst is able to set or
manipulate the values of the factors.
○ Observation: a statistical study in which the analyst is not able to control
the values of the factors.
● Secondary data: data taken from an existing source.
Eg. Internet, company reports, business journals
Experimental and observational studies (DIY)
Response variable: variable of interest that we wish to study.
Factors: other variables that may be related to the response variable.
● Experimental study
● Observational study
Eg. in studies of diet and cholesterol, patients are unlikely to follow the prescribed
diets, thus diet is a factor, and cholesterol level is the response variable.
● Survey
Transactional data, data warehousing and big data
Data warehousing: a process of centralized data management and retrieval and has as
its ideal objective the creation and maintenance of a central repository for all of an
organization’s data.
Ie. the process of storing customers’ transactional data (eg. address, phone number,
etc)
Big data: massive amounts of data, often collected at very fast rates in real time and in
different forms and sometimes needing quick preliminary analysis for effective business
decision making.
Note: the huge capacity of data warehouses has given rise to the term big data.
1.3 populations, samples, and traditional statistics
Population: the set of all elements about which we wish to draw conclusions.
Eg. all current MasterCard holders, all of last year’s graduates.
Population measurements: carry out a measurement to assign a value of a variable to
each and every population element.
Census: examination of all elements in a population. (usually small group)
Sample: a subset of the elements of a population. (for large group)
Note: when we measure a characteristic of the elements in a sample, we have a sample of
measurements.
Descriptive statistics: the science of describing the important aspects of a set of
measurements.
Statistical inference (推理): the science of using a sample of measurements to make
generalization about the important aspects of a population of measurements. (for large
group)
Traditional statistics
Traditional statistics consists of a set of concepts and techniques that are used to describe
populations and samples and to make statistical inferences about populations by using
samples.
Note: much of this book is devoted to traditional stats, but traditional stats is
sometimes not sufficient to analyze big data.
2 related extensions to help (chapter 1.5):
1. Business analytics: the use of traditional and newly developed stats methods,
advances in information systems, and techniques from management science to
continuously and iteratively explore and investigate past business
performance, with the purpose of gaining insight and improving business planning
and operations.
2. Data mining: the process of discovering useful knowledge in extremely large data
sets.
1.4 Random sampling and 3 case studies that illustrate
stats inference
Random sample: a sample selected in such a way that every set of n elements in the
population has the same chance of being selected.
In making random selection, we can sample with or without replace:
● Sample with replacement: place the element chosen back into the population.
Thus, this element has a chance to be chosen again.
●
Sample without replacement: do not place the element chosen back into the
population.
Note: it is best to sample without replacement, as it guarantees that all of the elements in
the sample will be different, and thus we will have the fullest possible look at the population.
3 case studies with 3 goals:
1. The need for a random sample
2. How to select the needed sample
3. The use of the sample in making stats inferences
Processes
Sometimes we are interested in studying the population of all of the elements that will be or
could potentially be produced by a process.
Process: a sequence of operations that takes inputs (labor, materials, machines...) and
turns them into output (products, services, ...)
● Finite population: a population that contains a finite number of elements.
● Infinite population: a population that is defined so that there is no limit to the number
of elements that could potentially belong to the population.
Probability sampling
Probability sampling: sampling where we know the chance that each element in the
population will be included in the sample.
Note: if we employ probability sampling, the sample obtained can be used to make valid
stat inferences about the sampled population.
Non-probability sampling
● Convenience sampling: sampling where we select elements because they are easy
or convenient to sample.
● Voluntary response sample: sampling in which the sample participants self-select.
(eg. employed by television and radio). This sample overrepresent people with strong
opinions.
● Judgement sampling: sampling where an expert selects population elements that
he/she feels are representative of the population.
Unethical stats practices:
● Improper sampling: cherry-picking
● Misleading charts, graphs and descriptive measures
● Inappropriate stats analysis or inappropriate interpretation of stats results
1.5 Business analytics and data mining
3 categories of business analytics:
1. Descriptive analytics: graphical and numerical methods used to find and visualize
patterns, associations, anomalies and other relationships in data sets, with the
purpose of business improvement.
a. Graphical descriptive analytics
It uses the traditional/newer graphics to present easy-to-understand visual
summaries of the operational status of a business.
Eg. gauges, bullet graphs, treemaps, sparkline, etc. Mixed used to form
analytic dashboards, part of executive information systems.
b. Numerical descriptive analytics
i.
Association learning
ii.
Text mining
iii.
Cluster analysis
iv.
Factor analysis
2. Predictive analytics: methods used to find anomalies, patterns, and associations in
data sets, with the purpose of predicting future outcomes.
Response variable: variable of interest that we wish to study.
3. Prescriptive analytics: the use of internal and external variables, along with the
predictions obtained from predictive analytics, to recommend one or more courses of
action.
1.6 Ratio, interval, ordinal, and nominative scales of
measurement
Quantitative variables
1. Ratio: quantitative variable such that the ratios of its values are meaningful (eg.
$5k/month is 2 times more than $2.5k/month) and for which there is an inherently
defined zero value (eg. 0km is ”no distance at all”).
Eg. salary, height, weight, time, distance
2. Interval: quantitative variable where ratios of its values are not meaningful and
there is not an inherently defined zero value.
Eg. temperature: 60deg is not twice hotter than 30deg. 0deg doesn’t mean “no heat
at all”.
Note: very few interval variables, almost all quantitative variables are ratio variables.
Qualitative variables
1. Ordinal: qualitative variable for which there is a meaningful ordering, or ranking of
the categories. Ordinal variables can be numerical or nonnumerical.
Eg. satisfaction ranking from 0 to 5, or from “no satisfactory” to “very satisfied”
2. Nominative: qualitative variable for which there is no meaningful ordering, or
ranking, of the categories.
Eg. colors of car, gender
1.7 Stratified random, cluster, and systematic sampling
Sample designs: methods for obtaining a sample.
Frame: a list of all of the population elements.
3 sample designs that are alternatives to random sampling:
1. Stratified random sampling
A sampling design in which we divide a population into non overlapping subgroups
(strata) and then select a random sample from each subgroup (stratum)
Eg. city, suburban and rural population can be the 3 selected strata for a consumer
study.
2. Cluster sampling (multistage)
A sampling design in which we sequentially cluster population elements into
subpopulations.
Note: “cluster” because at each stage we cluster the voters into subgroups.
3. Systematic sampling
A sample taken by moving systematically through the population
Eg. randomly select every 200th person in the sample.
1.8 More about surveys and eros in survey sampling
Survey questions can be:
1. Dichotomous (yes or no)
2. Multiple-choice
3. Open-ended
Types of surveys
1. Phone survey (low response rate)
2. Mail survey (low response rate)
3. Web survey (low response rate)
4. Personal interview survey (high response rate)
Eg. Mall survey
Errors occurring in surveys
● The target population and sample frame are not well defined.
Sample frame: list of sampling elements from which the sample will be selected. It
should closely agree with the target population.
Eg. consider a study to estimate the avg starting salary of students who have
graduated from JMSB over the last 5 years.
Target population: the group of graduates from JMSB
Sample frame: JMSB’s IB program graduates for the past 5 years.
● 2 general classes of survey errors:
Sampling error: the difference between a numerical descriptor of the population and
the corresponding descriptor of the sample.
1. Errors of non observation: sampling error related to population elements
that are not observed.
a. Undercoverage: occurs when some population elements are
excluded from the process of selecting the sample.
b. Nonresponse
2. Errors of observation: sampling error that occurs when the data collected in
a survey differs from the truth
a. Recording error: occurs when either the respondent or interviewer
incorrectly marks an answer.
b. Response bias: bias in the result obtained when carrying out a
statistical study that is related to how survey participants answer the
questions.
Assignment 1
1. Probability sampling is where we know the chance that each element will be
included in the sample, which allows us to make stats inferences about the sample
population.
2. Data collected for a particular study are referred to as a data set.
3. Descriptive stats refers to describing the important aspects of a set of
measurements.
4. Sampling error is the difference between a numerical descriptor of the population
and the corresponding descriptor of the sample.
5. Traditional stats consists of a set of concepts and techniques that are used to
describe populations and samples and to make statistical inferences about
populations by using samples.
6. Methods for obtaining a sample are called sampling designs.
7. Which of the following is a type of question used in survey research?
Dichotomous, open-ended, and multiple-choice
8. A data set provides information about some group of individual elements.
9. When the data being studied are gathered from a published source, this is referred to
as an existing data source.
10. A ratio variable has the following characteristic: inherently defined zero value.
Chapter 2
2.1 Graphically summarizing qualitative data
Frequency distribution: a table that summarizes the number (or frequency) of items in
each of several non overlapping classes.
𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑖𝑡𝑒𝑚
Relative frequency =
, where n is the total # of items.
𝑛
Bar chart: a graphical display of data in categories made up of vertical or horizontal bars.
Each bar gives the frequency, relative frequency, or percentage frequency of items in its
corresponding category.
Pie chart: a graphical display of data in categories made up of pie slices representing the
frequency, relative frequency, or percentage frequency of items in its corresponding
category.
Pareto charts: a bar chart of the frequencies or percentages for various types of defects.
These are used to identify opportunities for improvement.
Note: Pareto charts are sometimes plotted as a cumulative percentage point (up to 100%).
2.2 Graphically summarizing quantitative data
Histogram: display frequency distribution data
1. Find the number of classes
𝑘
K is the smallest number of classes in such way that 2 is greater than the total
number of items (n) in the data set.
2. Find the class length
(largest measurement - smallest measurement) / K
Frequency polygons: graphical display in which we plot points representing each class
frequency above their corresponding class midpoints and connect the points with lines.
Ogives (cumulative distribution)
Cumulative frequency distribution: a table that summarizes the number of measurements
that are the sum of the previous measurements.
2.3 Dot plots
Dot plot: graphical portrayal of a data set that shows the data set’s distribution by plotting
individual data point above a horizontal axis.
Note: dot plots are useful for detecting outliers (unusually large or small observation that is
well separated from the remaining of observations)
2.4 Stem-and-leaf displays
Stem-and-leaf display: graphical portrayal of a data set that shows the data set’s
distribution by using stems consisting of leading digits and leaves consisting of trailing digits.
Leaf unit: the unit that the leaf is representing.
Eg. if leaf 5 represents 500, the leaf unit is 100.
2.5 Contingency tables
2.6 Scatter plots
Scatter plot: a graph that is used to study the possible relationship between 2 variables
x and y. The observed values of y are plotted on the vertical axis and x on the horizontal.
Eg. time series plot
2.8 Graphical descriptive analytics (recent)
●
●
●
●
Analytic dashboard: a graphical representation of the current status and historical
trends of a business’ key performance indicators. (car dashboard)
Gauge: graphics similar to speedometer on cars
Bullet graphs: graphic that features a single measure and displays it as a
horizontal/vertical bar that extends into ranges representing qualitative measures
of performance, such as poor and good.
Treemaps
Sparklines
Summary
Qualitative data
● Pareto charts: specialized bar chart that order the bar from the highest
frequency to the lowest frequency
● Bar charts
● Pie charts
Quantitative data
● Histogram
● Frequency polygons
● Stem-and-leaf
● Dot plot
● Ogive plot (cumulative)
● Bullet graph
Dot plot displays individual data points
Ogive plot is a curved display of the cumulative distribution of the data
Box plot does not easily group measurements into classes.
Scatter plot is for looking at the relationship between 2 variables.
Assignment 2
1. Pareto charts are frequently used to identify the most common types of defects.
2. A stem-and-leaf is best used to display the shape of the distribution.
3. 30 items are rejected daily by a manufacturer because of defects for the last 30 days.
How many classes should be used in constructing a histogram?
5
4. What would be the first class interval for the frequency histogram?
5.2 < 6.6
50 data measurements: 2^k, where k is the closest value larger than 50. So
2^6=64, so 6 classes. Class length = (13.5-5.2)/6=1.38. So the boundary for the
first interval is 5.2+1.38=6.58. The first interval will contain the values 5.2 < 6.6.
5. A graphical portrayal of a quantitative data set that divides the data into classes and
gives the frequency of each class is a histogram.
6. An example of manipulating a graphical display to distort reality is stretching the
axes.
7. A MCQ on an exam has 4 possible responses (a,b,c,d). When 390 students take the
exam, 117 give response a, 39 b, 78 c, and 156 d.
a. How many degrees would be assigned to the “pie slice” for a?
108 deg
b. How many degrees would be assigned to the “pie slice” for b?
36 deg
8. With the same 50 data in Q4, the shape of the distribution of the data is skewed to
the right.
With outliers at the stem of 13 and the majority of the data grouped around stems
6,7,8, the shape is skewed with the outliers to the right.
9. Bar chart displays the frequency of each class with qualitative data
Histogram displays the frequency of each class with quantitative data
10. As a general rule, when creating a stem-and-leaf display, there should be 5-20 stem
values.
By definition, there should be 5-20 stems to enable reasonable display of the
shape of the distribution.
11. A histogram that has a longer tail extending toward smaller values is skewed to the
left.
Chapter 3
3.1 Describing central tendency
Central tendency: refers to the middle of a population or sample.
Population parameter: a descriptive measure of a population. It is a number calculated
using the population measurements that describes some aspect of the population.
Eg. population mean -> parameter average
Point estimate: a one-number estimate for the value of a population parameter.
Sample statistic: a descriptive measure of a sample. It is one way to find a point estimate
of a population parameter.
Eg. sample mean -> statistic average
Exam: among mean, median and mode, which is the best to use?
Median
For a positively skewed distribution, the mean will always be the highest estimate of
central tendency and the mode will always be the lowest estimate of central
tendency (assuming that the distribution has only one mode).
3.2 Measures of variation
In addition to estimating a population’s central tendency, it is important to estimate
the variability of the population’s individual values.
Variability (aka. spread/dispersion): refers to how spread out a set of data is. It gives you
a way to describe how much data sets vary and allow you to use statistics to compare your
data to other sets of data. The 4 main ways to describe variability in a data set are:
1. Range
2. IQR
3. Variance: the average of the squared deviations of the individual population
measurements from the mean.
4. Standard deviation
Empirical Rule
Tolerance interval: an interval of numbers that contains a specified percentage of the
individual measurements in a population.
Under normal distribution:
● µ ± σ: 68.26%
● µ ± 2σ:95.44%
● µ ± 3σ: 99.73%
Chebyshev’s Theorem
It allows us to find an interval that contains a specified percentage of the individual
measurements in the population.
Chebyshev’s theorem:
Consider any population that has mean µ and standard deviation σ. Then for any value of k
greater than 1, at least 100(1
−
1
2
𝑘
)% of the population measurements lie in the
interval [µ ± 𝑘σ]
Z-score
Z-score (aka. Standardized value): the number of standard deviations that a measurement
is from the mean. The quantity indicates the relative location of a measurement within its
distribution.
● Positive z-score: x is above the mean
● Negative z-score: x is below the mean
Note: z-score is a standardized measurement of samples with each different mean and
standard deviation, to facilitate the comparison among them.
Eg. Class A has an average of 65 and standard deviation of 10; and Class B has an average
of 80 and standard deviation of 5. A student in Class A who scores an 85 is the same as a
student who scores a 90 in Class B, because their z-scores are equal. (85-65)/10=2 and
(90-80)/5=2.
The coefficient of variation
Coefficient of variation: measures the variation of a population or sample relative to its
mean.
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
𝐶𝑂𝑉 =
𝑥100
𝑚𝑒𝑎𝑛
3.3 Percentiles and box-and-whisker display
Percentiles
●
●
●
●
First quartile (Q1): 25th percentile
Second quartile (median): 50th percentile
Third quartile (Q3): 75th percentile
Interquartile range (IQR) = Q3-Q1
To find the index i of the pth percentile for a set of n measurements:
𝑝
𝑖 = ( 100 ) 𝑥 𝑛
Note:
● If i is not an integer: round up to the next integer greater than i.
● If i is an integer: Take the number at this location index and the number at the
next index, and average them.
Box-and-whiskers displays (box plots)
Box plots: a graphical portrayal of a data set that depicts both the central tendency and
variability of the data. It is constructed using Q1, Median, and Q3
1. Draw a box that extends from Q1 to Q3 and draw a vertical line at the median
2. Determine the values of the lower and upper limits.
a. Lower limit: Q1 - 1.5 IQR
b. Upper limit: Q3 + 1.5 IQR
3. Draw whiskers as dashed lines that extend below Q1 and above Q3.
a. Draw one whisker from Q1 to the smallest number that is between the lower
and upper limits.
b. Draw one whisker from Q3 to the largest number that is between the lower
and upper limits.
4. Number that is less than the lower limit or greater than the upper limit is an
outlier. Plot each outlier with “*”
3.4 Covariance, correlation, and the least squares line
Covariance: a measure of the strength of the linear relationship between x and y.
Correlation coefficient: a measure of the strength of the linear relationship between -1 and
1, and independent of the units of x and y.
●
●
R close to 1: x and y have a strong tendency to move together in a straight-line
fashion with a positive slope. So x and y are highly related and positively correlated.
R close to -1: x and y have a strong tendency to move together in a straight-line
fashion with a negative slope. So x and y are highly related and negatively
correlated.
Least squares line: the line that minimizes the sum of the squared vertical differences
between points on a scatter plot and the line.
=
𝑠𝑥𝑦
●
Slope 𝑏
1
●
Y-intercept 𝑏
0
2
𝑠𝑥
= 𝑦 − 𝑏1𝑥
3.5 Weighted means and grouped data
Weighted mean: a mean where different measurements are given different weights based
on their importance.
∑𝑤𝑖𝑥𝑖
Weighted mean =
, where 𝑥𝑖= the value of the ith measurement
∑𝑤𝑖
𝑤𝑖= the weight applied to the ith measurement
Eg. percentage return are measurements and weighted applied are the amount invested.
We are weighting the percentage returns by the amount invested.
Grouped data: data presented in the form of a frequency distribution or a histogram.
3.6 Geometric mean
Geometric mean: the constant return Rg that yields the same wealth at the end of the
investment period as do the actual returns.
Note: unlike arithmetic mean, geometric mean takes time into consideration.
Assignment 3
Population or sample:
The question will specify it. If it says “the numbers are collected from a larger group”, then it
is a sample. If not specified, it is a population.
Histogram, standard deviation, box plot are a must for the exam.
Chapter 4 Probability and probability models
4.1 Probability, sample spaces, and probability models
Probability: number that measures the chance, or likelihood, that an event will occur when
an experiment is carried out.
Experiment: a process of observation that has an uncertain outcome.
2 ways of collecting data:
1. Performing a controlled experiment
2. Observing uncontrolled events (eg. watch stock market)
Sample space: the set of all possible experimental outcomes (sample space outcomes)
Note: the possible outcomes aka. sample space outcomes or experimental outcomes.
Methods of assigning probabilities
Classical method: method of assigning probabilities that can be used when all of the
sample space outcomes are equally likely. Eg. dice, coin
Relative frequency method (long-run): method of estimating a probability by performing
an experiment (in which an outcome of interest might occur) many times. Eg. Sample testing
Subjective probability method: using experience, intuition or expertise to assess the
probability of an event. Eg. Horse bet
Probability models
Definition: a mathematical representation of a random phenomenon.
Types of random phenomenon:
● Experiment (Chap 4)
The probability model describing an experiment consists of
○ The sample space of the experiment
○ Procedure for calculating probabilities concerning the sample space
outcomes
●
Random variable (Chap 6,7): a variable whose value is numeric and is determined
by the outcome of an experiment
The probability model describing a random variable is called probability
distribution, and consists of
○ Specification of the possible values of the random variable
○ Table, graph, or formula that can be used to calculate probabilities concerning
the values that the random variable might equal
2 types of probability distribution:
1. Discrete probability distribution (chap 6)
2. Continuous probability distribution (chap 7)
4.2 Probability and events
Event: a set of one or more sample space outcomes.
P(event): the sum of the probabilities of the sample space outcomes that correspond to the
event.
4.3 Some elementary probability rules
1. Rule of complements
2. Addition rule
𝑃(𝐴 ∪ 𝐵) = 𝑃(𝐴) + 𝑃(𝐵) − 𝑃(𝐴 ∩ 𝐵)
𝑝(𝐴 ∪ 𝐵 ∪ 𝐶) = 𝑃(𝐴) + 𝑃(𝐵) + 𝑃(𝐶) − 𝑃(𝐴 ∩ 𝐵) − 𝑝(𝐴 ∩ 𝐶) − 𝑃(𝐵 ∩ 𝐶) + 𝑃(𝐴 ∩ 𝐵 ∩ 𝐶)
3. Mutually exclusive event
Event A and B are mutually exclusive if they have no sample space outcomes in
common, thus events A and B cannot occur simultaneously.
𝑃(𝐴 ∩ 𝐵) = 0
4.4 Conditional probability and independence
Conditional probability: the probability that one event will occur given that we know that
another event has occurred.
𝑃(𝐴 | 𝐵 ) =
𝑃(𝐴∩𝐵)
𝑃(𝐵)
𝑃(𝐴 ∩ 𝐵) = 𝑃(𝐴 | 𝐵 ) 𝑃(𝐵) = 𝑃(𝐵 | 𝐴) 𝑃(𝐴)
Independent events
2 events A and B are independent iff:
1. 𝑃(𝐴 | 𝐵) = 𝑃(𝐴) or, equivalently,
2. P(B | A) = P(B)
Assume that P(A) and P(B) are greater than 0.
If A and B are independent events, then
𝑃(𝐴 ∩ 𝐵) = 𝑃(𝐴) 𝑃(𝐵)
4.5 Bayes’ theorem
Sometimes we have:
1. A prior probability (initial) that an event will occur.
2. When new information appear, we use Bayes’ Theorem to revise the prior
probability
3. The revised probability is called posterior probability.
4.6 Counting rules
Counting rule for combinations
The number of combinations of n items selected from N items is:
𝑁!
𝑁
=
𝑛
𝑛!(𝑁−𝑛)!
()
Contingency table:
● Marginal probability
Probability of the occurrence of 1 event.
● Joint probability
Probability of the occurrence of 2 or more events together.
Practice 4
1. A manager has just received the expense checks for 6 of her employees. She
randomly distributes the checks to the 6 employees. What is the probability
that exactly 5 of them will receive the correct checks?
0. If all 5 receives their correct check, the 6th person must receive the correct check
as well. So the probability that exactly 5 receiving the correct checks and the 6th
receiving the wrong check is 0.
2. A group has 12 men and 4 women. If 3 people are selected at random from the
group, what is the probability that they are all men?
12𝐶3
Probability=
=0.3929
16𝐶
3
Quiz 4
1. Container 1 has 8 items, 3 of which are defective. Container 2 has 5
items, 2 of which are defective. If one item is drawn from each container,
what is the probability that only one of the items is defective?
3
8
𝑥
3
5
+
5
8
𝑥
2
5
= 0. 475
2. A family has two children. What is the probability that both are girls,
given that at least one is a girl?
Sample set = {BB,BG,GB,GG}
At least 1 girl: P(G1) = ¾
Both girls: P(GG)=¼
𝑃(𝐺𝐺 | 𝐺1) =
𝑃(𝐺𝐺∩𝐺1)
𝑃(𝐺1)
=
1/4
3/4
=
1
3
3. A lot contains 12 items, and 4 are defective. If three items are drawn at
random from the lot, what is the probability they are not defective?
()
( )
8
3
12
3
=
8𝑥7𝑥6
12 𝑥 11 𝑥10
= 0. 2545, the number drawn is in both binomial,
small top, big bottom.
4. Three data entry specialists enter requisitions into a computer.
Specialist 1 processes 33 percent of the requisitions, specialist 2
processes 38 percent, and specialist 3 processes 29 percent. The
proportions of incorrectly entered requisitions by data entry specialists
1, 2, and 3 are .04, .02, and .04, respectively. Suppose that a random
requisition is found to have been incorrectly entered. What is the
probability that it was processed by data entry specialist 1? By data
entry specialist 2? By data entry specialist 3?
P(S1)=0.33, P(S2)=0.38, P(S3)=0.29
P(I | S1)=0.04, P(I | S2)=0.02, P(I | S3)=0.04
By Bayer’s theorem
𝑃(𝑆1 | 𝐼) =
𝑃(𝑆1)𝑃(𝐼 | 𝑆1)
𝑃(𝑆1)𝑃(𝐼 | 𝑆1)+𝑃(𝑆2)𝑃(𝐼 | 𝑆2)+𝑃(𝑆3)𝑃(𝐼 | 𝑆3)
5. If events A and B are independent, then the probability of simultaneous
occurrence of event A and event B can be found with ________.
All of these choices are correct:
● P(A)P(B|A)
● P(B)P(A|B)
● P(A)P(B)
Chapter 6 - Discrete Random Variables
6.1 Two types of random variable
Random variable: a variable whose value is uncertain and numerical and is determined
by the outcome of an experiment. A random variable assigns one and only one numerical
value to each experimental outcome.
Discrete random variable: when the possible values of a random variable can be counted
or listed by a finite number of possible values or by a countably infinite list.
Eg. The number of cars sold next month, x = 0,1,2,3…
Continuous random variable: when a random variable may assume any numerical value
in one or more intervals on the real number line. Not countable.
Eg. interest rate (%), time (s), temperature (F), weight (kg), car mileage (km/l)
6.2 Discrete probability distributions p(x)
Discrete probability distribution: table, graph or formula that gives the probability
associated with each of the discrete random variable’s values.
Properties of DPD p(x):
1. 𝑝(𝑥) ≥ 0for each value of x
2.
∑ 𝑝(𝑥) = 1
𝐴𝑙𝑙 𝑥
Expected value (or mean) of DRV
µ𝑥 = ∑ 𝑥𝑝(𝑥)
𝐴𝑙𝑙 𝑥
Variance of DRV
2
2
σ𝑥 = ∑ (𝑥 − µ𝑥 ) 𝑝(𝑥)
𝐴𝑙𝑙 𝑥
2
2
2
Estimated: σ𝑥 = ∑ 𝑥 𝑝(𝑥) − ∑ (𝑥𝑝(𝑥))
𝐴𝑙𝑙 𝑥
𝐴𝑙𝑙 𝑥
Standard deviation of DRV
σ𝑥 =
2
σ𝑥
6.3 Binomial distribution
Binomial distribution (or binomial model): the probability distribution that describes a
binomial random variable, which defined to be the total number of successes in n trials of a
binomial experiment.
The number of ways to arrange x successes among n trials:
𝑛!
𝑥!(𝑛−𝑥)!
The binomial distribution (binomial model)
Binomial experiment’s characteristics:
1. It has n identical trials
2. Each trial result in a success or a failure (2 results thus Bi)
3. The probability of a success on any trial is p and remains constant from trial to trial.
Thus, the probability of failure, q, on any trial is (1-p) and remains constant too.
4. Trails are independent
If a binomial random variable x = the total number of successes in n trials of a binomial
experiment, then the probability of obtaining x successes in n trials is:
𝑝(𝑥) =
𝑛!
𝑥!(𝑛−𝑥)
𝑥 𝑛−𝑥
𝑝𝑞
Binomial tables: show the probability of x successes in n trials, with success rate p.
Mean: µ𝑥 = 𝑛𝑝
2
Variance: σ𝑥 = 𝑛𝑝𝑞
Standard deviation σ𝑥 = 𝑛𝑝𝑞
Where n is the number of trials, p is the probability of success on each trial and q=1-p
6.4 Poisson distribution (Poisson model)
Poisson distribution: describes a Poisson random variable, which describes the number
of occurrences of an event over a specified interval of time or space.
Assume:
1. The probability of the event’s occurrence is the same for any 2 intervals of equals
length
2. Whether the event occurs in any interval is independent of whether the event occurs
in any other non overlapping interval
The probability that the event will occur x times in a specified interval is:
−µ 𝑥
𝑒 µ
𝑝(𝑥) = 𝑥! , where µis the mean (or expected) number of occurrences of the
event in the specified interval, and e=2.71828 is the base of Napierian logarithms.
Mean: µ𝑥 = µ
2
Variance: σ𝑥 = µ
Standard deviation σ𝑥 = µ
Where µis the mean number of occurrences of an event over the specified interval of
time or space of interest.
6.5 Hypergeometric distribution
Suppose a population consists of N items and that r of these items are success and N-r
items are failures. If we randomly select n items without replacement from the population,
the probability that x items of the n randomly selected items will be successes is given by the
hypergeometric probability formula:
𝑝(𝑥) =
Where:
( )( )
()
𝑟
𝑥
𝑁−𝑟
𝑛−𝑥
𝑁
𝑛
()
𝑟
𝑥 is the number of ways x successes can be selected from the total of r successes
in the population.
( )
𝑁−𝑟
𝑛−𝑥 is the number of way n-x failures can be selected from the total of N-r failures
in the population.
()
𝑁
𝑛 is the number of ways a sample of size n can be selected from a pop of size N.
𝑟
Mean: µ𝑥 = 𝑛( 𝑁 )
2
𝑟
Variance: σ𝑥 = 𝑛( 𝑁 )(1 −
𝑟
𝑁
𝑁−𝑛
)( 𝑁−1 )
Note: if the population size N is “much larger” than the sample size n (at least 20
times larger), then making selections will not substantially change the probability of a
success. We can assume that the probability of a success stays essentially constant from
selection to selection, and the different selections are essentially independent of each other.
In this case, we can approximate the hypergeometric distribution by using the
binomial distribution:
𝑥
𝑛−𝑥
𝑛!
𝑛!
𝑟 𝑥
𝑟 𝑛−𝑥
𝑝(𝑥) =
𝑥!(𝑛−𝑥)!
𝑝 (1 − 𝑝)
=
𝑥!(𝑛−𝑥)!
( 𝑁 ) (1 −
6.6 Joint distributions and the covariance
𝑁
)
Joint probability distribution of (x,y): a probability distribution that assigns probabilities to
all combinations of values of x and y.
To further measure the association between x and y, we calculate the covariance
between x and y.
Covariance: measures linearly the total variation of 2 random variables from their
expected values. Using covariance, we can only gauge the direction of the
relationship.
● A positive covariance says that as x increases, y tends to increase in a linear
fashion.
● A negative covariance says that as x increases, y tends to decrease in a linear
fashion.
2
σ𝑥𝑦 = ∑(𝑥 − µ𝑥)(𝑦 − µ𝑦)𝑃(𝑥, 𝑦)
Note: covariance helps us understand the importance of investment diversification.
Property of expected value of mixed investments (say P=0.5x+0.5y)
µ(𝑎𝑥+𝑏𝑦) = 𝑎µ𝑥 + 𝑏µ𝑦
Property of variances of mixed investments
2
2 2
2 2
2
σ(𝑎𝑥+𝑏𝑦) = 𝑎 σ𝑥 + 𝑏 σ𝑦 + 2𝑎𝑏σ𝑥𝑦
Correlation: measures linearly the strength of the relationship between variables.
Correlation is the scaled measure of covariance.
Correlation coefficient between x and y:
2
ρ=
σ𝑥𝑦
σ𝑥 σ𝑦
4 properties of expected values and variances:
1. If a is a constant and x is a random variable, µ
𝑎𝑥
2. If x1, x2… are random variables, µ
(𝑥1+𝑥2+...)
= 𝑎µ𝑥
= µ𝑥1 + µ𝑥2 +...
2
2 2
3. If a is a constant and x is a random variable, σ
= 𝑎 σ𝑥
𝑎𝑥
4.
If x1, x2… are independent random variables, then the covariance between any
2
2 of these random independent variables is 0 and σ
2
2
= σ𝑥1 + σ𝑥2 +...
𝑥1+𝑥2+...
Assignment 6
1. A total of 50 raffle tickets are sold for a contest to win a car. If you
purchase one ticket, what are your odds against winning?
49 to 1
2. If p = .1 and n = 5, then the corresponding binomial distribution is:
3.
4.
5.
6.
7.
Right skewed
If you were asked to play a game in which you tossed a fair coin three
times and were given $2 for every head you threw, how much would you
expect to win on average?
3$. The expected number of head E(x)=np=3*0.5=1.5.
Money earned on average = 1.5 x $2 = 3$
For a random variable X, the mean value of the squared deviations of its
values from their expected value is called its ________.
Variance
Which one of the following statements is not an assumption of the
binomial distribution?
Sampling with replacement
Which of the following is a valid probability value for a discrete random
variable?
0.2. (Between 0 and 1)
An insurance company will insure a $75,000 particular automobile make
and model for its full value against theft at a premium of $1500 per year.
Suppose that the probability that this particular make and model will be
stolen is .0075. Find the premium that the insurance company should
charge if it wants its expected net profit to be $2000.
-$75,000 x 0.0075 + Premium = $2000
$2562.5
Chapter 7 Continuous random variables
7.1 Continuous probability distributions
Continuous random variable: when a random variable assumes any numerical
value in one or more intervals on the real number line.
Continuous probability distributions (aka. Probability curve or probability
density function)
The curve f(x) is the continuous probability distribution of the random variable x if the
probability that x will be in a specified interval of number is the area under the
curve f(x) corresponding to the interval.
Property of a CPD:
1. f(x)≥ 0for any value of x
2. The total area under the curve f(x) = 1
7.2 Uniform distribution
Uniform distribution: a continuous probability distribution having a rectangular shape that
says the probability is distributed evenly over an interval of numbers.
If c and d are numbers on the real line, the equation describing the uniform distribution is
1
𝑓(𝑥) = 𝑑−𝑐
𝑓𝑜𝑟 𝑐 ≤ 𝑥 ≤ 𝑑
= 0 otherwise
Mean: µ𝑥 =
𝑐+𝑑
2
Standard deviation σ𝑥 =
𝑑−𝑐
12
Eg. imagine the waiting time for an elevator is uniformly distributed between 0 and 4
minutes. The uniform distribution is f(x)=¼ for 0 ≤ 𝑥 ≤ 4, having the shape of a rectangle
with base 4-0 and height ¼ .
7.3 Normal probability distribution
Normal distribution: the most important continuous probability distribution. Its probability
curve is the bell-shaped normal curve.
µ 𝑎𝑛𝑑 σare the mean and standard deviation of the population. e=2.71828
Note: We use a normal curve table to find areas (thus probabilities) unde the normal
curve.
Normal curve table’s properties:
1. The shape of each normal distribution is determined by its mean and its standard
deviation.
2. The highest point on the normal curve is located at the mean µ , which is also
the median and the mode of the distribution.
● Higher the mean µ , further the curve is shifted to the right
● Higher the standard deviation σ, flatter the curve becomes
3. The normal distribution is symmetrical:
a. Meaning the area under the normal curve to the right of the mean equals the
area under the curve to the left of the mean, and each area = 0.5
4. The tails of the normal curve extend to infinity but never touch the horizontal axis.
The tails get close enough to the horizontal axis to ensure that the total area under
the normal curve = 1.
The Empirical Rule comes handy here with 3 important percentages:
The Standard Normal Distribution
If a random variable x is normally distributed with mean and standard deviation, then the
random variable 𝑧
=
𝑥−µ
is normally distributed with mean 0 and standard deviation 1. A
σ
normal distribution with mean 0 and standard deviation 1 is called a standard normal
distribution.
Note:
𝑧=
𝑥−µ
expresses the number of standard deviations that x is from the
σ
mean.
Cumulative normal table: gives the area under the standard normal curve to the left of z,
for many different values of z.
Positive and negative Z-value
●
Positive Z-value, 𝑍 is the point on the horizontal axis under the standard normal
𝑎
curve that gives a right-hand tail area equal to a.
Eg. number of cases ordered so only a 5% chance the store will run short, 𝑍0.05.
#cases ordered as the x-axis.
●
Negative Z-value, −
𝑍𝑎is the point on the horizontal axis under the standard
normal curve that gives a left-hand tail area equal to a.
Eg. number of months to guarantee so that only 1% of the batteries will need to be
replaced free of charge, − 𝑍0.01. Battery life as the x-axis.
7.4 Approximating the binomial distribution by using the
normal distribution
Consider a binomial random variable x, where n is the number of trials and p is the
probability of success on each trial.
If 𝑛𝑝 ≥ 5 𝑎𝑛𝑑 𝑛(1 − 𝑝) ≥ 5, then x is
approximately
normally distributed with mean
µ = 𝑛𝑝 and standard deviation σ = 𝑛𝑝𝑞
7.5 Exponential distribution
A probability distribution with mean
1
that describes the time or space between
λ
successive occurrences of an event when the number of times the event occurs over an
interval of time is described by a Poisson distribution with mean λ.
If x is described by an exponential distribution with mean
1
, then the equation of the
λ
probability curve describing x is
−λ𝑥
𝑓(𝑥) = λ𝑒
𝑓𝑜𝑟 𝑥 ≥ 0
or 0 otherwise
Using this probability curve, it can be shown that:
−λ𝑎
𝑃(𝑎 ≤ 𝑥 ≤ 𝑏) = 𝑒
−λ𝑏
−𝑒
Mean and standard deviation of exponential distribution:
µ𝑥 = σ𝑥 =
1
λ
Note: Exponential and related Poisson distributions are useful in analyzing waiting lines or
queues.
Eg. Queuing theory attempts to determine the number of servers that strikes an optimal
balance between the time customers wait for service and the cost of providing service.
Quiz 4
1. Consider a normal population with a mean of 10 and a variance of 4.
Find P(X > 18).
0
z=(18-10)/2=4. The normal table’s highest value is at 3.9999, so above that
the probability is 0.
2. The relationship between the standard normal random variable, z, and
normal random variable, X, is that
the standard normal variable z counts the number of standard deviations that
the value of the normal random variable X is away from its mean.
3. The weight of a product is normally distributed with a mean of 5 ounces.
A randomly selected unit of this product weighs 7.1 ounces. The
probability of a unit weighing more than 7.1 ounces is .0014. The
production supervisor has lost files containing various pieces of
information regarding this process, including the standard deviation.
Determine the value of the standard deviation for this process.
P(x>7.1)=0.0014. p(x≤7.1) = 1-0.0014=0.9986.
Look at the normal table to find 𝑧.0014 = 2. 98. σ
= 0. 70
Midterm questions
1. From a population of size 2,000, a random sample of 200 items is selected. The
mean of the sample:
Can be larger, smaller or equal to the population mean
2. When a class interval is expressed as: 100 to under 200, it implies that:
The class must contain an observation with a value of 100
3. Consider a statistics defined as the distance between the 33rd percentile and
67th percentile. This statistics would give us information concerning:
Variability
4. Long question:
Let s=the sum of the returns from 2 projects, find 𝑝(𝑠 ≥ 18, 000 | 𝑠 ≥ 12, 000).
𝑝(𝑠 ≥ 18, 000 | 𝑠 ≥ 12, 000) =
𝑝(𝑠≥18,000 ∩ 𝑠≥12,000)
𝑝(𝑠≥12,000)
=
𝑝(𝑠≥18,000)
,
𝑝(𝑠≥12,000)
if s≥18,000 then s≥12,000, so
𝑝(𝑠 ≥ 18, 000 ∩ 𝑠 ≥ 12, 000) = 𝑝(𝑠 ≥ 18, 000)
𝑝(𝑠 ≥ 12, 000) = 𝑝(6)𝑝(6) + 𝑝(18)𝑝(18) + 𝑝(6)𝑝(18) + 𝑝(18)𝑝(6) = 0. 7569
𝑝(𝑠 ≥ 18, 000) = 𝑝(18)𝑝(18) + 𝑝(18)𝑝(6) + 𝑝(6)𝑝(18) = 0. 3344
𝑝(𝑠 ≥ 18, 000 | 𝑠 ≥ 12, 000) =
0.3344
0.7569
= 0. 4418
5. For a positively skewed distribution, the mean will always be the highest estimate of
central tendency and the mode will always be the lowest estimate of central
tendency (assuming that the distribution has only one mode).
In a right skewed distribution: Mode -> median -> mean
6.
Chapter 8 Sampling distributions
8.1 Sampling distribution of the sample mean 𝑥
𝑥 is the probability distribution of the population of all possible sample means that could be
obtained from all possible samples of the same size.
Note: one purpose of 𝑥is to tell how accurate the sample mean is likely to be as a point
estimate of the population mean. But when the population is large, it is hard to tell.
Unbiased point estimate: a sample stat is an unbiased estimate of a population parameter
if µ
𝑥
= µ, the mean of the population of all possible values of the sample stat equals the
population parameter.
The population of all possible sample means has:
1. Normal distribution, if the sampled population has a normal distribution
2. Mean µ
𝑥
= µ, the sampling distribution 𝑥 of has mean µ𝑥 equals to the population
mean
3. Standard deviation σ
=
𝑥
σ
𝑛
, if the sample population is infinite or ≥ 20times the
sample size.
Note: σ
=
𝑥
σ
𝑛
means that if the sample size n > 1, the SD of the sampling distribution
is smaller than the SD of the population. See the spread of the graph below
If the sample size n is larger, the spread of sampling distribution is smaller, thus closer
to the population mean µ, so it’s more likely to obtain a sample mean that is near the
population mean.
8.2 Central limit theorem
If the sample size n is large (𝑛
≥ 30), then the sampling distribution of 𝑥is
approximately normal, even if the sampled population is not normally distributed.
Note:
● the larger the sample size n is, the more nearly normally distributed is the
population of all possible sample means.
● The more skewed the probability distribution of the sampled population, the
larger the sample size must be for the population of all possible sample means to be
approx. normally distributed.
● As the sample size increases, the spread of the distribution of all possible sample
means decreases (ie. the spread is measured by σ 𝑥 , so σ 𝑥 decreases as well )
Unbiasedness and min-variance estimates
Sampling distribution of a sample statistic: the probability distribution of the pop of all
possible values of the sample statistic (descriptive measure eg. sample mean, sample
median, sample SD, etc).
Unbiased point estimate:
●
The sample mean is also called a min-variance unbiased point estimate of µ.
●
𝑠
2
2
is an unbiased point estimate of σ if the sampled population is infinite.
8.2 The sampling distribution of the sample proportion 𝑝
The population of all possible sample proportions:
1. Approximately has a normal distribution, if the sample size n is large
2. Has mean µ
𝑝
=𝑝
3. Has standard deviation σ
𝑝
=
𝑝(1−𝑝)
𝑛
Note: n should be considered large if both np and n(1-p) are at least 5.
Chapter 9 Confidence intervals
9.1 Z-based confidence intervals for a population mean:σ
known
Confidence interval for a pop mean: an interval constructed around the sample mean so
that we are reasonably confident that this interval contains the pop mean.
Confidence level: the percentage of time that a confidence interval would contain a
population parameter if all possible samples were used to calculate the interval.
Margin of error: the quantity that is added to and subtracted from a point estimate of a pop
parameter to obtain a confidence interval for the parameter.
Eg. [𝑥 ± 𝑚𝑎𝑟𝑔𝑖𝑛 𝑜𝑓 𝑒𝑟𝑟𝑜𝑟]
Increasing the confidence level has:
● Advantage of being more confident that µ is contained in the confidence interval
● Disadvantage of increasing the margin of error and thus providing a less precise
estimate of the true value of µ
9.2 T-based confidence interval for a pop mean: σunknown
T-distribution: commonly used continuous prob distribution that is described by a
distribution curve similar to a normal curve. The t curve is symmetrical about 0 and is
more spread out than a standard normal curve.
If we don’t know σ, we can use 𝑠 to help construct a confidence interval for µ:
𝑡=
𝑥−µ
𝑠/ 𝑛
Degree of freedom (df): determine the spread
𝑑𝑓 = 𝑛 − 1
When the sample size ≥ 30, you are safe to use t-table.
Note: z-table and t-table are the same when the sample size (df) is large. It is reasonable to
approximate the value of 𝑡α by 𝑧α when df is greater than 100.
9.3 Sample size determination
9.4 Confidence intervals for a population proportion
Note: if both np and n(1-p) are larger than 5, you can use z-table.
Note: if the sample size is in decimal, always round up to a higher integer.
Quiz 5
1. The width of a confidence interval will be
a. Narrower for 99% confidence than 95% confidence
b. Wider for a sample size of 100 than for size of 50
c. narrower for 90% confidence than 95% confidence
d. Wider when the sample s is small than when s is large
2. The internal auditing staff of a local manufacturing company performs a sample audit
each quarter to estimate the proportion of accounts that are current (between 0 and 60
days after billing). The historical records show that over the past 8 years 70 percent of
the accounts have been current. Determine the sample size needed in order to be
95% confident that the sample proportion of the current customer accounts is
within .03 of the true proportion of all current accounts for this company.
2
𝑛=
(𝑧α/2) 𝑝(1−𝑝)
2
𝐸
2
=
𝑧0.025 𝑥 0.7 𝑥 0.3
2
0.03
= 897
3. In the case where E is not given: If the interval is [100,200], E=50 since the
population mean will be at the middle of the distribution curve.
4. Sdsa
Chapter 10 Hypothesis testing
10.1 The null and alternative hypotheses and errors in
hypothesis testing
When doing hypothesis testing, it’s important to decide which of the statements is the null
hypothesis and which is the alternative hypothesis.
1. Null hypothesis (𝐻0): the statement being tested. It’s given the benefit of doubt
and is not rejected unless there is convincing sample evidence that it is false.
Ie. we assume that the 𝐻0 is true and will reject 𝐻0 only if there is convincing
sample evidence.
2. Alternative hypothesis (𝐻1): the statement that is assigned the burden of proof. It
is accepted only if there is convincing sample evidence that it is true.
Always testing what is in H0
𝐻0: = ≤ ≥
𝐻1: ≠
>
<
State h0 first, then support it with h1
Eg. I don’t have sufficient information to claim that the speed is less than 7.
P-value
● If p-value < α, z is in the reject area
● If p-value > α, z is not in the reject area
Chapter 13 Chi-square tests
Goodness of fit:
Condition: E=np > 5.
If np < 5, we need to use a bigger sample size.
Chi-square tests are right skewed for this course.
Steps in doing Chi tests:
1. 𝐻0: 𝑃1 = 10%, 𝑃2 = 20%, 𝑃3 = 25%, 𝑃4 = 30%, 𝑃5 = 15%
𝐻1: 𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝑃 𝑖𝑠 𝑤𝑟𝑜𝑛𝑔
2. CV -> Chi-score from the Chi table, df=k-1, where k is the # of proportion.
Note: always take the Chi-score to the right.
Df = 5-1 = 4 for this case.
𝑘 (𝑂 −𝐸 )2
2
𝑖
𝑖
3. 𝑋 = ∑
𝐸𝑖
𝑖=1
4. Reject or don’t reject 𝐻0
Note: In hypothesis, we never put the numbers collected from samples! Put the
sample numbers in Observed data. So don’t put P1=253/1200 in the hypothesis. Use the
hypothesis in the question text.
Instead the 𝐻0: 𝑃1 = 𝑃2 = 𝑃3 = 𝑃4 = 𝑃5 = 0. 2
The rejection area is on the right side, anything to the left is accepted.
The Chi-square formula gives basically the margin of error of the hypothesis from
observation.
Test of independence or homogeneity
(row total) x (column total) / total = expected value
𝐻0 always assume it’s independent (not related).
Ie. 𝐻0 Gender and owning a cell phone are not related.
𝐻1: Gender and owning a cell phone are related.
Note: if the 2 variables are independent, the proportion should be approximately even
distributed:
Age
A
B
C
0-10
≅33%
10-30
≅33%
>30
≅33%
Chapter 13
Nov 16 Class
Coefficient slope B1
Chapter 15
For multiple test, use f test
Download