Uploaded by saad hanif

DATA ANALYTICS FINAL

advertisement
DATA ANALYTICS FINAL
Prediction
Prediction: assigning a value to y, the target variable, or outcome variable, for a target
observation or more target observations (x).
Predictor: The variables (x) whose values are known and are used to predict the unknown
variable (y)
Predictive data analysis: Data analysis with the aim of prediction
Original Data: The data on variables y and x that is available and is used to make a model for
prediction.
Live data: The data that includes target observations i.e. predicted values.
The original data (observations on x and y) doesn’t include the target observations
for which the prediction is to make. This original data is used to uncover patterns of
association between y and x which is used in making the prediction. To uncover
those patterns, a model is estimated and with the help of that model, the values of y
are predicted for the target observations, for which we can observe x but not y. The
data that includes the target observations is called the live data. Thus, in short,
patterns in the original data are uncovered and then used for making the prediction
in the live data.
Function: a rule that gives a value for y if we plug in values for x. Mathematically, those
patterns of association are expressed as a function.
Notations
j
specific target observation in the live data
𝑥𝑗
specific observation of x variable in the live data
𝑦𝑗
the variable and observation we want to predict
𝑦̂𝑗
specific observation of y variable in the live data
f
function
𝑓̂
the specific function used in a model, also called predictive models.
𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ⋯
𝑦̂ = 𝛽̂0 + 𝛽̂1 𝑥1 + 𝛽̂2 𝑥2 + ⋯
In the abstract, the linear regression with y, x1, x2, is a model for the conditional expected
value of y, and it has coefficients (𝛽). We need estimated coefficients (𝛽̂ ) and actual x
values (𝑥𝑗 ) to predict an actual value 𝑦̂.
Here the function, or model, is 𝑓(𝑥1 , 𝑥2 , … ) = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ⋯ , and its estimated
version, with specific values for its coefficients, is 𝑓̂(𝑥1 , 𝑥2 , … ) = 𝛽̂0 + 𝛽̂1 𝑥1 + 𝛽̂2 𝑥2 + ⋯
The fundamental task of predictive analytics is finding the best model. The best model is
best in the sense that it gives the best prediction in the live data, not in the original data.
Various Kinds of Prediction
quantitative prediction prediction of the value of a quantitative outcome y.
probability prediction If y is binary (or categorical variable), and predicticing the probability
of y having a specific value (for example: y = 1) is desired; that would be a probability value.
classification If for a binary y variable, and we want to predict whether the outcome is 0 or
1, not the probability then this is called as classification.
Types of Prediction
With quantitative predictions, and with probability predictions, predicting a single value 𝑦̂𝑗 ,
is called point prediction.
An interval prediction produces a prediction interval, or PI, that tells where to find the value
of yj with a certain likelihood (e.g., a 95% prediction interval or an 80% prediction interval).
The difference between the true value of a variable (𝑦𝑗 ) and its point prediction (𝑦̂𝑗 ) is the
prediction error.
𝑒𝑗 = 𝑦𝑗 − 𝑦̂𝑗
The prediction error can be decomposed into three components: estimation error, model
error, and irreducible error – also called idiosyncratic error or genuine error.
The estimation error comes from the fact that, with a model f, we use our original data to
find 𝑓̂. When using a linear regression for prediction, this error is due to the fact that we
don’t know the values of the coefficients of the regression (𝛽), only their estimated values
(𝛽̂ ). This error is captured by the SE of the regression line.
The model error reflects the fact that we may not have selected the best model for our
prediction. More specifically, instead of model f, there may be a better model, g, which uses
the same predictor variables x or different predictor variables from the same original data.
The irreducible error is due to the fact that we are not able to make a perfect prediction for
yj with the help of the xj variables even if we find the best model and we can estimate
it without any estimation error. Maybe, if we had data with more x variables, we could have
a better
prediction using those additional variables. But, with the variables we have, we can’t do
anything
about this error. That’s why its name is irreducible error.
The irreducible error is due to the fact that we are not able to make a perfect prediction for
𝑦𝑗 with the help of the 𝑥𝑗 variables even if we find the best model and we can estimate it
without any estimation error. Maybe, if we had data with more x variables, we could have a
better prediction using those additional variables. But, with the variables we have, we can’t
do anything about this error. That’s why its name is irreducible error.
To help decisions, we would attach a value to the consequences of the prediction error: a
loss that we incur due to decisions we make because of a bad prediction. This idea is
formulated in a loss function. A loss function translates the prediction errors into a number
that makes more sense for the decisions and their consequences. Typically, we have more
than one target observation. The loss function helps to rank predictions.
Symmetric loss functions attach the same value to positive and negative errors of the same
magnitude. Asymmetric loss functions attach different values to errors of the same
magnitude if they are on different sides. Linearity versus convexity is about the magnitudes
of error in quantitative predictions (or probability predictions). Linear loss functions attach
proportionally larger loss to an error, regardless of its magnitude. In contrast, convex loss
functions attach a disproportionately larger loss to an error that is larger.
For example, if our ice cream shop has a lot of competition, we may be more worried about
damaged reputation than incurring more costs, and so our loss function would be
asymmetric, with larger loss attached to a negative prediction error than to a positive
prediction error of the same size. Most likely, our loss function would be convex in both
directions: a larger prediction error would have disproportionately larger consequences
than a smaller prediction error.
Or consider the inflation forecasts used by central banks to decide on monetary policy. The
loss function of central banks should describe the social costs of wrong decisions based on
erroneous predictions. The larger the prediction error, the larger the chances of wrong
decisions and thus the larger the social costs.
R Dashboard
An interactive data application, or dashboard, can help improve business performance by
providing key performance indicators (or KPIs) to stakeholders.
Key Benefits
There are many benefits to using dashboards to visualize data.
1) With real-time visuals on a dashboard, understanding the moving parts of a business
becomes easy.
2) Based on the report type and data, you can create suitable graphs and charts in one
central location, thereby providing an easy way for stakeholders to quickly access the data
and understand what is going right or wrong, and what needs to be improved.
3) Also, seeing the big picture in one place can help businesses make informed decisions,
thereby improving performance and reducing hours spent on analyzing the data.
In general, the best dashboards answer critical business questions.
Components
A dashboard consists of four components.
1. Analyze. In this component, you manipulate and summarize data using a backend library.
2. Visualize. You create graphs using the graphing library.
3. Interact. Here, you use frontend libraries to accept user inputs.
4. Serve component listens to user requests and returns web pages using a web server.
R Package
Shiny is an R package for building interactive web applications that can help you with
effective data storytelling. With Shiny, prototyping an application is fast because it is all
developed all in R. Using Shiny, you can easily build dashboards, host standalone apps on a
webpage, or embed them in R Markdown documents.
Parts of Shiny
A Shiny app consists of two parts: the server and the user interface (or UI.)
The server powers the app and holds the app logic. It tells the user interface which contents
to show when the user interacts with the application. The server that runs the application
can be hosted on either your own computer or on a remote server.
The user interacts with the UI when they run the app through a web browser. The UI has a
layout that describes the locations and types of input and output fields for the app. The user
interacts with the input fields, which can include text, files, checkboxes, sliders, dropdowns,
and more.
Output fields display the results of the analysis and can include graphs, tables, text, images,
and other visualizations. For example, assume you want to know the number of data science
courses released in 2021, split by programming language. You will enter the year in the text
input field of the UI. This information is sent as a request to the server and the input is
analyzed on the backend. Then, the server returns the result to the UI, and the visualization
displays in the corresponding output field.
Shiny is an external library, so you need to install it before you can use it. Use the
install.packages() command to install the package.
install.package("shiny")
install.package("digest")
Now, check the installation by running a demo app that is included with the package. If the
installation went well and the example app starts running, it should look like this.
library(shiny)
runExample("01_hello")
Close the screen to stop the app.
Getting into Shiny
There are two main parts that make up a Shiny application. The first part is the user
interface, or UI, which is a web document that displays the application to the user. It is
made up of HTML components that you create using Shiny functions. The second part is the
Server. This is where the application logic sits. Let’s create a minimal Shiny application.
You can code for both components in one file, or in two separate files. Coding in separate
files is recommended. However, you must place the ui.R and server.R files into a single
folder. RStudio knows that it is a Shiny application when it sees the two files together.
The Panel functions are used to group UI elements together into a single panel. For
example, if you want to have text input and numeric input boxes on the side, you can group
these elements into a sidebar panel. Some useful panel functions are absolutePanel(),
conditionalPanel(), headerPanel(), inputPanel(), mainPanel(), navListPanel(), sidebarPanel(),
tabPanel(), tabsetPanel(), titlePanel(), and wellPanel() etc. Panel functions return HTML div
tags with some class attributes defined in the bootstrap.
The Layout functions are used to organize panels containing UI elements in the application
layout. Here are a few examples. The fluidRow() function creates a fluid page layout, which
consists of rows that in turn include columns. Rows ensure its elements appear on the same
line (if the browser has adequate width.) Columns define how much horizontal space within
a 12-unit wide grid its elements should occupy. Fluid pages scale their components in real
time to fill the entire browser width. The flowLayout() function places elements in a left-toright, top-to-bottom arrangement. The sidebarLayout() function arranges elements in a
layout with a side bar and main area. The splitLayout() function lays out elements
horizontally, dividing the available horizontal space into equal parts, by default. And the
verticalLayout() function creates a container that includes one or more rows of content.
Control widgets are web elements that users can interact with. They allow users to send
input to the Server, which in turn performs the logic. As the user updates the widget values,
the output corresponding to the input will be updated as well.
The Shiny package has numerous pre-built widget functions, which not only makes it easier
to create widgets but also makes them look better. Some examples include buttons, date
range date input, radio buttons, single checkbox, checkbox group, help text, select box, file
input, slider input, numeric input, text input and more.
The Server performs the logic whenever input widgets change and then sends the result
back to the respective UI output elements. Let’s say you have an application that calculates
the square of a number. You will need numeric input and output UI elements. With the
numeric input widget in the UI, you can send input to the server. The Server will compute
the square of the input, then it will send it back to the UI output element.
Lecture 1
Introduction:
•
•
•
•
•
•
•
Data - factual information used as a basis for reasoning, discussion, or calculation”
(Merriam-Webster dictionary)
Data Table - consists of observations and variables. Observations are also known as
cases. Variables are also called features. In the language of mathematics, Data Table
is also called as Data Matrix.
Identifier – observations are identified by identifier or ID variable – a single ID
variable, or by a combination of multiple ID variables.
Dataset - a broader concept that includes, potentially, multiple data tables with
different kinds of information to be used in the same analysis.
Data Structures:
1. Cross-sectional Data – often abbreviated as xsec data, come from the same
time, but refer to different units.
2. Time Series Data – observations refer to a single unit observed multiple
times. A common abbreviation used for time series data is tseries data.
3. Panel Data – has multiple dimensions, multiple units, each observed multiple
times. It is also called longitudinal data, or cross-section time series data,
abbreviated as xt data.
a) xt data is called balanced if all cross-sectional units have observations
for the very same time periods.
b) called unbalanced if some cross-sectional units are observed more
times than others.
Aggregation
1. Individual Level – age, income, name etc.
2. Firm Level – Sales, Revenue, No of employees etc. in a firm
3. Industry/Sector Level – Total exports by a specific industry, employment in a
specific industry / sector.
4. Country Level – GDP, Unemployment Rate etc.
5. Global Level – Total population of the world, global emission rate etc.
Analytics - the science of using data to build models that lead to better decisions,
that add value to individuals, to companies, to institutions.
•
•
1.
Steps
1.
2.
3.
4.
5.
6.
Models …?
Collecting Data (ensure quality)
Understanding your data
Building Models
Analysis
Presentation
Story Telling
Collection of the Data
▪ Existing Sources
▪ Surveys
▪ Interviews
▪ Web-Scrapping and Application Program Interface (API)
▪ Sampling
▪ Random Sampling
▪ Snow Ball Sampling
▪ Cluster Sampling etc.
Collection of the Data
▪ Quality of the Data
▪ Content – determined by how the variable/data was measured, not by what
it was meant to measure. As a consequence, just because a variable is given a
particular name, it does not necessarily measure that.
▪ Validity – the content of a variable should be as close as possible to what it is
meant to measure (intended content). Reliability Measurement of a variable
should be stable, leading to the same value if measured the same way again.
▪ Comparability – a variable should be measured the same way for all
observations.
▪ Coverage – observations in the collected dataset should include all of those
that were intended to be covered (complete coverage). In practice, they may
not (incomplete coverage).
▪ Unbiased selection – if coverage is incomplete, the observations that are
included should be similar to all observations that were intended to be
covered (and, thus, to those that are left uncovered).
Types of the Variables (in terms of measurement scale)
• Nominal variables are qualitative variables with values that cannot be
unambiguously ordered. Different individual decision makers may have different
ordering of these options, as there is no universally agreed ranking of all options for
these types of variables.
• Ordinal or ordered variables take on values that are unambiguously ordered. All
quantitative variables can be ordered; some qualitative variables can be ordered,
too.
• Interval variables have the property that a difference between values means the
same thing regardless of the magnitudes. All quantitative variables have this
property, but qualitative variables don’t have this property.
• Ratio variables, also known as scale variables, are interval variables with the
additional property that their ratios mean the same regardless of the magnitudes.
This additional property also implies a meaningful zero in the scale. Many but not all
quantitative variables have this property.
❖ Zero distance is unambiguous, and a 10 km run is twice as long as a 5 km run. An
example of an interval variable that is not a ratio variable is temperature: 20 degrees
is not twice as warm as 10 degrees, be it Celsius or Fahrenheit.
Lecture 2
•
Relational data (or relational database) is the term often used for such datasets:
they have various kinds of observations that are linked to each other through various
relations.
• The process of pulling different variables from different data tables for wellidentified entities to create a new data table is called linking, joining, merging, or
matching data tables.
• One-to-One (1:1) Merging: merging tables with the same type of
observations.
• Many-to-one (m:1) Merging: Linking many observations from one to single
observation in other.
• One-to-Many (1:m) Merging: Opposite of M:1
• Many-to-Many (m:m) Merging: one observation in the first table may be
matched with many observations in the second table, and an observation in
the second data table may be matched with many observations in the first
data table.
Some Good Practices
• Understand your data and objectives well before anything.
• Make a plan (points, flowchart etc.) before proceeding
• Keep original file separate, Make Tidy tables of relevant data then make workfile
separately.
• Make a folder and keep all related files in that folder. Set your program’s directory to
that folder as well.
• Keep recording the codes (separately or somehow within the program you are
using).
• Checking back and forth is a good habit.
Entity Resolution
• After you know your objectives and know what each variable means, and now ready
to start work on the data ---- 1st step is Entity Resolution i.e. resolving the issues with
the data
• Duplication: Some IDs being repeated. Perfect duplication – all duplications have
same information, imperfect when have different information against same ID on
same characteristics.
• Ambiguous identification: the same entity having different IDs across different data
tables
• Non-entity observations: rows that do not belong to an entity we want in the data
table.
• Example: summary row in a table that adds up variables across some
entities.
• Missing Values - the value of a variable is not available for some observations.
1. Missing values are not always straightforward to identify, and they may be
mistakenly interpreted as some valid value. NA option in the data is helpful.
2. Missing values mean fewer observations in the data with valid information, making
the generalization of the results questionable. The magnitude of the problem
matters in two ways: what fraction of the observations is affected, and how many
variables are affected.
3. Missing values may lead to incomplete coverage or selection bias. Missing values
could be because of bias or random. We can detect it by Benchmarking – comparing
the distribution of variables that are available for all observations.
• Think of missing values of a variable y. Then benchmarking involves comparing some
statistics, such as the mean or median of variables x, z, …, each of which is thought to
be related to variable y, in two groups: observations with missing y and observations
with non-missing y. If these statistics are different, we know there is a problem.
•
1.
2.
3.
4.
Cleaning the Data: The data cleaning process starts with the raw data and results in
clean and tidy data.
make sure all variables are stored in an appropriate format. Binary variables stored
as 0 and 1 or otherwise 1 or 2. Qualitative variables with several values may be
stored as text or number. Good practice is to store them as numbers and label the
values.
Cleaning variables may include slicing up text, extracting numerical information, or
transforming text into numbers.
Identify missing values and making appropriate decisions on them.
Make sure that values of each variable are within their admissible range. Values
outside the range are best replaced as missing unless it is obvious what the value
was meant to be.
5. Sometimes changing the units of measurement is also needed, such as prices in
another currency or replace very large numbers with measures in thousands or
millions.
6. Variable description (also called variable labels) should be prepared showing the
content of variables with all the important details.
7. Be economical with time.
8. Reproduceable workflow – write steps and codes and keep them updated.
Review of Basic Statistics and Mathematics
Probability
• Experiment: Any activity that generates data
• Random Experiment: Any experiment with more than one outcomes, happening of
any can’t be surely known beforehand.
• Event: Outcome of an experiment or survey
• Elementary Event: An outcome that satisfies only one criterion.
• Joint Event: An outcome that satisfies two or more criteria
• Random Variable: A variable whose numerical values represent the events of a
random experiment
• The phrase random variable refers to a variable that has no data values until
an experimental trial is performed or a survey question is asked and
answered. Random variables are either discrete, in which the possible
numerical values are a set of integers (or coded values, in the case of
categorical data), or continuous, in which any value is possible within a
specific range.
• Probability - A number that represents the chance that a particular event will occur
for a random variable. 𝑃(𝐴) =
•
•
•
𝑛(𝐴)
𝑛(𝑆)
Exhaustive Events: Set of events that include all the possible events. Sum of
individual probabilities associated with a set of collectively exhaustive events is
always 1.
Mutually Exclusive Events – two (or more) events that cannot occur at a same time.
Occurrence of one automatically means none from the others have occurred.
Independent Events: Two events are independent if the occurrence of one event in
no way affects the probability of the second event.
SOME BASIC AND SIMPLE RULES
1. Probability of an event must be between 0 and 1.
2. P(A’) = 1 – P(A)
3. If two events A and B are mutually exclusive, the probability of either event A or
event B occurring is the sum of their separate probabilities.
𝑃(𝐴 𝑜𝑟 𝐵) = 𝑃(𝐴 𝑈 𝐵) = 𝑃(𝐴) + 𝑃(𝐵)
4. If events in a set are mutually exclusive and collectively exhaustive, the sum of their
probabilities must add up to 1.
5. If two events A and B are not mutually exclusive, the probability of either event A or
event B occurring is the sum of their separate probabilities minus the probability of
their simultaneous occurrence (the joint probability).
𝑃(𝐴 𝑜𝑟 𝐵) = 𝑃(𝐴 𝑈 𝐵) = 𝑃(𝐴) + 𝑃(𝐵) − 𝑃(𝐴 ∩ 𝐵)
SOME BASIC and SIMPLE RULES (Conti)
6. If two events A and B are independent, the probability of both events A and B
occurring is equal to the product of their individual probabilities.
𝑃(𝐴 𝑎𝑛𝑑 𝐵) = 𝑃(𝐴 ∩ 𝐵) = 𝑃(𝐴). 𝑃(𝐵)
7. If two events A and B are not independent, the probability of both events A and B
occurring is the product of the probability of event A multiplied by the probability of
event B occurring, given that event A has occurred.
𝑃(𝐴 𝑎𝑛𝑑 𝐵) = 𝑃(𝐴 ∩ 𝐵) = 𝑃(𝐴). 𝑃(𝐵/𝐴)
Assigning Probabilities
A. Classical Approach – Assigning probabilities based on prior knowledge of the process
involved. The probability that a particular event will occur is defined by the number
of ways the event can occur divided by the total number of elementary events.
B. Empirical Approach - Assigning probabilities based on frequencies obtained from
empirically observed data.
C. Subjective Approach - Assign probabilities based on expert opinions or other
subjective methods such as “gut” feelings or hunches.
Probability Distribution
• Discrete Probability Distribution - A list of all possible distinct (elementary) events
for a (discrete) variable and their probabilities of occurrence.
• Expected Value of a Variable – The sum of the products formed by multiplying each
possible event in a discrete probability distribution by its corresponding probability.
𝐸(𝑋) = ∑ 𝑋𝑖 . 𝑃(𝑋𝑖 )
•
•
•
The expected value tells you the value of the variable that you could expect
in the “long run”—that is, after many experimental trials.
Binomial and Poisson Probability Distribution
The probability distributions for certain types of discrete variables can be modeled
using a mathematical formula. For discrete variable, we either use Binomial
probability distribution or Poisson probability distribution.
Binomial Distribution
• The probability distribution for a discrete variable that meets these criteria:
1. The variable is for a sample that consists of a fixed number of experimental trials
(the sample size).
2. The variable has only two mutually exclusive and collectively exhaustive events,
typically labeled as success and failure.
3. The probability of an event being classified as a success, p, and the probability of an
event being classified as a failure, q = 1 – p, are both constant in all experimental
trials.
4. The event (success or failure) of any single experimental trial is independent of (not
influenced by) the event of any other trial.
• Using the binomial distribution prevents you from having to develop the probability
distribution by using a table of outcomes and applying the multiplication rule.
• Whenever p = 0.5, the binomial distribution will be symmetrical, regardless of how
large or small the value of the sample size. If p < (>) 0.5, the distribution will be
positive, or right-skewed (negative or left-skewed). The distribution will become
more symmetrical as p gets close to 0.5 and as the sample size, n, gets large.
• Formula
𝑛
𝑃(𝑋 = 𝑥|𝑛, 𝑝) = ( ) 𝑝 𝑥 . 𝑞 𝑛−𝑥
𝑥
• Example: An online social networking website defines success if a web surfer stays
and views its website for more than three minutes. Suppose that the probability that
the surfer does stay for more than three minutes is 0.16. What is the probability that
at least four (either four or five) of the next five surfers will stay for more than three
minutes?
Characteristics of the binomial distribution
• Mean: The sample size (n) times the probability of success (p), n × p. Sample size (n)
is the number of experimental trials.
• Variance: The product of the sample size, probability of success, and probability of
failure, n × p × q
Poisson Distribution
• The probability distribution for a discrete variable that meets these criteria:
1. You are counting the number of times a particular event occurs in a unit.
2. The probability that an event occurs in a particular unit is the same for all other
units.
3. The number of events that occur in a unit is independent of the number of events
that occur in other units.
4. As the unit gets smaller, the probability that two or more events will occur in that
unit approaches zero.
• Examples: The number of computer network failures per day, the number of
customers arriving at a bank during the 12 noon to 1 p.m. hour.
•
To use the Poisson distribution, you define an area of opportunity, which is a
continuous unit of area, time, or volume in which more than one event can occur.
• The Poisson distribution can model many variables that count the number of defects
per area of opportunity or count the number of times items are processed from a
waiting line.
• Formula
𝑒 −𝜆 . 𝜆𝑥
𝑃(𝑋 = 𝑥|𝜆) =
𝑥!
• e represents the mathematical constant approximated by the value 2.71828.
• Greek symbol lambda, λ represents the mean number of times that the event occurs
per area of opportunity.
Characteristics
• Mean and variance of Poisson probability distribution is 𝜆
•
Example: Determine the probabilities that a specific number of customers will arrive
at a bank branch in a one-minute interval during the lunch hour:
• You determine that you can use the Poisson distribution for the following
reasons:
• The variable is a count per unit—that is, customers per minute.
• Assume that the probability that a customer arrives during a specific
one-minute interval is the same as the probability for all the other
one-minute intervals.
• Each customer’s arrival has no effect on (is independent of) all other
arrivals.
• The probability that two or more customers will arrive in a given time
period approaches zero as the time interval decreases from one
minute.
• Using historical data, you determine that the mean number of arrivals of
customers is three per minute during the lunch hour
• The probability of zero arrivals is 0.0498.
• The probability of one arrival is 0.1494.
• The probability of two arrivals is 0.2240.
• Therefore, the probability of two or fewer customer arrivals per minute at
the bank during the lunch hour is 0.4232, the sum of the probabilities for
zero, one, and two arrivals (0.0498 + 0.1494 + 0.2240 = 0.4232).
Important
• We consider binomial probability distribution which is used for variables that have
only two mutually exclusive events, and the Poisson probability distribution which is
used when you are counting the number of outcomes that occur in a unit.
Continuous Probability Distribution
• The area under a curve that represents the probabilities for a continuous variable.
• Mathematical expression involves integral calculus.
NORMAL DISTRIBUTION
• The probability distribution for a continuous variable that meets these criteria:
1. The graphed curve of the distribution is bell-shaped and symmetrical.
2. The mean, median, and mode are all the same value.
3. The population mean, μ, and the population standard deviation, σ, determine
probabilities.
4. The distribution extends from negative to positive infinity. (The distribution has an
infinite range.)
5. Probabilities are always cumulative and expressed as inequalities, such as P < X or P
≥ X, where X is a value for the variable.
•
Probabilities associated with variables as diverse as physical characteristics such as
height and weight, scores on standardized exams, and the dimension of industrial
parts, tend to follow a normal distribution.
• Under certain circumstances, the normal distribution also approximates various
discrete probability distributions, such as the binomial and Poisson distributions.
1. Convert an X value of a variable to its corresponding Z score
𝑋−𝜇
𝑍=
𝛿
When the mean is 0 and the standard deviation is 1, the X value and Z score will be the
same, and no conversion is necessary.
2. Then we use standard normal table to find probabilities
•
•
EXAMPLE: Packages of chocolate candies have a labeled weight of 6 ounces. In order
to ensure that very few packages have a weight below 6 ounces, the filling process
provides a mean weight above 6 ounces. In the past, the mean weight has been 6.15
ounces with a standard deviation of 0.05 ounce. Suppose you want to determine the
probability that a single package of chocolate candies will weigh between 6.15 and
6.20 ounces.
Solution
6.15 − 6.15
6.2 − 6.15
𝑍1 =
= 0 𝑎𝑛𝑑 𝑍2 =
=1
0.05
0.05
SAMPLING and ITS METHODS
• Sampling is a technique of selecting individual members or a subset of the
population to make statistical inferences from them and estimate characteristics of
the whole population.
• Probability sampling: a sampling technique where a researcher sets a
selection of a few criteria and chooses members of a population randomly.
All the members have an equal opportunity to be a part of the sample with
this selection parameter.
• Non-probability sampling: the researcher chooses members for research at
random. This sampling method is not a fixed or predefined selection process.
This makes it difficult for all elements of a population to have equal
opportunities to be included in a sample.
Probability Sampling
• Simple random sampling: every single member of a population is chosen randomly,
merely by chance. Each individual has the same probability of being chosen to be a
part of a sample.
• Cluster sampling: the researchers divide the entire population into sections or
clusters that represent a population and then picks up a cluster.
• Systematic sampling: Researchers use this to choose the sample members of a
population at regular intervals. It requires the selection of a starting point for the
sample and sample size that can be repeated at regular intervals. This type of
sampling method has a predefined range, and hence this sampling technique is the
least time-consuming.
• Stratified random sampling: the researcher divides the population into smaller
groups that don’t overlap but represent the entire population. While sampling, these
groups can be organized and then draw a sample from each group separately.
Non – Probability Sampling
• Convenience sampling: This method is dependent on the ease of access to subjects
such as surveying customers at a mall or passers-by on a busy street.
• Judgmental or purposive sampling: formed by the discretion of the researcher.
Researchers purely consider the purpose of the study, along with the understanding
of the target audience.
• Snowball sampling: sampling method that researchers apply when the subjects are
difficult to trace. Here we take assistance of participants to identify more
participants. For example, it will be extremely challenging to survey shelterless
people or illegal immigrants. In such cases, using the snowball theory, researchers
can track a few categories to interview and derive results
• Quota sampling: The selection of members in this sampling technique happens
based on a pre-set standard.
Download