DATA ANALYTICS FINAL Prediction Prediction: assigning a value to y, the target variable, or outcome variable, for a target observation or more target observations (x). Predictor: The variables (x) whose values are known and are used to predict the unknown variable (y) Predictive data analysis: Data analysis with the aim of prediction Original Data: The data on variables y and x that is available and is used to make a model for prediction. Live data: The data that includes target observations i.e. predicted values. The original data (observations on x and y) doesn’t include the target observations for which the prediction is to make. This original data is used to uncover patterns of association between y and x which is used in making the prediction. To uncover those patterns, a model is estimated and with the help of that model, the values of y are predicted for the target observations, for which we can observe x but not y. The data that includes the target observations is called the live data. Thus, in short, patterns in the original data are uncovered and then used for making the prediction in the live data. Function: a rule that gives a value for y if we plug in values for x. Mathematically, those patterns of association are expressed as a function. Notations j specific target observation in the live data 𝑥𝑗 specific observation of x variable in the live data 𝑦𝑗 the variable and observation we want to predict 𝑦̂𝑗 specific observation of y variable in the live data f function 𝑓̂ the specific function used in a model, also called predictive models. 𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ⋯ 𝑦̂ = 𝛽̂0 + 𝛽̂1 𝑥1 + 𝛽̂2 𝑥2 + ⋯ In the abstract, the linear regression with y, x1, x2, is a model for the conditional expected value of y, and it has coefficients (𝛽). We need estimated coefficients (𝛽̂ ) and actual x values (𝑥𝑗 ) to predict an actual value 𝑦̂. Here the function, or model, is 𝑓(𝑥1 , 𝑥2 , … ) = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ⋯ , and its estimated version, with specific values for its coefficients, is 𝑓̂(𝑥1 , 𝑥2 , … ) = 𝛽̂0 + 𝛽̂1 𝑥1 + 𝛽̂2 𝑥2 + ⋯ The fundamental task of predictive analytics is finding the best model. The best model is best in the sense that it gives the best prediction in the live data, not in the original data. Various Kinds of Prediction quantitative prediction prediction of the value of a quantitative outcome y. probability prediction If y is binary (or categorical variable), and predicticing the probability of y having a specific value (for example: y = 1) is desired; that would be a probability value. classification If for a binary y variable, and we want to predict whether the outcome is 0 or 1, not the probability then this is called as classification. Types of Prediction With quantitative predictions, and with probability predictions, predicting a single value 𝑦̂𝑗 , is called point prediction. An interval prediction produces a prediction interval, or PI, that tells where to find the value of yj with a certain likelihood (e.g., a 95% prediction interval or an 80% prediction interval). The difference between the true value of a variable (𝑦𝑗 ) and its point prediction (𝑦̂𝑗 ) is the prediction error. 𝑒𝑗 = 𝑦𝑗 − 𝑦̂𝑗 The prediction error can be decomposed into three components: estimation error, model error, and irreducible error – also called idiosyncratic error or genuine error. The estimation error comes from the fact that, with a model f, we use our original data to find 𝑓̂. When using a linear regression for prediction, this error is due to the fact that we don’t know the values of the coefficients of the regression (𝛽), only their estimated values (𝛽̂ ). This error is captured by the SE of the regression line. The model error reflects the fact that we may not have selected the best model for our prediction. More specifically, instead of model f, there may be a better model, g, which uses the same predictor variables x or different predictor variables from the same original data. The irreducible error is due to the fact that we are not able to make a perfect prediction for yj with the help of the xj variables even if we find the best model and we can estimate it without any estimation error. Maybe, if we had data with more x variables, we could have a better prediction using those additional variables. But, with the variables we have, we can’t do anything about this error. That’s why its name is irreducible error. The irreducible error is due to the fact that we are not able to make a perfect prediction for 𝑦𝑗 with the help of the 𝑥𝑗 variables even if we find the best model and we can estimate it without any estimation error. Maybe, if we had data with more x variables, we could have a better prediction using those additional variables. But, with the variables we have, we can’t do anything about this error. That’s why its name is irreducible error. To help decisions, we would attach a value to the consequences of the prediction error: a loss that we incur due to decisions we make because of a bad prediction. This idea is formulated in a loss function. A loss function translates the prediction errors into a number that makes more sense for the decisions and their consequences. Typically, we have more than one target observation. The loss function helps to rank predictions. Symmetric loss functions attach the same value to positive and negative errors of the same magnitude. Asymmetric loss functions attach different values to errors of the same magnitude if they are on different sides. Linearity versus convexity is about the magnitudes of error in quantitative predictions (or probability predictions). Linear loss functions attach proportionally larger loss to an error, regardless of its magnitude. In contrast, convex loss functions attach a disproportionately larger loss to an error that is larger. For example, if our ice cream shop has a lot of competition, we may be more worried about damaged reputation than incurring more costs, and so our loss function would be asymmetric, with larger loss attached to a negative prediction error than to a positive prediction error of the same size. Most likely, our loss function would be convex in both directions: a larger prediction error would have disproportionately larger consequences than a smaller prediction error. Or consider the inflation forecasts used by central banks to decide on monetary policy. The loss function of central banks should describe the social costs of wrong decisions based on erroneous predictions. The larger the prediction error, the larger the chances of wrong decisions and thus the larger the social costs. R Dashboard An interactive data application, or dashboard, can help improve business performance by providing key performance indicators (or KPIs) to stakeholders. Key Benefits There are many benefits to using dashboards to visualize data. 1) With real-time visuals on a dashboard, understanding the moving parts of a business becomes easy. 2) Based on the report type and data, you can create suitable graphs and charts in one central location, thereby providing an easy way for stakeholders to quickly access the data and understand what is going right or wrong, and what needs to be improved. 3) Also, seeing the big picture in one place can help businesses make informed decisions, thereby improving performance and reducing hours spent on analyzing the data. In general, the best dashboards answer critical business questions. Components A dashboard consists of four components. 1. Analyze. In this component, you manipulate and summarize data using a backend library. 2. Visualize. You create graphs using the graphing library. 3. Interact. Here, you use frontend libraries to accept user inputs. 4. Serve component listens to user requests and returns web pages using a web server. R Package Shiny is an R package for building interactive web applications that can help you with effective data storytelling. With Shiny, prototyping an application is fast because it is all developed all in R. Using Shiny, you can easily build dashboards, host standalone apps on a webpage, or embed them in R Markdown documents. Parts of Shiny A Shiny app consists of two parts: the server and the user interface (or UI.) The server powers the app and holds the app logic. It tells the user interface which contents to show when the user interacts with the application. The server that runs the application can be hosted on either your own computer or on a remote server. The user interacts with the UI when they run the app through a web browser. The UI has a layout that describes the locations and types of input and output fields for the app. The user interacts with the input fields, which can include text, files, checkboxes, sliders, dropdowns, and more. Output fields display the results of the analysis and can include graphs, tables, text, images, and other visualizations. For example, assume you want to know the number of data science courses released in 2021, split by programming language. You will enter the year in the text input field of the UI. This information is sent as a request to the server and the input is analyzed on the backend. Then, the server returns the result to the UI, and the visualization displays in the corresponding output field. Shiny is an external library, so you need to install it before you can use it. Use the install.packages() command to install the package. install.package("shiny") install.package("digest") Now, check the installation by running a demo app that is included with the package. If the installation went well and the example app starts running, it should look like this. library(shiny) runExample("01_hello") Close the screen to stop the app. Getting into Shiny There are two main parts that make up a Shiny application. The first part is the user interface, or UI, which is a web document that displays the application to the user. It is made up of HTML components that you create using Shiny functions. The second part is the Server. This is where the application logic sits. Let’s create a minimal Shiny application. You can code for both components in one file, or in two separate files. Coding in separate files is recommended. However, you must place the ui.R and server.R files into a single folder. RStudio knows that it is a Shiny application when it sees the two files together. The Panel functions are used to group UI elements together into a single panel. For example, if you want to have text input and numeric input boxes on the side, you can group these elements into a sidebar panel. Some useful panel functions are absolutePanel(), conditionalPanel(), headerPanel(), inputPanel(), mainPanel(), navListPanel(), sidebarPanel(), tabPanel(), tabsetPanel(), titlePanel(), and wellPanel() etc. Panel functions return HTML div tags with some class attributes defined in the bootstrap. The Layout functions are used to organize panels containing UI elements in the application layout. Here are a few examples. The fluidRow() function creates a fluid page layout, which consists of rows that in turn include columns. Rows ensure its elements appear on the same line (if the browser has adequate width.) Columns define how much horizontal space within a 12-unit wide grid its elements should occupy. Fluid pages scale their components in real time to fill the entire browser width. The flowLayout() function places elements in a left-toright, top-to-bottom arrangement. The sidebarLayout() function arranges elements in a layout with a side bar and main area. The splitLayout() function lays out elements horizontally, dividing the available horizontal space into equal parts, by default. And the verticalLayout() function creates a container that includes one or more rows of content. Control widgets are web elements that users can interact with. They allow users to send input to the Server, which in turn performs the logic. As the user updates the widget values, the output corresponding to the input will be updated as well. The Shiny package has numerous pre-built widget functions, which not only makes it easier to create widgets but also makes them look better. Some examples include buttons, date range date input, radio buttons, single checkbox, checkbox group, help text, select box, file input, slider input, numeric input, text input and more. The Server performs the logic whenever input widgets change and then sends the result back to the respective UI output elements. Let’s say you have an application that calculates the square of a number. You will need numeric input and output UI elements. With the numeric input widget in the UI, you can send input to the server. The Server will compute the square of the input, then it will send it back to the UI output element. Lecture 1 Introduction: • • • • • • • Data - factual information used as a basis for reasoning, discussion, or calculation” (Merriam-Webster dictionary) Data Table - consists of observations and variables. Observations are also known as cases. Variables are also called features. In the language of mathematics, Data Table is also called as Data Matrix. Identifier – observations are identified by identifier or ID variable – a single ID variable, or by a combination of multiple ID variables. Dataset - a broader concept that includes, potentially, multiple data tables with different kinds of information to be used in the same analysis. Data Structures: 1. Cross-sectional Data – often abbreviated as xsec data, come from the same time, but refer to different units. 2. Time Series Data – observations refer to a single unit observed multiple times. A common abbreviation used for time series data is tseries data. 3. Panel Data – has multiple dimensions, multiple units, each observed multiple times. It is also called longitudinal data, or cross-section time series data, abbreviated as xt data. a) xt data is called balanced if all cross-sectional units have observations for the very same time periods. b) called unbalanced if some cross-sectional units are observed more times than others. Aggregation 1. Individual Level – age, income, name etc. 2. Firm Level – Sales, Revenue, No of employees etc. in a firm 3. Industry/Sector Level – Total exports by a specific industry, employment in a specific industry / sector. 4. Country Level – GDP, Unemployment Rate etc. 5. Global Level – Total population of the world, global emission rate etc. Analytics - the science of using data to build models that lead to better decisions, that add value to individuals, to companies, to institutions. • • 1. Steps 1. 2. 3. 4. 5. 6. Models …? Collecting Data (ensure quality) Understanding your data Building Models Analysis Presentation Story Telling Collection of the Data ▪ Existing Sources ▪ Surveys ▪ Interviews ▪ Web-Scrapping and Application Program Interface (API) ▪ Sampling ▪ Random Sampling ▪ Snow Ball Sampling ▪ Cluster Sampling etc. Collection of the Data ▪ Quality of the Data ▪ Content – determined by how the variable/data was measured, not by what it was meant to measure. As a consequence, just because a variable is given a particular name, it does not necessarily measure that. ▪ Validity – the content of a variable should be as close as possible to what it is meant to measure (intended content). Reliability Measurement of a variable should be stable, leading to the same value if measured the same way again. ▪ Comparability – a variable should be measured the same way for all observations. ▪ Coverage – observations in the collected dataset should include all of those that were intended to be covered (complete coverage). In practice, they may not (incomplete coverage). ▪ Unbiased selection – if coverage is incomplete, the observations that are included should be similar to all observations that were intended to be covered (and, thus, to those that are left uncovered). Types of the Variables (in terms of measurement scale) • Nominal variables are qualitative variables with values that cannot be unambiguously ordered. Different individual decision makers may have different ordering of these options, as there is no universally agreed ranking of all options for these types of variables. • Ordinal or ordered variables take on values that are unambiguously ordered. All quantitative variables can be ordered; some qualitative variables can be ordered, too. • Interval variables have the property that a difference between values means the same thing regardless of the magnitudes. All quantitative variables have this property, but qualitative variables don’t have this property. • Ratio variables, also known as scale variables, are interval variables with the additional property that their ratios mean the same regardless of the magnitudes. This additional property also implies a meaningful zero in the scale. Many but not all quantitative variables have this property. ❖ Zero distance is unambiguous, and a 10 km run is twice as long as a 5 km run. An example of an interval variable that is not a ratio variable is temperature: 20 degrees is not twice as warm as 10 degrees, be it Celsius or Fahrenheit. Lecture 2 • Relational data (or relational database) is the term often used for such datasets: they have various kinds of observations that are linked to each other through various relations. • The process of pulling different variables from different data tables for wellidentified entities to create a new data table is called linking, joining, merging, or matching data tables. • One-to-One (1:1) Merging: merging tables with the same type of observations. • Many-to-one (m:1) Merging: Linking many observations from one to single observation in other. • One-to-Many (1:m) Merging: Opposite of M:1 • Many-to-Many (m:m) Merging: one observation in the first table may be matched with many observations in the second table, and an observation in the second data table may be matched with many observations in the first data table. Some Good Practices • Understand your data and objectives well before anything. • Make a plan (points, flowchart etc.) before proceeding • Keep original file separate, Make Tidy tables of relevant data then make workfile separately. • Make a folder and keep all related files in that folder. Set your program’s directory to that folder as well. • Keep recording the codes (separately or somehow within the program you are using). • Checking back and forth is a good habit. Entity Resolution • After you know your objectives and know what each variable means, and now ready to start work on the data ---- 1st step is Entity Resolution i.e. resolving the issues with the data • Duplication: Some IDs being repeated. Perfect duplication – all duplications have same information, imperfect when have different information against same ID on same characteristics. • Ambiguous identification: the same entity having different IDs across different data tables • Non-entity observations: rows that do not belong to an entity we want in the data table. • Example: summary row in a table that adds up variables across some entities. • Missing Values - the value of a variable is not available for some observations. 1. Missing values are not always straightforward to identify, and they may be mistakenly interpreted as some valid value. NA option in the data is helpful. 2. Missing values mean fewer observations in the data with valid information, making the generalization of the results questionable. The magnitude of the problem matters in two ways: what fraction of the observations is affected, and how many variables are affected. 3. Missing values may lead to incomplete coverage or selection bias. Missing values could be because of bias or random. We can detect it by Benchmarking – comparing the distribution of variables that are available for all observations. • Think of missing values of a variable y. Then benchmarking involves comparing some statistics, such as the mean or median of variables x, z, …, each of which is thought to be related to variable y, in two groups: observations with missing y and observations with non-missing y. If these statistics are different, we know there is a problem. • 1. 2. 3. 4. Cleaning the Data: The data cleaning process starts with the raw data and results in clean and tidy data. make sure all variables are stored in an appropriate format. Binary variables stored as 0 and 1 or otherwise 1 or 2. Qualitative variables with several values may be stored as text or number. Good practice is to store them as numbers and label the values. Cleaning variables may include slicing up text, extracting numerical information, or transforming text into numbers. Identify missing values and making appropriate decisions on them. Make sure that values of each variable are within their admissible range. Values outside the range are best replaced as missing unless it is obvious what the value was meant to be. 5. Sometimes changing the units of measurement is also needed, such as prices in another currency or replace very large numbers with measures in thousands or millions. 6. Variable description (also called variable labels) should be prepared showing the content of variables with all the important details. 7. Be economical with time. 8. Reproduceable workflow – write steps and codes and keep them updated. Review of Basic Statistics and Mathematics Probability • Experiment: Any activity that generates data • Random Experiment: Any experiment with more than one outcomes, happening of any can’t be surely known beforehand. • Event: Outcome of an experiment or survey • Elementary Event: An outcome that satisfies only one criterion. • Joint Event: An outcome that satisfies two or more criteria • Random Variable: A variable whose numerical values represent the events of a random experiment • The phrase random variable refers to a variable that has no data values until an experimental trial is performed or a survey question is asked and answered. Random variables are either discrete, in which the possible numerical values are a set of integers (or coded values, in the case of categorical data), or continuous, in which any value is possible within a specific range. • Probability - A number that represents the chance that a particular event will occur for a random variable. 𝑃(𝐴) = • • • 𝑛(𝐴) 𝑛(𝑆) Exhaustive Events: Set of events that include all the possible events. Sum of individual probabilities associated with a set of collectively exhaustive events is always 1. Mutually Exclusive Events – two (or more) events that cannot occur at a same time. Occurrence of one automatically means none from the others have occurred. Independent Events: Two events are independent if the occurrence of one event in no way affects the probability of the second event. SOME BASIC AND SIMPLE RULES 1. Probability of an event must be between 0 and 1. 2. P(A’) = 1 – P(A) 3. If two events A and B are mutually exclusive, the probability of either event A or event B occurring is the sum of their separate probabilities. 𝑃(𝐴 𝑜𝑟 𝐵) = 𝑃(𝐴 𝑈 𝐵) = 𝑃(𝐴) + 𝑃(𝐵) 4. If events in a set are mutually exclusive and collectively exhaustive, the sum of their probabilities must add up to 1. 5. If two events A and B are not mutually exclusive, the probability of either event A or event B occurring is the sum of their separate probabilities minus the probability of their simultaneous occurrence (the joint probability). 𝑃(𝐴 𝑜𝑟 𝐵) = 𝑃(𝐴 𝑈 𝐵) = 𝑃(𝐴) + 𝑃(𝐵) − 𝑃(𝐴 ∩ 𝐵) SOME BASIC and SIMPLE RULES (Conti) 6. If two events A and B are independent, the probability of both events A and B occurring is equal to the product of their individual probabilities. 𝑃(𝐴 𝑎𝑛𝑑 𝐵) = 𝑃(𝐴 ∩ 𝐵) = 𝑃(𝐴). 𝑃(𝐵) 7. If two events A and B are not independent, the probability of both events A and B occurring is the product of the probability of event A multiplied by the probability of event B occurring, given that event A has occurred. 𝑃(𝐴 𝑎𝑛𝑑 𝐵) = 𝑃(𝐴 ∩ 𝐵) = 𝑃(𝐴). 𝑃(𝐵/𝐴) Assigning Probabilities A. Classical Approach – Assigning probabilities based on prior knowledge of the process involved. The probability that a particular event will occur is defined by the number of ways the event can occur divided by the total number of elementary events. B. Empirical Approach - Assigning probabilities based on frequencies obtained from empirically observed data. C. Subjective Approach - Assign probabilities based on expert opinions or other subjective methods such as “gut” feelings or hunches. Probability Distribution • Discrete Probability Distribution - A list of all possible distinct (elementary) events for a (discrete) variable and their probabilities of occurrence. • Expected Value of a Variable – The sum of the products formed by multiplying each possible event in a discrete probability distribution by its corresponding probability. 𝐸(𝑋) = ∑ 𝑋𝑖 . 𝑃(𝑋𝑖 ) • • • The expected value tells you the value of the variable that you could expect in the “long run”—that is, after many experimental trials. Binomial and Poisson Probability Distribution The probability distributions for certain types of discrete variables can be modeled using a mathematical formula. For discrete variable, we either use Binomial probability distribution or Poisson probability distribution. Binomial Distribution • The probability distribution for a discrete variable that meets these criteria: 1. The variable is for a sample that consists of a fixed number of experimental trials (the sample size). 2. The variable has only two mutually exclusive and collectively exhaustive events, typically labeled as success and failure. 3. The probability of an event being classified as a success, p, and the probability of an event being classified as a failure, q = 1 – p, are both constant in all experimental trials. 4. The event (success or failure) of any single experimental trial is independent of (not influenced by) the event of any other trial. • Using the binomial distribution prevents you from having to develop the probability distribution by using a table of outcomes and applying the multiplication rule. • Whenever p = 0.5, the binomial distribution will be symmetrical, regardless of how large or small the value of the sample size. If p < (>) 0.5, the distribution will be positive, or right-skewed (negative or left-skewed). The distribution will become more symmetrical as p gets close to 0.5 and as the sample size, n, gets large. • Formula 𝑛 𝑃(𝑋 = 𝑥|𝑛, 𝑝) = ( ) 𝑝 𝑥 . 𝑞 𝑛−𝑥 𝑥 • Example: An online social networking website defines success if a web surfer stays and views its website for more than three minutes. Suppose that the probability that the surfer does stay for more than three minutes is 0.16. What is the probability that at least four (either four or five) of the next five surfers will stay for more than three minutes? Characteristics of the binomial distribution • Mean: The sample size (n) times the probability of success (p), n × p. Sample size (n) is the number of experimental trials. • Variance: The product of the sample size, probability of success, and probability of failure, n × p × q Poisson Distribution • The probability distribution for a discrete variable that meets these criteria: 1. You are counting the number of times a particular event occurs in a unit. 2. The probability that an event occurs in a particular unit is the same for all other units. 3. The number of events that occur in a unit is independent of the number of events that occur in other units. 4. As the unit gets smaller, the probability that two or more events will occur in that unit approaches zero. • Examples: The number of computer network failures per day, the number of customers arriving at a bank during the 12 noon to 1 p.m. hour. • To use the Poisson distribution, you define an area of opportunity, which is a continuous unit of area, time, or volume in which more than one event can occur. • The Poisson distribution can model many variables that count the number of defects per area of opportunity or count the number of times items are processed from a waiting line. • Formula 𝑒 −𝜆 . 𝜆𝑥 𝑃(𝑋 = 𝑥|𝜆) = 𝑥! • e represents the mathematical constant approximated by the value 2.71828. • Greek symbol lambda, λ represents the mean number of times that the event occurs per area of opportunity. Characteristics • Mean and variance of Poisson probability distribution is 𝜆 • Example: Determine the probabilities that a specific number of customers will arrive at a bank branch in a one-minute interval during the lunch hour: • You determine that you can use the Poisson distribution for the following reasons: • The variable is a count per unit—that is, customers per minute. • Assume that the probability that a customer arrives during a specific one-minute interval is the same as the probability for all the other one-minute intervals. • Each customer’s arrival has no effect on (is independent of) all other arrivals. • The probability that two or more customers will arrive in a given time period approaches zero as the time interval decreases from one minute. • Using historical data, you determine that the mean number of arrivals of customers is three per minute during the lunch hour • The probability of zero arrivals is 0.0498. • The probability of one arrival is 0.1494. • The probability of two arrivals is 0.2240. • Therefore, the probability of two or fewer customer arrivals per minute at the bank during the lunch hour is 0.4232, the sum of the probabilities for zero, one, and two arrivals (0.0498 + 0.1494 + 0.2240 = 0.4232). Important • We consider binomial probability distribution which is used for variables that have only two mutually exclusive events, and the Poisson probability distribution which is used when you are counting the number of outcomes that occur in a unit. Continuous Probability Distribution • The area under a curve that represents the probabilities for a continuous variable. • Mathematical expression involves integral calculus. NORMAL DISTRIBUTION • The probability distribution for a continuous variable that meets these criteria: 1. The graphed curve of the distribution is bell-shaped and symmetrical. 2. The mean, median, and mode are all the same value. 3. The population mean, μ, and the population standard deviation, σ, determine probabilities. 4. The distribution extends from negative to positive infinity. (The distribution has an infinite range.) 5. Probabilities are always cumulative and expressed as inequalities, such as P < X or P ≥ X, where X is a value for the variable. • Probabilities associated with variables as diverse as physical characteristics such as height and weight, scores on standardized exams, and the dimension of industrial parts, tend to follow a normal distribution. • Under certain circumstances, the normal distribution also approximates various discrete probability distributions, such as the binomial and Poisson distributions. 1. Convert an X value of a variable to its corresponding Z score 𝑋−𝜇 𝑍= 𝛿 When the mean is 0 and the standard deviation is 1, the X value and Z score will be the same, and no conversion is necessary. 2. Then we use standard normal table to find probabilities • • EXAMPLE: Packages of chocolate candies have a labeled weight of 6 ounces. In order to ensure that very few packages have a weight below 6 ounces, the filling process provides a mean weight above 6 ounces. In the past, the mean weight has been 6.15 ounces with a standard deviation of 0.05 ounce. Suppose you want to determine the probability that a single package of chocolate candies will weigh between 6.15 and 6.20 ounces. Solution 6.15 − 6.15 6.2 − 6.15 𝑍1 = = 0 𝑎𝑛𝑑 𝑍2 = =1 0.05 0.05 SAMPLING and ITS METHODS • Sampling is a technique of selecting individual members or a subset of the population to make statistical inferences from them and estimate characteristics of the whole population. • Probability sampling: a sampling technique where a researcher sets a selection of a few criteria and chooses members of a population randomly. All the members have an equal opportunity to be a part of the sample with this selection parameter. • Non-probability sampling: the researcher chooses members for research at random. This sampling method is not a fixed or predefined selection process. This makes it difficult for all elements of a population to have equal opportunities to be included in a sample. Probability Sampling • Simple random sampling: every single member of a population is chosen randomly, merely by chance. Each individual has the same probability of being chosen to be a part of a sample. • Cluster sampling: the researchers divide the entire population into sections or clusters that represent a population and then picks up a cluster. • Systematic sampling: Researchers use this to choose the sample members of a population at regular intervals. It requires the selection of a starting point for the sample and sample size that can be repeated at regular intervals. This type of sampling method has a predefined range, and hence this sampling technique is the least time-consuming. • Stratified random sampling: the researcher divides the population into smaller groups that don’t overlap but represent the entire population. While sampling, these groups can be organized and then draw a sample from each group separately. Non – Probability Sampling • Convenience sampling: This method is dependent on the ease of access to subjects such as surveying customers at a mall or passers-by on a busy street. • Judgmental or purposive sampling: formed by the discretion of the researcher. Researchers purely consider the purpose of the study, along with the understanding of the target audience. • Snowball sampling: sampling method that researchers apply when the subjects are difficult to trace. Here we take assistance of participants to identify more participants. For example, it will be extremely challenging to survey shelterless people or illegal immigrants. In such cases, using the snowball theory, researchers can track a few categories to interview and derive results • Quota sampling: The selection of members in this sampling technique happens based on a pre-set standard.