ENGINEERING DATA ANALYSIS MODULE 1.1: Introduction to Statistics and Data Analysis DEFINITION In the plural sense, STATISTICS means a set of numerical facts and figures. We say for example, “statistics on birth, statistics on crime, statistics on unemployment or statistics on marriage” In the singular sense, STATISTICS is a body of knowledge concerned with collection, presentation, analysis and interpretation of data. CATEGORIES Descriptive Statistics: deals with the methods of organizing, summarizing and presenting a mass data” Inferential Statistics: concerned with making generalizations about a body of data where only a part of it is examined. TERMS 1. Universe- the set of all the individuals or entities under consideration 2. Measurement- the assignment of numbers to objects or events, observations according to logically accepted rules 3. Random Variable- a characteristic of interest measurable on each and every individual of the universe. TERMS 4. Quantitative Variable- a variable which may take on numerical values; observations vary in degree 5. Qualitative variable- a variable which takes on nonnumerical values; observations vary in kind 6. Discrete variable- observations have a finite number of values 7. Continuous variable- observations have an infinite number of values; a variable which can assume any value in a given interval of values. TERMS 8. Constant- observations do not vary 9. Population – the set of all possible values of a variable 10. Sample- a subset of the population 11.Parameter- a numerical characteristic of a population 12. Statistic- a quantity calculated from the observation in a sample SCALES OF MEASUREMENT 1. Nominal 2. Ordinal 3. Interval 4. Ratio Sampling methods often use what are called random numbers to select samples. When the research problem and design were identified, the process of collecting information is necessary. There are two types of sampling techniques: Probability Sampling Technique Non-Probability Sampling Technique Sampling Procedures & Collection of Data Simple random sampling In simple random sampling technique, every item in the population has an equal and likely chance of being selected in the sample. Since the item selection entirely depends on the chance, this method is known as “Method of Chance Selection”. As the sample size is large, and the item is chosen randomly, it is known as “Representative Sampling”. SIMPLE RANDOM SAMPLING Example: Suppose we want to select a simple random sample of 200 students from a school. Here, we can assign a number to every student in the school database from 1 to 500 and use a random number generator to select a sample of 200 numbers. SYSTEMATIC SAMPLING In the systematic sampling method, the items are selected from the target population by selecting the random selection point and selecting the other methods after a fixed sample interval. It is calculated by dividing the total population size by the desired population size. SYSTEMATIC SAMPLING Example: Suppose the names of 300 students of a school are sorted in the reverse alphabetical order. To select a sample in a systematic sampling method, we have to choose some 15 students by randomly selecting a starting number, say 5. From number 5 onwards, will select every 15th person from the sorted list. Finally, we can end up with a sample of some students. STRATIFIED SAMPLING In a stratified sampling method, the total population is divided into smaller groups to complete the sampling process. The small group is formed based on a few characteristics in the population. After separating the population into a smaller group, the statisticians randomly select the sample. STRATIFIED SAMPLING For example, there are three bags (A, B and C), each with different balls. Bag A has 50 balls, bag B has 100 balls, and bag C has 200 balls. We have to choose a sample of balls from each bag proportionally. Suppose 5 balls from bag A, 10 balls from bag B and 20 balls from bag C. CLUSTERED SAMPLING In the clustered sampling method, the cluster or group of people are formed from the population set. The group has similar significatory characteristics. Also, they have an equal chance of being a part of the sample. This method uses simple random sampling for the cluster of population. CLUSTERED SAMPLING Example: An educational institution has ten branches across the country with almost the number of students. If we want to collect some data regarding facilities and other things, we can’t travel to every unit to collect the required data. Hence, we can use random sampling to select three or four branches as clusters. All these four methods can be understood in a better manner with the help of the figure given below. The figure contains various examples of how samples will be taken from the population using different techniques. NON-PROBABILITY SAMPLING TECHNIQUE In non-probability sampling, not every member of the population has the equal chance of being selected. It can rely on the subjective judgment of the researcher. This method of sampling is resorted to when it is difficult to estimate the population of the study because they are moving or transition in a given location. This method is also useful in exploratory or descriptive studies with a qualitative implication. It is purposely to characterize the direction of the study, rather than to qualify it. Thus, the respondents are chosen on the basis of specific criteria formulated by the researcher, rather than randomly selected. ACCIDENTAL OR CONVENIENCE SAMPLING This method is implemented by seeking at elements who are readily available to respond to a question. In other words, the first person who comes along who typifies the candidate serves as the respondent of the study. PURPOSIVE SAMPLING Purposive sampling is a useful methodology in qualitative or exploratory studies, since the quality of the key person or informant are identified by the researcher. The purpose is to get an information from a respondent who are involved in the situation. QUOTA SAMPLING Quota sampling entails grouping elements according to certain characteristics and ensuring that each group is represented. This is similar to stratified sampling minus randomization. SNOWBALL OR REFERRAL SAMPLING This involves having a respondents refer other people who are in a position to answer some of the questions of the researcher. This sampling technique is useful to highly sensitive topics where the identity of respondents is difficult to divulge. Hence, for the topics that are highly confidential, referral system is appropriate. ENGINEERING DATA ANALYSIS MODULE 1.2: Measures of Central Tendency and Dispersion MEASURES OF CENTRAL TENDENCY UNGROUPED DATA MEASURES OF CENTRAL TENDENCY Measures of central Tendency or Location- a single value about which the set of observations tend to cluster. MEASURES OF CENTRAL TENDENCY Arithmetic Mean- the sum of all the observations divided by the total number of observations; denoted as ๐ (Greek letter mu) ๐ ๐ = เท ๐๐/๐ ๐=1 Where ๐๐ is the value of the ith observation, i =1,…N ๐ is the total number of observations MEASURES OF CENTRAL TENDENCY Median- a single value which divides an array (arranged data set in ascending or descending order) of observations into two equal parts such that 50% of the observations fall below it and 50% of the observations fall above it; denoted as Md. If N (no. of observations) is odd, the median is the middle value of the array If N is even, the median is the mean of the two middle values of the array MEASURES OF CENTRAL TENDENCY MODE -the value which occurs most frequently in the data set -denoted as Mo For ungrouped data set, the mode is the value which occurs most frequently MEASURES OF CENTRAL TENDENCY Geometric mean - the Nth root of the product of N positive number -used mainly to average ratios, rates of change, economic indices, etc. -in Practice, geometric mean means are calculated by making use of the fact that the logarithm of the geometric mean of a set of positive numbers equals the arithmetic means of their logarithms. Comparison among measures of central tendency MEAN • • • • Reflects the magnitude of • observation Easily affected by the presence of extreme values Most commonly used measure of central tendency (mct) because of • its good statistical properties Most meaningful mct when there are no extreme values MEDIAN It is positional value and hence is not affected by the presence of extreme values(suggested mct when there are few extreme values) the median of grouped data can be calculated even with open-ended intervals provided the median class is not open-ended MODE • • • • • • determined by the frequency and not by the values of the observations when a quick measure of location is needed it cannot be manipulated algebraically can be defined with quantitative as well as qualitative random variables very much affected by the method of grouping data can be computed with openended intervals provided the modal class is not open-ended MEASURES OF CENTRAL TENDENCY GROUPED DATA MEDIAN OF GROUPED DATA In a grouped data, it is not possible to find the median for the given observation by looking at the cumulative frequencies. The middle value of the given data will be in some class interval. So, it is necessary to find the value inside the class interval that divides the whole distribution into two halves. In this scenario, we must find the median class. To find the median class, we must find the cumulative frequencies of all the classes and n/2. After that, locate the class whose cumulative frequency is greater than (nearest to) n/2. The class is called the median class. MEDIAN OF GROUPED DATA After finding the median class, use the below formula to find the median value. Where l is the lower limit of the median class n is the number of observations f is the frequency of median class h is the class size cf is the cumulative frequency of class preceding the median class. MEDIAN OF GROUPED DATA EXAMPLE The following data represents the survey regarding the heights (in cm) of 51 girls of Class x. Find the median height. Answer: Median = 149.03 MODE OF GROUPED DATA In the case of grouped data, it is not possible to identify the mode of the data, by looking at the frequency of data. In this scenario, we can determine the mode value by locating the class with the maximum frequency called modal class. Inside a modal class, we can locate the mode value of the data by using the formula, MEDIAN OF GROUPED DATA Where, f1 is the frequency of the modal class f0 is the frequency of the class preceding the modal class f2 is the frequency of the class succeeding the modal class h is the size of the class intervals l is the lower limit of the modal class MODE OF GROUPED DATA EXAMPLE A survey has been conducted by a group of students on 20 households in a locality as shown in the following frequency distribution table. Find the mode for the given data. Answer: Mode = 3.286. MEASURES OF VARIATION MEASURES OF DISPERSION Measures of Dispersion- a quantity that measures the spread or variability of the observation in a given population Illustration: Data Set 1: Data Set 2: Data Set 3: 3,3,3,3,3 1,2,3,4,5 2,2,3,4,4 All three data sets have mean equal to 3 yet they are not identical. There is a need for another quantity to measure the spread of the values in a given population. Some common measures of dispersion: 1. Range 2. Variance 3. Standard Deviation 4. Coefficient of Variation Range- difference between the highest value and the lowest value of the population Example. The range of actual body weight value is 46.8-8.00=38.8. Properties: 1. It is quick but rough measure of dispersion 2. The larger the value of the range the more dispersed are the observations. 3. It considers the highest and lowest observations I the population. Hence, it may be reflective of the dispersion characteristic of the majority. Variance -mean of the squared deviations of the observations from the mean, denoted by ๐ 2 2 ๐ = σ(๐๐−๐)² ๐ = σ ๐๐²−(๐)² ๐ Properties: 1. The variance is always non-negative. 2. A large variance corresponds to a highly dispersed set of values. 3. The variance is easy to manipulate for further mathematical treatment. 4. The variance makes use of all observations. 5. The variance comes in a unit of measure that is the square of the unit of measure of the given set of values Standard deviation - the positive square root of variance. That is, ๐= ๐2 Properties: The standard deviation has the same set of properties as the variance except that its unit of measurement is similar to the unit of measurement of the observations Example. In actual body weight of sheep, the standard deviation is ๐ = 66.3533917= 8.145759091 Remark: The standard deviation, coupled with arithmetic mean, gives a lot of information about the distribution of a given population Interquartile range- the difference between the third and the first quartiles of a set of data. It is denoted by IR. It provides a measure of the range of the middle 50% of the observations. Quartiles are values from a given array of data which divide the array into four equal parts. The First Quartile, denoted by Q1, is the value for which 25% of the observations are less than Q1 and 75% are greater than it. The Third Quartile, denoted by Q3, is the value for which 75% of the observations are less than Q3 and 25% are greater than it. The Empirical Rule states that if the distribution of our data values appears to be mound-shaped or bell shaped with mean ๐ and standard deviation ๐ , then approximately a) 68% of the population values lies between ๐- ๐ and ๐ + ๐ b) 95% of the population lie between ๐ − 2๐ and ๐ + 2๐ c) 99.7% of the population values lie between ๐ − 3๐ ๐๐๐ ๐ + 3๐ A Russian mathematician named Chebychev has shown that: • At least 75% of the observation fall within 2 standard deviations from that mean • At least 89% of the observations fall within 3 standard deviations from the mean • At least 94% of the observations fall within 4 standard deviations from the mean Coefficient of Variation – ratio of the standard deviation and the mean - denoted as CV CV= ๐/๐ , provided ๐ is not equal to zero Properties: 1. CV could be expressed in decimal or percentage. 2. CV is an absolute measure of dispersion. 3. The CV, being unit less, can be used to compare the dispersion of two or more populations measured in different units. 4. CV can be expressed in percentage. GRAPHICAL SUMMARY Stem-and-Leaf Plots A stem-and-leaf plot is a simple way to summarize a data set. Stem-and-Leaf Plots Figure 1.5 presents a stem-and-leaf plot of the geyser data. Each item in the sample is divided into two parts: a stem, consisting of the leftmost one or two digits, and the leaf, which consists of the next digit Dotplots A dotplot is a graph that can be used to give a rough impression of the shape of a sample. It is useful when the sample size is not too large and when the sample contains some repeated values. Figure 1.7 presents a dotplot for the geyser data in Table 1.3. Histograms A histogram is a graphic that gives an idea of the “shape” of a sample, indicating regions where sample points are concentrated and regions where they are sparse. We will construct a histogram for the PM emissions of 62 vehicles driven at high altitude, as presented in Table 1.2. The sample values range from a low of 1.11 to a high of 23.38, in units of grams of emissions per gallon of fuel. The first step is to construct a frequency table, shown in Table 1.4. Histogram for the data in Table 1.4. In this histogram the heights of the rectangles are the relative frequencies. Since the class widths are all the same, the frequencies, relative frequencies, and densities are proportional to one another, so it would have been equally appropriate to set the heights equal to the frequencies or to the densities.