STATISTICS • Branch of mathematics that deals with the systematic method of collecting, classifying, presenting, analyzing and interpreting quantitative data DIVISION OF STATISTICS • DESCRIPTIVE • To summarize and describe the group characteristics of data • INFERENTIAL • Drawing of conclusion or judgment about a population based on a representative sample POPULATION • Consists of the totality of the observations with which one is concerned SAMPLE • Subset; taken from a population of objects or observation Variable Is a characteristic or information of interest that is observable or measurable from every individual or object under consideration. Types of variables Qualitative or Categorical Variable Quantitative or Numerical Variable Types of Quantitative Variables Discrete Quantitative Continuous Quantitative Levels of Measurements of Variables: Nominal Level (Classificatory Scale) Lowest level of measurement that simply labels or names or categories without any implicit or explicit ordering of the levels. Ordinal Level ( Ranking Scale) Labels or classes with an implied ordering in these labels. Interval Level The unit of measurement is arbitrary and there is no “true zero” point. Ratio Level Contains all the properties of the interval level, and in addition, it has a “true zero” point. STEPS IN STATISTICAL INQUIRY • • • • • Collection of Data Processing of Data Presentations of Data Analysis of Data Interpretation of Data TYPES OF DATA • INTERNAL DATA • Company’s own data • EXTERNAL DATA • Outside sources METHODS OF DATA COLLECTION • Interview or direct method • Questionnaire or indirect method • Registration method (e.g. NSO) • Observation • Experimentation Sampling Techniques One of the most parts of the research work that needs preparation and planning is choosing the right and appropriate sampling method. • Random Sampling A recommended process to prevent the possibility of a biased or erroneous inference. Under the concept of randomness, each member of the population has an equal chance to be included in the sample gathered. • Stratified Random Sampling This sampling technique is done through dividing the population into categories or strata and getting the members at random proportionate to each stratum or sub – group. • Systematic Random Sampling Refers to a process of selecting every nth element in the population until the desired sample size is acquired. • Cluster Sampling Is the advantageous procedure when the population is spread over a wide geographical area CLUSTER – refers to an intact group which has a common characteristics • Multistage Sampling More complex sampling technique, which includes the following steps: a)Divide the population into strata. b)Divide each stratum into clusters. c) Draw a sample from each cluster using the simple random sampling technique PROCESSING OF DATA • EDITING – to detect errors • CODING – assigning numerals and other symbols to be able to group them • CLASSIFYING – sorting and grouping PRESENTATION OF DATA • Textual • Tabular • Graphic • Bar • Line • Pie Chart • Scatter • Pictograph OTHER TERMS • VARIABLE – fundamental quantity that changes • DISCRETE VARIABLE – no in betweens • CONTINUOUS VARIABLE – with in between • CONSTANT – does not change How to organize data? Frequency Distribution Table: is the organization of raw data in table form Consider the midyear scores of 45 students in Statistics 29 27 28 27 34 29 27 27 28 25 23 35 25 29 33 23 27 33 27 22 40 27 21 29 22 25 29 25 21 20 21 23 25 30 20 28 30 29 28 30 27 27 27 19 30 Steps in Constructing Frequency Distribution Table • Find the range r. The range is the difference between the highest score and the lowest score. • Decide on the number of classes. A class is a grouping or category. The ideal number of classes is between 5 and 15. • Determine the class interval i. Class interval or simply interval, is the size of each class. Determine the classes starting with the lowest class. Determine the class frequency (f) for each class by counting the tally. The column for tally is optional. The following numerical values are relevant in dealing with frequency distribution: 1. Class mark. It is the middle value in a class 2. Class boundaries. They are often described as the true limits . The lower boundary of a class is 0.5 less than its lower limits, and the upper boundary is 0.5 more than its upper limit. Cumulative frequency. is found by adding the frequency starting from the lowest class. Grouped Frequency Distributions Class Limits Class Boundaries Class Mark (X) Tally Frequency Cumulative Frequency 24 - 30 23.5 – 30.5 27 III 3 3 31 - 37 30.5 – 37.5 34 I 1 4 38 - 44 37.5 – 44.5 41 IIII 5 9 45 - 51 44.5 – 51.5 48 IIII IIII 9 18 52 - 58 51.5 – 58.5 55 IIII I 6 24 59 - 65 58.5 – 65.5 62 I 1 25 Total = 25 Example 1: These data represent the record high temperatures in ⁰F for each of the 50 States 112 100 127 120 134 118 105 110 109 112 110 118 117 116 118 122 114 114 105 109 107 112 114 115 118 117 118 122 106 110 116 108 110 121 113 120 119 111 104 111 120 113 120 117 105 110 118 112 114 114 Construct a grouped frequency distribution for the data using 7 classes Example 2: Statistics Test Score of 50 Students 88 85 60 75 63 55 78 90 86 40 62 83 46 78 90 62 40 47 55 52 63 76 85 87 63 62 51 48 76 72 88 72 71 70 60 83 56 54 52 43 Construct the GFD for the Statistics Test Scores with 11 classes. 65 63 67 42 73 79 80 77 76 60 Using the data, 1. Construct a frequency distribution with 11 classes. 2. Construct a histogram, a frequency polygon and ogive from the data. Graphical Presentation of Data A histogram is a bar graph like representation of a frequency distribution. The rectangular bars are without space between them. The height of each bar corresponds to the frequency of the class and the width corresponds to the class marks. -A well balanced histogram should have a height of 60%, 67% or 75% of its width. - A frequency polygon is a line graph where the frequency of each class is plotted against the corresponding class mark. An ogive ( pronounced as o – jayv) is a line graph where the cumulative frequency of each class is plotted against the corresponding class boundary. Cumulative Frequency Graph (ogive) Exercises #1: I. Classify the following according to the scale of measurement. Write N if your answer is nominal, O if ordinal, I if interval or R if ratio. ______ 1. Newborns arranged according to gender. ______ 2. Banking hours of the different types of banks in Metro Manila ______ 3. Peso-dollar exchange rate ______ 4. Temperature range of patients afflicted with pneumonia. ______ 5. lead content in toys manufactured in the Phils. II. Classify the following as descriptive statistics or inferential statistics. Write D if your answer is descriptive and write I if your answer is inferential. ______ 1. The time it takes a shipment of perishable goods to reach its destination. ______ 2. The number of times the peso –dollar rate fluctuates during the week. ______ 3. Based on the medical record of the patient, the patient has a high sugar level considered to be critical. ______ 4. The farm produce in Baguio . ______ 5. For the past one month, there was an increase number of cases of cholera in hog farms in Bulacan.. III: Statistics Test Score of 50 Students 88 62 63 88 65 85 60 75 63 55 78 90 86 40 83 46 78 90 62 40 47 55 52 76 85 87 63 62 51 48 76 72 72 71 70 60 83 56 54 52 43 63 67 42 73 79 80 77 76 60 Construct the GFD for the Statistics Test Scores with 11 classes. Example 1: The following data represents the weekly savings of employees in a manufacturing company. • 49 • 57 • 54 84 82 52 91 29 43 67 47 67 38 38 65 • 50 • 16 • 78 18 65 56 58 35 35 48 71 59 39 73 71 • • • • 26 24 34 46 57 52 39 34 61 63 39 28 42 85 61 25 9 44 46 29 Using the data, 1. Construct a frequency distribution with 11 classes. 2. Construct a histogram and a frequency polygon from the data. 3. Construct a frequency distribution using 9 class interval. 4. Construct a histogram and a frequency polygon from the data. MEASURES OF CENTRAL TENDENCY -It is a statistic that serves as a representative of the data under investigation. -This tends to lie within the center of the set of data. -There are three measures of central tendency such as the mean, median and mode. The Mean(𝑥) • It is the most important, the most useful, and the most widely used measure of central tendency. • It refers to the sum of all the given values or items in a distribution divided by the number of values or items summed. • Mean has limitations and uses. The Mean is Used • for interval and ratio measurement; • If higher statistical computations are wanted; • If there are no extreme values in the distribution since it is easily affected by extremely low scores or extremely high scores. Thus, the distribution is approximately normal; - When the greater reliability of the measure of central tendency is wanted since its computations include all the given values. The Limitations of the Mean • It is the most widely used average, because it is the most familiar. It is often, however misused. It cannot be used if the clustering of values or items is not substantial. An example is when representing the scores or values, 10 and 100 since they are far apart. • When the given values do not tend to cluster around a central value, the mean is a poor measure of central location. • It is easily affected by extremely large or small values. One small value can easily pull down the mean. - The mean cannot be utilized to compare distributions since the means of two or more distributions may be the same but their characteristics maybe entirely different. The means of distribution A whose values are 80, 85, and 90 and distribution B whose values are 86, 85 and 84 are both 85. However, we cannot imply that both distribution posses the same characteristics since their patterns of dispersion or variations are markedly different despite having the same mean. The formula for computing the Mean are: • Ungrouped Data n • Where: X Xi i 1 n x= is the mean, xi stands for the values or items and n is the number of respondents. Grouped Data: The midpoint formula n X Xifi i 1 n • Where: • - is the mean • Xifi- is the product of the classmark and the frequency • n – is the number of respondents The Mean for Grouped data can also be computed using the CODED FORMULA: n • X = AM xifi i 1 i n • Where: • AM – assumed mean • Xi – deviation of the values from the assumed mean • i – class size • n – number of cases Example: Compute for the mean using the two formulas. Class Interval 90 - 94 85 - 89 80 - 84 75 - 79 70 - 74 65 - 69 60 - 64 55 - 59 50 – 54 45 – 49 40 – 44 f 2 6 3 8 5 2 10 3 4 3 4 Solution for Mean (Using Midpoint Formula) 𝑿𝒊 𝒇𝒊 Class Interval 90 – 94 85 – 89 Class Mark (X) Frequency (f) 92 2 87 6 184 522 80 – 84 75 – 79 70 – 74 65 – 69 82 77 72 67 3 8 5 2 246 616 360 134 60 – 64 55 – 59 50 – 54 62 57 52 10 3 4 620 171 208 45 – 49 47 3 141 40 – 44 42 4 168 𝒇𝒊 = 50 𝑿𝒊 𝒇𝒊 = 3,370 Using Midpoint Formula: Solving for mean 𝑥 𝑥= = 𝑛 𝑖=1 𝑋𝑖 𝑓𝑖 3,370 50 =67.4 𝑛 Solution for Mean (Using Unit Deviation Formula) 𝒙𝒊 𝒇𝒊 𝒙𝒊 Class Interval 90 – 94 85 – 89 Class Mark (X) 92 87 Frequency (f) 2 6 5 4 10 24 80 – 84 75 – 79 70 – 74 82 77 72 3 8 5 3 2 1 9 16 5 65 – 69 60 – 64 55 – 59 50 – 54 67 (AM) 62 57 52 2 10 3 4 O -1 -2 -3 0 -10 -6 -12 45 – 49 47 3 -4 -12 40 – 44 42 4 -5 -20 𝒏= 50 𝒙𝒊 𝒇𝒊 = 4 Using Unit Deviation Method: Solving for mean( 𝑥 ) 𝑛 𝑖=1 𝑥𝑖 𝑓𝑖 𝑥= 𝐴𝑀 + 𝑥= 64+ 𝑛 4 50 x5 = 67+0.4 = 67.4 Where: 𝑖 Assumed Mean(AM) may be one of the class marks but preferably one which is located at the center of the distribution or one which has the highest frequency. The Median(𝑥) • This is the middle value in a set of quantities. It separates an ordered set of data into two equal parts. Half of the quantities are found above the median and the other half is below it. • To find the median of an ungrouped data, follow these steps: 1. Arrange the quantities either in ascending or descending order. 2. Number the quantities consecutively from 1 to n. 3. If n is odd, the median is the (n+1/2)th quantity. If n is even, the median is the mean of (n/2+1)th and (n/2)th quantities. The Median is Used • for ordinal or ranked measurement; • if there are extrme cases, thus the distribution is markedly skewed; • if we desire to know whether the cases fall within the upper halves or the lower halves of the distribution; • for an open-end distribution; that is, the lowest or the highest class interval or both are not defined as 50 and below or 100 and above; Limitations of the Median: • It is easily affected by the number of items in a distribution. • It cannot be determined if the given values are not arranged according to magnitude. • If several values are contained in a distribution, it becomes a laborious task to arrange them according to magnitude. • Its value is not as accurate as the mean because it is just an ordinal statistic. Formula for finding the Median: • To get the median for ungrouped data, we simply arrange the data from the highest value to the lowest value or vice – versa. The median is the middle value in the distribution. • If there is an odd number of observation, the middle value is the median. Ex. 6 ,7, 8, 9, 10, 12, 16 • If the number of observation is even, the average of the two middle scores is the median. Ex. 8, 7, 6, 5, 4, 3 Grouped Data: 𝑛 𝑥=𝑢+ − 𝑐𝑓 2 𝑖 𝑓𝑖 • where: • u – exact lower limit of the class interval containing the median • 𝑛 2 one half of the total number of cases • cf – cumulative frequency immediately below u • i – class interval • 𝑓𝑖 - freequency of the class interval containing the median Solution for Median Class Interval 90 – 94 85 – 89 Frequency(f) 2 6 Cumulative frequency (cf) 50 48 80 – 84 75 – 79 70 – 74 65 – 69(u = 64.5) 3 8 5 2 42 39 31 26 60 – 64 55 – 59 50 – 54 10 3 4 24 =cf 14 11 45 – 49 40 – 44 3 4 7 4 𝒇 = 𝒏 = 𝟓𝟎 Solving for Median: 𝑛 𝑥=𝑢+ = 64.5 + = 67 − 𝑐𝑓 2 𝑖 𝑓𝑖 25 −24 2 x5 Examples: Solve for Median For ungrouped data • Find the median of the set of measure: 23, 15, 9, 30, 27, 10, 18, 14, 13. • 12.6, 15.0, 19.8, 17.9, 11.7, 18.6, 14.1, 13.4 The Mode(𝑥) • It is the quantity with the most number of frequency. • A set of data is unimodal distribution if it contains only one mode. For instance, the set 11, 15, 13, 15, 14, 13, 15 is unimodal. The mode is 15 with 3 frequencies. • A set is bimodal distribution if it contains two modes. For example, the sets 88, 89, 82, 82, 82, 89, 88, 89 and 63, 55, 57, 60, 60, 66, 56, 58, 57 are bimodal. The modes are 82 and 89 and 60 respectively. A set of data with three modes is trimodal. But the distribution 40, 44, 37, 37,44, 40 has no mode. The Mode is Used for nominal or categorical data; • if the most popular or most typical case or value in the distribution is wanted. • If a rough or quick estimate of a central value is wanted. • The Limitations of the Mode • It is rarely or seldom used since it does not always exist. • It is very unstable because its value changes depending on the approaches used in finding it. • Its value is just a rough estimate of the center of concentration of a distribution. Formula for Mode of Grouped Data • The mode in grouped data is the class mark or midpoint of the class with the highest frequency. 𝑥 = 𝑢+ 𝑑1 𝑖 𝑑1 + 𝑑2 • where: • u – exact lower limit of the modal class • d1 – difference between the frequency of the modal class and the next class lower in value • d2 – difference between the frequency of the modal class and the next class higher in value • i – class size of the modal class Solution for Mode Class Interval 90 – 94 85 – 89 Frequency(f) 2 6 80 – 84 75 – 79 70 – 74 3 8 5 65 – 69 60 – 64(modal class) 55 – 59 50 – 54 2 10 3 4 45 – 49 3 40 – 44 4 𝑑1 =10 – 3 =7 𝑑2 = 10 − 2 = 8 𝒇 = 𝒏 = 𝟓𝟎 Solving for Mode: 𝑥 = 𝑢+ = 59.5 + 7 15 x5 = 61.8 ≈ 62 𝑑1 𝑖 𝑑1 + 𝑑2 Example: Compute for the mean, median, and mode given the age brackets of the workers in a certain factory. Age 42 – 44 39 – 41 36 – 38 33 – 35 30 – 32 27 – 29 24 -26 21 – 23 18 – 20 15 – 17 No. of Workers(f) 15 18 23 20 24 16 25 12 10 13 Skewness in Relation to Central Tendency • The measure of central tendency are helpful describing the characteristics of a given distribution. • When the values of the mean, median and mode are all equal, then they are all represented by a simple point in a distribution. • The distribution in such case is normal or symmetrical. -If the values of the mean, median and mode are not the same, the curve or distribution is skewed or assymetrically. -There are two types of skewed distribution. *Positively Skewed – the curve has a heavy right tail. This means that there are more high values, so the scores accumulate at the right. Therefore, the mean is pulled into the tail of the distribution and its value is higher than the median. The mean here is easily affected by extreme cases which in a positively skewed distribution are found to the right. Moreover, the mean is also found to the right of the mode since skewness in this case is approximated by the distance of the mean from the mode. * Negatively Skewed – the curve has a heavy left tail. This implies that there are more low scores, so that the values accumulate at the left. Therefore, the mean is pulled into the tail of the curve which is found at the left. So the value of the mean is lower than the median because extreme cases are found at the left of the distribution. Quantiles: • This refers to values which divides the distribution into a given number of equal parts. • There are types of quantiles: • Quartiles – divide the distribution into four equal parts. • Deciles – divide the distribution into ten equal parts. • Percentiles – divide the distribution into one hundred equal parts. Percentiles(for ungrouped data) • Are positions measures used in educational and health- related fields to indicate the position of a n individual in a group. Percentile formula: • The percentile corresponding to a given value X is computed by using the following formula: • 𝑃𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑣𝑎𝑙𝑢𝑒𝑠 𝑏𝑒𝑙𝑜𝑤 𝑋 +0.5 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑣𝑎𝑙𝑢𝑒𝑠 x100% Example 1. A teacher gives a 20-point test to 10 students. The scores are shown here. Find the percentile rank of a score of 12. 18,15, 12, 6, 8, 2, 3, 5, 20, 10 Solution: • Arrange the data in order from lowest to highest. 2, 3, 5, 6, 8, 10, 12, 15, 18, 20 • Then substitute into the formula 𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 = 6+0.5 x 10 100% = 65th percentile. Thus, a student who scored 6 did better than 65% of the class. Procedure Table • finding a data value corresponding to a given Percentile STEP 1:Arrange first the scores according to magnitude or size.(lowest to highest). STEP 2:𝑐 = 𝑛∙𝑝 100 where: n = total number of values p = percentile STEP 3A: If c is not a whole number, round up to the next whole number. Starting at the lowest value, count over to the number that corresponds to the rounded-up value. STEP 3B: If c is a whole number, use the value halfway between the cth and (c + 1)st values when counting up from the lowest value. EXAMPLE 2: • Using the scores in Example 1: a. find the 25th percentile. b. find the 60th percentile. SOLUTION: • For a: STEP 1: STEP 2: 2, 3, 5, 6, 8, 10, 12, 15, 18, 20 𝑐= 10∙25 100 = 2.5 STEP 3: then c = 3. hence, the value 5corresponds to the 25th percentile SOLUTION: • For b: STEP 1: 2, 3, 5, 6, 8, 10, 12, 15, 18, 20 10∙60 100 STEP 2: 𝑐= STEP 3: 6th value = 10, 7th value = 12 then, 10+12 = 2 =6 11 Hence, 11 corresponds to the 60th percentile Examples: 1. Find the 20th percentile or P20 of the following scores: 25, 22, 20, 16, 17, 12, 8, 6, 5 2. Find the 60th percentile of the following scores: 99, 95, 80, 75, 70, 60, 40 Quartiles and Deciles(for ungrouped data) • Finding Data values Corresponding to Q1, Q2, and Q3 STEP 1: Arrange the data in order from lowest to highest. STEP 2: find the median of the data values. This is the value for Q2. STEP 3: Find the median of the data values that fall below Q2. This is the value for Q1 . STEP 4: Find the median of the data values that fall above Q2. This is Q3. Example: • Find Q1, Q2, Q3 for the data set 15, 13, 6, 5, 12, 50, 22, 18 SOLUTION: STEP 1: STEP 2: 5, 6, 12, 13, 15, 18, 22, 50 𝑄2 = 13+15 = 2 14 STEP 3: values less than 14 5, 6, 12, 13 Q1 6 + 12 𝑄1 = =9 2 STEP 4: values greater than 14 15, 18, 22, 50 𝑄3 = 18+22 2 = 20 Computations of the Quantiles for Grouped Data • The computations for the grouped data is similar to that of the median. • The formula is np cf Pp u f i where: Pp – the desired quantiles u – exact lower limit of the class interval containing the median n - number of cases p – proportion corresponding to the desired quantiles cf – cumulative frequency immediately below the class interval containing pp f – frequency of the class interval containing pp i – class interval The efficiency ratings of 200 faculty members of a certain college were taken and are shown below. CI 73 – 75 76 – 78 79 – 81 82 – 84 85 – 87 88 – 90 91 – 93 94 – 96 97 – 99 f 2 6 11 18 20 39 55 39 10 1.Compute for the value of the mean, median and mode 2.Determine the value of the following: nd a. lower boundary of the 2 quartile class b. upper limit of the 3rd quartile class c. classmark of the 78th percentile class d. frequency of the 8th decile class e. cumulative frequency before the 5th decile class 3. Determine the value of the following: a. Q1 e. D4 b. P36 f. P55 c. D5 g. P79 d. D7 h. Q4