Fundamentals of Bio-Statistics TEWACHEW G. University of Gondar Department of Statistics E-Mail:t e w a c h e w 9 3 @gmail.com 1 Defintion and classification of Statistics Definition Statistics can be defined in two senses: Statistics in its plural sense: Statistics refer to A collection of numerical information that describes every aspect of social and economic phenomenon. Statistics are the raw data themselves, like statistics of births, statistics of deaths, statistics of imports and exports, etc. .Statistics in its singular sense: The science of collecting, organizing, presenting, analyzing, and Interpreting data to assist in making more effective decision Types of Statistics Descriptive Statistics is concerned with Methods of organizing, summarizing, and presenting data in an informative way. Inferential Statistics is the methods used to determine something about a population on the basis of a sample. By Tewachew Fundamentals of Biostatistics 2 1.2 Stages in statistical investigation Collection of data: The process of measuring, gathering, assembling the raw data up on which the statistical investigation is to be based. Data can be collected in a variety of ways. Organization of data: Summarization of data in some meaningful way. Organization of data may involve Editing, coding and classification of the collected data. Presentation of the data: In this stage the collected and organized data are presented with some systematic order to facilitate statistical analysis. The organized data are presented with the help of tables, diagrams and graphs. Analysis of data: The process of extracting numerical description of data, mainly through the use of elementary mathematical operation (like mean, standard deviation,. Interpretation of data: This involves giving meaning to the analyzed data and draw conclusions. Statistical techniques based on probability theory are required. By Tewachew Fundamentals of Biostatistics 3 1.3 Definitions of some terms population: is the complete set of possible measurements for which inferences are to be made. Census: a complete enumeration of the population. But in most real problems it cannot be realized, hence we take sample. Sample: A sample from a population is the set of measurements that are actually collected in the course of an investigation. It should be selected using some pre-defined sampling technique in such a way that they represent the population very well. Parameter: Characteristic or measure obtained from a population. Statistic: Characteristic or measure obtained from a sample. Sampling: The process or method of sample selection from the population. Sample size: The number of elements or observation to be included in the sample. By Tewachew Fundamentals of Biostatistics 4 1.4 Scales of Measurement Variable: It is an attribute or characteristic that can assume different values. Variable is divided in to two: Qualitative and quantitative variable Qualitative variables are nonnumeric variables and cannot be measured. Examples: gender, religious affiliation, and state of birth. Quantitative Variables are numerical variables and can be measured. Examples include balance in checking account, number of children in family. Note that quantitative variables are either discrete or continuous Discrete variable: It assumes a finite or countable number of possible values. Example: number of children‘s in a family, number of cars at a traffic light Continuous variable: It can assume any value within the defined range. Example: weight in kg, height, time, air pressure in a tire. By Tewachew Fundamentals of Biostatistics 5 Cont… Measurement scale refers to the property of value assigned to the data based on the properties of order, distance and fixed zero. Nominal Scales: are measurement systems that possess none of the three properties stated above. With the nominal level, the data are sorted into categories with no particular order to the categories. Examples: Sex (Male or Female), Marital status (married, single, widow, divorce) Country code Regional differentiation of Ethiopia. By Tewachew Fundamentals of Biostatistics 6 Ordinal Scales Ordinal Scales are measurement systems that possess the property of order, but not distances and true zero point. Level of measurement which classifies data into categories that can be ranked. Differences between the ranks do not exist. Examples: Rating scales (Excellent, Very good, Good, Fair, poor), Military status. Interval Scales Interval scales are measurement systems that possess the properties of Order and distance, but not the property of fixed zero. Examples: Temperature in degree Celsius or 0F, Your score on an individual intelligence test as a measure of your intelligence. A temperature of 0°C does not mean that there is no temperature. By Tewachew Fundamentals of Biostatistics 7 Ratio RatioScales Scale Ratio scales are measurement systems that possess all three properties: order, distance, and fixed zero. . The ratio level of measurement has all the characteristics of the interval level, plus there is a zero point and the ratio of two values is meaningful. Examples: Weight, Height, Number of students, Age By Tewachew Fundamentals of Biostatistics 8 1.6 Introduction to Method of data Collection Ratio Scales The statistical data may be classified under two categories, depending upon the sources: (1) Primary data (2) Secondary data. Primary Data: are those data, which are collected by the investigator himself for the purpose of a specific inquiry or study. The Main Methods of Primary data collection are Observation: Interview: (Could be face to face /telephone interview) Questionnaire(Mailed and self-administered questionnaire) Laboratory experiment Life histories, case studies, etc. By Tewachew Fundamentals of Biostatistics 9 Cont….. Ratio Scales Secondary Data: When an investigator uses data, which have already been collected by others, such data are called "Secondary Data". Some of the sources of secondary data are government document, official statistics, technical report, scholarly journals, trade journals, review articles, reference books, research universities, hospitals, libraries, library search engines, computerized data base and world wide web (www ). By Tewachew Fundamentals of Biostatistics 10 CHAP-2 : Method of Data presentation Having collected and edited the data, the next important step is to organize it. The presentation of data is broadly classified in to the following three categories: Tabular presentation Diagrammatic and Graphic presentation The process of arranging data in to classes or categories according to similarities technically is called classification. Raw data: recorded information in its original collected form, whether it may be counts or measurements, is referred to as raw data . Frequency: is the number of values in a specific class of the distribution. By Tewachew Fundamentals of Biostatistics 11 / 98 Frequency Distribution A frequency distribution is the organization of raw data in table form, using classes and frequencies. There are three basic types of frequency distributions Categorical frequency distribution Ungrouped frequency distribution Grouped frequency distribution Categorical frequency Distribution: Used for data that can be place in specific categories such as nominal, or ordinal. E.g. marital status Steps of constructing categorical frequency distribution 1 You have to identify that the data is in nominal or ordinal scale of measurement 2. Make a table as show below By Tewachew Fundamentals of Biostatistics 12 / 98 Frequency Distribution Put distinct values of a data set in column A 4. Tally the data and place the result in column B 5. Count the tallies and place the results in column C 6. Find the percentage of values in each class by using the formula f Where, f frequency and n is total number of values. Example 2.1: Twenty-five army inductees were given a blood test to determine their blood type. The data set is given as follows: A B B AB O O O B AB B B B O A O A O O O AB AB A O B A Construct a frequency distribution for the above data By Tewachew Fundamentals of Biostatistics 13 / 98 Frequency Distribution..cont’d By Tewachew Fundamentals of Biostatistics 14 / 98 Ungrouped Frequency Distribution When the data are numerical instead of categorical, the range of data is small and each class is only one unit, this distribution is called an ungrouped frequency distribution The major components of this type of frequency distributions are class, tally, frequency, relative frequency and cumulative frequency. Example 2.2 : The following data represent the mark of 20 students. Construct a frequency distribution, which is ungrouped. Solution: Step 1: Find the range, Range=Max-Min=90-60=30. Step 2: Make a table as shown Step 3: Tally the data. Step 4: Compute the frequency. By Tewachew Fundamentals of Biostatistics 15 / 98 Ungrouped Frequency Distribution….cont’d Each individual value is presented separately, that is why it is named ungrouped frequency distribution. By Tewachew Fundamentals of Biostatistics 16 / 98 Grouped Frequency Distribution When the range of the data is large, the data must be grouped in to classes that are more than one unit in width. Definitions: Grouped Frequency Distribution: a frequency distribution when several numbers are grouped in one class. Class limits: Separates one class in a grouped frequency distribution from another and have gaps between the upper limits of one class and lower limit of the next. Units of measurement (U): the distance between two possible consecutive measures. It is usually taken as 1, 0.1, 0.01, 0.001, -----. Class boundaries: Separates one class in a grouped frequency distribution from another and there is no gap between the upper boundary of one class and lower boundary of the next class. The lower class boundary is found by subtracting U/2 from the corresponding lower class limit and the upper class boundary is found by adding U/2. By Tewachew Fundamentals of Biostatistics 17 / 98 Grouped Frequency Distribution….Cont’d Class width: the difference between the upper and lower class boundaries of any class. Class mark (Mid points): it is the average of the lower and upper class limits or the average of upper and lower class boundary. Cumulative frequency: is the number of observations less than/more than or equal to a specific value. Cumulative frequency above: it is the total frequency of all values greater than or equal to the lower class boundary of a given class. Cumulative frequency blow: it is the total frequency of all values less than or equal to the upper class boundary of a given class. Cumulative Frequency Distribution (CFD): it is the tabular arrangement of class interval together with their corresponding cumulative frequencies. Relative frequency (rf): it is the frequency divided by the total frequency. Relative cumulative frequency (rcf): it is the cumulative frequency divided by the total frequency. By Tewachew Fundamentals of Biostatistics 18 / 98 Guidelines for classes: 1. 2. 3. 4. 5. There should be between 5 and 20 classes. The classes must be mutually exclusive. This means that no data value can fall into two different classes The classes must be all inclusive or exhaustive. This means that all data values must be included. The classes must be continuous. There are no gaps in a frequency distribution. The classes must be equal in width. The exception here is the first or last class. It is possible to have a "below ..." or "... and above" class. This is often used with ages. By Tewachew Fundamentals of Biostatistics 19 / 98 Steps for constructing Grouped frequency Distribution 1. Find the largest and smallest values Compute the Range(R) = Maximum – Minimum Select the number of classes desired, use Sturge’s rule 𝑲 = 𝟏 + 𝟑. 𝟑𝟐𝒍𝒐𝒈(𝒏) where k is number of classes desired and n is total number of observation. 4. Find the class width by dividing the range by the number of classes and rounding up, not off. 𝐖 = 𝐑/𝐊 5. Pick a suitable starting point less than or equal to the minimum value. The starting point is called the lower limit of the first class. Continue to add the class width to this lower limit to get the rest of the lower limits. 6. To find the upper limit of the first class, subtract U from the lower limit of the second class. Then continue to add the class width to this upper limit to find the rest of the upper limits. 7. Find the boundaries by subtracting U/2 units from the lower limits and adding U/2 units from the upper limit of the next class. 8. Tally the data. 9. Find the frequencies. 10. Find the cumulative frequencies.. 11. If necessary, find the relative frequencies and/or relative cumulative frequencies 2. 3. By Tewachew Fundamentals of Biostatistics 20 / 98 Steps for constructing Grouped frequency Distribution Example-2.3 : Consider the following set of data and construct the frequency distribution. 11 29 6 33 14 21 18 17 22 38 31 22 27 19 22 23 26 39 34 27 Steps By Tewachew Fundamentals of Biostatistics 21 / 98 Steps for constructing Grouped frequency Distribution By Tewachew Fundamentals of Biostatistics 22 / 98 Diagrammatical and Graphical Presentation of Data One of the most effective and interesting alternative way in which a statistical data may be presented is through diagrams and graphs. The three most commonly used diagrammatic presentation for discrete as well as qualitative data are: Pie hart Pictogram Bar charts By Tewachew Fundamentals of Biostatistics 23 / 98 Pie Chart A pie chart is a circle that is divided in two sections or wedges according to the percentage of frequencies in each category of the distribution. The angle of the sector is obtained using: Example2.4 : The following table gives the details of monthly budget of a family. Represent these figures by a suitable diagram. By Tewachew Fundamentals of Biostatistics 24 / 98 Pie Chart…..con’d Solution: The necessary computations are given below: By Tewachew Fundamentals of Biostatistics 25 / 98 Bar Charts The bar graph (simple bar chart, multiple bar chart component bar chart) uses vertical or horizontal bins to represent the frequencies of a distribution. While we draw bar chart, we have to consider the following two points. These are Make the bars the same width Make the units on the axis that are used for the frequency equal in size Simple Bar Chart: Are used to display data on one variable classified on spatial, quantitative or temporal basis. Example : Draw simple bar diagram to represent the profits of a bank for 5 years. By Tewachew Fundamentals of Biostatistics 26 / 98 Bar Charts By Tewachew Fundamentals of Biostatistics 27 / 98 Multiple Bars When two or more interrelated series of data are depicted by a bar diagram, then such a diagram is known as a multiple-bar diagram. Suppose we have export and import figures for a few years. Fig 2.2 Multiple Bars By Tewachew Fundamentals of Biostatistics 28 / 98 Component Bar Chart is used to represent data in which the total magnitude is divided into different or components. Example : The table below shows the quantity in hundred kgs of Wheat, Barley and Oats produced on a certain form during the years 1991 to 1994. Draw stratified bar chart. By Tewachew Fundamentals of Biostatistics 29 / 98 Component Bar Chart….con’d Solution: To make the component bar chart, first of all we have to take year wise total production. The required diagram is given below: By Tewachew Fundamentals of Biostatistics 30 / 98 Graphical Presentation of data The histogram, frequency polygon and cumulative frequency graph or ogives are most commonly applied graphical representation for continuous data. Procedures for constructing statistical graphs: Draw and label the X and Y axes. Choose a suitable scale for the frequencies or cumulative frequencies and label it on the Y axes. Represent the class boundaries for the histogram or ogive or the mid points for the frequency polygon on the X axes. Plot the points. Draw the bars or lines to connect the points Histogram The graph which displays the data by using vertical bars of height to represent frequencies. Class boundaries are placed along the horizontal axes. Example: Take the data in above example 2.3 By Tewachew Fundamentals of Biostatistics 31 / 98 Graphical Presentation of data Frequency Polygon: a line graph. The frequency is placed along the vertical axis and classes mid points are placed along the horizontal axis. Ogive (cumulative frequency polygon) A graph showing the cumulative frequency (less than or more than type) plotted against upper or lower class boundaries respectively. That is class boundaries are plotted along the horizontal axis and the corresponding cumulative frequencies are plotted along the vertical axis. The points are joined by a free hand curve. By Tewachew Fundamentals of Biostatistics 32 / 98 Graphical Presentation of data……cont’d Example: Draw a frequency polygon and ogive curve(less than type and More than type ) for the above data in example 2.3 . . By Tewachew Fundamentals of Biostatistics 33 / 98 CHAPTER-3 : Measures of Central Tendency (MCT) A measure of central tendency is a summery measure that attempts to describe a whole set of data with single value that represents the middle or center of its distribution This single value is called the average of the group. Averages are also called measures of central tendency. Objectives of Measures of central Tendency . To summarize a set of data by single value To facilitate comparison among different data sets To use for further statistical analysis or manipulation Summation Notation By Tewachew Fundamentals of Biostatistics 34 / 98 ……….Cont’d PROPERTIES OF SUMMATION . Example 3.1: considering the following data determine find By Tewachew Fundamentals of Biostatistics 35 / 98 Types of Measures of Central Tendency In statistics, we have various types of measures of central tendencies. The most commonly used types of MCT includes: Mean Mode Median Quintiles Percentiles deciles Mean Is defined as the sum of the values of each observation in a data set divided by the number of observations. The mean of X1, X2 ,X3 …Xn is denoted by A.M ,m or 𝑋 and is given by By Tewachew Fundamentals of Biostatistics 37 / 98 Types of Measures of Central Tendency For grouped F.D Example3.2: Obtain the mean of the following number 2, 7, 8, 2, 7, 3, 7 Example3.3 : calculate the mean for the following age distribution. By Tewachew Fundamentals of Biostatistics 38 / 98 The Mode Mode is a value which occurs most frequently in a set of values The mode may not exist and even if it does exist, it may not be unique. Examples 3.7 1. 2. 3. Find the mode of , 5, 3, 5, 8, 9 Mode =5 Find the mode of 8, 9, 9, 7, 8, 2, and 5 , It is a bimodal Data: 8 and 9 Find the mode of 4, 12, 3, 6, and 7. No mode for this data. The mode of a set of numbers X1, X2, …Xn is usually denoted by If data are given in the shape of continuous frequency distribution, the mode is defined as: By Tewachew Fundamentals of Biostatistics 39 / 98 The Mode Note: The modal class is a class with the highest frequency. Example-3.8 : Following is the distribution of the size of certain farms selected at random from a district. Calculate the mode of the distribution. Solutin -? By Tewachew Fundamentals of Biostatistics 40 / 98 The Median In a distribution, median is the value of the variable which divides it in to two equal halves. Thus, in an ungrouped frequency distribution if the n values are arranged in ascending order of magnitude, the median is the middle value if n is odd. When n is even, the median is the mean of the two middle values. Example-3.9 : Find the median of the following numbers. a) 6, 5, 2, 8, 9, 4. b) b. 2, 1, 8, 3, 5 solution =? For grouped data By Tewachew Fundamentals of Biostatistics 41 / 98 The Median….Cont’D Remark: The median class is the class with the smallest cumulative frequency (less than type) greater than or equal to n/2 Example-3.9 : Find the median of the above example 3.8 solution =? Quartiles: Quartiles are measures that divide the frequency distribution in to four equal parts. usually denoted by Q1, Q2, Q3 and are obtained after arranging the data in an increasing order known as respectively first quartile ,second quartile and third quartile. For grouped data: we have the following formula: By Tewachew Fundamentals of Biostatistics 42 / 98 Deciles are measures which divide a given ordered data in to ten equal parts and each part contains equal no of elements. It has nine points known as 1st, 2nd… 9th deciles and denoted by D1, D2… D9 respectively. For ungrouped data 𝒊𝒕𝒉 deciles is can be given by For grouped (continuous) data deciles can be obtained by using Remark: The decile class (class containing Di )is the class with the smallest cumulative frequency (less than type) greater than or equal to By Tewachew Fundamentals of Biostatistics 43 / 98 Percentiles: Percentiles are measures that divide the frequency distribution in to hundred equal parts. The values of the variables corresponding to these divisions are denoted P1, P2,.. P99 often called the first, the second,…, the ninetyninth percentile respectively. For ungrouped data 𝒊𝒕𝒉 percentiles is For grouped (continuous) data deciles can be obtained by using Remark: The percentile class (class containing Pi )is the class with the smallest cumulative frequency (less than type) greater than or equal to By Tewachew Fundamentals of Biostatistics 44 / 98 CHAPTER-4 : Measures of Dispersion The scatter or spread of items of a distribution is known as dispersion or variation. Measures of dispersions are statistical measures which provide ways of measuring the extent in which data are dispersed or spread out Objectives of measuring Variation: To judge the reliability of measures of central tendency To control variability itself. To compare two or more groups of numbers in terms of their variability. To make further statistical analysis. Types of Measures of Dispersion Various measures of dispersions are in use. The most commonly used measures of dispersions are: 1.Range and relative range 2. variance 3. Standard deviation 4. Coefficient of variation and standard score By Tewachew Fundamentals of Biostatistics 45 / 98 The Variance Population Variance If we divide the variation by the number of values in the population, we get something called the population variance. This variance is the "average squared deviation from the mean". Sample Variance the sum of the squares of the deviations is divided by one less than the sample size. By Tewachew Fundamentals of Biostatistics 46 / 98 Standard Deviation The standard deviation is defined as the square root of the mean of the squared deviations of individual values from their mean. Examples: Find the variance and standard deviation of the following sample data 1. 5, 17, 12, 10. 2. The data is given in the form of frequency distribution. By Tewachew Fundamentals of Biostatistics 47 / 98 Standard Deviation ….Cont Solution By Tewachew Fundamentals of Biostatistics 48 / 98 Special properties of Standard deviations Chebyshev's Theorem By Tewachew Fundamentals of Biostatistics 49 / 98 Special properties of Standard deviations….ont’d Example: Suppose a distribution has mean 50 and standard deviation 6. What percent of the numbers are: a) Between 38 and 62 b) Between 32 and 68 c) Less than 38 or more than 62. d) Less th an 32 or more than 68. Solution By Tewachew Fundamentals of Biostatistics 50 / 98 Coefficient of Variation (C.V) By Tewachew Fundamentals of Biostatistics 51 / 98 Standard Scores (Z-scores) if X is a measurement from a distribution with mean X and standard deviation S, then its value in standard units is Z gives the deviations from the mean in units of standard deviation Z gives the number of standard deviation a particular observation lie above or below the mean. It is used to compare two observations coming from different groups. Examples: Student A from section 1 scored 90 and student B from section 2 scored 95.Relatively speaking who performed better? By Tewachew Fundamentals of Biostatistics 52 / 98 CHAPTER- 5: Elementary Probability probability is the chance of an outcome of an experiment. It is the measure of how likely an outcome is to occur. Definitions of some probability terms Experiment: Any process of observation or measurement or any process which generates well defined outcome. Probability Experiment: is an experiment whose out come is not known Outcome: The result of a single trial of a random experiment Sample Space: Set of all possible outcomes of a probability experiment Event: It is a subset of sample space. It is a statement about one or more outcomes of a random experiment .They are denoted by capital letters. Mutually exclusive events: Two events are said to be mutually exclusive, if both events cannot occur at the same time as outcome of a single experiment. Let E1 and E 2 said to be mutually exclusive evens if there is no sample point in common to both events E1 and E 2 By Tewachew Fundamentals of Biostatistics 53 / 98 Elementary Probability ….Cont Equally Likely outcomes: outcomes which have the same chance of occurring. Independent Events: Two events A and B are said to be independent events if the occurrence of event A has no influence on the occurrence of event B. Dependent Events: Two events are dependent if the first event affects the outcome or occurrence of the second event . Fundamental Principles of Counting Techniques If the number of possible outcomes in an experiment is small, it is relatively easy to list and count all possible events. When there are large numbers of possible outcomes an enumeration of cases is often difficult, tedious, or both. Therefore, to overcome such problems one can use various counting techniques or rules. By Tewachew Probability and Statistics 54 / 98 Elementary Probability…cont’d Addition rule: Suppose that a procedure designated by 1, can be performed in n1 ways. Assume that second procedure designated by 2 can be performed in n 2 ways. Suppose further more that it is not possible both procedures 1 and 2 are performed together. The number of ways in which we can perform 1 or 2 procedures is n1 + n 2 ways. This can be generalized as follows if there are k procedures and i th procedure may be performed in n i ways, i=1, 2, …, k , then the number of ways in which we perform procedure 1 or 2 or … or k is given by n1 +n 2 +…+ Example 5.1 : Suppose that we are planning a trip and are deciding between bus and train transportation. If there are 3 bus routes and 2 train routes to go from A to B, find the available routes for the trip. There are 3+2 = 5 possible By Tewachew Probability and Statistics 55 / 98 Elementary Probability…cont’d The Multiplication Rule: If a choice consists of k steps of which the first can be made in n1 ways, the second can be made in n2 ways…, the kth can be made in nk ways, then the whole choice can be made in (n1 * n2 * ........ * nk ) ways. Example 5.2 : An air line has 6 flights from A to B, and 7 flights from B to C per day. If the flights are to be made on separate days, in how many different ways can the airline offer from A to C? Example5.3 : The digits 0, 1, 2, 3, and 4 are to be used in 4 digit identification card. How many different cards are possible if a) Repetitions are permitted. b) Repetitions are not permitted. By Tewachew Probability and Statistics 56 / 98 Permutation Rule: Permutation is an arrangement of all or parts of a set of objects with regard to order. Rule 1: The number of permutations of n distinct objects taken all together is n! Or In particular, the number of permutations of n objects taken n at a time is Rule-2: A permutation of n different objects taken r at a time is an arrangement of r out of the n objects, with attention given to the order of arrangement. The number of permutations of n objects taken r at a time is denoted by nPr, or P (n,r) and is given by Rule-3: The number of permutation of n objects taken all at a time, when n1 objects are alike of one kind, n2 objects are alike of second kind, …, nk objects are alike of kth kind is given by: By Tewachew Probability and Statistics 57 / 98 Permutation Rule: Examples: 1. Suppose we have a letters A, B, C, D a) How many permutations are there taking all the four? b) How many permutations are there two letters at a time? 2. Find the permutation of the letters of the word STATISTICS taken all at a time ? Combinations Rule: Combination is the selection of objects without regarding order of arrangement. A combination of n different objects taken r at a time is a selection of r out of n-objects, denoted by the symbol By Tewachew Probability and Statistics 58 / 98 combination Rule ……cont’d Examples: 1. Suppose in the box 3 red, 3 white and 5 black equal sized balls are there. We want to draw 3 balls at a time. How many ways do we have from each type? 2. In how many ways a committee of 5 people be chosen out of 9 people? 3. Among 15 clocks there are two defectives .In how many ways can an inspector chose By Tewachew three of the clocks Probability and Statistics for inspection so 59 / 98 that: Different Approaches to probability Classical or Mathematical Approach If a random experiment results in N exhaustive, mutually exclusive and equally likely outcomes; out of which n are favorable to the happening of an event A, then the probability of occurrence of A, usually denoted by P (A) is given by: Examples: 1. In a given basket there is 3 yellow, 4 black and 3 white balls. What is the probability of selection of one black ball? 2. A box of 80 candles consists of 30 defective and 50 non defective candles. If 10 of these candles are selected at random, what is the probability? a) All will be defective. b) 6 will be non defective c) All will be non defective By Tewachew Probability and Statistics 60 / 98 Empirical or frequency approach This is based on the relative frequencies of occurrence of the event when the number of observations is very large. Definition: The probability of an event A is the proportion of outcomes favorable to A in the long run when the experiment is repeated under same condition. Example : 1. If 1000 tosses of a coin result in 529 heads, the relative frequency of heads is 529/1000 = 0.529. If another 1000 tosses results in 493 heads, the relative frequency in the total of 2000 tosses is 2. If records show that 60 out of 100,000 bulbs produced are defective. What is the probability of a newly produced bulb to be defective? By Tewachew Probability and Statistics 61 / 98 Axiomatic Approach: Given a sample space of a random experiment S, the probability of the occurrence of any event A is defined as a set function P (A) satisfying the following axioms: By Tewachew Probability and Statistics 62 / 98 Subjective Approach A probability derived from an individual's personal judgment about whether a specific outcome is likely to occur. Subjective probabilities contain no formal calculations and only reflect the subject's opinions and past experience. Subjective probabilities differ from person to person. Because the probability is subjective, it contains a high degree of personal bias. Events as a set: If A and B are two events then By Tewachew Probability and Statistics 63 / 98 Conditional probability Conditional Events: If the occurrence of one event has an effect on the next occurrence of the other event then the two events are conditional or dependent events. Let there be two events A and B. Then the probability of event A given that the outcome of event B is given Example : 120 employees of a certain factory are given a performance test and are divided in to two groups as those with good performance(G) and those with poor performance (P) the result is given below Good performance(G) Poor performance(P) Total Male (M) 60 20 80 Female(F) 25 15 40 Total 85 35 120 By Tewachew Probability and Statistics 64 / 98 Conditional probability…..cont’d Example 2:If the probability that a research project will be well planned is 0.60 and the probability that it will be well planned and well executed is 0.54, what is the probability that it will be well executed given that it is well planned? Example 3: For a student enrolling at freshman at certain university the probability is 0.25 that he/she will get scholarship and 0.75 that he/she will graduate. If the probability is 0.2 that he/she will get scholarship and will also graduate. What is the probability that a student who get a scholarship graduate? By Tewachew Probability and Statistics 65 / 98 Probability of Independent Events We say two events A and B are said to be independent if the occurrence of event A in a probability experiment does not affect the probability of event B. In other words, events A and B are considered as independent if the conditional probability A given B is the same as the unconditional probability of A i.e, P(A/B) = P(A) (B does not affect event A) This leads to a useful formula which is also our definition of independency Example: A box contains four black and six white balls. What is the probability of getting two black balls in drawing one after the other under the following conditions? a) The first ball drawn is not replaced b) The first ball drawn is replaced solution ? By Tewachew Probability and Statistics 66 / 57 CHAPTER -6 Sampling and Sampling Distribution Sampling Definitions: Population – A group that includes all the cases (individuals, objects, or groups) in which the researcher is interested. Sample – A relatively small subset from a population. Parameter: Characteristic or measure obtained from a population. Statistic: Characteristic or measure obtained from a sample. Sampling: The process or method of sample selection from the population. By Tewachew Introduction to Statistics 67 Sampling Cont…. Sampling unit: the ultimate unit to be sampled or elements of the population to be sampled Examples: Sampling frame: is the list of all elements in a population. Examples: By Tewachew Introduction to Statistics 68 cont…. Errors in sample survey: There are two types of errors a) Sampling error: Is the discrepancy between the population value and sample value. May arise due to in appropriate sampling techniques applied a) Non sampling errors: are errors due to procedure bias such as: Due to incorrect responses Measurement Errors at different stages in processing the data. Use of Sampling Reduced cost Greater speed Greater accuracy Greater scope Avoids destructive test The only option when the population is infinite By Tewachew Introduction to Statistics 69 Sampling Techniques There are two types of sampling techniques. Random Sampling or probability sampling. Non Random Sampling or non probability sampling. Probability sampling methods : are those in which every item in the population has a known chance, or probability, of being chosen for sample. Non-probability sampling : it is defined as a sampling technique in which the researcher selects samples based on the subjective judgment of the researcher rather than random selection. on-probability sampling is a method in which not all population members have an equal chance of participating in the study, unlike probability sampling. By Tewachew Introduction to Statistics 70 / 57 Probability Sampling Methods Simple Random Sampling Is a method of selecting items from a population such that every possible sample of specific size has an equal chance of being selected. In this case, sampling may be with or without replacement. Simple random sampling can be done either using the lottery method or table of random numbers. Stratified Random Sampling: The population will be divided in to non overlapping but exhaustive groups called strata. Simple random samples will be chosen from each stratum. Elements in the same strata should be more or less homogeneous while different in different strata. It is applied if the population is heterogeneous Some of the criteria for dividing a population into strata are: Sex (male, female); Age (under 18, 18 to 28, 29 to 39); Occupation (blue-collar, professional, other). By Tewachew Introduction to Statistics 71 Cluster Sampling: The population is divided in to non overlapping groups called clusters. A simple random sample of groups or cluster of elements is chosen and all the sampling units in the selected clusters will be surveyed. Clusters are formed in a way that elements with in a cluster are heterogeneous, Cluster sampling is useful when it is difficult or costly to generate a simple random sample. Systematic Sampling The first element is selected randomly from a list or from sequential files and then every 𝑛𝑡ℎ element is selected . The procedure starts in determining the first element to be included in the sample. Then the technique is to take the kth item from the sampling frame. By Tewachew Introduction to Statistics 72 Non Random Sampling Judgment/Purposive Sampling With this method, sampling is done based on previous ideas of population composition and behavior. An expert with knowledge of the population decides which units in the population should be sampled. Convenience Sampling samples are selected from the population only because they are conveniently available to the researcher. Quota Sampling the researcher decides the selection of sampling based on some quota. In quota sampling, the researcher makes sure that the final sample must meet his quota criteria. By Tewachew Introduction to Statistics 73 Non Random Sampling Quota Sampling the researcher decides the selection of sampling based on some quota. In quota sampling, the researcher makes sure that the final sample must meet his quota criteria. Example: A researcher wants to survey individuals about what smartphone brand they prefer to use. He/she considers a sample size of 500 respondents. Also, he/she is only interested in surveying ten states in the US. Here’s how the researcher can divide the population by quotas: Gender: 250 males and 250 females Age:100 respondents each between the ages of 16-20, 21-30, 31-40, 41-50, & 51+ Employment status: 350 employed and 150 unemployed people . By Tewachew Introduction to Statistics 74 Sampling Distribution More precisely, sampling distributions are probability distributions and used to describe the variability of sample statistics. The sampling distribution of a statistics is the probability distribution of that statistics Sampling Distribution of the sample mean Sampling distribution of the sample mean is a theoretical probability distribution that shows the functional relation ship between the possible values of a given sample mean based on samples of size and the probability associated with each value, for all possible samples of size drawn from that particular population. . By Tewachew Introduction to Statistics 75 Sampling Distribution…..Cont’d Steps for the construction of Sampling Distribution of the mean 1. 2. 3. From a finite population of size N , randomly draw all possible samples of size n . Calculate the mean for each sample. Summarize the mean obtained in step 2 in terms of frequency distribution or relative frequency distribution. 4. . Example: Take samples of size 2 with replacement and construct sampling distribution of the sample mean. Solution: By Tewachew Introduction to Statistics 76 Sampling Distribution…..Cont’d By Tewachew Introduction to Statistics 77 Sampling Distribution…..Cont’d Sampling distribution of means 𝑋 𝑓(𝑋) 6 7 8 1 2 3 25 25 25 By Tewachew 9 10 11 12 13 14 4 25 5 25 4 25 3 25 2 25 1 25 Introduction to Statistics 78 Sampling Distribution…..Cont’d By Tewachew Introduction to Statistics 79 Sampling Distribution…..Cont’d 𝜎 Solution⟹ 𝑋~𝑁(𝜇, √𝑛) ⟹ 𝑋~𝑁 5.7,0.33 𝑋−𝜇 ⟹ 𝑍 = 𝜎 ~𝑁(0,1) 𝑛 (𝑋 > 6)) = 𝑃(𝑍 > 6−5.7 0.33 𝑃(5 < 𝑋 < 6)) = 𝑃 ( ) = 𝑃(𝑍 > 0.91) = 0.5 − 𝑃(0 ≤ 𝑍 ≤ 0.91) = 0.1814 5 − 5.7 6 − 5.7 <𝑍< ) = 𝑃(−2.12 < 𝑍 < 0.91) = 𝑃(0 0.33 0.33 ≤ 𝑍 ≤ 2.12) + 𝑃(0 ≤ 𝑍 ≤ 0.91) = 0.8016 (𝑋 < 5.2)) = 𝑃(𝑍 < 5.2 − 5.7 ) = 𝑃(𝑍 < −1.52) = 0.5 − 𝑃(0 ≤ 𝑍 ≤ 1.52) 0.33 = 0.0643 By Tewachew Introduction to Statistics 80 7 : STATISTICAL INFERENCE Inference is the process of making interpretations or conclusions from sample data for the totality of the population. In statistics there are two ways though which inference can be made. Statistical estimation Statistical hypothesis testing. 1. Statistical Estimation This is one way of making inference about the population parameter where the investigator does not have any prior notion about values of the population parameter. Point Estimation It is a procedure that results in a single value as an estimate for a parameter. Interval estimation It is the procedure that results in the interval of values as an estimate for a parameter. It deals with identifying the upper and lower limits of a parameter. By Tewachew Introduction to Statistics 81 Definitions Confidence Interval: An interval estimate with a specific level of confidence Confidence Level: The percent of the time the true value will lie in the interval estimate given. Degrees of Freedom: The number of data values which are allowed to vary once a statistic has been determined. Estimator: A sample statistic which is used to estimate a population parameter. It must be unbiased, consistent, and relatively efficient. Estimate: Is the different possible values which an estimator can assumes. Interval Estimate: A range of values used to estimate a parameter. Point Estimate: A single value used to estimate a parameter. Properties of best estimator Unbiased Estimator: An estimator whose expected value is the value of the parameter being estimated. Consistent Estimator: An estimator which gets closer to the value of the parameter as the sample size increases. Relatively Efficient Estimator: The estimator for a parameter with the smallest variance. By Tewachew Introduction to Statistics 82 Point and Interval estimation of the population mean: µ Point Estimation Another term for statistic is point estimate, since we are estimating the parameter value. A point estimator is the mathematical way we compute the point estimate. For instance, is a point estimator of the population mean. Confidence interval estimation of the population mean Although possesses nearly all the qualities of a good estimator, because of sampling error, we know that it's not likely that our sample statistic will be equal to the population parameter, but instead will fall into an interval of values. We will have to be satisfied knowing that the statistic is "close to" the parameter. There are different cases to be considered to construct confidence intervals. By Tewachew Introduction to Statistics 83 case-1 : If sample size is large or if the population is normal with known variance By Tewachew Introduction to Statistics 84 case-1 : If sample size is large or if the population is normal with known variance Here are the z values corresponding to the most commonly used confidence levels. By Tewachew Introduction to Statistics 85 Case 2: If sample size is small and the population variance, is not known. Examples-2: From a normal sample of size 25 a mean of 32 was found .Given that the population standard deviation is 4.2. Find a) A 95% confidence interval for the population mean. b) A 99% confidence interval for the population mean. Solution: By Tewachew Introduction to Statistics 86 cont’d Examples-2: 1. A drug company is testing a new drug which is supposed to reduce blood pressure. From the six people who are used as subjects, it is found that the average drop in blood pressure is 2.28 points, with a standard deviation of .95 points. What is the 95% confidence interval for the mean change in pressure? By Tewachew Introduction to Statistics 87 2 Hypothesis Testing This is also one way of making inference about population parameter, where the investigator has prior notion about the value of the parameter. Definitions: Statistical hypothesis: is an assertion or statement about the population whose plausibility is to be evaluated on the basis of the sample data. Test statistic: is a statistics whose value serves to determine whether to reject or accept the hypothesis to be tested. Statistic test: is a test or procedure used to evaluate a statistical hypothesis and its value depends on sample data. There are two types of hypothesis: 1. Null hypothesis: Null Hypothesis (H0):- is a statistical hypothesis that states there is no difference between a parameter and a specific value or hypothesized value. H0 : 𝜇 = 𝜇0 where µ is the population mean and 𝜇0 is the hypothesized meanUsually denoted by H0. By Tewachew Introduction to Statistics 88 2 Hypothesis Testing 2. Alternative Hypothesis (H1):- is a statistical hypothesis that states there exists a difference between a parameter and a specific value or hypothesized value. By Tewachew Introduction to Statistics 89 Types and size of errors: Testing hypothesis is based on sample data which may involve sampling and non sampling errors. Type I error: Rejecting the null hypothesis when it is true. Type II error: Failing to reject the null hypothesis when it is false. NOTE: There are errors that are prevalent in any two choice decision making problems. There is always a possibility of committing one or the other errors. Type I error (𝛼 ) and type II error ( 𝛽) have inverse relationship and therefore, can not be minimized at the same time By Tewachew Introduction to Statistics 90 steps of hypothesis testing: 1. 2. 3. 4. 5. 6. 7. The first step in hypothesis testing is to specify the null hypothesis (H0) and the alternative hypothesis (H1). The next step is to select a significance level, 𝛼 Identify the sampling distribution of the estimator. Calculate test statistics Identify the critical region. Making decision. Summarization of the result. Hypothesis testing about the population mean, : By Tewachew Introduction to Statistics 91 Case 1: When sampling is from a normal distribution with 𝝈𝟐 known The relevant test statistic is After specifying 𝛼 we have the following regions (critical and acceptance) on the standard normal distribution corresponding to the above three hypothesis. Summary table for decision rule. By Tewachew Introduction to Statistics 92 Case 2: When sampling is from a normal distribution with 𝝈𝟐 unknown and small sample size The relevant test statistic is After specifying 𝛼 we have the following regions (critical and acceptance) on the standard normal distribution corresponding to the above three hypothesis. By Tewachew Introduction to Statistics 93 Examples: 1.Test the hypotheses that the average height content of containers of certain lubricant is 10 liters if the contents of a random sample of 10 containers are 10.2, 9.7, 10.1, 10.3, 10.1, 9.8, 9.9, 10.4, 10.3, and 9.8 liters. Use the 𝛼 =0.01 level of significance and assume that the distribution of contents is normal. 2. The mean life time of a sample of 16 fluorescent light bulbs produced by a company is computed to be 1570 hours. The population standard deviation is 120 hours. Suppose the hypothesized value for the population mean is 1600 hours. Can we conclude that the life time of light bulbs is increasing? (Use 𝛼 = 0.05 and assume the normality of the population) (exercise!) By Tewachew Introduction to Statistics 94 Cont’d Step 7: Conclusion At 1% level of significance, we have no evidence to say that the average height content of containers of the given lubricant is different from 10 litters, based on the given sample data. By Tewachew Introduction to Statistics 95 Chi-Square Test of Association Suppose A has r mutually exclusive and exhaustive classes. B has c mutually exclusive and exhaustive classes The entire set of data can be represented using r*c contingency table. The chi-square procedure test is used to test the hypothesis of independency of two attributes For instance we may be interested Whether the presence or absence of hypertension is independent of smoking habit or not. Whether the size of the family is independent of the level of education attained by the mothers. Whether there is association between father and son regarding boldness. By Tewachew Introduction to Statistics 96 Test of Association By Tewachew Introduction to Statistics 97 Test of Association Decision Rule: Example: 1. A geneticist took a random sample of 300 men to study whether there is association between father and son regarding boldness. He obtained the following results. By Tewachew Introduction to Statistics 98 Test of Association By Tewachew Introduction to Statistics 99 Test of Association By Tewachew Introduction to Statistics 100 Test of Association 2. Random samples of 200 men, all retired were classified according to education and number of children is as shown below Test the hypothesis that the size of the family is independent of the level of education attained by fathers. (Use 5% level of significance) By Tewachew Introduction to Statistics 101 Solution By Tewachew Introduction to Statistics 102 Summary Worksheet 1. A random sample of 200 adults are classified below by sex and their level of education attained. Education Male Female Elementary 38 45 Secondary 28 50 College 22 17 If a person is picked at random from this group, find the probability that the person is a male given that the person has a secondary education the person does not have a college education given that the person is a female. Solution: Let A= the person has secondary education, M = the person is male, F = the person is female, C = the person has college education and Cc =the person does not have college education 2. suppose we have a population of size N=5, consists of the age of five children (3, 6, 9, 12, 15) with their population variance 18 and population mean 9, if sample size 3 selected randomly without replacement how many different samples are possible? And construct a sampling distribution of the sample mean. By Tewachew Introduction to Statistics 103 Summary Worksheet 3. Random samples of 200 men, all retired were classified according to education and number of is as shown below Test the hypothesis that the size of the family is independent of the level of education attained by fathers. Test the hypothesis that the size of the family is independent of the level of education attained by fathers. (Use 5% level of significance By Tewachew Introduction to Statistics 104 Summary Worksheet 3. Random samples of 200 men, all retired were classified according to education and number of is as shown below Test the hypothesis that the size of the family is independent of the level of education attained by fathers. Test the hypothesis that the size of the family is independent of the level of education attained by fathers. (Use 5% level of significance By Tewachew Introduction to Statistics 105 Summary worksheet 1. A research reports that the average salary of veterinarians is more than $42000. A sample of 30 veterinarians has a mean salary of $43260. Test the reports claim. Assume the population standard deviation is $5230 2. A national magazine claims that the average college students watches less television than the general public. The national average is 29.4 hours per week, with a standard deviation 2 hours. A random sample of 25 college students has a mean of 27 hours. Test the claim. Assume normality. 3. A merchant believes that the average age of customers who purchase a certain brand of wears is 13 years of age. A random sample of 35 customers had an average age of 15.6 years. At α=0.01, should this conjecture be rejected. The standard deviation of the population is 1y By Tewachew Introduction to Statistics 106 By Tewachew Probability and Statistics 107 / 57 By Tewachew Probability and Statistics 108 / 57