Statistics Module: Data Analysis & Excel Applications

Module-1: Statistics O nl in e Notes Learning Objective: ●● To get introduced with types of statistics and its application in different phases ●● To develop an understanding of data representation techniques ●● To understand the MS Excel applications of numerical measures of central tendency and dispersion Learning Outcome: At the end of the course, the learners will be able to – To be able to arrange and describe statistical information using numerical and graphical procedures ●● To be able to use the tool MS Excel for answering business problems based on numerical measures er s ity ●● “ It is the science of collection, presentation, analysis, and interpretation of numerical data from logical analysis” Croxton and Cowden define- ni v 1.1.1 Statistical Thinking and Analysis U Data is a collection of any number of related observations. We can collect the number of telephones installed in a given day by several workers or the numbers of telephones installed per day over a period of several days by one worker and call the results our data. A collection of data is called a data set and a single observation is called as a data point. )A m ity Statistics is not restricted to only information about the State, but it also extends to almost every realm of the business. Statistics is about scientific methods to gather, organize, summarize and analyze data. More important still is to draw valid conclusions and make effective decisions based on such analysis. To a large degree, company performance depends on the preciseness and accuracy of the forecast. Statistics is an indispensable instrument for manufacturing control and market research. Statistical tools are extensively used in business for time and motion study, consumer behaviour study, investment decisions, credit ratings, performance measurements and compensations, inventory management, accounting, quality control, distribution channel design, etc. (c For managers, therefore, understanding statistical concepts and knowledge about using statistical tools is essential. With an increase in a company’s size and market uncertainty due to reduced competition, the need for statistical knowledge and statistical analysis of various business circumstances has greatly increased. Prior to this, when the size of business used to be small without much complexities, a single person, usually owner or manager of the firm, used to take all decisions regarding the business. Example: A manager used to decide, from where the necessary raw materials and other factors of production were to be acquired, how much of output will Amity Directorate of Distance & Online Education 2 Statistics Management O nl in e be produced, where it will be sold, etc. This type of decision making was usually based on experience and expectations of this single individual and as such had no scientific basis. Notes 1.1.2 Limitations and Applications of Statistics Statistical techniques, because of their flexibility have become popular and are used in numerous fields. But statistics is not a cure-all technique and has few limitations. It cannot be applied to all kinds of situations and cannot be made to answer all queries. The major limitations are: Statistics deals with only those problems, which can be expressed in quantitative terms and amenable to mathematical and numerical analysis. These are not suitable for qualitative data such as customer loyalty, employee integrity, emotional bonding, motivation etc. 2. Statistics deals only with the collection of data and no importance is attached to an individual item. 3. Statistical results are only an approximation and not mathematically correct. There is always a possibility of random error. 4. Statistics, if used wrongly, can lead to misleading conclusions, and therefore, should be used only after complete understanding of the process and the conceptual base. 5. Statistics laws are not exact laws and are liable to be misused. 6. The greatest limitation is that the statistical data can be used properly only by a professional. A person having thorough knowledge of the methods of statistics and proper training can only come to conclusions. 7. If statistical data are not uniform and homogenous, then the study of the problem is not possible. Homogeneity of data is essential for a proper study. 8. Statistical methods are not the only method for studying a problem. There are other methods as well, and a problem can be studied in various ways. U ni v er s ity 1. ity 1.1.3 Types of Statistical Methods: Descriptive & Inferential (c )A m The study of statistics can be categorized into two main branches. These branches are descriptive statistics and inferential statistics. Descriptive statistics is used to sum up and graph the data for a category picked. This method helps to understand a particular collection of observations. A sample is defined on descriptive statistics. There is no confusion in concise numbers, since you just identify the individuals or things which are calculated. Descriptive statistics give information that describes the data in some manner. For example, suppose a pet shop sells cats, dogs, birds and fish. If 100 pets are sold, and 35 out of the 100 were dogs, then one description of the data on the pets sold would be that 35% were dogs. Inferential statistics are techniques that allow us to use certain samples to generalize the populations from which the samples were taken. Hence, it is crucial that the sample represents the population accurately. The method to get this done is called sampling. Since Amity Directorate of Distance & Online Education 3 Statistics Management ●● Define the population we are studying. ●● Draw a representative sample from that population. ●● Use analyses that incorporate the sampling error. Notes O nl in e the inferential statistics aim at drawing conclusions from a sample and generalizing them to a population, we need to be sure that our sample accurately represents the population. This requirement affects our process. At a broad level, we must do the following: 1.1.4 Importance and Scope of Statistics Condensation: Statistics compresses a mass of figures to small meaningful information, for example, average sales, BSE index, the growth rate etc. It is impossible to get a precise idea about the profitability of a business from a mere record of income and expenditure transactions. The information of Return On Investment (ROI), Earnings Per Share (EPS), profit margins, etc., however, can be easily remembered, understood and thus used in decision-making. ●● Forecast: Statistics helps in forecasting by analyzing trends, which are essential ity ●● er s for planning and decision-making. Predictions based on the gut feeling or hunch can be harmful for the business. For example, to decide the refining capacity for a petrochemical plant, it is required to predict the demand of petrochemical product mix, supply of crude oil, the cost of crude, substitution products, etc., for next 10 to 20 years, before committing an investment. Testing of hypotheses: Hypotheses are the statements about population parameters based on past knowledge or information. It must be checked for its validity in the light of current information. Inductive inference about the population based on the sample estimates involves an element of risk. However, sampling keeps the decision-making costs low. Statistics provides quantitative base for testing our beliefs about the population. ●● Relationship between Facts: Statistical methods are used to investigate the cause and effect relationship between two or more facts. The relationship between demand and supply, money-supply and price level can be best understood with the help of statistical methods. ●● Expectation: Statistics provides the basic building block for framing suitable policies. For example how much raw material should be imported, how much ity U ni v ●● m capacity should be installed, or manpower recruited, etc., depends upon the expected value of outcome of our present decisions )A 1.1.5 Population and Sample Sample (c A sample consists of one or more observations drawn from the population. Sample is the group of people who actually took part in your research. They are the people that are questioned (for example, in a qualitative study) or who actually complete the survey (for example, in a quantitative study). Participants who may have been research participants but didn’t personally participate are not considered part of the survey. Amity Directorate of Distance & Online Education 4 Statistics Management O nl in e A sample data set contains a part, or a subset, of a population. The size of a sample is always less than the size of the population from which it is taken. [Utilizes the count n - 1 in formulas.] Notes Population A population includes all of the elements from a set of data. Population is the broader group of people that you expect to generalize your study results to. Your sample is just going to be a part of the population. The size of your sample will depend on your exact population. A population data set contains all members of a specified group (the entire list of possible data values). [Utilizes the count n in formulas.] ity Example: The population may be “ALL people living in India. For example – Mr. Tom wants to do a statistical analysis on students’ final examination scores in her math class for the past year. Should he consider her data to be a population data set or a sample data set? er s Mrs. Tom is only working with the scores from his class. There is no reason for him to generalize her results to all management students in the school. He has all of the data that pertaining to his investigation = population. 1.2.1 Importance of Graphical Representation of Data ●● One of the most convincing and appealing ways in which statistical results may be represented is through graphs and diagrams.. U ●● ni v Data needs to be process and analyze the data obtained from the field. The processing consists mainly of recording, labeling, classifying and tabulating the collected data so that it is consistent with the report. The data may be viewed either in tabulation form or via charts. Effective use of the data collected primarily depends on how it is organized, presented, and summarized. Diagrams and graphs are extremely used because of the following reasons: Diagrams and Graphs attract to the eye. ●● They have more memorizing effect. ●● It facilitates for easy comparison of data from one period to another. ●● Diagram and graphs give bird’s eye view of entire data; therefore, it conveys meaning very quickly (c )A m ity ●● 1.2.2 Bar Chart In a bar diagram, only the length of the bar is taken into account but not the width. In other words bar is a thick line whose width is merely shown, but length of the bar is taken into account and is called one-dimensional diagram. Simple Bar Diagram It represents only one variable. Since these are of the same width and vary only in lengths (heights), it becomes very easy for a comparative study. Simple bar diagrams Amity Directorate of Distance & Online Education 5 Statistics Management Illustration - 1 Notes O nl in e are very popular in practice. A bar chart can be either vertical or horizontal; for example sales, production, population figures etc. for various years may be shown by simple bar charts The following table gives the birth rate per thousand of different countries over a certain period of time. India Germany U. K. New Zealand Sweden China Birth rate 33 16 20 30 15 40 ni v er s ity Country Sub-divided Bar Diagram U Comparing the size of bars, China’s birth rate is highest, next is India whereas Germany and Sweden equal in the lowest positions. Illustration - 1 ity In a subdivided bar diagram, each bar representing the magnitude of given value is further subdivided into various components. Each component occupies a part of the bar proportional to its share in total. m Present the following data in a sub-divided bar diagram. Science Humanities Commerce 2014-2015 240 560 220 2015-2016 280 610 280 (c )A Year/Faculty Amity Directorate of Distance & Online Education 6 Statistics Management ity O nl in e Notes Multiple Bar Diagram er s In a multiple bar diagram, two or more sets of related data are represented and the components are shown as separate adjoining bars. The height of each bar represents the actual value of the component. The components are shown by different shades or colours. ni v Illustration 1 - Construct a suitable bar diagram for the following data of number of students in two different colleges in different faculties. College Arts Science Commerce Total A 1200 800 600 2600 700 500 600 1800 (c )A m ity U B Amity Directorate of Distance & Online Education 7 Statistics Management In percentage bar diagram the length of the entire bar kept equal to 100 (Hundred). Various segment of each bar may change and represent percentage on an aggregate. Illustation 1 Men Women Children 1995 45% 35% 20% 1996 44% 34% 22% 1997 48% 36% 16% ni v er s ity Year Notes O nl in e Percentage bar Diagram 1.2.3 Pie Chart U A pie chart or a circle chart is a circular statistical graphic, that is divided into slices to illustrate a numerical proportion. In a pie chart, the arc length of each slice is proportional to the quantity it represents. While it is named for its resemblance to a pie which has been sliced, there are variations on the way it can be presented.. In a pie chart, categories of data are represented by wedges in the circle and are proportional in size to the percent of individuals in each category. m ity Pie charts are very widely used in the business world and the mass media. Pie charts are generally used to show percentage or proportional data and usually the percentage represented by each category is provided next to the corresponding slice of pie. Pie charts are good for displaying data for around six categories or fewer. 1.2.4 Histogram )A Histogram is a graphical data display using bars of different heights. This is similar to a bar map, but there are ranges of histogram categories. The height of each bar shows how many fall within each set. A histogram can be used when: The data is numerical ●● The shape of the data’s distribution is to be viewed, especially when determining whether the output of a process is distributed approximately normally (c ●● Amity Directorate of Distance & Online Education 8 Statistics Management ●● Analyzing whether a process can meet the customer’s requirements ●● Analyzing what the output from a supplier’s process looks like seeing whether a process change has occurred from one time period to another ●● Determining whether the outputs of two or more processes are different ●● You wish to communicate the distribution of data quickly and easily to others O nl in e Notes 1.2.5 Frequency Polygon ity These are the frequencies plotted against the mid-points of the class-intervals and the points thus obtained are joined by line segments. On comparing the Histogram and a frequency polygon, in frequency polygons the points replace the bars (rectangles). Also, when several distributions are to be compared on the same graph paper, frequency polygons are better than Histograms. Illustration 1 Draw a histogram and frequency polygon from the following data 10-20 Number of Persons er s Age in Years 3 20-30 16 30-40 22 ni v 40-50 50-60 60-70 (c )A m ity U 70-80 Amity Directorate of Distance & Online Education 35 24 15 2 9 Statistics Management Notes O nl in e 1.2.6 Ogives When frequencies are added, they are called the cumulative frequencies. The curve obtained by plotting cumulating frequencies is called a cumulative frequency curve or an ogive (pronounced as ojive). To construct an Ogive: (i) Add up the progressive totals of frequencies, class by class, to get the cumulative frequencies. (ii) Plot classes on the horizontal (x-axis) and cumulative frequencies on the vertical (y-axis). Less than Ogive: To plot a less than ogive, data is arranged in ascending order of magnitude and frequencies are cumulated from the top i.e. adding. Cumulative frequencies are plotted against the upper class limits. Ogives under this method, gives a positive curve ity Greater than Ogive: To plot a greater than ogive, the data is arranged in the ascending order of magnitude and frequencies are cumulated from the bottom or subtracted from the total from the top. Cumulative frequencies are plotted against the lower class limits. Ogives under this method, gives negative curve er s Uses: Certain values like median, quartiles, quartile deviation, co-efficient of skewness etc. can be located using ogives. Ogives are helpful in the comparison of the two distributions. Illustration 1 – ni v Draw less than and more than ogive curves for the following frequency distribution and obtain median graphically. Verify the result. CI 0-20 20-40 40-60 60-80 80-100 100-120 120-140 140-160 f 5 12 18 25 12 8 5 Icf mcf size 20 5 100 0 40 17 95 20 35 83 40 60 65 60 75 40 80 120 87 25 100 140 95 13 120 100 5 140 60 80 )A m 100 ity Size U 15 (c 160 Amity Directorate of Distance & Online Education 10 Statistics Management er s 1.2.7 Pareto Chart ity O nl in e Notes A Pareto Chart is a graph showing the frequency of the defects and their cumulative effect. Pareto charts are helpful in identifying the defects that should be prioritized to achieve the greatest overall change. ni v The Pareto principle (also known as the 80/20 rule, the law of the vital few, or the principle of factor sparsity) states that, for many events, roughly 80% of the effects come from 20% of the causes. (c )A m ity U An example of pareto chart – When to use a pareto chart ●● A pareto chart must be used when analyzing data about the frequency of problems or causes in a process ●● A pareto chart must be used when there are many problems or causes and you want to focus on the most significant Amity Directorate of Distance & Online Education 11 Statistics Management It must be used when analyzing broad causes by looking at their specific components ●● It must be used while communicating with others about the data Notes O nl in e ●● 1.2.8 Stem-and-leaf display A stem-and-leaf display or stem-and-leaf plot is a device for presenting quantitative data in a graphical format, similar to a histogram, to assist in visualizing the shape of a distribution. They are are useful tools in exploratory data analysis. For example – 2.3, 2.5, 2.5, 2.7, 2.8 3.2, 3.6, 3.6, 4.5, 5.0 The stem-and-leaf plot for the same will be – Leaf 2 35578 3 266 4 5 5 0 U Stem “2” Leaf “3” means 2.3 ni v Stem er s Tom got his friends to do a long jump and got these results: ity A Stem and Leaf Plot is a special table where each data value is split into a “stem” (the first digit or digits) and a “leaf” which is usually the last digit. The “stem” values are listed down, and the “leaf” values go right (or left) from the stem values. The “stem” is used to group the scores and each “leaf” shows the individual scores within each group. What the stem and leaf here mean (Stem “2” Leaf “3” means 2.3) ●● In this case each leaf is a decimal ●● It is OK to repeat a leaf value ●● 5.0 has a leaf of “0” ity ●● 1.2.9 Cross tabulations )A m Cross tabulation is a method by which the relationship between multiple variables is quantitatively analysed. Cross tabulation party variables also known as contingency tables or cross tabs to explain the connection between the various variables. It also shows how the correlations vary from one group to another variable. (c A cross-tabulation (or crosstab) is, for reference, a two- (or more) dimensional table which records the number (frequency) of respondents having the specific characteristics described in the table cells. Cross-tabulation tables offer a wealth of information on the variables’ relationship. Cross-tabulation analysis goes by several names in the research world including crosstab, contingency table, chi-square and data tabulation. Amity Directorate of Distance & Online Education 12 Statistics Management ●● O nl in e Importance of Cross Tabulation Notes Clean and Useable Data: Cross tabulation makes it simple to interpret data! The clarity offered by cross tabulation helps deliver clean data that be used to improve decisions throughout an organization. ●● Easy to Understand: No advanced statistical degree is needed to interpret cross tabulation. The results are easy to read and explain. This is makes it useful in any type of presentation. 1.2.10 Scatter plot and Trend line er s ity Scatter diagram is the most fundamental graph plotted to show relationship between two variables. It is a simple way to represent bivariate distribution. Bivariate distribution is the distribution of two random variables. Two variables are plotted one against each of the X and Y axis. Thus, every data pair of (xi , yi ) is represented by a point on the graph, x being abscissa and y being the ordinate of the point. From a scatter diagram we can find if there is any relationship between the x and y, and if yes, what type of relationship. Scatter diagram thus, indicates nature and strength of the correlation. The pattern of points obtained by plotting the observed points are knows as scatter diagram. ni v It gives us two types of information. ●● Whether the variables are related or not. ●● If so, what kind of relationship or estimating equation that describes the relationship. U If the dots cluster around a line, the correlation is called linear correlation. If the dots cluster around a curve, the correlation is called a non-linear or curve linear correlation. (c )A m ity Scatter diagram is drawn to visualize the relationship between two variables. The values of more important variable are plotted on the X-axis while the values of the variable are plotted on the Y-axis. On the graph, dots are plotted to represent different pairs of data. When dots are plotted to represent all the pairs, we get a scatter diagram. The way the dots scatter gives an indication of the kind of relationship which exists between the two variables. While drawing scatter diagram, it is not necessary to take at the point of sign the zero values of X and Y variables, but the minimum values of the variables considered may be taken. ●● When there is a positive correlation between the variables, the dots on the scatter diagram run from left hand bottom to the right hand upper corner. In case of perfect positive correlation all the dots will lie on a straight line. ●● When a negative correlation exists between the variables, dots on the scatter diagram run from the upper left hand corner to the bottom right hand corner. In case of perfect negative correlation, all the dots lie on a straight line. Amity Directorate of Distance & Online Education 13 Statistics Management Advertisement cost in ‘000’ 40 65 60 90 85 75 35 90 34 Sales in Lakh ` 45 56 58 82 65 70 64 85 50 76 85 er s ity Solution: Notes O nl in e Example: Figures on advertisement expenditure (X) and Sales (Y) of a firm for the last ten years are given below. Draw a scatter diagram. U ni v A scatter diagram gives two very useful types of information. First, we can observe patterns between variables that indicate whether the variables are related. Secondly, if the variables are related we can get idea of what kind of relationship (linear or nonlinear) would describe the relationship. Correlation examines the first question of determining whether an association exists between the two variables, and if it does, to what extent. Regression examines the second question of establishing an appropriate relation between the variables. 1.3.1 Arithmetic mean - intro and application ity The mean is the average of the numbers. It is easy to calculate: add up all the numbers, then divide by how many numbers there are. In other words it is the sum divided by the count. m Arithmetic mean is defined as the value obtained by dividing the total values of all items in the series by their number. In other word is defined as the sum of the given observations divided by the number of observations, i.e., add values of all items together and divide this sum by the number of observations. )A Symbolically – x = x1 + x2 + x3 + xn/n Properties of Arithmetic Mean The sum of the deviations, of all the values of x, from their arithmetic mean, is zero. 2. The product of the arithmetic mean and the number of items gives the total of all items. 3. Finding the combined arithmetic mean when different groups are given. (c 1. Amity Directorate of Distance & Online Education 14 Demerits of Arithmetic Mean Notes O nl in e Statistics Management 1. Arithmetic mean is affected by the extreme values. 2. Arithmetic mean cannot be determined by inspection and cannot be located graphically. 3. Arithmetic mean cannot be obtained if a single observation is lost or missing. 4. Arithmetic mean cannot be calculated when open-end class intervals are present in the data. Arithmetic Mean for Ungrouped Data ity Individual Series 1. Direct Method The following steps are involved in calculating arithmetic mean under an individual series using direct method: Add up all the values of all the items in the series. - Divide the sum of the values by the number of items. The result is the arithmetic mean. er s - The following formula is used: X = Ʃ x/N ni v Where, X = Arithmetic mean Ʃx = Sum of the values N = Number of items. Illustration 1 – Value(x) – 125 128 132 135 140 148 155 157 159 191 Calculate the arithmetic mean U Solution – Total number of terms = N = 10 Mean = Ʃ x = 125 128 132 135 140 148 155 157 159 191 = 1440 ity X = Ʃ x/n = 1440/10 = 144 (c )A m 2. Short-cut Method or Indirect method The following steps are involved in calculating arithmetic mean under individual series using short-cut or indirect method: 1. Assume one of the values in the series as an average. It is called as working mean or assumed average. 2. Find out the deviation of each value from the assumed average. 3. Add up the deviations 4. Apply the following formula. X = A d N + Ʃ where, X = Arithmetic mean A = Assumed average Ʃd = Sum of the deviations N = Number of items Amity Directorate of Distance & Online Education 15 Statistics Management Roll No 1 2 3 4 5 6 7 8 9 10 Marks 43 48 65 57 31 60 37 48 78 59 Solution – Marks Obtained D = 60 1 43 -17 2 48 -12 3 65 5 4 57 5 31 6 60 7 37 8 48 9 78 10 59 Combined Arithmetic Mean -29 er s 0 -23 -12 18 -1 Ʃd = – 74 U Ʃ 60 + (- 74/10) = 52.6 marks -3 ni v X = a +Ʃd/N ity Roll No Notes O nl in e Illustration - 1 Calculate the arithmetic average of the data given below using short–cut method ity Arithmetic mean and number of items of two or more related groups are known as combined mean of the entire group. The combined average of two series can be calculated by the given formula – n1x1 + n2x2/ n1 + n2 m Where, n1 = No. of items of the first group, n2 = No. of items of the second group x1 = A.M of the first group, x2 = A.M of the second group, )A Example - From the following data ascertain the combined mean of a factory consisting of 2 branches namely branch A and Branch B. In branch A the number of workers is 500, and their average salary of 300. In branch B the number of workers is 1,000 and their average salary is 250 (c Solution: Let the no. of workers in branch A be n1 = 500 Let the no. of worker in branch B be n2 = 1000 Amity Directorate of Distance & Online Education 16 Statistics Management 300 O nl in e Average salary x1 = Notes Average salary x2 = 250 n1x1 + n2x2/ n1 + n2 = 500(300) + 1000(250)/ 500 + 1000 = 1, 50,000 + 2, 50,000/1500 = 266.66 Weighted Arithmetic Mean ity Sometimes, some observations get relatively more importance than other observations. The weight for such observation must be given on the basis of their relative importance. In weighted arithmetic mean, for finding an average the value of each item is multiplied by its weight and then the product are divided by the number of weights. er s Symbolically = Ʃwx / Ʃw Example – Calculate simple and weighted average from the following data – Jan Price 42.5 No. of tonnes 25 Solution: May June 51.25 50 52 44.25 54 30 40 50 10 45 WX purchased (w) Jan 42.5 25 1062.5 51.25 30 1537.5 50 40 2000 April 52 50 2600 May 44.25 10 442.5 June 54 45 2430 N=6 X = 294 Ʃw = 200 Ʃwx = 10027.5 U No. of tonnes ( in 000)(x) ity m April Price Per Tonn March )A March Month Feb (c Feb ni v Month Simple AM X = Ʃx/n = 294/6 = 49 Weighted AM Xw = Ʃwx/Ʃw = 10027.5/200 = 50.137 The correct average price paid is ` 50.30 and not ` 49 i.e., weight arithmetic mean is correct than simple arithmetic mean. Amity Directorate of Distance & Online Education 17 Statistics Management Notes O nl in e 1.3.2 Median - Intro and Application Median is defined as the value of the item dividing the series into two equal halves, where one half contains all values less than (or equal to) it and the other half contains all values greater than (or equal to) it. It is also defined as the “central value of the variable. In median, the value of items must be arranged in order of their size or magnitude to find out the median. Median is a positional average. The term position refers to the place of a value in the series, where the place of median is such that it is equal to the number of items lying on the either side; therefore it is also called as locative average. Merits of Median ity Following are the advantages of median: It is rigidly defined. ●● It is easy to calculate and understand. ●● It can be located graphically. ●● It is not affected by extreme values like the arithmetic mean. ●● It can be found by mere inspection. ●● It can be used for qualitative studies. ●● Even if the extreme values are unknown, median can be calculated if one knows the number of items. ni v er s ●● Demerits of Median Following are the disadvantages of median: In the case of individual observations, the values are to be arranged in order of their size to locate median. Such an arrangement of data is tedious task if the number of items is large. ●● If the median is multiplied by the number of items, the total value of all the items cannot be obtained as in the case of the arithmetic average. ●● It is not suitable for complex algebraic or mathematical treatment. ●● It is more affected by sampling fluctuations. ity U ●● Application of Median m Example – Determine the median from the following – 25, 15, 23, 40, 27 25 23 25 20 (c )A Solution - Arranging the figures in ascending order – S.no Value or Size 1 15 2 20 3 23 4 23 Amity Directorate of Distance & Online Education 18 Statistics Management 25 6 O nl in e 5 Notes 25 7 25 8 27 9 40 Median = 10/2 = 5th term = 25 1.3.3 Mode - Intro and Application ity The word “mode” is derived from the French word “1a mode” which means fashion. So it can be regarded as the most fashionable item in the series or the group. er s Croxtan and Cowden regard mode as “the most typical of a series of values”. As a result it can sum up the characteristics of a group more satisfactorily than the arithmetic mean or median. Mode is defined as the value of the variable occurring most frequently in a distribution. In other words it is the most frequent size of item in a series. ni v Merits of Mode ●● The most important advantage of mode is that it is usually on an actual value. ●● In the case of discrete series, mode can be easily located by inspection. ●● Mode is not affected by extreme values. ●● U The following are the merits of mode: It is easy to understand and this average is used by people in their every day speech. ity ●● Mode can be determined even if extreme values are not given. Demerits of Mode (c )A m The following are the demerits of mode: ●● It is not based on all the observation of the data ●● In a number of cases there will be more than one mode in the series. ●● If mode is multiplied by the number of items, the product will not be equal to the total value of the items. ●● It will not truly represent the group if there are a small number of items of the same size in a large group of items of different sizes ●● It is not suitable for further mathematical treatment Applications of Mode Mode in Ungrouped Data Amity Directorate of Distance & Online Education 19 Statistics Management Notes O nl in e Individual Series The mode of this series can be obtained by mere inspection. The number which occurs most often is the mode. Illustration - 1 Locate mode in the data 7, 12, 8, 5, 9, 6, 10, 9, 4, 9, 9 Solution: On inspection, it is observed that the number 9 has maximum frequency i.e., repeated maximum of 4 times than any other number. Therefore mode (Z) = 9 Discrete Series ity The mode is calculated by applying grouping and analysis table. Grouping Table: Consisting of six columns including frequency column, 1st column is the frequency 2nd and 3rd column is the grouping two way frequencies and 4th, 5th and 6th column is the grouping three way frequencies. ●● Analysis table: consisting of 2 columns namely tally bar and frequency Steps in Calculating Mode in Discrete Series er s ●● ni v The following steps are involved in calculating mode in discrete series: Group the frequencies by two’s. ●● Leave the frequency and group the other frequencies in two’s. ●● Group the frequencies in threes. ●● Leave the frequency of the first size and add the frequencies of other sizes in three’s. ●● Leave the frequencies of the first two sizes and add the frequencies of the other sizes in threes. ●● Prepare an analysis table to know the size occurring the maximum number of times. Find out the size, which occurs the largest number of times. That particular size is the mode. ity U ●● m Continuous Series The following steps are involved in calculating mode in continuous series. )A Find out the modal class. Modal class can be easily found out by inspection. The group containing maximum frequency is the modal group. Where two or more classes appear to be a modal class group, it can be decided by grouping process and preparing an analyzed table as was discussed in question number 2.102. The actual value of mode is calculated by applying the following formula. (c Mo = l + fm – f1 / 2fm – f1 – f2 . i Amity Directorate of Distance & Online Education 20 Statistics Management O nl in e 1.3.4 Partition values - Quartiles and Percentiles Notes A percentile is the value below which a percentage of data falls. Example: You are the fourth tallest person in a group of 20 80% of people are shorter than you: That means you are at the 80th percentile. If your height is 1.65m then “1.65m” is the 80th percentile height in that group. Quartiles are the values that split data into quarters. Quartiles are values that divide a (part of a) data table into four groups containing an approximately equal number of observations. The total of 100% is split into four equal parts: 25%, 50%, 75% and 100%. ity The Quartiles also divide the data into divisions of 25%, so: Quartile 1 (Q1) can be called the 25th percentile ●● Quartile 2 (Q2) can be called the 50th percentile ●● Quartile 3 (Q3) can be called the 75th percentile er s ●● Example: For 1, 3, 3, 4, 5, 6, 6, 7, 8, 8: The 25th percentile = 3 ●● The 50th percentile = 5.5 ●● The 75th percentile = 7 ni v ●● The percentiles and quartiles are computed as follows: 1. The f-value of each value in the data table is computed: i–1 n–2 U fi = where i is the index of the value, and n the number of values. The first quartile is computed by interpolating between the f-values immediately below and above 0.25, to arrive at the value corresponding to the f-value 0.25. (c )A m ity 2. 3. The third quartile is computed by interpolating between the f-values immediately below and above 0.75, to arrive at the value corresponding to the f-value 0.75. 4. Any other percentile is similarly calculated by interpolating between the appropriate values. 1.3.5 Measures of Dispersion - Range - intro and Application A measure of dispersion or variation in any data shows the extent to which the numerical values tend to spread about an average. If the difference between items is small, the average represents and describes the data adequately. For large differences it is proper to supplement information by calculating a measure of dispersion in addition to an average. It is useful to determine data for the knowledge it may serve: ●● To compare the current results with the past results. Amity Directorate of Distance & Online Education 21 Statistics Management To compare two are more sets of observations. ●● To suggest methods to control variation in the data. Notes O nl in e ●● A study of variations helps us in knowing the extent of uniformity or consistency in any data. Uniformity in production is an essential requirement in industry. Quality control methods are based on the laws of dispersion. Absolute and Relative Measures of Dispersion ity The measures of dispersion can be either ‘absolute’ or “relative”. Absolute measures of dispersion are expressed in the same units in which the original data are expressed. For example, if the series is expressed as Marks of the students in a particular subject; the absolute dispersion will provide the value in Marks. The only difficulty is that if two or more series are expressed in different units, the series cannot be compared on the basis of dispersion. er s ‘Relative’ or ‘Coefficient’ of dispersion is the ratio or the percentage of a measure of absolute dispersion to an appropriate average. The basic advantage of this measure is that two or more series can be compared with each other despite the fact they are expressed in different units. A precise measure of dispersion is one that gives the magnitude of the variation in a series, i.e. it measures in numerical terms, the extent of the scatter of the values around the average. The range Relative range ity The Quartile Deviation U Measures of Dispersion ni v When dispersion is measured in terms of the original units of a series, it is absolute dispersion or variability. It is difficult to compare absolute values of dispersion in different series, especially when the series in different units or have different sets of values. A good measure of dispersion should have properties similar to those described for a good measure of central tendency. Relative Variability Relative range Deviation Relative Quartile Deviation The Mean Deviation Deviation Relative Mean deviation The Median Deviation Deviation Coefficient of Variation m The Standard Deviation Graphical Method )A Range Definition: The ‘Range’ of the data is the difference between the largest value of data and smallest value of data. (c This is an absolute measure of variability. However, if we have to compare two sets of data, ‘Range’ may not give a true picture. In such case, relative measure of range, called coefficient of range is used. This is given by, Amity Directorate of Distance & Online Education 22 Formulae: Range = L-S Notes Where L – Largest value and S- Smallest Value O nl in e Statistics Management In individual observations and discrete series, L and S are easily identified. In continuous series, the following two methods are used as follows: Method 1: L - Upper boundary of the highest class. S - Lower boundary of the lowest class. Method 2: L - Mid value of the highest class. S - Mid Value of the lowest class. Example: Find the set of observations 10 5 8 11 12 9 ity Solution: L = 12 S = 5 Range = L – S =7 er s = 12 – 5 Coefficient of range = L – S / L + S = 12 – 5/ 12 + 5 = 0.4118 ni v = 7/17 Interquartile Range and Deviations Inter-quartile range and deviations are described in the following sub sections. U Inter-quartile Range ity Inter-quartile range is a difference between upper quartile (third quartile) and lower quartile (first quartile). Thus, Inter Quartile Range = (Q3 - Q1) Quartile deviation (c )A m Quartile Deviation is the average of the difference between upper quartile and lower quartile. Thus, Quartile Deviation = QD = (Q3 - Q1)/2 Quartile Deviation (QD) also gives the average deviation of upper and lower quartiles from Median. QD = (Q3 - Q1)/2 = Q3 - Q1 / Q3 + Q1 Example: Weekly wages of a labourers is given below. Calculate Q.D. and coefficient of Q.D. Weekly wages 100 200 400 500 600 Total No. of Weeks: 5 8 21 12 6 52 Amity Directorate of Distance & Online Education 23 Solution: Weekly wages No. of Weeks: Cumulative Frequency 100 5 5 200 8 13 400 21 34 500 12 46 600 6 52 N = 52 ity Q1 = N+1 /4 = 52+1/4 = 200 + 0.25 (400-200) = 200 + 0.25 × 200 = 200 + 50 ni v = 250 er s 13.25 Q1 = 13th value + 0.25 (14th value – 13th value) Notes O nl in e Statistics Management Q3 = 3(N+1 /4) = 3 x 13.25 U = 39.75 Q3 = 39th value + 0.75 (40th value – 39th value) = 500 + 0.75 (500-500) = 500. ity = 500 + 0.75 X 0 Q.D. = Q3 - Q1 / 2 m = 500 – 250/2 = 250/2 )A = 125 Coefficient of Q.D. = Q3 - Q1/ Q3 + Q1 . .= 500 -250/ 500 + 250 = 250/750 (c = 0.333 Amity Directorate of Distance & Online Education 24 Statistics Management O nl in e 1.3.7 Standard Deviation and Variance Notes Variance is defined as the average of squared deviation of data pointsfrom their mean. When the data constitute a sample, the variance is denoted byσ2x and averaging is done by dividing the sum of the squared deviation from the mean by ‘n – 1’. When observations constitute the population, the variance is denoted by σ2 and we divide by N for the average Different formulas for calculating variance: n (xi–x)2 i=1 n–1 Population Variance Var (x) = σ2 = Where, ity Sample Variance Var (x) = σx2 = (xi–µ)2 N x = Sample mean n = Sample size µ = Population mean ni v N = Population size er s xi for i = 1, 2, ..., n are observation values Population Variance is, ∑ ( xi − µ ) 2 Var (x) = σ = N 2 n n n n U ∑ ( x 2i − 2 µ xi + µ 2 ) ∑ ( x 2i ) − 2 µ ∑ xi + µ 2 ∑ (1) =i 1 =i 1 =i 1 =i 1 = = N N n ity ∑ x 2i = i =1 N − µ2 (c )A m Var (x) = E(X2)–[E(X)]2 Standard deviation Definition: Standard Deviation is the root mean square deviation of the values from their arithmetic mean. S.D. is denoted by symbol σ (read sigma). The Standard Deviation (SD) of a set of data is the positive square root of the variance of the set. This is also referred as Root Mean Square (RMS.) value of the deviations of the data points. SD of sample is the square root of the sample variance i.e. equal to σx and the Standard Deviation of a population is the square root of the variance of the population and denoted by σ. Amity Directorate of Distance & Online Education 25 Statistics Management The properties of standard deviation are: O nl in e Notes ●● It is the most important and widely used measure of variability. ●● It is based on all the observations. ●● Further mathematical treatment is possible. ●● It is affected least by any sampling fluctuations. ●● It is affected by the extreme values and it gives more importance to the values that are away from the mean. ●● The main limitation is; we cannot compare the variability of different data sets given in different units Formula for Calculating S.D. 2 Ex 2  ∑ x  −  n  n er s = σ ity For the set of values x1, x2 ........Xn If an assumed value A is taken for mean and d = X-A, then 2 ni v = σ Ed 2  ∑ d  −  n  n For a frequency distribution U 2 Efd 2  ∑ fd  σ= − ×C  N  N ity Where d = X–A and C is the true class interval N = Total frequency Application of Standard Deviation m Example : Find the standard deviation for the following data: 0-10 10-20 20-30 30-40 40-50 50-60 60-70 Frequency 6 14 10 8 1 3 8 )A Class Interval (c Solution: Direct Method Amity Directorate of Distance & Online Education 26 Statistics Management Class Mark mi Frequency Fi x mi di = (mi-A) di2 fi x di2 Jan 5 6 30 -25 625 3750 Feb 15 14 210 -15 225 3150 March 25 10 250 -5 25 250 April 35 8 280 5 25 200 May 45 1 45 15 225 225 June 55 3 165 25 625 1875 N=6 65 8 520 35 1225 9800 Σfi = 50 1500 SD = /19250/50 = 19.62 19250 er s Mean = 1500/50 = 30 O nl in e Class Interval ity Notes Combined Standard Deviation Standard Deviation of Combined Means ni v The mean and S.D. of two groups are given in the following table Group Mean S.D. Size I X1 σ1 n1 X2 σ2 n2 II n1x+n 2 x 2 n1 +n 2 ity X= U Let X and σ be the mean and S.D. of teh combined group of (n1 + n2) items. Then X and σ are determined by the formulae. n1σ 12 +n 2σ 2 2 + n1d12 +n 2 d 2 2 = (or) σ n1 +n 2 (c )A m = σ2 where d1 n1σ 12 +n 2σ 2 2 + n1d12 +n 2 d 2 2 n1 +n 2 = x1 -x;d 2 =x 2 − x (or)d1 = x1 − x ;d 2 = x2 − x These results can be extended to 3 samples as follows: X= n1x1 +n 2 x 2 + n 3 x 3 n1 +n 2 + n 3 σ2 = n1σ 12 +n 2σ 2 2 +n 3σ 32 + n1d12 +n 2 d 2 2 +n 3d 32 n1 +n 2 + n 3 Amity Directorate of Distance & Online Education 27 1.3.8 Relative measure of dispersion - Coefficient of variation It is defined as the ratio of SD and mean, multiplied by 100. CV =σ/ μ×100 Notes O nl in e Statistics Management This is also called as variability. Smaller value of CV indicates greater stability and lesser variability. Example: Two batsmen A and B made the following scores in the preliminary round of World Cup Series of cricket matches. A 14, 13, 26, 53, 17, 29, 79, 36, 84 and 49 Who will you select for the final? Justify your answer? ity B 37, 22, 56, 52, 28, 30, 37, 48, 20 and 40 er s Solution: We will first calculate mean, standard deviation and Karl Pearson’s coefficient of variation. We will select the player based on the average score as well as consistency. We not only want the player who has been scoring at high average but also doing it consistently. Thus, the probability of his playing good inning in final is high. For Player ‘A’ (Using Direct Method) Deviation (xi - µ) (xi - µ)2 Σ xi2 14 -26 676 196 13 -27 729 169 26 -14 196 676 53 13 169 2809 17 -23 529 289 29 -11 121 841 79 39 1521 6241 36 -4 16 1296 44 1936 7056 9 81 2401 Σ (xi - µ) = 0 Σ (xi-µ)2 = 5974 Σ xi2 = 21974 49 )A Now, U m Σ xi = 400 ity 84 ni v Score xi Mean = 10 ∑ (xi-µ ) 2 (c Variance = Var(x) = i −1 N = 5974 =597.4 10 Amity Directorate of Distance & Online Education 28 Standard Deviation = σ = Var(x) = 597.4=24.44 Notes O nl in e Statistics Management Example: Two batsmen A and B made the following scores in the preliminary round of World Cup Series of cricket matches. A 14, 13, 26, 53, 17, 29, 79, 36, 84 and 49 B 37, 22, 56, 52, 28, 30, 37, 48, 20 and 40 Who will you select for the final? Justify your answer? For Player ‘A’ (Using Direct Method) Deviation (xi - µ) (xi - µ)2 Σ xi2 14 -26 676 196 -27 729 169 -14 196 676 13 169 2809 -23 529 289 -11 121 841 39 1521 6241 -4 16 1296 44 1936 7056 49 9 81 2401 Σ xi = 400 Σ (xi - µ) = 0 Σ (xi-µ)2 = 5974 Σ xi2 = 21974 er s Score xi ity Solution: We will first calculate mean, standard deviation and Karl Pearson’s coefficientof variation. We will select the player based on the average score as well as consistency. We not only want the player who has been scoring at high average but also doing it consistently. Thus, the probability of his playing good inning in final is high. 13 26 53 29 79 36 ity U 84 ni v 17 Now, (c )A m Mean = 10 ∑ (xi-µ ) 2 Variance = Var(x) = i −1 N = 5974 =597.4 10 Standard Deviation = σ = Var(x) = 597.4=24.44 Key Terms ●● Sample: A sample consists one or more observations drawn from the population. Sample is the group of people who actually took part in your research. Amity Directorate of Distance & Online Education 29 Statistics Management Population: A population includes all of the elements from a set of data. Population is the broader group of people that you expect to generalize your study results to. ●● Frequency Polygon: These are the frequencies plotted against the mid-points of the class-intervals and the points thus obtained are joined by line segments ●● Bar Diagram: Only length of the bar is taken into account but not the width. In other wards bar is a thick line whose width is shown merely, but length of the bar is taken into account is called one-dimensional diagram. ●● Simple Bar Diagram: It represents only one variable. Since these are of the same width and vary only in lengths (heights), it becomes very easy for comparative study. Simple bar diagrams are very popular in practice. ●● Percentage bar diagram: the length of the entire bar kept equal to 100 (Hundred). Various segment of each bar may change and represent percentage on an aggregate. ●● Range: The ‘Range’ of the data is the difference between the largest value of data and smallest value of data. ity er s Check your progress b) Upper limit of the class c) Any value of the class d) Middle limit of the class ni v Lower limit of the class a) Social Statistics b) Descriptive Statistics c) Education Statistics d) Business Statistics U Numerical methods and graphical methods are specialized procedures used in A histogram consists of a set of a) Adjacent triangles b) Adjacent rectangles c) Non adjacent rectangles d) Adjacent squares )A 3. a) ity 2. A frequency polygon is constructed by plotting frequency of the class interval and the m 1. Notes O nl in e ●● Component bar charts are used when data is divided into a) Circles b) Squares (c 4. c) Parts d) Groups Amity Directorate of Distance & Online Education 30 Statistics Management A circle in which sector represents various quantities is called a) Histogram b) Pie chart c) Frequency Polygon d) O give Questions and Exercises O nl in e 5. Notes What do you mean by statistics ? 2. What are the various type of bar diagrams ? 3. What are the merits of mean median and mode 4. What do you understand by Standard deviation and combined standard deviation 5. Find the standard deviation for the following data: ity 1. 0-10 10-20 20-30 30-40 40-50 50-60 60-70 Frequency 8 14 10 6 4 8 Check your progress er s Class Interval d) Middle limit of the class 2. b) Descriptive Statistics 3. b) Adjacent rectangles 4. c) Parts 5. b) Pie chart ni v 1. 3 U Further Readings Richard I. Levin, David S. Rubin, Sanjay Rastogi Masood Husain Siddiqui, Statistics for Management, Pearson Education, 7th Edition, 2016. 2. Prem.S.Mann, Introductory Statistics, 7th Edition, Wiley India, 2016. 3. Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani, An Introduction to Statistical Learning with Applications in R, Springer, 2016. (c )A m ity 1. Bibliography 1. Srivastava V. K. etal – Quantitative Techniques for Managerial Decision Making, Wiley Eastern Ltd 2. Richard, I.Levin and Charles A.Kirkpatrick – Quantitative Approaches to Management, McGraw Hill, Kogakusha Ltd. 3. Prem.S.Mann, Introductory Statistics, 7th Edition, Wiley India, 2016. 4. Budnik, Frank S Dennis Mcleaavey, Richard Mojena – Principles of Operation Research - AIT BS New Delhi. 5. Sharma J K – Operation Research- theory and applications-Mc Millan,New Delhi Amity Directorate of Distance & Online Education 31 Statistics Management Kalavathy S. – Operation Research – Vikas Pub Co 7. Gould F J – Introduction to Management Science – Englewood Cliffs N J Prentice Hall. 8. Naray J K, Operation Research, theory and applications – Mc Millan, New Dehi. 9. Taha Hamdy, Operations Research, Prentice Hall of India Notes O nl in e 6. 10. Tulasian: Quantitative Techniques: Pearson Ed. 11. Vohr.N.D. Quantitative Techniques in Management, TMH. (c )A m ity U ni v er s ity 11. Stevenson W.D, Introduction to Management Science, TMH. Amity Directorate of Distance & Online Education 32 Statistics Management O nl in e Module-2: Probability Theory Notes Learning Objective: ●● To get familiarize with business problems associated with the concept of probability and probability distributions ●● To understand the MS Excel applications of Binomial, Poisson and Normal probabilities Learning Outcome: At the end of the course, the learners will be able to – Compute Binomial, Poisson and Normal probabilities through MS Excel ●● Understand various theorems and principles of probability ity ●● Prof. Boddington- er s “defined statistics as the science of estimates and probabilities” 2.1.1 Probability – Introduction ni v A probability is the quantitative measure of risk. Statistician I.J. Good suggests, “The theory of probability is much older than the human species, since the assessment of uncertainty incorporates the idea of learning from experience, which most creatures do.” U Probability and sampling are inseparable parts of statistics. Before we discuss probability and sampling distributions, we must be familiar with some common terms used in theory of probability. Although these terms are commonly used in business, they have precise technical meaning. ity Random Experiment: In theory of probability, a process or activity that results in outcomes under study is called experiment, for example, sampling from a production lot. Random experiment is an experiment whose outcome is not predictable in advance. There is a chance or risk (sometimes also called as uncertainty) associated with each outcome. (c )A m Sample Space: It is a set of all possible outcomes of an experiment. It is usually represented as S. Example: If the random experiment is rolling of a die, the sample space is a set, S = {1, 2, 3, 4, 5, 6}. Similarly, if the random experiment is tossing of three coins, the sample space is, S = {HHH, HHT, HTH, THH, HTT, THT, TTH, TTT} with total of 8 possible outcomes. (H is heads, and T is Tails showing up.) If we select a random sample of 2 items from a production lot and check them for defect, the sample space will be S = {DD, DS, DR, RS, RR, SS} where D stands for defective, S stands for serviceable and R stands for re-workable. ●● Event: One or more possible outcomes that belong to certain category of our interest are called as event. A sub set E of the sample space S is an event. In other words, an event is a favorable outcome. Amity Directorate of Distance & Online Education 33 Statistics Management Event space: It is a set of all possible events. It is usually represented as E. Note that usually in probability and statistics; we are interested in number of elements in sample space and number of elements in event space. ●● Union of events: If E and F are two events, then another event defined to include all outcomes that are either in E or in F or in both is called as a union of events E and F. It is denoted as E U F. ●● Intersection of events: If E and F are two events, then another event defined to include all outcomes that are in both E and F is called as an intersection of events E and F. It is denoted as E∩ F. ●● Mutually exclusive events: The events E and F are said to be mutually exclusive events if they have no outcome of the experiment common to them. In other words, events E and F are said to be mutually exclusive events if E∩ F = φ, where φ is a null or empty set. ●● Collectively exhaustive events: The events are collectively exhaustive if their union is the sample space. ●● Complement of event: Complement of an event E is an event which consists of all outcomes that are not in the E. It is denoted as EC. Thus, E ∩ EC = φ and E U EC = S Notes er s ity O nl in e ●● 2.1.2 Types of Events ni v A probability event can be defined as a set of outcomes of an experiment. In other words, an event in probability is the subset of the respective sample space. A random experiment ‘s entire potential set of outcomes is the sample space or the individual space of that encounter. The probability of an occurrence happening is called chance. The likelihood of any event happening lies between 0 and 1. U For example – The sample space for the tossing of three coins simultaneously is given by: ity S = {(T, T, T), (T, T, H), (T, H, T), (T, H, H), (H, T, T), (H, T, H), (H, H, T), (H, H, H)} Suppose, if we want to find only the outcomes which have at least two heads; then the set of all such possibilities can be given as: E = { (H , T , H) , (H , H ,T) , (H , H ,H) , (T , H , H)} m Thus, an event is a subset of the sample space, i.e., E is a subset of S. )A There could be a lot of events associated with a given sample space. For any event to occur, the outcome of the experiment must be an element of the set of event E. By event it is meant one or more than one outcomes. Example Events: Getting a Tail when tossing a coin is an event ●● Rolling a “5” is an event. (c ●● An event can include several outcomes: ●● Choosing a “King” from a deck of cards (any of the 4 Kings) is also an event Amity Directorate of Distance & Online Education 34 ●● Notes Rolling an “even number” (2, 4 or 6) is an event Events can be: O nl in e Statistics Management ●● Independent (each event is not affected by other events), ●● Dependent (also called “Conditional”, where an event is affected by other events) ●● Mutually Exclusive (events can’t happen at the same time) 2.1.3 Algebra of Events er s Complementary Events ity Events are the outcome of an experiment. The likelihood of an event occurring is the ratio of number of favourable events to total number of occurrences. Often they will happen together with two things occurring or it can happen that just one of them is going to happen. Event algebra can offer an event that performs certain operations over two given events. The operations are union, intersection, complement and difference of two events. As events are the subset of sample space, these operations are performed as set operations. ni v For an event AA, there is a complimentary event BB such that BB represent the set of events which are not in the set AA. For example, if two coins are tossed together then the sample space will be {HT,TH,HH,TT}{HT,TH,HH,TT}. Let AA be the event of getting one head, then the set AA = {HT,TH}{HT,TH}. The complementary events of A, BA, B = {HH,TT}{HH,TT}. Events with AND U AND stands for the intersection of two sets. An event is the intersection of two events if it has got the members present in both the event. For example, if a pair of dice is rolled then the sample space will have 3636 members. Suppose AA is the event of getting both dice having same members and BB is the event having the sum as 66. AA = {(1,1),(2,2),(3,3),(4,4),(5,5),(6,6)}{(1,1),(2,2),(3,3),(4,4),(5,5),(6,6)} ity BB = {(3,3),(1,5),(5,1),(2,4),(4,2)}{(3,3),(1,5),(5,1),(2,4),(4,2)} AA AND BB = {(3,3)}{(3,3)} (c )A m Events with OR OR stands for union of two sets. An event is called union of two events if it has got members present in either of the sets. For example, if two coins are tossed together the sample space, SS = {HT,TH,TT,HH}{HT,TH,TT,HH}. Let event AA be the event having only one head and event BB be the event having two heads. AA = {HT}{HT} BB = {HH}{HH} Union of AA and BB, AA OR BB = {HT,HH}{HT,HH} Events with BUT NOT For two events AA and BB, AA but not BB is the event having all the elements of AA but excluding the elements of BB. This can also be represented as AA - BB. Suppose, Amity Directorate of Distance & Online Education 35 Statistics Management Notes O nl in e there is an experiment of choosing 44 cards from a deck of 5252 cards. The event AA is having all cards as red cards and event BB is having all cards as king. Then the event AA but not BB will have all red cards excluding the two kings. 2.1.4 Addition Rule of Probability If one task can be done in n1 ways and other task can be done in n2 ways and if these tasks cannot be done at the same time, then there are (n1+n2) ways of doing one of these tasks (either one task or the other). When logical OR is used in deciding outcomes of the experiment and events are mutually exclusive then the ‘Sum Rule’ is applicable. The Addition rule of probability states that: If ‘A’ and ‘B’ are any two events then the probability of the occurrence of either ‘A’ or ‘B’ is given by: ity 1. P (A U B) = P (A) +P (B) – P (A∩B) If ‘A’ and ‘B’ are two mutually exclusive events then the probability of occurrence of either A or B is given by er s 2. P (A U B) = P (A) + P (B) Example: An urn contains 10 balls of which 5 are white, 3 black and 2 red. If we select one ball randomly, how many ways are there that the ball is either white or red? ni v Solution: Answer is 5 + 2 = 7. U Example: In a triangular series the probability of Indian team winning match with Zimbawe is 0.7 and that with Australia is 0.4. If the probability of India winning both matches is 0.3, what is the probability that India will win at least one match so that it can enter the final? Solution: ity Now, given that probability of the Indian team winning the match with Zimbawe P (A) = 0.7, Australia P (A) = 0.4 and with m both P (A ∩B) = 0.3 Therefore, probability that India will win at least one match is, )A P (A U B) = P (A) + P (B) - P (A∩ B) = 0.7 + 0.4 - 0.3 = 0.8 (c 2.1.5 Multiplication Rule of Probability Suppose that a procedure can be broken down into a sequence of two tasks. If there are n1 ways to do first task and n2 ways to do second task after the first task Amity Directorate of Distance & Online Education 36 Statistics Management O nl in e has been done. Then there are (n1 × n2) ways to do the procedure. In general, if r experiments are to be performed are such that the first outcome can be in n1 ways, having completed the first experiment the second experiment outcome can be in n2, then similarly outcome of the third experiment can be in n3 ways, and so on. Then there is a total of n1 × n2 × n3 ×…× nr possible outcomes of the r experiments. Notes Multiplicative rule is stated as: If ‘A’ and ‘B’ are two independent events then the probability of occurrence of ‘A’ and ‘B’ is given by: P (A∩B) = P (A) P (B) ity It must be remembered that when the logical AND is used to indicate successive experiments then, the ‘Product Rule’ is applicable. Example: How many outcomes are there if we toss a coin and then throw a dice? Answer is 2 × 6 = 12. er s Example: It has been found that 80% of all tourists who visit India visit Delhi, 70% of them visit Mumbai and 60% of them visit both. 1. What is the probability that a tourist will visit at least one city? 2. Also, find the probability that he will visit neither city. ni v Solution: Let D indicate visit to Delhi and M denote visit to Mumbai. Given, P (D) = 0.8, P (M) = 0.7 and P (D ∩M) = 0.6 Probability that a tourist will visit at least one city is, U P (D UM) = P (D) + P (M) - P (D ∩M) = 0.8 + 0.7 - 0.6 = 0.9 2. P (D¢ ∩M¢) =1 - P (D UM) =1- 0.9 = 0.1 (c )A m ity 2.1.6 Conditional, Joint and Marginal Probability As a measure of uncertainty, probability depends on the information available. If we know occurrence of say event F, probability of event E happening may be different as compared to original probability of E when we had no knowledge of the event F happening. Probability that E occurs given that F has occurred is the conditional probability and denoted by P(E F) . If event F occurs, then our sample space is reduced to the event space of F. Also now for event E to occur, we must have both events E and F occur simultaneously. Hence probability that event E occurs, given that event F has occurred, is equal to the probability of EF (that is E ∩ F) relative to the probability of F. Thus, P( E | F ) = P( EF ) P( F ) Another variation of conditional probability rule is = P( EF ) P( E | F ) × P( F ) Amity Directorate of Distance & Online Education 37 Statistics Management Notes O nl in e Conditional probability satisfies all the properties and axioms of probabilities. Now onwards, we would write (E ∩ F) as EF, which is a common convention. Conditional probability is the probability that an event will occur given that another event has already occurred. If A and B are two events, then the conditional probability of A given B is written as P (A/B) and read as “the probability of A given that B has already occurred.” Example: The probability that a new product will be successful if a competitor does not launch a similar product is 0.67. The probability that a new product will be successful in the presence of a competitor’s new product is 0.42. The probability that the competitor will launch a new product is 0.35. What is the probability that the product will be success? ity Solution: Let S denote that the product is successful, L denote competitor will launch a product and LC denotes competitor will not launch the product. Now, from given data, Hence, P(LC ) = 1− P(L) = 1− 0.35 = 0.65 er s P(S LC ) = 0.67 , P(S|L) = 0.42 , P(L) = 0.35 Now, using conditional probability formula, probability that the product will be success P(S) is, P(S) = P(S L)P(L) + P(S LC )P(LC ) ni v = 0.42 × 0.35 + 0.67 × 0.65 = 0.5825 2.1.7 Baye’s Theorem U Consider two events, E and F. whatsoever be the events, we can always say that the probability of E is equal to the probability of intersection of E and F, plus, the probability of the intersection of E and complement of F. That is, Baye’s Formula ity P (E) = P (E ∩ F) + P (E ∩ F ∩ C) Let, E and F are events. E = (E ∩ F) U (E ∩ F ∩ C) m For any element in E, must be either in both E and F or be in E but not in F. (E F) and (E FC) are mutually exclusive, since former must be in F and latter must not in F, we have by Axiom 3, )A P (E) = (E F) + (E FC) = P(E F)×P(F) +P(E Fc )×P(Fc ) = P(E|F)×P(F) +P(E|Fc )×[1− P(F) (c Suppose now that E has occurred and we are interested in determining the probability of Fi has occurred, then using above equations, we have following proposition. P(= Fi | E ) P( EFi ) P( E | Fi ) × P( Fi ) = n P( E ) ∑ P( E | Fi ) × P( Fi ) for all i = 1,2...n i =1 Amity Directorate of Distance & Online Education 38 Statistics Management This equation is known as Baye’s’ formula. If we think of the events Fi as being possible ‘hypothesis’ about proportionality of some subject matter, say market shares of a competitors, then Baye’s’ formula gives us how these should be modified by the new evidence of the experiment, says a market survey. O nl in e Notes Example: A bin contains 3 different types of lamps. The probability that a type 1 lamp will give over 100 hours of use is 0.7, with the corresponding probabilities for type 2 and 3 lamps being 0.4 and 0.3 respectively. Suppose that 20 per cent of the lamps in the bin are of type 1, 30 per cent are of type 2 and 50 per cent are of type 3. What is the probability that a randomly selected lamp will last more than 100 hours? Given that a selected lamp lasted more than 100 hours, what are the conditional probabilities that it is of type 1, type 2 and type 3? ity Solution: Let type 1, type 2 and type 3 lamps be denoted by T1, T2 and T3 respectively. Also, we denote S if a lamp lasts more than 100 hours and SC if it does not. Now, as per given data, P(S|T1) = 0.7 , = P(S|T2 ) 0.4 , = P(S|T3 ) 0.3 er s = P(T1 ) 0.2 , = P(T2 ) 0.3 , = P(T3 ) 0.5 Now, using conditional probability formula, = P(S1) = P(S|T1 )P(T1 ) P(S|T2 )P(T2 ) P(S|T3 )P(T3) = 0.7 × 0.2 + 0.4 × 0.3 +0.3 × 0.5 ni v = 0.41 (b) Now, using Bayes’ formula P( S | T1 ) P(T1 ) 0.7 × 0.2 = = 0.341 P( S ) 0.41 U = P(T1 | S ) ity P= (T2 | S ) P= (T3 | S ) P( S | T2 ) P(T2 ) 0.4 × 0.3 = = 0.293 P( S ) 0.41 P ( S | T3 ) P (T3 ) 0.3 × 0.5 = = 0.366 P( S ) 0.41 (c )A m 2.2.1 Random Variables - Introduction In many practical situations, the random variable of interest follows a specific pattern. Random variables are often classified according to the probability mass function in case of discrete, and probability density function in case of continuous random variable. When the distributions are entirely known, all the statistical calculations are possible. In practice, however, the distributions may not be known fully. But it can be approximated that the random variable to one of the known types of standard random variables by examining the processes that make it random. These standard distributions are also called ‘probability models’ or sample distributions. Various characteristics of distribution like mean, variance, moments, etc. can be calculated using known closed formulae. We will study some of the common types of probability distributions. The normal distribution is the backbone of statistical inference and hence we will study it in more detail. Amity Directorate of Distance & Online Education 39 Statistics Management 1. Bernoulli distribution 2. Binomial distribution 3. Poisson distribution 4. Normal distribution Notes O nl in e There are broadly four theoretical distributions which are generally applied in practice. They are: 2.2.2 Mean/ Expected Value of Random Variable ity In probability theory, the expected value of a random variable is a generalization of the weighted average and intuitively is the arithmetic mean of a large number of independent realizations of that variable. The expected value is also known as the expectation, mathematical expectation, mean, average, or first moment. er s A Random Variable is a set of possible values from a random experiment. The mean of a discrete random variable X is a weighted average of the possible values that the random variable can take. Unlike the sample mean of a group of observations, which gives each observation equal weight, the mean of a random variable weights each outcome xi according to its probability, pi. The common symbol for the mean (also known as the expected value of X) is u. It is defined as – ni v ux = x1p1 + x1p1 + … + kxpk = ∑xipi U The formula changes slightly according to what kinds of events are happening. For most simple events, either the Expected Value formula of a Binomial Random Variable or the Expected Value formula for Multiple Events is used. 2.2.3 Variance and Standard Deviation of Random Variable ity The variance is a numerical description of the spread, or the dispersion, of the random variable. That is, the variance of a random variable X is a measure of how spread out the values of X are, given how likely each value is to be observed. Variance: Var(X) m The Variance is: Var(X) = Σx2p − μ2 )A To calculate the Variance: ●● square each value and multiply by its probability ●● sum them up and we get Σx2p ●● then subtract the square of the Expected Value μ2 (c Standard Deviation: σ The Standard Deviation is the square root of the Variance: Amity Directorate of Distance & Online Education 40 σ = √Var(X) Notes 2.2.4 Binomial Distribution - Introduction O nl in e Statistics Management Usually we often conduct many trials, which are independent and identical. Suppose we perform n independent Bernoulli trials (each with two possible outcomes and probability of success p) each of which results in a success with probability p and probability of failure (1 – p). If random variable X represents the number of successes that occur in n trials (order of successes not important), then X is said to be a Binomial random variable with parameters (n, p). ity Note that Bernoulli random variable is a Binomial random variable with parameter (1, p) i.e. n = 1. The probability mass function of a binomial random variable with parameters (n, p) is given by, P(X = i) = (1 – p)n – 1 for i = 0, 1, 2, ....., n Expected value and variance for Binomial random variable are, Var = [X] = np(1 – p) er s μ = E[X] = np 2.2.5 Binomial Distribution - Application ni v When to use binomial distribution is an important decision. Binomial distribution can be used when following conditions are satisfied: Trials are finite (and not very large), performed repeatedly for ‘n’ times. ●● Each trial (random experiment) should be a Bernoulli trial, the one that results in either success or failure. ●● Probability of success in any trial is ‘p’ and is constant for each trial. ●● U ●● All the trials are independent. ity These trials are usually the experiments of selection ‘with replacement’. In cases where the number of the population is very large, drawing a small sample from it does not change probability of success significantly. Hence, we could consider the distribution as Bernoulli distribution. (c )A m Following are some of the real life examples of applications of binomial distribution. ●● Number of defective bulbs in a lot of n items produced by a machine. ●● Number of female births out of n births in a hospital. ●● Number of correct answers in a multiple-choice test. ●● Number of seeds germinated in a row of n planted seeds. ●● Number of recaptured fish n a sample of n fish. ●● Number of missiles hitting the targets out of n fired. Example: Suppose that the probability that a light in a classroom will be burnt out is 1/3. The classroom has in all five lights and it is unusable if the number of lights burning is less than two. What is the probability that the class room is unusable on a random occasion? Amity Directorate of Distance & Online Education 41 Statistics Management 1 3 Class room is unusable if the number of burnouts is 4 or 5. That is i = 4 or 5. Noting Notes O nl in e Solution: This a case of binomial distribution with n = 5 and p = that,  n P( X =+ i )   ( P)i (1 − P) n −i 4) P( X == i  Thus, the probability that the class room is unusable on a random occasion is, 4 5 0  5  1   2   5  1   2  P( X = 4) + P( X = 5) = + 0.0412 + 0.00412 = 0.04532  4  3   3   5  3   3  = ity Example: It is observed that 80% of T.V. vuewers watch Aap Ki Adalat programme. What is the probability that at least 80% of the viewers in a random sample of 5 watch this programme? Solution: This is the case of binomial distribution with n = 5 and p = 0.8. Also i = 4 er s or 5. Probability of at least 80% of the viewers in a random sample of 5 watches this programme.  5  5 4 1 5 0 P ( X > 4) + P ( X = 4) + P ( X = 5) = 0.4096 + 0.3277  4 (0.8) (0.2) +  5 (0.8) (0.2) = − ni v = 0.7373 We must remember that a cumulative binomial probability refers to the probability that the binomial random variable falls within a specified range (e.g., is greater than or equal to a stated lower limit and less than or equal to a stated upper limit). U 2.2.6 Poisson Distribution-Introduction ity A random variable X, taking one of the values 0, 1, 2, is said to be a Poisson random variable with parameter l, if for some l > 0, P(X = i) = eλ/I For i = 0, 1, 2 … P(X = i) is a probability mass function (p.m.f.) of the Poisson random variable. Its expected value and variance are, m μ = E[X] = λ Var(X) = λ )A Poisson random variable has wide range of applications. It can also be used as an approximation for a binomial random variable with parameters if n is large and p is small enough to make the product np of moderate size. In this case, we call np = l an average rate. Some of the common examples where Poisson random variable can be used to define the probability distribution are: Number of accidents per day on expressway. 2. Number of earthquakes occurring over fixed time span. (c 1. Amity Directorate of Distance & Online Education 42 Statistics Management 3. Number of misprints on a page. 4. Number of arrivals of calls on telephone exchange per minute. 5. Number of interrupts per second on a server. 2.2.7 Poisson Distribution-Application O nl in e Notes Procedure for Using Cumulative Poisson Probabilities Table Poisson p.m.f. for given l and i can be easily calculated using scientific calculators. But while calculating cumulative probabilities i.e., ‘c.d.f.’, manual calculations become too tedious. In such cases, we can use the Cumulative Poisson Probabilities. Cumulative ity Poisson Probabilities is referred as follows: To find cumulative binomial probability for given n, i and p ●● Looking at the given value of l i.e., average rate in the first column of the table. ●● In first row look for the value of i, the number of successes. ●● Locate the cell in the column of i value and row of l value. The contained in this cell is the value of cumulative Poisson probability. er s ●● Solution: Method I ni v Example: Average number of accidents on express way is five per week. Find the probability of exactly two accidents that would take place in a given week. Also find the probability of at the most two accidents that will take place in next week. U Using binomial distribution with parameters (n=10, p=0.1) we get, P{X<1} = p(0) + p(1) = 10C0(0.1)0(0.1)10 + 10C1 (0.1)1(0.1)9 = 0.7361 Or, Using Cumulative Binomial Probabilities Table ity We can read for n=10, p=0.1 and i=1, the cumulative probability as 0.7361. (c )A m Method II Using Poisson distribution (as approximation ot Binomial distribution) with parameter /=10x0.1=1 we get, P{X<1}=p(0) + p(1) = [e-1 (/) 0] / 0! + [e-1 (/) 1]/1! = e-1 + e-1 = 0.7358 Or, Using Cumulative Poisson Probabilities Table We can read for /=1, and i=1 the cumulative probability as 0.7358. Note: That Poisson distribution gives reasonable good approximation. Example: Average time for updating a passbook by a bank clerk is 15 seconds. Someone arrives just ahead of you. Find the probability that you will have to wait for your turn, 1. More than 1 minute. Amity Directorate of Distance & Online Education 43 Statistics Management Less than ½ minutes. Notes Solution: Now, λ = 60/15 = 4 passbooks per minute P {X > 1} = 1 – F (1) = e-4 = 0.0183 P {X < 0.5} = F (0.5) = 1 - e-2 = 1 - 0.1353 = 0.8647 2.2.8 Normal Distribution- Introduction including empirical rule O nl in e 2. ity Normal random variable and its distribution is commonly used in many business and engineering problems. Many other distributions like binomial, Poisson, beta, chisquare, students, exponential, etc., could also be approximated to normal distribution under specific conditions. (Usually when sample size is large.) er s If random variable is affected by many independent causes, and the effect of each cause is not significantly large as compared to other effects, then the random variable will closely follow the normal distribution, e.g., weights of coffee filled in packs, lengths of nails manufactured on a machine, hardness of ball bearing surface, diameters of shafts produced on lathe, effectiveness of training programme on the employees’ productivity, etc., are examples of normally distributed random variables. ni v Further, many sampling statistics, e.g., sample means X bar, are normally distributed. Empirical Rule U The empirical rule, also referred to as the three-sigma rule is a statistical rule which states that for a normal distribution, almost all observed data will fall within three standard deviations which is denoted by σ of the mean or average which is denoted by µ. ity The empirical rule states that for a normal distribution, nearly all of the data will fall within three standard deviations of the mean. The empirical rule can be broken down into three parts: 68% of data falls within the first standard deviation from the mean. ●● 95% fall within two standard deviations. ●● 99.7% fall within three standard deviations. m ●● )A The Empirical Rule is often used in statistics for forecasting, especially when obtaining the right data is difficult or impossible to get. The rule can give you a rough estimate of what your data collection might look like if you were able to survey the entire population. (c A random variable X is a normal random variable with parameters μ and σ if the probability density function (p.d.f.) of X is given by 1 = f ( x) e σ 2π ( x − µ )2 2σ 2 Where, - ∞ < X < ∞ Amity Directorate of Distance & Online Education 44 Statistics Management Properties of Normal Distribution O nl in e This distribution is bell-shaped curve that is symmetric about μ. It gives a theoretical base to the observation that, in practice, many random phenomena obey approximately, a normal probability distribution. Mean of normal random variable is E(X) = μ and variance of normal random variable is Var(X) σ2. If X is normally distributed with parameters μ and σ, then another random variable is also normally distributed with parameters (aμ + b) and (a σ). Notes It is perfectly symmetric about the mean μ. 2. For a normal distribution mean = median = mode. 3. It is uni-modal (one mode), with skewness = 0 and kurtosis = 0. 4. Normal distribution is a limiting form of binomial distribution when number trials n is large, and neither the probability p nor (1-p) is very small. 5. Normal distribution is a limiting case of Poisson distribution when mean μ = λ is very large. 6. While working on probability of normal distribution we usually use normal distribution (more often standard normal distribution) tables. er s ity 1. While reading these tables, properties are: ni v (a) The probability that a normally distributed random variable with mean μ and variance σ² lies between two specified values a and b is P (a < X < b) = area under the curve P(x) between the specified values X = a and X = b. (b) Total area under the curve P (x) is equal to 1 in which 0.5 lies on either side of the mean. U 2.2.9 Standard Normal Distribution (c )A m ity Calculating cumulative density of normal distribution involves integration. Further, tabulation also has a problem that we must have tables for every possible value of μ and σ² (which is not feasible). Hence, we transform Normal Random Variable to another random variable known as Standard Normal Random Variable. For this, we use a transformation, z is a normally distributed random variable with parameters, μ = 0 and σ = 1. Any normal random variable can be transformed to standard normal random variable z. We can get cumulative distribution function as, 2 z 1 F (a) = ∫ f ( x)dx = ∫ e 2 dz −∞ 2π a a This has been calculated for various values of ‘a’ and tabulated. Also, we know that, F (–a) = 1 – F(a) and also F(a < Z < b) = F(b) – F(a) Example: Tea is filled in the packs of 200 gm by a machine with variability of 0.25 Amity Directorate of Distance & Online Education 45 Statistics Management Notes O nl in e gms. Packs weighing less than 200 gm would be rejected by customers and not legally acceptable. Therefore, marketing and legal department requests production manager to set the machine to fill slightly more quantity in each pack. However, finance department objects to this since it would lead to financial loss due to overfilling the packs. The general manager wants to know the 99% confidence interval, when the machine is set at 200gms, so that he can take a decision. Find confidence interval. What is your advice to the production manger? Solution: Let weight of the tea in a pack is a random variable X. We know that the mean μ = 200 gm and variance σ² = 0.25 gms i.e. σ = 0.5 gm. ity First, we find the value of z for 99% confidence. Standard Normal Distribution curve is symmetric about mean. Hence, corresponding to 99% confidence, half area under the curve = 0.99/2 er s = 0.495. Value z corresponding to probability 0.495 is 2.575. Thus, the 99% confidence interval in terms of variable z is ± 2.575 which in terms of variable x is, 200 ±1.2875 or (198.71 to 201.29). ni v Note: that x = σ z + μ = 0.5 × (±2.575) + 200 = 200 ± 1.2875 U Hence, we can advise the production manager to set his machine to fill tea with mean weight as 201.2875 or say 201.29. In that case we have 99% confidence of meeting legal requirement and at the same time to keep the cost of excess filling of the coffee to minimum. Key Terms Probability: Probability of a given event is an expression of likelihood or chance of occurrence of an event. A probability is a number which rages from zero to one. ●● Continuous Probability Distributions: Continuous random variables are those that take on any value including fractions and decimals. Continuous random variables give rise to continuous probability distributions. Continuous is the opposite of discrete. ●● Random Experiment: In theory of probability, a process or activity that results in outcomes under study is called experiment, for example, sampling from a production lot. )A m ity ●● Sample: A sample is that part of the universe which the select for the purpose of investigation. A sample exhibits the characteristics of the universe. The word sample literally means small universe. ●● Sampling: Sampling is defined as the selection of some part of an aggregate or totality on the basis of which a judgment or inference about the aggregate or totality is made. Sampling is the process of learning about the population on the basis of a sample drawn from it. (c ●● Amity Directorate of Distance & Online Education 46 Statistics Management ●● Stratified random sampling: Stratified random sampling requires the separation of defined target population into different groups called strata and the selection of sample from each stratum. ●● Cluster sampling: Cluster sampling is a probability sampling method in which the sampling units are divided into mutually exclusive and collectively exhaustive subpopulation called clusters. ●● Hypothesis testing: Hypothesis testing refers to the formal procedures used by statisticians to accept or reject statistical hypotheses. It is an assumption about a population parameter. This assumption may or may not be true. O nl in e Notes Check your progress 3. Collectively exclusive events b. Mutually exhaustive events c. Mutually exclusive events d. Collectively exhaustive events ity a. er s 2. In probability theories, events which can never occur together are classified as Value which is used to measure distance between mean and random variable x in terms of standard deviation is called a. Z-value b. Variance c. Probability of x d. Density function of x ni v 1. test is applied when samples are less than 30. T c. U a. d. None of these Z Rank ity b. (c )A m 4. 5. Under non-random sampling method, samples are selected on the basis of a. Stages b. Strategy c. Originality d. Convenience Probability of second event in situation if first event has been occurred is classified as a. Series probability b. Conditional probability c. Joint probability d. Dependent probability Amity Directorate of Distance & Online Education 47 Statistics Management Questions and Exercises What is probability? What do you mean by probability distributions? 2. What is normal distribution ? What are the merits of normal distribution 3. What is Hypothesis Testing? 4. What do you mean by t-test and z test ? 5. Explain Poisson Distribution and its Application Check your progress c) Mutually exclusive events 2. a) Z value 3. a) T test 4. d) Convenience 5. b) Conditional probability ity 1. O nl in e Notes 1. er s Further Readings Richard I. Levin, David S. Rubin, Sanjay Rastogi Masood Husain Siddiqui, Statistics for Management, Pearson Education, 7th Edition, 2016. 2. Prem.S.Mann, Introductory Statistics, 7th Edition, Wiley India, 2016. 3. Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani, An Introduction to Statistical Learning with Applications in R, Springer, 2016. ni v 1. Bibliography Srivastava V. K. etal – Quantitative Techniques for Managerial Decision Making, Wiley Eastern Ltd 2. Richard, I.Levin and Charles A.Kirkpatrick – Quantitative Approaches to Management, McGraw Hill, Kogakusha Ltd. 3. Prem.S.Mann, Introductory Statistics, 7th Edition, Wiley India, 2016. 4. Budnik, Frank S Dennis Mcleaavey, Richard Mojena – Principles of Operation Research - AIT BS New Delhi. 5. Sharma J K – Operation Research- theory and applications-Mc Millan,New Delhi 6. Kalavathy S. – Operation Research – Vikas Pub Co 7. Gould F J – Introduction to Management Science – Englewood Cliffs N J Prentice Hall. 8. Naray J K, Operation Research, theory and applications – Mc Millan, New Dehi. 9. Taha Hamdy, Operations Research, Prentice Hall of India )A m ity U 1. 10. Tulasian: Quantitative Techniques: Pearson Ed. 11. Vohr.N.D. Quantitative Techniques in Management, TMH. (c 12. Stevenson W.D, Introduction to Management Science, TMH. Amity Directorate of Distance & Online Education 48 Statistics Management O nl in e Module-3: Sampling, Sampling Distribution and Estimation Notes Learning Objective: ●● To understand the basic concepts of sampling distribution and estimation techniques ●● To get familiarize with MS Excel for confidence interval construction Learning Outcome: At the end of the course, the learners will be able to – Use sampling methods and estimations techniques in order to answer business queries ●● Understand the purpose and need of sampling. er s 3.1.1 Sampling - Introduction ity ●● ni v Sampling is an important concept which is practiced in every activity. Sampling involves selecting a relatively small number of elements from a large defined group of elements and expecting that the information gathered from the small group will allow judgments to be made about the large group. The basic idea of sampling is that by selecting some of the elements in a population, the conclusion about the entire population is drawn. Sampling is used when conducting census is impossible or unreasonable. Meaning of Sampling ity U Sampling is defined as the selection of some part of an aggregate or totality on the basis of which a judgment or inference about the aggregate or totality is made. Sampling is the process of learning about the population on the basis of a sample drawn from it. Purpose of Sampling (c )A m There are several reasons for sampling. They are explained below: 1. Lower cost: The cost of conducting a study based on a sample is much lesser than the cost of conducting the census study. 2. Greater accuracy of results: It is generally argued that the quality of a study is often better with sampling data than with a census. Research findings also substantiate this opinion. 3. Greater speed of data collection: Speed of execution of data collection is higher with the sample. It also reduces the time between the recognition of a need for information and the availability of that information. 4. Availability of population element: Some situations require sampling. When the breaking strength of materials is to be tested, it has to be destroyed. A Amity Directorate of Distance & Online Education 49 Statistics Management Notes O nl in e census method cannot be resorted as it would mean complete destruction of all materials. Sampling is the only process possible if the population is infinite. Features of Sampling Method The sampling technique has the following good features of value and significance: Economy: Sampling technique brings about cost control of a research project as it requires much less physical resources as well as time than the census technique. 2. Reliability: In sampling technique, if due diligence is exercised in the choice of sample unit and if the research topic is homogenous then the sample survey can have almost the same reliability as that of census survey. 3. Detailed Study: An intensive and detailed study of sample units can be done since their number is fairly small. Also multiple approaches can be applied to a sample for an intensive analysis. 4. Scientific Base: As mentioned earlier this technique is of scientific nature as the underlined theory is based on principle of statistics. 5. Greater Suitability in most Situations: It has a wide applicability in most situations as the examination of few sample units normally suffices. 6. Accuracy: The accuracy is determined by the extent to which bias is eliminated from the sampling. When the sample elements are drawn properly some sample elements underestimates the population values being studied and others overestimate them. ni v er s ity 1. Essentials of Sampling In order to reach a clear conclusion, the sampling should possess the following essentials: It must be representative: The sample selected should possess the similar characteristics of the original universe from which it has been drawn. 2. Homogeneity: Selected samples from the universe should have similar nature and should not have any difference when compared with the universe. 3. Adequate Samples: In order to have a more reliable and representative result, a good number of items are to be included in the sample. 4. Optimization: All efforts should be made to get maximum results both in terms of cost as well as efficiency. If the size of the sample is larger, there is better efficiency and at the same time the cost is more. A proper size of sample is maintained in order to have optimized results in terms of cost and efficiency. m ity U 1. )A 3.1.2 Types of Sampling (c The sampling design can be broadly grouped on two basis viz., representation and element selection. Representation refers to the selection of members on a probability or by other means. Element selection refers to the manner in which the elements are selected individually and directly from the population. If each element is drawn individually from the population at large, it is an unrestricted sample. Restricted sampling is where additional controls are imposed, in other words it covers all other forms of sampling. Amity Directorate of Distance & Online Education 50 Statistics Management O nl in e The classification of sampling design on the basis of representation and element selection is - Notes Probability Sampling Probability sampling is where each sampling unit in the defined target population has a known non-zero probability of being selected in the sample. The actual probability of selection for each sampling unit may or may not be equal depending on the type of probability sampling design used. Specific rules for selecting members from the operational population are made to ensure unbiased selection of the sampling units and proper sample representation of the defined target population. The results obtained by using probability sampling designs can be generalized to the target population within a specified margin of error. er s ity Probability samples are characterised by the fact that, the sampling units are selected by chance. In such a case, each member of the population has a known, non- zero probability of being selected. However, it may not be true that all samples would have the same probability of selection, but it is possible to say the probability of selecting any particular sample of a given size. It is possible that one can calculate the probability that any given population element would be included in the sample. This requires a precise definition of the target population as well as the sampling frame. ni v Probability sampling techniques differ in terms of sampling efficiency which is a concept that refers to trade off between sampling cost and precision. Precision refers to the level of uncertainty about the characteristics being measured. Precision is inversely related to sampling errors but directly related to cost. The greater the precision, the greater the cost and there should be a trade-off between sampling cost and precision. The researcher is required to design the most efficient sampling design in order to increase the efficiency of the sampling. U The different types of probability sampling designs are discussed below: Simple Random Sampling (c )A m ity The following are the implications of random sampling: ●● It provides each element in the population an equal probability chance of being chosen in the sample, with all choices being independent of one another and ●● It offers each possible sample combination an equal probability opportunity of being selected. In the unrestricted probability sampling design every element in the population has a known, equal non-zero chance of being selected as a subject. For example, if 10 employees (n = 10) are to be selected from 30 employees (N = 30), the researcher can write the name of each employee in a piece of paper and select them on a random basis. Each employee will have an equal known probability of selection for a sample. The same is expressed in terms of the following formula: Probability of selection = Size of sample / Size of population Each employee would have a 10/30 or .333 chance of being randomly selected in a drawn sample. When the defined target population consists of a larger number Amity Directorate of Distance & Online Education 51 Statistics Management Notes O nl in e of sampling units, a more sophisticated method can be used to randomly draw the necessary sample. A table of random numbers can be used for this purpose. The table of random numbers contains a list of randomly generated numbers. The numbers can be randomly generated through the computer programs also. Using the random numbers the sample can be selected. Advantages and Disadvantages ity The simple random sampling technique can be easily understood and the survey result can be generalized to the defined target population with a pre specified margin of error. It also enables the researcher to gain unbiased estimates of the population’s characteristics. The method guarantees that every sampling unit of the population has a known and equal chance of being selected, irrespective of the actual size of the sample resulting in a valid representation of the defined target population. er s The major drawback of the simple random sampling is the difficulty of obtaining complete, current and accurate listing of the target population elements. Simple random sampling process requires all sampling units to be identified which would be cumbersome and expensive in case of a large population. Hence, this method is most suitable for a small population. Systematic Random Sampling U ni v The systematic random sampling design is similar to simple random sampling but requires that the defined target population should be selected in some way. It involves drawing every nth element in the population starting with a randomly chosen element between 1 and n. In other words individual sampling units are selected according their position using a skip interval. The skip interval is determined by dividing the sample size into population size. For example, if the researcher wants a sample of 100 to be drawn from a defined target population of 1000, the skip interval would be 10(1000/100). Once the skip interval is calculated, the researcher would randomly select a starting point and take every 10th until the entire target population is proceeded through. The steps to be followed in a systematic sampling method are enumerated below: Total number of elements in the population should be identified ●● The sampling ratio is to be calculated ( n = total population size divided by size of the desired sample) ●● A sample can be drawn by choosing every nth entry ity ●● It is important that the natural order of the defined target population list be unrelated to the characteristic being studied. )A 1. m Two important considerations in using the systematic random sampling are: 2. Skip interval should not correspond to the systematic change in the target population. Advantages and Disadvantages (c The major advantage is its simplicity and flexibility. In case of systematic sampling there is no need to number the entries in a large personnel file before drawing a Amity Directorate of Distance & Online Education 52 Statistics Management O nl in e sample. The availability of lists and shorter time required to draw a sample compared to random sampling makes systematic sampling an attractive, economical method for researchers. Notes The greatest weakness of systematic random sampling is the potential for the hidden patterns in the data that are not found by the researcher. This could result in a sample not truly representative of the target population. Another difficulty is that the researcher must know exactly how many sampling units make up the defined target population. In situations where the target population is extremely large or unknown, identifying the true number of units is difficult and the estimates may not be accurate. Stratified Random Sampling er s ity Stratified random sampling requires the separation of defined target population into different groups called strata and the selection of sample from each stratum. Stratified random sampling is very useful when the divisions of target population are skewed or when extremes are present in the probability distribution of the target population elements of interest. The goal in stratification is to minimize the variability within each stratum and maximize the difference between strata. The ideal stratification would be based on the primary variable under study. Researchers often have several important variables about which they want to draw conclusions. ni v A reasonable approach is to identify some basis for stratification that correlates well with other major variables. It might be a single variable like age, income etc. or a compound variable like on the basis of income and gender. Stratification leads to segmenting the population into smaller, more homogeneous sets of elements. In order to ensure that the sample maintains the required precision in terms of representing the total population, representative samples must be drawn from each of the smaller population groups. ●● U There are three reasons as to why a researcher chooses a stratified random sample: ●● To provide adequate data for analyzing various sub populations ●● To enable different research methods and procedures to be used in different strata. ity To increase the sample’s statistical efficiency (c )A m Cluster Sampling Cluster sampling is a probability sampling method in which the sampling units are divided into mutually exclusive and collectively exhaustive subpopulation called clusters. Each cluster is assumed to be the representative of the heterogeneity of the target population. Groups of elements that would have heterogeneity among the members within each group are chosen for study in cluster sampling. Several groups with intragroup heterogeneity and intergroup homogeneity are found. A random sampling of the clusters or groups is done and information is gathered from each of the members in the randomly chosen clusters. Cluster sampling offers more of heterogeneity within groups and more homogeneity among the groups. Amity Directorate of Distance & Online Education 53 Statistics Management Notes O nl in e Single Stage and Multistage Cluster Sampling In single stage cluster sampling, the population is divided into convenient clusters and required number of clusters are randomly chosen as sample subjects. Each element in each of the randomly chosen cluster is investigated in the study. Cluster sampling can also be done in several stages which is known as multistage cluster sampling. For example: To study the banking behaviour of customers in a national survey, cluster sampling can be used to select the urban, semi-urban and rural geographical locations of the study. At the next stage, particular areas in each of the location would be chosen. At the third stage, the banks within each area would be chosen. er s Advantages and Disadvantages of Cluster Sampling ity Thus multi-stage sampling involves a probability sampling of the primary sampling units; from each of the primary units, a probability sampling of the secondary sampling units is drawn; a third level of probability sampling is done from each of these secondary units, and so on until the final stage of breakdown for the sample units are arrived at, where every member of the unit will be a sample. ni v The cluster sampling method is widely used due to its overall cost-effectiveness and feasibility of implementation. In many situations the only reliable sampling unit frame available to researchers and representative of the defined target population, is one that describes and lists clusters. The list of geographical regions, telephone exchanges, or blocks of residential dwelling can normally be easily compiled than the list of all the individual sampling units making up the target population. Clustering method is a cost efficient way of sampling and collecting raw data from a defined target population. ity U One major drawback of clustering method is the tendency of the cluster to be homogeneous. The greater the homogeneity of the cluster, the less precise will be the sample estimate in representing the target population parameters. The conditions of intra- cluster heterogeneity and inter-cluster homogeneity are often not met. For these reasons this method is not practiced often. Area Sampling )A m Area sampling is a form of cluster sampling in which the clusters are formed by geographic designations. For example, state, district, city, town etc., Area sampling is a form of cluster sampling in which any geographic unit with identifiable boundaries can be used. Area sampling is less expensive than most other probability designs and is not dependent on population frame. A city map showing blocks of the city would be adequate information to allow a researcher to take a sample of the blocks and obtain data from the residents therein. Sequential/Multiphase Sampling (c This is also called Double Sampling. Double sampling is opted when further information is needed from a subset of groups from which some information has already been collected for the same study. It is called as double sampling because initially a sample is used in the study to collect some preliminary information of interest and later a sub-sample of this primary sample is used to examine the matter in more detail The Amity Directorate of Distance & Online Education 54 Statistics Management Sampling with Probability Proportional to Size O nl in e process includes collecting data from a sample using a previously defined technique. Based on this information, a sub sample is selected for further study. It is more convenient and economical to collect some information by sampling and then use this information as the basis for selecting a sub sample for further study. Notes Non-probability Sampling er s ity When the case of cluster sampling units does not have exactly or approximately the same number of elements, it is better for the researcher to adopt a random selection process, where the probability of inclusion of each cluster in the sample tends to be proportional to the size of the cluster. For this, the number of elements in each cluster has to be listed, irrespective of the method used for ordering it. Then the researcher should systematically pick the required number of elements from the cumulative totals. The actual numbers thus chosen would not however reflect the individual elements, but would indicate as to which cluster and how many from them are to be chosen by using simple random sampling or systematic sampling. The outcome of such sampling is equivalent to that of simple random sample. This method is also less cumbersome and is also relatively less expensive. ni v In non probability sampling method, the elements in the population do not have any probabilities attached to being chosen as sample subjects. This means that the findings of the study cannot be generalized to the population. However, at times the researcher may be less concerned about generalizability and the purpose may be just to obtain some preliminary information in a quick and inexpensive way. Sometimes when the population size is unknown, then non probability sampling would be the only way to obtain data. Some non-probability sampling techniques may be more dependable than others and could often lead to important information with regard to the population. U Convenience Sampling (c )A m ity Non-probability samples that are unrestricted are called convenient sampling. Convenience sampling refers to the collection of information from members of population who are conveniently available to provide it. Researchers or field workers have the freedom to choose as samples whomever they find, thus it is named as convenience. It is mostly used during the exploratory phase of a research project and it is the best way of getting some basic information quickly and efficiently. The assumption is that the target population is homogeneous and the individuals selected as samples are similar to the overall defined target population with regard to the characteristics being studied. However, in reality there is no way to accurately assess the representativeness of the sample. Due to the self selection and voluntary nature of participation in data collection process the researcher should give due consideration to the non-response error. Advantages and Disadvantages Convenient sampling allows a large number of respondents to be interviewed in a relatively short time. This is one of the main reasons for using convenient sampling in the early stages of research. However the major drawback is that the Amity Directorate of Distance & Online Education 55 Statistics Management Notes O nl in e use of convenience samples in the development phases of constructs and scale measurements can have a serious negative impact on the overall reliability and validity of those measures and instruments used to collect raw data. Another major drawback is that the raw data and results are not generalizable to the defined target population with any measure of precision. It is not possible to measure the representativeness of the sample, because sampling error estimates cannot be accurately determined. Judgment Sampling ity Judgment sampling is a non-probability sampling method in which participants are selected according to an experienced individual’s belief that they will meet the requirements of the study. The researcher selects sample members who conform to some criterion. It is appropriate in the early stages of an exploratory study and involves the choice of subjects who are most advantageously placed or in the best position to provide the information required. This is used when a limited number or category of people have the information that are being sought. The underlying assumption is that the researcher’s belief that the opinions of a group of perceived experts on the topic of interest are representative of the entire target population. er s Advantages and Disadvantages ni v If the judgment of the researcher or expert is correct then the sample generated from the judgment sampling will be much better than one generated by convenience sampling. However, as in the case of all non-probability sampling methods, the representativeness of the sample cannot be measured. The raw data and information collected through judgment sampling provides only a preliminary insight Quota Sampling ity U The quota sampling method involves the selection of prospective participants according to pre specified quotas regarding either the demographic characteristics (gender, age, education, income, occupation etc.,) specific attitudes (satisfied, neutral, dissatisfied) or specific behaviours (regular, occasional, rare user of product). The purpose of quota sampling is to provide an assurance that pre specified subgroups of the defined target population are represented on pertinent sampling factors that are determined by the researcher. It ensures that certain groups are adequately represented in the study through the assignment of the quota. Advantages and Disadvantages )A m The greatest advantage of quota sampling is that the sample generated contains specific subgroups in the proportion desired by researchers. In those research projects that require interviews the use of quotas ensures that the appropriate subgroups are identified and included in the survey. The quota sampling method may eliminate or reduce selection bias. (c An inherent limitation of quota sampling is that the success of the study will be dependent on subjective decisions made by the researchers. As a non-probability method, it is incapable of measuring true representativeness of the sample or accuracy of the estimate obtained. Therefore, attempts to generalize the data results beyond those respondents who were sampled and interviewed become very questionable and may misrepresent the given target population. Amity Directorate of Distance & Online Education 56 Statistics Management O nl in e Snowball Sampling Notes Advantages and Disadvantages ity Snowball sampling is a non-probability sampling method in which a set of respondents are chosen who help the researcher to identify additional respondents to be included in the study. This method of sampling is also called as referral sampling because one respondent refers other potential respondents. This method involves probability and non-probability methods. The initial respondents are chosen by a random method and the subsequent respondents are chosen by non-probability methods. Snowball sampling is typically used in research situations where the defined target population is very small and unique and compiling a complete list of sampling units is a nearly impossible task. This technique is widely used in academic research. While the traditional probability and other non-probability sampling methods would normally require an extreme search effort to qualify a sufficient number of prospective respondents, the snowball method would yield better result at a much lower cost. The researcher has to identify and interview one qualified respondent and then solicit his help to identify other respondents with similar characteristics. ni v er s Snowball sampling enables to identify and select prospective respondents who are small in number, hard to reach and uniquely defined target population. It is most useful in qualitative research practices. Reduced sample size and costs are the primary advantage of this sampling method. The major drawback is that the chance of bias is higher. If there is a significant difference between people who are identified through snowball sampling and others who are not then, it may give rise to problems. The results cannot be generalized to members of larger defined target population. 3.1.3 Types of Sampling & Non Sampling Errors and Precautions U A sampling error represents a statistical error occuring when an analyst does not select a sample that represents the entire population of data and the results found in the sample do not represent the results that would be obtained from the entire population. Regardless of the fact that the sample is not representative of the population or skewed in any way, a sampling error is a difference in sampled value versus true population value. ●● Also randomized samples may have some sampling error, since it is just a population estimate from which it is derived. ●● Sampling errors can be eliminated when the sample size is increased and also by ensuring that the sample adequately represents the entire population. For example, ABC Company provides a subscription-based service that allows consumers to pay a monthly fee to stream videos and other programming over the web. (c )A m ity ●● A non-sampling error is a statistical term referring to an error resulting from data collection, which causes the data to differ from the true values. A non-sampling error is different from that of a sampling error. ●● A non-sampling error refers to either random or systematic errors, and these errors can be challenging to spot in a survey, sample, or census. Amity Directorate of Distance & Online Education 57 Statistics Management Systematic non-sampling errors are worse than random non-sampling errors because systematic errors may result in the study, survey or census having to be scrapped. ●● The higher the number of errors, the less reliable the information. ●● When non-sampling errors occur, the rate of bias in a study or survey goes up. Notes O nl in e ●● 3.1.4 Central Limit Theorem In the study of probability theory, the central limit theorem (CLT) states that the distribution of sample approximates a normal distribution also known as a “bell curve. As the sample size becomes larger, it assumed that all samples are identical in size, and regardless of the shape of the population distribution. er s ity It is a statistical theory stating that, given a sufficiently large sample size from a population with a finite degree of variance, the mean of all samples from the same population would be approximately equal to the average. Furthermore, all the samples will follow an approximate normal distribution pattern, with all variances being approximately equal to the variance of the population, divided by each sample’s size The central limit theorem (CLT) states that the distribution of sample means approximates a normal distribution as the sample size gets larger. ●● Sample sizes equal to or greater than 30 are considered sufficient for the theorem to hold. ●● A key aspect of the theorem is that the average of the sample means and standard deviations will equal the population mean and standard deviation. ●● A sufficiently large sample size can always predict the characteristics of a population accurately. ni v ●● U 3.1.5 Sampling Distribution of the Mean ity A sample is that part of the universe which the select for the purpose of investigation. A sample exhibits the characteristics of the universe. The word sample literally means small universe. For example, suppose the microchips produced in a factory are to be tested. The aggregate of all such items is universe, but it is not possible to test every item. So in such a case, a part of the universe is taken and then tested. Now this quantity extracted for testing is known as sample. )A m If we take certain number of samples and for each sample compute various statistical measures such as mean, standard deviation etc. then we can find out that each sample may give its own value for statistics under consideration. All such values of a particular statics, say, mean together with their relative frequencies will constitute the sampling distribution of mean standard deviation. 3.1.6 Sampling Distribution of Proportion (c Sampling distribution of sample proportion refers to the concept that If repeated random samples of a given size n are taken from a population of values for a categorical variable, where the proportion in the category of interest is p, then the mean of all sample proportions (p-hat) is the population proportion (p). Amity Directorate of Distance & Online Education 58 Statistics Management O nl in e The theory dictates the behavior much more precisely than saying that there is less spread for larger samples as regards the spread of all sample proportions. The standard deviation of all sample proportions is generally directly related to the sample size, n as shown below Notes The standard deviation of all sample proportion (p ) is exactly p (1 − p ) n Given that the sample size n appears in the square root denominator, the standard deviation decreases as the sample size increases. Eventually, the p-hat distribution form should be reasonably normal as long as the sample size n is sufficiently high. The convention specifies that np and n(1 – p) should be at least 10 p is normally distributed with a mean of μp = p p (1 − p ) n as long as np > 10 and n(1-p) > 10 er s 3.1.7 Estimation – Introduction ity and a standard deviation σp = Let x be a random variable with probability density function (or probability mass function) f(X ; θ1 , θ2 , .... θk), where θ1 , θ2 , .... θk are the k parameters of the population. ni v Given a random sample x1 , x2 , ...... xn from this population, we may be interested in estimating one or more of the k parameters θ1 , θ2 , ...... θk. In order to be specific, let x be any normal variate so that its probability density function can be written as N(x : μ, σ). We may be interested in estimating m or s or both on the basis of random sample obtained from this population. U It should be noted here that there can be several estimators of a parameter, e.g., we can have any of the sample mean, median, mode, geometric mean, harmonic mean, etc., as an estimator of population mean μ. Similarly, S will be – ity 1 1 s= ∑ (x i -x) 2 or s =∑ (x i -x) 2 n n −1 (c )A m as an estimator of population standard deviation s. This method of estimation, where a single statistic such as Mean, Median, Standard deviation, etc. is used as an estimator of population parameter, is known as the Point Estimation. 3.1.8 Types of Estimation Statisticians use sample statistics to estimate population parameters. For example, sample means are used to estimate population means; sample proportions, to estimate population proportions. An estimate of a population parameter may be expressed in two ways: ●● Point estimate. A point estimate of a population parameter is a single value of a statistic. For example, the sample mean x is a point estimate of the population mean μ. Similarly, the sample proportion p is a point estimate of the population proportion P. Amity Directorate of Distance & Online Education 59 Statistics Management ●● Notes O nl in e A population parameter is denoted by θθ which is unknown constant. The available information is in the form of a random sample x1,x2,...,xnx1,x2,...,xn of size nn drawn from the population. We formulate a function of the sample observation x1,x2,...,xnx1,x2,...,xn. The estimator of θθ is denoted by θ^θ^. The different random sample provides different values of the statistics θ^θ^. Thus θ^θ^ is a random variable with its own sampling probability distribution. Interval estimate. An interval estimate is defined by two numbers, between which a population parameter is said to lie. For example, a < x < b is an interval estimate of the population mean μ. It indicates that the population mean is greater than a but less than b. er s ity This range of values used to estimate a population parameter is known as interval estimate or estimate by a confidence interval, and is defined by two numbers, between which a population parameter is expected to lie. For example, a<x¯<ba<x¯<b is an interval estimate of the population mean μ, indicating that the population mean is greater than aa but less than bb. The purpose of an interval estimate is to provide information about how close the point estimate is to the true parameter. 3.1.9 Using z Statistic for Estimating Population Mean ni v The estimation of a population mean given a random sample is a very common task. If the population standard deviation (σσ) is known, the construction of a confidence interval for the population mean (μ) is based on the normally distributed sampling distribution of the sample means The 100(1−α)%100 the confidence interval for μ is given by CI : x ± z *α /2 × σ x σ n U Where σ x = ity The value of z*α/2 corresponds to the critical value and is obtained from the standard normal table or computed with the qnorm() function in R. The critical value is a quantity that is related to the desired level of confidence. Typical values for z*α/2zα/2* are 1.64, 1.96, and 2.58, corresponding to a confidence level of 90%, 95% and 99%. This critical value is multiplied with the standard error, given by σx¯σx¯, in order to widen or narrowing the margin of error. )A m The standard error (σx¯) is given by the ratio of the standard deviation of the population (σ) and the square root of the sample size nn. It describes the degree to which the computed sample statistic may be expected to differ from one sample to another. The product of the critical value and the standard error is called the margin of error. It is the quantity that is subtracted from and added to the value of x¯ to obtain the confidence interval for μ. (c 3.1.10 Confidence Interval for Estimating Population Mean When Population SD is Unknown A confidence interval gives an estimated range of values which is likely to include an unknown population parameter, the estimated range being calculated from a given Amity Directorate of Distance & Online Education 60 Statistics Management set of sample data. The common notation for the parameter in question is θ. Often, this parameter is the population mean μ, which is estimated through the sample mean X . O nl in e Notes The level C of a confidence interval gives the probability that the interval produced by the method employed includes the true value of the parameter θ. In many situations, the value of σ is unknown, thus it is estimated with the sample standard deviation, s; and/or the sample size is small (less than 30), and it is unsure as to where data came from a normal distribution. (In the latter case, the Central Limit Theorem can’t be used.) In either situation, the z*-value can not be used from the standard normal (Z-) distribution as a critical value anymore. It is essential to use a larger critical value than that, because of not knowing the data quantity. − X ± t *n −1 s , where t*n-1 n ity The formula for a confidence interval for one population mean in this case is er s is the critical t*-value from the t-distribution with n-1 degrees of freedom (where n is the sample size). Estimating population mean using t Statistic ni v A statistical examination of two population means. A two-sample t-test examines whether two samples are different and is commonly used when the variances of two normal distributions are unknown and when an experiment uses a small sample size. Formula: t = x–m s n U Where, is the sample mean, Ä is a specified value to be tested, s is the sample standard deviation and n is the size of the sample. Look up the significance level of the z-value in the standard normal table. (c )A m ity When the standard deviation of the sample is substituted for the standard deviation of the population, the statistic does not have a normal distribution; it has what is called the t-distribution. Because there is a different t-distribution for each sample size, it is not practical to list a separate area of the curve table for each one. Instead, critical t-values for common alpha levels (0.10, 0.05, 0.01, and so forth) are usually given in a single table for a range of sample sizes. For very large samples, the t-distribution approximates the standard normal (z) distribution. In practice, it is best to use t-distributions any time the population standard deviation is not known. Values in the t-table are not actually listed by sample size but by degrees of freedom (df). The number of degrees of freedom for a problem involving the t-distribution for sample size n is simply n – 1 for a one-sample mean problem. Uses of T Test Among the most frequently used t-tests are: ●● A one-sample location test of whether the mean of a normally distributed population has a value specified in a null hypothesis. Amity Directorate of Distance & Online Education 61 Statistics Management A two sample location test of the null hypothesis that the means of two normally distributed populations are equal. Notes O nl in e ●● All such tests are usually called Student’s t-tests, though strictly speaking that name should only be used if the variances of the two populations are also assumed to be equal; the form of the test used when this assumption is dropped is sometimes called Welch’s t-test. These tests are often referred to as “unpaired” or “independent samples” t-tests, as they are typically applied when the statistical units underlying the two samples being compared are non-overlapping. ity A test of the null hypothesis that the difference between two responses measured on the same statistical unit has a mean value of zero. For example, suppose we measure the size of a cancer patient’s tumor before and after a treatment. If the treatment is effective, we expect the tumor size for many of the patients to be smaller following the treatment. This is often referred to as the “paired” or “repeated measures” t-test: A test of whether the slope of a regression line differs significantly from 0. 3.1.12 Confidence Interval Estimation for Population Proportion er s The confidence interval (CI) for a population proportion can be used to show the statistical probability that a characteristic is likely to occur within the population. ni v For example, if we wish to estimate the proportion of people with diabetes in a population, we consider a diagnosis of diabetes as a “success” (i.e., and individual who has the outcome of interest), and we consider lack of diagnosis of diabetes as a “failure.” In this example, X represents the number of people with a diagnosis of diabetes in the sample. The sample proportion is p̂ (called “p-hat”), and it is computed by taking the ratio of the number of successes in the sample to the sample size, that is = P = x/n U Where x is the number of successes in the sample and n is the size of the sample σp’=p(1−p)n ity The formula for the confidence interval for a population proportion follows the same format as that for an estimate of a population mean. The sampling distribution for the proportion from , the standard deviation was found to be: The confidence interval for a population proportion, therefore, becomes: m p=p′±[Z(a2)p′(1−p′)n] Z(a2) is set according to our desired degree of confidence and p′(1−p′)n is the standard deviation of the sampling distribution. )A The sample proportions p′ and q′ are estimates of the unknown population proportions p and q. The estimated proportions p′ and q′ are used because p and q are not known. (c Key Terms ●● Sample: A sample is that part of the universe which the select for the purpose of investigation. A sample exhibits the characteristics of the universe. The word sample literally means small universe. Amity Directorate of Distance & Online Education 62 Statistics Management Sampling: Sampling is defined as the selection of some part of an aggregate or totality on the basis of which a judgment or inference about the aggregate or totality is made. Sampling is the process of learning about the population on the basis of a sample drawn from it. ●● Stratified random sampling: Stratified random sampling requires the separation of defined target population into different groups called strata and the selection of sample from each stratum. ●● Cluster sampling: Cluster sampling is a probability sampling method in which the sampling units are divided into mutually exclusive and collectively exhaustive subpopulation called clusters. ●● Confidence interval: (CI) for a population proportion can be used to show the statistical probability that a characteristic is likely to occur within the population. ●● Point estimate. A point estimate of a population parameter is a single value of a statistic ●● Interval estimate. An interval estimate is defined by two numbers, between which a population parameter is said to lie. ity O nl in e ●● Check your progress Probability b) Central Limit Theorem c) Z test d) Sampling Theorem ni v a) ____ error is a statistical term referring to an error resulting from data collection, which causes the data to differ from the true values a) b) Non - sampling c) Probability d) Central Sampling (c )A m ity 2. _____ states that the distribution of sample means approximates a normal distribution as the sample size gets larger. U 1. er s Notes 3. 4. Sampling method in which a set of respondents are chosen who help the researcher to identify additional respondents to be included in the study is ? a) Quota Sampling b) Judgment Sampling c) Snowball Sampling d) Convenience Sampling Value used to measure distance between mean and random variable x in terms of standard deviation is a) Z-value Amity Directorate of Distance & Online Education 63 Statistics Management Variance c) Probability of x d) Density function of x Notes O nl in e 5. b) Test is applied when samples are less than 30. a) T b) Z c) Rank d) None of these Questions and Exercises What is sampling? Explain the features of sampling 2. Differentiate between sampling and non-sampling. 3. Explain any five types of sampling techniques 4. What do you mean by t-test and z test? 5. Explain Confidence interval estimation for population proportion Central Limit Theorem 2. b) Non - sampling 3. c) Snowball Sampling 4. a) Z-value 5. a) T er s b) U 1. ni v Check your progress: ity 1. Further Readings Richard I. Levin, David S. Rubin, Sanjay Rastogi Masood Husain Siddiqui, Statistics for Management, Pearson Education, 7th Edition, 2016. 5. Prem.S.Mann, Introductory Statistics, 7th Edition, Wiley India, 2016. 6. Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani, An Introduction to Statistical Learning with Applications in R, Springer, 2016. m ity 4. Bibliography )A 13. Srivastava V. K. etal – Quantitative Techniques for Managerial Decision Making, Wiley Eastern Ltd 14. Richard, I.Levin and Charles A.Kirkpatrick – Quantitative Approaches to Management, McGraw Hill, Kogakusha Ltd. 15. Prem.S.Mann, Introductory Statistics, 7th Edition, Wiley India, 2016. (c 16. Budnik, Frank S Dennis Mcleaavey, Richard Mojena – Principles of Operation Research - AIT BS New Delhi. Amity Directorate of Distance & Online Education 64 Statistics Management 18. Kalavathy S. – Operation Research – Vikas Pub Co O nl in e 17. Sharma J K – Operation Research- theory and applications-Mc Millan,New Delhi Notes 19. Gould F J – Introduction to Management Science – Englewood Cliffs N J Prentice Hall. 20. Naray J K, Operation Research, theory and applications – Mc Millan, New Dehi. 21. Taha Hamdy, Operations Research, Prentice Hall of India 22. Tulasian: Quantitative Techniques: Pearson Ed. 23. Vohr.N.D. Quantitative Techniques in Management, TMH (c )A m ity U ni v er s ity 24. Stevenson W.D, Introduction to Management Science, TMH Amity Directorate of Distance & Online Education 65 Module-4: Concepts of Hypothesis Testing Learning Objective: ●● To get introduced with the concept of hypothesis testing and learn parametric and non-parametric Learning Outcome: At the end of the course, the learners will be able to – ●● Notes O nl in e Statistics Management Perform Test of Hypothesis as well as calculate confidence interval for a population parameter for single sample and two sample cases. ity 4.1.1 Hypothesis Testing - Introduction U Characteristics of Hypothesis ni v er s Hypothesis test is a method of making decisions using data from a scientific study. In statistics, a result is called statistically significant if it has been predicted as unlikely to have occurred by chance alone, according to a pre-determined threshold probability, the significance level. The phrase “test of significance” was coined by statistician Ronald Fisher. These tests are used in determining what outcomes of a study would lead to a rejection of the null hypothesis for a pre-specified level of significance; this can help to decide whether results contain enough information to cast doubt on conventional wisdom, given that conventional wisdom has been used to establish the null hypothesis. The critical region of a hypothesis test is the set of all outcomes which cause the null hypothesis to be rejected in favor of the alternative hypothesis. Statistical hypothesis testing is sometimes called confirmatory data analysis, in contrast to exploratory data analysis, which may not have pre-specified hypotheses. Statistical hypothesis testing is a key technique of frequents inference. The important characteristics of Hypothesis are as follows: ity Hypothesis must be conceptually clear m The concepts used in the hypothesis should be clearly defined, operationally if possible. Such definitions should be commonly accepted and easily communicable among the research scholars. Hypothesis should have empirical referents )A The variables contained in the hypothesis should be empirical realities. In case these are not empirical realities then it will not be possible to make the observations. Being handicapped by the data collection, it may not be possible to test the hypothesis. Watch for words like ought, should, bad. (c Hypothesis must be specific The hypothesis should not only be specific to a place and situation but also these should be narrowed down with respect to its operation. Let there be no global Amity Directorate of Distance & Online Education 66 Statistics Management use of concepts whereby the researcher is using such a broad concept which may all inclusive and may not be able to tell anything. For example somebody may try to propose the relationship between urbanization and family size. Yes urbanization influences in declining the size of families. But urbanization is such comprehensive variable which hide the operation of so many other factor which emerge as part of the urbanization process. These factors could be the rise in education levels, women’s levels of education, women empowerment, emergence of dual earner families, decline in patriarchy, accessibility to health services, role of mass media, and could be more. Therefore the global use of the word `urbanization’ may not tell much. Hence it is suggested to that the hypothesis should be specific. O nl in e Notes Hypothesis should be related to available techniques of research ity Hypothesis may have empirical reality; still we are looking for tools and techniques that could be used for the collection of data. If the techniques are not there then the researcher is handicapped. Therefore, either the techniques are already available or the researcher is in a position to develop suitable techniques for the study. er s Hypothesis should be related to a body of theory ni v Hypothesis has to be supported by theoretical argumentation. For this purpose the research may develop his/her theoretical framework which could help in the generation of relevant hypothesis. For the development of a framework the researcher shall depend on the existing body of knowledge. In such an effort a connection between the study in hand and the existing body of knowledge can be established. That is how the study could benefit from the existing knowledge and later on through testing the hypothesis could contribute to the reservoir of knowledge. Hypothesis testing procedure U Hypothesis testing refers to the formal procedures used by statisticians to accept or reject statistical hypotheses. It is an assumption about a population parameter. This assumption may or may not be true. (c )A m ity The best way to determine whether a statistical hypothesis is true would be to examine the entire population. Since that is often impractical, researchers typically examine a random sample from the population. If sample data are not consistent with the statistical hypothesis, the hypothesis is rejected. In doing so, one has to take the help of certain assumptions or hypothetical values about the characteristics of the population if some such information is available. Such hypothesis about the population is termed as statistical hypothesis and the hypothesis is tested on the basis of sample values. The procedure enables one to decide on a certain hypothesis and test its significance. “A claim or hypothesis about the population parameters is known as Null Hypothesis and is written as, H 0 .” This hypothesis is then tested with available evidence and a decision is made whether to accept this hypothesis or reject it. If this hypothesis is rejected, then we accept the alternate hypothesis. This hypothesis is written as H1. For testing hypothesis or test of significance we use both parametric tests and nonparametric or distribution free tests. Parametric tests assume within properties of the population, from which we draw samples. Such assumptions may be about population parameters, sample size Amity Directorate of Distance & Online Education 67 Statistics Management Notes O nl in e etc. In case of non-parametric tests, we do not make such assumptions. Here we assume only nominal or ordinal data. 4.1.2 Developing Null and Alternate Hypothesis Null Hypothesis ity It is used for testing the hypothesis formulated by the researcher. Researchers treat evidence that supports a hypothesis differently from the evidence that opposes it. They give negative evidence more importance than to the positive one. It is because the negative evidence tarnishes the hypothesis. It shows that the predictions made by the hypothesis are wrong. The null hypothesis simply states that there is no relationship between the variables or the relationship between the variables is “zero.” That is how symbolically null hypothesis is denoted as “H0”. For example: H0 = There is no relationship between the level of job commitment and the level of efficiency. er s Or H0 = The relationship between level of job commitment and the level of efficiency is zero. Or The two variables are independent of each other. It does not take into consideration the direction of association ni v (i.e. H0 is non directional), which may be a second step in testing the hypothesis. First we look whether or not there is an association then we go for the direction of association and the strength of association. Experts recommend that we test our hypothesis indirectly by testing the null hypothesis. In case we have any credibility in our hypothesis then the research data should reject the null hypothesis. Rejection of the null hypothesis leads to the acceptance of the alternative hypothesis. Alternative Hypothesis ity U The alternative (to the null) hypothesis simply states that there is a relationship between the variables under study. In our example it could be: there is a relationship between the level of job commitment and the level of efficiency. Not only there is an association between the two variables under study but also the relationship is perfect which is indicated by the number “1”. Thereby the alternative hypothesis is symbolically denoted as “ H1”. It can be written like this: H1: There is a relationship between the level of job commitment of the officers and their level of efficiency. 4.1.3 Type I Error and Type II Error )A m A statistically significant result cannot prove that a research hypothesis is correct (as this implies 100% certainty). Because a p-value is based on probabilities, there is always a chance of making an incorrect conclusion regarding accepting or rejecting the null hypothesis (H0). Anytime we make a decision using statistics there are four possible outcomes, with two representing correct decisions and two representing errors. (c Type 1 error A type 1 error is also known as a false positive and occurs when a researcher incorrectly rejects a true null hypothesis. This means that your report that your findings are significant when in fact they have occurred by chance. Amity Directorate of Distance & Online Education 68 Statistics Management ●● The probability of making a type I error is represented by your alpha level (α), which is the p-value below which you reject the null hypothesis. A p-value of 0.05 indicates that user is willing to accept a 5% chance that you are wrong when you reject the null hypothesis. ●● The risk of committing a type I error can be reduced by using a lower value for p. For example, a p-value of 0.01 would mean there is a 1% chance of committing a Type I error. ●● However, using a lower value for alpha means that you will be less likely to detect a true difference if one really exists (thus risking a type II error). O nl in e Notes Type 2 error ity A type II error is also known as a false negative and occurs when a researcher fails to reject a null hypothesis which is really false. Here a researcher concludes there is not a significant effect, when actually there really is. er s The probability of making a type II error is called Beta (β), and this is related to the power of the statistical test (power = 1- β). The risk of committing a type II error can be decreased by ensuring that the test has enough power. 4.1.4 Level of Significance and Critical Region Level of Significance The level of significance often referred to as alpha or α, is a measure of the strength of the evidence to be present in your sample before the null hypothesis is rejected and it is concluded that the effect is statistically significant. Before performing the experiment the researcher decides the degree of significance. ●● The significance level is the probability of rejecting the null hypothesis when it is true. For example, a significance level of 0.06 indicates a 6% risk of concluding that a difference exists when there is no actual difference. Lower significance levels indicate that stronger evidence is required before the null hypothesis is rejected. ity U ni v ●● (c )A m ●● The significance levels are used during hypothesis testing to help in the determination of which hypothesis the data supports and are comparing the p-value with significance level. If the p-value is less than the significance level, then the null hypothesis can be rejected and concluded that the effect is statistically significant. In other words, the evidence in the sample is strong enough to be able to reject the null hypothesis at the population level. Critical Region A critical region, also known as the Region of Rejection, is a set of test statistic values for which the null hypothesis is rejected. That is to say, if the test statistics observed are in the critical region then we reject the null hypothesis and accept the alternative hypothesis. The critical region defines how far away our sample statistic must be from the null hypothesis value before we can say it is unusual enough to reject the null hypothesis. Amity Directorate of Distance & Online Education 69 Statistics Management Notes O nl in e The “best” critical region is one where the likelihood of making a Type I or Type II error is minimised. In other words, the uniformly most powerful rejection region is the region where the smallest chance of making a Type I or II error is present. It is also the region that provides the largest (or equally greatest) power function for a UMP test. 4.1.5 Standard Error A statistic’s standard error is the standard deviation from its sampling distribution, or an estimate of that standard deviation. If the mean is the parameter or the statistic it is called the mean standard error. It is defined as – SE= σ n ity Where, SE is Standard error of the sample N is the number of samples and σ Is the sample standard deviation. er s Standard error increases when standard deviation, i.e. the variance of the population, increases. Standard error decreases when sample size increases – as the sample size gets closer to the true size of the population, the sample means cluster more and more around the true population mean. ni v The standard error tells how accurate the mean is likely to be compared with the true population of any given sample from that population. By increasing the standard error, i.e. the means are more spread out; it becomes more likely that any given mean is an inaccurate representation of the true mean population. 4.1.6 Confidence Interval ity U A Confidence Interval is a range of values where the true value lies in. It is a type of estimate computed from the statistics of the observed data. This proposes a range of plausible values for an unknown parameter (for example, the mean). The interval has an associated confidence level that the true parameter is in the proposed range. Given observations and a confidence level a valid confidence interval has a probability of containing the true underlying parameter. The level of confidence can be chosen by the investigator. In general terms, a confidence interval for an unknown parameter is based on sampling the distribution of a corresponding estimator. The confidence level here represents the frequency (i.e. the proportion) of possible confidence intervals that contain the true value of the unknown population parameter. ●● In other words, if confidence intervals are constructed using a given confidence level from an infinite number of independent sample statistics, the proportion of those intervals that contain the true value of the parameter will be equal to the confidence level. )A m ●● (c For example, if the confidence level is 90% then in a hypothetical indefinite data collection, in 90% of the samples the interval estimate will contain the population parameter. The confidence level is designated before examining the data. Most Amity Directorate of Distance & Online Education 70 Statistics Management commonly, a 95% confidence level is used. However, confidence levels of 90% and 99% are also often used in analysis. O nl in e Notes Factors affecting the width of the confidence interval include the size of the sample, the confidence level, and the variability in the sample. A larger sample will tend to produce a better estimate of the population parameter, when all other factors are equal. A higher confidence level will tend to produce a broader confidence interval. 4.2.1 For Single Population Mean Using t-statistic When s is not known, we use its estimate computed from the given sample. Here, the nature of the sampling distribution of X would depend upon sample size n. There are the following two possibilities: ity If parent population is normal and n < 30 (popularly known as small sample case), use t - test. The Unbiased estimate of s in this case is given by s= ∑ ( xi − x ) 2 n −1 er s If n ³ 30 (large sample case), use standard normal test. The unbiased estimate of s in this case can be taken as s= ∑ ( xi − x ) 2 since the difference between n and n - 1 n is negligible for large values of n. Note that the parent population may or may not be ni v normal in this case. Application U Statisticians use tα to represent the t statistic that has a cumulative probability of (1 - α). For example, suppose we were interested in the t statistic having a cumulative probability of 0.95. In this example, α would be equal to (1 - 0.95) or 0.05. We would refer to the t statistic as t0.05 ity Of course, the value of t0.05 depends on the number of degrees of freedom. For example, with 2 degrees of freedom, t0.05 is equal to 2.92; but with 20 degrees of freedom, t0.05 is equal to 1.725. (c )A m Example: ABC Corporation manufactures light bulbs. The CEO claims that an average Acme light bulb lasts 300 days. A researcher randomly selects 15 bulbs for testing. The sampled bulbs last an average of 290 days, with a standard deviation of 50 days. If the CEO’s claim were true, what is the probability that 15 randomly selected bulbs would have an average life of no more than 290 days? Note: Solution is the traditional approach and requires the computation of the t statistic, based on data presented in the problem description. Then, the T distribution calculator is to be used to find the probability. Solution: Computing the t statistic, based on the following equation: Amity Directorate of Distance & Online Education 71 Statistics Management t=[x-μ]/[s/√ (n)] t = ( 290 - 300 ) / [ 50 / √ ( 15) ] t = -10 / 12.909945 = - 0.7745966 O nl in e Notes where x is the sample mean, μ is the population mean, s is the standard deviation of the sample, and n is the sample size. ●● The degrees of freedom are equal to 15 - 1 = 14. ●● The t statistic is equal to - 0.7745966. 4.2.2 For Single Population Mean Using z-statistic ity The calculator displays the cumulative probability: 0.226. Hence, if the true bulb life were 300 days, there is a 22.6% chance that the average bulb life for 15 randomly selected bulbs would be less than or equal to 290 days. er s A z-test is a statistical test that is used to determine if means of population differ when the variances are known and the sample size is large. It is assumed that the test statistics have a normal distribution, and nuisance parameters such as standard deviation should be known in order to perform an accurate z-test. It is useful to standardized the values of a normal distribution by converting them into z-scores as - ni v (a) It allows the researchers to calculate the probability of a score occurring within a standard normal distribution; (b) It enables the comparison of two scores that are from different samples (which may have different means and standard deviations). A z-test is a statistical test to determine whether two population means are different when the variances are known and the sample size is large. ●● It can be used to test hypotheses in which the z-test follows a normal distribution. ●● A z-statistic, or z-score, is a number representing the result from the z-test. ●● Z-tests are closely related to t-tests, but t-tests are best performed when an experiment has a small sample size. ●● Also, t-tests assume the standard deviation is unknown, while z-tests assume it is known. m Application ity U ●● The conditions for a z test are: The distribution of the population is Normal ●● The sample size is large n>30. )A ●● If at least one of conditions are satisfied, then. (c Z = x – µ / σ / √n Where, x is the sample mean, u is the population mean σ is the population standard deviation and n is the sample size Amity Directorate of Distance & Online Education 72 Statistics Management O nl in e Example: Notes The mean length of the lumber is supposed to be 8.5 feet. A builder wants to check whether the shipment of lumber she receives has a mean length different from 8.5 feet. If the builder observes that the sample mean of 61 pieces of lumber is 8.3 feet with a sample standard deviation of 1.2 feet. What will she conclude? Is 8.3 very different from 8.5? Solution: Whether the value is different or not depends on the standard deviation of x Thus, Z = x – µ / σ / √n ity = 8.3 -8.5 / 1.2 √ 61 = - 1.3 er s Thus, It is been asked if −1.3 is very far away from zero, since that corresponds to the case when x¯ is equal to μ0. If it is far away so the null statement is unlikely to be valid and one refuses it. Otherwise the null hypothesis can not be discarded. 4.2.3 Hypothesis Testing for Population Proportion. ni v Using independent samples means that there is no relationship between the groups. The values in one sample have no association with the values in the other sample. These populations are not related, and the samples are independent. We look at the difference of the independent means. U As with comparing two population proportions, when we compare two population means from independent populations, the interest is in the difference of the two means. In other words, if μ1 is the population mean from population 1 and μ2 is the population mean from population 2, then the difference is μ1−μ2. ity It is important to be able to distinguish between an independent sample and a dependent sample. (c )A m Independent sample The samples from two populations are independent if the samples selected from one of the populations have no relationship with the samples selected from the other population. Dependent sample The samples are dependent if each measurement in one sample is matched or paired with a particular measurement in the other sample. Another way to consider this is how many measurements are taken off of each subject. If only one measurement, then independent; if two measurements, then paired. Exceptions are in familial situations such as in a study of spouses or twins. In such cases, the data is almost always treated as paired data. Amity Directorate of Distance & Online Education 73 Example - Compare the time that males and females spend watching TV. a. We randomly select 15 men and 15 women and compare the average time they spend watching TV. Is this an independent sample or paired sample? b. We randomly select 15 couples and compare the time the husbands and wives spend watching TV. Is this an independent sample or paired sample? a. Independent Sample b. Paired sample Application Notes O nl in e Statistics Management ity The null hypothesis to be tested is H0: π = π0 against Ha: π ≠ π0 for a two tailed test and π > or < π0 for a one tailed test. The test statistic is er s p − π0 n = ( p − π0 ) π 0 (1 − π 0 ) π 0 (1 − π 0 ) n = zcal Example : A wholesaler in oranges claims that only 4% of the apples supplied by him are defective. A random sample of 600 apples contained 36 defective apples. Test the claim of the wholesaler. ni v Solution. We have to test H0 : π £ 0.04 against Ha : π > 0.04. It is given that p = 36/ 600 = 0.06 and n = 600. 600 2.5 = 0.04 x0.96 U (0.06−0.04) zcal = Example: ity This value is highly significant in comparison to 1.645, therefore, H0 is rejected at 5% level of significance. m 470 tails were obtained in 1,000 throws of an unbiased coin. Can the difference between the proportion of tails in sample and their proportion in population be regarded as due to fluctuations of sampling? Solution: )A We have to test H0 : π = 0.5 against Ha : π ≠ 0.5. It is given that p = 470/1000 = 0.47 and n = 1000. (c Since this value is less than 1.96, the coin can be regarded as fair and thus, the difference between sample and population proportion of heads are only due to fluctuations of sampling. Amity Directorate of Distance & Online Education 74 Statistics Management a. When the population mean in being known O nl in e 4.3.1 Inference about the Difference Between two Population Means Notes This test is applicable when the random sample X1 , X2 , ...... Xn is drawn from a normal population. We can write H0 : µ = µ0 (specified) against Ha : µ ≠ µ0 (two tailed test) The test statistic X − µ � N (0,1) . Let the value of this statistic calculated from σ/ n sample be denoted as zcal = X −µ . The decision rule would be: σ/ n ity Reject H0 at 5% (say) level of significance if zcal > 1.96. Otherwise, there is no evidence against H0 at 5% level of significance. Example – er s A company claims that the average mileage of bikes of his company is 40 km/l. A random sample of 20 bikes of the company showed an average mileage of 42 km/l. Test the claim of the manufacturer on the assumption that the mileage of scooter is normally distributed with a standard deviation of 2 km/l. Here, we have to test H0 : µ = 40 against Ha : π ≠ 40 X −µ = σ/ n 42 − 40 = 4.47 2 / 20 ni v = zcal Since zcal > 1.96, is rejected at 5% level of significance. U b. When the population mean is being unknown ity When s is not known, we use its estimate computed from the given sample. Here, the nature of the sampling distribution of X would depend upon sample size n. There are the following two possibilities: (c )A m If parent population is normal and n < 30 (popularly known as small sample case), use t – test. Also, like normal test, the hypothesis may be one or two tailed If n ³ 30 (large sample case), use standard normal test. Since the difference between n and n - 1 is negligible for large values of n. Note that the parent population may or may not be normal in this case. Example: Daily sales figures of 40 shopkeepers showed that their average sales and standard deviation were Rs 528 and Rs 600 respectively. Is the assertion that daily sales on the average is Rs 400, contradicted at 5% level of significance by the sample? Solution: Since n > 30, standard normal test is applicable. It is given that n = 40, X = 528 and S = 600. Amity Directorate of Distance & Online Education 75 Statistics Management We have to test H0 : µ = 400 against Ha : µ ≠ 400. 528 − 400 = 1.35 600 / 40 O nl in e = zcal Notes Since this value is less than 1.96, there is no evidence against H0 at 5% level of significance. Hence, the given assertion is not contradicted by the sample. 4.3.2 Inference about the Difference Between two Population Proportions A test of two population proportions is very similar to a test of two means, except that the parameter of interest is now “p” instead of “µ”. ity With a one-sample proportion test, p = x/n is used. as the point estimate of p. It is expect that p̂ would be close to p. With a test of two proportions, we will have two p̂ ’s, and we expect that (p̂ 1 – p̂ 2) will be close to (p1 – p2). The test statistic accounts for both samples. z= er s With a one-sample proportion test, the test statistic is  p− p p (1 − p ) n ni v ●● and it has an approximate standard normal distribution. ●● For a two-sample proportion test, we would expect the test statistic to be U HOWEVER, the null hypothesis will be that p1 = p2. Because the H0 is assumed to be true, the test assumes that p1 = p2. We can then assume that p1 = p2 equals p, a common population proportion. We must compute a pooled estimate of p (its unknown) using our sample data. ity Application When we have a categorical variable of interest measured in two populations, it is quite often that we are interested in comparing the proportions of a certain category for the two populations. )A m Men and Women were asked about what they would do if they received a $100 bill by mail, addressed to their neighbor, but wrongly delivered to them. Would they return it to their neighbour? Of the 69 males sampled, 52 said “yes” and of the 131 females sampled, 120 said “yes.” Does the data indicate that the proportions that said “yes” are different for male and female? (c If the proportion of males who said “yes, they would return it” is denoted as p1 and the proportion of females who said “yes, they would return it” is denoted as p2, thus p1 = p2 p1 – p2 = 0 or p1/p2 = 1 Amity Directorate of Distance & Online Education 76 Statistics Management O nl in e It is required to develop a confidence interval or perform a hypothesis test for one of these expressions. Notes Thus, Men: n1 = 69 p1 = 52/69 Women = n2 = 131 p2 = 120/131 Using the formula –     p1 (1 − p1 ) p2 (1 − p2 ) + n1 n2 52  52  120  120  1 −  1 −  52 120 69  69  131  131 − ± 1.96 + 69 131 69 131 −0.1624 ± 1.96(0.05725) −0.1624 ± 0.1122or (0.2746 − 0.0502) ity   p1 − p2 ± zα /2 er s We are 95% confident that the difference of population proportions of men who said “yes” and women who said “yes” is between -0.2746 and -0.0502. Based on both ends of the interval being negative, it seems like the proportion of females who would return it is higher than the proportion of males who would return it. 4.3.3 Independent Samples and Matched Samples ●● The same study participants are measured before and after an intervention. The same study participants are measured twice for two different interventions. ity ●● U ni v Matched samples also called as matched pairs, paired samples or dependent samples are paired such that all characteristics except the one under review are shared by the participants. A “participant” is a member of the sample, and can be a person, object or thing. Matched pairs are widely used to assign one person to a treatment group and another to a control group. This method , called matching, is used in the design of matched pairs. The “pairs” should not be different persons, at different times they can be the same individuals. (c )A m An independent sample is the opposite of a matched sample which deals with unrelated classes. Although matching pairs are intentionally selected, individual samples are typically selected at random (through simple random sampling or a similar technique) 4.3.4 Inference about the Ratio of two Population Variances One of the essential steps of a test to compare two population variances is for checking the equal variances assumption if you want to use the pooled variances. Many people use this test as a guide to see if there are any clear violations, much like using the rule of thumb. An F-test is used to test if the variances of two populations are equal. This test can be a two-tailed test or a one-tailed test. Amity Directorate of Distance & Online Education 77 Statistics Management Notes O nl in e The two-tailed version tests that the variances are not equal against the alternative. The one-tailed version tests only in one direction, that is, the variance from the first population is either greater or less than (but not both) the second variance in population. The problem determines the choice. If we are testing a new process , for example, we might only be interested in knowing if the new process is less variable than the old process. Application: To compare the variances of two quantitative variables, the hypotheses of interest Alternatives σ2 H 0 : 12 = 1 σ2 Hα : σ 12 ≠1 σ 22 Hα : σ 12 >1 σ 22 Hα : σ 12 <1 σ 22 er s Null ity are: Example: ni v Suppose randomly 7 women are selected from a population of women, and 12 men from a population of men. The table below shows the standard deviation in each sample and in each population. Compute the f statistic. Population standard deviation Sample standard deviation Women 30 35 Men 50 U Population Solution: 45 ity The f statistic can be computed from the population and sample standard deviations, using the following equation: f = [ s1 2/ σ1, 2 ] / [ s2 2/ σ2, 2 ] m where σ 1 is the standard deviation of population 1, s1 is the standard deviation of the sample drawn from population 1, σ 2 is the standard deviation of population 2, and s 1 is the standard deviation of the sample drawn from population 2. f = ( 35`2 / 30`2 ) / ( 45`2 / 50`2 ) )A = (1225 / 900) / (2025 / 2500) = 1.361 / 0.81 = 1.68 (c For this calculation, the numerator degrees of freedom v1 are 7 - 1 or 6; and the denominator degrees of freedom v2 are 12 - 1 or 11. On the other hand, if the men’s data appears in the numerator, we can calculate an f statistic as follows: Amity Directorate of Distance & Online Education 78 Statistics Management O nl in e f = ( 45`2 / 50`2 ) / ( 352 / 302 ) Notes = (2025 / 2500) / (1225 / 900) = 0.81 / 1.361 = 0.595 For this calculation, the numerator degrees of freedom v1 are 12 – 1 or 11; and the denominator degrees of freedom v2 are 7 – 1 or 6. When you are trying to find the cumulative probability associated with an f statistic, you need to know v1 and v2. Assumptions ity Several assumptions are made for the test. Your population must be approximately normally distributed (i.e. fit the shape of a bell curve) in order to use the test. Plus, the samples must be independent events. In addition, you’ll want to bear in mind a few important points: The larger variance should always go in the numerator (the top number) to force the test into a right-tailed test. Right-tailed tests are easier to calculate. ●● For two-tailed tests, divide alpha by 2 before finding the right critical value. ●● If you are given standard deviations, they must be squared to get the variances. ●● If your degrees of freedom aren’t listed in the F Table, use the larger critical value. This helps to avoid the possibility of Type I errors. ni v er s ●● 4.4.1 Analysis of Variance Variance is defined as the average of squared deviation of data points from their mean. ity U When the data constitute a sample, the variance is denoted byσ2x and averaging is done by dividing the sum of the squared deviation from the mean by ‘n – 1’. When observations constitute the population, the variance is denoted by σ2 and we divide by N for the average Different formulas for calculating variance: (c )A m Sample Variance Var (X) = σ 2 = x Population Variance Var (X) = n ∑ ( xi − X ) 2 i =1 n −1 ∑( − ) Where, Xi for i = 1, 2, ..., n are observations values. X = Sample mean n = Sample size. µ = Population mean Amity Directorate of Distance & Online Education 79 Statistics Management N = Population size O nl in e Notes Population Variance is, Var (x) = σ 2 = ∑ ( xi − µ ) 2 N n n n n ∑ ( xi2 − 2µ xi + µ 2 ) ∑ ( xi2 ) − 2µ ∑ xi + µ 2 ∑ (1) =i 1 =i 1 =i 1 =i 1 = = N N n = ∑ xi2 i =1 − µ2 N Var (x) = E(X 2 )-[E(X)]2 ity 4.5.1 Chi Square Test er s It is the test that uses the chi-square statistic to test the fit between a theoretical frequency distribution and a frequency distribution of observed data for which each observation may fall into one of several classes. Formula of Chi-square text: x2 = Σ (O – E)2 E X2 cal < X2 table, accept H0 Conditions of Chi-square Test ni v Table value of X2 for d.f and a A chi-square test can be used when the data satisfies four conditions: There must be two observed sets of data or one observed set of data and one expected set of data (generally, there are n-rows and c-columns of data). ●● The two sets of data must be based on the same sample size. ●● Each cell in the data contains the observed or expected count of five or large? ●● The different cells in a row of column must have categorical variables (male, female less than 25 years of age, 25 year of age, older than 40 years of age etc. ity U ●● The distribution typically looks like a normal distribution, which is skewed to the right with a long tail to the right. It is a continuous distribution with only positive values. It has following applications: )A ●● m Application areas of Chi-square Test To test whether the sample differences among various sample proportions are significant or can they be attributed to chance ●● To test the independence of two variables in a contingency table. ●● To use it as a test of goodness of fit. (c ●● Amity Directorate of Distance & Online Education 80 Example 1: Notes O nl in e Statistics Management The operations manager of a company that manufactures tires wants to determine whether there are any differences in the quality of work among the three daily shifts. She randomly selects 496 tires and carefully inspects them. Each tire is either classified as perfect, satisfactory, or defective, and the shift that produced it is also recorded. The two categorical variables of interest are shift and condition of the tire produced. The data can be summarized by the accompanying two-way table. Does the data provide sufficient evidence at the 5% significance level to infer that there are differences in quality among the three shifts? Satisfactory Shift 1 106 124 Shift 2 67 Shift 3 37 Total 210 C1 1 85 1 153 72 3 112 281 5 496 Total 124 1 231 130.87 2.33 67 85 1 64.78 86.68 1.54 37 72 3 47.42 63.45 1.13 210 281 5 ni v U 153 112 496 ity Total 231 C3 97.80 3 1 C2 106 2 Total er s Solution: Defective ity Perfect Chi-Sq = 8.647 DF = 4, P-Value = 0.071 (c )A m There are 3 cells with expected counts less than 5.0. In the above example, there are no significant results at a 5% significance level since the p-value (0.071) is greater than 0.05. Even if we did have a significant result, we still could not trust the result, because there are 3 (33.3% of) cells with expected counts < 5.0 Example 2 A food services manager for a baseball park wants to know if there is a relationship between gender (male or female) and the preferred condiment on a hot dog. The following table summarizes the results. Test the hypothesis with a significance level of 10%. Amity Directorate of Distance & Online Education 81 Ketchup Male Mustard Relish Total 15 23 10 48 Female 25 19 8 52 Total 42 18 100 40 Solution: ●● H0: Gender and condiments are independent ●● Ha: Gender and condiments are not independent Mustard Relish 15 ( 19.2) 23 ( 20.16) 10 ( 8.64) Female 25 ( 20.8) 19 ( 21.84) 8 ( 9.36) Total 42 Male 40 Total 48 52 er s Ketchup ity The hypotheses are: Notes O nl in e Statistics Management 18 100 ni v None of the expected counts in the table are less than 5. Therefore, we can proceed with the Chi-square test. The test statistic is (15 − 19.2)2 (23 − 20.16)2 (10 − 8.64)2 x = + + + 19.2 20.16 8.64 2* U (25 − 20.8)2 (19 − 21.84)2 (8 − 9.36)2 + + = 2.95 20.8 21.84 9.36 ity The p-value is found by P(χ2>χ2*)=P(χ2>2.95) with (3-1)(2-1) =2 degrees of freedom. Using a table or software, we find the p-value to be 0.2288. With a p-value greater than 10%, we can conclude that there is not enough evidence in the data to suggest that gender and preferred condiment are related. m Assumptions of Chi-square Test )A The chi-squared test, when used with the standard approximation that a chisquared distribution is applicable, has the following assumptions: Simple random sample: The sample data is a random sampling from a fixed distribution or population where each member of the population has an equal probability of selection. Variants of the test have been developed for complex samples, such as where the data is weighted. ●● Sample size (whole table): A sample with a sufficiently large size is assumed. If a chi squared test is conducted on a sample with a smaller size, (c ●● Amity Directorate of Distance & Online Education 82 Statistics Management O nl in e then the chi squared test will yield an inaccurate inference. The researcher, by using chi squared test on small samples, might end up committing a Type II error. Notes ●● Expected cell count: Adequate expected cell counts. Some require 5 or more, and others require 10 or more. A common rule is 5 or more in all cells of a 2-by2 table, and 5 or more in 80% of cells in larger tables, but no cells with zero expected count. When this assumption is not met, Yates’s correction is applied. ●● Independence: The observations are always assumed to be independent of each other. This means chi-squared cannot be used to test correlated data (like matched pairs or panel data). In those cases you might want to turn to McNamara’s test. ity Degrees of Freedom (d.f) The degree of freedom, abbreviated as d.f, denotes the extent of independence (freedom) enjoyed by a given set of observed frequencies. Degrees of freedom are usually denoted by the letter ‘v’ of the Greek alphabet. er s Suppose, if we are given a set of ‘n’ observed frequencies which are subjected to ‘k’ independent constraints (restrictions). Then Key Terms ni v Degrees of Freedom = No. of frequencies – No. of independent constraints ( v = n–k) Hypothesis Test: Hypothesis test is a method of making decisions using data from a scientific study ●● Type I error: A type 1 error is also known as a false positive and occurs when a researcher incorrectly rejects a true null hypothesis. ●● Type II error: A type II error is a false negative and occurs when a researcher fails to reject a null hypothesis which is really false ●● Confidence Interval: A Confidence Interval is a range of values where the true value lies in. It is a type of estimate computed from the statistics of the observed data. (c )A m ity U ●● ●● Z- Test: A z-test is a statistical test to determine whether two population means are different when the variances are known and the sample size is large. ●● p Value: The p-value is the probability of receiving outcomes as extreme as the outcomes of a statistical hypothesis test, assuming the null hypothesis is correct. ●● Sample random sample: The sample data is a random sampling from a fixed distribution or population where each member of the population has an equal probability of selection ●● Degrees of Freedom: The degree of freedom, abbreviated as d.f, denotes the extent of independence or the freedom enjoyed by a given set of observed frequencies Amity Directorate of Distance & Online Education 83 Statistics Management Check your progress : 5. O nl in e b) Quartile range c) Sample d) Mean a) T test b) Quartile c) z test d) Median ity A ____ a statistical test to determine whether two population means are different when variances are known a) Standard deviation b) Median c) Degree of freedom d) Hypothesis er s What denotes the extent of independence enjoyed by a given set of observed frequencies ni v 4. Confidence Interval Which test is used as test of goodness of fit. a) Z test b) T test c) Chi square test d) Fitness test U 3. a) A _____ is also known as a false positive and occurs when researcher incorrectly rejects a true null hypothesis. ity 2. Notes A ____ is a range of values where the true value lies in. a) Type I error b) Type II error c) T test error d) Probability error m 1. )A Questions & Exercises What do you understand by hypothesis ? Explain its characterstics 2. Explain the type of hypothesis and how to develop them ? 3. What is the p value approach to hypothesis testing ? (c 1. 4. Explain the Chi square test and its assumptions 5. What do you Infer about the difference between two population means Amity Directorate of Distance & Online Education 84 Statistics Management 1. a) Confidence Interval 2. c) z test 3. c) Degree of freedom 4. c) Chi square test 5. a) Type I error O nl in e Check your progress: Notes Further Readings Richard I. Levin, David S. Rubin, Sanjay Rastogi Masood Husain Siddiqui, Statistics for Management, Pearson Education, 7th Edition, 2016. 2. Prem.S.Mann, Introductory Statistics, 7th Edition, Wiley India, 2016. 3. Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani, An Introduction to Statistical Learning with Applications in R, Springer, 2016. ity 1. er s Bibliography Srivastava V. K. etal – Quantitative Techniques for Managerial Decision Making, Wiley Eastern Ltd 2. Richard, I.Levin and Charles A.Kirkpatrick – Quantitative Approaches to Management, McGraw Hill, Kogakusha Ltd. 3. Prem.S.Mann, Introductory Statistics, 7th Edition, Wiley India, 2016. 4. Budnik, Frank S Dennis Mcleaavey, Richard Mojena – Principles of Operation Research - AIT BS New Delhi. 5. Sharma J K – Operation Research- theory and applications-Mc Millan,New Delhi 6. Kalavathy S. – Operation Research – Vikas Pub Co 7. Gould F J – Introduction to Management Science – Englewood Cliffs N J Prentice Hall. 8. Naray J K, Operation Research, theory and applications – Mc Millan, New Dehi. 9. Taha Hamdy, Operations Research, Prentice Hall of India ity U ni v 1. 11. Vohr.N.D. Quantitative Techniques in Management, TMH 12. Stevenson W.D, Introduction to Management Science, TMH (c )A m 10. Tulasian: Quantitative Techniques: Pearson Ed. Amity Directorate of Distance & Online Education 85 Statistics Management Module-5: Forecasting Techniques Learning Objective: ●● To understand the measures of linear relationship between variables ●● To get familiarize with Time Series Analysis Learning Outcome: Understand and apply forecasting techniques for business decision making and to uncover relationships between variables to produce forecasts of the future values of strategic variables ity ●● O nl in e Notes “If two or more quantities vary in sympathy so that the movement in one tends to be accompanied by corresponding movements in others than they are said are correlated.” er s L.R. Conner says- 5.1.1 Measures of Linear Relationship: covariance & correlation – Intro ni v We often encounter the situations, where data appears as pairs of figures relating to two variables, for example, price and demand of commodity, money supply and inflation, industrial growth and GDP, advertising expenditure and market share, etc. U Examples of correlation problems are found in the study of the relationship between IQ and aggregate percentage marks obtained in mathematics examination or blood pressure and metabolism. In these examples, both variables are observed as they naturally occur, since neither variable can be fixed at predetermined levels. Correlation and regression analysis show how to determine the nature and strength of the relationship between the variables. According to Croxton and Cowden “When the relationship is of a quantitative nature, the appropriate statistical tool for discovering and measuring the relationship and expressing it in a brief formula is known as correlation” ●● A.M. Tuttle says, “Correlation is an analysis of the co variation between two or more variables.” ity ●● )A m Correlation is a degree of linear association between two random variables. In these two variables, we do not differentiate them as dependent and independent variables. It may be the case that one is the cause and other is an effect i.e. independent and dependent variables respectively. On the other hand, both may be dependent variables on a third variable. In some cases there may not be any cause effect relationship at all. Therefore, if we do not consider and study the underlying economic or physical relationship, correlation may sometimes give absurd results. (c 5.1.2 Covariance and Correlation - Application in Real Life For example, The case of global average temperature and Indian population. Both are increasing over past 50 years but obviously not related. Correlation is an analysis of the degree to which two or more variables fluctuate with reference to each other. Amity Directorate of Distance & Online Education 86 Statistics Management O nl in e Correlation is expressed by a coefficient ranging between –1 and +1. Positive (+ve) sign indicates movement of the variables in the same direction. E.g. Variation of the fertilizers used on a farm and yield, observes a positive relationship within technological limits. Whereas negative (–ve) coefficient indicates movement of the variables in the opposite directions, i.e. when one variable decreases, other increases. E.g. Variation of price and demand of a commodity have inverse relationship. Absence of correlation is indicated if the coefficient is close to zero. Value of the coefficient close to ±1denotes a very strong linear relationship. ●● The study of correlation helps managers in following ways: ●● To identify relationship of various factors and decision variables. ●● To estimate value of one variable for a given value of other if both are correlated. ●● To understand economic behaviour and market forces. ●● To reduce uncertainty in decision-making to a large extent. ity Notes ni v er s In business, correlation analysis often helps manager to take decisions by estimating the effects of changing the values of the decision variables like promotion, advertising, price, production processes, on the objective parameters like costs, sales, market share, consumer satisfaction, competitive price. The decision becomes more objective by removing subjectivity to certain extent. However, it must be understood that the correlation analysis only tells us about the two or more variables in a data fluctuate together or not. It does not necessarily be due cause and effect relationship. To know if the fluctuations in one of the variables indeed affects other or not, one has to be established with logical understanding of the business environment. 5.1.3 Types of Correlation ity U The correlation can be studied as positive and negative, simple and multiple, partial and total, linear and non linear. Further the method to study the correlation is plotting graphs on x - y axis or by algebraic calculation of coefficient of correlation. Graphs are usually scatter diagrams or line diagrams. The correlation coefficients have been defined in different ways, of these Karl Pearson’s correlation coefficient; Spearman’s Rank correlation coefficient and coefficient of determination. (c )A m 1. Positive or negative correlation: In positive correlation, both factors increase or decrease together. Positive or direct Correlation refers to the movement of variables in the same direction. The correlation is said to be positive when the increase (decrease) in the value of one variable is accompanied by an increase (decrease) in the value of other variable also. Negative or inverse correlation refers to the movement of the variables in opposite direction. Correlation is said to be negative, if an increase (decrease) in the value of one variable is accompanied by a decrease (increase) in the value of other. When we say a perfect correlation, the scatter diagram will show a linear (straight line) plot with all points falling on straight line. If we take appropriate scale, the straight line inclination can be adjusted to 45°, although it is not necessary as long as inclination is not 0° or 90° where there is no correlation at all because value of one variable changes without any change in the value of other variable. Amity Directorate of Distance & Online Education 87 Statistics Management Notes O nl in e In case of negative correlation when one variable increases the other decrease and visa versa. If the scatter diagram shows the points distributed closely around an imaginary line, we say it is high degree of correlation. On the other hand, if we can hardly see any unique imaginary line around which the observations are scattered, we say correlation does not exist. Even in case of imaginary line being parallel to one of the axes we say no correlation exists between the variables. If the imaginary line is a straight line we say the correlation is linear. ity 2. Simple or multiple correlations: In simple correlation the variation is between only two variables under study and the variation is hardly influenced by any external factor. In other words, if one of the variables remains same, there won’t be any change in other variable. For example, variation in sales against price change in case of a price sensitive product under stable market conditions shows a negative correlation. In multiple correlations, more than two variables affect one another. In such a case, we need to study correlation between all the pairs that are affecting each other and study extent to which they have the influence. 3. Partial or total correlation ni v er s In case of multiple correlation analysis there are two approaches to study the correlation. In case of partial correlation, we study variation of two variables and excluding the effects of other variables by keeping them under controlled condition. In case of ‘total correlation’ study we allow all relevant variables to vary with respect to each other and find the combined effect. With few variables, it is feasible to study ‘total correlation’. As number of variables increase, it becomes impractical to study the ‘total correlation’. For example, coefficient of correlation between yield of wheat and chemical fertilizers excluding the effects of pesticides and manures is called partial correlation. Total correlation is based upon all the variables. U 4. Linear and nonlinear correlation: When the amount of change in one variable tends to keep a constant ratio to the amount of change in the other variable, then the correlation is said to be linear. m ity The distinction between linear and non-linear is based upon the consistency of the ratio of change between the variables. The manager must be careful in analyzing the correlation using coefficients because most of the coefficients are based on assumption of linearity. Hence plotting a scatter diagram is good practice. In case of linear correlation, the differential (derivative) of relationship is constant with the graph of the data being a straight line. )A In case on nonlinear correlation the rate of variation changes as values increase or decrease. The nonlinear relationship could be approximated to a polynomial (parabolic, cubic etc.), exponential sinusoidal, etc. In such cases using the correlation coefficients based on linear assumption will be misleading unless used over a very short data range. Using computers, we could analyze a nonlinear correlation to a certain extent, with some simplified assumption (c 5.1.4 Correlation of Grouped Data Many times the observations are grouped into a ‘two way’ frequency distribution table. These are called bivariate frequency distribution. It is a matrix where rows are Amity Directorate of Distance & Online Education 88 Statistics Management O nl in e grouped for X variable and columns are grouped for Y variable. Each cell say (i, j) represents them frequency or count that falls in both groups of a particular range of values of Xi and Yj. In this case correlation coefficient is given by Notes 1 ∑ f × mx × m y − ∑ ( f × mx ) ∑ ( f × m y ) n r= (∑ f × my ) 2 ( ∑ f × mx ) 2 ∑ ( f × mx 2 ) − ∑( f × my 2 ) − n n Where mX and mY are class marks of frequency distributions of X and Y variables, fX and fY are marginal frequencies of X and Y and fXY are joint frequencies of X and Y respectively. Example: Calculate coefficient of correlation for the following data. 0-500 500-1000 1000-1500 1500-2000 2000-2500 Total 0-200 12 6 - - - 18 200-400 2 18 4 2 1 27 400-600 - 4 7 3 - 14 600-800 - 1 800-1000 - - Total 14 29 er s ity X/Y - 2 1 4 1 2 3 6 12 9 5 69 X Class Mark mx m )A mx − a g Frequency f f x dx f x dx2 14 -28 56 500-1000 750 -1 29 -29 29 1000-1500 1250 0 12 0 0 1500-2000 1750 1 9 9 9 2000-2500 2250 2 5 10 20 -38 114 U -2 Total (c dx = 250 ity 0-500 ni v Solution: Let the assumed mean for X be 1 = 1250 and the scaling factor g = 500. Therefore, we can calculate f x dy and f x dx2 from the marginal distribution of X as, Definition: The correlation coefficient measures the degree of association between two variables X and Y. The coefficient is given as – r= Covx.Cov y ó xó y 1 ∑ ( X − X )(Y − Y ) n r= ó xó y Where r is the ‘Correlation Coefficient’ or ‘Product Moment Correlation Coefficient’ between X and Y. σ X and σ Y are the standard deviations of X and Y respectively. ‘n’ is the number of the pairs of variables X and Y in the given data. Amity Directorate of Distance & Online Education 89 Statistics Management Notes O nl in e The expression - 1/nΣ(X − X)(Y − Y) is known as a covariance between the variables X and Y. It is denoted asCov(x,y) . The Correlation Coefficient r is a dimensionless number whose value lies between +1 and –1. Positive values of r indicate positive (or direct) correlation between the two variables X and Y i.e. both X and Y increase or decrease together. Negative values of r indicate negative (or inverse) correlation, thereby meaning that an increase in one variable X or Y results in a decrease in the value of the other variable. A zero correlation means that there is no association between the two variables. The formula can be modified as, 1 1 ∑ ( X − X )(Y − Y ) ∑ ( XY − XY − XY + XY ) n = r = n ∑ XY ∑ X ∑ Y − × n n n ∑X2 ∑X −  n  n 2 (2) ∑Y 2  ∑Y  −  n  n 2 er s = σ xσ y ity σ xσ y (3) E[ XY ] − E[ X ]E[Y ] = E[ X 2 ] − ( E[ X ]) 2 E[Y 2 ] − ( E[Y ]) 2 ni v Equations (2) and (3) are alternate forms of equation (1). These have advantage that each value from the mean may not be subtracted. Example: The data of advertisement expenditure (X) and sales (Y) of a company for past 10 year period is given below. Determine the correlation coefficient between these variables and comment the correlation. 50 50 50 Y 700 650 600 40 30 20 20 15 10 5 450 400 300 250 210 200 U X 500 Sl.No. Y = yi U = ui V = vi 30 120 16 900 650 4 25 100 16 625 50 600 4 20 80 16 400 40 500 2 10 20 4 100 30 450 0 5 0 0 25 6 20 400 -2 0 0 4 0 7 20 300 -2 -10 20 4 100 8 15 250 -3 -15 45 9 225 9 10 210 -4 -19 76 16 361 10 5 200 -5 -20 100 25 400 -2 26 561 110 3136 2 3 4 50 (c )A 5 50 Total 700 m 1 X = xi ity Solution: We shall take U to be the deviation of X values from the assumed mean of 30 divided by 5. Similarly, V represents the deviation of Y values from the assumed mean of 400 divided by 10. 4 uivi u i2 vi2 Amity Directorate of Distance & Online Education 90 Statistics Management n 1 n n ∑ ui vi − ∑ ui ∑ vi =i 1 = n i 1 =i 1 r= 2 2 n n 1 n  1 n  ∑ ui 2 −  ∑ ui  ∑ vi 2 −  ∑ vi   i1 = =i 1 = n  i 1= n i 1  = ( −2)(26) 561 − 10 = 4 676 110 − 3136 − 10 10 561 − 5.2 = 0.976 109.6 3068.4 Interpretation of r O nl in e Short cut procedure for calculation of correlation coefficient Notes er s ity The correlation coefficient, r ranges from −1 to 1. A value of 1 implies that a linear equation describes the relationship between X and Y perfectly, with all data points lying on a line for which Y increases as X increases. A value of −1 implies that all data points lie on a line for which Y decreases as X increases. A value of 0 implies that there is no linear correlation between the variables. More generally, note that (Xi − X) (Yi − Y) is positive if and only if Xi and Yi lie on the same side of their respective means. Thus the correlation coefficient is positive if Xi and Yi tend to be simultaneously greater than, or simultaneously less than, their respective means. The correlation coefficient is negative if Xi and Yi tend to lie on opposite sides of their respective means. ●● The coefficient of correlation r lies between –1 and +1 inclusive of those values. ●● When r is positive, the variables x and y increases or decrease together. ●● r = +1 implies that there is a perfect positive correlation between variables x and y. U ●● ni v ●● When r is negative, the variables x and y move in the opposite direction. When r = -1, there is a perfect negative correlation. ●● When r = 0, the two variables are uncorrelated. ity ●● (c )A m 5.1.5 Spearman Rank Correlation Method - Intro & Application Quite often the data is available in the form of some ranking for different variables. Also there are occasions where it is difficult to measure the cause-effect variables. For example, while selecting a candidate, there are number of factors on which the experts base their assessment. It is not possible to measure many of these parameters in physical units e.g. sincerity, loyalty, integrity, tactfulness, initiative, etc. Similar is the case during dance contests. However, in these cases the experts may rank the candidates. It is then necessary to find out whether the two sets of ranks are in agreement with each other. This is measured by Rank Correlation Coefficient. The purpose of computing a correlation coefficient in such situations is to determine the extent to which the two sets of ranking are in agreement. The coefficient that is determined from these ranks is known as Spearman’s rank coefficient, rS Amity Directorate of Distance & Online Education 91 Statistics Management This is defined by the following formula: O nl in e Notes n rs = 1 − 6 × ∑ di 2 i =1 n(n 2 − 1) Where, n = Number of observation pairs D = Xi - Yi = Xi = Values of variable X and = Yi values of variable Y Rank Correlation when Ranks are given Rank for Variable X 1 2 3 4 5 6 7 Rank for Variable Y 3 1 4 2 6 9 8 8 9 10 10 5 7 er s To determine the coefficient of rank correlation, S r ity Example: Ranks obtained by a set of ten students in a mathematics test (variable X) and a physics test (variable Y) are shown below: Solution: Computations of Spearman’s Rank Correlation as shown below: d i2 1 1 3 +2 4 2 2 1 -1 1 3 3 4 +1 1 4 4 2 -2 1 5 5 6 +1 1 6 6 9 +3 9 7 7 8 +1 1 8 8 10 +2 4 9 9 5 -4 16 10 10 7 -3 9 50 U n ∑ di 2 = 50 i =1 m Now, n = 10, ity Total ni v Individual Rank in Maths (X = xi) Rank in Physics (Y = yi) di = xi-yi Using the formula n )A 6 × ∑ di 2 6 × 50 i =1 = = rs = 1− 1− 0.697 2 n(n − 1) 10(100 − 1) It can be said that there is a high degree of correlation between the performance in mathematics and physics. (c Rank Correlation when Ranks are not given Example: Find the rank correlation coefficient for the following data. Amity Directorate of Distance & Online Education 92 Statistics Management X 75 88 95 70 60 Y 120 134 115 110 140 80 81 50 142 100 150 O nl in e Notes Solution: Let R1 and R2 denotes the ranks in X and Y respectively. Y R1 R2 d=R1-R2 d2 75 120 5 5 0 0 88 134 2 4 -2 4 95 150 1 1 0 0 70 115 6 6 0 0 60 110 7 7 0 0 80 140 4 3 1 1 81 142 3 2 1 1 50 100 8 8 0 0 6 er s ity X 6∑d2 6×6 1− = 1− = +.93 Coefficient of Correlation P = 8(64 − 1) n(n 2 − 1) ni v In this method the biggest item gets the first rank, the next biggest second rank and so on. 5.1.6 Regression Model (c )A m ity U There is a need for a statistical model that will extract information from the given data to establish the regression relationship between independent and dependent relationship. The model should capture systematic behaviour of data. The nonsystematic behaviour cannot be captured and called as errors. The error is due to random component that cannot be predicted as well as the component not adequately considered in statistical model. Good statistical model captures the entire systematic component leaving only random errors. In any model we attempt to capture everything which is systematic in data. Random errors cannot be captured in any case. Assuming the random errors are ‘Normally distributed’ we can specify the confidence level and interval of random errors. Thus, our estimates are more reliable. If the variables in a bivariate distribution are correlated, the points in scatter diagram approximately cluster around some curve. If the curve is straight line we call it as linear regression. Otherwise, it is curvilinear regression. The equation of the curve which is closest to the observations is called the ‘best fit’. The best fit is calculated as per Legender’s principle of least sum squares of deviations of the observed data points from the corresponding values on the ‘best fit’ curve. This is called as minimum squared error criteria. It may be noted that the deviation (error) can be measured in X direction or Y direction. Accordingly we will get two ‘best fit’ curves. If we measure deviation in Y direction, i.e. for a given i x value of Amity Directorate of Distance & Online Education 93 Statistics Management Notes O nl in e data point ( x,y ) and then we measure corresponding y value on ‘best fit’ curve and then take the value of deviation in y, we call it as regression of Y on X. In the other case, if we measure deviations in X direction we call it as regression of X and Y. Definition: According to Morris Myers Blair, regression is the measure of the average relationship between two or more variables in terms of the original units of the data. Applicability of Regression Analysis ity Regression analysis is one of the most popular and commonly used statistical tools in business. With availability of computer packages, it has simplified the use. However, one must be careful before using this tool as it gives only mathematical measure based on available data. It does not check whether the cause effect relationship really exists and if it exists which is dependent and which is dependent variable. Regression analysis helps in the following ways - er s Regression analysis is a branch of statistical theory which is widely used in all the scientific disciplines. It is a basic technique for measuring or estimating the relationship among economic variables that constitute the essence of economic theory and economic life. The uses of regression analysis are not confined to economic and business activities. Its applications are extended to almost all the natural, physical and social sciences. It provides mathematical relationship between two or more variables. This mathematical relationship can then be used for further analysis and treatment of information using more complex techniques. ●● Since most of the business analysis and decisions are based on causeeffect relationships, regression analysis is highly valuable tool to provide mathematical model for this relationship. ●● Most wide use of regression analysis is the analysis, estimation and forecast. ●● Regression analysis is also used in establishing the theories based on relationships of various parameters. ●● Some of the common examples are demand and supply, money supply and expenditure, inflation and interest rates, promotion expenditure and sales, productivity and profitability, health of workers and absenteeism, etc. ity U ni v ●● 5.1.7 Estimating the Coefficient Using Least Square Method m Generally the method used to find the ‘best’ fit that a straight line of this kind can give is the least-square method. To use it efficiently, we first determine )A ∑ xi 2 = ∑ xi 2 − nX 2 ∑ yi 2 = ∑ yi 2 − nY 2 ∑ xi yi = ∑ xi yi − nX .Y (c b= ∑ xi yi , a= Y − bX ∑ xi 2 Amity Directorate of Distance & Online Education 94 Statistics Management r= O nl in e These measures define a and b which will give the best possible fit through the original X and Y points and the value of r can then be worked out as under: Notes b ∑ xi 2 ∑ yi 2 Thus, the regression analysis is a statistical method to deal with the formulation of mathematical model depicting relationship amongst variables which can be used for the purpose of prediction of the values of dependent variable, given the values of the independent variable. ity Alternatively, for fitting a regression equation of the type Y = a + bX to the given values of X and Y variables, we can find the values of the two constants viz., a and b by using the following two normal equations: ∑ X iYi = a ∑ X i + b ∑ X i 2 er s ∑ yi = na + b ∑ X i ni v Solving these equations for finding a and b values. Once these values are obtained and have been put in the equation Y = a + bX, we say that we have fitted the regression equation of Y on X to the given data. In a similar fashion, we can develop the regression equation of X and Y viz., X = a + bX, presuming Y as an independent variable and X as dependent variable. 5.1.8 Assessing the Model Method of Least Square parabolic trend ity U The mathematical form of a parabolic trend is given by Yt = a + bt + ct2 or Y = a + bt + ct2 (dropping the subscript for convenience). Here a, b and c are constants to be determined from the given data. Using the method of least squares, the normal equations for the simultaneous solution of a, b, and c are: ∑ Y = na + b ∑ t + c ∑ t 2 ∑ tY = a ∑ t + b ∑ t 2 + c ∑ t 3 (c )A m ∑ t 2Y = a ∑ t 2 + b ∑ t 3 + c ∑ t 4 By selecting a suitable year of origin, i.e., define X = t - origin such that SX = 0, the computation work can be considerably simplified. Also note that if SX = 0, then SX3 will also be equal to zero. Thus, the above equations can be rewritten as: ∑ Y = na + c ∑ X 2 ..(i) ∑ XY =∑ b X 2 ..(ii) ∑ X 2Y = a ∑ X 2 + c ∑ X 4 ..(iii) From equation (ii), we get Amity Directorate of Distance & Online Education b= ∑ XY ...(iv) ∑X2 95 Statistics Management And from equation(iii), we get c = ∑Y − c ∑ X 2 n n ∑ X 2Y − (∑ X 2 )(∑ Y ) n ∑ X 4 − (∑ X 2 ) 2 Notes O nl in e Further, from equation (i), we get a = ...(v) ...(vi) Thus, equations (iv), (v) and (vi) can be used to determine the values of the constants a, b and c. 5.1.9 Standard Error of Estimate ity Standard Error of Estimate is the measure of variation around the computed regression line. Standard error of estimate (SE) of Y measure the variability of the observed values of Y around the regression line. Standard error of estimate gives a measure about the line of regression. of the scatter of the observations about the line of regression. Y = Observed value of y er s Standard Error of Estimate of Y on X is: S.E. of Yon X (SExy) = √ Σ (Y–Ye)2/n-2 Ye = Estimated values from the estimated equation that correspond to each y value e = The error term (Y – Y e ) ni v n = Number of observation in sample. The convenient formula: (SExy) = √ Σ Y2_a Σ Y_b Σ YX /n – 2 variable. a = Y U X = Value of independent variable. Y = Value of dependent intercept. b = Slope of estimating equation. n = Number of data points. Regression Coefficient of X on Y m ity The regression coefficient of X on Y is represented by the symbol bxy that measures the change in X for the unit change in Y. Symbolically, it can be represented as: The bxy can be obtained by using the following formula when the deviations are taken from the actual means of X and Y: When the deviations are obtained from the assumed mean, the following formula is used: Regression Coefficient of Y on X )A The symbol byx is used that measures the change in Y corresponding to the unit change in X. Symbolically, it can be represented as: In case, the deviations are taken from the actual means; the following formula is used: ●● The byx can be calculated by using the following formula when the deviations are taken from the assumed means: (c ●● Amity Directorate of Distance & Online Education 96 Statistics Management The Regression Coefficient is also called as a slope coefficient because it determines the slope of the line i.e., the change in the independent variable for the unit change in the independent variable. O nl in e ●● Notes 5.1.10 Regression Coefficient The coefficients of regression are YX b and XY b. They have following implications: Slopes of regression lines of Y on X and X on Y viz. YX b and XY b must have same signs (because r² cannot be negative). ●● Correlation coefficient is geometric mean of YX b and XY b. ●● If both slopes YX b and XY b are positive correlation coefficient r is positive. If both YX b and XY b are negative the correlation coefficient r is negative. ●● Both regression lines intersect at point (X,Y ) ity ●● As in case of calculation of correlation coefficient, we can directly write the formula for the two regression coefficients for a bivariate frequency distribution as given below – N ∑ ∑ fi j xi y j − (∑ fi xi )(∑ f j y j ) N ∑ fi xi 2 − (∑ fi xi ) 2 Xi − A YJ − B = and vj h k k  N ∑ ∑ fi j ui v j − (∑ f i ui )(∑ f j xJ )  b=   h N ∑ fi ui 2 − (∑ fi ui ) 2  N ∑ ∑ fi j xi y j − (∑ fi xi )(∑ f j yJ ) d= N ∑ f j y j 2 − (∑ f j y j )2 U ni v or,= if we define ui Similarly er s b= h  N ∑ fi j ui v j − (∑ f i ui )(∑ f j vJ )    k  N ∑ f j v j 2 − (∑ f j v j )2  ity or d = (c )A m 5.2.1 Time Series Time series analysis systematically identifies and isolates different kinds of timerelated patterns in the data. Four common relationship patterns are horizontal, trend, seasonal and cyclic. The random component is superimposed on these patterns. There is a procedure for decomposing the time series in these patterns. These are used for forecasting. However, more accurate and statistically sound procedure is to identify the patterns in time series using auto-correlations that was explained in previous subsection. It is correlation between the values of same variable at different time lag. When the time series represents completely random data, the auto correlation for various time lags is close to zero with values fluctuating both on positive and negative side. If auto correlation slowly drops to zero, and more than two or three differ significantly from zero, it indicates presence of trend in the data. The trend can be removed by taking difference between consecutive values and constructing a new series. This is called numerical differentiation. Amity Directorate of Distance & Online Education 97 Definition A time series is a collection of data obtained by observing a response variable atperiodic points in time. If repeated observations on a variable produce a time series, the variable is called a time series variable. We use Yi to denote the value of the variable at time i. Objectives of Time Series Notes O nl in e Statistics Management The analysis of time series implies its decomposition into various factors that affect the value of its variable in a given period. It is a quantitative and objective evaluation of the effects of various factors on the activity under consideration. There are two main objectives of the analysis of any time series data: To study the past behaviour of data. ity 1. 5.2.2 Variation in Time Series Time Series analysis – Secular Component er s 2. To make forecasts for future. The study of past behaviour is essential because it provides us the knowledge of the effects of various forces. This can facilitate the process of anticipation of future course of events and, thus, forecasting the value of the variable as well as planning for future. ity U ni v Secular trend or simply trend is the general tendency of the data to increase or decrease or stagnate over a long period of time. Most of the business and economic time series would reveal a tendency to increase or to decrease over a number of years. For example, data regarding industrial production, agricultural production, population, bank deposits, deficit financing, etc., show that, in general, these magnitudes have been rising over a fairly long period. As opposed to this, a time series may also reveal a declining trend, e.g., in the case of substitution of one commodity by another, the demand of the substituted commodity would reveal a declining trend such as the demand for cotton clothes, demand for coarse grains like bajra, jowar, etc. With the improved medical facilities, the death rate is likely to show a declining trend, etc. The change in trend, in either case, is attributable to the fundamental forces such as changes in population, technology, composition of production, etc. Time Series Analysis - Seasonal Component )A m Cycles that occur over short periods of time, normally < 1 year. e.g. monthly, weekly, daily. A time series, where the time interval between successive observations is less than or equal to one year, may have the effects of both the seasonal and cyclical variations. However, the seasonal variations are absent if the time interval between successive observations is greater than one year. Causes of Seasonal variations: The main causes of seasonal variations are: Climatic Conditions ●● Customs and Traditions (c ●● Amity Directorate of Distance & Online Education 98 Statistics Management Climatic Conditions: The changes in climatic conditions affect the value of time series variable and the resulting changes are known as seasonal variations. For example, the sale of woolen garments is generally at its peak in the month of November and December because of the beginning of winter season. Similarly, timely rainfall may increase agricultural output, prices of agricultural commodities are lowest during their harvesting season, etc., reflect the effect of climatic conditions on the value of time series variable. O nl in e Notes Customs and Traditions: The customs and traditions of the people also give rise to the seasonal variations in time series. For example, the purchase of clothing and ornaments may be highest during the marriage season, sale of sweets during Diwali, etc., are variations which are the results of customs and traditions of the people. ity Time Series Analysis - Cyclical Component Cyclical variations are revealed by most of the economic and business time series and, therefore, are also termed as trade or the business cycles. Any trade cycle has four phases which are respectively known as boom, recession, depression and recovery. ●● Various phases repeat themselves regularly one after another in the given sequence. The time interval between two identical phases is known as the period of cyclical variations. The period is always greater than one year. Normally, the period of cyclical variations lies between 3 to 10 years. er s ●● ni v Objectives of Measuring Cyclical Variations The main objectives of measuring cyclical variations are: To analyse the behaviour of cyclical variations in the past. ●● To predict the effect of cyclical variations so as to provide guidelines for future business policies. U ●● Time Series Analysis - Random Component (c )A m ity As the name suggests, these variations do not reveal any regular pattern of the movements. These variations are caused by random factors such as strikes, fire, floods, war, famines, etc. Random variations is that component of a time series that cannot be explained in terms of any of the components discussed so far. This component is obtained as a residue after the elimination of trend, seasonal and cyclical components and hence is often termed as residual component. Random variations are usually short-term variations but sometimes their effect may be so intense that the value of trend may get permanently affected. Numerical Application Using the method of Free hand determine the trend of the following data: Year 1998 Production 42 (in tonnes) Amity Directorate of Distance & Online Education 1999 2000 2001 2002 2003 2004 2005 44 48 42 46 50 48 52 99 Statistics Management Solution: er s ity O nl in e Notes Price (`) Year 52 2000 75 1995 65 2001 70 1996 58 2002 64 1997 63 2003 78 1998 66 2004 80 1999 72 2005 73 Price (`) Year Price (`) 3 yearly moving total 1994 ity Solution: U Year 1994 ni v Example 2 - Find trend values from the following data using three yearly moving averages and show the trend line on the graph. 52 – 65 175 58.33 58 186 62.00 63 187 62.33 1995 1996 1997 3 yearly moving average 66 201 67.00 1999 72 213 71.00 2000 75 217 72.33 2001 70 209 69.67 2002 64 212 70.67 2003 78 222 74.00 2004 80 231 77.00 2005 73 – (c )A 1998 m Computation of trend values Amity Directorate of Distance & Online Education 100 Statistics Management er s ity O nl in e Notes Key Terms Correlation: Correlation is expressed by a coefficient ranging between –1 and +1. Positive (+ve) sign indicates movement of the variables in the same direction. ●● Positive correlation: The correlation is said to be positive when the increase (decrease) in the value of one variable is accompanied by an increase (decrease) in the value of other variable also. ●● Negative correlation: Negative or inverse correlation refers to the movement of the variables in opposite direction ●● Linear correlation: When the amount of change in one variable tends to keep a constant ratio to the amount of change in the other variable, then the correlation is said to be linear. (c )A m ity U ni v ●● ●● Regression: Regression is a basic technique for measuring or estimating the relationship among economic variables that constitute the essence of economic theory and economic life. ●● Time Series: A time series is a collection of data obtained by observing a response variable at periodic points in time. ●● Standard Error of Estimate: Standard Error of Estimate is the measure of variation around the computed regression line. Check your progress: 1. In ____ correlation, both factors increase or decrease together. a) Constant b) Positive Amity Directorate of Distance & Online Education 101 Statistics Management 5. Notes O nl in e Probability The correlation that refers to the movement of the variables in opposite direction a) Constant b) Positive c) Negative d) Probability a) Mean deviation b) Sample c) Time Series d) Hypothesis ity A ____ is a collection of data obtained by observing a response variable at periodic points in time er s Technique for estimating the relationship among economic variables that constitute the essence of economic theory is ? a) Correlation b) Time Series c) Regression d) Standard deviation ni v 4. d) In ____ the variation is between only two variables under study and the variation is hardly influenced by any external factor. a) Partial correlation b) Total correlation c) Standard correlation d) Multiple correlation U 3. Negative ity 2. c) Questions and exercises Explain the measures of linear relationship. 2. What is correlation? What are the various types of correlation? 3. Explain correlation in a grouped data 4. The data of advertisement expenditure (X) and sales (Y) of a company for past 10 year period is given below. Determine the correlation coefficient between these variables and comment the correlation. )A m 1. 50 50 50 40 30 20 20 15 10 5 Y 700 650 600 500 450 400 300 250 210 200 (c X 5. What do you understand by time series analysis ? Explain its components. Amity Directorate of Distance & Online Education 102 Statistics Management 1. b) Positive 2. c) Negative 3. c) Time Series 4. c) Regression 5. d) Multiple correlation O nl in e Check your progress: Notes Further Readings Richard I. Levin, David S. Rubin, Sanjay Rastogi Masood Husain Siddiqui, Statistics for Management, Pearson Education, 7th Edition, 2016. 2. Prem.S.Mann, Introductory Statistics, 7th Edition, Wiley India, 2016. 3. Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani, An Introduction to Statistical Learning with Applications in R, Springer, 2016. ity 1. er s Bibliography Srivastava V. K. etal – Quantitative Techniques for Managerial Decision Making, Wiley Eastern Ltd 2. Richard, I.Levin and Charles A.Kirkpatrick – Quantitative Approaches to Management, McGraw Hill, Kogakusha Ltd. 3. Prem.S.Mann, Introductory Statistics, 7th Edition, Wiley India, 2016. 4. Budnik, Frank S Dennis Mcleaavey, Richard Mojena – Principles of Operation Research - AIT BS New Delhi. 5. Sharma J K – Operation Research- theory and applications-Mc Millan,New Delhi 6. Kalavathy S. – Operation Research – Vikas Pub Co 7. Gould F J – Introduction to Management Science – Englewood Cliffs N J Prentice Hall. 8. Naray J K, Operation Research, theory and applications – Mc Millan, New Dehi. 9. Taha Hamdy, Operations Research, Prentice Hall of India ity U ni v 1. 11. Vohr.N.D. Quantitative Techniques in Management, TMH 12. Stevenson W.D, Introduction to Management Science, TMH (c )A m 10. Tulasian: Quantitative Techniques: Pearson Ed. Amity Directorate of Distance & Online Education

Statistics Module: Data Analysis & Excel Applications

Related documents

Products

Support

Statistics Module: Data Analysis & Excel Applications

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib