LESSON 1: Basic Statistical Concepts 1.1 CONCEPTS AND NATURE OF STATISTICS The purpose of this lesson is to convince the learner that information resulting from a good statistical analysis is always concise, often, precise and never useless! The spirit of statistics is, in fact, very well captured by the quotation from John Maynard Keynes “It is better to be roughly right than precisely wrong”. Statistics is a science that helps us make better decisions in business and economics, as well as, in the other fields. Today, the field of statistics is widely recognized as an important tool for testing concepts and for perceiving new directions in various fields of specialization. Almost daily, we take “educated guesses” concerning the future events in our lives in order to plan new situations or experiences. As these experiences occur, we are sometimes able to confirm or support our ideas. In some other times, however, we are not so lucky and must experience some unpleasant consequences. Sometimes we win, sometimes we lose. Thus, we must make a sound investment in the stock market, but so sorry about our voting decision; win money at a game, but discover we have taken the wrong medicine for our illness; do well on a final exam, but have a miserable defense of the marketing plan. We now need statistical thinking, or the ability to weigh the merits of different options based on whatever available data we have. Statistical analysis often involves an attempt to generalize from the data. Our data are summarized, displayed in meaningful ways and analyzed. Deriving statistical information requires tedious work and boring descriptions of society and nature by means of a set of numbers. The collection, organization, presentation, analysis and interpretation of numerical data must be done with accuracy and precision in order to arrive at a valid and reliable statistical information. Actually, without noticing it, people often apply statistics in their everyday lives. A worker in any field, such as education, social sciences, behavioral sciences, applied sciences, engineering, research, business and economics, is expected to have at least statistical literacy. Today, college students in almost all disciplines, including business and economics are required to take at least one statistics subject. In fact, it is virtually impossible to practice one’s profession without a minimal understanding of statistics. In business and economics, statistics plays an important role in market feasibility studies for new products, forecasting of business trends, control and maintenance of high-quality products, improvement of employer-employee relationship and analysis of data concerning insurance, investment, sales, employment, transportation, communication, auditing and accounting procedures. In research, methods for statistical design of surveys and experiments are valuable to researchers. Causes and effects of factors affecting experiments may lead to discoveries which have to be supported by statistical data obtained from repeated trials. 1.2 DEFINITION OF STATISTICS Statistics is defined as a branch of mathematics/science which deals with the collection, organization, presentation, analysis and interpretation of numerical data for the purpose of assisting in making a more effective decision. Like almost all fields of study, statistics has two aspects: theoretical and applied. Theoretical or mathematical statistics deals with the development, derivation and proof of statistical theorems, formulas, rules and laws. Applied statistics involves the applications of these theorems, formulas, rules and laws to solve real world problems. 1.3 DIVISIONS OF APPLIED STATISTICS There are two divisions of applied statistics which help decision-makers extract the maximum usefulness from limited information. These areas of applied statistics are descriptive statistics and inferential statistics. DESCRIPTIVE STATISTICS. It is concerned with collecting, organizing, summarizing and presenting data; utilizing techniques to summarize values that describe group characteristics of data. This technique distinguishes the important regularities and patterns of variation from the nonsystematic component of data. The most common values are the measures of central tendency, variation, skewness and kurtosis. Preparation of tables, construction of graphs and computation of measures, such as averages and percentages, fall within this area of statistics. Examples of situations involving the use of descriptive statistics: According to the Human Resource Department of a certain company, the total number of employees is 5,000. The figure 5,000 merely describes the company’s total employment. Thus, the 5,000 is considered descriptive statistics. 2. The guard in a department store records the number of buying customers daily for the past 7 days. 3. The market researcher of a manufacturing company constructs a graph showing the fluctuations in sales for a major product line during the last 3 years. 1. INFERENTIAL STATISTICS. Another facet of statistics is inferential statistics. It is concerned with the predictions and inferences gathered based on pre-selected samples and help make predictions about a population. Selection of a single most desirable course of action from among a set of alternative actions is the concern of this technique. For example, a new product, introduced by a manufacturer, is dependent on the size of the market and its marketing cost. Another example, suppose a company receives a shipment of parts from a manufacturer that are to be used in DVD players manufactured by the company. To check the quality of the whole shipment, the company will select a few items from the shipment, inspect them and make a decision. The area of statistics that deals with such decision in drawing conclusions is referred to as inferential statistics. Examples of situations involving the use of inferential statistics. The manager of a department store records the number of buying customers daily for seven consecutive weeks and then estimates the average number of buying customers for the following weeks. 2. The dean recorded enrolment statistics of the college for the last 6 semesters and then determined if there will be a relative increase or decrease in the enrolment for the next semester. 3. A market researcher wants to find out the relationship of the product cost and the number of products. 4. A market researcher asked a sample of 1,960 consumers to try a newly developed frozen bangus dinner by a Bonoan called Bangus Delight. Out of the 1,960 sampled, 1,176 said they would purchase the dinner if it is marketed. 1. 1.4 BASIC STEPS IN CONDUCTING A STATISTICAL INQUIRY 1. 2. 3. 4. 5. There are basically 5 steps in conducting a statistical investigation. These are: Defining the problem Collecting and organizing relevant information Presenting the data Analyzing the data Interpreting the results 1.5 VARIABLE AND TYPES OF DATA A variable is an observable characteristic or attribute associated with the population or sample being studied which makes one different from the other. It is represented by a set of values that may arise from counting and/or from measurement. It may differ in kind or in degree among various elementary units. A variable may be classified as quantitative or qualitative. 1.5.1 QUANTITATIVE VARIABLES Quantitative variables are variables that are classified according to numerical value. These are expressed numerically because they differ in degree rather than in kind among members of the group. The data collected about a quantitative variable is called quantitative data. Age, height, test scores, weight, prices of cars, number of cars owned, annual income, market sales and stock prices are examples of quantitative variables that can be classified as either discrete or continuous. 1.5.1.1 Discrete Variables Discrete variables can assume values only at specific points on a scale of values with gaps between them. They are obtained by counting and, hence, are countable. Examples of discrete variables are the number of days in a week, the number of children in the family, the number of students in the classroom, the number of teachers in school, the number of house and lots sold on a particular day, the number of people visiting a bank, the number of cars in a parking lot, the number of poultry owned by a farmer and the number of employees of a company. 1.5.1.2 Continuous Variables A continuous variable may take any value within a defined range of values. The possible values of the variable belong to a continuous series. Between any two values of the variable, an indefinitely large number of in-between values may occur. Examples of continuous variables are values obtained by measurement such as weight, height, volume, temperature, distance, area, density, age and price of commodity. 1.5.2 QUALITATIVE OR CATEGORICAL VARIABLES A qualitative or categorical variable is not normally expressed numerically because it differs in kind rather than in degree among elementary units. It is also referred to as attribute variable. This variable can be classified into two or more non-numeric categories according to its characteristics or attributes. The data collected about such a variable are called qualitative data. Data falling under this category cannot be added, subtracted, multiplied or divided. Qualitative variables can be dichotomous or multinomial. Observations about a dichotomous qualitative variable can be made only in two categories; yes or no, defective or non-defective, present or absent, etc. Observations about a multinomial qualitative variable can be made in more than two categories such as educational attainment, nationality, religion, type of colleges and universities, regions, brand of soft drinks, name of companies, occupation, level of job performance, level of job satisfaction, etc. Qualitative data are often summarized in charts and bar graphs. The types of variables are shown in the following diagram: 1.6 PARAMETER AND STATISTICS A parameter is a value or measurement obtained from a population. If one uses the mean, median, mode and standard deviation to differentiate the achievement of one class from another class, he/she uses these measures called parameters. Statistics is any value or measurement obtained from a sample. It is an estimate from the parameter. 1.7 POPULATION AND SAMPLE A population consists of a complete set of individuals, objects, places, items, or events or measurements of interest whose characteristics are being studied. The population that is being studied is called the target population. Like any other set, a population (also known as universe) is classified as either finite or infinite. The distinction is sometimes made between the finite population and infinite population. The children attending school in Butuan City, the percentage of all females who earn less than 100,000.00 a year, the 2007 gross sales of all companies in Metro Manila, the prices of all mathematics books published in the Philippines during the past three years and the cards in a deck are examples of finite populations. The number of such population can presumably be observations in any specific experiments are samples of infinite or indefinitely large population. The number of rolls of a dice or the number of scientific observations may, at least theoretically, be increased without any finite limit. The 92 million or so people living in the Philippines constitute a large but finite population. This population is so large that for many types of statistical inference it may be assumed to be infinite. Because of the large size of the population, it may be either impracticable or impossible for the investigator to gather statistics from all the members. If a population is indefinitely large, it is of course impossible to produce complete population statistics. Under circumstances such as these, the investigator selects what is called a sample. A sample is a portion, or part, of the population selected for study drawn by some appropriate methods from the population. Please note, however, that the method used in drawing the sample from the population is very important. A survey is a collection of information from the elements of a population or a sample. Decisions are based on the sample information. To conduct a survey, we usually select a sample and collect the required information from the elements included in the sample. It is the most common method of generating data not only in business and economics, but also in many other fields. A census is a survey that includes every element of the target population. A sample survey is a technique of collecting information from a portion of the population. The purpose of conducting a sample survey is to make a decision about the corresponding population. The results obtained from the sample survey should closely match with the results obtained in conducting the census. For example, to find the average income of families living in Davao City, the sample must come from all the families having different income groups. This means that the proportions used for the income groupings in the sample must be the same as the groupings in the population. A representative sample represents the characteristics of the population because inferences derived from this sample are more reliable. 1.7.1 REASONS FOR TAKING A SAMPLE Why take a sample instead of studying every member of the population? There are good reasons why surveys are conducted using only a sample from the population. One good reason is that it is not practical or feasible to use the entire population considering factors such as cost, time, staff requirement, difficulty in reaching the respondents, etc. Some of the common reasons for using a sample instead of the population are the following: 1. Some tests are destructive in nature. If wine testers at the La Tondeña Distillery will drink all the wines to evaluate the wine, they will consume the entire stock of wines and none will be available for sale. In the area of industrial production, for instance, steel plates, wires and similar products must often have a certain minimum tensile strength. To ensure that the products meet the minimum standard, a relatively small sample is selected. 2. Considering all the items in the population is impossible. For instance, there is a way we can count the population of fish, birds, snakes, mosquitoes and the like because they are too large and are constantly moving from one place to another, some are born, the others died. Instead of even attempting to count all the ducks in the Philippines or all the fish found in Lake Lanao, we can make estimates using various techniques such as counting all the ducks in a pond picked at random or setting nets at predetermined places in the lake. 3. The cost of studying all the items in a population is often prohibitive. Public opinion polls and consumer testing organizations, such as the Pulse Asia, Social Weather Stations and Ibon Facts and Figures, usually contact fewer than 2,000 persons out of approximately 92 million people in the Philippines. One consumer panel-type organization charges about 1,600,000 to mail out samples and tabulate the responses in order to test a product such as rice variety, dog food or perfume. The same product test using 92 million persons would cost about 9 billion. 4. Sample results are adequate. Even if funds were available, it is doubtful whether the additional accuracy of 100% sample that is, studying the entire population is essential in most problems. Cited for preferring sampling have something to do with reducing the cost of getting a given type of information or with increasing the quality or quantity of information are needed and they can be better trained and more effectively supervised. 1.7.2 SPECIFIC USES OF SAMPLING Taking a sample to find out something about a population is done extensively in business, agriculture, politics, and government. Some examples are the following: Before an election, professional polling organizations such as the Pulse Asia and the Social Weather Stations take a sample of about 2,000 registered voters from the millions of eligible voters. Based on their sample results, general inferences are made regarding how all the voters cast their ballots on election day. Historically, the actual election results have always been remarkably closed to the sample results. 2. The Department of Labor and Employment constantly monitors data on employment, unemployment, salaries, and labor turnover based on sample surveys. 1. 3. 4. 5. 6. 7. Television networks regularly monitor the popularity of their programs by hiring the AGB Nielsen and the other organizations to conduct surveys using sample data to find out the preferences of TV viewers. These program ratings are used to determine advertising rates as well as to cancel programs. Marine biologists tag few seals to chart migrating patterns. Wine tasters sip few drops of wine to make a decision with respect to all the wines waiting to be released for sale. The accounting department checks only a few invoices to find out something about the accuracy of all the invoices. Consumers sample pizzas and other products at the grocery store to decide whether to purchase the whole pizza or not. 1.8 SCALES (or LEVELS) OF MEASUREMENT Data can also be classified based on levels of measurements or scales of measurement. There are four levels of measurement used in preparing data for analysis, namely: nominal, ordinal, interval, and ratio level. The nominal level data are of the lowest level, the most primitive or the most limited type of measurement while the ratio levels are classified under the highest level. 1.8.1 NOMINAL SCALE It is a measurement scale that involves the process of naming or labeling the items by placing cases into categories and counting their frequency of occurrences. While the numbers indicate that the elements are different, such difference is not according to order or magnitude. Each case must be placed in one and only one category, but the categories must be equal with respect to some of their attributes or properties. The categories must be non-overlapping or mutually exclusive. There are no measurements and no scales involved. Instead, these are just counts. Examples under nominal scale are gender (male or female), political affiliation (Team Unity, Genuine Opposition, KBL, Lakas CMD, Laban, Kampi, Liberal Party), mode of adaptation (conformity, innovation, ritualism, retreatism, and rebellion), time orientation (past, present, future), religion, region, civil status, names of companies, dichotomous responses or preferences, car makers (Toyota, Honda, Ford, Kia Pride, Hyundai, Volkswagen, BMW). These data are not graded, ranked or scaled for qualities such as better or worse, higher or lower, more or less. They are merely labeled with no meaningful ranking of the categories is applied. This indicates that for the nominal level of measurement, there is no particular order for the groupings. The numbers may not be added, subtracted, multiplied, or divided. Only the frequency and percentage of observations falling into each category are usually computed. While we can also determine the mode under this scale of measurement, we cannot do it for the mean and median. Data under nominal level are often summarized in charts and bar graphs. 1.8.2 ORDINAL SCALE It is a measurement scale that yields information about the ordering of categories. The magnitude of numerical differences between and/or among cases are not determined though. The intervals between the points or ranks in an ordinal level are not known. Therefore, it is not possible to assign scores to cases located at points along the scale. Examples are ranking of honor students, assessment of levels of job performance (poor, average, excellent), ranking of faculty members (instructor, assistant professor, associate professor, professor), hardness of material, IQ (low, average, high), ranking of candidates in a beauty contest, graded response to a certain issue (weak, moderate, strong), rating of a company commander (inferior, poor, average, good, superior) and evaluation of a product (poor, good, excellent). In this scale, one case is said to be greater than or less than the other using a criterion rather than saying that it is only equal or different from the others as what is meant in the nominal scale of measurement. 1.8.3 INTERVAL SCALE It is a measurement scale that shows order of cases into categories considering and indicating the exact differences between and among the cases. It uses constant units of measurement, for example, pesos, centavos, Fahrenheit, Celsius, yards, feet, minutes, seconds, which yield equal intervals between points on the scale. Calendar time is an interval variable with an arbitrary defined zero point. An interval variable does not have a “true zero” point, but a zero point may be arbitrarily defined for convenient purposes only. A temperature of 30 degrees Celsius in Manila cannot be compared to a 15 degrees Celsius in Baguio. It does not make sense, therefore, to talk of a temperature of 0 degrees Celsius indicating the absence of heat or the absence of temperature in a particular place. Scores on a SAT examination and scores in a history or a mathematics examination are also examples of interval scale of measurement. 1.8.4 RATIO SCALE It is a measurement that possesses all the characteristics of interval scale and for which the interval size and the ratio of two values have meanings. In ratio scale, it is appropriate to speak of one number in relation to another. Measurement of weights, heights, lengths and ages appropriately use the ratio scale. Examples of comparisons of measurement such as, a tree 6 meters tall is twice as tall as the other 3-meter-tall tree, a baby girl which weighs 10 lbs. is twice as heavy as a baby girl weighing 5 lbs., could mean that one variable value, or measurement, may be spoken of as double or triple the other variable. An absolute zero is always implied. Any number used represents a distance from a natural origin. One object may be twice as long, three times as heavy or four times as numerous as the other object. The essential difference between the ratio and an interval level variable is that measurements of the former are made from a true zero point, whereas, the latter measurements are from arbitrarily defined zero point of origin. Therefore, the ratio variable is formed directly from the variable values from which meaningful interpretations are done. 1.9 SUMMATION NOTATION The most commonly used notation in statistics is the summation notation which is used to denote the sum of values. The uppercase Greek letter (reads “sigma”) shall be employed to signify that the sum of the values of the variables that follow is desired. Using the notation, we can write the sum as follow: 20 x = x +x + ... + x i i =1 1 2 20