BUSINESS STATISTICS (BUS 215) TOPIC 1 DATA AND DATA COLLECTION LECTURER: MR ELISA EBIYAMU BUS 215 @ MCA DATA AND DATA COLLECTION BUS 215 @ MCA Meaning of Data • Data are observations that have been collected. • Data are sometimes used to find statistics. • A data set is an isolated list of lifeless numbers awaiting statistical manipulation. • It refers to all the data collected in a particular study. • Data is collected from a population or a subset of the population called a sample. • When data is processed, organized, structured and presented in a given context so as to make it useful, it is called information. BUS 215 @ MCA Population • A population is the complete collection of all elements to be studied. • The collection is complete in the sense that it includes all the subjects to be studied. • A census is a study that involves collection of data from every element in a population. BUS 215 @ MCA Sample • A sample is a sub collection of elements drawn from a population. • It is a subset of a population. Example • A business produces three brands of a product sold to a population of 500 people in the neighborhood. The management hired consultant to investigate consumer preference of each brand and determine the brand that preferred by most customers. 65 customers were selected at random for interviews as part of the research. • In this example the 65 customers make up sample for the research. • A survey is the collection of data from elements in a sample BUS 215 @ MCA Population and Sample BUS 215 @ MCA Meaning of Statistics • Two common definitions of the word statistics are as follows: • Statistics refer to facts or data, either numerical or nonnumerical, organized and summarized so as to provide useful and accessible information about a particular subject. • This definition is used with a plural verb. • Statistics is a collection of methods for planning experiments, obtaining data and then organising, summarizing, presenting, analysing, interpreting and drawing conclusions based on data. • This is used with a singular verb. BUS 215 @ MCA Statistic and Parameter • A statistic is a numerical measurement describing some characteristics of a sample. • A parameter is a numerical measurement describing some characteristics of a population. Example • The average score of students in Business statistics at MCA is 60 percent. A sample of 45 students was drawn and its mean was 58 percent with a standard deviation of 3. • In this example 60 is a parameter while 58 is a statistic. BUS 215 @ MCA CLASSIFICATION OF DATA BUS 215 @ MCA Classification of Data • Data classification can be difficult; even statisticians occasionally disagree over data type. • In most cases, however, data classification is fairly clear and will help you choose the correct statistical method for analyzing the data. • The statistical method appropriate for summarizing data depends upon the type of data. BUS 215 @ MCA Classification of Data • Data may classified in several ways. 1. By measurement, we have qualitative data and quantitative data 2. By time of collection we have cross section and time series data 3. By source, we have primary and secondary data 4. By preciseness we have discrete and continuous data 5. By Number of variables we have univariate, bivariate and multivariate data. BUS 215 @ MCA Qualitative and Quantitative Data • Data can be classified by measurement into two groups 1. Categorical or qualitative data 2. Quantitative data • Data that can be grouped by specific categories are referred to as categorical data. • Categorical data use either the nominal or ordinal scale of measurement. • Data that use numeric values to indicate how much or how many are referred to as quantitative data. • Quantitative data are obtained using either the interval or ratio scale of measurement. • A categorical or qualitative variable is a variable with categorical data, and a quantitative variable is a variable with quantitative data. BUS 215 @ MCA Qualitative and Quantitative Data • The statistical analysis appropriate for a particular variable depends upon whether the variable is categorical or quantitative. • If the variable is categorical, the statistical analysis is limited. • We can summarize categorical data by counting the number of observations in each category or by computing the proportion of the observations in each category. • However, even when the categorical data are identified by a numerical code, arithmetic operations such as addition, subtraction, multiplication, and division do not provide meaningful results. BUS 215 @ MCA Qualitative and Quantitative Data • Arithmetic operations provide meaningful results for quantitative variables. • For example, quantitative data may be added and then divided by the number of observations to compute the average value. • This average is usually meaningful and easily interpreted. • In general, more alternatives for statistical analysis are possible when data are quantitative. BUS 215 @ MCA Primary and Secondary Data • Primary data is the name given to data that are used for the specific purpose for which they were collected. • They will not contain unknown quantities in respect of method of collection, accuracy of measurement or number of respondents. • Secondary data is the name given to data that are being used for some purpose other than that for which they were originally collected. BUS 215 @ MCA Advantages of Secondary Data Secondary data is used when • Time, manpower, resources necessary for the study are not available. • It already exists and provides most if not all of the information required Advantages of Secondary Data • It saves time • Saves manpower • Saves resources BUS 215 @ MCA Disadvantages of Secondary Data • Questionable data quality • Data collected may now be out-of-date • Geographical coverage of the data may not coincide with study location • Strata of the population covered may not be appropriate for purposes of current study • Some terms used may have different meanings. BUS 215 @ MCA Cross-Sectional and Time Series Data • For purposes of statistical analysis, distinguishing between crosssectional data and time series data is important. • Cross-sectional data are data collected at the same or approximately the same point in time. • Each respondent provides data on one or more variables • Time series data are data collected over several time periods or regular intervals. • Each variable is observed at several points in time. For example daily, weekly, monthly, annually and so on. BUS 215 @ MCA Cross sectional data BUS 215 @ MCA Time series data BUS 215 @ MCA Time series data BUS 215 @ MCA Discrete and Continuous Data • Quantitative variables can be classified as either discrete or continuous. • A discrete variable is a variable whose possible values can be listed, even though the list may continue indefinitely. • It refers to data that can be measured precisely. • One way of obtaining discrete data is by counting, for example. i. Number of products that a firm produces ii. Number of employees working at a firm. • Discrete data can also be obtained from non counting situation, for example i. Shoe sizes of a sample of students ii. Weekly wages of a set of workers BUS 215 @ MCA Discrete and Continuous Data • A continuous variable is a variable whose possible values form some interval of numbers. • It is a variable whose values can not be measured precisely but can only be approximated. • Typically, a continuous variable involves a measurement of something, such as; • The height of a person, • The weight of a newborn baby, • The length of time a car battery lasts. BUS 215 @ MCA Scales of Measurement • Data collection requires one of the following scales of measurement: i. nominal, ii. ordinal, iii. interval, iv. ratio. • The scale of measurement determines the amount of information contained in the data and indicates the most appropriate data summarization and statistical analyses. BUS 215 @ MCA Scales of Measurement • Nominal scale The scale of measurement for a variable when the data are labels or names used to identify an attribute of an element. • Nominal data may be nonnumeric or numeric but we can not perform any mathematical operation on nominal data. • For example, to facilitate data collection and to prepare the data for entry into a computer database, we might use a numeric code by letting 1 denote BBME, 2 denote BMPR, and 3 denote BIAAS. • In this case the numeric values 1, 2, and 3 identify the category of programmes at MCA. • The scale of measurement is nominal even though the data appear as numeric values. BUS 215 @ MCA Scales of Measurement • Ordinal scale The scale of measurement for a variable if the data exhibit the properties of nominal data and the order or rank of the data is meaningful. • Ordinal data may be nonnumeric or numeric. • We can rank the data. • For example, Eastside Automotive sends customers a questionnaire designed to obtain data on the quality of its automotive repair service. • Each customer provides a repair service rating of excellent, good, or poor. • Because the data obtained are the labels—excellent, good, or poor—the data have the properties of nominal data. In addition, the data can be ranked, or ordered, with respect to the service quality BUS 215 @ MCA Scales of Measurement • Interval scale The scale of measurement for a variable if the data demonstrate the properties of ordinal data and the interval between values is expressed in terms of a fixed unit of measure. • Interval data are always numeric but zero is not meaningful. • We can add or subtract but we can not multiply or divide. • Scholastic Aptitude Test (SAT) scores are an example of interval-scaled data. For example, three students with SAT math scores of 620, 550, and 470 can be ranked or ordered in terms of best performance to poorest performance. • In addition, the differences between the scores are meaningful. BUS 215 @ MCA Scales of Measurement • Ratio scale The scale of measurement for a variable if the data demonstrate all the properties of interval data and the ratio of two values is meaningful. • Hence we can perform addition, subtraction, multiplication and division. • Ratio data are always numeric. • Variables such as distance, height, weight, and time use the ratio scale of measurement. • This scale requires that a zero value be included to indicate that nothing exists for the variable at the zero point. BUS 215 @ MCA SOURCES OF DATA AND DATA COLLECTION TECHNIQUES BUS 215 @ MCA Data Sources • Data can be obtained from 1. Existing or secondary data sources 2. Studies or surveys BUS 215 @ MCA Existing Sources • In some cases, data needed for a particular application already exist. • In Malawi, the National Statistical Office (NSO) in a government department created to conduct surveys and census. As such they have a lot of data that can be used in research. • These existing sources may upload the data on their websites. • Secondary data sources fall broadly into two categories 1. Internal sources 2. External sources BUS 215 @ MCA Internal Existing Data Sources BUS 215 @ MCA External Existing Data Sources BUS 215 @ MCA Statistical Studies • Sometimes the data needed for a particular application are not available through existing sources. • In such cases, the data can often be obtained by conducting a statistical study. • Statistical studies can be classified as either experimental or observational. • In an experimental study, a variable of interest is first identified and then one or more other variables are identified and controlled so that data can be obtained about how they influence the variable of interest. • For example, a pharmaceutical firm might be interested in conducting an experiment to learn about how a new drug affects blood pressure. • Blood pressure is the variable of interest in the study. BUS 215 @ MCA Statistical Studies • Non-experimental, or observational, statistical studies make no attempt to control the variables of interest. • A survey is perhaps the most common type of observational study. • For instance, in a personal interview survey, research questions are first identified. • Then a questionnaire is designed and administered to a sample of individuals. BUS 215 @ MCA DATA COLLECTION TECHNIQUES • Data collection can be thought of as the means by which information is obtained from the selected subjects of an investigation. • There are several methods of data collection and sometimes a samping technique will detect which method is used and in other cases there will be a choice. • Some of the common data collection techniques are; i. Individual (personal) interviews ii. Postal questionnaires iii. Street interviews iv. Telephone interviews v. Direct observation BUS 215 @ MCA Individual (Personal) Interviews • Individual interviews are usually used with random sampling It has an advantage of completeness and accuracy. • Questions can be thoroughly tested • Uniformity of approach if only one interviewer is used • Follow up question can be put where the question has not be addressed thoroughly. However, • This method is very expensive • Interviewers need to be trained • Interviews need arranging BUS 215 @ MCA Postal Questionnaire • Postal questionnaires can be used with many sampling methods. • This is a much cheaper method than individual interviews • Questions should be easy to understand. • Low response rate • No need for prior arrangement • Posted questionnaires may be filled by wrong persons BUS 215 @ MCA Street Interviews • This method of data collection is normally used in conjunction with quota sampling, where the interviewer is often just one of a team. • Some factors involved are; i. Possible differences in interviewer approach to the respondents and the way replies are recorded. ii. Questions must be short and simple iii. Non-response is not a problem normally, since refusals are ignored and another subject selected. iv. Convenient and cheap BUS 215 @ MCA Telephone Interview • This method is sometimes used in conjunction with a systematic sample. • It would generally be used within a local area and is often connected with selling a product or a service. • It has an in-built bias if private homes are being telephoned (rather than business), since only those people with telephones can be contacted and interviewed. • It can cause aggravation and the interviewer needs to be very skilled. BUS 215 @ MCA Direct Observation • This method can be used to examine items sample from a production line, in traffic surveys or in work study. • It is normally considered to be the most accurate form of data collection, but is very labour intensive and can not be used on many situations. BUS 215 @ MCA SAMPLING TECHENIQUES BUS 215 @ MCA Sampling • You will recall that a sample is a subset of a study population. • The research need to get a sample that is enough and representative of the population. This process is called sampling. • A Sampling Frame is a listing of the elements the sample will be selected from. • For some populations a sampling frame is not known such that an investigation would be required before a sample is taken. • A sample is representative if it has characteristics that are as close to the population as possible. BUS 215 @ MCA Sampling Techniques Sampling techniques can be put in three categories. 1. Random sampling i. Simple random sampling ii. Stratified random sampling 2. Quasi-random sampling i. Systematic sampling ii. Multi-stage sampling 3. Non-random sampling i. Cluster sampling ii. Quota sampling BUS 215 @ MCA Simple Random Sampling • Simple random sampling is a sampling procedure for which each possible sample of a given size is equally likely to be the one obtained. • We can also say that it is a sampling technique in which each member of the population has an equal chance of being selected. • Simple random sampling is used when a small proportion of the population is to be taken as a sample. • A random sample can be drawn using random numbers BUS 215 @ MCA Simple Random Sampling Advantages • Selection of elements is unbiased • It is a fair method Disadvantages • Needs population listing • The chosen elements might be so geographically dispersed that the cost of interviewing becomes too hire • The chance that certain attributes of the population may be over or under represented. BUS 215 @ MCA Stratified Random Sampling • Stratification of a population is a process which identifies certain attributes(Strata levels) that are considered significant to the investigation at hand and partition the population accordingly into groups based of the strata levels. • For example, to study factors that affect performance of students at MCA a researcher would start by putting students into groups based on mode of study (Fulltime, evening and weekend) then select elements randomly from each group (Stratum). • A population like that of MCA students is said to be heterogeneous. BUS 215 @ MCA Stratified Random Sampling • In stratified sampling the population is first divided into subpopulations, called strata, and then sampling is done from each stratum. • Ideally, the members of each stratum should be homogeneous relative to the characteristic under consideration. Procedure for Stratified Random Sampling Step 1: Divide the population into subpopulations (strata). Step 2: From each stratum, obtain a simple random sample. Step 3: Use all the members obtained in Step 2 as the sample. BUS 215 @ MCA Stratified Random Sampling Advantage • The sample is free from bias Disadvantages • Need an extensive sampling frame • Strata levels are selected subjectively • Costly, time consuming and need more resources to organize and implement the sample. BUS 215 @ MCA Stratified Random Sampling • In stratified sampling, the strata are often sampled in proportion to their size, which is called proportional allocation. • Given that a researcher would like to draw a sample of 20 elements from a population of 250 homeowners of which 25 are upper income, 175 are middle income, and 50 are lower income. • The sample size for the upper-income homeowners is, therefore, BUS 215 @ MCA Systematic Sampling • Systematic sampling is a sampling technique in which a starting point is chosen randomly and the selecting every nth element from the population. • It is used when the listed ( such as invoice values or a company’s fleet of vehicles) or some of it is physically in evidence (such as rows of houses, items coming out of production line). • This technique is particularly useful when the populations which have identical elements (homogeneous populations). BUS 215 @ MCA Systematic Sampling Advantages • Easy to use • Can be used even where a sampling frame is not available Disadvantages • Bias can occur if recurring sets in the population are possible. BUS 215 @ MCA Multi-Stage Sampling Multi-Stage Sampling involves the following; • Split the area up into a number of regions • Randomly select a small number of these regions • Confining sub-samples to these selected regions, with the size of each subsample proportional to the size of the area or population of the area. • The procedure can be repeated for sub-regions within regions Once the final region (or sub-regions) have been selected, the final sampling technique could be randon or systematic. BUS 215 @ MCA Multi-Stage Sampling Advantages • Need less time • Need less manpower • Cheaper Disadvantages • Possible bias if a very small number of regions is selected • The method is not random BUS 215 @ MCA Cluster Sampling • Cluster sampling is a non-random sampling method which can be employed where no sampling frame exists and for a population that is distributed over a geographical area. • The techniques involves; • Selecting one or more geographical areas • Sampling all the members of the targeted population that can be identified BUS 215 @ MCA Cluster Sampling Advantages • It is a good alternative to multi-stage sampling where no sampling frame exists • It is generally cheaper that other methods since little organisation is needed Disadvantages • Selecting bias could be significant because it is not random. BUS 215 @ MCA Quota Sampling • Quota sampling uses a team of interviewers, each with a set number(quota) of subjects to interview. • Normally the population is stratified in some way and the interviewers quota will reflect this. • The method places a huge responsibility on the interviewers since selection of subjects is left to them entirely. • As such, the interviewers must be well trained and have a responsible professional attitude. BUS 215 @ MCA Quota Sampling • Advantages • Stratification of the population is usual • No non-response • Low cost and convenient • Disadvantages • Sampling is non-random and therefore it is subject to bias. BUS 215 @ MCA BIAS AND VALIDITY BUS 215 @ MCA What is bias? • Bias can be defined as the tendency of a pattern of errors to influence data in an unrepresentative way. • More generally, a statistic is called an unbiased estimator of a parameter if the mean of all its possible values equals the parameter; otherwise, the statistic is called a biased estimator of the parameter. • Ideally, we want our statistic to be unbiased and have small standard error. • For, then, chances are good that our point estimate (the value of the statistic) will be close to the parameter. BUS 215 @ MCA Types of Bias 1. Selection bias • This occurs when the sample is not truly a representative of the population. • For example, selecting residents only from Area 10 in Lilongwe for purposes of estimating average income of all Malawians amount can lead to selection bias 2. Structure and working bias • It results from badly worded questions. 3. Interview bias • This occurs when the interviewer project a biased opinion or attitude that may not gain full cooperation of the subject. 4. Recording bias • This could result from badly worded responses or clerical errors made by untrained workforce. BUS 215 @ MCA Validity • Validity of data refers to how well the data measures what they are supposed to measure. • Validity should not be confused with reliability. • Reliability of data refers to the consistency with which the results occur. • Before analyzing a data set, statisticians usually make a variety of checks to ensure the validity of data. • In a large study it is not uncommon for errors to be made in recording data values or in entering the values into a computer. • Identifying outliers is one tool used to check the validity of the data. BUS 215 @ MCA THANK YOU BUS 215 @ MCA