Descriptive Statistics Lecture Notes

NOMINAL: INTERVAL STATS: Observations of a qualitative variableDESCRIPTIVE can only be classified and counted. Data classifications are ordered according to the amount of the characteristic they possess. differences in the characteristic are represented by  is Measures oforder location (both central and non-central Equal location) There no particular to the labels. equal differences in the measurements. of Spread e.g.eyeMeasures color, gender, religion. (or dispersion) about the central location value e.g. tempertaure, dress size  Measures of shape (skewness) ORDINAL: RATIO: Data classifications are represented by sets of labels or names (high, medium, low) that have relative values. Interval level with an inherent zero starting point. Differences and ratios are meaningful. The data classified can be ranked or ordered, but the differences between data values cannot be determined or are meaningless. Data classifications ordered according to the amount of the characteristics they possess.Equal differences in the characteristic are represented by equal differences in the numbers assigned to the classifications. e.g. your rank in class, juice tasting- sprite is 1, coke is 2, fanta is 3, pepsi is 4. The zero point is the absence of the characteristic and the ratio between two numbers is meaningful. e.g. distance traveled; salary Central Location Measures: Non Central Location Measures: 1. mean 2. mode 3. median 1. Quartiles: Divides an ordered data set into 4 parts 2. Percentiles 3. Geometric Mean MEDIAN: MODE: MEAN: 1. The There is a of unique median forthat eachappears data set. 1. value the observation most frequently. 1. widely used measurelarge of location. 2. The It is most not affected bysummarizing extremely or small values 2. Especially useful in nominal-level data.and is therefore a 2. Features: all values are used; it is unique; the sum values of the deviations from the valuable measure of central tendency when such occur. 3. mean Advantage:calculated Not affected by outliers. byratio-level, summing the values and dividing by the number 3. It can isbe0;computed for interval-level, and ordinal-level data. of 4. For many of data, there is no mode because no value values. 4. Disadvantages: It can be computed for ansets open-ended frequency distribution if the median 3. appears Weakness: be open-ended distorted outliers. once. ORby there is more than one mode. does notmore liecan inthan an class. GEOMETRIC MEAN: 1. Used where data represents percentage changes 2. The data must be represented as decimal values – a 4% increase is 1.04 a 4% decrease is 0.96 3. Geometric mean = nth root of the data points multiplied by each other (where n= sample size) 4. GM = n√X1 X2 X3…. Xn Z-score: Formulae: z = (x- µ)/ σ The standard normal distribution   Solve for x: x= µ + zσ A normal distribution with a mean of 0 and a standard deviation of 1. It is also called the z distribution. Measure of Dispersion: 1. 2. 3. 4. Range Standard deviation Variance Coefficient of variation VARIANCE AND STANDARD DEVIATION: 1. Variance is the arithmetic mean of the squared deviations from the mean. 2. The most common and useful measure of dispersion because it is the average distance of each observation from the mean. 3. Commonly used as a measure to compare the spread in two or more sets of observations 4. Advantages: uses all the values of a data set; expressed in the same unit of as the observations. 5. The variance and standard deviations are nonnegative and are zero only if all observations are the same. 6. For populations whose values are near the mean, the variance and standard deviation will be small. 7. For populations whose values are dispersed from the mean, the population variance and standard deviation will be large. 8. The variance overcomes the weakness of the range by using all the values in the population. RANGE: 1. The simplest measure of dispersion 2. Computed by subtracting the lowest value of a data from the highest value in the set. 3. Not a reliable measure of dispersion, since it only uses two values from the data set. 4. Extreme values can distort the range to be very large while most of the elements may actually be very close together. 5. Widely used in statistical process control (SPC) applications. NORMAL DISTRIBUTION AND STANDARD DEVIATIONS OF THE MEAN:   Measure of Skewness:  68.3% of all data values lie within 1 SD of the mean. (between the lower limit of [mean – SD] AND the upper limit of [mean+ SD]). 95.5% of all data values lie within 2 SD of the mean. (between the lower limit of [mean – 2SD] AND the upper limit of [mean +2SD]). 99.7% of all data values lie within 3 SD of the mean. (between the lower limit of [mean – 3SD] AND the upper limit of [mean +3SD]). Pearson’s coefficient of skewness 4 Shapes commonly observed:  SK = 0, distribution is symmetrical. Hence mean = median = mode.  SK > 0, distribution is positively skewed. Hence mean > median.  SK < 0, distribution is negatively skewed. Hence mean < median. POSITIVELY SKEWED:  Median is prefered  Mean will be mostly influenced (inflated and distorted) by large outliers and hence will lie furtherest to RHS of mode and median .  Distribution will have a long ‘tail’ to RHS. Coefficient of Variation: (expressed as %)    = Standard deviation/ mean CV is a measure of relative variability It is therefore possible to compare the variability of data across different samples, especially if the NEGATIVELY SKEWED:  Median is prefered  Mean will be mostly influenced (deflated and distorted) by small outliers and hence will lie furtherest to LHS of mode and median.  Distribution will have a long ‘tail’ to LHS. Outliers:  A data value (x) that has a z-score either below -3 or above +3. BINOMIAL BINOMIAL.DIST (x; n; p; cummulative?) Distriptive stats: 1. 2. Mean: µ =np Standard deviation: σ = √np (1-p) POISSON.DIST (x; mean; cummulative?) n = sample size; NORM.DIST (x; mean; SD; cummulative?) p= probability of a success outcome on a single independent object Sample Mean:  It is normally distributed  It has a mean equal to the population mean, µ  It has a standard deviation, called the standard error, σx equal to σ/ √n POISSON Distriptive stats: 1. 2. Mean: µ =λ Standard deviation: σ = √ λ λ = the mean number of occurences of a given outcome of the random variable for a predetermined time, space or volume interval. Conveniece sampling:  Sampling is drawn to suit the convenience of the researcher.  E.g. select motorists from only one petrol station; select item for inspection from only one shift instead of a number of shifts. Snowball sampling Used when it is not easy to identify the members of the target population for reasons of sensitivity or confidentiality (i.e. in studies related to HIV, ganster activity, sexuality, illegal immigrants). If one member can be identified, then this person is asked to identify other members of the same target population. This selection of sampling units is non-random and potentially biased. Non-probability sampling methods: Disadvantages of non-probability 1. Conveniece sampling sampling: 2. Judgment  The samples are likelysampling to be unrepresentative of their target 3. Quota sampling population. This will introduce bias into the statistical findings, 4. significant Snowballsections samplingof the population are likely to have because been omitted from the selection process.  It is not possible to measure the sampling error from data based on a non-probability sample. Sampling error is the difference between the actual population parameter value and its sample statistic. As a result, it is not valid to draw statistical inferences from nonprobability sample data.  However, non-probability samples can be useful in exploratory research situations or in less-scientific surveys to provide initial insights into and profiles of random variables under study. Stratified Random Sampling:  Used when the population is assumed to be heterogenous with repsect to the random variable under study. The population is divided into segments (strata), where the population members within each stratum are relatively homogeneous. Thereafter, simple random samples are drawn from each stratum.  If the random samples are drawn in proportion to the relative size of each stratum, then this method of sampling is called proportional Quota Sampling  Setting of quotas of sampling untis to interview from specific subgroups of a population. When the quota for any one subgroup is met, no more sampling units are selected from that subgroup for interview. This introduces selection bias into the sampling process. The main feature of quota sampling is the non-random selection of sampling units to fulfill the quota limits.  E.g. a researcher may set a quota to interview 40 males and 70 females from 25- to 40- year age group on Sampling savings practices. When the quota Judgment of interviews for any one subgroup is reached (either male or female) no further eligble sampling units from that subgroup are selected for interview purposes.  Researcher use their judement to select the best sampling units to include in the sample.  E.g. only professional rugby players are interviewed on the need for rule Systematic Random Sampling: Simple Random Sampling: changes in the only labour union leaders (instead of  sport; It is used when a sampling frameare (i.e.seleted an address list or  It is assumed that the population is general workers) to respond to a studymembers) regarding working conditions in database of population exists. Sampling begins homogeneous with respect to the random mining industry. by randomly selecting the first sampling unit. Thereafter variable under study (i.e. the sampling units share similar views on the research questions ; subsequent sampling units are selected at a uniform or the objects in a population are influenced by interval relative to the first sampling unit. Since only the the same background factors.) first sampling unit is randomly selected, some randomness is sacrificed.  One way to draw a simple random sample is to  To draw a systematic random sampling , first divide the assign a number to every element of the sampling frame by the sample size to determine the size population and then effectively ‘draw numbers of a sampling block. Randomly choose the first sample from a hat’. If a database of names exists, then a random number generator can be used to member from within the first sampling block. Then draw a simple random sample. choose subsequent sample members by selecting one member from each sampling block at a consant interval  E.g. the population of Cape Town motorists is ti from the previously sampled member. be surveyed for their views on toll roads. A simple random sample of Cape Town motorists is assumed to be representative of this Advantages of random sampling methods: population as their views are unlikely to differ  Random sampling reduces selection bias, meaning that the significantly acros gender, age, car type driven. sample statistics are likely to be ‘better’ (unbiased) estimates  E.g. in a production process, parts that come off of their population parameters. the same production line can be selected using  The error in sampling (sampling error) can be calculated from simple random sampling to check the quality of data that is recorded using random sampling methods. This the entire batch produced. makes the findings of inferential analysis valid. Cluster Random Sampling:  Certain target populations form natural clusters, which make for easier sampling.  E.g. labour forces clsuter within factories; students cluster within educational institutions; outputs from different production runs (e.g. margarine tubs) are batched and labelled separately, forming clusters.  Sampling units within these sampled clusters may themselves be randomly selected to provide a representative sample from the population. For this reason it is called a two-stage cluster sampling (e.g. select schools (stage 1) as clusters, then pupils within schools (stage 2)).  Tends to be used when the population is large and geographically dispersed.  Advantage: reduces the per unit cost of sampling  Disadvantage: tends to produce larger sampling errors than those resulting from simple random sampling. Level of confidence 1-a a/2 90% 5% 1.645 95% 2.5% 1.96 98% 1% 2.33 99% 0.5% 2.575

Descriptive Statistics Lecture Notes

Related documents

Products

Support

Descriptive Statistics Lecture Notes

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib