INTRODUCTION
Statistics is the scientific study of numerical data based on natural phenomena.
Statistics can be regarded as a kit of tools that can be extremely valuable in Research. It is only
when you know which tool to use, how to use it, and how to interpret your results can you hope
to do productive research. Data in statistic are generally based on individual observation
measurements taken on the smallest sampling unit. Therefore, the science of statistic has much
to offer the research worker in planning, analyzing, and interpreting the results of his
investigations. The science of statistics deals with:
1) Collecting and summarizing data
2) Designing experiments and surveys
3) Measuring the magnitude of variation in both experimental and survey data
4) Estimating population parameters and providing various measures of the accuracy and
precision of these estimates
5) Testing hypotheses about populations
6) Studying relationships among two or more variables
1.1
Deductive and Inductive Reasoning
There are problems in which we are given some general principles or set of principles
and asked to determine what would happen under specific set of conditions. The type of
reasoning employed from the general to the particular is called deductive reasoning for
example given the general formula for the area of a circle A =
, what is the area of a circle
whose diameter is 6m?, given Boyle’s and Charle’s laws, how do you expect a certain volume
of gas to change when subjected to certain changes in pressure and temperature.
Nearly all the problems you have been dealing with so far are of this type where the
solution required deductive reasoning. The second type of problem is the opposite of the first.
We are given some specific cases and asked to arrive at some general principles that will apply
to all members of the class represented by these cases. For example, given the areas and
diameters of several circles what general formula can we give expressing the relation between
the areas and diameters of all circles or given a series of observation of the volume of gas
under different conditions of pressure and temperature, what general laws will account for
these observations?
In general, experiments are conducted to provide specific facts from which general
conclusion or principles are established and thus involve inductive reasoning. When
experiments are carried out, the results obtained under the same conditions are not always
exactly the same due to differences beyond the control of an experiment. These differences
represent the variability among experimental units and are called experimental error.
Variability observed during experimentation create the problem since we need to decide
whether the difference between experimental unit results from unaccounted variability or real
treatments effects.
Statistical science helps to overcome this difficulty by requiring the collection of data to
provide unbiased estimates of treatment effects and the evaluation of treatment differences by
tests of significance based on measuring unaccounted variability.
1.2
Definitions and Terms
Accuracy: the level of agreement between replicates determination of a measurable property
and is reference or target value.
Alternative hypothesis: A statement reflecting a difference or change in the level of a response
as a result of experimental intervention, denoted by HA or H1
Analysis of variance (ANOVA): The technique of separating, mathematically, the total
variation within experimental measurements into sources corresponding to controlled and
uncontrolled components.
Blocking: the grouping of experimental units into homogeneous blocks to remove an
extraneous source of response variation.
Confidence Interval: An internal or range of values which contains an unknown parameter
with a specified probability.
Confounding: the design technique for blocking a factorial experiment where information on
certain treatment effects is sacrificed as they are indistinguishable from the block effects.
Degrees of Freedom: Number of independent measurements that are available for estimation,
generally corresponds to number of measurements minus number of parameters to estimate.
Dispersion: The level of variation within collected data corresponding to the way data cluster
around their centre value i.e the mean.
Dot plot: A data plot of recorded data where each observation is presented as a dot to display
its position relative to other measurements within the data set.
Experiment: A planned inquired to obtain new information on a measurable, or observable,
outcome or to confirm results from previous studies.
Experimental design: The experimental structure used to generate practical data for
interpretative purposes.
Experimental plan: Step-by-step guide to experimentation and subsequent data analysis.
Experimental unit: The physical experimental material to which one application of a treatment
is applied e.g. manufactured product, water sample, food specimen etc
Interaction: The joint influence of treatment combinations on a response which cannot be
explained by the sum of the individual factor effects.
Mean: The arithmetic average of a set of experimental measurements.
Median: The middle observation of a set of experimental measurements when expressed in
ascending order to magnitude.
Model: The statistical mechanism where an experimental response is explained in terms of the
factors controlled in the experiment.
Normal (Gaussian): The most commonly applied population distribution in statistics, the
assumed distribution for a measured response in parametric inference.
Null hypothesis: A statement reflecting no difference between observations and target or
between sets of observations as a result of experimental intervention, denoted by HN or Ho
Observation: A measured or observed data value from a study or an experiment.
Outlier: A recorded response measurement which differs markedly from the majority of the
data collected.
P value: The probability that a calculated test statistic value could have occurred by chance
alone, provides a measure of the probability that the level of treatment difference detected has
occurred purely by chance, compared to significance level.
Paired Sampling: A design principle where experimental material to be tested is split into two
equal parts with each part tested on one of two possible treatments.
Parameters: The terms included within a response model which require to be estimated and
assessed for their statistical significance.
Population: Total aggregate of observation that conceptually might occur as the result of
performing a particular operation in a particular way. Usually infinite but it will be taken as
finite with size N where N is large. It is a set of measurements or counts of a single variable
taken on the unit specified to be in the population (e.g. the height of all men over 25 years of
age in Nigeria.
Random Sample: Sample obtained in such a condition that each member of the population has
an equal chance of being chosen.
Precision: The level of agreement between replicate measurements of a measurable property.
Random effect: The treatments to be tested represent a random sample from larger population.
Random error: Causes response measurements to fall either side of a target affecting data
precision.
Randomization: Reduces the risk of bias in experimental results, concern with selection of
experimental units for use within an experiment and run order of experiments.
Range: A simple measure of data spread
Ranking: Ordinal number corresponding to the position of a measurement value when
measurements are placed in ascending order of magnitude.
Repeatability: A measure of the precision of a method expressed as the agreement attainable
between independent determinations performed by a single individual using the same
instrument and techniques in a short period of time.
Replication: The concept of repeating experimentation to produce multiple measurements of
the same response to enable data accuracy and precision to be estimated.
Residuals: Estimates of model error, determined as the difference between the recorded
observations and the model=s fitted values
Response: The characteristic measured or observed in a study
Sample: A set of representative measurements of a measurable or observable outcome.
Significance level: The probability of rejecting a true null hypothesis typically set at 5% or 0.05.
Skewness: Shape measure of data for assessing lack of symmetry
Standard deviation: A magnitude dependent measure of the absolute precision of replicate
experimental data.
Test statistic: A mathematical formulae which provides a measure of the evidence that the
study data provide in respect of acceptance or rejection of the null hypothesis, numerically
estimable using study data.
Treatment: The controlled effect being assessed in an experiment for its influence on a
measurable or observable outcome.
Variability: The level of variation present within collected data, also consistency and spread.
1.3
Interpretation of Experimental data
When summarizing large masses of data, it is sometime useful to distribute the data into
classes or categories. To interpret experimental data, it is necessary to first present them in summary
form to provide the basis for analysis and interpretation. To analyze the experimental data, two
approaches can be employed viz-a-viz, descriptive statistics and use of inferential statistics which
include statistical tests and estimation. Descriptive statistics cover graphical presentations (simple
data plots) while inferential statistics involves numerical presentation.
Graphical Presentations
Data gathered from experiments can be difficult to understand and interpret in their raw form.
Data plots represent simple pictorial representations which provide for concise summarizing of data
in simple and meaningful formats. There are several forms of graphical representation used in
presenting or interpreting raw data. These include histogram, dot plot, scatter diagram, standard error
plot, normal plots, control charts, time series plots and interaction plots.
One of the most commonly used and simple to interpret data plots is the dot plot. In such a
plot, each response measurement is presented separately enabling each measurement’s position to be
displayed as illustrated in Fig 1.1. The plot consists of a horizontal axis covering the range of
measurements with each measurements specified by a dot, at the requisite value on the axis. The dot
plot or diagram is a valuable device for displaying the distribution of a small body of data (say up to
20 observations). The plot diagram shows the general location of the observations and they spread of
the observations. A dot plot is particularly useful when comparing measurements from two or more
data sets.
Fig. 1.1. Dot plot for a sample of 10 observations
When a larger number of results is available, the dots become hard to distinguish from each
other, and we are better able to appreciate the data by constructing a frequency distribution, also
called a frequency diagram or histogram. Figure 1.2 shows a frequency distribution for N = 500
observations. Each observation was recorded to one decimal place. Since the smallest of the 500
observations lies between 56 and 57 and the largest between 73 and 74, it is convenient to classify the
observations into 18 internals each covering a range of one unit.
Most often, frequency diagrams have intervals of equal length. However, frequency diagrams
can also be constructed for data in which for some reasons, the grouping intervals are of different
lengths. The important thing to remember is that the area of the rectangle constructed on each
interval must be proportional to the frequency of observations within that interval.
Fig. 1.2. Frequency diagram (frequency distribution, histogram) for a sample of 500
observations.
Fig. 1.3. Frequency histogram and polygon plotted from sample of 201
observations.
Numerical presentations
Plotting data is the first step in analysis. The second step requires determination of
numerical summary measures which are thought to be A typical of the measurements collected.
Two basic forms of measure are generally used; a measure of location and a measure of
variability. The measure of location is a single measurement value specifying the position of the
A centre of the data set i.e a point the data tend to cluster around. The mean or average and the
median are two commonly used measures of location.
The mean of observations corresponds to the sum of the observations or measurement
divided by the number n of measurements collected. It is denoted by
, i.e
The median refers to the value which splits the data into equal two with half of the
measurement below and half above. The median can be obtained by arranging the measurements
in order to magnitude from smallest to largest and finding the value which divides the ordered
data into two equal halves. For a simple of k measurements, this will correspond to the [(k+1)/2]th
ordered observation when k is odd, and is the average of the (k/2)th and (k/2+1)the ordered
observations when k is even.
A measure of data variability is used to provide a numerical summary of the level of
variation present in the measurements in respect to how the data cluster around their centre value.
The centre value is the arithmetic mean or average. Variability is also referred to a spread,
consistency, or precision.
5