COMPUTATIONAL CHEMISTRY SOME STATISTICAL CONCEPTS AND DEFINITIONS Population: A population is a finite or infinite set of measurements of some property of a finite or infinite collection of objects. The use of the word implies completeness and refers to the measurements of the variable of interest and not to the objects whose properties are being measured. Thus the population refers to the set of numerical measurements x 1,x2,...,xn, which are the outcome of N measurements of some property of the objects. Example: the height and/or weights of students attending PSU. Sample: A sample is a portion or a subset of the population and implies incompleteness. A sample always yields a finite set of measured values due to the limitations of measurement. Samples are studied because of time and financial limitations as well as the impossibility of measuring an infinite set. The important question is this: WHAT CAN ONE LEARN ABOUT THE NATURE OF THE POPULATION FROM A STUDY OF THE PROPERTIES OF SAMPLES DRAWN FROM THE POPULATION? Example: Harris polls attempt to predict election results for the entire voting population (i.e., the statistical population is anyone who can vote) from a detailed study of a statistical sample, which is usually about 1260 random individuals. Random Variable: A random variable X, is a real-valued function defined for all possible occurrences in an experiment. 'Random' implies that each measurement in the set of possible values is independent of all of the others. Thus X(x1,x2, …, xn) refers symbolically to the set x1, x2,....,xn . Example: In the rolling of a die, X includes the set of values, 1,2,..,6. Random variables may be of two kinds: a) Discrete Variables: These are variables that can assume only specific values (usually integer), such as the number of male students with brown hair or the number of apples on a tree with worms, the number of defective products in a batch etc. This includes all counting data with radioactive isotopes, rolling of a die problems, coin flipping problems and opinion polls to name only a few examples. Discrete variables are very common in the sciences. The distribution that represents a discrete variable is a ‘discrete distribution’ and the corresponding probability function is the ‘probability mass function’ or pmf for short. b) Continuous Variables: If a variable can assume any value within a finite interval, then we say that the random variable is continuous. This concept is not limited by the resolution of the measuring process! Examples: The height or weight of a person, the speed of a car, in fact, most of the variables studied in the physical sciences are continuous. The fact that we cannot measure the variable to an infinite precision is irrelevant. The distribution that represents a continuous variable is a ‘continuous distribution’ and the corresponding probability function is called a ‘density’ function. Experimental Distributions: An experimental distribution is a curve or graph (bar graphs, pie charts etc.) which result from plotting the experimental frequencies of the measured data against the specific values that the experimental data take on. Example: frequencies of people saying they are for, against and undecided in an election poll. Theoretical Distributions: These are model equations that serve to represent the mathematical behavior of populations (not samples!). They may be discrete or continuous. Usually, theoretical distributions are used to model experimental distributions because once you know the theoretical model that represents an experimental distribution, you can derive information about the experimental distribution that probably has not been or cannot be easily measured. Of course, the validity of the procedure is dependent on how closely the experimental distribution is modeled by the theoretical distribution. 8/21/2006