RI 6 Descriptive Statistics

advertisement
Descriptive Statistics
Unit 6
Variable
• Any characteristic (data) recorded for the subjects of a study
ex. blood pressure, nesting orientation, phytoplankton count
• Can be classifies as either:
-categorical
-quantitative:
*discrete
*continuous
Categorical Variable
• Data belongs to one of a set of categories
Exs:
1. Gender (Male or Female)
2. Pets owned (dog, cat, great white…)
3. Type of food imported (beef, pork, shellfish …)
4. Engage in 30 minutes of exercise daily (Yes or No)
• Type of graph(s): bar, pie
Pie Charts
• Summarizes categorical
variable
• Drawn as circle where each
category is a slice
• The size of each slice is
proportional to the
percentage in that category
Bar Graphs
• Summarizes categorical variable
• Vertical bars for each category
• Height of each bar represents either
counts or percentages
• Easier to compare categories with
bar graph than with pie chart
• Called Pareto Charts when ordered
from tallest to shortest
Quantitative Variable
• Data is given numerical values for different magnitudes.
Exs:
1. Age of test subjects
2. Number of siblings
3. Seasonal changes in pH of pond water
• Type of graph: scatter-plot, line, stem and leaf
Quantitative vs. Categorical
• For Quantitative variables, key features are the center (a
representative value) and spread (variability).
• For Categorical variables, a key feature is the percentage of data in
each of the categories
Discrete Quantitative Variable
• Quantitative variable is discrete if its
possible values form a set of separate
numbers: 0,1,2,3,….
• Exs:
1. Number of calico cats sold
2. Number of nests with down linings
3. Number of students who fall asleep
in Stats class
Continuous Quantitative Variable
• Quantitative variable is continuous if its
possible values form an interval
• Measurements
• Examples:
1. Height/Weight
2. Age
3. Blood pressure
Most Common Way to Describe Data
• Central tendency
• Statistical variation
Central Tendency
• Used to represent entire data set
• Highlights distribution of data
• Measures one of the following: mode, mean, and median
Mode
• Value that occurs most often
• Highest bar in the histogram
• Mode is most often used with categorical data
• Best if not used alone
12, 12, 13, 14, 14, 15, 15, 15, 15, 37, 38
2, 3, 3, 4, 5, 5- bimodal
65, 68, 69, 71, 72, 73, 75, 77- mode?
Mean
• The sum of the observations divided by the
number of observations
• Measure of centermost point when there is a
symmetrical distribution of values in a data set
• Mean = Σx
Σ- sum
n
n- total number of values
8g/cm³, 10g/cm³, 7g/cm³, 9g/cm³
8g/cm³ + 10g/cm³ + 7g/cm³ + 9g/cm³
4
34g/cm³
4
8.5g/cm³
Median
• Midpoint of the observations when ordered from least to
greatest
• Used when there are extremes in data
1. Order observations
2. If the number of observations is:
a) Odd, the median is the middle observation
b) Even- the median is the average of the two middle
observations
Central Tendency
• If data set has normal distribution: mean, median and mode are the
same value
• If data set is not distributed normally: values of central tendency will
vary.
*requires inferential statistics: t-test, ANOVA
Comparing the Mean and Median
• Mean and median of a symmetric distribution are close
• In a skewed distribution: the mean is farther out than the
median
Statistical Variation
• Shows how scores differ from one another
• AKA: variation, dispersion, spread
• Represent average difference from the mean
• Four measures of variation: range, interquartile range, standard
deviation, variance
Range
• Most general measure of variation
• Measures difference between highest and lowest values: spread
of data
Ex. pH 6, 6, 6, 7, 7, 7, 7, 5, 3
range: 7-3 = pH 4
Range
• range is strongly affected by outliers.
Interquartile Range- IQR
• AKA mid-fifty or midspread
• Organizes data into 4 quartiles, each with 25% of data
• To calculate IQR:
1. Find median of entire data set
2. Find median of lower half of set- lower quartile
3. Find median of upper half of set- upper quartile
Quartiles
Measure of Spread: Quartiles
* 25% of the data at or below Q1 and
75% above
* 50% of the obs are above the
median and 50% are below
* 75% of the data at or below Q3 and
25% above
Q1= first quartile = 2.2
M = median = 3.4
Q3= third quartile = 4.35
Calculating Interquartile Range
• Interquartile range: distance between the third and first
quartile, giving spread of middle 50% of the data:
• IQR = Q3 - Q1
Standard Deviation
• Each data value has an associated deviation from
the mean, x  x
• A deviation is positive if it falls above the mean and
negative if it falls below the mean
• The sum of the deviations is always zero

Standard Deviation
Standard deviation: summarizes the deviations of each
observation from the mean and calculates an adjusted
average of these deviations:
1. Find mean
2. Find each deviation
3. Square deviations
4. Sum squared
deviations
5. Divide sum by n-1
6. Take square root
Outlier
An outlier falls far from the rest of the data
Graphs for Quantitative Data
1. Dot Plot: shows a dot for each observation placed above its
value on a number line
2. Stem-and-Leaf Plot: displays individual observations
3. Histogram: uses bars to portray the data
Which Graph?
• Dot-plot and stem-and-leaf plot:
• More useful for small data sets
• Data values are retained
• Histogram
• More useful for large data sets
• Most compact display
• More flexibility in defining
intervals
content.answers.com
Dot Plots
To construct a dot plot
1. Draw and label horizontal line
2. Mark regular values
3. Place a dot above each value on the
number line
Sodium
in
Cereals
Stem-and-leaf plots
• Summarizes quantitative
variables
• Separates each observation
into a stem (first part of #)
and a leaf (last digit)
• Write each leaf to the right of
its stem; order leaves if
desired
Sodium in Cereals
Download