Week 1 September 2-6

advertisement
Week 1
September 1-5
Six Mini-Lectures
QMM 510
Fall 2014
ML 1.1
• self-introductions (Moodle mini-biographies)
• course format, syllabus, projects
• grading, communication
• goals: short run vs long run
You can watch the instructor’s
introductory welcome video for MBA
students (posted on Moodle)
0-2
Chapter 0
Getting Started
Textbook
David P. Doane and Lori E. Seward, Applied Statistics in
Business and Economics, 4th edition (McGraw-Hill, 2013),
ISBN 0077931505. This is an omnibus ISBN that includes
several components (textbook, Connect access, MegaStat
download). All four components are essential because this
is an online course. The Oakland University campus book
center (248-370-2404) has this package ISBN in stock (and
can ship to you if necessary).
0-3
Chapter 0
Getting Started
Online Resources
Homework, testing, and grading will utilize McGraw-Hill's
Connect Plus. The Online Learning Center (OLC) has
downloadable data sets for exercises and examples, as well
as Big Data Sets, PowerPoint slides, self-graded practice
quizzes, and step-by-step guided examples. The instructor
will post mini-lectures on Moodle.
0-4
Chapter 0
Getting Started
Course Organization
Unless otherwise indicated, online quizzes, exercises, and
written projects are due by midnight on Monday of the week
shown in the syllabus. Use e-mail (doane@oakland.edu) or
call me (cell 248-766-7605) Note: Instructor is in the Pacific
time zone (please use judgment when calling). Post questions
on Moodle forum.
0-5
Chapter 0
Getting Started
Grading
Students will complete several written projects (50%
weight, graded by instructor) and several Connect
assignments with online feedback (50% weight). Basically,
you will submit one assignment (Connect or Project) per
week except for weeks 9 and 13. Grades will be posted on
Moodle.
0-6
Chapter 0
Getting Started
Homework using Connect
C-1 Chapters 2-3 (Sep 8)
C-5 Chapter 8 (Oct 20)
C-2 Chapter 4 (Sep 15)
C-6 Chapter 9-10 (Nov 3)
C-3 Chapters 5-6 (Sep 29)
C-7 Chapter 15 (Nov 10)
C-4 Chapter 7 (Oct 6)
C-8 Chapter 12 (Nov 17)
Chapter 0
Getting Started
Note: Connect assignments allow three attempts. Online feedback
increases with each attempt. Assignments will be auto-submitted on due
date. Your score will be the average of all three attempts, so it pays to try
hard on each attempt. You may complete them in advance (they are
accessible anytime up to due date). Be sure to save your work when you
exit Connect.
0-7
Projects
P-1 Describing a sample (Sep 22)
P-2 Making forecasts (Oct 13)
P-3 Regression modeling (Dec 3)
Note: For each project, submit a concise (5-10 page) report (not a
spreadsheet or PowerPoint) using Microsoft Word or equivalent
that answers the questions posed along with your own comments
and interpretations. Strive for effective writing (see textbook
Appendix I). Creativity and initiative will be rewarded. In projects
done with partners or teams, submit only one report.
0-8
Chapter 0
Getting Started
Short Run
Complete weekly assignments successfully
Improve Excel and report-writing skills
Balance this course against other responsibilities
Enjoy learning and want to learn more
Long Run
Succeed in other MBA classes that use statistics
Develop confidence and lose fear of quant methods
Use resources to learn on your own (web, textbook)
0-9
Chapter 0
Goals: Short Run / Long Run
ML 1.2
• textbook, e-book
• OLC (http://www.mhhe.com/doane4e)
• Connect (http://connect.mcgraw-hill.com/class/d_doane_qmm_510__fall_2014)
• Moodle (https://moodle.oakland.edu/)
• MegaStat (http://www.mhhe.com/megastat)
• LearningStats (http://www.mhhe.com/doane4e)
0-10
Chapter 0
Resources Available
Textbook, e-book
•
0-11
Basically, we will
cover the first 14
chapters
•
Within chapters
some topics get
less weight
•
Focus on what
you need for
assignments
Chapter 0
Resources Available
Not
covered
in this
class
Pre-paid registration code is required to use Connect Plus
Connect Plus (http://connect.mcgraw-hill.com/class/d_doane_qmm_510_-_fall_2014)
E-book:
In addition to
textbook, you
have an e-book
Premium content:
ScreenCam videos
on Excel and
MegaStat
0-12
Chapter 0
Resources Available
Connect Plus (http://connect.mcgraw-hill.com/class/d_doane_qmm_510_-_fall_2014)
OLC (http://www.mhhe.com/doane4e)
A pre-paid
registration code
is required to use
Connect Plus and
premium content
Premium content:
5-minute tutorials
on Excel and
MegaStat
0-13
Chapter 0
Resources Available
The OLC is
available to
anyone
(without
premium
content)
Connect Plus (http://connect.mcgraw-hill.com/class/d_doane_qmm_510_-_fall_2014)
Chapter 0
Resources Available
A pre-paid
registration code
is required to use
Connect Plus and
premium content
ScreenCam tutorials on
Excel statistics – by
Professor Doane (4 videos,
5 min each) if you need it
OLC (http://www.mhhe.com/doane4e)
Course:
Big Data Sets,
LearningStats,
etc
Click on a
chapter:
Quizzes,
PowerPoints for
that chapter
0-15
No registration code
required to use OLC
Chapter 0
Resources Available
MegaStat (http://www.mhhe.com/megastat)
Click to
download: Prepaid with code
(with ISBN
0077931505)
0-16
Chapter 0
Resources Available
MegaStat (http://www.mhhe.com/megastat)
Drop-down menu: Adds statistical
capability to Excel
0-17
Add-Ins tab: Click on this tab to see MegaStat
drop-down menu
Chapter 0
Resources Available
OLC (http://www.mhhe.com/doane4e)
Files are zipped:
Download one
chapter at a time
0-18
LearningStats is a
supplement – nice but not
part of the textbook (demos,
spreadsheets, slides)
Appendix A F Tables (346.0K)
Appendix I Business Reports (1011.0K)
Unit 01 Overview of Statistics (5925.0K)
Unit 02 Data Collection (815.0K)
Unit 03 Data Presentation (9572.0K)
Unit 04 Describing Data (3337.0K)
Unit 05 Probability (478.0K)
Unit 06 Discrete Distributions (550.0K)
Unit 07 Continuous Distributions (1409.0K)
Unit 08 Estimation (2103.0K)
Unit 09 Hypothesis Tests I (1135.0K)
Unit 10 Hypothesis Tests II (420.0K)
Unit 11 ANOVA (192.0K)
Unit 12 Simple Regression (2245.0K)
Unit 13 Multiple Regression (2756.0K)
Unit 14 Time Series I (1519.0K)
Unit 15 Chi Square Tests (627.0K)
Unit 16 Nonparametric Tests (1385.0K)
Unit 17 Quality Management (1329.0K)
Unit 18 Simulation (1460.0K)
Chapter 0
Resources Available
1.1 What is Statistics?
1.2 Why Study Statistics?
1.3 Uses of Statistics
1.4 Statistical Challenges
1.5 Critical Thinking
0-19
ML 1.3
Chapter 1
Challenges for MBAs
Statistics is the science of collecting, organizing, analyzing,
interpreting, and presenting data.
A statistic is a single measure (number) used to summarize
a sample data set; for example, the average height of
students in a university.
1-20
Chapter 1
What is Statistics?
• Data mining, neural tools, simulation, spreadsheet modeling, etc
• Costly software
• Specialized expertise required
• Huge databases (millions of records, complex file structure, sparse or
missing data, proprietary concerns, privacy issues)
1-21
Chapter 1
Big Data, Big Tools
Descriptive statistics – the collection, organization, presentation, and
summary of data.
Inferential statistics – generalizing from a sample to a
population, estimating unknown parameters, drawing
conclusions, making decisions.
1-22
Chapter 1
Uses of Statistics
• Statistical knowledge gives a company a competitive advantage
against organizations that cannot understand their internal or
external market data.
• Mastery of basic statistics gives an individual manager a
competitive advantage as one works one’s way through the
promotion process, or when one moves to a new employer.
1-23
Chapter 1
Why Study Statistics
• Is technically current (e.g., software-wise).
• Communicates well.
• Is proactive.
• Has a broad outlook.
• Is flexible.
• Focuses on the main problem.
• Meets deadlines
• Knows his/her limitations and is willing to ask for help.
• Can deal with imperfect information.
• Has professional integrity.
1-24
Chapter 1
The Ideal Data Analyst
• Treat customers in a fair and honest manner.
• Comply with laws that prohibit discrimination.
• Ensure that products and services meet safety regulations.
• Stand behind warranties.
• Advertise in a factual and informative manner.
• Encourage employees to ask questions and voice concerns.
• Accurately report information to management.
1-25
Chapter 1
Business Ethics
• Know and follow accepted procedures.
• Maintain data integrity.
• Carry out accurate calculations.
• Report procedures faithfully.
• Protect confidential information.
• Cite sources.
• Acknowledge sources of financial support.
1-26
Chapter 1
Upholding Ethical Standards
Pitfall 1: Big Conclusions from a Small Sample
Pitfall 2: Conclusions from Nonrandom Samples
Pitfall 3: Conclusions From Rare Events
Pitfall 4: Poor Survey Methods
Pitfall 5: Assuming a Causal Link
Pitfall 6: Generalization from Groups
Pitfall 7: Unconscious Bias
Pitfall 8: Significance versus Importance
1-27
Chapter 1
Critical Thinking
Hire consultants at the beginning of the project, when
your team lacks certain skills or when an unbiased or
informed view is needed.
1-28
Chapter 1
Using Consultants
Chapter Contents
2.1 Definitions
2.2 Level of Measurement
2.3 Sampling Concepts
2.4 Sampling Methods
2.5 Data Sources
2.6 Surveys
2-29
ML 1.4
Chapter 2
Collecting Data
2-30
•
Observation: a single member of a collection of items that we want
to study, such as a person, firm, or region.
•
Variable: a characteristic of the subject or individual, such as an
employee’s income or an invoice amount
•
Data Set: consists of all the values of all of the variables for all of the
observations we have chosen to observe.
Chapter 2
Definitions
Time Series Data
• Each observation in the sample represents a different equally spaced point
in time (e.g., years, months, days).
• Periodicity may be annual, quarterly, monthly, weekly, daily, hourly, etc.
• We are interested in trends and patterns over time (e.g., personal
bankruptcies from 1980 to 2008).
2-31
Chapter 2
Time Series vs Cross-Sectional Data
Cross Sectional Data
• Each observation represents a different individual unit (e.g., person) at
the same point in time (e.g., monthly VISA balances).
• We are interested in:
- variation among observations or
- relationships.
• We can combine the two data types to get pooled cross-sectional and
time series data.
2-32
Chapter 2
Time Series vs Cross-Sectional Data
Caution: Ambiguity is introduced when continuous data are rounded
to whole numbers so they seem discrete (e.g., round your weight from
166.4 to 166). When the range is large, it is usually best to treat
integers as continuous data.
(Figure 2.1)
2-33
Chapter 2
Data Types
2-34
Chapter 2
Level of Measurement
Level of
Measurement
Characteristics
Example
Nominal
Categories only
Eye color (blue, brown,
green, etc.)
Ordinal
Rank has meaning.
No clear meaning to
distance
Exercise frequency (often,
rarely, never)
Interval
Distance has meaning
Temperature (57o Celsius)
Ratio
Meaningful zero exists
Accounts payable ($21.7
million)
2-35
Chapter 2
Level of Measurement
Nominal Measurement
• Nominal data merely identify a category.
• Nominal data can be coded numerically (e.g., 1 = Apple, 2 =
Toshiba, 3 = Dell, 4 = HP, 5 = Other).
• Only mathematical operation allowed is counting (e.g.,
frequencies) or calculating percent in each category.
Ordinal Measurement
• Ordinal data codes can be ranked (e.g., 1 = Frequently, 2 =
Sometimes, 3 = Rarely, 4 = Never).
2-36
Chapter 2
Level of Measurement
Ordinal Measurement
• Distance between codes is not meaningful
(e.g., distance between 1 and 2, or between 2 and 3, or between 3
and 4 lacks meaning).
• Many useful statistical tests exist for ordinal data, especially in social
science, marketing and human resource research.
Interval Measurement
• Data can not only be ranked, but also have meaningful intervals
between scale points (e.g., difference between 60F and 70F is
same as difference between 20F and 30F).
2-37
Chapter 2
Level of Measurement
Interval Measurement
• Intervals between numbers represent distances, so math operations
can be performed (e.g., take the average).
• Zero point of interval scales is arbitrary, so ratios are not meaningful
(e.g., 60F is not twice as warm as 30F).
Ratio Measurement
• Ratio data have all properties of nominal, ordinal, and interval data
types and also a meaningful zero.
• Because of this zero point, ratios of data values are meaningful (e.g.,
$20 million profit is twice as much as $10 million).
• Zero does not have to be observable; it is a reference point.
2-38
Chapter 2
Level of Measurement
• A special case of interval data frequently used in survey research.
• The coarseness of a Likert scale refers to the number of scale points
(typically 5 or 7). Responses are often coded as numbers (e.g., 1, 2, 3, 4,
5) but technically are ordinal measurements.
• Researchers generally treat Likert scales as interval data (no true zero)
so they can calculate the mean and standard deviation.
2-39
Chapter 2
Likert Scales
Use the following procedure to recognize data types:
Question
If “Yes”
Q1. Is there a meaningful
zero point?
Ratio data (statistical operations are allowed)
Q2. Are intervals between
scale points meaningful?
Interval data (common statistics allowed, e.g.,
means and standard deviations)
Q3. Do scale points
represent rankings?
Ordinal data (restricted to certain types of
nonparametric statistical tests)
Q4. Are there discrete
categories?
Nominal data (only counting allowed, e.g.,
finding the mode)
2-40
Chapter 2
Level of Measurement
• In order to simplify data or when exact data magnitude is of little
interest, ratio data can be recoded downward into ordinal or nominal
measurements (but not conversely).
• For example, recode systolic blood pressure as “normal” (under 130),
“elevated” (130 to 140), or “high” (over 140).
• Or recode your income (a ratio measurement) as ordinal (low, medium,
high) by specifying cutoff points.
• The above recoded data are ordinal (ranking is preserved), but intervals
are unequal and some information is lost.
2-41
Chapter 2
Changing Data By Recoding
• A sample involves looking only at some items selected from the
population.
• A census is an examination of all items in a defined population.
• Why sample instead of census?
• Cost, time, budget constraints.
• Accuracy may be better in a sample (training, etc).
• For example, the United States Census cannot survey every person in
the population (mobility, un-documented workers, budget constraints,
incomplete responses, etc).
2-42
Chapter 2
Sample or Census?
Situations Where A Sample or Census May Be Preferred
Sample
Census
Infinite population
Small population
Destructive testing
Large sample size
Timely results
Database exists
Accuracy
Legal requirements
Cost
Sensitive information
2-43
Chapter 2
Sampling Concepts
• Statistics are computed from a sample of n items, chosen from a
population of N items.
• Statistics can be used as estimates of parameters found in the
population.
• Specific symbols are used to represent population parameters and
sample statistics.
Example: If you use the symbol s, the statistician assumes that you
are referring to a sample standard deviation, whereas σ would denote
a population standard deviation.
2-44
Chapter 2
Parameters and Statistics
Rule of Thumb: A population may be treated
as infinite when N is at least 20 times n
(i.e., when N/n ≥ 20 or equivalently if n/N < .05).
2-45
Chapter 2
Parameters and Statistics
Random Sampling
Simple random sample
Use random numbers to select items from a
list (e.g., VISA cardholders).
Systematic sample
Select every kth item from a list or
sequence (e.g., restaurant customers).
Stratified sample
Cluster sample
2-46
Select randomly within defined strata (e.g.,
by age, occupation, gender).
Like stratified sampling except strata are
geographical areas (e.g., zip codes).
Chapter 2
Sampling Methods
Non-random Sampling
Judgment sample
Use expert knowledge to choose “typical” items
(e.g., which employees to interview).
Convenience
sample
Use a sample that happens to be available (e.g.,
ask co-worker opinions at lunch).
Focus groups
2-47
In-depth dialog with a representative panel of
individuals (e.g., iPod users).
Chapter 2
Sampling Methods
With or Without Replacement
• If we allow duplicates when sampling, then we are sampling with
replacement.
• Duplicates are unlikely when n is much smaller than large N.
• If we do not allow duplicates when sampling, then we are sampling
without replacement.
2-48
Chapter 2
Sampling Methods
Computer Methods
Chapter 2
Sampling Methods
Excel - Option A
Enter the Excel function =RANDBETWEEN(1,875) into 10
spreadsheet cells. Press F9 to get a new sample.
Excel - Option B
Enter the function =INT(1+875*RAND()) into 10
spreadsheet cells. Press F9 to get a new sample.
Internet
The website www.random.org will give you many kinds of
excellent random numbers (integers, decimals, etc).
Minitab
Use Minitab’s Random Data menu with the Integer option.
These are pseudo-random generators because even the
best algorithms eventually repeat themselves.
2-49
Chapter 2
Sampling Methods
Row – Column Data Arrays
• When the data are arranged in a rectangular array, an item can be chosen
at random by selecting a row and column.
• For example, in the 4 x 3 array, select a random column between 1 and 3
and a random row between 1 and 4.
• This way, each item has an equal chance of being selected.
2-50
Randomizing a List
• In Excel, use function =RAND() beside each row to create a column of
random numbers between 0 and 1.
• Copy and paste these numbers into the same column using Paste Special >
Values in order to paste only values and not the formulas.
• Sort the spreadsheet on the random number column.
Demonstration:
CEO
compensation
(362 CEOs).
2-51
Chapter 2
Sampling Methods
Chapter 2
Sampling Methods
Randomizing a List of 362 CEOs
R an k
Before: CEOs are
arranged in
descending
order of
compensation.
After: Sorted on
RAND() column.
The first k CEOs
are a random
sample.
2-52
N am e
C o m p an y
1
Terry S S em el
Y ahoo
T o tal C o m p ($th o u )
230,554
2
B arry D iller
IA C /InterA c tiveC orp
156,168
3
W illiam W M c G uire
U nitedH ealth G roup
124,774
4
H ow ard S olom on
F ores t Labs
92,116
5
G eorge D avid
U nited Tec hnologies
88,712
6
Lew F rank fort
C oac h
86,481
7
E dw in M C raw ford
C arem ark R x
77,864
8
R ay R Irani
O c c idental P etroleum
64,136
9
A ngelo R M oz ilo
C ountry w ide F inanc ial
56,956
10
R ic hard D F airbank
C apital O ne F inanc ial
56,660
11
R ic hard M K ovac evic h
W ells F argo
53,083
T o tal C o m p ($th o u )
R an d ()
R an k
N am e
C o m p an y
0.0015203
254
G ary L B loom
V eritas S oftw are
3,492
0.0060530
173
E dm ond J E nglis h
TJX C os
6,938
0.0074301
350
W illiam V H ic k ey
S ealed A ir
1,049
0.0087558
202
W illiam C lay F ord Jr
F ord M otor
5,603
0.0093715
169
D avid N F arr
E m ers on E lec tric
7,154
0.0140494
305
C arl E Jones Jr
R egions F inanc ial
2,471
0.0153532
309
Jam es S Tis c h
Loew s
0.0161077
81
Jam es E R ogers
C inergy
0.0210922
184
Luk e R C orbett
K err-M c G ee
6,435
0.0222110
242
John B H es s
A m erada H es s
3,912
2,380
14,574
Systematic Sampling
• Sample by choosing every kth item from a list, starting from a randomly
chosen entry on the list.
• For example, starting at item 2, we sample every 4 items to obtain a
sample of n = 20 items from a list of N = 78 items.
Note that N/n = 78/20  4 (periodicity).
2-53
Chapter 2
Sampling Methods
Stratified Sampling
• Requires prior information about the population.
• Applicable when the population can be divided into relatively
homogeneous subgroups of known size (strata).
• A simple random sample of the desired size is taken within each stratum.
2-54
Chapter 2
Sampling Methods
Cluster Sample
• Strata consist of geographical regions.
• One-stage cluster sampling – sample consists of all elements in each of k
randomly chosen subregions (clusters).
• Two-stage cluster sampling, first choose k subregions (clusters), then
choose a random sample of elements within each cluster.
2-55
Chapter 2
Sampling Methods
Cluster Sample
• Here is an example of 4
elements sampled from
each of 3 randomly
chosen clusters (twostage cluster sampling).
2-56
Chapter 2
Sampling Methods
Chapter 2
Sampling Methods
Judgment Sample
• A non-probability sampling method that relies on the expertise of
the sampler to choose items that are representative of the
population.
• Can be affected by subconscious bias (i.e., non-randomness
the choice).
2-57
in
Convenience Sample
• Take advantage of whatever sample is available at that moment. A quick
way to sample.
Focus Groups
• A panel of individuals chosen to be representative of a wider
population, formed for open-ended discussion and idea gathering.
2-58
Chapter 2
Sampling Methods
ML 1.5
3.1 Stem-and-Leaf Displays and Dot Plots
3.2 Frequency Distributions and Histograms
3.3 Excel Charts
So many
3.4 Line Charts
topics, so little
time …
3.5 Bar Charts
3.6 Pie Charts
3.7 Scatter Plots
3.8 Tables
3.9 Deceptive Graphs
3-59
Chapter 3
Describing Data Visually
For univariate data (a set of n observations on one variable)
the statistician would consider the following:
3-60
Chapter 3
Describing Data
•
Look and Think
Look at the data and visualize how they were collected and measured.
Maybe the data values were rounded off?
•
Sorting (Example: Price/Earnings Ratios)
Sort the data. Without fancy calculations, you can see the range, and get an
idea of typical values. Note that these surely are rounded (price/earnings
would not be exactly an integer).
3-61
Chapter 3
Visualizing Data
To visualize small integer data sets we can use a stem-and-leaf plot. It is basically a
frequency tally, except that we write digits instead of tally marks. For two-digit
integer data, the stem is the tens digit of the data, and the leaf is the ones digit. For
the 44 P/E ratios, the stem-and-leaf plot is:
Use equally spaced stems (even if some stems are empty). The stem-and-leaf can reveal center (24 P/E
ratios were in the 10–19 stem) as well as variability (the range is from 7 to 59) and shape (right-skewed,
mode in the 2nd stem). In this illustration, the leaf digits have been sorted, although this is not necessary.
An advantage of the stem-and-leaf is that we can retrieve the raw data. For example, the data values in
the fourth stem are 31, 37, 37, 38.
Caution Teachers like it, but you rarely see this display in business because it
only works for simple integer data (at least, without heroic modifications).
3-62
Chapter 3
Stem-and-Leaf
Dot plots
- are easy to understand.
- reveal center, variability, and shape of the distribution.
Steps in Making a Dot Plot
1. Make a scale that covers the data range.
2. Mark the axes and label them.
3. Plot each data value as a dot above the scale at its approximate location.
Note: If more than one data value lies at about the same
axis location, the dots are stacked vertically.
3-63
Chapter 3
Dot Plot
• The range is from 7 to 59.
• All but a few data values lie between 10 and 25.
• A typical “middle” data value would be around 17 or 18.
• The data are not symmetric due to a few large P/E ratios.
Caution: Dot plots work best for integers and small samples. Avoid
dot plots if n is large or if you have decimal data.
3-64
Chapter 3
Dot Plot: Example
Bins and Bin Limits
Chapter 3
Frequency Distributions
•
A frequency distribution is a table formed by classifying n data values into
k classes (bins).
•
Bin limits define the values to be included in each bin. Widths must all be
the same except when we have open-ended bins.
•
Frequencies are the number of observations within each bin.
•
Often expressed as relative frequencies (frequency divided by the total) or
percentages (relative frequency times 100).
3-65
What is the ideal number of bins (k)
to classify n data values?
Herbert Sturges proposed adding bins
at a declining rate as n increases:
k = 1 + log2(n) or k = 1 + 3.3log10(n)
The Excel formula for k is =1+log(n)/log(2). Add one bin when n doubles. This is
only a guideline. Use more or fewer bins to make “nice” bin limits.
3-66
Chapter 3
How Many Bins?
Sturges suggests: k = 1 + 3.3log10(n)
k = 1 + 3.3log10(44)
k = 1 + 3.3(1.64345)
k = 6.42
so 6 or 7 bins seems reasonable
3-67
Chapter 3
Example: n = 44 P/E ratios:
A histogram is a bar chart whose Y-axis shows the frequency within each bin,
and whose X-axis ticks show end points of each bin.
Consider 3 histograms for the P/E ratio data with different bin widths. In what
ways do they differ? In what ways are they similar?
.
3-68
Chapter 3
Histograms
Prototype distribution shapes
3-69
Chapter 3
Shape
• A frequency polygon connects midpoints of the histogram intervals, with
extra intervals at the beginning and end so that the line will touch the Xaxis. Attractive when you need to compare data sets (since more than one
polygon can be plotted on the same scale).
•
An ogive is a line graph of the cumulative frequencies. It is useful for
finding percentiles or in comparing the shape of the sample with a known
benchmark such as the normal distribution.
Examples for P/E Data Using 6 Bins
3-70
Chapter 3
Frequency Polygons and Ogives
Examples Using 11 Bins
3-71
Chapter 3
Frequency Polygons and Ogives
Scatter plots can convey patterns in (x, y) data pairs
that would not be apparent from a table.
3-72
Chapter 3
Scatter Plots
Example: Miles per gallon vs weight for 93 cars.
3-73
Chapter 3
Scatter Plots
Tips for effective tables:
1.
2.
3.
4.
5.
6.
7.
8.
3-74
Keep the table simple, consistent with its purpose.
Put summary tables in the main body of the written report.
Put detailed tables in an appendix (or insert a hyperlink).
Display the data to be compared in columns rather than rows.
For presentation, round off to three or four significant digits.
Physical table layout should guide the eye toward the comparison
you wish to emphasize.
Row and column headings should be simple yet descriptive.
Within a column, use a consistent number of decimal digits.
Chapter 3
Effective Tables
Log Scales
Chapter 3
Line Charts
•
Arithmetic scale – distances on the Y-axis are proportional to the
magnitude of the variable being displayed.
•
Logarithmic scale – (ratio scale) equal distances represent equal ratios.
•
Use a log scale for the vertical axis when data vary over a wide range, say,
by more than an order of magnitude. This will reveal more detail for
smaller data values.
3-75
Log Scales
A log scale is useful for time series data that might be expected to grow at a compound
annual percentage rate (e.g., GDP, the national debt, or your
future income). It reveals whether the quantity is growing at an
increasing percent (concave upward), or
constant percent (straight line), or
declining percent (concave downward)
3-76
both growing at a constant percent?
Chapter 3
Line Charts
Error 1: Dramatic Title and Distracting Pictures
Error 2: Elastic Graph Proportions
Error 3: Dramatic Title and Distracting Pictures
Error 4: 3D and Novelty Graphs
Error 5: Rotated Graphs
Error 6: Unclear Definitions or Scales
Error 7: Vague Sources
Error 8: Complex Graphs
Error 9: Gratuitous Effects
Error 10: Estimated Data
Error 11: Area Trick
3-77
Chapter 3
Deceptive Graphs
Chapter 3
Deceptive Graphs
Error 1: Nonzero Origin
A nonzero origin will exaggerate the trend.
Deceptive
3-78
Objective
Error 4: 3-D and Novelty Graphs
• 3D is acceptable (e.g., 3D column) but harder to read data values.
• Avoid novelty charts (e.g., pyramid). They distort the data.
3-79
Chapter 3
Deceptive Graphs
Error 5: 3-D and Rotated Graphs
Trends may appear to dwindle into the distance or loom towards you. Harder
to read data values. Label each data value if there is room.
3-80
Chapter 3
Deceptive Graphs
Error 8: Complex Graphs
•
•
•
3-81
Keep your main objective in mind.
Break graph into smaller parts if necessary.
Use clear labels and descriptive titles.
Chapter 3
Deceptive Graphs
Assignments
ML 1.6
• Connect C-1 (covers chapters 2-3)
•
•
•
•
You get three tries
Connect gives you feedback
Printable if you wish
Deadline is midnight each Monday
• Project P-1 (data, tasks, questions)
•
•
•
•
•
0-82
Review instructions
Look at the data
Your task is to write a nice, readable report (not a spreadsheet)
Paste Excel graphs and tables into your Word document
Length is up to you
Projects: General Instructions
General Instructions
For each team project, submit a short (5-10 page) report (using Microsoft Word
or equivalent) that answers the questions posed. Strive for effective writing (see
textbook Appendix I). Creativity and initiative will be rewarded. Avoid careless
spelling and grammar. Paste graphs and computer tables or output into your
written report. It may be easier to format tables in Excel and then use Paste
Special > Picture to avoid weird formatting and permit sizing within Word.
Allocate tasks among team members as you see fit, but all should review and
proofread the report (submit only one report).
0-83
Project P-1
Random teams are assigned on Moodle (submit only one report). Data: Download from
Moodle or from the instructor’s web page. Your team is assigned one crime category
(but you can change it if you wish). Copy the city names and the chosen crime data
column to a new spreadsheet. Delete lines (if any) with missing data. Analysis: (a) Sort
the observations (with city names). (b) List the top 10 and bottom 10 data values (with
city names). (c) For the entire data set, calculate the mean and median. What do they
tell you about center? Would the mode be helpful for this type of data? Explain. (d)
Calculate the standard deviation. (e) Calculate the standardized z-value for each
observation. (f) Are there outliers or unusual data values (see p. 137)? Discuss. (g) Use
MegaStat (or Minitab or Excel) to make a histogram. Describe its shape. (h) Calculate the
quartiles. Make a boxplot and describe it. (i) Make a scatter plot of your kind of crime
versus a different type of crime. What does it show? (j) Ambitious students: Sort the
database in random order (see bottom of page 36) using Excel’s function =RAND(). Copy
and paste the first few sorted lines into your report to illustrate your sorting method.
Comment on anything unusual (or interesting things that you might find on the web).
Watch the video walkthrough using Voting,
North Carolina Births, and CEO compensation
as examples (posted on Moodle)
0-84
Project P-1
your 2010 data will look like this (2005 and 2000 are also available)
Crime Rates in U.S. Metropolitan Areas, 2010 (n = 365)
Violent Crimes Per 100,000
Metropolitan Statistical Area
All Violent
Murder
Rape Robbery
Abilene, TX M.S.A.
423.0
3.1
48.9
72.7
Akron, OH M.S.A.
304.7
3.7
40.9
105.1
Albany, GA M.S.A.
566.0
8.7
24.9
150.4
Albany-Schenectady-Troy, NY M.S.A.
310.4
1.5
21.0
98.5
Albuquerque, NM M.S.A.
670.4
5.8
44.8
124.3
Alexandria, LA M.S.A.
638.0
5.8
23.1
132.3
Allentown-Bethlehem-Easton, PA-NJ M.S.A.
228.2
3.5
20.3
93.6
Altoona, PA M.S.A.
243.6
0.8
38.0
49.8
Amarillo, TX M.S.A.
513.1
5.7
40.8
98.9
Ames, IA M.S.A.
299.5
1.1
41.7
12.4
Anchorage, AK M.S.A.
812.9
4.2
85.9
148.5
Anderson, IN M.S.A.
205.8
2.3
33.4
70.6
Anderson, SC M.S.A.
586.0
5.3
36.4
75.9
Ann Arbor, MI M.S.A.
338.5
1.4
43.2
69.8
Appleton, WI M.S.A.
155.8
0.0
21.4
13.8
Asheville, NC M.S.A.
229.7
1.9
21.8
59.9
Athens-Clarke County, GA M.S.A.
374.9
4.2
19.6
70.5
Atlanta-Sandy Springs-Marietta, GA M.S.A.
413.8
6.1
20.9
149.7
Atlantic City-Hammonton, NJ M.S.A.
529.8
8.0
18.9
245.5
Augusta-Richmond County, GA-SC M.S.A.
412.9
10.2
37.4
156.6
Austin-Round Rock-San Marcos, TX M.S.A.
327.9
3.4
24.7
84.0
Bakersfield-Delano, CA M.S.A.
593.0
9.0
19.9
148.4
Baltimore-Towson, MD M.S.A.
685.3
10.3
23.6
214.4
Bangor, ME M.S.A.
68.4
2.0
12.6
27.2
Barnstable Town, MA M.S.A.
434.6
0.5
36.1
57.6
Battle Creek, MI M.S.A.
697.6
4.5
75.3
109.6
Bay City, MI M.S.A.
335.2
0.9
78.1
50.8
Beaumont-Port Arthur, TX M.S.A.
498.3
5.6
37.7
157.9
Bellingham, WA M.S.A.
267.0
2.5
44.7
50.6
Bend, OR M.S.A.2
304.9
4.3
29.0
30.9
0-85
Assault
298.3
155.0
382.1
189.4
495.6
476.7
110.9
155.0
367.8
244.4
574.4
99.5
468.4
224.0
120.5
146.1
280.5
237.1
257.5
208.7
215.8
415.7
437.0
26.6
340.3
508.3
205.2
297.0
169.1
240.7
Property Crimes Per 100,000
All Property
Burglary
Larceny
Car Theft
3617.3
1009.0
2459.8
148.5
3185.6
947.7
2074.5
163.3
4512.6
1417.8
2803.4
291.4
2693.6
512.1
2076.2
105.4
3896.1
920.6
2586.2
389.4
4592.9
1203.3
3176.3
213.3
2298.0
432.2
1758.1
107.7
1811.7
425.4
1318.2
68.0
4812.7
1137.2
3390.5
285.0
2528.1
478.6
1966.1
83.3
3506.3
416.1
2813.4
276.8
3353.8
848.1
2294.6
211.1
4707.8
1297.6
3041.7
368.4
2713.7
659.7
1879.5
174.4
2136.7
378.5
1708.2
50.0
2454.9
749.6
1534.9
170.3
3843.7
1018.0
2588.1
237.5
3462.6
957.0
2135.7
370.0
3550.3
741.5
2685.7
123.1
4815.3
1355.1
3037.7
422.5
3792.0
754.3
2866.9
170.8
3713.1
1148.0
1931.6
633.6
3090.7
649.5
2135.5
305.7
3098.2
573.3
2429.3
95.7
2972.8
1116.6
1764.7
91.5
3703.5
1145.6
2411.1
146.8
2472.4
610.1
1776.6
85.7
3865.3
1156.9
2488.4
220.1
3197.8
694.2
2372.7
130.8
2973.7
497.5
2360.2
116.0
Definitions
Violent crime
Murder and nonnegligent manslaughter
Forcible rape
Robbery
Aggravated assault
Property crime
Burglary
Larceny-theft
Motor vehicle theft
Example: CEO Compensation
sorting is a good first step
0-86
Example: CEO Compensation
Highlight all data (including the
headings) and use Custom Sort
0-87
Example: CEO Compensation
now you can clearly see the high and low data
values (and comment on any weird data values)
0-88
Example: CEO Compensation
use MegaStat’s Descriptive
Statistics to get your basic stats
along with a nice boxplot
0-89
Example: CEO Compensation
severely skewed
use MegaStat’s Frequency Distributions to get a
frequency table, histogram, etc
annotated by user
normal if logs used?
0-90
Example: CEO Compensation
standardize the sorted list by subtracting the mean from each x value and then
dividing by the standard deviation (or use =STANDARDIZE function)
0-91
Example: CEO Compensation
after standardizing the sorted list, unusual z values can be seen
0-92
Example: CEO Compensation
to randomize the list, paste values of
=RAND() beside data and custom
sort on =RAND()
0-93
Download