Relationships between Variables

advertisement
Relationships between Variables
Relationships between Variables
• Two variables are related if they move
together in some way
• Relationship between two variables can be
strong, weak or none at all
• A strong relationship means that knowing
value of one var tells us a lot about the
value of the other
Example
• Catalog mailer who has tested mailing of two
different catalogs (A and B)
• Which customers, old or new buy more from
which catalog, A or B ?
• To answer the question, analyst pulls a sample of
100 names
• The two variables: Customer type and percentage
buying Catalog A are plotted in a graph
• Steep lines indicate strong relationships and flat
lines indicate lack of relationships
Correlation Analysis
• Correlations can be calculated for Categorical
variables and Scalar variables
• For the former the values range from 0 to 1 and
for the latter from –1 to 1
• For Scalar variables, correlations indicate both
direction and degree
• Positive Correlation (for scalar var): Tendency for
a high value of one variable to be associated with
a high value in the second
Correlation Analysis
Sample Correlation (r)
• Measure is based on a sample
• Reflects tendency for points to cluster systematically
about a straight line on a scatter diagram
- rising from left to right means positive association
- falling from left to right means negative association
• r lies between -1 < r < + 1
• r = o means absence of linear association
Correlation Coefficient in
practice
• Issues to consider
- Are the data straight or linear
- Is the relationship between the
variables significant?
Simple Regression
• Moving from association between two variables to
predicting value of one from the value of the other
• Variable to be predicted is Dependent variable (Y) and
variable used to make prediction is Independent
variable (X)
• Output of regression permits us to:
1. Explain why the values of Y vary as they do
2. Predict Y based on the known values of X
Idea behind Simple Regression
• Cataloger wants to know if there is relation
between time a customer is on file and sales
• Define variables:
- Independent var X (Length of time) is number
of months since first purchase
- Dependent var Y is dollar sales within last
month
• Draw Scatter plot, draw line through the points
and calculate slope of the line
• Eye-fitted regression line is Y=10 + 1*X
Fitting the Simple Regression line
• Goal is to minimize some measure of variation
between Actual observations and Fitted
observations
• This variation is called Residual
Residual =Actual - Fit
• The measure of variation is called Residual Sum of
Squares
• Most common fitting rule called Least-Squares
minimizes the Residual Sum of Squares
• The equation for simple regression is Y  b0  b1 X
Simple Regression in Practice
1. Turn observations into data (variables)
2. Access if relationship between X and Y is
linear
3. Straighten out the relationship if needed
4. Perform the regression analysis using any
standard computer program
5. Interpret the findings
Example
• Do customers who buy more frequently also
buy bigger ticket items?
Step 1. Transform into data as follows:
- Independent var (X) is number of
purchases in last 12 months
- Dependent var (Y) is largest dollar
item (LDI) amount
Example (cont)
Step 2. Draw Scatter plot to check for linearity
Step 3. No straightening out needed
Step 4. Regression output is
Variation of Y: Variance = 792.94
Total sum of squares = 6343.55
Correlation coefficient: r = 0.97254
Intercept: b0 = -18.22
Regression coefficient: b1= 10 with p= .001
Regression equation is Y  18.22  10 X
Example (cont)
Step 5. (i) Large positive value of r indicates
strong positive relation between X and Y.
(ii) This supports our hypothesis that large sales
are associated with frequent purchases
(iii) The r squared statistic maybe most
important in regression output. Also called
Coefficient of Determination. 0 < r < + 1
(iv) Here r squared is .946
(v) Thus 94.6% of variation in Y is explained
by X
(vi) p value is about significance of b
1
Simple Correlation Co-efficient
• Some formulae
Cov( x, y ) 
(X
i
 X ) * (Yi  Y )
X i  X (Yi  Y )
1
rxy 
*
*
(n  1)
Sx
Sy
rxy 
Covxy
Sx * S y
Computation of Correlation Coefficient
Download