Uploaded by Clara Behrends

Applikation for mgmt

advertisement
Session 10:
Be able to download and put data together, go over the steps from last time
European social survey, understand the variables, structures, what round number, what countries, how to
create dummy variables given the data set.
DUMMY Variables:
-
For example male or female: whether an employee is male or female into account when thinking
about the relationship between experience (x) and income (y), If females do have lower income
than males of equal experience, by ignoring (OMITTING) sex we would inflate the size of the
errors., not an ordinal variable, its an categorical variable: cant be included in the regression
-
If we estimate that kind of equation: b2 how much on average is the difference in wage btw them
Interaction effect
-
On average men earn more than women, b2 will be negative bc its taken the intercept and shifts
the line downwards  if b2 is close to zero or not significant: with this data I have no conclusion
that women earn less than men; close to zero but significant: women do earn less but not by very
much
Interpreting coefficients:
- dummy covariate identifies two groups of respondents, those on which we observe 1 and those
on which we observe 0.
- the difference between the average (mean) values of the response in the two groups when all
other covariates are kept constant and fixed (controlling for all other covariates)
-
•
•
is the difference we expect to observe between the salary of a man and the salary of a
woman, controlling for experience
Sympsons paradox: with female dummy
no female dummy
a trend that appears in different groups of data disappears when these groups are combined
the reverse trend appears for the aggregate data
•
•
On average, when controlling for Education and Tenure, female employees earn less than male
employees
The average difference between the hourly pay of a woman and the hourly pay of a man been
equal to -1.788 (they earn less than that)
The model with Female as additional covariate as a higher Adjusted R Square and a
smaller Standard Error of the Estimate.
Slope dummy:
-
Up to now we have assumed that the impact of a covariate on the response is the same regardless
of the values taken by the other covariates
Two covariates may interact, e.g. we may have a different effect for these combinations
-
E.g. does experience play the same role for males and for females?
A slope dummy variable is given by the interaction between a dummy and a numeric variable
It takes the value zero in some rows (when the dummy=0) and the value of a real independent
variable elsewhere (when the dummy=1)
Slope dummies are used when the data is not well modeled by two equally sloped lines line but
fit two lines with different slopes (one per dummy category)
Interactions are modeled by considering as additional covariate the product of the two variables
interacting
For Males
For Females
-
The interaction term is significant.
An additional year of working experience with the current employer has a different impact for
males and females.
It increases the males wage by 0.2019$/h and increases the females’ wage by 0.0656$/h
(=0.2019-0.1363), controlling for other covariates.
Slope dummy : generate
-
Dummy variable trap:
Multicollinearity frequently arises with regards to “dummy variables.”
The easiest solution is simply to omit one category when including dummy variables.
This avoids the so-called “dummy variable trap”
E.g. we did not introduce one dummy variable for female and one for male
ANOVA and multiple regression:
- ANOVA model reduces to a multivariate regression model having as unique covariate the factor.
Multinomial Categorical Covariate (k levels): k-1 dummy variables
-
A categorical variable taking k values can be included as a covariate by representing it by means
of a set of (k-1) dummy variables, each being the indicator of a given level of the variable.
- The category left out is referred to as the reference category or control group.
- The coefficients of the dummy are interpreted as the difference we would expect to observe
between the value of the response for an individual on which the category labelled by the dummy
is observed and the value of the response for an individual on which the reference category is
observed.
Wage example
-
wage: salary (in dollars per hour)
educ: years of education
tenure: number of years in the firm
female: Dummy variable for type of contract (1= female, 0= not female)
geo: multinomial categorical covariate 1=North and central US, 2= South, 3=West
‘North and central US’ is taken as the reference category (the ‘zero’)
— Two dummies are created:
— South: =1 if the employee lives in the South
— West : =1 if the employee lives in West US
Difference in the annual salary expected to be observed between an employee living in North-Central
US and an employee living in the West
Interpretation of coefficients: Recall Heteroscedasticity – Remedies
- Sometimes we need to take the log transformation of the dependent variable to regain efficiency:
the log transformation smoothes out the excess of variability of the observations.
- STATA -> gen ln_wage=ln(wage)
- When the response variable in multiple linear regression model is a logarithm of variable (as
ln_wage), each covariate coefficient estimate


-
is the average change in the logarithm of the response variable associated with a unit change
in the covariate controlling for all other covariates
Therefore it quantifies the average percentage change in the original response variable
associated with a unit increase in the covariate when all other covariates are kept constant.
Logarithms in Regression Analysis: interpretation of the coefficient(s)
Standard simple linear regression model: Y = β0+ β1 X+ε
o As X increases by 1 unit, Y changes on average by β1 units
The semi-log model is: Ln(Y) = β0+ β1 X+ε
o As X increases by 1 unit, Y changes by (100⋅β1)%
The log-log model is: Ln(Y) = β0+ β1 ln(X)+ε
o As X increases by 1%, Y changes by β1%
o Remark: When there is the suspect that heteroschedasticity is due to one (or more)
covariate, we can go for a log transformation also of the covariate
o
Regression of ln(wage) on Education and Tenure
Education coefficient estimate is 0.087: one unit increase in the number of years of Education
on average leads to a 8.7% increase (100*b1%) in wage, when controlling for Tenure.
Tenure coefficient estimate is 0.026: one additional year of experience with the current employer
on average leads to a 2.6% increase (100*b2%) in wage, when controlling for Education.
Regression of ln(wage) on Education, Tenure, Gender and Geo
The wage of employees living in the West is on average 13.9% higher than the wage of
employees living in North and Central US.
Research paper: needs to look like a research paper, data , background, conclusion
Professional writing : forget he knows ESS, start from the beginning: still say ESS biannual
survey…
NEW HELP!
-
Click File > Example datasets.. > Example datasets installed with Stata > use auto.dta.
or
to upload data file
command: sysuse dataset name to open data file
command: sysuse dataset name, clear
Save data
command: log using "Log", replace
Save data
you can browse the data entering the command browse or clicking the Data Editor (Browse) icon
in the toolbar.
Describe data
-
describes variable, does not show min max etc command: describe or
command: casebook variable variable variable
Replace variable  replace variable=10 if variable>2
Find missing data  missing cases are dots
-
command: tab variable, missing
-
command: tab variable, missing nolab (No labels of categorical variables but numbers instead)
How to deal with missing data: should a variable with high proportion so higher than 5.8% of missing
values be dropped from analyses? Omit the case from all analyses which is a listwise deletion
Impute (estimate) missing values.
Techniques include: mean substitution or estimation by regression
Provides a full set of data, but results may be invalid if our estimates are wrong. Results of
analyses with and without imputation should be compare
command: drop if dependent_variable==.
Data screening: univariate : impossible or improbable values, recover correct value if possible,
Extreme values (outliers)
Inspect using boxplot
+/- 3.5 standard deviation of the sample mean command: graph box dependent_variable
Normality of distribution of metric variables
as the sample size increases, the distribution will usually approach normal skewness must
between -1 and 1
Summarize
 only for continuous variables: command: sum variable or command: sum variable, detail
Frequency bar graph: command: graph bar (count), over(variable to be counted)
Histogram (continuous variables)
 command: histogram variable, norm
Two way frequency table for 2 categorical values
 depends on which frequencies you want to show command: tabulate one variable other variable,
cell command: tabulate one variable other variable, row command: tabulate one variable other
variable, column
Two-way frequency table for CATEGORICAL- CONTINUOUS
 command: tabulate categorical variable, sum(continuous variable) command: bys categorical
variable: summarize continuous variable
would calculate summary statistics for the continous variable separately for each category of the
categorical variable.
If clause:
 can be used to set conditions for many things summary table: summarize price if length==200, detail
scatter plot: scatter price length if foreign==0
missing values
count if foreign>0
count if foreign>0 & foreign!=.
Create new variable in stata
command: generate variable
ex: generate price2 = price^2
creates a new variable called "price2" in your dataset, where each value in the "price2" variable is the
result of squaring the corresponding value in the "price" variable.
ex: generate young=(agea<25) if agea!=.
creates variable that only includes age lower than 25 with no missing values
Delete variable from dataset
command: drop variable
ex: drop if cntry != "United Kingdom" & cntry != "Italy" & cntry != "Denmark” ex: drop if Distance==.
Keep only some variables in dataset with or function
command: keep variable variable variable ...
ex: keep if cntry == "United Kingdom" | cntry=="Italy" | cntry=="Denmark”
Rescaling variables
command: replace variable = ( variable - r(min) ) / ( r(max) - r(min) ) * 10 check if rescaled by looking
at min and max
command: summarize variable
Summary table of variables: mean, standard deviation,
min, max, skweness and kurtosis
negative kurtosis means curve of sub group by level of categorical variable flatter than normal
distribution its negative and higher than normal distribution its positive/more concentrated around mean
command: tabstat dependent variable, by(independent variable)s(mean sd min max skewness
kurtosis)
ANOVA:
If there is only one categorical variable, the analysis is referred to as one-way Anova
If there are n factors the analysis is called n-ways Anova
one-way anova
if p value (prob > F) is smaller than alpha, then reject null hypothesis null hypothesis means that all
means are equal
by rejecting it it means at least one mean differs so the categorical variable has an effect on the
dependent variable
when you want to understand the influence of ONE categorical variable (ONE factor) on a numerical
variable.
command: oneway dependent variable independent variable
SCHEFFE TEST:
check which mean from which sub sample is different
if p value less than alpha reject null hypothesis meaning that mean is different from the rest
in this case package 4 mean different from all the rest while the rest of the means r equal
always look at values below in each row
command: oneway dependent variable independent variable, scheffe
REGRESSION:
Regression model coefficients and significance of coefficient (simple linear regression)
if p value smaller than alpha we reject null hypothesis which states that coefficient or constant is equal
to 0 —> coefficient / constant is significant
if confidence interval not too wide and if it does not include 0 —> coefficient/constant is significant
p value and confidence interval do not always match up - but it often happens that it is
r squared shows proportion of variability described my model
 the higher the better
 the more variables the higher
command: reg dependent variable independent variable
 reg is used to understand the dependency of two variables
on the
left the confidence interval is slightly larger, a few points and not under the line – outliers that increase
variability
SCATTER PLOT
- only for one dependent and one independent variable
 command: scatter dependent variable independent variable || lfitci dependent
variable independent variable
lftici function shows fitline, without it no fitline
or
 command: scatter dependent variable independent variable if dummy variable ==0
this will create scatterplot between dependent variable and the independent variable only for
independent variables associated to the opposite of the dummy variable
ex: a scatterplot of price against length for foreign cars, and then for domestic cars.
 command: scatter dependent variable independent variable if independent dummy variable ==1
This will create a scatterplot between dependent variable and the independent variable only for
independent variables associated to the opposite of the dummy variable
CREATE SCATTERPLOT GRAPHING ONE VARIABLES AGAINST IN TWO SUBGROUPS:
Command ssc install sepscatter
 sepscatter one variable other variable, sep(subgroup)
ex: sepscatter price length, sep(foreign)
MULTI REGRESSION MODEL
- same as simple linear regression model
- every time you put a variable that makes sense your whole model will improve
-
command: reg dependent variable independent variable, independent variable, etc...
COMPARING THE BETAS
 which independent variable contributes more to the model —> to understand the strength of each
relationship
- the higher the standardized beta, the higher contribution of that independent variable
beta coefficient (Xk): b1 * Std. deviation of Xk / Std. deviation of Y
 A standardized coefficient represents the change in terms of standard deviations in the dependent
variable that results from an increase of one standard deviation in an independent variable (so we can
compare!)
command: reg dependent variable independent variable, independent variable, etc..., beta
how to find regression model between dependent variable and only one of the options of dummy
variable
 command: regress dependent variable independent variable if dummy variable==1
command: regress dependent variable independent variable if dummy variable==0
weights in stata
- weights are not crucial - they usually dont change model iteself - maybe a bit significance of
coefficient but still good
- Stata offers 4 weighting options:
frequency weights (fweight),
Frequency weights: sample with many duplicate observation
analytic weights(aweight),
Analytic weights: sample where each observation is a group-mean
probability weights (pweight)
Probability weights: sample where observations had a different probability of being sampled
importance weights (iweight)
weights that indicate the "importance" of the observation in some vague sense
iweights have no formal statistical definition; any command that supports iweights will define
exactly how they are treated
 command to add weights:
we add weight into commands adding [w=] to the command, before the options (that is,before the
comma)
ex: summarize variable [w=pspwght] pspwght is a specific weight
ex: regress dependent variable independent variable [w=pspwght]
creating new weight
 command: gen overallwght = pspwght * pweight
BEST LINEAR FIT LINE FOR REGRESSION IN CLUDING WEIGHTS
 command: twoway lfit dependent variable independent variable[w=pspwght]
EVALUATION OF A MODEL
Coefficient of determination, R Squared
 R2 = it measures the proportion of total variation in y explained by all X variables taken together
- 0≤ R2 ≤1 The closer R2 is to 1 the better the goodness of fit of the model.
- when comparing models, better to use r2 adjusted
(adjusted by the degrees of freedom, which depend on the sample size and number of covariates)
- r2 will always increase as you add a new variable whereas r2 adjusted wont necessarily
- keep adding variables as long as it doesn't decrease r2 adjusted, if new variable makes small
impact then you need to decide
command: reg dependent variable independent variable etc
MODEL ERROR VARIANCE
The model with lower se fits the data better.
 If its value is higher, then also standard error of coefficient are higher and, ceteris paribus, will be
more likely to find not significant results
 In STATA, it is called “Root MSE”
command: reg dependent variable independent variable etc
F-STATISTIC
- must be higher than 1
- bigger than 4 we reject null hypothesis If you do not reject null hypothesis
- estimated model is NO GOOD (The regressors you introduced haveno explanatory power for the
dependent variable, y)
- The “variability” of the estimated model is purely random, not in any way different from the
“variability” of the residuals.
- F is the ratio of the measures of variability of the two components and the worse the model
is the closer F will be to 1
- p-value must be smaller than alpha to reject null hypothesis which states that all
coefficients are equal to 0
command: reg dependent variable independent variable etc
DIFFERENCE BETWEEN F-TEST AND T-TEST:
- The F-test for overall significance compare the two models ( Education andTenure taken
together linearly affect wages?) —> it shows if there is a linear relationship between all of the X
variables considered together and Y. General rule of thumb is F>4, we reject the null hypothesis
(—> at least one independent variable affects Y)
- The t-test (on Education) compares these two models (Is there a significant marginal effect of
Education on Wages after Tenure has been controlled for?):
If at least one t-test is significant, then the F-test will be significant.
- It could be the case that all t-test are not significant and F-test it is significant.
This happens when the independent variables are highly correlated (multicollinearity)
MULTICOLLINIEARITY
- It can affect the reliability of regression results
 As a result, standard errors will be large and t-statistics small (unstable coefficients)
 Coefficient signs may not match prior expectations
 The Variance Inflation Factor (VIF) is an indicator of a multicollinearity problem
- VIF measures how much does the standard errors of our estimated regression coefficients
increase compare to the case in which our X had no relation with the other variables
- We could see it as an indication of how much ‘precision’ we lose in our estimates— Interpreting
VIF:
- VIF values range from 1 to positive infinity.
A VIF of 1 indicates no multicollinearity (perfect independence)
- Common thresholds for concern: VIF > 5 or 10(serious multicollinearity problem)
- E.g.: VIF=2 and SE=6 for Xk mean that the standard error of the coefficientβk is 2 times
larger than if Xk had no correlation with the other Xs (should be 6/2=3 in absence of
correlation
-
Command: vif : you have to write the command after the regression model
CORRELATION
-
Multicollinearity occurs when two + independent variables in a regression model are highly
correlated
Correlation matrix to check for multicollinearity
ONLY FOR NUMERICAL VARIABLES
command: pwcorr independent variable independent variable, sig
-
pwcorr drops observations that have at least 1 missing values in the pairs — better than corr
-
sig option, you're asking Stata to not only calculate the correlation coefficients but also display
the significance levels (p-values) associated with each correlation coefficient. These p-values
indicate whether the correlations are statistically significant.
or
command: corr independent variable independent variable
-
corr drops observations that have at least 1 missin values in any of the variables in the matrix.
or
command: pwcorr independent variable independent variable, obs the number of non-missing
observations in the correlation matrix or
-
command: pwcorr independent variable independent variable, obs casewise
However, the addition of the casewise option tells Stata to handle missing values on a case-bycase basis. This means that if there are missing values in any variable for a particular pair of
observations, Stata will use all available information for that specific pair to calculate the
correlation.
F-TEST ON SUBSET OF MODEL
- used to understand weather two variables taken together are significant this is used when you
detect multicollinearity between some variables
- this test is used to see whether you should take all variables that are correlated out of the model
or keep some in or an average
- if p-value < alpha than reject null hypothesis so variables are statistically significant
command: test independent variable independent variable
... only put subset of independent variables you think are correlated do this after calculating regression
model
REMEDIES FOR MULTICOLINEARITY
Drop one (or more) collinear variables
Transform the collinear variable in a single new indicator (e.g., take the mean of collinear variables)
Retain the collinear variable but interpret the p-values with care
The best option depends on the aims of the analysis and on the research questions
if after removing a variable, the other variables are all significant, than this means that because of the
collinearity we found a non-significant results in the previous model
HOW TO CALCULATE CHANGE IN INDEPENDENT VARIABLE
from regression model code)
(table
How much should 30280 charge for its hot dog to maintain its current market share?
EVALUATING ASSUMPTIONS
Gauss-Markov theorem: The OLS estimators bj are best linear unbiased estimators (BLUE) for βj
Best: smallest variance (in the class of linear unbiased estimators). That is, the OLS estimators are the
most efficient.
For Efficiency, also assumptions 4 must be valid (as well as 1,2,6). However, it is also necessary that the
error terms are random variables with means 0 (one of the condition stated in assumption 3).
Linear: they are a linear function of yi
Unbiased: E(bj) = β j (if, on average, it hits the true parameter value)
For Linearity and Unbiasedness only assumptions 1,2 and 6 are required.
If the error term has not only mean zero, but it is distributed normally(assumption 3), then OLS
estimators are normally distributed.
for the central limit theorem, even if errors are not perfectly normal but the sample size is high, the OLS
are approximately normal.
If errors are not normal and sample size is small, the OLS estimators are not normal and standard error
can be biased. Inference in this case may be invalid (usually, standard errors are biased downward, and
significance levels are higher than the correct ones.)
Assumption 1: linearity
check shape of graph using scatter plot
command: scatter dependent variable independent variable || lfitci
dependent variable independent variable linear line
compare to
command: scatter dependent variable independent variable || qfit dependent
variable independent variable squared line
if not linear but squared, change the regression model command: gen educ_squared=educ^2
command: reg wage educ educ_squared
age and tenure are two common variables where you might need to square it for example if you want to
make the variable tenure squared then
Assumption 2: No serial correlation of errors
We are not worried of serial correlation as long as we work with cross-sectional data
Generally there is no serial correlation when the residuals show no pattern around the x axis
Assumption 3: Normality of the error term
The error term is normally distributed with N(0,1) for error terms
command:
predict new_variable name ex: predict y_hat
This command is used to generate predicted (fitted) values of the dependent variable (y) based on a
regression model that you have estimated earlier.
predict residuals, r
This command is used to generate the residuals of the regression model. Residuals are the differences
between the actual values of the dependent variable (y) and the predicted values ( y_hat ) obtained from
the regression model.
sum residuals, detail
check skewness between -1 and 1 kurtosis between 2 and 4
mean must be 0
variance 1
kdensity residuals, normal
Assumption 4: Homoskedasticity
Errors have equal variance, σ2 (homoscedasticity)
this shoes homoskedasticity  command: esta hettest
do this after inserting the regression command remedies
Sometimes it is sufficient to take the log transformation of the dependent variable to regain efficiency:
the log transformation smoothes out the“excess” variability of the observations.
Change the estimation method (e.g., Weighted Least Square).
Apply formulas to get correct standard error (White heteroskedasticity- consistent standard errors)
Solve the cause of heteroschedasticity (non-linearity, omitted variable...) command: generate
ln_dependent_variable=ln(dependent_variable)
Assumption 6: No endogeneity
The error term ε is not correlated with any of the explanatory variables x1,x2, ... , xk. In other words,
there is no endogeneity
Often referred to as omitted variable bias
Omitted relevant variables are variables correlated both with the dependent variable and with the
independent variables included in the regression
This is very hard to detect....
can notice it if after adding a variable the other variables change signs remedies
Add a correlated regressor that is not in the model- Identify a problem in the sample selection
If the sample selection is correlated with the variable of interest,narrow the generalizability of
interpretation to a chunk of the population that satisfies the sample restriction
 how to save in stata statistics that can later be used
 command: sum independent variable display r(min)
HOW TO INTERPRET DUMMY VARIABLES
 if coefficient of dummy variable is 5 for dummy variable female, this means that for female the
dependent variable increases by 5
 dummy variable "female" with a coefficient of -500. This would mean that, on average, females earn
$500 less than males, assuming all other variables in the model are held constant.
How to introduce categorical variables as dummy variable for regression model
 Command tabulate categorical_variable, gen(categorical_variable)
 create a dummy for each country: they will all be called cntry_* where * is the alphabetical
rank attached to the country in variable cntry (cntry_1 has values 1 for Denmark and 0 otherwise,
cntry_2 has values 1 for Italy and 0 otherwise, etc
 describe categorical_variable* to check it worked
 regress dependent variable independent_continous_variables categorical_variable1
categorical_variable2 [w = overallwght]
this includes a weight, in this case categorical variable level 3 is the reference variable
HOW TO CREATE A DUMMY VARIABLE
 command: gen variable_name = 0 replace variable_name = 1 if variable =="x”
replace the new variable with one if another variable = a certain
value
here we wither insert the name or
example:
 gen cee = 0
 replace cee = 1 if cntry=="Poland” replace cee = 1 if cntry=="Hungary”
 replace cee = 1 if cntry=="Czech Republic”
cee is now a dummy variable taking value 1 if countries are from Central and Eastern Europe and 0
otherwise
- example
generate high_educ=(educ>12)
This dummy will be equal to 1 for values of education above 12 and equal to 0 for values of
education below 12
The coefficient for high_educ is equal to 1.942451
In other words, people with more than 12 years of education have, on average, a hourly age of 1.94
dollars higher than people with less than 12 years of education
HOW TO ADD DATASET TO EXCISTING DATASET AND COMBINE THEM
 command: append using "directory”
example: append using "/Users/angelici/Dropbox/3. Corsi Bocconi/30280 - Applications For
Management/2023-24/STATA/ESS6_Lab_East.dta”
DUMMY VARIABLES  A dummy variable is a variable that takes only a value of 0 or 1 to indicate
the absence or presence of some categorical effect that can shift the outcome.
Data analyses section
-
types of variables for dependency models
Dependent variable (denoted Y)a variable that is to be predicted or that changes in response to an
intervention; the value of this variable depends on the values of other variables
The intervention is an independent variable (X)
The outcome to be changed (e.g., skills, social conditions) is a dependent variable (Y)
Independent variable (denoted X)a variable that is used to predict values of another variable, that
records experimental categories such as participant/non-participant
Confounding variable a variable that might interfere with or ‘confound’ the relationship between
the independent and the dependent variable
-
background section
-
Common elements of a problem definition Research question
Selection of a theory or other basis for definition of the specific study: a in-depth literature
search has to be run on “recognized sources”.
-
Stata
35
Clear identification of the unit of analysis
Clear identification of the fundamental concepts to be studied Expected or proposed
relationships between the concepts
-
Respondents from Poland are more favourable (because the sign is positive) to immigrants from different
race/ethnic groups compared to respondents from Great Britain by 0.015 on a scale from 0 to 1. The
hypothesis test performed here is the following:
The coefficient on PL is not statistically significant, because the p-value of 0.152 is larger than the threshold
level of 0.05. The answer is thus no, attitudes towards immigrants don’t differ in Great Britain compared to
Poland.
To answer we regress immigration dummy on lrscale and inter- pret the coefficient. We keep the country
dummy in the regression to test this relationship taking into account differences across countries. A oneunit increase in political orientations, i.e. moving from left to right by one-unit on a scale from 0 to 10,
reduces positive attitudes towards different immigrants by 0.003. The answer is no, because this reduction
is not statistically different from zero (p-value>0.05).
STATA allows to speed-up this process by using a special notation that requires to specify whether the
variables involved are categorical (as in the case of a dummy variable, indicated with the prefix i.)
or continuous (as in the case of the lrscale variable, indicated with the prefix c.) and to add the
symbol ## between the variable we want to interact.
reg immigration_dummy i.cntry_num##c.lrscale [w=overallweight]
Factor analysis III  Factor extraction and rotation
- extraction of factors (equivalent to fitting model)
- interpretation of the cators, including rotation of factors to assist interpretation
The two main analytical stages of FA are:
1. Extraction stage
- Select the factor extraction method
- Produce an “initial solution”
- Decide how many factors to extract
2. The rotation stage
- We give an interpretation to the factors
Kaisers rule: if less than one it dienst pay off to have more factors if the alternative is to have one factor
Step 2: Extraction: common variance or communality
-
The total variance of a variable is split into the sum of:
o Common variance: variance of a variable that is shared with all other variables in the
analysis
 Variable communality is the estimate of its hsared or common variance among the
variables as represented by the derived factors
o Unique ( specific plus error variance): the part of the variance that can be attributed to the
single variable
Download