Part 1

advertisement
SKEMA – Ph.D programme
2010 – 2011
Quantitative Methods
For Social Sciences
Lionel Nesta
Observatoire Français des Conjonctures Economiques
Lionel.nesta@ofce.sciences-po.fr
Objective of The Course
 The objective of the class is to provide students with a set
of techniques to analyze quantitative data. It concerns the
application of quantitative and statistical approaches as
developed in the social sciences, for future decision
makers, policy markers, stake holders, managers, etc.
 All courses are computer-based classes using the STATA
statistical package. The objective is to reach levels of
competence which provide the students with skills to both
read and understand the work of others and to carry out
one's own research.
Examples
 Rise in biotechnology
 Should the EU fund fundamental research in biotechnology?
 Has biotechnology increased the productivity of firm-level R&D?
 Did it increase the speed of discovery in pharmaceutical R&D?
 Increasing university-industry collaborations
 Does it facilitate innovation by firms?
 Does it increase the production of new knowledge by academics?
 Does it modify the fundamental/applied nature of research?
Examples
 Economic (productivity) Growth
 Does it come mainly from new firms or improving existing firms?
 Is market selection operating correctly?
 Why do good firms exit the market?
 How does the organisation of knowledge impact on performance?
 How do knowledge stock and specialisation impact on productivity?
 How do firms enter into new technological fields?
 Do firms diversify in new technologies/businesses purposively?
Structure of the Class
 Part 1 : Descriptive Statistics
 Part 2 : Statistical Inference
 Part 3 : Relationship Between Variables
 Part 4 : Ordinary Least Squares (OLS)
 Part 5 : Extension to OLS
 Part 6 : Qualitative Dependent variables
Structure of the Class
 Part 1 : Descriptive Statistics
 Mean, variance, standard deviation
 Data management
 Part 2 : Statistical Inference
 Part 3 : Relationship Between Variables
 Part 4 : Ordinary Least Squares (OLS)
 Part 5 : Extension to OLS
 Part 6 : Qualitative Dependent variables
Structure of the Class
 Part 1 : Descriptive Statistics
 Part 2 : Statistical Inference
 Distributions
 Comparison of means
 Part 3 : Relationship Between Variables
 Part 4 : Ordinary Least Squares (OLS)
 Part 5 : Extension to OLS
 Part 6 : Qualitative Dependent variables
Structure of the Class
 Part 1 : Descriptive Statistics
 Part 2 : Statistical Inference
 Part 3 : Relationship Between Variables
 ANOVA, Chi-Square
 Correlation
 Part 4 : Ordinary Least Squares (OLS)
 Part 5 : Extension to OLS
 Part 6 : Qualitative Dependent variables
Structure of the Class
 Part 1 : Descriptive Statistics
 Part 2 : Statistical Inference
 Part 3 : Relationship Between Variables
 Part 4 : Ordinary Least Squares (OLS)
 Correlation coefficient, simple regression
 Multiple regression
 Part 5 : Extension to OLS
 Part 6 : Qualitative Dependent variables
Structure of the Class
 Part 1 : Descriptive Statistics
 Part 2 : Statistical Inference
 Part 3 : Relationship Between Variables
 Part 4 : Ordinary Least Squares (OLS)
 Part 5 : Extension to OLS
 Regressions diagnostics
 Qualitative explanatory variables
 Part 6 : Qualitative Dependent variables
Structure of the Class
 Part 1 : Descriptive Statistics
 Part 2 : Statistical Inference
 Part 3 : Relationship Between Variables
 Part 4 : Ordinary Least Squares (OLS)
 Part 5 : Extension to OLS
 Part 6 : Qualitative Dependent variables
 Linear probability model
 Maximum likelihood (logit, probit)
Part 1
Descriptive Statistics
Types of Data
Descriptive statistics is the branch of statistics which gathers all
techniques used to describe and summarize quantitative and
qualitative data.
Quantitative data
 Continuous
 Measured on a scale (value its the range)
 The size of the number reflect the amount of the variable
 Age; wage, sales; height, weight; GDP
Qualitative data
 Discrete, categorical
 The number reflect the category of the variable
 Type of work; gender; nationality
Descriptive Statistics
All means are good to summarize data in a synthetic way: graphs;
charts; tables.
Quantitative data
 Graphs: scatter plots; line plots; histograms
 Central tendency
 Dispersion
Qualitative data
 Graphs: pie graphs; histograms
 Tables, frequency, percentage, cumulative percentage
 Cross tables
Central Tendency and Dispersion
 A distribution is an ordered set of numbers showing how many
times each occurred, from the lowest to the highest number or the
reverse
 Central tendency: measures of the degree to which scores are
clustered around the mean of a distribution
 Dispersion: measures the fluctuations around the characteristics of
central tendency
 In other words, the characteristics of central tendency produce
stylized facts, when the characteristics of dispersion look at the
representativeness of a given stylized fact.
Central Tendency
 The mode
 The most frequent score in distribution is
called the mode.
 The median
 The middle value of all observed values, when
50% of observed value are higher and 50% of
observed value are lower than the median
 The mean
 The sum of all of the values divided by the
number of value
1
X 
N
i n
x
i 1
i
The mode, the mean and the median ore equal if and only of the distribution is symmetrical and unimodal.
Dispersion
 The range
 Difference between the maximum and
minimum values
R  xmax  xmin
 The variance
 Average of the squared differences between
data points and the mean (average)
quadratic deviation
i n
2 
x
i 1
i
X

2
N
 The standard deviation
 Square root of variance, therefore measures
the spread of data about the mean,
measured in the same units as the data
in
  
2
x
i 1
i
X
N

2
Research Productivity in the
Bio-pharmaceutical Industry
EU Framework Programme 7
Stylised Facts about Modern Biotech
1.
2.
Innovations emerge from uncertain, complex processes
involving knowledge and markets: Roles of networks.
Economic value is created in many ways – globally and
in geographical agglomerations
3.
Various linkages exist among diverse actors (LDFs,
DBFs, Univ, Venture Capital) in innovation processes,
but the firm plays a particularly important role.
4.
Regulations, social structures and institutions affect ongoing innovation processes as well as their impacts on
society: Importance of IPR.
STATA software
Statistical Package for the Social Sciences
The Stata software
 Stata Corp, spinoff from Texas A&M – College Station – Texas
(1985)
 Among the most widely used programs for statistical analysis
in social sciences.
 Probably to most widely used econometric software among
economists
 Data management (case selection, file reshaping, creating
derived data)
 Features of Stata are accessible via pull-down menus
 The pull-down menu interface generates command syntax.
The Stata software
 STATA is a statistical software in constant evolution
 Updates are constantly put on the web available to the use of
other Stata user (command update all)
 Most are available through the Boston College server
 ssc install module_name, all
 And hundreds of other can be reached as follows:
 net search key_words
 net install module_name, all
The Stata software
Review window
Pull down menus
Results window
Variable window
Command window
The Stata software
 How to use STATA ?
 Using pull-down menus
 Typing STATA instructions in the Command window
 Grouping a series of STATA instructions in a .do files
 Programming new functions (.ado files)
 Programming new functions with a powerful matrix language
(MATA) similar to C (Version 9.0 of STATA onwards)
The Stata software
 All STATA commands used from the menu or the command
window are automatically stored in the Review window
 At the end of a session, the review window can then be saved
by right-clicking on it
 save all : under a .Do-file
 Send to do-file editor : A new window opens up.
 A Do-file is a text file containing a list of STATA commands
which will be executed step by step by STATA.
It is recommended to explore results and methods with the
command window. Once the methods are settled, save the
series of commands as a do-file.
The Stata software
 All STATA results are displayed in the Result window
 This window is a buffer. Once it disappears from the screen, it
is deleted. That is why you may want to record results.
 log using log_name.txt (beginning of a session)
 log close (end of a session)
It is recommended to save results in a log file. Moreover, if you
work with a do file, you can always get ols results with the do-file.
The Stata software
 Memory settings
 By default, 10 megabytes are available for database uploading. If
a database is greater than 10Mb, STATA does not upload the
database. There are also other limits (matrix size, # of variables)
which can be managed using the commands below.
 Useful commands
describe using database_name.dta
query memory
clear
set memory 500m, permanently
set maxvar n , permanently
set matsize n , permanently
set virtual on , permanently
Data Handling (1): Database creation
 1st step: Creating a database
 Typing data in the database through Data Editor (edit)
 Importing data
insheet myfile.txt , options
options : tab ; comma ; delimiter("char") ; clear ; names
 Importing data from a .txt file
- Without fixed format (without dictionnary)
infile1 var1 var2 var3 using myfile.txt , options
- With a fixed format (with dictionnary)
infile2 using mydict.dct , using (myfile.txt) options
DH(2): Database Exploration
 2nd step: Exploring the Data
 To obtain a description of the database
describe [varlist], options
inspect [varlist]
codebook [varlist], options
nmissing [varlist], options
npresent [varlist], options
 To display all possible values of a variable
list [varlist] [if] [in], options
Example : list var1 if var2 > var3 in 1/100
DH(3): Database Organisation
 3rd step: Organisation of the database
 Sorting observations
sort varlist
gsort [ + | - ] varlist
 Sorting variables
order varlist
aorder varlist
(If no varlist is specified, _all is assumed.)
 Fusionner plusieurs bases de données (ajouter des variables)
merge varlist using base1.dta [base2.dta], options
 Fusionner plusieurs bases de données (ajouter des observations)
append using base1.dta [base2.dta], options
DH(3): Database Organisation
 3rd step: Organisation of the database
 Modifying the shape of the database
reshape long stubnames, i(varlist) j(varlist)
reshape wide stubnames, i(varlist) j(varlist)
i
i
id sex inc80 inc81 inc82
-------------------------------------1 0 5000 5500 6000
2 1 2000 2200 3300
3 0 3000 2000 1000
Wide form
j
id year sex inc
----------------------------1 80 0 5000
1 81 0 5500
1 82 0 6000
2 80 1 2000 Long form
2 81 1 2200
2 82 1 3300
3 80 0 3000
3 81 0 2000
3 82 0 1000
DH(4) : Saving, Opening, Exporting
 4th step: Save and re-use STATA database files (.dta files)
 Changes the working directory to the specified drive and directory
cd "C:\STATA SKEMA"
 Saves the database as a STATA file (.dta)
save myfile.dta , replace
 Opens a STATA format database (.dta)
use myfile.dta , clear
 Exports a database as a txt files
outsheet [varlist]using myfile.txt , options
options : comma ; nonames ; replace
Handling Variables
 Create a new variable
 By assigning a value to it
generate var1 = expression [if] [in]
 Using a predefined function: Extensions to generate
egen var1 = fcn(arguments) [if] [in], options by(varlist)
fcn : min ; max ; mode ; mean ; median ; sd ; total ;
pctile ; group ; count ; etc…
Examples : egen mean(salaire) , by(age)
egen group(nom)
egen count(id), by(sector)
Handling Variables
 Variables modifications and removal
 Modifying a variable which has already been created
replace var1 = expression [in] [if]
 Erasing variables
drop varlist
keep varlist
 Erasing observations
drop [in] [if]
keep [in] [if]
Examples : drop if revenu < 100
keep if age >= 18
Handling Variables
 Time series and panel data utilities
 Declaring data as time series or panel data
tsset [panelvar] timevar , options
options : daily ; weekly ; monthly ; quarterly ; yearly
Exemple : tsset id annee , yearly
 Using time series operators
Lagged values
L.
L2. ou LL.
 L.X = Xt-1
 L2.X = Xt-2
Forwarded values
F.
F2. ou FF.
 F.X = Xt+1
 F2.X = Xt+2
Differenced values D.
D2. ou DD.
 D.X = Xt - Xt-1
 D2.X = Xt - Xt-1 – (Xt-1 - Xt-2 )
Descriptive Statistics with STATA
Using log files
log using xxx, replace / log close
Defining and using labels
label variable
label define
label values
Descriptive statistics
summarize
table
table, content()
tabulate
Manipulating .dta files and exporting
collapse
save as
outsheet using...
Log files
Log files save the result window. They are useful when producing
descriptives statistics on the .dta files and on the variables.
log using nom_fichier_log, replace
Instructions STATA
log close
 Advantage. Very useful to find back old results (replication and refutation)
 Drawbacks. Tedious to manipulate
Labelling variables
 Labelling is too often neglected.
 No influence on the results
 Large influence on correct interpretation of variables and results
label variable. Describe a variable
label variable asset "real capital"
label define. Define a label
label define firm_type 1 "biotech"
label values Applies the label
label values type firm_type
0 "Pharma"
Descriptive statistics: summarize
summarize var1 var2....varN
Produces number of obs. means, variance, min and max
We can add a condition using if
summarize var1 var2 ....varN if [condition]
We can produce descriptive statistics by subsets of teh database
using bysort
bysort varcat: summarize var1 var2 ....varN
 Beware: Most of the time, you do not need a comma before if. However, if you
get an error message, there is very high chances that it comes from the absence of a
comma before if.
Descriptive statistics: table
The table command applies to categorical variables (string or
categorical).
table varcat1
Provide the number of observations by categories of varcat1
table varcat1 varcat2
Provides a cross table between varcat1 and varcat2
table varcat, content(count var1 mean var1 sd var1...)
Provide descriptive statistics on var1 by categories of varcat
Descriptive statistics: tabulate
The tabulate command is similar to table, but obtions are different.
tabulate varcat, gen(varcat_)
generates dummy variables for each category of varcat
tabulate varcat1 varcat2, [options]
Generate measures of associations between two categorical variables
tabulate varcat1, summarize(var2)
Provide descriptive statistics on var2 by categories of var1
Stacking observations: collapse
 The collapse command produces a new database which is an
aggregation of the old database.
 collapse will aggregate lines (observation) by categories of your
choice of a define categorical variable
collapse (mean)var1 var2 (sum) var3, by(varcat)
Will generate a new database with as many lines as there are categories
of varcat, with 3 variables (means of var1 & var2, sum of var3)
collapse (mean)var1 var1 (sd) sdvar1=var1 sdvar2=var2,
by(varcat1 varcat2)
Will generate a new database with as many lines as there are categories
of varcat1 & varcat2, with 3 variables (means of var1 & var2,
standard deviation
of var1 & var2)
 Note: collapse is interesting to export tables of results to excel.
 Note: Please save the old and new database under different names!
Keywords for table & collapse
mean
sd
sum
rawsum
count
max
min
iqr
median
p1
p2
...
p50
...
p98
p99
means (default)
standard deviations
sums
sums, ignoring optionally specified weight
number of nonmissing observations
maximums
minimums
interquartile range
medians
1st percentile
2nd percentile
3rd-49th percentiles
50th percentile (same as median)
51st-97th percentiles
98th percentile
99th percentile
Graphs
Graphic representations are a very effective means of synthesis .
-
Pie graphs, which display proportions of a population or a sample
-
Two-way graphs linking any two quantitative dimensions
-
Distribution graphs (histograms) which plots central tendency
characteristics and dispersion of a variable
Pie Graphs
graph pie, over(varcat)
C1
C3
D0
E2
F1
F3
F5
C2
C4
E1
E3
F2
F4
F6
Two-way Graphs
Two-way graphs link two continuous var1 and var2.
There are several types of two-way graphs :
- Line graphs
twoway line var1 var2
- Classical scatterplot
twoway scatter var1 var2
- Conencted graphs
twoway connected var1 var2
Line graphes
.1
.105
rdi
.11
.115
twoway line var1 var2
1988
1990
1992
1994
1996
year
twoway line rdi year if name==« Abbott"
1998
Line graphs
.1
.15
rdi
.2
.25
twoway line var1 var2
1988
1990
1996
1994
1992
1998
year
Amgen
Abbott
twoway (line rdi year if name=="Amgen", sort) (line rdi year if
name=="Abbott", sort), legend(on order(1 "Amgen" 2 "Abbott"))
Connected graphs
.1
.105
rdi
.11
.115
twoway connected var1 var2
1988
1990
1994
1992
1996
1998
year
twoway (connected rdi year if name=="Abbott")
Scatterplots
-6
-4
lrdi
-2
0
twoway scatter var1 var2
8
10
12
14
16
lassets
twoway scatter lrdi lassets
18
Distribution graphs
Distribution graphs plot the distribution of one quantitative variable var1
at a time by means of a histogram:
 On the horizontal axis, classes of var1 are displayed.
 On the vertical axis, the density of each class is displayed.
fj 
nj
Number of observations n j


Class range
d j c j  c j 1
Distributionnal histogrammes
0
.1
Density
.2
.3
hist var1
8
10
12
14
lassets
hist lassets
16
18
Kernel distributions
Using kernel, one can get the probability density function of var1. The
probability density function is important to visually look at the normality of
the distribution.
Normal distributions are also called Gaussian distribution. These are very
frenquently used in sciences to account for random processes. They are
based on the theory of large numbers and the central limit theorem.
Distribution de kernel
kdensity var1
.15
.1
0
.05
Density
.2
.25
Kernel density estimate
8
10
12
14
lassets
kernel = epanechnikov, bandwidth = 0.5319
kdensity lassets
16
18
Exporting Graphs
One can simply copy and paste graph in any microsoft office software.
One can use.do files, and write:
graph export [graph_name], as[extension] options
Exemple :
graph export SKEMA_rdi.wmf, as(wmf) replace
Possible extensions: PostScript (ps), Encapsulated PostScript (eps), Windows
Metafile (wmf), Windows Enhanced Metafile (emf), Macintosh PICT format
(pict), Acrobat Reader (pdf)
SPSS software
Statistical Package for the Social Sciences
SPSS : Opening SPSS
SPSS : Importing data
SPSS : Importing data
SPSS : Importing data
 Settings in the “import text” dialogue box
 No predefine format (1)
 Delimited (2)
 First lines contains the variable names (2)
 One observation per line // all observations (3)
 Tab delimited only (4)
 Finish (6)
SPSS windows
 SPSS has opens automatically windows
 The datasheet window
 Observe, manage, modify, create, data
 The results window
 Everything you do will be stored there
 The syntax window can be opened
SPSS : Data sheet (1)
SPSS : Data sheet (2)
SPSS : Result / Journal
SPSS : Saving data
SPSS : working, at last!
Recoding Variables
 Changing existing values to new values (biotechnologie → DBF,
pharmaceutique → LDF)
1
2
3
Computing New Variables
 Taking logarithm (normalization of continuous variables)
1
2
Creating Dummy Variables
 Taking logarithm (normalization of continuous variables)
1
2
3
Computation of Descriptive Statistics
1
3
2
Descriptive Statistics
Statistiques descriptives
N
patent
assets
rd
spe
pharma
biotech
N valide (listwise)
457
457
457
457
457
457
457
Intervalle
286
35788473.97
1917997.980
2.0235309
1
1
Minimum
0
4422.18
858.53204
-1.1298400
0
0
Maximum
Moyenne
Ecart type
286
11.92
22.901
35792896.15 4358371.54 6086530.85
1918856.512 330236.630 405160.516
.8936909 -.056808610 .3374751802
1
.63
.482
1
.37
.482
Variance
524.470
3.705E+013
164155043889
.114
.232
.232
Splitting Database
1
2
Descriptive Statistics (by type)
Statistique s descripti ves
type
DB F
LDF
N
patent
as sets
rd
spe
pharma
biotech
N valide (lis twis e)
patent
as sets
rd
spe
pharma
biotech
N valide (lis twis e)
Int ervalle
Minimum
167
202
0
167
2442619
4422.18
167
495443.5 858.53204
167 1.7544527
-1. 12984
167
0
0
167
0
1
167
290
286
0
290
4E +007 218006.47
290
1912600
6256.248
290 1.6904465 -.7967556
290
0
1
290
0
0
290
Maximum
202
2447041
496302.1
.6246127
0
1
Moyenne
12.11
342934.49
58116. 590
-.10630582
.00
1.00
Ec art t ype
21.066
478511.938
88638. 5347
.343286812
.000
.000
Variance
443.764
2E +011
8E +009
.118
.000
.000
286
4E +007
1918857
.8936909
1
0
11.81
6670709.4
486940.24
-.02830504
1.00
.00
23.929
6605972.68
432514.940
.331330781
.000
.000
572.609
4E +013
2E +011
.110
.000
.000
Logarithm
 Normalization
 Taking the logarithm is a transformation which usually normalize
distribution.
 Elasticities http://en.wikipedia.org/wiki/Elasticity_(economics)
 A change in log of x is a relative change of x itself.
 Cobb-Douglas production function
  log x 
x
1
x
    log x  
x
x
Download