Master APE & PPD Econometrics Professor: Karen Macours

STATA APPLICATIONS

Task 1 last year- Computer assignment

The data set busind.dta contains information on Gross National Income

(GNI) per capita and the number of days to open a business and to enforce a contract in a sample of 135 countries. It was extracted from the “Doing Business” dataset, a dataset collected by the World

Bank based on expert opinions in each country. The variable gnipc measures GNI per capita in thousand $. The variable daysopen measures the average number of days needed to open a business in that country, and daysenforce measures the average number of days needed to enforce a given type of contract.

(i) Find the average GNI per capita and the average number of days to open a business, and the average number of days to enforce a contract.

Answer to question (i)

 Stata command: use busind,clear su daysenforce daysopen gnipc

Variable Obs Mean Std. Dev. Min Max

daysenforce 129 352.9612 162.1636 27 909

daysopen 135 50.75556 38.42408 2 203

gnipc 135 6.56983 10.02707 .09 43.35

 (ii) In how many countries does it take on average less than 5 days to open a business? What is the maximum number of days to open a business in the dataset? In which countries does it take more than 200 days to open a business?

Answer to Question (ii)

 Stata command: su daysopen if daysopen<=5

Variable Obs Mean Std. Dev. Min Max

daysopen 4 3.5 1.290994 2 5 list country if daysopen>=200

country

135. Haiti

Question (iii)

 Estimate the following simple regression model: gnipc

 

0

 

1 daysopen

 u

 Give a careful interpretation of estimates b

1 the signs what you expected them to be?

and b

0

. Are

Answer to Question (iii)

 Stata commands: reg gnipc daysopen

Source SS df MS Number of obs = 135

F( 1, 133) = 20.08

Model 1766.98652 1 1766.98652 Prob > F = 0.0000

Residual 11705.6636 133 88.0125084 R-squared = 0.1312

Adj R-squared = 0.1246

Total 13472.6501 134 100.542165 Root MSE = 9.3815

gnipc Coef. Std. Err. t P>|t| [95% Conf. Interval]

daysopen -.0945063 .0210919 -4.48 0.000 -.1362253 -.0527873

_cons 11.36655 1.340889 8.48 0.000 8.714322 14.01878

Question (iv)

 Question: What kind of factors are contained in u? Are these likely to be correlated with the number of days to open a business?

 Answer: Factors contained in u are factors that explain the GNI par capita apart from the number of days to open a business. You might be conscious that there are many other factors, such as economic institutions, education, savings, consumption, R&D… Some factors are likely to be correlated with the number of days to open a business, such as the quality of economic institutions.

.

Question (v)

 Question: What is according to this model the predicted income for a country where it takes 5 days to open a business? And the predicted income for a country where it takes 200 days to open a business? Show how you can calculate the answers by hand (once you have obtained the estimation results). Do the obtained levels of income seem reasonable? Explain.

Answer to Question (v)

 You can compute predicted values for the dependent variable in two ways: by “displaying” gn i

ˆ pc

  ˆ

0

  ˆ

1

* when daysopen=5 and daysopen=200 daysopen

Stata commands: display _b[daysopen]*5+_b[_cons]

10.894018

display _b[daysopen]*200+_b[_cons]

-7.5347099

Answer to question (v)

 or by generating the fitted value of the dependent variable : reg gnipc daysopen predict gnipc_hat

. list gnipc_hat if daysopen==5

gnipc_~t

4. 10.89402

. list gnipc_hat if daysopen==200

A problem arises with this second method as there is no observation with daysopen=200, so that it is impossible to get the value of gnipc_hat for daysopen=200.

 To illustrate our fitted values, we can draw the OLS regression line: scatter gnipc daysopen||lfit gnipc daysopen

0 50 100 150 number of days to open a business gross national income ('000 US $) Fitted values

200

Question (vi)

Estimate the following simple regression model and give a careful interpretation of

1.

gnipc

 

0

 

1

daysenforc e

u

Answer to Question (vi)

 Stata command: reg gnipc daysenforce

Source SS df MS Number of obs = 129

F( 1, 127) = 33.90

Model 2768.61703 1 2768.61703 Prob > F = 0.0000

Residual 10372.442 127 81.6727718 R-squared = 0.2107

Adj R-squared = 0.2045

Total 13141.0591 128 102.664524 Root MSE = 9.0373

gnipc Coef. Std. Err. t P>|t| [95% Conf. Interval]

daysenforce -.0286796 .0049258 -5.82 0.000 -.0384269 -.0189322

_cons 16.66315 1.912056 8.71 0.000 12.87954 20.44676

Question (viii)

. reg lngnipc daysopen

Source SS df MS Number of obs = 135

F( 1, 133) = 18.32

Model 42.8638154 1 42.8638154 Prob > F = 0.0000

Residual 311.155695 133 2.3395165 R-squared = 0.1211

Adj R-squared = 0.1145

Total 354.01951 134 2.64193664 Root MSE = 1.5295

lngnipc Coef. Std. Err. t P>|t| [95% Conf. Interval]

daysopen -.0147194 .0034388 -4.28 0.000 -.0215212 -.0079176

_cons 1.4396 .218617 6.59 0.000 1.007184 1.872016

Question (vii)

Comparing the estimates of the models in (iii) and (v), which one explains more of the variation in income per capita across countries.

Can you infer whether the duration to open a business or the duration for enforcing contracts is more strongly correlated with income per capita?

Answer: How much of the variation of GNI per capita (y) is explained by an independent variable is given by the R 2 . The greater the R 2 , the more variation of y is explained by x. The R 2 of the regression of GNI per capita on the number of days to open a business is about 13% and the R 2 of the regression of GNI per capita on the number of days to enforce a contract 21%. That means that this variable explains more of the variation of the gni per capita than the former. It means that the duration for enforcing contract is more strongly correlated with income per capita than the number of days to open a business. Here, the correlation between gnipc and daysenforce is equal to -0.46 and the correlation between gnipc and daysopen is equal to -0.36.

Question (viii)

 Estimate the following simple regression model and give a careful interpretation of

1.

log(

gnipc

)

 

0

 

1

daysopen

u

Answer to Question (viii)

 Stata commands: gen lngnipc=ln(gnipc) reg lngnipc daysopen

 Do these results allow you to draw conclusions regarding the desirability of policies aimed at reducing the number of days for opening a business in certain developing countries?

 The dataset contains 135 countries, and hence does not contain information about all the countries in the world. Do you think one should account for that when interpreting the regression results. Why?

Task 2 last year- Computer exercise

 The dataset nepalind.dta contains data from 706 children of

15 years old in Nepal. The data come from the 2003 Nepal

Living Standard Survey (NLSS) Living Standard

Measurement Survey (LSMS). We want to analyze this data to understand the number of years of education. Illiteracy and low levels of education are a major concern in Nepal, so it would be good to know which type of factors could be explaining education of the present generation, to know what type of policies to implement. The dataset has some information on household characteristics and characteristics of the child, and of the household head.

The NLSS is a LSMS-type survey, which are country-wide representative surveys that statistical offices in developing countries conduct with the support of the World Bank to determine poverty levels, determinants of poverty, etc. See www.worldbank.org/lsms for more info.

Question 1

 Write a paragraph describing the dataset using the standard descriptive statistics (also called summary statistics, or “D-stats”). Add a table with the d-stats.

. su

Variable Obs Mean Std. Dev. Min Max

r2_sex 706 1.473088 .4996292 1 2 r2_healths~t 655 1.309924 .4726228 1 3

nrchild 706 3.478754 1.739072 1 14

nractad 706 3.002833 1.487534 0 13

nrold 706 .3427762 .6291367 0 5

head_age 706 46.56232 10.56877 15 85

head_educ 706 2.827195 4.070464 0 14 value_jewe~y 706 13984.75 26725.61 0 400000

distschool 656 .2925051 .3067064 .0166667 2.5

r2_supown 706 .7404253 1.053909 0 9.811792

educ 706 5.51983 3.565441 0 16

Question (1)

Child characteristics

Male (%)

Health status (%)

Good

Fair

Poor

Years of education

52

69.5

30

0.5

5.5 (3.6)

Question (1)

Household characteristics

Number of household members under 18 years old between 18 and 59

60 or older

Age of the head

Education of the head

Land owned (in ha)

Value of jewelries (in rupees)

Distance to school (in hours)

Number of observations

Standard errors into parenthesis

6.8 (2.73)

3.5 (1.74)

3.0 (1.49)

0.3 (0.63)

46 (10.5)

2.8 (4.1)

0.74 (1.05)

13985 (26726)

0.29 (0.31)

706

Question 2: Show the distribution of the different values of years of education in the dataset. Drop the variables that have values higher than 10. Explain why that might be a smart thing to do, before doing any regression analysis.

. hist educ,discrete

(start=0, width=1)

 Question (3): Specify a model that allows explaining the number of years of education as a function of father’s age, the number of active adults (between 18 and 60 years old) and the number of elderly (60 or older) and all other variables you think are interesting and appropriate.

Make sure only to include variables that are exogenous and discuss why the variables you include can be considered exogenous. Estimate the model and give a careful interpretation of each of the coefficients (sign, size, and significance!). Do you find any of your results counterintuitive?

Tips to answer question (3)

 Each variable that you add into the model must be related to educ in some way, and should not violate the ZCM assumption=>they must be exogenous=>ask yourself:

 x caused by y? i.e. possibility of reverse causality?

 One third factor determines both x and y? in this case correlation is not causation, and x is not exogenous.

 u and x related for some other reason?

 Gender? Head´s age? Nb of active adults? Number of elderly? Head´s education? Land owned? distance to school? Value jewelry? Nb of children?

Health?

A reasonable model to estimate:

educ

 

0

 

1 headage

 

2 headeduc

 

3 nractad

 

4 nrold

 

5 sup own

 

6 dist

 

7 female

 u

 Expected signs of coefficients? Argue.

. do "C:\Users\Yaya\AppData\Local\Temp\STD03000000.tmp"

. drop if educ>10

(9 observations deleted)

. ge male= r2_sex==1 //we create a dummy, =1 if r2_sex equals 1

. reg educ head_age head_educ nractad nrold r2_supown distschool male

Source SS df MS Number of obs = 649

F( 7, 641) = 20.22

Model 1463.51225 7 209.073179 Prob > F = 0.0000

Residual 6628.93151 641 10.3415468 R-squared = 0.1808

Adj R-squared = 0.1719

Total 8092.44376 648 12.4883391 Root MSE = 3.2158

educ Coef. Std. Err. t P>|t| [95% Conf. Interval]

head_age .033665 .0135985 2.48 0.014 .0069621 .0603679

head_educ .3317472 .0339146 9.78 0.000 .2651501 .3983444

nractad -.093363 .0904375 -1.03 0.302 -.2709525 .0842265

nrold .1106236 .2244013 0.49 0.622 -.3300268 .5512741

r2_supown .3162876 .1255367 2.52 0.012 .0697746 .5628006

distschool -.5371565 .418622 -1.28 0.200 -1.359193 .2848797

male 1.061818 .2538363 4.18 0.000 .5633668 1.560269

_cons 2.509268 .6762122 3.71 0.000 1.181409 3.837127

Question 4:What is the minimum significance level at which one can reject that hypothesis that age of the household head does not affect education levels?

 The p-value gives the smallest significant level at which an hypothesis H0 can be rejected. In other words, a low p-value indicates that the tested hypothesis is unlikely. The minimum significance level at which one can reject the hypothesis that the age of the household head does not affect education levels is given by the p-value of the test β

1

=0. Then, one can directly read on the stata output that this minimum significance level is

1.4%.

Question (5)

 Do your results allow you to conclude that the effects of the number of active adults in the household is different than the effect of elderly? State the null hypothesis and the alternative hypothesis you are testing, and the significance level you are considering. Does your answer differ depending on which significance level you consider?

Answer to Question (5)

Need to test null hypothesis: H

0

: β

3

= β

4

You just need command "test".

against H

1

: β

3

≠ β

4

. test nractad = nrold

( 1) nractad - nrold = 0

F( 1, 641) = 0.72

Prob > F = 0.3965

Question (6)

 Test whether the characteristics of the household head are jointly significant. Show how to do this in stata, and calculate the test by hand in 2 different ways. What can you conclude about the role of household head characteristics on education of the children?

Answer to Question (6)

. test (head_age=0)(head_educ=0)

( 1) head_age = 0

( 2) head_educ = 0

F( 2, 641) = 47.96

Prob > F = 0.0000

Question 6: compute F-test

 Run the unrestricted and restricted models, and compute either

SSR or R2 form of the F-statistic.

F

( 1

(

2

R ur

2

R ur

)

/(

R r

2 n

)

/ k q

1 ) reg educ head_age head_educ nractad nrold r2_supown distschool male scalar r2_ur=e(r2) scalar df=e(df_r) reg educ nractad nrold r2_supown distschool male scalar r2_r=e(r2)

. dis ((r2_ur-r2_r)/2)/((1-r2_ur)/df)

47.961109

Question (9)

Child characteristics

Male (%)

Health status (%)

Good

Fair

Poor

Years of education

Non missing Missing

53 48

69.5

30

0.5

5.3

74

26

.

6.8

Question (9)

Household characteristics

Number of household members under 18 years old between 18 and 59

60 or older

Age of the head

Education of the head

Land owned (in ha)

Value of jewelries (in rupees)

Distance to school (in hours)

Number of observations

Non missing

6.9

3.5

3.0

0.3

46.5

2.6

0.77

12212

0.29

600

Missing

6.1

2.8

2.9

0.4

48.2

5.5

0.29

35488

.

46