90-776 Manipulation of Large Data Sets Homework 4 Solutions 1) Program

advertisement
90-776 Manipulation of Large Data Sets
Homework 4 Solutions
1) Program
/* u:\class\907776\class\HW4P1.SAS does the tasks in Homework 4, problem 1*/
/* In lab 4, I took the data set l:\academic\90776\data\text\country.txt, created an improved comma
delimited ASCII file, then brought it into SAS
and saved the file as u:\class\90776\data\lab4.sd2. This program cleans up
that data set */
/* Rob Greenbaum*/
/* 4/6/1999*/
options pageno=1 ps=150 ls=150;
/* create library reference for my SAS data sets */
libname class 'l:\academic\90776\data';
libname mydat 'u:\class\90776\data\';
/* 1a. First, find the means (with proc means and univariat) of the numeric variables */
PROC MEANS DATA = MYDAT.LAB4;
/* I can see which observations are the outliers by including an ID subcommand
in my univariate procedure*/
PROC UNIVARIATE DATA=MYDAT.LAB4;
VAR YR_IND;
ID COUNTRY;
/*1b. Let’s set negative yr_ind observations to missing */
DATA FIXYR;
SET MYDAT.LAB4;
IF YR_IND <0 THEN YR_IND=.;
/* Let’s see how the mean has changed */
PROC MEANS DATA=C4;
RUN;
/*1c. Change the missing values for the character variables to blanks */
DATA FIXCHAR;
SET FIXYR;
ARRAY ROB[*] _CHARACTER_;
DO I=1 TO DIM(ROB);
IF ROB[I]="?" OR ROB[I]="-" THEN ROB[I] = " ";
END;
DROP I;
RUN;
/* 1d. Print out the data set */
PROC PRINT DATA=FIXCHAR;
RUN;
/* 1e. Check religion */
PROC FREQ DATA=FIXCHAR;
TABLES RELIGION;
RUN;
/* For one of the observations, Lutheran was misspelled as Lutherian. That is easy to fix */
DATA FIXSPELL;
SET FIXCHAR;
IF RELIGION = “Lutherian” THEN RELIGION = “Lutheran”;
/* MAKE SURE WE FIXED THE PROBLEM */
PROC FREQ DATA=FIXSPELL;
TABLES RELIGION;
RUN;
/* 1f. let’s create an ID variable */
DATA NEWVAR;
SET FIXSPELL;
LENGTH CSMALL $2; /* need to reset the length of the new variable*/
CSMALL = SUBSTR(COUNTRY,1,2); /* CSMALL is the first 2 chars of country name*/
ENDYR = MOD(YR_UN,100); *Get the last 2 digits of the year they entered the UN;
YR = PUT(ENDYR, 2.); /* convert ENDYR to CHARACTER*/
IF YR = “ .” THEN YR=”00”; /*convert missings to 00*/
ID = CSMALL||YR; /* put csmall and yr together*/
RUN;
/*1g. print out the data */
PROC PRINT DATA=NEWVAR ;
VAR COUNTRY ID;
RUN;
1) Output
The SAS System
16:05 Tuesday, April 6, 1999
1
16:05 Tuesday, April 6, 1999
2
Variable
N
Mean
Std Dev
Minimum
Maximum
-------------------------------------------------------------------TOTALPOP 93
55462086.65
151888397
27816.00
1149667000
CAPPOP
91
2196361.60
2379700.09
134393.00
10726000.00
YR_IND
92
1796.02
395.9593435
-1000.00
1980.00
YR_UN
89
1954.08
10.6602754
1945.00
1991.00
--------------------------------------------------------------------
1)a. There is at least one observation with a negative value for YR_IND.
The SAS System
Univariate Procedure
Variable=YR_IND
Moments
N
Mean
Std Dev
Skewness
USS
CV
T:Mean=0
Num ^= 0
M(Sign)
Sgn Rank
92
1796.022
395.9593
-4.69388
3.1103E8
22.04647
43.50658
92
45
2134
Quantiles(Def=5)
Sum Wgts
Sum
Variance
Kurtosis
CSS
Std Mean
Pr>|T|
Num > 0
Pr>=|M|
Pr>=|S|
92
165234
156783.8
27.93015
14267326
41.28162
0.0001
91
0.0001
0.0001
100%
75%
50%
25%
0%
Max
Q3
Med
Q1
Min
Range
Q3-Q1
Mode
1980
1960
1922.5
1821.5
-1000
99%
95%
90%
10%
5%
1%
Extremes
1980
1964
1962
1499
900
-1000
Lowest
ID
-1000(Ethiopia)
800(Denmark )
836(Sweden )
843(France )
900(USSR
)
Highest
ID
1964(Zambia )
1971(Banglade)
1975(Angola )
1975(Mozambiq)
1980(Zimbabwe)
2980
138.5
1960
Missing Value
Count
% Count/Nobs
.
1
1.08
The SAS System
16:05 Tuesday, April 6, 1999
3
Variable
N
Mean
Std Dev
Minimum
Maximum
-------------------------------------------------------------------TOTALPOP 93
55462086.65
151888397
27816.00
1149667000
CAPPOP
91
2196361.60
2379700.09
134393.00
10726000.00
YR_IND
91
1826.75
265.9089064
800.0000000
1980.00
YR_UN
89
1954.08
10.6602754
1945.00
1991.00
--------------------------------------------------------------------
1)b. The mean YR_IND has now increased to 1827 – before it was 1796. Removing the one outlier made a large difference.
The SAS System
OBS
1
COUNTRY
TOTALPOP
CAPITAL
CAPPOP
REGION
Afghanistan
16922000
Kabul
1424400
Asia
YR_IND
YR_UN
1919
1946
16:05 Tuesday, April 6, 1999
RELIGION
LANG1
LANG2
Islam
Pashto
Persian
LANG3
4
2 Algeria
3 Angola
4 Argentina
5 Australia
6 Austria
7 Bangladesh
8 Belgium
9 Bolivia
10 Brazil
11 Bulgaria
12 Burkina Faso
13 Burundi
14 Cambodia
15 Cameroon
16 Canada
17 Chad
18 Chile
19 China
20 Colombia
21 Cote d'Ivorie
22 Cuba
23 Czechoslovakia
24 Denmark
25 Dominican Repu
26 Ecuador
27 Egypt
28 El Salvador
29 Ethiopia
30 Finland
31 France
32 Germany
33 Ghana
34 Greece
35 Guatemala
36 Guinea
37 Haiti
38 Hong Kong
39 Hungary
40 India
41 Indonesia
42 Iran
43 Iraq
44 Italy
45 Japan
46 Kenya
The SAS System
25888000
10284000
32470000
17337000
7815000
108760000
9978000
7528000
153322000
9005000
9261000
27816
8781000
12239000
26941000
5823000
13385000
1149667000
33613000
12464000
10700000
15577000
5146000
48443
11079000
54609000
5392000
51617000
500100
56942000
79096000
15509000
10272000
9177000
7052000
6617000
5862000
10326000
871158000
181451000
57050000
18317000
57590000
123920000
25905000
Algiers
Luanda
Buenos A
Canberra
Vienna
Dhaka
Brussels
La Paz
Brasilia
Sofia
Ouagadou
Bujumbar
Phnom Pe
Yaounde
Ottawa
H'Djamen
Santiago
Beijing
Santafe
Abidjan
Havana
Prague
Copenhag
Santo Do
Quito
Cairo
Addis Ab
Helsinki
Paris
Berlin
Accra
Athens
Guatemal
Conakry
Port-auBudapest
New Delh
Jakarta
Tehran
Baghdad
Rome
Tokyo
Nairobi
1507241 Africa
1962
1134000 Africa
1975
2922829 Latin America
1816
310000 Oceania
1901
1487577 Europe
1918
5731000 Asia
1971
137966 Europe
1830
669400 Latin America
1825
1567709 Latin America
1822
1217024 Europe
1908
441514 Africa
1960
226628 Africa
1962
564000 Orient
1953
649000 Africa
1960
863900 America
1867
500000 Africa
1960
5236300 Latin America
1810
6800000 Orient
1523
4819696 Latin America
1810
1850000 Africa
1960
2077938 Latin America
1902
1212010 Europe
1960
1343916 Europe
800
1600000 Latin America
1844
1094318 Latin America
1822
6452000 Africa
1922
. Latin America
1841
1495266 Africa
.
492240 Europe
1917
2152423 Europe
843
3376800 Europe
1955
949100 Africa
1957
885737 Europe
1830
1095677 Latin America
1821
705280 Africa
1958
514438 Latin America
1804
. Orient
.
2016132 Europe
1918
7174755 Asia
1947
7829000 Oceania
1945
6042584 Asia
1906
5348117 Asia
1932
2803931 Europe
1861
8163127 Orient
1660
1504900 Africa
1963
16:05 Tuesday, April 6, 1999
1962
1976
1945
1945
1955
1974
1945
1945
1945
1955
1960
1962
1955
1960
1945
1960
1945
1945
1945
1960
1945
1960
1945
1945
1945
1945
.
1945
1955
1945
1973
1957
1945
1945
1958
1945
.
1955
1945
1950
1945
1945
1955
1956
1963
5
Islam
Catholicism
Islam
Catholicism
Buddhism
Lutherian
Islam
Greek Orthodox
Monotheism
Islam
Islam
Arabic
Portuguese
Spanish
English
German
Bengali
Dutch
Spanish
Portuguese
Bulgarian
French
Rundi
Khmer
French
English
Arabic
Spanish
Mandarin
Spanish
French
Spanish
Czech
Danish
Spanish
Spanish
Arabic
Spanish
Amharic
Finnish
French
German
English
Greek
Spanish
French
Haitian Creole
Chinese
Hungarian
Hindu
Bahasa Indonesian
Persian
Arabic
Italian
Japanese
Swahili
Cumulative Cumulative
RELIGION
Frequency
Percent
Frequency
Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Buddhism
2
8.7
2
8.7
Catholicism
3
13.0
5
21.7
Greek Orthodox
1
4.3
6
26.1
Hinduism
1
4.3
7
30.4
Islam
13
56.5
20
87.0
Lutheran
1
4.3
21
91.3
Lutherian
1
4.3
22
95.7
Monotheism
1
4.3
23
100.0
French
Aymara
French
English
French
French
Slovak
Swedish
French
English
English
English
German
Quechua
Frequency Missing = 70
The SAS System
16:05 Tuesday, April 6, 1999
6
16:05 Tuesday, April 6, 1999
7
Cumulative Cumulative
RELIGION
Frequency
Percent
Frequency
Percent
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Buddhism
2
8.7
2
8.7
Catholicism
3
13.0
5
21.7
Greek Orthodox
1
4.3
6
26.1
Hinduism
1
4.3
7
30.4
Islam
13
56.5
20
87.0
Lutheran
2
8.7
22
95.7
Monotheism
1
4.3
23
100.0
Frequency Missing = 70
The SAS System
OBS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
COUNTRY
Afghanistan
Algeria
Angola
Argentina
Australia
Austria
Bangladesh
Belgium
Bolivia
Brazil
Bulgaria
Burkina Faso
Burundi
Cambodia
Cameroon
Canada
Chad
Chile
China
Colombia
Cote d'Ivorie
Cuba
Czechoslovakia
Denmark
Dominican Repu
Ecuador
Egypt
El Salvador
Ethiopia
Finland
France
Germany
Ghana
Greece
ID
Af46
Al62
An76
Ar45
Au45
Au55
Ba74
Be45
Bo45
Br45
Bu55
Bu60
Bu62
Ca55
Ca60
Ca45
Ch60
Ch45
Ch45
Co45
Co60
Cu45
Cz60
De45
Do45
Ec45
Eg45
El00
Et45
Fi55
Fr45
Ge73
Gh57
Gr45
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
Guatemala
Guinea
Haiti
Hong Kong
Hungary
India
Indonesia
Iran
Iraq
Italy
Japan
Kenya
North Korea
South Korea
Madagascar
Gu45
Gu58
Ha45
Ho00
Hu55
In45
In50
Ir45
Ir45
It55
Ja56
Ke63
No91
So91
Ma60
Download