90-776 Manipulation of Large Data Sets Homework 4 Solutions 1) Program /* u:\class\907776\class\HW4P1.SAS does the tasks in Homework 4, problem 1*/ /* In lab 4, I took the data set l:\academic\90776\data\text\country.txt, created an improved comma delimited ASCII file, then brought it into SAS and saved the file as u:\class\90776\data\lab4.sd2. This program cleans up that data set */ /* Rob Greenbaum*/ /* 4/6/1999*/ options pageno=1 ps=150 ls=150; /* create library reference for my SAS data sets */ libname class 'l:\academic\90776\data'; libname mydat 'u:\class\90776\data\'; /* 1a. First, find the means (with proc means and univariat) of the numeric variables */ PROC MEANS DATA = MYDAT.LAB4; /* I can see which observations are the outliers by including an ID subcommand in my univariate procedure*/ PROC UNIVARIATE DATA=MYDAT.LAB4; VAR YR_IND; ID COUNTRY; /*1b. Let’s set negative yr_ind observations to missing */ DATA FIXYR; SET MYDAT.LAB4; IF YR_IND <0 THEN YR_IND=.; /* Let’s see how the mean has changed */ PROC MEANS DATA=C4; RUN; /*1c. Change the missing values for the character variables to blanks */ DATA FIXCHAR; SET FIXYR; ARRAY ROB[*] _CHARACTER_; DO I=1 TO DIM(ROB); IF ROB[I]="?" OR ROB[I]="-" THEN ROB[I] = " "; END; DROP I; RUN; /* 1d. Print out the data set */ PROC PRINT DATA=FIXCHAR; RUN; /* 1e. Check religion */ PROC FREQ DATA=FIXCHAR; TABLES RELIGION; RUN; /* For one of the observations, Lutheran was misspelled as Lutherian. That is easy to fix */ DATA FIXSPELL; SET FIXCHAR; IF RELIGION = “Lutherian” THEN RELIGION = “Lutheran”; /* MAKE SURE WE FIXED THE PROBLEM */ PROC FREQ DATA=FIXSPELL; TABLES RELIGION; RUN; /* 1f. let’s create an ID variable */ DATA NEWVAR; SET FIXSPELL; LENGTH CSMALL $2; /* need to reset the length of the new variable*/ CSMALL = SUBSTR(COUNTRY,1,2); /* CSMALL is the first 2 chars of country name*/ ENDYR = MOD(YR_UN,100); *Get the last 2 digits of the year they entered the UN; YR = PUT(ENDYR, 2.); /* convert ENDYR to CHARACTER*/ IF YR = “ .” THEN YR=”00”; /*convert missings to 00*/ ID = CSMALL||YR; /* put csmall and yr together*/ RUN; /*1g. print out the data */ PROC PRINT DATA=NEWVAR ; VAR COUNTRY ID; RUN; 1) Output The SAS System 16:05 Tuesday, April 6, 1999 1 16:05 Tuesday, April 6, 1999 2 Variable N Mean Std Dev Minimum Maximum -------------------------------------------------------------------TOTALPOP 93 55462086.65 151888397 27816.00 1149667000 CAPPOP 91 2196361.60 2379700.09 134393.00 10726000.00 YR_IND 92 1796.02 395.9593435 -1000.00 1980.00 YR_UN 89 1954.08 10.6602754 1945.00 1991.00 -------------------------------------------------------------------- 1)a. There is at least one observation with a negative value for YR_IND. The SAS System Univariate Procedure Variable=YR_IND Moments N Mean Std Dev Skewness USS CV T:Mean=0 Num ^= 0 M(Sign) Sgn Rank 92 1796.022 395.9593 -4.69388 3.1103E8 22.04647 43.50658 92 45 2134 Quantiles(Def=5) Sum Wgts Sum Variance Kurtosis CSS Std Mean Pr>|T| Num > 0 Pr>=|M| Pr>=|S| 92 165234 156783.8 27.93015 14267326 41.28162 0.0001 91 0.0001 0.0001 100% 75% 50% 25% 0% Max Q3 Med Q1 Min Range Q3-Q1 Mode 1980 1960 1922.5 1821.5 -1000 99% 95% 90% 10% 5% 1% Extremes 1980 1964 1962 1499 900 -1000 Lowest ID -1000(Ethiopia) 800(Denmark ) 836(Sweden ) 843(France ) 900(USSR ) Highest ID 1964(Zambia ) 1971(Banglade) 1975(Angola ) 1975(Mozambiq) 1980(Zimbabwe) 2980 138.5 1960 Missing Value Count % Count/Nobs . 1 1.08 The SAS System 16:05 Tuesday, April 6, 1999 3 Variable N Mean Std Dev Minimum Maximum -------------------------------------------------------------------TOTALPOP 93 55462086.65 151888397 27816.00 1149667000 CAPPOP 91 2196361.60 2379700.09 134393.00 10726000.00 YR_IND 91 1826.75 265.9089064 800.0000000 1980.00 YR_UN 89 1954.08 10.6602754 1945.00 1991.00 -------------------------------------------------------------------- 1)b. The mean YR_IND has now increased to 1827 – before it was 1796. Removing the one outlier made a large difference. The SAS System OBS 1 COUNTRY TOTALPOP CAPITAL CAPPOP REGION Afghanistan 16922000 Kabul 1424400 Asia YR_IND YR_UN 1919 1946 16:05 Tuesday, April 6, 1999 RELIGION LANG1 LANG2 Islam Pashto Persian LANG3 4 2 Algeria 3 Angola 4 Argentina 5 Australia 6 Austria 7 Bangladesh 8 Belgium 9 Bolivia 10 Brazil 11 Bulgaria 12 Burkina Faso 13 Burundi 14 Cambodia 15 Cameroon 16 Canada 17 Chad 18 Chile 19 China 20 Colombia 21 Cote d'Ivorie 22 Cuba 23 Czechoslovakia 24 Denmark 25 Dominican Repu 26 Ecuador 27 Egypt 28 El Salvador 29 Ethiopia 30 Finland 31 France 32 Germany 33 Ghana 34 Greece 35 Guatemala 36 Guinea 37 Haiti 38 Hong Kong 39 Hungary 40 India 41 Indonesia 42 Iran 43 Iraq 44 Italy 45 Japan 46 Kenya The SAS System 25888000 10284000 32470000 17337000 7815000 108760000 9978000 7528000 153322000 9005000 9261000 27816 8781000 12239000 26941000 5823000 13385000 1149667000 33613000 12464000 10700000 15577000 5146000 48443 11079000 54609000 5392000 51617000 500100 56942000 79096000 15509000 10272000 9177000 7052000 6617000 5862000 10326000 871158000 181451000 57050000 18317000 57590000 123920000 25905000 Algiers Luanda Buenos A Canberra Vienna Dhaka Brussels La Paz Brasilia Sofia Ouagadou Bujumbar Phnom Pe Yaounde Ottawa H'Djamen Santiago Beijing Santafe Abidjan Havana Prague Copenhag Santo Do Quito Cairo Addis Ab Helsinki Paris Berlin Accra Athens Guatemal Conakry Port-auBudapest New Delh Jakarta Tehran Baghdad Rome Tokyo Nairobi 1507241 Africa 1962 1134000 Africa 1975 2922829 Latin America 1816 310000 Oceania 1901 1487577 Europe 1918 5731000 Asia 1971 137966 Europe 1830 669400 Latin America 1825 1567709 Latin America 1822 1217024 Europe 1908 441514 Africa 1960 226628 Africa 1962 564000 Orient 1953 649000 Africa 1960 863900 America 1867 500000 Africa 1960 5236300 Latin America 1810 6800000 Orient 1523 4819696 Latin America 1810 1850000 Africa 1960 2077938 Latin America 1902 1212010 Europe 1960 1343916 Europe 800 1600000 Latin America 1844 1094318 Latin America 1822 6452000 Africa 1922 . Latin America 1841 1495266 Africa . 492240 Europe 1917 2152423 Europe 843 3376800 Europe 1955 949100 Africa 1957 885737 Europe 1830 1095677 Latin America 1821 705280 Africa 1958 514438 Latin America 1804 . Orient . 2016132 Europe 1918 7174755 Asia 1947 7829000 Oceania 1945 6042584 Asia 1906 5348117 Asia 1932 2803931 Europe 1861 8163127 Orient 1660 1504900 Africa 1963 16:05 Tuesday, April 6, 1999 1962 1976 1945 1945 1955 1974 1945 1945 1945 1955 1960 1962 1955 1960 1945 1960 1945 1945 1945 1960 1945 1960 1945 1945 1945 1945 . 1945 1955 1945 1973 1957 1945 1945 1958 1945 . 1955 1945 1950 1945 1945 1955 1956 1963 5 Islam Catholicism Islam Catholicism Buddhism Lutherian Islam Greek Orthodox Monotheism Islam Islam Arabic Portuguese Spanish English German Bengali Dutch Spanish Portuguese Bulgarian French Rundi Khmer French English Arabic Spanish Mandarin Spanish French Spanish Czech Danish Spanish Spanish Arabic Spanish Amharic Finnish French German English Greek Spanish French Haitian Creole Chinese Hungarian Hindu Bahasa Indonesian Persian Arabic Italian Japanese Swahili Cumulative Cumulative RELIGION Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Buddhism 2 8.7 2 8.7 Catholicism 3 13.0 5 21.7 Greek Orthodox 1 4.3 6 26.1 Hinduism 1 4.3 7 30.4 Islam 13 56.5 20 87.0 Lutheran 1 4.3 21 91.3 Lutherian 1 4.3 22 95.7 Monotheism 1 4.3 23 100.0 French Aymara French English French French Slovak Swedish French English English English German Quechua Frequency Missing = 70 The SAS System 16:05 Tuesday, April 6, 1999 6 16:05 Tuesday, April 6, 1999 7 Cumulative Cumulative RELIGION Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Buddhism 2 8.7 2 8.7 Catholicism 3 13.0 5 21.7 Greek Orthodox 1 4.3 6 26.1 Hinduism 1 4.3 7 30.4 Islam 13 56.5 20 87.0 Lutheran 2 8.7 22 95.7 Monotheism 1 4.3 23 100.0 Frequency Missing = 70 The SAS System OBS 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 COUNTRY Afghanistan Algeria Angola Argentina Australia Austria Bangladesh Belgium Bolivia Brazil Bulgaria Burkina Faso Burundi Cambodia Cameroon Canada Chad Chile China Colombia Cote d'Ivorie Cuba Czechoslovakia Denmark Dominican Repu Ecuador Egypt El Salvador Ethiopia Finland France Germany Ghana Greece ID Af46 Al62 An76 Ar45 Au45 Au55 Ba74 Be45 Bo45 Br45 Bu55 Bu60 Bu62 Ca55 Ca60 Ca45 Ch60 Ch45 Ch45 Co45 Co60 Cu45 Cz60 De45 Do45 Ec45 Eg45 El00 Et45 Fi55 Fr45 Ge73 Gh57 Gr45 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 Guatemala Guinea Haiti Hong Kong Hungary India Indonesia Iran Iraq Italy Japan Kenya North Korea South Korea Madagascar Gu45 Gu58 Ha45 Ho00 Hu55 In45 In50 Ir45 Ir45 It55 Ja56 Ke63 No91 So91 Ma60