Kaz SAS Kaz’s SAS manual To liberate Research Assistants of the World Version 11/6/2004 by Kazuaki Uekawa, Ph.D. kuekawa@alumni.uchicago.edu Copyright © 2002 By Kazuaki Uekawa All rights reserved. 1 Kaz SAS 2 Profile: Kazuaki (Kaz) Uekawa, Ph.D. I am from Japan, but I have been in the US for about ten years. In 2000 I got my doctorate in Sociology at the University of Chicago. While working for a research project lead by Charles Bidwell and Anthony Bryk, I learned SAS. Currently I am a research analyst at AIR, American Institutes for Research, located in Washington DC. I am also a professional writer with pretty large audience. I design/write essays that are used for ESL industry in Japan, i.e., those who are in business of testing students’ English competency. My favorite essay, used for a material that let people practice read English, is about how Japanese boys collect beatles and exchange them among themselves just like baseball cards in America. Also I wrote about how Japanese, Americans, and Mexicans have different techniques to cure common colds. I practice what literary theorists call “deconstruction,” which is to doubt what is taken for granted in a culture and show how strange and arbitrary the cultural practice may appear to people outside the culture. On weekends I am writing a book on English pronunciation. I discovered techniques that let Japanese people pronounce English phonemes correctly at their first attempt—without any practice. I have decided that linguistic theory that says that adult learners of foreign language cannot pronounce things correctly is just an excuse. I plan to rock the non-English speaking world with my book as soon as I can in the year 2005. I believe this will be bigger than the Beatles’ revolution for the impact it has on the entire non-English speaking communities of this globe. Immediately after Japanese people read my book, they will be able to tell the differences between “wonder” and “wander” or “lice” and “rice.” Kaz SAS 3 Table of Contents I. Basic Operations................................................................................................................................................ 5 1. Ask questions to SAS by emailing support@sas.com ..................................................................................... 5 2. How do I start and what mini-windows do I look at? ................................................................................. 6 3. How do I look at data sets? ........................................................................................................................... 7 4. Assigning library name and create folders .................................................................................................. 8 5. How do we create SAS data? ....................................................................................................................... 10 A) Create SAS data Via. Ms-Excel Sheets ............................................................................................... 11 B) Create a SAS data set via an extrenal text file .................................................................................. 12 C) Create a SAS data using a SAS syntax .............................................................................................. 10 6. Examples of data steps ................................................................................................................................ 13 7. Manipulating variables in data steps ........................................................................................................ 14 8. Lots of manipulation techniques to be used in a data step ...................................................................... 17 9. Application: How do we restrict analytical samples using NMISS function........................................... 18 Procedures ........................................................................................................................................................ 20 II. 10. PROC CONTENTS: Description of Contents......................................................................................... 20 11. PROC PRINT: See Data........................................................................................................................... 21 12. PROC SORT: Sorting Observations based on a value of variable ........................................................ 21 13. PROC MEANS: Get Descriptive Statistics (Mean, STD, Min, Max).................................................... 22 14. PROC FREQ: Get Frequencies ............................................................................................................... 23 15. PROC UNIVARIATE: Get elaborate statistics and a univariate plot .................................................. 23 16. PROC PLOT: Plotting Two Variables ..................................................................................................... 24 17. PROC TIMEPLOT: Time Plot ................................................................................................................. 24 18. PROC CORR: Correlation........................................................................................................................ 25 19. PROC OLS: OLS Regression ................................................................................................................... 25 20. PROC LOGISTIC: Logistic Regression .................................................................................................. 25 21. MAKE AN ASCHI FILE .......................................................................................................................... 25 III. More Procedures .......................................................................................................................................... 26 22. PROC STANDARD: Standardize Values ................................................................................................ 26 23. PROC RANK: Rank observations ........................................................................................................... 27 24. PROC SQL: Creating group-level mean variables ................................................................................ 27 25. PROC IMPORT ........................................................................................................................................ 28 IV. V. Merging Data Sets ....................................................................................................................................... 29 MACROs........................................................................................................................................................... 30 26. Typical Macro – I use this most often. .................................................................................................... 30 27. LET MACRO –looks useful and it is useful, but in a limited way ....................................................... 30 VI. ODS and PROC EXPORT ........................................................................................................................... 31 VII. Application: Do PROC MEANS and save results as excel sheet using ODS .......................................... 34 VIII. Kaz SAS 4 APPLICATION Read from many tables embedded within Excel sheets ............................................. 36 Kaz SAS 5 I. Basic Operations 1. Ask questions to SAS by emailing support@sas.com When you have a question about SAS, you can email SAS institutes’ technical support team. The address is support@sas.com. At the beginning of your email content, you copy the information you get at the head of your log file. The log file is a file that you get when you run SAS. It looks like this: NOTE: Copyright (c) 1999-2001 by SAS Institute Inc., Cary, NC, USA. NOTE: SAS (r) Proprietary Software Release 8.2 (TS2M0) Licensed to UNIVERSITY OF XXXXX, Site XXXXX. NOTE: This session is executing on the WIN_ME platform. I developed my SAS skills mostly by communicating with SAS tech team. I often use GOOGLE to get answers to my questions. Kaz SAS 6 2. How do I start and what mini-windows do I look at? In Windows, you can activate SAS by going to START ALL PROGRAMThe SAS System. Confirm that you get three windows. 1. Editor file. This is where you write your syntax. 2. Log file. This file shows your errors. Click this man to run your program. 3. Output file. You get results in this window. Click on this ! mark to cancel when the program is running. Click Explorer to look at the data sets. See next page on this. Kaz SAS 3. How do I look at data sets? This syntax (you type in into the editor file) gets you an example data to look at. data abcd; set sashelp.Prdsale; run; You can look at the data set in this way if you follow the four steps below. Notes: I look at the data sets to check if there is Look closely if there is any anything wrong with it. irregularity in data. You must close the data sets before you run anything else if the syntax you wrote affects the data set. To get the view above where you can examine the data, follow the following steps. 2. Click 4. Click Libraries The data set. 3. Click Work or other 1. Click Explorer folders. 7 Kaz SAS 8 4. Assigning library name and create folders You need a libname statement at the head of your SAS programs. With these, you assign nick names (library name) to indicate folders that host your SAS data sets. For example: libname here "C:¥TEMP"; libname there "C:¥"; Running above creates two folders “here” and “there” in the libraries in the explorer’s view as you see in the picture below (See previous page to see how to get to this view). Imagine there is a data set called MYDATA and it is in C: \TEMP. You can create it in this way: libname here "C:¥TEMP"; data here.MYDATA; X=1; run; This silly data has one observation, which is X whose value is 1. Because you decided to call that folder by a nickname HERE, you will be referring to the data set as “here.MYDATA.” to print the contents of that data, you will do this: For example, proc print data=here.MYDATA; run; To see what variables are in the data, do this: proc contents data=here.MYDATA; run; What are other folders? Sashelp fosts lots of data sets that SAS institutes ship with the SAS software for demonstration’s sake. I have never opened Sasuser or Maps. “Work” hosts temporary data sets that you create as you program in SAS. Temporary data sets disappear if you close your SAS program. Permanent data sets, on the other hand, are the data sets you create to keep even after you turn quit SAS. Next page elaborate eon these things. Kaz SAS 9 Here are some silly example syntax to show you what the folders do and what temporary and permanent data sets are. /*libname statements just need to occur at the beginning of the syntax file*/ libname here "C: ¥TEMP"; libname there "C:¥"; /*this creates a data called Wally in WORK folder*/ data Wally; x=1; y=2; Click on these folders to find different “Wally” sets. data z=3; run; /*this creates a data called ABC in HERE folder*/ data here.Wally; x=4; y=5; z=6; run; /*this creates a data called ABC in THERE folder*/ data there.Wally; x=7; y=8; z=9; run; /*Use proc print to see the content of the data sets*/ proc print data=work.ABC; run; The following would do the same: proc print data=here.ABC; run; proc print data=there.ABC; run; proc print; run; (when data is not specified, SAS just uses whatever data it sees right before the syntax.) proc print data=ABC; run; (“work.” can be omitted in this way. I always omit it.) Kaz SAS 10 5. How do we create SAS data? A) Create a SAS data using a SAS syntax Of course you can create data in your syntax. libname here "C:¥"; data kaz; input ID 1 SEX $ 4-9 height 13-15 ; cards; 1 Male 170 2 Female 165 When a value is missing, it is safe to enter a dot 3 Male 4 Male 5 Female ; run; 168 170 proc print; run; instead of leaving it empty in this way. But empty is also okay because INPUT LINE explicitly is telling SAS where to find values for each variable (e.g., height 13-15). After creating a data set, you want to see the data to see if there is anything wrong. Because this is a small data set, you can do PROC PRINT to print it on your output window. The other useful way is to click on the actual SAS data to see the content. I explained it earlier. Kaz SAS 11 B) Create SAS data Via. Ms-Excel Sheets This uses the first row for variable names. Then use this syntax to import the excel sheet (C:¥mary.xls) as a SAS data set (JOHN): PROC IMPORT OUT= JOHN DATAFILE= "C:¥mary.xls" DBMS=EXCEL2000 REPLACE; RUN; /*This one ignores variable names. It also specifies the sheet from which to take data*/ PROC IMPORT OUT= JOHN DATAFILE= "C:¥mary.xls" DBMS=EXCEL2000 REPLACE; GETNAMES=NO; SHEET=”Sheet1”; RUN; Be sure to close the excel sheet when you run the syntax to import it. Otherwise, you get this error message: ERROR: File _IMEX_.'Sheet1$'n.DATA does not exist. ERROR: Import unsuccessful. See SAS Log for details. NOTE: The SAS System stopped processing this step because of errors. Kaz SAS 12 C) Create a SAS data set via an external text file Imagine you have a text file (say, kaz.txt) that looks like this in your C temp folder. It’s okay for a value to be missing. Dot “.” is often used to indicate a missing value, though. It is safer that way. If you know where the data points are exactly in the data, you can indicate the locations in the following way. data kaz; infile "C:¥TEMP¥kaz.txt" ; input ID 1 SEX $ 4-9 height 13-15 ; run; $ indicates that SEX is a character variable. SAS always needs proc print; to know if it is a character variable or a numeric variable. run; If character variable is just one word (e.g., Male), then we don’t really need to tell SAS about exact locations. SAS will consider each block of words or numbers as one value. But you need to say “missover,” so in case SAS won’t encounter a value (as in the third observation in this data set) at an expected place, it will consider it as a missing value. If a character variable contains more than one word, then use the method above instead of below. libname here "C:¥TEMP"; data kaz; infile "C:¥TEMP¥kaz.txt" missover; input ID SEX $ height ; run; proc print; run; missover: when data are missing, SAS will treat them as missing values Kaz SAS 13 Data Steps and Creating New Variables 6. Examples of data steps Any SAS program consists of two elements. One is DATA STEPS and the other is PROCs (such as proc print or proc means). I discuss data steps in this chapter. I show you some variations of data steps, so you understand them by examples. libname here "C:¥TEMP"; libname there "C:¥"; I am creating a new temporary data set XYZ (to be found in the WORK folder) based on an already exisiting temporary data set data xyz; called ABC (found in the WORK folder). set abc; /*here manipulation of data */ run; I am creating a new temporary data set ABC (to be found in the WORK folder) based on an already existing temporary data set called data abc; ABC (found in the WORK folder). The latter ABC will be set abc; overridden by a new data ABC. This is perfectly okay. /*here manipulation of data */ run; I am creating a new temporary data set XYZ based on an already existing permanent data set called ABC (found in the HERE folder, data xyz; set here.abc; which is C:¥TEMP). /*here manipulation of data */ run; I am creating a new permanent data set ABC in the HERE folder (which is C:¥TEMP) based on an already existing temporary data set called data here.abc; XYZ. set xyz; /*here manipulation of data */ run; I am creating a new permanent data set ABC in the THERE folder (which data there.abc; is C:¥TEMP) based on an already existing permanent data set called ABC in the HERE folder (which is C:¥). set here.abc; /*here manipulation of data */ run; Reminder: Temporary data sets: Found in WORK folder. They disappear when a session ends.. Work folder: Click on Explorer Click on LIBRARIES Click on WORK The HERE folder and THERE folder: HERE and THERE are the arbitrary names that I assigned by giving LIBNAME statement. They refer to paths that I specified. Kaz SAS 14 7. Manipulating variables in data steps We use a SAS sample data set sashelp.Class (a data set called Class stored in SASHELP folder) to practice creating new variables. Do this to find out what this data set has: proc contents data=sashelp.Class; run; You get information below, telling you that the data set has AGE, Height, Name, SEX, and weight. # Variable Type Len Pos ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 3 Age Num 8 0 4 Height Num 8 8 1 Name Char 8 24 2 Sex Char 1 32 5 Weight Num 8 16 Here is a sample of how you can work on this data set to create Body Mass Index, as well as other useful variables. You always need to create a new data to create new data ABC; variables. set sashelp.Class; *Creating a character variable indicating a person's BMI status (Body Mass Index); weight_metric=weight*0.45359237; height_metric=(height* 2.54)/100 ; BMI=weight_metric/(height_metric**2); /*Definition of obesity Normal weight = 18.5-24.9 Overweight = 25-29.9 Obesity = BMI of 30 or greater */ Without length statement, SAS would set the length of character to the first value it encounters, which would be “Underweight” in this case. length If BMI If BMI If BMI If BMI run; status $ 15; < 18.5 then status="Underweight"; => 18.5 and BMI < 25 then status="Normal"; => 25 and BMI < 30 then status="Overweight"; >= 30 then status="Obese"; Kaz SAS 15 I have cleaned up this document up to here. I am still working on the rest. The rest of this manual is based on this data set: http://www.estat.us/sas/kazclass.txt Download the digital version of this document and cut and paste the following data. The data comes from TIMSS (Third International Mathematics and Science Survey). MAT7 is 7th graders’ and MAT8 is 8th graders’ nation-mean mathematics score. NATEXAM is 1 when a nation has a national examination system, NATTEXT is 1 if a nation decides on textbooks at the national-level, and NATSYLB is 1 when a nation decides on syllabus at the national-level. Block is a geographical area. PROP is a proportion of kids in middle school. data kaz; input acro $ NATION $ 6-14 NAME $ 15-33 MAT7 MAT8 GNP14 PROP NATEXAM NATSYLB NATTEXT block $; cards; aus Australi Australia 498 529.63 -0.15526 84 0 1 0 ocea aut Austria 509 539.43 -0.29163 100 0 0 1 weuro bfl Belgi_FL Belgium (Fl) 558 565.18 -0.25157 100 1 1 0 weuro bfr Belgi_FR Belgium (Fr) 507 526.26 -0.25157 100 0 1 0 weuro can Canada 494 527.24 0.07184 88 0 0 0 namer col Colombia Colombia 369 384.76 -0.23699 62 0 1 0 samer cyp Cyprus Cyprus 446 473.59 -0.41906 95 0 1 1 seuro csk Czech Czech Republic 523 563.75 -0.34840 86 0 1 0 eeuro dnk Denmark Denmark 465 502.29 -0.34057 100 1 0 0 weuro fra France France 492 537.83 0.55791 100 0 1 0 weuro deu Germany Germany 484 509.16 0.91992 100 0 0 0 weuro grc Greece Greece 440 483.90 -0.32620 99 0 1 1 seuro hkg HongKong Hong Kong 564 588.02 -0.31638 98 1 1 1 seasia hun Hungary Hungary 502 537.26 -0.37602 81 0 0 0 eeuro isl Iceland Iceland 459 486.78 -0.42606 100 0 0 0 neuro irn Iran Iran, Islamic Rep. 401 428.33 -0.17095 66 0 1 1 meast irl Ireland Ireland 500 527.40 -0.38919 100 1 1 0 weuro isr Israel Israel . 521.59 -0.35464 87 0 1 0 meast jpn Japan Japan 571 604.77 1.85543 96 0 1 0 seasia kor Korea Korea 577 607.38 -0.01168 93 0 1 1 seasia kwt Kuwait Kuwait . 392.18 -0.40359 60 0 1 1 meast lva Latvia Latvia (LSS) 462 493.36 -0.42319 87 0 0 0 eeuro ltu Lithuani Lithuania 428 477.23 -0.41785 78 1 1 1 eeuro nld Netherla Netherlands 516 540.99 -0.18184 93 1 0 0 weuro Austria Canada Kaz SAS 16 nzl NewZeala New Zealand 472 507.80 -0.38319 100 1 1 0 ocea nor Norway 461 503.29 -0.35450 100 0 1 1 neuro prt Portugal Portugal 423 454.45 -0.32588 81 0 1 0 weuro rom Romania 454 481.55 -0.35396 82 1 1 1 eeuro rus RussianF Russian Federation 501 535.47 0.12827 88 1 0 0 eeuro sco Scotland Scotland 463 498.46 0.48017 100 0 0 0 weuro sgp Singapor Singapore 601 643.30 -0.37279 84 1 1 1 seasia slv SlovakRe Slovak Republic 508 547.11 -0.40217 89 0 1 0 eeuro svn Slovenia Slovenia 498 540.80 -0.41310 85 0 1 1 eeuro esp Spain Spain 448 487.35 0.03461 100 0 1 1 weuro swe Sweden Sweden 477 518.64 -0.30049 99 0 1 0 neuro che Switzerl Switzerland 506 545.44 -0.27916 91 0 0 0 weuro tha Thailand Thailand 495 522.37 -0.14533 37 0 1 1 seasia usa USA 476 499.76 97 0 0 0 namer Norway Romania United States ;run;proc print;run; 5.37506 Kaz SAS 17 8. Lots of manipulation techniques to be used in a data step data abc; set sashelp.Class; var1=height+weight; var2=sum(of height weight); var3=weight-height; var4=height*weight; var5=height/weight; var6=1/(height+weight); var7=mean(of height weight); var7B=mean(height, weight);/*this way is okay too*/ var8=max(of height weight); var9=min(of height weight); var10=log(height); var11=abs(var3); /*Absolute values: this takes out negative signs*/ var12=nmiss(of height weight);/*N of missing cases*/ var13=n(of height weight); /*N of observations*/ run; proc print; run; How is Z=mean(of X1 X2 X3) different from Z=(X1+X2+X3)/3;? How is Z=sum(of X1 X2 X3) different from Z=X1+X2+X3;? Functions, such as mean(of …) or sum (of …), take statistics of non-missing values. They do return values even when some of the variables in the brackets are missing. For example, if X1 is missing: X=mean (of X1 X2 X3); will return the average of X2 and X3. In contrast, X=(X1+X2+X3)/2 will return a missing value, namely, “.” Kaz SAS 18 9. Using Character Functions to create new variables data abc; set sashelp.Class; var1=name||sex; var2=compress(name||sex);/*COMPRESS gets rid of space in between*/ var3=substr(name,1,3);/*take the first 3 letters starting from the first letter*/ var4=upcase(name);/*upper case*/ run; proc print; run; 10. Application: How do we restrict analytical samples using NMISS function When we compare several regression models (e.g., coefficients, R2, Goodness-of-fit, etc.), we want to keep the number of observations same across different models. Because predictors may have different patterns of missing values, this must be made to happen if you want to. For example, mat7, which is 7th graders’ mathematics score include some missing cases. Some nations only let their 8th graders participate in this international test. Use NMISS function to create a new variable john. data kaz2;set kaz; john=nmiss(of GNP14 mat8 mat7);/*this returns the number of missing cases*/ run; /*check how the data looks like now*/ proc print data=kaz2; var name gnp14 mat8 mat7 john; run; /*Apply OLS regression with cases with perfect data (no missing cases). In this way, model 1 and model 2 will have the same number of cases, or to be more precise, the same data.*/ proc reg data=kaz2; Kaz SAS 19 where john=0; /*Run only when john=0, namely, number of missing cases is 0*/ model mat8=mat7; model mat8=mat7 gnp14; run; Kaz SAS 20 II. Procedures 11. PROC CONTENTS: Description of Contents Data ABC;set sashelp.Prdsale; run; /*1111111111111111111111111*/ /*simple way*/ proc contents data=ABC; run; /*I like "position option" because it gives me a table that is sorted by the position of variables in the data, in addition to alphabetically sorted table*/ proc contents data=ABC position; run; /*2222222222222222222222222*/ /*Easiest way to produce RTF or EXCEL documents off PROC CONTENTS*/ /*but I don't like this way because it comes with too many details*/ ods rtf file ="C:¥TEMP¥datadictionary1.rtf"; proc contents data=ABC position; run; ods rtf close; ods html file ="C:¥TEMP¥datadictionary1.xls"; proc contents data=ABC position; run; ods html close; /*Using ODS we get only the data we want.*/ proc contents data=ABC position; ods output position=whatever_name_you_want ; run; ods rtf file ="C:¥TEMP¥datadictionary2.rtf"; proc print data=whatever_name_you_want noobs; title "data dictionary in RTF"; var variable label ; run; Kaz SAS 21 ods rtf close; ods html file ="C:¥TEMP¥datadictionary2.xls"; proc print data=whatever_name_you_want noobs; title "data dictionary in Excel"; var variable label ; run; ods html close; 12. PROC PRINT: See Data PROC PRINT data=kaz; VAR nation mat7 mat8 natexam; /*without this, all variables will be printed*/ run; Advanced topic: You can selectively print observations. /*print only when natexam=1*/ proc print data=kaz;where natexam=1;var nation mat7 mat8;run; /*print by group units*/ proc sort data=kaz out=kaz2;by block;run; proc print data=kaz;by block;var nation mat7 mat8;run; /*print only up to a certain number of observations*/ proc print data=kaz2 (obs=5); /*shows only five observations*/ run; If you want a nicer print-out, try proc report. 13. PROC SORT: Sorting Observations based on a value of variable You would be using this procedure a lot, but be careful with large data set. This procedure consumes lots of computation time. PROC SORT data=kaz out=kaz2; /*If you don’t want to create a new data set, just write “out=kaz”*/ by mat8; run; Advanced topics: proc sort data=kaz out=kaz2 nodupkey; Kaz SAS 22 by block; run; proc print data=kaz2;run; This takes only the first observation of each block. Imagine that you have data where there are individual level variable (e.g., 100 students) and group level variable (e.g., 10 schools). Imagine you want to get school level information from this data. Above procedure would take just the first observation of each school and gets you ten lines of data for 10 schools. Ignore individual-level variables, however. You can use more than one variable in by line. proc sort data=kaz out=kaz2; by natexam block; run; /*How would the new data look like?*/ proc print data=kaz2;run; 14. PROC MEANS: Get Descriptive Statistics (Mean, STD, Min, Max) PROC MEANS data=kaz; VAR mat7 mat8; run; Advanced topic: Group means. /*Report group means*/ proc sort data=kaz out=kaz2;by block;run; proc means data=kaz2; by block; var mat7 mat8; run; You can also use “class” statement instead of “by” statement. Class statement is easier because you don’t need to sort the data by the by-variable before it. I forgot what the downside of it was. proc means data=kaz2; /*now, kaz2 does not have to be sorted by block*/ class block; var mat7 mat8; run; Kaz SAS 23 /*Save group means*/ ods listing close; /*printing of results suppressed*/ proc means data=kaz2; /*make sure kaz2 is already sorted by group ID*/ by block; var mat7 mat8; ods output summary=john; /*Output Delivery System Used. See SAS manual 2*/ run; ods listing on; /*printing of results resumed*/ proc print data=john; run; /*Get standard errors by adding STDERR*/ /*But it would only get standard error, so you must add other statistics you would like with it. mean, N, STD, MAX, and MIN*/ PROC MEANS data=kaz mean n std max min stderr; VAR mat7 mat8;run; run; Specify I recommend reading a chapter on PROC MEANS in SAS CD-online. It is a very versatile procedure. 15. PROC FREQ: Get Frequencies PROC FREQ data=kaz; Tables natexam ; Run; Advanced topics: Get cross tabulation: PROC FREQ data=kaz; tables natexam*block; run; 16. PROC UNIVARIATE: Get elaborate statistics and a univariate plot PROC UNIVARIATE PLOT DATA=KAZ; var mat7 mat8 gnp14; run; Advanced topic:Get a whisker plot by sub groups, so you can compare group values. But the output is text-based and pretty ugly. Kaz SAS 24 proc sort data=kaz out=kaz2; by block; run; PROC UNIVARIATE data=kaz2 plot; by block; var mat8; run; 17. PROC PLOT: Plotting Two Variables This is text-based graph. Use proc gplot for a nicer graphic. PROC PLOT data=KAZ; Plot mat7*mat8; run; 18. PROC TIMEPLOT: Time Plot proc timeplot data=KAZ; plot mat8= '*'; id NAME; run; Advanced topics: /*Sort first by the variable of your interest and see it*/ /*you will be seeing a ranking of nations*/ proc sort data=kaz out=kaz2; by mat8; run; proc timeplot data=KAZ2; plot mat8= '*'; id NAME; run; Add bells and whistles. Below, I am asking, “Does GNP has anything to do with test score? /*First sort by GNP*/ proc sort data=kaz out=kaz2; by gnp14; run; proc timeplot data=KAZ2; title “TIMSS countries sorted by GNP”; plot mat7 mat8/overlay hiloc npp ; id NAME block gnp14 prop; Kaz SAS 25 run; 19. PROC CORR: Correlation PROC CORR DATA=KAZ; VAR mat7 mat8 gnp14; Run; 20. PROC REG: OLS Regression PROC REG DATA=KAZ; MODEL mat8=natexam gnp14; Run; Advanced Topic: http://www.estat.us/sas/OLS%20tables%20for%20learning.txt 21. PROC LOGISTIC: Logistic Regression /*I don’t know if natexam can be considered a dependent variable, but for the sake of demonstration*/ PROC logistic data=kaz descend; Model natexam=gnp14; run; /*option descend makes sure that RROC LOGISTIC is modeling the probability that the outcome=1. Without this option, it would model the probability that the outcome=0*/ 22. MAKE AN ASCHI FILE To use a stand-alone software program, you may have to create a simple aschi file. But I rarely use this lately because many software read SAS data directly. data timss;set kaz; file "aschi_example.txt"; put (nation) (10.0) (mat7 mat8) (8.0); run; Kaz SAS 26 III. More Procedures 23. PROC STANDARD: Standardize Values Make Z-score with a mean of 0 and standard deviation of 1 proc standard data=kaz out=kaz2 mean=0 std=1; var mat7 mat8; run; /*then see what you did*/ proc print data=kaz2; run; Advanced technique: Standardize within groups. /*First sort by group ID*/ proc sort data=kaz out=kaz2; by block; run; /*Use by statement*/ proc standard data=kaz2 out=kaz3 mean=0 std=1; by block; var mat7 mat8; run; Kaz SAS 27 24. PROC RANK: Rank observations proc rank data=kaz out=kaz2 group=3; /*Creates 3 groups. The new values will be 0, 1, and 2. */ var mat7 mat8; RANKS Rmat7 Rmat8; /*give names to the new variables*/ Run; /*see what happened*/ proc print data=kaz2; var mat7 Rmat7 mat8 Rmat8; RUN; Research Tip: Why do we use rank? a. We can split the sample based on the rank. e.g., high SES student sample versus low SES student sample. b. We can create dummy variables quickly by specifying group=2. e.g., high SES student will receive 1; else 0. This grouping occurs at the median point of a variable, which may or may not be always the best strategy. Alternative way is to assign 1 and 0 based on some meaningful threshold. For example, I have temperature data, I may use a medium point to split the data if it makes sense, but maybe I use 0 degree (Freezing point) as a meaningful point to split the data instead. 25. PROC SQL: Creating group-level mean variables One could use proc means to derive group-level means. I don’t recommend this since it involves extra steps of merging the mean data back to the main data set. Extra steps always create rooms for errors. PROC SQL does it at once. proc sql; create table kaz2 as select *, mean(mat7) as mean_mat7, mean(mat8) as mean_mat8, mean(gnp14) as mean_gnp from kaz group by block; run; /*proc sql does not really require run statement, but for the sake of consistency*/ Kaz SAS 28 proc print data=kaz2; run; 26. PROC IMPORT Since you learned proc export, why not learn proc import. You can read excel data into SAS by this. For an experiment, create an excel sheet in C drive and import it into SAS using the following code. PROC IMPORT OUT= mine DATAFILE= "C:¥example.xls" DBMS=EXCEL2000 REPLACE; GETNAMES=YES; RUN; proc print data=mine; run; Kaz SAS 29 IV. Merging Data Sets libname here “C:¥”; /*Create two data sets A and B.*/ data A; set kaz; /*I am assuming that you already have this data set “kaz” */ keep nation mat7; run; data B; set kaz; keep nation mat8; run; /*MERGE DATA SETS*/ /*First sort them by a common ID*/ /*Here they are already sorted, so the following two lines are not really necessary*/ proc sort data=A;by nation;run; proc sort data=B;by nation;run; data NEW; merge A B; by nation; run; /*Confirm*/ proc print data=NEW; run; Kaz SAS 30 V. MACROs Macro can save time by reducing repetitive parts in the program. 27. Typical Macro – I use this most often. %macro john (group=,var1=,var2=); proc means data=kaz; class &group; var &var1; run; %mend john; %john(group=natexam,var1=mat7 mat8); %john(group=block,var1=gnp14 prop); 28. LET MACRO –looks useful and it is useful, but in a limited way %let john=weight1; /*change this to weight2 if needed*/ data kaz2;set kaz; /*hypothetical weight. Unrealistic but for practice*/ weight1=1; weight2=2; run; proc reg data=kaz2; weight &john; title "Modeling with &john "; model mat8=mat7 ; run; Kaz SAS 31 VI. ODS and PROC EXPORT ODS can customize the result of statistics procedures. It can save statistical results as data sets. This is useful when making a table to go with a paper. Without printing results out on paper, one can manipulate the result data to come out of SAS as almost-paper-ready quality. Without printing results out on paper, one can create graphs right after the statistical procedures using those results. ODS exists for all procedures. Example: PROC MEANS. 1. Know the table names available by doing the following. (You can do this also to any other PROCs) ods trace on; proc means data=kaz; var mat8; run; ods trace off; 2. look at the log file to find out the name of tables available. The log will say. Output Added: ------------Name: Summary Label: Summary statistics Template: base.summary Path: Means.Summary ------------This means that proc means has a table called SUMMARY in which you will be seeing results of PROC MIXED. 3. add the ODS line in PROC MEANS in this way. You are getting a data set john in which you will find the results of the proc means procedure. proc means data=kaz; var mat8; ods output summary=john; run; 4. See what is inside john. proc print data=john; Kaz SAS 32 run; 5. If you like you can manipulate the john data in any way you like by doing a data step here. 5. Save it in an excel file. PROC EXPORT DATA= john OUTFILE= "C:¥john2.xls" DBMS=EXCEL2000 REPLACE; RUN; PROC TRANSPOSE Before going too far with ODS, learn how to transpose data. This is useful when you want to change the form of result data sets that you obtained by using ODS. For example, you got john data in the previous page and you may want to transpose it to get the form of table that you like. Transpose means to go from [1 2 3] to 1 2 3. We are still using john data that you created in the previous page. The original john looks like below. This may not be what you like the final table to look like. Obs block NObs 1 eeuro 8 8 2 meast 3 3 447.36666667 66.772247479 392.18 521.59 3 namer 2 2 513.5 19.431294347 499.76 527.24 4 neuro 3 3 502.90333333 15.933519176 486.78 518.64 5 ocea 2 2 6 samer 7 seasia 5 5 8 seuro 2 2 9 weuro 12 12 1 MAT8_N 1 MAT8_Mean 522.06625 518.715 384.76 MAT8_StdDev 32.954715655 15.436141033 MAT8_Min 477.23 507.8 MAT8_Max 563.75 529.63 . 384.76 384.76 593.168 44.409074185 522.37 643.3 478.745 7.290270914 473.59 483.9 519.52 30.411673895 454.45 565.18 Kaz SAS 33 proc transpose data=john out=john3; id block; run; proc print data=john3;run; Now, the transposed john, or john3 looks like this. If you like this, you can export this as an excel file. Obs _NAME_ _LABEL_ eeuro meast namer neuro ocea samer seasia seuro weuro 1 NObs N Obs 8.000 3.000 2.000 3.000 2.000 1.00 5.000 2.000 12.000 2 MAT8_N N 8.000 3.000 2.000 3.000 2.000 1.00 5.000 2.000 12.000 3 MAT8_Mean Mean 522.066 447.367 513.500 502.903 518.715 384.76 593.168 478.745 519.520 4 MAT8_StdDev Std Dev 32.955 66.772 19.431 15.934 15.436 . 44.409 7.290 30.412 5 MAT8_Min Minimum 477.230 392.180 499.760 486.780 507.800 384.76 522.370 473.590 454.450 6 MAT8_Max Maximum 563.750 521.590 527.240 518.640 529.630 384.76 643.300 483.900 565.180 There are a lot more bells and whistles to proc transpose. One is the use of by-statement. It allows transposing of data within by-groups, but the data must be sorted by the group variable right before proc transpose. Save it in an excel file. PROC EXPORT DATA= john3 OUTFILE= "C:¥john3.xls" DBMS=EXCEL2000 REPLACE;RUN; Kaz SAS 34 VII. Application: Do PROC MEANS and save results as excel sheet using ODS Try making a more sophisticated table off PROC MEANS. An example program for PROC REG is at www.src.uchicago.edu/users/ueka data kaz2;set kaz; /*create instrumental variable for the whole sample*/ /*this will be used as a classification variable*/ wholesample="whole"; run; %macro klas (var=); ods listing close; /*printing suppressed*/ /*Get statistics and save it in a result data*/ proc means data=kaz2 /*mean std stderr max min n*/; class &var; var mat7 mat8 GNP14 PROP; ods output summary = &var;/*result data's name will be the same as classification variable, i.e., wholesample and block*/ run; /*Transpose the result data, so it looks better*/ proc transpose data=&var out=&var.T; id &var; run; %mend klas; %klas (var=wholesample); %klas (var=block); data all; merge wholesampleT blockT; /*by statement (by _name_)not necessary because the two data has identical structure*/ run; ods listing; /*printing resumed*/ proc print data=all; run; Kaz SAS 35 /*create an excel file*/ PROC EXPORT DATA= all OUTFILE= "C:¥all.xls" DBMS=EXCEL2000 REPLACE; RUN; Kaz SAS 36 VIII. APPLICATION Read from many tables embedded within Excel sheets Scenario: We have 50 excel sheets. In each Excel sheet, we have students’ achievement data from fifty different schools. Bad news is that the data is not stored in a conventional form (row observations, columns variables). Instead, each excel sheet has tables and charts within it. But at least the formats of those tables are common across the sheets. How can we extract data from these 50 excel sheets and make them usable for students’ achievement analysis? Old way: Hire research assistants and let them manually pick relevant information from all fifty excel workbooks. New way: Read each excel sheet using PROC IMPORT and save it as a SAS data. Manipulate the SAS data into an analyzable form (where rows are observations and columns are variables). Example. Imagine that we have 50 of the excel sheets like this: Step 1: Read one sheet using PROC IMPORT: PROC IMPORT OUT= JOHN DATAFILE= "C:¥temp¥Blue Sky High School.xls" DBMS=EXCEL2000 REPLACE; GETNAMES=NO; RUN; Kaz SAS 37 Step 2: Examine the SAS data you created, i.e., JOHN by doing: proc print; run; We get: Obs F1 F2 1 Blue Sky High School Math achievement score . 2 Boys 3 Girls 55 4 Hispanic 45 5 Black 46 6 White 48 7 Asian 49 8 Native American 43 60 Step 3: Think of a way to get this data into a shape where rows are observations. like this: NAME Boys SCORE 60 Girls Hispanics 55 45 I want it to look Step 4: So I must TRANPOSE the data (PROC TRANSPOSE). But before that, I want to get rid of the first observation because it looks useless. It is just a title of the table (though I could be creative and use that info as an ID variable.) data john; set john; if _n_ ne 1; run; Kaz SAS 38 Step 5: Now I transpose the John data. proc transpose data=john out=John2; id F1; var F2; run; proc print data=john2; run; Native_ Obs 1 _NAME_ F2 _LABEL_ F2 Boys 60 Girls 55 Hispanic Black White Asian 45 46 48 49 American 43 The first two variables, _NAME_ and _LABEL_, are useless, so I could get rid of them, but I just leave them for now. Note so far that I went FROM (an original excel sheet): TO: (This is a SAS data set JOHN2 stored in WORK folder) Kaz SAS 39 Step 6 (FINAL): Now I look at what I did so far and thinking of a way to automate it using MACRO, so I can affect all 50 excel sheets. I used PROC IMPORT to read an Excel sheet, one data step to get rid of one observation from the data, and PROC transpose to get the format I wanted. Now I use a macro to get above process applied not only to one excel sheet but also to other sheets. Macro begins with this. libname here "C:¥TEMP"; &var1 is a token to be replaced by the words %macro Edward (var1=, var2=); specified later. When the first %edward line is PROC IMPORT OUT= JOHN read by SAS, all occurrence of &var1 will be DATAFILE= "C:¥temp¥&var1..xls" replaced by “Blue Sky High School.” DBMS=EXCEL2000 REPLACE; GETNAMES=NO; RUN; data john; set john; if _n_ ne 1; run; You see two dots, which is okay. The first dot indicates the ending of &var1. The second dot is part of the file name. _N_ indicates a sequence number of observation. number is not 1 then keep the observations. So it reads “if sequence Thus, the first observation is dropped. proc transpose data=john out=John2; id F1; The occurrence of &var2. Here you don’t need a dot that var F2; run; data data&var2; set john2; length ID $ 50; ID="&var1"; drop _NAME_ _LABEL_; run; indicates the ending of a macro token—because it is obvious that it ends there. We need a dot when it is not clear. If I chose to name this data “&var2.data” I’d need a dot in between; otherwise, &var2data feels like an entirely different macro token called &var2data rather than &var2 and data. The MACRO begins with %macro and ends with %mend. %mend Edward; %Edward (var1=Blue Sky High School, var2=1); %Edward (var1=Central High School, var2=2); %Edward (var1=West High School, var2=3); data here.ALLData; set data1 data2 data3;run; Executing first iteration. All occurance of &var1 will be replaced by Blue Sky High School and &var2 will be replaced by 1. Kaz SAS 40 APPLICATION /*3333333333333333333333333*/ /*And you can use data steps to manipute the result data set to customize it*/ /*Here I do something tedious but worth while doing*/ /*Merge content data with descriptive statistics*/ /*Feels tedious, but once you write this, you can use it for later use or you can even just use this program for your purpose*/ /*proc contents here*/ proc contents data=ABC position; ods output position=whatever_name_you_want ; run; /*get means here*/ proc means data=ABC; ods output summary=result_from_proc_mean; run; proc transpose data=result_from_proc_mean out=transposed_data; run; data transposed_data; set transposed_data; /*get rid of part of the names*/ _name_=tranwrd(_name_,"_Mean",""); _name_=tranwrd(_name_,"_StdDev",""); _name_=tranwrd(_name_,"_Max",""); _name_=tranwrd(_name_,"_Min",""); _name_=tranwrd(_name_,"_N",""); run; proc transpose data=transposed_data out=transposed_data2; by _name_ notsorted ; var col1; id _label_; run; Kaz SAS 41 data transposed_data2; length variable $ 32; /*I needed to do this because in the content data the length is 32*/ set transposed_data2; variable=_name_; run; proc sort data=whatever_name_you_want;by variable;run; proc sort data=transposed_data2;by variable;run; data newdata; merge whatever_name_you_want transposed_data2; by variable; run; /*I want to retain the original sequence of variables (which I lost by PROC SORT above that I had to use before merging*/ proc sort; by Num;run; ods rtf file ="C:¥TEMP¥datadictionary3.rtf"; proc print data=newdata noobs; title "data dictionary in RTF"; var variable label N Mean STD_dev Minimum Maximum ; run; ods rtf close; ods html file ="C:¥TEMP¥datadictionary3.xls"; proc print data=newdata noobs; title "data dictionary in Excel"; var variable label N Mean STD_dev Minimum Maximum ; run; ods html close;