Stat 530 Lab 1: Spring 2016 Part 1: SAS Let us start with two problems and see how we can use software to deal with it. In this class we will be using SAS as the primary software. SAS is a programming language and it is windows based. However, it is NOT a click and drag program, but it actually requires you to write a program and submit it. I think let’s look at how a simple problem is done is SAS. Consider this data set, here we have 25 students who wrote down their Height, age, shoesize, eyecolor and haircolor. So, let’s see how we can read this data set into SAS. id 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 height age shoe eyecolor 62 18 6.5 brown 69 19 11 hazel 71 20 13 brown 69 20 6.5 hazel 66 20 8 blue 63 18 7.5 brown 66 20 9.5 blue 65 20 8.5 blue 67 19 9.5 brown 67 23 8.5 brown 67 19 9.5 green 69 . 8.5 hazel 66 20 10 blue 70 20 9.5 hazel 69 20 9.5 hazel 72 20 11 blue 70 19 10.5 blue 71 24 12.5 blue 63 19 8 green 65 44 7 blue 68 22 9.5 hazel 69 20 10 hazel 70 35 10 brown 70 22 10 blue 64 19 7.5 blue hair blonde auburn brown blonde blonde brown blonde blonde brown brown brown blonde blonde brown brown blonde brown brown blonde brown brown brown brown brown brown There are many ways of entering data in SAS. The most important ones are: 1. Physically typing in the data 2. Import a data set 1. Physically entering data: data example1; input id height age shoesize eyecolor $ haircolor $; cards; 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 ; 62 69 71 69 66 63 66 65 67 67 67 69 66 70 69 72 70 71 63 65 68 69 70 70 64 18 19 20 20 20 18 20 20 19 23 19 . 20 20 20 20 19 24 19 44 22 20 35 22 19 6.5 11 13 6.5 8 7.5 9.5 8.5 9.5 8.5 9.5 8.5 10 9.5 9.5 11 10.5 12.5 8 7 9.5 10 10 10 7.5 brown hazel brown hazel blue brown blue blue brown brown green hazel blue hazel hazel blue blue blue green blue hazel hazel brown blue blue blonde auburn brown blonde blonde brown blonde blonde brown brown brown blonde blonde brown brown blonde brown brown blonde brown brown brown brown brown brown Dollar sign for categorical varaible Lets say I had the following data saved as an excel file named class1.xls or class1.csv 2. Importing an Excel File: *IMPORTING DATA; PROC IMPORT OUT= WORK.lab1 DATAFILE= “YOUR PATH WHERE FILE IS SAVED” DBMS=EXCEL REPLACE; RANGE="Sheet1$"; GETNAMES=YES; MIXED=NO; SCANTEXT=YES; USEDATE=YES; SCANTIME=YES; RUN; Problems to look out for If you read a file that is wider than 80 columns, you may need to use the lrecl= parameter on the infile statement. Once you have the data in, we need to think about analysis. However, often a big question is graphics and summary: So let’s look at the simplest graphics option in SAS. PLOTTING data 1. SAS does line-printer quality plots under Proc Plot or 2. High Resolution plots from Proc Gplot You can use one or the other for any plotting you do in class. Example for Proc Plot: proc plot data=dataname; plot yvar*xvar/hpos=somenumber vpos=somenumber; run; Example of Gplot Proc gplot data=dataname; Symbol1 v=plus c=black; Symbol2 v=u c=black; Symbol3 v=l c=black; Plot var1*var2 var3*var2 var4*var2 / overlay; There are various options here, and you can get quite fancy with Gplots. For example 1, we do this as follows: proc gplot data=example1; plot shoesize*height; run; For the categorical data we might want to do bar graphs or pie charts. proc gchart data=example1; pie haircolor eyecolor; run; proc gchart data=example1; pie3d haircolor eyecolor/ explode="brown"; run; proc gchart data=example1; hbar3d haircolor eyecolor; run; To summarize the data we need to use a PROCEDURE. First let us get simple descriptive measures for the data. Procedures in SAS are called PROCS. To look at means and variances we could use UNIVARIATE or CAPABILITY. proc means data=example1; var height shoesize age; run; proc univariate data=example1; var height shoesize age; run; proc capability data=example1; var height shoesize age; run; For simple descriptive measures for the categorical data we could use FREQ. proc freq data=example1; table haircolor eyecolor; run; proc freq data=example1; table haircolor*eyecolor; run; Part 2: For the ones among us who are R-inclined: Data is R can be of various types. 1. Vectors: one dimensional data set. Example: x1=c(1,2,3,5,7) x2=c(2,7,8,4,1) 2. Matrix: 2 dimensional data, has rows and columns. Example: m1=cbind(x1,x2) OR m2=matrix(c(1,2,3,5,7,2,7,8,4,1),nrow=5,ncol=2) 3. Data frames (the one we will use the most): idea of a data set with various elements. mydata= data.frame(x1,x2,m1,m2) 4. Arrays: collection of matrices, multidimensional matrices 5. List: a collection of data frames, matrices, vectors etc. Variables can be different sizes and not all numerical. Inputting data from a file: I will show you how to read from a csv file (which will be common for us). Reading data from a csv file. Put the data in the correct directory and make sure that’s your work directory. I created a directory called myRfolder in my C drive. I will try to use that as my default directory. I put in a data set called classdata.csv (This is a class data dealing with the students height, age, shoesize, eye and hair color). Use the commands: >setwd("/myRfolder") >mydata<-read.table("classdata.csv",header=TRUE,sep=",",row.names="id",na.strings=" ") > mydata #to print the read dataset. The data is saved in R as a dataframe “mydata”. It’s a temporary data set and will be lost when we get out of R. R looks at everything as a matrix. So if we want to get a summary measure for height all we have to do is refer to it as mydata[ ,1] referring to the first column of the data or can use the $ sign to say mydata$height Simple summaries: >summary(mydata[,1]) Min. 1st Qu. Median Mean 3rd Qu. Max. 62.00 66.00 68.00 67.52 70.00 72.00 > summary(mydata$height) Min. 1st Qu. Median Mean 3rd Qu. Max. 62.00 66.00 68.00 67.52 70.00 72.00 > mean(mydata$height) [1] 67.52 > sd(mydata$height) [1] 2.77068 SIMPLE GRAPHICS > barplot(table(mydata$hair)) > pie(table(mydata$hair)) > hist(mydata$shoe) > plot(mydata$shoe,mydata$height) > table(mydata$hair,mydata$eye)