Variable Types, Data Set Structure, and Data File Types

advertisement
UNC-Wilmington
Department of Economics and Finance
ECN 377
Dr. Chris Dumas
Variables, Variable Types, Data Set Structure, and Data File Types
Often, the type of Econometric analysis that is appropriate in a given situation depends on the types of variables
and data that are available. In this handout, we will discuss the issues and terminology involved when describing
Variables, Data, and Variable Types, Data Observations (rows), Data Fields (columns), and Data File Types.
Types of Variables
Recall that variables are things that (1) can be observed, (2) can be measured/described, and (3) have the
potential to vary. Variables whose values can be measured are called Cardinal/Measurement/Scale Variables.
Variables whose values we can’t measure but instead simply describe different categories of things are called
Ordinal/Nominal/Qualitative/Categorical/ Character/ String Variables. These two primary variable types
can be further divided into sub-categories.
A variable's type is important, because some analysis methods can be applied to only some types of variables
and not to others!
Cardinal/Measurement/Scale Variables -- numeric (number) variables where the distance between any two
values has the same meaning from one data observation to the next.
 Continuous -- cardinal numeric variables for which there is an infinite number of fractional values
between one number and the next
 Discrete -- cardinal numeric variables for which there are no fractional values between one number
and the next (for example, a variable that takes only integer values is a discrete variable)
Ordinal/Nominal/Qualitative/Categorical/Character/String Variables--data values that are composed of
numbers or text characters (e.g., persons' names, city names, colors, etc.) that indicate categories rather than
measurements.
 Ordinal Numeric Variables -- numeric (number) variables that indicate order or rank, rather than
measurement; the distance between any two values is not necessarily the same from one data
observation to the next. The numbers indicate ordered/ranked categories rather than measurements.
For example, suppose two people are rating product desirability on a scale of 1 to 10. Suppose both
people rank the first product “5” and the second product “3.” Both people rank the first product higher
than the second product, but the key is that the difference between a “5” and a “3” for the first person
might be different from the difference between a “5” and a “3” for the second person. So, the “5” and
the “3” simply indicate ordered/ranked categories rather than cardinal measurements.
 Ordinal Character Variables – the same as ordinal numeric variables, but characters (text) values are
used to indicate the different ordered/ranked categories (e.g., Likert scale data from questions asking
“strongly agree, agree, don't care, disagree, strongly disagree”)
 Nominal Numeric Variables -- numeric variables that indicate unordered/unranked categories. For
example, using values 0, 1 and 2 to represent the colors Red, Blue, and Green. In this case the numbers
do not necessarily indicate that the colors are ranked, with Red coming before Blue, etc., the numbers
simply indicate different (but unranked) categories of color. (Green is not twice as big as Blue just
because a 2 is used to indicate Green and a 1 is used to indicate Blue).
 Nominal Character Variables – the same as nominal numeric variables, but character (text) values are
used to indicate the unordered/unranked categories (e.g., persons’ names, colors, states)
1
UNC-Wilmington
Department of Economics and Finance
ECN 377
Dr. Chris Dumas
Rules for Naming Variables in SAS--In SAS, variable names must follow these rules:






Names must be 32 characters or less in length
Names must start with a letter or underscore (name may NOT begin with a number!)
Names may contain only letters, underscores and numbers (after the first character)
Names may not contain blank spaces, dashes, or the special characters % $ ! * & # @, etc.
Names may contain any mix of upper and lowercase letters. SAS is not case-sensitive. That is,
capitalization doesn't matter when it comes to variable names.
Reserved Words are words that SAS uses for special purposes and so cannot be used as variable names.
The names _N_, _ERROR_, _FILE_, _INFILE_, _MSG_, _IORC_, and _CMD_ are reserved words in
SAS. Note that reserved words in SAS start and end with an underscore; to avoid conflicts with reserved
words, it is recommended that you do not use variable names that start and end with an underscore.
Missing Values--A Missing Value is a data value that is missing from a data set, because either the value was
never collected, or it was collected but not entered into the database, or it was lost or deleted from the database.
In SAS, a missing data value is indicated by a single period ".". The period is used to "hold the place" of the
missing data value in the data set.
Observations/Cases (Rows) and Variables/Fields (Columns)



Data are typically arranged in horizontal rows and vertical columns, such as the rows and columns of a
table or the rows and columns in a spreadsheet.
An "Observation" or "Case" is a person, place, thing or time on which (during which) data are
collected. Observations/Cases are usually the rows in your data set. Sometimes, an observation
number is displayed at the right of each row to number the observations for ease of reference. (SAS
automatically creates a column of observation numbers, which it names "Obs", and adds these numbers to
your data set.) In SAS, the maximum number of observations in a data set is limited only by your
computer memory.
A "Variable" or "Field" is a type of data collected for each observation. Variables/Fields are usually the
columns in your data set. Often, the variable/field names are displayed at the tops of the columns. If a
data set has a column of observation numbers used to number the rows, the observations numbers are
usually not counted as a variable. In SAS, the maximum number of variables is 32,767 (for full
compatibility with earlier versions of SAS).
Typical Data Set Structure
Observation
Numbers
Variable
Names
Observations, or
Cases
Variables, or Fields
Obs Name Height Weight
Sign
1
Larry
6.2
150
Aquarius
2
Moe
5.7
180
Virgo
3
Curly
5.4
215
Libra
2
UNC-Wilmington
Department of Economics and Finance
ECN 377
Dr. Chris Dumas
Data File Types
Data can be stored in several types of computer files. The type of computer file is usually indicated by the
"extension" at the end of the computer file name. The extension is the part of the file name that follows the dot in
the file name. For example, the extension on a data file named "DumasData.xls" is the "xls" part. Some common
file types and extensions are listed below:










Space or tab-delimited text file
Comma-delimited text file
Microsoft Excel spreadsheet file
Lotus 123 spreadsheet file
Data Interchange Format database file
Microsoft Access database file
dBase database file
Stata data file
SPSS data file
SAS data file
.txt .prn .dat
.csv
.xls .xlsx .xlsm
.wks .wk1 .wk3
.dif
.accdb, .mdb
.dbf
.dta
.sav
.sd2 .sd7 .sas7bdat
Some data files are simple text files; common examples include ".txt" ".prn" ".dat" and ".csv" files. Simple text
files must use "delimiters" to indicate where one data value ends and the next begins. A "delimiter" is simply a
special character that indicates where one data value is separated from the next. A "space-delimited file" has
each data value separated by a space, a "comma-delimited file" has each data value separated by a comma, and
so on. IMPORTANT: If a data file uses a space as a delimiter, then none of the data values should contain
spaces, or SAS will think that there are two data values instead of only one. Also, a blank cannot be used to
represent a missing character value if a space is used as a delimiter--either change the delimiter to something else
or use something other than a blank to represent a missing character value. Similarly, if a data file uses a comma
as a delimiter, then none of the numbers in the data should contain commas.
In Microsoft Windows, the file name extensions are sometimes "hidden" to shorten the file names. If this is the
case on your computer, you need to "un-hide" the file name extensions so that you can see what type of data file
you are working with. You can un-hide file name extensions by going to the Properties of the Windows folder
and changing the properties of the folder to show file name extensions.
SAS can open and "read" all of the data file types listed above, but you must tell SAS which data file type you
want to open. More about this in the handout on Proc Import and Proc Export.
3
Download