|
|
|
||||
|
|
|||||
This page was adapted from a page created by Oliver Schabenberger. We thank Professor Schabenberger for permission to adapt and distribute this page via our web site.
1. Reading data inline
2. Reading data with column pointers
3. Reading data from ASCII files
4. Reading dBase files
5. Creating a permanent data set
6. Reading Excel files into SAS
6.1.
Using the Import facility
A DATA step creates a SAS data set. Each data set consists of observations and variables. You can think of a data set as a two-dimensional table where variables are the columns and observations the rows. The columns are identified by variable names. The name of a variable has to start with a character and can be up to eight characters long.
| Variable 1 | Variable 2 |
Variable 3 |
|
| Observation 1 ----> | |||
| Observation 2 ----> | |||
| Observation 3 ----> |
Reading data inline means that the raw data is part of the SAS program. You enter the actual data points inside the PROGRAM EDITOR. The following DATA step creates a data set named SURVEY with ten observations and seven variables. The variables are called ID, SEX, AGE, INC, R1, R2, and R3. The data stems from a consumer survey where INC denotes annual income in thousands, R1 through R3 are rating scores of three particular products. The INPUT statement defines the variables to be read in each line of data. The DATALINES statement indicates to SAS that DATA step statements are completed and the next line contains real data. Notice that the lines of data do not end in a semicolon. Instead of the DATALINES statement you can also use the CARDS statement. The two are equivalent. CARDS stems from way back when reading data meant feeding punch cards. The lone semicolon before the PROC PRINT statement is something I am used to doing. You could delete that line altogether.
options nocenter;
DATA survey;
INPUT id sex $ age inc r1 r2 r3 ;
DATALINES;
1 F 35 17 7 2 2
17 M 50 14 5 5 3
33 F 45 6 7 2 7
49 M 24 14 7 5 7
65 F 52 9 4 7 7
81 M 44 11 7 7 7
2 F 34 17 6 5 3
18 M 40 14 7 5 2
34 F 47 6 6 5 6
50 M 35 17 5 7 5
;
PROC PRINT; RUN;
The output from this run looks as follows:
OBS ID SEX AGE INC R1 R2 R3
1 1 F
35 17 7
2 2
2 17 M
50 14 5
5 3
3 33 F
45 6 7
2 7
4 49 M
24 14 7
5 7
5 65 F
52 9 4
7 7
6 81 M
44 11 7
7 7
7 2 F
34 17 6
5 3
8 18 M
40 14 7
5 2
9 34 F
47 6 6
5 6
10 50 M
35 17 5
7 5
The variable SEX in the INPUT statement is followed by a dollar sign ($) indicating that this is a character variable. Without specific instructions, SAS assumes that variables are numeric. If it then encounters a character, it will assign a missing value (.) to the variable for that observation. If the dollar sign is removed, the
result is as follows:DATA survey;
INPUT id sex age inc r1 r2 r3 ;
DATALINES;
1 F 35 17 7 2 2
17 M 50 14 5 5 3
33 F 45 6 7 2 7
49 M 24 14 7 5 7
65 F 52 9 4 7 7
81 M 44 11 7 7 7
2 F 34 17 6 5 3
18 M 40 14 7 5 2
34 F 47 6 6 5 6
50 M 35 17 5 7 5
;
PROC PRINT; RUN;
The output from this run looks as follows:
OBS ID SEX AGE INC R1 R2 R3
1 1 .
35 17 7
2 2
2 17 .
50 14 5
5 3
3 33 .
45 6 7
2 7
4 49 .
24 14 7
5 7
5 65 .
52 9 4
7 7
6 81 .
44 11 7
7 7
7 2 .
34 17 6
5 3
8 18 .
40 14 7
5 2
9 34 .
47 6 6
5 6
10 50 .
35 17 5
7 5
3.2. Reading data with column pointers
Sometimes it is useful to tell SAS exactly where to find an observation. For example, if the data set contains a characer variable whose entries are of different length or contain embedded blanks.
data thisone;
input name $1-20 age city $29-37 state $ ;
datalines;
Oliver Schabenberger 33 Lansing MI
John T. Smith 37 New
York NY
;
run;
proc print data=thisone; run;
Following the variable name NAME with $1-20 indicates that NAME is a character variable whose entries are found in columns 1-20 of the datalines. Since blanks automatically delimit variables in datalines, the fact that the two ages are not aligned has no effect. They will be read correctly without pointing SAS to a specific column of the dataline. The city then is again read by pointing SAS to specific columns. This is not necessary for the variable STATE since its entries begin in the same column and do not contain embedded blanks. The data set THISONE looks as follows:
OBS NAME AGE CITY STATE
1 Oliver Schabenberger 33
Lansing MI
2 John T. Smith
37
New York NY
If the $29-37 pointer after CITY is replaced with a simple $ to indicate a character variable in the INPUT statement, here is what happens
.data thisone;
input name $1-20 age city $ state $ ;
datalines;
Oliver Schabenberger 33 Lansing MI
John T. Smith 37 New
York NY
;
run;
proc print data=thisone; run;
OBS NAME AGE CITY STATE
1 Oliver Schabenberger 33
Lansing MI
2 John T. Smith
37
New York
For the second observation, SAS finds the characters 'New' and assigns it to the CITY variable thinking the blank between 'New' and 'York' delimits two variables. The string 'York' is then assigned to the next variable in the INPUT statement, which is STATE.
Data lines can contain more information than is necessary for a single observation. Assume a data set contains three variables, X, Y, and Z. A DATA step could look like:
data XYZ;
input x y z;
datalines;
1 1 12.4
1 2 11.3
1 3 1.4
2 1 2.1
2 2 19.4
2 3 10.0
;
run;
You can also read multiple observations in each line of data. Simply add the @@ symbol at the end of the input statement:
data XYZ;
input x y z @@;
datalines;
1 1 12.4 1 2 11.3 1 3 1.4
2 1 2.1 2 2 19.4 1 3 10.0
;
run;
SAS will go through each line of data and fill the variables X, Y, and Z in turn. Without the @@ symbol, it would switch to the next line as soon as the last variable was filled (which by the way prevented in the last example the value 'NY' from ever being read). With the @@ symbol, SAS will continue filling the variables until it reaches the end of the line.
3.3. Reading data from ASCII files
There are two versions of using the INFILE statement. You can specify expicitly in the INFILE statement which file to read:
data readasc;
infile 'D:\Research\Toxin\Data1998\DON.txt';
input location truck probe DON;
run;
or you can use the FILENAME statement. The FILENAME statement associates a physical file with a short-cut name inside a SAS program. The FILENAME statement appears outside the DATA step. For example
filename toxic 'D:\Research\Toxin\Data1998\DON.txt';
data readasc;
infile toxic;
input location truck probe DON;
run;
associates the file D:\Research\Toxin\Data1998\DON.txt with the short-cut name TOXIC. The INFILE statement inside the DATA step then refers only to the short-cut. The short-cut in SAS lingo is known as a fileref. If the columns of the ASCII file are not delimited by blanks (MSExcel files saved as *.prn text files are comma delimited, for example) you can add a delimiter option to the INFILE statement:
filename toxic 'D:\Research\Toxin\Data1998\DON.txt';
data readasc;
infile toxic delimiter=',';
input location truck probe DON;
run;
Many files contain header information which you do not want to make part of the SAS data set. To skip a records at the top of the file use the FIRSTOBS= option. To read only a specified number of records use the OBS= option. For example,
filename toxic 'D:\Research\Toxin\Data1998\DON.txt';
data readasc;
infile toxic delimiter=',' firstobs=12 obs=100;
input location truck probe DON;
run;
reads 100 - (12) + 1 = 89 observations from the file D:\Research\Toxin\Data1998\DON.txt beginning with the 12th line of the file and looks for commas to delimit variables.
Although dBase has lost importance as a database programming environment, the dBase data format is still very common and many packages including spreadsheets can read and write data in dBase format. A dBase file can easily be transported into a SAS data set. The procedure DBF is especially designed for this purpose. However, this procedure is not documented in the SAS help files and thus a well-kept secret. Its usage is rather simple however. (Remark: This page was written before 2005 and proc dbf is now documented in SAS. 07/2007) Assume a dBase file mydata.dbf resides in C:\Temp. You want to convert it into a SAS data set named dbdata. The statements
filename inf 'C:\Temp\mydata.dbf';
proc dbf db4=inf out=dbdata;
run;
will do the trick. If the dBase format is not compatible with dBaseIV, e.g. dBaseIII+, use db3= instead of db4=. If you use dBaseV format replace db4= with db5=.
3.5. Creating a permanent data set
By default all data sets created during a SAS session are temporary data sets and are deleted when you close SAS. While this is not a real problem for small data sets that you can recreate on the fly in the program, it is inconvenient, if you work with large data sets or your data sets require a lot of manipulation before they are in the form that can be used by a certain procedure. A permanent data set on the other hand is a data set that will not be deleted when SAS is exited. It is available directly when a new SAS session is started. To understand how permanent data sets are created and work, let's revisit a simple example used earlier. When executing the DATA step
data XYZ;
input x y z;
datalines;
1 1 12.4
1 2 11.3
1 3 1.4
2 1 2.1
2 2 19.4
2 3 10.0
;
run;
SAS will write to the LOG window
202 data XYZ;
203 input x y z;
204 datalines;
NOTE: The data set WORK.XYZ has 6 observations and 3 variables.
NOTE: The DATA statement used 0.26 seconds.
211 ;
212 run;
213
Although the data set was created with name XYZ, SAS created a data set named WORK.XYZ. The name in front of the period (WORK) is the name of a library which can contain many data sets (among other things, it can contain programs, graphs, etc.). All data sets associated with the library WORK are deleted at the end of the SAS session (they are temporary). To prevent this and thereby create a permanent data set, simply use a different library name to create a data set. Before you can use a library name, however, you have to associate the library with a directory accessible from your computer. This is accomplished with the LIBNAME statement. For example
libname mylib 'D:\Research\Data\1996';
associates the directory D:\Research\Data\1996 with the name mylib. mylib is known as a libref (a library reference) similar to file references created with the FILENAME statement. Once a libref is defined, you can create data sets in that library (=that directory) by preceding the data set name with the libref and delimiting the two with a period. For example,
libname mylib 'C:\Research\Data\Manure';
data mylib.XYZ;
input x y z;
datalines;
1 1 12.4
1 2 11.3
1 3 1.4
2 1 2.1
2 2 19.4
2 3 10.0
;
run;
will create a permanent data set and store it in the directory C:\Research\Data\Manure'. The name of the physical file stored in the directory will be xyz.sd2. When you start a new SAS session, the data can be accessed directly as soon as a libref pointing to the particular directory exists. The name of the libref can be different from the name you used when creating the permanent data set. For example if a new session is started,
libname what 'C:\Research\Data\Manure';
proc print data=what.xyz;
run;
will directly access the permanent data set. By the way, you may find that SAS data sets are often much more compact than, e.g., spreadsheet or database files. Storing your data in SAS form is a very economical and convenient way.
3.6. Reading Excel files into SAS
MSExcel has become a very common format for exchanging data. Although a spreadsheet application Excel data can be converted to SAS data relatively easy. There are three ways this can be accomplished (there may be more, but these are the ones I know off)
3.6.1. Using the Import facility
The Import facility is accessed through the File menu. It starts a wizard that takes you through the process of converting the Excel spreadsheet into SAS. While Excel can be running at the time of import, the conversion will not work, if the spreadsheet that you wish to convert is open. The easiest conversion is done if the spreadsheet is organized in variables and observations similar to a SAS data set with the data to be converted in the top-most worksheet and the variable names in the first row. When I convert Excel files, I make sure that the column names in Excel already comply to SAS rules for naming variables (no more than eight characters, start with a letter, no embedded blanks). SAS has rules for converting non-complying column names into variable names. Once the import is completed, print the SAS data set with PROC PRINT to make sure all variable names and observations have been converted correctly.
Important: When selecting the file format to be converted version 6.12 of SAS will indicate that it can convert Excel5.0 and Excel7.0 spreadsheets. Spreadsheets created with Excel97 will not convert properly. Make sure that spreadsheets are saved in Excel5.0 (=Excel95) format.
Of the three methods to read Excel data this is no doubt the most flexible, but also the most cryptic method. In order for it to work SAS/ACCESS must be installed on your system. To see whether it is, execute
proc access;
run;
If PROC ACCESS is installed, a window will appear.
To convert an Excel spreadsheet into a SAS data set on the fly in your program, you can use syntax such as the following. It requires that a libref is assigned to which interim data sets are written.
proc access dbms=xls;
create convert.xlsa.access;
path = 'C:\Courses\Independent Study\Data\Soyref95.xls';
scantype = yes;
worksheet = REF95;
getnames = yes;
assign = yes;
unique = yes;
create Convert.xlsv.view;
select all;
run;
data Soybean;
set convert.xlsv;
run;
proc datasets library=convert;
delete xlsv / memtype=view;
delete xlsa / memtype=access;
run;
quit;
proc print data=soybean;
run;
This code reads the Excel spreadsheet C:\Courses\Independent Study\Data\Soyref95.xls and converts it into the SAS data set SOYBEAN. The data is stored on worksheet REF95. The getnames=, assign=, and unique= statements of PROC ACCESS tell SAS to look up variable names from column names in the spreadsheet which are stored in the first row, and to keep those variable names unique, even if two column names in Excel (if truncated to eight characters) would be the same. The PROC DATASETS code deletes the two intermediate files created by SAS in the conversion process.
3.6.3. Using Dynamic Data Exchange (DDE)
DDE is used to dynamically exchange data between applications. You can use it not only to read data from Excel into SAS, but also to populate an Excel spreadsheet with information stored in a SAS data set. For saving data in Excel format, I find the Export wizard (File-Export menu) very helpful, easy to use and quick. I am concentrating here on how to read data from Excel into SAS via DDE. In contrast to import with the
Import facility, Excel must be running and the spreadsheet from which to read data must be open. To read the data converted in the PROC ACCESS example, use the following code:filename soy dde 'Excel|C:\Courses\Independent
Study\Data\[SOYREF95.XLS]REF95!R2C2:R229C10';
data bean;
infile soy missover;
input labid $ month reps moisture abn dead abn1 abn2 abn3;
run;
proc print data=bean; run;
The string after dde in the filename statement is called the DDE triplet consisting of
application | topic ! item.The specific form of the DDE triplet differs from application to application and may be hard to figure out. Fortunately, there is a simple trick to get the DDE triplet. In Excel, highlight the rows and columns you want to transport to SAS and copy them to the clipboard (Ctrl-C). Then go into SAS and choose Solutions-DDE Triplet from the menu. The dialog box that appears contains the DDE triplet. Just copy it from the dialog to the filename statement (Ctrl-C works here too). Notice that the triplet begins reading the spreadsheet in row 2, since the first row of the spreadsheet contains the column names. The MISSOVER option on the INFILE statement prevents SAS from going to a new input line if it does not find values in the current line for all the INPUT statement variables, that is some of the values are missing. With the MISSOVER option, when an INPUT statement reaches the end of the current record, values that are expected but not found are set to missing.
UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services