|
|
|
||||
|
Help the Stat Consulting Group by
giving a gift
| |||||
|
Loading
|
|||||
A SAS index file is used to identify certain records (i.e., observations) in a data file, usually a very large data file. For example, if you had a data file with information on one million people and you only wanted to analyze the data for people who were 52 years old (age = 52), you might want to use an index to select only those records. You could use other methods to identify those records, such as a where statement. However, SAS would have to evaluate each record and determine if "age = 52" is true or false, and that could take a while. By creating an index, SAS directly accesses those observations. The index is a SAS file separate from the data file that is (usually) stored in the same library or directory as the data file. The index file may show up as a separate file or as part of the data file, depending on the operating system on the computer. The index file is automatically associated with the data file when it is created. You can have more than one index on a single data file, and all indexes associated with a given data file are stored in a single index file. If you make changes to your data file, the index file is automatically updated. You can use proc contents to display information on indexes, and this will be illustrated below. To display information regarding the use of indexes in the log, you can use the system option option msglevel = i.
There are two types of indexes: simple and composite. A simple index has only one key variable. A key variable is the variable upon which the data file is indexed. You can have multiple simple indexes associated with a given data file. The name of the simple index is the same as the name of the key variable. A composite index is one in which the values of two or more variables have been concatenated together to form one value. When creating a composite index, you also need to give it a name.
There are two options that you can use when creating indexes: unique and nomiss. The unique option declares that the values for a variable are unique, such as the values of an id variable or social security numbers would be unique. If an attempt is made to add an observation with a duplicate value, that update is rejected. The nomiss option instructs SAS not to include missing values in the index. This can make the index more efficient if there are many missing data points in the key variable. Unlike the unique option, observations with missing values on the key variable can be added to the data set.
When deciding whether to use an index, several factors should be considered. Essentially, you want to figure out if the time and effort of creating and maintaining an index will be more or less than the time and effort saved by using the index. Using an index will save you computer processing time when you need to identify a relatively small number of cases in a large data set, and they become less efficient the more cases that qualify. If the index will select 25% or fewer of the cases, using an index will likely be more efficient than using a where statement. Efficiency will also be increased if the data set is saved in sorted order on the key variable.
There are at least three ways that you can create an index. They include using proc datasets, using a data step (usually while creating the data set), and using proc sql. Examples of each of these three methods are presented below. Although indexes are best used on large data sets, we will use a relatively small data set for the purposes of illustration. We will use the hsb2i data set for our examples.
1. Using proc datasets
In the example below, we use a libname statement to create a library called mylib. On the proc datasets statement we need to specify the input library on the library option. On the modify statement we need to specify the name of the data file on which we want to create the index. On the first index statement, we create the simple index id using the unique and the nomiss options. (These options are explained above.) Finally, we create the composite index dem. When creating a composite index, you will want to specify the most discriminating variable first. In this case, the variable race is the most discriminating of the two variables listed. We use proc contents to see that the two indexes were created and to get information regarding the indexes, which is at the bottom of the output.
libname mylib 'd:\'; proc datasets library = mylib; modify hsb2i; index create id / unique nomiss; index create dem = (female race); run; proc contents data = mylib.hsb2i; run;The CONTENTS Procedure Data Set Name: MYLIB.HSB2I Observations: 200 Member Type: DATA Variables: 11 Engine: V8 Indexes: 2 Created: 10:19 Wednesday, July 30, 2003 Observation Length: 88 Last Modified: 10:43 Wednesday, July 30, 2003 Deleted Observations: 0 Protection: Compressed: NO Data Set Type: Sorted: NO Label: -----Engine/Host Dependent Information----- Data Set Page Size: 8192 Number of Data Set Pages: 4 First Data Page: 1 Max Obs per Page: 92 Obs in First Data Page: 68 Index File Page Size: 4096 Number of Index File Pages: 3 Number of Data Set Repairs: 0 File Name: d:\hsb2i.sas7bdat Release Created: 8.0202M0 Host Created: WIN_PRO -----Alphabetic List of Variables and Attributes----- # Variable Type Len Pos ------------------------------------ 2 female Num 8 8 1 id Num 8 0 9 math Num 8 64 6 prog Num 8 40 3 race Num 8 16 7 read Num 8 48 5 schtyp Num 8 32 10 science Num 8 72 4 ses Num 8 24 11 socst Num 8 80 8 write Num 8 56 -----Alphabetic List of Indexes and Attributes----- # of Unique Nomiss Unique # Index Option Option Values Variables ------------------------------------------------------- 1 dem 8 female race 2 id YES YES 200
Now we will use proc datasets to delete the index id that we created above.
proc datasets library = mylib; modify hsb2i; index delete id; run; proc contents data = mylib.hsb2i; run;The CONTENTS Procedure Data Set Name: MYLIB.HSB2I Observations: 200 Member Type: DATA Variables: 11 Engine: V8 Indexes: 1 Created: 10:19 Wednesday, July 30, 2003 Observation Length: 88 Last Modified: 10:52 Wednesday, July 30, 2003 Deleted Observations: 0 Protection: Compressed: NO Data Set Type: Sorted: NO Label: -----Engine/Host Dependent Information----- Data Set Page Size: 8192 Number of Data Set Pages: 4 First Data Page: 1 Max Obs per Page: 92 Obs in First Data Page: 68 Index File Page Size: 4096 Number of Index File Pages: 3 Number of Data Set Repairs: 0 File Name: d:\hsb2i.sas7bdat Release Created: 8.0202M0 Host Created: WIN_PRO -----Alphabetic List of Variables and Attributes----- # Variable Type Len Pos ------------------------------------ 2 female Num 8 8 1 id Num 8 0 9 math Num 8 64 6 prog Num 8 40 3 race Num 8 16 7 read Num 8 48 5 schtyp Num 8 32 10 science Num 8 72 4 ses Num 8 24 11 socst Num 8 80 8 write Num 8 56 -----Alphabetic List of Indexes and Attributes----- # of Unique # Index Values Variables ----------------------------------- 1 dem 8 female race
2. Using a data step while creating a data set
In the example below, we will create again the simple index on id. We will use the unique and nomiss options. We will use proc contents to see the results.
data hsb2ia (index =(id / nomiss /unique)) ; set mylib.hsb2i; run; proc contents data = hsb2ia; run;The CONTENTS Procedure Data Set Name: WORK.HSB2IA Observations: 200 Member Type: DATA Variables: 11 Engine: V8 Indexes: 2 Created: 10:24 Wednesday, July 30, 2003 Observation Length: 88 Last Modified: 10:30 Wednesday, July 30, 2003 Deleted Observations: 0 Protection: Compressed: NO Data Set Type: Sorted: NO Label: -----Engine/Host Dependent Information----- Data Set Page Size: 8192 Number of Data Set Pages: 4 First Data Page: 1 Max Obs per Page: 92 Obs in First Data Page: 68 Index File Page Size: 4096 Number of Index File Pages: 3 Number of Data Set Repairs: 0 File Name: D:\hsb2ia.sas7bdat Release Created: 8.0202M0 Host Created: WIN_PRO -----Alphabetic List of Variables and Attributes----- # Variable Type Len Pos ------------------------------------ 2 female Num 8 8 1 id Num 8 0 9 math Num 8 64 6 prog Num 8 40 3 race Num 8 16 7 read Num 8 48 5 schtyp Num 8 32 10 science Num 8 72 4 ses Num 8 24 11 socst Num 8 80 8 write Num 8 56 The SAS System The CONTENTS Procedure -----Alphabetic List of Indexes and Attributes----- # of Unique Unique # Index Option Values Variables --------------------------------------------- 1 id YES 200 2 pgm 6 prog schtyp
3. Using proc sql
In the example below, we use proc sql to drop the index id from the data set hsb2ia. It is often a good idea to delete indexes before creating new ones, as deleting indexes often makes more computer resources available for the indexes that will be created. In this example, we drop the index id and then recreate it only to illustrate how to drop an index using proc sql. On the third line, we recreate the simple index called id. The name of the index is given after the keyword index, and the variable on which the index is created is listed in the parentheses. For this index, we have used the unique option. Note that the nomiss option is not available in proc sql. On the fourth line, we create a new composite index called pgm, using the variables prog and schtye in this index. Both indexes are created on the hsb2ia data file.
proc sql; drop index id from hsb2ia; create unique index id on hsb2ia (id); create index pgm on hsb2ia (prog, schtyp); quit; proc contents data = hsb2ia; run;The CONTENTS Procedure Data Set Name: WORK.HSB2IA Observations: 200 Member Type: DATA Variables: 11 Engine: V8 Indexes: 2 Created: 10:24 Wednesday, July 30, 2003 Observation Length: 88 Last Modified: 10:56 Wednesday, July 30, 2003 Deleted Observations: 0 Protection: Compressed: NO Data Set Type: Sorted: NO Label: -----Engine/Host Dependent Information----- Data Set Page Size: 8192 Number of Data Set Pages: 4 First Data Page: 1 Max Obs per Page: 92 Obs in First Data Page: 68 Index File Page Size: 4096 Number of Index File Pages: 3 Number of Data Set Repairs: 0 File Name: D:\hsb2ia.sas7bdat Release Created: 8.0202M0 Host Created: WIN_PRO -----Alphabetic List of Variables and Attributes----- # Variable Type Len Pos ------------------------------------ 2 female Num 8 8 1 id Num 8 0 9 math Num 8 64 6 prog Num 8 40 3 race Num 8 16 7 read Num 8 48 5 schtyp Num 8 32 10 science Num 8 72 4 ses Num 8 24 11 socst Num 8 80 8 write Num 8 56 The CONTENTS Procedure -----Alphabetic List of Indexes and Attributes----- # of Unique Unique # Index Option Values Variables --------------------------------------------- 1 id YES 200 2 pgm 6 prog schtyp
Now that we have created indexes, let's use them. In this first example, we will also show the time savings (for the CPU) for having used the index.
proc means data = hsb2ia; var math; where prog = 3 and schtyp = 1; run;
Below are the notes from the log.
NOTE: There were 48 observations read from the data set WORK.HSB2IA.
WHERE (prog=3) and (schtyp=1);
NOTE: PROCEDURE MEANS used:
real time 0.00 seconds
cpu time 0.00 seconds
Now let's do the same thing, but on the data file without the index. The notes from the log are below.
NOTE: There were 48 observations read from the data set WORK.HSB2IA.
WHERE (prog=3) and (schtyp=1);
NOTE: PROCEDURE MEANS used:
real time 0.06 seconds
cpu time 0.01 seconds
As you can see, using the index made the SAS program run faster. However, it may sometimes be the case that using the index does not make the program run faster; it may make it run slower. Because of this possibility, SAS does not always use an index associated with the data file. In other words, just because there is an index associated with a data file does not automatically mean that SAS will use it. Instead, SAS takes a moment to "guess" if it will be more or less efficient to use the index, and then acts accordingly. You can override SAS's decision and force SAS to use an index by using the data set option (given on the set statement in a data step) idxwhere = y. The "y" after the idxwhere = stands for "yes". Your other option is "n", for "no", and specifying that option tells SAS not to use the index even if it would be more efficient to do so. The other option that you can use is idxname = name_of_index. If you use this data set option, SAS will not try to determine if this index is the most efficient one to use, or if not using an index at all would be more efficient, and instead will use the index that you have specified.
With respect to composite indexes, to be most efficient, at least the first two variables in the composite index must be listed in the where conditions. If you use both a where statement and a by statement, SAS will look for an index to satisfy both. If none is found, the by statement will take precedence. The variables on the by statement have to be in the same order as they are in the composite index. The by statement will not use the index if 1) the by statement includes the descending or notsorted option, 2) the index was created with the nomiss option, or 3) the data file is physically stored in a sorted order based on the by variables. Indexing allows you to use a by statement without sorting. You can sort an indexed file only if you output the sorted file to a new data file, but that file will not be indexed.
You can use an index to merge a smaller data file with a larger one. To do this, use a data step in which you create one (or two) data sets; in this example we will call them match and nomatch. We will put the cases that come from both data sets into the file called match, and those that do not match in the file called nomatch. (Creating the nomatch data set is not necessary; we do it here only for illustration.) We use two set statements, one for the smaller data set and the other for the larger data set that has the composite index. We use the key = option on the set statement so that we can name the index that we want to use. Next, we use the SAS automatic variable _IORC_, which stands for input/output return code, and it is set to zero when an observation comes from both data sets. We use the SAS automatic variable _error_ to specify when the case should be sent to the nomatch output data set. If this statement is omitted, the nomatch data set appears to be created correctly, but an error message is printed in the log. We have used proc print to show the contents of both data files. Be aware that a by statement is not allowed in a data step with a key = option, and where processing is not allowed on a data file made with one.
data hs1; input id prog schtyp newtest; cards; 1 4 4 98 2 1 2 97 3 2 1 87 4 2 2 96 5 3 1 81 ; run; data match nomatch; set hs1; set hsb2ia key = pgm; if _IORC_ = 0 then do; total = sum(newtest, math, read, write, socst); output match; end; else do; _error_ = 0; output nomatch; end; run; proc print data = match; run;Obs id prog schtyp newtest female race ses read write math science socst total 1 195 1 2 97 0 4 2 57 57 60 58 56 327 2 172 2 1 87 0 4 2 47 52 57 53 61 304 3 196 2 2 96 0 4 3 44 38 49 39 46 273 4 121 3 1 81 1 4 2 68 59 53 63 61 322proc print data = nomatch; run;Obs id prog schtyp newtest female race ses read write math science socst total 1 1 4 4 98 . . . . . . . . .
UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services