Robin High
Statistical Programmer and Consultant
robinh@uoregon.edu
In SAS, DATA and PROC steps consist of series of statements which often begin with a keyword and always end with a semicolon. The statement that begins with the keyword BY followed by a list of categorical or integer variable names has several important functions, including:
However, the BY statement entered into a DATA step has one very important application that is not as well known or understood: it gives SAS the ability to process data included in one dataset where the records are organized by combinations of levels of classification variables.
Suppose an existing SAS dataset contains one or more variables coded as character data or integers with a finite number of values. The dataset can be organized into blocks of records or subgroups defined by the unique combinations of categorical or integer values. Within these defined subgroups you can process all records of this dataset with a DATA step itself.
For example, consider a hypothetical dataset called survey which contains
the following variables to identify groups of records: Gender (coded here as
gndr with m=male and f=female) and the category age
(coded as integers 1, 2, and 3 to represent three distinct age groups). Score
was computed from the summation of several Likert scale responses given by
each person:
DATA survey;
LABEL gndr='Gender';
INPUT id $ gndr $ age score @@;
CARDS;
b f 1 30 g m 1 27 l m 2 39 c f 2 23 h f 3 37
e m 1 49 n m 2 25 o m 3 38 a f 1 28 k f 3 32
d m 1 41
;
First, sort the dataset with PROC SORT by gndr and age (in ascending order). The unique ID for each record is also included on the BY statement as the third variable listed to organize the dataset, even though it is not required for subsequent data processing illustrated here:
PROC SORT DATA=survey;
BY gndr age id;
RUN;
Following PROC SORT is a DATA step that writes a new dataset called sum_var. Enter a BY statement following the SET statement (which reads the sorted dataset called survey) to indicate the records in the dataset have been sorted by gndr and age. Since the objective is to process the data across all records identified by levels of gender or across all records for both gender and age, the ID variable does not need to be included on this BY statement:
DATA sum_var;<enter additional DATA step statements - see below>
SET survey;
BY gndr age;
RUN;
These three statements give SAS the capability to identify the first and last record within each subgroup of records defined by the variables on the BY statement. This feature of the DATA step has many helpful data processing applications, but only one will be described here to illustrate how it works.
In order to process a sorted dataset with the DATA step, SAS automatically
assigns two internal variables for each variable listed on the BY statement
which are dummy coded as 0/1 to represent a false/true condition. For gender,
the variable is called FIRST.gndr and is equal to 1 to identify the first record
of each subgroup defined by the sorted values of gender; otherwise, it is assigned
a value of 0. A second variable called LAST.gndr is equal to 1 for the last
record of each value of gender in each subgroup; it is assigned 0 otherwise.
Analogous variables are computed internally for the three levels of age. These
variables are not included in the new dataset; they only exist to help SAS
identify the first and last records of the subgroups into which the records
have been sorted.
You'll recall that the survey dataset described above was sorted by
gndr, age, and id. Below you'll find the sorted dataset of values for
id, gndr, and age variables followed by the respective "internal" values
of FIRST.<var> and LAST.<var>, and a score variable with which
SAS will calculate some summary statistics.
Because only two values of gender exist that were listed first on the BY statement,
the columns for FIRST.gndr and LAST.gndr each contain only two 1's. However,
since age is sorted within each level of gender, its associated FIRST and LAST
variables change much more often (as indicated by the larger number of 1's
in the respective columns). If there is only 1 record for a group defined by
age, both FIRST and LAST values equal 1 (as is the case for gender=m and age=3).
| id | gndr | FIRST.gndr | LAST.gndr | age | FIRST.age | LAST.age | score |
| a | f | 1 | 0 | 1 | 1 | 0 | 28 |
| b | f | 0 | 0 | 1 | 0 | 1 | 30 |
| c | f | 0 | 0 | 2 | 1 | 1 | 23 |
| h | f | 0 | 0 | 3 | 1 | 0 | 37 |
| k | f | 0 | 1 | 3 | 0 | 1 | 32 |
| d | m | 1 | 0 | 1 | 1 | 0 | 41 |
| e | m | 0 | 0 | 1 | 0 | 0 | 49 |
| g | m | 0 | 0 | 1 | 0 | 1 | 27 |
| l | m | 0 | 0 | 2 | 1 | 0 | 39 |
| n | m | 0 | 0 | 2 | 0 | 1 | 25 |
| o | m | 0 | 1 | 3 | 1 | 1 | 38 |
The dataset survey can now be considered to contain both the actual data values and the internal values SAS assigns. As a result, you can process the records defined by variables listed on the BY statement with IF statements entered in the DATA step. One simple example to demonstrate how this data processing works is to accumulate counts and sums in the DATA step for all persons across values of age for each level of gender:
DATA sum_var;
SET survey;
BY gndr;
DROP id score age;
RETAIN gndr_sum gndr_count;
IF (FIRST.gndr EQ 1) THEN
DO; gndr_count=0; gndr_sum =0; END;
gndr_count = gndr_count + 1;
gndr_sum = gndr_sum + score;
IF (LAST.gndr EQ 1) THEN
DO; gndr_mean = gndr_sum/gndr_count; OUTPUT; END;
RUN;
The purpose of these two IF statements should not be confused with a statement
such as "IF gndr EQ 1 then..." (which is incorrect since gender
itself is character data that can only equal 'f' or 'm').
The actual values of gndr are not referred to; the value "1" entered
above refers only to the answer to the question that is asked for each record: "Is
this record the first (or last) of a group of records sorted by gender?"
The objective of this DATA step is to accumulate summary statistics by values
of gender. When the first record in any subgroup is encountered (e.g., FIRST.gndr
= 1), the summary statistics must be set to 0. Also, any new variables computed
within the DATA step should be included in a RETAIN statement since all computed
variables are automatically set to missing whenever the SET statement reads
a new record.
Records should only be output to the new dataset called sum_var when LAST.gndr equals 1, so the inclusion of an OUTPUT statement must follow "IF
LAST.gndr".
Since values of score, id, and age are no longer relevant, their names are
entered on the DROP statement.
PROC PRINT DATA=sum_var NOobs Label;
RUN;
Gender |
gndr_sum |
gndr_count |
gndr_mean |
f |
150 |
5 |
30.0 |
m |
219 |
6 |
36.5 |
PROC MEANS would be a more natural and efficient procedure to make the computations of this particular example and to collect other summary statistics. However, these statements demonstrate the versatility of processing data tasks you can perform within a DATA step that includes sorted BY variables. It essentially allows you more control and the ability to delete or modify records with additional statements.
A few situations where this approach has proved most helpful are:
The short example presented here is one of many illustrations of why the SAS DATA step is a very powerful component of the SAS system. Along with arrays, FIRST and LAST data processing can be among the most difficult concepts to grasp. However, when you start to think of data processing by subgroups, the BY statement entered into the DATA step can become a commonly applied feature for which you will find many applications.