Return to UOCC HomeComputing News Home
Header bar

How to Design Your Data Entry for Transfer into SAS or SPSS: the Excel Advantage

Robin High
Statistical Programmer and Consultant
robinh@uoregon.edu

Data entry can be a very laborious, and often a very boring, task, prone to errors resulting from fatigue or carelessness. Accuracy is likely to suffer when you're attempting to manually enter data from a data source directly into a computer file.

A primary rule of data entry is that it should never be done with a word processing program. Various specialized programs are available for data entry that are far more appropriate. Since most software programs allow data and information to be exchanged between applications, it makes sense to use the most efficient data entry process available.

Entering data by scanning. Some data types (e.g., discrete categories or Likert scales) can be entered from questionnaires onto individual Scantron forms, one for each subject. The Computing Center is able to scan them for you at a nominal cost and provide you with a text file that can be read by most statistical programs.

Entering data into a spreadsheet: the advantages of Microsoft Excel. If data compiled from other sources are to be entered by hand, one of the most popular and convenient ways is to enter them directly into the rows and columns of a spreadsheet such as Microsoft Excel. Although SPSS and some other statistical programs have their own integrated spreadsheets, they do not offer the flexibility that Excel gives you.


A warning! With data entry as the primary goal, always remember that Excel is first of all a spreadsheet and should not be considered a database management system nor a data analysis program.


If your intent is to analyze data with SAS, SPSS, or some other program, learn to use the features of Excel that will minimize potential problems with data transfer later on. With that in mind, consider some of Excel's limitations for data entry:

Notwithstanding these and several other limitations, Excel remains a reasonable and effective method to enter and store moderate-sized data sets because:

From the very beginning, design each individual worksheet so that it can be analyzed with another program: columns should contain data of one—and only one—type (e.g., numeric, character, or date) and rows should be the subjects or experimental units in your study. Getting to know your data in this structure is visually helpful, because it is exactly how SAS, SPSS, and other statistical programs work with data.

You'll find a helpful guide on how to effectively use spreadsheet packages for data entry at http://www.rdg.ac.uk/ssc/dfid/booklets/topsde.html

Getting Started with Excel

If you want to transfer your data into SAS, SPSS, or some other program, follow these guidelines:

The cells in Row 1 should contain the column's eventual data set name. Each name should be a relatively short and unique acronym that clearly identifies the data. It should begin with a letter and contain only letters, numbers, or an underscore ( _ ) where spaces would naturally fall. Avoid using special characters such as $, &, @, in variable names. Since each row represents the values from one subject, the first column(s) should contain one or more variables that give each subject a unique identifier. They become especially important if you need to merge two or more data files.

In Excel, data formats are defined for a range of cells rather than for a complete column. For this reason it is important that each entire column, including cells with missing or uncollected data, have one, and only one, format. Actually, you do not need to format the entire column, only the portion you will eventually use. Highlight that portion and select the appropriate format from the Format/Cells option. Do not select formats that will enter commas, dollar signs, or other visual enhancements. Numeric, text, and date formats (e.g. mm/dd/yy is often a good choice) are probably the only formats you'll ever need.

The "Split" option (under the "Window" pull-down menu) keeps the row of variable names and the columns of identifiers in view, whatever range of cells in the worksheet you may need to review. First place the cursor at the most extreme upper left-hand corner where data entry begins (e.g., the intersection of Row 2 and the column in the upper left-hand corner where data appear) and then select "Split" from this menu. For any row or column of the worksheet you move to, you'll know exactly which variables you are observing (column names) and their associated ID values (rows).

For versions of Excel later than 4.0, one file can contain multiple worksheets. By default, the tabs at the bottom of these sheets are supplied names ("sheet1," "sheet2," etc.). You can change these names by clicking this space with your mouse and entering a new name. Use the same conventions for first-row variable names: use a short acronym of the page contents that begins with a letter, use only letters or numbers, and enter the underscore ( _ ) where a space naturally falls.

Planning Ahead: Multivariate or Univariate Format?

Another reason to give careful thought to the design your data entry worksheet is to consider the objectives of your data analysis. Some statistical procedures require data to be entered in "multivariate" format; others require "univariate" format. A simple example demonstrates their difference. If you collect repeated measurements of data on individuals over time, your data would typically be stored in a multivariate format with this structure:

SUBJECT DATE P1 P2 P3 T1 T2 T3
P001 11/01/2002 123 234 345 30 28 27
P002 11/01/2002 111 232 143 31 23 22

The numbers on the variable names (attached to P and T) indicate the order two different types of variables were collected over time for each individual (row). This is the data structure required for multivariate analysis with PROC GLM in SAS or repeated measures with SPSS. However, to use this same data in procedures that require a univariate format (such as with procedures MEANS, PLOT, or MIXED) you should store the data with the structure shown below:

SUBJECT DATE TIME P T
P001 11/01/2002 1 123 30
P001 11/01/2002 2 234 28
P001 11/01/2002 3 345 27
P002 11/01/2002 1 111 31
P002 11/01/2002 2 232 23
P002 11/01/2002 3 143 22

Time now becomes a new index variable. The values of the two variables are now listed in two individual columns. You may need to be able to convert data from one structure to the other. Several methods of transposing data in SAS are available at http://darkwing.uoregon.edu/~robinh/06appl.txt

Transferring Data from Excel

After you've entered a few records into Excel, you can test the procedure to read them into SAS or SPSS. To transfer your data to another computer system such as Darkwing, your one choice with SAS version 8.2 is to save it as a text file (comma, *.csv, or tab-delimited, *.txt) and move it with a file transfer program. The column headers you entered into Row 1 will be your variable names. This row can be extracted and form the variable names of the INPUT statement.

By default, statistical programs assume all variables are numeric, so if you have text or date formats, you'll need to specify them on the respective input statement. Also, indicate firstobs=2 on the INFILE statement to show the data begin in Row 2:

DATA one;
INFILE ‘mydata.csv' dlm=',' dsd firstobs=2;
INPUT subject $4. date mm/dd/yy/10. p1 p2 p3 h1 h2 h3;
RUN;


† Note: Unix SAS 8.2 doesn't have the capability to access PC files directly across a network, but you can expect to be able to do this with SAS 9.1, which is scheduled for release sometime next year.


To read data into PC SAS, you can follow the procedure above for saving data as text files and read them in an analogous manner, or you can review several data transfer methods summarized at http://www.ita.doc.gov/industry/otea/dcsug/excelsas.pdf

Instructions for importing to SAS or SPSS. Assuming you have followed the above steps for data entry and your data begin in Row 2 of an Excel worksheet, both SPSS and PC SAS have the capability to read them directly. In SPSS, select "Open/Data" and choose *.xls as the filetype. PC SAS has an IMPORT wizard that will lead you through these steps one at a time, but it's helpful to know how to use the SAS IMPORT procedure.

PROC IMPORT DATAFILE="c:\<path>\test.xls" OUT=test REPLACE;

SHEET="sequence";
GETNAMES=yes;

run;

The DATAFILE= option tells SAS where to find the Excel file (path and file name).

The OUT= option tells SAS to create a temporary dataset called "test." This dataset could be made "permanent" by using a reference to a LIBNAME statement with a libref that appears prior to PROC IMPORT.

The REPLACE option tells SAS to over-write the data set listed on the OUT= option if it currently exists.

If the Excel file contains multiple worksheets, you'll need to specify the name of the worksheet you want to read with the SHEET= statement.

If SAS variable names have been placed in the first row, the GETNAMES=yes statement tells SAS to use these names rather than to create default names and formats of VAR1 VAR2 VAR3, etc.

To verify that data transfer worked, immediately print the data file with PROC PRINT (use the obs= option for large files) and check the data formats with PROC CONTENTS:

PROC PRINT DATA=test(obs=20);
PROC CONTENTS DATA=test;
RUN;

Another Data Entry Option

You might also consider using EpiData, a comprehensive, easy-to-use tool for simple or programmed data entry and for data documentation. Thanks to donations and volunteer efforts, EpiData is free. It is currently developed for Windows 95/98/NT/2000/XP, but also works on PowerMacs with emulators. You can download EpiData from http://www.epidata.dk/

EpiData is easy to get, easy to set up, easy to learn, and has valid-range checking capability. It also outputs the entered dataset for use in SAS, SPSS, Stata, etc., by producing files of commands plus data, so going through the process described above with PROC IMPORT is not necessary.

References

Bourque, Michele. "Reading External Data Files Using SAS®: Examples Handbook." Cary, NC. SAS Institute, Inc., 2002.

Bourque, L.B., & Clark, V. (1992) Processing data: The survey example (Sage University Paper series on Quantitative Applications in the Social Sciences, series no. 07-085). Newbury Park, CA: Sage. (Email: order@sagepub.com)


Spring 2003 Computing News | Computing Center Home Page