The Importance of Documentation in Data Analysis
Robin High
robinh@darkwing.uoregon.edu
This article is designed to motivate you to develop disciplined habits of careful data analysis, and walks you through some essential preparatory steps:
A statistical analysis plan should clearly state your objectives and list the most important tasks. Beyond these essential steps, this plan should also provide you with a detailed description of exactly what you want to do and why.
Write out the details on paper with plenty of visual aids showing what results could look like. Create mock versions of tables or graphs you would eventually include in your final report. From that point, the data you will collect, as well as appropriate analysis procedures, will become much more obvious.
Here are a few of the many basic questions to ask yourself to help you with this step:
a. Will you be working with categorical (nominal or ordinal) or continuous (interval or ratio) data, or some combination of both?
b. Will you need to create scales or perform computations with existing data? If so, what formulas will you use? Do they already exist or will you need time to include this step as part of your research?
c. Will data transformations be of value? How will you decide and what will be their interpretation?
d. How will you handle missing data? What will be the criteria for judging a data value to be an outlier?
e. Given you have stated answers to a through d, what data analysis techniques will you apply? You may discover a need to learn how to use, or better understand, some new or more advanced techniques. Perhaps logistic regression would be more appropriate than linear regression. Or, there may be a repeated measures or multivariate structure with your data for which the usual assumptions of independence will be violated if you apply a standard technique.
As you proceed, continuing to spend time in this disciplined effort will make your future data analysis needs much clearer. Keep written notes and tables with you and look at them frequently. A single glance at your handwritten notes a few months (or even weeks) from now can bring back a flood of valuable memories.
Writing summary results of what you do is just as important as the final written document, and it's an easy and natural task if you do it as you proceed. As you work on a project, budget enough time to successfully plan, execute, and document your analysis tasks as you go. If you don't, you'll very likely find yourself rushing the job, making unnecessary mistakes, and having to redo your work--spending much more time in the long run.
What seems intuitively obvious in the data analysis plan you are currently writing may be only remotely familiar several weeks from now. For example, exactly why did you apply a procedure to calculate the means and variances? How did you define outliers and what did you do with them?
One rule-of-thumb to consider is to include at least one short paragraph of written documentation for each data analysis task. While this exercise will increase the amount of written material you need to manage, it ensures you'll have a clearer picture of what you did and why when you revisit summary files later on.
Almost all data analysis tasks can be written into a program file that serves as written documentation of what you did. Within the data analysis program, make liberal use of comments to describe what each step does and include TITLE statements that will print a brief description of the analysis and a date on your printed output. This simple task will save time trying to guess the type of analysis and its creation date down the road.
Set up a retrieval system to keep your program files and output organized. The name you assign to programs should give a brief description of what the program does. Folders or subdirec-tories on the computer system you use are a very valuable aid to organizing numerous types of data files.
It shouldn't matter whether you use paper copies or electronic files (safely backed up, of course), but your retrieval system must be structured to allow you easy access to your programs and data files, as well as quick orientation to the purpose of the analysis as it was carried out.
When writing a statistical program, structure it with two tasks in mind: those that perform data management and those that perform the data analyses, so that both steps fall within a logical sequence where the output of one step is passed on for use in the next.
Data management steps involve reading data from external files, merging separate files together, making transformations or recodings, or creating new variables with formulas. Original data files should be left unchanged; let the program itself calculate new variables for you (such as sums or averages) rather than store the original and computed values in the external data file.
Procedural steps perform the actual data analyses (for example, summary statistics for a collection of variables, performing analysis of variance, plotting residuals to check assumptions, and so forth). The program itself should consist of combinations of these steps structured in a meaningful way.
In general, think of data analysis processing in a step-by-step manner, where each step can't take place until required information is passed to it from a previous step.
You should never rely on your own memory for steps taken in data analysis. You may remember small facts of a program you ran yesterday, but how clearly will you recall the steps you took two months--or even two weeks--from now?
If you had the printed document in front of you and someone asked you to do it again with a minor change, could you still reconstruct the steps it took to reach your conclusions? For example, do you remember all the data items used to compute scales, or how the derived variables were coded (e.g., age < 60 or age <= 60)?
Perhaps you haven't realized it, but one of the main goals of this article is to highlight a major advantage of using SAS: SAS allows you to document exactly what you do within the program itself. While this may also be true of other data analysis software, some of them have taken an "easy-to-use" approach that makes it all too tempting to sit down and try out an analysis already planned "in your head" simply by clicking a button. The problem is that not documenting your work as you proceed invites frustration--or even disaster--when trying to explain or reproduce results later.
A SAS program can be written that essentially describes everything in great detail from start to finish, and all the steps in the process will read very much like your data analysis plan. There is no doubt what you did, and with inserted comments, the reasons will be clear when you need to review it or modify it later. Another advantage is that you can easily modify these descriptions for use with a similar analysis of new data.