Categories‎ > ‎

Data Quality Macros

The following macro programs are very useful in testing the quality of data.

CheckLog

Checks the log for issues, including lines with error, warning, invalid, or uninitialized messages. See more at its very own page. Includes options to check the current log, a directory of logs, or a specific log, in addition to emailing results, popping-up messages, playing a sound, and lots of other options.

CompCon

COMPares the CONtents of two data sets to see how much they match. Includes options to redirect output to a library other than WORK, rename the output, change the ID variable from VARNUM to NAME, exclude specific types of comparisons (e.g., format), and change the maximum number of comparisons to be made (defaulted to the maximum, 32767).

DatePart

Use the DATEPART function, which outputs only the date from a datetime variable, without generating an issue in the log when missing values are encountered. The macro has options to set the output variable name and format, to create a label with information about the source column, and to output a final comma if requested.

DstCnt

Outputs the distinct count of a variable in a data set either to the log or to both the log and a macro variable. The name of the macro variable is specified by the user, and there is one additional option to filter the data set.

DupCheck

Checks for duplicates in a data set by the specified "by" variables (grouping variables).
  • Split the duplicates (if found) from the single records into two separate data sets, with all duplicates in one data set. This behavior is unlike the PROC SORT DUPOUT= option, which only removes duplicate records after the first, splitting the records into two data sets.
  • Sort the output, set to NO by default, since sorting takes time, especially with large data sets.
  • Rename the output prefix (defaulted to the input data set).
  • "Suggest" possible variables involved in duplication using the DupVar macro (see above).
  • Setting the output message as "expected", meaning the user was aware of the duplicates and there is no need for the WARNING messages in the log. This option is useful for splitting a data set with known duplicates in order to handle the duplicates separately from the single records, without having to code up the same thing separately.

DupVar

The DupVar macro is similar to the DupCheck macro, except it searches for variables that may be involved in duplication given the "by" variables (grouping variables) specified. The only available option is whether to run the macro in test mode.

Exist*

The following three macro programs check to see if a directory, file, or variable exists and output messages and global macro variables to notify the user about the status or to create dynamic programs that follow.

FreqLoop

A simple PROC FREQ, but allowing for specification of multiple variables or all variables in a data set. The second argument is the variable(s) to output. By specifying "_ALL_", the user can output frequencies on all variables in the data set. This is useful for reviewing categorical values, validation of dates, and many other variable-specific reviews. Allows for splitting the frequencies by a "by" variable (grouping variable) and renaming the output data set prefix.

Freqs

A variation on FreqLoop, except it outputs frequencies for all variables in a data set, with additional options for including/excluding specific variables and outputting frequencies using a BY variable. Each frequency is output to a master data set.

LibCount

This macro outputs the number of observations in each data set in a library. This is useful for tracking a program through its process and validating correct processing on data sets (e.g., record counts match from beginning to end, record counts sum when files are split, or records drop an expected amount when filters are applied). The macro includes options to rename the output and apply a filter to the input table (SASHELP.VTABLE).

MinMaxDt

Calculates the minimum and maximum values of a date field, stored either as a number (date) or text (date format). The third argument specifies the format of the variable if it is text. The macro does not re-do the min/max values for data sets it has already calculated, and to override that function, use the OVERRIDE=Y option. The macro outputs two macro variables: MINDT and MAXDT, with the values of the minimum and maximum dates, respectively. This is good for checking violations in dates (e.g., dates of birth are generally not before 1900 or after the current year).

NObs

This macro outputs the number of observations/rows in a data set, and it can be used to compare the length of a data set to other data sets or to dynamically change a program.

NVars

This macro outputs the number of variables/columns in a data set, and it can be used to compare the width of a data set to other data sets or to dynamically change a program.

QA_Counts

Compare the record counts of two data sets, using equality as the default comparison. The third argument specifies the mnemonic for the comparison operator, including EQ, GE, GT, LE, LT, or NE. Very useful when a subsequent step should have the same record count. Uses the NObs macro.

SaveLog

Saves the current SAS log to a file. This is great when you forget to redirect the output from the start. The CheckLog macro uses the same utility to export the current log. The two programs can work quite well together, as long as CheckLog uses the log output from the SaveLog. Includes an option to append a date/time stamp as part of the filename.

SortComp

Sorts and compares two data sets, record by record, using the specified ID variable. This uses the COMPARE procedures, thus the need for the sorts. Additionally, it searches for records in one data set but not the other. The final output includes the output from COMPARE, a data set of records only from the first data set, another of records only in the second, and a final data set that shows records in common that differ.

Options are available for renaming the output prefix, changing the output library (instead of WORK), excluding specific variables from comparison, changing the output to "expect" differences (that is, no warnings in the output), setting the maximum number of records to compare (defaulted to the maximum of 32767), and a test mode. Sometimes the output is quite redundant, but it helps in reviewing the data line-by-line.

Time

This macro puts the time into the log and a macro variable for tracking a program. The macro should be invoked at least twice within a piece of code - a beginning and and end. Midpoints can be specified, as well as an additional note in the log (e.g., "Section 1" or "Lab Data"). Further, an output data set with each point and the note can be written out. This is excellent for tracking down troublesome queries in a piece of code or for simple optimization.

ValidArg

An expansion of the BlankFind macro that generates code to validate macro arguments based on various criteria, including looking for blanks, Y/N values, a list of valid values (e.g., DATA PROC), and whether the values are numeric or character.

VarCount

Counts the number of populated variables in a data set. This helps with reducing clutter in a data set or identifying the magnitude of missing values in critical variables. The macro can rename the output and sort the output by various columns including ORDER (variable order), NAME (variable name), COUNT, or PERCENT.

ZeroCheck

Checks to see if a data set is empty. If not, it generates a warning or an error based on the user's preference, along with a custom message. I usually use this to check for invalid records by filtering a data set by the invalid value or pattern, expecting it to be empty.