Fundamentals of Programming in SAS. James Blum
run;
proc freq data=BookData.IPUMS2005Basic;
table MortgageStatus*Metro*HomeValue/nocol nopercent format=comma10.;
format MortgageStatus $MortStatus. Metro MetroB. HomeValue Hvalue.;
where MortgageStatus ne ‘N/A’;
run;
Output 2.7.6: A Three-Way Table in PROC FREQ
Table 1 of METRO by HomeValue | ||||||
Controlling for MortgageStatus=No | ||||||
METRO(Metropolitan status) | HomeValue(House value) | |||||
FrequencyRow Pct | $65,000 and Below | $65,001 to $110,000 | $110,001 to $225,000 | $225,001 to $500,000 | Above $500,000 | Total |
Not Identifiable | 10,77735.49 | 5,46017.98 | 10,41534.29 | 2,5848.51 | 1,1343.73 | 30,370 |
Not in Metro Area | 34,76640.57 | 16,26118.98 | 26,88931.38 | 5,5536.48 | 2,2272.60 | 85,696 |
In a Metro Area | 34,17618.55 | 23,70612.86 | 71,13338.60 | 33,59018.23 | 21,67811.76 | 184,283 |
Total | 79,719 | 45,427 | 108,437 | 41,727 | 25,039 | 300,349 |
Table 2 of METRO by HomeValue | ||||||
Controlling for MortgageStatus=Yes | ||||||
METRO(Metropolitan status) | HomeValue(House value) | |||||
FrequencyRow Pct | $65,000 and Below | $65,001 to $110,000 | $110,001 to $225,000 | $225,001 to $500,000 | Above $500,000 | Total |
Not Identifiable | 7,48617.55 | 7,14216.75 | 19,45345.61 | 6,46815.17 | 2,1004.92 | 42,649 |
Not in Metro Area | 24,44325.34 | 19,39620.11 | 40,66842.16 | 9,1649.50 | 2,7902.89 | 96,461 |
In a Metro Area | 26,3516.33 | 37,3458.97 | 175,48242.16 | 110,41226.52 | 66,67116.02 | 416,261 |
Total | 58,280 | 63,883 | 235,603 | 126,044 | 71,561 | 555,371 |
It is also possible for the FREQ procedure to count based on a quantitative variable using the WEIGHT statement, effectively tabulating the sum of the weights. Program 2.7.7 uses the weight statement to summarize total HomeValue for combinations of Metro and MortgagePayment.
Program 2.7.7: Using the WEIGHT Statement to Summarize a Quantitative Value.
proc freq data=BookData.IPUMS2005Basic;
table Metro*MortgagePayment /nocol nopercent format=dollar14.;
weight HomeValue;
format Metro MetroB. MortgagePayment Mort.;
run;
Output 2.7.7: Using the WEIGHT Statement to Summarize a Quantitative Value
Table of HomeValue by METRO | ||||
HomeValue(House value) | METRO(Metropolitan status) | |||
FrequencyRow Pct | Not Identifiable | Not in Metro Area | In a Metro Area | Total |
$65,000 and Below | $2,737,53012.20 | $8,736,98638.93 | $10,969,60048.88 | $22,444,116 |
$65,001 to $110,000 | $3,770,84010.74 | $9,887,45428.16 | $21,448,05261.09 | $35,106,346 |
$110,001 to $225,000 | $15,896,8547.82 | $30,632,55615.07 | $156,700,07477.10 | $203,229,484 |
$225,001 to $500,000 | $8,192,9084.80 | $10,741,2586.30 | $151,601,29488.90 | $170,535,460 |
Above $500,000 | $3,854,2802.60 | $4,735,2883.19 | $139,862,78094.21 | $148,452,348 |
Total | $34,452,412 | $64,733,542 | $480,581,800 | $579,767,754 |
2.8 Reading Raw Data
Often data is not available as a SAS data set; in practice, data often comes from external sources including raw files such as text files, spreadsheets such as Microsoft Excel ®, or relational databases such as Oracle ®. In this section, work with external data sources begins by exploring how to read raw data files.
Raw data refers to certain files that contain unprocessed data that is not in a SAS data set. (Certain other structures also qualify as raw data. See Chapter Note 6 in Section 2.12 for additional details.) Generally, these are plain-text files and some common file types are:
tab-delimited text (.txt or .tsv)
comma-separated values (.csv)
fixed-position files (.dat)
Choices for file extensions are not fixed; therefore, the extension does not dictate the type of data the file contains, and many other file types exist. Therefore, it is always important to explore the raw data before importing it to SAS. While SAS provides multiple ways to read raw data; this chapter focuses on using the DATA step due to its flexibility and ubiquity—understanding the DATA step is a necessity for a successful SAS programmer.
To assist in finding the column numbers when displaying raw files, a ruler is included in the first line when presenting raw data in the book, but the ruler is not present in the actual data file. Input Data 2.8.1 provides an example of such a ruler. Each dash in the ruler represents a column, while a plus represents multiples of five, and a digit represents multiples of ten. For example, the 1 in the ruler represents column 10 in the raw file and the plus sign between the 1 and the 2 represents column 15 in the raw file.
Input Data 2.8.1: Space Delimited Raw File (Partial Listing)
----+----1----+----2----+ |
1 1800 9998 9998 9998 |
2 480 1440 9998 9998 |
3 2040 360 100 9998 |
4 3000 9998 360 9998 |
5 840 1320 90 9998 |
2.8.1 Introduction to Reading Delimited Files
Delimiters, often used in raw files, are a single character such as a tab, space, comma, or pipe (vertical bar) used to indicate the break between one value and the next in a single record. Input Data 2.8.1 includes a partial representation of the first five records from a space-delimited file (Utility 2001.prn). Reading in this file, or any raw file, requires determining whether the file is delimited and, if so, what delimiters are present. If a file is delimited, it is important to note whether the delimiters also appear as part of the values for one or more variables. The data presented in Input Data 2.8.1 follows a basic structure and uses spaces to separate each record into five distinct values or fields. SAS can read this file correctly using simple list input without the need for additional options or statements using the following rules:
1. At least one blank/space must separate the input values and SAS treats multiple, sequential blanks as a single blank.
2. Character values cannot contain embedded blanks.
3. Character variables are given a length of eight bytes by default.
4. Data must be in standard numeric or character format. Standard numeric values must only contain digits, decimal point, +/-, and E for scientific notation.
Input Data 2.8.1 satisfies these rules using the default delimiter (space). Options and statements are available to help control the behavior associated with rules 1 through 3, which are covered in subsequent sections of this chapter. Violating rule 4 precludes the use of simple list input but is easily addressed with modified list input, as shown in Chapter 3. However, no such options or modifications are required to read Input Data 2.8.1, which is done using Program 2.8.1.
Program 2.8.1: Reading the Utility 2001 Data
data Utility2001;
infile “--insert path here--\Utility 2001.prn”;
input Serial$ Electric Gas Water Fuel;
run;
proc print data = Utility2001 (obs=5 );
run;
The DATA statement begins the DATA step and here names the data set as Utility2001, placing it in the Work library given the single-level naming. Explicit specification of the library is available with two-level naming, for example, Sasuser.Utility2001 or Work.Utility2001—see