Fundamentals of Programming in SAS. James Blum

Fundamentals of Programming in SAS - James Blum


Скачать книгу
2.8.6: Warning Generated by Attempting to Reset Length

      WARNING: Length of character variable State has already been set. Use the LENGTH statement as the very first statement in the DATA STEP to declare the length of a character variable.

      Tab-Delimited Files

      If the delimiter is not a standard keyboard character, such as the tab used in tab-delimited files, an alternate method is used to specify the delimiter via its hexadecimal code. While the correct hexadecimal representation depends on the operating system, Microsoft Windows and Unix/Linux machines typically use ASCII codes. The ASCII hexadecimal code for a tab is 09 and is written in the SAS language as ‘09 ‘x; the x appended to the literal value of 09 instructs the compiler to make the conversion from hexadecimal. Program 2.8.7 uses hexadecimal encoding in the DLM= option to correctly set the delimiter to a tab. The results of Program 2.8.7 are identical to those of Program 2.8.5.

      Program 2.8.7: Reading Tab-Delimited Data

      data work.Ipums2005Basic;

      length state $ 20 City $ 25 MortgageStatus $ 50;

      infile RawData (‘ipums2005basic.txt’) dlm = ‘09’x;

      input Serial State $ City $ CityPop Metro

      CountyFIPS Ownership $ MortgageStatus $

      MortgagePayment HHIncome HomeValue;

      run;

      Because there are no missing values denoted by sequential tabs, nor any tabs included in data values, the DSD option is no longer needed in the INFILE statement for this program.

      To specify multiple delimiters that include the tab, each must use a hexadecimal representation—for example, DLM= ‘2C09’x selects commas and tabs as delimiters since 2C is the hexadecimal value for a comma. For records with different delimiters within the same DATA step, see Chapter Note 7 in Section 2.12.

      While delimited data takes advantage of delimiting characters in the data, other files depend on the starting and stopping position of the values being read. These types of files are referred to by several names: fixed-width, fixed-position, and fixed-field, among others. The first five records from a fixed-position file (IPUMS2005Basic.dat) are shown in Input Data 2.8.8. As with Input Data 2.8.4, truncation of this display occurs due to the length of the record—now occurring in each of the five records.

      Input Data 2.8.8: Excerpt from a Fixed-Position Data File

----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+
2 Alabama Not in identifiable city (or size group) 0 4 73
3 Alabama Not in identifiable city (or size group) 0 1 0
4 Alabama Not in identifiable city (or size group) 0 4 73
5 Alabama Not in identifiable city (or size group) 0 1 0
6 Alabama Not in identifiable city (or size group) 0 3 97

      Since fixed-position files do not use delimiters, reading a fixed-position file requires knowledge of the starting position of each data value. In addition, either the length or stopping position of the data value must be known. Using the ruler, the first displayed field, Serial, appears to begin and end in column 8. However, inspection of the complete raw file reveals that is only the case for the single-digit values of Serial. The longest value is eight-digits wide, so the variable Serial truly starts in column 1 and ends in column 8. Similarly, the next field, State, begins in column 10 and ends in column 29. Some text editors, such as Notepad++ and Visual Studio Code, show the column number in a status bar as the cursor moves across lines in the file.

      The DATA step for reading fixed-position data looks similar to the DATA step for reading delimited data, but there are several important modifications. For fixed-position files, the syntax of the INPUT statement provides information about column positions of the variable values in the raw file, as it cannot rely on delimiters for separating values. Therefore, delimiter-modifying INFILE options such as DSD and DLM= have no utility with fixed-position data. Two different forms of input are commonly used for fixed-position data: column input or formatted input. This section focuses on column input while Chapter 4 discusses formatted input.

      Column Input

      Column input takes advantage of the fixed positions in which variable values are found by directly placing the starting and ending column positions into the INPUT statement. Program 2.8.8 shows how to use column input to read the IPUMS CPS 2005 basic data. The results of Program 2.8.8 are identical to Output 2.8.5.

      Program 2.8.8: Reading Data Using Column Input

      data work.ipums2005basicFPa;

      infile RawData (‘ipums2005basic.dat’);

      input serial 1-8 state $ 10-29  city $ 31-70  cityPop 72-76 

      metro 78-80 countyFips 82-84 ownership $ 86-91

      mortgageStatus $ 93-137 mortgagePayment 139-142

      HHIncome 144-150 homeValue 152-158;

      run;

       The LENGTH statement is no longer needed—when using column input, SAS assigns the length based on the number of columns read if the length attribute is not previously defined. Here, SAS assigns State a length of 20 bytes, just as was done in the LENGTH statement in Program 2.8.5.

       The first value indicates the column position—31—from which SAS should start reading for the current variable, City. The second number—70—indicates the last column SAS reads to determine the value of City.

       The default length of eight bytes is still used for numeric variables, regardless of the number of columns.

      Beyond the differences between column input and list input shown in Program 2.8.8, since column input uses the column positions, the INPUT statement can read variables in any order, and can even reread columns if necessary. Furthermore, the INPUT statement can skip unwanted variables. Program 2.8.9 reads Input Data 2.8.8 and demonstrates the ability to reorder and reread columns.

      Program 2.8.9: Reading the Input Variables Differently than Column Order

      data work.ipums2005basicFPb;

      infile RawData(‘ipums2005basic.dat’);

      input serial 1-8 hhIncome 144-150 homeValue 152-158 

      ownership $ 86-91 ownershipCoded $ 86 

      state $ 10-29 city $ 31-70 cityPop 72-76

      metro 78-80 countyFips 82-84

      mortgageStatus $ 93-137 mortgagePayment 139-142;

      run;

      proc print data = work.ipums2005basicFPb(obs = 5);

      var serial -- state;

      run;

       Output 2.8.9 shows that HHIncome and HomeValue are now earlier in the data set. Column input allows for reading variables in a user-specified order.

       Column 86 is read twice: first as part of a full value for Ownership, and second as a simplified version using only the first character as the value of a new variable, OwnershipCoded.

       As discussed in Chapter Note 3 in Section 1.7, the double-dash selects all variables between Serial and State, inclusive.

      Output 2.8.9: Reading the Input Variables Differently than Column Order

ObsserialhhIncomehomeValueownershipownershipCodedstate
12120009999999RentedRAlabama
23178009999999RentedRAlabama
34185000137500OwnedOAlabama
4520009999999RentedRAlabama
567260095000OwnedOAlabama

      Mixed Input


Скачать книгу