Analysing Quantitative Data. Raymond A Kent
If this is known, the comparison is relatively straightforward, although deciding on how ‘similar’ they should be to be acceptable is not clear-cut. If differences are discovered then, again, this can simply be reported along with suitable caveats applied to the results. Alternatively, the researcher may try to compensate for the problem by using a weighted adjustment of responses. A weight is a multiplying factor applied to some or all of the responses given in a survey in order to eliminate or reduce the impact of bias caused by types of case that are over- or under-represented in the sample. Thus if there are too few women aged 20–24 in a sample survey compared with the proportions in this age group known to exist in the population of cases, such that only 50 out of a required 60 are in the achieved sample, the number who, for example, said ‘Yes’ to a question will be multiplied by a weighting that is calculated by dividing the target sample number by the actual sample number, in the example 60/50 or 1.2.
Even if individuals are responding, there may be differences between respondents’ reported answers and actual or ‘true’ values. Response errors arising through dishonesty, forgetfulness, faulty memories, unwillingness or misunderstanding of the questions being asked are notoriously difficult to measure. Research on response error, furthermore, is limited due to the difficulty of obtaining some kind of external validation. In interview surveys, whether face to face or by telephone, interviewers may themselves misunderstand questions or the instructions for filling them in, and may be dishonest, inaccurate, make mistakes or ask questions in a non-standard fashion. Interviewer training, along with field supervision and control, can, to a large extent, reduce the likelihood of such errors, but they will never be entirely eliminated, and there is always the potential for systematic differences between the results obtained by different interviewers.
Errors arising from non-response, erroneous responses or interviewer mistakes are specific to questionnaire survey research. Errors from the inappropriate specification of cases, from biased case selection, from random sampling error or poor data capture techniques may arise in all kinds of research. Errors of different kinds will affect the record of variables or set memberships for each property in various ways and to different extents.
What researchers do in practice is, separately for each property, to focus on likely measurement error – discrepancies between the values recorded and the actual or ‘true’ values. The size of such error is usually unknown since the true value is unknown, but evidence from various sources can be gathered in order to estimate or evaluate the likelihood of such errors. Researchers focus on two aspects of such discrepancies: reliability and validity.
A measure is said to be reliable to the extent that it produces consistent results if repeat measures are taken. We expect bathroom scales to give us the same reading if we step on them several times in quick succession. If we cannot rely on the responses that a questionnaire item produces, then any analysis based on the question will be suspect. For a single-item question or a multi-item question (generating a derived measure) the measures can be retaken at a later date and the responses compared. Such test–retests give an indication of measure stability over time, but there are fairly key problems with this way of assessing reliability:
it may not be practical to re-administer a question at a later date;
it may be difficult to distinguish between real change and lack of reliability;
the administration of the first test may affect people’s answers the second time around;
how long to wait between tests;
what differences between measures count as ‘significant’.
An alternative is to give respondents two different but equivalent measures on the same occasion. The extent to which the two measures covary can then be taken as a measure of reliability. The problem here is that it may be difficult to obtain truly equivalent tests. Even when they are possible, the result may be long and repetitive questionnaires. Another version of this equivalent measures test is the split-half test. This randomly splits the values on a single variable into two sets. A score for each case is then calculated for each half of the measure. If the measure is reliable, each half should give the same or similar results and across the cases the scores should correlate. The problem with this method is that there are several different ways in which the values can be split, each giving different results.
Where the measure taken is a multi-item scale, for example a summated rating scale, it is possible to review the internal consistency of the items. Internal consistency is a matter of the extent to which the items used to measure a concept ‘hang’ together. Ideally, all the items used in the scale should reflect some single underlying dimension; statistically this means that they should correlate one with another (the concept of correlation is taken up in detail in Chapter 5). An increasingly popular measure for establishing internal consistency is a coefficient developed in 1951 by Cronbach that he called alpha. Cronbach’s coefficient alpha takes the average correlation among the items and adjusts for the number of items. Reliable scales are ones with high average correlation and a relatively large number of items. The coefficient varies between zero for no reliability to one for maximum reliability. The result approximates taking all possible split halves, computing the correlation for each split and taking the average. It is therefore a superior measure to taking a single split-half measure. However, there has been some discussion over the interpretation of the results. This discussion is summarized in Box 1.1.
A record of a value for a case is said to be valid to the extent that it measures what it is intended to measure: in other words, the measure and the concept must be properly matched. Validity relates to the use that is made of the measure, so stepping on bathroom scales may be a valid measure of weight, but not of overall health. A reliable measure, furthermore, is not necessarily valid. Our bathroom scales may consistently over- or under-estimate our real weight (although they will still measure change). A valid measure, on the other hand, will, of course, be reliable.
There is no conclusive way of establishing validity. Sometimes it is possible to compare the results of a new measure with a more well-established one. The results of a device for measuring blood pressure at home might be compared with the results from a doctor’s more sophisticated equipment. This assumes, of course, that our GP’s results are valid. For many concepts in the social sciences there are few or no well-established measures. In this situation, researchers might focus on what is called ‘content’ or ‘face’ validity. The key to assessing content validity lies in reviewing the procedures that researchers use to develop the instrument. Is there a clear definition of the concept, perhaps relating this to how the concept has been defined in the past? Do the items that have been used to measure the concept adequately cover all the different aspects of the construct? Have the items been pruned and refined so that, for example, items that do not discriminate between respondents or cases are excluded, and any items that overlap to too great an extent with other items are avoided?
Another way of establishing validity is to assess the extent to which the measures produce the kind of results that would be expected on the basis of experience or well-established theories. Thus a measure of alienation might be expected to associate inversely with social class – the lower the class, the higher the alienation. If the two measures do indeed behave in this way, then this is evidence of what is often called construct validity. If they do not, the researcher may be unclear whether either or both measures are faulty or whether the relationship between the two measures is contrary to theoretical expectations. Conversely, the expectation might be that the two measures are unconnected and therefore should not correlate. If it turns out that they are indeed not correlated, then this is sometimes called discriminant validity.
Other more complex methods of measuring construct validity have been developed, such as multitrait–multimethod validity (Campbell and Fiske, 1959), pattern matching (Cook and Campbell, 1979) or factor analysis (see Chapter 6). Evidence of validity has to be argued for and may be gathered in a number of ways. No single way is likely to provide full evidence and even a combination of approaches is unlikely to constitute final proof.