Interpreting and Using Statistics in Psychological Research. Andrew N. Christopher

Interpreting and Using Statistics in Psychological Research

need to be able to use it in their research.

Measurement Reliability and Validity

Let’s take a fairly simple variable to operationalize: foot size. It is a variable because people have various foot sizes. Suppose you measured your foot size right now. If you measured your foot size again tomorrow, I bet the two measurements would give you close to, if not precisely, the same number. It is unlikely that your foot size would change much between now and tomorrow. This is the essence of measurement reliability. The measurement of a variable is reliable to the extent that the measurement provides consistent scores. Think about a friend of yours whom you consider to be “reliable.” That means you can count on him or her to behave consistently over time. If she says she will meet you at 11 a.m., she will be there by that time or even several minutes before that time. It is the same notion with measurement reliability. If the same person completes a measure of a personality construct (e.g., venturesomeness), his or her score on that measure should be consistent.

Reliability: extent to which a measure produces consistent results.

There are several forms of measurement reliability. We will briefly discuss two of them here. A measurement has test–retest reliability when people complete a measurement twice and they tend to have similar scores on that measurement each time they take it. For instance, in psychology, personality traits are understood to be stable (consistent) behavioral patterns that a person displays across situations. So, we would expect if people completed a measure of a personality trait today, they should score similarly when they complete that measure at another time.

Test–retest reliability: extent to which people tend to score similarly on a measurement that is completed at two different points in time.

It is important to realize that measurement reliability is established by examining a large sample of data, not scores from one or two people. For instance, I took the Scholastic Aptitude Test (SAT®; College Board, New York, NY) twice and earned very different scores on the two testing occasions. However, when we examine a large group of students taking the SAT, we will see that, in general, people who scored low the first time tend to score low when they take the test again. Similarly, people who scored well the first time tend to score well when they take the test again. Remember that the law of small numbers, that is, results based on a small sample of data (such as only my SAT scores), is not enough to draw conclusions about the world.

To establish test–retest reliability, people must complete a measure twice. By definition, this is a problem. Not every student will take the SAT twice. In Terrell et al.’s (2008) research, I doubt participants would have wanted to complete the measures a second time. Therefore, there is a second type of reliability called internal reliability. Take a look at Figure 2.2. This figure contains an instrument to measure a construct called grit. Grit is a construct of task persistence and tenacity (Duckworth, Peterson, Matthews, & Kelly, 2007). As you look at each item on the Grit Scale, it should be reasonable to assume that each person would tend to respond to these items fairly consistently; after all, they are all supposed to tap into a person’s grit. To get an idea of how reliable this measure is, researchers can divide it in half, and then see whether scores on one half of the measure (6 items) are consistent with scores on the other half of the measure (the other 6 items).² This is why it is called “internal” reliability; we are looking for consistency between halves of the measure.

Internal reliability: extent to which people tend to score similarly on different parts of a measurement that is completed only once.

Now suppose we took a measure of foot size and used it to predict performance as a counselor. Foot size is reliable; however, to use it to predict job performance is not a valid use of this measurement. We say a measurement of a variable is valid when that measurement is used as it was intended to be used. I do not think that foot size would in any way be related to job performance as a counselor. However, a measure of knowledge about appropriate counseling techniques probably is a valid indicator of job performance as a counselor.

Figure 2.2 Duckworth et al.’s (2007) Grit Scale

Source: Copyright © 2007 American Psychological Association. Reproduced [or Adapted] with permission. The official citation that should be used in referencing this material is [list the original APA bibliographic citation]. No further reproduction or distribution is permitted without written permission from the American Psychological Association.

Valid: extent to which a measure is appropriate to use in a given context.

As there are different types of reliability, there are different types of validity. We will focus on two forms of validity here. First, construct validity refers to how well a variable, such as a person’s level of narcissism, is operationalized. It is generally not the case that a measure has construct validity or it does not have construct validity. Rather, it is the degree to which that measure has construct validity. To assess the extent to which a measure has construct validity, we can see how well it is associated with measures of closely related constructs. For instance, what constructs might be related to being narcissistic? Perhaps self-esteem? If we find our measure of narcissism is related to measures of self-esteem, we can be confident that it possesses some degree of construct validity.

Construct validity: degree to which a variable is operationalized appropriately.

We also have criterion validity, which refers to the extent to which a measure is related to some outcome of interest. For instance, when evaluating the job performance of counselors, we can see whether our measure of knowledge of counseling techniques predicted such performance. If our measure did predict performance, it would be high in criterion validity. Likewise, to the extent that the SAT predicts academic performance in college, we can say that the SAT possesses criterion validity.

Criterion validity: how well a measure predicts an outcome.

Researchers take great care in ensuring that their measurements are both reliable and valid. It requires a great deal of work to provide evidence of reliability and validity of measurements in published scientific research. Indeed, it generally requires large samples of data, often involving thousands of respondents, to establish both reliability and validity. That is why when conducting their research, investigators often use measurements that have already been published because they can be more confident in the reliability and validity of those measurements. For these reasons, Terrell and her colleagues (2008) used measurements that other researchers had demonstrated were both reliable and valid.

Learning Check

1 How might we operationally define each of the following constructs? Realize that for each construct, there are many possible ways to operationalize it, more than I’ve provided in the answers.HostilityA: blood pressure while interacting with another person; self-report measure of the tendency to experience hostility toward othersHelping behaviorA: number of times people hold the door open to a building for other people over a period of time; number of hours a person volunteers in his or her community each weekIntelligenceA: Grade-point average (GPA); scores on a standardized test such as the Weschsler Adult Intelligence Scale (WAIS)Investment returnsA: interest earned on a savings account this past year; stock market gains or losses in the past yearExercise behaviorA: number of miles walked each day; number of pull-ups a person does each weekDietA: daily sodium intake; number of servings of fruits and vegetables each dayStressA: cortisol levels; blood pressureJob burnoutA: number of cynical comments a person makes at work each day; scores on a self-report measure of burnout (e.g., Maslach & Jackson, 1981)

2 In Terrell et al.’s (2008) research, how was the dependent variable operationally defined?A: The dependent variable of aggression was operationalized as the number of noise blasts delivered during the 10-minute experimental

Скачать книгу