Practical Data Analysis with JMP, Third Edition. Robert Carver
the Distribution Platform for Continuous Data
As before, we will first use the Distribution platform to do most of the work here.
1. Select Analyze ► Distribution. Cast LifeExp into the role of Y, Columns and click OK.
2. When the distribution window opens, click the red triangle next to Distributions, and select Stack. This will re-orient the output horizontally making it a bit easier to interpret.
The histogram (Figure 3.5) is one representation of the distribution of life expectancy around the world in 2015, and it gives us one view of how much life expectancy varies. Above the histogram is a box plot (also known as a box-and-whiskers plot), which will be explained later in this chapter.
Figure 3.5: A Typical Histogram
As in the bar charts that we have studied earlier, there are two dimensions in the graph. Here, the horizontal axis displays values of the variable and the vertical axis displays the frequency of each small interval of values. For example, we can see that only a few countries have projected life expectancies of 51 to 54 years, but many have life expectancies between 74 and 78 years.
When we look at a histogram, we want to develop the habit of looking for four things: the shape, the center (or central tendency), the dispersion of the distribution, and unusual observations. The histogram can very often clearly represent these three aspects of the distribution.
Shape: Shape refers to the symmetry of the histogram and to the presence of peaks in the graph. A graph is symmetric if you could find a vertical line in the center defining two sides that are mirror images of one another. In Figure 3.5, we see an asymmetrical graph. There are few observations in the tails on the left, and most observations clump together on the right side. We say this is a left-skewed (or negatively skewed) distribution.
Many distributions have one or more peaks—data values that occur more often than the other values. Here we have a distinct peak around 75 to 76 years, and others closer to 72 and 83. Some distributions have multiple peaks, and some have no distinctive peaks at all. In short, we might describe the shape of this distribution as “multi-peaked and left-skewed.”
Center (or central tendency): Where do the values congregate on the number line? In other words, what values does this variable typically assume? As you might already know, there are several definitions of center as reflected in the mean, median, and mode statistics. Visually, we might think of the center of a histogram as the halfway point of the horizontal axis (the median, which is approximately 74 years in this case), as the highest-frequency region (the highest peak near 75), perhaps as a type of visual balancing point (the mean, which is approximately 72), or in some other way. Any of these interpretations have legitimacy, and all respond to the question in slightly different ways.
Dispersion (or spread): While the concept of center focuses on the typical, the concept of spread focuses on departures from the typical. The question here is, “how much does this variable vary?” and again there are several reasonable ways to respond. We might think in terms of the lowest and highest observed values (from about 40 to 85), in terms of a vicinity of the center (for example, “life expectancy tends to vary in most countries between about 65 and 85”), or in some other relative sense.
Unusual Observations: We can summarize the variability of a distribution by citing its shape, center, and dispersion, but in some distributions, there may be a small number of observations that deviate substantially from the pattern. In 2015, there was no such grouping, but let’s explore the shifts in the distribution over time and also find some unusual observations.
3. Re-open the global data filter (Rows ► Data Filter). Click Clear.
4. Click the red triangle next to Distributions and choose Redo ► Automatic Recalc. You will see the histogram change and might notice that it now represents more observations—we are looking at all twelve years of data.
5. Again, click the red triangle next to Distributions and choose Local Data Filter; choose year and click Add.
6. Rather than choosing one year, click the red triangle next to Local Data Filter and choose Animation, as shown in Figure 3.6. This will step through the twelve years, briefly selecting each one and changing the histogram for each year.
Figure 3.6: Animating a Local Data Filter
7. In the Animation Controls, click the blue “play” arrow and watch what happens. Take special notice of how life expectancy has tended to improve from 1960 through 2015.
8. After a few cycles, pause the animation in the year 1995.
Look at the box plot above the histogram. There are two dots at the far left end; these represent two nations with extraordinarily brief life expectancies. We refer to such values as outliers.
9. Hover the cursor over the left-most point in the box plot. You will see a pop-up note that this is Rwanda, with a life expectancy of only 31.977 years in 1995, reflecting the genocide that took place in 1994.
Often, it’s easier to think about shape, center, dispersion, and outliers by comparing two distributions. For example, Figure 3.7 shows two histograms using the life expectancy data from 1965 and 2015. We might wonder how human life expectancy changed during a 50-year period, and in these two histograms, we can note differences in shape, center, dispersion and unusual observations.
Figure 3.7: Comparing Two Distributions
To create the results shown in Figure 3.7, do the following:
10. Return to the original Life Expectancy data table.
11. Re-open the Data Filter dialog box (either choose Windows and find the filter or Rows ► Data Filter). Clear the Select check box but leave Show and Include checked.
12. Hold down the Ctrl key and highlight 1965 and 2015.
13. From the menu bar, choose Analyze ► Distribution.
14. Select LifExp as Y, just as you did earlier.
15. Cast year into the role of By and click OK.
This creates the two distributions with vertically oriented histograms. When you look at them, notice that the axis of the first one runs from 25 to 75 years, and the axis on the second graph runs from 50 to 85 years.
To facilitate the comparison, it is helpful to orient the histograms horizontally in a stacked arrangement and to set the axes to a uniform scale, an option that is available in the red triangle menu next to Distributions. This makes it easy to compare their shapes, centers, and spreads at a glance.
16. In the Distribution report, while pressing the Ctrl key, click the uppermost red triangle and select Uniform Scaling.
If you click the red triangle without pressing the Ctrl key, the uniform scaling option would apply only to the upper histogram. Pressing the Ctrl key has the effect of applying the choice to all graphs in the window.
17. Hold down the Ctrl key, click the red triangle once again, and choose Stack.
The histograms on your screen should now look like Figure 3.7. How does the shape of the 1965 distribution compare to that of the 2015 distribution? What might have caused these changes in the shape of the distribution?
We see that people tend to live longer now than they did in 1965. The location (or central tendency) of the 2015 distribution is to the right side of the 1965 distribution. Additionally, these two distributions also have quite different spreads (degrees of dispersion). We can see that the values were far more spread out in 1965 than they are in 2015 and that there were no outliers in either year. What does that reveal about life expectancy around the world during the past 50 years?