Business Experiments with R. B. D. McCullough
soon as you get the results, and so these types of tests have a major impact on how websites are managed.
So, how good are you at guessing which version is better? Table 1.6 shows the winning treatment for each of the tests (in bold) along with the lift in performance. If you were able to guess the results to all four examples, then you are gifted web designer. Even experienced website managers frequently guess incorrectly and user behavior changes over time and from one website to another, which means that the only way to figure out which version is better is to run a test.
Table 1.6 Summary of web test results.
Test | A treatment | B treatment | Response measures | Result |
---|---|---|---|---|
Email sign‐up | No incentive | $10 incentive | # of sign‐ups | 300% lift |
Skirt images | Head‐to‐toe | Cropped | Skirt sales ($/session) | 7% lift |
Location search | Zip search | GPS search | Sign‐ups | 40% lift |
Rentals | 23% lift | |||
Video icon | No icon | Icon | % to product detail | No significant |
Sales ($/session) | Difference |
Note: Winning treatment shown in boldface.
Notice that Table 1.6 shows a different response measure for each test. The response measure (KPI) in an experiment is simply the measure that is used to compare the performance of two treatments. This measure is usually directly related to the business goal of the treatment. For instance, the purpose of the Dell Landing Page is to get people to sign up to talk to a Dell representative, so the percentage of users who submit a request is a natural response measure. Dell could have selected a different response measure, such as the % of users who actually speak with a Dell representative or who sign up and pay for services. In some cases, the test will include several response measures; the video icon test used both the % of users that viewed a product detail page, which is closely related to the goal of the video icon, and the sales per session, which reflects the ultimate business goal of any retail website. We will discuss the selection of response measures later, but for now it is sufficient to recognize that choosing a response measure that relates to business goals is a critical (and sometimes overlooked) part of test design.
Table 1.6 reports the test results in terms of a percentage lift. For example, in the Dell Landing Page test, the lift was 36%, which means that the hero image produced 36% more submissions. Table 1.6 only reports the lift numbers for test results that were found to be significant. Significance tests are used to determine whether there is enough data to say that there really is a difference between the two treatments. Imagine, for example, that we had test data on only five users: two who saw version A and looked at product details and three who saw version B and did not look at product details. Is this enough data to say that A is better than B? Your intuition probably tells you that it isn't, which is true, but when samples are a bit bigger, we can't rely on intuition to determine whether there is enough data to draw a conclusion. Testing for significance is one of the tools we use in analyzing A/B tests, and Chapter 2 will show you how to do it. As we will explain in the next few sections, we need more than just the lift numbers to perform the significance test.
Most website testing managers will tell you that more than half of the website tests that they run are not significant, meaning that they cannot conclude that one version is better than the other. For example, in the video icon test in Figure 1.7, there were no significant differences in the % of users who viewed the product detail pages or the average sales per session. If we looked at the raw data, there were probably some small differences, but that difference was not great enough to rise to the level of significance. The analyst has wisely chosen not to report the lift numbers, and instead simply said, “there was no significant difference.” While the manager who came up with this video icon idea might not be too happy to find that it doesn't work, it is important to know that it doesn't work so that attention can be shifted to more promising improvements to the website. Smart testing managers realize that it is important to run many tests to find the features of the website that really do change user behavior.
Exercises
1 1.5.1 Visit a retail website and identify five opportunities for A/B tests on the website. For each test, clearly define the A and B treatments that you would test and identify a response variable to measure performance.
2 1.5.2 Find an article that reports the results of a medical experiment, a business experiment, or a psychological experiment. How are the results reported? Do they use a graph to display the data? Does the article indicate whether the difference between treatments was significant?
3 1.5.3s. Visit a retail store and identify five opportunities for A/B tests. For each test, clearly define the A and B treatments that you would test and identify a response variable to measure performance.
1.6 A Brief History of Experiments
Experiments are as old as the bible. From The Book of Daniel (1, 11‐16),
Daniel then said to the guard whom the chief official had appointed over Daniel, Hananiah, Mishael, and Azariah, ”Please test your servants for ten days: Give us nothing but vegetables to eat and water to drink. Then compare our appearance with that of the young men who eat the royal food, and treat your servants in accordance with what you see.” So he agreed to this and tested them for ten days. At the end of the ten days they looked healthier and better nourished than any of the young men who ate the royal food. So the guard took away their choice food and the wine they were to drink and gave them vegetables instead.
The first clinical trial was conducted in 1747 by the Scottish physician James Lind, who was trying to find a cure for scurvy. Scurvy was a serious problem, since it killed more British sailors than the French and the Spanish combined. After two months at sea, when the men were afflicted with scurvy, Lind divided 12 sick sailors into six groups of 2. Each day the groups were administered cider, 25 drops of sulfuric acid, vinegar, a cup of seawater, and barley water, and the final group received two oranges and one lemon. After six days the fruit ran out, but one sailor was completely recovered and the other was almost recovered.
Randomization was introduced into experimental design in the nineteenth century by Peirce and Jastrow (1885) (many people incorrectly attribute this to R. A. Fisher in the twentieth