End-to-end Data Analytics for Product Development. Chris Jones
tendency, variability, shape of data) and is unknown.
Statistical inference uses sample data to draw conclusions about a population with a known level of risk. In general, statistical inference proceeds as follows:
1 We are interested in a population.
2 We identify parameters of that population that will help us understand it better.
3 We take a random sample and compute sample statistics.
4 Through inferential techniques, we use the sample statistics to infer facts about the population parameters of interest.
Stat Tool 1.13 Inferential Problems
As mentioned in Stat Tool 1.3, we often want to answer questions about our processes or products to make improvements and predictions, save money and time, and increase customer satisfaction:
What is the stability of a new formulation?
Which attributes of a product do consumers find most appealing?
What is the performance of a new product compared with products currently on the market?
What is causing high levels of variation and waste during processing?
Can a process change reduce production time to get the product in stores more quickly?
These questions are examples of inferential problems.
How can we use inferential techniques to answer these questions?
Inferential problems are usually related to:
Estimation of a population parameter:What is the stability of a new formulation?Which attributes of a product do consumers find most appealing?
Comparison of a population parameter to a specified value or among groups:What is the performance of a new product compared with the industry standard or products currently on the market?
Assessing relationships among variables:What is causing high levels of variation and waste during processing?Can a process change reduce production time to get the product in stores more quickly?
We may use several inferential techniques to answer different questions:
Estimation of a population parameter:Point estimate and confidence intervals
Comparison among groups:Hypothesis testing (one‐sample tests; two‐sample tests; analysis of variance, ANOVA)
Assessing relationships among variables:Regression models
Stat Tool 1.14 Estimation of Population Parameters and Confidence Intervals
Let's introduce the problem of the estimation of a population parameter.
Because it is often impractical or impossible to gather data on the entire population, we must estimate the population parameters using sample statistics.
Statistics, such as the sample mean and standard deviation, are called point estimators.
A point estimate is a single sample value that approximates the true unknown value of a population parameter.
Point estimators:
sample mean |
sample proportion p | sample standard deviation S |
Population parameters:
population mean μ | population proportion π | population standard deviation σ |
Point estimates, such as the sample mean or standard deviation, provide a lot of information, but they don't give us the full picture.
As it is highly unlikely that, for example, the sample mean and standard deviation we obtain are exactly the same as the population parameters, and to get a better sense of the true population values, we can use confidence intervals.
A confidence interval is a range of likely values for a population parameter, such as the population mean or standard deviation.
Usually, a confidence interval is a range:
Using confidence intervals, we can say that it is likely that the population parameter is somewhere within this range.
Example 1.3. To illustrate this point, suppose that a research team wants to know the mean satisfaction score (from 0: completely not satisfied, to 10: completely satisfied) for the population of people who use a new formulation of a product.From a random sample of consumers, the sample mean is 6.8, and the confidence interval is CI = (6.2; 7.4).Mean satisfaction score (population parameter) = ?So the true unknown population mean satisfaction score is likely to be somewhere between 6.2 and 7.4.The central point of the confidence interval is the sample mean: = 6.8 (point estimate of μ).
There's always a chance that the confidence interval won't contain the true population mean.
When we use confidence intervals, we must decide how sure we need to be that the confidence interval contains the actual population parameter value, taking into account that we cannot be 100% sure.
We quantify how sure we need to be with a value called the confidence level, usually denoted by (1 − α).
The confidence level is set by the researcher before calculation of a confidence interval.
The most common confidence level is 95% (0.95). Other common levels are 90% and 99%.
The confidence level is how sure we are that the confidence interval contains the actual population parameter value.
Example 1.4. To illustrate the meaning of the confidence level, let's return to the previous example and suppose we drew 100 samples from the same population and calculated the confidence interval for each sample.If we used 95% confidence intervals, on average 95 out of 100 of the confidence intervals will contain the population parameter, while 5 out of 100 will not.In practice when we calculate a 95% confidence interval for our sample, we are confident that our sample is one of the 95% samples for which CI covers the true parameter value.
Stat Tool 1.15 Hypothesis Testing
A common task in statistical studies is the comparison of mean values, variances, proportions, and so on, to a hypothesized value of interest or among different groups, for example:
What is the performance of a new product compared with the industry standard or products