End-to-end Data Analytics for Product Development. Chris Jones
is the performance of a new product compared with the industry standard or products currently on the market?
What is causing high levels of variation and waste during processing?
These questions are examples of inferential problems.
Inferential problems are usually related to:
Estimation of a population parameter (e.g. a mean) |
|
What is the stability of a new formulation? |
Comparison among groups |
|
What is the performance of a new product compared with the industry standard or products currently on the market? |
Assessing relationships among variables |
|
What is causing high levels of variation and waste during processing? |
We can use several inferential techniques to answer different questions. Later on, we will review the following ones:
Estimation of a population parameter:Point estimateConfidence intervals |
|
|
Comparison among groups:Hypothesis testing (one‐sample tests; two‐sample tests; ANOVA) |
|
|
Assessing relationships among variables:Regression models |
|
|
Stat Tool 1.4 Shapes of Data Distributions
Frequency distributions may be shown by tables or graphs. Use bar charts for categorical or quantitative discrete variables, histograms for continuous variables, and dot plots (especially useful for small data sets) for discrete or continuous variables.
By observing the frequency distribution of a categorical or quantitative variable, several shapes may be detected:
When values or classes have similar percentages, the distribution is said to be fairly uniform. In a fairly uniform distribution there are no values or classes predominant over the others (a).
When there is one value or class predominant over the others, the distribution is said to be nonuniform and unimodal with one peak (b).
When there is more than one value or class predominant over the others, the distribution is said to be nonuniform and multimodal with more than one peak (c).
The value or class with the highest frequency is the mode of the distribution (see Figure 1.2).
Figure 1.2 Shapes of distributions.
Stat Tool 1.5 Shapes of Data Distributions for Quantitative Variables
By observing the frequency distribution of a quantitative discrete or continuous variable, several shapes may be detected related also to the presence or absence of symmetry (Figures 1.3 and 1.4).
Figure 1.3 Shapes of distributions (symmetric and skewed distributions).
Figure 1.4 Other shapes of distributions.
If one side of the histogram (or bar chart for quantitative discrete variables) is close to being a mirror image of the other, then the data are fairly symmetric (a). Middle values are more frequent, while low and high values are less frequent. If data are not symmetric, they may be skewed to the right (b) or skewed to the left (c). In (b) low and middle values are more frequent than high values. In (c) high and middle values are more frequent than low values.
If histograms (or bar charts for quantitative discrete variables) show ever‐decreasing or ever‐increasing frequencies, the distribution is said to be J‐shaped (d). If frequencies are decreasing on the left side of the graph and increasing on the right side, the distribution is said to be U‐shaped (e). Sometimes there are values that do not fall near any others. These extremely high or low values are called outliers (f).
Stat Tool 1.6 Measures of Central Tendency: Mean and Median
When quantitative data distributions tend to concentrate around certain values, we can try to locate these values by calculating the so‐called measures of central tendency: the mean and the median. These measures describe the area of the distribution where most values occur.
The mean is the sum of all data divided by the number of data. It represents the “balance point” of a set of values.
The median is the middle value in a sorted list of data. It divides data in half: 50% of data are greater than the median, 50% are less than the median.
For