Quality Progress - March 2017

Solving quality quandaries through statistics

Statistics Spotlight
DATA ANALYSIS

What's Driving
Uncertainty?
The influences of model and model parameters in data analysis
by Christine M. Anderson-Cook
One of the substantial improvements to the practice of data
analysis in recent decades is the change from reporting just
a point estimate for a parameter or characteristic to now
including a summary of uncertainty for that estimate. Understanding the precision of the estimate for the quantity of interest
provides a better understanding of what to expect and how well
we are able to predict future behavior from the process.
For example, when we report a sample average as an estimate of the population mean, it is good practice to also provide
a confidence interval (CI)-or credible interval if you are doing
a Bayesian analysis-to accompany that summary. This helps to
calibrate what ranges of values are reasonable given the variability observed in the sample and the amount of data included in
producing the summary.

Estimating density example

Recently, I encountered an example that demonstrates the
contributions from several sources we may wish to include in our
assessment of the uncertainty. An engineer had obtained a data
set with 30 observations that she wanted to use to estimate the
density of a material of interest as a function of the concentration of the key ingredient. The overall goal is to identify at what
concentration the density is minimized.
Subject matter expertise for the process suggested that a
quadratic model of the form, Densi = β₀ + β₁Conci + β₂Conc²i + εi ,
should be adequate to summarize the relationship between
the explanatory variable, concentration and the response:
density. Figure 1 shows the results when that model was fit to
the available data (using least-squares estimation) and a 95%
CI for the curve. The CI provides uncertainty bounds for where
the estimated mean curve lies, and differs from a prediction
interval which shows where we would expect new observations

44 QP

March 2017 ❘ qualityprogress.com

to be found if more data
were collected from the same
underlying mechanism.1
At first glance, the model
seems to fit reasonably well
with the overall trends in the
data being appropriately captured by the estimated model.
The engineer also decided
to explore a slightly more
complicated model, which
allows extra flexibility to
fit a cubic model of the form,
Densi = β₀ + β₁Conci + β₂Conc²i
+ β₃Conc³i + εi, to see whether
this provided an improved fit.
The results of this fit are shown
in Figure 2-with the accompanying 95% CI.
Superficially, the curve also
seems to fit the data well,
although the general shape
does show some notable
model. For larger concentrations (on the right-hand side of
the plot), the rate of increase
of the curve seems to diminish
with the cubic model, and the

shape around the minimum
also seems to differ. Table 1
shows a formal comparison of
the two models.
R², optimized by maximizing, summarizes the fraction
of the total variability of density observed in the sample
explained by each model.
for larger models and generally is a better summary than
R² for comparing models of
different sizes. The predicted
residual error sum of squares
(PRESS) statistic2 (the
smaller, the better) is a form
of cross-validation to assess
the ability of the model to
predict.
the cubic model is preferred.
Using the PRESS statistic, the quadratic model is
preferred. When we look at
a formal test of the cubic
term, we reject the null
hypothesis that it has a value
of zero (p-value ≈ 0.001)
and conclude that there is
strong evidence that this
term should not be removed

