When Are Data Not Data?
Revised and updated
Now there's nothing wrong with models, as such. F=G(M*m)/d² is a model. It says the force of gravity between two masses is like the solution to that equation. (It is thus also a metaphor, and hence, a poem. h/t Br. Guy for noting that in an interview.) The difficulty comes when we confuse the output of the model for actual, you know, data. And the peril is especially fraught when the model is a statistical assemblage produced by regression rather than a mathematical relationship deduced from principles.
|The farther the distance from the city, the greater the area |
to be influenced. For a star or candle or planet replace with
the inner surface of a notional sphere.
Design Matrices and Transfer FunctionsThe designer's task is to determine the Key Quality Characteristics whose specification and control will ensure achievement of the Critical-to-Quality performance metrics. This can be done with a Quality Function Deployment matrix and a matrix of design coefficients, bi. The transfer function may be a linear combination of the KQCs and, if not, a linear approximation is often adequate over the range of expected variation.
But the design engineer does not start out with a grab bag of possible Xs and then do a regression to see which are "significant." Rather, he starts with engineering principles and scientific relationships in mechanics or electronics and determines which design factors ought to come into play. This is the ideal function. For example, the press force (interference fit) of a shaft and a pulley is
Y = f(X) = Z0X²The regression is used to quantify the relationships. Functional Response (FR), what the design does, depends upon the independent variables, the design parameters, which are sometimes divided into the signal, noise, and DP proper. The signal is simply the most important of the DPs; the noise consist of those "DPs" which cannot (or will not) be controlled.
where X = relative interference:
(ODshaft – IDpulley)/ODshaft , and
Z0 = joint material stiffness coefficient
The problem now is to determine the values of the coefficients b0, b1, b2, ..., bn in the linear approximation. Strictly speaking, since there is always more than one FR (or Y), there is a transfer function for each row of the design matrix, and since the various Ys typically share many of the same Xs, the designer has his work cut out trying to achieve optimum performance for each Yj by targeting each Xi. To pick targets for a suite of Xi to optimize one Y will often end up de-optimizing other Ys.
As George Box was fond of saying:
"All models are wrong. Some are useful."
Enter RegressionSome are more wrong than others. When science and engineering are no aid, one must rely on multiple linear regression of sample data to find the coefficients, and here is where two bogies come into play.
- Kaczynski's Rule.* With seven or more Xs, you can always fit an equation by playing with the coefficients. Sometimes, if more data become available, re-running the regression with the additional data will produce a different regression equation with different coefficients. C'est le guerre.
- Multicollinearity. If any of the Xs are correlated with each other, the coefficients, and even the +/- signs may change. This shows up in a calculation of the Variance Inflation Factors and is a signal to drop one or more factor from the model, a practical application of Ockham's Razor in its original intent.
(*) Before he became Unabomber, Kaczynski taught graduate math at U. Berkeley, where one of his students was my cosmological friend, who passed that tidbit on to me.
The uncertainty in the Y due to uncertainties in the Xs are propagated in the one-variable linear case as:
σ²Y=(b1σX)²in the non-linear case, approximately as
σ²Y= (dY/dX*σX)² where the derivative is evaluated at the mean value
and in the multilinear case approximately as the Pythagorean sum:
TOF! Why Did You Do That to Us? Partial Derivatives? Really? Eeeuuw!
There are some things that are easy to forget.
- We can get so immersed in our model that we forget that the predicted Y-values are not themselves actually data.
- The model itself is often a linear approximation of a more complex system. (People forget what else Billy Ockham had to say.*)
- Models usually begin to stink in the tails; that is, at the extreme values. They predict white swans rather well, but don't handle the black ones.
- We forget about the propagation of variances and so the uncertainty builds as the model iterates.
(*) He warned us to keep our models simple for our own understanding. The real world, he said, could be as complex as God wished.
On Models Who StutterSuppose we had a model that predicted, oh, say "temperature" for tomorrow based on lots of data on many variables today.
Well if there are enough such Xs, we are almost certain to get a good-fitting model and this may deceive us into supposing that the relationships in the model mimic the relationships in the Real World™. For such folks, TOF has one word: epicycles. The modern term for epicycles is "feedback loops" which can be added as needed to ensure the proper results. A model can work rather well even though it is nothing like reality.*
(*) Artificial Intelligence people would do well to remember this. That a computer model can mimic the output of human thought does not mean the processes in the model are the same as those of the human mind.
(*) His razor was epistemological, not ontological.
Reanalysis.TOF tells you three times, be wary of taking model output and using it as the input for a second iteration; say, by taking tomorrow's predicted temperature and predicting the day after tomorrow. This is known to
|If a process is re-targeted based on the previous result|
we get a runaway process. The above is not global warming
but the position of a marble dropped through a funnel.
Remember the propagation of variances? You just did that twice. σ²Y=(b1σX)² has become σ²Z=(b1σY))² =(b1*(b1σX)²)². Now do it over and over to learn the temperature next year, and the year after, and... The prediction interval will swell up like a blowfish.
This is Deming's "Rule 4" in his famous Funnel Experiment. In Rule 4, the funnel is aimed at the point where the marble landed in the previous trial. It is akin to the game of telephone, or to learning the job from the previous job-holder, or to using the model output from a previous run as input "data" to the next. The net result is always, always a time plot that runs off to the Empyrean.
So Why Model?Because you have to. Suppose you have measured the angles on a triangular part and they sum to 179º 30'. But we know from external information that the interior angles of a plane triangle must sum to 180º. The data are clearly wrong because Euclidean geometry isn't. How would you allocate the additional 30' to adjust the figures? There are circumstances when the best tactic is to add 10' to each angle; there are other circumstances when the best strategy is to add a proportionate share of the 30' to each angle, depending on what exactly was wrong with the protractor originally used.
Another situation is when there is missing data. In industry we often found hourly data with one or more missing hours. Depending on circumstances, it might be important to fill in the missing cells with estimates. A typical strategy is to replace the missing measurements with the average of all the other data. Another is to interpolate between the data before and after the empty slot. But this assumes that the quality being plotted is a continuous function and changes smoothly from hour to hour. In many contexts the hourly figure is a random excursion from a constant central tendency.*
|One of the 44 batches is "missing." If you interpolate between |
the preceding and succeeding points (red line in upper graph)
you will come nowhere near the actual value (red dot in lower graph).
OTOH measurements made at different locations on a product -- a bolt of cloth, a roll of paper, a sheet of steel -- may reasonably vary continuously in both directions. The temperatures measured at different geographical points may fall in the same category. The temperature at longitude 25 may indeed be halfway between those of 24 and 26.
In temperature series, surface stations are not distributed either randomly or uniformly, but are sited based on purposes that were independent of the need for an unbiased regional or global average. (E.g., to supply a local air temperature for pilots, since it affects the handling of the plane.)
But to obtain such a regional/global average, how do we handle those sites on the grid where there is no surface station, or where the existing station is compromised? If there were a uniform distribution of sites, a reasonable approach would be to take the average of all neighboring stations, as shown below, left. But the Ugly Reality™ is that there might not be all the neighboring stations, or they might not be in the neighborhood, or there might be ranges of hills or a lake or a river which would form a discontinuity in the temperature across the landscape. Below, right, shows the surface stations around the Four Corners region in the US Southwest. You will note that the "nearest neighbor" stations used for statistical adjustment are not always that near.
|An ideal homogenization network (left) vs. Ugly
Reality (right) in|
Northern New Mexico.
This sort of spatial interpolation is called kriging [hard g] and can get quite complicated. But it has been of great practical significance in geology and the like. More to the point, kriging is well-understood by statisticians, produces error bars, and its applications (and forbearances) are known.
The same cannot be said for the iterative models described earlier, especially when the model is so complex that no one really knows what's in it or how it works -- as the ghost of Ockham reminds us with an evil chortle (see "mwahaha," above). The problem of the data deluge is mentioned in "To Know, but Not Understand" (an article in The Atlantic excerpted from a forthcoming book, Too Big To Know)
As the great statistician, George E. P. Box, once wrote::
All models are wrong, but some are useful.
-- "Robustness in the Strategy of Scientific Model Building"It is well to keep this in mind. He also wrote:
Since all models are wrong the scientist cannot obtain a "correct" one by excessive elaboration. On the contrary following William of Occam he should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist so overelaboration and overparameterization is often the mark of mediocrity.Thus proving that at least one scientist in the Modern Ages actually understood what Billy Ockham was trying to say!
-- "Science and Statistics," Journal of the American Statistical Association, Vol. 71, No. 356. (Dec., 1976)
- Box, George E. P., Hunter, William G., Hunter, J. Stuart. Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building, (Wiley & Sons, 1978)
(now in a second edition)
- W. Edwards Deming. Statistical Adjustment of Data. (Dover, 2011)
- E. Steirou and D. Koutsoyiannis. "Investigation of methods for hydroclimatic data homogenization." European Geosciences Union General Assembly 2012