Friday, May 17, 2013

America's Next Top Model - revised

When Are Data Not Data?

Revised and updated

Now there's nothing wrong with models, as such.  F=G(M*m)/d² is a model.  It says the force of gravity between two masses is like the solution to that equation.  (It is thus also a metaphor, and hence, a poem.  h/t Br. Guy for noting that in an interview.)  The difficulty comes when we confuse the output of the model for actual, you know, data.  And the peril is especially fraught when the model is a statistical assemblage produced by regression rather than a mathematical relationship deduced from principles.

The farther the distance from the city, the greater the area
to be influenced.  For a star or candle or planet replace with
the inner surface of a notional sphere. 
Take the inverse square law.  If we imagine gravity emanating outward from a body, or light from a candle, or influence from a population center, the area to be attracted, illuminated, or influenced will increase as the square of the distance from the source.  Recollect that Newton derived his laws from Euclidean geometry, not from regression on experimental data.

The confidence interval is the "slop in the slope."  That is, assuming 
that the model is a linear regression, the regression line probably lies
within the inner bounds.  Not to be confused with the forecast on the
output, which hinges on the prediction interval (likewise model-dependent). 
This sort of model is relatively secure, since its coefficients do not rely on a finite set of variable data.  Less secure are models like the mean value or a regression surface, which do.  They typically come with a certain amount of uncertainty, sometimes called a "confidence interval."  Confidence intervals apply to the coefficients of the model, not to the outputs.  That is, in the case of a simple linear regression, the confidence interval deals with the uncertainty of the slope, not with the uncertainty of the projected values.  That is why models usually appear more certain in the press release than they do in the Real World™.

Design Matrices and Transfer Functions

The designer's task is to determine the Key Quality Characteristics whose specification and control will ensure achievement of the Critical-to-Quality performance metrics.  This can be done with a Quality Function Deployment matrix and a matrix of design coefficients, bi.  The transfer function may be a linear combination of the KQCs and, if not, a linear approximation is often adequate over the range of expected variation.

But the design engineer does not start out with a grab bag of possible Xs and then do a regression to see which are "significant."  Rather, he starts with engineering principles and scientific relationships in mechanics or electronics and determines which design factors ought to come into play.  This is the ideal function.  For example, the press force (interference fit) of a shaft and a pulley is
Y = f(X) = Z0
where X = relative interference:
(ODshaft – IDpulley)/ODshaft , and
Z0 = joint material stiffness coefficient
The regression is used to quantify the relationships. Functional Response (FR), what the design does, depends upon the independent variables, the design parameters, which are sometimes divided into the signal, noise, and DP proper.  The signal is simply the most important of the DPs; the noise consist of those "DPs" which cannot (or will not) be controlled. 

The problem now is to determine the values of the coefficients b0, b1, b2, ..., bn in the linear approximation.  Strictly speaking, since there is always more than one FR (or Y), there is a transfer function for each row of the design matrix, and since the various Ys typically share many of the same Xs, the designer has his work cut out trying to achieve optimum performance for each Yj by targeting each Xi.  To pick targets for a suite of Xi to optimize one Y will often end up de-optimizing other Ys.

As George Box was fond of saying:
"All models are wrong.  Some are useful."

Enter Regression

Some are more wrong than others.  When science and engineering are no aid, one must rely on multiple linear regression of sample data to find the coefficients, and here is where two bogies come into play.
  1. Kaczynski's Rule.*  With seven or more Xs, you can always fit an equation by playing with the coefficients.  Sometimes, if more data become available, re-running the regression with the additional data will produce a different regression equation with different coefficients.  C'est le guerre.  
  2. Multicollinearity.  If any of the Xs are correlated with each other, the coefficients, and even the +/- signs may change.  This shows up in a calculation of the Variance Inflation Factors and is a signal to drop one or more factor from the model, a practical application of Ockham's Razor in its original intent.  
(*) Before he became Unabomber, Kaczynski taught graduate math at U. Berkeley, where one of his students was my cosmological friend, who passed that tidbit on to me.

The uncertainty in the Y due to uncertainties in the Xs are propagated in the one-variable linear case as:
σ²Y=(b1σX
in the non-linear case, approximately as
σ²Y= (dY/dX*σXwhere the derivative is evaluated at the mean value

and in the multilinear case approximately as the Pythagorean sum:
Boo!  Mwahahaha!
which is easily seen as a generalization of the one-variable case.  

TOF!  Why Did You Do That to Us?  Partial Derivatives?  Really? Eeeuuw!

There are some things that are easy to forget.
  1. We can get so immersed in our model that we forget that the predicted Y-values are not themselves actually data. 
  2. The model itself is often a linear approximation of a more complex system.  (People forget what else Billy Ockham had to say.*)  
  3. Models usually begin to stink in the tails; that is, at the extreme values.  They predict white swans rather well, but don't handle the black ones. 
  4. We forget about the propagation of variances and so the uncertainty builds as the model iterates. 
(*) He warned us to keep our models simple for our own understanding.  The real world, he said, could be as complex as God wished. 

On Models Who Stutter

Suppose we had a model that predicted, oh, say "temperature" for tomorrow based on lots of data on many variables today.

Well if there are enough such Xs, we are almost certain to get a good-fitting model and this may deceive us into supposing that the relationships in the model mimic the relationships in the Real World™.  For such folks, TOF has one word: epicycles.  The modern term for epicycles is "feedback loops" which can be added as needed to ensure the proper results.  A model can work rather well even though it is nothing like reality.*
(*) Artificial Intelligence people would do well to remember this.  That a computer model can mimic the output of human thought does not mean the processes in the model are the same as those of the human mind.

Mwahahaha
The problem is compounded if we neglect multicollinearity and forget to reduce the model by discarding mutually correlated predictors.  Remember what Billy Ockham said: Don't have too many Xs in your model or else you won't understand your model.*
(*) His razor was epistemological, not ontological.

Reanalysis.   

TOF tells you three times, be wary of taking model output and using it as the input for a second iteration; say, by taking tomorrow's predicted temperature and predicting the day after tomorrow.  This is known to
If a process is re-targeted based on the previous result
we get a runaway process.  The above is not global warming
but the position of a marble dropped through a funnel.

some as "re-analysis," but is gravely suspect in the world of statisticians.  The inputs to the model in the second iteration are not actual data. 

Remember the propagation of variances?  You just did that twice.  σ²Y=(b1σX)² has become σ²Z=(b1σY)=(b1*(b1σX)²)².  Now do it over and over to learn the temperature next year, and the year after, and...  The prediction interval will swell up like a blowfish. 

This is Deming's "Rule 4" in his famous Funnel Experiment.  In Rule 4, the funnel is aimed at the point where the marble landed in the previous trial.  It is akin to the game of telephone, or to learning the job from the previous job-holder, or to using the model output from a previous run as input "data" to the next.  The net result is always, always a time plot that runs off to the Empyrean.

So Why Model?

Because you have to.  Suppose you have measured the angles on a triangular part and they sum to 179º 30'.  But we know from external information that the interior angles of a plane triangle must sum to 180º.  The data are clearly wrong because Euclidean geometry isn't.  How would you allocate the additional 30' to adjust the figures?  There are circumstances when the best tactic is to add 10' to each angle; there are other circumstances when the best strategy is to add a proportionate share of the 30' to each angle, depending on what exactly was wrong with the protractor originally used. 

Another situation is when there is missing data.  In industry we often found hourly data with one or more missing hours.  Depending on circumstances, it might be important to fill in the missing cells with estimates.  A typical strategy is to replace the missing measurements with the average of all the other data.  Another is to interpolate between the data before and after the empty slot.  But this assumes that the quality being plotted is a continuous function and changes smoothly from hour to hour.  In many contexts the hourly figure is a random excursion from a constant central tendency.*

One of the 44 batches is "missing."  If you interpolate between
the preceding and succeeding points (red line in upper graph) 

you will come nowhere near the actual value (red dot in lower graph). 
In the sulfur batch data above, each batch is an independent iteration of the same process and is (for the most part) independent of the values of other batches.  Much of the observed variation is in fact variation in the laboratory measurement process.  There is no a priori reason why the value of batch number 25 should be halfway between the values of batches 24 and 26. 

OTOH measurements made at different locations on a product -- a bolt of cloth, a roll of paper, a sheet of steel -- may reasonably vary continuously in both directions.  The temperatures measured at different geographical points may fall in the same category.  The temperature at longitude 25 may indeed be halfway between those of 24 and 26. 

In temperature series, surface stations are not distributed either randomly or uniformly, but are sited based on purposes that were independent of the need for an unbiased regional or global average.  (E.g., to supply a local air temperature for pilots, since it affects the handling of the plane.)

But to obtain such a regional/global average, how do we handle those sites on the grid where there is no surface station, or where the existing station is compromised?  If there were a uniform distribution of sites, a reasonable approach would be to take the average of all neighboring stations, as shown below, left.  But the Ugly Reality™ is that there might not be all the neighboring stations, or they might not be in the neighborhood, or there might be ranges of hills or a lake or a river which would form a discontinuity in the temperature across the landscape.  Below, right, shows the surface stations around the Four Corners region in the US Southwest.   You will note that the "nearest neighbor" stations used for statistical adjustment are not always that near. 
An ideal homogenization network (left) vs. Ugly Reality (right) in
Northern New Mexico. 

This sort of spatial interpolation is called kriging [hard g] and can get quite complicated.  But it has been of great practical significance in geology and the like.  More to the point, kriging is well-understood by statisticians, produces error bars, and its applications (and forbearances) are known. 

The same cannot be said for the iterative models described earlier, especially when the model is so complex that no one really knows what's in it or how it works -- as the ghost of Ockham reminds us with an evil chortle (see "mwahaha," above).  The problem of the data deluge is mentioned in "To Know, but Not Understand" (an article in The Atlantic excerpted from a forthcoming book, Too Big To Know)

Boxing Match

Box's Rule

As the great statistician, George E. P. Box, once wrote::
All models are wrong, but some are useful.
-- "Robustness in the Strategy of Scientific Model Building"
It is well to keep this in mind.   He also wrote:
Since all models are wrong the scientist cannot obtain a "correct" one by excessive elaboration. On the contrary following William of Occam he should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist so overelaboration and overparameterization is often the mark of mediocrity.
-- "Science and Statistics," Journal of the American Statistical Association, Vol. 71, No. 356. (Dec., 1976)
Thus proving that at least one scientist in the Modern Ages actually understood what Billy Ockham was trying to say!

References

  1. Box, George E. P.,  Hunter, William G.,  Hunter, J. Stuart.  Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building, (Wiley & Sons, 1978)
    (now in a second edition
  2. W. Edwards Deming.  Statistical Adjustment of Data. (Dover, 2011)
  3. E. Steirou and D. Koutsoyiannis. "Investigation of methods for hydroclimatic data homogenization."  European Geosciences Union General Assembly 2012

6 comments:

  1. A reference from the link to the Kriging Wiki entry:

    Stein, M. L. (1999), Statistical Interpolation of Spatial Data: Some Theory for Kriging, Springer, New York.

    Who happens to be this guy:

    http://www.stat.uchicago.edu/faculty/stein.shtml

    ...The former Department Chair.

    Who kind of looks like Spock from the "Mirror, Mirror" episode of ST:TOS and who was crazy enough to share a dorm room with me during our freshman year at MIT.

    It helps that Mike was unflappable.

    JJB

    ReplyDelete
  2. In the early 1990's, I read the USDA Forest Services predictions of North American forest resources. There were a high, a low, and a medium forecast.

    And a disclaimer: forecasts assumed there would be no unexpected demographic changes, no unexpected economic changes, no surprising new technology....

    ReplyDelete
  3. This was very helpful (the parts I understood, anyway). Thanks.

    As my kids enter their teen-age years (got 3 at the moment) I try to arm them with helpful bits to test the claims of 'science' against - 'all models are wrong, some are useful' is one such I'm stealing.

    ReplyDelete
  4. Er, does the TOF have an email address? And does he take blog topic requests?

    ReplyDelete
    Replies
    1. When I began this article (http://www.forbes.com/sites/toddessig/2013/05/19/will-angelina-jolie-help-end-climate-change-denial-and-help-the-republican-party/ ), I wondered whether you'd commented on it, because it talked of modeling data and sounded as though it was going to go in-depth. (It doesn't really, though).
      Thanks for replying. I like your blog.

      Delete

In the Belly of the Whale Reviews

 Hi All The National Space Society reviewed Dad's last work, In the Belly of the Whale. Take a read here , and don't forget you can ...