Friday, March 11, 2011

Infantile Regression

Here is an illustration of the limitations of linear thinking alluded to in a previous statistical post: When is an Average Not an Average?  Both charts involve everyone's favorite topic: Global Warming/Climate Change/Climate Disruption. 

The first is a time series of the satellite data compiled by RSS [Remote Sensing Systems].  The series, plotted by Bob Tisdale, was for the purpose of comparing the recent revision of the data algorithm -- certain satellite feeds have been added, others discounted or weighted differently -- to the previous revision.  (Overall effect is that v.3.3 produces slightly cooler temps than v.3.2)  But our purpose here is to observer the linear regression thrown through the data series, for everyone needs a good laugh now and then.

Now I hesitate to say that no one but a scientist would dare try to fit a linear trend to what is clearly a non-linear time series; and in fact Tisdale will take it further shortly.  But one sees this sort of thing constantly.  A linear trend indicates a single cause which moves the data continually in the same direction up to a boundary condition or causal regime change.  For example, the speed of a falling apple will increase uniformly under the influence of the single cause of gravitational acceleration -- up to the point it strikes you on the head.  A certain amount of noise can be expected around the regression line due to, among other things, measurement system variation. 

But in actual processes, it is not uncommon to find multiple causes operating simultaneously. 
  1. Random* variation is due to common causes; i.e., causes that are "commonly present" all or most of the time.  Random variation will produce cumulatively a statistical distribution, such as the normal, lognormal, Poisson, extreme value, etc. 
  2. Spikes (or Icicles) are due to transient special causes that happen at a particular time. Example: a fill weight of 57 is entered accidentally as a 75.
  3. Shifts are due to a special cause that occurs at a particular point in time, raising or lowering the process mean.  Example: a fresh batch of higher density oxide paste is started all battery grids thereafter run a higher paste weight.
  4. Trends are due to a special cause that occurs during a span of time, as mentioned above. Example: due to progressive tool wear, the residual metal thickness of a stamped part grows gradually thicker.
  5. Cycles are due to a special cause that occurs repeatedly at more-or-less-equal intervals of time. Example: Cans used to measure coating weights are used three times. But the second and third time, the coating is sprayed over a previous coating layer, and this tends to absorb more of the coating. Hence, coating weights show a cyclical pattern: low-high-high.
 (*) random. Randomness is not a cause of anything. What we call "random variation" is due to a great many causes, none of which dominate the process, acting simultaneously. From time to time, some of these causes will increase the effect; some will decrease it. The aggregate effect then resembles the result of a random process. But this is simply a way of saying that it is not worthwhile chasing our tails over one of a multitude of minor causes.
 The above are simple patterns, so-called because there is a simple (direct) relation between the nature of the pattern and the time sequence of the data.  For complex patterns, the root cause is not tied into the time sequence.  These include Stratification, Mixtures, etc.  The pattern called Instability (or Chaos) usually indicates the presence of more than one special cause. Example: a mixture pattern on visual pre-fill inspections of glass phials indicated that two inspection crews were counting defects differently.

Complex patterns must usually be "parsed" in order to identify the causal factors, either through re-analysis of the data: a) chart-matching, b) breakdown, c) replot; or through "active statistics": d) elimination of a factor, e) designed experiment.

Which brings us back to the chart above.  In an industrial situation, variables are removed by the simple act of actually removing them.  In a famous capability study at the old Western Electric, the first variable suspected was operator over-adjustment of a certain machine setting; and this was removed by telling the operator to keep his cotton-picking fingers off the knob.  The before-and-after charts showed a marked decrease in variation and revealed a mixture pattern that identified a loose fixture.  When the fixture was, well, fixed, a skewness identified premature removal of the parts from their chucks; and so on until the chart was in a state of statistical control: i.e., the variation was indistinguishable from random variation. 

For found data, however well-massaged they may be, this is not practical.  So what folks normally do is determine an equation that describes the effect of the variable X on the charted values of Y, and then plotting the residual Ŷ = Y - f(X).  So Tisdale adjusted the RSS data for the effect of volcanic eruptions (anthropogenic volcanic eruptions?) using Goddard Institute's Aerosol Optical Thickness data for the amount of particulates in the atmosphere.  We won't worry here about how he may have done the regression, as that is not our topic du jour.  The result of his adjustment for vulcanism is below: 


The red lines are the volcano-adjusted mean values for the time intervals between major ENSO events (El Niño/La Niña-Southern Oscillation).  The green line is a 13-month rolling mean.  What he finds using this approach is a series of shifts rather than a single trend. This "step-wise" increase in temperature seems to fit the data better than the straight linear trend in the first graph. In consequence, the mind is led away from a single cause operating continuously across a time interval, and toward a series of "kicks" that shift the process from one mean to another.  These kicks seem to be related to the ENSO events; i.e., ENSOs seem to have persistent, lingering effects. 

I am not entirely convinced -- the middle phase seems weak to me, and he is only looking at northern hemisphere -- but it illustrates the bottom-up empirical approach to the data.  Instead of springing a linear regression through the data, start from the data and look for signals in the time series, identify the source of the signal, remove it (physically or mathematically) and look for another signal. 



What you should avoid is the top-down approach: brainstorm a bunch of factors that you think might be important, build a model using those factors, adjust the model coefficients so the model outputs are kinda like the actual data, then declare that other factors are unimportant because your model leaves no room for them.  It is well known that with seven factors you can fit any set of data, providing only that you can play with the coefficients.  This gives the illusion that the chosen coefficients reflect real-world relationships and that collectively the chosen factors account for virtually all the variation. 


No comments:

Post a Comment

In The Belly of the Whale - Now Available

    Dear Readers, Dad's final (? maybe?) work is now available at Amazon, B&N, and many other fine retailers. I compiled a list a fe...