The TOF Spot: America's Next Top Model -- Part IV

Tuesday, April 29, 2014

America's Next Top Model -- Part IV

Ruminations

This is a continuation of ruminations on the use and understanding of models that began with Part I and continued (as you might suspect) through Part II and Part III to reach this Part IV. One can only wonder at what might come next.

In partes priores, we learned that as science proceeded from matters of organized simplicity (a few elements connected one to another) to disorganized complexity (many elements acting randomly) to organized complexity (many elements organized into patterns), the scientific approach progressed from mathematics (e.g., Boyle's Law) to statistics (e.g., thermodynamics) to models (e.g., general climate models). As it did our understanding of the phenomena decreased. It is easy to understand:

less easy to understand:

and nearly impossible to understand complex nondeterministic models having millions of degrees of freedom. For example, says Judy Curry:

if a model is producing shortwave surface radiation fluxes that are substantially biased relative to observations, it is impossible to determine whether the error arises from the radiative transfer model, incoming solar radiation at the top of the atmosphere, concentrations of the gases that absorb shortwave radiation, physical and chemical properties of the aerosols in the model, morphological and microphysical properties of the clouds, convective parametrization that influences the distribution of water vapor and clouds, and/or characterization of surface reflectivity.

-- Judy Curry, "Climate Science and the Uncertainty Monster"

Previously, TOF discussed various aspects of uncertainty: viz.,

the kind of incertainty,
the degree of uncertainty, and
the location of uncertainty within the model.

Models and parts of models can be located (more or less) within this three-dimensional "uncertainty space."

Kind	Degree	Location
Epistemological Ontological	Determinate Statistical Scenario Recognized Ignorance Total Ignorance	Context Model Structure Model Execution Inputs Parameters Outputs

Two lips

A Tiptoe Through the Tulips

Now let's take a look at some typical uncertainties in models, starting at the planning end and proceeding to the calibration. In illustration, we'll use a simple linear model just for slaps and giggles. In the slide below, the Y-values are the performances that we are trying to model, say Performance, Cost, and Reliability of an airframe or some such thing. The X-values are the inputs and internal variables, such as properties of materials, dimensions, clearances, and the like. The idea is that if you

plug in actual values of X and
calculate the model outputs Ŷ,
those outputs ought to be "close" to the actual performances, Y.

The better the model outputs Ŷ match the actual performances Y,-- i.e., the smaller the values of the residuals (Y-Ŷ), the more the model is "vindicated." Whatever that means.

Calibration data are data for which both Y and X are known. By plugging both into the model, we can solve for the parameters b that make the model work by jiggering the b-values until (Y-Ŷ)² has been minimized.

The assumption is that when we plug in new Xs, the resulting Ŷ-values -- the model outputs -- will also be good predictions of those unknown Ys. Difficulties arise when the new Xs lie outside the boundaries of the original data.

Attractive to Ubangis.
Yes, that's a solar panel in her lip piercing.

Example 1: Suppose some psychologist did some testing with an instrument* used to measure (say) "attractiveness to males" and then performed regression on a gazillion possible factors such as "licking her lips"** to identify the important factors influencing attractiveness-to-males, and obtained b-values for the regression model. But if all of the test subjects were WEIRD people*** can the psychologist really make predictions about what non-WEIRD people find attractive?

(*) instrument. Actually, a questionnaire; but social scientists call them "instruments." Aren't they just so cute when they play science dress-up?
(**) licking her lips. No, TOF is not making this up. Someone has to pay for the grants and it may as well be the taxpayers.
(**) WEIRD. Western, Educated, Industrialized, Rich, and Democratic.They comprise 95% of all respondents in academic psychology papers.

Attractive to Choctaws, Lanena Grace John (right)
2013/14 Choctaw Princess

Example 2: Suppose a climatologist measures tree ring growth on a stand of bristlecone pines in Patagonia and, for the thermometer era, finds a good linear correlation with measured temperatures in that region. But ring growth is also sensitive to many other causal factors, including rainfall, drainage, and whether a deer took a dump on its nearby roots. So does the correlation found in the calibration data holds for other trees in other locations in other years when these other factors may have been different?

To put it another way, an empirical correlation Temp=b0+b1Ring does not have the same sort of ontological standing as a causal equation like F=ma or E=mc². It looks like an equation from traditional science, but it's not the same kind of thing. You can't look at organized complexity with the same mindset as at organized simplicity, or even at disorganized complexity.

The First Uncertainty: What X's Should We Include?

Much depends on the context of the problem the model is intended to resolve. When Napoleon asked Laplace where in his "System of the World" God appeared, the mathematician famously replied, "Sire, I have no need of that hypothesis." Similarly, Sean Carroll wrote that if the soul existed, it would have to have a term in Dirac's equation. But "it doesn’t and we can’t imagine what it would mean for it to have one." Rather than conclude that the equations must be incomplete, many suppose that this means the objects do not exist.

But the parsimony of models -- see William of Ockham for details -- requires that factors be excluded if they're not needed to estimate Y "close enough" for practical purposes. The speed of light does not appear in Newton's model because it served no purpose in what he was trying to accomplish. Similarly, Darwinian evolution does not appear in the manuals used by auto mechanics rebuilding a transmission.

Equally vexing is neglecting factors that ought to have been included. Climate models famously excluded solar effects and clouds, largely because no one initially knew how to account for them. It now appears that they play more important roles in the game than previously supposed.

In engineering, a Quality Function Deployment matrix is often useful for for identifying key X factors, as long as we remember that the matrix is a repository for our knowledge and not a replacement for it. Too many QFD matrices are hastily done or blown off with too little thought.

The example shown, for a pocket calculator, seeks to model the performance "Easy to Read" as a function of five factors, with the matrix entries indicating the weight or importance of each factor on the performance. Four other factors are thought to have some lesser impact on the performance metric.

Note: "Mat'l cost" really has no direct impact on Y=Easy to read. It is a design constraint, not a design input, and so is somewhat misplaced in the analysis.

Each row in the matrix is a different Y. Each column is a different X. Each Y depends upon multiple Xs; and the same X may loads onto multiple Ys. This is what makes it so hard to design a fully satisfactory product. The values of X that optimize one Y may sub-optimize other Ys.

Another way of identifying important Xs is the Parameter Diagram (P-diagram), which breaks the Xs up into Signals, Design Parameters, and Noise; and Performances into Outputs and Waste or Side Effects or Error States.

Don't take the little bell curves seriously. They indicate only that each of X (and Y) is variable, and that these variations combine in the model. Researchers often overlook this and focus instead on the precision of the estimates of the parameters of the model. More on this "propagation of variances" later.

Example: In "The Predictability of Coups d'Etat: A Model with African Data," Robert W. Jackman developed a set of four Xs by which to estimate the coup-proneness of Sub-Saharan African states:

Y: Coup index. The weighted sum of successful coups (5), unsuccessful coups (3), and plots (1).
M: Social mobilization. The percentage of the labor force in nonagricultural occupations (c.1966) plus the percentage of the population that is literate (c.1965).
C: Ethnic pluralism. The percentage of the population belonging to the largest ethnic group, treated as a binary variable: either greater than 44% or not.
D: Party dominance. The percentage of the vote going to the winning party in the election closest but prior to independence.
P: Political participation. The percentage of the population voting in the aforesaid pre-independence election, treated as a binary variable, either greater than 20% or not.

The selection of which variables to include in the model is a potent source of uncertainty. Has a key variable been omitted? A useless variable included? In his paper, Jackman gives sound reasons based on prior research for the conceptual variables that he is attempting to operationalize. IOW, he did not simply pop them out of thin air. Two of them have been made binary for heuristic reasons. That is, they seemed to work better that way. But M as defined defies reality. What is the basis for adding those two percentages? What does the resulting sum mean?

The Second Uncertainty: A Close Shave with Billy Ocks.

There are many more ways to be wrong in a 10^6 dimensional space than there are ways to be right.—Leonard Smith

Why does Ockham get all the credit for the well-known Principle of Parsimony?
Why do Moderns think it is a scientificalistic fact?

Digression: What's the probability that You know some A who knows B who knows Buzz Aldrin?

You→"A" (+1)→"B" (+2)→Buzz (+3)

"Unlikely," most people have said in TOF's training classes -- in which he was trying to make a point about estimating likelihoods for failure modes and effects analysis (FMEA). But as it turns out, TOF knows SF writer John Barnes and John Barnes knows Buzz Aldrin. So if you know TOF, you are separated from Buzz by no more than three degrees of separation.

That's because you know easily 1000 people -- family, neighbors, school-chums, co-workers, acquaintances, etc. -- and each of them knows 1000 people. And each of them knows 1000 people, which makes a billion people out at step 3. So even allowing for overlap and duplication it's almost dead certain that you know someone who knows someone who knows American X, no matter who X is.

TOF has collected examples over the years. You know TOF (sorta) (+1) and TOF knows:

a friend of the Incomparable (+2) who went to high school with Robert Redford (+3).
a client named Norayr (+2), who knew the Armenian Patriarch of Jerusalem (+3).
Jerry Pournelle (+2), who knew Ronald Reagan (+3) and knows Newt Gingrich (+3).
Ed Schrock (+2), who knew W. Edwards Deming (+3) -- who knew Buffalo Bill Cody (+4!).
Rep. Tim Wirth (+2) and Sen. Gary Hart (+2), who know inter alia then-Arkansas governor Bill Clinton (+3)
a nuclear inspector in Vienna (+2) who knew Nelson Mandela (+3)

IOW, probabilities are often different than you think.

Unabomber before he was Unabomber

Now, TOF used to say that there were likely exceptions -- hermits and recluses like the Unabomber. I mentioned this at dinner once when visiting a friend and colleague out in San Mateo, and he replied that he had taken a graduate math course from Ted Kaczynski. So there went that exception.

Now, TOF told you that to tell you this.

Kaczynski told his class, my friend told me: "With seven variables you can model any set of data -- as long as you can play with the coefficients." There will be no "unexplained" variation left over to warrant an eighth factor. That's why big models always seem to fit so well and have large R² values, and the modelers "don't need" the eighth factor.

Except that if you drop one of the other seven and replace it with number eight, you can get another good fit (with different b-coefficients). Many moons ago, TOF read an article in Scientific American, the title and author of the article now escape the keen steel trap of his mind, in which an R² value was improved by adding a variable that was completely irrelevant to the model! MLB batting averages, as he recalls.

So you can screw up your model by having too many terms in it. It will look better than it is because by fiddling with the bs you can get a good-looking R².

The Third Uncertainty: The Curse of Multicollinearity

When two or more Xs are correlated, the correlation coefficients and coefficients may be wrong. A positive correlation may even become negative and vice versa. The model is unreliable and may identify "signals" that aren’t there.

Example: Thickness of a Molded Plastic Part was correlated against a laundry list of potentially important factors involving the material and the molding machine:

Film Density
Tensile Strength
Air Bubbles
PreHeat Temp.
Plug Temp.
Air Pressure
Vacuum Pressure
Mold Temp.
Dwell Time

A 9-factor multiple linear correlation (!) was run producing the regression equation:

Thickness = 12.7+0.254 FilmDens+0.00008 Tensile-0.00517 Airbubbles

-0.0421 PreHeat-0.00610 PlugTemp-0.0106 AirPres

-0.0282 VacPress+0.00419 MoldTemp-0.0310 DwelTime

which returned an R² = 86%.

However, it turned out that Film Density and Tensile Strength were positively correlated and that Air Pressure and Vacuum Pressure were negatively correlated, meaning that including both of each pair of factors added no information to the model. After consideration of possible physical causations, the decision was made to drop Tensile and AirPres, yielding a reduced equation:

Thickness = 13.2+0.258 FilmDens-0.00469 Airbubbles-0.0430 PreHeat

-0.00635 PlugTemp-0.0182 VacPress+0.00373 MoldTemp

-0.0232 DwelTime

which had an R² = 86%. That is, dropping two variables had no impact on explanatory power. Note too that the coefficients of the factors have changed. It's a different equation, a different model. If Napoleon or Sean Carroll asked, "Where is Air Pressure in this system?" we would answer, "Sire, we had no need of that hyposthesis!"

We will look at the related problem of serial correlation in Part V.

The Fourth Uncertainty: Conjuring Factors from the Vasty Deep

Glendower: I can call spirits from the vasty deep!
Hotspur: Why, so can I, or so can any man. But will they come when you do call for them?

-- Shakespeare, Henry IV Part 1: Act 3, Scene 1

Some folks try to "de-correlate" their X-factors by transformations. This is called Orthogonal Factor Analysis, a/k/a Principle Component Analysis. Factor analysis describes the variation among observed, correlated variables "in terms of [a potentially lower number of] unobserved factors." In the scatter diagram to the left, the two Xs: yield strength and tensile strength (of aluminum coil stock) are mutually correlated. But if we rotate the axes to Z1, Z2, the correlation is removed.

The penalty is that the new "hidden" factors do not correspond to any factor in the real world. In the graph, X1 is Yield Strength and X2 is Tensile Strength. But Z1 and Z2 correspond to combinations of little bits of both. If that makes any sense. A model based on these Zs may have good predictability, but it is no longer clear what the model means. This is a particular trouble with Big Data, which often leads to models of surpassing opaqueness.

Weinberger tells of a computer program called Eureqa that seines the ocean of data and iteratively constructs equations which adequately predict the outputs. This sounds like orthogonal factor analysis by way of step-wise regression - on steroids. But the resulting equations will be heuristic and not built on any insight into the problem. "After chewing over the brickyard of data that Suel had given it, Eureqa came out with two equations that expressed constants within the cell. Suel had his answer. He just doesn't understand it and doesn't think any person could." But that is probably because the various Xs in such equations need not correspond to any physical factor in the system.

This is exactly what Billy Ockham warned us against: don't have too many terms in your model, or you won't understand your model. It gets worse when you don't even understand the terms. For discussion relating to global temperature, see "Step 3" here.

Summulae

The first group of uncertainties (once we have gotten past correctly assessing the context in which the model will be applied) deal with the identification of the predictors (Xs) in the model. Like Little Bear's porridge these should be neither too many nor too few.

Selection of the wrong Xs; non-selection of pertinent Xs. The kneebone's connected to the thighbone, and the thighbone's connected to the hip bone. The skeleton won't dance unless all the bones are in place.
Too many Xs leads to overfitting and the modeling of random noise instead of signal. A model that includes everything including the kitchen sink is likely to give results much like one finds in the trap of a kitchen sink.
Mutually correlated Xs lead to overfitting and to unreliable coefficients. You actually have one factor masquerading as two and will believe yourself less uncertain than you really should be.
Principle components analysis uncorrelates multicollinearity, but produces "factors" that do not correspond to physical reality. An excessive fascination with the mechanics of modelling may lead you to forget the reality you were trying to model in the first place.

The next step will be to knit these factors together into a set of equations, which leads us to the second batch of uncertainties in Part V.

The TOF Spot

Tuesday, April 29, 2014

America's Next Top Model -- Part IV

A Tiptoe Through the Tulips

The First Uncertainty: What X's Should We Include?

The Second Uncertainty: A Close Shave with Billy Ocks.

The Third Uncertainty: The Curse of Multicollinearity

The Fourth Uncertainty: Conjuring Factors from the Vasty Deep

Summulae

Suggested Reading

15 comments:

In the Belly of the Whale Reviews

Followers

Interesting Sites