Thursday, May 16, 2013

Fun With Statistics - the Saga Continues

Sometimes BadStats are simply trivially bad and stem from poor math skills on the part of reporters and/or researchers.  For example, we lately had the following story:

"Hispanic High School Graduates Pass Whites in Rate of College Enrollment"
"High School Drop-Out Rate at Record Low."
    A record seven-in-ten (69%) Hispanic high school graduates in the class of 2012 enrolled in college that fall, two percentage points higher than the rate (67%) among their white counterparts, according to a Pew Research Center analysis of new data from the U.S. Census Bureau. ...

...    The positive trends in Hispanic educational indicators also extend to high school. The most recent available data show that in 2011 only 14% of Hispanic 16- to 24-year-olds were high school dropouts, half the level in 2000 (28%). Starting from a much lower base, the high school dropout rate among whites also declined during that period (from 7% in 2000 to 5% in 2011), but did not fall by as much.
So let's take the percentages as given.  If 14% of Hispanics drop out, then 86% of Hispanics finish high school.  The 69% of HS graduates who go on to college is thus 69% of the 86%, or 59%.  Similarly, the 67% of white HS graduates who go on to college is 67% of the 95% who did not drop out along the way, or 64%.  In sum:
  • Whites: 5% drop out, 31% finish high school but don't go to college, 64% go to college
  • Hispanics: 14% drop out, 27% finish high school but don't go to college, 59% go to college
There has been considerable progress on these measures on the part of Hispanics, but the headline is tendentiously exaggerated.  Would "closing the gap" have been that much worse a proclamation?

But this is like other examples of simple innumeracy.  Sometimes, when the BadStat comes from people wearing white lab coats, the problems are deeper.
During the great epidemic of Carcinogen-of-the-Month Stories, now thankfully with the present administration forgotten, we learned that according to peer-reviewed studies:

"Bug Spray Could Lead to Parkinson’s"
-- New York Post (6 May 2000)

This seemed like bad news for the nation's home-and-gardeners, but good news was right around the corner.  Peer-review told us later that same month that:

"Coffee might cure Parkinson’s"
-- Daily News (24 May 2000)

The lesson we learn from really-truly Science is thus that we can spray the roses while drinking a mocha grande!

One of the problems with Late Modern (Statistical) Science, compared to Early Modern (Mathematical) Science is a metastatic mass of peer-reviewed papers far exceeding the volume of actual useful science being done.  In the old days, scientists would discover that F=G(M*m)/d² and after years of "the work of the intellect" (negotiatio intellectus) produce seminal works.  Nowadays, in the age of publish-or-perish, they "discover" that R² differs significantly (i.e., with Teeny p-Values™) from 0 in a multiple regression model of impressive kitchen-sinkery: Y=f(X1, X2, ... , Xn) and publish scores of papers in peer-reviewed journals. 

The p-value (or α-risk) is the probability (given specific presuppositions) of a false alarm.  If one uses a p-value of 5% as a marker, which is typical in scientific research, then testing an outcome simultaneously against twenty potential factors is virtually guaranteed to produce at least one "significant" correlation.  This is the one that gets published.  It is also why many of these correlations, breathlessly announced on the evening news, are never heard of again.  Follow-up studies do not confirm the initial results.  Retraction Watch is a useful site for tracking such things.

Hence, one learns that Belief in Heaven "is associated with" high crime rates, followed by useless maundering about why this is so. This is a bit like theorizing why a pair of dice show specifically 11 (p=5.6%).  Slightly deeper is the reason why 9-Month-Old Babies Are Racist: and that is the confusion of surrogate measurements with the real thing and the supposition that a questionnaire is an "instrument" on a par with a micrometer or a gas chromatograph. 

Correlations are a dime the dozen.  For example, as more lemons were imported from Mexico, US highway fatalities declined:

or that Coal-Fired Power Plants Fuel Suicide

Now the latter link, to Wm. Briggs, Statistician to the Stars, is due to the use of aggregate data to reach individual results.
[Spangler] gathered county-level suicide rates and various demographics, such as percent whites, median income and the like, and counted the number of coal-fired power plants. He also took genuine air-quality measurements of metals and other pollutants, which was wise. He then “regressed”, i.e. used an unnecessarily complicated statistical model, the suicide rate and the other variables together.
None of the variables except percent whites, median age, and number of coal-fired power plants were “significant.” Spangler claimed that for every increase of one plant the suicide rate increases by about 2 per 100,000. This led Spangler and the press to conclude, as summarized for instance in Scientific American, “that county suicide rates correlated very predictably with the number of coal-fired electricity plants within said county.”
But of course, counties do not commit suicide, and no effort was made to relate actual individual suicides to proximity to the power plant(s).  This was the same poor practice that was used to prove the Parkinson's result referred to above. 


