My defense of psychometric g has attracted more attention than I expected. It has been discussed on Metafilter, Noahpinion, Less Wrong, and iSteve, among other places. In this post, I will address some criticisms of my arguments and comment on a couple of issues I did not discuss earlier.
1. What did Cosma Shalizi claim about the positive manifold?
I wrote that according to Shalizi, the positive manifold is an artifact of test construction and that full scale scores from different IQ batteries correlate only because they are designed to do that. This has been criticized as misrepresenting Shalizi’s argument. I based my interpretation of his position on the following passages, also quoted in my original post:
The correlations among the components in an intelligence test, and between tests themselves, are all positive, because that’s how we design tests. […] So making up tests so that they’re positively correlated and discovering they have a dominant factor is just like putting together a list of big square numbers and discovering that none of them is prime — it’s necessary side-effect of the construction, nothing more.
What psychologists sometimes call the “positive manifold” condition is enough, in and of itself, to guarantee that there will appear to be a general factor. Since intelligence tests are made to correlate with each other, it follows trivially that there must appear to be a general factor of intelligence. This is true whether or not there really is a single variable which explains test scores or not.
By this point, I’d guess it’s impossible for something to become accepted as an “intelligence test” if it doesn’t correlate well with the Weschler [sic] and its kin, no matter how much intelligence, in the ordinary sense, it requires, but, as we saw with the first simulated factor analysis example, that makes it inevitable that the leading factor fits well.
It is true that Shalizi does not explicitly claim that it would be possible to construct intelligence tests that do not show the usual pattern of positive correlations. However, he does assert, repeatedly, that the reason that the correlations are positive is because that’s the way intelligence tests are designed. I don’t think it’s an uncharitable interpretation to infer that Shalizi thinks that a different approach towards designing intelligence tests would produce tests that are not always positively correlated. Why else would he compare test construction to “putting together a list of big square numbers and discovering that none of them is prime”?
Another issue is that it can never be inductively proven that all cognitive tests are correlated. However, intelligence testing has been around for more than 100 years, and if there were tests of important cognitive abilities that are independent of others, they would surely have been discovered by now. Correlations with Wechsler’s tests or the like is not how test makers decide which tests reflect intelligence. There’s a long history of attempts to go beyond the general intelligence paradigm.
2. Can random numbers generate the appearance of g?
Steve Hsu noted that people who approvingly cite Shalizi’s article tend to not actually understand it. A big source of confusion is Shalizi’s simulation experiment where he shows that if hypothetical tests draw, in a particular manner, on abilities that are based on randomly generated numbers, the tests will be positively correlated. This has led some to think that factor analysis, the method used by intelligence researchers, will generate the appearance of a general factor from any random data. This is not the case, and Shalizi makes no such claim. The correlations in his simulation result from the fact that different tests tap into some of the same abilities, thus sharing sources of variance. If the randomly generated abilities were not shared between tests, there’d be no positive manifold.
3. What I mean by “sampling”
Several commenters have thought that when I wrote about the sampling model, I was referring specifically to Thomson’s original model which is the basis for Shalizi’s simulation experiment. This is not what I meant. It’s clear that Thomson’s model as such is not a plausible description of how intelligence works, and Shalizi does not present it as one. I should have been more explicit that I regard sampling as a broad class of different models that are similar only in positing that many different, possibly uncorrelated neural elements acting together can cause tests to be correlated. Shalizi argues that evolutionary considerations and neuroscience findings favor sampling as an explanation of g. My argument is that none of this falsifies general intelligence as a trait.
4. Race and g
Previously, I did not discuss racial differences in g, because that issue is largely orthogonal to the question of whether g is a coherent trait. Arthur Jensen argued that the black-white test score gap in America is due to g differences, but the existence of the gap is not contingent on what causes it. James Flynn pointed this out when criticizing Stephen Jay Gould’s book The Mismeasure of Man:
Gould’s book evades all of Jensen’s best arguments for a genetic component in the black-white IQ gap, by positing that they are dependent on the concept of g as a general intelligence factor. Therefore, Gould believes that if he can discredit g, no more need be said. This is manifestly false. Jensen’s arguments would bite no matter whether blacks suffered from a score deficit on one or 10 or 100 factors.
Regarding whether group differences in IQ reflect real ability differences, Shalizi has the following to say:
The question is whether the index measures the trait the same way in the two groups. What people have gone to great lengths to establish is that IQ predicts other variables the same way for the two groups, i.e., that when you plug it into regressions you get the same coefficients. This is not the same thing, but it does have a bearing on the question of measurement bias: it provides strong reason to think it exists. As Roger Millsap and co-authors have shown in a series of papers going back to the early 1990s […] if there really is a difference on the unobserved trait between groups, and the test has no measurement bias, then the predictive regression coefficients should, generally, be different.  Despite the argument being demonstrably wrong, however, people keep pointing to the lack of predictive bias as a sign that the tests have no measurement bias. (This is just one of the demonstrable errors in the 1996 APA report on intelligence occasioned by The Bell Curve.)
Firstly, the APA report does not claim that a lack of predictive bias suggests that there’s no measurement bias. The report simply states that as predictors of performance, IQ tests are not biased, at least not against underrepresented minorities. This is important because a primary purpose of standardized tests is to predict performance.
Secondly, research does actually show that the performance of lower-IQ groups is often overpredicted by IQ tests and other g-loaded tests, something which is alluded to in the APA report as well. This means that the best-fitting prediction equation is not the same for all groups. For example, the following chart (from this paper) shows how SAT scores (and high school GPA) over- or underpredict first-year college GPA for different groups:
There’s consistent overprediction for blacks, Hispanics, and native Americans compared to whites and Asians. Female performance is underpredicted compared to males, which I’d suggest is largely due to sex differences in personality traits.
As another example, here are some results from a new study that investigated predictive bias in the Medical College Admission Test (MCAT):
Again, the performance of blacks and Hispanics is systematically overpredicted by the MCAT. Blacks and Hispanics are less likely to graduate in time and to pass a medical licensure exam than whites with similar MCAT scores.
But as noted by Shalizi, the question of predictive bias is separate from the question of whether a test measures the same thing across groups. These days, psychometricians maintain that to establish that the same traits are being measured in different groups there must be an analysis of what is called measurement invariance. Several studies have investigated this question with respect to the black-white IQ gap, and they affirm that the gap can generally be regarded as reflecting genuine differences in the latent traits measured (Dolan 2000; Dolan & Hamaker 2001; Lubke et al. 2003; Edwards & Oakland 2006).
In contrast, analyses of test score gaps between generations (or the Flynn effect) indicate that score gains by younger cohorts cannot be used to support the view that intelligence is genuinely increasing (Wicherts et al. 2004; Must et al. 2009; Wai & Putallaz 2011). Measurement invariance generally holds for black-white differences but not for cohort differences. Wicherts et al. 2004 put it this way:
It appears therefore that the nature of the Flynn effect is qualitatively different from the nature of B–W [black-white] differences in the United States. Each comparison of groups should be investigated separately. IQ gaps between cohorts do not teach us anything about IQ gaps between contemporary groups, except that each IQ gap should not be confused with real (i.e., latent) differences in intelligence. Only after a proper analysis of measurement invariance of these IQ gaps is conducted can anything be concluded concerning true differences between groups.