As an online discussion about IQ or general intelligence grows longer, the probability of someone linking to statistician Cosma Shalizi’s essay g, a Statistical Myth approaches 1. Usually the link is accompanied by an assertion to the effect that Shalizi offers a definitive refutation of the concept of general mental ability, or psychometric g.
In this post, I will show that Shalizi’s case against g appears strong only because he misstates several key facts and because he omits all the best evidence that the other side has offered in support of g. His case hinges on three clearly erroneous arguments on which I will concentrate.
I. Positive manifold
Shalizi writes that when all tests in a test battery are positively correlated with each other, factor analysis will necessarily yield a general factor. He is correct about this. All subtests of any given IQ battery are positively correlated, and subjecting an IQ correlation matrix to factor analysis will produce a first factor on which all subtests are positively loaded. For example, the 29 subtests of the revised 1989 edition of the Woodcock-Johnson IQ test are correlated in the following manner (click for larger image):
All the subtest intercorrelations are positive, ranging from a low of 0.046 (Memory for Words – Visual Closure) to a high of 0.728 (Quantitative Concepts – Applied Problems). (See Woodcock 1990 for a description of the tests.) This is the reason why we talk about general intelligence or general cognitive ability: individuals who get a high score on one cognitive test tend to do so on all kinds of tests regardless of test content or type (e.g., verbal, numerical, spatial, or memory tests), while those who do bad on one type of cognitive test usually do bad on all tests.
This phenomenon of positive correlations among all tests, often called the “positive manifold”, is routinely found among all collections of cognitive ability tests, and it is one of the most replicated findings in the social and behavioral sciences. The correlation between a given pair of ability tests is a function of the shared common factor variance (g and other factors) and imperfect test reliabilities (the higher the reliabilities, the higher the correlation). All cognitive tests load on g to a smaller or greater degree, so all tests covary at least through the g factor, if not other factors.
John B. Carroll factor-analyzed the WJ-R matrix presented above, using confirmatory analysis to successfully fit a ten-factor model (g and nine narrower factors) to the data (Carroll 2003):
Loadings on the g factor range from a low of 0.279 (Visual Closure) to a high of 0.783 (Applied Problems). The g factor accounts for 59 percent of the common factor variance, while the other nine factors together account for 41 percent. This is a routine finding in factor analyses of IQ tests: the g factor explains more variance than the other factors put together. (Note that in addition to the common factor variance, there is always some variance specific to each subtest as well as variance due to random measurement error.)
II. Shalizi’s first error
Against the backdrop of results like the above, Shalizi makes the following claims:
The correlations among the components in an intelligence test, and between tests themselves, are all positive, because that’s how we design tests. […] So making up tests so that they’re positively correlated and discovering they have a dominant factor is just like putting together a list of big square numbers and discovering that none of them is prime — it’s necessary side-effect of the construction, nothing more.
What psychologists sometimes call the “positive manifold” condition is enough, in and of itself, to guarantee that there will appear to be a general factor. Since intelligence tests are made to correlate with each other, it follows trivially that there must appear to be a general factor of intelligence. This is true whether or not there really is a single variable which explains test scores or not.
By this point, I’d guess it’s impossible for something to become accepted as an “intelligence test” if it doesn’t correlate well with the Weschler [sic] and its kin, no matter how much intelligence, in the ordinary sense, it requires, but, as we saw with the first simulated factor analysis example, that makes it inevitable that the leading factor fits well.
Shalizi’s thesis is that the positive manifold is an artifact of test construction and that full-scale scores from different IQ batteries correlate only because they are designed to do that. It follows from this argument that if a test maker decided to disregard the g factor and construct a battery for assessing several independent abilities, the result would be a test with many zero or negative correlations among its subtests. Moreover, such a test would not correlate highly with traditional tests, at least not positively. Shalizi alleges that there are tests that measure intelligence “in the ordinary sense” yet are uncorrelated with traditional tests, but unfortunately he does not gives any examples.
Inadvertent positive manifolds
There are in fact many cognitive test batteries designed without regard to g, so we can put Shalizi’s allegations to test. The Woodcock-Johnson test discussed above is a case in point. Carroll, when reanalyzing data from the test’s standardization sample, pointed out that its technical manual “reveals a studious neglect of the role of any kind of general factor in the WJ-R.” This dismissive stance towards g is also reflected in Richard Woodcock’s article about the test’s theoretical background (Woodcock 1990). (Yes, the Woodcock-Johnson test was developed by a guy named Dick Woodcock, together with his assistant Johnson. You can’t make this up.) The WJ-R was developed based on the idea that the g factor is a statistical artifact with no psychological relevance. Nevertheless, all of its subtests are intercorrelated and, when factor analyzed, it reveals a general factor that is no less prominent than those of more traditional IQ tests. According to the WJ-R technical manual, test results are to be interpreted at the level of nine broad abilities (such as Visual Processing and Quantitative Ability), not any general ability. Similarly, the manual reports factor analyses based only on the nine factors. But when Carroll reanalyzed the data, allowing for loadings on a higher-order g factor in addition to the nine factors, it turned out that most of the tests in the WJ-R have their highest loadings on the g factor, not on the less general (“broad”) factors which they were specifically designed to measure.
While the WJ-R is not meant to be a test of g, it does provide a measure of “broad cognitive ability”, which correlates at 0.65 and 0.64 with the Stanford-Binet and Wechsler full-scale scores, respectively (Kamphaus 2005, p. 335). Typically, correlations between full-scale scores from different IQ tests are around 0.8.
The WJ-R broad cognitive ability scores are probably less g-loaded than those of other tests, because they are based on unweighted sums of scores on subtests selected solely on the basis of their content diversity; hence the lower correlations, I believe. The lower than expected correlation appears to be due to range restriction in the sample used. In any case, the WJ-R is certainly not uncorrelated with traditional tests. (The WJ-III, which is the newest edition of the test, now recognizes the g factor.)
The WJ-R serves as a forthright refutation of Shalizi’s claim that the positive manifold and inter-battery correlations emerge by design rather than because all cognitive abilities naturally intercorrelate. But perhaps the WJ-R is just a giant fluke, or perhaps its 29 tests correlate as a carryover from the previous edition of the test which had several of the same tests but was not based on anti-g ideas. Are there other examples of psychometricians accidentally creating strongly g-loaded tests against their best intentions? In fact, there is a long history of such inadvertent confirmations of the ubiquity of the g factor. This goes back at least to the 1930s and Louis Thurstone’s research on “primary mental abilities”.
Thurstone and Guilford
In a famous study published in 1938, Thurstone, one of the great psychometricians, claimed to have developed a test of seven independent mental abilities (verbal comprehension, word fluency, number facility, spatial visualization, associative memory, perceptual speed, and reasoning; see Thurstone 1938). However, the g men quickly responded, with Charles Spearman and Hans Eysenck publishing papers (Spearman 1939, Eysenck 1939) showing that Thurstone’s independent abilities were not independent, indicating that his data were compatible with Spearman’s g model. (Later in his career, Thurstone came to accept that perhaps intelligence could best be conceptualized as a hierarchy topped by g.)
The idea of non-correlated abilities was taken to its extreme by J.P. Guilford who postulated that there are as many as 160 different cognitive abilities. This made him very popular among educationalists because his theory suggested that everybody could be intelligent in some way. Guilford’s belief in a highly multidimensional intelligence was influenced by his large-scale studies of Southern California university students whose abilities were indeed not always correlated. In 1964, he reported (Guilford 1964) that his research showed that up to a fourth of correlations between diverse intelligence tests were not different from zero. However, this conclusion was based on bad psychometrics. Alliger 1988 reanalyzed Guilford’s data and showed that when you correct for artifacts such as range restriction (the subjects were generally university students), the reported correlations are uniformly positive.
British Ability Scales
Psychometricians have not been discouraged by past failures to discover abilities that are independent of the general factor. They keep constructing tests that supposedly take the measurement of intelligence beyond g.
For example, the British Ability Scales was carefully developed in the 1970s and 1980s to measure a wide variety of cognitive abilities, but when the published battery was analyzed (Elliott 1986), the results were quite disappointing:
Considering the relatively large size of the test battery […] the solutions have yielded perhaps a surprisingly small number of common factors. As would be expected from any cognitive test battery, there is a substantial general factor. After that, there does not seem to be much common variance left […]
What, then, are we to make of the results of these analyses? Do they mean that we are back to square one, as it were, and that after 60 years of research we have turned full circle and are back with the theories of Spearman? Certainly, for this sample and range of cognitive measures, there is little evidence that strong primary factors, such as those postulated by many test theorists over the years, have accounted for any substantial proportion of the common variance of the British Ability Scales. This is despite the fact that the scales sample a wide range of psychological functions, and deliberately include tests with purely verbal and purely visual tasks, tests of fluid and crystallized mental abilities, tests of scholastic attainment, tests of complex mental functioning such as in the reasoning scales and tests of lower order abilities as in the Recall of Digits scale.
An even better example is the CAS battery. It is based on the PASS theory (which draws heavily on the ideas of Soviet psychologist A.R. Luria, a favorite of Shalizi’s), which disavows g and asserts that intelligence consists of four processes called Planning, Attention-Arousal, Simultaneous, and Successive. The CAS was designed to assess these four processes.
However, Keith el al. 2001 did a joint confirmatory factor analysis of the CAS together with the WJ-III battery, concluding that not only does the CAS not measure the constructs it was designed to measure, but that notwithstanding the test makers’ aversion to g, the g factor derived from the CAS is large and statistically indistinguishable from the g factor of the WJ-III. The CAS therefore appears to be the opposite of what it was supposed to be: an excellent test of the “non-existent” g and a poor test of the supposedly real non-g abilities it was painstakingly designed to measure.
A particularly amusing confirmation of the positive manifold resulted from Robert Sternberg’s attempts at developing measures of non-g abilities. Sternberg introduced his “triarchic” theory of intelligence in the 1980s and has tirelessly promoted it ever since while at every turn denigrating the proponents of g as troglodytes. He claims that g represents a rather narrow domain of analytic or academic intelligence which is more or less uncorrelated with the often much more important creative and practical forms of intelligence. He created a test battery to test these different intellectual domains. It turned out that the three “independent” abilities were highly intercorrelated, which Sternberg absurdly put down to common-method variance.
A reanalysis of Sternberg’s data by Nathan Brody (Brody 2003a) showed that not only were the three abilities highly correlated with each other and with Raven’s IQ test, but also that the abilities did not exhibit the postulated differential validities (e.g., measures of creative ability and analytic ability were equally good predictors of measures of creativity, and analytic ability was a better predictor of practical outcomes than practical ability), and in general the test had little predictive validity independently of g. (Sternberg, true to his style, refused to admit that these results had any implications for the validity of his triarchic theory, prompting the exasperated Brody to publish an acerbic reply called “What Sternberg should have concluded” [Brody 2003b].)
The administration of several different IQ batteries to the same sample of individuals offers another good way to test the generality of the positive manifold. As part of the Minnesota Study of Twins Reared Apart (or MISTRA), three batteries comprising a total of 42 different cognitive tests were taken by the twins studied and also by many of their family members. The three tests were the Comprehensive Ability Battery, the Hawaii Battery, and the Wechsler Adult Intelligence Scale. The tests are highly varied content-wise, with each battery measuring diverse aspects of intelligence. See Johnson & Bouchard 2011 for a description of the tests. Correlations between the 42 tests are presented below (click for larger image):
All 861 correlations are positive. Subtests of each IQ battery correlate positively not only with each other but also with the subtests of the other IQ batteries. This is, of course, something that the developers of the three different batteries could not have planned – and even if they could have, they would not have had any reason to do so, given their different theoretical presuppositions. (Later in this post, I will present some very interesting results from a factor analysis of these data.)
As a final example of the impossibility of doing away with the positive manifold I will discuss a test battery which is rather exotic from a traditional psychometric perspective. The Swiss developmental psychologist Jean Piaget devised a number of cognitive tasks in order to investigate the developmental stages of children. He was not interested in individual differences (a common failing among developmental psychologists) but rather wanted to understand universal human developmental patterns. He never created standardized batteries of his tasks. See here for a description of many of Piaget’s tests. Some of them, such as those assessing Logical Operations are quite similar to traditional IQ items, but others, such as Conservation tasks, are unlike anything in IQ tests. Nevertheless, most would agree that all of them measure cognitive abilities.
Humphreys et al. 1985 studied a battery of 27 Piagetian tasks completed by a sample of 150 children. A factor analysis of the Piagetian battery showed that a strong general factor underlies the tasks, with loadings ranging from 0.32 to 0.80:
But it is possible that the Piagetian general factor is not at all the same as the general factor of IQ batteries or achievement tests. Whether this is the case was tested by having the same sample take Wechsler’s IQ test and an achievement test of spelling, arithmetic, and reading. The result was that scores on the Piagetian battery, Wechsler’s Performance (“fluid”) and Verbal (“crystallized”) scales, and the achievement test were highly correlated, clearly indicating that they are measuring the same general factor. (A small caveat here is that the study included an oversample of mildly mentally retarded children in addition to normal children. Such range enhancement tends to inflate correlations between tests, so in a more adequate sample the correlations and gloadings would be somewhat lower. On the other hand, the data have not been corrected for measurement error which reduces correlations.) The correlations looked like this:
When this correlation matrix of four different measures of general ability is factor analyzed, it can be seen that all of them load very strongly (~0.9) on a single factor:
It can be said that a battery of Piagetian tasks is about as good a measure of g as Wechsler’s test. It does not matter at all that Piagetian and psychometric ideas of intelligence are very different and that the research traditions in which IQ tests and Piagetian tasks were conceived have nothing to do with each other. The g factor will emerge regardless of the type of cognitive abilities called for by a test.
Positive manifold as a fact of nature
These examples show that, contrary to Shalizi’s claims, all cognitive abilities are intercorrelated. We can be confident about this because the best evidence for it comes not from the proponents of g but from numerous competent researchers who were hell-bent on disproving the generality of the positive manifold, only to be refuted by their own work.
Quite contrary to what Shalizi believes, IQ tests are usually constructed to measure several different abilities, not infrequently with the (stubbornly unrealized) objective of measuring abilities that are completely independent of g. IQ tests are not devised with the aim of maximizing variance on the first common factor, or g; rather, the prominence of the g factor is a fact of human nature, and it is impossible to do away with it.
The g factor is thus not an artifact of test construction but a genuine explanandum, something that any theory of intelligence must account for. The only way to deny this is to redefine intelligence to include skills and talents with little intellectual content. For example, Howard Gardner claims that there is a “bodily-kinesthetic intelligence” which athletes and dancers have plenty of. I don’t think such semantic obfuscation contributes anything to the study of intelligence.
III. Shalizi’s second error
Towards the end of his piece, Shalizi makes this bizarre claim:
It is still conceivable that those positive correlations are all caused by a general factor of intelligence, but we ought to be long since past the point where supporters of that view were advancing arguments on the basis of evidence other than those correlations. So far as I can tell, however, nobody has presented a case for g apart from thoroughly invalid arguments from factor analysis; that is, the myth.
One can only conclude that if Shalizi really believes that, he has made no attempt whatsoever to familiarize himself with the arguments of g proponents, preferring his own straw man version of g theory instead. For example, in 1998 the principal modern g theorist, Arthur Jensen, published a book (Jensen 1998) running to nearly 700 pages, most of which consists of arguments and evidence that substantiate the scientific validity and relevance of the g factor beyond the mere fact of the positive manifold (which in itself is not a trivial finding, contra Shalizi). The evidence he puts forth encompasses genetics, neurophysiology, mental chronometry, and practical validity, among many other things.
I will next describe some of the most important findings that support the existence of g as the central, genetically rooted source of individual differences in cognitive abilities. Together, the different lines of evidence indicate that human behavioral differences cannot be properly understood without reference to g.
Evidence from confirmatory factor analyses
Shalizi spends much time castigating intelligence researchers for their reliance on exploratory factor analysis even though more powerful, confirmatory methods are available. This is a curious criticism in light of the fact that confirmatory factor analysis (CFA) was invented for the very purpose of studying the structure of intelligence. The trailblazer was the Swedish statistician Karl Jöreskog who was working at the Educational Testing Service when he wrote his first papers on the topic. There are in fact a large number of published CFAs of IQ tests, some of them discussed above. Shalizi must know this because he refers to John B. Carroll’s contribution in the book Intelligence, Genes, and Success: Scientists Respond to The Bell Curve. In his article, Carroll discusses classic CFA studies of g (e.g., Gustafsson 1984) and reports CFAs of his own which indicate that his three-stratum model (which posits that cognitive abilities constitute a hierarchy topped by the g factor) shows good fit to various data sets (Carroll 1995).
Among the many CFA studies showing that g-based factor models fit IQ test data well, two published by Wendy Johnson and colleagues are particularly interesting. In Johnson et al. 2004, the MISTRA correlation matrix of three different IQ batteries, discussed above, was analyzed, and it turned out that the g factors computed from the three tests were statistically indistinguishable from one another, despite the fact that the tests clearly tapped into partly different sets of abilities. The results of Johnson et al. 2004, which have since been replicated in an another multiple-battery sample (Johnson et al. 2008) are in accord with Spearman and Jensen’s argument that any diverse collection of cognitive tests will provide an excellent measure of one and the same g; what specific abilities are assessed is not important because they all measure the same g. In contrast, these results are not at all what one would have expected based on the theory of intelligence that Shalizi advocates. According to Shalizi’s model, g factors reflect only the average or sum of the particular abilities called for by a given test battery, with batteries comprising different tests therefore almost always yielding different g factors. (I have more to say about Shalizi’s preferred theory later in this post.) The omission of Johnson et al. 2004 and other CFA studies of intelligence (such the joint CFA of the PASS and WJ-III tests discussed earlier) from Shalizi’s sources is a conspicuous failing.
Behavioral genetic evidence
It has been established beyond any dispute that cognitive abilities are heritable. (Shalizi has some quite wrong ideas on this topic, too, but I will not discuss them in this post.) What is interesting is that the degree of heritability of a given ability test depends on its g loading: the higher the g loading, the higher the heritability. A meta-analysis of the correlations between g loadings and heritabilities even suggested that the true correlation is 1.0, i.e., g loadings appear to represent a pure index of the extent of genetic influence on cognitive variation (see Rushton & Jensen 2010).
Moreover, quantitative genetic analyses indicate that g is an even stronger genetic variable than it is a phenotypic variable. I quote from Plomin & Spinath 2004:
Multivariate genetic analysis yields a statistic called genetic correlation, which is an index of the extent to which genetic effects on one trait correlate with genetic effects on another trait independent of the heritability of the two traits. That is, two traits could be highly heritable but the genetic correlation between them could be zero. Conversely, two traits could be only modestly heritable but the genetic correlation between them could be 1.0, indicating that even though genetic effects are not strong (because heritability is modest) the same genetic effects are involved in both traits. In the case of specific cognitive abilities that are moderately heritable, multivariate genetic analyses have consistently found that genetic correlations are very high—close to 1.0 (Petrill 1997). That is, although Spearman’s g is a phenotypic construct, g is even stronger genetically. These multivariate genetic results predict that when genes are found that are associated with one cognitive ability, such as spatial ability, they will also be associated just as strongly with other cognitive abilities, such as verbal ability or memory. Conversely, attempts to find genes for specific cognitive abilities independent of general cognitive ability are unlikely to succeed because what is in common among cognitive abilities is largely genetic and what is independent is largely environmental.
Thus behavior genetic findings support the existence of g as a genetically rooted dimension of human differences.
The sine qua non of IQ tests is that they reveal and predict current and future real-world capabilities. IQ is the best single predictor of academic and job performance and attainment, and one of the best predictors of a plethora of other outcomes, from income, welfare dependency, and criminality (Gottfredson 1997) to health and mortality and scientific and literary creativity (Robertson et al. 2010), and any number of other things, including even investing success (Grinblatt et al. 2011). If you had to predict the life outcomes of a teenager based on only one fact about them, nothing would be nearly as informative as their IQ.
One interesting thing about the predictive validity of a cognitive test is that it is directly related to the test’s g loading. The higher the g loading, the better the validity. In fact, although the g factor generally accounts for less than half of all the variance in a given IQ battery, a lot of research indicates that it accounts for almost all of the predictive validity. The best evidence here are from several large-scale studies of US Air Force personnel. These studies contrasted g and a number of more specific abilities as predictors of performance in Air Force training (Ree, & Earles 1991) and jobs (Ree et al. 1994). The results indicated that g is the best predictor of training and job performance across all specialties, and that specific ability tests tailored for each specialty provide little or no incremental validity over g. Thus if you wanted to predict someone’s performance in training or a job, it would be much more useful for you to get their general mental ability score rather than scores on any specific ability tests that are closely matched to the task at hand. This appears to be true in all jobs (Schmidt & Hunter 1998, 2004), although specific ability scores may provide substantial incremental validity in the case of high-IQ individuals (Robertson et al. 2010), which is in accord with Charles Spearman’s view that abilities become more differentiated at higher levels of g. (This is why it makes sense for selective colleges to use admission tests that assess different abilities.)
For more evidence of how general the predictive validity of g is, we can look at the validity of g as a predictor of performance in GCSEs, which are academic qualifications awarded in different school subjects at age 14 to 16 in the United Kingdom. Deary et al. 2007 conducted a prospective study with a very large sample where g was measured at age 11 and GCSEs were obtained about five years later. The g scores correlated positively and substantially with the results of all 25 GCSEs, explaining (to give some examples) about 59 percent of individual differences in math, about 40 to 50 percent in English and foreign languages, and, at the low end, about 18 percent in Art and Design. In contrast, verbal ability, independently of g, explained an average of only 2.2 percent (range 0.0-7.2%) of the results in the 25 exams.
Arthur Jensen referred to g as the “active ingredient” of IQ tests, because g accounts for most if not all of the predictive validity of IQ even though most variance in IQ tests is not g variance. From the perspective of predictive validity, non-g variance seems to be generally just noise. In other words, if you statistically remove g variance from IQ test results, what is left is almost useless for the purposes of predicting behavior (except among high-IQ individuals, as noted above). This is a very surprising finding if you think, like Shalizi, that different mental abilities are actually independent, and g is just an uninteresting statistical artifact caused by an occasional recruitment of many uncorrelated abilities for the same task (more on this view of Shalizi’s below).
Hollowness of IQ training effects
Another interesting fact about g is that there is there is a systematic relation between g loadings and practice effects in IQ tests. A meta-analysis of re-testing effects on IQ scores showed that there is a perfect negative correlation between score gains and g loadings of tests (te Nijenhuis et al. 2007). It appears that specific abilities are trainable but g is generally not (see also Arendasy & Sommer 2013). Similarly, a recent meta-analysis of the effects of working memory training on intelligence showed, in line with many earlier reviews, that cognitive training produces short-term gains in the specific abilities trained, but no “far transfer” to any other abilities (Melby-Lervåg, & Hulme 2013). Jensen called such gains hollow because they do not seem to represent actual improvements in real-world intellectual performance. These findings are consistent with the view that g is a “central processing unit” that cannot be defined in terms of specific abilities and is not affected by changes in those abilities.
Chabris 2007 pointed out that findings in neurobiology “establish a biological basis for g that is firmer than that of any other human psychological trait”. This is a far cry from Shalizi’s claim that nothing has been done to investigate g beyond the fact of positive correlations between tests. There are a number of well-replicated, small to moderate correlations between g and features of brain physiology, including brain size, the volumes of white and grey matter, and nerve conduction velocity (ibid.; Deary et al. 2010). Currently, we do not have a well-validated model of “neuro-g“, but certainly the findings so far are consistent with a central role for g in intelligence.
IV. Shalizi’s third error
Besides his misconception that the positive manifold is an artifact of test construction and his disregard for evidence showing that g in a central variable in human affairs, there is a third reason why Shalizi believes that the g factor is a “myth”. It is his conviction that correlations between cognitive tests are best explained in terms of the so-called sampling model. This model holds that there are a large number of uncorrelated abilities (or other “neural elements”) and that correlations between tests emerge because all tests measure many different abilities at the same time, with some subset of the abilities being common to all tests in a given battery. Thus, according to Shalizi, there is no general factor of intelligence, but only the appearance of one due to each test tapping into some of the same abilities. Moreover, Shalizi’s model suggests that g factors from different batteries are dissimilar, reflecting only the particular abilities sampled by each battery. The sampling model is illustrated in the following figure (from Jensen 1998, p. 118):
The sampling model can be contrasted with models based on the idea that g is a unitary capacity that contributes to all cognitive efforts, reflecting some general property of the brain. For example, Arthur Jensen hypothesized that g is equivalent with mental speed or efficiency. In Jensen’s model, there are specific abilities, but all of them depend, to a smaller or greater degree, on the overall speed or efficiency of the brain. In contrast, in the sampling model there are only specific abilities, overlapping samples of which are recruited for each cognitive task. Statistically, both models are equally able to account for empirically observed correlations between cognitive tests (see Bartholomew et al. 2009).
There are many flaws in Shalizi’s argument. Firstly, the sampling model has several empirical problems which he ignores. I quote from Jensen 1998, pp. 120–121:
But there are other facts the overlapping elements theory cannot adequately explain. One such question is why a small number of certain kinds of nonverbal tests with minimal informational content, such as the Raven matrices, tend to have the highest g loadings, and why they correlate so highly with content-loaded tests such as vocabulary, which surely would seem to tap a largely different pool of neural elements. Another puzzle in terms of sampling theory is that tests such as forward and backward digit span memory, which must tap many common elements, are not as highly correlated as are, for instance, vocabulary and block designs, which would seem to have few elements in common. Of course, one could argue trivially in a circular fashion that a higher correlation means more elements in common, even though the theory can’t tell us why seemingly very different tests have many elements in common and seemingly similar tests have relatively few.
And how would sampling theory explain the finding that choice reaction time is more highly correlated with scores on a nonspeeded vocabulary test than with scores on a test of clerical checking speed?
Perhaps the most problematic test of overlapping neural elements posited by the sampling theory would be to find two (or more) abilities, say, A and B, that are highly correlated in the general population, and then find some individuals in whom ability A is severely impaired without there being any impairment of ability B. For example, looking back at Figure 5.2 [see above], which illustrates sampling theory, we see a large area of overlap between the elements in Test A and the elements in Test B. But if many of the elements in A are eliminated, some of its elements that are shared with the correlated Test B will also be eliminated, and so performance on Test B (and also on Test C in this diagram) will be diminished accordingly. Yet it has been noted that there are cases of extreme impairment in a particular ability due to brain damage, or sensory deprivation due to blindness or deafness, or a failure in development of a certain ability due to certain chromosomal anomalies, without any sign of a corresponding deficit in other highly correlated abilities. On this point, behavioral geneticists Willerman and Bailey comment: “Correlations between phenotypically different mental tests may arise, not because of any causal connection among the mental elements required for correct solutions or because of the physical sharing of neural tissue, but because each test in part requires the same ‘qualities’ of brain for successful performance. For example, the efficiency of neural conduction or the extent of neuronal arborization may be correlated in different parts of the brain because of a similar epigenetic matrix, not because of concurrent functional overlap.” A simple analogy to this would be two independent electric motors (analogous to specific brain functions) that perform different functions both running off the same battery (analogous to g). As the battery runs down, both motors slow down at the same rate in performing their functions, which are thus perfectly correlated although the motors themselves have no parts in common. But a malfunction of one machine would have no effect on the other machine, although a sampling theory would have predicted impaired performance for both machines.
But the fact that the sampling model has empirical shortcomings is not the biggest flaw in Shalizi’s argument. The most serious problem is that he mistakenly believes that if the sampling model is deemed to be the correct description of the workings of intelligence, it means that there can be no general factor of intelligence. This inference is unwarranted and is based on a confusion of different levels of analysis. The question of whether or not there is a unidimensional scale of intelligence along which individuals can be arranged is independent of the question of what the neurobiological substrate of intelligence is like. Indeed, at a sufficiently basal (neurological, molecular, etc.) level, intelligence necessarily becomes fractionated, but that does not mean that there is no general factor of intelligence at the behavioral level. As explained above, many types of evidence show that g is indeed a centrally important unidimensional source of behavioral differences between individuals. One can compare this to a phenotype like height, which is simply a linear combination of the lengths of a number of different bones, yet at the same time unmistakably represents a unidimensional phenotype on which individual differ, and which can, among other things, also be a target for natural selection.
While he rejected the sampling model, Arthur Jensen noted that sampling represents an alternative model of g rather than a refutation thereof. This is because of the many lines of evidence showing that there is indeed a robust general factor of intellectual behavior. It is undoubtedly possible, with appropriate modifications, to devise a version of the sampling theory to account for all the empirical facts about g. However, this would mean that those uncorrelated abilities that are shared between all tests would have to show great invariance and permanence between different test batteries as well as be largely impervious to training effects, and they would also have to explain almost all of the practical validity and heritability of psychometric intelligence. Thus preferring the sampling model to a unitary g model is, in many ways, a distinction without a difference. The upshot is that regardless of whether “neuro-g” is unitary or the result of sampling, people differ on a highly important, genetically-based dimension of cognition that we may call general intelligence. Sampling does not disprove g. (The same applies to “mutualism”, a third model of g introduced in van der Maas et al. 2006, so I will not discuss it in this post.)
Shalizi’s first error is his assertion that cognitive tests correlate with each other because IQ test makers exclude tests that do not fit the positive manifold. In fact, more or less the opposite is true. Some of the greatest psychometricians have devoted their careers to disproving the positive manifold only to end up with nothing to show for it. Cognitive tests correlate because all of them truly share one or more sources of variance. This is a fact that any theory of intelligence must grapple with.
Shalizi’s second error is to disregard the large body of evidence that has been presented in support of g as a unidimensional scale of human psychological differences. The g factor is not just about the positive manifold. A broad network of findings related to both social and biological variables indicates that people do in fact vary, both phenotypically and genetically, along this continuum that can be revealed by psychometric tests of intelligence and that has has widespread significance in human affairs.
Shalizi’s third error is to think that were it shown that g is not a unitary variable neurobiologically, it would refute the concept of g. However, for most purposes, brain physiology is not the most relevant level of analysis of human intelligence. What matters is that g is a remarkably powerful and robust variable that has great explanatory force in understanding human behavior. Thus g exists at the behavioral level regardless of what its neurobiological underpinnings are like.
In many ways, criticisms of g like Shalizi’s amount to “sure, it works in practice, but I don’t think it works in theory”. Shalizi faults g for being a “black box theory” that does not provide a mechanistic explanation of the workings of intelligence, disparaging psychometric measurement of intelligence as a mere “stop-gap” rather than a genuine scientific breakthrough. However, the fact that psychometricians have traditionally been primarily interested in validity and reliability is a feature, not a bug. Intelligence testing, unlike most fields of psychology and social science, is highly practical, being widely applied to diagnose learning problems and medical conditions and to select students and employees. What is important is that IQ tests reliably measure an important human characteristic, not the particular underlying neurobiological mechanisms. Nevertheless, research on general mental ability extends naturally into the life sciences, and continuous progress is being made in understanding g in terms of neurobiology (e.g., Lee et al. 2012, Penke et al. 2012, Kievit et al. 2012) and molecular genetics (e.g., Plomin et al., in press, Benyamin et al., in press).
P.S. See some of my further thoughts on these issues here.
Alliger, George M. (1988). Do Zero Correlations Really Exist among Measures of Different Intellectual Abilities? Educational and Psychological Measurement, 48, 275–280.
Arendasy, Martin E., & Sommer, Marcus (2013). Quantitative differences in retest effects across different methods used to construct alternate test forms. Intelligence, 41, 181–192.
Bartholomew, David J. et al. (2009). A new lease of life for Thomson’s bonds model of intelligence. Psychological Review, 116, 567–579.
Benyamin, Beben et al. (in press). Childhood intelligence is heritable, highly polygenic and associated with FNBP1L. Molecular Psychiatry.
Brody, Nathan (2003a). Construct validation of the Sternberg Triarchic Abilities Test. Comment and reanalysis. Intelligence, 31, 319–329.
Carroll, John B. (1995). Theoretical and Technical Issues in Identifying a Factor of General Intelligence. In Devlin, Bernard et al. (ed.), Intelligence, Genes, and Success: Scientists Respond to The Bell Curve. New York, NY: Springer.
Carroll, John B. (2003). The higher-stratum structure of cognitive abilities: Current evidence supports g and about ten broad factors. In Nyborg, Helmuth (Ed.), The scientific study of general intelligence: Tribute to Arthur R. Jensen. Oxford, UK: Elsevier Science/Pergamon Press.
Chabris, Christopher F. (2007). Cognitive and Neurobiological Mechanisms of the Law of General Intelligence. In Roberts, M. J. (Ed.) Integrating the mind: Domain general versus domain specific processes in higher cognition. Hove, UK: Psychology Press.
Keith, Timothy Z. et al. (2001). What Does the Cognitive Assessment System (CAS) Measure? Joint Confirmatory Factor Analysis of the CAS and the Woodcock-Johnson Tests of Cognitive Ability (3rd Edition). School Psychology Review, 30, 89–119.
Robertson, Kimberley F. et al. (2010). Beyond the Threshold Hypothesis: Even Among the Gifted and Top Math/Science Graduate Students, Cognitive Abilities, Vocational Interests, and Lifestyle Preferences Matter for Career Choice, Performance, and Persistence. Current Directions in Psychological Science, 19, 346–351.
Schmidt, Frank L., & Hunter, John E. (1998). The Validity and Utility of Selection Methods in Personnel Psychology: Practical and Theoretical Implications of 85 Years of Research Findings. Psychological Bulletin, 124, 262–274.