Standardized tests used as part of academic admissions decisions have many critics. Many students hold the tests they are required to take in order to apply to various academic institutions in disdain and are convinced that they measure nothing useful. Moreover, scholarly critiques include accusations that tests such as the SAT primarily measure students’ family wealth rather than true academic ability, and contribute to gender and ethnic inequities in access to higher education (Guinier, 2015). Concerns about the potential disadvantages of using standardized tests to inform admissions decisions are long-standing, and hundreds of colleges no longer require applicants to submit test scores (Peligri, 2014). Despite occasional ideological vitriol, admissions tests do predict academic performance. This simply means that scores on standardized tests correlate positively with later academic success. Rather than dismissing the broad evidence for this correlation, discussions about the appropriateness of using standardized tests might more appropriately revolve around the strength of the association and whether the potential disadvantages of using tests to inform admissions decisions outweigh the advantages.
To evaluate the predictive validity of a test – the degree to which a test predicts performance on some other variable – researchers examine the correlation between test scores and measures of subsequent performance such as college grade point average (GPA). If the test scores correlate significantly with later performance, researchers conclude that the test has some criterion-related validity. That is, the test can be used to predict subsequent performance on some criterion, which is usually academic grades. In practice, however, the interpretation of such correlations often becomes more complicated. Correlations indicate associations between variables for groups of people. Many people know of someone who did relatively poorly on college admission tests but did very well in college, and someone else who did well on the exams but did poorly in college. Such anecdotes do not negate the correlation – they simply reflect the reality that the correlation will always be far less than perfect. Doubting the association because of noteworthy exceptions is no different than assuming that smoking poses no health hazard because some smokers do not become ill. Few people would make such an argument.
Over the past 80 years or so, researchers have conducted hundreds of studies examining the validity of standardized admissions tests for predicting subsequent academic performance in undergraduate and graduate programs. It is impossible to cover them all in a single chapter, but it is useful to examine findings based on large data sets and meta-analyses. Researchers use meta-analysis to integrate the findings from many existing studies in order to overcome some of the limitations of the individual studies and draw more reliable conclusions. The findings from meta-analyses and large-scale studies on the association between standardized tests and subsequent academic performance are remarkably consistent: nearly all large studies reveal correlations that are significant, positive, and modest in magnitude. This pattern holds for tests used to predict undergraduate academic performance, as well as those used to inform decisions on graduate school admissions.
By far the most commonly used criterion in studies examining the validity of academic admissions tests is GPA in higher education programs. Researchers typically focus on the correlation between standardized test scores and the grades students achieve during their first year of college or graduate school. Correlations of this nature are reported as decimals on a scale from 0 to 1 – with a zero correlation indicating no association between the variables and a coefficient of 1.0 indicating a perfect association. As one example, Ramist, Lewis, and McCamley-Jenkins (1994) conducted a meta-analysis using data from more than 46,000 students who completed the SAT and attended one of 45 colleges in the 1980s. The researchers reported an overall correlation of .36 between SAT scores and first-year GPA. All large-scale SAT validity studies reveal correlations of similar magnitude.
The SAT has undergone many revisions over the years. One major revision took effect in 1994. There were changes to the verbal portion of the exam, including elimination of antonym items, and the addition of items requiring students to evaluate different points of view. The time allotted to complete the test also was extended by 15 minutes so that students could complete more items. Each time a test is revised, new questions arise concerning the utility of the test and new validity studies must be conducted. Bridgeman, McCamley-Jenkins, and Ervin (2000) compared the predictive validity of the revised SAT to that of the previous version of the test. The researchers compared data from two large samples: one whose members had taken the SAT in 1994 prior to the revisions, and the other consisting of students who had taken the SAT in 1995 after the revisions had been implemented. There were more than 45,000 students in each sample. Bridgeman and colleagues found that the predictive validity of the SAT had remained essentially unchanged. The correlation between SAT scores and first-year college GPA was .35 for the new version of the SAT, compared with .34 for the prior version. Despite some noteworthy changes to the test, its association with college GPA was quite stable.
Another major revision of the SAT went into effect in 2005. A writing section was added, which was something that had never before been part of the test. The College Board, which publishes the SAT, commissioned a study (Kobrin, Patterson, Shaw, Mattern, & Barbuti, 2008) to evaluate the validity of the revised and expanded test. Kobrin and colleagues analyzed data from a sample of more than 151,000 students from more than 100 colleges and universities. The researchers reported a correlation of .35 between a composite test score that included all three sections of the SAT – critical reading, math, and writing – and first-year college GPA. Even with the addition of a major new task, the correlation between the test and college performance was equivalent to the correlations from studies evaluating earlier versions of the test. Interestingly, the newly-added writing subtest by itself predicted first-year college performance nearly as well as the test as a whole: correlating .33 with first-year GPA.
It is apparent that across large samples of students, SAT scores predict first-year college performance. Camara and Echternacht (2000) explained why first-year GPA is the criteria most commonly used in SAT validity studies. They noted that courses at the freshman level are more similar to each other in difficulty than upper-level courses, so first-year courses provide a more reliable validity criterion. Furthermore, the largest available data sets focus on first-year GPA, and first-year GPA is highly correlated with later cumulative GPA. Camara and Echternacht also cite potential problems with using cumulative GPA in validity studies of college admission tests. For instance, since upper-level courses tend to vary more in terms of difficulty, the correlation between pre-college tests and academic performance becomes suppressed because students do not pursue equally difficult courses. However, Wilson (1983) reviewed all known studies conducted between 1930 and 1980 in which the SAT was used to predict cumulative college GPA. He concluded that standardized admissions tests are just as valid for predicting cumulative GPA as they are for predicting first-year GPA. Burton and Ramist (2001) reviewed studies that were not part of Wilson’s review because they were conducted after 1980. Analyzing data from more than 30,000 students, the researchers reported a correlation of .36 between SAT scores and cumulative GPA at graduation – the same level of association observed in studies of first-year grades.
It is important to note that most of the large-scale studies of SAT validity were commissioned by the College Board, which also publishes the test. This is somewhat understandable since the College Board has access to large data sets and also has a vested interest in demonstrating the validity of the SAT. However, the results from studies conducted by the College Board and by independent researchers are quite consistent. As one example, Geiser and Studley (2002) analyzed data from nearly 78,000 students entering the University of California between 1996 and 1999. For these students the correlation between SAT scores and first-year GPA was .36 – equivalent to correlations observed in other studies. It is possible to locate studies of smaller and more select student samples where the predictive validity of the SAT appears more equivocal. However, meta-analyses and studies with very large and diverse samples produce far more reliable and generalizable results (Sackett, Borneman, & Connelly, 2008). For reasons addressed in more detail later in this chapter, SAT validity coefficients are particularly vulnerable to suppression when researchers use restricted samples.
The conclusions summarized above concerning the predictive validity of standardized tests are not limited to tests used in undergraduate admissions. Kuncel, Hezlett, and Ones (2001) conducted a large meta-analysis of studies of the Graduate Record Examination (GRE). They analyzed data from more than 82,000 graduate students from nearly 1,800 separate research samples. The correlation between GRE scores and graduate school GPA was very similar to SAT validity coefficients. Correlations between GRE scores and graduate GPA ranged from .32 to .36 for different subsections – verbal, quantitative, and analytical – of the GRE. Moreover, GRE scores correlated more highly than undergraduate GPA with both graduate GPA and scores on comprehensive examinations in graduate school. A smaller meta-analysis (Kuncel, Wee, Serafin, & Hezlett, 2010) revealed somewhat smaller GRE validity coefficients, but demonstrated that GRE scores predict both first-year and cumulative GPA for both master’s and doctoral students.
Julian (2005) analyzed data from more than 4,000 medical students and found that scores on the Medical College Admission Test (MCAT) correlated .44 with cumulative medical school GPA. In a large meta-analysis of more than 65,000 students who took the Graduate Management Admission Test (GMAT), test scores correlated .32 with first-year GPA and .31 with cumulative GPA in graduate business school (Kuncel, Crede, & Thomas, 2007). Finally, in a large meta-analysis of more than 90,000 law students, scores on the Law School Admission Test (LSAT) correlated .38 with first-year law school grades (Linn & Hastings, 1984). On all three of these tests – the MCAT, GMAT, and LSAT – the validity of test scores for predicting graduate GPA surpassed the predictive validity of undergraduate GPA. In a recent synthesis of meta-analyses of graduate admissions exams, Kuncel and Hezlett (2007) conclude: “For all tests across all relevant success measures, standardized test scores are positively related to subsequent measures of student success” (pp. 1080–1081).
Although the positive association between standardized test scores and later academic performance is remarkably consistent across studies of both undergraduate and graduate performance, it is also consistently modest in magnitude. Critics of standardized academic admissions tests often cite the rather modest correlations as reason to question the validity and utility of such tests. However, researchers have provided many explanations for why the correlations are not higher than usually observed. Foremost among these is the issue of range restriction.
Researchers evaluating test validity must use correlation coefficients that show the association between two continuous variables – such as test scores and GPA. To accurately reveal a correlation, a data set must contain a full range of scores on both variables. If the range is restricted, the correlation becomes suppressed. For instance, if one could measure both the jumping ability and basketball skill of every person in the United States, the data set would contain a wide range of each of the abilities. There would be some people whose jumping ability was extremely limited and others whose jumping ability would approach world-record levels – with everyone else falling somewhere in between. The same would be true for basketball skill. Given the range of data and the fact that jumping ability provides an advantage in basketball, the correlation between the two variables would likely be quite high. However, the correlation between the same two variables in a data set consisting only of professional basketball players would be much weaker – not because jumping ability is less important among elite basketball players, but because the range of the data set would be greatly restricted. With so little variability on both measures, the correlation coefficient would be artificially suppressed.
Range restriction is a common concern among researchers who study the predictive validity of standardized tests (e.g., Burton & Ramist, 2001; Sackett et al., 2008). When test scores are correlated with academic performance, the data sets generally do not contain a full range of scores. For example, among the more than 1 million students who take the SAT each year, many will never attend college. Although this occurs for many reasons, the students who ultimately do not attend college are disproportionately those with low SAT scores. Since these students cannot be included in validity studies, the range in available data sets is restricted. The range becomes further restricted because students are admitted to higher education based in part on their test scores. Therefore, the range of scores for students at any particular school – especially elite schools – will be further limited. If the students at a particular school were drawn randomly from the population of SAT takers, the correlation between test scores and academic performance would be higher.
A second factor that suppresses the correlations between test scores and academic outcomes has to do with the reliability of measures used to evaluate test validity – typically subsequent course grades. Reliability is simply another name for the consistency with which something is measured. A particular course grade can have different meanings in different contexts. Since any particular grade depends not only on student learning and performance, but also institution and instructor standards, the reliability of course grades tends to be low. When the outcome measure has low reliability, the correlation with other variables such as test scores is further reduced. There are several reasons why this may occur. First, college courses vary widely in terms of difficulty. Sackett and colleagues (2008) explain that two students with the same level of ability may earn different grades because of the courses they choose. Accordingly, the GPAs that show up in data sets are unreliable in that they do not control for course difficulty. Whereas admission test scores are standardized in the sense that everyone is assessed on the same scale, college and graduate school grades are far more contingent on the difficulty of chosen courses and programs, and the grading idiosyncrasies of individual instructors. Further, students with low SAT scores tend to choose different courses and majors than students with high SAT scores (Bridgeman et al., 2000). This inconsistency serves to reduce observed validity coefficients.
Fortunately, researchers can statistically correct for problems such as range restriction and inconsistency in grading to obtain estimates of what the validity coefficients would be without such limitations. When researchers apply this strategy, the predictive validity of admissions tests looks much more impressive. For example, Ramist and colleagues arrived at a correlation of .57 between SAT scores and first-year GPA after correcting for range restriction and the unreliability of college grades. Similarly, when Bridgeman et al. (2000) corrected for range restriction and course difficulty, the SAT validity coefficient increased to .56. Correcting for range restriction only, Kobrin et al. (2008) arrived at an estimated SAT validity coefficient of .53, and Julian (2005) estimated the corrected validity of the MCAT to be .59.
In one particularly comprehensive study of range restriction and course choice with respect to SAT validity, Berry and Sackett (2009) analyzed course-level data from more than 5 million grades earned by more than 168,000 students from 41 colleges. When correcting for range restriction at the national level – as if accepted students were drawn randomly from all students who took the SAT – the estimated validity coefficient was .51. After controlling for course choice, the researchers concluded that typical validity studies underestimate the predictive validity of the SAT by 30–40%.
Although standardized academic admission tests do predict subsequent academic performance, there are many issues pertaining to the use of such tests that are not readily resolved based on such validity data. For instance, correlations between test scores and later performance tend to be modest – although they increase notably when researchers control for imperfect data. Without correcting for data limitations, the validity coefficients for admissions tests hover around .35. This means that about 12% of the variation in students’ academic performance is associated with their admissions test performance. Obviously, test scores are just one of many variables that predict academic performance. Nonetheless, many researchers argue that even correlations at this level can enhance prediction of success in meaningful ways (Sackett et al., 2008; Kuncel & Hezlett, 2010).
Another common criticism is that standardized test scores are a proxy measure of socioeconomic status (SES) (Guinier, 2015). Two recent studies involving very large data sets call this claim into question. In the first of these studies (Sackett, Kuncel, Arneson, Cooper, & Waters, 2009), researchers analyzed data from more than 155,000 students. They reported that the correlation between SAT scores and first-year GPA was .47 after correcting for range restriction. Controlling for student SES reduced this correlation only slightly to .44. In a follow-up meta-analysis of various college admissions tests, the researchers found that the validity coefficient was reduced from .37 to .36 after controlling for SES. The second study (Sackett et al., 2012) included two large data sets totaling more than 250,000 students. Again the correlation between SAT scores and college grades was barely affected by controlling for SES. In both studies, the authors concluded that controlling for SES does not reduce the predictive validity of the SAT in any meaningful way and therefore the SAT is not simply a measure of SES.
Finally, concerns often arise that standardized tests are biased against minorities. For example, Freedle (2003) argues that the SAT is both statistically and culturally biased. As evidence of this, he notes the well-known discrepancies in average test scores across various racial and ethnic groups, and states specifically that some items are differentially valid for Whites and African Americans. Although he acknowledges that such item-level differences are small, he argues for an alternative score calculation method, “to increase dramatically the number of minority individuals who might qualify for admission into our nation’s select colleges and universities” (p. 28). Importantly, Freedle provides no evidence of bias in criterion-related validity – differences in the degree to which the SAT predicts academic performance across groups – which is the most important concern among test developers and those using tests to predict academic performance. Research on this question shows that instead of underpredicting academic performance for minorities, standardized tests in fact tend to overpredict performance for minorities (Kuncel & Hezlett, 2007; Sackett et al., 2008). Camara and Sathy (2004) cite numerous flaws in Freedle’s alternate scoring proposal, demonstrating that his method would result in a test that is far less valid for predicting college performance. Despite mean differences in scores, there is little evidence that college admission tests are differentially valid across ethnic groups (Fleming & Garcia, 1998).
It is likely that many critics of standardized academic admissions tests tend to think of individuals – perhaps themselves or people they know – whose abilities they feel are not adequately revealed on standardized tests. They may not consider that using tests for screening large numbers of people is based on a different perspective. The question is whether knowing how students performed on a standardized test provides any information about their likely academic success. Psychological assessments have always been more prone to measurement error than other varieties of measurement such as those used in the physical sciences. Furthermore, tests do not measure all personal characteristics that predict success in higher education (Burton & Ramist, 2001; Kuncel & Hezlett, 2007). Nonetheless, prediction of success is enhanced when test scores are considered. No predictor of success is ever going to approach perfect accuracy. Regardless of the criteria used to admit students, there will always be some candidates who could have been successful but who are not selected. Admissions tests have the advantage of being the only measure that is standardized across all applicants. Other admissions criteria such as past academic grades, personal statements, and letters of recommendation are vulnerable to many subjective biases. Standardized test scores predict academic success beyond what is predicted by prior grades alone, and researchers have consistently found that the best predictor is a combination of prior academic performance and standardized test scores (Ramist et al., 1994; Camara & Echternacht, 2000; Linn & Hastings, 2004; Julian, 2005; Kobrin et al., 2008; Berry & Sackett, 2009). Test scores certainly do not provide all the information that admissions officers need to know about candidates. Moreover, the use of test scores has pros and cons meriting ongoing debate given the specific institutional contexts in which they are used. However, it is inaccurate to assert that standardized admission tests are uncorrelated with academic performance.