Statistics are king in the world of education. No matter the complexity and variety of human thought and expression in all fields, from physics to poetry, we are compelled by our culture’s obsession with superficial notions of success to find numbers to define that success. This obsession goes even further in education, favoring drama-laden single numbers over a profile of measures, as in the Average Yearly Progress (AYP) scores for schools introduced under the current federal education policy No Child Left Behind (NCLB).
AYP scores, and similar one-stop shopping indicators, are supposed to give a quick and clear window into the educational advancement of schools and demographic subgroups. The problem, as has been painfully clear
for too long, is that the statistics and the tests they are derived from are seriously flawed.
States have gamed the system
, by having low-scoring students unavailable for testing or kicking students in clever ways, thereby increasing the likelihood of higher test scores. On the other end of the spectrum, schools that have scored near the highest end of the spectrum are penalized because for all practical purposes they have achieved excellence and there is nowhere further to go. Because there is nowhere further to go, they have been deemed failing schools.
These sorts of problems are indicators of flawed implementations of flawed policies. But they are also indicators of a more widespread problem the misuse of statistics and data. Pointing out this problem, and especially how it impacts placement in colleges and advanced courses in high schools, is the goal of the recently published book Uneducated Guesses
, by Howard Wainer. Wainer is professor of statistics at the Wharton School at the University of Pennsylvania. His competency to explore this area is further enhanced by his position as a former principal research scientist at Educational Testing Service, the folks who bring us the SAT, the test used by almost all U.S. colleges and universities in placement decisions.
As a researcher who has long been involved in both the theoretical and applied areas of statistics, Wainer has clearly maintained great faith in our ability to develop meaningful standardized tests that can generate reliable data to make certain decisions and which can be used as evidence to reach certain conclusions. For instance, he is unequivocal in his argument for the SAT as an efficient measure of aptitude that can be confidently used to make college placement decisions. Wainer’s confidence is based on the high correlation between SAT scores and first-year grades. He is as strongly convinced that the use of the PSAT (the practice version of the SAT) is a solid predictor of who will get the most out of Advanced Placement courses, by again demonstrating a strong correlation between PSAT scores and passage of Advance Placement course.
However, there are so important topics lurking underneath Wainer’s analysis that he fails to address. The most important is that he offers no analysis of the SAT in and of itself, but instead starts from the premise that it is, de facto, a fair measure of aptitude across the pool of students taking the test. This means male and female students, students from different economic backgrounds, students from different cultural backgrounds and more.
Wainer believes that the SAT, the PSAT, and similar tests are an even playing field for all students to show their potential. Commenting the famous case of Jaime Escalantes high-achieving math students from a struggling high school in East LA, Wainer states: “The performance of Escalante’s students might have seemed miraculous based on stereotypes. But standardized tests are blind to such biases. Through the use of cheap but reliable aptitude tests like the PSAT, jewels can be discovered that might otherwise be missed. And once such promise is uncovered, some students previously thought to be unqualified can be given an opportunity and perform successfully.” (p. 54).
A key element that Wainer can’t see is that biases are most assuredly built into the SAT, as described by many analysts, including such organizations as FairTest
, which among other types of analysis, tracks differences in test score performance based on ethnicity and gender. Another dimension Wainer fails to address is how well the SAT correlates with overall college success. Wainer doesn’t even acknowledge that there is an entire sector challenging the validity of the SAT, the PSAT and other tests, which is a shame, because as a well-informed insider and advocate, his responses to these criticisms would have been interesting. Ignoring them weakens his position, as it leaves important questions lingering.
Wainer’s exclusion of the criticisms of the SAT and PSAT is somewhat surprising since the last chapter of his book is a deep-dive into one of the most controversial issues within today’s education debates -- Value Added Models (VAM). Ever since the LA Times released its report looking at teacher effectiveness using a value-added model
, the approach has come under great scrutiny with a host of educational thinkers questioning the approach’s validity, Stanford education professor Linda Darling-Hammond
being one of the most notable.
Wainer adds his voice to the mix, using VAM as the poster child for his underlying premise that data from standardized test scores can be used as excellent evidence for questions only
if the assumptions that go into making the data are solid. He shows in great detail how this is simply not the case, identifying what he sees as three key dimensions, asking in particular: (1) can causality between the circumstances and the data be established; (2) are the data complete; and (3) is it valid to make comparisons of test scores over time?
Wainer finds great weaknesses in each of the areas defined above, answering in the negative for each. In the first area, the variation in the makeup of students in each classroom introduces too much inconsistency to allow cross-classroom comparison, either within one school year or across school years. In the second area, he demonstrates that a sufficient amount of missing data (specifically, student test scores) in VAM calculations are assumed to be randomly missing.
But in fact, these data are likely to be missing in non-random ways. For instance, students who are likely to score well on tests are likely to show up for class regularly, so assuming that the missing scores are the same as scores for students who are present is not safe. In the third area, Wainer argues that the standardized test data are absolutely not evidence that can speak to the question of teacher effectiveness. At the most elemental, the change in what is being measured from year to year is so different, that comparing test scores even across a subject such as math is not really possible. The changes in test scores measure something, but not teaching effectiveness.
To risk making the type of dangerous causal inference Wainer warns against, if only Wainer had looked at his own assumptions (and that of his field’s) his book would have likely been quite path-breaking. However, the absence of any interrogation regarding the underlying validity of tests such as the SAT is a tremendous weakness in the foundation of all of his analysis. Keeping that limitation in mind, Wainer’s book still provides an accessible and informative read for education activists interested in better arming themselves for asking detailed, critical questions of the use of statistics to address complex education issues.
Lisa Schiff is the parent of two children in the San Francisco Unified School District and is a member of Parents for Public Schools of San Francisco and the PTA.