This past week, the Center for Research on Education Outcomes (CREDO) at Stanford University released yet another report in a series on the effects of charter schools on test scores -- this time focusing on Texas.

Almost immediately, members of the the local media trumpeted the results as "proof" that charter schools are realizing meaningful gains in student outcomes:

For the first time, Texas charter schools have outperformed traditional public schools in reading and closed the gap in math, researchers at Stanford University have found.

Students at Texas charter schools, on average, received the equivalent of 17 additional days of learning per year in reading and virtually the same level of education in math when compared to traditional public schools, according to a study released Wednesday by the Center for Research on Education Outcomes, or CREDO.

Rather than looking at raw standardized test scores, CREDO researchers quantify the impact of a school by looking at student improvement on the tests relative to other schools. The researchers then translate those results into an equivalent number of "days of learning" gained or lost in a 180-day school year.

The center's staff produced similar analyses in 2013 and 2015, finding Texas charter schools had a negative impact on reading and math performance.

"The most recent results are positive in two ways. Not only do they show a positive shift over time, but the values themselves are both positive for the first time," the researchers wrote.

CREDO's studies of charter school performance are widely respected in education circles. The center compares students from charter and traditional public schools by matching them based on demographic characters -- race, economic status, geography and English proficiency, among others -- and comparing their growth on standardized tests. Scores from 2011-12 to 2014-15 were analyzed for the most recent report. [emphasis mine]

That's from the Houston Chronicle, which published just one paragraph suggesting the CREDO studies might have credible critics:

Skeptics of CREDO's study typically offer three main criticisms of the research: it focuses exclusively on standardized test results, incentivizing schools that "teach to the test"; it ignores other advantages of traditional public schools, such as better access to extracurricular activities; and it doesn't account for the fact that charter school students are more likely to have strong, positive parental influence on their education.

Sorry, but that's, at best, an incomplete description of the serious limitations of these studies, which include:

The matching variables that create the counterfactuals are far too crude to do the job properly.
The definition of the treatment -- enrolling in a charter school -- does not account for factors such as increased spending, peer effects, and other advantages which have nothing to do with "charteriness."
The consistently small effect sizes have been pumped up by an unvalidated conversion into "days of learning" which has never been properly justified by the authors.

Here is how the CREDO Texas study reports its findings:

Stanley Pogrow published a paper earlier this year that didn't get much attention, and that's too bad. Because he quite rightly points out that it's much more credible to describe results like the ones reported here as "small" than as substantial. 0.03 standard deviations is tiny: plug it in here and you'll see it translates into moving from the 50th to the 51st percentile (the most generous possible interpretation when converting to percentiles).

I have been working on something more formal than a blog post to delve into this issue. I've decided to publish an excerpt now because, frankly, I am tired of seeing "days of learning" conversions reported in the press and in research -- both peer-reviewed and not -- as if there was no debate about their validity.

The fact is that many people who know what they are talking about have a problem with how CREDO and others use "days of learning," and it's well past time that the researchers who make this conversion justify it.

The excerpt below refers to what the eminent psychometrician Michael T. Kane coined a "validity argument." To quote Kane: "Public claims require public justification." I sincerely hope I can spark a meaningful conversation here and get the CREDO team to adequately and publicly justify their use of "days of learning." As of now, their validity argument is cursory at best -- and that's just not good enough.

I have added some bolding to the excerpt below to highlight key points.

* * *

Avoiding the Validity Argument: A Case Study

As an illustration of the problem of avoiding the validity argument in education policy, I turn to an ongoing series of influential studies of charter school effects. Produced by The Center for Research on Education Outcomes at Stanford University, the so-called CREDO reports have driven a great deal of discussion about the efficacy of charter school proliferation. The studies have been cited often in the media, where the effects they find are reported as “days of learning.”[1] Both the National Charter School Study (Raymond et al., 2013) and the Urban Charter School Study Report on 41 Regions (CREDO, 2015) include tables that translate the effect sizes found in the study into “days of learning.” Since an effect size of 0.25 SD is translated into 180 days, the clear implication is that an effect of this size moves a student ahead a grade level (a typical school year being 180 days long). Yet neither study explains the rationale behind the tables; instead, they cite two different sources, each authored by economist Eric Hanushek, as the source for the translations.

The 2015 study (p. 5) cites a paper published in Education Next (Hanushek, Peterson & Woessmann, 2012) that asserts: “On most measures of student performance, student growth is typically about 1 full std. dev. on standardized tests between 4th and 8th grade, or about 25 percent of a std. dev. from one grade to the next.” (p. 3-4) No citation, however, is given to back up this claim: it is simply stated as a received truth.

The 2013 study (p. 13) cites a chapter by Hanushek in the Handbook of the Economics of Education (Hanushek & Rivkin, 2006), in which the author cites his own earlier work:

“Hanushek (1992) shows that teachers near the top of the quality distribution can get an entire year’s worth of additional learning out of their students compared to those near the bottom. That is, a good teacher will get a gain of 1.5 grade level equivalents while a bad teacher will get 0.5 year for a single academic year.” (p. 1068)

No other references are made within the chapter as to how student gains could be presented as years or fractions of a year’s worth of learning.

The 1992 citation is to an investigation by Hanushek of the correlation between birth order and student achievement, and between family size and student achievement. The test scores used to measure achievement come from the “Iowa Reading Comprehension and Vocabulary Tests.” (p. 88) The Iowa Assessments: Forms E and F Research and Development Guide (2015), which traces the development of the Iowa Assessments back to the 1970s, states:

“To describe the developmental continuum or learning progression in a particular achievement domain, students in several different grade levels must answer the same questions in that domain. Because of the range of item difficulty in the scaling tests, special Directions for Administration were prepared to explain to students that they would be answering some very easy questions and other very difficult questions.” (p. 55-56)

In other words: to have test scores reported in a way that allows for comparisons across grade levels (or, by extension, fractions of a grade level), the Iowa Assessments deliberately place the same questions across those grade levels. There is no indication, however, that all, or any, of the statewide tests used in the CREDO studies have this property.[2]

Harris (2007) describes the process of creating a common score scale for different levels of an assessment as vertical scaling. She notes: “Different decisions can lead to different vertical scales, which in turn can lead to different reported scores and different decisions.” (p. 233) In her discussion of data collection, Harris emphasizes that several designs can be used to facilitate a vertical scale, such as a scaling test, common items, or single group to scaling test. (p. 241)

In all of these methods, however, there must be some form of overlapping: at least some students in concurrent grades must have at least some common items on their tests. And yet students in different grades still take tests that differ in form and content; Patz (2007) describes the process of merging their results into a common scale as linking (p. 6). He notes, however, that there is a price to be paid for linking: “In particular, since vertical links provide for weaker comparability than equating, the strength of the validity of interpretations that rest on the vertical links between test forms is weaker.” (p. 16)

So even if the CREDO studies used assessments that were vertically scaled, the authors would have to acknowledge that the validity of their effect sizes was at least somewhat compromised compared to effect sizes derived from other assessments. In this case, however, the point is moot: it appears that many of the assessments used by CREDO are not vertically scaled[3], which is a minimal requirement for making the case that effect sizes can be translated into fractions of a year’s worth of learning. The authors are, therefore, presenting their results in a metric that has not been validated and could be misleading.

I use this small but important example to illustrate a larger point: when influential education policy research neglects to validate the use of assessments, it may lead stakeholders to conclusions that cannot be justified. In the case of the CREDO reports, avoiding a validity argument for presenting effect sizes in “days of learning” has led to media reports on the effects of charter schools and policy decisions regarding charter proliferation that are based on conclusions that have not been validated. That is not to say these decisions are necessarily harmful; rather, that they are based on a reporting of the effects of charter schools that avoided having to make an argument for the validity of using test scores.

[1] Some example include:
http://www.chicagotribune.com/news/opinion/commentary/ct-charter-schools-public-teachers-unions-20161015-story.html
http://www.economist.com/news/international/21705697-liberating-schools-run-their-own-affairs-produces-some-great-ones-also-plenty
https://www.usnews.com/opinion/knowledge-bank/2015/03/19/new-study-shows-charter-schools-making-a-difference-in-cities

[2] Nor is there any indication the national and international tests Hanushek cites in his 2006 paper, such as the National Assessment of Educational Progress, share questions across grade levels. In fact, Patz (2007) notes: “NAEP attempted to vertically link forms for grades 4, 8, and 12, but abandoned the approach as the comparisons of students separated by 4 or 8 years were found to be ‘largely meaningless’ (Haertel, 1991).” (p.17)

[3] Some statewide assessments are vertically scaled, including the Smarter Balanced assessments; see: https://portal.smarterbalanced.org/library/en/2014-15-technical-report.pdf

References

Center for Research on Education Outcomes (CREDO) (2015). Urban Charter School Study Report on 41 Regions. Palo Alto, CA: Center for Research on Education Outcomes (CREDO), Stanford University. Retrieved from: http://urbancharters.stanford.edu/summary.php

Dunbar, S.& Welch, C. (2015). Iowa Assessments: Forms E and F Research and Development Guide. Iowa City, IA: University of Iowa. Retrieved from: https://itp.education.uiowa.edu/ia/documents/Research-Guide-Form-E-F.pdf

Hanushek, E. A. (1992). The trade-off between child quantity and quality. Journal of political economy, 100(1), 84-117.

Hanushek, E. A., & Rivkin, S. G. (2006). Teacher quality. Handbook of the Economics of Education, 2, 1051-1078.

Hanushek, E. A., Peterson, P. E., & Woessmann, L. (2012). Achievement Growth: International and US State Trends in Student Performance. PEPG Report No.: 12-03. Program on Education Policy and Governance, Harvard University.

ITBS Research Guide https://itp.education.uiowa.edu/ia/ITBSResearchGuide.aspx

Harris, D. J. (2007). Practical issues in vertical scaling. In Linking and aligning scores and scales (233-251). New York: Springer.

Kane, M. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement 50(1), 1–73.

Raymond, M. E., Woodworth, J. L., Cremata, E., Davis, D., Dickey, K., Lawyer, K., & Negassi, Y. (2013). National Charter School Study 2013. Palo Alto, CA: Center for Research on Education Outcomes (CREDO), Stanford University. Retrieved from: http://credo.stanford.edu/research-reports.html

10 comments:

gadfly1974August 5, 2017 at 9:44:00 AM PDT
@MoskowitzEva earlier this week linked to a similar +0.03 standard deviation study "proving" charters are better, even for public schools!
DukeAugust 5, 2017 at 1:50:00 PM PDT
Gadfly, is that the Cordes study? Because I've got plenty to say about that in a bit...
Sherman DornAugust 6, 2017 at 6:57:00 AM PDT
Since I've commented previously on the weeks/days of learning thing, a few thoughts:

1. I'm not sure what to make of the inevitable squishiness of the basis for "a year of learning," but it's not inherently worse than measurement of effect sizes, which also make a universal-metric claim.
2. It's not clear that an effect size of 0.03 is either large or small, except when placed in some context. I'd argue that the right context for "enrollment in charter school vs. local public school" is other large, population-level school policies/practices, such as school assignment for desegregation, as opposed to a classroom-level practice such as formative assessment guiding instruction. So you'd need to make the comparison among all big, hairy policies and practices, and in that context, I suspect 0.03 is around the middle of the pack. Doing big things well is hard, and all that.
carolineAugust 6, 2017 at 9:12:00 AM PDT
Hi Jersey Jazzman. Aside from the content here, it must be pointed out that CREDO is not part of Stanford University itself, not a scholarly research program as that implies. CREDO is a program of the Hoover Institution, a right-wing, free-market "think tank" (aka propaganda operation) that is located on the Stanford campus. CREDO formerly described itself as a program designed to PROMOTE charter schools, and used to be open about being run by Hoover (NOT Stanford). The materials about Hoover and CREDO have become more and more cagey about the connection, so now it's almost impossible to discern that. Also, CREDO's director, Hoover fellow Macke Raymond, is married to Eric Hanushek, whose "research" she cites. Hanushek is also a Hoover fellow whose longtime specialty is promoting propaganda aimed at disparaging and discrediting teachers.

So, just be aware that describing CREDO as part of Stanford and implying that it's a scholarly research project rather than a propaganda operation helps promote its credibility.

That said, I do recognize that CREDO has produced some research that didn't reflect favorably on charters, so it seems to zigzag. But still, it needs to be characterized clearly and accurately.

The very term "think tank" is in itself a piece of propaganda, as these operations are propaganda operations, not scholarly research organizations.

Disclosure that I did a large freelance writing job for the Hoover Institution many years ago, so I gained quite a bit of familiarity.
Duane SwackerAugust 6, 2017 at 9:35:00 AM PDT
"In the case of the CREDO reports, avoiding a validity argument for presenting effect sizes in “days of learning” has led to media reports on the effects of charter schools and policy decisions regarding charter proliferation that are based on conclusions that have not been validated. That is not to say these decisions are necessarily harmful; rather, that they are based on a reporting of the effects of charter schools that avoided having to make an argument for the validity of using test scores."

The testing industry has avoided the whole validity, or better put, invalidity issues that Noel Wilson raised in 1997. I have yet to see any rebuttal or refutation anywhere.

JJ, if you have come across any in your research I'd greatly appreciate if you'd cite them. Thanks in advance, Duane

For a short discussion of the invalidity issues in standards and testing see Wilson's "A Little Less than Valid: An Essay Review" at http://edrev.asu.edu/index.php/ER/article/view/1372/43

For a much fuller discussion see Wilson's “Educational Standards and the Problem of Error” found at: http://epaa.asu.edu/ojs/article/view/577/700

For a look at the junction of the purpose of public education, truth, justice, ethics and the standards and testing regime see my forthcoming book (this week, I'll have the hard copies) "Infidelity to Truth: Education Malpractice in American Public Education". To obtain an advanced electronic draft copy, please email me at duaneswacker@gmail.com and I'll be happy to send it to anyone at no charge.
DukeAugust 6, 2017 at 10:11:00 AM PDT
Hi Sherman,

First, thanks for the link - it's a really good piece (Manny!).

I agree that any effect size needs to be put into context. Even "statistical significance" is contextual. But if we're trying to get to useful heuristics, I'd argue it's very hard to make a case that 0.03 is anything but small.

In a typical statewide accountability test, how many more answers would a group of students have to get correct to realize a gain of 0.03SD? Less than one? Sure, you'd rather have a gain than not, but is anyone prepared to say a gain that small is indicative of superior instruction/practices/whatever?

Yes, it is hard to do big things well - that's the point. I haven't done this in a while (stand by...), but when you look at the claims of the charter sector, they are of the magnitude of "closing the achievement gap" and "proving poverty can be overcome" and so on. There's just no way to credibly say 0.03 is on that order.

I appreciate the thoughtful response. More to come...
DukeAugust 6, 2017 at 10:22:00 AM PDT
Caroline, you have a point. Personally, I've become less interested in arguing in these terms - but that doesn't mean what you're saying isn't important.

Duane, looking forward to the book. I am not a measurement guy; I only started reading seriously in the field over the last year. I have to say that I was surprised at first at how cautious the psychometric community is about the use of testing outcomes in things like policy formation. If you look at the "Standards for Educational and Psychological Testing" from AERA, APA, and NCME, you'll find the bar for establishing validity is much higher than is typically found in much of the econometric-style education research out there.

Test scores are different than many other measures used by economists. I often think this distinction is lost - that researchers just plow ahead with their regression models without stopping to ask first what the measures they are using are actually measuring.

Peter GreeneAugust 7, 2017 at 7:21:00 AM PDT
Thank you. This days of learning baloney drives me nuts. Which day-- a day in September or a day in April (because they sure aren't the same). And if we can quantify a day of learning, why stop there? Why not hours of learning? What an absurd measure.
Duane SwackerAugust 7, 2017 at 9:31:00 AM PDT
Duke,

Feel free to email me at duaneswacker@gmail.com and I'll sent you an electronic copy at no charge. I have sent the final proofs back to the printer and they should start printing it this week. The book will be available at Amazon when I get the copies, or you can email me.

Anyway, I do have a copy of the "testing bible" and have read it and am reading the past three to see how the language and concerns have changed over the years. One thing I've noticed is the the insistence by the authors, who ever they are, that they are "measuring" something, that the "measurement" of students can actually be accomplished-it can't as I show in my book. Terms related to measure, measurement and metrics are used, it appears around 4-5 times on average per page. I guess if one states something, even a falsehood, often enough it becomes true, eh. What a load of pure bovine excrement.
JPApril 11, 2018 at 10:50:00 AM PDT
Thought you might be interested in a new analysis demonstrating serious flaws in the "days of learning" translation: www.rand.org/t/WR1226

John