Jersey Jazzman: December 2017

Saturday, December 30, 2017

Test Scores Gains Are Not Necessarily a Sign of Better Instruction: A Cautionary Tale From Newark

This post is part of a series on recent research into Newark schools and education "reform."

Here's Part I.

Here's Part II.

* * *

In this series, I've been breaking down recent research about Newark, NJ's schools. Reformy types have been attempting to make the case that "reforms" in Newark over the past several years -- including charter school expansion, merit pay, Common Core alignment, school closures, and universal enrollment -- have led to gains in student learning. These "reforms" are purportedly the result of Facebook CEO Mark Zuckerberg's high-profile, $100 million grant to the city's schools back in 2010.

Zuckerberg recently funded a study, published by the Center for Education Policy Research at Harvard University this past fall, that shows a gain in "value-added" on tests for Newark compared to the rest of the state. (A technical paper, published by the National Bureau of Economic Research, is found here.)

Bruce Baker and I looked carefully at this study, and added our own analysis of statewide data, to produce a review of this research. One of our most important findings is that most of the "gains" -- which are, in our opinion, educationally small anyway (more on this later) -- can be tied to a switch New Jersey made in 2015 from the NJASK statewide exams to the newer PARCC exams.

As I noted in the last post, even the CEPR researchers suggest this is the most likely explanation for the "gains."

Assuming both tests have similar levels of measurement error, this implies that the PARCC and NJASK were assessing different sets of skills and the districts that excelled in preparing students for PARCC were not necessarily the same as the districts that excelled at preparing students for NJASK. Thus, what appears to be a single-year gain in performance may have been present before 2015, but was simply undetected by earlier NJASK tests. (p. 22, NBER, emphasis mine)

As I pointed out last time, there has never been, to my knowledge, any analysis of whether the PARCC does a better job measuring things we care about compared to the NJASK. So, while the PARCC has plenty of supporters, we really don't know if it's any better than the old test at detecting "good" instructional practices, assuming we can hold things like student characteristics constant.

But even if we did have reason to believe the PARCC was a "better" test, I still would find the sentence above that I bolded to be highly problematic. Let's look again at the change in "value-added" that the CEPR researchers found (p. 35 of the NBER report, with my annotations):

"Value-added" -- ostensibly, the measure of how much the Newark schools contributed to student achievement gains -- was trending downward prior to 2014 in English language arts. It then trended upward after the change to the new test in 2015. But the CEPR authors say that the previous years may have actually been a time when Newark students would have been doing better, if they had been taking the PARCC instead of the NJASK.

The first problem with this line of thinking is that there's no way to prove it's true. But the more serious problem is that the researchers assume, on the basis of nothing, that the bump upwards in value-added represents real gains, as opposed to variations in test scores which have nothing to do with student learning.

To further explore this, let me reprint an extended quote we used in our review from a recent book by Daniel Koretz, an expert on testing and assessment at Harvard's Graduate School of Education. The Testing Charade should be required reading for anyone opining about education policy these days. Koretz does an excellent job explaining what tests are, how they are limited in what they can do, and how they've been abused by education policy makers over the years.

I was reading Koretz's book when Bruce and I started working on our review. I thought it was important to include his perspective, especially because he explicitly takes on the writings of Paul Bambrick-Santoyo and Doug Lemon, who both hold just happen to hold leadership positions at Uncommon Schools, which manages North Star Academy, one of Newark's largest charter chains.

Here's Koretz:

One of the rationales given to new teachers for focusing on score gains is that high-stakes tests serve a gatekeeping function, and therefore training kids to do well on tests opens doors for them. For example, in Teaching as Leadership^{^[i]} – a book distributed to many Teach for America trainees – Steven Farr argues that teaching kids to be successful on a high-stakes test “allows teachers to connect big goals to pathways of opportunity in their students’ future.” This theme is echoed by Paul Bambrick-Santoyo in Leverage Leadership and by Doug Lemov in Teach Like a Champion, both of which are widely read by new teachers. For example, in explaining why he used scores on state assessments to identify successful teachers, Lemov argued that student success as measured by state assessments is predictive not just of [students’] success in getting into college but of their succeeding there.

Let’s use Lemov’s specific example to unpack this.

To start, Lemov has his facts wrong: test scores predict success in college only modestly, and they have very little predictive power after one takes high school grades into account. Decades of studies have shown this to be true of college admissions tests, and a few more recent studies have shown that scores on states’ high-stakes tests don’t predict any better.

However, the critical issue isn’t Lemov’s factual error; it’s his fundamental misunderstanding of the link between better test scores and later success of any sort (other than simply taking another similar test). Whether raising test scores will improve students’ later success – in contrast to their probability of admission – depends on how one raises scores. Raising scores by teaching well can increase students’ later success. Having them memorize a couple of Pythagorian triples or the rule that b is the intercept in a linear equation[ii] will increase their scores but won’t help them a whit later.

[...]

Some of today’s educators, however, make a virtue of this mistake. The[y] often tell new teachers that tests, rather than standards or a curriculum, should define what they teach. For example, Lemov argued that “if it’s ‘on the test,’ it’s also probably part of the school’s curriculum or perhaps your state standards… It’s just possible that the (also smart) people who put it there had a good rationale for putting it there.” (Probably? Perhaps? Possible? Shouldn’t they look?) Bambrick-Santoyo was more direct: “Standards are meaningless until you define how to assess them.” And “instead of standards defining the sort of assessments used, the assessments used define the standard that will be reached.” And again: “Assessments are not the end of the teaching and learning process; they’re the starting point.”

They are advising new teachers to put the cart before the horse.”[iii] [emphasis mine; the notes below are from our review]

Let's put this into the Newark context:

One of the most prominent "reforms" in Newark has been the closing of local public district schools while moving more students into charter schools like North Star.
By their own admission, these schools focus heavily on raising test scores.
The district also claims it has focused on aligning its curriculum with the PARCC (as I point out in our review, however, there is little evidence presented to back up the claim).
None of these "reforms," however, are necessarily indicators of improved instruction.

How did Newark get its small gains in value-added, most of which were concentrated in the year the state changed its tests? The question answers itself: the students were taught with the goal of improving their test scores on the PARCC. But those test score gains are not necessarily indicative of better instruction.

As Koretz notes in other sections of his book, "teaching to the test" can take various forms. One of those is curricular narrowing: focusing on tested subjects at the expense of instruction in other domains of learning that aren't tested. Did this happen in Newark?

More to come...

^{^[i]} Farr, S. (2010). Teaching as leadership; The highly effective teacher’s guide to closing the achievement gap. San Francisco: Josey-Bass. We note here that Russakoff reports that Teach for America received $1 million of the Zuckerberg donation “to train teachers for positions in Newark district and charter schools.” (Russakoff, D. (2016). The Prize; Who’s in charter of America’s schools? New York, NY: Houghton, Mifflin, Harcourt. p. 224)
[ii] A “Pythagorean Triple” is a memorized ratio that conforms to the Pythagorean theorem regarding the ratio of the sides of a right triangle. Koretz critiques the linear intercept rule, noting that b is often taught as the intercept of an equation in high school, but is usually the coefficient of an equation in college courses. In both cases, Kortez contends test prep strategies keep students from gaining a full understanding of the concepts being taught. See: Koretz, D. (2017) The testing charade; Pretending to make schools better. Chicago, IL: University of Chicago Press. pp. 104-108.
[iii] Koretz, D. (2017) The testing charade; Pretending to make schools better. Chicago, IL: University of Chicago Press. p. 114-115.

Tuesday, December 19, 2017

What Are Tests Really Measuring? A Tale of Education "Reform" in Newark

This post is part of a series on recent research into Newark schools and education "reform."

Here's Part I.

"What is a test, and what does it really measure?"

I often get the sense that more than a few stakeholders and policy makers in education don't take a lot of time to think carefully about this question.

There aren't many people who would claim that a test score, by itself, is the ultimate product of education. And yet test scores dominate discussions of education policy: if your beloved program can show a gain in a test outcome, you're sure to cite that gain as evidence in favor of it.

That's what's been happening in Newark, New Jersey these days. As I said in my last post, new research was published by the Center for Education Policy Research at Harvard University this past fall that purportedly showed a gain in "value-added" on tests for Newark compared to the rest of the state. The researchers have attempted to make the case that a series of reforms, initiated by a $100 million grant from Mark Zuckerberg, prompted those gains. (A more technical study of their research, published by the National Bureau of Economic Research, is found here.)

To make their case, the CEPR researchers do what many others have done: take test scores from students, input them into a sophisticated statistical model, and compare the gains for various groups. To be clear, I do think using test scores this way is fine -- to a point.

Test outcomes can and often do contain useful information that, when properly used, tell us important things. But we always have to remember that a test is a sample of knowledge or ability at a particular point in time. Like all samples, test outcomes are subject to error. Give a child who ate a good breakfast and got enough sleep a test in a quiet room with the heat set properly, and you'll get one score. Give that same child the same test but on an empty stomach in a freezing cold room, and you'll almost certainly get something else.

The variation in outcomes here illustrates a critical point: Often the scores on a test vary because of factors that have nothing to do with what the test is trying to measure. Psychometricians will often talk about construct validity: the extent to which a test is measuring what it is supposed to be measuring. Making a valid test requires not only creating test items that vary based on what we're trying to measure; it also requires defining what we're trying to measure.

Take, for example, New Jersey's statewide assessments in Grades 3 through 8 -- assessments required by federal law. For a number of years, the state administrated the NJASK: the New Jersey Assessment of Skills and Knowledge. It was a paper-and-pencil test that assessed students in two domains: math and English language arts (ELA).

Those are very big domains. What, exactly, comes under ELA? Reading and writing, sure... but reading what? Fiction? Informational texts? Toward what end? Comprehension, sure... but what does that mean? How does anyone demonstrate they comprehend something? By summarizing the text, or by responding to it in an original way? Is there a fool-proof way to show comprehension? And at what level?

These questions aren't merely a philosophical exercise -- they matter when building a test. What goes into the construct we are trying to measure? And, importantly, do the tests we give vary based on what we intend to use the tests to measure?

In the case of the recent Newark research, the economists who conducted the study made an assumption: they believed the test scores they used vary based on the actions of school systems, which implement programs and policies of various kinds. They assumed that after applying their models -- models that attempt to strip away differences in student characteristics and abilities to learn -- the variation in outcomes can be attributed to things the Newark publicly-financed schools, including the charter schools, do that differ from schools in other parts of the state.

It's a big assumption. It requires showing that the policies and programs implemented can be documented and, if appropriate, measured. It requires showing that those policies and programs only took place in Newark. And it requires making the argument that the variation found in test outcomes came only from those policies and programs -- what social scientists would call the treatment.

Further, this assumption requires making yet another assumption:

In 2015, New Jersey switched its statewide exam from the NJASK to the PARCC: the Partnership for Assessment of Readiness for College and Careers. PARCC is (mostly) a computerized exam. Its supporters often claim it's a "better" exam, because, they say, it measures things that matter more. I'm not going to get into that debate now, but I will note that, so far as I know, no one ever conducted a validity study of the PARCC compared to the NJASK. In other words: we're not sure how the two tests differ in what they measure.

What I can say is that everyone agrees the two exams are different. From what I've seen and heard from others, the PARCC math exam relies more on language skills than the NJASK math exam did, requiring students to do more verbal problem solving (which would put non-native English speakers at a disadvantage). The PARCC ELA exam seems to put more emphasis on writing than the NJASK, although how that writing is graded remains problematic.

Keeping this in mind, let's look at this graph from the CEPR research (p.35):

Until 2014, Newark's test score "growth" is pretty much the same as the other Abbott districts in the state. The Abbotts are a group of low-income districts that brought the famous Abbott v. Burke lawsuit, which forced the state toward more equitable school funding. They stand as a comparison group for Newark, because they have similar students and got similar test outcomes...

Until 2015. The Abbotts, as a group, saw gains compared to the rest of the state -- but Newark saw greater gains. Whether the size of those gains is educationally significant is something we'll talk about later; for right now, let's acknowledge they are statistically significant.

But why did they occur? Let me annotate this graph:

Newark's gains in "growth," relative to other, similar New Jersey districts, occurred in the same year the state switched exams.

And it's not just the CEPR research that shows this. As Bruce Baker and I showed in our review of that research, the state's own measure of growth, called Student Growth Percentiles (SGPs), also show a leap in achievement gains for Newark in the same year.

Again, the red line is the dividing point between the NJASK and the PARCC. In this case, however, we break down the districts into the Newark Public Schools, Newark's charter schools, and only those Abbotts in the same county as Newark. The districts close to Newark with similar demographics had similar gains in achievement "growth."

Let's step back and remember what the CEPR study was trying to understand: how a series of policies, initiated by Zuckerberg's donation, affected test score growth in Newark. What would we have to assume, based on this evidence, to believe that's true?

That the Newark reforms, which began in 2011, didn't kick in until 2015, when they suddenly started affecting test scores.
That the gains in the other Essex County Abbott districts (Irvington, Orange, and East Orange) were caused by some other factor completely separate from anything affecting Newark.
That the switch from the NJASK to the PARCC didn't create any gains in growth that were unrelated to the construct the tests are purportedly measuring.

Test makers will sometimes refer to the concept of construct-irrelevant variation: that test outcomes will vary because things we do not want them to measure still affect the scores. If two children with equal mathematical ability take a computerized test, but one has greater facility in using a computer, their test scores will differ. The problem is that we don't want their scores to differ, because we're trying to measure math ability, not familiarity with computers.

Did Newark's students -- and Orange's and East Orange's and Irvington's -- do better on the PARCC simply because they felt more at ease with the new PARCC test than students around the rest of the state? Did these districts engage in test prep activities specific to the PARCC that brought scores up, but didn't necessarily reflect better instruction?

The CEPR study admits this is likely:

Assuming both tests have similar levels of measurement error, this implies that the PARCC and NJASK were assessing different sets of skills and the districts that excelled in preparing students for PARCC were not necessarily the same as the districts that excelled at preparing students for NJASK. Thus, what appears to be a single-year gain in performance may have been present before 2015, but was simply undetected by earlier NJASK tests. (p. 22, NBER, emphasis mine)

I'll get into that last sentence more in a future post. For now, it's enough to note this: Even the CEPR team acknowledges that the most likely explanation for Newark's gains is the state's switch from the NJASK to the PARCC. But aligning instruction with one test more than another is not the same as providing better instruction.

Gains like these are not necessarily an indication of curricular or instructional improvements. They are not necessarily brought about by moving students into "better" schools. They could very easily be the result of the tests measuring different things that we don't really want them to measure.

We'll talk more about this -- and get the views of a Harvard education expert -- next.

Thursday, December 14, 2017

Education "Reform" in Newark: The Facts - Prelude

Miss me?

Yesterday, the National Education Policy Center published a lengthy report written by Dr. Bruce Baker and myself that looks closely at school "reform" in Newark. I wrote a short piece about our report at NJ Spotlight that gives summarizes our findings. We've also got a deep dive into the data for our report at the NJ Education Policy website.

You might be wondering why anyone outside of New Jersey, let alone Newark, should care about what we found. Let me give you a little background before I try to answer that question...

In 2010, Mark Zuckerberg, the CEO and founder of Facebook, went on the The Oprah Winfrey Show and announced that he was giving $100 million in a challenge grant toward the improvement of Newark's schools. Within the next couple of years, Newark had a new superintendent, Cami Anderson. Anderson attempted to implement a series of "reforms" that were supposed to improve student achievement within the city's entire publicly-financed school system.

In the time following the Zuckerberg donation, Newark has often been cited by "reformers" as a proof point. It has a large and growing charter school sector, it implemented a teacher contract with merit pay, it has a universal enrollment system, it "renewed" public district schools by churning school leadership, it implemented Common Core early (allegedly), and so on.

So when research was released this fall that purported to show that students had made "educationally meaningful improvements" in student outcomes, "reformers" both in and out of New Jersey saw it as a vindication. Charter schools are not only good -- they don't harm public schools, because they "do more with less." Disruption in urban schools is good, because the intractable bureaucracies in these districts needs to be shredded. Teachers unions are impeding student learning because we don't reward the best teachers and get rid of the worst...

And so on. If Newark's student outcomes have improved, it has to be because these and other received truths of the "reformers" must be true.

But what if the data -- including the research recently cited by Newark's "reformers" -- doesn't show Newark has improved? What if other factors account for charter school "successes"? What if the test score gains in the district, relative to other, similar districts, isn't unique, or educationally meaningful? What if all the "reforms" supposedly implemented in Newark weren't actually put into place? What if the chaos and strife that has dogged Newark's schools during this "reform" period hasn't been worth it?

What if Newark, NJ isn't an example of "reform" leading to success, but is instead a cautionary tale?

These are the questions we set out to tackle. And in the next series of posts here, I am going to lay out, in great detail, exactly what we found, and explain what the Newark "reform" experiment is actually telling us about the future of American education.

One other thing: as I have said before, the "reformers" often appear to misunderstand how research should be used to inform public policy. Often, you will hear them say some variation of this: "We have to do something to improve our schools -- and you can't prove my preferred reform won't make schools better!"

In fact, the burden of proof is on those who contend that charter school expansion, school closures/"renewals," merit pay, Common Core, universal enrollment, and so on will improve schools. If I or others present evidence that calls into question any of these policies, it is up to the promoters of those policies to show why that evidence is not germane.

For example: there is no question charter schools -- particularly the Newark charters run by big charter management organizations (CMOs) like TEAM/KIPP and North Star -- enroll very few Limited English Proficient (LEP) students. This puts the burden of educating these children on the public district schools. It explains why charters can spend less on instruction and student support services compared to NPS.

It is up to charter defenders to show why this doesn't matter when asserting charter claims of relatively better productivity. The burden of proof is on them, not on those who question their claims.

Much more to come...

More to come...