I will protect your pensions. Nothing about your pension is going to change when I am governor. - Chris Christie, "An Open Letter to the Teachers of NJ" October, 2009

Monday, July 22, 2019

How Student "Creaming" Works

There is, as usual, so much wrong in this Star-Ledger editorial on Camden's schools that it will probably take several posts for me to correct all of its mistakes. But there's one assertion, right at the very top, that folks have been making recently about Newark's schools that needs to be corrected immediately:
Last year, for the first time ever, the low-income, mostly minority kids in Newark charter schools beat the state’s average scores in reading and math in grades 3-8 – incredible, given the far more affluent pool of kids they were competing against.
This is yet another example, like previous ones, of a talking point that is factually correct but utterly meaningless for evaluating the effectiveness of education policies like charter schooling. It betrays a fundamental misunderstanding of test scores and student characteristics, which keeps the people who make statements like this from having to answer the questions that really matter.

The question in this case is: Do "successful" urban charter schools get their higher test scores, at least in part, by "creaming" students?

Creaming has become a central issue in the whole debate about the effectiveness of charters. A school "creams" when it enrolls students who are more likely to get higher scores on tests due to their personal characteristics and/or their backgrounds. The fact that Newark's charter schools enroll, as a group, fewer students with special education needs -- particularly high-cost needs -- and many fewer students who are English language learners is an indication that creaming may be in play.

The quote above, however, doesn't address this possibility. The SL's editors argue instead that these schools' practices have caused the disadvantaged children in Newark's charters to "beat" the scores of children who aren't disadvantaged. And because the students in Newark's charters are "beating the state's average scores," they must be "incredible."

Last month, I wrote about some very important context specific to Newark that has to be addressed when making such a claim. But let's set that aside and get to a more fundamental question: given the concerns about creaming, is the SL's argument -- that charter students "beat" the state average -- a valid way to assess these schools' effectiveness?

No. It is not.

Let's go through this diagram one step at a time. The first point we have to acknowledge is that test scores, by design, yield a distribution of scores. That distribution is usually a "bell curve": a few students score high, a few score low, and most score in the middle.

This is the distribution of all test takers. But you could also pull out a subpopulation of students, based on any number of characteristics: race, gender, socio-economic status, and so on. Unless you delineate the subpopulation specifically on test scores, you're almost certainly going to get another distribution of scores.

Think of a school in a relatively affluent suburb, where none of the students qualify for free-lunch (the standard measure of socio-economic status in educational research). Think of all the students in that school. Their test scores will vary considerably -- even if the school scores high, on average, compared to less-affluent schools. Some of the kids will have a natural affinity for doing well on tests; some won't. Some will have parents who place a high value on scoring well on tests; some parents will place less value on scoring well. The students will have variations in their backgrounds and personal characteristics that we can't see in the crude variables collected in the data; consequently, their scores will vary.

The important point is that there will be a range of scores in this school. Intuitively, most people will understand this. But can they make the next leap? Can they understand that there will also be a range of scores in a lower-performing school?

There is, in my opinion, a tendency for pundits who opine on education to sometimes see children in disadvantaged communities as an undifferentiated mass. They seem not to understand that the variation in unmeasured student characteristics can be just a great in a school located in a disadvantaged community as it is in an affluent community; consequently, the test scores in less-affluent schools will also vary.

The children enrolled in Newark's schools will have backgrounds and personal characteristics that vary widely. Some will be more comfortable with tests than others. Some will have parents who value scoring well on tests more than others. It is certainly possible that the variation in a disadvantaged school -- the shape of the bell curve -- will differ from the variation in affluent schools, but there will be variation.

In my graph above (which is simply for illustrative purposes) I show that the scores of disadvantaged and not-disadvantaged students vary. On average, the disadvantaged students will score lower -- but their scores will still vary. And because the not-disadvantaged students' scores will also vary, it is very likely that there will be some overlap between the two groups. In other words: there will be some relatively high-scoring students who are disadvantaged who will "beat" some relatively low-scoring students who are not disadvantaged.

And here's where the opportunity for creaming arises. If a charter school can find a way to get the kids at the top of the disadvantaged students distribution to enroll -- while leaving the kids in the middle and the bottom of the distribution in the public district schools -- they will likely be able to "beat" the average of all test takers.

Is that what's happening in Newark? Again, the differences in the special education and English language learner rates suggest there is a meaningful difference in the characteristics of the student populations between charters and public district schools. But further opportunities for creaming come from separating students based on unmeasured characteristics.

For example: charter schools require that families apply for admission. It is reasonable to assume that there is a difference between a family that actively seeks to enroll their child in a charter, and a family that does not. Some of the "high-performing" charters in Newark have high suspension and attrition rates; this may send a signal to families that only a certain type of child is a good "fit" for a charter (some charter operators are quite honest about this). These schools also tend to have much longer school days and years; again, this may signal that only students who have the personal characteristics to spend the extra time in class should apply.

There is a very real possibility that these practices have led to creaming -- again, in a way that won't show up in the data. If the creaming is extensive enough -- and is coupled with test-prep instruction and curriculum, more resources, and a longer school day/year -- it wouldn't be too hard for a charter to "beat the state's average scores."

Is this a bad thing? That's an entirely different question. Given the very real segregation in New Jersey's schools, and the regressive slide away from adequate and equitable funding in the last decade, it's hard to find fault with Newark and Camden parents who want to get their children into a "better" school if they can. On the other hand, the fiscal pressures of chartering are real and can affect the entire system of schooling. Further, concentrating certain types of students into certain schools can have unexpected consequences.

A serious discussion of these issues is sorely needed in this state (and elsewhere). Unfortunately, because they refuse to acknowledge some simple realities, the Star-Ledger's editorial board once again fails to live up to that task. I'll get to some other mistakes they make in this piece in a bit.

Star-Ledger Editorial Board

Monday, July 8, 2019

Who Put the "Stakes" In "High-Stakes Testing"?

Peter Green has a smart piece (as usual) about Elizabeth Warren's position on accountability testing. Nancy Flanagan had some smart things to say about it (as usual) on Twitter. Peter's piece and the back-and-forth on social media have got me thinking about testing again -- and when that happens these days, I find myself running back to the testing bible: Standards for Educational and Psychological Testing:
"Evidence of validity, reliability, and fairness for each purpose for which a test is used in a program evaluation, policy study, or accountability system should be collected and made available." (Standard 13.4, p. 210, emphasis mine)
This statement is well worth unpacking, because it dwells right in the heart of the ongoing debate about "high-stakes testing" and, therefore, influences even the current presidential race.

A core principle of psychometrics is that the evaluation of tests can't be separated from the evaluation how their outcomes will be used. As Samuel Messick, one of the key figures in the field, put it:
"Hence, what is to be validated is not the test or observation device as such but the inferences derived from test scores or other indicators -- inferences about score meaning or interpretation and about the implications for action that the interpretation entails." [1] (emphasis mine)
He continues:
"Validity always refers to the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of interpretations and actions based on test scores." [1] (emphasis mine)
I'm highlighting "actions" here because my point is this: You can't fully judge a test without considering what will be done with the results.

To be clear: I'm not saying items on tests, test forms, grading rubrics, scaling procedures, and other aspects of test construction can't and don't vary in quality. Some test questions are bad; test scoring procedures are often highly questionable. But assessing these things is just the start: how we're going to use the results has to be part of the evaluation.

Michael Kane calls on test makers and test users to make an argument to support their proposed uses of test results:
"To validate an interpretation or use of test scores is to evaluate the plausibility of the claims based on the test scores, and therefore, validation requires a clear statement of the claims inherent in the proposed interpretations and uses of the test scores. Public claims require public justification.
"The argument-based approach to validation (Cronbach, 1988; House, 1980; Kane, 1992, 2006; Shepard, 1993) provides a framework for the evaluation of the claims based on the test scores. The core idea is to state the proposed interpretation and use explicitly, and in some detail, and then to evaluate the plausibility of these proposals." [2]  (emphasis mine)
As I've stated here before: standardized tests, by design, yield a normal or "bell-curve" distribution of scores. Test designers prize variability in scores: they don't want most test takers at the high or low end of the score distribution, because that tells us little about the relative position of those takers. So items are selected, forms are constructed, and scores are scaled such that a few test takers score low, a few score high, and most score in the middle. In a sense, the results are determined first -- then the test is made.

The arguments some folks make about how certain tests are "better" than others often fail to acknowledge this reality. Here in New Jersey, a lot of hoopla surrounded the move from the NJASK to the PARCC; and then later, the change from the PARCC to the NJSLA. But the results of these tests really don't change much.

If you scored high on the old test, you scored high on the new one. So the issue isn't the test itself, because different tests are yielding the same outcomes. What really matters is what you do with these results after you get them. The central issue with "high-stakes testing" isn't the "testing"; it's the "high-stakes." 

So how are we using test scores these days? And how good are the validity arguments for each use?

- Determining an individual student's proficiency. I know I've posted this graphic on the blog dozens of times before, but people seem to respond to it, so...

"Proficiency" is not by any means an objective standard; those in power can set the bar for it pretty much wherever they want. Education officials who operate in good faith will try to bring some reason and order to the process, but it will always be, at its core, subjective.

In the last few years, policymakers decided that schools needed "higher standards"; otherwise, we'd be plagued by "white suburban moms" who were lying to themselves. This stance betrayed a fundamental misunderstanding of what tests are and how they are constructed. Again, test makers like variation in outcomes, which means someone has got to be at the bottom of the distribution. That isn't the same as not being "proficient," because the definition of "proficiency" is fluid. If it isn't, why can policymakers change it on a whim?

I'll admit I've not dug in on this as hard as I could, but I haven't seen a lot of evidence that telling a kid and her family that she is not proficient -- especially after previous tests said she was -- does much to help that kid improve her math or reading skills by itself. If the test spurs some sort of intervention that can yield positive results, that's good. 

But those of us who work with younger kids know that giving feedback to a child about their abilities is tricky business. A test that labels a student as "not proficient" may have unintended negative consequences for that student. A good validity argument for using tests this way should include an exploration of how students themselves will benefit from knowing whether they clear some arbitrary proficiency cut score. Unfortunately, many of the arguments I hear endorsing this use of tests are watery at best.

Still, as far as stakes go, this one isn't nearly as high as...

- Making student promotion or graduation decisions. When a test score determines a student's progression through or exit from the K-12 system, the stakes are much higher; consequently, the validity argument has to be a lot stronger. Whether grade retention based on test scores "works" is constantly debated; we'll save discussion for another time (I do find this evidence to be interesting).

It's the graduation test that I'm more concerned with, especially as I'm from New Jersey and the issue has been a key education policy debate over the past year. Proponents of graduation testing never want to come right out and say this in unambiguous terms, but what they're proposing is withholding a high school diploma -- a critical credential for entry into the workforce -- from high school students who did all their work and passed their courses, yet can't pass a test.

I don't see any validity argument that could possibly justify this action. Again, the tests are set up so someone has to be at the bottom of the distribution; is it fair to deny someone a diploma based on a test that must have low scoring test takers? And no one has put forward a convincing argument that not showing proficiency in the Algebra I exam is somehow a justification for withholding a diploma. A decision this consequential should never be made based on a single test score.

- Employment consequences for teachers. Even if you can make a convincing argument that standardized tests are valid and reliable measures of student achievement, you haven't made the argument that they're measures of teacher effectiveness. A teacher's contribution to a student's test score only explains a small part of the variation in those scores. Teasing out that contribution is a process rife with the potential for error and bias.

If you want to use value-added models or student growth percentiles as signals to alert administrators to check on particular teachers... well, that's one thing. Mandating employment consequences is another. I've yet to see a convincing argument that firing staff solely or largely on the basis of error-prone measures will help much in improving school effectiveness.

- Closing and/or reconstituting schools. It's safe to say that the research on the effects of school closure is, at best, mixed. That said, it's undeniable that students and communities can suffer real damage when their school is closed. Given the potential for harm, the criteria for targeting a school for closure should be based on highly reliable and valid evidence.

Test scores are inevitably part of the evidence -- in fact, many times they're most of the evidence -- deployed in these decisions... and yet their validity as measures of a student's probability of seeing their educational environment improve if a school is closed is almost never questioned by those policymakers who think school closure is a good idea.

Closing a school or converting ii to a charter is a radical step. It should only be attempted if it's clear there is no other option. There's just no way test outcomes, by themselves, give enough information to make that decision. It may well be a school that is "failing" by one measure is actually educating students who started out well behind their peers in other schools. It may be the school is providing valuable supports that can't be measured by standardized tests.

- Evaluating policy interventions. Using test scores to determine the efficacy of particular policy interventions is the bread-and-butter of labor economists and other quant researchers who work in the education field. I rarely see, however, well-considered, fully-formed arguments for the use of test outcomes in this research. More often, there is a simple assumption that the test score is measuring something that can be affected by the intervention; therefore, its use must be valid.

In other words: the argument for using test scores in research is often that they are measuring something: there is signal amid the noise. I don't dispute that, but I also know that the signal is not necessarily indicative of what we really want to measure. Test scores are full of construct-irrelevant variance: they vary because of factors that are other than the ones test-makers are trying to assess. Put another way: a kid may score higher than another not because she is a better reader or mathematician after a particularly intervention, but because she is now a better test-taker.

This is particularly relevant when the effect sizes measured in research are relatively small. We see this all the time, for example, in charter school research: effect sizes of 0.2 or less are commonly referred to as "large" and "meaningful." But when you teach to the test -- often to the exclusion of other parts of the curriculum -- it's not that hard to pump up your test scores a bit relative to those who don't. Daniel Koretz has written extensively on this.

These are only five proposed uses for test scores; there are others. But the initial reason for instituting a high-stakes standardized testing regime was "accountability." Documents from the early days of No Child Left Behind make clear that schools and schools districts were the entities being held accountable. Arguably, so were states -- but primarily to monitor schools and districts. 

I don't think anyone seriously thinks schools and districts -- and the staffs within them -- shouldn't be held accountable for their work. Certainly, taxpayers deserve to know whether their money is being used efficiently and effectively, and parents deserve to know whether their children's schools are doing their job. The question that we seem to have skipped over, however, is whether using standardized tests to dictate actions with high-stakes is a valid use of those tests' outcomes.

Yes, there are people who would do away with standardized testing altogether. But most folks don't seem to have a problem with some level of standardized testing, nor with using test scores as part of an accountability system (although many, like me, would question why it's only the schools and districts that are held accountable, and not the legislators and executives at the state and federal level who consistently fail to provide schools the resources they need for success). 

What they also understand, however -- and on this, the public seems to be ahead of many policymakers and researchers -- is that these are limited measures of school effectiveness, and that we are using them in ways that introduce corrupting pressures, which makes schools worse. That, more than any problem with the tests themselves, seems to be driving the backlashing against high-stakes testing.

As Kane says: "Public claims require public justification." The burden of proof, then, is on those who would use tests to take all sorts of highly consequential actions. Their arguments need to be made clearly and publicly, and they have an obligation to not only demonstrate that the tests themselves are good measures of student learning; they also have to argue convincingly that the results should be used for each separate purpose for which policymakers would use them.

I would suggest to those who question the growing skepticism of high-stakes testing: go back and look at your arguments for their use in undertaking specific actions. Are those arguments as strong as they should be? If they aren't, perhaps you should reconsider the stakes, and not just the test.

[1] Messick, S. (1989). "Validity." In R. Linn (Ed.), Educational measurement (3rd ed., pp. 13–100). Washington, DC: American Council on Education.

[2] Kane, M. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement 50(1), 1–73. 

Sunday, June 30, 2019

The Facts About Newark's Schools: An Update

Thanks to, among other things, Cory Booker's presidential campaign, Newark's schools remain in the spotlight. Back in 2017, Bruce Baker and I released Newark’s Schools: The Facts in an attempt to give some context to the topic. The report is admittedly long, but the story of Newark's schools can't be told in a few talking points.

That said, if I had to boil down what we found, it would be the following:
  • Newark's school system gained significant resource advantages over comparable districts, through the Abbott rulings -- a series of court cases that directed more funds to a select group of disadvantaged New Jersey school districts -- and through the proliferation of charter schools, especially those run by large charter management organizations (CMOs) such as KIPP and Uncommon Schools.
  • Much of the vaunted "growth" in Newark's test outcomes can be explained by the transition from one form of the state test (NJASK) to another (PARCC). Other, similar districts close to Newark showed similar gains in student growth, suggesting Newark wasn't doing anything unique to realize its own modest gains.
  • While Newark's charter schools have resource advantages, they aren't particularly efficient producers of student gains.
  • Newark's high-profile charter schools enroll a fundamentally different type of student than the Newark Public Schools (NPS), the city's public school district. NPS enrolls more special needs students, especially those with costly disabilities. NPS enrolls far more Limited English Proficient students, and there are differences in socio-economic status, although this varies across the charter sector.
  • Newark’s high-profile charters show substantial cohort attrition: many students leave between grades 7 and 12 and are not replaced. As those students leave, the relative test scores of those schools rise. These schools also have very high suspension rates.
  • By their own admission, the high-profile charters focus intensely on passing the state tests. There is evidence they do not put as many resources into non-tested subjects as NPS. 
  • The charters have a unique resource model: they hire many teachers who are relatively inexperienced, yet are paid more relative to similar teachers. Those teachers, however, work longer days and years.
Now, these points stand in opposition to the conventional wisdom on Newark's schools, which says that superior leadership, superior instruction and curriculum, and a culture of "high expectations" has turned Newark's education system around. 

Let me be very clear on this: I'm not saying the city didn't make improvements in the way it runs its schools or educates its students. I am saying, however, that the influence of charter school instructional practices and school/community leadership has been largely oversold. In addition: the gains the city has made do not come anywhere close to overcoming the disadvantages of inadequate funding, intense segregation, economic inequality, and structural racism Newark's beautiful and deserving children must contend with every day.

The only way to understand this reality is to take the time to understand the context in which Newark's schools operate. A lot of folks would rather not do that; they'd rather believe a few well-crafted data points designed to uphold a preferred narrative. 

I find this point of view to be enormously frustrating. If we really care about improving the education of children in disadvantaged communities, we owe it them to take the time to get the facts right. Otherwise, we're going to learn the wrong lessons and implement the wrong policies.

To that end, I want to set down, for the record, a few more critical facts about Newark schools:

- There is nothing special about Newark's graduation rate increases. Back in 2016, I wrote a brief that tested the claim that Newark's (and Camden's) graduation rates were growing at an accelerated rate. The problem with the claim, I note, was that the city's graduation rate was being compared to the entire state's. But that comparison is invalid because many New Jersey high schools already have grad rates near 100 percent -- and you can't go higher than that!

And yet people who really should know better continue to make the claim. What they should be doing instead is comparing changes in Newark's graduation rate to changes for similar districts. Here's the latest data:

Yes, Newark's graduation rates have been climbing -- but so have the rates of similar districts. "DFG" stands for "District Factor Group"; DFG-A districts, like Newark and Camden, are the most disadvantaged districts in the state. Aside from an initial leap in 2012, there is little reason to believe Newark has seen outsized gains in graduation rates.

I am increasingly convinced these gains are due to a policy of "credit recovery," where students at risk of dropping out receive alternate instruction -- often on-line -- in core subjects. The quality of this instruction, however, may not be very good. This is an area of policy crying out for a meaningful investigation.

- The claim that more Newark students are enrolled in schools that "beat the state average" is largely pointless, because you can get gains in this measure simply by shuffling around students. I went into detail about this in a post earlier this year:

This is a great example of a talking point designed to sell an agenda, as opposed to illuminating an issue. It sounds great: more kids are in "better" schools! But the fact that the measure can be improved without changing any students' outcomes renders it useless.

- Newark's demographics have changed; the student population is fundamentally different compared to other DFG-A districts. Take, for example, child poverty:

There's some noise here, but in general Newark's childhood poverty rate, which used to be high compared to other DFG-A districts, is now matching those districts. And there have been other demographic changes.

Newark's Hispanic student population has ticked up slightly (again, we've got some noise in the data). But the proportion of Hispanic students in the other DFG-A districts has risen substantially. And that means...

Newark's proportion of Limited English Proficient (LEP) students has stayed relatively constant, while the other DFG-A districts have seen big increases.

You might wonder why I'm comparing Newark to other DFG-A districts, or why I'm going back to around 2006. The NJ Children's Foundation, a new education "reform" group, released a report that asserts Newark has made outsized gains in outcomes since 2006. What the authors don't do, however, is present a clear picture of how the student populations may have changed. 

The only attempt they make is to track free and reduced-price lunch eligibility; however, as I've pointed out before, this measure is increasingly unreliable as more districts, including Newark, have moved to universal enrollment for federally-subsidized meals.

There's little point in analyzing relative changes in student outcomes if student populations characteristics are shifting. Before anyone claims that Newark has seen outsized success, they should at least acknowledge how its demographic changes may impact test scores.

- Newark has seen gains in test scores relative to the rest of the state; however, those gains are nowhere near enough to indicate the district has closed the opportunity gap. One of the most interesting graphs from the NJCF report was this (p. 16):

As Bruce and I noted in our 2017 report: the change from the NJASK to the PARCC in 2014-15 explains a lot of the "growth" in Newark's test outcomes. When you have a leap that large in one year, there's little reason to believe it's because your district suddenly and dramatically improved its instruction; more likely, the gains are due to Newark aligning its curriculum better to the PARCC than other districts. That's not necessarily because Newark's students are now better readers and mathematicians; they may simply be better test takers.

In any case, while the district improved its standing relative to other DFG-A districts, it made very little change in its standing relative to the entire state. Again: while Newark may have seen its poverty rate decline more than other districts, it's still a disadvantaged district compared to all districts.

Getting gains on tests is better than not getting gains, or suffering losses. But the magnitude of those gains is important. The "percentile rank"* of Newark compared to all of New Jersey has not shifted so significantly that the disadvantages from which the city's children suffer are being substantially mitigated.

- Newark's high profile charters schools continue to have high rates of cohort attrition and/or student suspensions. Bruce Baker and I have made so many versions of this graph I've lost count:

North Star Academy Charter School, the Newark branch of Uncommon Schools, sees its cohorts -- the students in the "Class of 'xx" --  shrink substantially between Grades 5 and 12. The Class of 2019 had 254 students in Grade 5; that was down to 149 in Grade 12.

TEAM Academy Charter School shows similar patterns. One of the arguments I hear is that the attrition occurs because students place into competitive magnet or private high schools. But the attrition occurs between all grade levels, not just Grades 8 and 9. Something else is going on.

This is a new metric from the NJDOE: total days of school missed due to out-of-school suspensions (OSS). I divided this by the number of students (divided by 100) to get how many days were missed, on average, per 100 students. The rate is higher at TEAM than at NPS, but far higher at North Star. There's good reason to believe the suspension rate contributes to the shrinking cohorts at North Star -- and that likely affects test scores.

- Newark's charter schools employ strategies to gain resource advantages that can't be scaled up. I included this table in my last report in NJ charter schools:

Let's break this down: the teaching staffs in Newark's charter schools are far less experienced than those at NPS. This, among other factors, explains why the costs per pupil are lower at Newark's charters. But, as we pointed out in 2017, the high-profile Newark charter schools tend to pay teachers more than NPS schools given the same level of experience. For this extra pay, the charter teachers work a longer school day and year.

In the absence of other learning opportunities, it's good for kids to have more time in school; unquestionably, the extra time helps boost test scores. But it's difficult to imagine a city the size of Newark could recruit and maintain an entire teaching staff with this little experience. If extra time in school is good for students, the practical solution is not to keep churning the teaching staff; instead, the district should pay teachers more to work longer hours and years.

As I've noted before, there's also a real question as to whether the charters are "free-riding" on public school district teacher wages. In other words: would these charters still be able to recruit enough inexperienced teachers if those teachers didn't see a transfer to a public school -- with fewer hours, better work conditions, and eventually better pay -- in their future?

Look, I am all for Newark talking up its educational successes. The students, families, educators, and community leaders should be proud of their schools -- and, yes, that includes the charter schools. But when we enter the realm of state and national education policy, we've got to be more judicious. We need to see the larger picture, with all of its subtleties.

Pre-digested taking points and incomplete analyses don't help us get where we need to be. We have to move past preferred narratives and look at all of the relevant facts. I'll keep putting them out there.

I hope some of you will challenge yourselves to listen.

* One of these days, we're going to have to have a chat about whether "percentile rank" is a good measure of relative standing. I would argue it spreads out what may be very small differences in outcomes. More to come...

Monday, June 24, 2019

Things Education "Reformers" Still Don't Understand About Testing

There was a new report out last week from the education "reform" group JerseyCAN, the local affiliate of 50CAN. In an op-ed at NJ Spotlight, Executive Director Patricia Morgan makes an ambitious claim:
New Jersey students have shown significant improvements in English Language Arts (ELA) and math across the grade levels since we adopted higher expectations for student learning and implemented a more challenging exam. And these figures are more than just percentages. The numbers represent tens of thousands more students reading and doing math on grade level in just four years.
None of this has happened by accident. For several decades, our education and business community leaders have come together with teachers and administrators, parents and students, and other stakeholders to collaborate on a shared vision for the future. Together, we’ve agreed that our students and educators are among the best in the nation and are capable of achieving to the highest expectations. We’ve made some positive changes to the standards and tests in recent years in response to feedback from educators, students, and families, but we’ve kept the bar high and our commitment strong to measuring student progress toward meeting that bar.
A New Jersey high school diploma is indeed becoming more meaningful, as evidenced by the academic gains we’ve see year over year and the increase in students meeting proficiency in subjects like ELA 10 and Algebra I. Our state is leading the nation in closing ELA achievement gaps for African American and Hispanic students since 2015. [emphasis mine]
This is a causal claim: according to Morgan, academic achievement in New Jersey is rising because the state implemented a tougher test based on tougher standards. If there's any doubt that this is JerseyCAN's contention, look at the report itself:
The name of the new exam, the Partnership for Assessment of Readiness for College and Careers, or PARCC, has become a political lightning rod that has co-opted the conversation around the need for an objective measure of what we expect from our public school graduates. 
This report is not about the politics of PARCC, but rather the objective evidence that our state commitment to high expectations is bearing fruit. As we imagine the next generation of New Jersey’s assessment system, we must build on this momentum by focusing on improvements that will help all students and educators without jeopardizing the gains made across the state.
That's their emphasis, not mine, and the message is clear: implementing the PARCC is "bearing fruit" in the form of better student outcomes. Further down, the report points to a specific example of testing itself driving better student outcomes:
Educators and district and state leaders also need many forms of data to identify students who need additional support to achieve to their full potential, and to direct resources accordingly. New Jersey’s Lighthouse Districts offer one glimpse into the way educators have used assessment and other data to inform instruction and improve student outcomes. These seven diverse school districts were named by the New Jersey Department of Education (NJDOE) in 2017 for their dramatic improvements in student math and ELA performance over time.29 
The K-8 Beverly City district, for example, has used assessment data over the past several years as part of a comprehensive approach to improve student achievement. Since 2015, Beverly City has increased the students meeting or exceeding expectations by 20 percentage points in ELA and by 15 in math.30 
These districts and schools have demonstrated how test results are not an endpoint, but rather a starting point for identifying areas of strength and opportunities for growth in individual students as well as schools and districts. This emphasis on using data to improve instruction is being replicated in schools across the state — schools that have now invested five years in adjusting to higher expectations and working hard to prepare students to meet them. [emphasis mine]
Go through the entire report and you'll note there is no other claim made as to why proficiency rates have risen over the past four years: the explicit argument here is that implementing PARCC has led to significant improvements in student outcomes.

Before I continue, stop for a minute and ask yourself this: does this argument, on its face, make sense?

Think about it: what JerseyCAN is claiming is that math and English Language Arts (ELA) instruction wasn't as good as it could have been a mere 4 years ago. And all that was needed was a tougher test to improve students' ability to read and do math -- that's it.

The state didn't need to deploy more resources, or improve the lives of children outside of school, or change the composition of the teaching workforce, or anything that would require large-scale changes in policy. All that needed to happen was for New Jersey to put in a harder test.

This is JerseyCAN's theory; unfortunately, it's a theory that shows very little understanding of what tests are and what they can do. There are at least three things JerseyCAN failed to consider before making their claim:

1) Test outcomes will vary when you change who takes the test.

JerseyCAN's report acknowledges that the rate of students who opt-out of taking the test has declined:
When the PARCC exam first rolled out in 2014-15, it was met with opposition from some groups of educators and parents. This led to an opt-out movement where parents refused to let their children sit for the test. Over the past four years, this trend has sharply declined. The chart below shows a dramatic increase in participation rates in a sample of secondary schools across four diverse counties. As schools, students, and families have become more comfortable with the assessment, participation has grown significantly.31
We'll come back to that last sentence in a bit; for now, let's acknowledge that JerseyCAN is right: participation rates for the PARCC have been climbing. We don't have good breakdowns on the data, so it's hard to know if the participation rates are growing consistently across all grades and types of students. In the aggregate, however, there's no question a higher percentage of students are taking the tests.

The graph above shows participation rates in the past 4 years have climbed by 10 percentage points. Nearly all eligible students are now taking the tests.

So here's the problem: if the opt-out rates are lower, that means the overall group of students taking the test each year is different from the previous group. And if there is any correlation between opting-out and student achievement, it will bias the average outcomes of the test.

Put another way: if higher achieving kids were opting-out in 2015 but are now taking the test, the test scores are going to rise. But that isn't because of superior instruction; it's simply that the kids who weren't taking the test now are.

Is this the case? We don't know... and if we don't know, we should be very careful to make claims about why test outcomes are improving.

2) Test outcomes will vary due to factors other than instruction.

We've been over this a thousand times on this blog, so I won't have to post yet another slew of scatterplots when I state: there is an iron-clad correlation between student socio-economic status and test scores. Which means that changes in economic conditions will likely have an effect on aggregate test outcomes.

When PARCC was implemented, New Jersey was just starting to come out of the Great Recession. Over the next four years, the poverty rate declined significantly; see the graph above.* Was this the cause of the rise in proficiency rates? Again, we don't know... which is why, again, JerseyCAN shouldn't be making claims about why proficiency rates rose without taking things like poverty into account.

These last two points are important; however, this next one is almost certainly the major cause of the rise in New Jersey PARCC proficiency rates:

3) Test outcomes rise as test takers and teachers become more familiar with the form of the test.

All tests are subject to construct-irrelevant variance, which is a fancy way of saying that test outcomes vary because of things other than students' abilities we are attempting to measure.

Think about a test of mathematics ability, for example. Now imagine a non-native English speaker taking that test in English. Will it be a good measure of what we're trying to measure -- mathematical ability? Or will the student's struggles with language keep us from making a valid inference about that student's math skills?

We know that teachers are feeling the pressure to have students perform well on accountability tests. We have evidence that teachers will target instruction on those skills that are emphasized in previous versions of a test, to the exclusion of skills that are not tested. We know that teaching to the test is not necessarily the same as teaching to a curriculum.

It is very likely, therefore, that teachers and administrators in New Jersey, over the last four years, have been studying the PARCC and figuring out ways to pump up scores. This is, basically, score inflation: The outcomes are rising not because the kids are better readers and mathematicians, but because they are better test takers.

Now, you might think this is OK -- we can debate whether it is. What we shouldn't do is make any assumption that increases in proficiency rates -- especially modest increases like the ones in New Jersey over the last four years -- are automatically indicative of better instruction and better student learning.

Almost certainly, at least part of these gains are due to schools having a better sense of what is on the tests, and adjusting instruction accordingly. Students are also more comfortable with taking the tests on computers, thus boosting scores. Again: this isn't the same as the kids gaining better math and reading skills; they're gaining better test-taking skills. And JerseyCAN themselves admit students "...have become more comfortable with the assessment." If that's the case, what did they think was going to happen to the scores?

It wasn't so long ago that New York State had to admit its increasingly better test outcomes were nothing more than an illusion. Test outcomes -- especially proficiency rates -- turned out to wildly exaggerate the progress the state's students had allegedly been making. You would think, after that episode, that people who position themselves as experts on education policy would exercise caution in interpreting proficiency rate gains immediately after introducing a new test.

I've been in the classroom for a couple of decades, and I will tell you this: improving instruction and student learning is a long, grinding process. It would be great if simply setting higher standards for kids and giving them tougher tests was some magic formula for educational success -- it isn't.

I'm all for improving assessments. The NJASK was, by all accounts, pretty crappy. I have been skeptical about the PARCC, especially when the claims made by its adherents have been wildly oversold -- particularly the claims about its usefulness in informing instruction. But I'm willing to concede it was probably better than the older test. We should make sure whatever replaces the PARCC is yet another improvement.

The claim, however, that rising proficiency rates in a new exam are proof of that exam's ability to improve instruction is just not warranted. Again, it's fine to advocate for better assessments. But I wish JerseyCAN would deploy some of its very considerable resources toward advocating for a policy that we know will help improve education in New Jersey: adequate and equitable funding for all students.

Because we can keep raising standards -- but without the resources required, the gains we see in tests are likely just a mirage.

ADDING: I probably shouldn't get into such a complex topic in a footnote to a blog post...

Yes, there is evidence accountability testing has, to a degree, improved student outcomes. But the validity evidence for showing the impact of testing on learning is... another test. It seems to me it's quite likely the test-taking skills acquired in a regime like No Child Left Behind transfer to other tests. In other words: when you learn to get a higher score on your state test, you probably learn to get a higher score on another test, like the NAEP. That doesn't mean you've become a better reader or mathematician; maybe you're just a better test taker.

Again: you can argue test-taking skills are valuable. But making a claim that testing, by itself, is improving student learning based solely on test scores is inherently problematic.

In any case, a reasonable analysis is going to factor in the likelihood of score inflation as test takers get comfortable with the test, as well as changes in the test-taking population.

* That dip in 2012 is hard to explain, other than some sort of data error. It's fairly obvious the decrease in poverty over recent years, however, is real.

Saturday, June 1, 2019

NJ's Student Growth Measures (SGPs): Still Biased, Still Used Inappropriately

What follows is yet another year's worth of data analysis on New Jersey's student "growth" measure: Student Growth Percentiles (SGPs).

And yet another year of my growing frustration and annoyance with the policymakers and pundits who insist on using these measures inappropriately, despite all the evidence.

Because it's not like we haven't looked at this evidence before. Bruce Baker started back in 2013, when SGPs were first being used in making high-stakes decisions about schools and in teacher evaluations; he followed up with a more formal report later that year.

Gerald Goldin, a professor emeritus at Rutgers, expressed his concerns back in 2016. I wrote about the problems with SGPs in 2018, including an open letter to members of the NJ Legislature.

But here we are in 2019, and SGPs continue to be employed in assessments of school quality and in teacher evaluations. Yes, the weight of SGPs in a teacher's overall evaluation has been cut back significantly, down to 5 percent -- but it's still part of the overall score. And SGPs are still a big part of the NJDOE's School Performance Reports.

So SGPs still matter, even though they have a clear and substantial flaw -- one acknowledged by their creator himself --  that renders them invalid for use in evaluating schools and teachers. As I wrote last year:
It's a well-known statistical precept that variables measured with error tend to bias positive estimates in a regression model downward, thanks to something called attenuation bias. Plain English translation: Because test scores are prone to error, the SGPs of higher-scoring students tend to be higher, and the SGPs of lower-scoring students tend to be lower.

Again: I'm not saying this; Betebenner -- the guy who invented SGPs -- and his coauthors are:
It follows that the SGPs derived from linear QR will also be biased, and the bias is positively correlated with students’ prior achievement, which raises serious fairness concerns.... 
The positive correlation between SGP error and latent prior score means that students with higher X [prior score] tend to have an overestimated SGP, while those with lower X [prior score] tend to have an underestimated SGP.(Shang et al., 2015)
Here's an animation I found on Twitter this year* that illustrates the issue:

Test scores are always -- always -- measured with error. Which means that if you try to show the relationship between last year's scores and this year's -- and that's what SGPs do --  you're going to have a problem, because last year's scores weren't the "real" scores: they were measured with error. This means the ability of last year's scores to predict this year's scores is under-estimated. That's why the regression line in this animation flattens out: as more error is added to last year's scores, the more the correlation between last year and this year is estimated as smaller than what it actually is.**

Again: The guy who invented SGPs is saying they are biased, not just me. He goes on to propose a way to reduce that bias which is highly complex and, by his own admission, never fully addresses all the biases inherent in SGPs. But we have no idea if NJDOE is using this method.

What I can do instead, once again, is use NJDOE's latest data to show these measures remain biased: Higher-scoring schools are more likely to show high "growth," and lower-scoring schools "low" growth, simply because of the problem of test scores being measured with error.

Here's an example:

There is a clear correlation between a school's average score on the Grade 5 math test and its Grade 5 math mSGP***. The bias is such that if a school has an average (mean) test score 10 points higher than another, it will have, on average, an SGP 4.7 points higher as well.

What happens if we compare this year's SGP to last year's score?

The bias is smaller but still statistically significant and still practically substantial: a 10-point jump in test scores yields a 2-point jump in mSGPs.

Now, we know test scores are correlated with student characteristics. Those characteristics are reported at the school level, so we have to compare them to school-level SGPs. How does that look? Let's start with race.

Schools with larger percentages of African-American students will see their SGPs fall.

But schools with larger percentages of Asian students will see their SGPs rise. Remember: when SGPs were being sold to us, we were told by NJDOE leadership at the time that these measures would "fully take into account socio-economic status." Is it true?

There is a clear and negative correlation between SGPs and the percentage of a school's population that is economically disadvantaged.

Let me be clear: in my own research work, I have used SGPs to assess the efficacy of certain policies. But I always acknowledge their inherent limitations, and I always, as best as I can, try to mitigate their inherent bias. I think this is a reasonable use of these measures.

What is not reasonable, in my opinion, is to continue to use them as they are reported to make judgments about school or school district quality. And it's certainly unfair to make SGPs a part of a teacher accountability system where high-stakes decisions are compelled by them.

I put more graphs below; they show the bias in SGPs varies somewhat depending on grade and student characteristics. But it's striking just how pervasive this bias is. The fact that this can be shown year after year should be enough for policymakers to, at the very least, greatly limit the use of SGPs in high-stakes decisions.

I am concerned, however, that around this time next year I'll be back with a new slew of scatterplots showing exactly the same problem. Unless and until someone in authority is willing to acknowledge this issue, we will continue to be saddled with a measure of school and teacher quality that is inherently flawed.

* Sorry to whomever made it, but I can't find an attribution.

** By the way: it really doesn't matter that this year's scores are measured with error; the problem is last year's scores are.

*** The "m" in "mSGP" stands for "median." Half of the school's individual student SGPs are above this median; half are below.

ADDITIONAL GRAPHS: Here are the full set of scatterplots showing the correlations between test scores and SGPs. SGPs begin in Grade 4, because the first year of testing is Grade 3, which becomes the baseline for "growth."

For each grade, I show the correlations between this year's test score and this year's SGP in math and English Language Arts (ELA). I also show the correlation between this year's SGP and last year's test score. In general, the correlation with last year's score is not as strong, but still statistically significant.

Here's Grade 4:

Grade 5:

Garde 6:

Grade 7:

I only include ELA for Grade 8, as many of those students take the Algebra I test instead of the Grade 8 Math test.

Here are correlations on race for math and ELA, starting with Hispanic students:

White students:

Asian students:

African American students:

Here are the correlations with students who qualify for free or reduced-price lunch:

Finally, students with learning disabilities:

This last graph shows the only correlation that is not statistically significant; in other words, the percentage of SWDs in a school does not predict that school's SGP for math.