Jersey Jazzman: June 2019

Sunday, June 30, 2019

The Facts About Newark's Schools: An Update

Thanks to, among other things, Cory Booker's presidential campaign, Newark's schools remain in the spotlight. Back in 2017, Bruce Baker and I released Newark’s Schools: The Facts in an attempt to give some context to the topic. The report is admittedly long, but the story of Newark's schools can't be told in a few talking points.

That said, if I had to boil down what we found, it would be the following:

Newark's school system gained significant resource advantages over comparable districts, through the Abbott rulings -- a series of court cases that directed more funds to a select group of disadvantaged New Jersey school districts -- and through the proliferation of charter schools, especially those run by large charter management organizations (CMOs) such as KIPP and Uncommon Schools.
Much of the vaunted "growth" in Newark's test outcomes can be explained by the transition from one form of the state test (NJASK) to another (PARCC). Other, similar districts close to Newark showed similar gains in student growth, suggesting Newark wasn't doing anything unique to realize its own modest gains.
While Newark's charter schools have resource advantages, they aren't particularly efficient producers of student gains.
Newark's high-profile charter schools enroll a fundamentally different type of student than the Newark Public Schools (NPS), the city's public school district. NPS enrolls more special needs students, especially those with costly disabilities. NPS enrolls far more Limited English Proficient students, and there are differences in socio-economic status, although this varies across the charter sector.
Newark’s high-profile charters show substantial cohort attrition: many students leave between grades 7 and 12 and are not replaced. As those students leave, the relative test scores of those schools rise. These schools also have very high suspension rates.
By their own admission, the high-profile charters focus intensely on passing the state tests. There is evidence they do not put as many resources into non-tested subjects as NPS.
The charters have a unique resource model: they hire many teachers who are relatively inexperienced, yet are paid more relative to similar teachers. Those teachers, however, work longer days and years.

Now, these points stand in opposition to the conventional wisdom on Newark's schools, which says that superior leadership, superior instruction and curriculum, and a culture of "high expectations" has turned Newark's education system around.

Let me be very clear on this: I'm not saying the city didn't make improvements in the way it runs its schools or educates its students. I am saying, however, that the influence of charter school instructional practices and school/community leadership has been largely oversold. In addition: the gains the city has made do not come anywhere close to overcoming the disadvantages of inadequate funding, intense segregation, economic inequality, and structural racism Newark's beautiful and deserving children must contend with every day.

The only way to understand this reality is to take the time to understand the context in which Newark's schools operate. A lot of folks would rather not do that; they'd rather believe a few well-crafted data points designed to uphold a preferred narrative.

I find this point of view to be enormously frustrating. If we really care about improving the education of children in disadvantaged communities, we owe it them to take the time to get the facts right. Otherwise, we're going to learn the wrong lessons and implement the wrong policies.

To that end, I want to set down, for the record, a few more critical facts about Newark schools:

- There is nothing special about Newark's graduation rate increases. Back in 2016, I wrote a brief that tested the claim that Newark's (and Camden's) graduation rates were growing at an accelerated rate. The problem with the claim, I note, was that the city's graduation rate was being compared to the entire state's. But that comparison is invalid because many New Jersey high schools already have grad rates near 100 percent -- and you can't go higher than that!

And yet people who really should know better continue to make the claim. What they should be doing instead is comparing changes in Newark's graduation rate to changes for similar districts. Here's the latest data:

Yes, Newark's graduation rates have been climbing -- but so have the rates of similar districts. "DFG" stands for "District Factor Group"; DFG-A districts, like Newark and Camden, are the most disadvantaged districts in the state. Aside from an initial leap in 2012, there is little reason to believe Newark has seen outsized gains in graduation rates.

I am increasingly convinced these gains are due to a policy of "credit recovery," where students at risk of dropping out receive alternate instruction -- often on-line -- in core subjects. The quality of this instruction, however, may not be very good. This is an area of policy crying out for a meaningful investigation.

- The claim that more Newark students are enrolled in schools that "beat the state average" is largely pointless, because you can get gains in this measure simply by shuffling around students. I went into detail about this in a post earlier this year:

This is a great example of a talking point designed to sell an agenda, as opposed to illuminating an issue. It sounds great: more kids are in "better" schools! But the fact that the measure can be improved without changing any students' outcomes renders it useless.

- Newark's demographics have changed; the student population is fundamentally different compared to other DFG-A districts. Take, for example, child poverty:

There's some noise here, but in general Newark's childhood poverty rate, which used to be high compared to other DFG-A districts, is now matching those districts. And there have been other demographic changes.

Newark's Hispanic student population has ticked up slightly (again, we've got some noise in the data). But the proportion of Hispanic students in the other DFG-A districts has risen substantially. And that means...

Newark's proportion of Limited English Proficient (LEP) students has stayed relatively constant, while the other DFG-A districts have seen big increases.

You might wonder why I'm comparing Newark to other DFG-A districts, or why I'm going back to around 2006. The NJ Children's Foundation, a new education "reform" group, released a report that asserts Newark has made outsized gains in outcomes since 2006. What the authors don't do, however, is present a clear picture of how the student populations may have changed.

The only attempt they make is to track free and reduced-price lunch eligibility; however, as I've pointed out before, this measure is increasingly unreliable as more districts, including Newark, have moved to universal enrollment for federally-subsidized meals.

There's little point in analyzing relative changes in student outcomes if student populations characteristics are shifting. Before anyone claims that Newark has seen outsized success, they should at least acknowledge how its demographic changes may impact test scores.

- Newark has seen gains in test scores relative to the rest of the state; however, those gains are nowhere near enough to indicate the district has closed the opportunity gap. One of the most interesting graphs from the NJCF report was this (p. 16):

As Bruce and I noted in our 2017 report: the change from the NJASK to the PARCC in 2014-15 explains a lot of the "growth" in Newark's test outcomes. When you have a leap that large in one year, there's little reason to believe it's because your district suddenly and dramatically improved its instruction; more likely, the gains are due to Newark aligning its curriculum better to the PARCC than other districts. That's not necessarily because Newark's students are now better readers and mathematicians; they may simply be better test takers.

In any case, while the district improved its standing relative to other DFG-A districts, it made very little change in its standing relative to the entire state. Again: while Newark may have seen its poverty rate decline more than other districts, it's still a disadvantaged district compared to all districts.

Getting gains on tests is better than not getting gains, or suffering losses. But the magnitude of those gains is important. The "percentile rank"* of Newark compared to all of New Jersey has not shifted so significantly that the disadvantages from which the city's children suffer are being substantially mitigated.

- Newark's high profile charters schools continue to have high rates of cohort attrition and/or student suspensions. Bruce Baker and I have made so many versions of this graph I've lost count:

North Star Academy Charter School, the Newark branch of Uncommon Schools, sees its cohorts -- the students in the "Class of 'xx" -- shrink substantially between Grades 5 and 12. The Class of 2019 had 254 students in Grade 5; that was down to 149 in Grade 12.

TEAM Academy Charter School shows similar patterns. One of the arguments I hear is that the attrition occurs because students place into competitive magnet or private high schools. But the attrition occurs between all grade levels, not just Grades 8 and 9. Something else is going on.

This is a new metric from the NJDOE: total days of school missed due to out-of-school suspensions (OSS). I divided this by the number of students (divided by 100) to get how many days were missed, on average, per 100 students. The rate is higher at TEAM than at NPS, but far higher at North Star. There's good reason to believe the suspension rate contributes to the shrinking cohorts at North Star -- and that likely affects test scores.

- Newark's charter schools employ strategies to gain resource advantages that can't be scaled up. I included this table in my last report in NJ charter schools:

Let's break this down: the teaching staffs in Newark's charter schools are far less experienced than those at NPS. This, among other factors, explains why the costs per pupil are lower at Newark's charters. But, as we pointed out in 2017, the high-profile Newark charter schools tend to pay teachers more than NPS schools given the same level of experience. For this extra pay, the charter teachers work a longer school day and year.

In the absence of other learning opportunities, it's good for kids to have more time in school; unquestionably, the extra time helps boost test scores. But it's difficult to imagine a city the size of Newark could recruit and maintain an entire teaching staff with this little experience. If extra time in school is good for students, the practical solution is not to keep churning the teaching staff; instead, the district should pay teachers more to work longer hours and years.

As I've noted before, there's also a real question as to whether the charters are "free-riding" on public school district teacher wages. In other words: would these charters still be able to recruit enough inexperienced teachers if those teachers didn't see a transfer to a public school -- with fewer hours, better work conditions, and eventually better pay -- in their future?

Look, I am all for Newark talking up its educational successes. The students, families, educators, and community leaders should be proud of their schools -- and, yes, that includes the charter schools. But when we enter the realm of state and national education policy, we've got to be more judicious. We need to see the larger picture, with all of its subtleties.

Pre-digested taking points and incomplete analyses don't help us get where we need to be. We have to move past preferred narratives and look at all of the relevant facts. I'll keep putting them out there.

I hope some of you will challenge yourselves to listen.

* One of these days, we're going to have to have a chat about whether "percentile rank" is a good measure of relative standing. I would argue it spreads out what may be very small differences in outcomes. More to come...

Monday, June 24, 2019

Things Education "Reformers" Still Don't Understand About Testing

There was a new report out last week from the education "reform" group JerseyCAN, the local affiliate of 50CAN. In an op-ed at NJ Spotlight, Executive Director Patricia Morgan makes an ambitious claim:

New Jersey students have shown significant improvements in English Language Arts (ELA) and math across the grade levels since we adopted higher expectations for student learning and implemented a more challenging exam. And these figures are more than just percentages. The numbers represent tens of thousands more students reading and doing math on grade level in just four years.

None of this has happened by accident. For several decades, our education and business community leaders have come together with teachers and administrators, parents and students, and other stakeholders to collaborate on a shared vision for the future. Together, we’ve agreed that our students and educators are among the best in the nation and are capable of achieving to the highest expectations. We’ve made some positive changes to the standards and tests in recent years in response to feedback from educators, students, and families, but we’ve kept the bar high and our commitment strong to measuring student progress toward meeting that bar.

A New Jersey high school diploma is indeed becoming more meaningful, as evidenced by the academic gains we’ve see year over year and the increase in students meeting proficiency in subjects like ELA 10 and Algebra I. Our state is leading the nation in closing ELA achievement gaps for African American and Hispanic students since 2015. [emphasis mine]

This is a causal claim: according to Morgan, academic achievement in New Jersey is rising because the state implemented a tougher test based on tougher standards. If there's any doubt that this is JerseyCAN's contention, look at the report itself:

The name of the new exam, the Partnership for Assessment of Readiness for College and Careers, or PARCC, has become a political lightning rod that has co-opted the conversation around the need for an objective measure of what we expect from our public school graduates.

This report is not about the politics of PARCC, but rather the objective evidence that our state commitment to high expectations is bearing fruit. As we imagine the next generation of New Jersey’s assessment system, we must build on this momentum by focusing on improvements that will help all students and educators without jeopardizing the gains made across the state.

That's their emphasis, not mine, and the message is clear: implementing the PARCC is "bearing fruit" in the form of better student outcomes. Further down, the report points to a specific example of testing itself driving better student outcomes:

Educators and district and state leaders also need many forms of data to identify students who need additional support to achieve to their full potential, and to direct resources accordingly. New Jersey’s Lighthouse Districts offer one glimpse into the way educators have used assessment and other data to inform instruction and improve student outcomes. These seven diverse school districts were named by the New Jersey Department of Education (NJDOE) in 2017 for their dramatic improvements in student math and ELA performance over time.29

The K-8 Beverly City district, for example, has used assessment data over the past several years as part of a comprehensive approach to improve student achievement. Since 2015, Beverly City has increased the students meeting or exceeding expectations by 20 percentage points in ELA and by 15 in math.30

These districts and schools have demonstrated how test results are not an endpoint, but rather a starting point for identifying areas of strength and opportunities for growth in individual students as well as schools and districts. This emphasis on using data to improve instruction is being replicated in schools across the state — schools that have now invested five years in adjusting to higher expectations and working hard to prepare students to meet them. [emphasis mine]

Go through the entire report and you'll note there is no other claim made as to why proficiency rates have risen over the past four years: the explicit argument here is that implementing PARCC has led to significant improvements in student outcomes.

Before I continue, stop for a minute and ask yourself this: does this argument, on its face, make sense?

Think about it: what JerseyCAN is claiming is that math and English Language Arts (ELA) instruction wasn't as good as it could have been a mere 4 years ago. And all that was needed was a tougher test to improve students' ability to read and do math -- that's it.

The state didn't need to deploy more resources, or improve the lives of children outside of school, or change the composition of the teaching workforce, or anything that would require large-scale changes in policy. All that needed to happen was for New Jersey to put in a harder test.

This is JerseyCAN's theory; unfortunately, it's a theory that shows very little understanding of what tests are and what they can do. There are at least three things JerseyCAN failed to consider before making their claim:

1) Test outcomes will vary when you change who takes the test.

JerseyCAN's report acknowledges that the rate of students who opt-out of taking the test has declined:

When the PARCC exam first rolled out in 2014-15, it was met with opposition from some groups of educators and parents. This led to an opt-out movement where parents refused to let their children sit for the test. Over the past four years, this trend has sharply declined. The chart below shows a dramatic increase in participation rates in a sample of secondary schools across four diverse counties. As schools, students, and families have become more comfortable with the assessment, participation has grown significantly.31

We'll come back to that last sentence in a bit; for now, let's acknowledge that JerseyCAN is right: participation rates for the PARCC have been climbing. We don't have good breakdowns on the data, so it's hard to know if the participation rates are growing consistently across all grades and types of students. In the aggregate, however, there's no question a higher percentage of students are taking the tests.

The graph above shows participation rates in the past 4 years have climbed by 10 percentage points. Nearly all eligible students are now taking the tests.

So here's the problem: if the opt-out rates are lower, that means the overall group of students taking the test each year is different from the previous group. And if there is any correlation between opting-out and student achievement, it will bias the average outcomes of the test.

Put another way: if higher achieving kids were opting-out in 2015 but are now taking the test, the test scores are going to rise. But that isn't because of superior instruction; it's simply that the kids who weren't taking the test now are.

Is this the case? We don't know... and if we don't know, we should be very careful to make claims about why test outcomes are improving.

2) Test outcomes will vary due to factors other than instruction.

We've been over this a thousand times on this blog, so I won't have to post yet another slew of scatterplots when I state: there is an iron-clad correlation between student socio-economic status and test scores. Which means that changes in economic conditions will likely have an effect on aggregate test outcomes.

When PARCC was implemented, New Jersey was just starting to come out of the Great Recession. Over the next four years, the poverty rate declined significantly; see the graph above.* Was this the cause of the rise in proficiency rates? Again, we don't know... which is why, again, JerseyCAN shouldn't be making claims about why proficiency rates rose without taking things like poverty into account.

These last two points are important; however, this next one is almost certainly the major cause of the rise in New Jersey PARCC proficiency rates:

3) Test outcomes rise as test takers and teachers become more familiar with the form of the test.

All tests are subject to construct-irrelevant variance, which is a fancy way of saying that test outcomes vary because of things other than students' abilities we are attempting to measure.

Think about a test of mathematics ability, for example. Now imagine a non-native English speaker taking that test in English. Will it be a good measure of what we're trying to measure -- mathematical ability? Or will the student's struggles with language keep us from making a valid inference about that student's math skills?

We know that teachers are feeling the pressure to have students perform well on accountability tests. We have evidence that teachers will target instruction on those skills that are emphasized in previous versions of a test, to the exclusion of skills that are not tested. We know that teaching to the test is not necessarily the same as teaching to a curriculum.

It is very likely, therefore, that teachers and administrators in New Jersey, over the last four years, have been studying the PARCC and figuring out ways to pump up scores. This is, basically, score inflation: The outcomes are rising not because the kids are better readers and mathematicians, but because they are better test takers.

Now, you might think this is OK -- we can debate whether it is. What we shouldn't do is make any assumption that increases in proficiency rates -- especially modest increases like the ones in New Jersey over the last four years -- are automatically indicative of better instruction and better student learning.

Almost certainly, at least part of these gains are due to schools having a better sense of what is on the tests, and adjusting instruction accordingly. Students are also more comfortable with taking the tests on computers, thus boosting scores. Again: this isn't the same as the kids gaining better math and reading skills; they're gaining better test-taking skills. And JerseyCAN themselves admit students "...have become more comfortable with the assessment." If that's the case, what did they think was going to happen to the scores?

It wasn't so long ago that New York State had to admit its increasingly better test outcomes were nothing more than an illusion. Test outcomes -- especially proficiency rates -- turned out to wildly exaggerate the progress the state's students had allegedly been making. You would think, after that episode, that people who position themselves as experts on education policy would exercise caution in interpreting proficiency rate gains immediately after introducing a new test.

I've been in the classroom for a couple of decades, and I will tell you this: improving instruction and student learning is a long, grinding process. It would be great if simply setting higher standards for kids and giving them tougher tests was some magic formula for educational success -- it isn't.

I'm all for improving assessments. The NJASK was, by all accounts, pretty crappy. I have been skeptical about the PARCC, especially when the claims made by its adherents have been wildly oversold -- particularly the claims about its usefulness in informing instruction. But I'm willing to concede it was probably better than the older test. We should make sure whatever replaces the PARCC is yet another improvement.

The claim, however, that rising proficiency rates in a new exam are proof of that exam's ability to improve instruction is just not warranted. Again, it's fine to advocate for better assessments. But I wish JerseyCAN would deploy some of its very considerable resources toward advocating for a policy that we know will help improve education in New Jersey: adequate and equitable funding for all students.

Because we can keep raising standards -- but without the resources required, the gains we see in tests are likely just a mirage.

ADDING: I probably shouldn't get into such a complex topic in a footnote to a blog post...

Yes, there is evidence accountability testing has, to a degree, improved student outcomes. But the validity evidence for showing the impact of testing on learning is... another test. It seems to me it's quite likely the test-taking skills acquired in a regime like No Child Left Behind transfer to other tests. In other words: when you learn to get a higher score on your state test, you probably learn to get a higher score on another test, like the NAEP. That doesn't mean you've become a better reader or mathematician; maybe you're just a better test taker.

Again: you can argue test-taking skills are valuable. But making a claim that testing, by itself, is improving student learning based solely on test scores is inherently problematic.

In any case, a reasonable analysis is going to factor in the likelihood of score inflation as test takers get comfortable with the test, as well as changes in the test-taking population.

* That dip in 2012 is hard to explain, other than some sort of data error. It's fairly obvious the decrease in poverty over recent years, however, is real.

Saturday, June 1, 2019

NJ's Student Growth Measures (SGPs): Still Biased, Still Used Inappropriately

What follows is yet another year's worth of data analysis on New Jersey's student "growth" measure: Student Growth Percentiles (SGPs).

And yet another year of my growing frustration and annoyance with the policymakers and pundits who insist on using these measures inappropriately, despite all the evidence.

Because it's not like we haven't looked at this evidence before. Bruce Baker started back in 2013, when SGPs were first being used in making high-stakes decisions about schools and in teacher evaluations; he followed up with a more formal report later that year.

Gerald Goldin, a professor emeritus at Rutgers, expressed his concerns back in 2016. I wrote about the problems with SGPs in 2018, including an open letter to members of the NJ Legislature.

But here we are in 2019, and SGPs continue to be employed in assessments of school quality and in teacher evaluations. Yes, the weight of SGPs in a teacher's overall evaluation has been cut back significantly, down to 5 percent -- but it's still part of the overall score. And SGPs are still a big part of the NJDOE's School Performance Reports.

So SGPs still matter, even though they have a clear and substantial flaw -- one acknowledged by their creator himself -- that renders them invalid for use in evaluating schools and teachers. As I wrote last year:

It's a well-known statistical precept that variables measured with error tend to bias positive estimates in a regression model downward, thanks to something called attenuation bias. Plain English translation: Because test scores are prone to error, the SGPs of higher-scoring students tend to be higher, and the SGPs of lower-scoring students tend to be lower.

Again: I'm not saying this; Betebenner -- the guy who invented SGPs -- and his coauthors are:

It follows that the SGPs derived from linear QR will also be biased, and the bias is positively correlated with students’ prior achievement, which raises serious fairness concerns....

The positive correlation between SGP error and latent prior score means that students with higher X [prior score] tend to have an overestimated SGP, while those with lower X [prior score] tend to have an underestimated SGP.(Shang et al., 2015)

Here's an animation I found on Twitter this year* that illustrates the issue:

Test scores are always -- always -- measured with error. Which means that if you try to show the relationship between last year's scores and this year's -- and that's what SGPs do -- you're going to have a problem, because last year's scores weren't the "real" scores: they were measured with error. This means the ability of last year's scores to predict this year's scores is under-estimated. That's why the regression line in this animation flattens out: as more error is added to last year's scores, the more the correlation between last year and this year is estimated as smaller than what it actually is.**

Again: The guy who invented SGPs is saying they are biased, not just me. He goes on to propose a way to reduce that bias which is highly complex and, by his own admission, never fully addresses all the biases inherent in SGPs. But we have no idea if NJDOE is using this method.

What I can do instead, once again, is use NJDOE's latest data to show these measures remain biased: Higher-scoring schools are more likely to show high "growth," and lower-scoring schools "low" growth, simply because of the problem of test scores being measured with error.

Here's an example:

There is a clear correlation between a school's average score on the Grade 5 math test and its Grade 5 math mSGP***. The bias is such that if a school has an average (mean) test score 10 points higher than another, it will have, on average, an SGP 4.7 points higher as well.

What happens if we compare this year's SGP to last year's score?

The bias is smaller but still statistically significant and still practically substantial: a 10-point jump in test scores yields a 2-point jump in mSGPs.

Now, we know test scores are correlated with student characteristics. Those characteristics are reported at the school level, so we have to compare them to school-level SGPs. How does that look? Let's start with race.

Schools with larger percentages of African-American students will see their SGPs fall.

But schools with larger percentages of Asian students will see their SGPs rise. Remember: when SGPs were being sold to us, we were told by NJDOE leadership at the time that these measures would "fully take into account socio-economic status." Is it true?

There is a clear and negative correlation between SGPs and the percentage of a school's population that is economically disadvantaged.

Let me be clear: in my own research work, I have used SGPs to assess the efficacy of certain policies. But I always acknowledge their inherent limitations, and I always, as best as I can, try to mitigate their inherent bias. I think this is a reasonable use of these measures.

What is not reasonable, in my opinion, is to continue to use them as they are reported to make judgments about school or school district quality. And it's certainly unfair to make SGPs a part of a teacher accountability system where high-stakes decisions are compelled by them.

I put more graphs below; they show the bias in SGPs varies somewhat depending on grade and student characteristics. But it's striking just how pervasive this bias is. The fact that this can be shown year after year should be enough for policymakers to, at the very least, greatly limit the use of SGPs in high-stakes decisions.

I am concerned, however, that around this time next year I'll be back with a new slew of scatterplots showing exactly the same problem. Unless and until someone in authority is willing to acknowledge this issue, we will continue to be saddled with a measure of school and teacher quality that is inherently flawed.

* Sorry to whomever made it, but I can't find an attribution.

** By the way: it really doesn't matter that this year's scores are measured with error; the problem is last year's scores are.

*** The "m" in "mSGP" stands for "median." Half of the school's individual student SGPs are above this median; half are below.

ADDITIONAL GRAPHS: Here are the full set of scatterplots showing the correlations between test scores and SGPs. SGPs begin in Grade 4, because the first year of testing is Grade 3, which becomes the baseline for "growth."

For each grade, I show the correlations between this year's test score and this year's SGP in math and English Language Arts (ELA). I also show the correlation between this year's SGP and last year's test score. In general, the correlation with last year's score is not as strong, but still statistically significant.

Here's Grade 4:

Grade 5:

Garde 6:

Grade 7:

I only include ELA for Grade 8, as many of those students take the Algebra I test instead of the Grade 8 Math test.

Here are correlations on race for math and ELA, starting with Hispanic students:

White students:

Asian students:

African American students:

Here are the correlations with students who qualify for free or reduced-price lunch:

Finally, students with learning disabilities:

This last graph shows the only correlation that is not statistically significant; in other words, the percentage of SWDs in a school does not predict that school's SGP for math.