I will protect your pensions. Nothing about your pension is going to change when I am governor. - Chris Christie, "An Open Letter to the Teachers of NJ" October, 2009

Wednesday, April 27, 2016

The PARCC Silly Season

Miss me?

There's a lot to get to that I've had to miss over the past couple of weeks, and I'll get to it all in due time. But it looks like we're still in the middle of the standardized test silly season, where all sorts of wild claims about the PARCC and other exams are made by folks who have consistently demonstrated that they really know very little about what these tests are and what their scores tell us.

So let's go over it one more time:

- All standardized tests, by design, yield normal, "bell curve" distributions of scores.

I will be the first to say that tests can vary significantly in their quality, reliability, and validity. But they all crank out bell curve score distributions. When New York switched to its "new" tests in 2014, the score distributions looked pretty much the same as the distributions back in 2009.

Same with New Jersey -- just ask Bruce Baker. This is by design - the tests are scored so that a few kids get low scores, a few kids get high scores, and most get somewhere in the middle.

Do I need to point out the obvious? When a test's scores are normalized, someone has got to be "below average." The notion that everyone can be high achieving makes no sense when achievement is judged in relative terms.

- Proficiency rates can be set any place those in power choose to set them.

You will hear reformy types say that proficient rates tanked because the PARCC is a more "rigorous" test than what came before. We could actually have a debate that -- if we were allowed to see the test. What isn't under debate, however, is that the proficiency rates are simply cut scores than can be set wherever those who have the power choose to set them. The NJASK and the PARCC yielded the same distribution of scores:

There was a bit of a ceiling effect on the old NJASK is some grades, but largely the distributions of the two tests are the same. All that changed was the cut score -- a score that could have been set anywhere.

The change in the test didn't cause the cut score to change; that was a completely different decision.

- The new proficiency rates are largely based on the scores of tests that are similarly normalized.

The PARCC proficiency rates were set using other tests, like the ACT and SAT, that also yield normal distributions of test scores.

The purpose of the SAT and the ACT is to order and rank students so college admissions offices can make decisions -- not to determine whether students meet some sort of objective standard of minimally acceptable education.

Colleges want to be able to judge the relative likelihood of different students achieving success in  their institutions. The SAT cut score of 1550 -- often reported as the "college and career ready standard" -- roughly represents a cut score where there's about a fifty-fifty chance of a student getting a B or higher in a freshman course at a selected sample of four-year colleges or universities (most of which have competitive admissions; some, like Northwestern and Vanderbilt, are extremely competitive).

Note that about one in three Americans holds a bachelors degree. I am still waiting for my friends on the reformy side to reveal their plans to triple the number of four-year college seats in America. I'm also waiting to hear how much more they'll pay their own gardeners and dishwashers and home health care aides and garbage haulers when they all earn bachelor's degrees.

Oh, I forgot: these people don't rely on non-college educated workers. They clean their own offices and pick their own lettuce and bus their own dishes at their favorite restaurants...

Don't they?

- The idea that "proficiency" for all current students should be the cut score level attained by the top one-third of yesterday's students flies in the face of all reason.

Seriously: does anyone really think all students should achieve at an academic level that would track them toward getting a B in math or English at a competitive admission, four-year university? How does that make any sense?

But let's supposed by some miracle it actually happened -- then what? Again, are we going to admit everyone into a four year college? Who's going to fund that?

Some folks say that I am consigning certain students to a life of low standards by pointing all this out. But I didn't make the system; I'm just describing it. When you turn human learning into bell curves, this is what you get: somebody's got to be on the left side. There's a serious conversation to be had about how these tests convert class and race advantage into "merit," but even if we removed all of the biases in these tests, somebody would still have to be getting less than average scores.

If you can't even acknowledge this, I can't even talk to you. And, speaking of class and race...

The best predictor of a school's test scores is how many of its students are in economic disadvantage.

How many times must I show some variation of this?

Nothing predicts a test score as well as relative student economic status measures -- nothing. No one serious debates this anymore.

So why aren't we doing anything about poverty if we want to equalize educational opportunity?

- Standardized tests could yield the same information about school effectiveness with far less cost and intrusion.

All of the above said, I still believe there is a place for strong academic standards and standardized tests. The truth is that this country does have a history of accepting unequal educational opportunity, and it's hard to make a case for, say, adequate and equitable school funding without some sort of metric that shows how students compare in academic achievement.

And I don't even have a problem with test scores, properly controlled for student characteristics, as markers for exploring whether certain schools could improve compared to others. Compulsory actions on test scores are idiotic, but using the data to inform decisions? Fine.

But why must we test every child in every grade for an accountability measure? If we're trying to determine if a school has "failed," we could do so with far less cost, far less intrusion, and far less Campbell's Law-type corruption. If the point is to show inequities in the system, we could do that with a lot less testing than we're currently doing.

- Testing supporters should be more concerned with what happens after a test raises a red flag.

Once we identify the schools in question that are lagging, what's our response? No Child Left Behind said: "Choice! Private tutoring! Shut 'em down!"

Turns out that is some seriously weak-ass tea. "Choice" hasn't come close to creating the large societal changes its adherents promise. "Turnarounds" aren't working out well either.

If we really cared about equalizing educational opportunities for all children, we'd start doing some stuff that actually seems to work, like:

  • Lowering class sizes.
  • Elevating the teaching profession.
  • Spending more in our schools, especially the ones serving many children in disadvantage.
  • Dismantling institutional segregation.
  • Improving the lives of children and their families outside of school.
Of course, this would mean shifting some of the massive wealth accumulated by the wealthiest people in this country towards to the people who actually do most of the work. Given the historic inequality this county faces, I think the rich folks who support outlets like The 74 and Education Post could handle keeping a little less for themselves.

Don't you?

As we come out of the PARCC silly season, it behooves us to ask: If these tests are so damn important for showing that America's schools are unequal, why don't we actually do some meaningful stuff to help them after we get the scores back? Why do we waste our time with reformy nonsense that doesn't work?

Like vouchers. Stand by...


Again, once you find the "failing" schools, the real question becomes: what are you going to do?
From 2004 to 2015, Karen DeJarnette was the director of planning, research, and evaluation in the Little Rock school district, where she was in charge of monitoring black student achievement. In her inspections, she found that some schools, predominantly in the poorer (and minority) parts of town, were plagued with mold and asbestos, had water that dripped through the ceiling, and, sometimes, lacked functioning toilets. Most of the subpar schools were in the east and south parts of town, where test scores were lower, which is no coincidence, she told me. “There was a direct correlation with under or poorly-resourced schools and poor results of students on standardized tests,” she said. 
DeJarnette pointed out the disparities in the reports she compiled for the district, but her comments weren’t acknowledged, she said. Instead, according to her, the board and administrators would talk about how badly some schools were performing, without talking about how under-resourced those schools were. [emphasis mine]

You don't cure a fever by yelling at the woman holding the thermometer.


c.l. ball said...

PARCC is not normed. This put even more power to those setting the scales for the raws scores and the performance levels/cut scores.

The link near the NJASK v. PARCC graphs is broken, so why is the mean scale score for schools being reported than the distribution of student scores?

Showing a bivariate correlation between test scores and income does not mean that other explanations are not more powerful. We need a multivariate model to begin to make that claim.

I think you are burying the lead on what happens after the test part.

Duke said...

Hi c.l.:

1) I am with Dan Koretz: ALL tests are more or less "normed." You don't test people without setting some standard for them, and that standard, high or low, is based on a norm.

2) I am reporting schools because I don't have student data. Here's a graph from the NJDOE:


Normal distribution, of course.

3) I have run plenty of multivariate models, and posted the results here and many other places. I can say with great confidence that, at the school level, nothing we measure predicts test scores better than proxy measures of student economic disadvantage (assuming, of course, the sample has substantial variability in that measure).

4) I suppose I did bury the lead. But if you take this blog as a whole, the last thing you'd ever say was that I'm a guy who downplays the stuff I listed.

Melissa Westbrook said...

Jersey, I write/moderate a public ed blog in Seattle (Seattle Schools Community Forum.) I did a round-up of news on opting out and I included your thread. Here's what one of my readers said about it and I would love if you would give me an answer to fire back on (given the tone of the comment.) I don't know as much about standardized testing as you do.

"The Jersey Jazzman post is ridiculously funny. He clearly has no idea that there is a difference between norm-referenced tests and criterion-referenced tests. Not all standardized tests are "normed" and not all scores are distributed along a "normal" bell curve. PARCC and SBAC tests are criterion-referenced tests.

It's always amazing to me that people will so boldly put their ignorance on display."

Melissa Westbrook

Duke said...

Hi Melissa,

Your snarky friend will notice that I said the test yields normally distributed scores. This is not under debate. I did not say the test is norm-referenced, largely because that phrase is one I find to be meaningless in this conversation.

When I did my masters in education, I took my assessment course like everyone else. The textbook, like all texts on this subject that I have seen, talked about the difference between norm- and criterion-referenced tests. This is a standard paradigm presented to teachers-in-training, and it's admittedly helpful for understanding testing basics.

But it's not particularly sophisticated. Daniel Koretz is generally considered one of the nation's best experts on testing. I'll let him explain:


"Norm-referenced, Criterion-Referenced, and Standards-Based Assessments

Standards-based assessments are now in vogue, and disparagement of norm-referenced tests is commonplace. Much of the debate about these terms, however, is misleading. The differences between standards-based and other assessments are often greatly overstated.

Most test scores are in themselves arbitrary and meaningless numbers. How do you know whether a score is high or low? Is a score of 250 on the National Assessment high or low? There is nothing magical about the number 250; NAEP could just as easily have used a scale on which that same level of performance was labeled, say, 36, 900, or any other number. The only way to make sense of arbitrary scores is to compare them to something.

One obvious comparison is to the performance of others. How do you know that a four-minute mile is an extremely fast time? Because only the world’s best runners can reach that standard. How do parents and students know that an SAT score of 1500 is very good? Because they know that few students reach that level. [Koretz is surely talking about the 1600-maximum SAT - JJ]

That is really all that "norm-referenced" means. It means that scores on the test are reported in terms of how one score—of a student, a school, or whatever—compares to others. Some people misuse the term and believe it refers to conventional, multiple-choice achievement tests. It does not necessarily refer to such tests. The NAEP state assessment, for example, is a norm-referenced assessment; it allows states to see how they compare to others. The Third International Mathematics and Science Study (TIMSS) is perhaps the world’s most expensive set of norms, one in which the unit of comparison is not students or schools, but rather entire countries.

Continued in the next comment.

Duke said...

"Another way to make sense of scores is to compare them to a defined domain of knowledge or skills. For example, one might design a test to measure 10 skills and then report the percentage of those skills mastered. This is criterion-referenced testing.

Another approach, standards-based assessment, entails reporting scores in terms of judgments about how much students should know. Those judgments are typically called "performance standards," and "standards-based" assessments typically report the percentage of students who reach a given standard.

There is nothing inherently bad about any of these approaches. They are all natural, useful ways of making sense of test scores. They are not mutually exclusive. For example, the NAEP state assessments use both standards-based and norm-referenced reporting. That is, each state is told what percentage of its students reaches each of the NAEP performance standards, but states are also compared to each other in terms of both these percentages and simple average scores.

Indeed, it makes sense to use these forms of reporting together. For example, while it may be useful to set performance standards, it is not helpful to set them in a vacuum. It helps no one, for example, to tell a group of eighth-grade track team members that this year’s standard is a three-minute mile, and an end-of-year test that showed that all students failed to reach this standard would be foolish and uninformative. Any sensible coach would take norm-referenced information—information about the distribution of performance among eighth-grade athletes—into account when setting performance goals. Educational testing is no different. If we set performance standards so high, for example, that few students in even the highest-performing countries can meet them, we are probably accomplishing nothing.

Some people note that assessments designed for standards-based reporting are inherently different than those designed for norm-referenced reporting. For example, you might include on a standards-based or criterion-referenced test items that most students answer correctly or incorrectly, even though such items are next to useless if your goal is to be able to differentiate high-performing students from poorly performing ones. Similarly, in standards-based assessment, items may be chosen in part on the basis of how well they exemplify the verbal descriptions of the standards.

The differences between standards-based and norm-referenced testing, however, are often overstated. The National Assessment, for example, did not change dramatically when the National Assessment Governing Board decided to adopt standards-based reporting."

I think that's all that needs to be said about this, except for one thing: I showed my graphs of the score distributions. They are reasonably normal. Does your friend deny this?

Duke said...

One important point to add: the Seattle Schools Community Forum was one of the first links I put on my blogroll. Melissa Westbrook is a champion of public education and does great work. I am honored to have her repost my stuff.

jamescfinney said...

I generally agree with your points, particularly that the student's can't all be above average. My only quibble is the choice of the normal distribution as a descriptor of either the raw test scores or the scaled scores that have been twisted to have a certain mean and deviation. In my experience, the skew in real data can be very interesting, and if I was sophisticated enough to figure out the importance, I can imagine that kurtosis might also be important.

Sara Calleja said...

Subject: Stop Inappropriate Testing in Our Public Schools!


Can you please share this petition with your network?

"Why is Pearson forcing teachers to sign gag orders and removing this analytical expose of how the PARCC test uses texts that are far above grade level and questions that do not represent the Common Core State Standards? https://celiaoyler.wordpress.com/2016/05/07/the-parcc-test-exposed/. Our children's educations are at stake---tell everyone this election year that we will NOT stand for secret, inappropriate demands being placed on our children while corporations profit off of orchestrated school failure! "

Will you sign this petition? Click here: