I will protect your pensions. Nothing about your pension is going to change when I am governor. - Chris Christie, "An Open Letter to the Teachers of NJ" October, 2009

Saturday, August 10, 2013

Scoring NY Tests With the Triple Lindy!

One of my guilty pleasures is screwball comedies, and one of my favorites stars the great Rodney Dangerfield: Back To School. In the climatic scene of the movie, Rodney wins the big diving meet ("Hey gang, forget about the football game; let's get pumped for the big diving meet!") by flawlessly executing a dive that defies all known laws of physics: the Triple Lindy!

What makes the Triple Lindy so difficult is that Rodney starts out on the high platform, then has to bounce from board to board over and over again before he finally slices through the water. It's one massive leap after another...

Just like the scoring on the New York State exams.

By now, you've undoubtedly heard about the huge dive test scores took this year in New York, thanks to new exams and new scoring methods that supporters claim are much more "realistic." This year, 31 percent of students were deemed "proficient" in reading and math; last year, 55 percent passed reading, and 65 percent passed math. I've argued that this was a deliberate ploy to make New York's schools look as bad as possible; reformies hope to usher in an era of privatization and deunionization on the "evidence" of these scores.

But how did NYSED come up with their definition of "proficient"? How did they determine the cut scores - the scores students would have to earn to be deemed to have achieved one of the four levels of "proficiency"?

Leonie Haimson pointed me to a fascinating document from NYSED that shows the method to this madness. Just like Rodney in the diving meet, NYSED bounced around from one external benchmark to another, taking massive leaps each time to justify where it placed its cut scores. By executing its own Triple Lindy, NYSED moved from its goal of defining "proficiency" as "college and career ready" to cut scores on the tests for levels as low as 3rd grade.

How about we play Greg Louganis for a bit and break down exactly how NYSED pulls off their Triple Lindy:

Starting Platform: "College and Career Ready." As I blogged before, this is a phony and useless phrase meant to conflate the education necessary to obtain a four-year bachelor's degree with the education needed for a job that should pay well but often doesn't. Reformy NYSED Commissioner John King, reformy SecEd Arne Duncan, reformy Regents Chancellor Merryl Tisch, and many other reformy stalwarts say this is the goal for all children, which means, right off the platform, we have to make a massive leap:

Springboard #1: First Year Grade Point Average at a Four-Year College. As the NYSED document clearly shows, the benchmark used for "proficiency" on the state test is earning a B- or better in a freshman English course, or earning at least a C+ in a math course, at a four-year college or university.

For this benchmarking, NYSED relied on the College Board, the folks in charge of the SAT. Demonstrating that the SAT predicts college GPA is one of the most important research tasks of the College Board: if the SAT didn't show some correlation between its score and college grades, there wouldn't be much of a point to using it for college admissions.

So the College Board regularly publishes "validity research," designed to show how well SAT scores match up with first year college grades (keep in mind: the College Board is hardly a disinterested party in this research). When you read these studies, you'll see that the researchers draw from a sample of around one to two hundred colleges and universities, all granting four-year degrees; community colleges need not apply. The researchers match different courses to different sections of the SAT: the reading section, for example, is matched to history and English courses, while the math section is matched with (surprise!) math courses.

There is a correlation: somewhere around 0.5, which is geek-speak for a moderate relationship, but hardly air-tight (the practical meaning of this correlation is naturally a subject of great debate). This is, of course, to be expected. After all, would anyone say that an "A" in Stochastic Calculus at M.I.T. was equivalent to an "A" in Intro To Statistics For Social Sciences at SUNY Binghamton? For that matter, how consistent are the grades given at the same institution for the same class taught by two different professors?

The notion that there is a uniform standard for a "B-" in freshman classes, consistent across course content, professors, and institutions, flies in the face of reason. But this is exactly what NYSED did: getting a "B-" in any course they choose with any professor at any college or university on their list is now equivalent to "college and career ready." Which gets us ready for our next big jump:

Springboard #2: SAT Scores. Again, there is a correlation between SAT scores and first-year college GPA. And I'm not saying there isn't a value for college admissions offices in SAT scores (again, let's put aside the controversy about using the SAT in college admissions). I am saying no admissions office I've ever heard of sets arbitrary cut scores for the SAT, because the test is hardly perfect when predicting college GPA. And, again, first-year GPA is not the only measure of college "success."

Unfortunately, NYSED ignored all these caveats when benchmarking the state tests: "Proficient," they say, "is equivalent to a 560 on the SAT Critical Reading section, a 530 in Writing, and a 540 in Math." We won't even get into the many ways the College Board and NYSED played around to get those numbers; for right now, it's enough to say that we've leapt from "college and career ready" to a largely arbitrary cut score on the SAT (and PSAT).

And, in fact, both the College Board researchers and NYSED acknowledge in the document that these cut scores on the SAT are prone to error. The SAT cut scores are matched to probabilities that a student will earn a specific grade in a college course. You'll notice there are no such nods to testing error on the NY state tests.

So now the SAT/PSAT cut scores are set; time to ricochet over to another measure:

Springboard #3: 8th Grade New York State Test Scores. Diane Ravitch points us to a terrific post from Dr. Maria Baldassarre-Hopkins, Assistant Professor of Language, Literacy and Technology at Nazareth College. Baldassarre-Hopkins served on the committee that made the recommendations to Commissioner King as to where to set the cut scores - after the tests had been administered and graded. In the comments below the post, Baldassarre-Hopkins confirms the SAT and PSAT were used as external benchmarks.

Baldassarre-Hopkins describes the process for setting cut scores, known as "bookmarking":

Cut scores were to be decided upon after NYS students in grades 3-8 took the tests.  By looking alternately at test questions, Common Core State Standards, Performance Level Descriptors, and other data-which-shall-not-be-named (thank you non-disclosure agreement!), 100 educators sat in four meeting rooms at the Hilton using a method known as “bookmarking” to decide what score a student needed to earn in order to be labeled as a particular “level” (i.e., 1-4).  How many questions does a student need to answer correctly in order to be considered a 3?  How about a 4?  2?  1?

In each room sat teachers, administrators and college faculty from across the state.  This mix made for some interesting discussion, heated debates, and a touch of hilarity.  There were smartly dressed psychometricians everywhere (i.e., Pearson stats people) and silent “gamemakers” unable to participate sitting in back of room looking on, clicking away on their laptops.  Sometimes they nodded to each other or whispered, other times they furrowed their brows, and at least twice when the tension was high in the room, one gamemaker (who I called “the snappy dresser” and others called “the Matrix guy”) stood up and leaned over the table like he was going to do something to make us rue the day.  I kept my eye on that one. [emphasis mine]
 (Another guilty pleasure: The Matrix trilogy. Can't get enough.)

Now, here's the critical point to understand in all this: it's not that these people sat in the Hilton and used their expertise to say, for example: "A 'proficient' 7th grader should be able to do this, this, and that." No, what Baldassarre-Hopkins makes clear is that the distribution of passing scores for each item on the tests would determine whether passing that item demonstrated proficiency. In other words: a test question was considered "hard" or "easy" not because it required a particular skill; its difficulty was determined based on how many students got it correct. In fact, Baldassarre-Hopkins tells us NYSED gave her group data that showed how many students passed each test question:

  • Time to “bookmark”:  Each of us would place a post-it note in the OIB on the last question a student at a particular threshold level of proficiency would have a 2/3 chance of answering correctly.
  • But before we began, we were told which page numbers correlate with external benchmark data (I could tell you what those data were, but then I would have to kill you).  So, it was sort of like this:  “Here is how students who are successful in college do on these ‘other’ assessments.  If you put your bookmark on page X for level 3, it would be aligned with these data.”
  • We had three rounds of  paging through the OIB, bookmarking questions, getting feedback data on our determined cut scores, and revising.  We
  • had intense discussion as we began to realize the incredible weight of our task.  We were given more data in the form of p-values for each question in the OIB – the percentage of students who answered it correctly on the actual assessment. Our ultimate results were still not the final recommendation. [emphasis mine]
So here's what happened: this group was given the percentage of 8th Graders who passed each item on the test. They were told that students who meet the "B- in English/C+ in math" college grades standard - a standard set by SAT scores - should be put at a level 3 for the state test. Keep in mind that these are 8th graders, but the SAT is typically taken in 11th grade. How did they account for the discrepancy? No one's saying.

We'll talk later about the issues with using the SAT, a normative test, as a benchmark for what should be a criteria-based test. But it's now time to take another leap: this time, to the other grade levels:

Springboard #4: 7th through 3rd Grade New York State Test Scores. Baldassarre-Hopkins makes clear that the cut scores for the other grades were set largely by comparing them to the 8th grade standard:
We began the bookmarking process with grade 8, later repeating the entire process for grades 7 and 6.  I will try to be as brief as possible:  

  • On our final day of bookmarking we came back to grade 8 (after the process took place for grades 6 and 7) and did one last round.  This 4th round determined the actual cut scores that would go to the commissioner as a recommendation.

I, along with the people in my room, completed this entire process for grades 6-8.   A group of educators in a similarly tiny room did the same for grades 3-5.  On day five, table leaders got together for vertical articulation.  This meant we looked across all of the cut scores to see whether or not there was some general consistency across all grades, 3-8. [emphasis mine]
I'll bet that was the most interesting day of the entire week: getting Grade 6 and Grade 5 to match up must have been a bear. In any case, I think Baldassarre-Hopkins gives us an important clue when she says her group worked from the top down. Grade 8 was informed by the external benchmarks - the SAT and PSAT. Grade 7 was then informed by grade 8. Grade 6 was informed by grade 7, and the elementary grades then were made to match up with the middle school grades. But it all started with the Grade 8 alignment with the SAT/PSAT.

Now, there's one other leap we have to make here - probably the biggest one of all:

Springboard #5: Teacher/Principal Evaluations, School Ratings, and Student Instruction Decisions. It would be one thing if these test scores didn't have high stakes consequences. The results would be published; Mike Bloomberg would spin them however he is wont to do; parents, teachers, principals and students would use the data to inform instruction and classroom practice as they saw fit; and that would be that.

But we live in John King's reformy New York, where test scores are used to evaluate teachers, dictate which schools should be closed, and determine whether children should be retained or given differentiated instruction. These test results have consequences - often serious consequences. So, even though the final cut scores are the result of a convoluted process, people's lives are profoundly affected by them.

Let's review the steps of the New York Test Score Triple Lindy one more time:

  • Start with "college and career ready," an ill-defined phrase that could mean just about anything.
  • Leap to freshman year GPA in selected courses at a limited number of four-year colleges. Could be graded on or off a curve (normative or criteria-based - more on this later); varies widely between professors, schools, and courses; doesn't necessarily indicate whether the student's entire college experience was "successful."
  • Spring to SAT/PSAT scores, somewhat correlated to first year college GPA, but a normative assessment (meaning a set number of students must score at each percentile - someone's got to lose). This is a test, by the way, tightly correlated to family income.
  • Bounce to 8th grade NY State test scores, which are given three years before the SAT.
  • Carom (got a thesaurus?) to 3rd through 7th NY State test scores, which would assume all children follow the same learning trajectory.
  • Jounce (SAT word!) to teacher/principal evaluations and school evaluations and student retention decisions.
May I give my informed opinion here? Pardon my technical language, but:

This is friggin' nuts.

No wonder John King, Merryl Tisch, Andrew Cuomo, and Arne Duncan can't get no respect. More to come...

I may get no respect, but I'm still "college and career" ready!

ADDING: In the middle of writing this, I realized where I had first heard this metaphor. Yeah, big surprise: it's you-know-who...


Tom Hoffman said...

When they went through the same process for the 11th grade math NECAP's a few years ago, they figured out that if 11th graders were going to pass the test at roughly the same rate that middle school students had been, the cut from "1" to "2" would have to come after the first item in the book, and from "2" to "3" (proficient) would be after item seven in a book with over 40 items.

You might not be surprised to know that, they specifically withheld that information from the committee (but did note the omission in the technical documentation).

Also, you might not be surprised to know that the NECAP math scores featured prominently in justifying the Central Falls debacle and other high school closings and turnarounds in RI.


Marie said...

My eyes started to glaze over about halfway through this post. Your second to last line sums it up: "This is friggin' nuts!"

Anyone who would design a system like this clearly has no clue what makes kids tick. I am appalled.

KenS said...

So, doesn't all this mean that the NY State test scores are pegged to the performance of college freshman who most likely managed to get those B-'s without having to endure learning under the Common Core?

Michael Fiorillo said...

Every self-respecting educator should boycott anything having to do with this insidious farce.

By participating, they give credence to the lie that teachers are integrally involved in the process.

jcg said...

OMG. It's inevitable. This Rube Goldberg will fail & displace schools, teachers & kids year, after year, after year.

Worse yet, the decisions are completely arbitrary.

nuff said said...

The misuse of these farcical scores will result in Principals once again focusing all resources on ELA and Math to the detriment of every student in the system. One can only hope that when parents get THEIR childs score they reject the results for what they arena hidden agenda. Sorry but parents are smart enough to know that 75% of their children are NOT failures as they would have you believe. Who fixes your computer or phone when there is a problem? Your 10 yr old that's who.Children are being persecuted beyond anything ever seen before and it has to stop. It's a good thing they tried to paint these scores as fact because now the backlash may take hold in earnest and NCLB and RTT dumped for good. How many States want to go thru the farce of Common Core if this is the result. We have seen this before when a few years ago the State rescored tests retroactively failing kids that already passed. Remeber one thing this whole system is simply a bell curve and you can move the line anywhere you want to get any numbers you want to justify anything you want. No matter what King and Tisch say watch the repurcussions and villification of schools, Teachers and Administrators unfold over the weeks and months to come--just watch.

Karen Foster said...

Just as NJ selects someone who is eyeball deep in this crap to send to the US Senate. Great.

Al Sylvia said...
This comment has been removed by the author.