Jersey Jazzman: California Dreamin'

A big story in the LA Times about value-added assessment of teachers:

Seeking to shed light on the problem, The Times obtained seven years of math and English test scores from the Los Angeles Unified School District and used the information to estimate the effectiveness of L.A. teachers — something the district could do but has not.

The Times used a statistical approach known as value-added analysis, which rates teachers based on their students' progress on standardized tests from year to year. Each student's performance is compared with his or her own in past years, which largely controls for outside influences often blamed for academic failure: poverty, prior learning and other factors.

Though controversial among teachers and others, the method has been increasingly embraced by education leaders and policymakers across the country, including the Obama administration.

In coming months, The Times will publish a series of articles and a database analyzing individual teachers' effectiveness in the nation's second-largest school district — the first time, experts say, such information has been made public anywhere in the country.

This article examines the performance of more than 6,000 third- through fifth-grade teachers for whom reliable data were available.

There's quite a bit to discuss here.

The Times published the technical methods for the article. There are others better suited to analyze this, but a few things stick out right away to me:

An empirical concern is that if a school is particularly effective in teaching kindergarten and 1st grade students, then they may have less potential to improve the outcomes for students in 2nd through 5th grades. Alternatively, schools with poor kindergarten and 1st grade preparations may set the stage for strong performance in 2nd through 5th grades. In our analysis, we implicitly assume that school elementary school performance is relatively consistent across grades. [p. 2, emphasis mine]

Honestly, I don't even know what that means, but look at the example. Learning in primary grades isn't necessarily about how much stuff you learn - it is also about how you learn HOW to learn. A strong K-1 program may lead to BETTER gains in 2-5; a weak program may lead to WORSE gains. So you shouldn't "assume" anything. And if performance isn't consistent across classrooms, why would anyone believe it's consistent across grades (again, I really don't know what this phrase means, because the writing is so unclear).

...standardized tests are imperfect measures of learning because students may misunderstand what is expected or because individual students may have test anxiety or other issues on the day of the test. Some of these problems will "average" out across students in a classroom or school. (p. 3)

That's a HUGE assumption! If a teacher teaches by simulating the standardized test in her classroom as much as possible so as to build up her students' comfort levels, her students will probably see gains in their scores, but that doesn't mean they've learned more - they've just become better test takers. Richard Buddin, the author, admits these are imperfect measures, yet these are the only ones he uses.

Value-added approaches are not intended to replace measures of student proficiency as indications of academic success, but the new approaches offer valuable insights into how districts might align their resources to improve the proficiency levels of all students. (p.3)

Which is why you've decided the correct way to offer these "valuable insights" is to publish them on a public website run by the west coast's largest newspaper? Is anyone else bothered by this?

Again, I'm not qualified to wade through the math here and give an informed opinion (Bruce?). But I do want to point out something really important:

Students and teachers are not allocated randomly into schools or classrooms. Families with higher preferences for schooling will try to allocate their children in better schools or classrooms, principals may not allocate teachers to classrooms randomly, and good teachers may have more negotiation power to locate themselves in schools or classrooms with higher-achieving students. These choices will lead to endogeneity of observed inputs with respect to unobserved student and teacher inputs or endowments. (p. 4, emphasis mine)

The biggest problem I have with that paragraph is the use of the word "may." There is no doubt that students are not assigned to teachers randomly.

Add to that (OK, way over my head here - probably about to make an ass of myself, but what the hell):

Following Todd and Wolpin (2003), let Tit be the test score measure of student i that is observed in year t and εit be a measurement error, and let Xit and νit represent observed and unobserved inputs for student i at time t. Finally, let μi0 be the student’s endowed ability that does not vary over time. (p.3)

What do you mean, "... does not vary over time"? Are you telling me a child's ability is not dynamic? Seriously?

Same here with teacher quality:

The model includes individual student and teacher fixed effects (αi and φj). Finally, εit contains individual and teacher time variant unobserved characteristics. (p.4)

Buddin is saying (I think) that the changes in a child's life are basically written off as error in this model because they are "unobserved."

Is there anyone who has ever worked with children who thinks that they don't change over time, due to circumstances of their lives or the individual ways they mature? Do you think, for example, that a divorce might serious affect a child's ability to learn? Has anyone ever known a "late-bloomer"?

So, we have a model here where teacher "value" is assessed without accounting for the changes in a child's life over the four years they are observed. We know that kids are not assigned to teachers randomly. We are assessing teacher "value" with an imperfect instrument that was not designed to measure teacher effectiveness in the first place. And we don't seem to much care about the interactions of the observed and unobserved variables.

And yet, we are now going to publish said "value" publicly with each teacher's name attached to his or her "value," and advocate for pay, tenure, and even job security based on this "assessment." This is journalism in America today.

By the way: how would you like to be the principal in a LA school this week? Think you might get a call or two from parents about who their kids' teachers are this fall? I'm doubt the principals of the LA schools are very happy with how little thought the publishers of the LA Times gave to the consequences of this report.

Again, Mathematica found errors rates of 25% in value-added models assessing teacher effectiveness. Looking at the problems above, it's easy to see why, but here's the real-world upshot:

Even at Third Street Elementary in Hancock Park, one of the most well-regarded schools in the district, Karen Caruso stands out for her dedication and professional accomplishments.

A teacher since 1984, she was one of the first in the district to be certified by the National Board for Professional Teaching Standards. In her spare time, she attends professional development workshops and teaches future teachers at UCLA.

She leads her school's teacher reading circle. In her purse last spring, she carried a book called "Strategies for Effective Teaching."

Third Street Principal Suzie Oh described Caruso as one of her most effective teachers.But seven years of student test scores suggest otherwise.

In the Times analysis, Caruso, who teaches third grade, ranked among the bottom 10% of elementary school teachers in boosting students' test scores. On average, her students started the year at a high level — above the 80th percentile — but by the end had sunk 11 percentile points in math and 5 points in English.

Caruso said she was surprised and disappointed by her results, adding that her students did well on periodic assessments and that parents seemed well-satisfied.

"Ms. Caruso was an amazing teacher," said Rita Gasparetti, whose daughter was in Caruso's class a few years ago. "She really worked with Clara, socially and academically."

Are we really at the point where we are prepared to drum a Karen Caruso out of the profession based on statistical models with admitted errors?

Apparently so:

During recent classes observed by a reporter, Caruso set clear expectations for her students but seemed reluctant to challenge them. In reviewing new vocabulary, for instance, Caruso asked her third-graders to find the sentence where the word "route" appeared in a story.

"Copy it just like it's written," she instructed the class, most of whom started the year advanced for their grade.

"Some teachers have kids use new words in their own sentences," Caruso explained. "I think that's too difficult."

She dismissed the weekly vocabulary quizzes that other teachers give as "old school."

The writers here seem to feel they have the skills to assess what's "wrong" with Caruso, based on a short observation in her classroom. Do they have training in this? Were they teachers? Supervisors? Are they conversant in evaluation techniques?

Maybe, but maybe not. I do know LOTS of politicians, pundits, and radio loudmouths are absolutely convinced they can spot bad teaching, even though they have no training or experience. They are the ones pushing value-added assessment and merit pay while railing against collective bargaining and tenure. But I dare say very few of them know the implications of what they are pushing.

Don't get me wrong: I'm all for standardized testing. I just want the testing transparent and the tests evaluated thoroughly.

I'm all for getting bad teachers out of schools. I just want the teachers evaluated fairly.

I don't think tenure should protect anyone who is bad at their job. I just want politics out of schools, and tenure is the best way to do that.

And there is definitely a place for using statistics as a tool to point out who is and isn't doing their job in the classroom. But I damn sure don't think we should be publishing the results next to the names of teachers who haven't seen the data that was derived from a model that wasn't peer-reviewed and has admitted major flaws.

This profession deserves more respect than that.

Jersey Jazzman

Sunday, August 15, 2010

California Dreamin'

No comments: