I will protect your pensions. Nothing about your pension is going to change when I am governor. - Chris Christie, "An Open Letter to the Teachers of NJ" October, 2009

Wednesday, June 19, 2013

An Exchange With Jonah Rockoff

This is a long post, but I think it's valuable. Let me set it up a bit:

Earlier this month, I was given a remarkable piece of audio: testimony by Jonah Rockoff, professor of finance and economics at Columbia University, before the New Jersey State Board of Education. Rockoff's subject was the use of test score data in teacher evaluation, which is obviously a subject near and dear to my heart. I've reposted the audio below if you care to listen.

After I posted the audio, I wrote a series of posts on Dr. Rockoff's presentation: here, here and here. Bruce Baker at Rutgers also weighed in: I'd suggest Bruce's post stands as a comprehensive counterargument to the wholesale embrace of test-based teacher evaluation.

My posts, on the other hand, are a bit more parochial. Here in New Jersey, as in so many other states, our Department of Education is pushing a teacher evaluation system that has been, frankly, sold with a bunch of distortion, misinterpretation, and outright mistruth. I've been following their sales pitches carefully and it's apparent to me that these people are out of their league: they are trying to construct a plan without having the requisite training to come up with something that actually makes sense.

So that's the thrust of my posts: while there was much in Dr. Rockoff's testimony that is important, there is also a clear danger that his work is being abused by the ideologues at NJDOE and in the Christie administration (and, undoubtedly, in many other reformy circles) to push policies that are not in the best interests of New Jersey's students. I said my piece and left it at that, confident that plenty of folks at NJDOE would read my cautions (I know that you do!) and promptly ignore them.

But then I looked in my mailbox last week, and saw an email from Jonah Rockoff himself.

I'll be honest: I was more than a little surprised. Dr. Rcokoff is undoubtedly a busy man, but - as you'll see - the emails he wrote me were quite lengthy and extremely well thought out. That he would take the time to engage a snarky teacher-blogger was, to me, unexpected. I take it as a sign of Rockoff's good will: I think he genuinely wants to engage in a good faith discussion with teachers about the consequences of his work.

I'll have more to say later, but here's the exchange. I've taken out some personal pleasantries so we can get right to the matter at hand.

**********

Jonah Rockoff, 6/10/13


Dear Jazzman,
(Apologies for not writing to you by name; I could not find your name in your blogs.)

I was looking to see if the NJ Board had posted information about my testimony and came across your blog post.  You made three points regarding my analogy of batting average in baseball.  Your points are well made but do not address the intended meaning of my testimony, and I’ve tried to clarify this below. 

In addition, if you post from this message, I would draw your attention to the first point I made during my testimony.  The goal should be to find ways to evaluate performance in meaningful ways.  An argument against introducing test-based measures is fine; it’s obviously important to have a healthy debate.  However, if such an argument is not accompanied by a proposal for some other means of measurement (which I did not see in your post, but maybe you have one) then it is, from a practical standpoint, an argument that current performance evaluation is meaningful and should not be changed.  Put differently, whether test-based measures have limits/drawbacks is not in dispute.  The key question is whether evaluation would be more meaningful than it is currently if this kind of quantitative performance measure was included, or if some other evaluation system that excludes tests would be even more meaningful.

Point #1: (A) “in baseball, many other quantitative measures are used”; (B) “it would be wildly expensive to create and collect so much data [for teachers]”; (C) “the plate appearances of a teacher in a test-based evaluation model come down to the number of students they teach”

(A) I use batting average when I give talks because it is likely the most well-known statistic.  You are right that there are many others, and they were created precisely because they were able to help evaluate player performance in ways that batting average does not. 

(B) The cost of collecting data on teacher performance may be high, but I would argue that good teaching is so valuable (and performance sufficiently varied) that it would be worth investing a lot in tools that provide more meaningful evaluation.  They do not have to be test-based.  However, as I stated in my testimony (this time using a surgical technique as an analogy), they should be valid in the sense that they are related to some student outcome that we care about.

(C) It’s unclear to me if the analogy that students = plate appearances is correct (why not individual test items? why not classrooms?).  Anyway, the important point is to understand how stable the test-based measures are.  They have year-to-year reliability of roughly 0.4, and as you say are typically based on “25 or 30” students.  Batting average, despite players having hundreds of at-bats, also only has reliability of roughly 0.4.  So, if you think test-based measures are too unstable to tell you something about performance, then you should think the same about batting average (and ERA for pitchers, which has reliability of around 0.35).  If you think batting average (or ERA) is stable enough to be informative, then you should think the same about test-based performance measures.

Point #2:  “Rockoff talks about a .300 batting average as the mark of a great hitter.  This is a cut score.”  “[SGP] necessitates the use of cut scores” “The NJDOE is proposing to take mSGP scores for teachers and convert them into ranked categories.”

As you point out, it would be silly to interpret someone batting .300 as much better than someone batting 0.299.  I didn’t have time to talk about this in my testimony, but I often do when I speak to groups of administrators/teachers and point out the drawbacks of cut scores.  I remarked that batting .300 is considered the mark of a great hitter because it is true (and helps with my analogy), not because .300 has any special statistical property.

It is incorrect to say SGP necessitates cut scores; SGP is a number and if it gets 30% of the weight in a teacher’s evaluation it could easily be converted into a score from 0-30 to be averaged with the factors making up the other 70%.  I also do not think NJDOE is putting the SGP into categories before combining it with the other (70%) measures.  That is not what New York State is doing and would, as you point out, discard valuable information.

As to whether categories are useful, that is a tough one.  In baseball, categorization is not often used except in cases of rare contract clauses; most accolades like “golden gloves” and “silver sluggers” (which are commonly linked to big bonuses in players’ contracts) are decided based on managers’ votes.  (Of course, the managers consider the statistics when they make their votes.  They also can’t vote for their own players, so they have little incentive to be untruthful.)  However, there are many other professions in which performance measures are used this way.  Many businesses give employees performance targets, like quarterly or annual sales numbers, rather than just continuous incentives (like a sales commission percentage). 

In addition, student performance is measured categorically all the time based on letter grades.  The student who got the lowest A- usually performed at basically the same level as the student who got the highest B+, and if teachers tweaked their grading (e.g. slightly changed the weighting of questions on the final exam) those students could easily swap places.  I point this out not because categories do not have their drawbacks, but because teachers should recognize that they use categories all the time.  I am not a fan of letter grades.  I would much prefer to give out continuous grades in my course, like a class rank, but changing the school’s grading system is above my pay grade.

Still, since categories are exceedingly common, we might ask why.  I think it is because they create clarity around incentives and administration in institutions where there considerable regulation or not a lot of managerial accountability or manager-employee trust.  That doesn’t make categories ideal, but coming up with a state-wide teacher evaluation system that has rewards and consequences without using categories would have its own problems too.

Point #3: “It's quite conceivable that when a hitter is hovering around the .300 level, he starts changing his behavior.”

We have plenty of examples of Campbell’s law and I would not disagree that this is a concern.  The main question is whether the behavioral changes reduce the value of the performance measure to the point at which it is no longer useful.  Nobody can know the answer to this question ex ante.  A broader point is that if we are going to use test-based measures, we ought to (1) make sure the material being tested represents the knowledge/skills we want students to acquire and (2) give tests where the best thing a teacher can do to get their students to perform well is to do a good job teaching them the material. 

Derek Neal from the University of Chicago has written a lot on the second issue.  Notably, he proposes that teachers/schools be given very little information about testing format in advance and that the format be allowed to change from year to year, so that test-prep is not a worthwhile use of teachers’ time.   Theoretically I think this is a nice argument, but I don’t see teachers and principals going for it.  (I am imagining my students’ reactions if I refused to tell them anything about the format of the final exam and changed it from year to year.)

Additional point: “Rockoff says that if you are in the bottom quartile of hitters in the majors, you don't get to bat. But there are terrible batters in the National League who bat regularly throughout the season: they're called pitchers.”

Good point, but they only hit because the league says they must.  No team would let their pitcher bat if they didn’t have to do so.  If we extend your analogy to teachers, it would be that someone is terrible at giving instruction but does something else that is very valuable to the school and therefore is allowed in the classroom.  Perhaps some regulation does not allow them to be removed from the classroom while still contributing to the school in another way, but my view would then be that the regulation is the problem.

Happy to clarify anything further if you are interested.

***********

Jersey Jazzman, 6/10/13

Before I answer your comments, I should tell you that I've posted again on your testimony here:


I will warn you: I am more than a little tart in this next post, but I think I have very good reason. You may or may not be aware of the debate here in New Jersey over SGPs. With all due respect, I think you muddied the issue further when you implied in your testimony that SGPs are designed to assess whether a student is "growing" toward a "fair benchmark."

It is quite clear that this is NOT what NJDOE is proposing at all. The mSGP scores my colleagues will receive - like the mSGPs for our schools that were included in the state schools "report cards" - are clearly normative measures. In every presentation I know of by NJDOE, and in all materials I have reviewed from the state, it has been made clear the SGPs will simply measure a student's "growth" relative to his or her "peers." While Damian Betebenner's technical overview includes a method of judging a student's growth toward an "absolute" goal over time and expressing that in percentiles, it's clear this is not what NJDOE plans to do with regard to individual teacher evaluations.

This means, as I say in my post, that a teacher whose students all exhibit "growth" and who are all meeting proficiency levels (even advanced proficiency levels) can still be found to be deficient under AchieveNJ. It's New York City all over again - probably worse, because there are no covariates for student characteristics involved.

Further, there is a significant correlation at the school level between SGPs and student characteristics.


Dr. Rockoff, I have been accused more than once of shooting myself in the foot with my tone. But I ask you to understand where my frustration - which drives my prose - comes from. I am a working teacher, and I and my colleagues are watching a debate swirling around us (a debate where we often have very little say) where radical changes are being proposed to our profession on the basis of work like yours. I will be the first to say that I agree with your conclusions from your last paper: teachers matter, there is a variation in teacher quality, it does affect student performance (how much is still up for debate), and the consequences extend past the school years (again, the intensity of those consequences is debatable).

But it is a huge, unwarranted leap to go from your research to a teacher evaluation system like AchieveNJ, which FORCES administrators to act on the basis of test score analysis. There are all sorts of unintended consequences that may come from AchieveNJ that your research simply hasn't covered - how could it? 

I'll leave my comments below. Again, I greatly appreciate the time you took to write.

[What follows are my comments on Dr. Rockoff's original email: the original is in italics, and my comments are in plain text.]

In addition, if you post from this message, I would draw your attention to the first point I made during my testimony.  The goal should be to find ways to evaluate performance in meaningful ways.  An argument against introducing test-based measures is fine; it’s obviously important to have a healthy debate.  However, if such an argument is not accompanied by a proposal for some other means of measurement (which I did not see in your post, but maybe you have one) then it is, from a practical standpoint, an argument that current performance evaluation is meaningful and should not be changed.  Put differently, whether test-based measures have limits/drawbacks is not in dispute.  The key question is whether evaluation would be more meaningful than it is currently if this kind of quantitative performance measure was included, or if some other evaluation system that excludes tests would be even more meaningful.

I do have one, but let me say this: the burden of proof is not on me. I am asking what I think are legitimate questions about AchieveNJ and other test-based teacher evaluation proposals. I don't believe I have any affirmative obligations here; the burden of proof should be on those who are claiming that a system like AchieveNJ will improve student performance. 

I say this because I think the notion of teacher quality being the most important factor in student learning has been completely overblown. You and I both know that school-based factors pale in comparison to student characteristics and home environment as inputs in determining variance in student learning. And yet it seems the entire debate right now revolves SOLELY around the quality of the teacher. "Oh, yes, poverty matters, and we should do something about it, but..." is about all the acknowledgment you will find from those who push systems like AchieveNJ.

I've come around in recent years to this position: like your hypothetical baseball front office, I'd want as much data as I could get on my players before making personnel decisions. But what I would never want is to be FORCED to act on any particular data in a particular way. But this is exactly where we are headed, in New Jersey at least.

It seems to me that it would be far more appropriate for the state to make its data and analyses available to districts, and those districts could then use it as a screening mechanism. The state could also use it for screening as an accountability measure for school or district performance.

What should NOT happen is what's being proposed: a rigid, inflexible percentage of an evaluation with a series of actions that are REQUIRED to be taken if a particular target is not met.

(A) I use batting average when I give talks because it is likely the most well-known statistic.  You are right that there are many others, and they were created precisely because they were able to help evaluate player performance in ways that batting average does not. 

But if you were a manger, would you like to be FORCED to have 35% or 30% (or whatever number Commissioner Cerf decides he likes today) of every personnel decision you make based on ONE statistic like batting average? Which, I think we can agree, is a far less noisy measure of a player's skill than a student's score on a bubble test is of a teacher's effectiveness?

(B) The cost of collecting data on teacher performance may be high, but I would argue that good teaching is so valuable (and performance sufficiently varied) that it would be worth investing a lot in tools that provide more meaningful evaluation.  They do not have to be test-based.  However, as I stated in my testimony (this time using a surgical technique as an analogy), they should be valid in the sense that they are related to some student outcome that we care about.

Dr. Rockoff, do you think there are many potentially great teachers waiting in the wings to replace the ones who will be fired under a system like AchieveNJ? Do you think we have many teachers who could be great, but choose not to be because they don't get "meaningful" evaluations like they will get in AchieveNJ?

Again: teaching matters. No one gets that more than a teacher. But weighing the pig doesn't fatten it up - especially when the scale is busted (better to say "noisy," but that would ruin my metaphor).

Your paper with Drs. Chetty and Friedman implied that the value of a "great" teacher was somewhere around a quarter million dollars per classroom. I take it, based on that observation, that you think "great" teachers are underpaid relative to their true economic worth. If that's true, how does just firing the bottom performers help get better candidates into the teaching pool? Don't we need to replace them with "great" teachers? Where are they going to come from? And if the goal is for every child to have a "great" teacher, how are we going to figure out whether they have one if we judge teachers by normative measures?

All of this is meant to say: if we are going to spend all kinds of money assessing teachers, but we don't do anything to make the profession more attractive, what's the point? And how does it make teaching more attractive to have an evaluation system that FORCES decisions based on measures that everyone admits are noisy? 

(C) It’s unclear to me if the analogy that students = plate appearances is correct (why not individual test items? why not classrooms?).  Anyway, the important point is to understand how stable the test-based measures are.  They have year-to-year reliability of roughly 0.4, and as you say are typically based on “25 or 30” students.  Batting average, despite players having hundreds of at-bats, also only has reliability of roughly 0.4.  So, if you think test-based measures are too unstable to tell you something about performance, then you should think the same about batting average (and ERA for pitchers, which has reliability of around 0.35).  If you think batting average (or ERA) is stable enough to be informative, then you should think the same about test-based performance measures.

Nobody says test measures aren't informative; of course they are. The real question is whether they are reliable and valid enough to FORCE high-stakes decisions on administrators. Again, if you were a manager, would you want to HAVE to make batting average a rigidly defined part of your player evaluations, with cut points that FORCED you to act in particular ways?

I contend AchieveNJ does exactly that.

As you point out, it would be silly to interpret someone batting .300 as much better than someone batting 0.299.  I didn’t have time to talk about this in my testimony, but I often do when I speak to groups of administrators/teachers and point out the drawbacks of cut scores.  I remarked that batting .300 is considered the mark of a great hitter because it is true (and helps with my analogy), not because .300 has any special statistical property.

It is incorrect to say SGP necessitates cut scores; SGP is a number and if it gets 30% of the weight in a teacher’s evaluation it could easily be converted into a score from 0-30 to be averaged with the factors making up the other 70%.  I also do not think NJDOE is putting the SGP into categories before combining it with the other (70%) measures.  That is not what New York State is doing and would, as you point out, discard valuable information.

I contend that this is EXACTLY what NJDOE is doing; see here:


And here:


The other 70% in AchieveNJ is an ordinal rank of 1 to 4. NJDOE is proposing to convert the SGP into another ordinal rank from 1 to 4. 

Sir, I'm just a music teacher, so please correct me if I'm wrong. But even I know you're not allowed to average - weighted or otherwise - ordinal measures. Isn't that exactly what's going on here?

As to whether categories are useful, that is a tough one.  In baseball, categorization is not often used except in cases of rare contract clauses; most accolades like “golden gloves” and “silver sluggers” (which are commonly linked to big bonuses in players’ contracts) are decided based on managers’ votes.  (Of course, the managers consider the statistics when they make their votes.  They also can’t vote for their own players, so they have little incentive to be untruthful.)  However, there are many other professions in which performance measures are used this way.  Many businesses give employees performance targets, like quarterly or annual sales numbers, rather than just continuous incentives (like a sales commission percentage). 

That makes sense in a business where your own performance is being evaluated, and there are concrete measures of performance. Teaching is not one of those professions: we are judged on the performance of others, and the measures are quite noisy.

Obviously, you would know far better than I - but isn't there a serious debate about the usefulness of these incentives in business?

In addition, student performance is measured categorically all the time based on letter grades.  The student who got the lowest A- usually performed at basically the same level as the student who got the highest B+, and if teachers tweaked their grading (e.g. slightly changed the weighting of questions on the final exam) those students could easily swap places.  I point this out not because categories do not have their drawbacks, but because teachers should recognize that they use categories all the time.  I am not a fan of letter grades.  I would much prefer to give out continuous grades in my course, like a class rank, but changing the school’s grading system is above my pay grade.

Still, since categories are exceedingly common, we might ask why.  I think it is because they create clarity around incentives and administration in institutions where there considerable regulation or not a lot of managerial accountability or manager-employee trust.  That doesn’t make categories ideal, but coming up with a state-wide teacher evaluation system that has rewards and consequences without using categories would have its own problems too.

Again, the question for me is less about the categories, and more about the consequences. If a cut score just out of my reach keeps me from paying for my kid's braces, the placement of that cut score matters. 

We have plenty of examples of Campbell’s law and I would not disagree that this is a concern.  The main question is whether the behavioral changes reduce the value of the performance measure to the point at which it is no longer useful.  Nobody can know the answer to this question ex ante.  A broader point is that if we are going to use test-based measures, we ought to (1) make sure the material being tested represents the knowledge/skills we want students to acquire and (2) give tests where the best thing a teacher can do to get their students to perform well is to do a good job teaching them the material. 

Derek Neal from the University of Chicago has written a lot on the second issue.  Notably, he proposes that teachers/schools be given very little information about testing format in advance and that the format be allowed to change from year to year, so that test-prep is not a worthwhile use of teachers’ time.   Theoretically I think this is a nice argument, but I don’t see teachers and principals going for it.  (I am imagining my students’ reactions if I refused to tell them anything about the format of the final exam and changed it from year to year.)

Teaching to the test is a genuine worry - but it's probably the least of our worries when the testing regime becomes the basis for high stakes decision making. Atlanta and Washington D.C. are the tips of the icebergs. 

We have a serious question to ask ourselves before any high-stakes decisions are made based on tests: is the supposed benefit worth the inevitable bad behavior that will follow?



**********


Jonah Rockoff, 6/12/13

It seems like you have a good handle on the issues, so I will only add a few more thoughts.

You are right that when we compare two teachers with equal mSGPs that teach very different populations of students, we do not have very strong evidence that these two teachers are of similar skill in general, only that they have similar skill in relation to teachers of similar students.  In other words, we cannot know if the best teacher in Newark would also be great in Montclair, or vice versa, but (I would argue) we can identify the best among teachers in Newark and the best among teachers in Montclair.  The question of what to do with this information is a much larger one.  In past work I have found evidence supporting a policy of providing this information to principals and letting them make decisions with it.  This was in NYC, where there is a high degree of accountability (schools graded A-F and principals fired for repeatedly low grades) and flexibility for the principal in making personnel decisions and allocating resources.  Obviously, NJ (as in other states) is moving forward with a policy that takes away some discretion from the principal by mandating consequences for low performance.  (Though principals still have some control via classroom observation ratings, etc.) Whether we should let principals make all final decisions is a big and open question.  If we let principals make the decision, we have to trust they will make good ones. 

Related to this point is that, in my earlier work, the number of times that teachers with high test-based performance were rated poorly by the principal was almost as large as the number of times that teachers with low test-based performance were rated well by the principal.  In other words, when principals disagree with the test data, they do it in both directions.  Who is right, the data or the principal?  We cannot know of course, but it’s important to realize that test-based measures may help teachers who, for some spurious reason, are not liked by their principals but are doing a good job in the classroom.  If principals make errors too, then you would not want to put all the weight on the principal’s opinion.

On reliability, I understand that people do not like risk and nobody would want to be evaluated on something that was close to a lottery. This is why it is important that something like SGP be combined with other measures, which can increase reliability substantially.  If we take one measure that is correlated 0.4 over time, its actual correlation with underlying skill in raising achievement is actually about 0.6.  (If I have two years of test-data, both have noise in them, so their correlation with each other is lower than their correlation with actual underlying skill.)  A 0.6 correlation may not be a very strong signal of job performance, but what if we have three measures, each with 0.4 reliability, and we average them? (This might be something like a system that gives 1/3 weight to SGP, 1/3 to classroom observation, and 1/3 to, say, portfolio analysis or student surveys.)  The correlation of this average with underlying skill is about 0.8.  If we have two years of this data (so 6 pieces of information, each with reliability 0.4) the correlation of the two-year average evaluation with actual underlying skill is 0.9. 

I want to be clear on what these numbers imply.  Suppose we had 6 pieces of data with 0.4 reliability and asked, “what is the probability that someone in the bottom quartile of evaluations is truly in the bottom 25%?”  Almost 80%.  But that’s not all we should be asking, because (as you alluded to in your previous email) if they are truly at the 26th percentile that’s not a serious difference.  “What is the chance someone in the bottom quartile is actually above average?”  About 2%.  “What’s the probability they are actually in the top third (but got really unlucky for two years)?”  About 0.1%.   

Risk will always be present, but, with reasonable reliability, the risk that we mistakenly assign consequences to an average teacher would be pretty low, and the risk of assigning consequences to a highly effective teacher would be extremely low.  Moreover, while we have very good evidence on reliability and validity of test-based measures, I would argue the bigger hole in our understanding is on the reliability and validity of classroom observation, portfolio analysis, etc.  (And the hole was much bigger prior to the Gates-MET study.)  These types of non-test measures are also susceptible to personal biases, which won’t show up in research studies like MET but could surface in the real world.

Lastly, I’d note something that Commissioner Cerf pointed out in a comment after my testimony last week.  He remarked (I’m paraphrasing based on my recollection) that the system will entail more risk for teachers but that the current system places all the risk on students (i.e. being assigned a low performing teacher who is rated satisfactorily), and that this balance of risk is unacceptable in his view.  That’s the commissioner’s view, not a research question, but I think it is extremely relevant to the discussion of risk. 

Teacher evaluation is a very complex problem and I appreciate your engaging on the issues here.  I think research does speak to potential avenues for policy, but there are no perfect solutions and the success of any policy will also depend on how things play out on the ground.


***********

Jonah Rockoff, 6/12/13

Sorry, I’m not sure I fully addressed the content of your second blog post.   If you go through my testimony, you’ll see (I’m pretty sure) that I specifically did make it clear that SGP is a relative measure and that, as you put it, “other kids themselves are the benchmark.”  Half of the teachers will be above average, half below.  This would not be the case with student learning goals, which teachers set (with some oversight); all of the kids could meet those learning goals, and of course all of them could fall short.  This obviously begs the question of whether the goals were set well—too low a bar and everyone passes, too high and nobody does—and means that there is risk entailed in how the goals are established. 

***********

Jersey Jazzman 6/12/13

I would say one thing about reliability and multiple measures: you use the phrase "underlying skill in raising achievement." I assume you are basing that on test scores, like the Gates MET report. The logic in that report strikes me as circular: they say the year-to-year reliability (which they essentially equate with validity - already problematic) is better when you combine other measures with test score outcomes; however, they judge that reliability largely through the strength of correlation to next year's test scores (and some other test-based outcomes).

If we can't capture teaching effectiveness solely through test scores, why are we judging other measures' validity entirely on them? In any case, as Morgan Polikoff has pointed out, Gates MET says nothing about the use of SGPs. If we have research that explores the use of them with multiple measures, I am unaware of it.
***********

Jonah Rockoff, 6/12/13


You are right that the validation of these other measures has been on test scores.  One important contribution of MET is that they used both high stakes standardized tests and low stakes tests that were supposed to capture other skills (e.g. math reasoning, writing, etc.), but they were still tests of academic achievement.  That does not mean that tests do not pick up on teachers’ ability to teach other skills that we value, e.g., students may do well on reading tests because their teacher get them to love reading and to read more at home.  However, you are totally right that tests can miss (or not be heavily weighted towards) dimensions of knowledge/skills that we value.  This is another reason (above issue of reliability) to mix test-based performance measures with other measures (like student surveys) which may pick up on other dimensions of good teaching.  If we care about A and B and only give strong incentives to do A, we’ll likely get less of B.  For example, in a recent paper I (and co-authors) find some evidence that NCLB raises math and reading achievement (even on low-stakes tests) but crowds out time spent on subjects like science and social studies.

I’d like to point you to my testimony again, particularly to the part where I paraphrased Tom Kane and used an analogy to surgical technique (which I won’t repeat here).  My point was that teacher evaluations should include measures that pick up on the student outcomes we care about, and they should exclude measures that are unrelated to these student outcomes.  We have evidence that prior test-based performance measures and things like classroom observation predict future performance in raising test scores.  We don’t care about test scores directly, but we have evidence that raising test scores affects outcomes like college enrollment and earnings, which we do care about.  This supports including measures that are linked to test scores.

Now, while we might suspect that including some other measures would pick up on other things we care about (and that do not show up in tests) we do not have much evidence to support that suspicion.  This is a place where more research should focus.  There is a nice paper by Kirabo Jackson at Northwestern that examines teachers’ impacts on outcomes like suspensions and GPA, which he argues are more correlated with “non-cognitive” skills and not measured well by tests.  The difficulty with these types of measures is that they are not standardized (e.g., an A grade means different things depending on the particular class or school), which makes it hard to distinguish performance from the application of differing standards.

Lastly, on the fact that MET uses a different value-added model than SGP, I can only say that the correlation between SGP (which is a form of value-added) and any other model that controls flexibly for prior test performance is likely to be very high.  In other words, technically they didn’t use SGP (neither do we in our long-term outcomes paper), but they do something very close to it.  I tend not to use SGP in research because conversions to percentiles and medians are very computationally intensive, but there are some people who argue that, on theoretical grounds, using percentiles and medians is better.  My view is that there is no correct answer and the differences are small.

I am very glad you have found our correspondence helpful.  It’s been helpful to me to see where the sticking points are and try to clarify as best I can. 

**********

Here's Rockoff's original testimony:



6 comments:

Sherman Dorn said...

This is a bit into the weeds, but since you're already there... one of the items in the CFR paper is a method they devised to test whether assignment of students within a school was random (i.e., a diagnostic that didn't require the type of falsification test Jesse Rothstein used to argue that VAM in most elementary schools ignored nonrandom assignment).

My guess from your and Bruce's posts is that Rockoff did NOT say anything in his testimony to urge that NJDOE perform the diagnostic that he and his colleagues devised and even displayed the code for in the paper. Am I right? And if so, why didn't Rockoff urge NJDOE to take that simple step to address the concerns teachers would have about nonrandom assignment?

Duke said...

I'd have to go back and reread CFR, Sherman. Are you talking about the part of the paper where they look at grade-level scores? They looked at the effect of moving a high-VA teacher into a different school and then watching the grade level scores rise, which would suggest that the teacher wasn't merely benefitting from cream-skimming when getting a high VA score.

Rockoff did talk about that part of the paper in his testimony. But I don't know how you implement this in a practical way. Or are you referring to something else?

FWIW, I get the sense that Dr. Rockoff came in to talk about his research, and how it could inform NJDOE's plans. But I definitely didn't get the sense he was there doing a program evaluation of AchieveNJ. More on that later...

Thanks for posting.

andrewsaltz said...

This was really interesting. And while I am not up on my education policy (at this level, anyway), I know a fair amount about sports.

Batting average is a really good metaphor for test data because batting average is incredibly inefficient. The point of moneyball was not that teams weren't using data, it's that the data they were using was terrible. This is how we got OBP and other really complicated measures.

And, as you wrote, that still assumes a concrete, agreed up on goal.

Deb said...

To me one of the most stark realizations coming out of this truly fascinating dialogue, is how complex and inaccessible these evaluation methods are to those being evaluated. (And I mean no disrespect here to any teacher. But I think we can agree that the Duke's understanding is above average).

Apart from the many legitimate points of concerns raised by Duke here, one of my many concerns is the intentional use of systems that appear to me, too complicated to access by those using or dependent on them -- and here I include evaluations, the new school report cards, etc. This is how teachers, parents and students eventually, if not already, feel disempowered, disenfranchised and disillusioned.

The insanity of the data driven reform which runs the gamut from student privacy to teacher evaluations is deeply disturbing to me. Insidious. As Duke points out in his argument that if we not fostering new excellent teachers, what good does it do to identify the bad ones, data demonstrates many things but alone solves nothing. In our current culture of reform data is being used to drive an agenda not seek solutions.

Sadly our discussions in education focus on these data debates and not nearly enough on the actual methods to solve (not identify) problems. From my unscientific survey, these evaluations may, or may not, identify good and not so good teachers but does nothing to foster success in the classroom and certainly do not foster trust, morale and confidence among teachers we hope will excel.

I've gone on far too long. My apologies

Duke said...

To the contrary, Deb, you are very concise and exactly right. Subject of an upcoming post...

Mrs. King's music students said...

As usual, you can on me to take an arguement about data into the very subjective arena of whats happening in my pilot school in Camden right now, where I notified in a text message from my principal on June 17th that she submitted a request for my termination because I was non- responsive to her suggestions. I will counter that my evaluations this year, heavily monitored by the RACs in a Danielson Pilot school point irrefutably in the opposite direction, in that the first one was Satisfactory, the second incorporated my principal's suggestions and resulted in a slightly higher score, and the 3rd was through the roof "highly effective". Under normal circumstances none of this would matter one whit since I'm untenured an my union "has no obligation to untenured teachers". I think my attny's access to uncorrupted data collected by the state in my school will support Mr Rockoff's musings about principals' subjective evaluations of teachers differing wildly from actual data - A LOT.