Jump to content
coach_hanes

Logit score, part deux

Recommended Posts

Hi Chaos,

 

I'm not sure I understand your question. Do you mean using the logit score to figure out which judges give low or high speaker points? You could, but that's kind of unnecessary. An easier way to calculate a judge's bias is comparing the judge's scores for teams {x1, x2, x3, ... } he/she judged to those teams' average or median scores from all judges. An average difference that's negative indicates the judge is too harsh; positive, too easy.

 

It's also possible to compare a judge's spread of speaker points to other judges in this way. A too-small spread shows the judge gives too many mediocre scores; a too-big spread shows the judge gives too many high and too many low scores.

  • Upvote 1

Share this post


Link to post
Share on other sites

I don't actually know what I'm talking about. Most of my knowledge of statistics is through feel and very casual self-study, so probably more than half of it's wrong. I was sort of hoping that by asking you an open-ended question, I'd provoke some discussion that would reveal something useful to me. Probably was a little too vague though, earlier.

I was thinking about possibly detecting judge biases against certain teams or argument styles. I think, in order to do that, you'd want to have a measurement of team similarity, which you could get by comparing judges' decisions to the predicted decisions, and then observing which teams judges tended to rate better or worse than expected. Call doing better than expected a bonus and doing worse than expected a detriment. If the bonuses and detriments of teams A, B, and C are highly correlated among all judges, then teams A, B, and C are highly similar. You could probably do something similar to develop a measurement of judge similarity. If the bonuses and detriments of judges A, B, and C are highly correlated for various teams, then judges A, B, and C are highly similar. It might be that we can't get access to both a good measure of judge similarity and a good measure of team similarity, though, and that these would confound each other. Not really sure.

If this idea worked out, potentially, debaters could look at teams similar in style to them who do better than expected with judges of certain types, as a basis for figuring out how to better adapt.

I guess you don't really need the logit score for this, and could just use the season's win-loss record, but incorporating the additional information about speaker points and opponent strength seems useful.

Also, I kind of vaguely feel like since judges are not randomly assigned to rounds this might complicate things somehow, making the logit score a better tool than just win/loss records or speaker points for grounding a similarity score in, but I can't articulate why and it could be nonsense. Both in the sense that judges aren't randomly assigned to K vs policy debates due to pref sheets, and in the sense that experienced judges are going to judge more rounds late in the tournament.

I feel like I might be reinventing, sloppily, concepts that sports statisticians have already mastered. This looks relevant, but debate doesn't have the kind of granular data sports do, so maybe that research won't be helpful.

Edited by Chaos

Share this post


Link to post
Share on other sites

@Chaos

 

Ehhhhhhhhhhhhhh

 

I'm not really sure how judging preferences are done in high school (I was a regional high school debater) but people informally do this in college using the MPJ style of prefs. While strategically good in the short term, it seems to lead to echo chambers and polarization. There is a massive ideological split in college debate and strategically exploiting it doesn't seem to help.

Share this post


Link to post
Share on other sites

I actually had that in mind. My motivation was a desire to detect if it's possible judges are voting for arguments from certain camps more often than they should, all other factors considered. Or, relatedly, choosing to inflate speaker points. A firm determination of bias from individual judges is probably beyond the level of confidence the data can offer, but it's hard not to be interested in the possibility. I guess knowing that judge bias exists won't really do anything to curtail it, though, and you're right that such tools could be used to make things worse, so maybe investigating further would be a bad idea. On the other hand, I don't know how much room is left for technical innovations to make polarization worse. At least this way, knowledge about judge biases would be more accessible for everyone, and less reliant on institutional experience.

Regardless of whether or not building such tools is a good idea, I hope Hanes will chime in with his thoughts on the technical feasibility of the idea, and push back against any mistakes I might be making.

I don't know that MPJ is a major cause of polarization. I get that there are arguments for that position, but feel like polarization would have happened regardless. Polarization is almost the inevitable consequence of 1. People with diverse beliefs thinking that debate has more significance than just a game and 2. Specialization resulting in more competitive arguments than those resulting from cross-pollination across camps.

Share this post


Link to post
Share on other sites

The type of statistics you're talking about is called cluster analysis. You could use scores judges give teams to cluster judges into types, or you could use the scores to cluster teams into types. You might want to use adjusted scores to do it so every judge is on the same scale, but I don't believe you need to go as fancy as logit scores to do that adjusting. (Anyway, the logit score is an adjusted strength for each team, but it does not adjust for individual rounds one by one to calculate it. In other words, the logit score is an average that doesn't let you work backward to the individual data points.)

 

The real problem you'd run into is the sparseness of the data set. Each team might have six or eight or whatever judges in prelims, giving scores. The likelihood they share any judge with another team is quite small. Even looking over the entire season, it's only going to be a couple of shared judges. It's probably too little raw data for the cluster analysis to mean much of anything. Ideally, I'm thinking I'd want any two teams to share 10 judges for comparison. Given that teams might only have 60 or 70 judges over an entire season, it's a huge amount of overlap to have 10 shared judges. It might happen in a really small league but not for any bigger circuit.

 

The better way is probably just to ask judges in a survey to rate different arguments. I'm sure most people would be honest.

Share this post


Link to post
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...

×
×
  • Create New...