I Hate the REF (and all such ilk)

Welp, we are doing the annual RRE thing and you have to submit and write words. So instead of doing stressful bullshit every 6-7 years we get to do it every year! It’s so dumb.

If you want to read a reasoned attack there are plenty. I like Josh Robinson’s. I saw on Twitter that you can return academics you just laid off which is just plain mean.

These metrics are high stakes and the result is what you expect: escalation in gaming them and counter gaming with not much change in the underlying phenomenon. People get crushed by it. So much time gets wasted.

Randomise your funding decisions, people! Weight it a bit if you must but come on. Lower you admin costs and eliminate biases (including biases toward the “hot” and the “good”).

Advertisements

Don’t Beat On IRBs From Ignorance

Institutional Review Boards are  are key part of research that involves animal (esp. human) participants. They are super important and have curtailed a lot of abuse and otherwise improved research (we need more pre-discussion of experiments across the board!).

They are not always tuned to very low risk studies. Which makes sense! But it does mean that they can be sometimes frustrating and a barrier to perfectly sensible research. Again, no surprise and not a big deal. At Manchester, we’ve gone through different mechanisms to expedite very low risk research without compromising safety or quality

I’ve never served on an IRB nor have I had a lot of interaction. But I’ve met with a couple. Generally, they are working researchers who are doing some important committee work (far more important than most). But it’s work and no one that I’ve encounter wants to make a big deal of it. Certainly, no one I’ve encountered has any desire to do it for kicks or block research unnecessarily.

IRBs are important, necessary, and a lot of work. I encourage real horror tales because we want them to be better, but this crap just pisses me off:

Originally, IRBs were set up in response to a Wild West atmosphere at universities and medical centers in which the human subjects of research could genuinely be harmed. Medical research was one obvious area, as was psychological research of various kinds (the Milgram shock experiment, for example). But then it expanded. Were you doing econ experiments with the students in your class? Better get IRB approval. Were you doing telephone surveys? There are humans on the other end of the line. Better get IRB approval. Are you just mailing surveys to people? You’ll still need IRB approval. Over time, the roadblocks to research multiplied, and as near as I can tell, IRBs today are widely considered less to be genuine safeguards against abuse than they are mini-empires dedicated primarily to demonstrating their own power over the research community.

Let’s take Drum’s three “escalating” examples:

  • Were you doing econ experiments with the students in your class?
    If you are doing in class role play sort of things, there’s no need for IRB. If you are going topublish research on your experiments, then you probably do need it.And rightly so. Obviously things that need to be checked: 1) are you compromising the educational experience for the sake of your research? 2) are student compelled to participate on threat of poorer grades (or reasonably think so? 3) can you protect against students trying to please you qua instructor (so is it even worth doing)? and 4) is your privacy plan sufficient? These are just off the top of my head. It’s not necessarily hard to make all this work out (3 is challenging), but making sure people work through this and get independent feedback is a really good idea.
  • Were you doing telephone surveys?
    The “There were humans on the other end of the line” is supposed to be a “funny” indicating the absurdity. But there are in fact humans on the other end of the line. There are harms possible. Are you asking intrusive questions? Might you reveal personal data that exposes them to harm? Is it a well designed survey likely to achieve it’s results (or just waste people’s time)? Are people participating with a clear understanding? All survey’s should be reviewed independently just for quality control!
  • Are you just mailing surveys to people?
    I don’t know why the modality of the survey is such a big deal. Most of the quality issues are similar. There’s less interaction but that has pros and cons (e.g., live, you can provide person specific clarification).

The “as far as I can tell” is pure bullshit. I’ve never heard anyone routinely working with IRBs suggest that they are at all about “demonstrating power”, much less mostly so. This is a staggering libel with the evidence obviously being nothing at all. (I’ll bet a lot that that “As far as I can tell” isn’t based on, you know, research of any kind but just some impressions.)

Wildly irresponsible and gross.

Oh, and all the stuff at the end about Would The IRB Approve It and If They Did Would They Leak It … just fuck off. The people on the IRB are professionals. The reason not to approve it is the impossibility of getting informed consent. It relies on deception. I can think of a few ways around that (get journal editors to give a general warning, though that would distort; but informed consent often distorts; too fucking bad!). I can think of alternative designs (e.g., an actual controlled experiment wherein people are given real and fraudulent and asked to pick out the frauds). Etc.

There’s a case that the people involved are all “experts” and this is a kind of expert review. (Bit of a stretch.) Or it could be framed as a quality control test (where the quality is fraud detection which isn’t a focus of peer review).

But the experiment as offered? I might well choose to reject it. It’s more stunt than useful research. It’s clear that the experimenters didn’t consider alternative mechanisms for the result they got (e.g., supportiveness), which makes the whole thing a bit dubious.

So the question I have is whether other such experiments (e.g., with computer generated papers or citation farm faked papers) have IRB approval. Again, if it’s a quality control measure, it might not need it.

We Win the 2018 SWASA 10 Year Award

What is this award? Well:

The primary criteria will be the number of citations to the paper in ten years. We will use Google Scholar. While citation count will be the primary criteria, the panel will also consider other impact factors for the top-cited papers. If two or more papers are very close, more than one paper may be honored. The determination is done in the Spring of the calendar year for the conference.

That’s pretty cool! It also won best paper at the time. So yay! Matt’s thesis won the BCS Distinguished Dissertation award so Triple Yay!

When Uli and I were supervising Matt, we had plenary meetings. They were enjoyable but it’s difficult enough to get a word in edgewise with either Uli or myself on our own…when we’re bouncing off each other fuggetaboutit.

A Survey of Online Coding Tutorials

The paper, “A Pedagogical Analysis of Online Coding Tutorials“, provides an analytical review of a sample of online coding tutorials. One of my project students did something similar (they should have published!). The analytical framework is useful but not surprising: they have a set of types (interactive tutorial, MOOCs, web references, etc.) and “nine groups of 24 [analytical] dimensions” including content, organisation, and context. It all seems sensible, though I’m a bit leery. It seems almost too sensible. There’s no empirical work on actual effects (completion, satisfaction, and learning). It’s super tempting to think we can extrapolate from this beautiful and tempting set of features to these effects. Consider their key conclusion:

Our results suggest that most online coding tutorials are still immature and do not yet achieve many key principles in learning sciences. Future research and commercial development needs to better emphasize personalized support and precise, contextualized feedback and explore ways of explaining to learners why and when to use particular coding concepts. Based on our sampled tutorials, we recommend that teachers be very selective in their use of materials, focusing on the more evidence-based tutorials, particularly the educational games. All educational games in the list provide hierarchical structure, immediate feedback, and opportunities that learners actively write code and use subsequent knowledge for coding throughout the tutorial.

But they’re games. What kind of learners are responding to them? Are students responding to them? No coding game that I know off has bubbled up in the popular consciousness or trade press the way say Khan Academy or MOOCs in general have. That doesn’t mean they aren’t educationally superior, but it needs some explanation.

Overall, however, it seems like a solid, worthwhile paper and a good and necessary starting point. Someone needs to too this sort of work and we need more of it.

It’s also the sort of work that needs a dynamic, ongoing database that’s kept up to date with periodicity snapshot papers. One off papers get stale quickly! But it’s eminently replicable, so…have at it!

The Loss of Loss Aversion

As with ego depletion, loss aversion turns out to probably not be a thing:

However, as documented in a recent critical review of loss aversion by Derek Rucker of Northwestern University and myself, published in the Journal of Consumer Psychology, loss aversion is essentially a fallacy. That is, there is no general cognitive bias that leads people to avoid losses more vigorously than to pursue gains. Contrary to claims based on loss aversion, price increases (ie, losses for consumers) do not impact consumer behavior more than price decreases (ie, gains for consumers). Messages that frame an appeal in terms of a loss (eg, “you will lose out by not buying our product”) are no more persuasive than messages that frame an appeal in terms of a gain (eg, “you will gain by buying our product”).

People do not rate the pain of losing $10 to be more intense than the pleasure of gaining $10. People do not report their favorite sports team losing a game will be more impactful than their favorite sports team winning a game. And people are not particularly likely to sell a stock they believe has even odds of going up or down in price (in fact, in one study I performed, over 80 percent of participants said they would hold on to it).

I have not dug into the paper so…who knows?! but I find it plausible.

This is super annoying. The ego depletion one was extra annoying due to the fact that the literature had seemed good. Loss aversion loss is annoying because of the pervasiveness of use of the concept. It was the example of behavior economics.

We really need to separate out the work that is inherently high risk in fields like psychology and nutrition.

Note: when looking up the ego depletion stuff I came across a post touting recent “strong” evidence for ego depletion in the form of two sorts of large studies with preregistration. That’s prima facie interesting but I’m going to retain a pretty high level of skepticism. Certainly when folks write (emphasis added)

Moreover, combining results from the two studies, there was an overall small, but statistically significant, ego depletion effect even after removing outlier participants (and this was after only a five-minute self control challenge, so you can imagine the effects being larger after more arduous real life challenges).

Arrrrrgh! The results of two studies with a combined n of around 1000 is a small but “statistically significant” (I presume p=0.05) effect. No no no no. That’s super dangerous.

Worse, speculating about how much bigger the effects would be with bigger manipulation is super duper dangerous. This is stoking confirmation bias. And we shouldn’t be looking at current tiny effects as evidence for future awesome effects.

Renata Wassermann on Belief Revision in DLs

I used some of Renata’s work in my thesis and we’ve corresponded on and off. One of her students is visiting us and she came and gave a talk! It was very nice.

One interesting bit was that did some experiments on partial meet vs kernel based revision and found “contrary to computer science intuition” partial meet generally is more efficient. Ok that’s a lot of jargon here’s an attempt to sort it succinctly.

Given a set of beliefs, B, (think propositional sentence ie things which can be true or false), and some sentence S which follows from B, how can we shrink B so S no longer follows? This isn’t easy! S may not be a member of B. S might be entailed by lots of different parts of B.

One approach is to find all the minimal subsets of B which entail S. Since they are minimal, we can break the entailment by deleting just one element. If we fix each subset then we have a fix for B. These subsets are called kernels (or justifications). They correspond nicely to typical debugging approaches.

Alternatively, we could try to build a maximal subset of B which doesn’t entail S. There will be many such subsets but obviously each does the job. Call such a set a remainder. We can just pick one remainder, or take the intersection of several (or all). If we take fewer than all we have partial meet contraction.

Now Renata said something that didn’t make sense to me ie that the reason kernal contraction has been preferred is that computer scientists think it’s more efficient because “kernels are smaller”. But…I’ve never heard that. The concepts are dual but kernels are easier for humans to deal will. They capture the logic of how the undesired entailment works. It never occurred to me to even ask which approach is more efficient. It depends on the nature of the sets!

One interesting bit is that a difference between debugging and revision folks is that debugging folks usually consider minimal repairs, ie, selections from the set of justifications that contain no repairs. This corresponds to full meet contraction which has a number of issues. If you go for partial meet then you have to do a bit of work to get an algorithm that finds desirable contractions compared to the remainder based approach.

Of course, even from a debugging perspective a partial meet approach might make sense. When you figure out a bug, you might make more changes than just the minimum one to fix the focus broken test. After all, you might get an insight about a particular function call and change how you call it everywhere. You might realise that a module is just irredeemably broken and replace it entirely.

Quantitative Social Sciences vs. the Humanities

Post Mortems

As we inch closer to realizing the Trump disaster, the election post-mortem’s continue. Obama has claimed that he would have beaten Trump. I’m unsure about the wisdom of that from either an analytical or political perspective. Qua analysis, it could be banal, reasonable, or silly:

  1. Banal: Your take on the election could be, roughly, while the fundamentals favored a generic Republican, Trump was an unusually bad candidate running and unusually bad campaign so that, absent extraordinary intervensions esp. from the FBI, a reasonable Democrat would have won. A bit more subtly, he could be claiming that Democrats can win when they aren’t also running against the press plus FBI plus Russia plus Wikileaks and he is a candidate that the press (a key enabler of the others) doesn’t run against.
    This isn’t quite as banal as “A major party candidate always has a good shot in this polarised age” in that it posits that Clinton specific features strengthened the Trump campaign just enough. However, it doesn’t posit any Obama specific features, hence the banality.
  2. Reasonable: Your take on the election could be, roughly, that given the closeness of Trump’s victory, a bit more juicing of Democratic turnout would have been sufficient (esp. when combined with all the items under the banal scenario) for victory. Obama has a good record of turnout which seems to be some combination of his personal qualities as well as his GOTV operation. If we posit that Clinton had the equivalent GOTV operation, then we’re left with his personal qualities which are a superset of “not having the Clinton failings”. I think you can probably make a case like this based on the exit polls. While reasonable, it’s highly defeasible. What’s more, it’s not clear that you add much over the banal case. You need something like what’s in the reasonable case to distinguish Obama vs. Sanders.
  3. Silly: Obama would have crushed Trump because Trump is an extremely bad candidate while Obama is an extremely good candidate. I feel like both those statements are true but we really need to take seriously the idea that candidate quality matters at best at the margins. It’s not just that fundamental models tend to do well empirically, but that the causal mechanisms for candidate or even campaign quality mattering are opposed by a lot of evidence and a lot of alternative causal stories. What voters hear, how they come to make decisions, the small number of “true idependents” etc. tend to point toward the partisan identity thesis of voting, to wit, voters tend to vote their party identity regardless of the policy implications or political behavior of the candidate. Voter attributions of decision making based on campaign specifics can be plausibly attributed (for many voters) on things like (supported) rationalisation.

Politically, all this seems to do is set up Clinton as a scapegoat or perhaps, better, set up Obama as the leader of the opposition. The former is pointless. The latter is perhaps worthwhile. It’s clear that Obama campaigning on the behalf of others isn’t effective (he’s not had notably strong coattails, for example). More significantly, I rather suspect he’s going to take a traditional ex-president role an be relatively quiet about Trump. If that’s the case, it would be bad for him to become leader of the opposition.

There’s lots to unpack about the election and we have the problem that, on the one hand, good analysis and data gathering takes time while, on the other hand, the further the election recedes into the past, the more evidence evaporates. This is all next to the fact that post mortems serve political goals thus are subject to motivated distortion.

The Loomis Hypotheses

Ok, that was a digression. What prompted this more directly is Erik Loomis’ latest entry in his war/trolling on the scientific status of social sciences like economic and political science. This is a bit more general than attempts to use the election outcome against specific models/prognosticators/etc. and, of course, Erik is provocatively overstating:

It’s time to put my humanities hat on for a bit. Obviously there are political scientists and economists who do good work. And we need people studying politics and economics, of course. But the idea that there is anything scientific about these fields compared to what historians or philosophers or literature critics do is completely laughable. As I tweeted at some point right after the election, the silver lining to November 8 is that I never have to even pretend to take political science seriously as a field ever again. Of course that’s overstated, but despite the very good political scientists doing good work (including my blog colleagues!) the idea that this field (Sam Wang, Nate Silver, etc., very much included) had some sort of special magic formula to help us understand politics this year, um, did not turn out to be true. They are just telling stories like I do, but with the pretense of scientific inquiry and DATA(!!!) around it. It’s really the same with economists, far too many of whom are completely deluded by their own models and disconnected from the real life of people.

Before trying to structure these a bit, I want to point out that we have some serious challenges to  making either a defensive or offensive claim about methodological validity or superiority based on prognostic outcomes of elections: All the models are probabilitistic with extremely small test cases. So, even Sam Wang’s prediction of a 99% chance of a Clinton win is consistent with what happened. Silver’s higher odds for Trump aren’t necessarily validated by Trump’s winning! You have to dig into the details in order to find grounds for determining which one actually overstated the odds and your arguments are going to be relatively weak. But conversely, your arguments that these models serve no useful purpose has to do more than say, “They got the election outcome wrong!!!” Highly accurate models might be only “empirically valid” that is, they succeed but provide no insight and don’t track the underlying causal structure. Highly uncertain models might tell you a lot about why certain outcomes are easily predictable.

Overall, I think the burden of argument is on the model proposers rather than the skeptics. First, this is the natural placement of burden: the person making the claim has to defend it. Models need content and if you rely on the fact that both Wang and Silver had a Trump win as a possibility, then you risk making them all essentially equivalent to coin toss models. In which case, Erik’s attack gets some purchase.

There seems to be three rough claims:

  1. (Quantitative) Social Science is no more scientific than history, philosophy, or literary criticism.
  2. (Quantitative) Social Science wrongly claims to have a “formula” that provides superior understanding of politics. Instead, they are “just telling stories.”
  3. The problem (Quantitative) Social Science is that they are deluded by their models and thus disconnected from the real lives of people.
    This could mean many things including: current models are oversimplistic (i.e., disconnected) yet treated as gold, models in principle are oversimplifying so will never be a good tool, or models are only useful in conjunction with other (qualitative) methods.

2 can be seen as a refinement of 1, that is, that the way that (Quantitative) Social Science is no more scientific than history, philosophy, or literary criticism is that it doesn’t do anything more than “tell stories,” albeit with a quantitative gloss. Obviously, there’s some difference in what they do as a novel about lost love is topic-distinct from a history of glass blowing in Egypt. Even when topic congruent, we expect that a novel about the Civil War to be a different kind of thing than a history of the Civil War. Not all stories have the same structure or purpose or value for a task, after all.

A Standard Caveat

Many debates about the “scienciness” of a field are prestige fights and as a result tend to be pretty worthless. That something is or isn’t a science per se doesn’t necessarily tell you about the difficulty or significance of it or much at all about its practitioners. There are sensible versions but they tend to be more focused on specific methodological, evidential, sociological, or ontological questions.

Comparative Scientivisity

(I’m not going to resolve this issue in this post. But here’s some gestures.)

While there’s some degree of “qualitative humanties is superior” in Erik’s posts (cf claim 3 and, wrt 1 and 2, the idea that they at least know their limits), let’s stick to the comparative scienciness claim. These points (the categorical and the superiorness) aren’t fully separable. (I.e., science is successful in certain enviable ways thus other fields try to glom on.)

Let’s pick a distant pair: election forecasting and interpretative literary criticism. It does seem that these two things are really different. If the literary criticism teases out a possible interpretation of, say, a poem, then the evaluative criteria for  the interpretation is whether it is “correct”, or “valid”, or “insightful” and the evaluative mechanism is (typically) either brute human judgement or more criticism (i.e., the presentation of other interpretations either of the criticism or of the original poem). The most obvious evaluative criterion for election forecasts is predictive success (and usually rather easy to verify predictive success). Prediction, of course, is a key indicator of science, so the fact that election forecasting (inherently)aims at prediction might be enough to cast a sciency feel on its parent discipline, political science.

Of course, astrology and tarot also aim at prediction. Their lack of science status doesn’t solely rest on their predictive failure. Indeed, predictive failure alone won’t give us a categorical judgement (science/nonscience) since it just as easily indicate bad or failing science. Throwing in some math won’t do the job, as astrology and numerology are happy to generate lots of math. The fact that the math tends to generate models that reasonable cohere with other knowledge of the physical world is a better indicator.

If we move over to history, it’s tempting to say that the main difference is analogous to autopsy vs. diagnosis: It’s much easier to figure out what killed someone (and when) than what will kill someone (and when). Even that there are epistemically or ontologically ambiguous cases (i.e., we can’t tell which bullet killed them or multiple simultaneous bullets were each sufficient to kill them) doesn’t make autopsy harder. (For one, it’s generally easier to tell when one is in such a situation.)

But there’s plenty of backward looking science. Cosmology and palentology and historical climate studies come to mind. They do try to predict things we’ll find (if we look at the right place), but it’s hard to say that they are fundamentally easier. What’s more, they all rest on a complex web of science.

I feel confident that history could (and probably does) do a lot of that as well. Surely more than most literary critcism would or perhaps should (even granting some literary critcism, such as author attribution, has become fairly sciency).

What does this mean for Erik’s claims?

I’m not sure. A lot of what we want from understanding of phenomena is how to manipulate those phenomena. But one thing we can learn is that we don’t have the capacity to manipulate something the way we’d like. This goes for buildings as well as elections.

(Oops. Gotta run to a play. But I don’t want to leave this hanging, so I’ll leave it with a hanging ending. But I’m also genuinely unsure where to go with this. I still have trouble interpreting Erik’s claims that leads me to any action.)