Word2Vec Misleadings

Distributional semantics is all the rage and is very cool and omg it’s a bit of a fraud!

Just to be clear: there is nothing wrong with the algorithm itself! It is conceptually very interesting and works very well for a lot of cases. Done right, it can give a decent representation of word similarity or meaning. But the “King – Man + Woman = Queen” example by far overstates what the algorithm actually is capable of.

Here are some reasons why I think we should stop using that classical example to introduce Word2Vec:

1) It turns out that for the example to work in the first place, you have to include some ‘cheating’. The actual result would namely be King – Man + Woman = King. So, the resulting vector would be more similar to King than to Queen. The widely known example only works because the implementation of the algorithm will exclude the original vector from the possible results! That means the word vector for King – Man + Woman is closest to the word vector for King. Second comes Queen, which is what the routine will then pick. Quite disappointing, isn’t it?

Why yes, yes it is. (I wonder how close ‘Queen’ is to ‘King’ without the subtraction. Or under random subtractions on ‘King’.)

Gah! This drives me nuts. Please don’t mislead in this way!

Advertisements

I Hate the REF (and all such ilk)

Welp, we are doing the annual RRE thing and you have to submit and write words. So instead of doing stressful bullshit every 6-7 years we get to do it every year! It’s so dumb.

If you want to read a reasoned attack there are plenty. I like Josh Robinson’s. I saw on Twitter that you can return academics you just laid off which is just plain mean.

These metrics are high stakes and the result is what you expect: escalation in gaming them and counter gaming with not much change in the underlying phenomenon. People get crushed by it. So much time gets wasted.

Randomise your funding decisions, people! Weight it a bit if you must but come on. Lower you admin costs and eliminate biases (including biases toward the “hot” and the “good”).

Don’t Beat On IRBs From Ignorance

Institutional Review Boards are  are key part of research that involves animal (esp. human) participants. They are super important and have curtailed a lot of abuse and otherwise improved research (we need more pre-discussion of experiments across the board!).

They are not always tuned to very low risk studies. Which makes sense! But it does mean that they can be sometimes frustrating and a barrier to perfectly sensible research. Again, no surprise and not a big deal. At Manchester, we’ve gone through different mechanisms to expedite very low risk research without compromising safety or quality

I’ve never served on an IRB nor have I had a lot of interaction. But I’ve met with a couple. Generally, they are working researchers who are doing some important committee work (far more important than most). But it’s work and no one that I’ve encounter wants to make a big deal of it. Certainly, no one I’ve encountered has any desire to do it for kicks or block research unnecessarily.

IRBs are important, necessary, and a lot of work. I encourage real horror tales because we want them to be better, but this crap just pisses me off:

Originally, IRBs were set up in response to a Wild West atmosphere at universities and medical centers in which the human subjects of research could genuinely be harmed. Medical research was one obvious area, as was psychological research of various kinds (the Milgram shock experiment, for example). But then it expanded. Were you doing econ experiments with the students in your class? Better get IRB approval. Were you doing telephone surveys? There are humans on the other end of the line. Better get IRB approval. Are you just mailing surveys to people? You’ll still need IRB approval. Over time, the roadblocks to research multiplied, and as near as I can tell, IRBs today are widely considered less to be genuine safeguards against abuse than they are mini-empires dedicated primarily to demonstrating their own power over the research community.

Let’s take Drum’s three “escalating” examples:

  • Were you doing econ experiments with the students in your class?
    If you are doing in class role play sort of things, there’s no need for IRB. If you are going topublish research on your experiments, then you probably do need it.And rightly so. Obviously things that need to be checked: 1) are you compromising the educational experience for the sake of your research? 2) are student compelled to participate on threat of poorer grades (or reasonably think so? 3) can you protect against students trying to please you qua instructor (so is it even worth doing)? and 4) is your privacy plan sufficient? These are just off the top of my head. It’s not necessarily hard to make all this work out (3 is challenging), but making sure people work through this and get independent feedback is a really good idea.
  • Were you doing telephone surveys?
    The “There were humans on the other end of the line” is supposed to be a “funny” indicating the absurdity. But there are in fact humans on the other end of the line. There are harms possible. Are you asking intrusive questions? Might you reveal personal data that exposes them to harm? Is it a well designed survey likely to achieve it’s results (or just waste people’s time)? Are people participating with a clear understanding? All survey’s should be reviewed independently just for quality control!
  • Are you just mailing surveys to people?
    I don’t know why the modality of the survey is such a big deal. Most of the quality issues are similar. There’s less interaction but that has pros and cons (e.g., live, you can provide person specific clarification).

The “as far as I can tell” is pure bullshit. I’ve never heard anyone routinely working with IRBs suggest that they are at all about “demonstrating power”, much less mostly so. This is a staggering libel with the evidence obviously being nothing at all. (I’ll bet a lot that that “As far as I can tell” isn’t based on, you know, research of any kind but just some impressions.)

Wildly irresponsible and gross.

Oh, and all the stuff at the end about Would The IRB Approve It and If They Did Would They Leak It … just fuck off. The people on the IRB are professionals. The reason not to approve it is the impossibility of getting informed consent. It relies on deception. I can think of a few ways around that (get journal editors to give a general warning, though that would distort; but informed consent often distorts; too fucking bad!). I can think of alternative designs (e.g., an actual controlled experiment wherein people are given real and fraudulent and asked to pick out the frauds). Etc.

There’s a case that the people involved are all “experts” and this is a kind of expert review. (Bit of a stretch.) Or it could be framed as a quality control test (where the quality is fraud detection which isn’t a focus of peer review).

But the experiment as offered? I might well choose to reject it. It’s more stunt than useful research. It’s clear that the experimenters didn’t consider alternative mechanisms for the result they got (e.g., supportiveness), which makes the whole thing a bit dubious.

So the question I have is whether other such experiments (e.g., with computer generated papers or citation farm faked papers) have IRB approval. Again, if it’s a quality control measure, it might not need it.

We Win the 2018 SWASA 10 Year Award

What is this award? Well:

The primary criteria will be the number of citations to the paper in ten years. We will use Google Scholar. While citation count will be the primary criteria, the panel will also consider other impact factors for the top-cited papers. If two or more papers are very close, more than one paper may be honored. The determination is done in the Spring of the calendar year for the conference.

That’s pretty cool! It also won best paper at the time. So yay! Matt’s thesis won the BCS Distinguished Dissertation award so Triple Yay!

When Uli and I were supervising Matt, we had plenary meetings. They were enjoyable but it’s difficult enough to get a word in edgewise with either Uli or myself on our own…when we’re bouncing off each other fuggetaboutit.

A Survey of Online Coding Tutorials

The paper, “A Pedagogical Analysis of Online Coding Tutorials“, provides an analytical review of a sample of online coding tutorials. One of my project students did something similar (they should have published!). The analytical framework is useful but not surprising: they have a set of types (interactive tutorial, MOOCs, web references, etc.) and “nine groups of 24 [analytical] dimensions” including content, organisation, and context. It all seems sensible, though I’m a bit leery. It seems almost too sensible. There’s no empirical work on actual effects (completion, satisfaction, and learning). It’s super tempting to think we can extrapolate from this beautiful and tempting set of features to these effects. Consider their key conclusion:

Our results suggest that most online coding tutorials are still immature and do not yet achieve many key principles in learning sciences. Future research and commercial development needs to better emphasize personalized support and precise, contextualized feedback and explore ways of explaining to learners why and when to use particular coding concepts. Based on our sampled tutorials, we recommend that teachers be very selective in their use of materials, focusing on the more evidence-based tutorials, particularly the educational games. All educational games in the list provide hierarchical structure, immediate feedback, and opportunities that learners actively write code and use subsequent knowledge for coding throughout the tutorial.

But they’re games. What kind of learners are responding to them? Are students responding to them? No coding game that I know off has bubbled up in the popular consciousness or trade press the way say Khan Academy or MOOCs in general have. That doesn’t mean they aren’t educationally superior, but it needs some explanation.

Overall, however, it seems like a solid, worthwhile paper and a good and necessary starting point. Someone needs to too this sort of work and we need more of it.

It’s also the sort of work that needs a dynamic, ongoing database that’s kept up to date with periodicity snapshot papers. One off papers get stale quickly! But it’s eminently replicable, so…have at it!

The Loss of Loss Aversion

As with ego depletion, loss aversion turns out to probably not be a thing:

However, as documented in a recent critical review of loss aversion by Derek Rucker of Northwestern University and myself, published in the Journal of Consumer Psychology, loss aversion is essentially a fallacy. That is, there is no general cognitive bias that leads people to avoid losses more vigorously than to pursue gains. Contrary to claims based on loss aversion, price increases (ie, losses for consumers) do not impact consumer behavior more than price decreases (ie, gains for consumers). Messages that frame an appeal in terms of a loss (eg, “you will lose out by not buying our product”) are no more persuasive than messages that frame an appeal in terms of a gain (eg, “you will gain by buying our product”).

People do not rate the pain of losing $10 to be more intense than the pleasure of gaining $10. People do not report their favorite sports team losing a game will be more impactful than their favorite sports team winning a game. And people are not particularly likely to sell a stock they believe has even odds of going up or down in price (in fact, in one study I performed, over 80 percent of participants said they would hold on to it).

I have not dug into the paper so…who knows?! but I find it plausible.

This is super annoying. The ego depletion one was extra annoying due to the fact that the literature had seemed good. Loss aversion loss is annoying because of the pervasiveness of use of the concept. It was the example of behavior economics.

We really need to separate out the work that is inherently high risk in fields like psychology and nutrition.

Note: when looking up the ego depletion stuff I came across a post touting recent “strong” evidence for ego depletion in the form of two sorts of large studies with preregistration. That’s prima facie interesting but I’m going to retain a pretty high level of skepticism. Certainly when folks write (emphasis added)

Moreover, combining results from the two studies, there was an overall small, but statistically significant, ego depletion effect even after removing outlier participants (and this was after only a five-minute self control challenge, so you can imagine the effects being larger after more arduous real life challenges).

Arrrrrgh! The results of two studies with a combined n of around 1000 is a small but “statistically significant” (I presume p=0.05) effect. No no no no. That’s super dangerous.

Worse, speculating about how much bigger the effects would be with bigger manipulation is super duper dangerous. This is stoking confirmation bias. And we shouldn’t be looking at current tiny effects as evidence for future awesome effects.

Renata Wassermann on Belief Revision in DLs

I used some of Renata’s work in my thesis and we’ve corresponded on and off. One of her students is visiting us and she came and gave a talk! It was very nice.

One interesting bit was that did some experiments on partial meet vs kernel based revision and found “contrary to computer science intuition” partial meet generally is more efficient. Ok that’s a lot of jargon here’s an attempt to sort it succinctly.

Given a set of beliefs, B, (think propositional sentence ie things which can be true or false), and some sentence S which follows from B, how can we shrink B so S no longer follows? This isn’t easy! S may not be a member of B. S might be entailed by lots of different parts of B.

One approach is to find all the minimal subsets of B which entail S. Since they are minimal, we can break the entailment by deleting just one element. If we fix each subset then we have a fix for B. These subsets are called kernels (or justifications). They correspond nicely to typical debugging approaches.

Alternatively, we could try to build a maximal subset of B which doesn’t entail S. There will be many such subsets but obviously each does the job. Call such a set a remainder. We can just pick one remainder, or take the intersection of several (or all). If we take fewer than all we have partial meet contraction.

Now Renata said something that didn’t make sense to me ie that the reason kernal contraction has been preferred is that computer scientists think it’s more efficient because “kernels are smaller”. But…I’ve never heard that. The concepts are dual but kernels are easier for humans to deal will. They capture the logic of how the undesired entailment works. It never occurred to me to even ask which approach is more efficient. It depends on the nature of the sets!

One interesting bit is that a difference between debugging and revision folks is that debugging folks usually consider minimal repairs, ie, selections from the set of justifications that contain no repairs. This corresponds to full meet contraction which has a number of issues. If you go for partial meet then you have to do a bit of work to get an algorithm that finds desirable contractions compared to the remainder based approach.

Of course, even from a debugging perspective a partial meet approach might make sense. When you figure out a bug, you might make more changes than just the minimum one to fix the focus broken test. After all, you might get an insight about a particular function call and change how you call it everywhere. You might realise that a module is just irredeemably broken and replace it entirely.