Word2Vec Misleadings

Distributional semantics is all the rage and is very cool and omg it’s a bit of a fraud!

Just to be clear: there is nothing wrong with the algorithm itself! It is conceptually very interesting and works very well for a lot of cases. Done right, it can give a decent representation of word similarity or meaning. But the “King – Man + Woman = Queen” example by far overstates what the algorithm actually is capable of.

Here are some reasons why I think we should stop using that classical example to introduce Word2Vec:

1) It turns out that for the example to work in the first place, you have to include some ‘cheating’. The actual result would namely be King – Man + Woman = King. So, the resulting vector would be more similar to King than to Queen. The widely known example only works because the implementation of the algorithm will exclude the original vector from the possible results! That means the word vector for King – Man + Woman is closest to the word vector for King. Second comes Queen, which is what the routine will then pick. Quite disappointing, isn’t it?

Why yes, yes it is. (I wonder how close ‘Queen’ is to ‘King’ without the subtraction. Or under random subtractions on ‘King’.)

Gah! This drives me nuts. Please don’t mislead in this way!

Advertisements

Spreadsheet Errors Deep and Shallow

Spreadsheets are almost certainly the most common form of end user programming. And they are both cool (enabling amazing systems to be built by all sorts of people) and dreadful (those systems are even more buggy than we fear). There’s lots of cool research on them…even inside Microsoft!

But I’ll tell you right now that the thing that fucking kills me with Excel for Mac is that scrolling is just the worse. It’s slow, buggy, and just plain hard to use. I switched to Numbers for dealing with a spreadsheet of less than 10 columns and less than 200 rows just because the scrolling was KILLING ME LOUDLY…

…of course, Numbers doesn’t let you delete specific cells…only whole rows why for the love of every sort of sanity on this good green earth WHY?!?!?

It just boggles me that these simple things are so broken. For generations of versions.

The Tabulator JavaScript Library

I have to build a website. It’s a browsable repository of modesty complex structured documents largely represented as graphs. There’s lots of possible entry points, summaries, and views.

Fine. We know this will be sorta easy on the one hand and brutally suck on the other.

This is exemplified by the excellent table library I’ve been using, Tabulator.

I mean, out of the box (or in the demos at least) the tables look nice. They can sort. You can resize columns. You feed it some JSON and bob is all avuncular toward you.

Except.

Whoof, cutting and pasting the example code is an exercise in weirdness. It took me a lot of dork age to realise that the HTML5 doctype was essential, entirely essential, to moderately normal functioning.

And then there are the filters. Seems really nice…add a bit of search without any server mucking. Well, we jump from sorta declarative specs to a rat nest of bizarre (to these naive eyes) JavaScript barfing. Cutting and pasting example code yields disappeared table with no clue to what’s going on.

Plus this documentation says “source code” and then gives you something which is, at best, partial snippets, not working source code for the example.

Don’t get me wrong. It’s very cool and maybe if I was a wired in front end developer the documentation would make perfect sense.

But oy! It’s some brutal, empirical try and mostly fail to do anything.

As with so much web dev stuff, some select simple things aren’t so horrible then it fucking drives off a cliff of doom. It’s all so unforgiving and weird. Why isn’t there a simple “filter” flag? Am I supposed to make the buttons and fields or do they magically show up? If the form shows up why do I have to call all the filter functions? Including “clear filters”?

Maybe there’s some point where I’ll get it. But adding tree data was adding a flag and structuring the data appropriately. Adding filters is some mass of coding. It feels uneven to my untutored mind.

Apple Store Suckery

Zoe’s phone died. It was doing the wonky battery thing. This was one of the recalled iPhone 6Ss. Last year I had the battery replaced under the program. It didn’t quite fix the problem but did seriously mitigate it.

Ok but now it won’t charge. As the Apple Store guy pointed out…it rattled.

Yikes! I guess they did a crap job last year.

Nope! It was a knock off battery poorly installed.

Now of course they didn’t “accuse us” of anything…they just thought that they fixed it properly and sometime in the past year, probably the past month, someone meddled with it and so no fix for us.

This phone has literally never been out of our possession. We had no motive to have any other repair or battery repair…we thought it was still under warranty!

And don’t fucking tell me that you aren’t accusing me because you did it by implication. That just sucks. Just say we can’t comment on the mechanism but since we can’t prove it’s their fault we’re shit out of luck.

Of course this doesn’t just mean they won’t repair it for free, it means they won’t repair it.

There was an iSmash in the mall that repaired it in under 15 minutes while we watched. Apple drove us to what they forbade.

We’ve dumped serious cash into that store. I don’t have a good feeling about them.

Paraglare is Doubleplusgood

Every now and again I look for a Python parser framework. I tend to want something highly declarative but also as easy to use and debug as PetitParser (which inclines me toward PEG grammar systems). I’d really like multi language targeting so people can reuse my grammars. (But then I want the actions to be specifiable as well.)

Usually I’m disappointed. I read about a bazillion systems and nothing happens.

I did this again the other day because I wanted to write my reasoner tests in Jewel and 1) I didn’t want to find, much less get working, my Smalltalk Jewel parser and 2) hooking that up to Python tests seemed even more awful.

So, I did my usual cursory look and stumbled on Paraglare. Feature list looked promising…esp. the idea of good error recovery so I installed it…

…and it just worked! What a treat! Here’s my grammar:

ONT: AX*;

CE: CE '&' CE  {left, 1}
 | '' CE  {left, 1}
 | name;

AX: CAX | RAX;
CAX: CE '=>' CE end;
RAX: name '==>' name end;

terminals
name: /[A-Za-z][A-Za-z0-9]*/;
end: /\./;

And here are my actions:

actions = {
    "CAX": [lambda _, nodes: SubClassOf(nodes[0], nodes[2])],
    "RAX": [lambda _, nodes: SubPropertyOf(Property(nodes[0]), Property(nodes[2]))],
    "CE": [lambda _, nodes: And(nodes[0], nodes[2]),
           lambda _, nodes: Some(Property(nodes[1]), nodes[3]),
           lambda _, nodes: Class(nodes[0])]}

Daaaaamn that was easy! It’s not a complex grammar but still. It’s parsing my tests no problem!

Good stuff.

Test test test

My little reasoner project hit a milestone today: it terminated in reasonable time (2 minutes or so) on one of the EL variants of Galen.

Yay! No more unboundedly growing to do lists!

And it wasn’t anything clever! No change in the balance of SQL and Python!

Nope, first I refactored so I would be able to more systematically explore variants and then I finally wrote some super basic tests. Eg

A=>B. B=>C.

I did this because I should and because I suspect some subtle unsoundness was causing the unbounded growth.

It wasn’t subtle problems but blatant, serious ones. I mean trouble with retrieving chains and reasoning with conjunctions.

Fixing them fixed the leak. I don’t know if the Galen classification is correct (yet) but I have a hell of a lot more confidence. And it terminates in reasonable time!!

So simple testing pays off again.

Students Surprise You

I have a group of pretty good students. They are very nice and likable and bright…all good things.

They went from most of them saying “I don’t have the faintest idea what I’m doing” to “here’s a pretty full featured first version” in three weeks.

It wasn’t totally unexpected given their progress last week but overall it still was a fun surprise.

Non linear progress is one of the frustrations and pleasures of teaching.