Some Query Links

I have a buncha tabs about SQL and query optimizations. So a bit of clean up:

Lots more tabs to be cleaned.


Eno: Another Data Markup Language

Eno seems interesting. I don’t know that it’s interesting enough to exist over existing formats, but it’s in the YAML/Config format/TOML family and sort of a cousin to Markdown/Settext/Wikiformat (as all these data formats are).

It doesn’t seem to have a spec per se and the documentation is silent about primitive datatypes, which is annoying. Looking at the Python library’s documentation suggests that it uses an on demand coercion approach…basically, you coerce an element when you retrieve it from the parsed structure:

>>> from enopy import loaders, parse

>>> doc = parse("""
...   publish: yes
...   location: 36.987094, -25.091719
...   contact: contact@faulty
... """)

>>> doc.boolean('publish')
>>> doc.lat_lng('location')
{ 'lat':  36.987094, 'lng': -25.091719 }
ValidationError: 'contact' must contain a valid email address, for instance ''.

I can’t say I’m a big fan of this API. It seems really dreadful to have the field name as a string inside a coercing function call on the document which magically gets the right “loaders”? I mean, you have to specify the type of everything including built in structured types? Yeek.

Ok there does seem to be a generic interface:

>>> document = enopy.parse(input)

>>> document.element('title')
<class Field name="title" value="Artfest 2018">
>>> document.element('tags')
<class List name="tags" items=2>
>>> document.element('content')
<class Section name="content" items=14>

This feels like a step back from e.g. JSON. And this documentation hurts me!

Still, it’s kinda interesting and one could always layer a different API on top.

The Difficulty of Simple Markup

XML was intended to be a simple markup language certainly simpler than SGML. It may have started out simpler than SGML to parse but it is by no means easy.

JSON is supposed to be simple but just read this investigation to get a sense of the complexity. Part of its complexity comes from under specification, but there are just plain confusing bits.

I had long thought of YAML as being simple but always found the spec a nightmare. It’s pretty expressive which is nice but hurts some aspects of simplicity. Here’s a good negative read on it.

S-expressions are genuinely simple although born pre-Unicode and rely a lot on the background Lisp implementations.

What I don’t have is a good sense of what complexity is essential and what is accidental. A lot of conflict comes from the tension between authoring, parsing, expressivity, and API niceness with interoperability as a pervasive issue.

Apigee Awesome API Consoles Bite The Dust


The were cool. It was web based, with no sign in (except for individual services). Lots of standard services (Twitter, Facebook, YouTube, etc.) predefined. Perfect for class discussion. It could use your authentication from your currently logged in services so very trivial to get going.

I mean, check it out (until April 15th, 2018).

The alternatives look pretty grim, thus far. There are some nice tools, but none seem to fill the niche.

Beyond JSON

JSON pretty clearly won and won big. This is perhaps inevitable given the dominance of Javascript. And that’s ok! JSON is a pretty decent sort of sexpr and having both lists and dicts makes it pretty useful for quicky externalisation of all sorts of data. The friction, in a typical scripting derived language, of manipulating JSON in memory is super low. Slinging dicts and lists is something any Python (or Javascript, or…) programmer is going to find congenial.

But it has some infelicities and I just don’t mean the lack of query and schema languages (which is sorta being addressed). JSON is rather annoying to hand author and doesn’t seem great for documents and document formats. Or even hacking existing documents like HTML…if only because there’s no standard reflection of HTML structure into JSON.

There are some moves to improve this situation.

JSON5 tackles the writability. Probably the biggest move is not having to quote (certain) keys in objects. That helps both reading and writing! For reading, there’s a clear visual difference between key strings and “value” strings. For writing, less quoting!!

The other big one is multi-line strings (with the ‘\’ as the continuation character). Having to have a continuation character sucks, but it’s much better than the status quo ante.

Comments are also a good idea! The rest seem minor, but these definitely make a difference.

Mark Notation is aimed at bringing XML-like structuring and extensibility to JSON. It does this by adding a key syntactic (and semantic!) extension, the Mark object which is a name (thing tag/element name), a set of properties (think attributes, but with complex contents), and a list of content objects (think child content). It builds on JSON5 so has those authoring felicities.

Semantically, Mark objects get mapped into pretty simple Javascript objects. I don’t fully understand this claim:

contents: an ordered list of content objects, which are like child nodes of elements in HTML/XML. Mark utilizes a novel feature of JS that JS object can be array-like. It can store both named properties and indexed properties.

I don’t see why this matters as you have a special Mark object with has an explicit contents variable. Ah, maybe:

properties: can be accessed through markObj.prop or markObj[‘prop’] when prop is not a proper JS identifier. You can also use JS for … in loop to iterate through the properties. Unlike normal JS array, Mark object has been specially constructed so that Mark contents are not enumerable, thus do not appear in for … in loop.
contents: can be accessed through markObj[index]. You can also use JS for … of loop to iterate through the content items.

So you don’t have to do a field access but just can use special loops. I don’t see that this would be painful in, say, Python even with field accessing. I might default to making Python Mark Objects iteratable over the contents (on the theory that that’s more “normal”).

It would be interesting to compare APIs to see whether this really brings JSON like easy of programmer use.

And, of course, there’s YAML, which you can think of as JSON++++. (JSON is a subset these days.) It’s designed from the ground up for writability and capturing complex structures. And that it does. The price is considerably more complexity. Like a ton more. (You can define entity like things (actually, more like IDs) on the fly! Inline!) It has excellent embedded complex multiline strings (essentially “here-strings“).

I have to say that it might be easier to grow Mark Notation a bit toward YAML than the reverse. Here-like-strings plus id references go a long way.


XML at 20

Tim Bray has a reflective piece on XML’s birthday last month. The key bit is in the middle:

Twenty years later, it seems obvious that the most important thing about XML is that it was the first. The first data format that anyone could pack anything up into, send across the network to anywhere, and unpack on the other end, without asking anyone’s permission or paying for software, or for the receiver to have to pay attention to what the producer thought they’d produced it for or what it meant.

Hmm. Really? I mean, csv dates to the 70s. It is less well specified, I guess, and simpler. The first isn’t really mentioned but maybe this is part of the “pack anything up into”. But then S-Expressions are easily as expressive as XML and go way way back, though largely de facto standardised. But then there’s ASN.1…maybe that needed permission? I can’t find anything that suggests this, at all, though. I don’t remember any such claims at the time. I do remember a lot of struggling to find XML parsers!

So, I’m very very skeptical about this claim. Maybe, it was the first to grab a lot of attention? But then I’m v. skeptical about its influence. A lot of trends were coming together and I think some notation which was human readable would have pushed forward. XML definitely failed entirely to become the language of web pages or of the web.

Update: On Facebook, Peter points out the other giant howler, to wit, “for the receiver to have to pay attention to what the producer thought they’d produced it for or what it meant.” I guess I’m just inured to this because its so ubiquitous in the XML and RDF communitie, but yeah, the idea that you don’t have to care what it meant is bonkers and part of that is paying attention to what the producer thought they produced it for. And metric tons of effort went into that (bonkers attribution of magic powers to namespaces, anyone?)

(I’m a bit more charitable to Bray in that thread. Maybe I’ll tackle it another day.)

Randomly Generated, Realistic Everyday Data

I teach a data modelling course, or rather a “formats and database” modelling course. For this, it’s helpful to have randomly generated but realistic data. Real data is nice, but can be tricky to find, much less deal with. For grading, it’s nice to give a reasonable amount of data and then grade on otherdata. For the grading, it’s nice if the new data is 1) reasonably like the given data but 2) not too much like it. (You want to avoid overfitting!)

I played a bit with with Mockeroo. Even the free tier is kind of interesting and the idea of simulating a data API is nice. But I don’t know what makes their data realistic. They clearly have modelled a bunch of domains (Animal names—common and scientific, buzzwords, all sorts of car stuff, etc.), but I don’t know what the models are! is open source and extensible, at least. (Though PHP….not for me.) There are some other options as well, but I don’t see much discussion about the characteristics of the data (except perhaps the “draw from a distribution” function).

Maybe that’s ok? How do people generate data for data cleaning exercises? What should it look like? I’d love to incorporate stuff from the Falsehood series (e.g., Falsehoods Programmers Believe about Names) but with some idea of the prevalence of the counterexamples (perhaps for a given region or cohort).