Beyond JSON

JSON pretty clearly won and won big. This is perhaps inevitable given the dominance of Javascript. And that’s ok! JSON is a pretty decent sort of sexpr and having both lists and dicts makes it pretty useful for quicky externalisation of all sorts of data. The friction, in a typical scripting derived language, of manipulating JSON in memory is super low. Slinging dicts and lists is something any Python (or Javascript, or…) programmer is going to find congenial.

But it has some infelicities and I just don’t mean the lack of query and schema languages (which is sorta being addressed). JSON is rather annoying to hand author and doesn’t seem great for documents and document formats. Or even hacking existing documents like HTML…if only because there’s no standard reflection of HTML structure into JSON.

There are some moves to improve this situation.

JSON5 tackles the writability. Probably the biggest move is not having to quote (certain) keys in objects. That helps both reading and writing! For reading, there’s a clear visual difference between key strings and “value” strings. For writing, less quoting!!

The other big one is multi-line strings (with the ‘\’ as the continuation character). Having to have a continuation character sucks, but it’s much better than the status quo ante.

Comments are also a good idea! The rest seem minor, but these definitely make a difference.

Mark Notation is aimed at bringing XML-like structuring and extensibility to JSON. It does this by adding a key syntactic (and semantic!) extension, the Mark object which is a name (thing tag/element name), a set of properties (think attributes, but with complex contents), and a list of content objects (think child content). It builds on JSON5 so has those authoring felicities.

Semantically, Mark objects get mapped into pretty simple Javascript objects. I don’t fully understand this claim:

contents: an ordered list of content objects, which are like child nodes of elements in HTML/XML. Mark utilizes a novel feature of JS that JS object can be array-like. It can store both named properties and indexed properties.

I don’t see why this matters as you have a special Mark object with has an explicit contents variable. Ah, maybe:

properties: can be accessed through markObj.prop or markObj[‘prop’] when prop is not a proper JS identifier. You can also use JS for … in loop to iterate through the properties. Unlike normal JS array, Mark object has been specially constructed so that Mark contents are not enumerable, thus do not appear in for … in loop.
contents: can be accessed through markObj[index]. You can also use JS for … of loop to iterate through the content items.

So you don’t have to do a field access but just can use special loops. I don’t see that this would be painful in, say, Python even with field accessing. I might default to making Python Mark Objects iteratable over the contents (on the theory that that’s more “normal”).

It would be interesting to compare APIs to see whether this really brings JSON like easy of programmer use.

And, of course, there’s YAML, which you can think of as JSON++++. (JSON is a subset these days.) It’s designed from the ground up for writability and capturing complex structures. And that it does. The price is considerably more complexity. Like a ton more. (You can define entity like things (actually, more like IDs) on the fly! Inline!) It has excellent embedded complex multiline strings (essentially “here-strings“).

I have to say that it might be easier to grow Mark Notation a bit toward YAML than the reverse. Here-like-strings plus id references go a long way.



XML at 20

Tim Bray has a reflective piece on XML’s birthday last month. The key bit is in the middle:

Twenty years later, it seems obvious that the most important thing about XML is that it was the first. The first data format that anyone could pack anything up into, send across the network to anywhere, and unpack on the other end, without asking anyone’s permission or paying for software, or for the receiver to have to pay attention to what the producer thought they’d produced it for or what it meant.

Hmm. Really? I mean, csv dates to the 70s. It is less well specified, I guess, and simpler. The first isn’t really mentioned but maybe this is part of the “pack anything up into”. But then S-Expressions are easily as expressive as XML and go way way back, though largely de facto standardised. But then there’s ASN.1…maybe that needed permission? I can’t find anything that suggests this, at all, though. I don’t remember any such claims at the time. I do remember a lot of struggling to find XML parsers!

So, I’m very very skeptical about this claim. Maybe, it was the first to grab a lot of attention? But then I’m v. skeptical about its influence. A lot of trends were coming together and I think some notation which was human readable would have pushed forward. XML definitely failed entirely to become the language of web pages or of the web.

Update: On Facebook, Peter points out the other giant howler, to wit, “for the receiver to have to pay attention to what the producer thought they’d produced it for or what it meant.” I guess I’m just inured to this because its so ubiquitous in the XML and RDF communitie, but yeah, the idea that you don’t have to care what it meant is bonkers and part of that is paying attention to what the producer thought they produced it for. And metric tons of effort went into that (bonkers attribution of magic powers to namespaces, anyone?)

(I’m a bit more charitable to Bray in that thread. Maybe I’ll tackle it another day.)

Apple’s Keynote XML

Apple’s Keynote file format is, apparently, a zipped directory with a big ole chunk of XML in there. This is, I hoped, good, because then there’s a chance I could extract presentations to S5 and thus use Keynote as a presentation editor. I’d settle for “mostly text” to start, but there’d be no reason not to extract overlays, drawings, and the like. 3rd year project fodder! Furthermore, I could make some variant of the extraction (either way) into part of my MSc class, e.g., an assignment. Cool.

One tiny problem: The format is from HEEEEEEEELLLLLLLLLLLL. Really brutal. There’s no schema and the “documentation” is, as one might say, minimal.

The real puzzler for me is that there’s absolutely no reuse of any standard whatsoever. The draw objects are all some homebrew thing instead of SVG. Heck, the text formating has p and br tags in a custom Apple namespace. Then there is a TON of weird stuff…strange stylesheety stuff. The stylesheety stuff massively overwhelms the content. (And it sure ain’t CSS or XSL-FO.)

Ok, this is a private Apple format and they basically just reflected out some internal structures. But then…er…why use XML? The file is HUGE and it’s filled with nonsense, afaict. Lots of nonsense. I lost the will to extract from this and certain the will to try to generate it. There’s no standards used, so I can’t reuse my tools or knowledge there. I could accept that bending the standards to their purposes might have been…challenging and thus not worth the effort. (Accept, not necessarily believe it!) But I’m rather mystified as to what it buys them.

(ODF wasn’t looking too much better, IMHO. OOXML neither. I find all this quite suggestive.)

(I’m pretty late to this game. While the earlier version had an XML Schema, the current does not have one available. There was a lot of discussion about the graphics. Some brave person did target that version as output using XSLT. I’ll need to examine these careful to see how nice it is. Someone has a converter from SVG to Keynote. Some SVG developers seem focused more on why nto SVG as the whole format rather than why not incorporate SVG inside the Apple format.)

Posted in XML