The Difficulty of Simple Markup

XML was intended to be a simple markup language certainly simpler than SGML. It may have started out simpler than SGML to parse but it is by no means easy.

JSON is supposed to be simple but just read this investigation to get a sense of the complexity. Part of its complexity comes from under specification, but there are just plain confusing bits.

I had long thought of YAML as being simple but always found the spec a nightmare. It’s pretty expressive which is nice but hurts some aspects of simplicity. Here’s a good negative read on it.

S-expressions are genuinely simple although born pre-Unicode and rely a lot on the background Lisp implementations.

What I don’t have is a good sense of what complexity is essential and what is accidental. A lot of conflict comes from the tension between authoring, parsing, expressivity, and API niceness with interoperability as a pervasive issue.


Apigee Awesome API Consoles Bite The Dust


The were cool. It was web based, with no sign in (except for individual services). Lots of standard services (Twitter, Facebook, YouTube, etc.) predefined. Perfect for class discussion. It could use your authentication from your currently logged in services so very trivial to get going.

I mean, check it out (until April 15th, 2018).

The alternatives look pretty grim, thus far. There are some nice tools, but none seem to fill the niche.

Beyond JSON

JSON pretty clearly won and won big. This is perhaps inevitable given the dominance of Javascript. And that’s ok! JSON is a pretty decent sort of sexpr and having both lists and dicts makes it pretty useful for quicky externalisation of all sorts of data. The friction, in a typical scripting derived language, of manipulating JSON in memory is super low. Slinging dicts and lists is something any Python (or Javascript, or…) programmer is going to find congenial.

But it has some infelicities and I just don’t mean the lack of query and schema languages (which is sorta being addressed). JSON is rather annoying to hand author and doesn’t seem great for documents and document formats. Or even hacking existing documents like HTML…if only because there’s no standard reflection of HTML structure into JSON.

There are some moves to improve this situation.

JSON5 tackles the writability. Probably the biggest move is not having to quote (certain) keys in objects. That helps both reading and writing! For reading, there’s a clear visual difference between key strings and “value” strings. For writing, less quoting!!

The other big one is multi-line strings (with the ‘\’ as the continuation character). Having to have a continuation character sucks, but it’s much better than the status quo ante.

Comments are also a good idea! The rest seem minor, but these definitely make a difference.

Mark Notation is aimed at bringing XML-like structuring and extensibility to JSON. It does this by adding a key syntactic (and semantic!) extension, the Mark object which is a name (thing tag/element name), a set of properties (think attributes, but with complex contents), and a list of content objects (think child content). It builds on JSON5 so has those authoring felicities.

Semantically, Mark objects get mapped into pretty simple Javascript objects. I don’t fully understand this claim:

contents: an ordered list of content objects, which are like child nodes of elements in HTML/XML. Mark utilizes a novel feature of JS that JS object can be array-like. It can store both named properties and indexed properties.

I don’t see why this matters as you have a special Mark object with has an explicit contents variable. Ah, maybe:

properties: can be accessed through markObj.prop or markObj[‘prop’] when prop is not a proper JS identifier. You can also use JS for … in loop to iterate through the properties. Unlike normal JS array, Mark object has been specially constructed so that Mark contents are not enumerable, thus do not appear in for … in loop.
contents: can be accessed through markObj[index]. You can also use JS for … of loop to iterate through the content items.

So you don’t have to do a field access but just can use special loops. I don’t see that this would be painful in, say, Python even with field accessing. I might default to making Python Mark Objects iteratable over the contents (on the theory that that’s more “normal”).

It would be interesting to compare APIs to see whether this really brings JSON like easy of programmer use.

And, of course, there’s YAML, which you can think of as JSON++++. (JSON is a subset these days.) It’s designed from the ground up for writability and capturing complex structures. And that it does. The price is considerably more complexity. Like a ton more. (You can define entity like things (actually, more like IDs) on the fly! Inline!) It has excellent embedded complex multiline strings (essentially “here-strings“).

I have to say that it might be easier to grow Mark Notation a bit toward YAML than the reverse. Here-like-strings plus id references go a long way.


XML at 20

Tim Bray has a reflective piece on XML’s birthday last month. The key bit is in the middle:

Twenty years later, it seems obvious that the most important thing about XML is that it was the first. The first data format that anyone could pack anything up into, send across the network to anywhere, and unpack on the other end, without asking anyone’s permission or paying for software, or for the receiver to have to pay attention to what the producer thought they’d produced it for or what it meant.

Hmm. Really? I mean, csv dates to the 70s. It is less well specified, I guess, and simpler. The first isn’t really mentioned but maybe this is part of the “pack anything up into”. But then S-Expressions are easily as expressive as XML and go way way back, though largely de facto standardised. But then there’s ASN.1…maybe that needed permission? I can’t find anything that suggests this, at all, though. I don’t remember any such claims at the time. I do remember a lot of struggling to find XML parsers!

So, I’m very very skeptical about this claim. Maybe, it was the first to grab a lot of attention? But then I’m v. skeptical about its influence. A lot of trends were coming together and I think some notation which was human readable would have pushed forward. XML definitely failed entirely to become the language of web pages or of the web.

Update: On Facebook, Peter points out the other giant howler, to wit, “for the receiver to have to pay attention to what the producer thought they’d produced it for or what it meant.” I guess I’m just inured to this because its so ubiquitous in the XML and RDF communitie, but yeah, the idea that you don’t have to care what it meant is bonkers and part of that is paying attention to what the producer thought they produced it for. And metric tons of effort went into that (bonkers attribution of magic powers to namespaces, anyone?)

(I’m a bit more charitable to Bray in that thread. Maybe I’ll tackle it another day.)

Randomly Generated, Realistic Everyday Data

I teach a data modelling course, or rather a “formats and database” modelling course. For this, it’s helpful to have randomly generated but realistic data. Real data is nice, but can be tricky to find, much less deal with. For grading, it’s nice to give a reasonable amount of data and then grade on otherdata. For the grading, it’s nice if the new data is 1) reasonably like the given data but 2) not too much like it. (You want to avoid overfitting!)

I played a bit with with Mockeroo. Even the free tier is kind of interesting and the idea of simulating a data API is nice. But I don’t know what makes their data realistic. They clearly have modelled a bunch of domains (Animal names—common and scientific, buzzwords, all sorts of car stuff, etc.), but I don’t know what the models are! is open source and extensible, at least. (Though PHP….not for me.) There are some other options as well, but I don’t see much discussion about the characteristics of the data (except perhaps the “draw from a distribution” function).

Maybe that’s ok? How do people generate data for data cleaning exercises? What should it look like? I’d love to incorporate stuff from the Falsehood series (e.g., Falsehoods Programmers Believe about Names) but with some idea of the prevalence of the counterexamples (perhaps for a given region or cohort).

A Panoply of SQL Injection Horrors

I hope that this year we’ll be able to migrate our Data on the Web course to Python and to focus a bit on manipulating data and formats we design.

Which means we can talk about APIs and the crappiness of string hacking for anything. Thus, SQL Injection!

The Code Curmudgeon maintains a SQL Injection Hall-of-Shame which is fascinating and depressing reading. (The page includes helpful links including the invaluable SQL Injection Prevention Cheat Sheet.)

On the one hand, the lesson seems to write itself. On the other, it’s really important to teach this stuff!

(I’ll throw the XSS Prevention Cheat Sheet on here too.)

What’s up with NoSQL?

A blog post (by a database vendor) caught my eye entitled Why SQL is beating NoSQL, and what this means for the future of data. For years I’ve co-taught a course roughly on the prior generation of anti SQL database tech (primarily XML, all under the rubric of “semi structured data”). For a similar time frame, I’ve researched and advocated for a sort of “anti SQL” information tech: the RDF stack and ontologies. And I remember the MySQL  arguments against ACIDity which are a kind of anti-SQL move. Most of the time, the arguments are more against either the relational calculus (too inexpressive!) or the tech infrastructure (doesn’t scale!). Modern NoSQL was more focused on the latter so it’d be no surprise if as we technically catch up on scaling, SQL would make a come back. If your objection to relational is performance, that’s not a good long term bet. Infrastructure catches up and crappy piles of code doing all sorts of little query, maintenance, and analytical jobs suck. Big piles of SQL sucks too, but it sucks so much less:

Developers soon found that not having SQL was actually quite limiting. Each NoSQL database offered its own unique query language, which meant: more languages to learn (and to teach to your coworkers); increased difficulty in connecting these databases to applications, leading to tons of brittle glue code; a lack of a third party ecosystem, requiring companies to develop their own operational and visualization tools.

The post gives a reasonable (if polemical) history and observes that the infrastructure is catching up:

First came the SQL interfaces on top of Hadoop (and later, Spark), leading the industry to “back-cronym” NoSQL to “Not Only SQL” (yeah, nice try).

Then came the rise of NewSQL: new scalable databases that fully embraced SQL. H-Store (published 2008) from MIT and Brown researchers was one of the first scale-out OLTP databases. Google again led the way for a geo-replicated SQL-interfaced database with their first Spanner paper (published 2012) (whose authors include the original MapReduce authors), followed by other pioneers like CockroachDB (2014).

This is all fine…buuuuuut…it doesn’t actually tell you about market or mindshare. Boo! We want numbers!

And…the article presumed that NoSQL had been winning but SQL has made a comeback! That seems unlikely esp over the given time frame.

A bit of searching shows a different picture.

First, there’s this hilarious InfoWorld article from 2016 entitled “NoSQL chips away at Oracle, IBM, and Microsoft dominance”.

Back in 2014, Network World’s Brandon Butler declared that NoSQL was “giving SQL database vendors and users a scare,” and a year later InfoWorld’s Andy Oliver quipped that “the once red-hot database technology is losing its luster, as NoSQL reaches mass adoption,” becoming boringly mainstream.

the SQL incumbents must be a little nervous. A new Gartner report suggests that NoSQL continues to kick the shins of its legacy RDBMS competition.

The bad news, however, is that their dominance is slipping,

Oooo! Scaaaary!!!! But wait:

Yet relational database vendors continue to print money; the NoSQL contenders, many of which are open source — not so much.

A new Gartner report suggests that NoSQL continues to kick the shins of its legacy RDBMS competition. As Gartner analyst Merv Adrian notes, “Over the past five years, the megavendors have collectively lost share,” dropping 2 percentage points to a still-hegemonic 89 percent market share.


All this big data infrastructure, in short, is only 3 percent of the overall paid DBMS market.

(This is a friendly reminder that prima facie you should treat Gartner reports as garbage. If the garbage is helpful because it pimps your stuff, fine. That’s advertising. Otherwise, ignore.)

But take revenue out of the equation, and cracks in the DBMS market share numbers start to appear. According to DB-Engines — which measures database popularity across a range of factors (including job listings and search interest) but excludes revenue numbers — Oracle, Microsoft, and IBM are joined at the top by some noisy neighbors:

Ok! This is a possible story. Commodification and expanding the un- and underpaid database market might be a thing (maybe). But let’s look at the DB-Engines trend line:

A ranking based on a bunch of indirect metrics.

Yeah, see that big cluster of lines up top? Oracle, MySQL, and SQL server. See that orange line that rises up a tiny bit at the end? PostgreSQL. This is a reminder not to trust InfoWorld.

Now their ranking definition is interesting: It’s basically “web presence”, total search results, mentions in job adverts, questions about the systems, etc. Established base is going to skew these numbers toward old tech and hype toward new tech. Maybe these balance out?

A StackOverflow survey doesn’t have any trendlines, but does say something about recent DB popularity:

So, NoSQL as a (paid or unpaid) market force seems to be a bit of a bust at least relative to the hype.

Evaluating tech (and tech trends) is hard!