A Panoply of SQL Injection Horrors

I hope that this year we’ll be able to migrate our Data on the Web course to Python and to focus a bit on manipulating data and formats we design.

Which means we can talk about APIs and the crappiness of string hacking for anything. Thus, SQL Injection!

The Code Curmudgeon maintains a SQL Injection Hall-of-Shame which is fascinating and depressing reading. (The page includes helpful links including the invaluable SQL Injection Prevention Cheat Sheet.)

On the one hand, the lesson seems to write itself. On the other, it’s really important to teach this stuff!

(I’ll throw the XSS Prevention Cheat Sheet on here too.)


What’s up with NoSQL?

A blog post (by a database vendor) caught my eye entitled Why SQL is beating NoSQL, and what this means for the future of data. For years I’ve co-taught a course roughly on the prior generation of anti SQL database tech (primarily XML, all under the rubric of “semi structured data”). For a similar time frame, I’ve researched and advocated for a sort of “anti SQL” information tech: the RDF stack and ontologies. And I remember the MySQL  arguments against ACIDity which are a kind of anti-SQL move. Most of the time, the arguments are more against either the relational calculus (too inexpressive!) or the tech infrastructure (doesn’t scale!). Modern NoSQL was more focused on the latter so it’d be no surprise if as we technically catch up on scaling, SQL would make a come back. If your objection to relational is performance, that’s not a good long term bet. Infrastructure catches up and crappy piles of code doing all sorts of little query, maintenance, and analytical jobs suck. Big piles of SQL sucks too, but it sucks so much less:

Developers soon found that not having SQL was actually quite limiting. Each NoSQL database offered its own unique query language, which meant: more languages to learn (and to teach to your coworkers); increased difficulty in connecting these databases to applications, leading to tons of brittle glue code; a lack of a third party ecosystem, requiring companies to develop their own operational and visualization tools.

The post gives a reasonable (if polemical) history and observes that the infrastructure is catching up:

First came the SQL interfaces on top of Hadoop (and later, Spark), leading the industry to “back-cronym” NoSQL to “Not Only SQL” (yeah, nice try).

Then came the rise of NewSQL: new scalable databases that fully embraced SQL. H-Store (published 2008) from MIT and Brown researchers was one of the first scale-out OLTP databases. Google again led the way for a geo-replicated SQL-interfaced database with their first Spanner paper (published 2012) (whose authors include the original MapReduce authors), followed by other pioneers like CockroachDB (2014).

This is all fine…buuuuuut…it doesn’t actually tell you about market or mindshare. Boo! We want numbers!

And…the article presumed that NoSQL had been winning but SQL has made a comeback! That seems unlikely esp over the given time frame.

A bit of searching shows a different picture.

First, there’s this hilarious InfoWorld article from 2016 entitled “NoSQL chips away at Oracle, IBM, and Microsoft dominance”.

Back in 2014, Network World’s Brandon Butler declared that NoSQL was “giving SQL database vendors and users a scare,” and a year later InfoWorld’s Andy Oliver quipped that “the once red-hot database technology is losing its luster, as NoSQL reaches mass adoption,” becoming boringly mainstream.

the SQL incumbents must be a little nervous. A new Gartner report suggests that NoSQL continues to kick the shins of its legacy RDBMS competition.

The bad news, however, is that their dominance is slipping,

Oooo! Scaaaary!!!! But wait:

Yet relational database vendors continue to print money; the NoSQL contenders, many of which are open source — not so much.

A new Gartner report suggests that NoSQL continues to kick the shins of its legacy RDBMS competition. As Gartner analyst Merv Adrian notes, “Over the past five years, the megavendors have collectively lost share,” dropping 2 percentage points to a still-hegemonic 89 percent market share.


All this big data infrastructure, in short, is only 3 percent of the overall paid DBMS market.

(This is a friendly reminder that prima facie you should treat Gartner reports as garbage. If the garbage is helpful because it pimps your stuff, fine. That’s advertising. Otherwise, ignore.)

But take revenue out of the equation, and cracks in the DBMS market share numbers start to appear. According to DB-Engines — which measures database popularity across a range of factors (including job listings and search interest) but excludes revenue numbers — Oracle, Microsoft, and IBM are joined at the top by some noisy neighbors:

Ok! This is a possible story. Commodification and expanding the un- and underpaid database market might be a thing (maybe). But let’s look at the DB-Engines trend line:

A ranking based on a bunch of indirect metrics.

Yeah, see that big cluster of lines up top? Oracle, MySQL, and SQL server. See that orange line that rises up a tiny bit at the end? PostgreSQL. This is a reminder not to trust InfoWorld.

Now their ranking definition is interesting: It’s basically “web presence”, total search results, mentions in job adverts, questions about the systems, etc. Established base is going to skew these numbers toward old tech and hype toward new tech. Maybe these balance out?

A StackOverflow survey doesn’t have any trendlines, but does say something about recent DB popularity:

So, NoSQL as a (paid or unpaid) market force seems to be a bit of a bust at least relative to the hype.

Evaluating tech (and tech trends) is hard!

Chinook Example Database

I’m developing new material for a course on semi-structured data, so I’ve been looking for examples. Since we are going to incorporate a SQL bit, I need me some nice SQL databases. The Chinook example database looks really promising. It’s derived from the iTunes XML, which is really nice for my purposes as we can compare models and queries directly. Students could even generated their own example databases from their own iTunes, which is cool.

Where it sorta fails for me as it stands is that it doesn’t clearly expose interesting modelling choices. It is sort of an existence proof that XML like features are unnecessary.

(These address tables are nice in that respect. It’s clear why you might want to factor multiple phone numbers into a separate table and having UK vs. US addresses is a good forcing function for thinking about wacky variation and how easy it is to overfit.)

(There was an awesome site with hundreds of simple ER diagrams/schemas for all sorts of common scenarios, most quite small. I can’t find it at the moment, grr!)