Archive for the 'Computer Science' Category

The GPL Won

January 18, 2017

It seems like GPL and related licenses dominate open source projects, by a lot:

At the beginning of his talk, DiBona said that according to Google’s net crawlers, the web now contains over 31 million open source projects, spanning 2 billion lines of code. Forty-eight per cent of these projects are under the GPL, 23 per cent use the LGPL, 14 per cent use the BSD license, 6 per cent use Apache, and 5 per cent use the MIT license. All other licenses are used with under 5 per cent of the projects.

So, GPL variants govern 71% of 31 million projects. Daaaamn. That’s a lot. A lot more than the rest, which are less restrictive.

I confess to being a bit surprised given the hostility (or exasperation) one often encounters by e.g., business folks when dealing with the GPL. Of course, it has two strong factors: it’s viral (so derived projects must use it) and it’s has a lot of advocacy, both dedicated (think Stallman) and more incidental (think Linux in general).

Ooo, I really have an itch to find out whether virality is a big factor….


On Facebook, Dan Brickley (thanks Dan!) points out that 1) this survey is from 2011 and 2) more recent surveys point to a shift away from GPL to more more permissive licenses, to wit, MIT and Apache:

Indeed, if we contrast each license’s share of the repositories surveyed by Black Duck [January 2017] versus January 2010, the shift is quite apparent….

In Black Duck’s sample, the most popular variant of the GPL – version 2 – is less than half as popular as it was (46% to 19%). Over the same span, the permissive MIT has gone from 8% share to 29%, while its permissive cousin the Apache License 2.0 jumped from 5% to 15%. What this means is that over the course of a seven year period, the GPLv2 has gone from being roughly equal in popularity to the next nine licenses combined to 10% out of first place.

All of which suggests that if we generally meant copyleft when we were talking about open source in 2007, we typically mean permissive when we discuss it today.

Read the whole thing, esp. the bit about the rise of unlicensed projects on Github.

Now, methodologically, their survey is smaller:

This open source licensing data reflects analysis of over two million open source projects from over 9,000 global forges and repositories.

So, it might be the case that the Google population wouldn’t show this shift. But, ex ante, a focused crawl is more likely (perhaps) to be dominated by “high quality” repositories, thus may reflect best or active practice better.

This all still cries out for some causal investigation.

The Concentration of Web Power

January 17, 2017

In the early days of the Web (and the Internet in general), there was a believe that it was a different kind of system. Distributedly anarchistic, that is, the was no central authority. No one organisation owned…or could own the whole network. Oh we worried about some centralising tendencies…browser makers had a lot of power. DNS is pretty centrally controlled. But the decentralised, distributed nature was also supposed to be resilient against various sorts of attack, including attempts to control it. The internet “routes around damage”, including control freaks. The contrast class were closed “walled gardens” like Compuserve and AOL. These were behemoths of online activity until they were crushed by open systems. Internet systems just scaled better because they were open and decentralised…so the story went.

But the world has changed. While we have more online activity than ever before, the capability for single organisations to control it all has also increased. Many organisations (Google, Apple, Facebook, Amazon to name a few) have the technical wherewithal and economic resources to build a global infrastructure that could handle the traffic of the current Web. (More orgs could do it if we reuse the current communication infrastructure. Google experiments notwithstanding, it would be a heavy lift for them to build out broadband and cell towers across the globe. These organisations could, of course, buy mobile carriers and backbone providers…)

Furthermore, governments have been rather effective in controlling the internet, cf China. The ability to route around such breakage on a mass scale is proving fairly limited.

As a result, power is getting increasingly concentrated in the online world (which, given the economic gains, is translating to power everywhere). We are pretty close to the Compuserve/AOL walled garden world. Even nominally distinct organisations rely heavily on the same high level infrastructure, which is why you see the same awful ads everywhere.

Google seems to be trying hardest for vertical integration. They push Chrome relentlessly and that lets them play with protocols specific to Chrome and their servers. Sure, they generally push them toward standards…but this is a new capability. Facebook tried this a bit (Facebook phone anyone?) but didn’t do so well. Mobile apps make this true for everyone.

Strangely we’re getting some refragmented experiences. I was trying to debug Zoe‘s new YouTube Channel (a bad experience, to be sure) and some things were different in the iPad app than on the website. Fine fine. I mean, terrible, but ok. But I tried to debug it on my iPad and I could not open the YouTube website in a browser. It forced me into the app with no other options (“Open in YouTube App? Ok. Cancel.” where “Cancel” means don’t open the page!). (Ok, I eventually “googled” how to modify the URL (from “www.” to “m.”) to make it work but holy hell that was wrong.) I was trying to share a page from the Guardian and got stuck in AMP hell. I could not get rid of the AMP/Google URL to get to the Guardian one (without string hacking).

That AMP page sure loaded fast…

…but the URL, server, connection all belong to Google.

It is interesting that mass phenomena are easier to control that small scale ones, in some respects. There’s more money available and most people are not very technical sophisticated. Hell, I’m not. I don’t want to be. I leave a lot of things broken because while I could suss it out, I just don’t have the time or energy to fight through all the nonsense. (And there is a ton of a ton of nonsense.)

So, this is the world we live in. The problem is that the big players will never stumble and fall…they might! Yahoo died. If Facebook dies (depending on how it does) it will be traumatic. If Google dies (depending on how it does), it will be very bad. But, these are probably survivable in most scenarios. They’ll degrade slowly and other big players will grab some and some new players will get big.

However, it’s hard to see how it won’t be big players from here on out.


Ruben Verborgh has a related discussion. My quick hit on the key difference is that Ruben isn’t focused on the centralisation and web architectural aspects, but on, roughly, publishers vs. consumers. A world where every site has a preferred app and directs you to it is still (potentially) a decentralised and distributed one. He’s more focused on a different sort of power, i.e., the power of individual site owners over their viewers. Now, obviously, this is related to the concentration of power I focused on since one of the bad things an “owner of the web” can do is exploit the rest of us in a myriad of ways. But I think it’s important to note that concentration and centralisation have not been “app” driven. Network effects (plus quality!) seem sufficient to explain the rise of Google and Apple. Facebook and Twitter rose by network effects alone, I’d say. Once you have a ton of cash and a big user base, you can exploit those in a variety of ways to gain more power. Though this isn’t trivial, witness Google’s persistent failures in social (Google+!?).

Kyle Schreiber writes about AMP lock in (via Daring Fireball):

Make no mistake. AMP is about lock-in for Google. AMP is meant to keep publishers tied to Google. Clicking on an AMP link feels like you never even leave the search page, and links to AMP content are displayed prominently in Google’s news carousel. This is their response to similar formats from both Facebook and Apple, both of which are designed to keep users within their respective ecosystems. However, Google’s implementation of AMP is more broad and far reaching than the Apple and Facebook equivalents. Google’s implementation of AMP is on the open web and isn’t limited to just an app like Facebook or Apple.

AMP (and other URL/browser behavior hijackings like search result URLs) is extra offensive because it hits deep into the working of the Web. But if we all end up living in Facebook all the time, it won’t matter if the URLs “look” independent.

Note that I  host my blog on WordPress, for “free.” WordPress is a pretty popular Content Management System with an ecosystem that is hard to rival. But the results are pretty vanilla Web. itself isn’t super dominant and it’s very easy to break free. There seems to be a material difference in the situations.

Ontology Management on the Gartner Hype Cycle!

September 28, 2016

The Gartner hype cycle is an analytical construct (of sorts) which tries to capture the relation between a technology and the expectations we have for that technology. It’s based on the pretty reasonable observation that esp with new technology, there’s a tendency for expectations to outrun the current or even potential benefits. Everyone wants to use the new glittery magic, so vendors and specialising consultants do very well for a while. But it turns out that the new technology isn’t magic, so people find that they’ve spent a bunch of money and time and energy and they still have the problems the tech was supposed to magically solve. This leads to a crash in expectations and a backlash against the tech. But lots of new tech is actually useful, used appropriately, so some of the new tech, its shiny worn off, finds a place in our toolkit and tech landscape. The Gartner hype cycle is a pretty iconic graph with fun-ish labels:

(The y-axis gets different labels over time.)

And people try to operationalise it:


But I’m skeptical about a lot of this as being rigorously evaluate.

Of course, sometimes a tech takes off and doesn’t really stop. It goes pretty straight from trigger to productivity. The iPhone/iPhone style phones comes to mind. It Just Grew. It may level off as it hits saturation, but that’s a completely different phenomenon.

This is all pretty banal stuff, but Gartner takes it very seriously (they’ve branded it!).

ANYWAY, this year’s hype cycle, excitingly, includes ontology management for the first time! WE’RE ON THE MAP!

  • 16 new technologies included in the Hype Cycle for the first time this year. These technologies include 4D Printing, Blockchain, General-Purpose Machine Intelligence, 802.11ax, Context Brokering, Neuromorphic Hardware, Data Broker PaaS (dbrPaaS), Personal Analytics, Smart Workspace, Smart Data Discovery, Commercial UAVs (Drones), Connected Home, Machine Learning, Nanotube Electronics, Software-Defined Anything (SDx), and Enterprise Taxonomy and Ontology Management,

Alas, if you look at the graph, we’re on the downslope into the Trough of Disllusionment:

And it has a “more than 10 years” to mainstream adoption label.


This is discouraging and perhaps hopeful. Remember that the hype cycle doesn’t tell you much about the qualitymaturity, or utility of the technology, only the perception and influence of perception on the market. (To the degree you believe it at all.) 10 years to mainstream adoption is not 10 years from being a boon for your business or a viable business itself. It means you will often have a hard sell, because people are skeptical.

Update: Oh WordPress. Picture management please.

Grumpy about Textbooks

September 20, 2016

I definitely need to do more research but I don’t feel that there is a really solid textbook on software engineering. I use Steve McConnell’s Code Complete (second edition) and Making Software for readings.

These are both pretty good. Code Complete is a bible for many people (not for me!) but regardless it’s definitely on a “you should read this if you are a software engineer” list. It has a few problems though:

  1. It’s not written with courses in mind, as far as I can tell. It introduces a lot of stuff and sometimes in a helpful order, but other times not. The “learning objects” are not clear at all.
  2. It’s not super well written. You get a lot of interesting lists (e.g., of program qualities) but they are often not coherent, have some redundancies, are are perfunctorily designed. These often feel revelatory on a first read but if you try to work with them you get a bit grumpy. For example, we have 4 kinds of tests: unit, component, integration, and system. Unit and component test bits of the same size: a unit. The difference is whether the unit is maintained by one team (thus a unit test) or more than one team (a component test). This is bonkers. It’s esp. bonkers to compare with integration or system tests. It could be part of an interesting axis (who’s the owner vs. who’s writing the tests). But there are much better frameworks out there.
  3. It’s a bit dated. The second edition came out in 2004 and is thus 12 years old. This doesn’t invalidate it per se, but given that the book itself has a prominent discussion of the need for life long learning because the fundamentals of software engineering keep changing, it’s a problem. I’d prefer something other than Basic as the “other” example language.
  4. It pretends to focus on code construction, but has just enough architecture, etc. to be almost a reasonably complete text. But the scattershot approach is a bit disorienting.

If you read it cover to cover and absorbed it all with an appropriately skeptical eye and organised it appropriately, then you’d be in great shape.

My pal Mark suggested reoriented on The Pragmatic Programmer, which is another classic and definitely on the must read list. But a lot of my concerns apply to it too. (That there’s a basic divide between those pushing Code Complete and those pushing the Pragmatic Programmer is interesting. The lines I’ve seen is that Code Complete aspires to be encyclopaedic and the Pragmatic Programmer is more opinionated and thus effective. Roughly. They both feel scattered to me.)

I could try both (not this year!). I could go with Pragmatic Programmer because it’s smaller and thus they could possibly read the whole thing.

But neither feel satisfactory as a textbook. The systematicity and pedagogic logic just don’t seem to be there. So I’m left imposing some order on them.

Software Gripes: Scrivener and ConcertWindow (and WordPress)

April 16, 2016

I think I need regular “features” i.e., columns of a particular type or theme, to keep the blogging going, so here’s a new one near and dear to my hard: ranting about software problems (I’ll through in other system gripes but the most common is software).


I want to love Scrivener. It certainly is enticing, if a bit complex. I’m trying to use it as a course materials (lectures, quizzes, etc.) management and editing tool. People certainly seem to have had some success with it as such. I think it could also be handy for paper or book writing and esp. grant writing. Grants have VERY complex and finicky structure which Scrivener’s “break it into bits” and “annotate and organise” and “hey, templates all the way down” approach looks to be quite good.

But there’s a fundamental problem: The whole Scrivener model is “compiling” the project into a single final document. Really. Uhm…that’s bonkers. Even if your final output is conceptually a single book, you very well may want the “out of Scrivener” view to be split up in multiple files. (Think Website with a separate HTML page per Chapter. Or just Website.)  For courses, I don’t want one output to contain it all, I want lots of documents (syllabus, references, slides broken out by day or by lecture, quizzes, lab sheets, etc.) Scrivener HAS THAT STRUCTURE, but, as far as I can tell, it doesn’t like to spit it out. You can “export” the file structure, and maybe that will turn out to be good enough. (I only figured that out today.) But I want some of the structure to be flattened! E.g., if I make a Lecture which has separate subdocuments as “slides”, for some workflows they should be combined! But the whole Lecture shouldn’t be combined with all the other lectures. (Except for the global print version.)

Ok, “export” at least lets me write my own custom compiler. But then why do I have to deal with the “project” structure and explicitly set “export”? Why can’t a project just be a directory/file structure in the file system. In other words, why “export”? That adds a really painful step to the process. It makes synching harder, etc.

Additionally, Scrivener has some simple WYSIWIG formatting (bold, italic, tables, lists, etc.) It also has export to MultiMarkdown. This all seems extremely promising for downstream processing: Write using the GUI, explore to Markdown, then run tools that parse and manipulate the Markdown to generate the final versions.

Oh, silly me! All Scrivener does is compile snippets written in MultiMarkdown to other formats (HTML or PDF via LaTeX)! You have to write the Markdown.

Well that sucks. It’s not like Scrivener is a word class Markdown editor with syntax checking etc. The key formatting features it supports in the GUI are eminently Markdownable, so why not export to it? Indeed, for things like Tables, having a reasonable GUI is much much nicer than hacking Markdown syntax directly. Sigh.

Finally, they have this cork board view. Before 2.7, it defaulted to a cork textured background and index card looking cards. Very skeuomorphic, but in a good way. It took you out of the UI and forced a cognitive mode shift. 2.7 it defaulted to a “flat” interface that was 1) bland and 2) merged it visually with every other view.

Sigh. But wait! You can tweak it back. But now, in my preferred Index Card style, they stuck a pushpin.

screenshot_03Why, why, why, why?! It doesn’t read; it doesn’t help; it forces a “vertical” orientation (I actually viewed them as piles before). This little tumour does exactly nothing positive. It serves no visual-informatics purpose and, indeed, distracts. It’s centred, bright, and in line with meaningful information. This is skeuomorphic madness, where the designer slavishly emulates the real world object without thinking about the design. Pushpins are not a useful information part of the design…they are there to hold the cards in place. If you lay the cork board flat, you don’t need them.

“But Bijan,” you say, “the cork only exists to have pins pushed in! Isn’t that the same problem?”

No, gentle reader, while the cork in the real object is there functionally to be stuck with pins, it has several user interface functions: 1) visual mode switching; it’s a very strong cue about the difference in working style; it provides an information cue, 2) it supports the illusion without affecting other information per se, and 3) it is high contrast yet not obtrusive. The main problem with skuomorphism is that people take it too far. The idea shouldn’t be to exactly replicate the real world object, but to design an interface that works. Flat interfaces general suck because they generally designed that chrome should be indistinguishable from content (or not be perceptible at all) and content should have few sub distinguishing features. (Microsoft’s Metro interface is something of an exception.)


Zoe tried to do a ConcertWindow concert last Sunday. There were numerous technical hassles, but we managed to struggle through most of them and have a reasonable concert which most viewers could see most of. One cool feature is that you can get the full recording of the stream and the website lets you post a one song snippet of the recording on their website. This was exactly what we wanted to promote the new album (in progress).

We do not have such a recording.

The reason we do not have one is that they have a “feature” that is supposed to help you debug your streaming. For a given concert slot, you can set up a “test” session which will not be exposed to anyone except your testers and can happen at other times than your scheduled slot. This sounded sensible, but there were a few problems:

  1. It doesn’t work from the iOS app, which is how were were going to broadcast the concert. Grr. But ok, we can at least test the basic setup via the browser version.
  2. Testing via the browser version just doesn’t help very much. You still need to test via the iOS app. A lot. So we were scheduling test concerts all over the place. That was better in someways, since that’s what exposed that the “Pay what you want” option is really “Pay what you want as long as it is at least $1”. Grr.
  3. When you go to look at your video, the prepend “for your reference” all the test video you did. What? Why? Who wants that? Who wants that in their concert recording? Shouldn’t you just save that as a separate file, if at all? Weird.
  4. Oh, and if you tested in your browser, but recorded from iOS, you now have a video that is half test video and half corrupted nothing. That’s right, the “test” mode can corrupt your concert recording. So we have no video of the concert, whatsoever.
  5. In the FAQ for “Preparing for the show” they have “How can I sound check before the show?” which says

    Choose if you’re going to broadcast with Web, iOS, or RTMP, then switch to “Test” mode and start broadcasting. No one will be able to see it on your channel. Click the “Test URL” link below the broadcaster and you’ll be able to see your test stream in real time. You can also send this link to a friend.

    In the FAQ for  “After the show” they have “My archived video file has errors and/or the recording is corrupted” (it’s on the SECOND PAGE of this FAQ)

    This can sometimes happen if you broadcasted to the same show via multiple devices (iOS + laptop) or in different frame rates / formats.

    To avoid this happening, be sure to broadcast to each show using only one device and one video/audio format.

    If you do broadcast using multiple devices or formats, the live stream will work totally fine, but the archived recording may be corrupted.

    So, the advice they give before hand can corrupt your recording because they have a feature (prepending test video) which is completely worthless. And their own help leads you there.

Message to the ConcertWindow programmers who did this: Never corrupt important data. Never. Ever. Especially don’t corrupt real data with test data. I mean…come on. Shame

Message to the ConcerWindow documentation writers who did this: If there is a risk of data corruption…DON’T RECOMMEND ACTIONS THAT RAISE THAT RISK. Oh, and WARN PEOPLE ABOUT THE RISK AND HOW TO MITIGATE IT before they might do the action that destroys their data.

You should be profoundly ashamed of yourselves.

While we’re talking documentation nonsense, let’s consider this gem:

At Concert Window, we give the artist a full private copy of their show, for free. You can use it for any non-commercial use, including uploading it to YouTube. 

The video files are in .mp4 format, which is playable with most major video players including VLC and can be imported into iMovie and Final Cut Pro.

Sometimes, due to errors during broadcast or other reasons, the video files may be corrupted or unplayable. In that case we’re sorry but there’s nothing we can do. This is part of why we offer video archives as a free service.

In addition to downloading your full show recording, you can also create a short highlight video. Here’s an article with more details: How to create a highlight video

*Artists are not allowed to sell their show videos due to copyright restrictions.

First, note the “or other reasons” for corruption…like BEING MISLED BY THE DOCUMENTATION TO HIT A DESIGN BUG WHICH IS KNOWN TO CORRUPT YOUR CONCERT. Maybe you should fix that.

Second, note the nonsense of the highlight blocks. Zoe owns the copyright for the songs she played and the performance. The terms of service explicitly SAYS that she owns the copyright.

(BTW, the terms of service are absurd and horrible. I’ll break that out in another post.)


Current gripe: Adding a category doesn’t put the new category under the parent one you’ve selected.

Also, I want to have categories be more meaningful. I’m currently inserting two key categories into my post title (see current post’s title): Music Monday and Software Gripe. This is wrong. I’m polluting my title with Metadata about my post in order to get the visual effect I want. Boo!

APIs on the Web Platform

January 26, 2016

blog post about Microsoft Edge (their new browser) contains an extraordinary tidbit. They are talking about compatibility with other browsers and one metric is “shared APIs”. Then they have this nifty little table:

Google Chrome 48 Apple Safari 9
Internet Explorer 11 4076 shared APIs 3769 shared APIs
EdgeHTML 13 4724 shared APIs (+16%) 4157 shared APIs (+10%)

Ok, clear improvement, but what’s staggering is the sheer number of APIs to share!!!

Is there even a list of these APIs publicly available?! And 4724 needs to be regarded as a lower bound on the number of APIs (even standard APIs)! One of the comments complains about lack of RSS support! So even very common APIs didn’t make it in yet.

The web platform is extraordinarily complex.

I am practicing British understatement.

eBook Sucking: Skeuofunctionality Limitations

January 14, 2016

Oh ebooks. I want to love you more. I’m fine reading them, but the endless sucking of all aspects of the software miffs me.

Given that we’ve lost stable pagination, you would think that issues with selection would be eliminated. Nope! As you can see from the following screen shots (of Scribd on iPad) that you cannot make a highlight that spans a page boundary. But the “page” isn’t a thing! It’s just a bogus limitation that comes from slavishly adhering to a tiny part of physical books. Keeping page turning is good and functional. Keeping highlighting within page boundaries is daft.

I want to select the paragraph starting at the bottom. So I select the bit there at the bottom of this page...

I want to select the paragraph starting at the bottom. So I select the bit there at the bottom of this page…

...then I select the rest of the paragraph on the next page...

…then I select the rest of the paragraph on the next page…

...and I have two selections instead of one :(

…and I have two selections instead of one 😦

A Case Study in Evaluating a Knowledge Representation (part 1)

January 10, 2016

I had to give a little ad hoc talk on “OWL vs. Rules”. This comes up a lot, and not just within rules communities.

When answering the question “Why would we use OWL over rules?” or “What’s the difference between OWL and rules?” or any of these variants (which often are, “Someone convinced me to use a rules engine that I don’t fully understand…does that get me out of having to use Yet Another Thing?”), I prefer not to evangelise. I like various rule languages. Datalog is pretty sweet. Production rules are nifty. I’ve written my share of XSLT and Prolog. And for many things, they are the right thing. So, of course, the first problem is “Which rule flavour do you mean?”

What I like to do is talk about evaluation techniques and criteria. Most people who are asking the question don’t know how to do this. They don’t know how to tell whether a KR formalism is fit for purpose. It’s not a trivial problem!

I use something like the following framework:

  1. Check the adequacies:
    1. Representational
      • The language can express required knowledge more or less directly, naturally, effectively
    2. Cognitive
      • The language is usable
    3. Computational
      • The language is reasonably implementable with appropriate performance
  2. What are the services (both possible and actually implemented)?
  3. Do we get surprising reuse?

Obviously, the ground issue is whether a particular representation does the application job at all! But I’m presuming we believe that all candidates can get somewhere, esp. if we add a lot of code to fill in the gaps. Minimising the hacks is a key goal!

My current example use scenario is the following: We’re trying to implement the CHA2DS2-VASc scoring system against an Electronic Health Record (EHR) system. So we have some databases we need to query.

CHA2DS2-VASc has a number of risk factors (or variables) that we need to determine, e.g., whether the patient has hypertension (HTN).  Each of these corresponds to a clinical concept (for hypertension, call it…HTN). These concepts can have various definitions for various purposes, most obviously, they may have different operationalisations. So, I might say that a patient has HTN if they have a typical blood pressure of over 140/80, or if in the last three months they had a blood pressure reading of >140/80, or if they are on 2 classes of anti-HTN drugs. These definitions are operational: They are in terms of metrics, not underlying concepts. But when trying to translate research into practice, the (various) operational definitions are key!

We can, of course, capture these in a variety of ways in a variety of formalisms.

HTN Example

Here’s a simple example to work through:

We want to capture the concept of a patient with hypertension  with the (one of many) operational definition of “on at least one member of each of two distinct classes of anti-HTN drug” (derived from a given paper on afib risk assessment).

Ultimately, we want to capture this is such a way that we can use it to help guide the formulation of queries against a medical record, or building a capture form, or similar clinical informatics task. We want it to be elaboration tolerant in the sense that we can fit in other operationalisations usefully.

RDFS Attempt

We can capture some of this in as simple a formalism as RDFS (using a slight variant of Manchester Syntax):

Class: P1-PatientWithHTN SubClassOf: PatientWithHTN
Class: Drug1 SubClassOf: ClassA
Class: Drug2 SubClassOf: ClassA
Class: Drug3 SubClassOf: ClassB
Class: Drug4 SubClassOf: ClassB
Class: ClassA SubClassOf: HTNDrug
Class: ClassB SubClassOf: HTNDrug
Class: OnClassA
Class: OnClassB

ObjectProperty: on Range: HTNDrug

(The range is a bit gnarly as we don’t want it to be restricted to anti HTN drugs only, but that’s not the limitations I’m focused on.)

As is clear, we get stuck fairly quickly. We can’t, in RDFS, express that the class of drug (or the drugs themselves!) are disjoint, which is important! It’s easy to imagine a case where the classes of drugs represented functionality,  and, in some cases, the same drug could be on both lists (e.g., both a cough suppressant and a decongestant). In that case, sometimes being on a single drug would be enough. Not so in this case!

We also can’t represent the fact that the sufficient condition is being on two drugs. If we say:

Class: OnClassA SubClassOf: P1-PatientWithHTN

We’ve missed the OnClassB requirement. We can’t just add the corresponding ClassB requirement:

Class: OnClassB SubClassOf: P1-PatientWithHTN

You could try to hack it with a named bridge class:

Class: OnClassAandB SubClassOf: OnClassA, OnClassB, P1-PatientWithHTN

This is a bit tempting because if you are a member of OnClassAandB you will be a member of OnClassA, OnClassB, and P1-PatientWithHTN, which is also true if being both OnClassA and OnClassB implies being a P1-PatientWithHTN. But this is compatible with someone being in each of OnClassA and OnClassB but in neither of OnClassAandB or P1-PaitentWithHTN. We either have to have the discipline to always enter people in OnClassAandB (even if we discover that they are OnClassA and OnClassB in very different contexts) or we need to add a rule, OWL axiom, or code to do the recognition.

Because that just adds in people who are only on ClassB drugs. We have no way to express a conjunctive condition. And, of course, we can’t express that being “with HTN” is a function of being “on certain drugs”. The relational aspects don’t show up in any of our definitions, mostly because we have no definitions!

Of course, we can express the concrete, extensional, relational situations that we want. We can assert that some patient is OnClassA, or is in an on relation to some specific class A drug. What we can’t do is make such a relational structure trigger recognition in a class. And, of course, we would need out of band information to say, “Hey! It’s this relational structure that form the recognition conditions for Patients with HTN!!!”

So, we are forced into another formalism or ad hoc application code. We can conclude that for this task in this domain, RDFS is representationally inadequate. I would also claim that it is cognitively inadequate, because it’s not too hard to trick yourself into thinking you can squeeze out more than you can. Worse, it seems easy to “fix” using a bit of code or some hacky rules, and thus not recognise that you are producing a code smell.

In the next instalment, we’ll look at using a richer language to capture the missing bits.

Note: Oy, this is where I got stuck. It’s taking so long to write it all out. I’m sorta transliterating a talk, and I kinda wish I would just put up the slides! I seem committed now though. These were in Powerpoint so….

License Troubles: Wikipedia vs. Flickr

September 25, 2015

I’m making slides for classes and for a talk.

I’d like to use images in these slides to make them less horrible to look at. I do not draw and even my diagraming skills are limited. I can take photos reasonably, but I can’t track everything I need, much less compose a photo while generating slides. This leaves the web! Yay! Lots of images!

But then…licenses. Damn. Ok, Creative Commons has been on the job for decades now. Google Image search will even filter by license type. Yay!

Wikipedia has lots of useful images. I often use them. But they don’t have everything. I find a nice image which seems perfect: A photo of an antique wooden French flat file cabinet.

I’d love to use this as my illustration of “flat file” database. License looks ok, non-commercial…

But wait, is my use non-commercial or commercial? What’s the rule for education use? Looking…not obvious. the CC website isn’t super helpful, then I find:

he NC licenses may not be compatible for remixing with many works. For example, a person may not remix BY-SA content (such as Wikipedia content) with BY-NC content.

Well, damn. Most Wikipedia content is copylefted (which is annoying as I don’t necessarily want my slides unrestrictedly reused, but whatev; I can live with it), but a good chunk of the content on Flickr (and elsewhere) is NC. So it doesn’t matter what my use is, I can’t do it.

Sigh. What was effortless now becomes impossible. And I’ve burned time worrying about it.

Chinook Example Database

September 1, 2015

I’m developing new material for a course on semi-structured data, so I’ve been looking for examples. Since we are going to incorporate a SQL bit, I need me some nice SQL databases. The Chinook example database looks really promising. It’s derived from the iTunes XML, which is really nice for my purposes as we can compare models and queries directly. Students could even generated their own example databases from their own iTunes, which is cool.

Where it sorta fails for me as it stands is that it doesn’t clearly expose interesting modelling choices. It is sort of an existence proof that XML like features are unnecessary.

(These address tables are nice in that respect. It’s clear why you might want to factor multiple phone numbers into a separate table and having UK vs. US addresses is a good forcing function for thinking about wacky variation and how easy it is to overfit.)

(There was an awesome site with hundreds of simple ER diagrams/schemas for all sorts of common scenarios, most quite small. I can’t find it at the moment, grr!)