A Case Study in Evaluating a Knowledge Representation (part 1)

I had to give a little ad hoc talk on “OWL vs. Rules”. This comes up a lot, and not just within rules communities.

When answering the question “Why would we use OWL over rules?” or “What’s the difference between OWL and rules?” or any of these variants (which often are, “Someone convinced me to use a rules engine that I don’t fully understand…does that get me out of having to use Yet Another Thing?”), I prefer not to evangelise. I like various rule languages. Datalog is pretty sweet. Production rules are nifty. I’ve written my share of XSLT and Prolog. And for many things, they are the right thing. So, of course, the first problem is “Which rule flavour do you mean?”

What I like to do is talk about evaluation techniques and criteria. Most people who are asking the question don’t know how to do this. They don’t know how to tell whether a KR formalism is fit for purpose. It’s not a trivial problem!

I use something like the following framework:

  1. Check the adequacies:
    1. Representational
      • The language can express required knowledge more or less directly, naturally, effectively
    2. Cognitive
      • The language is usable
    3. Computational
      • The language is reasonably implementable with appropriate performance
  2. What are the services (both possible and actually implemented)?
  3. Do we get surprising reuse?

Obviously, the ground issue is whether a particular representation does the application job at all! But I’m presuming we believe that all candidates can get somewhere, esp. if we add a lot of code to fill in the gaps. Minimising the hacks is a key goal!

My current example use scenario is the following: We’re trying to implement the CHA2DS2-VASc scoring system against an Electronic Health Record (EHR) system. So we have some databases we need to query.

CHA2DS2-VASc has a number of risk factors (or variables) that we need to determine, e.g., whether the patient has hypertension (HTN).  Each of these corresponds to a clinical concept (for hypertension, call it…HTN). These concepts can have various definitions for various purposes, most obviously, they may have different operationalisations. So, I might say that a patient has HTN if they have a typical blood pressure of over 140/80, or if in the last three months they had a blood pressure reading of >140/80, or if they are on 2 classes of anti-HTN drugs. These definitions are operational: They are in terms of metrics, not underlying concepts. But when trying to translate research into practice, the (various) operational definitions are key!

We can, of course, capture these in a variety of ways in a variety of formalisms.

HTN Example

Here’s a simple example to work through:

We want to capture the concept of a patient with hypertension  with the (one of many) operational definition of “on at least one member of each of two distinct classes of anti-HTN drug” (derived from a given paper on afib risk assessment).

Ultimately, we want to capture this is such a way that we can use it to help guide the formulation of queries against a medical record, or building a capture form, or similar clinical informatics task. We want it to be elaboration tolerant in the sense that we can fit in other operationalisations usefully.

RDFS Attempt

We can capture some of this in as simple a formalism as RDFS (using a slight variant of Manchester Syntax):

Class: P1-PatientWithHTN SubClassOf: PatientWithHTN
Class: Drug1 SubClassOf: ClassA
Class: Drug2 SubClassOf: ClassA
Class: Drug3 SubClassOf: ClassB
Class: Drug4 SubClassOf: ClassB
Class: ClassA SubClassOf: HTNDrug
Class: ClassB SubClassOf: HTNDrug
Class: OnClassA
Class: OnClassB

ObjectProperty: on Range: HTNDrug

(The range is a bit gnarly as we don’t want it to be restricted to anti HTN drugs only, but that’s not the limitations I’m focused on.)

As is clear, we get stuck fairly quickly. We can’t, in RDFS, express that the class of drug (or the drugs themselves!) are disjoint, which is important! It’s easy to imagine a case where the classes of drugs represented functionality,  and, in some cases, the same drug could be on both lists (e.g., both a cough suppressant and a decongestant). In that case, sometimes being on a single drug would be enough. Not so in this case!

We also can’t represent the fact that the sufficient condition is being on two drugs. If we say:

Class: OnClassA SubClassOf: P1-PatientWithHTN

We’ve missed the OnClassB requirement. We can’t just add the corresponding ClassB requirement:

Class: OnClassB SubClassOf: P1-PatientWithHTN

You could try to hack it with a named bridge class:

Class: OnClassAandB SubClassOf: OnClassA, OnClassB, P1-PatientWithHTN

This is a bit tempting because if you are a member of OnClassAandB you will be a member of OnClassA, OnClassB, and P1-PatientWithHTN, which is also true if being both OnClassA and OnClassB implies being a P1-PatientWithHTN. But this is compatible with someone being in each of OnClassA and OnClassB but in neither of OnClassAandB or P1-PaitentWithHTN. We either have to have the discipline to always enter people in OnClassAandB (even if we discover that they are OnClassA and OnClassB in very different contexts) or we need to add a rule, OWL axiom, or code to do the recognition.

Because that just adds in people who are only on ClassB drugs. We have no way to express a conjunctive condition. And, of course, we can’t express that being “with HTN” is a function of being “on certain drugs”. The relational aspects don’t show up in any of our definitions, mostly because we have no definitions!

Of course, we can express the concrete, extensional, relational situations that we want. We can assert that some patient is OnClassA, or is in an on relation to some specific class A drug. What we can’t do is make such a relational structure trigger recognition in a class. And, of course, we would need out of band information to say, “Hey! It’s this relational structure that form the recognition conditions for Patients with HTN!!!”

So, we are forced into another formalism or ad hoc application code. We can conclude that for this task in this domain, RDFS is representationally inadequate. I would also claim that it is cognitively inadequate, because it’s not too hard to trick yourself into thinking you can squeeze out more than you can. Worse, it seems easy to “fix” using a bit of code or some hacky rules, and thus not recognise that you are producing a code smell.

In the next instalment, we’ll look at using a richer language to capture the missing bits.

Note: Oy, this is where I got stuck. It’s taking so long to write it all out. I’m sorta transliterating a talk, and I kinda wish I would just put up the slides! I seem committed now though. These were in Powerpoint so….


OWL 2: A Medical Informatics Perspective

This is my second “publicity” article about OWL 2. The first didn’t really get into OWL 2 per se instead of OWL altogether. I’m trying to stay within 1000-1500 words. The first focused the Semantic Web angle whereas this one is focused on the use of OWL and ontology for bio-health applications.

Medicine has a huge and rapidly growing vocabulary even before you get to the complex scientific names. Is your pain shooting, or stabbing, or throbbing? Is your cough dry, wet, brassy, or barking? There is a huge difference between a “broken leg” which is a “greenstick fracture of the femur” and an “displaced transverse fracture of the patella”. There are dozens of brand names for plain aspirin, including, in some countries, the term “aspirin” itself. (To avoid registered trademarks entirely, you must say “acetylsalicylic acid”.)

Furthermore, there’s lots to know about each term and what you need to know varies by context. For example, acetylsalicylic acid is an analgesic (pain fighter), and an antipyretic (fever fighter), and a NSAID (that is, a non-steroidal anti-inflammatory drug, a kind of inflammation fighter).

There are a lot of concepts in medicine (analgesic, acetylsalicylic acid, pain) and even more terms (that is, “acetaminophen”, “paracetamol”, and “para-acetylaminophen” all are names for the same thing, Tylenol) corresponding to those concepts. When different care providers use different terms for the same thing there is the possibility for miscommunication. The more variation in terminology in health care records, the harder it is to analyze those records (for example, to monitor potential medical errors or to find candidates for clinical trials).

A core challenge in medical informatics is managing the huge, evolving terminologies that permeate all aspects of medicine. Most of these terminologies have a complex hierarchic structure, e.g., breast cancer is a kind of cancer is a kind of disease is a kind of pathology. The problem is that there are hundreds of thousands of terms in any reasonable terminology with a rats nest of connections between them. And these terminologies grow very fast. For example, The NCI Cancer Thesaurus grew from around 20,000 terms in 2004 to over 50,000 terms today. Each term corresponds to a potentially complex concept of specialized medical knowledge which is related to many more concepts in a variety of ways. Various problems emerge with manual curation of such terminologies: sometimes there are wrong connections between terms, or missing connections, or the text defining the term is out of date, confused, or just garbled. None of these errors are detectable from natural language or from simple, graph based representations of the terminology. These are semantic errors, that is, gaps between what the curators wrote down and what is true.

One way to improve terminology development is to write down the meanings of terms in a language that a program can understand. That way, we can run a program (an automated reasoner) to sanity check what we wrote and to find new connections that a person would recognize if they read all the definitions and didn’t get tired. Such a language is an ontology language and representations of a terminology with the definitions written so a program can reason with them is called an ontology. Ontologies and ontology languages have a rich history in computer science, artificial intelligence, and bio-medical informatics. A popular family of ontology languages are build on so-called description logics which allow people to reasonable express the definition of their concepts while still being amenable to state of the art automated reasoning techniques. Description logics form the basis of the standardized ontology language, OWL and its latest version, OWL 2.

The first version of OWL (the “Web Ontology Language”) was standardized by the W3C in 2004 and proved a rousing success in providing a common default language for ontology development. Key bio-medical ontologies, such as SNOMED-CT and the NCI thesaurus migrated to OWL and to an OWL based toolchain, allowing them to move from proprietory languages and their vender locked in toolchains.

In 2009, the W3C announced the finalization of the next generation of OWL, OWL 2. OWL 2 is based on continuing research from the Universities of Manchester and Oxford into all aspects of ontology engineering. Professor Ian Horrocks, whose early work in reasoning with description logics at Manchester made them a feasible technology, co-chaired the OWL 2 working group. [Ugh. it’s at this point I suck out 🙂 I hate writing people pimpage, even if true!]

OWL 2 addresses key expressive and computational limitations of OWL 1. By adding new constructs to the langauge, OWL 2 more directly supports medical applications. For example, so called “role chains” allow ontologists to express the connection between spatial relations and part-whole relations, e.g., that if a fracture is located on a bone which is part of a leg, that fracture is a fracture of that leg. General reasoning with such constructs in the presence of other OWL features was an open problem solved by Ian Horrocks and Prof. Uli Sattler (of Manchester).

OWL 2: A Semantic Web Pitch

OWL 2 is about to go to Recommendation. This is a PR moment and I want Manchester to take advantage of it. To that end, I’m trying to produce material that will be helpful for promoting OWL and Manchester. This is a general piece that tries to sell OWL 2 (and Manchester’s contribution) from a Semantic Web perspective. It’s a bit didactic, thus gets to OWL 2 fairly late in the game. Feedback is more than welcome, esp. if you are a lay reader. (It’s 1100 words.) I know it needs linkification, but I’m trying to pound out text that might go into press release type things, so linkfication is delayed. Volunteers welcomed!

The World Wide Web had humble beginnings: It was intended as a simple way to share fairly simple information (notes, room schedules, phone directories) within an organization (CERN) and perhaps between like minded organizations. It quickly out grew its humble beginnings to encompass the sharing of all sorts of information between all sorts of people and much more, besides. Today, people do much which feels far different than reading and publishing documents. They chat, play games, shop, or use progams such as word processors or spreadsheets.

The technology underlying the Web is recognizably descended from that of its salad days: The Hypertext Markup Language (HTML) is still the lingua franca of the Web; Cascading Style Sheets (CSS) allow designers to style HTML to make beautiful sites; and Javascript has come into its own as the key client side programming language for complex browser based applications. Aside from being (ever more) capable technical foundations for the dazzling Web sites and applications we see everywhere, the Web trio are both standardized (by the W3C) and, unlike competing technologies such as Flash and Silverlight, not controlled by a single vendor.

As amazing as the Web is, it remains, at heart, a Web primarily for end user consumption. HTML allows authors to describe the structure of a page in limited, mostly document oriented ways. HTML, of course, allows for hyperlinks and we need to look no further than Google to see how this human generated, human intended information can be exploited to the benefit of human readers: By interpreting links to a Web page as a vote for that page’s importance, the search engine can prioritize its results far better than any prior attempt.

The idea of the Web as a collection of documents has served us very well. But there are places where that idea creaks. Not everything published on the Web is published for humans first, last, and always, and somethings published for humans first also has a useful life as pure data. For example, while online maps are typically provided by one Web site (e.g., Google maps), people want to mix them with data from another. Whether one is plotting locations from Craig’s List’s real estate ads or finding where a photo from your last vacation was taken, the ability to get at the data of a Web page in a form amendable to programmatic manipulation is critical to making such “mash-up” robust, reliable, and easy to produce. The alternative, to wit, “scraping” data out of human oriented HTML is rather difficult and fraught with pitfalls. Essentially, the consuming program has to filter out the irrelevant parts of the page (ads, or narrative text), interpret the HTML as the sort of data in question, typically by reverse engineering the generating program. Since both the irrelevant parts of a page and the presentational structure of the data on the page tend to change a lot between pages and on the same page over time, the consuming program is faced with a Sisyphean task. Fortunately, there are several popular, well-supported formats for data exchange and web site publishers are increasingly socialized to provide “data” views of their websites.

Thus, we now have data “on” the Web, but this move does not fundamentally move us away from the Web as a collection of documents. Each piece of data is like a little document, and thus suffers the problems of data in HTML on the Web: The data aren’t hyperlinked (so the data are on the Web, not “of” the Web) and a programmer has to interpret the data in order to write a program that uses it sensibly.

There is an alternative conceptualization of the Web that aims to overcome these problems; that is, to make a true Web for programs that is on a par with the Web for people. This conceptualizaion is known as the Semantic Web. Like the Web, the Semantic Web has enabling technologies addressing the key goals of linking and meaning: The Resource Description Framework (RDF) is a “Web native” data model that incorporates hyperlinking deeply. The Web Ontology Language (OWL) is a “Web native” ontology language that extends RDF with the ability to write logic based descriptions of things so that an automated reasoning tool can draw conclusions about data incorporating these descriptions.

The W3C has just standardized a new version of OWL, OWL 2. OWL is based on a family of logics, so-called “description logics” which have played a prominent role in the field of knowledge representation, esp. bio-medical informatics, for over 30 years. The University of Manchester has played a key role in the development of these logics from the theory, to the implementation, to the application and in their standardization in the form of OWL and OWL 2. Prof. Uli Sattler, in collaboration with Prof. Ian Horrocks (while he was at the University of Manchester; he is now at Oxford University) designed the logic and reasoning techniques for the description logics underlying both OWL (the logic “SHOIQ”) and OWL 2 (the logic “SROIQ”). The difficulty in logic engineering for ontologies is allowing sufficient expressivity to be useful (e.g., so that modellers can say things like a foot has 5 toes which are part of it and it, in turn, is part of a leg) but where the reasoning procedure is computationally reasonable (so that we can write reasoners that can figure out that each toe is also part of a leg, and do so before the heat death of the universe). FaCT++, an OWL 2 reasoner from Manchester, is the practical realization of their design and can handle such enormous and complex ontologies as the Systematized Nomenclature of Medicine — Clinical Term (SNOMED-CT), a key component of many national clinical information management systems.

Trying to build something on the scale and nature of the Web, especially when the Web already exists, is a grand, perhaps grandiose, project. The Web, of course, grew more than was built and perforce so will the Semantic Web. Like an economy, the Web is the result of millions of people performing millions of interactions focused on their specific interests and needs. Similarly, languages like OWL 2 will succeed if they meet specific needs and do so well. As the bio-medical community standardizes on OWL 2 and pushes its boundaries, we can see in that microcosm what the Semantic Web might one day be like. And, unlike utopian fairy-tales, that microcosm can make a valuable direct contribution to human welfare.

“The recent success…”

Paul Miller writes:

The recent success of the Linked Data meme is long overdue, very welcome, and entirely capable of carrying the Web of Data far beyond its current niche adherents. A lot of my current work involves arguing that more organisations should adopt this approach

There are four claims about Linked Data (well, the Linked Data meme (LDM)):

  1. There is (recent) success
  2. That success is (long) overdue
  3. That success is welcome
  4. That success has certain powers (carrying the Web of Data far beyond…etc.); roughly, that the recent, overdue, welcome success (ROWS) is sufficient to generate good growth, perhaps exponential(?)/weblike growth

My main issue (which is totally orthogonal to Miller’s post) is that I don’t know what the ROWS is or how to measure it. I really don’t know how to figure out with any degree of confidence that it has the powers claimed for it.

I mean, I’ve been there and done that. RSS 1.0. The Dean campaign and LiveJournal dumping all that FOAF. MusicBrainz RDF. Mozilla having RDF Inside. Etc. etc. etc. Not only did these things, afaict, go nowhere, many of them were ultimately quite counterproductive. They are sticks to beat the Semantic Web with.

Now, past performance is no guarantee of future lack of results, so perhaps something is different this time around. But it’s hard to see exactly what. That Yahoo and Google sorta support RDFa is somewhat interesting, I guess. But Google supported a SOAP API back in the day too.

I’ve been shouted at by some Linked Data adherents, usually about how my work on OWL (esp. at the W3C) is evil because it either doesn’t help Linked Data or draws attention from Linked Data or…well…Linked Data is the Semantic Web and the only reason anyone could possibly like OWL is the Semantic Web so go away. I find this puzzling as the use of OWL still seems (to me, afaict) to be growing and in a pretty healthy way. I expect we’ll be hearing about Linked Data for quite some time (though it does seem to be mostly a marketing difference than a fundamental technical one over the RDFing of the past). Perhaps I’m missing the good stuff but the way Linked Data comes across to me is with the big Parasitic Node Diagrams where DBPedia has a link to DBLPinRDF. (Parasitic because they generally are scrapings from existing data. I’m never clear what the value add is per se. I am, of course, sometimes happy to have data in a preferred format, but that can’t be the point of all this. I hope.)

I’d be grateful to any pointers to some metrics about the ROWS. I totally recognize that there may be none or they may be impossible to divine. But I would find it helpful.

And I’m really just curious. I’ve no brief against the LDM except that it seems to be “mere marketing”. I don’t mind marketing per se but I would like to distinguish it from other matters. I certainly don’t think the LDM is a threat to anything I’m doing or care about, nor the reverse.

(Kinda sad that I feel compelled to spell that out even in my not-read-by-anyone blog.)