Goals for a Software Engineering Curriculum

The IEEE & ACM ACM 2009 Software Engineering Curriculum Recommendations suggest the following outcomes for an MSc program in software engineering:

A student who has mastered the [Core Body of Knowledge (CBOK)] will be able to develop a modest-sized software system of a few thousand lines of code from scratch, be able to modify a pre-existing large-scale software system exceeding 1,000,000 lines of code, and be able to integrate third-party components that are themselves thousands of lines of code. Development and modification include analysis, design, and verification, and should yield high-quality artefacts, including the final software product.

I’ve been very excited by this, though I currently only teach 1 MSc class and have some influence on 1 other (out of 4 specifically devoted to Software Engineering). I’ve made progress in my class toward getting them able to develop a modest-sized software system from scratch (though on the order of 100s, not 1000s of SLOC) and a tiny bit of integrating third-party components. But if we break this out:

  1. Develop 1k-5k (lets say) program from 0 lines of code.
  2. Modify an existing system with >1 million LOC.
  3. Integrate 3rd pary components of 1k-10k LOC (let’s say).

I’m struggling with the 1 million LOC. I mean, why 1 million specifically? I’d wadger that relatively few developers have to cope with such a system and even those who do deal with much smaller fragements thereof. (Necessarily!) If the million line system has a clean plug in architecture with a small API, then it’s not clear what the difference between writing such a plug in for a million line system is from one with 100k LOC or, for that matter, 10k LOC. Obviously, there are management differences, but as long as the whole system isn’t easily readable in the time available, I think the differences between those systems is immaterial. That is, being able to modify a 10K program roughly predicts being able to modify a 100k program or a 1 million LOC program pace tooling and infrastructure.

It seems to me that the heart of each of these is:

  1. Building a non-trivial system from scratch.
  2. Modifying a large system in a significant way.
  3. Assess and integrate (new) third party components.

Even with this, there’s so much variance in each category that it’s not clear that what we can do in a course setting is cleanly going to transfer. That is, there’s a big difference between being able to create some 1-5k program and being able to build a specific one (or a specific category of one). Someone who has built a bunch of command line utilties might struggle with building an ecommerce website even if the target LOCs are similar.

At the moment, for my class, I have them build a reverse engineered version of wc (wordcount…the GNU version). This is the first time many of them had to actually, you know, write a full fledged program. The average SLOC for their final version (a full fledged GNU wc with --files0-from=F and --max-line-length) was 263 (min=69 and max=1084). I’ve not yet correlated this with correctness, so it’s a bit hard to say what’s going on at the extremes. I think a reasonable clone should be comfortably doable in ≈200 SLOC.

They also had to rework a version of their code to use the standard library module argparse. They also had to write various sorts of tests and examine the performance. I think this approximates 1 and 3, though 1 better than 3. Given the wide range of skill levels in the class, I don’t know if we could hope to do a “longer” project. I’m also not sure what the gain would be.

But 2 is missing. I don’t get them to engage with a significant body of existing code. That’s what I’m searching for now. My leading candidates are various static site generation systems.

Advertisements

“AI” and cheating

Oh Wired:

For years, students have turned to CliffsNotes for speedy reads of books, SparkNotes to whip up talking points for class discussions, and Wikipedia to pad their papers with historical tidbits. But today’s students have smarter tools at their disposal—namely, Wolfram|Alpha, a program that uses artificial intelligence to perfectly and untraceably solve equations. Wolfram|Alpha uses natural language processing technology, part of the AI family, to provide students with an academic shortcut that is faster than a tutor, more reliable than copying off of friends, and much easier than figuring out a solution yourself.

This is gibberish. If the main form of cheating is solving equations, the NLP front end is largely irrelevant. It’s not very good either. Indeed, the article goes on to say it’s not very AIy:

The system is constrained by the limits of its data library: It can’t interpret every question. It also can’t respond in natural language, or what a human would recognize as conversational speech. This is a stumbling block in AI in general. Even Siri, which relies heavily on Mathematica—another Wolfram Research product and the engine behind Wolfram|Alpha—can only answer questions in programmed response scripts, which are like a series of Mad Libs into which it plugs answers before spitting them out of your speaker or onto your screen.

Alpha indeed makes Mathematica more accessible (if only by price!) which makes it easier to use for cheating. But afaict this is a web and economic change, not an “AI” change.

And of course there’s the standard Wolfram silliness. Alpha was crazily hyped when it came out but it really isn’t all that. The denials of Wolfram and his employees are hilarious:

Alan Joyce, the director of content development for Wolfram Alpha, says that cheating is “absolutely the wrong way to look at what we do.” But the staff understands what might make teachers uncomfortable. Historically, education had to emphasize hand calculations, says John Dixon, a program manager at Wolfram Research.

Suuuure dude. Sure. And:

Indeed, the people who are directing the tool’s development view it as an educational equalizer that can give students who don’t have at-home homework helpers—like tutors or highly educated and accessible parents—access to what amounts to a personal tutor. It also has enormous potential within the classroom. A “show steps” button, which reveals the path to an answer, allows teachers to break down the components of a problem, rather than getting bogged down in mechanics. The “problem generator” can pull from real datasets to create relevant examples. “When you start to show educators the potential,” Dixon says, “you can see points where their eyes light up.”

This isn’t reporting. They don’t interview educators (except a random Prof of Astronomy). They don’t talk to people trying to cope with the cheating. They don’t look at anything except Wolfram propaganda.

regex101.com

I’m almost done my horrowshow task so maybe I can get back to having a backlog of posts and write up the ideas I have lingering in the drafts folder…

…but today, it’s another quick hit. Last week, I pointed to pythex, an online regex debugger/IDE/pastebing for Python. This week I found another, regex101.com.

It takes everything a bit farther. It has a little “Explanation” pane with…far…too many colors. It handles several flavors of regex for different languages. It too can save ’em. It’s more cluttered than pythex and if I had to pick only one, it’d be pythex. But the explainer could be useful.

Beyond JSON

JSON pretty clearly won and won big. This is perhaps inevitable given the dominance of Javascript. And that’s ok! JSON is a pretty decent sort of sexpr and having both lists and dicts makes it pretty useful for quicky externalisation of all sorts of data. The friction, in a typical scripting derived language, of manipulating JSON in memory is super low. Slinging dicts and lists is something any Python (or Javascript, or…) programmer is going to find congenial.

But it has some infelicities and I just don’t mean the lack of query and schema languages (which is sorta being addressed). JSON is rather annoying to hand author and doesn’t seem great for documents and document formats. Or even hacking existing documents like HTML…if only because there’s no standard reflection of HTML structure into JSON.

There are some moves to improve this situation.

JSON5 tackles the writability. Probably the biggest move is not having to quote (certain) keys in objects. That helps both reading and writing! For reading, there’s a clear visual difference between key strings and “value” strings. For writing, less quoting!!

The other big one is multi-line strings (with the ‘\’ as the continuation character). Having to have a continuation character sucks, but it’s much better than the status quo ante.

Comments are also a good idea! The rest seem minor, but these definitely make a difference.

Mark Notation is aimed at bringing XML-like structuring and extensibility to JSON. It does this by adding a key syntactic (and semantic!) extension, the Mark object which is a name (thing tag/element name), a set of properties (think attributes, but with complex contents), and a list of content objects (think child content). It builds on JSON5 so has those authoring felicities.

Semantically, Mark objects get mapped into pretty simple Javascript objects. I don’t fully understand this claim:

contents: an ordered list of content objects, which are like child nodes of elements in HTML/XML. Mark utilizes a novel feature of JS that JS object can be array-like. It can store both named properties and indexed properties.

I don’t see why this matters as you have a special Mark object with has an explicit contents variable. Ah, maybe:

properties: can be accessed through markObj.prop or markObj[‘prop’] when prop is not a proper JS identifier. You can also use JS for … in loop to iterate through the properties. Unlike normal JS array, Mark object has been specially constructed so that Mark contents are not enumerable, thus do not appear in for … in loop.
contents: can be accessed through markObj[index]. You can also use JS for … of loop to iterate through the content items.

So you don’t have to do a field access but just can use special loops. I don’t see that this would be painful in, say, Python even with field accessing. I might default to making Python Mark Objects iteratable over the contents (on the theory that that’s more “normal”).

It would be interesting to compare APIs to see whether this really brings JSON like easy of programmer use.

And, of course, there’s YAML, which you can think of as JSON++++. (JSON is a subset these days.) It’s designed from the ground up for writability and capturing complex structures. And that it does. The price is considerably more complexity. Like a ton more. (You can define entity like things (actually, more like IDs) on the fly! Inline!) It has excellent embedded complex multiline strings (essentially “here-strings“).

I have to say that it might be easier to grow Mark Notation a bit toward YAML than the reverse. Here-like-strings plus id references go a long way.

 

XML at 20

Tim Bray has a reflective piece on XML’s birthday last month. The key bit is in the middle:

Twenty years later, it seems obvious that the most important thing about XML is that it was the first. The first data format that anyone could pack anything up into, send across the network to anywhere, and unpack on the other end, without asking anyone’s permission or paying for software, or for the receiver to have to pay attention to what the producer thought they’d produced it for or what it meant.

Hmm. Really? I mean, csv dates to the 70s. It is less well specified, I guess, and simpler. The first isn’t really mentioned but maybe this is part of the “pack anything up into”. But then S-Expressions are easily as expressive as XML and go way way back, though largely de facto standardised. But then there’s ASN.1…maybe that needed permission? I can’t find anything that suggests this, at all, though. I don’t remember any such claims at the time. I do remember a lot of struggling to find XML parsers!

So, I’m very very skeptical about this claim. Maybe, it was the first to grab a lot of attention? But then I’m v. skeptical about its influence. A lot of trends were coming together and I think some notation which was human readable would have pushed forward. XML definitely failed entirely to become the language of web pages or of the web.

Update: On Facebook, Peter points out the other giant howler, to wit, “for the receiver to have to pay attention to what the producer thought they’d produced it for or what it meant.” I guess I’m just inured to this because its so ubiquitous in the XML and RDF communitie, but yeah, the idea that you don’t have to care what it meant is bonkers and part of that is paying attention to what the producer thought they produced it for. And metric tons of effort went into that (bonkers attribution of magic powers to namespaces, anyone?)

(I’m a bit more charitable to Bray in that thread. Maybe I’ll tackle it another day.)

Professionalism and “The Programmer’s Oath”

I talk about professionalism in my software engineering class (and in my academic writing class a bit), so I find this attempt at a “Programmer’s Oath” interesting:

2. I will respect the learnings of those programmers who came before me, and share my learnings with those to come.

3. I will remember that programming is art as well as science, and that warmth, empathy and understanding may outweigh a clever algorithm or technical argument.

4. I will not be ashamed to say “I don’t know”, and I will ask for help when I am stuck.

I’m not sure how I feel about this. I think having common constraints that force certain ethical behavior is one of the only justifications for calling certain jobs “professions” in this heightened sense. Let’s face it, part of the big two professions (lawyers and doctors) schtick is caste elevation, i.e., trying to making being a doctor or lawyer a high status thing. It’s also suggests a certain amount of self-policing and regulation. The latter is generally good. The former….eh. If the former helps the latter then I’m more ok with it.

I’m not sure oaths are a good way to go. They feel a bit weird and are rather short. Maybe codes of conduct are better? The line is blurry and perhaps the main difference is length?

One weird thing is how they all have stuff that goes outside what’s specific to the profession e.g., (BCS):

You shall:

a) have due regard for public health, privacy, security and wellbeing of others and the environment.

Wait…that’s part of my professional obligation? Really? This just seems to be a general thing. If I’m not a member of the BCS it’s not like this lapses.

The ACM code tries to make the connection more explicit:

1. GENERAL MORAL IMPERATIVES.

As an ACM member I will ….

1.1 Contribute to society and human well-being.

This principle concerning the quality of life of all people affirms an obligation to protect fundamental human rights and to respect the diversity of all cultures. An essential aim of computing professionals is to minimize negative consequences of computing systems, including threats to health and safety. When designing or implementing systems, computing professionals must attempt to ensure that the products of their efforts will be used in socially responsible ways, will meet social needs, and will avoid harmful effects to health and welfare.

In addition to a safe social environment, human well-being includes a safe natural environment. Therefore, computing professionals who design and develop systems must be alert to, and make others aware of, any potential damage to the local or global environment.

Ok, that makes more sense, I guess, though, again, it just seems to be what a person should do.

The sanctions available are pretty minimal so maybe it doesn’t matter that much. The ACM does go in a rather strange direction:

1.5 Honor property rights including copyrights and patent.

Violation of copyrights, patents, trade secrets and the terms of license agreements is prohibited by law in most circumstances. Even when software is not so protected, such violations are contrary to professional behavior. Copies of software should be made only with proper authorization. Unauthorized duplication of materials must not be condoned.

1.6 Give proper credit for intellectual property.

Computing professionals are obligated to protect the integrity of intellectual property. Specifically, one must not take credit for other’s ideas or work, even in cases where the work has not been explicitly protected by copyright, patent, etc.

I’m down with the spirit of 1.6, but hate the execution. First, it takes “intellectual property” to be a real thing so it “exists” even if not “explicitly protected.” No. That’s bogus. Attribution (or so called “moral rights”) can be asserted without property talk.

I hate 1.5. I mean, are you kidding me? So any sort of protest against the increasingly bogus IP system is forbidden? I mean, I hereby condone all sorts of unauthorised duplication of materials (e.g., for whistleblowing). I think a pretty strong case could be made that circulating scientific papers esp. to poorer countries is a moral imperative.

And it’s unnecessary. We have a huge legal regime bent on enforcing this…why have it as a professional obligation as well? (Well, we know why.)