Unit Tests Effects on “Testability”

If your code base looks like this attic, I pity you.

Unit tests seem generally regarded as critical for quality code bases. Prima facie, the key effect we might expect is correctness. After all, that’s what most unit tests aim for testing!

Unit testing may not be sufficient for correctness (no test approach is, in the extreme) but it does seem that having lots of tests should promote correctness.

However, this primary, most direct outcome is not the only outcome we might expect:

  1. Proponents of test driven development argue that having good unit tests promotes refactoring and other good practices because you can make changes “with confidence” because your tests protect you from unintended effects
  2. Some units are harder to test than others, i.e. are less testable. Intuitively, long functions or methods, complex ones with lots of code paths, and complex signatures all make a given unit hard to test! So we might expect that writing lots of tests tends to promote testable code. We might expect synergy with 1.

It all sounds plausible (but defeatable). But what does reality say?

We are living in a golden age for empirical study of software engineering in many ways. There’s so much stuff freely accessible on the web (code of all sorts, with revision history, and a vast amount of side matter…issue and mailing lists, documentation, etc). It’s a lot easier to get a survey or experiment going.

That’s what Erik Dietrich did in a very nice blog post. He looked at 100 projects off of Github, characterized then binned them by percentage of methods which were test methods. If 50% of your methods are test methods, it’s a pretty good bet that it’s heavily tested.

Right off the bat we have the striking results:

Of the 100 codebases, 70 had unit tests, while 30 did not.

(I’m really loving the WordPress iPhone app EXCEPT for the fact that I can’t strip formatting when pasting text and can’t keep that formatting from contaminating subsequent text. That sucks WP especially FOR A FREAKING BLOGGING APP!!!

Update: It seems that the formatting nonsense is only in the app but doesn’t come through in the actual post. Yay!)

This could be an artifact of his detector or maybe the tests are elsewhere. Still!

Overall, only 5 of his 10 very natural hypotheses were correct. For example, testing anticorrelated with method length and complexity.

For cyclomatic complexity…this may not be surprising. You generally need more tests (to hit all the code paths). Also, as supported by “Beyond Lines of Code: Do We Need More Complexity Metrics?” (from the awesome Making Software, which needs a second edition!!), complexity metrics including cyclometric complexity, tend to correlate closely with lines of code. So larger methods and more complex methods are going to march together (and probably nesting too).

In any case, this is a very nice start.


The GPL Won

It seems like GPL and related licenses dominate open source projects, by a lot:

At the beginning of his talk, DiBona said that according to Google’s net crawlers, the web now contains over 31 million open source projects, spanning 2 billion lines of code. Forty-eight per cent of these projects are under the GPL, 23 per cent use the LGPL, 14 per cent use the BSD license, 6 per cent use Apache, and 5 per cent use the MIT license. All other licenses are used with under 5 per cent of the projects.

So, GPL variants govern 71% of 31 million projects. Daaaamn. That’s a lot. A lot more than the rest, which are less restrictive.

I confess to being a bit surprised given the hostility (or exasperation) one often encounters by e.g., business folks when dealing with the GPL. Of course, it has two strong factors: it’s viral (so derived projects must use it) and it’s has a lot of advocacy, both dedicated (think Stallman) and more incidental (think Linux in general).

Ooo, I really have an itch to find out whether virality is a big factor….


On Facebook, Dan Brickley (thanks Dan!) points out that 1) this survey is from 2011 and 2) more recent surveys point to a shift away from GPL to more more permissive licenses, to wit, MIT and Apache:

Indeed, if we contrast each license’s share of the repositories surveyed by Black Duck [January 2017] versus January 2010, the shift is quite apparent….

In Black Duck’s sample, the most popular variant of the GPL – version 2 – is less than half as popular as it was (46% to 19%). Over the same span, the permissive MIT has gone from 8% share to 29%, while its permissive cousin the Apache License 2.0 jumped from 5% to 15%. What this means is that over the course of a seven year period, the GPLv2 has gone from being roughly equal in popularity to the next nine licenses combined to 10% out of first place.

All of which suggests that if we generally meant copyleft when we were talking about open source in 2007, we typically mean permissive when we discuss it today.

Read the whole thing, esp. the bit about the rise of unlicensed projects on Github.

Now, methodologically, their survey is smaller:

This open source licensing data reflects analysis of over two million open source projects from over 9,000 global forges and repositories.

So, it might be the case that the Google population wouldn’t show this shift. But, ex ante, a focused crawl is more likely (perhaps) to be dominated by “high quality” repositories, thus may reflect best or active practice better.

This all still cries out for some causal investigation.

License Troubles: Wikipedia vs. Flickr

I’m making slides for classes and for a talk.

I’d like to use images in these slides to make them less horrible to look at. I do not draw and even my diagraming skills are limited. I can take photos reasonably, but I can’t track everything I need, much less compose a photo while generating slides. This leaves the web! Yay! Lots of images!

But then…licenses. Damn. Ok, Creative Commons has been on the job for decades now. Google Image search will even filter by license type. Yay!

Wikipedia has lots of useful images. I often use them. But they don’t have everything. I find a nice image which seems perfect: A photo of an antique wooden French flat file cabinet.

I’d love to use this as my illustration of “flat file” database. License looks ok, non-commercial…

But wait, is my use non-commercial or commercial? What’s the rule for education use? Looking…not obvious. the CC website isn’t super helpful, then I find:

he NC licenses may not be compatible for remixing with many works. For example, a person may not remix BY-SA content (such as Wikipedia content) with BY-NC content.

Well, damn. Most Wikipedia content is copylefted (which is annoying as I don’t necessarily want my slides unrestrictedly reused, but whatev; I can live with it), but a good chunk of the content on Flickr (and elsewhere) is NC. So it doesn’t matter what my use is, I can’t do it.

Sigh. What was effortless now becomes impossible. And I’ve burned time worrying about it.