Quick Hit on Replication

“Quick”…hopefully quicker than the last one as I’m fading fast.

I can tell you right now that there are parts of the Imperial model we’re just going to borrow, not rederive…at least for version 1. Maybe version all unless someone new steps up.

Why? Let’s consider a simple example: The formula for workplace size distribution. My current plan is just to use their reported equation instead of tracking down the source data (or comparable source data) and applying the same fitting and modelling that they did.

Why not? First, there has to be limits. I’m not going to try to rederive the Landscan data either. We only have so much time and expertise. My team’s current focus and expertise is on agent based epidemiological model engineering. We don’t have a ton of time either.

Second, it’s not clear that this will make a difference. Since we have the equation, we can use it and see if everything works out. We can easily try more ad hoc models to do some sensitivity testing. If I were going to go further, I might chase data on real workplaces.

Third, extending the first reason, there’s so much more to explore. We see lots of things in the Imperial model that we find a bit uncongenial and workplace size distributions ain’t one of them.

That being said, we will extract as much detail as we can and put it in a nice clear place so someone who is interesting in doing such work can easily find what’s needed and how to plug it into code.

A replication can serve many purposes. For us we want to:

  1. understand the Imperial model
  2. raise confidence in the results (or identify problems in a clear and rectifiable way)
  3. build out our epidemiological framework and expertise; in particular, we really want to establish a best of class infrastructure where reuse of models and parts of models is easy; unlike Ferguson et al, we’re less concerned with individual results or experiments and more with enabling such experiments
  4. learn more about the replicability of such models and build some tools for facilitating it

I certainly hope once we get this replication stood up that epidemiologists will want to use it and build on it in a variety of ways.

I’m not as concerned about “speed to market” though that’s an issue. At the moment we’re backward looking. We’re trying to understand a published and highly consequential model. One thing that will up the need for speed is if people and policy makers start rejecting epidemiological advice based on Ferguson’s personal issues and thus far rather silly complaints about the released code. I’d be very very surprised if our replication revealed any shenanigan’s about the results. This is, in part, because the Imperial results or, rather, the policy that was triggered by the results, are pretty straightforwardly correct and justifiable on a wide range of model outcomes.Simply put, you don’t need a complex model to know that lockdowns were necessary and came too late.

However, if people use facts about Ferguson to try to open too soon, that’d be a problem and a successful replication could help there.

So there’s some policy responsibility here. But again, it’s best served by us doing a careful job.

It’d be great to have a few more experienced Python programmers (it’s really me and one other with a bunch of CS students getting up to speed) and some Cython mentorship would be welcome. Any epidemiologist who wants to help with the extraction and validation is more than welcome.

(He wrote on a blog that is, to a first approximation, read by no one.)

Ok, I just spent a few hours I could have spent working on Cython or extraction or grading or…sleeping!

Imperial Model Quick Update

It’s just been over a week since I blogged last and a lot has been going on, not just in this replication effort. Zoe had a remote, prerecorded Zoom gig (which I’ve been helping out with) and tons of other things. Still haven’t heard about the promotion (though I think the latest they could tell me would be the end of June). I had some exercise energy early in the weak but that’s waned.

Oh, and I’ve been fighting with two things: measuring Python memory consumption and wrestling with Cython. There’s lots of good stuff to write about from those efforts…but it’s another thing and a tiring one. I need a smoother format than WordPress…maybe some Markdowny thingy.

First bit, other team members have made good progress on population density. We can load the appropriate Landscape and a rasterisation derived from ONS local authority based census data. So, getting closer.

Second bit, whoof, it’s super hard to measure the “deep” object memory weight of a Python object. I’ve tried 3 installed things, some manual stuff, and it’s not only not consistent between techniques, even within techniques, things are weird. Really weird.

I quickly found that our strings are almost certainly interned, so no big wins there. I tried the bit twiddling consolidation of booleans and saw memory consumption shrink at 10x the access time and a ton of ugly code. (Would want macros.) I quickly decided that the only productive way forward was Cython. Since a lot of our attribute values are essentially primitive, we could hope that they’d be captured as thin C types and popped in the struct Cython uses to represent extension object instance variables.That also would open the way for sub-word width representation with C speed bit operations. I found this cool feature of Cython where you can leave your Python file untouched and “just” add Cython declarations in a separate file…this seems perfect!

Well, “perfect” heh. Let’s say that I haven’t quite got it all working, esp with inheritance. (It’s a messy story for another day.) But I got it working well enough to see significant benefit in memory consumption. Here’s a guppy “relative” heap dump:

Partition of a set of 9000002 objects. Total size = 980889350 bytes.
Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
    0 1000000  11 408000000  42 408000000  42 dict of model.AgentsPlain.Person
    1 1000000  11 240000000  24 648000000  66 dict (no owner)
    2 1000000  11 104000000  11 752000000  77 list
    3 3000001  33 84000028   9 836000028  85 int
    4 1000000  11 64888890   7 900888918  92 str
    5 1000000  11 56000000   6 956888918  98 model.AgentsPlain.Person
    6 1000000  11 24000000   2 980888918 100 float
    7      1   0      432   0 980889350 100 types.FrameType

Here I just created 1 million Person objects. (Remember for the UK we need 60 million or so.) To a first approximation, the total size is probably about right. So 1GB for a million Persons. Index 5 gives us the “base” overhead of an object, about 56bytes. Index 0 is the dictionary which holds all they instance variables, which weighs in at around 400bytes. Anything that has a count of 1,000,000 is probably part of our Person object. (I created 1 distinct string per object, just to see. This makes the total size a bit of an overstatement but not as much as you might thing as schools, workplaces, and social centres we each one string and there would normally be more. It’s only 64byes for a total of 0.06GB/million or 3.6GB for 60 million. It’s currently a rounding error.)

After a naive Cython conversion, we see:


Partition of a set of 4000002 objects. Total size = 576889350 bytes.
Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
    0 1000000  25 240000000  42 240000000  42 dict (no owner)
    1 1000000  25 168000000  29 408000000  71 Agents.Person
    2 1000000  25 104000000  18 512000000  89 list
    3 1000000  25 64888890  11 576888890 100 str
    4      1   0      432   0 576889322 100 types.FrameType
    5      1   0       28   0 576889350 100 int

 

Only 60% or so the memory needed! That was totally worth the Cython learning curve.

Index 0 can be driven out just by a simple Python level refactor (that makes things a lot nicer) and perhaps the lists can too. If both of those work out (the dict is 100%, the list less so) then we get to around 0.16GB per million agents or 9.6GB for the UK population. This is with almost no changes to the substantive Python code. It looks the same. It runs the same. You can run it with our without the Cython version.

This is still about twice our overall model budget (if we’re going to run on certain sections of Manchester’s Condor setup) and we have some other stuff to represent. Not a lot, though and a lot of them are covered by our too many strs. 4GBs is a tight tight budget. 10GB for a model is low enough to run on my development laptop.

It does mean that the teams efforts at tessellating the map so we can work on parts isn’t so critical. I think it still can help in a variety of ways and for places like the US it will be critical. But assuming I’ve not messed up the measurement, we might be ok.

I do think it’s worth pressing on a bit with layout and alignment. I’m pretty sure all the fields of the Cython struct are 64bit and…we don’t need that for almost anything except population and even there, 32 bits gets us 4 billion. Unless we want to literally run 1 model with the entire world population, we have a lot of wasted space.

Just to hint, here’s the Cython declaration:


def class Person:
   cdef public bint is_infected, acquired_infection, dead, is_immune, timetable_may_change, is_at_hospital
   cdef public str  social_center, workplace, school, location, symptoms
   cdef public int id, hours_since_infection, incubation_length, infection_length, timetable_group
   cdef public float age, p_infection, p_death
   cdef public tuple  timetable
   cdef public dict hospital
   cdef public int household

(That household declaration is a bit of an experiment.)

Here’s the corresponding struct:


struct __pyx_obj_6Agents_Person {
 PyObject_HEAD
 int is_infected;
 int acquired_infection;
 int dead;
 int is_immune;
 int timetable_may_change;
 int is_at_hospital;
 PyObject *social_center;
 PyObject *workplace;
 PyObject *school;
 PyObject *location;
 PyObject *symptoms;
 int id;
 int hours_since_infection;
 int incubation_length;
 int infection_length;
 int timetable_group;
 float age;
 float p_infection;
 float p_death;
 PyObject *timetable;
 PyObject *hospital;
 int household;
};

 

So, it’s pretty clear that all the PyObjects except timetable and hospital are dispensable. social_center et al are just ids and could be ints with almost no loss of code clarity. You’d have to add a bit of pretty printing code but that’s it. The names are all generated by interpolating an incremented integer into a name like ‘social_center’.

Most of the ints have very short ranges, even beyond the booleans. We should be able to pack them quite tightly. I’m not sure if alignment is even worth all that much given how much memory we’d drop. I’m no C programmer, even at the novice level. So I’m going to have to research a bit on bit twiddling vs. packed and unaligned struct access.

Back of the envelope suggests that we could comfortably get down to 32 bytes a Person which would be around 2GB for all of the UK.

There is a super modest speed win at the moment at least for tests comfortably in memory. Just compiling Person doesn’t gain all that much and I suspect there’s not huge gains from Cythonising Person alone. But there are places where we iterate over the whole population doing simple stat gathering. That seems like an easy win if we can push the whole thing into C.

Ok, this went on a bit!

I have to get back to the spec extraction. I’m trying to keep the extraction ahead of the team and they are catching up!

But all in all, I’m feeling pretty confident that it will work out ok. I’m going to push, at our Monday meeting, to open the repo, though it won’t be “nice” yet. (I mean, it’s pretty good right now, but the master branch is the ad hoc model we threw together. There’s some but nowhere enough documentation. Etc.) It just bugs me. But maybe opening it early will bug others! I don’t think we can absorb a lot of contributions at the moment (though I don’t see people champing at the bit…I’ve gotten no feedback or help on the extraction.)

I have thoughts there, but that’s another post.

Scaling the Imperial Model

We’re at the point where we have two population density models (Landscan for various years and ONS census data for local authorities projected onto a Landscan like grid). So we’re starting to test loading and initialising. And we do hit some snags. From a comment on our issue list:

We cannot have 60 million agents in one go, as that required 35+GB RAM/simulation. We thus need to “compartmentalize”. We should keep each compartment to under 1 million agents so that these can be simulated more or less in parallel.

From our Slack channel:

person object is 500 bytes in size
which isn’t much
but
when we multiply it by the number of people we [expect] [60+ million]
it starts being a lot

(And if we try the US, that’s 330+ million, etc.)

Scaling is an issue and no the solution is not to rewrite it in Rust 🙂

(It’s going to be a running joke. The sad thing is that I had been reading up on various Rust-Python bridges before silly tweet because I think it might be helpful. We’ll have to look at Cython as well.)

Dario’s current thought is to tesselate the map (perhaps even by local authority and iterate over each tile perhaps in parallel. Still 35+GB RAM but not all in one process. (For some of the HPC systems we have access to, we have a  bound of 4GB/process.) I think this is a good strategy and will allow other optimisations to be simpler (e.g., if we have no infectious people in a tile, we don’t need to evaluate it at all, pace inter-tile travel).

One issue about simple tessellation is that we might have agents crossing boundaries. Even daily if a tile boundary crosses a commute. We could make each tile “commute closed” but that might end up being the whole map if there’s enough overlap.

But we will have agents cross tile boundaries if we include air travel. I don’t think this is a big deal, but each tile will need to synch entries and exits per epoch.

One thing we’ve discussed a bit from the start is whether we’d benefit from a fundamentally different internal architecture, something that would be compact, fast, and potentially allow exploiting GPUs. There’s some work on agent based frameworks on GPUs (which I will have to redig out). Given that the current implementation scales pretty reasonably to a few million agents, I think the current “naive object” approach has a lot of merit. I’d love to have a transformation based implementation so that you can still write a more or less naive object model.

But I’m getting ahead of myself. The first question is how much of that 500 bytes is eliminable. Python trades memory for convenience, so a lot of things in Python are “fat” by default. Plus, we were often simplistic in data representations, e.g., using non-interned* strings for small enums (e.g., various status indicators).

  • Actually, it’s unclear to me whether these strings (like “hospital0” are interned or not. Python will intern strings which “look like” identifiers but they have to be visible at compile time. We generate all hospital, school, etc. IDs, so they are likely not interned. But I haven’t verified that yet.

So, 50085 million is a lot…how small do we need to get? Well, 50085 million is 42500000000 bytes which is 42GB. That’s too much, but 100*85000000 is 8.5GB which is reasonable but means we can’t use our local Condor pool lab machines. If we get to about 40 bytes/person, that’d be better. (We might want to get a little slimmer as we have other overheads.)

(Note that even if we got it this small, we’d probably still want some sort of tessellation for parallelism, distribution, and just managing things esp if we ever scale up to the world. Of course, smaller objects makes tessellation work better too.)

Time to do some baking, so I’ll just outline some iterative strategies from least invasive and thus (likely) effective to most:

  1. Only look at attribute value ranges and try to tune them. For example, instead of using non-interned strings for school ids, use interned strings (or some other terse object). Similarly, we store age as a float, but we don’t need precise ages, only age bands, so perhaps convert to an enum.
  2. Consider a “fused” attribute representation. For example, a boolean object (e.g.,<code>True</code> reference seem to cost 8 bytes. We have 6 boolean attributes per person thus 6*8=48 bytes. But obvious, 6 boolean values can be stored in a single byte and we can use bit operations to access them independently. Now, getting a single byte object with low/no overhead is a bit tricky in Python, but we might be able to do something. Similarly, a lot of values are strings which identify a place. We could turn these into references. We’d use getters and setters to persevere the current interface.
  3. Consider lower overhead objects both for attributes and for persons themselves. Class instances aren’t the most heavyweight object type in Python, but we might be able to do better. Dataclasses and slots, for example, might help. (attrs looks cool but doesn’t seem to focus on size.)
  4. Ditch objects altogether and try for a packed data representation. Ideally, we’d have something of uniform size so we can pack sets of them into something like a numpy array. This would be a path toward GPUiization. This potentially is the biggest rewrite and will really change how the code looks for modellers.
  5. Try to use a database like SQLite. We usually are working with small numbers of agents (e.g., nearby ones) so we could store them in a database and iteratively retrieve an “area” (say, a pixel’s worth of people). We’d still want a tight representation since moving in and out of the database gets pricey.But maybe we can push some of our iteration into the database’s queries and…that might be faster? My luck here hasn’t been high. Still, imagine having a packed representation of most attributes in one column, and a few, key attributes in other, indexed, columns. We could grab all the infectious…or all in a region. For each one, we could grab the people “in range”. We could probably pack this all into one query with group by…

One key feature of the Imperial model, I believe, is that it’s fundamentally static at an agent level. We aren’t simulating people walking down the street or anything like that. They have their places and they teleport to them and that’s it. (Which is like our current model, by the way.) I’m not detecting any stochasticity in going to work or school. (Are weekends treated differently?) So we really can “invert” our step evaluation. Instead of iterating over all agents, we iterate over the currently infectious and grab their neighbours. At some point, a good swatch of the population will be infected (or infectious), but then we can reverse and grab only the non-infected, get their neighbours, and see if they get infected. There will be a bulge in the middle where we have to directly iterate over half the population and they might be in range of the other half, so we still have to worry about peaks, but only for time.

Imperial Model Structure: Workplaces (Part II)

Caveat: These are working blog posts with a lot of “thinking aloud”. I’m a computer science academic studying the engineering of agent based models, not an epidemiologist. Critique and comment welcomed. If you are .interested in collaborating, send me a note.

The Landscan data access came through yesterday, yay! It…was a bit trickier to deal with than we had hoped from the documentation, but we’re managing. More in a separate post.

(What I write here are really initial draft brain dumps. We refine it in our discussions.)

So, school placement and allocation is done. Now we need to do workplace placement and allocation. (Workplaces can include schools! But the school bit is discussed…with schools not with other workplaces. Sigh.)

(Primary documents are  this  supplementary data and that supplementary data.)

So, my takeaway from writing the last post is that workspace placement and allocation are not fully specified in the text. But we have some clear constraints:

developing an algorithm which matches both workplace size data and the observed distribution of distances travelled to work…

the result was non-biased and met the marginal distributional constraints on commuting distribution distance (see Fig 1 of main text) and workplace size.

So, really, any placement algorithm which generates a set of workplaces that is plausibly a sample drawn from the given distribution with an allocation algorithm that makes the set of commutes  plausibly a sample drawn from the commuting distance distribution is fine.

At least this gives us an empirical sanity check. Even if our algorithm isn’t guaranteed to adhere to these constraints, we can always re run it until we get an output which meets the constraints. I’d guess that my prior algorithm sketch would do the job.

CALLOUT: (I’m really not sure what they mean by “non-biased and met the marginal distributional constraints”. Non-biased wrt what? Is ‘marginal” doing some work here? We’re definitely outside my personal stats/epidemiology knowledge. So we need to revisit these.)

CALLOUT: So how do we check that our generated sets are plausibly drawn from the corresponding distributions. Only clues: “Figure SI3: (a) Data on numbers of workers in US working in workplaces of different size categories (see text for source), compared with maximum likelihood fit of offset power model for workplace size distribution. (b) Best fit modelled distribution of workplace sizes.” “Figure SI3: (a) Data on numbers of workers in Thailand working in workplaces of different size categories7 , compared with maximum likelihood fit of Zipf model for workplace size distribution. (b) Best fit probability distribution of workplace sizes.” I guess it’s maximum likelihood fit across the board. So that’s straightforward enough.

Now  I’m a bit nervous. We’re definitely outside my bailiwick and expertise. (Other teammates may be able to help out.) It’s hard  for me to tell whether this is poor reporting or that I’m not in the standard audience. Perhaps an experienced computational epidemiologist knows exactly which page of which standard textbook to go to. If that’s true, then for that audience, this is perfectly fine reporting for replication. It’d be nice if they reported for a somewhat more general audience. We can’t make everything perfectly self contained, but I don’t feel like an algorithm sketch or a few references would be so burdensome.

When looking around for help, I found a book which had this line in the intro:

Whenever appropriate, researchers should use available codes rather than writing their own. However, the true value of the new approach to choice modeling is the ability to create tailor-made models. The computational and programming steps that are needed to implement a new model are usually not difficult. An important goal of the book is to teach these skills as an integral part of the exposition of the models themselves. I personally find programming to be extremely valuable pedagogically. The process of coding a model helps me to understand how exactly the model operates, the reasons and implications of its structure, what features constitute the essential elements that cannot be changed while maintaining the basic approach, and which features are arbitrary and can easily be changed. I imagine other people learn this way too.

I love all of this, but I’m really feeling the first sentence right now!

There are multiple overlapping literatures and code bases to navigate here.

Aha! I was feeling blue about all this but then I jumped over to the “model” description in the oldest paper (yes, this is absurd!):

The proportions of individuals attending different place types varies by age (see Section 1 above), and we assumed individuals are members of only one place. Places are distributed randomly in space, with a density proportional to that of the population. Individuals are randomly allocated to places at the start of a 8 simulation, with a probability which can depend on the distance between the individual and place and the number of individuals the place still has space for. Individuals are always allocated to places in the same country as their household location. For place types 1..3 (schools), only the total number of places of each type is constrained (thus constraining mean place size) – no constraint on the maximum membership of any one school is imposed. For place type 4 (workplaces), membership is constrained to produce a distribution of workplace sizes which matches the data discussed above, while also reproducing the observed distance to work distribution for Thailand. We define i n and j mi as the number of people in the household and place type j of individual i respectively.

Ok! This basically validates a lot of my thinking!

Ok, so I think in the first pass, we’ll do something like the free choice allocation model for schools with a post check that the distribution looks ok. It won’t be quite free choice since workplaces can fill up.

CALLOUT: We need to study the effects of free choice workplace allocation and to work on “a probabilistic dynamical algorithm based around the concept of workers competing for available vacancies with a choice function which depended on distance from household location (the ‘choice kernel’)” or the equivalent.

First, placement (which we combine with generation):


jobs_needed = load(total_employment)

# We need to get workforce size for the schools.
for s in schools:
    s.worker_count = s.student_count*(teacher_student_ratio)


workplaces = set()
# All schools are workplaces.
workplaces += schools

for p in pixels: # we iterate randomly over location squares
    if jobs_needed > 0:
        population = len(p.population)
        number_of_workplaces = some_prob_function_of(population)
        p.workplaces = workplace_generate(number_of_workplaces) + p.schools
        jobs_needed -= sum([wp.worker_count for wp in p.workplaces])
    

Now allocation:


for person in population:
    person.commute_distance = generate_commute_distance()
    # We need a set of workplace "near" the commute distance.
    # If there are workplaces on the circle with radius commute distance
    # they go in. We might add more closer or further if we don't have any
    # or not enough. All should be "about" commute distance away and have a 
    # a free slot.
    candidate_workplaces = hiring_workplaces_nearest(person.home_pixel, person.commute_distance) 
    p.workplace = choose(candidate_workplaces)

There’s a bunch of un/semi defined functions in there. But the basics seem ok for a first cut.

Don’t Be a Tech Silly (or Like Tech Silly Tweets)

(I don’t mean the nice kind of silly but the jerky kind of silly.)

Some things are predictable, such as people will tweet silly things. My current rule is that I try to give the benefit of the doubt to mere silliness (i.e., not to bigotry or the like) for three exchanges and then I block. This is to keep me from wasting time and energy.

Today, I found that I should block someone if enough people like their silliness. Because each like is really annoying. Don’t like silly tweets!

What set me off today was having mentioned our replication effort, someone tweeted:

That may take weeks of runtime and hundreds of GB of memory to replicate a full scale run. If you want a highly robust, yet still fast, implementation, why not something like Rust?

Oh. My. God.

I really really like Rust. I’ve written some small programs in it and hope to write some longer ones. But the “rewrite it in Rust” drive by commenters are terrible.

  1. They have no idea what the performance characteristics of our framework in model are or will be. They heard “Python” and knee jerked “Rust”.
  2. Weeks of runtime and hundreds of GB aren’t a problem (esp for something like this).
  3. The hard part at the moment is the modelling…indeed the extraction of the model.
  4. We want our model to be broadly accessible for correctness and understanding. Python is better for that (esp. with non software engineer scientists).
  5. We want people to participate even if students, newbies, or causal. Python is better for that.
  6. Getting people trained in a new language before getting to the replication work is…extremely silly.

The tweeter may be a generally nice person but a) this was not a remotely nice tweet and b) they doubled down.

Don’t do this. It doesn’t make you look smart or good or helpful or savvy. It just makes you look like a jerk.

It reminds me that before we open up the model we need to write up a code of conduct and an FAQ. It’s sad that we need to do this.

Imperial Model Structure: Workplaces (Part I)

Caveat: These are working blog posts with a lot of “thinking aloud”. I’m a computer science academic studying the engineering of agent based models, not an epidemiologist. Critique and comment welcomed.

The Landscan data access came through yesterday, yay! It…was a bit trickier to deal with than we had hoped from the documentation, but we’re managing. More in a separate post.

(What I write here are really initial draft brain dumps. We refine it in our discussions.)

So, school placement and allocation is done. Now we need to do workplace placement and allocation. (Workplaces can include schools! But the school bit is discussed…with schools not with other workplaces. Sigh.)

(Primary document is the  supplementary data for 5.)

(Screenshotted for equation and graphs.)

Unlike schools which seem to have “emergent” sizes based on density, we seem to draw workplaces from a more complex size distribution. But…there’s no discussion of workplace placement and allocation is in the prior paper’s supplementary data:

The algorithm to allocate individuals to workplaces was necessarily more complex than the school allocation algorithm – while the school size distributions could be approximately matched with a ‘free choice’ algorithm, the highly over-dispersed workplace size distribution requires constraints on workplace choice to be imposed. The challenge in developing an algorithm which matches both workplace size data and the observed distribution of distances travelled to work is one of computational efficiency – it is very easy to find algorithms which are O(N2 ) and take days to allocate large populations. We developed a probabilistic dynamical algorithm based around the concept of workers competing for available vacancies with a choice function which depended on distance from household location (the ‘choice kernel’). A full description is beyond the scope of this paper, but the result was non-biased and met the marginal distributional constraints on commuting distribution distance (see Fig 1 of main text) and workplace size.

Ah yes, we have to bop up to the main text for Fig 1. (Journals! You’re electronic. Stop with the stupid page restrictions esp for supplementary data. Authors! Make the supplementary data self contained!)

The bolded part is just Argh! “It is easy to fine…” so give us one of them! Just one! It’s not all the easy to find since your spec isn’t exactly clear! If you gave a naive one we would have something to build on. And having your own magic algorithm that you don’t share is bonkers. It really isn’t beyond the scope of the paper…we can’t replicate with out it so it’s pretty damn in scope!

What still seems to be missing is workplace placement. We know how to get their sizes but not their (synthetic) locations.

(I feel that census or similar data would be useful here. I can’t tell how much is historical lack of access.)

Ok, first crack:

  1. Every school is a workplace with number of workers decided by student/teacher ratios. (Thus we must place and allocate schools first.)
  2. Workplace placement will use the same basic approach as school placement
    1. We have a pool of workplaces drawn from a distribution (most simply, number of workers divided by average workers per location).
    2. A pixel gets a workplace (or maybe more than one workplace?!) with a probability dependent on the population density.
      1. It seems bit odd to have “one workplace per square km” (remember workplaces include retail establishments, post offices, etc.) so maybe it’d be better to use a proportion of the population as a guide. So, suppose the population is 100 people. We seed a choice with 100 (it could be less or more) and that gives us the “working population” (say 75, say 150). The we draw workplaces from the general size distribution until we exceed the limit. Sometimes we’ll exceed it a lot (we draw a 5000 person workplace on the first draw) but that’s ok! Big workplaces can be in low density areas.
  3. Workplace allocation is also a bit similar to school allocation except we draw candidate workplaces from first selecting a community distance (from that distribution). In particular, for each person we pick two commuting distances and grab all the workplaces with free slots within the band formed by the two circles formed by the taking each commuting distance as a radius. Then we select a workplace randomly perhaps weighted by distance or by preferring one of the commuting distances (via a coin flip). If we have no free slots, we redraw the commuting distances and try again. If we fail three times, then we create a new workspace.

I’d guess this would “matches both workplace size data and the observed distribution of distances travelled to work”. Maybe it’s quadratic in…something? Certainly it wouldn’t be the  n of all workers times the m of all workplaces, or anything like that.

There are obvious alternatives, like picking one commuting distance and getting the nearest to that (either on both sides, or only outside, or…).

I don’t know why we need to have workers “compete” other than by first come first serve… we could iterate over the whole population in random order to avoid cluster effects.

I’m not ready to pseudocode this and am very tired. So I’m going to pause here.

I’m confused as to why we got a better description of the school situation than the workplace situation other than the workplace algorithms are…more complex so more of a pain to describe? Let’s just say that that’s not the right heuristic.

Ferguson Resigns from SAGE

Oh damn:

The epidemiologist credited with convincing the UK government to abandon thoughts of pursuing herd immunity in favour of physical distancing has resigned amid allegations he breached lockdown rules.

According to the Daily Telegraph, Prof Neil Ferguson stepped down from his government advisory position over claims a woman with whom he is in a romantic relationship, but who lives elsewhere, visited him at his home during the lockdown.

Full report is paywalled:

The scientist whose advice prompted Boris Johnson to lock down Britain resigned from his Government advisory position on Tuesday night as The Telegraph can reveal he broke social distancing rules to meet his married lover.

Professor Neil Ferguson allowed the woman to visit him at home during the lockdown while lecturing the public on the need for strict social distancing in order to reduce the spread of coronavirus. The woman lives with her husband and their children in another house.

Apparently in an open marriage.

Gah!

Well, resigning from SAGE is probably the right thing to do, though annoying. Qua lockdown violation, this wasn’t great:

The first visit was on 30 March and coincided with Prof Ferguson saying lockdown measures were showing the first signs of success. A second visit on 8 April reportedly came despite Staats telling friends she “suspected her husband, an academic in his thirties, had symptoms”.

Ferguson believing he was immune doesn’t really mitigate things. Staats clearly shouldn’t have left her home (I hope she didn’t take public transportation).

Is it the worst violation? No. But it’s bad enough. And obviously he knew better. Smart people often do stupid things, sometimes very very stupid things.

It doesn’t change his and his team’s science. We’re pressing forward on the replication. It still has a weight and a cost.

Imperial Model Structure: School Allocation

Caveat: These are working blog posts with a lot of “thinking aloud”. I’m a computer science academic studying the engineering of agent based models, not an epidemiologist. Critique and comment welcomed.

Going on to the next paragraph of supplementary data for 5,

Terminology is always an issue 🙂 I’d tend to say that

  • school generation is the process of creating the right number and type of schools
  • school placement or distribution is the process of assigning schools to specific locations
  • school allocation is the association of a student with a school

these together form school initialisation.

In the Imperial work there are two basic approaches to generation and placement:

  1. use a database of existing schools and addresses
  2. generate a synthetic set of schools based on whole unit stats and placement with a probability proportionate to tile density

CALLOUT: When looking at country level statistics, I doubt that there’s that much difference between the approaches, esp if you do multiple initialisations and average it out. If we want to assess risk from a school perspective (in general do more local stuff) then actual schools are helpful, but how helpful without real attendance figures is unclear to me. Might be worth some experiments.

CALLOUT: To be clear, let me give an example. If I am a Greater Manchester official, I might want to use a simulation to know which schools are higher risk for spreading the disease to riskier populations. This will depend on the particular age/health structure of the populations interacting with the school populations. Let’s assume we won’t have real enrolment data and certainly not family data. So we have to generate the population. So…does it matter that we also generate the schools? Might we just case match schools in our community with schools in our model to get a reasonable estimate? Probably! I would take a profile of a school (size, surrounding population density, etc.) and extract all relevantly similar artificial schools in my model and do some averaging and exploration of that set. That might give me better insight than looking at a location matched school in my model.

Since we have a synthetic population, we need some function to assign students  (and teachers!) to schools (whether the schools are synthetic or real) (bopping back to the supplementary data for 5):

These methods allocate children to schools using a free selection algorithm; each child ‘picks’ a school at random from the nearest 3 (for primary) or 6 (for secondary) schools of the appropriate type, with a probability weighted by the distance kernel derived from UK census data on commuting behaviour (see below). Restricting school choice to the nearest 3 or 6 schools avoids unrealistic long tails to the distribution of distances travelled to school, albeit at the cost of slightly underestimating mean distances travelled to school by approximately 30% 8 . Staff student ratios were used to determine the proportion of adults to schools rather than other workplaces.

There’s a lot of detail to pick out.

For each child, we need their student type (primary or secondary) presumably determined by their age (I guess we can assume faster or slower progressors balance out and we can use strict age bands?). They we need their nearest 3 or 6 appropriate schools with the distances. AND then we need to select one using a distance weighted probability.

The distance distribution function is…this?

The parameters and functional form of the choice kernel used in the worker-to-workplace assignment algorithm were then adjusted until the model fit the empirical distributions well. In the case of GB a 2 parameter offset-power law kernel fit well: f(d)~1/[1+(d/a) b], where a=4 km and b=3. For the US, the change in slope of the tail of the distribution (on a log scale) meant that a 4 parameter function was needed: f(d)~1/[1+d/a] b +k/[1+d/a] c where a=35km, b=6.5, k=0.0004 and c=2.2. Figure SI4 illustrates how well the model fitted the data.

Ok ,we have a function f(d)~1/[1+(d/a) b], where a=4 km and b=3. I think, certaintly for v1 of our replication, we’re not going to attempt to re-derive this distribution. Actually, I suspect a simpler choice model like equal weighting would be fine. Our goal here will be to make sure that our model uses a choice function (and indeed their choice function) and we can always rederive that if it seems important (or someone wants too!).

So the school allocation algorithm:


# This has age stratified people at each pixel!
# It also has any schools
pixels = load_current_map() 

for p in pixels:
    primary_students = p.get_people(primary_age_band)
    secondary_students = p.get_people(secondary_age_band)
    
    # We only have pixel to pixel resolution on distance.
    # So all students at a pixel have the same "nearest" schools
    
    # Note that we might well have equidistant schools so we might
    # want to be a bit clever, e.g., pull the wider pool and take a subset
    primary_schools = nearest_schools('primary', p, pixels, 3)
    secondary_schools = nearest_schools_choices('secondary', p, pixels, 6)

    for s in primary_students:
        # Uses probability function!
        s.school = choose_school_from(primary_schools) 
    for s in secondary_students:
        s.school = choose_school_from(secondary_schools)
    # Note this assignment should probably add each student to their 
    # school but that's a convenience implementation detail.

Imperial Model Structure: School Initialisation

Caveat: These are working blog posts with a lot of “thinking aloud”. I’m a computer science academic studying the engineering of agent based models, not an epidemiologist. Critique and comment welcomed.

Going on to the next paragraph of supplementary data for 5, we have the school generation and assignment model. They use two different strategies for school placement: for the UK, they generate a synthetic set and placement based on population density while for the US they use actual (near) locations based on a database of schools.

If we get the appropriate database of schools, then we’ll “just” use that, i.e., assign each school to their corresponding tile.

Let’s look at the text for synthetic generation:

For GB, we used data7 on the average size of primary (218 pupils in 2004) and secondary schools (949 pupils in 2004) and on average class sizes and staff-student ratios and applied the methods used in past work6 to generate a synthetic population of schools distributed in space with a density proportional to local population density.

Ok, we have:

  • average size of two kinds of school
  • average class sizes (why?)
  • staff-student ratios (this is to help us assign adults to schools as their workplace)

I’m a bit perplexed why they gave us the average size of schools but not the class sizes or staff-student ratios. I mean, those numbers are important too! I wouldn’t have to spelunk for that data. Ideally, I want both the values they used and enough details so I can rederive them (e.g., if I want to try for a different dataset).

Let’s follow this reference 6 to see if there’s a more detailed description of the synthetic school method. 6 is:

Ferguson, N. M. et al. Strategies for containing an emerging influenza pandemic in Southeast Asia. Nature 437, 209-14 (2005).

I believe this is the ur-paper which first introduced the model. I hope! Notice that gives us 3 prior papers in total that may or may not contain critical details.

This is partially an artefact of the scholarly publishing system. They are writing a narrative description and it’s not systematic at all. It’s not bad per se, but it’s also not documentation about their model. I may well be the first person to try to try to tease out all the details.

This isn’t only a problem for replication! It also is a problem for interpretation of their results. It doesn’t feel that way, of course. Indeed, the first few times I read it, I thought that it was  thin for replication but probably enough for interpretation.

But is it? In our prototype model (based on a much simpler network structure of houses and other locations, generated randomly), I had to think really hard about whether our initialisation is going to muck with infection chains. Now I have the advantage that I could inspect runs and monitor…if I implemented such telemetry and ran it more experiments. But I can get a long way because I know the code! My measurements would be hypothesis confirming.

Here…well, I don’t quite know what’s going on in a causal read. I think they make fixed size schools and  then allocated “enough” of them to keep student commutes sensible. But do they just underpopulate schools placed in sparse areas (and have some kids not go to school)? I would think that in addition to more schools in dense areas you have more populated schools which might affect infection patterns. I can’t relate any such thinking to their work because I don’t know what they actually did. Yet! Certainly not on a causal read.

Of course, too much detail can inhibit understanding as well. So the interpretation problem is much less clear cut than the replicability problem.

Ok, 6 also has supplementary notes! It’s very similar to the previous notes overall. Ooo, some discussion of the household generation function!

Household size and age structure data The sources for these data2,3 are detailed in the main text. Here we note that it is not straightforward to construct a realistic demographic model of households which matches marginal distributions of household size and population age structure. Randomly assigning ages within households produces consistent marginal distributions but fails to capture a realistic generational based age structure within households (i.e. households of 4 people tend to have 2 parents and 2 school-age children in, not 4 people with randomly selected ages). We therefore developed a rather heuristic model which maintained generational age gaps with households while matching overall household size and age distributions.

(Remember how I said I sometimes yell out “Fuck you” when frustrated by reading a paper and it’s venting? I yelled out “fuck you” here. That bolded bit upset me.)

Are there page limits in supplemental data? Do I really have to bounce up to the main text to check for details which aren’t even relevant to the model I actually want to replicate?

Looking up, the size and age structure seem national and are similarly given as column charts instead of tables (use tables! gimme the numbers! I hate trying to infer values from a chart! these charts are worthless):

The model used Landscan data24 to generate a simulated population realistically distributed across geographic space (Fig. 1a). Thai census data25,26 on household size and age distributions were used for demographic parameterization (Fig. 1bc).

CALLOUT: We don’t know what the heuristics are. I’m still guessing it’s something equivalent to for a new household, draw a person from the age distribution. Draw the next person from a weighted age distributed based on household structure (e.g., if adult, draw another adult from a similar age band; if a kid, draw an adult next, then adults or kids randomly for he rest of the members who are “close” in age to the remaining; this is for a UK like scenario where 3 generation households are relatively rare (I think)). Again, some sketch beyond “rather heuristic” and “generational age gap” (WHAT ARE THESE?!) This could be simple or complex.

Ok. Let’s do school generation (saving allocation of students and teachers for another post):

Data on school sizes was obtained for a sample of 24,000 schools in Thailand5 , roughly 2/3 of all schools in the nation. This sample was patchy in its spatial coverage, however – with only approximately 60% of provinces being included. Distributions of school types and sizes did not vary widely by province, but we were not able to verify anecdotal reports that combined elementary+secondary schools were more common in rural areas from the district-level stratification the sample provided. We therefore assumed no geographic variation in the distribution of school types. Overall, 41,700 primary schools, 12,500 elementary+secondary schools, and 6,900 secondary schools were modelled, with average sizes of 175, 420 and 750 pupils respectively. The simulation locates schools randomly across the modelled region with a probability proportional to the population density, the total number of schools being selected to give the correct average size for each type of school at a national level.

Ok. So they don’t weight the probability of a school type based on location. But do they do age stratified densities? If I have a region when the population is 2/3s primary school age, are 2/3s of the schools primary or primary+secondary? Maybe you take overall school age population and generate enough schools so that you get roughly the right average but don’t otherwise try to guide enrolment?

Ok, I can see an algorithm possibility, but I’m unclear whether a pixel by pixel loop would “locate schools randomly across the modelled region with a probability proportional to the population density”. (Do we need to look at pixel neighbourhoods?) Eh. We have a given number of schools to start, so perhaps:


# I'm only going to consider two school types as that fits with the UK model
# The number of schools for a type is the total number of school age children
# in the population divided by the average school size
primary_schools = generate_primary_schools(NROFPRIMARYSCHOOLS)
secondary_schools = generate_secondary_schools(NROFSECONDARYCHOOLS)

# We don't tune type to specific student population density
schools = primary_schools + secondary_schools
schools.shuffle() # randomise order
pixels = load_density_data()

for p in pixels:
    if contains_school_based_on_density_based_probability(p):
        p.add_school(schools.pop())

return pixels

The probability function needs to be such that we don’t run out of pixels too much before we run out of schools. Nor can it be so likely that we run out of schools early. That seems like it could be fiddly.

We could shuffle the pixels first to avoid getting caught in weird adjacent runs. We could also generate extra schools if we run out?

After the student allocation, we should then examine all the schools to get the mean (preferable something more like quintiles) to check that we look reasonable wrt to actual school size range.

Imperial Model Structure: Lower Resolution Population Density

Caveat: These are working blog posts with a lot of “thinking aloud”. I’m a computer science academic studying the engineering of agent based models, not an epidemiologist. Critique and comment welcomed.

While we’re waiting for the LandScan data, I was wondering if we could implement an alternative (if only to nail down some APIs and stuff). It’d be great to use other data sources, probably public census data which is likely to be lower resolution.

Peek ahead: I rather suspect that the 1km raster grid is built deeply into the Imperial model. (E.g., agents might move around a grid and community infection chance uses a spatial proximity model.) This would be a big change to our current implementation which uses a network model (i.e., people and places are related explicitly by links rather than implicitly by proximity).

However, I think we can defer the issue of how to represent things internal for a bit.

I’m going to make the assumption that we generally want to deal with the 1km grid in the generation algorithms. So we need to upsample/interpolate lower resolution data to get the finer grid.

To do this, we need a function that goes from a given 1km pixel to the associated admin area. Each admin area then assigns a fraction of it’s total population to that pixel. So something like:


pixels = load_pixels_with_coordinates()
for p in pixels:
    admin_area = admin_area_for(p)
    admin_area.assign_population_to(p)

assign_population_to should keep track of the assignments thus far and make sure not to go over.

We can use different assignment functions. The simplest would be some sort of uniform random assignment, i.e., each member of the population has an equal chance of living in a pixel. But we could use some other function such as equal number per pixel (very unrealistic) or some weighted function.

Given that the ONS data has granular age distributions (e.g., by year and per admin unit) we probably should use that (though use a country wide one when doing a precise replication).

The resulting pixels would be indistinguishable from the Landscan data from an api perspective, so we can just use it asa drop in replacement downstream.

I’m not sure what the implications are for using the lower resolution data. Admin units range in area and population, but I’m not yet sure how. They are hierarchical (e.g., Greater Manchester (Met County) is a Metropolitan County with 2,812,569 people while Manchester is a Metropolitan District with 547,627) so we could play a bit with that. We could compare the output of this generation algorithm to the LandScan data.

If I had to guess, I think this method (using random assignment) would result in fewer really dense areas (even taking into account that random assignment will produce clumps). This might have implications on spread i.e., slow it down a bit.

On the other hand, using admin unit specific age distributions might be helpful for local authorities as it would get a better picture of the likely critical care burden.