Oh 2021

Thus far, not looking great.

Thus I shall recommence with the blogging!

Robert Farley pointed out that the Monkee’s song, “Daydream believer” is actually a dark piece about a dissolving, shallow, materialistic marriage:

It’s not clear from the lyrics:

Oh, I could hide ‘neath the wings
Of the bluebird as she sings
The six o’clock alarm would never ring
But it rings, and I rise
Wipe the sleep out of my eyes
My shavin’ razor’s cold and it stings

Cheer up, sleepy Jean
Oh, what can it mean
To a daydream believer
And a homecoming queen

You once thought of me
As a white knight on his steed
Now, you know how happy I can be
Whoa, and our good times start and end
Without dollar one to spend
But how much, baby, do we really need

(OMG…the code block editor is unmitigated garbage…JUST LET ME HAVE A BLOCKQUOTE!!!!)

I mean:

Stewart was beginning to explore a more personal side to his writing when the Kingston Trio disbanded in 1967. In a 2006 interview, he recalled that he envisioned “Daydream Believer” as “part of a suburbia trilogy” that focused on the growing distance in a couple’s marriage. Its comparatively serious subject matter and its setting echoed “Pleasant Valley Sunday.”

The song’s oblique lyrics focused on the endgame of a comfy but increasingly distant relationship. The narrator is caught in mid-gaze before the bathroom mirror, reflecting on the quiet dissolution of his materialistic marriage – a union between “a daydream believer and a homecoming queen,” now curdled and (as originally written) “funky,” driven more by money than by romance.

It was surprisingly mature subject matter for ‘60s pop consumption, but the tune’s blissfully melodic, irresistible chorus screamed “major hit potential,” and overrode the vaguely sketched darkness at its narrative

Oblique is fucking right! It needs a third verse at least. The last verse line suggests a non materialistic relationship or hopes of one!

Perhaps something like:

It turns out we need lots
More than each of us will give
And it is the money not our love
No, there’s more to life and love
Than the emptiness of us
A facade cannot hide wood that rots

Ok, that’s just overt!

I listen to a song
That was ours and was all wrong
Promises we broke before they were even made
The razor’s work is done
Off to ride the train again
And to dream of bluebirds singing all day long

Then the last chorus could shift to:

Good bye, sleepy Jean
Don’t know what it’ll mean
To a daydream believer
And a homecoming queen

I think there still loads of ambiguity, but it’s not utterly obscure.

(Oh god WordPress…why why why why why why why why?!??!)

When Supposedly Smart People…

Sometimes I do something silly.

I god some fermenting gear. I’d wanted to make ginger beer for ages. But you really need appropriately strong bottles for the carbonation phase (the “secondary fermentation”). But I got some! But then I had no ginger.


But hey I had some strawberries and rhubarb and found a recipe for strawberry-rhubarb fermented soda and why not?

The primary fermentation went pretty well. It was easy. I ran it about 3 days and then decanted it into my flip top secondary fermentation bottle. There were bubbles!

I left it 24 hrs and put it in the fridge this morning. All is fine.

I was going to make baps but I wanted to try the soda. I wasn’t superconvinced by the bubbles. I had the bright idea of videoing the uncorking.

You can see the result.

So yeah. I got that on film.

Feel free to laugh. Zoe did when she saw the video. I haven’t seen her laugh so hard or so helplessly for a while so really, I did this for her. That’s the story and we’re sticking with it.

There wasn’t much soda left.

But it was delicious.

And I’ve learned my lesson: don’t make food videos!

Been a While

And not for lack of stuff to write about! I’ve still been v tired (exercise/going out levels have been v low) and I had grading (bleah) and eye troubles (triple bleah…I’m going to need to massively adjust my setups; dark mode for me).

BUT, my promotion case to Professor was successful so as of Aug 1, 2020, I will be a Professor of Computer Science. I never again have to think about a promotion! I won’t get the customary pay rise (due to COVID belt tightening).

There are still things I can apply for (pay rises in the future where they exist again, fellowships, a higher doctorate, etc.) but…they are all nice to have, not need to have. For some reason, making it to Professor was need to have for me. At least, in the sense that it felt obligatory to go for. Each year that I failed to make my promotion case felt like a real failure and cost me. I hope I’m now free of all that.

(Part was due to the weird detour my career took and feeling “behind”. Part was the unfortunate instillation of the feeling that if you didn’t make “full professor”/Professor you somehow didn’t have a proper career. <– This is super toxic! In any case, I feel relief.)

There’s a bunch of other cools stuff including that we have a first draft of a feature complete Imperial Model…which desperately needs optimising just to ru with 50,000 agents. The ball is back in my court and I’ve already had some interesting results which I’ll write up (I hope) shortly (I hope). I’m feeling pretty good about our chances of doing a full UK run by the end of the week.

(And I’m feeling SO GOOD about my adhering to the “measure first” philosophy. It’s been winning hard for me right now and we’ve had some nice object lessons for some students.)

It’s Obvious and Everyone Knows It

Dominic Cummings should resign.

We all know he broke the lockdown rules and rather dangerously. It’s immediate. This isn’t arguable or questionable.

The law has been more permissive than all the guidance (and this has been a problem), but there’s no question at all that he violated the guidance and that’s more that sufficient to require his resignation. And really, he pretty clearly violated the law. The only excuse that even remotely is relevant:

(j)in relation to children who do not live in the same household as their parents, or one of their parents, to continue existing arrangements for access to, and contact between, parents and children, and for the purposes of this paragraph, “parent” includes a person who is not a parent of the child, but who has parental responsibility for, or who has care of, the child;

isn’t appropriate as none of his extended family have parental responsibility or care of their child.

The pathetic orchestrated lies are particularly disgusting, of course. It is utterly bizarre to me. I hate Cummings but it’s not like I thought Ferguson should get a pass. I do not understand why all these politicians would go down this path of repugnant wagon circling.

Obviously, I would be happy for this inept government to be replaced but…in the middle of a crises it’d be better if they just got more apt!

Quick Hit on Replication

“Quick”…hopefully quicker than the last one as I’m fading fast.

I can tell you right now that there are parts of the Imperial model we’re just going to borrow, not rederive…at least for version 1. Maybe version all unless someone new steps up.

Why? Let’s consider a simple example: The formula for workplace size distribution. My current plan is just to use their reported equation instead of tracking down the source data (or comparable source data) and applying the same fitting and modelling that they did.

Why not? First, there has to be limits. I’m not going to try to rederive the Landscan data either. We only have so much time and expertise. My team’s current focus and expertise is on agent based epidemiological model engineering. We don’t have a ton of time either.

Second, it’s not clear that this will make a difference. Since we have the equation, we can use it and see if everything works out. We can easily try more ad hoc models to do some sensitivity testing. If I were going to go further, I might chase data on real workplaces.

Third, extending the first reason, there’s so much more to explore. We see lots of things in the Imperial model that we find a bit uncongenial and workplace size distributions ain’t one of them.

That being said, we will extract as much detail as we can and put it in a nice clear place so someone who is interesting in doing such work can easily find what’s needed and how to plug it into code.

A replication can serve many purposes. For us we want to:

  1. understand the Imperial model
  2. raise confidence in the results (or identify problems in a clear and rectifiable way)
  3. build out our epidemiological framework and expertise; in particular, we really want to establish a best of class infrastructure where reuse of models and parts of models is easy; unlike Ferguson et al, we’re less concerned with individual results or experiments and more with enabling such experiments
  4. learn more about the replicability of such models and build some tools for facilitating it

I certainly hope once we get this replication stood up that epidemiologists will want to use it and build on it in a variety of ways.

I’m not as concerned about “speed to market” though that’s an issue. At the moment we’re backward looking. We’re trying to understand a published and highly consequential model. One thing that will up the need for speed is if people and policy makers start rejecting epidemiological advice based on Ferguson’s personal issues and thus far rather silly complaints about the released code. I’d be very very surprised if our replication revealed any shenanigan’s about the results. This is, in part, because the Imperial results or, rather, the policy that was triggered by the results, are pretty straightforwardly correct and justifiable on a wide range of model outcomes.Simply put, you don’t need a complex model to know that lockdowns were necessary and came too late.

However, if people use facts about Ferguson to try to open too soon, that’d be a problem and a successful replication could help there.

So there’s some policy responsibility here. But again, it’s best served by us doing a careful job.

It’d be great to have a few more experienced Python programmers (it’s really me and one other with a bunch of CS students getting up to speed) and some Cython mentorship would be welcome. Any epidemiologist who wants to help with the extraction and validation is more than welcome.

(He wrote on a blog that is, to a first approximation, read by no one.)

Ok, I just spent a few hours I could have spent working on Cython or extraction or grading or…sleeping!

Imperial Model Quick Update

It’s just been over a week since I blogged last and a lot has been going on, not just in this replication effort. Zoe had a remote, prerecorded Zoom gig (which I’ve been helping out with) and tons of other things. Still haven’t heard about the promotion (though I think the latest they could tell me would be the end of June). I had some exercise energy early in the weak but that’s waned.

Oh, and I’ve been fighting with two things: measuring Python memory consumption and wrestling with Cython. There’s lots of good stuff to write about from those efforts…but it’s another thing and a tiring one. I need a smoother format than WordPress…maybe some Markdowny thingy.

First bit, other team members have made good progress on population density. We can load the appropriate Landscape and a rasterisation derived from ONS local authority based census data. So, getting closer.

Second bit, whoof, it’s super hard to measure the “deep” object memory weight of a Python object. I’ve tried 3 installed things, some manual stuff, and it’s not only not consistent between techniques, even within techniques, things are weird. Really weird.

I quickly found that our strings are almost certainly interned, so no big wins there. I tried the bit twiddling consolidation of booleans and saw memory consumption shrink at 10x the access time and a ton of ugly code. (Would want macros.) I quickly decided that the only productive way forward was Cython. Since a lot of our attribute values are essentially primitive, we could hope that they’d be captured as thin C types and popped in the struct Cython uses to represent extension object instance variables.That also would open the way for sub-word width representation with C speed bit operations. I found this cool feature of Cython where you can leave your Python file untouched and “just” add Cython declarations in a separate file…this seems perfect!

Well, “perfect” heh. Let’s say that I haven’t quite got it all working, esp with inheritance. (It’s a messy story for another day.) But I got it working well enough to see significant benefit in memory consumption. Here’s a guppy “relative” heap dump:

Partition of a set of 9000002 objects. Total size = 980889350 bytes.
Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
    0 1000000  11 408000000  42 408000000  42 dict of model.AgentsPlain.Person
    1 1000000  11 240000000  24 648000000  66 dict (no owner)
    2 1000000  11 104000000  11 752000000  77 list
    3 3000001  33 84000028   9 836000028  85 int
    4 1000000  11 64888890   7 900888918  92 str
    5 1000000  11 56000000   6 956888918  98 model.AgentsPlain.Person
    6 1000000  11 24000000   2 980888918 100 float
    7      1   0      432   0 980889350 100 types.FrameType

Here I just created 1 million Person objects. (Remember for the UK we need 60 million or so.) To a first approximation, the total size is probably about right. So 1GB for a million Persons. Index 5 gives us the “base” overhead of an object, about 56bytes. Index 0 is the dictionary which holds all they instance variables, which weighs in at around 400bytes. Anything that has a count of 1,000,000 is probably part of our Person object. (I created 1 distinct string per object, just to see. This makes the total size a bit of an overstatement but not as much as you might thing as schools, workplaces, and social centres we each one string and there would normally be more. It’s only 64byes for a total of 0.06GB/million or 3.6GB for 60 million. It’s currently a rounding error.)

After a naive Cython conversion, we see:

Partition of a set of 4000002 objects. Total size = 576889350 bytes.
Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
    0 1000000  25 240000000  42 240000000  42 dict (no owner)
    1 1000000  25 168000000  29 408000000  71 Agents.Person
    2 1000000  25 104000000  18 512000000  89 list
    3 1000000  25 64888890  11 576888890 100 str
    4      1   0      432   0 576889322 100 types.FrameType
    5      1   0       28   0 576889350 100 int


Only 60% or so the memory needed! That was totally worth the Cython learning curve.

Index 0 can be driven out just by a simple Python level refactor (that makes things a lot nicer) and perhaps the lists can too. If both of those work out (the dict is 100%, the list less so) then we get to around 0.16GB per million agents or 9.6GB for the UK population. This is with almost no changes to the substantive Python code. It looks the same. It runs the same. You can run it with our without the Cython version.

This is still about twice our overall model budget (if we’re going to run on certain sections of Manchester’s Condor setup) and we have some other stuff to represent. Not a lot, though and a lot of them are covered by our too many strs. 4GBs is a tight tight budget. 10GB for a model is low enough to run on my development laptop.

It does mean that the teams efforts at tessellating the map so we can work on parts isn’t so critical. I think it still can help in a variety of ways and for places like the US it will be critical. But assuming I’ve not messed up the measurement, we might be ok.

I do think it’s worth pressing on a bit with layout and alignment. I’m pretty sure all the fields of the Cython struct are 64bit and…we don’t need that for almost anything except population and even there, 32 bits gets us 4 billion. Unless we want to literally run 1 model with the entire world population, we have a lot of wasted space.

Just to hint, here’s the Cython declaration:

def class Person:
   cdef public bint is_infected, acquired_infection, dead, is_immune, timetable_may_change, is_at_hospital
   cdef public str  social_center, workplace, school, location, symptoms
   cdef public int id, hours_since_infection, incubation_length, infection_length, timetable_group
   cdef public float age, p_infection, p_death
   cdef public tuple  timetable
   cdef public dict hospital
   cdef public int household

(That household declaration is a bit of an experiment.)

Here’s the corresponding struct:

struct __pyx_obj_6Agents_Person {
 int is_infected;
 int acquired_infection;
 int dead;
 int is_immune;
 int timetable_may_change;
 int is_at_hospital;
 PyObject *social_center;
 PyObject *workplace;
 PyObject *school;
 PyObject *location;
 PyObject *symptoms;
 int id;
 int hours_since_infection;
 int incubation_length;
 int infection_length;
 int timetable_group;
 float age;
 float p_infection;
 float p_death;
 PyObject *timetable;
 PyObject *hospital;
 int household;


So, it’s pretty clear that all the PyObjects except timetable and hospital are dispensable. social_center et al are just ids and could be ints with almost no loss of code clarity. You’d have to add a bit of pretty printing code but that’s it. The names are all generated by interpolating an incremented integer into a name like ‘social_center’.

Most of the ints have very short ranges, even beyond the booleans. We should be able to pack them quite tightly. I’m not sure if alignment is even worth all that much given how much memory we’d drop. I’m no C programmer, even at the novice level. So I’m going to have to research a bit on bit twiddling vs. packed and unaligned struct access.

Back of the envelope suggests that we could comfortably get down to 32 bytes a Person which would be around 2GB for all of the UK.

There is a super modest speed win at the moment at least for tests comfortably in memory. Just compiling Person doesn’t gain all that much and I suspect there’s not huge gains from Cythonising Person alone. But there are places where we iterate over the whole population doing simple stat gathering. That seems like an easy win if we can push the whole thing into C.

Ok, this went on a bit!

I have to get back to the spec extraction. I’m trying to keep the extraction ahead of the team and they are catching up!

But all in all, I’m feeling pretty confident that it will work out ok. I’m going to push, at our Monday meeting, to open the repo, though it won’t be “nice” yet. (I mean, it’s pretty good right now, but the master branch is the ad hoc model we threw together. There’s some but nowhere enough documentation. Etc.) It just bugs me. But maybe opening it early will bug others! I don’t think we can absorb a lot of contributions at the moment (though I don’t see people champing at the bit…I’ve gotten no feedback or help on the extraction.)

I have thoughts there, but that’s another post.

Scaling the Imperial Model

We’re at the point where we have two population density models (Landscan for various years and ONS census data for local authorities projected onto a Landscan like grid). So we’re starting to test loading and initialising. And we do hit some snags. From a comment on our issue list:

We cannot have 60 million agents in one go, as that required 35+GB RAM/simulation. We thus need to “compartmentalize”. We should keep each compartment to under 1 million agents so that these can be simulated more or less in parallel.

From our Slack channel:

person object is 500 bytes in size
which isn’t much
when we multiply it by the number of people we [expect] [60+ million]
it starts being a lot

(And if we try the US, that’s 330+ million, etc.)

Scaling is an issue and no the solution is not to rewrite it in Rust 🙂

(It’s going to be a running joke. The sad thing is that I had been reading up on various Rust-Python bridges before silly tweet because I think it might be helpful. We’ll have to look at Cython as well.)

Dario’s current thought is to tesselate the map (perhaps even by local authority and iterate over each tile perhaps in parallel. Still 35+GB RAM but not all in one process. (For some of the HPC systems we have access to, we have a  bound of 4GB/process.) I think this is a good strategy and will allow other optimisations to be simpler (e.g., if we have no infectious people in a tile, we don’t need to evaluate it at all, pace inter-tile travel).

One issue about simple tessellation is that we might have agents crossing boundaries. Even daily if a tile boundary crosses a commute. We could make each tile “commute closed” but that might end up being the whole map if there’s enough overlap.

But we will have agents cross tile boundaries if we include air travel. I don’t think this is a big deal, but each tile will need to synch entries and exits per epoch.

One thing we’ve discussed a bit from the start is whether we’d benefit from a fundamentally different internal architecture, something that would be compact, fast, and potentially allow exploiting GPUs. There’s some work on agent based frameworks on GPUs (which I will have to redig out). Given that the current implementation scales pretty reasonably to a few million agents, I think the current “naive object” approach has a lot of merit. I’d love to have a transformation based implementation so that you can still write a more or less naive object model.

But I’m getting ahead of myself. The first question is how much of that 500 bytes is eliminable. Python trades memory for convenience, so a lot of things in Python are “fat” by default. Plus, we were often simplistic in data representations, e.g., using non-interned* strings for small enums (e.g., various status indicators).

  • Actually, it’s unclear to me whether these strings (like “hospital0” are interned or not. Python will intern strings which “look like” identifiers but they have to be visible at compile time. We generate all hospital, school, etc. IDs, so they are likely not interned. But I haven’t verified that yet.

So, 50085 million is a lot…how small do we need to get? Well, 50085 million is 42500000000 bytes which is 42GB. That’s too much, but 100*85000000 is 8.5GB which is reasonable but means we can’t use our local Condor pool lab machines. If we get to about 40 bytes/person, that’d be better. (We might want to get a little slimmer as we have other overheads.)

(Note that even if we got it this small, we’d probably still want some sort of tessellation for parallelism, distribution, and just managing things esp if we ever scale up to the world. Of course, smaller objects makes tessellation work better too.)

Time to do some baking, so I’ll just outline some iterative strategies from least invasive and thus (likely) effective to most:

  1. Only look at attribute value ranges and try to tune them. For example, instead of using non-interned strings for school ids, use interned strings (or some other terse object). Similarly, we store age as a float, but we don’t need precise ages, only age bands, so perhaps convert to an enum.
  2. Consider a “fused” attribute representation. For example, a boolean object (e.g.,<code>True</code> reference seem to cost 8 bytes. We have 6 boolean attributes per person thus 6*8=48 bytes. But obvious, 6 boolean values can be stored in a single byte and we can use bit operations to access them independently. Now, getting a single byte object with low/no overhead is a bit tricky in Python, but we might be able to do something. Similarly, a lot of values are strings which identify a place. We could turn these into references. We’d use getters and setters to persevere the current interface.
  3. Consider lower overhead objects both for attributes and for persons themselves. Class instances aren’t the most heavyweight object type in Python, but we might be able to do better. Dataclasses and slots, for example, might help. (attrs looks cool but doesn’t seem to focus on size.)
  4. Ditch objects altogether and try for a packed data representation. Ideally, we’d have something of uniform size so we can pack sets of them into something like a numpy array. This would be a path toward GPUiization. This potentially is the biggest rewrite and will really change how the code looks for modellers.
  5. Try to use a database like SQLite. We usually are working with small numbers of agents (e.g., nearby ones) so we could store them in a database and iteratively retrieve an “area” (say, a pixel’s worth of people). We’d still want a tight representation since moving in and out of the database gets pricey.But maybe we can push some of our iteration into the database’s queries and…that might be faster? My luck here hasn’t been high. Still, imagine having a packed representation of most attributes in one column, and a few, key attributes in other, indexed, columns. We could grab all the infectious…or all in a region. For each one, we could grab the people “in range”. We could probably pack this all into one query with group by…

One key feature of the Imperial model, I believe, is that it’s fundamentally static at an agent level. We aren’t simulating people walking down the street or anything like that. They have their places and they teleport to them and that’s it. (Which is like our current model, by the way.) I’m not detecting any stochasticity in going to work or school. (Are weekends treated differently?) So we really can “invert” our step evaluation. Instead of iterating over all agents, we iterate over the currently infectious and grab their neighbours. At some point, a good swatch of the population will be infected (or infectious), but then we can reverse and grab only the non-infected, get their neighbours, and see if they get infected. There will be a bulge in the middle where we have to directly iterate over half the population and they might be in range of the other half, so we still have to worry about peaks, but only for time.

Imperial Model Structure: Workplaces (Part II)

Caveat: These are working blog posts with a lot of “thinking aloud”. I’m a computer science academic studying the engineering of agent based models, not an epidemiologist. Critique and comment welcomed. If you are .interested in collaborating, send me a note.

The Landscan data access came through yesterday, yay! It…was a bit trickier to deal with than we had hoped from the documentation, but we’re managing. More in a separate post.

(What I write here are really initial draft brain dumps. We refine it in our discussions.)

So, school placement and allocation is done. Now we need to do workplace placement and allocation. (Workplaces can include schools! But the school bit is discussed…with schools not with other workplaces. Sigh.)

(Primary documents are  this  supplementary data and that supplementary data.)

So, my takeaway from writing the last post is that workspace placement and allocation are not fully specified in the text. But we have some clear constraints:

developing an algorithm which matches both workplace size data and the observed distribution of distances travelled to work…

the result was non-biased and met the marginal distributional constraints on commuting distribution distance (see Fig 1 of main text) and workplace size.

So, really, any placement algorithm which generates a set of workplaces that is plausibly a sample drawn from the given distribution with an allocation algorithm that makes the set of commutes  plausibly a sample drawn from the commuting distance distribution is fine.

At least this gives us an empirical sanity check. Even if our algorithm isn’t guaranteed to adhere to these constraints, we can always re run it until we get an output which meets the constraints. I’d guess that my prior algorithm sketch would do the job.

CALLOUT: (I’m really not sure what they mean by “non-biased and met the marginal distributional constraints”. Non-biased wrt what? Is ‘marginal” doing some work here? We’re definitely outside my personal stats/epidemiology knowledge. So we need to revisit these.)

CALLOUT: So how do we check that our generated sets are plausibly drawn from the corresponding distributions. Only clues: “Figure SI3: (a) Data on numbers of workers in US working in workplaces of different size categories (see text for source), compared with maximum likelihood fit of offset power model for workplace size distribution. (b) Best fit modelled distribution of workplace sizes.” “Figure SI3: (a) Data on numbers of workers in Thailand working in workplaces of different size categories7 , compared with maximum likelihood fit of Zipf model for workplace size distribution. (b) Best fit probability distribution of workplace sizes.” I guess it’s maximum likelihood fit across the board. So that’s straightforward enough.

Now  I’m a bit nervous. We’re definitely outside my bailiwick and expertise. (Other teammates may be able to help out.) It’s hard  for me to tell whether this is poor reporting or that I’m not in the standard audience. Perhaps an experienced computational epidemiologist knows exactly which page of which standard textbook to go to. If that’s true, then for that audience, this is perfectly fine reporting for replication. It’d be nice if they reported for a somewhat more general audience. We can’t make everything perfectly self contained, but I don’t feel like an algorithm sketch or a few references would be so burdensome.

When looking around for help, I found a book which had this line in the intro:

Whenever appropriate, researchers should use available codes rather than writing their own. However, the true value of the new approach to choice modeling is the ability to create tailor-made models. The computational and programming steps that are needed to implement a new model are usually not difficult. An important goal of the book is to teach these skills as an integral part of the exposition of the models themselves. I personally find programming to be extremely valuable pedagogically. The process of coding a model helps me to understand how exactly the model operates, the reasons and implications of its structure, what features constitute the essential elements that cannot be changed while maintaining the basic approach, and which features are arbitrary and can easily be changed. I imagine other people learn this way too.

I love all of this, but I’m really feeling the first sentence right now!

There are multiple overlapping literatures and code bases to navigate here.

Aha! I was feeling blue about all this but then I jumped over to the “model” description in the oldest paper (yes, this is absurd!):

The proportions of individuals attending different place types varies by age (see Section 1 above), and we assumed individuals are members of only one place. Places are distributed randomly in space, with a density proportional to that of the population. Individuals are randomly allocated to places at the start of a 8 simulation, with a probability which can depend on the distance between the individual and place and the number of individuals the place still has space for. Individuals are always allocated to places in the same country as their household location. For place types 1..3 (schools), only the total number of places of each type is constrained (thus constraining mean place size) – no constraint on the maximum membership of any one school is imposed. For place type 4 (workplaces), membership is constrained to produce a distribution of workplace sizes which matches the data discussed above, while also reproducing the observed distance to work distribution for Thailand. We define i n and j mi as the number of people in the household and place type j of individual i respectively.

Ok! This basically validates a lot of my thinking!

Ok, so I think in the first pass, we’ll do something like the free choice allocation model for schools with a post check that the distribution looks ok. It won’t be quite free choice since workplaces can fill up.

CALLOUT: We need to study the effects of free choice workplace allocation and to work on “a probabilistic dynamical algorithm based around the concept of workers competing for available vacancies with a choice function which depended on distance from household location (the ‘choice kernel’)” or the equivalent.

First, placement (which we combine with generation):

jobs_needed = load(total_employment)

# We need to get workforce size for the schools.
for s in schools:
    s.worker_count = s.student_count*(teacher_student_ratio)

workplaces = set()
# All schools are workplaces.
workplaces += schools

for p in pixels: # we iterate randomly over location squares
    if jobs_needed > 0:
        population = len(p.population)
        number_of_workplaces = some_prob_function_of(population)
        p.workplaces = workplace_generate(number_of_workplaces) + p.schools
        jobs_needed -= sum([wp.worker_count for wp in p.workplaces])

Now allocation:

for person in population:
    person.commute_distance = generate_commute_distance()
    # We need a set of workplace "near" the commute distance.
    # If there are workplaces on the circle with radius commute distance
    # they go in. We might add more closer or further if we don't have any
    # or not enough. All should be "about" commute distance away and have a 
    # a free slot.
    candidate_workplaces = hiring_workplaces_nearest(person.home_pixel, person.commute_distance) 
    p.workplace = choose(candidate_workplaces)

There’s a bunch of un/semi defined functions in there. But the basics seem ok for a first cut.

Don’t Be a Tech Silly (or Like Tech Silly Tweets)

(I don’t mean the nice kind of silly but the jerky kind of silly.)

Some things are predictable, such as people will tweet silly things. My current rule is that I try to give the benefit of the doubt to mere silliness (i.e., not to bigotry or the like) for three exchanges and then I block. This is to keep me from wasting time and energy.

Today, I found that I should block someone if enough people like their silliness. Because each like is really annoying. Don’t like silly tweets!

What set me off today was having mentioned our replication effort, someone tweeted:

That may take weeks of runtime and hundreds of GB of memory to replicate a full scale run. If you want a highly robust, yet still fast, implementation, why not something like Rust?

Oh. My. God.

I really really like Rust. I’ve written some small programs in it and hope to write some longer ones. But the “rewrite it in Rust” drive by commenters are terrible.

  1. They have no idea what the performance characteristics of our framework in model are or will be. They heard “Python” and knee jerked “Rust”.
  2. Weeks of runtime and hundreds of GB aren’t a problem (esp for something like this).
  3. The hard part at the moment is the modelling…indeed the extraction of the model.
  4. We want our model to be broadly accessible for correctness and understanding. Python is better for that (esp. with non software engineer scientists).
  5. We want people to participate even if students, newbies, or causal. Python is better for that.
  6. Getting people trained in a new language before getting to the replication work is…extremely silly.

The tweeter may be a generally nice person but a) this was not a remotely nice tweet and b) they doubled down.

Don’t do this. It doesn’t make you look smart or good or helpful or savvy. It just makes you look like a jerk.

It reminds me that before we open up the model we need to write up a code of conduct and an FAQ. It’s sad that we need to do this.

Imperial Model Structure: Workplaces (Part I)

Caveat: These are working blog posts with a lot of “thinking aloud”. I’m a computer science academic studying the engineering of agent based models, not an epidemiologist. Critique and comment welcomed.

The Landscan data access came through yesterday, yay! It…was a bit trickier to deal with than we had hoped from the documentation, but we’re managing. More in a separate post.

(What I write here are really initial draft brain dumps. We refine it in our discussions.)

So, school placement and allocation is done. Now we need to do workplace placement and allocation. (Workplaces can include schools! But the school bit is discussed…with schools not with other workplaces. Sigh.)

(Primary document is the  supplementary data for 5.)

(Screenshotted for equation and graphs.)

Unlike schools which seem to have “emergent” sizes based on density, we seem to draw workplaces from a more complex size distribution. But…there’s no discussion of workplace placement and allocation is in the prior paper’s supplementary data:

The algorithm to allocate individuals to workplaces was necessarily more complex than the school allocation algorithm – while the school size distributions could be approximately matched with a ‘free choice’ algorithm, the highly over-dispersed workplace size distribution requires constraints on workplace choice to be imposed. The challenge in developing an algorithm which matches both workplace size data and the observed distribution of distances travelled to work is one of computational efficiency – it is very easy to find algorithms which are O(N2 ) and take days to allocate large populations. We developed a probabilistic dynamical algorithm based around the concept of workers competing for available vacancies with a choice function which depended on distance from household location (the ‘choice kernel’). A full description is beyond the scope of this paper, but the result was non-biased and met the marginal distributional constraints on commuting distribution distance (see Fig 1 of main text) and workplace size.

Ah yes, we have to bop up to the main text for Fig 1. (Journals! You’re electronic. Stop with the stupid page restrictions esp for supplementary data. Authors! Make the supplementary data self contained!)

The bolded part is just Argh! “It is easy to fine…” so give us one of them! Just one! It’s not all the easy to find since your spec isn’t exactly clear! If you gave a naive one we would have something to build on. And having your own magic algorithm that you don’t share is bonkers. It really isn’t beyond the scope of the paper…we can’t replicate with out it so it’s pretty damn in scope!

What still seems to be missing is workplace placement. We know how to get their sizes but not their (synthetic) locations.

(I feel that census or similar data would be useful here. I can’t tell how much is historical lack of access.)

Ok, first crack:

  1. Every school is a workplace with number of workers decided by student/teacher ratios. (Thus we must place and allocate schools first.)
  2. Workplace placement will use the same basic approach as school placement
    1. We have a pool of workplaces drawn from a distribution (most simply, number of workers divided by average workers per location).
    2. A pixel gets a workplace (or maybe more than one workplace?!) with a probability dependent on the population density.
      1. It seems bit odd to have “one workplace per square km” (remember workplaces include retail establishments, post offices, etc.) so maybe it’d be better to use a proportion of the population as a guide. So, suppose the population is 100 people. We seed a choice with 100 (it could be less or more) and that gives us the “working population” (say 75, say 150). The we draw workplaces from the general size distribution until we exceed the limit. Sometimes we’ll exceed it a lot (we draw a 5000 person workplace on the first draw) but that’s ok! Big workplaces can be in low density areas.
  3. Workplace allocation is also a bit similar to school allocation except we draw candidate workplaces from first selecting a community distance (from that distribution). In particular, for each person we pick two commuting distances and grab all the workplaces with free slots within the band formed by the two circles formed by the taking each commuting distance as a radius. Then we select a workplace randomly perhaps weighted by distance or by preferring one of the commuting distances (via a coin flip). If we have no free slots, we redraw the commuting distances and try again. If we fail three times, then we create a new workspace.

I’d guess this would “matches both workplace size data and the observed distribution of distances travelled to work”. Maybe it’s quadratic in…something? Certainly it wouldn’t be the  n of all workers times the m of all workplaces, or anything like that.

There are obvious alternatives, like picking one commuting distance and getting the nearest to that (either on both sides, or only outside, or…).

I don’t know why we need to have workers “compete” other than by first come first serve… we could iterate over the whole population in random order to avoid cluster effects.

I’m not ready to pseudocode this and am very tired. So I’m going to pause here.

I’m confused as to why we got a better description of the school situation than the workplace situation other than the workplace algorithms are…more complex so more of a pain to describe? Let’s just say that that’s not the right heuristic.