Imperial Model Structure: School Allocation

Caveat: These are working blog posts with a lot of “thinking aloud”. I’m a computer science academic studying the engineering of agent based models, not an epidemiologist. Critique and comment welcomed.

Going on to the next paragraph of supplementary data for 5,

Terminology is always an issue 🙂 I’d tend to say that

  • school generation is the process of creating the right number and type of schools
  • school placement or distribution is the process of assigning schools to specific locations
  • school allocation is the association of a student with a school

these together form school initialisation.

In the Imperial work there are two basic approaches to generation and placement:

  1. use a database of existing schools and addresses
  2. generate a synthetic set of schools based on whole unit stats and placement with a probability proportionate to tile density

CALLOUT: When looking at country level statistics, I doubt that there’s that much difference between the approaches, esp if you do multiple initialisations and average it out. If we want to assess risk from a school perspective (in general do more local stuff) then actual schools are helpful, but how helpful without real attendance figures is unclear to me. Might be worth some experiments.

CALLOUT: To be clear, let me give an example. If I am a Greater Manchester official, I might want to use a simulation to know which schools are higher risk for spreading the disease to riskier populations. This will depend on the particular age/health structure of the populations interacting with the school populations. Let’s assume we won’t have real enrolment data and certainly not family data. So we have to generate the population. So…does it matter that we also generate the schools? Might we just case match schools in our community with schools in our model to get a reasonable estimate? Probably! I would take a profile of a school (size, surrounding population density, etc.) and extract all relevantly similar artificial schools in my model and do some averaging and exploration of that set. That might give me better insight than looking at a location matched school in my model.

Since we have a synthetic population, we need some function to assign students  (and teachers!) to schools (whether the schools are synthetic or real) (bopping back to the supplementary data for 5):

These methods allocate children to schools using a free selection algorithm; each child ‘picks’ a school at random from the nearest 3 (for primary) or 6 (for secondary) schools of the appropriate type, with a probability weighted by the distance kernel derived from UK census data on commuting behaviour (see below). Restricting school choice to the nearest 3 or 6 schools avoids unrealistic long tails to the distribution of distances travelled to school, albeit at the cost of slightly underestimating mean distances travelled to school by approximately 30% 8 . Staff student ratios were used to determine the proportion of adults to schools rather than other workplaces.

There’s a lot of detail to pick out.

For each child, we need their student type (primary or secondary) presumably determined by their age (I guess we can assume faster or slower progressors balance out and we can use strict age bands?). They we need their nearest 3 or 6 appropriate schools with the distances. AND then we need to select one using a distance weighted probability.

The distance distribution function is…this?

The parameters and functional form of the choice kernel used in the worker-to-workplace assignment algorithm were then adjusted until the model fit the empirical distributions well. In the case of GB a 2 parameter offset-power law kernel fit well: f(d)~1/[1+(d/a) b], where a=4 km and b=3. For the US, the change in slope of the tail of the distribution (on a log scale) meant that a 4 parameter function was needed: f(d)~1/[1+d/a] b +k/[1+d/a] c where a=35km, b=6.5, k=0.0004 and c=2.2. Figure SI4 illustrates how well the model fitted the data.

Ok ,we have a function f(d)~1/[1+(d/a) b], where a=4 km and b=3. I think, certaintly for v1 of our replication, we’re not going to attempt to re-derive this distribution. Actually, I suspect a simpler choice model like equal weighting would be fine. Our goal here will be to make sure that our model uses a choice function (and indeed their choice function) and we can always rederive that if it seems important (or someone wants too!).

So the school allocation algorithm:


# This has age stratified people at each pixel!
# It also has any schools
pixels = load_current_map() 

for p in pixels:
    primary_students = p.get_people(primary_age_band)
    secondary_students = p.get_people(secondary_age_band)
    
    # We only have pixel to pixel resolution on distance.
    # So all students at a pixel have the same "nearest" schools
    
    # Note that we might well have equidistant schools so we might
    # want to be a bit clever, e.g., pull the wider pool and take a subset
    primary_schools = nearest_schools('primary', p, pixels, 3)
    secondary_schools = nearest_schools_choices('secondary', p, pixels, 6)

    for s in primary_students:
        # Uses probability function!
        s.school = choose_school_from(primary_schools) 
    for s in secondary_students:
        s.school = choose_school_from(secondary_schools)
    # Note this assignment should probably add each student to their 
    # school but that's a convenience implementation detail.

Strike Deductions

If you strike, you may not get paid for the days you were on strike. That feels basic to a lot, in some sense. You don’t work and you don’t get paid. Conversely, if I didn’t get paid, then I won’t make up the work.

Of course, it’s not quite so simple. Employers forgive strike days for all sorts of reasons (generate good will; get make up work done; etc.). I’ve taken so few vacation days over the years that the University would come out well ahead even if it forgave all my strike days ever.

Conversely, strike pay deductions are punitive. They are intended to exact a cost from the strikers and to discourage them. This is particularly the case in an unresolved dispute.

How you deduct strike day pay matters. We were on strike for 21 days. For a lot of people, if you took that all from one pay packet they’d be screwed badly. So it’s often the case that the employer and employee negotiation some sort of stagger.

Well, I had no strike days deducted in March, but all of the deducted in April. I’m not the only one.

Gotta say, it pisses me off. It pisses me more that people who can’t easily weather it the way I can got hit (note I’m not taking strike fund pay either; I can afford it; I’m senior staff and have low expenses, etc.; obviously, I don’t enjoy it but it doesn’t put me in any sort of bad position this month they way it does for a lot of people).

It actually makes me regret all the extra work I did getting colleagues online for their classes. It doesn’t make me happy about doing extra work now. The fact that they are doing weird patterns of pay (which seems more due to problems in HR as our last rounds were messed up to, but the COVID crises can’t have made that easier) adds some salt.

There’s a straight forward case that forgiveness of deductions was the right move morally and strategically. The COVID lockdown is a highly unusual event and people who went on strike probably didn’t anticipate that. Doing so makes the uni look generous and caring which surely would zap some enthusiasm for industrial action in the future.

Our Fearless Leader apparently asserted “Those who didn’t go on strike want the deductions.” I have to say that even if this were true, it’s an interestingly rare case of leadership listening to staff. In any case, there didn’t do any kind of survey to determine what staff actually wanted and, for fuck’s sake, they are supposed to be leading.

(Note: I would be nearly as happy if there were a spine point cutoff for forgiveness and I was above it. As I said, I don’t like getting paid less (no one does) but I’ll be fine. Lots of people who went on strike won’t be. Show some compassion!)

Imperial Model Structure: School Initialisation

Caveat: These are working blog posts with a lot of “thinking aloud”. I’m a computer science academic studying the engineering of agent based models, not an epidemiologist. Critique and comment welcomed.

Going on to the next paragraph of supplementary data for 5, we have the school generation and assignment model. They use two different strategies for school placement: for the UK, they generate a synthetic set and placement based on population density while for the US they use actual (near) locations based on a database of schools.

If we get the appropriate database of schools, then we’ll “just” use that, i.e., assign each school to their corresponding tile.

Let’s look at the text for synthetic generation:

For GB, we used data7 on the average size of primary (218 pupils in 2004) and secondary schools (949 pupils in 2004) and on average class sizes and staff-student ratios and applied the methods used in past work6 to generate a synthetic population of schools distributed in space with a density proportional to local population density.

Ok, we have:

  • average size of two kinds of school
  • average class sizes (why?)
  • staff-student ratios (this is to help us assign adults to schools as their workplace)

I’m a bit perplexed why they gave us the average size of schools but not the class sizes or staff-student ratios. I mean, those numbers are important too! I wouldn’t have to spelunk for that data. Ideally, I want both the values they used and enough details so I can rederive them (e.g., if I want to try for a different dataset).

Let’s follow this reference 6 to see if there’s a more detailed description of the synthetic school method. 6 is:

Ferguson, N. M. et al. Strategies for containing an emerging influenza pandemic in Southeast Asia. Nature 437, 209-14 (2005).

I believe this is the ur-paper which first introduced the model. I hope! Notice that gives us 3 prior papers in total that may or may not contain critical details.

This is partially an artefact of the scholarly publishing system. They are writing a narrative description and it’s not systematic at all. It’s not bad per se, but it’s also not documentation about their model. I may well be the first person to try to try to tease out all the details.

This isn’t only a problem for replication! It also is a problem for interpretation of their results. It doesn’t feel that way, of course. Indeed, the first few times I read it, I thought that it was  thin for replication but probably enough for interpretation.

But is it? In our prototype model (based on a much simpler network structure of houses and other locations, generated randomly), I had to think really hard about whether our initialisation is going to muck with infection chains. Now I have the advantage that I could inspect runs and monitor…if I implemented such telemetry and ran it more experiments. But I can get a long way because I know the code! My measurements would be hypothesis confirming.

Here…well, I don’t quite know what’s going on in a causal read. I think they make fixed size schools and  then allocated “enough” of them to keep student commutes sensible. But do they just underpopulate schools placed in sparse areas (and have some kids not go to school)? I would think that in addition to more schools in dense areas you have more populated schools which might affect infection patterns. I can’t relate any such thinking to their work because I don’t know what they actually did. Yet! Certainly not on a causal read.

Of course, too much detail can inhibit understanding as well. So the interpretation problem is much less clear cut than the replicability problem.

Ok, 6 also has supplementary notes! It’s very similar to the previous notes overall. Ooo, some discussion of the household generation function!

Household size and age structure data The sources for these data2,3 are detailed in the main text. Here we note that it is not straightforward to construct a realistic demographic model of households which matches marginal distributions of household size and population age structure. Randomly assigning ages within households produces consistent marginal distributions but fails to capture a realistic generational based age structure within households (i.e. households of 4 people tend to have 2 parents and 2 school-age children in, not 4 people with randomly selected ages). We therefore developed a rather heuristic model which maintained generational age gaps with households while matching overall household size and age distributions.

(Remember how I said I sometimes yell out “Fuck you” when frustrated by reading a paper and it’s venting? I yelled out “fuck you” here. That bolded bit upset me.)

Are there page limits in supplemental data? Do I really have to bounce up to the main text to check for details which aren’t even relevant to the model I actually want to replicate?

Looking up, the size and age structure seem national and are similarly given as column charts instead of tables (use tables! gimme the numbers! I hate trying to infer values from a chart! these charts are worthless):

The model used Landscan data24 to generate a simulated population realistically distributed across geographic space (Fig. 1a). Thai census data25,26 on household size and age distributions were used for demographic parameterization (Fig. 1bc).

CALLOUT: We don’t know what the heuristics are. I’m still guessing it’s something equivalent to for a new household, draw a person from the age distribution. Draw the next person from a weighted age distributed based on household structure (e.g., if adult, draw another adult from a similar age band; if a kid, draw an adult next, then adults or kids randomly for he rest of the members who are “close” in age to the remaining; this is for a UK like scenario where 3 generation households are relatively rare (I think)). Again, some sketch beyond “rather heuristic” and “generational age gap” (WHAT ARE THESE?!) This could be simple or complex.

Ok. Let’s do school generation (saving allocation of students and teachers for another post):

Data on school sizes was obtained for a sample of 24,000 schools in Thailand5 , roughly 2/3 of all schools in the nation. This sample was patchy in its spatial coverage, however – with only approximately 60% of provinces being included. Distributions of school types and sizes did not vary widely by province, but we were not able to verify anecdotal reports that combined elementary+secondary schools were more common in rural areas from the district-level stratification the sample provided. We therefore assumed no geographic variation in the distribution of school types. Overall, 41,700 primary schools, 12,500 elementary+secondary schools, and 6,900 secondary schools were modelled, with average sizes of 175, 420 and 750 pupils respectively. The simulation locates schools randomly across the modelled region with a probability proportional to the population density, the total number of schools being selected to give the correct average size for each type of school at a national level.

Ok. So they don’t weight the probability of a school type based on location. But do they do age stratified densities? If I have a region when the population is 2/3s primary school age, are 2/3s of the schools primary or primary+secondary? Maybe you take overall school age population and generate enough schools so that you get roughly the right average but don’t otherwise try to guide enrolment?

Ok, I can see an algorithm possibility, but I’m unclear whether a pixel by pixel loop would “locate schools randomly across the modelled region with a probability proportional to the population density”. (Do we need to look at pixel neighbourhoods?) Eh. We have a given number of schools to start, so perhaps:


# I'm only going to consider two school types as that fits with the UK model
# The number of schools for a type is the total number of school age children
# in the population divided by the average school size
primary_schools = generate_primary_schools(NROFPRIMARYSCHOOLS)
secondary_schools = generate_secondary_schools(NROFSECONDARYCHOOLS)

# We don't tune type to specific student population density
schools = primary_schools + secondary_schools
schools.shuffle() # randomise order
pixels = load_density_data()

for p in pixels:
    if contains_school_based_on_density_based_probability(p):
        p.add_school(schools.pop())

return pixels

The probability function needs to be such that we don’t run out of pixels too much before we run out of schools. Nor can it be so likely that we run out of schools early. That seems like it could be fiddly.

We could shuffle the pixels first to avoid getting caught in weird adjacent runs. We could also generate extra schools if we run out?

After the student allocation, we should then examine all the schools to get the mean (preferable something more like quintiles) to check that we look reasonable wrt to actual school size range.

Imperial Model Structure: Lower Resolution Population Density

Caveat: These are working blog posts with a lot of “thinking aloud”. I’m a computer science academic studying the engineering of agent based models, not an epidemiologist. Critique and comment welcomed.

While we’re waiting for the LandScan data, I was wondering if we could implement an alternative (if only to nail down some APIs and stuff). It’d be great to use other data sources, probably public census data which is likely to be lower resolution.

Peek ahead: I rather suspect that the 1km raster grid is built deeply into the Imperial model. (E.g., agents might move around a grid and community infection chance uses a spatial proximity model.) This would be a big change to our current implementation which uses a network model (i.e., people and places are related explicitly by links rather than implicitly by proximity).

However, I think we can defer the issue of how to represent things internal for a bit.

I’m going to make the assumption that we generally want to deal with the 1km grid in the generation algorithms. So we need to upsample/interpolate lower resolution data to get the finer grid.

To do this, we need a function that goes from a given 1km pixel to the associated admin area. Each admin area then assigns a fraction of it’s total population to that pixel. So something like:


pixels = load_pixels_with_coordinates()
for p in pixels:
    admin_area = admin_area_for(p)
    admin_area.assign_population_to(p)

assign_population_to should keep track of the assignments thus far and make sure not to go over.

We can use different assignment functions. The simplest would be some sort of uniform random assignment, i.e., each member of the population has an equal chance of living in a pixel. But we could use some other function such as equal number per pixel (very unrealistic) or some weighted function.

Given that the ONS data has granular age distributions (e.g., by year and per admin unit) we probably should use that (though use a country wide one when doing a precise replication).

The resulting pixels would be indistinguishable from the Landscan data from an api perspective, so we can just use it asa drop in replacement downstream.

I’m not sure what the implications are for using the lower resolution data. Admin units range in area and population, but I’m not yet sure how. They are hierarchical (e.g., Greater Manchester (Met County) is a Metropolitan County with 2,812,569 people while Manchester is a Metropolitan District with 547,627) so we could play a bit with that. We could compare the output of this generation algorithm to the LandScan data.

If I had to guess, I think this method (using random assignment) would result in fewer really dense areas (even taking into account that random assignment will produce clumps). This might have implications on spread i.e., slow it down a bit.

On the other hand, using admin unit specific age distributions might be helpful for local authorities as it would get a better picture of the likely critical care burden.

Imperial Model Structure: Population Density

Caveat: These are working blog posts with a lot of “thinking aloud”. I’m a computer science academic studying the engineering of agent based models, not an epidemiologist. Critique and comment welcomed.

We need our four functions:

  1. Household generator
  2. School generator
  3. Workplace generator
  4. Person assigner (to households, schools, and workplaces)

All of these are very underspecified in Imperial Report 9:

In brief, individuals reside in areas defined by high-resolution population density data. Contacts with other individuals in the population are made within the household, at school, in the workplace and in the wider community. Census data were used to define the age and household distribution size. Data on average class sizes and staff-student ratios were used to generate a synthetic population of schools distributed proportional to local population density. Data on the distribution of workplace size was used to generate workplaces with commuting distance data used to locate workplaces appropriately across the population. Individuals are assigned to each of these locations at the start of the simulation.

It seems clear the the first step is understanding (and getting mastery of) the “high-resolution population density data”. It’s not in Report 9, we need to look at prior studies. The referenced 2:

5. Ferguson NM, Cummings DAT, Fraser C, Cajka JC, Cooley PC, Burke DS. Strategies for mitigating an influenza pandemic. Nature 2006;442(7101):448–52.

6. Halloran ME, Ferguson NM, Eubank S, et al. Modeling targeted layered containment of an influenza pandemic in the United States. Proc Natl Acad Sci U S A 2008;105(12):4639–44.

Have some supplementary data where you would expect to find the nitty gritty details. Yay!

Let’s start with the supplementary data for 5. Population density is the very first thing!

Population density data The 2003 Landscan1 dataset prepared by Oakridge National Laboratory was used as the source of population density information for the continental US and GB. This dataset has a 30 arc second resolution, equating to approximately <1km for US and GB latitudes. The dataset is a model of population density which is constructed from census, remote sensing, land use and transport network data. It is a model of instantaneous population density (i.e. where people are at an instant of time) rather than residential population density, but is very comparable to high resolution census data for GB and US, and has the advantage of being rasterized – while census datasets are defined for populations within with irregularly shaped and variable size administrative units (e.g. census tracts).

Ok, we have a reference, it seems to be to a standardish dataset. What does this dataset look like (from this description). It is “rasterized” which I take it to men that the map (of the UK, we’re just looking at the UK) is broken down into something like (square?) pixels and values (number of people? age stratified? otherwise stratified?) assigned to each pixel. And the “pixels” are about 1km (or a bit less? “approximately <1km” is a weird phrase. How is it different than “approximately 1km”? Is it weighted toward being smaller than 1km? I guess this will be square km?)

That’s an ok overview, but to code anything we need some details. We need to follow reference 1:

Oakridge National Laboratory. Landscan global population data http://www.ornl.gov/sci/gist/landscan (2003).

(Note: Even if you don’t release your model code. Releasing a version of this data would be nice…all packaged up! Or a good data description.)

It’s got a link! To nowhere:

Oy. Ok. Cool URLs Don’t Change but the Web has been super uncool for decades.

But the site works and the organisation exists and we have a critical keyword, “Landscan.” And we have joy:

ORNL’s LandScan™ is a community standard for global population distribution data.  At approximately 1 km (30″ X 30″) spatial resolution, it represents an ambient population (average over 24 hours) distribution.  The database is refreshed annually and released to the broader user community around October.

LandScan™ is now available at no cost to the educational community.  The latest LandScan™ dataset available is LandScan Global 2018.  Older LandScan Global data sets (LandScan 1998, 2000-2017) are available through site.   These data set can be licensed for commercial and other applications through multiple third-party vendors.

(No approximately less than nonsense HERE! ;))

We want the 2003 data for our replication. Probably? That’s what I got a reference for. Maybe they updated to more recent data. “Landscan” isn’t mentioned in Report 9.

CALLOUT: This seems to be an undocumented point. We don’t actually know which Landscan dataset they use. One would hope they used recent data! We could assume this, but I’m ok with starting with what’s documented. In our framework, we should make popping in arbitrary Landscan datasets super easy…as long as we have the computing power to run the simulations, at least.

CALLOUT: Oh how WordPress vexes me. I want to put notes in a marked container so I can retrieve them later. I don’t want to install a plugin but that “block” system is evil. So I’ll just use some text markers.

Ok! Getting to the 2003 dataset page is super easy. But you have to be a registered user to download the data. The registration form was simple enough. I got a confirmation email quickly, yay! I verifed and then got the following email:

Your request has been queued for review.  This process may take anywhere from 3-5 business days.  Once approved, you will be able to log in and download the available LandScan datasets.  Please note that during this time you will be able to log into the site, but you will not be able to access the data.

Best regards,
LandScan Staff

Well that sucks. I mean, I’m sure they’ll approve it and they are surely busy, but it kinda stops me.

Though perhaps I can get some feel for the shape of the data! There should be some documentation. Ah! There’s an FAQ:

ESRI ArcView with the Spatial Analyst extension, ESRI ArcGIS with the Spatial Analyst extension, or ArcInfo will allow you to read and analyze the database files.

The data files can also be read by any software product that allows the user to import ESRI GRID-formatted data.

The data files are also provided the data in the ESRI Raster Binary format. Almost any programming language can read this file directly.
See also “What data formats are used?” below.

*crickets*

Ooookay. Yikes.

Quick, let’s see the data formats answer!

ESRI grid format. The archive contains grids for the world and each of the six continents, excluding Antarctica. Each of the archive files contains two folders, an ArcInfo GRID folder and an INFO folder. Both folders must be extracted from the archive file into one new folder.

ESRI binary format. The archive contains contains raster binary data for the world and a rich text formatted file that describes the binary format. The binary raster file format is a simple format that can be used to transfer raster data among various applications. Almost any programming language can read this file directly.

For both formats, downloaded files need to be uncompressed using a standard Zip utility (e.g., WinZip, PKZIP, etc.) before they can be imported to GIS or other software. Users should expect a substantial increase in the size of downloaded data after uncompression. For additional information see the ReadMe file for 2010.

(Ok, in my dauntage, I am still laughing that we have to look at a ReadMe file for a specific year thats…not the first year in the dataset. Funny!)

Fortunately, we do have a bit of description of the data:

The dataset values are people per cell.

At this point my frustration often is expressed by my screaming “I hate you!” Rest assured, dear Reader, that this is just venting.

Ok ok…don’t panic. We don’t need the proprietary software (probably?). I’m a bit concerned that all the links in this critical FAQ are dead:

The links below contain instructions for calculating population density:

(In a weird way.)

But the other FAQ said “ESRI binary format. The archive contains contains raster binary data for the world and a rich text formatted file that describes the binary format. The binary raster file format is a simple format that can be used to transfer raster data among various applications. Almost any programming language can read this file directly.”

Of course, Wikipedia says:

An Esri grid is a raster GIS file format developed by Esri, which has two formats:

  1. proprietary binary format, also known as an ARC/INFO GRIDARC GRID and many other variations

  2. A non-proprietary ASCII format, also known as an ARC/INFO ASCII GRID

So maybe they meant that the binary format is the ASCII format?

Ok. Ok. Ok.

Clearly, I need to wait for access to the data to see what’s really going on.

But I did glean some crucial stuff:

  1. We need to deal with a grid of 1km cells
  2. All we have from this is population count per cell.

I’m missing why population density is something which needs instructions or a special script. (Here’s where my lack of domain knowledge screws me.)

CALLOUT: Research population density calculations to understand potential issues.

From a demographic perspective, this isn’t helpful data. We just have counts, not even ages. They use census data for the rest:

Household size and age structure data Census data was used for age2,3 and household size distribution data4,5 for both the US and UK. We used a heuristic model6 to generate the ages of individuals in households which maintained generational age gaps with households while matching marginal household size and age distributions (Figure SI2). We assumed household size and age distributions did not vary geographically.

This still doesn’t get us to how they placed and populated households. And it’s weird that the generate household and age distributions that need to be matched to the data.

Hmm…….Ok, I have an idea for an algorithm:


density_data = load_somehow(landscan2003)

households = list()
for location, people_count in density_data:
    # We're assuming that the snapshot location data
    # is where people live. So we need a household for each
    # person at a location.

    c = 0
    while c < people_count:
        # We need a new household
        # We use the household distribution to
        # pick a household size
        cur_home = gen_new_household(at=location)
        if (people_count - c) > cur_home.size:
             c += cur_home.size
        else:
             cur_home.size = (people_count - c)
             cur_home.add_people() # adds size people
    households.append(cur_home)
return households

Obviously, “add_people” needs to use two things:

  1. Age distribution data (the model seems to ignore sex, race, and other demographic features)
  2. Household composition heuristics (no house of only kids; it’s probably less likely to have lots of households with 5 adults (?)).

I think you just do something similar…pick size people from the distribution. If it doesn’t make household sense, then adjust (redraw, or redraw from an “adult only” distribution).

CALLOUT: Clearly, we could have more sophisticated demographics and household data. The most interesting ones are those which affect morbidity and mortality e.g., being immuno compromised.Once we have the basic approach, adding these isn’t hard.

Ok! This makes sense. And it suggests how one might substitute other data for the Landscan.

It also has profound implications for our model. Our current model uses a “network style” model where places are objects that people are linked too. Here we might well need pathfinding to see where people go in a commute. The UK is around 240k km² which is a fair bit already even without 58 million or so people. We could use lower resolution density data or alternative datasets.

Progess!

Imperial Model Structure: Initialisation and Ambulatory Contact Model

Caveat: These are working blog posts with a lot of “thinking aloud”. I’m a computer science academic studying the engineering of agent based models, not an epidemiologist. Critique and comment welcomed.

In the study, the model is describe over three sections in “Methods” comprising a scant three pages:

  1. Transmission Model
  2. Disease Progression and Healthcare Demand
  3. Non-Pharmaceutical Intervention Scenarios

1 and 2 are what we need in order to implement the non-intervention scenario (obviously). Getting this seems like a reasonable v 1.0 for a replicaiton.

In this post, we’ll just look at (part of) the transmission model. I’m going to work through it nearly line by line to start.

We modified an individual-based simulation model developed to support pandemic influenza planning5,6 to explore scenarios for COVID-19 in GB. The basic structure of the model remains as previously published.

“Individual-based simulation” and “agent-based simulation” are synonyms with the contrast class being “equation based” or “mathematical” models, i.e., systems of differential equations describing the aggregate behavior of the system. I won’t go into the differences, but just note that there are reported advantages and disadvantages of each paradigm.

Individual-based simulations are very much like SimCity or the Sims…we try to simulate the behavior of individual people in an environment. Thus, in order to replicate, we need to know:

  1. the demographics of the agents in the model
  2. the distribution of those agents in the model
  3. their behavior
  4. how the disease works on them (infection, disease course, etc)

That they modify an existing model described elsewhere suggests we might have to go elsewhere. But let’s see:

In brief, individuals reside in areas defined by high-resolution population density data. Contacts with other individuals in the population are made within the household, at school, in the workplace and in the wider community. Census data were used to define the age and household distribution size. Data on average class sizes and staff-student ratios were used to generate a synthetic population of schools distributed proportional to local population density. Data on the distribution of workplace size was used to generate workplaces with commuting distance data used to locate workplaces appropriately across the population. Individuals are assigned to each of these locations at the start of the simulation.

(We’re going to have to look elsewhere.)

We have 4 “contact locations” i.e., places where one agent might encounter (and thus infect or be infected by) another:

  1. home “Census data were used to define the age and household distribution size.”
  2. school “Data on average class sizes and staff-student ratios were used to generate a synthetic population of schools distributed proportional to local population density.”
  3. the workplace “Data on the distribution of workplace size was used to generate workplaces with commuting distance data used to locate workplaces appropriately across the population.”
  4. “in the wider community

We need at least 4 initialisation functions: one generating households, one schools, one workplaces, and one generating agents and assigning them to the locations.

Now, if we were trying to replicate their modelling (as epidemiologists), this might be enough, because we’d have our own ideas of how to use census data to build these initialisation functions. Indeed, immediately, you might think that it’d be nice to replace a synthetic population of schools with a real population of schools. Indeed, for a simulation on the scale of Manchester, dumping the synthetic data for real data might allow us make more precise predictions or work out more specific interventions. But that’s not our goal here. We want to reimplement their model and this isn’t enough detail.

Transmission events occur through contacts made between susceptible and infectious individuals in either the household, workplace, school or randomly in the community, with the latter depending on spatial distance between contacts.

This gives us a big hint that we’ll need some sort of spatial relation between agents. This can make the implementation a bit complex and computationally expensive if we have to look “nearby” an agent over some radius and maintain coordinate information about each agent (and simulate their movement? at various speeds?). The rest of the paragraph  gives a tweak to the contact rates plus an observation of how that cashes out:

Per-capita contacts within schools were assumed to be double those elsewhere in order to reproduce the attack rates in children observed in past influenza pandemics7. With the parameterisation above, approximately one third of transmission occurs in the household, one third in schools and workplaces and the remaining third in the community. These contact patterns reproduce those reported in social mixing surveys8.

Ok! Per-captia contacts within schools is 2x elsewhere…but what’s the rate elsewhere? This seems like a ‘modification to an existing model’ report, but we want the original model details!

The rest of the transmission model section discusses infectiousness (how likely an infectious person is likely to transmit per encounter) and seeding (how many infectious people we start with and where). These are important, but until we get the distribution of people and places as well as how contacts are made, it’s a bit moot.

The problem is that the specific generative functions for place distribution, people distribution, and contact generation are not given. There doesn’t seem to be supplementary material given directly on the report’s page. And, alas, on the resources page there’s nothing for report 9.

Well, that’s a bit of a bummer. I’ll note again that this doesn’t necessarily mean that this work isn’t replicable even if this is all we have! It’s just a bunch of work and not the thing I’m trying to do.

We have to hope that the details of the initialisation and contact functions are available in the referenced work:

5. Ferguson NM, Cummings DAT, Fraser C, Cajka JC, Cooley PC, Burke DS. Strategies for mitigating an influenza pandemic. Nature 2006;442(7101):448–52.

6. Halloran ME, Ferguson NM, Eubank S, et al. Modeling targeted layered containment of an influenza pandemic in the United States. Proc Natl Acad Sci U S A 2008;105(12):4639–44.

We’ll see, as we delve into these, that we’ll probably have to jump at least one more citation level.

First Look at the Imperial Model

Caveat: These are working blog posts with a lot of “thinking aloud”. I’m a computer science academic studying the engineering of agent based models, not an epidemiologist. Critique and comment welcomed.

My starting point is the 16 March 2020 report from the  Imperial College COVID-19 Response Team entitled: Impact of non-pharmaceutical interventions (NPIs) to reduce COVID- 19 mortality and healthcare demand“. The have other variants and targets, but this is a great target. Successful replication will enable replicating other work. And it’s an important one to validate! But it may be outdated in a number of ways. That’s ok. We’re doing some of the slower bits of science.

The core background assumptions are that:

  1. There is a novel, fairly infectious flu like pathogen causing a fairly serious disease
  2. There’s no prior immunity in the population (follows from the novelty!)
  3. There’s no existing vaccine or disease reducing treatment (also follows from the novelty!)

The fundamental scenario is the no-intervention case. Beyond that, they consider a number of different policies (see table 2) including various degrees of social distancing.

The core outcomes are

  1. degree and timing of peak infection,
  2. total infection (over the whole epidemic),
  3. total mortality (directly due to COVID19),
  4. timing and degree of healthcare saturation (e.g., when critical cases exceed available beds).

By comparing these outcomes in different scenarios, we can get a sense of whether a given policy is acceptable.

And we have the central and dramatic outcome:

In the (unlikely) absence of any control measures or spontaneous changes in individual behaviour, we would expect a peak in mortality (daily deaths) to occur after approximately 3 months (Figure 1A). In such scenarios, given an estimated R0 of 2.4, we predict 81% of the GB and US populations would be infected over the course of the epidemic…In total, in an unmitigated epidemic, we would predict approximately 510,000 deaths in GB and 2.2 million in the US, not accounting for the potential negative effects of health systems being overwhelmed on mortality.

For an uncontrolled epidemic, we predict critical care bed capacity would be exceeded as early as the second week in April, with an eventual peak in ICU or critical care bed demand that is over 30 times greater than the maximum supply in both countries (Figure 2).

That’s a lot of dead people in a very short time. Fortunately, we’re not (in the UK) living through the scenario where “critical care bed capacity would be exceeded as early as the second week in April” due, almost certainly, to moving to fairly extreme lockdown. In their simulations, you can see the proverbial flattening of the curve with each more extreme approach:

What’s weird about this chart is that it doesn’t contain the most extreme intervention…social distancing of the entire population. I think this is because these all fall under the “mitigation” not the “suppression” strategy? Yes:

Given that mitigation is unlikely to be a viable option without overwhelming healthcare systems, suppression is likely necessary in countries able to implement the intensive controls required. Our projections show that to be able to reduce R to close to 1 or below, a combination of case isolation, social distancing of the entire population and either household quarantine or school and university closure are required (Figure 3, Table 4).

The UK did expand its critical care facilities, but probably not 8-fold. Thus, our current…”success”…ok, it is a success in the sense that we managed not to end up with a serious shortage of critical care beds…seems probably due to the lockdown. It’s certainly congruent with the Imperial modelling.

So, at the moment, we can think of the model as a black box (with a “GB” or “US” label on it; I’m going to focus on the “GB”one). We pick a policy strategy plus some parameters about the disease and turn the crank. Out pops predictions of the key outcomes. (This is a highly simplified overview, but it’s useful.)

Ideally, if we rebuild the black box from scratch and run it with the same strategy and parameters, we’ll get similar results. It’s not immediately obvious “how similar” we need to be in order to claim that we replicated their results. Their model has a fair bit of randomness in it, so even if you had their code and reran it, you could get slightly different numbers. The precise numbers don’t matter as much as the general shape of the outcomes and what policies they suggest. I’m going to put a pin in this for the moment but I am going to pre-commit to a specific similarity to their outcomes as my success metric. It might be that it would be wrong and I’m certainly going make it fairly broad. But the idea of pre-committing is to help avoid confirmation bias and post-hoc rationalisation of results. Ideally, if we hit the pre-commit targets, then that strengthens our case.

Time for bed…

Replicating “The Imperial Study”

I have several students at various levels working on agent based modelling for biology esp epidemiology. Obviously, the COVID19 study by Ferguson et al at Imperial is probably the single most consequential ABM in history given that it seems to have flipped the UK Government’s response from “let it spread” to “fairly strong lockdown”. (It also probably affected the Trump administration’s response.)

Ferguson and his team have been modelling novel-flu like pandemics for a long time. The seem to be good folks and world class epidemiologists. However, there’s this:

I believe that the code is coming out, but this seemed like a good time to attempt an independent replication from their published work.

One of things I’m studying is how models are shared, communicated, and interpreted. We’ve already looked at dozens of papers and attempt some replications. It ain’t easy!

Thus, we have an interesting case study with both first order (having an independent recreation of their model appropriately raises confidence in their conclusions, if congruent) and second order (we’ll test our our homegrown framework and learn a lot about what needs to be shared in order to replicate a model). We’re replicating in Python, as that’s what our home grown framework (Panaxea) is written in. That should make the model more broadly accessible.

Two important caveats:

  1. The starting team does not have any formally trained epidemiologists. We’re, at best, epidemiology adjacent computer scientists. Our focus is not redoing any of the data derivation or thinking through of assumptions etc. but on replicating the given model. I hope we’ll get some epidemiologists on board, of course! But we’re still working out the software engineering and computer sciencey aspects.
  2. In the current climate, we have to be super careful about presenting results or even sharing a “more easy to manipulate” model. I think we will be that careful, but it’s tricky and obviously we can control everything.

I will say that it’s not immediate how to replicate the Imperial model. Or models, I should say. Consider this paragraph from the key report:

We modified an individual-based simulation model developed to support pandemic influenza planning5,6 to explore scenarios for COVID-19 in GB. The basic structure of the model remains as previously published.

It’s not at all clear that this report is self-contained, i.e., that you can fully understand the model (up to being able to replicate it) without spelunking into their prior work…maybe quite a bit. (Indeed, that’s what I believe currently.) That’s not ideal for replication (though probably sufficient for many publication purposes).

In this series, I’ll detail my attempts to read this literature and extract a model structure, parameters, key algorithms, and evaluation criteria. (This is also being done by some of my students, but I’m finding I can’t follow things well enough without writing it out.) These are working posts, not final ones, so there may be all sorts of problems and nothing written is intended to be definitive. Corrections and suggestions gratefully welcomed.

We haven’t yet opened up our Github repos as we sorta wanted to get to a reasonable place first. I expect to do something by mid May.

Alright, this was the intro. I’m going to make lots of posts on this topic instead of a big monster one.

All About Blackboard Collaborate

I spent the weekend playing with Blackboard Collaborate and successfully held a Y1 tutorial and Y3 project group meeting in it. It was fine!

t can work well for lectures, allowing for streaming your lecture with a recording done from anywhere (no need to be in a lecture theatre). The recording can be downloaded.

Students just need a reasonable browser and decent connection. If they can Skype, they can do this.

They don’t need to be logged in to Backboard to join in.

I’ve made a starter video, to show you how to expose it:

More coming! I’ll add them here.

I can give you access to a session for you to play around with it and can do demos. Email me (UoM staff and students first, please).