Caveat: These are working blog posts with a lot of “thinking aloud”. I’m a computer science academic studying the engineering of agent based models, not an epidemiologist. Critique and comment welcomed.
We need our four functions:
- Household generator
- School generator
- Workplace generator
- Person assigner (to households, schools, and workplaces)
All of these are very underspecified in Imperial Report 9:
In brief, individuals reside in areas defined by high-resolution population density data. Contacts with other individuals in the population are made within the household, at school, in the workplace and in the wider community. Census data were used to define the age and household distribution size. Data on average class sizes and staff-student ratios were used to generate a synthetic population of schools distributed proportional to local population density. Data on the distribution of workplace size was used to generate workplaces with commuting distance data used to locate workplaces appropriately across the population. Individuals are assigned to each of these locations at the start of the simulation.
It seems clear the the first step is understanding (and getting mastery of) the “high-resolution population density data”. It’s not in Report 9, we need to look at prior studies. The referenced 2:
5. Ferguson NM, Cummings DAT, Fraser C, Cajka JC, Cooley PC, Burke DS. Strategies for mitigating an influenza pandemic. Nature 2006;442(7101):448–52.
6. Halloran ME, Ferguson NM, Eubank S, et al. Modeling targeted layered containment of an influenza pandemic in the United States. Proc Natl Acad Sci U S A 2008;105(12):4639–44.
Have some supplementary data where you would expect to find the nitty gritty details. Yay!
Let’s start with the supplementary data for 5. Population density is the very first thing!
Population density data The 2003 Landscan1 dataset prepared by Oakridge National Laboratory was used as the source of population density information for the continental US and GB. This dataset has a 30 arc second resolution, equating to approximately <1km for US and GB latitudes. The dataset is a model of population density which is constructed from census, remote sensing, land use and transport network data. It is a model of instantaneous population density (i.e. where people are at an instant of time) rather than residential population density, but is very comparable to high resolution census data for GB and US, and has the advantage of being rasterized – while census datasets are defined for populations within with irregularly shaped and variable size administrative units (e.g. census tracts).
Ok, we have a reference, it seems to be to a standardish dataset. What does this dataset look like (from this description). It is “rasterized” which I take it to men that the map (of the UK, we’re just looking at the UK) is broken down into something like (square?) pixels and values (number of people? age stratified? otherwise stratified?) assigned to each pixel. And the “pixels” are about 1km (or a bit less? “approximately <1km” is a weird phrase. How is it different than “approximately 1km”? Is it weighted toward being smaller than 1km? I guess this will be square km?)
That’s an ok overview, but to code anything we need some details. We need to follow reference 1:
Oakridge National Laboratory. Landscan global population data http://www.ornl.gov/sci/gist/landscan (2003).
(Note: Even if you don’t release your model code. Releasing a version of this data would be nice…all packaged up! Or a good data description.)
It’s got a link! To nowhere:
Oy. Ok. Cool URLs Don’t Change but the Web has been super uncool for decades.
But the site works and the organisation exists and we have a critical keyword, “Landscan.” And we have joy:
ORNL’s LandScan™ is a community standard for global population distribution data. At approximately 1 km (30″ X 30″) spatial resolution, it represents an ambient population (average over 24 hours) distribution. The database is refreshed annually and released to the broader user community around October.
LandScan™ is now available at no cost to the educational community. The latest LandScan™ dataset available is LandScan Global 2018. Older LandScan Global data sets (LandScan 1998, 2000-2017) are available through site. These data set can be licensed for commercial and other applications through multiple third-party vendors.
(No approximately less than nonsense HERE! ;))
We want the 2003 data for our replication. Probably? That’s what I got a reference for. Maybe they updated to more recent data. “Landscan” isn’t mentioned in Report 9.
CALLOUT: This seems to be an undocumented point. We don’t actually know which Landscan dataset they use. One would hope they used recent data! We could assume this, but I’m ok with starting with what’s documented. In our framework, we should make popping in arbitrary Landscan datasets super easy…as long as we have the computing power to run the simulations, at least.
CALLOUT: Oh how WordPress vexes me. I want to put notes in a marked container so I can retrieve them later. I don’t want to install a plugin but that “block” system is evil. So I’ll just use some text markers.
Ok! Getting to the 2003 dataset page is super easy. But you have to be a registered user to download the data. The registration form was simple enough. I got a confirmation email quickly, yay! I verifed and then got the following email:
Your request has been queued for review. This process may take anywhere from 3-5 business days. Once approved, you will be able to log in and download the available LandScan datasets. Please note that during this time you will be able to log into the site, but you will not be able to access the data.
Well that sucks. I mean, I’m sure they’ll approve it and they are surely busy, but it kinda stops me.
Though perhaps I can get some feel for the shape of the data! There should be some documentation. Ah! There’s an FAQ:
ESRI ArcView with the Spatial Analyst extension, ESRI ArcGIS with the Spatial Analyst extension, or ArcInfo will allow you to read and analyze the database files.
The data files can also be read by any software product that allows the user to import ESRI GRID-formatted data.
The data files are also provided the data in the ESRI Raster Binary format. Almost any programming language can read this file directly.
See also “What data formats are used?” below.
Quick, let’s see the data formats answer!
ESRI grid format. The archive contains grids for the world and each of the six continents, excluding Antarctica. Each of the archive files contains two folders, an ArcInfo GRID folder and an INFO folder. Both folders must be extracted from the archive file into one new folder.
ESRI binary format. The archive contains contains raster binary data for the world and a rich text formatted file that describes the binary format. The binary raster file format is a simple format that can be used to transfer raster data among various applications. Almost any programming language can read this file directly.
For both formats, downloaded files need to be uncompressed using a standard Zip utility (e.g., WinZip, PKZIP, etc.) before they can be imported to GIS or other software. Users should expect a substantial increase in the size of downloaded data after uncompression. For additional information see the ReadMe file for 2010.
(Ok, in my dauntage, I am still laughing that we have to look at a ReadMe file for a specific year thats…not the first year in the dataset. Funny!)
Fortunately, we do have a bit of description of the data:
The dataset values are people per cell.
At this point my frustration often is expressed by my screaming “I hate you!” Rest assured, dear Reader, that this is just venting.
Ok ok…don’t panic. We don’t need the proprietary software (probably?). I’m a bit concerned that all the links in this critical FAQ are dead:
The links below contain instructions for calculating population density:
(In a weird way.)
But the other FAQ said “ESRI binary format. The archive contains contains raster binary data for the world and a rich text formatted file that describes the binary format. The binary raster file format is a simple format that can be used to transfer raster data among various applications. Almost any programming language can read this file directly.”
Of course, Wikipedia says:
An Esri grid is a raster GIS file format developed by Esri, which has two formats:
A proprietary binary format, also known as an ARC/INFO GRID, ARC GRID and many other variations
A non-proprietary ASCII format, also known as an ARC/INFO ASCII GRID
So maybe they meant that the binary format is the ASCII format?
Ok. Ok. Ok.
Clearly, I need to wait for access to the data to see what’s really going on.
But I did glean some crucial stuff:
- We need to deal with a grid of 1km cells
- All we have from this is population count per cell.
I’m missing why population density is something which needs instructions or a special script. (Here’s where my lack of domain knowledge screws me.)
CALLOUT: Research population density calculations to understand potential issues.
From a demographic perspective, this isn’t helpful data. We just have counts, not even ages. They use census data for the rest:
Household size and age structure data Census data was used for age2,3 and household size distribution data4,5 for both the US and UK. We used a heuristic model6 to generate the ages of individuals in households which maintained generational age gaps with households while matching marginal household size and age distributions (Figure SI2). We assumed household size and age distributions did not vary geographically.
This still doesn’t get us to how they placed and populated households. And it’s weird that the generate household and age distributions that need to be matched to the data.
Hmm…….Ok, I have an idea for an algorithm:
density_data = load_somehow(landscan2003)
households = list()
for location, people_count in density_data:
# We're assuming that the snapshot location data
# is where people live. So we need a household for each
# person at a location.
c = 0
while c < people_count:
# We need a new household
# We use the household distribution to
# pick a household size
cur_home = gen_new_household(at=location)
if (people_count - c) > cur_home.size:
c += cur_home.size
cur_home.size = (people_count - c)
cur_home.add_people() # adds size people
Obviously, “add_people” needs to use two things:
- Age distribution data (the model seems to ignore sex, race, and other demographic features)
- Household composition heuristics (no house of only kids; it’s probably less likely to have lots of households with 5 adults (?)).
I think you just do something similar…pick size people from the distribution. If it doesn’t make household sense, then adjust (redraw, or redraw from an “adult only” distribution).
CALLOUT: Clearly, we could have more sophisticated demographics and household data. The most interesting ones are those which affect morbidity and mortality e.g., being immuno compromised.Once we have the basic approach, adding these isn’t hard.
Ok! This makes sense. And it suggests how one might substitute other data for the Landscan.
It also has profound implications for our model. Our current model uses a “network style” model where places are objects that people are linked too. Here we might well need pathfinding to see where people go in a commute. The UK is around 240k km² which is a fair bit already even without 58 million or so people. We could use lower resolution density data or alternative datasets.