December 15, 2014

Just a weird little thing.

When scraping data from a paper (or any source), I grab both the numbers they give including derived numbers and try to rederive the numbers. This provides a couple of sanity checks (e.g., that my scraping was accurate) and gives me their “model” (even if it’s trivial).

Of course, you find stuff!

For example, I’m scraping the breakdown of a population across categories. The size of the population is 73,538 and they give both the number and the percentage in the breakdown. Thus, it’s trivial to rederive the percentage. So that’s what I do, but then I get four values that are off by one:

Paper Rederived
17.9 17.8
59.7 59.6
14.8 14.7
2.6 2.5

So, an off by one error. GRRR! Clearly this is a rounding problem, and looking at the unrounded results confirms this:

Paper Rederived Unrounded
17.9 17.8 17.8492752
59.7 59.6 59.64807311
14.8 14.7 14.74883734
2.6 2.5 2.548342354

Excel is doing the “right” thing here: It only looks at the digit before the target digit. Of course, this sort of rounding is not equivalent to the fixed point iteration (i.e., if I round(17.849) to 2 places I get 17.85 and if I round that to 1 place I get 17.9). But it’s far more common to do things the “right” way. (And it makes a lot of sense.)

What confuses me is how the heck did the paper get the iterated rounding version? Is there software out there that does it that way? My spot checking of Excel, Google Spreadsheets, and Python all yield the same behaviour.

Is this a big deal? Well, obviously not. Arguably, I don’t care about what’s beyond the decimal for these purposes and nothing about these differences is critical — or even  marginally relevant — for the paper’s results. However, it is an interopt and validation problem. What should have been two seconds took me 20 minutes. And what’s this software doing this weird rounding? Is it causing problems elsewhere?

%d bloggers like this: