Data and Method Sharing: Unnecessary sleuthing

December 8, 2014

Disclaimer: I do not endorse any inferences from my complaining here to “the paper authors suck”. In fact, I think all these paper authors are very cool and I value their work a lot. Stuff happens. Any venting that happens in this post is just that, venting, not considered judgement.

So, I’m trying to put together a paper about how computer based risk assessment tools which use score based risk assessment schemes unnecessarily homogenize risks strata and in a bad way. (I’m coming more and more to the conclusion that score based risk stratification is a pretty bad idea.) I have some good preliminary observations, but now I want to demonstrate how bad it can get. To do this, I’m trying to replicate the methods behind one tool both exactly (i.e., do what they did) and using a more granular set of probabilities (new contribution). But I’m running into problems.

First, no one, as far as I can tell, provides R scripts or spreadsheets of their data and calculations. Grrr. CSVs would be fine at this point. Ok, this means I’m pulling data out of PDFs. Such fun!

Second, and more importantly, I’m trying to pull data out of this paper‘s PDF the same way the other paper did. And even in the supplement, it’s not clear what’s going on. Here’s an example:

We derived a theoretical annual risk of stroke without treatment by adjusting SSE [stroke and systemic embolism] rates from a large cohort (n = 73 538) of ‘real world’ patients in the Danish National Patient Registry13 who have non-valvular AF and were not treated with warfarin. We chose to utilize the 10-year follow-up data from this study because the one-year follow-up data may overestimate the true event rate, given that all of the patients who were included in the analysis had recently been admitted to hospital. The rates in the Danish non-OAC cohort were adjusted to account for antiplatelet use within each group, assuming that antiplatelet use confers a 22% relative risk (RR) reduction.2 The adjusted rates were calculated by prorating the patient years according to the percentage of patients within each group who were taking anti-platelets, and then dividing the number of events by the adjusted patient years. The reported and adjusted annual risks of SSE by the CHA2DS2-VASc score are shown in Figure 1A. The event rates and exact confidence intervals were calculated independently for each score assuming a Poisson distribution. [bolds added]

So, we need:

  1. To find the SSE rate from the 10-year follow-up data.
  2. Find the percentage of patients in each group that were on anti-platelets.
  3. Do some calculations.

Ok, but where in the source paper is this data? They have 8 tables in there! Would it kill you to mention the particular table you used? Yes, i probably can figure it out, but why should I have to? Anyway, I look at Table 1, “Baseline characteristics of patients. Values are numbers (percentages)”. First hit, there’s a column with n=73 538, which is the n in the quote above! Woohoo! Table 1 also has a row, “Antiplatelet drug”, with the value “25 503 (34.7)” in the appropriate place. Yay! Except Boo! because I don’t have a number for this to cross check. Plus, this is a value for the entire cohort as a whole, not “for each group” by which I infer that a group is a set of people with the same CHA2DS2-VASc score. But it’s highly unlikely that antiplatelet therapy is evenly distributed over these scores. It could be that the sicker (more risk factor) groups are going to have more people on antiplatelet therapy. Perhaps CHA2DS2-VASc=0 will have only 20% on aspirin. Or it could be the other way around.

Of course, for my purposes it doesn’t matter since I just want to replicate. But I find this line ambiguous. I don’t like concluding that they just use 34% across each group by the fact that that’s the only number I could find. Tell me, please!

Now, I still need the event rate by CHA2DS2-VASc score from the 10-year follow-up. Enter Table 2, “Event rate (95% CI) of hospital admission and death due to thromboembolism* per 100 person years”.

Dear reader, you may think I’m being a bit of a lazybones. Table 1 and Table 2 don’t seem that hard to find. Yes, but finding isn’t the same as being told.

In any case, it’s found! Now I do a quick cross check:

CHA2DS2-VASc Table 2 Figure 1A
Figure 1A
0 0.66 0.6 0.6
1 1.45 1.2 1.3
2 2.92 2.6 2.8
3 4.28 3.9 4.2
4 6.46 6.0 6.6
5 9.97 9.4 10.5
6 12.52 11.6 13.2
7 13.96 13.0 15.0
8 14.10 13.2 15.4
9 15.89 13.9 15.9

Join in my confusion!

So Table 2, by itself, is not the direct source of the numbers in figure 1a. Some munging has been done. Hint: The units differ. Table 2 is events per 100 person-years and Figure 1a is annual risk. I don’t believe this is a simple translation, frankly. I mean, I think I have a way to do a calculation, and maybe I’ll get the numbers in Figure 1a, but I think it’s a bit dodgy statistically (though what do I know). In any case, TELLING ME WHAT YOU DID would make my life easier. The supplement just has the same paragraph 😦

Update 1

It’s definitely not derived from Table 2 nor is the antiplatelet cohort calculated. The supplement contains a table with particular numbers for the antiplatelet cohort and the total number of incidents. They also have raw numbers of patient years broken down by category. That’s good data, but not in the original paper. Time to look for the other data supplement!

Update 2

Alas, the other supplement does not have this data either. I’ll do a few more searches before contacting authors.

I’ll note that this discrepancy is still worrisome. Presumably the rates should come out the same if the calculations are done in a similar manner.

Update 3

Contact the author time! There’s an overlapping author, thus I think they used data directly and made some slight changes in method. That’s the only thing that makes sense to me right now.


My tentative thought is that I can’t repeat the derivation in the second paper as described based on the data published in the first paper. I could perhaps do a replication, but some bits will be inherently lower quality. (I.e., not just different, but worse. E.g., the anti-platelet adjustment.) I can’t just use the supplemental data from the second paper, because it’s missing all the stuff on co-variates, which is what I’m interested in. So I can’t do my preferred replication off the data in the second paper. I probably still can do two replications of the second paper based on the published data to compare them, and that might be adequate for my purpose. It’s not as nice as being able to compare against the second paper directly, but it will probably work.

%d bloggers like this: