So you’re going to develop a model that predicts a specific healthcare outcome. You’ve been fortunate in that your dataset has information from 100,000 admissions: 25,000 each from the previous four years (2012 through 2015). First and foremost, you can’t use all 100,000 patients to develop your model. But you already knew that.

So . . .  How are you going to partition your dataset into development and validation sets1, respectively? There are numerous approaches available, such as simple random sampling, k-fold cross-validation, bootstrapping, etc. .  . . All of these methods randomize at the admission (e.g. patient) level. That’s all well and good if you want to be confident that your model isn’t spuriously accurate. However, in healthcare we’re concerned about two important issues:

1.      How will my model fare when applied to new hospitals?

2.      Will my model hold up in the next cohort of patients?

The examples of sampling methods given above don’t really address these two questions. But there are alternative sampling methods that allow you to get a reasonable idea of whether or not your model will do well at other hospitals and in future time frames. I’ve successfully utilized both of these methods, and describe them below.

All Hospitals Are Not Created Equal

A small community hospital does not deliver the same level of care as Johns Hopkins, nor does it treat the same case-mix of patients. Extending this further to all of the hospitals in your dataset, there will undoubtedly be tremendous variation among institutions. But by including patients from every hospital in both the development and validation datasets you don’t gain insight about your model’s performance when applied to new hospitals. However, stratifying by hospitals is an incredibly powerful sampling method that will give you some confidence about how your model will fare when applied to additional institutions.

An example of this strategy is shown in Figure 1. Note that there are different numbers of admissions depending on the hospital, as well as different types of hospitals (academic, community, rural). The distribution of patients using simple random sampling by admission (70% development, 30% validation) gives the pools of admissions seen on the left side of Figure 1. Hospitals are shaded by “type”: blue for academic, white for community, and yellow for rural hospitals, respectively. In this example, using simple random sampling each hospital contributes 70% of its patients to the development dataset and 30% to the validation dataset. The right side of Figure 1 shows what happens when one out of every four hospitals is randomly selected by “type” only contribute to the validation dataset, while all of the other hospitals contribute all of their admissions to the development dataset. This second sampling method gives what data scientists call external validity to your predictive model.

Figure 2 shows the resulting number of patients by hospital type in the development and validation datasets, respectively. Stratifying by hospital type did not materially alter the 70:30 allocation of admissions to the two dataset.

figure one


figure two

Something Old, Something New

Medical practice changes over time, and as a result the incidence of outcomes change. For example, mortality before hospital discharge for patients admitted to an intensive care unit declined from 13.6% to 10.5% over the last 15 years. Thus models created a decade ago will be severely overestimating mortality. That’s why so many predictions are prone to what I call “model fade”2.

There is a way to partly protect your predictive models from declining in accuracy quickly. Instead of doing random sampling, stratify by date of admission. Patients admitted to hospitals during 2012-2014 are allocated to the development data set, and admissions in 2015 make up the validation data set. In this way you can quickly see whether your model’s accuracy starts to decline. I have used this method numerous times to create models and find it helpful to determine if your model is swiftly fading.

Combining the two methods described above is optimal, using patients from 2012-2014 in the development dataset for 75% of the hospitals, while using 2015 admissions for 25% of the hospitals. The patients who were not included as a result of this sampling schema can be used in a further validation data set.

Healthcare outcomes can be difficult to predict. But by eschewing random sampling at the admission level, you can gain more confidence that your model’s accuracy will hold up.

1 Ideally one would have development, test, and validation data sets. I’ve narrowed this down to just development and validation datasets for simplification.
2 See the article “Predictive mortality models are not like fine wine” in the October 2005 edition of the journal Critical Care Medicine.