In predictive models of binary outcomes such as mortality, the two areas of model accuracy can be loosely described as calibration and discrimination. Calibration measures prediction accuracy across the range of predictive values. Discrimination is a model’s ability to discern between individuals that have an outcome vs. individuals without an outcome. Calibration is usually obtained via the Hosmer-Lemeshow statistic and the ratio of observed to predicted outcomes, which can be misleading. Discrimination is assessed by the area under the Receiver Operating Characteristic curve (AU-ROC). All of these metrics have faults and should not be used singly to appraise a model’s accuracy.

The Culprit is …

There is a measure that combines both calibration and discrimination. Furthermore, it can be compared across models regardless of sample size: the Brier score. This metric was created over 60 years ago to determine the accuracy of weather forecasts. For a predictive model of a healthcare outcome the Brier score is simply the mean squared error of a prediction, as shown in the formula below:

Figure 1

where Oi is either {0, 1} depending on whether the patient(i) has the outcome; P is the predictive value for patient(i); and N is the total number of patients. If we did a coin flip for every patient to predict the outcome then every patient would have a score of 0.25, meaning that the overall Brier score is 0.25. I have seen where studies where researchers have looked at the actual Brier score of their model and saw that it was considerably less than 0.25, thereby concluding that their model was accurate.

Not so fast…. there is a major problem with using the Brier score “as is”: its value is directly linked to the overall incidence of the outcome. To see why this is so, imagine every patient was given a predictive value equal to the overall outcome incidence. The Brier score reduces to this formula.

Figure 3


So the real baseline value for a model’s Brier score is not 0.25 but γ(1-γ) !!! Let’s take an example. Suppose mortality in a group of hospitals is 4%. If we gave every patient a mortality probability of 4% then our baseline Brier score would be 0.0384. The actual Brier score of our model would tell us the extent to which it reduced inaccuracy based on chance alone. So if our actual Brier score was 0.0192, then our model reduced uncertainty by 50%.

Thus raw Brier score needs to be adjusted as follows:

Best and Worst Fig 3

The Worst and Best Compared

Here’s a table giving the results of two models: one to predict hospital mortality and the other to predict a patient acquiring pneumonia.

Fig 4

If we compared the models using the raw Brier score, we might conclude that the hospital mortality model is more accurate than the model to predict pneumonia. But looking at the adjusted Brier score clearly shows that the pneumonia model is superior to the mortality model.

The Brier score can be one of the most useful metrics for gauging a predictive model’s accuracy, if the appropriate adjustment is made. Otherwise it can be one of the most misleading statistics.