Everyone who wants to delve into predictive modeling in medicine starts out studying linear regression. After learning the basics we’re told that that this type of model is unrealistic, as associations between biologic characteristics and health outcomes are almost never linear, i.e. the whole is not the sum of the parts. You can’t just add effect 1 + effect 2 and expect a reasonable prediction to arise out of it: nature isn’t usually additive. Yet many of the most popular statistical techniques are additive in some form. Multiple regression, logistic regression, analysis of variance, Cox regression, etc. are all variants of an additive model.
Additivity is how many concepts in medicine are constructed. An example is the Sequential Organ Failure Assessment (SOFA) score for indicating severity of illness, particularly for patients with sepsis. Points are assigned to each of six organ systems based on their extent of failure. These scores are then added together to produce the SOFA score.
The need for perceiving processes as additive goes back to how we process information. People are very good at assessing relationships when looking at a two-dimensional graph. But throw up a three-dimensional graphic and most individuals, including incredibly bright ones, get a blank stare. More’s the pity as many machine learning techniques are decidedly non-additive. Unfortunately, these techniques are sometimes labelled as “black boxes”, not because they’re opaque but because they’re just too hard to explain. Or because it’s too challenging to write code for their implementation. They force people out of the comfort of thinking in terms of additive concepts.
So just what am I getting at? Predictive models in medicine based on non-additive foundations should not be abandoned because of inaccessibility of the underlying methods. The developers of these kinds of models need to come up with visualizations that make these techniques accessible. For example, the tree diagram in classification and regression trees can be easily grasped by non-analysts. Dendograms are a good way of displaying agglomerative clustering.
But what about methods such as support vector machines and neural networks? The answer is that data scientists must use novel forms of visual analytics to shed light on how their predictions were derived. These graphics should be dynamic and elective. This means that the user should be able to change the variables being assessed on the fly, taking in the importance of specific variables and their association with outcomes. Heat maps are a popular way of doing that. Finally, visual analytics should be easy to use yet impart a large amount of information. Lacking that, trying to explain what a predictive model is telling us just won’t add up for most people.