Big Data, AI and Psychiatry

Some Commentary

My colleague Lia Ali pointed me to this – at this year’s Royal College of Psychiatrist’s International Congress, there’s a debate titled Clinic appointment with Skynet? This house believes that the RCPsych should embrace Artificial Intelligence and Big Data in guiding clinical decision making and service development. Regrettably, I’m going to miss the Congress so won’t be able to attend this exciting panel discussion, but here’s some thoughts.

Fair advance warning, this is a highly opinionated and somewhat personal/anecdotal piece so you know the risks …

First of all – in my view – the two topics (Big Data and AI) would benefit from being debated separately, especially as two application domains are mentioned (clinical decision making and service development). And I’ll explain why as we go. I wonder if the two topics (AI and Big Data) are often headlined together because there’s an assumption that AI techniques – almost always, inductive learning algorithms – often require vast datasets. But not all inductive methods required huge amounts of data (more on this later). Further, as I’ll elaborate, I think there is a tendency to see AI as some computational alchemy which turns variable quality (but voluminous) data into robust inferences and insights you couldn’t possibly have obtained using mature, established disciplines like statistics and epidemiology. My view is that intellectually, AI (however you conceive it or divide up the many sub-fields) inherits from the traditions of formal (rather than empirical) sciences and engineering. In this regard, given evidence-based medicine (EBM) inherits more from empirical science my view is that statistics, biostatistics and epidemiology as sanity-checks for the direction of travel and outputs of AI, machine learning and engineering-based treatments of clinical data.

1 Reputation Risk

AI is a broad concept and is used today (in my view) vaguely and without qualification – almost as if everyone implicitly knows what the term means. But let’s be clear – the term carries substantial historical baggage and investing blindly in “AI will solve this problem” is risky because if we fail, the cost in terms of resource, enthusiasm (often, to be read as “hype”), energy and reputation will be substantial. For this reason, I was left wanting by the Topol review – look at pp. 5 of the Topol review for mental health to see why.

An anecdote: when I was writing up my Ph.D. in early 2000, my supervisor (one of the most helpful, tolerant and supportive academics I’ve ever met) told me bluntly “Do not juxtapose the words artificial and intelligence in your thesis title … No one will examine it”.

And this was precisely because at that time, there was a history of at least two “waves” of AI that had failed; so-called AI winters – and as I discuss shortly, skepticism about the status of current machine learning methods.

Debatably, the first AI winter resulted from the failures of what became know in the 1980s-1990s as connectionism, or, the period of neural networks research dating back to the McCulloch-Pitts model neuron, early learning algorithms such as the Widrow-Hoff rule and Frank Rosenblatt’s perceptron that culimated in 1969, when Minsky and Papert’s book “Perceptrons” showed that certain kinds of neural networks can’t learn a fundemantal logical operator, the exclusive OR (XOR) function. Like most stories, there’s subtlety that get’s ignored but ultimately, in the UK at least, the 1973 Lighthill report certainly kaiboshed the field. Neural networks (at least, the feed-forward, function approximating kind most like modern deep networks) came alive again in the 1980s, primarily because it was discovered that if you add more layers of model McCulloch-Pitts style neurons between the input and output and combine this architecture with an algorithm to adjust the parameters between model neurons (the back-propogation rule) suddenly, you could do much more. Notice something here: a way out of a difficult slump is to a) build a bigger network and b) innovate on learning algorithms that can optimise and estimate the model’s parameters to make it perform. Sound familiar ?

More anecdotal data: toward the end of the 1990s, where I was working, the enthusiasm for neural networks – invigorated again in 1987 by the publication of the two-volume book “Parallel Distributed Processing: Explorations in the Microstructure of Cognition” by Rumelhart, McClelland and the PDP Research Group – was waning because

  • it could be shown that multilayer perceptrons (the pre-cursor to today’s contemporary deep learning networks) essentially implemented multivariable (and in some cases, multivariate) regression
  • there was a Bayesian formulation of multilayer perceptrons and their training by the back-propogation method (nowadays, autodiff) which offered a solid mathematical framework and dispelled some of the mythology around these techniques (that is to say, the story was “they don’t do anything special but are another way of implementing regression”)
  • additionally, where I was working, techniques like support vector machines (SVMs) and decision trees (such as Breiman’s CART method) could perform as well (if not better) on classification tasks. And many people where excited about fuzzy logic.

Alongside this, “classical” AI – based on symbolic reasoning, rather than the numerical, probabilistic and statistical methods of neural networks – had been shown to be brittle and in some cases, computationally intractable; examples include the frame problem and symbolic search methods for planning actions. Expert systems – the 1980s version of engineering applications of symbolic AI to problems such as decision support in medicine – didn’t get very far either.

Anyway – in the 1990s and 2000s researchers attempted to position themseleves outside the umbrella of AI because in essence, it was a toxic brand and this is because there had been too many promises that failed to deliver in the previous hype-cycles of the 1960s, 70s, 80s and 1990s.

2 AI and the NHS

Anecdotally, when speaking with colleagues across industry, academia and the NHS, I’ve heard the following:

  1. “Real AI” is contemporary neural networks (by implication, anything else is not real AI) – this is just machismo and patently non-sense. I’ve heard it said that the “NHS doesn’t do real AI” because it doesn’t deploy large, sophisticated neural networks. Let’s assume that “real AI” is equated specifically with the branch of machine learning that studies large neural networks and then let’s cherry pick one impressive achievement – beating a human at Go. By some estimates, DeepMind spent around $35 million training a system to play Go. Take a look at the paper, and you’ll see that they used either 48 CPUs and 8 GPUs, or for the distributed version, 1,202 CPUs and 176 GPUs. And it tooks 3 weeks of compute time to train.

You might reasonably ask “So what”. But if I want to buy cloud-based resource to implement and run my neural network-based solution to some problem, I might go to AWS and cost an on-demand p4d.24xlarge instance with 8 GPUs and I’m going to drive this thing hard for training and then use in production for inference – that’s over 23,000 USD (or £18,800) per month if I want it switched on 24/7 or if I buy a reserved instance, I can get this down to just over 8,000 USD (£6500) per month if I purchase for three years. To buy my own single GPU (of a similar specification to those in the AWS instance described) will set me back about £9,000 ignoring the costs of a host server, storage and maintenance costs for running the rig.

So, if we’re going to equate “real AI” with neural networks, the NHS better be sure it’s going to be worth it.

  1. Contemporary large, deep neural networks can locate non-linear and long-range interactions (say, between variables) in a way that standard statistical techniques cannot – I’ve heard this said often and the architectures of modern neural networks are certainly configurable for this to be the case … but I’ve not seen this advantage demonstrated in applications (alluded to, certainly, but not demonstrated).

This is one of those alchemy-like principles which I wonder may be based on an analogy with some actual theory. Back in the day (1989) Cybenko published a universal approximation theorem which showed multilayer perceptrons could approximate any real-valued mathematical function to an arbitrary degree of accuracy. Similar theorems have been developed for modern deep networks. These essentially tell us that neural networks can capture very sophisticated and complex mappings from inputs to outputs but they these theorems don’t deliver a specification for the actual, realisable network (that is, the inputs, hidden layers, parameterisation and so on). I don’t know for sure, but perhaps using these ideas (universal approximation) analogically – combined with the representational sophistication afforded by very large networks – provide us with hope that these long-range interactions, dependencies and correlations (say, in very long time series) might be captured or exploited.

  1. There seems to be an implicit belief that if one throws a big enough neural network at a problem, it can magically divine signal from noise and “add value” in ways that traditional inductive inference (e.g. statistical methods) cannot. It’s seems (to me at least) that there’s a tacit belief that low-fidelity data can be transformed by contemporary AI without critically looking at the source data and the desired application.

I think this last point is particularly pernicious. Some neural network architectures can do amazing things, but they tend to be in domains where – from the design of these neural networks – you’d predict that they should work. For example, convolutional neural networks in radiology and imaging tasks. But still, they’ve been shown to learn aberrant mappings which are kinds of “short-cuts” from inputs to desired outputs that bear no resemblance to clinical reasoning; roughly, correlations between image features and classification task that turn out to be unlike any information a skilled radiologist would use.

I’ll finish this section with this thought: it’s easy to be a nay-sayer and dump on other’s work. But my point is simply that we do require more critical thinking that focuses on how data that is available mesh with robust and reliable methods that might be reasonably combined to produce useful results. And not just blindly expecting AI (however conceived) will “just solve the problem” – which as described above, didn’t get us very far.

Also, I’d add that we should be more prosaic with our language – in some applications of AI or machine learning (ML) we are really just saying “We used methods derived from computational sciences and engineering to solve a problem”. A great example is when we use regularised regression to combine an element of feature selection with learning the probability of some outcome alongside internal validation techniques like cross-validation, resampling etc. If you’re going to call (essentially) regression “AI” then where do you stop ? You might as well decide that linear algebra is basically AI.

3 2. Decision Support and Service Development

Helping clinicians and patients make a decision is very different from service development (again, in my opinion). The latter is (I think) about changing how we deliver care in response to demand, capacity, shifts in medical technology and the demographics of the people that a healthcare system serves. The former is about sitting alongside a patient, their data and helping decide on management, establishing a diagnosis and so on.

Somewhat over-simplistically, one could say that service development requires understanding e.g. data about the locality/population characteristics, audits of the operational aspects of services and their delivery to targets or clinical guidelines and methods for marrying the two with a view to changing how care is delivered. Certainly, this requires lots of data and perhaps, opportunistic re-use of existing routinely-collected data. And statistical process control (SPC) is described as being widely used in the NHS for understanding how changes in services deliver differences in outcomes.

But I doubt this is the same data required when sitting alongside a patient, trying to arrive at a shared decision using decision support tools – the QRISK cardiovascular risk score being an example. It could be the case that if a service collects high-fidelity, granular data about patients that this same data could be reused at the population/service level, but I think the two things remain very different.

And this leads us to the next point: what data and methods you need, or use, depends on detailed understanding of the task at hand.

4 Big Is Not Necessarily Better

When I trained in medicine, we got well-intentioned, but rudimentary, training on statistical methods under the guise of the evidence-based medicine (EBM) curriculum. As part of this, we learn many examples of where the precision of some estimate increases as the sample size, \(n\) increases. For example, the standard error of the mean is \(\sigma/\sqrt{n}\) where \(\sigma\) is the sample standard deviation. Almost everywhere you look, you find \(n\) in the denominator and this leads us to internalise the idea that “bigger \(n\) means more precision” and we should trust large studies over small studies.

Anecdotally, I’ve been involved in a few studies where – in spectacular examples of terrible research practices – the team has decided to collect “just a few more samples”, to increase \(n\) and shrink the standard error of some inferential estimates and even worse, to shrink uncertainty estimates to the point where it “reaches statistical significance”. That’s a whole other debate.

But of course, it’s never that easy. Achieving more precision requires better measurement not just larger \(n\) and psychiatry (alongside biological sciences and medicine more generally) has noisy and imprecise measurements.

4.1 Noise and Measurement Error

An example, lifted from Loken and Gelman’s 2017 paper “Measurement error and the replication crisis”: If I tell you I can run a mile in 5 minutes, that’s impressive. If I inform you that I did this with a rucksack loaded with bricks, then you’d likely conclude I would have been even faster without the weight slowing me down. By analogy: I perform a study involving \(n=100\) people, collecting the kind of data that allows me to estimate the odds ratio of having a disorder given some factors I want to control for and including a categorical variable of interest, representing membership of some tentative risk group (e.g. natal sex, membership of a certain ethnic group and so on). I estimate the odds ratio for the categorical variable of interest at 1.2, but my uncertainty is wide and the odds ratio might be between 0.7 and 5.7. I write this up, stating we can’t be conclusive about the contribution of the categorical risk variable’s effect because the uncertainty interval [0.7, 5.7] includes 1.0. In my “future work, discussion and limitations” section, I write that “future work should include larger sample sizes” because (as above) I expect that my effect would be larger or more visible if it weren’t for the noise, and further, I can overcome the noise – increasing the precision of my estimate – by increasing \(n\).

Loken and Gelman show that: if you take two variables, \(x\) and \(y\) with an actual “ground truth” small effect of \(x\) on \(y\) and then consider two scenarios where the observed measurements have either low or high measurement error (i.e. we add either a small or a large amount of noise) then:

  • with large \(n\) and a truly modest effect size (i.e. the correlation between \(x\) and \(y\) is in reality small) – adding measurement error almost certainly reduces the observed correlation, making the effect less visible. Analogously, running a mile in 5 minutes but with a heavy rucksack.
  • with small \(n\) and a truly modest effect size – adding measurement error can result in the observed correlation being larger, making the effect artificially more “visible” to us. This represents the analogy and assumption we would have run faster (effect size/correlation would be larger) if we weren’t burdened by the rucksack (measurement error)

This tells us that when we draw a conclusion from small samples, we should not rely on the idea that the impact of measurement error is obscuring what would have been a more impressive effect size.

4.2 Big Data by Analogy with Big Surveys

Now, I hear you: isn’t this why we need Big Data? But let’s consider the Big Data Paradox – introduced in Xiao-Li Meng’s 2018 paper “Statistical Paradises and Paradoxes in Big Data” in the context of population surveys. Meng’s work shows that while increasing the sample size may reduce estimates of uncertainty, e.g. shrink confidence intervals for some statistic, it can also magnify bias in the statistic; consequently, we end up being very confident in summary statistics which are completely off-the-mark.

Meng considers a finite sample \(n\) of some population \(N\) and focuses on the difference between some sample statistic (say, the average) \(\bar{G}_n\) and the “true” population value of the same, \(\bar{G}_N\).

  • for some sample \(n\), the difference (error) between the sample and population average is \(\bar{G}_n - \bar{G}_N\) – how much we can trust the sample average \(\bar{G}_n\) depends on this difference/error
  • The error \(\bar{G}_n - \bar{G}_N\) can be “decomposed” into a formula consisting of three terms

The three terms are:

  1. a data quality measure – or the data defect correlation measure – which captures total bias as the correlation between the event that an individual’s data was recorded in the sample (a function of the sampling method) and the actual value recorded; in a high-quality sample, the values recorded should bear no relationship to whether or not the individual was included in the sample. Hence, the value of the data defect correlation will be low and ideally, close to zero.

A concrete example: In one electronic health record (EHR) study I was involved in, patient’s with long histories (and therefore, having larger volumes of clinical data in the EHR) were absent from the sample for purely technical reasons that we hadn’t spotted at the time (FYI, there was a bug in the software that just “dumped” patients from the sample if they had lots of historical notes). Assume we’re studying recurrent depression and using EHR samples to estimate the mean number of episodes as a function of age. In this case, the data defect correlation will be high because the event that the individual data was ‘captured’ in any sample would strongly correlate with both age and how many episodes they’d experienced (older people are more likely to have experienced more episodes and therefore have longer records).

  1. a data quantity measure – expressed as a function of \(\sqrt{(N-n)/n}\) so that if the sample is the entire population (complete data) \(n = N\) and this expression becomes zero (i.e. contributes nothing to the error).

Note that the data quantity is defined such that the size of the sample \(n\) alone is not so important, rather, it’s the sample size as a proportion of the population size \(N\) that contributes to the overall error \(\bar{G}_n - \bar{G}_N\)

  1. a problem difficulty measure – expressed as a function of the standard deviation of the values measured in the population, which is to say, the more heterogenous the population values are, the harder it is to estimate a robust average \(\bar{G}_n\)

Narratively, Meng is asking (see pp. 687) “Which one should we trust more, a 5% survey sample or an 80% administrative dataset?” or alternatively, “Should we invest in a small but well-designed, robust sample or large, routinely collected data from electronic health records, population surveillance system or a big online social-media opportunistic sample?”. The headline lesson of Meng’s paper is the Big Data Paradox: “The more the data, the surer we fool ourselves”.

A great example of the Big Data Paradox using Meng’s decomposition is Bradley et al’s (2021) paper “Unrepresentative big surveys significantly overestimated US vaccine uptake”. In this paper, they estimated the components above for three different survey designs of Covid vaccine uptake in the US. They had access to the actual number of vaccines delivered (i.e. the “true” population values) and could look at how biased the estimates were for the three surveys (samples). One of the surveys was a robustly designed, probabilistic sample of the population, but was relatively small at 995 people. Another was an opportunistically sampled survey from active social media users (with 181,949 people) and another was based on census addresses for which there was a mobile phone or email contact available consisting of 76,068 people.

The results are interesting. Bradley et al showed how the two very large surveys (relying more on opportunistic sampling) had high data defect correlations and the estimated average vaccine uptake differed systematically from the actual vaccinations administered. The small \(n=995\), higher quality-sampled survey was, however, much closer to the population average.

And the headline in Bradley et al’s paper is that a survey of 250,000 people (with properties similar to the two large surveys) produces estimates of the population’s mean that are no more accurate than a truly random sample of just 10 (yes, ten) people.

5 Conclusion

The points I’m trying to articulate here are:

  1. We need more precision in how we describe our methods – “AI” obscures what we mean, especially in the age of hype and over-promise. Sometimes, a more prosaic description of our methods might be less attention-grabbing but perhaps, more honest – so let’s see more papers saying “We used resampling methods for internal validation, on a modest data set, estimating a linear model’s parameters using maximum-likelihood” rather than assuming we need to call this AI to persuade readers of the value of what was undertaken. And let’s not assume contemporary AI has to be equated with the current trend for deep learning / massive neural networks
  2. Large neural network-models with millions/billions of parameters and sophisticated architectures are expensive to train and deploy – so we better be sure that they’re worth the investment
  3. Knowing the history of previous cycles of AI (and the subsequent “AI winters”) and understanding where and why they failed (which domains, applications and so on) is important; spectacular results enabled by increases in “raw” computational power and resources shouldn’t persuade us that everything is a breakthrough or a game-changer.
  4. Big Data is not always better data – sure, size matters, but we should understand and attend to the intended use (or re-use) of any large data set; some opportunistically acquired data sets might be useful for some tasks, but give unreliable results for other tasks. For example, how we use routinely collected data for service development might radically differ from using the same data for individual-level clinical decision support

I’m aware that this blog covered none of the profound ethical or sociotechnical aspects, but that’s for another time.

Dan W Joyce
Dan W Joyce

I’m interested in how principles from computation can be used to understand things like clinical state, trajectories and how to augment clinical decision making using data, multivariate statistics and (cautiously) AI and ML.