Asking The Right Questions: Part 3 – How Much Data Do I Need?

Blog May 4, 2023

Asking the right questions: Part 3 – How much data do I need?

This series of articles is all about asking the right questions around data, and one that I get asked often is ‘how much data do I need?’. The answer will depend who you ask and ultimately what you want to do with the data. The concept of more is that ‘more’ has come to dominate thinking across defence, with an emerging ethos of ‘let’s collect and store as much data as possible so we can do clever stuff with it later’. This approach feels like a response to almost a decade of perceived slow adoption of Machine Learning (ML) and Artificial Intelligence (AI).

The value of both to defence is well understood and in the last year the MoD has published a Data Ethics policy paper as well as the Defence AI Strategy. It has also created the Defence AI Centre (DAIC) across both research and development (DAIC-X) and, more significantly, an operational facing team (DAIC-Ops) getting tools into the hands of users. A key limitation people have identified is a lack of data preventing the use of ML, however I don’t think this is actually the case. There is a huge amount of data in defence, it isn’t necessarily in the right place, format, or well understood but it does exist. I’d like to borrow some thinking from the NATO Data Exploitation Framework Policy which articulates this problem very well and offers some alternative thinking.

When you have no, or rather, insufficient data then data volumes do matter. It won’t be possible to train a high performance ML model with 10 records no matter what you try to do with it! So the first step on the data value chain is volume. Automated systems generate larger volumes of data than any human process, for example sensors generating readings every second a platform is in use compared to a manual data entry process.

However, that isn’t the end. Whilst ML models can easily ingest huge volumes of data, for them to be able to accurately predict outcomes and ultimately to be useful, data variability is more important. In other words, having lots of data about one scenario but nothing on anything else will mean a model will default to a single outcome. This is known as imbalance in data science language, where a model ignores the edge cases it lacks examples of and instead predicts to the majority class every time. You can compensate for this, but you risk over fitting to a small number of examples and a model will not be able to generalise well.

So having some data is preferable to no data, and given that you’ve got some data its more important that it has variability and reflects all the likely outcomes you want to model. The next step on the chain is quality. Poor quality data undermines all the analysis that stems from it. There is a concept I’ve experienced on a number of occasions that I have termed ‘paralysis by perfection’. Organisations don’t view their data as being perfect and so can’t use it for anything. Pragmatically, data are never perfect; they just need to be good enough to be useful. If you wait for everything to be perfect quality then nothing will ever happen, hence the term.

Finally, if data is sufficiently high quality so as to be useful; then what you do with it matters the most. A huge amount of effort goes into producing high quality analytical outputs, but to be useful these need to drive business behaviour. Although this is an article about data, my closing point is a cultural one. To be of value, a data product needs to be used for something rather than just an academic exercise. That isn’t to discount research, since understanding what’s possible in a particular domain or developing new techniques is hugely valuable, but those need to go on to feed into other practical outputs. The DAIC is well set up to capitalise on this with the passing of knowledge between DAIC-X and DAIC-OPS. The key challenge won’t be around data volumes though, it will be about cultural and organisational change to introduce AI/ML.

AUTHOR

Tom Fowler
Capability Lead – Data Science