In my previous article, I wrote about the abundance of data around us coupled with the proliferation of data analysis, which is often of questionable quality. However, what can’t be ignored is the increasing reliance on data to make business decisions. This means that, as consumers of those analyses, we need to be able to critically assess the output while understanding the process by which it was created and the implications of the results.
Therefore, in this second article I want to explore the question ‘how can you make sure you know exactly what is being measured?’
The challenges of data analysis
When presented with a piece of analysis, it is useful to understand not just how it has been arrived at – i.e. what steps have been taken to get the raw data into the figure now presented – but also what data is included. Typically this question takes two parts. Firstly, the data being collected will often drive what data is used; if it’s not readily available then, without some investment, it will go uncounted. The second part is more nuanced, as analysts will often make decisions about what to include, or more importantly what to exclude, from a calculation.
Looking at the data collection angle first, let’s take the example of a survey. You can only analyse the data available at the end of the process. There are a number of well documented issues with bias, typically selection bias or survivorship bias. If you are collecting data on the wrong aspect or only collecting data from a subset of the population, any conclusions drawn will be limited at best or misleading at worst. Typically in a business context, some systems will lend themselves to analysis by creating lots of data points and these will be the systems most often analysed. However you may be missing significant elements by focusing on just what’s available rather than what’s actually useful.
The second question is more difficult to grasp as data may be removed or manipulated for good reasons. Take Machine Learning tasks as an example. Missing data is commonly removed or gaps filled with the average for a field so that models can be trained. Test data may have appeared in a live system by mistake, and so you would want to remove that data so that it doesn’t influence results. However, it becomes more difficult with outliers. This is data that sits well outside of the expected range of values for a field. It could represent an error in the source data or be particularly interesting edge cases. Without a thorough approach to exploring the data and a clear explanation of the method, it is impossible to know which is true.
Finding the source
Good analysis should therefore make clear the source of the data used as well as anything that has been removed or altered in the process. As a consumer, understanding this on a case by case basis is a huge effort and is clearly not going to be possible every time.
Making use of analytical pipelines is an approach that Techmodal has used on a number of our projects. In this case, data is consistently collected in the same format and fed into a ‘pipe’. This pipe represents a series of steps joined together ensuring you always get the same answer when you process the same data. It also helps to give clarity to the consumer; instead of asking questions of every result you are shown, you only need to validate the process once.
This agreed process or pipe can then be applied across lots of data in the future. As an approach this gives consistency, efficiency and a clear auditable process – useful attributes of any analysis. It also means you have already answered the question of what is being measured for any analysis you might be presented with – thereby enabling you to make more accurate and effective business decisions based on data.
Capability Lead – Data Science