August 18, 2020

Is bias a problem in Data Science?

In a world of ever-increasing data volume, Data Science is a necessity for delegating aspects of the thinking to machines, just as mechanisation moved the doing to machines. Data Science and Modelling generally require the simplification of a system to create tools that allow us to ask and answer questions: Can we do that with fewer people? How much will that cost?

Even though it would be an excellent challenge to address on a day rate, there is no conceivable way to model all of the uncertainty in the universe to answer these questions definitively. This is a heuristic challenge by necessity and that exposes technology to the same challenges which the human brain faces. Proverbially, “the best is the enemy of the good.” 

Humans are heuristic beasts, and that characteristic has been essential to survival and productivity for us and our forebears since the dawn of the nervous system some 600 million years ago. By using our own neural models, trained by a combination of genetics and experience, we instinctively know that fire is bad, and that the big green thing is a tree. But, as with everything in life, there is a compromise being made for all this functionality. 

How many people that you know would foreseeably make a deliberate choice to disadvantage a person because of race, gender or sexual orientation? Now ask yourself how many might make a decision with those same consequences, though with no ill intent, because of something they hadn’t considered. I expect the latter would yield a longer list. 

We are neurologically programmed to focus on the primary factors which have influenced outcomes important to us. When something is overlooked it is seldom by choice; our neural models have subconsciously made decisions for us. 

It may not surprise you then, that the ability of human infants to differentiate between faces of other races starts strong, but diminishes significantly between 3 and 9 months as “unimportant” neural pathways are brushed aside (Kelly et al., 2008). The human brain, in pursuit of a useful heuristic solution, prunes out what it does not need. Luckily, humans can (when we remember) overcome this by reaching out for new, broader sources of information. 

Any system which learns will be innately biased towards the lessons which it is taught. No model, in vivo or in silico, can ever be better than the data on which it was trained, and Python package has no way of accessing knowledge beyond what it has given. Algorithms to craft Machine Learning Models are completely unbiased – which is the source of our problem. 

In practical terms, if you were to use Italy as the sole input to create a model for the study of population dynamics then you would have, in effect, given it a Catholic upbringing. The same is (in)famously true for facial recognition in that many source data sets for training facial recognition algorithms are comprised primarily of white males, with unsurprising results when it comes to gender classification accuracy (Buolamwini et al., 2018).  

Bias, however, is not always borne of omission or blissful ignorance. In the field of bioscience p-hacking (Head et al., 2015) is a recognised problem, where methods and results are cherry picked and manipulated to produce a “significant” result. Here, bias is deliberately introduced to manipulate a system or result for gain. 

In contrast to a perfectly valid piece of research which yields nothing of interest, an output confirming a researcher’s hypotheses would be regarded as a success. Success in science is often measured in high impact publications. This in turn creates more funding opportunities for the individuals with more high impact publications, and so the behaviour propagates. 

This scenario would be feasible in industry, enticing Data Scientists to include (or deliberately not remove) irrelevant variables in their models to boost performance. 

If we want to delegate the thinking to Machine Learning Models, then they must not be a reproduction of our reflexive observational powers but should include an appropriate level of critical thought and contextual awareness in their design. That can only be achieved by capturing the appropriate data and insight in design. 

We must also consider how to remove opportunity for obfuscation and mystification, which can lead to misunderstanding, distrust, and dishonesty. As experts and advisors we have a responsibility to our customers to ensure that work is explained at an appropriate level of complexity and encourage deeper questioning. 

So, how can we ensure quality design, and minimise bias in Data Science?  

Guidance and best practice are becoming available to help. The UK Government’s Data Ethics Framework provides principles that can be used to hold Data Scientists to account, and help us to design appropriate analyses and models. 

We can also look to more established, related disciplines in our industry for inspiration. 

The field of cost modelling encourages good practice by accepting and acknowledging the likelihood of human error, and uses the mechanism of independent Validation and Verification to identify and address problems. Similarly, we must encourage and exploit diversity of thought to avoid bias and ensure that we capture contextual subtleties. 

Agile development principles are also an excellent fit to the field of data science. Regularly exploring work with our customers will foster open discussions with subject matter experts. In contrast to a waterfall-style project, we can flexibly change direction as we compare developing analyses and models to what is appropriate for our customers, their data, and their values. 

Ultimately, the problems underpinning bias in Data Science are not new, only the context. Luckily for us, our own neural models are sophisticated enough to draw on the relevant practices and experience to address the challenges.  

Thank you to Gwyn Wilkinson at UWE for the link to the Buolamwini paper, and Techmodal’s Eron Cemal and Pranav Patel for inspiration and an introduction to the fascinating concept of Perceptual Narrowing. 


To keep things topical, the 2020 A-Level results deserve a mention here. 

The 2020 A-Level results…

Ofqual (and the incumbent government) have received rather a lot of negative attention over “the Algorithm” designed to moderate predicted A-level results. While I’ve not had time to go through in detail, Ofqual have published a relatively comprehensive 79 page requirements document which shows clearly that this was not a whimsical endeavour. In addition, some of the outputs are discussed in an interim report which indicate a good degree of accuracy on certain KPIs (more than 90% of students within plus or minus one grades). 

This interim report goes into detail about attempts to address bias in teacher estimated grades and the rationale for implementing standardisation to reduce it. One particularly poignant paragraph states: “We have carefully considered whether it would be possible to centralise checks for systemic bias within centres. We concluded that this would be very difficult, if not impossible, to do in a timely fashion. Further, a centralised approach might be perceived to undermine rather than support teachers.” In credit to Ofqual they have stated that they will not seek to implement a system to overcome wide reaching systemic issues. 

To some this might seem a machinated plan to disenfranchise the unfortunate. I see it as an acknowledgement of the injustices in society, accompanied by a reminder of the commitment that Ofqual had to deliver outcomes on a pressing timescale. They are simply attempting to replicate the status quo, which is a justifiable compromise. 

Further to this; a heightened difference between prediction and awarded grade at lower performing schools is likely to have occurred based on past teacher behaviour with predicted grades. Independent analysis of GCSE estimations identified that “lower attaining schools tended to submit more optimistic results.” That may go some way to explaining why the algorithm seems (or is) targeted. 

It would be much more of an entertaining read if I could slam Ofqual for failing to think, or plan appropriately, but at this time – I think that would be disingenuous. Despite a tight timeline, Ofqual engaged in an open consultation to which they received over 12,500 individual responses which were then factored into their decisions. They reviewed appropriate literature and were open and transparent in their approach. It seems that Ofqual made an honest attempt at creating a standardisation algorithm, to minimise bias, under some rather challenging conditions. So, what went wrong? 

On announcement of the U-turn Ofqual chair Roger Taylor made a very telling statement, noting that the approach “has technical merits” but “has not been an acceptable experience for young people”. The focus has not been as much on population-level accuracy as it has on the extremes. The backlash has been severe because for those who have been dealt an unfair hand, the consequences to their life (would) have been profound. 

In 1753 Sir William Blackstone wrote (with respect to criminal law): 

“It is better that ten guilty persons escape than that one innocent suffer.” 

I would argue the same rule applies here. Closing the door on someone’s future with an algorithm designed to catch instances of positive bias does not seem a palatable trade… Unless that algorithm decides whether you get a mortgage, but that is one for another day. 

Further reading:




Author: Daniel Jones, Data Science Capability Lead