The AI data selection problem… and best practice to solve it

To gain the benefits of AI we need to minimise decisioning biases

We know that bias within data selection and hidden correlations within data sets are part of the problem. However, there’s very little agreement or accepted best practice on how to select low-bias data sets. We simply don’t have the “How to…” guides that help us in other areas of business and computing.

The inevitability of biases in data sets

To conduct any experiment, to design anything, to test anything, one has to select data. To select data a series of choices have to be made as to which data is relevant and which data can be ignored. You simply can’t include all data about everything. Lines have to be drawn and relevance established.

The problems with data selection often occur at the edges. Imagine a local council deciding on a new traffic scheme. It consults the residents of the roads that are directly affected. Should it consult the adjacent roads where overspill traffic could be redirected? It measures the traffic to baseline the scheme. For how long? During which seasons? Does the nature of measurement change the things that are being observed?

Our view is that working hard on data selection is at the heart of minimising bias in AI decisioning and it’s hugely important. The problem is that there is no fool-proof, easy methodology available. That’s the same with a lot of things.

It’s beginning to look a lot like science

Classically, science has been defined as Physics, Chemistry, Biology, and subsets of those disciplines. These subjects are governed by the scientific method, in which hypotheses are formed and experiments are conceived to test these hypotheses.

If an experiment passes repeatable experimental tests, it strengthens the likely validity of the hypothesis, which may then be described as a theory or law. However, the theory or law can always be disproven by experimental evidence that contradicts it. For something to be deemed scientific it always has to be capable of being disproved.

Data Science is not like this, nor is Artificial Intelligence. Yes, Data Science contains the word science, but it never sets out to create robust generalisable theories. Data Science is an amalgam of other disciplines: statistics, probability, computer science, psychology. It attempts to offer descriptions and workable predictions, but it isn’t science in the same sense that Physics is science. Data Science is neither generalisable nor refutable.

Part of our problem with AI topics such as the demand for explicable AI is that we are expecting scientific standards of explanation and repeatability from disciplines that are not scientific. It looks like science; there are numbers and computers involved: but it’s not.

Causal and correlated versus just correlated

Here’s where science and Data Science really part company. True science is interested in causal relationships. A useful analogy for this is found in the highly topical subject of viruses.

If a virologist wants to understand the likely impact of viral mutations, she studies viruses in great detail, using a variety of methods. If a Data Scientist wants to advise a pharmaceutical company on stock-piling common flu treatments for the winter, she may look for correlations with seasons, weather, previous outbreaks etc. She does not necessarily have to understand viral mutation or transmission mechanisms in detail, although it may help.

Data Science, at least in business, is quite happy to deal with correlations, provided that these allow it to function. If the Data Scientist found that flu outbreaks had a high likelihood of occurring after the end of the Thanksgiving holiday, that would be useful information. If a Virologist found the same pattern, her work would be incomplete until she had explored the causal mechanisms and had a testable hypothesis as to why. Data Science does not have to deal with why.

Historic biases in social data

Our biggest bias-in-data problems occur with social data, because social data are a reflection of cumulative decisions made by people.

If we take wage distribution data, occupational data, educational achievement data, occupational health data, arrest data… they all contain social history and that social history reflects how different groups have been treated over decades. It therefore follows that any forward-looking decisioning that touches social data is immediately at risk of bias from the history that is embedded in that data.

Here’s an example of how difficult this becomes. It makes sense to direct policing towards areas where crime is likely to occur. Any measure of where crimes are likely to occur is based on crime reporting and crime detection. Crime detection is likely to be influenced by past policing patterns.

Some things are simply circular and it’s hard to break out of them. One way of addressing this particular example would be to distribute police evenly for a while. This would help reset the observed crime distribution. However, it is probably not a viable option as it would lead to accusations that police had been withdrawn from areas of higher crime and deployed to low-crime areas – a scientific approach that is politically unacceptable.

Making progress

At reputable.AI we believe in curtailing interminable debates and making practical progress.

Here is a series of actionable and practical steps derived from our observations of best practice in reducing embedded bias in AI and algorithmic systems. Note that we say reducing rather than eliminating bias. Bias cannot be eliminated.

1) Before selecting training data, think about consequences

This step is not strictly about data selection. It’s about the project as a whole.

There’s a tendency to jump straight into the fascinating world of data, but first we should think about the world of unintended consequences. At the outset of any project there’s a project purpose. With AI projects this often involves an optimisation goal. There is a series of steps to think through what inadvertently gets de-optimised as a result of the optimisation goal. Three techniques we consider worth exploring are:

      • Consequence Scanning from responsible think tank, DotEveryone
      • Stakeholder Equilibrium Theory (as set out by
      • Conducting a pre-mortem on the project (what’s the most embarrassing cause for this project to fail?)

2) Avoid “data-donation” creep

The reason this is top of the list is because it happens first and then gets extended. Anybody who has developed a software product has probably done this.

In the absence of the data set that you really want, you use your own data – banking records, heart rate, image data – it matters not. Then you ask the rest of the team to donate their data. Once you’ve exhausted the immediate team, you move on to other colleagues in the same firm, or family and friends. The obvious result is you’ve biased the dataset from the outset, simply in the interests of getting the job done.

There is one way of partially avoiding this, which is to delete all the first training data and start again, i.e. retrain. The risk though is that some design decisions have already become embedded.

3) Match the market for which you are making decisions with your training data

To do this, you need to decide what market the application is to serve. For example, if you were building a mortgage decisioning application to offer loans to the whole of the UK market, you would select data derived and weighted to the whole of the UK, including economically deprived areas.

4) Once trained, test responses to sensitive sub-segments

A useful sensitivity check is to use a set of test data and look at the results as they apply to subsets of the population that could be at risk of not being treated fairly. For example, in the mortgage example, above, does the model treat gig workers fairly? They may not be able to show long-term work from a single income source, but nonetheless have resilient incomes.

5) Weight the data set towards causal data (but not exclusively so)

AI looks for correlations. Correlations are not necessarily causal. It may be perfectly possible to construct a credit-rating application based on, for example, a correlation between the proximity of a person’s home to a high-end shop. Of course, living near a designer boutique doesn’t cause you to be credit worthy; it is simply an example of a possible correlation with credit worthiness.

We should focus the training data set on things that are likely to be causally linked to the decision in question. In this case the set might include: income; other outstanding loans; average bank balance.

However, we are also relying on the AI to find patterns that humans miss, so outright data minimisation, or an attempt to be 100% causal, is not advisable.

6) Actively select data

We know that data reflect biases within our society. For example, there is a well-documented gender pay gap in the UK. Within a given company this reflects many factors, possibly including a gravitation of women towards lower-paid and part-time roles.

A gender pay gap is not illegal; it is something that firms are being encouraged to reduce. What is illegal is paying a man and a woman differently for doing the same job. It happens, but it is illegal.

If we are asking an AI to perform a decision on wage data and we anticipate that the AI would find a probabilistic relationship between the attribute Gender = Woman and lower-than-average pay, we have a couple of options:

a) Strip out some of the historic data (assuming the M/F position is improving);

b) Normalise the distributions, i.e. select equal volumes of M/F data at particular pay points.

Doing b) means that the inference of Woman ⇨ Lower_Pay, cannot be made by the AI (other things being equal).

This is an example of training-in the treatment that you want to happen, through training data selection, rather accidentally than training to historic patterns.

7) Capture protected attribute data

There are two schools of thought on this:

a) If you don’t capture protected attribute data then an algorithm can’t make decisions linked to it, so it can’t be explicitly biased;

b) Collect protected attribute data so that you can later check for bias.

In the first example, a) is probably wrong anyway, because an AI is likely to find pattern correlations, so protected attribute data could accidentally feature invisibly.

That’s why we prefer approach b) because it allows for validation. But data consent mechanisms will have to reflect this, and those using the data are going to have to ask some questions that could be perceived as intrusive, e.g. “We are asking for data on your sexuality to help us treat all sexual orientations fairly.” That’s a tough one. However, there will be spending patterns that are particular to different sexualities. How do you reduce the likelihood of bias if you can’t collect this attribute data?

8) Check the results

Ensure that AI is monitored and safeguarded with human review. Part of the definition of AI is that it is adaptive. This implies that weightings are being adjusted as the agent seeks a goal. It therefore needs to be monitored to ensure that in an effort to reach a goal the AI is not generating deficits or unintended consequences in other areas.

Good tools are coming onto the market to help with AI monitoring.

9) Set parameters and dates for retraining

How do we know when or if an AI needs to be retrained? Set explicit parameters to indicate when retraining should be considered. Set dates for revisiting (expiry dates).

10)Make explicit “Use For” or “Trained For” statements

We can regard these as active statements of justified and acknowledged bias (in a neutral sense of the word).

Here’s an example of a Use For statement:

“This credit rating AI has been trained and tested for use in New Jersey. It is not intended for use in other geographies.”

The intention of such statements is to reduce generalisation of software to deal with patterns that have not been “trained in”. We can make similar statements for many AI systems. Nobody would propose using autonomous vehicle software trained in the UK in the US, without systematic retraining and testing.

If you’re wondering how granular such statements should be, this is determined by the training data set. Once the in-life data characteristics differ from the training data set, then you’ve probably moved beyond the Use For statement.

A few conclusions

There are three real conclusions to this piece and our approach. Firstly, applying pure science expectations in the area of Data Science and AI would be mistaken. Secondly, expecting bias-free decisioning is unrealistic. Thirdly – and this is the optimistic part – we can do two things about bias: limit it and understand it. If we accept that all decisioning systems – human and machine – contain biases, and we are clear about the nature of these biases, we can progress.