HUMAN Protocol and the intricacies of bias

With the distinction between human and statistical bias highlighted in The basics of bias, this article will now examine the specific forms of bias that can adversely affect research. Bias is simply an imbalance in a dataset, and best thought of as error. It can manifest as poor voice recognition technology, or accidental racism; while different in their severity, the problem and solution are equally applicable to both. Briefly, statistical bias has to do with the results derived from a sample pool of respondents over- or underestimating the results that would be found if the entire parameter were sampled.

The error of bias can, to a large extent, be mitigated by providing volume which increases diversity in the sample datasets. However, even the largest datasets do not necessarily mitigate bias; the problem of respondents is only one fragment of the bias problem, which is, in fact, a systemic problem in the entire research process, from the sourcing of respondents (selection bias), to the procedure (measurement bias), and the interpretation of results (confounding). As we define each of these forms of bias, we examine the ways in which HUMAN Protocol offers real solutions to mitigate bias of all kinds throughout the research process.

Selection bias

At the highest level, selection bias indicates an error by which the tested population does not represent the population the researcher intends to study. There is an important distinction to be aware of here: selection bias does not relate to the idiosyncrasies in a study group, but rather to the error of that group not representing the group the researcher wishes to study.

When we are talking about AI products that understand the subtleties of the world they operate in, the target population cannot be too big. Accuracy improves with scale. The intention is to receive global feedback. The means, however, are often missing.

Often selection bias is not unconscious; resource limitations, rather than a wayward intention, determine what is practically feasible. Research institutions are currently so limited in resources that often the researchers themselves are used as data sources. ML PHD students are a minute niche, and a heavily biased sample of data labelers.

If you send a CAPTCHA to a website selling dentures, it is more likely your sample population self-selects as elderly. This is an example of sampling bias, a subsection of selection bias.

While selection bias is often defined as the result of non-random methods for curating respondents, randomness does not equate to representative data sets. Randomness is simply the best way of preventing selection bias: what is in fact meant by randomness is really the exclusion of extraneous variables introduced into the study group once the parameters have been set. Randomness is not simply a case of complete blindness; in fact, the opposite is true.

If a researcher wants to poll opinion about food in a restaurant, a random selection of the target population is required. The target population is ‘all people who have eaten at this restaurant’. To achieve a representative sample population, the researcher must omit non-random factors by including, for example, diners from breakfast, lunch, and dinner across every day of the week.

That said, a researcher wouldn’t wish for complete randomness if the area of study is opinions about food in a restaurant. The researcher would want only people who have eaten the food. So the selection is not random, but highly specified. Randomness refers to there being no extra variable unaccounted for, such as the mistake of only surveying people who go into the restaurant on Tuesdays. Selecting a specific day would make it non-random, which is a problem if the purpose is to measure overall sentiment about the food in general.

When we are talking about AI products that understand the subtleties of the world they operate in, the target population cannot be too big. Accuracy improves with scale. The intention is to receive global feedback. The means, however, are often missing.

How HUMAN Protocol mitigates

What is required is both randomization and precision. Precision in the targeting of the parameter population; randomization thereafter in the selection of the corresponding pool. HUMAN Protocol provides the balance of both options.

To specify the target population, the Protocol aims to give the researcher control of the bias. Requesters have tools to determine and pinpoint their target population. If they want the pool to be only users of a sporting website in the USA, that can be achieved. Equally, a job can be distributed across many websites, in different languages, reaching countries across each continent.

To facilitate randomness, the Protocol joins participants from, currently, 15% of the Internet. In AI, the parameters are likely to be broad, for the goal is to create globally representative AI products that understand the cultural, social, and racial subtleties of the world they operate in. HUMAN Protocol hosts the largest distributed workforce in the world; a diverse crowd of hundreds of millions of data labelers.

Measurement bias

Measurement bias in the fields of ML can dissemble into many categories. An example of this is interviewer bias, whereby interviewers skew the test results through their particular interview style; discrepancies between the groups are attributed to innate group dynamics and qualities rather than the difference in interviewer style.

Essentially, in this instance, the interviewer is the data scientist who frames the question differently from her colleague. With the same research brief of determining the likelihood of a cat in a given image, they send out for a million images to be labelled. One data scientist asks ‘Are there cats in this photo?’, and the other asks ‘Click all the squares containing a cat.’

The question itself has many presumptions and a subtle fingerprint of the scientist. The scientist has selected words – each choice reflects a position, an assumption, and will, inevitably, affect the outcome of the test.

If there are images without cats sent in a CAPTCHA, the question framed as ‘Are there cats in this photo?’ has a clear response. The responder is aware that the answer is binary and they can therefore select an answer, an image, in which there are no cats. If the question is ‘Click all the images containing cats,’ the parameter of ‘no cats’ is not accounted for; the question, while not forgoing the possibility of the image being without a cat, subtly suggests that there is a cat to be found, and that the test is measuring one’s ability to detect the cat. It is likely that the latter of the questions would return inaccurate responses, with responders eager to find a cat.

How HUMAN Protocol mitigates

Increase the number of scientists involved in asking the question. HUMAN Protocol is permissionless; any ML shop can plug in to purchase the data labeling they need. If this kind of bias is innate and unavoidable – a consequence of being a matter of perspective – then the answer simply relies on having more perspectives to broaden consensus, and to highlight outliers.

Pause, cancel, and relaunch jobs if an error is detected. While it is unlikely an error will be detected, because the measurement bias is subjective, personal, and subtle, the possibility to cancel a job (and pay for only the work done) demonstrates the upside of the Protocol’s automated labeling platform.

Confounding

Whether technically a type of bias or not is up for debate, the bottom line is the same: confounding is when an unaccounted for variable influences the ‘cause and effect’ relationship being established. A confounding variable is correlated to cause and effect, but not necessarily causal.

For example, using a grammar labeling tool, a researcher observes an association between an increase in the usage of the US English ‘soccer’ as opposed to the UK English ‘football’ with an increase in the spelling of the US English ‘flavor’ as opposed to the UK ‘flavour’. The result would be an incorrect presumption of causality, whereby an increase in usage of one word causes an increase in usage of the other.

Are they causal? No. The confounding factor is nationality, education style, or system, or the audience that is being written for.

How HUMAN Protocol mitigates

The Protocol has tools which allow you to target and specify your respondents in a detailed way using control variables. This provides the ability to mitigate confounding variables by controlling them. As such, and as already mentioned, the Protocol puts the bias into the hands of the researcher for them to account for.

The oracle system creates a real-time assessment of answer quality. If your sample of correct answers fed to the Protocol – the ‘ground truth’ – is sufficiently accurate to account for confounding variables, they will be flagged.

The Protocol delivers real-time results uploaded to an Amazon S3 bucket which only Requesters have the private key to access. If the Requester is dissatisfied for any reason, they can cancel the job, paying only for the questions answered, and relaunch it with a new perspective. This feature can help with all kinds of bias – for example, with measurement bias, if you realise the question was, for example, too broad (you asked for cows to be labeled, but meant cattle, because you want to include bulls) you can cancel.

For the latest updates on HUMAN Protocol, follow us on Twitter or join our community Telegram channel.

Legal Disclaimer

The HUMAN Protocol Foundation makes no representation, warranty, or undertaking, express or implied, as to the accuracy, reliability, completeness, or reasonableness of the information contained here. Any assumptions, opinions, and estimations expressed constitute the HUMAN Protocol Foundation’s judgment as of the time of publishing and are subject to change without notice. Any projection contained within the information presented here is based on a number of assumptions, and there can be no guarantee that any projected outcomes will be achieved.