Predictive text, voice recognition technology, and navigation systems; face identification, spam email filters, and product recommendations. Machine learning is used to build products we interact with every day. But what is it?
An algorithm is the set of rules that a machine is programmed to follow in order to complete a task. Machine learning is the practice of training a computer to develop its own algorithm, so that it can complete increasingly difficult tasks. The computer creates an algorithm from samples of data, often called ‘training data’, which it uses to generate predictions of appropriate answers and actions. The computer can, therefore, make decisions on its own, without being explicitly programmed in the unique context of that decision.
Machine learning is most useful when the relationships within data are complex, or what is often called high-dimensional data, as is seen, for example, in Formula 1. The secret to winning a race is a complex balance of choices based on innumerable variables - from tire pressure, wind speed, to track temperature, and air humidity; each car, with 120 sensors to relay race data, will send and receive about 750 million data points over a two-hour race. No human could process the relationships between the data. A machine, however, can.
It is important to note that however smart a computer may seem, it fundamentally does not know anything. At best, it can make highly accurate predictions based on what it has already seen. That is why practitioners have to be careful about the data they show machines - more on that below.
Machine learning techniques start with a training dataset. In the case of your computer’s predictive text, this may be tens of thousands of sentences that are used to detect similarity and, thereby, meaning between specific data points. By looking at ten thousand people typing an email to their boss, the machine can make a fairly accurate prediction that, at the end of an email, the word ‘Best’ will likely be followed by ‘wishes’.
Data labeling is the process of taking raw data and ‘labeling’ it to make it useful for machines to learn from. A raw piece of data is an image of a road; data labeling, or data annotation, is the practice of labeling the car, the fire hydrant, and the crosswalk. The labels create the association between a word and an object, which is the foundation of making intelligent machines. Once the machine can recognize a crosswalk, it can learn the appropriate actions to take.
More often than not, data labeling requires a human to label the images. Sometimes, software built with machine learning can undertake the labeling process.
There are three key machine learning techniques: supervised learning, unsupervised learning, and reinforcement.
The machine learns from labeled data. Over time, and through the sheer volume of datasets, the computer can become more accurate. This is the most common type of machine learning used today.
Example: Humans label images of fire hydrants, traffic cones, and crosswalks. These labeled images are fed to a machine to create an algorithm for a driverless car.
The machine detects patterns in unlabeled data. Instead of supervising the computer, the computer is left to identify patterns in data on its own, and thereby creates ‘clusters’ of data based on similar qualities across the raw data.
Example: Feeding a computer with a thousand news stories from across the web, it can, on its own, begin to find patterns in the text and thereby create categories of news. Articles that mention the Superbowl, football, and Full Time 34-7 can be tagged under ‘Sports’, just as those referencing the Senate and Joe Biden can be tagged under ‘Politics.’
The machine equivalent of human conditioning: a machine learns through trial and error, with a predetermined reward system.
Example: to train models to play games or drive vehicles, by letting a machine know when the right decision was made. Over time, it uses this information to determine the actions it should take.
The problems with data are numerous. Data can be scarce, low quality, incomprehensive, or noisy, which simply means it is unstable, unpredictable, and hard to glean useful information from. Below are some details of the problems faced in data, and the solution HUMAN Protocol provides.
Data labeling services have not been accessible or affordable to most practitioners of machine learning. The problem is such that data labeling has been left to graduate students and engineers, creating a problem of bias.
Applications running on HUMAN Protocol already access hundreds of millions of data labelers. The Protocol is designed to fulfil jobs of many different scales; from large BPO (business process outsourcing) of major ML practitioners, university researchers who want a data set annotated, feedback given to a model, or AI startups looking to have small, or specific, data labeled.
Bias in machine learning is different to regular human bias, as we discuss in this article. Bias in data would be, as described above, having only graduate students, researchers, and PhDs labeling data. It is one thing if they are labeling images of a dog, but another if they are creating an emotion-recognition technology to distinguish a happy face from a sad face. A small pool of labelers will likely imprint a bias - reflecting their specific world-view - onto the data. Given that emotion recognition varies from culture to culture, it would be appropriate to have a representative labeling workforce of many cultures to label the data.
That is why good data is representative data. To create globally appropriate AI products, the data used needs to be representative of different cultures, societies, and backgrounds.
Bias can occur at both ends of the process. Limited access to data labeling services limits the voices used to create AI products; limited data-labeling services do not necessarily provide representative workpools. HUMAN Protocol helps to democratize access to data within a permissionless system, meaning anyone can buy the data they need to create the products they want. Most importantly, however, the data they use comes from a global source: applications running on the Protocol are accessed by hundreds of millions of data labelers, across 247 countries and territories.
For a comprehensive account of how HUMAN Protocol mitigates bias, read our in depth article on the subject.
When it comes to data labeling, poor labels are a problem. For applications like CVAT, where the labeler is required to draw a bounding box around, for example, a truck, it is easy for the box to be too small, too big, or simply inaccurate. This is a problem for the machines trained to find the truck. Such data creates ‘noise’ in the dataset, and can lead to inaccuracies in the learning.
The Protocol is designed to support an independent and, therefore, decentralized network of oracles to manage job quality. Each component can be incentivized to catch human errors, to reward quality work, and to disincentivize any malicious behaviour. Workers who perform well can build their reputation, and then be prioritised for future jobs.
For the latest updates on HUMAN Protocol, follow us on Twitter or join our community Telegram channel. Alternatively, to enquire about integrations, usage, or to learn more about how HUMAN Protocol supports machine-learning technologies, get in contact with the HUMAN team.
The HUMAN Protocol Foundation makes no representation, warranty, or undertaking, express or implied, as to the accuracy, reliability, completeness, or reasonableness of the information contained here. Any assumptions, opinions, and estimations expressed constitute the HUMAN Protocol Foundation’s judgment as of the time of publishing and are subject to change without notice. Any projection contained within the information presented here is based on a number of assumptions, and there can be no guarantee that any projected outcomes will be achieved.