Like it or not, a lot of social media platforms we often browse about use machine learning in their recommendation systems. Things like your next recommended YouTube video, Twitter topics that you might be interested in, your next hotel booking, all rely on this black box of machine learning algorithms.
Perhaps the name algorithm is kind of misleading, or of a different variant of the normal algorithms that most programmers and mathematicians are familiar with. For the case of machine learning algorithms, they act literally like a black box – you can’t visibly see its content, but if you feed it data, it will return an output.
How would someone then create such black boxes, you might ask? Well, most machine learning algorithms are trained on real data. Say, for example, you want to create a black box that could identify pictures of hotdogs from pictures that are not hotdogs.
https://youtu.be/pqTntG1RXSY
To do so, we need to train the black box, or more technically the machine learning model, with labelled pictures of hotdogs and pictures of things which are not hotdogs. The machine learning model works by extracting features of either two categories. For instance, the model may pick up the feature of a hotdog like having a sausage clipped between two ends of a roll. Or, the model may also learn features like having a person’s face, which means it is not a hotdog.
This task is called image classification, which generally falls under the task of supervised learning. Supervised learning, like its name suggests, is when we feed the model labelled data, like images, and in this case, is tasked with correctly predicting foreign images that are not in the training data.
Unfortunately with these models, you cannot manually tweak the features it learns. To be more specific, you cannot stop the model from learning features you wouldn’t want it to learn. This may seem harmless at first but notice this following important point.
Algorithmic Bias
A lot of real-world training data is biased. The cause of this varies from issues to issues, for example, Rachel Thomas in her post Five Things That Scare Me About AI, stated that “2/3 of the images in ImageNet (the most studied image data set in the world) are from the Western world (USA, England, Spain, Italy, Australia).”
What would that mean? It means that datasets like ImageNet lack geo-diversity and studies have shown that they as a consequence, exhibit an observable amerocentric and eurocentric representation bias.
In the long run, machine learning algorithms that learn from these data hence amplify these biases in their implementation. Again in the same post written by Rachel Thomas, she provided examples of bias magnification in a variety of applications, including:
- Software used to decide prison sentences that have twice as high a false positive rate for Black defendants as for white defendants.
- Computer vision software from Amazon, Microsoft, and IBM performs significantly worse on people of color.
- Word embeddings, which are a building block for language tools like Gmail’s SmartReply and Google Translate, generate useful analogies such as Rome:Italy :: Madrid:Spain, as well as biased analogies such as man:computer programmer :: woman:homemaker.
- Machine learning used in recruiting software developed at Amazon penalized applicants who attended all-women’s colleges, as well as any resumes that contained the word “women’s.”
Bias in data is indeed an existing problem that we have yet to tackle, and we can clearly pinpoint the negative effects of it. As Rachel Thomas explained, as much as algorithmic bias reflects how bias the world really is, “our algorithms and products impact the world and are part of feedback loops.” Therefore, machine learning algorithms’ outputs are not just an effect, but also a cause.
Is Bias Inherent to Algorithms?
It may be more convenient to point the blame on the algorithm that allows for such biases. However, understand that none of those problems is inherent to algorithms. As the engineer behind those systems, you too can make a difference. Rachel Thomas gave some notable ways we can do better:
- Make sure there is a meaningful, human appeals process. Plan for how to catch and address mistakes in advance.
- Take responsibility, even when our work is just one part of the system.
- Be on the lookout for bias. Create datasheets for data sets.
- Choose not to just optimize metrics.
- Push for thoughtful regulations and standards for the tech industry.
Recommendation Systems
Perhaps the given examples sound a bit too far off, or unrelatable to the generic, day-to-day internet consumer. However, understand that even the seemingly insignificant patterns/actions we make on internet platforms are contributing signals to a machine learning model. These signals are the representation of you, the user, to the black box.
A more concrete example is, say, YouTube. I believe most of you are familiar with this platform that used to have the tagline “broadcast yourself.” Nowadays, you probably have found yourself binge-watching recommended videos, or videos in the autoplay system. These videos are determined highly from the videos you manually click on and show interest in, or the keywords you type in.
Yet, as much as this sounds equally as harmless, the YouTube algorithm is also tasked to keep you on the platform for a longer period of time. Realize that most online platforms are merely the world’s best advertising company, and the longer you stay, the more ads you watch, the more that they get paid for.
So, how does YouTube make your stay longer? Couldn’t it just provide you with videos that are related to the one you’re watching? Well, the answer is yes and no. Yes, because of course, it can upmost certainly make your stay, but at the same time, the company has to maintain its compliance with the truth.
Particularly, not all videos people upload are facts, nor are they backed up with scientific research. Inevitably, users can upload false news, hoaxes, conspiracy theories, clickbait, and YouTube may need to play their role here.
As much as it seems like the uploader is the one responsible for these false contents, YouTube is equally responsible to remove these from the top charts. It is for the same reason why YouTube is authorized and is constantly removing adult-related content, as it does not comply with their guidelines.
Truthful Content vs. False Beliefs
Here’s the catch: truthful content or factful content doesn’t make as much money as conspiracy theories, hoaxes, and click baits. To restate, these platforms are advertising companies, and their goal is to make money. Providing users with endless truths, or letting only experts of a field with a Ph. D. degree to make explanation videos don’t make people stay longer. To many, they’re boring, dull, or possibly disagree with their personal beliefs.
People want to watch/hear from people who they are sound with, from people who share equal emotions, and things that they perceive as truths. These aspects make consumers stay longer, and since it works, the platform has to somehow cater to these contents as well. Unlike truthful content presented by an expert, conspiracy theories are unfortunately more interesting to watch for many.
Thus, how would a platform like YouTube handle this need of providing truthful content while maintaining their users’ retention? Balance. If you provide all truths, not many will be satisfied nor stay long on the platform. Give them all fake news, people will start bombarding the platform with reports. So, balance them both. Perhaps the recommendation system starts out with truthful content, but gradually it may steer the user to false content, for the sake of audience retention.
Remember, this happens not only to YouTube but almost all social media platforms. Whatever you want to see, the platform will find its way to provide you with the answer. Tom Scott gave a great example of this in his discourse titled There is No Algorithm for Truth, and a lot of this discussion is inspired by his talk.
Consider this, if say you were once a religious believer, and now you’ve left your former belief and now you search content to respond to your condition. What would a recommendation system suggest? Will it throw you videos of evangelical priests of all denomination to make you return to your belief, or will it bombard you with videos of angry atheists? The answer is both. Whatever you want to believe, they will make you believe.
This is a central problem with online platforms. It’s fragile to control and it presents you, the consumer, with an array of both ends of the extreme. If you’re keen to believe Einstein is still alive, it’ll present you with conspiracy theories that claim as so. Otherwise, if you believe Einstein has truly passed away, it will provide you videos with evidence that support the claim.
What does this mean then, as the consumer? In the golden age of content that we live in, realize that as much as these algorithms want to steer you away/towards the values, stories, and claims you believe in, you ultimately have a choice to make.
Know that a machine learning algorithm isn’t currently 100% foolproof, and it’s not the fault of the backend engineers that such a phenomenon happens. It is immensely difficult to filter good/bad content, let alone juggling both truths and audience retention.
One way to put this into a day-to-day analogy is to imagine you’re in a supermarket. You walk into a rack of hair shampoo products, ranging from the cheapest to the most expensive ones. You’ve heard how TV advertisements have said multiple times that Brand A works the best, it kills 99.9% of germs, has tons of health benefits, but your best friend says it’s no good to use Brand A – he/she doesn’t recommend you to use it.
On the other hand, you know that Brand B works better for you, you’ve used it for years, you know that it is equally as good, your hair is perfectly fine with Brand B. Unfortunately, Brand B’s advertising isn’t as loud as Brand A’s. You start doubting its functionality.
Now, as a consumer, what will you pick? Brand A has endless advertisements, but you don’t know whether it suits your hair condition. With Brand B, you’re 100% sure that your hair will be alright, it’s just that the TV doesn’t promote it as grandly as Brand A.
In the end, you walk out with Brand B. You choose to stick with what you know works best, and it was you who made the decision – not the TV advertisement, not the packaging, not your best friend – you. All other external opinions merely influence your decision, but it’s ultimately yours to decide.
Closing Remark
Allow me to conclude this article with the same question Tom Scott raised in his discourse opening. What if a machine learning system is able to produce only objective truths, and its “objective truth” disagrees with one of your fundamental core values. What’s more likely, option A: that you’re gonna just go “oh okay, I guess I’m gonna be heard less and less and we’ll just have to deal with that”, or option B: you decide that instead, you need to shout louder and that the algorithm is wrong.
Whichever option you choose, know that biases exist in the world, know that recommendation systems build upon the actions you take. It’s a decision you make, with all your prior external influences, and the effects are yours to carry.
To finally wrap up this discussion, I shall quote a very meaningful paragraph written by Rachel Thomas in the same post with which we’ve started this article:
“The problems we are facing can feel scary and complex. However, it is still very early on in this age of AI and increasing algorithmic automation. Now is a great time to take action: we can change our culture, cultivate a greater sense of responsibility for our work, seek out thoughtful accountability to counterbalance the inordinate power that major tech companies have, and choose to create more humane products and systems. Technology is just a tool, and it can be used for good or bad. Let’s work to use it for good, to improve the lives of many, rather than just generate wealth for a small number of people.”
Featured Image by Thomas Rowlandson, CC0, via Wikimedia Commons.