January 16, 2021

Let's not let garbage in

Two weeks ago, a bunch of people stormed the Capitol building in the U.S. If you followed the news on this, you probably know that many who were involved have been arrested and charged with the crime already.

Even though it is necessary for actions like these to have serious consequences, on one level I feel bad for these people.

Many of them did this because they have been living in an alternate reality. They have been lied to and they were doing what they thought was a reasonable thing to do with the information they had. One example of it is that they believed the elections were stolen. Don’t get me wrong though, I don’t think acting with incorrect and incomplete information justifies any sort of violence.

With that said, this whole situation reminded me of a saying we have in computer/data science: garbage in, garbage out. It implies that the output of a program can only be as good as its input.

This is exactly what we do as data scientists to a great level. We decide what goes into our models so we don’t produce garbage.

That’s why we care so much about cleaning our data, not making incorrect assumptions about it, making sure there is no hidden bias in it. It's no coincidence that I focus on data preparation in great detail in my course Hands-on Data Science. Real-life data is dirty, confusing and inconsistent. As professionals, it is up to us to eliminate these issues and not let them corrupt the output.

I believe we haven’t seen the half of how data can affect our lives. As artificial intelligence becomes a bigger part of our world, we as (future) practitioners have a significant amount of the responsibility. So be sure to equip yourself with the correct understanding of data and AI ethics from the beginning. Don’t brush them off as secondary skills.

Let’s not let garbage in.

Inspiration for this article came from Dan Carlin’s latest episode of the podcast “Common Sense with Dan Carlin”.