December 27, 2020

Pandas fundamentals every data scientist needs to know

Prefer to watch this? Check out the video version on YouTube.

Isn’t Pandas the best? It is such a great library with so much potential and so much flexibility. I remember the times when I just started using it. I immediately fell in love with it.

Don’t get me wrong, it does have a steep-ish learning curve. Not everything is immediately obvious from the start. There are a couple of tricky concepts in Pandas. And these will be the things to take you to 80% with just 20% effort.

Here are the main working principles of Pandas I have picked up over the years that will boost your coding performance.

Dataframes and Series are the main building blocks

You load your data into a data frame. A data frame consists of rows and columns. One row is one data point and one column is, well, a feature when you think about it in machine learning terms.

When you look closely though, a data frame consists of multiple Series objects. A series object is basically a column. And when you add together multiple columns (or Series objects) together, you get a data frame.

As I mention in the Pandas Common Functions Cheat Sheet too, most functions of the Pandas library either apply to or return a Series object or a data frame.

When you realize this distinction and the relationship between the two, it becomes much easier to understand how dataframes work.

The index has a supporting role

Columns are the stars of the Pandas dataframe but rows can also be used for actions. Pandas has built-in functions that will apply to each row and also gives you the option to iterate through rows. 

The index is the way to identify rows. In the example dataframe below, the index is the numbers at the very left, shown in bold. You can have other types of index too, such as string.

Even though we don't have much to do with indexes, we still need to be aware of how they work. Occasionally, index might be used as reference in a Pandas function and it would need to be correct. As an example, if I sorted the above data frame by trip_distance, this is what I would get. You can see that the index is out of order now. The first data point's index value is 2 instead of 0. 

This is normally not a big problem but if you then want to do an operation on the data frame that takes indexes as reference, you might get unexpected results. For example, if I tried to plot the trip_distance column now, I will get this weird plot:

This is happening because the function that makes the plot takes indices as reference. The x-axis is the index and the y axis is the trip_distance value. So this weird graph is produced.

There is a very neat solution to this though and it is resetting the index with reset_index(). So we get this dataframe:

It is again sorted by trip_distance but the index is in order. Don’t get confused by the fact that there is a column called index. It’s just the old index values. You can easily drop that column if you don’t need it.

Now if I plot this data frame, I get a proper plot:

Think of the entirety of the data frame when you want to change something

If you are coming from a programming background, like me, your first instinct would be to write loops (for or while) for everything you want to change in the data frame.

For example, if I want to increase passenger_count of each data point by 1, old me would have wanted to write a for loop, reading each row and increasing the passenger count by one.

But you don’t have to do things that way with Pandas. It is even discouraged. Instead, you can just say, hey, you see that column called passenger_count, increase all values by one. And this can be done with just 1 line of code.

This again comes from the fact that columns are objects in themselves. And just like saying increase an integer by one in Python, you can increase all values in a column by one.

You can also do more of course. On top of basic math:

  • you can combine multiple columns to create a new column, 
  • you can change the format of columns, 
  • you can fill in missing values with a given value 

and much more.

The main take away is, when you think of changing something in a data frame, think of the whole data frame, or at least the columns as separate objects, rather than on a row by row basis.

I have more examples of this in the video version of this article.

Being efficient in filtering will save you a lot of time

Filtering data points in a Pandas dataframe is needed very frequently. There could be many reasons why you want to filter out some values in a data frame: 

  • to see only a subset of the data frame, 
  • to use only data points with a certain value in your analysis, 
  • to exclude data points with a certain combination of values and much more.

Pandas makes it super easy to do filtering. It consists of 2 things: the data frame shell and the conditions. These are not really official names, I came up with them.

The data frame shell is basically the name of the data frame and two brackets. And inside the data frame shell, you write your conditions on what you want to be included and what to be excluded. As in:

my_dataframe[<this is where conditions go>]

Here is an example. If I only want to see the data points that have a trip_distance greater than 0 in a dataframe called taxi_df, I would write a filtering line like this:

The inner phrase, which is taxi_df[‘trip_distance’]>0 is what I call a condition. And it returns a Series object where each value is either True or False. That is determined by whether the corresponding trip_distance is greater than 0 or not.

You can have any sort of conditions. Including comparing strings, checking if values are in a certain list and more. You can negate a condition by placing a tilde (~) before it. You can also combine multiple conditions by wrapping them in parentheses and merging them with logical operators such as & (and), | (or). It’s all very easy when you know that all that you need are:

  • the dataframe shell and
  • a condition statement that returns True or False for each data point/row

The community is always there for you

Sure, maybe they’re not there for you individually but they’ve been there for others and you can use the knowledge they have accumulated. There are so much information, so many questions and so many answers that I think it’s nearly impossible to ask a question that hasn’t been resolved yet. 

That’s why when I get stuck, when I don’t know how to do something, the first thing I do is to write down what I’m trying to do on Google in a very simple and straightforward way. Most of the time, I find an answer to my question on the first try.

These are just some simple but effective secrets I learned about Pandas that made my life much easier once I did. That’s why I wanted to share it with you.

If you want to dive deeper into Pandas and learn more about how some most-used functions work, check out my Pandas Common Functions Cheat Sheet. It includes definitions of functions grouped into what they’re used for, some tips on their way of working and some general Pandas advice from me.

Good luck on your Pandas journey! I hope you enjoy it as much as I do!