How to build code you can be proud of as a data scientist
It is pretty much a stereotype that data scientists can’t write clean and understandable code. This doesn't have to be the case for you. By learning a few principles of how to write code properly, you can use the stereotype to your advantage and set yourself apart from the competition.
To find the best practices, we should look no further than the reliable practice of software engineering. Here are some software engineering principals to get you writing clean, readable and easy to work with code as a data scientist:
When it comes to working with new technologies, one of the biggest mistakes I see data scientists make is to ignore the documentation. The documentation of a tool or library is solely prepared to help users have an easier time applying the tool to their work.
When users fail to consult the documentation, the development phase could take too long or the library might not be used to its full potential.
So when you're about to use a tool you never used before, make sure to read and understand the documentation. It will make your life easier.
While we’re on the topic, I also have to emphasise the importance of writing documentation. You might not be developing state of the art code that will be used by thousands of people, but simply noting down what everything does in your code will be good enough.
It will be useful for people who will use your code after you, your teammates who will build upon your code and also for you when will look back at what you write a couple of years down the road.
Your documentation doesn’t have to look super professional. Just a word document describing what this piece of code does, what the input is and what the output is would be good enough.
Comment comment comment
Similar to creating documentation, commenting on the code itself is a lifesaver. Make it a habit to add a sentence or two at the beginning of every function you write, explaining what the function does and what the expected input and outputs are. And if there is some complicated logic in the function just note down what some lines do. For example: “this line calculates the average of each column”, “this line deletes the ones where the average is less than 10”, and so on.
Design before coding
This is not the color/layout design. It is the design where you decide on the general flow of your project. In my head, I divide designing my code into two categories:
- Knowing what your input and output need to be: For everything you’re writing, there will be an input and output. Decide what is going to be inputted to your code. This could be a requirement from your business stakeholders or the type of file you were able to find online. Some examples would be “a CSV file I will download from the internet” or “answers in JSON format to a query I will send to the database”, etc.
- Writing code purposefully: While you’re developing your code, on every step, take some time to think about what’s about the come. You know your bigger goal but what is the short term goal? What are you trying to achieve on this phase of the project? On Master the Data Science Method, the planning step is built-in as assignments. There are a couple of questions you need to answer before you move on to the implementation. Here is what a student says about these assignments:
“I find if I don't write my goals or hypothesis or outline for how I want to visualize something... then my brain can easily get lost or sidelined in the mechanics of setting the thing up. It really is easy to lose track of time and even lose track of the specific tangible baby step I was trying to achieve. Loved these helpful coaching questions!” — Nathan Eckel
Clean your code
It’s a bit abstract of course to say you need to clean your code. What I mean here specifically is that, do not have anything that is not necessary on your code when you finish it. This is especially a problem for data scientists. On Jupyter Notebooks, it’s very likely to try a couple of lines of code just to see what the outcome will be, and then forget to delete it or lose track of which cells are part of the program and which ones are not used anymore.
What I do to overcome this is to keep a draft notebook and a final notebook open. I start by coding on the draft notebook. I try everything I wanted to try on the draft notebook and implement the functionality step-by-step. And every time I finish a phase of the project (e.g. data exploration), I review and transfer the code to the final notebook. This way, I always keep my code organised and no unnecessary lines of code exist on the final notebook. It also helps keep errors or mistakes to a minimum since I review my code at the end of every phase.
Follow naming conventions and name clearly
It’s a big deal on software engineering to follow strict naming guidelines. For example, variables names should start with a lower case letter and every next word in a variable name’s first letter should be capitalised (soLikeThis) or function names should start with a capital letter (FunctionNameXyZ).
I am not aware of a strict naming policy for the data science community and to be honest, I don’t think it will be feasible to enforce this anyways. Nevertheless, I think it’s a great idea to be consistent with your own naming.
Just decide what your style is. You have a couple of options: capital letter (NewVariable) or no capital letter (newVariable), underscore (new_variable) or no underscore (newVariable) when it comes to naming your variables, classes and functions. Choose a style for each of these and stick with it. This will make your code much easier to follow.
There is one thing I am strict about though and that is clear naming. You should name your variables, functions and classes based on their role, not randomly. Dataframe1 is not acceptable. Especially if the data frame plays a key role in your code. This doesn’t mean that you should name it dataframe_where_location_and_time_is_merged_for_model_training. Keep it short and understandable for someone who would read your code. But then again, naming gets easier as you gain some experience, so don't stress yourself out trying to get it perfect.
Keep a tidy file structure
From what I observed, data scientists tend to download data, start notebooks, write code snippets and just keep them sort of lying about on their desktop and only when they really have to they put everything together on a folder. This does not just look unprofessional, it’s also a killer of productivity. Who wants to start working when they know they need to first locate the notebook they need to work on among 10 other notebooks they randomly created last week.
Keep all the project files in one place. Have a folder for your data and name your notebooks clearly so you know which one is the one you need to work on. That’s all. Nothing fancy. Just start this structure at the beginning of a project so you don’t have to spend hours later trying to bring everything together.
Use version control
Version control, or in other words, keeping your code on GitHub will help you make changes confidently without being scared of breaking everything. It will help you track what changed, and it will keep your code happy and cosy on a server somewhere so it won’t be deleted just because someone left your company.
By “use version control” I don’t only mean that you should set it up. You should also use it actively during your project. Push your changes to Git every time you finish a new part of the project. Make it a daily habit to upload your changes and your work. This way, everything stays nice and organised and it will be trivial for new people joining your team to get on with the work. And most importantly, when something does break, it will be trivial to revert to the latest working version.
When in doubt, debug
Debugging is the ability to find out what went wrong if something breaks in your code. It is one of the most common skills new data scientists lack if they don’t have a coding background. But that's no problem because much like everything else, it's just a skill that you need to learn.
The main thing you need is the instinct to debug. So that when you run into a problem, you won't feel helpless or lost. Instead, you will start debugging! This could include; changing values of variables, commenting out lines, printing the value of variables to see if everything work as you expect it to work. If not, that means there is a problem and you should address it.
If there is one thing I can tell you about debugging though it should be this: change only one thing at a time when debugging. It’s sort of scientific when you think about it. If you want to know why something is not working and you have 5 different variables, you need to change them one by one to see which one affects the outcome. If you change two at the same time and see a change in the outcome you cannot deduct which one was the one that caused the change. Sounds simple but many new data scientists make the mistake of being too eager to fix the mistake and in return making their debugging last longer.
All in all, no one is perfect when it comes to coding. But if you follow some simple rules used by software engineers, you will improve your coding. Once you start coding every day as part of your job, it will be your habits that will count and what better way to get better habits then starting adapting best practices today!