April 2, 2020

Easiest way to set up your data science environment

Data science is analysing data and analysing data means writing code. But where? It’s fairly straight forward when you need to write code for a normal program. You can do it in a text editor and as long as there is syntax markup you’ll be fine. That is not the case for data science though. You would want a way to run a couple of lines of code at a time, see their result and correct as necessary. That's why what you need is a notebook.

A notebook is a piece of software that looks like in the image below and lets you run cells. Cells are just lines of code, as many or as few lines as you want. Inside the cells, you can import the data, analyse it, create inline visualisations and see how your data looks, print tables, check that everything is correct, train your model, witness the training progress and so on.

You can go as fancy as you want when setting up a data science environment. There are many tutorials out there showing how to customise and personalise your environment for perfect comfort. I want to show you the simplest way that you can get set up, so that you can stop worrying about having a working environment and start coding away. If you prefer watching a video tutorial, see the video below and follow the steps there.

Here is what we are going to do:

  • Install Anaconda, this will give us Python and Jupyter notebook and every required library
  • Create a Jupyter Notebook
  • Check that we have the libraries we need to start with
  • Set up version control, to be able to track changes in our code

Step 1 – Python through Anaconda

Python is the most commonly used language for data science and if you are just starting out with data science I would suggest you start with it. Anaconda is a popular distribution of Python that also makes it easier to have Jupyter notebooks on your computer. It is actually the recommended way of getting Jupyter notebooks by the people behind Jupyter. It comes with several common libraries pre-installed so you don't have to worry about installing libraries.
Some of these common libraries are pandas, numpy, scikit-learn and matplotlib.

Here are the steps for installing Anaconda:

1. Go to Anaconda’s website

2. You'll see that they have two options for installers. This is because Python 2 and 3 have syntactical differences. If you do not have a preference or you do not know what the differences are download Python 3 version. Choose the graphical installer.

3. Follow the installation process.

4. After the installation is done, go to your applications folder and start Anaconda-Navigator

5. Locate the Jupyter Notebook app and click launch

Step 2 - Create your first notebook and check for the necessary libraries

1. On the new browser screen that opened after you launched Jupyter notebooks, go to the directory you want your notebook to be in

2. On the right of the screen, click new > Python 3. This creates and opens a new notebook that uses Python 3 as the language.

3. In the first cell, write or copy and paste the following lines

If everything went well, there should be no output and the brackets to the left of your cell should show a number.

Step 3 - Set up version control

First of all, you need a GithHub account for this part of the guide. If you don't have an account, go ahead and create one now.

1. Once you logged in to your account, click the plus sign on the top right and choose New Repository

2. Name your repository anything you want

3. On the "Add .gitignore" drop down menu, choose Python. A ".gitignore" file is one that makes sure redundant and unnecessary files will be ignored during pulling and pushing from the repository. It is language dependent.

4. After the repository is created, click the Clone or download button and copy the clone link. You can use the small button shown below.

5. Now start a terminal window and navigate to the directory you want to have your project in. In the terminal the command ls shows you everything inside the directory you're in and cd followed by a directory name, will move you to that directory. If you never used terminal and not sure how to do it, just copy and paste the line below. This will navigate you to your Desktop.

6. Now write "git clone " and follow it up with the clone link you copied. An example line would look like this.

7. You will see that a new folder is created in the location you chose to host your repository.

8. Now, carry the notebook file you created to your new repository folder

9. In the terminal type: cd <name of your repository>

10. Type the below commands to push your changes to GitHub

Now if you go back to GitHub on your browser, you will see that your notebook file is on there. Every time you make a change to your project, you should use the last three commands to push it to GitHub.

I hope all the steps worked for you and you now have a first understanding of how to use these tools. As you work more and more with these tools, you will learn to use them better. So don't rush into it, if you encounter problems Google them until you figure out a solution. When you're just starting out, it's normal to feel like you are making a lot of mistakes. Just accept them and believe in yourself to solve them.

Let me know if you start some personal/side data science projects!