Data Science project in 3 steps

When working with Data Science projects you really need to know what you are doing. We learned that the hard way, so I want to give you some tips on how to prepare yourself and structure the process of creating your project. This post isn’t dedicated only to developers, but also the non-technical people. You can modify this as you will, but the core ideas should remain the same. Let’s get right to it!

Step 1: Define the outcome

You might think that it is quite obvious what you want to get at the end. Remember that things that are obvious to you, might not be so to Finance, Developers, Marketing or any other department you will be working with on your project. Pointing out exactly what you want to get and defining the vocabulary (what are sales, what is balance, etc.) is essential. Everyone on your team must speak the same language or you will get situations like this:

Boss: Please provide me data in the form of this excel template.
Me: Sure, I’m right on it.
— 3 days of work later —
Me: Here is the file you wanted.
Boss: Wait… this isn’t what I had in mind.

If you give a template for the outcome, provide the dictionary of what each field means and how it should be calculated in your reasoning. Once everyone knows what the goal should be and it is aproved (ask people to describe it to you, to be sure!), we can move to the next step.

Step 2: Prepare the data

Dilbert comic about data Taken from https://dilbert.com/strip/2008-05-07

Working with data is hard - there is no doubt about it. It gets worse when you will have to work with abundant amount of data taken straight of your company database. While tempting, you should never take data that you won’t need to fulfil the goals of this exact project. This means that data for each project should be gathered separately, then cleaned accordingly to the current task. You can reuse some of the methods from other projects, but think of the project as a closed environment. So to sum up the steps:

  1. Define data you will need to complete the project
  2. Gather the data from data sources
  3. Clean the data to be useful to achieve the goal

Each step should be discussed with your team, we don’t want to remove data that might be useful or take too much into consideration.

Step 4: Write code that will achieve the goal

Once you have defined the outcome and prepared the data, the most interesting part can be started - writing code to fulfill your project needs. And while after all this time you might want to jump right into it, that might be a bad idea. Start with writing down the steps that you need to code to get the output, think of algorithms that will help you achieve that, write tests and implement them. Remember that doing anything in Data Science (or actually in programming in general) without previously thinking about the process will blow up your project faster than you will be able to say - I can’t imagine why.

Conclusion

These steps should help you organize your project and your team. Remember that jumping straight into tasks isn’t always the best idea (and almost never works in a proper project environment, and if it works… well let’s see in a year). Remember that this post is only here to guide you, you can modify or change the process as you wish, whatever will fit your needs. For us, this seems to work and I hope it will work for you too.