Software engineering for data scientists – Part 3, No Man Is an Island

As we’ve seen in our previous posts, developing software is not only about writing code. You have to use quality tools to write human and machine readable and maintainable code and you have to automate your workflow while following industry best practices (i.e. using git, doing unit tests, and etc.) We can’t emphasize enough that software development is team work, and there are development methodologies which help you to manage the complexity of a project and also these methodologies help you to coordinate you and your co-workers’ effort to make a good software. Putting Machine Learning (or Data Science/AI/etc.) into the mix complicates the situation, but it’s worth trying to follow a structured approach since no man is an island and you write software for others, not for your drawer.


Following various software development methodologies is very similar to ideological fundamentalism. As Heraclitus said, you cannot step into the same river twice. Also, you cannot do the same project twice, so you cannot face the challenges coming up during a project twice. Software development methodologies provides you with a framework to deal with the problems, but you have to fill the frames.

A Brief Philosophical Aside

Naur describes software development as a way to externalize common knowledge. Although knowledge has often been thought as a kind of rigid, unchangeable thing, the reality shows that our understanding of a problem tends to change over time, so does software should change over time. So programming is a way of theory building.

[…] Accepting that programs will not only have to be designed and produced, but also modified so as to cater for changing demands, it is concluded that the proper, primary aim of programming is, not to produce programs, but to have the programmers build theories of the manner in which the problems at hand are solved by program execution. […] programming properly should be regarded as an activity by which the programmers form or achieve a certain kind of insight, a theory, of the matters at hand. This suggestion is in contrast to what appears to be a more common notion, that programming should be regarded as a production of a program and certain other texts.

Peter Naur: Programming as Theory Building

If you are interested this very philosophical side of software development, read Michael Polanyi‘s Personal Knowledge to learn more about the tacit knowledge we use to build our (scientific) theories.

How to start your data science project?

The CRISP-DM method gives you a good framework to start designing your data science project. We think most of the projects can be described with its six phases.

The CRoss Industry Standard Process for Data Mining (CRISP-DM) is a process model with six phases that naturally describes the data science life cycle. It’s like a set of guardrails to help you plan, organize, and implement your data science (or machine learning) project.

1. Business understanding – What does the business need?

2. Data understanding – What data do we have / need? Is it clean?

3. Data preparation – How do we organize the data for modeling?

4. Modeling – What modeling techniques should we apply?

5. Evaluation – Which model best meets the business objectives?

6. Deployment – How do stakeholders access the results?

CRISP is not a methodology, but it is rather a framework to think about your projects. Pay special attention to steps 1-3. because the success of your project depends on the quality (and suitability) of your data. Read our previous posts on Data gathering and annotation, and Strategies and tools to get the right data in the right quality.

Be agile

CRISP helps you to frame your or your client’s problem, Agile helps you to deliver a working solution. Yes, Agile is not about data science projects, it is about delivering software.

In software development, agile (sometimes written Agile) practices include requirements discovery and solutions improvement through the collaborative effort of self-organizing and cross-functional teams with their customer(s)/end user(s), adaptive planning, evolutionary development, early delivery, continual improvement, and flexible responses to changes in requirements, capacity, and understanding of the problems to be solved.

If you’ve never ever worked on an Agile project, think of it as an iterative process. Instead of planning everything ahead and working towards the end goal, it urges you to build up the deliverable step by step. This way the end users can provide you feedback at every iteration. This gives you enormous flexibility and the software can evolve naturally. This methodology comes with ceremonies like daily standups, and etc. which can be frustrating for technical people. But please keep in mind, you have to work with other stakeholders and you have to work together to achieve a common goal. You have to talk to each other, and it is always better to do this in a structured way than in a chaotic one. If you really want to make progress on your projects, you need top notch project managers and Scrum masters, so your developers and data scientist can spend most of their times on delivering high quality software.

Rules of Machine Learning

We love the Rules of Machine Learning! Just like CRISP-DM, it is a kind of framework to think about development projects. It can be summarized as the following:

To make great products:

do machine learning like the great engineer you are, not like the great machine learning expert you aren’t.

Most of the problems you will face are, in fact, engineering problems. Even with all the resources of a great machine learning expert, most of the gains come from great features, not great machine learning algorithms. So, the basic approach is:

Make sure your pipeline is solid end to end.
Start with a reasonable objective.
Add common­-sense features in a simple way.
Make sure that your pipeline stays solid.
This approach will work well for a long period of time. Diverge from this approach only when there are no more simple tricks to get you any farther. Adding complexity slows future releases.

Be ethical

Software affects people’s life. It shouldn’t contain any kind of AI to do harm. A simple tool which helps you to fill out your tax form can make your life easier if it helps you to fill out your form easily AND correcly. It can cause you harm, if it contains a bug and you submit a bad form. An AI system at a bank which helps credit scoring might help to get more impartial results or it can reinforce racial prejudices if it learns from bad data. You have to keep in mind, that you have to work within a legal AND an ethical framework!

The Institute for Ethical Machine Learning’s website is a great resource of guidelines and best practices. The European Commission’s Ethics Guidelines for Trustworthy AI is also a good place to start exploring this topic.

We strongly recommend Mark Coeckelbergh’s short yet comprehensive book on the topic. Read our review of it here.

The Value Sensitive Design Lab offers a design framework to take ethical and societal considerations into your design process.


A good data scientist is also a good software developer. (S)he knows the best practices of software engineering and (s)he can work as a part of a bigger team. Also, a good data scientist knows (and applies) the ethical and legal guidelines of the field.


The cover image was downloaded from this link:


Make a one-time donation

Make a monthly donation

Make a yearly donation

Choose an amount


Or enter a custom amount

Your contribution is appreciated.

Your contribution is appreciated.

Your contribution is appreciated.

DonateDonate monthlyDonate yearly

Do you like our visualizations? Buy them for yourself!

Visit our shop on Society6 to get a printout of our vizs.