Let’s face it, data scientists work with software engineers and at the end of the day, they have to deploy software. An effective data scientists should be able to work within a software development team, so (s)he must be able to use the tools of the trade. Our previous post introduced the very basics and now we make a further step.
Where do you save you plots? Where do you put your raw data and its intermediate derivatives? Are you working on a API, a GUI based tool, or you just simply preparing a basic descriptive stat of your data? Have you ever tried to follow someone’s advice and make consistent folder structures to put your code into the src/ folder, data into data/ and etc? And have you ever forgotten and called your source folder source? Use cookiecutter, or make project templates using your IDE (e.g. you can make templates using PyCharm too). There are great templates for Python packages, Flask/FastAPI projects, data analysis projects and etc. Even you can define your own templates.
Project templates not only save you time and makes your projects consistent. They force you to think about the purpose of your project. If you are about to start an ad hoc data analysis project, probably you will work on your own laptop and you will ship a report and/or data visualizations. If you have already used almost the same functions to clean and prepare your data in your projects and you know, you’ll need them hundreds times, put them into a package. Next time you can just pip install the package and you can import those handy functions into your new project. Templates are the first step towards automation. They usually contain everything you need to build a good project, from dependency configuration to testing. If you never build a package or an API, you are a data analyst.
Manage your dependencies
If pip install anything into your virtual environment, you will need dependency management. Probably you’ve used a requirements.txt to install dependencies of a project, but things can get complicated, esp. if you develop a package which will be used in different environments. Dependency management tools helps you to handle these situations. We prefer Poetry over other tools like conda, pipenv and etc.
No software comes without bugs, but this doesn’t mean that you cannot mitigate the risk of failures. Testing is an art on its own and sometimes it is done by dedicated test engineers. The most common types of testing, which are usually done by the developer itself, are unit testing and doctests. The Hitchhikers’ Guide to Python, our favorite resource on software development with Python, has a brief and to-the-point section on testing here. For a short intro into unit testing, we strongly recommend this tutorial on Real Python.
The best place to start learning about testing is Brian Okken‘s page. Brian runs a superb podcast called Test & Code and he wrote an excellent book on the subject too.
Good news! Cookiecutter templates usually come with test templates too.
Automate the software development workflow
So far, you’ve read about lots of tools, but the most basic one is the repo where you put your code. Usually, this repo lives on a remote server or in the cloud at a provider like Github or GitLab (these providers are referred to as source code management providers). Also, there is a version control system on your machine too, which helps you to keep your version of the code up-to-date with the central repo. At your side, you can use Git to automatically check your code before commiting it. These are the so called pre-commit hooks. E.g. you can set up your version control system to run the black code formatter on your source files before commiting changes (check out this resource on how to set up hook for this) or run your unittests.
At the side of the central repo, there are various tools which automates the making of your software. Usually more than one developer works on a repo and it is not trivial to make sure that there is no conflict, the tests passe, and the software can be shipped/deployed. The two main cloud source code management providers offers these tools under the names of GitHub Actions and GitLab CI/CD. These help you to define complex pipelines to test, build and even deploy your software into production into the cloud.
Use the cloud
Every reasonably sized project use some sort of cloud services. Usually, the end result (aka software) runs on the cloud, the accompanying machine learning models also stored, run and trained on the cloud. There is a plethora of tools to store, analyze, and manage your data, serve your models and run your APIs. Amazon, Google, Microsoft all offers their technology stack and you can easily find other players on the market too. Usually, you will encounter with object storages (like Amazon’s S3), basic databases (like Google’s Big Query) ways to run virtual machines (e.g Digital Ocean’s droplets) and ML tools (like Amazon Sage Maker or Google AI and Machine Learning Tools). All major provider made serious efforts to document their products and support learners with good tutorials and you can try out their main tools for free for a short period of time.
This long list of tools can be horrifying. Don’t panic! No-one knows all these tools perfectly! Start using them, get comfortable with their documentations to be able to look up the necessary information and to be able communicate with your co-workers! No-one expects you to know everything, but you must be able to work within a team. If you know the basics and you devote some time to pick up these technologies, your devops and software engineer colleagues will appreciate you. If you’d like to get a feeling of how these things work, read (and work through) the Hypermodern Python series.
Make a one-time donation
Make a monthly donation
Make a yearly donation
Choose an amount
Or enter a custom amount
Your contribution is appreciated.
Your contribution is appreciated.
Your contribution is appreciated.DonateDonate monthlyDonate yearly
Do you like our visualizations? Buy them for yourself!
Visit our shop on Society6 to get a printout of our vizs.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.