If data is the new oil, then getting and enriching your own data is like fracking and refining it, at least in the case of textual data. This post gives you an overall picture on how to think about gathering and labeling data. You also get some tips on what kind of business questions should be considered.
The Data Science Hierarchy of Needs
These days more and more people try to build a so-called vertical AI startup/solution. These endeavors intend to solve industry specific problems by combining AI and subject matter expertise. They have four distinct features: 1) they are full stack products 2) they rely on subject matter expertise 3) they are built on the top of a proprietary dataset 4) AI delivers core value. Our experience suggests that the third point – getting the right proprietary dataset – is the hardest and most decisive factor regarding every data driven project, being either an intra- or entrepreneur endeavor.
Most people take data for granted. We get news about the newest deep learning algorithms every day. We live in the era of big data. We hear (at least those who work in the tech field) about new machine learning/artificial intelligence startups every day. So it must be easy to get data!
On the one hand, yes, there are awesome data repositories, like the UCI Machine Learning Repository. Governments are getting open and they are publishing their data via their own platforms or they are using something like CKAN. But keep in mind, your competitors can access these data too!
On the other hand, you have to get your own, domain-specific dataset, and annotate it to train your model(s)! Deep learning and other fancy ML algorithms are just the tip of the iceberg. There are plenty of things to do underneath. If you can’t get the underlying levels right, even the sexiest new deep learning algorithm will perform badly on your specific problem. Again, you can start with combining open datasets, but your competitors are doing the same thing too. If you want to deliver real value that is different from your competitors (i.e. better or more precise), you have to build and annotate your own dataset. The popular data science hierarchy of needs pyramid should look like as follows.
Separate your tasks
Harvesting and annotating data are two separate tasks done by two different groups. Data collection is often carried out by traditional software engineers, or by the data infrastructure team.While annotation is often lead (and sometimes even done) by Data Scientists/Analysts. A good product manager keeps his or her hands on the data and involves every stakeholder into the process. A PM should always remind one that getting and annotating data is a process, so you should constantly check the quality and scope of your raw and annotated data. The performance of the model you built using the data should be also monitored. You can use evaluation metrics and even some user feedback to plan further data gathering and data annotation task(s), which will help you build even better models.
Before you consider various options to gather and label data, keep in mind that you should build your initial dataset AND a pipeline/process that will help you train better and better models. Choosing a solution at one phase doesn’t mean that you cannot move to another one at a later phase. But note that transitioning from outsourcing to in-house scraping and labeling can be hard and very costly.
In theory, you have an idea about a product, and you need a special purpose dataset to train its magical AI part. Before you think over your options, you have to answer a few questions. What kind of data do you need in order to train a model? How can you get the data? Should you clean up the raw data before annotation? How much data should be annotated for the first model(s)? What does it mean to make a representative dataset in your case? Probably, you won’t get final answers first, but don’t be afraid as a rough idea is enough initially.
As a next step you should consider your options of data gathering and annotation, like
- building in-house competency
You should know about your constraints like
Keep in mind that if something is legal, it is not necessarily ethical. Your project should be legal AND ethical. It is hard to define what ethical means. Probably your colleagues follow the ethical regulations and guidelines published by professional bodies and governments at your region. If not, ask them to do so! Also, the team should agree on that the goal of the project is in accordance with the members’ ethical norms. Scraping sites that requires login is a shady part of the business. Imagine that your colleague thinks it is actually stealing data and harming the privacy of the users of this site. Will such a colleague build the best scraper for the task? – Presumably, no. So, even if you have nothing against scraping data from certain sources, accept the fact that someone may think that it is not acceptable, even it it is legal.
Last but not least, you have budgetary and time constraints too. The more ready-made a solution is, the more expensive it is, but usually the less time it requires to deliver the data. In-house solutions require hiring permanent and temporary workers. Finding the right people takes time. You can employ juniors who are willing to learn a new filed, but again, this takes time. If you have enough money, first start with outsourcing the tasks to reliable partners. Later you can build up your own capabilities. If you are very short of money, bring data scraping in-house and crowdsource annotation. Otherwise read on and consider the tools and options you have.
That’s all for now. If you’d like to learn more about tools used for data gathering and annotation, stay tuned. The second part of this series will come soon!
If you face any issues during data gathering and annotation, don’t hesitate to contact us at email@example.com
Do you like our visualizations? Buy them for yourself!
Visit our shop on Society6 to get a printout of our vizs.
Subscribe to our newsletter
Get highlights on NLP, AI, and applied cognitive science straight into your inbox.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.