Do I have enough data to do machine learning?

This is the most common question our clients ask us. Indeed, if you browse through the latest research or the news related to machine learning, it may seem like the amount of data is the most critical success indicator. However, not every problem requires large amounts of data. Here are seven questions you should ask yourselves when assessing your data needs for machine learning.

1. Are there publicly accessible datasets?

The machine learning community strongly believes in open science, which means that there are plenty of existing datasets out there that may be relevant for your use case. Here are some resources where you can find publicly accessible datasets:

Searching for existing datasets should always be done. On the one hand, you may be able to improve your dataset. On the other hand, when there is a public dataset, it means that someone has already attempted to solve the problem. It allows you to understand the feasibility of the problem better, and sometimes even find the implementation you need.

2. Can I create a dataset myself?

The data you are looking for may not always exist, be in the proper format, or fit your specific use case. So, you might want to create it yourself. It generally happens in 3 ways:

  • Web crawling. The internet contains an incredible amount of information. If you are looking for data to train a sentiment analysis model, you might want to check reviews people leave on Amazon, if you’re going to build a speech-to-text model, you might want to check out TED’s videos as they are expertly subtitled, and so on.
  • Crowdsourcing. Some data may reside in your colleagues or your customers’ heads, and therefore, you might need to find ways to collect it. It can be as simple as sending out a form within the office or by using a focus group.
  • Application Data. The best way by far is to use data coming from your applications. Nearly every successful use of machine learning follows this trend. Think of the Google search engine, Netflix’s recommendations, or even your trusty spam filter. All of these applications log user interactions and automatically create datasets that can be used to update the machine learning models running in the background. The significant advantage of this approach is that the data is yours, and you are free to alter the way it is collected based on your needs.

3. Do I even need data in the first place?

Nowadays, plenty of off-the-shelf ML solutions exist, meaning that you don’t need any data to get started! Cloud providers like Amazon, Google, Microsoft, and IBM all provide a series of ML services with pre-trained models. Some even offer a way to fine-tune these models with your data. You must host these services on their respective cloud platform, and your data, therefore, must leave your system to make predictions. If that is an issue, you can also look for open-source pre-trained models. A website like Model Zoo is a great place to start.

4. Is my data relevant?

There is this rule-of-thumb that the more data you have, the better your model will be. It is essential to understand that this is a rule-of-thumb, not an absolute truth. Data is only useful when it is correctly annotated and covers the scenarios you expect to encounter. Think about your efforts to create a machine learning model as a chef trying to create a delicious meal for its restaurant. It doesn’t matter how many ingredients are in the fridge if these ingredients are expired or if they don’t allow you to create the meals on the menu.

5. Can I solve (parts of) the problem without machine learning?

Because academia is spear-heading the machine learning efforts, there is an unhealthy obsession with creating a single model that can do-it-all. While this is a crucial scientific goal, it may not always be relevant in practice. It is essential to understand that the primary use of a machine learning model is to handle unknown cases. It means that hard-coded rules, for instance, may handle known cases. If we already know the answer to a problem, it makes little sense to ask our machine learning model to solve it again.

6. What business problem am I trying to solve?

If there is one tip to remember, it is probably this one. Your goal should not be to build a machine learning model; it should be to solve a business problem. Therefore, you should consider non-ML alternatives like hardcoding the decision-making or doing a root cause analysis. It’s great to build a chatbot that can solve the problems your customers face, but what you should investigate is why they are having these problems in the first place!

7. And above all, start now!

Yes, you may need a skilled machine learning engineer to build a model suitable for your business. But you should not wait before you think you have enough data. The only way to know that you have enough data is to build the solution! Enhanced business scalability and improved business operations are some of the many benefits you can get from it. Your competitors may already be making use of its competitive advantages. Isn’t it time for you to start as well?

This blog is written by Valentin Calomme – AI Engineer at Mediaan.