The corona-virus (COVID-19) outbreak has affected the lives of millions, the businesses we rely upon, and the way we live our life. The internet is full of relevant data accumulated over the last 2 years. Wouldn’t it be great if all of this available data is used to understand the situation better and know what we can expect next? Last time our interns created a data entry automation tool. In this blog, we bring you another exciting and successful internship project! Using available data related to the pandemic, this time, our interns developed a data crawling tool that can be used to scrape data out of various websites and make predictions. Curious? Let’s jump right into it!
The use case
If history has taught us something, it is through struggle that we reach the top. The pandemic has provided us with large quantities of data that can be used to inform the world about the current situation and “predict” what is to come. With that in mind, we created an internship project to extract data from websites, aggregate it into a BI tool and learn how to use AI algorithms to make predictions.
Similar to the previous project, the students received an overview of the project and the challenges. They were also given information about some websites containing important data about COVID-19, a tutorial on creating a crawler for data extraction, and some tips, making sure that they had freedom over their creativity. As per usual, we tried to stimulate them to think alongside the customer and to totally “own” the project. At the same time, we also actively helped them develop and improve their soft skills and to work with the Agile methodology.
As this use case was quite big, the students quickly realized that they would have an easier time completing it by splitting up the project into 3 big parts.
- The first part was to create a data crawler tool that would gather information about infection rates, hospital admissions and deaths caused by the virus. To make sure that it works, the students decided to start small by only focussing on Belgium. As data was readily available on the internet, the remaining job was to create a crawler that would get the website URL as input and automatically collect all the necessary information.
- The second part was to get the students familiar with the BI tools, in particularly Power-BI. The idea was to use the collected data to create graphs of the current situation. This way, the students gained some experience using the tool.
- Last but not least, was to let them use the knowledge they gained from school and during some internal workshops at Mediaan, to make predictions using the data. As this was considered as a “research” project, the students had to try different methods and algorithms to come up with the best one to use for the predictions.
For the data crawler tool, not only did the students deliver results as requested, but they went above and beyond! When getting the data, the user is able to export in different formats like CSV and Json. Using NLP.4, the data crawler also allows the user to search on specific keywords and their synonyms when gathering specific information.
As for the use of Power-BI, we let the students have some fun. They created different graphs to indicate the current situation and also used filters to easily modify the graph, depending on what information you – as an end-user – would like to see.
The predictions, as predicted, were the hardest part of the entire project. Using different methods like Random Forest, CNN, and LSTM, the students concluded that it was indeed possible to do some predictions, but only for a few days. We later then compared this with the reality, and the result was pretty close to accurate. The longer we went however, the more that “small errors” increased exponentially and created an unrealistic prediction. The reason was mainly because of the lack of data, and this only further proves the importance of clean, large quantities of historical data.
We can say that it was another successful project! We now have a data crawler tool that can be used to extract data from various websites. Thanks to its prediction ability, there are many new opportunities that we can showcase to our customers, to further help them find (new) solutions to their problems. As for the student, they got everything they requested for their internship and learned how to work on a real project at Mediaan. We have another exciting internship project to cover next time. So, stay tuned to find out!
This blog is written by Sébastien Bodart – Consulting Project Manager and Internship Supervisor at Mediaan