When programming in Python, it’s unnecessary to reinvent the wheel and write all your code from scratch. For that, packages exist. These packages are a bundle of functions and methods or reusable code, that allow you to perform many tasks without writing that code yourself. With nearly 300.000 Python packages hosted on PyPI (Python Package Index), it can be a little bit overwhelming to know which one is best for your project. Widely used packages like Numpy, Pandas, and Scikit-Learn are often already included in any Data Scientist’s skill set. But how about lesser-known packages that live somewhere in this large pool of packages? This blog post highlights four interesting Python packages that you may have overlooked, but are definitely worth your attention. It’s time to give these hidden gems some love!
1. Pandas Profiling
This one is a personal favorite: gone are the days where you’re frantically trying to explore, visualizing, and understanding every individual variable in your dataset. One line of code is enough to perform a quick exploratory analysis of your dataset and generate a report with all that information available. Even better, you could also generate interactive web reports that are easily interpretable and thus presentable to non-programmers and stakeholders!
2. Great Expectations
There’s that elusive 80/20 split again. As a Data Scientist, roughly 80% of your time is spent on collecting, understanding, and processing your data. But how can you document and transfer all the acquired knowledge from that exploratory analysis? How can you capture those insights into the code, so that any successor can easily retrieve the same insights? Often, that knowledge is lost. And when your model’s performance suddenly starts declining because of a change in the data, that knowledge is crucially missing. Great Expectations is the perfect solution for this.
Basically, Great Expectations pairs a dataset with a set of expectations – call them unit tests for your data. From the docs: “With Great Expectations, you can assert what you expect from the data you load and transform and catch data issues quickly.” For example, in your data, you expect the number of passengers per car to be between 1 and 8. If at some point, the data changes beyond the expectations and more than 8 passengers appear in a car (Hello, US soccer moms!), warnings will be raised to address this data quality issue. By then, you’ve proactively discovered data quality issues before model performance drops and your stakeholders start asking questions. Well done!
During the meteoric rise of deep learning in the past years, the main measure to assess whether a model “works” has been its output performance. However, a growing number of experts is calling to unlock the hidden layers in these black-box models. Being able to explain why and how a model reaches a decision is becoming more and more important. Furthermore, business and process owners are often apprehensive to include a model into their decision-making process if they fail to understand how the model reaches a decision.
A very helpful package to explain your models is SHAP (Shapley Additive exPlanations). When applied to a model, SHAP can show how much every variable contributes to the final prediction globally. Also, predictions for individual observations can be explained using SHAP to reach more insight into what contributed to this observation’s production. For any machine learning model out there, SHAP can interpret and explain the how and why of the predictions. The package is invaluable for anyone looking to include their stakeholders in the model decision-making process.
Last but not least, what is life without a little fun? Install and import the package Antigravity to find an easter egg added simply to amuse users: A webpage is opened containing Randall Munroe’s xkcd comic about Python. Read the comic, and soon you’ll understand why the package is called antigravity!
If your favorite Python library or framework didn’t make it to the list, don’t take offense. The Python ecosystem has generated so many valuable packages that it would be impossible to include them all. Again, the above-mentioned packages are mainly my personal top picks, and many great hidden gems are still waiting to be found. I highly recommend you take a dip into the vast lake of Python packages out there. Who knows, you might just find exactly what you’ve been missing all that time!
Now that you are already checking out our blog, you might as well want to have a look at our Data Scientist vacancy. Maybe you are the next to join our Data Science team!
This blog is written by Kai Mevissen – Data Scientist at Mediaan.