My top 10 Python packages for data science
Over the last four years I have transitioned from using SAS exclusively for all data processing and statistical modelling tasks to using Python for these tasks. One barrier I had to overcome was the need to keep discovering and learning to use all the great packages put together by the open source community.
There are a lot of benefits of adopting these open source packages, including:
- Everything is free
- Most likely they are constantly being updated and improved
- There’s a large community offering support to each other on websites like Stack Overflow
It does take some time to get familiar with these packages. However, if you are the kind of person who gets excited about learning new things, you’ll actually enjoy the process.
Today I’m sharing my top 10 Python packages for data science, grouped by tasks. Hopefully you find it useful!
Data processing
pandas
Developed by Wes McKinney more than a decade ago, this package offers powerful data table processing capabilities. For people with a SAS background, it offers something like SAS data steps functionality. You can do sorting, merging, filtering etc. The key difference is in pandas, you call a function to perform these tasks.
By the way, I was really amazed to know that Wes McKinney was able to develop pandas after only a few years of Python experience. Some people are just really gifted!
His book Python for Data Analysis is highly recommended if you are just starting out your Python data science journey.
numpy
Pandas builds on top of another important package, numpy. So when you work with data you will often rely on this package for basic data manipulations. For example when you need to create a new column based on the age of the customer, you need to do something like:
df['isRetired'] = np.where(df['age']>=65, 'yes', 'no')
qgrid
An amazing package which allows you to sort, filter, and edit DataFrames in Jupyter Notebooks.
Graphing
The next three packages are all to do with graphing — which is a key step in exploratory data analysis.
matplotlib
This package allows you to do all sorts of graphs. If you are using it in a Jupyter Notebook, remember to run this line of code to enable the display of the graphs:
%matplotlib inline
seaborn
With the help of this package, you can make matplotlib graphs look much more attractive.
plotly
Nowadays we come across interactive graphs everywhere. They offer a much better user experience. For example:
- when we hover the mouse over a line plot we expect some text to pop up.
- when we select a line, we expect it to stand out from the other lines.
- sometimes we would like to zoom into parts of the graph.
plotly allows you to build these interactive graphs easily within a Jupyter Notebook. A great way to share work with your colleagues and stakeholders is sending a webpage (a Jupyter Notebook) with beautiful, interactive plotly graphs embedded.
The best part is there is no need for the recipient to install any special software other than a modern internet browser.
Modelling
statsmodels
This package allows you to build Generalized Linear Models (GLMs) which are still widely used by actuaries today.
It also offers time series analysis and other statistical modelling capabilities.
scikit-learn
This is the main machine learning package allowing you to complete most machine learning tasks, including classification, regression, clustering, and dimensionality reduction.
I also use the model selection and pre-processing functions. From k-fold cross validation to scaling data and encoding categorical features, it has so much to offer.
lightgbm
This is one of my favourite machine learning packages for Gradient Boost Machine (GBM). I gave a talk in the 2018 Data Analytics seminar about this package.
For a fraction of the time and effort needed to build GLMs, you could run a GBM, look at the importance matrix to find out the most important features for your model and have a good initial understanding of the problem. This can be a standalone step, or a quick first step before building a full GLM that’s more readily accepted by the stakeholders.
lime
Model interpretation is still a challenge for machine learning models like GBM. When stakeholders don’t understand a model they can’t trust it and as a result there’s no adoption.
However, I feel model interpretation packages like lime are starting to change this. They allow you to examine each model prediction and work out what’s driving the prediction.
Conclusion
I’ve listed my top 10 packages. Have you come across any other useful packages? Please share in your comments below.
“Exploration is really the essence of the human spirit.” – Frank Borman
This article was originally published on Medium.com
CPD: Actuaries Institute Members can claim two CPD points for every hour of reading articles on Actuaries Digital.