To sklearn, or to not sklearn?

Sam Dedes
4 min readDec 12, 2020

--

This article is a friendly overview of this paper on scikit-learn published in volume 12 of the Journal of Machine Learning Research. The intention is to inform those who are considering adding scikit-learn to their toolkit, and quench the curiosity of the occasional philomath.

Whether you’re a veteran software engineer looking into data analysis and modelling, a quarantined college student trying to demystify the term “machine learning”, or an entrepreneur looking to expand the data capabilities of your business, you’re likely to encounter scikit-learn (aka sklearn).

What is scikit-learn?

Scikit-learn is a library in Python designed for user-friendly machine learning. It includes tools for model selection and analysis, and supports both supervised and unsupervised learning. Is built on the popular libraries like numpy and scipy, as well as the lesser known Cython.

What can scikit-learn do?

Short answer: A LOT (but not everything).

Scikit-learn is a library able to run various machine learning algorithms on data sets of varying sizes. Following the YAGNI philosophy, the library’s developers focus on providing high quality functionality implementations, rather than including as many functionalities as possible. With that said, it’s functionality ranges from OLS regressions, polynomial features (article here), data preprocessing, and other machine learning fundamentals.*

*You’re welcome to explore the plethora of features scikit-learn has to offer on it’s official website here.

Should I use scikit-learn over more mature frameworks like mlpy or pymvpa?

Well, it depends on what you’re trying to do. While it doesn’t have every capability, Scikit-learn focuses on the performance of the machine learning algorithms and data preprocessing tools included in its library. Even if you don’t plan on using scikit-learn to create your model, you can take advantage of model selection tools like “train-test-split” and metrics such as “f1-score.”

Scikit-learn has a breadth of capabilities while maintaining high performance standards. According to this paper, scikit-learn has either comparable or impressive performance when compared to it’s python-based counterparts.

Table 1 from “Scikit-learn: Machine Learning in Python”

Specifically, the performance was compared by running several machine learning algorithms on the Madelon data set. These were timed and compared, showing scikit-learn outperformed mlpy, pybrain, pymvpa, and mdp when tested in all areas besides k-Nearest Neighbors, and k-Means. In k-Nearest Neighbors, where pymvpa came a hundredth of a second ahead. For k-means, it was beaten by mlpy, as well as shogun, a C++ based machine learning library.

If versatility is what you’re after, scikit-learn is a great place to start for many machine learning applications. If you’re looking for performance, scikit-learn has the competitive edge in many areas, but this will depend on the desired model and size of data being analyzed.

How does scikit-learn achieve it’s efficiency?

There’s more than one answer to this question, and none are truly complete without the others. The long and short of it is that there are already libraries that are very good at what they do, and scikit-learn combines the best of these for optimal performance.

“Scikit-learn stands on the shoulders of giants.” — Anonymous

Well, what are these core libraries?

Numpy is at the core of how scikit-learn handles its data. Originally, numpy was created to unlock the speed of C implementation inside the versatile Python environment. For example, a single Python list can contain entries of different data types. For example, the list below is valid in Python, containing a string, object, float, and tuple, respectively.

foo = 1
list = ['string', foo, 1.0, (42, 'forty-two')]

In contrast, the entries of a numpy array must be of the same type; whether it be integers, floats, strings, objects, etc.

(After all, isn’t dinner planning easier if everyone has the same food allergy?)

Another core package is scipy, which contains efficient algorithms for the linear algebra (matrix math) and bindings legacy numerical packages from Fortran. In other words, numpy handles how the data is stored, while scipy handles how it is manipulated. This is somewhat analogous to the the order of operations, consider the equation below:

Recall PEMDAS, the order of operations

2 * 4 - 4 = (2 * 4) - 4 = 8 - 4 = 4

NOT

2 * 4 - 4 = 2 * (4 - 4) = 2 * 0 = 0

Numpy handles storing our numbers, while scipy gives us instructions on what to do with them.

Numpy and scipy are two of the essential Python libraries which scikit-learn is based. While there are more underlying libraries and conventions scikit-learn uses to achieve it’s efficiency, their explanations become more complex and unique as they pertain to the specific functions, classes, methods and workflow capabilities within scikit-learn; and as such, are best saved as a topic for a future article.

Takeaways:

  • By combining various optimized packages including those mentioned in this article and a good deal more, scikit-learn offers a breadth of functionality while maintaining high performance.
  • It’s a great choice for those exploring machine-learning, and offers powerful data preprocessing tools which can be used in tandem with other machine learning packages.

References: Perdregosa, Fabian, et al. “Scikit-learn: Machine Learning in Python.” Journal of Machine Learning Research, vol. 12, 2011, 2825–2830

--

--