Python continues to lead the way in addressing data science tasks and challenges. Last year, we published a blog post, Top 15 Python Libraries for Data Science in 2017, outlining the Python Libraries that proved to be the most helpful at the time. This year, we expanded the list, adding new Python libraries, and revisiting Python libraries already discussed last year, focusing on updates throughout the year.

Our selection actually covers more than 20 libraries, because some of them are interchangeable and solve the same problems. Therefore, we put them in the same group.

▌ Core libraries and statistics

1.NumPy (Commits: 17911, Contributors: 641)

Website: www.numpy.org/

NumPy is one of the main software packages of the science application library for handling large multidimensional arrays and matrices, and its extensive collection of high-level mathematical functions and implementations make it possible for these objects to perform operations.

2.SciPy (Commits: 19150, Contributors: 608)

Liverpoolfc.tv: scipy.org/scipylib/

Another core library for scientific computing is SciPy. It is based on NumPy and its functionality has been extended accordingly. The SciPy master data structure is again a multidimensional array, implemented by Numpy. This package contains tools to help solve linear algebra, probability theory, integral computation, and many other tasks. In addition, SciPy encapsulates many new BLAS and LAPACK functions.

3.Pandas (Commits: 17144, Contributors: 1165)

Liverpoolfc.tv: pandas.pydata.org/

Pandas is a Python library that provides advanced data structures anda wide variety of analysis tools. The main feature of this package is the ability to convert fairly complex data operations into one or two commands. Pandas contains many built-in methods for grouping, filtering, and combining data, as well as time series capabilities.

4.StatsModels (Commits: 10067, Contributors: 153)

Website: www.statsmodels.org/devel/

Statsmodels is a Python module that provides many opportunities for statistical data analysis, such as statistical model estimation, performing statistical tests, and so on. With its help, you can implement many machine learning methods and explore different drawing possibilities.

The Python library is constantly evolving and enriching new opportunities. As a result, this year saw the emergence of improved time series and new counting models, namely, GeneralizedPoisson, Zero Inflation Models, and the NegativeBinomialP, as well as new multivariate approaches: Factor analysis, multiple variance analysis, and repeated measures in variance analysis.

▌ visualization

5.Matplotlib (Commits: 25747, Contributors: 725)

Liverpoolfc.tv: matplotlib.org/index.html

Matplotlib is an underlying library for creating two-dimensional diagrams and graphs. With its help, you can build all kinds of different ICONS, from histograms and scatter plots to Fiscarian coordinates. In addition, there are many popular picture libraries designed to be used in conjunction with Matplotlib.



6.Seaborn (Commits: 2044, Contributors: 83)

Liverpoolfc.tv: seaborn.pydata.org/

Seaborn is essentially a high-level API based on the Matplotlib library. It contains default Settings that are better suited for working with charts. There is also a rich library of visualizations, including complex types such as time series, jointplots and violin diagrams.



7.Plotly (Commits: 2906, Contributors: 48)

Website: the plot. Ly/python /

Plotly is a popular library that lets you easily build complex graphics. The software package is suitable for interactive Web applications, which can realize contour, ternary and three-dimensional visual effects.

8.Bokeh (Commits: 16983, Contributors: 294)

Liverpoolfc.tv: bokeh.pydata.org/en/latest/

The Bokeh library uses JavaScript widgets to create interactive and scalable visualizations in the browser. The library provides graphical collections, Styling possibilities, interactive capabilities in the form of link diagrams, adding widgets, defining callbacks, and many more useful features.


Liverpoolfc.tv: pypi.org/project/pyd…

Pydot is a library for generating complex directed and undirected graphs. It is the Graphviz interface written in pure Python. With its help, the structure of the graph can be displayed, which is often used in constructing neural networks and decision tree-based algorithms.

▌ Machine learning

10.Scikit-learn (Commits: 22753, Contributors: 1084)

Liverpoolfc.tv: scikit-learn.org/stable/

This Python module based on NumPy and SciPy is one of the best libraries for working with data. It provides algorithms for many standard machine learning and data mining tasks such as clustering, regression, classification, reduction, and model selection.

Use Data Science School to improve your skills

Data Science School:datascience-school.com/

11.XGBoost / LightGBM / CatBoost (Commits: 3277 / 1083 / 1509, Contributors: 280 / 79 / 61)

Website:

Xgboost. Readthedocs. IO/en/latest/h…

Gradient enhancement algorithm is one of the most popular machine learning algorithms. It is to build a basic model of continuous improvement, namely the decision tree. Therefore, special libraries have been designed to implement this method quickly and easily. That said, we think XGBoost, LightGBM, and CatBoost deserve special attention. They are all competitors for solving common problems and are used in much the same way. These libraries offer highly optimized, scalable, and fast implementation of gradient enhancement, which makes them very popular among data scientists and Kaggle competitors, as many contests are won with the help of these algorithms.

12.Eli5 (Commits: 922, Contributors: 6)

Liverpoolfc.tv: eli5. Readthedocs. IO/en/latest /

Often, the results predicted by machine learning models are not entirely clear, and this is the challenge Eli5 helps address. It is a software package for visualizing and debugging machine learning models and progressively tracking the work of algorithms, providing support for the SciKit-Learn, XGBoost, LightGBM, Lightning and SkLear-CrFSuite libraries, and performing different tasks for each library.

▌ Deep learning

13.TensorFlow (Commits: 33339, Contributors: 1469)

Website: www.tensorflow.org/

TensorFlow is a popular deep learning and machine learning framework developed by Google Brain. It provides the ability to use artificial neural networks with multiple data sets. Among the most popular TensorFlow applications are target recognition, speech recognition, and so on. There are also different Leyer-helpers on regular TensorFlow, such as TFLearn, TF-SLIM, skflow, etc.

14.PyTorch (Commits: 11306, Contributors: 635)

Liverpoolfc.tv: pytorch.org/

PyTorch is a large framework that allows you to perform tensor computations using GPU acceleration, create dynamic computations, and automatically compute gradients. On top of that, PyTorch provides a rich API for addressing neural network-related applications. The library is based on Torch and is an open source deep learning library implemented in C.

15.Keras (Commits: 4539, Contributors: 671)

Liverpoolfc.tv: keras. IO /

Keras is an advanced library for processing neural networks that runs on TensorFlow, Theano, and now, thanks to new releases, CNTK and MxNet as backends. It simplifies many specific tasks and greatly reduces the amount of tedious code. However, it may not be suitable for some complex tasks.

▌ Distributed deep learning

16.Dist-keras / elephas / spark-deep-learning (Commits: 1125 / 170 / 67, Contributors: 5 / 13 / 11)

Website:

Joerihermans.com/work/distri…

As more and more use cases require a lot of effort and time, the problem of deep learning becomes more and more important. However, with distributed computing systems like Apache Spark, it’s much easier to process so much data, again extending the possibilities for deep learning. As a result, dist- Keras, Elephas, and Spark-deep-learning are all rapidly gaining popularity and growth, and it is difficult to single out a library because they are all designed to solve common tasks. These packages allow you to train Keras library-based neural networks directly with the help of Apache Spark. Spark-deep-learning also provides tools for creating pipes using Python neural networks.

▌ Natural language processing

17.NLTK (Commits: 13041, Contributors: 236)

Website: www.nltk.org/

NLTK is a set of libraries, a complete platform for natural language processing. With the help of NLTK, you can process and analyze text in a variety of ways, mark and tag text, extract information, and more. NLTK is also used for prototyping and building research systems.

18.SpaCy (Commits: 8623, Contributors: 215)

Liverpoolfc.tv: spacy. IO /

SpaCy is a natural language processing library with excellent examples, API documentation, and demo applications. This library is written in the Cython language, which is a C extension of Python. It supports nearly 30 languages and provides simple deep learning integration that guarantees robustness and high accuracy. Another important feature of SpaCy is an architecture designed for entire document processing, without the need to decompose documents into phrases.

19.Gensim (Commits: 3603, Contributors: 273)

Liverpoolfc.tv: radimrehurek.com/gensim/

Gensim is a Python library for robust semantic analysis, topic modeling, and vector space modeling, built on top of Numpy and Scipy. It provides implementations of popular NLP algorithms such as Word2vec. Although Gensim has its own models.wrappers. Fasttext implementation, the FastText library can also be used to learn word representations efficiently.

▌ Data Collection

20.Scrapy (Commits: 6625, Contributors: 281)

Liverpoolfc.tv: scrapy.org/

Scrapy is a library for creating web crawlers, scanning web pages and collecting structured data. In addition, Scrapy extracts data from the API. Because of the extensibility and portability of the library, it is very convenient to use.

▌ conclusion

The list above is our rich collection of Python libraries for data science in 2018. Some of the new modern libraries are getting more popular than last year, and those that have become classic data science tasks are getting better.

The following table shows the detailed statistics for GitHub activity:





Big data turf
Big data turf