- Lesser Known Python Libraries for Data Science
- Original article by Parul Pandey
- Translation from: The Gold Project
- This article is permalink: github.com/xitu/gold-m…
- Translator: haiyang – tju
- Proofreader: TrWestdoor
PC: Hitesh Choudhary from Unsplash
Python is a great language. It is one of the fastest growing programming languages in the world. It has proven time and again to be useful in both developer positions and data science positions across industries. The entire Ecosystem of Python and its libraries makes it a great choice for users — beginners and advanced — around the world. One of the reasons for its success and popularity is its powerful collection of third-party libraries that keep it alive and efficient.
In this article, we’ll look at some Python libraries for data science tasks, rather than the usual ones like Panda, Scikit-Learn, and Matplotlib. While libraries like Panda and Scikit-Learn are common in machine learning tasks, it’s always good to learn about other Python products in the field.
Wget
Extracting data from the web is one of the most important tasks for data scientists. Wget is a free utility that can be used to download non-interactive files from the Web. It supports HTTP, HTTPS, and FTP protocols, as well as file retrieval through HTTP proxies. Since it is non-interactive, it can work in the background even if the user is not logged in. So the next time you want to download all the images on a website or a page, Wget can help you.
Installation:
$ pip install wget
Copy the code
Example:
import wget
url = 'http://www.futurecrew.com/skaven/song_files/mp3/razorback.mp3'filename = wget.download(url) 100% [................................................] 3841532 / 3841532 filename'razorback.mp3'
Copy the code
Pendulum
For those who get frustrated working with dates and times in Python, Pendulum is for you. It is a Python package that simplifies date-time operations. It is an easy alternative to Python’s native classes. See the documentation for further study.
Installation:
$ pip install pendulum
Copy the code
Example:
import pendulum
dt_toronto = pendulum.datetime(2012, 1, 1, tz='America/Toronto')
dt_vancouver = pendulum.datetime(2012, 1, 1, tz='America/Vancouver')
print(dt_vancouver.diff(dt_toronto).in_hours())
3
Copy the code
imbalanced-learn
It can be seen that most classification algorithms work best when the sample number of each class is basically the same, that is, data balance needs to be maintained. However, most of the real cases are unbalanced data sets, which have a great impact on the learning stage and subsequent prediction of the machine learning algorithm. Fortunately, this library is designed to solve this problem. It is compatible with Scikit-Learn and is part of the Scikit-Lear -contrib project. The next time you encounter an unbalanced data set, try using it.
Installation:
pip install -U imbalanced-learn
# or
conda install -c conda-forge imbalanced-learn
Copy the code
Example:
Please refer to the documentation for usage methods and examples.
FlashText
In NLP tasks, cleaning text data often requires replacing or extracting keywords from sentences. Normally, this can be done using regular expressions, but it can become cumbersome if you have to search for thousands of terms. Python’s FlashText module, which is based on the FlashText algorithm, provides a suitable alternative for this situation. The great thing about FlashText is that it takes the same amount of time to run regardless of the number of search terms. You can learn more here.
Installation:
$ pip install flashtext
Copy the code
Example:
Extracting keywords
from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
# keyword_processor.add_keyword(<unclean name>, <standardised name>)
keyword_processor.add_keyword('Big Apple'.'New York')
keyword_processor.add_keyword('Bay Area')
keywords_found = keyword_processor.extract_keywords('I love Big Apple and Bay Area.')
keywords_found
['New York'.'Bay Area']
Copy the code
Replace keywords
keyword_processor.add_keyword('New Delhi'.'NCR region')
new_sentence = keyword_processor.replace_keywords('I love Big Apple and new delhi.')
new_sentence
'I love New York and NCR region.'
Copy the code
For more practical cases, please refer to the official documentation.
Fuzzywuzzy
The library’s name sounds strange, but FuzzyWuzzy is a very useful library when it comes to string matching. It is easy to calculate string matching degree, token matching degree and other operations, and it is also easy to match records stored in different databases.
Installation:
$ pip install fuzzywuzzy
Copy the code
Example:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
# Simple matching
fuzz.ratio("this is a test"."this is a test!")
97
# Fuzzy matching
fuzz.partial_ratio("this is a test"."this is a test!")
100
Copy the code
More interesting examples can be found in the GitHub repository.
PyFlux
Time series analysis is one of the most common problems in machine learning. PyFlux is an open source library in Python that was built to deal with time series issues. The library has an excellent set of modern time series models, including but not limited to the ARIMA, GARCH, and VAR models. In short, PyFlux provides a probabilistic approach to modeling time series. It’s worth a try.
The installation
pip install pyflux
Copy the code
example
Refer to the official documentation for detailed usage and examples.
Ipyvolume
Presentation of results is also an important aspect of data science. Being able to visualize the results is a big advantage. IPyvolume is a Python library that allows you to visualize 3d volumes and shapes (such as 3D scatter diagrams, etc.) in the Jupyter Notebook with a small amount of configuration. But it’s still a pre-1.0 version. A more apt metaphor is that IPyvolume’s volshow is just as good for 3d arrays as Matplotlib’s imshow is for 2D arrays. You can get more here.
PIP $PIP install ipyVolume Run the Conda/Anaconda $Conda install -c Conda -forge ipyvolume commandCopy the code
example
- animation
- Volume rendering
Dash
Dash is an efficient Python framework for building Web applications. It’s based on Flask, plotly.js, and React.js, and it’s loaded with modern UI elements like dropdowns, sliders, and charts, so you can write your analysis in Python code instead of javascript. Dash is ideal for building data visualization applications. These applications can then be rendered in a Web browser. The user guide is available here.
The installation
PIP install dash = = 0.29.0# Core Dash backendPIP install the dash - HTML - components = = 0.13.2# HTML componentPIP install the dash - core - components = = 0.36.0# Enhanced componentsPIP install the dash - table = = 3.1.3# Interactive DataTable component (latest!)
Copy the code
example
The following example shows a highly interactive chart with a drop-down function. When the user selects a value from the drop-down menu, the application code dynamically exports the data from Google Finance to the Panda DataFrame. The source code is here
Gym
OpenAI Gym is a development and comparison kit for enhancement learning algorithms. It is compatible with any numerical computation library such as TensorFlow or Theano. The Gym library is a necessary tool for testing the set of questions, also known as the environment, that you can use to develop your reinforcement learning algorithms. These environments have a shared interface that allows you to write generic algorithms.
The installation
pip install gym
Copy the code
example
This example runs an instance in the CartPole-V0 environment with 1000 time steps, each rendering the entire scene.
You can find information about other environments here.
conclusion
I have carefully selected these useful data science Python libraries, not the usual ones like Numpy and Pandas. If you know of other libraries to add to the list, please mention them in the comments below. And don’t forget to try running them first.
If you find any errors in the translation or other areas that need improvement, you are welcome to revise and PR the translation in the Gold Translation program, and you can also get corresponding bonus points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.
Diggings translation project is a community for translating quality Internet technical articles from diggings English sharing articles. The content covers the fields of Android, iOS, front end, back end, blockchain, products, design, artificial intelligence and so on. For more high-quality translations, please keep paying attention to The Translation Project, official weibo and zhihu column.