The text and pictures in this article come from the network, only for learning, exchange, do not have any commercial purposes, copyright belongs to the original author, if you have any questions, please contact us to deal with
The following article comes from Tencent Cloud author: Shen Condolence
(Want to learn Python? Python Learning exchange group: 1039649593, to meet your needs, materials have been uploaded to the group file stream, you can download! There is also a huge amount of new 2020Python learning material.)
network
general
- Urllib – Network library (STdlib).
- Requests – Network library.
- Grab — Web library (based on PyCurl).
- Pycurl — A network library (bound to libcurl).
- Urllib3 – Python HTTP library, secure connection pooling, post support, high availability.
- Httplib2 – Network library.
- RoboBrowser – a simple, very Python-style Python library that allows you to browse the Web without a separate browser.
- MechanicalSoup – a Python library that automatically interacts with web sites.
- Mechanize – stateful, programmable Web browsing library.
- Socket – Underlying network interface (STdLIB).
- Unirest for Python – Unirest is a set of lightweight HTTP libraries that can be used in multiple languages.
- Hyper – HTTP/2 client for Python.
- PySocks — an updated and actively maintained version of SocksiPy, including bug fixes and some other features. As a direct replacement for the socket module.
asynchronous
- Treq – An API similar to Requests (based on Twisted).
- Aiohttp – HTTP client/server for Asyncio (PEP-3156).
Web crawler framework
Fully functional crawler
- Grab – Web crawler framework (based on Pycurl/Multicur).
- Scrapy – web crawler framework (based on twisted), no Python3 support.
- Pyspider – a powerful crawler system.
- Cola – a distributed crawler framework.
other
- Portia – Visual crawler based on Scrapy.
- Restkit – HTTP resource toolkit for Python. It allows you to easily access HTTP resources and build objects around them.
- Demiurge – PyQuery-based crawler microframework.
- HTML/XML parser
general
- LXML – C language to write efficient HTML/ XML processing library. Support XPath.
- Cssselect – Parses DOM trees and CSS selectors.
- Pyquery – Parses the DOM tree and jQuery selector.
- BeautifulSoup – Inefficient HTML/ XML processing library, implemented purely in Python.
- Html5lib – THE DOM that generates HTML/ XML documents according to the WHATWG specification. The specification is used in all browsers today.
- Feedparser – Parses RSS/ATOM feeds.
- MarkupSafe – Provides safe escaped strings for XML/HTML/XHTML.
- Xmltodict – a Python module that lets you work with XML as if you were working with JSON.
- Xhtml2pdf – Convert HTML/CSS to PDF.
- Untangle – Easy to convert XML files into Python objects.
Clean up the
- Bleach – Clean up HTML (html5lib required).
- Sanitize – Brings clarity to the chaotic world of data.
Text processing
A library for parsing and manipulating simple text.
general
- Difflib — (the Python standard library) helps with differentiation comparisons.
- Levenshtein – Quickly calculates Levenshtein distance and string similarity.
- Fuzzywuzzy — Fuzzy string matching.
- Esmre – Regular expression Accelerator.
- Ftfy – Automatically collates Unicode text to reduce fragmentation.
- conversion
- Unidecode – Converts Unicode text to ASCII.
- A character encoding
- Uniout – Prints readable characters instead of escaped strings.
- Chardet – Python compatible 2/3 character encoder.
- Xpinyin – a library for converting Chinese characters into pinyin.
- Pangu. py – Spacing between CJK and alphanumeric in formatted text.
- Slug,
- Awesome -slugify – a Python slugify library that can preserve Unicode.
- Python-slugify – a Python slugify library that converts Unicode to ASCII.
- Unicode – Slugify – a tool for generating Unicode slugs.
- Pytils – A simple tool for handling Russian strings (including Pytils.translit.slugify).
- General purpose parser
- Python implementations of PLY — lex and yACC parsing tools.
- Pyparsing – a general-purpose framework generation parser
- The name of the person
- Python-nameparser – The component that parses people’s names.
- The phone number
- Phonenumbers – parses, formats, stores and validates international phonenumbers.
- User agent string
- Python-user-agents – parser for browser user agents.
- HTTP Agent Parser – Python’s HTTP proxy Parser.
Specific format file processing
A library that parses and processes specific text formats.
- general
- Tablib – a module that exports data to XLS, CSV, JSON, YAML, etc.
- Textract – Extracts text from various files, such as Word, PowerPoint, PDF, etc.
- Messytables – a tool for parsing messytable data.
- Rows – a common data interface that supports many formats (currently CSV, HTML, XLS, TXT – more to come!) .
- Office
- Python-docx – Read, query and modify Microsoft Word2007/2008 docx files.
- XLWT/XLRD – Reads and writes data and formatting information from Excel files.
- XlsxWriter – a Python module that creates Excel. XLSX files.
- Xlwings – a BSD-licensed library that makes it easy to call Python in Excel and vice versa.
- Openpyxl – a library for reading and writing Excel2010 XLSX/ XLSM/ XLTX/XLTM files.
- Marmir – Extracts Python data structures and transforms them into spreadsheets.
- PDFMiner – a tool for extracting information from PDF documents.
- PyPDF2 – a library that can split, merge, and transform PDF pages.
- ReportLab – Allows fast creation of rich PDF documents.
- Pdftables – Extracts tables directly from PDF files.
- Markdown
- Python-markdown – a Python implementation of John Gruber’s Markdown.
- Mistune — The fastest, full-featured Markdown pure Python parser.
- Markdown2 – a fast Markdown implemented entirely in Python.
- YAML
- PyYAML – a YAML parser for Python.
- CSS
- Cssutils – a Python CSS library.
- ATOM/RSS
- Feedparser – Generic feedparser.
- SQL
- Sqlparse – a non-validated SQL statement parser.
- HTTP
- Http-parser – HTTP request/response message parser implemented in C language.
- microformats
- Opengraph – a Python module for parsing Open Graph protocol tags.
- Portable actuators
- Pefile – a multi-platform module for parsing and processing portable executable (PE) files.
- PSD
- Psd-tools – Reads Adobe Photoshop PSD (PE) files into Python data structures.
Natural language processing
A library that deals with human language problems.
- NLTK – The best platform for writing Python programs to process human language data.
- Pattern — Python’s network mining module. He has natural language processing tools, machine learning and more.
- TextBlob – provides a consistent API for in-depth natural language processing tasks. It’s based on NLTK and Pattern’s Shoulders of giants.
- Jieba — Chinese word segmentation tool.
- SnowNLP – Chinese text processing library.
- Loso – Another Chinese thesaurus.
- Genius — Conditional random field based Chinese word segmentation.
- Langid. py – Independent language recognition system.
- Korean – a Korean morphological library.
- Pymorphy2 — Russian morphological analyzer (pos tagging + inflection engine).
- PyPLN – A distributed natural language processing channel written in Python. The goal of this project is to create a simple way to use NLTK to process large language libraries through network interfaces.
Browser automation and emulation
- Selenium — Automate real browsers (Chrome, Firefox, Opera, IE).
- Ghost. Py – WebKit wrapper for PyQt (requires PyQt).
- Spynner – WebKit encapsulation of PyQt (requires PyQt).
- Splinter — Universal API browser emulator (Selenium Web driver, Django client, Zope).
multiprocessing
- Threading – The Python standard library thread runs. Great for I/O intensive tasks. Useless for CPU-bound tasks because of the Python GIL.
- Multiprocessing – The standard Python library runs multiple processes.
- Celery – Asynchronous task queues/job queues based on distributed messaging.
- Concurrent-futures – The concurrent-futures module provides a high-level interface for invoking asynchronous executions.
asynchronous
Asynchronous network programming library
- Asyncio — (Python standard library above Python 3.4 +) asynchronous I/O, time loops, coroutines, and tasks.
- Twisted – An event-driven network engine framework.
- Tornado — a network framework and asynchronous network library.
- Pulsar — Python event-driven concurrency framework.
- Diesel – Python’s GREEN event-based I/O framework.
- Gevent – a Coroutine based Python network library that uses greenlet.
- Eventlet – Asynchronous framework with WSGI support.
- Tomorrow – The fancy embellishment syntax of asynchronous code.