- JSON encoding/decoding with Python
- Originally written by Martin Thoma
- The Nuggets translation Project
- Permanent link to this article: github.com/xitu/gold-m…
- Translator: Luo Zhu, snow thorn
- Proofreader: Zoe
REST apis are standardized message formats used around the world. JSON is the cornerstone of data exchange on the Internet, and as a subset of JavaScript, it has gained huge popularity since its inception. Its exceptionally clear and legible syntax is also useful for promotion.
I know of JSON libraries for serialization and deserialization in various languages. In fact, there are many JSON libraries in Python. Below, I’ll compare them for you.
Reference library
CPython itself has a JSON module. It was originally developed by Bob Ippolito as SimpleJSON and was incorporated into Python 2.4 (source code). CPython is licensed by the Python Software Foundation.
Simplejson still exists as a separate inventory, which you can install via PIP. It is a pure Python library with optional C extensions. Simplejson is licensed under MIT and Academic Free License (AFL) agreements.
Ujson is a binding to the C library Ultra JSON. Ultra JSON was developed by ESN, an electronic art studio company, under a 3 clause BSD license. Ultra JSON has 3k stars on Github, 305 forks, 50 contributors, and the last submission was only 12 days old, while the last submission was released 5 days ago. I heard it was in “maintenance mode” (source), indicating no new progress.
Pysimdjson is a binding to the C ++ library simdjson. SIMDjson received funding from Canada. Simdjson has 12.2 K stars on Github, 611 branches, 63 contributors, last submitted 11 hours ago and last issue created 2 hours ago.
Python-rapidjson is a binding to the C ++ library rapidJSON. RapidJSON was developed by Tencent. RapidJSON has 9.8K stars, 2.7K forks, 150 contributors on GitHub, the last submission was about 2 months ago, and the last issue was created 17 days ago.
Orjson is a Python package that relies on Rust to do the heavy lifting.
Maturity and operational security
All of the libraries mentioned above can be used as benchmark examples without a problem, and switching JSON modules is not a big deal, but I still want to make sure the relevant modules support it.
CPython, SimpleJSON, UJSON, and Orjson all consider themselves ready for production.
Python-rapidjson marks itself as alpha, but a maintainer says this is an error and will be fixed soon (resource).
The problem
A direct way to determine if a library problem can be resolved is to go directly to its repository to create an issue and observe follow-up feedback:
- SimpleJSON: I got the answer the next day, and it was clear, easy to understand, and kind. Bob Ippolito answered me. He was the one who originally developed the library and is mentioned in the Python documentation for the JSON module!
- UJSON: Within 30 minutes, I had a clear, friendly, easy-to-follow answer. @hugovank
- ORJSON: no response for 10 days, then closed without any comments.
- [PySIMDJSON] : No reply after 15 days.
- Python-rapidjson: Within 30 minutes, I got a clear, friendly, and easy to follow answer. Ten days later a simple PR was merged.
I’ve come up with an answer that is basically unrelated.
Benchmark
To properly benchmark different libraries, I consider the following:
- API: Web services that exchange information. It may contain Unicode and have a nested structure. The JSON file for the Twitter API sounds good enough to test.
- API JSON error: I’m curious about how performance would change if the JSON API format were incorrect. Therefore, I removed a curly brace in the middle.
- GeoJSON: I first got a JSON file in GeoJSON format through Overpass Turbo, an open source street map exporter. You get a crazy amount of JSON files, most of which have coordinates and are quite nested.
- Machine learning: Just a large list of floating point numbers. These may be the weights of the neural network layer.
- JSON line: Structured logging is widely used in the industry. If you analyze these logs, you may need to traverse gigabytes of data. They are simple word sets with date-time objects, messages, loggers, log status, and so on.
Deserialization speed
I set a low upper limit for the read speed of my hard drive, which WILL be used as a baseline in the following three charts.
The conclusions are as follows:
- Rapidjson is slow, but for small json like Twitter.json, you won’t notice a difference. You can see this through structured logging.
- Simdjson, OrJSON, and UJSON are all surprisingly fast.
- For most libraries, JSON files with structural errors can be read at the same speed. One notable exception is Rapidjson. My guess is that it will stop reading the file once it finds an error.
Serialization speed
In this case, I pre-created the JSON string and measured the time required to write to disk as a baseline.
My conclusion from this is:
- Orjson is very fast, super close to my hard drive write speed. Ujson is also very close.
- Rapidjson is also fast, but not on the same level as OrJSON or UJSON.
- Simdjson slowly.
Professional JSON workflow
To conclude, I’d like to point out some of the issues I’ve seen and documented earlier:
-
Call the variable foo_json: JSON is a string format. If it’s not a string, it’s not JSON. Bar is not JSON if you deserialize JSON with bar = json.loads(foo).
You can serialize bar to JSON equivalent to JSON foo, but bar is not JSON, it’s a Python object, much like a dictionary object, so call it foo_json.
-
Properties are checked here and there: if you receive JSON data, you can easily convert it into a Python object (such as a dictionary) and use it. This is a good choice for proof-of-concept code or very small JSON strings. If you don’t convert it to something like dataclass, it will be a mess.
Pydantic is a super useful validation library. You can use your favorite JSON library to parse JSON strings into a basic Python representation with dictionary/list/string/number/Boolean values, and then use Pydantic to parse them. The advantage of this is that you know what to deal with later. No more just using Dict[STR, Any] as a type annotation, no more auto-doing with a useless editor, no more checking to see if properties are present throughout the code.
To import JSON packages other than the default JSON, I recommend using this pattern
import ujson as json
Copy the code
For Flask, you can use other encoders/decoders as follows:
from simplejson import JSONEncoder, JSONDecoder
app.json_encoder = JSONEncoder
app.json_decoder = JSONDecoder
Copy the code
You can also look at
- Daniel Lemire: Parsing JSON Really Quickly: Lessons Learned
- Ng Wai Foong: Introduction to orjson
- Nicolas Seriot: Parsing JSON is a Minefield
If you find any mistakes in your translation or other areas that need to be improved, you are welcome to the Nuggets Translation Program to revise and PR your translation, and you can also get the corresponding reward points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.
The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.