preface
This is the 10th article in the “Python Craftsman” series. [View all articles in the series]
Programming actually has a lot in common with playing video games. Before you play different games, you need to learn the different rules of each game. Only when you are familiar with and flexibly use the rules of the game, you are more likely to win the game.
Programming, too, has different “rules” for different programming languages. From whether it supports object orientation to whether it can define constants, the rules of a programming language are more complex than most video games.
Most of the time when we’re programming, if we just take what we’ve learned in one language and apply it to another language, we’re not going to get the best results. It’s like a COUNTER-strike expert playing PUBG without knowing the rules. He might be a perfect shot, but there’s a good chance he’ll be ambushed by someone lurking in the grass before he spots the first enemy.
Rules in Python
Python is a language that looks simple at first, but becomes more complex as we go deeper. Take the most important concept of “object” in Python. Python defines so many rules for it that you can’t remember them all, such as:
- Defines the
__str__
Method object, can be usedstr()
Function to return a readable name - Defines the
__next__
ε__iter__
Method can then be iterated over - Defines the
__bool__
Method objects that use custom logic when making Boolean judgments - . .
** Being familiar with the rules and adapting your code to them can help you write better code and get the job done with less effort. ** Next, let’s look at a story about adaptation rules.
Case study: Obtaining a list of people from two travel data
One day, in a travel company specializing in outbound tourism from New Zealand, my business colleague suddenly rushed to me excitedly and said that he had obtained two important data from a partner:
- All persons who have been to Phuket, Thailand and their contact information
- All persons who have been to New Zealand and their contact information
The data is in JSON format, as follows:
# Data on people who have been to Phuket
users_visited_puket = [
{"first_name": "Sirena"."last_name": "Gross"."phone_number": "650-568-0388"."date_visited": "2018-03-14"},
{"first_name": "James"."last_name": "Ashcraft"."phone_number": "412-334-4380"."date_visited": "2014-09-16"},... . ]# Data of people who have been to New Zealand
users_visited_nz = [
{"first_name": "Justin"."last_name": "Malcom"."phone_number": "267-282-1964"."date_visited": "2011-03-13"},
{"first_name": "Albert"."last_name": "Potter"."phone_number": "702-249-3714"."date_visited": "2013-09-11"},... . ]Copy the code
Each data has a surname, first name, mobile phone number, travel time four fields. Based on this data, the business student proposed a * (seemingly unreasonable) * hypothesis: “Anyone who has been to Phuket should also be interested in traveling to New Zealand. We need to take that data, identify people who have been to Phuket but not to New Zealand, and target them with products.
First attempt at brute force
Armed with raw data and clear requirements, the next question is how to write code. By brute force, I quickly came up with my first solution:
def find_potential_customers_v1(a):
""" Find someone who has been to Phuket but not to New Zealand. ""
for puket_record in users_visited_puket:
is_potential = True
for nz_record in users_visited_nz:
if puket_record['first_name'] == nz_record['first_name'] and \
puket_record['last_name'] == nz_record['last_name'] and \
puket_record['phone_number'] == nz_record['phone_number']:
is_potential = False
break
if is_potential:
yield puket_record
Copy the code
Since the raw data doesn’t have unique identifiers like “user ID” *, we can only rely on “identical name and phone number” as the criterion for determining whether it’s the same person.
Find_potential_customers_v1 loops through all the people who have been to Phuket and then through New Zealand. If there is no perfect match in the New Zealand record, the find_potential_customers_v1 returns it as a potential customer.
This function does the job, but I don’t have to tell you. ** It has very serious performance issues. ** For every visit to Phuket, we need to go through all the New Zealand visits to try and find a match. The time complexity of the whole algorithm is O(n*m) and it would take a very long time to execute if New Zealand had a large number of access items.
To optimize the inner loop performance, we need to reduce the overhead of the linear lookup match portion.
Try using set optimization functions
If you know anything about Python, you know that dictionaries and collection objects in Python are implemented based on Hash tables. The average time to determine whether something is in the set is order 1, which is very fast.
So, for the above function, we can first try to initialize a set of New Zealand access records, then the search match part can become very fast, the overall time complexity of the function can become O(n+m).
Let’s look at the new function:
def find_potential_customers_v2(a):
""" Find people who have been to Phuket but not To New Zealand. """
First, iterate over all New Zealand access records to create lookup indexes
nz_records_idx = {
(rec['first_name'], rec['last_name'], rec['phone_number'])
for rec in users_visited_nz
}
for rec in users_visited_puket:
key = (rec['first_name'], rec['last_name'], rec['phone_number'])
if key not in nz_records_idx:
yield rec
Copy the code
With the use of collection objects, the new function is a leap forward in speed compared to the old version. However, the optimization of this problem doesn’t end there, or the title of this article should have been “How to Use Collections to Improve Program performance.”
A rethinking of the problem
Let’s try to think abstractly about the nature of the problem. First, we have A container A* (Phuket visit Record) with A lot of stuff, then we are given another container B (New Zealand visit Record) * with A lot of stuff, and then define the equality rule: “Name and number match”. Finally, based on the equality rule, find the ** “difference set” ** between A and B.
If you’re not particularly familiar with collections in Python, I’ll tell you a little bit more about them. If we have two sets A and B, we can directly use mathematical expressions such as A – B to calculate the difference between them.
>>> a = {1.3.5.7}
>>> b = {3.5.8}
Create a new set: all elements in A but not in B
>>> a - b
{1.7}
Copy the code
So counting “all the people who have been to Phuket but not to New Zealand” is essentially a differential operation of a set. So how do we fit our problem into the rules of the set game?
Use the rules of the set game
In Python, if you want to put something into a collection or dictionary, one basic condition must be met: “It must be Hashable.” What is “Hashable”?
For example, all mutable objects in Python, such as dictionaries, are not Hashable. This error occurs when you try to put a dictionary into a collection:
>>> s = set()
>>> s.add({'foo': 'bar'})
Traceback (most recent call last):
File "<stdin>", line 1.in <module>
TypeError: unhashable type: 'dict'
Copy the code
So, to solve our problems with collections, we first had to define our own “Hashable” object: VisitRecord. To make a custom object Hashable, the only thing you have to do is define the object’s __hash__ method.
class VisitRecord:
""" "Travel Records """
def __init__(self, first_name, last_name, phone_number, date_visited):
self.first_name = first_name
self.last_name = last_name
self.phone_number = phone_number
self.date_visited = date_visited
Copy the code
A good hash algorithm should make the values of different objects as unique as possible to minimize the probability of “hash collisions.” By default, all Python objects are hashed from their memory address.
In this case, we need to customize the object’s __hash__ method to use the (first name, first name, phone number) tuple as the hash source for the VisitRecord class.
def __hash__(self):
return hash(
(self.first_name, self.last_name, self.phone_number)
)
Copy the code
After customizing the __hash__ method, VisitRecord instances can be put into the collection as normal. But that’s not enough. In order for the difference algorithm mentioned earlier to work, we also need to implement the __eq__ special method.
__eq__ is the special method Python calls to determine whether two objects are equal. By default, it returns True only if it has exactly the same memory address as another object. But here, we reuse the hash value of the VisitRecord object, and when they are equal, they are considered the same.
def __eq__(self, other):
If the names of two access records are the same as the phone number, check whether they are equal.
if isinstance(other, VisitRecord) and hash(other) == hash(self):
return True
return False
Copy the code
After the proper data modeling is done, the subsequent differential calculation is a natural step. The new version of the function does this with a single line of code:
def find_potential_customers_v3(a):
return set(VisitRecord(**r) for r in users_visited_puket) - \
set(VisitRecord(**r) for r in users_visited_nz)
Copy the code
Hint: If you’re using Python 2, you’ll need to customize the class’s __ne__ (used to determine inequality) method, in addition to the __eq__ method.
Simplify code with dataclass
The story doesn’t end there. In the above code, we manually define our own data class VisitRecord, implementing __init__, __eq__ and other initialization methods. But there’s an easier way to do it.
Because the need to define dataclasses is so common in Python, the dataclasses module was added to the standard library in version 3.7 to make it easier.
Using the features provided by dataclasses, our code could be simplified as follows:
@dataclass(unsafe_hash=True)
class VisitRecordDC:
first_name: str
last_name: str
phone_number: str
# skip the "Access time" field as any comparison condition
date_visited: str = field(hash=False, compare=False)
def find_potential_customers_v4(a):
return set(VisitRecordDC(**r) for r in users_visited_puket) - \
set(VisitRecordDC(**r) for r in users_visited_nz)
Copy the code
You don’t have to do any dirty work, and you get the job done in less than ten lines of code.
Case summary
With that out of the way, let’s do a little summary. In dealing with this problem, we used three solutions altogether:
- A normal two-level loop is used to filter the result set that matches the rule
- Using hash table structure (set object) to create index, improve processing efficiency
- Convert data to custom objects, using rules, directly using collection operations
Why is the third way better than the first two?
First, the performance problems of the first solution were so obvious that it was quickly abandoned. What about the second option? If you think about it, there are no obvious drawbacks to plan 2. It may even be slightly better than the third option in terms of performance and memory footprint because of the lack of custom object customization.
But think about it for a second. If you change plan 2 code to another language, like Java, does it translate almost exactly 1:1? In other words, it’s efficient and straightforward, but it doesn’t take full advantage of the rules that the Python world offers to make the most of it.
If the “rule” in this question is to be specified, it is the fact that Python has a set of built-in structures that can perform four operations, such as differences, between collections. Scheme 3 code written after matching rules has the following advantages:
- After you model the data, you can more easily define other methods
- If the requirements change, it is easy to do reverse difference and intersection operations
- After understanding the collection and dataclasses logic, the code is much cleaner than the other versions
- If you want to modify the equality rule, such as “records that only have the same last name count as the same,” you only need inheritance
VisitRecord
cover__eq__
Methods can be
How do other rules affect us
Earlier, we spent a lot of time talking about how to use “rules of collections” to write more code with less effort. In addition, there are many other rules in the Python world. If you can master these rules, you can design apis that conform to Python conventions and make your code more concise.
Here are two concrete examples.
use__format__
Do object string formatting
If your custom object needs to define multiple string representations, like this:
class Student:
def __init__(self, name, age):
self.name = name
self.age = age
def get_simple_display(self):
return f'{self.name}({self.age}) '
def get_long_display(self):
return f'{self.name} is {self.age} years old.'
piglei = Student('piglei'.'18')
# OUTPUT: piglei(18)
print(piglei.get_simple_display())
# OUTPUT: piglei is 18 years old.
print(piglei.get_long_display())
Copy the code
So in addition to adding this extra get_xxx_display() method, you can also try customizing the __format__ method of the Student class, because that’s the standard rule for turning objects into strings.
class Student:
def __init__(self, name, age):
self.name = name
self.age = age
def __format__(self, format_spec):
if format_spec == 'long':
return f'{self.name} is {self.age} years old.'
elif format_spec == 'simple':
return f'{self.name}({self.age}) '
raise ValueError('invalid format spec')
piglei = Student('piglei'.'18')
print('{0:simple}'.format(piglei))
print('{0:long}'.format(piglei))
Copy the code
use__getitem__
Define object slicing operations
If you’re designing a container type that can hold something, you’ll probably define methods like “empty” and “get the NTH object” for it:
class Events:
def __init__(self, events):
self.events = events
def is_empty(self):
return not bool(self.events)
def list_events_by_range(self, start, end):
return self.events[start:end]
events = Events([
'computer started'.'os launched'.'docker started'.'os stopped',]),Print the second and third objects to determine if there is content
if not events.is_empty():
print(events.list_events_by_range(1.3))
Copy the code
However, this is not the best course of action. Because Python already provides us with a set of object rules, we don’t need to define additional methods ourselves as we would if we were writing OO* (object-oriented) * code in other languages. We have better options:
class Events:
def __init__(self, events):
self.events = events
def __len__(self):
"" custom length will be used to make Boolean judgments. ""
return len(self.events)
def __getitem__(self, index):
"" custom slicing method ""
Pass the slice object directly to Events for processing
return self.events[index]
Print the second and third objects to determine if there is content
if events:
print(events[1:3])
Copy the code
The new way of writing it fits the rules of the Python world better than the old code, and the API is simpler.
On how to adapt rules and write better Python code. Raymond Hettinger gave a great talk at PyCon 2015 “Beyond PEP8 – Best practices for beautiful intelligible code”. This talk has long been on my personal “PyCon TOP5 videos” list, and if you haven’t seen it yet, I highly recommend you do so now π
Hint: A more comprehensive set of Python object model rules can be found in the official documentation, which is a bit hard to read, but worth it.
conclusion
The Python world has a very complex set of rules that can cover things like whether objects are equal to objects, which ones are bigger or smaller. Most of these need to be implemented by redefining the “double underscore method __xxx__”.
Being familiar with these rules and using them in everyday coding can help us solve problems more efficiently and design apis that are more consistent with the Python philosophy. Here are some highlights from the article:
- Always do an abstract analysis of the original requirements, such as whether the problem can be solved with a difference set
- If you want to put an object into a collection, you need to customize the object’s
__hash__
δΈ__eq__
methods __hash__
Method determines performance (collision probability),__eq__
Determines the equality logic between objects- Using the Dataclasses module will save you a lot of code
- use
__format__
Method replaces your own string formatting methods - Used on container class objects
__len__
,__getitem__
Method instead of implementing it yourself
After reading the article, do you have anything to tease? Let me know in the comments or on Github Issues.
The appendix
- Photo by JESHOOTS.COM on Unsplash
- More articles: github.com/piglei/one-…
Other articles in the series:
- Index of all Articles [Github]
- Python craftsman: Techniques for writing conditional branching code
- Python Craftsman: Three good habits for exception handling
- Python Craftsman: Two suggestions for writing idiomatic loops