When we use Python to read data from MongoDB, we might write code like this:
import pymongo
handler = pymongo.MongoClient().db.col
for row in handler.find():
parse_data(row)
Copy the code
In just 4 lines of code, read every line of data in MongoDB and pass parse_data to do the processing. Read the next line after processing is complete. The logic is clear and simple. What could go wrong? As long as parse_data(row) does not report errors, this code is perfect.
But that’s not the case.
Your code may fail the for row in handler.find() line. The reason for it is a long story.
To explain this, handler.find() does not return data in the database, but a cursor object. As shown below:
It’s only when you start iterating through it with the for loop that the cursor actually reads from the database.
However, if you connect to the database every time you loop, the network connection can waste a lot of time.
So PyMonGo will fetch 100 rows at a time, and the first time for Row in handler.find() loop, it will connect to MongoDB, read 100 rows, and cache them in memory. For cycles 2 to 100, the data is fetched directly from memory without connecting to the database.
When the loop has gone 101 times, connect to the database again and read lines 101-200…
This logic is very effective in reducing network I/O time.
However, MongoDB’s default cursor timeout is 10 minutes. Within 10 minutes, you must connect to MongoDB again to read the content and refresh the cursor time. Otherwise, a cursor timeout error will occur:
pymongo.errors.CursorNotFound: cursor id 211526444773 not found
Copy the code
As shown below:
So, going back to the original code, if parse_data takes more than 6 seconds to execute each time, it will take more than 10 minutes to execute 100 times. At this point, the program will report an error when it tries to read line 101.
To solve this problem, we have four approaches:
- Modify the MongoDB configuration, extend the cursor timeout period, and restart MongoDB. Since MongoDB in the production environment cannot be restarted at will, this solution is useful but excluded.
- Read all the data at once, and then do the processing:
all_data = [row for row in handler.find()]
for row in all_data:
parse(row)
Copy the code
The downside of this approach is that if you have a lot of data, you may not be able to fit it all into memory. Even if you could fit it all into memory, the list derivation iterates through all the data, followed by the for loop again, wasting time.
- Let the cursor return less than 100 items at a time, so that the time to consume this batch of data is less than 10 minutes:
Only 50 rows of data are returned each time the database is connected
for row in handler.find().batch_size(50):
parse_data(row)
Copy the code
However, this solution will increase the number of database connections, thus increasing THE I/O time.
- Make the cursor never time out. By setting parameters
no_cursor_timeout=True
, so that the cursor never times out:
cursor = handler.find(no_cursor_timeout=True)
for row in cursor:
parse_data(row)
cursor.close() Be sure to close the cursor manually
Copy the code
This is dangerous, however, because if your Python program stops unexpectedly for some reason, the cursor can never be closed again! Unless MongoDB is restarted, these cursors will remain on MongoDB, hogging resources.
Of course, some people might say, with try… Except encloses where data is read, as long as an exception is thrown and the cursor is closed while handling the exception:
cursor = handler.find(no_cursor_timeout=True)
try:
for row in cursor:
parse_data(row)
except Exception:
parse_exception()
finally:
cursor.close() Be sure to close the cursor manually
Copy the code
The code in finally executes with or without exceptions.
But that would make the code really ugly. To solve this problem, we can use the cursor’s context manager:
with handler.find(no_cursor_timeout=True) as cursor:
for row in cursor:
parse_data(row)
Copy the code
As soon as the program exits the indent with, the cursor is automatically closed. If an error occurs midway through the program, the cursor is also closed.
Its principle can be explained by the following two pieces of code:
class Test:
def __init__(self):
self.x = 1
def echo(self):
print(self.x)
def __enter__(self):
print('Enter context')
return self
def __exit__(self, *args):
print('Exit context')
with Test() as t:
t.echo()
print('Exit indent')
Copy the code
The operating effect is shown in the figure below:
Next create an artificial exception inside the indent with:
class Test:
def __init__(self):
self.x = 1
def echo(self):
print(self.x)
def __enter__(self):
print('Enter context')
return self
def __exit__(self, *args):
print('Exit context')
with Test() as t:
t.echo()
1 + 'a' # this is an error
print('Exit indent')
Copy the code
The operating effect is shown in the figure below:
No matter what happens in the indentation of with, the code in __exit__ of the Test class will always run.
In pymongo’s cursor object, __exit__ is written as follows:
As you can see, this is exactly the operation to close the cursor.
Therefore, if we use a context manager, we can feel free to use the no_cursor_timeout=True parameter.