• Moving a large and old codebase to Python3
  • By Anders Hovmoller
  • The Nuggets translation Project
  • Permanent link to this article: github.com/xitu/gold-m…
  • Translator: Starrier
  • Proofread by: LynnShaw, Steinliber

Migrate an old large project to Python 3

A year and a half ago, we decided to use Python 3. We’ve been talking about this for a long time, and now it’s time to use it! Now that the process is over, we have migrated the final deployment of the production environment to Python 3

  • The entire code base is about 240 k lines, not including blank lines and comments.
  • This is a web-based batch task system. And there is only one production, deployment environment.
  • The code base is about 15 years old.
  • Although this is a Django application, some of the code was written before Django was released.

Some basic statistics about changing Python 3 are based on a rough filtering of Git commit history:

  • 275 times to submit
  • 4080 added lines of code
  • Delete lines of code 3432 times

I found 109 JIRA issues related to this project.

Py2 → six → py3

Our philosophy has always been py2 → Py2 / Py3 → Py3 because we really couldn’t make big changes in actual production, and this intuition has proved to be correct in surprising ways. That means two or three is impossible, which I think is very common. We tried using 2 to 3 to check for Python 3 compatibility issues, but this was soon found to be untenable. Basically, such a change means that code in Python 2 will be broken. Such a change is not feasible.

The conclusion is to use six, a library that makes it easy to build a code base that works in Both Python 2 and 3.

The first is the dependency before the update. This work needs to start right away, as there will be more updates later.

modern

Python-modernize is our tool of choice for migration. It is a tool that automatically converts the Py 2 code base into a compatible six code base. We’ll start by introducing a test as part of CI to check if the hyperel-based new code is ready for PY3 compatibility. The biggest effect of this is to make those who still use the Py 2 syntax aware of the new approach, but it obviously doesn’t do much to convert the existing 240 K lines of code to six. We all have the bad habit of using old syntax, which is something of an educational success, even though it counts lines of code no differently, and we use it for our experimental branches:

The experimental branch

I created a new branch called “Python 3” and did the following:

  • Run “python-hyperton-n-w” on the entire code base. It changes the code where appropriate. I often start fixing code after this step without making the first commit. This wrong step has always been something I regret and more than once has forced me to start the whole thing over. Even if something goes wrong at this stage, it’s best to commit it first. So it’s important to separate what machines do from what people do.
  • Import all the dependencies for the function body into Py3, which we haven’t fixed yet.

The idea here is to “run ahead”, which is to see what problems we would have had if we hadn’t used outdated dependencies. This branch allows me to start the application very quickly in a super-interrupted state and run at least some unit tests. This branch is quite different, but I found a way to apply it in the right scenario. I use good GitUp to split, combine, and commit. When a commit looks good, I pick it out to a new branch and send it to code review.

No one can work on this branch because it is constantly rebase, pushed, and abused, but it does allow the project to move forward without waiting for all the dependencies to be updated. I highly recommend using this method!

Static analysis

We added pre-commit hooks, so if you edit a file, you’ll be prompted to recommend updating Python 3 altogether by making it easier to modernize.

Manual static analysis of QUOte_plus: There are some subtle differences in how quote_plus and six are handled. Finally, we created our own wrapper, and the default code enforces using this wrapper instead of the standard library wrapper and the six wrapper. We also statically checked for bytes that you never sent to Quote_plus.

We fixed all Python 3 problems in every Diango application and enforced this using a whitelist in the CI environment, so you can’t break a once fixed application.

Rely on

Resolving dependencies is the hardest part for us. We had a lot of dependencies, so it took a lot of time, and two of them were tricky:

  • Splunk-lib. We rely on Splunk, but to this day they still ignore all the angry customers who want to add Py3 compatibility to their clients. One of our team finally took matters into their own hands. Splunk handled it so badly that it even locked the issue in the comments section! This is simply unacceptable.
  • Cassandra. Our entire product uses this database, but we use an old driver with an old API module. For us, this was a big part of the Py3 migration, so we had to rewrite all of this code piece by piece.

test

Approximately 65% of our code test coverage includes: units, integrations, and UI merges. We did write more tests, but the overall number didn’t change much. Not surprisingly, considering increasing coverage from 65% to 66% means writing nearly 2,000 lines of tests.

We have to skip the tests that require Cassandra and fix this dependency at the same time. I invented a fun little hack to make it work and wrote about it.

Code changes

An explanation of the code changes, not covered in the documentation on how to migrate PY2 to Six (perhaps we missed it) :

StringIO

We use StringIO a lot in our code. The first reaction is to use six. But for StringIO, this is true in almost all cases (but not all!) They all turned out to be wrong. Basically, we have to think very carefully about every place we use StringIO and try to figure out whether we should replace it with IO.StringIO, IO.BytesIO, or six. The manifestation of error here is usually code that looks like it’s py3-compatible, works in Py2, but is actually broken in Py3.

fromfutureIn the import unicode_literals

This is a mixed blessing. You can find bugs by adding it to many files, but bugs are sometimes introduced in Py2. The log can also become confusing when it suddenly writes “u” in strange places, like before strings. Overall, this is clearly not what I was hoping for.

str/bytes/unicode

Much of this is what you would expect. I was surprised that STR was needed in Py2 and Py3. If you import with unicode_literals in the future, some strings will need to be changed from ‘foo’ to STR (‘foo’).

six.moves

The implementation of Six. Moves is a very strange hack, so it doesn’t act like the normal Python module it pretends to act like. I also disagree with their option not to include mocks in six. Moves. We had to use their API to add it ourselves, but that made it hard to get started, And it requires us to change from Mock Import patch to from six. Moves import mock which also means that patch is now mock.patch.

CSV parsing is different

If you use the CSV module, you need to know about CSV342. In my opinion, this should be part of six. Otherwise it means you don’t realize there’s a problem. However, we do not use CSV342 in many places, so your work here may be different.

Release the order

Let’s start with a test:

  • Unit tests in CI
  • Integration and UI testing in CI (without Cassandra)
  • Cassandra test in CI (this is later than the previous step!)

Then there’s the product itself. We built a batch machine with the ability to switch to Py3 in one go and, crucially, switch back. This becomes important when interrupts occur on PY3. This is good for us because we can requeue the broken tasks, but we can’t break too many or any of the tasks that are actually critical. We used Sentry to collect crash logs, so it was easy to see all the problems we had migrating to Py3, and when we fixed all the problems, we needed to migrate to Py3 again until we got some problems, and so on.

We have the following environments:

  • Devtest: Developers use it internally, so in most cases, this is just to test database migration. This environment is very easy to use, so it doesn’t cause problems very often.
  • IAT (Internal acceptance testing) : Used to validate changes and perform regression testing before we push changes into production.
  • UAT (User Acceptance Testing): Test environment that customers can access. Used to prepare changes to customer systems or to allow customers to view changes before going live. This environment will be migrated only a few days before the database migration.
  • The production environment

We released Python 3 to these environments in the following order:

  • Devtest environment
  • Short-term IAT environment
  • Long-term IAT environment
  • A short-term batch production machine
  • A batch production machine used during work
  • Production SFTP
  • Batch machines account for half of production
  • Production batch
  • Production Web (after a long manual test run in the test environment)
  • Production load machine. This is a special subset of batch processing. It does the most CPU and memory in our product.

The load machine exposed customer data configurations that were incompatible with Python 3, so we had to implement warnings about these situations in Python 2 and make sure we fixed them before opening Python 3 again. It took a few days because we received customer data every day, so every time there was a warning, which made us have to wait another day.

Surprise in Production

  • 'decide'. The upper ()Is in the py2'decide'But in Py3 it is'SS'. When the last part of the product was migrated to PY3, it ended up crashing the product!
  • Comparing and sorting objects of different types works in Py2, but it hides a lot of bugs. We get some nasty surprises because this behavior leaks out of the stack in some non-obvious ways, especially in some sorted listsNoneFrom time to time. Overall, it was a win because we found quite a few bugs.NoneIt may come as a surprise to be at the top of the py2 list (you might expect it to be sorted to near zero!). Now we just have to deal with them.
  • '{}'.format(b'asd')In Python 2 it is'asd', but in Python 3 it is"b'asd'". In Python 3, almost any other behavior here would be better: the output is hexadecimal (the result is significantly different), the old behavior (the previous code runs), or the exception thrown (the best behavior!). .
  • int('1_0') In py 3 it’s 10, but not in PY2. This bothered us even before we switched to Py3. Because of this mismatch, another team using Py3 before us sent us valid values that we thought were invalid and they thought were valid. I personally think this decision was wrong: very strict parsing is the better default, and I fear this will continue to haunt us in subtle ways for years to come.

conclusion

In the end, we felt we really had no choice in the matter: Python 2 maintenance would stop at some point, and our dependencies would be limited to Py3, most notably Django. However, we wanted to do this conversion anyway, because we were often plagued by bytes/Unicode problems, and Python 3 merely fixed many minor bugs in Python 2. During this migration, we have found some actual bugs/misconfigurations in production. We also expect f-strings and ordered dictionaries to be available everywhere.


The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.