It is well known that SVG, as a branch of the interpretation descriptive language XML, does not support direct logical operations.

An attribute called stroke relies heavily on the length of the path, and the XML DOM doesn’t hand the length to you, so you rely on a function called getTotalLength().

In front-end development, we usually use this function to get the length of a path directly. If you get it in javascript, just call it. When faced with a complex multi-segment path, the most you can do is write a loop quickly.

But when the development process leaves the browser or Node.js environment, and the need for length occurs in Python in the case of complex requirements, something interesting happens.

And that’s one of the things that I’ve come across recently, or phenomenon, or bug, if you will.

① Small numbers: Python vs JavaScript

The thing is, there are very few third-party libraries in the Python environment that support SVG, and having tried some of the ones supported by PIP, the accuracy of the third-party libraries is almost unacceptably rough for length alone.

So I built the wheel myself and implemented a high-precision length calculation function for Path-data, also known as the “D” attribute. (Of course, this was very difficult, and I read countless forums, Github, and even converted part of the browser kernel code into PY, but I’ll skip the details here.)

Precision benchmarking the getTotalLength() function of the browser console is accurate to one part in ten million. Even at most curves, the accuracy is higher than the browser results.

Of course, the price of high-precision results is the high time consumption of mathematical operations. However, the time consumption is not too unreasonable, but the length algorithm implemented by PY is stable and, most importantly, provides very good results.

So for now, we can use this wheel as a new benchmark.

According to the process, the next step is to move from the development environment to the production environment. Can it withstand the testing in the production environment?

SVG path-data test set [big.txt] : SVG path-data test set [big.txt]

Raw.githubusercontent.com/fontello/sv…

Interested readers can check it out for themselves.

To recap, the test set is a text document that contains a total of 2036 lines of D strings, with multiple seed paths within each line of string. The “A/ A” instruction is removed from the test set, but this is irrelevant and will not affect the rest of this article.

The screenshots are as follows:

This document is perfect for testing length calculations. You can see how the py version works by running it once in Python and then iterating through the javascript built-in functions.

At the same time, in order to calculate the average error, the cumulative variable TotalLength is set on both sides to calculate the length when the whole document is a path.

Tip: Here TotalLength is calculated by piecewise length accumulation.

From the perspective of algorithm correctness, this method is undoubtedly correct.

After 2036 executions, one in each language:

TotalLength: 11407602.3134924 TotalLength: 11407603.5324707 Average error: 0.000000107 (1.07 in 10 million)

As the browser kernel (Chromium here) and the length calculation algorithm written by myself both use the idea of linear approximate fitting curve (also known as Newton method, in this case, it is a discrete implementation), the larger the value is, the more accurate it is, that is, the closer to the upper bound.

The upper bound here is the theoretical exact value in mathematics, which can only be approximated infinitely under the actual discrete condition, but cannot be accurately achieved, except the integer.

For example, the length of the line and curve connecting the two endpoints is obviously greater than that of the line, so when the line is used to fit the curve, the more subdivided and dense the line segment is, the closer the length is to the arc length of the curve. This is the basic idea, and the focus of computing is actually [precision] control.

So far, so good.

And then, here’s the funny thing.

② Split vs. Whole

The introduction to “D” can be read at the beginning of a previous article. · For a deeper understanding of how the SVG-D attribute works, here we directly treat the entire document as a string of “D” as input.

In theory, this huge “D” should be exactly the same as TotalLength obtained by the accumulation algorithm, or at least the same for the accuracy of large values.

The Python version outputs a total error of 0.00000055 after nearly a minute of computation.

In other words, the cumulative calculation: 11407602.313492425 direct calculation: 11407602.313491877

This error is negligible, considering that the whole number of digits is also in the tens of thousands.

However, JavaScript seconds gave an unexpected result: 11407201

There is no decimal place, and the total error is as large as 401.3, and the average error is 0.00003518 (352 parts per 10 million) in 2036 fragments.

The average error for piecewise summation is 1.07 parts per million.

All things being correct, 401 units (px in this case) are unnecessarily short.

Did you feel a little scared?

When all the procedures, algorithms, and theoretical support are correct, a wrong answer appears.

This is a phenomenon, can also be said to be a very advanced “bug”.

③ Distance: Pythagorean theorem or Newton’s method?

The so-called front-end engineering, the final point, or in the front end. Even with Python’s best calculations, the “wrong soil” of the browser environment is where you end up.

For developers, this phenomenon is disconcerting and means that the main premise for evaluating the stability of a technology has shifted from an absolute mathematical theory to the context in which the technology is implemented, even if it is a wrong context.

The next thing to consider, then, is how to get a Mathematically proven Python program to calculate the same wrong answer as the browser, so that it can be adapted.

Commonly known as: how to place rotten?

During the testing phase, the observant developer will observe that the browser is extremely fast at solving calculations.

Even a single “D” string of up to 2,036 is almost a second long. Python, on the other hand, takes a very long time.

This kind of time consumption can be partly understood, you can see in the source code, browser processing “D”, most of the time is to solve the string processing ideas, namely, array processing. You can easily understand the data structure, array is a random access structure, the advantage is fast read.

In Python, because of more advanced requirements, an OOP object-oriented approach is used to build objects for each subpath. This involves complex operations such as memory allocation and freeing, and the browser kernel is implemented in a lower-level language such as C++, which is itself more efficient than Python.

But all that aside, the one thing I care about is precision. After all, if you make an error of 400, your first reaction is going to be accuracy.

Under Newton’s method, high precision requires more iterations and segmentation. If the accuracy is properly lowered, the segmentation paragraphs will be fewer, the number of executions will be reduced, and the speed will naturally go up.

To do a small experiment, through JS to obtain a straight line length.

The result is very unexpected, the straight line is (100,100) to (200,200), learned primary school mathematics know that the distance between two points can be obtained by hook law. The theoretical value is SQRT (100^2+100^2), which is the square root of 20,000, or 100 square root of two.

We all know that the square root of two is 1.414, which is fine.

But if you expand the decimal point, you will find that the theoretical value: 141.4213562373095 Function value: 141.42135620117188

The SQRT () function was specifically called for this purpose, and the result was the same as above.

This proves at least one thing: getTotalLength() does not use the Pythagorean theorem when calculating the distance between two points.

To turn over the source code, but so far, have not found the reason and specific algorithm.

Combined with the definition of upper bound above, blind guess also uses the fitting algorithm. Perhaps in order to optimize the speed of calculation, the high complexity of square and square root calculation was abandoned in favor of more fitting methods using addition. This is just a guess, if there is a big guy know the specific reason, sincerely ask for advice.

There is always precision involved in the fit, and the fit algorithm used by the browser kernel has some effect, but from previous experiments, the effect is more of a low-level fine-tuning.

After all, the piecewise accumulation is accurate, so it can be seen that there is no huge error when the amount of data is small.

There are two main reasons affecting the accuracy, one is approximate fitting algorithm, and another is the carry problem.

(4) Floating point traps

Floating point numbers and their constructions are familiar to you if you have a solid foundation.

There is a well-known phenomenon on the front end that 0.1 + 0.2 does not equal 0.3.

Can you say it’s wrong? Yes, not really.

If it was wrong, then the problem would not have persisted until now, but what caused it. The result is in the floating-point representation.

When it comes to floating point numbers, it is inevitable to refer to the IEEE 754 standard, which strictly regulates the construction and implementation of floating point numbers. The current browser standard is implemented by IEEE 754-2019, and you can check for details.

The point is that the underlying computer representation for 0.3 is not 0.3, but 0.30000000000000004.

The two are obviously different, and the computer won’t recognize them.

When reviewing the standard, you see the presence of a static variable: number.epsilon

Its function is actually the precision of the limit, also known as tolerance.

When the difference between two different values is less than this value, the two values are considered the same.

When two values need to be equalized, the tail value can be approximated.

We all learned about rounding in elementary school, and there are three main algorithms for numerical approximation in computers.

Respectively is:

  1. Ceil () : rounded up, ceil(0.3) = 1
  2. Round () : Round (0.3) = 0, round(0.8) = 1
  3. Floor () : round down, floor(0.8) = 0

Given that the output is a large integer without decimals, it is highly likely that the kernel will perform the carry approximation itself during the calculation.

In order to sample the segmented data under THE JS version, carry approximate operation is carried out respectively.

Ceil () is rounded upward, and the result obtained is much larger than the reference value, so it is discarded.

In the case of large values, the expected value of round() algorithm with rounding as the core is actually 0, and the experimental results have verified that the expected value is almost the same as the baseline value. Therefore, the floor() approximation needs to be verified finally.

The results confirm some guesses.

⑤ A possible truth

Let’s just call the full data result of the browser’s calculation error of 400: the target value.

Under the accurate segmented sample data, the results are smaller than the target value and larger than the target value by rounding down the ones place and one place after the decimal point.

This result is welcome, the target value is successfully squeezed, indicating that the current thinking has succeeded in bringing the recurrence value into the target range.

I don’t mean to insult you, but if you’re a math student, you know the squeeze theorem, and I’m going to use it to describe the upper and lower intervals.

In other words, there may be a potential value approximating 0.000001 to 0.99999 decimal places that would make the result exactly the target value.

Of course, floor() should also be given priority here, because it is the programmer’s instinct to notice that when numeric precision changes, it is not a type conversion problem. After all, most beginner programmers know how int works, essentially the same as floor().

In the experiment on floor(), what we did was we approximated each sample value, so we could find a path to the target value if we didn’t process all the samples.

Then we flip the idea the other way around, and the browser kernel will probably find that it’s not accurate enough to achieve the target value, and then it will fit and lose precision as it goes along. In other words, the carry approximation here is actually happening dynamically.

This approach is reasonable, appropriate, and correct because of the impact of ultra-high precision on data format, storage footprint, and computation time. As a real-world tool, browser implementation requires a trade-off between accuracy and usability.

And each step of the carry approximation process is the most appropriate execution at the moment.

This philosophy of exchanging accuracy for efficiency has always existed in all kinds of engineering development. The ultimate charm, art, philosophy and wisdom of engineering lies in how to make the best use of all resources to achieve trade-off and balance. This is where the ultimate test of a developer’s ability and experience comes in.

Slanting a building, although programmers are the pursuit of perfection, but “perfect” itself, is limited range. Absolute perfectionism should not exist.

Of course, the principle part is personal speculation, and since the implementation has not yet been found in the source code (whether or not there is open source in the source code documentation is a mystery), this is the kind of speculation that can be explained for now.

But not knowing the truth doesn’t stop us from solving the problem. Based on this assumption, the tip and lesson for developers is to avoid introducing very large data at once. In the actual development and production, according to the specific data scale and demand to decide whether to do data segmentation, or scale compression.

Since the error value is born in the correct flow, the solution to the error is to avoid it.

From a development perspective: “escape” is not shameful, but useful.

The right conditions, the right range, the right process, the right result, the so-called engineering, but that is all.

Of course, local optimality does not lead directly to global optimality, and the philosophy of this part is that of higher level cycles.

END