In machine learning, we often need to use classes and functions to define various parts of the model, such as functions to read data, functions to preprocess data, functions to model architecture and training procedures, and so on. So what kind of function is beautiful, pleasing code? This tutorial explores how to make great functions, from naming to code volume. There is a video tutorial for you at the end of the article, you can learn on demand, you can also leave a message if you are not clear!
In Python, as in most modern programming languages, functions are one of the basic methods of abstraction and encapsulation. You may have written hundreds of functions during development, but not all functions are created equal. Writing “bad” functions directly affects the readability and maintainability of your code. So what is a “bad” function? More importantly, how do you write a “good” ** function?
A simple review
Mathematics is full of functions, although we may not remember them. So let’s start with our favorite topic — calculus. You may remember this equation: f(x) = 2x +3. This is a function called “f” with an unknown x that “returns” 2*x+3. This function may not look like what we see in Python, but the basic idea is the same as functions in computer languages.
Functions have a long history in mathematics, but they are more versatile in computer science. However, there are some drawbacks to the function. Next we’ll discuss what a “good” function is, and what symptoms we need to refactor the function for.
What makes a function good or bad
What is the difference between a good Python function and a bad Python function? It’s amazing how many “good” functions are defined. For our purposes, I’ll define good Python functions as those that follow most of the rules in this list (some of which are more difficult to implement) :
- Naming is reasonable
- Single function
- Include documentation comments
- Return a value
- No more than 50 lines
- Idempotent, as pure as possible
For many people, this list may be too restrictive. But I guarantee that if your functions follow these rules, your code will look pretty good. I’ll walk you through each rule step by step and then summarize what makes a “good” function.
named
My favorite quote on this subject (by Phil Karlton, always mistaken for Donald Knuth) is:
There are only two problems in computer science: cache invalidation and naming.
It sounds crazy, but the whole good naming thing is really hard. Here’s a bad way to name a function:
def get_knn(from_df):
Copy the code
I’ve seen bad naming pretty much everywhere, but this example comes from data science (or machine learning), where practitioners are always writing code on Jupyter Notebooks and trying to turn those different units into an understandable program.
The first problem with naming this function is the use of acronyms/acronyms. Full English words are better than acronyms and obscure acronyms. The only reason to use abbreviations is to save typing time, but modern editors have autocomplete features, so you only have to type your full name once. Abbreviations are a problem because they are usually only used in specific domains. In the code above, KNN refers to “K-nearest Neighbors” and DF refers to “DataFrame” — the ubiquitous Pandas data structure. If another programmer who is not familiar with these abbreviations is reading the code, he or she will be confused.
There are two other small problems with this function name: the word “get” doesn’t matter. For most well-named functions, it’s obvious that the function returns something, and its name reflects that. From_df is also unnecessary. If the parameter name description is not clear, the function’s documentation comment or type comment will describe the parameter type.
So how do we rename this function? Such as:
def k_nearest_neighbors(dataframe):
Copy the code
Now that even the layman knows what the function is evaluating, the parameter name (dataframe) clearly tells us what type of argument to pass.
Single function principle
The “single function principle” comes from a book by Bob Martin “Uncle” and applies not only to classes and modules, but also to functions (Martin’s original goal). This principle emphasizes that functions should have a “single function.” In other words, a function should only do one thing. One big reason for doing this is that if each function does only one thing, the function needs to change only if the way it does that thing must change. Things are easy when a function can be deleted: if the function’s single function is no longer needed because of changes elsewhere, it can simply be deleted.
Let me give you an example. Here is a function that does more than one “thing” :
def calculate_and print_stats(list_of_numbers):
sum = sum(list_of_numbers)
mean = statistics.mean(list_of_numbers)
median = statistics.median(list_of_numbers)
mode = statistics.mode(list_of_numbers)
print('-----------------Stats-----------------')
print('SUM: {}'.format(sum) print('MEAN: {}'.format(mean)
print('MEDIAN: {}'.format(median)
print('MODE: {}'.format(mode)
Copy the code
This function does two things: calculates a set of statistics about a list of numbers and prints them to STDOUT. This function violates the rule that there is only one reason for a function to change. There are obviously two reasons for this function to change: new or different data needs to be computed or the output format needs to change. It is best to write this function as two separate functions: one that executes and returns the result of the calculation; The other receives the results and prints them out. A fatal flaw in functions having multiple functions is the word “and” in the function name
This separation also simplifies testing for the behavior of functions, and they are not only separated into two functions in a module, but may also exist in different modules where appropriate. This makes the tests cleaner and easier to maintain.
Functions that only do two things are actually very rare. More often, a function is responsible for many, many tasks. Again, for readability and testability, we should break these generalist functions into smaller functions, each of which does only one task.
Documentation comments
Many Python developers are aware of PEP-8, which defines the style guide for Python programming, but few are aware of PEP-257, which defines the document comment style. Pep-257 is not covered here in detail, but the reader can read more about the document annotation style agreed upon in the guide.
- PEP-8:www.python.org/dev/peps/pe…
- PEP-257:www.python.org/dev/peps/pe…
First, the documentation comment is the first string declaration that defines a module, function, class, or method. This string should clearly describe the function’s role, input parameters, return parameters, and so on. The main information of PEP-257 is as follows:
- Each function needs a documentation description;
- Write complete sentences using proper grammar and punctuation.
- Start with a one-sentence summary of the function’s main purpose;
- Use prescriptive rather than descriptive language.
It is easy to follow these rules when writing functions. We just need to get into the habit of writing documentation comments and do them before we actually write the function body. If you can’t clearly describe what the function does, you need to think more about why you’re writing it.
The return value
A function can and should be thought of as a separate applet. They take some input as parameters and return some output values. Arguments are, of course, optional, but the return value is not, according to Python internals. Even if you try to create a function that does not return a value, we cannot choose not to use the return value internally, because the Python interpreter forces a return of None. Unconvinced readers can test it with the following code:
❯ python3
Python 3.7.0 (default, Jul 23 2018, 20:22:55)
[Clang 9.1.0 (clang-902.0.39.2)] on darwin
Type "help", "copyright", "credits" or "license" *for *more information.
>>> def add(a, b):
... print(a + b)
...
>>> b = add(1, 2)
3
>>> b
>>> b is None
True
Copy the code
Run the code above and you will see that the value of B is indeed None. So even if we write a function that doesn’t contain a return statement, it still returns something. But the function should also return something, because it’s also a small program. How useful is a program without output, and how can we test it?
I’d even like to make the following statement: Every function should return a useful value, even if it’s only for testing. The code we write should need to be tested, and a function without a return value is difficult to test for correctness, and the above function may need to redirect I/O to get tested. In addition, a return value can change a method call, as shown in the following code:
with open('foo.txt', 'r') as input_file:
for line in input_file:
if line.strip().lower().endswith('cat'):
# ... do something useful with these lines
Copy the code
The line if line.strip().lower().endswith(‘cat’) works because the string methods (strip(), lower(), endswith()) return a string as the result of calling the function.
Here are some common reasons people give when asked why a function they wrote doesn’t return a value:
“Functions do I/ O-like operations, such as saving a value to a database, and cannot return useful output.”
I disagree because the function can return True if the operation completes successfully.
“I need to return multiple values, because returning one value doesn’t mean anything.”
It is also possible to return a tuple containing multiple values. In short, even in existing code bases, returning a value from a function is definitely a good idea and is unlikely to break anything.
The length of the function
The length of a function directly affects readability and therefore maintainability. So make sure your function is short enough. A function of 50 lines seems to me to be a reasonable length.
If functions follow the single-function rule, they are generally very short in length. If the function is pure or idempotent (discussed below), its length will also be shorter. These ideas are useful for constructing clean code.
So what happens if a function is too long? Code refactor! Code refactoring is probably something you do all the time when you write code, even if you’re not familiar with the term. It means to change the structure of a program without changing its behavior. So taking a few lines of code from a long function and converting it to a function that belongs to that function is also code refactoring. This is also the fastest and most common way to shorten long functions. With proper names for these new functions, the code becomes easier to read.
Idempotent and functional purity
The idempotent function returns the same value regardless of how many times it is invoked given the same set of variable arguments. The result of the function does not depend on non-local variables, parameter variability, or data from any I/O stream. The following add_three(number) function is idempotent:
def add_three(number):
"""Return *number* + 3."""
return number + 3
Copy the code
Whenever add_three(7) is called, its return value is 10. The following is an example of a non-idempotent function:
def add_three():
"""Return 3 + the number entered by the user."""
number = int(input('Enter a number: '))
return number + 3
Copy the code
This function is not idempotent because the return value of the function depends on the I/O, the number entered by the user. Each time this function is called, it may return a different value. If it is called twice, the user can type 3 the first time and 7 the second time, making the call to add_three() return 6 and 10, respectively.
Why is idempotence important?
Testability and maintainability. Idempotent functions are easy to test because they return the same results with the same parameters. A test is to check that different calls to a function return the expected value. In addition, Testing idempotent functions is quick, which is important in Unit Testing but often overlooked. Reconstructing idempotent functions is also easy. No matter how much you change the code outside the function, calling the function with the same arguments returns the same value.
What is a “pure” function?
In functional programming, a function is pure if it is idempotent and has no obvious side effects. Remember, an idempotent function means that the function always returns the same result given a set of arguments and cannot use any external factors to calculate the result. However, this does not mean that idempotent functions cannot affect non-local variables or I/O streams, etc. For example, if the idempotent version of add_three(number) above prints the result before returning it, it is still idempotent because it calls the I/O stream, and this does not affect the return value of the function. Calling print() is a side effect: interaction with the rest of the program or system, other than the return value.
Let’s extend the add_three(number) example. We can use the following code snippet to see how many times the add_three(number) function is called:
add_three_calls = 0
def add_three(number):
"""Return *number* + 3."""
global add_three_calls
print(f'Returning {number + 3}')
add_three_calls += 1
return number + 3
def num_calls():
"""Return the number of times *add_three* was called."""
return add_three_calls
Copy the code
We now output the result to the console (one side effect) and modify the non-local variable (another side effect), but since these side effects do not affect the return value of the function, the function is still idempotent.
Pure functions have no side effects. Not only does it not use any “foreign data” to compute values, it also does not interact with other parts of the system/program, except to compute and return values. Thus, although our newly defined add_three(number) is still idempotent, it is no longer pure.
Pure functions do not record statements or print() calls, do not use a database or Internet connection, and do not access or modify non-local variables. They do not call any other impure functions.
In short, pure functions cannot do (in the context of computer science) what Einstein called “spooky action at a distance”. They do not modify the rest of the program or system in any way. In imperative programming (as writing Python code is imperative programming), they are the safest functions. They are very easy to test and maintain, and even better than pure idempotent functions in this respect. Pure functions can be tested almost as fast as they can be executed. And the test was simple: there were no database connections or other external resources, no code to set up, and nothing to clean up after the test.
Obviously, idempotent and pure functions are icing on the cake, but not necessary. That is, we like to write pure or idempotent functions because of the above advantages, but not all the time. The key is that we instinctively start deploying code with the thought of removing side effects and external dependencies. This makes it easier to test every line of code we write, even if we don’t write pure or idempotent functions.
conclusion
The secret to writing good functions is no longer a secret. Just follow a few well-established best practices and rules of thumb. I hope you found this tutorial helpful.