• Building a Map of Your Python Project Using Graph Technology — Visualize Your Code
  • Kasper Muller
  • The Nuggets translation Project
  • Permanent link to this article: github.com/xitu/gold-m…
  • Translator: Ashira97
  • Proofread: PassionPenguin, chzh9311

Code Visualization – Use graph techniques to draw structure diagrams for Python projects

As a mathematician working on data science, I am interested in programming languages, machine learning, data and mathematics.

These technologies, tools or art are vital to our society. As you read this, these technologies are changing our lives. But at the same time, another emerging technology is growing fast.

This math-based technique was discovered (or invented?) by the great Leonhard Euler when he was trying to solve a problem that no one could solve. I think the wording here should be discussed separately).

The problem has to do with an underlying structure or shape, which is usually represented by a form of relation that connects things.

Euler requires a tool to examine the relationship and structure between specific entities, where the distance between specific entities is not important, but the connection between them.

To solve this problem, he developed a tool called a mathematical graph (or graph for short)

This was the birth of graph theory and topology.

Fast forward, 286 years later…

Discovery of higher order structures

Not long ago, I had to deal with a fairly large project at work. This project contains hundreds of Python classes, methods, and functions that communicate with each other by sharing data or making calls to each other.

In the middle of development, I was working on a folder containing code to solve a problem throughout the project, and an idea occurred to me:

Wouldn’t it be nice to see where this problem fits into the overall project, and to see the invocation and data passing relationships between different entities?

What would the whole thing look like?

After a few evenings and seventeen espresso cups, I wrote a Python program. It takes code as input, parses it into nodes and relationships in the form of objects, calls, scopes, and instances, and stores the results in a Neo4j graphical database.

The picture shown at the beginning of this article is the result of working with an NLP project, a machine learning technique for working with human languages, using this Python graphic-building project.

If you don’t know what a graphics database is, let’s pause the main story here

First, a graph is a mathematical model consisting of nodes and edges. An edge is called a relation in Neo nomenclature, which is a very appropriate name because an edge represents a relationship between two nodes.

A classic example of this graph is a social network like Facebook, where nodes represent friendships and relationships represent relationships.

A graph database stores graphs like this so you can explore the patterns hidden behind the thousands of edges.

It is important to keep in mind that graph databases store relationships as well as nodes. This means that some lookup operations are much faster than associating many tables in a relational database and then querying them.

To make it clear what kind of graph we have built, I will briefly describe the structure of the graph.

Let’s start by specifying the root directory of the Python project.

The nodes in the diagram represent objects in our project files, specifically functions, classes, and methods. These nodes have their own properties, such as storing parent file information if the current node has a parent. (A class can be the parent of a method, a function can be defined in another function, etc.).

As for relationships, we have invocation relationships, instantiation relationships that indicate which class a method belongs to, and so on.

We want to track calls and dependencies in our code.

So we have a new visualization tool at our disposal that visualizes the structure of the code rather than the data that Matplotlib visualizes.

At first, I didn’t see much use for it beyond seeing it as a fun tool for creating posters for award-winning projects.

However, after a discussion with a colleague who also studied mathematics and was interested in graphical databases, we found that many other tools had pointed us in the direction of more than just a visualization tool.

Testing and Security

It’s nice to be able to see dependencies in the code, catch a bug or two with the diagram, or optimize the code just by looking at the diagram, but the real advantage of this structure diagram is that it shows the structure of the code.

For example, unless you’re really serious about breaking up your code into small testable units and testing them one by one before running larger integration tests, it’s hard to answer the following questions:

  • How well did you test your code? For example: which functions have been indirectly tested but not directly, or vice versa?
  • Is there code that has never been used or tested?
  • Which function is called the most times?

First of all, what does indirect testing mean?

When I call a function (or a method, generator, etc.), that function may call another function. These functions that are called twice are called indirection. The first function called is called either explicitly or directly.

Ok… Why is this important? Specifying the type of call can help us measure the extent to which bugs in different functions are damaging the system.

Because if a function is slightly buggy when called indirectly by different functions many times, first of all, the bug is more likely to occur when called multiple times than when the same method is called only once. Second, the more functions that call this function, the greater the potential threat the bug poses to the system.

This proves that diagrams are a perfect tool for solving this kind of problem. By using graphs, we can obtain the importance index of different methods, functions, and classes. By sorting by this index, we can prioritize important functions and solve bugs.

The solution

Before Python can actually use Neo4j, we need to install the desktop environment of Neo4j by executing the following command:

pip install neo4j
Copy the code

Next, build a class in Python that is responsible for communicating with Neo4j.

We can now simply build a graph loader in any class with the following code:

self.loader = LoadGraphData("Kasper"."strong_pw_123"."bolt://localhost:7687")
Copy the code

Let’s take a look at the implementation of the mapping algorithm described above on my personal project:

In this project, the blue nodes represent classes, the orange nodes represent methods, and the red nodes represent functions.

Note that some of the code in this project is useless code to test the mapping algorithm, but it also conforms to Python syntax.

We want to know which methods are tested and how are they invoked indirectly? For example: How many calls from the most recent test method to a specified non-test method?

Now that we have the graph, it’s time to query the graph database.

Look at the following one that implements the shortest path algorithm query between tests and function functions and compare it to the picture.

Note that at this point the test method is an object defined in a file, class, or function that begins with “test”, or simply a function, method, or class whose name begins with “test”. (If it is a Test class, it should start with “Test”.)

This premise seems far-fetched, but I’ve rarely written a test method in a Python file that starts with anything other than “test”, because most of the time the method name itself starts with “test”.

If you have a file that starts with “test”, I assume that all functions and methods are test methods in this file.

The output of the above query is shown in the table below:

HMM… It would be nice if we could get this table in the data format Pandas DataFrame.

Let’s change the code:

We store the query statement as a string wrapped in three brackets in the variable Query. Then, within the selected method or function, the code might look like this:

loader = self.loader
records = loader.work_with_data(query)
df = pd.DataFrame(records, columns=["distances"."test_functions"."test_source"."targets"."target_source"])
Copy the code

You then have the tables stored as DataFrame objects for further processing.

We can then immediately identify all non-test methods. Let’s build this code diagram

Before continuing our exploration, we should define what a safety index means.

For a given function f, define its test specification as the distance between the nearest test method shown in the figure and itself.

  • The test specification that typically defines all test methods is 0.
  • If a method is called directly by a test method, the test specification for that method is 1.
  • If a method is not called by any test method, but is called by another function called by the test method, the method test specification is 2, and so on.

Next we define a safety factor, sigma, for the entire project. We will NT is defined as a collection of all the test methods and define N = NT | |. And then we define


sigma = 1 N f N T 1 f T .  where  f T  is the test norm of  f \sigma=\frac{1}{N} \sum_{f \in \mathbb{N T}} \frac{1}{|f|_{T}}, \quad \text { where }|f|_{T} \text { is the test norm of } f

Please note:

  • If all functions in a project are directly tested, that is, all functions have a test specification of 1, then σ = 1.
  • If no function is tested directly or indirectly, sigma is the sum of a series of null values, usually specified as zero.

0 < σ < 1, and the closer you get to 1, the better your testing will be and the more secure your code will be.

The idea behind this formula is that the farther away a given function is from the test function, the less testing is done on it. However, HERE I assume that the “average” defect test is for functions on the edges of the graph. But this is just a definition, and we can change it as needed. For example, a method called by many test functions is better tested than a method called by just one test function, a situation we don’t consider in the current release project, but may address in a later release.

Let’s make it happen!

This works, and the above project got a score of about 0.2, but remember, it only works if your file name or object used to test your code starts with “test”.

I’ll leave building this Python map as an exercise for the reader, since it’s not convenient to open the source code for this project. However, I’ll give you some tips to help you build.

  • I have a main class for tracking and storing nodes and relationships as I iterate through the file line by line.
  • During iteration, we record the current scope. Are we in the same class? In one method? , etc.
  • If there is a call or instantiation to another function or other class, we store the object and create a relationship
  • If the current row contains a definition statement, store the current object and its parent node (if present), which is a method or class. After that, the relationship is stored as IS_METHOD_IN, IS_FUNCTION_IN.

This is basically the whole point of writing a Python syntax parser.

It’s a little more complicated than we thought at first.

We need to keep track of introductions and calls to other files as we parse them line by line, because we don’t know how deep the project is. Each object we store has a corresponding source file that created it, and we need to store the corresponding source file as a property of the object, because if two objects are called with the same name in two different files, we should not merge the two objects together in the structure diagram.

After iterating through all the.py files in the project, I created CSV files from data stores that Python loaded from Neo4j using the LOAD CSV query from the previous LoadGraphData method.

Structure diagram of an existing project

Here is a ready-made project structure diagram

Beautiful Soup

This picture perfectly illustrates what happens in Beautiful Soup. Its clusters are very tightly connected to each other.

While this code isn’t perfect, I believe it will be quite useful in the future. I’m currently working on a more stable version that takes account of how many times a file is opened.

If you find any mistakes in your translation or other areas that need to be improved, you are welcome to the Nuggets Translation Program to revise and PR your translation, and you can also get the corresponding reward points. The permanent link to this article at the beginning of this article is the MarkDown link to this article on GitHub.


The Nuggets Translation Project is a community that translates quality Internet technical articles from English sharing articles on nuggets. The content covers Android, iOS, front-end, back-end, blockchain, products, design, artificial intelligence and other fields. If you want to see more high-quality translation, please continue to pay attention to the Translation plan of Digging Gold, the official Weibo, Zhihu column.