At least from an engineering point of view, your project is by no means as shady as it should be.

Google is unique in code management, developing based on a “trunk” and putting more than 90% of its code in a single repository called Piper, shared by tens of thousands of software developers in dozens of offices around the world. For projects that are open source and require external collaboration, the code is placed in Git, the version management software, mainly Android projects and Chrome projects.

The entire repository is a tree structure, with each team having its own directory, the directory path being the code namespace. Each directory has an owner who approves changes to files in that directory.

This approach has worked at Google for more than 20 years. In 2015, Google’s code base contained about a billion files and had a history of about 35 million submissions. Code is generally submitted to the head of the trunk, ensuring that all users see the latest version of the same code, allowing file-level permissions, and making 99% of the code visible to all users. Access to only a few important profiles and confidential business critical is restricted. All reads and writes are logged, and administrators can find out who has read the file.

The benefits of Google’s approach are clear compared to managing code in a variety of ways: anyone can view and use company-wide code, greatly facilitating code sharing and reuse; There is a unified version and path, there is no such problem as can not find the latest version of the file; Every time a code change is made, it is easy to undo it or test its impact with a pre-commit……

Even with SQL management, Google follows these principles to some extent. The writer is a data engineer who worked as a supplier for Google for two years and discovered that Google’s data engineers treat SQL in much the same way software engineers treat code. He believes this attitude is so important that it’s worth adopting in data strategies for businesses large and small. We’ve translated the author’s article to see how Google’s APPROACH to SQL as code helps, and what it can teach smaller organizations.

SQL is just a query language, why does Google treat it as code?

Like object-oriented code, SQL is time-consuming to write, difficult to debug, difficult to understand (not conducive to version control), and highly maintainable. In the case of Google’s data engineering department alone, SQL can be used to create data pipes, and data pipes are particularly easy to debug and fix. With these factors in mind, the higher the code concentration, the smoother the implementation of the data strategy.

So thinking of SQL as code means we can bring code management tools into the process, easily understand who is responsible for a particular change or maintaining SQL scripts, and keep track of adjustments made by the same author in other related queries. This way, we can quickly find failed commits, recover changes, or apply necessary fixes. After the SQL code is submitted, it can be deployed to the development environment immediately. The next step is to run the development pipeline to identify and fix faults in real time.

In addition, we regularly release test environments and update them to make sure the code runs in production. After successful testing, the SQL code submitted between the old and new production versions can be officially enabled, keeping the chances of problems to a minimum.

What should small companies learn?

Take a look at your own software engineering culture, benchmarking more mature cultural concepts like Google, and try out their tools of choice, including Git, IDE, etc. We should definitely index all of our code, take the time to turn our proprietary scripts into global scripts, and eliminate unnecessary elements of views, materialized views, stored procedures, and so on.

2. How does Google manage SQL code?

Google keeps almost all of its code in a single, centralized code repository. So when a change needs to be made to SQL, or a new script needs to be created, Google’s engineers create a list of changes — similar in nature to PR. After that, the change process needs to be tested and approved by other engineers. Approval is successful before the author can commit the code changes to the code repository.

While this form of change control is fairly common in the enterprise, one of Google’s hallmarks is its strong emphasis on code formatting. I didn’t pay much attention to formatting myself, but my own experience has made me realize that high-quality formatting can greatly reduce the difficulty of understanding and debugging code, and can also help reduce the time other authors spend on code changes. Google takes formatting so seriously that it has an automated mechanism to reject code that doesn’t meet its coding standards.

What should small companies learn?

Choose a code repository and stick to it for the rest of your work. Ideally, this repository should be shared between engineering teams, or at least ensure that all SQL code is centralized. It should cover the full range of data capabilities, including data engineering, analytics, business intelligence, and more.

Also, be sure to standardize code formats. At present, there are many open source standardization tools on the market, which are not difficult to use, but can greatly improve the readability and maintainability of code. Take the time to comb through the original code using existing formatting tools or in-house development tools, and make sure that all subsequent submissions follow formatting standards. In a matter of days, or weeks at most, data engineers can adapt to the new standards and improve the readability, authoring quality, and understandability of SQL code in the company.

Version control + multiple test environments = save time

Code changes are so pervasive that without version control, it would be hard to use rollback to remedy unexpected errors. In the event that the submitted code breaks the pipe, produces unexpected values, or does not work, we need to use version control to restore a normal state.

Google’s code integration principles follow this line of thinking. A well-structured test environment can absorb even the most outrageous changes (and, indeed, sometimes the submitted code does) without disrupting normal business. In this way, the impact of SQL changes on the development environment can be directly reflected, helping engineers quickly detect failures.

There are, of course, very few pieces of code that don’t break immediately in a development environment, but often in production. There are many factors contributing to this situation, so we need to introduce a separate pre-production environment into our test system.

Google uses environment variables to manage multiple test environments, and these variables can be easily injected into table names through an interpretation layer.

What should small companies learn?

Establish at least one development environment and maximize the scope of the data infrastructure involved in code testing to minimize the chance of failure. Free open source tools such as DBT make testing significantly easier with an abstraction layer, where all tables have two versions, one development and one production. With a daily, weekly, or even monthly release schedule, we can feel comfortable pushing all the code that has been committed since the last release plan directly into production.

4. Extensive code access

Google’s practice of cramming almost all of its code into a single code repository makes it hard to tell who owns a product and who uses it. For example, without broad access to this centralized code base, it is difficult for software engineers to understand the downstream effects of changes when updating production-grade applications. With broad access, they can easily search for scripts, query operations, and other applications that depend on the current application and notify the engineers to coordinate changes.

I know that many organizations want to isolate different aspects of development with code secrecy. Yes, there are some highly sensitive project code bases that shouldn’t be left open, but they are few and far between. Since a company as large as Google is willing to use the power of trust when building code architectures, there’s really no need for other small companies to always be secretive.

What should small companies learn?

Trust and communication mechanisms should be introduced into the structure of the code base and repository. At least from an engineering point of view, your project is by no means as shady as it should be. After all, if you can’t trust the engineers on your team, how can the business function? In short, take the initiative to push boundaries and encourage collaboration between software engineering and data engineering in business processes. This way, we can address the negative downstream effects of a change before the code hits production.

Recently, spring recruitment, the so-called golden three silver four, during this period of Internet companies began to hire frantically, it is also the best time for millions of technical personnel from outsourcing companies or small companies to jump to big companies.

Here free to share a GitHub star 120K in Java advanced knowledge comprehensive analysis. These include Java infrastructure, Java containers, Java concurrency, Java Virtual Machines, and Java IO. It also includes networking, Linux, data structures and algorithms, databases, system design, required tools, interview guidelines, etc.

This is free! Some screenshots are shown below. Click here to get them for free!

(I). Basis

1. Java basic skills

  • Getting started with Java (Basic Concepts and General Knowledge)
  • Java syntax
  • Basic data types
  • Method (function)

2. Java object oriented

  • Classes and objects
  • Three characteristics of object orientation
  • The modifier
  • Interfaces and abstract classes
  • Other important points

3. Java core technology

  • A collection of
  • abnormal
  • multithreading
  • Files and I\O streams

(ii) concurrency

1. Concurrent containers

  • Summary of concurrent containers provided by the JDK
  • ConcurrentHashMap
  • CopyOnWriteArrayList
  • ConcurrentLinkedQueue
  • BlockingQueue
  • ConcurrentSkipListMap

2. Thread pools

  • The benefits of using thread pools
  • Executor framework
  • (Important) An introduction to the ThreadPoolExecutor class
  • (Important) Example of ThreadPoolExecutor usage
  • Several common thread pools in detail
  • ScheduledThreadPoolExecutor,
  • The thread pool size is determined

3. Optimistic and pessimistic locks

  • What is the pessimistic lock and optimistic lock
  • There are two common implementations of optimistic locking
  • Disadvantages of optimistic locking
  • The use of CAS and synchronized

(3). The JVM

1. Java memory area

  • An overview of the
  • Runtime data area
  • HotSpot VIRTUAL machine object Exploration
  • Key supplementary content

2. JVM garbage collection

  • Demystify JVM memory allocation and reclamation
  • The subject is dead?
  • Garbage collection algorithm
  • Garbage collector

3. JDK monitoring and troubleshooting tools

  • JDK command line tools
  • JDK visual analysis tools

Networking, Linux, Data structures and algorithms, databases, system design, required tools, interview guide

I won’t show you anything else because I don’t have enough space. This advanced note runs to 512 pages. It should be very helpful for those who want to advance, and I hope it can help you, too.

This is free! Click here to get it for free!

\