Click on the top of the “QI QI” attention, star mark or top grow together

There are not many technical articles about ID Mapping on the Internet. I happened to see this article by Peng Wenhua and share it with you. We can learn together and make progress together.

Why ID Mapping?

Technology is all about solving real business problems. Without the problem of data silos, there would be no such dramatic development and reform of digital technology. \

More than 10 years ago, the IT industry was still doing the “four libraries and twelve gold” project. I was given the job of cleaning up all the addresses in an area. This can be difficult, because the same address can be written in many ways, such as “big pants”, the full name is “CCTV headquarters building”, the door number is “Beijing Chaoyang District East third Ring Middle Road 32”, also called “CCTV new site”, and there are specific latitude and longitude.

In such a mess, I accidentally made a mistake. The project we were working on was to unify this mess of addresses, to provide basic data for the geographic information base. Why don’t you go up and down? It’s too much work!

How did we do that? Say also very simple, is the comparison. Write rules, simple rules can not be matched, use complex rules, complex rules can not be matched, the naked eye radar. First of the building, the number of what to do cleaning, the wrong words and so on are cleaned. Then, a wave is matched with the relatively accurate data source, and the same ones are marked first. And then you put the ones that are similar, and then you put the ones that don’t match, and then you add the latitude and longitude. Finally, we run it twice with artificial eye radar, and then we leave the rest behind.

It’s too painful! But I didn’t know how to use high technology back then. Baidu side will use the graph database to solve this problem, now baidu search what to get out for you:

In the Internet scene, examples abound. Before the popularity of data center, ID Mapping was used in DSP (Internet advertising platform). They must identify the same user logging in from different ends (home computer, work computer). They cannot get a lot of detailed data and can only be identified by the Cookie data of the browser. Therefore, ID Mapping in the DSP system is done based on cookies. When the same client logs in at different ends, The same cookie is identified as the same customer in DMP (data management platform).

However, there is another problem here, that is, cookies can only belong to the same domain name, that is to say, the cookies you access the mailbox are not the same as the cookies of Baidu Advertising Alliance, so ID Mapping should also be done between the website and DSP. By doing this Mapping, they can see what you see on those sites and then recommend relevant content to you.

So if you search for “health preservation” on Baidu, you will be recommended to “Goji berry” on shopping websites.

Now, as our systems become more complex and demand more value discovery from our customers, we have similar requirements in common scenarios. For example, the user transaction information on our trading platform may only be related with the order in THE ERP, but the user in the two-time system is completely different. In addition, the customer information in our CRM is independent. Users in the trading platform, ERP and CRM are fundamentally independent of each other, so we cannot grasp the whole picture of customer contact and accurately identify the value of customers.

Ali’s situation was more complex than ours. It was not just a case of data silos between systems, but even worse, separate lines of business. This is going to kill you. Therefore, Ali made use of THE ID Mapping logic in DSP to thoroughly analyze all the data. This is the One ID basis of Ali Data Center. \

The core technology of ID Mapping

ID Mapping has several scenarios: 1. Identification of multiple data; 2. Get through multi-source data. The treatment is basically the same in both cases. Take an example: Lao Wang browses commodities at the PC terminal of the mall and places orders at the mobile terminal. The background automatically generates orders and sends them to ERP for subsequent orders and logistics processing. Later, Lao Wang got impatient and called the customer service of supply chain finance for advice. So Lao Wang’s data is as follows: \

(Note: In most cases, the UUID is the same for web and mobile. This is just an example.)

In this case, the way we write SQL doesn’t work very well, because there are too many associations. And this is a case in point, where you have to get all the data straight through, which is not easy.

You can do it if you have to write SQL, but the rules are much more complex, and the system is very demanding. Also, you have to consider occasionally logging in to something. This is even more confusing, okay? You’re going to have a lot of problems landing.

Now the big data environment, technology is also developing very fast, of course, I can not do data cleaning before the way so do, write SQL seems too silly. I introduced before, Baidu is the way to solve the graph database. The logic of graph calculation is to abstract the data into “points” and “edges”, and then calculate the natural “connection” effect with the graph to realize the automatic identification and opening of data.

You see, what we’re actually going to do is we’re going to combine these numbers together, something like this:

If you look at this graph, there’s no direction, and it probably doesn’t form a loop. This is an undirected connected graph. In this way, the information matches up.

Now you want to write, is it really hard to write SQL? But it’s pretty easy to figure it out on a graph. It’s easy to get the results we want by processing the data into the format that a graph database needs and then using the graph to compute. What’s more, we can also set thresholds on “edges” to remove the user from temporary logins such as the print room and filter out noise. Isn’t that great?

Therefore, the process of ID Mapping is basically as follows:

1, each source/end element identification, is able to identify the various elements of user information, the original ID is also useful;

2. Abstract and assemble data sets of “point” and “edge” respectively, set edge threshold and filter weak connections;

3. Construct a graph model and use the connected subgraph algorithm to get those IDS belonging to the same object;

4. Get the result set and assign a new ID.

5. Deduplicate and merge data to generate final results;

6. Loop links 3-5, and use the existing result set in link 3 at the same time. The existing ID will keep the old ID.

Finally, generate a dictionary of ID mappings, which roughly means:

In this way, the isolated system data even from the ID level, we can do more things based on the dictionary, such as drawing a more comprehensive portrait of the user.

We can also save the data, how to put it, it is best to throw ES and other fast query database, external to provide One ID query service.

So that’s the core of ID-mapping. When you actually land, you have all kinds of problems, like what do you do in a many-to-many situation? Before the lack of elements can not match, but then the user added information, and matched what to do? What is the best way to store the resulting data? Is it better to put it there? Should I build a DV model so I can find the data? That’s a problem to be considered in the construction of the project. It’s all about practice.

conclusion

The core value of One ID is to break through data islands and connect systems built in isolation at different times with a unified ID. The One ID function is like building a bridge, connecting silos of data into One.

After the data island is broken, we can have a more comprehensive and complete understanding of our users, products and businesses, evaluate their value more accurately, conduct further value discovery, and lay a solid data foundation for refined operation.

The core technology of One ID is id-mapping. Its principle is to abstract the key elements of each system into “points” and “edges” used in graph calculation. It is easy to determine the same “object” with graph calculation algorithm, so as to construct One undirected connected graph and generate ID Mapping dictionary. The ID mapping dictionary is a bridge to data silos. These Bridges allow us to concatenate data from the same “object” on different islands. In this way, we control the whole, not the part.