Author: Yan Fei
Yan Fei, a veteran driver of Kyligence big data, has more than 15 years of experience in the field of big data/data warehouse, and has in-depth research and practice on the construction planning, architecture design, technical system, method theory and products and solutions of mainstream manufacturers of big data/data warehouse.
[Appetizers]
Fifteen years ago, I just started working and came back home from the imperial capital.
Some elder kindly of ask me: “work, what do?”
I hesitated for a long time and answered, “Get kicked.”
Elders without thinking to a: “Oh, in Zhongguancun sell CD ah!”
I…… I… I…… (Uncle, you know too much)
“Deputy food”
Five years ago, overnight, the word “big data” became popular all over the country. When I was asked by others, I could easily answer with a popular online word: “I do big data!”
(Thanks to the development of mobile Internet, thanks to the hype of major IT manufacturers, thanks to the attention and planning of the country, thanks to all TV and AV)
But last weekend, when I was talking to my mom on the phone, she asked me an insightful question: “I know you’re in big data, but what are you doing with big data?”
I’m at a loss for words and don’t know where to begin. Similar questions are often asked on Zhihu by students who are about to find jobs, who are fans of “big data” and want to become data people. Because they are lazy, they have not answered them properly.
Therefore, in order to contribute to the popularization of “big data”, in order to give people on the fringe a little more basic understanding of big data, and in order to answer mom’s question, I decided to write an article (save on phone bills) to introduce big data and the daily work of data people.
Although big data has been a well-known hot word, many terms and concepts in the field of data will still make people confused, so I am going to start from “cooking”, which is a field that ordinary people should have a basic understanding of, and mom is familiar with.
“Main course”
Is the so-called “make bricks without straw”, cooking must first have ingredients, big data is the same, there is no data to say what is bullshit, so the data is the data of the food (as long as there is data, I do not have to eat).
Cooking usually includes “buying vegetables ~ washing vegetables ~ matching vegetables ~ frying vegetables” these several necessary links, whether you open a restaurant or three meals a day at home, the size of cooking will be different, but the process is the same. These links actually correspond to the daily work content of data people: buying vegetables (data collection) ~ washing vegetables (data cleaning) ~ dressing vegetables (data modeling) ~ cooking vegetables (data processing).
1. Buying vegetables (data collection)
Buy food, go out to want to consider where to buy above all, take a walk to see what food material to buy after arrive ground, take a fancy to after a bargain, counteroffer, hand in money, meat, egg, green vegetables, all sorts of food material that buy must press this flow to come again, walked home after buying neat.
For data people, we call this grocery shopping process data collection.
The vegetable market is what we usually call the data source.
There’s a lot to choose from: supermarkets (less variety, good quality), farmers’ markets (more variety, mediocre menu), open-air morning markets (you can have anything, even game if you’re lucky).
In fact, data sources are the same. Structured business data and transaction data are stored in the database (supermarket), and a large amount of semi-structured log data and machine data are generated by sensors (farmers’ market), and on the network (morning market).
It’s chock full of uneven, unstructured data.
When we go to the vegetable market, we have to choose food, I want to eat all the ingredients, but the money is never enough, so I can only buy selectively, this process is called data research, which data is useful, which data can be used, we have to screen.
Walk around and decide to buy pork, eggs and cucumbers. You have to pick and choose, bargain and weigh with the seller, a process called data interface specification.
It’s a process called data transmission.
Data collection can be divided into many types according to different ways of buying food and habits:
- Meat has a long shelf life and can be bought for a week at a time, which is called full collection.
- Green vegetables pay attention to fresh, each time only buy the day’s dishes, can be called incremental collection.
- If you have to go grocery shopping every morning, it’s called batch collection.
- The seller takes the initiative to send new dishes to your home every time (special for local tyrants), which can be called flow collection
2. Washing vegetables (data cleaning)
Washing vegetables is easy to understand. No matter where the food materials come from, there are more or less sanitary or quality problems. After buying them, they must be washed and picked before they can be eaten, otherwise, the taste will be affected at a small level, and the health will be damaged at a large level.
It is the same with the data. After taking it back, we have to check whether the data content is short of weight and whether there are rotten vegetables in the data value. Otherwise, the results of the following statements and analysis will be all wrong conclusions.
Due to the diversity and complexity of the various data sources in the digital world is much higher than the vegetable market in the real life, the data cleaning process needs to deal with problems is far more than washing dishes, in order to solve and prevent data on various aspects of the problems appeared in the process of production, use, data domain segmentation out a special research direction is called data governance, such as:
- In order to understand the situation of each vegetable market, we need to record the size, color, price and other characteristics of pork, eggs, cucumbers and other ingredients of each vegetable market and each seller, which is called metadata management.
- After recording, we found that each family had different characteristics and could not be compared at all. Therefore, we decided to standardize and price pork, egg and cucumber in terms of size, color and price, which is called data standard management.
- After setting the standards, we have to inspect each market regularly to see if they are complying with the standards. This is called data quality management.
3. Side dishes (data modeling)
Dishes is based on what to Fried dishes, various ingredients needed to match well together beforehand, for example we are going to fry moo shu pork, then wash the pork, eggs, cucumbers, cutting it in a bowl, like this can be handy when cooking, without looking for ingredients, can improve the efficiency of cooking very well.
Average household cooking may not strictly to do so, but for food industry, it is necessary to link, think about ingredients, bought a car wash, cut, if there is no certain put rule, can’t fully ensure the efficiency, when the chef cooking the customer for half a day do not eat food, the high turn over rate and to lead this hotel is absolutely no, or closed early. (Mom is a delicate person with strong overall planning ability. No matter for dinner or three meals a day, she would match the ingredients needed for each dish before cooking, so I got to know it.)
In data engineering, there is also a very professional and even mythic side dish process, which is the legendary data modeling. Data modeling is to establish a data storage model, which replans, designs and arranges data from various data sources according to certain business rules or application requirements.
This process may be insignificant in the cooking process, sometimes dispensable, but in data engineering, data modeling is a very key link, so let me say a few words.
The variety of data, complexity of much higher than the ingredients, such as a bank, business, process, and internal management related IT system is generally more than 100, which is more than 100 vegetable market, and each market can provide ingredients are less hundreds, many thousands of, this all together is hundreds of thousands of food, In addition to the external more complex other data sources, so many different types of ingredients, different standards together, how to make the later cooking more efficient and scientific, the complexity and researchability is indeed much higher than the real side dishes.
Because of this, there have been a number of specialized modeling (side dish) methodologies throughout the history of data:
- For example, we call it paradigm modeling. If you run a hotpot restaurant or plan to eat hot pot, you will definitely use paradigm modeling to side dishes
- For example, when dishes are arranged by type (stir-fried moo Shu pork in a pile, stir-fried kung pao chicken in a pile), we call dimensional modeling. If you are eating a home-cooked dish, it is more reasonable to use dimensional modeling to side dishes
Each methodology has its own background, applicable scenarios, and proponents, so we won’t go into that here in order not to start a war
On the basis of the methodology, through constant practice and research, some of the leading manufacturers launched data standards of industrial data model, what is called industry data model, because the business characteristics of each industry is different, such as banking, telecoms and retail business model has very big difference, the data is not the same, so the data of different industry how to put, How to design data model has strong industry uniqueness, so each industry needs its own specific data model, which is called specialization.
Did you not understand the above paragraph? That’s ok. To put it simply, industry data models are “hotel preparation strategies”.
For example, if you think Sichuan cuisine is very profitable, you want to open a Sichuan cuisine restaurant, but you are just a standard foodie, only to eat pork has not seen the pig run, what to do? Nothing, I have this “Sichuan cuisine shop preparation strategy”, there are everything:
- First of all, the guide will tell you which famous, popular and best-selling sichuan dishes (such as boiled pork, maoxuewang, etc.), updated regularly and illustrated, so that the menu is available.
- Secondly, what kind of ingredients should be used for each dish, and what kind of proportion should be respectively? The guide has been concluded, and it is the experience and conclusion from various famous chefs, so the recipe has also been created.
- Again, what is the position of each food material in the kitchen, so that it can maximize the efficiency of chefs in the limited space of the kitchen. The detailed design drawings are also drawn for you in the guide, so that the kitchen design is also available.
- Finally, I tell you where you can buy each ingredient, where is the most affordable, and even get the supply chain through
So, all you have to do is find a front, hire two lanxiang graduates, and we can start our business and make a fortune. What, looking for a facade is very troublesome, nothing, we can even provide stores, welcome to join our franchise plan, we not only provide a guide, even stores together to provide, with fine decoration, POTS and pans. (The legendary One Machine is about to make an appearance, but that’s a story for another time.)
Of course, if you don’t want to open sichuan restaurant, I have cantonese cuisine, Hunan cuisine, Shandong cuisine…… Well, I have all the tips of opening eight cuisines.
(The above content is a little exaggerated, opening a hotel is not a guide can be done, do big data is not only the model can. But a lot of times, data models are just that.)
Now as an aside, because the data modeling of professionalism is too strong, very need to experience, in a derivative data industry responsible for dishes of the type of work is called “model designer”, a world famous manufacturer T company model designer is each big headhunting industry and the properties of party a, T company was dug into hard-hit areas.
4. Stir-frying (data processing)
I believe we are not unfamiliar with cooking, if the side dish is an art, it is absolutely a technical cooking. Not only do chefs need to be able to combine various ingredients to cook well, but they also need to flexibly use oil, salt, sauce, vinegar and other ingredients to ensure the food is delicious. And since the door is open, all kinds of consumer needs should be able to respond, and to respond quickly and well.
Data processing is cooking. It is a process of calculating, summarizing and preparing all kinds of data. It serves the final data application and data consumers. Customer requirements are always strange, so according to the needs of data consumers, data processing forms also bloom.
- Bosses are time-sensitive and big-picture, so they tend to focus on the most important indicators, and require them to be illustrated and easy to understand. This is like the emperor eating a full banquet every day, all the dishes are fixed, but the food has to be delicious and fast. So chefs have to process the data in advance into the dashboard, visualization on key indicators such as be clear at a glance, and presents the data on the tall applications, and used in a variety of technical means to ensure the data application performance (serving speed), or when the emperor was hungry not timely served, who can’t afford to back the pot.
- Officials are in charge of each stall, and they have to face a variety of daily work and emergencies every day, so their requirements for data are not only regular dishes to meet the needs of daily management, but also extra dishes to deal with emergencies, and the speed of serving food should not be slow, so it is better for county officials to take charge of it now. Therefore, referring to the buffet model, data chefs can process data into data applications such as multidimensional analysis and self-help analysis. Based on experience and officials’ tastes and hobbies, they can provide all possible dishes. When officials are hungry, they can taste at will according to their needs, which is sweet and warm to their stomach.
- Employees also have data requirements, but usually the requirements are simple. The difficulty lies in the large number of people and the large demand. Therefore, processing data into statements, which is similar to fast food data application, is the best way.
In addition to meeting the above various data needs, data processing also has to mention the responsibility of data innovation. It is also the chef’s responsibility to introduce new dishes from time to time to keep the restaurant competitive and consumers fresh. In the data circle, innovation through data has become a trend and consensus, so the role of data analysts, data scientists began to emerge.
Their job is to explore and discover new business opportunities by experimenting with various combinations of data (ingredients) and metrics (spices). And because the amount of food material is too large, the fluctuation range of ingredients proportion is endless, it is difficult to rely on manpower to exhaust all kinds of combination. Therefore, with the development of mathematical theory and technology, it is possible to make new discoveries by using algorithms to automatically combine food ingredients and seasoning ratio, which is the data mining and machine learning that we often hear on the scene.
“Dessert”
Code word is very tired, yangsasa wrote a lot, but feel some points have not written through, some aspects also write more far-fetched, but understand the spirit of the most important, put a big picture, we will.
The level is general, ability is limited, cast a brick to attract jade, welcome all kinds of opinions and discussion.
[Reproduced please indicate the author and source]