0 x00 preface
Today, I was chatting with my friends about how to reflect my skill breadth in the process of interview and communication with others. I felt quite interesting. There are two points to discuss: 1. What aspects of technology breadth can be improved; 2. 2. How to improve. The first of these points can be roughly translated into a skill tree for data development engineers, which is different, but can be thought of this way.
The position of data development engineer can actually do a lot of things, because everything related to data basically has a data development engineer figure. For example, in the recommendation system, although there will be a recommendation algorithm, data development engineers will also be deeply involved in the final engineering implementation. And often the situation is that there is no algorithm engineer, is the data development engineer to achieve the algorithm and system, so this requires data development to understand the algorithm and system development; Again such as operations, generally there is no professional operations to help you build a cluster, try various components, even if such operations (we are), but when you need to try a new component, still need data development for the early stage of the installation and operational work, only is a mature and then hand over to so there is a need for data development to understand a lot of operational knowledge, At the very least, Linux proficiency, and sometimes even the ability to install machines in the machine room (I’ve done this before); Then said system development, a lot of system customization is very strong, such as Olap system, early it is difficult to find a professional front-end to help you do, that is to do their own, some open source components can be directly used, such as Kylin+Saiku, but many times to customize development, so this requires data development to understand a lot of front-end and Web development things.
A lot of these things don’t necessarily require mastery, but a lot of them do. At least the boss mentioned that he wanted to build a data visualization system. You should be able to think of data visualization as a simple presentation in Zeppelin, or Superset, Gephi, Then, if we need to make a system, you can think of Echarts to do the chart display. If it is more professional, we can use D3.js. If the background data source of this system is ES, we can also use Kibana to directly draw the graph.
0x01 Skill tree of data Development Engineers
A little bit of teasing ahead, let’s get to the point. Below is an approximate skill tree for data development engineers. I’ve broken it up into modules. Not to mention big data components, but then development capabilities, data warehousing, algorithms, and other skills. Let’s go through them one by one.
1. Big data components
There are a lot of big data components, too many in the past two years, but in general, there are still a few representative systems. In storage, HDFS and BigTable, and in computing, MapReduce and Spark are leading the wave. Then there are some very important components, such as Hive as data warehouse engine, ES as retrieval system, Kafka as message queue, Flume as log collection, database synchronization tool Sqoop, various NoSql.
What do we do with so many components?
First of all, HDFS must play understand, very familiar with its principle, the best in-depth understanding, can solve various problems. Be familiar with the principles and application scenarios of MapReduce and Hbase, except those commonly used in work. Because even if MapReduce and Hbase don’t work, HDFS certainly does. (Three of Google’s papers have spawned three systems: HDFS, MapReduce, and Hbase. HDFS and MapReduce belong to Hadoop.)
Then, for various components, if you encounter them in your work, you should work on one or two. If you don’t encounter them, you should do three things: 1. Understand its architecture; 2. Understand its usage scenarios; 3. Look at use cases from other companies.
Finally, still want to see the core source code of some components, components are too many, pick the core components, see the core content of these components. For example, real-time processing frameworks such as Spark Streaming, Storm and Flink, we can consider only watching the representative Storm, and then watching the Storm, not all of them, only the most core design, the core design and code is still human can see.
2. Develop capabilities
Development capability is divided into three parts, data cleaning, system and language.
Data cleaning is out of the question. Most of the data development is done by writing Spark, MapReduce, or Hive scripts. Notice, we have to sublimate, we have to try to generalize, we have to sublimate. For example, how to do data skew, how to ensure that real-time data cleaning data does not lose a data does not repeat, how to deal with abnormal data.
System, is the development of a variety of systems, data development generally can not run away from the report system, which is the most basic value to the boss. Then for the background system, it is best to understand the design method of metadata system, scheduling system. And then the recommendation system and the advertising system, if you don’t have the opportunity to contact, at least know some of their design ideas, understand the whole framework, so that you can design a recommendation system now you can design a simple version.
Language words Java is a must learn, and to be particularly deep. Python is a good thing to do, because more and more systems support Python, especially machine learning and deep learning, all Python. Scala, Go can be a little bit of understanding, use it to learn. Have to mention is Sql, Sql can solve a lot of problems, do not look down on Sql, many and logic can be easily solved with Sql, even in SparkStreaming can also set Sql to solve the problem.
3. Data warehouse
Have to focus on talking about data warehouse, many big data developers do not know much about data warehouse, may still think that we are Internet companies only traditional companies engaged in data warehouse. In fact, data warehouse is a complete theoretical system, including data cleaning, management, modeling and presentation of a set of theories. I’m going to share this with you later, but I’ve written several articles on data warehousing before, and you can find them in my brief books, such as: Rambling On Dimensional Modeling of Data Warehousing, how to Elegantly Design Data layering in A Big Data Scenario.
The data warehouse stuff I want to write about in detail separately, but I won’t go into detail here. It is important that you understand what dimension modeling, OLAP, and data marts do, why you have large and wide tables, dimension tables, and fact tables, and why you have data layering.
Then user portrait and feature engineering should also have a better understanding of feature engineering layman understanding is relatively little.
4. The algorithm
Algorithms are a good thing, and when it comes to algorithms, you mostly think of classical algorithms and machine learning algorithms, and more recently, deep learning algorithms.
I don’t want to talk about classical algorithms. It’s a basic skill. Which graph theory and tree or take a good look, decided useful. For example, scheduling system design depends on graph theory, social relationship algorithm depends on graph theory, file system directory tree depends on tree, data mining algorithm also has a lot of tree related.
Data mining algorithm is best to understand some, even if you do not do algorithm research, but the implementation of the algorithm is still very necessary, otherwise the credit has been robbed by algorithm engineers. And as big data development, it should shoulder the design and optimization of data mining algorithm in distributed system, which is full of technical content. Here, understand some of the general principle of each algorithm, understand the use of the scene, there is a work need to be able to personally implement the best.
Forget about deep learning algorithms.
Then there are big data algorithms. What is a big data algorithm? It’s hard to say, but you can’t put all the data mining algorithms here, because the problem that data mining algorithms solve is not large-scale data processing, but extracting value from data. Big data algorithm can be understood as an algorithm designed for processing massive data. Currently, Hyperloglog with cardinal statistics algorithm that I have come into contact with is one of them. It is used in Redis and Druid, and then there are external memory algorithm and sublinear algorithm mentioned in The book Big Data Algorithm. This kind of algorithm is not deep enough to learn, the follow-up slowly while writing while learning.
5. Other
There are other things beyond the algorithm, unnecessary but I think important.
First, Linux, I feel the necessity of this is relatively high, this article has also been mentioned before, I feel that most of the big data engineers should be able to have, just the depth of knowledge, more contact with the bottom of the children should be more in-depth understanding of Linux, for example, I often install the system to replace the hard disk in a period of time.
Then is the crawler, crawler in front of the two articles, I have always felt that crawler is a basic skill of data engineers, no other, fun. You can get the data. You can do whatever you want. Of course, reptilian engineers have to be very thorough.
Cloud computing is also worth knowing. In the beginning, I always thought that platforms like Hadoop and Spark should be cloud computing content, but it doesn’t matter what they are, just consider them related but different. Cloud computing is mainly divided into three layers: IaaS, PaaS and SaaS. After the popularity of container technology, DaaS, a container-as-a-service concept, was introduced. As the cloud market and big data technology mature, Microsoft Azure, Amazon AWS and Ali Cloud are embedding big data components such as Hadoop and Spark into their cloud products. Therefore, I can understand that many big data components we use should be considered as PaaS, that is, platform as a service.
0x02 How can technology breadth be improved
In front of the content to write a little more, here to write a little less, the follow-up free to continue to do.
Personally. There are a few points to note about improving the breadth of technology:
- Interest is the most important, otherwise it is difficult to have the motivation to learn a lot of things
- Listen to lectures, chat with everyone, know their own shortcomings, to point to the sense of crisis.
- Make your own fun little things to do. Use new technology as often as possible.
To be more practical, I am currently working on a small series: write my own crawler data, then run PageRank algorithm and LPA algorithm on it, then play with the above database, then turn these algorithms into MR and Spark versions, and then use D3.js to create dot graphs. Then I’m going to crawl some text data, do some NLP processing, and then I’m going to do some information retrieval stuff, and then I’m going to do some data mining algorithms and all sorts of running, like LDA running for topic extraction. A series of things, anyway. In between, I will intersperse my understanding of algorithms, personal ideas, scheme design and some code implementation, and occasionally try to share the design of the system used.
Many of the things mentioned above are actually before, but not touched for a long time forget, there are some things that are not familiar with, while this wave catch up, learn a little more is a little bit. I feel this is also a way to expand the breadth of technology.
0xFF is written at the end
One family, write their own play, is a joke + idle talk. A lot of things themselves will not, but still hope to master early, can go out to blow.
After reading the idea can communicate with each other, just spray.
WeChat pay
- The author: Mudong Koshi
- Links to this article: www.mdjs.info/2017/09/04/…
- Copyright Notice: All articles on this blog are licensed under a CC BY-NC-SA 3.0 license unless otherwise stated. Reprint please indicate the source!