Abstract: Starting from horsepower as a power measurement standard, this paper introduces TPCBB, a computing power measurement standard in the field of Big data, and MaxCompute2.0’s excellent performance on Big Bench. At the same time, we share the latest development of MaxCumpute2.0 in detail. In addition, as for the topic of data security, which is of great concern to common cloud users, Alibaba ensures data security and realizes safe data exchange and sharing through logical isolation/resource isolation/operational isolation mechanism.
Brief Introduction of speakers:
Yunlang, MaxCompute Senior Product Specialist
The following content is compiled according to the video sharing and PPT of the speakers.
PPT material download address: click.aliyun.com/m/100000306…
Video address: edu.aliyun.com/lesson_1010…
Address: product click.aliyun.com/m/100000306…
This sharing mainly focuses on the following four aspects:
I. Background introduction
Second, the TPCBB
MaxCompute2.0 evolution
Fourth, data security
I. Background introduction
Watching some documentaries recently, I was greatly inspired by it. It reproduced the history of human beings from many points of view. In this process, I found that many things are interlinked in the history of human beings. I think of lu Xun, a famous Chinese scholar, who said, “History is the first step in studying science.” By studying history, we can have more collisions from a deeper level, and even rise to the level of philosophy or culture. So big data is not scary, scary is big data also need to engage in culture.
Have you ever seen a documentary – The Great History of Mankind, it filmed dozens of episodes, one of which is about the revolution of the horse, horse to the whole human history of what kind of impact? When I was young, THERE were four words that impressed me deeply, that is, hufu riding archery. In the picture below, it is easy to think that this is Genghis Khan’s iron army. In addition, we are familiar with western civilization and culture. When we talk about Rome, when we see white robes, we can see that the aristocracy usually wears very white robes. In this documentary, the ancient Roman dress changes as shown in the picture below, gradually evolving into three people wearing pants, before all wearing robes. So the effect, number one, is that humans put pants on because of horses. Second, whether it was the Roman Empire in the West, or whether it was genghis Khan’s empire, the horse played a very important role in creating the Empires of the East and the West. And the horse decides the territory, so the horse has a very important influence on us.
Of all the 66 million species on earth, why is the horse our most important friend? We have so many choices in everything, why did humans choose horses in the end? There are many insects and birds flying in the sky. Can they be trained well? But the problem is they’re too small to carry the weight of a human being. Secondly, can lions and tigers be trained? First of all, they are very difficult to domesticate. In addition, they are both carnivores and eat more expensive food than humans. Third, elephants can perform, very smart, so why can’t elephants, because they’re too big, and it takes a long, long time to tame them to the right size. At the end of the day, only the horse combines the qualities we want, the first horse has strength, the second horse has temperament, and the third, most important of all, is speed. These are several reasons why horses are man’s most important friends.
In the earliest days, the train was called “iron horse”. When the car was first invented, it was not called car. It was called horseless carriage. Today, we make trains, planes, cars, etc. The standard measure of power is called “horsepower”, 1 metric horsepower =0.735 kilowatts (kW). Through human history, we know why horse power came into being, and why horses are our friends.
Second, the TPCBB
So in big data, how do we define its weights and measures, as computing platforms, how do we evaluate computing power?
In recent years, we have done a lot of benchmark tests on big data with many international standardization organizations, including sort Bench, TPCH, TPC-DS, and TPCBB (Big Bench) last year. Last year we discovered that TPCBB is the standard measure of big data computing power. Big data takes into account the types of data, including semi-structured, structured and unstructured data, which must be present in the scene. In addition, in the analysis scenario, we will include common modeling, such as static modeling. And data analysis and mining. Also do Reporting, because SQL writing is different for Reporting and static analysis. These are the three basic scenarios. For the types of queries, only SQL, do machine learning, natural language processing, through standard languages, and stream computing. For doing big data, the whole processing of data, operations, technology from the above aspects to cover.
This set of 30 test cases completely covers different data types, different technologies, and different query types. Everyone can download the 30 test sets and see how to do TPCBB. TPCBB was preceded by TPC-DS, which, in addition to many OLTP-oriented scenario testing foundations, overlays our benchmark scenario to OLAP, which is the on-line analysis scenario. TPCBB then extends further from OLAP, extending all the features of big data such as log files, machine learning, and so on. These 30 test scenarios completely cover all our appeals to big data computing. It is very important to represent the real scene as much as possible, and the result is closer to the actual performance after production. In many cases, the magnitude of PB and EB is common, currently there are 1TB, 3TB, 10TB, 30TB on the official website, the largest specification is 100TB.
Last year ali Cloud held a computing conference in Beijing, and was inspired to ask if MaxCompute, like mobile phones, can also run a score. Last year, a field run was made, when the cluster of Beijing and Shenzhen was called, Beijing’s run was very good, the following figure is the result from three latitudes. The data set was 100 terabytes, and the first test used the largest scale. The second one is how many tasks per minute we can run, which is the world’s highest 8200QPM. The third is cost, reaching the lowest $354.7/QPM. So MaxCompute is a comprehensive breakthrough in data capacity, performance and cost performance. We have found the measurement criteria of big data, and at the same time, we are constantly optimizing, hoping to get more breakthroughs.
MaxCompute2.0 evolution
What have we done behind the breakthrough in computing power? In fact, MaxCompute has a history of ten years in Aliababa, which is not unchanged. It basically meets the internal business of Alibaba Group from the very beginning, and then can replace Hadoop to become a real computing platform and infrastructure for data. In Alibaba, 99% of the stored data, 95% of the computes are run in MaxCompute, so let’s see how it evolves. If MaxCompute is used in depth, it will be optimized in terms of performance, cost, full scan, etc. This includes compatibility between data organization structure and ORC, because it is known that different data organization structure and ORC have some balance in compression ratio and performance of storage. So in this respect we’re going to pick the best way to organize the data, and you’ll see the advantage pretty quickly, and the end result is that MaxCompute is cheaper. In addition, we have fully optimized and upgraded the language level, the optimization of NewSQL, and the level of coverage of the optimizer and language. It has been nearly a year since 2.0 was launched, and the process is very long. It was not until May that the switch on the original 2.0 trial was turned off and the whole network was opened.
Here are the types of jobs supported on MaxCompute, and we started thinking about how to integrate ecology last year. That is to say, in the case of only one copy of data, how to support more task types, in this process, without moving the data, without more than one step of copy. Data is shared at the project level, on top of which we integrate the federated computing platform to support more job types. What can be released in advance here is that we will support real-time interactive analysis jobs, that is to say, we can respond to the results in seconds, called “Lightning”, when the amount of data in the table is relatively small. Second, we will support Spark job types. Spark can directly access MaxCompute tables without importing/exporting or moving. In this way, we can achieve the purpose of joint computing platform: on the basis of unified data, support more job types and meet more computing scenarios.
Optimization in SQL includes compiler optimization, support for complex data types, etc. The overall work focuses on improving usability and development efficiency, improving compatibility, and reducing migration costs, which are optimized from the perspective of developers.
SQL is blessed with the ability to declare, write things that everyone can understand, and solve problems without having to look through the code, but it loses some flexibility in the process. So let’s do a better combination of the two in the form of Function, in the form of Java functions, in the form of Python functions. That is to say, SQL plus Function is the perfect combination to keep declarations simple and support complex business. On the Function side, MaxCompute2.0 also introduced more built-in functions to improve the convenience of complex business logic.
So how to deal with unstructured data? In Ali Cloud, unstructured data is stored in OSS, which stores files, pictures, videos and so on. The semi-structured data on Ali Cloud is stored in TableStore. These are the two most important data sources on Ali Cloud. Define a facade here, with OSS as the unstructured data source and TableStore as the semi-structured data source, directly Select to operate on unstructured and semi-structured data. Because OSS and TableStore are both through API, the new database is provided here, which further reduces the cost of unstructured data operation and the convenience of development. So the appearance effectively complements MaxCompute’s computing scenarios and scope of data.
Performance has always been the ultimate pursuit of Ali, as can be seen from our continuous efforts every year to continuously optimize performance. MaxCompute2.0 has doubled performance, and when people start running at 12 o ‘clock at night, we don’t want to finish running at 9 o ‘clock the next day. If we double it, the calculation of the first, second and third tables will be finished at 5 or 6 o ‘clock in the morning, there will be less chance of being found by business students. So performance is very, very critical in offline operations. There is also increased scheduling, running tens of thousands of operations, accumulation every night, how to reduce the amount, so this point is also very important performance.
Because Ali is working on computing engines, Studio has many new features in terms of tools. And Studio, which is a custom-built IDE for MaxCompute based on the IntelliJ platform, shipped very quickly.
There is also command line publishing, where many managers would appreciate a very efficient black screen, operating commands, authorization, and managing data. We are constantly releasing new versions of the command line, and you can use different tools for different roles. For example, developers can use Studio, and managers recommend using the client for daily management.
Logview is a Shared service, really impressed by the user is finally never search distributed task from a lot of log information, a task after release, we will push to give him a link to open after the needs of all relevant information, the context will complete typing, it included common full scan, Data cleaning and so on all performance issues through the DAG diagram, through the task can be further detailed analysis, positioning, diagnosis. It is often the case that the task delivered is very large, requiring thousands of cores at the same time, and if one of them runs slowly, the whole operation is slowed down, we can use Logview to tune and diagnose.
We introduced PyODPS, which you can use to schedule very complex computations in a very simple development way. In addition, the Python SDK (PyODPS) can be easily integrated into Pandas DataFrame. This leads to high development efficiency.
In addition, there is R. We believe that THERE should be better promotion of R in China, so we support RODPS.
And JDBC, which you don’t use very much at the moment, most of you access through the SDK, so you can also use JDBC to further integrate with third parties.
In addition, we have opened many projects in ETL tool integration, including Flume plug-in, OGG plug-in, Sqoop, Kettle plug-in, and Hive Data Transfer. We open source a lot of projects through this mechanism. You can integrate ETL tools with the link below.
In Internet companies, including when I first joined Ali, I was obviously reluctant to write down documents. However, after communicating with customers, I found that documents were for long-term communication when I could not see the codes. We do a lot of work on this as a product, MaxCompute sunflower in the cloud community, and we focus not only on technology, but also on tools, documentation, distribution.
Fourth, data security
Data security is a top priority, and without security there can be no cloud computing. This problem in the product manager’s point of view is always the first, any trace of security problems must be the top priority to solve the problem, data security is life and death. On April 27th, Hadoop Yarn encountered a security vulnerability. Ali has a very powerful security department called Cloud Shield, which also gave a lot of security suggestions, including one which said that Ali Cloud MaxCompute can be immune to Hadoop vulnerability. Ali Cloud’s MaxCompute has been running on the public cloud for so long that our data security is zero security incidents.
How do we do security? Here is our data center, one Region at a time. First, at the logical level, MaxCompute is a standard Severless service, that is, it is a serverless cloud service, isolated by tenants within the cluster. After that, logical isolation is made, and the largest isolation degree is project, in which more fine-grained authorization control is made according to tables and columns. So the logical layer is very finely isolated to ensure that the model is adequate, rather than simply releasing the file system. How do you isolate data when it can’t be taken, stolen, and once it’s running, it’s going into memory, it’s going into the CPU? Therefore, resource isolation is also required. When each UDF is deployed, it is put into an independent resource pool, and run in an isolated resource environment to ensure the security of resources and runtime memory. Therefore, overall, cloud security is not only enough for multi-tenant, but also needs to be fully isolated from the logical level, resource level, and operation level to ensure that each tenant is secure and reliable between shares.
In addition, data security alone is not enough, security and sharing must come together, so through the logical mechanism not only achieve security, but also achieve safe data exchange and sharing. That’s what we want, it’s not absolutely safe, nothing happens, the data is a backwater.
We also recently partnered with Forrester, which has been doing cloud reviews, and our MaxCompute, DataWorks, and aliYun on-cloud data warehousing solutions were rated quadrant 1, and we were very honored to be ranked number 2. The first is AWS, the second is Alibaba, the third is Google, the fourth is Microsoft, and many more. This process also reflects how they understand the current Data Warehouse (CDW) on the Cloud, which is divided into three categories, the first is the standard multi-tenant CDW, the second is the independent and exclusive CDW, and the third is the hosted CDW. These three are very different. The first category, such as MaxCompute shared clusters, and BigQuery is Serverless. The exclusive one says that the whole large inside completely gives a piece, all resources are locked, cannot be shared, just makes the logical division of resources. And managed, if you think of virtual machines as physical resources, at that level.
In addition, a comprehensive judgment is made from different perspectives, including self-service, resilience, automatic upgrade, data loading capacity, unloading capacity, hybrid cloud, data recovery and so on.
Back to the beginning, Lu Yao knows horsepower. Everyone has the dream of a data empire, how to expand the territory to determine the size of the radius of the empire, choose what kind of horse is very important. We really want MaxCompute to be the “swift horse” on the big data journey. MaxCompute is still very naive, going back to the beginning, we want to be more open, have a better ecosystem, faster and cheaper, and then simpler and easier to use, and more stable, and able to travel thousands of miles a day and hundreds of thousands of miles at night. From horsepower to computing power, we can think more. We also hope that you can choose your favorite big data horse, so that everyone can run further in the process of big data and build a stronger empire.
The original link