DAWN, Spark, and Mesos are Stanford's next big hits

This article is originally published by AI Frontier. The original link is t.cn/RTkgXrG

By Peter Bailis et al

The translator | Debra

Edit | Emily

“It’s a project to spread machine learning practices, but no background in AI is required. It’s called DAWN, which stands for” Next Generation data analytics.” The team behind Spark, Mesos, DeepDive, HogWild, etc. I can imagine it being another successful project.”

As ML application technology evolves, more and more organizations are using it in production to improve efficiency. In fact, however, this “high end” technology is only available to organizations with sufficient funding and large technical teams. In order to dramatically simplify the AI application building process and popularize AI technology so that non-ML experts can use it for the benefit of society, Stanford University has launched a five-year program called DAWN (Data Analytics for What’s Next).

DAWN, an industrial branch of Stanford University, was launched in 2007 with unrestricted funding of about $500,000 a year from several companies, including Intel, Microsoft, Teradata, VMware, and others. In addition, NSF and DARPA, Defense Advanced Research Projects Agency and other government-funded agencies.

According to the team, its past research (Spark, Mesos, DeepDive, HogWild, etc.) has begun to serve Silicon Valley and around the world, developing a wide range of AI and data product development tools in combating human trafficking, assisting in cancer diagnosis, and high-throughput genome sequencing. The next step is to make these tools more efficient and accessible, from training set creation and model design to monitoring and efficient operation.

In a nutshell, the problems DAWN is trying to solve are:

How does an expert with expertise, but without a PhD in machine learning, who is not a systems expert, who does not understand the latest hardware, build their own ML product?

The following is an introduction to a paper on the machine learning practice infrastructure of the project:

Background:

Despite recent incredible advances in machine learning, machine learning applications remain prohibitive, time-consuming and expensive for the average company except for well-trained and well-funded engineering organizations. The high cost of applying machine learning is not due to the need for new statistical models, but rather the lack of systems and tools to support end-to-end machine learning application development, from data preparation and labelling to production and sales monitoring.

DAWN Project members

Introduction and objectives of DAWN Project

We are in a golden age of machine learning and AI development. Constant advances in algorithms, coupled with the vast amount of data sets available and fast parallel computing, have made a scenario that would have been science fiction only a few years ago a reality. In the past five years, voice-based personal assistants have become ubiquitous, image recognition systems are comparable to humans, and self-driving cars are rapidly becoming a reality. There is no doubt that machine learning will transform large parts of our economy and society. Companies, governments and science LABS are waiting to see how machine learning can solve their practical problems.

However, while new machine learning (ML) applications have made impressive progress, they are expensive to build. Every major ML product, like Apple’s Siri, Amazon’s Alexa, and Tesla’s self-driving cars, needs the backing of a large and expensive team of domain experts, data scientists, data engineers, and DevOps. Even in organizations that have successfully adopted ML, ML remains a rare commodity within the reach of a small group of team members. In addition, many ML models require a large amount of training data, and obtaining such training data is extremely challenging for many application domains. For example, while ML algorithms can accurately identify an image of a dog (because there are millions of tagged images available online), they can’t achieve the same accuracy in identifying tumors in medical images unless organizations have human experts spend years tagging training data. Finally, once an ML product is generated, a lot of deployment, operation, and monitoring is required, especially when it becomes a mainstay of critical business processes. To sum up, ML technology similar to the digital computer in the early stage, at this stage, some technical personnel in the process of production in a white gown maintains a small amount of machine run properly: ML technology clearly has great potential, but currently, based on the application of ML or too expensive for most fields.

The DAWN of the goal, not in order to improve the ML algorithm, because for many important applications, algorithm has perfect enough, but to make ML can used for a small team, the ML experts can also be applied ML solve their problems, achieve the result of the high quality, and deploy can be used in the production system in a particular application. The question is: How can everyone with the expertise build their own high-quality data products (without a bunch of PHDS in machine learning, big data or distributed systems that don’t understand the latest hardware systems)? In short, make AI universal.

DAWN and ML practice system research

In order to maximize the opportunities presented by our project objectives, and to learn from previous experiences in building large analysis systems such as Apache Spark [27], Apache Mesos [11], Delite [5], and DeepDive [6], We will spend the next five years researching and building tools to solve end-to-end problems in machine learning practices. By combining areas from algorithms to system hardware, and working closely with partners in data-intensive areas, we plan to achieve DAWN’s goals in phases throughout the ML lifecycle.

Our design philosophy in the DAWN stack is based on three principles:

A) Target an end-to-end ML workflow. ML application development involves more than model training. Therefore, the biggest challenges in developing new ML applications today are not in model training, but in data preparation, feature selection/extraction, and production (servicing, monitoring, debugging, etc.). Therefore, the system should be targeted at the entire ML end-to-end workflow.

B) Experts in the field of empowerment. The most influential ML applications will be developed by domain experts, not ML experts. However, there are too few systems available for domain experts to code using automated and machine learning models. Future systems should empower users who are not ML specialists by providing them with tools for such heavy-duty tasks as tagging, feature engineering, and data enhancement.

C) End-to-end optimization. In model training, ML execution speed is very important, and fast speed allows better modeling (for example, by entering more data or broader search parameters); Speed is also important for deployed production services.

Figure 1: DAWN Protocol Stack for machine learning practices: In the Stanford DAWN project, we built a research stack of software and tools across each phase of the ML lifecycle, as well as abstractions from new hardware to new interfaces, to emphasize the importance of the ML practice architecture. We believe that this parallel end-to-end, interface-to-hardware approach is necessary to fully realize the potential of ML practices.

These models are cost-effective in practice. However, today’s ML tools typically perform 10-100 times below hardware limits, requiring expensive software engineering to set up production systems. Our early results show that ML applications that can run on current and emerging hardware will be 10-100x faster by building tools that optimize ML end-to-end processes and take advantage of statistical features of the algorithm, such as tolerance to lax execution.

In summary, we believe that those non-specialized requirements for all application development, as well as comprehensive optimization of system software and hardware, are critical to realizing the full potential of ML practices.

DAWN’s research direction

In order to embody these principles, we are conducting research along several lines. Below is an overview of our research directions, with early results for each area to cite:

New interface for ML.

To empower domain experts who are not ML experts, we need to develop new interfaces for ML technology to fit model specifications, model monitoring, etc. :

A) Simplified model specifications by observing ML (data preparation, feature engineering) : Can we improve the quality of learning by observing domain experts and build ML model systems? In tag data, for example, the domain experts often use a set of heuristic rules, to determine the specific data point labels (for example, if the phrase “preheat oven” repeated within a document collection, any tags may be cooking) through provide a simple interface for the user, in order to establish user understanding of data rules (for example, a regular expression), We can combine a few of these rules to apply to massive data sets. We use unsupervised ML to eliminate rules and learn their accuracy, and train supervised ML models to generate probabilistic labels, establishing a new paradigm we call data programming [18]. We’ve had good early results with a new system [17] called Snorkel that generates high-quality models from low-quality rules. We are also developing new research routes using weakly supervised ML to improve the quality of models without user manual operations such as feature discovery [24,25] and structure learning [2].

B) Interpretation of results to humans (feature engineering, production) : How do we interpret the results of a particular ML model to humans? As ML models are applied to increasingly important business-critical applications, it will be critical to be able to interpret the predictions made by classification decisions in a human-understandable way. The challenge is that those large and complex models provide highly accurate results, but those results are extremely difficult to interpret. One effective way to do this is. ML prediction does not take place in a “vacuum” : each user has dozens to hundreds of attributes that can be used for segmentation, correlation, and subcultural prediction (for example, users running v47 versions of software may be abnormally flagged as spam). This is particularly effective for preliminary results based on basic correlation analysis [4], and we plan to extend this capability to other areas, including textual, visual and time series data [21].

C) Debugging and observability (feature engineering, production) : THE “drift” phenomenon of ML models can be catastrophic, and ML models must be monitored and updated periodically after deployment. We are interested in developing and deploying inexpensive and useful monitoring tools to monitor the predictive quality of ML models, especially when new models are used by potentially heterogeneous users and device platforms. In addition, the subsequent need to identify and correct expected behavioral biases will also facilitate progress in interface and model training.

D) Assessment and improvement of data quality (data preparation, feature engineering) : generating high-quality models requires training in diverse and high-quality data. As more and more data sources are digitized, integrating structured (e.g., data warehouse, CSV) and unstructured (e.g., text, image, time series) data to extract signals in model building will become increasingly important. The question arises: which of these diverse data sources can be trusted? What resources should be expanded and enriched? Do you add labels manually or extend an existing knowledge base? Our early results [20] show that if the quality of each modeled data source is clear, the data sources most in need of enrichment can be automatically identified, thereby reducing the cost of data cleansing and acquisition.

End-to-end ML system.

We believe that in many important areas, it is possible to design an end-to-end system similar to a search engine or SQL database that includes an entire ML workflow and hides internal data from the user. We are looking at these areas:

A) Large-scale data classification (data preparation, feature engineering, production) : Classification and ranking are the core technologies behind every modern search engine. But beyond categorizing static text and images, how can we categorize sensor data, time series, and other streams of data that can occur tens of millions of times per second? We are interested in developing high-quality, optimized algorithms that can categorize and summarize disparate data, transform functions, classify and aggregate data flows. Preliminary studies by MacroBase Engine [3] suggest that a small number of algorithms in areas including sensors in manufacturing, motion analytics, and automotive can be reused on a large scale. We are interested in expanding this capability to other areas such as video processing, where a $0.50 image sensor currently requires a $1200 graphics card adapter for real-time processing; In addition, using traditional system techniques such as caching, incremental memory, branch limiting pruning, and adaptation (such as training a scene specific object detector) within a unified system framework and a “classifier” toolbox can achieve very high speeds without compromising accuracy [8,12].

B) Personalized recommendation (feature engineering, production) : Personalization is critical to many popular ML applications, and there are numerous literature on personalized recommendation algorithms. However, beyond simple input and output, practitioners still need to combine low-level algorithms and tools to build the engine from scratch. We plan to build a universal end-to-end recommendation platform, including a concise input interface (such as clicks or ratings from users), automatic model tuning, automated servicing, monitoring, and model retraining. Early results suggest that it is not impossible for distributions to gradually accomplish these tasks. After entering the data, a “plug and play” personalized recommendation system is created, where users can simply enter user interactions and get the latest recommendations in real time.

C) Combining reasoning and decision making (feature engineering, data preparation, production) : how should we implement autonomous decision making processes when ML becomes powerful because of its deeper insight and decision making capabilities? Today, with the exception of a few applications, such as self-driving vehicles, reasoning/prediction (predicting what will happen) and predicting actions/decisions (taking actions based on predictions) are usually performed separately by two systems (usually consisting of an automatic reasoning engine and a human “decision maker”). How do we prioritize decisions in an ML workflow? Fortunately, with the advent of automation apis, decision making has never been easier (for example, sending a POST request to the automation center); What we lack is the glue needed to integrate ML and automation apis and reasoning logic about composition, so we are developing combinations of decision reasoning ranging from alerts and notifications to physical operations on the environment.

D) Unified SQL, Graphics, and Linear Algebra (production) : THE ML product pipeline contains a different set of set operations, including SQL, graphic computation, and ML training and evaluation. Unfortunately, most execution engines are optimized for only one of these computing patterns. So how do we build an engine that optimizes all computing patterns? Perhaps many of these patterns can be used as examples of traditional relational joins, for example, PI Re recently developed a faster join operation [14]. In practice, we found that when combined with SIMD optimization operations, the connection speed of this optimization was very fast, enabling matching optimization for every SQL and graph [1]. What about ML? We believe that ML can do the same by extending these theoretical results to traditional ML models including linear algebra operations and coefficient matrix operations. By combining these operations in a single engine, we can optimize end-to-end SQL flow, graph computation, linear algebra, and so on.

New substrate for ML.

Training and deploying ML quickly, economically, and efficiently requires the development of new computing substrates, from language support to distributed runtimes, and accelerated hardware.

A) End-to-end compiler optimization (feature engineering, production) : Currently, ML applications include a variety of libraries and systems such as TensorFlow, Apache Spark, SciKit-Learn, Pandas, etc. Although each library has its advantages, the actual workflow often involves combining multiple libraries, so large-scale production often requires a software engineering team to rewrite low-level code for an entire application. We are developing a new runtime, Weld [15], that optimizes data-intensive code across multiple libraries and automatically generates ML training.

It may come as a surprise to many that modern data analysis tools such as Apache Spark, Pandas, Apache, and TensorFlow have been able to run 10-fold faster and cross-library workloads 30-fold faster by optimizing their operators. In addition, the Weld design enables heterogeneous hardware porting. Therefore, we can also run these libraries on Gpus, mobile processors, and FPGas. In addition to Weld, we are developing new compiler technology for ML Delite [5], a framework for developing domain-specific languages, and Splinter[26], a privacy-protected data analysis platform.

B) Precision reduction processing (production) : It is well known that ML operation has randomness and probability, how can we make use of this characteristic in the operation process? Early HogWild! [19] was the first project to demonstrate that asynchrony in computing can actually reduce convergence times, whereas Google, Microsoft, and other large tech companies use basic algorithms in their daily production. However, we think we can do a lot more, such as improve performance and reduce power consumption by controlling randomness at the bit level: we can design chips specifically for ML that can perform low-precision calculations with low power consumption and also produce efficient results. Our recent theoretical results show that it is possible to operate with low precision without compromising accuracy [7], and gratifying results have been achieved in practice [22].

C) Core kernel reconfigurable hardware (feature engineering, production) : Computer architects often consider the year after the FPGA release to be the true “YEAR of the FPGA.” However, programming with FPGas is still difficult and expensive. Still, ML could be a turning point: 2017 saw computing as a bottleneck in ML analytics that needed cracking, both in terms of training time and reasoning time. Given the upcoming competition between cpus and FPGas, reconfigurable hardware with advanced programmable capabilities will be increasingly important. In addition, we are developing new reconfigurable substrates that can be easily used in modular and efficient computing kernels [16], which could be significant for improving power consumption ratios, especially as the upper layers of the software stack continue to evolve.

D) Distributed runtime (production) : As models continue to evolve, the ability to train on a large scale and the reasoning behind the execution of the training becomes more and more important. The combination of ML with distributed systems is a real headache: is a model underperforming because it has too many servers allocated, or because of poor matching? What is the optimal amount of asynchrony? What is the best distributed training framework? Our research is very interested in exploiting resource consumption parallelism (i.e., automatic and dynamic allocation to different hardware within the cluster) within devices (e.g., FPGA, GPU vectorization) and between devices (e.g., cluster computing). Moreover, some of our recent theories [10] demonstrate that we can explicitly automatically adjust and match the optimal grassroots learning algorithm for different hardware and computing networks. But there are still many questions to be answered: how does distributed asynchronous computing benefit reasoning time (i.e., in model services)? Can new computing substrates such as serverless computing (Amazon Lambda) further expand the inference program? What is the unified programming model for distributed computing? We intend to build new tools and combine them with existing frameworks such as TensorFlow and Spark to answer these questions.

Research ideas and success indicators

According to DAWN, they will work with target research partners within the program to conduct on-campus and off-campus practices to achieve these research goals. The main indicators of the project’s success are availability, including I) the time and cost of applying ML applications (including data sources and functionality); Ii) the time and cost of executing the application in production (including hardware and human resources to monitor the ML model); And iii) revenue to end users. In addition, DAWN plans to open source all of its research so that all industries can benefit from the DAWN project.

Results and flagship projects

Early production deployments of DAWN’s systems, including Snorkel, MacroBase and DeepDive, are said to be in use in Silicon Valley and around the world, confirming the great potential of the DAWN project and the promise of radically improving existing technologies.

DAWN’s flagship projects are listed on the project’s website, including:

Macrobase: Focus on fast data first

Github.com/stanford-fu…

MacroBase is a new analysis oversight engine designed to prioritize large data sets and data streams. Unlike traditional analysis engines, MacroBase is dedicated to one task: finding and interpreting unusual or interesting trends in data.

Snorkel: Training set threading system

hazyresearch.github.io/snorkel/

Snorkel, a system for quickly creating, modeling, and managing training data, is currently focused on accelerating the development of structured or “dark” data extraction applications for areas where large-scale annotation of training sets is impractical or not easily accessible.

Spatial: DSL for FPGA

Github.com/stanford-pp…

Spatial is a new domain-specific language for programming reconfigurable hardware from parameterized high-level abstractions.

Weld: Accelerated data analysis

Github.com/weld-projec…

Official website link:

dawn.cs.stanford.edu/

Original link:

Arxiv.org/pdf/1705.07…

You know what I mean

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

DAWN, Spark, and Mesos are Stanford’s next big hits

DAWN, Spark, and Mesos are Stanford’s next big hits

Related Posts

Finally it’s my turn — what it’s like to be an end AI at Bytedance

What? Voice synthesis open source code will not run, Follow me!

Which is better, distributed machine learning or joint learning? Comparison of Distributed Machine Learning vs. Federated Learning