AI Front Line introduction:
For more quality content, please follow the wechat official account “AI Front” (ID: AI-Front).

There’s nothing bigger than data these days! We have a lot more data than ever before, and we have more ways to store and analyze data: SQL databases, NoSQL databases, distributed OLTP databases, distributed OLAP platforms, distributed hybrid OLTP/OLAP platforms. The 2018 Bossie Award winners for databases and data analysis platforms also include innovators in streaming processing.

Despite the proliferation of new products, Apache Spark continues to dominate the data analytics space. If you need to work in distributed computing, data science, or machine learning, use Apache Spark. Apache Spark 2.3, released in February, continues to focus on developing, integrating, and enhancing its Structured Streaming API. In addition, the Kubernetes scheduler has been added to the new version, making it very easy to run Spark directly on the container platform. Overall, the current version of Spark looks refreshed with tweaks and improvements.

AI Front Related reports:

Spark 2.3 blockbuster release: To compete with Flink, introduce continuous stream processing

Crisis and Opportunity for Spark: The future must be an AI framework reversed data processing framework

Apache Pulsar was originally developed by Yahoo, then entered the Apache Incubator and recently graduated as an Apache Top-level project. Pulsar aims to replace The dominance of Apache Kafka for many years. Pulsar provides faster throughput and lower latency than Kafka in many cases, and provides a compatible set of apis for developers to easily switch from Kafka to Pulsar.

The biggest advantage of Pulsar is that it provides a simpler and more robust set of operational capabilities than Apache Kafka, particularly in terms of visibility, geographical replication, and multi-tenancy. Enterprises that find it difficult to run large Kafka clusters might consider switching to Pulsar.

AI Front Related reports:

Apache Pulsar has been promoted to a top project, creating a data hub for the real-time era

Why did we end up with Apache Pulsar when Kafka already existed?

Pulsar: a set of Kafka+Flink+DB

Over the years, the difference between batch and streaming has slowly narrowed. Batch data becomes smaller and smaller, becoming microbatch data, and as the batch size approaches one, it becomes streaming data. There are many different processing architectures that are trying to map this transition into a programming paradigm.

Apache Beam is Google’s solution. Beam combines a programming model with multiple language-specific SDKS that can be used to define data processing pipelines. Once the pipes are defined, they can run on different processing frameworks, such as Hadoop, Spark, and Flink. When choosing a data processing pipeline for developing data-intensive applications (what other applications are not data-intensive these days?) Beam should be on your radar.

AI Frontier Beam technology column (update ing) :

Apache Beam practical guide | foundation primer

Apache Beam practical guide | taught you how to play KafkaIO and Flink

Although everyone thinks of Apache Solr as a search engine built on Lucene’s indexing technology, it’s actually a text-oriented document database, and a very good one at that. Whether you’re looking for a needle in a haystack or running spatial queries, Solr can help.

The Solr 7 series is out now, and the new version can still run more analytical queries at lightning speed. You can add lots of documents and return results in less than a second. It also improves support for logging and event data. Disaster recovery (CDCR) is now also two-way. Solr’s new automatic scaling feature simplifies scaling as cluster load grows.

JupyterLab is the next generation of Jupyter, a Web-based notebook server popular with data scientists around the world. After three years of development, JupyterLab has completely changed the way we think about Notebook, with support for drag-and-drop cell rearrangement, tabled notebook, real-time preview Markdown editing, and an improved extension system that makes integration with services like GitHub easy. JupyterLab is expected to release stable version 1.0 by the end of 2018.

KNIME Analytics platform is open source software for creating data science applications and services. It provides a drag-and-drop graphical interface for creating visual workflows, support for R and Python scripts, machine learning, support, and The Apache Spark connector. KNIME currently has about 2,000 modules that can be used as workflow nodes.

KNIME also offers a commercial version, which aims to improve productivity and support collaboration. However, the open source KNIME analytics platform has no artificial limitations and can handle projects containing hundreds of millions of rows of data.

CockroachDB is a distributed SQL database built on transactional and consistent key-value storage. It is designed to survive disk, machine, rack, and even data center failures with minimal latency and no human intervention. CockroachDB V1.13 used to have a high score of five stars and still lacks a lot of features, but that’s changed now.

CockroachDB V2.0, released in April, has significant performance improvements, extended compatibility with PostgreSQL by adding JSON (and other types) support, and cross-region cluster management capabilities for production environments. The roadmap for CockroachDB V2.1 includes a cost-based query optimizer for query performance improvements, associated subqueries (ORM), better support for schema changes, and encryption for the enterprise version product.

Vitess is a database cluster system that achieves horizontal expansion of MySQL through sharding and is mainly developed using Go language. Vitess combines many of the important features of MySQL with the extensibility of NoSQL databases. Its built-in sharding capability allows users to extend the database without adding sharding logic to the application. Vitess has been a core component of YouTube’s database infrastructure since 2011 and has grown to thousands of MySQL nodes.

Vitess doesn’t use standard MySQL connections because it consumes a lot of RAM and limits the number of connections per node. It uses a more efficient GRPC-based protocol. In addition, Vitess automatically overwrites queries that harm database performance and mediates queries through caching mechanisms to prevent the same queries from entering the database at the same time.

TiDB is a distributed database that is compatible with MySQL and supports mixed transaction and analysis processing (HTAP). It is built on transactional key-value stores, providing full horizontal scalability (by adding nodes) and continuous availability. Most of the early TiDB users are in China, as TiDB developers are based in Beijing. TiDB source code is mainly written in Go language.

At the bottom of TiDB is RocksDB, Facebook’s log-structured key-value database engine, written in C++ for best performance. Above RocksDB is the Raft consensus layer, the transaction layer, and then the SQL layer that supports the MySQL protocol.

AI Front Technology dry goods:

TiDB is applied in the real-time risk control scenario of 360 financial loans

YugaByte DB combines distributed ACID transactions, multi-region deployment, support for Cassandra and Redis apis, and support for PostgreSQL is coming soon. Compared to Cassandra, YugaByte is highly consistent, while Cassandra is ultimately consistent. YugaByte’s benchmark is also better than open source Cassandra, but worse than commercial Cassandra, while DataStax Enterprise 6 has tunable consistency. YugaByte is the fast, more consistent equivalent of distributed Redis and Cassandra. It can standardize a single database, such as combining the Cassandra database with the Redis cache.

When dealing with the task of correlation network, the execution speed of Neo4j graph database is faster than that of SQL and NoSQL database, but the graph model and Cypher query language need special learning. Recently, Russian Twitter hooligan analysis, ICIJ’s Panama Papers analysis and Paradise Papers analysis have indicated that Neo4j is very valuable.

After 18 years of development, Neo4j has become a full-fledged graph database platform that can run on Windows, MacOS, Linux, Docker containers, VMS, and clusters. Even the open source version of Neo4j can handle very large diagrams, and there is no limit to the size of diagrams in the enterprise version. (The open source version of Neo4j only runs on one server.)

AI Front Related reports:

Are graph databases really more advanced than relational databases?

InfluxDB is an open source time series database with no external dependencies designed to handle high-load writes and queries, useful for logging metrics, events, and analysis. It runs on MacOS, Docker, Ubuntu/Debian, Red Hat/CentOS, and Windows platforms. It provides a built-in HTTP API and SQL-style query language, and is designed to provide real-time query responses (within 100 milliseconds).

AI Front Related reports:

How to select an appropriate timing database?

英文原文 :

https://www.infoworld.com/article/3306454/big-data/the-best-open-source-software-for-data-storage-and-analytics.html#sli de1

Today’s recommendation,

Click on the image below to read it

Haidao 100 billion listed, the future kitchen only two engineers


In the era of machine learning and artificial intelligence, the convergence of large amounts of data, cheap storage, advances in elastic computing and algorithms, especially in deep learning research, are driving the continuous innovation of enterprise infrastructure platforms to provide building blocks for intelligent systems. ArchSummit in Beijing on December 7th, we are inviting experts from Cainiao, Baidu and Netflix to share relevant technology precipitation. We hope it will be helpful to you.


If you enjoyed this article, or would like to see more quality reporting like this, leave a comment and give me a like!