In today’s world, with the rapid development of Internet technology, there are many friends asking questions about big data, such as what is big data development, and what is the technology related to big data and other problems, we will talk about the development of big data and big data related technology.

  


First of all, big data refers to the collection of data that cannot be captured, managed and processed by conventional software tools within a certain period of time. It is a massive, high-growth and diversified information asset that requires a new processing mode to have stronger decision-making power, insight and discovery ability and process optimization ability.

So what are the technologies associated with big data?

1. Cloud technology

Big data is often associated with cloud computing because real-time analysis of large data sets requires distributed processing frameworks to distribute work to dozens, hundreds, or even tens of thousands of computers. Cloud computing acted as the engine of the industrial revolution, so to speak, while big data was electricity.

The origins of the idea of cloud computing were laid out by McCarthy in the 1960s: providing computing power to users as a utility, like water or electricity.

Now, led by Google, Amazon, Facebook and other Internet companies, an effective model has emerged: cloud computing provides the infrastructure platform on which big data applications run.

The relationship between the two is described as follows: Without the accumulation of big data information, it is difficult to find a place to use the computing power of cloud computing no matter how powerful it is. Without the processing power of cloud computing, the information accumulation of big data, no matter how rich it is, is just a mirage after all.

So what cloud computing technologies are needed for big data?

Here are just a few examples, such as virtualization technology, distributed processing technology, massive data storage and management technology, NoSQL, real-time streaming data processing, intelligent analysis technology (similar to pattern recognition and natural language understanding).

The relationship between cloud computing and big data can be illustrated by the following chart. The combination of cloud computing and big data will produce the following effects: More innovative services based on massive business data can be provided; Reduce the innovation cost of big data business through the continuous development of cloud computing technology.

  


If you make some comparisons between cloud computing and big data, the most obvious distinctions are in two areas:

First, the two are conceptually different. Cloud computing has changed IT, while big data has changed the business. However, big data must have the cloud as the infrastructure to operate smoothly.

Second, the target audience of big data is different from that of cloud computing. Cloud computing is the technology layer that ciOs and others care about, and IT is an advanced IT solution. Big data is the product that the CEO focuses on and the decision maker is the business layer.

2. Distributed processing technology

Distributed processing system can be different locations or have different functions or have different data of many computers connected by communication network, under the unified management control of the control system, coordinated to complete the information processing task – this is the definition of distributed processing system.

Take Hadoop(Yahoo) as an example. Hadoop is a software framework that implements MapReduce mode and can process a large amount of data in a reliable, efficient and scalable manner.

MapReduce is a core computing mode of cloud computing proposed by Google. It is a distributed computing technology as well as a simplified distributed programming mode. The main idea of MapReduce mode is to divide problems to be executed automatically (such as programs) into Map (mapping) and Reduce (simplification). After the data is segmented, the program of Map function maps the data into different blocks and distributes them to the computer cluster for processing to achieve the effect of distributed operation. The program of Reduce function aggregates the results so as to output the results required by the developer.

Taking a look at Hadoop’s features, first, it is reliable because it assumes that compute elements and storage will fail, so it maintains multiple copies of working data to ensure that processing can be redistributed for nodes that fail. Second, Hadoop is efficient because it works in parallel, speeding up processing through parallel processing. Hadoop is also scalable and can handle petabytes of data. In addition, Hadoop relies on a community server, so its cost is low and anyone can use it.

Hadoop=HDFS(file system, data storage technology)+HBase(database)+MapReduce(data processing)+… Others

Some of the technologies used by Hadoop are:

HDFS: HadoopDistributed File System – HDFS (HadoopDistributed File System)

MapReduce: parallel computing framework

HBase: a distributed NoSQL column database similar to Google BigTable.

Hive: Data warehouse tool, contributed by Facebook.

Zookeeper: Distributed lock facility that provides similar functionality to Google Chubby, contributed by Facebook.

Avro: A new data serialization format and transmission tool that will gradually replace Hadoop’s existing IPC mechanism.

Pig: Big data analysis platform, providing users with a variety of interfaces.

Ambari: Hadoop management tool, which can quickly monitor, deploy, and manage clusters.

Sqoop: Used to transfer data between Hadoop and traditional databases.

Having said that, let’s take a practical example. Although this example is a little old, the massive data technology architecture of Taobao is still helpful for us to understand the operation and processing mechanism of big data:

  


As shown in the figure above, the technical architecture of Massive data products of Taobao is divided into five layers, which are data source, computing layer, storage layer, query layer and product layer from top to bottom.

Data source layer. Store taobao store transaction data. Data generated at the data source layer is transmitted in quasi-real time through DataX, DbSync and Timetunel to the “ladder” described in Point 2 below.

Computing layer. In this computing layer, Taobao uses the Hadoop cluster, which we call the ladder, is the main part of the computing layer. On the ladder, the system performs different MapReduce calculations on data products each day.

Storage layer. In this layer, Taobao uses two things, one makes MyFox, one is Prom. MyFox is a cluster of distributed relational databases based on MySQL, and Prom is a NoSQL storage cluster based on Hadoop Hbase technology.

Query layer. In this layer, Glider provides restful interfaces over HTTP. The data product gets the data it wants from a unique URL. At the same time, the data is queried through MyFox.

The last layer is the product layer, which I don’t need to explain.

3. Storage technology

Big data can be abstractly divided into big data storage and big data analysis. The relationship between the two is as follows: The purpose of big data storage is to support big data analysis. So far, it has been two very different fields of computer technology: big data storage, which focuses on developing data storage platforms that can scale up to PB or even EB-scale; Big data analytics focuses on processing large numbers of different types of data sets in the shortest amount of time.

When it comes to storage, there’s a famous Moore’s Law that I’m sure you’ve all heard of: integrated circuits double in complexity in 18 months. As a result, the cost of storage falls by half roughly every 18-24 months. Falling costs have also made big data more scalable.

Google, for example, manages over half a million servers and a million hard drives, and it continues to expand its computing and storage capabilities, much of it based on cheap servers and common hard drives, which greatly reduces the cost of its services. So more money can be put into technology research and development.

Amazon S3, for example, is an Internet-oriented storage service. The service is designed to make network sizing easier for developers. Amazon S3 provides a concise Web services interface that allows users to store and retrieve data of any size anywhere on the Web at any time. This service gives all developers access to the same highly scalable, reliable, secure, fast and affordable infrastructure that Amazon uses to run its global network of web sites. Consider S3’s design metrics: 99.999999999% durability and 99.99% availability for data elements in a given year, and the ability to withstand simultaneous data loss in both facilities.

S3 has been successful and it has worked. The S3 cloud has trillion-sized storage objects and is performing quite well. The S3 cloud has trillions of objects stored across geographies, while AWS has a peak number of object execution requests in the millions. Hundreds of thousands of companies around the world already run all or part of their daily business through AWS. These businesses are in more than 190 countries, with Amazon customers in almost every corner of the world.

4. Perception technology

The collection of big data and the development of perception technology are closely linked. The improvement of perception based on sensor technology, fingerprint identification technology, RFID technology and coordinate positioning technology is also the cornerstone of the development of the Internet of Things. There are countless digital sensors on industrial equipment, cars and electricity meters all over the world that measure and transmit vast amounts of data about position, motion, vibration, temperature, humidity and even changes in the chemicals in the air.

With the popularity of smart phones, the development of perception technology has reached its peak. In addition to the wide application of geographical location information, some new perception methods have also come onto the stage, such as, The iPhone 5S has a fingerprint sensor embedded in its home button, a new phone that can measure how much fat you burn directly from your breath, a smell sensor for your phone that can detect everything from air pollution to dangerous chemicals, and Microsoft is working on smartphone technology that can sense your current mood, Google Glass InSight is a new technology that can identify people by their clothing.

In addition, there are many perception-related technological innovations that are refreshing to us: Teeth, for example, real-time monitoring of oral activities and diet, baby wear equipment available data to raise the baby, Intel is developing 3 d laptop camera tracking eye understand emotions, Japanese companies to develop new type can heart rate monitor user textile material, the industry is trying to introduce biometric technology payments, etc.

In fact, the process of gradually capturing the perception is the process of the world being digitized. Once the world is completely digitized, the essence of the world is also information.

Just as a famous saying goes, “Mankind passed on civilization before, but now it passes on information.”