Big data technology is a general term for a series of technologies. It is a large and complex technical system that integrates data acquisition and transmission, data storage, data processing and analysis, data mining, data visualization and other technologies.
According to the process of big data transmission from source to application, the big data architecture design can be divided into data collection layer, data storage layer, data processing layer, data governance and modeling layer, and data application layer.
1. Data collection layer
The big data collection layer mainly adopts the big data collection technology to realize the ETL operation of data. ETL is the abbreviation of Extract-Transform-Load in English. Data is extracted, transformed and loaded from the source end to the destination end.
Second, data storage layer
When a large amount of data is collected, we need to store the big data. The storage of data is divided into persistent storage and non-persistent storage. Persistent storage means that data is stored on disk and will not be lost after a shutdown or power failure. Non-persistent storage means that the data is stored in memory, read and write fast, but after shutdown or power off, the data is lost.
Three, data processing layer
What are we doing with all this data when we have collected it, stored it, and read and write it? In addition to keeping the raw data and backing it up, we also need to think about using it to generate greater value. So first of all we need to process the data. Big data processing falls into two categories, batch processing and real-time processing.
Fourth, data governance and modeling layer
Data architecture design and data governance are closely connected. Data collection, data storage and data processing are the basic Settings of big data architecture. In general, the above three levels of data work have been completed, the data has been transformed into basic data, to provide support for the upper business applications. However, in the era of big data, the characteristics of diverse data types and sparse unit value require data governance and fusion modeling. By using R language, Python and other ETL data preprocessing, and then based on the algorithm model, business model fusion modeling, so as to better provide high-quality low-level data for business applications.
Five, data application layer
Data application layer is the target of big data technology and application. It usually includes information retrieval, correlation analysis and other functions. Open source projects such as Lucene, Solr, and Elasticsearch make it possible to implement information retrieval.
Big data architecture for data of business application provides a generic framework, also need to according to industry field, company technology accumulation and business scenarios, from business requirements, product design, selection of technology to implement process, analyzing the specific issues on the use of big data visualization technology, further thorough, the formation of more specific application, It includes big data trading and sharing based on big data, big data application based on development platform, tool application based on big data, etc.
This is the theoretical data architecture design, you may want to ask, in the specific application, is there a good data architecture design software? Here I will mainly show you the data architecture design system of SmartBI.
First, business application: it actually refers to data collection, and how you collect data. It is relatively easy to collect data on the Internet. Data can be collected through websites and APPs. For example, many banks now have their own APPs.
At a deeper level, user behavior data can be collected, which can be divided into many dimensions and made a very detailed analysis. However, for offline industries, data collection needs to be completed with the help of various business systems.
Second, data integration: it actually refers to ETL, which means that the user extracts the required data from the data source, cleans the data, and finally loads the data into the data warehouse according to the pre-defined data warehouse model. And Kettle here is just one type of ETL.
Data storage: refers to the construction of data warehouse, which can be simply divided into business data layer (DW), index layer, dimension layer and summary layer (DWA).
Data sharing layer: it provides data sharing services between the data warehouse and the business system. Web services and Web APIs represent one way to connect data, and there are other ways to connect data that you can define in your own case.
Five, data analysis layer: analysis function is relatively easy to understand, is a variety of mathematical functions, such as K-means analysis, clustering, RMF model and so on. Column storage lets individual pages on disk store only the values of a single column, not the entire row. So the compression algorithm is much more efficient. Further, this reduces disk I/O and increases cache utilization, so disk storage is used more efficiently.
VI. Data presentation: what form the results are presented in is actually data visualization. Agile BI is recommended here. Unlike traditional BI, it can generate reports with simple drag and drop, which is a low learning cost. In domestic agile BI, Tableau is recommended by individual users, while Yonghong BI is recommended by enterprise requirements such as banks.
Seven, data access: this is relatively simple, depending on how you view these data, the example in the figure is because of the B/S architecture, the final visualization results are accessed through the browser.