This is the second day of my participation in the wenwen Challenge
Definition of big data
Big data refers to the collection of data whose contents cannot be captured, managed and processed by conventional software tools within a certain period of time.
The concept of big data –4V+XV
- 1. Large Volume of data
- 2. Variety
- 3, Fast speed and high aging
- 4. Low value density
- Variability
- Veracity
The concept of big data — quantity, type
The three phases of the big data generation pattern
- Operational system phase
Management information application system
- User-generated content phase
WEB 2.0, Weibo, wechat and so on
- Perceptual system stage
Sensors, Internet of things
The impact of big data on scientific research
- The first paradigm: experimental science
- The second paradigm: theoretical science
- The third paradigm: computational science
- The fourth paradigm: data-intensive science
The impact of big data on way of thinking
- Full samples rather than samples;
- Efficiency rather than accuracy;
- Correlation rather than causation;
Big data computing mode
- Batch calculation; MapReduce
- Flow calculation; Storm,Flink,Spark streaming
- Figure calculation; Pregel,Spark GraphX
- Query analysis calculation; Dremel, Hive, Impala
The definition of Hadoop
Apache open Source software Foundation developed, run on a large scale ordinary server big data storage, computing, analysis of distributed storage system and distributed computing framework
Hadoop2.0 consists of three parts
- Distributed file system HDFS
- Resource allocation system Yarn
- MapReduce, a distributed computing framework
Hadoop and Google
The characteristics of the Hadoop
- Scalable: The axis can reliably store and handle petabytes of data.
- Economical: Data can be distributed and processed through server farms of ordinary machines. These server farms can total thousands of nodes.
- Efficient: By distributing data, Hadoop can process it in parallel on the nodes where it resides, which makes processing very fast.
- Reliable: Hadoop automatically maintains multiple copies of data and redeploys computing tasks when a task fails.