Learn more about Java basics
V3.0 Big data E-commerce Warehouse Project Script: E-commerce warehouse of big data Project (Script)
User acquisition platform
HDFS stores multiple directories
- Server disks in the production environment
- Configure multiple directories in the HDFS -site. XML file. Note That the access permission of the newly mounted disk is incorrect.
Dir is determined by dfs.datanode.data.dir. Its default value is file://${hadoop.tmp.dir}/ DFS /data. If the server has multiple disks, change this parameter. If the server disk is shown in the preceding figure, change this parameter to the following value.
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///dfs/data1,file:///hd2/dfs/data2,file:///hd3/dfs/data3,file:///hd4/dfs/data4</value>
</property>
Copy the code
Note: Each server mounts different disks, so the multi-directory configurations on each node can be inconsistent. Configure it separately.
Cluster Data Balancing
Data balancing between nodes
Command for enabling data balancing:
start-balancer.sh -threshold 10
Copy the code
For parameter 10, it indicates that the disk space usage difference of each node in the cluster does not exceed 10%. You can adjust the difference based on the actual situation. Command to stop data balancing:
stop-balancer.sh
Copy the code
Data balancing between disks
(1) Generating a balancing plan (We only have one disk, HDFS diskbalancer -plan hadoop103 (2) Execute the balancing plan HDFS diskbalancer -execute hadoop103.plan.json (3) Check the execution status of the current balancing task HDFS Diskbalancer -query hadoop103 (4) Cancel the balancing task HDFS diskbalancer -cancel hadoop103.plan.jsonCopy the code
LZO compression configuration is supported
- Hadoop itself does not support LZO compression, so you need to use the Hadoop-LZO open source component provided by Twitter. Hadoop-lzo relies on Hadoop and LZO for compilation. The compilation procedure is as follows:
GCC -C ++ zlib-devel autoconf automake libtool GCC -C ++ zlib-devel autoconf automake libtool GCC -c++ zlib-devel autoconf automake libtool GCC -c++ zlib-devel automake libtool yum -y install gcc-c++ lzo-devel zlib-devel autoconf automake libtool 1. Download, install, and compile LZO wget http://www.oberhumer.com/opensource/lzo/download/lzo-2.10.tar.gz tar - ZXVF LZO - 2.10. Tar. Gz CD LZO - 2.10 ./configure -prefix=/usr/local/hadoop/lzo/ make make install 2. Hadoop-lzo: Hadoop-lZO: hadoop-lZO: Hadoop-lZO: Hadoop-lZO: Hadoop-lZO https://github.com/twitter/hadoop-lzo/archive/master.zip 2.2 after decompression, Modify POM.xml <hadoop.current.version>3.1.3</hadoop.current.version> 2.3 to declare two temporary environment variables export C_INCLUDE_PATH = / usr/local/hadoop/lzo/include export LIBRARY_PATH = / usr/local/hadoop/lzo/lib 2.4 compiled into the hadoop - lzo - master, Run the maven compilation command MVN package-dmaven.test. skip=true. 2.5 Go to target, and hadoop-lzo-0.4.21-snapshot. jar is the successfully compiled Hadoop-lZO componentCopy the code
- Will compile and hadoop – lzo – 0.4.20. Jar into the hadoop – 3.1.3 / share/hadoop/common /, synchronous hadoop – lzo – 0.4.20. Jar to hadoop103, hadoop104
- Core-site. XML is configured to support LZO compression and synchronize core-site. XML to hadoop103 and hadoop104
<configuration>
<property>
<name>io.compression.codecs</name>
<value>
org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.DefaultCodec,
org.apache.hadoop.io.compress.BZip2Codec,
org.apache.hadoop.io.compress.SnappyCodec,
com.hadoop.compression.lzo.LzoCodec,
com.hadoop.compression.lzo.LzopCodec
</value>
</property>
<property>
<name>io.compression.codec.lzo.class</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
</configuration>
Copy the code
LZO creates an index
Create an index for LZO files. The slicing feature of LZO compressed files depends on its index, so we need to manually create an index for LZO compressed files. If there is no index, there is only one slice of the LZO file.
Hadoop Benchmarking
Test HDFS write performance
Test contents: Write 10 128 MB files to the HDFS cluster
[atguigu@hadoop102 mapreduce]$ hadoop jar / opt/module/hadoop - 3.1.3 / share/hadoop/graphs/hadoop - graphs - the client - jobclient - 3.1.3 - tests. The jar TestDFSIO - write -nrFiles 10-filesize 128MB 2020-04-16 13:41:24724 INFO fs.TestDFSIO: ----- TestDFSIO ----- : Write 2020-04-16 13:41:24,724 INFO fs.TestDFSIO: Date & time: Thu Apr 16 13:41:24 CST 2020 2020-04-16 13:41:24,724 INFO fs.TestDFSIO: Number of files: 10 2020-04-16 13:41:24,725 INFO fs.TestDFSIO: Total MBytes processed: 1280 2020-04-16 13:41:24,725 Throughput MB/SEC: 8.88 2020-04-16 13:41:24,725 INFO fs.TestDFSIO: Average IO rate MB/SEC: 8.96 2020-04-16 13:41:24,725 INFO fs.TestDFSIO: IO rate STD deviation: 0.87 2020-04-16 13:41:24,725 INFO fs.TestDFSIO: Test exec time SEC: 67.61Copy the code
Test HDFS read performance
Test contents: Read 10 128 MB files in the HDFS cluster
[atguigu@hadoop102 mapreduce]$ hadoop jar / opt/module/hadoop - 3.1.3 / share/hadoop/graphs/hadoop - graphs - the client - jobclient - 3.1.3 - tests. The jar TestDFSIO - read -nrFiles 10-filesize 128MB 2020-04-16 13:43:38,857 INFO fs.TestDFSIO: ----- TestDFSIO ----- : Read 2020-04-16 13:43:38,858 INFO fs.TestDFSIO: Date & time: Thu Apr 16 13:43:38 CST 2020 2020-04-16 13:43:38,859 INFO fs.TestDFSIO: Number of files: 10 2020-04-16 13:43:38,859 INFO fs.TestDFSIO: Total MBytes processed: 1280 2020-04-16 13:43:38,859 INFO fs.TestDFSIO: Throughput MB/SEC: 85.54 2020-04-16 13:43:38,860 INFO fs.TestDFSIO: Average IO rate MB/SEC: 100.21 2020-04-16 13:43:38,860 INFO fs.TestDFSIO: IO rate STD deviation: INFO fs.TestDFSIO: Test exec time SEC: 53.61Copy the code
Delete test generated data
[atguigu@hadoop102 mapreduce]$ hadoop jar / opt/module/hadoop - 3.1.3 / share/hadoop/graphs/hadoop - graphs - the client - jobclient - 3.1.3 - tests. The jar TestDFSIO - cleanCopy the code
Hadoop parameters tuning
HDFS parameter tuning hdFS-site.xml
The number of Namenode RPC server threads that listen to requests from clients. If dfs.namenode.servicerpc-address is Not configured then Namenode RPC Server Threads listen to requests from all Nodes. It is used to process concurrent heartbeats of different Datanodes and concurrent metadata operations of clients. For large cluster or a large number of client cluster, often need to increase parameter DFS. The namenode. Handler. The default value of the count of 10.<property>
<name>dfs.namenode.handler.count</name>
<value>10</value>
</property>
Copy the code
For example, if the cluster size is 8, set this parameter to 41. This value can be calculated using simple Python code like this:
[atguigu@hadoop102 ~]$ python
Python 2.7. 5 (default, Apr 11 2018, 7:36:10)
[GCC 4.8. 5 20150623 (Red Hat 4.8. 5-28)] on linux2
Type "help"."copyright"."credits" or "license" for more information.
>>> import math
>>> print int(20*math.log(8))
41
>>> quit()
Copy the code
Set yarn-site.xml parameters
Problem: HiveSQL is used for data statistics, there is no data skew, small files have been merged, JVM reuse enabled, I/O is not blocked, and memory usage is less than 50%. But it was still very slow, and the whole cluster would go down during peak data traffic. There is no optimization scheme based on this situation.
Solution: The memory usage is insufficient. This is usually caused by two configurations of Yarn, the maximum memory size that can be applied for by a single task, and the available memory size of a Hadoop node. Adjusting these two parameters can improve the utilization of system memory.
- yarn.nodemanager.resource.memory-mb
Specifies the total physical memory available to YARN on the node. The default value is 8192 MB. If your node memory is less than 8GB, you need to reduce this value.
- yarn.scheduler.maximum-allocation-mb
Maximum amount of physical memory that can be requested by a single task. The default is 8192 (MB).
Kafka machine count
Number of Kafka machines (rule of thumb) =2 * (peak production speed * copies / 100) +1 Get the peak production speed and estimate the number of Kafkas you need to deploy based on the set number of copies. For example, our peak production rate is 50 meters per second. The number of copies is 2. Number of Kafka machines =2* (50*2/100) + 1=3
Kafka partition number calculation
- Create a topic with only 1 partition
- Test the Producer throughput and consumer throughput for this topic.
- Suppose they are Tp and Tc, respectively, in MB/s.
- Then assume that the total target throughput is Tt, so the number of partitions =Tt/min (Tp, Tc)
For example, producer throughput =20m/s. Consumer throughput =50m/s, expected throughput 100m/s; Number of partitions = 100/20 =5 Number of partitions The value ranges from 3 to 10
Blog.csdn.net/weixin_4264…
Flume component selection
Source
- TailDir Source: breakpoint continuation, multiple directories. Before Flume1.7, you need to customize Source record every time you read the file location to achieve breakpoint continuation.
- Exec Source collects data in real time, but it will be lost if Flume is not running or if Shell commands fail.
- Spooling Directory Source Monitors directories and supports resumable transmission.
(2) How to set batchSize? Answer: When Event 1K is around, 500-1000 is appropriate (default is 100)
Channel
Kafka Channel is adopted to eliminate Sink and improve efficiency. KafkaChannel Data is stored in Kafka, so data is stored on disk. Note that Kafka Channel was rarely used prior to Flume1.7, as it was found that the parseAsFlumeEvent configuration did not work. That is, whether parseAsFlumeEvent is set to true or false, it is converted to FlumeEvent. As a result, Kafka messages will always be written with headers mixed with the content, which is obviously not what I want, I just need to write the content.
Service data collection platform
Synchronization policy (Critical)
Data synchronization policies include full synchronization, incremental synchronization, new and change synchronization, and special cases
- Full scale: Stores complete data.
- Delta table: Stores newly added data.
- Add and Change tables: Stores newly added data and changed data.
- Special tables: only need to be stored once.
Full synchronization policy
Incremental synchronization policy
Add and change strategies
A special strategy
Some special dimension tables may not follow the synchronization policy described above.
- Objective world dimension
Invariable dimensions of the objective world (such as gender, region, ethnicity, political composition, shoe size) can be stored with a single fixed value.
- The date dimension
The date dimension can import data for one year or several years at a time.
Data consistency between Hive and Mysql about Null
In Hive, Null is stored as “\N”. In MySQL, Null is stored as Null to ensure data consistency.
- When exporting data (Hdfs_to_mysql transmissionUsed when)
--input-null-string
and--input-null-non-string
Two parameters. - Import data (Mysql service data is imported to HDFS through SQoopUsed when)
--null-string
and--null-non-string
.