Hadoop and its commands
A Hadoop cluster contains multiple nodes.
Large amount of data: How to store it? How do I do that? Several P’s of data
HDFS focuses on distributed storage. A machine can not be stored, so there is the concept of cluster.
Clusters are generally odd number, there is an election mechanism.
Suppose you have a 300M file with five machines in a cluster. How is clustering stored in this case?
Instead of storing the 300M file directly into a node, it cuts it up.
A Block is 128M. It will be split into three blocks: 128M, 128M, 44M.
The Leader (NameNode) is in charge of the cluster and is responsible for assigning which three nodes to live on. The exact allocation depends on the remaining space and physical distance of the cluster nodes.
HDFS:
NameNode does not store data and is used exclusively for macro control.
DataNode reads and reads data.
SecondNameNode assists NameNode and is NameNode’s assistant.
So HDFS has these three processes.
There are two ecospheres based on big data:
Hadoop ecosystem: Focus on 1. Distributed storage 2. Analytical computing. Hive is used for analysis
Spark ecosphere: Based on memory computing, it can be used for offline analysis or real-time computing.
Hadoop overview
Hadoop is a large database processing framework
Hadoop core three elements:
HDFS — Solves big data storage
Distributed computing framework MapReduce — Addresses big data computing
Distributed resource management system Yarn
The relevant components are placed under the Model folder
Hadoop Configuration Modification
Version: Hadoop – 3.2.1
Example Modify the mapping between a host name and an IP address
[hadoop@hadoop01 model]$sudo vi /etc/hosts 127.0.0.1 localhost localhost. localDomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 192.168.1.100 hadoop01 # note that the previous 10.2.0.181 Change to your own IP addressCopy the code
After the host mapping is changed, you must restart the system for it to take effect
Modify the hadoop – env. Sh
-
Location: / opt/model/hadoop – 3.2.1 / etc/hadoop
-
[hadoop@hadoop01 hadoop]$vi /opt/model/hadoop-3.2.1 /etc/hadoop-env. sh 37 export JAVA_HOM=/opt/model/jdk1.8/Copy the code
HDFS Service Management
The service start
Starting namenodes on [hadoop01] Starting datanodes Starting secondary namenodes [hadoop01] [hadoop@hadoop01 hadoop3.2.1]$start-yarn.sh # Starting resource management yarn Allocating disk space resourcemanager Starting nodemanagersCopy the code
[Verification Service]
[hadoop@hadoop01 hadoop-3.2.1]$JPS 2640 NodeManager # Belong to YARN 2246 Secondarynamenodes # belong to HDFS 3034 JPS 2059 datanodes # Belong to HDFS 2523 ResourceManager belongs to YARN 1934 NameNode Belongs to HDFSCopy the code
HDFS and Yarn are successfully started based on web pages
【 Verify HDFS】
The port is 9870
http://192.168.0.104:9870
There is currently one node
Take a look at the overall directory space
These folders are in the root path
[Verify Yarn]
The port number 8088
http://192.168.0.104:8088
Service is down
[hadoop@hadoop01 hadoop]$ stop-dfs.sh
[hadoop@hadoop01 hadoop]$ stop-yarn.sh
Copy the code
HDFS command
Prerequisite: start - DFS. ShCopy the code
HDFS common file commands
[hadoop@hadoop01 hadoop]$ hdfs --help Usage: hdfs [OPTIONS] SUBCOMMAND [SUBCOMMAND OPTIONS] OPTIONS is none or any of: --buildpaths attempt to add class files from build tree --config dir Hadoop config directory --daemon (start|status|stop) operate on a daemon --debug turn on shell script debug mode --help usage information --hostnames list[,of,host,names] hosts to use in worker mode --hosts filename list of hosts to use in worker mode --loglevel level set the log4j level for this command --workers turn on worker mode SUBCOMMAND is one of: Admin Commands: cacheadmin configure the HDFS cache crypto configure HDFS encryption zones debug run a Debug Admin to execute HDFS debug commands dfsadmin run a DFS admin client dfsrouteradmin manage Router-based federation ec run a HDFS ErasureCoding CLI fsck run a DFS filesystem checking utility haadmin run a DFS HA admin client jmxget get JMX exported values from NameNode or DataNode. oev apply the offline edits viewer to an edits file oiv apply the offline fsimage viewer to an fsimage oiv_legacy apply the offline fsimage viewer to a legacy fsimage storagepolicies list/get/set/satisfyStoragePolicy block storage policies Client Commands: classpath prints the class path needed to get the hadoop jar and the required libraries dfs run a filesystem command on The file System # common envvars display computed Hadoop environment variables fetchdt fetch a delegation token from the NameNode getconf get config values from configuration groups get the groups which users belong to lsSnapshottableDir list all snapshottable dirs owned by the current user snapshotDiff diff two snapshots of a directory or diff the current directory contents with a snapshot version print the version Daemon Commands: balancer run a cluster balancing utility datanode run a DFS datanode dfsrouter run the DFS router diskbalancer Distributes data evenly among disks on a given node journalnode run the DFS journalnode mover run a utility to move block replicas across storage types namenode run the DFS namenode nfs3 run an NFS version 3 gateway portmap run a portmap service secondarynamenode run the DFS secondary namenode sps run external storagepolicysatisfier zkfc run the ZK Failover Controller daemon SUBCOMMAND may print help when invoked w/o parameters or with -h. [hadoop@hadoop01 hadoop]$Copy the code
HSDF is added before the command to perform cluster operations
Perform operations on cluster file systems. Prefix: HDFS DFS
function | The command |
---|---|
Check the directory | hdfs dfs -ls / |
Create a directory | hdfs dfs -mkdir -p /input/weather/data |
Delete the directory | hdfs dfs -rm -r /input/weather |
upload | HDFS dfS-put Local file location Cluster location |
Appends content to a file | HDFS dfS-appendtoFile Local file Location Cluster file location |
Download (default download to current folder) | HDFS dfs-get File location |
Viewing file Contents | HDFS dfs-cat File location |
#1) Directory management[hadoop@hadoop01 hadoop]$HDFS DFS -ls / # Found 4 items drwxr-xr-x - Hadoop supergroup 0 2021-01-23 14:51 /flume drwxrwxrwx - hadoop supergroup 0 2020-07-20 10:51 /hive drwxrwxrwx - hadoop supergroup 0 2020-07-20 17:02 /tmp drwxrwxrwx - hadoop supergroup 0 2020-09-15 15:50 /warehouse [hadoop@hadoop01 hadoop]$ [hadoop@hadoop01 hadoop]$ hdfs [hadoop@hadoop01 hadoop]$HDFS DFS -mkdir -p /input/weather/data # create multi-level directory [hadoop@hadoop01 Hadoop]$HDFS DFS -rm -r /input/weather # Delete directory Deleted /input/weather
#2) Document management
#2.1) Upload the /opt/data/world. SQL file to the /input space of HDFS[hadoop@hadoop01 data]$HDFS DFS -put world.sql /input # upload file to current default HDFS#2.2) Upload a local file to a specified cluster
[hadoop@hadoop01 data]$ hdfs dfs -put market.sql hdfs://hadoop01:9000/input
#2.3) Upload order log to/INPUT[hadoop@hadoop01 data]$echo "s10001 ">> order.log [hadoop@hadoop01 data]$cat order.log S10001, Zhang SAN, icebox,5000,1,2021-01-19 [hadoop@hadoop01 data]$HDFS dfS-put order.log /input#2.4) Order. log Generates a log file every day and appends it to order.log[hadoop@hadoop01 data]$cat 2021_01_20_order.log s10002, Zhang SAN, refrigerator,5000,2,2021-01-20 S10003, Li Si. washing machine,4000,1,2021-01-20 [hadoop@hadoop01 data]$ hdfs dfs -appendToFile 2021_01_20_order.log /input/order.log# Appends the local 2021_01_20_order.log file to /input/order.log
#2.5) Review files[hadoop@hadoop01 data]$HDFS DFS -cat /input/order.log s10001, DFS /input/order.log s10001, DFS /input/order.log s10001, DFS /input/order
#2.6) Download the cluster file to Linux[hadoop@hadoop01 hdfs_data]$HDFS DFS -get /input/order.log # run the following command: [hadoop@hadoop01 hdfs_data]$ll-rw-rw-1 hadoop Hadoop 159 1月 20 16:49 order.log # check the downloaded files [hadoop@hadoop01 hdfs_data]$cat order.log # Check the downloaded files S10002, Zhang SAN, refrigerator,5000,2,2021-01-20 S10003, Li Si, washing machine,4000,1,2021-01-20 S10004, Wang Wu, COLOR TV, 60002,2021-01-20Copy the code
HDFS dfsadmin command
#Viewing Cluster Status
[hadoop@hadoop01 hdfs_data]$ hdfs dfsadmin -report
Copy the code
You can also check it out directly on the website
Viewing HDFS uploaded data blocks
For example, where is the order. Log uploaded to the cluster
Let’s go to the DFS folder under Hadoop
There are two folders below, with a number under data and a number for each block connection pool
We cut to the specified directory under the connection pool to see a list of data blocks
We can take a look at what’s in the data block, let’s take a look at the movie reviews that were uploaded yesterday
cd subdir29/
ll
cat blk_1073749459
Copy the code