What is a client for a cluster
A client is a machine (server) that can access the cluster, send/obtain data files to the cluster, and perform distributed jobs. The client is like a gripper, with which we can use the cluster. After the Hadoop and Spark(or MapReduce, or Storm) clusters are set up, if we need to send or obtain files to the cluster, or execute MapReduce or Spark jobs, we usually build a peripheral, cluster client. Perform operations on this client. Instead of doing it directly on the NameNode or DataNode of the cluster. At this point, the cluster and client structure is shown below (simplified, without considering the high availability of NameNode), and this article shows how to quickly set up a cluster client (sometimes called a gateway). Below is the structure of the Hadoop cluster and client.Copy the code
In the network configuration shown in the figure above, you can follow the principle that the cluster is open only for Intranet access (because servers in the cluster do not need to communicate with the external environment) and the client is open for extranet access. All access and management of the cluster are done by the client.Copy the code
Procedure For configuring cluster clients
< 1 > configuration hosts
The host name of the client is DC1 (the abbreviation of DataClient1, 192.168.0.150), and the NameNode host name in the Hadoop cluster is HadoOP01 (192.168.0.34). We only need to make the client and namenode know each other. We don't need to make the client know all the machines in the cluster. #vim /etc/hosts #vim /etc/hosts #vim /etc/hosts #vim /etc/hosts 192.168.0.34 hadoop01 Modify the hosts of hadoop01 and add the client to it, so that namenode knows the client. #vim /etc/hosts Add: 192.168.0.150 dc1 If there are many servers in the data center, it may not be convenient to configure hosts. In this case, deploy a Domain Name Service (DNS) server to resolve host names.Copy the code
<2> Configure SSH login without password
Add the public key in the. SSH directory of the client to the. SSH /authorized_keys file of the Namenode server.Copy the code
<3> Copy the ~/.bashrc file (environment variable configuration file)
Next, you need to configure environment variables such as $HADOOP_HOME and $JAVA_HOME. # scp-p 60034 ~/.bashrc dc1:~/.bashrc # scp-p 60034 ~/.bashrc dc1:~/.bashrcCopy the code
<4> Install Java and Hadoop
Installing Java and Hadoop is as simple as copying the Hadoop and Java folders on HadoOP01 to DC1. # scp-p 60034 -r $HADOOP_HOME dc1:$HADOOP_HOME # scp-p 60034 -r $JAVA_HOME dc1:$HADOOP_HOME This is the same as installing and configuring Hadoop cluster nodes, as if adding another machine to the cluster. The biggest difference is that the Hadoop processes (such as DataNode, NameNode, ResourceManager, and NodeManager) do not need to be run, nor do you need to run the start-dfs.sh/start-yarn.sh command. Also, the $HADOOP_CONF_DIR/slaves file is not modified, so it is not joined to the cluster. It is only used as a client of the cluster. Because we copied the entire $HADOOP_HOME to dc1, which contains all the configuration files, no configuration is required.Copy the code
<5> Verify the installation
Since I already have some test files on the Hadoop cluster, I can verify that the client and cluster are working well by getting and sending the files through the command line interface. 1. Download files from the cluster to the client # HDFS DFS -get /user/root/tmp/file1.txt ~/ TMP 2. #mv ~/ TMP /file1.txt ~/ TMP /file1_2.txt # HDFS DFS -put ~ TMP /file1_2.txt /user/root/tmpCopy the code
< 6 >
At this point, a simple Hadoop cluster client is set up. In addition to performing HDFS file operations on the client, you can also run Hive, which is a client tool. In addition, you can run the Spark Driver program as the Spark cluster client. Workers in the Spark cluster are usually deployed on the same server as datanodes of the HDFS to improve data access efficiency.Copy the code