I. Virtual machine environment

1.1 Configuring a static IP address for campus NETWORK NAT

Reference:

  • Ubuntu Server 20.04LTS NAT Mode Configure a static IP address
  • Win10 解 决 ping 错 误

Note that the hotspot cannot be enabled during NIC sharing. You can disable NIC sharing when the hotspot needs to be enabled, and then enable NIC sharing again.

The IP address assigned to the host VMnet8 NIC is 192.168.137.1/24. Therefore, the SUBNET network segment of the VM VMnet8 is set to 192.168.137.0/24, and the gateway IP address is set to the same as that of the host VMnet8 NIC.


Install ubuntu 1.2 to 20.04

Reference: Install ubuntu-20.04-live-server-amd64.iso for the VM

Image source choice: http://mirrors.aliyun.com/ubuntu.

Configure the static IP address, hostname, and image source during the installation to facilitate subsequent modification.

Static IP

ip addr
vim /etc/netplan/00-installer-config.yaml
Copy the code
Network: Ethernets: ens33: addresses: -192.168.137.100/24 gateway4: 192.168.137.1 Nameservers: addresses: - 192.168.137.1 Search: [] version: 2Copy the code

One caveat: VIM’s default formatting for YAML is not particularly good, so consider adjusting the indentation manually.

netplan apply
Copy the code

hostname

hostnamectl set-hostname ubuntu
#Log back in
Copy the code

Image source

Reference: [Linux]Ubuntu 20.04 for Ali source

vim /etc/apt/sources.list
Copy the code
deb http://mirrors.aliyun.com/ubuntu focal main restricted universe multiverse
deb http://mirrors.aliyun.com/ubuntu focal-updates main restricted  universe multiverse
deb http://mirrors.aliyun.com/ubuntu focal-backports main restricted universe multiverse
deb http://mirrors.aliyun.com/ubuntu focal-security main restricted universe multiverse
Copy the code

Enable the root SSH

#Setting the root Password
sudo passwd root

#Uncomment PermitRootLogin and change its value to yes
sudo vim /etc/ssh/sshd_config

#Restart the SSHD
sudo systemctl restart sshd
Copy the code

Root user does not have bash color highlighted:

sudo cp /etc/skel/.bashrc /root

#Uncomment force_COLOR_prompt =yes
sudo vim /root/.bashrc

#Log back in
Copy the code

hosts

vim /etc/hosts
Copy the code
#.
192.168.137.1   win10
192.168.137.101 node101
192.168.137.102 node102
192.168.137.103 node103
Copy the code

Note: in order to access HDFS normally through web pages, you also need to configure domain name resolution on the host (C:\Windows\System32\drivers\etc\hosts).

SSH Login without password

# node101 node102 node103
ssh-keygen -t rsa
ssh-copy-id root@node101
ssh-copy-id root@node102
ssh-copy-id root@node103
Copy the code

Install the open – jdk8

apt install openjdk-8-jdk
Copy the code

The default installation location is usr/lib/jvm/java-8-openjdk-amd64.

vim /etc/profile
Copy the code
#.
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
Copy the code
source /etc/profile && source ~/.bashrc
Copy the code


1.3 SSH and FTP Schemes

Windows Terminal

Documentation: Windows/Development Environment/Windows Terminal/Overview

Cascadia Code: github.com/microsoft/c…

To make up for the missing “send command to all Windows” feature, some scripts can be used instead:

vim ~/bin/cluster.sh
Copy the code
#! /bin/bashcase $1 in exec) for host in node101 node102 node103; do echo "========== ${host} ==========" ssh "${host}" "${@:2}" done ;; rsync) for host in node101 node102 node103; do if [[ "$(hostname)" != "${host}" ]]; then echo "========== ${host} ==========" rsync -a -r -v "$2" "${host}":"$2/.. /" fi done ;; esacCopy the code
chmod 777 ~/bin/cluster.sh
Copy the code
~/bin/cluster.sh exec "jps"
~/bin/cluster.sh rsync "$HADOOP_HOME/etc"
Copy the code

When you need to transfer files between the host and virtual machine, you may have to use some tools, such as XFTP, MobaXterm, etc. But of course there are other ways.

sftp

cd <local_dir>

sftp <user>@<host>

cd <remote_dir>

put -r <local_dir>

get -r <remote_file>
Copy the code

Nginx File server +curl

#.http { # ... server { # ... root /usr/share/nginx/html; charset utf-8; #... location /public { autoindex on; autoindex_localtime on; autoindex_exact_size off; }}}Copy the code
curl <url> -o <filename>
Copy the code

Docker

docker cp <local_file> <container_id | container_name>:<container_dir>

docker cp <container_id | container_name>:<container_file> <local_dir>
Copy the code


1.4 other

VIM

vim /etc/vim/vimrc
Copy the code
"... filetype plugin indent on set showcmd set showmatch set ignorecase set smartcase set incsearch set autowrite set hidden set number set ruler set expandtab set tabstop=4 set cursorline set confirm set hlsearchCopy the code

The time zone

timedatectl set-timezone Asia/Shanghai

timedatectl
Copy the code


Second, the Hadoop

2.1 install hadoop – 3.2.2

Documents:

  • Apache Hadoop 3.2.2 – Hadoop: Setting up a Single Node Cluster.
  • Apache Hadoop 3.2.2 – Hadoop Cluster Setup
The curl https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/hadoop-3.2.2/hadoop-3.2.2.tar.gz - o hadoop - 3.2.2. Tar. Gz Mkdir /opt/hadoop tar -zxvf hadoop-3.2.2.tar.gz -c /opt/hadoopCopy the code
vim /etc/profile
Copy the code
#.Export JAVA_HOME=/usr/lib/ JVM /java-8-openjdk-amd64 export HADOOP_HOME=/opt/hadoop/hadoop-3.2.2 export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbinCopy the code


2.2 a Zookeeper cluster

You can start a copy of Docker-compose directly on the host:

version: '3'

services:
  zk1:
    image: Zookeeper: 3.7
    hostname: zk1
    ports:
      - 2181: 2181
    volumes:
      - ./zk1/data/:/data/
      - ./zk1/datalog/:/datalog/
    environment:
      ZOO_MY_ID: 1
      ZOO_SERVERS: Server. 1 = 0.0.0.0:2888-3888; 2181 server.2=zk2:2888:3888; 2181 server.3=zk3:2888:3888; 2181

  zk2:
    image: Zookeeper: 3.7
    hostname: zk2
    ports:
      - 2182: 2181
    volumes:
      - ./zk2/data/:/data/
      - ./zk2/datalog/:/datalog/
    environment:
      ZOO_MY_ID: 2
      ZOO_SERVERS: server.1=zk1:2888:3888; 2181 Server. 2 = 0.0.0.0:2888-3888; 2181 server.3=zk3:2888:3888; 2181

  zk3:
    image: Zookeeper: 3.7
    hostname: zk3
    ports:
      - 2183: 2181
    volumes:
      - ./zk3/data/:/data/
      - ./zk3/datalog/:/datalog/
    environment:
      ZOO_MY_ID: 3
      ZOO_SERVERS: server.1=zk1:2888:3888; 2181 server.2=zk2:2888:3888; 2181 Server. 3 = 0.0.0.0:2888-3888; 2181

networks:
  default:
    external: true
    name: local_net
Copy the code
docker network create local_net

docker-compose up -d
Copy the code


2.3 a Hadoop cluster

The configuration file

Default configuration file:

  • core-default.xml
  • hdfs-defautl.xml
  • hdfs-rbf-default.xml
  • mapred-default.xml
  • yarn-default.xml

Reference:

  • Hadoop Why do I need to reconfigure JAVA_HOME in Hadoop-env.sh?
  • HDFS_NAMENODE_USER, HDFS_DATANODE_USER & HDFS_SECONDARYNAMENODE_USER not defined
hadoop-env.sh
vim $HADOOP_HOME/etc/hadoop/hadoop-env.sh
Copy the code
#.
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
#.
export HDFS_NAMENODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_JOURNALNODE_USER=root
export HDFS_ZKFC_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
Copy the code
workers
vim $HADOOP_HOME/etc/hadoop/workers
Copy the code
node101
node102
node103
Copy the code
core-site.xml
vim $HADOOP_HOME/etc/hadoop/core-site.xml
Copy the code
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://hdfs-cluster</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>The/opt/hadoop/hadoop - 3.2.2 / TMP</value>
    </property>
    <property>
        <name>hadoop.http.staticuser.user</name>
        <value>root</value>
    </property>

    <! -- HDFS zookeeper address -->
    <property>
        <name>ha.zookeeper.quorum</name>
        <value>win10:2181,win10:8182,win10:2183</value>
    </property>

    <! -- YARN Zookeeper address -->
    <property>
        <name>hadoop.zk.address</name>
        <value>win10:2181,win10:2182,win10:2183</value>
    </property>
</configuration>
Copy the code

Note: Hadoop.tmp. dir cannot reference environment variables.

hdfs-site.xml
vim $HADOOP_HOME/etc/hadoop/hdfs-site.xml
Copy the code
<configuration>
    <! -- HDFS HA -->
    <property>
        <name>dfs.nameservices</name>
        <value>hdfs-cluster</value>
    </property>
    <property>
        <name>dfs.ha.namenodes.hdfs-cluster</name>
        <value>nn1,nn2,nn3</value>
    </property>

    <! -- NameNode RPC communication address -->
    <property>
        <name>dfs.namenode.rpc-address.hdfs-cluster.nn1</name>
        <value>node101:8020</value>
    </property>
    <property>
        <name>dfs.namenode.rpc-address.hdfs-cluster.nn2</name>
        <value>node102:8020</value>
    </property>
    <property>
        <name>dfs.namenode.rpc-address.hdfs-cluster.nn3</name>
        <value>node103:8020</value>
    </property>

    <! -- NameNode HTTP communication address -->
    <property>
        <name>dfs.namenode.http-address.hdfs-cluster.nn1</name>
        <value>node101:9870</value>
    </property>
    <property>
        <name>dfs.namenode.http-address.hdfs-cluster.nn2</name>
        <value>node102:9870</value>
    </property>
    <property>
        <name>dfs.namenode.http-address.hdfs-cluster.nn3</name>
        <value>node103:9870</value>
    </property>

    <! -- JournalNode -->
    <property>
        <name>dfs.namenode.shared.edits.dir</name>
        <value>qjournal://node101:8485; node102:8485; node103:8485/hdfs-cluster</value>
    </property>
    <property>
        <name>dfs.journalnode.edits.dir</name>
        <value>The/opt/hadoop/hadoop - 3.2.2 / TMP/DFS/journalnode /</value>
    </property>

    <! -- Isolation mechanism of split brain -->
    <property>
        <name>dfs.ha.fencing.methods</name>
        <value>
            sshfence
            shell(/bin/true)
        </value>
    </property>
    <property>
        <name>dfs.ha.fencing.ssh.private-key-files</name>
        <value>/root/.ssh/id_rsa</value>
    </property>

    <! -- HDFS automatic failover -->
    <property>
        <name>dfs.ha.automatic-failover.enabled</name>
        <value>true</value>
    </property>
    <property>
        <name>dfs.client.failover.proxy.provider.hdfs-cluster</name>
        <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
    </property>
</configuration>
Copy the code
mapred-site.xml
vim $HADOOP_HOME/etc/hadoop/mapred-site.xml
Copy the code
<configuration>
    <property>
        <name>yarn.app.mapreduce.am.env</name>
        <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
    </property>
    <property>
        <name>mapreduce.map.env</name>
        <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
    </property>
    <property>
        <name>mapreduce.reduce.env</name>
        <value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
    </property>

    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>
Copy the code
yarn-site.xml
vim $HADOOP_HOME/etc/hadoop/yarn-site.xml
Copy the code
<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>

    <! -- ResourceManager HA -->
    <property>
        <name>yarn.resourcemanager.ha.enabled</name>
        <value>true</value>
    </property>

    <property>
        <name>yarn.resourcemanager.cluster-id</name>
        <value>yarn-cluster</value>
    </property>
    <property>
        <name>yarn.resourcemanager.ha.rm-ids</name>
        <value>rm1,rm2,rm3</value>
    </property>

    <property>
        <name>yarn.resourcemanager.hostname.rm1</name>
        <value>node101</value>
    </property>
    <property>
        <name>yarn.resourcemanager.hostname.rm2</name>
        <value>node102</value>
    </property>
    <property>
        <name>yarn.resourcemanager.hostname.rm3</name>
        <value>node103</value>
    </property>

    <property>
        <name>yarn.resourcemanager.webapp.address.rm1</name>
        <value>node101:8088</value>
    </property>
    <property>
        <name>yarn.resourcemanager.webapp.address.rm2</name>
        <value>node102:8088</value>
    </property>
    <property>
        <name>yarn.resourcemanager.webapp.address.rm3</name>
        <value>node103:8088</value>
    </property>

    <! -- ResourceManager automatically restores -->
    <property>
        <name>yarn.resourcemanager.recovery.enabled</name>
        <value>true</value>
    </property>
    <property>
        <name>yarn.resourcemanager.store.class</name>
        <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
    </property>
</configuration>
Copy the code

Log gathered

mapred-site.xml
<configuration>
    <! -... -->

    <! -- JobHistoryServer -->
    <property>
        <name>mapreduce.jobhistory.address</name>
        <value>node101:10020</value>
    </property>
    <property>
        <name>mapreduce.jobhistory.webapp.address</name>
        <value>node101:19888</value>
    </property>
</configuration>
Copy the code
yarn-site.xml
<configuration>
    <! -... -->

    <! -- Log aggregation -->
    <property>
        <name>yarn.log-aggregation-enable</name>
        <value>true</value>
    </property>
    <property>
        <name>yarn.log.server.url</name>
        <value>http://node101:19888/jobhistory/logs</value>
    </property>
    <! -- 3600 * 24 * 7 -->
    <property>
        <name>yarn.log-aggregation.retain-seconds</name>
        <value>604800</value>
    </property>
</configuration>
Copy the code

Synchronizing configuration Files

# node101
rsync -a -r -v $HADOOP_HOME/etc node102:$HADOOP_HOME
rsync -a -r -v $HADOOP_HOME/etc node103:$HADOOP_HOME
Copy the code

Initializing a cluster

#Start the JournalNode
# node101 node102 node103
hdfs --daemon start journalnode

#Format the NameNode
# node101
hdfs namenode -format

#Synchronize NameNode metadata
# node101
hdfs --daemon start namenode
# node102 node103
hdfs namenode -bootstrapStandby

#Formatting ZookeeperFailoverController
# node101
hdfs zkfc -formatZK
Copy the code

Delete TMP and logs before reformatting:

rm -rf $HADOOP_HOME/tmp $HADOOP_HOME/logs
Copy the code

Start the cluster

Reference: In Hadoop HA mode, after the active Namenode node is killed, the standby Namenode node fails to automatically start

# node101
start-dfs.sh
start-yarn.sh
mapred --daemon start historyserver
Copy the code

Start separately:

hdfs --daemon <start|stop> namenode
hdfs --daemon <start|stop> secondarynamenode
hdfs --daemon <start|stop> datanode

yarn --daemon <start|stop> resourcemanager
yarn --daemon <start|stop> nodemanager

mapred --daemon <start|stop> historyserver
Copy the code

Check the HA status:

hdfs haadmin -getAllServiceState

yarn rmadmin -getAllServiceState
Copy the code

WordCount

cd $HADOOP_HOME mkdir input echo -e "i keep saying no\nthis can not be the way it was supposed to be\ni keep saying no\nthere has gotta be a way to get you close to me" > input/word.txt hadoop fs -mkdir /input hadoop fs -put Input/word. TXT/input hadoop jar share/hadoop/graphs/hadoop - graphs - examples - 3.2.2. Jar wordcount/input/outputCopy the code


2.4 HDFS

HDFS Architecture

NameNode
  • Manages the HDFS namespace
  • Configuring a Copy Policy
  • Manages data block mapping information
  • Handle client read/write requests
DataNode
  • Stores actual data blocks
  • Perform read/write operations on data blocks
SecondaryNameNode
  • Copy the NameNode and share its workload. For example, merge Fsimage and Edits regularly and push to the NameNode
  • In an emergency, the NameNode can be recovered
Client
  • File sharding: When uploading files to the HDFS, the Client divides the files into blocks and uploads them
  • Interacts with NameNode to obtain file location information
  • Interacts with Datanodes to read or write data
  • The Client provides several commands to manage HDFS, such as NameNode formatting
  • The Client can use some commands to access the HDFS, such as adding, deleting, modifying, and querying HDFS

Common HDFS Commands

Documents: FileSystemShell

#Shear upload
hadoop fs -moveFromLocal <local_file> <hdfs_dir>

#Copy to upload
hadoop fs -copyFromLocal <local_file> <hdfs_dir>
hadoop fs -put <local_file> <hdfs_dir>

#Additional upload
hadoop fs -appendToFile <local_file> <hdfs_file>

#download
hadoop fs -copyToLocal <hdfs_file> <local_dir>
hadoop fs -get <hdfs_file> <local_dir>

#Setting the number of copies
hadoop fs -setrep <replication> <hdfs_file>
Copy the code

HDFS Java API

Install Hadoop 3.3.0 on Windows 10 Step by Step Guide

Download: github.com/kontext-tec…

Pom. XML:

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client</artifactId>
    <version>3.2.2</version>
</dependency>
Copy the code

Check file status:

public static void main(String[] args) throws IOException, InterruptedException {
    FileSystem fs = FileSystem.get(URI.create("hdfs://node101:8020"), new Configuration(), "root");
    FileStatus status = fs.getFileStatus(new Path("/input/word.txt"));
    System.out.println(status);
    fs.close();
}
Copy the code


2.5 graphs

Graphs process

A complete MapReduce program runs in distributed mode with three types of instance processes:

  • ApplicationMaster: Responsible for process scheduling and state coordination of the entire program
  • MapTask: Responsible for the whole data processing process of the Map phase
  • ReduceTask: Responsible for the entire data processing process in the Reduce phase

MapTask working mechanism

  1. Read phase

    The RecordReader obtained by MapTask in InputFormat is used to parse KV from InputSplit.

  2. The Map phase

    In this stage, the resolved KV is handed over to the map() function written by the user for processing, and a series of new KV are generated.

  3. Collect phase

    In a user-written map() function, outputCollector.collect () is typically called to output the results when the data processing is complete.

    Inside the function, it partitions the generated KV and writes it to a ring buffer.

  4. Spill phase

    When the ring cache is full, MapReduce writes data to the local disk to generate a temporary file.

    Note that before data is written to the local disk, the data must be sorted locally and merged or compressed if necessary.

ReduceTask Working mechanism

  1. Copy stage

    ReduceTask Remotely copies a piece of data from each MapTask. If the size exceeds a certain threshold, the data is written to the disk; otherwise, the data is directly stored in the memory.

  2. Sort stage

    When remotely copying data, ReduceTask starts two background threads to merge files on the memory and disk to prevent excessive memory usage or excessive files on the disk.

    According to the MapReduce semantics, the input data for the user-written Reduce () function is a set of data aggregated by Key.

    To bring together key-identical omissions, Hadoop uses a sortion-based strategy.

    Since each MapTask has implemented local sorting of its own processing results, the ReduceTask only needs to perform a merge sort for all data.

  3. Reduce phase

    The reduce() function writes the calculated results to HDFS.

WordCount

public class WordCount {
    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
        Job job = Job.getInstance();

        job.setJarByClass(WordCount.class);

        job.setMapperClass(WordCount.WordCountMapper.class);
        job.setReducerClass(WordCount.WordCountReducer.class);

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        boolean successful = job.waitForCompletion(true);
        System.exit(successful ? 0 : 1);
    }

    public static class WordCountMapper extends Mapper<LongWritable.Text.Text.IntWritable> {
        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String line = value.toString();
            String[] words = line.split("");
            for (String word : words) {
                context.write(new Text(word), new IntWritable(1)); }}}public static class WordCountReducer extends Reducer<Text.IntWritable.Text.IntWritable> {
        @Override
        protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            int count = 0;
            for (IntWritable value : values) {
                count += value.get();
            }
            context.write(key, newIntWritable(count)); }}}Copy the code

Package upload execution:

#Delete the /output directory
hadoop fs -rmdir --ignore-fail-on-non-empty /output
#performHadoop jar word - count - 0.0.1. Jar xyz. Icefery. Mr. Wc. The WordCount/input/word. TXT/outputCopy the code


2.6 YARN

YARN Architecture

ResourceManager
  • Handle client requests
  • Monitor the NodeManager
  • Start or monitor ApplicationMaster
  • Resource allocation and scheduling
NodeManager
  • Manage resources on a single node
  • Process the commands from ResourceManager
  • Process commands from ApplicationMaster
ApplicationMaster
  • Request resources for the application and assign them to internal tasks
  • Task monitoring and fault tolerance
Container
  • Container is a resource abstraction in YARN. It encapsulates multi-dimensional resources on a node, such as memory, CPU, disk, and network resources

YARN working mechanism

  1. The MR program is submitted to the node where the client resides
  2. The YarnRunner applies for an Application from ResourceManager
  3. ResourceManager returns the resource path of the application to YarnRunner
  4. The program submits the required resources to the HDFS
  5. After application resources are submitted, apply to run ApplicationMaster
  6. ResourceManager initializes user requests into a Task
  7. One NodeManager receives the Task
  8. The NodeManager creates the Container and generates the ApplicationMaster
  9. Container Copies resources from the HDFS to the local
  10. ApplicationMaster applies to ResourceManager for running MapTask resources
  11. ResourceManager assigns the MapTask running task to the other two NodeManagers. The other two NodeManagers receive the task and create a container respectively
  12. The MR starts scripts to the two NodeManager spreaders that receive the task, and the two NodeManagers respectively start the MapTask, which partitions the data
  13. ApplicationMaster waits for all MapTasks to run, applies for a container from ResourceManager, and runs ReduceTask
  14. ReduceTask Obtains the data of the corresponding partition from the MapTask
  15. After the program runs, the MR sends a request to ResourceManager to deregister himself

Third, Hive

3.1 install the hive – 3.1.2

Reference: Hive startup error: Java. Lang. NoSuchMethodError: com.google.com mon. Base. The Preconditions. CheckArgument

The curl https://mirrors.tuna.tsinghua.edu.cn/apache/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz - o Gz mkdir /opt/hive tar -zxvf apache-hive-3.1.2-bin.tar.gz -c /opt/hiveCopy the code

Environment variables:

#.Export HIVE_HOME = / opt/hive/apache - hive - 3.1.2 - bin export PATH = $PATH: $HIVE_HOME/binCopy the code

Jar package conflict:

Rm $HIVE_HOME - rf/lib/guava - 19.0. Jar cp $HADOOP_HOME/share/hadoop/common/lib/guava - 27.0 - the jre. Jar $HIVE_HOME/libCopy the code


3.2 QuickStart

MySQL

Create hive metadata database by starting MySQL on host:

create database hive;
Copy the code

Add a driver to the Hive lib directory:

The curl https://repo1.maven.org/maven2/mysql/mysql-connector-java/8.0.24/mysql-connector-java-8.0.24.jar - o $HIVE_HOME/lib/mysql connector - Java - 8.0.24. JarCopy the code

The configuration file

Reference:

  • Hive Use Hiveserver2 or Beeline to start impersonate impersonate User: root is not allowed to impersonate root
  • Apache Hadoop 3.2.2 – Proxy user-superusers Acting On Behalf Of Other Users
core-site.xml
vim $HADOOP_HOME/etc/hadoop/core-site.xml
Copy the code
<configuraiton>
    <! -... -->
    <property>
        <name>hadoop.proxyuser.root.hosts</name>
        <value>*</value>
    </property>
    <property>
        <name>hadoop.proxyuser.root.groups</name>
        <value>*</value>
    </property>
</configuraiton>
Copy the code
hive-site.xml
vim $HIVE_HOME/conf/hive-site.xml
Copy the code

      

      
<configuration>
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.cj.jdbc.Driver</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://win10:3306/hive</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>root</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>root</value>
    </property>

    <property>
        <name>hive.metastore.uris</name>
        <value>thrift://node101:9083</value>
    </property>

    <property>
        <name>hive.server2.thrift.bind.host</name>
        <value>node101</value>
    </property>
</configuration>
Copy the code

Example Initialize the metadata database

schematool -initSchema -dbType mysql
Copy the code

Start the MetaStore

nohup hive --service metastore 1>/dev/null 2>&1 &
#Use the JPS andkill- the end of September
Copy the code

Hive Shell

create database test;
use test;
Copy the code
create table student (
    name      string,
    gender    tinyint,
    deskmates array<string>,
    score     struct<chinese:int, math:int, english:int>,
    refs      map<string, string>
) row format delimited fields terminated by '|' collection items terminated by ', ' map keys terminated by ':' lines terminated by '\n';
Copy the code

The insert statement:

insert into student values ('leader'.1.array('Vivian jade'.'fang'), named_struct('chinese'.90.'math'.90.'english'.90), map('height'.'165'.'weight'.'55'.'eyesight'.'0.2'));
Copy the code

File import:

vim ~/student.txt
Copy the code
Fang | 2 | hierarch | 120110120 | height: 165, weight: 50, eyesight: 1.0 Vivian jade | 2 | hierarch | 110120120 | height: 160, weight: 45, eyesight: 0.2Copy the code
load data local inpath '/root/student.txt' into table student;
Copy the code
select * from student;
Copy the code
dfs -ls /user/hive/warehouse;
Copy the code

JDBC access to the Hive

#Start the hiveserver2
nohup hive --service hiveserver2 1>/dev/null 2>&1 &
Copy the code

The WEB interface is http://node101:10002

Connect using the Beeline client:

beeline -u jdbc:hive2://node101:10000 -n root
Copy the code

DataGrip also supports Hive connections.

View logs:

tail -n 300 /tmp/root/hive.log
Copy the code


Four, HBase

4.1 installation hbase – 2.3.5

Curl https://mirrors.tuna.tsinghua.edu.cn/apache/hbase/2.3.5/hbase-2.3.5-bin.tar.gz - o hbase - 2.3.5 - bin. Tar. Gz mkdir -p /opt/hbase tar -zxvf hbase-2.3.5-bin.tar.gz -c /opt/hbaseCopy the code

Environment variables:

#.Export HBASE_HOME = / opt/hbase/hbase - 2.3.5 export PATH = $PATH: $HBASE_HOME/binCopy the code


4.2 HBase cluster

The configuration file

HDFS configuration file soft link
ln -s $HADOOP_HOME/etc/hadoop/core-ste.xml $HBASE_HOME/conf/core-site.xml
ln -s $HADOOP_HOME/etc/hadoop/hdfs-ste.xml $HBASE_HOME/conf/hdfs-site.xml
Copy the code
hbase-env.sh
#.
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
#.
export HBASE_MANAGES_ZK=false
#.
export HBASE_DISABLE_HADOOP_CLASSPATH_LOOKUP=true
#.
Copy the code
regionservers
node101
node102
node103
Copy the code
backup-masters
node102
node103
Copy the code
hbase-site.xml
<configuration>
    <property>
        <name>hbase.cluster.distributed</name>
        <value>true</value>
    </property>
    <property>
        <name>hbase.rootdir</name>
        <value>hdfs://hdfs-cluster/hbase</value>
    </property>
    <property>
        <name>hbase.zookeeper.quorum</name>
        <value>win10:2181,win10:2182,win10:2183</value>
    </property>
</configuration>
Copy the code

Start the cluster

# node101
start-hbase.sh
Copy the code

The WEB interface is http://node101:16010 or http://node101:16030

HBase Shell

#Create a table
create 'student', 'info', 'score'

#List the table
list

#Viewing table structure
describe 'student'
Copy the code
#Insert data
put 'student', '1', 'info:name', 'icefery'
put 'student', '1', 'info:gender', '1'
put 'student', '1', 'score:math', '120'
put 'student', '2', 'info:name', 'fang'
put 'student', '2', 'info:gender', '2'
put 'student', '2', 'score:math', '110'
put 'student', '3', 'info:name', 'wenyu'
put 'student', '3', 'info:gender', '2'
put 'student', '3', 'score:math', '120'
Copy the code
#Scan for table data
scan 'student'

#Count table rows
count 'student'

#View specified rows
get 'student', '1'
get 'student', '1', 'info:name'
Copy the code
#Delete the specified row
delete 'student', '1', 'info:name'
deleteall 'student', '1'

#Disable the table
disable 'student'
#Delete table
drop 'student'
Copy the code