CDH off-line setup
In fact, three years ago, I had been with CDH for a long time. At that time, I also planned to use CDH as the big data platform of the company. However, I was not good enough at that time, so I changed to Ambari later.
CDH recently acquired HDP, which is also a big data bucket I’ve been using. I want to feel CDH for a change this time.
There are only two test machines, ES01 and ES02. That’s what we used to call an elastic cluster.
Before the installation
Versioning compatibility between Cloudera Manager and CDH must be considered.
For 3-10 machines, CDH recommends planning as follows. For two, just install ZooKeeper and CM to get a feel for it.
Hardware requirements
The hard disk resources required by CM
CDH gives the suggested value of memory and CPU required by each role (including Agent,DataNode,HBase, etc.) and is very careful. HDP wonders if I was careless enough not to find these options. CDH is so friendly.
I’m not going to focus on that. After all, it’s just a test environment.
Software requirements
Dependent jar package
- Python: Cloudera Enterprise 6, with the exception of Hue, is included in the operating system by default in the Python version and later, but is not compatible with Python 3.0 or later. For example, Cloudera Enterprise 6 requires Python 2.6 or higher on RHEL 6 compatible operating systems, but Python 2.7 or higher on RHEL 7 compatible operating systems. Hue in CDH 6 requires Python 2.7 or higher on all operating systems. For the RHEL 6 compatible operating system running Hue, you must manually install Python 2.7. Spark 2 requires Python 2.7 or higher. If the correct level of Python is not selected by default, set the PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON environment variables to point to the correct Python executables before running the pyspark command. Python 3 is not supported.
- Perl – Cloudera Manager requires Perl
- Cloudera Manager 6 relies on the Python-psycopg2 package. Hue in CDH 6 requires a higher version of psycopg2 than the Cloudera Manager dependency requires. For more information, see Installing the psycopg2 Python package.
- Iproute package – Cloudera Enterprise 6 relies on the Iproute package. This package is required by any host running Cloudera Manager Agent. The version required varies by operating system. Centos7.6 is used this time, and the corresponding version is Iproute-3.10
Operating system requirements
File type requirements
The Hadoop Distributed File System (HDFS) is designed to run on top of the underlying file system in the operating system. Cloudera recommends that you use any of the following file systems on the supported operating systems
- Ext3: This is HDFS’s most tested underlying file system.
- Ext4: This extensible extension of ext3 is supported in recent Linux versions. Cloudera does not support in-place upgrades from ext3 to ext4. Cloudera recommends that you format the disk as ext4 before using it as a data directory.
- XFS: This is the default file system in RHEL 7.
- S3: Amazon Simple Storage Service
There is also a CDH called Kudu from the development of the database, it requires the file system is ext4 and XFS.
In addition, CDH recommends tuning your file system
File access time
The Linux file system keeps metadata that records the access to each file. This means that even a read results in a write to disk. To speed up file reading, Cloudera recommends that you disable this option, called atime, using the noatime mount option in /etc/fstab.
/dev/sdb1 /data1 ext4 defaults,noatime 0
Let it work mount -o remount /data1
File system mount options
The file system mount option has synchronization options that allow you to synchronize writes. Using the Sync Filesystem mount option can degrade the performance of services that write data to disk, such as HDFS, YARN, Kafka, and Kudu. In CDH, most of the writes have already been copied. As a result, synchronous writes to disk are unnecessary, expensive, and do not measurably improve stability. The installation of NFS and NAS options as a DataNode data directory is not supported, even with the layered storage feature.
Nproc configuration
Cloudera Manager automatically in the/etc/security/limits set in the conf nproc configuration, but the/etc/security/limits of d/a single file can override this configuration. This can cause problems with Apache Impala and other components. Be sure to set the nproc limit high enough, such as 65536 or 262144.
Database requirements
I prefer to use MySQL here, if you need to see others please see here
Java version requirements
Configure the network name
sudo hostnamectl set-hostname es01.example.com
[root@es01 ~]# cat /etc/hosts 127.0.0.1 localhost. Localhost ::1 localhost. Localdomain localhost 172.17.0.11 ES01.ljktest.com ES01 172.16.0.11 ES02.ljktest.com ES02
Edit /etc/sysconfig/network using only this host’s FQDN
[root@es01 ~]# cat /etc/sysconfig/network
# Created by cloud-init on instance boot automatically, do not edit.
#
NETWORKING=yes
HOSTNAME=es01.ljktest.com
Disable Firewall
sudo systemctl disable firewalld
sudo systemctl stop firewalld
Set the SELinux mode
- Check SELinux status
getenforce
, if the output is Permissive or Disabled, the task can be skipped. If the output is executing, proceed to the next step. - Open the /etc/selinux/config file (on some systems, the /etc/sysconfig/selinux file). Change SELinux =enforcing to SELinux = permissive
- Restart the system or run the following command to disable SELinux immediately
setenforce 0
Enable the NTP service
It is assumed that the external network can be connected. For Intranet, you’ll need your own internal NTP server. You can use the chrony synchronization time in my other post, but this ensures the same intra-cluster time, but the time will be different from the real NTP server.
- Install the NTP package
yum install ntp
-
Edit the /etc/ntp.conf file to add an NTP server, as shown in the following example.
server 0.pool.ntp.org server 1.pool.ntp.org server 2.pool.ntp.org
- Start the NTPD service
systemctl start ntpd
- Configure NTPD service to boot up
systemctl enable ntpd
- Synchronize the system clock to the NTP server
ntpdate -u <ntp_server>
- Synchronize the hardware clock with the system clock
hwclock --systohc
The development of
Use the CDH 6 Maven repository
Configure the local package repository
HTTPD installed
sudo yum install httpd
AddType application/ x-gzip.gz.tgz.parcel
sudo systemctl start httpd
Download the CM and CDH packages
The official instructions are to execute the following command on the server where HTTPD is installed
- CM
sudo mkdir -p /var/www/html/cloudera-repos sudo wget --recursive --no-parent --no-host-directories https://archive.cloudera.com/cm6/6.3.0/redhat7/ - P/var/WWW/HTML/cloudera - repos sudo wget https://archive.cloudera.com/cm6/6.3.0/allkeys.asc - P/var/WWW/HTML/cloudera - repos/cm6 6.3.0 /
sudo chmod -R ugo+rX /var/www/html/cloudera-repos/cm6
- CDH
sudo mkdir -p /var/www/html/cloudera-repos sudo wget --recursive --no-parent --no-host-directories https://archive.cloudera.com/cdh6/6.3.0/redhat7/ - P/var/WWW/HTML/cloudera - repos
Sudo wget -- recursive - no - the parent - no - host - directories https://archive.cloudera.com/gplextras6/6.3.0/redhat7/ - P /var/www/html/cloudera-repos
sudo chmod -R ugo+rX /var/www/html/cloudera-repos/cdh6
sudo chmod -R ugo+rX /var/www/html/cloudera-repos/gplextras6
I directly downloaded some packages with the download tool, and downloaded the packaged packages (because it is too slow to install the command, of course, if the network speed is OK, we recommend using the method of the official website, mainly because it is not easy to make mistakes).
CM package CDH package GPL package
You can get the following by manually downloading it
Cm6.3.0 - redhat7. Tar. Gz allkeys. Asc CDH 6.3.0-1. Cdh6.3.0. P0.1279813 - el7. Parcel CDH 6.3.0-1. Cdh6.3.0. P0.1279813 - el7. Parcel. Sha1 manifest. Json GPLEXTRAS 6.3.0-1. Gplextras6.3.0. P0.1279813 - el7. Parcel GPLEXTRAS 6.3.0-1. Gplextras6.3.0. P0.1279813 - el7. Parcel. Sha1 manifest. The json
Note that allKeys. ASC file cannot be missing, otherwise the installation agent will report an error.
Configure the internal repository
touch /etc/yum.repos.d/cloudera-repo.repo
write
[cloudera - cm6.3.0] name = cloudera - cm6.3.0 baseurl gpgcheck = 0 = http://49.234.43.99/cloudera-repos/cm6.3.0 enabled = 1
Install JDK1.8
Cloudera-Manager-Agent Cloudera-Manager-Server: Oracle JDK does not match the JDK version.
I’ll have to use the OpenJDK that comes with it.
Yum install Java -- 1.8.0 comes with its devel
Install Cloudera Manager Packages
sudo yum install cloudera-manager-daemons cloudera-manager-agent cloudera-manager-server
Mysql installation
The original local MySQL package, CDH documentation is too conscience, even install MySQL have. Let’s go through the documentation
-
The installation
wget http://repo.mysql.com/mysql-community-release-el7-5.noarch.rpm sudo rpm -ivh mysql-community-release-el7-5.noarch.rpm sudo yum update sudo yum install mysql-server
- Move the old InnoDB log files /var/lib/mysql/ib_logfile0 and /var/lib/mysql/ib_logfile1 from /var/lib/mysql/ib_logfile1 to the backup location.
mv ib_logfile0 ib_logfile1 ~/mysql_backup/
-
Modify the MySQL configuration file my.conf
- To prevent deadlocks, set the isolation level to read-committed.
- Configure the InnoDB engine. If Cloudera Manager’s tables are configured with the MyISAM engine, it will not start. (In general, if the InnoDB engine is configured incorrectly, the tables revert to MyISAM.) To check the engine used by the table, run the following command from the MySQL shell
- The default Settings for MySQL installations in most distributions use conservative buffer sizes and memory usage. Cloudera Management Service roles require high write throughput because they may insert many records into the database. Cloudera recommends that you set the innodb_flush_method property to O_direct.
- Set the max_connections property based on the size of the cluster
Less than 50 hosts – you can store multiple databases on the same host (for example, an activity monitor and a service monitor). If you do this, you should: put each database on its own volume. Allow a maximum of 100 connections per database, then add 50 additional connections. For example, for two databases, set the maximum number of connections to 250. If you store five databases on a single host (Cloudera Manager Server, Activity Monitor, Report Manager, Cloudera Navigator, and Hive Metastore databases), Set the maximum number of connections to 550.
In the end, CDH came up with a list of suggested configurations, which was really nice.
```
[mysqld]
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
transaction-isolation = READ-COMMITTED
# Disabling symbolic-links is recommended to prevent assorted security risks;
# to do so, uncomment this line:
symbolic-links = 0
key_buffer_size = 32M
max_allowed_packet = 32M
thread_stack = 256K
thread_cache_size = 64
query_cache_limit = 8M
query_cache_size = 64M
query_cache_type = 1
max_connections = 550
#expire_logs_days = 10
#max_binlog_size = 100M
#log_bin should be on a disk with enough free space.
#Replace '/var/lib/mysql/mysql_binary_log' with an appropriate path for your
#system and chown the specified folder to the mysql user.
log_bin=/var/lib/mysql/mysql_binary_log
#In later versions of MySQL, if you enable the binary log and do not set
#a server_id, MySQL will not start. The server_id must be unique within
#the replicating group.
server_id=1
binlog_format = mixed
read_buffer_size = 2M
read_rnd_buffer_size = 16M
sort_buffer_size = 8M
join_buffer_size = 8M
# InnoDB settings
innodb_file_per_table = 1
innodb_flush_log_at_trx_commit = 2
innodb_log_buffer_size = 64M
innodb_buffer_pool_size = 4G
innodb_thread_concurrency = 8
innodb_flush_method = O_DIRECT
innodb_log_file_size = 512M
[mysqld_safe]
log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid
sql_mode=STRICT_ALL_TABLES
```
- Start the mysql
systemctl start mysqld
-
Mysql > set password for mysql
sudo /usr/bin/mysql_secure_installation
CDH is so sweet, here are the choices you will encounter.
[...]. Enter current password for root (enter for none): OK, successfully used password, moving on... [...]. Set root password? [Y/n] Y New password: Re-enter new password: Remove anonymous users? [Y/n] Y [...] Disallow root login remotely? [Y/n] N [...] Remove test database and access to it [Y/n] Y [...] Reload privilege tables now? [Y/n] Y All done!
-
Install the MySQL JDBC driver
Install the JDBC driver on the Cloudera Manager Server host and on any other host running services that require database access. For more information about Cloudera software that uses databases, see Required Databases. Cloudera recommends only using version 5.1 of the JDBC driver.
Wget HTTP: / / https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.46.tar.gz
The tar ZXVF mysql connector - Java - 5.1.46. Tar. Gz
Copy the renamed JDBC driver to /usr/share/java /. If the destination directory does not already exist, create it. Such as:
Sudo mkdir -p /usr/share/ Java/CD mysql-connector-java-5.1.46 sudo cp mysql-connector-java-5.1.46-bin.jar /usr/share/java/mysql-connector-java.jar
Create a database for Cloudera software
In theory, you need to create all the databases in the following table for your CHD. For now, let’s simply create Cloudera Manager Server.
- Enter the mysql
mysql -u root -p
-
Create the database
CREATE DATABASE scm DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;
GRANT ALL ON scm.* TO 'scm'@'%' IDENTIFIED BY 'scm';
-
Verify that the above command was successfully executed
SHOW DATABASES;
SHOW GRANTS FOR 'scm'@'%';
Set up the Cloudera Manager database
Cloudera Manager Server includes a script that creates and configures a database for itself. The script can:
- Create Cloudera Manager Server database configuration file.
- (MariaDB, MySQL, and PostgreSQL) create and configure databases for Cloudera Manager Server to use.
- (MariaDB, MySQL, and PostgreSQL) Create and configure user accounts for Cloudera Manager Server.
/opt/cloudera/cm/schema/scm_prepare_database.sh mysql scm scm
Install CDH and other software
- Start the cloudera – SCM – server
sudo systemctl start cloudera-scm-server
- Wait a few minutes to start Cloudera Manager Server. To observe the startup process, run the following command on the Cloudera Manager Server host
sudo tail -f /var/log/cloudera-scm-server/cloudera-scm-server.log
When you see this log entry, the Cloudera Manager administration console is ready
INFO WebServerImpl:com.cloudera.server.cmf.WebServerImpl: Started Jetty server.
- In your Web browser, go to http://< Server_host >:7180, where
is the FQDN or IP address of the host running Cloudera Manager Server.
Distribution of CHD and GPL packages.
So far, CDH has been successfully installed offline.
Some warnings in interface installation
Generally you will get two warning messages
- Cloudera recommends setting /proc/sys/vm/swappiness to a maximum of 10. It is currently set to 30. Use the sysctl command to change this setting at run time and edit /etc/sysctl.conf to save the setting after reboot. You can proceed with the installation, but Cloudera Manager may report that your host is not performing well due to switching. The following hosts will be affected
echo 'vm.swappiness=10'>> /etc/sysctl.conf
sysctl -p
- Transparent large page compression is enabled and can cause significant performance problems. Please run “echo never > / sys/kernel/mm/transparent_hugepage/defrag” and “echo never > / sys/kernel/mm/transparent_hugepage/enabled “to disable this setting, and then add the same command to the/etc/rc. Local initialization scripts, such as shall be set up so that the system reboots. The following hosts will be affected
conclusion
Compared to Ambari, I personally feel that CDH is friendlier in terms of document quality.
Although some of the internal mechanisms have not been specifically covered, but we can see that CDH is more detailed, there are more steps to show.
Of course CDH is not open source, but it seems possible to use custom services as well ~ we will try this later.
The appendix
- https://www.cloudera.com/documentation/enterprise/6/6.3.html
- CDH custom services
- CDH Package version information