Offline Data Analysis based on EMR (Ali Cloud)

Scene experience objective

Today, with the explosive growth of data volume, digital transformation has become a hot topic in the IT industry. Data needs more in-depth value mining to cope with the changing needs in the future. Massive offline data analysis can be applied to a variety of business system environments, such as e-commerce massive log analysis, user behavior portrait analysis, massive offline computing analysis tasks in scientific research industry and other scenarios.

In this scenario, you can log in to the EMR Hadoop cluster and perform Hive operations. Hive is used to load data and calculate data. Demonstrates how to build resilient low-cost offline big data analytics.

After experiencing this scenario, the following knowledge can be learned:

1. Basic operations of EMR cluster, have a preliminary understanding of EMR products

2.EMR cluster data transmission and Hive simple operations, have a preliminary grasp of how to conduct offline big data analysis

Background knowledge

E-mapreduce (EMR for short) is a cloud-based open source big data platform that provides easy-to-integrate open source big data computing and storage engines such as Hadoop, Hive, Spark, Flink, Presto, Clickhouse, Delta, and Hudi. EMR computing resources can be adjusted according to business needs. EMR can be deployed in ECS and ACK of Ali Cloud public cloud and private cloud platform.

Product advantage

Open source ecosystem: Provides high-performance and stable versions of open source big data components such as Hadoop, Spark, Hive, Flink, Kafka, HBase, Presto, Impala, and Hudi. Customers can flexibly use them based on scenarios

Engine optimization: Multi-engine performance optimization, such as Spark SQL, which is 6 times better than the open source version. Adopting JindoFS+OSS to ensure data reliability, performance is greatly improved

Convenient O&M: Easily monitor, operate, and maintain clusters, nodes, and services on the Ali Cloud console and OpenAPI. This helps you greatly improve operation and maintenance efficiency and enables data engineers to focus more on business development

Cost savings: Cluster resources can be automatically matched on demand, and you only need to pay according to the actual usage, reducing the cost of wasted resources. Support Aliyun preemptive instances and reserved instance coupons (RI) to further reduce costs

Elastic resources: Can flexibly adjust cluster resources and create clusters based on cloud server ECS and container ACK within a few minutes to quickly respond to service requirements

Secure and reliable: Configure cluster network security policies based on VPC and security groups, support Kerberos authentication and data encryption, and use Ranger data access control. Supports data encryption to ensure data security

In the cluster

(You can go to the experience lab for free experience if you do not have Aliyunqun)

Upload data to HDFS

1. Create the HDFS directory.

hdfs dfs -mkdir -p /data/student

Copy the code

2. Upload files to the Hadoop file system. A. Run the following command to download the sample data file to the server:

wget https://labfileapp.oss-cn-hangzhou.aliyuncs.com/%E5%85%AC%E5%85%B1%E6%96%87%E4%BB%B6/u.txt
Copy the code

B. Upload files to the Hadoop file system.

hdfs dfs -put u.txt /data/student
Copy the code

3. View the file

hdfs dfs -ls /data/student
Copy the code

Create tables using Hive

1. Log in to the Hive database.

[root@emr-header-1 ~]# hive Logging initialized using configuration in File: / etc/ecm/hive - conf - 2.3.7-1.1.7 / hive - log4j2. The properties of Async: true Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.Copy the code

2. Create the user table.

CREATE TABLE emrusers (
   userid INT,
   movieid INT,
   rating INT,
   unixtime STRING ) 
  ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY '\t' 
  ;
Copy the code

3. Load data from hadoop file system to Hive data table.

 LOAD DATA INPATH '/data/student/u.txt' INTO TABLE emrusers;
Copy the code

Perform operations on tables

1. View the table data.

select * from emrusers limit 5;
Copy the code

2. Count.

select count(*) from emrusers;
Copy the code

3. The three highest-rated movies.

select movieid,sum(rating) as rat from emrusers group by movieid order by rat desc limit 3;
Copy the code

Offline Data Analysis based on EMR (Ali Cloud)

Scene experience objective

Background knowledge

In the cluster

Upload data to HDFS

Create tables using Hive

Perform operations on tables

Related Posts

Digg Project Monthly list | October 2021 Top authors list announced

How can A PHP server maximize concurrency with limited resources?

A Set of Java collections