The feed

In the near future, we need to analyze a group of users and calculate the daily usage distribution of these users according to the time of registering the system and The Times of using the system, such as daily usage (1-1000) (1000-10000) (10,000-100000). These data are mainly in CSV format. In order to solve the problem that the CSV format automatically switches to the technical method when the number is too large, all data end with \ T delimiter by default. The data format is as follows.

operation

Pre-processing data [We provide several sample data]

> cd /opt/document/ > vi document.csv 10623429 ,74 ,2020-09-16 04:32:30 30428389 ,1899 ,2021-01-06 14:25:13 35315614 ,15000,2021-01-12 18:43:54 10472929,1406,2020-09-11 04:22:31 10502044,13000,2020-09-12 02:23:18 13829381,504 2020-10-22 00:50:35 17282257,488000,2020-10-26 23:46:29 7250404,7633,2020-06-15 03:08:20 7300069,464,2020-06-16 01:55:44 8192050,1424,2020-07-01 13:27:18 26185180,3434,2020-11-14 20:15:50 2020-11-05 08:20:58 17773641,2859,2020-10-27 07:04:42 18580457 7105481,11000,2020-06-13 06:08:29 12665058,1181000,2020-10-12 23:03:33 10908114,28000 In the 2020-09-22 21:02:29Copy the code

Replaces special characters in files

We refer to the following article to remove some special characters and unnecessary splits from CSV files using python scripts (why? Python requires no compilation and has powerful three-party library support). Replaces special characters in files.

Centos install python2.7 by default. Py import CSV import sys with open(sys.argv[1], 'r') as srcFile, open(sys.argv[2], 'w') as dstFile: fileReader = csv.reader(srcFile) fileWriter = csv.writer(dstFile) for data in list(fileReader): for i,d in enumerate(data): if d.find('\r\n') ! = -1: d = d.replace('\r\n', ' ') if d.find('\n') ! = -1: d = d.replace('\n', ' ') if d.find('\r') ! = -1: d = d.replace('\r', ' ') if d.find('\\') ! = -1: d = d.replace('\\', '') if d.find('\t') ! = -1: d = d.replace('\t', '') data[i] = d fileWriter.writerow(data) dstFile.close() srcFile.close()Copy the code

perform

# Set permissions and go to > chmod 777 filter.py > Python filter.py document.csv documents.csvCopy the code

Display the processed data

10623429,74,2020-09-16 04:32:30 30428389,1899,2021-01-06 14:25:13 35315614,15000,2021-01-12 18:43:54 10472929,1406,2020-09-11 04:22:31 10502044,13000,2020-09-12 02:23:18 13829384,504,2020-10-22 00:50:35 17282257, 488,000,2020-10-26 23:46:29 7250404,763,2020-06-15 03:08:20 7300069,464,2020-06-16 01:55:44 8192050,1424,2020-07-01 13:27:18 26185180,3434,2020-11-14 20:15:50 18928713,2998,2020-10-29 09:07:31 19155065,363,2020-10-29 19:22:26 23245377,14000,2020-11-05 08:20:58 17773641,2859,2020-10-27 07:04:42 18580457,351,2020-10-28 19:20:31 710548,11000,2020-06-13 06:08:29 12665058,1181000,2020-10-12 23:03:33 10908114280, 00202-09-22 21:02:29Copy the code

Convert the [simplified version of data] CSV data to TSV file format and remove the header

The default CSV format uses commas to separate columns. If the columns to be separated contain commas, column errors may occur. We refer to the following article to use Python scripts for file format conversion. How to convert a CSV file to a TSV file.

CSV documentss.csv less documentss.csv 30428389 1899 2021-01-06 14:25:13 35315614 15000 2021-01-12 18:43:54 10472929 1406 2020-09-11 04:22:31 10502044 13000 2020-09-12 02:23:18 13829381 504 2020-10-22 00:50:35 17282257 488000 2020-10-26 23:46:29 7633 2020-06-15 03:08:20 7300069 464 2020-06-16 01:55:44 8192050 1424 2020-07-01 13:27:18 2618185180 3434 2020-11-14 20:15:50 18928713 2998 2020-10-29 09:07:31 19155065 363 2020-10-29 19:22:26 23245377 14000 2020-11-05 08:20:58 17773641 2859 2020-10-27 07:04:42 18580457 72020-10-28 19:20:31 72020-09-22 11000 2020-06-13 06:08:29 12665058 11820-10-12 23:03:33 10908114 28000 2020-09-22 21:02:29Copy the code

Analyze the final table

Create temporary table

hive> CREATE TABLE Document(id String,count String,time String) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS  TEXTFILE; hive> LOAD DATA LOCAL INPATH '/opt/document/documentss.csv' INTO TABLE document;Copy the code

Data import final table

hive> CREATE TABLE documents(id String,count String,time String,average DOUBLE) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;

hive> insert overwrite table documents select id,count,time,count/datediff('2021-03-23',time) from document;
Copy the code

Analysis of the

> hive > SELECT CASE WHEN average < 100 THEN 'a' WHEN average >= 1000 THEN 'b' WHEN average < 2000 THEN 'c' WHEN average  < 3000 THEN 'd' ELSE average END AS `types`, count(*) FROM documents GROUP BY CASE WHEN average < 100 THEN 'a' WHEN average >= 1000 THEN 'b' WHEN average < 2000 THEN  'c' WHEN average < 3000 THEN 'd' ELSE average END ...... OK B 2 c 3 a 13 Time taken: 1.265 seconds, 3 row(s)Copy the code