A method of sending contract parsing data back to Hive via Python

1. Event Description: Parses the contract content and sends the result back to Hive for visualization. This event involves the following steps, and the third step is mainly described in this article.

Add the contract file to hbase.
Read and parse contract files;
Send the parsing result back to Hive.
Visualize the resulting data.

Two, the operation process

The preparatory work

(1) Prepare a CSV file named TEST_WF_20210930, which contains two fields. In order to reduce the waiting time and operation complexity during the verification process, simplified data files are prepared in advance to simulate the parsing results.

Create the hive target table test_WF_20210930

create table XXX.test_wf_20210930(
name string,
tel string)
ROW FORMAT DELIMITED  
FIELDS TERMINATED BY ',';
Copy the code

Authentication: you must have HDFS read and write permissions. In this case, use the Hive user.

Kinit -kt hive.service.keytab hive/ host name@panel.comCopy the code

Connect the hbase

from hdfs.ext.kerberos import KerberosClient
try:
    keytab_file = './hive.service.keytab'
    user = 'hive'
    host = "IP"
    hdfs_port = "50070"
    hdfs_url = 'http://' + host + ':' + hdfs_port
    client = KerberosClient(hdfs_url)
    data = client.list('/')
    print(data)
except Exception as e:
    raise e
Copy the code

Upload a local file to HDFS

client.upload("/","/home/datascience/data/test_wf_20210930.csv")
Copy the code

5. Load HDFS files to Hive

from pyhive.hive import connect import os import pandas as pd keytab_file='hive.service.keytab' user='hive' host='IP' port=10000 active_str='kinit -kt {0} {1}'.format(keytab_file,user) os.system(active_str) try: con=connect(host=host,port=port,database='XXX',auth='KERBEROS',kerberos_service_name="hive") cursor=con.cursor() dim_sql  = "load data inpath '/test_wf_20210930.csv' into table XXX.test_wf_20210929" dim_sql_select = "select * from XXX.test_wf_20210929" cursor.execute(dim_sql) cursor.execute(dim_sql_select) curinfo = cursor.fetchall() print(curinfo) finally: cursor.close() con.close()Copy the code

Query hive data to check whether the hive data is accurate

select * from XXX.test_wf_20210930;
Copy the code

Three, matters needing attention & problem solving

Client. Write and client. The upload

Write ('/ test_wF_20210930. CSV ','123.csv') : yes write '123.csv' to a CSV file in HDFS. Client. The upload ("/", "/ home/datascience/data/test_wf_20210930 CSV") : is the test_wf_20210930 CSV file uploaded to the HDFS root directory;Copy the code

All data written to hive tables is stored in one column. Other columns are ‘Null’.

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; Columns delimited by ',' can be used to solve this problem.Copy the code

A method of sending contract parsing data back to Hive via Python

Related Posts

How to play Pandas with 50 exercises!

What is the life of the 27 – year – old Internet drifters in North China

Keras, the father of AI giant Keras, wrote 20 pieces of advice to programmers!