1. Event Description: Parses the contract content and sends the result back to Hive for visualization. This event involves the following steps, and the third step is mainly described in this article.

  1. Add the contract file to hbase.
  2. Read and parse contract files;
  3. Send the parsing result back to Hive.
  4. Visualize the resulting data.

Two, the operation process

  1. The preparatory work

(1) Prepare a CSV file named TEST_WF_20210930, which contains two fields. In order to reduce the waiting time and operation complexity during the verification process, simplified data files are prepared in advance to simulate the parsing results.

Create the hive target table test_WF_20210930

create table XXX.test_wf_20210930(
name string,
tel string)
ROW FORMAT DELIMITED  
FIELDS TERMINATED BY ',';
Copy the code
  1. Authentication: you must have HDFS read and write permissions. In this case, use the Hive user.
Kinit -kt hive.service.keytab hive/ host [email protected]Copy the code
  1. Connect the hbase
from hdfs.ext.kerberos import KerberosClient
try:
    keytab_file = './hive.service.keytab'
    user = 'hive'
    host = "IP"
    hdfs_port = "50070"
    hdfs_url = 'http://' + host + ':' + hdfs_port
    client = KerberosClient(hdfs_url)
    data = client.list('/')
    print(data)
except Exception as e:
    raise e
Copy the code
  1. Upload a local file to HDFS
client.upload("/","/home/datascience/data/test_wf_20210930.csv")
Copy the code

5. Load HDFS files to Hive

from pyhive.hive import connect import os import pandas as pd keytab_file='hive.service.keytab' user='hive' host='IP' port=10000 active_str='kinit -kt {0} {1}'.format(keytab_file,user) os.system(active_str) try: con=connect(host=host,port=port,database='XXX',auth='KERBEROS',kerberos_service_name="hive") cursor=con.cursor() dim_sql  = "load data inpath '/test_wf_20210930.csv' into table XXX.test_wf_20210929" dim_sql_select = "select * from XXX.test_wf_20210929" cursor.execute(dim_sql) cursor.execute(dim_sql_select) curinfo = cursor.fetchall() print(curinfo) finally: cursor.close() con.close()Copy the code
  1. Query hive data to check whether the hive data is accurate
select * from XXX.test_wf_20210930;
Copy the code

Three, matters needing attention & problem solving

  1. Client. Write and client. The upload
Write ('/ test_wF_20210930. CSV ','123.csv') : yes write '123.csv' to a CSV file in HDFS. Client. The upload ("/", "/ home/datascience/data/test_wf_20210930 CSV") : is the test_wf_20210930 CSV file uploaded to the HDFS root directory;Copy the code
  1. All data written to hive tables is stored in one column. Other columns are ‘Null’.

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; Columns delimited by ',' can be used to solve this problem.Copy the code