Interchange concatenation in Python

It is used for intersection and concatenation of more than two small files. Each file is stored as a separate object, and then the key is used as the key of the corresponding object, and then the file is concatenated one by one

#! /bin/env python
# filename: merge_data_by_dict.py

fid_a = open('a.txt')
fid_b = open('b.txt')
fid_c = open('c.txt'.'w')
lines_a = fid_a.readlines()
lines_b = fid_b.readlines()
lines_c = []
for i in range(0,len(lines_b)):
    lines_a[i]= lines_a[i].strip('\n\r')
    lines_c.append(lines_a[i]+lines_b[i])
    fid_c.writelines(lines_c[i])
Copy the code

Shell + Python Approach to Intersection stitching (Scheme 1)

Assume file format:

  1. Unique string \t field value 1-1
  2. Unique string \t field value 2-1 \t field value 2-2

Assuming the file is not particularly large and can run locally, the number of lines is assumed to be in the millions

  1. Consider aggregating the data from two files into a unique string and arranging them in file order
  2. Process the original data in the file and add a column of data to the data. The column data value represents the owning file and is a number
  3. Output the contents of two files at the same time, sort the files by the specified column (the aggregate key and the file index must be placed in the corresponding column), and sort the files by the content
  4. Prepare a Python program that reads data row by row and compares the next row with the previous row to see if the keys are the same. If the keys are the same, merge the two data sets. If the keys are different, skip or output the current row information

Step one:

file_index=0
cat ${file_path} \
    | awk -F'\t' \
        -v index="${file_index}" \
    '{ other_str = ""; for(i=2; i<=NF; i++) { other_str = other_str"\t"$i; } print $1"\t"index""other_str; } ' > ${file_path}.${file_index}
Copy the code

Step 2:

cat file.1 file.2 | sort -k1 -k2 > file.tmp
Copy the code

Step 3:

#! /bin/env python
# filename: merge_data.py

import sys

last_list = []
def main(a):
    """ Main executive function """
    global last_list
    
    file0_empty_data = ['0'.'0'.'0']
    file1_empty_data = ['0'.'0'.'0'.'0']
    
    for line in sys.stdin:
        line = line.strip()
        if not line:
            continue
        
        llist = line.split('\t')
        if len(llist) not in [3.4] :continue
        
        if not last_list:
            if llist[1] = ='0':
                last_list = llist
            else:
                ret_list = file0_empty_data + llist
                print "%s" % '\t'.join(tuple(ret_list)
        else:
            if llist[1] = ='0':
                ret_list  = last_list + file1_empty_data
                print "%s" % '\t'.join(tuple(ret_list)
                last_list = llist
            else:
                if llist[0] == last_list[0]:
                    ret_list = last_list + llist
                    print "%s" % '\t'.join(tuple(ret_list)
                    last_list = []
                else:
                    ret_list  = last_list + file1_empty_data
                    print "%s" % '\t'.join(tuple(ret_list)
    
    if last_list:
        ret_list  = last_list + file1_empty_data
        print "%s" % '\t'.join(tuple(ret_list)
        last_list = llist

if __name__ == '__main__':
    main()
Copy the code

Step four:

cat file.tmp | python merge_data.py
Copy the code

Shell + Python Approach to Intersection stitching (Scheme 2)

  1. If a large data file cannot be directly operated on two files, you can use this method
  2. Hash the file based on the specified key, and output the hash to each small file in the form of multiple files
  3. Then each small file can be operated according to one of the above two schemes

Assumed file format

  1. Unique string \t field value 1-1
  2. Unique string \t field value 2-1 \t field value 2-2
#! /bin/bash
file_nums=100
file_index=0

# Split files into smaller files
cat file.1 | awk -F'\t' \
-v file_pos="${file_index}" \
-v file_nums="${file_nums}" \
'{ file_index = $1 % file_nums; file_path = "file_data."file_index; print file_pos"\t"$0 >> file_path; } '

function merge_data() {
    local file_path=The $1
    cat ${file_path} 2>/dev/null \
        | sort -k2 -k1 \
        | python merge_data.py
}

# Traversal small file for intersection stitching operation
for(( i=0; i<=$file_nums; i++ ));do
    tmp_file_path="file_data."$i
    merge_data "${tmp_file_path}"
done
Copy the code