Interchange concatenation in Python

It is used for intersection and concatenation of more than two small files. Each file is stored as a separate object, and then the key is used as the key of the corresponding object, and then the file is concatenated one by one

#! /bin/env python
# filename: merge_data_by_dict.py

fid_a = open('a.txt')
fid_b = open('b.txt')
fid_c = open('c.txt'.'w')
lines_a = fid_a.readlines()
lines_b = fid_b.readlines()
lines_c = []
for i in range(0,len(lines_b)):
    lines_a[i]= lines_a[i].strip('\n\r')
    lines_c.append(lines_a[i]+lines_b[i])
    fid_c.writelines(lines_c[i])
Copy the code

Shell + Python Approach to Intersection stitching (Scheme 1)

Assume file format:

Unique string \t field value 1-1

Unique string \t field value 2-1 \t field value 2-2

Assuming the file is not particularly large and can run locally, the number of lines is assumed to be in the millions

Consider aggregating the data from two files into a unique string and arranging them in file order

Process the original data in the file and add a column of data to the data. The column data value represents the owning file and is a number

Output the contents of two files at the same time, sort the files by the specified column (the aggregate key and the file index must be placed in the corresponding column), and sort the files by the content

Prepare a Python program that reads data row by row and compares the next row with the previous row to see if the keys are the same. If the keys are the same, merge the two data sets. If the keys are different, skip or output the current row information

Step one:

file_index=0
cat ${file_path} \
    | awk -F'\t' \
        -v index="${file_index}" \
    '{ other_str = ""; for(i=2; i<=NF; i++) { other_str = other_str"\t"$i; } print $1"\t"index""other_str; } ' > ${file_path}.${file_index}
Copy the code

Step 2:

cat file.1 file.2 | sort -k1 -k2 > file.tmp
Copy the code

Step 3:

#! /bin/env python
# filename: merge_data.py

import sys

last_list = []
def main(a):
    """ Main executive function """
    global last_list
    
    file0_empty_data = ['0'.'0'.'0']
    file1_empty_data = ['0'.'0'.'0'.'0']
    
    for line in sys.stdin:
        line = line.strip()
        if not line:
            continue
        
        llist = line.split('\t')
        if len(llist) not in [3.4] :continue
        
        if not last_list:
            if llist[1] = ='0':
                last_list = llist
            else:
                ret_list = file0_empty_data + llist
                print "%s" % '\t'.join(tuple(ret_list)
        else:
            if llist[1] = ='0':
                ret_list  = last_list + file1_empty_data
                print "%s" % '\t'.join(tuple(ret_list)
                last_list = llist
            else:
                if llist[0] == last_list[0]:
                    ret_list = last_list + llist
                    print "%s" % '\t'.join(tuple(ret_list)
                    last_list = []
                else:
                    ret_list  = last_list + file1_empty_data
                    print "%s" % '\t'.join(tuple(ret_list)
    
    if last_list:
        ret_list  = last_list + file1_empty_data
        print "%s" % '\t'.join(tuple(ret_list)
        last_list = llist

if __name__ == '__main__':
    main()
Copy the code

Step four:

cat file.tmp | python merge_data.py
Copy the code

Shell + Python Approach to Intersection stitching (Scheme 2)

If a large data file cannot be directly operated on two files, you can use this method

Hash the file based on the specified key, and output the hash to each small file in the form of multiple files

Then each small file can be operated according to one of the above two schemes

Assumed file format

Unique string \t field value 1-1

Unique string \t field value 2-1 \t field value 2-2

#! /bin/bash
file_nums=100
file_index=0

# Split files into smaller files
cat file.1 | awk -F'\t' \
-v file_pos="${file_index}" \
-v file_nums="${file_nums}" \
'{ file_index = $1 % file_nums; file_path = "file_data."file_index; print file_pos"\t"$0 >> file_path; } '

function merge_data() {
    local file_path=The $1
    cat ${file_path} 2>/dev/null \
        | sort -k2 -k1 \
        | python merge_data.py
}

# Traversal small file for intersection stitching operation
for(( i=0; i<=$file_nums; i++ ));do
    tmp_file_path="file_data."$i
    merge_data "${tmp_file_path}"
done
Copy the code

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

File stitching method

Interchange concatenation in Python

Shell + Python Approach to Intersection stitching (Scheme 1)

Assuming the file is not particularly large and can run locally, the number of lines is assumed to be in the millions

Step one:

Step 2:

Step 3:

Step four:

Shell + Python Approach to Intersection stitching (Scheme 2)

File stitching method

Interchange concatenation in Python

Shell + Python Approach to Intersection stitching (Scheme 1)

Assuming the file is not particularly large and can run locally, the number of lines is assumed to be in the millions

Step one:

Step 2:

Step 3:

Step four:

Shell + Python Approach to Intersection stitching (Scheme 2)

Related Posts

Java collection extension series | figure framework

Scrapy

The best time to buy and sell stocks