Interchange concatenation in Python
It is used for intersection and concatenation of more than two small files. Each file is stored as a separate object, and then the key is used as the key of the corresponding object, and then the file is concatenated one by one
#! /bin/env python
# filename: merge_data_by_dict.py
fid_a = open('a.txt')
fid_b = open('b.txt')
fid_c = open('c.txt'.'w')
lines_a = fid_a.readlines()
lines_b = fid_b.readlines()
lines_c = []
for i in range(0,len(lines_b)):
lines_a[i]= lines_a[i].strip('\n\r')
lines_c.append(lines_a[i]+lines_b[i])
fid_c.writelines(lines_c[i])
Copy the code
Shell + Python Approach to Intersection stitching (Scheme 1)
Assume file format:
- Unique string \t field value 1-1
- Unique string \t field value 2-1 \t field value 2-2
Assuming the file is not particularly large and can run locally, the number of lines is assumed to be in the millions
- Consider aggregating the data from two files into a unique string and arranging them in file order
- Process the original data in the file and add a column of data to the data. The column data value represents the owning file and is a number
- Output the contents of two files at the same time, sort the files by the specified column (the aggregate key and the file index must be placed in the corresponding column), and sort the files by the content
- Prepare a Python program that reads data row by row and compares the next row with the previous row to see if the keys are the same. If the keys are the same, merge the two data sets. If the keys are different, skip or output the current row information
Step one:
file_index=0
cat ${file_path} \
| awk -F'\t' \
-v index="${file_index}" \
'{ other_str = ""; for(i=2; i<=NF; i++) { other_str = other_str"\t"$i; } print $1"\t"index""other_str; } ' > ${file_path}.${file_index}
Copy the code
Step 2:
cat file.1 file.2 | sort -k1 -k2 > file.tmp
Copy the code
Step 3:
#! /bin/env python
# filename: merge_data.py
import sys
last_list = []
def main(a):
""" Main executive function """
global last_list
file0_empty_data = ['0'.'0'.'0']
file1_empty_data = ['0'.'0'.'0'.'0']
for line in sys.stdin:
line = line.strip()
if not line:
continue
llist = line.split('\t')
if len(llist) not in [3.4] :continue
if not last_list:
if llist[1] = ='0':
last_list = llist
else:
ret_list = file0_empty_data + llist
print "%s" % '\t'.join(tuple(ret_list)
else:
if llist[1] = ='0':
ret_list = last_list + file1_empty_data
print "%s" % '\t'.join(tuple(ret_list)
last_list = llist
else:
if llist[0] == last_list[0]:
ret_list = last_list + llist
print "%s" % '\t'.join(tuple(ret_list)
last_list = []
else:
ret_list = last_list + file1_empty_data
print "%s" % '\t'.join(tuple(ret_list)
if last_list:
ret_list = last_list + file1_empty_data
print "%s" % '\t'.join(tuple(ret_list)
last_list = llist
if __name__ == '__main__':
main()
Copy the code
Step four:
cat file.tmp | python merge_data.py
Copy the code
Shell + Python Approach to Intersection stitching (Scheme 2)
- If a large data file cannot be directly operated on two files, you can use this method
- Hash the file based on the specified key, and output the hash to each small file in the form of multiple files
- Then each small file can be operated according to one of the above two schemes
Assumed file format
- Unique string \t field value 1-1
- Unique string \t field value 2-1 \t field value 2-2
#! /bin/bash
file_nums=100
file_index=0
# Split files into smaller files
cat file.1 | awk -F'\t' \
-v file_pos="${file_index}" \
-v file_nums="${file_nums}" \
'{ file_index = $1 % file_nums; file_path = "file_data."file_index; print file_pos"\t"$0 >> file_path; } '
function merge_data() {
local file_path=The $1
cat ${file_path} 2>/dev/null \
| sort -k2 -k1 \
| python merge_data.py
}
# Traversal small file for intersection stitching operation
for(( i=0; i<=$file_nums; i++ ));do
tmp_file_path="file_data."$i
merge_data "${tmp_file_path}"
done
Copy the code