1 goal

Compare the compression ratio of the generated files in the Parquet format under different compression algorithms.Copy the code

2 operation

Read a certain amount of data from the Hive data table, specify different compression formats, write Parquet to Hdfs, and then compare the compression ratio.Copy the code

2.1 Main Code

from pyspark.sql import SparkSession

if __name__ == '__main__':
    # Initialize SparkSession
    session = SparkSession.builder.getOrCreate()

    # Read data
    data_frame = session.sql("select * from ods.xxx where name = 'xxx'").repartition(100)

    data_frame.cache()

    [CSV, none, uncompressed, snappy, gzip]
    data_frame.write.mode("overwrite").csv("/user/sun/output/parquet_test/csv")
    data_frame.write.mode("overwrite").option("compression"."none").parquet("/user/sun/output/parquet_test/none")
    data_frame.write.mode("overwrite").option("compression"."uncompressed").parquet("/user/sun/output/parquet_test/uncompressed")
    data_frame.write.mode("overwrite").option("compression"."snappy").parquet("/user/sun/output/parquet_test/snappy")
    data_frame.write.mode("overwrite").option("compression"."gzip").parquet("/user/sun/output/parquet_test/gzip")

    data_frame.unpersist()
    session.stop()
Copy the code
#Task submission code
spark-submit --master yarn --deploy-mode cluster --queue queueD --driver-memory 10G --num-executors 10 --executor-cores 2 --executor-memory 10G --name Parquet_test test.py
Copy the code

2.2 Data Comparison

$ hadoop fs -du -h /user/sunkangkang/output/parquet_test18.7 G 56.1 G/user/sunkangkang/output/parquet_test/CSV 1.1 G 3.2 G/user/sunkangkang/output/parquet_test/gzip 1.4 G 4.3 G/user/sunkangkang/output/parquet_test/after 1.6 G 4.9 G/user/sunkangkang/output/parquet_test/none 1.6 G 4.9 G /user/sunkangkang/output/parquet_test/uncompressedCopy the code
  • File size comparison

  • Compressibility comparison

    Take the size of the output CSV file as the initial size, and collect statistics on the compression ratio of each compression format.

3 summary

The Parquet file generated when the compression method is not specified (None, uncompressed) is much smaller than the plain text file (CSV). Gzip compression has a higher compression ratio than SNappy compression. Therefore, gZIP can be used to compress data when the storage space of the cluster is tight.