OSS has introduced the OSS Select function, which can directly Select the required content from OSS files using simple SQL statements
The Object Storage Service (OSS) features mass, reliability, security, high performance, and low cost. OSS provides standard, low-frequency, archive types, covering various data storage requirements from hot to cold. The size of a single file ranges from 1 byte to 48.8TB, and the number of files that can be stored is unlimited. OSS has become the infrastructure for Internet, enterprise-level data applications. In general, the common way to get object storage data is to get the whole object or to get data in a specified range of bytes. Today, we are launching OSS Select, which uses simple SQL statements to Select the desired content from OSS files.
OSS Select introduction
Select the contents of the OSS file using SQL
OSS Select (in open beta) allows developers to Select content from OSS files directly using SQL statements.
Using OSS Select, which retrieves only the query results required by the application and supports concurrent sharding of queries, can significantly improve application performance, typically by up to 400 percent.
Public test instructions:
- File format: Unencrypted CSV format or delimited UTF8 text file (refer to RFC4180) is supported during public test.
- Standard and low-frequency objects are supported during public testing
- RangeQuery support (Use Header Name is not supported in RangeQuery mode during open beta)
- OSS Select is free during public beta
- Later ali Cloud EMR, DataLakeAnalytics, MaxCompute, HybridDB and so on will support OSS Select in succession
Use examples (Python)
# -*- coding: utf-8 -*-
import os
import oss2
def select_call_back(consumed_bytes, total_bytes = None):
print('Consumed Bytes:' + str(consumed_bytes) + '\n')
AccessKeyId, AccessKeySecret, Endpoint, etc.
AccessKeyId = AccessKeyId = AccessKeyId = AccessKeyId = AccessKeyId
#
# Take Hangzhou as an example, the Endpoint can be:
# http://oss-cn-hangzhou.aliyuncs.com
# https://oss-cn-hangzhou.aliyuncs.com
# Use HTTP and HTTPS respectively.
access_key_id = os.getenv('OSS_TEST_ACCESS_KEY_ID'.'<你的AccessKeyId>')
access_key_secret = os.getenv('OSS_TEST_ACCESS_KEY_SECRET'.'<你的AccessKeySecret>')
bucket_name = os.getenv('OSS_TEST_BUCKET'.'< your Bucket >')
endpoint = os.getenv('OSS_TEST_ENDPOINT'.'< your domain name >')
Make sure all the above parameters are filled in correctly
for param in (access_key_id, access_key_secret, bucket_name, endpoint):
assert '<' not in param, 'Please set parameters:' + param
Create a Bucket Object. All interfaces related to the Object can be created through the Bucket Object
bucket = oss2.Bucket(oss2.Auth(access_key_id, access_key_secret), endpoint, bucket_name)
#
csvfile = 'sample.csv'
resultfilename = 'python_select.csv'
csv_meta_params = {'FileHeaderInfo': 'None'.'RecordDelimiter': '\r\n'}
# LineRange (optional) : specifies the range of query rows
select_csv_params = {'FileHeaderInfo': 'None'.'LineRange': (100.1000)}
csv_header = bucket.get_csv_object_meta(key, csv_meta_params)
Output the query result to a file
result = bucket.select_csv_object_to_file(csvfile, resultfile,
"select _1, _3, _4 from ossobject where _4 > 40 and _1 like '%Tom%' ",
select_call_back, input_format)Copy the code
The above is a simple Python example that uses SQL to query OSS objects and output the results to a file summary.
In addition to output the query results to a file, you can also return the query results directly
result = bucket.select_csv_object(csvfile, "select * from ossobject where _4 > 40 and _1 like '%Tom%' ", select_call_back, select_csv_params) content_got = b'' for chunk in result: content_got += chunk print(content_got)Copy the code
Query result:
Test the sample
You can use OSS Select to speed up your various applications. The OSS Select team has created an example of Spark that implements the Spark Data Source API based on OSS Select. Suppose you need to query qualified personnel information from a large list of personnel. For example, query target people over 50 whose names include Tom.
Use OSS Select to improve application performance
- When OSS Select is enabled, Spark uses OSS Select to obtain only the data required by the file. If OSS Select is disabled, Spark obtains the entire file
- Without OSS Select, the query takes 78 seconds (1.3 minutes). With OSS Select, it takes 11 seconds,Application performance increased by 6 times!
Test configuration description:
Spark test cluster configuration:
The number of | configuration | |
---|---|---|
master | 1 | 4core 8GB |
workers | 2 | 4core 8GB |
The Spark configuration:
export SPARK_MASTER_IP=master
export SPARK_WORKER_MEMORY=6g
export SPARK_WORKER_CORES=3
export SPARK_WORKER_INSTANCES=1
export SPARK_EXECUTOR_CORES=1
export SPARK_EXECUTOR_MEMORY=2gCopy the code
The amount of data:
The CSV data is 7GB.
The original link