With the continuous development of cloud services, more and more resources are stored on the cloud. Object storage is one of the widely used storage methods, such as ALI Cloud OSS, Baidu Cloud BOS, amazon S3. Although it is convenient for users to use, developers are often limited by the SDK and face many problems. This is true of a seemingly simple function such as counting the size of an OSS object.

The business scenario

Some project files of the company are stored on the OSS of Ali Cloud. The size of the stored objects needs to be counted according to the directory to calculate the storage cost and estimate the downstream traffic cost. These objects have the following three characteristics: 1. It’s over a million, and it’s going to grow. 2 Few changes. After uploading, most likely will not be modified. 3 Short cycle. Objects are stored for only three months and are automatically cleared after three months.

SDK issues

Ali Cloud SDK does not directly provide the API function of counting by folder, but needs to list the size of each object through the ListObjects function, and then carry out manual statistics.

The request parameters for the ListObjects function are shown below.

The description is detailed, but there are few parameters. It can be summarized as follows:

  1. You can set the number of records returned per query. The maximum value is 1000.
  2. If the enumeration cannot be completed once, a NextMarker value will be returned, which can be used as a Marker parameter to obtain the data of the next page during the query.
  3. You can set the Prefix parameter to list objects in a directory and subdirectories.
  4. You can use the Delimiter parameter to enumerate directories.

In other words, the ListObjects function is like an iterator, and when there are a large number of objects, it can only be queried one by one by paging tags. If you have 100,000 objects, you need to call them at least 100 times in 2-3 seconds and take 3-5 minutes. However, the number of our project is small, about 1.2 million, and the total time will reach 40~60 minutes, which is too long! Especially for functional computing applications deployed in the cloud, 10 minutes is already time out.

How to optimize query statistics operations?

Referring to the ideas of front-end performance optimization, there are generally three kinds of solutions:

1 Compression For example, HTTP uses Gzip to compress returned data. Since the API function does not provide filtering fields or compression parameters, the size of the returned data is immutable and compression is not an option.

2 Splitting splitting can be divided into two types by purpose. One type is delayed data request, such as front-end route lazy loading components, and the other type is concurrent data request, such as large file fragment uploading.

Lazy loading makes no sense for file statistics and does not reduce the final elapsed time.

Theoretically, concurrency can actually shorten the interface query time, but there will be some problems in operation, because the ListOjects function adopts the iterator mode, and each query depends on the NextMarker value returned last time. Such setting forces the query into serial operation.

3 Cache Caches are generally optimized for multiple iterations because they get results directly when the cache hits. Such as strong caching and negotiated caching on the front end.

It does not make sense to cache the query results directly, because the ListObjects function does not provide an HTTP-like judgment mechanism to check the validity of the cache. Instead, the ListObjects function is called to verify the validity of the cache.

But does caching make no sense at all? Of course not.

Splitting and caching

If the parameters of each query are cached (mainly NextMarker), then concurrent queries can be implemented when the query is repeated. Specific implementation ideas are as follows:

  1. When the number of objects in a directory exceeds 1000, the current query parameters are cached by directory. Less than 1000 does not generate paging and the cache parameter is of little significance.
  2. When querying again, the cache object is obtained first. Query whether each directory has cache parameters. If yes, perform concurrent query based on cache parameters. If no, perform direct query.
  3. When returning results and caching parametersNextMarkerIf the value is inconsistent, the proof file is changed, and the current concurrent query result is abandoned and the query is repeated.
  4. For each query result, the cache parameters are updated for objects greater than 1000.

In order to further squeeze the concurrent capability of Aliyun function calculation, two measures are taken:

  1. Change from Python to Node.js. Considering that the project might be maintained by back-end colleagues, I chose Python, but the concurrency performance of Python thread and coroutine is not as good as Node.js, so I changed it to Node.js.
  2. Infinite concurrency. Read out all the cache parameters and call them separatelyListObjectsFunction, and finally returns the result.

However, concurrent execution throws an error similar to the XML parsing below (truncated because there are too many errors).

Error: Unclosed root tag Line: 1284 Column: 8 Char: raw xml: <? xml version="1.0" encoding="UTF-8"? >...Copy the code

Considering that the implementation principle of API functions is to convert the XML string obtained through HTTP request into XML and then into JSON data and return, it is assumed that the returned XML string is incomplete, resulting in the failure of parsing into XML objects.

So instead of only query statistics error directory, but there is no error.

Had to enter the SDK source for breakpoint debugging, and finally found that the request timeout caused the error.

The root cause is that there are too many concurrent queues waiting for response results, so the wait times out, but this error message is too misleading…

Concurrency and queues

To solve the problem of queuing, the solution is very simple, that is to control the number of concurrent.

How do you control it? Inspired by the JavaScript engine event loop, a task queue was created.

When the task queue is not empty, the task queue is constantly polled.

When the number of tasks executed does not reach the upper limit, take out the task execution and increase the counter by 1.

The counter decreases by one after performing the task. The core code is as follows:

function schedule() {/* concurrency limit */while(queue. Length > 0 && running < concurrency) {running++ // queue specifies the concurrency of a taskletSinglelist.call (null, task.params, task.key).then(task.resolve, task.reject) .finally(() => { running--; //if(queue.length > 0) timeout = setTimeout(schedule, 0)
      console.info('waiting:', queue.length, 'running:', running)
    })
  }
}
Copy the code

The final execution effect on Aliyun function calculation counted 1.2 million objects and wrote them to the database, which took 69 seconds and occupied 962 MB of memory.

Duration: 69451.69 ms, Billed Duration: 69500 ms, Memory Size: 3072 MB, Max Memory Used: 962.17 MB

At present, this performance has been able to meet business requirements, even if the number of files is increased by an order of magnitude, it can be executed within 10 minutes (Ali Cloud function calculation timeout limit is 600 seconds).

To optimize the

In theory, you can also shorten the time by increasing the number of concurrent requests. Set up a main process and assign tasks to different processes to further increase concurrency. Allocation can be based on the number of objects, such as one process for every 100 tasks.


Original link: tech.gtxlab.com/oss-file.ht… Author information: Zhu Delong, Renhe Future Senior front End Engineer.