Access the cloud storage as if it were a local file system

Wechat public account: Measurement space

1. What is Zarr?

The Zarr website describes Zarr as a Python package that provides chunked, compressed, n-dimensional arrays. Chunked said Zarr can handle very large data sets and fast data access. Compressed means Zarr can save files with a reasonable file size, which means lower cost. The n-dimensional array capability shows that Zarr can handle cubes like Netcdf, for example, earth science datasets with temporal, X, Y, and Z dimensions.

Some highlights of Zarr are as follows:

You can use NumPy to create n-dimensional arrays.
Block array along any dimension.
Compress and/or filter block arrays using any NumCodecs codec.
Store arrays in memory, on disk, in Zip files, on cloud storage (such as AWS S3),…
Read arrays concurrently from multiple threads or processes.
Write to an array concurrently from multiple threads or processes.
Arrays are organized into hierarchies by groups.

The key component of Zarr is that it allows you to read and write files to a cloud storage system (such as AWS S3) just like your local file system, while maintaining the data organization format of Netcdf.

2. Read the NetcdF file

Here, we will use surface temperature data from the NCEP Reanalysis Dataset as an example. I first downloaded the 2019 surface temperature file air.SIG995.2019.NC to my local computer. Then, use Xarray to read the data in the file.

import xarray as xr

ds = xr.open_dataset('air. Sig995.2019. Nc')
ds
Copy the code

    <xarray.Dataset>
    Dimensions:    (lat: 73, lon: 144, nbnds: 2, time: 116)
    Coordinates:
      * lat        (lat) float32 90.0 87.5 85.0 82.5... - 82.5-85.0-87.5-90.0 * LON (LON)float32 0.0 2.5 5.0 7.5 10.0... [ns] 2019-01-01 2019-01-02 [ns] 2019-01-01 2019-01-02... 2019-04-26 Dimensions without coordinates: nbnds Data variables: air (time, lat, lon)float32... time_bnds (time, nbnds)float64... Attributes: Conventions: COARDS title: mean daily NMC reanalysis (2014)history: Created 2017/12 by Hoop (netCDF2.3) Description: Data is from NMC initialized Reanalysis \n(4x/day). These... platform: Model References: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reana... dataset_title: NCEP-NCAR Reanalysis 1Copy the code

ds.air
Copy the code

    <xarray.DataArray 'air' (time: 116, lat: 73, lon: 144)>
    [1219392 values with dtype=float32]
    Coordinates:
      * lat      (lat) float32 90.0 87.5 85.0 82.5 80.0... - 82.5-85.0-87.5-90.0 * LON (LON)float32 0.0 2.5 5.0 7.5 10.0... [ns] 2019-01-01 2019-01-02 [ns] 2019-01-01 2019-01-02... 2019-04-26 Attributes: long_name: mean Daily Air temperature at sigma level 995 units: degK precision: 2 GRIB_id: 11 GRIB_name: TMP var_desc: Air temperature dataset: NCEP Reanalysis Daily Averages level_desc: Surface statistic: Mean parent_stat: Individual Obs valid_range: [185.16 331.16] Actual_range: [198.4 314]Copy the code

From the above results, we can see that the AIR variable is the mean daily temperature in degK. This variable has three dimensions – time, longitude (LON) and latitude (LAT).

3. Save the data in Zarr format

We will now save the above data in Zarr format. Since I don’t have an AWS account, I’m going to save it to my laptop. Note that if you have an AWS account, you can save it directly to AWS S3 with the help of the S3FS package. S3fs is the Python file interface for S3, which is built on Boto3, the Amazon Web Services (AWS) SDK for Python.

import zarr
import s3fs


# Compare the data if needed
compressor = zarr.Blosc(cname='zstd', clevel=3)
encoding = {vname: {'compressor': compressor} for vname in ds.data_vars}
# Save to zarr
ds.to_zarr(store='zarr_example', encoding=encoding, consolidated=True)
Copy the code

<xarray.backends.zarr.ZarrStore at 0x31a4d1ef0>
Copy the code

We have now saved the data as a local ZARR file.

The following code can be used to save the data to AWS S3 ZARR format.

import zarr
import s3fs

# AWS S3 path
s3_path = 's3://your_data_path/zarr_example'
# Initilize the S3 file system
s3 = s3fs.S3FileSystem()
store = s3fs.S3Map(root=s3_path, s3=s3, check=False)
# Compare the data if needed
compressor = zarr.Blosc(cname='zstd', clelve=3)
encoding = {vname: {'compressor': compressor} for vname in ds.data_vars}
# Save to zarr
ds.to_zarr(store=store, encoding=encoding, consolidated=True)
Copy the code

4. Read the Zarr file

Reading Zarr files is also easy. You can read ZARR files directly from cloud storage systems such as AWS S3, which is especially important for earth science data. With Zarr we can directly access whole or part of the data set. Imagine that all weather and climate models and satellite data (such as NCEP reanalysis data) are stored in the cloud (which can be thousands of terabytes or petabytes) and we can easily read and download files directly from AWS S3 with just a few lines of code, without having to constantly manage painful, time-consuming data downloads. How wonderful it would make our lives!

Since I don’t have an AWS S3 account, I’ll use a local ZARR file as an example.

# Read Zarr file
zarr_ds = xr.open_zarr(store='zarr_example', consolidated=True)
zarr_ds
Copy the code

<xarray.Dataset> Dimensions: (lat: 73, lon: 144, nbnds: 2, time: 116) Coordinates: * LAT (LAT) Float32 90.0 87.5 85.0 82.5... -82.5-85.0 -87.5-90.0 * lon (lon) Float32 0.0 2.5 5.0 7.5 10.0... [ns] 2019-01-01 2019-01-02 [ns] 2019-01-01 2019-01-02... 2019-04-26 Dimensions without coordinates: nbnds Data variables: air (time, lat, lon) float32 dask.array<shape=(116, 73, 144), chunksize=(58, 37, 72)> time_bnds (time, nbnds) float64 dask.array<shape=(116, 2), chunksize=(116, 2)> Attributes: Conventions: COARDS References: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reana... dataset_title: NCEP-NCAR Reanalysis 1 description: Data is from NMC initialized reanalysis\n(4x/day). These... History: Created 2017/12 by Hoop (netCDF2.3) Platform: Model Title: Mean Daily NMC ReAnalysis (2014)Copy the code

Easy, right? In fact, zarR here only reads the metadata of the data file instead of loading all the real data. This is useful when there is a large amount of data. Because usually, we don’t use all of the data, we only use a subset of the data set. For example, we could only read temperature data for January 2019.

import pandas as pd

# We'd like the read data in Januray 2019
time_period = pd.date_range('2019-01-01'.'2019-01-31')
# Select part of the zarr data
zarr_Jan = zarr_ds.sel(time=time_period)
zarr_Jan
Copy the code

<xarray.Dataset> Dimensions: (lat: 73, lon: 144, nbnds: 2, time: 31) Coordinates: * LAT (LAT) Float32 90.0 87.5 85.0 82.5... -82.5-85.0 -87.5-90.0 * lon (lon) Float32 0.0 2.5 5.0 7.5 10.0... [ns] 2019-01-01 2019-01-02 [ns] 2019-01-01 2019-01-02... 2019-01-31 Dimensions without coordinates: nbnds Data variables: air (time, lat, lon) float32 dask.array<shape=(31, 73, 144), chunksize=(31, 37, 72)> time_bnds (time, nbnds) float64 dask.array<shape=(31, 2), chunksize=(31, 2)> Attributes: Conventions: COARDS References: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reana... dataset_title: NCEP-NCAR Reanalysis 1 description: Data is from NMC initialized reanalysis\n(4x/day). These... History: Created 2017/12 by Hoop (netCDF2.3) Platform: Model Title: Mean Daily NMC ReAnalysis (2014)Copy the code

Above, we only selected data from January 2019. Simple, right?

The following code can be used to access AWS S3 ZARR files.

# AWS S3 path
s3_path = 's3://your_data_path/zarr_example'
# Initilize the S3 file system
s3 = s3fs.S3FileSystem()
sotre = s3fs.S3Map(root=s3_path, s3=s3, check=False)
# Read Zarr file
ds = xr.open_zarr(store=store, consolidated=True)
Copy the code

5. Quick access secret :” Consolidated = True”

Once the ZARR data is fixed and can be treated as read-only, we can combine many metadata objects into a single object with the parameter Consolidated = True. Doing so can greatly increase the speed of reading metadata from the data set, making reading data very fast!

6. Summary

With Zarr, we can easily read and write files to cloud storage systems (such as AWS S3), which is very useful for storing and accessing big data using cloud storage systems. Zarr’s compressed and Consolidated capabilities also help save on storage costs and speed up data access. Reading and writing of N – dimensional data in cloud is a very hot topic. In addition to Zarr, there are now several other packs. So far, though, Zarr is the leader.

Access the cloud storage as if it were a local file system

1. What is Zarr?

2. Read the NetcdF file

3. Save the data in Zarr format

4. Read the Zarr file

5. Quick access secret :” Consolidated = True”

6. Summary

Related Posts

Introduction to neural networks – Gem classification

Classification model evaluation index of SKLearn (I) : accuracy rate, Top accuracy rate, balance accuracy rate

Personalized federated learning algorithm framework released to enable AI drug discovery