NCZarr Introduction
Beginning with netCDF version 4.8.0, the Unidata NetCDF group has extended the netcdf-c library to provide access to cloud storage (e.g. Amazon S3 [1] ) by providing a mapping from a subset of the full netCDF Enhanced (aka netCDF-4) data model to a variant of the Zarr [4] data model that already has mappings to key-value pair cloud storage systems. The NetCDF version of this storage format is called NCZarr [4].
The NCZarr Data Model
NCZarr uses a data model [4] that is, by design, a superset of the Zarr Version 2 Specification [6]. Note Carefully: an uncompressed legal Zarr dataset is also a legal NCZarr dataset. If the Zarr implementation ignores unrecognized objects whose name starts with a "." (e.g. ".nczarray" or ".nczattr"), then the NCZarr dataset is a legal Zarr dataset and should be readable by any Zarr implementation. This assumes, of course, that the actual storage format in which the dataset is stored – a zip file, for example – can be read by the Zarr implementation.
Briefly, the data model supported by NCZarr is netcdf-4 minus the user-defined types and the String type. As with netcdf-4 chunking is supported. Eventually it will also support filters in a manner similar to the way filters are supported in netcdf-4. When this is implemented, then many compressed Zarr datasets will be readable by NCZarr.
Specifically, the model supports the following.
- "Atomic" types: char, byte, ubyte, short, ushort, int, uint, int64, uint64.
- Shared (named) dimensions
- Attributes with specified types – both global and per-variable
- Chunking
- Fill values
- Groups
- N-Dimensional variables
- Per-variable endianness (big or little)
With respect to full netCDF-4, the following concepts are currently unsupported.
- String type
- User-defined types (enum, opaque, VLEN, and Compound)
- Unlimited dimensions
- Contiguous or compact storage
Note that contiguous and compact are not actually supported because they are HDF5 specific. When specified, they are treated as chunked where the file consists of only one chunk. This means that testing for contiguous or compact is not possible, the nc_inq_var_chunking function will always return NC_CHUNKED and the chunksizes will be the same as the dimension sizes of the variable's dimensions.
Enabling NCZarr Support
NCZarr support is enabled if the –enable-nczarr option is used with './configure'. If NCZarr support is enabled, then a usable version of libcurl must be specified using the LDFLAGS environment variable (similar to the way that the HDF5 libraries are referenced). Refer to the installation manual for details. NCZarr support can be disabled using the –disable-dap.
Accessing Data Using the NCZarr Prototocol
In order to access a NCZarr data source through the netCDF API, the file name normally used is replaced with a URL with a specific format. Note specifically that there is no NC_NCZARR flag for the mode argument of nc_create or nc_open. In this case, it is indicated by the URL path.
URL Format
The URL is the usual scheme:://host:port/path?query#fragment format. There are some details that are important.
- Scheme: this should be https or s3,or file. The s3 scheme is equivalent to "https" plus setting "mode=nczarr,s3" (see below). Specifying "file" is mostly used for testing, but is used to support directory tree or zipfile format storage.
- Host: Amazon S3 defines two forms: Virtual and Path.
- Virtual: the host includes the bucket name as in bucket.s3.<region>.amazonaws.com
- Path: the host does not include the bucket name, but rather the bucket name is the first segment of the path. For example s3.<region>.amazonaws.com/bucket
- Other: It is possible to use other non-Amazon cloud storage, but that is cloud library dependent.
- Query: currently not used.
- Fragment: the fragment is of the form key=value&key=value&.... Depending on the key, the _=value_ part may be left out and some default value will be used.
Client Parameters
The fragment part of a URL is used to specify information that is interpreted to specify what data format is to be used, as well as additional controls for that data format. For NCZarr support, the following key=value pairs are allowed.
- mode=nczarr|zarr|s3|xarray|file|zip
The mode key specifies the particular format to be used by the netcdf-c library for interpreting the dataset specified by the URL.
Using mode=nczarr causes the URL to be interpreted as a reference to a dataset that is stored in NCZarr format.
The modes s3, file, and zip tell the library what storage driver to use.
- The s3 driver is the default and indicates using Amazon S3 or some equivalent.
- The file format stores data in a directory tree.
- The zip format stores data in a local zip file.
Note that It should be the case that zipping a file format directory tree will produce a file readable by the zip storage format, and vice-versa.
The zarr mode tells the library to use NCZarr, but to restrict its operation to operate on pure Zarr Version 2 datasets.
The xarray mode tells the library to support the XArray __ARRAY_DIMENSIONS_ convention.
The netcdf-c library is capable of inferring additional mode flags based on the flags it finds. Currently we have the following inferences.
- xarray => zarr
- zarr => nczarr
So for example: ...#mode=xarray,zip
is equivalent to this. ``...#mode=nczarr,zarr,xarray,zip
# NCZarr Map Implementation {#nczarr_mapimpl}
Internally, the nczarr implementation has a map abstraction that allows different storage formats to be used.
This is closely patterned on the same approach used in the Python Zarr implementation, which relies on the Python _MutableMap_ <a href="#ref_python">[5]</a> class.
In NCZarr, the corresponding type is called _zmap_.
The __zmap__ API essentially implements a simplified variant
of the Amazon S3 API.
As with Amazon S3, __keys__ are utf8 strings with a specific structure:
that of a path similar to those of a Unix path with '/' as the
separator for the segments of the path.
As with Unix, all keys have this BNF syntax:
key: '/' | keypath ; keypath: '/' segment | keypath '/' segment ; segment: <sequence of UTF-8 characters except control characters and '/'>
Obviously, one can infer a tree structure from this key structure.
A containment relationship is defined by key prefixes.
Thus one key is "contained" (possibly transitively)
by another if one key is a prefix (in the string sense) of the other.
So in this sense the key "/x/y/z" is contained by the key "/x/y".
In this model all keys "exist" but only some keys refer to
objects containing content -- _content bearing_.
An important restriction is placed on the structure of the tree,
namely that keys are only defined for content-bearing objects.
Further, all the leaves of the tree are these content-bearing objects.
This means that the key for one content-bearing object should not
be a prefix of any other key.
There several other concepts of note.
1. __Dataset__ - a dataset is the complete tree contained by the key defining
the root of the dataset. Technically, the root of the tree is the key <dataset>/.nczarr, where .nczarr can be considered the _superblock_ of the dataset.
2. __Object__ - equivalent of the S3 object; Each object has a unique key
and "contains" data in the form of an arbitrary sequence of 8-bit bytes.
The zmap API defined here isolates the key-value pair mapping
code from the Zarr-based implementation of NetCDF-4. It wraps
an internal C dispatch table manager for implementing an
abstract data structure implementing the zmap key/object model.
__Search__: The search function has two purposes:
1. Support reading of pure zarr datasets (because they do not explicitly
track their contents).
2. Debugging to allow raw examination of the storage. See zdump
for example.
The search function takes a prefix path which has a key syntax
(see above). The set of legal keys is the set of keys such that
the key references a content-bearing object -- e.g. /x/y/.zarray
or /.zgroup. Essentially this is the set of keys pointing to the
leaf objects of the tree of keys constituting a dataset. This
set potentially limits the set of keys that need to be examined
during search.
The search function returns a limited set of names, where the
set of names are immediate suffixes of a given prefix path.
That is, if _<prefix>_ is the prefix path, then search returns
all _<name>_ such that _<prefix>/<name>_ is itself a prefix
of a "legal" key. This can be used to implement glob style
searches such as "/x/y/*" or "/x/y/**"
This semantics was chosen because it appears to be the minimum required to implement all other kinds of search using recursion. It was also chosen
to limit the number of names returned from the search. Specifically
1. Avoid returning keys that are not a prefix of some legal key.
2. Avoid returning all the legal keys in the dataset because that set may be very large; although the implementation may still have to examine all legal keys to get the desired subset.
3. Allow for use of partial read mechanisms such as iterators, if available. This can support processing a limited set of keys for each iteration. This is a straighforward tradeoff of space over time.
As a side note, S3 supports this kind of search using common
prefixes with a delimiter of '/', although the implementation is
a bit tricky. For the file system zmap implementation, the legal
search keys can be obtained one level at a time, which directly
implements the search semantics. For the zip file
implementation, this semantics is not possible, so the whole
tree must be obtained and searched.
__Issues:__
1. S3 limits key lengths to 1024 bytes. Some deeply nested netcdf files
will almost certainly exceed this limit.
2. Besides content, S3 objects can have an associated small set
of what may be called tags, which are themselves of the form of
key-value pairs, but where the key and value are always text. As
far as it is possible to determine, Zarr never uses these tags,
so they are not included in the zmap data structure.
__A Note on Error Codes:__
The zmap API returns two distinguished error code:
1. NC_NOERR if a operation succeeded
2. NC_EEMPTY is returned when accessing a key that has no content.
Note that NC_EEMPTY is a new error code to signal to that the
caller asked for non-content-bearing key.
This does not preclude other errors being returned such
NC_EACCESS or NC_EPERM or NC_EINVAL if there are permission
errors or illegal function arguments, for example. It also does
not preclude the use of other error codes internal to the zmap
implementation. So zmap_file, for example, uses NC_ENOTFOUND
internally because it is possible to detect the existence of
directories and files. This does not propagate outside the zmap_file
implementation.
## Zmap Implementatons
The primary zmap implementation is _s3_ (i.e. _mode=nczarr,s3_)
and indicates that the Amazon S3 cloud storage
-- or some related applicance -- is to be used.
Another storage format uses a file system tree of directories and
files (_mode=nczarr,file_).
A third storage format uses a zip file (_mode=nczarr,zip_).
The latter two are used mostly for
debugging and testing. However, the _file_ and _zip_ formats
are important because they is intended to match corresponding
storage formats used by the Python Zarr implementation. Hence
it should serve to provide interoperability between NCZarr and
the Python Zarr. This has not been tested.
Examples of the typical URL form for _file_ and _zip_ are as follows.
file:///xxx/yyy/testdata.file#mode=nczarr,file file:///xxx/yyy/testdata.zip#mode=nczarr,zip
Note that the extension (e.g. ".file" in "testdata.file")
is arbitraty, so this would be equally acceptable.
file:///xxx/yyy/testdata.anyext#mode=nczarr,file
As with other URLS (e.g. DAP), these kind of URLS can be passed
as the path argument to __ncdump__, for example.
# NCZarr versus Pure Zarr. {#nczarr_purezarr}
The NCZARR format extends the pure Zarr format by adding extra objects such as _.nczarr_ and _.ncvar_.
It is possible to suppress the use of these extensions so that the netcdf library can read and write a pure zarr formatted file.
This is controlled by using _mode=nczarr,zarr_ combination.
The primary effects of using pure zarr are described
in the [Translation Section](@ref nczarr_translation).
# Notes on Debugging NCZarr Access {#nczarr_debug}
The NCZarr support has a trace facility.
Enabling this can sometimes give important information.
Tracing can be enabled by setting the environment variable NCTRACING=n,
where _n_ indicates the level of tracing. A good value of _n_ is 9.
# Zip File Support {#nczarr_zip}
In order to use the _zip_ storage format, the libzip [3]
library must be installed. Note that this is different from zlib.
# Amazon S3 Storage {#nczarr_s3}
The Amazon AWS S3 storage driver currently uses the Amazon AWS S3 Software Development Kit for C++ (aws-s3-sdk-cpp).
In order to use it, the client must provide some configuration information.
Specifically, the `~/.aws/config` file should contain something like this.
[default] output = json aws_access_key_id=XXXX... aws_secret_access_key=YYYY...
## Addressing Style
The notion of "addressing style" may need some expansion. Amazon S3 accepts two forms for specifying the endpoint for accessing the data.
1. Virtual -- the virtual addressing style places the bucket in
the host part of a URL. For example:
https://<bucketname>.s2.<region>.amazonaws.com/
2. Path -- the path addressing style places the bucket in
at the front of the path part of a URL. For example:
https://s2.<region>.amazonaws.com/<bucketname>/
The NCZarr code will accept either form, although internally, it is standardized on path style.
The reason for this is that the bucket name forms the initial segment in the keys.
# Zarr vs NCZarr {#nczarr_vs_zarr}
## Data Model
The NCZarr storage format is almost identical to that of the the
standard Zarr version 2 format. The data model differs as
follows.
1. Zarr supports filters -- NCZarr as yet does not
2. Zarr only supports anonymous dimensions -- NCZarr supports
only shared (named) dimensions.
3. Zarr attributes are untyped -- or perhaps more correctly
characterized as of type string.
## Storage Format
Consider both NCZarr and Zarr, and assume S3 notions of bucket and object.
In both systems, Groups and Variables (Array in Zarr) map to S3 objects.
Containment is modeled using the fact that the container's key is a prefix of the variable's key.
So for example, if variable _v1_ is contained in top level group g1 -- _/g1 -- then the key for _v1_ is _/g1/v_.
Additional information is stored in special objects whose name start with ".z".
In Zarr, the following special objects exist.
1. Information about a group is kept in a special object named
_.zgroup_; so for example the object _/g1/.zgroup_.
2. Information about an array is kept as a special object named _.zarray_;
so for example the object _/g1/v1/.zarray_.
3. Group-level attributes and variable-level attributes are stored
in a special object named _.zattr_;
so for example the objects _/g1/.zattr_ and _/g1/v1/.zattr.
The NCZarr format uses the same group and variable (array) objects as Zarr.
It also uses the Zarr special _.zXXX_ objects.
However, NCZarr adds some additional special objects.
1. _.nczarr_ -- this is in the top level group -- key _/.nczarr_.
It is in effect the "superblock" for the dataset and contains
any netcdf specific dataset level information. It is also used
to verify that a given key is the root of a dataset.
2. _.nczgroup_ -- this is a parallel object to _.zgroup_ and contains any netcdf specific group information. Specifically it contains the following.
* dims -- the name and size of shared dimensions defined in this group.
* vars -- the name of variables defined in this group.
* groups -- the name of sub-groups defined in this group.
These lists allow walking the NCZarr dataset without having to use
the potentially costly S3 list operation.
3. _.nczarray_ -- this is a parallel object to _.zarray_ and contains
netcdf specific information. Specifically it contains the following.
* dimrefs -- the names of the shared dimensions referenced by the variable.
* storage -- indicates if the variable is chunked vs contiguous
in the netcdf sense.
4. _.nczattr_ -- this is parallel to the .zattr objects and stores
the attribute type information.
## Translation {#nczarr_translation}
With some constraints, it is possible for an nczarr library to read Zarr and for a zarr library to read the nczarr format.
The latter case, zarr reading nczarr is possible if the zarr library is willing to ignore objects whose name it does not recognized; specifically anything beginning with _.ncz_.
The former case, nczarr reading zarr is also possible if the nczarr can simulate or infer the contents of the missing _.nczXXX_ objects.
As a rule this can be done as follows.
1. _.nczgroup_ -- The list of contained variables and sub-groups
can be computed using the search API to list the keys
"contained" in the key for a group. By looking for occurrences
of _.zgroup_, _.zattr_, _.zarray to infer the keys for the
contained groups, attribute sets, and arrays (variables).
Constructing the set of "shared dimensions" is carried out
by walking all the variables in the whole dataset and collecting
the set of unique integer shapes for the variables.
For each such dimension length, a top level dimension is created
named ".zdim_<len>" where len is the integer length. The name
is subject to change.
2. _.nczarray_ -- The dimrefs are inferred by using the shape
in _.zarray_ and creating references to the simulated shared dimension.
netcdf specific information.
3. _.nczattr_ -- The type of each attribute is inferred by trying to parse the first attribute value string.
# Compatibility {#nczarr_compatibility}
In order to accomodate existing implementations, certain mode tags are provided to tell the NCZarr code to look for information used by specific implementations.
## XArray
> XArray support is introduced in netCDF-C version `4.8.1`.
The Xarray
<a href="#ref_xarray">[7]</a>
Zarr implementation uses its own mechanism for
specifying shared dimensions. It uses a special
attribute named ''_ARRAY_DIMENSIONS''.
The value of this attribute is a list of dimension names (strings).
An example might be
["time", "lon", "lat"]```. It is essentially equivalent to the
, but stored as a specific variable attribute.
As of netcdf-c version 4.8.1, The Xarray ''_ARRAY_DIMENSIONS'' attribute is supported. This attribute will be read/written if and only if the mode value "xarray" is specified. If enabled and detected, then these dimension names are used to define shared dimensions. Note that "xarray" implies pure zarr format.
Examples
Here are a couple of examples using the ncgen and ncdump utilities.
- Create an nczarr file using a local directory tree as storage.
ncgen -4 -lb -o "file:///home/user/dataset.file#mode=nczarr,file" dataset.cdl
- Display the content of an nczarr file using a local directory tree as storage.
ncdump "file:///home/user/dataset.zip#mode=nczarr,zip"
- Create an nczarr file using S3 as storage.
ncgen -4 -lb -o "s3://s3.us-west-1.amazonaws.com/datasetbucket" dataset.cdl
- Create an nczarr file using S3 as storage and keeping to the pure zarr format.
ncgen -4 -lb -o "s3://s3.uswest-1.amazonaws.com/datasetbucket#mode=zarr" dataset.cdl
References
[1] Amazon Simple Storage Service Documentation
[2] Amazon Simple Storage Service Library
[3] The LibZip Library
[4] NetCDF ZARR Data Model Specification
[5] Python Documentation: 8.3. collections — High-performance container datatypes
[6] Zarr Version 2 Specification
[7] XArray Zarr Encoding Specification
Appendix A. Building NCZarr Support
Currently the following build cases are known to work.
Operating System | Build System | NCZarr | S3 Support |
Linux | Automake | yes | yes |
Linux | CMake | yes | yes |
Cygwin | Automake | yes | no |
OSX | Automake | unknown | unknown |
OSX | CMake | unknown | unknown |
Visual Studio | CMake | yes | tests fail |
Note: S3 support includes both compiling the S3 support code as well as running the S3 tests.
Automake
There are several options relevant to NCZarr support and to Amazon S3 support. These are as follows.
- –enable-nczarr – enable the NCZarr support. If disabled, then all of the following options are disabled or irrelevant.
- –enable-nczarr-s3 – Enable NCZarr S3 support.
- –enable-nczarr-s3-tests – the NCZarr S3 tests are currently only usable by Unidata personnel, so they are disabled by default.
A note about using S3 with Automake. If S3 support is desired, and using Automake, then LDFLAGS must be properly set, namely to this.
LDFLAGS="$LDFLAGS -L/usr/local/lib -laws-cpp-sdk-s3"
The above assumes that these libraries were installed in '/usr/local/lib', so the above requires modification if they were installed elsewhere.
Note also that if S3 support is enabled, then you need to have a C++ compiler installed because part of the S3 support code is written in C++.
CMake
The necessary CMake flags are as follows (with defaults)
- -DENABLE_NCZARR=on – equivalent to the Automake –enable-nczarr option.
- -DENABLE_NCZARR_S3=off – equivalent to the Automake –enable-nczarr-s3 option.
- -DENABLE_NCZARR_S3_TESTS=off – equivalent to the Automake –enable-nczarr-s3-tests option.
Note that unlike Automake, CMake can properly locate C++ libraries, so it should not be necessary to specify -laws-cpp-sdk-s3 assuming that the aws s3 libraries are installed in the default location. For CMake with Visual Studio, the default location is here:
C:/Program Files (x86)/aws-cpp-sdk-all
It is possible to install the sdk library in another location. In this case, one must add the following flag to the cmake command.
cmake ... -DAWSSDK_DIR=<awssdkdir>
where "awssdkdir" is the path to the sdk installation. For example, this might be as follows.
cmake ... -DAWSSDK_DIR="c:\tools\aws-cpp-sdk-all"
This can be useful if blanks in path names cause problems in your build environment.
Testing S3 Support
The relevant tests for S3 support are in nczarr_test__. They will be run if __–enable-nczarr-s3-tests is on.
Currently, by default, testing of S3 with NCZarr is supported only for Unidata members of the NetCDF Development Group. This is because it uses a specific bucket on a specific internal S3 appliance that is inaccessible to the general user.
However, an untested mechanism exists by which others may be able to run the tests. If someone else wants to attempt these tests, then they need to define the following environment variables:
- NCZARR_S3_TEST_HOST=<host>
- NCZARR_S3_TEST_BUCKET=<bucket-name>
This assumes a Path Style address (see above) where
- host – the complete host part of the url
- bucket – a bucket in which testing can occur without fear of damaging anything.
Example:
NCZARR_S3_TEST_HOST=s3.us-west-1.amazonaws.com
NCZARR_S3_TEST_BUCKET=testbucket
If anyone tries to use this mechanism, it would be appreciated it any difficulties were reported to Unidata as a Github issue.
Appendix B. Building aws-sdk-cpp
In order to use the S3 storage driver, it is necessary to install the Amazon aws-sdk-cpp library.
As a starting point, here are the CMake options used by Unidata to build that library. It assumes that it is being executed in a build directory, build
say, and that build/../CMakeLists.txt exists
.
The expected set of installed libraries are as follows:
- aws-cpp-sdk-s3
- aws-cpp-sdk-core
This library depends on libcurl, so you may to install that before building the sdk library.
Appendix C. Amazon S3 Imposed Limits
The Amazon S3 cloud storage imposes some significant limits that are inherited by NCZarr (and Zarr also, for that matter).
Some of the relevant limits are as follows:
- The maximum object size is 5 Gigabytes with a total for all objects limited to 5 Terabytes.
- S3 key names can be any UNICODE name with a maximum length of 1024 bytes. Note that the limit is defined in terms of bytes and not (Unicode) characters. This affects the depth to which groups can be nested because the key encodes the full path name of a group.
Appendix D. Alternative Mechanisms for Accessing Remote Datasets
The NetCDF-C library contains an alternate mechanism for accessing data store in Amazon S3: The byte-range mechanism. The idea is to treat the remote data as if it was a big file. This remote "file" can be randomly accessed using the HTTP Byte-Range header.
In the Amazon S3 context, a copy of a dataset, a netcdf-3 or netdf-4 file, is uploaded into a single object in some bucket. Then using the key to this object, it is possible to tell the netcdf-c library to treat the object as a remote file and to use the HTTP Byte-Range protocol to access the contents of the object. The dataset object is referenced using a URL with the trailing fragment containing the string #mode=bytes
.
An examination of the test program nc_test/test_byterange.sh shows simple examples using the ncdump program. One such test is specified as follows:
https://s3.us-east-1.amazonaws.com/noaa-goes16/ABI-L1b-RadC/2017/059/03/OR_ABI-L1b-RadC-M3C13_G16_s20170590337505_e20170590340289_c20170590340316.nc#mode=bytes
Note that for S3 access, it is expected that the URL is in what is called "path" format where the bucket, noaa-goes16 in this case, is part of the URL path instead of the host.
The _#mode=byterange_ mechanism generalizes to work with most servers that support byte-range access.
Specifically, Thredds servers support such access using the HttpServer access method as can be seen from this URL taken from the above test program.
https://thredds-test.unidata.ucar.edu/thredds/fileServer/irma/metar/files/METAR_20170910_0000.nc#bytes
Byte-Range Authorization
If using byte-range access, it may be necessary to tell the netcdf-c library about the so-called secretid and accessid values. These are usually stored in the file ~/.aws/config
and/or ~/.aws/credentials
. In the latter file, this might look like this.
[default]
aws_access_key_id=XXXXXXXXXXXXXXXXXXXX
aws_secret_access_key=YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY
Point of Contact
Author: Dennis Heimbigner
Email: dmh at ucar dot edu
Initial Version: 4/10/2020
Last Revised: 2/22/2021