![]() |
NetCDF
4.9.1-rc1
|
Beginning with netCDF version 4.8.0, the Unidata NetCDF group has extended the netcdf-c library to provide access to cloud storage (e.g. Amazon S3 [1] ). This extension provides a mapping from a subset of the full netCDF Enhanced (aka netCDF-4) data model to a variant of the Zarr [4] data model. The NetCDF version of this storage format is called NCZarr [4].
A note on terminology in this document.
There are some important "caveats" of which to be aware when using this software.
NCZarr uses a data model [4] that, by design, extends the Zarr Version 2 Specification [6] to add support for the NetCDF-4 data model.
Note Carefully: a legal NCZarr dataset is also a legal Zarr dataset under a specific assumption. This assumption is that within Zarr meta-data objects, like ''.zarray'', unrecognized dictionary keys are ignored. If this assumption is true of an implementation, then the NCZarr dataset is a legal Zarr dataset and should be readable by that Zarr implementation. The inverse is true also. A legal Zarr dataset is also a legal NCZarr dataset, where "legal" means it conforms to the Zarr version 2 specification. In addition, certain non-Zarr features are allowed and used. Specifically the XArray ''_ARRAY_DIMENSIONS'' attribute is one such.
There are two other, secondary assumption:
Briefly, the data model supported by NCZarr is netcdf-4 minus the user-defined types. However, a restricted form of String type is supported (see Appendix H). As with netcdf-4 chunking is supported. Filters and compression are also supported.
Specifically, the model supports the following.
With respect to full netCDF-4, the following concepts are currently unsupported.
Note that contiguous and compact are not actually supported because they are HDF5 specific. When specified, they are treated as chunked where the file consists of only one chunk. This means that testing for contiguous or compact is not possible; the nc_inq_var_chunking function will always return NC_CHUNKED and the chunksizes will be the same as the dimension sizes of the variable's dimensions.
Additionally, it should be noted that NCZarr supports scalar variables, but Zarr does not; Zarr only supports dimensioned variables. In order to support interoperability, NCZarr does the following.
These actions allow NCZarr to properly show scalars in its API while still maintaining compatibility with Zarr.
NCZarr support is enabled by default. If the –disable-nczarr option is used with './configure', then NCZarr (and Zarr) support is disabled. If NCZarr support is enabled, then support for datasets stored as files in a directory tree is provided as the only guaranteed mechanism for storing datasets. However, several addition storage mechanisms are available if additional libraries are installed.
In order to access a NCZarr data source through the netCDF API, the file name normally used is replaced with a URL with a specific format. Note specifically that there is no NC_NCZARR flag for the mode argument of nc_create or nc_open. In this case, it is indicated by the URL path.
The URL is the usual format.
There are some details that are important.
The fragment part of a URL is used to specify information that is interpreted to specify what data format is to be used, as well as additional controls for that data format. For NCZarr support, the following key=value pairs are allowed.
Typically one will specify two mode flags: one to indicate what format to use and one to specify the way the dataset is to be stored. For example, a common one is "mode=zarr,file"
Using mode=nczarr causes the URL to be interpreted as a reference to a dataset that is stored in NCZarr format. The zarr mode tells the library to use NCZarr, but to restrict its operation to operate on pure Zarr Version 2 datasets.
The modes s3, file, and zip tell the library what storage driver to use.
Note that It should be the case that zipping a file format directory tree will produce a file readable by the zip storage format, and vice-versa.
By default, the XArray convention is supported and used for both NCZarr files and pure Zarr files. This means that every variable in the root group whose named dimensions are also in the root group will have an attribute called *_ARRAY_DIMENSIONS* that stores those dimension names. The noxarray mode tells the library to disable the XArray support.
The netcdf-c library is capable of inferring additional mode flags based on the flags it finds. Currently we have the following inferences.
So for example: ...#mode=zarr,zip
is equivalent to this.
Internally, the nczarr implementation has a map abstraction that allows different storage formats to be used. This is closely patterned on the same approach used in the Python Zarr implementation, which relies on the Python MutableMap [5] class.
In NCZarr, the corresponding type is called zmap. The zmap API essentially implements a simplified variant of the Amazon S3 API.
As with Amazon S3, keys are utf8 strings with a specific structure: that of a path similar to those of a Unix path with '/' as the separator for the segments of the path.
As with Unix, all keys have this BNF syntax:
Obviously, one can infer a tree structure from this key structure. A containment relationship is defined by key prefixes. Thus one key is "contained" (possibly transitively) by another if one key is a prefix (in the string sense) of the other. So in this sense the key "/x/y/z" is contained by the key "/x/y".
In this model all keys "exist" but only some keys refer to objects containing content – aka content bearing. An important restriction is placed on the structure of the tree, namely that keys are only defined for content-bearing objects. Further, all the leaves of the tree are these content-bearing objects. This means that the key for one content-bearing object should not be a prefix of any other key.
There several other concepts of note.
The zmap API defined here isolates the key-value pair mapping code from the Zarr-based implementation of NetCDF-4. It wraps an internal C dispatch table manager for implementing an abstract data structure implementing the zmap key/object model. Of special note is the "search" function of the API.
Search: The search function has two purposes:
The search function takes a prefix path which has a key syntax (see above). The set of legal keys is the set of keys such that the key references a content-bearing object – e.g. /x/y/.zarray or /.zgroup. Essentially this is the set of keys pointing to the leaf objects of the tree of keys constituting a dataset. This set potentially limits the set of keys that need to be examined during search.
The search function returns a limited set of names, where the set of names are immediate suffixes of a given prefix path. That is, if _<prefix>_ is the prefix path, then search returnsnall _<name>_ such that _<prefix>/<name>_ is itself a prefix of a "legal" key. This can be used to implement glob style searches such as "/x/y/*" or "/x/y/**"
This semantics was chosen because it appears to be the minimum required to implement all other kinds of search using recursion. It was also chosen to limit the number of names returned from the search. Specifically
As a side note, S3 supports this kind of search using common prefixes with a delimiter of '/', although its use is a bit tricky. For the file system zmap implementation, the legal search keys can be obtained one level at a time, which directly implements the search semantics. For the zip file implementation, this semantics is not possible, so the whole tree must be obtained and searched.
Issues:
A Note on Error Codes:
The zmap API returns some distinguished error code:
This does not preclude other errors being returned such NC_EACCESS or NC_EPERM or NC_EINVAL if there are permission errors or illegal function arguments, for example. It also does not preclude the use of other error codes internal to the zmap implementation. So zmap_file, for example, uses NC_ENOTFOUND internally because it is possible to detect the existence of directories and files. But this does not propagate outside the zmap_file implementation.
The primary zmap implementation is s3 (i.e. mode=nczarr,s3) and indicates that the Amazon S3 cloud storage – or some related applicance – is to be used. Another storage format uses a file system tree of directories and files (mode=nczarr,file). A third storage format uses a zip file (mode=nczarr,zip). The latter two are used mostly for debugging and testing. However, the file and zip formats are important because they are intended to match corresponding storage formats used by the Python Zarr implementation. Hence it should serve to provide interoperability between NCZarr and the Python Zarr, although this interoperability has not been tested.
Examples of the typical URL form for file and zip are as follows.
Note that the extension (e.g. ".file" in "testdata.file") is arbitraty, so this would be equally acceptable.
As with other URLS (e.g. DAP), these kind of URLS can be passed as the path argument to, for example, ncdump.
The NCZARR format extends the pure Zarr format by adding extra keys such as ''_NCZARR_ARRAY'' inside the ''.zarray'' object. It is possible to suppress the use of these extensions so that the netcdf library can read and write a pure zarr formatted file. This is controlled by using ''mode=zarr'', which is an alias for the ''mode=nczarr,zarr'' combination. The primary effects of using pure zarr are described in the Translation Section.
There are some constraints on the reading of Zarr datasets using the NCZarr implementation.
Again, this list should diminish over time.
The NCZarr support has a trace facility. Enabling this can sometimes give important, but voluminous information. Tracing can be enabled by setting the environment variable NCTRACING=n, where n indicates the level of tracing. A good value of n is 9.
In order to use the zip storage format, the libzip [3] library must be installed. Note that this is different from zlib.
The Amazon AWS S3 storage driver currently uses the Amazon AWS S3 Software Development Kit for C++ (aws-s3-sdk-cpp). In order to use it, the client must provide some configuration information. Specifically, the ''~/.aws/config'' file should contain something like this.
See Appendix E for additional information.
The notion of "addressing style" may need some expansion. Amazon S3 accepts two forms for specifying the endpoint for accessing the data.
The NCZarr code will accept either form, although internally, it is standardized on path style. The reason for this is that the bucket name forms the initial segment in the keys.
The NCZarr storage format is almost identical to that of the the standard Zarr version 2 format. The data model differs as follows.
Consider both NCZarr and Zarr, and assume S3 notions of bucket and object. In both systems, Groups and Variables (Array in Zarr) map to S3 objects. Containment is modeled using the fact that the dataset's key is a prefix of the variable's key. So for example, if variable v1 is contained in top level group g1 – _/g1 – then the key for v1 is _/g1/v_. Additional meta-data information is stored in special objects whose name start with ".z".
In Zarr, the following special objects exist.
The first three contain meta-data objects in the form of a string representing a JSON-formatted dictionary. The NCZarr format uses the same objects as Zarr, but inserts NCZarr specific key-value pairs in them to hold NCZarr specific information The value of each of these keys is a JSON dictionary containing a variety of NCZarr specific information.
These keys are as follows:
__nczarr_superblock__ – this is in the top level group – key _/.zarr_. It is in effect the "superblock" for the dataset and contains any netcdf specific dataset level information. It is also used to verify that a given key is the root of a dataset. Currently it contains the following key(s):
__nczarr_group__ – this key appears in every _.zgroup_ object. It contains any netcdf specific group information. Specifically it contains the following keys:
__nczarr_array__ – this key appears in every _.zarray_ object. It contains netcdf specific array information. Specifically it contains the following keys:
__nczarr_attr__ – this key appears in every _.zattr_ object. This means that technically, it is attribute, but one for which access is normally surpressed . Specifically it contains the following keys:
With some constraints, it is possible for an nczarr library to read Zarr and for a zarr library to read the nczarr format. The latter case, zarr reading nczarr is possible if the zarr library is willing to ignore keys whose name it does not recognize; specifically anything beginning with __NCZARR__.
The former case, nczarr reading zarr is also possible if the nczarr can simulate or infer the contents of the missing __NCZARR_XXX_ objects. As a rule this can be done as follows.
In order to accomodate existing implementations, certain mode tags are provided to tell the NCZarr code to look for information used by specific implementations.
The Xarray [7] Zarr implementation uses its own mechanism for specifying shared dimensions. It uses a special attribute named ''_ARRAY_DIMENSIONS''. The value of this attribute is a list of dimension names (strings). An example might be ["time", "lon", "lat"]
. It is essentially equivalent to the _nczarr_array "dimrefs" list
, except that the latter uses fully qualified names so the referenced dimensions can be anywhere in the dataset.
As of netcdf-c version 4.8.2, The Xarray ''_ARRAY_DIMENSIONS'' attribute is supported for both NCZarr and pure Zarr. If possible, this attribute will be read/written by default, but can be suppressed if the mode value "noxarray" is specified. If detected, then these dimension names are used to define shared dimensions. The following conditions will cause ''_ARRAY_DIMENSIONS'' to not be written.
Here are a couple of examples using the ncgen and ncdump utilities.
[1] Amazon Simple Storage Service Documentation
[2] Amazon Simple Storage Service Library
[3] The LibZip Library
[4] NetCDF ZARR Data Model Specification
[5] Python Documentation: 8.3. collections — High-performance dataset datatypes
[6] Zarr Version 2 Specification
[7] XArray Zarr Encoding Specification
[8] Dynamic Filter Loading
[9] Officially Registered Custom HDF5 Filters
[10] C-Blosc Compressor Implementation
[11] Conda-forge / packages / aws-sdk-cpp
[12] GDAL Zarr
Currently the following build cases are known to work.
Operating System | Build System | NCZarr | S3 Support |
Linux | Automake | yes | yes |
Linux | CMake | yes | yes |
Cygwin | Automake | yes | no |
OSX | Automake | unknown | unknown |
OSX | CMake | unknown | unknown |
Visual Studio | CMake | yes | tests fail |
Note: S3 support includes both compiling the S3 support code as well as running the S3 tests.
There are several options relevant to NCZarr support and to Amazon S3 support. These are as follows.
A note about using S3 with Automake. If S3 support is desired, and using Automake, then LDFLAGS must be properly set, namely to this.
The above assumes that these libraries were installed in '/usr/local/lib', so the above requires modification if they were installed elsewhere.
Note also that if S3 support is enabled, then you need to have a C++ compiler installed because part of the S3 support code is written in C++.
The necessary CMake flags are as follows (with defaults)
Note that unlike Automake, CMake can properly locate C++ libraries, so it should not be necessary to specify -laws-cpp-sdk-s3 assuming that the aws s3 libraries are installed in the default location. For CMake with Visual Studio, the default location is here:
It is possible to install the sdk library in another location. In this case, one must add the following flag to the cmake command.
where "awssdkdir" is the path to the sdk installation. For example, this might be as follows.
This can be useful if blanks in path names cause problems in your build environment.
The relevant tests for S3 support are in the nczarr_test directory. Currently, by default, testing of S3 with NCZarr is supported only for Unidata members of the NetCDF Development Group. This is because it uses a Unidata-specific bucket is inaccessible to the general user.
In order to use the S3 storage driver, it is necessary to install the Amazon aws-sdk-cpp library.
Building this package from scratch has proven to be a formidable task. This appears to be due to dependencies on very specific versions of, for example, openssl.
For linux, the following context works. Of course your mileage may vary.
In order to build netcdf-c with S3 sdk support, the following options must be specified for ./configure.
If you have access to the Unidata bucket on Amazon, then you can also test S3 support with this option.
It is possible to build and install aws-sdk-cpp. It is also possible to build netcdf-c using cmake. Unfortunately, testing currently fails.
For Windows, the following context work. Of course your mileage may vary.
This command-line build assumes one is using Cygwin or Mingw to provide tools such as bash.
Notice that the sdk is being installed in the directory "c:\tools\aws-sdk-cpp" rather than the default location "c:\Program Files (x86)/aws-sdk-cpp-all" This is because when using a command line, an install path that contains blanks may not work.
Enabling S3 support is controlled by these two cmake options:
However, to find the aws sdk libraries, the following environment variables must be set:
Then the following options must be specified for cmake.
The Amazon S3 cloud storage imposes some significant limits that are inherited by NCZarr (and Zarr also, for that matter).
Some of the relevant limits are as follows:
The NetCDF-C library contains an alternate mechanism for accessing traditional netcdf-4 files stored in Amazon S3: The byte-range mechanism. The idea is to treat the remote data as if it was a big file. This remote "file" can be randomly accessed using the HTTP Byte-Range header.
In the Amazon S3 context, a copy of a dataset, a netcdf-3 or netdf-4 file, is uploaded into a single object in some bucket. Then using the key to this object, it is possible to tell the netcdf-c library to treat the object as a remote file and to use the HTTP Byte-Range protocol to access the contents of the object. The dataset object is referenced using a URL with the trailing fragment containing the string #mode=bytes
.
An examination of the test program nc_test/test_byterange.sh shows simple examples using the ncdump program. One such test is specified as follows:
Note that for S3 access, it is expected that the URL is in what is called "path" format where the bucket, noaa-goes16 in this case, is part of the URL path instead of the host.
The _::mode=bytes_ mechanism generalizes to work with most servers that support byte-range access.
Specifically, Thredds servers support such access using the HttpServer access method as can be seen from this URL taken from the above test program.
If byterange support is enabled, the netcdf-c library will parse the files
to extract profile names plus a list of key=value pairs. This example is typical.
The keys in the profile will be used to set various parameters in the library
The algorithm for choosing the active profile to use is as follows:
The profile named "none" is a special profile that the netcdf-c library automatically defines. It should not be defined anywhere else. It signals to the library that no credentialas are to used. It is equivalent to the "--no-sign-request" option in the AWS CLI. Also, it must be explicitly specified by name. Otherwise "default" will be used.
If the specified URL is of the form
Then this is rebuilt to this form:
However this requires figuring out the region to use. The algorithm for picking an region is as follows.
Picking an access-key/secret-key pair is always determined by the current active profile. To choose to not use keys requires that the active profile must be "none".
In NCZarr Version 1, the NCZarr specific metadata was represented using new objects rather than as keys in existing Zarr objects. Due to conflicts with the Zarr specification, that format is deprecated in favor of the one described above. However the netcdf-c NCZarr support can still read the version 1 format.
The version 1 format defines three specific objects: _.nczgroup_, _.nczarray_,_.nczattr_. These are stored in parallel with the corresponding Zarr objects. So if there is a key of the form "/x/y/.zarray", then there is also a key "/x/y/.nczarray". The content of these objects is the same as the contents of the corresponding keys. So the value of the ''_NCZARR_ARRAY'' key is the same as the content of the ''.nczarray'' object. The list of connections is as follows:
The Zarr V2 specification is somewhat vague on what is a legal value for an attribute. The examples all show one of two cases:
However, the Zarr specification can be read to infer that the value can in fact be any legal JSON expression. This "convention" is currently used routinely to help support various attributes created by other packages where the attribute is a complex JSON expression. An example is the GDAL Driver convention [12], where the value is a complex JSON dictionary.
In order for NCZarr to be as consistent as possible with Zarr Version 2, it is desirable to support this convention for attribute values. This means that there must be some way to handle an attribute whose value is not either of the two cases above. That is, its value is some more complex JSON expression. Ideally both reading and writing of such attributes should be supported.
One more point. NCZarr attempts to record the associated netcdf attribute type (encoded in the form of a NumPy "dtype") for each attribute. This information is stored as NCZarr-specific metadata. Note that pure Zarr makes no attempt to record such type information.
The current algorithm to support JSON valued attributes operates as follows.
There are mutiple cases to consider.
The process of reading and interpreting an attribute value requires two pieces of information.
Given these two pieces of information, the read process is as follows.
Zarr supports a string type, but it is restricted to fixed size strings. NCZarr also supports such strings, but there are some differences in order to interoperate with the netcdf-4/HDF5 variable length strings.
The primary issue to be addressed is to provide a way for user to specify the maximum size of the fixed length strings. This is handled by providing the following new attributes:
Note that when accessing a string through the netCDF API, the fixed length strings appear as variable length strings. This means that they are stored as pointers to the string (i.e. char*) and with a trailing nul character. One consequence is that if the user writes a variable length string through the netCDF API, and the length of that string is greater than the maximum string length for a variable, then the string is silently truncated. Another consequence is that the user must reclaim the string storage.
Adding strings also requires some hacking to handle the existing netcdf-c NC_CHAR type, which does not exist in Zarr. The goal was to choose NumPY types for both the netcdf-c NC_STRING type and the netcdf-c NC_CHAR type such that if a pure zarr implementation reads them, it will still work.
For writing variables and NCZarr attributes, the type mapping is as follows:
Admittedly, this encoding is a bit of a hack.
So when reading data with a pure zarr implementaion the above types should always appear as strings, and the type that signals NC_CHAR (in NCZarr) would be handled by Zarr as a string of length 1.
Note, this log was only started as of 8/11/2022 and is not intended to be a detailed chronology. Rather, it provides highlights that will be of interest to NCZarr users. In order to see exact changes, It is necessary to use the 'git diff' command.
Author: Dennis Heimbigner
Email: dmh at ucar dot edu
Initial Version: 4/10/2020
Last Revised: 8/27/2022