The NetCDF NCZarr Implementation
NCZarr Introduction
The NCZarr Data Model
Enabling NCZarr Support
Accessing Data Using the NCZarr Prototocol
- URL Format
- Client Parameters
NCZarr Map Implementation
- Zmap Implementatons
NCZarr versus Pure Zarr.
Notes on Debugging NCZarr Access
Zip File Support
- Addressing Style
Zarr vs NCZarr
Compatibility
- XArray
Examples
Appendix A. Building NCZarr Support
Appendix B. Amazon S3 Imposed Limits
Appendix C. JSON Attribute Convention.
Appendix D. Support for string types
References
Change Log
Point of Contact

The NetCDF NCZarr Implementation

NCZarr Introduction

Beginning with netCDF version 4.8.0, the Unidata NetCDF group has extended the netcdf-c library to support data stored using the Zarr data model and storage format [4,6]. As part of this support, netCDF adds support for accessing data stored using cloud storage (e.g. Amazon S3 [1] ).

The goal of this project, then, is to provide maximum interoperability between the netCDF Enhanced (netcdf-4) data model and the Zarr version 2 [4]data model. This is embodied in the netcdf-c library so that it is possible to use the netcdf API to read and write Zarr formatted datasets.

In order to better support the netcdf-4 data model, the netcdf-c library implements a limited set of extensions to the Zarr data model. This extended model is referred to as NCZarr. Additionally, another goal is to ensure interoperability between NCZarr formatted files and standard (aka pure) Zarr formatted files. This means that (1) an NCZarr file can be read by any other Zarr library (and especially the Zarr-python library), and (2) a standard Zarr file can be read by netCDF. Of course, there limitations in that other Zarr libraries will not use the extra, NCZarr meta-data, and netCDF will have to "fake" meta-data not provided by a pure Zarr file.

As a secondary – but equally important – goal, it must be possible to use the NCZarr library to read and write datasets that are pure Zarr, which means that none of the NCZarr extensions are used. This feature does come with some costs, namely that information contained in the netcdf-4 data model may be lost in the pure Zarr dataset.

Notes on terminology in this document.

The term "dataset" is used to refer to all of the Zarr objects constituting the meta-data and data.
NCZarr currently is not thread-safe. So any attempt to use it with parallelism, including MPIO, is likely to fail.

The NCZarr Data Model

NCZarr uses a data model that, by design, extends the Zarr Version 2 Specification .

Note Carefully: a legal NCZarr dataset is expected to also be a legal Zarr dataset. The inverse is true also. A legal Zarr dataset is expected to also be a legal NCZarr dataset, where "legal" means it conforms to the Zarr specification(s). In addition, certain non-Zarr features are allowed and used. Specifically the XArray [7] ''_ARRAY_DIMENSIONS'' attribute is one such.

There are two other, secondary assumption:

The actual storage format in which the dataset is stored – a zip file, for example – can be read by the Zarr implementation.
The compressors (aka filters) used by the dataset can be encoded/decoded by the implementation. NCZarr uses HDF5-style filters, so ensuring access to such filters is somewhat complicated. See the companion document on filters for details.

Briefly, the data model supported by NCZarr is netcdf-4 minus the user-defined types and full String type support. However, a restricted form of String type is supported (see Appendix D). As with netcdf-4, chunking is supported. Filters and compression are also supported.

Specifically, the model supports the following.

"Atomic" types: char, byte, ubyte, short, ushort, int, uint, int64, uint64, string.
Shared (named) dimensions
Unlimited dimensions
Attributes with specified types – both global and per-variable
Chunking
Fill values
Groups
N-Dimensional variables
Scalar variables
Per-variable endianness (big or little)
Filters (including compression)

With respect to full netCDF-4, the following concepts are currently unsupported.

User-defined types (enum, opaque, VLEN, and Compound)
Contiguous or compact storage

Note that contiguous and compact are not actually supported because they are HDF5 specific. When specified, they are treated as chunked where the file consists of only one chunk. This means that testing for contiguous or compact is not possible; the nc_inq_var_chunking function will always return NC_CHUNKED and the chunksizes will be the same as the dimension sizes of the variable's dimensions.

Additionally, it should be noted that NCZarr supports scalar variables, but Zarr Version 2 does not; Zarr V2 only supports dimensioned variables. In order to support interoperability, NCZarr V2 does the following.

A scalar variable is recorded in the Zarr metadata as if it has a shape of [1].
A note is stored in the NCZarr metadata that this is actually a netCDF scalar variable.

These actions allow NCZarr to properly show scalars in its API while still maintaining compatibility with Zarr.

Enabling NCZarr Support

NCZarr support is enabled by default. If the –disable-nczarr option is used with './configure', then NCZarr (and Zarr) support is disabled. If NCZarr support is enabled, then support for datasets stored as files in a directory tree is provided as the only guaranteed mechanism for storing datasets. However, several addition storage mechanisms are available if additional libraries are installed.

Zip format – if libzip is installed, then it is possible to directly read and write datasets stored in zip files.
If one of the supported AWS SDKs is installed, then it is possible to directly read and write datasets stored in the Amazon S3 cloud storage.

Accessing Data Using the NCZarr Prototocol

In order to access a NCZarr data source through the netCDF API, the file name normally used is replaced with a URL with a specific format. Note specifically that there is no NC_NCZARR flag for the mode argument of nc_create or nc_open. In this case, it is indicated by the URL path.

URL Format

The URL is the usual format.

protocol:://host:port/path?query#fragment

See the document "quickstart_paths" for details about using URLs.

There are, however, some details that are important.

Protocol: this should be https or s3,or file. The s3 scheme is equivalent to "https" plus setting "mode=s3". Specifying "file" is mostly used for testing, but also for directory tree or zipfile format storage.

Client Parameters

The fragment part of a URL is used to specify information that is interpreted to specify what data format is to be used, as well as additional controls for that data format.

For reading, key=value pairs are provided for specifying the storage format.

mode=nczarr|zarr

Additional pairs are provided to specify the Zarr version.

mode=v2

Additional pairs are provided to specify the storage medium: Amazon S3 vs File tree vs Zip file.

mode=file|zip|s3

Note that when reading, an attempt will be made to infer the format and Zarr version and storage medium format by probing the file. If inferencing fails, then it is reported. In this case, the client may need to add specific mode flags to avoid inferencing.

Typically one will specify three mode flags: one to indicate what format to use and one to specify the way the dataset is to be stored. For example, a common one is "mode=zarr,file"

Obviously, when creating a file, inferring the type of file to create is not possible so the mode flags must be set specifically. This means that both the storage medium and the exact storage format must be specified. Using mode=nczarr causes the URL to be interpreted as a reference to a dataset that is stored in NCZarr format. The zarr mode tells the library to use NCZarr, but to restrict its operation to operate on pure Zarr.

The modes s3, file, and zip tell the library what storage medium driver to use.

The s3 driver stores data using Amazon S3 or some equivalent.
The file driver stores data in a directory tree.
The zip driver stores data in a local zip file.

As an aside, it should be the case that zipping a file format directory tree will produce a file readable by the zip storage format, and vice-versa.

By default, the XArray convention is supported for Zarr Version 2 and used for both NCZarr files and pure Zarr files.

This means that every variable in the root group whose named dimensions are also in the root group will have an attribute called *_ARRAY_DIMENSIONS* that stores those dimension names. The noxarray mode tells the library to disable the XArray support.

NCZarr Map Implementation

Internally, the nczarr implementation has a map abstraction that allows different storage formats to be used. This is closely patterned on the same approach used in the Python Zarr implementation, which relies on the Python MutableMap [5] class.

In NCZarr, the corresponding type is called zmap. The zmap API essentially implements a simplified variant of the Amazon S3 API.

As with Amazon S3, keys are utf8 strings with a specific structure: that of a path similar to those of a Unix path with '/' as the separator for the segments of the path.

As with Unix, all keys have this BNF syntax:

key: '/' | keypath ;
keypath: '/' segment | keypath '/' segment ;
segment: <sequence of UTF-8 characters except control characters and '/'>

Obviously, one can infer a tree structure from this key structure. A containment relationship is defined by key prefixes. Thus one key is "contained" (possibly transitively) by another if one key is a prefix (in the string sense) of the other. So in this sense the key "/x/y/z" is contained by the key "/x/y".

In this model all keys "exist" but only some keys refer to objects containing content – aka content bearing. An important restriction is placed on the structure of the tree, namely that keys are only defined for content-bearing objects. Further, all the leaves of the tree are these content-bearing objects. This means that the key for one content-bearing object should not be a prefix of any other key.

There several other concepts of note.

Dataset - a dataset is the complete tree contained by the key defining the root of the dataset. The term File will often be used as a synonym. Technically, the root of the tree is the key <dataset>/.zgroup, where .zgroup can be considered the superblock of the dataset.
Object - equivalent of the S3 object; Each object has a unique key and "contains" data in the form of an arbitrary sequence of 8-bit bytes.

The zmap API defined here isolates the key-value pair mapping code from the Zarr-based implementation of NetCDF-4. It wraps an internal C dispatch table manager for implementing an abstract data structure implementing the zmap key/object model. Of special note is the "search" function of the API.

Search: The search function has two purposes:

Support reading of pure zarr datasets (because they do not explicitly track their contents).
Debugging to allow raw examination of the storage. See zdump for example.

The search function takes a prefix path which has a key syntax (see above). The set of legal keys is the set of keys such that the key references a content-bearing object – e.g. /x/y/.zarray or /.zgroup. Essentially this is the set of keys pointing to the leaf objects of the tree of keys constituting a dataset. This set potentially limits the set of keys that need to be examined during search.

The search function returns a limited set of names, where the set of names are immediate suffixes of a given prefix path. That is, if _<prefix>_ is the prefix path, then search returnsnall _<name>_ such that _<prefix>/<name>_ is itself a prefix of a "legal" key. This can be used to implement glob style searches such as "/x/y/*" or "/x/y/**"

This semantics was chosen because it appears to be the minimum required to implement all other kinds of search using recursion. It was also chosen to limit the number of names returned from the search. Specifically

Avoid returning keys that are not a prefix of some legal key.
Avoid returning all the legal keys in the dataset because that set may be very large; although the implementation may still have to examine all legal keys to get the desired subset.
Allow for use of partial read mechanisms such as iterators, if available. This can support processing a limited set of keys for each iteration. This is a straighforward tradeoff of space over time.

As a side note, S3 supports this kind of search using common prefixes with a delimiter of '/', although its use is a bit tricky. For the file system zmap implementation, the legal search keys can be obtained one level at a time, which directly implements the search semantics. For the zip file implementation, this semantics is not possible, so the whole tree must be obtained and searched.

Issues:

S3 limits key lengths to 1024 bytes. Some deeply nested netcdf files will almost certainly exceed this limit.
Besides content, S3 objects can have an associated small set of what may be called tags, which are themselves of the form of key-value pairs, but where the key and value are always text. As far as it is possible to determine, Zarr never uses these tags, so they are not included in the zmap data structure.

A Note on Error Codes:

The zmap API returns some distinguished error code:

NC_NOERR if a operation succeeded
NC_EEMPTY is returned when accessing a key that has no content.
NC_EOBJECT is returned when an object is found which should not exist
NC_ENOOBJECT is returned when an object is not found which should exist

This does not preclude other errors being returned such NC_EACCESS or NC_EPERM or NC_EINVAL if there are permission errors or illegal function arguments, for example. It also does not preclude the use of other error codes internal to the zmap implementation. So zmap_file, for example, uses NC_ENOTFOUND internally because it is possible to detect the existence of directories and files. But this does not propagate outside the zmap_file implementation.

Zmap Implementatons

The primary zmap implementation is s3 (i.e. mode=nczarr,s3) and indicates that the Amazon S3 cloud storage – or some related applicance – is to be used. Another storage format uses a file system tree of directories and files (mode=nczarr,file). A third storage format uses a zip file (mode=nczarr,zip). The latter two are used mostly for debugging and testing. However, the file and zip formats are important because they are intended to match corresponding storage formats used by the Python Zarr implementation. Hence it should serve to provide interoperability between NCZarr and the Python Zarr, although this interoperability has had only limited testing.

Examples of the typical URL form for file and zip are as follows.

file:///xxx/yyy/testdata.file#mode=nczarr,file

file:///xxx/yyy/testdata.zip#mode=nczarr,zip

Note that the extension (e.g. ".file" in "testdata.file") is arbitrary, so this would be equally acceptable.

file:///xxx/yyy/testdata.anyext#mode=nczarr,file

As with other URLS (e.g. DAP), these kind of URLS can be passed as the path argument to, for example, ncdump.

NCZarr versus Pure Zarr.

The NCZARR format extends the pure Zarr format by adding extra attributes such as ''_nczarr_array'' inside the ''.zattr'' object. It is possible to suppress the use of these extensions so that the netcdf library can write a pure zarr formatted file. But this probably unnecessary since these attributes should be readable by any other Zarr implementation. But these extra attributes might be seen as clutter and so it is possible to suppress them when writing using mode=zarr.

Reading of pure Zarr files created using other implementations is a necessary compatibility feature of NCZarr. This requirement imposed some constraints on the reading of Zarr datasets using the NCZarr implementation.

Zarr allows some primitive types not recognized by NCZarr. Over time, the set of unrecognized types is expected to diminish. Examples of currently unsupported types are as follows:
- "c" – complex floating point
- "m" – timedelta
- "M" – datetime
The Zarr dataset may reference filters and compressors unrecognized by NCZarr.
The Zarr dataset may store data in column-major order instead of row-major order. The effect of encountering such a dataset is to output the data in the wrong order.

Again, this list should diminish over time.

Notes on Debugging NCZarr Access

The NCZarr support has a trace facility. Enabling this can sometimes give important, but voluminous information. Tracing can be enabled by setting the environment variable NCTRACING=n, where n indicates the level of tracing. A good value of n is 9.

Zip File Support

In order to use the zip storage format, the libzip [3] library must be installed. Note that this is different from zlib.

Addressing Style

The notion of "addressing style" may need some expansion. Amazon S3 accepts two forms for specifying the endpoint for accessing the data (see the document "quickstart_path").

Virtual – the virtual addressing style places the bucket in the host part of a URL. For example:

https://<bucketname>.s2.<region&gt.amazonaws.com/

Path – the path addressing style places the bucket in at the front of the path part of a URL. For example:

https://s3.<region&gt.amazonaws.com/<bucketname>/

The NCZarr code will accept either form, although internally, it is standardized on path style. The reason for this is that the bucket name forms the initial segment in the keys.

Zarr vs NCZarr

Data Model

The NCZarr storage format is almost identical to that of the the standard Zarr format. The data model differs as follows.

Zarr only supports anonymous dimensions – NCZarr supports only shared (named) dimensions.
Zarr attributes are untyped – or perhaps more correctly characterized as of type string.
Zarr does not explicitly support unlimited dimensions – NCZarr does support them.

Storage Medium

Consider both NCZarr and Zarr, and assume S3 notions of bucket and object. In both systems, Groups and Variables (Array in Zarr) map to S3 objects. Containment is modeled using the fact that the dataset's key is a prefix of the variable's key. So for example, if variable v1 is contained in top level group g1 – _/g1 – then the key for v1 is _/g1/v_. Additional meta-data information is stored in special objects whose name start with ".z".

In Zarr Version 2, the following special objects exist.

Information about a group is kept in a special object named .zgroup; so for example the object _/g1/.zgroup_.
Information about an array is kept as a special object named .zarray; so for example the object _/g1/v1/.zarray_.
Group-level attributes and variable-level attributes are stored in a special object named .zattr; so for example the objects _/g1/.zattr_ and _/g1/v1/.zattr_.
Chunk data is stored in objects named "\<n1\>.\<n2\>...,\<nr\>" where the ni are positive integers representing the chunk index for the ith dimension.

The first three contain meta-data objects in the form of a string representing a JSON-formatted dictionary. The NCZarr format uses the same objects as Zarr, but inserts NCZarr specific attributes in the .zattr object to hold NCZarr specific information The value of each of these attributes is a JSON dictionary containing a variety of NCZarr specific information.

These NCZarr-specific attributes are as follows:

__nczarr_superblock__ – this is in the top level group's .zattr object. It is in effect the "superblock" for the dataset and contains any netcdf specific dataset level information. It is also used to verify that a given key is the root of a dataset. Currently it contains keys that are ignored and exist only to ensure that older netcdf library versions do not crash.

"version" – the NCZarr version defining the format of the dataset (deprecated).

__nczarr_group__ – this key appears in every group's .zattr object. It contains any netcdf specific group information. Specifically it contains the following keys:

"dimensions" – the name and size of shared dimensions defined in this group, as well an optional flag indictating if the dimension is UNLIMITED.
"arrays" – the name of variables defined in this group.
"groups" – the name of sub-groups defined in this group. These lists allow walking the NCZarr dataset without having to use the potentially costly search operation.

__nczarr_array__ – this key appears in the .zattr object associated with a .zarray object. It contains netcdf specific array information. Specifically it contains the following keys:

dimension_references – the fully qualified names of the shared dimensions referenced by the variable.
storage – indicates if the variable is chunked vs contiguous in the netcdf sense. Also signals if a variable is scalar.

__nczarr_attr__ – this attribute appears in every .zattr object. Specifically it contains the following keys:

types – the types of all attributes in the .zattr object.

Translation

With some loss of netcdf-4 information, it is possible for an nczarr library to read the pure Zarr format and for other zarr libraries to read the nczarr format.

The latter case, zarr reading nczarr, is trival because all of the nczarr metadata is stored as ordinary, String valued (but JSON syntax), attributes.

The former case, nczarr reading zarr is possible assuming the nczarr code can simulate or infer the contents of the missing __nczarr_xxx_ attributes. As a rule this can be done as follows.

__nczarr_group__ – The list of contained variables and sub-groups can be computed using the search API to list the keys "contained" in the key for a group. The search looks for occurrences of .zgroup, .zattr, .zarray to infer the keys for the contained groups, attribute sets, and arrays (variables). Constructing the set of "shared dimensions" is carried out by walking all the variables in the whole dataset and collecting the set of unique integer shapes for the variables. For each such dimension length, a top level dimension is created named "_Anonymous_Dimension_<len>" where len is the integer length.
__nczarr_array__ – The dimension referencess are inferred by using the shape in .zarray and creating references to the simulated shared dimensions. netcdf specific information.
__nczarr_attr__ – The type of each attribute is inferred by trying to parse the first attribute value string.

Compatibility

In order to accomodate existing implementations, certain mode tags are provided to tell the NCZarr code to look for information used by specific implementations.

XArray

The Xarray [7] Zarr implementation uses its own mechanism for specifying shared dimensions. It uses a special attribute named ''_ARRAY_DIMENSIONS''. The value of this attribute is a list of dimension names (strings). An example might be ["time", "lon", "lat"]. It is almost equivalent to the _nczarr_array "dimension_references" list, except that the latter uses fully qualified names so the referenced dimensions can be anywhere in the dataset. The Xarray dimension list differs from the netcdf-4 shared dimensions in two ways.

Specifying Xarray in a non-root group has no meaning in the current Xarray specification.
A given name can be associated with different lengths, even within a single array. This is considered an error in NCZarr.

The Xarray ''_ARRAY_DIMENSIONS'' attribute is supported for both NCZarr and pure Zarr. If possible, this attribute will be read/written by default, but can be suppressed if the mode value "noxarray" is specified. If detected, then these dimension names are used to define shared dimensions. The following conditions will cause ''_ARRAY_DIMENSIONS'' to not be written.

The variable is not in the root group,
Any dimension referenced by the variable is not in the root group.

Note that this attribute is not needed for Zarr Version 3, and is ignored.

Examples

Here are a couple of examples using the ncgen and ncdump utilities.

Create an nczarr file using a local directory tree as storage.
ncgen -4 -lb -o "file:///home/user/dataset.file#mode=nczarr,file" dataset.cdl
Display the content of an nczarr file using a zip file as storage.
ncdump "file:///home/user/dataset.zip#mode=nczarr,zip"
Create an nczarr file using S3 as storage.
ncgen -4 -lb -o "s3://s3.us-west-1.amazonaws.com/datasetbucket" dataset.cdl
Create an nczarr file using S3 as storage and keeping to the pure zarr format.
ncgen -4 -lb -o 's3://s3.uswest-1.amazonaws.com/datasetbucket\#mode=zarr dataset.cdl
Create an nczarr file using the s3 protocol with a specific profile
ncgen -4 -lb -o "s3://datasetbucket/rootkey\#mode=nczarr&awsprofile=unidata" dataset.cdl

Note that the URL is internally translated to this
"https://s2.<region&gt.amazonaws.com/datasetbucket/rootkey\#mode=nczarr&awsprofile=unidata"

Appendix A. Building NCZarr Support

Currently the following build cases are known to work. Note that this does not include S3 support. A separate tabulation of S3 support is in the document cloud.md.

Operating System	Build System	NCZarr
Linux	Automake	yes
Linux	CMake	yes
Cygwin	Automake	yes
Cygwin	CMake	yes
OSX	Automake	yes
OSX	CMake	yes
Visual Studio	CMake	yes

Automake

The relevant ./configure options are as follows.

–disable-nczarr – disable the NCZarr support.

CMake

The relevant CMake flags are as follows.

-DNETCDF_ENABLE_NCZARR=off – equivalent to the Automake –disable-nczarr option.

Testing NCZarr S3 Support

The relevant tests for S3 support are in the nczarr_test directory. Currently, by default, testing of S3 with NCZarr is supported only for Unidata members of the NetCDF Development Group. This is because it uses a Unidata-specific bucket that is inaccessible to the general user.

NetCDF Build

In order to build netcdf-c with S3 sdk support, the following options must be specified for ./configure.

--enable-s3

If you have access to the Unidata bucket on Amazon, then you can also test S3 support with this option.

--with-s3-testing=yes

NetCDF CMake Build

Enabling S3 support is controlled by this cmake option:

-DNETCDF_ENABLE_S3=ON

However, to find the aws sdk libraries, the following environment variables must be set:

AWSSDK_ROOT_DIR="c:/tools/aws-sdk-cpp"
AWSSDKBIN="/cygdrive/c/tools/aws-sdk-cpp/bin"
PATH="$PATH:${AWSSDKBIN}"

Then the following options must be specified for cmake.

-DAWSSDK_ROOT_DIR=${AWSSDK_ROOT_DIR}

-DAWSSDK_DIR=${AWSSDK_ROOT_DIR}/lib/cmake/AWSSDK

Appendix B. Amazon S3 Imposed Limits

The Amazon S3 cloud storage imposes some significant limits that are inherited by NCZarr (and Zarr also, for that matter).

Some of the relevant limits are as follows:

The maximum object size is 5 Gigabytes with a total for all objects limited to 5 Terabytes.
S3 key names can be any UNICODE name with a maximum length of 1024 bytes. Note that the limit is defined in terms of bytes and not (Unicode) characters. This affects the depth to which groups can be nested because the key encodes the full path name of a group.

Appendix C. JSON Attribute Convention.

The Zarr V2 specification is somewhat vague on what is a legal value for an attribute. The examples all show one of two cases:

A simple JSON scalar atomic values (e.g. int, float, char, etc), or
A JSON array of such values.

However, the Zarr specification can be read to infer that the value can in fact be any legal JSON expression. This "convention" is currently used routinely to help support various attributes created by other packages where the attribute is a complex JSON expression. An example is the GDAL Driver convention [12], where the value is a complex JSON dictionary.

In order for NCZarr to be as consistent as possible with Zarr, it is desirable to support this convention for attribute values. This means that there must be some way to handle an attribute whose value is not either of the two cases above. That is, its value is some more complex JSON expression. Ideally both reading and writing of such attributes should be supported.

One more point. NCZarr attempts to record the associated netcdf attribute type (encoded in the form of a NumPy "dtype") for each attribute. This information is stored as NCZarr-specific metadata. Note that pure Zarr makes no attempt to record such type information.

The current algorithm to support JSON valued attributes operates as follows.

Writing an attribute:

There are mutiple cases to consider.

The netcdf attribute is not of type NC_CHAR and its value is a single atomic value.
- Convert to an equivalent JSON atomic value and write that JSON expression.
- Compute the Zarr equivalent dtype and store in the NCZarr metadata.
The netcdf attribute is not of type NC_CHAR and its value is a vector of atomic values.
- Convert to an equivalent JSON array of atomic values and write that JSON expression.
- Compute the Zarr equivalent dtype and store in the NCZarr metadata.
The netcdf attribute is of type NC_CHAR and its value – taken as a single sequence of characters – is parseable as a legal JSON expression.
- Parse to produce a JSON expression and write that expression.
- Use "|J0" as the dtype and store in the NCZarr metadata.
The netcdf attribute is of type NC_CHAR and its value – taken as a single sequence of characters – is not parseable as a legal JSON expression.
- Convert to a JSON string and write that expression
- Use ">S1" as the dtype and store in the NCZarr metadata.

Reading an attribute:

The process of reading and interpreting an attribute value requires two pieces of information.

The value of the attribute as a JSON expression, and
The optional associated dtype of the attribute; note that this may not exist if, for example, the file is pure zarr.

Given these two pieces of information, the read process is as follows.

The JSON expression is a simple JSON atomic value.
- If the dtype is defined, then convert the JSON to that type of data, and then store it as the equivalent netcdf vector of size one.
- If the dtype is not defined, then infer the dtype based on the the JSON value, and then store it as the equivalent netcdf vector of size one.
The JSON expression is an array of simple JSON atomic values.
- If the dtype is defined, then convert each JSON value in the array to that type of data, and then store it as the equivalent netcdf vector.
- If the dtype is not defined, then infer the dtype based on the first JSON value in the array, and then store it as the equivalent netcdf vector.
The attribute is any other JSON structure.
- Un-parse the expression to an equivalent sequence of characters, and then store it as of type NC_CHAR.

Notes

If a character valued attributes's value can be parsed as a legal JSON expression, then it will be stored as such.
Reading and writing are almost idempotent in that the sequence of actions "read-write-read" is equivalent to a single "read" and "write-read-write" is equivalent to a single "write". The "almost" caveat is necessary because (1) whitespace may be added or lost during the sequence of operations, and (2) numeric precision may change.

Appendix D. Support for string types

Zarr supports a string type, but it is restricted to fixed size strings. NCZarr also supports such strings, but there are some differences in order to interoperate with the netcdf-4/HDF5 variable length strings.

The primary issue to be addressed is to provide a way for user to specify the maximum size of the fixed length strings. This is handled by providing the following new attributes:

**_nczarr_default_maxstrlen** — This is an attribute of the root group. It specifies the default maximum string length for string types. If not specified, then it has the value of 128 characters.
**_nczarr_maxstrlen** — This is a per-variable attribute. It specifies the maximum string length for the string type associated with the variable. If not specified, then it is assigned the value of **_nczarr_default_maxstrlen**.

Note that when accessing a string through the netCDF API, the fixed length strings appear as variable length strings. This means that they are stored as pointers to the string (i.e. char*) and with a trailing nul character. One consequence is that if the user writes a variable length string through the netCDF API, and the length of that string is greater than the maximum string length for a variable, then the string is silently truncated. Another consequence is that the user must reclaim the string storage.

Adding strings also requires some hacking to handle the existing netcdf-c NC_CHAR type, which does not exist in Zarr. The goal was to choose NumPY types for both the netcdf-c NC_STRING type and the netcdf-c NC_CHAR type such that if a pure zarr implementation reads them, it will still work.

For writing variables and NCZarr attributes, the type mapping is as follows:

">S1" for NC_CHAR.
"|S1" for NC_STRING && MAXSTRLEN==1
"|Sn" for NC_STRING && MAXSTRLEN==n

Admittedly, this encoding is a bit of a hack.

So when reading data with a pure zarr implementaion the above types should always appear as strings, and the type that signals NC_CHAR (in NCZarr) would be handled by Zarr as a string of length 1.

References

[1] Amazon Simple Storage Service Documentation
[2] Amazon Simple Storage Service Library
[3] The LibZip Library
[4] NetCDF ZARR Data Model Specification
[5] Python Documentation: 8.3. collections — High-performance dataset datatypes
[6] Zarr Version 2 Specification
[7] XArray Zarr Encoding Specification
[8] Dynamic Filter Loading
[9] Officially Registered Custom HDF5 Filters
[10] C-Blosc Compressor Implementation
[11] Conda-forge packages / aws-sdk-cpp
[12] GDAL Zarr

Change Log

[Note: minor text changes are not included.]

Note, this log was only started as of 8/11/2022 and is not intended to be a detailed chronology. Rather, it provides highlights that will be of interest to NCZarr users. In order to see exact changes, It is necessary to use the 'git diff' command.

03/31/2024

Document the change to V2 to using attributes to hold NCZarr metadata.

01/31/2024

Add description of support for Zarr version 3 as an appendix.

3/10/2023

Move most of the S3 text to the cloud.md document.

8/29/2022

Zarr fixed-size string types are now supported.

8/11/2022

The NCZarr specific keys have been converted to lower-case (e.g. "_nczarr_attr" instead of "_NCZARR_ATTR"). Upper case is accepted for back compatibility.
The legal values of an attribute has been extended to include arbitrary JSON expressions; see Appendix D for more details.

Point of Contact

Author: Dennis Heimbigner
Email: dmh at ucar dot edu
Initial Version: 4/10/2020
Last Revised: 4/02/2024

Table of Contents