![]() |
NetCDF Users Guide v1.2
|
The netCDF library supports a general filter mechanism to apply various kinds of filters to datasets before reading or writing.
The netCDF enhanced (aka netCDF-4) library inherits this capability since it depends on the HDF5 library. The HDF5 library (1.8.11 and later) supports filters, and netCDF is based closely on that underlying HDF5 mechanism.
The NCZarr/Zarr implementation also supports filters. It utilizes the HDF5-style filters as its implementation, but extends them to support the NumCodecs JSON-based format as an alternate to the HDF5 unsigned integer format.
In all cases, filters assume that a variable has chunking defined and each chunk is filtered before writing and "unfiltered" after reading and before passing the data to the user.
In the event that multiple filters are defined on a variable, they are applied in first-defined order on writing and on the reverse order when reading.
The most common kind of filter is a compression-decompression filter, and that is the focus of this document.
This document first covers the HDF5-style filters and then the NCZarr-style filters.
HDF5 supports dynamic loading of compression filters using the following process for reading of compressed data.
In order to compress a variable with an HDF5 compliant filter, the netcdf-c library must be given three pieces of information:
The meaning of the parameters is, of course, completely filter dependent and the filter description [3] needs to be consulted. For bzip2, for example, a single parameter is provided representing the compression level. It is legal to provide a zero-length set of parameters. Defaults are not provided, so this assumes that the filter can operate with zero parameters.
Filter ids are assigned by the HDF group. See [4] for a current list of assigned filter ids. Note that ids above 32767 can be used for testing without registration.
The first two pieces of information can be provided in one of three ways: using ncgen, via an API call, or via command line parameters to nccopy. In any case, remember that filtering also requires setting chunking, so the variable must also be marked with chunking information. If compression is set for a non-chunked variable, the variable will forcibly be converted to chunked using a default chunking algorithm.
In a CDL file, compression of a variable can be specified by annotating it with the following attribute:
This is a "special" attribute, which means that it will normally be invisible when using ncdump unless the -s flag is specified.
This attribute may be repeated to specify multiple filters. For backward compatibility it is probably better to use the ''_Deflate'' attribute instead of ''_Filter''. But using ''_Filter'' to specify deflation will work.
Note that the lexical order of declaration is important when more than one filter is specified for a variable because it determines the order in which the filters are applied.
Note that the assigned filter id for bzip2 is 307 and for szip it is 4.
When copying a netcdf file using nccopy it is possible to specify filter information for any output variable by using the "-F" option on the command line; for example:
Assume that unfiltered.nc has a chunked but not bzip2 compressed variable named "var". This command will copy that variable to the filtered.nc output file but using filter with id 307 (i.e. bzip2) and with parameter(s) 9 indicating the compression level. See the section on the parameter encoding syntax for the details on the allowable kinds of constants.
The "-F" option can be used repeatedly, as long as a different variable is specified for each occurrence.
It can be convenient to specify that the same compression is to be applied to more than one variable. To support this, two additional -F cases are defined.
-F *,...
means apply the filter to all variables in the dataset.-F v1&v2&..,...
means apply the filter to multiple variables.Multiple filters can be specified using the pipeline notions '|'. For example
-F v1&v2,307,9|4,32,32
means apply filter 307 (bzip2) then filter 4 (szip) to the multiple variables.Note that the characters '*', '&', and '|' are shell reserved characters, so you will probably need to escape or quote the filter spec in that environment.
As a rule, any input filter on an input variable will be applied to the equivalent output variable — assuming the output file type is netcdf-4. It is, however, sometimes convenient to suppress output compression either totally or on a per-variable basis. Total suppression of output filters can be accomplished by specifying a special case of "-F", namely this.
The expression -F *,none
is equivalent to -F none
.
Suppression of output filtering for a specific set of variables can be accomplished using these formats.
where "var" and the "vi" are the fully qualified name of a variable.
The rules for all possible cases of the "-F none" flag are defined by this table.
-F none | -Fvar,... | Input Filter | Applied Output Filter |
---|---|---|---|
true | undefined | NA | unfiltered |
true | none | NA | unfiltered |
true | defined | NA | use output filter(s) |
false | undefined | defined | use input filter(s) |
false | none | NA | unfiltered |
false | defined | NA | use output filter(s) |
false | undefined | undefined | unfiltered |
false | defined | defined | use output filter(s) |
The filter id for an HDF5 format filter is an unsigned integer. Further, the parameters passed to an HDF5 format filter are encoded internally as a vector of 32-bit unsigned integers. It may be that the parameters required by a filter can naturally be encoded as unsigned integers. The bzip2 compression filter, for example, expects a single integer value from zero thru nine. This encodes naturally as a single unsigned integer.
Note that signed integers and single-precision (32-bit) float values also can easily be represented as 32 bit unsigned integers by proper casting to an unsigned integer so that the bit pattern is preserved. Simple integer values of type short or char (or the unsigned versions) can also be mapped to an unsigned integer by truncating to 16 or 8 bits respectively and then zero extending.
Machine byte order (aka endian-ness) is an issue for passing some kinds of parameters. You might define the parameters when compressing on a little endian machine, but later do the decompression on a big endian machine.
When using HDF5 format filters, byte order is not an issue for 32-bit values because HDF5 takes care of converting them between the local machine byte order and network byte order.
Parameters whose size is larger than 32-bits present a byte order problem. This specifically includes double precision floats and (signed or unsigned) 64-bit integers. For these cases, the machine byte order issue must be handled, in part, by the compression code. This is because HDF5 will treat, for example, an unsigned long long as two 32-bit unsigned integers and will convert each to network order separately. This means that on a machine whose byte order is different than the machine in which the parameters were initially created, the two integers will be separately endian converted. But this will be incorrect for 64-bit values.
So, we have this situation (for HDF5 only):
In order to properly extract the correct 8-byte value, we need to ensure that the values stored in the HDF5 file have a known format independent of the native format of the creating machine.
The idea is to do sufficient manipulation so that HDF5 will store the 8-byte value as a little endian value divided into two 4-byte integers. Note that little-endian is used as the standard because it is the most common machine format. When read, the filter code needs to be aware of this convention and do the appropriate conversions.
This leads to the following set of rules.
The utilities ncgen and nccopy, and also the output of ncdump, support the specification of filter ids, formats, and parameters in text format. The BNF specification is defined in Appendix A. Basically, These specifications consist of a filter id, a comma, and then a sequence of comma separated constants representing the parameters. The constants are converted within the utility to a proper set of unsigned int constants (see the parameter encoding section).
To simplify things, various kinds of constants can be specified rather than just simple unsigned integers. The ncgen and nccopy programs will encode them properly using the rules specified in the section on parameter encode/decode. Since the original types are lost after encoding, ncdump will always show a simple list of unsigned integer constants.
The currently supported constants are as follows.
Example | Type | Format Tag | Notes |
---|---|---|---|
-17b | signed 8-bit byte | b|B | Truncated to 8 bits and zero extended to 32 bits |
23ub | unsigned 8-bit byte | u|U b|B | Truncated to 8 bits and zero extended to 32 bits |
-25S | signed 16-bit short | s|S | Truncated to 16 bits and zero extended to 32 bits |
27US | unsigned 16-bit short | u|U s|S | Truncated to 16 bits and zero extended to 32 bits |
-77 | implicit signed 32-bit integer | Leading minus sign and no tag | |
77 | implicit unsigned 32-bit integer | No tag | |
93U | explicit unsigned 32-bit integer | u|U | |
789f | 32-bit float | f|F | |
12345678.12345678d | 64-bit double | d|D | LE encoding |
-9223372036854775807L | 64-bit signed long long | l|L | LE encoding |
18446744073709551615UL | 64-bit unsigned long long | u|U l|L | LE encoding |
Some things to note.
Each filter is assumed to be compiled into a separate dynamically loaded library. For HDF5 conformant filters, these filter libraries are assumed to be in some specific location. The details for writing such a filter are defined in the HDF5 documentation[1,2].
The HDF5 loader expects plugins to be in a specified plugin directory. The default directory is:
Windows
*
.dll
When multiple filters are defined on a variable, the order of application, when writing data to the file, is same as the order in which the filters are associated with the variable. When reading a file the order of application is of necessity the reverse.
There are some special cases.
The current szip plugin code in the HDF5 library has some behaviors that can catch the unwary. These are handled internally to (mostly) hide them so that they should not affect users. Specifically, this filter may do two things.
The reason for these changes is has to do with the fact that the szip API provided by the underlying H5Pset_szip function is actually a subset of the capabilities of the real szip implementation. Presumably this is for historical reasons.
In any case, if the caller applies or queries the szip filter, then the parameter values returned may differ from those originally specified.
The current matrix of OS X build systems known to work is as follows.
Build System | Supported OS |
---|---|
Automake | Linux, Cygwin, OSX |
Cmake | Linux, Cygwin, OSX, Visual Studio |
Author: Dennis Heimbigner
Email: dmh at ucar dot edu
Initial Version: 1/10/2018
Last Revised: 7/15/2021