NetCDF
4.9.2
|
See Appendix D.1. NetCDF-4 Filter QuickStart for tips to get started quickly with NetCDF-4 Filter Support.
NetCDF-C filters have some features of which the user should be aware.
The netCDF library supports a general filter mechanism to apply various kinds of filters to datasets before reading or writing. The most common kind of filter is a compression-decompression filter, and that is the focus of this document. But non-compression filters – fletcher32, for example – also exist.
The netCDF enhanced (aka netCDF-4) library inherits this capability since it depends on the HDF5 library. The HDF5 library (1.8.11 and later) supports filters, and netCDF is based closely on that underlying HDF5 mechanism.
Filters assume that a variable has chunking defined and each chunk is filtered before writing and "unfiltered" after reading and before passing the data to the user. In the event that multiple filters are defined on a variable, they are applied in first-defined order on writing and on the reverse order when reading.
This document describes the support for HDF5 filters and also the newly added support for NCZarr filters.
The API defined in this document should accurately reflect the current state of filters in the netCDF-c library. Be aware that there was a short period in which the filter code was undergoing some revision and extension. Those extensions have largely been reverted. Unfortunately, some users may experience some compilation problems for previously working code because of these reversions. In that case, please revise your code to adhere to this document. Apologies are extended for any inconvenience.
A user may encounter an incompatibility if any of the following appears in user code.
For additional information, see Appendix B.
HDF5 supports dynamic loading of compression filters using the following process for reading of compressed data.
In order to compress a variable with an HDF5 compliant filter, the netcdf-c library must be given three pieces of information:
The meaning of the parameters is, of course, completely filter dependent and the filter description [3] needs to be consulted. For bzip2, for example, a single parameter is provided representing the compression level. It is legal to provide a zero-length set of parameters. Defaults are not provided, so this assumes that the filter can operate with zero parameters.
Filter ids are assigned by the HDF group. See [4] for a current list of assigned filter ids. Note that ids above 32767 can be used for testing without registration.
The first two pieces of information can be provided in one of three ways: (1) using ncgen, (2) via an API call, or (3) via command line parameters to nccopy. In any case, remember that filtering also requires setting chunking, so the variable must also be marked with chunking information. If compression is set for a non-chunked variable, the variable will forcibly be converted to chunked using a default chunking algorithm.
The necessary API methods are included in netcdf_filter.h by default. These functions implicitly use the HDF5 mechanisms and may produce an error if applied to a file format that is not compatible with the HDF5 mechanism.
Add a filter to the set of filters to be used when writing a variable. This must be invoked after the variable has been created and before nc_enddef is invoked.
Arguments:
Return codes:
Query a variable to obtain a list of the ids of all filters associated with that variable.
Arguments:
Return codes:
The number of filters associated with the variable is stored in nfiltersp (it may be zero). The set of filter ids will be returned in filterids. As is usual with the netcdf API, one is expected to call this function twice. The first time to set nfiltersp and the second to get the filter ids in client-allocated memory. Any of these arguments can be NULL, in which case no value is returned.
Query a variable to obtain information about a specific filter associated with the variable.
Arguments:
Return codes:
The id indicates the filter of interest. The actual parameters are stored in params. The number of parameters is returned in nparamsp. As is usual with the netcdf API, one is expected to call this function twice. The first time to set nparamsp and the second to get the parameters in client-allocated memory. Any of these arguments can be NULL, in which case no value is returned. If the specified id is not attached to the variable, then NC_ENOFILTER is returned.
Query a variable to obtain information about the first filter associated with the variable. When netcdf-c was modified to support multiple filters per variable, the utility of this function became redundant since it returns info only about the first defined filter for the variable. Internally, it is implemented using the functions nc_inq_var_filter_ids and nc_inq_filter_info.
Arguments:
Return codes:
The filter id will be returned in the idp argument. If there are no filters, then zero is stored in this argument. Otherwise, the number of parameters is stored in nparamsp and the actual parameters in params. As is usual with the netcdf API, one is expected to call this function twice. The first time to get nparamsp and the second to get the parameters in client-allocated memory. Any of these arguments can be NULL, in which case no value is returned.
In a CDL file, compression of a variable can be specified by annotating it with the following attribute:
This is a "special" attribute, which means that it will normally be invisible when using ncdump unless the -s flag is specified.
For backward compatibility it is probably better to use the *_Deflate* attribute instead of *_Filter*. But using *_Filter* to specify deflation will work.
Multiple filters can be specified for a given variable by using the "|" separator. Alternatively, this attribute may be repeated to specify multiple filters.
Note that the lexical order of declaration is important when more than one filter is specified for a variable because it determines the order in which the filters are applied.
Note that the assigned filter id for bzip2 is 307 and for szip it is 4.
When copying a netcdf file using nccopy it is possible to specify filter information for any output variable by using the "-F" option on the command line; for example:
nccopy -F "var,307,9" unfiltered.nc filtered.nc
Assume that unfiltered.nc has a chunked but not bzip2 compressed variable named "var". This command will copy that variable to the filtered.nc output file but using filter with id 307 (i.e. bzip2) and with parameter(s) 9 indicating the compression level. See the section on the parameter encoding syntax for the details on the allowable kinds of constants.
The "-F" option can be used repeatedly, as long as a different variable is specified for each occurrence.
It can be convenient to specify that the same compression is to be applied to more than one variable. To support this, two additional -F cases are defined.
Multiple filters can be specified using the pipeline notions '|'. For example
Note that the characters '*', '&', and '|' are shell reserved characters, so you will probably need to escape or quote the filter spec in that environment.
As a rule, any input filter on an input variable will be applied to the equivalent output variable — assuming the output file type is netcdf-4. It is, however, sometimes convenient to suppress output compression either totally or on a per-variable basis. Total suppression of output filters can be accomplished by specifying a special case of "-F", namely this.
nccopy -F none input.nc output.nc
The expression -F *,none is equivalent to -F none.
Suppression of output filtering for a specific set of variables can be accomplished using these formats.
nccopy -F "var,none" input.nc output.nc nccopy -F "v1&v2&...,none" input.nc output.nc
where "var" and the "vi" are the fully qualified name of a variable.
The rules for all possible cases of the "-F none" flag are defined by this table.
-F none | -Fvar,... | Input Filter | Applied Output Filter |
---|---|---|---|
true | undefined | NA | unfiltered |
true | none | NA | unfiltered |
true | defined | NA | use output filter(s) |
false | undefined | defined | use input filter(s) |
false | none | NA | unfiltered |
false | defined | undefined | use output filter(s) |
false | undefined | undefined | unfiltered |
false | defined | defined | use output filter(s) |
The utilities ncgen and nccopy, and also the output of ncdump, support the specification of filter ids, formats, and parameters in text format. The BNF specification is defined in Appendix C. Basically, These specifications consist of a filter id, a comma, and then a sequence of comma separated constants representing the parameters. The constants are converted within the utility to a proper set of unsigned int constants (see the parameter encoding section).
To simplify things, various kinds of constants can be specified rather than just simple unsigned integers. The ncgen and nccopy programs will encode them properly using the rules specified in the section on parameter encode/decode. Since the original types are lost after encoding, ncdump will always show a simple list of unsigned integer constants.
The currently supported constants are as follows.
Example | Type | Format Tag | Notes |
---|---|---|---|
-17b | signed 8-bit byte | b|B | Truncated to 8 bits and sign extended to 32 bits |
23ub | unsigned 8-bit byte | u|U b|B | Truncated to 8 bits and zero extended to 32 bits |
-25S | signed 16-bit short | s|S | Truncated to 16 bits and sign extended to 32 bits |
27US | unsigned 16-bit short | u|U s|S | Truncated to 16 bits and zero extended to 32 bits |
-77 | implicit signed 32-bit integer | Leading minus sign and no tag | |
77 | implicit unsigned 32-bit integer | No tag | |
93U | explicit unsigned 32-bit integer | u|U | |
789f | 32-bit float | f|F | |
12345678.12345678d | 64-bit double | d|D | LE encoding |
-9223372036854775807L | 64-bit signed long long | l|L | LE encoding |
18446744073709551615UL | 64-bit unsigned long long | u|U l|L | LE encoding |
Some things to note.
Each filter is assumed to be compiled into a separate dynamically loaded library. For HDF5 conformant filters, these filter libraries are assumed to be in some specific location. The details for writing such a filter are defined in the HDF5 documentation[1,2].
The HDF5 loader searches for plugins in a number of directories. This search is contingent on the presence or absence of the environment variable named HDF5_PLUGIN_PATH.
As with all other "...PATH" variables, it is a sequence of absolute directories separated by a separator character. For nix operating systems, this separator is the colon (':') character. For Windows and Mingw, the separator is the semi-colon (';') character. So for example:
If HDF5_PLUGIN_PATH is defined, then the loader will search each directory in the path from left to right looking for shared libraries with specific exported symbols representing the entry points into the library.
If HDF5_PLUGIN_PATH is not defined, the loader defaults to using these default directories:
It should be noted that there is a difference between the search order for HDF5 versus NCZarr. The HDF5 loader will search only the directories specificed in HDF5_PLUGIN_PATH. In NCZarr, the loader searches HDF5_PLUGIN_PATH and as a last resort, it also searches the default directory.
Given a plugin directory, HDF5 examines every file in that directory that conforms to a specified name pattern as determined by the platform on which the library is being executed.
Platform | Basename | Extension |
---|---|---|
Linux | lib* | .so* |
OSX | lib* | .dylib* |
Cygwin | cyg* | .dll* |
Windows | * | .dll |
For each dynamic library located using the previous patterns, HDF5 attempts to load the library and attempts to obtain information from it. Specifically, It looks for two functions with the following signatures.
If plugin verification fails, then that plugin is ignored and the search continues for another, matching plugin.
The inclusion of Zarr support in the netcdf-c library creates the need to provide a new representation consistent with the way that Zarr files store filter information. For Zarr, filters are represented using the JSON notation. Each filter is defined by a JSON dictionary, and each such filter dictionary is guaranteed to have a key named "id" whose value is a unique string defining the filter algorithm: "lz4" or "bzip2", for example.
The parameters of the filter are defined by additional — algorithm specific — keys in the filter dictionary. One commonly used filter is "blosc", which has a JSON dictionary of this form.
So it has three parameters:
NCZarr has four constraints that must be met.
Note that the term "visible parameters" is used here to refer to the parameters provided by nc_def_var_filter
or those stored in the dataset's metadata as provided by the JSON codec. The term "working parameters" refers to the parameters given to the compressor itself and derived from the visible parameters.
The standard authority for defining Zarr filters is the list supported by the NumCodecs project [7]. Comparing the set of standard filters (aka codecs) defined by NumCodecs to the set of standard filters defined by HDF5 [3], it can be seen that the two sets overlap, but each has filters not defined by the other.
Note also that it is undesirable that a specific set of filters/codecs be built into the NCZarr implementation. Rather, it is preferable for there be some extensible way to associate the JSON with the code implementing the codec. This mirrors the plugin model used by HDF5.
The mechanism provided to address these issues is similar to that taken by HDF5. A shared library must exist that has certain well-defined entry points that allow the NCZarr code to determine information about a Codec. The shared library exports a well-known function name to access Codec information and relate it to a corresponding HDF5 implementation, Note that the shared library may optionally be the same library containing the HDF5 filter processor.
There are several paths by which the NCZarr filter API is invoked.
In this case, the filter plugin is located and the set of visible parameters (from nc_def_var_filter) are provided.
In this case, the codec is read from the metadata and must be converted to a visible set of HDF5 style parameters. It is possible that this set of visible parameters differs from the set that was provided by nc_def_var_filter. If this is important, then the filter implementation is responsible for marking this difference using, for example, different number of parameters or some differing value.
Given environmental information such as the associated variable's base type, the visible parameters are converted to a potentially larger set of working parameters; additionally provide the opportunity to modify the visible parameters.
As chunks are read or written, the filter is repeatedly invoked using the working parameters.
The visible parameters from step 2 are stored in the dataset's metadata. It is desirable to determine if the set of visible parameters changes. If no change is detected, then re-writing the compressor metadata may be avoided.
Currently, there is no way to specify use of a filter via Codec through the netcdf-c API. Rather, one must know the HDF5 id and parameters of the filter of interest and use the functions nc_def_var_filter and nc_inq_var_filter. Internally, the NCZarr code will use information about known Codecs to convert the HDF5 filter reference to the corresponding Codec. This restriction also holds for the specification of filters in ncgen and nccopy. This limitation may be lifted in the future.
A new special attribute is defined called *_Codecs* in parallel to the current *_Filters* special attribute. Its value is a string containing the JSON representation of the Codecs associated with a given variable. This can be especially useful when a file is unreadable because it uses a filter not available to the netcdf-c library. That is, no implementation was found in the e.g. HDF5_PLUGIN_PATH directory. In this case ncdump -hs will display the raw Codec information so that it may be possible to see what filter is missing.
The process for using filters for NCZarr is defined to operate in several steps. First, as with HDF5, all shared libraries in a specified directory (e.g. HDF5_PLUGIN_PATH) are scanned. They are interrogated to see what kind of library they implement, if any. This interrogation operates by seeing if certain well-known (function) names are defined in this library.
There will be two library types:
H5Z_plugin_type
and H5Z_get_plugin_info
.NCZ_get_codec_info
Note that a given library can export either or both of these APIs. This means that we can have three types of libraries:
Suppose that our HDF5_PLUGIN_PATH location has an HDF5-only library. Then by adding a corresponding, separate, Codec-only library to that same location, it is possible to make an HDF5 library usable by NCZarr. It is possible to do this without having to modify the HDF5-only library. Over time, it is possible to merge an HDF5-only library with a Codec-only library to produce a single, combined library.
The netcdf-c library processes all of the shared libraries by interrogating each one for the well-known APIs and recording the result. Any libraries that do not export one or both of the well-known APIs is ignored.
Internally, the netcdf-c library pairs up each HDF5 library API with a corresponding Codec API by invoking the relevant well-known functions (See Appendix E. This results in this table for associated codec and hdf5 libraries.
HDF5 API | Codec API | Action |
---|---|---|
Not defined | Not defined | Ignore |
Defined | Not defined | Ignore |
Defined | Defined | NCZarr usable |
As a special case, a shared library may be created to hold defaults for a common set of filters. Basically, there is a specially defined function that returns a vector of codec APIs. These defaults are used only if no other library provides codec information for a filter. Currently, the defaults library provides codec defaults for Shuffle, Fletcher32, Deflate (zlib), and SZIP.
Given a set of filters for which the HDF5 API and the Codec API are defined, it is then possible to use the APIs to invoke the filters and to process the meta-data in Codec JSON format.
When writing, the user program will invoke the NetCDF API function nc_def_var_filter. This function is currently defined to operate using HDF5-style id and parameters (unsigned ints). The netcdf-c library examines its list of known filters to find one matching the HDF5 id provided by nc_def_var_filter. The set of parameters provided is stored internally. Then during writing of data, the corresponding HDF5 filter is invoked to encode the data.
When it comes time to write out the meta-data, the stored HDF5-style parameters are passed to a specific Codec function to obtain the corresponding JSON representation. Again see Appendix E. This resulting JSON is then written in the NCZarr metadata.
When reading, the netcdf-c library will read the metadata for a given variable and will see that some set of filters are applied to this variable. The metadata is encoded as Codec-style JSON.
Given a JSON Codec, it is parsed to provide a JSON dictionary containing the string "id" and the set of parameters as various keys. The netcdf-c library examines its list of known filters to find one matching the Codec "id" string. The JSON is passed to a Codec function to obtain the corresponding HDF5-style unsigned int parameter vector. These parameters are stored for later use.
HDF5 supports filter chains, which is a sequence of filters where the output of one filter is provided as input to the next filter in the sequence. When encoding, the filters are executed in the "forward" direction, while when decoding the filters are executed in the "reverse" direction.
In the Zarr meta-data, a filter chain is divided into two parts: the "compressor" and the "filters". The former is a single JSON codec as described above. The latter is an ordered JSON array of codecs. So if compressor is something like "compressor": {"id": "c"...} and the filters array is like this: "filters": [ {"id": "f1"...}, {"id": "f2"...}...{"id": "fn"...}] then the filter chain is (f1,f2,...fn,c) with f1 being applied first and c being applied last when encoding. On decode, the filter chain is executed in the order (c,fn...f2,f1).
So, an HDF5 filter chain is divided into two parts, where the last filter in the chain is assigned as the "compressor" and the remaining filters are assigned as the "filters". But independent of this, each codec, whether a compressor or a filter, is stored in the JSON dictionary form described earlier.
The Codec style, using JSON, has the ability to provide very complex parameters that may be hard to encode as a vector of unsigned integers. It might be desirable to consider exporting a JSON-base API out of the netcdf-c API to support user access to this complexity. This would mean providing some alternate version of nc_def_var_filter
that takes a string-valued argument instead of a vector of unsigned ints. This extension is unlikely to be implemented until a compelling use-case is encountered.
One bad side-effect of this is that we then may have two classes of plugins. One class can be used by both HDF5 and NCZarr, and a second class that is usable only with NCZarr.
As part of its testing, the NetCDF build process creates a number of shared libraries in the netcdf-c/plugins (or sometimes netcdf-c/plugins/.libs) directory. If you need a filter from that set, you may be able to set HDF5_PLUGIN_PATH to point to that directory or you may be able to copy the shared libraries out of that directory to your own location.
Depending on the debugger one uses, debugging plugins can be very difficult. It may be necessary to use the old printf approach for debugging the filter itself.
One case worth mentioning is when there is a dataset that is using an unknown filter. For this situation, you need to identify what filter(s) are used in the dataset. This can be accomplished using this command.
ncdump -s -h <dataset filename>
Since ncdump is not being asked to access the data (the -h flag), it can obtain the filter information without failures. Then it can print out the filter id and the parameters as well as the Codecs (via the -s flag).
Within the netcdf-c source tree, the directory two directories contain test cases for testing dynamic filter operation.
These tests are disabled if –disable-shared or if –disable-filter-tests is specified or if –disable-plugins is specified.
A slightly simplified version of one of the HDF5 filter test cases is also available as an example within the netcdf-c source tree directory netcdf-c/examples/C. The test is called filter_example.c and it is executed as part of the run_examples4.sh shell script. The test case demonstrates dynamic filter writing and reading.
The files example/C/hdf5plugins/Makefile.am and example/C/hdf5plugins/CMakeLists.txt demonstrate how to build the hdf5 plugin for bzip2.
When multiple filters are defined on a variable, the order of application, when writing data to the file, is same as the order in which *nc_def_var_filter*is called. When reading a file the order of application is of necessity the reverse.
There are some special cases.
Starting with HDF5 version 1.10.*, the plugin code MUST be careful when using the standard malloc(), realloc(), and free() function.
In the event that the code is allocating, reallocating, for free'ing memory that either came from or will be exported to the calling HDF5 library, then one MUST use the corresponding HDF5 functions H5allocate_memory(), H5resize_memory(), H5free_memory() [5] to avoid memory failures.
Additionally, if your filter code leaks memory, then the HDF5 library generates a failure something like this.
H5MM.c:232: H5MM_final_sanity_check: Assertion `0 == H5MM_curr_alloc_bytes_s' failed.
One can look at the the code in plugins/H5Zbzip2.c and H5Zmisc.c as illustrations.
The current szip plugin code in the HDF5 library has some behaviors that can catch the unwary. These are handled internally to (mostly) hide them so that they should not affect users. Specifically, this filter may do two things.
The reason for these changes is has to do with the fact that the szip API provided by the underlying H5Pset_szip function is actually a subset of the capabilities of the real szip implementation. Presumably this is for historical reasons.
In any case, if the caller uses the nc_inq_var_szip or the nc_inq_var_filter functions, then the parameter values returned may differ from those originally specified.
It should also be noted that the HDF5 szip filter wrapper that is invoked depends on the configuration of the netcdf-c library. If the HDF5 installation supports szip, then the NCZarr szip will use the HDF5 wrapper. If HDF5 does not support szip, or HDF5 is not enabled, then the plugins directory will contain a local HDF5 szip wrapper to be used by NCZarr. This can be confusing, but is generally transparent to the use since the plugins HDF5 szip wrapper was taken from the HDF5 code base.
The current matrix of OS X build systems known to work is as follows.
Build System | Supported OS |
---|---|
Automake | Linux, Cygwin, OSX |
Cmake | Linux, Cygwin, OSX, Visual Studio |
If you do not want to use Automake or Cmake, the following has been known to work.
gcc -g -O0 -shared -o libbzip2.so <plugin source files> -L${HDF5LIBDIR} -lhdf5\_hl -lhdf5 -L${ZLIBDIR} -lz
The filter id for an HDF5 format filter is an unsigned integer. Further, the parameters passed to an HDF5 format filter are encoded internally as a vector of 32-bit unsigned integers. It may be that the parameters required by a filter can naturally be encoded as unsigned integers. The bzip2 compression filter, for example, expects a single integer value from zero thru nine. This encodes naturally as a single unsigned integer.
Note that signed integers and single-precision (32-bit) float values also can easily be represented as 32 bit unsigned integers by proper casting to an unsigned integer so that the bit pattern is preserved. Simple signed integer values of type short or char can also be mapped to an unsigned integer by truncating to 16 or 8 bits respectively and then sign extending. Similarly, unsigned 8 and 16 bit values can be used with zero extensions.
Machine byte order (aka endian-ness) is an issue for passing some kinds of parameters. You might define the parameters when compressing on a little endian machine, but later do the decompression on a big endian machine.
When using HDF5 format filters, byte order is not an issue for 32-bit values because HDF5 takes care of converting them between the local machine byte order and network byte order.
Parameters whose size is larger than 32-bits present a byte order problem. This specifically includes double precision floats and (signed or unsigned) 64-bit integers. For these cases, the machine byte order issue must be handled, in part, by the compression code. This is because HDF5 will treat, for example, an unsigned long long as two 32-bit unsigned integers and will convert each to network order separately. This means that on a machine whose byte order is different than the machine in which the parameters were initially created, the two integers will be separately endian converted. But this will be incorrect for 64-bit values.
So, we have this situation (for HDF5 only):
In order to properly extract the correct 8-byte value, we need to ensure that the values stored in the HDF5 file have a known format independent of the native format of the creating machine.
The idea is to do sufficient manipulation so that HDF5 will store the 8-byte value as a little endian value divided into two 4-byte integers. Note that little-endian is used as the standard because it is the most common machine format. When read, the filter code needs to be aware of this convention and do the appropriate conversions.
This leads to the following set of rules.
To support these rules, some utility programs exist and are discussed in Appendix B.
Several functions are exported from the netcdf-c library for use by client programs and by filter implementations. They are defined in the header file netcdf_aux.h. The h5 tag indicates that they assume that the result of the parse is a set of unsigned integers — the format used by HDF5.
Examples of the use of these functions can be seen in the test program nc_test4/tst_filterparser.c.
Some of the above functions use a C struct defined in *netcdf_filter.h_. The definition of that struct is as follows.
This struct in effect encapsulates all of the information about and HDF5 formatted filter — the id, the number of parameters, and the parameters themselves.
The include file netcdf_meta.h contains the following definition.
This, in conjunction with the error code NC_ENOFILTER in netcdf.h can be used to see what filter mechanism is in place as described in the section on incompatibities.
The Codec API mirrors the HDF5 API closely. It has one well-known function that can be invoked to obtain information about the Codec as well as pointers to special functions to perform conversions.
This function returns a pointer to a C struct that provides detailed information about the codec plugin.
The value returned is actually of type struct NCZ_codec_t, but is of type void* to allow for extensions.
The semantics of the non-function fields is as follows:
Given a JSON Codec representation, it will return a corresponding vector of unsigned integers representing the visible parameters.
Return Value: a netcdf-c error code.
Given an HDF5 visible parameters vector of unsigned integers and its length, return a corresponding JSON codec representation of those visible parameters.
Return Value: a netcdf-c error code.
Extract environment information from the (ncid,varid) and use it to convert a set of visible parameters to a set of working parameters; also provide option to modify visible parameters.
Return Value: a netcdf-c error code.
Some compressors may require library initialization. This function is called as soon as a shared library is loaded and matched with an HDF5 filter.
Return Value: a netcdf-c error code.
Some compressors (like blosc) require invoking a finalize function in order to avoid memory loss. This function is called during a call to nc_finalize to do any finalization. If the client code does not invoke nc_finalize then memory checkers may complain about lost memory.
Return Value: a netcdf-c error code.
As an aid to clients, it is convenient if a single shared library can provide multiple NCZ_code_t instances at one time. This API is not intended to be used by plugin developers. A shared library must only export this function.
Return a NULL terminated vector of pointers to instances of NCZ_codec_t.
The value returned is actually of type *NCZ_codec_t***, but is of type *void** to allow for extensions. The list of returned items are used to try to provide defaults for any HDF5 filters that have no corresponding Codec. This is for internal use only.
Support for a select set of standard filters is built into the NetCDF API. Generally, they are accessed using the following generic API, where XXXX is the filter name. As a rule, the names are those used in the HDF5 filter ID naming authority [4] or the NumCodecs naming authority [7].
The first function inserts the specified filter into the filter chain for a given variable. The second function queries the given variable to see if the specified function is in the filter chain for that variable. The hasfilter argument is set to one if the filter is in the chain and zero otherwise. As is usual with the netcdf API, one is expected to call this function twice. The first time to set nparamsp and the second to get the parameters in the client-allocated memory argument params. Any of these arguments can be NULL, in which case no value is returned.
Note that NetCDF inherits four filters from HDF5, namely shuffle, fletcher32, deflate (zlib), and szip. The API's for these do not conform to the above API. So aside from those four, the current set of standard filters is as follows.
Filter Name | Filter ID | Reference |
---|---|---|
zstandard | 32015 | https://facebook.github.io/zstd/ |
bzip2 | 307 | https://sourceware.org/bzip2/ |
It is important to note that in order to use each standard filter, several additonal libraries must be installed. Consider the zstandard compressor, which is one of the supported standard filters. When installing the netcdf library, the following other libraries must be installed.
A major problem for filter users is finding an implementation of an HDF5 filter wrapper and (optionally) its corresponding NCZarr wrapper. There are several ways to do this.
As part of the overall build process, a number of filter wrappers are built as shared libraries in the "plugins" directory. These wrappers can be installed as part of the overall netcdf-c installation process. WARNING: the installer still needs to make sure that the actual filter/compression libraries are installed: e.g. libzstd and/or libblosc.
The target location into which libraries in the "plugins" directory are installed is specified using a special *./configure* option
or its corresponding cmake option.
This option defaults to the value "yes", which means that filters are installed by default. This can be disabled by one of the following options.
If the option is specified with no argument (automake) or with the value "YES" (CMake), then it defaults (in order) to the following directories:
If NCZarr is enabled, then in addition to wrappers for the standard filters, additional libraries will be installed to support NCZarr access to filters. Currently, this list includes the following:
The shuffle, fletcher32, and deflate filters in this case will be ignored by HDF5 and only used by the NCZarr code. But in order to use them, it needs additional Codec capabilities provided by the lib__nczh5filters.so shared library. Note also that if you disable HDF5 support, but leave NCZarr support enabled, then all of the above filters should continue to work.
At the moment, NetCDF uses the existing HDF5 environment variable HDF5_PLUGIN_PATH to locate the directories in which filter wrapper shared libraries are located. This is used both for the HDF5 filter wrappers but also the NCZarr codec wrappers.
HDF5_PLUGIN_PATH is a typical Windows or Unix style path-list. That is it is a sequence of absolute directory paths separated by a specific separator character. For Windows, the separator character is a semicolon (';') and for Unix, it is a a colon (':').
So, if HDF5_PLUGIN_PATH is defined at build time, and –with-plugin-dir is specified with no argument then the last directory in the path will be the one into which filter wrappers are installed. Otherwise the default directories are used.
The important thing to note is that at run-time, there are several cases to consider:
Author: Dennis Heimbigner
Email: dmh at ucar dot edu
Initial Version: 1/10/2018
Last Revised: 5/18/2022