Time Partitions


All of the GRIB records in all of the files that you specify constitute a GRIB collection. The CDM creates one or more CDM datasets from this collection. There is always an overall collection dataset, and for large collections you may want to also present smaller subsets, for example yearly subsets of a multiyear model run. To do so, GRIB collections can be divided up into time partitions, based on the reference time of the GRIB records.

You may also want to partition your collection due to memory constraints or in order to speed up indexing. To build an index, all of the metadata of each GRIB record is read into memory. This is approximately 400-500 bytes / record. Another estimate is the sum of all the GRIB index file sizes (gbx9 files) in the collection. (This memory is only required when building the index, and after the indexes are built, the memory usage is proportional to the size of the CDM index files (ncx3 files)). So if you have 1 million records in your collection, you may need ~500 MB of main memory for the indexing. The time it takes to index also increases with the number of GRIB records (probably somewhat more than linearly). When your collection is changing, partitioned collections will only have to update the partitions that change. So all of these are reasons you may want to use time partitioning.

See here for options on how to get your collections indexed.

A time partition is the partitioning of the GRIB records into disjoint sets, based on the reference time of the GRIB records. (For model runs the reference time is usually the model run time). The way the partition is done is controlled by the timePartition attribute on the <collection> element inside the <featureCollection>, eg:

  <featureCollection name="NCEP GFS model" featureType="GRIB1" path="test/all">
    <metadata inherited="true">
      <serviceName>all</serviceName>
      <dataFormat>GRIB-1</dataFormat>
    </metadata>
    <collection name="gfs" spec="/data/GFS/**/.*\.grb$" timePartition="directory"/>
  </featureCollection>

For each partition, a CDM index file is created, and a CDM collection dataset can be acccessed by the user.

The possible values of timePartition are:

  1. none: all files are scanned in, and a collection level index is built, plus a collection for each unique reference time.
  2. directory: each directory is a partition. Nested directories create nested partitions (eg year/month/day).
  3. file: each file is a partition.
  4. time unit: the files are divided into partitions based on the given time unit. The reference time must be encoded in the filenames. Thi is the only time you need to use a dateFormatMark.

Directory Partition

In order to use a directory partition, the directory structure must partition the data by reference time. That is, all of the data for any given reference time must be completely contained in the directory. Directories can be nested to any level.

File Partition

In order to use a file partition, all of the records for a reference time must be contained in a single file. The common case is that each file contains all of the records for a single reference time.

Time Partition

In order to use a time partition, the filenames must contain parseable time information than can be used to partition the data. The directory layout, if any, is not used. The common case is where all files are in a single directory, and each file has the reference date encoded in the name. The split-out of variables does not matter.

If a collection is configured as a time partition, all of the filenames are read into memory at once. A date extractor must be specified, and is used to group the files into partitions. For example, if timePartition = "year", all of the files for each calendar year are made into a collections. The overall dataset is then the collection of all of those yearly collections.

None Partition

If a collection is configured with timePartition=none, all of the records' metadata (excluding the data itself) must be read into memory at once. The records are then grouped by reference time, and a seperate collection is written for each reference time. The overall dataset is then the collection of all of those single-reftime collections.

This option is reasonable for small collections. Record metadata is approximately 400-500 bytes / record. 1M records would then consume 400-500 MB during the indexing process. Our rule of thumb is to use a different partition strategy if the collection exceeds 1M records. Even then, this option takes the longest when indexing, and other strategies are preferred especially if the collection is dynamic and must be reindexed often.

Collection storage strategies

Generally, managing many small files has more overhead than managing smaller numbers of large files. For today's disks, file sizes of of 100 Mb - 10 Gb seems right. Keep the number of files in a directory small, a few hundred is best, and more than a thousand starts to make things like directory listings hard.

Partition your files by reference time, which typically is the model run time. Depending on size and number, you might group files by day, month, year, etc.

When you have complete control over how the collection is stored on disk, and you want to optimize for fastest THREDDS indexing and retrieval, file partitions are recommended. Placing all of the records for a single reference time in a single file is often optimal. If there are a small number of records for each runtime (<10,000 ?) you might want to put more than one reference time in each file. Its essential that all the records for each runtime be in a single file.

GRIB files are unordered collections of GRIB records. So you can concatenate GRIB files together just using the cat command. So you can easily reorganize your collection (breaking up files requires GRIB-aware software, but is easy enough), just delete any old THREDDS indexes (.gbx9 or .ncx3) and regenerate them with the TDM.

Homogeneity requirements

A feature collection dataset is a homogeneous collection of records. That means that the metadata describing the dataset is approximately the same for all of the records. Each GRIB record is self-contained, in the sense that it does not reference other records, or an overall schema. The point of making a collection of GRIB records into a CDM dataset, is to allow the user to access the entire collection at once, using the netCDF API for multidimensionsal arrays. The user cannot, using the netCDF API, access the metadata for individual GRIB records. Instead, we assume that the metadata is uniform in the collection, and expose it with variable and global attributes.

 


This document was last updated April 2015