A collection specification string creates a collection of files by scanning file directories and looking for matches. It can optionally extract a date from a filename. It has these parts:

  • A root directory (absolute file path).
  • Followed by an optional /**/ indicating to scan all subdirectories under the root directory.
  • Followed by a regular expression that is applied to the filename.
  • An optional date extractor may be specified that computes a date from the filename.

Example 1

/data/ldm/pub/native/grid/NCEP/GFS/Alaska_191km/.*nc$

All files ending with nc in the directory /data/ldm/pub/native/grid/NCEP/GFS/Alaska_191km. The .nc$ is a regular expression which tries to match the path name after the top directory /data/ldm/pub/native/grid/NCEP/GFS/Alaska_191km/. The . means any number of any character and the nc$ means “ending with nc”. If you want to make sure it ends with .nc, you need:

 /data/ldm/pub/native/grid/NCEP/GFS/Alaska_191km/.*\.nc$

Since . is a special character in regular expressions, one needs to escape it to match a literal ., so \.nc$ means match the characters .nc at the end of the string.

It’s generally important to use the $ to indicate the end of string, since a common convention is to write auxiliary files by naming them <org file>.<ext>, and you need to eliminate the auxiliary files from the collection.

Example 2

/data/ldm/pub/native/grid/NCEP/GFS/Alaska_191km/**/.*\.nc$

All files ending with .nc in the directory /data/ldm/pub/native/grid/NCEP/GFS/Alaska_191km and its subdirectories.

Example 3

/data/ldm/pub/native/grid/NCEP/GFS/Alaska_191km/**/GFS_Alaska_191km_#yyyyMMdd_HHmm#\.nc$

Search the directory /data/ldm/pub/native/grid/NCEP/GFS/Alaska_191km and its subdirectories for files that match the regular expression:

 GFS_Alaska_191km.........\.nc$

Remember that an unescaped . matches any character. An escaped \. matches the literal . character.

From the filename, extract the date by applying the SimpleDateFormat template yyyyMMdd_HHmm to the portion of the filename after GFS_Alaska_191km.

Method For Constructing Collection Specification Strings

The idea is that one copies an example file path, and then modifies it.
For example, copy an example filename:

/data/ldm/pub/native/grid/NCEP/GFS/Alaska_191km/20090301/GFS_Alaska_191km_20090301_0600.grib1

Modify it to include subdirectories:

/data/ldm/pub/native/grid/NCEP/GFS/Alaska_191km/**/GFS_Alaska_191km_20090301_0600.grib1

Demarcate the part of the filename where the run date is encoded, using # chars:

/data/ldm/pub/native/grid/NCEP/GFS/Alaska_191km/**/GFS_Alaska_191km_#20090301_0600#.grib1

Substitute a SimpleDateFormat:

/data/ldm/pub/native/grid/NCEP/GFS/Alaska_191km/**/GFS_Alaska_191km_#yyyyMMdd_HHmm#.grib1

Make sure the name ends with grib1:

/data/ldm/pub/native/grid/NCEP/GFS/Alaska_191km/**/GFS_Alaska_191km_#yyyyMMdd_HHmm#\.grib1$

Notes

You have to escape any of these regular expression literal characters that you want to match. It’s a good idea to avoid these characters in directory and file names, except the .

 .|*?+(){}[]^$\
  • The dot character . matches any single character.
  • A ^ character matches the null string at the start of a line.
  • A $ character matches the null string at the end of a line.

The date extractor can only be used on the filename in a collection specification string. If the date is part of a directory name, use the more general dateFormatMark on the collection element.

The date extractor element cannot be used after the regular expression. So GFS_Alaska_191km_#yyyyMMdd_HHmm#.*grib$ is ok, but GFS.*km#yyyyMMdd_HHmm#grib$ is not.

Use the more general dateFormatMark:

<collection spec="/data/ldm/pub/native/grid/NCEP/GFS/Alaska_191km/**/GFS.*km.*grib$" dateFormatMark="yyyyMMdd_HHmm#.grib#$" />

References for regular expressions