Collection Specification String

A collection specification string creates a collection of files by scanning file directories and looking for matches. It can optionally extract a date from a filename. It has these parts:

  1. A root directory (absolute file path).
  2. Followed by an optional "/**/" indicating to scan all subdirectories under the root directory.
  3. Followed by a regular expression that is applied to the filename.
  4. An optional date extractor may be specified that computes a date from the filename.

Example 1

 /data/ldm/pub/native/grid/NCEP/GFS/Alaska_191km/.*nc$

All files ending with "nc" in the directory /data/ldm/pub/native/grid/NCEP/GFS/Alaska_191km. The ".*nc$" is a regular expression which tries to match the path name after the top directory /data/ldm/pub/native/grid/NCEP/GFS/Alaska_191km/. The ".*" means "any number of any character" and the "nc$" means "ending with nc". If you want to make sure it ends with ".nc", you need:

 /data/ldm/pub/native/grid/NCEP/GFS/Alaska_191km/.*\.nc$

Since "." is a special character in regular expressions, one needs to escape it to match a literal ".", so "\.nc$" means match the characters "." "n" "c" at the end of the string.

Its generally important to use the '$' to indicate the end of string, since a common convention is to write auxilary files by naming them <org file>.<ext>, and you need to eliminate the auxilary files from the collection.

Example 2

 /data/ldm/pub/native/grid/NCEP/GFS/Alaska_191km/**/.*\.nc$
All files ending with ".nc" in the directory /data/ldm/pub/native/grid/NCEP/GFS/Alaska_191km and its subdirectories.

Example 3

/data/ldm/pub/native/grid/NCEP/GFS/Alaska_191km/**/GFS_Alaska_191km_#yyyyMMdd_HHmm#\.nc$

Search the directory /data/ldm/pub/native/grid/NCEP/GFS/Alaska_191km and its subdirectories for files that match the regular expression:

 GFS_Alaska_191km.........\.nc$

Remember that an unescaped "." matches any character. An escaped "\." matches the literal "." character.

From the filename, extract the date by applying the SimpleDateFormat template yyyyMMdd_HHmm to the portion of the filename after

GFS_Alaska_191km_

Method for constructing collection specification Strings

The idea is that one copies an example file path, and then modifies it: For example, copy an example filename:

/data/ldm/pub/native/grid/NCEP/GFS/Alaska_191km/20090301/GFS_Alaska_191km_20090301_0600.grib1
Modify it to include subdirectories:
/data/ldm/pub/native/grid/NCEP/GFS/Alaska_191km/**/GFS_Alaska_191km_20090301_0600.grib1
Demarcate the part of the filename where the run date is encoded, using '#' chars:
/data/ldm/pub/native/grid/NCEP/GFS/Alaska_191km/**/GFS_Alaska_191km_#20090301_0600#.grib1
Substitute a SimpleDateFormat:
/data/ldm/pub/native/grid/NCEP/GFS/Alaska_191km/**/GFS_Alaska_191km_#yyyyMMdd_HHmm#.grib1
Make sure that the name ends with "grib1":
/data/ldm/pub/native/grid/NCEP/GFS/Alaska_191km/**/GFS_Alaska_191km_#yyyyMMdd_HHmm#\.grib1$

Notes:

  1. You have to escape any of these regular expression literal characters that you want to match. Its a good idea to avoid these characters in directory and file names, with the exception of the '.'
     .|*?+(){}[]^$\ 
  2. The date extractor can only be used on the filename in a collection specification string. If the date is part of a directory name, use the more general dateFormatMark on the collection element.
  3. The date extractor element cannot be used after the regular expression. So GFS_Alaska_191km_#yyyyMMdd_HHmm#.*grib$ is ok but GFS.*km#yyyyMMdd_HHmm#grib$. This is because the . Use the more general dateFormatMark:
    <collection spec="/data/ldm/pub/native/grid/NCEP/GFS/Alaska_191km/**/GFS.*km.*grib$" dateFormatMark="yyyyMMdd_HHmm#.grib#$" />

Resources for Regular Expressions:


This document was last updated May 2013.