The datasetScan element allows you to serve all files in a directory tree. The files must be homogenous enough that the same metadata can be applied to all of them.
Here is a minimal catalog containing a datasetScan element:
<?xml version="1.0" encoding="UTF-8"?> <catalog name="Unidata Workshop 2006 - NCEP Model Data" version="1.0.1" xmlns="http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.0" xmlns:xlink="http://www.w3.org/1999/xlink"> <service name="myserver" serviceType="OpenDAP" base="/thredds/dodsC/" /> <datasetScan name="NCEP Data" path="ncep" location="/data/ldm/pub/native/grid/NCEP/" > <serviceName>myserver</serviceName> </datasetScan > </catalog>
The main points are:
In the catalog that the TDS server sends to any client, the datasetScan element is shown as a catalog reference:
<catalog name="Unidata Workshop 2006 - NCEP Model Data" version="1.0.1" xmlns="http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.0" xmlns:xlink="http://www.w3.org/1999/xlink"> <service name="myserver" serviceType="OpenDAP" base="/thredds/dodsC/" /> <catalogRef xlink:href="/thredds/catalog/ncep/catalog.xml" xlink:title="NCEP Data" name="" /> </catalog>
The catalog will be generated dynamically on the server when requested, by scanning the server's directory /data/ldm/pub/native/grid/NCEP/. For example, if the directory looked like:
/data/ldm/pub/native/grid/NCEP/ GFS/ CONUS_191km/ GFS_CONUS_191km_20061107_0000.grib1 GFS_CONUS_191km_20061107_0000.grib1.gbx9 GFS_CONUS_191km_20061107_0600.grib1 GFS_CONUS_191km_20061107_1200.grib1 CONUS_80km/ ... ... NAM/ ... NDFD/ ...
The result of a request for "/thredds/catalog/ncep/catalog.xml" might look like:
<catalog ...>
<service name="myserver" serviceType="OpenDAP" base="/thredds/dodsC/" />
<dataset name="NCEP Data">
<metadata inherited="true">
<serviceName>myserver</serviceName>
</metadata>
<catalogRef xlink:title="GFS" xlink:href="GFS/catalog.xml" name="" />
<catalogRef xlink:title="NAM" xlink:href="NAM/catalog.xml" name="" />
<catalogRef xlink:title="NDFD" xlink:href="NDFD/catalog.xml" name="" />
</dataset>
</catalog>
and for a "/thredds/catalog/ncep/GFS/CONUS_191km/catalog.xml" request:
<catalog ...>
<service name="myserver" serviceType="OpenDAP" base="/thredds/dodsC/" />
<dataset name="ncep/GFS/CONUS_191km">
<metadata inherited="true">
<serviceName>myserver</serviceName>
</metadata>
<dataset name="GFS_CONUS_191km_20061107_0000.grib1"
urlPath="ncep/GFS/CONUS_191km/GFS_CONUS_191km_20061107_0000.grib1" />
<dataset name="GFS_CONUS_191km_20061107_0000.grib1.gbx"
urlPath="ncep/GFS/CONUS_191km/GFS_CONUS_191km_20061107_0000.grib1.gbx" />
<dataset name="GFS_CONUS_191km_20061107_0000.grib1"
urlPath="ncep/GFS/CONUS_191km/GFS_CONUS_191km_20061107_0600.grib1" />
<dataset name="GFS_CONUS_191km_20061107_0000.grib1"
urlPath="ncep/GFS/CONUS_191km/GFS_CONUS_191km_20061107_1200.grib1" />
</dataset>
</catalog>
Note that:
The datasetScan element is an extension of a dataset element, and it can contain any of the metadata elements that a dataset can. Typically you want all of its contained datasets to inherit the metadata, so add an inherited metadata element contained in the datasetScan element, for example:
<catalog name="Unidata Workshop 2006 - NCEP Model Data" version="1.0.1" xmlns="http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.0"> <service name="myserver" serviceType="OpenDAP" base="/thredds/dodsC/" /> <datasetScan name="NCEP Data" path="ncep" location="/data/ldm/pub/native/grid/NCEP/"> <metadata inherited="true"> <serviceName>myserver</serviceName> <authority>unidata.ucar.edu:</authority> <dataType>Grid</dataType> </metadata> </datasetScan> </catalog>
A datasetScan element can specify which files and directories it will include with a filter element (also see server catalog spec for details). When no filter element is given, all files and directories are included in the generated catalog(s). Adding a filter element to your datasetScan element allows you to include (and/or exclude) the files and directories as you desire. For instance, the following filter and selector elements will only include files that end in ".grib1" and exclude any file that ends with "*_0000.grib1".
<filter> <include wildcard="*.grib1"/> <exclude wildcard="*_0000.grib1"/> </filter>
You can specify which files to include or exclude using either wildcard patterns (with the wildcard attribute) or regular expressions (using the regExp attribute). If the wildcard pattern (or the regular expression) matches the dataset name, the dataset is included or excluded as specified. By default, includes and excludes apply only to regular files (atomic datasets). You can specify that they apply to directories (collection datasets) as well by using the atomic and collection attributes. For instance, the additional selector in this filter element means that only directories that don't start with "CONUS" will be cataloged (since the default value of atomic is true, we have to explicitly set it to false if we only want to filter directories):
<filter> <include wildcard="*.grib1"/> <exclude wildcard="*_0000.grib1"/> <exclude wildcard="CONUS*" atomic="false" collection="true"/> </filter>
Its a good idea to always use a filter element with explicit includes, so that if stray files accidentally get into your data directories, they wont generate erroneous catalog entries. This is also known as whitelisting.
Complicated matching can be done with regular expressions, eg:
<filter> <include regExp="PROFILER_.*_2013110[67]_[0-9]{4}\.nc"/> </filter>
A few gotchas to remember:
<namer>
<regExpOnName regExp="GFS" replaceString="NCEP GFS model data" />
<regExpOnName regExp="NCEP" replaceString="NCEP model data"/>
</namer>
More complex renaming is possible as well. The namer uses a regular expression match on the dataset name. If the match succeeds, any regular expression capturing groups are used in the replacement string.
A capturing group is a part of a regular expression enclosed in parenthesis. When a regular expression with a capturing group is applied to a string, the substring that matches the capturing group is saved for later use. The captured strings can then be substituted into another string in place of capturing group references,"$n", where "n" is an integer indicating a particular capturing group. (The capturing groups are numbered according to the order in which they appear in the match string.) For example, the regular expression "Hi (.*), how are (.*)?" when applied to the string "Hi Fred, how are you?" would capture the strings "Fred" and "you". Following that with a capturing group replacement in the string "$2 are $1." would result in the string "you are Fred."
Here's an example namer:
<namer>
<regExpOnName regExp="([0-9]{4})([0-9]{2})([0-9]{2})_([0-9]{2})([0-9]{2})"
replaceString="NCEP GFS 191km Alaska $1-$2-$3 $4:$5:00 GMT"/>
</namer
the regular expression has five capturing groups
<dataset name="NCEP GFS 191km Alaska 2005-10-11 00:00:00 GMT"
urlPath="models/NCEP/GFS/Alaska_191km/GFS_Alaska_191km_20051011_0000.grib1"/>
Datasets at each collection level are listed in ascending order by name. With a sort element you can specify that they are to be sorted in reverse order:
<sort>
<lexigraphicByName increasing="false" />
</sort>
Note that the sort is done before renaming.
For real-time data you may want to have a special link that points to the "latest" data in the collection. Here, latest is simply means the last filename in a list sorted by name, so its only the latest if the time stamp is in the filename and the name sorts correctly by time.
The simplest way to enable this is to add the attribute addLatest="true" to the datasetScan element. In this case, the default values are used: name = "latest.xml", top = "true", and serviceName = "latest".
The addProxies element allows more control over how latest is displayed. The simpleLatest element allows you to change the default values (name, top, serviceName). The latestComplete element adds the ability to exclude files which have changed within a specified amount of time, eg to exclude files that are still being written.
Example:
<service name="latest" type="Resolver" base="" /> <datasetScan name="GRIB2 Data" path="grib2" location="c:/data/grib2/" serviceName="myserver" > <addProxies> <simpleLatest/> <latestComplete name="latestComplete.xml" top="true" serviceName="latest" lastModifiedLimit="60000" /> </addProxies> </datasetScan>
The latestComplete element includes a name attribute which provides the name of the proxy dataset, the serviceName attribute that references the service used by the proxy dataset, the top attribute which indicates if the proxy dataset should appear at the top or bottom of the list of datasets in this collection, and the lastModifiedLimit which sets a time period in millisecs which the dataset must not have been modified..
The simpleLatest element allows for the same attributes as the latestComplete element minus the lastModifiedLimit attribute. In this case, all the attributes have default values: the name attribute defaults to "latest.xml", the top attribute defaults to "true", and the serviceName attribute defaults to "latest".
While you wouldnt put two latest datasest in the same scan, the example shows both:
<service name="latest" type="Resolver" base="" /> <dataset name="GRIB2 Data" ID="testdata"> <dataset name="latestComplete.xml" serviceName="latest" urlPath="latestComplete.xml" /> <dataset name="latest.xml" serviceName="latest" urlPath="latest.xml" /> <dataset name="200610130730.nc" urlPath="200610130730.nc" /> <dataset name="200406190916.nc" urlPath="200406190916.nc" /> </dataset>
More details are available in the Server-side Catalog specification document.
A datasetScan element may contain an addTimeCoverage element. The addTimeCoverage element indicates that a timeCoverage metadata element should be added to each dataset in the collection and describes how to determine the time coverage for each datasets in the collection.
Currently, the addTimeCoverage element can only construct start/duration timeCoverage elements and uses the dataset name to determine the start time. As described in the "Naming Datasets" section above, the addTimeCoverage element applies a regular expression match to the dataset name. If the match succeeds, any regular expression capturing groups are used in the start time replacement string to build the start time string.These attributes values are used to determine the time coverage:
Example 1: The addTimeCoverage element,
<datasetScan name="GRIB2 Data" path="grib2" location="c:/data/grib2/" serviceName="myserver">
<addTimeCoverage datasetNameMatchPattern="([0-9]{4})([0-9]{2})([0-9]{2})_([0-9]{2})([0-9]{2}).grib1$" startTimeSubstitutionPattern="$1-$2-$3T$4:00:00"
duration="60 hours" />
</datasetScan>
results in the following timeCoverage element:
<timeCoverage>
<start>2005-07-18T12:00:00</start>
<duration>60 hours</duration> </timeCoverage>
A variation is the addition of the datasetPathMatchPattern attribute. It can be used instead of the datasetNameMatchPattern
attribute and changes the target of the match from the dataset name to the dataset path. If both attributes are used, the
datasetNameMatchPattern attribute takes precedence.
<?xml version="1.0" encoding="UTF-8"?>
<catalog name="Unidata Workshop 2006 - NCEP Model Data" version="1.0.1"
xmlns="http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.0"
xmlns:xlink="http://www.w3.org/1999/xlink">
<service name="myserver" serviceType="OpenDAP" base="/thredds/dodsC/" />
<datasetScan name="NCEP Data" path="ncep" location="/data/ldm/pub/native/grid/NCEP/" >
<serviceName>myserver</serviceName>
</datasetScan>
</catalog>
Catalog with wildcard filter element:
<?xml version="1.0" encoding="UTF-8"?>
<catalog name="Unidata Workshop 2006 - NCEP Model Data" version="1.0.1"
xmlns="http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.0"
xmlns:xlink="http://www.w3.org/1999/xlink">
<service name="myserver" serviceType="OpenDAP" base="/thredds/dodsC/" />
<datasetScan name="NCEP Data" path="ncep" location="/data/ldm/pub/native/grid/NCEP/" >
<serviceName>myserver</serviceName>
<filter>
<include wildcard="*.grib1"/>
<include wildcard="*.grib2"/>
<exclude wildcard="*.gbx"/>
</filter>
</datasetScan>
</catalog>
Catalog with filter and addTimeCoverage elements using regular expressions:
<?xml version="1.0" encoding="UTF-8"?>
<catalog name="Unidata Workshop 2006 - NCEP Model Data" version="1.0.1"
xmlns="http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.0"
xmlns:xlink="http://www.w3.org/1999/xlink">
<service name="myserver" serviceType="OpenDAP" base="/thredds/dodsC/" />
<datasetScan name="NCEP Data" path="ncep" location="/data/ldm/pub/native/grid/NCEP/" >
<serviceName>myserver</serviceName>
<filter>
<include regExp="PROFILER_wind_06min_2013110[67]_[0-9]{4}\.nc"/>
</filter>
<addTimeCoverage
datasetNameMatchPattern="PROFILER_wind_06min_([0-9]{4})([0-9]{2})([0-9]{2})_([0-9]{2})([0-9]{2}).nc$"
startTimeSubstitutionPattern="$1-$2-$3T$4:$5:00" duration="1 hour"/>
</datasetScan>
</catalog>