A CDM Object name refers to the name of a Group, Dimension, Variable, Attribute, or EnumTypedef. A CDM object name abstractly is a variable length sequence of Unicode characters. Unicode has various encodings which are used in various contexts, for example:
This document summarizes the various encodings. To transform between them, translate from form1 to Unicode with form1 unescaping as needed, then from Unicode to form2 with form2 escaping as needed.
NetCDF C library Object names refer to the name of a Group, Dimension, Variable, Attribute, user-defined Type, compound type Member, or enumeration type Symbol.
A netCDF identifier is stored in a netCDF file as UTF-8 Unicode characters, NFC normalized. There are some restrictions on the valid characters used in a netCDF identifier:
ID = ([a-zA-Z0-9_]|{MUTF8})([^\x00-\x1F\x2F/\x7F]|{MUTF8})* where: MUTF8 = multibyte UTF8 encoded char
which says:
Also See:
A CDL (netCDF Definition Language) document is encoded in UTF-8. Certain characters must be escaped. The escape mechanism is to prepend a backslash "\" before the character.
Which characters in an identifier must be escaped in CDL?
Alternatively, we can enumerate the escaped characters (using the regular expression syntax accepted by lex or flex):[^\x00-\x1F\x7F/_.@+-a-zA-Z0-9]
idescaped =[ !"#$%&'()*,:;<=>?\[\\\]^`{|}~]
A NcML (netCDF Markup Language) document uses standard XML encoding and escaping.
The chars '&', '<', '>' must be replaced by these entity references: "&", "<", ">" In some places the single and double quote must be replaced by "'" and """ respectively.
Typically an XML parser/library will handle this transparently.
A CDM object name abstractly is a variable length sequence of Unicode characters. It can be anything except:
A CDM object has a short name (a String) and a full name, consisting of the parent groups and structures that it belongs to. Internally, only short names are used, along with the enclosing Group or Structure objects. So there is generally no problem in comparing or searching for short names. In certain places in the CDM / NetCDF-Java library API (eg NetcdfFile.findVariable()), a full name can be passed in as a single String, of the form
groupName/groupName/varName.memberName.memberName
which uses the "/" and "." as group and structure delimiters, respectively. In this case, those characters must be escaped in the object names. Since "/" is not a legal character in an indentifier, that leaves just the "." to be escaped.
The client forms requests of the form endpoint?query. The possible query parameters are:
req=( CDL | NcML | capabilities | header | data) var=vars where: vars := varspec | varspec[',' varspec] varspec := varname[subsetSpec] varname := valid variable name subsetSpec := '(' fortran-90 arraySpec ')' fortran-90 arraySpec := dim | dim ',' dims dim := ':' | slice | start ':' end | start ':' end ':' stride slice := INTEGER start := INTEGER stride := INTEGER end := INTEGER
So the characters in variable names that need to be escaped are ',' ':' '(' ')' in order to not interfere with this grammer. Actually you could get away with just escaping the "(", since you can use it as a delimiter.
The client forms requests of the form endpoint?query. The possible query parameters are:
req=( capabilities | data | form | stations) accept= (csv | xml | ncstream | netcdf ) time_start,time_end=time range north,south,east,west=bounding box var=vars stn=stns where: vars := varName | varName[,varName] stns := stnName | stnName[,stnName] varName := valid variable name stnName := valid station name
Here we just need the comma "," in the variable name and in the station names.
It should suffice to URLencode the variable names and station names , and to URL decode all the query parameters.
Standard practice for escaping names is to use NetcdfFile.escapeName(), unescapeName(). This uses backslash escaping. The backslash becomes a special char, so it needs to be in the escape set:
[\(\),:\.\\]
Utility routines using this include Variable.getNameEscaped(), and GridDatatype.getNameEscaped().
OPeNDAP has an on-the-wire specification that must be followed in order to ensure interoperability. There are two parts to this:
OPeNDAP (we think) uses standard URL encoding, aka percent encoding.
An OPeNDAP dataset as represented in the CDM library looks like any other CDM dataset, ie it is not restricted to OPeNDAP encoding. When making a request over the OPeNDAP protocol, a translation between CDM and OPeNDAP identifiers must be made.
From the spec:
A DAP variable’s name MUST contain ONLY US-ASCII characters with the following additional limitation: The characters MUST be either upper or lower case letters, numbers or from the set _ ! ~ * ’ - " . Any other characters MUST be escaped. To escape a character in a name, the character is replaced by the sequence %<Character Code> where Character Code is the two hex digit code corresponding to the US-ASCII character.
From the OPeNDAP lexers:
1. from dds.lex and ce_expr.lex
[-+a-zA-Z0-9_/%.\\*][-+a-zA-Z0-9_/%.\\#*]*
2. from das.lex
[-+a-zA-Z0-9_/%.\\*:()][-+a-zA-Z0-9_/%.\\#*:()]*
(same as dds plus ':','(', and ')' are added)
3. from gse.lex
[-+a-zA-Z0-9_/%.\\][-+a-zA-Z0-9_/%.\\#]*
(same as dds except that '*' is removed)
Their note:
"...Note that the DAS allows Identifiers to have parens and colons while the DDS and expr scanners don't. It's too hard to disambiguate functions when IDs
have parens in them and adding colons makes parsing the array projections hard..."
Standard practice, then is to translate from CDM identifiers to OPeNDAP identifiers using
ucar.nc.util.net.EscapeStrings.escapeDAPIdentifier()
and to translate from OPeNDAP identifiers to CDM identifiers using
ucar.nc.util.net.EscapeStrings.unescapeDAPIdentifier()
In addition, HTTPMethod(String URI) automatically adds URL encoding. These may create a double escaped URL. On the server, one first unescapes the request, and then parses it. Any identifiers in the request then are unescaped again before comparing with the corresponding CDM object.
IS THAT RIGHT??
A direct translation of their grammar would appear to be this:
PathName={AbsolutePathName}|{RelativePathName} Separator=[/]+
AbsolutePathName={Separator}{RelativePathName}?
RelativePathName={Component}({Separator}|{RelativePathName})*
Component=[.]|{Name}
Name=[.]|({Charx}{Character}*)|{Character}+
/* Ascii set - '/' Character={Charx}|[.]
/* Ascii set - '.' and '/' */ Charx=[ !"#$%&'()*+,-0123456789:;<=>?@\[\\\]^`{|}~\x00-\x1e,\x7f]
The Web Map Service Implementation Specification version 1.3.0 states:
6.3.2 Reserved characters in HTTP GET URLs
The URL specification (IETF RFC 2396) reserves particular characters as significant and requires that these be escaped when they might conflict with their defined usage. This International Standard explicitly reserves several of those characters for use in the query portion of WMS requests. When the characters '&', '=', ',' and '+' appear in one of the roles defined in Table 1, they shall appear literally in the URL. When those characters appear elsewhere (for example, in the value of a parameter), they shall be encoded as defined in IETF RFC 2396.
Table 1 -- Reserved Characters in WMS Query String
Character Reserved Usage ? Separator indicating start of query string. & Separator between parameters in query string. = Separator between name and value of parameter. , Separator between individual values in list-oriented parameters (such as BBOX, LAYERS and STYLES in the GetMap request). + Shorthand representation for a space character.
6.8.2 Parameter lists
Parameters consisting of lists (for example, BBOX, LAYERS and STYLES in WMS GetMap) shall use the comma (",") as the separator between items in the list. Additional white space shall not be used to delimit list items. If a list item value includes a space or comma, it shall be escaped using the URL encoding rules (6.3.2 and IETF RFC 2396).
The URL specification [IETF RFC 2396] states that all characters other than:
shall be encoded as "%xx", where xx is the two hexadecimal digits > representing the octet code of the character. Within the query string portion of a URL (i.e., everything after the "?"), the space character (" ") is an exception, and shall be encoded as a plus sign ("+"). A server shall be prepared to decode any character encoded in > this manner.
It appears that neither Firefox or Chrome does standard URL encoding.
HTTPClient 3 will not send out a URL with certain chars in it like "[" (possibly the full 2396 set)
The query string is always run through URLDecoder.decode() before further processing:
queryString = URLDecoder.decode(req.getQueryString(), "UTF-8");
http://www.blooberry.com/indexdot/html/topics/urlencoding.htm
http://www.w3schools.com/TAGS/ref_urlencode.asp