An ncstream is an ordered sequence of one or more messages:
ncstream = MAGIC_START, {message}*, MAGIC_END message = headerMessage | dataMessage | errorMessage headerMessage = MAGIC_HEADER, vlenb, NcStreamProto.Header dataMessage = MAGIC_DATA, vlenb, NcStreamProto.Data, regData | vlenData | seqData | structData errorMessage = MAGIC_ERR, vlenb, NcStreamProto.Error regData = vlenb, (byte)*vlenb vlenData= vlenn, {vlenb, (byte)*vlenb}*vlenn seqData = {MAGIC_VDATA, vlenb, NcStreamProto.StructureData}*, MAGIC_VEND structData = vlenb, NcStreamProto.StructureData vlenb = variable length encoded positive integer == length of the following object in bytes vlenn = variable length encoded positive integer == number of objects that follow NcStreamProto.Header = Header message encoded by protobuf NcStreamProto.Data = Data message encoded by protobuf byte = actual bytes of data, encoding described by the NcStreamProto.Data message primitives: MAGIC_START = 0x43, 0x44, 0x46, 0x53 MAGIC_HEADER= 0xad, 0xec, 0xce, 0xda MAGIC_DATA = 0xab, 0xec, 0xce, 0xba MAGIC_VDATA = 0xab, 0xef, 0xfe, 0xba MAGIC_VEND = 0xed, 0xef, 0xfe, 0xda MAGIC_ERR = 0xab, 0xad, 0xba, 0xda MAGIC_END = 0xed, 0xed, 0xde, 0xde
The protobuf messages are defined by
(these are files on Unidata's GitHub repository). These are compiled by the protobuf compiler into Java and C code that does the actual encoding/decoding from the stream.
Rules
There is just enough information in the stream to break the stream into messages and to know what kind of message it is. To interpret the data message correctly, one must have the definition of the variable.
message Data {
required string varName = 1; // full escaped name. change to hash or index to save space ??
required DataType dataType = 2;
optional Section section = 3; // not required for Sequence
optional bool bigend = 4 [default = true];
optional uint32 version = 5 [default = 0];
optional Compress compress = 6 [default = NONE];
optional fixed32 crc32 = 7;
}
Primitive types (byte, char, short, int, long, float, double): arrays of primitives are stored in row-major order. The endian-ness is specified in the NcStreamProto.Data message when needed.
Variable length types (String, Opaque): First the number of objects is written, then each object, preceded by its length in bytes as a vlen. Strings are encoded as UTF-8 bytes. Opaque is just a bag of bytes.
Variable length arrays: First the number of objects is written, then each object, preceded by its length in bytes as a vlen.
Structure types (Structure, Sequence): An array of StructureData. Can be encoded in row or col (?).
int levels(ninst= 23, acqtime=100, *);
encoded as
Should be able to pop this in and out of a ByteBuffer (java) or void * (C), then use pointer manipulation to decode on the fly. Maybe good candidate for encodeing with protobuf
in this case, you have to read everything. if buffer has no vlens or strings, could use fixed size offsets. otherwise record the offsets.
(each struct contains its own heap)
(each struct contains its own heap)
this indicates maybe we should rewrite ArrayStructureBB to have seperate heaps for each struct.
A nested variable length field, goes on the heap
netcdf Q:/cdmUnitTest/formats/netcdf4/vlen/cdm_sea_soundings.nc4 { dimensions: Sounding = 3; variables: Structure { int sounding_no; float temp_vl(*); } fun_soundings(Sounding=3); }
Should have a way to efficiently encode sparse data. Look at Bigtable/hBase.
Should we store ints using vlen?
Forces on the design:
We already have Fortran 90 syntax, and * indicating a variable length dimension. Do we really want to support arbitrary vlen dimension ??
An obvious thing to do is to use java/C "array of arrays". rather than Fortran / netCDF rectangular arrays:
what does numPy do ??
java/C assumes in memory. Is this useful for very large, ie out of memory, data?
Nested Tables has taken approach that its better to use Structures rather than arrays, since there are usually multiple fields. Fortran programmers prefer arrays, but they are thinking of in memory.
What is the notation that allows a high level specification (eg SQL), that can be efficiently executed by a machine ?
Extending the array model to very large datasets may not be appropriate. Row vs column store.
What about a transform language on the netcdf4 / CDM data model, to allow efficient rewriting of data ? Then it also becomes an extraction language ??