
|
|||||||||||||||
|
|
|||||||||||||||
OverviewThe XML datasheet format was developed for the SketchEl open source project, which is hosted on SourceForge. It is a superset of the SketchEl molecule format, which is used to store individual molecular structures within the datasheet. The format is essentially a representation of a structured table, where each column has a particular type, and each row contains one cell for every column. There is a special header section which stores metadata which applies to the whole datasheet. The XML datasheet format is often used interchangeably with the industry standard MDL SD file format, but there are some important caveats to be aware of. The MDL SD file format can only include one molecule per entry, while the XML datasheet format can have any number of columns which are typed as molecules. SD files also have no way of specifying field types, and do not enforce a table structure throughout the file. Software which expects the contents of an SD file to conform to a typed table format, which is usually a valid assumption, are required to scan the entire SD file to ascertain the fields and make guesses as to their data type. There is also no approved method for storing meta information in an SD file, such as document title, or other information about the meaning of individual fields. The preferred file extension is .ds, and the MIME type is chemical/x-datasheet. ExampleA very simple datasheet is shown below, rendered using the SketchEl datasheet editor, along with the XML source for the file:
SpecificationThe basic prototype for an XML datasheet is as follows:
The three header elements, <Summary>, <Extension> and <Header> must be listed in the XML document before the <Content> element. This is required in order to make it possible and convenient to use the datasheet format for streaming purposes, e.g. using a streaming XML parser such as SAX. The layout of the datasheet is defined prior to the arrival of the data. SummaryThe <Summary> section contains two elements: <Title> and <Description>. The datasheet title should be contained on a single line, while the description can be multiline text, so is enclosed within a CDataSection. ExtensionThe <Extension> section is optional. It is a way for programs to store metadata within a datasheet. Whenever a datasheet is manipulated, the software should generally leave alone any of the extensions that it does not recognise. Each extension is encoded within a child element named <Ext>. The name and type attributes are arbitrary, and should be distinguishable by software which understands their nature. The content is a CDataSection which contains arbitrary data. The extension fields are discussed in more detail in: Format: DataSheet Aspects. HeaderThe <Header> section defines the size of the table that makes up the content of the datasheet, as well as the properties of each of the columns. The nrows and ncols attributes must exactly specify the dimensions. For every column, there must be a single <Column> child element. The id attribute identifies which column it refers to. Numbers must be between 1 and the number of columns. The type attribute specifies what kind of data the column holds, while the description attribute is an arbitrary single-line summary describing the column. The available column types are:
Note that for most of these types, the value of an individual cell must either conform to the format of the data type, or be blank, which is considered to be a special null state. The exceptions are the string and extend types, for which an empty string is valid, and so there is no null state. ContentThe <Content> section is split into individual <Row> elements, each of which is split into individual <Cell> elements. The number of rows is defined in the header, and there must be a row element for each and every one of them. Furthermore, the rows must be in consecutive order. The id attribute specifies the row number, for the sake of consistency checking. If the row identifiers do not start at 1 and increase until reaching the limit, the datasheet is considered to be malformed. The requirement for consecutively numbered rows makes it possible to use the datasheet format for streaming purposes. Each row must contain exactly one cell for every column that is defined in the header. The id attribute specifies the column number, and it must be between 1 and the number of columns defined in the header. Unlike for rows, the cells are not required to be arranged consecutively, but all must be present and unique. Cells whose data is null must still be defined by an element. See Also
|
|||||||||||||||
|
|||||||||||||||