DDI
DDI (Data Documentation Initiative) is an international metadata specification for the social sciences.
There is a custom version of SuperCHANNEL available that is supplied with a DDI driver. Any DDI document being prepared for consumption in SuperCHANNEL must conform to the DDI Specification Version 3.1, and must be valid against the XSD Schema for DDI Version 3.1 (10-18-2009). The supplied DDI driver also validates the document against ISO/IEC Schematron.
This article does not attempt to cover all aspects of DDI, but it does outline some of the DDI terminology and the corresponding SuperCHANNEL terminology. For more information about DDI, see http://www.ddialliance.org/.
Example
The DDI version of SuperCHANNEL is supplied with a simple DDI version 3.1 example. The example files are located in C:\ProgramData\STR\SuperCHANNEL\examples\HealthSurvey\DDI.
Connecting to a DDI Source
To connect to a DDI source in SuperCHANNEL:
- Select File > Connect to Source.
- In the Driver field, select ddi (str.jdbc.ddi.Driver).
In the Location field, enter a connection string in the following format:
CODEjdbc:ddi:file:/<path_to_DDI_XML_file>
For example:
CODEjdbc:ddi:file:/C:\ProgramData\STR\SuperCHANNEL\examples\HealthSurvey\DDI\HealthSurvey.xml
- Click OK.
DDI File Validation
Schematron offers capabilities above and beyond validation that is possible using just the W3C XML Schema. The Schematron file DDIv31ModelValidator.sch is included in the DDI example directory, and DDI document authors are strongly encouraged to use the file to validate their document before attempting to parse it via SuperCHANNEL. Modern XML editors like XMLSpy and Oxygen offer convenient tools to perform Schematron validation against XML documents.
Alternatively, you can use the batch file SchematronUtil.bat, which is also included in the example directory. to perform Schematron validation.
During file processing, the DDI driver will detect and report any referential errors so that the user can diagnose and fix the errors in the DDI creation process. SuperCHANNEL uses TLS encryption and basic authentication to connect to DDI web services, ensuring that only authorised users can gain access to DDI documents.
Modelling RKEY Relationships in DDI
If the source DDI has... | Then... |
---|---|
| SuperCHANNEL will transform the RKEY column name to the data variable name, with an added _RKey suffix (to meet SuperSERVER requirements). |
| SuperCHANNEL will transform the RKEY column name to the table name with an added _RKey suffix. |
A record in the "perturbation" variable group with an associated RKEY. | Both must be of the type "double". |
Example DDI Code Fragment:
<l:Variable id="PAR_ITEM_1753726_VAR" version="1.0.2" versionDate="2010-08-11T08:48:07">
<r:UserID type="11179-IRDI">abs.au.ddi:PAR_ITEM_1753726_VAR:1.0.2</r:UserID>
<r:Label>Replicate Weights - Person 60</r:Label>
<r:Description>Replicate Weights - Person 60</r:Description>
<l:ConceptReference>
<r:Scheme>
<r:ID>IW_DATASET_1278248_CS</r:ID>
<r:IdentifyingAgency>abs.au.ddi</r:IdentifyingAgency>
<r:Version>1.0.2</r:Version>
</r:Scheme>
<r:ID>PAR_ITEM_1753726_CON</r:ID>
<r:IdentifyingAgency>abs.au.ddi</r:IdentifyingAgency>
<r:Version>1.0.2</r:Version>
</l:ConceptReference>
<l:Representation measurementUnit="Number" additivity="Flow">
<l:Role>replicate weight</l:Role>
<l:NumericRepresentation type="Double"/>
</l:Representation>
</l:Variable>
<!-- N.B. although this example uses the _RKey suffix for this variable id, it is not required,
only the 'rkey' Representation Role below is needed to denote the RKey variable. -->
<l:Variable id="PAR_ITEM_1810203_VAR_RKey" version="1.0.2" versionDate="2011-07-19T09:34:27">
<r:UserID type="11179-IRDI">abs.au.ddi:PAR_ITEM_1810203_VAR_RKey:1.0.2</r:UserID>
<r:Label>Person weight_RKey</r:Label>
<r:Description>Person weight_RKey</r:Description>
<l:ConceptReference>
<r:Scheme>
<r:ID>IW_DATASET_1278248_CS</r:ID>
<r:IdentifyingAgency>abs.au.ddi</r:IdentifyingAgency>
<r:Version>1.0.2</r:Version>
</r:Scheme>
<r:ID>PAR_ITEM_1810203_CON</r:ID>
<r:IdentifyingAgency>abs.au.ddi</r:IdentifyingAgency>
<r:Version>1.0.2</r:Version>
</l:ConceptReference>
<l:Representation measurementUnit="Number" additivity="Flow">
<l:Role>rkey</l:Role>
<l:NumericRepresentation type="Double"/>
</l:Representation>
</l:Variable>
<!--
This group provides the additional information needed to interpret the RKey variable: it tells us that the variables contained form a VariableGroup of type 'Perturbation'.
In practice we will check that there are at least one and no more than two variables in this group.
There must be at least one variable, which has the Representation Role 'rkey'.
There may not be another variable, in the case of unweighted census data in which case the RKey is used to perturb any cross tab counts from the table.
If there is another variable then we can assume that the other variable is the weighted data which needs to be perturbed.
-->
<l:VariableGroup id="PAR_ITEM_1753726_VAR_GROUP" version="1.0.2" versionDate="2011-08-18T09:10:03">
<r:UserID type="11179-IRDI">abs.au.ddi:PAR_ITEM_1753726_VAR_GROUP:1.0.2</r:UserID>
<l:GroupTypeCoded codeListID="Group Type" codeListAgencyName="DDI" otherValue="Perturbation">UseOther</l:GroupTypeCoded>
<r:Label>Specifies relationship between a data variable and it's corresponding RKEY variable.</r:Label>
<r:Description>Specifies relationship between a data variable and it's corresponding RKEY variable - used to implement Perturbation of the values in the data variable.</r:Description>
<l:VariableReference>
<r:Scheme>
<r:ID>IW_DATASET_1278248_CS</r:ID>
<r:IdentifyingAgency>abs.au.ddi</r:IdentifyingAgency>
<r:Version>1.0.2</r:Version>
</r:Scheme>
<r:ID>PAR_ITEM_1753726_VAR</r:ID>
<r:IdentifyingAgency>abs.au.ddi</r:IdentifyingAgency>
<r:Version>1.0.2</r:Version>
</l:VariableReference>
<l:VariableReference>
<r:Scheme>
<r:ID>IW_DATASET_1278248_CS</r:ID>
<r:IdentifyingAgency>abs.au.ddi</r:IdentifyingAgency>
<r:Version>1.0.2</r:Version>
</r:Scheme>
<r:ID>PAR_ITEM_1753726_VAR_RKey</r:ID>
<r:IdentifyingAgency>abs.au.ddi</r:IdentifyingAgency>
<r:Version>1.0.2</r:Version>
</l:VariableReference>
</l:VariableGroup>
Important Notes
- SuperCHANNEL Multi Response Fields are declared in DDI 3.1 by means of using
VariableGroup
elements with the typeGroupTypeCoded
set toMultipleResponse
. See the supplied HealthSurvey.xml file for an example. The SuperCHANNEL DDI driver supports empty codes for multiple response variables.
Previously, if there was no valid response for multiple response questions, the DDI author would have to make the DDI variable nullable by setting the
blankIsMissingValue
attribute of theCodeRepresentation
for the multi-response variable totrue
and then leaving the no-response columns empty in the associated CSV file.Now, the DDI author can choose to explicitly identify certain codes as non-responses by adding a
missingValue
attribute to theCodeRepresentation
element. ThemissingValue
attribute is a space separated XML Schema NMTOKENS list. ThesemissingValue
codes will be treated the same as actual empty values. Once the DDI Multi-response variable has one empty code specified in a legalmissingValue
attribute, an empty registry table will be generated and populated with the empty code values. All empty codes in the registry table will be filtered from the SXV4 and therefore will not affect the count of multiple responses.- When DDI documents that contain mixed continuous categorical variables are channelled in SuperCHANNEL, the DDI driver separates the data so that:
- The measure column will not contain (and sum) any classified values (95, 96, 97, 98, 99) and they will instead appear as 0s.
- The classified column will not contain (and count) any non-classified values. They will all be assigned a safe classified value, as specified in the variable's
missingValue
attribute, meaning a valid response was given.
During the build process in SuperCHANNEL, any invalid (null value) records are filtered out and are not included in the SXV4.
For example, a data file that consists of the following values:
TEXT22000.00 96 97 62000.00 97000.00 98
Would produce two fields: a measure field and a classification field.
Summing the measure field would mean 22000.00 + 62000.00 + 97000.00 = 181000.00 and the frequency would be 3 (which is the number of valid responses in the csv file).
Cross-tabulating the classification field would produce the following:
Code/Label
Count
00 - A valid response was recorded
3
95 - Greater than or equal to this number
0
96 - Not Applicable
1
97 - Refusal
1
98 - Not Known
1
99 - No Source / Negative Income
0
The classification's frequency would be 6 (which is the actual number of original records in the csv file).
- SuperCHANNEL measures are defined in DDI 3.1 using the attribute
additivity="Stock"
oradditivity="Flow"
against the variable'sRepresentation
element. See the supplied HealthSurvey.xml file for an example. - DDI 3.1 was not designed to cope with complex record relationships. Due to this shortcoming, a workaround suggested by the DDI Consortium is being used involving usage of two
Note
elements for eachRecordRelationship
element. Details on the workaround can be found at: - For a variable with a numeric representation in DDI to be a measure, the representation
additivity
attribute needs to be marked as eitherstock
orflow
. All the main weights, replicate weights and RKEY values need to be marked as measures.
Terminology
The following are selected DDI elements and their corresponding equivalents in SuperCHANNEL terminology. For further details refer to the Supported Elements section below. See the supplied HealthSurvey.xml file for examples.
Element | Description | SuperCHANNEL Equivalent |
---|---|---|
CaseIdentification | Information on which variable is considered the primary key. | |
CategoryScheme | Defines particular categories used as question responses. | Display name for a classification. For example, the gender category could map code M to Male and F to Female. |
CodeScheme | Defines code values used to represent categories for a variable or question. | Classification codes. For example, the gender category could map code M to Male and F to Female. |
LogicalProduct | A high level DDI element that describes and defines the logical data products of the study unit. The LogicalProduct describes its logical content by means of Variables, usually grouped into VariableSchemes and CodeSchemes, which are used to define the values applicable for an individual variable. CategorySchemes are used to define particular categories used as question responses. Their relationships and code values are described in the code scheme. The DataRelationship then defines how exactly those schemes relate to each other to form one or more LogicalRecords (such as a household, family, or person records) of the study unit. | Database. |
LogicalRecord | Defines a grouping of variables into a logical record, such as a household, family, or person record. | Fact tables. |
PhysicalDataProduct | High level DDI element that describes and defines the physical structure of data products of the study unit. | Data source. |
PhysicalInstance | High level DDI element that describes and defines the physical data set. It points to the physical location of the individual record data. | Data source. |
RecordRelationship | A description of how the logical contents of the DDI document relate to each other. All relationships are pair-wise. Multiple pair-wise relationships may be needed to clarify all record relationships within the logical contents of the DDI document. | In JDBC language, this would be a rough equivalent of "imported keys" and "exported keys", so that Database Management System (DBMS) can establish primary key/foreign key relationships through their specific database and database driver implementations. |
StudyUnit | High level DDI element that describes and defines a single unit of (social sciences) study. | Database. |
Variable | Defines an individual data item, such as a person's gender or age. | Column. |
VariableScheme | A collection of variables and variable groups. | |
NMTOKEN | Constrains the contents of attribute values by their form (rather than by their actual value, as would be done with an enumeration). The form of an attribute value declared as type NMTOKEN is such that it may contain only valid name characters, digits, and limited punctuation (such as full stops, hyphens, colons, and various combining and extending characters required for internationalisation). NMTOKEN-type attributes may start with a digit, hyphen or full stop. Internal white space, commas and "/" are prohibited in NMTOKEN type attribute values. Leading and trailing white space is trimmed/ignored before determining whether the value is valid for an NMTOKEN-type attribute. | |
RKey | A record key used as part of the perturbation confidentialisation process. Cubes created by the server for databases that have the perturbation confidentiality process enabled use the RKeys to generate a cell key for each cell in the cube. |
Supported Elements
Reusable DDI Element Support
DDI | Driver Support |
---|---|
| Must be a plain text entry. Multiple labels entries for a single element are not supported. All attributes are ignored. |
| Only ID element is supported. All attributes and other elements are ignored. |
Supported DDI for DDIInstance/s:StudyUnit
DDI | Driver Support |
---|---|
DDIInstance/s:StudyUnit/r:Citation/Title | Required. Included in |
DDIInstance/s:StudyUnit/@version | Required. Included in both |
DDIInstance/s:StudyUnit/@agency | Required. Included in |
DDIInstance/s:StudyUnit/@id | Required. Maps to first part of |
Supported DDI for LogicalProduct
DDI | DDI Driver Support |
---|---|
. | Required. |
l:DataRelationship/l:LogicalRecord/@id | Maps to |
l:DataRelationship/l:LogicalRecord/VariablesInRecord/@allVariablesInLogicalProduct | The DDI Driver will include all variables in the logical product as columns table. |
l:DataRelationship/l:LogicalRecord/VariablesInRecord/VariableSchemeReference/ID | Alternative to |
l:VariableScheme/l:Variable/@id | Maps to |
| Optional. Maps to |
/l:VariableScheme/l:Variable/l:Representation/l:NumericRepresentation/@additivity | Maps to |
l:VariableScheme/l:Variable/l:Representation/l:NumericRepresentation/@type | Column.data Type. |
l:VariableScheme/l:Variable/l:Representation/l:CodeRepresentation/r:CodeSchemeReference/r:ID |
|
l:CodeScheme/@id | Maps to |
l:CodeScheme/r:Label | Optional. Maps to |
l:VariableScheme/l:Variable/l:Representation/l:CodeRepresentation/r:CodeSchemeReference/r:ID | Refers to a code scheme, which is used to populate a value set. |
l:CodeScheme/l:Code/l:Value | Maps to code entries in a value set table. |
l:CodeScheme/l:Code/l:CategoryReference/r:ID | Refers to a category item, which contains a matching label for a value set table. |
l:CategoryScheme/l:Category/r:Label | Maps to a label entry in a value set table. |
| Maps to an entry in the parent column in a value set table. |
l:DataRelationship/l:RecordRelationship/l:RecordReferenceSource/@relation | If
|
l:DataRelationship/l:RecordRelationship/l:RecordReferenceSource/l:VariableReference/r:ID | Refers to a primary or foreign key variable, depending on the value of the containing reference source. |
l:DataRelationship/l:RecordRelationship/l:RecordReferenceTarget/r:ID | Refers to a primary or foreign key variable, depending on the value of |
l:DataRelationship/l:LogicalRecord/l:CaseIdentification/l:VariableSpecificationReference/l:VariableReference/r:ID | Refers to a primary key with |
Supported DDI for PhysicalDataProduct
DDI | DDI Driver Support |
---|---|
. | Required. The DDI Driver processes items listed in this table, subject to the constraints described here. Multiple physical data products within are not supported. |
@id | Maps to By using this as the internal ID, SuperSTAR will be able to link back to parts of the associated product to retrieve descriptive metadata. |
P:PhysicalStructureScheme/p:PhysicalStructure/p:LogicalProductReference/r:ID | Required. The DDI Driver checks that this refers to a valid logical product defined in the DDI document. |
P:PhysicalStructureScheme/p:PhysicalStructure/p:DefaultDataType | Ignored. The data type is taken from the logical level. |
P:PhysicalStructureScheme/p:PhysicalStructure/p:DefaultDelimiter | If defined, the DDI Driver uses the delimiter as the separator character reading from the data file. |
P:PhysicalStructureScheme/p:PhysicalStructure/p:GrossRecordStructure/p:LogicalRecordReference/r:ID | Required. The DDI Driver checks that it matches a logical record defined in the DDI document. |
P:PhysicalStructureScheme/p:PhysicalStructure/p:GrossRecordStructure/p:PhysicalRecordSegment/@id | Required, but only one physical record segment can be defined in the gross record structure. Other than the mandatory ID attribute, values and child elements are ignored. |
P:RecordLayoutScheme/p:RecordLayout/p:PhysicalStructureReference/r:ID | Required. The DDI Driver checks that it matches a physical structure within the same logical data product. |
P:RecordLayoutScheme/p:RecordLayout/p:PhysicalStructureReference/p:PhysicalRecordSegmentUsed | Required. The DDI Driver checks that it matches a physical record defined within the physical structure that is referenced by the segment that the element belongs to. |
P:RecordLayoutScheme/p:RecordLayout/p:CharacterSet | If defined, must be |
P:RecordLayoutScheme/p:RecordLayout/p:ArrayBase | Required. Must be either 0 or 1. |
P:RecordLayoutScheme/p:RecordLayout/p:DefaultVariableSchemeReference/r:ID | If defined, the DDI Driver checks that it matches a valid variable scheme defined in the logical product referred to by the containing physical product. |
P:RecordLayoutScheme/p:RecordLayout/p:DataItem/p:VariableReference/r:Scheme/r:ID | If defined, the default variable scheme defined for the containing layout is ignored. The DDI Driver checks that it references a valid variable scheme defined in the logical product referenced by the containing data product. If neither this item nor the default variable is defined then the DDI Driver reports an error. |
P:RecordLayoutScheme/p:RecordLayout/p:DataItem/p:VariableReference/r:ID | Required. The DDI Driver checks that it refers to a variable within the variable scheme defined, or if a scheme is referenced by the data then the DDI Driver checks in that scheme instead. In either case, the scheme referred to must be defined in the logical product by the containing physical data product. |
P:RecordLayoutScheme/p:RecordLayout/p:DataItem/p:PhysicalLocation/p:StorageFormat | Ignored. |
P:RecordLayoutScheme/p:RecordLayout/p:DataItem/p:PhysicalLocation/p:ArrayPosition | Required. The DDI Driver checks that the array position is validly defined. No upper bound check is made. |
P:RecordLayoutScheme/p:RecordLayout/p:DataItem/p:PhysicalLocation/p:Width | Required. This value must be defined according to the standard if the sibling position is not defined. This is used to check the maximum allowed length of strings defined for the DDI driver. |
Supported DDI for PhysicalInstance
DDI | DDI Driver Support |
---|---|
. | Required. A separate physical instance is required for every CSV file. |
Pi:DataFileIdentification/pi:URI | Required. The DDI Driver uses the URI to access the data. It reports error messages on the data (and indicates if the URI type is unsupported, Supported or HTTP). The DDI Driver only supports a single data file identification physical instance and a single URI within it (although it supports multiple identifications for physical as multiple URIs within). If multiple items are defined then the DDI Driver reports an error, with a message indicating that it is not supported. |