DDI

DDI (Data Documentation Initiative) is an international metadata specification for the social sciences.

There is a custom version of SuperCHANNEL available that is supplied with a DDI driver. Any DDI document being prepared for consumption in SuperCHANNEL must conform to the DDI Specification Version 3.1, and must be valid against the XSD Schema for DDI Version 3.1 (10-18-2009). The supplied DDI driver also validates the document against ISO/IEC Schematron.

This article does not attempt to cover all aspects of DDI, but it does outline some of the DDI terminology and the corresponding SuperCHANNEL terminology. For more information about DDI, see http://www.ddialliance.org/.

Example

The DDI version of SuperCHANNEL is supplied with a simple DDI version 3.1 example. The example files are located in C:\ProgramData\STR\SuperCHANNEL\examples\HealthSurvey\DDI.

Connecting to a DDI Source

To connect to a DDI source in SuperCHANNEL:

Select File > Connect to Source.
In the Driver field, select ddi (str.jdbc.ddi.Driver).

In the Location field, enter a connection string in the following format:

CODE

jdbc:ddi:file:/<path_to_DDI_XML_file>

For example:

CODE

jdbc:ddi:file:/C:\ProgramData\STR\SuperCHANNEL\examples\HealthSurvey\DDI\HealthSurvey.xml

Click OK.

DDI File Validation

Schematron offers capabilities above and beyond validation that is possible using just the W3C XML Schema. The Schematron file DDIv31ModelValidator.sch is included in the DDI example directory, and DDI document authors are strongly encouraged to use the file to validate their document before attempting to parse it via SuperCHANNEL. Modern XML editors like XMLSpy and Oxygen offer convenient tools to perform Schematron validation against XML documents.

Alternatively, you can use the batch file SchematronUtil.bat, which is also included in the example directory. to perform Schematron validation.

During file processing, the DDI driver will detect and report any referential errors so that the user can diagnose and fix the errors in the DDI creation process. SuperCHANNEL uses TLS encryption and basic authentication to connect to DDI web services, ensuring that only authorised users can gain access to DDI documents.

Modelling RKEY Relationships in DDI

If the source DDI has...	Then...
A variable group of type "perturbation"; A variable within that group marked with "RKEY"; and A data variable.	SuperCHANNEL will transform the RKEY column name to the data variable name, with an added `_RKey` suffix (to meet SuperSERVER requirements).
A variable group of type "perturbation"; A variable within that group marked with "RKEY"; but No data variable.	SuperCHANNEL will transform the RKEY column name to the table name with an added `_RKey` suffix.
A record in the "perturbation" variable group with an associated RKEY.	Both must be of the type "double".

Example DDI Code Fragment:

XML

<l:Variable id="PAR_ITEM_1753726_VAR" version="1.0.2" versionDate="2010-08-11T08:48:07"> 
 <r:UserID type="11179-IRDI">abs.au.ddi:PAR_ITEM_1753726_VAR:1.0.2</r:UserID> 
 <r:Label>Replicate Weights - Person 60</r:Label> 
 <r:Description>Replicate Weights - Person 60</r:Description> 
 <l:ConceptReference> 
  <r:Scheme> 
   <r:ID>IW_DATASET_1278248_CS</r:ID> 
   <r:IdentifyingAgency>abs.au.ddi</r:IdentifyingAgency> 
   <r:Version>1.0.2</r:Version> 
  </r:Scheme> 
  <r:ID>PAR_ITEM_1753726_CON</r:ID> 
  <r:IdentifyingAgency>abs.au.ddi</r:IdentifyingAgency> 
  <r:Version>1.0.2</r:Version> 
 </l:ConceptReference> 
 <l:Representation measurementUnit="Number" additivity="Flow"> 
  <l:Role>replicate weight</l:Role> 
  <l:NumericRepresentation type="Double"/> 
 </l:Representation> 
</l:Variable> 
   
<!-- N.B. although this example uses the _RKey suffix for this variable id, it is not required, 
     only the 'rkey' Representation Role below is needed to denote the RKey variable. -->
<l:Variable id="PAR_ITEM_1810203_VAR_RKey" version="1.0.2" versionDate="2011-07-19T09:34:27"> 
 <r:UserID type="11179-IRDI">abs.au.ddi:PAR_ITEM_1810203_VAR_RKey:1.0.2</r:UserID> 
 <r:Label>Person weight_RKey</r:Label> 
 <r:Description>Person weight_RKey</r:Description> 
 <l:ConceptReference> 
  <r:Scheme> 
   <r:ID>IW_DATASET_1278248_CS</r:ID> 
   <r:IdentifyingAgency>abs.au.ddi</r:IdentifyingAgency> 
   <r:Version>1.0.2</r:Version> 
  </r:Scheme> 
  <r:ID>PAR_ITEM_1810203_CON</r:ID> 
  <r:IdentifyingAgency>abs.au.ddi</r:IdentifyingAgency> 
  <r:Version>1.0.2</r:Version> 
 </l:ConceptReference> 
 <l:Representation measurementUnit="Number" additivity="Flow"> 
  <l:Role>rkey</l:Role> 
  <l:NumericRepresentation type="Double"/> 
 </l:Representation>
</l:Variable> 
   
<!-- 
     This group provides the additional information needed to interpret the RKey variable: it tells us that the variables contained form a VariableGroup of type 'Perturbation'. 
 
     In practice we will check that there are at least one and no more than two variables in this group. 
     There must be at least one variable, which has the Representation Role 'rkey'. 

 
     There may not be another variable, in the case of unweighted census data in which case the RKey is used to perturb any cross tab counts from the table. 
 
     If there is another variable then we can assume that the other variable is the weighted data which needs to be perturbed. 
--> 
   
<l:VariableGroup id="PAR_ITEM_1753726_VAR_GROUP" version="1.0.2" versionDate="2011-08-18T09:10:03"> 
 <r:UserID type="11179-IRDI">abs.au.ddi:PAR_ITEM_1753726_VAR_GROUP:1.0.2</r:UserID> 
 <l:GroupTypeCoded codeListID="Group Type" codeListAgencyName="DDI" otherValue="Perturbation">UseOther</l:GroupTypeCoded> 
 <r:Label>Specifies relationship between a data variable and it's corresponding RKEY variable.</r:Label> 
 <r:Description>Specifies relationship between a data variable and it's corresponding RKEY variable - used to implement Perturbation of the values in the data variable.</r:Description> 
 <l:VariableReference> 
  <r:Scheme> 
   <r:ID>IW_DATASET_1278248_CS</r:ID> 
   <r:IdentifyingAgency>abs.au.ddi</r:IdentifyingAgency> 
   <r:Version>1.0.2</r:Version> 
  </r:Scheme> 
  <r:ID>PAR_ITEM_1753726_VAR</r:ID>
  <r:IdentifyingAgency>abs.au.ddi</r:IdentifyingAgency>
  <r:Version>1.0.2</r:Version>
 </l:VariableReference>
 <l:VariableReference>
  <r:Scheme>
   <r:ID>IW_DATASET_1278248_CS</r:ID>
   <r:IdentifyingAgency>abs.au.ddi</r:IdentifyingAgency>
   <r:Version>1.0.2</r:Version>
  </r:Scheme> 
  <r:ID>PAR_ITEM_1753726_VAR_RKey</r:ID>
  <r:IdentifyingAgency>abs.au.ddi</r:IdentifyingAgency>
  <r:Version>1.0.2</r:Version>
 </l:VariableReference>
</l:VariableGroup>

Important Notes

SuperCHANNEL Multi Response Fields are declared in DDI 3.1 by means of using VariableGroup elements with the type GroupTypeCoded set to MultipleResponse. See the supplied HealthSurvey.xml file for an example.
The SuperCHANNEL DDI driver supports empty codes for multiple response variables.
Previously, if there was no valid response for multiple response questions, the DDI author would have to make the DDI variable nullable by setting the blankIsMissingValue attribute of the CodeRepresentation for the multi-response variable to true and then leaving the no-response columns empty in the associated CSV file.
Now, the DDI author can choose to explicitly identify certain codes as non-responses by adding a missingValue attribute to the CodeRepresentation element. The missingValue attribute is a space separated XML Schema NMTOKENS list. These missingValue codes will be treated the same as actual empty values. Once the DDI Multi-response variable has one empty code specified in a legal missingValue attribute, an empty registry table will be generated and populated with the empty code values. All empty codes in the registry table will be filtered from the SXV4 and therefore will not affect the count of multiple responses.
When DDI documents that contain mixed continuous categorical variables are channelled in SuperCHANNEL, the DDI driver separates the data so that:
- The measure column will not contain (and sum) any classified values (95, 96, 97, 98, 99) and they will instead appear as 0s.
- The classified column will not contain (and count) any non-classified values. They will all be assigned a safe classified value, as specified in the variable's missingValue attribute, meaning a valid response was given.
During the build process in SuperCHANNEL, any invalid (null value) records are filtered out and are not included in the SXV4.
For example, a data file that consists of the following values:
TEXT
```
22000.00
96
97
62000.00
97000.00
98
```
Would produce two fields: a measure field and a classification field.
Summing the measure field would mean 22000.00 + 62000.00 + 97000.00 = 181000.00 and the frequency would be 3 (which is the number of valid responses in the csv file).
Cross-tabulating the classification field would produce the following:
Code/Label
Count
00 - A valid response was recorded
3
95 - Greater than or equal to this number
0
96 - Not Applicable
1
97 - Refusal
1
98 - Not Known
1
99 - No Source / Negative Income
0
The classification's frequency would be 6 (which is the actual number of original records in the csv file).
SuperCHANNEL measures are defined in DDI 3.1 using the attribute additivity="Stock" or additivity="Flow" against the variable's Representation element. See the supplied HealthSurvey.xml file for an example.
DDI 3.1 was not designed to cope with complex record relationships. Due to this shortcoming, a workaround suggested by the DDI Consortium is being used involving usage of two Note elements for each RecordRelationship element. Details on the workaround can be found at:
For a variable with a numeric representation in DDI to be a measure, the representation additivity attribute needs to be marked as either stock or flow. All the main weights, replicate weights and RKEY values need to be marked as measures.

Terminology

The following are selected DDI elements and their corresponding equivalents in SuperCHANNEL terminology. For further details refer to the Supported Elements section below. See the supplied HealthSurvey.xml file for examples.

Element	Description	SuperCHANNEL Equivalent
CaseIdentification	Information on which variable is considered the primary key.
CategoryScheme	Defines particular categories used as question responses.	Display name for a classification. For example, the gender category could map code M to Male and F to Female.
CodeScheme	Defines code values used to represent categories for a variable or question.	Classification codes. For example, the gender category could map code M to Male and F to Female.
LogicalProduct	A high level DDI element that describes and defines the logical data products of the study unit. The LogicalProduct describes its logical content by means of Variables, usually grouped into VariableSchemes and CodeSchemes, which are used to define the values applicable for an individual variable. CategorySchemes are used to define particular categories used as question responses. Their relationships and code values are described in the code scheme. The DataRelationship then defines how exactly those schemes relate to each other to form one or more LogicalRecords (such as a household, family, or person records) of the study unit.	Database.
LogicalRecord	Defines a grouping of variables into a logical record, such as a household, family, or person record.	Fact tables.
PhysicalDataProduct	High level DDI element that describes and defines the physical structure of data products of the study unit.	Data source.
PhysicalInstance	High level DDI element that describes and defines the physical data set. It points to the physical location of the individual record data.	Data source.
RecordRelationship	A description of how the logical contents of the DDI document relate to each other. All relationships are pair-wise. Multiple pair-wise relationships may be needed to clarify all record relationships within the logical contents of the DDI document.	In JDBC language, this would be a rough equivalent of "imported keys" and "exported keys", so that Database Management System (DBMS) can establish primary key/foreign key relationships through their specific database and database driver implementations.
StudyUnit	High level DDI element that describes and defines a single unit of (social sciences) study.	Database.
Variable	Defines an individual data item, such as a person's gender or age.	Column.
VariableScheme	A collection of variables and variable groups.
NMTOKEN	Constrains the contents of attribute values by their form (rather than by their actual value, as would be done with an enumeration). The form of an attribute value declared as type NMTOKEN is such that it may contain only valid name characters, digits, and limited punctuation (such as full stops, hyphens, colons, and various combining and extending characters required for internationalisation). NMTOKEN-type attributes may start with a digit, hyphen or full stop. Internal white space, commas and "/" are prohibited in NMTOKEN type attribute values. Leading and trailing white space is trimmed/ignored before determining whether the value is valid for an NMTOKEN-type attribute.
RKey	A record key used as part of the perturbation confidentialisation process. Cubes created by the server for databases that have the perturbation confidentiality process enabled use the RKeys to generate a cell key for each cell in the cube.

Supported Elements

Reusable DDI Element Support

DDI	Driver Support
`LabelType` (used for any element that can have a human readable name)	Must be a plain text entry. Multiple labels entries for a single element are not supported. All attributes are ignored.
`ReferenceType` (used for many different element types)	Only ID element is supported. All attributes and other elements are ignored.

Supported DDI for DDIInstance/s:StudyUnit

DDI	Driver Support
DDIInstance/s:StudyUnit/r:Citation/Title	Required. Included in `SXV4Database.label`
DDIInstance/s:StudyUnit/@version	Required. Included in both `SXV4Database.label` and `SXV4Database.name`
DDIInstance/s:StudyUnit/@agency	Required. Included in `SXV4Database.name`
DDIInstance/s:StudyUnit/@id	Required. Maps to first part of `SXV4Database.name`

Supported DDI for LogicalProduct

DDI (Starting from / /l:LogicalProduct)	DDI Driver Support
.	Required.
l:DataRelationship/l:LogicalRecord/@id	Maps to `FactTable.name`.
l:DataRelationship/l:LogicalRecord/VariablesInRecord/@allVariablesInLogicalProduct	The DDI Driver will include all variables in the logical product as columns table.
l:DataRelationship/l:LogicalRecord/VariablesInRecord/VariableSchemeReference/ID	Alternative to `@allVariablesInLogicalProduct`. ID of a variable that contains a set of variables to be included as columns in the table.
l:VariableScheme/l:Variable/@id	Maps to `Column.name` and `Column.label` if there is no name or label defined.
`l:VariableScheme/l:Variable/r:label` or `l:VariableScheme/l:Variable/@id` if none defined	Optional. Maps to `Column.label`.
/l:VariableScheme/l:Variable/l:Representation/l:NumericRepresentation/@additivity	Maps to `Column.type`, either as a plain column type (if `@additivity` is set to `non-additive`) or as a measure otherwise.
l:VariableScheme/l:Variable/l:Representation/l:NumericRepresentation/@type	Column.data Type.
l:VariableScheme/l:Variable/l:Representation/l:CodeRepresentation/r:CodeSchemeReference/r:ID	`Column.type` is set to `classification` when a code representation is defined for a variable. The DDI Driver checks that the ID refers to a scheme.
l:CodeScheme/@id	Maps to `ValueSetTable.name` and also `ValuesetTable.label` if there is no variable name or label.
l:CodeScheme/r:Label	Optional. Maps to `ValueSetTable.label`.
l:VariableScheme/l:Variable/l:Representation/l:CodeRepresentation/r:CodeSchemeReference/r:ID	Refers to a code scheme, which is used to populate a value set.
l:CodeScheme/l:Code/l:Value	Maps to code entries in a value set table.
l:CodeScheme/l:Code/l:CategoryReference/r:ID	Refers to a category item, which contains a matching label for a value set table.
l:CategoryScheme/l:Category/r:Label	Maps to a label entry in a value set table.
`parent::l:Code/l:Value` where context is the current code element that is being added to a value set.	Maps to an entry in the parent column in a value set table.
l:DataRelationship/l:RecordRelationship/l:RecordReferenceSource/@relation	If `@relation = "Parent"` then the record reference source refers to the key variable, otherwise if it is `"Child"` then it refers to a foreign variable. `"Sibling"` is not supported.
l:DataRelationship/l:RecordRelationship/l:RecordReferenceSource/l:VariableReference/r:ID	Refers to a primary or foreign key variable, depending on the value of the containing reference source.
l:DataRelationship/l:RecordRelationship/l:RecordReferenceTarget/r:ID	Refers to a primary or foreign key variable, depending on the value of `@relation`.
l:DataRelationship/l:LogicalRecord/l:CaseIdentification/l:VariableSpecificationReference/l:VariableReference/r:ID	Refers to a primary key with `r:ID`.

Supported DDI for PhysicalDataProduct

DDI (Starting from /ns1:DDIInstance/s:StudyUnit/p:PhysicalDataProduct)	DDI Driver Support
.	Required. The DDI Driver processes items listed in this table, subject to the constraints described here. Multiple physical data products within are not supported.
@id	Maps to `SXV4Database.id`. By using this as the internal ID, SuperSTAR will be able to link back to parts of the associated product to retrieve descriptive metadata.
P:PhysicalStructureScheme/p:PhysicalStructure/p:LogicalProductReference/r:ID	Required. The DDI Driver checks that this refers to a valid logical product defined in the DDI document.
P:PhysicalStructureScheme/p:PhysicalStructure/p:DefaultDataType	Ignored. The data type is taken from the logical level.
P:PhysicalStructureScheme/p:PhysicalStructure/p:DefaultDelimiter	If defined, the DDI Driver uses the delimiter as the separator character reading from the data file.
P:PhysicalStructureScheme/p:PhysicalStructure/p:GrossRecordStructure/p:LogicalRecordReference/r:ID	Required. The DDI Driver checks that it matches a logical record defined in the DDI document.
P:PhysicalStructureScheme/p:PhysicalStructure/p:GrossRecordStructure/p:PhysicalRecordSegment/@id	Required, but only one physical record segment can be defined in the gross record structure. Other than the mandatory ID attribute, values and child elements are ignored.
P:RecordLayoutScheme/p:RecordLayout/p:PhysicalStructureReference/r:ID	Required. The DDI Driver checks that it matches a physical structure within the same logical data product.
P:RecordLayoutScheme/p:RecordLayout/p:PhysicalStructureReference/p:PhysicalRecordSegmentUsed	Required. The DDI Driver checks that it matches a physical record defined within the physical structure that is referenced by the segment that the element belongs to.
P:RecordLayoutScheme/p:RecordLayout/p:CharacterSet	If defined, must be `"US ASCII"`. The DDI Driver reports errors for all values.
P:RecordLayoutScheme/p:RecordLayout/p:ArrayBase	Required. Must be either 0 or 1.
P:RecordLayoutScheme/p:RecordLayout/p:DefaultVariableSchemeReference/r:ID	If defined, the DDI Driver checks that it matches a valid variable scheme defined in the logical product referred to by the containing physical product.
P:RecordLayoutScheme/p:RecordLayout/p:DataItem/p:VariableReference/r:Scheme/r:ID	If defined, the default variable scheme defined for the containing layout is ignored. The DDI Driver checks that it references a valid variable scheme defined in the logical product referenced by the containing data product. If neither this item nor the default variable is defined then the DDI Driver reports an error.
P:RecordLayoutScheme/p:RecordLayout/p:DataItem/p:VariableReference/r:ID	Required. The DDI Driver checks that it refers to a variable within the variable scheme defined, or if a scheme is referenced by the data then the DDI Driver checks in that scheme instead. In either case, the scheme referred to must be defined in the logical product by the containing physical data product.
P:RecordLayoutScheme/p:RecordLayout/p:DataItem/p:PhysicalLocation/p:StorageFormat	Ignored.
P:RecordLayoutScheme/p:RecordLayout/p:DataItem/p:PhysicalLocation/p:ArrayPosition	Required. The DDI Driver checks that the array position is validly defined. No upper bound check is made.
P:RecordLayoutScheme/p:RecordLayout/p:DataItem/p:PhysicalLocation/p:Width	Required. This value must be defined according to the standard if the sibling position is not defined. This is used to check the maximum allowed length of strings defined for the DDI driver.

Supported DDI for PhysicalInstance

DDI
(Starting from /ns1:DDIInstance/s:StudyUnit/p:PhysicalInstance)

DDI Driver Support

Required. A separate physical instance is required for every CSV file.

Pi:DataFileIdentification/pi:URI

Required. The DDI Driver uses the URI to access the data. It reports error messages on the data (and indicates if the URI type is unsupported, Supported or HTTP).

The DDI Driver only supports a single data file identification physical instance and a single URI within it (although it supports multiple identifications for physical as multiple URIs within). If multiple items are defined then the DDI Driver reports an error, with a message indicating that it is not supported.

Code/Label	Count
00 - A valid response was recorded	3
95 - Greater than or equal to this number	0
96 - Not Applicable	1
97 - Refusal	1
98 - Not Known	1
99 - No Source / Negative Income	0