Page tree
Skip to end of metadata
Go to start of metadata

Duke Digital Repository

Standard Ingest Format Documentation

4.16.2016

Overview

The Duke Digital Repository (DDR) accepts batch ingests of objects and metadata using the Standard Ingest Format (SIF).  The SIF defines the organization and naming conventions of files, the format and content of a tab-delimited descriptive metadata file, and other factors relevant to repository deposit.  Collections that conform to the SIF may be ingested by Digital Repository Services (DRS).

The DDR content model may be understood to contain 3 tiers of hierarchy- collection, item, and component.  For example, a digitized image collection may contain metadata about the source of the collection, curator contact information, etc. This collection may contain an item that is a scanned photograph.  The item metadata may describe the subject, photographer, etc.  This item may contain a scan of both the front and back of the physical object resulting in 2 components. Descriptive metadata for each component may describe the side of the object illustrated.

Collections are created at the time of ingest.  Collection metadata is ingested along with descriptive metadata and must be specified in the metadata file.   

At present, the objects and metadata must be loaded onto storage accessible to the ingest server.

File Organization

The SIF is an extension of the Bag-It specification.   Please see https://en.wikipedia.org/wiki/BagIt for more information or go straight to the specification.  Please note that we do not yet support holey bags.

The SIF will traverse through the data directory and its sub-directories, but not deeper.  Conceptually, the data directory represents the collection. That collection contains items, which contain components.  An item may be thought of as an intellectual unit and the components they contain are the individual pieces that comprise that unit.  Single components must be deposited as belonging to an item.  

In the example below, there is one item, 27613-h, with two components, q172.png and q172.txt.  

Example:

myfirstbag/
|-- data

|   \--metadata.txt
|   \-- 27613-h
|       \-- q172.png
|       \-- q172.txt
|-- manifest-sha1.txt
|     49afbd86a1ca9f34b677a3f09655eae9 data/27613-h/q172.png
|     408ad21d50cef31da4df6d9ed81b01a7 data/27613-h/q172.txt
|-- bagit.txt
     BagIt-Version: 0.97
     Tag-File-Character-Encoding: UTF-8

|-- tagmanifest-sha1.txt

There are a variety of extant tools to create bags.  Please verify that the tool you choose supports SHA-1 as the repository only understands that algorithm.  One such tool is bagit-python at https://pypi.python.org/pypi/bagit/  

Checksum Manifest File - manifest-sha1.txt

Checksums must be supplied to the repository to validate that the ingested object is an exact copy of the file supplied to us.  The repository understands checksums using the SHA-1 algorithm.  The manifest file will contain one row per component, will begin with the SHA-1 checksum, will separate the checksum from the path with two space, and will specify the path to the component relative to the bag as illustrated in the example.


Bag-It Version File - bagit.txt


To ensure accurate processing of each bag, metadata about the bag must be included.  This metadata will include the Bag-It version and Tag File Character Encoding as illustrated in the example.

Metadata File Structure - metadata.txt


The SIF expects a tab-delimited text file with column headers that match the available metadata fields. Fields left blank or not included in the metadata file will be ignored.  Fields with repeated values must be delimited by repeat value separator, a semi-colon.  The SIF allows for metadata to be applied to the collection, the item, and/or the component.  See this example metadata.txt.

The first column must be labeled ‘path’ and is used to identify the object to which the metadata applies.  An empty value in the ‘path’ column means that the metadata applies to the collection object.  A value of ‘[folder]’, where ‘[folder]’ is the name of a folder within the ‘data’ directory, means that the metadata applies to the item object represented by that folder.  A value of ‘[folder]/[file]’ means that the metadata applies to the component object represented by the filename in ‘[file]’.

Control Characters

Column separator   “\t” tab character

Metadata fields must be separated by a tab character.  If tab characters occur in the metadata value the Quote Character must be used to ensure the SIF does not interpret those tabs as column delimiters.

Row separator/EOL   "\n",  "\r", or "\r\n"

To designate the end of a low and the beginning of a new record, please use the Carriage Return, Line Feed, or both.  Standard End of Line characters from all major Operating Systems should be acceptable.

Quote Character  “  double quotation mark


The Double Quotation Mark should be used to inform the SIF that control characters, such as tab and EOL characters, should not be interpreted as such.  

Repeat value separator   ; semi-colon

Multiple values in repeatable metadata fields should be separated by a semi-colon.  If the values contains control characters, please use the Quote Character to escape them.  For example dc.author may contain the following values   Washington, George;Read, George;Bedford, Gunning Jr.

Encoding  

The Common Batch Ingest tool (SIF) uses the standard Ruby CSV library to parse metadata from the delimited file.  For more information see http://docs.ruby-lang.org/en/2.1.0/CSV.html  To minimize display issues we encourage depositors to submit metadata in UTF-8 expressed as 'u'/'U'.  

Metadata Fields

The Duke Digital Repository supports the following Dublin Core metadata terms.  The use of terms is defined in the Dublin Core version 1.1 dictionary at http://www.dublincore.org/documents/dces/  The metadata fields, repeatability, and necessity are provided in the Standard Ingest Format - Metadata Fields sheet.

Metadata about the Collection

Collection level metadata, e.g. title, should be included in the metadata file with a blank in the  path column.  






  • No labels