File Format Characterization

Purpose

This document explains our use of file format characterization as a component of our technical metadata approach. 

Background

The transmission and consumption of digital information is dependent on the consumer and the consumer's software being able to understand and render the information in digital files.  File formats encode intellectual information into forms that require highly specific hardware and software to decode.  Hardware, software, and formats change rapidly endangering the long-term viability of digital information.1,2
The process of file format characterization identifies, validates, and extracts key characteristics of the file formats represented in our preservation collections.  Those characteristics include, but aren't limited to3:

  • Format & Version - Formats are much more nuanced than mimetype.  Tools rely on resources like the PRONOM Technical Registry which catalogs format signatures, magic numbers, and the like.
  • Validity & Well-formedness - By comparing a sample file to the file format specification for that format we can determine if it is conformant to the specification and thus have more confidence that it will be rendered properly.
  • Technical metadata - Color depth, image dimensions, and the like can be important to gauge appropriateness of specific instances of a work for a particular purpose.

With detailed information about the formats in our preservation systems and an ongoing commitment to review the evolving state of digital formats we are able to make informed decisions about when we might need to consider remediating at-risk formats.4,5

Technical Details

We use Harvard's File Information Tool Set (FITS) version 0.10.2, which invokes other tools to generate technical metadata as a scheduled job that runs twice an hour.  The job queries the repository index for repository files (content datastreams) that do not have a value in the index field describing the FITS version.  When a new content datastreams is created, as in the case of a new version of being deposited, the FITS datastream is discard and the nightly job recreates it.  The output of FITS is stored in the FITS datastream for the object and some values are indexed including:

  • Format Label
  • Format Version
  • Fits Version
  • PRONOM Identifier

For specific fields being indexed see the TECHMD fields in github.com/duke-libraries.

Additional Considerations

At present there is no process in place to regenerate technical metadata when the version of FITS changes.

References

  1. Selecting file formats for long-term preservation - The National Archives
  2. Preserving Digital Information, Report of the Task Force on Archiving of Digital Information
  3. inSPECT Significant Properties Report 2007
  4. OPF File Format Risk Registry
  5. Library of Congress Recommended Formats Statement