M710report

= M7.10 =

Review of standard format options for community contributed bibliographies

Executive Summary
This report provides an overview of the data models and formats used in various bibliography services. It presents the file formats widely used by bibliographic services to import and export references. The report continues with a list of various bibliographic services, their capabilities and available functionality. The report concludes with a summary and recommendations for WP7's preferred import/export formats and external bibliographic service(s) to use.

Scope of Report
The scope of the report is to inform the outputs standards to be used in WP7. The report documents relevant bibliographic formats and supporting services. One of the WP7 provided services is a comprehensive database of taxonomic literature, a bibliography of life. This report reviews the landscape of formats and services so that we can identify what we build upon and what we can re-use.

Citation formats
This section describes publicly available data formats that are widely used for storing and exchanging bibliographic information. The focus of the report is on taxonomic bibliographic data, so formats that are widely used in other domains but that are not so widely used in taxonomy may not be covered (e.g. BibTeX??).

The formats that this section of the report will cover are BibTeX, DublinCore, EndNote/RIS, MODS, Citebank/GNUB information model (Guido).

The means and mechanisms for documenting and comparing the different bibliographic formats will be to focus on the different fields and sub-structure of those fields (if any), for example, author fields may be decomposible in some formats but not others;

[Guido's words: granularity (e.g. authors) / expressiveness, comparison, possible mappings]

Other issues: loss of information when converting between different formats.

(For each format discussed, Insert references to other sources of online information, e.g. manufacturer's websites, Wikipedia pages, etc.) Bibtex page in wikipedia

A sample document is encoded in each format. The sample is a letter submitted to nature by Dave Roberts and Vishwas Chavan about the use of standard identifiers to mobilize data.

(Needs explanation of the internal structure for this section - being broken down into several sub-sections.)

BibTeX
BibTeX uses a single record for each reference. Each record begins with a reference type (see list of BibTeX reference types) and a key that is unique at least within the scope of the individual BibTeX file at hand. All records are on the same level. Each record has list of attributes that a BibTeX record can have. Depending on the reference type, some fields are required, others are optional and others forbidden. (give some examples, but don't provide a complete manual).

BibTeX has 26 fields defined, although which fields are actually available depends on the reference type; see list of BibTeX reference types and list of attributes for details. Note that unlike some other bibliographic formats (e.g. EndNote), the fields are non-repeatable.

Sample: @article{roberts_standard_2008, title = {Standard identifier could mobilize data and free time}, volume = {453}, issn = {0028-0836}, url = {http://dx.doi.org/10.1038/453449c}, doi = {10.1038/453449c}, number = {7194}, journal = {Nature}, author = {Roberts, Dave and Chavan, Vishwas}, month = may, year = {2008}, pages = {449--450}, annote = {10.1038/453449c} }

Editing: While there are dedicated BibTeX editors because the file is in plain text format it can be edited using a text editor.

Structure: flat records in a file, there is no nesting of records to indicate, for example, chapters within books or papers within a journal issue. However, the same effect can be achieved by using a record's crossref field to refer to the other record's unique key.

Granularity: flat strings with a record so that, for example, author names are single strings, with multiple authors appear as a list of names (authors, editors) separated with “and”, e.g. author = "Guido Sautter and Dauvit King",

Audience/Purpose: originally developed for keeping a personal bibliography in a simple text file for easy integration in academic publications written in TeX. TeX is a typesetting mark up language, originally devised by Donald Knuth for use in computer science but now widely used in other domains. BibTeX is a bibliographic reference manager extension to TeX.

Assessment: Good for generating the bibliography sections of academic publications. Merging two BibTeX files can be problematic because identifiers have to be unique only within the scope of individual files. Thus good as an export format that brings bibliographic data to users; maybe for import, but the lack of structure in person names might cause a challenge; definitely unsuited for storage of large bibliographies.

RIS File Format
The RIS file format is a standardized tag format that has been developed by Research Information Systems, Incorporated (the format name refers to the company) to enable the interchange of bibliographic data. The format also underlies the EndNote reference management software.

The RIS file format uses a single record per reference, starting with a reference type list of RIS reference types and ending with an end-of-record marker (insert example reference in RIS format). In between these two markers, the tags can occur in any order. The syntax is tag based, with each attribute identified a different tag list of RIS tags, with tags consisting of a two-letter code, two spaces, and a dash. A line starting with a tag indicates the start of an attribute, attribute values can have multiple lines, to separate the names of individual authors, for example.

RIS files are in plain text format that can be edited using a text editor. However, the records can be more complex than the equivalent BibTeX records because there are more fields and the fields are interrelated. RIS is best used as an export/import format, as was its purpose.

Sample: TY - JOUR AU - Roberts, Dave AU - Chavan, Vishwas TI - Standard identifier could mobilize data and free time JA - Nature PY - 2008/05/22/print VL - 453 IS - 7194 SP - 449 EP - 450 PB - Nature Publishing Group SN - 0028-0836 UR - http://dx.doi.org/10.1038/453449c M3 - 10.1038/453449c N1 - 10.1038/453449c ER -

Fields/Tags: 40 in total, see list of RIS tags for details Structure: flat records, no nesting or recursion

Granularity: flat strings, person names are single strings, but strictly formatted as “lastname, firstname, suffix”; lists of person names (authors, editors) separated with line breaks, some attributes are restricted in length

Audience/Purpose: developed as a standard data exchange format and adopted by the EndNote management software for bibliographic references.

Assessment: Good for export, import, and transport of bibliographic data, but might turn out too inflexible as a storage model, even with the attributes written to a relational database.

EndNote
EndNote is a commercial reference management software package, used to manage bibliographies and references when writing essays and articles. It is produced by Thomson Reuters. (Taken from Wikipedia entry.)

For historical reasons EndNote's import/export is based on RIS, thought EndNote does support a variety of options including plain text, RTF and HTML. It also has it's own structured XML format.
 * insert sample*

Citation Style Language
CSL page in wikipedia

TaxPub
TaxPub is an extension of the NLM/NCBI Journal Archiving DTD for markup of taxonomic treatments. Terry Catapano's paper provides the background to TaxPub.

TaxonX
TaxonX is a XML schema for encoding taxonomic literature. The project's minimal home page points to the latest schema, which is version 1.2, dated 2008-07-08.

Citation formats in metadata descriptions
Comment on the fact that these are XML Formats.

Extreme simplicity of Dublin Core, the middle-way of MODS and the complexity of MARC.

Dublin Core
Dublin Core is a small set of metadata elements that is used for the basic description of resources. The principal purpose of Dublin Core is that it was intended to be embedded in other, more deeply nested XML schemas. As such, DublinCore is more a name space than a fully-fledged data format. There is no container element for delimiting individual records, which is one of the limitations of the format as a bibliographic representation format.

Fields/Attributes: There are 22 top-level elements and no nested ones.

Structure: completely depends on the schema DublinCore is embedded in.

Granularity: flat strings, person names are single strings

Audience/Purpose: originally developed cataloguing, mostly in libraries.

Assessment: Too simplistic and coarse for import, export, and transport of bibliographic data, not to mention storage. The need for a host schema incurs additional development effort.

MODS
The Metadata Object Description Schema (MODS) is an XML Schema for, among other things, bibliographic records, derived from the widely used, but aged and highly complex MARC format. See the Wikipedia article and this list of MODS elements for a summary of this complex format. Most elements can be repeated, and MODS records can be nested within one another, e.g. to allow for the data on a book chapter to include the data on the book it belongs to. MODS is relatively flexible, with many optional attributes to its elements that can provide additional semantics. This means that the same data can be represented in different granularities.

Sample:    Standard identifier could mobilize data and free time  text journalArticle periodical Roberts Dave aut Chavan <namePart type="given">Vishwas</namePart> <roleTerm type="code" authority="marcrelator">aut</roleTerm> <relatedItem type="host"> 453          7194           449           450       <originInfo> <dateIssued>print May 22, 2008</dateIssued> </originInfo> 0028-0836       Nature </titleInfo> <titleInfo type="abbreviated"> Nature </titleInfo> </relatedItem> 10.1038/453449c http://dx.doi.org/10.1038/453449c </modsCollection> Elements: There are 17 top-level elements, mostly with multiple nested ones, mostly repeatable, see list of MODS elements for details

Structure: XML structured records.

Granularity: well structured, person names can be single strings, but individual name parts can also be stored in separate elements.

Audience/Purpose: developed as a data exchange format for libraries in the US.

Assessment: Perfect as an exchange format for bibliographic data due to the possibility of fine-grain data representation. Not so well suited as an import or export format, however, because nested XML syntax is rather verbose and hard to read for human users, and due to the lack (so far) of supporting tools – applications in libraries are probably mostly custom-built and proprietary. Not well suited as an import format, either, because the XML syntax is hard to generate for human users. Well suited for storage if broken down into a set of relational tables and used with a few conventions regarding granularity.

MARC
expand on acronym

Integration with existing bibliographic tools
Comparison

Also need to incorporate Guido's two sections of his report : Summary (Now incorporated at the end of the report in the Interchange considerations section) and Mapping on pages 3 and 4 of his report. The mapping (the table in Guido's report) is quite important since this attempts to capture data that is lost when converting from one format to another.

Summary of citation formats
(Perhaps needs integrating with the previous section on Integration with existing bibliographic tools. Does this need to be one section that summarises and provides recommendations of which are the best bibliographic format(s) to use?)

Storage options
TODO - storage engines: OCLC, Mendaley, Zotero, Citebank/GNUB, Drupal Biblio Module (Dauvit, Guido) – API, cost of accounts, availability (exists right now? will probably exist for how long?), supported data formats TODO - others worth mentioning: Google Scholar, Citeseer, … (all)

Biostor
More info

CiteBank
CiteBank and its underlying data model are still in active design. The data model will likely distinguish multiple sorts of entities, including references, journals, authors, and institutions. Each entity will have a GUID, by means of which it is linked to from other entities. The API is not yet specified and may vary across different implementations.

More info

Cite-U-Like
CiteULike is browser-based, with no browser plug-ins required. It does not provide an API for machine access, so the latter will require page scraping, nor is there any documentation of its internal data model used for storing references. Supported export formats are BibTeX and the RIS file format.

More info

Connotea
Connotea is a free online reference management service for scientists, researchers, and clinicians, created in December 2004 by Nature Publishing Group. It is one of a breed of social bookmarking tools, similar to CiteULike and del.icio.us, where users can save links to their favourite websites. (Taken from the Wikipedia entry.)

Connotea is a browser-based service, with no browser plug-ins being required. The API is rudimentary and in alpha state (version 0.1). Supported import/export formats are RIS, BibTeX, EndNote and MODS.

The site is intended for private, non-commercial use. We would have to negotiate with Macmillan Publishers Limited to use Connotea in ViBRANT. If we do want to talk to them then we would also have to discuss the current restriction on non-English language material.

Drupal Biblio module
The Drupal Biblio module, a.k.a. Drupal Scholar, supports multiple formats for the import and export of references, including BibTeX, the RIS file format, and XML. There is no documentation on underlying the internal storage format, but a comprehensive API.

GNUB
The GNUB Information Model being developed for the GNUB (Global Names Usage Bank, an NSF-funded project) comprises more than strictly bibliographic data, also comprising taxonomic names and their usages in literature. The information model is abstract, namely modelled in Entity Relationship notation. Thus, records can take many representations, and there can be multiple implementations and storage engines. The bibliographic part of the GNUB model distinguishes three main entities: the publications themselves, authors, and institutions. Publications are recursively nested from individual articles or even parts of them up to entire journals. Likewise, institutions can be nested to express dependencies.

Fields/Attributes: no details available yet, but expectably similar to or more than MODS Structure: abstract, nesting and recursion possible

Granularity: well structured, individual parts of person names will be in separate fields

Audience/Purpose: under development as a global infrastructure and data provider on taxonomic names and respective literature

Assessment: Apart from the possibly interactive parsing of bibliographies from legacy documents, the GNUB project aims at something very similar to our envisioned Bibliography of Life, but on an even larger scale and including the taxonomic part. However, the project just started, so it is all but sure when the information model will reach stability, when respective exchange formats become available, and when GNUB hosts will be available in the envisioned numbers.

Mendeley
Mendeley provides client software and client plug-ins for many applications and platforms. There is no documentation on the internal data storage and data transfer formats. A comprehensive JSON-based REST API is available. Supported import / export formats comprise the RIS file format, BibTeX and EndNote™ XML

More info

Papers
More info

Scratchpads
More info

Zotero
Zotero is a plug-in for Mozilla Firefox (with other browsers being in development), for MS Word, and OpenOffice Writer. It further provides a REST API for reading data, with updating functionality being planned. The internal data representation is rather complicated, consisting as it does of 56 relational tables. Data is stored locally, in an SQLite DB inside the Firefox plug-in folder.

Zotero will import (and export?) from
 * Zotero RDF (this is loss-less)
 * MODS (Metadata Object Description Schema)
 * BibTeX
 * RIS
 * Refer/BibIX
 * Unqualified Dublin Core RDF

More info

Interchange issues
(This section may need rewriting and the text below may not be appropriate. What we are looking for is a summary of what the storage issues are and recommendations of which storage system(s) we want to use.)

Formats for bibliographic data are available in a wide range of granularities, flexibilities, and scalabilities, from the extremely simplistic DublinCore and the personal-use oriented BibTeX through the flat, but fine granular and scalable RIS to the highly flexible and rich MODS and the less flexible but even richer and more broader GNUB information model. All have their advantages and drawbacks.

TODO revise this one Dave & Dauvit come up with their requirements Storage requires a scalable and fine granular data format, and optimally one that allows for nesting. MODS or the GNUB information model in combination with a respective relational representation would be ideal here. Data transfer between machines requires the same granularity as storage, while nesting and scalability are less important, as records can be transferred one after the other and in small portions. MODS appears to be the best choice in this regard. Data import, especially when done by means of parsing bibliographies from legacy documents, requires a data format that is easy to handle and to generate for human users, yet sufficiently fine granular to fully use the expressiveness of the storage format. Thus, BibTeX or even DublinCore would be good import formats. Data export requires a data format that is easy to read for human users and interoperable with as many tools as possible. Thus, BibTeX is probably the best choice for export.

Recommendations
Overall recommendations that we want to take forward.