RefPars

= Reference parsing =

Introduction
The purpose of this page is to document a review of currently available tools.

Will do more searches for tools and papers. Keywords to consider: bibliographic citation matching parsing reference.

Notes
 * bibliography parsing - no good, returns bibliographies on general parsing
 * bibliographic parsing - much better, produces:
 * Bibliographic attribute extraction from erroneous references based on a statistical model
 * Bibliographic component extraction from references based on a text recognition error model
 * The ADS bibliographic reference resolver
 * Autobib: Automatic extraction of bibliographic information on the web
 * Locating and parsing bibliographic references in HTML medical articles
 * Bibliographic attributes extraction with layer-upon-layer tagging

Bibliography managers
A useful collection of tools for bibliography management with short reviews. This is not an exhaustive list, but mentions one or two tools that are not widely known.


 * https://digitalresearchtools.pbworks.com/w/page/17801648/Citation%20Management%20Tools

Parsing tools
What's out there.

Reference Parser
See David Shorthouse's blog, http://ispiders.blogspot.com/2010/08/reference-parser-revived.html.

The tool itself is at http://refparser.shorthouse.net// where David describes it as:

"This jQuery plugin gives visitors of your pages quick access to a web service that parses verbatim journal article citations then gives them a link to the publisher's resource if the parsing is successful. It works by making a secondary web service call to CrossRef's OpenURL service. The plugin is especially useful if the reference citations you serve are user-generated, varied in format, or may be discoverable at some indefinite time in the future (e.g. a society's back issues scanned and later assigned DOIs)."

The code components are downloadable from:


 * http://refparser.shorthouse.net//jquery.refparser.js
 * http://refparser.shorthouse.net//code.tar.gz

Biblio Citation Parser
Perl Biblio::Citation::Parser - http://search.cpan.org/~mjewell/Biblio-Citation-Parser-1.10/lib/Biblio/Citation/Parser/Jiao.pm

ParsCit
The homepage is at http://aye.comp.nus.edu.sg/parsCit/, where it states:

"This is the home page of the ParsCit project, which performs two tasks: 1) reference string parsing, sometimes also called citation parsing or citation extraction, and 2) logical structure parsing of scienfific [sic] documents. It is architected as a supervised machine learning procedure that uses Conditional Random Fields as its learning mechanism. You can download the code below, parse strings online, or send batch jobs to our web service. The code contains both the training data, feature generator and shell scripts to connect the system to a web service (used on this web site)."

Originally jointly developed by Penn State and the University of Singapore, and while only the latter maintain it now, it is still an active project.

Results

Generally ParsCit has the best performance of any parser tried so far.

ENDERLEIN, G. 1906c. Zehn neue aussereuropäische Copeognathen. Stettiner Entomologische Zeitung 67: 306-316, 1 fig.

does a perfect job; but only after I cleaned up the data: on first pass the en-dash between the page numbers was lost when transfered to ParsCit, resulting in fig' being identified as the pages; replacing the en-dash with a hyphen gives the accurate results above
 * author: G ENDERLEIN
 * volume: 67
 * date: 1906
 * title: Zehn neue aussereuropäische Copeognathen.
 * journal: Stettiner Entomologische Zeitung
 * pages: 306-316

P F Mattingly, A Stone, K L Knight (1962) Culex aegypti Linnaeus, 1762 (Insecta, Diptera): proposed validation and interpretation under the plenary powers of the species so named. Z. N. (S.) 1216. Bulletin of Zoological Nomenclature 19: 208 - 219

does a very good job, though muddles the title and journal.
 * author: P F Mattingly, A Stone, K L Knight
 * volume: 19
 * year: 1962
 * title: Culex aegypti Linnaeus, 1762 (Insecta, Diptera): proposed validation and interpretation under the plenary powers of the species so named.
 * journal: Z. N. (S.) 1216. Bulletin of Zoological Nomenclature
 * pages: 208--219

CiteSeerX
http://sourceforge.net/projects/citeseerx/

FreeCite
Available from http://freecite.library.brown.edu/

Where it is described as: 'FreeCite is an open-source application that parses document citations into fielded data. You can use it as a web application or a service. You can also download the source and run FreeCite on your own server. FreeCite is distributed under the MIT license.'

The FreeCite page has links to a dataset (the CORA dataset) that was used to train FreeCite.

Results

Generally Freecite parses does very well, but never perfectly.

ENDERLEIN, G. 1906c. Zehn neue aussereuropäische Copeognathen. Stettiner Entomologische Zeitung 67: 306-316, 1 fig.

has conflated the title and journal name (possibly due to both being in German?) and used the fig[ure] count as the volume.
 * authors: G ENDERLEIN
 * title: Zehn neue aussereuropäische Copeognathen. Stettiner Entomologische Zeitung 67
 * volume: 1
 * pages: 306-316
 * year: 1906

P F Mattingly, A Stone, K L Knight (1962) Culex aegypti Linnaeus, 1762 (Insecta, Diptera): proposed validation and interpretation under the plenary powers of the species so named. Z. N. (S.) 1216. Bulletin of Zoological Nomenclature 19: 208 - 219

has mangled the title, but otherwise is correct.
 * authors: P F Mattingly A Stone K L Knight named Z N 1216
 * title: Culex aegypti Linnaeus, 1762 (Insecta, Diptera): proposed validation and interpretation under the plenary powers of the species so
 * journal: Bulletin of Zoological Nomenclature
 * volume: 19
 * pages: 208--209
 * year: 1962

ParaCite
Available at http://paracite.eprints.org/

Taken from http://paracite.eprints.org/about.html

'ParaCite is an experimental service, being designed at the University of Southampton, for the location of articles from raw references. When a reference is passed to the service, it is split into its component parts (e.g. author, title, year), and transferred to the search resource. Based on the subject area, and the data provided, a set of resources is presented that the system believes have the highest probability of providing the full text article at no charge.'

Its training data is available at http://paracite.eprints.org/cgi-bin/reflist.cgi.

Citation metadata extraction tool from the California Digital Library
This uses Hidden Markov Models. The code can be downloaded from here: http://gales.cdlib.org/~egh/hmm-citation-extractor/ and a presentation that describes the tool is available here http://gales.cdlib.org/~egh/hmm-citation-extractor/jcdl2008-slides.pdf

Google Code
Gupta, D., Morris, B., Catapano, T. and Sautter, G., (2009), 'A new approach towards bibliographic reference identification, parsing and inline citation matching',  Contemporary Computing: Communications in Computer and Information Science, 40(2), pp.93--102, DOI: 10.1007/978-3-642-03547-0_10, downloaded from http://193.27.218.161:8080/dspace/bitstream/10199/19094/1/GuptaEtAl.pdf, last accessed November 2011.

CrossRef's DOI retriever
CrossRef has a form for retrieving DOIs for bibliographic references. However, there are usage limits on the simple text query form to prevent volume use. This will be a problem for us but CrossRef state that other options are possible. The form is here http://www.crossref.org/SimpleTextQuery/

Paperbase
An old product from Wight Scientific who no longer support it, but ported by Dave Roberts to Apple II BASIC, who states "…it was remarkably successful [at] getting probable hits, some false positives, but it missed very few things. It was intended to run on manuscripts prepared in a word-processor and link the in-text citation with its database. Then EndNote arrived."

We have Dave's source code, which could form the basis for an up-to-date tool.

Reformatting Tools
A catalogue of slightly different tools. These re-format references so have their role to play downstream in the workflow.

bibliograph.parsing
This is a suite of parsers "Each parser accepts input from a given bibliographic reference format and outputs a list of python dictionaries, one for each entry listed in the input source." Downloadable from http://pypi.python.org/pypi/bibliograph.parsing/1.0.0

bibutils
"The bibutils program set interconverts between various bibliography formats using a common MODS-format XML intermediate." Downloadable from http://sourceforge.net/p/bibutils/home/Bibutils/.

Conclusion
like it says…