M712report

=M7.12=

A suite of test cases that will be used to test the de-duplication software

Building the test collection
Q. How should we create this collection of references?

A. There are two approaches:


 * 1) take a reference collection and create duplicates artificially by editing references to create duplicates and near-duplicates. This process should attempt to mirror, as far as possible, examples of duplicates and near-duplicates that are found in real reference collections.
 * 2) ask for collections of references and analyse them for duplicates and near-duplicates.

We should have two suites of test cases:
 * 1) A unit-test style set of references which we will create artificially. so that we have a full set of known errors and the corrected references. (Written in Python, this could be written as formal test cases e.g. using JUnit)
 * 2) A real-world or integration-style test using data and corrections supplied by practicing taxonomists. This could be developed into a JUnit framework too. Potential contributors could be Vince Smith (his lists of bibliographies), Rod Page, readers and contributors to the TDWG mailing list.

The output should be in a known standard such as BibTeX so that we do simple searches and so that the input and output are human-readable.