M727report

= M7.27 – Publish ViBRANT NLP corpus =


 * Due: 31 October 2013
 * Delivered: 24 October 2013†
 * Purpose: To develop, refine and assess our data mining work, and the similar work of others aiming to mine biodiversity texts, a substantial gold standard corpus is required.
 * Benefit: No such corpus currently exists. This milestone addresses the community need for this building block to enable the development and evaluation of text mining tools for legacy biodiversity literature.

† The date recorded here is when the milestone report was begun. The corpus has been available for some time.

Access
Our corpus is available from ViBRANT's git repository at https://git.scratchpads.eu/v/vibrantcorpus.git.

Screenshot of corpus log in ViBRANT's git repository.

It can be downloaded following the instructions at https://git.scratchpads.eu/v/.

Through being hosted in git, the corpus can accept updates and additions from outside the ViBRANT project in a controlled manner. This aids ViBRANT's sustainability plan: making its resources available after project completion.

Licence
As with all content produced by the ViBRANT project, the corpus is released under | Creative Commons CC0 licence.

Sample content
This is the first few lines of text from aves_v1, c136.txt. (Downloadable from https://git.scratchpads.eu/v/vibrantcorpus.git/blob/HEAD:/aves_v1/c136.txt) This is clean, re-keyed text.

136 MNIOTILTIDÆ.

12. Dendrœca decora. (Tab. X. fig. 1.)

Dendrœca graciæ, var. decora, Ridgw. Am. Nat. vii. p. 6081; Baird, Brew. & Ridgw. N. Am. B. i. p. 2402; Coues, B. Col. Vall. i. p. 2923.

Dendrœca graciæ, Salv. Ibis, 1873, p. 4284; Lawr. Bull. U.S. Nat. Mus. no. 4, p. 1655.

Dendrœca decora, Salv. Cat. Strickl. Coll. p. 926.

Supra cinerea, pilei antici plumis in medio nigris; alis et cauda fusco-nigris cinereo limbatis, illis vix pallide cinereo bifasciatis, hujus rectricibus tribus utrinque externis plaga alba gradatim latius notatis; superciliis a naribus, ciliis ipsis, macula suboculari et gutture toto læte flavis; corpore reliquo lactescenti-albo, hypochondriis cinerascentibus vix nigro striatis; rostro nigricante, pedibus corylinis. Long. tota 4, alœ 2·2, caudæ 1·8, rostri a rictu 0·55, tarsi 0·6. (Descr. exempl. ex Guatemala. Mus. Acad. Cantabr.)

Hab. MEXICO, near Zapotitlan (Sumichrast5); BRITISH HONDURAS, Belize (C. Wood1 3), GUATEMALA (Constancia6, Mus. Soc. Econ.4).

Dendrœca decora is a near ally of D. graciæ, a species of New Mexico and Arizona discovered some years ago by Dr. Coues. The differences observable between the two birds are slight, and have been treated by American ornithologists as indicating that their possessors are varieties only one of another and not distinct species. This may prove to be the case; but at present no intermediate links have been discovered blending the two races, nor do we think it very probable that such now exist; and for this reason we prefer to treat D. decora as distinct.

Accompanying the text is an annotation file, identifying taxonomic names and their rank. The matching file for the text above is c136.ann. Here are the relevant annotations for the sample text.

T1 family 4 15 MNIOTILTIDÆ

T2 genus 21 29 Dendrœca

T3 specificepithet 30 36 decora

T4 genus 56 64 Dendrœca

T5 specificepithet 65 71 graciæ

T6 infraspecificrank 73 77 var.

T7 infraspecificepithet 78 84 decora

T8 genus 193 201 Dendrœca

T9 specificepithet 202 208 graciæ

T10 genus 280 288 Dendrœca

T11 specificepithet 289 295 decora

T12 genus 995 1003 Dendrœca

T13 specificepithet 1004 1010 decora

T14 genus-abbrev 1029 1031 D.

T15 specificepithet 1032 1038 graciæ

T16 genus-abbrev 1528 1530 D.

T17 specificepithet 1531 1537 decora

The ViBRANT corpus though is more than just another set of gold standard marked up texts, because for each of the re-keyed clean texts we have the OCR available for download from the BHL.

This is the OCR text for the sample text above, taken from d136.txt (d for dirty as opposed to c for clean).

136 MNIOTILimE.

12. Dendroeca decora. (Tab. X. fig. 1.)

Dendroeca grades, var. decora, Ridgw. Am. Nat. vii. p. 608

1

&#59; Baird, Brew. & Ridgw. N. Am. B. i.

p. 240

2

&#59; Cones, B. Col. Vail. i. p. 292

3

.

Dendroeca gratia, Salv. Ibis, 1873, p. 428

4

&#59; Lawr. Bull. U.S. Nat. Mus. no. 4, p. 16

5

.

Dendroeca decora, Salv. Cat. Strickl. Coll. p. 92

6

.

Supra cinerea, pilei antici plumis in medio nigris ; alis et cauda fusco-nigris cinereo limbatis, illis vix pallide

cinereo bifasciatis, bujus rectricibus tribus utrinque externis plaga alba gradatim latius notatis ; supereiliis a

naribus, ciliis ipsis, macula suboculari et gutture toto laete flavis ; corpore reliquo lactescenti-albo, bypo-

cbondriis cinerascentibus vix nigro striatis ; rostro nigricante, pedibus corylinis. Long, tota 4, alae 2-2,

caudse 1*8, rostri a rictu 0-55, tarsi 0*6. (Descr. exempl. ex Guatemala. Mus. Acad. Cautabr.)

Hob. Mexico, near Zapotitlan (Sumichrast 5

) ; Bkitish Hondueas, Belize (C. Wood 13

),

Guatemala (Constancia

6

, Mus. Soc. Econ.*).

Dendroeca decora is a near ally of D. gracice, a species of New Mexico and Arizona

discovered some years ago by Dr. Coues. The differences observable between the two

birds are slight, and have been treated by American ornithologists as indicating that

their possessors are varieties only one of another and not distinct species. This may

prove to be the case ; but at present no intermediate links have been discovered blending

the two races, nor do we think it very probable that such now exist ; and for this reason

we prefer to treat D. decora as distinct.

These are the relevant annotations taken from annotation file d136.ann.

T1 family 8 18 MNIOTILimE

T2 genus 26 35 Dendroeca

T3 specificepithet 36 42 decora

T4 genus 64 73 Dendroeca

T5 specificepithet 74 80 grades

T6 infraspecificrank 82 86 var.

T7 infraspecificepithet 87 93 decora

T8 genus 218 227 Dendroeca

T9 specificepithet 228 234 gratia

T10 genus 316 325 Dendroeca

T11 specificepithet 326 332 decora

T12 genus 1067 1076 Dendroeca

T13 specificepithet 1077 1083 decora

T14 genus-abbrev 1102 1104 D.

T15 specificepithet 1105 1112 gracice

T16 genus-abbrev 1614 1616 D.

T17 specificepithet 1617 1623 decora

These files enable the development and evaluation of taxonomic name processing tools for both clean and dirty texts. This is especially important for ViBRANT for it liberates the legacy literature and allows us to consider, in future projects, processing OCR digitised texts using the same tool kit as for born-digital literature.

We can see the value of this corpus from a non-scientific review of the sample. consider the first annotated term MNIOTILTIDÆ, which is rendered as MNIOTILimE in the OCR, while the second term Dendrœca is rendered as Dendroeca. Quickly we can see the problems, in contrast, the third term term, decora, is rendered accurately, and so on.

The biodiversity informatics and natural language processing communities now have a reliable data source to work on.