Zookeys-paper

= Bibliography of Life =

Or as we prefer Bibliographies In The Sky, to make the issues, approaches and solutions generalisable.

Purpose:  review the state of the art.

Inspired by: Rod's musings and a draft paper by Alistair.

'Note: This is a working draft of the paper. It is not the final submitted version.'

What is it
If we run a Google search for "Bibliography of Life" as our starting point, and why not these days?, then at the top of the results list is Rod Page's blog post from October 2010, Mendeley, BHL and the "Bibliography of Life" (Page. 2010). In his post, Rod offers this definition:

"bibliography of life," a freely accessible bibliography of every taxonomic paper ever published.

The principle of freely accessible bibliographies already exists in taxonomy, albeit they focus on a particular domain, such as ants (Antbase) or fish (Fishbase). The core idea of the Bibliography of life is to employ the same technology as these existing bibliographies, but on a far more ambitious scale. The domain covered by this bibliography is to be the whole of taxonomy.

There is a precedent for this ambition. In the domain of Computer Science, the Digital Bibliography & Library Project evolved from a small specialized bibliography to a digital library covering most sub-domains of computer science (Ley, 2009; DBLP). The increase in scope was driven by the library's users. From small beginnings, the bibliography now lists more than 1,700,000 publications (September 2011). The DBLP is hosted by the Universität Trier, in Germany. At a larger scale and in a different discipline, biomedical science, PubMed is a well-known database that provides free access to the MEDLINE database of references and abstracts (PubMed). The PubMed database is maintained by the United States National Library of Medicine (NLM).

There appears to be a similar drive in taxonomy to produce a comprehensive library and matching bibliography. We do not see commercial organisations rising to this challenge. For while there are excellent resources, such as Thomson Reuters' BIOSIS (BIOSIS), the focus in these resources is on modern, generally born-digital material, which is both relatively easy to process and potentially revenue earning through copyright access charges. Taxonomic literature has a different emphasis compared to other sciences because older documents remain relevant. Hence the number of digitisation projects, such as the Biodiversity Heritage Library (BHL), that exist to bring old paper documents into the digital age. There remains the problem, however, of producing a comprehensive bibliography to the newly digitised documents. We suggest there is a problem because while the concept of a bibliography of life might be easy to define, the simple fact that it does not exist indicates there are practical difficulties with the idea. this article explores some of the difficulties, and possible solution.

The issues around producing a comprehensive bibliography embracing many sub-domains of a broad research area are not confined to taxonomy. To that end we prefer the term, Bibliography In The Sky as used by ViBRANT's PI, Vince Smith, to define our challenge. This name indicates that the tools we produce can find a use outside of taxonomy too. We will use the term BITS in the rest of this article.

Why is it needed
The immediate benefit of a BITS is the potential to enhance the quality of current work by making it easier for authors to include valid links to original scientific literature. However, the current specialist bibliographic databases can facilitate that too within their own domains. BITS, in contrast, will have a comprehensive nature. This will allow researchers more easily to ask new questions, especially for cross-domain research. For example, work on climate change and on invasive species could draw on literature from across different geographic areas and across taxonomic specialities. Having a single point of reference will not only speed up the work of the researcher, but may lead to serendipitous findings too as the researchers review the integrated results.

What it is not
Google (Google) is seemingly almost all conquering in terms of popular search on the Internet. Its specialist academic derivative, Google Scholar (Google Scholar), is very popular too based on the non-scientific data gathering method of peering over researchers' shoulders. Yet these two search engines are not the solution to providing a BITS. Google and Google Scholar are harvesters. They can only work on what is available to harvest. Private and personal bibliographies will not contribute to their results, either in terms of breadth of coverage or accuracy of information. This weakness is particularly evident with taxonomic literature as many important documents are only now being digitised. All too often, until a document is available on-line then it is not visible to on-line search engines such as these.

Another weakness of these on-line search engines from our perspective is that they serve a different purpose. For Google Scholar, it is aimed helping researchers find articles, or related papers such as patent applications. Searches are based on authors or expected key words. If searching for keywords in the article itself, an overwhelming number of results can be returned. Defining a discriminating search query can be an arduous task. The relevance of the results is also affected by the granularity of the reference returned, especially when dealing with volumes, books or journals. It would be far more productive to the researcher if the results referred directly to the relevant article, say, rather than the volume in which it the article is found.

There are several approaches to addressing these weaknesses. One is to establish a domain specific, academic reference collection. Typically though these resources are limited by geography or scientific domain. The constraint can arise from many sources, such as funding body limitations or the personal research interests of the bibliography builder. The results is that many smaller bibliographies have been built. These are useful within their domain, but do not provide an easy means for a researcher to gain on overview of the domain, or to work across domains. Another frequent issue with the smaller bibliographies is that there neither the resources to amend incorrect data nor to add new data once the original funding that created the bibliography runs out.

There are small, academic databases that are edited and maintained however. These are the personal databases of researchers in the domain. Personal reference management tools are long established and with the rise of the Internet, there has been the development of on-line tools to match. On-line tools have the advantage of making it easier top share references, and some tools such as CiteULike (CiteULike) and Mendeley have taken this idea to the current boundaries of on-line technology and incorporated techniques from social networking software into their offering. However, this has not addressed the scalability problem; for there remains a multiplicity of small bibliographies. The result is a multiplicity of small groups within these services, often with overlapping areas of interest. For example, there are seven groups in Mendeley related to ants (Mendeley:ants). As yet, these on-line tools do not deliver the comprehensive vision of a BITS.

What would it be used for
The immediate advantage of a BITS is to facilitate academic rigour. A BITS will make it far easier for a researcher to provide complete and accurate references to accompany their writings.

A longer term advantage of a BITS is for the researcher to work embrace all relevant works so cutting across existing small bibliographic reference collections. This too should provide increased academic rigour because all relevant prior works should be easier to retrieve, which could lead to some serendipitous findings. Further, where the original smaller collections where created to address a specific question, the comprehensive approach of a BITS will also make it easier to ask new questions of the old data. Two of the most prominent contemporary drivers asking new questions of old data are climate change and invasive species. A BITS would facilitate such work.

The primary user of a BITS we expect to be the scientist engaged in a traditional literature search. A BITS will offer this user a targeted search, so reducing the potential for irrelevant results. It will also facilitate a search across domains, making it easier for the scientist to ask new questions of the data and to gain a baseline of understanding across otherwise unconnected bibliographic resources. Quality control will be an additional benefit to the scientist as results can be compared against other resources, with the potential issues for data quality issues to be automatically addressed or at least highlighted to the scientist. A second use case for the scientist is to process errors found in the references. The errors can be either those suggested by BITS or those identified by the scientist. Hence, through use the quality of the data in BITS, should improve over time.

In a similar vein, a BITS can facilitate the work of a citizen scientist. We expect this individual to be a competent taxonomist being either a retired professional researcher or a highly motivated amateur. We do not envisage a role for more casual citizen scientist such as secondary school students in using and managing bibliographic references. This individual will be engaged in the same literature searches that the scientist will engage in, and to realise the same benefits. The expert citizen scientist could also provide reference validation and correction in the same manner as the scientist. There is, of course, the necessity to apply quality control to these corrections.

Why has it not been done before
If a bibliography in the sky is potentially so useful, why has it not been built already?

One important reason is money. In general funding is predicated on breaking a big problem into a smaller, manageable chunk. Hence, one would expect an application to fund work in a national park to be more successful than one that encompasses the whole world. There is usually some funding body, at a national level, to pay for the work. In consequence, there has been a multiplicity of databases built, and even aggregators have tended to work on a national or genus specific scale. In the absence of funding a cottage industry approach has taken hold, with those researchers interested in the technology and problems of bibliographic reference management building systems in their own personal time. This has meant that opportunities for added value are often missed, while large-scale challenges such as de-duplication are not overcome. The resulting resources are useful, but limited.

A factor complicating inter-operability among these existing resources is the lack of standards defining much of what they are meant to do. These cover everything from data storage itself, to the APIs that enable access to the data stored and all the aspects in between such as metadata, exchange, extraction and citation. The Taxonomic Database Working Group (TDWG) has been working hard over many years to define biodiversity information standards. However, its work is not complete as anyone who subscribes to its mailing list can testify (tdwg:non-technical discussions; tdwg:technical discussions). In the absence of standards, various answers, each at the time representing best practice, have been implemented; but the resulting variety has only added to the problem of sharing data, and generated the new concern about when time and money has to be spent restructuring the existing data to match some new standard or colleagues preferred style. Hence, the different approaches are perpetuated.

One solution is not to re-write the existing resources but to aggregate them. However, funding for such projects can be difficult to achieve, especially as having to cope with numerous data formats makes aggregation difficult, and therefore expensive. Aggregators can also suffer in the bidding for funding stakes if they do not appear to add value. One further weakness is that they are dependent on the quality of the data they aggregate. If they do not incorporate some form of automatic validation or user correction, then that represents an opportunity missed. A related weakness, is that with de-duplication being an unresolved problem in digital library research, there is no simple tool to be called to help address the issue and so an aggregator may perform no de-duplication at all except when records are identical in all respects.

There are an increasing number of commercial offerings, on-line bibliographic reference managers with social network features, that could deliver many of the benefits of a BITS. However, none as yet seem to have the full set of features that a taxonomic specific BITS could provide. This is particularly so for data validation, as an academic BITS should be able to provide feedback to the data provider to enhance the quality of the source data. Commercial offerings also have the complication of requiring funding. For academics this could be achieved through an institution's library service much as a subscription to an on-line journal is justified. However, that falls foul of the need for the commercial offering to have achieved critical mass before the subscription request would be successful, and how will it grow to achieve critical mass without subscribers. A second issue with commercial offerings is that it goes against the trend in academe to open science and open access publication.

The next section explores some of the ways we propose to address these issues that have prevented the creation of a BITS already.

How are we going to build it
There are many challenges to building a bibliography at any scale. This section documents the major ones applicable to a large-scale bibliography only.

de-duplication
This remains an open problem in digital libraries research (Kan and Tan, 2008). Not when there is a direct match across references, that is all fields are identical. In such circumstances the duplicates are easily ignored and only one copy of the reference is retained. A good cue for this is to check the DOI first. However, even if the DOI is the same, sometimes other date can be contradictory or incomplete. Resolving these near identical references can be difficult. A variety of resolution techniques are required because the problems can come from a variety of sources, such as using different journal abbreviations or a mis-match between fascicle and article page numbers.

Internationalisation is a common cause of near identical matches. This can occur when there are multiple names for the same entity such as place names or person names. Also problems arise with the transliteration of entities into Latin script. A topical example is that of Gaddafi. There are many variations of the name in Latin script, a problem compounded by whether you use the formal Arabic pronunciation of the name or the Libyan dialect, and whether you transliterate for an English or French speaking audience (Time:Gaddafi, 2011). Even equipped with this knowledge, however, no consensus has emerged on the correct Latin rendering of the name (Yahoo:Gaddafi, 2011). Hence, for this example, there is no definitive correct answer.

The personal name problem is compounded by cultural differences, affecting such characteristics as name order. This can give rise to further variations depending on whether the name order is amended to match the typical Western style of given name first when the name is transliterated. The W3C (World Wide Web Consortium) has produced advice on handling this aspect of internationalisation[W3C:personal names] and other aspects of internationalisation too[W3C:internationalisation]. Fortunately personal name variations can be addressed by a variety of techniques including data mining (Phua et al, 2006), while Biostor (Biostor) implements Feitelson's (Feitelson, 2004) weighted clique algorithm for finding equivalent names. However, these techniques do not allow for the occasions when a researcher may deliberately use a different name for different publications, such as to distance themselves from their early work (McKay et al, 2010).

The problem of reference de-duplication in bibliographic databases is more formally known as citation matching (Lee et al, 2007; Kan and Tan, 2008), and addressing this problem will form one of the core areas of research for WP7 in the ViBRANT project. We will investigate the tools available from the wider digital libraries community. However, it is still an open problem and a preliminary review of the landscape suggests there are few practical examples of solutions. Widening our review though, we have found examples of citation tools finding a role in other domains, such as detecting plagiarism (Plagiarism Today, 2011), that might have transferable techniques we can exploit.

from where do you get bibliographies
It is one thing to state we can build a bibliography from other bibliographies; it is another thing to do it! The primary source of references will be the users themselves, contributing their own bibliographies and making ad hoc additions as they discover more references. Thus, we have the non-technical challenge of building a community of users for whom it is worth their time and effort to contribute to BITS. This problem is potentially self-resolving once there are enough users and enough references to make it a truly useful resource. The question, of course, is how to achieve that desirable critical mass. This is where building BITS through a larger project such as ViBRANT will be crucial, for ViBRANT gives users another reason to engage with the environment in which BITS is hosted. Indeed, in harvesting user contributed bibliographies and ad hoc additions from the users' Scratchpads and published papers tools, we will be able to populate BITS with no additional effort from the users.

Harvesting external resources is another important means of populating BITS. We can build on existing aggregators that crawl the web for references. One such is FaLX, developed as part of EDIT project (EDIT). It could aggregate references from Connotea, Scratchpads and CiteULike, but has not been enhanced to harvest references from other sources. This highlights one issue with aggregators, maintaining them to use new data sources. This is a particular concern when the data source requires user and password parameters to access, and where such access rights might change over time.

Usually harvested references will be in a known, publicly defined format. However, the references might still need parsing and re-formatting to be used within BITS. This problem could lead to maintenance issues with BITS should a new format for defining bibliographic records emerge in a few years time. Another issue with harvesting records is that of tracking updates to references. Hence an amended reference could be imported into BITS because it is not identical to an existing reference. Recognising and reconciling near identical references is very difficult, as explained in the section on duplicate references earlier in this article.

Another potential source of data is to parse literature directly for references. This is a difficult problem, even for major commercial concerns such as Mendeley (Mendeley:reference extraction, 2010). One simple technique is to look for the isolated word References in the body of the text and examine the subsequent text. This is one of the methods used by open source tools such as ParaCite (ParaCite) and can be effective on born digital literature and on well scanned historic literature. However, as a technique such keyword searches are limited in scope and depend on references being in a dedicated section within a document. Embedded references, or worse still from the point of view of automated extraction are references in an endnote or footnote, present greater problems of identification.

Automatic extraction can be complemented by supported user input, as exemplified by GoldenGATE (GoldenGATE), in which the user first identifies the reference which can then be parsed by the software extraction routines. This is a useful facility for a user to add references as they read and review a document.

A further source of references, but one which brings a new set of complications, are micro-citations. By their incomplete nature, satisfactorily resolving the citation is difficult (Gupta et al, 2009) though there are some examples we can build on to address the issues (Page, 2011). If BITS is to be the comprehensive tool envisaged, then we will need to incorporate micro-citation capture.

how do you build it: which architecture is best for our needs
There are several possible architectures for BITS. One is to build on a bibliographic database, the other to build a search portal.

The first option is to build on a database, for which there are two approaches. Either we can build our own database to store references or we can use an existing database. building our own database, gives us complete control on what we build so we can tailor it to meet our users' needs. However, that assumes we can achieve a sustainable body of users, willing both to contribute to the database and to use it. We would have to establish our credibility first. Assuming we achieved that then there are the continuing costs of running the servers and maintaining and enhancing the software. The issue of sustainability is covered later in this article.

The alternative is build on another's database, leaving us only to ensure the sustainability of our software. Of the currently available storage solutions, there are three front runners. In the commercial sector, Mendeley (Mendeley) and Papers (Papers), and in the public sector, CiteBank (CiteBank).

Mendeley and Papers are both tools for an individual to organise their bibliographies. Both offer social network enhancements to enable papers to be shared among groups; though both restrict the number and size of groups, and storage of references, that are available for free. If we are to work with either organisation then we will need to enter into a contractual relationship. Concerns about either organisation are their long term business plans and viability. The two named organisations represent the current leading on-line reference manager tools suitable for our use. There have been other earlier tools that rose, and then fell from prominence, such as CiteULike (CiteULike) and Connotea (Connotea). In a similar vein there is the publicly funded Zotero (Zotero), which has found a niche in the social sciences, but which would also require a commercial arrangement to handle the volumes of data a BITS would produce.

Of the publicly funded bibliographic databases only CiteBank has the ambition to match BITS. Other databases are focused on a sub-domain of taxonomy and lack the scope to expand in line with the potential size of BITS. CiteBank is the bibliographic offshoot of the Biodiversity Heritage Library, which has achieved sustained funding (BHL:funding). One model of sustainability is to co-operate with CiteBank, which could then demonstrate its enhanced impact as the size and scope of its references grow, which in turn could make future funding requests even more likely to succeed. This is one avenue we are actively exploring.

An alternative approach is not to build a BITS database at all, but a functionally equivalent portal offering a federated search across existing taxonomic bibliographic resources. Hence, our task in ViBRANT would be to build a user interface to a global search of these existing data stores, complemented by an index to speed up query results. The latter would be necessary because we would have to do additional processing such as de-duplication on the fly to consolidate the results. The leading, proven indexing technology applicable to this task is Apache SOLR (SOLR). It offers many advantages if used in ViBRANT, not the least being its integration with Drupal, the foundation of Scratchpads. However, to build the index would still require that we address the same issue as if we were to populate a reference database. Given the potential performance penalty, there seems to be no advantage in adopting a purely search portal approach over populating a searchable database.

using other people's data: the issue of copyright
Access to copyrighted materials is a well known problem. In principle copyright is a 'good thing' and has been perceived so for a long time. One of the earliest examples of copyright being enshrined in legislation was also one of the first Acts of the newly unified parliaments of England and Scotland was the Copyright Act of 1709. Its full title was "An Act for the Encouragement of Learning, by vesting the Copies of Printed Books in the Authors or purchasers of such Copies, during the Times therein mentioned". The act can be said to have met its aims in the centuries since then. However, as the Hargreaves report of earlier this year points out:

"Could it be true that laws designed more than three centuries ago, with the express purpose of creating economic incentives for innovation by protecting creators' rights, are today obstructing innovation and economic growth? The short answer is: yes." (Hargreaves, 2011)

Two of the report's recommendations are particularly relevant to ViBRANT. The first is the introduction of a specific exception for text mining to allow researchers to extract and process data. The second is to build an exception into EU frameworks to to ensure harmonised exceptions across Europe remain relevant regardless of developments in technology.

Our problem of access to useful material is affected by laws drawn up in a different age, and is only now, hopefully, being addressed by the current United Kingdom government. Unfortunately, conservative laws are just one of the many complications affecting copyright.

Issues with copyright are further complicated in that most laws were originally determined by national law, and so many differing, local interpretations developed. The growth of international trade provoked the creation of a layer of international law to reconcile national differences, most notably the Berne Convention of 1886 and its numerous subsequent amendments. In Western Europe the development of the common market through the then European Community brought in many additional laws aimed at harmonising national practice across community members. This affects ViBRANT because as an international project the partners are subject to different copyright laws.

Some organisations choose to avoid working with potentially copyrighted materials as a way of circumventing copyright problems. In our domain, BHL generally follows this approach, though working with information aggregators such as BioOne has enabled BHL to expand access to more recent, copyrighted publications (Rinaldo and Norton, 2010). We do not have the option to ignore copyrighted material if we are to build a truly comprehensive BITS that includes modern literature.

A particular concern to us is the ability to process documents to extract the data they contain. Different jurisdictions offer different interpretations on whether this is possible. For example, in many countries it is legal to use the data within a database because the individual data elements are not subject to copyright; however, their accumulation into a new artefact, the database, is subject to copyright. This distinction between individual facts within a larger work is not always clear. Nor is it clear at what point a new artefact built from the individual facts becomes eligible for copyright in its own right. The situation is further complicated when the law does not permit a copy to be made of the original artefact at all, even if it is only for extraction of individual, non-copyright, items of data from that larger work. Switzerland has a pragmatic approach to the creation of new artefacts and even allows the temporary copying of copyright material if it is needed for the creation of a new, distinct artefact (Agosti and Egloff, 2009). It is possible that we will have to enter into a commercial arrangement with Swiss-based Plazi (PLAZI) to process our data in a more suitable jurisdiction.

We will have to investigate the copyright implications for BITS closely. The initial service will be hosted in the UK, where copyright law may change soon, but there will be a secondary installation in Germany and potentially elsewhere for the failover systems being developed by ViBRANT's work package 2. We will need to ensure that a BITS can function legally in any of these jurisdictions.

the quality of our data: avoiding garbage in, garbage out
The question of data quality is not a new one, and it has many dimensions such as completeness, accuracy, correctness, currency and consistency of data (Redman, 1996). The question of data quality can arise whether the reference is user submitted or harvested from an on-line library. There is no guarantee in either case that the input is validated. In addition, while the data might be correct, there is also the potential for introducing an error during data entry.

Manual validation of the data is possible, and a BITS requires an editor for users to amend references, such as the one provided in the GoldenGATE editor (GoldenGATE). However, care must be taken by users editing bibliographic details since this could allow the introduction of new errors, typically through mis-keying the intended change.

For the automatic addressing of quality issues, Ley and Reuther (2006) suggest there are broadly two approaches.

The first approach to data validation they call database bashing. In this approach the data is checked against other other databases. Unfortunately, this is not a foolproof approach. It is possible that both databases contain wrong data derived from a common source, and so an error can be propagated without detection. Ultimately, a voting mechanism has to be employed if there is no single authoritative data base against which to validate the data. We have seen a related problem in our work post-processing scanned literature (King, 2010) in which an OCR engine made a consistent mis-identification of a term, which was then considered to be the correct version of the term because it was the most common form. When the actual and rare correct version was encountered, it was considered the variant. However, there is the more fundamental issue of having an external authoritative database. Rather in BITS we are aiming to build that authoritative database. A similar situation exists with taxon names and the work of GBIF (GBIF), in which it is simultaneously building and providing an authoritative source of information.

The second approach to data validation Lee and Reuther call data edits. This is the application of rules to highlight/resolve discrepancies. This can help address issues such as the Hungarian and Japanese use of family name first when giving names, which may or may not be amended to given name first in the reference. This approach is clearly limited to addressing known issues. Though generic data quality problems, such as with personal names, are already known from more general work with bibliographic reference management tools.

We will use both approaches, referring to external resources and applying rule based corrections to enhance data quality.

implementation
There are several options to input references into a BITS. Arguably the most obvious is by direct user input. this could be by directly keying an entry or, more likely, by enabling users to import their existing bibliographies in the same way the Mendeley allows user to import their EndNote database or CiteULike account.

An extension of this import principle is to harvest publicly available references. A harvester could access publicly available resources and download references. This facility could be extended to import users' reference collections providing the user ensures they are accessible to the harvester.

A third means to enable users to submit references would be to let them upload the original source or article for reference extraction. Anecdotal evidence suggests that many taxonomists record bibliographic references in ad hoc Word documents rather than in reference manager tools. Extracting the references from these documents may be relatively easy because we already know the content is references. It is more difficult to

It will be necessary to allow users to edit the references. This will allow them to correct errors. It could also be used to set up a black list of known bad references to prevent their re-submission to BITS. This is an issue with any harvester system, in which an incorrect entry is corrected by a user but on the next run of the harvester the incorrect entry is re-imported.

Allowing users to edit the references, does lead to an additional complication: version control. A reference may be subject to several edits before the final, definitive version is reached. In addition, there may be genuine disagreement over a reference, which can lead to see-saw changes as the contending users edit the reference to match their own preference.

sustainability: managing the resource for the long term
This final sub-section on the challenges facing a bITS looks at the longer term issues of managing a it.

A fundamental question is who pays to keep it running. Personal experience shows that there are far more funding calls to build something new than to maintain something existing. It is often a given measure of success that whatever was built has become self financing! Funding is further complicated by national/regional boundaries, which means that funding body will happily pay for a resource covering their geographic area, but will baulk at supporting any international extension. One possibility is to co-ordinate bids to differing national bodies, though timings of funding calls makes this an unlikely scenario. If conventional project funding is not available, then one alternative is to charge for the service. Various charging models could be used, such as by an institutional subscription or by a per access fee, however this would make it more difficult to achieve a critical mass of users. Indeed, there is a growing resistance to payment in part because of the rise of the open science movement (Open Science, 2011), which is even making its presence felt outside of academe (Guardian:open science, 2011). The open science movement is matched by the trend to open access publishing (Wikipedia:open access, 2011). The model is well established with high profile publishers such as BioMedCentral (BioMedCentral), and ViBRANT partner, Pensoft (Pensoft). Indeed, ViBRANT itself is promoting open publishing, for it is a condition in our grant agreement, Special Clause 39, that we publish our work in open access journals or we deposit final manuscripts in accessible institutional repositories.

If we do use an external storage provider for bibliographic references, instead of building our own database of references, then we must ensure that we can extract our data out of their system should that provider ever fail, or change its licensing arrangements to something unsuitable for our purposes. This will enable us to transfer the data to another provider, or provide the service ourselves should that be necessary.

In terms of maintaining the relevance of BITS to working taxonomists, we would endeavour to make this self justifying for taxonomists. Having achieved a critical mass for usefulness, it should become the first point of reference to taxonomic literature. As researchers work with more papers so they will add to BITS because they appreciate the benefits of BITS. We would endeavour to make this as simple as possible for our users, so that rather than having to explicitly key in the references simply exposing their personal web page to our harvester will suffice. New material published through Scratchpads and Pensoft will automatically be added.

Conclusion
The principle of a bibliography in the sky, implemented as a bibliography of life for taxonomists, is one with which pretty much everyone agrees will provide a useful resource. However, despite this acclaim there are technical issues that have prevented its creation before. Through the work of organisations such as TDWG, and through advances in digital library software, some of the issues are now being addressed. In ViBRANT we have the commitment of a sufficiently large amount of time and resource to build on these advances to achieve a tool that can provide a freely accessible bibliography of every taxonomic paper ever published.

References and Links
Consolidated references and links in Zookeys format here


 * Agosti D, Egloff W, (2009) Taxonomic information exchange and copyright: the Plazi approach. BMC Research Notes, 2: 53. doi:10.1186/1756-0500-2-53.
 * Antbase http://antbase.org/
 * BHL http://www.biodiversitylibrary.org/
 * BHL:funding http://biodivlib.wikispaces.com/Funding+Sources
 * BIOSIS http://thomsonreuters.com/products_services/science/science_products/a-z/biosis/
 * BioMedCentral http://www.biomedcentral.com/
 * Biostor http://biostor.org/
 * CiteBank http://citebank.org/
 * CiteULike http://www.citeulike.org/
 * Connotea http://www.connotea.org/
 * DBLP http://www.informatik.uni-trier.de/~ley/db/ DBLP
 * EDIT http://www.e-taxonomy.eu/
 * Feitelson DG, (2004) On identifying name equivalences in digital libraries. Information Research 9(4): 192.
 * Fishbase http://www.fishbase.org/
 * GBIF http://www.gbif.org/
 * GoldenGATE http://plazi.org/?q=GoldenGATE
 * Google Scholar http://scholar.google.com/
 * Google http://www.google.com/
 * Guardian:open science (2011) http://www.guardian.co.uk/education/2011/may/22/open-science-shared-research-internet
 * Gupta D, Morris B, Catapano T, Sautter G (2009) A new approach towards bibliographic reference identification, parsing and inline citation matching. In: Proceedings of the International Conference on Contemporary Computing, Noida (India), August 2009.
 * Hargreaves I (2011) Digital Opportunity: A review of Intellectual Property and Growth. Crown copyright. http://www.ipo.gov.uk/ipreview.htm?intcmp=239
 * JISC open access publishing (2009) http://www.jisc.ac.uk/news/stories/2009/01/houghton.aspx
 * Kan M-Y, Tan YF (2008) Record matching in digital library metadata. Communications of the ACM, 51: 91-94.
 * King DJ (2010) ABLE - Automatic Biodiversity Literature Enhancement project: making biodiversity literature accessible. Seminar in the Zoology series at the Natural History Museum, London, 16th March 2010.
 * Lee D, Kang J, Mitra P, Giles CL, On B-W (2007) Are your citations clean? Communications of the ACM, 50: 33-38.
 * Ley M (2009) DBLP - Some Lessons Learned. In: Proceedings of the Very Large Databases Conference, Lyon (France), August 2009.
 * Ley M, Reuther P (2006) Maintaining an online bibliographical database: The problem of data quality. In: Actes des sixièmes journées Extraction et Gestion des Connaissances, Lille (France), January 2006.
 * McKay D, Sanchez S, Parker R (2010) What's My Name Again? Sociotechnical Considerations for Author Name Management in Research Database. In: Proceedings of the 22nd Conference of the Computer-Human Interaction Special Interest Group of Australia on Computer-Human Interaction - OZCHI '10, Brisbane (Australia), November 2010. doi:10.1145/1952222.1952274.
 * Mendeley http://www.mendeley.com/
 * Mendeley:ants http://www.mendeley.com/groups/search/?query=ants
 * Mendeley:reference extraction (2010) http://feedback.mendeley.com/forums/4941-mendeley-feedback/suggestions/834313-version-0-9-7-does-not-extract-references-from-the-Mendeley
 * Open Science http://www.openscience.org/blog/
 * PLAZI http://plazi.org/
 * Page R (2010) Mendeley, BHL, and the "Bibliography of Life". http://iphylo.blogspot.com/2010/10/mendeley-bhl-and-of-life.html
 * Page R (2011) Microcitations: linking nomenclators to BHL. http://iphylo.blogspot.com/2011/03/microcitations-linking-nomenclators-to.html
 * Papers http://www.mekentosj.com/papers/
 * ParaCite http://paracite.eprints.org/
 * Pensoft http://www.pensoft.net/
 * Phua C, Lee V, Smith K (2006) The Personal Name Problem and a Recommended Data Mining Solution. In: Wang J, (Ed) Encyclopedia of Data Warehousing and Mining. Idea Group, London.
 * Plagiarism Today (2011) http://www.plagiarismtoday.com/2011/08/08/using-citations-to-detect-plagiarism/
 * PubMed http://www.ncbi.nlm.nih.gov/pubmed
 * Redman TC (1996) Data Quality for the Information Age. Artech House, London.
 * Rinaldo C, Norton CN (2010) The Biodiversity Heritage Library: an expanding international collaboration. In: Proceedings of the 36th International Association of Aquatic and Marine Science Libraries and Information Centers Conference, Mar del Plata (Argentina), October 2010.
 * SOLR http://lucene.apache.org/solr/
 * Scratchpads http://scratchpads.eu/
 * TDWG:non-technical discussions http://lists.tdwg.org/mailman/listinfo/tdwg-content
 * TDWG:technical discussions http://lists.tdwg.org/mailman/listinfo/tdwg-tag
 * Time:Gaddafi (2011) http://newsfeed.time.com/2011/02/23/how-do-you-spell-gaddafi-the-linguistics-behind-libyas-leader/
 * W3C:Internationalisation http://www.w3.org/International/
 * W3C:Personal names http://www.w3.org/International/questions/qa-personal-names
 * Wikipedia:opan access http://en.wikipedia.org/wiki/Open_access
 * Yahoo:Gaddafi (2011) http://uk.news.yahoo.com/how-should-you-spell-gaddafi%E2%80%99s-name-.html
 * Zotero http://www.zotero.org/