TechWrkSofiaApril2012

= Technical Workshop =

Background Information

 * Where : Best Western Plus City Hotel, Sofia
 * When : 09:00-16:00 EEST Tuesday 17 April 2012
 * Who : Ben Scott, Dauvit King, Evangelos Pafilis, Guido Sautter, Pavel Stoev, Teodor Georgiev, Alexander Pochinkov
 * Facilities: Wi-fi and coffee will be available. Lunch will probably be in the hotel restaurant to save travel time.

Agenda
''Temporarily kept for reference. Actual agenda is below.''
 * 1) Bibliography of Life:
 * 2) demonstrate current RefBank functionality
 * 3) explore issues integrating RefBank into Scratchpads infrastructure
 * 4) define use-cases for reference management in Scratchpads
 * 5) discuss and define compatibility and data exchange formats between RefBank, Scratchpads Literature Reference Module and Pensoft Writing Tool (PWT)
 * 6) Text-mining:
 * 7) demonstrate current development of Evangelos's software tools
 * 8) explore their possible use within ViBRANT infrastructure, especially their integration into Scratchpads
 * 9) define use-cases for text-mining in Scratchpads

Revised agenda
1. publication module in Scratchpads
 * 1) import and export of XML,
 * 2) compatibility of elements,
 * 3) level of mark up (especially on Scratchpads)
 * 4) [technical issues to be discussed offline]
 * (2 hrs)

2. references - exchange between Pensoft, RefBank and Scratchpads
 * (3~4 hrs)

3. text mining issues - opportunities through Evangelos
 * (3~4 hrs)

For Wednesday morning

 * 1) specimen data - Darwin Core exchange between Pensoft and Scratchpads
 * 2) locality data and Darwin Core

Discussion among Ben, Simon and Teodor to confirm what fields are present/required.

From the meeting
Ben demonstrated Scratchpad Publication module. Teodor demonstrated Pensoft Writing Tool.

Agreed: Scratchpads to post full files not just references to files, because users can delete files in Scratchpads so there is the potential for the file references to become invalid.

Agreed: Rename of element names to be self-describing using the existing attribute values. This will make XSLT and XPATH access to the data much easier, and will make the data self-documenting. Action: Pensoft due Friday 20 April

Overview of XML suggests that data exchange looks promising with a large amount of overlap with near identical atomisation. Areas of difference seem principally around authoring information, such as contributors’ roles. in particular co-ordination of the submitting author and the different role-based permissions in the applications. (Ben was taking notes on this and related issues, such as recording the user's affiliation. Ben will need a list of the PWT fields, and which are mandatory. Action: Pensoft due Friday 20 April 2012. Pensoft need Drupal Biblio fields. Action: Ben due Friday 20 April 2012.) This will also be followed up tomorrow with the discussion around Darwin Core.

Reviewing options to refine authors’ workflow given the overlap of data between the two applications.

Also need to consider the automatic return to Scratchpads of the final Pensoft pre-publication version of the paper from PWT.

High level workflow:
 * user starts in Scratchpads
 * populates fields to create publication
 * submits publication to Pensoft's PWT
 * PWT returns ID and link
 * Scratchpads retain link so that user can progress their publication in PWT, ie cannot change the publication in Scratchpads, only in PWT
 * can adapt existing triggers in PWT e-mail update process as a service to provide notification to user in Scratchpads

Tentative due date - PWT update in one month, and then the links between it and Scratchpads to follow: update this with an agreed, realistic due date after Darwin Core discussion.

RefBank-Scratchpads integration:
 * expected workflow is for user to populate Scratchpads only, RefBank will harvest Scratchpads
 * user can search RefBank from Scratchpad, task to be integrated as a simple view
 * user's search results can be saved, or edited and saved, into user's Scratchpad - amended references will be automatically harvested by RefBank without action by the user

''Action: Dauvit visit NHM and agree details within next two weeks. NB Final delivery date depends on stability of currently changing Drupal Biblio module. TBA''

Evangelos talk of his work at HMCR looking for proteins and environmental terms in PubMed abstracts.

The project has created SPECIES corpus, 800 PubMed abstracts, 100 in each field eg virology, which will be available along with code, written in C++ for speed ;-), when publicly released. currently, evaluating the application to provide statistical evaluation data.

Project/software has coped with some:
 * issues of ambiguity such as water channel means different things in cells and environment
 * dirty data - in one journal italicised words lost spaces before and after so creating one long concatenated word, handled by manual curation of data.
 * synonyms, legacy names anad some orthographic mistakes which are handled by reference to NCBI taxonomy

Also has a beta annotator, http://onthefly.embl.de, for proteins and chemicals. Interface works, but the functionality might not as still some problems linking to NCBI. Keep an eye on developments.