Classifying service providers

= Classifying Service Providers =

Source data
Gold standard data for the initial work was in

Current Gold standard is

Data reformatted
Data has been prepared using a PHP script to train classifiers.

The output is three sqlite files, one for each level of classification.

Simple classifier
Initial work was to code a very simple classifier in python.

The expectation was that the sophistication in the previous filter could mean that classifying the selected Service Providers should be relatively easy.

In this context, simple means:


 * naïve, all terms within the Service Provider are of equal value in classifying
 * no thresholds, all categories are of equal value unlike, say, for a spam filter; and we have already removed the unknown Service Providers at the filter stage (ie our 'other' category)
 * no other rules or use of a hybrid approach to classification

Initial results:


 * level 1: 90% accuracy
 * level 2: 88% accuracy
 * level 3: 49% accuracy

The higher level of abstraction in level three is causing problems, eg many universities instead of being 'general' are 'humaninties' (sic); 'california academy of sciences' should be 'general' but is in 'natural sciences' because of the limited cues available to the classifier. Interestingly both 'institut national de la recherche agronomique' and 'institut national de recherches agronomiques' should be 'agriculture/animal health' but while the latter is correctly classified the former is classified as 'development'. In general, 'institute' skews the classification of a Service Provider to 'chemistry' resulting in three incorrect classifications, suggesting more sophistication in the classifier is required.

Enhanced classifiers
The simple classifier was enhanced in two ways:


 * used Bayes to bring in weights to the terms when used to classify Service Providers
 * used Fisher to bring in weights to the categories when used to classify Service Providers

to produce this version of the classifier.

Here is the doctest sholuld you want to rework the classifier.

Using the enhanced classifier
Here are three sample scripts to apply the enhanced classifier:


 * simple classification at tier one, area
 * bayesian classification at tier two, level
 * fisher classification at tier three, focus