Identifying patterns in our data

= Automatically identifying patterns in our data =

Background
We investigated using a machine learner to look for patterns in our gold standard test data. We used the inductive logic program, aleph. Our gold standard test data was converted into aleph format and loaded into the system.

Using aleph to identify patterns was frankly disappointing.

Results
Many entries - as expected - involved the language variants of university: universidad, universidade, universita, universitaet, universite, universiteit, universitet and university itself. All of these could be replaced with the template universi, which alone covers 49 of the 88 'to include' entries.

Aleph's next suggestion with the greatest coverage was museum, with seven entries, none of which overlapped with universi.

Then came nacional/national, which also covered seven entries. However, two of these entries had already been covered by universi (universidad nacional autonoma de mexico and universidad nacional de colombia). In addition, this filter produced a false positive: nib national internet backbone. This entry was identified by the filter term for inclusion but should be excluded.

Then came research, which covered five entries, two of which were already covered by universi.

Then came the one interesting pattern in which of should precede of, e.g. institute of marine biology of crete. This filter rule correctly identified five entries to include, with no false positives, though two were already covered by universi.

Finally there were two locations, queensland and muenchen, each with two entries.

This left 20 entries not identified by any pattern. These entries did not share common terms and so had no pattern to mark them out.

This automatically produced filter set (universi, museum, na[c/t]ional, research, of before of, queensland and muenchen) marked 68 out of the 88 'to include' entries for inclusion and 1 out of the 220 entries 'to exclude' for inclusion. Or in the jargon:


 * 68 true positives
 * 1 false positive
 * 20 false negatives
 * 219 true negatives

This gives a:


 * precision [tp/(tp+fp)] 0.99
 * indicating a high degree of accuracy in correctly identifying an include entry
 * recall [tp/(tp+fn)] 0.77
 * but only identifying about ¾s of all valid include entries

And for completeness:


 * specificity [tn/(tn+fp)] 1.00
 * that's rounding for you ;-) the one false identification is lost against the 219 correctly identified to exclude entries
 * accuracy [(tp+tn)/(tp+tn+fp+fn)] 0.93
 * this suggests the filter list is quite accurate overall but is misleading for our purposes because it is skewed by the large number (219) of true negative entries; our work will suffer because of the large number (20) of missed entries we want included
 * f-measure [2((precision.recall)/(precision+recall))] 0.86
 * a better overall summary of the usefulness of this filter list as it indicates it could be improved

Conclusion
In summary, a not very useful exercise in automatically suggesting filter terms by finding patterns in the data. Possibly because the sample set of 'to include' entries was relatively small. However, the exercise was useful in highlighting a few key terms: universi, museum, nacional, national and research. We will take these forward in our research.

Subsequent reflection
Later work in preparing the gold standard data highlighted the issue that 'universi' did not cover either 'univerzitet u beogradu' or 'univerza v ljubljani'.

Interesting that the South Slav languages use a 'z' for 's' in university. Using 'univerz' as a template matched the two target entries and did not produce any false positives. Replacing both 'universi' and 'univerz' with 'univer' was good too.

This highlights an issue with using inductive logic to suggest terms - both 'univerzitet' and 'univerza' occurred once only in the gold data and so there is no pattern that can find them. They are also too far away from 'university' to be matched as a variant to that term, and too different to each other to easily form a pattern.

The good news is that the PHP script and the gold data give us a toolset to explore the accuracy when using different terms.