Identify the unique terms

= Identify unique terms =

Our work is based on using the terms in Google Analytics' Service Provider field as a filter to select which accesses to Scratchpad we will process.

A PHP script was produced to list and count the unique terms in the Service Provider field. (Here is an equivalent script in Python should you want something faster.)

The script was run on the gold standard file with 975 entries and the full training data set with 6,729 entries.

It is interesting what we can learn by a quick review, sorting the data, etc. For example, in the gold standard 975 data set:


 * there are 1,547 unique terms, of which 1,155 occur only once in the data; and
 * the most common term is university, which occurs 118 times;

…which suggests we can quickly identify many relevant users by searching for 'university', but with so many unique terms we will have to produce a lengthy filter list to capture everything else we consider relevant.

One word of caution if you look at the output data in Excel. You will see there are apparently two rows with the term '2'. Actually there is one row with the term '2' and one row with the term '02'. To explain: when Excel displays '02' it only shows '2' and drops the leading zero. Confusing!

In the output data there is only one occurrence of the term '02'. The term '2' occurs 2 times in the gold 975 standard data and 103 times in the full training data.

Also note, this was a quick initial piece of work over a lunchtime and it does not allow for punctuation in the terms. Hence, '(australia)' is a different term to 'australia'. However, in our filter apply and asses script a search for 'australia' will match both terms.

We may revisit this work and cluster similar terms to allow for punctuation if we think this will be useful.

Input files






Output files