GAmetrics

= Google Analytics metrics =

Background
Please read , which refers to this filter set

In summary -  two step classification:
 * 1) Separate the ISPs that have a name of an organization from the ISPs that are telecom and hosting companies (their names do not tell you which organization is behind the name eg  “AOL, Tiscali”).
 * 2) Use the list with ISPs from identifiable organizations to cluster them into categories (eg, government, research, ngo).

Initial work
Rough notes on our primary work to classify Service Providers =>

Here are the two filter sets mentioned in the rough notes above, filter-sets and  filter-set2, and here is the apply and assess script.

Here is an example of using the script to compare two similar search terms, which helps us refine the filter sets.

Refinement of exclude filters
At the start of 2012, Daphne revisited the set of exclude filter terms by hand marking the gold975 data to produce.

The suggestions in that file were used by David to refine the actual filter sets. The suggested terms were analysed and were there were overlaps these were removed. Similarly if the term could be used as a template was tested. At each stage the assess and apply script was used to test the filter set. The unique set of terms identified by Daphne and as modified by David is noted in.

The final set of exclude filters was merged into the established set of six include filters to produce this filter set. Using the version of the assess and apply script enhanced to support the identification of numbers to exclude Service Providers, gives these results:

include exclude    other totals:          302      675       23 false pos:         7        9      262 false neg:        91      183        4 true pos:        211      492       19 true neg:        691      316      715 sum:            1000     1000     1000 precision:      0.97     0.98     0.07 recall:         0.70     0.73     0.83 f-measure:      0.82     0.84     0.12

In summary, for those Service providers that the filters do match, they are very good at correctly assigning them as include or exclude. However, not even three-quarters of the total entries are matched so more terms are required to be added to the filter set.

Second level analysis - Classifying
The second phase of our work is to classify the included Service Providers identified by the initial analysis.

Other analysis
We used the Google Analytics data in other analysis work.

One exercise we undertook was to apply inductive logic programming to the task of identifying patterns in our data, and hence automatically produce a filter list.

A second, very simple exercise was to identify the unique terms available to us in the Service Providers field provided by Google Analytics.

Manual review of the Service Providers to exclude suggested that numbers are a useful filter. This proved so and a filter to test for numbers was added to the apply and assess script.

Code
Initially a simple Python script was written to apply the original filter sets to downloaded data.

Following the production of gold standard test data a PHP script to easily process filter lists was written. The script reads two files:


 * gold-2011.csv
 * filter-set.php

where filter-set is supplied as a parameter to the script. The script writes two files:


 * gold-2011-filter-set-applied.csv
 * gold-2011-filter-set-applied.log

Two example filter sets here are for the original filter set and the six term filter set produced after we applied inductive logic programming to the task of identifying patterns in our data.

You can also use the script to aid your analysis of the data by passing it single terms and seeing how applicable they are. Fore example, the 'universi' on its own as a filter produces these results:


 * false pos: 0, false neg: 39, true pos: 49, true neg: 220
 * precision: 1, recall: 0.56, f-measure: 0.72

which shows that 56% (49 out of 88) of our users are easily identifiable as using Scratchpads from a university.

Following the decision to produce output marked as either `include`, `exclude` or `other`, a more sophisticated apply and assess script was written.

Source data
Original list File:GA_filter_ISPs_v2.txt

Gold standard test data
This is a set of hand marked entries retrieved from Google Analytics. We can use this test data to assess the accuracy of our filter sets.

There are 308 rows. Each row has four columns: Service Provider, Visits, Time on Site and Mark. The first three columns are data from Google Analytics. The fourth column contains the mark we assigned based on whether the Service Provider meets our criteria or not. In summary, Service Providers that are commercial ISPs are marked 'e' because we want to exclude them from further analysis. Other Service Providers, are marked with an 'i', and these we do wish to include in our further analysis to understand Scratchpad usage, e.g. what proportion of access is made by universities.

Of the 308 rows, 88 are marked 'i' and the remaining 220, 'e'.

The file is in comma separated value format to ease sharing the data with end user programs such as Excel, OpenOffice Calc, SPSS, etc and with programs written in any language (we have used both Python and PHP). The file is editable with any text editor, even Notepad.



Extension to 975 entries
We have now prepared an extended gold standard data set with 975 marked entries, of which 301 entries are marked to include and 674 entries marked to exclude.

Adding 'other' entries
Following further discussion we now have two more extended datasets; each has 'other' entries directly marked as such:
 * an extended gold standard data set [[File:Gold320-2011.csv]] that extends gold-2011.csv and has 320 marked entries, of which 88 entries are marked to include, 220 exclude and 12 other.
 * an extended gold standard data set [[File:Gold1000-2011.csv]] that extends gold975-2011,csv and has 1000 marked entries, of which 303 entries are marked to include, 674 exclude and 23 other.

Further revision
The latest version of the gold1000 data has the following five changes:
 * 'nerc computer services' and  'tu darmstadt hochschulrechenzentrum' are now i
 * 'powys county council', 'florida information resource network' and 'centre de ressources informatique (cri)' are now e

These changes mean that the data now has 1000 marked entries, of which 302 entries are marked include, 675 exclude and 23 other.

This data was then used to prepare the Gold standard data for the second phase of our work, the classification of included Service Providers.



Further amendments (highlighted in yellow) were made to the classifications to produce:



and its csv equivalent (note how it has been renamed for simplicity):