Talk:PubMan Func Spec Statistics

=Scenarios= Comments and Discussions see talk/discussion page.

Basic PubMan item statistics
The public user wants to get hints on reception of a certain article and would like to see the download numbers of a specific preprint when accessing the item page in PubMan.

Interpretations/Analysis
The author wants to understand the visibility of his research. He wants to understand how often is article is accessed based on a hit by google/google scholar search.
 * google/google scholar

The author wants to understand the geografical distribution of research interest.
 * geographical distribution

The author wants to understand the background of users accessing his research and would like to understand the domain they are coming from, such as .com, .edu, .gov
 * domain statistics

The author wants to know how if his publication is accessed by colleagues from the same or neighbour departments of his institution.
 * institutional statistics

''Most of these requirements need the analysis of IP-addresses to get geographic or location-specific information of the user. It is not directly supported by the eSciDoc coreservice to automatically gather and store these IP-addresses. Thus, this must be realized in the solution. But then we have the problem that more than one solution can access one repository and it can even be accessed without any solution, e.g. by REST-interface. These requests wouldn't be counted and the statistic is distorted'' --Haarlaender 08:58, 27 October 2008 (UTC)

In any case, any provision of statistical data has to have a "disclaimer" how to read and understand the statistics. --Ulla 17:10, 27 October 2008 (UTC)

Visualisations
The author wants to get statistical data and its analyses/interpretations adequately visualised by graphs, timelines, diagrams and geografical maps.

Reports
Based on administrative searches (queries on demand and saved searches), reports (item lists and/or numbers) can be delivered:

Current coverage in the repository
 * The author wants to have an overview on all his records deposited and released


 * The institution wants to have an overview on all records of specific department deposited and released

Open Access
 * The author wants to have an overview on all his records with at least one OA component
 * The author wants to have an overview on the number of OA components
 * The institution wants to have an overview on the number of OA components per department
 * The OA policy department would like to prepare a statistical report on increase of OA publications in the last 2 years, with details on monthly level, for the MPS. (See more scenarios under OA statistics)
 * The local Press department needs reliable numbers of record entries per department, including the number of fulltexts. They would like to get the numbers in a re-usable format, e.g. XML.

=Requirements=

Basic item/component statistics
Status: implemented (see Use case view item statistics

Schedule: R3


 * Numbers of retrievals for a specific item from the framework by users (anonymous/all).
 * Visibility: public
 * Numbers of file downloads for a specific item by users (anonymous/all).
 * Visibility: public

Downloads of files with content type “copyright transfer agreement” and “correspondence” are not counted.
 * Numbers of downloads for a specific file by users (anonymous/all
 * Visibility: public

Extending the Statistical Service
Status: in design

Schedule: ???

Page Retrievals
    Logging of page retrievals in eSciDoc solutions {#pageName} {#solutionName or ID}  {#IPAdress}  {#YES|NO}                                                                //Possible to gain info about user behaviour within one session {#SessionID} //To check where the user came from {#referrer} 
 * Record for page retrievals (different pages: homepage, editItem, workspace etc.)

Item manipulation
    Logging of item manipulations in eSciDoc solutions                                                                   //Internal item ID		 {#ItemID}                                                                      //Persistent item identifier {#PID}                                                                //Combination of Person and Organization with pattern: (coneId, AffId) (Cone:123,escidoc:persistent22),(Cone:456,escidoc:persistent26), (Cone:456,escidoc:persistent25)                                                                //Possible to gain info about user behaviour within one session {#SessionID} {#solutionName or ID} <parameter name="IP"> {#IPAdress} <parameter name="LoginIn"> {#YES|NO} {#retrieval|submission|release|withdraw|accept|sendToRework|...} {#ItemContext} <parameter name="itemhasOAcomponent"> {#YES|NO} <parameter name="itemStatus"> {#submitted|released|pending...} </statistic-record>
 * Record for item manipulation (different tasks: retrieval, submission, release, withdraw etc.)

On this issue, and as poping up from PEER project, I would also like to check what we have as ItemID is here meant the internal ID of the item. I would also like to propose tracking the PID of this item into the statistical record. --Natasa 11:41, 3 December 2008 (UTC)
 * Added PID to record--Friederike 09:20, 9 February 2009 (UTC)

Export
<?xml version="1.0" encoding="UTF-8"?>  <scope objid="Export_Record"/> <parameter name="Export"> Logging of export informations in eSciDoc solutions <parameter name="ItemID"> {#ItemID or ContainerID}                                    -- A container ca also be exported (ex: Faces album) <parameter name="PID">                                                                         -- Persistent item identifier {#PID} {#solutionName or ID} {#exportType}                                               -- NEW : define what type of export has been perfromed : CSV, EndNote, Bitex... <parameter name="IP"> {#IPAdress} <parameter name="LoginIn"> {#YES|NO} {#ItemContext} -- This is interesting for faces {#NumberofExportedComponents} <parameter name="sessionId">                                                                   -- Possible to gain info about user behaviour within one session {#SessionID} </statistic-record>
 * Record for export informations

Import
<?xml version="1.0" encoding="UTF-8"?>  <scope objid="Import_Record"/> <parameter name="Import"> Logging of import informations in eSciDoc solutions {#solutionName or ID} <parameter name="IP"> {#IPAdress} <parameter name="LoginIn"> {#YES|NO} {#ItemContext} {#bibtex|arXiv|pmc|endnote} <parameter name="sessionId">                                                               //Possible to gain info about user behaviour within one session {#SessionID} </statistic-record>
 * Record for import informations


 * Here also would be good when we do Import to have information on imported ID from source and acquired ID from eSciDoc --Natasa 16:36, 9 February 2009 (UTC)

Search
<?xml version="1.0" encoding="UTF-8"?>  <scope objid="Search_Record"/> <parameter name="Search"> Logging of search informations in eSciDoc solutions {#solutionName or ID} <parameter name="IP"> {#IPAdress} <parameter name="LoginIn"> {#YES|NO} <parameter name="sessionId">                                                               //Possible to gain info about user behaviour within one session {#SessionID} {#cql} <parameter name="searchField"> {#(escidoc.title="test"),(escidoc:author="maier")}      //The single fields with the value of search <parameter name="NumberOfHits">                                                            //Maybe of interest? {#NumberOfHits} <parameter name="FileSearch">                                                              //Maybe of interest? {#Yes|No} </statistic-record>
 * Record for informations on search

The search record should be written on the PubMan side (not Search&Export) as otherwise the records may contain unuseful info (e.g. item retrieval during every researcher portfolio request). When writing records on the search and export the solutionName should be set to 'Search&Export'. It has to be clear which info we display on PubMan statistics pages (only pubMan searches and not search&export).--Friederike 09:17, 9 February 2009 (UTC)

File download

 * In similar manner as we have created PubMan Item statistics, would suggest to define more granular record also for File / Locator statistics (retrieve, export, conversion to file) corespondingly --Natasa 16:40, 9 February 2009 (UTC)
 * As input from PEER project:
 * statistical record should have file-id, file-pid, item-id, item-pid
 * statistical record should have file-name
 * would suggest that statistical record should have file-mime-type

Discussion

 * Input after VidConf with NIMS --Natasa 16:24, 22 December 2008 (UTC):


 * we need to track session ID (so that some analysis can be made based on user session) - added it to all records--Friederike 09:04, 9 February 2009 (UTC)
 * we need to track search criteria (so that we could have analysis of search terms)
 * nicely done for eDoc, and can be checked with Vlad

affiliations (and all parents) of itemID - why log this? we can retrieve this info dynamically when creating the statistic --Kleinfercher 15:23, 26 November 2008 (UTC)
 * The issue is whether to do it on retrieval of statistics or during the creation of the statistic record (i.e. when logging it in). If we expect to do it on retrieval, this might lead to less data in statistics, but we could not have defined aggregations on OU level in this case, right? Is it possible to define aggregations based on data not in the statistics raw record? On this issue, see also above: Author-statistics, Input from PEER project. --Natasa 13:35, 27 November 2008 (UTC)
 * Direct affiliation is now added, for parent affiliations i would suggest to not write this info in record, but collect it during statistics retrieval (we should not overload the record).--Friederike 09:04, 9 February 2009 (UTC)


 * I would agree to distinguish between page-retrievals and item-level operations. However, not certain if we would like to distinguish between import/export/item manipulation actions e.g.


 * import is one item manipulation action, and if we bring Import and item creation together (in case of item creation source would be PubMan) we could still know how many items were e.g. fetched from PubMed with OA components
 * export is again, similar manipulation as retrieval. Having this information together with the e.g. itemHasOA component we would know how many OA items have been exported.
 * to bring them all together, maybe we can say smth like:

{#NumberofAffectedComponents} {#NumberofAffectedOAComponents}

for each action. Just a proposal to think of. --Natasa 13:35, 27 November 2008 (UTC)

New Requirements

 * Item basis
 * Which item? (objectId & pId)
 * What happened? (view, submit, release, withdraw, export etc)
 * Who is the user? (Which country, was he logged in? Where did he come from? Browser, Screen Res,)
 * When did it happen?
 * Visualizations:
 * Item views over time
 * Item views over time, separated by logged-in
 * Item views, seperated by viewers’ countries


 * Author basis
 * Which author (coneId)
 * (Which item?)
 * What happened (only views of items interesting here?)
 * Who is the user? (Which country, was he logged in? Where did he come from? Browser, Screen Res,)
 * When did it happen?
 * Visualizations:
 * Views of items of a certain author over time
 * Views of items of a certain author over time, separated by login
 * Views of items of a certain author, seperated by viewers’ countries


 * Organization basis
 * Which organization (organizationId)
 * (Which item?)
 * What happened (only views of items interesting here?)
 * Who is the user? (Which country, was he logged in? Where did he come from? Browser, Screen Res,)
 * When did it happen?
 * Visualizations:
 * Views of items of a certain organization over time
 * Views of items of a certain organization over time, separated by login
 * Views of items of a certain organization, seperated by viewers’ countries


 * Internal search basis
 * What was searched for? (Complete CQL query or search terms as simple string)
 * Who searched?
 * How many results?
 * When was searched?
 * Visualizations:
 * Top ten search keywords?


 * External search basis (e.g. users from Google search)


 * Open Questions
 * Is it necessary to log actions like submit, release and withdraw? Isn't this already covered by the eSciDoc event history?
 * Do we also want to have Page and Site statistics in general, not only item action statistics?
 * Should an view action only be counted once per user? (e.g. user is on View Item Page, goes to Item Log page, returns to View Item Page -> view is counted twice)

Data Model for new requirements
We would need a database schema like this:

This data then has to be aggregated frequently due to performance reasons (database would get too large).

Comparison of statistical tools
For the visualization of the statistical data, use of an external tool is necessary. Three different approaches are distinguished here:
 * Simple chart image library (e.g. JFreeChart)
 * The library is used to create chart visualizations from our statistic data stored on eSciDoc coreservice side.
 * Storage of data: Stored in eSciDoc core database, using eSciDoc Statistic Handler. Needs definition of statistic-records that define what data we have to store. (see above).
 * Aggregation: Using of eSciDoc Statistic Aggregation Handler. Needs definition of aggregations. Limited options here.
 * Interpretation. Has to be implemented completely manually by us. Need of other external tools to e.g. map IPs to countries, referres to search keyeords etc.
 * Requirements:
 * Including of library.
 * Definition of statistic data.
 * Definition of aggregations.
 * Definition of statistic reports.
 * Adding code to Pubman whenever a statistic record has to be written (Reading out variables, calling statistic handler).
 * (Installation of external interpretation tools)
 * Implementation of data interpretation.
 * Implementation of visualization.
 * Additional Info: JFreeChart documentation is very limited and bad. A developer guide is available with costs, but can't say anything about the quality. All in all this approach can get very complicated when working with advances statistics and lead to a high effort. --Haarlaender 08:12, 22 June 2009 (UTC)


 * Full web statistic tools (e.g. Piwik, Google Analytics)
 * The tool manages the complete handling of the statistics, including storage, aggregation, interpretation and visualization of the data.
 * Storage of data: In an external database which is managed by the tool.
 * Aggregation: Done by the tool. Maybe customization needed.
 * Interpretation: Done by the tool. Maybe customization needed.
 * Visualization: Done by the tool. Maybe customization needed.
 * Requirements (depending on the tool):
 * External installation of the tool, e.g. setup of a webserver with PHP and MySQL for Piwik. Not necessary for Google Analytics.
 * Adding javascript to pages that call the statistic tool.
 * Additional Info for Piwik: Took a deeper look in the tool. Very nice for standard statistics, but our statistic requirements (storage and interpretation of additional data like author name, organizations) are not covered by the default Piwik. Would require the implementation of Plugins for storage, aggregation and visualization in PHP. Could get complicated, documentation is still not very sophisticated and in a beta state.
 * Additional Info for Google Analytics: Privacy problems with the data, as we always have when using Google Apps. Apart from that the most advanced solution that fits our requirements best. (Allows custom actions, advanced segmentation of data etc.) with the least effort. But, of course, needs some tests first as every other soulution. A question is also how to distinguish between different pubman instances. --Haarlaender 08:12, 22 June 2009 (UTC)
 * we asked herrn gerling about using google analytics for wals. the answer was basically that google analytics is pretty much beyond the pale.--Robert 17:04, 24 June 2009 (UTC)


 * Server log analyzer (e.g. AWStats with JAWStats)
 * The tool analyzes the logs of the webserver and builds statistics out of it.
 * Storage: Already done by the webserver.
 * Aggregation: Done by the tool.
 * Interpretation: Done by the tool and extenstions.
 * Visualization: Done by the tool.
 * Requirements:
 * Installation of tool on webserver.
 * Customization of tool.
 * Find a way to include statistics in PubMan.
 * Additional Info: NIMS is currently testing these tools in conjunction with pubman. The biggest problem is that not all data that is required by us is looged in the webserver. NIMS wants to test to include data from eSciDoc statistic into these tools. But I guess then also implementation of custom aggregation and interpretation is necessary again. --Haarlaender 08:12, 22 June 2009 (UTC)

Implementation with eSciDoc Statistic Management
The databases of the eSciDoc Statistic Manager can only be created/filled/queried using the StatisticManager API over SOAP or REST with XML. The aggregation is also defined via an xml. With this system, all recommendations regarding good database design, in particular avoiding redundancy, are not applicable. One way would be to use the above schema and aggregate every parameter into the same database table. This migth lead to performance issues at a certain point when querying the statistic database. Additionally we would need:
 * A tool or service that maps ip-addresses to countries with a corresponding database. A free solution is e.g. GeoLiteCountry
 * A tool which can parse search keywords out from different referrers. Piwik e.g. has a file where around 600 search engines are distinguished in PHP. Did not find a similar solution for Java.
 * A tool that can detect search robots and crawler and distinguish them from real users, e.g. using the HTTP user-agent header. No real Java solution found. Here is a database with user-agent headers with a web service, but it's private and will access will be slow and limited for larger data amounts, so not applicable for us: Agentarius
 * Optional: A tool that can detect browsers and client settings from user-agents. Also no sophisticated Java solution found

In general, it can be said that many tools in this field are available for PHP and also Javascript, but not for Java.

=Future developments=


 * Co-authoring
 * The author wants to understand how many of his papers are co-authored.
 * Possible with new item manipulation record--Friederike 09:07, 9 February 2009 (UTC)


 * The author wants to understand how many of his papers are co-authored with members of Institution X.
 * Possible with new item manipulation record--Friederike 09:07, 9 February 2009 (UTC)


 * Author-information
 * Input from PEER project: A need to generate various statistics on items created by authors from e.g. Europe
 * is it possible to assume this by author affiliation in publication? --Natasa 13:41, 27 November 2008 (UTC)


 * Cross-repository
 * The author wants to know the number of items related to his name across the collection/contexts in the repository.
 * Possible with new item manipulation record (Only for controlled names, personIds) --Friederike 09:07, 9 February 2009 (UTC)

''It is not clear to me how this should be technically possible. Would require to retrieve and accumulate statistic data from different repositories. Are these only eSciDoc-repositories or different types of repositories? Can they be accessed by only one application (e.g. PubMan) or by different ones?'' --Haarlaender 08:58, 27 October 2008 (UTC)
 * Harvesting statistics
 * The public user wants to get an overview on the reception a certain preprint, which is not limited to the local repository. He would like to see statistics on the download of the preprint by summing the download numbers of all copies located in distributed repositories.

''This scenario is related to standardized exchange of statistical data, e.g by using SUSHI. Adressing this requirement is scheduled for later releases, it's mentioned here for not getting lost. In addition, we have to exchange knowledge with Margit Palzenberger, she is quite deep into provision of statistical data by publishers. ''--Ulla 17:10, 27 October 2008 (UTC)


 * Citation metrics/Research evaluation


 * Private statistics
 * Some statistical information might be access restricted to the author himself or administrative staff.