Talk:PubMan OA Statistics

MPDL,PubMan,ESciDoc This page is a collection of requirements by the OA Team concerning statistical data from eDoc and PubMan.

=Requirements for the current eDoc Statistics= The aim is to measure the Open Access performance of the MPS in relation to the eDoc institutional Repository.

Items to be considered:

 * With status "Released"
 * With status "Yearbook"

Metrics
The following dimensions should be derivable by the statistics: The currently implemented statistics are emphasized with bold.
 * Time
 * System (eDoc Created)
 * Time of the the eDoc document creation. Can be taken from the document history.
 * System (eDoc Released)
 * --Makarenko 14:10, 31 July 2009 (UTC) and Anja
 * eDoc accepted && authorized
 * date of the first release, it would be taken from the item history.
 * Problem: the statistics can differ to the old one, the deviation should be checked.
 * First Published (eDoc Released)
 * First Published (Date of Pubvlication or Date of Approval for Thesis)
 * --Makarenko 16:40, 24 June 2009 (UTC): Should be taken the date of the first item release from the item hisory?
 * Yes. Definition: the first filled Date Of Publication for eDoc Released item. Should be taken from history.
 * --Makarenko 11:52, 1 October 2009 (UTC): Date Of Publication is not mandatory field in docs, probably make no sense to estimate it in the statistic.
 * Organization
 * MPS
 * --Makarenko 08:48, 16 July 2009 (UTC): Not clear, where to use the indicator
 * Anja: no sense to group, all documents are MPG affiliated. Reject.
 * Institutes
 * Section within MPS
 * --Makarenko 16:32, 24 June 2009 (UTC): The section ID (abbreviations) should be added to the institutions table, the list is needed.
 * Files
 * One_file_attached: At least one file (fulltext) has been attached to the eDoc released version of the document at the time of estimation (month or year). The history of the file is used.
 * One_file_OA_attached: At least one file (fulltext) with the PUBLIC access has been attached to the eDoc released version of the document at at the time of estimation (month or year). The history of the file is used.

Report format
For the monthly statistics report sent by eDoc, the following files should be generated in CSV-format:
 * 1) One file for released items per month
 * 2) One file for released items per calendar year accumulated
 * 3) One file for yearbook items per month
 * 4) One file for yearbook items per calendar year accumulated

Released items per month
Month_for_statistic;MPG;System;Published;MetaData_Only;One_file_attached;One_file_OA_attached Month_for_statistic;MPI_ID1;MPI_NAME1;System;Published;MetaData_Only;One_file_attached;One_file_OA_attached .... Month_for_statistic;MPI_IDn;MPI_NAMEn;System;Published;MetaData_Only;One_file_attached;One_file_OA_attached

Released items per calendar year accumulated
Year_for_statistic;MPG;System;Published;MetaData_Only;One_file_attached;One_file_OA_attached Year_for_statistic;MPI_ID1;MPI_NAME1;System;Published;MetaData_Only;One_file_attached;One_file_OA_attached .... Year_for_statistic;MPI_IDn;MPI_NAMEn;System;Published;MetaData_Only;One_file_attached;One_file_OA_attached

--Makarenko 08:56, 16 July 2009 (UTC): Header and rows have different number of columns here.

Yearbook items per month
Month_for_statistic;MPG ID;MPG Name;Yearbook Year;MetaData_Only;One_file_attached;One_file_OA_attached Month_for_statistic;MPI_ID1;MPI_NAME1;Yearbook Year;MetaData_Only;One_file_attached;One_file_OA_attached .... Month_for_statistic;MPI_IDn;MPI_NAMEn;Yearbook Year;MetaData_Only;One_file_attached;One_file_OA_attached


 * --Makarenko 10:19, 13 July 2009 (UTC): There is no possibility to create this statistic: the yb release history is not saved.
 * Anja: workflow for Yearbook release is different in every institute and not really substantial for OA statistic; can be rejected.

Yearbook items per calendar year accumulated
Year_for_statistic;MPG;Yearbook Year;MetaData_Only;One_file_attached;One_file_OA_attached Year_for_statistic;MPI_ID1;MPI_NAME1;Yearbook Year;MetaData_Only;One_file_attached;One_file_OA_attached .... Year_for_statistic;MPI_IDn;MPI_NAMEn;Yearbook Year;MetaData_Only;One_file_attached;One_file_OA_attached

(";" used as separator here, first line is header, the following lines are filled with values)

Number of item views and fulltext views per month and per year
--Makarenko 11:06, 13 July 2009 (UTC): Questions Month_for_statistic;MPG_ID;MPG_NAME;Number_of_views Month_for_statistic#1;MPG_ID#1;MPG_NAME#1;Number_of_views ... Month_for_statistic#N;MPG_ID#N;MPG_NAME#N;Number_of_views
 * Columns:
 * Cumulative estiamations are needed?
 * The statistic can be processed by apache access logs parsing. It takes lot of resources. The following workflow is suggested on the regular base:
 * At first time the complete bundle of eDoc logs will processed for both statistics; results will be stored in a statistic eDoc database table (with month lag).
 * The monthly cronjob statistics will process only last month and update the table
 * Year statistics will be generated from the table
 * Views by crawlers should be excluded?
 * Anja: main crawlers should be excluded (Google, Yahoo, MS Bing, MPG), technical tools: nagios.
 * Only successful views (http code 200) will be counted?
 * Anja: 200 and not 200 is sufficient.
 * Should the history views be counted as well?
 * --Makarenko 16:26, 2 October 2009 (UTC) Suggestion:
 * Check edoc release status of the document/document version (i.e. version from the history) at the fetch time
 * Both doc access types:
 * 1) http://edoc.mpg.de/display.epl?mode=doc&id= &col=&grp=
 * 2) http://edoc.mpg.de/
 * Both file access types:
 * 1) http://edoc.mpg.de/get.epl?did=&fid=&&ver=
 * 2) http://edoc.mpg.de//fulltext//

--Schmidt 16:17, 11 Jan 2010 (UTC): Questions (Marion Schmidt, MPI für Kognitions- und Neurowissenschaften)
 * I would like to have download statistics for open access full texts on a per-item-level, similar to the implementation in the Open Access journal PLoS One: http://www.plosone.org/article/metrics/info%3Adoi%2F10.1371%2Fjournal.pone.0005915
 * the eDoc server of the Humboldt University uses Download statistics, too, since some time
 * of course, this could tempt people to manipulate to a certain degree (I don`t think that much), but maybe it`s nevertheless possible to at least exclude crawler or repeated downloads from the same IP address