ESciDoc Developer Workshop 2008-04-29

ESciDoc

Date: March 29.04.2008

Location: Karlsruhe, München (Video conference)

Participants MPDL: Natasa Bulatovic, Wilhelm Frank, Robert Forkel

Participants FIZ: Frank Schwichtenberg, Harald Kappus, Matthias Razum

Start time: 14:00 29.04.2008

HTTP Response headers in framework responses
08:53:39$ curl -v -o test.jpg     http://zim02.gwdg.de:8080/ir/item/escidoc:14439/components/component/escidoc:14440/content > GET /ir/item/escidoc:14439/components/component/escidoc:14440/content HTTP/1.1 > User-Agent: curl/7.16.4 (i486-pc-linux-gnu) libcurl/7.16.4 OpenSSL/0.9.8e zlib/1.2.3.3 libidn/1.0 > Host: zim02.gwdg.de:8080 > Accept: */* > < HTTP/1.1 200 OK < Server: Apache-Coyote/1.1 < X-Powered-By: Servlet 2.4; JBoss-4.0.5.GA (build: CVSTag=Branch_4_0 date=200610162339)/Tomcat-5.5 < Cache-Control: no-cache < Pragma: no-cache < file-name: 177_y_f_a_a_original.jpg < Content-Type: image/jpg < Transfer-Encoding: chunked < Date: Wed, 16 Apr 2008 06:53:58 GMT <
 * file-name should not be used since it's not a standard HTTP header and leads to undesired behaviour, e.g. in Safari.
 * About to connect to zim02.gwdg.de port 8080 (#0)
 * Trying 134.76.24.61... connected
 * Connected to zim02.gwdg.de (134.76.24.61) port 8080 (#0)


 * if a filename to use in conjunction with a response should be specified, a content-disposition header should be sent. But since the framework does not really know, who it talks to, it probably shouldn't say anything about content-disposition at all.

Outcome

 * content disposition header will be sent with file name

Support for HTTP HEAD requests
Currently the framework does not support HTTP HEAD requests, which may become a problem with the rest of the web infrastructure - caches, proxies, browsers, etc.

Fetching an item via GET: 15:11:46$ curl -v -o content.xml http://zim02.gwdg.de:8080/ir/item/escidoc:6431 > GET /ir/item/escidoc:6431 HTTP/1.1 > User-Agent: curl/7.16.4 (i486-pc-linux-gnu) libcurl/7.16.4 OpenSSL/0.9.8e zlib/1.2.3.3 libidn/1.0 > Host: zim02.gwdg.de:8080 > Accept: */* > < HTTP/1.1 200 OK < Server: Apache-Coyote/1.1 < X-Powered-By: Servlet 2.4; JBoss-4.0.5.GA (build: CVSTag=Branch_4_0 date=200610162339)/Tomcat-5.5 < Cache-Control: no-cache < Pragma: no-cache < Content-Type: text/xml;charset=UTF-8 < Content-Length: 6499 < Date: Wed, 16 Apr 2008 13:12:07 GMT <
 * About to connect to zim02.gwdg.de port 8080 (#0)
 * Trying 134.76.24.61... connected
 * Connected to zim02.gwdg.de (134.76.24.61) port 8080 (#0)

Requesting HEAD: 08:38:12$ curl --head http://zim02.gwdg.de:8080/ir/item/escidoc:6431 HTTP/1.1 404 Not Found Server: Apache-Coyote/1.1 Cache-Control: no-cache Pragma: no-cache Content-Type: text/xml;charset=UTF-8 Transfer-Encoding: chunked Date: Wed, 16 Apr 2008 06:38:22 GMT

Outcome

 * important for testing
 * complex to achieve at the moment - security whole?
 * to answer if we still need to support head requests now?
 * agreed workaround: do not return error code 404 - check in future this issue again

TOC implementation

 * Status of implementation
 * open issues

Outcome

 * Toc implementation will not be in next release
 * Content streams in Item - to put binary content on item and not on component
 * After TOC handler will be created to handle special Items (of type TOC)
 * Patent data and many other data have only XML - they have no special metadata for "binary-content", they are just part of item, single content is allowed
 * Access rights are same as item access rights
 * Content streams are part of items - either components/content data streams are allowed (if content data streams - then only single one is allowed)
 * FIZ wants to test for uploads of 1 000 000 000 object (patents)
 * adding content stream - efficiency reasons

Status of Migration Tool

 * There will be migration tool in the release tomorrow.
 * No need the schemas to be changed, item changes are in schemas
 * Migration tool will include the changes
 * Would be final for next several months
 * the goal of the release would be to stabilize schemas (0.99 release candidate)
 * No relations for 1.0 ?
 * As it seems now - no plans to change schemas/interfaces in upcoming months
 * Part of the release are filter methods improvement (+ evtl. support for Shibboleth and LDAP)

Status of Ingestion Tool

 * ongoing work in very early stage -
 * investigating to ingest efficiently objects into Fedora bypassing eSciDoc (first step)
 * still trying to get numbers and decide on strategy how to efficiently load large amounts of objects in eSciDoc
 * will take some more time
 * goal to be able to load several hundred tohousand objects per day - depending on the item complexity
 * pure FOXML objects - ingest rate nearly 1 mil objects/day
 * to analyze the input and create FoXML objects - bypassing eSciDoc business objects/logic
 * triple store, full-text indexing etc.

mra: I created a new Colab page for the Ingest Tool.
 * for another project 150000 objects, scientific articles, 16-20 hours to index
 * 20.000 objects / edoc (MPDL update: eDoc has much more than 20000 objects, 100000 objects)
 * latest in Jun/July expected to have production environment with several collections (total NO more than 100 000 objects)
 * limitation of developers