ESciDoc Committer Meeting 2010-12-14

From MPDLMediaWiki
Jump to navigation Jump to search

Date: 14.12.2010 Start time: 14:30 End time: 15:30

Location: Karlsruhe, München phone: +49-89-38602-223 VidCo-ISDN: 08938602595; alternate ip

Participants MPDL: Natasa Bulatovic, Wilhelm Frank

Participants FIZ: Frank Schwichtenberg, Michael Hoppe, Matthias Razum, Harald Kappus

Previous committer meeting

Next committer meetings


next meetings[edit]

  • 2010-12-21 postponed
  • 2011-01-11 ok
  • 2011-01-18 ok

pdf text-extraction of fedoragsearch-indexing[edit]

  • The fedoragsearch-indexer uses PDFBox to extract text from pdf-documents. The text sometimes is not extracted correctly.
  • Discuss approach


  • PDFBox does not extract text from PDFs properly
  • for Japanese XPDF was the best one, and that is a command line tool (can not be integrated for Java env).
  • all text extractors have been researched - none of them was able to correctly extract japanese text (except XPDF)
  • not certain what can be done
  • if MPDL wants to try the XPDF new fedora gsearch patch needs to be used (OK for MPDl, could work for all documents, not only for those that do not run with PDfBox to be tested)

postgres connections[edit]

  • since core service 1.2 deployment and the postgres connections, we have had experienced quite unusual memory consumption of postgres. From ca. 10 connections to the database, one of these connections consumes enourmous RAM and is never set free (4-6 GB). The same behavior is on all servers, even those which are simply set as default and have no midnight fetching of data.
  • Note that this process is not the postmaster process itself, but is the escidoc-core connection process to escidoc-core database.Our servers start to consume after a while a lot of swap and then after some time the system stops functioning.The total memory of the server where this is really critical is 32 GB, where postgres database itself uses only 1/4 of it.
  • Can you please investigate a bit or let us know if you have some ideas how to monitor this problem?


  • not discussed- will be addressed next year


  • component - content update for components of released items associated with a PID
    • A Component PID should be treated like an "object PID"
    • that is the understood, the question is how to provide a valid component "object PID" into each item version
    • is the versioning of component content a choice of application developers in this case


  • content PID will act similar as item version PID
    • with every change of the content, the PID will be removed and application is able to set a new one
  • FIZ may provide some help with clarifications about the current FoXML migration tool (might be appropriate)
  • migratio may assign content PID to the latest version of the content
  • however release 1.3 is primary goal, and then migration shall be considered separately

On Behalf-of deposit[edit]


  • will be addressed in v.2.0 (with ownership - transfer ownership issue)

core service release timelines[edit]

  • PID bug fix


  • probably beta version by end of the year
  • end of January planned for final release (depending on user tests outcome)

scalability/load tests environment[edit]

  • status


  • partnering in setting-up such tests can be set-up as soon as we have environment defined
  • MPDL may have some free resources for setting up such environment next year


  • eSciDoc-Colab Page setup
  • installation guides