An Enquiry Concerning The Handle System

From MPDLMediaWiki
Revision as of 15:20, 2 April 2008 by Inga (talk | contribs) (cat)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

This piece - while the honest opinion of the author (Robert Forkel) - is also meant to stipulate discussion. Feel free to reply on the discussion page.


Why would I want to use the Handle System?[edit]

The Handle FAQ[1] gives the following answers:

"There are many reasons you might want to use the Handle System, but one simple one is that you have information and other resources represented in digital form, sometimes called digital content, that you want users to access via the Internet and you plan to keep that content available over long periods of time. If the content's location is likely to have to change during that time, then you need a resolution system like the Handle System. You would assign your digital content unique identifiers, not just identify the objects by their locations. A location -- a given URL, for example -- is not a persistent identifier if the content moves to another location.

Think about keeping track of a person. People are listed in a telephone directory by name. If you look up a person's name, you will find his address. If he moves across town, his address will change but not his name, so you will get his new address when you look up his name. If he didn't have a name, was known only by his address, and he moved, you'd probably lose track of him. If you tried his old address, he wouldn't be found. He'd have to tell everyone his new address, and hope they kept it. If he had a lot of friends, that might work, but it would take a lot of effort.

Similarly, if your digital content is only known by its location, and that location changes, it will be hard for users to find it. If you give each object a unique name (an identifier), and associate that name with the object's location using the Handle System, you'd only have to update a single record with the new location, rather than notify everyone who might want to find the object, even if you could arrange to do so.

The ability to change locations without changing names also applies to ownership. You can move or sell an object from one owner to another and still use the same identifier, which is very difficult to do using domain names that are required in URLs. Other reasons to use the Handle System are more technical, e.g., secure resolution and/or multiple resolution, for which we recommend you read the technical documentation found on this site."


How does this apply to eSciDoc?[edit]

"the content's location is likely to have to change" - It's hard to see how this applies to the escidoc content - short of fearing loss of the domain. For other conceivable scenarios like copying or moving objects in a repository see below.

"You can move or sell an object from one owner to another and still use the same identifier" - This seems to be the reason doi and the handle system are attractive to commercial publishers. It doesn't apply to our use cases, though. In particular in the case of pubman, the whole idea is to claim some sort of ownership of the Max Planck Society for the items.

"multiple resolution" - Do we really want this? And if so, how would we do this? In the DAM-LR project they provide this functionality by publishing web pages linking to the multiple targets. But if one does that, one might just as well link to the alternative versions from the canonical representation.

Another strength of the handle system - being useful for managing non-networked or non-web resources - is also irrelevant in the escidoc context.


Can't HTTP do that?[edit]

The NLA Guidelines for persistent identifiers[2] are a good read. In particular one of their general requirements seems relevant: "Commitment to persistence

An organisation must have a commitment to maintain the association of the current location of the resource with the persistent identifier. It is important that a resource with a persistent identifier should not be moved, or removed without updating the location information associated with the identifier."

This is true, whatever persistent identification scheme is used. And in the eSciDoc@MPG context it is clear, that the organisation which must show commitment is the MPG, which will surely be commited to making mpg.de a persistent prefix to build HTTP identifiers from; or as Sir Tim Berners-Lee put it:

"Most URN schemes I have seen look something like an authority ID followed by either a date and a string you choose, or just a string you choose. This looks very like an HTTP URI. In other words, if you think your organization will be capable of creating URNs which will last, then prove it by doing it now and using them for your HTTP URIs. There is nothing about HTTP which makes your URIs unstable. It is your organization. Make a database which maps document URN to current filename, and let the web server use that to actually retrieve files."[3]

Similar views - but more elaborate - are expressed by John Kunze[4].

We also argue that multiple resolution can be handled without resorting to the handle system. In fact, a common method to do this with the handle system - an intermediate HTML page providing links to all resolution targets - is the typical HTTP/HTML way to do this. (X)HTML link or a elements, RDF documents demarcating the boundaries[5] of an object, etc. - all these methods are available to the HTTP alternatives to the handle system, or better they are native to these alternatives and just borrowed by the handle system.


Current Application of the Handle System[edit]

See http://www.handle.net/apps.html


DOI[edit]

DOI is clearly the application of the handle system built to exploit the advantage described as "you can move or sell an object from one owner to another and still use the same identifier".


dspace[edit]

From DSpace's usage of the handle system[6] we can learn two things:

  1. When being serious about using the handle system, handles must also be used internally in the application.
  2. The goal to have usable - i.e. resolvable, aka actionable - identifiers means HTTP.

While handles may "outlive the Domain Name system and Internet protocols"[6], it is unclear how useful this may be. Without DNS and IP URLs will be more or less the same as handles: Globally unique strings.


Two Fallacies[edit]

The ISBN example[edit]

We argue that the comparison between ISBN and identifiers for online content is flawed. Why are ISBNs useful? ISBNs are useful, because getting a book isn't easy. Physical things can only be in one place at a time, which is why there are many copies of books. This relieves readers from the pain of having to go to a single place in the world to read a book. So ISBNs are not actionable - at least not uniformly - because this is impossible to achieve. But obviously it would be preferable if they were. Online objects have the distinct but nice property of having an identifier which is uniformly actionable from each browser - which is basically from everywhere where digital objects make sense in the first place.


Is actionable bad?[edit]

Some object to URLs as identifiers because they are actionable - meaning too tightly coupled to the HTTP GET request. We argue that being actionable is essential for usable identifiers.

First, being resolvable means being actionable. Second, without resolution identifiers loose the ability to identify, because there is no way to verify the claim; i.e. if there is no "official" way to resolve an ISBN to a book, i could print whatever ISBN i choose in a book.


Escidoc Use Cases[edit]

  1. Half the items of an escidoc instance are moved to LTA.
  2. Identifiers should resolve to different URLs depending on where the resolution request comes from.

Assessment:

ad 1. Can't be solved with the handle system in the current escidoc context (see open question 4 below). In addition, since the current plan is to add pluggable persistent identifier providers, the system would need a way to find out which resolution mechanism to use for which identifier and how to handle the response, which complicates the scenario a lot. Note: It is also unclear what "moved to LTA" means. Is the idea really to remove stuff from a running application and put it on tape only? And if so, what should the identifier resolve to?

ad 2. Can't be easily solved with the handle system because (not being based on http) information about the origin of the request (like for example the referrer header in http or - better yet - use content negotiation) is not available to the resolver. Differentiation would have to be based on what is known about the requester (the IP address). It also has to be kept in mind that with the current design the framework will be rather PID agnostic (it just stores them like other metadata), so the resolution will have to be configured by the application. But then it is unclear why the application shouldn't do the resolution right away (without involving a handle resolver).

Note: The use case of keeping track of copies of digital resources is obviously only solvable if the only means of copying are controlled by the escidoc system itself (and even then considerable effort is necessary as can be seen in the DAM-LR project[7]).

The fact of the escidoc framework being agnostic towards PIDs raises the question of what to do with objects like organizational units and users. Which application should be responsible for assigning/managing PIDs for these? How are these anticipated to behave regarding LTA?


Open Questions[edit]

If escidoc were to use the handle system, the following questions would have to be answered to satisfaction:

  1. What should handles resolve to? In particular in case the objects actually do move? If only to URLs, why not use HTTP?
  2. What problems do we want to solve with the handle system? Moving stuff between escidoc instances? This case is already answered in the latest fiz doc regarding identifiers, suggesting to make instance ids part of the local escidoc identifiers[8].
  3. What prefixes to use? Just one? Or one per escidoc instance, per collection, per context, ...?
  4. Why/how will handles be used in relation with longterm archiving?
  5. How is the handle system useful, if (as described in the latest paper from fiz) handles are not used for relations (let alone the implicit relations between item and creator or between author and organizational unit)? "Explicitly excludes are the demand for internal content relation with PIDs and PIDs in relation to framework URLs."[8]
    Note that this point makes the use case "part of an instances content is moved to LTA" impossible, i.e. all the intra-repository relations to moved content will break.

A similar set of questions is listed in the Cork Report[9]: "... some fundamental concepts that must be addressed before implementing any persistent identification system. We must understand what it is that we wish to identify [...] We must know what we want the persistent identifier to resolve to [...]"

Both these questions seem not to be answered to satisfaction in the escidoc context. Does the identifier for an escidoc publication item identify the escidoc item or the content of the publication which may be attached as component - which will probably get/have an other persistent identifier from a publisher? If e.g. the published article is to be identified, should the handle resolve to the component only or to the containing item?


Advantages[edit]

The handle system introduces a layer of abstraction between identifier and locator. However, since locators can also be regarded as identifiers (being the canonical location of an object, i.e. distinguished among all locators) and given the fact that HTTP provides sufficient machanisms for redirection, there are clearly alternatives to achieve the same advantage.

The benefit of being able to store arbitrary metadata with handles does clearly not make much sense in the case of escidoc, because we would not want to maintain another storage location for data readily available from escidoc. Again, in the case of moving content, the previous URL can also be used to provide all information which could be obtained from the handle system.

The perceived advantage of being independent of DNS seems to be rather a question of faith, or as Norman Walsh put it: "I'd gamble that the organization that maintains DNS names will outlast any new organization created to manage 'newscheme' names"[10]


Disadvantages[edit]

  1. Using the handle system introduces two additional single-points-of-failure into the system (three if the handle.net HTTP proxy is used as with dspace). The global handle registry (over which we do not have control) and the local handle resolver (responsible to resolve handles with our prefix). Both of these have to be highly available.
  2. We will still have to keep URLs stable because that's what people will use.
  3. The handle system builds indirection into the system right from the start, whereas with HTTP redirection the problem of moving content can be solved on demand (even conveying some provenance information in the chain of redirections).

For a detailed assessment of the "newscheme" issue, see URNs, Namespaces and Registries, quoting the abstract:

"This finding addresses the questions "When should URNs or URIs with novel URI schemes be used to name information resources for the Web?" and "Should registries be provided for such identifiers?". The answers given are "Rarely if ever" and "Probably not". Common arguments in favor of such novel naming schemas are examined, and their properties compared with those of the existing http: URI scheme.

Three case studies are then presented, illustrating how the http: URI scheme can be used to achieve many of the stated requirements for new URI schemes."


Alternatives[edit]


Conclusion[edit]

Making reasonable use of the handle system means real work (configuring, maintaining, running the resolver; enabling the remainder of the system to communicate identifiers using handles rather than URLs; ...). This work would have to be done in addition to keeping the URLs stable anyway (which I consider a prerequisite of doing any web publishing whatsoever). Short of being afraid to loose the domain under which to run an escidoc system, i don't see a valid reason to use the handle system.


"If we take in our hand any scheme of persistent identification let us ask, Does it contain any resolution mechanism? No. Does it contain http? No. Commit it then to the flames: for it can contain nothing but sophistry and illusion."[11] (just a joke to link beginning - i.e. title - and end)


References[edit]