CharsetEncoding

Programming on eSciDoc, namely on PubMan touches many aspects of character set encoding that should be consistent on all layers of the applications. The used default character set depends on the application, and in the case of PubMan surely UTF-8 is preferable.

The following components are affected by this.

Java[edit]

The Java Development Kit contains classes that are encoding-aware (e.g. FileReader) as well as classes that are not (e.g. StringBuffer) and classes that are buggy in respect to encoding (e.g. StringBufferInputStream, deprecated). To deal with string/byte data where the encoding is not yet defined, e.g. when loading file content into a string variable only encoding-aware classes and methods should be used.

Programming tips[edit]

NEVER write your own encoding "management", always use those possibilities handling encoding given by Java.

   Don't do things like this:
   
       byte[] utf8bytes = "ÄÖÜ".getBytes("UTF8");
       String utf8String = new String(utf8bytes);
   
   In the best case, this code does nothing (when UTF-8 is system default). In the worst case (e.g. when CP1252 is system default), you get something like this:
   
       Ã„Ã–Ãœ

Be aware of the encoding of your console: In some cases, your string may still be valid, but the Writer (or Printer) that streams it to the console uses a different encoding than the console itself. Classical example: When running a java program in a DOS box there the encoding usually is CP437 (Ugh).

Eclipse[edit]

In the eclipse IDE the default encoding can (and must) be set to UTF-8. This can be done via Window | Preferences | General |Workspace | Text file encoding. In addition, eclipse should be started given UTF-8 as default encoding (I am still investigating on that). This can be done with an additional startup property in the eclipse.ini file in ECLIPSE_HOME:

add

   -Dsun.jnu.encoding=UTF-8

at the end of eclipse.ini (this actually does not work, seems to be read-only or reset)

Furthermore, all data files, config files and class files (.java), should be checked if they use the right encoding. Any broken characters like Ã„, �� or [] is a hint that the encoding is not correct at some place. But even if a file or a console output looks correct, it does not mean that the underlying string/stream has the right encoding (example).

JSF[edit]

JSF1.1 RI, that is currently in use, seems to be not aware of the encoding of the client request and therefore takes some default value for the encoding (investigating). JSF 1.2 has special new features to deal with that. A solution here is missing. Although in the JSP source pages the encoding is defined, neither in the xml declaration nor in the html headers is the encoding defined. At least, the http "Content-Type" header is affected (still investigating).

JiBX[edit]

JiBX has to be told the encoding before the marshalling of an object to an XML string. This is happening already.

Axis and the FIZ framework[edit]

Axis escapes non-ascii characters to their Unicode entity (e.g. "Ã„" to "&#xC3&#x201E"). This is a legal but needless operation, because the UTF-8 characters are of course allowed in a UTF-8 encoded XML document. When given an entity like "&#x201E", the framework returns an error saying "a non-xml character "1E" was detected. This obviuosly is not correct. We have to clarify where this happens.

EJB[edit]

The RMI serialization/deserialization between a client and an EJB seems to work properly on JBoss 4.0.5.

JBoss[edit]

???