CharsetEncoding

Programming on eSciDoc, namely on PubMan touches many aspects of character set encoding that should be consistent on all layers of the applications. The used default character set depends on the application, and in the case of PubMan surely UTF-8 is preferable.

The following components are affected by this.

Java
The Java Development Kit contains classes that are encoding-aware (e.g. FileReader) as well as classes that are not (e.g. StringBuffer) and classes that are buggy in respect to encoding (e.g. StringBufferInputStream, deprecated). To deal with string/byte data where the encoding is not yet defined, e.g. when loading file content into a string variable only encoding-aware classes and methods should be used.

Programming tips
Don't do things like this: In the best case, this code does nothing (when UTF-8 is system default). In a worse case, e.g. when CP1252 is system default, you get something like this: Worst case: It works although you have a different default encoding! This means that the original string was already invalid and now two faults make it right. mystring.length and myString.getBytes(String encoding).length äöüß éèç (ISO-8859-1) „–” „—“ (CP1252) ❤€      (UTF-8) Note:  returns
 * NEVER write your own encoding "management", always use those possibilities given by Java to hande encoding.
 * Be aware of the encoding of your console: In some cases, your string may still be valid, but the Writer (or Printer) that streams it to the console uses a different encoding than the console itself. Classical example: When running a java program in a DOS box there the encoding usually is CP437 (Ugh).
 * IF you are not sure if your string is encoded correctly, try
 * In test data, always use characters which are typical for different encodings. This lets you identify problems in an early stage. Use for instance
 * By construction of  from   use "UTF-8" encoding explicitly. E.g.
 * One big source of confusion is the compilation mechanism of Java: If you changed the encoding of your .java source files to the value you need, it does not mean that the compiled .class files now have the same encoding. Make sure all classes are recompiled after switching the encoding.
 * When reading bigger amounts of UTF-8 characters from a stream, sometimes some characters break: This is usually due to the use of a ByteReader that caches a certain amount of bytes and then turns them into character data. If, coincidentally, a 2-byte UTF-8 character is split into pieces, this character is broken. This can be avoided by using a CharacterReader to read character data, e.g. BufferedReader.

Eclipse
In the eclipse IDE the default encoding can (and must) be set to UTF-8. This can be done under Additionally, there are file-type specific settings under The encoding of a newly created file, of course, may be defined here: where you have to set the file encoding for each kind of file separately.

In addition, eclipse should be started given UTF-8 as default encoding. This can be done with an additional startup property in the eclipse.ini file in ECLIPSE_HOME:

add at the end of eclipse.ini.

This is also useful when launching applications from eclipse. Add the above parameter to the arguments:

Furthermore, all data files, config files and class files (.java), should be checked if they use the right encoding. Any broken character like Ã„, �� or [] is a hint that the encoding is not correct at some place. But even if a file or a console output looks correct, it does not mean that the underlying string/stream has the right encoding (example).

JSF
JSF1.1 RI, that is currently in use, seems to be not aware of the encoding of the client request and therefore takes some default value for the encoding (investigating). JSF 1.2 has special new features to deal with that. A solution here is missing. Although in the JSP source pages the encoding is defined, neither in the xml declaration nor in the html headers is the encoding defined. At least, the http "Content-Type" header is affected (still investigating).

JiBX
JiBX has to be told the encoding before the marshalling of an object to an XML string. E.g.

and This is happening already.

Axis
Axis (sometimes???) escapes non-ascii characters to their Unicode entity (e.g. "Ã„" to "&#xC3&#x201E"). This is a legal but needless operation, because the UTF-8 characters are of course allowed in a UTF-8 encoded XML document.

FIZ Framework
When given a character of the upper unicode range (e.g. &#x201E), the framework returns an error saying "a non-xml character "1E" was detected. This obviuosly is not correct. We have to clarify where this happens. --Michael Franke 18:04, 27 February 2008 (CET) this is fixed.

EJB
The RMI serialization/deserialization between a client and an EJB seems to work properly on JBoss 4.0.5.

JBoss/Tomcat
Moreover, I found out that the log files, and thus the eclipse jboss console, are encoded in system default even when "file.encoding" and "sun.jnu.encoding" are set to "UTF-8". is (on a Windows server using CP1252) always interpreted as This behaviour can only be changed by editing Tomcat's server.xml file: Then the result is
 * JBoss seems to have encoding issues still in version 4.2.0 (see http://jira.jboss.org/jira/browse/JBWS-1716). These issues relate to the reading of config XML files and to the encoding of client requests.
 * There is an ugly little feature in Tomcat: Get-URLs are always interpreted as system-default encoded. This means a URL parameter in the browser like

Ant
Ant runs in a separate runtime, even when started in eclipse. In default, it uses the default encoding of the system, which may lead to warnings like

[seu.javac] Compiling 15 source files to V:\development\build\common_logic [distcomponents] V:\development\common_logic\src\test\common\encoding\EncodingTest.java:108: warning: unmappable character for encoding Cp1252 [distcomponents] String encodingCharacters = "Ã¤Ã¶Ã¼ÃŸ Ã©Ã¨Ã§"; //"Ã¤Ã¶Ã¼ÃŸ Ã©Ã¨Ã§ â€žâ€“â€? â€žâ€”â€œ â‚¬"; [distcomponents] ^ [distcomponents] 1 warning

This can be avoided by setting the property file.encoding=UTF-8 at Window|Preferences|Ant|Runtime|Properties.

Further information

 * Using Charsets and Encodings and Using Reflection To Create Class Instances
 * Converting Non-Unicode Text
 * UTF-8 and Unicode Standards