CharsetEncoding

From MPDLMediaWiki
Revision as of 09:10, 27 October 2009 by Rkiefl (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Programming on eSciDoc, namely on PubMan touches many aspects of character set encoding that should be consistent on all layers of the applications. The used default character set depends on the application, and in the case of PubMan surely UTF-8 is preferable.

The following components are affected by this.

Java[edit]

The Java Development Kit contains classes that are encoding-aware (e.g. FileReader) as well as classes that are not (e.g. StringBuffer) and classes that are buggy in respect to encoding (e.g. StringBufferInputStream, deprecated). To deal with string/byte data where the encoding is not yet defined, e.g. when loading file content into a string variable only encoding-aware classes and methods should be used.

Programming tips[edit]

  • NEVER write your own encoding "management", always use those possibilities given by Java to hande encoding.
   Don't do things like this:
   
       byte[] utf8bytes = "ÄÖÜ".getBytes("UTF8");
       String utf8String = new String(utf8bytes);
   
   In the best case, this code does nothing (when UTF-8 is system default). In a worse case, 
   e.g. when CP1252 is system default, you get something like this:
   
       ÄÖÜ
   
   Worst case: It works although you have a different default encoding! This means that the original string 
   was already invalid and now two faults make it right.
  • Be aware of the encoding of your console: In some cases, your string may still be valid, but the Writer (or Printer) that streams it to the console uses a different encoding than the console itself. Classical example: When running a java program in a DOS box there the encoding usually is CP437 (Ugh).
  • IF you are not sure if your string is encoded correctly, try
   mystring.length() and myString.getBytes(String encoding).length
  • In test data, always use characters which are typical for different encodings. This lets you identify problems in an early stage. Use for instance
   äöüß éèç (ISO-8859-1)
   „–” „—“  (CP1252)
   ❤€       (UTF-8)
  • By construction of String from byte[] use "UTF-8" encoding explicitly. E.g.
   
   String result = new String(export.getOutput(itemList), "UTF-8");  
   

Note: export.getOutput(itemList) returns byte[]

  • One big source of confusion is the compilation mechanism of Java: If you changed the encoding of your .java source files to the value you need, it does not mean that the compiled .class files now have the same encoding. Make sure all classes are recompiled after switching the encoding.
  • When reading bigger amounts of UTF-8 characters from a stream, sometimes some characters break: This is usually due to the use of a ByteReader that caches a certain amount of bytes and then turns them into character data. If, coincidentally, a 2-byte UTF-8 character is split into pieces, this character is broken. This can be avoided by using a CharacterReader to read character data, e.g. BufferedReader.

Eclipse[edit]

In the eclipse IDE the default encoding can (and must) be set to UTF-8. This can be done under

 Window | Preferences | General | Workspace | Text file encoding

Additionally, there are file-type specific settings under

 Window | Preferences | General | Content Types

The encoding of a newly created file, of course, may be defined here:

 Window | Preferences | Web and XML | ... files

where you have to set the file encoding for each kind of file separately.

In addition, eclipse should be started given UTF-8 as default encoding. This can be done with an additional startup property in the eclipse.ini file in ECLIPSE_HOME:

add

   -Dfile.encoding=UTF-8

at the end of eclipse.ini.

This is also useful when launching applications from eclipse. Add the above parameter to the arguments:

 Run| Run Configurations| Arguments | VM Arguments

Furthermore, all data files, config files and class files (.java), should be checked if they use the right encoding. Any broken character like Ä, �� or [] is a hint that the encoding is not correct at some place. But even if a file or a console output looks correct, it does not mean that the underlying string/stream has the right encoding (example).

JSF[edit]

JSF1.1 RI, that is currently in use, seems to be not aware of the encoding of the client request and therefore takes some default value for the encoding (investigating). JSF 1.2 has special new features to deal with that. A solution here is missing. Although in the JSP source pages the encoding is defined, neither in the xml declaration nor in the html headers is the encoding defined. At least, the http "Content-Type" header is affected (still investigating).

JiBX[edit]

JiBX has to be told the encoding before the marshalling of an object to an XML string. E.g.

 IMarshallingContext mctx = bfact.createMarshallingContext();
 StringWriter sw = new StringWriter();
 mctx.setOutput(sw);
 mctx.marshalDocument(affiliationVO, "UTF-8", null, sw);

and

 IBindingFactory bfact = BindingDirectory.getFactory("PubItemVO_PubCollectionVO_input", PubItemResultVO.class);
 IUnmarshallingContext uctx = bfact.createUnmarshallingContext();
 StringReader sr = new StringReader(searchResultItem);
 pubItemResultVO = (PubItemResultVO)uctx.unmarshalDocument(sr, "UTF-8");

This is happening already.

Axis[edit]

Axis (sometimes???) escapes non-ascii characters to their Unicode entity (e.g. "Ä" to "&#xC3&#x201E"). This is a legal but needless operation, because the UTF-8 characters are of course allowed in a UTF-8 encoded XML document.

FIZ Framework[edit]

When given a character of the upper unicode range (e.g. &#x201E), the framework returns an error saying "a non-xml character "1E" was detected. This obviuosly is not correct. We have to clarify where this happens. --Michael Franke 18:04, 27 February 2008 (CET) this is fixed.

EJB[edit]

The RMI serialization/deserialization between a client and an EJB seems to work properly on JBoss 4.0.5.

JBoss/Tomcat[edit]

Moreover, I found out that the log files, and thus the eclipse jboss console, are encoded in system default even when "file.encoding" and "sun.jnu.encoding" are set to "UTF-8".

  • There is an ugly little feature in Tomcat: Get-URLs are always interpreted as system-default encoded. This means a URL parameter in the browser like

   lang=Gr%C3%B6nl%C3%A4ndisch

is (on a Windows server using CP1252) always interpreted as

   lang: Grönländisch

This behaviour can only be changed by editing Tomcat's server.xml file:

   <Connector port="8080" address="${jboss.bind.address}"    
        maxThreads="250" maxHttpHeaderSize="8192"
        emptySessionPath="true" protocol="HTTP/1.1"
        enableLookups="false" redirectPort="8443" acceptCount="100"
        connectionTimeout="20000" disableUploadTimeout="true" URIEncoding="UTF-8"/>

Then the result is

   lang: Grönländisch

Ant[edit]

Ant runs in a separate runtime, even when started in eclipse. In default, it uses the default encoding of the system, which may lead to warnings like

   [seu.javac] Compiling 15 source files to V:\development\build\common_logic
   [distcomponents] V:\development\common_logic\src\test\common\encoding\EncodingTest.java:108: warning: unmappable character for encoding Cp1252
   [distcomponents] String encodingCharacters = "äöüß éèç"; //"äöüß éèç „–� „—“ €";
   [distcomponents] ^
   [distcomponents] 1 warning

This can be avoided by setting the property file.encoding=UTF-8 at Window|Preferences|Ant|Runtime|Properties.


Further information[edit]