CharsetEncoding

From MPDLMediaWiki
Jump to: navigation, search

Programming on eSciDocEnhanced Scientific Documentation, namely on PubManPublication Management touches many aspects of character set encoding that should be consistent on all layers of the applications. The used default character set depends on the application, and in the case of PubManPublication Management surely UTFUnicode Transformation Format-8 is preferable.

The following components are affected by this.

Java

The Java Development Kit contains classes that are encoding-aware (e.g. FileReader) as well as classes that are not (e.g. StringBuffer) and classes that are buggy in respect to encoding (e.g. StringBufferInputStream, deprecated). To deal with string/byte data where the encoding is not yet defined, e.g. when loading file content into a string variable only encoding-aware classes and methods should be used.

Programming tips

  • NEVER write your own encoding "management", always use those possibilities given by Java to hande encoding.
   Don't do things like this:
   
       byte[] utf8bytes = "ÄÖÜ".getBytes("UTF8");
       String utf8String = new String(utf8bytes);
   
   In the best case, this code does nothing (when UTFUnicode Transformation Format-8 is system default). In a worse case, 
   e.g. when CP1252 is system default, you get something like this:
   
       ÄÖÜ
   
   Worst case: It works although you have a different default encoding! This means that the original string 
   was already invalid and now two faults make it right.
  • Be aware of the encoding of your console: In some cases, your string may still be valid, but the Writer (or Printer) that streams it to the console uses a different encoding than the console itself. Classical example: When running a java program in a DOS box there the encoding usually is CP437 (Ugh).
  • IF you are not sure if your string is encoded correctly, try
   mystring.length() and myString.getBytes(String encoding).length
  • In test data, always use characters which are typical for different encodings. This lets you identify problems in an early stage. Use for instance
   äöüß éèç (ISOInternational Organization for Standardization-8859-1)
   „–” „—“  (CP1252)
   ❤€       (UTFUnicode Transformation Format-8)
  • By construction of String from byte[] use "UTFUnicode Transformation Format-8" encoding explicitly. E.g.
   
   String result = new String(export.getOutput(itemList), "UTFUnicode Transformation Format-8");  
   

Note: export.getOutput(itemList) returns byte[]

  • One big source of confusion is the compilation mechanism of Java: If you changed the encoding of your .java source files to the value you need, it does not mean that the compiled .class files now have the same encoding. Make sure all classes are recompiled after switching the encoding.
  • When reading bigger amounts of UTFUnicode Transformation Format-8 characters from a stream, sometimes some characters break: This is usually due to the use of a ByteReader that caches a certain amount of bytes and then turns them into character data. If, coincidentally, a 2-byte UTFUnicode Transformation Format-8 character is split into pieces, this character is broken. This can be avoided by using a CharacterReader to read character data, e.g. BufferedReader.

Eclipse

In the eclipse IDEIntegrated Development Environment the default encoding can (and must) be set to UTFUnicode Transformation Format-8. This can be done under

 Window | Preferences | General | Workspace | Text file encoding

Additionally, there are file-type specific settings under

 Window | Preferences | General | Content Types

The encoding of a newly created file, of course, may be defined here:

 Window | Preferences | Web and XMLExtensible Markup Language | ... files

where you have to set the file encoding for each kind of file separately.

In addition, eclipse should be started given UTFUnicode Transformation Format-8 as default encoding. This can be done with an additional startup property in the eclipse.ini file in ECLIPSE_HOME:

add

   -Dfile.encoding=UTFUnicode Transformation Format-8

at the end of eclipse.ini.

This is also useful when launching applications from eclipse. Add the above parameter to the arguments:

 Run| Run Configurations| Arguments | VMVirtual Machine Arguments

Furthermore, all data files, config files and class files (.java), should be checked if they use the right encoding. Any broken character like Ä, �� or [] is a hint that the encoding is not correct at some place. But even if a file or a console output looks correct, it does not mean that the underlying string/stream has the right encoding (example).

JSFJavaServer Faces

JSF1.1 RI, that is currently in use, seems to be not aware of the encoding of the client request and therefore takes some default value for the encoding (investigating). JSFJavaServer Faces 1.2 has special new features to deal with that. A solution here is missing. Although in the JSPJavaServer Pages source pages the encoding is defined, neither in the xml declaration nor in the html headers is the encoding defined. At least, the http "Content-Type" header is affected (still investigating).

JiBXFramework for Binding XML Data to Java Objects

JiBXFramework for Binding XML Data to Java Objects has to be told the encoding before the marshalling of an object to an XMLExtensible Markup Language string. E.g.

 IMarshallingContext mctx = bfact.createMarshallingContext();
 StringWriter sw = new StringWriter();
 mctx.setOutput(sw);
 mctx.marshalDocument(affiliationVO, "UTFUnicode Transformation Format-8", null, sw);

and

 IBindingFactory bfact = BindingDirectory.getFactory("PubItemVO_PubCollectionVO_input", PubItemResultVO.class);
 IUnmarshallingContext uctx = bfact.createUnmarshallingContext();
 StringReader sr = new StringReader(searchResultItem);
 pubItemResultVO = (PubItemResultVO)uctx.unmarshalDocument(sr, "UTFUnicode Transformation Format-8");

This is happening already.

Axis

Axis (sometimes???) escapes non-ascii characters to their Unicode entity (e.g. "Ä" to "Ä"). This is a legal but needless operation, because the UTFUnicode Transformation Format-8 characters are of course allowed in a UTFUnicode Transformation Format-8 encoded XMLExtensible Markup Language document.

FIZFachinformationszentrum Karlsruhe Framework

When given a character of the upper unicode range (e.g. „), the framework returns an error saying "a non-xml character "1E" was detected. This obviuosly is not correct. We have to clarify where this happens. --Michael Franke 18:04, 27 February 2008 (CETCentral European Time) this is fixed.

EJBEnterprise JavaBeans

The RMI serialization/deserialization between a client and an EJBEnterprise JavaBeans seems to work properly on JBossOpen source Java EE-based application server 4.0.5.

JBossOpen source Java EE-based application server/Tomcat

  • JBossOpen source Java EE-based application server seems to have encoding issues still in version 4.2.0 (see http://jira.jboss.org/jira/browse/JBWS-1716). These issues relate to the reading of config XMLExtensible Markup Language files and to the encoding of client requests.

Moreover, I found out that the log files, and thus the eclipse jboss console, are encoded in system default even when "file.encoding" and "sun.jnu.encoding" are set to "UTFUnicode Transformation Format-8".

  • There is an ugly little feature in Tomcat: Get-URLs are always interpreted as system-default encoded. This means a URLUniform Resource Locator parameter in the browser like

   lang=Gr%C3%B6nl%C3%A4ndisch

is (on a Windows server using CP1252) always interpreted as

   lang: Grönländisch

This behaviour can only be changed by editing Tomcat's server.xml file:

   <Connector port="8080" address="${jboss.bind.address}"    
        maxThreads="250" maxHttpHeaderSize="8192"
        emptySessionPath="true" protocol="HTTPHyperText Transfer Protocol/1.1"
        enableLookups="false" redirectPort="8443" acceptCount="100"
        connectionTimeout="20000" disableUploadTimeout="true" URIEncoding="UTFUnicode Transformation Format-8"/>

Then the result is

   lang: Grönländisch

Ant

Ant runs in a separate runtime, even when started in eclipse. In default, it uses the default encoding of the system, which may lead to warnings like

   [seu.javac] Compiling 15 source files to V:\development\build\common_logic
   [distcomponents] V:\development\common_logic\src\test\common\encoding\EncodingTest.java:108: warning: unmappable character for encoding Cp1252
   [distcomponents] String encodingCharacters = "äöüß éèç"; //"äöüß éèç „–� „—“ €";
   [distcomponents] ^
   [distcomponents] 1 warning

This can be avoided by setting the property file.encoding=UTFUnicode Transformation Format-8 at Window|Preferences|Ant|Runtime|Properties.


Further information