Contents | Prev | Next

Character Set Conversion

Java uses Unicode as its native character encoding; however, many Java programs still need to handle text data in other encodings. Java therefore provides a set of classes that convert many standard character encodings to and from Unicode. Java programs that need to deal with non-Unicode text data will typically convert that data into Unicode, process the data as Unicode, then convert the result back to the external character encoding.

Converter Classes

Two classes provide the interface to the character encoding converters. The class java.io.ByteToCharConverter is a base class for classes that convert characters in an external encoding into Unicode. The class java.io.CharToByteConverter is a base class for classes that convert Unicode characters into an external encoding. Note that the names ByteToChar and CharToByte derive from the fact that the char type can be used to hold only Unicode characters while byte arrays are needed to represent characters in other encodings.

Converter objects are created like this:

ByteToCharConverter toUnicode
= ByteToCharConverter.getConverter("ISO-8859-1");
CharToByteConverter fromUnicode
= CharToByteConverter.getConverter("CP850");

The static getConverter() methods return a subclass that will perform the appropriate conversion. The name strings used to select the converter must be valid IANA character set names1. These are the same names as are used by the MIME charset parameter.

Both ByteToCharConverter and CharToByteConverter contain methods that are used to perform the actual conversion. The following simple example shows how a Unicode string could be converted into the Japanese Shift JIS character encoding using the convertAll() method:

String str = "The quick brown fox";
byte[] output;
CharToByteConverter fromUnicode
= CharToByteConverter.getConverter("SJIS");
output = fromUnicode.convertAll( str.toCharArray() );
In addition to the convertAll() method which converts text all at once, the converter classes also contain a method named convert() which can be used to convert large amounts of text incrementally.

Unknown Characters

There is seldom a one to one mapping between two character encodings. When a converter encounters a character in the source encoding that cannot be represented in the target encoding, an UnknownCharacterException is thrown. This behavior can be turned off however by calling setSubstitutionMode(true). In substitution mode, a converter will replace unknown characters with a substitution character. When converting to Unicode the default substitution character is \ufffd, the Unicode Replacement Character. When converting to an external character encoding the default is `?'. A different substitution character can be used by calling setSubstitutionBytes() and setSubstitutionChars().

Stateful Conversions

Many common character encodings are stateful, that is, the interpretation of a byte depends on the bytes before it. To accommodate this, the converter classes retain the state of the conversion at the end of a call to convert(). The retained state can include partially encountered multi-byte character sequences. If convert() is interrupted by an exception, the caller can handle the error then continue the conversion with a subsequent call to convert().

As a result of the stateful nature of the converters, the flush() method must be called after a series of calls to convert(). This signals that a particular conversion is done and allows the converter to emit any final bytes or characters. A call to flush() also resets the converter to its initial state. An explicit reset() call is also available if the caller does not care about finalizing the previous conversion.

Exception Handling

There are three exceptions which can be thrown from a call to a converter's convert() method. All are subclasses of CharConversionException. They are described in Table 2.
Table 0-2 Character Encoding Conversion Exceptions

Exception

Meaning

MalformedInputExceptions A byte or character sequence violates the rules of a multibyte character set or the rules of the surrogate extension mechanism for Unicode.
UnsupportedCharacterException There is no defined mapping for this character.
ConversionBufferFullException Not all input characters could be converted due to a limited size of the output buffer.

To recover from an UnsupportedCharacterException or MalformedInputException, getBadInputLength() can be called to determine the length of the bad input and throw it out or fix it. In all cases, the nextCharIndex() and nextByteIndex() methods can be used to determine where conversion should resume in the input and output buffers.

The following example shows how to use the convert() method to translate between files:

public static void convertFile( DataInputStream in,
DataOutputStream out,
ByteToCharConverter conv )
throws java.io.IOException
{
// Set up output buffer.
int outBufLen = 1200;
char[] outBuf = new char [outBufLen];

// Set up input buffer and index.
int inBufLen = 1200;
byte[] inBuf = new byte [inBufLen];
int inStart = 0;

// Use the ByteToCharConverter's substitution mode.
// Substitute '?' for unknown characters.
char[] subChars = { '?' };
conv.setSubstitutionMode(true);
conv.setSubstitutionChars(subChars);

// If we don't know where this converter is coming from,
// it's a good idea to reset it.
conv.reset();

// Read an input buffer's worth.
while((inBufLen = in.read(inBuf, 0, inBuf.length)) != -1 ) {

// Loop while there's input to convert. This loop only
// needs to run more than once if MalformedInputException
// or ConversionBufferFullException occurs.
while( inStart < inBufLen ) {
try {
// convert it.
conv.convert(inBuf, inStart, inBufLen,
outBuf, 0, outBufLen);
inStart = conv.nextByteIndex();
} catch (MalformedInputException e) {
// Throw away bad input.
inStart += conv.getBadInputLength();
} catch(ConversionBufferFullException e) {
inStart = conv.nextByteIndex();
} finally {
// write it out!
out.writeChars(
new String(outBuf,0, conv.nextCharIndex()));
}
}
}

int flushedOutputLength = 0;
try {
flushedOutputLength = conv.flush(outBuf, 0, outBufLen);
} catch (MalformedInputException e) {
// The bad input will just be ignored.
} finally {
// write it out!
out.writeChars(
new String(outBuf,0, flushedOutputLength));
}
}
This example uses a converter directly and handles buffering explicitly. An easier way to accomplish the same task would be to use the new CharInputStream and CharOutputStream classes. See the Stream I/O section for more information.

Supported Encodings

JDK 1.1 will include ByteToCharConverter and CharToByteConverter subclasses for the following set of character encodings:
Table 0-3 JDK 1.1 Character Encodings1

Character encoding

Explanation

8859_1 ISO Latin-1
8859_2 ISO Latin-2
8859_3 ISO Latin-3
8859_4 ISO Latin-4
8859_5 ISO Latin/Cyrillic
8859_6 ISO Latin/Arabic
8859_7 ISO Latin/Greek
8859_8 ISO Latin/Hebrew
8859_9 ISO Latin-5
Big5 Big 5 Traditional Chinese
CNS11643 CNS 11643 Traditional Chinese
Cp1250 Windows Eastern Europe / Latin-2
Cp1251 Windows Cyrillic
Cp1252 Windows Western Europe / Latin-1
Cp1253 Windows Greek
Cp1254 Windows Turkish
Cp1255 Windows Hebrew
Cp1256 Windows Arabic
Cp1257 Windows Baltic
Cp1258 Windows Vietnamese
Cp437 PC Original
Cp737 PC Greek
Cp775 PC Baltic
Cp850 PC Latin-1
Cp852 PC Latin-2
Cp855 PC Cyrillic
Cp857 PC Turkish
Cp860 PC Portuguese
Cp861 PC Icelandic
Cp862 PC Hebrew
Cp863 PC Canadian French
Cp864 PC Arabic
Cp865 PC Nordic
Cp866 PC Russian
Cp869 PC Modern Greek
Cp874 Windows Thai
EUCJIS Japanese EUC
GB2312 GB2312-80 Simplified Chinese
JIS JIS
KSC5601 KSC5601 Korean
MacArabic Macintosh Arabic
MacCentralEurope Macintosh Latin-2
MacCroatian Macintosh Croatian
MacCyrillic Macintosh Cyrillic
MacDingbat Macintosh Dingbat
MacGreek Macintosh Greek
MacHebrew Macintosh Hebrew
MacIceland Macintosh Iceland
MacRoman Macintosh Roman
MacRomania Macintosh Romania
MacSymbol Macintosh Symbol
MacThai Macintosh Thai
MacTurkish Macintosh Turkish
MacUkraine Macintosh Ukraine
SJIS PC and Windows Japanese
UTF8 Standard UTF-8
1 The table does not currently show the correct IANA names. The naming of the converters will be cleaned up for JDK1.1 FCS. Also, new converters will be added throughout the alpha and beta period. In particular converters that handle common EBCDIC based character sets will be made available.

OS Support For Character Conversions

Some platforms provide, or will soon provide, their own conversion primitives for conversion to and from Unicode. These conversions likely will differ from the Java provided conversions in:

For these reasons, Java provides its own standard mappings, but allows platform specific extensions to fit in the general model.



Contents | Prev | Next
1 The alpha version of the JDK1.1 does not correctly use the IANA names. The naming convention for the converters will be cleaned up for JDK1.1 beta. Also, the list of supported converters will be evolving throughout alpha and beta.

java-intl@java.sun.com
Copyright © 1996 Sun Microsystems, Inc. All rights reserved.