Converter objects are created like this:
ByteToCharConverter toUnicode
= ByteToCharConverter.getConverter("ISO-8859-1");
CharToByteConverter fromUnicode
= CharToByteConverter.getConverter("CP850");
The static getConverter() methods return a subclass that will perform the appropriate conversion. The name strings used to select the converter must be valid IANA character set names1. These are the same names as are used by the MIME charset parameter.
Both ByteToCharConverter and CharToByteConverter contain methods that are used to perform the actual conversion. The following simple example shows how a Unicode string could be converted into the Japanese Shift JIS character encoding using the convertAll() method:
String str = "The quick brown fox";In addition to the convertAll() method which converts text all at once, the converter classes also contain a method named convert() which can be used to convert large amounts of text incrementally.
byte[] output;
CharToByteConverter fromUnicode
= CharToByteConverter.getConverter("SJIS");
output = fromUnicode.convertAll( str.toCharArray() );
As a result of the stateful nature of the converters, the flush() method must be called after a series of calls to convert(). This signals that a particular conversion is done and allows the converter to emit any final bytes or characters. A call to flush() also resets the converter to its initial state. An explicit reset() call is also available if the caller does not care about finalizing the previous conversion.
To recover from an UnsupportedCharacterException or MalformedInputException, getBadInputLength() can be called to determine the length of the bad input and throw it out or fix it. In all cases, the nextCharIndex() and nextByteIndex() methods can be used to determine where conversion should resume in the input and output buffers.
The following example shows how to use the convert() method to translate between files:
public static void convertFile( DataInputStream in,This example uses a converter directly and handles buffering explicitly. An easier way to accomplish the same task would be to use the new CharInputStream and CharOutputStream classes. See the Stream I/O section for more information.
DataOutputStream out,
ByteToCharConverter conv )
throws java.io.IOException
{
// Set up output buffer.
int outBufLen = 1200;
char[] outBuf = new char [outBufLen];
// Set up input buffer and index.
int inBufLen = 1200;
byte[] inBuf = new byte [inBufLen];
int inStart = 0;
// Use the ByteToCharConverter's substitution mode.
// Substitute '?' for unknown characters.
char[] subChars = { '?' };
conv.setSubstitutionMode(true);
conv.setSubstitutionChars(subChars);
// If we don't know where this converter is coming from,
// it's a good idea to reset it.
conv.reset();
// Read an input buffer's worth.
while((inBufLen = in.read(inBuf, 0, inBuf.length)) != -1 ) {
// Loop while there's input to convert. This loop only
// needs to run more than once if MalformedInputException
// or ConversionBufferFullException occurs.
while( inStart < inBufLen ) {
try {
// convert it.
conv.convert(inBuf, inStart, inBufLen,
outBuf, 0, outBufLen);
inStart = conv.nextByteIndex();
} catch (MalformedInputException e) {
// Throw away bad input.
inStart += conv.getBadInputLength();
} catch(ConversionBufferFullException e) {
inStart = conv.nextByteIndex();
} finally {
// write it out!
out.writeChars(
new String(outBuf,0, conv.nextCharIndex()));
}
}
}
int flushedOutputLength = 0;
try {
flushedOutputLength = conv.flush(outBuf, 0, outBufLen);
} catch (MalformedInputException e) {
// The bad input will just be ignored.
} finally {
// write it out!
out.writeChars(
new String(outBuf,0, flushedOutputLength));
}
}
|
Character encoding
|
Explanation
|
|---|---|
| 8859_1 | ISO Latin-1 |
| 8859_2 | ISO Latin-2 |
| 8859_3 | ISO Latin-3 |
| 8859_4 | ISO Latin-4 |
| 8859_5 | ISO Latin/Cyrillic |
| 8859_6 | ISO Latin/Arabic |
| 8859_7 | ISO Latin/Greek |
| 8859_8 | ISO Latin/Hebrew |
| 8859_9 | ISO Latin-5 |
| Big5 | Big 5 Traditional Chinese |
| CNS11643 | CNS 11643 Traditional Chinese |
| Cp1250 | Windows Eastern Europe / Latin-2 |
| Cp1251 | Windows Cyrillic |
| Cp1252 | Windows Western Europe / Latin-1 |
| Cp1253 | Windows Greek |
| Cp1254 | Windows Turkish |
| Cp1255 | Windows Hebrew |
| Cp1256 | Windows Arabic |
| Cp1257 | Windows Baltic |
| Cp1258 | Windows Vietnamese |
| Cp437 | PC Original |
| Cp737 | PC Greek |
| Cp775 | PC Baltic |
| Cp850 | PC Latin-1 |
| Cp852 | PC Latin-2 |
| Cp855 | PC Cyrillic |
| Cp857 | PC Turkish |
| Cp860 | PC Portuguese |
| Cp861 | PC Icelandic |
| Cp862 | PC Hebrew |
| Cp863 | PC Canadian French |
| Cp864 | PC Arabic |
| Cp865 | PC Nordic |
| Cp866 | PC Russian |
| Cp869 | PC Modern Greek |
| Cp874 | Windows Thai |
| EUCJIS | Japanese EUC |
| GB2312 | GB2312-80 Simplified Chinese |
| JIS | JIS |
| KSC5601 | KSC5601 Korean |
| MacArabic | Macintosh Arabic |
| MacCentralEurope | Macintosh Latin-2 |
| MacCroatian | Macintosh Croatian |
| MacCyrillic | Macintosh Cyrillic |
| MacDingbat | Macintosh Dingbat |
| MacGreek | Macintosh Greek |
| MacHebrew | Macintosh Hebrew |
| MacIceland | Macintosh Iceland |
| MacRoman | Macintosh Roman |
| MacRomania | Macintosh Romania |
| MacSymbol | Macintosh Symbol |
| MacThai | Macintosh Thai |
| MacTurkish | Macintosh Turkish |
| MacUkraine | Macintosh Ukraine |
| SJIS | PC and Windows Japanese |
| UTF8 | Standard UTF-8 |
java-intl@java.sun.com Copyright © 1996 Sun Microsystems, Inc. All rights reserved.