If you � Unicode, you’ll ����� EBCDIC

One of the projects I’ve been working on over the past few months is the Dynamic Scripting feature pack for CICS. It runs on z/OS, which is an interesting environment for a few reasons – not least because it’s an EBCDIC platform.

Know your charset

A character set defines which byte value is used to represent a given character. So the choice of character set is a concern that should crop up whenever raw bytes are interpreted as character sequences, or character sequences are converted to raw bytes. For example, care should be taken to use the correct character set when writing text to a file, or reading character data from the body of an HTTP response.

Often, it’s tempting to ignore the concern. The vast majority of character sets share a block of invariant codepoints – that is, the most common Western characters are represented by the same byte values in the most common character sets. For example, the following hexadecimal message could be decoded equivalently using UTF-8, ISO8859-1 Windows-1252 or CP-437:

49206b6e6f7720636861722d66752e

This holds because:

  • The four character encodings (I tend to use the terms “character encoding” and “character set” interchangeably) listed above are ASCII compatible, meaning that they represent characters in the ASCII range with the same set of byte values.
  • The message only contains characters that are in in the ASCII range.

Incidentally, this tool, which can convert hex strings like the one above to text, neglects to even mention that it uses an ASCII character set to do the job (let alone which ASCII-compatible character set). This is understandable: ASCII is so prevalent, its use sometimes goes without saying.

As long as you stick with ASCII platforms, bugs relating to character sets only tend to rear their carapace within the context of internationalisation. Characters like é are represented by different byte sequences in UTF-8 and ISO8859-1. So, if you ever write é to a file, make sure you remember which character set you use, because you’ll need to know later when you come to read it again.

If you � Unicode, you’ll �̞̌̐�̧̇̑�̝̠̄�̝̞̉�̛̣̋ EBCDIC

EBCDIC (usually pronounced ebb-sih-dic) is another category of character encodings. Dealing with EBCDIC makes the whole issue of charset choice a little more explicit.

EBCDIC encodings are not ASCII compatible. The most inoffensive looking text file will turn to garbage if it is saved in ASCII and opened in EBCDIC (or vice-versa). Further more, any code that performs a conversion between bytes and character strings while assuming that an ASCII encoding will be used is at risk of misbehaving on an EBCDIC platform. For example, the following Java Class Library methods all use the platform default encoding, which might always be ASCII in your test environments, but suddenly becomes EBCDIC when your code is running in a mainframe. So using any of these without passing a Charset is suspicious:

 java.lang.String.getBytes()
 java.lang.String(byte[] bytes)
 java.io.ByteArrayOutputStream.toString()
 java.io.FileReader(String filename)
 java.io.FileReader(File file)
 java.io.FileReader(FileDescriptor fileDescriptor)
 java.io.FileWriter(String filename)
 java.io.FileWriter(File file)
 java.io.FileWriter(FileDescriptor fileDescriptor)
 java.io.InputStreamReader(InputStream input)
 java.io.OutputStreamWriter(OutputStream output)
 java.io.PrintStream(File file)
 java.io.PrintStream(OutputStream output)
 java.io.PrintStream(String string)
 java.io.PrintWriter(File file)
 java.io.PrintWriter(OutputStream output)
 java.io.PrintWriter(String string)
 java.util.Scanner(InputStream input)
 java.util.Formatter(String filename)
 java.util.Formatter(File file)
 java.util.Formatter(OutputStream output)

(We use a custom FindBugs detector to identify invocations of these methods – I’ll be writing about that in the near future.)

At least in the case of Java, I’d like to suggest that if this issue isn’t carefully considered when writing or reviewing a piece of code, then that code will most likely not run bug free on an EBCDIC platform like z/OS – no matter how portable it is considered to be.

Have an(other) EBCDIC!

Too many flavours

There are well over 50 flavours of EBCDIC.
They are not this tasty.
(Image by Marzk, kindly released under CC by-nc-nd 2.0)

 

Apparently, an excessive degree of choice leads to paralysis, uncertainty and regret. EBCDIC doesn’t help in this matter. Here’s the list of EBCDIC Charsets supported by the Windows IBM JDK 60sr8 (a subset of all EBCDICs):
 IBM-Thai, IBM00924, IBM01140, IBM01141, IBM01142,
 IBM01143, IBM01144, IBM01145, IBM01146, IBM01147,
 IBM01148, IBM01149, IBM037, IBM1026, IBM1047,
 IBM1047_LF, IBM1141_LF, IBM1153, IBM273, IBM277,
 IBM278, IBM280, IBM284, IBM285, IBM297,
 IBM420, IBM424, IBM500, IBM870, IBM871,
 IBM918, IBM924_LF, x-IBM1025, x-IBM1027, x-IBM1097,
 x-IBM1112, x-IBM1122, x-IBM1123, x-IBM1364, x-IBM1371,
 x-IBM1388, x-IBM1399, x-IBM1399A, x-IBM420S, x-IBM833,
 x-IBM836, x-IBM875, x-IBM933, x-IBM935, x-IBM937,
 x-IBM939, x-IBM939A

Here’s the Groovy code used to generate that list.

The justification for this plethora of encodings is to cater for globalisation. EBCDIC character sets are single-byte character sets, so any given EBCDIC variation can encode a maximum of 255 symbols. Given that some diacritical marks or currency symbols might be required in one locale, but can reasonably be omitted in another, administrators can pick the preferred code page for their environment. For example, if you need a “ç” but can do without a “å”, IBM-1147 might be the character set for you. For this reason, a small number of different EBCDIC flavours tend to be associated with each locale. IBM-1047 and IBM-037 are commonly used on English language systems; IBM-297 or IBM-1147 might be used in France; Japanese systems could use one of a handful more, and so on.

Like ASCII, all variations of EBCDIC have a set of code points that don’t change – i.e. some characters are represented by the same byte value, no matter the flavour of EBCDIC. The letter a, for example, is always byte value 0x81.

Unfortunately, there are two nasty complications. Firstly, as we’ve established, the byte values of these invariant characters are not the same as the ASCII byte values for the same characters – so even the simplest message is corrupt if you save it as ASCII and read it as EBCDIC. Secondly, and perhaps more shockingly, there are 13 widely-used characters that can vary between EBCDIC flavours.

Here they are:

 ^ ~ ! [ ] { } # | ` $ @

This table shows a small part of the story by illustrating how a handful of EBCDICs encode those variant characters. Of course, there are a few dozen flavours that are not represented.

If you write code and hadn’t come across EBCDIC variants before, feel free to cringe at this stage. For programmers, those are some pretty useful characters. It’s somewhat disconcerting that they could be misinterpreted by the system, just because a user is running under a different locale. For example, given that characters like #, $ and ! are variant, how can you distribute shell scripts that run in any z/OS environment, regardless of the active flavour of EBCDIC? I’ll discuss this in my next post…

This post was going to be about EBCDIC and self-trepanation, but it turns out EBCDIC alone is disturbing and painful enough.

5 thoughts on “If you � Unicode, you’ll ����� EBCDIC”

  1. Hi Robin,

    [maybe this is a repost; the first try to comment seems not to have worked]

    You wrote:
    > We use a custom FindBugs detector to identify invocations of these methods
    > I’ll be writing about that in the near future.

    It would be wonderful, if you could provide this detector. I think it belongs into the default ruleset of findbugs!

    Kind regards,
    Rüdiger

    Like

  2. Pingback: Blog

Leave a comment