Archived Forum Post

Index of archived forum posts

Question:

Character Encoding vs. Content-Type Encoding

Feb 20 '13 at 20:34

I am using the Perl library. Been running into an issue and hopefully you can shed some light.

I've been going back and forth for months with certain Japanese users who are unable to read our email.

When I asked the customer to send me an example of an email (from a different provider) that was readable, I noticed that email had a content type of iso-2022-jp with an encoding of base64.

Using your library, if I send an email with Japanese html content, you are setting the content type to "shift_jis" however you are still encoding the email with quoted-printable.

According to your docs, I thought that if you select shift_jis, you would also select base64 encoding. I'm making an assumption that this is the reason some Japanese users are unable to view our mail.

Attached is an example email that our system is creating. Is this right? Shouldn't it be encoded in base64?

Answer

Character encoding and Content-Transfer-Encoding encoding are two entirely different things.

A character encoding specifies the exact bytes used to represent each character in a language. For example:

Consider this character: É

In the iso-8859-1 character encoding, it is represented by a single byte: 0xC9
In the utf-8 character encoding, it is represented by a two bytes: 0xC3 0x89
In the ucs-2 character encoding, it is represented by a two bytes: 0x00 0xC9

The Content-Transfer-Encoding encoding of a MIME message indicates how the bytes comprising the body are encoded (if encoded at all). For historical reasons, it was important for many mail processors that MIME messages be comprised of 7bit printable (us-ascii) chars with maximum line-lengths. Non-text data, such as attached images or non-English text containing 8bit character encodings, are encoded most commonly using either base64 or quoted-printable (see http://en.wikipedia.org/wiki/MIME)

The Content-Transfer-Encoding has nothing to do with the character encoding of the text. The character encoding determines the bytes used to represent each char. The Content-Transfer-Encoding determines how MIME body bytes are encoded. The MIME body bytes might happen to be text, image data, or anything else.

It's perfectly valid to choose either quoted-printable or base64 in any situation. Any MIME reader should decode according to the Content-Transfer-Encoding and the result will be the same. The only reason for choosing one over the other is probably due to size. Base64 is good for non-text data, or character data where the vast majority of characters are non-us-ascii. This would result in a 4/3rds increase in size, because every 3 bytes is represented as 4 printable chars. Quoted-printable is more efficient for text data where a large majority of chars are 7bit us-ascii, such as with European languages with accented chars, or perhaps HTML where the HTML tags and markup account for a large portion of the text.