Encoding
A character encoding maps each character in a character set to a numeric value that a computer can represent. These numbers can be represented by a single byte or multiple bytes. For example, the ASCII encoding uses 7 bits to represent the Latin alphabet, punctuation, and control characters.
You use Japanese encodings, such as Shift-JIS, EUC-JP, and ISO-2022-JP, to represent Japanese text. These encodings can vary slightly, but they include a common set of approximately 10,000 characters used in Japanese.
The following terms apply to character encodings:
- SBCS Single-byte character set; a character set encoded in one byte per character, such as ASCII or ISO 8859-1.
- DBCS Double-byte character set; a method of encoding a character set in no more than 2 bytes, such as Shift-JIS. Many character encoding schemes that are referred to as double-byte, including Shift-JIS, allow mixing of single-byte and double-byte encoded characters. Others, such as UCS-2, use 2 bytes for all characters.
- MBCS Multiple-byte character set; a character set encoded with a variable number of bytes per character, such as UTF-8.The following table lists some common character encodings; however, there are many additional character encodings that browsers and web servers support:
|
Type |
Description |
---|---|---|
ASCII |
SBCS |
7-bit encoding used by English and Indonesian Bahasa languages |
Latin-1(ISO 8859-1) |
SBCS |
8-bit encoding used for many Western European languages |
Shift_JIS |
DBCS |
16-bit Japanese encoding Note: Use an underscore character (_), not a hyphen (-) in the name in CFML attributes. |
EUC-KR |
DBCS |
16-bit Korean encoding |
UCS-2 |
DBCS |
Two-byte Unicode encoding |
UTF-8 |
MBCS |
Multibyte Unicode encoding. ASCII is 7-bit; non-ASCII characters used in European and many Middle Eastern languages are two-byte; and most Asian characters are three-byte |
The World Wide Web Consortium maintains a list of all character encodings supported by the Internet. You can find this information at www.w3.org/International/O-charset.html.Computers must often convert between character encodings. In particular, the character encodings most commonly used on the Internet are not used by Java or Windows. Character sets used on the Internet are typically single-byte or multiple-byte (including DBCS character sets that allow single-byte characters). These character sets are most efficient for transmitting data, because each character takes up the minimum necessary number of bytes. Currently, Latin characters are most frequently used on the web, and most character encodings used on the web represent those characters in a single byte. Computers, however, process data most efficiently if each character occupies the same number of bytes. Therefore, Windows and Java both use double-byte encoding for internal processing.
The Java Unicode character encoding
ColdFusion uses the Java Unicode Standard for representing character data internally. This standard corresponds to UCS-2 encoding of the Unicode character set. The Unicode character set can represent many languages, including all major European and Asian character sets. Therefore, ColdFusion can receive, store, process, and present text from all languages supported by Unicode.
The Java Virtual Machine (JVM) that is used to processes ColdFusion pages converts between the character encoding used on a ColdFusion page or other source of information to UCS-2. The page or data encodings that ColdFusion supports depend on the specific JVM, but include most encodings used on the web. Similarly, the JVM converts between its internal UCS-2 representation and the character encoding used to send the response to the client.
By default, ColdFusion uses UTF-8 to represent text data sent to a browser. UTF-8 represents the Unicode character set using a variable-length encoding. ASCII characters are sent using a single byte. Most European and Middle Eastern characters are sent as 2 bytes, and Japanese, Korean, and Chinese characters are sent as 3 bytes. One advantage of UTF-8 is that it sends ASCII character set data in a form that is recognized by systems designed to process only single-byte ASCII characters, while it is flexible enough to handle multiple-byte character representations.
While the default format of text data returned by ColdFusion is UTF-8, you can have ColdFusion return a page to any character set supported by Java. For example, you can return text using the Japanese language Shift-JIS character set. Similarly, ColdFusion can handle data that is in many different character sets. For more information, see Determining the page encoding of server output in Processing a request in ColdFusion.
Character encoding conversion issues
Because different character encodings support different character sets, you can encounter errors if your application gets text in one encoding and presents it in another encoding. For example, the Windows Latin-1 character encoding, Windows-1252, includes characters with hexadecimal representations in the range 80-9F, while ISO 8859-1 does not include characters in that range. As a result, under the following circumstances, characters in the range 80-9F, such as the euro symbol (), are not displayed properly:
- A file encoded in Windows-1252 includes characters in the range 80-9F.
- ColdFusion reads the file, specifying the Windows-1252 encoding in the cffile tag.
- ColdFusion displays the file contents, specifying ISO-8859 in the cfcontent tag.
Similar issues can arise if you convert between other character encodings; for example, if you read files encoded in the Japanese Windows default encoding and display them using Shift-JIS. To prevent these problems, ensure that the display encoding is the same as the input encoding.