What is character encoding?

6 Mar 2018

What is character encoding?


What is character encoding?

Words and sentences in text are created from characters. For the English language these characters would include 26 upper case letters, 26 lower case letters, punctuation marks and others characters such as those you would see on an English keyboard etc.

In computing, characters are encoded in various encoding formats, generally depending on the language of the document. For example, Hebrew, Japanese, and English characters require different character encodings in order to be legible to document authors. If you use anything other than the most basic characters needed for English, people may not be able to read your text unless you say what character encoding you used. Character encoding defines a unique binary code for each different character used in an XML document

It is important to clearly distinguish between the concepts of a character set versus a character encoding.

character set otherwise known as a repertoire is comprised of the set of characters one might use for a particular purpose, for example, all those characters you would see every day in a document, a book, an email etc. or those a Russian child will learn at school and has nothing to do with computers.

Characters are grouped into a character set and are stored in computers using a code. This is then called a coded character set when each character in the set is assigned a particular number, called a code point. These code points will be represented in the computer by one or more bytes. The character encoding is the key that maps code points to bytes in the computer memory, and read the bytes back into code points.

One such character encoding is the American Standard Code for Information Interchange abbreviated to A-S-C-I-I, and is pronounced ass-kee . ASCII codes represent text in computers, communications equipment, and other devices that use text. Most modern character-encoding schemes are based on ASCII, though they support many additional characters. ASCII was the most common character encoding on the World Wide Web until December 2007, when it was surpassed by Unicode UTF-8, which includes ASCII as a subset.

XML’s character set is Unicode. Unicode is the standard for digital representation of the characters used in writing all of the world’s languages. Unicode provides a uniform means for storing, searching, and interchanging text in any language. It is used by all modern computers and is the foundation for processing text on the Internet.  XML allows the use of any of the Unicode-defined encodings, and any other encodings whose characters also appear in Unicode.

The encoding forms that can be used with Unicode are called UTF-8, UTF-16, and UTF-32, though UTF-8 is the most commonly used. UTF stands for Universal character set Transformation Format

Leave a Reply