Sat Mar 23 2019
What is Unicode and how it represents all the languages in number?
Unicode is a system for the interchange, processing, and display of the written texts of the diverse languages of the modern world. But, Why it's so important in computing?
Fundamentally, computers just deal with numbers. In order for a computer to be able to store text and numbers that humans can understand, there needs to be a code that transforms characters into numbers.
In that case, a character encoding is required, so that every device can display the same information, but it should be unified. Because, a custom character encoding scheme might work brilliantly on your computer, but problems will occur when you send that same text to someone else.
Unicode is an entirely new idea in setting up binary codes for text or script characters, which is officially known as Unicode Worldwide Character Standard. Unicode provides a unique number for every character, no matter what platform, device, application or language. Unicode has become the dominant scheme for internal processing and storage of text. Although a great deal of text is still stored in legacy encodings, Unicode is used almost exclusively for building new information processing systems.
For example, you could say that the letter A becomes the number 13, a=14, 1=33, #=123, and so on.
Before Unicode was invented, there were hundreds of different systems, called character encoding, for assigning these numbers. These early character encodings were limited and could not contain enough characters to cover all the world's languages.
American Standard Code for Information Interchange (ASCII) became the first widespread encoding scheme. However, it's limited to only 256 character definitions. This is good for the most common English characters, numbers, and punctuation, but is a bit limiting for the rest of the world.
Naturally, the rest of the world wants the same encoding scheme for their characters too. They began creating their own encoding schemes and things started to get a little bit confusing. Not only were the coding schemes of different lengths, programs needed to figure out which encoding scheme they were supposed to use.
It became apparent that a new character encoding scheme was needed, which is when the Unicode standard was created. Unicode has been adopted by all modern software providers and now allows data to be transported through many different platforms, devices and applications without corruption.
Currently, the Unicode standard defines values for over 128,000 characters from 24 supported language scripts. There have several character encoding forms, which is written as UTF (Unicode Transformation Unit) followed by the number of bits it used-
- UTF-8: Only uses one byte (8 bits) to encode English characters. It can use a sequence of bytes to encode other characters. UTF-8 is widely used in email systems and on the internet.
- UTF-16: Uses 2 bytes (16 bits) to encode the most commonly used characters. If needed, the additional characters can be represented by a pair of 16-bit numbers.
- UTF-32: Uses 4 bytes (32 bits) to encode the characters. It became apparent that as the Unicode standard grew, a 16-bit number is too small to represent all the characters. UTF-32 is capable of representing every Unicode character as one number.
The Unicode Standard provides a unique number for every character, no matter what platform, device, application or language. It has been adopted by all modern software providers and now allows data to be transported through many different platforms, devices and applications without corruption. Support of Unicode forms the foundation for the representation of languages and symbols in all major operating systems, search engines, browsers, laptops, and smart phones - plus the Internet and World Wide Web.