![]() ![]() These code points are encoded to bytes and decoded from bytes back to code points. So unicode code points refer to actual characters that are displayed. These 137k characters are each represented by a unicode code point. As of May 2019, the most recent version of Unicode is 12.1 which contains over 137k characters including different scripts including English, Hindi, Chinese and Japanese, as well as emojis. Unicode is international standard where a mapping of individual characters and a unique number is maintained. We needed an international standard that we all agreed on to deal with hundreds and thousands of non-English characters. We tried extending 127 characters to 256 characters (via Latin-1 or ISO-8859–1) to fully utilize the 8 bit space - but that was not enough. This was cool for the initial few decades or so, but slowly we realized that there are way more number of characters than just English characters. You could tell your friend to decode your JSON file in ASCII encoding, and voila - she would be able to read what you sent her. 7 bits of information or 1 byte is enough to encode every English character. These were all encoded into a 127 symbol list called ASCII. For the first 20 years or so of computing, upper and lower case English characters, some punctuations and digits were enough. So if you write a JSON file and send it over to your friend, your friend would need to know how to deal with the bytes in your JSON file. While reading bytes from a file, a reader needs to know what those bytes mean. What is Unicode, and unicode code points? We can all agree that we need bytes, but then what about unicode code points? We will get to them in the next question. So all of the CSVs and JSON files on your computer are built of bytes. Byte is a unit of information that is built of 8 bits - bytes are used to store all files in a hard disk. In Python (2 or 3), strings can either be represented in bytes or unicode code points. Below I am going to take a Q and A format to really get to the answers to the questions you might have, and which I also had before I started learning about strings. ![]() Many programmers use encode and decode with strings in hopes of removing the dreaded UnicodeDecodeError - hopefully, this blog will help you overcome the dread about dealing with strings. Let’s decipher what is hidden in the strings ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |