Warehouse of knowledge: UCS2 0x82 Encoding

The encoding for '0x82' format is as follow

1. The first octet is '0x82'

2. The second octet is the number of UCS2 characters

3. The third octet and fourth octet is full 16-bit Base Pointer for the UCS2

4. The following octets are the coded characters with the following rule:

- If the MSB (most significant bit) is zero,
the remaining 7 bits contain GSM Default Alphabet.

- If the MSB is one, the remaining 7 bits are offset value added to Base Pointer
which the result defines the UCS2 character.

Example:
We have 3 UCS2: Sকদ
The characters in bytes are: '0x0053' for "S", '0x0995' for "ক", and '0x09A6' for "দ".
The coding for Alpha field for this format is: '82 03 09 95 53 80 91'.

How can we get that value?
The first octet and second octet is quite clear.
For the third and fourth octet, the Base Pointer, we can get it from the lowest value from all UCS characters. Of course, it is better if we have the UCS2 characters look-up table which indicates the Base Pointer for each specific set.

In this example, I set the Base Pointer as '0x0995'.
The fifth octet is the character "S" ('0x0053').

Since it is default alphabet, then the octet value is '0x53'.

The sixth octet is the character "ক" ('0x0995').
To encode this character, we calculate the additional offset from the Base Pointer.
Additional value = '0x0995' - '0x0995' = '0x00' = 000 0000 (only 7 bit)
The coded character has MSB set to 1. Hence the value is (1000 0000) = '0x80'.

The seventh octet is the character for "দ" ('0x09A6').
Additional value = 0x'09A6' - '0x0995' = '0x11' = 001 0001 (only 7 bit)
The coded character has MSB set to 1. Hence the value is (1001 0001)b = '0x91'.

Warehouse of knowledge

Sunday, December 22, 2013

UCS2 0x82 Encoding

No comments:

Post a Comment