2006/04/17

 

GSM 7-bit encoding and decoding

The GSM 7-bit encoding is a method for compressing data by using the leftover bit when representing the data in 8-bit octets. A simple explanation on how this bit-fiddeling works can be found here:
http://www.dreamfabric.com/sms/hello.html

Information on SMS and the PDU format can be found here:
http://www.dreamfabric.com/sms/


In this blog I will examine this seemlingly simple algorithm a bit closer and point out a few problem areas that can easily lead to early alopecia for implementors.


Examining the "hellohello" example:

The most natural way of solving the problem would be to reverse the the array of values, then chip off 7 bits at the time from the end of the array and working our way to the beginning. Now if we use binary strings on the form "01101001" it is fairly easy, we just use substring and iterate our way to the start of the string. This is ok for a simple algorithm that just needs to solve the problem or for learning. However, this is not very performant, so we'll use primitives instead.

Assume we have two helper arrays like this:

final int[] upperBits = {
  0xFE, // 0 = B7|B6|B5|B4|B3|B2|B1
  0xFC, // 1 = B7|B6|B5|B4|B3|B2
  0xF8, // 2 = B7|B6|B5|B4|B3
  0xF0, // 3 = B7|B6|B5|B4
  0xE0, // 4 = B7|B6|B5
  0xC0, // 5 = B7|B6
  0x80  // 6 = B7
};

final int[] lowerBits = {

  0x01, // 0 =                   B0
  0x03, // 1 =                B1|B0
  0x07, // 2 =             B2|B1|B0
  0x0F, // 3 =          B3|B2|B1|B0
  0x1F, // 4 =       B4|B3|B2|B1|B0
  0x3F, // 5 =    B5|B4|B3|B2|B1|B0
  0x7F  // 6 = B6|B5|B4|B3|B2|B1|B0
};

The upper bits and the lower bits will be referred to as U and L respectively, where L3 means index 3 in lower bits yielding 0x0F.

The encoding of "hellohello" in the GSM 7bit default alphabet is:
E8329BFD4697D9EC37
and in binary:
00110111 11101100 11011001
10010111 01000110 11111101
10011011 00110010 11101000


Now we reverse the order of the values which gives us 37ECD99746FD9B32E8 and we make a note of which bits we need  (from the helper arrays) in order to extract the last 7 bits.
00110111 11101100 11011001
10010111 01000110 11111101
10011011 00110010 11101000

L6


If we continue in this manner we'll get something like this:

00110111111011001101100110010111010001101111110110011011
0011001011101000

L6
U6+L5
U5+L4
U4+L3
U3+L2
U2+L1
U1+L0
U0
L6
U6+L5

So it is quite simple and we could go about and implement an algorithm to decode GSM 7bit. However, it wouldn't be truly optimal as we do two operations, first we reverse the list and then we decode it. In general it is not really a performance issue, as the SMS limit is 140 bytes. But if one takes a closer look at the bit-fiddeling above, then it can easily be decoded without reversing the array. Repeating the process by starting from the beginning of the array yields this:

11101000001100101001101111111101010001101001011111011001
1110110000110111

L6
U6+L5
U5+L4
U4+L3
U3+L2
U2+L1
U1+L0
U0
L6
U6+L5


The bit values that we need for our algorithm is evidently the same, which also is logical, but perhaps not easy to spot at first glance. With this information we can go on and implement our algorithm:

  /**
   * Decodes GSM 7-bit septets in 8-bit octets.
   *
   * @param ud The 8-bit octets representing 7-bit data
   * @param udl The message length, if known
   * @return The converted message
   */
  public static int[] decode7bit(int[] ud, int udl) {
    final int[] output = new int[udl];    

    int b = 6, p = 0;
    for (int i = 0; i < udl; i++) {
      switch (b) {
      case 7: // U0
        output[i] = (ud[p] & upperBits[0]) >> 1;
        break;        

      case 6: // L6
        output[i] = ud[p] & lowerBits[b];
        break;

      default: // The rest
        output[i] = ((ud[p] & lowerBits[b]) << (6 - b))
                  + ((ud[p-1] & upperBits[b+1]) >> (b+2));
        break;
      }

      if (--b == -1) b = 7;
      else p++;
    }

    return output;
  }

Ok, so far so good, but there is still a bit further to go.

Question: Do we really need to know the length of the message? Wouldn't it be much more convenient to calculate it by doing ud.length * 8 / 7?

Answer: We could calculate it, but it would be wrong 1 in 7 times, enough to make us think the algorithm can decode anything until that odd one comes in and spoils the evening. Why? The benefit of this algorithm is that you can store an extra character per 7 characters you use, i.e. you can store eight 7bit characters in seven bytes. Now, if your message is "message", how does your algorithm know that you are not using the last 7 bits with you have gained by using this encoding? Well, you could check if the final character is of value 0, and if so omit it, though then you'd probably have to recreate the array as well. In other words, it gets quite messy when you don't know the actual length of the message. This algorithm could perfectly be used to decode alphanummeric destination and originating addresses of the SMS, but is not quite good enough to fully decode all SMS messages.

What this algorithm is currently missing is fill bits support. The need for fill bits is actually an odd one. It has to do with that the length of the message represents the total number of 7bit segments. When user data header is present, which is a number of 8bit octets, a certain number of fill bits are needed to make the message itself start at a septet boundary. The fill bits will shift all bits in the message between 1 and 6 spaces, more than 6 fill bits does not make much sense and we'll not support that in this algorithm. The fill bits complicates the algorithm further as when using int or byte arrays each character can span across 2-3 values, and it may not be possible to create a simple algorithm using the helper arrays as in the previous example. Perhaps it is possible to generate helper arrays based on the number of fill bits, but it may just as well be more performant to just shift the bits first and then use our first algorithm. Here's my current solution of a GSM 7bit default alphabet decoding:

  /**
   * Decodes GSM 7-bit septets in 8-bit octets.
   *
   * @param ud The 8-bit octets representing 7-bit data
   * @param fillBits The number of fill bits, if any
   * @param udl The message length, if known
   * @return The converted message
   */
  public static int[] decode7bit(int[] ud, int udl,
                                 int fillBits) {
    final int[] upperBits = {
        0xFE, // 0 = B7|B6|B5|B4|B3|B2|B1
        0xFC, // 1 = B7|B6|B5|B4|B3|B2
        0xF8, // 2 = B7|B6|B5|B4|B3
        0xF0, // 3 = B7|B6|B5|B4
        0xE0, // 4 = B7|B6|B5
        0xC0, // 5 = B7|B6
        0x80  // 6 = B7
    };    

    final int[] lowerBits = {
        0x01, // 0 =                   B0
        0x03, // 1 =                B1|B0
        0x07, // 2 =             B2|B1|B0
        0x0F, // 3 =          B3|B2|B1|B0
        0x1F, // 4 =       B4|B3|B2|B1|B0
        0x3F, // 5 =    B5|B4|B3|B2|B1|B0
        0x7F  // 6 = B6|B5|B4|B3|B2|B1|B0
    };    

    final int length = ud.length;
    if (fillBits != 0) {
      final int len = length - 1;
      final int cut = lowerBits[fillBits - 1];
      final int move = 8 - fillBits;
      for (int f = 0; f < len; f++) {
        ud[f] >>= fillBits;
        ud[f] |= (ud[f + 1] & cut) << move;
      }
      ud[len] >>= fillBits;
    }    

    if (udl < 0) udl = length * 8 / 7;
    final int[] output = new int[udl];    

    int b = 6, p = 0;
    for (int i = 0; i < udl; i++) {
      switch (b) {
      case 7: // U0
        output[i] = (ud[p] & upperBits[0]) >> 1;  
        break;        

      case 6: // L6
        output[i] = ud[p] & lowerBits[b];
        break;
        
      default: // The rest
        output[i] = ((ud[p] & lowerBits[b]) << (6 - b))
                  + ((ud[p-1] & upperBits[b+1]) >> (b+2));
        break;
      }

      if (--b == -1) b = 7;
      else p++;
    }

    return output;

  }

What about encoding? Well, having done the decoding it should be a breeze to do the encoding, right? The pattern from the hellohello example should be quite obvious, shifting the bits from the current value rightwards while merging this with the bits from the next value shifted leftwards. The only thing to note here is the gap that appears between the two arrays (ud and output), which increases every 8 characters due to the compression. So at position 160 in the array ud, we will be in the position 140 in the output array, thus the limit of 140 bytes gives 160 7bit letters in the SMS message. Here's the algorithm:

  /**
   * Converts 7bit septets into octets. Make sure the user-data
   * has been converted into the GSM 7-bit default alphabet
   * before using this method.
   *
   * <BR><A HREF="http://www.dreamfabric.com/sms/hello.html">
   *              http://www.dreamfabric.com/sms/hello.html</A&gt
   *
   * @param ud The data to be transformed into 7bit septets
   * @param fillBits Fill bits in range 0 - 6
   * @return The encoded message
   */
  public static int[] encode7bit(int[] ud, int fillBits) {
    final int[] upperBits = {
        0xFE, // 0 = B7|B6|B5|B4|B3|B2|B1
        0xFC, // 1 = B7|B6|B5|B4|B3|B2
        0xF8, // 2 = B7|B6|B5|B4|B3
        0xF0, // 3 = B7|B6|B5|B4
        0xE0, // 4 = B7|B6|B5
        0xC0, // 5 = B7|B6
        0x80  // 6 = B7
    };
    
    final int[] lowerBits = {
        0x01, // 0 =                   B0
        0x03, // 1 =                B1|B0
        0x07, // 2 =             B2|B1|B0
        0x0F, // 3 =          B3|B2|B1|B0
        0x1F, // 4 =       B4|B3|B2|B1|B0
        0x3F, // 5 =    B5|B4|B3|B2|B1|B0
        0x7F  // 6 = B6|B5|B4|B3|B2|B1|B0
    };

    final int udl = ud.length;
    final int ol = (udl * 7) / 8;
    final boolean rest = ol != (double)udl * 7 / 8;
    final boolean fill = fillBits != 0;
    int[] output = new int[ol + (fill ? 1 : 0)

                              + (rest ? 1 : 0)];

    int b = 0, p = 0;
    for (int i = 0; i < ol; i++) {
      output[i] = (ud[p] >> b)

                + ((ud[p + 1] & lowerBits[b]) << 7 - b);
      if (++b == 7) {
        b = 0;
        p++;
      }
      p++;
    }
    if (rest) output[ol] = ud[p] >> b;   

    if (fill) {
      final int cut = upperBits[7 - fillBits];
      final int move = 8 - fillBits;
      p = 0;
      for (int i = 0; i <= ol + 1; i++) {
        b = output[i] & cut;
        output[i] ^= b;
        output[i] <<= fillBits;
        if (p != 0) output[i] = p;
        p = b >> move;
      }
    }

    return output;
  }

A few more things to note here, the output array's length is defined by the number of bytes needed to store all characters plus an eventual rest and eventual fill bits. The rest is what's left over at the end of the array after our algorithm has finished, there will be a rest in 6 out of 7 times. Since the fill bits essentially shifts the whole array up to 6 bits, then the array will need the extra space. Keeping the helper arrays inside the method or static outside is up to the user, having them static is probably more efficient for small encoding/decoding sets such as SMS messages. When it comes to fill bits for encoding I still think it is easiest to do it in two steps as combining the encoding and shifting makes things very complicated. If someone solves this problem and would want to share it with me, I'd be very interested in comparing the complexity of the algorithms and performance.

Comments:
I have a simpler code found at this blog post
 
This comment has been removed by the author.
 
I think Following section of code for decoding GSM7bit message with fill bits has a typing mistake.

Original Code

if (fillBits != 0) {
final int len = length - 1;
final int cut = lowerBits[fillBits - 1];
final int move = 8 - fillBits;
for (int f = 0; f < len; f++) {
ud[f] >>= fillBits;
ud[f] = (ud[f + 1] & cut) << move;
}
ud[len] >>= fillBits;
}

Modified Code

if (fillBits != 0) {
final int len = length - 1;
final int cut = lowerBits[fillBits - 1];
final int move = 8 - fillBits;
for (int f = 0; f < len; f++) {
ud[f] >>= fillBits;
ud[f] | = (ud[f + 1] & cut) << move;
}
ud[len] >>= fillBits;
}

Following line in above code is modified added | to ensure ud[f] dosent loose its original value and is properly or'ed with the bit cut from the next byte in sequence.

ud[f] | = (ud[f + 1] & cut) << move;
 
Downloadable jar files at: Google code
 
I can confirm that Saurabh's findings are correct. I'll update the post.

You may also refer to http://code.google.com/p/attention/source/browse/trunk/Attention/SMS/src/com/googlecode/attention/sms/pdu/Codec.java
 
Input "abcdefg"
Output "abcdefg@"

Hex value for the last bit is wrong '01" it should be '1B', so its represent "g@" .
 
1B represents an escape character and isn't part of the text that was encoded. When dealing with fill bits there is a chance that there are leftover bits, so you need to know how long the original string was when decoding it. Check the udl argument to decode7bit.
 
Post a Comment



<< Home

Copyright © 2006 Stein Gunnar Bakkeby