                      -= D A T A   C O M P R E S S I O N =-
                              Copyright (c) 1998
                                      by
                        Matjaz Trtnik aka maLi/MaLixa

Table of contents
=================
  1. What the hell is this?
  2. Disclaimer
  3. RLE (Run-Length Encoding)
  4. Huffman compression
  5. Final words and greets


1. What the hell is this?
=========================

  Two easiest data compression alhorithms. RLE and Huffman compresion. I will
  try to explain both very quicky so you can get idea behind them. For more
  information please refer to source code or just go to FAQSYS page at
  http://www.neutralzone.org/home/faqsys/ and download other docs, source
  code connected with compression. I have done this two algorithms only for
  my own purposes. RLE algorithm is used when decoding PCX or BMP files and
  it is easiest one I know. Huffman is a bit more complicated but not that
  hard. I had to do it for university and I thought it might be useful for
  someone else too.

2. Disclaimer
=============

    The author takes no responsibility, if something in this document or
    the accompanying classes causes any kind of data loss or damage to
    your hardware.

    You can use this product strictly for *NON* commercial programs.
    If you want to use it for commercial programs please contact author.

    You are not permitted to distribute, sell or use any part of
    this source for your software without special permision of author.


3. RLE (Run-Length Encoding)
============================

  Ok, lets have stream of characters:

        AAABBCCCCBBBBEEEEEAAAA

  Each one is represented with 8 bits so total size of this stream is
  22*8 = 176 bits. Can we make this size smaller? The answer is yes by
  using. We can use RLE compresion algorithm. If some character appear
  more than 2 times than we can write how many times it appear in stream
  and then which character. Lets encode our stream to see what I mean:

        3ABB4C4B5E4A

  Now we got 12*8 = 96 bits which is a better than 176 bits. In our case
  we saved almost 45%. But here comes question. What if there is stream
  like that:

        543345

  This can be interpreted like:
        4444433355555 or 543355555 or ...

  So we have to "mark" somehow that stream is encoded. We can do this
  quite easy. Lets say if upper two bits of character are set to 1
  (11xxxxxx) then this means that which is counter. We read next char
  and write it to output stream for size of counter. Example code:

  while not end_of_stream {

    counter = read_from_input()

    /* We got compressed packet */
    if ((counter & 0xC0) == 0xC0) {
      character = read_from_input()

      /* Length can only be 64 since upper tw bits are used to mark compression */
      length_of_substream = counter & 0xC0
    }
    else {
    /* No compression */
      length_of_substream = 1
      charachet = counter
    }

    while (length-- > 0)
      write_to_output(character)


  }

  Pretty easy, right? If you still have any questions please follow source
  code or download other docs. I know this "tutor" is bad but I do not
  feel like writing more.
  Of course RLE compression is only good for treams that have many substreams
  with same characters. In case of stream like this
       ABABABABABABACCD
  it would make same file if compression is smart or even bigger if it is not
  well programmed :)
  So how to compress stream like this one? We can use Huffman compression.



4. Huffman compression
======================

  Huffman compression is a bit more complicated that RLE. It assumes that
  some characters appears in data stream more frequently than another.
  Now I will try to explain through example.Lets have stream from RLE, which
  did not work that good there:
       ABABABABABABACCD = 128 bits

  We have to build frequency table. It is very simple. We just have to
  count frequency of all characters in stream. In our case it is:
         A = 7
         B = 6
         C = 2
         D = 1

  Pseudo code:
    set_all_elements_of_frequency_table_to_zero
    while not end_of_stream {
      char = read_from_input()
      freq_table[char]++
    }

  For freq_table we can use int freq_table[]. Size of table depends of
  how many different characters appear in stream but I guess 256 should
  be enough for normal PC (since there are 256 entries in ASCII).

  Ok, now we have our frequency table. But we still do not know what to do
  with it. We even do not know how Huffam compression works. Here it goes.
  Now algorithm replace each character with a bit string. This is easiest
  done by building Huffman tree. Purpose of this tree is that it helps us
  to get correct bit strings for each character from input stream. Lets
  first encode our stream and then I will try to explain a bit more.

  A appeared most frequently in our stream so we will use shorter bit string
  for its representation. Now lets see how our bit strngs would look like:
        A = 1
        B = 01
        C = 001
        D = 000
  and input stream would look like:
      1011011011011011011001001000 = 28 bits

  Almost 80% compression. Huffman compression is quite good for text files
  and it works fine for most raw images as well. Ok, I convinced you now it
  is worth to use it but we still does not know how to built Huffman tree
  (well, I know :)
  Take a look at our frequency table:
         7 6 2 1

  We have to sort it from smaller to bigger. Now we got:
         1 2 6 7

  Next step is to add first two values (1 + 2 = 3). And sort our frequency
  table again:
         3 6 7

  Then we add first two values and sort it again:
         7 9

  and again to get final
         16

  Now 16 is the root of our tree. And our tree would look like this:

                                19
                                /\
                               9  7
                              /\
                             3  6
                            /\
                           1  2

  Ok, all we have to do now is to build our bit strings. We start at root (19)
  and go down. For example lets make bit string for 1 which is actually D.
  So we start at root and if we go left we put 0 and if we go we put 1 (or
  reversed, it actually does not matter at all). So we go left(0), left(0),
  left(0) and got final 000 code for character D. Other codes are:

        A = 1    (1 bit now, 8 bits before)
        B = 01   (2 bits now, 8 bits before)
        C = 001  (3 bits now, 8 bits before)
        D = 000  (3 bits now, 8 bits before)

  Now all we have to do is to read input stream check for character and
  replace it with bit string. Pretty easy from here on. I leave it to you.

  I know this is very loosy explanation and that is why I included full
  source code for Huffman encoding/decoding. If you would like to get better
  explanation please go to FAQSYS homepage and download better explanation
  there.


5. Final words and greets
=========================

  First I have to say sorry for such a crappy tutor but at least I tried. You
  have to note that time is about 1am and I have to wake up tomorrow at 7am
  so I better go to bed now. Naaaah, I have to finish this doc with greets :)
  Ok, greets flies to (in no special order):

                                    .maLa.
      .Adept.Altair.BLACKAXE.bsm.damaq.Ex.Gedge.Kalms.Kombat.mri.mrz_ai.
      .raster.Unreal.frenzy.Gaffer.Jare.blala.Melan.doj.Teran.Alixa.DiC.
       .sqrt(-1).Ravian.Wog.xLs.aXs.CyberEagle.Paso.submissiv.Ecstasy.
       .borzom.HeadSoft.Vastator.Zed.Phred.Ghoul.Eckart.Plasm.lim_dul.
        .Findus.DynaByte.lovex.MidNight.multiplex.Psyq.Sarwaz.Compile.

              and all others I forgot from #coders and #scene.si

  If you want to contact me for any reasons here is my address:

  Matjaz Trtnik
  11, Klemenova
  1260 Ljubljana - Polje
  Slovenia, Europe
  Tel:. ++386 (0)61 482 289

  Email: mtrtnik@bigfoot.com

  Web: http://www2.arnes.si/~ssdmalic/mali/
       (This page has a lot of source code to demo effects and other
        resources or links)


        Sincerelly,
                   Matjaz Trtnik aka maLi/MaLixa
