updated 2003-05-24.
Contents:
related pages:
middle Con
the pianofor
a quarter note) really only loses ``noise''.
"It is my ambition to say in ten sentences what others say in a whole book." -- Friedrich Nietzsche
"Short words are best and the old words when short are best of all." -- Winston Churchill
The most valuable talent is that of never using two words when one will do. -- Thomas Jefferson
If you would be pungent, be brief; for it is with words as with sunbeams. The more they are condensed, the deeper they burn. - Robert Southey
"In the multitude of words there wanteth not sin: but he that refraineth his lips is wise." -- Proverbs 10:19
There are several different orthogonal ways to categorize data compression:
Most people have assumed that the decompressor has access to the complete compressed data file from the beginning. However, DAV has been thinking about ways to synchronize a decoder in the middle of a data stream -- for example, immediately after you turn on your TV, you want to be able to start watching a compressed digital video signal almost immediately. This is similar to ``medium-latency'', except it requires the TV to be able to decode the last part of the TV program even if it completely missed out on the first part. Some fractal compression techniques have this property ... I've been calling this ``history-limited decompression''.
introductions to data compression beginner tutorials
a brief introduction designed for ``gifted students in the 8 to 12 age group''. The concepts of ``coding'' and various forms of data representation are very important to data compression. (In particular, Morse code and the ASCII character set). The understanding that some letters are more frequent than others is good for understanding Huffman compression. the understanding that some letter pairs are far more frequent than the letter frequencies alone would suggest is good for understanding LZ77 and related compression algorithms. Has good info on letter-pair frequences (his charts have first letter on left, second letter on top; one chart based on the King James Bible).
(example: http://members.ozemail.com.au/~xenophon/code.html )
More links to data compression, in general (Other pages that, like this one, have lots of links to compression information):
specifically designed to compress 2D images.
The 2 hottest topics in 2D image compression (circa 1999, when DAV spent a lot of time in this area) are "wavelets" and "fractals" (which really have many similarities).
The goal here is to try to improve on computer_graphics_tools.html#png .
see also computer_graphics_tools.html#fractals for fractals in general (as artistic and mathematical objects).
(FIXME: move to http://en.wikipedia.org/wiki/Fractal_Compression )
[FIXME: write fractal image compression with extra ``erode'' and ``dilate'' parameters]
Introductions to Wavelets
Wavelets applied to Image Compression
Color coordinate systems and conversion between them.
See also clustering and color quantization machine_vision.html#cluster .
I'm most interested in simple lossless stuff.
[Here I've translated it into DAV notation, where for integer values x, N,
floor2(x) = floor( x / 2 ) = arithmetic right shift,and
ceil2(x) = ceil( x / 2 ) = floor2(x+1) = -floor2(-x).such that
floor2(x) + ceil2(x) = x, and ceil2(x) - floor2(x) = (0 or 1), so floor2( 2N + ceil2(x) - floor2(x) ) = N floor2( N + x ) + ceil2( N - x ) = x..
FIXME: My notations avoid lots of little "+1" and "-1" floating around ... but perhaps it would be clearer to write in executable C notation, perhaps with #define ceil2(x) ((x+1) >> 1)
describes a reversible integer RGB (8 8 8 bits) to Y Ur Vr (8 9 9 bits) color transform and shows that for many (but not all) images, there is a net gain in spite of the fact that 2 of the components require 9 bits of accuracy instead of the original 8. This transform also allows a grayscale image to be extracted from the compressed data without decompressing the entire image.)
suggests
Specifically designed to compress 1D streams of symbols, such as English text. 2D images are often rearranged into a linear list of pixels (e.g., by scanning one row at a time, or walking a Hilbert path), then compressed with one of these tools.
Some of these (such as Huffman) don't even take advantage of the 1D correlation between adjacent symbols, and just take advantage of "correlations between random pairs", i.e., statistics of the entire file taken as a unordered set. These algorithms would compress any file with identical statistics (for example, by shuffling the order of the items in any arbitrary manner) just as well. [This could be phrased better ...]
Theory:
Source code available at:
[FIXME: Not all my Huffman links are here; some of the 1D_compression links describe several compression ideas including Huffman compression. Perhaps I should copy them here and directly link to their Huffman page].
This implies that:
There is only one standard Huffman code for a particular file, but many other Huffman codes that give just as good compression for that file. For example, if you are trying to compress a file of 100 characters with these frequencies
a: 10 b: 10 c: 20 d: 20 e: 40there are 3 bit lengths ("Kraft vectors") that give the best compression (220 bits). They are
L = (1, 2, 3, 4, 4) L = (1, 3, 3, 3, 3) L = (2, 2, 2, 3, 3).Given a "Kraft vector", one can find the unique canonical Huffman code for a file. For the above optimum Kraft vectors,
a: 0000 b: 0001 c: 001 d: 01 e: 1 a: 000 b: 001 c: 010 d: 011 e: 1 a: 000 b: 001 c: 11 d: 10 e: 01
This last canonical Huffman code is the unique "standard" Huffman code. The "standard" (DAV: I made up this terminology. Is the a better one ?) Kraft vector is generated by preferentially merging sub-trees with smallest depth. For certain 8-bit binary files, the optimum Kraft vector indicates one code needs a ~ 256 bit code. Since decoders can be made simpler if there is a limit on code length, some algorithms (like the one in JPEG ?) generate a non-optimum Kraft vector that has a maximum code length of 16 bits (this restriction gives slightly worse compression) for 8 bit values.
Newsgroups: comp.compression From: (Antaeus Feldspar) Subject: Re: Literature/HTML page on optimal Huffman codes Date: Mon, 3 Nov 1997 20:48:21 GMT In article <345dc108.17828360@news.worldonline.nl>, Thiadmer Riemersma wrote: >Hello everybody, > >Do you know a book, article or HTML page where the creation of *optimal* >Huffman tables is explained. I have read on various occaisions that the >Huffman tree should always be better or as good as Shannon-Fano coding. One >book (with examples of both) said, "Well, Shannon-Fano is a little more >compact than Huffman in this example, but that's because we have not >optimized the Huffman tree." Hence my question. I think your book has played fast and loose with terminology. If we are working with accurate statistics, a prefix code created through Huffman coding is going to be the most compact possible -- when codes are limited to integral numbers of bits, that is. Huffman codes are prefix codes; but not all prefix codes are Huffman codes. >By the way, if you know a compact way to store a Huffman table? Easiest way is to adopt restrictions for your Huffman trees that make only one of all possible trees for a given set of codelengths the legal tree. Then, storing the codelengths is sufficient to reconstruct the entire tree. An example, used in "deflate" compression, is the following set of restrictions: -- Shorter codes go to the left of longer codes. -- Symbols with equal codelengths go in lexicographic order from left to right. Any Huffman tree can be converted into a Huffman tree that meets these restrictions, without changing the lengths of the codes and changing the efficiency of the code -- and as pointed out before, this means that storing just the lengths of the codes is the same as storing the whole tree. >Thanks in advance, >Thiadmer > >* * * >Note: My E-mail address has been altered to avoid spam. -- ! -jc IS feldspar at netcom.com ! ! "'Asa Nisi Masa!' How strange! But what does it mean?" ! ! *** Fight spam! Sign up at http://www.cauce.org/ ! *** !
[ This shuffles everything up one slot every iteration ... perhaps it would run significantly faster if it never shuffled anything up, using double the amount of memory (still O(n) space and O(n log n) time). ]Huffman's famous greedy algorithm solves this problem in $O(n \log n)$ time if $f$ is unsorted; and can be implemented to run in $O(n)$ time if the input array $f$ is sorted. In this paper the space usage of the greedy method is considered. It is shown that if $f$ is sorted then the array $c$ can be calculated in-place, with $c_i$ overwriting $f_i$, in $O(n)$ time by using $O(1)$ additional space. The new implementation leads directly to an $O(n \log n)$ time and $n+O(1)$ words of extra space implementation for the case when $f$ is not sorted. The proposed method is simple to implement and executes quickly.
Re: Dynamic Huffman algorithms
Author:
Matthew D Moss
Email:
moss at null.net
Date:
1998/10/02
Forums:
comp.compression
ralionmaster@geocities.com.remove.after.com wrote:
>
> I have asked similar question before. Could someone
>advise where I could obtain algorithm(s) on dynamic Huffman
>implementations?
Take a look at the October 1998 issue of Dr Dobb's Journal.
In the "Algorithm Alley" column, Steven Pigeon and Yoshua Bengio tackle this
problem. Their variation of the Huffman algorithm is adaptive, not as
memory intensive for sparse symbol sets, and doesn't require you to store
the encoding table.
http://www.ddj.com/
o: 000 a: 001 a: 010 i: 011 i: 100 e: 101 e: 110 e: 110This can map directly to Arithmetic coding (???). Lu (Lu 1997 p. 312) uses a Arithmetic coder where letters are sorted by frequency similar to this. Canonical Huffman coding, in effect, adds duplicate characters and deletes (!) characters such that the total length of this sorted file is a power-of-2:
o: 000 o: 001 // additional `o' a: 010 a: 011 i: 100 i: 101 e: 110 e: 111 // deleted one `e' !
Then, when compressing the original (unsorted) file, the compressed code word is just enough to locate that letter in this (sorted) file: in this case, "a" can be located in 2 bits: "01", because *all* letters at locations that begin with "01" are "a".
Another example: "ebeaabeb" is sorted "aabbbeee". Canonical Huffman, in effect, creates this sorted file:
a: 000 a: 001 b: 010 b: 011 // deleted one `b' ! e: 100 e: 101 e: 110 e: 111 // added 1 `e' !Here, "e" can be indicated by the code word "1", since all locations in this file that start with "1" are "e". "b" is indicated by the code word "01", since all locations that start with "01" are "b".
For some initial data files (for example, aabbeee), Huffman always adds letters, never deletes letters. For those sorts of data files,
compressed_length(a) = ceil(log2(L/f(a)))However, for many data files, Huffman deletes letters; so
ceil(log2(L/f(a))) < compressed_length(a)and we get better compression than we expected for *other* letters (letters different than the one that got deleted). Sometimes this "improvement" is even better than than Arithmetic coding (it appears to violate entropy); i.e., sometimes
compressed_length(a) < log2(L/f(a)) // sometimes. However, *total* entropy is not violated, since the *worse* compression on letters just like the one that was deleted more than make up for it.
If we have hardware that is limited in the size (in bits) of the code words it can handle, then algorithms that generate (occasionally non-optimal) Kraft vectors have the effect of adding lots of copies of the least common letter(s).
Many programs written to *teach* Huffman codes represent them in memory like this:
(length)(bitstring) pairs.
When you encode 8 bit symbols, then the worst-case bitstring is 255 bits. So one way of holding all the codes is an array of 256 of these (length)(bitstring pairs). The (length) can range from 1 bit to 255 bits. (a length of 0 indicates "this symbol never occurs").
However, if you use canonical Huffman codes, the least common symbols all start with lots of zero bits. A better way of representing all the Huffman codes is to truncate all those zero bits:
(length)(remaining bits)By using canonical Huffman codes, the worst-case "remaining bits" bitstring (when compressing 8 bit symbols) is 8 bits. The "remaining bits" bitstring is all zeros for the least-frequently-occuring symbol. The "remaining bits" bitstring has at least one 1 bit for the other 255 symbols.
The decompressor needs to know what the Huffman codes are for each compressed file. (They are different for each compressed file). So a natural place to store this information is in the header of each compressed file. It would be nice if this information could be represented as compactly as possible.
I know I was astonished to learn that this table could be represented (in the header of the compressed file) by an array of 256 bytes
(length). The actual code words (or even the "remaining bits") do *not* need to be stored in the header (if you use canonical Huffman codes).
The compressed file header just stores the *lengths* of the codeword for each symbol, and the decompressor regenerates the code words from that.
Assuming your maximum code length <= 15 bits, (see #huffman_height for details on how to guarantee some particular maximum code length, at some tiny loss in compression ) the simplest possible header for huffman compression is to pack 4 bits per symbol to encode that code length (0 == never occurs in file, 1 = occurs about half the time, 15 == occurs rarely ... or would it be better for "0000" to indicate a length of 16 == occurs even more rarely ?). If you have 256 symbols, then this header requires 256*4 bits = 1024 bits = 128 bytes.
Obviously, if the header requires 128 bytes, then it's impossible to get any useful compression for files of 129 bytes or less.
[FIXME: other ways to reduce size of Huffman header]
[FIXME: doesn't this just repeat above ?]
[data compression#huffman]
From: d.cary at ieee.org
To: d.cary at ieee.org
Subject: Re: Arbitrary Huffman tree and weights distribution (was: huffman code length)
Date: Thu, 15 Jul 1999 21:34:58 GMT
Newsgroups: comp.compression,alt.comp.compression,sci.math
Organization: Deja.com - Share what you know. Learn what you don't.
X-Article-Creation-Date: Thu Jul 15 21:34:58 1999 GMT
X-Http-Proxy: 1.1 x26.deja.com:80 (Squid/1.1.22) for client 38.193.64.47
X-Http-User-Agent: Mozilla/4.0 (compatible; MSIE 5.0; Windows 98; DigExt)
X-MyDeja-Info: XMYDJUIDd_cary
I've been thinking about a different paradigm to "explain" the Huffman
algorithm. This one has no trees. And it creates a easy-to-understand
concept of the "canonical" Huffman code.
Starting with the file you want to compress, count how many times each
symbol occurs (histogram). Then sort the file,
not in alphabetical order, but in order of least-frequently-occurring-
symbol first.
For example, the file "ebeaabeb" is sorted
000 a
001 a
010 b
011 b
100 b
101 e
110 e
111 e
Then we could encode the original file by just transmitting the offset
of each letter in this sorted file. This leads to 3 bits/symbol
encoding. (I think that arithmetic coding is very similar to this idea).
However, notice that all the symbols in this sorted file that start
with "00" all lead to `a'. This means that we could encode all `a's in
the original file with only those 2 bits, leading to 2 bits for a (00),
2 bits for the b (01) and 2 bits for the e (11). Note that the "10"
code is never used, since it only codes for b and e, which are already
covered.
The canonical Huffman algorithm works its magic on this sorted file,
adding and deleting a few letters, creating
000 a
001 a
010 b
011 b // deleted one `b' !
100 e
101 e
110 e
111 e // added 1 `e' !
The length of a Huffman sorted file is always a power of 2.
Each letter occurs exactly some power-of-2
times, letters that occur some smaller power-of-2 times occur first. In
a "canonical" Huffman sorted file, letters that occur the same power-of-
2 times are sorted in alphabetical order. This means that given that
power-of-2 ("the code length") of each symbol, we can easily regenerate
this canonical Huffman sorted file.
As before, we could encode the original file by encoding each character
by its offset in this sorted file. This leads to 3 bits per symbol
At decoding, we take the code lengths of each symbol, and regenerate
the canonical sorted file. Then we decode by looking up bits in the
compressed bit stream as offsets to this file. Note that if the first
codeword you're starting to decode starts with "1", you already know
that must have been a `e' in the original file. So the remaining bits
in the offset don't need to be stored in the file. This means the
canonical Huffman code for this file is 2 bits for a (00), 2 bits for b
(01), and 1 bit for e (1).
This immediately leads to all the standard properties for canonical
Huffman codes, including "the all-zeros code is the longest code", "the
all-ones code is the shortest code", etc.
In a few initial data files (for example, aabbeeee), letters are
already in a power-of-2 frequency distribution, and the total file
length is already a power-of-2. The Huffman "magic" never makes any
changes to these files. For those files,
Lo = Ls
bl(a) = log2(Lo/f(a)), so
f(a) = Lo / 2^(bl(a)).
where
bl(a) = bit length of the symbol `a'
f(a) = how many times 'a' occurs in the original source file
Ls = length of this hypothetical sorted file modified by Huffman,
Lo = length of original file.
Any possible bitlength distribution *could* have been generated by a
file of this type.
To other initial data files (for example, aabeeeb), Huffman only adds
letters (to round their frequencies up to the next power of 2), never
subtracts letters. (Is this right ? Would Huffman ever round up to more
than the next power of 2 ?)
For these files,
Lo < Ls
bl(a) = log2( Ls/f(a) )
bl(b) = floor(log2( Ls/f(b) ))
where
`a' is any letter that was unchanged
`b' is any letter that had extra copies added.
Since
x-1 < floor(x) <= x (Perhaps a tighter bound can be found here)
we can derive
log2( Lo/f(b) ) < bl(b)
Ls / 2^(1+bl(c)) < f(c) < Ls / 2^(bl(c)).
where
`c' is any type `a' or type `b' symbol in the original file.
( I thought I was going to get
Lo / 2^(bl(b)) < f(b) < 2*Lo / 2^(bl(b))
but I think this is incorrect).
Unfortunately, there are lots of tricky cases where Huffman deletes
some duplicate letters (and possibly adds other duplicate letters, in
the general case). For those files, Ls could be bigger or smaller
or the same as Lo. For any letter `d' that had duplicate copies deleted,
floor(log2(Ls/f(d))) < bl(d).
The remaining letters still get
bl(b) = floor(log2( Ls/f(b) ))
Sometimes some of the remaining letters are compressed even better than
Arithmetic coding (it appears to violate entropy); i.e., sometimes
bl(c) < log2( Lo/f(c) ).
However, total entropy is not violated since the worse compression on
letters who had some copies deleted more than make up for
this "improvement".
In particular, for the extremely skew case where Lo/f(d) = 1.00001
(or, in general, where Lo/f(d) < 2 approximately),
floor( log2( Ls/f(d) ) = 0 < bl(d) = 1.
So for files where Huffman deletes some characters, we get
Letters with unchanged frequency
bl(a) = log2( Ls/f(a) )
f(a) = Ls / 2^(bl(a)).
Letters with copies added
bl(b) = floor(log2( Ls/f(b) ))
Ls / 2^(1+bl(b)) < f(b) < Ls / 2^(bl(b)).
Letters with copies deleted
floor(log2(Ls/f(d))) < bl(d)
Ls / ( 2^(bl(d)+1) ) < f(d).
Of course, for every symbol in every kind of file that has Q unique
symbols,
0 < bl(c) < Q
0 < bl(d) < Q.
(except for the special case of Q=1 (Ls = 1) or Q=0 (Ls = 0), where the
compressed data has length zero bits).
Maybe if I could get a better handle on when exactly these copies are
deleted, I could answer the original question.
I suspect that if you look at every symbol that has the same bit length,
only the one closest to the end of the alphabet could possibly have
copies deleted. If that were true, then for all symbols of the same bit
length except for the last,
Ls / 2^(1+bl(c)) < f(c) < Ls / 2^(bl(c)).
where
Ls = sum{i=each unique symbol}( 2^(max - bl(i)) )
max = the longest bit length.
David Cary developed "reverse Huffman coding" to minimize the transmission cost whenever where different symbols have different cost.
(Other people, including Peter Wayner, have independently discovered/used reverse Huffman for other purposes)
This requires a Huffman compression algorithm such that decompressing *any* "compressed" file, then compressing that "plaintext", generates identically the same "compressed" file. See David Scott http://bijective.dogma.net/ for another application that requires this same property.
Say you have Q unique symbols that can be transmitted, each one x with cost c(x). For example, you have a battery-powered device that turns a LED off and on, where each 9 bit symbol always starts with the LED on one bit time, and the total cost is the battery drain which only occurs while the LED is turned on. In other situations, the "cost" may also involve the transmission time.
Theorem: To minimize transmission cost, you want to equalize the cost per bit of information transmitted, for each symbol.
Proof: [FIXME]
We assume that the data to be transmitted has already been compressed into a bitstream, such that no matter what values the previous bits have been, the probability that the next bit is a 1 is always approximately 1/2.
The cost can be any arbitrary function of the symbol x. This would commonly include some combination of
If every one of the unique symbols can be transmitted in the same amount of time, and the "time to transmit" dominates the cost, then the minimum cost occurs when all symbols will be transmitted (approximately) equally likely.
Using Arithmetic coding isn't much better than Huffman at equalizing cost/bit when there are lots of symbols, but it can be far superior when there are only 2 or 3 symbols.
In some cases, it may be a good idea to concatenate "simple" symbols to make a "big" symbol. For example, if you have 3 "simple" symbols to work with, you might get better cost/bit if you concatenate 3 of them to create a symbol set of 27 "big" symbols, and then feed the cost of each of those 27 symbols into the reverse Huffman algorithm. Concatenation also simplifies convoluted transmission constraints such as "when transmitting these -1, 0, +1 values, the average must always be 0".
Simple example: We have a LED transmitter/ photodiode receiver pair that transmit 4 bit symbols to each other. Every symbol required to have a minimum of 1 "on" bit to maintain synchronization between transmitter and receiver, so that leaves 15 valid symbols.
The cost for this application is entirely the battery drain when the LEDs are on, a constant 1 unit for each time period. We neglect the time needed to transmit information and the battery power used by the transmitter and receiver while the LED is off.
| LED | cost | encoded bits |
| 0000 | 0 | (never transmitted) |
| 0001 | 1 | 001 |
| 0010 | 1 | 01 |
| 0011 | 2 | 000001 |
| 0100 | 1 | 10 |
| 0101 | 2 | 000010 |
| 0110 | 2 | 000011 |
| 0111 | 3 | 000000000 |
| 1000 | 1 | 11 |
| 1001 | 2 | 000100 |
| 1010 | 2 | 000101 |
| 1011 | 3 | 00000001 |
| 1100 | 2 | 00011 |
| 1101 | 3 | 00000010 |
| 1110 | 3 | 00000011 |
| 1111 | 4 | 000000001 |
Without reverse Huffman transmission, you might transmit each symbol equally, so each symbol encodes approximately log2(15) =~= 3.9 bits of information (using bit stuffing). at an average cost of (average cost)/(average bits)= = (32/15)/(log2(15)) = 0.546 cost/bit.
Or you might transmit only 3 bits per symbol, using the 8 lowest-power symbols, to get a slightly better rasio of (average cost)/(average bits) = = (12/8)/(3) = 0.500 cost/bit.
But the optimum cost/bit is to use reverse Huffman transmission. Then the probability of each symbol being transmitted becomes [ 64 128 8 128 8 8 1 128 8 8 2 16 2 2 1]/512. The length in bits for each symbol is [ 3 2 6 2 6 6 9 2 6 6 8 5 8 8 9]. So the average(cost/bit) = sum( probability(i).*cost(i)./bit(i) ) = 0.4611 cost/bit.
We can see how well this choice of bitpatterns fits to our goal of a equal cost/bit by calculating cost ./ bit = [0.3333 0.5000 0.3333 0.5000 0.3333 0.3333 0.3333 0.5000 0.3333 0.3333 0.3750 0.4000 0.3750 0.3750 0.4444] = [120 180 120 180 120 120 120 180 120 120 135 144 135 135 160]./360 .
Well, it's a lot closer to a constant cost/bit than the simpler strategies above.
Open questions: Is it possible to have a "reverse adaptive huffman" adapt to changing line conditions ?
[FIXME: not a good explaination of end effects. See bijection ] End effects: The above neglects what happens at the end of a transmission. What if you are nearing the end of transmitting a message, and you find that the remaining bits to be transmitted are "1100" ? Obviously you find that you need to transmit a "1000" pattern (interpreted as "11"); then what could you send that would be interpreted as "00" ? One simple protocol would be to append a virtually infinite string of zeros after the message; this protocol would then transmit a "0111" message (interpreted as "000000000") to transmit the remaining "00" data bits and a bunch of bogus "0" bits. Hopefully the receiver would properly ignore the extra 0 bits. Perhaps if the message to be transmitted ended in zeros, the transmitter could halt transmissions after the last 1 bit was transmitted, and the receiver would somehow fill in the mising 0 bits. (Appending a virtually infinite string of ones after the message could also be interesting). Is it possible for the receiver to truncate extra zeros (or even fill in missing zeros) even if it doesn't know the exact length of the original text ? Maybe if it knows the original text was a exact multiple of 8 bits in length ... no, that wouldn't work, because in the special case where we lacked 1 bit to make up a full byte, and the the final transmitted symbol was "0111", then the 9 "0" bits that decodes into would be ambigous as to whether it was intended as the final "0" bit of the message, or whether the message ended with another full byte of zeros. (Appending a virtually infinite string of ones looks like it would work in this case ... will it always work ?)
from a thread at http://www.deja.com/[ST_rn=md]/threadmsg_md.xp?AN=487814206 and more info at http://www.compressconsult.com/huffman/#maxlength
As you can see from the above, if you have 2^15 symbols (or less) in your data file, you can refer to any one of them using a 15 bit offset from the start of the file. I was incorrect in concluding that therefore the maximum code length would be 15 bits, since Huffman *inserts* lots of symbols (in that mythical sorted file), so the Huffman offset may be much longer than 15 bits.
alternatively, you can loose a *tiny* amount of compression by using a code table that is slightly less than optimal, to force code lengths to a maximum of 15 bits.
Q: What is the maximum height
of a Huffman tree with an alphabet of 8 bit (256 different characters)?
A: The worst case height is 255.
The worst case is, of course, exceedingly improbable in practice.
If you have control of the Huffman code assignment (which you probably
do as an encoder), then you can set an a-priori limit on the code
length. This means that the lowest-probability symbols will have
slightly-less-than-optimal code assignments, but since they occur
infrequently by definition, the loss in compression ratio is tiny.
The resulting code simplifications may be well worth that price.
The JPEG standard, for example, requires that no code be longer than
16 bits. (Since this is a requirement of the standard, both encoders
and decoders are able to take shortcuts that assume this is the
maximum code length.) The standard offers a simple approximate
algorithm for adjusting a true Huffman code tree to meet this maximum
depth.
regards, tom lane
organizer, Independent JPEG Group
but,
From: "DI Michael Schindler" Subject: Re: huffman code length Date: 10 Jun 1999 00:00:00 GMT Organization: Compression Consulting Newsgroups: comp.compression hi Tom, you are correct. Two things you did not mention: JPEG uses a very small alphabet; in the case here a maximum codelength of 8 bit is not practical (we have a 256 symbol alphabet here!) To estimate loss the factor the number of bits for the longest code is greater than then number of bits needed for simple storing symbols is a good measure. The original poster also wanted to know codetable size for a *canonical* code: If you omit the leading zeros/ones (which one depends on wether the shortest code is all zeros or ones) you can store codelength+trailing bits pairs in memory. The number of trailing bits needed is limited by log2(alphabetsize); see my huffman coding webpage http://www.compressconsult.com/huffman/ for details. Michael
From: "DI Michael Schindler" Subject: Re: huffman code length Date: 16 Jul 1999 00:00:00 GMT Organization: Compression Consulting Newsgroups: comp.compression,alt.comp.compression,sci.crypt,sci.math hi! the formula below is right only if you take the longest subtrees in case of equality of weights. If you prefer shorter trees the # of samples raises faster. The reason is that in the case of this worst-case sample distribution you add just about half the number of samples you already have but need an additional bit. Note the codelength is not important; important is the part of the code that is nonconstant for coding and decoding issues. This can be limited to ceil(log2(ALPHABETSIZE)) bits when using a canonical code (which you should for decoding speed reasons). Michael
For the "standard" code (where we prefer shorter trees by merging the least-recently-merged tree when there is equality of weights), then the worst-case (pathological) codelength is still (ALPHABETSIZE-1) bits. The pathological case for 7 symbols (a tree with depth 6) is ((((((1 1) 1) 3) 4) 7) 11)
| length of file (# of samples) | maximum codelength |
| 0..1 | 0 |
| 2 | 1 |
| 3..4 | 2 |
| 5..8 | 3 |
| 9..14 | 4 |
| 15..24 | 5 |
| 25..40 | 6 |
| 41..66 | 7 |
| 67..108 | 8 |
| 109..176 | 9 |
| 177..286 | 10 |
| 287..464 | 11 |
| 465..752 | 12 |
| 753..1218 | 13 |
| 1219..1972 | 14 |
| 1973..3192 | 15 |
| 3193..5166 | 16 |
| 5167..8360 | 17 |
The width of each range is 1 more than the lower end of the previous range.
I already knew about preferring shorter trees (easy to do by preferring
the least-frequently-merged nodes when building the tree during
compression), but somehow I neglected to take that into account when
building that table. Thanks for pointing out that flaw. Using
standard (prefer shorter trees) encoding,
length of file (# of samples)
maximum code length (standard encoding)
0..1 0
2 1
3..4 2
5..8 3
9..14 4
15..24 5
25..40 6
41..66 7
67..108 8
...
1973..3192 15
3193..5166 16
5167..8360 17
The width of each range is the lower end of the previous range.
(Did I get it right this time ?)
Unfortunately, this doesn't seem to make a significant difference (5166
characters vs. 4180 characters).
Using standard (prefer shorter trees) encoding,
the worst-case (pathological) case for 8 unique symbols is a depth of 7,
((((((((1+1)+1)+3)+4)+7)+11)+18) = 46 symbols
which is almost, but not quite, the standard Fibonacci sequence.
Each frequency is the sum of the 2 previous frequencies,
but this sequence starts out "1 1 1 3" rather than "1 1".
But a depth of 7 can occur with only 41 symbols,
albeit with 9 unique symbols:
((((((((1+1)+(1+1))+1)+4)+6)+10)+16) = 41
This is even less like the standard Fibonacci sequence.
Each frequency is the sum of the 2 previous frequencies,
but this sequence starts out
"1 1 1 1 1 4 6" rather than "1 1".
Thanks for telling me that, after the initial zeros, even the worst-
case canonical codes are relatively short. I didn't know that. I can
see that that would definitely simplify Huffman compressors and
decompressors. If I have time later, I will see if that really makes my
software (which already uses canonical codes) go any faster (or slower).
"DI Michael Schindler"
mentioned:
> the formula below is right only if you take the longest subtrees
> in case of equality of weights. If you prefer shorter trees
> the # of samples raises faster.
> The reason is that in the case of this worst-case sample
> distribution you add just about half the number of samples you
> already have but need an additional bit.
>
> Note the codelength is not important; important is the part of
> the code that is nonconstant for coding and decoding issues.
> This can be limited to ceil(log2(ALPHABETSIZE)) bits when using a
> canonical code (which you should for decoding speed reasons).
...
> d_cary at my-deja.com wrote
...
> > the Fibonacci sequence of frequencies
> > gives the maximum possible code length.
> >
> > length of file (# of samples)
> > maximum code length
> > 2 1
> > 3...4 2
> > 5...7 3
> > 8...12 4
> > 13...20 5
> > 21...33 6
> > ...... ...
> > 1597...2583 15
> > 2584...4180 16
> > 4181... 17
obsolete:
Date: Mon, 31 May 1999 01:30:46 GMT
From: d_cary
Subject: Re: Frequency distribution of Huffman encoding tree
Mok-Kong Shen wrote:
...
> My example given for a balanced tree of 4 symbols appears to
> contradict your formula. (Your range is [0.25, 0.5]).
>
> M. K. Shen
You're right. The formula I posted was wrong. I would really like to
know the correct formula (and maybe even a URI or paper reference).
Maybe the correct formula is
2^(-Hi-1) <= fi <= 2^(1-Hi)
where
fi = ni/L = number of times "i" occurs / total length of file
Hi = number of bits assigned by Huffman to the "i" code.
?
No, that's not right either. For the special case of only 2 symbols,
Huffman always assigns 1 bit per symbol, unless the probability of one
of the symbols is identically zero (and the other is identically 1). So
for the case of 2 symbols, one cannot find a tighter bound than
0 < fi < 1.
I'm pretty sure a tighter bound can be found if one has more than 2
symbols with non-zero probability.
So the formula must depend not only on the frequency that a symbol
occurs, but also on how many symbols there are.
Lots of standard test images and standard test text files, plus comparisons of various compression programs.
Maximum Compression's goal is to show the maximum achievable compression ratio for several filetypes (text, executable etc). The best 120+ programs for every filetype are compared in a table indicating compression ratios and ...http://MaximumCompression.com/ best compression ratio on:
Many data compression algorithms (1D and 2D) are designed to work with large amounts of data, and compress them all-or-nothing.
These actually *expand* the data (by a few bytes) if one tries to use them for "short" snippets of text or other data less than a few hundred bytes long.
If one has a large database, those data compression algorithms can give excellent compression, but it takes a very long time to decompress the database and find something in it.
Here I've listed algorithms that are a bit less efficient on "large" blocks of data, but are appropriate for "small" amounts of data (or for random-access indexing into large databases with small amounts of data at each entry). Perhaps you could think of them as 0D.
One might think that low-latency algorithms must all be 0D, operating on small blocks of data, but [elsewhere] there are many "single-pass" and "adaptive" algorithms with low latency (such as LZW compression, adaptive Huffman compression, etc.).
[FIXME: split out into seperate section] I also have some "small" data compression algorithms, designed to be used on slow machines with very little RAM and ROM. There's several applications, and several variants that trade off speed, RAM, and ROM. One semi-popular application is "I want to squeeze all kinds of cool stuff into my video game, but I want it all to fit on one disk / one ROM. Oh, and I want it to boot fast, too.". We want to create a bootable executable on that disk that loads the rest of the info off the disk and decompresses it. In this case, we want to minimize the total (decompression binary + compressed video game data) size -- since smaller size means it takes less time to get it off the disk. On one hand, we want the algorithm very simple (so that the decompression binary is small); on the other hand, we want the algorithm very sophisticated (so that the compressed video game data is small).
Fixed Length Compression:
Bigrams / Digrams
Storage Unit: 8 bits (0-255)
- Use 0-87 for blank, upper case, lower case, digits and 25 special characters
- Use 88-255 for bigrams (master + combining)
- master (8): blank, A, E, I, O, N, T, U
- combining(21): blank, plus everything [every lower case letter] but J, K, Q, X, Y, Z [DAV:... my list of letter frequencies gives etaoins hrdlcumwfgypb vkjxqz -- is there a bigram reason why Y was excluded instead of V? Or why U was included instead of S ?]
- total codes: 88 + 8 * 21 = 88 + 168 = 256
- Pro: Simple, fast, requires little memory.
- Con: Based on a small symbol set
- Con: Maximum compression is 50%.
- Average is lower (33%?).
- Variation: 128 ASCII characters and 128 bigrams.
- Extension: Escape character for ASCII 128-255.
...
Variable-Length Encoding: Huffman Codes
- ...
- Variation: Can be used on words (as opposed to characters).
- English text, with symbols for characters, is approximately 5 bits per character
- English text, with symbols for characters and 800 frequent words, yields 4.8-4.0 bits per character
- ...
-- Copyright © Jamie Callan, Bruce Croft, and/or James Allan http://ciir.cs.umass.edu/cmpsci646/Slides/ir11%20compression.pdf
DAV: hm... perhaps put some of the most-frequent words into those "special characters":
http://ciir.cs.umass.edu/cmpsci646/Slides/ir10%20text%20stats.pdf Top 50 words from 423 short TIME magazine articles (243,836 word occurrences, lowercased, punctuation removed, 1.6 MB) :
Word : Freq --- the 15659 of 7179 to 6287 a 5830 and 5580 in 5245 that 2494 for 2197 was 2147 with 1824 his is he as on by at it from but u (? U-Haul ?) had last be who has not an s (I suppose this is from possessives with apostrophe+s) have were their are one they its all week government when would been out new which up more into only will 488
2 main variants:
What are the different ways of representing integers ?
see fixme - floating point; endian
[FIXME: this leaves out gray codes, PN shift codes ...]
[FIXME: point to different ways of representing real numbers -- IEEE floating point, continued fraction representation, factorial notation, etc.]
a general overview of the different compression algorithms (Huffman, LZ77), and has the best section on "Representing Integers" I've seen on the web, covering Elias Gamma Code, Elias Delta Code, Fibonacci Code, Golomb and Rice Codes, ...
[FIXME: summarize this conversation]
From: CBFalconer Subject: Re: compact storage of integer values Date: 01 Feb 2001 00:00:00 GMT Approved: clc at plethora.net Organization: AT&T Worldnet Reply-To: cbfalconer at worldnet.att.net Newsgroups: comp.lang.c.moderated Harald Kirsch wrote: > > Situation: a HUGE (no HHHUUUGGGEEE) number of mostly small data > elements must be kept in memory. Their size is stored along with > them. 7 bit are sufficient to hold the size most of the > time. Nevertheless larger sizes can be expected. > > I came up with the following scheme, where 7 should be read (CHAR_BIT-1): > > If size fits in 7 bit, store it. > If size fits in n*7 bits: > store (n-1) times 7 bit in a char and set the 8th bit > store the last 7 bit in the last char (without 8th bit set) > > The idea is to use the high-bit of a char to indicate if more bytes > follow. I suggest you first evaluate what the MAX value can be. I suspect you don't need the infinite linking. If MAX is expressible in N bytes, then use 0..255-N+1 as values, and M = N+2 .. 255 to signify that the following M-N bytes hold a value. I think this has less wastage, crams more into the 1 byte representation, etc. -- Chuck F (cbfalconer at my-deja.com) (cbfalconer@XXXXworldnet.att.net) http://www.qwikpages.com/backstreets/cbfalconer (Remove "NOSPAM." from reply address. my-deja works unmodified) mailto:uce@ftc.gov (for spambots to harvest) -- comp.lang.c.moderated - moderation address: clcm at plethora.net [This is in response to a suggestion by Hans-Bernhard Broeker and Andy Isaacson who (independently ?) proposed a specialized version of this general scheme with N = 128 which is simple to decode: hi bit = 0: this byte is a small literal (0...127). hi bit = 1: clear the hi bit, then use the result as the number of 8 bit chunks following. ]
To: "John S. Fine" <johnfine at erols.com>, Pasi Ojala <albert at cs.tut.fi>, Antaeus Feldspar <feldspar at cryogen.com> From: David Cary <d.cary at ieee.org> Subject: "little workspace" decompressors Dear "John S. Fine" <johnfine at erols.com> and Pasi Ojala <albert at cs.tut.fi> and Antaeus Feldspar <feldspar at cryogen.com>, I saw you all in the comp.compression newsgroup. You all seem to be interested in "little workspace" decompressors. I just thought I'd point out your web pages to each other. I am really impressed with the clever scheme Pasi Ojala came up with. How does the Elias Gamma codes (that Pasi Ojala re-invented) compare with the functionally-similar variable-length code John Fine invented ? Both map more common (smaller value) integers to shorter (in bits) prefix codes. (Of course, knowing the actual value frequencies and applying huffman gives one of several optimum encodings. What are the actual value frequencies for typical files ?). For this application, we also want a code that a very small/fast routine can decode (slightly less-than-optimum encoding can trade off for smaller/faster decode routine). >From: "John S. Fine" <johnfine at erols.com> >Take bits in triplets from high order to low order, with the rightmost >bit in each triplet used as a stop bit: >001=0 011=1 ... 111=3 000001=4, 000011=5 ... 010001=8 ... 110111=19 >... 000000001=20 ... 110110111=83 etc. It may not be obvious, but >the above can be decoded in a trivial amount of x86 assembler code >with no tables or workspace. Pasi Ojala [this section is very unfinished ...]Currently, good-quality lossy text compression can only be done by humans. It's one of the few remaining areas where humans are clearly superior to silicon-based computers.
"better compression scheme for text?"
by lowwave (469952) on Monday September 15
It seems possible that one can create better compression scheme for text. If you fix both ends of a word, order the middle letters, then there are sequences appears more frequent than in nature writing. The trick is to unscramble the word based on the context. The scheme probably will introduce some errors.
http://science.slashdot.org/comments.pl?sid=78625&threshold=1&commentsort=0&tid=133&tid=134&tid=186&mode=thread&cid=6970387
(DAV: I think I've seen "sort consonants alphabetically, then vowels alphabetically" somewhere before ...
or perhaps this is just talking about standard alphabetical sequence.)
Plus Magazine http://plus.maths.org/ : "Maths is elegant, interesting, useful."
I've been thinking about "conflation" (``confounded'' ?), mixing together unrelated items, and how to undo this (in hopes that such a transformation could lead to better compression).
From: d_cary < at my-dejanews.com >
Subject: undoing conflation
Date: 28 Oct 1998 00:00:00 GMT
Newsgroups: comp.compression
Assume that I have some pseudo-English text such that each word is a
standard English word and occurs with pretty much the standard frequency for
that word in English text, but each word has absolutely no correlation to the
next ("memoryless").
(There are, of course, strong correlations between the letters inside a word,
as well as common word prefixes, common word suffixes, etc).
The best possible compression I could hope for would be to
use each word as a "symbol" and compress one symbol at a time
with arithmetic coding (marginally better than Huffman coding), right ?
(I assume the file is very big relative to the size of the frequency table).
If I scramble each word with any reversible word-&integer transformation
before compressing,
the compressed file will be exactly the same size
as the one I created without this transformation, right ?
If I just randomly selected clumps of letters
(say, 4 letters at a time), paying no attention to the word boundaries,
compressing one clump (one symbol) at a time with arithmetic (or Huffman)
compression, I think the compressed file would be much worse (right ?)
than the previous 2 files.
If I scramble each clump with any reversible clump-&integer transformation,
creating a "conflated file",
the compressed file will be exactly the same ("much worse") size, right ?
(Is there a better term for this than "conflated file" ?).
Here's the question: Say I have a "conflated file" -- a sequence of (short)
integers with the above characteristics. How do I compress it (lossless) as
small as possible. Rather than immediately hitting it with arithmetic coding,
the best thing to do is the inverse integer->clump transformation, then
compress one word (symbol) at a time with Huffman coding. But I don't know
which clump->integer transformation was used. Now what ? Is there some way I
could un-do (some of) the damage, somehow break the integers up into
variable-length strings (doesn't have to be the original text), re-group them
into "better" multi-byte words ? If I could do that, I could use the
above-mentioned "optimal" arithmetic coding.
If this were possible, one might even be able to apply it to compressing
standard text -- -- ideally, it would recognize that the letter "T" is
conflating the idea of "next letter is capitalized" with the idea of the
letter "t". Once we un-did that conflation, we would have a text file without
any Upper Case letters, but we would have to add a new symbol (I arbitrarily
choose A) to indicate that the next letter