updated 2009-04-29.
Contents:
related pages:
middle Con
the pianofor
a quarter note) really only loses ``noise''.
"It is my ambition to say in ten sentences what others say in a whole book." -- Friedrich Nietzsche
"Short words are best and the old words when short are best of all." -- Winston Churchill
The most valuable talent is that of never using two words when one will do. -- Thomas Jefferson
If you would be pungent, be brief; for it is with words as with sunbeams. The more they are condensed, the deeper they burn. - Robert Southey
"In the multitude of words there wanteth not sin: but he that refraineth his lips is wise." -- Proverbs 10:19
There are several different orthogonal ways to categorize data compression:
Most people have assumed that the decompressor has access to the complete compressed data file from the beginning. However, DAV has been thinking about ways to synchronize a decoder in the middle of a data stream -- for example, immediately after you turn on your TV, you want to be able to start watching a compressed digital video signal almost immediately. This is similar to ``medium-latency'', except it requires the TV to be able to decode the last part of the TV program even if it completely missed out on the first part. Some fractal compression techniques have this property ... I've been calling this ``history-limited decompression''.
introductions to data compression beginner tutorials
a brief introduction designed for ``gifted students in the 8 to 12 age group''. The concepts of ``coding'' and various forms of data representation are very important to data compression. (In particular, Morse code and the ASCII character set). The understanding that some letters are more frequent than others is good for understanding Huffman compression. the understanding that some letter pairs are far more frequent than the letter frequencies alone would suggest is good for understanding LZ77 and related compression algorithms. Has good info on letter-pair frequences (his charts have first letter on left, second letter on top; one chart based on the King James Bible).
(example: http://members.ozemail.com.au/~xenophon/code.html )
More links to data compression, in general (Other pages that, like this one, have lots of links to compression information):
specifically designed to compress 2D images.
The 2 hottest topics in 2D image compression (circa 1999, when DAV spent a lot of time in this area) are "wavelets" and "fractals" (which really have many similarities).
The goal here is to try to improve on computer_graphics_tools.html#png .
see also computer_graphics_tools.html#fractals for fractals in general (as artistic and mathematical objects).
(FIXME: move to http://en.wikipedia.org/wiki/Fractal_Compression )
[FIXME: write fractal image compression with extra ``erode'' and ``dilate'' parameters]
Introductions to Wavelets
Wavelets applied to Image Compression
Color coordinate systems and conversion between them.
See also clustering and color quantization machine_vision.html#cluster .
I'm most interested in simple lossless stuff.
[Here I've translated it into DAV notation, where for integer values x, N,
floor2(x) = floor( x / 2 ) = arithmetic right shift,and
ceil2(x) = ceil( x / 2 ) = floor2(x+1) = -floor2(-x).such that
floor2(x) + ceil2(x) = x, and ceil2(x) - floor2(x) = (0 or 1), so floor2( 2N + ceil2(x) - floor2(x) ) = N floor2( N + x ) + ceil2( N - x ) = x..
FIXME: My notations avoid lots of little "+1" and "-1" floating around ... but perhaps it would be clearer to write in executable C notation, perhaps with #define ceil2(x) ((x+1) >> 1)
describes a reversible integer RGB (8 8 8 bits) to Y Ur Vr (8 9 9 bits) color transform and shows that for many (but not all) images, there is a net gain in spite of the fact that 2 of the components require 9 bits of accuracy instead of the original 8. This transform also allows a grayscale image to be extracted from the compressed data without decompressing the entire image.)
suggests
Specifically designed to compress 1D streams of symbols, such as English text. 2D images are often rearranged into a linear list of pixels (e.g., by scanning one row at a time, or walking a Hilbert path), then compressed with one of these tools.
Some of these (such as Huffman) don't even take advantage of the 1D correlation between adjacent symbols, and just take advantage of "correlations between random pairs", i.e., statistics of the entire file taken as a unordered set. These algorithms would compress any file with identical statistics (for example, by shuffling the order of the items in any arbitrary manner) just as well. [This could be phrased better ...]
Others of these (LZ77 and descendents) completely ignore the unordered statistics, and simply copy repeated phrases.
Theory:
Source code available at:
[FIXME: Not all my Huffman links are here; some of the 1D_compression links describe several compression ideas including Huffman compression. Perhaps I should copy them here and directly link to their Huffman page].
This implies that:
There is only one standard Huffman code for a particular file, but many other Huffman codes that give just as good compression for that file. For example, if you are trying to compress a file of 100 characters with these frequencies
a: 10 b: 10 c: 20 d: 20 e: 40there are 3 bit lengths ("Kraft vectors") that give the best compression (220 bits). They are
L = (1, 2, 3, 4, 4) L = (1, 3, 3, 3, 3) L = (2, 2, 2, 3, 3).Given a "Kraft vector", one can find the unique canonical Huffman code for a file. For the above optimum Kraft vectors,
a: 0000 b: 0001 c: 001 d: 01 e: 1 a: 000 b: 001 c: 010 d: 011 e: 1 a: 000 b: 001 c: 01 d: 10 e: 11
This last canonical Huffman code is the unique "standard" Huffman code. The "standard" (DAV: I made up this terminology. Is the a better one ?) Kraft vector is generated by preferentially merging sub-trees with smallest depth. For certain 8-bit binary files, the optimum Kraft vector indicates one code needs a ~ 256 bit code. Since decoders can be made simpler if there is a limit on code length, some algorithms (like the one in JPEG ?) generate a non-optimum Kraft vector that has a maximum code length of 16 bits (this restriction gives slightly worse compression) for 8 bit values.
Newsgroups: comp.compression From: (Antaeus Feldspar) Subject: Re: Literature/HTML page on optimal Huffman codes Date: Mon, 3 Nov 1997 20:48:21 GMT In article <345dc108.17828360@news.worldonline.nl>, Thiadmer Riemersma wrote: >Hello everybody, > >Do you know a book, article or HTML page where the creation of *optimal* >Huffman tables is explained. I have read on various occaisions that the >Huffman tree should always be better or as good as Shannon-Fano coding. One >book (with examples of both) said, "Well, Shannon-Fano is a little more >compact than Huffman in this example, but that's because we have not >optimized the Huffman tree." Hence my question. I think your book has played fast and loose with terminology. If we are working with accurate statistics, a prefix code created through Huffman coding is going to be the most compact possible -- when codes are limited to integral numbers of bits, that is. Huffman codes are prefix codes; but not all prefix codes are Huffman codes. >By the way, if you know a compact way to store a Huffman table? Easiest way is to adopt restrictions for your Huffman trees that make only one of all possible trees for a given set of codelengths the legal tree. Then, storing the codelengths is sufficient to reconstruct the entire tree. An example, used in "deflate" compression, is the following set of restrictions: -- Shorter codes go to the left of longer codes. -- Symbols with equal codelengths go in lexicographic order from left to right. Any Huffman tree can be converted into a Huffman tree that meets these restrictions, without changing the lengths of the codes and changing the efficiency of the code -- and as pointed out before, this means that storing just the lengths of the codes is the same as storing the whole tree. >Thanks in advance, >Thiadmer > >* * * >Note: My E-mail address has been altered to avoid spam. -- ! -jc IS feldspar at netcom.com ! ! "'Asa Nisi Masa!' How strange! But what does it mean?" ! ! *** Fight spam! Sign up at http://www.cauce.org/ ! *** !
[ This shuffles everything up one slot every iteration ... perhaps it would run significantly faster if it never shuffled anything up, using double the amount of memory (still O(n) space and O(n log n) time). ]Huffman's famous greedy algorithm solves this problem in $O(n \log n)$ time if $f$ is unsorted; and can be implemented to run in $O(n)$ time if the input array $f$ is sorted. In this paper the space usage of the greedy method is considered. It is shown that if $f$ is sorted then the array $c$ can be calculated in-place, with $c_i$ overwriting $f_i$, in $O(n)$ time by using $O(1)$ additional space. The new implementation leads directly to an $O(n \log n)$ time and $n+O(1)$ words of extra space implementation for the case when $f$ is not sorted. The proposed method is simple to implement and executes quickly.
Re: Dynamic Huffman algorithms Author: Matthew D Moss Email: moss at null.net Date: 1998/10/02 Forums: comp.compression ralionmaster@geocities.com.remove.after.com wrote: > > I have asked similar question before. Could someone >advise where I could obtain algorithm(s) on dynamic Huffman >implementations? Take a look at the October 1998 issue of Dr Dobb's Journal. In the "Algorithm Alley" column, Steven Pigeon and Yoshua Bengio tackle this problem. Their variation of the Huffman algorithm is adaptive, not as memory intensive for sparse symbol sets, and doesn't require you to store the encoding table.http://www.ddj.com/
o: 000 a: 001 a: 010 i: 011 i: 100 e: 101 e: 110 e: 110This can map directly to Arithmetic coding (???). Lu (Lu 1997 p. 312) uses a Arithmetic coder where letters are sorted by frequency similar to this. Canonical Huffman coding, in effect, adds duplicate characters and deletes (!) characters such that the total length of this sorted file is a power-of-2:
o: 000 o: 001 // additional `o' a: 010 a: 011 i: 100 i: 101 e: 110 e: 111 // deleted one `e' !
Then, when compressing the original (unsorted) file, the compressed code word is just enough to locate that letter in this (sorted) file: in this case, "a" can be located in 2 bits: "01", because *all* letters at locations that begin with "01" are "a".
Another example: "ebeaabeb" is sorted "aabbbeee". Canonical Huffman, in effect, creates this sorted file:
a: 000 a: 001 b: 010 b: 011 // deleted one `b' ! e: 100 e: 101 e: 110 e: 111 // added 1 `e' !Here, "e" can be indicated by the code word "1", since all locations in this file that start with "1" are "e". "b" is indicated by the code word "01", since all locations that start with "01" are "b".
For some initial data files (for example, aabbeee), Huffman always adds letters, never deletes letters. For those sorts of data files,
compressed_length(a) = ceil(log2(L/f(a)))However, for many data files, Huffman deletes letters; so
ceil(log2(L/f(a))) < compressed_length(a)and we get better compression than we expected for *other* letters (letters different than the one that got deleted). Sometimes this "improvement" is even better than than Arithmetic coding (it appears to violate entropy); i.e., sometimes
compressed_length(a) < log2(L/f(a)) // sometimes. However, *total* entropy is not violated, since the *worse* compression on letters just like the one that was deleted more than make up for it.
If we have hardware that is limited in the size (in bits) of the code words it can handle, then algorithms that generate (occasionally non-optimal) Kraft vectors have the effect of adding lots of copies of the least common letter(s).
Many programs written to *teach* Huffman codes represent them in memory like this:
(length)(bitstring) pairs.
When you encode 8 bit symbols, then the worst-case bitstring is 255 bits. So one way of holding all the codes is an array of 256 of these (length)(bitstring pairs). The (length) can range from 1 bit to 255 bits. (a length of 0 indicates "this symbol never occurs").
However, if you use canonical Huffman codes, the least common symbols all start with lots of zero bits. A better way of representing all the Huffman codes is to truncate all those zero bits:
(length)(remaining bits)By using canonical Huffman codes, the worst-case "remaining bits" bitstring (when compressing 8 bit symbols) is 8 bits. The "remaining bits" bitstring is all zeros for the least-frequently-occuring symbol. The "remaining bits" bitstring has at least one 1 bit for the other 255 symbols.
The decompressor needs to know what the Huffman codes are for each compressed file. (They are different for each compressed file). So a natural place to store this information is in the header of each compressed file. It would be nice if this information could be represented as compactly as possible.
I know I was astonished to learn that this table could be represented (in the header of the compressed file) by an array of 256 bytes
(length). The actual code words (or even the "remaining bits") do *not* need to be stored in the header (if you use canonical Huffman codes).
The compressed file header just stores the *lengths* of the codeword for each symbol, and the decompressor regenerates the code words from that.
Assuming your maximum code length <= 15 bits, (see #huffman_height for details on how to guarantee some particular maximum code length, at some tiny loss in compression ) the simplest possible header for huffman compression is to pack 4 bits per symbol to encode that code length (0 == never occurs in file, 1 = occurs about half the time, 15 == occurs rarely ... or would it be better for "0000" to indicate a length of 16 == occurs even more rarely ?). If you have 256 symbols, then this header requires 256*4 bits = 1024 bits = 128 bytes.
Obviously, if the header requires 128 bytes, then it's impossible to get any useful compression for files of 129 bytes or less.
There are ways to compress the Huffman header in a variable-length way. Those ways result in an even shorter header for typical English text files.
[FIXME: other ways to reduce size of Huffman header]
[FIXME: doesn't this just repeat above ?]
[data compression#huffman] From: d.cary at ieee.org To: d.cary at ieee.org Subject: Re: Arbitrary Huffman tree and weights distribution (was: huffman code length) Date: Thu, 15 Jul 1999 21:34:58 GMT Newsgroups: comp.compression,alt.comp.compression,sci.math Organization: Deja.com - Share what you know. Learn what you don't. X-Article-Creation-Date: Thu Jul 15 21:34:58 1999 GMT X-Http-Proxy: 1.1 x26.deja.com:80 (Squid/1.1.22) for client 38.193.64.47 X-Http-User-Agent: Mozilla/4.0 (compatible; MSIE 5.0; Windows 98; DigExt) X-MyDeja-Info: XMYDJUIDd_cary I've been thinking about a different paradigm to "explain" the Huffman algorithm. This one has no trees. And it creates a easy-to-understand concept of the "canonical" Huffman code. Starting with the file you want to compress, count how many times each symbol occurs (histogram). Then sort the file, not in alphabetical order, but in order of least-frequently-occurring- symbol first. For example, the file "ebeaabeb" is sorted 000 a 001 a 010 b 011 b 100 b 101 e 110 e 111 e Then we could encode the original file by just transmitting the offset of each letter in this sorted file. This leads to 3 bits/symbol encoding. (I think that arithmetic coding is very similar to this idea). However, notice that all the symbols in this sorted file that start with "00" all lead to `a'. This means that we could encode all `a's in the original file with only those 2 bits, leading to 2 bits for a (00), 2 bits for the b (01) and 2 bits for the e (11). Note that the "10" code is never used, since it only codes for b and e, which are already covered. The canonical Huffman algorithm works its magic on this sorted file, adding and deleting a few letters, creating 000 a 001 a 010 b 011 b // deleted one `b' ! 100 e 101 e 110 e 111 e // added 1 `e' ! The length of a Huffman sorted file is always a power of 2. Each letter occurs exactly some power-of-2 times, letters that occur some smaller power-of-2 times occur first. In a "canonical" Huffman sorted file, letters that occur the same power-of- 2 times are sorted in alphabetical order. This means that given that power-of-2 ("the code length") of each symbol, we can easily regenerate this canonical Huffman sorted file. As before, we could encode the original file by encoding each character by its offset in this sorted file. This leads to 3 bits per symbol At decoding, we take the code lengths of each symbol, and regenerate the canonical sorted file. Then we decode by looking up bits in the compressed bit stream as offsets to this file. Note that if the first codeword you're starting to decode starts with "1", you already know that must have been a `e' in the original file. So the remaining bits in the offset don't need to be stored in the file. This means the canonical Huffman code for this file is 2 bits for a (00), 2 bits for b (01), and 1 bit for e (1). This immediately leads to all the standard properties for canonical Huffman codes, including "the all-zeros code is the longest code", "the all-ones code is the shortest code", etc. In a few initial data files (for example, aabbeeee), letters are already in a power-of-2 frequency distribution, and the total file length is already a power-of-2. The Huffman "magic" never makes any changes to these files. For those files, Lo = Ls bl(a) = log2(Lo/f(a)), so f(a) = Lo / 2^(bl(a)). where bl(a) = bit length of the symbol `a' f(a) = how many times 'a' occurs in the original source file Ls = length of this hypothetical sorted file modified by Huffman, Lo = length of original file. Any possible bitlength distribution *could* have been generated by a file of this type. To other initial data files (for example, aabeeeb), Huffman only adds letters (to round their frequencies up to the next power of 2), never subtracts letters. (Is this right ? Would Huffman ever round up to more than the next power of 2 ?) For these files, Lo < Ls bl(a) = log2( Ls/f(a) ) bl(b) = floor(log2( Ls/f(b) )) where `a' is any letter that was unchanged `b' is any letter that had extra copies added. Since x-1 < floor(x) <= x (Perhaps a tighter bound can be found here) we can derive log2( Lo/f(b) ) < bl(b) Ls / 2^(1+bl(c)) < f(c) < Ls / 2^(bl(c)). where `c' is any type `a' or type `b' symbol in the original file. ( I thought I was going to get Lo / 2^(bl(b)) < f(b) < 2*Lo / 2^(bl(b)) but I think this is incorrect). Unfortunately, there are lots of tricky cases where Huffman deletes some duplicate letters (and possibly adds other duplicate letters, in the general case). For those files, Ls could be bigger or smaller or the same as Lo. For any letter `d' that had duplicate copies deleted, floor(log2(Ls/f(d))) < bl(d). The remaining letters still get bl(b) = floor(log2( Ls/f(b) )) Sometimes some of the remaining letters are compressed even better than Arithmetic coding (it appears to violate entropy); i.e., sometimes bl(c) < log2( Lo/f(c) ). However, total entropy is not violated since the worse compression on letters who had some copies deleted more than make up for this "improvement". In particular, for the extremely skew case where Lo/f(d) = 1.00001 (or, in general, where Lo/f(d) < 2 approximately), floor( log2( Ls/f(d) ) = 0 < bl(d) = 1. So for files where Huffman deletes some characters, we get Letters with unchanged frequency bl(a) = log2( Ls/f(a) ) f(a) = Ls / 2^(bl(a)). Letters with copies added bl(b) = floor(log2( Ls/f(b) )) Ls / 2^(1+bl(b)) < f(b) < Ls / 2^(bl(b)). Letters with copies deleted floor(log2(Ls/f(d))) < bl(d) Ls / ( 2^(bl(d)+1) ) < f(d). Of course, for every symbol in every kind of file that has Q unique symbols, 0 < bl(c) < Q 0 < bl(d) < Q. (except for the special case of Q=1 (Ls = 1) or Q=0 (Ls = 0), where the compressed data has length zero bits). Maybe if I could get a better handle on when exactly these copies are deleted, I could answer the original question. I suspect that if you look at every symbol that has the same bit length, only the one closest to the end of the alphabet could possibly have copies deleted. If that were true, then for all symbols of the same bit length except for the last, Ls / 2^(1+bl(c)) < f(c) < Ls / 2^(bl(c)). where Ls = sum{i=each unique symbol}( 2^(max - bl(i)) ) max = the longest bit length.
David Cary developed "reverse Huffman coding" to minimize the transmission cost whenever where different symbols have different cost.
(Other people, including Peter Wayner, have independently discovered/used reverse Huffman for other purposes)
This requires a Huffman compression algorithm such that decompressing *any* "compressed" file, then compressing that "plaintext", generates identically the same "compressed" file. See David Scott http://bijective.dogma.net/ for another application that requires this same property.
Say you have Q unique symbols that can be transmitted, each one x with cost c(x). For example, you have a battery-powered device that turns a LED off and on, where each 9 bit symbol always starts with the LED on one bit time, and the total cost is the battery drain which only occurs while the LED is turned on. In other situations, the "cost" may also involve the transmission time.
Theorem: To minimize transmission cost, you want to equalize the cost per bit of information transmitted, for each symbol.
Proof: [FIXME]
We assume that the data to be transmitted has already been compressed into a bitstream, such that no matter what values the previous bits have been, the probability that the next bit is a 1 is always approximately 1/2.
The cost can be any arbitrary function of the symbol x. This would commonly include some combination of
If every one of the unique symbols can be transmitted in the same amount of time, and the "time to transmit" dominates the cost, then the minimum cost occurs when all symbols will be transmitted (approximately) equally likely.
Using Arithmetic coding isn't much better than Huffman at equalizing cost/bit when there are lots of symbols, but it can be far superior when there are only 2 or 3 symbols.
In some cases, it may be a good idea to concatenate "simple" symbols to make a "big" symbol. For example, if you have 3 "simple" symbols to work with, you might get better cost/bit if you concatenate 3 of them to create a symbol set of 27 "big" symbols, and then feed the cost of each of those 27 symbols into the reverse Huffman algorithm. Concatenation also simplifies convoluted transmission constraints such as "when transmitting these -1, 0, +1 values, the average must always be 0".
Simple example: We have a LED transmitter/ photodiode receiver pair that transmit 4 bit symbols to each other. Every symbol required to have a minimum of 1 "on" bit to maintain synchronization between transmitter and receiver, so that leaves 15 valid symbols.
The cost for this application is entirely the battery drain when the LEDs are on, a constant 1 unit for each time period. We neglect the time needed to transmit information and the battery power used by the transmitter and receiver while the LED is off.
LED | cost | encoded bits |
0000 | 0 | (never transmitted) |
0001 | 1 | 001 |
0010 | 1 | 01 |
0011 | 2 | 000001 |
0100 | 1 | 10 |
0101 | 2 | 000010 |
0110 | 2 | 000011 |
0111 | 3 | 000000000 |
1000 | 1 | 11 |
1001 | 2 | 000100 |
1010 | 2 | 000101 |
1011 | 3 | 00000001 |
1100 | 2 | 00011 |
1101 | 3 | 00000010 |
1110 | 3 | 00000011 |
1111 | 4 | 000000001 |
Without reverse Huffman transmission, you might transmit each symbol equally, so each symbol encodes approximately log2(15) =~= 3.9 bits of information (using bit stuffing). at an average cost of (average cost)/(average bits)= = (32/15)/(log2(15)) = 0.546 cost/bit.
Or you might transmit only 3 bits per symbol, using the 8 lowest-power symbols, to get a slightly better rasio of (average cost)/(average bits) = = (12/8)/(3) = 0.500 cost/bit.
But the optimum cost/bit is to use reverse Huffman transmission. Then the probability of each symbol being transmitted becomes [ 64 128 8 128 8 8 1 128 8 8 2 16 2 2 1]/512. The length in bits for each symbol is [ 3 2 6 2 6 6 9 2 6 6 8 5 8 8 9]. So the average(cost/bit) = sum( probability(i).*cost(i)./bit(i) ) = 0.4611 cost/bit.
We can see how well this choice of bitpatterns fits to our goal of a equal cost/bit by calculating cost ./ bit = [0.3333 0.5000 0.3333 0.5000 0.3333 0.3333 0.3333 0.5000 0.3333 0.3333 0.3750 0.4000 0.3750 0.3750 0.4444] = [120 180 120 180 120 120 120 180 120 120 135 144 135 135 160]./360 .
Well, it's a lot closer to a constant cost/bit than the simpler strategies above.
Open questions: Is it possible to have a "reverse adaptive huffman" adapt to changing line conditions ?
[FIXME: not a good explaination of end effects. See bijection ] End effects: The above neglects what happens at the end of a transmission. What if you are nearing the end of transmitting a message, and you find that the remaining bits to be transmitted are "1100" ? Obviously you find that you need to transmit a "1000" pattern (interpreted as "11"); then what could you send that would be interpreted as "00" ? One simple protocol would be to append a virtually infinite string of zeros after the message; this protocol would then transmit a "0111" message (interpreted as "000000000") to transmit the remaining "00" data bits and a bunch of bogus "0" bits. Hopefully the receiver would properly ignore the extra 0 bits. Perhaps if the message to be transmitted ended in zeros, the transmitter could halt transmissions after the last 1 bit was transmitted, and the receiver would somehow fill in the mising 0 bits. (Appending a virtually infinite string of ones after the message could also be interesting). Is it possible for the receiver to truncate extra zeros (or even fill in missing zeros) even if it doesn't know the exact length of the original text ? Maybe if it knows the original text was a exact multiple of 8 bits in length ... no, that wouldn't work, because in the special case where we lacked 1 bit to make up a full byte, and the the final transmitted symbol was "0111", then the 9 "0" bits that decodes into would be ambigous as to whether it was intended as the final "0" bit of the message, or whether the message ended with another full byte of zeros. (Appending a virtually infinite string of ones looks like it would work in this case ... will it always work ?)
from a thread at http://www.deja.com/[ST_rn=md]/threadmsg_md.xp?AN=487814206 and more info at http://www.compressconsult.com/huffman/#maxlength
As you can see from the above, if you have 2^15 symbols (or less) in your data file, you can refer to any one of them using a 15 bit offset from the start of the file. I was incorrect in concluding that therefore the maximum code length would be 15 bits, since Huffman *inserts* lots of symbols (in that mythical sorted file), so the Huffman offset may be much longer than 15 bits.
alternatively, you can loose a *tiny* amount of compression by using a code table that is slightly less than optimal, to force code lengths to a maximum of 15 bits.
Q: What is the maximum height of a Huffman tree with an alphabet of 8 bit (256 different characters)? A: The worst case height is 255. The worst case is, of course, exceedingly improbable in practice. If you have control of the Huffman code assignment (which you probably do as an encoder), then you can set an a-priori limit on the code length. This means that the lowest-probability symbols will have slightly-less-than-optimal code assignments, but since they occur infrequently by definition, the loss in compression ratio is tiny. The resulting code simplifications may be well worth that price. The JPEG standard, for example, requires that no code be longer than 16 bits. (Since this is a requirement of the standard, both encoders and decoders are able to take shortcuts that assume this is the maximum code length.) The standard offers a simple approximate algorithm for adjusting a true Huffman code tree to meet this maximum depth. regards, tom lane organizer, Independent JPEG Group
but,
From: "DI Michael Schindler" Subject: Re: huffman code length Date: 10 Jun 1999 00:00:00 GMT Organization: Compression Consulting Newsgroups: comp.compression hi Tom, you are correct. Two things you did not mention: JPEG uses a very small alphabet; in the case here a maximum codelength of 8 bit is not practical (we have a 256 symbol alphabet here!) To estimate loss the factor the number of bits for the longest code is greater than then number of bits needed for simple storing symbols is a good measure. The original poster also wanted to know codetable size for a *canonical* code: If you omit the leading zeros/ones (which one depends on wether the shortest code is all zeros or ones) you can store codelength+trailing bits pairs in memory. The number of trailing bits needed is limited by log2(alphabetsize); see my huffman coding webpage http://www.compressconsult.com/huffman/ for details. Michael
From: "DI Michael Schindler" Subject: Re: huffman code length Date: 16 Jul 1999 00:00:00 GMT Organization: Compression Consulting Newsgroups: comp.compression,alt.comp.compression,sci.crypt,sci.math hi! the formula below is right only if you take the longest subtrees in case of equality of weights. If you prefer shorter trees the # of samples raises faster. The reason is that in the case of this worst-case sample distribution you add just about half the number of samples you already have but need an additional bit. Note the codelength is not important; important is the part of the code that is nonconstant for coding and decoding issues. This can be limited to ceil(log2(ALPHABETSIZE)) bits when using a canonical code (which you should for decoding speed reasons). Michael
For the "standard" code (where we prefer shorter trees by merging the least-recently-merged tree when there is equality of weights), then the worst-case (pathological) codelength is still (ALPHABETSIZE-1) bits. The pathological case for 7 symbols (a tree with depth 6) is ((((((1 1) 1) 3) 4) 7) 11)
length of file (# of samples) | maximum codelength |
0..1 | 0 |
2 | 1 |
3..4 | 2 |
5..8 | 3 |
9..14 | 4 |
15..24 | 5 |
25..40 | 6 |
41..66 | 7 |
67..108 | 8 |
109..176 | 9 |
177..286 | 10 |
287..464 | 11 |
465..752 | 12 |
753..1218 | 13 |
1219..1972 | 14 |
1973..3192 | 15 |
3193..5166 | 16 |
5167..8360 | 17 |
The width of each range is 1 more than the lower end of the previous range.
I already knew about preferring shorter trees (easy to do by preferring the least-frequently-merged nodes when building the tree during compression), but somehow I neglected to take that into account when building that table. Thanks for pointing out that flaw. Using standard (prefer shorter trees) encoding, length of file (# of samples) maximum code length (standard encoding) 0..1 0 2 1 3..4 2 5..8 3 9..14 4 15..24 5 25..40 6 41..66 7 67..108 8 ... 1973..3192 15 3193..5166 16 5167..8360 17 The width of each range is the lower end of the previous range. (Did I get it right this time ?) Unfortunately, this doesn't seem to make a significant difference (5166 characters vs. 4180 characters). Using standard (prefer shorter trees) encoding, the worst-case (pathological) case for 8 unique symbols is a depth of 7, ((((((((1+1)+1)+3)+4)+7)+11)+18) = 46 symbols which is almost, but not quite, the standard Fibonacci sequence. Each frequency is the sum of the 2 previous frequencies, but this sequence starts out "1 1 1 3" rather than "1 1". But a depth of 7 can occur with only 41 symbols, albeit with 9 unique symbols: ((((((((1+1)+(1+1))+1)+4)+6)+10)+16) = 41 This is even less like the standard Fibonacci sequence. Each frequency is the sum of the 2 previous frequencies, but this sequence starts out "1 1 1 1 1 4 6" rather than "1 1". Thanks for telling me that, after the initial zeros, even the worst- case canonical codes are relatively short. I didn't know that. I can see that that would definitely simplify Huffman compressors and decompressors. If I have time later, I will see if that really makes my software (which already uses canonical codes) go any faster (or slower). "DI Michael Schindler" mentioned: > the formula below is right only if you take the longest subtrees > in case of equality of weights. If you prefer shorter trees > the # of samples raises faster. > The reason is that in the case of this worst-case sample > distribution you add just about half the number of samples you > already have but need an additional bit. > > Note the codelength is not important; important is the part of > the code that is nonconstant for coding and decoding issues. > This can be limited to ceil(log2(ALPHABETSIZE)) bits when using a > canonical code (which you should for decoding speed reasons). ... > d_cary at my-deja.com wrote ... > > the Fibonacci sequence of frequencies > > gives the maximum possible code length. > > > > length of file (# of samples) > > maximum code length > > 2 1 > > 3...4 2 > > 5...7 3 > > 8...12 4 > > 13...20 5 > > 21...33 6 > > ...... ... > > 1597...2583 15 > > 2584...4180 16 > > 4181... 17
obsolete:
Date: Mon, 31 May 1999 01:30:46 GMT From: d_cary Subject: Re: Frequency distribution of Huffman encoding tree Mok-Kong Shen wrote: ... > My example given for a balanced tree of 4 symbols appears to > contradict your formula. (Your range is [0.25, 0.5]). > > M. K. Shen You're right. The formula I posted was wrong. I would really like to know the correct formula (and maybe even a URI or paper reference). Maybe the correct formula is 2^(-Hi-1) <= fi <= 2^(1-Hi) where fi = ni/L = number of times "i" occurs / total length of file Hi = number of bits assigned by Huffman to the "i" code. ? No, that's not right either. For the special case of only 2 symbols, Huffman always assigns 1 bit per symbol, unless the probability of one of the symbols is identically zero (and the other is identically 1). So for the case of 2 symbols, one cannot find a tighter bound than 0 < fi < 1. I'm pretty sure a tighter bound can be found if one has more than 2 symbols with non-zero probability. So the formula must depend not only on the frequency that a symbol occurs, but also on how many symbols there are.
[FIXME: todo: write some code to try out the speculative ideas] [post a summary to the data compression wiki, leaving out the speculative ideas]
2009-01-06:DAV: ideas for short data compression: ... a variant of LZRW: rather than fixed size for all files, perhaps allow compressor to adapt "offset" and "length" field size for the particular file to be compressed. When compressing short files, it's important to be able to compress byte pairs, which repeat far more often than byte triplets. (But I can imagine byte pairs being so rare when compressing large files that it's not worth dealing with them)
LZRW1-A and LZJB both use (in effect)
Contrary to what both Williams and Bonwick claim, 2 byte copies actually *would* save space (albeit only 1 bit each) -- 2 literals take 18 bits, but a 2 byte copy takes 17 bits.
Perhaps the tradeoff between "length" and "offset" should be made on a file-by-file basis.
Or perhaps go even further, and make "length" and "offset" independently adjustable on a file by file basis. This slows down the decompressor (because the "items" in the compressed file are no longer byte aligned, although we maintain byte alignment in the decompressed data). And it slows down the compressor a lot, because it must search for the optimum "offset" and "length" field sizes.
Does it make sense to somehow make the compressed phrase merely 8 bits, so even a 1 byte copy might take less space than a 1 byte literal (9 bits)?
Rather than the compressor attempting all possible variations of "offset" and "length", is there any way to more quickly calculate the optimum size?
... also: perhaps mash in a "universal code" (variable length) for either the offset or the length or both ... ... also: perhaps the complex "escape code" system used in pucrunch in order to reduce the expansion of completely random data ...
For example, Using 10 bits for literals, 9 bits for compressed phrases and the most common single bytes (not the most common bytes in the source code; the most common bytes that would otherwise be encoded as 10 bit literals): (using 1 or 2 bits in the "control byte", and a single byte for the remainder, would keep the compressed code byte aligned)
DAV's example for 9 bit compressed phrases:
compressed file: * special ID code * other metadata * # of "common single bytes" used * table of "most common single bytes" * # of lengths used * table of "most common lengths" * #bits used in offset field * zero or more items * CRC
Each item is either a literal item, a "substitute single byte" item, or a copy item. If the length field is, for example, 3 bits, then
000: copy 1 byte 001: copy 2 bytes 010: copy 3 bytes 011: copy 4 bytes 100: if offset is one of the L special values, use a substitute literal. Otherwise, copy 6 bytes. 101: copy 8 bytes. 11x: x and the following 7 bits are a literal byte.
The "substitute single byte" only makes sense when the offset field is short enough that a "substitute single byte" is shorter than a literal item.
A single byte copy can be shorter than a literal. For example, when the "length" is 3 bits and "offset" is 5 bits (8 bits), but a literal is 9 bits. (I'm not sure this happens often enough to actually save any bits. With short source texts, the overhead of the table may make it not worthwhile. With long source texts, where the overhead of the table is negligible, using those codes for more "copy" options may give a shorter compress file than using those codes for "substitute bytes".)
To extend the range of those 5 bits of offset, the offset is not a literal offset in bytes, but a context offset -- given the previous byte, the Nth time the previous byte occurred.
Unlike Huffman compression of symbols emitted by a memoryless source, there are several valid "length" fields. For example, say a block 301 bytes long is exactly repeated.
Various ways to optimize "continue copying from where we left off". Such a thing is common in English text. Some common cases include:
If we use special "extra long" lengths, the obvious thing is to make them powers of 2 ... but is that really optimum? Perhaps some other lengths would be better. For example, rather than 2^n (natural binary), use (2^n)-1 instead (Booth encoding). In particular, I think I would like to try the lengths used in Combsort11 : 1, 2, 3, 4, 6, 8, 11, ... with a shrink factor of around 1.24 ... 1.5 . fast decoding: look up lengths in table or use easily-calculated function: * f(x) = (1<<x) ("obvious powers of 2"): 1, 2, 4, 8, 16, 32, ... * f(x) = floor(x*x / 4) (squaring, rather than the exponential I want ... but still better than linear): 0, 0, 1, 2, 4, 6, 9, 12, 16, 20, 25, 30, 36, ... * f(x) = floor(x*x / 8) (squaring, rather than the exponential I want ... but still better than linear): 0, 0, 0, 1, 2, 3, 4, 6, 8, 10, 12, 15, 18, ... * f(x) = ( 2 + (1&x) ) << (x>>1) (approximately powers of sqrt 2): 2, 3, 4, 6, 8, 12, 16, 24, 32 ... If I hit a phrase that isn't exactly the same length as the available lengths, then I break it up into pieces, starting with the next-shortest phrase length and work down. But should I emit those pieces longest-to-shortest or shortest-to-longest? For example, should I break a phrase of length 10 up into 8, 2 or 2, 8 ? And if we're not doing powers-of-two, there seems to be several different ways to break them up: a phrase of length 10 could also be broken into 6, 4. Of course, this is all moot if we use a variable-length "universal code" for the length.
Lots of standard test images and standard test text files, plus comparisons of various compression programs.
Maximum Compression's goal is to show the maximum achievable compression ratio for several filetypes (text, executable etc). The best 120+ programs for every filetype are compared in a table indicating compression ratios and ...http://MaximumCompression.com/ best compression ratio on:
Many data compression algorithms (1D and 2D) are designed to work with large amounts of data, and compress them all-or-nothing.
These actually *expand* the data (by a few bytes) if one tries to use them for "short" snippets of text or other data less than a few hundred bytes long.
If one has a large database, those data compression algorithms can give excellent compression, but it takes a very long time to decompress the database and find something in it.
Here I've listed algorithms that are a bit less efficient on "large" blocks of data, but are appropriate for "small" amounts of data (or for random-access indexing into large databases with small amounts of data at each entry). Perhaps you could think of them as 0D.
One might think that low-latency algorithms must all be 0D, operating on small blocks of data, but [elsewhere] there are many "single-pass" and "adaptive" algorithms with low latency (such as LZW compression, adaptive Huffman compression, etc.).
[FIXME: split out into seperate section] I also have some "small" data compression algorithms, designed to be used on slow machines with very little RAM and ROM. There's several applications, and several variants that trade off speed, RAM, and ROM. One semi-popular application is "I want to squeeze all kinds of cool stuff into my video game, but I want it all to fit on one disk / one ROM. Oh, and I want it to boot fast, too.". We want to create a bootable executable on that disk that loads the rest of the info off the disk and decompresses it. In this case, we want to minimize the total (decompression binary + compressed video game data) size -- since smaller size means it takes less time to get it off the disk. On one hand, we want the algorithm very simple (so that the decompression binary is small); on the other hand, we want the algorithm very sophisticated (so that the compressed video game data is small).
Fixed Length Compression:
Bigrams / Digrams
Storage Unit: 8 bits (0-255)
- Use 0-87 for blank, upper case, lower case, digits and 25 special characters
- Use 88-255 for bigrams (master + combining)
- master (8): blank, A, E, I, O, N, T, U
- combining(21): blank, plus everything [every lower case letter] but J, K, Q, X, Y, Z [DAV:... my list of letter frequencies gives etaoins hrdlcumwfgypb vkjxqz -- is there a bigram reason why Y was excluded instead of V? Or why U was included instead of S ?]
- total codes: 88 + 8 * 21 = 88 + 168 = 256
- Pro: Simple, fast, requires little memory.
- Con: Based on a small symbol set
- Con: Maximum compression is 50%.
- Average is lower (33%?).
- Variation: 128 ASCII characters and 128 bigrams.
- Extension: Escape character for ASCII 128-255.
...
Variable-Length Encoding: Huffman Codes
- ...
- Variation: Can be used on words (as opposed to characters).
- English text, with symbols for characters, is approximately 5 bits per character
- English text, with symbols for characters and 800 frequent words, yields 4.8-4.0 bits per character
- ...
-- Copyright © Jamie Callan, Bruce Croft, and/or James Allan http://ciir.cs.umass.edu/cmpsci646/Slides/ir11%20compression.pdf
DAV: hm... perhaps put some of the most-frequent words into those "special characters":
http://ciir.cs.umass.edu/cmpsci646/Slides/ir10%20text%20stats.pdf Top 50 words from 423 short TIME magazine articles (243,836 word occurrences, lowercased, punctuation removed, 1.6 MB) :
Word : Freq --- the 15659 of 7179 to 6287 a 5830 and 5580 in 5245 that 2494 for 2197 was 2147 with 1824 his is he as on by at it from but u (? U-Haul ?) had last be who has not an s (I suppose this is from possessives with apostrophe+s) have were their are one they its all week government when would been out new which up more into only will 488
2 main variants:
What are the different ways of representing integers ?
see fixme - floating point; endian
[FIXME: this leaves out gray codes, PN shift codes ...]
[FIXME: point to different ways of representing real numbers -- IEEE floating point, continued fraction representation, factorial notation, etc.]
a general overview of the different compression algorithms (Huffman, LZ77), and has the best section on "Representing Integers" I've seen on the web, covering Elias Gamma Code, Elias Delta Code, Fibonacci Code, Golomb and Rice Codes, ...
[FIXME: summarize this conversation]
From: CBFalconer Subject: Re: compact storage of integer values Date: 01 Feb 2001 00:00:00 GMT Approved: clc at plethora.net Organization: AT&T Worldnet Reply-To: cbfalconer at worldnet.att.net Newsgroups: comp.lang.c.moderated Harald Kirsch wrote: > > Situation: a HUGE (no HHHUUUGGGEEE) number of mostly small data > elements must be kept in memory. Their size is stored along with > them. 7 bit are sufficient to hold the size most of the > time. Nevertheless larger sizes can be expected. > > I came up with the following scheme, where 7 should be read (CHAR_BIT-1): > > If size fits in 7 bit, store it. > If size fits in n*7 bits: > store (n-1) times 7 bit in a char and set the 8th bit > store the last 7 bit in the last char (without 8th bit set) > > The idea is to use the high-bit of a char to indicate if more bytes > follow. I suggest you first evaluate what the MAX value can be. I suspect you don't need the infinite linking. If MAX is expressible in N bytes, then use 0..255-N+1 as values, and M = N+2 .. 255 to signify that the following M-N bytes hold a value. I think this has less wastage, crams more into the 1 byte representation, etc. -- Chuck F (cbfalconer at my-deja.com) (cbfalconer@XXXXworldnet.att.net) http://www.qwikpages.com/backstreets/cbfalconer (Remove "NOSPAM." from reply address. my-deja works unmodified) mailto:uce@ftc.gov (for spambots to harvest) -- comp.lang.c.moderated - moderation address: clcm at plethora.net [This is in response to a suggestion by Hans-Bernhard Broeker and Andy Isaacson who (independently ?) proposed a specialized version of this general scheme with N = 128 which is simple to decode: hi bit = 0: this byte is a small literal (0...127). hi bit = 1: clear the hi bit, then use the result as the number of 8 bit chunks following. ]
To: "John S. Fine" <johnfine at erols.com>, Pasi Ojala <albert at cs.tut.fi>, Antaeus Feldspar <feldspar at cryogen.com> From: David Cary <d.cary at ieee.org> Subject: "little workspace" decompressors Dear "John S. Fine" <johnfine at erols.com> and Pasi Ojala <albert at cs.tut.fi> and Antaeus Feldspar <feldspar at cryogen.com>, I saw you all in the comp.compression newsgroup. You all seem to be interested in "little workspace" decompressors. I just thought I'd point out your web pages to each other. I am really impressed with the clever scheme Pasi Ojala came up with. How does the Elias Gamma codes (that Pasi Ojala re-invented) compare with the functionally-similar variable-length code John Fine invented ? Both map more common (smaller value) integers to shorter (in bits) prefix codes. (Of course, knowing the actual value frequencies and applying huffman gives one of several optimum encodings. What are the actual value frequencies for typical files ?). For this application, we also want a code that a very small/fast routine can decode (slightly less-than-optimum encoding can trade off for smaller/faster decode routine). >From: "John S. Fine" <johnfine at erols.com> >Take bits in triplets from high order to low order, with the rightmost >bit in each triplet used as a stop bit: >001=0 011=1 ... 111=3 000001=4, 000011=5 ... 010001=8 ... 110111=19 >... 000000001=20 ... 110110111=83 etc. It may not be obvious, but >the above can be decoded in a trivial amount of x86 assembler code >with no tables or workspace. Pasi Ojala < albert at cs.tut.fi > http://www.cs.tut.fi/~albert/Dev/pucrunch/ some interesting thoughts on data compression -- -- optimizing for very low-RAM decompression agents -- minimizing *both* decompression *code* as well as intermediate data used in decompression. ("in-place" decompression; the compressed data includes the decoder program; RAM is smaller than the total compressed data + total decompressed data) Includes source code implementing his ideas. >Subject: Comment on these compression ideas, please >From: "John S. Fine" <johnfine at erols.com> >Date: 1998/06/27 >Newsgroups: comp.compression >... >http://www.erols.com/johnfine/ I'm thinking about building very tiny embedded processors, that instead of communicating status via LEDs or a tiny LCD panel, use the serial port to send full-fledged HTML pages (with pointers to pretty graphics on some other machine) as status; perhaps even feedback forms for control. Since these processors have only tiny amount of RAM (but slightly larger amounts of ROM), Antaeus Feldspar has a good idea to try to apply a bit of compression. p.s.: Given a device spewing HTML pages out its serial port, does any of you know the techy details involved in actually displaying them on a commodity PC using some web browser ? I *could* use a terminal program, capture-as-text, save, then load into my web browser -- but surely there's a better way. From: "Ojala Pasi 'Albert'" <albert at cs.tut.fi> Subject: Re: "little workspace" decompressors To: d.cary at ieee.org (David Cary) Date: Fri, 10 Jul 1998 11:53:14 +0300 (EET DST) Cc: johnfine at erols.com, albert at cs.tut.fi, feldspar at cryogen.com MIME-Version: 1.0 > How does the Elias Gamma codes (that Pasi Ojala re-invented) compare > with the functionally-similar variable-length code John Fine invented ? (Note: do not swallow before biting a bit. I may jump to conclusions here.) John's code assumes (and benefits from) more flat distribution. The code can be transformed into: n n+1 0 1(bb) by rearranging the "stop bits" to the beginning of the code. So, in fact we have a unary code which tells us how many actual bits to read (times 2). (Btw, rearranging the code like this may even make it easier and faster to decode.) We can compare this to the Elias Gamma Code: n n 0 1 b The code lengths for the first couple of values become Gamma: 1 3 3 5 5 5 5 7 7 7 7 7 7 7 7 9 9 9 9 9 9 9 9 ... John's: 3 3 3 3 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 9 9 9 ... So, the result depends entirely on the value distribution. > What are the actual value frequencies for typical files ? _My_ typical frequencies were very skewed, that's why I selected Gamma Code. Your typical files probably differ :-) Oh, and btw, my literal tag system automatically takes advantage of 7-bit literals and literal locality. -Pasi -- "We believe that no race can be truly intelligent without laughter." -- Delenn to Sheridan in Babylon 5:"A Race Through Dark Places" Date: Sun, 12 Jul 1998 08:52:08 -0400 From: John Fine <johnfine at erols.com> Reply-To: johnfine at erols.com MIME-Version: 1.0 To: "Ojala Pasi 'Albert'" <albert at cs.tut.fi> CC: David Cary <d.cary at ieee.org>, feldspar at cryogen.com Subject: Re: "little workspace" decompressors Ojala Pasi 'Albert' wrote: > > > How does the Elias Gamma codes (that Pasi Ojala re-invented) compare > > with the functionally-similar variable-length code John Fine invented ? There are actually a whole range of variable-length codes that you can get by unary coding a number U and having (aU+b) data bits. Note that some of the value is carried in the length bits, because you never use a longer code than is needed for a given value. If I understand correctly, Elias Gamma uses (a=1, b=-1). My code has (a=2, b=0). (Note, I am taking U to be the total number of length bits, not the value encoded by them). > John's code assumes (and benefits from) more flat distribution. Actually I used that flatter one for offset and a code more like Elias Gamma (which I had never heard of) for length. > by rearranging the "stop bits" to the beginning of the code. So, in fact > we have a unary code which tells us how many actual bits to read (times 2). > (Btw, rearranging the code like this may even make it easier and faster > to decode.) I am fairly sure that keeping the length bits and content bits mixed results in smaller and simpler code (probably faster as well). Even at the one to one ratio of Elias Gamma, I think mixing the bits would give simpler code. -- http://www.erols.com/johnfine/ http://www.geocities.com/SiliconValley/Peaks/8600/ Date: Mon, 13 Jul 1998 11:45:56 -0400 From: "John S. Fine" <johnfine at erols.com> Reply-To: johnfine at erols.com MIME-Version: 1.0 To: David Cary <d.cary at ieee.org> CC: "Ojala Pasi 'Albert'" <albert at cs.tut.fi> Subject: Re: "little workspace" decompressors David Cary wrote: > They all unary code a number U >= 0 with U zeros, then a one, followed by > (aU+b) data bits (where b >= 0, and usually a > 0), for a total length of > (U(a+1) + 1 + b) bits. I was using U > 0 including the "stop" bit; But it describes the same thing, so we can use your form. > This plain flavor of of the Elias Gamma (a=1, b=0) code seems easy to decode > since the code *is* the value of the code; once just needs to know how many > bits to grab. But I suspect rearranging a la John's method creates a decode > routine much shorter and faster than the routine needed to decode the plain I would guess a LITTLE shorter and MAYBE faster. > On the other hand, I just discovered some codes [*] that are *better* than > the these (aU+b) codes in the sense that, no matter what the distribution > of integers (as long as it is monotonic), the compressed file will be > smaller than if we used any of these (aU+b) codes. I don't think that is possible. > Perhaps something like this would be close enough to fibonacci, > yet still be easy to decode: > 000 to 101 are the 6 different terminal codes > (use base 6: multiply accumulated total by 6, add 0..5 to make final result) > 110 to 111 are the 2 different non-terminal codes. > (use base 2: multiply accumulated total by 2, then add 0 or 1) > Since (we assume) smaller numbers are always *more* common than large numbers, > the 2 codes where "base 6" is worse than John's "stop bit" code > are less common than the 2 codes where "base 6" is better than John's "stop > bit" code -- the net effect is using this "base 6" code compresses to fewer > bits than using the "stop bit" code -- no matter what the actual > distribution is. You stopped too early in examining the distribution. You have simply picked a code that does low numbers slightly better and high numbers MUCH worse (you thought it did high numbers SLIGHTLY worse because you stopped too soon). Compare 3-bit codes with 6 terminal values to three bit codes with 4 terminal values. This table shows the total number of values that can be represented in N or fewer bits. bits 6T 4T 3 6 4 6 18 20 9 42 84 12 90 340 -- http://www.erols.com/johnfine/ http://www.geocities.com/SiliconValley/Peaks/8600/ From: "Ojala Pasi 'Albert'" <albert at cs.tut.fi> Subject: Re: "little workspace" decompressors To: johnfine at erols.com Date: Wed, 15 Jul 1998 13:26:21 +0300 (EET DST) Cc: albert at cs.tut.fi, d.cary at ieee.org, feldspar at cryogen.com MIME-Version: 1.0 > I am fairly sure that keeping the length bits and content > bits mixed results in smaller and simpler code (probably > faster as well). Well, of course this depends on the architecture of the chip performing the decode. In my case the most effective way to read bits is one at a time (through Carry), thus they _are_ read one at a time. There is of course the loop overhead, which could be reduced by making two calls to the getbit routine. The decode code would become slightly larger (6 bytes ~ 2%) because I still need the N b routine for the linear code decode. And after really thinking about it, if and when you can easily handle multiple bits at a time (barrel shifter available), using this stop-bit-arrangement does indeed seem faster (less jumps per bit). > I think mixing the bits would give simpler code. Unary + linear is simpler code. If you mean "faster/simpler to decode", then you are absolutely right. I may even try it myself, although my main concern is the size of the decoder. I would need a definite speed increase (and there are several ways to increase the speed of my decoder, but they need a longer decoder). -Pasi To: johnfine at erols.com, "Ojala Pasi 'Albert'" <albert at cs.tut.fi> From: David Cary <d.cary at ieee.org> Subject: Re: "little workspace" decompressors Re: "little workspace" decompressors Yes, rearranging the bits makes no difference in how many bits there are (how big the output file will be). I think John is right in saying that, given any particular code, if we scramble up the bits we can make the decoder program shorter and faster. I never considered scrambling the bits before, but now I think it's cool. I think that all the (aU+b) codes John mentioned have a (at least one) "easy-to-decode" scrambled version. They all unary code a number U >= 0 with U zeros, then a one, followed by (aU+b) data bits (where b >= 0, and usually a > 0), for a total length of (U(a+1) + 1 + b) bits. (Ojala Pasi uses the equivalent U ones, then a zero -- I guess that was easier to decode for him). The Elias gamma code has a = 1, b = 0. Plain 7-bit ASCII uses a=0, b=7. John's code has a=b=2. Perhaps it would be interesting for the compressor program to select some "optimum" value of a and b for the particular file being compressed. Especially when, as with the Pasi Ojala algorithm, the decompressor program is embedded in the compressed data and can be optimized for that particular value of a and b. Perhaps the compressor would only try a couple of values for a and b that are particularly easy to decode. For example, we could generalize John's code to all the b=a codes: scramble 1 "continue" bit with (a) bits of data to make fixed-length blocks of (a+1) bits (where a=2 is John's code). (One could brute-force try compressing with a=1, then a=2, then a=3. Perhaps there is a efficient algorithm for the compressor to decide which a is optimal ... depends heavily on the distribution of values ... we want a code that "looks like" the optimal huffman code) This plain flavor of of the Elias Gamma (a=1, b=0) code seems easy to decode since the code *is* the value of the code; once just needs to know how many bits to grab. But I suspect rearranging a la John's method creates a decode routine much shorter and faster than the routine needed to decode the plain version. plain scrambled (1x = last block) 1 1 1 2 010 0 10 3 011 0 11 4 00100 0 00 10 5 00101 0 00 11 6 00110 0 01 10 7 00111 0 01 11 8 0001000 0 00 00 10 9 0001001 0 00 00 11 A 0001010 0 00 01 10 B 0001011 0 00 01 11 C 0001100 0 01 00 10 D 0001101 0 01 00 11 E 0001110 0 01 01 10 F 0001111 0 01 01 11 10 000010000 0 00 00 00 10 11 000010001 0 00 00 00 11 Both codes are, of course, identical in bit length, number of zeros, number of ones, etc. -- the only difference is in the size and speed of the decompression program. On the other hand, I just discovered some codes [*] that are *better* than the these (aU+b) codes in the sense that, no matter what the distribution of integers (as long as it is monotonic), the compressed file will be smaller than if we used any of these (aU+b) codes. Unfortunately, I have not (yet) figured out a small routine to decode these codes. I suppose I could just have a lookup table for the first N codes, then revert back to John's code if it is one of the rare items not in the lookup table. This may be a win in size (if the extra compaction in the data compensates for the larger program) and speed (if the most common table-lookup is faster than bit-by-bit decoding), but it just doesn't seem elegant. I've been trying to think of how to take advantage of "edge effects" in small-file compression. As a particular example, after the decompressor has decoded, say, 200 bytes, if it were really intelligent then it would "know" that a offset value in a (offset, length) LZ77 code couldn't possibly be more than 200 bytes; so perhaps we can somehow shorten the length (in bits) of the codes that *are* possible. More generally, it doesn't seem "right" to me that most compression algorithms get significantly different compressed file sizes on a file vs. the time-reversed image of that file. Huffman is the only algorithm I know of that gets exactly the same compressed file size either way. If I could somehow make the conditions when the decompressor is near the end of the file "more like" the conditions when it is near the start of the file, perhaps I could squeeze out a few more bits of compression. The length of the uncompressed file is usually known to the compressor before compression even begins; the decoder often has a rough estimate (and sometimes the exact value) of the total compressed code length and the total decompressed data length; perhaps it can take advantage of this meta-information... [*] The Elias delta and the Fibonacci codes. 2 radically different ways of overcoming the inherent inefficiency of the unary number encoding. Elias delta has 3 parts: a unary number encoding U, then 2^N bits encoding V, then 2^V bits encoding the actual desired number. I think I remember reading that Elias delta was already "optimal", in some sense -- there was no advantage to going to 4 parts. (But I don't really understand why that was true). Fibonacci codes, rather than having a unary code of 0 bits that ends at the 1st 1 bit, instead has a more compact code that ends the first time it hits 2 consecutive 1 bits. This is similar to the "bit-stuffing" in most long-distance communication protocols (and CD-ROM format and some magnetic disk formats), where the data is not allowed to have N 0 bits in a row (because the receiver would lose clock sync). *Every* time there is (N-1) "0 bits" in a row, the transmitter *always* stuffs a "1 bit". When decoding, every time there is (N-1) "0 bits", always throw away the following "1 bit". Is there a efficient way of decoding these values (perhaps by scrambling the bits) ? Or a code similar to this (better than (aU+b) ) that can be decoded with a short program ? (perhaps read pairs of bits, and stop when both bits equal 11 -- meanwhile decode using base 3 ?). I first read about Elias delta and Fibonacci codes at http://www.ics.uci.edu/~dan/pubs/DC-Sec3.html#Sec_3.3 >3.3 Universal Codes and Representations of the Integers ... >The first Elias code is one which is simple but not optimal. This code, gamma ... >Elias delta [is optimal] ... >While the Fibonacci codes are not asymptotically optimal, they compare well to the Elias codes as long as the number of source messages is not too large. Fibonacci codes have the additional attribute of robustness, which manifests itself by the local containment of errors. ... >a Fibonacci code provides better compression than the Elias code until the size of the source language becomes very large. ... >We describe only the order 2 Fibonacci code; the extension to higher orders is straightforward. > >N R(N) F(N) > > 1 1 11 > 2 1 0 011 > 3 1 0 0 0011 > 4 1 0 1 1011 > 5 1 0 0 0 00011 > 6 1 0 0 1 10011 > 7 1 0 1 0 01011 > 8 1 0 0 0 0 000011 ... >16 1 0 0 1 0 0 0010011 ... >32 1 0 1 0 1 0 0 00101011 > > 21 13 8 5 3 2 1 > >Figure 3.7 -- Fibonacci Representations and Fibonacci Codes. Perhaps something like this would be close enough to fibonacci, yet still be easy to decode: 000 to 101 are the 6 different terminal codes (use base 6: multiply accumulated total by 6, add 0..5 to make final result) 110 to 111 are the 2 different non-terminal codes. (use base 2: multiply accumulated total by 2, then add 0 or 1) base 6 "stop bit" 000 001 001 011 010 101 011 111 100 000 001 101 000 011 110 000 000 101 110 001 000 111 110 010 010 001 110 011 010 011 110 100 010 101 110 101 010 111 111 000 100 001 111 001 100 011 111 010 100 101 111 011 100 111 111 100 110 001 111 101 110 011 110 110 000 110 101 110 110 001 110 111 110 110 010 000 000 001 110 110 011 000 000 011 110 110 100 000 000 101 110 110 101 000 000 111 110 111 000 000 010 001 ... Since (we assume) smaller numbers are always *more* common than large numbers, the 2 codes where "base 6" is worse than John's "stop bit" code are less common than the 2 codes where "base 6" is better than John's "stop bit" code -- the net effect is using this "base 6" code compresses to fewer bits than using the "stop bit" code -- no matter what the actual distribution is. But this probably increases the complexity of the decoder. >From: John Fine <johnfine at erols.com> ... > There are actually a whole range of variable-length codes >that you can get by unary coding a number U and having >(aU+b) data bits. Note that some of the value is carried >in the length bits, because you never use a longer code >than is needed for a given value. ... > Actually I used that flatter one for offset and a code more >like Elias Gamma (which I had never heard of) for length. ... > I am fairly sure that keeping the length bits and content >bits mixed results in smaller and simpler code (probably >faster as well). Even at the one to one ratio of Elias >Gamma, I think mixing the bits would give simpler code. > >-- >http://www.erols.com/johnfine/ >http://www.geocities.com/SiliconValley/Peaks/8600/ >From: "Ojala Pasi 'Albert'" <albert at cs.tut.fi> ... >The code lengths for the first couple of values become > Gamma: 1 3 3 5 5 5 5 7 7 7 7 7 7 7 7 9 9 9 9 9 9 9 9 ... > John's: 3 3 3 3 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 9 9 9 ... >So, the result depends entirely on the value distribution. ... >-Pasi (posted to comp.compression; also mailed direct) -- + David Cary "http://www.ionet.net/~caryd_osu/david" | Future Tech, Unknowns, PCMCIA, digital hologram, <*> O- To: "John S. Fine" <johnfine at erols.com>, Pasi Ojala <albert at cs.tut.fi>, Antaeus Feldspar <feldspar at cryogen.com> From: David Cary <d.cary at ieee.org> Subject: Re: "little workspace" decompressors Yes, John is quite right-- the 6T code I just made up gets worse very quickly at high numbers. If I were lucky enough to have an input file that needed these large numbers extremely rarely, then my 6T code would do a better job compression than the 4T code; but if the data has a more even distribution, then the 4T code would be better. Is this true for the Fibonacci codes as well -- are they always better, as I originally claimed, or is it data-dependent ? 1st hand: It *seems* like Fibonacci would always be better -- if one looks at all the length N unary codes (there is only 1: (N-1) zeroes, followed by a 1), and compared it to all the length N Fibonacci codes (which includes the unary code, plus some other codes with lots of zeros with single "1" buts scattered through them, followed by a "11"), it seems there are always more Fibonacci codes of a specific length (which means it can represent more data). (I'm having a little difficulty coming up with a formula predicting exactly how many Fibonacci codes there are of length N). 2nd hand: On the other hand, one can use the "pigeon-hole", "counting" argument -- since, when decoding, all possible bit sequences mean *something* unique, there is no redundancy. it would then seem that one cannot guarantee that one code will always get better compression than another. In fact, one can pick any code, then synthesize a artificial data file to make that code look the best by using numbers with a frequency proportional to 1/(#bits needed to represent that number in this code). How do I resolve the contradiction between these 2 hands ? U U-bit unary codes U-bit Fibonacci codes 1 1: 1 0 2 1: 01 1: 11 3 1: 001 1: 011 4 1: 0001 2: 0011, 1011 5 1: 00001 3: 00011, 01011, 10011 6 1: 000001 5: 000011, 001011, 010011, 100011, 101011 u 1: 0...01 f(u) (is there a formula ?) (neglecting the aU+b data bits that follow the code) Hm... if there were *always* at least many Fibonacci codes as unary codes (the column at right always had at least 1, instead of that pesky 0 at the top) then the 1st hand paragraph would be true. What sorts of distributions would compress better using unary codes ? Which of these codes (Fibonacci, Fibonacci aU+b, unary aU+b, Elias delta) has the best match with the 1/f distribution typical of the distribution of words in natural languages -- and, so I hear, the distribution of patterns in DNA ? http://www.ics.uci.edu/~dan/pubs/DC-Sec3.html#Sec_3.3 >3.3 Universal Codes and Representations of the Integers ... >We describe only the order 2 Fibonacci code; the extension to higher orders is straightforward. > >N R(N) F(N) > > 1 1 11 > 2 1 0 011 > 3 1 0 0 0011 > 4 1 0 1 1011 > 5 1 0 0 0 00011 > 6 1 0 0 1 10011 > 7 1 0 1 0 01011 > 8 1 0 0 0 0 000011 ... >16 1 0 0 1 0 0 0010011 ... >32 1 0 1 0 1 0 0 00101011 > > 21 13 8 5 3 2 1 > >Figure 3.7 -- Fibonacci Representations and Fibonacci Codes. >Date: Mon, 13 Jul 1998 11:45:56 -0400 >From: "John S. Fine" <johnfine at erols.com> >To: David Cary <d.cary at ieee.org> >CC: "Ojala Pasi 'Albert'" <albert at cs.tut.fi> >Subject: Re: "little workspace" decompressors > >David Cary wrote: ... >> On the other hand, I just discovered some codes [*] that are *better* than >> the these (aU+b) codes in the sense that, no matter what the distribution >> of integers (as long as it is monotonic), the compressed file will be >> smaller than if we used any of these (aU+b) codes. > > I don't think that is possible. > >> Perhaps something like this would be close enough to fibonacci, >> yet still be easy to decode: >> 000 to 101 are the 6 different terminal codes >> (use base 6: multiply accumulated total by 6, add 0..5 to make final result) >> 110 to 111 are the 2 different non-terminal codes. >> (use base 2: multiply accumulated total by 2, then add 0 or 1) > >> Since (we assume) smaller numbers are always *more* common than large numbers, >> the 2 codes where "base 6" is worse than John's "stop bit" code >> are less common than the 2 codes where "base 6" is better than John's "stop >> bit" code -- the net effect is using this "base 6" code compresses to fewer >> bits than using the "stop bit" code -- no matter what the actual >> distribution is. > > You stopped too early in examining the distribution. You have simply >picked a code that does low numbers slightly better and high numbers >MUCH worse (you thought it did high numbers SLIGHTLY worse because you >stopped too soon). > > Compare 3-bit codes with 6 terminal values to three bit codes with >4 terminal values. This table shows the total number of values that >can be represented in N or fewer bits. > >bits 6T 4T > 3 6 4 > 6 18 20 > 9 42 84 >12 90 340 > >-- >http://www.erols.com/johnfine/ >http://www.geocities.com/SiliconValley/Peaks/8600/ Date: Thu, 30 Jul 1998 10:11:51 -0400 From: "John S. Fine" <johnfine at erols.com> To: David Cary <d.cary at ieee.org> Subject: Re: "little workspace" decompressors David Cary wrote: > Is this true for the Fibonacci codes as well -- are they always better, as > I originally claimed, or is it data-dependent ? bits Fib 6T 4T 1 - - - 2 1 - - 3 2 6 4 4 4 - - 5 7 - - 6 12 18 20 7 20 - - 8 33 - - 9 54 42 84 . . . 12 232 90 340 15 986 154 1364 Maybe I misunderstand "better". The values in the table above are cumulative. In 9 or fewer bits, fib can represent 54 different values; 4T can represent 84 different values. -- http://www.erols.com/johnfine/ http://www.geocities.com/SiliconValley/Peaks/8600/
[this section is very unfinished ...]
Currently, good-quality lossy text compression can only be done by humans. It's one of the few remaining areas where humans are clearly superior to silicon-based computers.
"better compression scheme for text?"
by lowwave (469952) on Monday September 15
It seems possible that one can create better compression scheme for text. If you fix both ends of a word, order the middle letters, then there are sequences appears more frequent than in nature writing. The trick is to unscramble the word based on the context. The scheme probably will introduce some errors.
http://science.slashdot.org/comments.pl?sid=78625&threshold=1&commentsort=0&tid=133&tid=134&tid=186&mode=thread&cid=6970387
(DAV: I think I've seen "sort consonants alphabetically, then vowels alphabetically" somewhere before ...
or perhaps this is just talking about standard alphabetical sequence.)
Plus Magazine http://plus.maths.org/ : "Maths is elegant, interesting, useful."
I've been thinking about "conflation" (``confounded'' ?), mixing together unrelated items, and how to undo this (in hopes that such a transformation could lead to better compression).
From: d_cary < at my-dejanews.com > Subject: undoing conflation Date: 28 Oct 1998 00:00:00 GMT Newsgroups: comp.compression Assume that I have some pseudo-English text such that each word is a standard English word and occurs with pretty much the standard frequency for that word in English text, but each word has absolutely no correlation to the next ("memoryless"). (There are, of course, strong correlations between the letters inside a word, as well as common word prefixes, common word suffixes, etc). The best possible compression I could hope for would be to use each word as a "symbol" and compress one symbol at a time with arithmetic coding (marginally better than Huffman coding), right ? (I assume the file is very big relative to the size of the frequency table). If I scramble each word with any reversible word-&integer transformation before compressing, the compressed file will be exactly the same size as the one I created without this transformation, right ? If I just randomly selected clumps of letters (say, 4 letters at a time), paying no attention to the word boundaries, compressing one clump (one symbol) at a time with arithmetic (or Huffman) compression, I think the compressed file would be much worse (right ?) than the previous 2 files. If I scramble each clump with any reversible clump-&integer transformation, creating a "conflated file", the compressed file will be exactly the same ("much worse") size, right ? (Is there a better term for this than "conflated file" ?). Here's the question: Say I have a "conflated file" -- a sequence of (short) integers with the above characteristics. How do I compress it (lossless) as small as possible. Rather than immediately hitting it with arithmetic coding, the best thing to do is the inverse integer->clump transformation, then compress one word (symbol) at a time with Huffman coding. But I don't know which clump->integer transformation was used. Now what ? Is there some way I could un-do (some of) the damage, somehow break the integers up into variable-length strings (doesn't have to be the original text), re-group them into "better" multi-byte words ? If I could do that, I could use the above-mentioned "optimal" arithmetic coding. If this were possible, one might even be able to apply it to compressing standard text -- -- ideally, it would recognize that the letter "T" is conflating the idea of "next letter is capitalized" with the idea of the letter "t". Once we un-did that conflation, we would have a text file without any Upper Case letters, but we would have to add a new symbol (I arbitrarily choose A) to indicate that the next letter is a capital. (conflated) This is a test. If this were ... (unconflated) Athis is a test. Aif this were ... Of course, now we have the extra overhead of having to stick extra information in the compressed file telling the uncompressor, once the "unconflated file" has been decompressed, how to re-conflate it so we can recover the original conflated file I was given. But compressor can now recognize that both "This" and "this" are the same word, and that the "A" symbol is almost always preceded by a space or a newline, allowing it to create a smaller file. Is it possible to get a net benefit on the Canterbury Corpus ? I hope so :-). -- David Cary http://www.rdrop.com/~cary/html/data_compression.html -----------== Posted via Deja News, The Discussion Network ==---------- http://www.dejanews.com/ Search, Read, Discuss, or Start Your Own
Other compression-related ideas:
ECC: error correcting codes
EDAC: error detection and correction
[FIXME: move all this information to "Data Link Error Detection / Correction Methods" http://massmind.org/techref/method/errors.htm ]
CRC: cyclic redundancy check (see spin_dictionary.html#crc for other CRC acronyms )
Data compression is often preparatory to communicating it over some "limited-bandwidth" real-time communications channel or storing it on a limited storage media. After squeezing out most of the redundancy, we *deliberately* add more redundancy. All real-time communications channels add enough redundancy (in the form of headers, trailers, and CRC codes) to allow the errors to be detected, so the receiver can ask the transmitter to re-send the data. Some communications channels and most storage devices add even more redundancy (Hamming codes for DRAMs, Reed-Solomon codes for CD-ROMs, and other ECCs) to let us detect and correct errors at the receiver or media reader.
The simple hardware to do CRCs looks very similar to the simple hardware to do LFSR machine_vision.html#spread_spectrum
[FIXME: to read]
bit coding
Chip-by-Chip Turbo Coding for DS/SS Systems Chen ZHENG, Takaya YAMAZATO, Masaaki KATAYAMA, Akira OGAWA Vol.E82-A No.12 p.2751: http://search.ieice.or.jp/1999/files/e32026.htm http://search.ieice.or.jp/1999/files/e000a12.htm#e82-a,12,2751
A New Digitized Bit Timing Recovery Scheme Takao, Suzuki, and Shirato. http://search.ieice.or.jp/1999/files/e32020.htm#toshiaki32takao http://search.ieice.or.jp/1999/files/e000b08.htm#e82-b,8,1326 http://search.ieice.or.jp/1999/pdf/e82-b_8_1318.pdf
One item it points to is:
"A painless guide to crc error detection algorithms index v3.00 (9/24/96)" by Ross N. Williams, http://www.repairfaq.org/filipg/LINK/F_crc_v3.html
"A painless guide to crc error detection algorithms" by Ross N. Williams, 19 August 1993 http://www.geocities.com/SiliconValley/Pines/8659/crc.htm
"A painless guide to crc error detection algorithms" by Ross N. Williams 19 August 1993 http://massmind.org/techref/method/math/crcguide.html
From: (William Lewis) Subject: Re: Document CRC Organization: The Seattle Group Date: Tue, 27 Dec 1994 00:50:13 GMT...
Ross N. Williams has written "A Painless Guide to CRC Error Detection Algorithms", available from ftp.adelaide.edu.au:/pub/rocksoft/crc_v3.txt, which I found very useful. It goes from the basic mathematical theory, through how to choose a polynomial, to how to write efficient implementations, with examples in C. Lucid and practical. Check it out. -- William Lewis.
"CRC Generating and Checking" white paper by Thomas Schmidt, Microchip Technology Inc. http://www.microchip.com/download/appnote/crc/00730a.pdf has a diagram of a hardware CRC generator (very similar to a LFSR [FIXME:] ), a "loop driven CRC implementation" (in PIC assembly code), a "table driven CRC implementation (in PIC assembly code).
the codes are powerful enough to totally recover a burst error of greater than 4,000 consecutive bits -- about 2.5 mm on the disc. With full error correction implemented (this is not always the case with every CD player), it is possible to put a piece of 2 mm tape radially on the disc or drill a 2 mm hole in the disc and have no audio degradation. Some test CDs have just this type of defect introduced deliberately.http://repairfaq.ece.drexel.edu/sam/cdfaq.htm
http://repairfaq.ece.drexel.edu/sam/cdfaq.htm [FIXME: where am I collecting "compact disk" information ?]
2001-12-01:DAV: started section. (Thanks to "Jerry Lim Han Sin" who encouraged me to write down my thoughts on program compression)
Lots of people complain about huge, bloated executables. What can we do about it ?
related local files:
Normally I (DAV dav_info.html) utterly despise machine/OS dependent stuff -- because this makes it far too difficult to *improve* my machine. But in this special case, it's cool -- executable files are inherently machine/OS dependent *anyway*, so it doesn't make things any worse to use techniques that also are machine/OS dependent.
Program optimization goals that David knows about: (ways to choose among myriad implementations that all functionally do the same thing):
[FIXME: read the article "Forth code size and programming effort" http://www.complang.tuwien.ac.at/projects/forth.html on executable code size. ]
Ways of compressing programs: [FIXME: is there a better terminology I could use than ``functional'' vs. ``non-functional'' ? ] ( "Metaprogramming and Free Availability of Sources: Two Challenges for Computing Today" article by François-René Rideau http://fare.tunes.org/articles/ll99/index.en.html mentions
)
"The Code Compression Bibliography" (currently maintained by Mario Latendresse) http://www.iro.umontreal.ca/~latendre/codeCompression/ uses the term "code compression" for what I call "program compression" or "executable compression". (Unfortunately, the term "code" conflicts with the jargon used in data compression).
flavors:
This makes clear the tradeoff between the size of the decompression algorithm and the size of the compressed data.
flavors:
there seem to be several of these available. Here's a couple I've seen recently:
By ``functional'', I mean that when you run the new executable, it acts like running the original executable.
By ``lossless'', I mean that with the right decompression tool, you can recover the original executable.
flavors:
use compiler option -Os, which tells the compiler to optimize for size.). In some cases, compared to burning a (statically linked) application into ROM, *adding* Linux can actually reduce the total ROM+RAM memory budget (using the CRAMFS can allow you to use a much smaller ROM; also, Linux sets up the MMU and demand paging so that you need much less RAM than you would if the entire (uncompressed) application had to fit into RAM). Specific examples: ARM9 cross-development environment; roughly 2 MB of FLASH.
flavors:
See http://c2.com/cgi/wiki?RefactorMercilessly for more refactoring benefits, and discissions on when it is or is not appropriate to refactor.
See http://c2.com/cgi/wiki?MultiLanguageRefactoring for some speculation on gradually switching to a higher-level language.
Often programming languages have ``features'' that make it easier to compile (functions must be declared before they are called; some languages have ``hints''; some languages use postfix notation ). This makes the compiler much smaller, and makes it easier to justify putting the compiler on the run-time hardware (which has other benefits we won't go into here). Perhaps compressed human-readable source code could be much smaller than compressed executable code, giving a tangible benefit. In other words, perhaps putting a small compiler + compressed human-readable source code is cheaper (takes less ROM) than putting the compiled executable in ROM. Even if it's not clearly cheaper, there are intangible benefits. (intangible benefits of putting human-readable source code and the compiler on the run-time hardware are: it makes things much easier on the people trying to find and fix hardware problems and software bugs; it makes things much easier when people later try to add features ).
``Program size doesn't matter (much) for workstations. But, for embedded control it matters a lot, especially when you're limited to on-CPU chip memory, and the CPU has to cost less than $5-$10. Anecdotal evidence indicates stack computer program size can be smaller than CISC programs by a factor of 2.5 to 8 (and, another factor of 1.5 to 2.5 smaller than RISC, depending whom you want to believe). This comes not just from compact opcodes, but also from reuse of short code segments and implicit argument passing with subroutine calls. Code size comparisons I've seen don't take this into account. (P. Koopman 1989, Stack Computers, pg. 118-121.)''
``Program size measures depend largely on the language being used, the compiler, and programming style, as well as the instruction set of the processor being used.'' -- P. Koopman 1989 http://www-2.cs.cmu.edu/People/koopman/stack_computers/sec6_2.html#621
For more info on stack machines, see see computer_architecture.html#stack .
DAV: The P. Koopman quote directly points to 3 ways of decreasing program size:
If you're using gcc or a similar compiler, consider using the
-Os
(optimize for size) compiler option.
[FIXME: move information on subroutine threading, direct threading, and indirect threading to here] [or to http://en.wikipedia.org/wiki/Threaded_code ]
In some cases (when the CALL instruction on this particular architecture is very compact, and ... it's just as compact to handle control structures in-line than it is to CALL subroutines to handle them), subroutine threading can be just as compact as the other kinds of threading.
DAV wonders what a very compact instruction set would look like, and what a good tradeoff in size/speed/power would be.
[FIXME: how to properly split this topic between data compression and computer_architecture.html#considerations ?]
A compact instruction set can either be implemented directly in hardware, or (trading off some run-time speed) emulated/interpreted (for example, the BASIC Stamp and Pcode), or (trading off some start-up time) recompiled before running (for example, JIT Java).
There are 3 main variants that I know of to creating a compact instruction set:
DAV has found this to occur with the same image in a .DWG and longer .DXF files becoming compressed to short .DWG.ZIP and much shorter .DXF.ZIP files.
Anton Ertl has found this to occur with he same document in a .pdf and long .ps file becoming compressed to short .pdf.gz and much shorter .ps.gz files http://www.complang.tuwien.ac.at/~anton/why-not-pdf.html .
I think this illustrates "premature optimization" -- the misplaced focus on making individual instructions small (cb) interferes with the real goal of making the entire program small (cc). If you're only allowed to look at one instruction at a time (cb), the best you can do is a Huffman-style compression. But if you're able to look at even 2 or 3 instructions at a time (cc), you can often get much better compression -- LZ, LZW, and similar kinds of compression. (But I keep thinking that LZ, LZW compression won't work if the program is already well-factored).
related links:
Reduced Memory Requirements ... Another advantage this provides, is the ability to generate performance with lower pin-counts or lower-bandwidth busses ... With smaller code size, and variable length instructions and data, the CompactRISC family provides more efficient use of smaller, lower cost, lower bandwidth memories.How was this "program size" chart made ?
I am especially interested in ``lossy'' program compression. (When an embedded system lacks enough RAM to run the full-up version, ... a smaller program that functionally *does* all the same stuff, although it's impossible to recover the exact original file from the compressed version, and it may have slightly worse performance. Example: recognize ``unrolled loops'', then shrink them back down to a single cycle of the loop. Example: "refactoring" .
However, "functional compression" is really ...
Tradeoff between performance and size:
program compression links:
``1x Forth'' by Charles Moore April 13, 1999 http://www.ultratechnology.com/1xforth.htm :
My contention is that every application that I have seen that I didn't code has ten times as much code in it as it needs. ...
Why would anyone want to write ten times as much as they would need to write?
... If it impossible for you to start with a clean piece of paper then you will have to write more code. ...
How big should a program be? For instance, how large should the TCP/IP stack be? I don't know. I couldn't know without sitting down and writing the code for it. But I should not be very big, a kiloword.
...
About a thousand instructions seems about right to me to do about anything. To paraphrase the old legend that any program with a thousand instructions can be written in one less. All programs should be a thousand instructions long.
How do you get there? What is the magic? How can you make applications small? Well you can do several things that are prudent to do in any case and in any language.
... inevitably the problem will change in a way that you didn't anticipate. ... Don't anticipate, solve the problem you've got.
Don't Complexify ...
Ten times code means ten times cost; the cost of writing it, the cost of documenting it, it the cost of storing it in memory, the cost of storing it on disk, the cost of compiling it, the cost of loading it, everything you do will be ten times as expensive as it needed to be. ... Ten times the bugs! And ten times the difficulty of doing maintenance on the code ...
This is why we are still running programs which are ten or twenty years old and why people can't afford to update, understand, and rewrite these programs: because they are significantly more complex, ten times more complex than they should be.
... How do you write one times programs?
You factor. You factor, you factor, you factor and you throw away everything that isn't being used, that isn't justified.
... you wrote a hundred words or so that discussed the application and you used those hundred words to write a one line definition to solve the application. It is not easy to find those hundred words, but they exist, they always exist.
Identify those aspects of what you are trying to do and saying we don't need to do that. We don't need checksums on top of checksums. We don't need encryption because we aren't transmitting anything that we don't need. You can eliminate all sorts of things.
...
I wish I knew what to tell you that would lead you to write good Forth. I can demonstrate. I have demonstrated in the past, ad nauseam, applications where I can reduce the amount of code by 90% percent and in some cases 99%. It can be done, but in a case by case basis. The general principle still eludes me.
...
When I first started in this business in fifty seven computers were used for calculating ... long complex algebraic expressions. ... today. ... I would guess that most computers don't compute, they move bytes around. ...
... I remain adamant that local variables are not only useless, they are harmful.
On the other hand, Anton Ertl has run some experiments and concluded that:
--The code size measurements dispell another popular myth, that of the inherent size advantage of stack architecture code and of the bloat produced by optimizing C compilers. While a comparison of a header-stripping 16-bit Forth with a RISC (about 50% bigger code than CISCs) would give a somewhat different result, the reported size differences of more than an order of magnitude need a different explanation: differences in the functionality of the software and different software engineering practices come to mind.
... Interestingly, less than one byte of machine code is generated per line of C code.
Forth code size and programming effortby Anton Ertl http://www.complang.tuwien.ac.at/forth/performance.html#size --
Compilation of Forth to Cby Anton Ertl, Martin Maierhofer http://www.complang.tuwien.ac.at/projects/forth.html
includes chapter "Tools for Optimizing Java" (profilers, etc). [FIXME: read] DAV is particularly interested in optimizing programs for size executable compression FIXME"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil." - Donald Knuth
...
Optimizing Java for Size
Reducing code size is particularly important for Java applets, since it directly affects the time taken to download an applet.
the code density (amount of instruction storage space required to accomplish a given function) associated with 32-bit fixed-length instructions can require more memory when compared with other processor architectures. To address this concern, IBM has developed CodePack, a unique method to store complete PowerPC instructions in memory in a compressed format. ...
... a silicon-efficient ASIC core that is placed between the processor and the memory controller in an integrated system-on-a-chip design. The core decompresses instructions on-the-fly only as needed by the processor. ...
Typically, use of the CodePack decompression core results in application code that fits in 35 to 40% less space. ...
...
The performance of compressed applications evaluated so far have been within ±10% of the performance of uncompressed versions of the same application. In some benchmarks that were run with slower memory (such as FLASH), the compressed code actually runs faster because compressed instructions are smaller and can be fetched from slow memory in less time.
...
DAV wonders:
Rather than having a specialized decompression hardware, could we get almost as good performance by making the a purely software decompressor, which runs when the MMU traps an address in code space that hasn't been decompressed yet (or that has been flushed out of the LRU cache of decompressed code) ? This lets software people experiment with heavily custom-tuned compression ideas, possibly a completely different algorithm for every application ... Maybe some modifications to the MMU to make it more fine-grained would be helpful.
Optimizing loop overhead
Before:
for(i=0;i<100;i++) { map[i].visited = 0; }After:
i=99; do { map[i].visited = 0; i--; } while(i>=0);
Umm... that's nice, and another way to do that is
i=100-1; do { map[i].visited = 0; i--; } while(0 < i); map[0].visited = 0;This compiles to even more compact loop than the above on architectures such as the 68000 that have a dbnz -- decrement and branch if not zero -- instruction. [FIXME: move to http://c2.com/cgi/wiki?edit=BetterForLoopConstruct ]
both the assembly language source and the dot com DOS executable file ... The demo program produces a tumbling octahedron in a mere 1024 byte dot com file, while it's computing pi to about 100,000 places.another amazing gem.
-- http://www.cs.vu.nl/~dick/Summaries/CS/CompilerConstruction-1979.htmlThe demise of the goto and the advent of structured programming have altered the mix of machine instruction generated by compilers ... To find the new mix, ... many programs were written ... This resulted in a set of machine instructions with estimated frequencies of static use; ... The results confirm Knuth's observation that almost all operations done by a program are very simple. Huffman encoding of the operations based on the frequencies would lead to opcodes of 3 bits, 4 bits, 5 bits, etc., which is unsuitable hardware-wise (and software-wise). To obtain byte-sized opcodes, the smaller opcodes are combined with codes for frequent operands (since also the frequencies of the operands are known). For example, the opcode for PUSH_LOCAL occupies 4 bits, and the same byte can contain the offsets 0 through 11; this leaves another 4 8-bit opcodes to be used for rarer opcodes. Code size is compared to that for the PDP-11 and the Cyber and in both cases the EM-1 code is a factor 3 smaller.
DAV: Yes, but wouldn't it be interesting to Huffman encode non-time-critical stuff anyway? Perhaps even use compression techniques that give better-than-Huffman compression? A "factor of 3" smaller -- very impressive.
DAV: ... perhaps it would also be interesting to speculate that the horizontal and vertical are not completely independent -- i.e., use some *other* formula for combining the probabilities ... One would expect that if, say, "5" were extremely common in a program, *both* the horizontal *and* the vertical probabilities would be high for "5" ... how to "de-convolve" (?) the probabilities to get an accurate prediction for "5" ?
DAV: It seems that LZW and PPM assume that some byte affects another byte a fixed distance away. I wonder if it's possible to model the effect that often, when one word is encountered, a context is set up so that another word will be encountered "soon" -- not necessarily immediately after this word, or at any particular fixed offset from this word. i.e., when a "test" or "compare" instruction occurs, there is very often a conditional branch in the next few instructions ... when "if" is mentioned, "then" is likely to also occur ... ... Some versions of this may be *simpler* than Markov code -- for example, since there are far fewer groups of 4 letters ''in alphabetical order'' than 4 letters, the storage requirements for predicting the next letter given a set of 4 letters of context (not in any particular order) is much less than predicting the next letter given a list of 4 letters in a particular order.
References these papers:
[FIXME: should I combine all my file format information in one place ? Or does it make sense to divide it into 2 parts:
]
[FIXME: html-ify "file format considerations"; add use the Unicode trick of "FFFE" to detect and correct byte-swap. ]
ELF is a common executable file format. ELF seems superior to a.out and COFF executable formats ... but of course I find plaintext source superior to any executable format.
Details about the ELF format are generally only useful for people who write OSes and who write compilers.
People obsessed with creating tiny exectables (``functional program compression'' #program_compression ) may also be interested.
BFD is a package which allows applications to use the same routines to operate on object files whatever the object file format. A new object file format can be supported simply by creating a new BFD back end and adding it to the library.
[FIXME: this is very similar to what I'm doing with YARMAC ... any tips I can pick up here ?]`` The primary quality of this document is reproducibility. Every tiny bit of information should be proved by a working example. Since I don't trust myself all output files are rebuild for every release. All sections titled "Output" are real product of source code and shell scripts included in this document. Most numbers and calculations are processed by a Perl script parsing these output files.
... Conversion to HTML is the last step of a Makefile that builds and runs all examples.
sBOX File Format Specification v1.0 by Sean Barrett 2000 http://www.nothings.org/computer/sbox/sbox.html
[FIXME: finish reading] [in C, the 4 signature bytes are "sb0X".]sBOX, a meta-file format for creating file formats whose internal contents are indexed by name. sBOX is a simple and carefully engineered file structure that provides a base layer upon which other file formats can build.
...
The first twenty-four bytes of an sBOX file constitute the sBOX header. The first sixteen bytes are undefined; any set of values in the first sixteen bytes can still indicate a valid sBOX file.
The following four bytes (the seventeenth through twentieth) contain the sBOX signature, and consist of the following decimal values:
115 98 48 88...
the last four bytes of the file ... must contain the sBOX signature:
115 98 48 88...
It's possible to define a writeable sBOX file format in which all modifications occur by appending new data; no data in the file ever need be rewritten.
...
On the other hand, if you like colored glasses, according to the article "Sunglasses Change Color on Command" by Tracy Staedter, people are developing glasses that can be adjusted at any time to almost any color. Those people are Chunye Xu, Minoru Taya, and Chao Ma at the University of Washington.
``LizardTech released the source code of the latest version of the DjVu Reference Library (v3.0) under the GNU General Public License. The software is available at http://lizardtech.com/products/djvu/referencelibrary/DjVuRefLib_3.0.html.
The DjVu Reference Library 3.0 contains the complete source code of decoders, simple versions of the encoder, and several utilities.''
Image data compression From: jrobin at essex.ac.uk () Date: 29 Jun 1995 14:17:21 GMT Documentation of Binary Tree Predictive Coding (BTPC is an image compression method ... suited for coding multimedia images which combine text, graphics and photographs ... lossless and lossy settings) http://monet.uwaterloo.ca/~john/btpc.html evaluation source code at: ftp://monet.uwaterloo.ca/pub/john/btpc.tar.Z BTPC uses a binary pyramid, predictive coding and Huffman coding. John Robinson john at monet.uwaterloo.cahas moved to http://www.engr.mun.ca/~john/
Data compression From: Jonathan Burt Newsgroups: alt.comp.compression Subject: Re: QUANTUM Date: 18 May 1995 12:16:29 +0100 Organization: None ... (Daniel James Williams) writes: > Quantum and a LOT of other compression programs are FTP'able from > garbo.uwasa.fi /pc/arcers > . You have to get the file act-23.zip! It really explains how good > about 30 different compression programs are! (including Quantum). It's version 24 now, and has 45 compression programs, but thanks for the recommendation! :-) Bye Jonathan -- Jonathan Burt == Archive Comparison Table: -> garbo.uwasa.fi:/pc/arcers/act-24.zip -> oak.oakland.edu:/SimTel/msdos/archiver/act-24.zip & other SimTel mirrors -> wuarchive.wustl.edu:/pub/MSDOS_UPLOADS/archivers/act-24.zip
From: John Morris Smith Newsgroups: alt.comp.compression Subject: Re: What dos program has THE BEST compression Date: 19 May 1995 00:01:15 +0100 Organization: ToolMaster
....
If you haven't already read it, read the article A.C.T. May 1995 in this newsgroup (alt.comp.compression) or WWW to http://www.mi.net/act/act.html and browse through Jeff Gilchrist's excellent comparison of many zip/archive programs. --
Newsgroups: comp.sys.mac.apps,comp.sys.mac.comm,comp.compression From: tbrown at minerva.cis.yale.edu (Tom Brown) Subject: ZipIt 1.3.3 public beta testing Organization: Yale University Date: Sun, 21 May 1995 19:31:17 GMT ZIPIT 1.3.3 PUBLIC BETA ZipIt version 1.3.3b1 is now available for public beta testing. Major new features include: o More bug fixes in the segmenting routines. (what, that's not a feature?) o Runs native on the Power Mac o Supports Internet Config for extension mapping and text/binary file information. You can get the most recent beta version from: ftp://ftp.awa.com/pub/softlock/mac/products/zipit/zipit.hqx (This is the fat version; you can also obtain PPC-only or 68K-only versions in the same directory.) If you would like to subscribe to a mailing list that will announce future beta versions, and release versions, as they are made available, please send email to tbrown@dorsai.org with the subject "subscribe zipit-announce". (The subject must be typed exactly as above, all lowercase.) Please send me both positive and negative feedback. I'd like to know that it's working for some people! Thanks. -- Tom Brown tbrown at dorsai.dorsai.org tbrown at minerva.cis.yale.edu
Dr. Dobb's Journal: all source code is available ... online .... anonymous FTP from site ftp.mv.com (192.80.84.1) in the /pub/ddj directory.
ALLEY.ASC Title: ALGORITHM ALLEY Keywords: SEP93 C++ ALGORITHMS DATA COMPRESSION
Published source code presented by Tom Swan in his column where Tom examines data compression techniques. Also see ALLEY.ZIP. http://www.ddj.com/
Subject: Re: Ultrasound image sequences Date: 27 Jan 1999 00:00:00 GMT From: "Michael J. Aramini" Organization: Hewlett-Packard Co. To: boaz cohen Newsgroups: sci.image.processing References: 1 boaz cohen wrote: > I need some ultrasound image sequences (B-mode scans). > Does anyone have some? There are a number of downloadable ultrasound images and cine loops available on Hewlett-Packard's web site. To access them you first need to register (at no cost to you) using http://www.medical.hp.com/DSRsupport/register.html You can then download the image and loop files in (HP) DSR format which is an extended version of TIFF. You can also download a software tool (as an MS-DOS binary) which converts from DSR format to standard TIFF format. -Michael
-- "Document Image Compression and Analysis" by Omid Ebrahimi Kia http://documents.cfar.umd.edu/LAMP/Media/Publications/Papers/okia97/Thesis-HTML.htmlPredictive coding, such as the differential pulse code modulation (DPCM) patented by Cutler [22], predicts the next sample based on the previously coded samples and codes the error between the prediction and the actual sample. The main objective of DPCM is to narrow down the range of coded samples. However, images are not stationary in texture, and some predictors may work in certain areas, but not in others. Hence, the concept of adaptive predictors [3, 31, 83] was introduced so that the coder and decoder could choose from a set of predictors at the cost of transmitting the parameters needed for the prediction.
22 C. Cutler. Differential quantization of communication signals. U.S. Patent 2 605 361, 1952.
bzip2
compresses files using the Burrows-Wheeler block-sorting text
compression algorithm, and Huffman coding. Compression is generally
considerably better than that achieved by more conventional LZ77/LZ78-based
compressors, and approaches the performance of the PPM family of statistical
compressors."
>From: d_cary at my-deja.com >To: d_cary at my-deja.com >Subject: Re: Compressing the Bible for a PDA >Date: Thu, 16 Nov 2000 21:26:45 GMT > >I'm impressed withe the "pucrunch" compressor by Pasi Ojala > http://www.cs.tut.fi/~albert/Dev/pucrunch/ > http://www.cs.tut.fi/~albert/Dev/pucrunch/packing.html >. >It decompresses very fast (important on a slow PDA). Almost all other >decompression algorithms I've seen run much slower. > >You might also look at (better compression, much slower) > "Compression: A Key for Next-Generation Text Retrieval Systems" >Nivio Ziviani, Edleno Silva de Moura, Gonzalo Navarro, and Ricardo >Baeza-Yates >in >_Computer_ magazine November 2000 > http://www.computer.org/computer/co2000/rytoc.htm >. > >Their decompressor takes 1, 2, or 3 *whole* bytes of compressed data >and decompresses (using a vocabulary list) into a whole word. This >makes many kinds of searches *much* faster. One can directly search the >compressed text for words or phrases, which turns out to be even faster >than searching uncompressed text. ( You *are* going to let your users >do searches like a concordance, right ?). > >Since it uses whole bytes, it runs faster than other "word at a time" >algorithms that have to shift around individual bits. And it lets you >point to any word in the text with a normal (byte) pointer and start >decompressing immediately from that point. > >The article also has lots of other ideas about compressing indexes and >approximate-match searching. > >In article <8uav48$6h$1@nnrp1.deja.com>, > comphelp at iinet.net.au wrote: >... >> Thanks Benjamin. That makes a lot of sense. Presumably I'd have some >> sort of table of pointers to the beginning to each book (or chapter >> even) so that I could quickly jump straight to that location rather >> than searching sequentially? After all, most PDA's have quite slow >> cpu's.
From David Cary 2002-03-09 Dear David Scott, Thanks for putting your page http://bijective.dogma.net/ online. When I first heard about it (in the newsgroup comp.compression) and skimmed it, I though making a one-to-one mapping was trivial. But now that I've thought about it for a few weeks :-/, I see how very tricky it is to handle all possibilities correctly. I'm pretty sure there's many ways to make a one-to-one mapping of the output of a Huffman compressor (which can have any number of bits) to files (which are restricted to multiples of 8 bits ... unless you're on an early VAX machine which restricted files to an integer number of 80 byte ``records''). After reading your page, I came up with yet another one-to-one mapping. Yes, it has the ``infinite recursion'' problem you mentioned, but there are ways of implementing it such that we never really need to back up. The common case is really very simple; but there's all those annoying exceptions to handle zeros. So I'm calling this ``upthz'' (ugly patch to handle zeros). Do you have a name for the method you described on your web page ? Since you explained things in terms of decompression, I guess I'll do the same. ---- UPTHZ (ugly patch to handle zeros) decompression 1. Since I use canonical Huffman codes, the all-zeros code represents the least frequent symbol. There's always one or more other codes that have the same length, but no other code is all-zeros, and no other code has a longer length. Call the length of the all-zeros code L. If I use N bit source text symbols, then N <= L < 2^N. (Much software assumes N=8, but I'll just use it as an example in parenthesis). Also, any symbol *other* than the all-zeros code is composed of 3 parts: from 0 to L-1 leading zero bits; a central 1 bit, and a tail of 0 to N-1 following bits (can be 0 bits, 1 bits, or any combination). 2. The decompressor pulls bits, counting consecutive zeroes, until it either (a) hits the end of the file (only happens once per file), in which case jump to step 6, or (b) pulls a 1 bit (the most common case). If it's *not* the end of the file (i.e., we just pulled a 1 bit): 3. Throw that 1 bit away. The decompressor grabs N-1 more bits to make a binary number B. (If we hit the end of the file here, pretend that there are lots of 0 bits to grab just past the last byte of the file). 4. (special case) If the zero count is L or more (when N=8, L could be up to 255), then print the symbol represented by the all-zeros code, and subtract L from the zero count. Repeat step 4 until the zero count is less than L. 5. Now we have 2 numbers: a zero count (when N=8, the zero count will be from 0 to 254) and a binary number B (when N=8, B will be from 0 to 127). The decompressor runs these through a lookup table, which returns 2 items. The decompressor prints the first item (the symbol represented by that Huffman code). The second item is the true length of the tail of that Huffman code. If that tail has length N-1, we can immediately go to (2) and keep sucking bits. Otherwise we have to put back some (in some cases all) of these N-1 bits we pulled in step (3), then we return to step (2) and allow the decompressor to re-scan them. This continues with step 2 until we hit the end-of-the-file. So far I've just described standard static Huffman decompression (with the table-lookup speedup). As you already know, handling the end-of-file is the tricky part if I want to maintain a bijection. Else, it *is* the end of this file. 6. Find the number of zero *bytes* at the end of the compressed file. This is *not* simply floor( zero count) / 8 ), but also includes any trailing zero bits in B from the previously decoded Huffman code. 7. For each of those zero bytes (if there are any), print the least-frequently used symbol. Done ! Step 7 is similar to step 4, but instead of subtracting L each time, the decompressor jumps 1 byte ( 8 bits ) each time. ---- Did that explaination make any sense ? No ? Let me try again, this time in terms of compression: ---- UPTHZ (ugly patch to handle zeros) compression 1. Use standard canonical Huffman compression. For each symbol in the source text, lookup the Huffman code, and immediately write it out to the file. Once you've compressed the entire source text, pad out the last byte with zero bits if necessary. (I keep thinking these are ``extra bits'', redundant somehow, but since this is a bijection, it can't be). 2. Find the *last* byte in compressed file that is non-zero, and end the file there. If there were any zero bytes that followed that byte, delete them. 3. Count how many consecutive times the least-frequent symbol occurs at the end of the source text. Often, this will be zero, and we're done. Otherwise, append that many all-zero bytes to the end of the compressed file. ---- I think it's possible to keep a couple of counters of how many zero bytes and all-zeros codes that we were ``planning'' on writing out. Then I could do steps 1, 2, 3 in a single pass, without indefinite recursion (I think). The trick is to delay writing out the code for the least-frequently-used-symbol until either you hit the end of the source file -- then you know to represent it as a 8 zero bits (a zero byte) -- or you hit some other symbol -- then those least-frequently-used symbols are represented as L zero bits. I understand upthz better than your method (not surprising, since I came up with it). Is it really any easier to understand ? It seems so simple that surely someone has thought of this before I thought of it. UPTHZ also can handle Huffman trees with *any* number of leaves -- in particular, it works with 2 leaves (arbitrary binary sequences), and with 128 leaves, and with 256 leaves, and with 1024 leaves. It's pretty easy to construct input files such that UPTHZ gives *longer* output file than the first bijection you published. For example, if the last symbol of the source text is compressed to the code ``1111'', and that symbol gets split over 2 bytes, then UPTHZ ends the file with `` ........ ......11 11000000'', while your first published bijection gives the same file minus that last byte. But how is it possible to have extra, redundant information if this upthz is a real bijection ? -- David Cary http://www.rdrop.com/~cary/html/data_compression.html
QUALCOMM's image compression algorithm, ABSolute, is the groundbreaking technology employed by Technicolor Digital Cinema installations worldwide. ABSolute technology reduces the amount of digital information needed to represent cinema-quality digital images by as much as 35 to 40 times. These efficient coding rates save storage and transmission cost for digital cinema systems, while reproducing the director's vision with stunning clarity. The ABSolute algorithm divides a digital image into regions, or "blocks," that vary in size from 2x2 to 16x16 pixels, depending on where information resides within the frame. Each block is then transformed to the discrete cosine domain and processed to remove information from the image that will not be visible.
[machine_vision ? computer_graphics ?][wearables ?]... foveating codecs ... an overall reduction in required bandwidth of three to five times... the only way for the codec to work is if we know where people are actually looking. ...
as video comes to phones, it will most likely do so not in the form of larger screens, but through wearable display devices that are much higher resolution. Retinal scan displays are my personal favorites. Microvision is the top company in that business, and what I would like them to do is build a cheap display and find a way to track eye motion at the same time. There is already a laser at work in these devices, so why not use it during the vertical scanning interval to track eye movement? ...
DAV: Interesting. I was thinking about a similar problem. I just set one source as the primary source and compressed it normally, then tried to compress a (very similar) text file using information from the other source ... I hadn't thought of generating 3 codewords. Interesting. In particular, I was thinking about compressing a collection of various translations of the Bible. Say I've already compressed one translation, and the *compressed* file indicates that a particular word in one verse is a repeat of the same word 2 verses ago. When I compress a different translation, and (surprise, surprise) I see that a particular word in the corresponding verse is also a repeat of the same word 2 verses ago ... even though it's a completely different word from the 1st translation ... you would think that I could use the information from that 1st file to make this 2nd file a bit smaller.We are working in the area of source compression for correlated sources. Standard techniques lead to one of two options. One is to entropy encode the two sources into one codeword. This uses minimal memory resources, but leads to complicated decoding procedures. Alternately, the sources can be separately entropy encoded into two independent codewords. This uses greater memory resources, but leads to simple decoding procedures.
We are interested in a third set-up where the sources are encoded into three codewords. One codeword characterizes the
common informationbetween the two sources, and the other two codewords characterize themarginal refinementsneeded to reconstruct each source. By differentially weighting the cost of the common information rate versus the marginal rate, we can trace out a region bracketed by the two standard techniques described above.
Uses some very technical language. (See http://www.idiom.com/~zilla/Work/Notes/productionanecdotes.html for some much easier-to-read language ... that gives anecdotes of people who understood that development time cannot be estimated before the fact)
algorithmic complexity or KCS (Kolmogorov-Chaitin-Solomonoff) complexity. ``Algorithmic complexity (AC) defines the complexity of a digital object to be the length of the shortest computer program that produces that object. This definition formalizes an intuitive notion of complexity.''
``KCS noncomputability theorem: there is no algorithm for computing the AC of an arbitrary string. ... there is no algorithm for finding the shortest program with a desired behavior.''
DAV is a little confused by this statement: ``One makes the reasonable assumption that if a statement AC(x) > c can be proved then it should be possible to extract from the proof the particular x that is used.'' Is this really true ? I wish I had an easier-to-read discussion of this.
From: Stephan T. Lavavej (stlAT_CALTECH_NOT_MIT@mit.edu) Subject: Bwtzip, A Linear-Time Burrows-Wheeler Compressor View: Complete Thread (10 articles) Original Format Newsgroups: comp.compression Date: 2002-12-04 07:38:02 PST
I have released a new version of my program bwtzip: http://stl.caltech.edu/bwtzip.html
It is a portable linear-time BWT compressor distributed under the GNU GPL. A comparison of its compression efficiency with respect to ZIP, gzip, and bzip2 is provided on the page above; bwtzip handily beats all of them.
Currently, bwtzip uses the following algorithms: * Suffix Tree Burrows-Wheeler Transformation * Move-to-Front-2 * Wheeler's Zero Length Coding * Arithmetic Coding
bwtzip constructs suffix trees using Ukkonen's online algorithm. It is extremely space-greedy, but I have finally fixed a memory leak that was lurking in the code for months.
Other algorithms are implemented in the source code but not used, including: * Suffix Array BWT * MTF * Switched-RLE and variants thereof * Variants of Wheeler's ZLE * Canonical Huffman Coding
bwtzip does not yet break input into blocks; I would recommend not attempting to compress files that are larger than 1/100 of your main system memory.
Comments are welcome. I would like to know of any algorithms that would improve my compression efficiency (not time performance, for which I care not).
Stephan T. Lavavej
The MG system is a suite of programs for compressing and indexing text and images. Most of the functionality implemented in the suite is as described in the bookhttp://freshmeat.net/projects/mg/?highlight=mg&topic_id=849Managing Gigabytes: Compressing and Indexing Documents and Images.
_Managing Gigabytes: Compressing and Indexing Documents and Images_ book by Ian H. Witten, Alistair Moffat, and Timothy C. Bell http://www.cs.mu.oz.au/mg/
[FIXME: read]
Modular Eigenspaces
http://www-white.media.mit.edu/vismod/demos/facerec/basic.html
independently finds eigeneyes, eigennoses, and eigenmouths.
a modular or layered representation of a face,
where a coarse (low-resolution) description of the whole head is augmented by
additional (higher-resolution) details in terms of salient facial features.
...
The modular description is also advantageous for image compression and coding purposes.
From: Igor Grobarcik (igor.grobarcik at centrum.sk) Subject: Re: faster huffmann? Newsgroups: alt.lang.asm, comp.compression Date: 2003-06-30 00:29:47 PST
Hi Zdenek,
For faster Huffman Tree building most of algorithms use a datastructure called heap. Another way is a Huffman-Ricardo coding. See page http://www.rootshell.be/~dan/hrc.html . These algorithms are designed for building static huffman tree. Try adaptive huffman algorithms. See page http://www.datacompression.info./AdaptiveHuffman.shtml .
Igor
From: Dale King (KingD at tmicha.net) Subject: Re: faster huffmann? Newsgroups: comp.compression Date: 2003-08-21 17:41:19 PST
...
For static Huffman, if you first sort your input frequencies then you don't need a heap and can build the tree in O(n) time (of course sorting the frequencies may take O(n log n) time. A really good paper on this is:
http://www.cs.mu.oz.au/~alistair/abstracts/mk95:wads.html
This algorithm actually operates in-place with a constant amount of additional memory. You put it an array of weights and it overwrites the weights with the length of the codeword.
While the paper is not on-line, the source code for it is.
-- Dale King
For sequences of asymptotically zero empirical entropy, a modification of the Lempel-Ziv coding rule is offered whose coding cost is at most a finite number of times worse than the optimum. A combinatorial proof is offered for the well-known redundancy estimate of the Lempel-Ziv coding algorithm for sequences having a positive entropy.
Caching and prefetching are important ... for speeding up access time to data on secondary storage. ... Intuitively, in order to compress data effectively, you have to be able to predict future data well, and thus good data compressors should be able to predict well for purposes of prefetching. ... the page fault rate incurred by our prefetching algorithms are optimal in the limit for almost all sequences of page requests.
Original Author: David Cary. This page split off from computer_graphics_tools.html on 1998-07-10. and has backlinks
David Cary feedback.html
Return to index // end http://david.carybros.com/html/data_compression.html /* was http://rdrop.com/~cary/html/data_compression.html */