Hi,
Following is the LZJU90 document Nathaniel referred to; I
posted it once before, but no one was interested then.
It does single-pass conversion of an "arbitrary binary object"
to a mailable (malleable? :-) object. (I say "arbitrary" in
quotes because there is the assumption that the object is
"flat": a single ordered sequence of octets).
It is an LZ77-class algorithm. This has several interesting
characteristics:
1) [most important] LZ77 algorithms are (presently) considered
to be in the public domain, and do not infring upon (e.g.)
the Unisys patent.
2) most of the work is done by the encoder. Since objects are
usually compressed once and uncompressed either once or
many times (e.g. something sent to a mailing list) this
is a Good Thing.
3) many possible encoding algorithms are possible, with the
SAME DECODER. (think about that!) In particular, this
means that the encoder can make whatever time/space/efficiency
tradeoffs it likes, the decoder still works. It also
means that a proprietary (or even patentable) algorithm
might be used for encoding, without preventing anyone
from decoding with public domain software.
4) the decoder runs in _bounded_ space (32KB) and _linear_ time.
The code given in the RFC draft is explicitly placed in
the public domain. It is written for maximum portability and
a certain degree of clarity; production software can be
more efficient. (In particular, the encoder runs in
O(n log n) time, with a high proportionality constant;
methods are known that run nearly in O(n), with a very
small constant. Code for these methods is not in the
public domain at present :-)
Best Regards,
Robert Ullmann
Prime Computer, Inc.
+1 508 620 2800 x1736
Network Working Group R. Jung, R. Ullmann
Request for Comments: DRAFT Prime Computer, Inc.
January 1991
LZJU90: Compressed Encoding for Binary Mail
1. Status of this Memo
This memo describes an encoding [1] for a binary object to be sent in an
Internet mail message. The encoding provides both compression and
representation in a text format that will successfully survive
transmission through the many different mailers and gateways that
comprise the Internet and connected mail networks.
Distribution of this memo is unlimited.
2. Introduction
The encoding first compresses the binary object, using a modified LZ77
algorithm, called LZJU90. It then encodes each 6 bits of the output of
the compression as a text character, using a character set chosen to
survive any translations between codes, such as ASCII to EBCDIC.
The 64 six-bit strings 000000 through 111111 are represented by the
characters "+", "-", "0" to "9", "A" to "Z", and "a" to "z".
The output text begins with a line identifying the encoding. This is for
visual reference only, the Encoding: field in the header identifies the
section to the user program. It also names the object that was encoded,
usually by a file name. The format of this line is:
* LZJU90 <name>
where <name> is optional. For example:
* LZJU90 vmunix
This is followed by the compressed and encoded data, broken into lines
where convenient. It is recommended that lines be broken every 78
characters, to survive mailers than restrict line length. The decoder
must accept lines with 1 to 1000 characters on each line.
After this, there is one final line that gives the number of bytes in
the original data and a CRC of the original data. This should match the
byte count and CRC found during decompression. This line has the format:
* <count> <CRC>
Jung, Ullmann [Page 1]
RFC DRAFT LZJU90: Compressed Encoding for Binary Mail January 1991
where <count> is a decimal number, and CRC is 8 hexadecimal digits. For
example:
* 4128076 5AC2D50E
The count used in the Encoding: field in the message header is the total
number of lines, including the start and end lines that begin with *.
A complete example is given in section 6.
3. Specification of the LZJU90 compression
This data compression specification uses the Lempel-Ziv-Storer-Szymanski
model of mixing pointers and literal characters.
The data compression is defined by the decoding algorithm. Any encoder
that emits symbols which cause the decoder to produce the original input
is defined to be valid.
There are many possible strategies for the maximal-string matching that
the encoder does, section 5 gives the code for one such algorithm.
Regardless of which algorithm is used, and what tradeoffs are made
between compression ratio and execution speed or space, the result can
always be decoded by the simple decoder.
The compressed data consists of a mixture of unencoded literal
characters and copy pointers which point to an earlier occurrence of the
string to be encoded.
Compressed data contains two types of codewords:
LITERAL pass the literal directly to the uncompressed
output
COPY length, offset go back offset characters in the output and copy
length characters forward to the current position.
To distinguish between codewords, the copy length is used. A copy
length of zero indicates that the following codeword is a literal
codeword. A copy length greater than zero indicates that the following
codeword is a copy codeword.
To improve copy length encoding, a threshold value of 2 has been
subtracted from the original copy length for copy codewords, because the
minimum copy length is 3 in this compression scheme.
The maximum offset value is set at 32255. Larger offsets offer
extremely low improvements in compression (less than 1 percent,
typically).
No special encoding is done on the LITERAL characters. However, unary
Jung, Ullmann [Page 2]
RFC DRAFT LZJU90: Compressed Encoding for Binary Mail January 1991
encoding is used for the copy length and copy offset values to improve
compression. A start-step-stop unary code is used.
A (start, step, stop) unary code of the integers is defined as follows:
The Nth codeword has N ones followed by a zero followed by a field of
size START + (N * STEP). If the field width is equal to STOP then the
preceding zero can be omitted. The integers are laid out sequentially
through these codewords. For example, (0, 1, 4) would look like:
Codeword Range
0 0
10x 1-2
110xx 3-6
1110xxx 7-14
1111xxxx 15-30
Below are the actual values used for copy length and copy offset:
The copy length is encoded with a (0, 1, 7) code leading to a maximum
copy length of 256 by including the THRESHOLD value of 2.
Codeword Range
0 0
10x 3-4
110xx 5-8
1110xxx 9-16
11110xxxx 17-32
111110xxxxx 33-64
1111110xxxxxx 65-128
1111111xxxxxxx 129-256
The copy offset is encoded with a (9, 1, 14) code leading to a maximum
copy offset of 32255. Offset 0 is reserved as an end of compressed data
flag.
Codeword Range
0xxxxxxxxx 0-511
10xxxxxxxxxx 512-1535
110xxxxxxxxxxx 1536-3583
1110xxxxxxxxxxxx 3485-7679
11110xxxxxxxxxxxxx 7680-15871
11111xxxxxxxxxxxxxx 15872-32255
Jung, Ullmann [Page 3]
RFC DRAFT LZJU90: Compressed Encoding for Binary Mail January 1991
The 0 has been chosen to signal the start of the field for ease of
encoding.
The stop values are useful in the encoding to prevent out of range
values for the lengths and offsets, as well as shortening some codes by
one bit.
The worst case compression using this scheme is a 1/8 increase in size
of the encoded data. (One zero bit followed by 8 character bits). After
the character encoding, the worst case ratio is 3/2 to the original
data.
The minimum copy length of 3 has been chosen because the worst case copy
length and offset is 3 bits (3) and 19 bits (32255) for a total of 22
bits to encode a 3 character string (24 bits).
Jung, Ullmann [Page 4]
RFC DRAFT LZJU90: Compressed Encoding for Binary Mail January 1991
4. The Decoder
As mentioned previously, the compression is defined by the decoder. Any
encoder that produced output that is correctly decoded is by definition
correct.
The following is an implementation of the decoder, written more for
clarity and as much portability as possible, rather than for maximum
speed.
When optimized for a specific environment, it will run significantly
faster.
/* LZJU 90 Decoding program */
#include <stdio.h>
typedef unsigned char uchar;
typedef unsigned int uint;
#define N 32255
#define THRESHOLD 3
#define STRTP 9
#define STEPP 1
#define STOPP 14
#define STRTL 0
#define STEPL 1
#define STOPL 7
static FILE *in;
static FILE *out;
static int getbuf;
static int getlen;
static long in_count;
static long out_count;
static long crc;
static long crctable[256];
static uchar xxcodes[] =
"+-0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ\
abcdefghijklmnopqrstuvwxyz";
static uchar ddcodes[256];
static uchar text[N];
Jung, Ullmann [Page 5]
RFC DRAFT LZJU90: Compressed Encoding for Binary Mail January 1991
#define CRCPOLY 0xEDB88320
#define CRC_MASK 0xFFFFFFFF
#define UPDATE_CRC(crc, c) \
crc = crctable[((uchar)(crc) ^ (uchar)(c)) & 0xFF] \
^ (crc >> 8)
#define START_RECD "* LZJU90"
void MakeCrctable() /* Initialize CRC-32 table */
{
uint i, j;
long r;
for (i = 0; i <= 255; i++) {
r = i;
for (j = 8; j > 0; j--) {
if (r & 1)
r = (r >> 1) ^ CRCPOLY;
else
r >>= 1;
}
crctable[i] = r;
}
}
int GetXX() /* Get xxcode and translate */
{
int c;
do {
if ((c = fgetc(in)) == EOF)
c = 0;
} while (c == '\n');
in_count++;
return ddcodes[c];
}
int GetBit() /* Get one bit from input buffer */
{
int c;
while (getlen <= 0) {
c = GetXX();
getbuf |= c << (10-getlen);
getlen += 6;
}
c = (getbuf & 0x8000) != 0;
getbuf <<= 1;
getbuf &= 0xFFFF;
getlen--;
return(c);
}
Jung, Ullmann [Page 6]
RFC DRAFT LZJU90: Compressed Encoding for Binary Mail January 1991
int GetBits(len) /* Get len bits */
int len;
{
int c;
while (getlen <= 10) {
c = GetXX();
getbuf |= c << (10-getlen);
getlen += 6;
}
if (getlen < len) {
c = (uint)getbuf >> (16-len);
getbuf = GetXX();
c |= getbuf >> (6+getlen-len);
getbuf <<= (10+len-getlen);
getbuf &= 0xFFFF;
getlen -= len - 6;
}
else {
c = (uint)getbuf >> (16-len);
getbuf <<= len;
getbuf &= 0xFFFF;
getlen -= len;
}
return(c);
}
int DecodePosition() /* Decode offset position pointer */
{
int c;
int width;
int plus;
int pwr;
plus = 0;
pwr = 1 << STRTP;
for (width = STRTP; width < STOPP; width += STEPP) {
c = GetBit();
if (c == 0)
break;
plus += pwr;
pwr <<= 1;
}
if (width != 0)
c = GetBits(width);
c += plus;
return(c);
}
Jung, Ullmann [Page 7]
RFC DRAFT LZJU90: Compressed Encoding for Binary Mail January 1991
int DecodeLength() /* Decode code length */
{
int c;
int width;
int plus;
int pwr;
plus = 0;
pwr = 1 << STRTL;
for (width = STRTL; width < STOPL; width += STEPL) {
c = GetBit();
if (c == 0)
break;
plus += pwr;
pwr <<= 1;
}
if (width != 0)
c = GetBits(width);
c += plus;
return(c);
}
void InitCodes() /* Initialize decode table */
{
int i;
for (i = 0; i < 256; i++) ddcodes[i] = 0;
for (i = 0; i < 64; i++) ddcodes[xxcodes[i]] = i;
return;
}
main(ac, av) /* main program */
int ac;
char **av;
{
int r;
int j, k;
int c;
int pos;
char buf[80];
char name[3];
long num, bytes;
if (ac < 3) {
fprintf(stderr, "usage: judecode in out\n");
exit(1);
}
in = fopen(av[1], "r");
if (!in){
fprintf(stderr, "Can't open %s\n", av[1]);
exit(1);
}
Jung, Ullmann [Page 8]
RFC DRAFT LZJU90: Compressed Encoding for Binary Mail January 1991
out = fopen(av[2], "w");
if (!out) {
fprintf(stderr, "Can't open %s\n", av[2]);
fclose(in);
exit(1);
}
while (1) {
if (fgets(buf, sizeof(buf), in) == NULL) {
fprintf(stderr, "Unexpected EOF\n");
exit(1);
}
if (strncmp(buf, START_RECD, strlen(START_RECD)) == 0)
break;
}
in_count = 0;
out_count = 0;
getbuf = 0;
getlen = 0;
InitCodes();
MakeCrctable();
crc = CRC_MASK;
r = 0;
while (feof(in) == 0) {
c = DecodeLength();
if (c == 0) {
c = GetBits(8);
UPDATE_CRC(crc, c);
out_count++;
text[r] = c;
fputc(c, out);
if (++r >= N)
r = 0;
}
else {
pos = DecodePosition();
if (pos == 0)
break;
pos--;
j = c + THRESHOLD - 1;
pos = r - pos - 1;
if (pos < 0)
pos += N;
Jung, Ullmann [Page 9]
RFC DRAFT LZJU90: Compressed Encoding for Binary Mail January 1991
for (k = 0; k < j; k++) {
c = text[pos];
text[r] = c;
UPDATE_CRC(crc, c);
out_count++;
fputc(c, out);
if (++r >= N)
r = 0;
if (++pos >= N)
pos = 0;
}
}
}
fgetc(in); /* skip newline */
if (fscanf(in, "* %ld %lX", &bytes, &num) != 2) {
fprintf(stderr, "CRC record not found\n");
exit(1);
}
else if (crc != num) {
fprintf(stderr,
"CRC error, expected %lX, found %lX\n",
crc, num);
exit(1);
}
else if (bytes != out_count) {
fprintf(stderr,
"File size error, expected %lu, found %lu\n",
bytes, out_count);
exit(1);
}
else
fprintf(stderr,
"File decoded to %lu bytes correctly\n",
out_count);
fclose(in);
fclose(out);
return;
}
Jung, Ullmann [Page 10]
RFC DRAFT LZJU90: Compressed Encoding for Binary Mail January 1991
5. An example of an Encoder
Many algorithms are possible for the encoder, with different tradeoffs
between speed, size, and complexity. The following is a simple example
program which is fairly efficient; more sophisticated implementations
will run much faster, and in some cases produce somewhat better
compression.
This example also shows that the encoder need not use the entire window
available. Not using the full window costs a small amount of
compression, but can greatly increase the speed of some algorithms.
/* LZJU 90 Encoding program */
#include <stdio.h>
typedef unsigned char uchar;
typedef unsigned int uint;
#define N 8192 /* Size of window buffer */
#define F 256 /* Size of look-ahead buffer */
#define THRESHOLD 3
#define NIL N /* End of tree's node */
#define STRTP 9
#define STEPP 1
#define STOPP 14
#define STRTL 0
#define STEPL 1
#define STOPL 7
#define CHARSLINE 78
static FILE *in;
static FILE *out;
static int putlen;
static int putbuf;
static int char_ct;
static long in_count;
static long out_count;
static long crc;
static long crctable[256];
static uchar xxcodes[] =
"+-0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ\
abcdefghijklmnopqrstuvwxyz";
Jung, Ullmann [Page 11]
RFC DRAFT LZJU90: Compressed Encoding for Binary Mail January 1991
static uchar text[N + F - 1];
static int match_position;
static int match_length;
static int lson[N + 1];
static int rson[N + 257];
static int dad[N + 1];
#define CRCPOLY 0xEDB88320
#define CRC_MASK 0xFFFFFFFF
#define UPDATE_CRC(crc, c) \
crc = crctable[((uchar)(crc) ^ (uchar)(c)) & 0xFF] \
^ (crc >> 8)
void MakeCrctable() /* Initialize CRC-32 table */
{
uint i, j;
long r;
for (i = 0; i <= 255; i++) {
r = i;
for (j = 8; j > 0; j--) {
if (r & 1)
r = (r >> 1) ^ CRCPOLY;
else
r >>= 1;
}
crctable[i] = r;
}
}
void PutXX(c) /* Translate and put xxcode */
int c;
{
c = xxcodes[c & 0x3F];
if (++char_ct > CHARSLINE) {
char_ct = 1;
fputc('\n', out);
}
fputc(c, out);
out_count++;
}
Jung, Ullmann [Page 12]
RFC DRAFT LZJU90: Compressed Encoding for Binary Mail January 1991
void PutBits(c, len) /* Put rightmost "len" bits of "c" */
int c, len;
{
c <<= 16 - len;
c &= 0xFFFF;
putbuf |= (uint) c >> putlen;
c <<= 16 - putlen;
c &= 0xFFFF;
putlen += len;
while (putlen >= 6) {
PutXX(putbuf >> 10);
putlen -= 6;
putbuf <<= 6;
putbuf &= 0xFFFF;
putbuf |= (uint) c >> 10;
c = 0;
}
}
void EncodePosition(ch) /* Encode offset position pointer */
int ch;
{
int width;
int prefix;
int pwr;
pwr = 1 << STRTP;
for (width = STRTP; ch >= pwr; width += STEPP, pwr <<= 1)
ch -= pwr;
if ((prefix = width - STRTP) != 0)
PutBits(0xffff, prefix);
if (width < STOPP)
width++;
else if (width > STOPP)
abort();
PutBits(ch, width);
}
Jung, Ullmann [Page 13]
RFC DRAFT LZJU90: Compressed Encoding for Binary Mail January 1991
void EncodeLength(ch) /* Encode code length */
int ch;
{
int width;
int prefix;
int pwr;
pwr = 1 << STRTL;
for (width = STRTL; ch >= pwr; width += STEPL, pwr <<= 1)
ch -= pwr;
if ((prefix = width - STRTL) != 0)
PutBits(0xffff, prefix);
if (width < STOPL)
width++;
else if (width > STOPL)
abort();
PutBits(ch, width);
}
void InitTree()
{
int i;
for (i = N + 1; i <= N + 256; i++)
rson[i] = NIL;
for (i = 0; i < N; i++)
dad[i] = NIL;
}
Jung, Ullmann [Page 14]
RFC DRAFT LZJU90: Compressed Encoding for Binary Mail January 1991
/* Insert string of length F, text[r..r+F-1], into one of
the trees (text[r]'th tree) and return the longest-match
position and length via the global variables match_position
and match_length. If match_length = F, then remove the
old node in favor of the new one, because the old one will
be deleted sooner. Note r plays double role, as tree node
and position in buffer.
*/
void InsertNode(r)
int r;
{
int i, cmp, p, c;
uchar *key, *keyp, *txtp;
cmp = 1;
key = &text[r];
p = N + 1 + key[0];
rson[r] = lson[r] = NIL;
match_length = 0;
for ( ; ; ) {
if (cmp >= 0) {
if (rson[p] != NIL)
p = rson[p];
else {
rson[p] = r;
dad[r] = p;
return;
}
}
else {
if (lson[p] != NIL)
p = lson[p];
else {
lson[p] = r;
dad[r] = p;
return;
}
}
txtp = &text[p];
keyp = key;
if ((cmp = *++keyp - *++txtp) != 0)
continue;
for (i = 2; i < F; i++)
if ((cmp = *++keyp - *++txtp) != 0)
break;
Jung, Ullmann [Page 15]
RFC DRAFT LZJU90: Compressed Encoding for Binary Mail January 1991
if (i > match_length) {
match_position = ((r - p) & (N - 1));
if ((match_length = i) >= F)
break;
}
else if (i == match_length) {
if ((c = ((r - p) & (N - 1))) < match_position)
match_position = c;
}
}
dad[r] = dad[p];
lson[r] = lson[p];
rson[r] = rson[p];
dad[lson[p]] = r;
dad[rson[p]] = r;
if (rson[dad[p]] == p)
rson[dad[p]] = r;
else
lson[dad[p]] = r;
dad[p] = NIL;
}
Jung, Ullmann [Page 16]
RFC DRAFT LZJU90: Compressed Encoding for Binary Mail January 1991
void DeleteNode(p) /* Delete node p from tree */
int p;
{
int q;
if (dad[p] == NIL)
return;
if (rson[p] == NIL)
q = lson[p];
else if (lson[p] == NIL)
q = rson[p];
else {
q = lson[p];
if (rson[q] != NIL) {
do {
q = rson[q];
} while (rson[q] != NIL);
rson[dad[q]] = lson[q];
dad[lson[q]] = dad[q];
lson[q] = lson[p];
dad[lson[p]] = q;
}
rson[q] = rson[p];
dad[rson[p]] = q;
}
dad[q] = dad[p];
if (rson[dad[p]] == p)
rson[dad[p]] = q;
else
lson[dad[p]] = q;
dad[p] = NIL;
}
main(ac, av) /* main program */
int ac;
char **av;
{
int r, s, i, c;
int last_match_length;
int len;
if (ac < 3) {
fprintf(stderr, "usage: juencode in out\n");
exit(1);
}
in = fopen(av[1], "r");
if (!in) {
fprintf(stderr, "Can't open %s\n", av[1]);
exit(1);
}
Jung, Ullmann [Page 17]
RFC DRAFT LZJU90: Compressed Encoding for Binary Mail January 1991
out = fopen(av[2], "w");
if (!out) {
fprintf(stderr, "Can't open %s\n", av[2]);
fclose(in);
exit(1);
}
char_ct = 0;
in_count = 0;
out_count = 0;
putbuf = 0;
putlen = 0;
MakeCrctable();
crc = CRC_MASK;
fprintf(out, "* LZJU90 %s\n", av[1]);
InitTree();
r = 0;
s = 0;
/* Fill lookahead buffer */
for (len = 0; len < F && (c = fgetc(in)) != EOF; len++) {
UPDATE_CRC(crc, c);
in_count++;
text[s++] = c;
}
while (len > 0) {
InsertNode(r);
if (match_length > len)
match_length = len;
if (match_length < THRESHOLD) {
EncodeLength(0);
PutBits(text[r], 8);
match_length = 1;
}
else {
EncodeLength(match_length - THRESHOLD + 1);
EncodePosition(match_position);
}
Jung, Ullmann [Page 18]
RFC DRAFT LZJU90: Compressed Encoding for Binary Mail January 1991
last_match_length = match_length;
for (i = 0; i < last_match_length &&
(c = fgetc(in)) != EOF; i++) {
UPDATE_CRC(crc, c);
in_count++;
DeleteNode(s);
text[s] = c;
if (s < F - 1)
text[s + N] = c;
s = (s + 1) & (N - 1);
}
while (i++ < last_match_length) {
DeleteNode(s);
s = (s + 1) & (N - 1);
len--;
}
r = (r + last_match_length) & (N - 1);
}
/* end compression indicator */
EncodeLength(1);
EncodePosition(0);
PutBits(0, 7);
fprintf(out, "\n* %lu %08lX\n", in_count, crc);
fprintf(stderr, "Encoded %lu bytes to %lu symbols\n",
in_count, out_count);
fclose(in);
fclose(out);
}
6. LZJU90 example
The following is an example of an LZJU90 compressed object. Using this
as source for the program in section 4 will reveal what it is.
* LZJU90 example
8-mBtWA7WBVZ3dEBtnCNdU2WkE4owW+l4kkaApW+o4Ir0k33Ao4IE4kk
bYtk1XY618NnCQl+OHQ61d+J8FZBVVCVdClZ2-LUI0v+I4EraItasHbG
VVg7c8tdk2lCBtr3U86FZANVCdnAcUCNcAcbCMUCdicx0+u4wEETHcRM
7tZ2-6Btr268-Eh3cUAlmBth2-IUo3As42laIE2Ao4Yq4G-cHHT-wCEU
6tjBtnAci-I++
* 190 081E2601
References
[1] David Robinson, Robert L. Ullmann.
Encoding Header Field for Internet Messages.
RFC 1154, Prime Computer, April, 1990.
Jung, Ullmann [Page 19]
RFC DRAFT LZJU90: Compressed Encoding for Binary Mail January 1991
Author's Address
Robert Jung
2606 Village Road West
Norwood, MA 02062
USA
Phone: +1 617 769 5999
Email: robjung(_at_)world(_dot_)std(_dot_)com
Robert Ullmann 10-30
Prime Computer, Inc.
500 Old Connecticut Path
Framingham, MA 01701
USA
Phone: +1 508 620 2800 x1736
Email: Ariel(_at_)Relay(_dot_)Prime(_dot_)COM
Jung, Ullmann [Page 20]