TL;DR (more information about how the compression works)
[edited - I was originally thinking about this backwards]
Initially I just used the utf-8 encoded stream directly for icons. But it clearly looks like it could be compressed, so I went down a bit of a rabbit hole: I couldn't find any built-in javascript compression options in the js that's loaded on PDFXChange - maybe there is, and if so, let me know.
I figured out that with a RegExp back-reference and also playing with number encoding in the alphabet one can make a simple run-length compression tool in javascript.
Code: Select all
/** Encode and decode utf-8 to run length encoding
based on concept of using String.replace and backreference to find repeating blocks from
article at https://deepbsd.github.io/js,programming/2019/06/22/Run_Length_Encoding_in_Javascript.html
Because utf-8 uses numbers 0-F, had to shift count 16 and keys 26 to make them all letters
*/
if ('undefined' == typeof xutil) xutil={};
// map numbers to letters [g-p]+, keys [q-z]+ and work in 8 char blocks
// alpha number serves as a block delimiter
// try to find repeating blocks and key at end. key is letters [q-z]+ converted back to decimal
xutil.decodeRLx = function(text) {
// see if there are any repeating keys
let [rTen,keys] = text.split(':');
// if keys were compressed, would need to decompress, then split
if (keys) keys = keys.match(/.{8}/g);
// function to return alpha to decimal
let fromAlpha = (val, offset) => val.replace(/./g, digit => parseInt(digit,10+offset)-offset );
// assumes keys exist if chr is in range q-z
// function to find in array or just return same string
let key = (chr) => (/[q-z]/).test(chr) ? keys[ fromAlpha(chr, 26) ] : chr;
// inflate and repeat the parts
rTen = rTen.replace(/([g-p]+)([0-9a-f]+|[q-z]+)/gi, (_, count, chr) => key(chr).repeat( fromAlpha(count, 16)));
return rTen;
};
xutil.encodeRLx = function(text) {
let keys = {};
// convert to an alpha representation so that don't need delimiters
let toAlpha = (val, offset) => val.toString().replace(/./g, chr => (offset+1*chr).toString(10+offset));
// must precede with a [g-p] number otherwise could get confused because keys rq=10 etc
let doEncode = (txt, bytes) => txt.replace( RegExp('(.{'+bytes+'})\\1*','g'), (group, chr) => {
// keep track of the number of times this sequence is referenced
keys[chr] = keys[chr] ? keys[chr]+1 : 1;
return toAlpha( group.length/bytes, 16) + chr;
});
// compress using 8 char blocks
let compr = doEncode( text, 8 );
// replace any keys occuring more than once
// sort the keys to put the most used ones at the front (less likely the frequently repeated key is in double digits)
for (let k in keys) {
let sortKeys = [];
if (keys[k] > 1) {
sortKeys.push([ k, keys[k] ]);
}
}
sortKeys.sort( (a, b)=> b[1]-a[1] );
// make list of tokens
let tokens = sortKeys.map( (k,i) => {
// base 10, but shifted to q-z to find during decode
compr = compr.replace( RegExp('([g-p]+)' + k[0],'g'),(_,count) => count + toAlpha(i, 26) );
return k[0];
});
// add the tokens to the end with : delimiter
// could compress the tokens using bytelength 1 but that adds complication at the other end
if (tokens.length) {
compr += ':'+tokens.join(''); // length of tokens is 8 so don't need delimiter
}
return compr;
};
Limitation
It only works on runs of hexadecimal numbers in the range 0-f (it will also work on 0-9), but any letters past f and the tool will mess things up quite dramatically.
xutil.encodeRLx( utfString )
returns a run-length compressed string
utfString is the utf-8 encoded string to be compressed
xutil.decodeRLx( rlString )
returns an uncompressed utf-8 string
rlString is the compressed string to be uncompressed
This code does it in three steps, almost all powered by regular expressions:
- It replaces occurrences of all 8 character recurring sequences with a count followed by that sequence. The count is a decimal number converted to letters g-p in place of 0-9.
- It makes a list of the 8 bit sequences, and
- for those that occur more than once, it sorts them by most -> least used, and replaces with a pointer to a list of the sequences tagged onto the end. The counter uses letters q-z in place of 0-9
Initially I was doing trial and error looping through decreasing block lengths but generally 8 characters was the best compression, so I just hard coded that in. For the lineart icons I've been using, it seems to compress quite well. It does better with icons that don't have a lot of colors or shading. I've been getting the reduced size between 50% to less than 10%, including the overhead from the minified decode script.