Burrows–Wheeler Transform - Bijective Variant

Bijective Variant

When you do a BWTS or the BIJECTIVE BWT of ^BANANA you get ANNBAA^ you don't need a special character for the end of string which forces one to increase character space by one or force one to have a separate field with a number value for an offset. Either one of these features makes compression harder. When dealing with short files the savings are large percent wise.

In the example above BWT changed ^BANANA| where | could be thought of as EOF but not in BNN^AA|A

Since that is not a valid file you would need to add something like BNN^AA|A| where the last | is true EOF

Thankfully for SMALL FILES BWTS would
work on just the ^BANANA to get ANNBAA^
then output to file would be ANNBAA^| where | is the True EOF

The fact is in the so-called normal BWT the first | was just
a place holder for EOF and not the EOF
So BNN^AA|A| would most likely either be written as
BNN^AA (new symbol) A|
or as
(index field) BNN^AAA|
which is why on VERY SHORT FILES you will have a hard time getting any compression with out using the bijective BWT
On long files a lot depends on how the follow on compression pass is tuned. But in which case using the Bijective BWT or some version and there are many of BWT will likely only differ by a few bytes. Sometime one is shorter sometimes the other.

The bijective transform is done by sorting all rotations of the lyndon words, in comparing two string of unequal length, we compare the infinite periodic repetitions of each of these,in lexicographic order, then taking the last column of the base rotated lyndon word. For example, the text "^BANANA|" is transformed into "ANNBAA^|" through these steps (the red | character indicates the 'EOF' pointer) in original string:
The EOF character is not Needed in the Bijective Transform so it is dropped during the Transform and added back in so that its in its proper place in the file.

First step break into lyndon words, in such a way that the words in the sequence are decreasing using comparison method above.
"^BANANA" becomes (^) (B) (AN) (AN) (A) but with in change result I combine like Lydon words so I use (^) (B) (ANAN) (A)

Bijective Transformation
Input All
Rotations
Sorting All Rows in Alphabetical
Order by their first letters
Taking Last Column
of Rotated Lyndon word
Output
Last Column
^BANANA| ^^^^^^^^ (^) BBBBBBBB (B) ANANANAN... (ANAN) NANANANA... (NANA) ANANANAN... (ANAN) NANANANA... (NANA) AAAAAAAA... (A) AAAAAAAA... (A) ANANANAN... (ANAN) ANANANAN... (ANAN) BBBBBBBB... (B) NANANANA... (NANA) NANANANA... (NANA) ^^^^^^^^... (^) AAAAAAAA... (A) ANANANAN... (ANAN) ANANANAN... (ANAN) BBBBBBBB... (B) NANANANA... (NANA) NANANANA... (NANA) ^^^^^^^^... (^) ANNBAA^|
Inverse Bijective Transform
Input
ANNBAA^
Add 1 Sort 1 Add 2 Sort 2
A N N B A A ^ A A A B N N ^ AA NA NA BB AN AN ^^ AA AN AN BB NA NA ^^
Add 3 Sort 3 Add 4 Sort 4
AAA NAN NAN BBB ANA ANA ^^^ AAA ANA ANA BBB NAN NAN ^^^ AAAA NANA NANA BBBB ANAN ANAN ^^^^ AAAA ANAN ANAN BBBB NANA NANA ^^^^
Output
^BANANA

Notice to get the above you can either view it as 4 cycles
^ = (^)(^)... = ^^^^^...
B = (B)(B)... = BBBB...
ANAN = (ANAN)(ANAN)... = ANANANAN...
A = (A)(A).. = AAAAA..
or 5 cycles WHERE ANAN broken into 2
AN = (AN) (AN) ... = ANANANAN
AN = (AN) (AN) ... = ANANANAN

if a cycle is N character it will be repeated N times take lowest valve of each cycle then sort so highest first so in this case you get either
(^)
(B)
(ANAN)
(A)

or

(^)
(B)
(AN)
(AN)
(A)

to get the ^BANANA

Since any rotation of the input string will lead to the same transformed string, the BWT cannot be inverted without adding an 'EOF' marker to the input or, augmenting the output with information, such as an index, that makes it possible to identify the input string from the class of all of its rotations.

There is a bijective version of the transform, by which the transformed string uniquely identifies the original. In this version, every string has a unique inverse of the same length.

If you check the internet of BWTS you will fine the code to do this bijective BWT. The fastest versions have been created by Yuta Mori and are linear in time and space. The first known version where coded by David A. Scott

The bijective transform is computed by first factoring the input into a non-increasing sequence of Lyndon words; such a factorization exists by the Chen–Fox–Lyndon theorem, and can be found in linear time. Then, the algorithm sorts together all the rotations of all of these words; as in the usual Burrows–Wheeler transform, this produces a sorted sequence of n strings. The transformed string is then obtained by picking the final character of each of these strings in this sorted list.

For example, applying the bijective transform gives:

Input SIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXES
Lyndon words SIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXES
Output STEYDST.E.IXXIIXXSMPPXS.B..EE..SUSFXDIOIIIIT

The bijective transform includes eight runs of identical characters. These runs are, in order: XX, II, XX, PP, .., EE, .., and IIII. In total, 18 characters take part in these runs.

Read more about this topic:  Burrows–Wheeler Transform

Famous quotes containing the word variant:

    “I am willing to die for my country” is a variant of “I am willing to kill for my country.”
    Mason Cooley (b. 1927)