Pranavi Boyalakuntla and Lycia Tran Autumn 2023 EE 274: Data Compression


Introduction

We optimized the LZ77 implementation on SCL. LZ77 is the basis for some of the most widely used compressors (gzip, zstd), but the current implementation on SCL does not use repcodes and uses Huffman coding for entropy coding. We believe with the addition of a basic repcode and by changing the entropy coding to rANS, tANS, or Arithmetic coding we will see an improved data compression rate. The following variations of the LZ77 compressor will be implemented: original SCL implementation, variation with rANS coding, variation with tANS coding, variation with Arithmetic coding, variation with repcodes, variation with repcodes and rANS coding, variation with repcodes and tANS coding, and variation with repcodes and Arithmetic coding. These variations will be tested on different datasets (text file, random ASCII, and Github user information) to see which variation performs the best for each dataset.

Previous Work/Literature Review

Original LZ77 Implementation

The LZ77 algorithm works by replacing later iterations of a previously existing pattern by encoding information about how far back the pattern exists and how long the pattern was in a table. Unmatched patterns, or “new”, patters are included in the table as is. The specifics of how this is implemented in code as a part of the SCL library are discussed below. The steps to implement the LZ77 algorithm are:

Encoder

This section will detail the variables and functions that make up the original LZ77 encoder and the order in which they are utilized.

This implementation utilizes a custom LZ77Window class provides utilities for a given byte array including retrieving the current window or getting a byte from a current window. This class stores information about the current window like the start and end pointers.

This implementation also utilizes a custom match finder with options for two types: a base match finder, or a hash based match finder. The base match finder, MatchFinderBase, takes a lookahead buffer and a sliding window and finds the best possible match. The lookahead buffer is what we are trying to match with in the window. This function, extend_match, will look at a given candidate to match and will extend the match to the left and right. The position of the match in the lookahead buffer and the sliding window will be returned along with its length.

The hash based match finder, HashBasedMatchFinder, builds off the MatchFinderBase to find the best match using a hash table. This hash table will store all the hash values and the previous occurances of that hash value. The best match is found using the function, find_best_match where all the starting positions in the lookahead buffer are hashed and the best match is found within a sliding window. If this match is less than the minimum length, the loop continues.

The encoder begins by instantiating the first window with a provided window size, or a default of 1000000 bits. In order to encode a file, the file is broken down into blocks which are then encoded using the detailed LZ77 algorithm. The function lz77_parse_and_generate_sequence is used to encode a block. The function loops through the provided block and stores where the current position is throughout. Using the HashBasedMatchFinder a match is found for the provided lookahead buffer. If this match length is non-zero, then the offset is found and the current position is shifted. The bytes that are “done” are added onto the window so that they are included for future matches. If this match is zero, then the current position is shifted, but the bytes are saved as is and added to the window. This function returns the final lz77 table and the saved literals that did not have a pattern match.

Untitled

Decoder

The LZ77 decoder implementation loops through the LZ77 tables and decodes on a per block basis using the function decode_blockwhich then calls the execute_lz77_sequence function which will go through the table to reconstruct the data. Since the table includes information about the literals, match offset, and match length, the table can be used to expands the compressed version. The position in the literals will need to be tracked while the decoding process is being completed because the literals are stored in their own list rather than as a list of lists. The length of the literal is stored however. The decoding is completed by, copying literals, copying over the matching string, and then appending any remaining literals.

Untitled