String compression module

  • What is compression

    Data is often repeated such as abbba which can be represented in a different format effectively saving the number of characters needed to represent this data. A simple approach would be to simply count the repeated characters and represent this numerically. The result could be something like a2ba which would save one character in this string.

    This however is unlikely to occur within a string and better methods for compressing strings are available. Better string compression methods create or contain a dictionary which is used as a lookup table as this allows the data to be represented in one place. This data can then be built upon or represented in a smaller number of characters and in most cases will help reduce the overall string length.

    If you feel up to a challenge you should look at Snack Beak #27

    Cons to compression

    Each string compression method will yield different results and understanding what compression methods would suit your needs is critical. A good string compression method should be fast to encode / decode, yield a good compression ratio across different types of text not just numbers for example and lastly use a small amount of memory though this is often not possible.

    String compression module

    I have created as small string compression module that can be found here. It currently support the following string compression:-

    • Run-Length Strict
    • Run-Length Smart (Avoiding increasing the total string length)
    • Huffman Coding
    • LZW
    • Prediction by partial matching (PPM will be added at a later stage)

    Run-Length Strict
    This compression method simply counts the number of repeated characters and outputs this number followed by the character. As you can see this compression method is not that useful as it adds more characters than it represents in most cases. Additionally numerical values need to be escaped as the conflict with the decoding process.

    ------------------------------------------ RLE Strict -----------------------------------------
    Input 		        ThisssssIsATest122113ThisssssIsATest122113
    Input Length 		42
    Encoded 		T1h1i1s5I1s1A1T1e1s1t1|11|22|12|31T1h1i1s5I1s1A1T1e1s1t1|11|22|12|31
    Encoded Length 		68
    Decoded 		ThisssssIsATest122113ThisssssIsATest122113

    Run-Length Smart
    This compression method build upon the original by choosing if it is beneficial to compress a part of the string. Again this method also needs to escape numerical characters as they conflict with the decoding process. Unlike the previous string compression method will not attempt to compress parts of the string if they are not beneficial in terms of reducing the length of the string.

    ----------------------------------------- RLE Smart -----------------------------------------
    Input 		        ThisssssIsATest122113ThisssssIsATest122113
    Input Length 		42
    Encoded 		This4IsATest|11|21|1|3This4IsATest|11|21|1|3
    Encoded Length 		44
    Decoded 		ThisssssIsATest112113ThisssssIsATest112113

    Huffman Coding
    This compression method is a binary based and thus will not be properly represented in terms of compression ratio unless additional processing of the binary output is performed. Huffman Coding computes a tree based hierarchy and assigning unique binary representations for each node for example e = 1. Characters which are repeated more often are moved higher up the tree resulting in a shorter unique id being assigned which in tern save data.

    ----------------------------------------- Huffman codeing -----------------------------------------
    Input 		        ThisssssIsATest122113ThisssssIsATest122113
    Input Length 		42
    Encoded Length 		205
    Decoded 		ThisssssIsATest122113ThisssssIsATest122113

    This compression method is commonly used to compress large text files as its ability to save data increases with the more possible words created and stored in its internal dictionary as it progresses through encoding. The output dictionary only needs to include all the possible characters that are used in the string as in the decoding process the dictionary is rebuilt allowing for more complex strings. This example string is relatively small so the compression will not help in this case. Repeating the text ThisssssIsATest122113ThisssssIsATest122113 ten times which has a character length of 420
    and will result in a compressed length of 376.

    ----------------------------------------------- LZW -----------------------------------------------
    Input 		        ThisssssIsATest122113ThisssssIsATest122113
    Input Length 		42
    Encoded Length 		100
    Decoded 		ThisssssIsATest122113ThisssssIsATest122113

    Though I have tested this module I am unable to test for all possible scenarios. Please post and bug reports here along with the input string used. Better still provide a solution to the bug and I will include your name in the contributions section.

    As of yet I have not performed any bench marking so the speed of each compression method is unknown. If you wish to help with this please provide the code changes and test data. If performance is increased I will again include your name in the contributions section.

    I hope you enjoy and find this module useful.

  • Can you post a pastebin with the module instead? I peruse the forums a lot without access to ROBLOX Studio and so I can't use in-game models like Modules off the catalog. :( It could help with bench marking too (for example I prefer to use my linux suite for stuff like that because I know the CLI commands a bit better there).

    This is a lovely write-up of all of the different things you implemented. Even if I won't use the module myself, I really enjoyed all of the explanations.

  • @kools I would prefer not to make a paste bin so that users use the most up to date version. You can make a personal pastebin copy?

Log in to reply

Looks like your connection to Scripting Helpers was lost, please wait while we try to reconnect.