What is the most efficient way to store a numeric range? Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)any document which says exactly what range of numbers are .NET BigIntegers designed for?A good schema to represent integer numbers from 0 to infinity, assuming you have infinite linear binary storage?Where can I find good example of techniques to compact data in-memory?What is the best way to store a table in C++Estimation of space is required to store 275305224 of 5x5 MagicSquares?How to store data that is recorded with different frequency?What is the most efficient way to store this data?How does cuckoo hashing guarantee O(1) lookups in presence of persistent hash collisionsFast, lossless compression of a video streamFast Upload/Retrieval For Blob Data Solutions

How come Sam didn't become Lord of Horn Hill?

Can I cast Passwall to drop an enemy into a 20-foot pit?

Why am I getting the error "non-boolean type specified in a context where a condition is expected" for this request?

Bete Noir -- no dairy

Check which numbers satisfy the condition [A*B*C = A! + B! + C!]

Can a USB port passively 'listen only'?

What to do with chalk when deepwater soloing?

Is the Standard Deduction better than Itemized when both are the same amount?

porting install scripts : can rpm replace apt?

Storing hydrofluoric acid before the invention of plastics

Resolving to minmaj7

Why are there no cargo aircraft with "flying wing" design?

Why do we bend a book to keep it straight?

Seeking colloquialism for “just because”

How to align text above triangle figure

How to run gsettings for another user Ubuntu 18.04.2 LTS

Why didn't this character "real die" when they blew their stack out in Altered Carbon?

Coloring maths inside a tcolorbox

Is it ethical to give a final exam after the professor has quit before teaching the remaining chapters of the course?

Denied boarding although I have proper visa and documentation. To whom should I make a complaint?

What is the logic behind the Maharil's explanation of why we don't say שעשה ניסים on Pesach?

Identifying polygons that intersect with another layer using QGIS?

If a contract sometimes uses the wrong name, is it still valid?

How to Merge Multiple Columns in to Two Columns based on Column 1 Value?



What is the most efficient way to store a numeric range?



Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)any document which says exactly what range of numbers are .NET BigIntegers designed for?A good schema to represent integer numbers from 0 to infinity, assuming you have infinite linear binary storage?Where can I find good example of techniques to compact data in-memory?What is the best way to store a table in C++Estimation of space is required to store 275305224 of 5x5 MagicSquares?How to store data that is recorded with different frequency?What is the most efficient way to store this data?How does cuckoo hashing guarantee O(1) lookups in presence of persistent hash collisionsFast, lossless compression of a video streamFast Upload/Retrieval For Blob Data Solutions



.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








28















This question is about how many bits are required to store a range. Or put another way, for a given number of bits, what is the maximum range that can be stored and how?



Imagine we want to store a sub-range within the range 0-255.



So for example, 45-74.



We can store the example above as two unsigned bytes, but it strikes me that there must be some redundancy of information there. We know that the second value is larger than the first, so in the case that the first value is large, fewer bits are required for the second value, and in the case that the second value is large, fewer bits are required for the first.



I suspect that any compression technique would yield a marginal result, so it may be a better question to ask "what is the maximum range that could be stored in one byte?". This should be larger than what is achievable by storing the two numbers separately.



Are there any standard algorithms for doing this kind of thing?










share|improve this question
























  • do you also have to store the start of the range?

    – Ewan
    Apr 11 at 8:50











  • @Ewan I don't really follow. In the example above, 45 is the start (the minimum) and 74 is the end (the maximum) and both have to be stored.

    – rghome
    Apr 11 at 8:56






  • 2





    so is the question how much space does a type which can store any range require. or how much space does a type which can store 45-74 require?

    – Ewan
    Apr 11 at 8:58






  • 1





    While thinking about this is certainly good, I sure hope you don't do this in real applications. The reason is that the amount of complexity of real applications is so huge that we have to accept less than 100% optimized code....That is why compilers existed.

    – NoChance
    Apr 11 at 10:54






  • 2





    @rghome, I agree, even the simplest requirement produces hundreds of lines of code. Each is error prone. Personally, I'd pay for hardware than increase the complexity of software.

    – NoChance
    Apr 11 at 13:36

















28















This question is about how many bits are required to store a range. Or put another way, for a given number of bits, what is the maximum range that can be stored and how?



Imagine we want to store a sub-range within the range 0-255.



So for example, 45-74.



We can store the example above as two unsigned bytes, but it strikes me that there must be some redundancy of information there. We know that the second value is larger than the first, so in the case that the first value is large, fewer bits are required for the second value, and in the case that the second value is large, fewer bits are required for the first.



I suspect that any compression technique would yield a marginal result, so it may be a better question to ask "what is the maximum range that could be stored in one byte?". This should be larger than what is achievable by storing the two numbers separately.



Are there any standard algorithms for doing this kind of thing?










share|improve this question
























  • do you also have to store the start of the range?

    – Ewan
    Apr 11 at 8:50











  • @Ewan I don't really follow. In the example above, 45 is the start (the minimum) and 74 is the end (the maximum) and both have to be stored.

    – rghome
    Apr 11 at 8:56






  • 2





    so is the question how much space does a type which can store any range require. or how much space does a type which can store 45-74 require?

    – Ewan
    Apr 11 at 8:58






  • 1





    While thinking about this is certainly good, I sure hope you don't do this in real applications. The reason is that the amount of complexity of real applications is so huge that we have to accept less than 100% optimized code....That is why compilers existed.

    – NoChance
    Apr 11 at 10:54






  • 2





    @rghome, I agree, even the simplest requirement produces hundreds of lines of code. Each is error prone. Personally, I'd pay for hardware than increase the complexity of software.

    – NoChance
    Apr 11 at 13:36













28












28








28


5






This question is about how many bits are required to store a range. Or put another way, for a given number of bits, what is the maximum range that can be stored and how?



Imagine we want to store a sub-range within the range 0-255.



So for example, 45-74.



We can store the example above as two unsigned bytes, but it strikes me that there must be some redundancy of information there. We know that the second value is larger than the first, so in the case that the first value is large, fewer bits are required for the second value, and in the case that the second value is large, fewer bits are required for the first.



I suspect that any compression technique would yield a marginal result, so it may be a better question to ask "what is the maximum range that could be stored in one byte?". This should be larger than what is achievable by storing the two numbers separately.



Are there any standard algorithms for doing this kind of thing?










share|improve this question
















This question is about how many bits are required to store a range. Or put another way, for a given number of bits, what is the maximum range that can be stored and how?



Imagine we want to store a sub-range within the range 0-255.



So for example, 45-74.



We can store the example above as two unsigned bytes, but it strikes me that there must be some redundancy of information there. We know that the second value is larger than the first, so in the case that the first value is large, fewer bits are required for the second value, and in the case that the second value is large, fewer bits are required for the first.



I suspect that any compression technique would yield a marginal result, so it may be a better question to ask "what is the maximum range that could be stored in one byte?". This should be larger than what is achievable by storing the two numbers separately.



Are there any standard algorithms for doing this kind of thing?







data-structures numbers compression






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Apr 12 at 13:09







rghome

















asked Apr 11 at 8:42









rghomerghome

463412




463412












  • do you also have to store the start of the range?

    – Ewan
    Apr 11 at 8:50











  • @Ewan I don't really follow. In the example above, 45 is the start (the minimum) and 74 is the end (the maximum) and both have to be stored.

    – rghome
    Apr 11 at 8:56






  • 2





    so is the question how much space does a type which can store any range require. or how much space does a type which can store 45-74 require?

    – Ewan
    Apr 11 at 8:58






  • 1





    While thinking about this is certainly good, I sure hope you don't do this in real applications. The reason is that the amount of complexity of real applications is so huge that we have to accept less than 100% optimized code....That is why compilers existed.

    – NoChance
    Apr 11 at 10:54






  • 2





    @rghome, I agree, even the simplest requirement produces hundreds of lines of code. Each is error prone. Personally, I'd pay for hardware than increase the complexity of software.

    – NoChance
    Apr 11 at 13:36

















  • do you also have to store the start of the range?

    – Ewan
    Apr 11 at 8:50











  • @Ewan I don't really follow. In the example above, 45 is the start (the minimum) and 74 is the end (the maximum) and both have to be stored.

    – rghome
    Apr 11 at 8:56






  • 2





    so is the question how much space does a type which can store any range require. or how much space does a type which can store 45-74 require?

    – Ewan
    Apr 11 at 8:58






  • 1





    While thinking about this is certainly good, I sure hope you don't do this in real applications. The reason is that the amount of complexity of real applications is so huge that we have to accept less than 100% optimized code....That is why compilers existed.

    – NoChance
    Apr 11 at 10:54






  • 2





    @rghome, I agree, even the simplest requirement produces hundreds of lines of code. Each is error prone. Personally, I'd pay for hardware than increase the complexity of software.

    – NoChance
    Apr 11 at 13:36
















do you also have to store the start of the range?

– Ewan
Apr 11 at 8:50





do you also have to store the start of the range?

– Ewan
Apr 11 at 8:50













@Ewan I don't really follow. In the example above, 45 is the start (the minimum) and 74 is the end (the maximum) and both have to be stored.

– rghome
Apr 11 at 8:56





@Ewan I don't really follow. In the example above, 45 is the start (the minimum) and 74 is the end (the maximum) and both have to be stored.

– rghome
Apr 11 at 8:56




2




2





so is the question how much space does a type which can store any range require. or how much space does a type which can store 45-74 require?

– Ewan
Apr 11 at 8:58





so is the question how much space does a type which can store any range require. or how much space does a type which can store 45-74 require?

– Ewan
Apr 11 at 8:58




1




1





While thinking about this is certainly good, I sure hope you don't do this in real applications. The reason is that the amount of complexity of real applications is so huge that we have to accept less than 100% optimized code....That is why compilers existed.

– NoChance
Apr 11 at 10:54





While thinking about this is certainly good, I sure hope you don't do this in real applications. The reason is that the amount of complexity of real applications is so huge that we have to accept less than 100% optimized code....That is why compilers existed.

– NoChance
Apr 11 at 10:54




2




2





@rghome, I agree, even the simplest requirement produces hundreds of lines of code. Each is error prone. Personally, I'd pay for hardware than increase the complexity of software.

– NoChance
Apr 11 at 13:36





@rghome, I agree, even the simplest requirement produces hundreds of lines of code. Each is error prone. Personally, I'd pay for hardware than increase the complexity of software.

– NoChance
Apr 11 at 13:36










5 Answers
5






active

oldest

votes


















56














Just count the number of possible ranges. There are 256 ranges with lower bound 0 (0-0, 0-1, ... 0-254, 0-255), 255 ranges with lower bound 1, ... and finally 1 range with lower bound 255 (255-255). So the total number is (256 + 255 + ... + 1) = 257 * 128 = 32,896. As this is slightly higher than 215 = 32,768, you'll still need at least 16 bits (2 bytes) to store this information.



In general, for numbers from 0 up to n-1, the number of possible ranges is n*(n+1)/2. This is less than 256 if n is 22 or less: n = 22 gives 22*23/2 = 253 possibilities. So one byte suffices for sub-ranges of 0-21.



Another way to look at the problem is the following: storing a pair of integers in the range 0 to n-1 is almost the same as storing a subrange of 0-(n-1) plus a single bit which determines if the first number is lower or higher than the second one. (The difference comes from the case when both integers are equal, but this chance becomes increasingly smaller as n grows larger.) That's why you can only save about a single bit with this technique, and probably the main reason why it is rarely used.






share|improve this answer

























  • Thanks. The number of bits required for n ranges is log(n)/log2. Feeding it all into Wolfram Alpha gave me the following Excel compatible formula for calculating the maximum value for the subrange for a given number of bits: =INT((SQRT(POWER(2, N + 3) + 1) - 1) / 2)

    – rghome
    Apr 11 at 9:51






  • 9





    The TLDR is that you gain about half a bit, so in general it isn't really worth compressing.

    – rghome
    Apr 11 at 9:53











  • Yeah, it tends to one bit for large N but it isn't really worth the hassle.

    – Glorfindel
    Apr 11 at 10:13











  • FYI, the N + 3 in the equation looks odd, but one power of 2 comes from your equation and the other two comes from the 4ac part of the quadratic formula.

    – rghome
    Apr 11 at 13:35






  • 1





    BTW, your counting discounts the empty range, for which all the non-counted combinations stand. So, n * (n + 1) / 2 + 1! A miniscule change.

    – Deduplicator
    Apr 11 at 17:47


















16














For such small number of bits, it is infeasible to save many bits as Glorfindel has pointed out.
However, if the domain you are using has a few more bits, you can achieve significant savings for the average case by encoding ranges with the start value and a delta.



Lets assume the domain is the integers, so 32 bits. With the naive approach, you need 64 bits (start, end) to store a range.



If we switch to an encoding of (start, delta), we can construct the end of the range from that. We know that in the worst case, the start is 0 and the delta has 32 bits.



2^5 is 32, so we encode the length of the delta in five bits (no zero length, always add 1), and the encoding becomes (start, length, delta). In the worst case, this costs 32*2 + 5 bits, so 69 bits. So in the worst case, if all ranges are long, this is worse then the naive encoding.



In the best case, it costs 32 + 5 + 1 = 38 bits.



This means if you have to encode a lot of ranges, and those ranges each only cover small part of your domain, you end up using less space on average using this encoding. It doesn't matter how the starts are distributed, since the start will always take 32 bits, but it does matter how the lengths of the ranges are distributed. If the more small lengths you have, the better the compression, the more ranges you have that cover the whole length of the domain, the worse this encoding will get.



However, if you have lots of ranges grouped around similar start points, (for example because you get values from a sensor), you can achieve even bigger savings. You can apply the same technique to the start value and use a bias to offset the start value.



Lets say you have 10000 ranges. The ranges are grouped around a certain value. You encode the bias with 32 bits.



Using the naive approach, you would need 32*2 * 10 000 = 640 000 bits to store all those ranges.



Encoding the bias takes 32 bits, and encoding each range takes in the best case then 5 + 1 + 5 + 1 = 12 bits, for a total of 120 000 + 32 = 120 032 bits.
In the worst case, you need 5 + 32 + 5 + 32 bits, thus 74 bits, for a total of 740 032 bits.



This means, for 10 000 values on a domain that takes 32 bits to encode, we get



  • 120 032 bits with the smart delta-encoding in the best case

  • 640 000 bits with the naive start, end encoding, always (no best or worst case)

  • 740 032 bits with the smart delta encoding in the worst case

If you take the naive encoding as baseline, that means either savings of up to 81.25% or up to 15.625% more cost.



Depending on how your values are distributed, those savings are significant. Know your business domain! Know what you want to encode.



As an extension, you can also change the bias. If you analyse the data and identify groups of values, you can sort the data into buckets and encode each of those buckets separately, with its own bias. This means you can apply this technique not only to ranges that are grouped around a single start value, but also to ranges that are grouped around multiple values.



If your start points are distributed equally, this encoding doesn't really work that well.



This encoding is obviously extremely bad to index. You can not simply read the x-th value. It can pretty much only be read sequentially. Which is appropriate in some situations, e.g. streaming over the network or bulk storage (e.g. on tape or HDD).



Evaluating the data, grouping it and choosing the correct bias can be substantial work and might require some fine-tuning for optimal results.






share|improve this answer
































    6














    This kind of problem is the subject of Claude Shannon’s seminal paper, A Mathematical Theory of Communication, which introduced the word “bit” and more or less invented data compression.



    The general idea is that the number of bits used to encode a range is inversely proportional to the probability of that range occurring. For example, suppose the range 45-74 appears about 1/4 of the time. You may say that the sequence 00 corresponds to 45-74. To encode the range 45-74, you output “00” and stop there.



    Let’s also suppose that the ranges 99-100 and 140-155 each appear about 1/8 of the time. You can encode each of them with a 3 bit sequence. Any 3 bits will do as long as they don’t start with “00”, which has already been reserved for the range 45-74.



    00: 45-74
    010: 99-100
    101: 140-155


    You can continue in this manner until every possible range has an encoding. The least probable range may need over 100 bits. But that’s okay because it rarely appears.



    There are algorithms to find the optimal coding. I won’t try to explain them here, but you can find more by visiting the link above or searching for “Information Theory”, “Shannon-fano coding”, or “Huffman coding”.



    As others have pointed out, it’s probably better to store the start number and the difference between the start and end number. You should use one coding for the start and another for the difference, as they have different probability distributions (and I’m guessing the latter is more redundant). As polygnome suggested, the best algorithm depends on your domain.






    share|improve this answer




















    • 1





      Yeah the business domain is really important. We actually considered using Huffmann coding for the biases for the start date, but eventually decided against it after running some statistical analysis on real-world data. Simplicity of using the same encoding for bias and delta was more important then adding Huffmann on top, plus you need to send the whole Huffmann tree as well. Its a good idea to keep Huffmann coding in mind though.

      – Polygnome
      Apr 12 at 10:12


















    1














    To expand on the answer from @Glorfindel:



    As n → ∞, (n - 1) → n. Thus, Ω(ranges) → n² / 2 and log(Ω(ranges)) → (2n - 1). Since the naive encoding takes 2n bits, the asymptotic maximal compression only saves 1 bit.






    share|improve this answer






























      1














      There's a similar answer, but to achieve optimal compression, you need:



      1. An optimal entropy encoding method (read up on Arithmetic coding and the essentially equivalent (same compression ratio, a bit faster but also harder to grasp) ANS)

      2. As much information as possible about the distribution of the data. Crucially, this does not just involve "guessing" how often one number may appear, but you can often rule out certain possibilities for sure. For example, you can rule out intervals of negative size, and possibly 0 size, depending on how you define a valid interval. If you have multiple intervals to encode at once, you could sort them e.g. in order of decreasing width, or increasing start/end value, and rule out a whole lot of values (e.g. if you guarantee an order by decreasing width, the previous interval had a width of 100, and the starting value for the next one is 47, you only need to consider the possibilities up to 147 for end values).

      Importantly, number 2 means you want to encode things in such a way that the most informative values (per bit encoded) come first. For example, while I suggested encoding a sorted list "as-is", it would usually be smarter to encode it as a "binary tree" -- i.e. if they're sorted by width, and you have len elements, start by encoding element len/2. Say it had width w. Now you know all elements before it have width somewhere in [0, w], and all elements after it have width somewhere in [w, max val you accept]. Repeat recursively (subdividing each half list again in half, etc) until you've covered len elements (unless it's fixed, you'll want to encode len first so you don't need to bother with ending tokens). If "max val you accept" is really open, it may be smart to first encode the highest value that actually appears in your data, i.e. the last element, and then do the binary partitioning. Again, whatever is most informative per bit first.



      Also, if you're encoding the width of the interval first, and you know the max possible value you're dealing with, obviously you can rule out all starting values that would make it overflow... you get the idea. Transform and order your data in such a way that you can infer as much as possible about the rest of the data as you decode it, and an optimal entropy encoding algorithm will make sure you're not wasting bits encoding info you "already know".






      share|improve this answer










      New contributor




      tohoho is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.


















        protected by gnat Apr 12 at 11:16



        Thank you for your interest in this question.
        Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).



        Would you like to answer one of these unanswered questions instead?














        5 Answers
        5






        active

        oldest

        votes








        5 Answers
        5






        active

        oldest

        votes









        active

        oldest

        votes






        active

        oldest

        votes









        56














        Just count the number of possible ranges. There are 256 ranges with lower bound 0 (0-0, 0-1, ... 0-254, 0-255), 255 ranges with lower bound 1, ... and finally 1 range with lower bound 255 (255-255). So the total number is (256 + 255 + ... + 1) = 257 * 128 = 32,896. As this is slightly higher than 215 = 32,768, you'll still need at least 16 bits (2 bytes) to store this information.



        In general, for numbers from 0 up to n-1, the number of possible ranges is n*(n+1)/2. This is less than 256 if n is 22 or less: n = 22 gives 22*23/2 = 253 possibilities. So one byte suffices for sub-ranges of 0-21.



        Another way to look at the problem is the following: storing a pair of integers in the range 0 to n-1 is almost the same as storing a subrange of 0-(n-1) plus a single bit which determines if the first number is lower or higher than the second one. (The difference comes from the case when both integers are equal, but this chance becomes increasingly smaller as n grows larger.) That's why you can only save about a single bit with this technique, and probably the main reason why it is rarely used.






        share|improve this answer

























        • Thanks. The number of bits required for n ranges is log(n)/log2. Feeding it all into Wolfram Alpha gave me the following Excel compatible formula for calculating the maximum value for the subrange for a given number of bits: =INT((SQRT(POWER(2, N + 3) + 1) - 1) / 2)

          – rghome
          Apr 11 at 9:51






        • 9





          The TLDR is that you gain about half a bit, so in general it isn't really worth compressing.

          – rghome
          Apr 11 at 9:53











        • Yeah, it tends to one bit for large N but it isn't really worth the hassle.

          – Glorfindel
          Apr 11 at 10:13











        • FYI, the N + 3 in the equation looks odd, but one power of 2 comes from your equation and the other two comes from the 4ac part of the quadratic formula.

          – rghome
          Apr 11 at 13:35






        • 1





          BTW, your counting discounts the empty range, for which all the non-counted combinations stand. So, n * (n + 1) / 2 + 1! A miniscule change.

          – Deduplicator
          Apr 11 at 17:47















        56














        Just count the number of possible ranges. There are 256 ranges with lower bound 0 (0-0, 0-1, ... 0-254, 0-255), 255 ranges with lower bound 1, ... and finally 1 range with lower bound 255 (255-255). So the total number is (256 + 255 + ... + 1) = 257 * 128 = 32,896. As this is slightly higher than 215 = 32,768, you'll still need at least 16 bits (2 bytes) to store this information.



        In general, for numbers from 0 up to n-1, the number of possible ranges is n*(n+1)/2. This is less than 256 if n is 22 or less: n = 22 gives 22*23/2 = 253 possibilities. So one byte suffices for sub-ranges of 0-21.



        Another way to look at the problem is the following: storing a pair of integers in the range 0 to n-1 is almost the same as storing a subrange of 0-(n-1) plus a single bit which determines if the first number is lower or higher than the second one. (The difference comes from the case when both integers are equal, but this chance becomes increasingly smaller as n grows larger.) That's why you can only save about a single bit with this technique, and probably the main reason why it is rarely used.






        share|improve this answer

























        • Thanks. The number of bits required for n ranges is log(n)/log2. Feeding it all into Wolfram Alpha gave me the following Excel compatible formula for calculating the maximum value for the subrange for a given number of bits: =INT((SQRT(POWER(2, N + 3) + 1) - 1) / 2)

          – rghome
          Apr 11 at 9:51






        • 9





          The TLDR is that you gain about half a bit, so in general it isn't really worth compressing.

          – rghome
          Apr 11 at 9:53











        • Yeah, it tends to one bit for large N but it isn't really worth the hassle.

          – Glorfindel
          Apr 11 at 10:13











        • FYI, the N + 3 in the equation looks odd, but one power of 2 comes from your equation and the other two comes from the 4ac part of the quadratic formula.

          – rghome
          Apr 11 at 13:35






        • 1





          BTW, your counting discounts the empty range, for which all the non-counted combinations stand. So, n * (n + 1) / 2 + 1! A miniscule change.

          – Deduplicator
          Apr 11 at 17:47













        56












        56








        56







        Just count the number of possible ranges. There are 256 ranges with lower bound 0 (0-0, 0-1, ... 0-254, 0-255), 255 ranges with lower bound 1, ... and finally 1 range with lower bound 255 (255-255). So the total number is (256 + 255 + ... + 1) = 257 * 128 = 32,896. As this is slightly higher than 215 = 32,768, you'll still need at least 16 bits (2 bytes) to store this information.



        In general, for numbers from 0 up to n-1, the number of possible ranges is n*(n+1)/2. This is less than 256 if n is 22 or less: n = 22 gives 22*23/2 = 253 possibilities. So one byte suffices for sub-ranges of 0-21.



        Another way to look at the problem is the following: storing a pair of integers in the range 0 to n-1 is almost the same as storing a subrange of 0-(n-1) plus a single bit which determines if the first number is lower or higher than the second one. (The difference comes from the case when both integers are equal, but this chance becomes increasingly smaller as n grows larger.) That's why you can only save about a single bit with this technique, and probably the main reason why it is rarely used.






        share|improve this answer















        Just count the number of possible ranges. There are 256 ranges with lower bound 0 (0-0, 0-1, ... 0-254, 0-255), 255 ranges with lower bound 1, ... and finally 1 range with lower bound 255 (255-255). So the total number is (256 + 255 + ... + 1) = 257 * 128 = 32,896. As this is slightly higher than 215 = 32,768, you'll still need at least 16 bits (2 bytes) to store this information.



        In general, for numbers from 0 up to n-1, the number of possible ranges is n*(n+1)/2. This is less than 256 if n is 22 or less: n = 22 gives 22*23/2 = 253 possibilities. So one byte suffices for sub-ranges of 0-21.



        Another way to look at the problem is the following: storing a pair of integers in the range 0 to n-1 is almost the same as storing a subrange of 0-(n-1) plus a single bit which determines if the first number is lower or higher than the second one. (The difference comes from the case when both integers are equal, but this chance becomes increasingly smaller as n grows larger.) That's why you can only save about a single bit with this technique, and probably the main reason why it is rarely used.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Apr 11 at 10:44

























        answered Apr 11 at 8:54









        GlorfindelGlorfindel

        2,32541727




        2,32541727












        • Thanks. The number of bits required for n ranges is log(n)/log2. Feeding it all into Wolfram Alpha gave me the following Excel compatible formula for calculating the maximum value for the subrange for a given number of bits: =INT((SQRT(POWER(2, N + 3) + 1) - 1) / 2)

          – rghome
          Apr 11 at 9:51






        • 9





          The TLDR is that you gain about half a bit, so in general it isn't really worth compressing.

          – rghome
          Apr 11 at 9:53











        • Yeah, it tends to one bit for large N but it isn't really worth the hassle.

          – Glorfindel
          Apr 11 at 10:13











        • FYI, the N + 3 in the equation looks odd, but one power of 2 comes from your equation and the other two comes from the 4ac part of the quadratic formula.

          – rghome
          Apr 11 at 13:35






        • 1





          BTW, your counting discounts the empty range, for which all the non-counted combinations stand. So, n * (n + 1) / 2 + 1! A miniscule change.

          – Deduplicator
          Apr 11 at 17:47

















        • Thanks. The number of bits required for n ranges is log(n)/log2. Feeding it all into Wolfram Alpha gave me the following Excel compatible formula for calculating the maximum value for the subrange for a given number of bits: =INT((SQRT(POWER(2, N + 3) + 1) - 1) / 2)

          – rghome
          Apr 11 at 9:51






        • 9





          The TLDR is that you gain about half a bit, so in general it isn't really worth compressing.

          – rghome
          Apr 11 at 9:53











        • Yeah, it tends to one bit for large N but it isn't really worth the hassle.

          – Glorfindel
          Apr 11 at 10:13











        • FYI, the N + 3 in the equation looks odd, but one power of 2 comes from your equation and the other two comes from the 4ac part of the quadratic formula.

          – rghome
          Apr 11 at 13:35






        • 1





          BTW, your counting discounts the empty range, for which all the non-counted combinations stand. So, n * (n + 1) / 2 + 1! A miniscule change.

          – Deduplicator
          Apr 11 at 17:47
















        Thanks. The number of bits required for n ranges is log(n)/log2. Feeding it all into Wolfram Alpha gave me the following Excel compatible formula for calculating the maximum value for the subrange for a given number of bits: =INT((SQRT(POWER(2, N + 3) + 1) - 1) / 2)

        – rghome
        Apr 11 at 9:51





        Thanks. The number of bits required for n ranges is log(n)/log2. Feeding it all into Wolfram Alpha gave me the following Excel compatible formula for calculating the maximum value for the subrange for a given number of bits: =INT((SQRT(POWER(2, N + 3) + 1) - 1) / 2)

        – rghome
        Apr 11 at 9:51




        9




        9





        The TLDR is that you gain about half a bit, so in general it isn't really worth compressing.

        – rghome
        Apr 11 at 9:53





        The TLDR is that you gain about half a bit, so in general it isn't really worth compressing.

        – rghome
        Apr 11 at 9:53













        Yeah, it tends to one bit for large N but it isn't really worth the hassle.

        – Glorfindel
        Apr 11 at 10:13





        Yeah, it tends to one bit for large N but it isn't really worth the hassle.

        – Glorfindel
        Apr 11 at 10:13













        FYI, the N + 3 in the equation looks odd, but one power of 2 comes from your equation and the other two comes from the 4ac part of the quadratic formula.

        – rghome
        Apr 11 at 13:35





        FYI, the N + 3 in the equation looks odd, but one power of 2 comes from your equation and the other two comes from the 4ac part of the quadratic formula.

        – rghome
        Apr 11 at 13:35




        1




        1





        BTW, your counting discounts the empty range, for which all the non-counted combinations stand. So, n * (n + 1) / 2 + 1! A miniscule change.

        – Deduplicator
        Apr 11 at 17:47





        BTW, your counting discounts the empty range, for which all the non-counted combinations stand. So, n * (n + 1) / 2 + 1! A miniscule change.

        – Deduplicator
        Apr 11 at 17:47













        16














        For such small number of bits, it is infeasible to save many bits as Glorfindel has pointed out.
        However, if the domain you are using has a few more bits, you can achieve significant savings for the average case by encoding ranges with the start value and a delta.



        Lets assume the domain is the integers, so 32 bits. With the naive approach, you need 64 bits (start, end) to store a range.



        If we switch to an encoding of (start, delta), we can construct the end of the range from that. We know that in the worst case, the start is 0 and the delta has 32 bits.



        2^5 is 32, so we encode the length of the delta in five bits (no zero length, always add 1), and the encoding becomes (start, length, delta). In the worst case, this costs 32*2 + 5 bits, so 69 bits. So in the worst case, if all ranges are long, this is worse then the naive encoding.



        In the best case, it costs 32 + 5 + 1 = 38 bits.



        This means if you have to encode a lot of ranges, and those ranges each only cover small part of your domain, you end up using less space on average using this encoding. It doesn't matter how the starts are distributed, since the start will always take 32 bits, but it does matter how the lengths of the ranges are distributed. If the more small lengths you have, the better the compression, the more ranges you have that cover the whole length of the domain, the worse this encoding will get.



        However, if you have lots of ranges grouped around similar start points, (for example because you get values from a sensor), you can achieve even bigger savings. You can apply the same technique to the start value and use a bias to offset the start value.



        Lets say you have 10000 ranges. The ranges are grouped around a certain value. You encode the bias with 32 bits.



        Using the naive approach, you would need 32*2 * 10 000 = 640 000 bits to store all those ranges.



        Encoding the bias takes 32 bits, and encoding each range takes in the best case then 5 + 1 + 5 + 1 = 12 bits, for a total of 120 000 + 32 = 120 032 bits.
        In the worst case, you need 5 + 32 + 5 + 32 bits, thus 74 bits, for a total of 740 032 bits.



        This means, for 10 000 values on a domain that takes 32 bits to encode, we get



        • 120 032 bits with the smart delta-encoding in the best case

        • 640 000 bits with the naive start, end encoding, always (no best or worst case)

        • 740 032 bits with the smart delta encoding in the worst case

        If you take the naive encoding as baseline, that means either savings of up to 81.25% or up to 15.625% more cost.



        Depending on how your values are distributed, those savings are significant. Know your business domain! Know what you want to encode.



        As an extension, you can also change the bias. If you analyse the data and identify groups of values, you can sort the data into buckets and encode each of those buckets separately, with its own bias. This means you can apply this technique not only to ranges that are grouped around a single start value, but also to ranges that are grouped around multiple values.



        If your start points are distributed equally, this encoding doesn't really work that well.



        This encoding is obviously extremely bad to index. You can not simply read the x-th value. It can pretty much only be read sequentially. Which is appropriate in some situations, e.g. streaming over the network or bulk storage (e.g. on tape or HDD).



        Evaluating the data, grouping it and choosing the correct bias can be substantial work and might require some fine-tuning for optimal results.






        share|improve this answer





























          16














          For such small number of bits, it is infeasible to save many bits as Glorfindel has pointed out.
          However, if the domain you are using has a few more bits, you can achieve significant savings for the average case by encoding ranges with the start value and a delta.



          Lets assume the domain is the integers, so 32 bits. With the naive approach, you need 64 bits (start, end) to store a range.



          If we switch to an encoding of (start, delta), we can construct the end of the range from that. We know that in the worst case, the start is 0 and the delta has 32 bits.



          2^5 is 32, so we encode the length of the delta in five bits (no zero length, always add 1), and the encoding becomes (start, length, delta). In the worst case, this costs 32*2 + 5 bits, so 69 bits. So in the worst case, if all ranges are long, this is worse then the naive encoding.



          In the best case, it costs 32 + 5 + 1 = 38 bits.



          This means if you have to encode a lot of ranges, and those ranges each only cover small part of your domain, you end up using less space on average using this encoding. It doesn't matter how the starts are distributed, since the start will always take 32 bits, but it does matter how the lengths of the ranges are distributed. If the more small lengths you have, the better the compression, the more ranges you have that cover the whole length of the domain, the worse this encoding will get.



          However, if you have lots of ranges grouped around similar start points, (for example because you get values from a sensor), you can achieve even bigger savings. You can apply the same technique to the start value and use a bias to offset the start value.



          Lets say you have 10000 ranges. The ranges are grouped around a certain value. You encode the bias with 32 bits.



          Using the naive approach, you would need 32*2 * 10 000 = 640 000 bits to store all those ranges.



          Encoding the bias takes 32 bits, and encoding each range takes in the best case then 5 + 1 + 5 + 1 = 12 bits, for a total of 120 000 + 32 = 120 032 bits.
          In the worst case, you need 5 + 32 + 5 + 32 bits, thus 74 bits, for a total of 740 032 bits.



          This means, for 10 000 values on a domain that takes 32 bits to encode, we get



          • 120 032 bits with the smart delta-encoding in the best case

          • 640 000 bits with the naive start, end encoding, always (no best or worst case)

          • 740 032 bits with the smart delta encoding in the worst case

          If you take the naive encoding as baseline, that means either savings of up to 81.25% or up to 15.625% more cost.



          Depending on how your values are distributed, those savings are significant. Know your business domain! Know what you want to encode.



          As an extension, you can also change the bias. If you analyse the data and identify groups of values, you can sort the data into buckets and encode each of those buckets separately, with its own bias. This means you can apply this technique not only to ranges that are grouped around a single start value, but also to ranges that are grouped around multiple values.



          If your start points are distributed equally, this encoding doesn't really work that well.



          This encoding is obviously extremely bad to index. You can not simply read the x-th value. It can pretty much only be read sequentially. Which is appropriate in some situations, e.g. streaming over the network or bulk storage (e.g. on tape or HDD).



          Evaluating the data, grouping it and choosing the correct bias can be substantial work and might require some fine-tuning for optimal results.






          share|improve this answer



























            16












            16








            16







            For such small number of bits, it is infeasible to save many bits as Glorfindel has pointed out.
            However, if the domain you are using has a few more bits, you can achieve significant savings for the average case by encoding ranges with the start value and a delta.



            Lets assume the domain is the integers, so 32 bits. With the naive approach, you need 64 bits (start, end) to store a range.



            If we switch to an encoding of (start, delta), we can construct the end of the range from that. We know that in the worst case, the start is 0 and the delta has 32 bits.



            2^5 is 32, so we encode the length of the delta in five bits (no zero length, always add 1), and the encoding becomes (start, length, delta). In the worst case, this costs 32*2 + 5 bits, so 69 bits. So in the worst case, if all ranges are long, this is worse then the naive encoding.



            In the best case, it costs 32 + 5 + 1 = 38 bits.



            This means if you have to encode a lot of ranges, and those ranges each only cover small part of your domain, you end up using less space on average using this encoding. It doesn't matter how the starts are distributed, since the start will always take 32 bits, but it does matter how the lengths of the ranges are distributed. If the more small lengths you have, the better the compression, the more ranges you have that cover the whole length of the domain, the worse this encoding will get.



            However, if you have lots of ranges grouped around similar start points, (for example because you get values from a sensor), you can achieve even bigger savings. You can apply the same technique to the start value and use a bias to offset the start value.



            Lets say you have 10000 ranges. The ranges are grouped around a certain value. You encode the bias with 32 bits.



            Using the naive approach, you would need 32*2 * 10 000 = 640 000 bits to store all those ranges.



            Encoding the bias takes 32 bits, and encoding each range takes in the best case then 5 + 1 + 5 + 1 = 12 bits, for a total of 120 000 + 32 = 120 032 bits.
            In the worst case, you need 5 + 32 + 5 + 32 bits, thus 74 bits, for a total of 740 032 bits.



            This means, for 10 000 values on a domain that takes 32 bits to encode, we get



            • 120 032 bits with the smart delta-encoding in the best case

            • 640 000 bits with the naive start, end encoding, always (no best or worst case)

            • 740 032 bits with the smart delta encoding in the worst case

            If you take the naive encoding as baseline, that means either savings of up to 81.25% or up to 15.625% more cost.



            Depending on how your values are distributed, those savings are significant. Know your business domain! Know what you want to encode.



            As an extension, you can also change the bias. If you analyse the data and identify groups of values, you can sort the data into buckets and encode each of those buckets separately, with its own bias. This means you can apply this technique not only to ranges that are grouped around a single start value, but also to ranges that are grouped around multiple values.



            If your start points are distributed equally, this encoding doesn't really work that well.



            This encoding is obviously extremely bad to index. You can not simply read the x-th value. It can pretty much only be read sequentially. Which is appropriate in some situations, e.g. streaming over the network or bulk storage (e.g. on tape or HDD).



            Evaluating the data, grouping it and choosing the correct bias can be substantial work and might require some fine-tuning for optimal results.






            share|improve this answer















            For such small number of bits, it is infeasible to save many bits as Glorfindel has pointed out.
            However, if the domain you are using has a few more bits, you can achieve significant savings for the average case by encoding ranges with the start value and a delta.



            Lets assume the domain is the integers, so 32 bits. With the naive approach, you need 64 bits (start, end) to store a range.



            If we switch to an encoding of (start, delta), we can construct the end of the range from that. We know that in the worst case, the start is 0 and the delta has 32 bits.



            2^5 is 32, so we encode the length of the delta in five bits (no zero length, always add 1), and the encoding becomes (start, length, delta). In the worst case, this costs 32*2 + 5 bits, so 69 bits. So in the worst case, if all ranges are long, this is worse then the naive encoding.



            In the best case, it costs 32 + 5 + 1 = 38 bits.



            This means if you have to encode a lot of ranges, and those ranges each only cover small part of your domain, you end up using less space on average using this encoding. It doesn't matter how the starts are distributed, since the start will always take 32 bits, but it does matter how the lengths of the ranges are distributed. If the more small lengths you have, the better the compression, the more ranges you have that cover the whole length of the domain, the worse this encoding will get.



            However, if you have lots of ranges grouped around similar start points, (for example because you get values from a sensor), you can achieve even bigger savings. You can apply the same technique to the start value and use a bias to offset the start value.



            Lets say you have 10000 ranges. The ranges are grouped around a certain value. You encode the bias with 32 bits.



            Using the naive approach, you would need 32*2 * 10 000 = 640 000 bits to store all those ranges.



            Encoding the bias takes 32 bits, and encoding each range takes in the best case then 5 + 1 + 5 + 1 = 12 bits, for a total of 120 000 + 32 = 120 032 bits.
            In the worst case, you need 5 + 32 + 5 + 32 bits, thus 74 bits, for a total of 740 032 bits.



            This means, for 10 000 values on a domain that takes 32 bits to encode, we get



            • 120 032 bits with the smart delta-encoding in the best case

            • 640 000 bits with the naive start, end encoding, always (no best or worst case)

            • 740 032 bits with the smart delta encoding in the worst case

            If you take the naive encoding as baseline, that means either savings of up to 81.25% or up to 15.625% more cost.



            Depending on how your values are distributed, those savings are significant. Know your business domain! Know what you want to encode.



            As an extension, you can also change the bias. If you analyse the data and identify groups of values, you can sort the data into buckets and encode each of those buckets separately, with its own bias. This means you can apply this technique not only to ranges that are grouped around a single start value, but also to ranges that are grouped around multiple values.



            If your start points are distributed equally, this encoding doesn't really work that well.



            This encoding is obviously extremely bad to index. You can not simply read the x-th value. It can pretty much only be read sequentially. Which is appropriate in some situations, e.g. streaming over the network or bulk storage (e.g. on tape or HDD).



            Evaluating the data, grouping it and choosing the correct bias can be substantial work and might require some fine-tuning for optimal results.







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Apr 12 at 21:12









            Braiam

            1075




            1075










            answered Apr 11 at 13:56









            PolygnomePolygnome

            1,3161015




            1,3161015





















                6














                This kind of problem is the subject of Claude Shannon’s seminal paper, A Mathematical Theory of Communication, which introduced the word “bit” and more or less invented data compression.



                The general idea is that the number of bits used to encode a range is inversely proportional to the probability of that range occurring. For example, suppose the range 45-74 appears about 1/4 of the time. You may say that the sequence 00 corresponds to 45-74. To encode the range 45-74, you output “00” and stop there.



                Let’s also suppose that the ranges 99-100 and 140-155 each appear about 1/8 of the time. You can encode each of them with a 3 bit sequence. Any 3 bits will do as long as they don’t start with “00”, which has already been reserved for the range 45-74.



                00: 45-74
                010: 99-100
                101: 140-155


                You can continue in this manner until every possible range has an encoding. The least probable range may need over 100 bits. But that’s okay because it rarely appears.



                There are algorithms to find the optimal coding. I won’t try to explain them here, but you can find more by visiting the link above or searching for “Information Theory”, “Shannon-fano coding”, or “Huffman coding”.



                As others have pointed out, it’s probably better to store the start number and the difference between the start and end number. You should use one coding for the start and another for the difference, as they have different probability distributions (and I’m guessing the latter is more redundant). As polygnome suggested, the best algorithm depends on your domain.






                share|improve this answer




















                • 1





                  Yeah the business domain is really important. We actually considered using Huffmann coding for the biases for the start date, but eventually decided against it after running some statistical analysis on real-world data. Simplicity of using the same encoding for bias and delta was more important then adding Huffmann on top, plus you need to send the whole Huffmann tree as well. Its a good idea to keep Huffmann coding in mind though.

                  – Polygnome
                  Apr 12 at 10:12















                6














                This kind of problem is the subject of Claude Shannon’s seminal paper, A Mathematical Theory of Communication, which introduced the word “bit” and more or less invented data compression.



                The general idea is that the number of bits used to encode a range is inversely proportional to the probability of that range occurring. For example, suppose the range 45-74 appears about 1/4 of the time. You may say that the sequence 00 corresponds to 45-74. To encode the range 45-74, you output “00” and stop there.



                Let’s also suppose that the ranges 99-100 and 140-155 each appear about 1/8 of the time. You can encode each of them with a 3 bit sequence. Any 3 bits will do as long as they don’t start with “00”, which has already been reserved for the range 45-74.



                00: 45-74
                010: 99-100
                101: 140-155


                You can continue in this manner until every possible range has an encoding. The least probable range may need over 100 bits. But that’s okay because it rarely appears.



                There are algorithms to find the optimal coding. I won’t try to explain them here, but you can find more by visiting the link above or searching for “Information Theory”, “Shannon-fano coding”, or “Huffman coding”.



                As others have pointed out, it’s probably better to store the start number and the difference between the start and end number. You should use one coding for the start and another for the difference, as they have different probability distributions (and I’m guessing the latter is more redundant). As polygnome suggested, the best algorithm depends on your domain.






                share|improve this answer




















                • 1





                  Yeah the business domain is really important. We actually considered using Huffmann coding for the biases for the start date, but eventually decided against it after running some statistical analysis on real-world data. Simplicity of using the same encoding for bias and delta was more important then adding Huffmann on top, plus you need to send the whole Huffmann tree as well. Its a good idea to keep Huffmann coding in mind though.

                  – Polygnome
                  Apr 12 at 10:12













                6












                6








                6







                This kind of problem is the subject of Claude Shannon’s seminal paper, A Mathematical Theory of Communication, which introduced the word “bit” and more or less invented data compression.



                The general idea is that the number of bits used to encode a range is inversely proportional to the probability of that range occurring. For example, suppose the range 45-74 appears about 1/4 of the time. You may say that the sequence 00 corresponds to 45-74. To encode the range 45-74, you output “00” and stop there.



                Let’s also suppose that the ranges 99-100 and 140-155 each appear about 1/8 of the time. You can encode each of them with a 3 bit sequence. Any 3 bits will do as long as they don’t start with “00”, which has already been reserved for the range 45-74.



                00: 45-74
                010: 99-100
                101: 140-155


                You can continue in this manner until every possible range has an encoding. The least probable range may need over 100 bits. But that’s okay because it rarely appears.



                There are algorithms to find the optimal coding. I won’t try to explain them here, but you can find more by visiting the link above or searching for “Information Theory”, “Shannon-fano coding”, or “Huffman coding”.



                As others have pointed out, it’s probably better to store the start number and the difference between the start and end number. You should use one coding for the start and another for the difference, as they have different probability distributions (and I’m guessing the latter is more redundant). As polygnome suggested, the best algorithm depends on your domain.






                share|improve this answer















                This kind of problem is the subject of Claude Shannon’s seminal paper, A Mathematical Theory of Communication, which introduced the word “bit” and more or less invented data compression.



                The general idea is that the number of bits used to encode a range is inversely proportional to the probability of that range occurring. For example, suppose the range 45-74 appears about 1/4 of the time. You may say that the sequence 00 corresponds to 45-74. To encode the range 45-74, you output “00” and stop there.



                Let’s also suppose that the ranges 99-100 and 140-155 each appear about 1/8 of the time. You can encode each of them with a 3 bit sequence. Any 3 bits will do as long as they don’t start with “00”, which has already been reserved for the range 45-74.



                00: 45-74
                010: 99-100
                101: 140-155


                You can continue in this manner until every possible range has an encoding. The least probable range may need over 100 bits. But that’s okay because it rarely appears.



                There are algorithms to find the optimal coding. I won’t try to explain them here, but you can find more by visiting the link above or searching for “Information Theory”, “Shannon-fano coding”, or “Huffman coding”.



                As others have pointed out, it’s probably better to store the start number and the difference between the start and end number. You should use one coding for the start and another for the difference, as they have different probability distributions (and I’m guessing the latter is more redundant). As polygnome suggested, the best algorithm depends on your domain.







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Apr 12 at 10:00

























                answered Apr 12 at 2:43









                Patrick McElhaneyPatrick McElhaney

                347210




                347210







                • 1





                  Yeah the business domain is really important. We actually considered using Huffmann coding for the biases for the start date, but eventually decided against it after running some statistical analysis on real-world data. Simplicity of using the same encoding for bias and delta was more important then adding Huffmann on top, plus you need to send the whole Huffmann tree as well. Its a good idea to keep Huffmann coding in mind though.

                  – Polygnome
                  Apr 12 at 10:12












                • 1





                  Yeah the business domain is really important. We actually considered using Huffmann coding for the biases for the start date, but eventually decided against it after running some statistical analysis on real-world data. Simplicity of using the same encoding for bias and delta was more important then adding Huffmann on top, plus you need to send the whole Huffmann tree as well. Its a good idea to keep Huffmann coding in mind though.

                  – Polygnome
                  Apr 12 at 10:12







                1




                1





                Yeah the business domain is really important. We actually considered using Huffmann coding for the biases for the start date, but eventually decided against it after running some statistical analysis on real-world data. Simplicity of using the same encoding for bias and delta was more important then adding Huffmann on top, plus you need to send the whole Huffmann tree as well. Its a good idea to keep Huffmann coding in mind though.

                – Polygnome
                Apr 12 at 10:12





                Yeah the business domain is really important. We actually considered using Huffmann coding for the biases for the start date, but eventually decided against it after running some statistical analysis on real-world data. Simplicity of using the same encoding for bias and delta was more important then adding Huffmann on top, plus you need to send the whole Huffmann tree as well. Its a good idea to keep Huffmann coding in mind though.

                – Polygnome
                Apr 12 at 10:12











                1














                To expand on the answer from @Glorfindel:



                As n → ∞, (n - 1) → n. Thus, Ω(ranges) → n² / 2 and log(Ω(ranges)) → (2n - 1). Since the naive encoding takes 2n bits, the asymptotic maximal compression only saves 1 bit.






                share|improve this answer



























                  1














                  To expand on the answer from @Glorfindel:



                  As n → ∞, (n - 1) → n. Thus, Ω(ranges) → n² / 2 and log(Ω(ranges)) → (2n - 1). Since the naive encoding takes 2n bits, the asymptotic maximal compression only saves 1 bit.






                  share|improve this answer

























                    1












                    1








                    1







                    To expand on the answer from @Glorfindel:



                    As n → ∞, (n - 1) → n. Thus, Ω(ranges) → n² / 2 and log(Ω(ranges)) → (2n - 1). Since the naive encoding takes 2n bits, the asymptotic maximal compression only saves 1 bit.






                    share|improve this answer













                    To expand on the answer from @Glorfindel:



                    As n → ∞, (n - 1) → n. Thus, Ω(ranges) → n² / 2 and log(Ω(ranges)) → (2n - 1). Since the naive encoding takes 2n bits, the asymptotic maximal compression only saves 1 bit.







                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered Apr 11 at 17:39









                    Jared GoguenJared Goguen

                    1456




                    1456





















                        1














                        There's a similar answer, but to achieve optimal compression, you need:



                        1. An optimal entropy encoding method (read up on Arithmetic coding and the essentially equivalent (same compression ratio, a bit faster but also harder to grasp) ANS)

                        2. As much information as possible about the distribution of the data. Crucially, this does not just involve "guessing" how often one number may appear, but you can often rule out certain possibilities for sure. For example, you can rule out intervals of negative size, and possibly 0 size, depending on how you define a valid interval. If you have multiple intervals to encode at once, you could sort them e.g. in order of decreasing width, or increasing start/end value, and rule out a whole lot of values (e.g. if you guarantee an order by decreasing width, the previous interval had a width of 100, and the starting value for the next one is 47, you only need to consider the possibilities up to 147 for end values).

                        Importantly, number 2 means you want to encode things in such a way that the most informative values (per bit encoded) come first. For example, while I suggested encoding a sorted list "as-is", it would usually be smarter to encode it as a "binary tree" -- i.e. if they're sorted by width, and you have len elements, start by encoding element len/2. Say it had width w. Now you know all elements before it have width somewhere in [0, w], and all elements after it have width somewhere in [w, max val you accept]. Repeat recursively (subdividing each half list again in half, etc) until you've covered len elements (unless it's fixed, you'll want to encode len first so you don't need to bother with ending tokens). If "max val you accept" is really open, it may be smart to first encode the highest value that actually appears in your data, i.e. the last element, and then do the binary partitioning. Again, whatever is most informative per bit first.



                        Also, if you're encoding the width of the interval first, and you know the max possible value you're dealing with, obviously you can rule out all starting values that would make it overflow... you get the idea. Transform and order your data in such a way that you can infer as much as possible about the rest of the data as you decode it, and an optimal entropy encoding algorithm will make sure you're not wasting bits encoding info you "already know".






                        share|improve this answer










                        New contributor




                        tohoho is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                        Check out our Code of Conduct.
























                          1














                          There's a similar answer, but to achieve optimal compression, you need:



                          1. An optimal entropy encoding method (read up on Arithmetic coding and the essentially equivalent (same compression ratio, a bit faster but also harder to grasp) ANS)

                          2. As much information as possible about the distribution of the data. Crucially, this does not just involve "guessing" how often one number may appear, but you can often rule out certain possibilities for sure. For example, you can rule out intervals of negative size, and possibly 0 size, depending on how you define a valid interval. If you have multiple intervals to encode at once, you could sort them e.g. in order of decreasing width, or increasing start/end value, and rule out a whole lot of values (e.g. if you guarantee an order by decreasing width, the previous interval had a width of 100, and the starting value for the next one is 47, you only need to consider the possibilities up to 147 for end values).

                          Importantly, number 2 means you want to encode things in such a way that the most informative values (per bit encoded) come first. For example, while I suggested encoding a sorted list "as-is", it would usually be smarter to encode it as a "binary tree" -- i.e. if they're sorted by width, and you have len elements, start by encoding element len/2. Say it had width w. Now you know all elements before it have width somewhere in [0, w], and all elements after it have width somewhere in [w, max val you accept]. Repeat recursively (subdividing each half list again in half, etc) until you've covered len elements (unless it's fixed, you'll want to encode len first so you don't need to bother with ending tokens). If "max val you accept" is really open, it may be smart to first encode the highest value that actually appears in your data, i.e. the last element, and then do the binary partitioning. Again, whatever is most informative per bit first.



                          Also, if you're encoding the width of the interval first, and you know the max possible value you're dealing with, obviously you can rule out all starting values that would make it overflow... you get the idea. Transform and order your data in such a way that you can infer as much as possible about the rest of the data as you decode it, and an optimal entropy encoding algorithm will make sure you're not wasting bits encoding info you "already know".






                          share|improve this answer










                          New contributor




                          tohoho is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                          Check out our Code of Conduct.






















                            1












                            1








                            1







                            There's a similar answer, but to achieve optimal compression, you need:



                            1. An optimal entropy encoding method (read up on Arithmetic coding and the essentially equivalent (same compression ratio, a bit faster but also harder to grasp) ANS)

                            2. As much information as possible about the distribution of the data. Crucially, this does not just involve "guessing" how often one number may appear, but you can often rule out certain possibilities for sure. For example, you can rule out intervals of negative size, and possibly 0 size, depending on how you define a valid interval. If you have multiple intervals to encode at once, you could sort them e.g. in order of decreasing width, or increasing start/end value, and rule out a whole lot of values (e.g. if you guarantee an order by decreasing width, the previous interval had a width of 100, and the starting value for the next one is 47, you only need to consider the possibilities up to 147 for end values).

                            Importantly, number 2 means you want to encode things in such a way that the most informative values (per bit encoded) come first. For example, while I suggested encoding a sorted list "as-is", it would usually be smarter to encode it as a "binary tree" -- i.e. if they're sorted by width, and you have len elements, start by encoding element len/2. Say it had width w. Now you know all elements before it have width somewhere in [0, w], and all elements after it have width somewhere in [w, max val you accept]. Repeat recursively (subdividing each half list again in half, etc) until you've covered len elements (unless it's fixed, you'll want to encode len first so you don't need to bother with ending tokens). If "max val you accept" is really open, it may be smart to first encode the highest value that actually appears in your data, i.e. the last element, and then do the binary partitioning. Again, whatever is most informative per bit first.



                            Also, if you're encoding the width of the interval first, and you know the max possible value you're dealing with, obviously you can rule out all starting values that would make it overflow... you get the idea. Transform and order your data in such a way that you can infer as much as possible about the rest of the data as you decode it, and an optimal entropy encoding algorithm will make sure you're not wasting bits encoding info you "already know".






                            share|improve this answer










                            New contributor




                            tohoho is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                            Check out our Code of Conduct.










                            There's a similar answer, but to achieve optimal compression, you need:



                            1. An optimal entropy encoding method (read up on Arithmetic coding and the essentially equivalent (same compression ratio, a bit faster but also harder to grasp) ANS)

                            2. As much information as possible about the distribution of the data. Crucially, this does not just involve "guessing" how often one number may appear, but you can often rule out certain possibilities for sure. For example, you can rule out intervals of negative size, and possibly 0 size, depending on how you define a valid interval. If you have multiple intervals to encode at once, you could sort them e.g. in order of decreasing width, or increasing start/end value, and rule out a whole lot of values (e.g. if you guarantee an order by decreasing width, the previous interval had a width of 100, and the starting value for the next one is 47, you only need to consider the possibilities up to 147 for end values).

                            Importantly, number 2 means you want to encode things in such a way that the most informative values (per bit encoded) come first. For example, while I suggested encoding a sorted list "as-is", it would usually be smarter to encode it as a "binary tree" -- i.e. if they're sorted by width, and you have len elements, start by encoding element len/2. Say it had width w. Now you know all elements before it have width somewhere in [0, w], and all elements after it have width somewhere in [w, max val you accept]. Repeat recursively (subdividing each half list again in half, etc) until you've covered len elements (unless it's fixed, you'll want to encode len first so you don't need to bother with ending tokens). If "max val you accept" is really open, it may be smart to first encode the highest value that actually appears in your data, i.e. the last element, and then do the binary partitioning. Again, whatever is most informative per bit first.



                            Also, if you're encoding the width of the interval first, and you know the max possible value you're dealing with, obviously you can rule out all starting values that would make it overflow... you get the idea. Transform and order your data in such a way that you can infer as much as possible about the rest of the data as you decode it, and an optimal entropy encoding algorithm will make sure you're not wasting bits encoding info you "already know".







                            share|improve this answer










                            New contributor




                            tohoho is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                            Check out our Code of Conduct.









                            share|improve this answer



                            share|improve this answer








                            edited Apr 12 at 10:54





















                            New contributor




                            tohoho is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                            Check out our Code of Conduct.









                            answered Apr 12 at 10:45









                            tohohotohoho

                            112




                            112




                            New contributor




                            tohoho is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                            Check out our Code of Conduct.





                            New contributor





                            tohoho is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                            Check out our Code of Conduct.






                            tohoho is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                            Check out our Code of Conduct.















                                protected by gnat Apr 12 at 11:16



                                Thank you for your interest in this question.
                                Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).



                                Would you like to answer one of these unanswered questions instead?



                                Popular posts from this blog

                                Wikipedia:Vital articles Мазмуну Biography - Өмүр баян Philosophy and psychology - Философия жана психология Religion - Дин Social sciences - Коомдук илимдер Language and literature - Тил жана адабият Science - Илим Technology - Технология Arts and recreation - Искусство жана эс алуу History and geography - Тарых жана география Навигация менюсу

                                Bruxelas-Capital Índice Historia | Composición | Situación lingüística | Clima | Cidades irmandadas | Notas | Véxase tamén | Menú de navegacióneO uso das linguas en Bruxelas e a situación do neerlandés"Rexión de Bruxelas Capital"o orixinalSitio da rexiónPáxina de Bruselas no sitio da Oficina de Promoción Turística de Valonia e BruxelasMapa Interactivo da Rexión de Bruxelas-CapitaleeWorldCat332144929079854441105155190212ID28008674080552-90000 0001 0666 3698n94104302ID540940339365017018237

                                What should I write in an apology letter, since I have decided not to join a company after accepting an offer letterShould I keep looking after accepting a job offer?What should I do when I've been verbally told I would get an offer letter, but still haven't gotten one after 4 weeks?Do I accept an offer from a company that I am not likely to join?New job hasn't confirmed starting date and I want to give current employer as much notice as possibleHow should I address my manager in my resignation letter?HR delayed background verification, now jobless as resignedNo email communication after accepting a formal written offer. How should I phrase the call?What should I do if after receiving a verbal offer letter I am informed that my written job offer is put on hold due to some internal issues?Should I inform the current employer that I am about to resign within 1-2 weeks since I have signed the offer letter and waiting for visa?What company will do, if I send their offer letter to another company