Extreme file compression

Why cant we compress 1GB file onto 100MB?

It doesnt have to be fast to decompress the data.
1 hour decompression would be acceptable on a good CPU in some occassions.

Just to prove the point of possible extreme compression.

1) video file gets .webm:ed -> 1GB
2) zip compression
3) a compression algorithm with speciales in making compressed zip file smaller
4) ???
5) profit

Attached: 7zip-security-logo-2.png (600x443, 4K)

Other urls found in this thread:

en.wikipedia.org/wiki/Entropy_(information_theory)
en.wikipedia.org/wiki/Entropy_(information_theory)
en.wikipedia.org/wiki/PAQ#Comparison
cyborg.co/about
en.wikipedia.org/wiki/Jan_Sloot
github.com/philipl/pifs
github.com/philipl/pifs/issues/56
en.wikipedia.org/wiki/Pigeonhole_principle
twitter.com/NSFWRedditImage

Why don't you go back to r e ddit

Kys

I have a 1.41GB txt file that 7z compressed down to only 216KB.

you are an idiot

because of LZ77 factorization

20 IQ post

".webm:ed" is a lossy compression. Non-audiovisual information has to be compressed lossless.

But what about .docx into .doc?

Theoretically yes


All files are a sequence of 0 and 1

Compression find patterns so that they can represent the same values without writing them down

for example: 0000000000000000

instead of writing all 0, they instead add write "16 0s at this point" which the compressor reads and unpacks

same with repeating sequences

another example, if the sequence 010111010101 is found 8 times in the file, it's only saved once and told to repeat those times

you can have as many rules or patterns, but adding more have diminishing returns, as you are adding comprehension patterns for very very specific bit patterns, so you end increasing the decompressor size exponentially

.Zip .Rar, etc have defined rules and patterns for general files which include the most used and important patterns for compression

But if size is not a problem, you can have a decompressor with a huge set of patterns, that would

for instance, making a 500MB out of 1GB would for example take 5 seconds and a 100GB pattern archive

400MB would take 50 seconds and 800GB pattern archive

300MB would take 18 hours and 98TB pattern archive

200MB would take 78 days and 45,309TB pattern archive

making it 100MB would take 49 months and 29,549,295TB archive

these are all examples but should give you and idea

What about .doc into .docx decompression?

Ah the classic "pretending to be a retard"

>1) compress a video down to 1GB by applying lossy compression and discarding data in the process
>2) zip it for 0.05% decrease in size
>3) run a program like ECT to max. out the zip compression and get another 0.02% decrease in size
>4) realize that general compression algorithms are wasted effort after a compression algorithm specialized on compressing video footage and that you wasted half a day for 0.07% decrease in size, when using a more efficient video coding format could've actually made a difference
>5) kill yourself

Because bits man, ain't gotta explain shit.

lol retards you can do this already just keep zipping the zip files

i bet 100gb text file of only 1 same characters compresses even smaller lmao

>It doesnt have to be fast to decompress the data.

Hash your file with SHA512.

Now to depress, just compute data that matches the SHA512 hash.

Doesn't have to be fast, right?

For a 1GB file to compress to 100MB (which is reasonable, in some cases), it needs to have about 900MB of redundant data. That is, it only has 100MB of actual information.

Let's say we had a sequence of letters, such as "xxxxxx". A compression algorithm might take that and output "6x" to indicate 6 consecutive letters "x". Our data had 6 characters, but 5 were redundant: 1 was enough to represent all the data, with another 1 for the information of what data that's representing (the "6"). Compression is about seeing where data repeats or is similar so that redundant things can be removed. If the file has little redundancy, then compression algorithms might not successfully reduce the size (much)

Because that's not how information works.
en.wikipedia.org/wiki/Entropy_(information_theory)

Here's my compression algo. Files are zero bits long and contain no data. When the file is decompressed, it extracts to the movie "Despicable Me" in 720p30. Infinite compression ratio. Perfect.

Lossless compression is about the maximization of entropy, and there comes a point where your data becomes seemingly random and you can't compress it any further.

Sounds pretty lossy though

You can
If you have a file that is a list of 1 GB worth of bits, but you know that its made of a sequence of groups, such that those groups are 10 consecutive bits in which the bits are the same value, then you can compress the file by only taking 1 representative bit for each group. This reduces the file size from 1 GB to 100 MB

>video file gets .webm:ed -> 1GB
You already lost data that can't be recovered in this step. This is a lossy compression. Of course you can compress stuff a lot if you permanently lose pieces of it.

You're wrong, and an idiot. Theoretically, no. If you try to map a bit stream of length n to a compressed bit string of length n/x, you will find that as n/x gets smaller, you're trying to map an infinite set of strings (all possible files to be compressed) into a finite set of strings (all possible compressed files of say 100 MB); at this point there's no way to know into which file a compressed one should decompress, i.e. you just have garbage without meaning. The impossibility of extreme compression is mathematical in nature; you can't just do it.
>instead of writing all 0, they instead add write "16 0s at this point" which the compressor reads and unpacks
How do you think that you say "16 0s at this point" in a string that is made of 1s and 0s only? You need an abstraction layer, which is where the efficiency is diminishing. At some point, saying "x 0s here" is gonna take up more space than actually just writting the x 0s there.

Me backing up my registry.

Attached: Untitle111d.png (794x451, 32K)

You can if you make the application much larger.

en.wikipedia.org/wiki/Entropy_(information_theory)

Evolutionary compression sondes too good to be true

you can sometimes do that, even in a lossless way. the algorithms for that are just extremly slow (18h compression time for 1gb).
PAQ is an example of such an algorithm where the compressed file had 13% of the original size:
en.wikipedia.org/wiki/PAQ#Comparison
keep in mind though that this is a benchmark result and the algorithm used was tuned for that benchmark specifically, so real world performance might not be this good in all cases.

Attached: 2019-08-07-165357_748x334_scrot.png (748x334, 46K)

I can never remember the name of that kid from a few years back that "created" a super efficient compression algorithm that was obvious bullshit. He got quite a bit of news coverage for a month or two.

tar - /dir | gzip | 7z | pzip | gpg | zgrar > archinve.tar.gz.7z.gpg.rar
ez compression

Nicolas Dupont
cyborg.co/about

That file is already just 1 character repeated. It just illustrates that OP is a retard.

How the hell is the dietpi OS image zip like 80MB but then when I extract the image it's 650 MB??

FUCKING THANK YOU I knew it had cyborg in the name.

Is there are file system that compresses shit on the go?

zfs

Anything with transparent compression. EXT4 doesn't have that unfortunately.

is zfs viable on linux with no raid setup?

yes
though why would you want a compresses fs? its slow af

To add to the zfs reply, zfs supports multiple compression algorithms. Some compression methods will try to recognize incompressible data. For example, a text file will get compressed and a video file won't since the codec already compressed it really far. Keep in mind that it doesn't actually evaluate this on file level but on block level. It's a pretty smart fs, just be sure to properly educate yourself on it before using it.

Since zfs is incompatible with the gpl you there are a lot of details that come with it. To keep it simple, root on zfs is possible but you really need to educate yourself before doing this. A non root fs with zfs is pretty easy to get going.

>I have a 1.41GB txt file that 7z compressed down to only 216KB.
there's a 42 KB zip file full of nested zip files that decompresses to 4.5 PB

Attached: 2019-08-07_17-34.png (808x387, 38K)

>nested
Well of course.

Is this the ClamAV dos?

en.wikipedia.org/wiki/Jan_Sloot

> we

Glorious dutch algorithm master race

Why is OP a faget?

that's literally just .zip

guys is it possible for me to compress Doom on to a floppy disk so i can play with my friend? I tried copying it from my desktop but it didn't work

Just use piFS.

Before you ask:

“πfs is a revolutionary new file system that, instead of wasting space storing your data on your hard drive, stores your data in π! You'll never run out of space again - π holds every file that could possibly exist! They said 100% compression was impossible? You're looking at it!”

github.com/philipl/pifs

10/10
thank you

i was thinking.. why don't people make a dictionary store
that way, common file formats can have public algorithms used to compress them, and the file itself only has to refer to them
sure, it'd require some bootstrap if you have a file which depends on algos you don't have (it'd download the algos for example), but it can still make a lot of file transfers, write speeds etc faster

strongly recommend reading trough the issues

algorithms derived from experiments with fractal geometry are already being used in compressing then searching/decompressing/enhancing spy satellite images.
>Every possible image already exists, you just need to know how to find it and zoom in on the fractal.

github.com/philipl/pifs/issues/56

this is honestly the best solution given OP's parameters.
Try using AV1 on videos right now. It doesn't have to be fast :^)

PI miner malwares when?

The issues section for that project is reddit personified.

Use sha-1 then, its smaller to store, and faster to "re-compute" on account of it being broken.

no it's a broken hash function so you may get a corrupted file with matching hash.

0% loss if you're compressing the YIFY release of Despicable Me in 720p.

if this is the new generation of software engineers,
then we are fucking doomed.

seriously, just do anything else,
make music, paint, decorate, cook
just stay the hell away from computers science

that's just free data
just continue brute-forcing until yours appears

depends on what you're compressing

the program has no way of knowing if the data is correct or not with a broken hash function retard.

I have a better idea.
Sell a hard drive with every combination of data preloaded onto it.
Then you don't have to extract anything just navigate to the correct index for any given file.

>it's this thread again

Attached: proxy.duckduckgo.com.jpg (474x296, 34K)

Jow Forumsgatekeeping

That's literally what's going on, except everything is in pi and you just have to find it.

Attached: 1548594375031.png (791x660, 128K)

What happens if i decompress it on a disk with drive compression enabled?

EVERY hash function has collisions so the same can happen with any of them.
But who am I kidding, this is a troll thread top to bottom.

Very nice, thank you sir.

Because videos are already compressed, dumbass. The more repeated similar data there's on an file, the more it'll be compressed. Think of having a string with "00000000000000000000" (20 zeroes), what the compression algorithm does is store the file as something like "2*10". If the file is something like "123456789abcdefghijk" it won't compress as much. Of course this is extremely simplified, but that's basically how it works.

>retard
Oh fuck you are taking this shit serious
yet this does not bother you
shows who the real retard is

But that's not at all how it works, you retard.

videos do not use dictionaries.

>1) video file gets .webm:ed -> 1GB
ok, gigantic raw video data compressed to absolute hell, good start, entropy very highly saturated, very little inefficiency, nice!
>2) zip compression
ok, the file is now in zip format, but it didn't change in size, because there's next to no redundancies left to remove, and what was saved was lost in storing zip metadata
>3) a compression algorithm with speciales in making compressed zip file smaller
a compression algorithm specialized in compressing compressed data? this actually makes no sense.
>4) ???
my thoughts exactly

I might be a brainlet here; but couldn't you make a program that orders the bytes in a specific order so when you multiply or divide one byte with a series of other bytes it will give you a correct piece of data?

you mean like.. turn blocks into algebraic equations?

If a complete retard doesn't understand the basics of data compression, then I am no gatekeeping,

I am stating the obvious.

It's the basic of how it works, the implementation depends on the algorithm. Basically the more redundancy there is on the file, the more be compressed.

I never said that they do. But they are already compressed by their own algorithms and trying to compress them again will result in nothing expressive.

So lets rearrange the input file, zeros at the beginning and ones at the end, so the algo can easily compress them.

the files that need specialized solutions already have them (compressed audio/video), and the rest just uses general compression (zip/lzma/gzip).

what would even benefit from a general dict store?

LOL WUT IZ MATHS??!!??

Attached: thumbs_up_cat.png (720x714, 722K)

>Infinite amount of spurious keys.

>what would even benefit from a general dict store?
nothing, only specialized dicts make sense

>Why cant we compress 1GB file onto 100MB?
we can, if the content of the file has lots of repetition.

So is the answer.

no, dummy. a general dict with specialized tuning for certain file formats.
but a dict with tunings for mp3/mkv/png/whatever is going to be worthless, because they're already compressed well. only improvements would be changing the actual encoders or inventing a new file format + encoder.

so what would even benefit from "specialized" dicts?

You can't just losslessly compress data forever. You can only compress data with some sort of redundancy in it.
For example, I just took a 1GB file from /dev/urandom
When I count the bytes, it's 1073741824 bytes.

For a non-scientific example, When I put the file through gzip to compress it and then count the bytes, it's now 1074069377 bytes - almost 320 KiB more. No computational power in the world would get that down to 100MB, unless I'm willing to get a different random file from the decompression than I started with.

en.wikipedia.org/wiki/Pigeonhole_principle

>No computational power in the world would get that down to 100MB
only if you're assuming the Linux kernel's urandom is truly random.

Sure. But it's proven that it's not possible for all possible inputs to be shrunk to sizes like that; There must ALWAYS be data that can't be compressed, or would rather be made even larger by compression:

>The principle can be used to prove that any lossless compression algorithm, provided it makes some inputs smaller (as the name compression suggests), will also make some other inputs larger. Otherwise, the set of all input sequences up to a given length L could be mapped to the (much) smaller set of all sequences of length less than L without collisions (because the compression is lossless), a possibility which the pigeonhole principle excludes.

yeah, i know that. and I got the point you were making. im just shitposting.

with most things we don't care about 99%+ of the possible inputs, just the select few we're interested in. e.g. audio, video, text, whatever.

can urandom even create all possible 1GB files if left running long enough? I'm not sure.

Who even knows. It's cryptographically safe, which should guarantee that I think. That every byte can follow any previous bytes, with equal probability.

But generating all possible 1GiB files would probably take so long the universe would restart a googolplexian times before you're done.

>It's cryptographically safe
Debian would like a word

Attached: random4.jpg (508x298, 32K)

This is already possible but no one is willing to wait an hour just to compress a 1GB file to 100mb. Compression schemes based on fractals, or transcendental numbers like pi can get you absolutely absurd levels of compression at the cost of taking forever to compress/decompress

Win10 has built in compression if I recall and it can save hundreds of gigs in some cases.