Extreme file compression

Question

Extreme file compression

Charles Murphy

Why cant we compress 1GB file onto 100MB?

It doesnt have to be fast to decompress the data.
1 hour decompression would be acceptable on a good CPU in some occassions.

Just to prove the point of possible extreme compression.

1) video file gets .webm:ed -> 1GB
2) zip compression
3) a compression algorithm with speciales in making compressed zip file smaller
4) ???
5) profit

Attached: 7zip-security-logo-2.png (600x443, 4K)

August 7, 2019 - 12:28

Other urls found in this thread:

en.wikipedia.org/wiki/Entropy_(information_theory)
en.wikipedia.org/wiki/Entropy_(information_theory)
en.wikipedia.org/wiki/PAQ#Comparison
cyborg.co/about
en.wikipedia.org/wiki/Jan_Sloot
github.com/philipl/pifs
github.com/philipl/pifs/issues/56
en.wikipedia.org/wiki/Pigeonhole_principle
twitter.com/NSFWRedditImage

Chase Torres

Why don't you go back to r e ddit

August 7, 2019 - 12:31

Jose Lewis

Kys

August 7, 2019 - 12:40

Joshua Clark

I have a 1.41GB txt file that 7z compressed down to only 216KB.

August 7, 2019 - 12:41

Matthew Cooper

you are an idiot

August 7, 2019 - 12:42

Colton Jenkins

because of LZ77 factorization

August 7, 2019 - 12:42

Noah Richardson

20 IQ post

August 7, 2019 - 12:42

Thomas Walker

".webm:ed" is a lossy compression. Non-audiovisual information has to be compressed lossless.

August 7, 2019 - 12:43

Parker Johnson

But what about .docx into .doc?

August 7, 2019 - 12:44

Jason Rogers

Theoretically yes

All files are a sequence of 0 and 1

Compression find patterns so that they can represent the same values without writing them down

for example: 0000000000000000

instead of writing all 0, they instead add write "16 0s at this point" which the compressor reads and unpacks

same with repeating sequences

another example, if the sequence 010111010101 is found 8 times in the file, it's only saved once and told to repeat those times

you can have as many rules or patterns, but adding more have diminishing returns, as you are adding comprehension patterns for very very specific bit patterns, so you end increasing the decompressor size exponentially

.Zip .Rar, etc have defined rules and patterns for general files which include the most used and important patterns for compression

But if size is not a problem, you can have a decompressor with a huge set of patterns, that would

for instance, making a 500MB out of 1GB would for example take 5 seconds and a 100GB pattern archive

400MB would take 50 seconds and 800GB pattern archive

300MB would take 18 hours and 98TB pattern archive

200MB would take 78 days and 45,309TB pattern archive

making it 100MB would take 49 months and 29,549,295TB archive

these are all examples but should give you and idea

August 7, 2019 - 12:46

John Lopez

What about .doc into .docx decompression?

August 7, 2019 - 12:46

Austin Jackson

Ah the classic "pretending to be a retard"

August 7, 2019 - 12:47

Luis Gray

>1) compress a video down to 1GB by applying lossy compression and discarding data in the process
>2) zip it for 0.05% decrease in size
>3) run a program like ECT to max. out the zip compression and get another 0.02% decrease in size
>4) realize that general compression algorithms are wasted effort after a compression algorithm specialized on compressing video footage and that you wasted half a day for 0.07% decrease in size, when using a more efficient video coding format could've actually made a difference
>5) kill yourself

August 7, 2019 - 12:47

Ryder Jones

Because bits man, ain't gotta explain shit.

August 7, 2019 - 12:48

Easton Scott

lol retards you can do this already just keep zipping the zip files

August 7, 2019 - 12:50

Zachary Davis

i bet 100gb text file of only 1 same characters compresses even smaller lmao

August 7, 2019 - 12:51

Connor Lopez

>It doesnt have to be fast to decompress the data.

Hash your file with SHA512.

Now to depress, just compute data that matches the SHA512 hash.

Doesn't have to be fast, right?

August 7, 2019 - 12:55

Isaac Cruz

For a 1GB file to compress to 100MB (which is reasonable, in some cases), it needs to have about 900MB of redundant data. That is, it only has 100MB of actual information.

Let's say we had a sequence of letters, such as "xxxxxx". A compression algorithm might take that and output "6x" to indicate 6 consecutive letters "x". Our data had 6 characters, but 5 were redundant: 1 was enough to represent all the data, with another 1 for the information of what data that's representing (the "6"). Compression is about seeing where data repeats or is similar so that redundant things can be removed. If the file has little redundancy, then compression algorithms might not successfully reduce the size (much)

August 7, 2019 - 12:57

Joshua Gomez

Because that's not how information works.
en.wikipedia.org/wiki/Entropy_(information_theory)

August 7, 2019 - 12:58

Aaron Kelly

Here's my compression algo. Files are zero bits long and contain no data. When the file is decompressed, it extracts to the movie "Despicable Me" in 720p30. Infinite compression ratio. Perfect.

August 7, 2019 - 12:58

Jason Reed

Lossless compression is about the maximization of entropy, and there comes a point where your data becomes seemingly random and you can't compress it any further.

August 7, 2019 - 13:53

Lincoln Flores

Sounds pretty lossy though

August 7, 2019 - 13:55

Jayden Roberts

You can
If you have a file that is a list of 1 GB worth of bits, but you know that its made of a sequence of groups, such that those groups are 10 consecutive bits in which the bits are the same value, then you can compress the file by only taking 1 representative bit for each group. This reduces the file size from 1 GB to 100 MB

August 7, 2019 - 14:06

Elijah Foster

>video file gets .webm:ed -> 1GB
You already lost data that can't be recovered in this step. This is a lossy compression. Of course you can compress stuff a lot if you permanently lose pieces of it.

You're wrong, and an idiot. Theoretically, no. If you try to map a bit stream of length n to a compressed bit string of length n/x, you will find that as n/x gets smaller, you're trying to map an infinite set of strings (all possible files to be compressed) into a finite set of strings (all possible compressed files of say 100 MB); at this point there's no way to know into which file a compressed one should decompress, i.e. you just have garbage without meaning. The impossibility of extreme compression is mathematical in nature; you can't just do it.
>instead of writing all 0, they instead add write "16 0s at this point" which the compressor reads and unpacks
How do you think that you say "16 0s at this point" in a string that is made of 1s and 0s only? You need an abstraction layer, which is where the efficiency is diminishing. At some point, saying "x 0s here" is gonna take up more space than actually just writting the x 0s there.

August 7, 2019 - 14:30

Xavier Powell

Me backing up my registry.

Attached: Untitle111d.png (794x451, 32K)

August 7, 2019 - 14:33

Henry Jones

You can if you make the application much larger.

August 7, 2019 - 14:34

Jaxson Martin

en.wikipedia.org/wiki/Entropy_(information_theory)

August 7, 2019 - 14:37

Grayson Ramirez

Evolutionary compression sondes too good to be true

August 7, 2019 - 14:42

Caleb Garcia

you can sometimes do that, even in a lossless way. the algorithms for that are just extremly slow (18h compression time for 1gb).
PAQ is an example of such an algorithm where the compressed file had 13% of the original size:
en.wikipedia.org/wiki/PAQ#Comparison
keep in mind though that this is a benchmark result and the algorithm used was tuned for that benchmark specifically, so real world performance might not be this good in all cases.

Attached: 2019-08-07-165357_748x334_scrot.png (748x334, 46K)

August 7, 2019 - 14:52

Chase Perez

I can never remember the name of that kid from a few years back that "created" a super efficient compression algorithm that was obvious bullshit. He got quite a bit of news coverage for a month or two.

August 7, 2019 - 14:54

Hunter Anderson

August 7, 2019 - 14:55

Hudson Baker

Nicolas Dupont
cyborg.co/about

August 7, 2019 - 15:00

Nolan Martin

That file is already just 1 character repeated. It just illustrates that OP is a retard.

August 7, 2019 - 15:03

Luis Bell

How the hell is the dietpi OS image zip like 80MB but then when I extract the image it's 650 MB??

August 7, 2019 - 15:04

Joseph Harris

FUCKING THANK YOU I knew it had cyborg in the name.

August 7, 2019 - 15:04

Gabriel Morales

Is there are file system that compresses shit on the go?

August 7, 2019 - 15:05

Jackson Thomas

zfs

August 7, 2019 - 15:07

Nicholas Taylor

Anything with transparent compression. EXT4 doesn't have that unfortunately.

August 7, 2019 - 15:11

Nicholas Butler

is zfs viable on linux with no raid setup?

August 7, 2019 - 15:14

Jack Anderson

yes
though why would you want a compresses fs? its slow af

August 7, 2019 - 15:16

Nicholas Lewis

To add to the zfs reply, zfs supports multiple compression algorithms. Some compression methods will try to recognize incompressible data. For example, a text file will get compressed and a video file won't since the codec already compressed it really far. Keep in mind that it doesn't actually evaluate this on file level but on block level. It's a pretty smart fs, just be sure to properly educate yourself on it before using it.

Since zfs is incompatible with the gpl you there are a lot of details that come with it. To keep it simple, root on zfs is possible but you really need to educate yourself before doing this. A non root fs with zfs is pretty easy to get going.

August 7, 2019 - 15:17

Nicholas Nguyen

>I have a 1.41GB txt file that 7z compressed down to only 216KB.
there's a 42 KB zip file full of nested zip files that decompresses to 4.5 PB

Attached: 2019-08-07_17-34.png (808x387, 38K)

August 7, 2019 - 15:38

Jaxon Young

>nested
Well of course.

August 7, 2019 - 15:42

Carter Wood

Is this the ClamAV dos?

August 7, 2019 - 15:42

Josiah Cox

en.wikipedia.org/wiki/Jan_Sloot

August 7, 2019 - 15:43

Wyatt Stewart

> we

August 7, 2019 - 15:44

Henry Howard

Glorious dutch algorithm master race

August 7, 2019 - 15:45

Jayden Walker

Why is OP a faget?

August 7, 2019 - 15:45

Samuel Russell

that's literally just .zip

August 7, 2019 - 15:45

Christian Lewis

guys is it possible for me to compress Doom on to a floppy disk so i can play with my friend? I tried copying it from my desktop but it didn't work

August 7, 2019 - 15:49

Nathaniel Price

Just use piFS.

August 7, 2019 - 15:50

David Taylor

Before you ask:

“πfs is a revolutionary new file system that, instead of wasting space storing your data on your hard drive, stores your data in π! You'll never run out of space again - π holds every file that could possibly exist! They said 100% compression was impossible? You're looking at it!”

github.com/philipl/pifs

August 7, 2019 - 16:11

Noah Scott

10/10
thank you

August 7, 2019 - 16:19

Hudson Green

i was thinking.. why don't people make a dictionary store
that way, common file formats can have public algorithms used to compress them, and the file itself only has to refer to them
sure, it'd require some bootstrap if you have a file which depends on algos you don't have (it'd download the algos for example), but it can still make a lot of file transfers, write speeds etc faster

August 7, 2019 - 16:24

Elijah Wood

strongly recommend reading trough the issues

August 7, 2019 - 16:25

Lucas Barnes

algorithms derived from experiments with fractal geometry are already being used in compressing then searching/decompressing/enhancing spy satellite images.
>Every possible image already exists, you just need to know how to find it and zoom in on the fractal.

August 7, 2019 - 16:28

Jeremiah Hughes

github.com/philipl/pifs/issues/56

August 7, 2019 - 16:30

Jeremiah Sanders

this is honestly the best solution given OP's parameters.
Try using AV1 on videos right now. It doesn't have to be fast :^)

August 7, 2019 - 16:37

Brayden Peterson

PI miner malwares when?

August 7, 2019 - 16:38

Matthew Ortiz

The issues section for that project is reddit personified.

August 7, 2019 - 16:38

Lucas Martin

Use sha-1 then, its smaller to store, and faster to "re-compute" on account of it being broken.

August 7, 2019 - 16:40

Ian Gray

no it's a broken hash function so you may get a corrupted file with matching hash.

August 7, 2019 - 16:42

Hudson Rogers

0% loss if you're compressing the YIFY release of Despicable Me in 720p.

August 7, 2019 - 16:42

Xavier Hall

if this is the new generation of software engineers,
then we are fucking doomed.

seriously, just do anything else,
make music, paint, decorate, cook
just stay the hell away from computers science

August 7, 2019 - 16:44

Christopher Ross

that's just free data
just continue brute-forcing until yours appears

August 7, 2019 - 16:45

Carson Turner

depends on what you're compressing

August 7, 2019 - 16:47

Robert Sullivan

the program has no way of knowing if the data is correct or not with a broken hash function retard.

August 7, 2019 - 16:47

Gavin Kelly

I have a better idea.
Sell a hard drive with every combination of data preloaded onto it.
Then you don't have to extract anything just navigate to the correct index for any given file.

August 7, 2019 - 16:49

Landon Garcia

>it's this thread again

Attached: proxy.duckduckgo.com.jpg (474x296, 34K)

August 7, 2019 - 16:52

Aiden Watson

Jow Forumsgatekeeping

August 7, 2019 - 16:56

Michael Adams

That's literally what's going on, except everything is in pi and you just have to find it.

August 7, 2019 - 16:57

Gabriel King

Attached: 1548594375031.png (791x660, 128K)

August 7, 2019 - 16:57

Isaac Foster

What happens if i decompress it on a disk with drive compression enabled?

August 7, 2019 - 16:59

Bentley Williams

EVERY hash function has collisions so the same can happen with any of them.
But who am I kidding, this is a troll thread top to bottom.

August 7, 2019 - 17:00

Jordan Parker

Very nice, thank you sir.

August 7, 2019 - 17:06

Grayson Gomez

Because videos are already compressed, dumbass. The more repeated similar data there's on an file, the more it'll be compressed. Think of having a string with "00000000000000000000" (20 zeroes), what the compression algorithm does is store the file as something like "2*10". If the file is something like "123456789abcdefghijk" it won't compress as much. Of course this is extremely simplified, but that's basically how it works.

August 7, 2019 - 17:08

Aiden King

>retard
Oh fuck you are taking this shit serious
yet this does not bother you
shows who the real retard is

August 7, 2019 - 17:09

Isaiah Murphy

But that's not at all how it works, you retard.

August 7, 2019 - 17:12

Alexander White

videos do not use dictionaries.

August 7, 2019 - 17:14

Christian Robinson

>1) video file gets .webm:ed -> 1GB
ok, gigantic raw video data compressed to absolute hell, good start, entropy very highly saturated, very little inefficiency, nice!
>2) zip compression
ok, the file is now in zip format, but it didn't change in size, because there's next to no redundancies left to remove, and what was saved was lost in storing zip metadata
>3) a compression algorithm with speciales in making compressed zip file smaller
a compression algorithm specialized in compressing compressed data? this actually makes no sense.
>4) ???
my thoughts exactly

August 7, 2019 - 17:17

Tyler Russell

I might be a brainlet here; but couldn't you make a program that orders the bytes in a specific order so when you multiply or divide one byte with a series of other bytes it will give you a correct piece of data?

August 7, 2019 - 17:47

Benjamin Turner

you mean like.. turn blocks into algebraic equations?

August 7, 2019 - 17:50

Anthony Foster

If a complete retard doesn't understand the basics of data compression, then I am no gatekeeping,

I am stating the obvious.

August 7, 2019 - 17:57

John Richardson

It's the basic of how it works, the implementation depends on the algorithm. Basically the more redundancy there is on the file, the more be compressed.

I never said that they do. But they are already compressed by their own algorithms and trying to compress them again will result in nothing expressive.

August 7, 2019 - 17:58

Nathaniel Jenkins

So lets rearrange the input file, zeros at the beginning and ones at the end, so the algo can easily compress them.

August 7, 2019 - 18:14

Juan Gray

the files that need specialized solutions already have them (compressed audio/video), and the rest just uses general compression (zip/lzma/gzip).

what would even benefit from a general dict store?

August 7, 2019 - 18:17

David Baker

LOL WUT IZ MATHS??!!??

Attached: thumbs_up_cat.png (720x714, 722K)

August 7, 2019 - 18:19

Zachary Gray

>Infinite amount of spurious keys.

August 7, 2019 - 18:22

Luis Butler

>what would even benefit from a general dict store?
nothing, only specialized dicts make sense

August 7, 2019 - 18:29

Michael Davis

>Why cant we compress 1GB file onto 100MB?
we can, if the content of the file has lots of repetition.

August 7, 2019 - 18:30

Juan Rivera

So is the answer.

August 7, 2019 - 18:32

Benjamin Powell

no, dummy. a general dict with specialized tuning for certain file formats.
but a dict with tunings for mp3/mkv/png/whatever is going to be worthless, because they're already compressed well. only improvements would be changing the actual encoders or inventing a new file format + encoder.

so what would even benefit from "specialized" dicts?

August 7, 2019 - 18:40

Brody Wood

You can't just losslessly compress data forever. You can only compress data with some sort of redundancy in it.
For example, I just took a 1GB file from /dev/urandom
When I count the bytes, it's 1073741824 bytes.

For a non-scientific example, When I put the file through gzip to compress it and then count the bytes, it's now 1074069377 bytes - almost 320 KiB more. No computational power in the world would get that down to 100MB, unless I'm willing to get a different random file from the decompression than I started with.

en.wikipedia.org/wiki/Pigeonhole_principle

August 7, 2019 - 18:51

Xavier Sanders

>No computational power in the world would get that down to 100MB
only if you're assuming the Linux kernel's urandom is truly random.

August 7, 2019 - 18:56

Josiah Collins

Sure. But it's proven that it's not possible for all possible inputs to be shrunk to sizes like that; There must ALWAYS be data that can't be compressed, or would rather be made even larger by compression:

>The principle can be used to prove that any lossless compression algorithm, provided it makes some inputs smaller (as the name compression suggests), will also make some other inputs larger. Otherwise, the set of all input sequences up to a given length L could be mapped to the (much) smaller set of all sequences of length less than L without collisions (because the compression is lossless), a possibility which the pigeonhole principle excludes.

August 7, 2019 - 19:02

Bentley Lopez

yeah, i know that. and I got the point you were making. im just shitposting.

with most things we don't care about 99%+ of the possible inputs, just the select few we're interested in. e.g. audio, video, text, whatever.

can urandom even create all possible 1GB files if left running long enough? I'm not sure.

August 7, 2019 - 19:07

Ian Hall

Who even knows. It's cryptographically safe, which should guarantee that I think. That every byte can follow any previous bytes, with equal probability.

But generating all possible 1GiB files would probably take so long the universe would restart a googolplexian times before you're done.

August 7, 2019 - 19:10

Charles Ross

>It's cryptographically safe
Debian would like a word

Attached: random4.jpg (508x298, 32K)

August 7, 2019 - 19:14

Charles Gutierrez

This is already possible but no one is willing to wait an hour just to compress a 1GB file to 100mb. Compression schemes based on fractals, or transcendental numbers like pi can get you absolutely absurd levels of compression at the cost of taking forever to compress/decompress

August 7, 2019 - 19:16

Oliver Stewart

Win10 has built in compression if I recall and it can save hundreds of gigs in some cases.

August 7, 2019 - 19:16

1 2 ... 10 Next

Extreme file compression

Last threads