Need a Duplicate Detection Tool for GNU/Linux

Question

Need a Duplicate Detection Tool for GNU/Linux

Jonathan Wright

Does anyone know of a tool for GNU/Linux that can detect duplicate directories, including the directory contents? I have an old backup hard drive that I've slowly added shit to for years. Some of it is backups of stuff that was already on there. The problem is, there are a bunch of duplicate files mixed in with the others that would need to stay. Just removing duplicate files wouldn't work, but if I could find entire duplicate directory trees, I could just manually decide if they need to be removed or not. Unfortunately, I can't think of a way to do this with a tool like find, unless I'm just being dumb.

Attached: 59226183068.jpg (1150x950, 787K)

September 20, 2019 - 10:43

Other urls found in this thread:

pastebin.com/58vB9wMp
twitter.com/SFWRedditGifs

Adam Cooper

examine the files yourself.

September 20, 2019 - 10:48

Angel Clark

No really feasible. There are hundreds of thousands, if not millions of files on this drive.

September 20, 2019 - 10:51

Benjamin Nguyen

Pay someone to examine the files for you. I don't get what the point of this thread is, you're not going to find any answers better than that.

September 20, 2019 - 10:56

Chase Wood

Seems like an interesting exercise for a short Python script. What things are in common in a duplicated directory?

September 20, 2019 - 11:00

Alexander Mitchell

>Does anyone know of a tool for GNU/Linux that can detect duplicate directories
What's not to get? Surely software exists that can detect duplicate directories by recursing into them.

September 20, 2019 - 11:01

Gabriel Fisher

fslint-gui

Attached: 1568977334156.png (807x607, 23K)

September 20, 2019 - 11:02

Nathan Butler

Here's the window after it found some duplicate images.

Attached: 1568977566828.png (809x611, 38K)

September 20, 2019 - 11:06

Ayden Mitchell

That looks like it will only find duplicated files, not entire duplicated directories.

Do you care about file names, or only file contents? How big (file count & total size) is the drive? Is reading every file on the drive practical?

September 20, 2019 - 11:08

James Long

write script
> scan every file
> get path + filesize + md5
> if filesize and md5 matches
> read both, check all bytes
> if true, print path of duplicates

September 20, 2019 - 11:12

Grayson Thompson

so there isn't some shell regex trick - at least not one you know - so you're spinning shit to cover the idea that it can't be done?

September 20, 2019 - 11:12

Ethan Hill

Hash the files, compare hashes and go from there.

September 20, 2019 - 11:12

Jaxson Williams

theres diff, but that only checks if 2 files are different.
running diff on every file would be O(n^2)
instead, hashing every file reduces the amount of data that needs to be compared.

September 20, 2019 - 11:16

Andrew Morgan

>Do you care about file names, or only file contents?
Contents
>How big (file count & total size) is the drive?
Just checked. 2.7 million files totalling 1.4TiB / 1.8TiB.
>Is reading every file on the drive practical?
Should be. I have a server machine that I can let it run on and do its thing.
You would need to keep track of the directory structure too.
The reason I mention duplicate directories instead of just files is because I know there are decent sized trees that are just duplicates of other identical trees on the drive. They are buried amongst other removable duplicate files and duplicate files that need to stay.

September 20, 2019 - 11:39

James Johnson

>Contents
>Should be.
Cool. I think I have an idea on how to do that. Technically I can't promise no false positives, (because hash collisions), but other than that it should be pretty simple.

September 20, 2019 - 11:42

Grayson Williams

Awesome, user.

September 20, 2019 - 11:45

Jordan Walker

yeah, that's why step 2 is:
> get path + filesize + md5
if there is a whole duplicate path, you can search for empty directory trees later and delete those.

and that's why i said
> read both, check all bytes
to make sure it isn't a hash collision

filesize is the first check, then md5, then all bytes.

September 20, 2019 - 11:46

Daniel Ward

Hentai drive out of space is it?

September 20, 2019 - 11:53

Hunter White

fdupes

September 20, 2019 - 11:54

Jason Turner

Dude, just hash the files with sha1sum and compare them

September 20, 2019 - 12:10

Mason Murphy

DupeGuru

September 20, 2019 - 12:10

Landon Mitchell

sha1 is slow compared to md5, both have collisions.

September 20, 2019 - 12:37

Kevin Allen

Bad and lazy Python solution:
pastebin.com/58vB9wMp

September 20, 2019 - 13:07

Jacob Martin

diff for dirs
fdupes for files
findimagedupes for images

September 20, 2019 - 13:12

Ryder Watson

Virtually impossible to cause collisions at random.

September 20, 2019 - 13:48

1 2 3 Next

Need a Duplicate Detection Tool for GNU/Linux

Last threads