Does anyone know of a tool for GNU/Linux that can detect duplicate directories, including the directory contents? I have an old backup hard drive that I've slowly added shit to for years. Some of it is backups of stuff that was already on there. The problem is, there are a bunch of duplicate files mixed in with the others that would need to stay. Just removing duplicate files wouldn't work, but if I could find entire duplicate directory trees, I could just manually decide if they need to be removed or not. Unfortunately, I can't think of a way to do this with a tool like find, unless I'm just being dumb.
Need a Duplicate Detection Tool for GNU/Linux
Other urls found in this thread:
examine the files yourself.
No really feasible. There are hundreds of thousands, if not millions of files on this drive.
Pay someone to examine the files for you. I don't get what the point of this thread is, you're not going to find any answers better than that.
Seems like an interesting exercise for a short Python script. What things are in common in a duplicated directory?
>Does anyone know of a tool for GNU/Linux that can detect duplicate directories
What's not to get? Surely software exists that can detect duplicate directories by recursing into them.
fslint-gui
Here's the window after it found some duplicate images.
That looks like it will only find duplicated files, not entire duplicated directories.
Do you care about file names, or only file contents? How big (file count & total size) is the drive? Is reading every file on the drive practical?
write script
> scan every file
> get path + filesize + md5
> if filesize and md5 matches
> read both, check all bytes
> if true, print path of duplicates
so there isn't some shell regex trick - at least not one you know - so you're spinning shit to cover the idea that it can't be done?
Hash the files, compare hashes and go from there.
theres diff, but that only checks if 2 files are different.
running diff on every file would be O(n^2)
instead, hashing every file reduces the amount of data that needs to be compared.
>Do you care about file names, or only file contents?
Contents
>How big (file count & total size) is the drive?
Just checked. 2.7 million files totalling 1.4TiB / 1.8TiB.
>Is reading every file on the drive practical?
Should be. I have a server machine that I can let it run on and do its thing.
You would need to keep track of the directory structure too.
The reason I mention duplicate directories instead of just files is because I know there are decent sized trees that are just duplicates of other identical trees on the drive. They are buried amongst other removable duplicate files and duplicate files that need to stay.
>Contents
>Should be.
Cool. I think I have an idea on how to do that. Technically I can't promise no false positives, (because hash collisions), but other than that it should be pretty simple.
Awesome, user.
yeah, that's why step 2 is:
> get path + filesize + md5
if there is a whole duplicate path, you can search for empty directory trees later and delete those.
and that's why i said
> read both, check all bytes
to make sure it isn't a hash collision
filesize is the first check, then md5, then all bytes.
Hentai drive out of space is it?
fdupes
Dude, just hash the files with sha1sum and compare them
DupeGuru
sha1 is slow compared to md5, both have collisions.
Bad and lazy Python solution:
pastebin.com
diff for dirs
fdupes for files
findimagedupes for images
Virtually impossible to cause collisions at random.