Sat, 09 Aug 2008
Finding duplicate files
Every once in a while, I know one file is duplicated in many places. This happens, for example, when I have imported photos from my camera into a photo management program and also stored a copy of them somewhere else. Sometimes I have downloaded files twice from the web.
Detecting duplicate files is not hard - you just compare the file contents. The problem is that with large files, and a large number of files, it can take a long time if you compare every file to every other file.
Because I needed to do this for a few gigabytes of photos, and everything I found I either didn't trust or ran too slowly, I wrote my own. Once you detect duplicate files, you generally want to either delete all but one, or to "merge them" via hardlinks so that all the files exist, but they share storage space on disk.
Summary: I had a fairly good approach, but everyone should use rdfind instead of my code.
My approach
You can check out (using Subversion or a web browser) my code at http://svn.asheesh.org/svn/public/code/merge_dups/ .
- Organize all the files grouped by size (since only files of equal size can have equal contents).
- For each size that contains more than one file, calculate a hash (MD5) of all the files.
- If any of the files have the same size and MD5, delete the one with a longer filename.
- Continue to the next file size.
This approach has to stat() every file at least once, but many files don't have to be read at all. For my photos, this was a huge time-saver.
(Why delete the one with a longer filename? Usually that's the one in some obscure directory named "camera-backup" or "recovered-from-some-dying-computer".)
I trust my code. Plus, it is verbose, printing out what it is doing and why. And the entire program with comments, status message print-outs, and vertical spacing easily fits on my screen.
Other implementations
Today, I decided to go through Freshmeat to see if I could retire my code and just rely on someone else's. So I checked out the reasonable contenders from this search.
find_duplicates by Fredrik Hubinette
- Homepage: http://fredrik.hubbe.net/hacks/
- License: GPL v2 (good)
- Efficiency: Good (uses file sizes the way I do)
- Language: Pike (weird, but seems okay)
- Strategy: Check sizes; hash; verify by reading file; merge via hardlinks
- Sanity: High
- Rating: Good
It uses the first few kilobytes of the string as a hash, which is probably more efficient that reading the whole thing. It is safe and reads the whole files before marking them as duplicates.
dmerge.cpp by Jonathan H Lundquist
- Homepage: http://www.fluxsmith.com/cgi-bin/twiki/view/Jonathan/DMerge
- License: X11-like (good)
- Efficiency: Good (uses file sizes the way I do)
- Language: C++ (bearable)
- Strategy: Check sizes; hash by calling an external program; verify by calling "cmp"; ...
- Sanity: Low
- Rating: Don't use
I stopped caring when I realized it calls external programs. I doubt it does it in a correct/secure way, so forget it.
duff by Camilla Berglund
- Homepage: http://duff.sourceforge.net/
- License: zlib/libpng (MIT-esque) (good)
- Efficiency: Good
- Language: C (okay)
- Strategy: Check sizes; hash with first few bytes; verify by SHA1 or actual
- Sanity: High (comes with a man page; very tunable; great web site)
- Rating: Don't use
This looks really good, but it doesn't actually do the merging. It relies on a shell script to do the merging, and I don't trust the correctness of the shell script's handling of filenames (due to the whitespace-separated output format of duff itself).
Note to Camilla: If you provided a -z option (like find -print0) to duff, and made sure the shell script respected it, then it would be practically perfect.
fslint 2.14
- Efficiency: Seemed lame
- Rating: Don't use
- Explanation: I tried it, and then I said, "That's it, I'm writing my own."
It was so litlte fun to use I don't even want to talk about it. The benchmarks on the rdfind web page confirm this with data.
rdfind by Paul Sundvall (WINNER!)
- Homepage: http://www2.paulsundvall.net/rdfind/rdfind.html
- License: GPL (v2, probably) (good)
- Efficiency: Excellent
- Language: C++ (okay)
- Strategy: Check sizes; check first bytes; calculate SHA1s; delete dups or create symlinks or create hardlinks or print report
- Sanity: High - object-oriented, well-commented, includes man page, includes benchmarks
- Self-importance: High, but seems deserved
- Rating: Excellent, use it
Finding software like this is why I look for software not written by me.
Other tools I didn't fully review
- finddup by Heiner Steven <http://www.shelldorado.com/scripts/cmds/finddup>
- Language: Shell, which probably means it has problems with complicated filenames
- clink by Michael Opdenacker <http://free-electrons.com/community/tools/utils/clink/>
- Language: Python (yay!)
- Does not support hard links, only symlinks, thereby (to the author's own admission) creates permissions problems
- dupfinder by Matthias Böhm <http://doubles.sourceforge.net/>
- Sanity: Moderate to Low - thinks that not using hash functions makes it "much faster" than other programs
- dupmerge2 by Rolf Freitag (continuation of work from Phil Karn) <http://sourceforge.net/projects/dupmerge/>
- Sanity: Moderate to Low: Bundles a pre-compiled binary, which is just weird
- dupseek by Antonio Bellezza <http://www.beautylabs.net/software/dupseek.html>
- Focus on interactive duplicate file removal. Probably good at that; I want correct, unattended operation.
- freedup by William Stearns <http://freedup.org/>
- Looks fairly good, even though it's written in bash (freaks me out)
- Offers an option to strip metadata and compare only file *contents* for MP3, MPEG4, MPC, JPEG, and Ogg (not FLAC, I guess), which is great.
Conclusion
rdfind looks great. Every once in a while, two hours are better spent doing research rather than re-inventing the wheel. This is one of those times where I was more useful to my life as a secretary rather than by trying to be a programmer.