For quite a while I have been using the following code in a C++ project for easily comparing files for any changes:
CkZipCrc zipCrc; // Get CRC of binary file 1 unsigned long crc1 = zipCrc.FileCrc("binary file 1"); // Get CRC of binary file 2 unsigned long crc2 = zipCrc.FileCrc("binary file 2"); if (crc1 != crc2) ...
I am wondering if this is considered a safe comparison of any type of files. I know there are many opinions about this and that the file sizes and number have a great deal to say. It is a matter of the likeliness of false positives, but should I rewrite the code to be on the safe side?
Above code is used in an automated process that typically compares up to a 1000 files at a time (two directories with the same file names, where some may have changed and dates/attributes cannot be trusted). The files are typically less than 5 MB each, but since I have no upper limit, they could potentially be any size, though I encourage my users to keep files small for different reasons.
It is critical that no errors occur as a result of an invalid comparison, but speed in this matter also has a lot to say. I chose CRC32 because it was fast and well integrated, but the question is whether I should switch to SHA1 or similar?
In case SHA1 is the recommended solution, would the following code the be best way to do it, or is there a built-in method to compare two binary files? (otherwise that would be a nice feature)
CkCrypt2 crypt; crypt.UnlockComponent("..."); crypt.put_HashAlgorithm("sha1"); crypt.put_EncodingMode("hex"); // Get hash of binary files const char *hash1 = crypt.hashFileENC("binary file 1"); const char *hash2 = crypt.hashFileENC("binary file 2"); // Compare SHA1 hash values if (hash1 != hash2) ...
I think both CRC32 and SHA-1 are adequate and of similar performance. The chance of false-positive would be one in MAXINT, where MAXINT is the max unsigned integer possible to be held in a 32-bit unsigned int, which would be about 1 in 4 billion. I'm personally OK with those odds. An SHA-1 hash is 20 bytes as opposed to the 4-byte CRC32 hash, so the chance of collision is astronomically lower.
Thanks, that was my understanding as well. Would you consider adding an integrated method that does the compare and simply returns an integer value for equality (only to simplify code and avoid returning temporary objects)?
E.g. CkZipCrc::FileCompare(const char *file1, const char *file2) or CkCrypt2::HashFileCompare(const char *file1, const char *file2, const char *hashAlgorithm)
I would compare files sizes initially to potentially short circuit the SHA1 calls which will save processing time. If the files are different sizes, then obviously the contents are not equal.
If you are worried about collisions, you could do a binary compare a few random internal chunks of bytes of the files too - this might also save time before calculating the SHA1, but this would need to be benchmarked (if you are dealing with lots of large files, it certainly would).
I ran some quick tests on a few hundred files, and just performing the size check optimization reduced the total run time from around 6 seconds to around 200 milliseconds.