Question:
Is it possible to check to see if a text file (specifically an XML file) actually contains utf-8 characters? Sometimes we have XML file that indicate "utf-8" in the XML declaration (i.e. the first line of the XML file), but the contents of the XML is actually ANSI (1-byte/char) encoded.
Pseudo code:
If(someChilkatLib.Encode(data) == ‘ANSI’) { someChilkatLib.convertTo(data, ‘utf-8’) } // the file (data) now is in correct format!
Here's a sample in C++:
// This is NOT a perfect solution... bool isItReallyUtf8(const char *path) { // Load the file with no interpretation of bytes. CkByteData data; data.loadFile(path);// Does this have the utf-8 preamble? Some utf-8 files may, some may not. if (data.getSize() >= 3) { const unsigned char *p = data.getData(); if ((*p == 0xEF) && (*(p+1) == 0xBB) && (*(p+2) == 0xBF)) { // Yes, this seems to be utf-8. return true; } } // Load the file, telling the CkString object to interpret that bytes as utf-8 encoded chars. CkString str1; str1.loadFile(path,"utf-8"); // Round-trip from widechar unicode back to utf-8. CkString str2; str2.appendU(str1.getUnicode()); if (!str2.equalsStr(str1)) { return false; // The file was not utf-8. } else { return true; // The file was utf-8. } }