Archived Forum Post

Index of archived forum posts


Determine if XML File is Valid utf-8?

Feb 27 '14 at 12:33

Is it possible to check to see if a text file (specifically an XML file) actually contains utf-8 characters? Sometimes we have XML file that indicate "utf-8" in the XML declaration (i.e. the first line of the XML file), but the contents of the XML is actually ANSI (1-byte/char) encoded.

Pseudo code:

If(someChilkatLib.Encode(data) == ‘ANSI’)
   someChilkatLib.convertTo(data, ‘utf-8’)
// the file (data) now is in correct format!


Here's a sample in C++:

// This is NOT a perfect solution...
bool isItReallyUtf8(const char *path)
    // Load the file with no interpretation of bytes.
    CkByteData data;

// Does this have the utf-8 preamble?  Some utf-8 files may, some may not.
if (data.getSize() >= 3)
const unsigned char *p = data.getData();
if ((*p == 0xEF) && (*(p+1) == 0xBB) && (*(p+2) == 0xBF))
    // Yes, this seems to be utf-8.
    return true;

// Load the file, telling the CkString object to interpret that bytes as utf-8 encoded chars.
CkString str1;

// Round-trip from widechar unicode back to utf-8.
CkString str2;

if (!str2.equalsStr(str1))
return false;   // The file was not utf-8.
return true;    // The file was utf-8.