Archived Forum Post

Index of archived forum posts

Question:

Determine if XML File is Valid utf-8?

Feb 27 '14 at 12:33

Is it possible to check to see if a text file (specifically an XML file) actually contains utf-8 characters? Sometimes we have XML file that indicate "utf-8" in the XML declaration (i.e. the first line of the XML file), but the contents of the XML is actually ANSI (1-byte/char) encoded.

Pseudo code:

If(someChilkatLib.Encode(data) == ‘ANSI’)
{
   someChilkatLib.convertTo(data, ‘utf-8’)
}
// the file (data) now is in correct format!


Answer

Here's a sample in C++:

// This is NOT a perfect solution...
bool isItReallyUtf8(const char *path)
    {
    // Load the file with no interpretation of bytes.
    CkByteData data;
    data.loadFile(path);

// Does this have the utf-8 preamble?  Some utf-8 files may, some may not.
if (data.getSize() >= 3)
{
const unsigned char *p = data.getData();
if ((*p == 0xEF) && (*(p+1) == 0xBB) && (*(p+2) == 0xBF))
    {
    // Yes, this seems to be utf-8.
    return true;
    }
}

// Load the file, telling the CkString object to interpret that bytes as utf-8 encoded chars.
CkString str1;
str1.loadFile(path,"utf-8");

// Round-trip from widechar unicode back to utf-8.
CkString str2;
str2.appendU(str1.getUnicode());

if (!str2.equalsStr(str1))
{
return false;   // The file was not utf-8.
}
else
{
return true;    // The file was utf-8.
}

}