Archived Forum PostQuestion:
Is it possible to check to see if a text file (specifically an XML file) actually contains utf-8 characters? Sometimes we have XML file that indicate "utf-8" in the XML declaration (i.e. the first line of the XML file), but the contents of the XML is actually ANSI (1-byte/char) encoded.
Pseudo code:
If(someChilkatLib.Encode(data) == ‘ANSI’)
{
someChilkatLib.convertTo(data, ‘utf-8’)
}
// the file (data) now is in correct format!
Here's a sample in C++:
// This is NOT a perfect solution...
bool isItReallyUtf8(const char *path)
{
// Load the file with no interpretation of bytes.
CkByteData data;
data.loadFile(path);
// Does this have the utf-8 preamble? Some utf-8 files may, some may not.
if (data.getSize() >= 3)
{
const unsigned char *p = data.getData();
if ((*p == 0xEF) && (*(p+1) == 0xBB) && (*(p+2) == 0xBF))
{
// Yes, this seems to be utf-8.
return true;
}
}
// Load the file, telling the CkString object to interpret that bytes as utf-8 encoded chars.
CkString str1;
str1.loadFile(path,"utf-8");
// Round-trip from widechar unicode back to utf-8.
CkString str2;
str2.appendU(str1.getUnicode());
if (!str2.equalsStr(str1))
{
return false; // The file was not utf-8.
}
else
{
return true; // The file was utf-8.
}
}