Question:
Hello, I'm using the HtmlToText to convert HTML to plain text. The component is treating angle brackets used as mathematical symbols as HTML though. For example: "before less than < after less than" becomes "before less than". The value of the SuppressLinks property didn't have an effect.
I did turn on verbose logging, but it didn't give me anything I could use. ToText: DllDate: Aug 15 2013 ChilkatVersion: 9.4.1.42 Username: IUSR Architecture: Little Endian; 32-bit Language: .NET 2.0 VerboseLogging: 1 decodeHtmlEntities: 1 HtmlCodePage: 65001 charset3: utf-8 toXmlTime: Elapsed time: 0 millisec xmlToText: recursiveToText: (leaveContext) (leaveContext) toTextTime: Elapsed time: 16 millisec Success. (leaveContext)
Any suggestions? Is this a known issue?
Thanks for you help.
When parsing HTML, the "<" character is interpreted as the open character of an HTML tag. Therefore, when an unencoded "<" exists, such as in "before less than < after less than" the HTML parser things that the HTML tag is "<afterlessthan...."
As a human we can look at it and obviously know that the "<" character in that case is a mistake. However, programmatically it is not so easy. There is no way to really encounter a "<" and decide to NOT interpret it as the start character for an HTML tag.