Archived Forum Post

Index of archived forum posts

Question:

Want to Send "ąćęłńśóźżĄĆĘŁŃŚÓŹŻ" in Socket SendString Function

Sep 20 '16 at 08:15

I have a test string: "ąćęłńśóźżĄĆĘŁŃŚÓŹŻ" and I want to send it by SendString function. I can correctly send it from C# to PHP. But why I am not able to send it between PHP and PHP? I have set up two machines, same configuration (one is a clone of another). And it work almost great - almost because the 'Ł' is lost every time.

What can be wrong? In PHP string is not an object, is might be it? Do you have any clues what should I do with it?


Answer

First one must understand this: http://php.net/manual/en/language.types.string.php#language.types.string.details

Specifically, that

The string in PHP is implemented as an array of bytes and an integer indicating the length of the buffer. It has no information about how those bytes translate to characters, leaving that task to the programmer. There are no limitations on the values the string can be composed of; in particular, bytes with value 0 (“NUL bytes”) are allowed anywhere in the string (however, a few functions, said in this manual not to be “binary safe”, may hand off the strings to libraries that ignore data after a NUL byte.)

This nature of the string type explains why there is no separate “byte” type in PHP – strings take this role. Functions that return no textual data – for instance, arbitrary data read from a network socket – will still return strings.

Given that PHP does not dictate a specific encoding for strings, one might wonder how string literals are encoded. For instance, is the string "á" equivalent to "xE1" (ISO-8859-1), "\xC3\xA1" (UTF-8, C form), "\x61\xCC\x81" (UTF-8, D form) or any other possible representation? The answer is that string will be encoded in whatever fashion it is encoded in the script file. Thus, if the script is written in ISO-8859-1, the string will be encoded in ISO-8859-1 and so on. ...

There are two hurdles that need to be cleared in order to get things right:

(1) Given that in PHP a string is just an array of bytes, when the bytes are passed to Chilkat, Chilkat must know how to interpret the bytes. Are they utf-8 bytes such that an "á" is represented by "\xC3\xA1", or are they ANSI bytes, where the ANSI character encoding is defined by the locale of the machine, which is likely iso-8859-2 if the computer is in Poland, and in this case the "á" is represented by "\xE1".

For programming languages where strings are byte arrays, Chilkat provides a "Utf8" property that defaults to false/0. The Utf8 property defines how Chilkat is going to interpret the bytes of passed-in string arguments -- as either ANSI bytes or utf-8 bytes. This must be set correctly.

In your particular case, given that only the 'Ł' is lost, it must be that the string is passed in correctly, and the problem occurs in (2) described below.

(2) The 2nd hurdle that must be cleared is that Chilkat must know exactly what bytes to send. Will it be sending the utf-8 representation of the string, the ANSI representation, or something else (perhaps utf-16, utf-32, or some arcane charset that's seldom used). The way to control this is to set the Socket.StringCharset property. This is likely the problem -- your program passed the string to Chilkat correctly, and now Chilkat must convert it to the actual bytes that are going to be sent over the socket. The StringCharset controls which byte representation. If the StringCharset is set to some charset (encoding) where the "Ł" character has no possible byte representation, then it is lost. For example, you cannot send "Ł" if StringCharset = "iso-8859-1" because that charset is 1 byte per char and there is no byte value that represents "Ł". To send "Ł", StringCharset must be something that includes "Ł", which can be any Unicode encoding (utf-8, utf-16, etc.) or the multibyte encodings for the region (iso-8859-2, Windows-1250, etc.)


Answer

What a great answer, thank you very much.

Before I've asked a question I was testing various combinations of put_StringCharset and put_Utf8 for both client and server socket. After reading your response I've decided to note every test and it's result just to be sure I didn't ommited anything.

I might suprise you, but the case when only "Ł" was ommited occured with put_Utf8(false).

In my case the solution is: put_Utf8(true) and put_StringCharset('utf-8') on both client and server.


Answer

Thanks! There's one more thing to know, and it may explain what happened in your case. If there is a literal string in your source file (i.e. a literal quoted string such as "ąćęłńśóźżĄĆĘŁŃŚÓŹŻ"), then it makes a difference how that source file is saved. For example, if you are using an IDE or text editor that saves source files in utf-8, then the bytes of those chars are saved in the utf-8 representation. When PHP is interpreting the source file, the string is composed of the bytes found in the source file -- thus the need to set the Chilkat Utf8 property to true/false depends on how the PHP source file was saved. (This applies to the source files for any programming language where strings are simply byte arrays. In other programming languages, it may be that the compiler/interpreter expects the source to be utf-8, and it would always be a mistake to save the source in the ANSI encoding.)