Page 1 of 1

UTF-8 character encoding issues

Posted: Sat 29. Oct 2005, 10:21
by kipara
I want to use PHPWCMS for several websites in different languages. To make things as easy and compatible as possible I want to use UTF-8 as my encoding. So off I go...

I have a local setup of PHP 4.3.3 and MySQL 4.1.14. All standard install. I install PHPWCMS 1.2.5. Choose UTF-8 as character encoding. Easy.

I make a small sample site and put in some text. I use the FCKeditor 2. In order to prevent the editor from writing HTMLentities I have switched those off.

I put in Russian, Thai, Czech and a whole bunch of accented characters and even a ü! :D And (to my surprise) EVERYTHING WORKS. Great job.

So I transfer the whole lot to a staging server. But then problems arise: most characters (80-90%) are displayed correctly but some of them are not. In all of the different languages. Seems like part of the unicode 'set' is not properly encoded.

What can that be?

The staging server setup is almost identical to local: PHP 4.3.3 and MySQL 4.1.10 The only difference is that the other server has the PHP mbstring extension switched off. That 'can' be a problem in PHPMyAdmin (I'm using 2.6.4-pl1 on both) but PHPWCMS doesn't use this function, no?

As I investigate further I find some strange things:

- When you install PHPWCMS, the SQL tables are created BEFORE you choose character encoding for the site. And no collation prefs are set. Is that the best way? Is the default collation (latin-swedish-ci) going to work well with UTF-8? I experimented with setting collation to UTF-bin and UTF-unicode-ci but no difference.

- The 'foreign' characters are encoded in the MySQL database itself, where I was expecting to see the actual characters (much easier to work with). Is that correct? On my local setup they are displayed correctly in the HTML produced and inside PHPWCMS admin so it works. But it's not ideal.

Sorry for writing such a long story but this really does my head in. If I can get it to work locally I should be able to have it work on another site. And other than 'mbstring' which I don't think is the problem I don't see what the cause is!

Any help or ideas much appreciated. Thanks!

Posted: Wed 2. Nov 2005, 09:15
by Oliver Georgi
Seems you have dumped your db - then often UNICODE chars having a problem. A working workaround: use phpMyAdmin to make the dump.

And check charset setting for MySQL too.

Oliver

Posted: Wed 2. Nov 2005, 09:41
by kipara
Thanks Oliver,

I use PHPMyAdmin to work with MySQL. Nothing else! :-) Charset settings for MySQL are also OK.

UTF-8 is a strange beast indeed. The conversion to character pairs that I described is actually SUPPOSED to happen. So 'é' is transformed into 'é'and so on. That's what you see in the DB.

I also checked the full HTTP source and the server actually sends these pairs out and then the browser converts them back into proper characters. Also correct.

The problem is with the use of CERTAIN double character strings. The second character of the pair does not display properly in the database when I look at them using PHPMyAdmin.

You can see what I mean if you have a look at the UTF-8 column in the table at the bottom of this page:

http://czyborra.com/utf/

At for example 'Á' (Á) you'll see that the second part of the double character code pair is not readable (on both OS X and Windows) That is exactly what happens both locally and on the server. It looks like a character is used that can not be displayed and which is also not properly transferred to the db.

As you can see this is only the case for about 5 characters or so. That corresponds to a test I've done. Have a look at (test):

http://www.netherlands-embassy.or.ke/phpwcms/

I typed in every special character for those three languages (using PHPWCM on that server to do it) and only seven of them don't work! Czech characters are ALL fine except c-hacheck (č) !

I had a look at the forum and found two sites in the Czech language that use PHPWCMS. Unfortunately neither of them is using UTF-8 so they wouldn't have had this problem.

I have taken this up with my ISP as well as it is not really a specific PHPWCMS problem (as I understand it now). But it would be great if anyone here can help. Thanks again.

Posted: Wed 2. Nov 2005, 09:57
by Oliver Georgi
I think you have to use UTF-16 then...