Uniform character encoding

serge · February 2006

I noticed that the language file (which I'm currently working on to translate to the spanish language) is encoded in Latin 1, yet the output of the software to the browser is in UTF-8. While I understand, respect and actually agree with the idea of using UTF-8 for output, I tend to prefer Latin 1 output simply because that way all of my strings can be written normally (that is, without stuff like á for "Ã¡") and I can rest assured it won't show up as some gibberish. So I guess my suggestion is either change the program's output to Latin 1 or encode all the source files to UTF-8. So long as there's one uniform character encoding format, so that people using languages with characters not in the english alphabet don't have to write a ñ-type code for each and every special character. Thanks for the great work!

Mark · February 2006

I suck at character encoding - I really have no idea what I'm doing with it. It only got to where it's at through people like yourself making sugestions. If anyone else has any thoughts on the subject, please share. I'm obviously willing to change it again if the demand is there.

Minisweeper · February 2006

I assume it's not impossible to make this an application setting, mark? That way people could change depending on the language they were running anyway? Surely not *everything* falls into latin1? And not everyone prefers using UTF-8. Or is changing the output encoding more difficult than i'd imagine?

Mark · February 2006

mini: utf-8 is already a configuration option in the Configuration array - check out line 67 of appg/settings.php:

$Configuration['CHARSET'] = 'utf-8';

I think what serge was getting at was that the dictionary file itself was encoded as Latin 1 and should be encoded as utf-8. OR the file should remain encoded as it is and the default CHARSET setting should be Latin 1 as well. Just for consistency.

Am I correct, serge?

Minisweeper · February 2006

Well yeah I realised he wanted some consistency, i just hadnt realised the charset was an option. Maybe the best way would be to set the charset in the language file which could then be encoded in whichever was most suitable for the language being displayed?

serge · February 2006

Mark, that's exactly what I was referring to. But it's not consistency for consistency's sake. The thing is, whatever the output of the program, the strings that are fed to the program must be encoded in the same format for it to display properly. Now, minisweeper brings up an interesting issue when he asks "Surely not *everything* falls into latin1?" And surely, he's right. (I don't *know* this, but it makes sense.) In any case, for characters commonly used in the spanish language, it doesn't matter if you use Latin1 or UTF-8, as long as the strings being fed to the program are in the same encoding as the output in the HTML/XML headers. The way it's set up right now (v 0.9.2.6, I believe), the XML specifies UTF-8 character encoding, but the Language.php file gives the program strings encoded in Latin1. That works fine with regular, english-language characters (as Latin1 and UTF-8 seem to map these characters in the same way), but completely messes up characters like the umlaut (Ã¼), the acute symbol (Ã©) and the tilde (Ã±), among many, many others. I suppose the important thing here is that there should be an administrative option to set the character encoding for the output, and then have the program automatically re-encode (translate) the strings stored in the Language.php file using the encoding map specified by the admin. This can be done real-time (have the program translate the strings from their default encoding to the admin-specified encoding on every single page load, NOT recommended), or *quasi* hard-coded (have the program re-write the Language.php file in the new character encoding once a new encoding is selected, recommended). Thanks for the prompt replies!

nick1presta · February 2006

For default, I would suggest using UTF-8. Also, it should be set directly into the PHP headers before any HTML output.

lech · February 2006

I want to say that this could be easily set and stored for the settings.php file, not necessarily the language file. But it would make it useful in the long run if the end-user wants to switch up the output. We would just have to dig up all the available relevant header language prefixes to store into a file or array to make it easily acceptable, because I doubt many would know what to do there. I say go for it :)

Mark · February 2006

It already is sent in the headers (look at appg/headers.php).

So how do I format the files properly on Windows?

Bergamot · February 2006

Crazy other-people languages and their weird characters...

serge · February 2006

"Crazy other-people languages and their weird characters..." Hahahahaha... Mark, doesn't your text editing program (the one where you write all this software) have a preference for character encoding? As long as all the files in the program are encoded uniformly, I don't think you'll have a problem.

lech · February 2006

I don't think a native file formatting would be necessary (or would it?) headers should do it all, just having the ability to switch those headers en-US -> eu-FR or whatever on the fly from within the application settings would help make it easier without mucking about in any templates or guts of the app. The only major file formats that usually translate are those of nix/dos/sun/mac file formats, and that's mostly relevant only to the server delivering it. Most php servers don't care as long as it can parse the files one way or another, regardless of file format.

serge · February 2006

Well, I don't know how PHP works, but in my experience with Perl, no matter how much you tell Perl to use UTF-8 (for example), if the strings are saved in a file encoded in anything other than UTF-8, it's gonna misinterpret these special characters. Unless, of course, I tell Perl to translate the strings from one encoding to the other before working with them.

Mark · February 2006

I use Komodo, so I should be able to do something. I'll take a look...

serge · February 2006

Oh, ok. For some reason I assumed you wrote the code from scratch in a text editor. Must be because that's the way I work with Perl. Actually, now that I think about it, using something like Komodo might come in handy for those pesky little typos that run havock on my programs. (I'm an amateur, so these typo's are quite frecuent.) I see that Komodo has been ported to OS X, I'll give it a look over.

Bergamot · February 2006

Mark fuses directly to the computer core and communicates with it in pure binary code.

Uniform character encoding

Comments