XML-encoding in PHP

judgej · November 2010

I have noticed in the unstable repository that there are many places that URLs are constructed, and they seem to be gaining & entities across the board.

I would just like to say that is likely to lead you into a whole lot of trouble further down the line, which will include double-encoding and having to decode and re-encode strings all over the place. The format of a URL involves separating GET parameters with ampersands (&s), not & entities. Keep the XML-encoding at the level it belongs - close to where the URL is output in XML format, and not in the non-template code that puts these strings together.

I've seen this happen in other projects, and it stems from a misunderstanding of what a URL is, and where encoding needs to take place for a particular output format (which may *not* even be HTML).

Just a general thing anyway, on the table for you to think about.

cdavid · November 2010

Well, most likely I am the one to be blamed, since I think I pushed those changes. The idea is that VanillaForums is advertised as XHTML compatible (one of the very few forum systems), so we took and it and tried it as an XHTML forum ... and it doesn't work by default. So, I decided to patch this.

On the other hand, you might be correct, but it seems that in Vanilla, there is no much distinction between an URL and its representation (the URL contains & , the XML representation of that URL contains & the HTML representation contains &).

Glad you brought this up,

/cd

judgej · November 2010

It was just the low level of some of the stuff that flagged itself up to me. In the end, url_encode($url) is all that is needed right at the end, when the URL is passed into the XHTML output stage. Doing it a bit at a time, as URLs are constructed, is making it easier for bits of the URL to be double-encoded.

Like I say, I've seen it before, and I've seen the bottomless pit of trouble that it can lead to. Getting all of Vanilla XHTML compliant thought, that is great. However, do bear in mind XHTML is not the *only* output format.

Sorry - I'm sounding like a bossy developer, but I'm not, and I've got no real say in the end, but I hope my experience can help.

-- Jason

Edit: that "url_encode()" above is wrong - I meant to write xml_encode() (aka htmlspecialchars). The urlencode() or rawurlencode() is something that gets applied individually to each get parameter name and get parameter value.

cdavid · November 2010

Yes, but as far as I know, XHTML is a superset of HTML, so if you get Vanilla to serve pages as XHTML, they will work as HTML too. Therefore, you can serve pages however you want in your theme, as long as they are a subset of XHTML.

A bug related to this issue that I just found is that   is not an XML entity, so, when outputting XHTML + " " you get many ugly errors in the browser. My suggestion would be to replace that with   What do you think?

/cd

judgej · November 2010

  is an XHTML entity though, isn't it? Could you give an example of where you are seeing the problems, because I'm not sure about what you mean. Or do you mean spaces in URLs?

The "other formats" I was referring to include RSS, Atom, and simply "data" that gets passed on to external applications and other modules.

-- Jason

cdavid · November 2010

I thought that too, but apparently it's not an XML entity, just an HTML entity, therefore it's not an XHTML entity. Reading about this here: http://techtrouts.com/webkit-entity-nbsp-not-defined-convert-html-entities-to-xml/ and in many other places ...

You are right. Maybe @Todd @Tim @Mark @Lincoln can get more ideas.

/cd

cdavid · November 2010

I see this as the best distinction between a concept (URL) and it's representation (the URL in different formats, as XHTML, HTML etc.) http://tapciuc.ro/blog/wp-content/uploads/2009/06/magritte-ceci-nest-pas-un-pipe-_rene-magritte.jpg

/cd

judgej · November 2010

You say it is not a plain XML entity, and therefore not an XHTML entity. That is not right (I think!)- XHTML is a superset of basic XML and includes 252 entities that it supports. That includes the non-breaking space. A full list of supported XHTML entities can be found here:

http://www.elizabethcastro.com/html/extras/entities.html

Note that not all of these are valid in other XML feeds, such as RSS feeds for example. Numeric entities are valid in all flavours of XML though.

Let me do some more digging. That article sounds authoritative, but I'm not convinced it is right at all.

Edit: okay, it looks like something that XHTML5 has brought to the table. XHTML1 and HTML4/5 support the same set of named entities (252 of them) but XHTML5 does not support any. You can create them in XHTML5 on a page-by-page basis, but that causes other problems with older browsers so it is advised you don't.

So the general advice is to use UTF-8 characters where you can, and numeric entities otherwise.

This to me is a page output issue, and perhaps the whole page should be put through a filter when serving up strict XHTML5, just to convert any named entities that get through into numeric entities.

Here's a great article that goes over some of the issues (and it does highlight that even if *you* don't think it is un pipe, some other browser probably will):

http://www.ibm.com/developerworks/web/library/x-think45/index.html?ca=dgr-twtrHTML5dth-WD

Learn something new every day :-)

Todd · November 2010

@judgej, can you give me a specific example of where you see the & problem? I must say you do sound like you know what you are talking about and don't sound like a bossy programmer.

Linc · November 2010

Aside: There is no such thing as XHTML5, nor "strict" HTML5.

I'll come back to this over the holiday. Thanks for doing the legwork on checking into this stuff and please keep updating this discussion as you learn more.

//edit: Ditto what Todd said.

cdavid · November 2010

@Todd I assume this https://github.com/vanillaforums/Garden/pull/654

@judgej If such things about   are true, then there must be something else going wrong because whenever I post a comment on my XHTML-enabled forum, Firebug starts crying that it is not a defined entity which brings up lots of trouble in jQuery.

If you want to test, try adding the following lines to the head of your theme/_theme_name_/views/default.master.php :


<?php
$mime="application/xhtml+xml";
$charset="utf-8";
header("Content-Type: $mime;charset=$charset");

echo '<?xml version="1.0" encoding="utf-8"?>'; ?>

I'm using Firefox 4 / 3.6 with Firebug (persist mode in console) and I am seeing this issue.

/cd

cdavid · November 2010

@Lincoln enjoy the break!

judgej · November 2010

@Todd There aren't many, which is why I am mentioning it early. The main example is the URL method in Gdn_Request. It looks like bits of URLs are being joined together using what was & and (in unstable) is now &

If the Url() method deals entirely in non-XML encoded URLs, then it would be easy enough to provide another method to encode it, or a UrlEncoded() version that calls the first then XML encodes the URL string before returning it. The encoding only needs to be done once on the whole URL. Knowing that it works like that, anything that feeds into the URL is then able to disregard the output format, knowing that it is handled at the end.

Sorry - seems minor really.

@cdavid I think your header is invoking the XHTML5 behaviour in FF. I must admit that I had never even heard of XHTML5 until today - those browsers seem to be way ahead of the published and well-known web standards (and I must be a bit behind). XHTML1 always did support the entities when delivered as a HTML MIME type (just so it works on all browsers), and was generally the way XHTML1 was used. XHTML5, without a DTD, seems to be a lot more strict - it is pure XML with no DTD, and so no defined lists of named entities and that, I suspect, is why you are having that error. XHTML1, even when delivered as XML, still had a bunch of DTDs that the browser could call up to determine what named entities it supported.

cdavid · November 2010

@judgej Yes, you are right. I am looking at xhtml1-strict.dtd and some of the lines say:

<!ENTITY % HTMLlat1 PUBLIC
   "-//W3C//ENTITIES Latin 1 for XHTML//EN"
   "xhtml-lat1.ent">
%HTMLlat1;

while in xhtml-lat1.ent (which is loaded) on one of the first lines there is:


<!ENTITY nbsp   "&#160;"> <!-- no-break space = non-breaking space,
                                  U+00A0 ISOnum -->

while my header in default.master.php reads as:

<?php
$mime="application/xhtml+xml";
$charset="utf-8";
header("Content-Type: $mime;charset=$charset");

echo '<?xml version="1.0" encoding="utf-8"?>'; ?>
<!DOCTYPE xhtml PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:svg="http://www.w3.org/2000/svg" xml:lang="en-ca">

The page info says:

Type: application/xhtml+xml
Render Mode:Standards compliance mode

And yes, XHTML is important for us since we are doing MathML with namespaces (as seen above).

/cd

cdavid · November 2010

In case anybody is interested, this commit fixes the problem of nbsp for me https://github.com/cdavid/Garden/commit/bd1021dd09a2b44c78960c9da937b2cba30746de

If anyone is interested in pulling this / thinks that there should be other things added, let me know.

/cd

judgej · November 2010

I'm wondering, after looking at those updates, whether he non-breaking space is even needed in most of those cases. Is is necessary to put a non-breaking space in every span and div that is used just for expanding into content? What about empty table cells - is it really necessary now to fill them with non-breaking spaces, considering the CSS can be used to ensure the cells are still displayed?

Edit: the empty cells seems to be an IE-only problem, as earlier versions do not support the "empty-cells" CSS property. There is a nice JS solution here that can be triggered only for IE:

http://stackoverflow.com/questions/57002/css-to-make-an-empty-cells-border-appear

cdavid · November 2010

The "other formats" I was referring to include RSS, Atom, and simply "data" that gets passed on to external applications and other modules.

RSS and Atom are both XML, so I am pretty sure that the & problem is present also there... Regarding the data format, I am not sure that Vanilla is/should be optimized for this. My patch should work for v2.0.14, I haven't migrated to 2.0.15 yet.

< rant >
In any case, for me IE is dead and IE 6 users should receive a trojan or something instead of the required webpage to get them to upgrade.
</rant >

An official position from the core would be highly appreciated.

/cd

Linc · November 2010

This is definitely bookmarked and on my list for when I get a decent block of time to dig into it better.

XML-encoding in PHP

Comments