Please upgrade here. These earlier versions are no longer being updated and have security issues.
HackerOne users: Testing against this community violates our program's Terms of Service and will result in your bounty being denied.
XML-encoding in PHP
judgej
✭
I have noticed in the unstable repository that there are many places that URLs are constructed, and they seem to be gaining & entities across the board.
I would just like to say that is likely to lead you into a whole lot of trouble further down the line, which will include double-encoding and having to decode and re-encode strings all over the place. The format of a URL involves separating GET parameters with ampersands (&s), not & entities. Keep the XML-encoding at the level it belongs - close to where the URL is output in XML format, and not in the non-template code that puts these strings together.
I've seen this happen in other projects, and it stems from a misunderstanding of what a URL is, and where encoding needs to take place for a particular output format (which may *not* even be HTML).
Just a general thing anyway, on the table for you to think about.
I would just like to say that is likely to lead you into a whole lot of trouble further down the line, which will include double-encoding and having to decode and re-encode strings all over the place. The format of a URL involves separating GET parameters with ampersands (&s), not & entities. Keep the XML-encoding at the level it belongs - close to where the URL is output in XML format, and not in the non-template code that puts these strings together.
I've seen this happen in other projects, and it stems from a misunderstanding of what a URL is, and where encoding needs to take place for a particular output format (which may *not* even be HTML).
Just a general thing anyway, on the table for you to think about.
0
Comments
On the other hand, you might be correct, but it seems that in Vanilla, there is no much distinction between an URL and its representation (the URL contains & , the XML representation of that URL contains & the HTML representation contains &).
Glad you brought this up,
/cd
Like I say, I've seen it before, and I've seen the bottomless pit of trouble that it can lead to. Getting all of Vanilla XHTML compliant thought, that is great. However, do bear in mind XHTML is not the *only* output format.
Sorry - I'm sounding like a bossy developer, but I'm not, and I've got no real say in the end, but I hope my experience can help.
-- Jason
Edit: that "url_encode()" above is wrong - I meant to write xml_encode() (aka htmlspecialchars). The urlencode() or rawurlencode() is something that gets applied individually to each get parameter name and get parameter value.
A bug related to this issue that I just found is that
is not an XML entity, so, when outputting XHTML + "
" you get many ugly errors in the browser. My suggestion would be to replace that with 
What do you think?/cd
The "other formats" I was referring to include RSS, Atom, and simply "data" that gets passed on to external applications and other modules.
-- Jason
You are right. Maybe @Todd @Tim @Mark @Lincoln can get more ideas.
/cd
/cd
http://www.elizabethcastro.com/html/extras/entities.html
Note that not all of these are valid in other XML feeds, such as RSS feeds for example. Numeric entities are valid in all flavours of XML though.
Let me do some more digging. That article sounds authoritative, but I'm not convinced it is right at all.
Edit: okay, it looks like something that XHTML5 has brought to the table. XHTML1 and HTML4/5 support the same set of named entities (252 of them) but XHTML5 does not support any. You can create them in XHTML5 on a page-by-page basis, but that causes other problems with older browsers so it is advised you don't.
So the general advice is to use UTF-8 characters where you can, and numeric entities otherwise.
This to me is a page output issue, and perhaps the whole page should be put through a filter when serving up strict XHTML5, just to convert any named entities that get through into numeric entities.
Here's a great article that goes over some of the issues (and it does highlight that even if *you* don't think it is un pipe, some other browser probably will):
http://www.ibm.com/developerworks/web/library/x-think45/index.html?ca=dgr-twtrHTML5dth-WD
Learn something new every day :-)
I'll come back to this over the holiday. Thanks for doing the legwork on checking into this stuff and please keep updating this discussion as you learn more. //edit: Ditto what Todd said.
@judgej If such things about
are true, then there must be something else going wrong because whenever I post a comment on my XHTML-enabled forum, Firebug starts crying that it is not a defined entity which brings up lots of trouble in jQuery.If you want to test, try adding the following lines to the head of your theme/_theme_name_/views/default.master.php :
<?php $mime="application/xhtml+xml"; $charset="utf-8"; header("Content-Type: $mime;charset=$charset"); echo '<?xml version="1.0" encoding="utf-8"?>'; ?>
I'm using Firefox 4 / 3.6 with Firebug (persist mode in console) and I am seeing this issue.
/cd
If the Url() method deals entirely in non-XML encoded URLs, then it would be easy enough to provide another method to encode it, or a UrlEncoded() version that calls the first then XML encodes the URL string before returning it. The encoding only needs to be done once on the whole URL. Knowing that it works like that, anything that feeds into the URL is then able to disregard the output format, knowing that it is handled at the end.
Sorry - seems minor really.
@cdavid I think your header is invoking the XHTML5 behaviour in FF. I must admit that I had never even heard of XHTML5 until today - those browsers seem to be way ahead of the published and well-known web standards (and I must be a bit behind). XHTML1 always did support the entities when delivered as a HTML MIME type (just so it works on all browsers), and was generally the way XHTML1 was used. XHTML5, without a DTD, seems to be a lot more strict - it is pure XML with no DTD, and so no defined lists of named entities and that, I suspect, is why you are having that error. XHTML1, even when delivered as XML, still had a bunch of DTDs that the browser could call up to determine what named entities it supported.
<!ENTITY % HTMLlat1 PUBLIC "-//W3C//ENTITIES Latin 1 for XHTML//EN" "xhtml-lat1.ent"> %HTMLlat1;
while in xhtml-lat1.ent (which is loaded) on one of the first lines there is:
<!ENTITY nbsp " "> <!-- no-break space = non-breaking space, U+00A0 ISOnum -->
while my header in default.master.php reads as:
<?php $mime="application/xhtml+xml"; $charset="utf-8"; header("Content-Type: $mime;charset=$charset"); echo '<?xml version="1.0" encoding="utf-8"?>'; ?> <!DOCTYPE xhtml PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:svg="http://www.w3.org/2000/svg" xml:lang="en-ca">
The page info says:
Type: application/xhtml+xml
Render Mode:Standards compliance mode
And yes, XHTML is important for us since we are doing MathML with namespaces (as seen above).
/cd
If anyone is interested in pulling this / thinks that there should be other things added, let me know.
/cd
Edit: the empty cells seems to be an IE-only problem, as earlier versions do not support the "empty-cells" CSS property. There is a nice JS solution here that can be triggered only for IE:
http://stackoverflow.com/questions/57002/css-to-make-an-empty-cells-border-appear
< rant >
In any case, for me IE is dead and IE 6 users should receive a trojan or something instead of the required webpage to get them to upgrade.
</rant >
An official position from the core would be highly appreciated.
/cd