BBPress and Encoded URLs with Uppercase Hex
I recently discovered that the Google crawler was throwing a redirect error on certain pages in our YLF bbPress forum (bbPress 1.0.2 at time of writing). These URLs had one thing in common – the were encoded to handle special characters. Here is an example:
http://youlookfab.com/welookfab/topic/your-favourite-80%E2%80%99s-music-bands
The “%E2” in this URL is an encoded version of the apostrophe in “80’s”. I investigated a little using Firebug and web-sniffer.net, and discovered that this URL does indeed cause a 302 redirect. But why?
bb_repermalink() is the problem
The answer is in a bbPress function that runs for most forum pages: bb_repermalink(). (you can find this function in “functions.bb-core.php” in the “/bb-includes” folder of the bbPress distribution) I couldn’t find any documentation or discussion on this function, but it appears to checkĀ the permalink and do a redirect if it finds an error. It turns out that the “correct” permalink (i.e. the one based on the post slug in the database) encodes special characters with lowercase hex (i.e. “%e2″ instead of “%E2″ using our example above). When bb_repermalink() compares the URL we typed into the address bar (with uppercase hex characters) to the “correct” one (with lowercase hex characters), it finds a discrepancy that it thinks is an error in the URL. So it redirects to the “correct” URL.
A plugin workaround
Fortunately there are some well-placed hooks in bb_repermalink(), so I was able to create a plugin that detected uppercase hex in the URL and then adjust the “correct” permalink accordingly. The code for this plugin is at the end of this post.
Why the Google crawl error?
One question still remains: why did this 302 redirect cause an error in the Google crawler? I can’t say for sure, but my theory is as follows…
- The crawler converts hex to uppercase before crawling the URL. So even though my sitemap specifies the URL with lowercase hex, Google’s crawler converts this to uppercase.
- Then when the crawler visits this URL, bbPress detects the uppercase discrepancy and issues a 302 redirect to the lowercase URL.
- Google’s crawler takes the new lowercase URL, AGAIN converts the hex to uppercase and then re-crawls the URL.
- This of course leads to an infinite loop as the crawler repeatedly converts to uppercase and bb_repermalink() repeatedly redirects back to the lowercase URL.
The crawler probably detects that it is getting the same URL back repeatedly, and interprets this as an error.
Again, I can’t be certain that this is what’s happening, but its a theory. If you know more than me about this issue, I would love to hear about it in the comments.
Here is the code for the plugin that will force bbPress to accommodate uppercase hex in encoded URLs without issuing a 302 redirect.
function _permalink_fix( $permalink, $location )
{
$matches = array();
/* are there any URL encoded hex characters with uppercase in the request URI? */
if (preg_match( '#\%([0-9][A-F]|[A-F][0-9]|[A-F][A-F])#', $_SERVER['REQUEST_URI'], $matches ))
{
/* replace ALL URL encoded HEX parameters with uppercase versions */
$patterns = array(
'#\%([0-9])([a-f])#e',
'#\%([a-f])([0-9])#e',
'#\%([a-f][a-f])#e' );
$replacements = array(
'"%" . $1 . strtoupper("$2")',
'"%" . strtoupper("$1") . $2',
'"%" . strtoupper("$1")' );
$permalink = preg_replace( $patterns, $replacements, $permalink );
}
return $permalink;
}
add_filter('bb_repermalink_result', '_permalink_fix', 10, 2);
Posted: December 26th, 2009 under Development.
Tags: Intermediate
