Can I use non-ASCII characters in URLs?

Our servers let you use filenames that contain non-ASCII characters. For example, you could name a file with an accented letter “e”:

tést.html

If you do this, it may seem like you can then simply access this file using this URL:

http://www.example.com/tést.html

This is not reliable, though.

What's the problem?

Non-ASCII filenames are stored in a special format called “Unicode”. Unfortunately, Unicode sometimes offers multiple ways to write things that look exactly the same.

For example, the “é” character can be represented as either a normal letter “e” followed by a “combining acute accent” character, or as a “Latin small letter e with acute" character. Although they look the same, they have different patterns of bytes and would therefore be treated as two different filenames on the server. You may not even know which you’re using.

When you request a filename like this in a Web browser, the browser converts it to a “URL encoded” version of one of the two possibilities above, which it would represent as one of:

te%CC%81st.html
t%C3%A9st.html

It then sends that encoded request to the Web server. The server decodes it back into the original pattern of bytes and looks for a matching filename on the disk. If the request and the filename match because the browser sends the same version, it works — but if they don’t match, it doesn’t work (you’ll get a “404 not found” error).

This might seem easy to fix, because you may think you can simply pre-encode your URLs, trying these to see which works:

http://www.example.com/te%CC%81st.html
http://www.example.com/t%C3%A9st.html

... and then using the working version in your links. However, some browsers (notably Safari, but not Chrome or Firefox) will actually change the first already-encoded version to the second version before they send it to the server, so this still isn’t completely reliable either.

And we haven’t even talked about the possibility that something might encode one of these characters using a non-Unicode character set like Windows-1252, which result in yet another version, “te%E9st.html”, which is yet another byte pattern. This sometimes happens by accident when you transfer a file using FTP.

What’s the solution?

The simple way to avoid all of these problems is to stick to basic ASCII characters in filenames and URLs. If you use only letters, numbers, dots, hyphens, and underscores, you’ll never see this problem.

If you must use non-ASCII characters:

  • Avoid transferring them using FTP
  • Avoid the “combining accent” form of a filename, preferring the shorter version instead
  • Test your URLs in several browsers, including Safari

How can I see more technical details?

Although this is not for the faint of heart, running this command from the shell will show you the correct URL-encoded name (and thus the byte pattern) of each file in a directory:

ls | perl -pe 's/([\x20\x2c\x5c\x80-\xff])/"%" . uc sprintf "%02x",ord($1)/eg;'

Again, though, this comes with the caveat that the encoded version you see won’t necessarily work in browsers like Safari that “canonicalize” the encoding before sending it to the server.