How to Write Bad Tests Regarding HTML Escaping

Previously I have written about testing front-end code. Today I want to share a short tip specifically regarding testing code that needs to escape HTML. I will use JavaScript in my examples, but this also applies to other languages such as Ruby, Python, or Java.

Recently I was reviewing some JavaScript that looked something like this:

function emphasize(str) {
  return `<em>${escapeHTML(str)}</em>`;

That’s good code. To write safe HTML that isn't vulnerable to HTML injection (which enable XSS attacks and other exploits), we need to properly encode whatever str is as HTML. (Assume escapeHTML is defined or imported and is in scope. It is a shameful omission of JavaScript that there is not such a native function.1)

Now the code came with a unit test suite that included assertions like this:

expect(emphasize("I <3 NY")).to.equal("<em>I &lt;3 NY</em>");

Do you see anything wrong with that test? If not, you are far from alone. However, this is a fragile test.

Imagine that your test suite passed locally, but your CI (continuous integration) tool failed your build. The error looks something like this:

Failed: Expected "<em>I &#60;3 NY</em>" to equal "<em>I &lt;3 NY</em>"

Well, that’s interesting, isn’t it? Using &#60; instead of &lt; is one of several perfectly valid ways to encode the < symbol in HTML.

This scenario is not hypothetical, even if the exact code examples above are invented. I have seen it happen.

If we take a step back, we can see that we do not care whether < is encoded as &lt; or as a decimal or hexadecimal entity. Certainly we don't care about the case used in the hexadecimal entity (&#x3C; vs &#x3c;).

This confirms that our test is too fragile. We want to confirm that HTML encoding is being applied, but we do not want to assert that one flavor of entity must be used, because it doesn’t actually matter.

With the above history in mind, when I saw the tests above, I suggested that they be rewritten to something like this:

// Note: Somewhat indirect testing to avoid asserting
// a preferred entity style, which would introduce
// fragility and flakiness to these tests.
// See:
expect(emphasize("I <3 NY")).to.not.contain("I <3 NY");
expect(decodeHTML(emphasize("I <3 NY"))).to.equal("I <3 NY");

Or, in English: Assert that this string does not come back unencoded (with plain less-than characters). Also assert that decoding the HTML successfully returns the original string. (Here we use decodeHTML2 to complete the "round trip.")

(Depending on what decodeHTML does exactly, you may instead need to assert that our encoded HTML string yields an EM element with a text content of I <3 NY. Your needs and mileage may vary, but the general principle to avoid demanding a particular entity style will always apply.)

In this manner, we are able to assert what we care about, without writing extra code to deal with the numerous valid and equivalent HTML encodings of the same string, and without demanding that the universe always perform HTML escaping the same way.

Let me know if this was helpful, if there is an error here, or if you have other feedback.

  1. Please note that encodeHTML and decodeHTML as used in this article are deceivingly simple functions that are easy to get very wrong. Luckily, they are also common functions available in many JavaScript libraries. They are short to write yourself, but you shouldn’t, due to the subtleties and pitfalls here. Choose carefully. ThinkingStiff has a good encodeHTML function here↩︎

  2. Use a well-tested and popular library’s decodeHTML utility, or consider Wladimir Palant’s decodeHTML function here↩︎

April 8th, 2019
Alan Hogan (@alanhogan_com).  Contact · About