Myths Web Developers Believe About HTML

Watch a brief video overview of these myths: Four HTML Myths that Developers Believe.

The web platform has evolved a lot over the years. Due to its commitment to backwards-compatibility, it only grows. Mistakes are deprecated but rarely removed and never completely forgotten. Commonly accepted knowledge can be based on outdated information or simple misunderstandings.

Here are a few common misconceptions about HTML I have seen among web developers.

Myth: Adding a page description meta tag is a waste of time §

Fact: A lot of software will pull your page description via this meta tag and show it to users.

I’m talking about this: <meta name="description" content="Page description here">.

Origins of the myth: SEO circles long ago noted that search engines stopped using page descriptions in their ranking algorithms due to abuse from SEO hackers.

Why it matters: Google may not change your ranking based on your <meta name="description"> tag, but it may certainly show that description in its search results pages. A good and accurate description will help users decide whether your page is relevant to them or not. Social media networks and chat or messaging apps often show page descriptions in their rich representations of URLs users send to each other. And lastly, as MDN notes, “Several browsers, like Firefox and Opera, use this as the default description of bookmarked pages.”

Myth: HTML5 means we can invent our own element names and attribute names now §

Fact: HTML5 specifies how browsers must interpret unknown element and attribute names, but it does not grant authors permission to use tag and attribute names that were not defined in the HTML5 specification. Furthermore, it is a bad idea to do so (although using a hyphen should be fine).

I’m talking about this: <bandname>Radiohead</bandname> instead of <span class="bandname">Radiohead</span>.

Origins of the myth: A mix of truth (HTML5 specifying browser behavior for unknown tags), dumb choices in popular frameworks (Angular 1.x using made-up tag and attribute names for pretty much everything), and a never-realized extension to web standards (“web components” including custom elements and attributes, although they require a hyphen). Google should shoulder a lot of the blame for this myth’s popularity due to their promotion of AngularJS. Angular made cornerstones of a few really dumb ideas; asking users to pollute the DOM with invented tag and attribute names was one of the dumbest. While Angular’s own “directives” were prefixed with ng- (<ng-if>, for example), their examples and documentation plainly showed and encouraged authors to define directives without hyphens. Thanks to HTML5 specifying browser behavior in this case, many web developers believed everything was fine and there was nothing to worry about. However, a quick trip to an HTML validator will show that this isn’t true. The reason is simple: HTML5 specified browser behavior when encountering unknown HTML elements and attributes because HTML is never finished, and out-of-date browsers should do something reasonable and predictable when encountering new features that were added to HTML5 later. If authors were handed blank checks and told to invent and use whatever tag names they wanted, this would be an “own-goal” for the future of HTML, since user-defined and future-spec-defined tags will collide to retroactively break old pages. I use the example of <details> to show this clearly. I’m sure we can all imagine many things we might want to invent a <details> tag for if we are in the Angular-style habit of inventing tag names for every little part of our web apps’ user interfaces. And nothing would have stopped us from doing so a few years back; the tag was not defined and it would have been interpreted just like a <div> by browsers. However, <details> has in fact been added to HTML5 (long after the initial definition of HTML5), and it specifies changes in browser behavior and appearance. Surely this has caused a number of Angular pages to look stupid now that these tag names have collided. (This is why the “web components” draft specification required authors to namespace their custom elements with a hyphenated prefix. The implication is that no new official HTML elements would ever have a hyphen. Separate lanes, no collisions!) Notably, HTML5 gives authors the following unrestricted methods to extend HTML: The id attribute, the class attribute, and data-* attributes.

Why it matters: One of the best things about the Web is that ancient web pages still work just fine on modern browsers. It’s basically the most backward-compatible technology ever created. But, as illustrated above, misunderstandings of HTML5 can cause pages to break as new tags are added to HTML5. Just as tragically, such illegally invented tags and attributes can effectively “squat” the name, forcing the standards to pick a less ideal name to avoid breaking backwards compatibility. It’s similar to how the Dojo JavaScript library’s extension of native object prototypes has forced naming compromises.

Myth: Except for self-closing tags, all tags need to be closed §

Fact: The following is a perfectly valid fragment of HTML:

<p> Hello!
<p> I’m Cowboy Dave, and I never
    close my paragraph tags!

Try it. Run it in a browser. Paste it into a valid HTML document and run that through a validator. See that green checkmark? It’s all good.

Origins of the myth: XHTML, mostly, and general poor understanding of HTML5 parsing. It’s true the above fragment would fail XHTML validation in a heartbeat, because XHTML was a reformulation of HTML as XML, and in XML, you darn well better close all your tags. But HTML is old, and the oldest web pages and the oldest web browsers did not see a need to close a paragraph just to start another one. This is still the case today, and it will be forever, as enshrined in the HTML5 parsing algorithm.

Why it matters: If you are writing the HTML, it probably doesn’t. It’s best for your own sanity (and your peers’) to close all your tags explicitly. But if you are accepting HTML from other sources, these kind of assumptions may bite you. This brings us to our next myth…

Myth: Regex is a great way to make changes to HTML §

Fact: The HTML5 parsing algorithm is really complicated and your regular expressions are incapable of forming a substitute.

Why it matters: Security! And other bugs. Look, it’s just really, really dumb. Someone is going to exploit the difference between how you think HTML works as you write your regexes and how HTML actually works, and they will exploit it to totally pwn you. I’ll give a silly little example: Browser tools always show attribute values enclosed with double quotes, like <input value="McFly">. I’m sure that a lot of newer web developers assume this is the only way to provide an attribute value in HTML, as a result. But this is not remotely true. This is also 100% valid and equivalent: <input value=McFly>. So is this: <input value='McFly'>. Now imagine that a developer uses this incorrect assumption about how attributes can be defined to try to strip out JavaScript from HTML with a little regex magic. This developer knows that JavaScript can sneak in not just via <script> tags but via things like the onmouseover and onclick HTML attributes, so the dev writes patterns to find and remove matches of /onmouseover="[^"]*"/. This security measure would be trivially bypassed by using single quotes instead of double quotes! (I must say at this point that this is not the only problem with writing one’s own HTML sanitizer, be it with regexes or not. My point is general: The smallest deviation from the HTML spec will bite you! Don’t reinvent the wheel. Use popular HTML parsing and sanitization libraries with healthy test suites!)

That’s all for now

I hope this was helpful. Share it with your friends & co-workers to save someone time and grief!

Thank you to Braidy Merkle for reading a draft of this article.


March 11th, 2019
Alan Hogan (@b01dface).  Contact · About