When to use which RegExp function in JavaScript

Although the MDN pages do a good job in explaining what the different RegExp functions do exactly and what the differences between them are, they can be a little confusing if you know what you want to do, but not which function to call.

So here is a breakdown, grouped by what you want to do.
As is so often the case, I made this list mostly for myself, but I think other people may benefit from it too.

You simply want to know if a string contains a certain pattern

RegExp.test(String) returns true if the pattern can be found, false otherwise.

let str = 'The quick brown fox jumps over a lazy dog';
let result = /\w+o\w+/.test(str);
// result will be true

You want to know where in the string a pattern occurs

String.search(RegExp) returns the index, or -1 if not found.

let str = 'The quick brown fox jumps over a lazy dog';
let result = str.search(/\w+o\w+/);
// result will be 10

If there are multiple matches, it will return the index of the first one.

Retrieve the substring matched by the pattern

String.match(RegExp) and RegExp.exec(String) each return an array, the first element of which is the first match.
They return null if not found.

let str = 'The quick brown fox jumps over a lazy dog';
let result = str.match(/\w+o\w+/); if (result) result = result[0];
// result will be 'brown'

Count how many times the pattern occurs in the string

String.match(RegExp) on a regex with the g flag returns an array of matches (or null if not found).

So just take the length (or use 0 if the result is null). In this particular example, there are three matches and the outcome is [‘brown’, ‘fox’, ‘dog’];

let str = 'The quick brown fox jumps over a lazy dog';
let result = str.match(/\w+o\w+/g); result = result ?result.length :0;
// result will be 3

Retrieve a list of all substrings matched by the pattern

String.match(RegExp) on a regex with the g flag returns an array of matches (or null if not found).

If there are multiple matches in the string for the pattern, the returned array will contain all of them. In this particular case, the result will be [‘brown’, ‘fox’, ‘dog’];

let str = 'The quick brown fox jumps over a lazy dog';
let result = str.match(/\w+o\w+/g);
// result will be ['brown', 'fox', 'dog']

Retrieve the match and its capturing groups

String.match(RegExp) without the g flag and RegExp.exec(String)
each return an array, the first element of which is the first match and the following elements are the matches for the capturing groups (or the result is null if not found).

let str = 'The quick brown fox jumps over a lazy dog';
let result = str.match(/(\w+)o(\w+)/);
// result will be ['brown', 'br', 'wn'];

Retrieve all matches, the indexes at which they are found in the string and all their capturing groups

RegExp.exec(String) with the g flag returns an array with the info you want for the first match (or null if not found).
To get to the rest of the matches, you have to call the exec function repeatedly with the same RegExp variable, until it returns null. So this is a tad more work, but not a lot.

let str = 'The quick brown fox jumps over a lazy dog';
let rex = /(\w+)o(\w+)/g;
let allRes = [];
while ((result = rex.exec(str))!=null)
  allRes.push('n='+result.shift()+' i='+result.index+' g='+result.join('/'));
// allRes will be ['n=brown i=10 g=br/wn', 'n=fox i=16 g=f/x', 'n=dog i=38 g=d/g']

Note that this requires a RegExp variable, because it needs to remember the location at which it found its last result, which it starts off from on the next go through the loop. A regex literal, like result = /(\w+)o(\w+)/g.exec(str), won’t do; this would reinitialise the regex each time and so it would always return the first match.

Or, alternatively…

If you don’t want to remember all these different function calls, know that there is one function which has all these features built in: exec! That’s all you need to remember. Make sure to use the g flag.

let str = 'The quick brown fox jumps over a lazy dog';
let rex = /(\w+)o(\w+)/g;
let result = rex.exec(str);
// To test if the pattern occurs, return true here if the result is not null or false otherwise
// For the location of the pattern, return result.index if the result is not null, or else -1
// For the (first) matching substring, return result[0] if the result is not null
// Other results need some more code, like above
let allRes = [];
while (result!=null) {allRes.push(result); result = rex.exec(str);}
// Now to retrieve the number of matches, return allRes.length
// For the matches themselves, return allRes.map(el => el[0])
// etc. You get the idea.

That’s about it.
I want to close with a heads-up: this mechanism (fetching the next result if you call the function repeatedly while using the g flag) is also used by the test function. So if you use that in a loop for unrelated reasons, you may get unexpected results:

let str = 'The quick brown fox jumps over a lazy dog';
let rex = /(\w+)o(\w+)/g;
for (let i = 1; i<=10; ++i) {
  do stuff;
  if (rex.test(str)) do other stuff, but only if str contains rex;
  do more stuff;
}

This will behave the way you want the first three times through the loop, but it will fail after that!
Solution: don’t use the g flag, or call rex.test(str) once and put it in a variable to use later.

Advertisement

HTML in WordPress

This was supposed to be a blog about how browsers handle ruby annotation. With live examples of different HTML snippets, demonstrating how your browser renders those.

Unfortunately, it turns out that the WordPress editor can’t handle esoteric markup like ruby very well; it removes many of the elements, leaving the examples crippled. Shame.

So does anybody know how to insert raw HTML in a WordPress post? I mean, without it being changed? Let me know in the comments!

In the meantime, if you want to know about the situation with ruby and how your browser handles it, you can read the blog post here: http://strictquirks.nl/standards/the-situation-with-ruby-2020.xhtml

Quotes and font names

One popular misunderstanding about specifying fonts in CSS is that you should put quotes around the name of the fonts when they contain spaces, and you shouldn’t when they don’t. And you end up with this

font-family: "Hoefler Text", Utopia, "Liberation Serif", Times, "Times New Roman", serif;

This usage, however, is flawed.

Not only is this an oversimplification of the rules, but there are many instances in which the opposite is true! So, some clarification is in order.

First, the official rules.

You MUST use quotes if

  1. A font name starts or ends with a space, or it has runs of two or more adjacent spaces in its name, or it contains other whitespace characters such as tabs.
  2. The name contains characters that can confuse the parser into thinking the property ends there, like a comma, a semicolon etc.
  3. The name happens to be a keyword that would otherwise be handled differently, such as inherit.

And you MUST NOT use quotes if

  1. You are specifying one of the generic font-families serif, sans-serif, monospace, cursive or fantasy
  2. You don’t mean a font name, but rather an action such as inherit.

In all other cases, quotes are optional.

So first of all, an important conclusion to draw is that font names like Times New Roman are perfectly safe to use without quotes. No weird characters in the name, no two spaces in a row etc. You can test this for yourself; all browsers handle this name the same whether you include quotes or not; the W3C CSS Validator finds no errors etc.

So spaces are not a problem, really. You know what’s a problem? Numbers! If you have a font named Courier 10 Pitch, you’ll have to put quotes around it or it won’t show.

Then there’s the other thing: using one-word font names without quotes. The point is, how will you know that your font name is not a keyword? Like I said above, if you have a font named Inherit, you must put quotes around it or the parser will think you meant to inherit the font from the parent element.

And there are more such keywords. Font names like Default, Caption, Icon, Menu, Message-Box, Status-Bar and so on are all problematic.

And not only are there many such keywords, there are also new keywords invented regularly. For example, the words initial, unset and revert are relatively new. So if you had a font named Initial or UnSet, you could use its name without quotes until a few years ago, but you no longer can!

What can we learn from all of this? Simple: play it safe. Put all font names in quotes. The quotes won’t do any harm, even where they’re not needed, but leaving them out to save two bytes in your source will come back and bite you in the rear later!

What does * {margin:0; padding:0} solve?

I’ve always disliked reset stylesheets. They are often overkill – making the webpage look like plain text, they slow down the loading of the page and overall simply reset too much.

Now while many other professional developers agree with the above sentiment, there seems to be a consensus that you need some kind of minimum reset stylesheet, that contains at least

*{margin:0; padding:0}

and I can’t help but wonder why.. What does this solve? What are the cross-browser compatibility issues that need this for a solution?

This makes paragraphs lose their margins, unless you add those back in. And it isn’t necessary in the first place, because all browsers use the same margin for paragraphs, without exception. You don’t iron out any differences using this. Ditto for headers. And it ruins lists, squishes tables, wrecks forms and inputs, etc. Unless you add the margins and the paddings back in, effectively undoing the damage you did.

So why use it in the first place? What is it good for?

html {font-size:62.5%} is a mistake

If you find yourself using html{font-size:62.5%} in your stylesheet, ask yourself why you are doing it.

You may argue that font sizes of 10px are easier to calculate with than font sizes of 16px. But you’d forget a few things.
Firstly, 62.5% of the user’s preferred font size is not 10px. Well, it might be, if the user’s preferred size is 16px, but then again, it might not be!
If you want the root font size to be 10px, then why don’t you make it 10px? Why don’t you write html{font-size:10px}? Tell me that.
You may say that if you use a percentage, you’re still respecting the user’s preferences: no matter what default font size they have, 1.6rem will still be their original. But that isn’t always true; it depends on many different factors.
Let’s say you have a line of text like <p style="font-size: 1.6rem"> This is the user's preferred size! </p> and the user has set a preferred font size of 15px. Then this line can come out at the following sizes:

  • 15px if you’re lucky
  • 14px if the browser rounds all sizes to whole pixels, so 62.5% of 15 becomes 9 and 1.6 times 9 becomes 14.
  • 9px if the browser doesn’t support rem. The p’s style attribute will be ignored.
  • 12px if the browser doesn’t support rem and the minimum font size is 12px.
  • 19px if the minimum size is 12px and the browser has corrected the root font size to this minimum size (that is, 1rem is now 12px).

(The last example might sound contrived, but let me tell you that one of the major browsers does indeed treat its sizes that way. Test thoroughly!)

And what about zooming in and out? If a user has their minimum font size at half the default font size, they can zoom out to 50% before those sizes start interfering. With your setup, they can only zoom out to 80% before running into issues. So test that too!

Secondly, even disregarding your flawed assumption about every user having 16px for a default, why do you want to make calculations with the font size? And why do you think it’s easier after html{font-size:62.5%}? If you want some header to be 24px, you can just write 24px. There is no need to change the html font size first and then write 2.4rem.
Also, if you hadn’t changed the html font size, you could have written 1.5rem. Why would 1.5rem be more difficult to work with than 2.4rem? In fact, not changing the html and writing 1.5rem will make it clearer that this is one and a half times the standard size. Much more intuitive.

Speaking of intuitive, you’re also messing with the predefined font sizes xx-small, x-small, small, medium, large, x-large and xx-large. After this treatment. these keywords don’t work as they should any more; small wil be larger than 1rem!

Now some people believe that if you use pixels, users will not be able to zoom in and out on their webpages. This isn’t true.

Oh, and I know there is a misconception about using pixels. Pixels are not good, because not all pixels are the same size. Well, let me tell you, not all percentages are the same size either! How about that.

And some people believe that you need to set a font size on the html, for whatever reason. That if you don’t, some things won’t work correctly. So they do html{font-size:100%} and think that they are doing the right thing. I am not sure where this misconception comes from.

There is no such thing as XHTML5

For a while, I believed that XHTML was keeping up with HTML. That you could use all the features of HTML5 in XHTML, by using the XHTML structure and file type, and HTML5 content.

The W3C and the WHATWG also strongly hint that you can write newer HTML5 material as both HTML and XHTML documents. That is, you can serve up your document with either a .html or a .xhtml extension and it will get the same treatment, the same result.

In fact, the HTML5 specification explicitly allows for XHTML-like structures in HTML5, even in HTML mode, such as the slash at the end of any void element; and the W3C proudly proclaims all the benefits of using polyglot markup and how you can serve up exactly the same content as both HTML and XHTML if you just keep to some simple rules – see http://www.w3.org/TR/html-polyglot/

So where do we stand here with this wonderful new HTML5 technology?

Well,the problem is that you can’t really give any XHTML file an HTML5 DOCTYPE and get away with it.
Any XHTML file containing entity references like &eacute; or &nbsp; can no longer be displayed!

So there you have it. If you want to use a XHTML file type, you can’t use a HTML5 DOCTYPE. The HTML5 DOCTYPE is only for HTML files.

Rest in peace, XHTML.