Sunday, January 26, 2025

What is the right tool for email validation?


One of my pet peeves is seeing something like this in a codebase, grabbed from a friendly StackOverflow article or maybe even regurgitated by some carbon-spewing AI:

EMAIL_REGEX = (?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])

Don't get me wrong: I love regular expressions when they are the right tool for the job. I can even appreciate a clever one now and then. So what is wrong with this picture?
  1. It is almost certainly incorrect. Please prove me wrong if anyone has proof or exhaustive tests in a repo somewhere.
  2. It is ridiculously big and impossible to understand.
  3. Regular expressions tend to inadvertently introduce punitive algorithmic complexity explosion due to their power and expressiveness.
RFC 5322 Section 3.4 defines (as of this writing) the format of an email address. It consists of complex rules and formats going back to the beginning of time (well, not all the way back to UUCP email such as my first email address ..!ualberta!tim). It also defers validation of the parts to other RFCs and even to the implementations. Fortunately though, it does provide enough to build a simple procedural parser that any programmer can understand and adjust if business needs change.

For the parser, I suggest a divide and conquer strategy that breaks the address into its constituent parts using language string manipulation. These parts can then be further broken down and validated using regular expressions or other language features.

I have built this already four or fives times in various languages including Java, Ruby, JavaScript, and Elm for previous projects. This time I'll finally write a gem for it. Stay posted, I'll have a GitHub repo ready in a bit.

And if you know of a gem or other open source implementation that is still current for RFC style procedural validation please let me know!