13.6.1 Problem
13.6.2 Solution
This is a popular question and everyone has a different answer,
depending on their definition of valid. If valid means a mailbox belonging to a
legitimate user at an existing hostname, the real answer is that you can't do it
correctly, so don't even bother. However, sometimes a regular expression can
help weed out some simple typos and obvious bogus attempts. That said, our
favorite pattern that doesn't require maintenance is:
/^[^@\s]+@([-a-z0-9]+\.)+[a-z]{2,}$/i
$parsed = imap_rfc822_parse_adrlist($email_address, $default_host)
if ('INVALID_ADDRESS' == $parsed['mailbox']) {
// bad address
}
Ironically, because this function is so RFC-compliant, it may
not give the results you expect.
13.6.3 Discussion
The pattern in the Solution accepts any email address that has
a name of any sequence of characters that isn't a @ or whitespace.
After the @, you need at least one domain name consisting of the
letters a-z, the numbers 0-9, and the hyphen, separated by
periods, and proceed it with as many subdomains you want. Finally, you end with
either a two-digit country code or another top-level domain, such as
.com or .edu.
The solution pattern is handy because it still works if ICANN
adds new top-level domains.
However, it does allow through a few false positives. This more strict pattern
explicitly enumerates the current noncountry top-level domains:
/
^ # anchor at the beginning
[^@\s]+ # name is all characters except @ and whitespace
@ # the @ divides name and domain
(
[-a-z0-9]+ # (sub)domains are letters, numbers, and hyphens
\. # separated by a period
)+ # and we can have one or more of them
(
[a-z]{2} # TLDs can be a two-letter alphabetical country code
|com|net # or one of
|edu|org # many
|gov|mil # possible
|int|biz # three-letter
|pro # combinations
|info|arpa # or even
|aero|coop # a few
|name # four-letter ones
|museum # plus one that's six-letters long!
)
$ # anchor at the end
/ix # and everything is case-insensitive
Both patterns are intentionally liberal in what they accept,
because we assume you're only trying to make sure someone doesn't accidentally
leave off their top-level domain or type in something fake such as "not
telling." For instance, there's no domain "-.com", but
"foo@-.com" flies through without a blip. (It wouldn't be hard to
modify the pattern to correct this, but that's left as an exercise for you.) On
the other hand, it is legal to have an address of "Tim
O'Reilly@oreilly.com", and our pattern won't accept this. However, spaces
in email addresses are rare; because a space almost always represents a mistake,
we flag that address as bad.
The canonical definition of what's a valid address is
documented in RFC 822; however, writing code to handle all cases isn't a pretty
task. Here's one example of what you need to consider: people are allowed to
embed comments inside addresses! Comments are set inside parentheses, so it's
valid to write:
Tim (is the man @ computer books) @ oreilly.com
That's equivalent to "tim@oreilly.com". (So, again,
the pattern fails on that address.)
Alternatively, the IMAP extension has an RFC 822-compliant address parser. This
parser correctly navigates through whitespace comments and other oddities, but
it allows obvious mistakes because it assumes that addresses without hostnames
are local:
$email = 'stephen(his account)@ example(his host)'; $parsed = imap_rfc822_parse_adrlist($email,''); print_r($parsed); Array ( [0] => stdClass Object ( [mailbox] => stephen [host] => example [personal] => his host ) )
Reassembling the mailbox and host, you get
"stephen@example", which probably isn't what you want. The empty string
you must pass in as the second argument defeats your ability to check for valid
hostnames.
Some people like behind-the-scenes
processing such as DNS lookups, to check if the address is valid. This doesn't
make much sense because that technique won't always work, and you may end up
rejecting perfectly valid people from your site, due to no fault of their own.
(Also, its unlikely a mail administrator would fix his mail handling just to
work around one web site's email validation scheme.)
Another consideration when validating email addresses is that
it doesn't take too much work for a user to enter a completely legal and working
address that isn't his. For instance, one of the authors used to have a bad
habit of entering "billg@microsoft.com" when signing up for Microsoft's
web sites because "Hey! Maybe Bill doesn't know about that new version of
Internet Explorer?"
If the primary concern is to avoid typos, make people enter
their address twice, and compare the two. If they match, it's probably correct.
Also, filter out popular bogus addresses, such as
"president@whitehouse.gov" and the previously mentioned
"billg@microsoft.com". (This does have the downside of not letting The
President of the United States of America or Bill Gates sign up for your site.)
However, if you need to ensure people actually have access to
the email address they provide, one technique is to send a message to their
address and require them to either reply to the message or go to a page on your
site and type in a special code printed in the body of the message to confirm
their sign-up. If you do choose the special code route, we suggest that you
don't generate a random string of letters, such as HSD5nbADl8. Since it
looks like garbage, it's hard to retype it correctly. Instead, use a word list
and create code words such as television4coatrack. While, on occasion,
it's possible to divine hidden meanings in these combos, you can cut the error
rate and your support costs.