As a data scientist or developer, you‘ll often find yourself needing to extract specific pieces of information from large blobs of text. One common example is phone numbers. Whether you‘re cleaning up user-entered data, scraping contact information from web pages, or parsing resumes, the ability to quickly and accurately pull out phone numbers is an invaluable skill to have.
While you could try to write custom parsing logic to handle different formats, a much more efficient and flexible approach is to harness the power of regular expressions, or regex for short. In this in-depth guide, we‘ll walk through everything you need to know to become a master of using regex to match and extract phone numbers from text like a pro!
What is Regex and Why Use It for Extracting Phone Numbers?
In a nutshell, regular expressions provide a concise and highly flexible way to define search patterns and match, locate, and extract text that fits those patterns. Regex has its own special syntax for defining these search patterns using characters, operators, and constructs that each have special meaning.
At first glance it may look quite complex and cryptic, but the core concepts are actually fairly straightforward once you understand what each piece does. And the payoff is immense in terms of being able to easily handle pretty much any text processing task you can imagine by defining the appropriate pattern to match what you‘re looking for.
When it comes to phone numbers, defining a regex pattern is by far the most robust way to handle the multitude of formats that you might encounter, including:
- Different delimiter characters like hyphens, periods, spaces, or parentheses
- Country codes
- Area codes
- Number groupings and lengths
- Extensions
Trying to anticipate and parse each different format separately would be reinventing the wheel each time and lead to error-prone and fragile code. With regex, you can define a single pattern that covers all the bases. Let‘s dive in and see how!
Regex Basics
Before we get into constructing patterns for phone numbers specifically, let‘s do a quick overview of regex fundamentals. A regex pattern consists of a sequence of characters, where each character is either interpreted literally (as the character itself) or as a special metacharacter with a specific meaning.
The most common metacharacters to know include:
- . (dot): Matches any single character except newline
- \d: Any single digit character (0-9)
- \w: Any word character (alphanumeric + underscore)
- \s: Any whitespace character (space, tab, newline)
- [abc]: Matches any single character in the brackets
- [^abc]: Matches any single character NOT in the brackets
- a|b: Matches either a or b
- (): Capturing group and indicates precedence
Quantifiers are used to indicate the number of times the preceding character or group should occur:
- *: Match 0 or more times
- +: Match 1 or more times
- ?: Match 0 or 1 time
- {n}: Match exactly n times
- {n,}: Match at least n times
- {n,m}: Match between n and m times
Anchors are used to specify the position of the match:
- ^: Start of a line/string
- $: End of a line/string
- \b: Word boundary
Putting this all together, here are a few simple examples:
- \d{3}: Matches any 3 digits
- [aeiou]: Matches any vowel
- \w+@\w+.\w+: Matches a basic email address format
- ^(https?://)?www.\w+.\w+$: Matches URLs starting with http://, https://, or www.
This just scratches the surface, but hopefully gives you a sense of the expressive power that regex provides. Now let‘s see how we can apply it to handling phone numbers.
Building Regex Patterns for Phone Numbers
At a basic level, a phone number is simply a string of digits, typically grouped in some way and potentially separated by spaces, dashes, periods, or other punctuation. A common format in the US is 3 digits (area code), followed by 3 digits (exchange), followed by 4 digits (line number), optionally wrapped in parentheses and separated by a hyphen or space, like:
(212) 555-1234
With our regex knowledge, we can construct a pattern to match this as:
(\d{3})\s?\d{3}[-.]?\d{4}
Here‘s how this breaks down:
- (: Match a literal opening parenthesis. The backslash is needed to escape the special meaning of parentheses in regex.
- \d{3}: Match a group of exactly 3 digits
- ): Match closing parenthesis
- \s?: Match an optional whitespace character
- \d{3}: Another group of 3 digits
- [-.]: Match either a hyphen or period
- ?: Make the separator matchoptional
- \d{4}: Final group of 4 digits
We can expand this to cover numbers with or without the area code parentheses:
((\d{3})\s?|\d{3}[-.]?)\d{3}[-.]?\d{4}
And using a similar approach, we can define a more comprehensive pattern that covers other common formats and edge cases like:
^(?:+?(?:1\s[-./]?)?(?[2-9]\d{2}))?[-.]?(?:\d{3}[-.]?){2}\d{4}(?:\s(?:#|x.?|ext.?)\s*\d+)?$
This may look intimidating, but here‘s what each piece does:
- ^: Start of string anchor
- (?: … ): Non-capturing group used for organization
- +?: Optional + sign for country code
- 1\s*[-./]?: Optional US country code with separator
- (?: Optional opening parenthesis for area code
- [2-9]\d{2}: Area code starting with 2-9 followed by any 2 digits
- )?: Optional closing parenthesis
- [-.]?: Optional separator
- (?:\d{3}[-.]?){2}: Two groups of 3 digits separated by optional – or .
- \d{4}: Final 4 digits
- (?: … )?: Optional non-capturing group for extension
- \s(?:#|x.?|ext.?)\s: Extension prefix (with optional x or ext)
- \d+: Extension number
- $: End of string anchor
This can handle pretty much any phone number format you‘re likely to encounter, including:
- (212) 555-1234
- 212.555.1234
- 212-555-1234
- +1-212-555-1234
- 1 (212) 555-1234
- 212-555-1234 x1234
- (212)5551234 #1234
Now all that‘s left is to actually use this regex pattern in code to extract the matched phone numbers from text.
Matching and Extracting Phone Numbers with Regex in Code
Most modern programming languages provide built-in support for regular expressions, making it easy to leverage their power for matching and extracting text based on patterns. Let‘s see some quick examples of how we can extract phone numbers using our regex pattern above in a few popular languages.
Python
In Python, we can use the built-in re module to work with regular expressions:
import re
text = "Call me at 212-555-1234 or (415) 555-6789 x1234 if urgent."
pattern = re.compile(r‘(?:+?(?:1\s[-./]?)?(?[2-9]\d{2})?[-.]?(?:\d{3}[-.]?){2}\d{4}(?:\s(?:#|x.?|ext.?)\s*\d+)?)‘)
matches = pattern.findall(text)
print(matches)
The re.compile function compiles the regex pattern string into a regex object. Note the r prefix on the string to indicate a raw string literal and avoid having to escape backslashes.
The findall method finds all matches of the pattern in the input text and returns them as a list of strings. We could also use the sub method to find and replace phone numbers, or the split method to split the text on phone number matches.
JavaScript
In JavaScript, regular expressions are supported directly as part of the language using the RegExp object and String methods:
let text = "Call me at 212-555-1234 or (415) 555-6789 x1234 if urgent.";
let pattern = /(?:+?(?:1\s[-./]?)?(?[2-9]\d{2})?[-.]?(?:\d{3}[-.]?){2}\d{4}(?:\s(?:#|x.?|ext.?)\s*\d+)?)/g;
let matches = text.match(pattern);
console.log(matches);
// Output: [‘212-555-1234‘, ‘(415) 555-6789 x1234‘]
Here the regex pattern is defined as a RegExp literal by wrapping it in forward slashes. The g flag at the end indicates a global search to find all matches.
Calling the match method on the input text with the pattern returns an array of all matched substrings. We could also use the test method to simply check for a match, the replace method to find and replace matches, or split to split the text on matches.
Grep
For shell scripting and text processing using CLI tools, grep is the go-to utility for searching files or input text using regular expressions. We can use it to extract phone numbers like:
echo "Call me at 212-555-1234 or (415) 555-6789 x1234 if urgent." | grep -Eo ‘(?:+?(?:1\s[-./]?)?(?[2-9]\d{2})?[-.]?(?:\d{3}[-.]?){2}\d{4}(?:\s(?:#|x.?|ext.?)\s*\d+)?)‘
The -E flag indicates extended regular expression syntax and -o specifies to print only the matched parts of a matching line. We could also use sed, awk, or other Unix utilities in a similar manner.
Tools and Resources for Learning and Debugging Regex
Regular expressions have a reputation for being notoriously difficult to read and debug. Regex is said to be "write-only", meaning even the author of a regex often has trouble later understanding what it does! While this is certainly a risk, there are many great tools available to help you construct, test, and debug your patterns with ease.
Some of the most popular online regex testers and playgrounds are:
- Regex101: https://regex101.com/
- Debuggex: https://www.debuggex.com/
- RegExr: https://regexr.com/
- Rubular: https://rubular.com/ (Ruby-focused)
- RegEx Pal: https://www.regexpal.com/
These all provide real-time visualization, explanation, and testing of regex patterns against example input strings. You can also save and share patterns or explore pre-made community patterns.
For offline, desktop regex testing and debugging, some great options are:
- RegexBuddy: https://www.regexbuddy.com/ (Windows)
- RegViz: https://github.com/wakatime/regviz (macOS)
- Patterns: https://krillapps.com/patterns/ (macOS)
- Regex App: https://github.com/luong-komorebi/Regex-Resources (Windows, macOS, Linux)
As for actually learning the ins and outs of regex syntax and best practices, you can‘t go wrong with:
- Regular-Expressions.info: https://www.regular-expressions.info/
- RexEgg: https://www.rexegg.com/
- Regex Tutorial: https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285
- Regex Learn: https://regexlearn.com/
- Regex Crossword: https://regexcrossword.com/
The best way to get better at regex is simply using it and practicing! Take every opportunity to reach for regex to solve your string processing needs.
Conclusion
In this comprehensive guide, we covered the power of using regular expressions to match and extract phone numbers from unstructured text data. We looked at the fundamental components of regex syntax, how to construct robust patterns to handle different phone number formats, and examples of using regex to extract numbers in various programming languages.
Some key takeaways to remember:
- Regex provides a flexible and concise way to express patterns to match text
- Regex syntax can seem arcane but is learnable and supported by many tools
- Constructing a regex pattern is all about identifying the literal and variable pieces of the text you want to match
- Regex is incredibly handy for data wrangling and cleaning tasks like extracting phone numbers
- Regex is a valuable skill to have in your developer toolkit that is widely applicable
I encourage you to start using regex in your own projects and leverage the multitude of online tools and resources to continue honing your text-wrangling skills! While it does take some practice to master, I hope this guide has given you the foundation and motivation to make regular expressions a regular part of your workflow.