I’ll admit it: I used to be a regexaphobe. When I was new to analytics, I remember someone sending me a snippet of regular expressions (AKA regex) to solve a goal setup conundrum I was working through. It looked like a foreign language to me. I was fascinated by it but repelled at the same time. #itscomplicatedk?
Sadly, my intimidation of regex prevented me from doing more powerful analysis. I tried everything to avoid it and would copy and paste code from articles I saved when I had to create a custom filter. But eventually I hit a wall I couldn’t scale unless I conquered this beast. So, like Yukon Cornelius, Rudolph, and Hermey, I set out on a quest to learn it.
Why This Post
As nerdy as regex is, I’m writing this post because it will broaden your capacity as a marketer to do more sophisticated analysis in tools like Google Analytics, Google Docs, Google Spreadsheets, Tableau, Screaming Frog, SQL, etc. Basically any tool that uses filters. I will be creating videos to demonstrate practical tasks you might need to carry out in each of these tools with the aid of regex. It’s gonna be dope.
So I’m going to hit on the main ones you’ll need, while explaining the geek speak in simple terms. I will even subject myself to public scorn by sharing the goofy mnemonic devices I used early on to remember a few of them I just couldn’t seem to get down.
I will break the regex characters you’ll use most down in the order I go through in my video. When creating the video, I used regexr.com to test my regex. There are quite a few tools on the market. There were a couple times it was a little buggy. So if it’s not matching and you’re sure your regex is on point, try refreshing the page.
I’ll also include the lists I used in my demos so that you can follow along, if you’re so inclined.
Video
Regex Lineup
Pipe (|)
What It Does
In the video (00:34 min mark), I use a list of countries, which you can download here, and use regex to filter it down to just EU countries. The regex I used:
Austria|Belgium|Bulgaria|Croatia|Republic of Cyprus|Czech Republic|Denmark|Estonia|Finland|France|Germany|Greece|Hungary|Ireland|Italy|Latvia|Lithuania|Luxembourg|Malta|Netherlands|Poland|Portugal|Romania|Slovakia|Slovenia|Spain|Sweden
Caveat
One thing you need to be careful with when using the pipe character in a long list like this is, if you tack a pipe character onto the end of your list, you will select everything. You’re basically saying, “or whatever.”
Dot (.)
What It Does
The . metacharacter is a wildcard character. It means match any one character. It can be a number, letter, or special character (even a white space). By itself, it’s not that amazing, but with the help of its frequent companions, the asterisk (*) and plus (+) characters, it’s pretty bad to the bone.
Follow Along
In the video (4:00 min mark), I use a list of kindergarten words to play with the dot character. Feel free to play along:
cat
cap
cot
bat
cut
dab
but
Caveat
If you’re a marketer, you’ll be using dots quite a bit as themselves, so you’ll need to escape them (i.e., drop a backslash in front of it). That said, most of the time, if you’re matching a list of URLs, your regex will most likely work even if you forget to escape the dot because how many other characters are you going to see before your top level domain (e.g., .com, .edu, .gov)?
Asterisk (*)
What It Does
The asterisk says to match 0 or more of the character that comes right before it. So, in other words, it looks at the character before it (most often the . character) and indicates that there may or may not be that character AND an unlimited number of matches afterwards.
Follow Along
In the video (4:00 min mark), I address the asterisk and plus characters in quick succession after the dot character. See the Dot section for the list of words I used in my demo. 👆
Caveat
The .* combo meal is expensive. I treat it as an option of last resorts. I demonstrate in the video how I’ll most commonly use it within some pretty tight parameters when I walk through how to capture the misspellings of Britney Spears (34:21 min mark).
Heads-Up
I wasn’t supposed to cover the * character when I did but went off script. Then I forgot I did that and introduced it [again] at the 7:07 min mark. #50firstdates
At least I don’t have to worry about you forgetting what the asterisk character does. 🤦♀️
Plus Sign (+)
What It Does
The + means one or more of the previous character. So it’s a lot like the asterisk, except it requires that at least one character matches. Iow, the previous character is mandatory. I use this all through the video tutorial.
Follow Along
In the video (4:00 min mark), I address the plus and asterisk characters in quick succession after the dot character. See the Dot section for the list of words I used in my demo. 👆
Square Brackets ([ ])
What It Does
This means match any one of the characters between the brackets. So, c[aou]p would match cap, cop, and cup. But you can only pick one; that’s the key to the brackets. You can throw in a dash to indicate a range of characters to choose from. For example, [0-5] would mean you could pick any one digit between 0 and 5, and [x-z] will match x, y, z.
Follow Along
In the video (6:18 min mark), I use the list of words under the Dot section to demo square brackets. But we’ll use them several times throughout the video. They accomplish the same thing as the pipe character, but I find brackets easier to read than a bunch of pipe characters.
Caveat
You don’t need to escape regex characters when they’re inside square brackets. You won’t blow anything up if you do, but they’re not necessary. Imagine playing a high-stakes game of tag on the playground. Square brackets are base for regex characters, like *, ., +, and ?. So you’ll get no judgment from me for escaping them, but I can’t protect you from that pedantic developer on your team who’s already tired of marketers poking around in their code.
The one exception is if you’re using [^ ] to exclude string characters and want to indicate the literal ^ character, as opposed to the regex character. Then you could drop a \ in front of it. (If you’re new to regex, I promise this will all make sense by the time you get to the end of this post.) Alternatively, if you have multiple characters you’re excluding, you could position after another character. So if you wanted to exclude the caret character along with the hyphen and asterisk in your regex, you could write, [\^-*] or [-*^]. (Is it just me or does that first expression look a little flirty?)
Backslash (\)
What It Does
This character escapes the character that follows it. In plain English, that simply means that it says treat the character that follows it as a regular ol’ character and NOT a regex character. These non-regexy characters are literally called literals. 😂😶😏
So if I write out index\.aspx\?query=funky\+boots, I’m saying treat the . , ?, and + signs as characters and don’t interpret them as regex.
Follow Along
In the video (6:18 min mark), I go through the list of most common regex characters that you’ll need to escape. You can find that list here. And here is the list I worked from:
$45.18
3892.8467
$35479.27
$39,756.18
$1284
76390
Caveat
You may play Russian roulette with your regex and not escape your regex characters. With the example of the URI above, it would probably work out. But you’re going to have 🍳 on your face if someone drills into your dashboard and finds junk. To wit, I was once building out Tableau dashboards for a client, and their Google rep had been sending them filtered data to drop in their reports. When I audited their data using a treemap, one wrong character caused two brackets of their keywords to be distorted by millions. (They used a * when they should have used a +. In this client’s case, that was an actual expensive mistake. 🤭)
Digit (\d)
What It Does
The digit metacharacter is very self-explanatory. It includes any one number between 0-9.
Follow Along
In the video (9:37 min mark) I use the list under Backslash, due north, to demonstrate this handy regex character.
Caveat
Regex characters are case sensitive. If you capitalize the ‘d’ (i.e., \D) it is negated, meaning it will match any character that’s NOT on the VIP list (ergo, letters, symbols, etc.).
Question Mark (?)
What It Does
Technically, this character means 0 or 1 of the character before, but I like to think of it as the previous character being optional. Maybe it’s there, maybe it’s not—who knows, really? Hence the ?. See how easy this is when you’re not learning from a textbook printed on recycled paper in Times New Roman with pics of Macs from the 80s? Or reading my post from 2013 that was technically correct but not elegant. (Like, at all.)
Follow Along
In the video (10:55 min mark) you can keep rocking the list above to practice.
Caveat
You can make multiple characters optional using the ? character; you just have to wrap them up in a little burrito made of parentheses, e.g., (sir)? paul mccartney.
I don’t want to dash anyone’s faith in the future of humanity, but IRL your regex would probably look closer to:
(sir)? paul mcc?[ck]artn(ey|y|ie)
Curly Braces ({ })
What It Does
Curly braces indicate how many times you may want a character repeated. They immediately follow the character (or characters wrapped in parentheses) and either contain a single number or two numbers separated by a comma. Let’s say you want to scoop up all US zip codes out a column where the address is in one cell. (Annoying, amirite?) Because a basic zip code in the US is five digits, you’d write it as [\d]{5}—or [0-9]{5} if you want to look like a neophyte. (Kidding. Sorta.)
You could also express a range with curly braces by using the convention {minimum, maximum}. For example, let’s say you have a list of product IDs that start with three lower case letters followed by a hyphen and then three-to-five digits. You could pattern match it with this:
[\w]{3}-[\d]{3,5}
If the \w was pulling in characters you didn’t want, you could cinch it down by only including what you need: [a-z] or [a-zA-Z].
Follow Along
In the video (14:09 min mark) I use this list below to identify phone numbers:
325-678-3892
89-2784-09
578-487-89921
(202) 893-2749
98-36489032
813-234-9569
Caveat
A mistake I sometimes see in Google Analytics accounts is someone will separate the min and max numbers with a hyphen. It’s an honest mistake. We can use them in square brackets. But someone probably lost a bet somewhere, and it was decided that the curly braces should use a comma. And this, boys and girls, is why programming is hard.
Caret (^)
What It Does
The caret character just indicates the beginning of a line—meaning your selection has to begin with whatever you put after it. I use this all the time when pattern matching URLs and URIs (a URL that got separated from the hostname/subdomain). I’m in the process right now of building out a series of campaign-specific dashboards for client with different universes of URLs. I’m using regex to pattern match of URLs and URIs and then marrying up their Google Analytics, Search Console, Moz, and Screaming Frog data. This wouldn’t be possible without regex.
I also use the ^ regex character when making sets in Tableau. This is helpful in grouping keywords from Google Ads, Search Console, site search, names, etc.
Follow Along
In the video (15:50 min mark), I use the list below to identify social media profiles.
@AnnieCushing
This is just test
@ me!
@mashable
annie@annielytics.com
@old_skool
more random text
@annie-cushing
Caveat
If you see a caret inside square brackets, it takes on an entirely different role. I’ll cover that in the “Square Brackets + Caret” section 👇.
Word (\w)
What It Does
I didn’t include this regex metacharacter in my original post, but now—after almost 10 years of experience with regex—I use it all the time. It includes any one character that’s a letter (upper- or lowercase), number, or underscore. It’s a more efficient alternative to typing out [a-zA-Z0-9_]. Oddly enough, it doesn’t include a hyphen.
Similar to the \d metacharacter, if you capitalize it, you’ll throw your net out and catch anything that’s not a word character (e.g., a symbol).
Follow Along
In the video (at the 15:50 min mark) I introduce the \w character along with the caret. You can use that same list.
Caveat
If you including numbers or the underscore included in your filter, you’ll need to just indicate letters. And if your pattern could include lowercase and uppercase letters, you’ll need to specify that, e.g., [a-zA-Z].
Parentheses ( )
What They Do
Parentheses are used to form groups — just like you learned in algebra. When you write more sophisticated regex, you’ll rely pretty heavily on parentheses. For one client’s site, I wanted to create a bucket for all the URLs that were generated when someone searched for a property on their site. I save snippets like this in Evernote and tag these snippets with ‘regex’ so it’s fun sometimes to look back on my old code. We tested it thoroughly before creating the rewrite filter (where I rewrote them all to a single URL since these pages all did the same thing). And it worked. But it’s a hot mess:
(^/index.html?pclass.*)|(/index.html?action=search.*)|(/index.php?cur_page=.*)|(/index.html?searchtext.*)|(/realty/index.html?pclass.*)
Here’s how I’d write it now:
^(/realty)?/index\.(html|php)\?(pclass|action=search|cur_page|searchtext)
Especially after cleaning up that regex salad.Parentheses are especially helpful when identifying words that are frequently truncated, like months. So if you wrote Sep(tember)? it would match Sep or September. Or if you want to let go and let God, [sS]ep(tember)? would additionally match sep and september. But now I’m just showing off. Sorry.
Follow Along
In the video (at the 19:20 min mark) I introduce the parentheses. You can use the list below to follow along:
facebook.com
search.yahoo.com
huffpo.com
search.ask.com
pinterest.com
search.aol.com
search.xfinity.com
Caveat
In Google Analytics, you don’t need to tack a (.*) to the end of your patterns to catch string characters in the caboose. The report filter treats regex as a contains filter on ‘roids. But some tools explicitly require the wildcard characters to account for string characters you haven’t included in your regex. So user beware.
Dollar Sign ($)
What It Does
The dollar sign character means that your string must end at that point. For example, health insurance$ matches cheap health insurance but not health insurance rates. Or you could attach a $ to the end of a URL to prevent that URL with any query strings from being included in your match. Or at the end of a directory to analyze only traffic to your category pages and not their child pages. (I demonstrate the latter in the video.)
I really look forward to demonstrating how you can use regex to search and replace. It’s tricky but very empowering once you learn the essentials because most tools that support regex support this ability. You will use the $ a lot when you power up to replacing with regex.
Follow Along
In the video (at the 22:08 min mark) I introduce the parentheses. You can use the list below to follow along:
/blog/google-docs/how-to-import-one-spreadsheet-into-another-in-google-drive-video/
/blog/
/guides/definitive-guide-campaign-tagging-google-analytics/
/services/
/comprehensive-self-guided-site-audit-checklist/
/resources/
/blog/analytics/referral-exclusion-list-google-analytics-explained/
/blog/excel-tips/formatting-dates-in-excel/
/about/
/services/analytics-audits/
/about-me/
Caveat
Just because a $ means the end of a line, it doesn’t necessarily mean the end of your regex. For example, you could have an expression that looks like ^Los Angeles$|^New York$|^Chicago$. (This would filter a report down to just the three largest cities in the US.)
Utterly Ridiculous Mnemonic Device (That Works)
I came up with this when I first started learning hierogl– regex. But you have to promise not to laugh.
Promise? 🤨
Okay, I thought of how you lead someone with a carrot (I know it’s a different spelling—work with me 🙄) by putting it out in front and how at the end of the day it’s all about the money. So the ^ goes in front in a regex expression and the $ at the end.
Yeah, yeah, go ahead and laugh (promise breaker). But I guarantee you’ll remember next time.
Square Brackets + Caret ( [^ ])
What It Does
If you toss a caret into your square brackets (as the first character), it will exclude whatever else is in the square brackets. So b[^a]t will match bit, bet, bot, and but but not bat. As with the square brackets sans the caret, you don’t separate these characters in any way. Just shove them into the elevator together.
Follow Along
In the video (24:51 min mark), I use the list below to identify phone numbers:
325-678-3892
89-2784-09
578-487-89921
(202) 893-2749
98-36489032
813-234-9569
Caveat
As I wrote above, in the Square Brackets section, you need to be careful if you want to exclude the literal caret character. You’ll either need to escape it or make sure it doesn’t directly follow the left square bracket. So if you wanted to exclude the caret character along with the hyphen and asterisk in your regex, you could write, [\^-*] or [-*^].
Whitespace (\s)
What It Does
The whitespace metacharacter matches a space character. I use it most commonly to match an actual space, but it will also match the tab (\t), new line (\n), and carriage return (\r). (It also matches the line and form feed, but I’ve never had to use those options as an analyst.)
Follow Along
In the video (29:26 min mark), I use the same phone number list above.
Caveat
If you only need to match a space between words, you can just drop a space into your regex. Watch out for those Boomers and their double spaces between sentences though. (Oh HEY, Boomers! 😘)
Testing Your Regex
The best part of Google Analytics is every report comes with a line-item filter. And that filter is sensitive to regex. Previously, you would need to select Matching RegExp for it to recognize it; now you can just enter your regex into the filter, and you’re good to go.
So if I’m writing regex to capture a group of pages to concatenate in a segment to analyze, I’ll fire up a content report and paste my regex into the filter. If all of my pages are present and accounted for, I’m golden. It’s a real time saver.
That said, if you’re brand new to regex and want to test your code, I highly recommend using a regex helper like regexr.com (what I used in my tutorial) or regex101.com.
More Practice
The rest of the video tutorial is an opportunity to practice your regex with more lists. I’ll drop them below:
Britney Spears Practice
34:21 min mark
Britney Spears
Brittany Speers
Britanni Spers
brittany spears
Britany Spears
Britani Speres
Brittny Spears
britanni speers
brtany spears
Identify URIs with Query Parameters
38:21 min mark
You’ll want to either drop a group of URLs with query parameters into regexr.com or open your All Pages report (Behavior > Site Content).
Filter for Site Search Terms with Three Words
41:23 min mark
You’ll want to either drop a group of multi-word terms into regexr.com or open your Behavior > Site Search > Search Terms. (Alternatively, you could pull these from any keyword tool, like Search Console, Ahrefs, etc.)
Staging Subdomains
43:08 min mark
www.mydomain.com
staging.mydomain.com
blog.mydomain.com
production.mydomain.com
store.mydomain.com
login.mydomain.com
Extract Zip Codes
44:14 min mark
1367 Misty Ridge Ct Hampton, GA 30228-8456
6489 M 40 Lawton, MI 49065
3360 Woods Ln Callahan, FL 32011
378 Country Side Ln #UNT 2 Albany MN 56307
Y No Regex in Excel?
A common frustration I had for a long time was that I couldn’t use regex in Excel. I could Word but not Excel. Go figure. You can use a plugin like the SeoTools plugin or do all your regex in Google Docs and bring it back into Excel or (my personal fave) use advanced filters in Excel. They actually give you more options than regex and are easier to master.
Sam says
Fantastic, I’ve been meaning to learn this for sometime but was always put off by these mega guides not aimed at marketers. I just wanna perform some regex on Google Analytics & Screaming Frog. So this is great 🙂
Annie Cushing says
This is exactly why I did this guide. I’ll be doing videos dedicated to each of those tools, so make sure you’ve subscribed and click the bell for notifications, if you want to follow along. I really look forward to demystifying the extract feature that regex offers in all of these tools.