31 May How to Get Started with Regular Expressions in Tableau
In this tutorial, we’ll look at how we can use Regular Expressions in Tableau. In our last tutorial, we discussed how we can use Regular Expressions in Python. Just like we learnt earlier, Regular Expression is used for extracting data element from a string of data.
The common application of Regular Expression is website and mobile apps. The use of Regex here is basically for input validation. Many applications can validate if what a user enters matches the back end’s required structure. An example is our email address, which the system checks to see if we typed it correctly.
Now, let’s look at the Metacharacters classes which build up our knowledge foundation in Regex.
Regular Expression Metacharacters Classes
The Metacharacters are classes useful in defining rules to find a specific pattern in a string. These classes do have a special meaning in the REGEX Engine.
Few examples of the Metacharacters classes are listed below.
Now that we’ve discussed the Metacharacters classes, this gives us an overview of how to interpret or define our Regex expressions.
In Tableau, we have four different REGEXP functions, and they are;
- REGEXP_EXTRACT (string, pattern)
- REGEXP_EXTRACT_NTH (string, pattern, index)
- REGEXP_MATCH (string, pattern)
- REGEXP_REPLACE (string, pattern, replacement)
Let’s discuss how each can be used and their various examples.
REGEXP_EXTRACT (string, pattern): The REGEXP_EXTRACT is used for searching a particular pattern within a string or substring, and then extract the data element which is our output.
Example:
We have students’ descriptions with their names, subjects, and ages in the table below. We’ll demonstrate how we can use REGEXP_EXTRACT to extract each student’s first and last name.
Extract Last Name
To extract their last name, we will use the code below. In this code, we observed that the pattern (\w+) is a rule from the metacharacters we explained earlier. This is why understanding the metacharacters helps a lot in writing our regex.
1 REGEXP_EXTRACT([Text], '(\w+)')
The function of the pattern (\w+) here tells us the parenthesis acts as our capture group, and the backslash describes the starting point to start with.
Now the “w” is matching on a word character and alphanumeric values. The function of the “+ “ symbol means to extract all characters once and stop when it finds a next character which isn’t a word. This is similar to what we have after each last name is a comma.
As seen below, we should have our desired text extracted with this code.
Extract First Name
Let’s extract our first name . We’ll also define the right pattern we need to extract our data correctly. We have the code we’ll use for this below.
1 REGEXP_EXTRACT([Text], ',\s+(\w+)')
In this code, we observed it’s quite different from what we used to extract the last name. The reason is where the first name is located. Look at this first string, “Cooper, Elizabeth is a student of Account, Age: 15”.
Elizabeth, the first name is located after a comma and a space. This constraint has to be dealt with to extract our data correctly. So with our pattern (R EGEXP_EXTRACT([Text], ‘,\s+(\w+)')
, we first have the “,” whereby it is to search for the comma. Also, the “\s+” is for searching for space, and the plus symbol is used for any double space. Then we use the (\w+) to extract our word.
REGEXP_EXTRACT_NTH (string, pattern, index)
This is used to search for a particular pattern within a string or substring, starting at the nth position in the string, and extracting the data element.
Example:
String= ‘abc 123’
To extract 123 out, we use the pattern ‘([a-z]+)\s+(\d+)’ and also specify the item index position
REGEXP_EXTRACT_NTH('abc 123', '([a-z]+)\s+(\d+)', 2) = '123'
REGEXP_MATCH (string, pattern):
The REGEXP_MATCH is used to look for a particular pattern within a string or substring and returns TRUE if we have an exact match.
Example:
In the table below, we want to search if the Student’s Details contains an Email Address.
We will specify our pattern as seen below to check for all email address in the data.
1 REGEXP_MATCH 2 ( 3 [Student Details], 4 '([A-Za-Z0-9 ._% + -] + @ [A-Za-Z0-9 .-] + \. [Az] {2,4})' 5)
We should have this result table;
REGEXP_REPLACE (string, pattern, replacement)
This function is used for searching for a particular pattern within a string or substring and replacing that data element with a different data element.
Example:
In this table below we want to replace all this email address with a space.
To handle this, we’ll use the code below, which has same pattern we used earlier to search for email addresses. Whenever it finds the email address in the text, it replaces it with a space.
1 ( 2 [Messy Data], 3 '([A-Za-Z0-9 ._% + -] + @ [A-Za-Z0-9 .-] + \. [Az] {2,4})', 4 '' 5 )
We should have our output as seen below;
Hope you enjoyed this post. Thanks for reading.
No Comments