What ChatGPT Has Taught Me About Regexes
ChatGPT is so helpful. When I don’t know how to do something I usually ask ChatGPT for a solution before trying to think of one myself. Especially with anything about Unicode or XML. Because have you read the XML specification? It’s like a novel. No one’s got time for that.
Unicode is hard and most of the time ChatGPT is very helpful. Not always, but usually it is. And one thing ChatGPT has been teaching me recently is regular expressions. I know what you’re thinking: what is there to know about regular expressions? Well:
1. Named Groups
Regular expressions don’t just match text. They can also get the portion that matches using named groups.
They allow you to assign a name to a specific portion of a match. By doing so, you can easily reference and manipulate these specific parts of the match in your code. To create a named group, you use the syntax (?<name>pattern).
For example, suppose you want to extract the date from a string in the format of YYYY-MM-DD. You can use a regular expression with named groups to capture the year, month, and day as separate groups. Here’s an example regular expression:
RegExp(r’^(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})$’)In this regular expression, we are using named groups to capture the year, month, and day as separate groups.
With this regular expression, we can easily extract the year, month, and day using the named groups like this:
final match = RegExp(r’^(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})$’) .firstMatch(‘2023–03–15’);
final year = int.parse(match!.namedGroup('year')!);
final month = int.parse(match.namedGroup('month')!);
final day = int.parse(match.namedGroup('day')!);
print('$year-$month-$day'); // Output: 2023–3–15This is used in my language learning app Litany. I ask GPT-3.5 to give me synonyms in the form of ‘Synonyms: a, b, c’ but sometimes it gives them to me in the form:
Synonyms - a - b - c
OK, whatever. Just use a regex to convert it. Actually, I’m not using the named part of the named groups. I’m just using groups. It’s the same idea though. You just omit the ?<name> section and get it using .group(1).
2. Unicode Properties
Also in Litany I have a list of punctuation characters so I can strip them out of the text. But apparently there is a regex that does this for you called Unicode Properties.
Unicode properties are a set of special character classes in regular expressions that match characters based on their Unicode properties. Using Unicode properties, you can match characters based on their script, general category, or specific properties like whether they are a letter, digit, or whitespace character.
To use Unicode properties in a regular expression, you use the syntax \p{prop} or \P{prop}, where prop is the name of the Unicode property. Also you have to set unicode to true (In Flutter, not sure about other languages).
I use this to fix a duplicate hyphen issue in my hyphenation package.
Specifically I use this.
final regex = RegExp(r’\p{P}’, unicode: true);There are tons of these available. There’s one for separators, there exists one for upper case, lower case, math symbols, currency symbols, and even different scripts like Greek, Lao, and Tagalog. There’s a lot you can play with.
Final Thoughts
There are some other tricks ChatGPT showed me. Like ‘look ahead’ and ‘look behind’, non-capturing groups, and new lines. But these are pretty niche. I don’t see myself using them.
The two tricks above are useful though. I’m already using them. And maybe you can too.
THE ABOVE INFORMATION IS FOR ENTERTAINMENT PURPOSES ONLY. THE AUTHOR OF THIS BLOG POST ASSUMES NO LIABILITY FOR ANY DAMAGES ARISING FROM THE USE OF REGULAR EXPRESSIONS INCLUDING, BUT NOT LIMITED TO, INCORRECT OR MISSING RESULTS, DATA LOSS, DATA CORRUPTION, OR THERMONUCLEAR WAR. USE OF THE TECHNIQUES IN THIS BLOG POST ARE ENTIRELY AT YOUR OWN RISK.
If you liked this article be sure to give it a few claps. It helps out a lot with the algorithm. Also consider subscribing, I made an RSS reader that makes it very easy to do. It’s available on iOS and Android. ChatGPT contributed to this post.
