Regex: Match A Specific Pattern, Exclude If Match Is In A Specific Context
I am a beginner in regex and wanted to ask how you can solve this problem with regex. At the moment I am trying to preprocess german text. German has a few specific characters in i
Solution 1:
You may use
import re
import pandas as pd
dct = {'ae' : 'ä', 'Ae' : 'Ä', 'oe' : 'ö', 'Oe' : 'Ö', 'ue' : 'ü', 'Ue' : 'Ü'}
df = pd.DataFrame({"text": ["Uebergang", "euer"]})
df['text'].str.replace(r'[AaÄäEe]ue|([aouAOU]e)', lambda x: dct[x.group(1)] if x.group(1) else x.group())
# => 0 Übergang
# 1 euer
# Name: text, dtype: object
The [AaÄäEe]ue|([aouAOU]e)
pattern matches:
[AaÄäEe]ue
-A
,a
,Ä
,ä
,E
ore
followed withue
substring|
- or([aouAOU]e)
- Group 1:a
,o
,u
,A
,O
orU
and thene
The lambda x: dct[x.group(1)] if x.group(1) else x.group()
lambda expression does the following: once Group 1 matches, dct[x.group(1)]
will return the replacement string. Else, the match found is pasted back.
Solution 2:
Should do the trick:
df["text"] = df["text"].str.replace("[^AaÄäEe](ue)", "ü")
The '^' means not in regex
Post a Comment for "Regex: Match A Specific Pattern, Exclude If Match Is In A Specific Context"