Remove All Replicas Of A String More Than X Characters Long (regex?)
Solution 1:
regexps are not the right tool for that task. They are based on the theory of context free languages, and they can't match if a string contains duplicates and remove the duplicates. You may find a course on automata and regexps interesting to read on the topic.
I think Josay's suggestion can be efficient and smart, but I think I got a more simple and pythonic solution, though it has its limits. You can split your string into a list of lines, and pass it through a set()
:
>>>s = """I would like this...text to be......reduced...I would like this...text to be......reduced""">>>print"\n".join(set(s.splitlines()))
I would like this
text to be
reduced
>>>
The only thing with that solution is that you will loose the original order of the lines (the example being a pretty counter example). Also, if you have the same line in two different contexts, you will end up having only one line.
- To fix the first problem, you may have to then iterate over your original string a second time to put that set back in order, or simply use an ordered set.
- If you got any symbol separating each slide, it would help you merge only the duplicates, fixing the second problem of that solution.
Otherwise a more sophisticated algorithm would be needed, so you can take into account proximity and context. For that a suffix tree could be a good idea, and there are python libraries for that (cf that SO answer).
edit:
using your algorithm I could make it work, by adding support of multiline and adding spaces and endlines to your text matching:
>>> re.match(r"([\w \n]+)\n\1", string, re.MULTILINE).groups()
('I would like this\ntext to be\n\nreduced',)
Though, afaict the \1
notation is not a regular regular expression syntax in the matching part, but an extension. But it's getting late here, and I may as well be totally wrong. Maybe shall I reread those courses? :-)
I guess that the regexp engine's pushdown automata is able to push matches, because it is only a long multiline string that it can pop to match. Though I'd expect it to have side effects...
Post a Comment for "Remove All Replicas Of A String More Than X Characters Long (regex?)"