Skip to content Skip to sidebar Skip to footer

Remove All Replicas Of A String More Than X Characters Long (regex?)

I'm not certain that regex is the best approach for this, but it seems to be fairly well suited. Essentially I'm currently parsing some pdfs using pdfminer, and the drawback is tha

Solution 1:

regexps are not the right tool for that task. They are based on the theory of context free languages, and they can't match if a string contains duplicates and remove the duplicates. You may find a course on automata and regexps interesting to read on the topic.

I think Josay's suggestion can be efficient and smart, but I think I got a more simple and pythonic solution, though it has its limits. You can split your string into a list of lines, and pass it through a set():

>>>s = """I would like this...text to be......reduced...I would like this...text to be......reduced""">>>print"\n".join(set(s.splitlines()))
I would like this

text to be
reduced
>>>

The only thing with that solution is that you will loose the original order of the lines (the example being a pretty counter example). Also, if you have the same line in two different contexts, you will end up having only one line.

  • To fix the first problem, you may have to then iterate over your original string a second time to put that set back in order, or simply use an ordered set.
  • If you got any symbol separating each slide, it would help you merge only the duplicates, fixing the second problem of that solution.

Otherwise a more sophisticated algorithm would be needed, so you can take into account proximity and context. For that a suffix tree could be a good idea, and there are python libraries for that (cf that SO answer).

edit:

using your algorithm I could make it work, by adding support of multiline and adding spaces and endlines to your text matching:

>>> re.match(r"([\w \n]+)\n\1", string, re.MULTILINE).groups()
('I would like this\ntext to be\n\nreduced',)

Though, afaict the \1 notation is not a regular regular expression syntax in the matching part, but an extension. But it's getting late here, and I may as well be totally wrong. Maybe shall I reread those courses? :-)

I guess that the regexp engine's pushdown automata is able to push matches, because it is only a long multiline string that it can pop to match. Though I'd expect it to have side effects...

Post a Comment for "Remove All Replicas Of A String More Than X Characters Long (regex?)"