HTML: Fix quote pair RegEx matching for all quote types #6661

Th-Underscore · 2025-01-13T12:53:08Z

Checklist:

I have read the Contributing guidelines.

The current quote matching only works for the first tuple, ('"', '"'), due to how Match.group functions (enumerated [1, 2, 3], [4, 5, 6], etc.). Additionally, literal open-close quotes \u201C and \u201D are undetected.

oobabooga · 2025-01-13T14:47:36Z

Thanks, this is indeed an issue. The problem with your solution is that it doesn't enforce matching pairs, so " can match with ’, which is less robust.

What do you think of this?

import re

def replace_quotes(text):
    # Define a list of quote pairs (opening and closing), using HTML entities
    quote_pairs = [
        ('&quot;', '&quot;'),
        ('&ldquo;', '&rdquo;'),
        ('&lsquo;', '&rsquo;'),
        ('&laquo;', '&raquo;'),
        ('&bdquo;', '&ldquo;'),
        ('&lsquo;', '&rsquo;'),
        ('&#8220;', '&#8221;'),
        ('&#x201C;', '&#x201D;'),
        ('\u201C', '\u201D'),
    ]

    # Create a regex pattern that matches any of the quote pairs, including newlines
    pattern = '|'.join(f'({re.escape(open_q)})(.*?)({re.escape(close_q)})' for open_q, close_q in quote_pairs)

    # Replace matched patterns with <q> tags, keeping original quotes
    def replacer(m):
        # Find the first non-None group set
        for i in range(1, len(m.groups()), 3):  # Step through each sub-pattern's groups
            if m.group(i):  # If this sub-pattern matched
                return f'<q>{m.group(i)}{m.group(i + 1)}{m.group(i + 2)}</q>'

        return m.group(0)  # Fallback (shouldn't happen)

    replaced_text = re.sub(pattern, replacer, text, flags=re.DOTALL)
    return replaced_text

Th-Underscore · 2025-01-13T14:50:57Z

The solution I originally had in mind for that was:

    for open_q, close_q in quote_pairs:
        # Create a regex pattern that matches each of the quote pairs, including newlines
        pattern = f'({re.escape(open_q)})(.*?)({re.escape(close_q)})'
        # Replace matched patterns with <q> tags, keeping original quotes
        text = re.sub(pattern, lambda m: f'<q>{m.group(1)}{m.group(2)}{m.group(3)}</q>', text, flags=re.DOTALL)
    return text

Yours takes better advantage of re and is more pythonic, so I think I'd go with yours.

Th-Underscore · 2025-01-13T15:32:33Z

Whoops, Git mistake. Not sure what happened. I'll try to resolve.

oobabooga · 2025-01-13T21:01:39Z

Looks good now. Thanks a lot for this fix, @Th-Underscore -- if you notice anything else weird with the markdown renderer, feel free to send new PRs.

Th-Underscore closed this Jan 13, 2025

Th-Underscore force-pushed the dev branch from 4d8a694 to c85e5e5 Compare January 13, 2025 15:20

HTML: Fix quote pair RegEx matching for all quote types

d1da45a

Th-Underscore reopened this Jan 13, 2025

oobabooga merged commit 53b838d into oobabooga:dev Jan 13, 2025

jfmherokiller pushed a commit to jfmherokiller/text-generation-webui that referenced this pull request Jan 15, 2025

HTML: Fix quote pair RegEx matching for all quote types (oobabooga#6661)

f887e35

Th-Underscore deleted the dev branch January 26, 2025 03:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML: Fix quote pair RegEx matching for all quote types #6661

HTML: Fix quote pair RegEx matching for all quote types #6661

Th-Underscore commented Jan 13, 2025

oobabooga commented Jan 13, 2025

Th-Underscore commented Jan 13, 2025

Th-Underscore commented Jan 13, 2025

oobabooga commented Jan 13, 2025

HTML: Fix quote pair RegEx matching for all quote types #6661

HTML: Fix quote pair RegEx matching for all quote types #6661

Conversation

Th-Underscore commented Jan 13, 2025

Checklist:

oobabooga commented Jan 13, 2025

Th-Underscore commented Jan 13, 2025

Th-Underscore commented Jan 13, 2025

oobabooga commented Jan 13, 2025