Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML: Fix quote pair RegEx matching for all quote types #6661

Merged
merged 1 commit into from
Jan 13, 2025

Conversation

Th-Underscore
Copy link
Contributor

Checklist:

The current quote matching only works for the first tuple, ('"', '"'), due to how Match.group functions (enumerated [1, 2, 3], [4, 5, 6], etc.). Additionally, literal open-close quotes \u201C and \u201D are undetected.

@oobabooga
Copy link
Owner

Thanks, this is indeed an issue. The problem with your solution is that it doesn't enforce matching pairs, so " can match with ’, which is less robust.

What do you think of this?

import re

def replace_quotes(text):
    # Define a list of quote pairs (opening and closing), using HTML entities
    quote_pairs = [
        ('"', '"'),
        ('“', '”'),
        ('‘', '’'),
        ('«', '»'),
        ('„', '“'),
        ('‘', '’'),
        ('“', '”'),
        ('“', '”'),
        ('\u201C', '\u201D'),
    ]

    # Create a regex pattern that matches any of the quote pairs, including newlines
    pattern = '|'.join(f'({re.escape(open_q)})(.*?)({re.escape(close_q)})' for open_q, close_q in quote_pairs)

    # Replace matched patterns with <q> tags, keeping original quotes
    def replacer(m):
        # Find the first non-None group set
        for i in range(1, len(m.groups()), 3):  # Step through each sub-pattern's groups
            if m.group(i):  # If this sub-pattern matched
                return f'<q>{m.group(i)}{m.group(i + 1)}{m.group(i + 2)}</q>'

        return m.group(0)  # Fallback (shouldn't happen)

    replaced_text = re.sub(pattern, replacer, text, flags=re.DOTALL)
    return replaced_text

@Th-Underscore
Copy link
Contributor Author

The solution I originally had in mind for that was:

    for open_q, close_q in quote_pairs:
        # Create a regex pattern that matches each of the quote pairs, including newlines
        pattern = f'({re.escape(open_q)})(.*?)({re.escape(close_q)})'
        # Replace matched patterns with <q> tags, keeping original quotes
        text = re.sub(pattern, lambda m: f'<q>{m.group(1)}{m.group(2)}{m.group(3)}</q>', text, flags=re.DOTALL)
    return text

Yours takes better advantage of re and is more pythonic, so I think I'd go with yours.

@Th-Underscore
Copy link
Contributor Author

Whoops, Git mistake. Not sure what happened. I'll try to resolve.

@Th-Underscore Th-Underscore reopened this Jan 13, 2025
@oobabooga
Copy link
Owner

Looks good now. Thanks a lot for this fix, @Th-Underscore -- if you notice anything else weird with the markdown renderer, feel free to send new PRs.

@oobabooga oobabooga merged commit 53b838d into oobabooga:dev Jan 13, 2025
jfmherokiller pushed a commit to jfmherokiller/text-generation-webui that referenced this pull request Jan 15, 2025
@Th-Underscore Th-Underscore deleted the dev branch January 26, 2025 03:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants