Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support dedenting long-strings with Windows EOLs #1534

Merged
merged 1 commit into from
Dec 20, 2024

Conversation

pyrmont
Copy link
Contributor

@pyrmont pyrmont commented Dec 16, 2024

This PR fixes the dedenting behaviour of long-strings that use Windows EOLs (i.e. \r\n). It fixes #1502.

Background

Janet supports dedenting long-strings (i.e. strings delimited by one or more ` characters). This is particularly useful for docstrings where the docstring will usually be indented to preserve visual hierarchy.

As reported by @CFiggers in #1502, this practice ‘breaks’ with files that use \r\n as the EOL sequence. This is because Janet’s dedenting algorithm assumes the EOL sequence will be \n:

janet/src/core/parse.c

Lines 361 to 407 in 4daecc9

if (state->flags & PFLAG_LONGSTRING) {
/* Post process to remove leading whitespace */
JanetParseState top = p->states[p->statecount - 1];
int32_t indent_col = (int32_t) top.column - 1;
uint8_t *r = bufstart, *end = r + buflen;
/* Check if there are any characters before the start column -
* if so, do not reindent. */
int reindent = 1;
while (reindent && (r < end)) {
if (*r++ == '\n') {
for (int32_t j = 0; (r < end) && (*r != '\n') && (j < indent_col); j++, r++) {
if (*r != ' ') {
reindent = 0;
break;
}
}
}
}
/* Now reindent if able to, otherwise just drop leading newline. */
if (!reindent) {
if (buflen > 0 && bufstart[0] == '\n') {
buflen--;
bufstart++;
}
} else {
uint8_t *w = bufstart;
r = bufstart;
while (r < end) {
if (*r == '\n') {
if (r == bufstart) {
/* Skip leading newline */
r++;
} else {
*w++ = *r++;
}
for (int32_t j = 0; (r < end) && (*r != '\n') && (j < indent_col); j++, r++);
} else {
*w++ = *r++;
}
}
buflen = (int32_t)(w - bufstart);
}
/* Check for trailing newline character so we can remove it */
if (buflen > 0 && bufstart[buflen - 1] == '\n') {
buflen--;
}
}

By default, Git will use \r\n when cloning repositories on Windows even if the files in the repository used only \n. Git’s behaviour can be modified but requires affirmative action from the user.

Implementation

The dedenting algorithm in the src/core/parse.c file is modified to support \r\n as an EOL sequence. In essence, this is done by including a test after the for-loop that checks indentation. If the for-loop ended because it hit a \r character, reindent is set back to 1 (i.e. true) if the next character is a \n so that the while-loop continues.

The indenting tests that are in the test suite are expanded to ensure that long-strings that use \r\n as an EOL sequence work as intended.

Notes

As noted by @sogaiu in #1302, different approaches could be taken to normalising EOL sequences. This PR does not do this. If the long-string contained Windows-style EOL sequences, these are preserved in the dedented output.

@pyrmont
Copy link
Contributor Author

pyrmont commented Dec 16, 2024

I should have added that I was prompted to look into this again after it was reported in pyrmont/documentarian#13 that a template I define using a long-string was producing unexpected output on Windows.

There are workarounds to this particular issue but, from my perspective, it would be preferable to allow Windows users to enjoy the same dedenting behaviour that is available on other platforms.

@bakpakin
Copy link
Member

bakpakin commented Dec 18, 2024

I understand the frustration here, but the correct solution is to make sure your source control is not configured to mangle source files.

By default, Git will use \r\n when cloning repositories on Windows even if the files in the repository used only \n. Git’s behaviour can be modified but requires affirmative action from the user.

This is a far bigger problem. Even if the dedenting algorithm is updated in this particular case, you are still actually emitting string constants on Windows and Non-Windows platform, as we don't discard newlines, just move the whitespace around them. Fixing dedenting here is like hot gluing the bumper back on after you get into a wreck so you can drive home.

Languages like C that basically treat all whitespace the same get around this by simply not caring, but even then, it is a problem. Let's say as part of our build process we take a hash of our source code (we do this in Janet) and embed it in the binary - with the conversion behavior, we get different hashes on Windows and everything else!

You can add a .gitattributes to git repositories to prevent this behavior without users doing anything, or just manually strip /r from generated strings yourself.

EDIT:

Adding

*.janet text eol=lf

to .gitattributes should prevent this

@strangepete
Copy link
Contributor

strangepete commented Dec 18, 2024

I would like to suggest that .gitattributes should be added to jpm project templates, as well as all of the official Janet modules (I'll volunteer to help) as this will likely continue to be a surprise to new Windows users.

I am curious what the use case is for needing a binary embedded hash to match one running in a different OS?

Edit: as I was thinking about it, the first time I encountered long string indenting, I actually just assumed that was the default as the documentation does say "all bytes will be parsed literally until the ending delimiter is found" so this made sense that the indents would translate

Edit Edit: Humorously, I saw a rant on reddit from seven years ago suggesting that Git is last bastion of CRLF issues on Windows

@pyrmont
Copy link
Contributor Author

pyrmont commented Dec 18, 2024

@bakpakin wrote:

I understand the frustration here, but the correct solution is to make sure your source control is not configured to mangle source files.

The circumstances in which the issue becomes visible to non-Windows users is because of Git's default behaviour. I agree that if this were the fundamental issue, the correct solution would be to address Git's behaviour.

But this isn't the fundamental issue. If a Windows user creates their own project in Janet, never clones any source files and simply writes an indented long-string, they'll encounter this problem. Janet dedenting long-strings is one of those lovely quality of life decisions you've made that I sincerely believe all users should be able to enjoy regardless of the EOL sequence their platform uses.

Janet's Windows support is absolutely first class. I've moved away from Windows since growing up with it as a teenager but I remember trying to use scripting languages like Ruby and PHP and being bitterly disappointed how frequently things simply didn't work. I would have loved Janet back then. I think if we can maintain that support at minimal impact to the runtime's source we should.

@bakpakin
Copy link
Member

I guess the only reason not to make this change would be change existing code on windows, but frankly it seems like an innocuous change in almost all cases. And there could also be users who reject \n for newlines and insist on \r\n everywhere (perhaps they are working with HTTP and cgi).

For completeness, perhaps we should also recognize \r newlines.

@bakpakin bakpakin merged commit b2d2690 into janet-lang:master Dec 20, 2024
13 checks passed
@pyrmont pyrmont deleted the bugfix.windows-longstrings branch December 20, 2024 01:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multi-line string docstrings seem broken on Windows
3 participants