-
-
Notifications
You must be signed in to change notification settings - Fork 326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Block quote marker choice uniformity #466
base: master
Are you sure you want to change the base?
Conversation
Require that the same block quote marker be used to avoid ambiguity in parsing strategy (compatible with the algorithm described [here](commonmark#460 (comment)))
The line:
has the intent of making the following equivalent (both take interpretation of using marker type (( > I would like to visit a castle in north scotland, next year.
>
> But my home is my castle > I would like to visit a castle in north scotland, next year.
>•
> But my home is my castle It could probably be rephrased |
very sorry typo
Can you add some negative examples that implementations can test against? Since this would be tightening up current rules, they will probably need to be modified. |
@ScottAbbey how does that look? |
I think I don't really understand this new rule yet, but does it mean that (in the worst case) the whole block quote has to be parsed before it can be decided if the marker is If yes, this would be a new limitation that hasn't been there before, and I'd have to rewrite my block parser. |
Yes, that's correct. Seeing as this is a new limitation (and may require rewrites in a lot of places) we could opt for a similar approach that is opportunistic instead of minimal overall marker (difference highlighted here). Opportunistic being that instead of a uniform marker, we always continue using the previous marker (i.e. we use Personally I'd prefer the consistency of the current proposed approach (minimal overall marker) so that visual indent always matches the interpreted indent. Opportunistic would solve some ambiguity that we currently experience though, so it would at least be a step in the right direction. (I suppose the decision may come down to whether it should be easy to parse, or easy to write). Whichever approach is chosen, it should achieve the method of parsing that is consistent across all implementations (at present, a parser could conform to spec by randomly allocating which marker to interpret as being prepended on each line of the same blockquote). |
@aidantwoods Thanks for the answer! This would indeed solve the parsing problem, but it would make for quite confusing and hard to explain behavior. For example:
In this case, Which leads me to my alternative suggestion: We should declare The "quick reference" (http://commonmark.org/help/) already implies that there should be a space, and the tutorial (http://commonmark.org/help/tutorial/08-blockquotes.html) mentions an "optional" space. We could just drop the "optional" and everything would be fine. In the actual spec, we could specify that What about that? IMHO this would have some similarity to how indentation is handled inside of fenced code blocks (assuming the fence itself is indented by N = 1 space):
BTW, I'm a fan of the "readers perspective" (cf. #460 (comment)). IMHO It's much easier to understand if the spec says "remove |
This wouldn't affect my personal usage too much (I always use Also note that both minimal overall marker, and opportunistic marker approaches are tightening of the spec (so are valid already within the current spec), whereas dropping |
@aidantwoods I'm not talking about dropping Multiple nested block quotes wouldn't be any different. The "official" usage would be I don't think that any of the examples in the spec have to be changed, but probably the wording should be changed as to why those examples are allowed. And the text could contain a warning about possible problems when not using a space after What I'm suggesting doesn't require any change in the reference implementations, it's just a different "interpretation" of the current behavior. |
IMO if If we want to keep it, we should address the problems in the current spec. Btw, no examples in the spec need be changed with the proposed implementation (since it is contained in a subset of possible interpretations of the current spec). ;) |
I agree, but this might be the lesser evil. And it might actually happen very rarely (but I have no data to support that).
But this would break backwards compatibility. And as long as there are no lists (with ) inside the block quote, it's perfectly fine to be used (but I would still not recommend it in tutorials).
I agree, but I consider none of the discussed alternatives "grey areas". |
I think the only way to resolve this may be to try and gauge some consensus from the community. Couple points though:
The spec is too permissive at present (it does not define a unique mapping), so we necessarily break BC with something by fixing it. It is an unavoidable consequence. I think my order of preference for fix is:
I think that dropping the short marker is perhaps a reasonable solution, but I worry that allowing implementations to decide what to do goes against the aim of the spec (that you have guarantees that the same markdown will look the same when going through different compliant parsers). Spec should take a hard line on things when possible IMO |
Statistics on a large corpus of markdown files from GitHub: about 0.8% use the Is this a fourth option?
This option avoids having to scan the whole block quote -- which I really want to avoid. It is a kind of mix of 1 and 2, but I think it would give good results for all real-world cases. |
I like this idea, especially so since it addresses uniformity while not breaking the i.e. (with
has a singular interpretation with this new rule, but the writer may have started with
and prepended
and prepended This new rule assumes the latter case, and the only way to explain why is to say how the marker on the first line is interpreted (reader's perspective). This could be as simple as saying something like: "if there is ambiguity as to whether |
Good point. How about this. A type (a) block quote marker is a A type (b) block quote marker is a If a sequence of lines Ls forms a sequence of blocks Bs, then the result of prepending type (a) block quote markers to each line in L forms a block quote with content Bs. If a sequence of lines Ls forms a sequence of blocks Bs, then the result of prepending type (b) block quote markers to each line in L forms a block quote with content Bs. If we have
then there's no way to prepend type (b) block quote markers to each line, since a |
Actually, we probably need something more like this: A type (a) block quote marker is a > followed by a space. A type (b) block quote marker is a > not followed by a space or a A type (c) block quote marker is a > followed by If a sequence of lines Ls forms a sequence of blocks Bs, then the result of prepending type (a) or (c) block quote markers to each line in L forms a block quote with content Bs. If a sequence of lines Ls forms a sequence of blocks Bs, then the result of prepending type (b) or (c) block quote markers to each line in L forms a block quote with content Bs. The revision allows for:
|
To be explicit, this proposal will parse
as a block quote with contents |
This proposal sounds reasonable. My only question, what about blank lines? For example, I've used quite a lot
|
Blank lines are covered. According to this proposal, you can use either (type a and c) or (type b and c) markers. Type c markers are the ones followed by a newline. |
Hm. But really, I don't think the spec is all too clear about why this is a block quote with content I am getting tempted to scrap the whole descriptive spec idea, and describe a parsing algorithm (similar to the HTML5 spec). It should be possible to describe the general block parsing strategy, in terms of a set of parameters (block start matcher, block continuation matcher, can it interrupt a paragraph?, does it allow laziness, etc.). Descriptions of individual block-level elements could then fill in these parameters. |
@jgm Hooray! It's nice to hear that you finally are thinking about dropping this "writer's perspective" thing with its overly complicated and hard-to-understand rules. I'm very much in favor to switch to a spec that describes some kind of "abstract" parser. Implementers of actual parsers don't have to do exactly the same steps, but they should do something that behaves "as if" those steps would have been done (similar to the "as if" rule in C++: https://en.cppreference.com/w/cpp/language/as_if). |
So the rule for a block-quote parser would be to remove one or two characters from the beginning of each line and yield the rest of the line for further block processing. This issue discusses whether that should indeed be exactly one or rather two characters or some mixture of both. I'll try to summarize the above suggestions, starting with the numbering used by @aidantwoods in #466 (comment). option 1: find the minimal marker disadvantage: The block cannot be parsed line-by-line anymore. option 2: only long marker A A special rule would have to be added to handle a disadvantage: This would break backwards compatibility with Gruber's Markdown. Some old documents would suddenly have a few stray literal Question: should this still work for nested block-quote markers without spaces?
Should this still be allowed? (I tend towards yes) option 3: discourage short marker This is basically the same as the current state, except the documentation is changed to strongly suggest using a space. option 4: either long or short marker Remove space, except none is available. disadvantage: breaks lists if list item is written without space. option 5: change marker adaptively Remove space from each line until there is no space. Afterwards don't remove spaces. Disadvantage: Behavior changes within one block quote, quite confusing. option 6: only short marker This wasn't really proposed above, but I think it would also be an option. One could still use spaces, but those would count as additional indentation in the nested blocks. Disadvantage: Some indented code blocks would suddenly get a leading space where there wasn't one before. I would still use spaces in all tutorials and examples. I think spaces look much better, but that doesn't mean that the block-quote parser has to remove them. I hope I didn't forget some option that was mentioned above. Feel free to add further options and comments! |
Matthias Geier <[email protected]> writes:
@jgm Hooray! It's nice to hear that you finally are thinking about dropping this "writer's perspective" thing with its overly complicated and hard-to-understand rules. I'm very much in favor to switch to a spec that describes some kind of "abstract" parser. Implementers of actual parsers don't have to do exactly the same steps, but they should do something that behaves "as if" those steps would have been done (similar to the "as if" rule in C++: https://en.cppreference.com/w/cpp/language/as_if).
It's a mistake to think that the complexity would
vanish if the spec were rewritten in this way.
Anyway, I don't think I have the energy or time for
such a project now, so trying to patch up what we
have may be the best approach.
|
Which situation is more seldom? No spaces are used (this breaks in option 2):
List item without space (this breaks in option 2, 3 and 4):
Indented code block with 5 spaces (this grows an unwanted leading space in option 6):
Spaces followed by no spaces followed again by spaces (this behaves strangely in option 5 and has unwanted leading spaces in option 6)
Are there other situation I've forgotten? Leaving option 1 aside, I have the feeling that option 5 leads to the least breakage, but option 6 is easier to understand. Options 2, 3 and 4 lead to far more breakage. |
Are you able to craft a query on the large corpus of markdown files to see what effect "option 6: only short marker" would have? This should mostly affect situations where someone has started their line with the |
Option 6 is interesting. I guess you're right, it would only affect cases where the block quote encloses indented content: indented code blocks and lists. And it shouldn't greatly affect lists, because it would change the indentation of every line evenly. I tried modifying the cmark parser so it doesn't consume the space after
I'll look at the corpus with this in mind. |
In one small part of the corpus (4901 files):
I tried a few others; this seems fairly typical. So to estimate, with option 6,
|
@jgm Thanks for checking this out! I would be really interested in those 2 cases that have The test failure above is actually a bug (and not a problem with option 6). If only the BTW, I actually forgot an option in the list. It was suggested by @jgm in #466 (comment), right below the original 3 options. option 7: first line decides This sounds indeed quite practical, but I don't know how this is supposed to work with nested
Is the space in the third line supposed to be removed? What about:
Is the last line supposed to have literal |
It currently gets parsed as a regular paragraph in a block quote. The block quote marker consumes the first space, leaving three -- and a paragraph can have a three-space indent.
That's right. As for option 4:
Thinking about this in terms of the current spec, and not the "remove markers" idea (which is foreign to it), this would be parsed as a blockquote containing a paragraph "text" followed by a paragraph containing ">> what now?". The uniformity requirement would prohibit parsing the whole thing as one block quote, and the consecutiveness rule would forbid having two distinct block quotes without an intervening blank line. Admittedly this is something you might see in the wild, and this proposed interpretation is not going to be the expected one. So, perhaps that's a point in favor of the simple option 6 -- just letting the |
Here's the main thing that keeps me from embracing option 6. If you start with
and put it in a block quote with nice spacing:
Then you should, in my opinion, get a block quote with the same content as the original. |
^ totally agree with that - adding the blockquote with space should not change how the text inside the blockquote looks when rendered (inside a blockquote).
Just ran into this recently with a set of files that got converted from RedCarpet to CommonMark. We changed some of these where we ran across it, but since it didn't break the final rendering, not all of them were changed. Not saying they shouldn't be eventually, just many still exist as |
Require that the same block quote marker be used to avoid ambiguity in parsing strategy (compatible with the algorithm described here)
For what it's worth, this is one of the possible interpretations of the rule being overwritten – so it's not strictly new, just more specific ;)