The previous two pages introduced the lexer and parser but have not yet discussed how to handle whitespace and comments.
In some grammars whitespace and comments would not be considered significant and
so we might be tempted not to generate any tokens for them. However, in the
running example, we may want to make spaces significant in the future. For
example, we might want to implement function application expressed by
juxtaposition as in Haskell and Idris like this: f x
.
So, on this page, we will add a token for whitespace and comments. We will then consider two ways to process this:
- Filter all whitespace tokens from the lexer results before passing to the parser.
- Process the whitespace tokens in the parser.
To start with we can use the same token for both whitespace and comments. Here
it is called Comment
and added to the ExpressionToken
data structure
like this:
public export
data ExpressionToken = Number Integer
| Operator String
| OParen
| CParen
| Comment String
| EndInput
This is added to the TokenMap
like this:
||| from https://github.com/edwinb/Idris2/blob/master/src/Parser/Lexer.idr
comment : Lexer
comment = is '-' <+> is '-' <+> many (isNot '\n')
expressionTokens : TokenMap ExpressionToken
expressionTokens =
[(digits, \x => Number (toInt' x)),
(operator, \x => Operator x),
(is '(' ,\x => OParen),
(is ')' ,\x => CParen),
(spaces, Comment),
(comment, Comment)]
As you can see, the comment is defined like a single line Idris comment,
it starts with --
and continues for the remainder of the line.
We don't need to define spaces
because it is already defined in
: https://github.com/idris-lang/Idris-dev/blob/master/libs/contrib/Text/Lexer.idr
like this:
||| Recognise a single whitespace character
||| /\\s/
space : Lexer
space = pred isSpace
||| Recognise one or more whitespace characters
||| /\\s+/
spaces : Lexer
spaces = some space
If these whitespace tokens are not significant, that is, they can appear anywhere and they are optional then we can filter them out before the parser gets them like this:
processWhitespace : (List (TokenData ExpressionToken), Int, Int, String)
-> (List (TokenData ExpressionToken), Int, Int, String)
processWhitespace (x,l,c,s) = ((filter notComment x),l,c,s) where
notComment : TokenData ExpressionToken -> Bool
notComment t = case tok t of
Comment _ => False
_ => True
calc : String -> Either (ParseError (TokenData ExpressionToken))
(Integer, List (TokenData ExpressionToken))
calc s = parse expr (fst (processWhitespace (lex expressionTokens s)))
If we sometimes require whitespace to be significant then we can't filter
them out as above. In this case the Comment
tokens are sent to the parser
which now needs to be able to handle them.
commentSpace : Rule Integer
commentSpace = terminal (\x => case tok x of
Comment s => Just 0
_ => Nothing)
So far we don't have any syntax that requires spaces to be significant so we
need to define the grammar so that it will parse with, or without, spaces.
This needs to be done in a systematic way, here I have defined the grammar so
that there is an optional space to the right of every atom or operator.
First add versions of intLiteral
, openParen
, closeParen
and op
that allow optional spaces/comments to the right of them:
intLiteralC : Rule Integer
intLiteralC = (intLiteral <* commentSpace) <|> intLiteral
openParenC : Rule Integer
openParenC = (openParen <* commentSpace) <|> openParen
closeParenC : Rule Integer
closeParenC = (closeParen <* commentSpace) <|> closeParen
||| like op but followed by optional comment or space
opC : String -> Rule Integer
opC s = ((op s) <* commentSpace) <|> (op s)
Then just use these functions instead of the original functions:
expr : Rule Integer
factor : Rule Integer
factor = intLiteralC <|> do
openParenC
r <- expr
closeParenC
pure r
term : Rule Integer
term = map multInt factor <*> (
(opC "*")
*> factor)
<|> factor
expr = map addInt term <*> (
(opC "+")
*> term)
<|> map subInt term <*> (
(opC "-")
*> term)
<|> term
calc : String -> Either (ParseError (TokenData ExpressionToken))
(Integer, List (TokenData ExpressionToken))
calc s = parse expr (fst (lex expressionTokens s))
lft : (ParseError (TokenData ExpressionToken)) -> IO ()
lft (Error s lst) = putStrLn ("error:"++s)
rht : (Integer, List (TokenData ExpressionToken)) -> IO ()
rht i = putStrLn ("right " ++ (show i))
main : IO ()
main = do
putStr "alg>"
x <- getLine
either lft rht (calc x) -- eliminator for Either
Many languages such as Python, Haskell and Idris use indents to delimit the block structure of the language.
We can see how Idris2 does it here : https://github.com/edwinb/Idris2/blob/master/src/Parser/Support.idr
export
IndentInfo : Type
IndentInfo = Int
export
init : IndentInfo
init = 0
So far, the parser will return a successful result even if the full input
is not consumed. To ensure that the top level syntax is fully matched we
add a EndInput
token to indicate the last token.
EndInput
is added to the other tokens like this:
data ExpressionToken = Number Integer
| Operator String
| OParen
| CParen
| Comment String
| EndInput
A rule to consume this token is added:
eoi : Rule Integer
eoi = terminal (\x => case tok x of
EndInput => Just 0
_ => Nothing)
Instead of using expr
at the top level of the syntax we can now define
exprFull
as shown here. This will make sure that only when EndInput
is consumed will the parse be successful:
exprFull : Rule Integer
exprFull = expr <* eoi
The following code makes sure EndInput
is added to the end of the token
list:
processWhitespace : (List (TokenData ExpressionToken), Int, Int, String)
-> (List (TokenData ExpressionToken), Int, Int, String)
processWhitespace (x,l,c,s) = ((filter notComment x)++
[MkToken l c EndInput],l,c,s) where
notComment : TokenData ExpressionToken -> Bool
notComment t = case tok t of
Comment _ => False
_ => True
All we have to do now is use exprFull
instead of expr
:
calc : String -> Either (ParseError (TokenData ExpressionToken))
(Integer, List (TokenData ExpressionToken))
calc s = parse exprFull (fst (processWhitespace (lex expressionTokens s)))