Name	Name	Last commit message	Last commit date
Latest commit History 36 Commits
.github/workflows	.github/workflows
src/crossandra	src/crossandra
.gitignore	.gitignore
CHANGELOG.md	CHANGELOG.md
LICENSE	LICENSE
README.md	README.md
dev-requirements.txt	dev-requirements.txt
pyproject.toml	pyproject.toml
setup.py	setup.py

Crossandra

Crossandra is a fast and simple tokenization library for Python operating on enums and regular expressions, with a decent amount of configuration.

Installation

Crossandra is available on PyPI and can be installed with pip, or any other Python package manager:

$ pip install crossandra

(Some systems may require you to use pip3, python -m pip, or py -m pip instead)

License

Crossandra is licensed under the MIT License.

Reference

`Crossandra`

Crossandra(
    token_source: type[Enum] = Empty,
    *,
    ignore_whitespace: bool = False,
    ignored_characters: str = "",
    suppress_unknown: bool = False,
    rules: list[Rule | RuleGroup] | None = None
)

token_source: an enum containing all possible tokens (defaults to an empty enum)
ignore_whitespace: whether spaces, tabs, newlines etc. should be ignored
ignored_characters: characters to skip during tokenization
suppress_unknown: whether unknown tokens should continue without throwing an error
rules: a list of additional rules to use

The enum takes priority over the rule list.

When all tokens are of length 1 and there are no additional rules, Crossandra will use a simpler tokenization method (the so called Fast Mode).

Example: Tokenizing noisy Brainfuck code (tested on MacBook Air M1 (256/16) with pure Python wheels)

# Setup
from random import choices
from string import punctuation

program = "".join(choices(punctuation, k=...))

k	Default	Fast Mode	Speedup
10	0.00004s	0.00002s	2x
100	0.00016s	0.00003s	5.3x
1000	0.0015s	0.00013s	11.5x
10000	0.014s	0.0009s	15.6x
100000	0.29s	0.009s	32.2x

`Rule`

Rule[T](
    pattern: str,
    converter: Callable[[str], T] | bool = True,
    flags: RegexFlag | int = 0
)

Used for defining custom rules. pattern is a regex pattern to match (flags can be supplied).
When converter is a callable, it's used on the matched substring.
When converter is True, it will directly return the matched substring.
When converter is False, it will not include the matched substring in the token list.

`RuleGroup`

RuleGroup(rules: tuple[Rule[Any], ...])

Used for storing multiple Rules in one object. Can be constructed by ORing two or more Rules.

`common`

The common submodule is a collection of commonly used patterns.

Rules:

CHAR (e.g. 'h')
LETTER (e.g. m)
WORD (e.g. ball)
SINGLE_QUOTED_STRING (e.g. 'nice fish')
DOUBLE_QUOTED_STRING (e.g. "hello there")
C_NAME (e.g. crossandra_rocks)
NEWLINE (\n; \r\n is converted to \n before tokenization)
DIGIT (e.g. 7)
HEXDIGIT (e.g. c)
DECIMAL (e.g. 3.14)
INT (e.g. 2137)
SIGNED_INT (e.g. -1)
FLOAT (e.g. 1e3)
SIGNED_FLOAT (e.g. +4.3)

RuleGroups:

STRING (SINGLE_QUOTED_STRING | DOUBLE_QUOTED_STRING)
NUMBER (INT | FLOAT)
SIGNED_NUMBER (SIGNED_INT | SIGNED_FLOAT)

Examples

from enum import Enum
from crossandra import Crossandra

class Brainfuck(Enum):
    ADD = "+"
    SUB = "-"
    LEFT = "<"
    RIGHT = ">"
    READ = ","
    WRITE = "."
    BEGIN_LOOP = "["
    END_LOOP = "]"

bf = Crossandra(Brainfuck, suppress_unknown=True)
print(*bf.tokenize("cat program: ,[.,]"), sep="\n")
# Brainfuck.READ
# Brainfuck.BEGIN_LOOP
# Brainfuck.WRITE
# Brainfuck.READ
# Brainfuck.END_LOOP

from crossandra import Crossandra, Rule, common

def hex2rgb(hex_color: str) -> tuple[int, int, int]:
    r, g, b = (int(hex_color[i:i+2], 16) for i in range(1, 6, 2))
    return r, g, b

t = Crossandra(
    ignore_whitespace=True,
    rules=[
        Rule(r"#[0-9a-fA-F]+", hex2rgb),
        common.WORD
    ]
)

text = "My favorite color is #facade"
print(t.tokenize(text))
# ['My', 'favorite', 'color', 'is', (250, 202, 222)]

# Supporting Samarium's numbers and arithmetic operators
from enum import Enum
from crossandra import Crossandra, Rule

def sm_int(string: str) -> int:
    return int(string.replace("/", "1").replace("\\", "0"), 2)

class Op(Enum):
    ADD = "+"
    SUB = "-"
    MUL = "++"
    DIV = "--"
    POW = "+++"
    MOD = "---"

sm = Crossandra(
    Op,
    ignore_whitespace=True,
    rules=[Rule(r"(?:\\|/)+", sm_int)]
)

print(*sm.tokenize(r"//\ ++ /\\/ --- /\/\/ - ///"))
# 6 Op.MUL 9 Op.MOD 21 Op.SUB 7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crossandra

Installation

License

Reference

`Crossandra`

`Rule`

`RuleGroup`

`common`

Examples

About

Releases 12

Packages

Contributors 4

Languages

License

trag1c/crossandra

Folders and files

Latest commit

History

Repository files navigation

Crossandra

Installation

License

Reference

Crossandra

Rule

RuleGroup

common

Examples

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 12

Packages 0

Contributors 4

Languages

`Crossandra`

`Rule`

`RuleGroup`

`common`

Packages