This repo contains work in progress on a new compiler for Stan, written in OCaml.
To be able to build the project, make sure you have GNU make installed.
If you do not have OCaml and Opam installed already, run scripts/install_ocaml.sh
to set up your OCaml environment.
To install the required OCaml libraries, run scripts/install_dev_deps.sh
.
To build stanc.exe
, run make
. The binary will be built in _build/default
To run tests, run dune runtest
and use dune promote
to accept changes.
To run e.g. only the integration tests, run dune runtest test/integration
.
There are some git hooks in scripts/hooks
; install with bash scripts/hooks/install_hooks.sh
.
To auto-format the OCaml code (sadly, this does not work for the two ocamllex
and menhir files), run dune build @fmt
or make format
.
To accept the changes proposed by ocamlformat, run dune promote
.
Run ./_build/default/stanc.exe
on individual .stan file to compile it. Use -?
to get command line options.
Use dune build @update_messages
to see if your additions to the parser have added any new error message possibilities, and dune promote
to accept them.
Having tried both native Windows development and development through Ubuntu on WSL, the Ubuntu on WSL route seems vastly smoother and it is what we recommend as a default. It's only downside seems to be that it builds Ubuntu, rather than Windows binaries. If Windows binaries are preferred, OCaml for Windows can be used.
For working on this project, we recommend using either VSCode or Emacs/Spacemacs as an editor, due to their good OCaml support through Merlin: syntax highlighting, auto-completion, type inference, automatic case splitting, and more. For people who prefer a GUI and have not memorized all Emacs keystrokes, VSCode might have the less steep learning curve.
Install instructions for VSCode can be found here.
For Windows users: please note that we advise to follow the Linux install instructions through WSL.
Seeing that VSCode is a GUI application, you will need to install an XServer and add export DISPLAY=:0.0
to ~/.bashrc
.
We recommend Mobaxterm.
In case you are using a high-res display, it may be worth overriding the high DPI setting of Mobaxterm (right click Mobaxterm binary > properties > Compatibility > Change high DPI settings > Override high DPI scaling behaviour > Application) and adding export GDK_SCALE=3
or export GDK_SCALE=2
to ~/.bashrc
.
We also advise setting "window.titleBarStyle": "native"
in VSCode under settings to be able to have proper control over the window.
Once in VSCode (on any platform), simply install the OCaml extension and you should be ready to go.
- A lexer
- A LR(1) parser (without any shift/reduce conflicts), constructing an AST
- A typed and untyped AST
- Command line interface to mirror that of stanc2 with additional debugging flags for writing out lexing and parsing operations and resulting (decorated or undecorated) AST as s-expression in case of a successful parse / semantic check
- Ported all function signatures from Stan Math
- A well-tested semantic/type checker with informative semantic error messages
- Lexical position printed in syntactic and semantic error messages
- Tests for all models in
stan/src/test/test-models/good
(including the pretty printing functionality) andstan/src/test/test-models/bad
- 100% coverage of parse errors with informative custom syntax errors implemented using Menhir's Incremental API
- Added hundreds of extra bad Stan models to test errors (all the models in
stan/src/example-bad/new
) to obtain 100% coverage of all possible parse errors - A pretty printer for Stan models
- A preprocessor for C-style #include macros with correct mapping of error locations
- Builds for portable Linux, Mac and Windows binaries
- Work in progress on intermediate representations and code generation
- Decide on final tree representation used for AST and IRs, some inspirational ideas:
- polymorphic variants
- currently using Neel's "two-level types" pattern
- End-to-end model test framework
- Could show generated C++ code matches stanc2 or that the same results are achieved at runtime
- Unit or expect tests at a decent granularity
- Code review
- Write code generation phase
- Continuous integration and deployment for Windows, Linux, and Mac static binaries
- Traditional compiler optimisations
- loop unrolling
- constant-folding
- inlining
- common subexpression elimination
- dead code elimination
- loop invariant code motion
- Stan or Math specific optimisations:
- automatic vectorization (and parallelization?)
- algebraic simplification
- algebraic derivatives
- more efficient data and parameters error checking
- Exciting new language features:
- User-defined gradients for user-defined functions
- Stan-in-stan - define much of the math library as Stan functions with some additional Stan language functionality like user-defined gradients
extern
support for linking against functions defined in other C-compatible languages (FFI)- take in std:vector and do automatic conversion for factor variables, create indicator arrays automatically
- "@quiet" annotation to not spit out certain parameters or data
- GPU matrix annotation to indicate the data structure should be manipulated only on the GPU
- closures
- type inference
- higher order functions
- submodels / structs / records / ?
- some safe support for possibly inefficient discrete parameters
- custom transforms? (like lower, upper); composable transforms?
- statically deriving graphical model/conditional independence properties of model
install_tensorflow()
style installers for R and Python that install a C++ toolchain in the user's home directory. We will need this to install the newstanc
binary.- Work needed to compile the math library ahead of time!
- Multiple phases, each with human-readable intermediate representations for easy debugging and optimization design.
- Optimizing - takes advantage of info known at the Stan language level. Minimize information we must teach users for them to write performant code.
- Holistic- bring as much of the code as possible into the MIR for whole-program optimization.
- Research platform- enable a new class of optimizations based on probability theory.
- Modular - architect & build in a way that makes it easy to outsource things like symbolic differentiation to external libraries and to use parts of the compiler as the basis for other tools built around the Stan language.
- Simplicity first - When making a choice between correct simplicity and a perceived performance benefit, we want to make the choice for simplicity unless we can show significant (> 5%) benchmark improvements to compile times or run times. Premature optimization is the root of all evil.
- Parse Stan language into AST that represents the syntax quite closely and aides in development of pretty-printers and linters
- Typecheck & add type information
- De-sugar into Middle Intermediate Representation
- Analyze & optimize MIR -> MIR (will be many passes)
- Interpret MIR, emit C++ (or LLVM IR, or Tensorflow)
- Data and parameters are never modified
- Conditionally independent code-motion
target+=
is commutative- error checks are idempotent
- Pattern rewrites;
exp(x) - 1
->exp1m(x)
- In most Stan models, almost everything is immutable: variables are initialized when they are declared and never changed again. We should exploit this. We can consider implementing optimizations that only work properly on the commutative sublanguage which does not have non-commutative side effects, as most programs can be written in that language.
- We should be careful with continue, break and early return statements as they are non-commutative effects as well. I guess you wouldn't need most of the time, but some models do use them.
- Move code to
transformed data
if possible, if not, then try to move it togenerated quantities
(c.f. SlicStan).
- The AST should have different variant types for each different type of syntax, and thus follow closely. Think about how a pretty-printer would want to deal with an AST (thanks @jimtla!)
- The AST should keep track of debug information (line number, etc) in each node itself, rather than in some external data structures keyed off nodes. This is so that when we run an optimization pass, we will be forced to design how our AST operations affect line numbers as well as the semantics, and at the end of the day we can always point a user to a specific place in their Stan code.
- We should also keep track of the string representation of numeric literals so we can make sure not to convert accidentally and lose precision.
- It would be nice to have different types for side-effect free code. We might need to analyze for print statements, or possibly ignore them as they are moved around.
- We would prefer to keep track of flow dependencies via MIR CFG pointers to other MIR nodes or symbols rather than via SSA or other renaming schemes.
- C++ is a pain to write optimization and type-checking passes in; adding a language feature touches 40+ files
- No one has wanted to work on the compiler (probably because of C++ + Spirit Qi)
- Distribution is a pain (targets C++ and requires C++ toolchain at runtime)
- Compilation takes a long time.
- Difficult for possible contributors to jump in - people tend to compile TO Stan, rewrite a Stan parser in another language, or trick the compiler into emitting the AST as text so they can read it in somewhere else.
- R and Python interfaces are buggy, hard to install, and time-consuming to maintain
1 and 2) Switch implementation languages to something more expressive and fun 3 and 4) Try to switch to a single binary distribution that ends up either interpreting or linking against something that emits native code. 5) Split up the compiler into many phases with human-readable intermediate representations between the phases 6) Focus on CmdStan as the correct unit of Stan / reference implementation, and jazz it up with some logging I/O.
- Make it easier for users to share code (modularity and encapsulation are important here)
- Make it easier for users to compose models together
- Force users to learn as little as possible to get numerical stability and performance (looking at you,
transformed data
) - Capture arbitrary metadata about AST nodes or variables:
- @silent do not save these values
- @prior tag on AST for automatic SBC, PPC
- @opencl on matrix types to send to GPU
- @hierarchical_params for GMO et al
- ??? @broadcast
- ??? @genquant
- ??? constraints (
lower=0
,corr_matrix
) ???
- User-defined derivatives
- tuples or structs
- missing data
- automated vectorization
extern
for FFI w/ gradients