As usual, we try not to invent anything big or new, but instead focus on composing and rationalizing existing software and protocols:
- Many good implementation of POSIX file systems (Linux ext4, ZFS, etc.)
- git, a distributed version control system
- in particular the packfile format
- the ssh send/receive pattern
- Static WWW file servers like Apache and nginx
- tar files, gzip files
- Building CI containers faster with wedges
- native deps: re2c, bloaty, uftrace, ...
- Python deps, e.g. MyPy
- R deps, e.g. dplyr
- wedge source is a .treeptr tarball
- wedge derived is a .treeptr file
- CI serving
.wwz
files. We need fast random access. - Running benchmarks on multiple machines
oils-for-unix
tarball from EVERY commit, sync'd to different CI tasks
- Comparisons across distros, OSes, and hardware
- building same packages on Debian, Ubuntu, Alpine
- and FreeBSD
- x86 / x86-64 / ARM
- Web .log files can be .treeptr files
You can git pull
and git push
without paying for these large objects, e.g.
container images.
To start, trees use regular compression with gzip
. Later, it will introspect
trees and take hints for differential compression.
Related:
- git annex
- git LFS
https://oilshell.org/
deps.silo/
objects/ # everything is a blob at first
00/ # checksums calculated with git hash-object
123456.gz # may be a .tar file, but silo doesn't know
pack/ # like git, it can have deltas, and be repacked
foo.pack
foo.idx
derived/ # DERIVED trees, e.g. different deltas,
# different compression, SquashFS, ...
silo verify # blobs should have valid checksums
Existing tools:
rsync # back up the entire thing
rclone # ditto, but works with cloud storage
ssh rm "$@" # a list of vrefs to delete can be calculated by 'medo reachable'
scp # create a new silo from 'medo reachable' manifest
du --si -s # Total size of the Silo
To start, this will untar and uncompress blobs from a Silo. We can also:
- Materialize a git
tree
, e.g. in a packfile - Mount a git
tree
directly with FUSE. I think the pack.idx
does binary search, which makes this possible.- TODO: write prototype with pygit2 wrapping libgit2
- FUSE bindings seem in question
~/git/oilshell/oil/
deps/ # 3 medo structure is arbitrary; they're
# generally mounted in different places, and
# used by different tools
source.medo/ # Relocatable data
SILO.json # Can point to multiple Silos
Python-3.10.4.treeptr # with checksum and provenance (original URL)
derived.medo/ # derived values, some are wedges with absolute paths
SILO.json # Can point to multiple Silos
debian/
bullseye/
Python-3.10.4.treeptr
ubuntu/
20.04/
Python-3.10.4.treeptr # derived data has provenance:
# base layer, mounts of input / code, env / shell command
22.04/
Python-3.10.4.treeptr
opaque.medo/ # Opaque values that can use more provenance.
SILO.json
images/ # 'docker save' format. Make sure it can be imported.
debian/
bullseye/
slim.treeptr
layers/
debian/
bullseye/
mypy-deps.treeptr # packages needed to build it
# Get files to build. This does uncompress/untar.
medo expand deps/source.medo/Python-3.10.4.treeptr _tmp/source/
# Or sync files that are already built. If they already exist, verify
# checksums.
medo expand deps/derived.medo/debian/bullseye/ /wedge/oilshell.org/deps
# Combine SILO.json and the JSON in the .treeptr
medo url-for deps/source.medo/Python-3.10.4.treeptr
# Verify checksums.
medo verify deps.medo/ /wedge/oilshell.org/deps
# Makes a tarball and .treeptr that you can scp/rsync
medo add /wedge/oilshell.org/bash-4.4/ deps.medo/ubuntu/18.04/bash-4.4.treeptr
medo reachable deps.medo/ # first step of garbage collection
medo mount # much later: FUSE mount
A package exports one or more binaries, and is a treeptr
value:
- metadata is stored in a
.medo
directory - data is stored in a Silo
The package typically lives in a subdirectory of /wedge
. This is due to to
configure --prefix=/wedge/...
.
What can you do with it?
- A wedge can be mounted, e.g.
--mount type=bind,...
- It can be copied into an image:
COPY ...
- for quick deployment to cloud services, like Github Actions or fly.io
- It has provenance, like other treeptr values. The provenance is either:
- the original URL, for source data
- the code, data, and environment used to build it
Related:
- GNU Stow (symlinks)
- GoboLinux
- Distri (exchange dirs with FUSE)
- Nix/Bazel: a wedge is a "purely functional" value
- Docker: wedges are meant to be created in containers, and mounted in containers
/wedge/ # an absolute path, for --configure --prefix=/wedge/..
oils-for-unix.org/ # scoped to domain
pkg/ # arbitrary structure, for dev dependencies
Python-3.10.4.treeptr # metadata
Python-3.10.4/
python # Executable, which needs a 'python3' symlink
Text:
- JSON for .treeptr, MEDO.json, SILO.json
- lockfile / "world" / manifest - what does this look like?
Data:
git
- blob
- tree for FS metadata
- no commit objects!
- packfile for multiple objects
- Archiving:
.tar
,- OCI layers use
.tar
- OCI layers use
- Compression:
.gz
,bzip2
, etc. - Encryption (well LUKS does the whole system)
It's a wrapper like ninja_lib.py
. Importantly, everything you build should
be versioned, immutable, and cached, so it doesn't use timestamps!
Distributed builds, too? Multiple workers can pull and publish intermediate values to the same Silo.
Key ideas:
- the knot worker pulls tasks and is pointed at source.medo and derived.medo directories.
- All of this metadata is in git. The git repo is sync'd on worker
initialization, and continually updated.
- TODO: if 2 workers grab the same task, it should be OK. One of their git commits will fail?
- The worker does a lazy 'medo sync'
- The worker keeps a local cache of the Silo, according to the parts of the
Medo it needs
- It can give HINTS for differential compression, saying "I have Python-3.10.4, send me delta for Python-3.10.5"
- If all metadata is local, it can be even smarter
(Name: it's geometry like "wedge", and hopefully cuts a "Gordian knot.")
- shrub vs. blob?
- a shrub is a subtree, unlike a git
tree
object which is like an inode - is all of the metadata like paths and sizes stored client side? Then the client can give repacking hints for differential compression, rather than the server doing anything smart.
- medo explode? You change the reference client-side
- or silo explode? It can redirect from blob to shrub
- a shrub is a subtree, unlike a git
- TODO: look at git tree format, and whether an entire subtree/shrub of
metadata can be stored client-side. We want ONLY trees, and blobs should be
DANGLING.
- Use pack format, or maybe a text format.
~/git/oilshell/oil$ git cat-file -p master^{tree}
040000 tree 37689433372bc7f1db7109fe1749bff351cba5b0 .builds
040000 tree 5d6b8fdbeb144b771e10841b7286df42bfce4c52 .circleci
100644 blob 6385fd579efef14978900830e5fd74bbac907011 .cirrus.yml
100644 blob 343af37bf39d45b147bda8a85e8712b0292ddfea .clang-format
040000 tree 03400f57a8475d0cc696557833088d718adb2493 .github
- Analog for low level
runc
,crun
- Analog for high level
docker run
,podman run
- The equivalent of inotify() on a silo / medo.
- could be an REST API on
https://app.oilshell.org/soil.medo/events/
for tarballs - it tells you what Silo to fetch from
- could be an REST API on
- Source browser for https://www.oilshell.org/deps.silo
- "Distributed OS without RPCs". We use the paradigms of state synchronization, dependency graphs (partial orders), and probably low-level "events".
- Silo is the data plane; Medo is the control plane
- Hay config files will also be a control plane
- Silo is a mechanism; Medo is for policy
/wedge
is a middleground between Docker and Nix/Bazel- Nix / Bazel are purely functional, but require rewriting upstream build
systems in their own language (to fully make use of them)
- Concretely: I don't want to rewrite the R build system for the tidyverse. I want to use the Debian packaging that already works, and that core R developers maintain.
/wedge
is purely functional in the sense that wedges are literally values. But like Docker, you can use shell commands that mutate layers to create them. You can run entire language package managers and build systems via shell.- Wedges compose with, and compose better than, Docker layers.
- Nix / Bazel are purely functional, but require rewriting upstream build
systems in their own language (to fully make use of them)