Releases · ml-explore/mlx · GitHub

27 Sep 21:10

Highlights

Speed improvements:
- Up to 2x faster I/O: benchmarks.
- Faster transposed copies, unary, and binary ops
  - CPU benchmarks here.
  - GPU benchmarks here and here.
Transposed convolutions
Improvements to mx.distributed (send/recv/average_gradients)

Core

New features:
- mx.conv_transpose{1,2,3}d
- Allow mx.take to work with integer index
- Add std as method on mx.array
- mx.put_along_axis
- mx.cross_product
- int() and float() work on scalar mx.array
- Add optional headers to mx.fast.metal_kernel
- mx.distributed.send and mx.distributed.recv
- mx.linalg.pinv
Performance
- Up to 2x faster I/O
- Much faster CPU convolutions
- Faster general n-dimensional copies, unary, and binary ops for both CPU and GPU
- Put reduction ops in default stream with async for faster comms
- Overhead reductions in mx.fast.metal_kernel
- Improve donation heuristics to reduce memory use
Misc
- Support Xcode 160

NN

Faster RNN layers
nn.ConvTranspose{1,2,3}d
mlx.nn.average_gradients data parallel helper for distributed training

Bug Fixes

Fix boolean all reduce bug
Fix extension metal library finding
Fix ternary for large arrays
Make eval just wait if all arrays are scheduled
Fix CPU softmax by removing redundant coefficient in neon_fast_exp
Fix JIT reductions
Fix overflow in quantize/dequantize
Fix compile with byte sized constants
Fix copy in the sort primitive
Fix reduce edge case
Fix slice data size
Throw for certain cases of non captured inputs in compile
Fix copying scalars by adding fill_gpu
Fix bug in module attribute set, reset, set
Ensure io/comm streams are active before eval
Fix mx.clip
Override class function in Repr so mx.array is not confused with array.array
Avoid using find_library to make install truly portable
Remove fmt dependencies from MLX install
Fix for partition VJP
Avoid command buffer timeout for IO on large arrays

Assets 2

13 Sep 00:17

🚀

Assets 2

24 Aug 17:19

🐛

Assets 2

23 Aug 18:48

Highlights

mx.einsum: PR
Big speedups in reductions: benchmarks
2x faster model loading: PR
mx.fast.metal_kernel for custom GPU kernels: docs

Core

Faster program exits
Laplace sampling
mx.nan_to_num
nn.tanh gelu approximation
Fused GPU quantization ops
Faster group norm
bf16 winograd conv
vmap support for mx.scatter
mx.pad "edge" padding
More numerically stable mx.var
mx.linalg.cholesky_inv/mx.linalg.tri_inv
mx.isfinite
Complex mx.sign now mirrors NumPy 2.0 behaviour
More flexible mx.fast.rope
Update to nanobind 2.1

Bug Fixes

gguf zero initialization
expm1f overflow handling
bfloat16 hadamard
large arrays for various ops
rope fix
bf16 array creation
preserve dtype in nn.Dropout
nn.TransformerEncoder with norm_first=False
excess copies from contiguity bug

Assets 2

12 Aug 23:14

🚀

Assets 2

09 Aug 00:30

🚀🚀

Assets 2

25 Jul 18:45

🚀

Assets 2

11 Jul 18:44

Highlights

@mx.custom_function for custom vjp/jvp/vmap transforms
Up to 2x faster Metal GEMV and fast masked GEMV
- benchmarks
Fast hadamard_transform
- benchmarks

Core

Metal 3.2 support
Reduced CPU binary size
Added quantized GPU ops to JIT
Faster GPU compilation
Added grads for bitwise ops + indexing

Bug Fixes

1D scatter bug
Strided sort bug
Reshape copy bug
Seg fault in mx.compile
Donation condition in compilation
Compilation of accelerate on iOS

Assets 2

27 Jun 18:21

🚀

Assets 2

14 Jun 21:13

🚀

Assets 2