Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pull] main from openxla:main #5

Open
wants to merge 2,615 commits into
base: main
Choose a base branch
from
Open

[pull] main from openxla:main #5

wants to merge 2,615 commits into from

Conversation

pull[bot]
Copy link

@pull pull bot commented Dec 8, 2023

See Commits and Changes for more details.


Created by pull[bot]

Can you help keep this open source service alive? 💖 Please sponsor : )

@pull pull bot added the ⤵️ pull label Dec 8, 2023
sogartar and others added 29 commits January 2, 2025 17:43
…#19582)

This fixes the error

```
ALREADY_EXISTS; HIP driver error 'hipErrorPeerAccessAlreadyEnabled' (704): peer access is already enabled; creating device 'hip'
```

This should not be treated as an error.

Signed-off-by: Boian Petkantchin <[email protected]>
Still contains the revert of

llvm/llvm-project@169c32e

Signed-off-by: MaheshRavishankar <[email protected]>
This PR adds the unit attribute`
iree_codegen.tuning_spec_with_default_entrypoint` to indicate the
default tuning spec (typically or user-provided tuning spec but can work
in the same manner) must contain one named sequence operation marked
with `__kernel_config`, also add the corresponding verification in
`verifyOperationAttribute` function.

This PR is relevant to task in
#19214: add [a discardable attr
verifier](https://mlir.llvm.org/docs/DefiningDialects/#discardable-attribute-verification)
for entry points iree_codegen.tuning_spec_entrypoint

Context:
Jakub proposed two approaches for verifying the default tuning
specification:
1. Implement a dedicated pass for verification.
2. Add a new attribute and update the verifyOperationAttribute function
accordingly.

After careful consideration, we agreed on the second approach to avoid
introducing an additional pass, ensuring a simple implementation.

---------

Signed-off-by: Bangtian Liu <[email protected]>
This PR updates the third-party/benchmark in IREE to address the use of
the RDCYCLE instruction on RISC-V. Starting from Linux 6.6[1], RDCYCLE
is a privileged instruction and cannot be directly accessed from user
space. To ensure compatibility, this update transitions to using RDTIME
instead.

Use RDTIME instead, which while less accurate has the advantage of being
synchronized between CPU (and thus monotonic) and of constant frequency.

[1]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cc4c07c89aada16229084eeb93895c95b7eabaa3

Signed-off-by: Phoebe Chen <[email protected]>
If we want to build the PJRT CPU plugin, we'll run something like
```
pip install --no-deps -v ./integrations/pjrt/python_packages/iree_cpu_plugin/
```

It works well in the first run. But if we did some changes, and want to
run it the second time, some errors will appear: cmake cannot find the
ninja in the first run anymore because it's in a temp build environment
and removed after the first build is finished.

We can remove the build dir to solve this problem. But it will cause a
full rebuild, and is quite annoying : )

Since IREE compiler doesn't have such issue so I checked its build
script, and I found that it's solved via the function
`maybe_nuke_cmake_cache` in its
[setup.py](https://github.com/iree-org/iree/blob/76a7b893e4c62d52eae2c165bdb23952a8589689/compiler/setup.py#L177).
So I copy it into setup.py of the PJRT plugin and did some modification:
- I think the PJRT plugin doesn't rely on CPython API (although it
builds a shared library) so we don't need to pin Python version;
- the build dir should be passed via a parameter since we have plugins
for different platforms (cpu/cuda/rocm..).

Also, I used this chance to add `cmake` to build dependencies, in case
of some users don't have cmake installed in the system.

ci-exactly: build_packages, test_pjrt

Signed-off-by: PragmaTwice <[email protected]>
…lt (#19590)

This PR is a follow-up to
llvm/llvm-project#117340.

It disables `lowerPadLikeWithInsertSlice` and
`lowerUnpadLikeWithExtractSlice` so `insertslice` or `extractslice`
won't appear when high dimensions are unit dimensions.

---------

Signed-off-by: jerryyin <[email protected]>
This adds implementations for "getIterationDomainTileFromOperandTile"
and "getTiledImplementationFromOperandTile" to linalg_ext.scatter. This
allows fusing scatters with producer loops during tiling. The
implementation of these methods is trivial because the iteration domain
is already defined in terms of the input operands, so we can just invoke
the tiling implementation.
See prior updates: #16028

> Happy New Year 🎉
> 
> Yes, this is a bit silly. We still like to intentionally update the
copyright year in this one location so the website appears fresh.
Still carries revert of
llvm/llvm-project@169c32e

Signed-off-by: MaheshRavishankar <[email protected]>
The error message should be `cuda` instead of `rocm`.
In `flow.call` op, there are two custom `OpBuilder` declarations:

https://github.com/iree-org/iree/blob/76a7b893e4c62d52eae2c165bdb23952a8589689/compiler/src/iree/compiler/Dialect/Flow/IR/FlowOps.td#L983-L996

And the second one is defined in `FlowOp.cpp`:

https://github.com/iree-org/iree/blob/76a7b893e4c62d52eae2c165bdb23952a8589689/compiler/src/iree/compiler/Dialect/Flow/IR/FlowOps.cpp#L1579-L1583

However, the function definition of the first one is missing. If we try
to use it, we'll get a linker error like "undefined symbol" in the build
phase.

So in this PR I try to add a definition for the first `OpBuilder`
(inside the tblgen file instead of `FlowOp.cpp`, since it's simple).

---------

Signed-off-by: PragmaTwice <[email protected]>
…19460)

The loop here is iterating on arguments of a dead operation. This
sometimes works if the operation decided to use the same memory for it's
iter arguments, but is relying on undefined behavior. This patch
restarts the check each time a new loop is created.

No tests for this one, because it sometimes works, depending on how the
memory allocator allocates the operation.

---------

Signed-off-by: Groverkss <[email protected]>
Signed-off-by: MaheshRavishankar <[email protected]>
Co-authored-by: MaheshRavishankar <[email protected]>
These changes are needed to be able to propagate reshapes and fold unit
dimensions. This essentially changes `scatter` to be more closely in
line with
[tf.tensor_scatter_nd_update](https://www.tensorflow.org/api_docs/python/tf/tensor_scatter_nd_update)
except with a `dimension_map` (side note: the linked tensorflow docs
have a really good explanation of the op).

This also removes support for non-contiguous scatters because the slice
must be right justified (along the innermost dimensions of `updates` and
`original`) to prevent ambiguity around how to index `original` and how
to scatter `updates`.

#### Overview:
- Update verifier to handle multiple batch dimensions. Restrict
`dimension_map` to allow indexing only of the outermost
  dimensions, ensuring slices are inserted contiguously.
- Fix `TilingInterfaceImpl` to support multiple "batch" dimensions
  and added test cases to `convert_to_loops.mlir` and `tiling.mlir`
- Fix `ScatterOp` description to align with verifier
- Add new test cases for `ScatterOp` and remove a few that are no longer
supported.

---------

Signed-off-by: Ian Wood <[email protected]>
This change uses the result types as a part of the hash when grouping
ops. This vastly improves the performance of this pass when there are
several similar objects that consist of ops with the same names but
differ in the number/type of results. However, this may increase the
overhead of hashing when bucketing isn't effective.

Although a sample size of one, I found that for 405b tp8, the number of
buckets went from 35 -> 140. This brought the time of this pass from a
few minutes to several seconds.

Signed-off-by: Ian Wood <[email protected]>
Adds nanobind reverts on top of #19600 to allow the macOS build to pass
(see #19591).
…ute (#19603)

This PR generalizes the cases in which the linking pass can be skipped
based on the presence of the default entry point attribute.

---------

Signed-off-by: Bangtian Liu <[email protected]>
#19113 uncovered some problems with
the logic in this pass.

Fixes two problems:
1. If a consumer cannot be collapsed, producers can only collapse
dimensions not touched by the consumer
2. When updating which consumer loops can be collapsed, the
reassociation of the producer must be taken into account since its
possible they are not all contiguous (e.g. a transpose on an input).
This is the same logic as in `updateFromConsumer`

---------

Signed-off-by: Ian Wood <[email protected]>
If the status was an error status that we are passing in, then it will
be passed back to us. It is incorrect to join it.

Signed-off-by: Andrew Woloszyn <[email protected]>
This change adds patterns to drop the unit dims of a
`iree_linalg_ext.scatter`'s `%updates` tensor. It only drops the leading
unit dimensions from the portion of `updates` that represents the
indexed dimensions.


See the main issue #19091

---------

Signed-off-by: Ian Wood <[email protected]>
With llvm/llvm-project@10ef20f, support for `MLIR_LINK_MLIR_DYLIB` was
introduced. With `LLVM_LINK_LLVM_DYLIB` set to `ON` at
https://github.com/iree-org/iree/blob/cdf24b9be0354f06879ba08db85ff8a5dbe49b14/build_tools/llvm/llvm_config.cmake#L30,
this setting is propagated to `MLIR_LINK_MLIR_DYLIB`. This break the BYO
LLVM workflow, see #19549, hence, setting to `OFF`.
Fixes #17344.

After nod-ai/SHARK-TestSuite#418, there are only
two tests running in that test suite, both of which are XFAIL'd due to
programs needing to be regenerated.
Carries 4 reverts

Related to Nanobind issues

-
llvm/llvm-project@5cd4274
-
llvm/llvm-project@08e2c15
-
llvm/llvm-project@b56d1ec

Related to RISC-V compilation


llvm/llvm-project@169c32e

---------

Signed-off-by: MaheshRavishankar <[email protected]>
Signed-off-by: MaheshRavishankar <[email protected]>
This avoids the need for string manipulation at runtime and is what
the HSA API expects.
We don't support custom debug sinks in the Runtime Python bindings.
In particular the ability to register a custom callback when tracing
tensors.

This change makes it possible to create a HAL module with a Python
function as a callback.
This implementation does not handle the case of referencing directly or
indirectly the HAL module, VM context or VM instance in the callback
function object. In such a scenario the circular reference will not be
collected by the garbage collector and will leak. No no check is done
to guard against this. It is possible to traverse the Python object
structure to detect a reference to VM objects but it would require more
effort.

Here is added a callback to the debug sink in the IREE native runtime
API that signals when the runtime is done using the debug sink.
We need this since the Python objects corresponding to native runtime
objects are ephemeral and can not be used to hold the reference to the
debug sink.

---------

Signed-off-by: Boian Petkantchin <[email protected]>
COMPILER_TARGET_BACKEND is something we should deprecate in the future.
This was incorrectly assuming that ordinals are always allowed (they
aren't) and that there are exactly as many physical devices with ordinals
as there are enumerable logical devices.
dependabot bot and others added 30 commits February 4, 2025 15:04
In #19902 we added reporting of
errors in `LLVMCPUTargetCLOptions::getTargetOptions` which allows
reporting things like an unknown CPU before that causes assertion
failures in LLVM. But we mistakenly also reported there the warning
about the implicit CPU fallback, which is a false positive in this case
as it triggers on default targets that we may not actually use.

Signed-off-by: Benoit Jacob <[email protected]>
#17593

while reproducing this, I was caught by an error
`unpack.mlir`
```mlir
func.func @unpack(%arg0: tensor<1x5x2x64xf32>) -> tensor<2x320xf32> {
  %0 = tensor.empty() : tensor<2x320xf32>
  %unpack = tensor.unpack %arg0 outer_dims_perm = [0, 1] inner_dims_pos = [0, 1] inner_tiles = [2, 64] into %0 : tensor<1x5x2x64xf32> -> tensor<2x320xf32>
  return %unpack : tensor<2x320xf32>
}
```
 
exec script:
```
iree-opt --mlir-print-ir-before-all --mlir-pretty-debuginfo \
--pass-pipeline="builtin.module(func.func(iree-codegen-generic-vectorization{enable-vector-masking=true}))" \
--split-input-file  unpack.mlir
```


Compilation error workaround
This impacts the ability to horizontally fuse the matmuls that feed into
`Q-K-V` transpose. The improvements seen with the change might have been
due to reduction in copy overheads, which are no more an issue.

Signed-off-by: MaheshRavishankar <[email protected]>
The tablegen had some strange auto-generated polymorphism with implicit
parsing of certain fields. None of it provided any benefit and is
simplified down to just the MMA enum. Also replaces the enum with an
enum parameter removing the extra `.getValue()` indirection when
accessing the enum.
There are a ton bit extend ops that are getting hoisted which simply
convert the weights from f16 to f32 (these ops are fairly small so they
don't trigger the max size increase threshold i.e. 1024 elems). Instead,
we want these ops to be fused with their consumers to prevent
materializing the high bit-width tensors.

---------

Signed-off-by: Ian Wood <[email protected]>
* Add warning about ongoing TOSA changes and recommend installing old
versions per #19777.
* Refresh sample code to download from Kaggle instead of a deleted GCS
bucket, making more progress on
#18518. I couldn't find an
equivalent posenet i8 model, so I used a float32 version that expects
different dimensions.
This is based on the materialize-encoding-into-nop pass, with additional
patterns to handle load and store ops.
For now, we materialize load of a padded input with and extract slice
and store as insert slice into a larger tensor. These get folded and
become partial loads/stores at the end, but we can change this later.

Enable this by default in the LLVMGPU pass pipeline as it's a superset
of the existing nop encoding materialization pass.

---------

Signed-off-by: Jakub Kuderski <[email protected]>
These types are not available in NumPy so no interoperability is
provided for them.

---------

Signed-off-by: Boian Petkantchin <[email protected]>
)

Progress on #18174, updating some
stale documentation.

> [!NOTE]
> Demo here:
https://scotttodd.github.io/iree/guides/deployment-configurations/cpu/

Changes included:

* Switch examples to use ONNX instead of TensorFlow given that users are
trying to use TensorFlow and failing:
#19852
* Add more documentation for CPU targets and features for
#18561
* Standardize some formatting across CPU/CUDA/ROCm/Vulkan pages
* Adjust some parts of the ONNX guide now that support is more mature
…19726)

The revision adds the support for the rest of AffinityOp that have
TensorPhase trait, i.e., TensorCloneOp, TensorSliceOp, TensorFillOp, and
TensorUpdateOp ops. It is tricky to handle encodings for transfer ops,
so only the encoding in the fill op is updated. If other operations have
tensor encodings, it returns a failure for now.

There are two stream tensor ops do not implement the
AffinityOpInterface, so they are not supported within the revision. They
are stream.tensor.load op and stream.tensor.store op. We should be able
to track the resource affinity for these two ops, and it requires
additional analysis. Thus, they are not scoped within the revision.

The revision also adds the missing documentation to the
`addLayoutsToTensorPhaseOps` method.

---------

Signed-off-by: hanhanW <[email protected]>
Revert commits:

-
llvm/llvm-project@8c1dbac

The author is working on a fix, and it is not ready yet.

---------

Signed-off-by: hanhanW <[email protected]>
This skips tiling large fills for the same reasoning as in #19887
We had previously cherry-picked
llvm/llvm-project@73f11ac
in #19939.

Now we're integrating up to that commit, so it's no longer a
cherry-pick.

Reverting llvm/llvm-project#125789 because it
breaks TorchToTosa, in torch-mlir. We will need to wait for this to be
resolved in torch-mlir, then simultaneously bump torch-mlir and drop the
revert.

Chery-pick a Bazel fix:
llvm/llvm-project@4df287a

---------

Signed-off-by: Benoit Jacob <[email protected]>
…umers (#19804)

This PR adds new logic in ConfigUtils.cpp to analyze a dispatch and
determine required multiples of workgroup tile sizes for the root
operation. This affects dispatches that contain either tensor.pack or
tensor.unpack ops, since the pack and unpack ops require the workgroup
tile sizes to be a multiple of their inner_tiles in order for them to be
fused into the workgroup scf.forall loop. The following example of a gpu
set_encoding dispatch illustrates the new constraint imposed by this PR:

```mlir
%in = flow.dispatch.tensor.load ... -> tensor<256x64xi8>
%pack = tensor.pack %in ... inner_tiles = [128, 64] ... tensor<256x64xi8> -> tensor<2x1x128x64xi8>
%expanded = tensor.expand_shape %pack [[0], [1], [2, 3, 4], [5, 6, 7]]
    : tensor<2x1x128x64xi8> into tensor<2x1x4x8x4x2x4x8xi8>
// linalg.transpose is the root op. The workgroup tile sizes must contain an
// even multiple of the tensor.pack inner_tiles.
%transposed = linalg.transpose
    ins(%expanded : tensor<2x1x4x8x4x2x4x8xi8>)
    outs(%empty : tensor<2x1x8x4x4x4x2x8xi8>)
    permutation = [0, 1, 3, 6, 2, 4, 5, 7]
flow.dispatch.tensor.store %transposed
```

Since the linalg.transpose is the root op, it needs to be aware of its
producer chain when selecting tile sizes. With this PR, the lowering
config selection logic will walk producers until it hits an unsupported
operation or a block argument, and find the LCM of any pack or unpack
tiles along the dimensions of their inner_tiles. In the above example,
this would look like the following:

1. Walk producer chain up to the producer of `tensor.pack`, and stop at
the `flow.dispatch.tensor.load`. The initial workgroup tile size
multiples will be `[1, 1]` (i.e., no constraint for unsupported ops).
2. The workgroup tile sizes will be propagated through the
`tensor.pack`, which updates the workgroup tile size multiples to `[1,
1, 128, 64]`.
3. Then, it will propagate through the `tensor.expand_shape`, which will
expand the workgroup size multiples if possible. In this case, they are
expanded to `[1, 1, 4, 8, 4, 2, 4, 8]`.
4. Now walk the consumer chain to find the multiples for the workgroup
tile slice of the root op result. In this case, the propagation simply
stops at the `flow.dispatch.tensor.store`, and the multiples are `[1, 1,
1, ...]`.
5. Now the root op has the required workgroup tile size multiples for
the operand and result slices, and the multiples for the iteration space
of the op are computed based on the indexing maps of the operation, by
taking the LCM along each dimension of that dimension's multiples from
all operands and results. In this case the final workgroup tile size
multiples would become `[1, 1, 8, 4, 4, 4, 2, 8]`.

---------

Signed-off-by: Max Dawkins <[email protected]>
Fixes a bug in the `transform.iree.match.cast_compatible_dag_from_root`
op failing to match when there are repeated operands.

---------

Signed-off-by: Max Dawkins <[email protected]>
This test is flaky on CI. I can't reproduce the issue locally and I'm
not sure why the file would not be found or would have errors being
opened. Maybe due to too much ctest parallelism?

Sample logs:

*
https://github.com/iree-org/iree/actions/runs/12926084982/job/36048365781#step:10:187
*
https://github.com/iree-org/iree/actions/runs/13154155457/job/36707383211#step:10:157

```
  34/1546 Test   #12: iree/tools/test/iree-dump-parameters.txt.test ....................................................................***Failed    3.02 sec
-- Testing: 1 tests, 1 workers --
FAIL: IREE :: test/iree-dump-parameters.txt (1 of 1)
******************** TEST 'IREE :: test/iree-dump-parameters.txt' FAILED ********************
Exit Code: 2

Command Output (stderr):
--
RUN: at line 1: (iree-dump-parameters    --parameters=a=C:/home/runner/_work/iree/iree/tools/test/parameters_a.safetensors    --parameters=b=C:/home/runner/_work/iree/iree/tools/test/parameters_b.safetensors) |   FileCheck C:/home/runner/_work/iree/iree/tools/test/iree-dump-parameters.txt
+ iree-dump-parameters --parameters=a=C:/home/runner/_work/iree/iree/tools/test/parameters_a.safetensors --parameters=b=C:/home/runner/_work/iree/iree/tools/test/parameters_b.safetensors
+ FileCheck C:/home/runner/_work/iree/iree/tools/test/iree-dump-parameters.txt
C:\home\runner\_work\iree\iree\runtime\src\iree\io\file_handle.c:223: UNKNOWN; failed to open file 'C:/home/runner/_work/iree/iree/tools/test/parameters_a.safetensors'; stack:
  0x00007ff6326c6754 iree-dump-parameters <iree_io_file_handle_platform_open+0x1a4> (C:\home\runner\_work\iree\iree\runtime\src\iree\io\file_handle.c:221)
  0x00007ff6326c6283 iree-dump-parameters <iree_io_file_handle_create_or_open+0x83> (C:\home\runner\_work\iree\iree\runtime\src\iree\io\file_handle.c:367)
  0x00007ff6326c6528 iree-dump-parameters <iree_io_file_handle_open+0x78> (C:\home\runner\_work\iree\iree\runtime\src\iree\io\file_handle.c:419)
  0x00007ff6326ac3de iree-dump-parameters <iree_io_open_parameter_file+0x13e> (C:\home\runner\_work\iree\iree\runtime\src\iree\tooling\parameter_util.c:93)
  0x00007ff6326ac224 iree-dump-parameters <iree_io_append_parameter_file_to_index+0x64> (C:\home\runner\_work\iree\iree\runtime\src\iree\tooling\parameter_util.c:130)
  0x00007ff6326ac732 iree-dump-parameters <iree_tooling_build_parameter_indices_from_flags+0xd2> (C:\home\runner\_work\iree\iree\runtime\src\iree\tooling\parameter_util.c:166)
  0x00007ff6326a377a iree-dump-parameters <main+0xca> (C:\home\runner\_work\iree\iree\tools\iree-dump-parameters-main.c:138)
  0x00007ff6326d4f88 iree-dump-parameters <__scrt_common_main_seh+0x10c> (D:\a\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl:288)
  0x00007ff954d34cb0 ??? <BaseThreadInitThunk+0x10>
  0x00007ff95a21edcb ??? <RtlUserThreadStart+0x2b>

FileCheck error: '<stdin>' is empty.
FileCheck command line:  C:\mnt\azure\b\092750\llvm-project\bin\FileCheck.exe C:/home/runner/_work/iree/iree/tools/test/iree-dump-parameters.txt

--

********************
********************
Failed Tests (1):
  IREE :: test/iree-dump-parameters.txt
```

skip-ci: not tested by presubmit
These backends default to disabled, so enable them on this CI config.
The backends are already tested in other jobs too.

I hoped this might help spot issues like
#19875, but it doesn't seem that
way.

| | Before | After |
-- | -- | --
Logs | [logs
here](https://github.com/iree-org/iree/actions/runs/13118302608/job/36597989009)
| [logs
here](https://github.com/iree-org/iree/actions/runs/13118149869/job/36597451528?pr=19883)
Number of build targets | 8442 | 8715
Number of `iree-test-deps` targets | 1067 | 1336
Number of tests | 1538 | 1552


ci-exactly: linux_x64_clang
Carrying the existing revert of
llvm/llvm-project#125789 because it breaks
TorchToTosa, in torch-mlir. We will need to wait for this to be resolved
in torch-mlir, then simultaneously bump torch-mlir and drop the revert.

Signed-off-by: Benoit Jacob <[email protected]>
In order to rewrite subspans to buffer descriptors, we might need to be
able to fold offsets into the buffer descriptors. This means that we
need to be able to replace an offset with a different one (specifically
0) because the offset will be applied to the base pointer during buffer
casts. If the offset were dynamic, we can always memref.cast the
dynamic-ness of the offset back in, but we can't replace a static offset
with a different static offset. Therefore, never create buffers that
have a static non-zero offset during bufferization.
Integrate at llvm/llvm-project@001ba42f

Carrying the existing revert of
llvm/llvm-project#125789 because it breaks
TorchToTosa, in torch-mlir. We will need to wait for this to be resolved
in torch-mlir, then simultaneously bump torch-mlir and drop the revert.

Signed-off-by: Benoit Jacob <[email protected]>
…9923)

This is in preparation of the modified way of generating horizontally
fused GEMMs. This PR adds kernel configuration for these GEMM ops to
allow them to go down the vector distribute pipeline.

---------

Signed-off-by: MaheshRavishankar <[email protected]>
This patch remove spurious CAPI dependencies on non-CAPI libraries.

CAPI libraries should never be added to non-CAPI libs, as they end-up
causing `multiple definition` linking errors.

Signed-off-by: fabian <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.