Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

P2P send recv test gives errors #8074

Open
ajayvohra2005 opened this issue Sep 25, 2024 · 5 comments
Open

P2P send recv test gives errors #8074

ajayvohra2005 opened this issue Sep 25, 2024 · 5 comments

Comments

@ajayvohra2005
Copy link

ajayvohra2005 commented Sep 25, 2024

🐛 Bug

Trying to test simple xm.send and xm.recv gives error.

To Reproduce

Steps to reproduce the behavior:

  1. Run test code below
import torch
import torch_xla.core.xla_model as xm
import torch_xla.runtime as xr
import torch_xla.distributed.xla_backend as xb
import torch.distributed


def test_p2p():
    torch.distributed.init_process_group(backend="xla", init_method="xla://")       

    rank = torch.distributed.get_rank()
    device = xm.xla_device()
    tensor = torch.arange(2, dtype=torch.float32, device=device) + 1 + 2 * rank
    
    next_map = { 0:4, 1:5, 2:6, 3:7}
    prev_map = { 4:0, 5:1, 6:2, 7:3}
    
    torch.distributed.barrier()
    
    if rank < 4:
        print(f"send at rank: {rank}, to:{next_map[rank]}, tensor: {tensor}")
        xm.send(tensor, next_map[rank])

    torch.distributed.barrier()
    
    recv_buffer = torch.zeros_like(tensor)
    if rank >= 4:
        print(f"recv at rank: {rank}, from:{prev_map[rank]} ... ")
        xm.recv(recv_buffer, prev_map[rank])
        print(f"recv at rank: {rank}, from:{prev_map[rank]}, recv_buffer: {recv_buffer}")
    
    torch.distributed.barrier()
    
    torch.distributed.destroy_process_group()
    
if __name__ == "__main__":
    test_p2p()

Expected behavior

Test code should run without errors

Log output showing error

W0930 17:26:04.156000 140373368411968 torch/distributed/run.py:757] 
W0930 17:26:04.156000 140373368411968 torch/distributed/run.py:757] *****************************************
W0930 17:26:04.156000 140373368411968 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0930 17:26:04.156000 140373368411968 torch/distributed/run.py:757] *****************************************
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1727717166.852975  184724 coordination_service.cc:365] Initializing CoordinationService
I0000 00:00:1727717166.857299  186366 coordination_service.cc:579] /job:jax_worker/replica:0/task:0 has connected to coordination service. Incarnation: 1331949604963884975
I0000 00:00:1727717166.857361  186366 coordination_service.cc:541] Waiting for 7/8 tasks to connect.
I0000 00:00:1727717166.857385  186366 coordination_service.cc:544] Example stragglers:
/job:jax_worker/replica:0/task:2
/job:jax_worker/replica:0/task:4
/job:jax_worker/replica:0/task:1
I0000 00:00:1727717166.857775  186366 coordination_service.cc:579] /job:jax_worker/replica:0/task:2 has connected to coordination service. Incarnation: 12293608533655025016
I0000 00:00:1727717166.857797  186366 coordination_service.cc:541] Waiting for 6/8 tasks to connect.
I0000 00:00:1727717166.857801  186366 coordination_service.cc:544] Example stragglers:
/job:jax_worker/replica:0/task:4
/job:jax_worker/replica:0/task:1
/job:jax_worker/replica:0/task:5
I0000 00:00:1727717166.858749  186366 coordination_service.cc:579] /job:jax_worker/replica:0/task:3 has connected to coordination service. Incarnation: 6693673612313623099
I0000 00:00:1727717166.858794  186366 coordination_service.cc:541] Waiting for 5/8 tasks to connect.
I0000 00:00:1727717166.858811  186366 coordination_service.cc:544] Example stragglers:
/job:jax_worker/replica:0/task:4
/job:jax_worker/replica:0/task:1
/job:jax_worker/replica:0/task:5
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1727717166.859230  184726 coordination_service_agent.cc:303] Coordination agent has successfully connected.
I0000 00:00:1727717166.859488  184724 coordination_service_agent.cc:303] Coordination agent has successfully connected.
I0000 00:00:1727717166.860198  186366 coordination_service.cc:579] /job:jax_worker/replica:0/task:5 has connected to coordination service. Incarnation: 14604579324477984403
I0000 00:00:1727717166.860218  186366 coordination_service.cc:541] Waiting for 4/8 tasks to connect.
I0000 00:00:1727717166.860223  186366 coordination_service.cc:544] Example stragglers:
/job:jax_worker/replica:0/task:4
/job:jax_worker/replica:0/task:1
/job:jax_worker/replica:0/task:6
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1727717166.860551  184727 coordination_service_agent.cc:303] Coordination agent has successfully connected.
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1727717166.861981  184729 coordination_service_agent.cc:303] Coordination agent has successfully connected.
I0000 00:00:1727717166.874591  186366 coordination_service.cc:579] /job:jax_worker/replica:0/task:6 has connected to coordination service. Incarnation: 1852783379941685202
I0000 00:00:1727717166.874617  186366 coordination_service.cc:541] Waiting for 3/8 tasks to connect.
I0000 00:00:1727717166.874622  186366 coordination_service.cc:544] Example stragglers:
/job:jax_worker/replica:0/task:4
/job:jax_worker/replica:0/task:1
/job:jax_worker/replica:0/task:7
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1727717166.876173  186366 coordination_service.cc:579] /job:jax_worker/replica:0/task:4 has connected to coordination service. Incarnation: 3757167233025618777
I0000 00:00:1727717166.876114  184730 coordination_service_agent.cc:303] Coordination agent has successfully connected.
I0000 00:00:1727717166.876193  186366 coordination_service.cc:541] Waiting for 2/8 tasks to connect.
I0000 00:00:1727717166.876203  186366 coordination_service.cc:544] Example stragglers:
/job:jax_worker/replica:0/task:1
/job:jax_worker/replica:0/task:7
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1727717166.878104  184728 coordination_service_agent.cc:303] Coordination agent has successfully connected.
I0000 00:00:1727717167.853755  186366 coordination_service.cc:579] /job:jax_worker/replica:0/task:1 has connected to coordination service. Incarnation: 2093214038750281383
I0000 00:00:1727717167.853780  186366 coordination_service.cc:541] Waiting for 1/8 tasks to connect.
I0000 00:00:1727717167.853784  186366 coordination_service.cc:544] Example stragglers:
/job:jax_worker/replica:0/task:7
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1727717167.853903  184725 coordination_service_agent.cc:303] Coordination agent has successfully connected.
I0000 00:00:1727717167.854359  186366 coordination_service.cc:579] /job:jax_worker/replica:0/task:7 has connected to coordination service. Incarnation: 17786123015033867819
I0000 00:00:1727717167.854374  186366 coordination_service.cc:541] Waiting for 0/8 tasks to connect.
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1727717167.854489  184731 coordination_service_agent.cc:303] Coordination agent has successfully connected.
I0000 00:00:1727717169.854387  186856 cuda_executor.cc:1032] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1727717169.873099  184727 service.cc:145] XLA service 0x56207f623d10 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1727717169.873143  184727 service.cc:153]   StreamExecutor device (0): NVIDIA A10G, Compute Capability 8.6
I0000 00:00:1727717169.875642  184727 se_gpu_pjrt_client.cc:853] Using BFC allocator.
I0000 00:00:1727717169.878063  184727 gpu_helpers.cc:107] XLA backend allocating 17787371520 bytes on device 3 for BFCAllocator.
I0000 00:00:1727717169.878535  184727 gpu_helpers.cc:147] XLA backend will use up to 5929123840 bytes on device 3 for CollectiveBFCAllocator.
I0000 00:00:1727717169.878706  184727 cuda_executor.cc:1032] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1727717169.907579  186883 cuda_executor.cc:1032] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1727717169.912426  184724 service.cc:145] XLA service 0x55f9dc3d8670 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1727717169.912561  184724 service.cc:153]   StreamExecutor device (0): NVIDIA A10G, Compute Capability 8.6
I0000 00:00:1727717169.914960  184724 se_gpu_pjrt_client.cc:853] Using BFC allocator.
I0000 00:00:1727717169.915052  184724 gpu_helpers.cc:107] XLA backend allocating 17787371520 bytes on device 0 for BFCAllocator.
I0000 00:00:1727717169.915148  184724 gpu_helpers.cc:147] XLA backend will use up to 5929123840 bytes on device 0 for CollectiveBFCAllocator.
I0000 00:00:1727717169.915343  184724 cuda_executor.cc:1032] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1727717169.934043  186902 cuda_executor.cc:1032] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1727717169.940388  184726 service.cc:145] XLA service 0x56185428cba0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1727717169.940470  184726 service.cc:153]   StreamExecutor device (0): NVIDIA A10G, Compute Capability 8.6
I0000 00:00:1727717169.941640  184726 se_gpu_pjrt_client.cc:853] Using BFC allocator.
I0000 00:00:1727717169.941711  184726 gpu_helpers.cc:107] XLA backend allocating 17787371520 bytes on device 2 for BFCAllocator.
I0000 00:00:1727717169.942866  184726 gpu_helpers.cc:147] XLA backend will use up to 5929123840 bytes on device 2 for CollectiveBFCAllocator.
I0000 00:00:1727717169.943044  184726 cuda_executor.cc:1032] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1727717169.943217  186910 cuda_executor.cc:1032] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1727717169.946296  186865 cuda_executor.cc:1032] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1727717169.946400  186892 cuda_executor.cc:1032] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1727717169.946771  184730 service.cc:145] XLA service 0x55f5ebc2e010 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1727717169.946812  184730 service.cc:153]   StreamExecutor device (0): NVIDIA A10G, Compute Capability 8.6
I0000 00:00:1727717169.949711  184730 se_gpu_pjrt_client.cc:853] Using BFC allocator.
I0000 00:00:1727717169.949764  184725 service.cc:145] XLA service 0x55c22b5706d0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1727717169.949802  184725 service.cc:153]   StreamExecutor device (0): NVIDIA A10G, Compute Capability 8.6
I0000 00:00:1727717169.952002  184730 gpu_helpers.cc:107] XLA backend allocating 17787371520 bytes on device 6 for BFCAllocator.
I0000 00:00:1727717169.953677  184730 gpu_helpers.cc:147] XLA backend will use up to 5929123840 bytes on device 6 for CollectiveBFCAllocator.
I0000 00:00:1727717169.953861  184730 cuda_executor.cc:1032] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1727717169.954342  184725 se_gpu_pjrt_client.cc:853] Using BFC allocator.
I0000 00:00:1727717169.954509  184728 service.cc:145] XLA service 0x55e96db73d80 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1727717169.954565  184728 service.cc:153]   StreamExecutor device (0): NVIDIA A10G, Compute Capability 8.6
I0000 00:00:1727717169.955257  184725 gpu_helpers.cc:107] XLA backend allocating 17787371520 bytes on device 1 for BFCAllocator.
I0000 00:00:1727717169.955619  184725 gpu_helpers.cc:147] XLA backend will use up to 5929123840 bytes on device 1 for CollectiveBFCAllocator.
I0000 00:00:1727717169.955781  184725 cuda_executor.cc:1032] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1727717169.956115  184728 se_gpu_pjrt_client.cc:853] Using BFC allocator.
I0000 00:00:1727717169.956204  184728 gpu_helpers.cc:107] XLA backend allocating 17787371520 bytes on device 4 for BFCAllocator.
I0000 00:00:1727717169.956527  184728 gpu_helpers.cc:147] XLA backend will use up to 5929123840 bytes on device 4 for CollectiveBFCAllocator.
I0000 00:00:1727717169.956749  184728 cuda_executor.cc:1032] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1727717169.965518  186874 cuda_executor.cc:1032] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1727717169.965584  186919 cuda_executor.cc:1032] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1727717169.967922  184731 service.cc:145] XLA service 0x5582b5978080 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1727717169.967972  184731 service.cc:153]   StreamExecutor device (0): NVIDIA A10G, Compute Capability 8.6
I0000 00:00:1727717169.967999  184729 service.cc:145] XLA service 0x55a094abf3a0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1727717169.968059  184729 service.cc:153]   StreamExecutor device (0): NVIDIA A10G, Compute Capability 8.6
I0000 00:00:1727717169.968577  184731 se_gpu_pjrt_client.cc:853] Using BFC allocator.
I0000 00:00:1727717169.968634  184731 gpu_helpers.cc:107] XLA backend allocating 17787371520 bytes on device 7 for BFCAllocator.
I0000 00:00:1727717169.968678  184731 gpu_helpers.cc:147] XLA backend will use up to 5929123840 bytes on device 7 for CollectiveBFCAllocator.
I0000 00:00:1727717169.968749  184729 se_gpu_pjrt_client.cc:853] Using BFC allocator.
I0000 00:00:1727717169.968807  184729 gpu_helpers.cc:107] XLA backend allocating 17787371520 bytes on device 5 for BFCAllocator.
I0000 00:00:1727717169.968848  184731 cuda_executor.cc:1032] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
I0000 00:00:1727717169.968856  184729 gpu_helpers.cc:147] XLA backend will use up to 5929123840 bytes on device 5 for CollectiveBFCAllocator.
I0000 00:00:1727717169.969072  184729 cuda_executor.cc:1032] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
recv at rank: 7, from:3 ... 
recv at rank: 6, from:2 ... 
recv at rank: 4, from:0 ... 
recv at rank: 5, from:1 ... 
F0000 00:00:1727717170.470668  184731 shape.h:207] Check failed: has_layout() element_type: TUPLE tuple_shapes { element_type: F32 dimensions: 2 layout { minor_to_major: 0 tail_padding_alignment_in_elements: 1 } is_dynamic_dimension: false } tuple_shapes { element_type: U32 layout { tail_padding_alignment_in_elements: 1 } } tuple_shapes { element_type: TOKEN }
F0000 00:00:1727717170.483437  184730 shape.h:207] Check failed: has_layout() element_type: TUPLE tuple_shapes { element_type: F32 dimensions: 2 layout { minor_to_major: 0 tail_padding_alignment_in_elements: 1 } is_dynamic_dimension: false } tuple_shapes { element_type: U32 layout { tail_padding_alignment_in_elements: 1 } } tuple_shapes { element_type: TOKEN }
F0000 00:00:1727717170.485669  184728 shape.h:207] Check failed: has_layout() element_type: TUPLE tuple_shapes { element_type: F32 dimensions: 2 layout { minor_to_major: 0 tail_padding_alignment_in_elements: 1 } is_dynamic_dimension: false } tuple_shapes { element_type: U32 layout { tail_padding_alignment_in_elements: 1 } } tuple_shapes { element_type: TOKEN }
F0000 00:00:1727717170.488110  184729 shape.h:207] Check failed: has_layout() element_type: TUPLE tuple_shapes { element_type: F32 dimensions: 2 layout { minor_to_major: 0 tail_padding_alignment_in_elements: 1 } is_dynamic_dimension: false } tuple_shapes { element_type: U32 layout { tail_padding_alignment_in_elements: 1 } } tuple_shapes { element_type: TOKEN }
send at rank: 0, to:4, tensor: tensor([1., 2.], device='xla:0')
send at rank: 3, to:7, tensor: tensor([7., 8.], device='xla:0')
send at rank: 2, to:6, tensor: tensor([5., 6.], device='xla:0')
send at rank: 1, to:5, tensor: tensor([3., 4.], device='xla:0')
I0000 00:00:1727717171.099255  184724 coordination_service_agent.cc:472] Coordination agent has initiated Shutdown().
I0000 00:00:1727717171.180823  184727 coordination_service_agent.cc:472] Coordination agent has initiated Shutdown().
I0000 00:00:1727717171.395648  184725 coordination_service_agent.cc:472] Coordination agent has initiated Shutdown().
I0000 00:00:1727717171.431319  184726 coordination_service_agent.cc:472] Coordination agent has initiated Shutdown().
*** Check failure stack trace: ***
    @     0x7f37825c8159  absl::lts_20230802::log_internal::LogMessageFatal::~LogMessageFatal()
    @     0x7f37781a358b  xla::Shape::layout()
    @     0x7f37789eb14a  xla::gpu::IrEmitterUnnested::EmitRecvThunk()
    @     0x7f37789f6900  xla::gpu::IrEmitterUnnested::EmitHloInstruction()
    @     0x7f37789f9a80  xla::gpu::IrEmitterUnnested::EmitHloComputation()
    @     0x7f37787ddb6e  xla::gpu::CompileModuleToLlvmIr()
    @     0x7f37787c0756  xla::gpu::GpuCompiler::CompileToBackendResult()
    @     0x7f37787c30f5  xla::gpu::GpuCompiler::RunBackend()
    @     0x7f3778783839  xla::Service::BuildExecutable()
    @     0x7f3778771533  xla::LocalService::CompileExecutables()
    @     0x7f37787636b7  xla::LocalClient::Compile()
    @     0x7f377873f44b  xla::PjRtStreamExecutorClient::Compile()
    @     0x7f377871226f  xla::StreamExecutorGpuClient::Compile()
    @     0x7f3778561177  torch_xla::runtime::PjRtComputationClient::Compile()
    @     0x7f3778357623  torch_xla::XLAGraphExecutor::Compile()
    @     0x7f377835953f  torch_xla::XLAGraphExecutor::SyncTensorsGraphInternal()
    @     0x7f3778359b41  torch_xla::XLAGraphExecutor::SyncTensorsGraph()
    @     0x7f37781fc98a  torch_xla::XLATensor::ApplyPendingGraph()
    @     0x7f37782008ed  torch_xla::XLATensor::GetXlaData()
    @     0x7f3778200a5d  torch_xla::XLATensor::ToTensor()
    @     0x7f3778144288  torch_xla::XLANativeFunctions::_to_copy()
    @     0x7f3778392ab5  c10::impl::wrap_kernel_functor_unboxed_<>::call()
    @     0x7f37983d3a65  at::_ops::_to_copy::redispatch()
*** Check failure stack trace: ***
    @     0x7fe2a6c36159  absl::lts_20230802::log_internal::LogMessageFatal::~LogMessageFatal()
    @     0x7fe29c81158b  xla::Shape::layout()
    @     0x7fe29d05914a  xla::gpu::IrEmitterUnnested::EmitRecvThunk()
    @     0x7fe29d064900  xla::gpu::IrEmitterUnnested::EmitHloInstruction()
    @     0x7fe29d067a80  xla::gpu::IrEmitterUnnested::EmitHloComputation()
    @     0x7fe29ce4bb6e  xla::gpu::CompileModuleToLlvmIr()
    @     0x7fe29ce2e756  xla::gpu::GpuCompiler::CompileToBackendResult()
    @     0x7fe29ce310f5  xla::gpu::GpuCompiler::RunBackend()
    @     0x7fe29cdf1839  xla::Service::BuildExecutable()
    @     0x7fe29cddf533  xla::LocalService::CompileExecutables()
    @     0x7fe29cdd16b7  xla::LocalClient::Compile()
    @     0x7fe29cdad44b  xla::PjRtStreamExecutorClient::Compile()
    @     0x7fe29cd8026f  xla::StreamExecutorGpuClient::Compile()
    @     0x7fe29cbcf177  torch_xla::runtime::PjRtComputationClient::Compile()
    @     0x7fe29c9c5623  torch_xla::XLAGraphExecutor::Compile()
    @     0x7fe29c9c753f  torch_xla::XLAGraphExecutor::SyncTensorsGraphInternal()
    @     0x7fe29c9c7b41  torch_xla::XLAGraphExecutor::SyncTensorsGraph()
    @     0x7fe29c86a98a  torch_xla::XLATensor::ApplyPendingGraph()
    @     0x7fe29c86e8ed  torch_xla::XLATensor::GetXlaData()
    @     0x7fe29c86ea5d  torch_xla::XLATensor::ToTensor()
    @     0x7fe29c7b2288  torch_xla::XLANativeFunctions::_to_copy()
    @     0x7fe29ca00ab5  c10::impl::wrap_kernel_functor_unboxed_<>::call()
    @     0x7fe2bca41a65  at::_ops::_to_copy::redispatch()
*** Check failure stack trace: ***
    @     0x7f4879690159  absl::lts_20230802::log_internal::LogMessageFatal::~LogMessageFatal()
    @     0x7f486f26b58b  xla::Shape::layout()
    @     0x7f486fab314a  xla::gpu::IrEmitterUnnested::EmitRecvThunk()
    @     0x7f486fabe900  xla::gpu::IrEmitterUnnested::EmitHloInstruction()
    @     0x7f486fac1a80  xla::gpu::IrEmitterUnnested::EmitHloComputation()
    @     0x7f486f8a5b6e  xla::gpu::CompileModuleToLlvmIr()
    @     0x7f486f888756  xla::gpu::GpuCompiler::CompileToBackendResult()
    @     0x7f486f88b0f5  xla::gpu::GpuCompiler::RunBackend()
    @     0x7f486f84b839  xla::Service::BuildExecutable()
    @     0x7f486f839533  xla::LocalService::CompileExecutables()
    @     0x7f486f82b6b7  xla::LocalClient::Compile()
    @     0x7f486f80744b  xla::PjRtStreamExecutorClient::Compile()
    @     0x7f486f7da26f  xla::StreamExecutorGpuClient::Compile()
    @     0x7f486f629177  torch_xla::runtime::PjRtComputationClient::Compile()
    @     0x7f486f41f623  torch_xla::XLAGraphExecutor::Compile()
    @     0x7f486f42153f  torch_xla::XLAGraphExecutor::SyncTensorsGraphInternal()
    @     0x7f486f421b41  torch_xla::XLAGraphExecutor::SyncTensorsGraph()
    @     0x7f486f2c498a  torch_xla::XLATensor::ApplyPendingGraph()
    @     0x7f486f2c88ed  torch_xla::XLATensor::GetXlaData()
    @     0x7f486f2c8a5d  torch_xla::XLATensor::ToTensor()
    @     0x7f486f20c288  torch_xla::XLANativeFunctions::_to_copy()
    @     0x7f486f45aab5  c10::impl::wrap_kernel_functor_unboxed_<>::call()
    @     0x7f488f49ba65  at::_ops::_to_copy::redispatch()
*** Check failure stack trace: ***
    @     0x7fd05d889159  absl::lts_20230802::log_internal::LogMessageFatal::~LogMessageFatal()
    @     0x7fd05346458b  xla::Shape::layout()
    @     0x7fd053cac14a  xla::gpu::IrEmitterUnnested::EmitRecvThunk()
    @     0x7fd053cb7900  xla::gpu::IrEmitterUnnested::EmitHloInstruction()
    @     0x7fd053cbaa80  xla::gpu::IrEmitterUnnested::EmitHloComputation()
    @     0x7fd053a9eb6e  xla::gpu::CompileModuleToLlvmIr()
    @     0x7fd053a81756  xla::gpu::GpuCompiler::CompileToBackendResult()
    @     0x7fd053a840f5  xla::gpu::GpuCompiler::RunBackend()
    @     0x7fd053a44839  xla::Service::BuildExecutable()
    @     0x7fd053a32533  xla::LocalService::CompileExecutables()
    @     0x7fd053a246b7  xla::LocalClient::Compile()
    @     0x7fd053a0044b  xla::PjRtStreamExecutorClient::Compile()
    @     0x7fd0539d326f  xla::StreamExecutorGpuClient::Compile()
    @     0x7fd053822177  torch_xla::runtime::PjRtComputationClient::Compile()
    @     0x7fd053618623  torch_xla::XLAGraphExecutor::Compile()
    @     0x7fd05361a53f  torch_xla::XLAGraphExecutor::SyncTensorsGraphInternal()
    @     0x7fd05361ab41  torch_xla::XLAGraphExecutor::SyncTensorsGraph()
    @     0x7fd0534bd98a  torch_xla::XLATensor::ApplyPendingGraph()
    @     0x7fd0534c18ed  torch_xla::XLATensor::GetXlaData()
    @     0x7fd0534c1a5d  torch_xla::XLATensor::ToTensor()
    @     0x7fd053405288  torch_xla::XLANativeFunctions::_to_copy()
    @     0x7fd053653ab5  c10::impl::wrap_kernel_functor_unboxed_<>::call()
    @     0x7fd073694a65  at::_ops::_to_copy::redispatch()

Environment

Docker image

us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:r2.3.0_3.10_cuda_12.1

Nvidia GPUs

nvidia-smi
Wed Sep 25 22:47:32 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A10G                    Off |   00000000:00:16.0 Off |                    0 |
|  0%   31C    P0             61W /  300W |    1356MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A10G                    Off |   00000000:00:17.0 Off |                    0 |
|  0%   27C    P8             15W /  300W |     120MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A10G                    Off |   00000000:00:18.0 Off |                    0 |
|  0%   27C    P8             16W /  300W |     120MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A10G                    Off |   00000000:00:19.0 Off |                    0 |
|  0%   26C    P8             15W /  300W |     120MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA A10G                    Off |   00000000:00:1A.0 Off |                    0 |
|  0%   26C    P8             16W /  300W |     120MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA A10G                    Off |   00000000:00:1B.0 Off |                    0 |
|  0%   26C    P8             16W /  300W |     120MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA A10G                    Off |   00000000:00:1C.0 Off |                    0 |
|  0%   26C    P8             15W /  300W |     120MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA A10G                    Off |   00000000:00:1D.0 Off |                    0 |
|  0%   26C    P8             16W /  300W |     120MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

OS

6.2.0-1017-aws #17~22.04.1-Ubuntu SMP Fri Nov 17 21:07:13 UTC 2023 x86_64 GNU/Linux
@JackCaoG
Copy link
Collaborator

on TPU side we never really tested the send and recv. @jeffhataws I wonder if you guys are actively using send and recv?

@ajayvohra2005
Copy link
Author

Just tested with us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:r2.4.0_3.10_cuda_12.1 and I get same error as with r2.3.0.

@ajayvohra2005
Copy link
Author

ajayvohra2005 commented Sep 28, 2024

@JackCaoG I am curious if you never use send and recv how do you make a pipeline forward work? Is there a simple workaround to recv input from previous PP rank, model forward step on current rank, send output to next PP rank that does not use send and recv. I can think how to simulate send followed by recv using all_gather if there is no intervening call, but not if there is an intervening call.

@JackCaoG
Copy link
Collaborator

We(Google team) have been mostly focus on TPU and since the inter-host connection(ICI in TPU case) is fast, we were able to get away with the FSDP + TP. We have not experience PP too much.

@ajayvohra2005
Copy link
Author

This modified code

import torch
import torch_xla.core.xla_model as xm
import torch_xla.runtime as xr
import torch_xla.distributed.xla_backend as xb
import torch.distributed


def test_p2p():
    torch.distributed.init_process_group(backend="xla", init_method="xla://")       

    rank = torch.distributed.get_rank()
    print(f"node rank: {rank}")
    device = xm.xla_device()
    tensor = torch.arange(2, dtype=torch.float32, device=device) + 1 + 2 * rank
    recv_buffer = torch.zeros_like(tensor).to(device=device)
    
    next_map = { 0:4, 1:5, 2:6, 3:7}
    prev_map = { 4:0, 5:1, 6:2, 7:3}
    
    if rank < 4:
        print(f"send tensor: {tensor}")
        xm.send(tensor, next_map[rank])
    
        print(f"send at rank: {rank}, to:{next_map[rank]} ")

    if rank >= 4:
        print(f"Before rank: {rank} recv_buffer: {recv_buffer}")
        xm.recv(recv_buffer, prev_map[rank])
        #print(f"recv at rank: {rank}, from:{prev_map[rank]} {recv_buffer} ")

    torch.distributed.destroy_process_group()
    
if __name__ == "__main__":
    test_p2p()

works without errors using

us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:r2.5.0_3.10_cuda_12.4

However if I uncomment

#print(f"recv at rank: {rank}, from:{prev_map[rank]} {recv_buffer} ")

I see following errors:

[rank7]: Traceback (most recent call last):
[rank7]:   File "/tmp/test.py", line 34, in <module>
[rank7]:     test_p2p()
[rank7]:   File "/tmp/test.py", line 29, in test_p2p
[rank7]:     print(f"recv at rank: {rank}, from:{prev_map[rank]} {recv_buffer} ")
[rank7]:   File "/usr/local/lib/python3.10/site-packages/torch/_tensor.py", line 1053, in __format__
[rank7]:     return object.__format__(self, format_spec)
[rank7]:   File "/usr/local/lib/python3.10/site-packages/torch/_tensor.py", line 523, in __repr__
[rank7]:     return torch._tensor_str._str(self, tensor_contents=tensor_contents)
[rank7]:   File "/usr/local/lib/python3.10/site-packages/torch/_tensor_str.py", line 708, in _str
[rank7]:     return _str_intern(self, tensor_contents=tensor_contents)
[rank7]:   File "/usr/local/lib/python3.10/site-packages/torch/_tensor_str.py", line 440, in _str_intern
[rank7]:     self = self.to("cpu")
[rank7]: RuntimeError: CHECK failed: it != end(): key not found: _xla_send_recv_source_target_pairs
[rank5]: Traceback (most recent call last):
[rank5]:   File "/tmp/test.py", line 34, in <module>
[rank5]:     test_p2p()
[rank5]:   File "/tmp/test.py", line 29, in test_p2p
[rank5]:     print(f"recv at rank: {rank}, from:{prev_map[rank]} {recv_buffer} ")
[rank5]:   File "/usr/local/lib/python3.10/site-packages/torch/_tensor.py", line 1053, in __format__
[rank5]:     return object.__format__(self, format_spec)
[rank5]:   File "/usr/local/lib/python3.10/site-packages/torch/_tensor.py", line 523, in __repr__
[rank5]:     return torch._tensor_str._str(self, tensor_contents=tensor_contents)
[rank5]:   File "/usr/local/lib/python3.10/site-packages/torch/_tensor_str.py", line 708, in _str
[rank5]:     return _str_intern(self, tensor_contents=tensor_contents)
[rank5]:   File "/usr/local/lib/python3.10/site-packages/torch/_tensor_str.py", line 440, in _str_intern
[rank5]:     self = self.to("cpu")
[rank5]: RuntimeError: CHECK failed: it != end(): key not found: _xla_send_recv_source_target_pairs
[libprotobuf FATAL external/com_google_protobuf/src/google/protobuf/map.h:1293] CHECK failed: it != end(): key not found: _xla_send_recv_source_target_pairs
[libprotobuf FATAL external/com_google_protobuf/src/google/protobuf/map.h:1293] CHECK failed: it != end(): key not found: _xla_send_recv_source_target_pairs
[rank6]: Traceback (most recent call last):
[rank6]:   File "/tmp/test.py", line 34, in <module>
[rank6]:     test_p2p()
[rank6]:   File "/tmp/test.py", line 29, in test_p2p
[rank6]:     print(f"recv at rank: {rank}, from:{prev_map[rank]} {recv_buffer} ")
[rank6]:   File "/usr/local/lib/python3.10/site-packages/torch/_tensor.py", line 1053, in __format__
[rank6]:     return object.__format__(self, format_spec)
[rank6]:   File "/usr/local/lib/python3.10/site-packages/torch/_tensor.py", line 523, in __repr__
[rank6]:     return torch._tensor_str._str(self, tensor_contents=tensor_contents)
[rank6]:   File "/usr/local/lib/python3.10/site-packages/torch/_tensor_str.py", line 708, in _str
[rank6]:     return _str_intern(self, tensor_contents=tensor_contents)
[rank6]:   File "/usr/local/lib/python3.10/site-packages/torch/_tensor_str.py", line 440, in _str_intern
[rank6]:     self = self.to("cpu")
[rank6]: RuntimeError: CHECK failed: it != end(): key not found: _xla_send_recv_source_target_pairs
[rank4]: Traceback (most recent call last):
[rank4]:   File "/tmp/test.py", line 34, in <module>
[rank4]:     test_p2p()
[rank4]:   File "/tmp/test.py", line 29, in test_p2p
[rank4]:     print(f"recv at rank: {rank}, from:{prev_map[rank]} {recv_buffer} ")
[rank4]:   File "/usr/local/lib/python3.10/site-packages/torch/_tensor.py", line 1053, in __format__
[rank4]:     return object.__format__(self, format_spec)
[rank4]:   File "/usr/local/lib/python3.10/site-packages/torch/_tensor.py", line 523, in __repr__
[rank4]:     return torch._tensor_str._str(self, tensor_contents=tensor_contents)
[rank4]:   File "/usr/local/lib/python3.10/site-packages/torch/_tensor_str.py", line 708, in _str
[rank4]:     return _str_intern(self, tensor_contents=tensor_contents)
[rank4]:   File "/usr/local/lib/python3.10/site-packages/torch/_tensor_str.py", line 440, in _str_intern
[rank4]:     self = self.to("cpu")
[rank4]: RuntimeError: CHECK failed: it != end(): key not found: _xla_send_recv_source_target_pairs
I0000 00:00:1733249742.963479   69043 coordination_service_agent.cc:602] Coordination agent has initiated Shutdown().
I0000 00:00:1733249742.977747   69047 coordination_service_agent.cc:602] Coordination agent has initiated Shutdown().
I0000 00:00:1733249743.012451   69044 coordination_service_agent.cc:602] Coordination agent has initiated Shutdown().
I0000 00:00:1733249743.078437   69048 coordination_service_agent.cc:602] Coordination agent has initiated Shutdown().
I0000 00:00:1733249743.270190   69042 coordination_service_agent.cc:602] Coordination agent has initiated Shutdown().
I0000 00:00:1733249743.301796   69045 coordination_service_agent.cc:602] Coordination agent has initiated Shutdown().
I0000 00:00:1733249743.338541   69046 coordination_service_agent.cc:602] Coordination agent has initiated Shutdown().
I0000 00:00:1733249743.443428   69049 coordination_service_agent.cc:602] Coordination agent has initiated Shutdown().
I0000 00:00:1733249743.445948   69043 coordination_service_agent.cc:621] Coordination agent has successfully shut down.
I0000 00:00:1733249743.445979   69047 coordination_service_agent.cc:621] Coordination agent has successfully shut down.
I0000 00:00:1733249743.445976   69044 coordination_service_agent.cc:621] Coordination agent has successfully shut down.
I0000 00:00:1733249743.446025   69048 coordination_service_agent.cc:621] Coordination agent has successfully shut down.
I0000 00:00:1733249743.446049   69042 coordination_service_agent.cc:621] Coordination agent has successfully shut down.
I0000 00:00:1733249743.446062   69045 coordination_service_agent.cc:621] Coordination agent has successfully shut down.
I0000 00:00:1733249743.446069   69049 coordination_service_agent.cc:621] Coordination agent has successfully shut down.
I0000 00:00:1733249743.446102   69046 coordination_service_agent.cc:621] Coordination agent has successfully shut down.
W1203 18:15:44.388000 68914 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 69047 closing signal SIGTERM
W1203 18:15:44.388000 68914 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 69048 closing signal SIGTERM
W1203 18:15:44.389000 68914 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 69049 closing signal SIGTERM
E1203 18:15:44.618000 68914 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 4 (pid: 69046) of binary: /usr/local/bin/python3.10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants