Tags: Yutaro-Sanada/pytorch
Tags
(torch/elastic) add fqdn hostname to error printout (pytorch#66182) (p… …ytorch#66662) Summary: Pull Request resolved: pytorch#66182 closes pytorch#63174 Does a few things: 1. adds hostname to the error report 2. moves the "root cause" section to the end (presumably since the logs are being "tailed" we want the root cause to appear at the end) 3. moves redundant error info logging to debug 4. makes the border max 60 char in length and justifies left for the header NOTE: YOU HAVE TO annotate your main function with torch.distributed.elastic.multiprocessing.errors.record, otherwise no traceback is printed (this is because python exception propagation does NOT work out of the both for IPC - hence the extra record annotation). Test Plan: Sample ``` ============================================================ run_script_path FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2021-10-05_17:37:22 host : devvm4955.prn0.facebook.com rank : 0 (local_rank: 0) exitcode : 1 (pid: 3296201) error_file: /home/kiuk/tmp/elastic/none_3_lsytqe/attempt_0/0/error.json traceback : Traceback (most recent call last): File "/tmp/jetter.xr3_x6qq/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 372, in wrapper return f(*args, **kwargs) File "main.py", line 28, in main raise RuntimeError(args.throws) RuntimeError: foobar ============================================================ ``` Reviewed By: cbalioglu, aivanou Differential Revision: D31416492 fbshipit-source-id: 0aeaf6e634e23ce0ea7f6a03b12c8a9ac57246e9
(torch/elastic) add fqdn hostname to error printout (pytorch#66182) (p… …ytorch#66662) Summary: Pull Request resolved: pytorch#66182 closes pytorch#63174 Does a few things: 1. adds hostname to the error report 2. moves the "root cause" section to the end (presumably since the logs are being "tailed" we want the root cause to appear at the end) 3. moves redundant error info logging to debug 4. makes the border max 60 char in length and justifies left for the header NOTE: YOU HAVE TO annotate your main function with torch.distributed.elastic.multiprocessing.errors.record, otherwise no traceback is printed (this is because python exception propagation does NOT work out of the both for IPC - hence the extra record annotation). Test Plan: Sample ``` ============================================================ run_script_path FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2021-10-05_17:37:22 host : devvm4955.prn0.facebook.com rank : 0 (local_rank: 0) exitcode : 1 (pid: 3296201) error_file: /home/kiuk/tmp/elastic/none_3_lsytqe/attempt_0/0/error.json traceback : Traceback (most recent call last): File "/tmp/jetter.xr3_x6qq/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 372, in wrapper return f(*args, **kwargs) File "main.py", line 28, in main raise RuntimeError(args.throws) RuntimeError: foobar ============================================================ ``` Reviewed By: cbalioglu, aivanou Differential Revision: D31416492 fbshipit-source-id: 0aeaf6e634e23ce0ea7f6a03b12c8a9ac57246e9
Fix cosine similarity dim checks (pytorch#66214) * fix cosine similarity dimensionality check * fix shapes in the doc
[1.10] Remove torch.vmap (pytorch#65496) torch.vmap is a prototype feature and should not be in the stable binary. This PR: - Removes the torch.vmap API - Removes the documentation entry for torch.vmap - Changes the vmap tests to use an internal API instead of torch.vmap. Test Plan: - Tested locally (test_torch, test_autograd, test_type_hints, test_vmap), but also wait for CI.
Fix builder pinning (pytorch#64971) `git checkout release/1.9` should be work dir is changed to `BUILDER_ROOT`
Fix builder pinning (pytorch#64971) `git checkout release/1.9` should be work dir is changed to `BUILDER_ROOT`
(torch.distributed) Add torch.distributed.is_torchelastic_launched() … …util method + make init_method=tcp:// compatible with torchelastic (pytorch#63910) (pytorch#64826) Summary: Pull Request resolved: pytorch#63910 Addresses the current issue that `init_method=tcp://` is not compatible with `torch.distributed.run` and `torch.distributed.launch`. When running with a training script that initializes the process group with `init_method=tcp://localhost:$port` as such: ``` $ python -u -m torch.distributed.run --max_restarts 0 --nproc_per_node 1 --nnodes 1 --master_addr $(hostname) --master_port 6000 ~/tmp/test.py ``` An `Address in use` error is raised since the training script tries to create a TCPStore on port 6000, which is already taken since the elastic agent is already running a TCPStore on that port. For details see: pytorch#63874. This change does a couple of things: 1. Adds `is_torchelastic_launched()` check function that users can use in the training scripts to see whether the script is launched via torchelastic. 1. Update the `torch.distributed` docs page to include the new `is_torchelastic_launched()` function. 1. Makes `init_method=tcp://` torchelastic compatible by modifying `_tcp_rendezvous_handler` in `torch.distributed.rendezvous` (this is NOT the elastic rendezvous, it is the old rendezvous module which is slotted for deprecation in future releases) to check `is_torchelastic_launched()` AND `torchelastic_use_agent_store()` and if so, only create TCPStore clients (no daemons, not even for rank 0). 1. Adds a bunch of unittests to cover the different code paths NOTE: the issue mentions that we should fail-fast with an assertion on `init_method!=env://` when `is_torchelastic_launched()` is `True`. There are three registered init_methods in pytorch: env://, tcp://, file://. Since this diff makes tcp:// compatible with torchelastic and I've validated that file is compatible with torchelastic. There is no need to add assertions. I did update the docs to point out that env:// is the RECOMMENDED init_method. We should probably deprecate the other init_methods in the future but this is out of scope for this issue. Test Plan: Unittests. Reviewed By: cbalioglu Differential Revision: D30529984 fbshipit-source-id: 267aea6d4dad73eb14a2680ac921f210ff547cc5 Co-authored-by: Kiuk Chung <[email protected]>
Wrap cub in its own namespace (pytorch#55292) (pytorch#61605) Summary: Tentative fix for pytorch#55027. Wraps cub import in its name space so that static variables used by cub and thrust don't conflict if they end up in the different libraries when torch is built with BUILD_SPLIT_CUDA. cub variables end up in their own namespace, thrust variables are unwrapped, so they don't clash. This also allows extensions to use cub without wrapping it (thrust will still be problematic). The solution to allowing extensions to use thrust is to stop using thrust in pytorch completely. Now importing cub and importing thrust cannot coexist, so I had to move nonzero to its own file, and remove reliance on thrust functions for it. Nonzero now uses cub only. Also, we cannot selectively import just some of cub headers, we are forced to import `cub/cub.cuh`, which is not great. Caffe2 ops using cub are not touched (there are too many), so mixing caffe2 and torch will (can) still result in the same bug. We are moving towards disabling c2 ops, so I think this is fine. Still, even with that compiler (correctly) warns about redefinition of `CUB_NS_PREFIX` because including `ATen/ATen.h` transitively includes `thrust/complex.h` and that in turn includes original (empty) definition of `CUB_NS_PREFIX`. We probably can just ignore this warning. Here's an example warning: ``` In file included from /data/users/ngimel/pytorch/aten/src/ATen/native/cuda/Nonzero.cu:9: /data/users/ngimel/pytorch/aten/src/ATen/cuda/CubUtils.cuh:4: warning: "CUB_NS_PREFIX" redefined #define CUB_NS_PREFIX namespace at{ namespace native{ In file included from /home/ngimel/local/cuda/include/thrust/system/cuda/config.h:76, from /home/ngimel/local/cuda/include/thrust/system/cuda/detail/execution_policy.h:33, from /home/ngimel/local/cuda/include/thrust/iterator/detail/device_system_tag.h:23, from /home/ngimel/local/cuda/include/thrust/iterator/iterator_traits.h:111, from /home/ngimel/local/cuda/include/thrust/detail/type_traits/pointer_traits.h:23, from /home/ngimel/local/cuda/include/thrust/type_traits/is_contiguous_iterator.h:27, from /home/ngimel/local/cuda/include/thrust/type_traits/is_trivially_relocatable.h:19, from /home/ngimel/local/cuda/include/thrust/detail/complex/complex.inl:20, from /home/ngimel/local/cuda/include/thrust/complex.h:1031, from /data/users/ngimel/pytorch/c10/util/complex.h:9, from /data/users/ngimel/pytorch/c10/core/ScalarType.h:4, from /data/users/ngimel/pytorch/c10/core/Scalar.h:10, from /data/users/ngimel/pytorch/build/aten/src/ATen/core/TensorBody.h:8, from /data/users/ngimel/pytorch/aten/src/ATen/Tensor.h:3, from /data/users/ngimel/pytorch/aten/src/ATen/Context.h:4, from /data/users/ngimel/pytorch/aten/src/ATen/ATen.h:9, from /data/users/ngimel/pytorch/aten/src/ATen/native/cuda/Nonzero.cu:1: /home/ngimel/local/cuda/include/cub/util_namespace.cuh:43: note: this is the location of the previous definition #define CUB_NS_PREFIX ``` We will need a lint rule to prevent people from including `cub/cub.cuh`, because this will lead to pytorch#55027 reappearing again for some sequence of operations (and will lead to errors with cub code in extensions). Also, for this to work reliably we'll need to make sure that everything calling cub ends up in only one of libtorch_cuda_cu or libtorch_cuda_cpp, otherwise even namespace won't help (there still will be same symbols in 2 libraries). Upd: libtorch_cuda_cpp and libtorch_cuda_cu still contain the same symbols, which means that there exists a sequence of operations that will cause cache bug to reappear, so this is not a solution, we need to adjust file lists for BUILD_SPLITC_CUDA: ``` (pytorch) [ngimel@ ~/local/pytorch/build/lib] nm libtorch_cuda_cu.so | grep PerDeviceAttributeCache | c++filt 000000000c6bf808 u guard variable for at::native::cub::GetPerDeviceAttributeCache<at::native::cub::PtxVersionCacheTag>()::cache 000000000c600830 u guard variable for cub::GetPerDeviceAttributeCache<cub::PtxVersionCacheTag>()::cache 00000000018625e0 t at::native::cub::PerDeviceAttributeCache::DevicePayload at::native::cub::PerDeviceAttributeCache::operator()<at::native::cub::PtxVersion(int&)::{lambda(int&)https://github.com/pytorch/pytorch/issues/1}>(at::native::cub::PtxVersion(int&)::{lambda(int&)https://github.com/pytorch/pytorch/issues/1}&&, int) 00000000009ce630 t cub::PerDeviceAttributeCache::DevicePayload cub::PerDeviceAttributeCache::operator()<cub::PtxVersion(int&)::{lambda(int&)https://github.com/pytorch/pytorch/issues/1}>(cub::PtxVersion(int&)::{lambda(int&)https://github.com/pytorch/pytorch/issues/1}&&, int) 000000000c6bf820 u at::native::cub::GetPerDeviceAttributeCache<at::native::cub::PtxVersionCacheTag>()::cache 000000000c600840 u cub::GetPerDeviceAttributeCache<cub::PtxVersionCacheTag>()::cache (pytorch) [ngimel@ ~/local/pytorch/build/lib] nm libtorch_cuda_cpp.so | grep PerDeviceAttributeCache | c++filt 0000000000ad2d98 u guard variable for at::native::cub::GetPerDeviceAttributeCache<at::native::cub::PtxVersionCacheTag>()::cache 0000000000ad2da0 u at::native::cub::GetPerDeviceAttributeCache<at::native::cub::PtxVersionCacheTag>()::cache ``` Upd2: Moved TensorFactories.cu to torch_cuda_cu sources (see a change to caffe2/CMakeLists.txt), so now cub-related symbols are only in libtorch_cuda_cu. We'd need a test for that, any suggestions on how best to test it? cc zasdfgbnm malfet Pull Request resolved: pytorch#55292 Reviewed By: anjali411 Differential Revision: D27576442 Pulled By: ngimel fbshipit-source-id: 1ef29503a342bb214794d34a42a47052092a66c1 Co-authored-by: Natalia Gimelshein <[email protected]>
Wrap cub in its own namespace (pytorch#55292) (pytorch#61605) Summary: Tentative fix for pytorch#55027. Wraps cub import in its name space so that static variables used by cub and thrust don't conflict if they end up in the different libraries when torch is built with BUILD_SPLIT_CUDA. cub variables end up in their own namespace, thrust variables are unwrapped, so they don't clash. This also allows extensions to use cub without wrapping it (thrust will still be problematic). The solution to allowing extensions to use thrust is to stop using thrust in pytorch completely. Now importing cub and importing thrust cannot coexist, so I had to move nonzero to its own file, and remove reliance on thrust functions for it. Nonzero now uses cub only. Also, we cannot selectively import just some of cub headers, we are forced to import `cub/cub.cuh`, which is not great. Caffe2 ops using cub are not touched (there are too many), so mixing caffe2 and torch will (can) still result in the same bug. We are moving towards disabling c2 ops, so I think this is fine. Still, even with that compiler (correctly) warns about redefinition of `CUB_NS_PREFIX` because including `ATen/ATen.h` transitively includes `thrust/complex.h` and that in turn includes original (empty) definition of `CUB_NS_PREFIX`. We probably can just ignore this warning. Here's an example warning: ``` In file included from /data/users/ngimel/pytorch/aten/src/ATen/native/cuda/Nonzero.cu:9: /data/users/ngimel/pytorch/aten/src/ATen/cuda/CubUtils.cuh:4: warning: "CUB_NS_PREFIX" redefined #define CUB_NS_PREFIX namespace at{ namespace native{ In file included from /home/ngimel/local/cuda/include/thrust/system/cuda/config.h:76, from /home/ngimel/local/cuda/include/thrust/system/cuda/detail/execution_policy.h:33, from /home/ngimel/local/cuda/include/thrust/iterator/detail/device_system_tag.h:23, from /home/ngimel/local/cuda/include/thrust/iterator/iterator_traits.h:111, from /home/ngimel/local/cuda/include/thrust/detail/type_traits/pointer_traits.h:23, from /home/ngimel/local/cuda/include/thrust/type_traits/is_contiguous_iterator.h:27, from /home/ngimel/local/cuda/include/thrust/type_traits/is_trivially_relocatable.h:19, from /home/ngimel/local/cuda/include/thrust/detail/complex/complex.inl:20, from /home/ngimel/local/cuda/include/thrust/complex.h:1031, from /data/users/ngimel/pytorch/c10/util/complex.h:9, from /data/users/ngimel/pytorch/c10/core/ScalarType.h:4, from /data/users/ngimel/pytorch/c10/core/Scalar.h:10, from /data/users/ngimel/pytorch/build/aten/src/ATen/core/TensorBody.h:8, from /data/users/ngimel/pytorch/aten/src/ATen/Tensor.h:3, from /data/users/ngimel/pytorch/aten/src/ATen/Context.h:4, from /data/users/ngimel/pytorch/aten/src/ATen/ATen.h:9, from /data/users/ngimel/pytorch/aten/src/ATen/native/cuda/Nonzero.cu:1: /home/ngimel/local/cuda/include/cub/util_namespace.cuh:43: note: this is the location of the previous definition #define CUB_NS_PREFIX ``` We will need a lint rule to prevent people from including `cub/cub.cuh`, because this will lead to pytorch#55027 reappearing again for some sequence of operations (and will lead to errors with cub code in extensions). Also, for this to work reliably we'll need to make sure that everything calling cub ends up in only one of libtorch_cuda_cu or libtorch_cuda_cpp, otherwise even namespace won't help (there still will be same symbols in 2 libraries). Upd: libtorch_cuda_cpp and libtorch_cuda_cu still contain the same symbols, which means that there exists a sequence of operations that will cause cache bug to reappear, so this is not a solution, we need to adjust file lists for BUILD_SPLITC_CUDA: ``` (pytorch) [ngimel@ ~/local/pytorch/build/lib] nm libtorch_cuda_cu.so | grep PerDeviceAttributeCache | c++filt 000000000c6bf808 u guard variable for at::native::cub::GetPerDeviceAttributeCache<at::native::cub::PtxVersionCacheTag>()::cache 000000000c600830 u guard variable for cub::GetPerDeviceAttributeCache<cub::PtxVersionCacheTag>()::cache 00000000018625e0 t at::native::cub::PerDeviceAttributeCache::DevicePayload at::native::cub::PerDeviceAttributeCache::operator()<at::native::cub::PtxVersion(int&)::{lambda(int&)https://github.com/pytorch/pytorch/issues/1}>(at::native::cub::PtxVersion(int&)::{lambda(int&)https://github.com/pytorch/pytorch/issues/1}&&, int) 00000000009ce630 t cub::PerDeviceAttributeCache::DevicePayload cub::PerDeviceAttributeCache::operator()<cub::PtxVersion(int&)::{lambda(int&)https://github.com/pytorch/pytorch/issues/1}>(cub::PtxVersion(int&)::{lambda(int&)https://github.com/pytorch/pytorch/issues/1}&&, int) 000000000c6bf820 u at::native::cub::GetPerDeviceAttributeCache<at::native::cub::PtxVersionCacheTag>()::cache 000000000c600840 u cub::GetPerDeviceAttributeCache<cub::PtxVersionCacheTag>()::cache (pytorch) [ngimel@ ~/local/pytorch/build/lib] nm libtorch_cuda_cpp.so | grep PerDeviceAttributeCache | c++filt 0000000000ad2d98 u guard variable for at::native::cub::GetPerDeviceAttributeCache<at::native::cub::PtxVersionCacheTag>()::cache 0000000000ad2da0 u at::native::cub::GetPerDeviceAttributeCache<at::native::cub::PtxVersionCacheTag>()::cache ``` Upd2: Moved TensorFactories.cu to torch_cuda_cu sources (see a change to caffe2/CMakeLists.txt), so now cub-related symbols are only in libtorch_cuda_cu. We'd need a test for that, any suggestions on how best to test it? cc zasdfgbnm malfet Pull Request resolved: pytorch#55292 Reviewed By: anjali411 Differential Revision: D27576442 Pulled By: ngimel fbshipit-source-id: 1ef29503a342bb214794d34a42a47052092a66c1 Co-authored-by: Natalia Gimelshein <[email protected]>
[docs] Add torch.package documentation for beta release (pytorch#59886) **Summary** This commit adds documentation for the `torch.package` module to accompany its beta release in 1.9. **Test Plan** Continous integration.
PreviousNext