Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is your method only work on CUDA 10? #23

Open
kids0cn opened this issue Oct 27, 2019 · 2 comments
Open

Is your method only work on CUDA 10? #23

kids0cn opened this issue Oct 27, 2019 · 2 comments

Comments

@kids0cn
Copy link

kids0cn commented Oct 27, 2019

Hi there,
I try to reproduce your code on NVIDIA V100,CUDA 9,but it can not work。

Experiment dir : /home/limingnie/logsearch-note_of_this_run-20191027-142718
10/27 02:27:18 PM args = Namespace(add_layers=['0', '6', '12'], add_width=['0'], arch_learning_rate=0.0006, arch_weight_decay=0.001, batch_size=64, cifar100=False, cutout=False, cutout_length=16, drop_path_prob=0.3, dropout_rate=['0.1', '0.4', '0.7'], epochs=25, grad_clip=5, init_channels=16, layers=5, learning_rate=0.025, learning_rate_min=0.0, momentum=0.9, note='note_of_this_run', report_freq=50, save='/home/limingnie/logsearch-note_of_this_run-20191027-142718', seed=2, tmp_data_dir='/home/limingnie/cifar-10-batches-py', train_portion=0.5, weight_decay=0.0003, workers=2)
Files already downloaded and verified
10/27 02:27:33 PM param size = 1.276058MB
/home/limingnie/anaconda3/envs/pdarts_tx/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:100: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule.See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
10/27 02:27:33 PM Epoch: 0 lr: 2.490143e-02
Traceback (most recent call last):
  File "train_search.py", line 465, in <module>
    main() 
  File "train_search.py", line 155, in main
    train_acc, train_obj = train(train_queue, valid_queue, model, network_params, criterion, optimizer, optimizer_a, lr, train_arch=False)
  File "train_search.py", line 292, in train
    logits = model(input)
  File "/home/limingnie/anaconda3/envs/pdarts_tx/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/limingnie/anaconda3/envs/pdarts_tx/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 148, in forward
    inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
  File "/home/limingnie/anaconda3/envs/pdarts_tx/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 159, in scatter
    return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim)
  File "/home/limingnie/anaconda3/envs/pdarts_tx/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 36, in scatter_kwargs
    inputs = scatter(inputs, target_gpus, dim) if inputs else []
  File "/home/limingnie/anaconda3/envs/pdarts_tx/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 28, in scatter
    res = scatter_map(inputs)
  File "/home/limingnie/anaconda3/envs/pdarts_tx/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 15, in scatter_map
    return list(zip(*map(scatter_map, obj)))
  File "/home/limingnie/anaconda3/envs/pdarts_tx/lib/python3.7/site-packages/torch/nn/parallel/scatter_gather.py", line 13, in scatter_map
    return Scatter.apply(target_gpus, None, dim, obj)
  File "/home/limingnie/anaconda3/envs/pdarts_tx/lib/python3.7/site-packages/torch/nn/parallel/_functions.py", line 89, in forward
    outputs = comm.scatter(input, target_gpus, chunk_sizes, ctx.dim, streams)
  File "/home/limingnie/anaconda3/envs/pdarts_tx/lib/python3.7/site-packages/torch/cuda/comm.py", line 147, in scatter
    return tuple(torch._C._scatter(tensor, devices, chunk_sizes, dim, streams))
RuntimeError: CUDA error: out of memory (malloc at /pytorch/c10/cuda/CUDACachingAllocator.cpp:241)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x2b36b153c813 in /home/limingnie/anaconda3/envs/pdarts_tx/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x1cb50 (0x2b36af129b50 in /home/limingnie/anaconda3/envs/pdarts_tx/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: <unknown function> + 0x1de6e (0x2b36af12ae6e in /home/limingnie/anaconda3/envs/pdarts_tx/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: at::native::empty_cuda(c10::ArrayRef<long>, c10::TensorOptions const&, c10::optional<c10::MemoryFormat>) + 0x279 (0x2b36f19a1eb9 in /home/limingnie/anaconda3/envs/pdarts_tx/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #4: <unknown function> + 0x41c27c8 (0x2b36f03ae7c8 in /home/limingnie/anaconda3/envs/pdarts_tx/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x3c7beb8 (0x2b36efe67eb8 in /home/limingnie/anaconda3/envs/pdarts_tx/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #6: <unknown function> + 0x1bd17b1 (0x2b36eddbd7b1 in /home/limingnie/anaconda3/envs/pdarts_tx/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #7: at::native::to(at::Tensor const&, c10::TensorOptions const&, bool, bool) + 0x272 (0x2b36eddbe152 in /home/limingnie/anaconda3/envs/pdarts_tx/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #8: <unknown function> + 0x1efadd0 (0x2b36ee0e6dd0 in /home/limingnie/anaconda3/envs/pdarts_tx/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #9: <unknown function> + 0x3a8db03 (0x2b36efc79b03 in /home/limingnie/anaconda3/envs/pdarts_tx/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #10: torch::cuda::scatter(at::Tensor const&, c10::ArrayRef<long>, c10::optional<std::vector<long, std::allocator<long> > > const&, long, c10::optional<std::vector<c10::optional<c10::cuda::CUDAStream>, std::allocator<c10::optional<c10::cuda::CUDAStream> > > > const&) + 0x4db (0x2b36f07a438b in /home/limingnie/anaconda3/envs/pdarts_tx/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #11: <unknown function> + 0x7846a3 (0x2b36ebbd06a3 in /home/limingnie/anaconda3/envs/pdarts_tx/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #12: <unknown function> + 0x20ffc4 (0x2b36eb65bfc4 in /home/limingnie/anaconda3/envs/pdarts_tx/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #20: THPFunction_apply(_object*, _object*) + 0x936 (0x2b36eb8ecb86 in /home/limingnie/anaconda3/envs/pdarts_tx/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
@chenxin061
Copy link
Owner

Our test environment was CUDA10, Python 3.6 and PyTorch 0.4 and 1.0.
However, one of my colleagues tested the code with CUDA 9 and it worked well.
Since the error in your log is an OOM error, I suggest you check those hyper-parameters first, e. g., the batch size.
Also, this code has not yet been tested on Python 3.7.

@Jeffrey-JDong
Copy link

Hi,I have the same question.I try reproduce the code on CUDA 9 python 3.6 pytorch 0.4,but it still out of memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants