From
运行时错误:CUDA 内存不足。尝试分配 33.84 GiB(GPU 0;总共 79.35 GiB)
容量;已分配 36.51 GiB; 32.48 GiB 免费; PyTorch 总共预留了 44.82 GiB)
最有可能的是,那是因为它需要
然后在内存中
- 已分配 36.51 GB,最有可能将模型加载到 GPU RAM 上
- 保留 44.82 GB,应包括 36.51 分配 + pytorch 开销
而你需要
- 评估批次为 33.84 GB
- 但只有 32.48 GB 可用
所以我想有几个选择,你可以尝试减少per_device_eval_batch_size
,从 7 一直到 1,看看是否有效,例如
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=10,
per_device_train_batch_size=8,
per_device_eval_batch_size=1,
...)
如果这不起作用,也许是默认累积,请参阅https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.eval_accumulation_steps https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.eval_accumulation_steps
你可以试试:
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=10,
per_device_train_batch_size=8,
per_device_eval_batch_size=1,
eval_accumulation_steps=1,
...)
有时这也是默认情况下不生成预测的原因。我不确定为什么会发生这种情况,但我认为当它只是用model.eval()
or with torch.no_grad()
当。。。的时候predict_with_generate
设置为False,需要一些开销。但这只是我的猜测https://discuss.huggingface.co/t/cuda-out-of-memory-only-during-validation-not-training/18378 https://discuss.huggingface.co/t/cuda-out-of-memory-only-during-validation-not-training/18378
如果是这样,您可以尝试:
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=10,
per_device_train_batch_size=8,
per_device_eval_batch_size=1,
eval_accumulation_steps=1,
predict_with_generate=True,
...)
或者你可以尝试auto_find_batch_size
, i.e.
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=10,
predict_with_generate=True,
auto_find_batch_size=True,
...)
还有一些记忆技巧:
# At the imports part of your code.
# See https://pytorch.org/docs/stable/generated/torch.cuda.set_per_process_memory_fraction.html
import torch
torch.cuda.set_per_process_memory_fraction(0.9)
如果仍然不起作用,请尝试算法技巧。
From https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one https://huggingface.co/docs/transformers/main/en/perf_train_gpu_one
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=10,
fp16=True,
optim="adafactor",
gradient_checkpointing=True,
per_device_train_batch_size=8,
per_device_eval_batch_size=1,
eval_accumulation_steps=1,
predict_with_generate=True,