Research
Deep Learning - GPU memory limitation, How to overcome it?
Shakeratto
2018. 3. 26. 22:36
Limited GPU Memory
- GPU usually has lesser device memory than host memory
- The latest high-end GPU (such as NVIDIA GPU P100)
- 12–16 GB device memory
- Host system memory
- 256GB
- Trend for deep learning models is to have a “deeper and wider” architecture
- Especially, RNN needs a lot of memory
1. First Solution: distributed Deep Learning
Source: M. Cho et al., "PowerAI DDL", 2017
- PowerAI DDL provides a unified infrastructure for distributed Deep Learning over multiple GPUs for a single node, multiple nodes and a cloud environment
- PowerAI DDL leverages an innovative multi-ring communication algorithm that balances communication latency with the communication overhead
- The PowerAI DDL library provides functionality for high-performance distributed Deep Learning that can be employed in multiple frameworks
- Currently there are PowerAI DDL enabled versions
- Caffe
- Tensorflow
- Torch
- However, DDL needs multiple server
2. Second Solution: Unified Memory
- Currently, only Caffe supports ‘Unified Memory’
- /opt/DL/caffe-ibm/bin/caffe time --model=/opt/DL/caffe-ibm/models/bvlc_googlenet/deploy.prototxt -gpu=0 -iterations=1
- result: out of memory
- /opt/DL/caffe-ibm/bin/caffe time --model=/opt/DL/caffe-ibm/models/bvlc_googlenet/deploy.prototxt -gpu=0 -lms 8192 -iterations=1
- result: 1477.19ms (lms: Large Model Support)
3. Third Solution: Swap Out/In Atomic operations
Source: C. Meng et al., "Training Deeper Models by GPU: Memory Optimization on Tensorflow", NIPS 2017
- Transferred to CPU memory, Transferred back to GPU memory
- Table 1. (a) of this paper, The max batch size can be increased by up to 4 times
- Getting better result
- Table 1. (b) of this paper, conventional model is only possible to train a ResNet-200 model (“OOM” means “Out of Memory”), But, 'swap out/in' enables to train ResNet-10001, ResNet-2000 without OOM