
Deep Learning - GPU memory limitation, How to overcome it?

Shakeratto 2018. 3. 26. 22:36

Limited GPU Memory

  • GPU usually has lesser device memory than host memory
    • The latest high-end GPU (such as NVIDIA GPU P100)
      • 12–16 GB device memory
    • Host system memory
      • 256GB
  • Trend for deep learning models is to have a “deeper and wider” architecture
    • Especially, RNN needs a lot of  memory

1. First Solution: distributed Deep Learning
Source: M. Cho et al., "PowerAI DDL", 2017
  • PowerAI DDL provides a unified infrastructure for distributed Deep Learning over multiple GPUs for a single node, multiple nodes and a cloud environment
  • PowerAI DDL leverages an innovative multi-ring communication algorithm that balances communication latency with the  communication overhead
  • The PowerAI DDL library provides functionality for high-performance distributed Deep Learning that can be employed in multiple frameworks
  • Currently there are PowerAI DDL enabled versions
    • Caffe
    • Tensorflow
    • Torch
  • However, DDL needs multiple server

2. Second Solution: Unified Memory

  • Currently, only Caffe supports ‘Unified Memory’
    • /opt/DL/caffe-ibm/bin/caffe time --model=/opt/DL/caffe-ibm/models/bvlc_googlenet/deploy.prototxt -gpu=0 -iterations=1
      • result: out of memory
    • /opt/DL/caffe-ibm/bin/caffe time --model=/opt/DL/caffe-ibm/models/bvlc_googlenet/deploy.prototxt -gpu=0 -lms 8192 -iterations=1
      • result:  1477.19ms (lms: Large Model Support)

3. Third Solution: Swap Out/In Atomic operations

    Source: C. Meng et al., "Training Deeper Models by GPU: Memory Optimization on Tensorflow", NIPS 2017
    • Transferred to CPU memory, Transferred back to GPU memory
    • Table 1. (a) of this paper, The max batch size can be increased by up to 4 times
      • Getting better result
    • Table 1. (b) of this paper, conventional model is only possible to train a ResNet-200 model (“OOM” means “Out of Memory”), But, 'swap out/in' enables to train ResNet-10001, ResNet-2000 without OOM