Notice
일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | |||
5 | 6 | 7 | 8 | 9 | 10 | 11 |
12 | 13 | 14 | 15 | 16 | 17 | 18 |
19 | 20 | 21 | 22 | 23 | 24 | 25 |
26 | 27 | 28 | 29 | 30 | 31 |
Tags
- ubuntu
- urllib
- Jupyter notebook
- colab
- face_recognition
- gpu memory
- CUDA
- python3
- YouTube 8M
- shakeratos
- dlib
- colaboratory
- object detection
- error
- 딥러닝
- pyTorch
- python
- Deep Learning
- keras
- linux
- raspberry pi
- Anaconda
- windows
- FIle
- install
- download
- ppc64le
- dataset
- Windows 10
- TensorFlow
Archives
- Today
- Total
Shakerato
Deep Learning - GPU memory limitation, How to overcome it? 본문
Limited GPU Memory
- GPU usually has lesser device memory than host memory
- The latest high-end GPU (such as NVIDIA GPU P100)
- 12–16 GB device memory
- Host system memory
- 256GB
- Trend for deep learning models is to have a “deeper and wider” architecture
- Especially, RNN needs a lot of memory
1. First Solution: distributed Deep Learning
Source: M. Cho et al., "PowerAI DDL", 2017
- PowerAI DDL provides a unified infrastructure for distributed Deep Learning over multiple GPUs for a single node, multiple nodes and a cloud environment
- PowerAI DDL leverages an innovative multi-ring communication algorithm that balances communication latency with the communication overhead
- The PowerAI DDL library provides functionality for high-performance distributed Deep Learning that can be employed in multiple frameworks
- Currently there are PowerAI DDL enabled versions
- Caffe
- Tensorflow
- Torch
- However, DDL needs multiple server
2. Second Solution: Unified Memory
- Currently, only Caffe supports ‘Unified Memory’
- /opt/DL/caffe-ibm/bin/caffe time --model=/opt/DL/caffe-ibm/models/bvlc_googlenet/deploy.prototxt -gpu=0 -iterations=1
- result: out of memory
- /opt/DL/caffe-ibm/bin/caffe time --model=/opt/DL/caffe-ibm/models/bvlc_googlenet/deploy.prototxt -gpu=0 -lms 8192 -iterations=1
- result: 1477.19ms (lms: Large Model Support)
3. Third Solution: Swap Out/In Atomic operations
Source: C. Meng et al., "Training Deeper Models by GPU: Memory Optimization on Tensorflow", NIPS 2017
- Transferred to CPU memory, Transferred back to GPU memory
- Table 1. (a) of this paper, The max batch size can be increased by up to 4 times
- Getting better result
- Table 1. (b) of this paper, conventional model is only possible to train a ResNet-200 model (“OOM” means “Out of Memory”), But, 'swap out/in' enables to train ResNet-10001, ResNet-2000 without OOM
'Research' 카테고리의 다른 글
Tensorflow Object Detection Tutorial (0) | 2018.03.29 |
---|---|
How to make video clips of the long video using ffmpeg (0) | 2018.03.27 |
FakeApp 1.1 Tutorial (0) | 2018.02.12 |
Create environment for tensorflow 1.4 in Anaconda 3 (0) | 2018.02.10 |
Docker-Theano setup for ppc64le architure (0) | 2018.01.05 |
Comments