Shakerato

YouTube 8M dataset download 본문

Research

YouTube 8M dataset download

Shakeratto 2018. 6. 1. 17:02

* What is YouTube 8M dataset?

   https://research.google.com/youtube8m/download.html


* How to download the dataset?

* You have to download the dataset only on the Linux! (not Windows), 

  because of duplicated file names. 

  (Windows does not allow this situation: 'Ab.txt', 'aB.txt' in the same folder)


1. Download 'download_fix.py' file from this URL:

   https://storage.googleapis.com/data.yt8m.org/download_fix.py


2. Change codes to download the dataset (video-level feature)

   No need to follow this step. You can download dataset using this command

    'curl data.yt8m.org/download.py | partition=2/video/train mirror=us python'


    I modified the code because sometimes downloading is interrupted due to the network latency.


 2.1. os.environ['partition'] -> '2/video/train', '2/video/validate', '2/video/test'

 2.2. os.environ['mirror'] -> 'us' or 'eu' or 'asia'

 2.3. Comment all codes for checking 'os.environ' is setted as (partition, mirror, shard)

 2.4. Comment the code for checking 'curl' is installed

 2.5. Change the code for download to like below link (using urllib) - this step is impotent.

      (https://github.com/entymos/crawling/blob/master/python3_file_download)

  2.5.1. os.system('curl %s > %s' % (plan_url, plan_filename))

         -> fileDownload(plan_url, plan_filename)

  2.5.2. os.system('curl %s > %s' % (download_url, f)) 

         ->fileDownload(download_url, f)


3. Run code for each dataset (train, validate, test) (change the code at the step 2.1)

Comments