Ideas/Questions for future code

Brainstorming here :P


  • Main database for crawled content
  • YouTube is mostly non-relational (except channels ↔ videos)
  • Users can change videos (title, etc.): Support for multiple crawls needed
    • In one document as array? Like {videoid: xxx, crawls: []…
      • Pros: Easy history query
      • Cons: (Title) indices might be harder to maintain
    • Or as separate documents? Like {videoid: xxx, crawldate: …
      • Pros: Race conditions less likely
      • Cons: Duplicates more likely?
    • Avoiding duplicates?
      • If the user hasn't changed video metadata, crawling it again is a waste of disk space
      • Rescan score: Should a video be rescanned?
        • Viral videos should be crawled more often
        • New videos shouldn't be instantly crawled again
        • Very old videos are unlikely to change
        • Maybe focus on views per week
        • Machine learning?
      • Hashing data from crawls to detect changes?
        • Invalidates old data on API upgrade
        • Could be used as an index tho
  • Live data
    • like views/comments/subscribers per day
    • vs more persistent data: Title/Description/video Formats
    • Are they worth crawling
  • Additional data
    • Like subtitles and annotations
    • Need separate crawls
    • Not as important as main data

Types of bot

  • Discover bots
    • Find and push new video IDs to the queue
    • Monitor channels for new content
    • Discover new videos
  • Maintainer bots
    • Occasionally look at the database and push backups/freezes to drive
    • Decide which old video IDs to re-add to the queue
  • Worker bots
    • Get jobs from the Redis queue and crawl YT
    • Remove processed entries from the queue

Redis queue

  • A redis queue lists video IDs that have been discovered, but not crawled
  • Discover bots bots push IDs if they find new ones
    • Implement queue priority?
  • Maintainer bots push IDs if they likely need rescans
  • States of queued items
    1. Queued: Processing required (no worker bot picked them up yet)
    2. Assigned: Worker claimed ID and processes it. If the worker doesn't mark the ID as done in time it gets tagged back as Queued again (should be hidden from other workers)
    3. Done: Worker submitted crawl to the database (can be deleted from the queue)
  • Single point of failure
    • Potentially needs ton of RAM
      • 800 M IDs at 100 bytes per entry = 80 GB
    • Shuts down entire crawl system on failure
    • Persistence: A crash can use all discovered IDs
  • Alternative implementations
    • SQLite in-memory?