You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

ideas.md 2.7KB

1 year ago
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677
  1. # Ideas/Questions for future code
  2. > Brainstorming here :P
  3. ### MongoDB
  4. * Main database for crawled content
  5. * YouTube is mostly non-relational
  6. (except channels ↔ videos)
  7. * Users can change videos (title, etc.):
  8. Support for multiple crawls needed
  9. * In one document as array?
  10. Like `{videoid: xxx, crawls: []…`
  11. * Pros: Easy history query
  12. * Cons: (Title) indices might be harder to maintain
  13. * Or as separate documents?
  14. Like `{videoid: xxx, crawldate: …`
  15. * Pros: Race conditions less likely
  16. * Cons: Duplicates more likely?
  17. * Avoiding duplicates?
  18. * If the user hasn't changed video metadata,
  19. crawling it again is a waste of disk space
  20. * Rescan score: Should a video be rescanned?
  21. * Viral videos should be crawled more often
  22. * New videos shouldn't be instantly crawled again
  23. * Very old videos are unlikely to change
  24. * Maybe focus on views per week
  25. * Machine learning?
  26. * Hashing data from crawls to detect changes?
  27. * Invalidates old data on API upgrade
  28. * Could be used as an index tho
  29. * Live data
  30. * like views/comments/subscribers per day
  31. * vs more persistent data: Title/Description/video Formats
  32. * Are they worth crawling
  33. * Additional data
  34. * Like subtitles and annotations
  35. * Need separate crawls
  36. * Not as important as main data
  37. ### Types of bot
  38. * __Discover bots__
  39. * Find and push new video IDs to the queue
  40. * Monitor channels for new content
  41. * Discover new videos
  42. * __Maintainer bots__
  43. * Occasionally look at the database and
  44. push backups/freezes to drive
  45. * Decide which old video IDs to re-add to the queue
  46. * __Worker bots__
  47. * Get jobs from the Redis queue and crawl YT
  48. * Remove processed entries from the queue
  49. ### Redis queue
  50. * A redis queue lists video IDs that have been
  51. discovered, but not crawled
  52. * Discover bots bots push IDs if they find new ones
  53. * Implement queue priority?
  54. * Maintainer bots push IDs if they likely need rescans
  55. * States of queued items
  56. 1. _Queued:_ Processing required
  57. (no worker bot picked them up yet)
  58. 2. _Assigned:_ Worker claimed ID and processes it.
  59. If the worker doesn't mark the ID as done in time
  60. it gets tagged back as _Queued_ again
  61. (should be hidden from other workers)
  62. 3. _Done:_ Worker submitted crawl to the database
  63. (can be deleted from the queue)
  64. * Single point of failure
  65. * Potentially needs ton of RAM
  66. * 800 M IDs at 100 bytes per entry = 80 GB
  67. * Shuts down entire crawl system on failure
  68. * Persistence: A crash can use all discovered IDs
  69. * Alternative implementations
  70. * SQLite in-memory?