|
|
|
|
# Ideas/Questions for future code
|
|
|
|
|
|
|
|
|
|
> Brainstorming here :P
|
|
|
|
|
|
|
|
|
|
### MongoDB
|
|
|
|
|
|
|
|
|
|
* Main database for crawled content
|
|
|
|
|
* YouTube is mostly non-relational
|
|
|
|
|
(except channels ↔ videos)
|
|
|
|
|
* Users can change videos (title, etc.):
|
|
|
|
|
Support for multiple crawls needed
|
|
|
|
|
* In one document as array?
|
|
|
|
|
Like `{videoid: xxx, crawls: []…`
|
|
|
|
|
* Pros: Easy history query
|
|
|
|
|
* Cons: (Title) indices might be harder to maintain
|
|
|
|
|
* Or as separate documents?
|
|
|
|
|
Like `{videoid: xxx, crawldate: …`
|
|
|
|
|
* Pros: Race conditions less likely
|
|
|
|
|
* Cons: Duplicates more likely?
|
|
|
|
|
* Avoiding duplicates?
|
|
|
|
|
* If the user hasn't changed video metadata,
|
|
|
|
|
crawling it again is a waste of disk space
|
|
|
|
|
* Rescan score: Should a video be rescanned?
|
|
|
|
|
* Viral videos should be crawled more often
|
|
|
|
|
* New videos shouldn't be instantly crawled again
|
|
|
|
|
* Very old videos are unlikely to change
|
|
|
|
|
* Maybe focus on views per week
|
|
|
|
|
* Machine learning?
|
|
|
|
|
* Hashing data from crawls to detect changes?
|
|
|
|
|
* Invalidates old data on API upgrade
|
|
|
|
|
* Could be used as an index tho
|
|
|
|
|
* Live data
|
|
|
|
|
* like views/comments/subscribers per day
|
|
|
|
|
* vs more persistent data: Title/Description/video Formats
|
|
|
|
|
* Are they worth crawling
|
|
|
|
|
* Additional data
|
|
|
|
|
* Like subtitles and annotations
|
|
|
|
|
* Need separate crawls
|
|
|
|
|
* Not as important as main data
|
|
|
|
|
|
|
|
|
|
### Types of bot
|
|
|
|
|
|
|
|
|
|
* __Discover bots__
|
|
|
|
|
* Find and push new video IDs to the queue
|
|
|
|
|
* Monitor channels for new content
|
|
|
|
|
* Discover new videos
|
|
|
|
|
* __Maintainer bots__
|
|
|
|
|
* Occasionally look at the database and
|
|
|
|
|
push backups/freezes to drive
|
|
|
|
|
* Decide which old video IDs to re-add to the queue
|
|
|
|
|
* __Worker bots__
|
|
|
|
|
* Get jobs from the Redis queue and crawl YT
|
|
|
|
|
* Remove processed entries from the queue
|
|
|
|
|
|
|
|
|
|
### Redis queue
|
|
|
|
|
|
|
|
|
|
* A redis queue lists video IDs that have been
|
|
|
|
|
discovered, but not crawled
|
|
|
|
|
* Discover bots bots push IDs if they find new ones
|
|
|
|
|
* Implement queue priority?
|
|
|
|
|
* Maintainer bots push IDs if they likely need rescans
|
|
|
|
|
* States of queued items
|
|
|
|
|
1. _Queued:_ Processing required
|
|
|
|
|
(no worker bot picked them up yet)
|
|
|
|
|
2. _Assigned:_ Worker claimed ID and processes it.
|
|
|
|
|
If the worker doesn't mark the ID as done in time
|
|
|
|
|
it gets tagged back as _Queued_ again
|
|
|
|
|
(should be hidden from other workers)
|
|
|
|
|
3. _Done:_ Worker submitted crawl to the database
|
|
|
|
|
(can be deleted from the queue)
|
|
|
|
|
* Single point of failure
|
|
|
|
|
* Potentially needs ton of RAM
|
|
|
|
|
* 800 M IDs at 100 bytes per entry = 80 GB
|
|
|
|
|
* Shuts down entire crawl system on failure
|
|
|
|
|
* Persistence: A crash can use all discovered IDs
|
|
|
|
|
* Alternative implementations
|
|
|
|
|
* SQLite in-memory?
|