|
|
@@ -0,0 +1,77 @@ |
|
|
|
# Ideas/Questions for future code |
|
|
|
|
|
|
|
> Brainstorming here :P |
|
|
|
|
|
|
|
### MongoDB |
|
|
|
|
|
|
|
* Main database for crawled content |
|
|
|
* YouTube is mostly non-relational |
|
|
|
(except channels ↔ videos) |
|
|
|
* Users can change videos (title, etc.): |
|
|
|
Support for multiple crawls needed |
|
|
|
* In one document as array? |
|
|
|
Like `{videoid: xxx, crawls: []…` |
|
|
|
* Pros: Easy history query |
|
|
|
* Cons: (Title) indices might be harder to maintain |
|
|
|
* Or as separate documents? |
|
|
|
Like `{videoid: xxx, crawldate: …` |
|
|
|
* Pros: Race conditions less likely |
|
|
|
* Cons: Duplicates more likely? |
|
|
|
* Avoiding duplicates? |
|
|
|
* If the user hasn't changed video metadata, |
|
|
|
crawling it again is a waste of disk space |
|
|
|
* Rescan score: Should a video be rescanned? |
|
|
|
* Viral videos should be crawled more often |
|
|
|
* New videos shouldn't be instantly crawled again |
|
|
|
* Very old videos are unlikely to change |
|
|
|
* Maybe focus on views per week |
|
|
|
* Machine learning? |
|
|
|
* Hashing data from crawls to detect changes? |
|
|
|
* Invalidates old data on API upgrade |
|
|
|
* Could be used as an index tho |
|
|
|
* Live data |
|
|
|
* like views/comments/subscribers per day |
|
|
|
* vs more persistent data: Title/Description/video Formats |
|
|
|
* Are they worth crawling |
|
|
|
* Additional data |
|
|
|
* Like subtitles and annotations |
|
|
|
* Need separate crawls |
|
|
|
* Not as important as main data |
|
|
|
|
|
|
|
### Types of bot |
|
|
|
|
|
|
|
* __Discover bots__ |
|
|
|
* Find and push new video IDs to the queue |
|
|
|
* Monitor channels for new content |
|
|
|
* Discover new videos |
|
|
|
* __Maintainer bots__ |
|
|
|
* Occasionally look at the database and |
|
|
|
push backups/freezes to drive |
|
|
|
* Decide which old video IDs to re-add to the queue |
|
|
|
* __Worker bots__ |
|
|
|
* Get jobs from the Redis queue and crawl YT |
|
|
|
* Remove processed entries from the queue |
|
|
|
|
|
|
|
### Redis queue |
|
|
|
|
|
|
|
* A redis queue lists video IDs that have been |
|
|
|
discovered, but not crawled |
|
|
|
* Discover bots bots push IDs if they find new ones |
|
|
|
* Implement queue priority? |
|
|
|
* Maintainer bots push IDs if they likely need rescans |
|
|
|
* States of queued items |
|
|
|
1. _Queued:_ Processing required |
|
|
|
(no worker bot picked them up yet) |
|
|
|
2. _Assigned:_ Worker claimed ID and processes it. |
|
|
|
If the worker doesn't mark the ID as done in time |
|
|
|
it gets tagged back as _Queued_ again |
|
|
|
(should be hidden from other workers) |
|
|
|
3. _Done:_ Worker submitted crawl to the database |
|
|
|
(can be deleted from the queue) |
|
|
|
* Single point of failure |
|
|
|
* Potentially needs ton of RAM |
|
|
|
* 800 M IDs at 100 bytes per entry = 80 GB |
|
|
|
* Shuts down entire crawl system on failure |
|
|
|
* Persistence: A crash can use all discovered IDs |
|
|
|
* Alternative implementations |
|
|
|
* SQLite in-memory? |