Parcourir la source

Future ideas

terorie il y a 1 an
2 fichiers modifiés avec 91 ajouts et 6 suppressions
  1. 14
  2. 77

+ 14
- 6 Voir le fichier

@@ -1,6 +1,6 @@
# WIP: yt-mango 💾

> YT metadata extractor inspired by [`youtube-ma` by _CorentinB_][1]
> YT metadata extractor inspired by [`youtube-ma` by _CorentinB_][youtube-ma]

##### Build

@@ -10,11 +10,19 @@ If you don't have a Go toolchain, grab an executable from the Releases tab

##### Project structure

- _/controller_: Manages workers (sends tasks, gets results, …)
- _/common_: Commonly used HTTP code
- _/data_: Data structures
- _/db_: MongoDB connection
- _/data_: Data definitions
- _/api_: Abstract API definitions
- _/apiclassic_: HTML API implementation (parsing using [goquery][goquery])
- _/apijson_: JSON API implementation (parsing using [fastjson][fastjson])
- _/net_: HTTP utilities (asnyc HTTP implementation)

- _/pretty_: (not yet used) Terminal color utilities
- _/controller_: (not yet implemented) worker management
- _/db_: (not yet implemented) MongoDB connection
- _???_: (not yet implemented) Redis queue
- _/classic_: Extractor calling the HTML `/watch` API
- _/watchapi_: Extractor calling the JSON `/watch` API


+ 77
- 0 Voir le fichier

@@ -0,0 +1,77 @@
# Ideas/Questions for future code

> Brainstorming here :P

### MongoDB

* Main database for crawled content
* YouTube is mostly non-relational
(except channels ↔ videos)
* Users can change videos (title, etc.):
Support for multiple crawls needed
* In one document as array?
Like `{videoid: xxx, crawls: []…`
* Pros: Easy history query
* Cons: (Title) indices might be harder to maintain
* Or as separate documents?
Like `{videoid: xxx, crawldate: …`
* Pros: Race conditions less likely
* Cons: Duplicates more likely?
* Avoiding duplicates?
* If the user hasn't changed video metadata,
crawling it again is a waste of disk space
* Rescan score: Should a video be rescanned?
* Viral videos should be crawled more often
* New videos shouldn't be instantly crawled again
* Very old videos are unlikely to change
* Maybe focus on views per week
* Machine learning?
* Hashing data from crawls to detect changes?
* Invalidates old data on API upgrade
* Could be used as an index tho
* Live data
* like views/comments/subscribers per day
* vs more persistent data: Title/Description/video Formats
* Are they worth crawling
* Additional data
* Like subtitles and annotations
* Need separate crawls
* Not as important as main data

### Types of bot

* __Discover bots__
* Find and push new video IDs to the queue
* Monitor channels for new content
* Discover new videos
* __Maintainer bots__
* Occasionally look at the database and
push backups/freezes to drive
* Decide which old video IDs to re-add to the queue
* __Worker bots__
* Get jobs from the Redis queue and crawl YT
* Remove processed entries from the queue

### Redis queue

* A redis queue lists video IDs that have been
discovered, but not crawled
* Discover bots bots push IDs if they find new ones
* Implement queue priority?
* Maintainer bots push IDs if they likely need rescans
* States of queued items
1. _Queued:_ Processing required
(no worker bot picked them up yet)
2. _Assigned:_ Worker claimed ID and processes it.
If the worker doesn't mark the ID as done in time
it gets tagged back as _Queued_ again
(should be hidden from other workers)
3. _Done:_ Worker submitted crawl to the database
(can be deleted from the queue)
* Single point of failure
* Potentially needs ton of RAM
* 800 M IDs at 100 bytes per entry = 80 GB
* Shuts down entire crawl system on failure
* Persistence: A crash can use all discovered IDs
* Alternative implementations
* SQLite in-memory?