Browse Source

Future ideas

terorie 9 months ago
parent
commit
4453dd3370
2 changed files with 91 additions and 6 deletions
  1. 14
    6
      README.md
  2. 77
    0
      ideas.md

+ 14
- 6
README.md View File

@@ -1,6 +1,6 @@
1 1
 # WIP: yt-mango 💾
2 2
 
3
-> YT metadata extractor inspired by [`youtube-ma` by _CorentinB_][1]
3
+> YT metadata extractor inspired by [`youtube-ma` by _CorentinB_][youtube-ma]
4 4
 
5 5
 ##### Build
6 6
 
@@ -10,11 +10,19 @@ If you don't have a Go toolchain, grab an executable from the Releases tab
10 10
 
11 11
 ##### Project structure
12 12
 
13
-- _/controller_: Manages workers (sends tasks, gets results, …)
14
-- _/common_: Commonly used HTTP code
15
-- _/data_: Data structures
16
-- _/db_: MongoDB connection
13
+- _/data_: Data definitions
14
+- _/api_: Abstract API definitions
15
+    - _/apiclassic_: HTML API implementation (parsing using [goquery][goquery])
16
+    - _/apijson_: JSON API implementation (parsing using [fastjson][fastjson])
17
+- _/net_: HTTP utilities (asnyc HTTP implementation)
18
+
19
+- _/pretty_: (not yet used) Terminal color utilities
20
+- _/controller_: (not yet implemented) worker management
21
+    - _/db_: (not yet implemented) MongoDB connection
22
+    - _???_: (not yet implemented) Redis queue
17 23
 - _/classic_: Extractor calling the HTML `/watch` API
18 24
 - _/watchapi_: Extractor calling the JSON `/watch` API
19 25
 
20
- [1]: https://github.com/CorentinB/youtube-ma
26
+ [youtube-ma]: https://github.com/CorentinB/youtube-ma
27
+ [goquery]: https://github.com/PuerkitoBio/goquery
28
+ [fastjson]: https://github.com/valyala/fastjson

+ 77
- 0
ideas.md View File

@@ -0,0 +1,77 @@
1
+# Ideas/Questions for future code
2
+
3
+> Brainstorming here :P
4
+
5
+### MongoDB
6
+
7
+* Main database for crawled content
8
+* YouTube is mostly non-relational
9
+  (except channels ↔ videos)
10
+* Users can change videos (title, etc.):
11
+  Support for multiple crawls needed
12
+    * In one document as array?
13
+      Like `{videoid: xxx, crawls: []…`
14
+        * Pros: Easy history query
15
+        * Cons: (Title) indices might be harder to maintain
16
+    * Or as separate documents?
17
+      Like `{videoid: xxx, crawldate: …`
18
+        * Pros: Race conditions less likely
19
+        * Cons: Duplicates more likely?
20
+    * Avoiding duplicates?
21
+        * If the user hasn't changed video metadata,
22
+          crawling it again is a waste of disk space
23
+        * Rescan score: Should a video be rescanned?
24
+            * Viral videos should be crawled more often
25
+            * New videos shouldn't be instantly crawled again
26
+            * Very old videos are unlikely to change
27
+            * Maybe focus on views per week
28
+            * Machine learning?
29
+        * Hashing data from crawls to detect changes?
30
+            * Invalidates old data on API upgrade
31
+            * Could be used as an index tho
32
+* Live data
33
+    * like views/comments/subscribers per day
34
+    * vs more persistent data: Title/Description/video Formats
35
+    * Are they worth crawling
36
+* Additional data
37
+    * Like subtitles and annotations
38
+    * Need separate crawls
39
+    * Not as important as main data
40
+
41
+### Types of bot
42
+
43
+* __Discover bots__
44
+    * Find and push new video IDs to the queue
45
+    * Monitor channels for new content
46
+    * Discover new videos 
47
+* __Maintainer bots__
48
+    * Occasionally look at the database and 
49
+      push backups/freezes to drive
50
+    * Decide which old video IDs to re-add to the queue
51
+* __Worker bots__
52
+    * Get jobs from the Redis queue and crawl YT
53
+    * Remove processed entries from the queue
54
+
55
+### Redis queue
56
+
57
+* A redis queue lists video IDs that have been
58
+  discovered, but not crawled
59
+* Discover bots bots push IDs if they find new ones
60
+    * Implement queue priority?
61
+* Maintainer bots push IDs if they likely need rescans
62
+* States of queued items
63
+    1. _Queued:_ Processing required
64
+       (no worker bot picked them up yet)
65
+    2. _Assigned:_ Worker claimed ID and processes it.
66
+       If the worker doesn't mark the ID as done in time
67
+       it gets tagged back as _Queued_ again
68
+       (should be hidden from other workers)
69
+    3. _Done:_ Worker submitted crawl to the database
70
+       (can be deleted from the queue)
71
+* Single point of failure
72
+    * Potentially needs ton of RAM
73
+        * 800 M IDs at 100 bytes per entry = 80 GB
74
+    * Shuts down entire crawl system on failure
75
+    * Persistence: A crash can use all discovered IDs
76
+* Alternative implementations
77
+    * SQLite in-memory?

Loading…
Cancel
Save