-
-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feedback/Request] Scraping is way too slow. 25K Elements needs more than 3 days just to scrap. #555
Comments
Of course, this should not necessarily be an estimate, but legendary (surveys, personal experience as a reference? etc). |
This is actually entirely expected and nothing really to do with tubesync. I would strongly suggest you drop the worker count back to 1. YouTube will aggressively rate limit and throttle your IP if you attempt to run it any faster. Yes, if you add a channel or channels with 25k items it'll take some time, potentially a couple of days, to get up to date and index the metadata. Once the initial sync is done though it will only sync new content only, which is fast, so it's a one-off issue. You should have no issues with large channels once the initial sync and index is done. There is no way to improve this performance, the throttling is at YouTube. The throttling does, from anecdotal reports anyway, seem to be also triggered by scraping metadata too quickly as well. If you have any magical solutions feel free to suggest them. Potentially there could be an estimate for task completion, but it would be quite complicated to implement and so rough to be likely not that helpful. |
I would be more willing to wait for complete metadata as long as downloads take priority. I'd rather see the most recent videos that can be downloaded start than all of the downloads waiting for the complete metadata to be stored in the database. Perhaps an easy fix is to index until it finds a video that it can download. Then reschedule the index scan after that download finished. The next round would stop after two downloads are found, then four, etc. Is there a compelling reason the metadata has to take priority over downloads that are ready? |
The metadata is required to determine if a media item is to be and can be downloaded so I just scheduled it first, that's about it. Significantly fancier methods of scheduling are of course possible. There's no guarantee that the metadata is going to be indexed in a logical time order so this may not function as you expect even if implemented. Also the current tasks system this would probably be a bit rough with some arbitrary escalating priority number to schedule tasks in a specific order rather than group priorities. Main reason though is no-one has previously asked for it or looked into implementing it I would suspect. |
I was thinking of something a lot simpler. How about changing the indexing task to stop itself after a certain number of successful additions? |
The indexing task is just asking |
I just wanna provide some feedback after some changes I did on the source code.
Since almost two weeks, no throttling, nothing! I do have an YT-Premium account but no cookies are delivered! Yes, it may throttle me at any time, but it got a lot faster than with only one worker. It made within 4 days over 15k on Tasks. And started in day 5 with the content download. Edit: |
Thanks for the update. Workers were originally set to 4 or 8 or somewhere around there, however many people (including me) experienced issues that were difficult to resolve (like having to change public IPs). 8 may work for you, just make sure you're using a VPN or some other public endpoint you can trivially cycle because you'll probably get throttled at some point. You'll likely just notice downloads work, they're just extremely slow at some point. And yes with a large media library using Posgres will provide a significant boost. |
A few minor cleanups to the metadata JSON will allow savings of about 90% and make I started by using this one
Ideally, the application code would parse the JSON and create a new structure that keeps only the information that it will use later. Switching to PostgreSQL will help with concurrency issues you may have encountered more than even the write-ahead-log mode for SQLite would. |
Interesting! I can't change my IP since i am pay for a static IP address. And I don't see any throttling yet in the logs. But I will keep updating you guys if the case will change anytime. The workers are set to 4 in the source code :-) Edit: Theres only one issue keept: A channel with 6k of Videos, 7 days for rescan the channel is not enough. It stopped downloading by approximatly 25% being downloading and has restarted indexing the channel. Thats a bummer. |
Indexing the channel isn't that time consuming, or shouldn't be, even with 6k videos it shouldn't take that long. All the "media indexing" does really is just list all the media item IDs (YouTube video IDs in this case) and checks if there are any new, previously unseen, IDs. This won't take more than a few minutes, maybe up to an hour for the largest of channels. Unless you're being throttled that is, one of the symptoms of being throttled is indexing and metadata collection is just extremely slow. |
Hey :)
Scraping takes an extremely long time. I have a session running here with over 25k elements to trade, but this took 3 days and only metadata was loaded into the DB. This is simply too slow and not usable. I have set this up on a HPE ProLiant server as I initially thought this was a problem with my NAS where I first used this project.
But apparently this is a general problem that needs to be addressed.
The number of workers has been increased to 4. The speed still remains the same.
I understand the limitation for downloading, but not for scraping the information and cover images.
A clarification or idea for improvement would be cool :)
The text was updated successfully, but these errors were encountered: