Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feedback/Request] Scraping is way too slow. 25K Elements needs more than 3 days just to scrap. #555

Open
0n1cOn3 opened this issue Nov 14, 2024 · 11 comments

Comments

@0n1cOn3
Copy link

0n1cOn3 commented Nov 14, 2024

Hey :)

Scraping takes an extremely long time. I have a session running here with over 25k elements to trade, but this took 3 days and only metadata was loaded into the DB. This is simply too slow and not usable. I have set this up on a HPE ProLiant server as I initially thought this was a problem with my NAS where I first used this project.

But apparently this is a general problem that needs to be addressed.
The number of workers has been increased to 4. The speed still remains the same.
I understand the limitation for downloading, but not for scraping the information and cover images.

A clarification or idea for improvement would be cool :)

@0n1cOn3
Copy link
Author

0n1cOn3 commented Nov 14, 2024

  • Would it be possible to display an estimate of how long the indexing would take?
  • The same with the download?

Of course, this should not necessarily be an estimate, but legendary (surveys, personal experience as a reference? etc).

@meeb
Copy link
Owner

meeb commented Nov 15, 2024

This is actually entirely expected and nothing really to do with tubesync. I would strongly suggest you drop the worker count back to 1. YouTube will aggressively rate limit and throttle your IP if you attempt to run it any faster. Yes, if you add a channel or channels with 25k items it'll take some time, potentially a couple of days, to get up to date and index the metadata. Once the initial sync is done though it will only sync new content only, which is fast, so it's a one-off issue. You should have no issues with large channels once the initial sync and index is done.

There is no way to improve this performance, the throttling is at YouTube. The throttling does, from anecdotal reports anyway, seem to be also triggered by scraping metadata too quickly as well. If you have any magical solutions feel free to suggest them.

Potentially there could be an estimate for task completion, but it would be quite complicated to implement and so rough to be likely not that helpful.

@tcely
Copy link
Contributor

tcely commented Nov 22, 2024

I would be more willing to wait for complete metadata as long as downloads take priority. I'd rather see the most recent videos that can be downloaded start than all of the downloads waiting for the complete metadata to be stored in the database.

Perhaps an easy fix is to index until it finds a video that it can download. Then reschedule the index scan after that download finished. The next round would stop after two downloads are found, then four, etc.

Is there a compelling reason the metadata has to take priority over downloads that are ready?

@meeb
Copy link
Owner

meeb commented Nov 22, 2024

The metadata is required to determine if a media item is to be and can be downloaded so I just scheduled it first, that's about it. Significantly fancier methods of scheduling are of course possible. There's no guarantee that the metadata is going to be indexed in a logical time order so this may not function as you expect even if implemented. Also the current tasks system this would probably be a bit rough with some arbitrary escalating priority number to schedule tasks in a specific order rather than group priorities.

Main reason though is no-one has previously asked for it or looked into implementing it I would suspect.

@tcely
Copy link
Contributor

tcely commented Nov 22, 2024

Also the current tasks system this would probably be a bit rough with some arbitrary escalating priority number to schedule tasks in a specific order rather than group priorities.

I was thinking of something a lot simpler. How about changing the indexing task to stop itself after a certain number of successful additions?

@meeb
Copy link
Owner

meeb commented Nov 22, 2024

The indexing task is just asking yt-dlp to return a massive dict of all media on a channel or playlist, the control isn't that fine grained. If you have a playlist as a source and not a channel just stopping indexing some of the playlist would be very confusing. Channels are generally index-able by time with newest first, but it's not guaranteed.

@0n1cOn3
Copy link
Author

0n1cOn3 commented Dec 7, 2024

I just wanna provide some feedback after some changes I did on the source code.

  • Workers are set to 8! Not 1
  • Parallel Download increased up to 3 downloads at once.
    And another few things I dont know straight out of my head.

Since almost two weeks, no throttling, nothing! I do have an YT-Premium account but no cookies are delivered! Yes, it may throttle me at any time, but it got a lot faster than with only one worker. It made within 4 days over 15k on Tasks. And started in day 5 with the content download.

Edit:
I also switched from sqlite to postqresql. Seems faster too with that much of "content".

image

@meeb
Copy link
Owner

meeb commented Dec 9, 2024

Thanks for the update. Workers were originally set to 4 or 8 or somewhere around there, however many people (including me) experienced issues that were difficult to resolve (like having to change public IPs). 8 may work for you, just make sure you're using a VPN or some other public endpoint you can trivially cycle because you'll probably get throttled at some point. You'll likely just notice downloads work, they're just extremely slow at some point.

And yes with a large media library using Posgres will provide a significant boost.

@tcely
Copy link
Contributor

tcely commented Dec 9, 2024

A few minor cleanups to the metadata JSON will allow savings of about 90% and make db.sqlite3 much nicer to work with. I will probably create a trigger to use json_remove during INSERT at some point rather than the cleanup.sql file I'm using, periodically, now.

I started by using this one UPDATE; for anyone who wants to try it:

UPDATE OR ROLLBACK "sync_media" SET metadata = json_remove(metadata, '$."automatic_captions"');

Ideally, the application code would parse the JSON and create a new structure that keeps only the information that it will use later.

Switching to PostgreSQL will help with concurrency issues you may have encountered more than even the write-ahead-log mode for SQLite would.

@0n1cOn3
Copy link
Author

0n1cOn3 commented Dec 9, 2024

Thanks for the update. Workers were originally set to 4 or 8 or somewhere around there, however many people (including me) experienced issues that were difficult to resolve (like having to change public IPs). 8 may work for you, just make sure you're using a VPN or some other public endpoint you can trivially cycle because you'll probably get throttled at some point. You'll likely just notice downloads work, they're just extremely slow at some point.

And yes with a large media library using Posgres will provide a significant boost.

Interesting!

I can't change my IP since i am pay for a static IP address.

And I don't see any throttling yet in the logs. But I will keep updating you guys if the case will change anytime.

The workers are set to 4 in the source code :-)

Edit:

Theres only one issue keept:

A channel with 6k of Videos, 7 days for rescan the channel is not enough. It stopped downloading by approximatly 25% being downloading and has restarted indexing the channel. Thats a bummer.
Could it arranged that it only restart indexing when all tasks from the particular channel has been finished?

@meeb
Copy link
Owner

meeb commented Dec 10, 2024

Indexing the channel isn't that time consuming, or shouldn't be, even with 6k videos it shouldn't take that long. All the "media indexing" does really is just list all the media item IDs (YouTube video IDs in this case) and checks if there are any new, previously unseen, IDs. This won't take more than a few minutes, maybe up to an hour for the largest of channels. Unless you're being throttled that is, one of the symptoms of being throttled is indexing and metadata collection is just extremely slow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants