-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect Number for Items Remaining when Recovering Cache #25
Comments
Tested several times. If initiate with no before parameter and then use the printed timestamp as the "before" to resume, the first try to resume will NOT load the initial requests. It will show no cached requests loaded and start all over. When you re-run the query again with the before value, it will load the first resume attempt, but as the items remaining start over, the scraping also restarted. So basically, I cannot find a way to recover interrupted scraping. |
Hi @YukunYangNPF I investigated this issue, and there appears to be a logging issue when the cache is loaded. I found when I ran your example code that the cache was being loaded, but not being reported as loaded. I have added a static method in |
@mattpodolak
def func(): func() Can you tell me how to resolve this. (I dump the cache data later using a different code. I just complete collecting the pickle(gzip) files first.) |
Hi @gauravkhadgi thanks for reporting this issue what exception is being thrown when you run this code? Also, can you open a new issue as this doesnt seem related to the parent issue? |
I've posted this issue on reddit; transfer it here as a formal issue report.
My initial requests:
posts = api.search_comments(subreddit='de', limit=None, mem_safe=True, safe_exit=True)
The metadata I obtained from this initial response.
Here is the timestamp I have:
Setting before to 1631651549
Response cache key: 186e2bb94155846df2c5d321b768b8cb
10380641 result(s) available in Pushshift
The last file checkpoint before this scraping was interrupted was:
Now I want to resume the process, so I plugged in the timestamp I got from the initial requests as the "before" value.
However, it seems to start a new process of scraping the same query; the item remaining is what's left from the previously interrupted requests.
The text was updated successfully, but these errors were encountered: