Incorrect Number for Items Remaining when Recovering Cache #25

YukunYangNPF · 2021-09-15T14:13:05Z

I've posted this issue on reddit; transfer it here as a formal issue report.

My initial requests:
posts = api.search_comments(subreddit='de', limit=None, mem_safe=True, safe_exit=True)

The metadata I obtained from this initial response.
Here is the timestamp I have:

Setting before to 1631651549

Response cache key: 186e2bb94155846df2c5d321b768b8cb

10380641 result(s) available in Pushshift

The last file checkpoint before this scraping was interrupted was:

File Checkpoint 167:: Caching 15802 Responses

Checkpoint:: Success Rate: 75.43% - Requests: 33400 - Batches: 3340 - Items Remaining: 8009761

Checkpoint:: Success Rate: 75.48% - Requests: 33500 - Batches: 3350 - Items Remaining: 8001173

Now I want to resume the process, so I plugged in the timestamp I got from the initial requests as the "before" value.

api = pmaw.PushshiftAPI()
posts = api.search_comments(subreddit="de", limit=None, mem_safe=True, safe_exit=True, before=1631651549)

However, it seems to start a new process of scraping the same query; the item remaining is what's left from the previously interrupted requests.

Loaded Cache:: Responses: 2399576 - Pending Requests: 140 - Items Remaining: 10367680

Not all PushShift shards are active. Query results may be incomplete.

The text was updated successfully, but these errors were encountered:

YukunYangNPF · 2021-09-16T00:44:38Z

Tested several times.

If initiate with no before parameter and then use the printed timestamp as the "before" to resume, the first try to resume will NOT load the initial requests. It will show no cached requests loaded and start all over. When you re-run the query again with the before value, it will load the first resume attempt, but as the items remaining start over, the scraping also restarted.

So basically, I cannot find a way to recover interrupted scraping.

mattpodolak · 2021-09-30T19:57:40Z

Hi @YukunYangNPF I investigated this issue, and there appears to be a logging issue when the cache is loaded. I found when I ran your example code that the cache was being loaded, but not being reported as loaded.

I have added a static method in v2.1.0 that will allow you to load responses directly from the cache to work with, as well as I have modified the logging so that its clear when the cache is loaded.

gauravkhadgi · 2021-10-06T14:14:00Z

@mattpodolak
def submissions(subreddit,after=[2020,1,1],before=[2021,1,1],limit=None,num_workers=20,file_checkpoint=8):

after = int(dt.datetime(after[0], after[1], after[2]).timestamp())
before = int(dt.datetime(before[0], before[1], before[2]).timestamp())

api = PushshiftAPI(num_workers=num_workers,file_checkpoint=file_checkpoint)
 
cache_dir = 'drive/My Drive/scraped data/cache/' + subreddit + '/'           

api.search_submissions(subreddit=subreddit, after=after,before=before,safe_exit=True,limit=limit,sort='desc',cache_dir=cache_dir,mem_safe=True)

def func():
try:
submissions('science')
except:
print('\n FUCK ERRORS \n')
func()

func()

Can you tell me how to resolve this.
After few "FUCK ERRORS", it just starts again from the beginning.

(I dump the cache data later using a different code. I just complete collecting the pickle(gzip) files first.)

mattpodolak · 2021-10-07T11:23:50Z

Hi @gauravkhadgi thanks for reporting this issue what exception is being thrown when you run this code? Also, can you open a new issue as this doesnt seem related to the parent issue?

mattpodolak mentioned this issue Oct 1, 2021

2.1.0 #26

Merged

mattpodolak closed this as completed Oct 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect Number for Items Remaining when Recovering Cache #25

Incorrect Number for Items Remaining when Recovering Cache #25

YukunYangNPF commented Sep 15, 2021

YukunYangNPF commented Sep 16, 2021

mattpodolak commented Sep 30, 2021

gauravkhadgi commented Oct 6, 2021 •

edited

Loading

mattpodolak commented Oct 7, 2021 •

edited

Loading

Incorrect Number for Items Remaining when Recovering Cache #25

Incorrect Number for Items Remaining when Recovering Cache #25

Comments

YukunYangNPF commented Sep 15, 2021

YukunYangNPF commented Sep 16, 2021

mattpodolak commented Sep 30, 2021

gauravkhadgi commented Oct 6, 2021 • edited Loading

mattpodolak commented Oct 7, 2021 • edited Loading

gauravkhadgi commented Oct 6, 2021 •

edited

Loading

mattpodolak commented Oct 7, 2021 •

edited

Loading