-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
Attempting to download the IMDB dataset gives the following error:
tar: Error opening archive: Unrecognized archive format
An IMDB.tgz
is created with the following content:
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>404 Not Found</title>
</head><body>
<h1>Not Found</h1>
<p>The requested URL was not found on this server.</p>
</body></html>
It seems the dataset is removed or unavailable.
To Reproduce
Run benchmarks/bench.sh data imdb
Expected behavior
It should download the dataset, extract the csv files and convert to parquet.
Additional context
The related part in bench.sh
datafusion/benchmarks/bench.sh
Lines 458 to 463 in 6cfd1cf
# Downloads the csv.gz files IMDB datasets from Peter Boncz's homepage(one of the JOB paper authors) | |
# http://homepages.cwi.nl/~boncz/job/imdb.tgz | |
data_imdb() { | |
local imdb_dir="${DATA_DIR}/imdb" | |
local imdb_temp_gz="${imdb_dir}/imdb.tgz" | |
local imdb_url="https://homepages.cwi.nl/~boncz/job/imdb.tgz" |
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working