Downloading IMDB dataset for benchmarks gives 404 Not Found

### Describe the bug

Attempting to download the IMDB dataset gives the following error:

```
tar: Error opening archive: Unrecognized archive format
```

An `IMDB.tgz` is created with the following content:

```html
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>404 Not Found</title>
</head><body>
<h1>Not Found</h1>
<p>The requested URL was not found on this server.</p>
</body></html>
```

It seems the dataset is removed or unavailable.

### To Reproduce

Run `benchmarks/bench.sh data imdb`

### Expected behavior

It should download the dataset, extract the csv files and convert to parquet.

### Additional context

The related part in `bench.sh`
https://github.com/apache/datafusion/blob/6cfd1cf1e030ccfe3b17621cc51fdcefcceae018/benchmarks/bench.sh#L458-L463

	# Downloads the csv.gz files IMDB datasets from Peter Boncz's homepage(one of the JOB paper authors)
	# http://homepages.cwi.nl/~boncz/job/imdb.tgz
	data_imdb() {
	local imdb_dir="${DATA_DIR}/imdb"
	local imdb_temp_gz="${imdb_dir}/imdb.tgz"
	local imdb_url="https://homepages.cwi.nl/~boncz/job/imdb.tgz"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Downloading IMDB dataset for benchmarks gives 404 Not Found #13896

Describe the bug

To Reproduce

Expected behavior

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Downloading IMDB dataset for benchmarks gives 404 Not Found #13896

Description

Describe the bug

To Reproduce

Expected behavior

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions