Query to get last update of a dataset can be quite slow #263

Kjir · 2023-03-21T12:22:08Z

The query to get the last update of a dataset can be quite slow, especially once the number of time series in the dataset grow above 90'000.

I wrote an alternative query that seems to perform slightly better, but there probably needs to be an additional index somewhere to make it really go fast.

The existing query is:

-- Get the last time the dataset was updated
--
-- Returns the created_at timestamp of the series in the given dataset
-- that was most recently updated.
CREATE OR REPLACE FUNCTION timeseries.dataset_get_last_update(p_dataset TEXT)
RETURNS TABLE(name TEXT, updated TIMESTAMPTZ)
AS $$
BEGIN
  CREATE TEMPORARY TABLE tmp_ts_read_keys
  ON COMMIT DROP
  AS (
    SELECT ts_key
    FROM timeseries.catalog AS cat
    WHERE set_id = p_dataset
  );

  RETURN QUERY
  SELECT p_dataset AS name, max(ud.updated) AS updated
  FROM timeseries.ts_get_last_update() AS ud;
END;
$$ LANGUAGE PLPGSQL
SECURITY DEFINER
SET search_path = timeseries, pg_temp;

-- Get the last time time series were updated
--
-- Returns the created_at of the given time series
CREATE OR REPLACE FUNCTION timeseries.ts_get_last_update()
RETURNS TABLE(ts_key TEXT, updated TIMESTAMPTZ)
AS $$
BEGIN
  RETURN QUERY
  SELECT
  DISTINCT ON(mn.ts_key)
  mn.ts_key, mn.created_at AS updated
  FROM timeseries.timeseries_main AS mn
  JOIN tmp_ts_read_keys AS rd
  USING(ts_key)
  ORDER BY mn.ts_key, mn.created_at DESC;
END;
$$ LANGUAGE PLPGSQL
SECURITY DEFINER
SET search_path = timeseries, pg_temp;

The query I wrote is the following:

SELECT MAX(updated) updated
  FROM (
    SELECT
       ts_key,
      (
         SELECT MAX(created_at) updated
         FROM timeseries.timeseries_main tm
         WHERE tm.ts_key = catalog.ts_key
       )
    FROM timeseries.catalog
    WHERE set_id = '<set_id>'
  ) AS s;

The timings with the ch.fso.cah.inv dataset, which contains 69'366 time series, is as follows:

The difference in time between the first query and the following ones is probably due to Postgres caching the table in memory after the first read, but the time when used in actual code would probably be closer to the first run.

The timings with the updated query are as follows:

Some data

Set	N. series	Time old	Time new
ch.kof.ivua.ng08	120'742	First Run: ~270s , Cached: ~270s	FR: ~49s, C: ~6s
ch.kof.ivus.ng08	97'595	FR: ~245-250s, C: ~245-250s	FR: ~13s, C: ~4s
ch.fso.cah.inv	69'366	FR: ~4.5s, C: ~2.9s	FR: ~2s, C: ~2s
ch.fso.hesta	4'212	FR: ~1.6s, C: ~1.6s	FR: ~1.2s, C: ~1.2s
ch.fso.unemp	84	FR: ~1.2s, C: ~1.1s	FR: ~1.2s, C: ~1.1s

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query to get last update of a dataset can be quite slow #263

Query to get last update of a dataset can be quite slow #263

Kjir commented Mar 21, 2023

Query to get last update of a dataset can be quite slow #263

Query to get last update of a dataset can be quite slow #263

Comments

Kjir commented Mar 21, 2023

Some data