Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query to get last update of a dataset can be quite slow #263

Open
Kjir opened this issue Mar 21, 2023 · 0 comments
Open

Query to get last update of a dataset can be quite slow #263

Kjir opened this issue Mar 21, 2023 · 0 comments

Comments

@Kjir
Copy link

Kjir commented Mar 21, 2023

The query to get the last update of a dataset can be quite slow, especially once the number of time series in the dataset grow above 90'000.

I wrote an alternative query that seems to perform slightly better, but there probably needs to be an additional index somewhere to make it really go fast.

The existing query is:

-- Get the last time the dataset was updated
--
-- Returns the created_at timestamp of the series in the given dataset
-- that was most recently updated.
CREATE OR REPLACE FUNCTION timeseries.dataset_get_last_update(p_dataset TEXT)
RETURNS TABLE(name TEXT, updated TIMESTAMPTZ)
AS $$
BEGIN
  CREATE TEMPORARY TABLE tmp_ts_read_keys
  ON COMMIT DROP
  AS (
    SELECT ts_key
    FROM timeseries.catalog AS cat
    WHERE set_id = p_dataset
  );

  RETURN QUERY
  SELECT p_dataset AS name, max(ud.updated) AS updated
  FROM timeseries.ts_get_last_update() AS ud;
END;
$$ LANGUAGE PLPGSQL
SECURITY DEFINER
SET search_path = timeseries, pg_temp;

-- Get the last time time series were updated
--
-- Returns the created_at of the given time series
CREATE OR REPLACE FUNCTION timeseries.ts_get_last_update()
RETURNS TABLE(ts_key TEXT, updated TIMESTAMPTZ)
AS $$
BEGIN
  RETURN QUERY
  SELECT
  DISTINCT ON(mn.ts_key)
  mn.ts_key, mn.created_at AS updated
  FROM timeseries.timeseries_main AS mn
  JOIN tmp_ts_read_keys AS rd
  USING(ts_key)
  ORDER BY mn.ts_key, mn.created_at DESC;
END;
$$ LANGUAGE PLPGSQL
SECURITY DEFINER
SET search_path = timeseries, pg_temp;

The query I wrote is the following:

SELECT MAX(updated) updated
  FROM (
    SELECT
       ts_key,
      (
         SELECT MAX(created_at) updated
         FROM timeseries.timeseries_main tm
         WHERE tm.ts_key = catalog.ts_key
       )
    FROM timeseries.catalog
    WHERE set_id = '<set_id>'
  ) AS s;

The timings with the ch.fso.cah.inv dataset, which contains 69'366 time series, is as follows:

image

The difference in time between the first query and the following ones is probably due to Postgres caching the table in memory after the first read, but the time when used in actual code would probably be closer to the first run.

The timings with the updated query are as follows:

image

Some data

Set N. series Time old Time new
ch.kof.ivua.ng08 120'742 First Run: ~270s , Cached: ~270s FR: ~49s, C: ~6s
ch.kof.ivus.ng08 97'595 FR: ~245-250s, C: ~245-250s FR: ~13s, C: ~4s
ch.fso.cah.inv 69'366 FR: ~4.5s, C: ~2.9s FR: ~2s, C: ~2s
ch.fso.hesta 4'212 FR: ~1.6s, C: ~1.6s FR: ~1.2s, C: ~1.2s
ch.fso.unemp 84 FR: ~1.2s, C: ~1.1s FR: ~1.2s, C: ~1.1s
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant