Skip to content

Latest commit

 

History

History
 
 

large-imputer

large-imputer

This folder includes our latest imputer on large-scale spatiotemporal traffic data.

Highlights

  • Linear unitary transform.
  • Temporal variation using quadratic time series autoregression.
  • Large-scale spatiotemporal imputation problem.

Data Sets

In this repository, we have adapted some publicly available data sets into our experiments. Thse data are summarized as follows,

  • PeMS-4W data set

    • This data set contains freeway traffic speed collected from 11160 traffic measurement sensors over 4 weeks (the first 4 weeks in the year of 2018) with a 5-min time resolution (288 time intervals per day) in California, USA. It can be arranged in a matrix of size 11160 x 8064 or a tensor of size 11160 x 288 x 28 according to the spatial and temporal dimensions. Note that this data set contains about 90 million observations.
  • PeMS-8W data set

    • This data set contains freeway traffic speed collected from 11160 traffic measurement sensors over 8 weeks (the first 8 weeks in the year of 2018) with a 5-min time resolution (288 time intervals per day) in California, USA. It can be arranged in a matrix of size 11160 x 16128 or a tensor of size 11160 x 288 x 56 according to the spatial and temporal dimensions. Note that this data set contains about 180 million observations.

In particular, if you are interested in large-scale traffic data, we recommend PeMS-4W/8W/12W and UTD19. For PeMS data, you can download the data from Zenodo and place them at the folder of datasets (data path example: ../datasets/California-data-set/pems-4w.csv). Then you can use Pandas to open data:

import pandas as pd

data = pd.read_csv('../datasets/California-data-set/pems-4w.csv', header = None)

For model evaluation, we mask certain entries of the "observed" data as missing values and then perform imputation for these "missing" values.

  • London-1M data set

    • This is London movement speed data set that created by Uber movement project. This data set includes the average speed on a given road segment for each hour of each day over a whole month (April 2019). In this data set, there are about 220,000 road segments. Note that this data sets only includes the hours or a road segment with at least 5 unique trips in that hour. There are up to 73.09% missing values and most missing values occur during the night. We choose the subset of this raw data set and build a time series matrix of size 35912 x 720 (or a tensor of size 35912 x 24 x 30) in which each time series has at least 70% observations.
  • Guangzhou-2M data set

    • This traffic speed data set was collected from 214 road segments over two months (61 days from August 1 to September 30, 2016) with a 10-min resolution (144 time intervals per day) in Guangzhou, China. It can be arranged in a matrix of size 214 x 8784 or a tensor of size 214 x 144 x 61.

Python implementation

Our Publication

  • Xinyu Chen, Yixian Chen, Nicolas Saunier, Lijun Sun (2021). Scalable low-rank tensor learning for spatiotemporal traffic data imputation. Transportation Research Part C: Emerging Technologies. [preprint] [DOI] [data] [Python code]

    This folder is for the above paper, please cite this paper if it help your research.