In this Github repo, we provide several datasets could be used for the dynamic sequence time-series problem. All datasets have been preprocessed and they were stored as .npy files. The dataset ranges from 2022/08 to 2022/10.
Data background: The ECW is a real-world dataset utilized in our study One for All: Unified Workload Prediction for Dynamic Multi-tenant Edge Cloud Platforms. The whole dataset encompasses diverse load logs (e.g., bandwidth for ECW, CPU, storage) and abundant cross-domain static data from a leading edge cloud service company that provides mature idle computing resource integration and edge cloud resource provisioning.
Note: The uploaded data has been normalized to the minimum and maximum for privacy reasons!!!.
Dataset list (updating)
-
ECW-08: A matrix shaped as 797
$\times$ 720$\times$ 16 represents the upload bandwidth workload variation of 797 edge servers over 720 hours (24*30) in 2022/08. 16 is the feature dimension of the data, containing 4-dimensional dynamic features and 12-dimensional static content features from the cross domain. -
ECW-09: A matrix shaped as 1022
$\times$ 720$\times$ 16 represents the upload bandwidth workload variation of 1022 edge servers over 720 hours (24*30) in 2022/09. -
ECW-min: Workload logs with a granularity of every 5 minutes of upload bandwidth, allows for finer granularity load changes compared to ECW.
The following dataset is recommended for testing.
-
ECW-App-Switch-sample: The workload series where application switching occurred during 2022/08/25-2022/08/30. The exact time of application switches can be easily detected by plotting the curve, as it is accompanied by a sudden change in the load change pattern. If you wish to test the model's capability to handle abrupt changes in time-series patterns (extreme concept-drift), use this dataset.
-
ECW-New-App: The workload series of the applications that never appeared in ECW-08. If you wish to test the model's generalizability when faced with unknown time-series patterns, use this dataset.
-
ECW-New-Infras.: The workload series running on infrastructure that has never been present in the ECW.
If you use this dataset please cite the work DynEformer @ KDD2023 [paper], [code] (The Bibtex version is comming soon):
Shaoyuan Huang, Zheng Wang, Heng Zhang, Xiaofei Wang, Cheng Zhang, and Wenyu Wang. 2023. One for All: Unified Workload Prediction for Dynamic Multi-tenant Edge Cloud Platforms. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’23), August 6–10, 2023, Long Beach, CA, USA. ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/3580305.3599453
@inproceedings{10.1145/3580305.3599453,
author = {Huang, Shaoyuan and Wang, Zheng and Zhang, Heng and Wang, Xiaofei and Zhang, Cheng and Wang, Wenyu},
title = {One for All: Unified Workload Prediction for Dynamic Multi-Tenant Edge Cloud Platforms},
year = {2023},
isbn = {9798400701030},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3580305.3599453},
doi = {10.1145/3580305.3599453},
booktitle = {Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining},
pages = {788–797},
numpages = {10},
location = {Long Beach, CA, USA},
series = {KDD '23}
}
The cross-domain data points, which encompass 12 dimensions of static server hardware attributes, including maximum bandwidth, number of CPUs, location, and other infrastructure characteristics. This data is collected when edge servers join the Multi-tenant Edge Cloud Platforms(MT-ECP) or during subsequent updates. Incorporating cross-domain data is aimed to further enhance model's robustness, as noted in previous research (Lim et al. [a]). Specifically, for workload prediction in MT-ECP, server features such as hardware attributes and geographical location significantly influence workload variations.
[a] Bryan Lim, et al. "Temporal Fusion Transformers for interpretable multi-horizon time series forecasting". International Journal of Forecasting, 2021
ECW, ECW-App-Switch, ECW-New-App, ECW-New-Infras cover all three types of behavior of dynamic MT-ECP, as shown in the following figure, all datasets are shaped as N(edge server num)*T(series len, hourly granularity)*F(num of feature dimensions, all 16).
Specifically, each edge server's workload incorporates daily short-term perodical patterns and weekly long-term perodical patterns (where future load on the same calendar day of the upcoming week resembles that of the current day's load), in addition to certain trend patterns and irregular fluctuations. Due to the heterogeneity in server attributes and application patterns, these patterns vary across different load series. Compared to existing time-series prediction datasets (such as ETT and Azure), ECW exhibits more complex patterns and higher-frequency dynamics (application switch disruptions and new entities), significantly elevating the generalization requirements for models.
We use the .npy file format to save the data, an edge server workload demo of the ECW data is illustrated in Figure 2. The first line (16 columns) is the horizontal header and includes "bw_upload", "hour", "day", "week", 'province', 'bandwidth_type', 'nat_type', 'isp', 'billing_rule', 'upbandwidth', 'upbandwidth_base', 'cpu_num', 'memory_size', 'disk_size', 'test_sat', and 'loss_sat'. The detailed meaning of each column name is shown in the Table 1.
Field | bw_upload | hour | day | week | province | bandwidth_type | nat_type | isp | billing_rule | upbandwidth | upbandwidth_base | cpu_num | memory_size | disk_size | test_sat | loss_sat |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Description | upload bandwidth workload (target) | record time (dynamic features) | <- | <- | edge server location | quality assessment of the network | nat type | isp | types of billing | total server bandwidth | available server bandwidth | cpu_num | memory_size | disk_size | network pressure test quality | packet loss quality |
Table 1. Description for each columm.