Skip to content

hsy23/ECWDataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 

Repository files navigation

Edge Cloud servers Workload Dataset(ECWDataset)

In this Github repo, we provide several datasets could be used for the dynamic sequence time-series problem. All datasets have been preprocessed and they were stored as .npy files. The dataset ranges from 2022/08 to 2022/10.

Data background: The ECW is a real-world dataset utilized in our study One for All: Unified Workload Prediction for Dynamic Multi-tenant Edge Cloud Platforms. The whole dataset encompasses diverse load logs (e.g., bandwidth for ECW, CPU, storage) and abundant cross-domain static data from a leading edge cloud service company that provides mature idle computing resource integration and edge cloud resource provisioning.

Note: The uploaded data has been normalized to the minimum and maximum for privacy reasons!!!.

Dataset list (updating)

  • ECW-08: A matrix shaped as 797 $\times$ 720 $\times$ 16 represents the upload bandwidth workload variation of 797 edge servers over 720 hours (24*30) in 2022/08. 16 is the feature dimension of the data, containing 4-dimensional dynamic features and 12-dimensional static content features from the cross domain.

  • ECW-09: A matrix shaped as 1022 $\times$ 720 $\times$ 16 represents the upload bandwidth workload variation of 1022 edge servers over 720 hours (24*30) in 2022/09.

  • ECW-min: Workload logs with a granularity of every 5 minutes of upload bandwidth, allows for finer granularity load changes compared to ECW.

The following dataset is recommended for testing.

  • ECW-App-Switch-sample: The workload series where application switching occurred during 2022/08/25-2022/08/30. The exact time of application switches can be easily detected by plotting the curve, as it is accompanied by a sudden change in the load change pattern. If you wish to test the model's capability to handle abrupt changes in time-series patterns (extreme concept-drift), use this dataset.

  • ECW-New-App: The workload series of the applications that never appeared in ECW-08. If you wish to test the model's generalizability when faced with unknown time-series patterns, use this dataset.

  • ECW-New-Infras.: The workload series running on infrastructure that has never been present in the ECW.

If you use this dataset please cite the work DynEformer @ KDD2023 [paper], [code] (The Bibtex version is comming soon):

Shaoyuan Huang, Zheng Wang, Heng Zhang, Xiaofei Wang, Cheng Zhang, and Wenyu Wang. 2023. One for All: Unified Workload Prediction for Dynamic Multi-tenant Edge Cloud Platforms. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’23), August 6–10, 2023, Long Beach, CA, USA. ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/3580305.3599453

@inproceedings{10.1145/3580305.3599453,
author = {Huang, Shaoyuan and Wang, Zheng and Zhang, Heng and Wang, Xiaofei and Zhang, Cheng and Wang, Wenyu},
title = {One for All: Unified Workload Prediction for Dynamic Multi-Tenant Edge Cloud Platforms},
year = {2023},
isbn = {9798400701030},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3580305.3599453},
doi = {10.1145/3580305.3599453},
booktitle = {Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining},
pages = {788–797},
numpages = {10},
location = {Long Beach, CA, USA},
series = {KDD '23}
}

Why Cross-domain Static Content is involved in ECW?

The cross-domain data points, which encompass 12 dimensions of static server hardware attributes, including maximum bandwidth, number of CPUs, location, and other infrastructure characteristics. This data is collected when edge servers join the Multi-tenant Edge Cloud Platforms(MT-ECP) or during subsequent updates. Incorporating cross-domain data is aimed to further enhance model's robustness, as noted in previous research (Lim et al. [a]). Specifically, for workload prediction in MT-ECP, server features such as hardware attributes and geographical location significantly influence workload variations.

[a] Bryan Lim, et al. "Temporal Fusion Transformers for interpretable multi-horizon time series forecasting". International Journal of Forecasting, 2021

ECW and its derivative data:

ECW, ECW-App-Switch, ECW-New-App, ECW-New-Infras cover all three types of behavior of dynamic MT-ECP, as shown in the following figure, all datasets are shaped as N(edge server num)*T(series len, hourly granularity)*F(num of feature dimensions, all 16).

Figure 1.Workloads under dynamic MT-ECP behaviors.

Specifically, each edge server's workload incorporates daily short-term perodical patterns and weekly long-term perodical patterns (where future load on the same calendar day of the upcoming week resembles that of the current day's load), in addition to certain trend patterns and irregular fluctuations. Due to the heterogeneity in server attributes and application patterns, these patterns vary across different load series. Compared to existing time-series prediction datasets (such as ETT and Azure), ECW exhibits more complex patterns and higher-frequency dynamics (application switch disruptions and new entities), significantly elevating the generalization requirements for models.

We use the .npy file format to save the data, an edge server workload demo of the ECW data is illustrated in Figure 2. The first line (16 columns) is the horizontal header and includes "bw_upload", "hour", "day", "week", 'province', 'bandwidth_type', 'nat_type', 'isp', 'billing_rule', 'upbandwidth', 'upbandwidth_base', 'cpu_num', 'memory_size', 'disk_size', 'test_sat', and 'loss_sat'. The detailed meaning of each column name is shown in the Table 1.

Figure 2. Edge server workload demo.

Field bw_upload hour day week province bandwidth_type nat_type isp billing_rule upbandwidth upbandwidth_base cpu_num memory_size disk_size test_sat loss_sat
Description upload bandwidth workload (target) record time (dynamic features) <- <- edge server location quality assessment of the network nat type isp types of billing total server bandwidth available server bandwidth cpu_num memory_size disk_size network pressure test quality packet loss quality

Table 1. Description for each columm.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published