This benchmark data set contains example data values for 112 semantic data types (e.g., phone, email, address, isbn, upc, etc.), collected from the public web. The benchmark was compiled to evaluate precision/recall of type-detection algorithms, when given a small number of positive-examples, as described in the AutoType paper [1]. (Details of the data set can also be found in the paper).
We hope this data set can facilitate research of detecting semantic data types in tabular data, and can serve as a common benchmark for future research in this area.
This data set is released under the Computational Use of Data Agreement v1.0.
[1] Auto-Type: Synthesizing Type-Detection Logic for Rich Semantic Data Types using Open-source Code. Cong Yan and Yeye He. In SIGMOD 2018. https://dl.acm.org/doi/10.1145/3183713.3196888