This repository is designed to collect various publicly available conversational datasets in a CSV format for data analysis, such as process mining.
Publicly available datasets with conversation transcripts annotated with dialog (speech) acts:
Dialog State Tracking Challenge Series provided several datasets with annotated information-seeking dialog transcripts for traveling and restaurant domains. Some of them are freely available. These datasets were created to evaluate and compare performance of dialog state trackers, systems able to interpret the user's action. They also include ontologies describing the domain, which consists of attributes (slots) with a set of possible values for each of the attributes. The transcripts are annotated with the dialog acts, user goals, methods, attributes, time-stamps as well as the user feedback.
-
DSTC1 The domain is route information for buses in Pittsburgh. Codebook License: MSR-LA
-
DSTC2 labeled human-computer dialogs in restaurant information domain. JSON format. The domain of a dataset is described by an ontology object, also distributed in JSON. Phoenix grammar. The dialog-act notation closely matches that used in DSTC1.
- The Switchboard Dialog Act Corpus (SwDA) extends the Switchboard-1 Telephone Speech Corpus, Release 2 with turn/utterance-level dialog-act tags. The dataset contains conversation transcripts of telephone conversations annotated with 43 dialog-act tags, part-of-speech tags, lemmas and parse trees. Description Codebook License: GNU GPL v2.0.
-
Spoken Conversational Search (SCS) Data Set provides conversational transcripts collected for the pre-defined search tasks performed in a conversational speech-only setting. The transcripts are annotated with the timestamps, the corresponding search queries and dialog acts for each of the roles. Codebook
-
Open Data Exploration dataset for the conversational browsing task contains 26 transcripts annotated with dialog acts and entity spans. Codebook License: MIT.
Format CSV for importing into ProM. One message/dialog act per row.
Basic columns:
- case ID - conversation identifier
- resource - actor role of the conversation participant
- activity name - dialog (speech) act
Optional columns:
- start time, stop time - timestamps reflect ordering of messages along the time axis
- message count - counts the number of messages exchanged within a conversation
- message - transcript of the utterance
- query - information need describing the task (instruction) that participants are solving
- turn count - counts the pairs of messages exchanged within a conversation
- slots - message attributes from the domain ontology
SCS:
- Query.complexity - one of three levels, referencing the task complexiy type (remember, understand, and analyze)
- Notes - comments such as the particular search is stopped by the user or researcher or extra notes which relate to the action of the participant regarding the search session.
- length - duration of the conversation in seconds
- caller_dialect_area - geo identifier for the cluster of resources from the set of {MIXED, NEW ENGLAND, NORTH MIDLAND, NORTHERN, NYC, SOUTH MIDLAND, SOUTHERN, UNK, WESTERN}
conducted by 2 annotators
Annotation schema: Krippendorff's alpha, 0.997
Dialogue success evaluation: Krippendorff's alpha 0.726
-
Stefan Sitter and Adelheit Stein. 1992. Modeling the illocutionary aspects of information-seeking dialogues. Information Processing & Management, 393 28(2):165–180.
-
Johanne R. Trippas, Damiano Spina, Lawrence Cavedon, and Mark Sanderson. How Do People Interact in Conversational Speech-Only Search Tasks: A Preliminary Analysis. The ACM SIGIR Conference on Human Information Interaction and Retrieval (CHIIR), Oslo, Norway, 2017.