Know Thy Strengths: Comprehensive Dialogue State Tracking Diagnostics (EMNLP2022 Findings)

Paper » Repo » Poster » Slides »

Dialogue State Tracking (DST) models are brittle, but comparisons beyond joint goal accuracy have been sparse and uncoordinated

DST overview

A key skill🔑 for task-oriented dialogue (TOD) models is DST: identifying what the user wants, i.e. dialogue state or belief state, in the format of slot key-value pairs that can be passed on to APIs to make requests necessary for fulfilling a task.

DST models are evaluated on an in-distribution test set so models that perform well on DST benchmarks can be brittle to distribution shift😱. Therefore, benchmark results are often a poor guide to a model’s performance at deployment😥.

DST overview

This is not news! Previous work have recognized this issue and went beyond comparing on slot prediction accuracy, but these efforts have been sparse and uncoordinated, making it difficult for DST models to be compared holistically.

CheckDST facilitates a comprehensive and consolidated comparison of DST performance

CheckDST overview

So we put together CheckDST, a consolidation of robustness metrics and analytical tools that quantify prediction consistency under perturbations, performance for challenging cases that contain coreferences, and problematic behaviors such as hallucination.

With CheckDST, we can easily get a comprehensive and fine-grained picture of a model's performance. Comparing two major classes of DST models with CheckDST reveals that their DST benchmark results do not correlate with its robustness and that each has clear strengths and weaknesses.

results overview

CheckDST can help develop more robust DST models

Also, we show how the weaknesses exposed by CheckDST serve as a guide to develop robust DST models. We achieve holistic improvements with PrefineDST, which prefinetunes generation models with multiple tasks that can mitigate the weaknesses.

So for anyone working on DST, we encourage using CheckDST to facilitate an extensive comparison to establish strengths and weaknesses and use the findings as a guide for developing more robust models. For the full set of findings and the juicy details, please check out our paper!

PrefineDST overview


I want to use CheckDST!

The CheckDST toolkit can be accessed through our GitHub repository.


If our work inspires you, please cite us:
    title={Know Thy Strengths: Comprehensive Dialogue State Tracking Diagnostics},
    author={Cho, Hyundong and Sankar, Chinnadhurai and Lin, Christopher and Sadagopan, Kaushik Ram and Shayandeh, Shahin and Celikyilmaz, Asli and May, Jonathan and Beirami, Ahmad},
    journal={arXiv preprint arXiv:2112.08321},


This work was done by a group of researchers at USC ISI, Meta AI, and Google AI. This work was done while Justin and Ahmad were at Meta when Justin was an intern.


If you have any questions about the CheckDST toolkit, feel free to email us at "hd.justin at gmail dot com" or raise an issue on the Github repository.