Know Thy Strengths: Comprehensive Dialogue State Tracking Diagnostics (EMNLP2022 Findings)
Paper » Repo » Poster » Slides »Dialogue State Tracking (DST) models are brittle, but comparisons beyond joint goal accuracy have been sparse and uncoordinated
A key skill🔑 for task-oriented dialogue (TOD) models is DST: identifying what the user wants, i.e. dialogue state or belief state, in the format of slot key-value pairs that can be passed on to APIs to make requests necessary for fulfilling a task.
DST models are evaluated on an in-distribution test set so models that perform well on DST benchmarks can be brittle to distribution shift😱. Therefore, benchmark results are often a poor guide to a model’s performance at deployment😥.
This is not news! Previous work have recognized this issue and went beyond comparing on slot prediction accuracy, but these efforts have been sparse and uncoordinated, making it difficult for DST models to be compared holistically.
CheckDST facilitates a comprehensive and consolidated comparison of DST performance
So we put together CheckDST, a consolidation of robustness metrics and analytical tools that quantify prediction consistency under perturbations, performance for challenging cases that contain coreferences, and problematic behaviors such as hallucination.
With CheckDST, we can easily get a comprehensive and fine-grained picture of a model's performance. Comparing two major classes of DST models with CheckDST reveals that their DST benchmark results do not correlate with its robustness and that each has clear strengths and weaknesses.
CheckDST can help develop more robust DST models
Also, we show how the weaknesses exposed by CheckDST serve as a guide to develop robust DST models. We achieve holistic improvements with PrefineDST, which prefinetunes generation models with multiple tasks that can mitigate the weaknesses.
So for anyone working on DST, we encourage using CheckDST to facilitate an extensive comparison to establish strengths and weaknesses and use the findings as a guide for developing more robust models. For the full set of findings and the juicy details, please check out our paper!
Video
I want to use CheckDST!
Paper
@article{cho2021checkdst, title={Know Thy Strengths: Comprehensive Dialogue State Tracking Diagnostics}, author={Cho, Hyundong and Sankar, Chinnadhurai and Lin, Christopher and Sadagopan, Kaushik Ram and Shayandeh, Shahin and Celikyilmaz, Asli and May, Jonathan and Beirami, Ahmad}, journal={arXiv preprint arXiv:2112.08321}, year={2021} }