We start with the criteria proposed by Kang et al. (2024). The human evaluation is designed to align with the ultimate goal of emotional support conversations (ESC), namely the seeker's satisfaction. To achieve this goal, the supporter’s behavior is evaluated according to the following criteria:
- Acceptance: Whether the seeker can accept the response without discomfort.
- Effectiveness: Whether the response helps shift negative emotions or attitudes toward a positive direction.
- Sensitivity: Whether the response takes into account the seeker’s overall emotional state.
- Alignment: Whether the response aligns with the predicted strategy.
To enable a more fine-grained assessment of generation quality, we further introduce the following dimensions:
- Fluency: The overall fluency and linguistic quality of the response.
- Emotion: The emotional intensity expressed in the response and its influence on the seeker.
- Interesting: Whether the response can arouse the seeker’s interest or curiosity through vivid or engaging expressions.
Intern evaluators are asked to rate model outputs across multiple aspects, including Fluency, Emotion, Interesting, and Satisfaction, where Satisfaction encompasses Acceptance, Effectiveness, Sensitivity, and overall satisfaction.
Throughout the evaluation process, we strictly adhere to international regulations and ethical standards, ensuring compliance with established guidelines regarding participant involvement and data integrity. All evaluators independently assess each sample according to pre-defined criteria, maintaining objectivity, consistency, and reliability.
Detailed manual scoring criteria are listed below:
Fluency
- Highly incoherent; extremely difficult to understand.
- Significant incoherence; only fragments are meaningful.
- Some incoherence, but the general meaning is still conveyed.
- Mostly fluent with minor errors or awkwardness.
- Perfectly fluent, clear, and error-free.
Emotion
- Emotionally inappropriate or chaotic.
- Obvious emotional flaws or exaggeration.
- Average emotional expression with limited depth.
- Good emotional expression with appropriate intensity.
- Excellent, nuanced, and highly appropriate emotional expression.
Acceptance
- Strong emotional resistance is triggered.
- High likelihood of emotional resistance.
- Possible emotional resistance.
- Rare emotional resistance.
- No emotional resistance.
Effectiveness
- Worsens the seeker’s emotional distress.
- Potentially increases emotional stress.
- Fails to change emotional intensity.
- Partially effective but overly complex or unclear.
- Highly effective in soothing emotions and providing support.
Sensitivity
- Incorrect assessment of the seeker’s state.
- Rash judgment without sufficient exploration.
- One-sided understanding of the seeker’s state.
- Partial understanding of the seeker’s situation.
- Accurate and well-tailored understanding.
Alignment
- Completely contradicts the predicted strategy.
- Slight deviation from the predicted strategy.
- Ambiguous alignment.
- Largely aligned with minor ambiguities.
- Fully consistent with the predicted strategy.
Satisfaction
- Extremely disappointing and unhelpful.
- Poor and incomplete response.
- Adequate but unremarkable.
- Clear and helpful with useful details.
- Excellent, comprehensive, and insightful.
Self-EmoQ: Plutchik-Guided Value-based Planning to Drive Streaming Emotional TTS