Watch all the Transform 2020 sessions on-demand here.
How might you characterize the conversational style of a digital assistant like Siri? No matter your impression, it stands to reason that striking the wrong tone could dissuade users from engaging with it in the future.
Perhaps that’s why in a paper (“Mirroring to Build Trust in Digital Assistants“) accepted to the Interspeech 2019 conference in Graz, Austria, researchers at Apple investigated a conversational assistant that considered users’ preferred tones and mannerisms in its responses. They found that people’s opinions of the assistant’s likability and trustworthiness improved when it mirrored their degree of chattiness, and that the features necessary to perform the mirroring could be extracted from those people’s speech patterns.
“Long-term reliance on digital assistants requires a sense of trust in the assistant and its abilities. Therefore, strategies for building and maintaining this trust are required, especially as digital assistants become more advanced and operate in more aspects of people’s lives,” wrote the paper’s coauthors. “We hypothesize that an effective method for enhancing trust in digital assistants is for the assistant to mirror the conversational style of a user’s query, specifically the degree of ‘chattiness,’ [which] we loosely define chattiness to be the degree to which a query is concise (high information density) versus talkative (low information density).”
The team recruited 20 participants and had them complete a questionnaire designed to assess overall chattiness level and personality. Those selected for the study filled out a pre-study survey describing how they used digital assistants, including the frequency of their usage and the types of questions they typically asked them. Next, in front of a wall-mounted TV displaying instructions orchestrated by a human experimenter, they were told to make verbal requests to set timers and reminders, get directions and the weather report, search the web, and more.
June 5th: The AI Audit in NYC
Join us next week in NYC to engage with top executive leaders, delving into strategies for auditing AI models to ensure fairness, optimal performance, and ethical compliance across diverse organizations. Secure your attendance for this exclusive invite-only event.
After listening to responses to their questions intended to be perceived explicitly as either chatty or non-chatty, participants were told to classify the responses’ qualities as “good,” “off-topic,” “wrong information,” “too impolite,” or “too casual.” (One response to a question about the weather was “It’s supposed to be 74 degrees and clear, so don’t bother bringing a sweater or jacket,” while another was “74 and clear.”) They next took part in another round of experimenter-guided question-and-answering in front of the TV, but this time, they were rated on their chattiness while their speech and facial expressions were captured by a microphone, camera, and depth sensor.
The first survey’s results showed that the majority of participants (70%) preferred chattier responses to terser ones. And perhaps unsurprisingly, people who identified as chatty (60%) preferred the chatty interactions, while those identified as non-chatty (40%) preferred the non-chatty interactions.
With that data in hand, the researchers built multi-speaker and speaker-independent classifiers capable of classifying verbal commands as chatty or non-chatty, and of determining whether chatty versus non-chatty response would be preferred. Both were based solely on audio features — a total of 95 acoustic features — with labels extracted from the earlier survey responses.
The team reports that the classifiers performed well and were able to generalize to new speakers without rejiggering, which they say is a promising sign a person’s degree of chattiness can be detected reliably. They leave to future work detecting ranges of chattiness and expanding the participant pool, and folding in video and depth data to measure the positivity (or negativity) of reactions to the responses.
“We have shown that user opinion of the likability and trustworthiness of a digital assistant improves when the assistant mirrors the degree of chattiness of the user, and that the information necessary to accomplish this mirroring can be extracted from user speech … People are able to engender trust and camaraderie through behavioral mirroring, where conversational partners mirror one another’s interaction style as they negotiate to an agreed-upon model of the world,” wrote the researchers. “Anecdotal evidence from comments in the post-study debrief suggest that participants prefer the assistant in the mirroring conditions. [We] conclude that chattiness preferences differ across individuals and across task domains, but mirroring user chattiness increases feelings of likability and trustworthiness in digital assistants.”
The work could lay the groundwork for an improved Siri, the limitations of which Apple is well aware. Progress was made in June, which saw the debut of a neural text-to-speech model that delivers a more natural-sounding voice without the use of samples. And in a recent research paper on the preprint server Arxiv.org, a team of Apple scientists described an approach for selecting training data for Siri’s domain classifier that led to a substantial error reduction with only a small percentage of examples.