What are the main services?
The Chat Room uses Tonality AI's capabilities for Alexa integration and text generation, and a Large Language Model (LLM) to support the content. For any given session, we use Meta's Llama, Anthropic's Claude, or OpenAI's GPT model. For Alexa devices with screens, we may also fetch a Wikipedia image for the celebrity when available.
How are you using the LLM?
We try to make the chat as convincing and as entertaining as possible not just by delivering realistic answer content (which most LLMs are generally good at, to a certain point), but also by providing those answers in a tone of voice appropriate to the celebrity. This is crucial for the perceived authenticity of the simulation, and means creating a rigorous 'system prompt' to prime the LLM's responses in terms of the language to be used in the reply.
Part of the trade-off with a model such as GPT is that its answers are generally 'safe' in terms of content. This is a huge plus: it is crucial for a positive family user experience. But it can also take the edge off a character's tone of voice (try getting even Gordon Ramsay to swear!) and can sound formulaic over multiple responses. The quality of our prompts to address this while maintaining family-appropriate content is an area of active investment.
In order to maintain the coherence of the conversation over multiple turns, we must keep the most recent parts of the conversation, until the user ends the session or changes chat partner. This means maintaining a window of the conversation state, and supplying it to the LLM as part of the prompt set.
These functions of the prompt set lead to another tradeoff: quality of the conversation versus cost and latency of the API. Better prompts typically use more words, but this increases the monetary cost and the latency in using the LLM API. Response latency is particularly important for us, because Alexa only ever gives us 8 seconds to respond to the user (if we don't get back in time, the skill is abruptly ended for the user).
So our prompts have to be lean in encoding the 'priming' instructions and the dialogue state, and the reply length must be similarly optimized to mitigate cost at scale and timeout risk.
Our current system is only able to attempt responses at the word and phrase level. We do not produce audio with the celebrity's actual vocal characteristics (although we do try to select a 'least wrong' Text-to-Speech engine based on the gender and nationality of the celebrity). But we are excited by emerging voice-cloning solutions (custom speech synthesis engines based on audio from a particular speaker) that may enable this capability.
What about the Alexa part?
For recognizing what the user is asking, we built 'intents' and 'slots' in the Alexa Skills Kit, and handlers to convert them to LLM calls and/or user responses. The intents and slots specify the words and phrases that people use to talk to Alexa. So in our case, they cover the most likely people-entities to be requested, and questions to be asked (the Alexa system requires these to be specified, since it will only pass to the skill speech that matches specific intents and slots). This process was non-trivial, since no existing dataset is available, though it's fair to say that the LLMs themselves are not unhelpful in generating a first cut at entity and chat data!
The voice user experience is designed to allow simple switching of chat partners, and our responses try to 'prime' the user to ask questions (rather than give statements, which, as a set of inputs, is too diverse for our intents to be able to recognize). Interestingly, this is not dissimilar in concept to the manner in which we prime the other side of the conversation with the LLM.