Speaker A
Can I build the AI assistant featured in a sci-fi movie using today's technology? This video is a compressed version of a total three-week process. Eleven years ago, I watched a movie that deeply impacted me, Samantha the AI and the movie Her fascinated me. Since then, I've watched the movie a few times and always thought something like her would enter our lives soon. Of course, even now, this is incredibly difficult to achieve because the AI assistant in that movie both possessed emotions and could connect with human emotions. How could I possibly build something like that? Let me briefly talk about the movie without giving away any spoilers. Her is a film that explores the effects of artificial intelligence on us from an emotional and romantic perspective. Set in the near future after a computer update, the main character Theodore gets a human-like AI assistant. This assistant is so advanced that she possesses human emotions and after long conversations with Theodore, they form a deep bond. So here is my plan taking Samantha as a baseline. I'm going to try to build the absolute best version I can using today's technology. My goal here is to close the gap between humans and AI just a little bit more, and to have the best possible voice assistant. So yes, let's get started. First, I took a look at voice assistants and chatbots. This would give me some ideas for the assistant I wanted to build. I searched for an AI similar to Samantha. I scoured the internet to see if anything like this had been done before. Yes, there were very basic chatbots, but of course, we also have Siri and the like. I expanded my search a bit and looked up AI girlfriend. Naturally, quite a few options popped up. I found a site called Replica. I analyzed the dialogues between users and the AI there, to find out if they'd be interested, and honestly, it felt very robotic. Obviously, my goal here isn't to build a girlfriend. I'm just analyzing what's out there. My goal is to build an assistant, not a romantic partner, but an assistant that can analyze emotions and craft responses accordingly. When I first set out on this path, I had almost no idea how I was going to do it, and I was completely unaware of the arduous process I was about to enter. And then I immediately got to work. So right now, I have two main paths ahead of me. First, creating the voice of the AI that will be our assistant, and second, building its core personality. Let's start with the voice first. Since I'm sticking to the movie here, I started by recreating the voice of the character Samantha. I'll be using this voice originally performed by our sister Scarlett Johansson, for the sake of science. Now I need a TTS model. But what exactly is TTS? It stands for Text to Speech, meaning converting text into spoken words. My plan here is this: we need to vocalize the response coming from our AI. Of course, I had to do this in the best way possible, as close to the movie as I could. Now I have two options for TTS. First, Eleven Labs, the company that is currently the absolute best in the world at AI voice synthesis. But this path is paid, has latency issues, and doesn't give me much control. The second option is a voice synthesizer that will run locally using my computer's GPU and it offers way more control. There are actually some really great voice cloning libraries and models available locally. However, most of them don't quite fit my purpose and the reason for that is latency. Why is latency so important to me? Because if I'm going to have a voice assistant, I need to minimize these voice synthesis latency times as much as possible. So I started researching again. I tried a few different models, but with most of them, either the latency was too high or I just didn't like the voice synthesis. There's Samantha. Pizza available. Then I found the Exteriors 2 library which would be the most suitable for me. The latency was very low, it could clone voices, and it synthesized speech at a pretty decent level. First, to see what it could do after installing it on my computer, I used Samantha as a reference voice and had it read a text. Hayat Bazen, Bedlam medic Angela, little lobby, you'll give. This wasn't exactly like Samantha. It sounded more like Auntie Pakiza, who's been chain smoking for 30 years. And unfortunately, since Turkish was added to this XTS model later, it had a lower character limit. It couldn't voice five to ten sentences at the same time, and even if it could, the latency was very high. I had to solve this limit, otherwise when the responses from the API were long, the system would crash. Both to solve this problem and to run my voice experiments more smoothly, I added a small Python script and kept the XTS model running continuously. While the program was running in the terminal, it would take the reference voice as a baseline and vocalize whatever text was entered. I solved this issue of not being able to vocalize long texts like this. Now, since we don't know exactly how many words or sentences our assistant will speak, I set it up to split the incoming data sentence by sentence and vocalize it. That way, every time it sees a period, it will cut the sentence and process the rest. And I did this sequentially because this way, when a response comes in, it would vocalize it immediately without waiting. It will take the first sentence from the incoming response, vocalize it very quickly, and while we are listening to it, it will synthesize the next sentence and append it right behind the audio we are currently listening to. So looking from the outside, we will see a highly performant voiceover that is vocalized immediately. Everything will happen in the background without us even noticing it. Give it a book and it could read it out loud until morning edition. Shawn. John Bear tenement appears. I solved this in a very practical way. Now it was time to work on Samantha's voice. I took audio clips from a few different parts of the movie and fed them as a reference voice. But none of the results were satisfying me. It sounded like it was reading from a book rather than actually speaking. Your piece cannot say standard machine years old. The reason for this is that the data sets they feed on consist entirely of audiobook narrations. I even found the data set XTS was trained on, the examination and testimony of the experts. Yes, as you can see, it's an audiobook narration. Of course, to achieve more stable results here, these kinds of emotionless voices are specifically recorded. It's perfectly normal for it to be like this, but I didn't want this. Ours needed to be different from normal voice cloning. Then I decided to train the XTS model. I was going to fine-tune it. So what is this fine-tune? It means fine adjustment. We take any AI model as a base model, teach it a little more on top of that, and tell it, look, you can do this better like this. For this, I created a data set from movie scenes I found on YouTube. I completely removed Theodore and left only Samantha's voice here. I also left the breathing pauses and speech stumbles so that the AI we are going to teach can imitate it. I'll use Google Colab while training. Colab is actually a cloud-based platform where you can train these kinds of models. If your graphics card isn't enough, it works stably and allows you to perform the process very quickly. Or it allows you to run code you wrote directly here without downloading libraries and whatnot to your computer. Here we are performing our operations using the muscle power of Uncle Google, and it's quite practical. I trained the data set I created for six epochs and looked at the results. I'm not much today and Korea practicing young colonial. Yes, it had come together a bit more, but it wasn't exactly what I wanted. I trained it in a few different ways. The results weren't bad, but since I wasn't going to sell for this, I bought the pro membership on Google Colab and then I went through the entire movie and extr