I see a demand for some kind of guide for changing ingame voices through AI, so I decided to describe the process I used in creating my mods.
Beforehand, I want to apologize for english-speaking users if you find some mistakes in this article, as English isn't my native language.
Intro
First of all, there're 2 fundamentally different approaches to use AI in this case: speech-to-speech (or STS) and text-to-speech (or TTS). During first one AI takes an existing recorded voice and alters it with a voice model, usually trained on some other voice. During second one AI takes some text and tries to voice it using a voice model. In this article I'll cover both ways applied to my Female Vasco Voice mods, which has several voice types, created using both of this techniques.
Also, there're many different AI TTS and STS instruments you can use, I'll cover only few I used in my mods, but progress doesn't stop and sooner or later these will become obsolete.
Tools mentioned in this guide:
- Retrieval-based-Voice-Conversion-WebUI (RVC): repository
- Bethesda Archive Extractor: link
- Wwise Audio Unpacker: repository
- Wwise launcher: link
- PowerToys (PowerRename): link
- xTranslator: link
- ElevenLabs: link, Python API
- OpenAI Whisper: repository
- vgmstream: link
- Audacity: link
- My Python scripts: link
- XML to JSON dialogue script: link
Method 1. STS
TL;DR:
1. Extract .wem files with BAE
2. Transform them with RVC
3. Convert all to .wem and put to \Data
This method is the easiest one and allows you to partially keep many of the traits in original voice, such as intonation and accent. In some kind it melds two voices together - original one and one from voice model.
Prerequisites:
- Register on Audiokinetic site and download Audiokinetic Launcher
- Start Audiokinetic Launcher (named Wwise Launcher after instalation) and on Wwise tab install Wwise, you can unckeck all plugins
- Install RVC, I recommend downloading and extracting a complete package from releases page
- Download RVC models from Hugging Face or AI Hub discord channel (find it in Google) or any other place. You can also train your own, but that's for another guide.
- Place .pth voice model in RVC/weights folder and .index file in RVC/logs folder.
- Install PowerToys, we'll need PowerRename tool from it.
Steps:
1. Unpack the necessary audio files from "sound/voice/starfield.esm" folders inside the "Starfield - Voices01.ba2", "Starfield - Voices02.ba2" and "Starfield - VoicesPatch.ba2" archives, you can use Bethesda Archive Extractor for this. For example, Vasco's voice lines are located in "sound\voice\starfield.esm\robotmodelavasco" directory, and Sarah's in "sound\voice\starfield.esm\npcfsarahmorgan". All game audio is presented in .wem format.
2. Convert unpacked .wem files to .ogg using Wwise Audio Unpacker (put all .wem in "Game Files" folder and launch "WEM to OGG.bat", the resulting files will be placed into the "Result" folder).
3. Launch RVC-WebUI, on "Model Inference" tab select voice model in the "Inferencing voice" list, then scroll to the bottom and adjust the settings:
• Transpose: leave at 0 for most cases or change the number to change pitch; I don't recommend to go beyond -6 to 6, better change your voice model.
• Specify output folder: self-explanatory.
• Extraction algorithm: I recommend using rmvpe
• Median filtering: default one is ok, but test it for yourself
• Index path: select .index file which was bundled with the voice model
• Search feature ratio: for Vasco I set this to 0, for human characters to 0.75, but you can find your sweet spot yourself.
• Resample: 0
• Volume envelope scaling: higher values help to mask oddities on altering quiet voice, whispering, sighs, breathing etc. Set this either to 0.25 if original voice lines have small amount of those or 0.75 otherwise.
• Protect voiceless consonants: set to 0
• Audio folder to be processed: the folder with your .ogg files from previous step
• Export file format: wav
Then click Convert button on the bottom of the page and wait. The speed will depend on GPU you have.
4. Convert .wav files to .wem format. This is kinda complicated to setup for the first time, so I will break this down:
• Launch Wwise through Wwise Launcher, create new project with any name, untick everything in "Import Factory Assets"
• Go to Project > Project Settings > Source Settings and set "Default Conversion Settings" to "Vorbis Quality High".
That's unrelated to this guide, but please note that though all dialogue files seem to be Vorbis, some sound files must be encoded as PCM.
The rule is simple, keep the same encoding as original .wem file had.
• Go to Project > Import Audio files > Add Folders and add folder with your .wav files from step 3 and click Import. If Import Conflict Manager pops out, go to "%USERPROFILE%\Documents\WwiseProjects\your-project-name\Originals\SFX" and delete all inside of it, then repeat the step.
• Go to Project > Convert All Audio Files and click "Convert". The resulting files will be in "%USERPROFILE%\Documents\WwiseProjects\your-project-name\.cache\Windows\SFX\name-of-import-folder"
5. Go to output folder in explorer, right-click on empty space and launch PowerRename, rename all files so their names are identical to the original ones, e.g. "0047cc70.wem"
6. Place all the files in the same directory structure as the original ones in .ba2 archives, for example "Starfield\Data\sound\voice\starfield.esm\<your-modded-npc-name>".
That's all, you just created your own AI voice mod. Congratulations!
Method 2. TTS
TL;DR:
1. Extract dialogues with xTranslator
2. Voice them with AI tool of your choice
3. Find missing voice files and transcribe vanilla versions with OpenAI Whisper
4. Repeat step 2
5. Convert all to .wem and put to \Data
This is a trickier one. The advantage of this approach is also its disadvantage: you'll get a completely new voice with unique intonations, speed of speaking, accent and everything else. In some cases, like my mod for Vasco, the benefits may outweigh the drawbacks for some, as many people didn't like original Vasco's retrofuturism-esque way of speaking, but in the same time we lose al of the original voice actor's performance, which quite rarely can be outperformed by an AI, at least now.
Also, the process isn't so straight-forward as in the STS method, as you will see, so I'll have to skip some minor moments. To perform this method, you may need some basic coding knowledge.
So, let's start:
- Launch xTranslator and go to Options > Dictionaries and languages > Source Language: en, Destination Language: en
- Go to File > Load Esp/Esm and open Starfield.esm
- Click NPC/Fuz Map.
- Go to File > Export Translation > XML files, check Export Fuz Data and Everything and click OK
Now you got an XML file with all (not really) dialogues in the game and corresponding .wem files. Next I suggest you to transform it to JSON with similar structure:
{
"audio.wem": "Blah-blah-blah.",
"someotheraudio.wem": "Whatever."
}
You can do this by using Python script kindly provided by ice9000, he also included already converted JSON files for some NPCs. My own ready to use JSON with Vasco's dialogues is available here (transcription by Whisper described later is included).
Next step is to voice the dialogues with the TTS tool of your choice. I recommend using ElevenLabs as it produces the best results at the moment, even though it's paywalled under a subscription if you want to generate more than 10000 characters. Some free alternatives which can run locally on your PC are: Tortoise, Bark, Silero. Also, Edge-tts is not bad, it uses Microsoft's online TTS services, but I don't know about its free use rate limits.
If you're sticking with ElevenLabs, let's continue.
- Install Python from official site
- Install ElevenLabs Python API by executing pip install elevenlabs in terminal.
Here you may use my Python scripts. If you do, open ElevenLabs.py with notepad and fill in your API key and the name of the voice model you're planning to use, and optionally adjust generation settings.
- If you're happy with your JSON dictionary and script, run it with python ElevenLabs.py and wait, my script by default puts the generated .wav files in "elevanlabs-output" folder. If you get connection problems try to increase wait time between requests at the very last line of the script.
TA-DAH! All voice lines are generated! Or are they?
While making a Female Vasco Voice 2 (TTS) mod I noticed, that xTranslator exported 933 replicas for Vasco, while in reality there're 2052 of them in game files... Before Creation Kit comes out I don't know how to check if all of them are really used in the game, but at least bunch of them, the ones that are responsible for Vasco naming you by your character's name, are absent from xTranslator's export. So, here I had to use OpenAI Whisper to transcribe remaining .wem files. Of course, the ones which contains only "Captain Sexy" (sexy.wem) or "Captain Boobies" (boobies.wem, real filenames, by the way!) don't need this, just create a JSON with filenames, append "Captain" before names and you're good. You can get a list of all files in a folder by running dir /b /a-d command in terminal.
Anyway, you'll have to convert .wem files to .ogg like described in previous section, and then go on.
- Install OpenAI Whisper by executing pip install openai-whisper in terminal
- Put files needed to be transcribed into the "whisper-input" folder next to Whisper.py
- Run the transcription be executing python Whisper.py in terminal, that will generate whisper_transcription.json file
- Repeat previous steps to generate voices, but don't forget to adjust ElevenLabs.py to use whisper_transcription.json
When you finally generated all the voice files, just convert them to .wem like described in first method, and that's pretty much it. Hope you like the result you got after these 6 hours of struggling.
Extra
After transforming/generating voices with AI and before conversion to .wem you might want to do some post-processing, I recommend using free Audacity or Adobe Audition to do so. Robotic voice can be achieved with some echo/chorus/flanger, muffling with EQing down upper and lower frequencies. That's not an audio processing guide, so seek help on YouTube.
Protip: you can use vgmstream plugin for foobar2000 to listen .wem files without converting them.
Also, vgmstream-cli is a good alternative for converting from .wem files, as it can convert both Vorbis and PCM data. You can find my wrapper Python script for it here.
Don't forget to ask voice actors for a permission for using their voice! Well, it technically isn't totally "their" voice, and of course, there's a grey area in copyright laws regarding AI, but it would be a display of good manners to do so. Also, Nexus policy is that your mod will be deleted if author of the voice writes a complaint, keep that in mind.
Big thanks to Nojioh for his assistance with extracting dialogues, check his Vasco Japanese Female Voice mod.
Also, check my profile for my own AI voice mods done with the described techniques.
And of course, go make some awesome mod!
—
Please, link this article if it helped you to make your mod. Not necessary, but highly appreciated.
—
If you want to support my work you can send me a tip on boosty.to
25 comments
My Starfield playthrough now has the following cast :D
Sarah - Jennifer English (Shadowheart from BG3)
Barret - Morgan Freeman
Andreja - Ariana Grande
Sam Coe - Brad Pitt
Mateo - Jack Black
Walter - Troy Baker (Joel Miller from Last of Us)
VASCO - Steve Carell
What a time to be alive lol
That would be my biggest and only need for this? To AI voice the text responses in dialogue
I`ve created an alternative one on how to train your model using XVATrainer.
What do you think?
help with this point, it doesn’t work, I receive a file but there is no data on the NPC so that I can run it through the script, post a video on how to do it, I’m already getting hysterical with this, please help, I can’t sleep because of this
edit: I forgot to mention that its worth pointing out you will need a free license to use wwise otherwise you need to pay for it, without a license you can only convert 200 files and all the main characters have thousands of lines that need to be converted. you can get a free license by applying on their site for one