Cyberpunk 2077

File information

Last updated

Original upload

Created by

Dan Ruta

Uploaded by

DanRuta

Virus scan

Safe to use

Tags for this mod

About this mod

xVASynth is an AI tool for generating high-quality voice acting lines using voices from video games. The app supports hundreds of voices, across dozens of games, and provides pitch, duration, and energy control at per-letter granularity.

Requirements
Permissions and credits
Changelogs
Download the main app from the Skyrim page.  Main app now also on Steam!

You can now train your own voices: xVATrainer
List of voices available for xVASynth, from both myself and the community: Google doc link
You can submit models at the following link, if you train them with xVATrainer: Google forms link


Quick intro


xVASynth is an AI based app for creating new voice lines using neural speech synthesis. The app loads models individually trained on character voice data from games. The app gives users control over details such as pitch and durations of individual letters to provide control over emotion and emphasis. To see it in action, watch these short intro/tutorial videos, narrated by various supported voices:

 
 
Supported games

Discord: https://discord.gg/nv7c6E2TzV
Patreon: https://www.patreon.com/xvasynth
Twitter: @dan_ruta


Preface: The tool does not re-distribute any game assets, nor does it interact with them in any way. Game assets are used only during voice training as a reference, to guide the algorithm to drive itself to a point where it can create voices that sound similar enough to the examples. Think about it as an automated digital impersonator. Regardless, avoid using the tool in an offensive/explicit manner. Make it obvious where you can, in descriptions that the voice samples are generated, and are not from real human voice actors. Any issues you cause with this are on you.


Introduction

xVASynth (or [CP]VASynth, for [Cyberpunk] voices) is an AI app that generates voice acting lines using specific voices from video games. It can do text-to-speech (TTS) from text input, or speech-to-speech (S2S) from audio input. The app uses FastPitch [1,2] models, which give users artistic control over pitch, duration, and energy values for every letter in the audio. They also allow generating audio with explicitly defined pronunciation via ARPAbet [3] notation.



The use of neural speech synthesis leads to natural sounding voices, something which is very difficult to do with more traditional methods involving concatenations of existing data. It also means new vocabulary can be generated, outside of what the voice actors have already read out.

Speech to speech

The app can also do speech-to-speech, rather than text-to-speech. In this mode, you can provide a reference dialogue line, and have the app try to infer all the pitch/energy/duration values from the audio, for each text character. You can provide the exact text transcript of the reference audio in the input textarea, or you can leave it blank to have the app try to infer the text also. You can provide a reference audio line by recording with your microphone (by clicking the icon), or you can drag+drop an audio file onto the icon. You must first select an INPUT voice model, which must sound as similar as possible to the reference audio, and it must be a v2 model.  


ARPAbet pronunciation

You can specify exact pronunciation for words by using ARPAbet notation between { } brackets in the input, or by managing words in your own (or other people's) dictionaries. Included is CMUdict with 135k words with American-English pronunciations.



Other 3rd party dictionaries you can install into the app include:

xVADict community project - Elder Scrolls edition: https://www.nexusmods.com/skyrimspecialedition/mods/56778
xVADict is a community project to create ARPAbet pronunciation dictionaries, for use in xVASynth. This page contains the dictionary for the unique words found across all Elder Scrolls games.

xVADict - Alphabet Pronunciation: https://www.nexusmods.com/skyrimspecialedition/mods/57439
Adds the English alphabet pronunciation to xVASynth.

Batch Mode

For larger projects, where you need to synthesize a large amount of lines, you can alternatively use the Batch synthesis mode. You can use either a .txt file or a .csv file to batch generate hundreds or even thousands of lines, in one go, with parallelization. Although the pitch/duration/energy editor is sometimes needed to get a line sounding just right, it's sometimes not needed, and this is a good way to get an initial pass on lines. Using the GPU is especially highly recommended for this, as you can greatly parallelize the number of lines generated in one go (limited by VRAM). You should also check the various settings, such as multi-threading, to get the best possible speed out of this for your system.




3D Voice embeddings visualizer

The 3D voice embeddings visualizer is an interactive panel where you can explore in 3D all the voices in the app, as seen by an AI representation learning model, projected down to 3D. There are no axes, and this serves purely as a visualization, to enable voice discovery. You can colour the points by game, or gender, and you can enable disable specific games/voices. You can load a voice by clicking it and the "Load" button, if it's installed.




Third party plugins


The app supports third-party plugins for either/both javascript front-end (UI) and python back-end (AI) parts of the app. Plugins are a great way to customise the app to your liking, or to add new functionality to it that would be too niche or too game-specific to add to the base app for everyone. Plugins can be made for either/both the front-end/back-end of the app. Some example plugins are listed here (let me know if you make anything, and I will add it here): 

Voiced Player - xVASynth Fuz Ro Bork plugin: https://www.nexusmods.com/skyrimspecialedition/mods/62944
A plugin to connect xVASynth up to Fuz Ro Bork, enabling xVASynth voices to be used in the Fuz Ro Bork mod.

.lip and .fuz plugin for xVASynth v2: https://www.nexusmods.com/skyrimspecialedition/mods/55605
A plugin to create .lip and (optionally) .fuz files automatically from audio lines generated with xVASynth, in either normal mode or batch mode, with or without multi-threading. DOES NOT NEED THE CK. Works for Skyrim, Fallout 4, Fallout 3, and Fallout New Vegas.

xVASynth plugin - Romanian Language: https://www.nexusmods.com/skyrimspecialedition/mods/50878
A demo plugin for v1.4.0+ of xVASynth, where third party plugins are now supported. This plugin changes the app front-end, swapping the UI language to Romanian. Full developer reference: https://github.com/DanRuta/xVA-Synth/wiki/Plugins





If you are a developer and are interested in developing a plugin, check out the documentation here: https://github.com/DanRuta/xVA-Synth/wiki/Plugins


Nexus API integration


xVASynth has Nexusmods API integration to display what voices are available for updates/download, from any of the nexus pages listed in the "Manage Repos" sub-menu. If you have Nexus Premium, you can also download or batch download voices straight from within the app, and have them installed automatically. 






App installation

You may need to install Microsoft Visual C++ Redistributable if you don't already have it. To install the app, download it and extract it anywhere you'd like (it does not need to be in any game directory). You can optionally download the WaveGlow models (and place the files in ./resources/app/models), if you'd like more options for the vocoder used, but the bespoke HiFi-GAN vocoders included with each voice are almost always the highest quality vocoders, and by far the quickest. Launch the app by double-clicking the xVASynth.exe file. If you have any issues, try running it as admin, but be mindful that Electron on Windows has some issues with drag+drop events when running as Admin.

Important: Make sure you click "Allow" if windows asks you for permission to run the python server. I use a local HTTP server to enable communication between the python code (for the AI models) and the JavaScript code (for the Electron front-end). If there are any issues, check the server.log/app.log files (located next to xVASynth.exe) - there should be an error at the end which I'll need to see for helping with issues.


Voice installation

The recommended way to install voices is through the Nexus API integration. However, if you don't have Nexus Premium membership, or you'd prefer manual installation, you need to download the individual .zip files from the game-specific nexus pages (such as this one) and extract the voice files into the app directory, at this location: <.exe location>/resources/app/models/<game>     where <game> is the game ID. The voice .zip files already contain the required directory structure, so all you need to do is drag+drop the extracted "resources" folder from the .zip files into the folder where the xVASynth.exe file is (replacing files if prompted).

To confirm, when installing voices, you should see 4 files (a .json, a .pt, a .hg.pt, and a .wav file) all named as the voice you're downloading, in <your xVASynth install directory>/resources/app/models/<game>/   (where <game> is cyberpunk, for models on this page).

Important: If you move the app files to a different directory, you MUST update the model paths in the settings, because these folder paths get initialized with the full path (starting from the drive letter) - basically, just make sure the app is looking in the new place where your models are, rather than the old folder. The app also allows you to set a different folder to store your voice models in, rather than nested in your app installation directory. The easier thing to do long-term would be to find somewhere not in your app installation folder to store your models, and set the app file paths to point there.


The voices

For Cyberpunk, the voices trained so far are as follows ("Track" the mod for updates):
  • 🌮 🗲 V (Male)
  •   🗲 V (Female)
  •   🗲 Johnny
  • 🌮 🗲 Judy
  • 🌮 🗲 Panam
  • 🌮 🗲 Claire
  • 🌮 🗲 Alt
  • 🌮 🗲 Delamain
  • 🌮 🗲 Takemura
  • 🌮 🗲 Misty 
  • 🌮 🗲 Placide
  • 🌮 🗲 Jackie
  • 🌮 🗲 Elizabeth
  • 🌮 🗲 Sebastian
  • 🌮 🗲 Kerry
  • 🌮 🗲 Rhino
  •   🗲 River
  •   🗲 Rogue
  •  ☢ 🗲 Evelyn
  •  ☢ 🗲 Haru
  •  ☢ 🗲 Dakota
  •  ☢ 🗲 Gillean
  •  ☢ 🗲 Wakako
  •  ☢ 🗲 Hanako Arasaka
  •  ☢ 🗲 Stanley
  •  ☢ 🗲 Lizzy Wizzy
  •  ☢ 🗲 Meredith Stout
  • [New] ☢ 🗲 Maiko
  • [New] ☢ 🗲 Rachel

Where green text colour represents good quality, yellow means ok quality, and red currently quite bad (will need a good deal of playing with the input to get something good). There are several types of models and variants of models supported by the app, so I will use emojis to try to clearly label what type of model each voice is:🌮 - This means the data for the voice is pre-trained using Tacotron2 [6], and the sentence structure/composition quality will be high [color=#ffff00]🗲  - This means the voice comes with a bespoke HiFi [4] vocoder model, meaning the audio quality will be high [color=#00ff00]☢   - This means the voice model is FastPitch1.1, enabling energy control, speech-to-speech, and ARPAbet pronunciation. Tacotron2 isn't needed for this. (rad icon for RAD-TTS the built-in alignment mechanism replacing Tacotron2)    Note: To start with, most voice models will be v1.0 FastPitch, but they will eventually all be re-trained with the better v2.0 models with all the new features. I have over 425 voices to get through, so it may take a while.You can optionally install WaveGlow [5] models from here, for extra vocoder options, but these are much slower, and almost always not as good as HiFi-GAN. TipsThe most important thing to keep in mind is to make sure to play around with the editor, to get the best quality from the generated lines. If some words/letters sound bad, try changing the pitch/duration/energy values. Tinny artefacts can normally be fixed by slightly shortening the durations of offending letters. If you absolutely can't get it to say it well, and ARPAbet pronunciation doesn't help, try re-wording the line.Check out the community guide here, where anyone can add their tips/advice for how to get the best quality out of the tool:  https://github.com/DanRuta/xvasynth-community-guide  You can also access this from the info (i) menu in the app.Future Plans

The current development time is focused on the research+development for v3 models (and the app v3). You can help/check what this will hopefully include, on my patreon page. Now that xVATrainer is out, I will be reducing the rate of training new voices somewhat, in favour of having more resources to use for the research. But I will still be continuing with community polls for new voices every so often, and going through existing voices to upgrade them to newer model versions.





Support

The best support is using the tool, making something cool with it, and letting me know about it! Or spreading the word, to anyone that may get some use/fun out of this. Spread the word! Join the discord server, and let me know if you have any ideas/suggestions, show off something you made, or you just want to chat about all this: https://discord.gg/nv7c6E2TzV

Special thanks:

Caden Black, Thuggysmurf, D0lphin, radbeetle, Max Loef, Cecell, Solstice_, Anshela Asre, Tara_C, flyingvelociraptor, My Best Friend Is A Squid, Bungle Paws, neci, eldayualien, Retlaw83, Trixie, TomahawkJackson, Netherworks, Imogen, CHASE MCKELVY, Leif, ionite, Joshua Jones, CookieGalaxy, Rachel Wiles, TCG, Hellath, sadfer, Jaktt, David Keith vun Kannon, Danielle, EnculerDeTaMere, bourbonicRecluse, GOLOVATRIS, Bob, finalfrog, Buck Crutchfield, Yael van Dok, Vossler, Mikkel Jensen, yic17, AgitoRivers, John Detwiler, Alexandra Whitton, Tako-kun, Caro Tuts, beccatoria, Hammerhead96 ., PConD, Blythe, cramonty, Hazel Louise Steele, Lulzar, Vahzah Vulom, Ryan W, Laura Almeida, Wyntilda, Gorim, Krazon, squirecrow, crash blue, GrumpyBen, Adrilz, Katsuki, Calvin, hairahcaz, FeralByrd, Comical, dog, Althecow, SomeOtherWeirdo, Optimist Vamscenes, 𝖆𝖈𝖊𝖗𝖇𝖎𝖈𝖔𝖓, David, Hawkbar, Katherine Fishwick, John S., Idiotenschnitzel, Michael Gill, Jacob Garbe, NerfViking, Jacob Porter, Hapax, stormalize, Golem, Luckystroker, Tempuc, CAW CAW, Veks, stljeffbb, Zoenna, CDante, GodWaffle, Jarrett Barclay, Hound740, Jack in the Hinter, Royce, HunterAP, pimphat, PTC001, Hector Medima, CinnaMewRoll, Grant Spielbusch, Sean Lyons, Charles Hufnagel, Kirill Akimov, Mister Lyosea, Anthony Crane, Sh1tMagnet

All the amazing donors, anonymous or otherwise.
Adrian Łańcucki for FastPitch and the helpful discussions on GitHub.
All the amazing researchers behind the many tools and models I've used in creating this.


References
     [1] FastPitch - 📎📎https://arxiv.org/abs/2006.06873
     [2] FastPitch 1.1 - 📎📎https://arxiv.org/pdf/2108.10447.pdf
     [3] CMUDict - http://www.speech.cs.cmu.edu/cgi-bin/cmudict
     [4] HiFi GAN - 📎📎https://arxiv.org/abs/2010.05646
     [5] WaveGlow - 📎📎https://arxiv.org/abs/1811.00002
     [6] Tacotron2 - 📎📎https://arxiv.org/abs/1712.05884

Changelog:

Changelog now moved to the changelog panel.