topchetoeu/rvc-rt

Clone repo: git clone https://git.topcheto.eu/topchetoeu/rvc-rt.git
All branches, All tags, View raw

Files:

A usable, cross-platform GUI RVC real-time inference app

Example

I got tired from the infamous gui_v1, so I bumped the version.

Don't be mistaken, this is a complete rewrite, using pyqt6 + some improvements, so that you can use different devices for inputs/outputs (helpful when you want to use virtual cables or anything more complex in linux).

My current setup (on linux) basically routes the input from my microphone to a virtual source, which I then use for whatever I do (basically like a VST plugin).

How to setup

  1. Install python 3.11 (yes, you need this precise version, because this ancient piece of shit uses fairseq)
  2. Install dependencies from requirements.txt
  3. Download all assets using the download.py script
  4. Run gui_v2.py

I recommend using venv

How to use

The interface is quite simple and I'd say barebones. You might've noticed the glaring absence of some of gui_v1's features. I've removed them because I don't need them and don't feel like reimplementing them. If you want them, implement them yourself.

First two fields are used to specify the model weights/indices. Both are needed for the model to run. Input/output are pretty self-explanatory.

I'm not aware of the different pros and cons of the different pitch extraction algorithms, but rmvpe/crepe seem to be the better ones by far. Pitch sets by how much the pitch is shifted, useful when "gender-bending" (from my experience, male to female is exactly 12 semitones, male to anime female voice is 20-ish semitones).

Index rate, and for that matter, indices, don't have much audible effect, although indices do tend to improve output quality somewhat. Keep that around .5, or weird stuff happens.

Now the most important settings come. They will determine the quality and latency of inference. Block time determines in what chunks the audio is gotten from the input. Bigger chunks results in more consistent audio, but directly correlates to latency (keep this under .25 seconds if you want any reasonable latency).

Crossfade time is how much time is added to block time for fading reasons. Since the AI doesn't produce sample for sample perfect audio every time, there is some crackling if the model's chunks get outputted naively. This is why a small slice is allocated for fading the previous inferred chunk with the current chunk. NOTE: This adds to the latency TWICE - picking a crossfade of .15 adds .3 seconds to the latency. This is because the crossfade is first added to the read block size, then is sliced out of the end of the input block, hence it has a two-fold effect.

Extra time determines how much additional "history" is kept in order for the model to infer better. This won't affect latency, but will increase inference times. Keep this around 2-3 seconds for reasonable inference times.

Finally, ratio determines how the raw (dry) and inferred (wet) signals are mixed. This allows you to listen to both signals at the same time, as well as turning off the AI temporarily.

Of course, clicking start starts the inference, and stop stops it.

Quirks

Despite my best efforts, this has some annoying bugs.

  • When you start the inference, your audio will lag for a while. This is because it takes some time for the input reader, inference threads and output writer to stabilize
  • Sometimes, on linux, with the jack backend, you will get bogus errors for "invalid sample rates". This is BS, just restart your JACK/pipewire server
  • Changing the timing parameters while inferring is a big no-no. In theory, it should work, but 9/10 times the app will just deadlock or break in most gruesome ways. Do not do it, there is no reason to do that anyways
  • If you plug/unplug any audio devices, you need to refresh the devices list
  • Some shit will be spewed out in stdout, ignore it lol

If you catch other bugs, please tell me. In general usage, after shit stabilizes, the app is rock solid (as long as )