An AI system, called DeepSinger is claimed to have been developed by researchers at Microsoft and Zhejiang University. This technology can train on data from music websites, thereby, producing singing voices in several languages.
Reports from the preprint, Arxiv explain the new approach used – how that a specially designed component is exploited to store and capture singers’ sound quality from a disorganized and noisy singing data.
In similarity to Open AI’s Jukebox AI, DeepSinger AI has commercial consequences. Pick-up sessions for the purpose of correcting mistakes, changes, or additions after a recording are often done by music artists.
By implication, AI-assisted voice synthesis will save time and money for the singers’ employer. However, this will put the singers out of work.
An even more unpleasant side to this technology is the creation of false voices of musicians, that can make it look as though they sang lyrics they never did.
Just recently, Jay Z filed a copyright notice under his label, Roc Nation against videos that made him rap Billy Joel’s “We Didn’t Start the Fire” with the use of AI.
The researchers report that normal speaking voices are not as complicated as singing voices with respect to rhythms and patterns. So, synthesizing singing voices is really demanding because of the need to access information relating to duration and pitch control.
Additionally, manual analysis of lyrics and videos must be done for songs used in training, and there are not many singing training data sets available to the public
DeepSinger apparently finds a solution to these challenges with a pipeline that comprises of various data. The system first visits music websites for songs performed by top singers in several languages.
It then uses a music separation tool called Spleeter to extract the singing voices before separating the audio into sentences. DeepSinger further extracts the singing duration of each unit of sound differentiating one word from another in the lyrics.
When the lyrics and singing voices have been separated in accordance with a model-generated confidence score, the system makes use of these components to manage imperfect or distorted training data.
The researchers report that DeepSinger can synthesize high-quality singing voices regarding pitch accuracy and naturalness of voice, from pitch information, duration, lyrics, and reference audio.
Microsoft researchers plan to leverage more sophisticated AI-based technologies for the improvement of voice quality generated on DeepSinger.