As the words are taken from a database of speech fragments, it’s very
difficult to modify the voice, so adding things like intonation and
emphasis is almost impossible. This is why robotic voices often sound
monotonous and decidedly different from humans.
During training the DeepMind team gave WaveNet real waveforms recorded
from human speakers to learn from. Using a type of AI called a neural
network, the program then learns from these, much in the same way a
human brain does.