In addition to the speaker vectors, we also store on the phone the “Hey Siri” portion of their corresponding utterance waveforms. When improved transforms are deployed via an over-the-air update, each user profile can then be rebuilt using the stored audio.
The most Apple-like way to continuously improve that I can think of. More interesting, though, is this bit later on:
The network is trained using the speech vector as an input and the corresponding 1-hot vector for each speaker as a target.
To date, ‘personalized Hey Siri’ has meant “the system is trained to recognize only one voice.” That quote, though, sounds like they’re working on multiple-user support; which, with the HomePod, they really should be.