“Personalized Hey Siri”

In addition to the speaker vectors, we also store on the phone the “Hey Siri” portion of their corresponding utterance waveforms. When improved transforms are deployed via an over-the-air update, each user profile can then be rebuilt using the stored audio.

The most Apple-like way to continuously improve that I can think of. More interesting, though, is this bit later on:

The network is trained using the speech vector as an input and the corresponding 1-hot vector for each speaker as a target.

To date, ‘personalized Hey Siri’ has meant “the system is trained to recognize only one voice.” That quote, though, sounds like they’re working on multiple-user support; which, with the HomePod, they really should be.

Leave a Reply Cancel reply