A short while ago, Apple launched a journal on machine learning; the general consensus on why they did it is that AI researchers want their work to be public, although as some have pointed out, the articles don’t have a byline. Still, getting the work out at all, even if unattributed, is an improvement over their normal secrecy.
They’ve recently published a few new articles, and I figured I’d grab some interesting tidbits to share.
In one, they talked about their use of deep neural networks to power the speech recognition used by Siri; in expanding to new languages, they’ve been able to decrease training time by transferring over the trained networks from existing language recognition systems to new languages.1 Probably my favorite part, though, is this throwaway line:
While we wondered about the role of the linguistic relationship between the source language and the target language, we were unable to draw conclusions.
I’d love to see an entire paper exploring that; hopefully that’ll show up eventually. You can read the full article here.
Another discusses the reverse – the use of machine learning technology for audio synthesis, specifically the voices of Siri. Google has done something similar,2 but as Apple mentions, it’s pretty computationally expensive to do it that way, and they can’t exactly roll out a version of Siri that burns through 2% of your iPhone’s battery every time it has to talk. So, rather than generate the entirety of the audio on-device, the Apple team went with a hybrid approach – traditional speech synthesis, based on playing back chunks of audio recordings, but using machine learning techniques to better select which chunks to play based, basically, on how good they’ll sound when they’re stitched together. The end of the article includes a table of audio samples comparing the Siri voices in iOS 9, 10, and 11, it’s a cool little example to play with.
The last of the three new articles discusses the method by which Siri (or the dictation system) knows to change “twenty seventeen” into “2017,” and the various other differences between spoken and written forms of languages. It’s an interesting look under the hood of some of iOS’ technology, but mostly it just made me wonder about the labelling system that powers the ‘tap a date in a text message to create a calendar event’ type stuff – that part, specifically, is fairly easy pattern recognition, but the system also does a remarkable job of tagging artist names3 and other things. The names of musical groups is a bigger problem, but the one that I wonder about the workings of is map lookups – I noticed recently that the names of local restaurants were being linked to their Maps info sheet, and that has to be doing some kind of on-device search, because I doubt Apple has a master list of every restaurant in the world that’s getting loaded onto every iOS device.
As a whole, it’s very cool to see Apple publishing some of their internal research, especially considering that all three of these were about technologies they’re actually using.
- The part in question was specific to narrowband audio, what you get via bluetooth rather than from the device’s onboard microphones, but as they mention, it’s harder to get sample data for bluetooth microphones than for iPhone microphones. ↩
- Entertainingly, the Google post is much better designed than the Apple one; Apple’s is good-looking for a scientific journal article, but Google’s includes some nice animated demonstrations of what they’re talking about that makes it more accessible to the general public. ↩
- Which it opens, oh-so-helpfully, in Apple Music, rather than iTunes these days. ↩