Knowledge Bridge

Global Intelligence for the Digital Transition

//Jeremy Wagstaff /August 7 / 2018

Talking Heads: Speech recognition tools could help ease newsroom’s great bottleneck

The bane of any reporter’s life is returning from an interview and then having to transcribe the recording of it. Text reporters can get away with some shorthand and a few notes, as they probably only need a quote or two. But radio and tv journalists, and those seeking to squeeze a little more out of an interview, are stuck with the painstaking process of going over the recording and either correcting their notes or typing out the transcript afresh. It’s not fun.

Technology has been promising to fix this for a while. There have been products like Nuance’s Dragon NaturallySpeaking, which since the late 1990s has been chipping away at speech recognition, but this required training the software to be familiar with your voice, didn’t work well with other people’s and was, at least for me, a little too error prone to be genuinely useful.

But there are now options.

I’ve been testing a couple — Trint (trint.com) and Descript (descript.com) — which do an excellent job of automatically turning an interview recording into a transcript you can work with. And they’re relatively cheap: expect to pay about $15 for an hour’s worth of audio. It’ll take about five minutes for the transcript to be ready, and then both provide pretty good editors (Descript in app form, Trint in a web app) where you can tidy up the text and fix errors. The underlying audio is mapped to the text, so editing text and moving through the audio is painless. Keystrokes allow you to switch quickly between listening and editing. Descript even lets you edit the audio, so you could prepare an interview for broadcast or podcast.

I would say on the whole you save yourself a couple of hours per hour of audio. For a journalist this means you can the semblance of a transcript to work off within minutes of the interview finishing. If you’re under time pressure that’s a serious time saver.

There are several other apps offering something similar: Otter is an app from AISense that is in essence a voice recorder that automatically transcribes whatever is being recorded. In real time. Temi and Scribie are also worth checking out.

So how does this work? And why now? Well, as with a lot of tech advances it has to do with algorithms, cloud computing and data. The algorithm comes first, because that is the part that says ‘this sound is someone saying hello. So type ‘hello.” In the early days — before cloud computing came along — that algorithm would have to be very efficient: it needed to be good because it had to work on a personal computer, or mobile device.

Cloud computing helped change that, because then the companies trying to do this were not constrained by hardware. In the cloud they could throw as much computing power as they wanted at it. But it doesn’t mean that computers are doing all the work — the algorithms still need something to work from, examples they can learn from. So a lot of the advances have come from a hybrid approach: humans do some of the work and train the computer algorithms to get better.

And now, at least in the case of the ones I have played with, the job has now been handed over to algorithms entirely. (And with each bit that we correct in their app, they learn a little bit more.) These example-driven algorithms have replaced the old classical ones which must be trained precisely. These algorithms teach themselves; you simply give them a bunch of data, and tell them: ‘this is how people have transcribed it, now go away and figure out how to do that.’

This means I have to add a few caveats. This kind of machine translation is not trying to perfectly transcribe each utterance. It is applying what it has learned from previous transcripts, so if those transcripts aren’t great, the results won’t be great. This can be good: Trint, for example, leaves out a lot of the verbal tics we use in speech — ers, ahs, ums — because human transcribers would naturally do that. But it also can mistranscribe whole sentences which make sense, but bear no relation to what the speaker said. So whereas in usual transcriptions you might be scanning for the odd misheard word or mis-spelling, you need to keep an eye out for entirely incorrect phrases. This could be fatal if you end up using a quote in a story!

There’s a bigger caveat too: accents can easily put these services off. Trint can cope with most European languages but in one case it could not handle someone speaking English with a Middle Eastern accent despite their grammar and syntax being excellent. Likewise, when I used Trint’s option of selecting an Australian accent (over a North American or British one, the other options) for the transcription, the Australian interviewee appeared to be talking about crocodiles, tucker, barbies and tinnies, and other Australiana, whereas in reality he talked about nothing of the sort. The training data was used to such terms and must have applied higher probabilities to him using those words than what he actually said.

This means that I would not be confident recommending any of these services to situations where non-European languages are being spoken, but also those when a accent is used. This is largely because of a lack of freely available training data. Academics are working to fix at least some of these problems: I’ve seen recent papers addressing indigenous South African languages, as well as those where speakers switch between languages, such Frisian-Dutch.

Give these apps a chance if you haven’t already. Behind this is a big step into the future where computers can more readily understand what we say, and what we say can easily be transcribed and stored. It has both exciting and scary implications. But for journalism it helps ease a significant bottleneck in getting what people are saying into our stories.

Article by Jeremy Wagstaff

Leave your comment