The Visual Microphone: Passive Recovery Of Sound From Video

When sound hits an object, it causes small vibrations of the object’s surface. This is why standard microphones work, but it’s also the principle behind this video, which explores using video as a source for resynthesizing sound.

The video, via Abe Davis’s Research, demonstrates how high-speed video can be used to capture micro-vibrations in objects filmed, which can be used as a visual microphone. Those visual vibrations can be used to, at least partially, recover the sound that produced them. This lets everyday objects—a glass of water, a potted plant, a box of tissues, or a bag of chips—be used as ‘visual microphones’.

The video demonstrates capturing sounds from high-speed footage of a variety of objects with different properties, and explores some of the factors that affect our ability to visually recover sound. The video also explores how to use regular consumer cameras to recover audio from standard frame-rate videos.

Additional information on this technique can be found and the project’s web page. 


22 thoughts on “The Visual Microphone: Passive Recovery Of Sound From Video

  1. Just watch out! – The plants are listening…
    No snacks in the White House Situation Room allowed.
    Happy days 😀

  2. Now with drones & all the snack bags lying around the place …. heck, even with tupperware I’ve got no privacy. Sorry plants, I have to move you away from the window.

    With the standard frame rate video, we’ve got a nice bit-crusher thing going on.

  3. What’s the big deal. When I was a kid, I used a speaker connected to my cheap cassette recorder to listen in on my brother’s room. Transducers can work both ways

    1. Yes, a speaker and a microphone are the same thing.

      What we didn’t know is that a 60fps video of a nearby object was also a microphone. Being able to extract sound from standard definition silent video footage means there is no such things as a secure room when a direct line of sight is available. It’s very cool and a big deal, and being able to reproduce it with consumer products is the definition of amazing. You don’t need an FBI van and a audio surveillance laser to read the vibrations on window glass.

      I never understood why people pop up in comments for EVERY ARTICLE to declare how they are bored or not impressed with something. It’s like a psychological compulsion to make yourself seem smarter than people who are impressed – which makes zero sense, since you’re anonymous, and you don’t seem to understand the video.

    2. The big deal is that without entering a building, home or office, without running wires, without even hacking into someone’s devices– a drone or external camera can capture high speed footage of some object visible in the window and be able to listen in on conversations.

      This might also have some investigative or historical applications.

      There was a related technology that took video with seemingly motionless subjects, and it basically exaggerated the tiny movements that were present– it showed babies breathing, peoples’ faces flushing with each heartbeat, buildings swaying, etc. It was really cool.

      1. I didn’t even think about going through historical / archived video and listening in one conversations! Thats super cool

      2. The room itself is a visible object with vibrations – it is a concern when such tech is refined – a helicopter/drone able to listen in on vibrations of everything at a long sight – mobile device, room, curtains/blinds. And then is becomes affordable for all!

  4. This requires very high speed photography – I believe the existing spy tech involves bouncing a laser off a window and measuring it’s displacement, that probably gives much higher quality results. Meanwhile, secret services around the world are busy buying typewriters again!

  5. I don’t think a typewriter would be secure from this technology. I suspect that every key of a typewriter has a distinctive sound, essentially making a typewriter a multi-sampler. Once you know what each key sounds like, you you can decode a message just by listening to the typewriter. In fact, based on frequency analysis of content, if you have a large enough sample, you can probably decode an audiorecording of a typewriter without knowing in advance the sound of each letter.


    you can get much clearer audio if you use a Opto-Acoustic Laser Microphone pointed at any window .
    this has been done since the 70’s

    it works like this ( googled)
    by transmitting an invisible IR-beam to the window of the target room. The window pane is slightly vibrating in accordance to the sound waves emanating from speech. The beam is reflected from the window pane according to the law of optics, ie. the angle of incidence is equal to the angle of reflection. The receiver picks up the reflected beam that is modulated by the window pane vibrations, and the optical signals are automatically converted into electronic signals. The picked up speech is now filtered, amplified and recorded.

    1. In your attempts to act clever, both you and FSK1138 seem to have missed the word ‘passive’ in the headline.

      This does not require shining lasers or infrared lights – it’s passive, meaning that it works using only the light that is naturally reflected off of objects.

      1. In your attempts to act clever ….you seem to have forgotten how a camera works

        the passive method described above uses a ir beam to focus

        the sound is recovered passively from the recording
        the capture is not passive

      2. you’re right. though i wasn’t trying to act clever at all. just wanted to add this method to the discussion as infrared beams are invisible and thus the result of both methods are kind of similar.

  7. This reminds me of a science fiction story (by Rudy Rucker, I believe), where somebody discovered/extracted the imprints of the voice of an alien on a millenium old vase made of clay (like the grooves of a record), and the voice caused some kind of fracture in the time/space continuum (Rudy Rucker has wierd ideas). Wouldn’t it be awesome to be able to extract hidden audio from old movie footage?

    1. I don’t think 24fps film or 30fps video is fast enough. The camera for most of the video is super high speed. The example of the consumer camera in the video is recording at 60fps, roughly double that of video, and focused clearly on a single object, and is perfectly still, and sounds much lower fidelity than the professional high speed camera.
      So you probably can’t get much more than white noise from an old silent film. But maybe with clever filtering, we might get some low frequency content from very clean 24fps film, but it probably won’t be recognizable speech.
      Maybe if someone hit a bell in closeup…

      1. Exactly. Given the very small excursion of objects and grainy video– the amp resolution would be 1-bit at best. And given the frame-rate, the sampling rate would be 25 Hz at best.

        BUT, if the analysis was so in-depth as to read vibrational information from every possible pixel in the frame, (– i.e., many streams) it might be able to do more interpolation, but it is still unlikely to get above the noise floor.

  8. This technique was used in an episode of the British-French thriller series “The Tunnel” (which is a version of the Danish-Swedish series “The Bridge”) – in it, the French police use internet video of a hostage with a vibrating ventilation louvre in the background – they recover infrasound from the video of the louvre to determine that there are train sounds audible at the hostage location, and then by deducing the types of trains and examining their their timing, they narrow the search area to just two warehouses where the hostage might be held.

Leave a Reply

Your email address will not be published. Required fields are marked *