Researchers at MIT have created a machine-learning technique that accurately captures and predicts the scene’s underlying acoustics from a small number of sound recordings. A red dot identifies a sound emitter in this image. Yellow is louder than blue, and vice versa.
Imagine the reverberation of a pipe organ’s booming chords throughout the vast sanctuary of a gigantic stone cathedral. The sound that cathedral-goers experience depends on various elements, including the placement of the organ, the listener’s position, the presence of columns, pews, or other obstructions between them, the material of the walls, the location of windows and doors, etc. Sound can assist a person in visualising their world.
Additionally, researchers at MIT and the MIT-IBM Watson AI Lab are investigating the use of spatial acoustic information to assist robots in visualising their environments. They created a machine-learning model that can record how any sound in a room propagates around the area, allowing the model to replicate what a listener might hear from different locations. By precisely simulating a scene’s acoustics, the system can learn the 3D geometry of a room from sound recordings. The researchers can use the acoustic data captured by their approach to create accurate visual representations of a room, similar to how humans use sound to estimate the qualities of their physical surroundings. In addition to its potential uses in virtual and augmented reality, this technique could assist artificial intelligence agents in gaining a more profound comprehension of their surroundings.
Sound and vision
A machine-learning model known as an implicit neural representation model has been utilised in computer vision research to build smooth, continuous reconstructions of 3D photographic scenes. These models employ neural networks composed of layers of interconnected nodes, or neurons, that analyse data to accomplish a task. The MIT researchers used the same model to capture the continuous movement of sound throughout a scene. However, they discovered that vision models benefit from a characteristic known as photometric consistency that does not apply to models of sound. The object appears similar when observing the same item from two distinct positions. Change one’s location, however, and the sound one hears it may be quite different due to barriers, distance, etc. It makes audio prediction extremely challenging.
The researchers resolved this issue by including two acoustic properties in their model:
- the reciprocal nature of sound and
- the influence of local geometric elements.
Sound is reciprocal, which implies that if the source of a sound and the listener switch places, there is no difference in what the listener perceives. Local variables, such as a barrier separating the listener and the sound source, can also significantly impact what one hears. They add a grid that records the scene’s items and architectural details, including walls and doors, to the neural network to combine these two aspects into their model, which they refer to as a neural acoustic field (NAF). The model then randomly selects points from the grid to learn the properties at particular locations.
Researchers can feed the NAF visual information about a setting and a few spectrograms that depict how a piece of audio might sound when the emitter and listener are positioned at specific locations within a room. The algorithm then predicts how that audio would sound if the listener moved to any location within the scene. Finally, the NAF outputs an impulse response representing how a sound should evolve as it propagates through a scene. The researchers then apply this impulsive reaction to various noises to determine how they should alter when a person passes around a room.
Furthermore, when compared to alternative methods for modelling acoustic data, the accuracy of the sound models produced by the researchers’ technique was always greater. And because it learnt local geometric information, their model was significantly better able to generalise to new locations in a scene than other methods. In addition, they discovered that transferring the acoustic data their model learns to a computer vision model can result in a more accurate visual reconstruction of the scene.