Researchers at the University of Texas at Austin have used generative artificial intelligence to transform sound recordings made on the streets into street-view images that are surprisingly close to reality.
The researchers were able to convert sounds from audio recordings into street-view images with remarkable accuracy. Published in Computers, Environment and Urban Systems, their research reports on the training of an AI model of sound-to-image conversion using sound data collected from a variety of urban and rural landscapes. This model can now be used to generate novel images.
Their study shows that acoustic environments contain enough visual cues to be able to recreate the reality of an environment relatively accurately. This means the scientists can convert acoustic environments into visual representations of those places.
To achieve their goal, they captured videos and sounds from YouTube videos of cities and country roads in North America, Asia and Europe. This enabled the team to create pairs of 10-second audio clips and images of the corresponding locations to train their model.
The model was then able to create high-resolution images from a simple audio input. These were then compared with the actual images corresponding to the sounds, and assessed in relation to the proportions of greenery, buildings and sky. Initial results showed a fairly strong correlation in the proportions of sky and greenery between the real images and those generated by AI.
The results were a little less impressive when it comes to buildings. It should be noted that the AI is even able to recognise whether the sounds were recorded during the day or at night, using cues such as traffic noise in the city or insect noise in the countryside.
The aim of this study is to examine the potential of AI to capture the features that give cities their distinctive identity and, more generally, to study how humans interact with their environment. – AFP Relaxnews