AI

Where Was This Photo Taken? AI Knows Instantly

Imagine that you are playing a new, slightly modified version of the game GeoGuessr. Before you is a picture of an average American house, perhaps two stories with a front garden in a cul-de-sac and an American flag flying proudly out front. But there’s nothing particularly special about this house, nothing that tells you what condition it’s in or where its owners are from.

You have two tools at your disposal: your mind, and 44,416 low-resolution, higher-perspective images of random places across the United States and their associated location data. Can you match the house to an aerial photo and locate it correctly?

Of course I couldn’t, but a new machine learning model probably could. The software, created by researchers at the China University of Petroleum (eastern China), searches a database of remote sensing images with associated location information to match a street image — of a house, a commercial building or anything else that can be photographed from the road — to an aerial image in the database. While other systems can do the same thing, this system is pocket-sized compared to others and extremely accurate.

At its best (when faced with an image with a 180-degree field of view), it succeeds up to 97 percent of the time in the first stage of location narrowing. This is better than or within two percentage points of all other models available for comparison. Even under less than ideal conditions, it performs better than many competitors. When a specific location is identified, it is correct 82 percent of the time, which is within three points of the other models.

But this model is new in terms of speed and memory saving. It’s at least twice as fast as its counterparts, and uses less than a third of the memory it needs, according to the researchers. This combination makes them valuable for applications in navigation systems and the defense industry.

“We train the AI ​​to ignore superficial differences in perspective and focus on extracting the same ‘key features’ from both viewpoints, and turning them into a simple common language,” says Ping Ren, who develops machine learning and signal processing algorithms at China University of Petroleum (eastern China).

The program is based on a method called deep cross-view segmentation. Instead of trying to compare every pixel of a Street View image to every single image in a giant database, this method relies on hashing, which means converting a set of data — in this case, street-level and aerial images — into a series of unique numbers for the data.

To do this, the China University of Petroleum research group uses a type of deep learning model called a vision transformer that splits images into small units and finds patterns between the pieces. The model might find an image of what it was trained to identify as a tall building, a circular fountain, or a roundabout, and then encode its results into strings of numbers. ChatGPT is based on a similar architecture, but finds patterns in text rather than images. (The “T” in “GPT” stands for “transformer.”)

Hong Dong Lee, who studies computer vision at the Australian National University, says the number representing each image is like a fingerprint. The number code captures unique features from each photo that allows the geolocation process to quickly narrow down potential matches.

In the new system, the code associated with a given ground-level image is compared to that of all aerial images in the database (for testing, the team used satellite images of the United States and Australia), resulting in the five closest candidates for aerial matches. Data representing the geography of closest matches is averaged using a technique that weights locations closest to each other more to reduce the influence of outliers, and the estimated location is popped up for the Street View image.

The new geolocation mechanism was published last month in IEEE Transactions on Geosciences and Remote Sensing.

Fast and memory efficient

“Although this research does not represent a completely new paradigm, it represents a clear advance in the field,” Lee says. Because this problem has been solved before, some experts, such as computer scientist Nathan Jacobs of Washington University in St. Louis, are not enthusiastic about the problem. “I don’t think this paper is particularly groundbreaking,” he says.

But Lee disagrees with Jacobs: He believes the approach is innovative in its use of segmentation to make finding matching images faster and more memory-efficient than traditional techniques. It uses just 35MB, while the next smaller model examined by Ren’s team required 104MB, about three times as much space.

Researchers claim that this method is twice as fast as the next fastest method. When matching street-level images to the US aerial photography dataset, the runner-up time to match was about 0.005 seconds – the petroleum group was able to find a location in about 0.0013 seconds, almost four times faster.

“As a result, our method is more efficient than traditional image geolocation techniques,” says Ren, who assures me that these claims are credible. Retail “is a well-established path to speed and compactness, and the reported results are consistent with theoretical predictions,” Lee says.

Although these efficiencies look promising, more work is needed to ensure this method is successful on a large scale, Lee says. The group did not fully consider real-world challenges, such as seasonal variation or clouds obscuring the image, which could affect the strength of geolocation matching. This limitation can be overcome by providing images from more widely distributed locations, Ren says.

However, experts say long-term applications (beyond the highly advanced GeoGuessr system) are worth considering now.

There are some trivial uses for effective geotagging of photos, such as automatically geotagging old family photos, Jacobs says. But on the more serious side, navigation systems can also exploit geolocation methods like this one. If a self-driving car’s GPS goes down, another way to quickly and accurately find a location might be useful, Jacobs says. Lee also points out that it could play a role in emergency response within the next five years.

There may also be applications in defense systems. The 2011 Finder Project from the Office of the Director of National Intelligence aims to help intelligence analysts learn as much information as possible about images without metadata using reference data from sources including overhead images, a goal that could be achieved using models similar to this new method of geolocation.

Jacobs puts the defense request in context: If a government agency sends a photo of a terrorist training camp without metadata, how can the geolocation be determined quickly and efficiently? Deep cross-view segmentation might be of some help.

From articles on your site

Related articles around the web

Don’t miss more hot News like this! Click here to discover the latest in AI news!

2025-10-15 12:00:00

Related Articles

Back to top button