I have used Structure from Motion software to find the spatial relationships between a group of photos taken in the same environment for years now, most prominently, Microsoft’s Photosynth.
A brief overview of the generic structure from motion workflow for the purposes of this discussion are as follows:
- Run a filter on each input photo to detect the sorts of points of interest which are likely to be recognizable from nearby, but certainly different, perspectives and create a descriptor for each feature in a given image.(These features will be found at various sizes and orientations in the image and are expected to be invariant to resolution observed at, lighting conditions, rotation, etc.)
- Since the computer does not know which images are likely to match each other, pairwise matching is used. This is to say, every image will be matched to every other image. (More technically, the list of feature descriptors for every image will be compared and matched to the list of feature descriptors for all other images.)(If we imagine that each feature discovered and described in a given image is a uniquely colored or numbered by thumbtack, you can think of this second step as essentially the process of trying to find any given thumbtack amongst the thumbtacks in all other photos.)
- Scene reconstruction or 3D reconstruction is the third step. This is the actual structure from motion problem, whereas the previous steps are simply a way of automating the identification of the same points in multiple photos.Every feature which is believed to have been identified in a chosen minimum number of photos (usually 3) will be represented by a point in 3D space and every photo will get added to the 3D scene one at a time, prioritized by how many features are believed to be shared between images.
The positions of the points in 3D space and the camera positions will be refined with each added photo to satisfy the majority consensus. Essentially, as you place points in 3D space and project them back onto the flat plane in the camera frusta, representing each photo, you test how far away, in 2D space on the image plane, your proposed 3D point lands from where you know you observed it in Step 1 and then the computer adjusts either the camera position for that photo or the 3D points or both until it minimizes the energy/distance between the feature’s observed position and the resulting position of the proposed solution.
That is probably much too long of an introduction, but in any case what this blog post is primarily about is the second step: pairwise matching.
For years, I simply assumed that a 2D graph of the image comparisons necessary to be performed would be flat.
That is to say, I thought that for a batch of 200 photos, each photo would be compared to the other 199 (which is technically true), so I expected that if you plotted the total number of images to be matched on the x axis and which photos/how many photos it was matching on the y axis that you would see a flat line of 199 across the board.
What I had not given a great deal of thought until recently was that the total number of photo comparisons to be performed is not the number of input photos nearly squared, but rather about half of that.
What I am attempting to say is that rather than a formula for describing the total number of photo comparisons looking like this:
N * (N-1) = T
where N is the number of input photos and T is the total photo comparisons
instead, it ought to be half of that number, like so:
N * (N-1) / 2 = T
The reason that this is so is because, for any given pair of images, you only wish to compare their list of features once. In other words, once you have compared Photo 25 to Photo 80, you will get absolutely no new data from comparing Photo 80 to Photo 25. You already know which features they are likely (or not) to share.
So if we think back to the graph that I was imagining earlier, the comparisons which we want our program to perform will look more like an isosceles right triangle.
There are two ways to go about this.
We could begin with the brunt of the workload up front and compare our first image to every other image.
This would mean that the second image would be compared to every photo except the first photo (since they’ve already been compared).
Likewise, the third image would be compared to every image except photos 1 and 2.
In our 200 photo example, this means that with this method, by the time you reach Photo 200, it will have already been compared to every other image.
Working the other way around, you could begin by comparing each image to every preceding image. In other words:
Photo 1 does not get compared with anything,
Photo 2 gets compared only with Photo 1,
Photo 3 gets compared with Photo 2 and Photo 1,
Photo 4 gets compared with Photos 3, 2, and 1,
and so forth until Photo 200 will be compared with all 199 preceding photos.
Whichever end we begin at, we can still see that within the square graph of image combinations, our matches to be performed will begin in one corner and finish in the diagonally opposite corner with every cell on one side of that diagonal line destined to be filled in as that intersection of images is matched.
The second method may have a small benefit that if you wish to add more images later, you can re-use your code by telling all new images to compare themselves to every preceding photo, rather than needing to know the total number of images at the beginning.
I have seen an alternative to Photosynth called VisualSfM for years but was not able to get it running on my computers. It uses GPGPU hardware acceleration to significantly speed up parts of the 3 steps I mentioned at the beginning of the post.
While I find them to have excellent speed at detecting and describing features (Step 1) and 3D reconstruction (Step 3), unfortunately their pairwise matching appears to me to take significantly longer than the same images in Photosynth and I have found myself waiting weeks for this to finish.
Partially, I am to blame for that time duration, simply because I am matching many more images than most people would attempt to fit into a single reconstruction.
It is also important to understand that Photosynth will create a lower resolution proxy for each image to look for features in. Lower resolution versions of the photos mean fewer image features per photo which in turn means less time necessary to compare the feature descriptors from one image to the feature descriptors of others.
I will admit that I am currently computing matches for the full resolution photos from my current project, which I would expect to be slower than Photosynth, even if it was using the exact same matching code, simply because of the resolution which features are being detected in, per the previous paragraph, however I also tried running a very low resolution copy of the entire image set in both Photosynth and VisualSfM and found VisualSfM to still be significantly slower than Photosynth at this one step. That low resolution copy was lower resolution than what Photosynth downsizes to, so it ought to have been a completely fair test. Not only that, but I was running Photosynth on a much less powerful machine, so my hat goes off to Drew Steedly and David Nister at Microsoft for the speed they achieved on pairwise matching.
The point of all of this preamble is that as I wait the many hours necessary to compute the matches between the full resolution copy of these photos, it is possible to dump in all 6,000+ photos and just walk away to let VisualSfM run for over two weeks, but for my own encouragement have broken the job up into some smaller pieces, just so that I can see those smaller pieces be completed and feel as though progress is actually happening.
My photos for the project reside in 8 folders.
I have already compared the photos in each of the folders to themselves, but when you think about each folder’s photos still needing to match all the photos outside their folder, it becomes clear that this is just the beginning.
At the time of this writing, I have successfully compared all photos in folders 1 through 5 to each other, which is good progress, but (deceptively) not yet 50% of the total comparisons to be performed. This is because the later folders of images will be compared to more images than any of the images thus far because VisualSfM uses the ‘compare each image with all preceding images’ method.
My current project is the house where my youngest brother and his wife were married earlier this year and is primarily composed of the back and front yards of that house. I took still photos before the wedding with the aim of building the structure of the exterior of the house where the wedding was taking place so that I could collect the photos and videos from other guests and match them all in 3D.
The first three folders of images, taken by me on the Wednesday afternoon + evening, before the Saturday afternoon wedding, are almost exclusively taken in the back yard where the wedding ceremony was to be held.
Folders 4, 5, 6, and 7 were all taken by me early on Saturday morning, before everyone had arrived to finish decorating before the ceremony. Folders 4 and 5 are mostly taken in the back yard in a much closer state to the wedding than Wednesday had been, and Folders 6 and 7 are very nearly exclusively front yard shots.
Folders 3 and 5 both feature one trip around the house clockwise and counter-clockwise to bridge the two environments.
Folder 8 is where I am collecting photos taken by other wedding guests, which are divided into:
and Kitchen/Dining Room/Living Room.
After finishing the matches of everything in the first 5 folders and letting VisualSfM begin chewing on folder 6 for a while, I stopped that for a little while and decided to compare all of the interior and back yard shots in Folder 8 to Folders 1 through 5 (since they are all primarily Back Yard shots).
I am nearing completion of that little side step (which also means I have a significant portion of Folder 8’s matches nearly finished.
I have also already compared the images in folders 6 and 7 to each other, as well as the Front Yard shots in Folder 8.
What I have left to match are Folders 6, 7, and 8 to everything that came before them.
At least… that is what would happen if this was Photosynth.
Part of me knows that most Back Yard shots will not match any Front Yard shots or vice versa.
Neither are Interior shots likely to match Front Yard shots (however the door to the back yard opens from the Dining Room which is a large glass door and that wall also has several windows which you see through from both sides, so matching Interior shots to Back Yard shots is worthwhile, even if a bit of a long shot in some cases.
The exception to this is that because the back yard has a steep hill at the back of the yard (which I climbed and took photos from along the back fence) you can see over the roof of the house and across the street to the fronts of the houses on the other side of the street and those same houses are definitely visible in my Front Yard photos. Likewise, the trees growing are the back hill behind the house are visible from the front yard photos. I would hate to miss out on those matches.
In this entire process, I have been wishing that I were able to visually represent what progress I have made on the matching process.
I would be satisfied with that simple square graph with half the cells colored to represent matches to be completed and then within that selection, coloring cells a second color to represent comparisons which have already been computed.
The key feature for me is being able to specify ranges of image comparisons which may have been completed out of sequence and have them be correctly colored without me needing to make time consuming manual selections to apply color to.