3D captures are very dependent on:
Number of photos
Timing
Surface characteristics
Consistency
Lighting
Proper exposure
Image sharpness
Helper gadgets
Number of photos
123D Catch says 50 to 70 photos. It is a case of the more the better, but more implies rapidly escalating processing times. The photos need to have significant overlap because the 3D data is derived trigonomically calculations of the relative changes in the positions of matching point sets in different photos.
Timing
Timing is comes into play in several ways. Will the object change shape while you are taking 70 photos? Think of the quality you would get while take 70 photos of a hyper active terrier chasing a rabbit. There are cameras capable of 1 million frames per second. But you also have to move the camera between shots. However, and array of 70 cameras triggered simultaneously could capture the data (an expensive option).
Lighting
Lighting should be flat and almost shadowless. You want a 3D virtual model to either compose a render or use in an animation. If the light source in your model is doing one thing and the shadows from a light in your photo source is doing something else it will distort or destroy the 3D illusion. Also if some of your control points are lost in deep shadow or blown away by a bright highlight you will lose the data needed for an accurate model. You don't want shadows in your photo sets. The shadows will come back in a good virtual model from that model's own virtual light source.
Surfaces
Any thing with specular reflections can be a problem. Lets say you photograph a glass building and outside trees are reflected in the glass. One the program will think there are trees inside the building. Even worse the reflection of the trees will move from shot to shot and some surfaces will be totally mangled. There are workarounds but they have their own problems. You think the guy with the shiny mirror-like Ferrari will let you shoot dulling spray or talcum powder all over his car? Some mirrors are not so bad as you can tape paper to the surface and add the mirror finish back into the virtual model.
Image sharpness
Artsy depth of field is not desirable while doing data acquisition. Natural control points will be hard to find if they are fuzzy. And artificially control points may not be immediately recognizable to the computer. The same problem exists with motion blur. There are always reasons for not using a tripod, but the number of usable data sets extracted from tripod mounted cameras will be higher than non tripod mounted camera sets. Also any professional photographer knows that you can generate hurricane force winds simply by bringing a light easily carried tripod to a job.
Helper gadgets
I had a chance to get a deal on some aerial photo sets to capture the topography of some property I owned. Its all covered in waist high grass so finding matching control points would have been a nightmare. After I cut .8 acres of grass with a weedwacker, I will layout emergency rescue panels at high and low points and set them up in differing configurations and colors. Hopefully, I can get the county fair pilots to orbit the property with me in the passenger seat. They have a 240 lb. passenger weight limit and I weigh 235 and the camera is another pound or so. For small objects you can run colored tape through a hole punch and use the colored sticky dots to differentiate control points. Some practitioners of this art set up a back wall as if it were the back corner of a bounding box and put control points on that wall. When the computer is able to reconstruct the bounding box (a simple cube) it has a very good reference for positioning all points within the known bounding box. You could also set up a dozen laser pointers to highlight key points on a subject and use those spots of colored light as control points.
I HAVE TO DO SOME OTHER THING RIGHT NOW BUT WILL BE BACK TO EDIT THIS POST AND ADD TO IT.