As mentioned, the problem is very hard and is often also referred to as multi-view object reconstruction. It is usually approached by solving the stereo-view reconstruction problem for each pair of consecutive images.
Performing stereo reconstruction requires that pairs of images are taken that have a good amount of visible overlap of physical points. You need to find corresponding points such that you can then use triangulation to find the 3D co-ordinates of the points.
Stereo reconstruction is usually done by first calibrating your camera setup so you can rectify your images using the theory of epipolar geometry. This simplifies finding corresponding points as well as the final triangulation calculations.
If you have:
you can calculate the fundamental and essential matrices using only matrix theory and use these to rectify your images. This requires some theory about co-ordinate projections with homogeneous co-ordinates and also knowledge of the pinhole camera model and camera matrix.
If you want a method that doesn't need the camera parameters and works for unknown camera set-ups you should probably look into methods for uncalibrated stereo reconstruction.
Finding corresponding points is the tricky part that requires you to look for points of the same brightness or colour, or to use texture patterns or some other features to identify the same points in pairs of images. Techniques for this either work locally by looking for a best match in a small region around each point, or globally by considering the image as a whole.
If you already have the fundamental matrix, it will allow you to rectify the images such that corresponding points in two images will be constrained to a line (in theory). This helps you to use faster local techniques.
There is currently still no ideal technique to solve the correspondence problem, but possible approaches could fall in these categories:
For specific implementations you can use Google Scholar to search through the current literature. Here is one highly cited paper comparing various techniques:
A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms.
Once you have the corresponding points, you can then use epipolar geometry theory for the triangulation calculations to find the 3D co-ordinates of the points.
This whole stereo reconstruction would then be repeated for each pair of consecutive images (implying that you need an order to the images or at least knowledge of which images have many overlapping points). For each pair you would calculate a different fundamental matrix.
Of course, due to noise or inaccuracies at each of these steps you might want to consider how to solve the problem in a more global manner. For instance, if you have a series of images that are taken around an object and form a loop, this provides extra constraints that can be used to improve the accuracy of earlier steps using something like bundle adjustment.
As you can see, both stereo and multi-view reconstruction are far from solved problems and are still actively researched. The less you want to do in an automated manner the more well-defined the problem becomes, but even in these cases quite a bit of theory is required to get started.
If it's within the constraints of what you want to do, I would recommend considering dedicated hardware sensors (such as the XBox's Kinect) instead of only using normal cameras. These sensors use structured light, time-of-flight or some other range imaging technique to generate a depth image which they can also combine with colour data from their own cameras. They practically solve the single-view reconstruction problem for you and often include libraries and tools for stitching/combining multiple views.
My knowledge is actually quite thin on most of the theory, so the best I can do is to further provide you with some references that are hopefully useful (in order of relevance):
I'm not sure how helpful all of this is, but hopefully it includes enough useful terminology and references to find further resources.
Peter's matlab code would be much helpful to you I think :
Peter has posted a number of fundamental matrix solutions. The original algorithms were mentioned in the zisserman book
Also, while you are at it don't forget to see the fundamental matrix song :
one fine composition in my honest opinion!
That seems like a massive undertaking: model recognition is not an easy task. I recommend looking at OpenCV (which has some standard algorithms you can use as a starting point) and then looking at a good computer vision book (e.g., Richard Szeliski's book or Hartley and Zisserman).
But you are going to run into a host of practical problems. Consider that systems like Vuforia provide camera calibration data for most Android devices, and it's hard to do computer vision without it. Then, of course, there's efficiently managing the whole pipeline which (again) companies like Qualcomm and Metaio invest huge amounts of $$ in.
I think that stereoCalibrate is the way to work if you are interested in the depth map and in aligning the 2 images (and I think this is an important issue even if I don't know what you're trying to do and even if you're already have a depth map from the kinect).
But, If I understand it correctly what you need you also want to find the position of the cameras in the world. You can do that by having the same known geometry in both view. This is normally achieved via a chessboard pattern that is lying in the floor, send by both (fixed position) cameras.
Once you have a known geometry 3d points and the correspective 2d points projected in the image plane you can find independently the 3d position of the camera relative to the 3d world considering the world starting in one edge of the chessboard.
In this way what you're going to achieve is something like this image:
To find the 3d position of the camera relative to the chessboards you can use the cv::solvePnP to find the extrinsic matrix for each camera independently. The are some issues about the direction of the camera (the ray pointing from the camera to the origin world) and you have to handle them (the same: independently for each camera) if you want to visualise them (like in OpenGL). Some matrix algebra and angle handling too.
For a detailed description of the math I can address you to the famous Multiple View Geometry.
See also my previous answer on augmented reality and integration between OpenCV and OpenGL (i.e. hot to use the extrinsic matrix and T and R matrixes that can be decomposed from it and that represent position and orientation of the camera in the world).
Just for curiosity: why are you using a normal camera PLUS a kinect? The kinect gives you the depth map that we are try to achieve with 2 stereo camera. I don't understand exactly what kind of data an additional normal camera can give you more then a calibrated kinect with good use of the extrinsic matrix already gives you.
PS the image is taken from this nice OpenCV introductory blog but I think that post is not much relevant to your question because that post is about intrisinc matrix and distortion parameters that seems you already have. Just to clarify.
EDIT: when you're talking about units of the extrinsic data you are normally measure them in the same unit of the 3D points of the chessboard are, so if you identify a squared chessboard edge points in 3D with P(0,0) P(1,0) P(1,1) P(0,1) and use them with solvePnP the translation of the camera will be measured in the unit of "chessboard edge size". If it is 1 meter long, the unit of measure will be meters. For the rotations, the unit are normally angles in radians, but it depends how you are extracting them with the cv::Rodrigues and how you're getting the 3 angles yawn-pitch-roll from a rotation matrix.