Peter's matlab code would be much helpful to you I think :
Peter has posted a number of fundamental matrix solutions. The original algorithms were mentioned in the zisserman book
Also, while you are at it don't forget to see the fundamental matrix song :
one fine composition in my honest opinion!
That seems like a massive undertaking: model recognition is not an easy task. I recommend looking at OpenCV (which has some standard algorithms you can use as a starting point) and then looking at a good computer vision book (e.g., Richard Szeliski's book or Hartley and Zisserman).
But you are going to run into a host of practical problems. Consider that systems like Vuforia provide camera calibration data for most Android devices, and it's hard to do computer vision without it. Then, of course, there's efficiently managing the whole pipeline which (again) companies like Qualcomm and Metaio invest huge amounts of $$ in.
I think that stereoCalibrate is the way to work if you are interested in the depth map and in aligning the 2 images (and I think this is an important issue even if I don't know what you're trying to do and even if you're already have a depth map from the kinect).
But, If I understand it correctly what you need you also want to find the position of the cameras in the world. You can do that by having the same known geometry in both view. This is normally achieved via a chessboard pattern that is lying in the floor, send by both (fixed position) cameras.
Once you have a known geometry 3d points and the correspective 2d points projected in the image plane you can find independently the 3d position of the camera relative to the 3d world considering the world starting in one edge of the chessboard.
In this way what you're going to achieve is something like this image:
To find the 3d position of the camera relative to the chessboards you can use the cv::solvePnP to find the extrinsic matrix for each camera independently. The are some issues about the direction of the camera (the ray pointing from the camera to the origin world) and you have to handle them (the same: independently for each camera) if you want to visualise them (like in OpenGL). Some matrix algebra and angle handling too.
For a detailed description of the math I can address you to the famous Multiple View Geometry.
See also my previous answer on augmented reality and integration between OpenCV and OpenGL (i.e. hot to use the extrinsic matrix and T and R matrixes that can be decomposed from it and that represent position and orientation of the camera in the world).
Just for curiosity: why are you using a normal camera PLUS a kinect? The kinect gives you the depth map that we are try to achieve with 2 stereo camera. I don't understand exactly what kind of data an additional normal camera can give you more then a calibrated kinect with good use of the extrinsic matrix already gives you.
PS the image is taken from this nice OpenCV introductory blog but I think that post is not much relevant to your question because that post is about intrisinc matrix and distortion parameters that seems you already have. Just to clarify.
EDIT: when you're talking about units of the extrinsic data you are normally measure them in the same unit of the 3D points of the chessboard are, so if you identify a squared chessboard edge points in 3D with P(0,0) P(1,0) P(1,1) P(0,1) and use them with solvePnP the translation of the camera will be measured in the unit of "chessboard edge size". If it is 1 meter long, the unit of measure will be meters. For the rotations, the unit are normally angles in radians, but it depends how you are extracting them with the cv::Rodrigues and how you're getting the 3 angles yawn-pitch-roll from a rotation matrix.