So what people often do in this case is first find points in the images that match then compute the best transformation matrix with least squares. The point matching is not particularly simple and often times you just use human input for this task, you have to do it all the time for calibrating cameras. Anyway, if you want to fully automate this process you can use feature extraction techniques to find matching points, there are volumes of research papers written on this topic and any standard computer vision text will have a chapter on this. Once you have N matching points, solving for the least squares transformation matrix is pretty straightforward and, again, can be found in any computer vision text, so I'll assume you got that covered.
If you don't want to find point correspondences you could directly optimize the rotation and translation using steepest descent, trouble is this is non-convex so there are no guarantees you will find the correct transformation. You could do random restarts or simulated annealing or any other global optimization tricks on top of this, that would most likely work. I can't find any references to this problem, but it's basically a digital image stabilization algorithm I had to implement it when I took computer vision but that was many years ago, here are the relevant slides though, look at "stabilization revisited". Yes, I know those slides are terrible, I didn't make them :) However, the method for determining the gradient is quite an elegant one, since finite difference is clearly intractable.
Edit: I finally found the paper that went over how to do this here, it's a really great paper and it explains the Lucas-Kanade algorithm very nicely. Also, this site has a whole lot of material and source code on image alignment that will probably be useful.