Elements of Computer Vision: Multiple View Geometry.

§ 4. Two-View Geometry

The two-view geometry is the relative geometry of two different perspective views of the same 3-D scene (see Figure 4). It is usually referred to as epipolar geometry.

The two perspective views may be acquired simultaneously, for example in a stereo rig, or sequentially, for example by a moving camera. From the geometric viewpoint, the two situations are equivalent, provided that that the scene do not change between successive snapshots.

We consider 3-D scene points that are visible in both views simultaneously. This is not true in case of occlusions, i.e., points visible only in one camera. Any unoccluded 3-D scene point M=x,y,z,1T is projected to the left and right view as m=u,v,1T and mr=ur,vr,1T, respectively (see Figure 4).

Image points m and mr are called corresponding points (or conjugate points) as they represent projections of the same 3-D scene point M.

The knowledge of image correspondences enables scene reconstruction from images. The concept of correspondence is a cornerstone of multiple-view vision. In this notes we assume known correspondences, and explore their use in geometric algorithms. Techniques for computing dense correspondences are surveyed in Scharstein and Szeliski (2002); Brown et al. (2003).

Figure 4. Two perspective views of the same 3-D scene with two conjugate points singled out.

We will refer to the camera projection matrix of the left view as P and of the right view as Pr. The 3-D point M is then imaged as (20) in the left view, and (21) in the right view:

ζm=PM (20)
ζrmr=PrM. (21)

Geometrically, the position of the image point m in the left image plane I can be found by drawing the optical ray through the left camera projection centre C and the scene point M. The ray intersects the left image plane I at m.

Similarly, the optical ray connecting Cr and M intersects the right image plane Ir at mr.

The relationship between image points m and mr is given by the epipolar geometry, described in Section 4.1.

§ 4.1. Epipolar Geometry

The epipolar geometry describes the geometric relationship between two perspective views of the same 3-D scene.

The key finding, discussed below, is that corresponding image points must lie on particular image lines, which can be computed without information on the calibration of the cameras.

This implies that, given a point in one image, one can search the corresponding point in the other along a line and not in a 2-D region, a significant reduction in complexity.

Figure 5. The epipolar geometry and epipolar constraint.

Any 3-D point M and the camera projection centres C and Cr define a plane that is called epipolar plane.

The projections of the point M, image points m and mr, also lie in the epipolar plane since they lie on the rays connecting the corresponding camera projection centre and point M.

The conjugate epipolar lines, l and lr, are the intersections of the epipolar plane with the image planes. The line connecting the camera projection centres C,Cr is called the baseline.

The baseline intersects each image plane in a point called epipole.

By construction, the left epipole e is the image of the right camera projection centre Cr in the left image plane. Similarly, the right epipole er is the image of the left camera projection centre C in the right image plane.

All epipolar lines in the left image go through e and all epipolar lines in the right image go through er.

§ 4.1.1. The epipolar constraint.

An epipolar plane is completely defined by the camera projection centres and one image point.

Therefore, given a point m, one can determine the epipolar line in the right image on which the corresponding point, mr, must lie.

The equation of the epipolar line can be derived from the equation describing the optical ray. As we mentioned before, the right epipolar line corresponding to m geometrically represents the projection (Eq. (8)) of the optical ray through m (Eq.  (14)) onto the right image plane:

ζrmr=PrM=Pr-P1:3-1P41er+ζPrP1:3-1m0 (22)

If we now simplify the above equation we obtain the description of the right epipolar line:

ζrmr=er+ζPr1:3P1:3-1mm (23)

This is the equation of a line through the right epipole er and the image point m which represents the projection onto the right image plane of the point at infinity of the optical ray of m.

The equation for the left epipolar line is obtained in a similar way.

Figure 6. Left and right images with epipolar lines.

§ 4.2. Triangulation (or intersection)

Given the camera matrices P and Pr, let m and mr be two corresponding points satisfying the epipolar constraint. It follows that mr lies on the epipolar line Fm and so the two rays back-projected from image points m and mr lie in a common epipolar plane. Since they lie in the same plane, they will intersect at some point. This point is the reconstructed 3-D scene point M.

Figure 7. Triangulation, general case

§ 4.2.1. Metodo linear-eigen

Triangulation – as other problems in this tutorial – can be cast as a null-space problem. Let us consider m=u,v,1T, the projection of the 3D point M according to the perspective projection matrix P. Starting from the general projection equation (8) one obtains:

p1-up3TM=0p2-vp3TM=0 (24)

and then, in matrix form:

p1-up3Tp2-vp3TM=02×1 (25)

Hence, one point gives two homogeneous equations. Let us consider now m=u,v,1T, the corresponding point of m in the second image, and let P be the second perspective projection matrix. Being both projectio of the same 3D pointr M, the equations provided by m and m can be stacked, thereby obtaining a homogeneous linear system of four equations in four unknown (including the last component of M):

p1-up3Tp2-vp3Tp1-up3Tp2-vp3TM=04×1 (26)

The solution is the null-space of the 4×4 coefficient matrix, which muat then have rank three, otherwise only the trivial solution M=0 would be possible. In the presence of noise this rank condition cannot be fulfilled exactly, so a lest squares solution is sought, typically via SVD, as in calibration with DLT. Hartley and Sturm (1997) call this method “linear-eigen”.

This method generalizes to the case of N>2 cameras: each one gives two equations and one ends up with 2N equations in four unknowns.

Triangulation is addressed in more details in (Beardsley et al.,1997; Hartley and Sturm,1997; Hartley and Zisserman,2003).

§ 4.3. Levels of description and ambiguity in reconstruction

The epipolar geometry can be described analytically in several ways, depending on the amount of the a priori knowledge about the stereo system. We can identify three general cases.

  1. If both intrinsic and extrinsic parameters are known, we can describe the epipolar geometry in terms of the projection matrices (Equation (23)).

  2. If only the intrinsic parameters are known, we work in normalized camera coordinates and the epipolar geometry is described by the essential matrix.

  3. If neither intrinsic nor extrinsic parameters are known the epipolar geometry is described by the fundamental matrix.

Likewise, what can be reconstructed (by triangulation) depends on what is known about the scene and the stereo system. We can identify three cases.

  1. If both the intrinsic and extrinsic parameters are known, we can solve the reconstruction problem unambiguously.

  2. If only the intrinsic parameters are known, we can estimate the extrinsic parameters and solve the reconstruction problem up to an unknown scale factor (+ a rigid transformation that correspond to the arbitrariness in fixing the world reference frame). In other words, R can be estimated completely, and t up to a scale factor.

  3. If neither intrinsic nor extrinsic parameters are known, i.e., the only information available are pixel correspondences, we can still solve the reconstruction problem but only up to an unknown, global projective transformation of the world. This ambiguity w may be reduced if additional information is supplied on the cameras or the scene (see Sec .6).

§ 4.4. The calibrated case

Suppose that a set of image correspondences mimri are given. It is assumed that these correspondences come from a set of 3-D points Mi, which are unknown.

The intrinsic parameters are known, i.e. the cameras are calibrated, but the position and orientation of the cameras are unknown.

The situation – discussed previously – when the intrinsic and extrinsic parameters are known will be referred to as full calibrated for the sake of clarity.

We will see that the epipolar geometry is described by the essential matrix and that, starting from the essential matrix, only a reconstruction up to a similarity transformation (rigid+uniform scale) can be achieved. Such a reconstruction is referred to as “Euclidean”.

§ 4.4.1. The Essential Matrix E

As the intrinsic parameters are known, we can switch to normalized camera coordinates: mK-1m (please note that this change of notation will hold throughout this section).

Consider a pair of cameras P and Pr. Without loss of generality, we can fix the world reference frame onto the first camera, hence:

P=[I|0]andPr=[R|t]. (27)

With this choice, the unknown extrinsic parameters have been made explicit.

If we substitute these two particular instances of the camera projection matrices in Equation (23), we get

ζrmr=t+ζRm; (28)

in other words, the point mr lies on the line through the points t and Rm. In homogeneous coordinates, this can be written as follows:

mrTtRm=0, (29)

as the homogeneous line through two points is expressed as their cross product, and a dot product of a point and a line is zero if the point lies on the line.

The cross product of two vectors can be written as a product of a skew-symmetric matrix and a vector. Equation (29) can therefore be equivalently written as

mrTt×Rm=0, (30)

where t× is the skew-symmetric matrix of the vector t. Let us define the essential matrix E:

Et×R. (31)

In summary, the relationship between the corresponding image points m and mr in normalized camera coordinates is the bilinear form:

mrTEm=0. (32)

E encodes only information on the rigid displacement between cameras. It has five degrees of freedom: a 3-D rotation and a 3-D translation direction.

E is singular, since dett×=0, and it is a homogeneous quantity.

§ 4.4.2. Reconstruction up to a Similarity

If a sufficient number of point correspondences mimri is given, we can use Equation (32) to compute the unknown matrix E (see Sec. 4.5.2).

The reconstruction is achieved starting from the essential matrix, which contains – entangled – the unknown extrinsic parameters.

Unlike the fundamental matrix, the only property of which is to have rank two, the essential matrix is characterised by the following theorem (Huang and Faugeras,1989).

Theorem 4.1.

A real 3×3 matrix E can be factorised as product of a nonzero skew-symmetric matrix and a rotation matrix if and only if E has two identical singular values and a zero singular value.

Proof. Let E=SR where R is a rotation matrix and S is skew-symmetric. Let S=t× where t=1. Then

EET=SRRTST=SST=I-ttT

Let U the orthogonal matrix such that Ut=0,0,1T. Then

UEETUT=UI-ttTUT=I-UttTUT=I-0010,0,1=100010000.

The elements of the diagonal matrix are the eigenvalues of EET i.e., the singular values of E. This demonstrate one implication.

Let us now give a constructive proof of the converse. Let E=UDVT be the SVD of E, with D=diag1,1,0 (with no loss of generality, since E is defined up to a scale factor) and U and V orthogonal. The key observation is that

D=100010000=0-10100000010-100001=ΔSR

where S is skew symmetric and R a rotation. Hence

E=UDVT=USRVT=USUTURVT=detUVTUSUTSdetUVTURVTR.

Q.E.D.

This factorization is not unique. Because of homogeneity of E, we can change its sign, either by changing the sign of S or by taking the transpose of R (because SRT=-D). In total, we have four possible factorizations given by:

S=U±SUT (33)
R=detUVTURVT or R=detUVTURATVT, (34)

The choice between the four displacements is determined by the requirement that the 3-D points must lie in front of both cameras, i.e., their depth must be positive.

The rotation R and translation t are then used to instantiate a camera pair as in Equation (27), and this camera pair is subsequently used to reconstruct the structure of the scene by triangulation.

The rigid displacement ambiguity arises from the arbitrary choice of the world reference frame, whereas the scale ambiguity derives from the fact that t can be scaled arbitrarily in Equation (31) and one would get the same essential matrix (E is defined up to a scale factor).

Therefore translation can be recovered from E only up to an unknown scale factor which is inherited by the reconstruction.

This is also known as depth-speed ambiguity (in a context where points are moving and camera is stationary): a large motion of a distant point and a small motion of a nearby point produces the same motion in the image.

§ 4.5. The weakly calibrated case

Suppose that a set of image correspondences mimri are given. It is assumed that these correspondences come from a set of 3-D points Mi, which are unknown.

Similarly, the position, orientation and calibration of the cameras are not known.

This situation is usually referred to as weak calibration, and we will see that the epipolar geometry is described by the fundamental matrix and the scene may be reconstructed up to a projective ambiguity.

§ 4.5.1. The Fundamental Matrix F

The fundamental matrix can be derived in a similar way to the essential matrix. Intrinsic parameters are assumed unknown; we write therefore a more general version of Equation (27):

P=K[I|0]andPr=Kr[R|t]. (35)

Inserting these two projection matrices into Equation (23), we get

ζrmr=er+ζKrRK-1mwither=Krt, (36)

which states that point mr lies on the line through er and KrRK-1m. As in the case of the essential matrix, this can be written in homogeneous coordinates as:

mrTer×KrRK-1m=0. (37)

The matrix

F=er×KrRK-1 (38)

is the fundamental matrix F, giving the relationship between the corresponding image points in pixel coordinates.

Therefore, the bilinear form that links corresponding points writes:

mrTFm=0. (39)

F is the algebraic representation of the epipolar geometry in the least information case. It is a 3×3, rank-two homogeneous matrix. It has only seven degrees of freedom since it is defined up to a scale and its determinant is zero. Notice that F is completely defined by pixel correspondences only (the intrinsic parameters are not needed).

For any point m in the left image, the corresponding epipolar line lr in the right image can be expressed as

lr=Fm. (40)

Similarly, the epipolar line l in the left image for the point mr in the right image can be expressed as

l=FTmr. (41)

The left epipole e is the right null-vector of the fundamental matrix and the right epipole is the left null-vector of the fundamental matrix:

Fe=0 (42)
erTF=0 (43)

One can see from the derivation that the essential and fundamental matrices are related through the camera calibration matrices K and Kr:

F=Kr-TEK-1. (44)

Consider a camera pair. Using the fact that if F maps points in the left image to epipolar lines in the right image, then FT maps points in the right image to epipolar lines in the left image, Equation (36) gives:

ζrFTmr=ζe×m. (45)

This is another way of writing the epipolar constraint: the epipolar line of mr FTmr is the line containing its corresponding point m and the epipole in the left image e.

§ 4.5.2. Estimating F: the eight-point algorithm

If a number of point correspondences mimri is given, we can use Equation (39) to compute the unknown matrix F.

We need to convert Equation (39) from its bilinear form to a form that matches the null-space problem. To this end we use again the vec operator, as in the DLT algorithm:

mrTFm=0vec(mrTFm)=0(mrTmT)vec(F)=0.

Each point correspondence gives rise to one linear equation in the unknown entries of F. From a set of n point correspondences, we obtain a n×9 coefficient matrix A by stacking up one equation for each correspondence.

In general A will have rank 8 and the solution is the 1-dimensional right null-space of A.

The fundamental matrix F is computed by solving the resulting linear system of equations, for n8.

If the data are not exact and more than 8 points are used, the rank of A will be 9 and a least-squares solution is sought.

The least-squares solution for vecF is the singular vector corresponding to the smallest singular value of A.

This method does not explicitly enforce F to be singular, so it must be done a posteriori.

Replace F by F such that detF=0, by forcing to zero the least singular value.

It can be shown that F is the closest singular matrix to F in Frobenius norm.

Geometrically, the singularity constraint ensures that the epipolar lines meet in a common epipole.

§ 4.5.3. Reconstruction up to a Projective Transformation

The reconstruction task is to find the camera matrices P and Pr, as well as the 3-D points Mi such that

mi=PMiandmri=PrMi,i (46)

If T is any 4×4 invertible matrix, representing a collineation of P3, then replacing points Mi by TMi and matrices P and Pr by PT-1 and PrT-1 does not change the image points mi. This shows that, if nothing is known but the image points, the structure Mi and the cameras can be determined only up to a projective transformation.

The procedure for reconstruction follows the previous one. Given the weak calibration assumption, the fundamental matrix can be computed (using the algorithm described in Section 4.5.1), and from a (non-unique) factorization of F of the form

F=er×A (47)

two camera matrices P and Pr:

P=[I|0]  and  Pr=[A|er], (48)

can be created in such a way that they yield the fundamental matrix F, as can be easily verified. The position in space of the points Mi is then obtained by triangulation.

The matrix A in the factorization of F can be set to A=-er×F (this is called the epipolar projection matrix (Luong and Viéville,1996)).

Unlike the essential matrix, F does not admit a unique factorization, whence the projective ambiguity follows.

Indeed, for any A satisfying Equation (47), also A+erxT for any vector x, satisfies Equation (47).

More in general, any homography induced by a plane can be taken as the A matrix (see Sec. 4.6)

§ 4.6. Planes and collineations

When observing a plane, we obtain an interesting specialization of the epipolar geometry of two views.

First, let us establish that the map between a world plane and its perspective image is a collineation of P2. The easiest way to see it is to choose the world coordinate system such that the plane has equation z=0.

Expanding the projection equation gives:

ζuv1=Pxy01=P1|P2|P4xy1. (49)

Points are mapped from the world plane to the image plane with a 3×3 (non-singular) matrix, which represents a collineation of P2.

Figure 8. Left: The map between a world plane Π and a perspective image is a collineation. Right: The plane Π induces a collineation between two views.

Next, we prove that: images of points on a plane are related to corresponding image points in a second view by a collineation (or homography) of P2.

We have one collineation from Π to the left image plane, and another collineation from Π to the right image plane. By composing the inverse of the first with the second, we define a collineation from the image plane of the left camera to the image plane of the right camera.

The plane Π induces a collineation HΠ between the views, which transfers points from one view to the other:

mrHΠmifMΠ. (50)

where HΠ is a 3×3 non-singular matrix.

Even though a collineation of P2 depends upon eight parameters, there is no contradiction with the fact that a plane depends upon three parameters. Indeed, the collineation induced by a plane must be compatible with the epipolar geometry, i.e.:

HΠmTFm=0 (51)

for all points m. This implies that the matrix HΠTF is antisymmetric:

HΠTF+FTHΠ=0 (52)

and this imposes six homogeneous constraints on HΠ.

A collineation H that satisfies Eq. (52) is said to be compatible with F.

A collineation H is compatible with F if and only if

Fer×H. (53)

From the latter – provided that Π does not contain Cr – follows that:

HΠeer (54)

§ 4.6.1. Homography induced by a plane

If the 3-D point M lies on a plane Π with equation nTM=d, Eq. (36) can be specialized, obtaining (after elaborating):

ζrζmr=KrR+tnTdK-1m. (55)

Therefore, the collineation induced by Π is given by:

HΠ=KrR+tnTdK-1 (56)

This is a three-parameter family of collineations, parametrized by n/d.

§ 4.6.2. Infinite homography

The infinite homography H is the collineation induced by the plane at infinity; it maps vanishing points to vanishing points (a vanishing point is where all the lines that shares the same direction meet).

It can be derived by letting d in (55), thereby obtaining:

H=KrRK-1 (57)

The infinity homography does not depend on the translation between views.

In other terms, the vanishing points are fixed under camera translation.

§ 4.6.3. Plane induced parallax

In general, when points are not on the plane, the homography induced by a plane generates a virtual parallax. This gives rise to an alternative representation of the epipolar geometry and scene structure (Shashua and Navab,1996).

First, let us rewrite Eq. (36), which links two general conjugate points, as:

ζrζmr=Hm+1ζer, (58)

The mapping from one point to its conjugate can be seen as composed by a transfer with the infinity homography Hm plus a parallax correction term 1ζer.

Note that if t=0, then the parallax vanishes. Thus H not only relates points at infinity when the camera describes a general motion, but it also relates image points of any depth if the camera rotates about its centre.

We want to generalize this equation to any plane. To this end we substitute

H=HΠ-KrtnTdK-1 (59)

into Eq. (58), obtaining

ζrζmr=HΠm+γer (60)

with γ=adζ, where a is the distance of M to the plane Π.

When M is on the 3-D plane Π, then mrHΠm. Otherwise there is a residual displacement, called parallax, which is proportional to γ and oriented along the epipolar line.

The magnitude parallax depends only on the left view and the plane. It does not depend on the parameters of the right view.

From Eq. (60) we derive mrTerHΠm=0, hence

Fer×HΠ (61)

Figure 9. Plane induced parallax.

Figure 10. Visualizing parallax. The rightmost image is the superposition of the warped left image and the right image. The reference plane exactly coincide. However, points off the plane (such as the statue) do not coincide.

§ 4.6.4. Estimating H

A number of point correspondences mrimi is given, and we are required to find an homography matrix H such that

mriHmifor all i (62)

The equation (we drop the index i for simplicity) can be rewritten in terms of the cross product as

mrHm=0 (63)

As we did before, we exploit the properties of the Kronecker product and the vec operator to transform this into a null-space problem and then derive a linear solution:

mr×Hm=0[mr]×Hm=0vec([mr]×Hm)=0(mT[mr]×)vec(H)=0

These are three equations in nine unknown.

The rank of the matrix mTmr× is two because it is the Kronecker product of a rank-1 matrix by a a rank-2 matrix. Hence, only two equations out of three are independent.

From a set of n point correspondences, we obtain a 2×n×9 coefficient matrix A by stacking up two equations for each correspondence.

In general A will have rank 8 and the solution is the 1-dimensional right null-space of A.

The projection matrix H is computed by solving the resulting linear system of equations, for n4.

If the data are not exact, and more than 4 points are used, the rank of A will be 9 and a least-squares solution is sought.

The least-squares solution for vecHT is the singular vector corresponding to the smallest singular value of A.

This is another incarnation of the DLT algorithm.