If x and v belong to a group, xv/x is the conjugate of v by x. Conjugate this by y and it's the same as conjugating v by yx. And if u is the conjugate of v by x then v is the conjugate of u by 1/x. Conjugates clump together into conjugacy classes. So far it's just group theory; but this chapter explores the conjugacy classes of matrices, with a couple of twists.
We don't usually leave the group when looking for conjugators. However, when conjugating a matrix, it is sometimes helpful to extend the base field. If M is a matrix of real numbers, X might be a matrix of complex numbers, whence XM/X is the conjugate of M. It is perhaps easiest to embed our matrices in a closed field from the start.
Everything in a group is invertible, but not all matrices are invertible. Even if M is singular, it has conjugates, and a conjugacy class. (If you want to be technical, we are looking for conjugacy classes in a monoid.) The only requirement is that the conjugator be invertible. Well that make sense, you have to divide by the conjugator on the other side.
Each conjugacy class has a preferred matrix, a preferred representative, which is called the jordan canonical form (named after Camille Jordan). The jordan form is much easier to work with. For example, compute Mn by switching to the jordan form J. Conjugation is an isomorphism that respects multiplication, so there is no trouble here. Jn is easy to calculate, because J is diagonal, or nearly so. Conjugate back, and find Mn.
Conjugacy class of M:
M ⇔ XM/X ⇔ YM/Y ⇔ … ⇔ ZM/Z = J (the jordan form of M )
Conjugate matrices are also called similar matrices.
Verify that the determinant does not change with conjugation. We are multiplying and dividing by the determinant of the conjugator. Thus similar matrices have the same determinant.
Let the rank of M be k. Thus the rows of M span a space of dimension k. Multiply by X on the left and each row in XM is a linear combination of rows of M. The rows span a subspace of the original space. However, X is invertible, and the action is reversible, so each row of M is spanned by the rows of XM, and the space is the same.
On the other side, multiplication by X inverse is a linear map with no kernel, since X is nonsingular. The dimension of the domain equals the dimension of the range. Thus the rank of M is the rank of M/X. Put this all together and similar matrices have the same rank.
Two matrices are unitarily similar if one is the conjugate of the other by a unitary (orthonormal) matrix. The product of unitary matrices is unitary, as is the inverse, so matrices clump together into well defined equivalence classes, using unitary matrices as conjugators. These are refinements of the "similar" classes described above.
So why is this chapter on linear algebra here, in the middle of groups? Because the next chapter will explore invertible matrices, as a group, and jordan forms can simplify the analysis.
An eigen vector is a vector that is scaled by a linear transformation, but not moved. Think of an eigen vector as an arrow whose direction is not changed. It may stretch, or shrink, as space is transformed, but it continues to point in the same direction. Most arrows will move, as illustrated by a spinning planet, but some vectors will continue to point in the same direction, such as the north pole.
The scaling factor of an eigen vector is called its eigen value. An eigen value only makes sense in the context of an eigen vector, i.e. the arrow whose length is being changed.
In the plane, a rigid rotation of 90° has no eigen vectors, because all vectors move. However, the reflection y = y has the x and y axes as eigen vectors. In this function, x is scaled by 1 and y by 1, the eigen values corresponding to the two eigen vectors. All other vectors move in the plane.
The y axis, in the above example, is subtle. The direction of the vector has been reversed, yet we still call it an eigen vector, because it lives in the same line as the original vector. It has been scaled by 1, pointing in the opposite direction. An eigen vector stretches, or shrinks, or reverses course, or squashes down to 0. The key is that the output vector is a constant (possibly negative) times the input vector.
These concepts are valid over a division ring, as well as a field. Multiply by K on the left to build the K vector space, and apply the transformation, as a matrix, on the right. However, the following method for deriving eigen values and vectors is based on the determinant, and requires a field.
Given a matrix M implementing a linear transformation, let the vector x represent an eigen vector and let l be the eigen value. Solve x*M = lx.
Rewrite lx as x times l times the identity matrix and subtract it from both sides. The right side drops to 0, and the left side is x*Mx*l*identity. Pull x out of both factors and write x*Q = 0, where Q is M with l subtracted from the main diagonal. The eigen vector x lies in the kernel of the map implemented by Q. The entire kernel is known as the eigen space, and of course it depends on the value of l, a different eigen space for each eigen value l.
The eigen space is nontrivial iff the determinant of Q is 0. Expand the determinant, giving an n degree polynomial in l. This is called the characteristic polynomial of the matrix. The roots of this polynomial are the eigen values. There are at most n eigen values.
Substitute each root in turn and find the kernel of Q. This is the set of vectors x such that x*Q = 0.
In summary, a somewhat straightforward algorithm extracts the eigen values, by solving an n degree polynomial, then derives the eigen space for each eigen value.
Some eigen values will produce multiple eigen vectors, i.e. an eigen space with more than one dimension. The identity matrix, for instance, has an eigen value of 1, and an ndimensional eigen space to go with it. In contrast, an eigen value may have multiplicity > 1, yet there is only one eigen vector. This is illustrated by the 2×2 matrix with ones on the top and lower right, and zero in the lower left. This tilts the x axis counterclockwise and leaves the y axis alone. The eigen values are 1 and 1, and the eigen vector is [0,1], namely the y axis.
Let two eigen vectors have the same eigen value. Specifically, let a linear map multiply the vectors v and w by the scaling factor l. By linearity, 3v+4w is also scaled by l. In fact every linear combination of v and w is scaled by l. When a set of vectors has a common eigen value, the entire space spanned by those vectors is an eigen space for the given eigen value.
This is not surprising, since the eigen vectors associated with l are precisely the kernel of the transformation defined by the matrix M with l subtracted from the main diagonal. This kernel is a vector space, and so is the eigen space of l.
Select a basis b for the eigen space of l. The vectors in b are eigen vectors, with eigen value l, and every eigen vector with eigen value l is spanned by b. Conversely, an eigen vector with some other eigen value lies outside of b.
Different eigen values always lead to independent eigen spaces.
Suppose we have the shortest counterexample. Thus c1x1 + c2x2 + … + ckxk = 0. Here x1 through xk are the eigen vectors, and c1 through ck are the coefficients that prove the vectors form a dependent set. Furthermore, the vectors represent at least two different eigen values. Thus k is at least 2.
Let the first 7 vectors share a common eigen value l. If these vectors are dependent then one of them can be expressed as a linear combination of the other 6. Make this substitution and find a shorter list of dependent eigen vectors that do not all share the same eigen value. Now the first 6 have eigen value l, and the rest have some other eigen value. Remember, we selected the shortest list, so this is a contradiction. Therefore the eigen vectors associated with any given eigen value are independent.
Scale all the coefficients c1 through ck by a common factor s. This does not change the fact that the sum of cixi is still zero. However, other than this scaling factor, we will prove there are no other coefficients that carry these eigen vectors to 0.
If there are two independent sets of coefficients that lead to 0, scale them so the first coefficients in each set are equal, then subtract. This gives a shorter linear combination of eigen vectors that yields 0. If there is but one vector left standing, then cjxj = 0, which is impossible. These dependent eigen vectors cannot share one common eigen value, that has already been ruled out; thus multiple eigen values are represented. This is a shorter list of dependent eigen vectors with multiple eigen values, which is a contradiction. Therefore, if a set of coefficients carries our eigen vectors to 0, it must be a scale multiple of c1 c2 c3 … ck.
Now take the sum of cixi and multiply by M on the right. In other words, apply the linear transformation. The image of 0 ought to be 0. Yet each coefficient is effectively multiplied by the eigen value for its eigen vector, and not all eigen values are equal. In particular, not all eigen values are 0. The coefficients are not scaled equally. The new linear combination of eigen vectors is not a scale multiple of the original, and is not zero across the board. It represents a new way to combine eigen vectors to get 0. If there were two eigen values before, and one of them was zero, there is but one eigen value now. However, this means the vectors associated with that one eigen value are dependent, and we already ruled that out. Therefore we still have two or more eigen values represented. This cannot be a shorter list, so all eigen vectors are still present. In other words, all our original eigen values were nonzero. Hence a different linear combination of our eigen vectors yields 0, and that is impossible. Therefore the eigen spaces produced by different eigen values are linearly independent.
Here is a simple application of eigen vectors.
A rigid rotation in 3 space always has an axis of rotation.
Let M implement the rotation. The determinant of M, with l subtracted from its main diagonal, gives a cubic polynomial in l, and every cubic has at least one real root. Since lengths are preserved by a rotation, l is ±1. If l is 1 we have a reflection. So l = 1, and the space rotates through some angle θ about the eigen vector. That's why every planet, every star, has an axis of rotation. 
Let P be a nonsingular matrix that proves A is similar to B. In other words, PA/P = B.
Let v be an eigen vector of A, such that v*A = sv for some scaling factor s. Take the preimage of v under P. In other words, w*P = v. Now w*P*A = sv, and when we apply P inverse, we get sw. Thus w is an eigen vector for B, with the same eigen value s.
By symmetry, an eigen vector w of the matrix B becomes an eigen vector v of the matrix A, with the same eigen value s.
If A and B are similar, eigen values and eigen spaces correspond.
Use P or P inverse to map one eigen space to the other.
This makes sense if you think geometrically. If A has an eigen vector v, the linear transformation implemented by P is going to move some vector w onto v. Once v is scaled by A, the inverse of P will pull it back to a scaled version of w. Thus B has the same eigen vectors as A, they're just moved around a bit. 
↑
A B → eigen vector 
All this shows the set of eigen values is the same, but doesn't verify multiplicity.
You can show equal characteristic polynomials, and hence the same eigen values,
by looking at Bs*I, where s is a variable that represents the eigen value, and I is the identity matrix.
This is the same as P*A/P  s*I,
or P*A/P  P*s*I/P,
or P*(As*I)/P.
Take determinants, remembering that det(P) and det(P1) cancel, and the determinant of B with s subtracted from the main diagonal is the same as the determinant of A with s subtracted from the main diagonal. Both matrices produce the same characteristic polynomial, and both matrices have the same eigen values, multiplicities included.
You've seen the term "diagonalizable" before. Gaussian elimination can diagonalize a matrix M, using elementary row operations, iff M is nonsingular. This section defines diagonalizable in a new way. M is diagonalizable iff if has n independent eigen vectors. This is an unrelated concept; some matrices are diagonalizable here, and not there, and some are diagonalizable there, and not here. Sorry for the confusion.
A matrix M is diagonalizable if it is similar to a diagonal matrix. Write this as PM/P = D, where P is nonsingular and D is a diagonal matrix.
The eigen vectors of D are the coordinate vectors, each scaled by the corresponding entry on the main diagonal of D. There are n independent eigen vectors. Pull this back to M and find n independent vectors.
Conversely, let M have n independent eigen vectors. Let P map the unit basis onto these n independent vectors. In other words, the rows of P are the eigen vectors of M. Apply M, which scales these vectors, then apply P inverse, which pulls the vectors back to the coordinate axes. The composite is a matrix that scales the coordinate vectors, and that has to be a diagonal matrix. The diagonal entries are the eigen values. Therefore a matrix is diagonalizable iff it has n independent eigen vectors.
Note that some of these eigen values could be 0, Squashing the eigen vectors down to a point. The corresponding diagonal matrix has zeros on its main diagonal.
When M is diagonalizable, the diagonal equivalent D is the jordan form of M. As you might imagine, D is easier to work with. If M does not have n independent eigen vectors, there is more work to do.
Schur's theorem states that any matrix is unitarily similar to a lower triangular matrix.
Remember that a unitary (orthogonal) matrix implements a rigid rotation in space. So the phrase "unitarily similar" is just like turning your head. You're looking at the same linear transform from another direction. From this vantage point, the first coordinate is scaled, the second coordinate is a linear combination of the first two coordinates, and so on, building a lower triangular matrix. Of course you could turn your head the other way and the transform becomes upper triangular. In fact that's how most textbooks present the theorem. But jordan forms are traditionally lower triangular, with ones on the subdiagonal, so I'm going to build a lower triangular matrix for consistency.
This theorem only works over an algebraically closed field, such as the complex numbers. To illustrate, consider a linear map in R2 that rotates the plane 90 degrees. There is no real eigen vector, and there won't be after you conjugate by a real matrix. Yet a lower triangular matrix always has at least one eigen vector. Forget unitary; this matrix isn't similar to a lower triangular matrix at all, unless you bring in the complex numbers.
With that example behind us, let's assume the field includes the entries of M and all the eigen values of M.
Proceed by induction on the number of dimensions. A 1×1 matrix is automatically a lower triangular matrix, so consider an n×n matrix.
Find an eigen value s and an eigen vector x. Thus xM = sx.
Assume there is a well defined dot product, as is the case in the reals or the complex numbers. Use the Gram Schmidt process to build B, an orthogonal basis with x as its top row. Scale each row so that it's length is 1, whence B becomes unitary. If you can't build a unitary basis, then let B be any basis and move on.
Wait a minute  why can't we build a unitary basis?
Well this is abstract algebra, not real analysis, so the base field might be the integers mod p. A vector like [1,3,5] has length 0 mod 7, and cannot be scaled to a length of 1. So if unitary cannot be achieved, then let B be any basis and move on. M will be similar to a lower triangular matrix, but not unitarily similar.
Since x is eigen, the top row of BM is sx. Let C be the inverse of B. Consider B*M*C. To find the top row, multiply the top row of BM (which is sx) by the columns of C. The result is s in the upper left, and zeros across. Thus BMC begins to look like a lower triangular matrix.
By similarity, BMC has the same eigen values as M. One of them is s, in the upper left; the others are found in the submatrix below. This can be seen in the formula for det1. The characteristic polynomial becomes xs times the characteristic polynomial of the submatrix below.
Ignore the first row and column of BMC and consider the remaining n1 dimensional matrix. Find P and Q, where Q is P inverse, to make this submatrix similar to a lower triangular matrix. This can always be done by induction. Now add an extra row of zeros above and to the left of P, and do the same for Q. Place a 1 in the upper left of the augmented versions of P and Q. Note that P*Q is still the identity matrix, and P and Q are unitary (if we are building unitary matrices).
Consider PBMCQ. BMC has s in the upper left and zeros across. Multiply by Q on the right, and the top row and left column remain the same, while the submatrix is multiplied by the submatrix of Q. Multiply by P on the left. The upper left entry is still s, and the rest of the first column is, well, who knows. The top row is still zeros, except for s in the upper left. Finally the submatrix, with P on the left and Q on the right, becomes lower triangular. Therefore the entire product, PBMCQ, is lower triangular. The result is similar to M via the invertible matrix PB, which is unitary (if P and B are unitary). That completes the proof.
In the complex numbers, u.v is defined as the sum of uivi. Dot product is no longer commutative, but that isn't a showstopper. In fact, v.u is the conjugate of u.v. With this in mind, u.u is its own conjugate, and is real. Furthermore, u.u is nonzero as long as u is nonzero. Also, u.v is 0 iff v.u is 0. Thus an orthonormal matrix is well defined over C. Grams Schmid, and Schur's theorem, generalize to complex numbers, so that M is unitarily similar to a lower triangular matrix. An excursion into complex analysis is somewhat beyond the scope of this book, but it is worth at least a paragraph. If you are interested, go back up to the top of this section and run through the proof again; P can inddeed be made unitary over C.
The trace of a square matrix is the sum of its diagonal elements.
Let A and B be n×n matrices and consider the upper left entry of A*B. This is the sum of A1,j×Bj,1, as j runs from 1 to n. The ith entry in the diagonal is the sum of Ai,j×Bj,i. To find the trace of A*B, take the sum as i runs from 1 to n. This produces the sum of Ai,j×Bj,i, for all i and j between 1 and n. Yet this is symmetric with respect to A and B. Therefore the trace of A*B is the same as the trace of B*A.
If Q is a nonsingular matrix, consider the trace of QM/Q. This is the same as the trace of M/Q times Q, or M. The trace of a matrix is equal to the trace of any similar matrix. In other words, trace is basis invariant.
Given a matrix M, let T be the lower triangular matrix that is similar to M. You may need to expand the underlying field to do this, e.g. extend the reals up to the complex numbers. Once this is done, T can always be derived by Schur's theorem.
Recall that the determinant of T is the product of the diagonal elements. Subtract the variable s from these elements and multiply the binomials together to get the characteristic polynomial of T. The roots of this polynomial are, of course, the elements on the main diagonal. Thus the diagonal contains the eigen values, including multiplicities.
The trace of T is the sum of the diagonal elements, which happens to be the sum of the eigen values. Since eigen values are basis invariant, M has the same eigen values. M has the same trace as T, and the same eigen values, hence the trace is the sum of the eigen values.
The norm of a matrix is the product of its eigen values, including multiplicities. Again, the eigen values are basis invariant, so switch to a similar, lower triangular matrix T. The norm is now the product of the diagonal elements of T. This is also the determinant of T, which is the same as the determinant of M. So the norm is the determinant, or the product of the eigen values.
Since norm and trace can be computed from the entries of M, they belong to the base field F, even if some of the eigen values do not belong to F. If M is real, and an eigen value is 3+4i, another eigen value is 34i, so that their sum, and product, is once again real.
A block diagonal matrix has blocks along the main diagonal, and zeros elsewhere.
To be more precise, disjoint intervals define the blocks. If [2,5] establishes a block, then Mi,j could be nonzero when i is between 2 and 5, and j is between 2 and 5. Everything outside the specified blocks is zero. Here is a simple 4×4 block matrix, with blocks [1,2] and [3,4].
7  9  0  0 
4  2  0  0 
0  0  8  6 
0  0  1  1 
You may have noticed that every matrix is block diagonal, consisting of one block running from 1 to n. Well that's not very interesting, so let's say a block diagonal matrix has to have at least two blocks.
The trace of M is the sum of the trace of the blocks. That's pretty clear; just add up the elements along the main diagonal.
The determinant of M is the product of the determinants of the individual blocks. Build a block diagonal matrix P, and it's inverse block diagonal matrix Q, so that PMQ makes each block lower triangular. This makes the whole matrix lower triangular. Once transformed, the eigen values run down the main diagonal. These are the same eigen values of M, and the blocks of M. The product of these eigen values is the product of the eigen values grouped together by block. Therefore det(M) is the product of the determinants of each block.
Modify M all you like below the blocks; making M lower block triangular. That does not change the trace or norm. PMQ still makes each block lower triangular, as it did before, with the same eigen values. Thus the eigen values of a lower block triangular matrix are the eigen values of the individual blocks.
The span of two subspaces U and V is the smallest subspace containing both. This is the set of vectors x+y such that x ∈ U and y ∈ V. Verify that this is indeed a subspace, and it must be included in any subspace containing U and V, hence it is the span of U and V.
If U and V are disjoint (except for 0), then the span is called the sum, or the direct sum, or the direct product. Every vector in the span has a unique representation as x+y, where x comes from U and y comes from V. If this fails then x1+y1 = x2+y2, hence x1x2 = y2y1, and U and V intersect after all. We can also claim U and V are linearly independent. If x+y = 0 then x = y, U contains y, U contains y, and U and V intersect. To complete the circle, assume U and V are linearly independent. If they are not disjoint, if they share a common vector x, then write x + x = 0, which contradicts independence.
In summary, the span is a direct sum (i.e. every vector in the span has a unique representation), iff U and V are disjoint, iff U and V are linearly independent spaces.
Some people write the direct sum using the plus operator, as in U+V. Others write it as U*V. I prefer the * notation, because the span really is a cross product of all possible vectors in both sets. Granted, the vectors are added together, as part of an abelian group, so + makes sense too; but it's still a cross product, and I lean towards U*V.
If U and V are independent, a basis for U and a basis for V combine to form a basis for U*V.
A subspace is invariant under a transformation if that subspace is mapped into itself. Don't assume the subspace is fixed by this transformation; that isn't necessarily the case. If a transformation rotates space about the z axis, The z axis is fixed, and the xy plane is invariant.
Given a linear transformation and an eigen value w, the eigen vectors with eigen value w form an invariant vector space. In fact each vector in that space remains in position; it is simply scaled by the eigen value w.
Zero is always invariant.
Show that the span and intersection of two invariant subspaces is an invariant subspace.
Subspaces are complementary if they are invariant and their direct sum produces the original space. Note that spaces are complementary with respect to a given transformation.
If a basis B consists of a basis for U and a basis for V, and U and V are complementary, so that U maps into U and V maps into V, then the transformation, relative to B, is a block diagonal matrix, where the upper left block operates on U, and the lower right block operates on V. 

A transformation is nilpotent, with index p, if p is minimal, and applying the transformation p times squashes the entire vector space down to 0. If the transformation is implemented as a matrix M, Mp = 0.
If a nilpotent function M has index p, some vector z satisfies z*Mp1 ≠ 0. This is a consequence of p being minimal.
Build a set of vectors z, zM, zM2, zM3, and so on up to zMp1. These vectors are independent, and span a space of dimension p.
Suppose a linear combination of these vectors produces 0. Multiply through by Mp1 on the right. Every term drops to 0 except the first term, cz. Yet this must still equal 0, hence the first coefficient is 0. Remove that term from the combination, and multiply through by Mp2. This leaves only the second term, and its coefficient must be 0. Continue this process until all coefficients are 0. Thus the p vectors are linearly independent.
Since M moves each of these vectors onto another one, the space spanned by these vectors is invariant under M.
Set M = 0 for the trivial case, a nilpotent map with index 1. Since zM = 0, any nonzero z will serve, spanning a space of dimension 1.
The next trick is to find a complementary subspace. With M and z as described earlier, the repeated images of z span a space of dimension p. This space always has a complementary subspace relative to M, of dimension np. The proof is rather technical, so hang on to your hat.
Let W be a vector space of dimension n, and let M be a nilpotent transformation on W with index p. Select a vector z such that zMp1 is nonzero. Recall that zMi, as i runs from 0 to p1, forms a set of p independent vectors.
Let the subspace Ni be spanned by zMj, for all j ≥ i. This is a descending chain of spaces, where N0 has dimension p, Ni has dimension pi, and Np = 0. The N spaces will depend on the selection of z. That's ok; pick any z that builds a chain of length p, and on we go. Note that z is in our vector space; there is no need to extend the base field.
Let Ti be W*Mi. This is another descending chain of spaces, with T0 = W, and Tp = 0. This chain does not depend on the selection of z.
Show that Ti contains Ni. It contains (z)Mi, and (zM)Mi, and (zM2)Mi, and so on.
The goal is to build Vi, such that Vi and Ni are disjoint subspaces, and the direct sum Vi*Ni = Ti, and Vi is invariant under M. (We already know Ni is invariant.) Here is an illustration with p = 5.
N space  Dimension  Complementary  T space 

N0  5  V0  T0 = W 
N1  4  V1  T1 
N2  3  V2  T2 
N3  2  V3  T3 
N4  1  V4  T4 
N5 = 0  0  V5 = 0  T5 = 0 
When i = p, the bottom row, Vp exists, and is equal to 0. Since M has index p, Mp maps all of W into 0, thus Tp = 0. With Np and Vp = 0, Np*Vp = Tp. Of course Np and Vp are M invariant. Therefore Vp satisfies our criteria.
Take one step backwards, setting i = p1. Now Ni is the one dimensional space spanned by y, where y = zMi. Build a basis for Ti, starting with y, then let Vi be the space spanned by the basis vectors of Ti beyond y. Clearly Vi and y are linearly independent, and together they span Ti. Also, Vi and y are invariant under M; in fact they are both squashed down to 0. Thus Vi meets our criteria.
If p = 1 we are done. For p > 1, take the inductive step, working from p1 back to 0. For notational convenience I will step back from 3 to 2. By induction there is a subspace V3 corresponding to N3. Let K3 be the subspace satisfying K3M ⊆V3. Verify that K3 is indeed a subspace.
If x is in V3 then it is in T3, which is the image of T2 under M, hence the map is onto. As spaces, K3M = V3.
Since V3 is invariant, applying M to V3 remains inside V3. Thus V3 meets the definition of K3, and V3 lies in K3. In other words, K3 is an extension of V3, and M squashes K3 onto V3.
Applying M to K3 yields V3, which is in K3, hence K3 is M invariant.
Next show that K3 and N2 span T2. Let x be any vector in T2. Let y = xM. Thus y is in T3. By induction, y is a unique sum of two vectors, one from N3 and one from V3. Call these vectors q and r respectively. Remember that q is spanned by zM3, zM4, zM5, etc. Pull out a common factor of M, and q = sM, where s is spanned by zM2, zM3, zM4, etc. Thus q = sM, where s is in N2. Write xM = sM+r, or r = (xs)M, and xs is in K3. An arbitrary x is spanned by N2 and K3.
Next, show N2∩K3 is contained in N3. Intersect N2*M and K3*M, which is the same as N3∩V3, or 0. Our intersection lives in N2, and when multiplied by M, the result is 0. Elements in N2 that are one step away from 0 are in fact in Np1, i.e. a scale multiple of zMp1. Vectors of this form certainly belong to N3, hence N2∩K3 lies in N3.
Consider N2∩K3∩V3. With the first intersection in N3, and N3 and V3 disjoint, the result is 0. Since V3 is contained in K3, N2∩V3 = 0. We built V3 to be disjoint from N3, and it is, but it is also disjoint from N2. N2 gets bigger, as T2 gets bigger, but N2 does not collide with V3.
Let S be the direct sum of N2∩K3 and V3. With N2 and V3 disjoint, the direct sum is well defined. Clearly S contains V3.
Elements of S are produced by members of V3 and K3, hence S is contained in K3. In other words, V3 ⊆S ⊆K3.
If we have a basis for S, extend it to include all of K3.
Let the additional vectors span a space called K2.
Thus the direct sum of S and K2 gives K3.
Furthermore, K2 and V3 are disjoint, with V3 entirely in S.
Finally we are ready to build V2. Let V2 be the direct sum of K2 and V3. 

Both K2 and V3 lie in K3, hence V2 is in K3. Thus M*V2 lies in V3, which is part of V2, hence V2 is M invariant.
Consider N2∩V2. Suppose x lies in both, and write x as q+r, where q is in V3 and r is in K2. Remember that N2 and V3 are linearly independent, hence no part of x lies in V3. Therefore x lies in K2, which is a subset of K3. With x in K3, and N2, x is in their intersection, which was used to build S. Now S and K2 are disjoint, so x = 0. Therefore N2 and V2 are disjoint.
Finally we have the last criterion, N2 and V2 span T2. Remember that V2 is the direct sum of K2 and V3. We need to show N2, K2, and V3 span T2. We know that N2 and K3 span T2 (shown earlier). Find a representation for an arbitrary vector x as p+q+r, where p is in N2, q is in S, and r is in K2. Review the definition of S. Part of q comes from N2, and the other part comes from V3. The former is obviously in our span. Since V3 lies in V2, the latter is also in our span. There is no trouble with p and r, hence x is spanned. Conversely, note that N2, V3, and K2 are all in T2. The span of N2 and V2 is precisely T2. The space V2 satisfies our criteria, and that completes the inductive step.
March all the way up to T0, which equals W. The space N0 has a complementary space V0, such that W is their direct sum, and both subspaces are M invariant.
Build a new basis for M, starting with N0 (i.e. z and its ongoing images), and then V0. The new matrix is block diagonal, with N0 in the upper left and V0 in the lower right. In fact we know what the upper block looks like. The transformation moves each basis vector onto the next one, hence it has ones just below the main diagonal, and zeros elsewhere. Let z be the bottom basis vector in this block, [0,0,0,… 1]. Multiply by M and shift left, to [0,0,…,1,0]. Repeat this, up to the highest basis vector [1,0,0,…], which then falls off the edge and becomes 0, as you would expect from a nilpotent transformation.
With these blocks established, restrict M to V0, the lower right block. This is still a nilpotent transformation. Perform the very same procedure, giving two new complementary subspaces. This splits the lower right block into two blocks. Thus our matrix now has three blocks. The first is unchanged, and the second looks like the first, with ones on the subdiagonal and zeros elsewhere. The third block, in the lower right, represents yet another subspace with a nilpotent transformation. Apply the process again, and again, until the entire matrix consists of these blocks.
After a change of basis, the new, similar matrix becomes block diagonal, where each block has ones on its subdiagonal and zeros elsewhere. Thus every nilpotent matrix is similar to a matrix that is zero everywhere, except for scattered ones along the subdiagonal. The intermittent zeros on the subdiagonal delimit the blocks. For instance, consider a 6×6 matrix with 2 blocks [1,2][3,6]. The first block comes from N0, and the second from V0, which has then been transformed into its own subdiagonal representation. Notice that 0, in position 3,2, separates the two blocks.
0  0  0  0  0  0 
1  0  0  0  0  0 
0  0  0  0  0  0 
0  0  1  0  0  0 
0  0  0  1  0  0 
0  0  0  0  1  0 
Let M be a nilpotent transformation on nspace. If the basis is chosen properly, the basis vectors can be partitioned into sets, so that Within each set, M moves one vector onto the next, onto the next, onto the next, and so on, until the last vector is squashed down to 0. A set of size one is a lone vector that is mapped to 0. Each set becomes a block in the block diagonal matrix. If all sets are lone vectors, all vectors are mapped to 0, and M is the zero matrix.
The index of M, as a nilpotent transformation, is the size of the largest set, or block. Thus the maximum possible index of a nilpotent transformation is n, the dimension of the space. This happens when all n vectors participate in the march to zero, and all entries in the subdiagonal of M are 1. Verify that the powers of M shift the ones down and to the left, until Mn1 has 1 in the lower left corner, and Mn is zero.
A jordan block jr(l) is an r×r matrix with l down the main diagonal and 1's on the subdiagonal. If r = 1 then there is no subdiagonal; that's ok.
A matrix is in jordan canonical form if it is block diagonal, and each block is a jordan block.
A diagonal matrix is in jordan form; each block has size 1.
A diagonalizable matrix is similar to a diagonal matrix (by definition), and its diagonal equivalent is in jordan form.
Every nilpotent matrix m is similar to a matrix in jordan form. This was proved in the previous section. The jordan blocks all have zeros on the main diagonal, e.g. jr(0). Thus the matrix m has zeros down its main diagonal. Being a lower triangular matrix, these are also the eigen values. This makes sense, for a single, nonzero eigen value would prevent m from being nilpotent. The corresponding eigen vector persists, scaled again and again by the eigen value.
Here is a sample matrix in jordan form, with eigen values 0, 2, 3, 5, and 7.
0  0  0  0  0  0  0  0  0  0  0 
1  0  0  0  0  0  0  0  0  0  0 
0  0  2  0  0  0  0  0  0  0  0 
0  0  1  2  0  0  0  0  0  0  0 
0  0  0  1  2  0  0  0  0  0  0 
0  0  0  0  0  3  0  0  0  0  0 
0  0  0  0  0  1  3  0  0  0  0 
0  0  0  0  0  0  0  3  0  0  0 
0  0  0  0  0  0  0  0  5  0  0 
0  0  0  0  0  0  0  0  0  7  0 
0  0  0  0  0  0  0  0  0  1  7 
Let the elements of M, and the eigen values of M, generate a field F. A conjugating matrix in F now turns M into its jordan form equivalent. The proof is just a few steps away.
Let M be a singular n×n matrix that is not nilpotent, and let W be all of n space. Let W*M represent the image of W under M, which is really the span of the rows of M. Since M is singular, this is a proper subspace of W.
Consider the image of M2, which is the same as mapping W through M twice. If this produces a smaller subspace, then repeat the process. Eventually we're going to run out of dimensions, so assume W*Mj+1 = W*Mj. We don't have to worry about an image of 0, because M is not a nilpotent transformation. Thus Mj produces a proper subspace between 0 and W.
Let K be the kernel of Mj, and let R be the range.
Another application of M, beyond Mj, does not change the range. Thus R is M invariant.
Since the range does not change with another application of M, the kernel does not change either. If K is the kernel of Mj, K is also the kernel of Mj+1. Thus KM is a space, that when mapped through Mj, is 0. KM lies in the kernel of Mj, and K is M invariant.
Suppose K and R intersect in x. Find the preimage y such that x = yMj. Since x is also in the kernel, yM2j = 0. This means y is in the kernel of M2j, which is also the kernel of Mj. In other words, y is in K, and that means x = 0. Therefore R and K are disjoint.
Being kernal and range, the dimension of R plus the dimension of K = n. Since K and R are disjoint they span a space of dimension n. This is a subspace of W, which also has dimension n; thus the direct sum of K and R yields W.
We have produced to proper complementary spaces inside W, relative to the transformation M. When M is restricted to K the result is a nilpotent transformation with index j. When M is restricted to R the map is onto, and invertible.
After a change of basis, M is block diagonal, with the nilpotent kernel in the upper left and the invertible range in the lower right. Convert K into jordan form, then concentrate on R.
This applies when some, but not all, of the eigen values of M are 0. Convert M to lower triangular, and some, but not all, of the diagonal is 0. If the diagonal was all zero then M2 would push the nonzero entries down and to the left, M3 would push them down further, and Mn would be 0, thus nilpotent. If the diagonal was entirely nonzero then M would be nonsingular. So this procedure pulls out the space with eigen values of 0, and turns it into jordan form, leaving a complementary space wherein M is invertible.
Now for the finale. Each equivalence class of similar matrices is represented by a unique matrix in jordan canonical form.
There are two parts to this proof. Given an arbitrary matrix M, find a similar matrix in jordan form that can act as its representative. Then, show that different jordan matrices represent different classes. Thus each jordan matrix is a faithful representative of its class.
Let M have an eigen value l. If l = 0 then separate out the corresponding eigen space and turn it into jordan form, as described above. Once this is done, concentrate on the lower right block, where all the eigen values are nonzero. Relabel this block as M, and proceed.
Let l be any eigen value and let T be M with l subtracted from the main diagonal. Write this as T = Ml, or M = T+l.
The determinant of T is zero, and T is a singular matrix. Use the previous theorem to find complementary subspaces K and R, such that K is nilpotent and R is nonsingular. (This change of basis may require an extension of the base field F by the eigen value l.)
We know that K is T invariant, that is, T maps K into itself. But K is also invariant when scaled by the constant l. Put these together and K is invariant under M. In the same way, R is M invariant. Turn M into a block diagonal matrix, where the upper left block operates on K and the lower right block operates on R. Once again set T = Ml.
Since the upper left block of T is nilpotent, turn it into jordan form. Write PT/P = J, where J has jordan blocks in the upper left. Add l to T and note that l commutes with P. (In fact l commutes with everything.) Thus PM/P = J+l. The jordan blocks are now marked with the eigen value l that created them. Each block has l down the diagonal and ones on the subdiagonal.
If R is nontrivial then there are other eigen values to consider. Step through them one at a time, creating jordan blocks for each eigen value. Finally M is transformed into a series of jordan blocks, a matrix in jordan form.
Next show that different jordan matrices belong to different classes, and are not similar to each other.
Note that a change of basis (using permutation matrices) can permute the blocks in a block diagonal matrix. The simple jordan blocks establish the jordan form, without regard to order. If you like, you can sort the blocks by eigen value, and then by size, and place these blocks along the main diagonal from upper left to lower right. This produces a well defined canonical matrix. For our purposes, we will say that each equivalence class is represented by an unordered set of jordan blocks whose dimensions add up to n.
How can we prove two jordan matrices are not similar? Start by finding properties of a matrix that do not change under similarity. One property that is invariant is the set of eigen values, including multiplicity. Thus, if two jordan forms are similar, the jordan blocks have the same eigen values, including multiplicity.
With this in mind, it is sufficient to focus on a particular eigen value l. There may be several ways to produce 7 instances of l. We could have two jordan blocks, size 3 and 4, or 7 jordan blocks of size 1. Let's see why these aren't similar.
Consider two jordan matrices, each with l down the main diagonal, that are similar by assumption. If two matrices are similar then they are similar after you subtract l from the main diagonal. Now the two matrices are nilpotent. The index p, wherein Mp = 0, does not change with similarity, thus both matrices have the same index p, which is the size of the largest jordan block. Both matrices have blocks jp(0).
How many blocks of size p? Find z, as described earlier, so that the images of z span a space of dimension p, before the last image of z drops to 0. Do this on either side. This consumes one jordan block of size p from each matrix. If there is another jordan block of size p then there is another z, outside the span of the last eigen space, with index p. Build a new eigen space using this vector, then look for another independent vector of index p, and so on. The number of p dimensional eigen spaces that can be built in this manner determines the number of jordan blocks of size p.
Once these spaces are established, move on to a smaller index q. Find z such that the images of z span a space of dimension q, and z is independent of the eigen spaces of dimension p that have gone before. This establishes a jordan block of size q. Continue this process until all the jordan blocks are determined, and all via properties of the nilpotent transformation that do not change under similarity. Therefore, if two jordan matrices are similar, they must contain the same jordan blocks, of the same sizes. That completes the proof.