Having studied sets every September from 6th grade through 12th, I thought I knew what a set was. It's like a bag with stuff in it, right? Then I went to graduate school and learned otherwise.
Actually there are no bags; there are only rules for containment. The even numbers form a set because they conform to a rule, i.e. divisible by 2. You can picture them in a bag, or not, as you wish. However, the bag metaphor gets in the way at times, because there are situations where a set can contain itself, or two sets can contain each other.
→  
← 
What sort of items can be clumped together into sets? Lions? Tigers? Bears? Surely the integers. Well not really. The only thing a set can contain is other sets. We ascribe semantic meaning to these sets, but they are just sets.
It is possible to start with the empty set and build the integers, and all of mathematics, from the ground up. The empty set is denoted ∅, and it has no members. Think of the empty set as 0, the set containing the empty set as 1, the set containing 0 and 1 as 2, the set containing 0 1 and 2 as 3, and so on. Thus n is the set containing all the numbersets that came before. So you see, we don't really need any objects, or bags, at all. The only thing that is real is the empty set and the rules for containment.
Here's how the number 4 might look as a set.
(∅ (∅) (∅(∅)) (∅(∅)(∅(∅))))
If two sets contain the same elements they are the same set. This reenforces the notion that the rule is the set. If two sets both contain the even integers, they are the same set. Such a set starts out (∅, 2, 4, 6, …), where the integers are really the aforementioned sets.
Write x ∈ s when x belongs to s. We call x an element, i.e. an element of s, but it's really just another set, because that's all we have is sets, and sets of sets, etc.
Many books have been written about sets; I'm only going to go over the basics, the things you need to know to build groups and rings and the like. Taking this shortcut means you may, at the end of this chapter, still be unsure what a set is. Join the club! But 99.9% of the time, a set is a slice of the rationals, or the reals, or polynomials, or rotations in space, or other things that are well understood.
One thing it is not is "all sets". There is no set of all sets. I may start out by saying "A monoid is a set with a + operator and an identity", and you might think, "Great, all the sets form a monoid under union." It looks good at first. (A∪B)∪C = A∪(B∪C), so union is associative. And the empty set is the identity. S ∪ ∅ = S. But there is no set of all sets. For technical reasons, every structure I define is going to be a set, and there is no set of all sets, so the entire universe of sets is never going to be a field, or a group, etc.
The lack of a universal set can be proven, but other aspects of set theory are axiomatic. For instance, infinite sets are assumed, simply waved into existence. They don't have to be there. You can build a perfectly good world of mathematics that consists only of finite sets. But then there is no structure like the integers, and proofs based on induction (even though they may refer to finite sets) don't really work. So infinite sets are assumed.
Another axiom is that a set does not contain itself. This is arbitrary too. You can build a consistent world of mathematics where sets do at times contain themselves. There is no inherent contradiction here. However, selfreferencing sets don't seem to be necessary for the vast majority of math, or engineering, or science, so I'm leaving them out.
Another axiom is the continuum hypothesis, abbreviated CH. One can prove, by diagonalization, that there are more real numbers than rationals. This seems strange at first, since both sets are infinite, but it can be made rigorous, and will be demonstrated below. Are there any sets that are larger than the rationals, yet smaller than the reals? It's up to you. If you want it to be true than declare it so; otherwise say it is false. The continuum hypothesis says there are no such intermediate sets, and that's ok with me.
The most interesting axiom is the axiom of choice. This is usually invoked as zorn's lemma, which is equivalent to the axiom of choice. One is true iff the other is true. Zorn's lemma is described below, under partially ordered sets. Many theorems rely on zorn's lemma, theorems that I want to be true, so I am saying yes to the axiom of choice.
This is ZFC set theory, where ZF stands for Zermelo Fraenkel, who developed and refined the theory, and C stands for choice. Almost every math book is based on ZF set theory, and most include choice as well, even if they don't say so explicitly.
The direct product of sets, possibly an infinite list of sets, is the cross product of all these sets, i.e. all combinations thereof. If S contains a b and c, and T contains 1 2 and 3, then the direct product of S and T, sometimes called S cross T, consists of all 9 pairs: a1 a2 a3 b1 b2 b3 c1 c2 c3. The direct product of 3 sets generates triples, and so on.
The direct product of an infinite list of sets creates infinite strings. (This requires the axiom of choice.) If each set Si in the list happens to be the real numbers, then the direct product is a sequence of real numbers.
S1  *  S2  *  S3  *  S4  *  S5  *  … 
The direct sum is again a cross product, but infinite strings are highly constrained. Almost all the elements in the string have to be some designated member of the set, usually zero.
7, 5, ½, 3, 2, π, 0, 4, 6, 0, 0, 0, 0, 0, 0, 0, 0, …
When the list of sets is finite, direct product and direct sum are the same thing.
The 7adic numbers are a direct product  infinite sequences of integers mod 7. Each set in the underlying list is Z/7. The direct product contains the direct sum, where the entries trailing off to the right are all zero. These are the strings that correspond to the nonnegative integers as we know them. Write the integer in base 7, backwards, to build the finite string; then fill in with zeros after that.
A relation is a subset of the cross product of two or more sets. For example, consider the "ownership" relation, where S = people and T = pets. The relation contains Dick,Spot, since Dick owns Spot. It does not contain Jane,Spot, but it does contain Jane,Puff.
Dick → Spot Jane → Puff Tim → Lassy Fred → Spot (this function is not invertible) Clinton → Sox Nixon → Checkers
A relation is considered a function if no member of S is repeated. The "ownership" relation above is a function, because nobody owns more than one pet. Put in a person, and the function cranks out the pet.
When a function is derived from S cross T, S is called the domain, and T is called the range. If x is a member of S, f(x) is the corresponding member of T, as determine by the function f. We sometimes say f maps S into T.
If f covers all of T, every pet in the world for example, then f is surjective, or f maps S onto T.
If everything in T appears only once, then f is injective, or f embeds S into T. The function can be reversed, going from pet back to owner. In our example, f is not invertible, since two people have joint custody of Spot.
A function is bijective if it is injective and surjective. Every person owns exactly one pet, and every pet is owned by somebody. There is a 1 to 1 correspondence.
A function can map multiple sets into another set. The operator times maps R cross R onto R. Technically this relation consists of triples from the direct product R cross R cross R. The domain is the plane and the range is the line.
When a relation is derived from a set crossed with itself, several important properties become meaningful. Let R be a subset of S cross S, with a, b, and c drawn from S.
R is reflexive if for every a, R contains a,a.
R is symmetric if for every a and b, R contains a,b iff R contains b,a.
R is transitive if for every a, b, and c, R contains a,c whenever R contains both a,b and b,c.
A relation possessing all three properties is called an equivalence relation. Such a relation partitions the set S into disjoint subsets called equivalence classes. When R is an equivalence relation, a simpler function q carries the same information. Here q is an equivalence class function, where q(a) determines the subset of S that contains a. In other words, q(a) is the clump of things that a belongs to.
A simple example in Z cross Z includes a,b whenever ab is a multiple of 5. Verify the three properties above, then find the equivalence classes. The first equivalence class is represented by 0. It includes 5, 10, 15, and so on, since each is a multiple of 5 away from 0. The other equivalence classes are the other 4 integers mod 5. In general, a homomorphism partitions the domain into equivalence classes, shifted copies of the kernel, and these subsets, also called cosets, can sometimes be manipulated as individual entities. In the above example, the number 1 represents the coset of integers that are 1 mod 5.
Equivalence classes need not be the same size, and they need not correspond to one another. Let two real numbers be in a relation if they have a nonzero ratio. This is an equivalence relation, and the two equivalence classes are 0 and all the nonzero numbers.
Another example of equivalence classes are associates within a ring. In the integers, the classes are 5 and 5, 7 and 7, 19 and 19, and so on. All equivalence classes are the same size, according to the units of the ring, except for 0, which stands alone.
While the above properties clump elements together, other properties separate elements and put them in order.
R is irreflexive if for every a, R does not contain a,a. This is of course the opposite of reflexive. There is one relation that is both reflexive and irreflexive … the empty relation on the empty set.
R is antisymmetric if a,b and b,a implies a = b. This is where we start using the ≤ notation, for if a ≤ b and b ≤ a, then a must equal b. However, this notation can be confusing at first, because everybody is use to numbers, and given any two numbers, one is always less than or equal to the other. But two elements of an antisymmetric relation need not be comparable. In fact, R could be the empty set, no ordered pairs at all, and the relation is still antisymmetric. We only require that any two elements not be ≤ each other, unless they are the same.
If you know the relation is irreflexive, you can define antisymmetric as a,b implies not b,a. In this case we use the < operator. That is, a < b means we cannot have b < a. Again, all pairs of elements need not be comparable.
The relation R is a partial ordering on the set S, or S is a partially ordered set via R, or S is a poset, if R is transitive and antisymmetric. Once again we use the ≤ operator, or the < operator if R is also irreflexive. These are called weak and strong partial orderings respectively. Applying the weak notation, transitivity is written: a ≤ b and b ≤ c implies a ≤ c. The notation is helping us along, because arithmetic ≤ is certainly transitive. But unlike the arithmetic ≤ operator, some elements might not be comparable. Here is an example. Set a < b and c, and b and c < d, which means a < d, yet b and c are not comparable. Neither is less than the other. This is illustrated by a square with a in the lower left corner and d in the upper right. Arrows, or directed arcs, indicate the < or ≤ relation. Two arrows run from a to b and c, and two more arrows run from b and c up to d. Another arrow is implied from a to d by transitivity, but b and c are not comparable. 

An upper bound of a set T within a larger poset S is an element that is ≥ every member of T. The least upper bound (lub) is an upper bound that is ≤ every upper bound. Since two elements ≤ each other are the same, the lub is unique (if it exists). Lower bounds are defined similarly. These definitions are inspired by the integers, when a ≤ b means a is a factor of b. Verify that this is a partial ordering. The greatest lower bound of a and b becomes the greatest common divisor, and the least upper bound is the least common multiple. 

A minimal element in S means no element is smaller. The smallest or minimum element is less than every other element in S. Thus minimum implies minimal, and is unique relative to S. Maximal and maximum are defined similarly.
A lattice is a partial ordering with a lub (least upper bound) and a glb (greatest lower bound) for every pair of points. Given x and y, the join is the lub and the meet is the glb. If x ≤ y means x is a subset of y, the union of x and y is the join, and the intersection is the meet. This is a subset lattice. If the set is the integers, and ≤ means a factor of, then the lcm is the join and the gcd is the meet. The positive integers form a lattice, not just a poset.
Don't confuse this with a regular lattice in nspace. That is a repeating grid of points corresponding to the integer multiples of n vectors. A checkerboard in the plane for instance, or a pattern of parallelograms if the checkerboard is pushed over. That's a different kind of lattice altogether. Mathematicians sometimes use the same word for different things.
A complete lattice has a lub and a glb for every set of elements. The integers are not a complete lattice, because all the numbers taken together don't have an lcm. They do have a global gcd however, namely 1. The numbers 1, 2, 3, 5, 6, 10, 15, and 30 are complete, with 30 at the top and 1 at the bottom.
A linear ordering, or total ordering, is a partial ordering where all pairs of elements are comparable. They can all be placed in a line. The real numbers are linearly ordered, where ≤ has its usual meaning.
Every linear ordering is a lattice, since the join is the larger element and the meet is the smaller.
A chain is a subset of a poset that is linearly ordered. Usually the chain is declared ascending or descending, if it runs up forever or down forever.
Let S be a partially ordered set, and assume every chain of elements, i.e. every linearly ordered subset of S, has an upper bound. Zorn's lemma says there exists a maximal element m in S, such that nothing is larger than m. This does not mean m is maximum, greater than everything in S, only that m is maximal, with nothing above it. As mentioned earlier, this is assumed to be true as an axiom.
If every descending chain has a lower bound, then zorn's lemma says there exists a minimal m with nothing below it.
If set S has more members than set T (both finite), there is no injective function F mapping S into T.
Suppose a counterexample and delete some member of S and its image in T. This gives an injective function on two smaller sets. Keep going down until T has one element and S has more, whence the injective function cannot exist.
The name "pigeonhole Principle" was coined when this counting argument was applied to pigeons as they occupied a set of protected havens (pigeonholes). When there is an excess of birds, an injective function mapping pigeons to pigeonholes is not possible. Some must remain outside, or share a hole.
Here is a brainteaser; see if you can solve it. Select 9 different points in R3, having integer coordinates. Prove that one of the 36 segments determined by these 9 points contains another grid point with integer coordinates.
A parity argument combined with the pigeonhole principle does the trick. An integer point in R3 possesses 1 of 8 parity combinations; each coordinate is even or odd. Since 9 points were selected, two have the same parity combination by the pigeonhole principle. Their average, or midpoint,is another point with integer coordinates.
As a complement to the pigeonhole principle, f cannot be onto if T is bigger than S. Again, sets must be finite for this to work. Systematically remove x from S, and f(x) from T if nothing else maps to f(x). Eventually one element in S must map to many things in T, which means f cannot map onto all of t.
Here is an important variation. Let S and T be finite sets of the same size. A map from S to T is injective iff it is surjective. Equivalently, f is not injective iff f is not surjective. Start by assuming f is not injective. If both x and y in S map to z in T, delete x, and the smaller set S cannot map onto all of the larger set T. Thus f is not surjective. Conversely, if f does not cover everything in T then delete those items that are not in the range of f and apply the pigeonhole principle, whence f is not injective.
This is often used when a function f maps S into itself. By the above, f is injective iff it is surjective, and then f becomes a permutation on S.
An ordinal is a number, like 6 or 7 or 21, or even infinity. The numbers, before you get to infinity, are finite ordinals.
A cardinal is an indication of size, like 6 or 7 or 21, or even infinity. This looks the same, and it is for finite sets, but it's different for infinite sets.
Two sets are the same size, having the same cardinality, if there is a bijection between them. Their elements can be put in 1 to 1 correspondence. If S has 7 elements and T has 21, there is no way to equate the elements; the sets are not the same size.
Now move on to infinity. Let ω be the set containing 0 and all the positive integers. This is an ordinal, it is the smallest "number" that is bigger than all the finite ordinals. It is also a cardinal, measuring a flavor of infinity. This is the smallest infinity, the first infinity.
ω+1 is an ordinal, the next ordinal after ω. In technical terms, it is the set containing all the finite ordinals and ω. It's a perfectly good set, but it is not a well defined cardinal. That is because ω+1 is just as big as ω. Let each number n correspond to n+1. Then let ω correspond to 0. This is a bijection between ω and ω+1. The two sets are in correspondence, and are the same size. So ω+1 is no bigger than ω. There are, in fact, flavors of infinity that are bigger than ω, but they aren't found by adding or multiplying ω with the integers.
A set is countable if it is finite, or it can be placed in 1 to 1 correspondence with ω. A set is countably infinite if it has size ω, like the positive integers.
The nonnegative even numbers are countable; map n to n/2. Thus an infinite set can be just as large as a proper subset of itself.
The integers are countable. Map n to 2n for n ≥ 0, and map n to 2n1 for n < 0.
The positive numbers are countable; map n to n+1.
Any set that can be listed in order, or enumerated, is countable. For instance, any subset of the positive integers is countable. Take the smallest number in the set and place it first on the list. Take the next smallest and place it second on the list. Continue this process until the entire set has been "counted", i.e. mapped onto the positive integers. Thus the prime numbers are countable.
Ordered pairs of positive integers are countable. List them this way:
1/1 2/1 1/2 3/1 2/2 1/3 4/1 3/2 2/3 1/4 5/1 4/2 3/3 2/4 1/5 …
Since reduced fractions are a subset of ordered pairs of integers, the rational numbers are countable. This is counterintuitive, since they are densely packed in the number line. There are infinitely many rationals packed into the tiniest of intervals, yet there are just as many rationals as integers.
We showed that the cross product (i.e. ordered pairs) of countable sets is countable. Use this fact again and again to show that the ntuples of integers are countable. The integer points, or even the rational points in n space are countable.
In fact all finite ordered tuples of the integers, of all lengths, are countable. Here is a recipe for listing all possible tuples in order. The tuples of length n can be listed in order; this was described above. So start with the first tuple of length 1, then the first tuple of length 2 followed by the second tuple of length 1, then the first tuple of length 3 and the second tuple of length 2 and the third tuple of length 1, then the first tuple of length 4 and the second tuple of length 3 and the third tuple of length 2 and the fourth tuple of length 1, and so on.
As a corollary, the finite sets of integers are countable, as these are all represented, perhaps many times over, by various ordered tuples. The set (1,2,3) appears six times when order is significant. Since the ordered tuples over count the unordered subsets, and the tuples are countable, the finite subsets are also countable.
The integer polynomials are precisely the finite ordered tuples of integers. Well almost; the tuples that have leading zeros can be thrown away. Anyways, since the tuples are countable, the polynomials are countable. Of course the coefficients can be drawn from any countable set, so the polynomials over the rationals are countable.
The polynomials over x and y are the polynomials in y, whose coefficients are polynomials in x. The polynomials in x are countable, and since these act as coefficients, the polynomials in x and y are countable. This extends to polynomials in x y z, and so on.
There are lots of countable sets; are there any uncountable sets? Yes, as shown by diagonalization. Consider all the subsets of the integers (not just the finite subsets). If these sets are countable then the correspondence builds a list of all possible subsets in order. Build a new subset S as follows. Let n be in S iff n is not in the nth subset on the list. Now S cannot appear anywhere on the list. If S is in position n, then S contains n iff it doesn't. Every possible correspondence fails, because it misses some set S. A generalization of this important theorem will be given later on.
Consider the reals between 0 and 1, represented as decimal numbers. Suppose they can be arranged in a list. Build a new decimal as follows. Make the nth digit anything other than the nth digit of the nth number on the list. Also avoid the digits 0 and 9. The constructed real number cannot appear on the list. Since we're avoiding 0 and 9, an equivalent form of the same number cannot appear on the list either, e.g. 0.720000… = 0.719999…
This is called diagonalization because we are looking at the first digit of the first number, the second digit of the second number, and so on. List the decimal numbers down the page, with decimal digits flowing to the right. The nth digit in the nth number is a diagonal line running down and to the right.
The padic numbers are uncountable, using the very same proof.
63105…
52210…
44326…
66101…
21045…
The size of the reals is called c, for continuum, and c is a larger cardinal than ω. In fact I am assuming that c is next in line, the next biggest cardinal. This is the continuum hypothesis.
For every set, there is another set that is bigger. Cantor stunned the world with this simple, elegant proof. Let S be any set and let T be the power set of S. That means T includes every subset of S. If S has 1 2 and 3, then T has (), (1), (2), (3), (12), (13), (23), (123). If S is finite of size n, then T has size 2n.
Is there such a set? There is, but it's another axiom. It's not something you can prove; we just assume it is true.
Clearly S maps into T. Every x in S maps to the set (x) in T. But there is no bijection mapping S onto T.
Suppose f is such a bijection and build a set W as follows. For every x in S, x is in W iff x is not in f(x). By assumption, f is surjective, and f(x) = w for some x. Thus x ∈ W iff x ∉ W. Therefore the correspondence cannot exist. T is strictly larger than S.
Keep taking the power set, building S1, S2, S3, etc, and the flavors of infinity go on forever. When you think you have a handle on this, take the union of all these infinite sets, which is larger than any of them individually, and use this as a base to build another ascending chain of infinite sets, each larger than the last. Then take the union of that chain, and the union of all such chains, and so on, and so on. Trying to grasp this is like trying to comprehend the distance to the nearest galaxy. It is wonderful and troubling at the same time. Cantor's work was widely rejected, even ridiculed, sometimes on theological grounds; yet today it is an essential foundation for modern mathematics. Take a moment to read about Georg Cantor and his work.
If two sets embed in each other, are they the same size? They should be, at an intuitive level, they both fit inside each other, but we need a bijection to prove it.
If the first set S is finite, and f maps S into T, and g maps T into S, then fg embeds S into itself. As shown above, such an embedding is a bijection, both injective and surjective. If f is not surjective then its image, a proper subset of T, maps onto S by g. Remember that T embeds into S by G, and now a proper subset of T still covers all of S by g. That is impossible, hence f is surjective, as well as injective, and that proves S and T are the same size.
We really want size, or cardinality, to be a partial ordering on sets. Every set trivially maps (one to one) into itself, and if S maps into T maps into U then S maps into U. Sizes keep increasing. The relation is reflexive and transitive, we almost have a partial ordering on cardinality, but we're missing one thing: S ≤ T and T ≤ S → S = T. (Here = means the same size, not necessarily the same set.) We dealt with the finite case above, so assume both sets are infinite.
Let f map S into T and let g map T into S. These are injective functions. For any x in either set, trace the ancestry of x, going backwards through f inverse and g inverse, until you reach a parentless element that is not in the range of f or g. Let S1 be the subset of S whose members have parentless ancestors in S. Let S2 be the subset of S whose members have parentless ancestors in T. Let S3 be the subset of S whose members have infinite ancestry.
Let h = f on S1 and S3, and g inverse on S2.
Clearly h is reversible when restricted to
S1∪S3, or S2.
We need to show h is reversible across all of S.
In other words, we don't want h(x) to equal h(y),
when x comes from
S1∪S3
and y comes from S2.
Suppose f(x) = g inverse of y. Thus y = g(f(x)). If x is in S3 (infinite ancestry) then so is y. Yet y is in S2. Thus x is in S1. This means x has a parentless ancestor in S, and the same holds for y, so y should be in S1. Therefore h is injective. 

Next show that h maps S onto all of T. Select any y in T and first consider the case when, for some x in S, we have f(x) = y. If x is in S1 or S3 we are done, so x has a parentless ancestor in T. Let g(y) = z, and z has a parentless ancestor in T. Thus z is in S2, and h(z) = y. That covers y in the image of f.
Now consider the last case, when y is not in the image of f. Once again z=g(y) is in S2, and h maps z to y. Therefore h maps onto all of T, and both sets have the same cardinality. That completes antisymmetry, so that "size", as measured by cardinality, makes sense.
This provides another method for proving cardinal equality. Mapping two sets into each other is often easier than finding a perfect correspondence. For instance, the integers map into the rationals in the obvious way, and rationals map into integers by sending the fraction a/b (lowest terms) to 2a×3b. (Send a/b to 2a3b, and send 0 to 0.) That's it. The integers and the rationals are the same size.
Are S and T the same size if a function takes S onto all of T, and another function carries T onto S? If S and T are finite then the answer is yes. The composition fg maps S onto S and is a permutation, whence both f and g must be injective.
In general the answer is yes, assuming the axiom of choice. Choice is equivalent to the well ordering principle, which states the elements in any set can be arranged in order. (The connection between choice, zorn's lemma, and the well ordering principle is beyond the scope of this book.)
Assume a function g maps T onto S. There is a "least" element in any subset of T. Let f′(x) (x drawn from S) be the least y in T such that g(y) = x. This makes f′ injective from S into T. A function from T onto S turns around and becomes an embedding of S into T. If S also maps onto T, then build g′ injective from T into S, and apply the Schroder Bernstein theorem. That completes the bijection.
The heart of this proof is the transformation of a surjective function in one direction to an injective function in the other direction. In the same way, an injective function in one direction implies a surjective function in the other direction, again assuming the well ordering principle. If y in T is in the image of f then let g(y) be the preimage of y under f. Map the rest of T to the least element of S. Thus g maps T onto S. (This doesn't work if S is the empty set, but that is a bit pathological.)
Surjective and injective are opposite sides of the same coin. Surjective implements ≥, while injective implements ≤. Two sets that map onto each other, or embed in each other, are the same size.
Let c and d be two cardinals, such that d is at least as big as c. Since we know how to do finite math, let d (and perhaps c) be some flavor of infinity.
What is c + d? By definition, it is the size of the disjoint union of c and d. This is consistent with finite math. But how large is the union when d is infinite? Certainly d embeds in this union, so it is at least d. Embed c and d back into d as follows. Let 0 in d map to 0, and let 0 in c map to 1. Let 1 in d map to 2, and let 1 in c map to 3. Let n (finite) in d map to 2n, and let n in c map to 2n+1. For any limit ordinal, such as θ, let θ+n in d map to θ+2n, and if c contains θ, let θ+n in c map to θ+2n+1. Thus c + d embeds in d. Since both sets embed in each other, c + d = d. The larger set wins.
Let c*d be the size of the cross product, everything in c cross everything in d. This is consistent with finite math. If c is the empty set you get no cross product, and 0, as expected. Otherwise d embeds in the cross product by crossing it with anything in c. It is possible to map c cross d back into d, whence the sets are the same size, and c*d = d. The larger set wins. The details are beyond the scope of this book. If you are interested in this and other related topics, then you want to read Set Theory, by Ken Kunen.
The union of all the finite cross products of d has size d. This is d + d2 + d3 + …, which is d + d + d + …, which is d*ω, which is d.
If c is at least 2, then cd, the direct product of d copies of the set c, is greater than d. This is shown by another diagonal argument.
All this is cardinal math, and it actually comes in to play once in a while, e.g. to prove that certain structures have a certain size, even if that size is infinite, and that the size is a particular flavor of infinity, and cannot ambiguously be two different infinities at once.
If f is some formula on n, and f(0) is true, and f(n) implies f(n+1), then f is true for all n. This is induction, and it proves the formula for all n in one go. But it doesn't prove f for infinite sets.
Transfinite induction makes one more assumption. If f is true for an ascending chain of sets, it is true for their union. Apply this to ω. Since f is true for n = 0 1 2 3 and so on, f is true for the union of all these sets, and that union happens to be ω. After that, f is true for ω+1, and ω+2, and so on. Then f is true for the union of all these sets, which is ω+ω, or 2*ω. Continue along this path, and f is true for 3*ω, 4*ω, 5*ω, and so on. Take the union, and f is true for ω*ω. After that f is true for ω*ω+1. This continues through all the ordinals.
Some of the structures in this book are larger than the integers, perhaps uncountably large, and transfinite induction is used to make assertions about these structures. It is a form of induction that goes beyond finite sets.
Cosmologists have given us a picture of reality that never ceases to amaze. Start with a sea of hydrogen, and just a touch of helium, distributed a bit unevenly, and wait a few billion years, and you get stars, galaxies, brown dwarfs, red giants, white dwarfs, neutron stars, black holes, supernovae, heavier elements, more stars, planets, life, animal life, and finally intelligent life. As Carl Sagan summarized, "We are a way for the Cosmos to know itself." And all this from hydrogen! I view set theory in the same way. This entire book, and almost every math book past present and future that you will ever read, comes from the set membership relation and a few axioms. It's not quite as amazing as life from hydrogen, but almost.
This is an excursion into continuous functions on real variables, but it is also an important introduction to abstract topology and continuous functions on abstract sets. You can skip ahead to the next chapter if you like, but if you do, you'll probably want to return to this section later. It's quite lovely.
Before we explore another wonderful construct from the mind of Cantor, let's look at 2ω. This is an infinite direct product, wherein each component is 0 or 1  in other words, infinite binary strings. These strings can be arranged in a linear order. If s and t are two such strings, march along their digits until they disagree, perhaps in position n. At that point the string with 0 in position n is less than the string with 1 in position n. Of course if the strings agree forever then they are equal. This is called a lexicographical order.
Any two strings s and t are comparable; they are equal, or one is less than the other. If s is less than t, then t is not less than s (antisymmetric). Finally, s < t < u implies s < u (transitive). Therefore this is a linear ordering.
Case #1
01101000 <
 Transitivity 
Case #2
01101000 <

A linear ordering always implies a linear topology. Consider the real numbers, which are also linearly ordered. The open interval strictly between 3 and 7, written (3,7), is an open set, and the union of open intervals is also an open set. In contrast, the closed interval from 3 to 7, which includes the endpoints 3 and 7, and is written [3,7], is a closed set, and the intersection of closed intervals is also a closed set.
As it turns out, the same topology can be applied to 2ω. For instance, all the sequences strictly between 010000… and 011000… form an open interval. The sequences greater than s, or less than s, also form an open interval (think of the other point as infinity), and the entire set is open, as is the empty set. The union of any collection of open intervals is an open set.
The complement of an open set is a closed set. This includes the closed interval, (which is the complement of the union of the ray above and the ray below), and any finite collection of closed intervals.
A function , such as f(x) = y, is continuous if the preimage of every open set is an open set. Actually it is sufficient to show the preimage of every open interval is an open set. Then take the union of these intervals in the range to show the preimage of an open set is an open set. For example, if f(x) = x2, then the preimage of the open interval (4,9) is the union of the two open intervals (2,3) and (3,2). This holds for every open interval in y. Even the open interval (2,1) has an empty preimage, which is technically an open set. Therefore f(x) = x2 is a continuous function.
You may have seen another definition of continuity in algebra or calculus, involving δ and ε. The two definitions, δ ε and preimages of open sets, are equivalent. Let's show that each implies the other.
If f is continuous by open sets, and f(p) = q, the interval, or disk, or ball (depending on dimension) of radius ε around q is an open set, and pulls back to an open set containing p, which includes some open ball of radius δ around p. A neighborhood of radius δ about p maps into our neighborhood of radius ε about q.
Conversely, assume the δ ε definition, and let q be any point in a designated open set R in the range of f. Let p be any point such that f(p) = q. A ball of radius ε is centered at q and lies in R, and has some ball of radius δ about p that maps entirely into the ball of radius ε. Take the union of all these balls, for each point p in the preimage of q, and then for each q in R, and the preimage of R is open.
The two definitions agree, and you can use which ever one is convenient. However, the open set definition is more general, and applies to many different structures in abstract algebra where distances such as δ and ε might not make any sense. Therefore continuity is usually defined in terms of open sets.
The composition of continuous functions is continuous. Let f and g be continuous, and let R be an open set in the range of fg. The preimage of R under g is open, and the preimage of this set under f is open, thus the preimage of R under fg is open.
Let b() be a function from 2ω into the closed interval [0,1]. I call it b() for the binary representation. Put a decimal point in front of the sequence and read the number in base 2. Technically it's called a radix point, since we are not in decimal. The first digit is one half, the second digit is one forth, the third digit is one eighth, and so on. Each sequence becomes a real number from 0 to 1. The sequence 000000… maps to 0, and the sequence 111111… maps to 1/2 + 1/4 + 1/8 + 1/16 + …, which sums to 1. The minimum sequence maps to 0 and the maximum sequence maps to 1.
The function b() is not injective. For instance, 001111… is the same real number as 010000…, namely ¼. The function is however surjective. Every real number has a binary representation. Thus b maps 2ω onto [0,1], covering the entire interval.
Note that b respects order. If s ≤ t then b(s) ≤ b(t). As a consequence of this, b is continuous. Start with an open interval (u,v) in [0,1]. Represent u in binary, using the larger of the two representations if u has two representations. Represent v in binary, using the smaller of the two representations if v has two representations. The strings between u and v are precisely the strings that map to real numbers between u and v. The preimage of an open interval is an open set, and b is continuous.
Now we are ready for the Cantor set. Start with the unit interval [0,1] and remove the open set (1/3,2/3). We have taken out the middle third, leaving two pieces on either side. From each of these, remove the middle third. In other words, remove the open intervals (1/9,2/9) and (7/9,8/9). This leaves 4 closed segments of length 1/9. From each of these, remove the middle third. Thus the first segment loses (1/27,2/27), and so on. That leaves 8 segments of length 1/27. Remove the middle third of each of these, and repeat this process forever. The result is the Cantor set, which I will call C. The beginning of this construction is shown below.
Notice that the section of C from 0 to 1/3, when magnified, looks exactly like C. This is an early example of a fractal, although fractal geometry was not well understood at the time. Cantor simply thought it was a beautiful set, and it is.
If x is in the complement of C, then it was removed at some point, as part of an open interval. The complement is a union of open intervals, and is an open set, hence C is closed.
There is a canonical map, call it m(), from 2ω onto C. If s is a sequence, and s1 = 0, select the interval [0,1/3]. If s1 = 1, select the interval [2/3,1]. If s2 = 0, select the first third of the interval previously selected, and if s2 = 1, select the last third of the interval previously selected. This continues forever, building a chain of descending closed intervals whose lengths approach 0. This chain converges to a point; call it x. Thus m(s) = x. Since x does not lie in the middle third of any segment, it has never been deleted, and x belongs to C. We have a map from 2ω into C.
Different sequences will diverge at some point, and live in disjoint intervals thereafter. Thus m is injective.
Finally, let x be any point in the cantor set C. At each step, x lies in the first or last third of the prior interval. If it were in the middle third it would not lie in C. Thus, at each step, sn can be set to 0 or 1. The resulting sequence converges to some y in C, and if y is not equal to x, then sn goes down the wrong path at some point, moving towards y instead of x. This contradicts the construction of s, hence s maps to x. The map is onto, and a bijection.
The sequences in 2ω are uncountable by diagonalization, hence the points of C are uncountable.
Show that m respects order. That is, s < t implies m(s) < m(t). Also, m(s) < m(t) implies s < t. Open intervals correspond, and open sets correspond. You might think m is an isomorphism, but we don't really have operators like + and *. When a bijection preserves open sets in both directions it is called a homeomorphism. The two topological spaces are the same. One is merely a relabeling of the other  and once relabeled, the open and closed sets are the same. We can use C or 2ω interchangeably, whichever is convenient.
Oddly enough, the topological product of C cross C is homeomorphic to C. Let's build a bijection p() from C into C*C.
Given a sequence of zeros and ones, extract the bits in the odd positions and call that x, then extract the bits in the even positions and call that y. The ordered pair x,y becomes a point in C*C.
Show that p() is injective and surjective.
Prove the continuity of p by looking at the preimage of an open set in C*C. Such a set is, by definition, the union of open rectangles, where each rectangle is the cross product of two open intervals. So it is enough to show the preimage of each such rectangle is open.
A rectangle is the intersection of four open regions: everything to the right of the left edge, everything to the left of the right edge, everything below the top edge, and everything above the bottom edge. It is enough to show the preimage of each such region is open.
Let's look at the region above the line at t; the other regions are handled similarly. Let u,b be a point in this region. Let s be the preimage of u,v. The odd bits of s create u and the even bits of s create v. Let n be the first position where v and t differ, thus vn = 1 and tn = 0. Truncate s at 2n bits, the first n bits of u and the first n bits of v, Then fill out the rest of s with zeros. Let z be based on s, but fill out the rest of z with ones. The interval of interest is now [s,z]. Wait a minute! That's a closed interval. Yes it is, but s has a predecessor and z has a successor. Any sequence that ends in all zeros has a predecessor that ends in all ones, with nothing in between these two sequences, and any sequence that ends in all ones has a successor that ends in all zeros, with nothing in between. So our closed interval is also an open interval (s,z+). The interval [s,z] maps to a region constrained to match the first n bits of u and the first n bits of v. This contains u,v and lies entirely above t. An open interval maps into our region above t, and contains u,v. Take the union over each u,v above t, and the preimage is an open set.
Verify the same for a lower region, a right region, and a left region. Now the preimage of each open rectangle is an open set, the preimage of an open set is open, and p() is continuous.
The reverse map, from C*C back to C, is also continuous. You can build a proof much like the above, or you can call upon some theorems from topology. Since C is a closed subspace of [0,1] it is compact and hausdorff. Thus C*C is also compact and hausdorff. The continuous map p becomes bicontinuous. Well however you get there, p is continuous in both directions, and is a homeomorphism. The spaces are indistinguishable from one another.
I said earlier that it was odd that C and C*C should be the same, but it makes sense if you remember that C is a fractal, and the left half of C is exactly the same as C. With that in mind, yes, C cross C probably looks just like C.
Write C = C*C = C*(C*C) = C*C*C. Thus C is homeomorphic to C3, or Cn, the finite direct product of n copies of C.
By definition, a curve in the plane such as y = x2 is infinitely thin. We draw it with ink, or pixels, but that is an approximation. How then can a curve, infinitely thin, sweep around and around to fill an entire region in the plane? It can, and such a curve is called a space filling curve. Your pencil never leaves the paper (continuous), yet the tip wiggles about in an infinite number of infinitely small vibrations. To say your hand is shaky is an understatement; it is downright fuzzy, almost like quantum mechanics.
First, what is a curve? A curve is a continuous function from the line or the line segment [0,1] into the plane, or into euclidean space, or into any space for that matter. The aforementioned parabola is a curve in the plane, with x(t) = t, and y(t) = t2. Another curve, a spiral, has x(t) = t*cos(t) and y(t) = t*sin(t), for t ≥ 0.
A closed curve is a continuous function from the unit circle into space. This is a curve that starts and ends at the same point, just as the circle starts and ends at the same point. An ellipse is a closed curve in the plane, as is a square.
The following space filling curve is courtesy of Peano. Let C be the cantor set, as described above. Remember that C maps continuously onto the closed interval [0,1] using the "binary representation" function b() described in the previous section. Apply this map in two coordinates, and the topological product C*C maps continuously onto the unit square. Then, apply the fact that C*C is homeomorphic to C, and build a continuous map from C onto the unit square.
Of course C is not the same as the closed interval [0,1], but the map can be extended. Let x be a point in [0,1] that is not in C. It has been excised as part of an open interval. The endpoints of this interval, call them a and b, map to two points in the unit square. Map all points in between, including x, into the unit square linearly. Thus, if x is halfway between a and b, then f(x) is the midpoint of f(a) and f(b). The result is a continuous map from the line segment [0,1] onto the unit square.
Since Cn is homeomorphic to C, a similar result holds in n dimensions. A space filling curve can cover the unit hypercube in n space.
In the above, 0 maps to 0,0 and 1 maps to 1,1. But a space filling loop (closed curve) is also possible. Map the top have of the circle onto the unit square as described above, then map the bottom half onto a line segment from 1,1 back to 0,0. The curve starts and ends at the origin  hence it is a loop, or a closed curve.
For topological reasons, a loop in the plane cannot extend out to infinity. It is bounded in the plane. Thus there is always an outside region, beyond the loop. The jordan curve theorem says there is also an inside, and a simple closed curve, like a square or an ellipse, cuts the plane into two pieces, the inside and the outside. This sounds simple and intuitive, almost to easy to prove, yet it is a very complicated theorem with some important stipulations. One of the conditions says the curve must be smooth, or at least peacewise smooth. In contrast, our space filling curve vibrates wildly at the microscopic level. It is capable of covering a region completely, thus having no inside.