Information Theory in a Nutshell
Introduction to Information Theory
The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point.
— Claude Shannon, 1948
이 글 하나를 통해서 섀넌의 노이지 채널 코딩 정리까지 증명해보자. 개이득.
With this post, we’ll cover the Shannon’s Noisy Channel Coding Theorem. Profit!
— B-Boy SeiOk, 2018
How to measure information content?
Claims:
1.The Shannon information content of an outcome
\[h(x=a_i) = \log_2 \frac{1}{P(x=a_i)}\]is a sensible measure of information content.
2. The entropy
\[H(X) = \sum_x P(x) \log_2 \frac{1}{P(x)}\]is a sensible measure of expected information content.
3. Source coding Theorem
$N$ outcomes from a source $X$ can be compressed into roughly $NH(X)$ bits.
We all know that, but this is a bit simple statement of the source coding theorem, so let’s look at it more deeply with the following example:
The Bent Coin Lotery
A coin with $p_1 = 0.1$ will be tossed $N = 1000$ times. The outcome is $\mathbf{x} = x_1 x_2 \ldots x_N.$
e.g., $\mathbf{x}$ = 000001001000100…00010
You can buy any of the $2^N$ possible tickets for £1 each, before the coin-tossing.
If you own ticket $\mathbf{x}$, you win £1,000,000,000.
Q. To have 99% chance of winning, at lowest possible cost, which tickets would you buy?
-
And how many tickets is that?
Express your answer in the form $2^{(\cdots)}$.
The top 15 strings are samples from $X^{100}$, where $p_1$ = 0.1 and $p_0$ = 0.9. The bottom two are the most and least probable strings in this ensemble. The final column shows the log-probabilities of the random strings, which may be compared with the entropy $H(X^{100})$ = 46.9 bits
Schematic diagram showing all strings in the ensemble $X^N$ ranked by their probability, and the typical set $T_{N\beta}$.
Shannon’s source coding theorem. (Formal)
Shannon’s source coding theorem Let $X$ be an ensemble with entropy $H(X)=H$ bits. Given $\epsilon>0$ and $0<\delta<1$, there exists a positive integer $N_0$ such that for $N > N_0$,
\[\left | \frac{1}{N}H_\delta(X^N) - H \right | < \epsilon\]Typical set
Let us define typicality for an arbitrary ensemble $X$ with alphabet $\mathcal{A}_X$. Our definition of a typical string will involve the string’s probability. A long string of $N$ symbols will usually contain about $p_1 N$ occurences of the first symbol, $p_2 N$ occureneces of the second, etc. Hence the probability of this string is roughly
\[P(\mathbf{x})_{\text{typ}} = P(x_1)P(x_2)P(x_3)\ldots P(x_N)\simeq {p_1}^{(p_1 N)}{p_2}^{(p_2 N)}\ldots{p_I}^{(p_I N)}\]so that the information content of a typical string is
\[\log_2 \frac{1}{P(\mathbf{x})} \simeq N \sum_i p_i \log_2 \frac{1}{p_i} = NH\]so the random variable $\log_2 \frac{1}{P(\mathbf{x})}$, which is the information content of $\mathbf{x}$, is very likely to be close in value to $NH$.
We define the typical elements of $\mathcal{A}_X^N$ to be those elements that have probabilty close to $2^{-NH}$. (Note that the typical set, unlike the smallest sufficient subset, does not include the most probable elements of $\mathcal{A}_X^N$, but these elements contribute negligible probability.)
We introduce a parameter $\beta$ that defines how close the probability has to be to $2^-NH$ for an element to be ‘typical’. We call the set of typical elements the typical set, $T_{N\beta}$:
\[T_{N\beta} \equiv \left \{ \mathbf{x} \in \mathcal{A}_X^N \colon \left | \frac{1}{N} \log_2 \frac{1}{P(\mathbf{x})} - H \right | < \beta\right \}\]It can be shown that whatever value of $\beta$ we choose, the typical set contains almost all the probability as $N$ increases.
This result is sometimes called the asymptotic equipartition principle.
By the way, It’s called “equipartition” because the strings with information content $NH$ have similar probabilities. See the bent coin distribution for the typical set. See?
Symbol Codes
By Symbol Codes I mean the variable-length symbol codes, which encode one source symbol at a time.
Symbol code summary
Kraft Inequality
-
unique decodeability
— requires \(\sum_i 2^{-l_i}\leq 1\) Kraft inequality
[if equality holds, the code is ‘complete’]
For any lengths satisfying the Kraft inequality, there exists a prefix code with those lengths.
-
prefix codes
and binary trees
-
whenever a code achieved $L=H$, it had
code lengths $l_i^*$ equal to the information contents
\[l_i^* = \log \frac{1}{p_i}\] -
optimal symbol codes
The optimal symbol code’s expected length $L$ satisfies
\[H(X) \leq L \lt H(X) + 1\] -
Huffman algorithm
proof of Kraft inequality. Define $S = \sum_i 2^{-l_i}$. Consider the quantity
\[S^N = \left [ \sum_i 2^{-l_i} \right ]^N = \sum_{i_1=1}^I \sum_{i_2=1}^I \cdots \sum_{i_N=1}^I 2^{-(l_{i_1} + l_{i_2}+\cdots l_{i_N})}.\]the quantity in the exponent, $(l_{i_1} + l_{i_2} + \cdots + l_{i_N})$, is the length of the encoding of the string $\mathbf{x} = a_{i_1} a_{i_2} \ldots a_{i_N}$. For every string $\mathbf{x}$ of length $N$, there is one term in the above sum. Introduce an array $A_l$ that counts how many strings $\mathbf{x}$ have encoded length $l$. Then defining $l_{\min} = \min_i l_i$ and $l_{\max} = \max_i l_i$:
\[S^N = \sum_{l=Nl_{\min}}^{Nl_{\max}} 2^{-l} A_l.\]Now assume $C$ is uniquely decodeable, so that for all $\mathbf{x} \neq \mathbf{y}$, $c^{+}(\mathbf{x}) \neq c^{+}(\mathbf{y})$. Concentrate on the $\mathbf{x}$ that have encoded length $l$. There are a total of $2^l$ distinct bit strings of length $l$, so it must be the case that $A_l \leq 2^l$, So
\[S^N = \sum_{l=Nl_{\min}}^{Nl_{\max}} 2^{-l} A_l \leq \sum_{l=Nl_{\min}}^{Nl_{\max}} 1 \leq Nl_{\max} .\]Thus $ S^N \leq l_{\max} N $ for all $N$. Now if $S$ were greater than 1, then as $N$ increases, $S^N$ would be an exponentially growing function, and for large enough $N$, an exponential always exceeds a polynomial such as $l_{\max} N$. But out result ($S^N \leq l_{\max} N$) is true for any $N$. Therefore $S \leq 1$.
Shannon’s Source Coding Theorem Proof
\[L(C,X) = \sum_i p_i l_i\]define ideal length $l_i^* = \log_2 \frac{1}{p_i} = h_i$
imagine someone has picked $l_i \leftrightarrow q_i$
$q_i \equiv \frac{2^{-l_i}}{z}$ where z = $\sum_i 2^{-l_i}$
Note $z \leq 1$ for any UD code, $z = 1$ for complete code
\[\begin{aligned} L(C,X) &= \sum_i p_i l_i \\ &= \sum_i p_i \left [ \log_2 \frac{1}{q_i} - \log_2 z \right ] \\ &=\sum_i p_i \log_2 \frac{1}{p_i} + \sum_i p_i \log_2 \frac{p_i}{q_i} - \log_2 z \\ &= H(X) + \underbrace{D_{KL}(p \| q)}_\text{Kullback-Leibler Divergence} -\log_2 z \\ \end{aligned}\]where $\log_2 z \geq 0$ if code is UD, with equality only if the code is complete.
where Gibb’s inequality holds for
\[D_{KL}(p\| q) = \sum p \log \frac{p}{q} \geq 0\]with equality only if $q=p$
Therefore…
\[L(C,X) \geq H(X)\]euality iff $l_i = \log_2 \frac{1}{p_i}$ & complete code.
Optimal symbol codes are provieded by Huffman algorithm
summary build binary tree, starting from the (furthest) leaves
Q: If we encode symbols from the ensemble
\[\begin{aligned} \mathcal{A}_X &= \{a,b,c,d\} \\ \mathcal{P}_X &= \{1/2, 1/4, 1/8, 1/8\} \end{aligned}\]using the symbol code
\[\mathcal{C} = \{0, 10, 110, 111\}\]what is the probability ‘$p_1$’
that a bit pluckced at random from the encoded stream is a 1?
\[\begin{aligned} \text{A} \qquad& p_1 < \frac{1}{2} \\ \text{B} \qquad& p_1 = \frac{1}{2} \\ \text{C} \qquad& p_1 > \frac{1}{2} \end{aligned}\] \[\begin{aligned} p_1 &= \frac{\sum_i p_i n_i}{\sum_i p_i l_i} \\ & = \frac{\frac{1}{2} \times 0 + \frac{1}{4} \times 1 + \frac{1}{8} \times 2 + \frac{1}{8} \times 3}{\frac{1}{2} \times 1 + \frac{1}{4} \times 2 + \frac{1}{8} \times 3 + \frac{1}{8} \times 3} \\ & = \frac{1}{2} \end{aligned}\]In the sneaky information theory way, it’s completedly compressed, it’s closed.
Stream Codes
I’ll put more materials here when I have the time…
Noisy Channel Coding
Joint ensemble (Example)
\[x\\y \begin{array}{c | c c c c | c} & 1 & 2 & 3 & 4 & P(y) \\\hline 1& \frac{1}{8} & \frac{1}{16} & \frac{1}{32} & \frac{1}{32} & \frac{1}{4} \\ 2& \frac{1}{16} & \frac{1}{8} & \frac{1}{32} & \frac{1}{32} & \frac{1}{4} \\ 3& \frac{1}{16} & \frac{1}{16} & \frac{1}{16} & \frac{1}{16} & \frac{1}{4} \\ 4& \frac{1}{4} & 0 & 0 & 0 & \frac{1}{4} \\\hline P(x) & \frac{1}{2} & \frac{1}{4} & \frac{1}{8} & \frac{1}{8} & \end{array}\]with the marginal probabilities \(P(X) = \frac{1}{2}, \frac{1}{4}, \frac{1}{8}, \frac{1}{8}\) \(P(Y) = \frac{1}{4}, \frac{1}{4}, \frac{1}{4}, \frac{1}{4}\)
calculating the entropies for each variable gives:
\[\begin{aligned} H(X) & = \frac{7}{4} \text{bits} \\ H(Y) & = 2 \text{bits} \\ H(X,Y) &= \sum_{(x,y)} P(x,y) \log_2 \frac{1}{P(x,y)} \\ &= \frac{27}{8} \text{bits} \end{aligned}\]Conditional probability
\[P(x \mid y) = \frac{P(x,y)}{P(y)}\] \[x\\y \begin{array}{c | c c c c} & 1 & 2 & 3 & 4 \\\hline 1& \frac{1}{2} & \frac{1}{4} & \frac{1}{8} & \frac{1}{8} \\ 2& \frac{1}{4} & \frac{1}{2} & \frac{1}{8} & \frac{1}{8} \\ 3& \frac{1}{4} & \frac{1}{4} & \frac{1}{4} & \frac{1}{4} \\ 4& 1 & 0 & 0 & 0 \end{array}\] \[\left. \begin{aligned} H(X \mid y=1) &= \frac{7}{4} \text{bits} \\ H(X \mid y=2) &= \frac{7}{4} \text{bits} \\ H(X \mid y=3) &= 2 \text{bits} \\ H(X \mid y=4) &= 0\text{bits} \end{aligned} \right \} \begin{aligned}[t] H(X \mid Y) &= \sum_y P(y)H(X \mid y)\\ &= \frac{11}{8} \text{bits} \end{aligned}\]Generally, the following holds:
\[H(X|Y) \leq H(X)\]as you can see from the above example.
Mutual Information
\[\begin{aligned} I(X ; Y ) & = H(X) - H(X \mid Y) \\ & = H(Y) - H(Y \mid X) \end{aligned}\]One may also evaluate $ H(Y \mid X) = 13/8\text{bits}$. The mutual information for the above distribution is
\[I(X;Y) = H(X) - H(X \mid Y) = 3/8 \text{bits}\]You can calculate the mutual information for various input distribution.
for $f=0.15$:
The left is the mutual information $I(X;Y)$ for a binary symmetric channel, and the right is the mutual information $I(X;Y)$ for a Z channel.
You can also show that the mutual information is always bigger than 0 from the following fact:
\[I(X;Y) = D_{KL}(P(x,y) \| P(x)P(y))\]because…
\[\begin{aligned} I(X;Y) &= H(X) - H(X|Y) \\ &= \sum_x P(x) \log \frac{1}{P(x)} - \sum_{xy} P(x,y) \log \frac{1}{P(x \mid y)} \\ &= \sum_{xy}P(x,y) \log \frac{P(x \mid y)}{P(x)} \\ &= \sum_{xy}P(x,y) \log \frac{P(x,y)}{P(x)P(y)}. \end{aligned}\]And thus the Gibb’s inequality holds.
Communication over noisy channels
Channel $Q$ defines conditional probabilities
\[P(y \mid x)\]If we choose an input distribution $P(x)$, we have a joint distribution
\[P(x,y) = P(x)P(y \mid x)\]for which we can compute the mutual information
\[\begin{aligned} I(X ; Y ) & = H(X) - H(X \mid Y) \\ & = H(Y) - H(Y \mid X) \end{aligned}\]The first channel in the diagram shown above is a binary symmetric channel with
\[Q = \begin{bmatrix} 1-f & f \\ f & 1-f \end{bmatrix}\]The second one is called a noisy typewriter.
There is also Binary erasure channel.
Information measures for noisy channels
let’s talk about binary symmetric channel, and let’s the the flip probability is $f = 0.1$.
let’s say the inputs are kind of like the bent coin case, with:
\[P(x=0) = 0.9 \\ P(x=1) = 0.1\]observe y = 1, given this output, what’s the probability that the x = 1?
\[\begin{aligned}\text{A}\qquad & 0.1 \\ \text{B}\qquad & 0.5 \\ \text{C}\qquad & 0.9 \\ \text{Z}\qquad & \text{I don't know}\end{aligned}\] \[\begin{aligned} P(x=1 | y=1) &= \frac{P(y=1 | x=1)P(x=1)}{P(y=1)}\\ &= \frac{P(y=1 | x=1)P(x=1)}{\sum_{x'} P(y=1 | x')P(x')}\\ &= \frac{0.9 \times 0.1}{\underbrace{0.9\times0.1 + 0.1 \times 0.9}_{P(y=1) = 0.18}}\\ &= \frac{1}{2} \end{aligned}\]how about what is the probability of getting $x=1$ when the recieved signal is 0?
\[\begin{aligned} P(x=1|y=0) &= \frac{P(y=0|x=1)P(x=1)}{P(y=0)} \\ &= \frac{0.1 \times 0.1}{0.01 + 0.81} \\ &= \frac{1}{82} \end{aligned}\]Now we’ve done inference… Let’s talk about information measures… How much information does the output convey about the input?
And that’s called mutual information.
\[\begin{aligned} I(X ; Y ) & = H(X) - H(X \mid Y) \\ & = H(Y) - H(Y \mid X) \end{aligned}\]Generally, the bottom approach is more easier:
\[\begin{aligned} H(Y) - H(Y \mid X) &= H_2(0.18) - H_2(f)\\ &= 0.68-0.47\\ &= 0.21\text{bits} \end{aligned}\]where $H_2(x) = x \log_2 \frac{1}{x} + (1-x) \log_2 \frac{1}{1-x}$
It is a bit more of a pain to calculate the first term:
\[\begin{aligned} &H(X) - \underbrace{H(X|Y) }_{\sum_y P(y) H(X |y)} \\ =& 0.47 - [\underbrace{0.18 \times H_2(\frac{1}{2})}_{(y=1)} + \underbrace{0.82 \times H_2(\frac{1}{82})}_{(y=0)}] \\ =& 0.47 - 0.26 \\ =& 0.21 \text{bits} \end{aligned}\]Definition The Capacity of a channel
is the maximum, over all input distributions $P(x)$, of the mutual information:
\[C(Q) \equiv \max_{\mathcal{P}_X} I(X;Y)\]The distribution $\mathcal{P}_X^*$ that achieves the maximum is called the optimal input distribution.
Ternary Confusion Channel
Let’s talk about Ternery Confusion Channels
Assume input distribution $\mathcal{P}_X = \left \{ \frac{1}{3},\frac{1}{3},\frac{1}{3} \right \}$,
- If $ y = M$, what’s $x$?
- What is the mutual information $I(X;Y)$? Again, the mutual information is given as:
Actually it’s A & B, they are just two different ways to compute the mutual information.
Then what is the optimal input channel then?
\[\mathcal{P}_X = \left \{ \frac{1}{2},0,\frac{1}{2} \right \}\]It’s just like throwing out the Black/White card in the 3-card situation.
Shannon’s noisy channel coding theorem:
Reliable (virtually error-free) communication is possible at rates up to $C$
Now, with the example of noisy typewriter:
\[\begin{aligned} &\ \vdots&\\ P(y=\mathbf{F}|x=\mathbf{G})\quad&=&\quad 1/3;\\ P(y=\mathbf{G}|x=\mathbf{G})\quad&=&\quad 1/3;\\ P(y=\mathbf{H}|x=\mathbf{G})\quad&=&\quad 1/3;\\ &\ \vdots& \end{aligned}\]- What is the optimal input distribution?
- WHat is the capacity?
And we choose the uniform distribution to maximize the entropy $H(Y)$
\[P_X^* = \left \{ \frac{1}{27},\frac{1}{27},\frac{1}{27},\ldots,\frac{1}{27} \right \}\]will give you the answer
\[\text{Capacity} = \log 27 - \log 3 = \log 9\]Also, the following is also optimal:
\[P_X^* = \left \{ 0,\frac{1}{9},0,0,\frac{1}{9},0,\ldots \right \}\]So the reliable communication is possible at 9 bits.
Which is in alignment with Shannon’s noisy channel coding theorem.
In other words, there is a non-confusable subset of inputs for the noisy typewriter.
intuitive preview of proof
Extended channels
To prove the theorem for any given channel, we consider the extended channel corresponding to $N$ uses of the channel. The extended channel has $|\mathcal{A}_X|^N$ possible inputs and $|\mathcal{A}_Y|^N$ possible outputs.
Above is the extended channel from a binary symmetric channel with transition probability 0.15.
And the above is the extended channels obtained from a Z channel with transition probability 0.15.
The intuition is, that if $N$ is sufficiently large, it looks rather like the noisy typewriter.
For some more general view, let’s look at the problem from the typical set point of view:
Now, the formal statement of the Noisy-Channel Coding Theorem:
The theorem
1) For every discrete memoryless channel, the channel capacity
\[C = \max_{\mathcal{P}_X} I(X;Y)\]has the following property. For any $\epsilon > 0$ and $R < C$, for large enough $N$, there exists a code of length $N$ and rate $\geq R$ and a decoding algorithm, such that the maximal probability of block error is $<\epsilon$.
2) If a probability of bit error $p_b$ is acceptable, rates up to $R(p_b)$ are achievable, where
\[R(p_b) = \frac{C}{1-H_2(p_b)}\]3) For any $p_b$, rates greater than $R(p_b)$ is not achievable.
Jointly-typical sequences
This is the fomalization of the intuitive preview.
We will define codewords $\mathbf{x}^{(s)}$ as coming from an ensemble $X^N$, and consider the random selection of one codeword and a corresponding channel output $\mathbf{y}$, thus defining a joint ensemble $(XY)^N$. We will use a typical-set decoder, which decodes a received signal $\mathbf{y}$ as $s$ if $\mathbf{x}^{(s)}$ and $\mathbf{y}$ are jointly typical, a term to be defined shortly.
The proof will then center on determining the probabilities (a) that the true input codeword is not jointly typical with the output sequence; and (b) that a false input codeword is jointly typical with the output. We will show that, for large $N$, both probabilities go to zero as long as there are fewer than $2^{NC}$ codewords, and the ensemble $X$ is the optimal input distribution.
Joint typicality. A pair of sequences $\mathbf{x}$,$\mathbf{y}$ of length $N$ are defined to be jointly typical (to tolerance $beta$) with respect to the distribution $P(x,y)$ if
\[\mathbf{x} \text{ is typical of } P(\mathbf{x}) \text{, i.e., } \left \vert \frac{1}{N} \log \frac{1}{P(\mathbf{x})} - H(X)\right \vert < \beta,\] \[\mathbf{y} \text{ is typical of } P(\mathbf{y}) \text{, i.e., } \left \vert \frac{1}{N} \log \frac{1}{P(\mathbf{y})} - H(Y)\right \vert < \beta,\] \[\text{and } \mathbf{x},\mathbf{y} \text{ is typical of } P(\mathbf{x},\mathbf{y}) \text{, i.e., } \left \vert \frac{1}{N} \log \frac{1}{P(\mathbf{x},\mathbf{y})} - H(X,Y)\right \vert < \beta,\]The jointly-typical set $J_{N\beta}$ is the set of all jointly-typical sequence pairs of length $N$.
Example. Here is a jointly-typical pair of length $N = 100$ for the ensemble $P(x,y)$ in which $P(x)$ has $(p_0, p_1) = (0.9, 0.1)$ and $P(y \mid x)$ corresponds to a binary symmetric channel with noise level 0.2.
Notice that $\mathbf{x}$ has 10 $1$s and so is typical of the probability $P(\mathbf{x})$ (at any tolerance $\beta$); and $\mathbf{y}$ has 26 $1$s, so it is typical of $P(\mathbf{y})$ (because $P(y=1)$ = 0.26); and $\mathbf{x}$ and $\mathbf{y}$ differ in 20 bits, which is the typical number of flips for this channel.
Joint typicality theorem. Let $\mathbf{x}$, $\mathbf{y}$ be drawn from the ensemble $(XY)^N$ defined by
\[P(\mathbf{x}, \mathbf{y}) = \prod_{n=1}^N P(x_n, y_n).\]Then
1) the probability that $\mathbf{x}$, $\mathbf{y}$ are jointly typical (to tolerance $\beta$) tends to 1 as $N \rightarrow \infty$;
2) the number of jointly-typical sequences $| J_{N\beta}|$ is close to $2^{NH(X,Y)}$. To be precise,
\[| J_{N\beta} | \leq 2^{N(H(X,Y) + \beta)};\]3) if $\mathbf{x}’ \sim X^N$ and $\mathbf{y}’ \sim Y_N$, i.e., $\mathbf{x}’$ and $\mathbf{y}’$ are independent samples with the same marginal distribution as $P(\mathbf{x},\mathbf{y})$, then the probability that $(\mathbf{x}’,\mathbf{y}’)$ lands in the jointly-typical set is about $2^{-NI(X;Y)}$. To be precise,
\[P((\mathbf{x}',\mathbf{y}') \in J_{N\beta}) \leq 2^{-N(I(X;Y)-3\beta)}.\]Proof. The proof of parts 1 and 2 by the law of large numbers. For part 2, let the pair $x, y$ play the role of $x$ in the source coding theorem, replacing $P(x)$ there by the probability distribution $P(x,y)$.
For the third part,
\[\begin{aligned} P((\mathbf{x}',\mathbf{y}') \in J_{N\beta}) &= \sum_{(\mathbf{x},\mathbf{y}) \in J_{N\beta}} P(\mathbf{x})P(\mathbf{y})\\ &\leq |J_{N\beta}| 2^{-N(H(X)-\beta)} 2^{-N(H(Y)-\beta)} \\ &\leq 2^{N(H(X,Y)+\beta)-N(H(X)+H(Y)-2\beta)}\\ &=2^{-N(I(X;Y)-3\beta)}. \end{aligned}\]Two independent typical vectors are jointly typical with probability
\[P((\mathbf{x}',\mathbf{y}') \in J_{N\beta}) \simeq 2^{-N(I(X;Y))}.\]because the total number of independent typical pairs is the area of the dashed rectangle, $2^{NH(X)}2^{NH(Y)}$, and the number of jointly-typical pairs is roughly $2^{NH(X,Y)}$, so the probability of hitting a jointly-typical pair is roughly
\[2^{NH(X,Y)} / 2^{NH(X)+NH(Y)} = 2^{-NI(X;Y)}.\]Proof of the noisy-channel coding theorem
Analogy
Imagine that we wish to prove that there is a baby in a class of one hundred babies who weighs less than 10 kg. Individual babies are difficult to catch and weigh. Shannon’s method of solving the task is to scoop up all the babies and weigh them all at once on a big weighing machine. If we find that their average weight is smaller than 10 kg, there must exist at least one baby who weighs less than 10 kg – indeed there must be many! Shannon’s method isn’t guaranteed to reveal the existence of an underweight child, since it relies on there being a tiny number of elephants in the class. But if we use his method and get a total weight smaller than 1000 kg then our task is solved.
From skinny children to fantastic codes
We wish to show that there exists a code and a decoder having small probability of error. Evaluating the probability of error of any particular coding and decoding system is not easy. Shannon’s innovation was this: instead of constructing a good coding and decoding system and evaluating its error probability, Shannon calculated the average probability of block error of all codes, and proved that this average is small. There must then exist individual codes that have small probability of block error.
Random coding and typical-set decoding
Consider the following encoding-decoding sytem, whose rate is $R’$.
1) We fix $P(x)$ and generate the $S = 2^{NR’}$ codewords of a $(N, NR’)$ = $(N, K)$ code $\mathcal{C}$ at random according to
\[P(\mathbf{x}) = \prod_{n=1}^N P(x_n).\]A random code is shown schematically in the figure above.
2) The code is known to both sender and receiver.
3) A message $s$ is chosen from \(\{1,2,\ldots,2^{NR'}\}\), and $\mathbf{x}^{(s)}$ is transmitted. The received signal is $\mathbf{y}$, with
\[P(\mathbf{y} \mid \mathbf{x}^{(s)}) = \prod_{n=1}^N P(y_n \mid x_n^{(s)}).\]4) The signal is decoded by typical-set decoding.
Typical-set decoding. Decode $\mathbf{y}$ as $\hat s$ if $(\mathbf{x}^{(\hat s)}, \mathbf{y})$ are jointly typical and there is no other $s’$ such that $(\mathbf{x}^{(s’)}, \mathbf{y})$ are jointly typical; otherwise declare a failure ($\hat s = 0$).
This is not the optimal decoding algorithm, but it will be good enough, and easier to analyze. The typical-set decoder is illustrated in the (b) figure above.
5) A decoding error occurs if $\hat s \neq s$.
There are three probabilities of error that we can distinguish. First, there is the probability of block error for a particular code $\mathcal{C}$, that is,
\[p_B(\mathcal{C}) \equiv P(\hat s \neq s \mid \mathcal{C}).\]This is a difficult quantity to evaluate for any given code.
Second, there is the average over all codes of this block error probability,
\[\langle p_B \rangle \equiv \sum_\mathcal{C} P(\hat s \neq s \mid \mathcal{C})P(\mathcal{C}).\]Fortunately, this quantitiy is much easier to evaluate than the first quantity \(P(\hat s \neq s \mid \mathcal{C})\).
Third, the maximal block error probability of a code $\mathcal{C}$,
\[p_{BM}(\mathcal{C}) \equiv \max_s P(\hat s \neq s \mid s, \mathcal{C}),\]is the quantitiy we are most interested in: we wish to show that there exists a code $\mathcal{C}$ with the required rate whose maximal block error probability is small.
We will get to this result by first finding the average block error probability, $\langle p_B \rangle$. Once we have shown that this can be made smaller than a desired number, we immediately deduce that there must exist at least one code $\mathcal{C}$ whose block error probability is also less than this small number. Finally, we show that this code, whose block error probability is satisfactorily small but whose maximal block error probability is unknown (and could conceivably be enormous), can be modified to make a code of slightly smaller rate whose maximal block error probablity is also guaranteed to be small. We modify the code by throwing away the worst 50% of its codewords.
We therefore now embark on finding the average probability of block error.
There are two sources of error when we use typical set decoding. Either (a) the output $\mathbf{y}$ is not jointly typical with the transmitted codeword $\mathbf{x}^{(s)}$, or (b) there is some other codeword in $\mathcal{C}$ that is jointly typical with $\mathbf{y}$.
By the symmetry of the code construction, the average probability of error averaged over all codes does not depend on the selected value of $s$; we can assume without the loss of generality that $s=1$.
(a) The probability that the input $\mathbf{x}^{(1)}$ and the output $\mathbf{y}$ are not jointly typical vanishes, by the joint typicality theorem’s first part. We give a name $\delta$ to the upper bound on this probability, satisfying $\delta \rightarrow 0$ as $N \rightarrow \infty$; for any desired $\delta$, we can find a blocklength $N(\delta)$ such that the $P((\mathbf{x}^{(1)},\mathbf{y}) \notin J_{N\beta}) \leq \delta$.
(b) The probablity that $\mathbf{x}^{(s’)}$ and $\mathbf{y}$ are jointly typical, for a given $s’ \neq 1$ is $\leq 2^{-N(I(X;Y)-3\beta)}$, by part 3. And there are $(2^{NR’}-1)$ rival values of $s’$ to worry about.
Thus the average probability of error $\langle p_B \rangle$ satisfies:
\[\begin{aligned} \langle p_B \rangle & \leq \delta + \sum_{s'=2}^{2^{NR'}} s^{-N(I(X;Y)-3\beta)} \\ &\leq \delta + 2^{-N(I(X;Y)-R'-3\beta)}. \end{aligned}\]The average probablity of error can be made $<2\delta$ by increasing $N$ if
\[R' < I(X;Y) - 3\beta.\]We are almost there. We make three modifications:
1) We choose $P(x)$ in the proof to be the optimal input distribution of the channel. Then the condition $R’ < I(X;Y) - 3\beta$ becomes $R’ < C - 3\beta$.
2) since the average probability of error over all codes is $<2\delta$, there must exist a code with mean probability of block error $p_B(\mathcal{C}) < 2\delta$.
3) To show that not only the average but also the maximal probability of error $p_{BM}$, can be made small, we modify this code by throwing away the worst half of the codewords — the ones most likely to produce errors. Those that reemain must all have conditional probability of error less than $4\delta$. We use these remaining codewords to define a new code. This new code has $2^{NR’} - 1$ codewords, i.e., we have reduced the rate from $R’$ to $R’-1/N$ (a negligible reduction, if $N$ is large), and achieved $p_{BM} < 4 \delta$. This trick is called expurgation. The resulting code may not be the best code of its rate and length, but it is still good enough to prove the noisy-channel coding theorem, which is what we are trying to do here.
In conclusion, we can ‘construct’ a code of rate $R’ - 1/N$, where $R’ < C-3\beta$, with maximal probability of error $< 4\delta$. We obtain the theorem as stated by setting $R’ = (R+C)/2, \delta = \epsilon/4, \beta < (C-R’)/3$, and $N$ sufficiently large for the remaining conditions to hold. The theorem’s first part is thus solved.
QED.
I’ll continue on the latter part later.