Entropy of information source

iVenky · Mar 31, 2013

In the paper "A Mathematical theory of Communication" by Shannon I couldn't understand a theorem (Theorem 4) under the topic "The Entropy of Information Source"-

Consider again the sequences of length N and let them be arranged in order of decreasing probability. We define n(q) to be the number we must take from this set starting with the most probable one in order to accumulate a total probability q for those taken

Theorem 4:
\[
\lim_{N \to +\infty} \frac{log (n(q))}{N}= H
\]

when q does not equal 0 or 1.
We may interpret log n(q) as the number of bits required to specify the sequence when we consider only the most probable sequences with a total probability q.

Here 'H' is the entropy.

Can you explain me how we get this theorem?
What is this n(q)? I couldn't understand the sentence that I have underlined. How is log n(q) number of bits required to specify the sequence?

Here's his paper-
**broken link removed**

It's under the topic "The Entropy of an Information Source" in Page 13 and 14. He has written the proof for that in the appendix but I couldn't understand that.

thanks a lot

W_Heisenberg · Mar 31, 2013

Dude,

this used the AEP theorem to prove. In fact, it is a weakly AEP.

What it means is that when your sequence is approaching to infinite long, then to encode this sequence you need at least nH bits for it.

iVenky · Mar 31, 2013

W_Heisenberg said:
Dude,

this used the AEP theorem to prove. In fact, it is a weakly AEP.

What it means is that when your sequence is approaching to infinite long, then to encode this sequence you need at least nH bits for it.

Thanks. What is that 'n' in that 'nH' bits? Is that number of sequences? But I don't understand these "most probable sequence" as he says in that paper. Is it applicable only to those ? What exactly is that n(q)?

thanks once again.

W_Heisenberg · Mar 31, 2013

say the sequence you got is made of n random variables, x1, x2, ... , Xn
all these variables x has a prob distr, and here we make them to be identical independt, which means they all have the same prob distr as p(x)
now you should notice that you need n*log(x) bits at least to represent a sequence,

but as Shannon said and proved, you only need n*H(x) bits to encode.

In that way, you got the data compressed. You should notice that H(x) is smaller equal than log(x).

n(q) is the sequnce you really compressed, that is to say, n(q) is the subset of all your original permutation sequence.

In fact, these sequence is called typical sequence. Such typical sequence has the property that called "most probable".

Hope this helps somehow.

iVenky said:
Thanks. What is that 'n' in that 'nH' bits? Is that number of sequences? But I don't understand these "most probable sequence" as he says in that paper. Is it applicable only to those ? What exactly is that n(q)?

thanks once again.

- - - Updated - - -

iVenky · Apr 1, 2013

W_Heisenberg said:
say the sequence you got is made of n random variables, x1, x2, ... , Xn
all these variables x has a prob distr, and here we make them to be identical independt, which means they all have the same prob distr as p(x)
now you should notice that you need n*log(x) bits at least to represent a sequence,

but as Shannon said and proved, you only need n*H(x) bits to encode.

In that way, you got the data compressed. You should notice that H(x) is smaller equal than log(x).

n(q) is the sequnce you really compressed, that is to say, n(q) is the subset of all your original permutation sequence.

In fact, these sequence is called typical sequence. Such typical sequence has the property that called "most probable".

Hope this helps somehow.

- - - Updated - - -

Now I understood something. Thanks a lot for your reply. I have some few doubts

You said n*log(x)- Here 'x' actually indicates the total number of possibilities that x can take right? I mean the size of the sample space of x right?

Also you are telling that n(q) indicates the number of samples with highest probability and according the paper they are those with probability 'q' right?

From the result it looks more like log n(q)=NH which means where are the rest of those less probable elements which don't come under n(q)?

Thanks a lot

W_Heisenberg · Apr 1, 2013

x belongs to a set, e.g. x : {1 2 3}

due to lack of other formation of x when i input here, i used x, but i mean it is: log(alphbet x) = log(3), followed by the above example.

here p(x) = 1/3 if x =1
1/3 if x =2
1/3 if x =3
note that you can assign all other probs, but the above gives you the max H(x) = log(3)

the first log(3) is smaller or equal than H(x), this is what i mean for your first question.

Then, number of typical sequce is n(q), you have log(n(q)) approaches to nH(x), n(q) is those most probable sequce of all the permutations.

Hope this is better.

iVenky · Apr 2, 2013

Ya that's my doubt. If you consider n(q) as the most probable sequence of all the permutations what you are coming to say is that NH(x) bits are enough to represent those most probable sequences alone. But you haven't represented sequences (say least probable sequences) that don't come under most probable sequences ? If you include them also the number of bits would increase right?

Thanks a lot for your help.

- - - Updated - - -

I figured that out!

Actually he had proved in his paper that (I can understand this derivation. It's in his paper)

\[

\lim_{N \to \infty }\frac{log(p^{-1})}{N} = H

\]

where p is the total probability of the sequence of length N. It looks like as N-> infinity, all have the same probability 'p'. This can be easily understood if you see that derivation.

Now if you define n(q) as the number of sequences till the summation of those probabilities become equal to 'q' then we have

n(q) * p= q (we have assumed that as N-> infinity all have the same probability 'p'. If you see that derivation I had told in his paper you will get this)

We know that the upper bound of q is 1

so clearly

n(q)*p=1

which means

n(q)=1/p =p^-1

When you substitute this in the above expression you will get the result !

What I understand from this is that when you have an ergodic process (with a Markov chain kind of thing) then as N-> infinity some sequences won't occur at all so that you need not worry about them. This also means that the number of bits required to represent them is not log (total sample space) but rather just NH.

Thanks a lot for your help.

Welcome to EDAboard.com

Entropy of information source

iVenky

Advanced Member level 2

W_Heisenberg

Full Member level 4

iVenky

iVenky

Advanced Member level 2

W_Heisenberg

Full Member level 4

iVenky

iVenky

Advanced Member level 2

W_Heisenberg

Full Member level 4

iVenky

iVenky

Advanced Member level 2

W_Heisenberg

Similar threads

Part and Inventory Search

Welcome to EDABoard.com

Sponsor

Connect with us

Online statistics

Forum statistics