Exploring Information Theory and Shannon's Insights

I was hoping to design this talk in a way that would enable both those who had no familiarity before today with the work of Shanon to appreciate some of the questions and solutions that he came up with as well as for those who are in the community to learn something new and hopefully I can keep both of those audiences interested my own personal interest in information theory is not only in the solutions that we come up with but also in the beautiful questions that the field asks and these are the mysteries that this talk title alludes to but before I get to those mysteries let me just tell you a little bit about the background so to set the stage for this story Shannen of course was born in 1916 were celebrating his hundredth what would have been his hundredth birthday this year in Gaylord Michigan so a native son of Michigan and this picture of course is not him at birth that was the youngest picture I managed to find on the web but he was born in 1916 and these are pictures of his parents just to give you some sense of what the world was like at that time in 1916 telephony was relatively new so this photo taken in 1892 is of Alexander Graham Bell he's placing the first phone call from New York City to Chicago City and you can see that that was a noteworthy newsworthy event at the time just about 15 years before Shannon's birth but of course while the telephone was relatively new it was catching on very quickly so this picture was taken just a little bit before Shannon's birth in a place called Pratt Kansas I looked up the population right now it's it's pretty small they don't know what it was in 1912 but at the time the telephone had caught on enough to cause this kind of disaster in the sky if you're wondering why we no longer have streets that look like this it's not because these wires or their modern equivalent the cables that carry bone line kind of conversations are now gone it's just that we buried them underground so if you could look underground and see all those wires it would look like a mess I think even even worse than this but in any case this is 1911 just before Shannon's birth and at the time then like now communication was a noisy business so when you picked up a phone line you couldn't count on being heard very well or your call being perfectly reliable and in the 1940s when Shannon started thinking about communication the question that he was interested in was this question of reliability is it possible to communicate reliably given that we are communicating an inherently noisy world and as a simple way to think about that question consider this model drawn here where you imagine that you're going to communicate by transmitting some bits either zeros or ones and that what's received at the other end of the communication line in the perhaps the phone line is also zeros or ones but not always the bit that you've sent so you might send a zero and have it be received by a zero most of the time but occasionally when you transmit a zero it would be received as a one and vice-versa ones usually make it through as one's but occasionally they're corrupted by noise and received as zeros and if you imagine that that proportion of the time where your bits get corrupted is some fixed value you can then ask the question is there any hope of communicating reliably so say P is point one ten percent of the time your bits get flipped is there any hope that you can communicate in a way so that despite the fact that ten percent of your bits get flipped you can still have very high probability probability approaching one that the message that somebody understands at the other end is what you were intending to send now one way to increase your reliability is to add redundancy to what you transmit it's really the only thing that you can do is somehow send extra evidence about what you're transmitting so that even some of your even when some of your bits gets get flipped you can figure out at the other end what the transmitter was trying to transmit and one way to add redundancy is simply to repeat yourself so if you're interested in sending a sequence of zeros and ones that somehow encode your message you might choose to transmit not j0 but zero zero zero three copy of the Bichir trying to send or one one one when you're trying to send a one in the hopes that this extra repetition will help the receiver at the other end figure out what it was that you were trying to communicate the received signal then will be some noisy version of whatever was transmitted and the decoder will look at whatever is received and try and make its best guess as to what you were trying to say in some cases it will get it right and in other cases it will get it wrong and your goal is to I get through as much rate as possible as reliably as possible you'll notice that since we're only sending one bit with every three transmissions we're sending at a rate of one third of a bit per transmission on average so each transmission carries about one third of a bit now you'll notice that this approach won't be perfectly reliable if you want to get more reliable you can repeat yourself more times and the challenge here is that while you're making your code more and more reliable the rate at which you're transmitting information is decaying rather precipitously so if we repeat ourselves nine times and use sort of majority rules to decide what was the bit that was trying to be transmitted we will in fact have a more reliable strategy than if we repeated ourselves only three times but our rate will also be significantly smaller instead of sending an average of one third bit per transmission we're now sending only 1/9 and Shannon's fundamental question was this one can we increase reliability without forcing the rate to zero so does making yourself more and more reliable mean that you really have to send less and less information or is there any hope of becoming truly reliable as close of a probability to 100% of the time being correct as you'd like is there any hope of being truly reliable without making your rate go to zero and remarkably what Shannon showed is that for each channel there's a range of rates that's achievable with arbitrary reliability so for each different channel model that you might consider and the channel model in this case was this picture here for each channel model that you might consider there's some range of rates that can be achieved with as high reliability as you want as close to 100% accuracy and your decoding as you want and he called the maximal such achievable rate the capacity of the channel and in fact gave a complete characterization of that capacity as a function of the relationship between the output and the input of the channel so in the case of our previous model that would have been these probabilities that we assigned for a bit getting flipped or a bit not getting flipped in transmitting through this channel so Shannon solved the capacity of point-to-point memoryless channels again these are the number of bits per channel use that a link can reliably deliver the maximal number of such bits that a link can reliably deliver now the model that I just described to you gives you a world that looks something like this we have a single transmitter sending information to a single receiver if we imagine that Shannon was thinking about the old-fashioned telephone line you might draw this that picture or something like this we have one transmitter of information in this case one telephone sending information to one receiver of information notice there are only two communicators in this world and then and that information goes only one way in this picture but of course the world in which Shannon lived and the world in which we live today did not have just a single transmitter and a single receiver again that world looks more like this there were many many communicators even in those early days and many many receivers and we'd like to understand what happens in in that world and so I'll take you from the questions of Shannon's time to some of the questions that inspire our work today so one question you might ask yourself is okay Shannon characterized the behavior of a channel when that channel sat in isolation is Shannon's channel capacity relevant to the capacity of a network that is if I build a network out of many many communication channels and I understand what the capacity of each of those individual communication channels is will that be enough to help me understand what the capacity of my larger network like now to understand what I mean by a capacity of a larger network consider this example of a network which I Here I am trying to represent a collection of noisy channels such as the channels that Shannon studied and imagine that we define in advance a collection of source nodes and a collection of receiver nodes so for example information might be coming in from source 1 and intended to reach receiver 1 while simultaneously information is coming in from source 2 and trying to reach receiver 2 and so on and imagine that this network while I've drawn it to be rather simple as arbitrarily complex so you may have many many transmitters many many receivers every node maybe both the transmitter and a receiver and so on sharing this complicated network will say that a rate vector is achievable in this network if you can simultaneously deliver information at the rates described in that vector so if for example I tell you that some rate r1 r2 and r3 are achievable I'm saying that we can simultaneously send our one bits per channel use from source 1 to receiver 1 while sending our two bits per channel use from source 2 to receiver 2 and our 3 bits per channel use from source three 2 receiver 3 all of these transmitters and receivers are sharing this environment and so if you want to increase one pairs rates you may have to decrease some other pairs rates so that they can share this resource and accomplish their goals simultaneously and if you look at all combinations of rate vectors that you can achieve we'll call that collection of rate vectors the capacity region it's all triples in this example of achievable rates if you had a million different sources and receiver is sharing the network at the same time it would be a million dimensional vector describing at what rates each transmitter and receiver could simultaneously send to each other through these networks and again it would be not just a single point but a collection of points capturing the trade-off that occurs if one transmitter tries to increase his rate somebody else may have to decreases rate in order to obtain our goal of reliable communication that is communication that occurs with the probability of error that can be made arbitrarily small now to go back to the previous question what we'd like to understand is whether the capacity of some channel somewhere in the middle of this network is captured by the capacity of the I'm sorry whether the capacity of this some individual channel describes that channels behavior from the perspective of a large network that is is it enough if I'm interested in finding the capacity of this entire network to know the capacity of each individual channel or is that characterization somehow insufficient to capture the behavior of this channel in this larger environment this environment outside of the system in which Shannon studied it now you'll notice that the answer to that question is not immediately obvious the challenge here is that when Shannon defined capacity he required that we communicate reliably from the transmitter to the receiver across that particular channel and if this channel is somewhere in a larger network it may or may not be necessary for you to communicate reliably across this link what you may instead be sending is some noisy information that will way down the line be decoded for reliable decoding but you don't necessarily have to decode in the middle of this network so we'd like to understand does the capacity capture the essence or not maybe the capacity only tells us what happens when you want that reliable channel reliable communication across that link in isolation in answering that question we were able to show in fact that Shannon's characterization of a point-to-point Channel fully captures what's needed to know the capacity of the larger Network now I won't go into the details of the proof of the result shown here or mentioned here but I will comment that the proof of the result uses much the same kind of argument that Professor elgamal mentioned in his talk earlier today in particular we demonstrate that this capacity that the capacity of each individual link is the critical characterization by doing the following thought experiment imagine that you had a code that could operate on this network when you had a noisy channel in this location we will use that code to design another code for the network that replaces that noisy channel by a lossless link of the same capacity if you can show that any rate that was achievable when this was a noisy link remains achievable when this is a lossless link achievable in mere usual asymptotic sense then you can show that any rate that was in the capacity region of the original network is in the capacity region of this network as well and likewise if you can do the opposite argument if you can show that any code that can run on the network that has a lossless link in this location can run on the same network when that lossless link is replaced by any noisy channel of the same capacity then you will show that any rate that's in the capacity region of this network is also in the capacity region of the network with the noisy Channel and by doing this argument in both ways we can demonstrate that the capacity regions of these two networks neither of which by the way we know how to solve must be identical to each other because each of these capacity regions contains the others notice that this kind of argument or some kind of argument that doesn't rely on solving capacity for each network is critical to proving a result of this type in other words we can't hope to show that these capacities are the same by considering every possible network with every possible channel that you could plug in instead we have to make an argument that relies not on solving these capacities but on nonetheless demonstrating that these capacities are equal to each other because we have no hope of solving even many individual examples much less all possible examples to get a result of this type now this results as I've described it to you is here described in the case where you have a network made up of wireline channels channels with a single transmitter and a single receiver such as the kinds of channels that Shannon considered in his original work this result it turns out is also a so extends to the case where you have wireless components as well so think of your wireless phone your cell phone communicating with a cell phone tower and so on the impact of a single link in that larger network a single wireline link in that larger network it turns out is completely characterized by the capacity of that channel in other words if we take out every noisy wireline channel and replace it by a wireless link a noiseless Wireless link model the capacity of the network as a whole remains unchanged so this is one mystery that we might consider a second question that you might ask yourself is okay now I know that the capacity of my network can be characterized by replacing each channel by a lossless link of the same capacity and finding the capacity of that new network but but what is the impact of that edge I mean how do I determine what the capacity of a single channel will tell me about the capacity of the network as a whole to make this question more concrete consider the following example suppose that you have some network imagine that this is the Internet let's start out with just the wireline portion of that network we'll add in Wireless later on in the talk an imagine that you are looking at some particular wire sum or cable somewhere in this network there's some wire in your office or you know you're gonna pull up some wire somewhere that you find in this network and you're considering removing it from the network as a way of studying how much impact in this network have did this edge have on the capacity of the network as a whole so we'll take the original network and we'll call that network n plus Delta it's some network and including this particular edge of some capacity that I'm here going to call Delta and we'd like to compare the capacity of that network to the capacity of another network that's identical except that that single edge has been removed understanding how these two capacities relate to each other gives you one mechanism for demonstrating the impact that this capacity Delta edge has on the capacity of the network as a whole you can ask yourself the question well if I rip out the wrong that is I pull out some wire in my office and it happens to be a really powerful wire that I disconnect how much damage can I do to the capacity of the network as a whole by removing that single edge it's a way of measuring how big of an impact that single edge has on the capacity of the network as a whole to make the question concrete you might hypothesize that the following properties should be true if I tell you that some rate Triple R 1 R 2 R 3 was achievable in my original network that included that Delta capacity edge we'll call that Network again n + Delta and its capacity region the capacity of n + Delta if I know that R 1 R 2 and R 3 was achievable in that network does that imply that our 1 minus Delta R 2 minus Delta and our 3 minus Delta is in the capacity region of the network that I get when I remove that edge that is is it possible that any message is damaged by more than Delta when I remove a single edge of capacity Delta now why might you believe that such a property would be true well one way to reason about it would be to look at each message individually and ask what is the impact on that message if it were the only message in the network when I remove this Delta capacity edge so imagine for example that I'm looking only at source 1 and imagine for a minute that it's the only source in this network and ask yourself the question how much does the rate I can achieve decrease if I remove this capacity Delta edge you'll quickly come to the realization that if you remove this Delta capacity edge it's gonna hurt you by at most Delta that is if R 1 was achievable before then R 1 minus Delta will be achievable now because at most you're sending Delta down that edge you take away that resource you have to back off a little bit but only by the amount that was travelling across the edge of course the same property is true for the second message and the third message as well now in the case where you have all three messages simultaneously you would expect things you know to be up in the worst case equal to this case that is you don't really expect everybody to be damaged by Delta because you know expect everybody to be simultaneously getting the full advantage of that edge but in that most stream case you would expect that okay maybe everybody somehow magically is getting the full Delta out of that edge but it should never be worse than that so as a reasonable hypothesis you might consider the hypothesis that r1 r2 r3 is in the capacity region of the original Network implies that this triple is in the capacity region of a network that you get when you remove that Delta capacity edge it's a question you can ask a mystery that you might consider now remarkably well we have worked very hard on this question surprisingly little is known about its solution so this question remains an unsolved question and information theory let me tell you what we do know and a little bit about the mystery that remains so first of all we've been able to show that in some collection of special examples the answer to this question is yes that is if you restrict in some way your network or what kind of code you can use in some cases you can demonstrate that removing an edge of capacity Delta hurts everybody by at most Delta each but in fact we haven't been able to prove that this property always holds nor have we been able to find a counter example to demonstrate that this property sometimes fail it's an open research question that we haven't been able to find the answer to despite working very hard sometimes on this side and sometimes on the other trying to either prove or disprove that this property holds in general we have gone back and looked at existing outer bounds on the capacity region and for every outer bound that we've been able to find in the literature we've been able to show that this property is satisfied by that existing outer bound that by no means proves that the property is always true but maybe it gives a little bit of evidence in one direction or the other depending upon how tight you believe this these existing outer bounds ought to be additionally we've been able to show that while we can't solve this question we can relate this question to many other questions in the literature that we also are unable to solve that is there a variety of other open mysteries in the literature that turn out to be solvable by solving this question not because they're obviously related to each other at first glance they don't seem to be related but in fact you can design demonstrate a relationship between these questions such that if you find out that the answer to my prior question is yes then you will also solve these other questions and likewise the reverse if the answer is no that will solve these other questions and give you the opposite outcome for these other outstanding questions and I won't go through those questions in the interest of time but I'd be happy to tell anyone about them if you're interested afterwards now in order to give you a bit of intuition about how these results are demonstrated let me walk you through at least a little bit about how the proofs work in the cases where we're able to demonstrate that the answer to this question is yes that is that removing a single edge of capacity Delta never hurts you buy more than Delta the arguments go roughly like this imagine that you have some code or really sequence of codes for communicating across the network that has this edge of capacity Delta we'd like to design a sequence of codes for the network that doesn't have this edge that reduces every rate by at most Delta so we have a delta reduction in every dimension in the worst case so how are you going to use a code for this network to design a code for this network what we're going to do is we're only going to transmit the combinations of inputs that result in the most common value that traverses this edge so there's some value that traverses this edge more frequently than any other value that you transmit will only send inputs that results in that most commonly sent message being transmitted across this edge so whenever the most popular value across this edges we're only going to send combinations of inputs that result in that most popular value if you can send only values that result in those that most popular value across this edge then in this network you can run the same code without the existence of that edge just assuming at what would have been the output of the missing edge that that most popular value was the value that you received it turns out that in cases where you can coordinate like that and guarantee that you're only sending the values that would have resulted in the most common message across the missing edge then you will reduce each rate by at most Delta so you get that desired Delta reduction at most in each dimension unfortunately this strategy doesn't always work because we can't always find a way for independent sources sending messages across this network to coordinate and only send those combinations of messages that would have resulted in this most popular value so in cases where we can coordinate in that way we have proofs in cases where we don't for a few of them we have proofs but for the most part we don't in most of the cases where we have arguments that's how the argument goes at least at a high level now while the question is unanswered in the case of noiseless links that doesn't stop us from from looking at the same question in the case where we have lossless links as well and when we started looking at that question we thought that we would end up with an answer much the same as the answer that we got previously that is in some special cases we would be able to demonstrate that removing an edge of capacity Delta hurts everybody by at most Delta and that in general we wouldn't be able to prove whether that was always true or sometimes false so when we started to look at this question the first thing we did is to look in the literature to see what prior examples in the literature had to say about the question the question itself was new but there were examples that could be interpreted through this question and give us some insight into what happens at least in the networks that have been studied so far so here are two examples from the literature where people have studied networks that include both wireless components this is what's called a multiple access channel it's a model for example for you and some other person in your neighborhood talking to the same cellphone tower and interfering with each other in that communication and in this case there's feedback that's been added from the receiver to one of the transmitters and people have studied what does the capacity region look like when you allow yourself some very small Delta feedback how does the capacity region change from what had happened when that feedback was absent likewise people have looked at what are called conferencing encoders in the same model so now you have to try miners' instead of the receiver communicating with one of the transmitters you can imagine the transmitters communicating with each other each one is sending the other Delta information and you can ask how does the capacity region for this example differ from the previously solved capacity region when this communication was not allowed and in both of those examples and in fact all of the prior examples we were able to find in the literature the answer to the question that I previously posed was yes that is if you tell me that r1 r2 and r3 are achievable when you include this extra edge of capacity in Delta it has an all of those private previous examples always implied that our 1 minus Delta r2 minus Delta R 3 minus Delta is in the capacity region of the network you would get by removing that Delta capacity edge so removing an edge of capacity Delta hurts you by at most Delta interestingly though in this case when we tried to find a counter example what we demonstrated was that the answer to this question can sometimes be no that is in this case we were able to construct examples where adding a single edge of capacity Delta implies that the answer to this question is no that is adding an edge of capacity Delta can increase each value or removing an NGO capacity Delta can decrease each value by more than Delta so knowing that r1 r2 and r3 are in the capacity region for n1 and plus Delta does not imply that our 1 minus Delta r2 minus Delta and r3 minus Delta is in the capacity region for the same Network when you remove that single Delta capacity edge now you might wonder in this case ok maybe an edge can have more than a capacity and maybe a Delta capacity edge can have more than a Delta capacity impact but you know maybe it's not much more I mean if you told me that removing this capacity down to edge hurts me by at most one point zero zero one times Delta maybe I won't be so impressed how big can the gap actually be and so we tried to construct examples to make this gap as large as possible and in doing so demonstrated that the following property is true not only does r1 r2 r3 in the capacity region of the original Network not imply that our 1 minus Delta R - - Delta r3 minus Delta is in the capacity region without that edge but in fact if you substitute any polynomial function of Delta that is zero at Delta equals zero the property is still false that is the answer to this question is still no which means if you go into your office and remove a single wire somewhere from the internet you may be changing the capacity of that network by amount that grows almost exponentially in the capacity of the edge that you removed if you happen to be removing exactly the wrong edge what do I mean by almost exponentially in this case I mean that the difference can be more than any polynomial no matter how big your polynomial is no matter how quickly it grows in Delta I can construct an example that will suffer a bigger loss by the removal of an edge of capacity Delta then then whatever function you've given me now how do those examples work they work by constructing a channel specifically for this goal and in the interest of time I probably shouldn't step you through how this example works roughly how it works at least at a very high level is that you imagine that you construct a channel where some combination of inputs are good they go successfully through this channel with no problem and other combination of inputs are bad and a little bit of communication between these transmitters through this Delta capacity edge allows you to transmit only on the good pairs whereas without those that cooperation without that working together of these two transmitters you have no way of avoiding the bad pairs and if you construct these examples very carefully you can demonstrate that the capacity difference can be very very large compared to Delta that is identifying the good locations requires only a little bit of shared information between the two transmitters whereas avoiding the bad location is impossible when that amount of communication goes all the way to zero now of course we've demonstrated that the impact of the removal of a single edge of capacity Delta can be almost exponential in Delta but this is for an artificial example this is for an example constructed for the point of demonstrating that such a thing is possible not an example that in any way it claims to capture something that really happens in nature and so we'd like to understand whether real channels exhibit this property also that is is a large loss for the removal of a Delta capacity edge something that only happens when you artificially construct a crazy channel for the purpose of demonstrating that that's possible or is it something that can happen in real channels and what we've been able to demonstrate in this case is that if cooperation helps at all then a little cooperation helps a lot by which I mean the following if you give me any channel for which control of all of the inputs will call that complete cooperation control of all of the inputs by a single node would allow you to communicate much more than you can communicate if they had to work independently give me any channel where complete cooperation helps even helps a little bit we can demonstrate that for any such channel the benefit that you get as a function of the Delta of that Delta capacity edge grows in a way that has slope infinity and Delta equals 0 that is for very very small allowed communication between the transmitter of your network you get a much bigger benefit an infinite slope benefit as a function of Delta so adding a little bit of cooperation between the transmitters in your network gives you a much bigger benefit in terms of the capacity of your network as a whole now to take this to its extreme point you might ask yourself ok if the coop benefit of cooperation has an infinite slope when the edge allowing that cooperation has capacity zero at Delta equals zero does zero cooperation rate zero cooperation itself ever help and this may sound like a nonsensical question what does it even mean to communicate at rate zero but rate zero communication doesn't imply that no bits whatsoever are being sent you'll remember that we describe the rate of communication across a single link to be the number of bits that you and per channel use and so if I'm going to send some number of bits across a channel that grows sub linearly in the number of channel uses say I'm going to send log n bits in n channel use we would describe that as a rate zero communication because as n goes very very large log n bits over your end channel uses converges to zero bits per channel use on average and so we can ask the question will that ever increase the capacity of your network and in this case also we've been able to show that the capacity of your network can be increased by a sublinear number of bits per transmission at least in the case where you require perfectly reliable communication that is at least in the case where you require the decoder not only to get a probability of error that goes to zero but instead to get a probability of error precisely equal to zero which in some network models is possible at nonzero rates in fact we've even been able to design an example where just one bit of communication not one bit per channel use but one bit ever can change the capacity of your network so even a very small very very small amount of rate can make a non-trivial difference in your capacity region more specifically there's a discontinuity in your capacity region as a function of that single edge so removing an edge of capacity zero can in fact change the capacity of your network as a whole at least in under some definitions of network capacity so to summarize Shannon started a communication revolution by characterizing the capacity of a single channel in isolation but Shannon's work shouldn't be viewed as the last word in this question instead you can think of it as the first word it helped us understand something about a single channel and it remains much more remains to be understood about that single channel when it operates in a larger network in the wireline case it's unknown whether the benefit of that single edge can ever exceed the capacity of that edge so we still don't know how much damage you'll do to a wireline network if you go into your then rip out a single wire we still don't actually understand how big the impact of that can be in the worst case but sit current outer bounds suggest that it cannot as well as some special examples and if you can solve this question you will end up solving many other open questions and information theory so it's a productive question to consider if you solve this one you'll get all sorts of other solutions for free because of equivalence results that show that solution to this problem would also solve many other open questions in the literature in contrast to the case where we have wireline networks you could consider the case with wireless connections and again ask the question what is the benefit of a single edge and in this case the benefit of a single edge can far exceed the capacity of this of that individual edge and isolation in fact the gap between the network capacity with that edge and without that edge can be large the capacity as a function the capacity of the network as a function of that edge can have a slope that's infinite with respect to that edge capacity and the benefit can be discontinuous as a function of that single edges capacity so many mysteries remain in our understanding of a single edge even the channel that Shannon studied and we look forward to working on these questions and hopefully someday finding their solution thank you the idea was that he would wheel this thing around and it would eliminate most of the functions that surveyors performed and he tried really hard to make it a commercial success of this invention and fell short

Transcript for:Exploring Information Theory and Shannon's Insights

Transcript for:
Exploring Information Theory and Shannon's Insights