Overview of Information Retrieval Techniques

Hello. In this segment, I'm going to introduce the task of information retrieval, including in particular what's now the dominant form, web search. The task of information retrieval can be maybe defined as follows. That our goal is finding material, which is usually documents, of an unstructured nature, usually text, that satisfies an information need. That's what the person is looking for information on. collections, usually stored on computers. So, there's lots of mentions there of prototypical cases. And other kinds of, of information retrieval do exist. So there are things like music information retrieval, where there are sounds, not text documents. But effectively, what we're going to talk about here is the case where all of those usually clauses hold. There are many scenarios for information retrieval. The one that people. think of first these days is almost invariably web search. But there are many others, so searching your email, searching the contents of your laptop computer, finding stuff in some company's knowledge base, doing legal information retrieval to find relevant cases for a legal context or something like that. It's always been the case that a large percentage of human knowledge and information is stored in the form of human language documents. And yet, also for a long time been something of a paradox there. So this is just a kind of a not quite real graph to give a flavor of things. Don't really believe the numbers on the left hand side are exactly what they mean. But what we're showing is that in the mid 1990s, it was already the case that if you looked at the volume of data, that there was some data that was in structured forms. By that I mean things like relational databases and spreadsheets. But there was already vastly more data in companies, organizations, and around people's homes that was in unstructured form, in the form of human language text. However, despite that, in the mid-1990s, structured data management and retrieval was a developed field, and there were already large database companies, where in the field of unstructured data management, there was very little. There were a few teeny little companies that did various kinds of corporate document retrieval. things like that. That situation just completely changed around the turn of the millennium. So if we look today, the situation is like this. So the data volumes have gotten larger on both sides, but in particular they've gotten larger on the unstructured side, the massive outpouring of blogs, tweets, forums, and all those other places that now store massive amounts of information. But there's also then been a turnaround on the corporate side. So now we have huge companies that are addressing the problems of unstructured information retrieval, such as the major web search giants. So let's start and just say a little bit about what are the base, what is the basic framework of doing information retrieval. So we start off by assuming that we've got a collection of documents over which we're going to do retrieval. And then at the moment we're going to assume that it's just a static collection. Later on we'll deal with when documents get added to and deleted from that collection, and how we can go out and find them in something like a web search scenario. Then our goal is to retrieve documents that are relevant to the user's information need and helps the user complete a task. Let me go through that once more, a little bit more slowly, with this picture. So at the start, what we have is the user has some task. that they want to perform. Let's take a concrete example. Suppose what I want to do is get rid of the mice that are in my garage, and I'm the kind of person that doesn't really want to kill them with poison. So that's my user task that I want to achieve. And so to achieve that task, I feel that I need more information. And so this is the information need. I want to know about getting rid of mice without killing them. This information need is the thing with respect to which we assess an information retrieval system. But we can't just take someone's information need and stick it straight into a computer. So what we have to do to be able to stick it into a computer is to translate the information need into something that goes into the search box. And so that's what's happened here. We now have a query. And so here's my attempt at a query. How trap mice alive? So that's taken this information need and I've made an attempt to realize it as a particular query. And so then it's that query that goes into the search engine, which leads, which then interrogates our document collection and leads to some results coming back. And that may be the end of the day, but sometimes if I'm not satisfied with how my information retrieval session is working, I might take. the evidence I got from there and go back and come up with a better query. So I might decide that alive is a bad word and put in something like without killing and see if that works any better. Okay. What can go wrong here? Well, there are a couple of stages of interpretation here. So first of all, there was my initial task and I made some decisions as to what kind of information need I had. It could be that I got something wrong there, so I could have misconceptions. So maybe in getting rid of mice, the most important issue is not the mouse. kill them, but whether I do it humanely. But we're not going to deal with that issue so much. What we're more interested in here is the translation between the information need and the query. And there are lots of ways in which that can go wrong in this formulation of the query. So I might choose the wrong words to express the query. I might make use of query search operators like inverted letters, which might have a good or a bad effect on how the query actually works. choices in my query formulation, which aren't altering my information need whatsoever. That's important when we talk about how to evaluate information retrieval systems. That's a topic we'll say more about later, but let me just give you the first rough sense of this, because we'll need it for everything that we do. So whenever we make a query to an information retrieval system, we're going to get some documents back, and we want to know whether the results are good. And the very basic way in which we're going to see whether they are good is as follows. We're going to think of two measures which are complementary, one of which is precision. So precision is the fraction of the retrieved documents of the system that are relevant to the user's information needs. So that's whether when you look at the results, it seems like one document in ten is relevant to the information need you have, or whether seven documents in ten are. assessing whether the mix of stuff you're getting back has a lot of bad results in it. The co, complimentary mixture, measure is the one of recall. Recall is measuring how much of the good information that's in the document collection is the system succeeding in finding for you. The fraction of relevant docs in the collection that are retrieved. We'll give more precise definitions and measurements for information retrieval later. But the one thing I want to express right now is that is that for these measures to make sense, they have to be evaluated with respect to the user's information need. For certain kinds of queries, such as the ones we'll look at in the following segments, the documents that get returned are deterministic in terms of what query is submitted. So if we were just going to evaluate what gets returned with respect to the user's query, then we'd necessarily say that the precision is 100%. But that's not what we do. What we do is think about the user's information need and that the precision of the results is assessed relative to that. So in particular, if we just look back for a moment, if there's been a misformulation of the query by the user, or just they didn't come up with a very good query, that will be seen as lowering the precision of the return results. Okay. This is only the very beginning of our looking at information retrieval, but I hope you've already got a sense of how we think of the task of information retrieval and roughly how we can see whether a search engine is doing a good or a bad job on it.

That our goal is finding material, which is usually documents, of an unstructured nature, usually text, that satisfies an information need. That's what the person is looking for information on. collections, usually stored on computers.

So, there's lots of mentions there of prototypical cases. And other kinds of, of information retrieval do exist. So there are things like music information retrieval, where there are sounds, not text documents. But effectively, what we're going to talk about here is the case where all of those usually clauses hold. There are many scenarios for information retrieval.

The one that people. think of first these days is almost invariably web search. But there are many others, so searching your email, searching the contents of your laptop computer, finding stuff in some company's knowledge base, doing legal information retrieval to find relevant cases for a legal context or something like that. It's always been the case that a large percentage of human knowledge and information is stored in the form of human language documents.

And yet, also for a long time been something of a paradox there. So this is just a kind of a not quite real graph to give a flavor of things. Don't really believe the numbers on the left hand side are exactly what they mean. But what we're showing is that in the mid 1990s, it was already the case that if you looked at the volume of data, that there was some data that was in structured forms. By that I mean things like relational databases and spreadsheets.

But there was already vastly more data in companies, organizations, and around people's homes that was in unstructured form, in the form of human language text. However, despite that, in the mid-1990s, structured data management and retrieval was a developed field, and there were already large database companies, where in the field of unstructured data management, there was very little. There were a few teeny little companies that did various kinds of corporate document retrieval.

things like that. That situation just completely changed around the turn of the millennium. So if we look today, the situation is like this.

So the data volumes have gotten larger on both sides, but in particular they've gotten larger on the unstructured side, the massive outpouring of blogs, tweets, forums, and all those other places that now store massive amounts of information. But there's also then been a turnaround on the corporate side. So now we have huge companies that are addressing the problems of unstructured information retrieval, such as the major web search giants.

So let's start and just say a little bit about what are the base, what is the basic framework of doing information retrieval. So we start off by assuming that we've got a collection of documents over which we're going to do retrieval. And then at the moment we're going to assume that it's just a static collection. Later on we'll deal with when documents get added to and deleted from that collection, and how we can go out and find them in something like a web search scenario. Then our goal is to retrieve documents that are relevant to the user's information need and helps the user complete a task.

Let me go through that once more, a little bit more slowly, with this picture. So at the start, what we have is the user has some task. that they want to perform. Let's take a concrete example. Suppose what I want to do is get rid of the mice that are in my garage, and I'm the kind of person that doesn't really want to kill them with poison.

So that's my user task that I want to achieve. And so to achieve that task, I feel that I need more information. And so this is the information need. I want to know about getting rid of mice without killing them.

This information need is the thing with respect to which we assess an information retrieval system. But we can't just take someone's information need and stick it straight into a computer. So what we have to do to be able to stick it into a computer is to translate the information need into something that goes into the search box. And so that's what's happened here. We now have a query.

And so here's my attempt at a query. How trap mice alive? So that's taken this information need and I've made an attempt to realize it as a particular query.

And so then it's that query that goes into the search engine, which leads, which then interrogates our document collection and leads to some results coming back. And that may be the end of the day, but sometimes if I'm not satisfied with how my information retrieval session is working, I might take. the evidence I got from there and go back and come up with a better query.

So I might decide that alive is a bad word and put in something like without killing and see if that works any better. Okay. What can go wrong here?

Well, there are a couple of stages of interpretation here. So first of all, there was my initial task and I made some decisions as to what kind of information need I had. It could be that I got something wrong there, so I could have misconceptions. So maybe in getting rid of mice, the most important issue is not the mouse. kill them, but whether I do it humanely.

But we're not going to deal with that issue so much. What we're more interested in here is the translation between the information need and the query. And there are lots of ways in which that can go wrong in this formulation of the query. So I might choose the wrong words to express the query.

I might make use of query search operators like inverted letters, which might have a good or a bad effect on how the query actually works. choices in my query formulation, which aren't altering my information need whatsoever. That's important when we talk about how to evaluate information retrieval systems. That's a topic we'll say more about later, but let me just give you the first rough sense of this, because we'll need it for everything that we do.

So whenever we make a query to an information retrieval system, we're going to get some documents back, and we want to know whether the results are good. And the very basic way in which we're going to see whether they are good is as follows. We're going to think of two measures which are complementary, one of which is precision.

So precision is the fraction of the retrieved documents of the system that are relevant to the user's information needs. So that's whether when you look at the results, it seems like one document in ten is relevant to the information need you have, or whether seven documents in ten are. assessing whether the mix of stuff you're getting back has a lot of bad results in it. The co, complimentary mixture, measure is the one of recall.

Recall is measuring how much of the good information that's in the document collection is the system succeeding in finding for you. The fraction of relevant docs in the collection that are retrieved. We'll give more precise definitions and measurements for information retrieval later. But the one thing I want to express right now is that is that for these measures to make sense, they have to be evaluated with respect to the user's information need.

For certain kinds of queries, such as the ones we'll look at in the following segments, the documents that get returned are deterministic in terms of what query is submitted. So if we were just going to evaluate what gets returned with respect to the user's query, then we'd necessarily say that the precision is 100%. But that's not what we do.

What we do is think about the user's information need and that the precision of the results is assessed relative to that. So in particular, if we just look back for a moment, if there's been a misformulation of the query by the user, or just they didn't come up with a very good query, that will be seen as lowering the precision of the return results. Okay. This is only the very beginning of our looking at information retrieval, but I hope you've already got a sense of how we think of the task of information retrieval and roughly how we can see whether a search engine is doing a good or a bad job on it.

Transcript for:Overview of Information Retrieval Techniques

Transcript for:
Overview of Information Retrieval Techniques