Last update:Friday 16th of December 2011
| Welcome to the online introduction to artificial intelligence. | ▶ 00:00 |
| My name is Sebastian Thrun. >>I'm Peter Norvig. | ▶ 00:04 |
| We are teaching this class at Stanford, | ▶ 00:07 |
| and now we are teaching it online for the entire world. | ▶ 00:09 |
| We are really excited about this. | ▶ 00:11 |
| It's great to have you all here. | ▶ 00:13 |
| It's exciting to have such a record-breaking number of people. | ▶ 00:14 |
| We think we can deliver a good introduction to artificial intelligence. | ▶ 00:18 |
| We hope you'll stick with it. | ▶ 00:22 |
| It's going to be a lot of work, | ▶ 00:24 |
| but we think it's going to be very rewarding. | ▶ 00:25 |
| The way that it is going to be organized is that | ▶ 00:27 |
| every week there is going to be new videos and with these videos, quizes. | ▶ 00:29 |
| With these quizzes, you can test your knowledge about AI. | ▶ 00:32 |
| We also post for the advanced version of this class, homework assignments and exams | ▶ 00:35 |
| on which you'll be quizzed. | ▶ 00:38 |
| We're going to grade those to give you a final score to see | ▶ 00:40 |
| if you can actually master artificial intelligence the same way | ▶ 00:44 |
| any good student at Stanford would do it. | ▶ 00:47 |
| If you do that, then at the end of the class, we'll sign a letter of accomplishment, | ▶ 00:49 |
| and let you know that you've achieved this and what your rank in the class was. | ▶ 00:54 |
| So I hope you have fun. Watch us on videotape. | ▶ 00:58 |
| We will teach you AI. | ▶ 01:02 |
| Participate in the discussion forum. | ▶ 01:04 |
| Ask your questions, and help others answer questions. | ▶ 01:06 |
| I hope we have a fantastic time ahead of us in the next 10 weeks. | ▶ 01:09 |
| Welcome to the class. We'll see you online. | ▶ 01:12 |
| Welcome to the first unit of Online Introduction to Artificial Intelligence. | ▶ 00:00 |
| I will be teaching you the very, very basics today. | ▶ 00:05 |
| This is Unit 1 of Artificial Intelligence. | ▶ 00:09 |
| Welcome. | ▶ 00:14 |
| The purpose of this class is twofold: | ▶ 00:16 |
| Number 1, to teach you the very basics of artificial intelligence | ▶ 00:20 |
| so you'll be able to talk to people in the field | ▶ 00:25 |
| and understand the basic tools of the trade; | ▶ 00:29 |
| and also, very importantly, to excite you about the field. | ▶ 00:32 |
| I have been in the field of artificial intelligence for about 20 years, | ▶ 00:37 |
| and it's been truly rewarding. | ▶ 00:42 |
| So I want you to participate in the beauty and the excitement of AI | ▶ 00:44 |
| so you can become a professional who gets the same reward | ▶ 00:48 |
| and excitement out of this field as I do. | ▶ 00:52 |
| The basic structure of this class involves videos | ▶ 00:55 |
| in which Peter or I will teach you something new, | ▶ 01:00 |
| then also quizzes, which we will ask you about your ability to answer AI questions, | ▶ 01:03 |
| and finally, answer videos in which we tell you what the right answer would have been | ▶ 01:11 |
| for the quiz that you might have falsely or incorrectly answered before. | ▶ 01:17 |
| This will all be reiterated, and every so often you get a homework assignment, | ▶ 01:22 |
| also in the form of quizzes but without the answers. | ▶ 01:28 |
| And then we also have video exams. | ▶ 01:34 |
| If you check our website, there's requirements | ▶ 01:37 |
| on how you have to do assignments and exams. | ▶ 01:39 |
| Please go to ai-class.org in this class. | ▶ 01:43 |
| An AI program is called wetware, a formula, or an intelligent agent. | ▶ 01:48 |
| Pick the one that fits best. | ▶ 01:58 |
| [Thrun] The correct answer is intelligent agent. | ▶ 00:00 |
| Let's talk about intelligent agents. | ▶ 00:04 |
| Here is my intelligent agent, | ▶ 00:07 |
| and it gets to interact with an environment. | ▶ 00:11 |
| The agent can perceive the state of the environment | ▶ 00:17 |
| through its sensors, | ▶ 00:22 |
| and it can affect its state through its actuators. | ▶ 00:25 |
| The big question of artificial intelligence is the function that maps sensors to actuators. | ▶ 00:29 |
| That is called the control policy for the agent. | ▶ 00:37 |
| So all of this class will deal with how does an agent make decisions | ▶ 00:41 |
| that it can carry out with its actuators based on past sensor data. | ▶ 00:48 |
| Those decisions take place many, many times, | ▶ 00:54 |
| and the loop of environment feedback to sensors, agent decision, | ▶ 00:58 |
| actuator interaction with the environment and so on is called perception action cycle. | ▶ 01:03 |
| So here is my very first quiz for you. | ▶ 01:12 |
| Artificial intelligence, AI, has successfully been used in finance, | ▶ 01:15 |
| robotics, games, medicine, and the Web. | ▶ 01:21 |
| Check any or all of those that apply. | ▶ 01:26 |
| And if none of them applies, check the box down here that says none of them. | ▶ 01:28 |
| So the correct answer is all of those-- | ▶ 00:00 |
| finance, robotics, games, medicine, the Web, and many more applications. | ▶ 00:03 |
| So let me talk about them in some detail. | ▶ 00:08 |
| There is a huge number of applications of artificial intelligence in finance, | ▶ 00:10 |
| very often in the shape of making trading decisions-- | ▶ 00:15 |
| in which case, the agent is called a trading agent. | ▶ 00:18 |
| And the environment might be things like the stock market or the bond market | ▶ 00:21 |
| or the commodities market. | ▶ 00:27 |
| And our trading agent can sense the course of certain things, | ▶ 00:29 |
| like the stock or bonds or commodities. | ▶ 00:33 |
| It can also read the news online and follow certain events. | ▶ 00:35 |
| And its decisions are usually things like buy or sell decisions--trades. | ▶ 00:40 |
| There's a huge history of artificial intelligence finding methods to look at data over time | ▶ 00:48 |
| and make predictions as to how courses develop over time-- | ▶ 00:55 |
| and then put in trades behind those. | ▶ 00:58 |
| And very frequently, people using artificial intelligence trading agents | ▶ 01:01 |
| have made a good amount of money with superior trading decisions. | ▶ 01:06 |
| There's also a long history of AI in Robotics. | ▶ 01:10 |
| Here is my depiction of a robot. | ▶ 01:14 |
| Of course, there are many different types of robots | ▶ 01:17 |
| and they all interact with their environments through their sensors, | ▶ 01:20 |
| which include things like cameras, microphones, tactile sensor or touch. | ▶ 01:24 |
| And the way they impact their environments is to move motors around, | ▶ 01:33 |
| in particular, their wheels, their legs, their arms, their grippers. | ▶ 01:38 |
| They can also say things to people using voice. | ▶ 01:43 |
| Now there's a huge history of using artificial intelligence in robotics. | ▶ 01:46 |
| Pretty much, every robot that does something interesting today uses AI. | ▶ 01:50 |
| In fact, often AI has been studied together with robotics, as one discipline. | ▶ 01:54 |
| But because robots are somewhat special in that they use physical actuators | ▶ 01:58 |
| and deal with physical environments, they are a little bit different from | ▶ 02:03 |
| just artificial intelligence, as a whole. | ▶ 02:06 |
| When the Web came out, the early Web crawlers were called robots | ▶ 02:08 |
| and to block a robot from accessing your website, to the present day, | ▶ 02:15 |
| there's a file called robot.txt, that allows you to deny any Web crawler | ▶ 02:20 |
| to access and retrieve that information from your website. | ▶ 02:24 |
| So historically, robotics played a huge role in artificial intelligence | ▶ 02:28 |
| and a good chunk of this class will be focusing on robotics. | ▶ 02:32 |
| AI has a huge history in games-- | ▶ 02:36 |
| to make games smarter or feel more natural. | ▶ 02:39 |
| There are 2 ways in which AI has been used in games, as a game agent. | ▶ 02:43 |
| One is to play against you, as a human user. | ▶ 02:47 |
| So for example, if you play the game of Chess, | ▶ 02:50 |
| then you are the environment to the game agent. | ▶ 02:54 |
| The game agent gets to observe your moves, and it generates its own moves | ▶ 02:57 |
| with the purpose of defeating you in Chess. | ▶ 03:03 |
| So most adversarial games, where you play against an opponent | ▶ 03:07 |
| and the opponent is a computer program, | ▶ 03:10 |
| the game agent is built to play against you--against your own interests--and make you lose. | ▶ 03:13 |
| And of course, your objective is to win. | ▶ 03:20 |
| That's an AI games-type situation. | ▶ 03:22 |
| The second thing is that games agents in AI | ▶ 03:25 |
| also are used to make games feel more natural. | ▶ 03:29 |
| So very often games have characters inside, and these characters act in some way. | ▶ 03:32 |
| And it's important for you, as the player, to feel that these characters are believable. | ▶ 03:36 |
| There's an entire sub-field of artificial intelligence to use AI | ▶ 03:42 |
| to make characters in a game more believable--look smarter, so to speak-- | ▶ 03:45 |
| so that you, as a player, think you're playing a better game. | ▶ 03:51 |
| Artificial intelligence has a long history in medicine as well. | ▶ 03:55 |
| The classic example is that of a diagnostic agent. | ▶ 04:00 |
| So here you are--and you might be sick, and you go to your doctor. | ▶ 04:04 |
| And your doctor wishes to understand | ▶ 04:09 |
| what the reason for your symptoms and your sickness is. | ▶ 04:11 |
| The diagnostic agent will observe you through various measurements-- | ▶ 04:17 |
| for example, blood pressure and heart signals, and so on-- | ▶ 04:21 |
| and it'll come up with the hypothesis as to what you might be suffering from. | ▶ 04:25 |
| But rather than intervene directly, in most cases the diagnostic of your disease | ▶ 04:29 |
| is communicated to the doctor, who then takes on the intervention. | ▶ 04:34 |
| This is called a diagnostic agent. | ▶ 04:38 |
| There are many other versions of AI in medicine. | ▶ 04:40 |
| AI is used in intensive care to understand whether there are situations | ▶ 04:43 |
| that need immediate attention. | ▶ 04:48 |
| It's been used for life-long medicine to monitor signs over long periods of time. | ▶ 04:50 |
| And as medicine becomes more personal, the role of artificial intelligence | ▶ 04:54 |
| will definitely increase. | ▶ 04:58 |
| We already mentioned AI on the Web. | ▶ 05:01 |
| The most generic version of AI is to crawl the Web and understand the Web, | ▶ 05:05 |
| and assist you in answering questions. | ▶ 05:09 |
| So when you have this search box over here | ▶ 05:12 |
| and it says "Search" on the left, | ▶ 05:15 |
| and "I'm Feeling Lucky" on the right, | ▶ 05:18 |
| and you type in the words, | ▶ 05:20 |
| what AI does for you is it understands what words you typed in | ▶ 05:21 |
| and finds the most relevant pages. | ▶ 05:28 |
| That is really co-artificial intelligence. | ▶ 05:30 |
| It's used by a number of companies, such as Microsoft and Google | ▶ 05:32 |
| and Amazon, Yahoo, and many others. | ▶ 05:36 |
| And the way this works is that there's a crawling agent that can go | ▶ 05:39 |
| to the World Wide Web and retrieve pages, through just a computer program. | ▶ 05:43 |
| It then sorts these pages into a big database inside the crawler | ▶ 05:51 |
| and also analyzes developments of each page to any possible query. | ▶ 05:56 |
| When you then come and issue a query, | ▶ 06:01 |
| the AI system is able to give you a response-- | ▶ 06:04 |
| for example, a collection of 10 best Web links. | ▶ 06:08 |
| In short, every time you try to write a piece of software, | ▶ 06:12 |
| that makes your computer software smart | ▶ 06:15 |
| likely you will need artificial intelligence. | ▶ 06:18 |
| And in this class, Peter and I will teach you | ▶ 06:20 |
| many of the basic tricks of the trade | ▶ 06:23 |
| to make your software really smart. | ▶ 06:25 |
| It will be good to introduce some basic terminology | ▶ 00:00 |
| that is commonly used in artificial intelligence to distinguish different types of problems. | ▶ 00:04 |
| The very first word I will teach you is fully versus partially observable. | ▶ 00:09 |
| An environment is called fully observable if what your agent can sense | ▶ 00:16 |
| at any point in time is completely sufficient to make the optimal decision. | ▶ 00:19 |
| So, for example, in many card games, | ▶ 00:26 |
| when all the cards are on the table, the momentary site of all those cards | ▶ 00:29 |
| is really sufficient to make the optimal choice. | ▶ 00:36 |
| That is in contrast to some other environments where you need memory | ▶ 00:40 |
| on the side of the agent to make the best possible decision. | ▶ 00:46 |
| For example, in the game of poker, the cards aren't openly on the table, | ▶ 00:50 |
| and memorizing past moves will help you make a better decision. | ▶ 00:55 |
| To fully understand the difference, consider the interaction of an agent | ▶ 01:00 |
| with the environment to its sensors and its actuators, | ▶ 01:04 |
| and this interaction takes place over many cycles, | ▶ 01:08 |
| often called the perception-action cycle. | ▶ 01:11 |
| For many environments, it's convenient to assume | ▶ 01:16 |
| that the environment has some sort of internal state. | ▶ 01:19 |
| For example, in a card game where the cards are not openly on the table, | ▶ 01:22 |
| the state might pertain to the cards in your hand. | ▶ 01:28 |
| An environment is fully observable if the sensors can always see | ▶ 01:33 |
| the entire state of the environment. | ▶ 01:37 |
| It's partially observable if the sensors can only see a fraction of the state, | ▶ 01:41 |
| yet memorizing past measurements gives us additional information of the state | ▶ 01:46 |
| that is not readily observable right now. | ▶ 01:52 |
| So any game, for example, where past moves have information about | ▶ 01:55 |
| what might be in a person's hand, those games are partially observable, | ▶ 02:01 |
| and they require different treatment. | ▶ 02:06 |
| Very often agents that deal with partially observable environments | ▶ 02:08 |
| need to acquire internal memory to understand what | ▶ 02:12 |
| the state of the environment is, and we'll talk extensively | ▶ 02:15 |
| when we talk about hidden Markov models about how this structure | ▶ 02:18 |
| has such internal memory. | ▶ 02:21 |
| A second terminology for environments pertains to whether the environment | ▶ 02:23 |
| is deterministic or stochastic. | ▶ 02:26 |
| Deterministic environment is one where your agent's actions | ▶ 02:29 |
| uniquely determine the outcome. | ▶ 02:35 |
| So, for example, in chess, there's really no randomness when you move a piece. | ▶ 02:37 |
| The effect of moving a piece is completely predetermined, | ▶ 02:42 |
| and no matter where I'm going to move the same piece, the outcome is the same. | ▶ 02:46 |
| That we call deterministic. | ▶ 02:50 |
| Games with dice, for example, like backgammon, are stochastic. | ▶ 02:52 |
| While you can still deterministically move your pieces, | ▶ 02:56 |
| the outcome of an action also involves throwing of the dice, | ▶ 03:00 |
| and you can't predict those. | ▶ 03:03 |
| There's a certain amount of randomness involved for the outcome of dice, | ▶ 03:05 |
| and therefore, we call this stochastic. | ▶ 03:08 |
| Let me talk about discrete versus continuous. | ▶ 03:10 |
| A discrete environment is one where you have finitely many action choices, | ▶ 03:14 |
| and finitely many things you can sense. | ▶ 03:18 |
| So, for example, in chess, again, there's finitely many board positions, | ▶ 03:21 |
| and finitely many things you can do. | ▶ 03:25 |
| That is different from a continuous environment | ▶ 03:28 |
| where the space of possible actions or things you could sense may be infinite. | ▶ 03:30 |
| So, for example, if you throw darts, there's infinitely many ways to angle the darts | ▶ 03:35 |
| and to accelerate them. | ▶ 03:41 |
| Finally, we distinguish benign versus adversarial environments. | ▶ 03:43 |
| In benign environments, the environment might be random. | ▶ 03:49 |
| It might be stochastic, but it has no objective on its own | ▶ 03:53 |
| that would contradict the own objective. | ▶ 03:57 |
| So, for example, weather is benign. | ▶ 03:59 |
| It might be random. It might affect the outcome of your actions. | ▶ 04:02 |
| But it isn't really out there to get you. | ▶ 04:06 |
| Contrast this with adversarial environments, such as many games, like chess, | ▶ 04:08 |
| where your opponent is really out there to get you. | ▶ 04:14 |
| It turns out it's much harder to find good actions in adversarial environments | ▶ 04:16 |
| where the opponent actively observes you and counteracts what you're trying to achieve | ▶ 04:21 |
| relative to benign environment, where the environment might merely be stochastic | ▶ 04:26 |
| but isn't really interested in making your life worse. | ▶ 04:30 |
| So, let's see to what extent these expressions make sense to you | ▶ 04:35 |
| by going to our next quiz. | ▶ 04:38 |
| So here are the 4 concepts again: partially observable versus fully, | ▶ 04:40 |
| stochastic versus deterministic, continuous versus discrete, | ▶ 04:45 |
| adversarial versus benign. | ▶ 04:50 |
| And let me ask you about the game of checkers. | ▶ 04:52 |
| Check one or all of those attributes that apply. | ▶ 04:56 |
| So, if you think checkers is partially observable, check this one. | ▶ 05:00 |
| Otherwise, just don't check it. | ▶ 05:03 |
| If you think it's stochastic, check this one, | ▶ 05:05 |
| continuous, check this one, adversarial, check this one. | ▶ 05:07 |
| If you don't know about checkers, you can check the Web and Google it | ▶ 05:11 |
| to find a little more information about checkers. | ▶ 05:15 |
| So, checkers is an interesting game. | ▶ 00:00 |
| Here's the typical board of the game of checkers. | ▶ 00:04 |
| Your pieces might look like this, | ▶ 00:08 |
| and your opponent's pieces might look like this. | ▶ 00:11 |
| And apart from some very cryptic rules in checkers, | ▶ 00:16 |
| which I won't really discuss here, the board basically tells you | ▶ 00:19 |
| everything there is to know about checkers, so it's clearly fully observable. | ▶ 00:23 |
| It is deterministic because your move and your opponent's move | ▶ 00:28 |
| very clearly affect the state of the board in ways that have | ▶ 00:33 |
| absolutely no stochasticity. | ▶ 00:36 |
| It is also discrete because there's finitely many action choices | ▶ 00:39 |
| and finitely many board positions, | ▶ 00:45 |
| and obviously, it is adversarial, since your opponent is out to get you. | ▶ 00:47 |
| [Male narrator] The game of poker--is this partially observable, stochastic, | ▶ 00:00 |
| continuous, or adversarial? | ▶ 00:06 |
| Please check any or all of those that apply. | ▶ 00:09 |
| [Male narrator] I would argue poker is partially observable | ▶ 00:00 |
| because it can't be seen what is in your opponent's hands. | ▶ 00:03 |
| It is stochastic because you're being dealt cards that are kind of coming at random. | ▶ 00:08 |
| It is not continuous; it's just finding many cards | ▶ 00:13 |
| and finding many actions you can do, even though you might argue | ▶ 00:16 |
| that there's a huge number of different monies you can bet. | ▶ 00:20 |
| It's still finite, and it is clearly adversarial. | ▶ 00:24 |
| If you've ever played poker before, you know how brutal it can be. | ▶ 00:27 |
| [Male narrator] --a favorite, a robotic car. | ▶ 00:00 |
| I wish to know whether it is partially observable, | ▶ 00:04 |
| stochastic, continuous, or adversarial. | ▶ 00:06 |
| That is, is the problem of driving robotically-- | ▶ 00:11 |
| say, in a city--subject to any of those 4 categories? | ▶ 00:16 |
| Please check any or all that might apply. | ▶ 00:20 |
| Well, the robotic car clearly deals with a partially observable environment | ▶ 00:00 |
| if you just look at momentary sensing input, you can't even tell how fast other cars are going. | ▶ 00:04 |
| So, you need to memorize something. | ▶ 00:10 |
| It is stochastic because it's inherently unpredictable | ▶ 00:12 |
| what's going to happen next with other cars. | ▶ 00:15 |
| It is continuous. | ▶ 00:17 |
| There's the infinitely many ways to set your steering | ▶ 00:20 |
| or push your gas pedal or your brake, | ▶ 00:23 |
| and, well, you can argue with adversarial or not. | ▶ 00:26 |
| Depending on where you live, it might be highly adversarial. | ▶ 00:29 |
| Where I live, it isn't. | ▶ 00:31 |
| I'm going to briefly talk of AI as something else, | ▶ 00:00 |
| which is AI is the technique of uncertainty management in computer software. | ▶ 00:03 |
| Put differently, AI is the discipline that you apply when you want to know what to do | ▶ 00:10 |
| when you don't know what to do. | ▶ 00:17 |
| Now, there's many reasons why there might be uncertainty in a computer program. | ▶ 00:22 |
| There could be a sensor limit. | ▶ 00:27 |
| That is, your sensors are unable to tell me | ▶ 00:29 |
| what exactly is the case outside the AI system. | ▶ 00:33 |
| There could be adversaries who act in a way that makes it hard for you | ▶ 00:37 |
| to understand what is the case. | ▶ 00:41 |
| There could be stochastic environments. | ▶ 00:44 |
| Every time you roll the dice in a dice game, | ▶ 00:48 |
| the stochasticity of the dice will make it impossible for you | ▶ 00:51 |
| to be absolutely certain of what's the situation. | ▶ 00:55 |
| There could be laziness. | ▶ 00:57 |
| So perhaps you can actually compute what the situation is, | ▶ 01:00 |
| but your computer program is just too lazy to do it. | ▶ 01:04 |
| And here's my favorite: ignorance, plain ignorance. | ▶ 01:07 |
| Many people are just ignorant of what's going on. | ▶ 01:11 |
| They could know it, but they just don't care. | ▶ 01:14 |
| All of these things are cause for uncertainty. | ▶ 01:17 |
| AI is the discipline that deals with uncertainty and manages it in decision making. | ▶ 01:21 |
| Now we've had an introduction to AI. | ▶ 00:00 |
| We've heard about some of the properties of environments, | ▶ 00:03 |
| and we've seen some possible architecture for agents. | ▶ 00:06 |
| I'd like next to show you some examples of AI in practice. | ▶ 00:10 |
| And Sebastian and I have some experience personally in things we have done | ▶ 00:13 |
| at Google, at NASA, and at Stanford. | ▶ 00:18 |
| And I want to tell you a little bit about some of those. | ▶ 00:21 |
| One of the best successes of AI technology at Google | ▶ 00:25 |
| has been the machine translation system. | ▶ 00:28 |
| Here we see an example of an article in Italian automatically translated into English. | ▶ 00:31 |
| Now, these systems are built for 50 different languages, | ▶ 00:37 |
| and we can translate from any of the languages into any of the other languages. | ▶ 00:41 |
| So, that's over 2,500 different systems, and we've done this all | ▶ 00:46 |
| using machine learning techniques, using AI techniques, | ▶ 00:51 |
| rather than trying to build them by hand. | ▶ 00:55 |
| And the way it works is that we go out and collect examples of text | ▶ 00:58 |
| that's a line between the 2 languages. | ▶ 01:03 |
| So we find, say, a newspaper that publishes 2 editions, | ▶ 01:06 |
| an Italian edition and an English edition, and now we have examples of translations. | ▶ 01:11 |
| And if anybody ever asked us for exactly the translation of this one particular article, | ▶ 01:16 |
| then we could just look it up and say "We already know that." | ▶ 01:22 |
| But of course, we aren't often going to be asked that. | ▶ 01:25 |
| Rather, we're going to be asked parts of this. | ▶ 01:27 |
| Here are some words that we've seen before, and we have to figure out | ▶ 01:30 |
| which words in this article correspond to which words in the translation article. | ▶ 01:34 |
| And when we do that by examining many, many millions of words of text | ▶ 01:40 |
| in the 2 languages and making the correspondence, | ▶ 01:45 |
| and then we can put that all together. | ▶ 01:49 |
| And then when we see a new example of text that we haven't seen before, | ▶ 01:51 |
| we can just look up what we've seen in the past for that correspondence. | ▶ 01:54 |
| So, the task is really two parts. | ▶ 01:58 |
| Off-line, before we see an example of text we want to translate, | ▶ 02:01 |
| we first build our translation model. | ▶ 02:05 |
| We do that by examining all of the different examples | ▶ 02:07 |
| and figuring out which part aligns to which. | ▶ 02:10 |
| Now, when we're given a text to translate, we use that model, | ▶ 02:14 |
| and we go through and find the most probable translation. | ▶ 02:18 |
| So, what does it look like? | ▶ 02:22 |
| Well, let's look at it in some example text. | ▶ 02:24 |
| And rather than look at news articles, I'm going to look at something simpler. | ▶ 02:26 |
| I'm going to switch from Italian to Chinese. | ▶ 02:29 |
| Here's a bilingual text. | ▶ 02:35 |
| Now, for a large-scale machine translation, examples are found on the Web. | ▶ 02:37 |
| This example was found in a Chinese restaurant by Adam Lopez. | ▶ 02:41 |
| Now, it's given, for a text of this form, | ▶ 02:46 |
| that a line in Chinese corresponds to a line in English, | ▶ 02:49 |
| and that's true for each of the individual lines. | ▶ 02:55 |
| But to learn from this text, what we really want to discover | ▶ 02:59 |
| is what individual words in Chinese correspond to individual words | ▶ 03:02 |
| or small phrases in English. | ▶ 03:07 |
| I've started that process by highlighting the word "wonton" in English. | ▶ 03:09 |
| It appears 3 times throughout the text. | ▶ 03:16 |
| Now, in each of those lines, there's a character that appears, | ▶ 03:18 |
| and that's the only place in the Chinese text where that character appears. | ▶ 03:23 |
| So, that seems like it's a high probability that this character in Chinese | ▶ 03:27 |
| corresponds to the word "wonton" in English. | ▶ 03:33 |
| Let's see if we can go farther. | ▶ 03:36 |
| My question for you is what word or what character or characters in Chinese | ▶ 03:38 |
| correspond to the word "chicken" in English? | ▶ 03:44 |
| And here we see "chicken" appears in these locations. | ▶ 03:47 |
| Click on the character or characters in Chinese that corresponds to "chicken." | ▶ 03:54 |
| The answer is that chicken appears here, | ▶ 00:01 |
| here, here, and here. | ▶ 00:04 |
| Now, I don't know for sure, 100%, that that is the character for chicken in Chinese, | ▶ 00:10 |
| but I do know that there is a good correspondence. | ▶ 00:14 |
| Every place the word chicken appears in English, | ▶ 00:17 |
| this character appears in Chinese and no other place. | ▶ 00:20 |
| Let's go 1 step farther. | ▶ 00:24 |
| Let's see if we can work out a phrase in Chinese | ▶ 00:27 |
| and see if it corresponds to a phrase in English. | ▶ 00:30 |
| Here's the phrase corn cream. | ▶ 00:33 |
| Click on the characters in Chinese that correspond to corn cream. | ▶ 00:38 |
| The answer is: these 2 characters here | ▶ 00:00 |
| appear only in these 2 locations | ▶ 00:04 |
| corresponding to the words corn cream | ▶ 00:07 |
| which appear only in these locations in the English text. | ▶ 00:10 |
| Again, we're not 100% sure that's the right answer, | ▶ 00:13 |
| but it looks like a strong correlation. | ▶ 00:17 |
| Now, 1 more question. | ▶ 00:20 |
| Tell me what character or characters in Chinese | ▶ 00:22 |
| correspond to the English word soup. | ▶ 00:26 |
| The answer is that soup occurs in most of these phrases | ▶ 00:00 |
| but not 100% of them. | ▶ 00:09 |
| It's missing in this phrase. | ▶ 00:11 |
| Equivalently, on the Chinese side | ▶ 00:14 |
| we see this character occurs | ▶ 00:17 |
| in most of the phrases, | ▶ 00:20 |
| but it's missing here. | ▶ 00:23 |
| So we see that the correspondence doesn't have to be 100% | ▶ 00:27 |
| to tell us that there is still a good chance of a correlation. | ▶ 00:31 |
| When we're learning to do machine translation | ▶ 00:34 |
| we use these kinds of alignments to learn probability tables | ▶ 00:37 |
| of what is the probability of one phrase in one language | ▶ 00:41 |
| corresponding to the phrase in another language. | ▶ 00:45 |
| So congratulations, you just finished unit 1. | ▶ 00:00 |
| You just finished unit 1 of this class, | ▶ 00:03 |
| where I told you about key applications | ▶ 00:07 |
| of artificial intelligence, | ▶ 00:10 |
| I told you about the definition of an intelligent agent, | ▶ 00:13 |
| I gave you 4 key attributes of intelligent agents | ▶ 00:18 |
| (partial observability, stochasticity, continuous spaces, and adversarial natures), | ▶ 00:24 |
| I discussed sources and management of uncertainty, | ▶ 00:31 |
| and I briefly mentioned the mathematical concept of rationality. | ▶ 00:34 |
| Obviously, I only touched any of these issues superficially, | ▶ 00:40 |
| but as this class goes on you're going to dive into any of those | ▶ 00:45 |
| and learn much more about | ▶ 00:49 |
| what it takes to make a truly intelligent AI system. | ▶ 00:51 |
| Thank you. | ▶ 00:55 |
| [PROBLEM SOLVING] | ▶ 00:00 |
| In this unit we're going to talk about problem solving. | ▶ 00:01 |
| The theory and technology of building agents | ▶ 00:04 |
| that can plan ahead to solve problems. | ▶ 00:06 |
| In particular, we're talking about problem solving | ▶ 00:10 |
| where the complexity of the problem comes from the idea that there are many states. | ▶ 00:13 |
| As in this problem here. | ▶ 00:17 |
| A navigation problem where there are many choices to start with. | ▶ 00:19 |
| And the complexity comes from picking the right choice now and picking the right choice at the | ▶ 00:24 |
| next intersection and the intersection after that. | ▶ 00:29 |
| Streaming together a sequence of actions. | ▶ 00:32 |
| This is in contrast to the type of complexity shown in this picture, | ▶ 00:35 |
| where the complexity comes from the partial observability | ▶ 00:39 |
| that we can't see through the fog where the possible paths are. | ▶ 00:43 |
| We can't see the results of our actions | ▶ 00:46 |
| and even the actions themselves are not known. | ▶ 00:48 |
| This type of complexity will be covered in a later unit. | ▶ 00:51 |
| Here's an example of a problem. | ▶ 00:56 |
| This is a route-finding problem where we're given a start city, | ▶ 00:58 |
| in this case, Arad, and a destination, Bucharest, the capital of Romania, | ▶ 01:03 |
| from which this is a corner of the map. | ▶ 01:09 |
| And the problem then is to find a route from Arad to Bucharest. | ▶ 01:11 |
| The actions that the agent can execute when driving | ▶ 01:16 |
| from one city to the next along one of the roads shown on the map. | ▶ 01:20 |
| The question is, is there a solution that the agent can come up with | ▶ 01:23 |
| given the knowledge shown here to the problem of driving from Arad to Bucharest? | ▶ 01:28 |
| And the answer is no. | ▶ 00:00 |
| There is no solution that the agent can come up with | ▶ 00:03 |
| because Bucharest doesn't appear on the map, | ▶ 00:06 |
| and so the agent doesn't know any actions that can arrive there. | ▶ 00:08 |
| So let's give the agent a better chance. | ▶ 00:12 |
| Now we've given the agent the full map of Romania. | ▶ 00:19 |
| To start, he's in Arad, and the destination--or goal--is in Bucharest. | ▶ 00:23 |
| And the agent is given the problem of coming up with a sequence of actions | ▶ 00:30 |
| that will arrive at the destination. | ▶ 00:35 |
| Now, is it possible for the agent to solve this problem? | ▶ 00:37 |
| And the answer is yes. | ▶ 00:43 |
| There are many routes or steps or sequences of actions that will arrive at the destination. | ▶ 00:45 |
| Here is one of them: | ▶ 00:50 |
| Starting out in Arad, taking this step first, then this one, then this one, | ▶ 00:53 |
| then this one, and then this one to arrive at the destination. | ▶ 01:00 |
| So that would count as a solution to the problem. | ▶ 01:05 |
| So sequence of actions, chained together, that are guaranteed to get us to the goal. | ▶ 01:08 |
| [DEFINITION OF A PROBLEM] | ▶ 01:12 |
| Now let's formally define what a problem looks like. | ▶ 01:14 |
| A problem can be broken down into a number of components. | ▶ 01:17 |
| First, the initial state that the agent starts out with. | ▶ 01:21 |
| In our route finding problem, the initial state was the agent being in the city of Arad. | ▶ 01:25 |
| Next, a function--Actions--that takes a state as input and returns | ▶ 01:32 |
| a set of possible actions that the agent can execute when the agent is in this state. | ▶ 01:41 |
| [ACTIONS (s) {a,a2,a3...}] | ▶ 01:47 |
| In some problems, the agent will have the same actions available in all states | ▶ 01:50 |
| and in other problems, he'll have different actions dependent on the state. | ▶ 01:54 |
| In the route finding problem, the actions are dependent on the state. | ▶ 01:58 |
| When we're in one city, we can take the routes to the neighboring cities-- | ▶ 02:02 |
| but we can't go to any other cities. | ▶ 02:06 |
| Next we have a function called Result, which takes, as input, a state and an action | ▶ 02:09 |
| and delivers, as its output, a new state. | ▶ 02:20 |
| So, for example, if the agent is in the city of Arad, and takes--that would be the state-- | ▶ 02:24 |
| and takes the action of driving along Route E-671 towards Timisoara, | ▶ 02:33 |
| then the result of applying that action in that state would be the new state-- | ▶ 02:40 |
| where the agent is in the city of Timisoara. | ▶ 02:45 |
| Next, we need a function called Goal Test, | ▶ 02:51 |
| which takes a state and returns a Boolean value-- | ▶ 02:58 |
| true or false--telling us if this state is a goal or not. | ▶ 03:04 |
| In a route-finding problem, the only goal would be being in the destination city-- | ▶ 03:09 |
| the city of Bucharest--and all the other states would return false for the Goal Test. | ▶ 03:14 |
| And finally, we need one more thing which is a Path Cost function-- | ▶ 03:19 |
| which takes a path, a sequence of state/action transitions, | ▶ 03:28 |
| and returns a number, which is the cost of that path. | ▶ 03:40 |
| Now, for most of the problems we'll deal with, we'll make the Path Cost function be additive | ▶ 03:44 |
| so that the cost of the path is just the sum of the costs of the individual steps. | ▶ 03:50 |
| And so we'll implement this Path Cost function, in terms of a Step Cost function. | ▶ 03:56 |
| The Step Cost function takes a state, an action, and the resulting state from that action | ▶ 04:04 |
| and returns a number--n--which is the cost of that action. | ▶ 04:14 |
| In the route finding example, the cost might be the number of miles traveled | ▶ 04:18 |
| or maybe the number of minutes it takes to get to that destination. | ▶ 04:24 |
| Now letβs see how the definition of a problem | ▶ 00:00 |
| maps onto the route finding, the domain. | ▶ 00:06 |
| First, the initial state was given. | ▶ 00:10 |
| Letβs say we start off in Arad, | ▶ 00:12 |
| and the goal test, | ▶ 00:15 |
| letβs say that the state of being in Bucharest | ▶ 00:17 |
| is the only state that counts as a goal, | ▶ 00:22 |
| and all the other states are not goals. | ▶ 00:24 |
| Now the set of all of the states here | ▶ 00:26 |
| is known as the state space, | ▶ 00:29 |
| and we navigate the state space by applying actions. | ▶ 00:31 |
| The actions are specific to each city, | ▶ 00:35 |
| so when we are in Arad, there are three possible actions, | ▶ 00:39 |
| to follow this road, this one, or this one. | ▶ 00:42 |
| And as we follow them, we build paths | ▶ 00:46 |
| or sequences of actions. | ▶ 00:49 |
| So just being in Arad is the path of length zero, | ▶ 00:51 |
| and now we could start exploring the space | ▶ 00:55 |
| and add in this path of length one, | ▶ 00:58 |
| this path of length one, | ▶ 01:01 |
| and this path of length one. | ▶ 01:03 |
| We could add in another path here of length two | ▶ 01:06 |
| and another path here of length two. | ▶ 01:11 |
| Here is another path of length two. | ▶ 01:14 |
| Here is a path of length three. | ▶ 01:17 |
| Another path of length two, and so on. | ▶ 01:21 |
| Now at ever point, | ▶ 01:26 |
| we want to separate the state out into three parts. | ▶ 01:28 |
| First, the ends of the pathsβ | ▶ 01:34 |
| The farthest paths that have been explored, | ▶ 01:37 |
| we call the frontier. | ▶ 01:40 |
| And so the frontier in this case | ▶ 01:42 |
| consists of these states | ▶ 01:46 |
| that are the farthest out we have explored. | ▶ 01:51 |
| And then to the left of that in this diagram, | ▶ 01:55 |
| we have the explored part of the state. | ▶ 01:59 |
| And then off to the rigtht, | ▶ 02:02 |
| we have the unexplored. | ▶ 02:04 |
| So letβs write down those three components. | ▶ 02:06 |
| We have the frontier. | ▶ 02:09 |
| We have the unexplored region, | ▶ 02:15 |
| and we have the explored region. | ▶ 02:20 |
| One more thing, | ▶ 02:25 |
| in this diagram we have labeled the step cost | ▶ 02:27 |
| of each action along the route. | ▶ 02:30 |
| So the step cost of going between Neamt to Iasi | ▶ 02:33 |
| would be 87 corresponding to a distance of 87 kilometers, | ▶ 02:37 |
| and the path cost is just the sum of the step costs. | ▶ 02:42 |
| So the cost of the path | ▶ 02:46 |
| of going from Arad to Oradea | ▶ 02:48 |
| would be 71 plus 75. | ▶ 02:50 |
| [Narrator] Now let's define a function for solving problems. | ▶ 00:00 |
| It's called Tree Search because it superimposes | ▶ 00:04 |
| a search tree over the state space. | ▶ 00:07 |
| Here's how it works: It starts off by | ▶ 00:10 |
| initializing the frontier to be the path | ▶ 00:12 |
| consisting of only the initial states, | ▶ 00:14 |
| and then it goes into a loop | ▶ 00:16 |
| in which it first checks to see | ▶ 00:18 |
| do we still have anything left in the frontier? | ▶ 00:21 |
| If not we fail, there can be no solution. | ▶ 00:23 |
| If we do have something, then we make a choice. | ▶ 00:25 |
| Tree Search is really a family of functions | ▶ 00:28 |
| not a single algorithm which | ▶ 00:31 |
| depends on how we make that choice, | ▶ 00:33 |
| and we'll see some of the options later. | ▶ 00:35 |
| If we go ahead and make a choice of one of | ▶ 00:38 |
| the paths on the frontier and remove that | ▶ 00:41 |
| path from the frontier, we find the state | ▶ 00:43 |
| which is at the end of the path, and if that | ▶ 00:45 |
| state's a go then we're done. | ▶ 00:47 |
| We found a path to the goal; otherwise, | ▶ 00:49 |
| we do what's called expanding that path. | ▶ 00:51 |
| We look at all the actions from that state, | ▶ 00:54 |
| and we add to the path the actions | ▶ 00:57 |
| and the result of that state; so we get | ▶ 01:00 |
| a new path that has the old path, the action | ▶ 01:03 |
| and the result of that action, and we | ▶ 01:06 |
| stick all of those paths back onto the frontier. | ▶ 01:09 |
| Now Tree Search represents a whole family | ▶ 01:17 |
| of algorithms, and where you get the family | ▶ 01:19 |
| resemblance is that they're all looking | ▶ 01:22 |
| at the frontier, copying items off and | ▶ 01:24 |
| and looking to see if their goal tests, | ▶ 01:26 |
| but where you get the difference is right here, | ▶ 01:29 |
| in the choice of how you're going to expand | ▶ 01:31 |
| the next item on the frontier, which | ▶ 01:34 |
| path do we look at first, and we'll go through | ▶ 01:36 |
| different sets of algorithms that make | ▶ 01:39 |
| different choices for which path to look at first. | ▶ 01:42 |
| The first algorithm I want to consider | ▶ 01:47 |
| is called Breadth-First Search. | ▶ 01:49 |
| Now it could be called shortest-first search | ▶ 01:51 |
| because what it does is always choose | ▶ 01:54 |
| of the frontier one of the paths that hadn't been | ▶ 01:56 |
| considered yet that's the shortest possible. | ▶ 01:59 |
| So how does it work? | ▶ 02:02 |
| Well we start off with the path of | ▶ 02:04 |
| length 0, starting in the start state, and | ▶ 02:06 |
| that's the only path in the frontier so | ▶ 02:10 |
| it's the shortest one so we pick it, | ▶ 02:13 |
| and then we expand it, and we add in | ▶ 02:15 |
| all the paths that result from | ▶ 02:17 |
| applying all the possible actions. | ▶ 02:20 |
| So now we've removed | ▶ 02:22 |
| this path from the frontier, | ▶ 02:25 |
| but we've added in 3 new paths. | ▶ 02:28 |
| This one, | ▶ 02:31 |
| this one, and this one. | ▶ 02:33 |
| Now we're in a position where | ▶ 02:37 |
| we have 3 paths on the frontier, and | ▶ 02:39 |
| we have to pick the shortest one. | ▶ 02:42 |
| Now in this case all 3 paths | ▶ 02:45 |
| have the same length, length 1, so we | ▶ 02:47 |
| break the tie at random or using some | ▶ 02:50 |
| other technique, and let's suppose that | ▶ 02:52 |
| in this case we choose this path | ▶ 02:56 |
| from Arad to Sibiu. | ▶ 02:58 |
| Now the question I want you to answer | ▶ 03:00 |
| is once we remove that from the frontier, | ▶ 03:03 |
| what paths are we going to add next? | ▶ 03:09 |
| So show me by checking off the cities | ▶ 03:11 |
| that ends the paths, which paths | ▶ 03:14 |
| are going to be added to the frontier? | ▶ 03:16 |
| [Male narrator] The answer is that in Sibiu, the action function gives us 4 actions | ▶ 00:00 |
| corresponding to traveling along these 4 roads, | ▶ 00:06 |
| so we have to add in paths for each of those actions. | ▶ 00:09 |
| One of those paths goes here, | ▶ 00:15 |
| the other path continues from Arad and goes out here. | ▶ 00:17 |
| The third path continues out here | ▶ 00:21 |
| and then the fourth path goes from here--from Arad to Sibiu | ▶ 00:25 |
| and then backtracks back to Arad. | ▶ 00:31 |
| Now, it may seem silly and redundant to have a path that starts in Arad, | ▶ 00:36 |
| goes to Sibiu and returns to Arad. | ▶ 00:41 |
| How can that help us get to our destination in Bucharest? | ▶ 00:44 |
| But we can see if we're dealing with a tree search, | ▶ 00:49 |
| why it's natural to have this type of formulation | ▶ 00:52 |
| and why the tree search doesn't even notice that it's backtracked. | ▶ 00:56 |
| What the tree search does is superimpose on top of the state space | ▶ 01:00 |
| a tree of searches, and the tree looks like this. | ▶ 01:05 |
| We start off in state A, and in state A, there were 3 actions, | ▶ 01:09 |
| so we gave those paths going to Z, S, and T. | ▶ 01:15 |
| And from S, there were 4 actions, so that gave us paths going from O, F, R, and A, | ▶ 01:21 |
| and then the tree would continue on from here. | ▶ 01:34 |
| We'd take one of the next items | ▶ 01:37 |
| and we'd move it and continue on, but notice that we returned to the A state | ▶ 01:40 |
| in the state space, but in the tree, | ▶ 01:48 |
| it's just another item in the tree. | ▶ 01:51 |
| Now, here's another representation of the search space | ▶ 01:55 |
| and what's happening is as we start to explore the state, | ▶ 01:57 |
| we keep track of the frontier, which is the set of states that are at the end of the paths | ▶ 02:01 |
| that we haven't explored yet, and behind that frontier | ▶ 02:09 |
| is the set of explored states, and ahead of the frontier is the unexplored states. | ▶ 02:13 |
| Now the reason we keep track of the explored states | ▶ 02:19 |
| is that when we want to expand and we find a duplicate-- | ▶ 02:22 |
| so say when we expand from here, if we pointed back to state T, | ▶ 02:27 |
| if we hadn't kept track of that, we would have to add in a new state for T down here. | ▶ 02:33 |
| But because we've already seen it and we know that this is actually a regressive step | ▶ 02:42 |
| into the already explored state, now, because we kept track of that, | ▶ 02:47 |
| we don't need it anymore. | ▶ 02:51 |
| Now we see how to modify the Tree Search Function | ▶ 00:00 |
| to make it be a Graph Search Function | ▶ 00:04 |
| to avoid those repeated paths. | ▶ 00:06 |
| What we do, is we start off and initialize a set | ▶ 00:09 |
| called the explored set of states that we have already explored. | ▶ 00:13 |
| Then, when we consider a new path, | ▶ 00:16 |
| we add the new state to the set of already explored states, | ▶ 00:19 |
| and then when we are expanding the path | ▶ 00:23 |
| and adding in new states to the end of it, | ▶ 00:26 |
| we donβt add that in if we have already seen that new state | ▶ 00:29 |
| in either the frontier or the explored. | ▶ 00:33 |
| Now back to Breadth First Search. | ▶ 00:37 |
| Letβs assume we are using the Graph Search | ▶ 00:39 |
| so that we have eliminated the duplicate paths. | ▶ 00:41 |
| Arad is crossed off the list. | ▶ 00:44 |
| The path that goes from Arad to Sibiu | ▶ 00:47 |
| and back to Arad is removed, | ▶ 00:49 |
| and we are left with these one, two, three, | ▶ 00:51 |
| four, five possible paths. | ▶ 00:53 |
| Given these 5 paths, | ▶ 00:57 |
| show me which ones are candidates to be expanded next | ▶ 00:59 |
| by the Breadth First Search Algorithm. | ▶ 01:02 |
| [Male narrator] And the answer is that Breadth - First Search always considers | ▶ 00:00 |
| the shortest paths first, and in this case, there's 2 paths of length 1, | ▶ 00:03 |
| and 1, the paths from Arad to Zerind and Arad to Timisoara, | ▶ 00:08 |
| so those would be the 2 paths that would be considered. | ▶ 00:12 |
| Now, let's suppose that the tie is broken in some way | ▶ 00:15 |
| and we chose this path from Arad to Zerind. | ▶ 00:18 |
| Now, we want to expand that node. | ▶ 00:22 |
| We remove it from the frontier and put it in the explored list | ▶ 00:25 |
| and now we say, "What paths are we going to add?" | ▶ 00:31 |
| So check off the ends of the paths the cities that we're going to add. | ▶ 00:35 |
| [Male narrator] In this case, there's nothing to add | ▶ 00:00 |
| because of the 2 neighbors, 1 is in the explored list and 1 is in the frontier, | ▶ 00:03 |
| and if we're using graph search, then we won't add either of those. | ▶ 00:09 |
| [Male narrator] So we move on, we look for another shortest path. | ▶ 00:00 |
| There's one path left of length 1, so we look at that path, we expand it, | ▶ 00:04 |
| add in this path, put that one on the explored list, | ▶ 00:11 |
| and now we've got 3 paths of length 2. | ▶ 00:16 |
| We choose 1 of them, and let's say we choose this one. | ▶ 00:20 |
| Now, my question is show me which states we add to the path | ▶ 00:23 |
| and tell me whether we're going to terminate the algorithm at this point | ▶ 00:30 |
| because we've reached the goal or whether we're going to continue. | ▶ 00:35 |
| [Male narrator] The answer is that we add 1 more path, the path to Bucharest. | ▶ 00:00 |
| We don't add the path going back because it's in the explored list, | ▶ 00:08 |
| but we don't terminate it yet. | ▶ 00:11 |
| True, we have added a path that ends in Bucharest, | ▶ 00:13 |
| but the goal test isn't applied when we add a path to the frontier. | ▶ 00:16 |
| Rather, it's applied when we remove that path from the frontier, | ▶ 00:22 |
| and we haven't done that yet. | ▶ 00:26 |
| [Male narrator] Now, why doesn't the general tree search or graph search algorithm stop | ▶ 00:00 |
| when it adds a goal node to the frontier? | ▶ 00:06 |
| The reason is because it might not be the best path to the goal. | ▶ 00:09 |
| Now, here we found a path of length 2 | ▶ 00:13 |
| and we added a path of length 3 that reached the goal. | ▶ 00:16 |
| The general graph search or tree search doesn't know | ▶ 00:21 |
| that there might be some other path that we could expand | ▶ 00:24 |
| that would have a distance of say, 2-1/2, | ▶ 00:27 |
| but there's an optimization that could be made. | ▶ 00:30 |
| If we know we're doing Breadth - First Search | ▶ 00:33 |
| and we know there's no possibility of a path of length 2-1/2. | ▶ 00:35 |
| Then we can change algorithm so that it checks states | ▶ 00:40 |
| as soon as they're added to the frontier | ▶ 00:44 |
| rather than waiting until they're expanded | ▶ 00:46 |
| and in that case, we can write a specific Breadth - First Search routine | ▶ 00:49 |
| that terminates early and gives us a result as soon as we add a goal state to the frontier. | ▶ 00:53 |
| Breadth - First Search will find this path | ▶ 01:01 |
| that ends up in Bucharest, and if we're looking for the shortest path | ▶ 01:04 |
| in terms of number of steps, | ▶ 01:08 |
| Breadth - First Search is guaranteed to find it, | ▶ 01:10 |
| But if we're looking for the shortest path in terms of total cost | ▶ 01:12 |
| by adding up the step costs, then it turns out | ▶ 01:17 |
| that this path is shorter than the path found by Breadth - First Search. | ▶ 01:21 |
| So let's look at how we could find that path. | ▶ 01:26 |
| An algorithm that has traditionally been called uniform-cost search | ▶ 00:00 |
| but could be called cheapest-first search, | ▶ 00:05 |
| is guaranteed to find the path with the cheapest total cost. | ▶ 00:08 |
| Let's see how it works. | ▶ 00:11 |
| We start out as before in the start state. | ▶ 00:14 |
| And we pop that empty path off. | ▶ 00:19 |
| Move it from the frontier to explored, | ▶ 00:24 |
| and then add in the paths out of that state. | ▶ 00:28 |
| As before, there will be 3 of those paths. | ▶ 00:33 |
| And now, which path are we going to pick next | ▶ 00:39 |
| in order to expand according to the rules of cheapest first? | ▶ 00:43 |
| Cheapest first says that we pick the path with | ▶ 00:00 |
| the lowest total cost. | ▶ 00:04 |
| And that would be this path. | ▶ 00:06 |
| It has a cost of 75 compared to the cost of 118 and 140 | ▶ 00:07 |
| for the other paths. | ▶ 00:13 |
| So we get here. We take that path off the frontier, | ▶ 00:14 |
| put it on the explored list, add in its neighbors. | ▶ 00:19 |
| Not going back to Arad, | ▶ 00:23 |
| but adding in this new path. | ▶ 00:26 |
| Summing up the total cost of that path, | ▶ 00:30 |
| 71 + 75 is 146 for this path. | ▶ 00:33 |
| And now the question is, | ▶ 00:40 |
| which path gets expanded next? | ▶ 00:41 |
| Of the 3 paths on the frontier, we have ones | ▶ 00:00 |
| with a cost of 146, 140, and 118. | ▶ 00:05 |
| And that's the cheapest, so this one gets expanded. | ▶ 00:10 |
| We take it off the frontier, move it to explored, | ▶ 00:13 |
| add in its successors. In this case it's only 1. | ▶ 00:16 |
| And that has a path total of 229. | ▶ 00:21 |
| Which path do we expand next? | ▶ 00:29 |
| Well, we've got 146, 140, and 229 | ▶ 00:30 |
| So 140 is the lowest. | ▶ 00:33 |
| Take it off the frontier. Put it on explored. | ▶ 00:38 |
| Add in this path | ▶ 00:41 |
| for a total cost of 220. | ▶ 00:44 |
| And this path for a total cost of 239. | ▶ 00:48 |
| And now the question is, which path do we expand next? | ▶ 00:53 |
| The answer is this one, 146. | ▶ 00:00 |
| Put it on explored. | ▶ 00:04 |
| But there's nothing to add because | ▶ 00:07 |
| both of its neighbors have already been explored. | ▶ 00:12 |
| Which path do we look at next? | ▶ 00:13 |
| The answer is this one. Two-twenty is less than 229 or 239. | ▶ 00:00 |
| Take it off the frontier. Put it on explored. | ▶ 00:05 |
| Add in 2 more paths and sum them up. | ▶ 00:09 |
| So, 220 plus 146 is 366. | ▶ 00:15 |
| And 220 plus 97 is 317. | ▶ 00:21 |
| Okay, and now, notice that we're closing in on Bucharest. | ▶ 00:29 |
| We've got 2 neighbors almost there, but neither of them is their turn yet. | ▶ 00:32 |
| Instead, the cheapest path is this one over here, | ▶ 00:38 |
| so move it to the explored list. | ▶ 00:43 |
| Add 70 to the path cost so far, | ▶ 00:45 |
| and we get 299. | ▶ 00:50 |
| Now the cheapest node is 239 here, | ▶ 00:57 |
| so we expand, finally, into Bucharest at a cost of 460. | ▶ 01:01 |
| And now the question is are we done? Can we terminate the algorithm? | ▶ 01:09 |
| [Male] And the answer is no, we're not done yet. | ▶ 00:00 |
| We've put Bucharest, the gold state, onto the frontier, | ▶ 00:03 |
| but we haven't popped it off the frontier yet. | ▶ 00:07 |
| And the reason is because we've got to look around and see if there's a better path | ▶ 00:09 |
| that can reach it, Bucharest. | ▶ 00:13 |
| And so, let's continue. | ▶ 00:15 |
| Look at everything on the frontier. | ▶ 00:18 |
| Here's the cheapest one over here. | ▶ 00:20 |
| Expand that. | ▶ 00:23 |
| Now, what's the cheapest next one? | ▶ 00:26 |
| Well, over here. | ▶ 00:30 |
| Oops, forgot to take this one off the list. | ▶ 00:33 |
| So now, we're at 317 plus 101 gives us another path into Bucharest, | ▶ 00:36 |
| and this is a better path. | ▶ 00:44 |
| This is 418, gives us another route in. | ▶ 00:46 |
| But we have to keep going. | ▶ 00:54 |
| The best path on the frontier is 366, | ▶ 00:59 |
| so pop that off, and that would give us 2 more routes into here, | ▶ 01:06 |
| and eventually we pop off all of these. | ▶ 01:14 |
| And then we get to the point where 418 was the best path on the frontier. | ▶ 01:18 |
| We pop that off, and then we recognize that we'd reach the goal, | ▶ 01:24 |
| and the reason that uniform cost finds the optimal path, the cheapest cost, | ▶ 01:29 |
| is because it's guaranteed that it will first pop off this cheapest path, | ▶ 01:35 |
| the 418, before it gets to the more expensive path, like the 460. | ▶ 01:40 |
| So, we've looked at 2 search algorithms. | ▶ 00:00 |
| One, breadth-first search, in which we always expand first | ▶ 00:03 |
| the shallowest paths, the shortest paths. | ▶ 00:08 |
| Second, cheapest-first search, in which we always expand first the path | ▶ 00:12 |
| with the lowest total cost. | ▶ 00:17 |
| And I'm going to take this opportunity to introduce a third algorithm, depth-first search, | ▶ 00:20 |
| which is in a way the opposite of breadth-first search. | ▶ 00:25 |
| In depth-first search, we always expand first the longest path, | ▶ 00:28 |
| the path with the most lengths in it. | ▶ 00:33 |
| Now, what I want to ask you to do is for each of these nodes in each of the trees, | ▶ 00:36 |
| tell us in what order they're expanded, | ▶ 00:42 |
| first, second, third, fourth, fifth and so on by putting a number into the box. | ▶ 00:44 |
| And if there are ties, put that number in and resolve the ties in left to right order. | ▶ 00:49 |
| Then I want you to ask one more question or answer one more question | ▶ 00:58 |
| which is are these searches optimal? | ▶ 01:03 |
| That is, are they guaranteed to find the best solution? | ▶ 01:06 |
| And for breadth-first search, optimal would mean finding the shortest path. | ▶ 01:11 |
| If you think it's guaranteed to find the shortest path, check here. | ▶ 01:16 |
| For cheapest first, it would mean finding the path with the lowest total path cost. | ▶ 01:21 |
| Check here if you think it's guaranteed to do that. | ▶ 01:26 |
| And we'll allow the assumption that all costs have to be positive. | ▶ 01:30 |
| And in depth first, cheapest or optimal would mean, again, | ▶ 01:34 |
| as in breadth first, finding the shortest possible path in terms of number of lengths. | ▶ 01:41 |
| Check here if you think depth first will always find that. | ▶ 01:46 |
| Here are the answers. | ▶ 00:00 |
| Breadth-first search, as the name implies, expands nodes in this order. | ▶ 00:04 |
| One, 2, 3, 4, 5, 6, 7. | ▶ 00:10 |
| So, it's going across a stripe at a time, breadth first. | ▶ 00:17 |
| Is it optimal? | ▶ 00:23 |
| Well, it's always expanding in the shortest paths first, | ▶ 00:25 |
| and so wherever the goal is hiding, it's going to find it by examining | ▶ 00:28 |
| no longer paths, so in fact, it is optimal. | ▶ 00:34 |
| Cheapest first, first we expand the path of length zero, | ▶ 00:38 |
| then the path of length 2. | ▶ 00:45 |
| Now there's a path of length 4, path of length 5, | ▶ 00:47 |
| path of length 6, a path of length 7, and finally, a path of length 8. | ▶ 00:53 |
| And as we've seen, it's guaranteed to find the cheapest path of all, | ▶ 01:02 |
| assuming that all the individual step costs are not negative. | ▶ 01:08 |
| Depth-first search tries to go as deep as it can first, | ▶ 01:14 |
| so it goes 1, 2, 3, then backs up, 4, | ▶ 01:17 |
| then backs up, 5, 6, 7. | ▶ 01:24 |
| And you can see that it doesn't necessarily find the shortest path of all. | ▶ 01:29 |
| Let's say that there were goals in position 5 and in position 3. | ▶ 01:34 |
| It would find the longer path to position 3 and find the goal there | ▶ 01:39 |
| and would not find the goal in position 5. | ▶ 01:43 |
| So, it is not optimal. | ▶ 01:46 |
| Given the non-optimality of depth-first search, | ▶ 00:00 |
| why would anybody choose to use it? | ▶ 00:04 |
| Well, the answer has to do with the storage requirements. | ▶ 00:07 |
| Here I've illustrated a state space | ▶ 00:10 |
| consisting of a very large or even infinite binary tree. | ▶ 00:13 |
| As we go to levels 1, 2, 3, down to level n, | ▶ 00:18 |
| the tree gets larger and larger. | ▶ 00:22 |
| Now, let's consider the frontier for each of these search algorithms. | ▶ 00:24 |
| For breadth-first search, we know a frontier looks like that, | ▶ 00:29 |
| and so when we get down to level n, we'll require a storage space of | ▶ 00:35 |
| 2 to the n of pass in a breadth-first search. | ▶ 00:40 |
| For cheapest first, the frontier is going to be more complicated. | ▶ 00:45 |
| It's going to sort of work out this contour of cost, | ▶ 00:49 |
| but it's going to have a similar total number of nodes. | ▶ 00:53 |
| But for depth-first search, as we go down the tree, we start going down this branch, | ▶ 00:57 |
| and then we back up, but at any point, our frontier is only going to have n nodes | ▶ 01:03 |
| rather than 2 to the n nodes, so that's a substantial savings for depth-first search. | ▶ 01:08 |
| Now, of course, if we're also keeping track of the explored set, | ▶ 01:14 |
| then we don't get that much savings. | ▶ 01:19 |
| But without the explored set, depth-first search has a huge advantage | ▶ 01:21 |
| in terms of space saved. | ▶ 01:25 |
| One more property of the algorithms to consider | ▶ 01:27 |
| is the property of completeness, meaning if there is a goal somewhere, | ▶ 01:30 |
| will the algorithm find it? | ▶ 01:35 |
| So, let's move from very large trees to infinite trees, | ▶ 01:37 |
| and let's say that there's some goal hidden somewhere deep down in that tree. | ▶ 01:41 |
| And the question is, are each of these algorithms complete? | ▶ 01:47 |
| That is, are they guaranteed to find a path to the goal? | ▶ 01:51 |
| Mark off the check boxes for the algorithms that you believe are complete in this sense. | ▶ 01:55 |
| The answer is that breadth-first search is complete, | ▶ 00:00 |
| so even if the tree is infinite, if the goal is placed at any finite level, | ▶ 00:04 |
| eventually, we're going to march down and find that goal. | ▶ 00:10 |
| Same with cheapest first. | ▶ 00:16 |
| No matter where the goal is, if it has a finite cost, | ▶ 00:18 |
| eventually, we're going to go down and find it. | ▶ 00:21 |
| But not so for depth-first search. | ▶ 00:25 |
| If there's an infinite path, depth-first search will keep following that, | ▶ 00:28 |
| so it will keep going down and down and down along this path | ▶ 00:33 |
| and never get to the path that the goal consists of | ▶ 00:37 |
| and never get to the path on which the goal sits. | ▶ 00:42 |
| So, depth-first search is not complete. | ▶ 00:46 |
| Let's try to understand a little better how uniform cost search works. | ▶ 00:00 |
| We start at a start state, | ▶ 00:05 |
| and then we start expanding out from there looking at different paths, | ▶ 00:08 |
| and what we end of doing is expanding in terms of contours like on a topological map, | ▶ 00:13 |
| where first we span out to a certain distance, then to a farther distance, | ▶ 00:21 |
| and then to a farther distance. | ▶ 00:28 |
| Now at some point we meet up with a goal. Let's say the goal is here. | ▶ 00:31 |
| Now we found a path from the start to the goal. | ▶ 00:35 |
| But notice that the search really wasn't directed at any way towards the goal. | ▶ 00:42 |
| It was expanding out everywhere in the space and depending on where the goal is, | ▶ 00:46 |
| we should expect to have to explore half the space, on average, before we find the goal. | ▶ 00:52 |
| If the space is small, that can be fine, | ▶ 00:57 |
| but when spaces are large, that won't get us to the goal fast enough. | ▶ 01:00 |
| Unfortunately, there is really nothing we can do, with what we know, to do better than that, | ▶ 01:05 |
| and so if we want to improve, if we want to be able to find the goal faster, | ▶ 01:10 |
| we're going to have to add more knowledge. | ▶ 01:15 |
| The type of knowledge that is proven most useful in search is an estimate of the distance | ▶ 01:21 |
| from the start state to the goal. | ▶ 01:27 |
| So let's say we're dealing with a route-finding problem, | ▶ 01:32 |
| and we can move in any direction--up or down, right or left-- | ▶ 01:36 |
| and we'll take as our estimate, the straight line distance between a state and a goal, | ▶ 01:43 |
| and we'll try to use that estimate to find our way to the goal fastest. | ▶ 01:50 |
| Now an algorithm called greedy best-first search does exactly that. | ▶ 01:55 |
| It expands first the path that's closest to the goal according to the estimate. | ▶ 02:04 |
| So what do the contours look like in this approach? | ▶ 02:09 |
| Well, we start here, and then we look at all the neighboring states, | ▶ 02:13 |
| and the ones that appear to be closest to the goal we would expand first. | ▶ 02:17 |
| So we'd start expanding like this and like this and like this and like this | ▶ 02:21 |
| and that would lead us directly to the goal. | ▶ 02:30 |
| So now instead of exploring whole circles that go out everywhere with a certain space, | ▶ 02:33 |
| our search is directed towards the goal. | ▶ 02:38 |
| In this case it gets us immediately towards the goal, but that won't always be the case | ▶ 02:41 |
| if there are obstacles along the way. | ▶ 02:46 |
| Consider this search space. We have a start state and a goal, | ▶ 02:50 |
| and there's an impassable barrier. | ▶ 02:54 |
| Now greedy best-first search will start expanding out as before, | ▶ 02:57 |
| trying to get towards the goal, | ▶ 03:02 |
| and when it reaches the barrier, what will it do next? | ▶ 03:08 |
| Well, it will try to increase along a path that's getting closer and closer to the goal. | ▶ 03:11 |
| So it won't consider going back this way which is farther from the goal. | ▶ 03:15 |
| Rather it will continue expanding out along these lines | ▶ 03:20 |
| which always get closer and closer to the goal, | ▶ 03:24 |
| and eventually it will find its way towards the goal. | ▶ 03:28 |
| So it does find a path, and it does it by expanding a small number of nodes, | ▶ 03:31 |
| but it's willing to accept a path which is longer than other paths. | ▶ 03:36 |
| Now if we explored in the other direction, we could have found a much simpler path, | ▶ 03:42 |
| a much shorter path, by just popping over the barrier, and then going directly to the goal. | ▶ 03:47 |
| but greedy best-first search wouldn't have done that because | ▶ 03:54 |
| that would have involved getting to this point, which is this distance to the goal, | ▶ 03:56 |
| and then considering states which were farther from the goal. | ▶ 04:01 |
| What we would really like is an algorithm that combines the best parts | ▶ 04:08 |
| of greedy search which explores a small number of nodes in many cases | ▶ 04:11 |
| and uniform cost search which is guaranteed to find a shortest path. | ▶ 04:17 |
| We'll show how to do that next using an algorithm called the A-star algorithm. | ▶ 04:22 |
| [Male narrator] A* Search works by always expanding the path | ▶ 00:00 |
| that has a minimum value of the function f | ▶ 00:03 |
| which is defined as a sum of the g + h components. | ▶ 00:07 |
| Now, the function g of a path | ▶ 00:12 |
| is just the path cost, | ▶ 00:16 |
| and the function h of a path | ▶ 00:19 |
| is equal to the h value of the state, | ▶ 00:23 |
| which is the final state of the path, | ▶ 00:27 |
| which is equal to the estimated distance to the goal. | ▶ 00:30 |
| Here's an example of how A* works. | ▶ 00:36 |
| Suppose we found this path through the state's base to a state x | ▶ 00:39 |
| and we're trying to give a measure to the value of this path. | ▶ 00:44 |
| The measure f is a sum of g, the path cost so far, | ▶ 00:48 |
| and h, which is the estimated distance that the path will take | ▶ 00:55 |
| to complete its path to the goal. | ▶ 01:02 |
| Now, minimizing g helps us keep the path short | ▶ 01:04 |
| and minimizing h helps us keep focused on finding the goal | ▶ 01:08 |
| and the result is a search strategy that is the best possible | ▶ 01:13 |
| in the sense that it finds the shortest length path | ▶ 01:17 |
| while expanding the minimum number of paths possible. | ▶ 01:20 |
| It could be called "best estimated total path cost first," | ▶ 01:24 |
| but the name A* is traditional. | ▶ 01:28 |
| Now let's go back to Romania and apply the A* algorithm | ▶ 01:32 |
| and we're going to use a heuristic, which is a straight line distance | ▶ 01:36 |
| between a state and the goal. | ▶ 01:40 |
| The goal, again, is Bucharest, | ▶ 01:42 |
| and so the distance from Bucharest to Bucharest is, of course, 0. | ▶ 01:44 |
| And for all the other states, I've written in red | ▶ 01:47 |
| the straight line distance. | ▶ 01:51 |
| For example, straight across like that. | ▶ 01:53 |
| Now, I should say that all the roads here I've drawn as straight lines, | ▶ 01:55 |
| but actually, roads are going to be curved to some degree, | ▶ 01:59 |
| so the actual distance along the roads is going to be longer | ▶ 02:03 |
| than the straight line distance. | ▶ 02:06 |
| Now, we start out as usual--we'll start in Arad as a start state-- | ▶ 02:09 |
| and we'll expand out Arad and so we'll add 3 paths | ▶ 02:13 |
| and the evaluation function, f, will be the sum of the path length, | ▶ 02:21 |
| which is given in black, and the estimated distance, | ▶ 02:26 |
| which is given in red. | ▶ 02:29 |
| And so the path length from this path | ▶ 02:32 |
| will be 140+253 or 393; | ▶ 02:37 |
| for this path, 75+374, or 449; | ▶ 02:45 |
| and for this path, 118+329, or 447. | ▶ 02:55 |
| And now, the question is out of all the paths that are on the frontier, | ▶ 03:05 |
| which path would we expand next under the A* algorithm? | ▶ 03:09 |
| The answer is that we select this path first--the one from Arad to Sibiu-- | ▶ 00:00 |
| because it has the smallest value--393--of the sum f=g+h. | ▶ 00:05 |
| Let's go ahead and expand this node now. | ▶ 00:00 |
| So we're going to add 3 paths. | ▶ 00:03 |
| This one has a path cost of 291 | ▶ 00:06 |
| and an estimated distance to the goal of 380, | ▶ 00:10 |
| for a total of 671. | ▶ 00:14 |
| This one has a path cost of 239 | ▶ 00:18 |
| and an estimated distance of 176, for a total of 415. | ▶ 00:21 |
| And the final one is 220+193=413. | ▶ 00:27 |
| And now the question is which state to we expand next? | ▶ 00:33 |
| The answer is we expand this path next | ▶ 00:00 |
| because its total, 413, | ▶ 00:03 |
| is less than all the other ones on the front tier-- | ▶ 00:06 |
| although only slightly less than the 415 for this path. | ▶ 00:09 |
| So we expand this node, | ▶ 00:00 |
| giving us 2 more paths-- | ▶ 00:03 |
| this one with an f-value of 417, | ▶ 00:06 |
| and this one with an f-value of 526. | ▶ 00:10 |
| The question again--which path are we going to expand next? | ▶ 00:16 |
| And the answer is that we expand this path, Fagaras, next, | ▶ 00:00 |
| because its f-total, 415, | ▶ 00:05 |
| is less than all the other paths in the front tier. | ▶ 00:08 |
| Now we expand Fagaras | ▶ 00:01 |
| and we get a path that reaches the goal | ▶ 00:04 |
| and it has a path length of 450 and an estimated distance of 0 | ▶ 00:07 |
| for a total f value of 450, | ▶ 00:11 |
| and now the question is: What do we do next? | ▶ 00:14 |
| Click here if you think we're at the end of the algorithm | ▶ 00:17 |
| and we don't need to expand next | ▶ 00:22 |
| or click on the node that you think we will expand next. | ▶ 00:24 |
| The answer is that we're not done yet, | ▶ 00:00 |
| because the algorithm works by doing the goal test, | ▶ 00:03 |
| when we take a path off the front tier, | ▶ 00:06 |
| not when we put a path on the front tier. | ▶ 00:08 |
| Instead, we just continue in the normal way and choose the node | ▶ 00:11 |
| on the front tier which has the lowest value. | ▶ 00:15 |
| That would be this one--the path through Pitesti, with a total of 417. | ▶ 00:18 |
| So let's expand the node at Pitesti. | ▶ 00:01 |
| We have to go down this direction, up, | ▶ 00:04 |
| then we reach a path we've seen before, | ▶ 00:08 |
| and we go in this direction. | ▶ 00:11 |
| Now we reach Bucharest, which is the goal, | ▶ 00:13 |
| and the h value is going to be 0 | ▶ 00:16 |
| because we're at the goal, and the g value works out to 418. | ▶ 00:19 |
| Again, we don't stop here just because we put a path onto the front tier, | ▶ 00:24 |
| we put it there, we don't apply the goal test next, | ▶ 00:31 |
| but, now we go back to the front tier, | ▶ 00:35 |
| and it turns out that this 418 is the lowest-cost path on the front tier. | ▶ 00:38 |
| So now we pull it off, do the goal test, | ▶ 00:43 |
| and now we found our path to the goal, | ▶ 00:45 |
| and it is, in fact, the shortest possible path. | ▶ 00:49 |
| In this case, A-star was able to find the lowest-cost path. | ▶ 00:55 |
| Now the question that you'll have to think about, | ▶ 00:59 |
| because we haven't explained it yet, | ▶ 01:02 |
| is whether A-star will always do this. | ▶ 01:04 |
| Answer yes if you think A-star will always find the shortest cost path, | ▶ 01:06 |
| or answer no if you think it depends on the particular problem given, | ▶ 01:12 |
| or answer no if you think it depends on the particular heuristic estimate function, h. | ▶ 01:17 |
| The answer is that it depends on the h function. | ▶ 00:02 |
| A-star will find the lowest-cost path | ▶ 00:06 |
| if the h function for a state is less than the true cost | ▶ 00:09 |
| of the path to the goal through that state. | ▶ 00:16 |
| In other words, we want the h to never overestimate the distance to the goal. | ▶ 00:20 |
| We also say that h is optimistic. | ▶ 00:26 |
| Another way of stating that | ▶ 00:31 |
| is that h is admissible, | ▶ 00:34 |
| meaning is it admissible to use it to find the lowest-cost path. | ▶ 00:37 |
| Think of all of these of being the same way | ▶ 00:41 |
| of stating the conditions under which A-star finds the lowest-cost path. | ▶ 00:45 |
| Here we give you an intuition as to why | ▶ 00:01 |
| an optimistic heuristic function, h, finds the lowest-cost path. | ▶ 00:03 |
| When A-star ends, it returns a path, p, with estimated cost, c. | ▶ 00:08 |
| It turns out that c is also the actual cost, | ▶ 00:15 |
| because at the goal the h component is 0, | ▶ 00:20 |
| and so the path cost is the total cost as estimated by the function. | ▶ 00:23 |
| Now, all the paths on the front tier | ▶ 00:28 |
| have an estimated cost that's greater than c, | ▶ 00:31 |
| and we know that because the front tier is explored in cheapest-first order. | ▶ 00:35 |
| If h is optimistic, then the estimated cost | ▶ 00:40 |
| is less than the true cost, | ▶ 00:44 |
| so the path p must have a cost that's less than the true cost | ▶ 00:47 |
| of any of the paths on the front tier. | ▶ 00:51 |
| Any paths that go beyond the front tier | ▶ 00:54 |
| must have a cost that's greater than that | ▶ 00:57 |
| because we agree that the step cost is always 0 or more. | ▶ 00:59 |
| So that means that this path, p, must be the minimal cost path. | ▶ 01:04 |
| Now, this argument, I should say, only goes through | ▶ 01:09 |
| as is for tree search. | ▶ 01:13 |
| For graph search the argument is slightly more complicated, | ▶ 01:16 |
| but the general intuitions hold the same. | ▶ 01:19 |
| So far we've looked at the state space of cities in Romania-- | ▶ 00:01 |
| a 2-dimensional, physical space. | ▶ 00:05 |
| But the technology for problem solving through search | ▶ 00:07 |
| can deal with many types of state spaces, | ▶ 00:10 |
| dealing with abstract properties, not just x-y position in a plane. | ▶ 00:12 |
| Here I introduce another state space--the vacuum world. | ▶ 00:17 |
| It's a very simple world in which there are only 2 positions | ▶ 00:21 |
| as opposed to the many positions in the Romania state space. | ▶ 00:25 |
| But there are additional properties to deal with as well. | ▶ 00:30 |
| The robot vacuum cleaner can be in either of the 2 conditions, | ▶ 00:33 |
| but as well as that each of the positions | ▶ 00:36 |
| can either have dirt in it or not have dirt in it. | ▶ 00:40 |
| Now the question is to represent this as a state space | ▶ 00:43 |
| how many states do we need? | ▶ 00:47 |
| The number of states can fill in this box here. | ▶ 00:51 |
| And the answer is there are 8 states. | ▶ 00:01 |
| There are 2 physical states that the robot vacuum cleaner can be in-- | ▶ 00:04 |
| either in state A or in state B. | ▶ 00:10 |
| But in addition to that, there are states about how the world is | ▶ 00:12 |
| as well as where the robot is in the world. | ▶ 00:17 |
| So state A can be dirty or not. | ▶ 00:19 |
| That's 2 possibilities. | ▶ 00:24 |
| And B can be dirty or not. | ▶ 00:26 |
| That's 2 more possibilities. | ▶ 00:28 |
| We multiply those together. We get 8 possible states. | ▶ 00:31 |
| Here is a diagram of the state space for the vacuum world. | ▶ 00:01 |
| Note that there are 8 states, and we have the actions connecting the states | ▶ 00:05 |
| just as we did in the Romania problem. | ▶ 00:09 |
| Now let's look at a path through this state. | ▶ 00:12 |
| Let's say we start out in this position, | ▶ 00:15 |
| and then we apply the action of moving right. | ▶ 00:19 |
| Then we end up in a position where the state of the world looks the same, | ▶ 00:23 |
| except the robot has moved from position 'A' to position 'B'. | ▶ 00:27 |
| Now if we turn on the sucking action, | ▶ 00:32 |
| then we end up in a state where the robot is in the same position | ▶ 00:37 |
| but that position is no longer dirty. | ▶ 00:42 |
| Let's take this very simple vacuum world | ▶ 00:47 |
| and make a slightly more complicated one. | ▶ 00:50 |
| First, we'll say that the robot has a power switch, | ▶ 00:53 |
| which can be in one of three conditions: on, off, or sleep. | ▶ 00:56 |
| Next, we'll say that the robot has a dirt-sensing camera, | ▶ 01:04 |
| and that camera can either be on or off. | ▶ 01:09 |
| Third, this is the deluxe model of robot | ▶ 01:13 |
| in which the brushes that clean up the dust | ▶ 01:16 |
| can be set at 1 of 5 different heights | ▶ 01:19 |
| to be appropriate for whatever level of carpeting you have. | ▶ 01:22 |
| Finally, rather that just having the 2 positions, | ▶ 01:27 |
| we'll extend that out and have 10 positions. | ▶ 01:30 |
| Now the question is how many states are in this state space? | ▶ 01:37 |
| The answer is that the number of states is the cross product | ▶ 00:01 |
| of the numbers of all the variables, since they're each independent, | ▶ 00:05 |
| and any combination can occur. | ▶ 00:08 |
| For the power we have 3 possible positions. | ▶ 00:10 |
| The camera has 2. | ▶ 00:14 |
| The brush height has 5. | ▶ 00:18 |
| The dirt has 2 for each of the 10 positions. | ▶ 00:23 |
| That's 2^10 or 1024. | ▶ 00:28 |
| Then the robot's position can be any of those 10 positions as well. | ▶ 00:33 |
| That works out to 307,200 states in the state space. | ▶ 00:39 |
| Notice how a fairly trivial problem-- | ▶ 00:44 |
| we're only modeling a few variables and only 10 positions-- | ▶ 00:46 |
| works out to a large number of state spaces. | ▶ 00:50 |
| That's why we need efficient algorithms for searching through states spaces. | ▶ 00:52 |
| I want to introduce one more problem that can be solved with search techniques. | ▶ 00:01 |
| This is a sliding blocks puzzle, called a 15 puzzle. | ▶ 00:05 |
| You may have seen something like this. | ▶ 00:08 |
| So there are a bunch of little squares or blocks or tiles | ▶ 00:10 |
| and you can slide them around. | ▶ 00:14 |
| and the goal is to get into a certain configuration. | ▶ 00:19 |
| So we'll say that this is the goal state, where the numbers 1-15 are in order | ▶ 00:21 |
| left to right, top to bottom. | ▶ 00:27 |
| The starting state would be some state where all the positions are messed up. | ▶ 00:29 |
| Now the question is: Can we come up with a good heuristic for this? | ▶ 00:34 |
| Let's examine that as a way of thinking about where heuristics come from. | ▶ 00:38 |
| The first heuristic we're going to consider | ▶ 00:42 |
| we'll call h1, and that is equal to the number of misplaced blocks. | ▶ 00:46 |
| So here 10 and 11 are misplaced because they should be there and there, respectively, | ▶ 00:54 |
| 12 is in the right place, 13 is in the right place, | ▶ 00:59 |
| and 14 and 15 are misplaced. | ▶ 01:02 |
| That's a total of 4 misplaced blocks. | ▶ 01:04 |
| The 2nd heuristic, h2, is equal to | ▶ 01:07 |
| the sum of the distances that each block would have to move to get to the right position. | ▶ 01:13 |
| For this position, 10 would have to move 1 space to get to the right position, | ▶ 01:19 |
| 11 would have to move 1, so that's a total of 2 so far, | ▶ 01:26 |
| 13 is in the right place, | ▶ 01:30 |
| 14 is 1 displaced, | ▶ 01:31 |
| and 15 is 1 displaced, | ▶ 01:33 |
| so that would also be a total of 4. | ▶ 01:35 |
| Now, the question is: Which, if any, of these heuristics are admissible? | ▶ 01:38 |
| Check the boxes next to the heuristics that you think | ▶ 01:44 |
| are admissible. | ▶ 01:47 |
| H1 is admissible, because every tile that's in the wrong position | ▶ 00:02 |
| must be moved at least once to get into the right position. | ▶ 00:07 |
| So h1 never overestimates. | ▶ 00:10 |
| How about h2? | ▶ 00:13 |
| H2 is also admissible, because every tile in the wrong position | ▶ 00:15 |
| can be moved closer to the correct position no faster than 1 space per move. | ▶ 00:20 |
| Therefore, both are admissible. | ▶ 00:26 |
| But notice that h2 is always greater than or equal to h1. | ▶ 00:28 |
| That means that, with the exception of breaking ties, | ▶ 00:33 |
| an A* search using h2 will always expand | ▶ 00:35 |
| fewer paths than one using h1 | ▶ 00:39 |
| Now, we're trying to build an artificial intelligence | ▶ 00:01 |
| that can solve problems like this all on its own. | ▶ 00:04 |
| You can see that the search algorithms do a great job | ▶ 00:08 |
| of finding solutions to problems like this. | ▶ 00:12 |
| But, you might complain that in order for the search algorithms to work, | ▶ 00:15 |
| we had to provide it with a heurstic function. | ▶ 00:19 |
| A heurstic function came from the outside. | ▶ 00:22 |
| You might think that coming up with a good heurstic function is really where all the intelligence is. | ▶ 00:25 |
| So, a problem solver that uses an heurstic function given to it | ▶ 00:30 |
| really isn't intelligent at all. | ▶ 00:34 |
| So let's think about where the intelligence could come from | ▶ 00:36 |
| and can we automatically come up with good heurstic functions. | ▶ 00:39 |
| I'm going to sketch a description of | ▶ 00:45 |
| a program that can automatically come up with good heurstics | ▶ 00:47 |
| given a description of a problem. | ▶ 00:50 |
| Suppose this program is given a description of the sliding blocks puzzle | ▶ 00:52 |
| where we say that a block can move from square A to square B | ▶ 00:57 |
| if A is adjacent to B and B is blank. | ▶ 01:02 |
| Now, imagine that we try to loosen this restriction. | ▶ 01:06 |
| We cross out "B is blank," | ▶ 01:10 |
| and then we get the rule | ▶ 01:14 |
| "a block can move from A to B if A is adjacent to B," | ▶ 01:16 |
| and that's equal to our heurstic h2 | ▶ 01:20 |
| because a block can move anywhere to an adjacent state. | ▶ 01:23 |
| Now, we could also cross out the other part of the rule, | ▶ 01:27 |
| and we now get "a block can move from any square A | ▶ 01:31 |
| to any square B regardless of any condition. | ▶ 01:36 |
| That gives us heurstic h1. | ▶ 01:40 |
| So we see that both of our heurstics can be derived | ▶ 01:43 |
| from a simple mechanical manipulation | ▶ 01:48 |
| of the formal description of the problem. | ▶ 01:50 |
| Once we've generated automatically these candidate heuristics, | ▶ 01:53 |
| another way to come up with a good heurstic is to say | ▶ 01:58 |
| that a new heurstic, h, | ▶ 02:02 |
| is equal to the maximum of h1 and h2, | ▶ 02:04 |
| and that's guaranteed to be admissible as long as | ▶ 02:10 |
| h1 and h2 are admissible | ▶ 02:13 |
| because it still never overestimates, | ▶ 02:16 |
| and it's guaranteed to be better because its getting closer to the true value. | ▶ 02:18 |
| The only problem with combining multiple heuristics like this | ▶ 02:22 |
| is that there is some cause to compute the heuristic | ▶ 02:27 |
| and it could take longer to compute | ▶ 02:29 |
| even if we end up expanding pure paths. | ▶ 02:31 |
| Crossing out parts of the rules like this | ▶ 02:35 |
| is called "generating a relaxed problem." | ▶ 02:38 |
| What we've done is we've taken the original problem, | ▶ 02:41 |
| where it's hard to move squares around, | ▶ 02:44 |
| and made it easier by relaxing one of the constraints. | ▶ 02:46 |
| You can see that as adding new links in the state space, | ▶ 02:49 |
| so if we have a state space in which there are only particular links, | ▶ 02:54 |
| by relaxing the problem it's as if we are adding new operators | ▶ 02:59 |
| that traverse the state in new ways. | ▶ 03:05 |
| So adding new operators only makes the problem easier, | ▶ 03:07 |
| and thus never overestimates, and thus is admissible. | ▶ 03:11 |
| We've seen what search can do for problem solving. | ▶ 00:00 |
| It can find the lowest-cost path to a goal, | ▶ 00:03 |
| and it can do that in a way in which we never generate more paths than we have to. | ▶ 00:06 |
| We can find the optimal number of paths to generate, | ▶ 00:12 |
| and we can do that with a heuristic function that we generate on our own | ▶ 00:15 |
| by relaxing the existing problem definition. | ▶ 00:19 |
| But let's be clear on what search can't do. | ▶ 00:22 |
| All the solutions that we have found consist of a fixed sequence of actions. | ▶ 00:25 |
| In other words, the agent Hirin Arad, thinks, comes up with a plan that it wants to execute | ▶ 00:31 |
| and then essentially closes his eyes and starts driving, | ▶ 00:38 |
| never considering along the way if something has gone wrong. | ▶ 00:42 |
| That works fine for this type of problem, | ▶ 00:46 |
| but it only works when we satisfy the following conditions. | ▶ 00:49 |
| [Problem solving works when:] | ▶ 00:53 |
| Problem-solving technology works when the following set of conditions is true: | ▶ 00:55 |
| First, the domain must be fully observable. | ▶ 00:59 |
| In other words, we must be able to see what initial state we start out with. | ▶ 01:03 |
| Second, the domain must be known. | ▶ 01:08 |
| That is, we have to know the set of available actions to us. | ▶ 01:12 |
| Third, the domain must be discrete. | ▶ 01:16 |
| There must be a finite number of actions to chose from. | ▶ 01:20 |
| Fourth, the domain must be deterministic. | ▶ 01:24 |
| We have to know the result of taking an action. | ▶ 01:28 |
| Finally, the domain must be static. | ▶ 01:32 |
| There must be nothing else in the world that can change the world except our own actions. | ▶ 01:36 |
| If all these conditions are true, then we can search for a plan | ▶ 01:41 |
| which solves the problem and is guaranteed to work. | ▶ 01:44 |
| In later units, we will see what to do if any of these conditions fail to hold. | ▶ 01:47 |
| Our description of the algorithm has talked about paths in the state space. | ▶ 00:01 |
| I want to say a little bit now about how to implement that in terms of a computer algorithm. | ▶ 00:08 |
| We talk about paths, but we want to implement that in some ways. | ▶ 00:15 |
| In the implementation we talk about nodes. | ▶ 00:19 |
| A node is a data structure, and it has four fields. | ▶ 00:22 |
| The state field indicates the state at the end of the path. | ▶ 00:27 |
| The action was the action it took to get there. | ▶ 00:35 |
| The cost is the total cost, | ▶ 00:40 |
| and the parent is a pointer to another node. | ▶ 00:45 |
| In this case, the node that has state "S", | ▶ 00:50 |
| and it will have a parent which points to the node that has state "A", | ▶ 00:56 |
| and that will have a parent pointer that's null. | ▶ 01:06 |
| So we have a linked list of nodes representing the path. | ▶ 01:10 |
| We'll use the word "path" for the abstract idea, | ▶ 01:15 |
| and the word "node" for the representation in the computer memory. | ▶ 01:18 |
| But otherwise, you can think of those two terms as being synonyms, | ▶ 01:22 |
| because they're in a one-to-one correspondence. | ▶ 01:26 |
| Now there are two main data structures that deal with nodes. | ▶ 01:31 |
| We have the "frontier" and we have the "explored" list. | ▶ 01:35 |
| Let's talk about how to implement them. | ▶ 01:41 |
| In the frontier the operations we have to deal with | ▶ 01:44 |
| are removing the best item from the frontier and adding in new ones. | ▶ 01:48 |
| And that suggests we should implement it as a priority queue, | ▶ 01:52 |
| which knows how to keep track of the best items in proper order. | ▶ 01:55 |
| But we also need to have an additional operation | ▶ 01:59 |
| of a membership test as a new item in the frontier. | ▶ 02:03 |
| And that suggests representing it as a set, | ▶ 02:07 |
| which can be built from a hash table or a tree. | ▶ 02:10 |
| So the most efficient implementations of search actually have both representations. | ▶ 02:14 |
| The explored set, on the other hand, is easier. | ▶ 02:20 |
| All we have to do there is be able to add new members and check for membership. | ▶ 02:23 |
| So we represent that as a single set, | ▶ 02:28 |
| which again can be done with either a hash table or tree. | ▶ 02:31 |
| Congratulations. | ▶ 00:00 |
| You just made assignment 1. | ▶ 00:02 |
| This is homework assignment #1. | ▶ 00:00 |
| This is a question about peg solitaire. | ▶ 00:01 |
| In peg solitaire, a single player faces | ▶ 00:04 |
| the following kind of board. | ▶ 00:08 |
| Initially, all pieces are occupied except for the center piece. | ▶ 00:13 |
| You can find more information on peg solitare at the following URL. | ▶ 00:22 |
| [http://en.wikipedia.org/wiki/peg_solitaire] | ▶ 00:26 |
| I wish to know whether this game is partially observable, | ▶ 00:36 |
| Please say yes or no. | ▶ 00:40 |
| I wish to know whether it is stochastic. | ▶ 00:43 |
| Please say yes if it is and no if it's deterministic. | ▶ 00:46 |
| Let me know if it's continuous, yes or no, | ▶ 00:50 |
| and let me know if it's adversarial, yes or no. | ▶ 00:55 |
| >>Peg Solitaire is not partially observable because you can see the board at all times. | ▶ 00:00 |
| It is not stochastic because you just make all the moves, | ▶ 00:06 |
| and they have very different mystic effects. | ▶ 00:09 |
| It is not continuous. It's just finding many choices of actions | ▶ 00:11 |
| and finding many board positions, so therefore, it is not continuous. | ▶ 00:15 |
| and it's not adversarial because there is no adversaries--just you playing. | ▶ 00:18 |
| I am going to ask you about the problem to learn about a loaded coin. | ▶ 00:01 |
| A loaded coin is a coin, | ▶ 00:05 |
| that if you flip it, | ▶ 00:07 |
| might have a non 0.5 chance | ▶ 00:09 |
| of coming up heads or tails. | ▶ 00:13 |
| Fair coins always come up 50% heads or tails. | ▶ 00:16 |
| Loaded coins might come up, for example, | ▶ 00:20 |
| 0.9 chance heads and 0.1 chance tails. | ▶ 00:23 |
| Your task will be to understand, | ▶ 00:27 |
| from coin flips, | ▶ 00:30 |
| whether a coin is loaded, | ▶ 00:31 |
| and if so, at what probability. | ▶ 00:33 |
| I don't want you to solve the problem, | ▶ 00:35 |
| but I want you to answer the following questions: | ▶ 00:37 |
| Is it partially observable? | ▶ 00:40 |
| Yes or no. | ▶ 00:42 |
| Is it stochastic? | ▶ 00:44 |
| Yes or no. | ▶ 00:46 |
| Is it continuous? [Yes or no.] | ▶ 00:48 |
| And finally, is it adversarial? | ▶ 00:51 |
| Yes or no. | ▶ 00:53 |
| [Thrun] So the loaded coin example is clearly partially observable, | ▶ 00:00 |
| and the reason is it is actually used for the memory | ▶ 00:06 |
| if you flip it more than 1 time so you can learn more about what the actual probability is. | ▶ 00:09 |
| Therefore, looking at the most recent coin flip is insufficient to make your choice. | ▶ 00:14 |
| It is stochastic because you flip a coin. | ▶ 00:20 |
| It is not continuous because there's only 1 action--a flip--and 2 outcomes. | ▶ 00:25 |
| And it isn't really adversarial because while you do your learning task | ▶ 00:31 |
| no adversary interferes. | ▶ 00:36 |
| Let's talk about the problem of finding a path through a maze. | ▶ 00:00 |
| Let me draw you a maze. | ▶ 00:05 |
| Suppose you wish to find the path from the start to your goal. | ▶ 00:10 |
| I don't want to you to solve this problem. | ▶ 00:15 |
| Rather I want you to tell me whether it's partially observable. | ▶ 00:19 |
| Yes or no. | ▶ 00:23 |
| It is stochastic? | ▶ 00:25 |
| Yes or no. | ▶ 00:27 |
| Is it continuous? | ▶ 00:29 |
| Yes or no. | ▶ 00:31 |
| [Thrun] The path through the maze is clearly not partially observable | ▶ 00:00 |
| because you can see the maze entirely at all times. | ▶ 00:03 |
| It is not stochastic. There is no randomness involved. | ▶ 00:06 |
| It isn't really continuous. | ▶ 00:10 |
| There's typically just finitely many choices--go left or right. | ▶ 00:12 |
| And it isn't adversarial because there's no real adversary involved. | ▶ 00:15 |
| This is a search question. | ▶ 00:00 |
| Suppose we are given the following search tree. | ▶ 00:02 |
| We are searching from the top, the start node, | ▶ 00:05 |
| to the goal, which is over here. | ▶ 00:08 |
| Assume we expand from left to right. | ▶ 00:12 |
| Tell me how many nodes are expanded | ▶ 00:17 |
| if we expand from left to right, | ▶ 00:20 |
| counting the start node and the goal node in your answer. | ▶ 00:23 |
| And give me the same answer for Depth First Search. | ▶ 00:27 |
| Now, let's assume you're going to search from right to left. | ▶ 00:32 |
| How many nodes would we now expand in Breadth First Search, | ▶ 00:35 |
| and how many do we expand in Depth First Search? | ▶ 00:39 |
| [Thrun] Breadth first from left to right is 6-- | ▶ 00:00 |
| 1, 2, 3, 4, 5, 6. | ▶ 00:03 |
| Depth first from left to right is 4--1, 2, 3, 4. | ▶ 00:07 |
| Breadth first searched from right to left is 9-- | ▶ 00:15 |
| 1, 2, 3, 4, 5, 6, 7, 8, 9. | ▶ 00:19 |
| And depth first from right to left is 9-- | ▶ 00:25 |
| 1, 2, 3, 4, 5, 6, 7, 8, 9. | ▶ 00:28 |
| Another search problem-- | ▶ 00:00 |
| Consider the following search tree, | ▶ 00:03 |
| where this is the start node. | ▶ 00:08 |
| Now, assume we search from left to right. | ▶ 00:12 |
| I would like you to tell me the number of nodes expanded from Breadth-First Search | ▶ 00:15 |
| and Depth-First Search. | ▶ 00:19 |
| Please do count the start and the goal node, | ▶ 00:22 |
| and please give me the same numbers for Right-to-Left Search, | ▶ 00:25 |
| for Breadth-First, and Depth-First. | ▶ 00:28 |
| [Thrun] The correct answer for breadth first left to right is 13-- | ▶ 00:00 |
| 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13. | ▶ 00:05 |
| And for depth first it is 10-- | ▶ 00:13 |
| 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10. | ▶ 00:17 |
| For right to left search, the right answer for breadth first is 11-- | ▶ 00:28 |
| 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11. | ▶ 00:32 |
| And for depth first the right answer is 7-- | ▶ 00:38 |
| 1, 2, 3, 4, 5, 6, 7. | ▶ 00:42 |
| This is another search problem. | ▶ 00:00 |
| Let's assume we have a search graph. | ▶ 00:04 |
| It isn't quite a tree but looks like this. | ▶ 00:07 |
| Obviously in the structure we can reach nodes through multiple paths. | ▶ 00:13 |
| So let's assume that our search never expands the same node twice. | ▶ 00:18 |
| Let's also assume this start node is on top. We search down. | ▶ 00:22 |
| And this over here is our goal node. | ▶ 00:27 |
| So left-to-right search, tell me how many nodes | ▶ 00:30 |
| breadth first would expand--do count the start and goal node in the final answer. | ▶ 00:35 |
| Give me the same result for a depth-first search. | ▶ 00:43 |
| Again counting the start and the goal node in your answer. | ▶ 00:48 |
| And again give me your answer for breadth-first | ▶ 00:51 |
| and for depth-first in the right-to-left search paradigm. | ▶ 00:54 |
| [Thrun] The right answer over here is 10 for breadth first from left to right-- | ▶ 00:00 |
| 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. | ▶ 00:05 |
| Depth first is 16, or all nodes-- | ▶ 00:11 |
| 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16. | ▶ 00:15 |
| And notice how I never expanded a node twice. | ▶ 00:30 |
| Correct answer for breadth first right to left is 7-- | ▶ 00:34 |
| 1, 2, 3, 4, 5, 6, 7. | ▶ 00:38 |
| And the correct answer for depth first from right to left is 4--1, 2, 3, and 4. | ▶ 00:43 |
| Let's talk about a star search. | ▶ 00:00 |
| Let's assume we have the following grid. | ▶ 00:03 |
| The start state is right here. | ▶ 00:08 |
| And the goal state is right here. | ▶ 00:13 |
| And just for convenience, I will give each here a little number. | ▶ 00:16 |
| A. B. C. D. | ▶ 00:22 |
| Let me draw a heuristic function. | ▶ 00:26 |
| Please take a look for a moment | ▶ 00:30 |
| and tell me whether this heuristic function is admissable. | ▶ 00:32 |
| Check here if yes and here if no. | ▶ 00:38 |
| Which one is the first node a star would expand? | ▶ 00:41 |
| B1 or A2? | ▶ 00:46 |
| What's the second node to expand? | ▶ 00:51 |
| B1, C1, A2, A3, or B2? | ▶ 00:56 |
| And finally, what is the third node to expand? | ▶ 01:06 |
| D1, C2, B3, or A4? | ▶ 01:10 |
| [Thrun] Clearly this is an admissable heuristic because the distance to the goal | ▶ 00:00 |
| is strictly underestimated. | ▶ 00:05 |
| From here it would take 1 step, | ▶ 00:07 |
| from here it will take 1, 2 steps, so the answer is yes. | ▶ 00:09 |
| Now, to understand A*, let me also draw the g function | ▶ 00:15 |
| for development part of this table. | ▶ 00:22 |
| Clearly g is 0 over here. | ▶ 00:24 |
| To understand which node to expand, this one or this one, | ▶ 00:27 |
| let's project the g function, which is 1, | ▶ 00:31 |
| and we will see that 3 plus 1 is smaller than 4 plus 1; | ▶ 00:34 |
| therefore, this is the second node to expand, which is b1. | ▶ 00:40 |
| Now let me for the next step explain the g function from this guy here, 2 and 2. | ▶ 00:47 |
| So 2 plus 2 is 4 versus 3 plus 2 is 5, so we expand this node next, which is c1. | ▶ 00:55 |
| And finally, the g function from here would go 3 and 3. | ▶ 01:08 |
| 3 plus 1 is better than 3 plus 2, so we would expand d1 next. | ▶ 01:14 |
| And notice how in the sum of g and h, | ▶ 01:24 |
| this node over here, which has a total of 4, is better than any other node that is unexpanded. | ▶ 01:29 |
| So in particular, 4 plus 1 is 5, and 3 plus 2 is 5 as well, | ▶ 01:35 |
| and 2 plus 3 is 5 as well, so this is the next one to expand. | ▶ 01:40 |
| So the next units will be concerned with probabilities | ▶ 00:00 |
| and particularly with structured probabilities using Bayes networks. | ▶ 00:03 |
| This is some of the most involved material in this class. | ▶ 00:08 |
| And since this is a Stanford level class, | ▶ 00:12 |
| you will find out that some of the quizzes are actually really hard. | ▶ 00:14 |
| So as you go through the material, I hope the hardness of the quizzes won't discourage you; | ▶ 00:18 |
| it'll really entice you to take a piece of paper and a pen and work them out. | ▶ 00:23 |
| Let me give you a flavor of a Bayes network using an example. | ▶ 00:30 |
| Suppose you find in the morning that your car won't start. | ▶ 00:35 |
| Well, there's many causes why your car might not start. | ▶ 00:39 |
| One is that your battery is flat. | ▶ 00:43 |
| Even for a flat battery there is multiple causes. | ▶ 00:46 |
| One, it's just plain dead, | ▶ 00:50 |
| and one is that the battery is okay but it's not charging. | ▶ 00:52 |
| The reason why a battery might not charge is that the alternator might be broken | ▶ 00:55 |
| or the fan belt might be broken. | ▶ 01:01 |
| If you look at this influence diagram, also called a Bayes network, | ▶ 01:03 |
| you'll find there's many different ways to explain that the car won't start. | ▶ 01:07 |
| And a natural question you might have is, "Can we diagnose the problem?" | ▶ 01:12 |
| One diagnostic tool is a battery meter, | ▶ 01:17 |
| which may increase or decrease your belief that the battery may cause your car failure. | ▶ 01:20 |
| You might also know your battery age. | ▶ 01:26 |
| Older batteries tend to go dead more often. | ▶ 01:29 |
| And there's many other ways to look at reasons why the car might not start. | ▶ 01:31 |
| You might inspect the lights, the oil light, the gas gauge. | ▶ 01:37 |
| You might even dip into the engine to see what the oil level is with a dipstick. | ▶ 01:43 |
| All of those relate to alternative reasons why the car might not be starting, | ▶ 01:48 |
| like no oil, no gas, the fuel line might be blocked, or the starter may be broken. | ▶ 01:52 |
| And all of these can influence your measurements, | ▶ 01:59 |
| like the oil light or the gas gauge, in different ways. | ▶ 02:04 |
| For example, the battery flat would have an effect on the lights. | ▶ 02:07 |
| It might have an effect on the oil light and on the gas gauge, | ▶ 02:12 |
| but it won't really affect the oil you measure with the dipstick. | ▶ 02:16 |
| That is affected by the actual oil level, which also affects the oil light. | ▶ 02:20 |
| Gas will affect the gas gauge, and of course without gas the car doesn't start. | ▶ 02:26 |
| So this is a complicated structure that really describes one way to understand | ▶ 02:32 |
| how a car doesn't start. | ▶ 02:39 |
| A car is a complex system. | ▶ 02:41 |
| It has lots of variables you can't really measure immediately, | ▶ 02:43 |
| and it has sensors which allow you to understand a little bit about the state of the car. | ▶ 02:46 |
| What the Bayes network does, | ▶ 02:52 |
| it really assists you in reasoning from observable variables, like the car won't start | ▶ 02:54 |
| and the value of the dipstick, to hidden causes, like is the fan belt broken | ▶ 03:01 |
| or is the battery dead. | ▶ 03:06 |
| What you have here is a Bayes network. | ▶ 03:09 |
| A Bayes network is composed of nodes. | ▶ 03:13 |
| These nodes correspond to events that you might or might not know | ▶ 03:15 |
| that are typically called random variables. | ▶ 03:21 |
| These nodes are linked by arcs, and the arcs suggest that a child of an arc | ▶ 03:24 |
| is influenced by its parent but not in a deterministic way. | ▶ 03:31 |
| It might be influenced in a probabilistic way, which means an older battery, for example, | ▶ 03:35 |
| has a higher chance of causing the battery to be dead, | ▶ 03:41 |
| but it's not clear that every old battery is dead. | ▶ 03:45 |
| There is a total of 16 variables in this Bayes network. | ▶ 03:48 |
| What the graph structure and associated probabilities specify | ▶ 03:53 |
| is a huge probability distribution in the space of all of these 16 variables. | ▶ 03:59 |
| If they are all binary, which we'll assume throughout this unit, | ▶ 04:06 |
| they can take 2 to the 16th different values, which is a lot. | ▶ 04:10 |
| The Bayes network, as we find out, is a complex representation | ▶ 04:15 |
| of a distribution over this very, very large joint probability distribution of all of these variables. | ▶ 04:18 |
| Further, once we specify the Bayes network, | ▶ 04:26 |
| we can observe, for example, the car won't start. | ▶ 04:29 |
| We can observe things like the oil light and the lights and the battery meter | ▶ 04:33 |
| and then compute probabilities of the hypothesis, like the alternator is broken | ▶ 04:37 |
| or the fan belt is broken or the battery is dead. | ▶ 04:41 |
| So in this class we're going to talk about how to construct this Bayes network, | ▶ 04:45 |
| what the semantics are, and how to reason in this Bayes network | ▶ 04:50 |
| to find out about variables we can't observe, like whether the fan belt is broken or not. | ▶ 04:56 |
| That's an overview. | ▶ 05:02 |
| Throughout this unit I am going to assume that every event is discrete-- | ▶ 05:04 |
| in fact, it's binary. | ▶ 05:08 |
| We'll start with some consideration of basic probability, | ▶ 05:10 |
| we'll work our way into some simple Bayes networks, | ▶ 05:14 |
| we'll talk about concepts like conditional independence | ▶ 05:19 |
| and then define Bayes networks more generally, | ▶ 05:23 |
| move into concepts like D-separation and start doing parameter counts. | ▶ 05:26 |
| Later on, Peter will tell you about inference in Bayes networks. | ▶ 05:32 |
| So we won't do this in this class. | ▶ 05:36 |
| I can't overemphasize how important this class is. | ▶ 05:38 |
| Bayes networks are used extensively in almost all fields of smart computer system, | ▶ 05:43 |
| in diagnostics, for prediction, for machine learning, and fields like finance, | ▶ 05:49 |
| inside Google, in robotics. | ▶ 05:57 |
| Bayes networks are also the building blocks of more advanced AI techniques | ▶ 06:00 |
| such as particle filters, hidden Markov models, MDPs and POMDPs, | ▶ 06:05 |
| Kalman filters, and many others. | ▶ 06:12 |
| These are words that don't sound familiar quite yet, | ▶ 06:14 |
| but as you go through the class, I can promise you you will get to know what they mean. | ▶ 06:18 |
| So let's start now at the very, very basics. | ▶ 06:22 |
| [Thrun] So let's talk about probabilities. | ▶ 00:00 |
| Probabilities are the cornerstone of artificial intelligence. | ▶ 00:02 |
| They are used to express uncertainty, | ▶ 00:05 |
| and the management of uncertainty is really key to many, many things in AI | ▶ 00:08 |
| such as machine learning and Bayes network inference | ▶ 00:12 |
| and filtering and robotics and computer vision and so on. | ▶ 00:16 |
| So I'm going to start with some very basic questions, | ▶ 00:21 |
| and we're going to work our way up from there. | ▶ 00:24 |
| Here is a coin. | ▶ 00:26 |
| The coin can come up heads or tails, and my question is the following: | ▶ 00:28 |
| Suppose the probability for heads is 0.5. | ▶ 00:32 |
| What's the probability for it coming up tails? | ▶ 00:38 |
| [Thrun] So the right answer is a half, or 0.5, | ▶ 00:00 |
| and the reason is the coin can only come up heads or tails. | ▶ 00:03 |
| We know that it has to be either one. | ▶ 00:07 |
| Therefore, the total probability of both coming up is 1. | ▶ 00:10 |
| So if half of the probability is assigned to heads, then the other half is assigned to tail. | ▶ 00:14 |
| [Thrun] Let me ask my next quiz. | ▶ 00:00 |
| Suppose the probability of heads is a quarter, 0.25. | ▶ 00:02 |
| What's the probability of tail? | ▶ 00:06 |
| [Thrun] And the answer is 3/4. | ▶ 00:00 |
| It's a loaded coin, and the reason is, well, | ▶ 00:02 |
| each of them come up with a certain probability. | ▶ 00:05 |
| The total of those is 1. The quarter is claimed by heads. | ▶ 00:08 |
| Therefore, 3/4 remain for tail, which is the answer over here. | ▶ 00:12 |
| [Thrun] Here's another quiz. | ▶ 00:00 |
| What's the probability that the coin comes up heads, heads, heads, three times in a row, | ▶ 00:02 |
| assuming that each one of those has a probability of a half | ▶ 00:08 |
| and that these coin flips are independent? | ▶ 00:12 |
| [Thrun] And the answer is 0.125. | ▶ 00:00 |
| Each head has a probability of a half. | ▶ 00:04 |
| We can multiply those probabilities because they are independent events, | ▶ 00:06 |
| and that gives us 1 over 8 or 0.125. | ▶ 00:10 |
| [Thrun] Now let's flip the coin 4 times, and let's call Xi the result of the i-th coin flip. | ▶ 00:00 |
| So each Xi is going to be drawn from heads or tail. | ▶ 00:11 |
| What's the probability that all 4 of those flips give us the same result, | ▶ 00:16 |
| no matter what it is, assuming that each one of those has identically | ▶ 00:22 |
| an equally distributed probability of coming up heads of the half? | ▶ 00:26 |
| [Thrun] And the answer is, well, there's 2 ways that we can achieve this. | ▶ 00:00 |
| One is the all heads and one is all tails. | ▶ 00:04 |
| You already know that 4 times heads is 1/16, | ▶ 00:06 |
| and we know that 4 times tail is also 1/16. | ▶ 00:10 |
| These are completely independent events. | ▶ 00:13 |
| The probability of either one occurring is 1/16 plus 1/16, which is 1/8, which is 0.125. | ▶ 00:15 |
| [Thrun] So here's another one. | ▶ 00:00 |
| What's the probability that within the set of X1, X2, X3, and X4 | ▶ 00:02 |
| there are at least three heads? | ▶ 00:07 |
| [Thrun] And the solution is let's look at different sequences | ▶ 00:00 |
| in which head occurs at least 3 times. | ▶ 00:03 |
| It could be head, head, head, head, in which it comes 4 times. | ▶ 00:06 |
| It could be head, head, head, tail and so on, all the way to tail, head, head, head. | ▶ 00:10 |
| There's 1, 2, 3, 4, 5 of those outcomes. | ▶ 00:16 |
| Each of them has a 16th for probability, so it's 5 times a 16th, which is 0.3125. | ▶ 00:19 |
| [Thrun] So we just learned a number of things. | ▶ 00:00 |
| One is about complementary probability. | ▶ 00:02 |
| If an event has a certain probability, p, | ▶ 00:05 |
| the complementary event has the probability 1-p. | ▶ 00:08 |
| We also learned about independence. | ▶ 00:13 |
| If 2 random variables, X and Y, are independent, | ▶ 00:15 |
| which you're going to write like this, | ▶ 00:19 |
| that means the probability of the joint that any 2 variables can assume | ▶ 00:21 |
| is the product of the marginals. | ▶ 00:26 |
| So rather than asking the question, "What is the probability | ▶ 00:30 |
| "for any combination that these 2 coins or maybe 5 coins could have taken?" | ▶ 00:34 |
| we can now look at the probability of each coin individually, | ▶ 00:40 |
| look at its probability and just multiply them up. | ▶ 00:42 |
| [Thrun] So let me ask you about dependence. | ▶ 00:00 |
| Suppose we flip 2 coins. | ▶ 00:03 |
| Our first coin is a fair coin, and we're going to denote the outcome by X1. | ▶ 00:05 |
| So the chance of X1 coming up heads is half. | ▶ 00:12 |
| But now we branch into picking a coin based on the first outcome. | ▶ 00:15 |
| So if the first outcome was heads, | ▶ 00:20 |
| you pick a coin whose probability of coming up heads is going to be 0.9. | ▶ 00:23 |
| The way I word this is by conditional probability, | ▶ 00:28 |
| probability of the second coin flip coming up heads | ▶ 00:32 |
| provided that or given that X1, the first coin flip, was heads, is 0.9. | ▶ 00:35 |
| The first coin flip might also come up tails, | ▶ 00:41 |
| in which case I pick a very different coin. | ▶ 00:44 |
| In this case I pick a coin which with 0.8 probability will once again give me tails, | ▶ 00:47 |
| conditioned on the first coin flip coming up tails. | ▶ 00:54 |
| So my question for you is, | ▶ 00:57 |
| what's the probability of the second coin flip coming up heads? | ▶ 00:59 |
| [Thrun] The answer is 0.55. | ▶ 00:00 |
| The way to compute this is by the theorem of total probability. | ▶ 00:04 |
| Probability of X2 equals heads. | ▶ 00:08 |
| There's 2 ways I can get to this outcome. | ▶ 00:12 |
| One is via this path over here, and one is via this path over here. | ▶ 00:15 |
| Let me just write both of them down. | ▶ 00:18 |
| So first of all, it could be the probability of X2 equals heads | ▶ 00:20 |
| given that and I will assume X1 was head already. | ▶ 00:26 |
| Now I have to add the complementary event. | ▶ 00:30 |
| Suppose X1 came up tails. | ▶ 00:32 |
| Then I can ask the question, what is the probability that X2 comes up heads regardless, | ▶ 00:35 |
| even though X1 was tails? | ▶ 00:40 |
| Plugging in the numbers gives us the following. | ▶ 00:42 |
| This one over here is 0.9 times a half. | ▶ 00:44 |
| The probability of tails is 0.8, | ▶ 00:49 |
| thereby my head probability becomes 1 minus 0.8, which is 0.2. | ▶ 00:51 |
| Adding all of this together gives me 0.45 plus 0.1, | ▶ 00:58 |
| which is exactly 0.55. | ▶ 01:03 |
| So, we actually just learned some interesting lessons. | ▶ 00:00 |
| The probability of any random variable Y can be written as | ▶ 00:02 |
| probability of Y given that some other random variable X assumes value i | ▶ 00:08 |
| times probability of X equals i, | ▶ 00:13 |
| sums over all possible outcomes i for the (inaudible) variable X. | ▶ 00:17 |
| This is called total probability. | ▶ 00:22 |
| The second thing we learned has to do with negation of probabilities. | ▶ 00:24 |
| We found that probability of not X given Y is 1 minus probability of X given Y. | ▶ 00:27 |
| Now, you might be tempted to say "What about the probability of X given not Y?" | ▶ 00:37 |
| "Is this the same as 1 minus probability of X given Y?" | ▶ 00:43 |
| And the answer is absolutely no. | ▶ 00:51 |
| That's not the case. | ▶ 00:54 |
| If you condition on something that has a certain probability value, | ▶ 00:56 |
| you can take the event you're looking at and negate this, | ▶ 01:00 |
| but you can never negate your conditional variable | ▶ 01:03 |
| and assume these values add up to 1. | ▶ 01:05 |
| We assume there is sometimes sunny days and sometimes rainy days, | ▶ 00:00 |
| and on day 1, which we're going to call D1, | ▶ 00:06 |
| the probability of sunny is 0.9. | ▶ 00:09 |
| And then let's assume that a sunny day follows a sunny day with 0.8 chance, | ▶ 00:13 |
| and a rainy day follows a sunny day with--well-- | ▶ 00:20 |
| Well, the correct answer is 0.2, which is a negation of this event over here. | ▶ 00:00 |
| A sunny day follows a rainy day with 0.6 chance, | ▶ 00:00 |
| and a rainy day follows a rainy day-- | ▶ 00:06 |
| please give me your number. | ▶ 00:11 |
| 0.4 | ▶ 00:00 |
| So, what are the chances that D2 is sunny? | ▶ 00:00 |
| Suppose the same dynamics apply from D2 to D3, | ▶ 00:03 |
| so just replace D3 over here with D2s over there. | ▶ 00:06 |
| That means the transition probabilities from one day to the next remain the same. | ▶ 00:10 |
| Tell me, what's the probability that D3 is sunny? | ▶ 00:14 |
| So, the correct answer over here is 0.78, | ▶ 00:00 |
| and over here it's 0.756. | ▶ 00:04 |
| To get there, let's complete this one first. | ▶ 00:10 |
| The probability of D2 = sunny. | ▶ 00:13 |
| Well, we know there's a 0.9 chance it's sunny on D1, | ▶ 00:16 |
| and then if it is sunny, we know it stays sunny with a 0.8 chance. | ▶ 00:21 |
| So, we multiply these 2 things together, and we get 0.72. | ▶ 00:25 |
| We know there's a 0.1 chance of it being rainy on day 1, which is the complement, | ▶ 00:29 |
| but if it's rainy, we know it switches to sunny with 0.6 chance, | ▶ 00:33 |
| so you multiply these 2 things, and you get 0.06. | ▶ 00:37 |
| Adding those two up equals 0.78. | ▶ 00:41 |
| Now, for the next day, we know our prior for sunny is 0.78. | ▶ 00:46 |
| If it is sunny, it stays sunny with 0.8 probability. | ▶ 00:51 |
| Multiplying these 2 things gives us 0.624. | ▶ 00:55 |
| We know it's rainy with 0.2 chance, which is the complement of 0.78, | ▶ 01:01 |
| but a 0.6 chance if it was (inaudible) sunny. | ▶ 01:07 |
| But if you multiply those, 0.132. | ▶ 01:10 |
| Adding those 2 things up gives us 0.756. | ▶ 01:14 |
| So, to some extents, it's tedious to compute these values, | ▶ 01:19 |
| but they can be perfectly computed, as shown here. | ▶ 01:23 |
| Next example is a cancer example. | ▶ 00:00 |
| Suppose there's a specific type of cancer which exists for 1% of the population. | ▶ 00:05 |
| I'm going to write this as follows. | ▶ 00:11 |
| You can probably tell me now what the probability of not having this cancer is. | ▶ 00:13 |
| And yes, the answer is 0.99. | ▶ 00:00 |
| Let's assume there's a test for this cancer, | ▶ 00:04 |
| which gives us probabilistically an answer whether we have this cancer or not. | ▶ 00:07 |
| So, let's say the probability of a test being positive, as indicated by this + sign, | ▶ 00:12 |
| given that we have cancer, is 0.9. | ▶ 00:18 |
| The probability of the test coming out negative if we have the cancer is--you name it. | ▶ 00:22 |
| 0.1, which is the difference between 1 and 0.9. | ▶ 00:00 |
| Let's assume the probability of the test coming out positive | ▶ 00:06 |
| given that we don't have this cancer is 0.2. | ▶ 00:11 |
| In other words, the probability of the test correctly saying | ▶ 00:15 |
| we don't have the cancer if we're cancer free is 0.8. | ▶ 00:19 |
| Now, ultimately, I'd like to know what's the probability | ▶ 00:24 |
| they have this cancer given they just received a single, positive test? | ▶ 00:28 |
| Before I do this, please help me filling out some other probabilities | ▶ 00:35 |
| that are actually important. | ▶ 00:39 |
| Specifically, the joint probabilities. | ▶ 00:41 |
| The probability of a positive test and having cancer. | ▶ 00:45 |
| The probability of a negative test and having cancer, | ▶ 00:51 |
| and this is not conditional anymore. | ▶ 00:53 |
| It's now a joint probability. | ▶ 00:55 |
| So, please give me those 4 values over here. | ▶ 00:57 |
| And here the correct answer is 0.009, | ▶ 00:00 |
| which is the product of your prior, 0.01, times the conditional, 0.9. | ▶ 00:05 |
| Over here we get 0.001, the probability of our prior cancer times 0.1. | ▶ 00:12 |
| Over here we get 0.198, | ▶ 00:21 |
| the probability of not having cancer is 0.99 | ▶ 00:26 |
| times still getting a positive reading, which is 0.2. | ▶ 00:29 |
| And finally, we get 0.792, | ▶ 00:32 |
| which is the probability of this guy over here, and this guy over here. | ▶ 00:37 |
| Now, our next quiz, I want you to fill in the probability of | ▶ 00:00 |
| the cancer given that we just received a positive test. | ▶ 00:04 |
| And the correct answer is 0.043. | ▶ 00:00 |
| So, even though I received a positive test, | ▶ 00:06 |
| my probability of having cancer is just 4.3%, | ▶ 00:09 |
| which is not very much given that the test itself is quite sensitive. | ▶ 00:14 |
| It really gives me a 0.8 chance of getting a negative result if I don't have cancer. | ▶ 00:18 |
| It gives me a 0.9 chance of detecting cancer given that I have cancer. | ▶ 00:26 |
| Now, what comes (inaudible) small? | ▶ 00:32 |
| Well, let's just put all the cases together. | ▶ 00:35 |
| You already know that we received a positive test. | ▶ 00:38 |
| Therefore, this entry over here, and this entry over here are relevant. | ▶ 00:41 |
| Now, the chance of having a positive test and having cancer is 0.009. | ▶ 00:47 |
| Well, I might--when I receive a positive test--have cancer or not cancer, | ▶ 00:56 |
| so we will just normalize by these 2 possible causes for the positive test, | ▶ 01:01 |
| which is 0.009 + 0.198. | ▶ 01:06 |
| We know both these 2 things together gets 0.009 over 0.207, | ▶ 01:11 |
| which is approximately 0.043. | ▶ 01:20 |
| Now, the interesting thing in this equation is that the chances | ▶ 01:23 |
| of having seen a positive test result in the absence of cancers | ▶ 01:28 |
| are still much, much higher than the chance of seeing a positive result | ▶ 01:32 |
| in the presence of cancer, and that's because our prior for cancer | ▶ 01:35 |
| is so small in the population that it's just very unlikely to have cancer. | ▶ 01:39 |
| So, the additional information of a positive test | ▶ 01:44 |
| only erased my posterior probability to 0.043. | ▶ 01:47 |
| So, we've just learned about what's probably the most important | ▶ 00:00 |
| piece of math for this class in statistics called Bayes Rule. | ▶ 00:03 |
| It was invented by Reverend Thomas Bayes, who was a British mathematician | ▶ 00:09 |
| and a Presbyterian minister in the 18th century. | ▶ 00:15 |
| Bayes Rule is usually stated as follows: P of A given B where B is the evidence | ▶ 00:18 |
| and A is the variable we care about is P of B given A times P of A over P of B. | ▶ 00:27 |
| This expression is called the likelihood. | ▶ 00:36 |
| This is called the prior, and this is called marginal likelihood. | ▶ 00:40 |
| The expression over here is called the posterior. | ▶ 00:46 |
| The interesting thing here is the way the probabilities are reworded. | ▶ 00:50 |
| Say we have evidence B. | ▶ 00:55 |
| We know about B, but we really care about the variable A. | ▶ 00:57 |
| So, for example, B is a test result. | ▶ 01:01 |
| We don't care about the test result as much as we care about the fact | ▶ 01:03 |
| whether we have cancer or not. | ▶ 01:06 |
| This diagnostic reasoning--which is from evidence to its causes-- | ▶ 01:08 |
| is turned upside down by Bayes Rule into a causal reasoning, | ▶ 01:16 |
| which is given--hypothetically, if we knew the cause, | ▶ 01:22 |
| what would be the probability of the evidence we just observed. | ▶ 01:27 |
| But to correct for this inversion, we have to multiply | ▶ 01:31 |
| by the prior of the cause to be the case in the first place, | ▶ 01:36 |
| in this case, having cancer or not, | ▶ 01:40 |
| and divide it by the probability of the evidence, P(B), | ▶ 01:42 |
| which often is expanded using the theorem of total probability as follows. | ▶ 01:47 |
| The probability of B is a sum over all probabilities of B | ▶ 01:52 |
| conditional on A, lower caps a, times the probability of A equals lower caps a. | ▶ 01:58 |
| This is total probability as we already encountered it. | ▶ 02:04 |
| So, let's apply this to the cancer case | ▶ 02:08 |
| and say we really care about whether you have cancer, | ▶ 02:10 |
| which is our cause, conditioned on the evidence | ▶ 02:13 |
| that is the result of this hidden cause, in this case, a positive test result. | ▶ 02:17 |
| Let's just plug in the numbers. | ▶ 02:23 |
| Our likelihood is the probability of seeing a positive test result | ▶ 02:25 |
| given that you have cancer multiplied by the prior probability | ▶ 02:30 |
| of having cancer over the probability of the positive test result, | ▶ 02:33 |
| and that is--according to the tables we looked at before-- | ▶ 02:38 |
| 0.9 times a prior of 0.01 over-- | ▶ 02:43 |
| now we're going to expand this right over here according to total probability | ▶ 02:50 |
| which gives us 0.9 times 0.01. | ▶ 02:55 |
| That's the probability of + given that we do have cancer. | ▶ 03:01 |
| So, the probability of + given that we don't have cancer is 0.2, | ▶ 03:06 |
| but the prior here is 0.99. | ▶ 03:11 |
| So, if we plug in the numbers we know about, we get 0.009 | ▶ 03:15 |
| over 0.009 + 0.198. | ▶ 03:20 |
| That is approximately 0.0434, which is the number we saw before. | ▶ 03:27 |
| So, if you want to draw Bayes rule graphically, | ▶ 00:00 |
| we have a situation where we have an internal variable A, | ▶ 00:03 |
| like whether I'm going to die of cancer, but we can't sense A. | ▶ 00:08 |
| Instead, we have a second variable, called B, | ▶ 00:13 |
| which is our test, and B is observable, but A isn't. | ▶ 00:16 |
| This is a classical example of a Bayes network. | ▶ 00:21 |
| The Bayes network is composed of 2 variables, A and B. | ▶ 00:26 |
| We know the prior probability for A, | ▶ 00:30 |
| and we know the conditional. | ▶ 00:33 |
| A causes B--whether or not we have cancer, | ▶ 00:35 |
| causes the test result to be positive or not, | ▶ 00:38 |
| although there was some randomness involved. | ▶ 00:41 |
| So, we know what the probability of B given the different values for A, | ▶ 00:44 |
| and what we care about in this specific instance is called diagnostic reasoning, | ▶ 00:49 |
| which is the inverse of the causal reasoning, | ▶ 00:54 |
| the probability of A given B or similarly, probability of A given not B. | ▶ 00:58 |
| This is our very first Bayes network, and the graphical representation | ▶ 01:06 |
| of drawing 2 variables, A and B, connected with an arc | ▶ 01:11 |
| that goes from A to B is the graphical representation of a distribution | ▶ 01:15 |
| of 2 variables that are specified in the structure over here, | ▶ 01:22 |
| which has a prior probability and has a conditional probability as shown over here. | ▶ 01:26 |
| Now, I do have a quick quiz for you. | ▶ 01:31 |
| How many parameters does it take to specify | ▶ 01:34 |
| the entire joint probability within A and B, or differently, the entire Bayes network? | ▶ 01:37 |
| I'm not looking for structural parameters that relate to the graph over here. | ▶ 01:43 |
| I'm just looking for the numerical parameters of the underlying probabilities. | ▶ 01:48 |
| And the answer is 3. | ▶ 00:00 |
| It takes 1 parameter to specify P of A from which we can derive P of not A. | ▶ 00:02 |
| It takes 2 parameters to specify P of B given A and P given not A, | ▶ 00:09 |
| from which we can derive P not B given A and P of not B given not A. | ▶ 00:15 |
| So, it's a total of 3 parameters for this Bayes network. | ▶ 00:21 |
| So, we just encountered our very first Bayes network | ▶ 00:00 |
| and did a number of interesting calculations. | ▶ 00:03 |
| Let's now talk about Bayes Rule and look into more complex Bayes networks. | ▶ 00:06 |
| I will look at Bayes Rule again and make an observation | ▶ 00:10 |
| that is really non-trivial. | ▶ 00:13 |
| Here is Bayes Rule, and in practice, what we find is | ▶ 00:15 |
| this term here is relatively easy to compute. | ▶ 00:20 |
| It's just a product, whereas this term is really hard to compute. | ▶ 00:23 |
| However, this term over here does not depend on what we assume for variable A. | ▶ 00:28 |
| It's just the function of B. | ▶ 00:33 |
| So, suppose for a moment we also care about the complementary event of not A | ▶ 00:35 |
| given B, for which Bayes Rule unfolds as follows. | ▶ 00:40 |
| Then we find that the normalizer, P(B), is identical, | ▶ 00:43 |
| whether we assume A on the left side or not A on the left side. | ▶ 00:47 |
| We also know from prior work that P(A) given B plus | ▶ 00:51 |
| P of not A given B must be one because these are 2 complementary events. | ▶ 00:57 |
| That allows us to compute Bayes Rule very differently | ▶ 01:03 |
| by basically ignoring the normalizer, so here's how it goes. | ▶ 01:06 |
| We compute P(A) given B--and I want to call this prime, | ▶ 01:11 |
| because it's not a real probability--to be just P(B) given A times P(A), | ▶ 01:16 |
| which is the normalizer, so the denominator of the expression over here. | ▶ 01:23 |
| We do the same thing with not A. | ▶ 01:28 |
| So, in both cases, we compute the posterior probability non-normalized | ▶ 01:31 |
| by omitting the normalizer B. | ▶ 01:36 |
| And then we can recover the original probabilities by normalizing | ▶ 01:38 |
| based on those values over here, so the probability of A given B, | ▶ 01:43 |
| the actual probability, is a normalizer, eta, | ▶ 01:48 |
| times this non-normalized form over here. | ▶ 01:52 |
| The same is true for the negation of A over here. | ▶ 01:55 |
| And eta is just the normalizer that results by adding these 2 values over here together | ▶ 01:59 |
| as shown over here, and dividing them for one. | ▶ 02:06 |
| So, take a look at this for a moment. | ▶ 02:10 |
| What we've done is we deferred the calculation of the normalizer over here | ▶ 02:13 |
| by computing pseudo probabilities that are non-normalized. | ▶ 02:18 |
| This made the calculation much easier, and when we were done with everything, | ▶ 02:22 |
| we just folded it back into the normalizer based on the resulting | ▶ 02:26 |
| pseudo probabilities and got the correct answer. | ▶ 02:29 |
| The reason why I gave you all this is because I want you to apply it now | ▶ 00:00 |
| to a slightly more complicated problem, which is the 2-test cancer example. | ▶ 00:03 |
| In this example, we again might have our unobservable cancer C, | ▶ 00:08 |
| but now we're running 2 tests, test 1 and test 2. | ▶ 00:14 |
| As before, the prior probability of cancer is 0.01. | ▶ 00:18 |
| The probability of receiving a positive test result for either test is 0.9. | ▶ 00:24 |
| The probability of getting a negative result given they're cancer free is 0.8. | ▶ 00:30 |
| And from those, we were able to compute all the other probabilities, | ▶ 00:36 |
| and we're just going to write them down over here. | ▶ 00:40 |
| So, take a moment to just verify those. | ▶ 00:43 |
| Now, let's assume both of my tests come back positive, | ▶ 00:46 |
| so T1 = + and T2 = +. | ▶ 00:50 |
| What's the probability of cancer now written in short form probability of | ▶ 00:56 |
| C given ++? | ▶ 01:00 |
| I want you to tell me what that is, and this is a non-trivial question. | ▶ 01:03 |
| So, the correct answer is 0.1698 approximately, | ▶ 00:00 |
| and to compute this, I used the trick I've shown you before. | ▶ 00:10 |
| Let me write down the running count for cancer and for not cancer | ▶ 00:15 |
| as I integrate the various multiplications in Bayes Rule. | ▶ 00:24 |
| My prior for cancer was 0.01 and for non-cancer was 0.99. | ▶ 00:28 |
| Then I get my first +, and the probability of a + given they have cancer is 0.9, | ▶ 00:37 |
| and the same for non-cancer is 0.2. | ▶ 00:43 |
| So, according to the non-normalized Bayes Rule, | ▶ 00:48 |
| I now multiply these 2 things together to get my non-normalized probability | ▶ 00:52 |
| of having cancer given the plus. | ▶ 00:58 |
| Since multiplication is commutative, | ▶ 01:00 |
| I can do the same thing again with my 2nd test result, 0.9 and 0.2, | ▶ 01:03 |
| and I multiply all of these 3 things together to get my non-normalized probability | ▶ 01:09 |
| P prime to be the following: 0.0081, if you multiply those things together, | ▶ 01:14 |
| and 0.0396 if you multiply these facts together. | ▶ 01:21 |
| And these are not a probability. | ▶ 01:28 |
| If we add those for the 2 complementary of cancer/non-cancer, | ▶ 01:30 |
| I get 0.0477. | ▶ 01:34 |
| However, if I now divide, that is, I normalize | ▶ 01:38 |
| those non-normalized probabilities over here by this factor over here, | ▶ 01:42 |
| I actually get the correct posterior probability P of cancer given ++. | ▶ 01:47 |
| And they look as follows: | ▶ 01:52 |
| approximately 0.1698 and approximately 0.8301. | ▶ 01:54 |
| Calculate for me the probability of cancer | ▶ 00:00 |
| given that I received one positive and one negative test result. | ▶ 00:03 |
| Please write your number into this box. | ▶ 00:08 |
| We apply the same trick as before | ▶ 00:00 |
| where we use the exact same prior of 0.01. | ▶ 00:03 |
| Our first + gives us the following factors: 0.9 and 0.2. | ▶ 00:07 |
| And our minus gives us the probability 0.1 for a negative first test result given that we have cancer, | ▶ 00:13 |
| and a 0.8 for the inverse of a negative result of not having cancer. | ▶ 00:20 |
| We multiply those together. | ▶ 00:26 |
| We get our non-normalized probability. | ▶ 00:28 |
| And if we now normalize by the sum of those two things | ▶ 00:30 |
| to turn this back into a probability, we get 0.009 | ▶ 00:35 |
| over the sum of those two things over here, and this is 0.0056 | ▶ 00:41 |
| for the chance of having cancer and 0.9943 for the chance of being cancer free. | ▶ 00:50 |
| And this adds up approximately to 1, and therefore, is a probability distribution. | ▶ 00:59 |
| I want to use a few words of terminology. | ▶ 00:00 |
| This, again, is a Bayes network, of which the hidden variable C | ▶ 00:03 |
| causes the still stochastic test outcomes T1 and T2. | ▶ 00:08 |
| And what is really important is that we assume not just | ▶ 00:16 |
| that T1 and T2 are identically distributed. | ▶ 00:19 |
| We use the same 0.9 for test 1 as we use for test 2, | ▶ 00:22 |
| but we also assume that they are conditionally independent. | ▶ 00:27 |
| We assumed that if God told us whether we actually had cancer or not, | ▶ 00:31 |
| if we knew with absolute certainty the value of the variable C, | ▶ 00:37 |
| that knowing anything about T1 would not help us make a statement about T2. | ▶ 00:41 |
| Put differently, we assumed that the probability of T2 given C and T1 | ▶ 00:48 |
| is the same as the probability of T2 given C. | ▶ 00:55 |
| This is called conditional independence, which is given the value of the cancer variable C. | ▶ 01:00 |
| If you knew this for a fact, then T2 would be independent of T1. | ▶ 01:08 |
| It's conditionally independent because the independence only holds true | ▶ 01:17 |
| if we actually know C, and it comes out of this diagram over here. | ▶ 01:21 |
| If we look at this diagram, if you knew the variable C over here, | ▶ 01:26 |
| then C separately causes T1 and T2. | ▶ 01:32 |
| So, as a result, if you know C, whatever counted over here | ▶ 01:39 |
| is kind of cut off causally from what happens over here. | ▶ 01:46 |
| That causes these 2 variables to be conditionally independent. | ▶ 01:48 |
| So, conditional independence is a really big thing in Bayes networks. | ▶ 01:52 |
| Here's a Bayes network where A causes B and C, | ▶ 01:58 |
| and for a Bayes network of this structure, we know that given A, | ▶ 02:02 |
| B and C are independent. | ▶ 02:08 |
| It's written as B conditionally independent of C given A. | ▶ 02:11 |
| So, here's a question. | ▶ 02:16 |
| Suppose we have conditional independence between B and C given A. | ▶ 02:18 |
| Would that imply--and there's my question--that B and C are independent? | ▶ 02:21 |
| So, suppose we don't know A. | ▶ 02:28 |
| We don't know whether we have cancer, for example. | ▶ 02:30 |
| What that means is that the test results individually are still independent of each other | ▶ 02:33 |
| even if we don't know about the cancer situation. | ▶ 02:38 |
| Please answer yes or no. | ▶ 02:42 |
| And the correct answer is No | ▶ 00:00 |
| Intuitively, getting a positive test result about cancer | ▶ 00:03 |
| gives us information about whether you have cancer or not. | ▶ 00:08 |
| So if you get a positive test result | ▶ 00:13 |
| you're going to raise the probability of having cancer | ▶ 00:15 |
| relative to the prior probability. | ▶ 00:18 |
| With that increased probability we will predict | ▶ 00:20 |
| that another test will with a higher likelihood | ▶ 00:24 |
| give us a positive response than if we hadn't taken the previous test. | ▶ 00:27 |
| That's really important to understand | ▶ 00:33 |
| So that we understand it let me make you calculate those probabilities | ▶ 00:36 |
| Let me draw the cancer example again with two tests. | ▶ 00:00 |
| Here's my cancer variable | ▶ 00:05 |
| and then there's two conditionally independent tests T1 and T2. | ▶ 00:07 |
| And as before let me assume that the prior probability of cancer is 0.01 | ▶ 00:13 |
| What I want you to compute for me is the probability of the second test | ▶ 00:19 |
| to be positive if we know that the first test was positive. | ▶ 00:26 |
| So write this into the following box. | ▶ 00:33 |
| So, for this one, we want to apply total probability. | ▶ 00:00 |
| This thing over here is the same as probability of test 2 to be positive, | ▶ 00:04 |
| which I'm going to abbreviate with a +2 over here, | ▶ 00:10 |
| conditioned on test 1 being positive and me having cancer | ▶ 00:14 |
| times the probability of me having cancer given test 1 was positive plus | ▶ 00:19 |
| the probability of test 2 being positive conditioned on test 1 being positive | ▶ 00:25 |
| and me not having cancer times the probability of me not having cancer | ▶ 00:31 |
| given that test 1 is positive. | ▶ 00:36 |
| That's the same as the theorem of total probability, | ▶ 00:38 |
| but now everything is conditioned on +1. | ▶ 00:42 |
| Take a moment to verify this. | ▶ 00:46 |
| Now, here I can plug in the numbers. | ▶ 00:48 |
| You already calculated this one before, which is approximately 0.043, | ▶ 00:50 |
| and this one over here is 1 minus that, which is 0.957 approximately. | ▶ 00:57 |
| And this term over here now exploits conditional independence, | ▶ 01:05 |
| which is given that I know C, knowledge of the first test | ▶ 01:09 |
| gives me no more information about the second test. | ▶ 01:14 |
| It only gives me information if C was unknown, as was the case over here. | ▶ 01:17 |
| So, I can rewrite this thing over here as follows: | ▶ 01:21 |
| P of +2 given that I have cancer. | ▶ 01:24 |
| I can drop the +1, and the same is true over here. | ▶ 01:27 |
| This is exploiting my conditional independence. | ▶ 01:31 |
| I knew that P of +1 or +2 conditioned on C | ▶ 01:34 |
| is the same as P of +2 conditioned on C and +1. | ▶ 01:41 |
| I can now read those off my table over here, | ▶ 01:47 |
| which is 0.9 times 0.043 plus 0.2, | ▶ 01:50 |
| which is 1 minus 0.8 over here times 0.957, | ▶ 01:58 |
| which gives me approximately 0.2301. | ▶ 02:03 |
| So, that says if my first test comes in positive, | ▶ 02:09 |
| I expect my second test to be positive with probably 0.2301. | ▶ 02:14 |
| That's an increased probability to the default probability, | ▶ 02:21 |
| which we calculated before, which is the probability of any test, | ▶ 02:24 |
| test 2 come in as positive before was the normalizer of Bayes rule which was 0.207. | ▶ 02:29 |
| So, my first test has a 20% chance of coming in positive. | ▶ 02:38 |
| My second test, after seeing a positive test, | ▶ 02:43 |
| has now an increased probability of about 23% of coming in positive. | ▶ 02:47 |
| So, now we've learned about independence, | ▶ 00:00 |
| and the corresponding Bayes network has 2 nodes. | ▶ 00:02 |
| They're just not connected at all. | ▶ 00:04 |
| And we learned about conditional independence, | ▶ 00:07 |
| in which case we have a Bayes network that looks like this. | ▶ 00:09 |
| Now I would like to know whether absolute independence | ▶ 00:12 |
| implies conditional independence. | ▶ 00:16 |
| True or false? | ▶ 00:18 |
| And I'd also like to know whether conditional independence implies absolute independence. | ▶ 00:20 |
| Again, true or false? | ▶ 00:25 |
| And the answer is both of them are false. | ▶ 00:00 |
| We already saw that conditional independence, as shown over here, | ▶ 00:03 |
| doesn't give us absolute independence. | ▶ 00:07 |
| So, for example, this is test #1 and test #2. | ▶ 00:09 |
| You might or might not have cancer. | ▶ 00:13 |
| Our first test gives us information about whether you have cancer or not. | ▶ 00:15 |
| As a result, we've changed our prior probability | ▶ 00:18 |
| for the second test to come in positive. | ▶ 00:21 |
| That means that conditional independence does not imply absolute independence, | ▶ 00:24 |
| which means this assumption here falls, | ▶ 00:30 |
| and it also turns out that if you have absolute independence, | ▶ 00:32 |
| things might not be conditionally independent for reasons that I can't quite explain so far, | ▶ 00:37 |
| but that we will learn about next. | ▶ 00:43 |
| [Thrun] For my next example, I will study a different type of a Bayes network. | ▶ 00:00 |
| Before, we've seen networks of the following type, | ▶ 00:04 |
| where a single hidden cause caused 2 different measurements. | ▶ 00:08 |
| I now want to study a network that looks just like the opposite. | ▶ 00:13 |
| We have 2 independent hidden causes, | ▶ 00:17 |
| but they get confounded within a single observational variable. | ▶ 00:20 |
| I would like to use the example of happiness. | ▶ 00:26 |
| Suppose I can be happy or unhappy. | ▶ 00:29 |
| What makes me happy is when the weather is sunny or if I get a raise in my job, | ▶ 00:33 |
| which means I make more money. | ▶ 00:41 |
| So let's call this sunny, let's call this a raise, and call this happiness. | ▶ 00:43 |
| Perhaps the probability of it being sunny is 0.7, | ▶ 00:47 |
| probability of a raise is 0.01. | ▶ 00:53 |
| And I will tell you that the probability of being happy is governed as follows. | ▶ 00:58 |
| The probability of being happy given that both of these things occur-- | ▶ 01:05 |
| I got a raise and it is sunny--is 1. | ▶ 01:09 |
| The probability of being happy given that it is not sunny and I still got a raise is 0.9. | ▶ 01:13 |
| The probability of being happy given that it's sunny but I didn't get a raise is 0.7. | ▶ 01:20 |
| And the probability of being happy given that it is neither sunny nor did I get a raise is 0.1. | ▶ 01:27 |
| This is a perfectly fine specification of a probability distribution | ▶ 01:35 |
| where 2 causes affect the variable down here, the happiness. | ▶ 01:39 |
| So I'd like you to calculate for me the following questions. | ▶ 01:46 |
| Probability of a raise given that it is sunny, according to this model. | ▶ 01:50 |
| Please enter your answer over here. | ▶ 01:57 |
| [Thrun] The answer is surprisingly simple. | ▶ 00:00 |
| It is 0.01. | ▶ 00:03 |
| How do I know this so fast? | ▶ 00:05 |
| Well, if you look at this Bayes network, | ▶ 00:08 |
| both the sunniness and the question whether I got a raise impact my happiness. | ▶ 00:12 |
| But since I don't know anything about the happiness, | ▶ 00:21 |
| there is no way that just the weather might implicate or impact whether I get a raise or not. | ▶ 00:24 |
| In fact, it might be independently sunny, and I might independently get a raise at work. | ▶ 00:32 |
| There is no mechanism of which these 2 things would co-occur. | ▶ 00:39 |
| Therefore, the probability of a raise given that it's sunny | ▶ 00:46 |
| is just the same as the probability of a raise given any weather, which is 0.01. | ▶ 00:49 |
| [Thrun] Let me talk about a really interesting special instance of Bayes net reasoning | ▶ 00:00 |
| which is called explaining away. | ▶ 00:07 |
| And I'll first give you the intuitive answer, | ▶ 00:10 |
| then I'll wish you to compute probabilities for me that manifest the explain away effect | ▶ 00:14 |
| in a Bayes network of this type. | ▶ 00:19 |
| Explaining away means that if we know that we are happy, | ▶ 00:22 |
| then sunny weather can explain away the cause of happiness. | ▶ 00:27 |
| If I then also know that it's sunny, it becomes less likely that I received a raise. | ▶ 00:34 |
| Let me put this differently. | ▶ 00:41 |
| Suppose I'm a happy guy on a specific day | ▶ 00:43 |
| and my wife asks me, "Sebastian, why are you so happy?" | ▶ 00:45 |
| "Is it sunny, or did you get a raise?" | ▶ 00:49 |
| If she then looks outside and sees it is sunny, | ▶ 00:52 |
| then she might explain to herself, | ▶ 00:55 |
| "Well, Sebastian is happy because it is sunny." | ▶ 00:57 |
| "That makes it effectively less likely that he got a raise | ▶ 01:00 |
| "because I could already explain his happiness by it being sunny." | ▶ 01:05 |
| If she looks outside and it is rainy, | ▶ 01:10 |
| that makes it more likely I got a raise, | ▶ 01:13 |
| because the weather can't really explain my happiness. | ▶ 01:16 |
| In other words, if we see a certain effect that could be caused by multiple causes, | ▶ 01:20 |
| seeing one of those causes can explain away any other potential cause | ▶ 01:27 |
| of this effect over here. | ▶ 01:33 |
| So let me put this in numbers and ask you the challenging question of | ▶ 01:36 |
| what's the probability of a raise given that I'm happy and it's sunny? | ▶ 01:43 |
| [Thrun] The answer is approximately 0.0142, | ▶ 00:00 |
| and it is an exercise in expanding this term using Bayes' rule, | ▶ 00:07 |
| using total probability, which I'll just do for you. | ▶ 00:11 |
| Using Bayes' rule, you can transform this into P of H given R comma S | ▶ 00:16 |
| times P of R given S over P of H given S. | ▶ 00:24 |
| We observe the conditional independence of R and S | ▶ 00:34 |
| to simplify this to just P of R, | ▶ 00:37 |
| and the denominator is expanded by folding in R and not R, | ▶ 00:40 |
| P of H given R comma S | ▶ 00:46 |
| times P of R plus P of H given not R and S | ▶ 00:49 |
| times P of not R, which is total probability. | ▶ 00:54 |
| We can now read off the numbers from the tables over here, | ▶ 00:58 |
| which gives us 1 times 0.01 divided by this expression | ▶ 01:01 |
| that is the same as the expression over here, so 0.01 plus this thing over here, | ▶ 01:10 |
| which you can find over here to be 0.7, times this guy over here, | ▶ 01:17 |
| which is 1 minus the value over here, 0.99, | ▶ 01:23 |
| which gives us approximately 0.0142. | ▶ 01:27 |
| [Thrun] Now, to understand the explain away effect, | ▶ 00:00 |
| you have to compare this to the probability of a raise given that we're just happy | ▶ 00:04 |
| and we don't know anything about the weather. | ▶ 00:11 |
| So let's do that exercise next. | ▶ 00:14 |
| So my next quiz is, what's the probability of a raise given that all I know is that I'm happy | ▶ 00:16 |
| and I don't know about the weather? | ▶ 00:24 |
| This happens to be once again a pretty complicated question, so take your time. | ▶ 00:26 |
| [Thrun] So this is a difficult question. | ▶ 00:00 |
| Let me compute an auxiliary variable, which is P of happiness. | ▶ 00:02 |
| That one is expanded by looking at the different conditions that can make us happy. | ▶ 00:12 |
| P of happiness given S and R | ▶ 00:19 |
| times P of S and R, which is of course the product of those 2 | ▶ 00:24 |
| because they are independent, | ▶ 00:29 |
| plus P of happiness given not S R, probability of not as R | ▶ 00:31 |
| plus P of H given S and not R | ▶ 00:39 |
| times the probability of P of S and not R plus the last case, | ▶ 00:43 |
| P of H given not S and not R. | ▶ 00:48 |
| So this just looks at the happiness under all 4 combinations of the variables | ▶ 00:52 |
| that can lead to happiness. | ▶ 00:56 |
| And you can plug those straight in. | ▶ 00:58 |
| This one over here is 1, and this one over here is the product of S and R, | ▶ 01:00 |
| which is 0.7 times 0.01. | ▶ 01:05 |
| And as you plug all of those in, | ▶ 01:10 |
| you get as a result 0.5245. | ▶ 01:14 |
| That's P of H. | ▶ 01:21 |
| Just take some time and do the math by going through these different cases | ▶ 01:24 |
| using total probability, and you get this result. | ▶ 01:28 |
| Armed with this number, the rest now becomes easy, | ▶ 01:32 |
| which is we can use Bayes' rule to turn this around. | ▶ 01:38 |
| P of H given R times P of R over P of H. | ▶ 01:43 |
| P of R we know from over here, the probability of a raise is 0.01. | ▶ 01:49 |
| So the only thing we need to compute now is P of H given R. | ▶ 01:54 |
| And again, we apply total probability. | ▶ 01:57 |
| Let me just do this over here. | ▶ 01:59 |
| We can factor P of H given R as P of H given R and S, sunny, | ▶ 02:02 |
| times probability of sunny plus P of H given R and not sunny | ▶ 02:09 |
| times the probability of not sunny. | ▶ 02:14 |
| And if you plug in the numbers with this, you get 1 times 0.7 | ▶ 02:16 |
| plus 0.9 times 0.3. | ▶ 02:21 |
| That happens to be 0.97. | ▶ 02:25 |
| So if we now plug this all back into this equation over here, | ▶ 02:30 |
| we get 0.97 times 0.01 over 0.5245. | ▶ 02:33 |
| This gives us approximately as the correct answer 0.0185. | ▶ 02:45 |
| [Thrun] And if you got this right, I will be deeply impressed | ▶ 00:00 |
| about the fact you got this right. | ▶ 00:04 |
| But the interesting thing now to observe is if we happen to know it's sunny | ▶ 00:07 |
| and I'm happy, then the probability of a raise is 14%, 0.014. | ▶ 00:13 |
| If I don't know about the weather and I'm happy, | ▶ 00:21 |
| then the probability of a raise goes up to about 18.5%. | ▶ 00:26 |
| Why is that? | ▶ 00:30 |
| Well, it's the explaining away effect. | ▶ 00:32 |
| My happiness is well explained by the fact that it's sunny. | ▶ 00:35 |
| So if someone observes me to be happy and asks the question, | ▶ 00:40 |
| "Is this because Sebastian got a raise at work?" | ▶ 00:43 |
| well, if you know it's sunny and this is a fairly good explanation for me being happy, | ▶ 00:46 |
| you don't have to assume I got a raise. | ▶ 00:53 |
| If you don't know about the weather, then obviously the chances are higher | ▶ 00:55 |
| that the raise caused my happiness, | ▶ 01:01 |
| and therefore this number goes up from 0.014 to 0.018. | ▶ 01:03 |
| Let me ask you one final question in this next quiz, | ▶ 01:10 |
| which is the probability of the raise given that I look happy and it's not sunny. | ▶ 01:14 |
| This is the most extreme case for making a raise likely | ▶ 01:23 |
| because I am a happy guy, and it's definitely not caused by the weather. | ▶ 01:27 |
| So it could be just random, or it could be caused by the raise. | ▶ 01:33 |
| So please calculate this number for me and enter it into this box. | ▶ 01:37 |
| [Thrun] Well, the answer follows the exact same scheme as before, | ▶ 00:00 |
| with S being replaced by not S. | ▶ 00:04 |
| So this should be an easier question for you to answer. | ▶ 00:08 |
| P of R given H and not S can be inverted by Bayes' rule to be as follows. | ▶ 00:11 |
| Once we apply Bayes' rule, as indicated over here where we swapped H to the left side | ▶ 00:20 |
| and R to the right side, you can observe that this value over here | ▶ 00:24 |
| can be readily found in the table. | ▶ 00:29 |
| It's actually the 0.9 over there. | ▶ 00:32 |
| This value over here, the raise is independent of the weather | ▶ 00:35 |
| by virtue of our Bayes network, so it's just 0.01. | ▶ 00:41 |
| And as before, we apply total probability to the expression over here, | ▶ 00:45 |
| and we obtain off this quotient over here that these 2 expressions are the same. | ▶ 00:52 |
| P of H given not S, not R is the value over here, | ▶ 00:58 |
| and the 0.99 is the complement of probability of R taken from over here, | ▶ 01:03 |
| and that ends up to be 0.0833. | ▶ 01:08 |
| This would have been the right answer. | ▶ 01:16 |
| [Thrun] It's really interesting to compare this to the situation over here. | ▶ 00:00 |
| In both cases I'm happy, as shown over here, | ▶ 00:04 |
| and I ask the same question, which is whether I got a raise at work, as R over here. | ▶ 00:08 |
| But in one case I observe that the weather is sunny; in the other one it isn't. | ▶ 00:15 |
| And look what it does to my probability of having received a raise. | ▶ 00:21 |
| The sunniness perfectly well explains my happiness, | ▶ 00:25 |
| and my probability of having received a raise ends up to be a mere 1.4%, or 0.014. | ▶ 00:30 |
| However, if my wife observes it to be non-sunny, then it is much more likely | ▶ 00:41 |
| that the cause of my happiness is related to a raise at work, | ▶ 00:47 |
| and now the probability is 8.3%, which is significantly higher than the 1.4% before. | ▶ 00:51 |
| This is a Bayes network of which S and R are independent | ▶ 00:58 |
| but H adds a dependence between S and R. | ▶ 01:04 |
| Let me talk about this in a little bit more detail on the next paper. | ▶ 01:10 |
| So here is our Bayes network again. | ▶ 01:16 |
| In our previous exercises, we computed for this network | ▶ 01:18 |
| that the probability of a raise of R given any of these variables shown here was as follows. | ▶ 01:22 |
| The really interesting thing is that in the absence of information about H, | ▶ 01:29 |
| which is the middle case over here, | ▶ 01:34 |
| the probability of R is unaffected by knowledge of S-- | ▶ 01:37 |
| that is, R and S are independent. | ▶ 01:41 |
| This is the same as probability of R, | ▶ 01:46 |
| and R and S are independent. | ▶ 01:49 |
| However, if I know something about the variable H, | ▶ 01:56 |
| then S and R become dependent-- | ▶ 02:02 |
| that is, knowing about my happiness over here renders S and R dependent. | ▶ 02:06 |
| This is not the same as probability of just R given H. | ▶ 02:15 |
| Obviously, it isn't because if I now vary S from S to not S, | ▶ 02:23 |
| it affects my probability for the variable R. | ▶ 02:28 |
| That is a really unusual situation | ▶ 02:33 |
| where we have R and S are independent | ▶ 02:36 |
| but given the variable H, R and S are not independent anymore. | ▶ 02:40 |
| So knowledge of H makes 2 variables that previously were independent non-independent. | ▶ 02:50 |
| Offered differently, 2 variables that are independent may not be in certain cases | ▶ 02:58 |
| conditionally independent. | ▶ 03:06 |
| Independence does not imply conditional independence. | ▶ 03:08 |
| [Thrun] So we're now ready to define Bayes networks in a more general way. | ▶ 00:00 |
| Bayes networks define probability distributions over graphs or random variables. | ▶ 00:05 |
| Here is an example graph of 5 variables, | ▶ 00:10 |
| and this Bayes network defines the distribution over those 5 random variables. | ▶ 00:14 |
| Instead of enumerating all possibilities of combinations of these 5 random variables, | ▶ 00:19 |
| the Bayes network is defined by probability distributions | ▶ 00:24 |
| that are inherent to each individual node. | ▶ 00:28 |
| For node A and B, we just have a distribution P of A and P of B | ▶ 00:32 |
| because A and B have no incoming arcs. | ▶ 00:38 |
| C is a conditional distribution conditioned on A and B. | ▶ 00:42 |
| D and E are conditioned on C. | ▶ 00:47 |
| The joint probability represented by a Bayes network | ▶ 00:52 |
| is the product of various Bayes network probabilities | ▶ 00:56 |
| that are defined over individual nodes | ▶ 01:00 |
| where each node's probability is only conditioned on the incoming arcs. | ▶ 01:03 |
| So A has no incoming arc; therefore, we just want it P of A. | ▶ 01:08 |
| C has 2 incoming arcs, so we define the probability of C conditioned on A and B. | ▶ 01:12 |
| And D and E have 1 incoming arc that's shown over here. | ▶ 01:18 |
| The definition of this joint distribution by using the following factors | ▶ 01:22 |
| has one really big advantage. | ▶ 01:27 |
| Whereas the joint distribution over any 5 variables requires 2 to the 5 minus 1, | ▶ 01:30 |
| which is 31 probability values, | ▶ 01:40 |
| the Bayes network over here only requires 10 such values. | ▶ 01:43 |
| P of A is one value, for which we can derive P of not A. | ▶ 01:48 |
| Same for P of B. | ▶ 01:53 |
| P of C given A B is derived by a distribution over C | ▶ 01:55 |
| conditioned on any combination of A and B, of which there are 4 of A and B as binary. | ▶ 02:02 |
| P of D given C is 2 parameters for P of D given C and P of D given not C. | ▶ 02:07 |
| And the same is true for P of E given C. | ▶ 02:15 |
| So if you add those up, you get 10 parameters in total. | ▶ 02:18 |
| So the compactness of the Bayes network | ▶ 02:21 |
| leads to a representation that scales significantly better to large networks | ▶ 02:25 |
| than the common natorial approach which goes through all combinations of variable values. | ▶ 02:31 |
| That is a key advantage of Bayes networks, | ▶ 02:36 |
| and that is the reason why Bayes networks are being used so extensively | ▶ 02:39 |
| for all kinds of problems. | ▶ 02:43 |
| So here is a quiz. | ▶ 02:45 |
| How many probability values are required to specify this Bayes network? | ▶ 02:47 |
| Please put your answer in the following box. | ▶ 02:51 |
| [Thrun] And the answer is 13. | ▶ 00:00 |
| One over here, 2 over here, and 4 over here. | ▶ 00:03 |
| Simplifiably speaking, any variable that has K inputs requires 2 to the K such variables. | ▶ 00:06 |
| So in total we have 1, 9, 13. | ▶ 00:15 |
| [Thrun] Here's another quiz. | ▶ 00:00 |
| How many parameters do we need to specify the joint distribution | ▶ 00:02 |
| for this Bayes network over here | ▶ 00:06 |
| where A, B, and C point into D, D points into E, F, and G | ▶ 00:09 |
| and C also points into G? | ▶ 00:13 |
| Please write your answer into this box. | ▶ 00:15 |
| [Thrun] And the answer is 19. | ▶ 00:00 |
| So 1 here, 1 here, 1 here, 2 here, 2 here, 2 arcs point into G, which makes for 4, | ▶ 00:02 |
| and 3 arcs point into D. Two to the 3 is 8. | ▶ 00:09 |
| So we get 1, 2, 3, 8, 2, 2, 4. If you add those up, it's 19. | ▶ 00:13 |
| [Thrun] And here is our car network which we discussed at the very beginning of this unit. | ▶ 00:00 |
| How many parameters do we need to specify this network? | ▶ 00:06 |
| Remember, there are 16 total variables, | ▶ 00:11 |
| and the naive joint over the 16 will be 2 to the 16th minus 1, which is 65,535. | ▶ 00:15 |
| Please write your answer into this box over here. | ▶ 00:25 |
| [Thrun] To answer this question, let us add up these numbers. | ▶ 00:00 |
| Battery age is 1, 1, 1. | ▶ 00:04 |
| This has 1 incoming arc, so it's 2. | ▶ 00:08 |
| Two incoming arcs makes 4. | ▶ 00:10 |
| One incoming arc is 2, 2 equals 4. | ▶ 00:13 |
| Four incoming arcs makes 16. | ▶ 00:17 |
| If we add all the right numbers, we get 47. | ▶ 00:21 |
| [Thrun] So it takes 47 numerical probabilities to specify the joint | ▶ 00:00 |
| compared to 65,000 if you didn't have the graph-like structure. | ▶ 00:05 |
| I think this example really illustrates the advantage | ▶ 00:11 |
| of compact Bayes network representations over unstructured joint representations. | ▶ 00:14 |
| [Thrun] The next concept I'd like to teach you is called D-separation. | ▶ 00:00 |
| And let me start the discussion of this concept by a quiz. | ▶ 00:04 |
| We have here a Bayes network, | ▶ 00:09 |
| and I'm going to ask you a conditional independence question. | ▶ 00:11 |
| Is C independent of A? | ▶ 00:16 |
| Please tell me yes or no. | ▶ 00:20 |
| Is C independent of A given B? | ▶ 00:22 |
| Is C independent of D? | ▶ 00:27 |
| Is C independent of D given A? | ▶ 00:30 |
| And is E independent of C given D? | ▶ 00:32 |
| [Thrun] So C is not independent of A. | ▶ 00:00 |
| In fact, A influences C by virtue of B. | ▶ 00:04 |
| But if you know B, then A becomes independent of C, | ▶ 00:09 |
| which means the only determinate into C is B. | ▶ 00:13 |
| If you know B for sure, then knowledge of A won't really tell you anything about C. | ▶ 00:17 |
| C is also not independent of D, just the same way C is not independent of A. | ▶ 00:22 |
| If I learn something about D, I can infer more about C. | ▶ 00:27 |
| But if I do know A, then it's hard to imagine how knowledge of D would help me with C | ▶ 00:31 |
| because I can't learn anything more about A than knowing A already. | ▶ 00:38 |
| Therefore, given A, C and D are independent. | ▶ 00:42 |
| The same is true for E and C. | ▶ 00:45 |
| If we know D, then E and C become independent. | ▶ 00:48 |
| [Thrun] In this specific example, the rule that we could apply is very, very simple. | ▶ 00:00 |
| Any 2 variables are independent if they're not linked by just unknown variables. | ▶ 00:04 |
| So for example, if we know B, then everything downstream of B | ▶ 00:10 |
| becomes independent of anything upstream of B. | ▶ 00:14 |
| E is now independent of C, conditioned on B. | ▶ 00:18 |
| However, knowledge of B does not render A and E independent. | ▶ 00:22 |
| In this graph over here, A and B connect to C and C connects to D and to E. | ▶ 00:26 |
| So let me ask you, is A independent of E, | ▶ 00:33 |
| A independent of E given B, | ▶ 00:37 |
| A independent of E given C, | ▶ 00:39 |
| A independent of B, | ▶ 00:41 |
| and A independent of B given C? | ▶ 00:43 |
| [Thrun] And the answer for this one is really interesting. | ▶ 00:00 |
| A is clearly not independent of E because through C we can see an influence of A to E. | ▶ 00:03 |
| Given B, that doesn't change. | ▶ 00:08 |
| A still influences C, despite the fact we know B. | ▶ 00:11 |
| However, if we know C, the influence is cut off. | ▶ 00:15 |
| There is no way A can influence E if we know C. | ▶ 00:18 |
| A is clearly independent of B. | ▶ 00:22 |
| They are different entry variables. They have no incoming arcs. | ▶ 00:25 |
| But here is the caveat. | ▶ 00:29 |
| Given C, A and B become dependent. | ▶ 00:32 |
| So whereas initially A and B were independent, | ▶ 00:35 |
| if you give C, they become dependent. | ▶ 00:38 |
| And the reason why they become dependent we've studied before. | ▶ 00:41 |
| This is the explain away effect. | ▶ 00:44 |
| If you know, for example, C to be true, | ▶ 00:48 |
| then knowledge of A will substantially affect what we believe about B. | ▶ 00:51 |
| If there's 2 joint causes for C and we happen to know A is true, | ▶ 00:57 |
| we will discredit cause B. | ▶ 01:02 |
| If we happen to know A is false, we will increase our belief for the cause B. | ▶ 01:04 |
| That was an effect we studied extensively in the happiness example I gave you before. | ▶ 01:09 |
| The interesting thing here is we are facing a situation | ▶ 01:15 |
| where knowledge of variable C renders previously independent variables dependent. | ▶ 01:19 |
| [Thrun] This leads me to the general study of conditional independence in Bayes networks, | ▶ 00:00 |
| often called D-separation or reachability. | ▶ 00:06 |
| D-separation is best studied by so-called active triplets and inactive triplets | ▶ 00:10 |
| where active triplets render variables dependent | ▶ 00:17 |
| and inactive triplets render them independent. | ▶ 00:20 |
| Any chain of 3 variables like this makes the initial and final variable dependent | ▶ 00:23 |
| if all variables are unknown. | ▶ 00:30 |
| However, if the center variable is known-- | ▶ 00:32 |
| that is, it's behind the conditioning bar-- | ▶ 00:35 |
| then this variable and this variable become independent. | ▶ 00:38 |
| So if we have a structure like this and it's quote-unquote cut off | ▶ 00:42 |
| by a known variable in the middle, that separates or deseparates | ▶ 00:47 |
| the left variable from the right variable, and they become independent. | ▶ 00:53 |
| Similarly, any structure like this renders the left variable and the right variable dependent | ▶ 00:57 |
| unless the center variable is known, | ▶ 01:04 |
| in which case the left and right variable become independent. | ▶ 01:08 |
| Another active triplet now requires knowledge of a variable. | ▶ 01:12 |
| This is the explain away case. | ▶ 01:16 |
| If this variable is known for a Bayes network that converges into a single variable, | ▶ 01:19 |
| then this variable and this variable over here become dependent. | ▶ 01:25 |
| Contrast this with a case where all variables are unknown. | ▶ 01:29 |
| A situation like this means that this variable on the left or on the right are actually independent. | ▶ 01:33 |
| In a single final example, we also get dependence if we have the following situation: | ▶ 01:40 |
| a direct successor of a conversion variable is known. | ▶ 01:48 |
| So it is sufficient if a successor of this variable is known. | ▶ 01:52 |
| The variable itself does not have to be known, | ▶ 01:57 |
| and the reason is if you know this guy over here, | ▶ 01:59 |
| we get knowledge about this guy over here. | ▶ 02:02 |
| And by virtue of that, the case over here essentially applies. | ▶ 02:05 |
| If you look at those rules, | ▶ 02:09 |
| those rules allow you to determine for any Bayes network | ▶ 02:11 |
| whether variables are dependent or not dependent given the evidence you have. | ▶ 02:15 |
| If you color the nodes dark for which you do have evidence, | ▶ 02:20 |
| then you can use these rules to understand whether any 2 variables | ▶ 02:25 |
| are conditionally independent or not. | ▶ 02:29 |
| So let me ask you for this relatively complicated Bayes network the following questions. | ▶ 02:31 |
| Is F independent of A? | ▶ 02:37 |
| Is F independent of A given D? | ▶ 02:41 |
| Is F independent of A given G? | ▶ 02:45 |
| And is F independent of A given H? | ▶ 02:49 |
| Please mark your answers as you see fit. | ▶ 02:51 |
| [Thrun] And the answer is yes, F is independent of A. | ▶ 00:00 |
| What we find for our rules of D-separation is that F is dependent on D | ▶ 00:04 |
| and A is dependent on D. | ▶ 00:08 |
| But if you don't know D, you can't govern any dependence between A and F at all. | ▶ 00:11 |
| If you do know D, then F and A become dependent. | ▶ 00:16 |
| And the reason is B and E are dependent given D, | ▶ 00:20 |
| and we can transform this back into dependence of A and F | ▶ 00:25 |
| because B and A are dependent and E and F are dependent. | ▶ 00:29 |
| There is an active path between A and F which goes across here and here | ▶ 00:33 |
| because D is known. | ▶ 00:38 |
| If we know G, the same thing is true because G gives us knowledge about D, | ▶ 00:40 |
| and D can be applied back to this path over here. | ▶ 00:44 |
| However, if you know H, that's not the case. | ▶ 00:47 |
| So H might tell us something about G, | ▶ 00:49 |
| but it doesn't tell us anything about D, | ▶ 00:51 |
| and therefore, we have no reason to close the path between A and F. | ▶ 00:53 |
| The path between A and F is still passive, even though we have knowledge of H. | ▶ 00:59 |
| [Thrun] So congratulations. You learned a lot about Bayes networks. | ▶ 00:00 |
| You learned about the graph structure of Bayes networks, | ▶ 00:03 |
| you understood how this is a compact representation, | ▶ 00:06 |
| you learned about conditional independence, | ▶ 00:10 |
| and we talked a little bit about application of Bayes network | ▶ 00:12 |
| to interesting reasoning problems. | ▶ 00:15 |
| But by all means this was a mostly theoretical unit of this class, | ▶ 00:18 |
| and in future classes we will talk more about applications. | ▶ 00:23 |
| The instrument of Bayes networks is really essential to a number of problems. | ▶ 00:27 |
| It really characterizes the sparse dependence that exists in many readable problems | ▶ 00:31 |
| like in robotics and computer vision and filtering and diagnostics and so on. | ▶ 00:36 |
| I really hope you enjoyed this class, | ▶ 00:41 |
| and I really hope you understood in depth how Bayes networks work. | ▶ 00:43 |
| [Probabilistic Interference] | ▶ 00:00 |
| [Male] Welcome back. In the previous unit, we went over the basics | ▶ 00:02 |
| of probability theory and saw how | ▶ 00:05 |
| a Bayes network could concisely represent a joint probability distribution, | ▶ 00:12 |
| including the representation of independence between the variables. | ▶ 00:17 |
| In this unit, we will see how to do probabilistic inference. | ▶ 00:24 |
| That is, how to answer probability questions using Bayes nets. | ▶ 00:31 |
| Let's put up a simple Bayes net. | ▶ 00:36 |
| We'll use the familiar example of the earthquake | ▶ 00:40 |
| where we can have a burglary or an earthquake | ▶ 00:45 |
| setting off an alarm, and if the alarm goes off, | ▶ 00:50 |
| either John or Mary might call. | ▶ 00:53 |
| Now, what kinds of questions can we ask to do inference about? | ▶ 00:58 |
| The simplest type of question is the same question we ask | ▶ 01:02 |
| with an ordinary subroutine or function in a programming language. | ▶ 01:05 |
| Namely, given some inputs, what are the outputs? | ▶ 01:08 |
| So, in this case, we could say given the inputs of B and E, | ▶ 01:12 |
| what are the outputs, J and M? | ▶ 01:18 |
| Rather than call them input and output variables, | ▶ 01:22 |
| in probabilistic inference, we'll call them evidence and query variables. | ▶ 01:26 |
| That is, the variables that we know the values of are the evidence, | ▶ 01:36 |
| and the ones that we want to find out the values of are the query variables. | ▶ 01:39 |
| Anything that is neither evidence nor query is known as a hidden variable. | ▶ 01:44 |
| That is, we won't tell you what its value is. | ▶ 01:52 |
| We won't figure out what its value is and report it, | ▶ 01:55 |
| but we'll have to compute with it internally. | ▶ 01:58 |
| And now furthermore, in probabilistic inference, | ▶ 02:01 |
| the output is not a single number for each of the query variables, | ▶ 02:05 |
| but rather, it's a probability distribution. | ▶ 02:10 |
| So, the answer is going to be a complete, joint probability distribution | ▶ 02:13 |
| over the query variables. | ▶ 02:17 |
| We call this the posterior distribution, given the evidence, | ▶ 02:19 |
| and we can write it like this. | ▶ 02:23 |
| It's the probability distribution of one or more query variables | ▶ 02:26 |
| given the values of the evidence variables. | ▶ 02:34 |
| And there can be zero or more evidence variables, | ▶ 02:39 |
| and each of them are given an exact value. | ▶ 02:42 |
| And that's the computation we want to come up with. | ▶ 02:47 |
| There's another question we can ask. | ▶ 02:53 |
| Which is the most likely explanation? | ▶ 02:56 |
| That is, out of all the possible values for all the query variables, | ▶ 02:58 |
| which combination of values has the highest probability? | ▶ 03:03 |
| We write the formula like this, asking which Q values | ▶ 03:08 |
| are maxable given the evidence values. | ▶ 03:12 |
| Now, in an ordinary programming language, each function goes only one way. | ▶ 03:16 |
| It has input variables, does some computation, | ▶ 03:22 |
| and comes up with a result variable or result variables. | ▶ 03:26 |
| One great thing about Bayes nets is that we're not restricted | ▶ 03:31 |
| to going only in one direction. | ▶ 03:34 |
| We could go in the causal direction, giving as evidence | ▶ 03:36 |
| the route nodes of the tree and asking as query values the nodes at the bottom. | ▶ 03:41 |
| Or, we could reverse that causal flow. | ▶ 03:47 |
| For example, we could have J and M be the evidence variables | ▶ 03:50 |
| and B and E be the query variables, | ▶ 03:55 |
| or we could have any other combination. | ▶ 03:58 |
| For example, we could have M be the evidence variable | ▶ 04:01 |
| and J and B be the query variables. | ▶ 04:05 |
| Here's a question for you. | ▶ 04:11 |
| Imagine the situation where Mary has called to report that the alarm is going off, | ▶ 04:13 |
| and we want to know whether or not there has been a burglary. | ▶ 04:18 |
| For each of the nodes, click on the circle to tell us | ▶ 04:22 |
| if the node is an evidence node, a hidden node, | ▶ 04:27 |
| or a query node. | ▶ 04:32 |
| The answer is that Mary calling is the evidence node. | ▶ 00:00 |
| The burglary is the query node, | ▶ 00:04 |
| and all the others are hidden variables in this case. | ▶ 00:07 |
| Now we're going to talk about how to do inference on Bayes net. | ▶ 00:00 |
| We'll start with our familiar network, and we'll talk about a method | ▶ 00:04 |
| called enumeration, | ▶ 00:08 |
| which goes through all the possibilities, adds them up, | ▶ 00:12 |
| and comes up with an answer. | ▶ 00:15 |
| So, what we do is start by stating the problem. | ▶ 00:17 |
| We're going to ask the question of what is the probability | ▶ 00:24 |
| that the burglar alarm occurred given that John called and Mary called? | ▶ 00:27 |
| We'll use the definition of conditional probability to answer this. | ▶ 00:34 |
| So, this query is equal to the joint probability distribution | ▶ 00:39 |
| of all 3 variables divided by the conditionalized variables. | ▶ 00:47 |
| Now, note I'm using a notation here where instead of writing out the probability | ▶ 00:55 |
| of some variable equals true, I'm just using the notation plus | ▶ 01:01 |
| and then the variable name in lower case, | ▶ 01:05 |
| and if I wanted the negation, I would use negation sign. | ▶ 01:08 |
| Notice there's a different notation where instead of writing out | ▶ 01:13 |
| the plus and negation signs, we just use the variable name itself, P(e), | ▶ 01:17 |
| to indicate E is true. | ▶ 01:22 |
| That notation works well, but it can get confusing between | ▶ 01:25 |
| does P(e) mean E is true, or does it mean E is a variable? | ▶ 01:29 |
| And so we're going to stick to the notation where we explicitly have | ▶ 01:34 |
| the pluses and negation signs. | ▶ 01:37 |
| To do inference by enumeration, we first take a conditional probability | ▶ 01:41 |
| and rewrite it as unconditional probabilities. | ▶ 01:45 |
| Now we enumerate all the atomic probabilities and calculate the sum of products. | ▶ 01:49 |
| Let's look at just the complex term on the numerator first. | ▶ 01:56 |
| The procedure for figuring out the denominator would be similar, and we'll skip that. | ▶ 02:00 |
| So, the probability of these 3 terms together | ▶ 02:05 |
| can be determined by enumerating all possible values of the hidden variables. | ▶ 02:12 |
| In this case, there are 2, E and A, | ▶ 02:17 |
| so we'll sum over those variables for all values of E and for all values of A. | ▶ 02:22 |
| In this case, they're boolean, so there's only 2 values of each. | ▶ 02:29 |
| We ask what's the probability of this unconditional term? | ▶ 02:34 |
| And that we get by summing out over all possibilities, | ▶ 02:41 |
| E and A being true or false. | ▶ 02:44 |
| Now, to get the values of these atomic events, | ▶ 02:49 |
| we'll have to rewrite this equation in a form that corresponds | ▶ 02:52 |
| to the conditional probability tables that we have associated with the Bayes net. | ▶ 02:55 |
| So, we'll take this whole expression and rewrite it. | ▶ 03:00 |
| It's still a sum over the hidden variables E and A, | ▶ 03:04 |
| but now I'll rewrite this expression in terms of the parents | ▶ 03:08 |
| of each of the nodes in the network. | ▶ 03:12 |
| So, that gives us the product of these 5 terms, | ▶ 03:15 |
| which we then have to sum over all values of E and A. | ▶ 03:21 |
| If we call this product f(e,a), | ▶ 03:24 |
| then the whole answer is the sum of F for all values of E and A, | ▶ 03:31 |
| so as the sum of 4 terms where each of the terms is a product of 5 numbers. | ▶ 03:43 |
| Where do we get the numbers to fill in this equation? | ▶ 03:51 |
| From the conditional probability tables from our model, | ▶ 03:54 |
| so let's put the equation back up, and we'll ask you for the case | ▶ 03:58 |
| where both E and A are positive | ▶ 04:03 |
| to look up in the conditional probability tables and fill in the numbers | ▶ 04:09 |
| for each of these 5 terms, and then multiply them together and fill in the product. | ▶ 04:14 |
| We get the answer by reading numbers off the conditional probability tables, | ▶ 00:00 |
| so probability of B being positive is 0.001. | ▶ 00:04 |
| Of E being positive, because we're dealing with the positive case now | ▶ 00:11 |
| for the variable E, is 0.002. | ▶ 00:16 |
| The probability of A being positive, because we're dealing with that case, | ▶ 00:22 |
| given that B is positive and the case for an E is positive, | ▶ 00:26 |
| that we can read off here as 0.95. | ▶ 00:30 |
| The probability that J is positive given that A is positive is 0.9. | ▶ 00:37 |
| And finally, the probability that M is positive given that A is positive | ▶ 00:44 |
| we read off here as 0.7. | ▶ 00:50 |
| We multiple all those together, it's going to be a small number | ▶ 00:54 |
| because we've got the .001 and the .002 here. | ▶ 00:57 |
| Can't quite fit it in the box, but it works out to .000001197. | ▶ 01:00 |
| That seems like a really small number, but remember, | ▶ 01:12 |
| we have to normalize by the P(+j,+m) term, | ▶ 01:14 |
| and this is only 1 of the 4 possibilities. | ▶ 01:19 |
| We have to enumerate over all 4 possibilities for E and A, | ▶ 01:22 |
| and in the end, it works out that the probability of the burglar alarm being true | ▶ 01:26 |
| given that John and Mary calls, is 0.284. | ▶ 01:32 |
| And we get that number because intuitively, | ▶ 01:38 |
| it seems that the alarm is fairly reliable. | ▶ 01:42 |
| John and Mary calling are very reliable, | ▶ 01:44 |
| but the prior probability of burglary is low. | ▶ 01:47 |
| And those 2 terms combine together to give us the 0.284 value | ▶ 01:49 |
| when we sum up each of the 4 terms of these products. | ▶ 01:54 |
| [Norvig] We've seen how to do enumeration to solve the inference problem | ▶ 00:00 |
| on belief networks. | ▶ 00:04 |
| For a simple network like the alarm network, that's all we need to know. | ▶ 00:06 |
| There's only 5 variables, so even if all 5 of them were hidden, | ▶ 00:10 |
| there would only be 32 rows in the table to sum up. | ▶ 00:14 |
| From a theoretical point of view, we're done. | ▶ 00:20 |
| But from a practical point of view, other networks could give us trouble. | ▶ 00:22 |
| Consider this network, which is one for determining insurance for car owners. | ▶ 00:26 |
| There are 27 different variables. | ▶ 00:35 |
| If each of the variables were boolean, that would give us over 100 million rows to sum out. | ▶ 00:38 |
| But in fact, some of the variables are non-boolean, | ▶ 00:44 |
| they have multiple values, and it turns out that representing this entire network | ▶ 00:46 |
| and doing enumeration we'd have to sum over a quadrillion rows. | ▶ 00:52 |
| That's just not practical, so we're going to have to come up with methods | ▶ 00:57 |
| that are faster than enumerating everything. | ▶ 01:01 |
| The first technique we can use to get a speed-up in doing inference on Bayes nets | ▶ 01:04 |
| is to pull out terms from the enumeration. | ▶ 01:09 |
| For example, here the probability of b is going to be the same for all values of E and a. | ▶ 01:13 |
| So we can take that term and move it out of the summation, | ▶ 01:20 |
| and now we have a little bit less work to do. | ▶ 01:26 |
| We can multiply by that term once rather than having it in each row of the table. | ▶ 01:28 |
| We can also move this term, the P of e, to the left of the summation over a, | ▶ 01:33 |
| because it doesn't depend on a. | ▶ 01:40 |
| By doing this, we're doing less work. | ▶ 01:43 |
| The inner loop of the summation now has only 3 terms rather than 5 terms. | ▶ 01:45 |
| So we've reduced the cost of doing each row of the table. | ▶ 01:50 |
| But we still have the same number of rows in the table, | ▶ 01:53 |
| so we're going to have to do better than that. | ▶ 01:57 |
| The next technique for efficient inference is to maximize independence of variables. | ▶ 02:00 |
| The structure of a Bayes net determines how efficient it is to do inference on it. | ▶ 02:08 |
| For example, a network that's a linear string of variables, | ▶ 02:12 |
| X1 through Xn, can have inference done in time proportional to the number n, | ▶ 02:17 |
| whereas a network that's a complete network | ▶ 02:27 |
| where every node points to every other node and so on could take time 2 to the n | ▶ 02:31 |
| if all n variables are boolean variables. | ▶ 02:40 |
| In the alarm network we saw previously, we took care | ▶ 02:45 |
| to make sure that we had all the independence relations represented | ▶ 02:50 |
| in the structure of the network. | ▶ 02:54 |
| But if we put the nodes together in a different order, | ▶ 02:57 |
| we would end up with a different structure. | ▶ 03:00 |
| Let's start by ordering the node John calls first | ▶ 03:03 |
| and then adding in the node Mary calls. | ▶ 03:09 |
| The question is, given just these 2 nodes and looking at the node for Mary calls, | ▶ 03:13 |
| is that node dependent or independent of the node for John calls? | ▶ 03:19 |
| [Norvig] The answer is that the node for Mary calls in this network | ▶ 00:01 |
| is dependent on John calls. | ▶ 00:05 |
| In the previous network, they were independent given that we knew that the alarm had occurred. | ▶ 00:08 |
| But here we don't know that the alarm had occurred, | ▶ 00:13 |
| and so the nodes are dependent | ▶ 00:16 |
| because having information about one will affect the information about the other. | ▶ 00:18 |
| [Norvig] Now we'll continue and we'll add the node A for alarm to the network. | ▶ 00:00 |
| And what I want you to do is click on all the other variables | ▶ 00:05 |
| that A is dependent on in this network. | ▶ 00:09 |
| [Norvig] The answer is that alarm is dependent on both John and Mary. | ▶ 00:01 |
| And so we can draw both nodes in, both arrows in. | ▶ 00:05 |
| Intuitively that makes sense because if John calls, | ▶ 00:09 |
| then it's more likely that the alarm has occurred, | ▶ 00:14 |
| likely as if Mary calls, and if both called, it's really likely. | ▶ 00:16 |
| So you can figure out the answer by intuitive reasoning, | ▶ 00:20 |
| or you can figure it out by going to the conditional probability tables | ▶ 00:23 |
| and seeing according to the definition of conditional probability | ▶ 00:27 |
| whether the numbers work out. | ▶ 00:31 |
| [Norvig] Now we'll continue and we'll add the node B for burglary | ▶ 00:01 |
| and ask again, click on all the variables that B is dependent on. | ▶ 00:05 |
| [Norvig] The answer is that B is dependent only on A. | ▶ 00:00 |
| In other words, B is independent of J and M given A. | ▶ 00:04 |
| [Norvig] And finally, we'll add the last node, E, | ▶ 00:00 |
| and ask you to click on all the nodes that E is dependent on. | ▶ 00:04 |
| [Norvig] And the answer is that E is dependent on A. | ▶ 00:00 |
| That much is fairly obvious. | ▶ 00:04 |
| But it's also dependent on B. | ▶ 00:06 |
| Now, why is that? | ▶ 00:08 |
| E is dependent on A because if the earthquake did occur, | ▶ 00:10 |
| then it's more likely that the alarm would go off. | ▶ 00:13 |
| On the other hand, E is also dependent on B | ▶ 00:16 |
| because if a burglary occurred, then that would explain why the alarm is going off, | ▶ 00:19 |
| and it would mean that the earthquake is less likely. | ▶ 00:23 |
| [Norvig] The moral is that Bayes nets tend to be the most compact | ▶ 00:00 |
| and thus the easier to do inference on when they're written in the causal direction-- | ▶ 00:04 |
| that is, when the networks flow from causes to effects. | ▶ 00:12 |
| Let's return to this equation, which we use to show how to do inference by enumeration. | ▶ 00:00 |
| In this equation, we join up the whole joint distribution | ▶ 00:06 |
| before we sum out over the hidden variables. | ▶ 00:10 |
| That's slow, because we end up repeating a lot of work. | ▶ 00:15 |
| Now we're going to show a new technique called variable elimination, | ▶ 00:18 |
| which in many networks operates much faster. | ▶ 00:25 |
| It's still a difficult computation, an NP-hard computation, | ▶ 00:27 |
| to do inference over Bayes nets in general. | ▶ 00:30 |
| Variable elimination works faster than inference by enumeration | ▶ 00:34 |
| in most practical cases. | ▶ 00:38 |
| It requires an algebra for manipulating factors, | ▶ 00:41 |
| which are just names for multidimensional arrays | ▶ 00:45 |
| that come out of these probabilistic terms. | ▶ 00:48 |
| We'll use another example to show how variable elimination works. | ▶ 00:53 |
| We'll start off with a network that has 3 boolean variables. | ▶ 00:57 |
| R indicates whether or not it's raining. | ▶ 01:00 |
| T indicates whether or not there's traffic, | ▶ 01:04 |
| and T is dependent on whether it's raining. | ▶ 01:12 |
| And finally, L indicates whether or not I'll be late for my next appointment, | ▶ 01:15 |
| and that depends on whether or not there's traffic. | ▶ 01:19 |
| Now we'll put up the conditional probability tables for each of these 3 variables. | ▶ 01:22 |
| And then we can use inference to figure out the answer to questions like | ▶ 01:29 |
| am I going to be late? | ▶ 01:35 |
| And we know by definition that we could do that through enumeration | ▶ 01:38 |
| by going through all the possible values for R and T | ▶ 01:42 |
| and summing up the product of these 3 nodes. | ▶ 01:47 |
| Now, in a simple network like this, straight enumeration would work fine, | ▶ 01:54 |
| but in a more complex network, what variable elimination does is give us a way | ▶ 01:59 |
| to combine together parts of the network into smaller parts | ▶ 02:03 |
| and then enumerate over those smaller parts and then continue combining. | ▶ 02:09 |
| So, we start with a big network. | ▶ 02:13 |
| We eliminate some of the variables. | ▶ 02:15 |
| We compute by marginalizing out, and then we have a smaller network to deal with, | ▶ 02:17 |
| and we'll show you how those 2 steps work. | ▶ 02:24 |
| The first operation in variable elimination is called joining factors. | ▶ 02:28 |
| A factor, again, is one of these tables. | ▶ 02:35 |
| It's a multidimensional matrix, and what we do is choose 2 of the factors, | ▶ 02:39 |
| 2 or more of the factors. | ▶ 02:43 |
| In this case, we'll choose these 2, and we'll combine them together | ▶ 02:45 |
| to form a new factor which represents | ▶ 02:49 |
| the joint probability of all the variables in that factor. | ▶ 02:52 |
| In this case, R and T. | ▶ 02:56 |
| Now we'll draw out that table. | ▶ 03:00 |
| In each case, we just look up in the corresponding table, | ▶ 03:03 |
| figure out the numbers, and multiply them together. | ▶ 03:06 |
| For example, in this row we have a +r and a +t, | ▶ 03:08 |
| so the +r is 0.1, and the entry for +r and +t is 0.8, | ▶ 03:13 |
| so multiply them together and you get 0.08. | ▶ 03:19 |
| Go all the way down. For example, in the last row we have a -r and a -t. | ▶ 03:22 |
| -r is 0.9. The entry for -r and -t is also 0.9. | ▶ 03:28 |
| Multiply those together and you get 0.81. | ▶ 03:34 |
| So, what have we done? | ▶ 03:40 |
| We used the operation of joining factors on these 2 factors, | ▶ 03:42 |
| getting us a new factor which is part of the existing network. | ▶ 03:45 |
| Now we want to apply a second operation called elimination, | ▶ 03:50 |
| also called summing out or marginalization, to take this table and reduce it. | ▶ 03:56 |
| Right now, the tables we have look like this. | ▶ 04:02 |
| We could sum out or marginalize over the variable R | ▶ 04:06 |
| to give us a table that just operates on T. | ▶ 04:10 |
| So, the question is to fill in this table for P(T)-- | ▶ 04:14 |
| there will be 2 entries in this table, the +t entry, formed by summing out | ▶ 04:20 |
| all the entries here for all values of r for which t is positive, | ▶ 04:23 |
| and the -t entry, formed the same way, by looking in this table | ▶ 04:28 |
| and summing up all the rows over all values of r where t is negative. | ▶ 04:32 |
| Put your answers in these boxes. | ▶ 04:37 |
| The answer is that for +t we look up the 2 possible values for r, | ▶ 00:00 |
| and we get 0.08 or 0.09. | ▶ 00:05 |
| Sum those up, get 0.17, | ▶ 00:09 |
| and then we look at the 2 possible values of R for -t, | ▶ 00:13 |
| and we get 0.02 and 0.81. | ▶ 00:18 |
| Add those up, and we get 0.83. | ▶ 00:22 |
| So, we took our network with RT and L. We summed out over R. | ▶ 00:00 |
| That gives us a new network with T and L | ▶ 00:04 |
| with these conditional probability tables. | ▶ 00:09 |
| And now we want to do a join over T and L | ▶ 00:13 |
| and give us a new table with the joint probability of P(T, L). | ▶ 00:17 |
| And that table is going to look like this. | ▶ 00:25 |
| The answer, again, for joining variables is determined by pointwise multiplication, | ▶ 00:00 |
| so we have 0.17 times 0.3 is 0.051, | ▶ 00:05 |
| +t and +l, 0.17 times 0.7 is 0.119. | ▶ 00:12 |
| Then we go to the minuses. | ▶ 00:21 |
| Minus 0.83 times 0.1 is 0.083. | ▶ 00:23 |
| And finally, 0.83 times 0.9 is 0.747. | ▶ 00:31 |
| Now we're down to a network with a single node, T, L, | ▶ 00:00 |
| with this joint probability table, and the only operation we have left to do | ▶ 00:06 |
| is to sum out to give us a node with just L in it. | ▶ 00:12 |
| So, the question is to compute P(L) for both values of L, | ▶ 00:17 |
| +l and -l. | ▶ 00:26 |
| The answer is that the +l values, | ▶ 00:00 |
| 0.051 plus 0.083 equals 0.134. | ▶ 00:03 |
| And the negative values, 0.119 plus 0.747 | ▶ 00:11 |
| equals 0.886. | ▶ 00:15 |
| No subtitles... | ▶ 00:00 |
| So, that's how variable elimination works. | ▶ 00:00 |
| It's a continued process of joining together factors | ▶ 00:03 |
| to form a larger factor and then eliminating variables by summing out. | ▶ 00:06 |
| If we make a good choice of the order in which we apply these operations, | ▶ 00:11 |
| then variable elimination can be much more efficient | ▶ 00:15 |
| than just doing the whole enumeration. | ▶ 00:18 |
| Now I want to talk about approximate inference | ▶ 00:00 |
| by means of sampling. | ▶ 00:07 |
| What do I mean by that? | ▶ 00:12 |
| Say we want to deal with a joint probability distribution, | ▶ 00:14 |
| say the distribution of heads and tails over these 2 coins. | ▶ 00:17 |
| We can build a table and then start counting by sampling. | ▶ 00:24 |
| Here we have our first sample. | ▶ 00:30 |
| We flip the coins and the one-cent piece came up heads, | ▶ 00:32 |
| and the five-cent piece came up tails, | ▶ 00:35 |
| so we would mark down one count. | ▶ 00:39 |
| Then we'd toss them again. | ▶ 00:42 |
| This time the five cents is heads, and the one cent is tails, | ▶ 00:45 |
| so we put down a count there, and we'd repeat that process | ▶ 00:50 |
| and keep repeating it until we got enough counts that we could estimate | ▶ 01:00 |
| the joint probability distribution by looking at the counts. | ▶ 01:06 |
| Now, if we do a small number of samples, the counts might not be very accurate. | ▶ 01:11 |
| There may be some random variation that causes them not to converge | ▶ 01:15 |
| to their true values, but as we add more counts, | ▶ 01:19 |
| the counts--as we add more samples, | ▶ 01:23 |
| the counts we get will come closer to the true distribution. | ▶ 01:25 |
| Thus, sampling has an advantage over inference in that we know a procedure | ▶ 01:29 |
| for coming up with at least an approximate value for the joint probability distribution, | ▶ 01:35 |
| as opposed to exact inference, where the computation may be very complex. | ▶ 01:42 |
| There's another advantage to sampling, which is if we don't know | ▶ 01:50 |
| what the conditional probability tables are, as we did in our other models, | ▶ 01:53 |
| if we don't know these numeric values, but we can simulate the process, | ▶ 01:59 |
| we can still proceed with sampling, whereas we couldn't with exact inference. | ▶ 02:04 |
| Here's a new network that we'll use to investigate | ▶ 00:00 |
| how sampling can be used to do inference. | ▶ 00:05 |
| In this network, we have 4 variables. They're all boolean. | ▶ 00:10 |
| Cloudy tells us if it's cloudy or not outside, | ▶ 00:14 |
| and that can have an effect on whether the sprinklers are turned on, | ▶ 00:17 |
| and whether it's raining. | ▶ 00:21 |
| And those 2 variables in turn have an effect on whether the grass gets wet. | ▶ 00:23 |
| Now, to do inference over this network using sampling, | ▶ 00:28 |
| we start off with a variable where all the parents are defined. | ▶ 00:34 |
| In this case, there's only one such variable, Cloudy. | ▶ 00:38 |
| And it's conditional probability table tells us that the probability is 50% for Cloudy, | ▶ 00:42 |
| 50% for not Cloudy, and so we sample from that. | ▶ 00:48 |
| We generate a random number, and let's say it comes up with positive for Cloudy. | ▶ 00:52 |
| Now that variable is defined, we can choose another variable. | ▶ 00:59 |
| In this case, let's choose Sprinkler, and we look at the rows in the table | ▶ 01:02 |
| for which Cloudy, the parent, is positive, and we see we should sample | ▶ 01:08 |
| with probability 10% to +s and 90% a -s. | ▶ 01:13 |
| And so let's say we do that sampling with a random number generator, | ▶ 01:19 |
| and it comes up negative for Sprinkler. | ▶ 01:23 |
| Now let's jump over here. Look at the Rain variable. | ▶ 01:26 |
| Again, the parent, Cloudy, is positive, | ▶ 01:29 |
| so we're looking at this part of the table. | ▶ 01:34 |
| We get a 0.8 probability for Rain being positive, | ▶ 01:38 |
| and a 0.2 probability for Rain being negative. | ▶ 01:41 |
| Let's say we sample that randomly, and it comes up Rain is positive. | ▶ 01:44 |
| And now we're ready to sample the final variable, | ▶ 01:51 |
| and what I want you to do is tell me which of the rows | ▶ 01:54 |
| of this table should we be considering and tell me what's more likely. | ▶ 02:01 |
| Is it more likely that we have a +w or a -w? | ▶ 02:07 |
| The answer to the question is that we look at the parents. | ▶ 00:00 |
| We find that the Sprinkler variable is negative, | ▶ 00:03 |
| so we're looking at this part of the table. | ▶ 00:06 |
| And the Rain variable is positive, so we're looking at this part. | ▶ 00:09 |
| So, it would be these 2 rows that we would consider, | ▶ 00:14 |
| and thus, we'd find there's a 0.9 probability for w, the grass being wet, | ▶ 00:18 |
| and only 0.1 for it being negative, | ▶ 00:25 |
| so the positive is more likely. | ▶ 00:28 |
| And once we've done that, then we generated a complete sample, | ▶ 00:31 |
| and we can write down the sample here. | ▶ 00:34 |
| We had +c, -s, +r. | ▶ 00:37 |
| And assuming we got a probability of 0.9 came out in favor of the +w, | ▶ 00:43 |
| that would be the end of the sample. | ▶ 00:51 |
| Then we could throw all this information out and start over again | ▶ 00:54 |
| by having another 50/50 choice for cloudy and then working our way through the network. | ▶ 00:59 |
| Now, the probability of sampling a particular variable, | ▶ 00:00 |
| choosing a +w or a -w, depends on the values of the parents. | ▶ 00:04 |
| But those are chosen according to the conditional probability tables, | ▶ 00:10 |
| so in the limit, the count of each sampled variable | ▶ 00:14 |
| will approach the true probability. | ▶ 00:18 |
| That is, with an infinite number of samples, this procedure computes the true | ▶ 00:20 |
| joint probability distribution. | ▶ 00:24 |
| We say that the sampling method is consistent. | ▶ 00:27 |
| We can use this kind of sampling to compute the complete joint probability distribution, | ▶ 00:33 |
| or we can use it to compute a value for an individual variable. | ▶ 00:38 |
| But what if we wanted to compute a conditional probability? | ▶ 00:43 |
| Say we wanted to compute the probability of wet grass | ▶ 00:47 |
| given that it's not cloudy. | ▶ 00:53 |
| To do that, the sample that we generated here wouldn't be helpful at all | ▶ 00:58 |
| because it has to do with being cloudy, not with being not cloudy. | ▶ 01:03 |
| So, we would cross this sample off the list. | ▶ 01:08 |
| We would say that we reject the sample, and this technique is called rejection sampling. | ▶ 01:11 |
| We go through ignoring any samples that don't match | ▶ 01:17 |
| the conditional probabilities that we're interested in | ▶ 01:21 |
| and keeping samples that do, say the sample -c, +s, +r, -w. | ▶ 01:24 |
| We would just continue going through generating samples, | ▶ 01:34 |
| crossing off the ones that don't match, keeping the ones that do. | ▶ 01:37 |
| And this procedure would also be consistent. | ▶ 01:41 |
| We call this procedure rejection sampling. | ▶ 01:46 |
| But there's a problem with rejection sampling. | ▶ 00:00 |
| If the evidence is unlikely, you end up rejecting a lot of the samples. | ▶ 00:03 |
| Let's go back to the alarm network where we had variables for burglary and for an alarm | ▶ 00:08 |
| and say when arrested, in computing the probability of a burglary, | ▶ 00:16 |
| given that the alarm goes off. | ▶ 00:22 |
| The problem is that burglaries are very infrequent, | ▶ 00:25 |
| so most of the samples we would get would end up being-- | ▶ 00:28 |
| we start with generating a B, and we get a -b and then a -a. | ▶ 00:32 |
| We go back and say does this match? | ▶ 00:39 |
| No, we have to reject this sample, | ▶ 00:43 |
| so we generate another sample, and we get another -b, -a. | ▶ 00:45 |
| We reject that. We get another -b, -a. | ▶ 00:50 |
| And we keep rejecting, and eventually we get a +b, | ▶ 00:54 |
| but we'd end up spending a lot of time rejecting samples. | ▶ 01:00 |
| So, we're going to introduce a new method called likelihood weighting | ▶ 01:04 |
| that generates samples so that we can keep every one. | ▶ 01:13 |
| With likelihood weighting, we fix the evidence variables. | ▶ 01:17 |
| That is, we say that A will always be positive, | ▶ 01:20 |
| and then we sample the rest of the variables, | ▶ 01:25 |
| so then we get samples that we want. | ▶ 01:28 |
| We would get a list like -b, +a, | ▶ 01:31 |
| -b, +a, | ▶ 01:37 |
| +b, +a. | ▶ 01:40 |
| We get to keep every sample, but we have a problem. | ▶ 01:42 |
| The resulting set of samples is inconsistent. | ▶ 01:46 |
| We can fix that, however, by assigning a probability | ▶ 01:52 |
| to each sample and weighing them correctly. | ▶ 01:56 |
| In likelihood weighting, we're going to be collecting samples just like before, | ▶ 00:00 |
| but we're going to add a probabilistic weight to each sample. | ▶ 00:05 |
| Now, let's say we want to compute the probability of rain | ▶ 00:11 |
| given that the sprinklers are on, and the grass is wet. | ▶ 00:17 |
| We start as before. | ▶ 00:22 |
| We make a choice for Cloudy, and let's say that, again, | ▶ 00:24 |
| we choose Cloudy being positive. | ▶ 00:30 |
| Now we want to make a choice for Sprinkler, | ▶ 00:33 |
| but we're constrained to always choose Sprinkler being positive, | ▶ 00:37 |
| so we'll make that choice. | ▶ 00:41 |
| And we know we were dealing with Cloudy being positive, | ▶ 00:44 |
| so we're in this row, and we're forced to make the choice of Sprinkler being positive, | ▶ 00:50 |
| and that has a probability of only 0.1, so we'll put that 0.1 into the weight. | ▶ 00:56 |
| Next, we'll look at the Rain variable, | ▶ 01:05 |
| and here we're not constrained in any way, so we make a choice | ▶ 01:09 |
| according to the probability tables with Cloudy being positive. | ▶ 01:13 |
| And let's say that we choose the more popular choice, and Rain gets the positive value. | ▶ 01:19 |
| Now, we look at Wet Grass. | ▶ 01:27 |
| We're constrained to choose positive, and we know that the parents | ▶ 01:30 |
| are also positive, so we're dealing with this row here. | ▶ 01:35 |
| Since it's a constrained choice, we're going to add in or multiply in an additional weight, | ▶ 01:41 |
| and I want you to tell me what that weight should be. | ▶ 01:47 |
| The answer is we're looking for the probability | ▶ 00:00 |
| of having a +w given a +s and a +r, | ▶ 00:04 |
| so that's in this row, so it's 0.99. | ▶ 00:09 |
| So, we take our old weight and multiply it by 0.99, | ▶ 00:16 |
| gives us a final weight of 0.099 | ▶ 00:22 |
| for a sample of +c, +s, +r and +w. | ▶ 00:28 |
| When we include the weights, | ▶ 00:00 |
| counting this sample that was forced to have a +s and a +w | ▶ 00:03 |
| with a weight of 0.099, instead of counting it as a full one sample, | ▶ 00:08 |
| we find that likelihood weighting is also consistent. | ▶ 00:14 |
| Likelihood weighting is a great technique, | ▶ 00:00 |
| but it doesn't solve all our problems. | ▶ 00:03 |
| Suppose we wanted to compute the probability of C given +s and +r. | ▶ 00:05 |
| In other words, we're constraining Sprinkler and Rain to always be positive. | ▶ 00:14 |
| Since we use the evidence when we generate a node that has that evidence as parents, | ▶ 00:21 |
| the Wet Grass node will always get good values based on that evidence. | ▶ 00:27 |
| But the Cloudy node won't, and so it will be generating values at random | ▶ 00:31 |
| without looking at these values, and most of the time, or some of the time, | ▶ 00:39 |
| it will be generating values that don't go well with the evidence. | ▶ 00:44 |
| Now, we won't have to reject them like we do in rejection sampling, | ▶ 00:48 |
| but they'll have a low probability associated with them. | ▶ 00:51 |
| A technique called Gibbs sampling, | ▶ 00:00 |
| named after the physicist Josiah Gibbs, | ▶ 00:07 |
| takes all the evidence into account and not just the upstream evidence. | ▶ 00:10 |
| It uses a method called Markov Chain Monte Carlo, or MCMC. | ▶ 00:14 |
| The idea is that we resample just one variable at a time | ▶ 00:26 |
| conditioned on all the others. | ▶ 00:31 |
| That is, we have a set of variables, | ▶ 00:33 |
| and we initialize them to random variables, keeping the evidence values fixed. | ▶ 00:37 |
| Maybe we have values like this, | ▶ 00:44 |
| and that constitutes one sample, and now, at each iteration through the loop, | ▶ 00:48 |
| we select just one non-evidence variable and resample it | ▶ 00:54 |
| based on all the other variables. | ▶ 01:01 |
| And that will give us another sample, and repeat that again. | ▶ 01:04 |
| Choose another variable. | ▶ 01:11 |
| Resample that variable and repeat. | ▶ 01:15 |
| We end up walking around in this space of assignments of variables randomly. | ▶ 01:21 |
| Now, in rejection and likelihood sampling, | ▶ 01:27 |
| each sample was independent of the other samples. | ▶ 01:30 |
| In MCMC, that's not true. | ▶ 01:34 |
| The samples are dependent on each other, and in fact, | ▶ 01:37 |
| adjacent samples are very similar. | ▶ 01:40 |
| They only vary or differ in one place. | ▶ 01:42 |
| However, the technique is still consistent. We won't show the proof for that. | ▶ 01:46 |
| Now, just one more thing. | ▶ 00:00 |
| I can't help but describe what is probably the most famous probability problem of all. | ▶ 00:02 |
| It's called the Monty Hall Problem after the game show host. | ▶ 00:07 |
| And the idea is that you're on a game show, and there's 3 doors: | ▶ 00:11 |
| door #1, door #2, and door #3. | ▶ 00:15 |
| And behind each door is a prize, and you know that one of the doors | ▶ 00:20 |
| contains an expensive sports car, which you would find desirable, | ▶ 00:26 |
| and the other 2 doors contain a goat, which you would find less desirable. | ▶ 00:29 |
| Now, say you're given a choice, and let's say you choose door #1. | ▶ 00:35 |
| But according to the conventions of the game, the host, Monty Hall, | ▶ 00:42 |
| will now open one of the doors, knowing that the door that he opens | ▶ 00:47 |
| contains a goat, and he shows you door #3. | ▶ 00:52 |
| And he now gives you the opportunity to stick with your choice | ▶ 00:57 |
| or to switch to the other door. | ▶ 01:02 |
| What I want you to tell me is, what is your probability of winning | ▶ 01:05 |
| if you stick to door #1, and what is the probability of winning | ▶ 01:10 |
| if you switched to door #2? | ▶ 01:15 |
| The answer is that you have a 1/3 chance of winning if you stick with door #1 | ▶ 00:00 |
| and a 2/3 chance if you switch to door #2. | ▶ 00:08 |
| How do we explain that, and why isn't it 50/50? | ▶ 00:12 |
| Well, it's true that there's 2 possibilities, | ▶ 00:16 |
| but we've learned from probability that just because there are 2 options | ▶ 00:18 |
| doesn't mean that both options are equally likely. | ▶ 00:22 |
| It's easier to explain why the first door has a 1/3 probability | ▶ 00:26 |
| because when you started, the car could be in any one of 3 places. | ▶ 00:30 |
| You chose one of them. That probability was 1/3. | ▶ 00:34 |
| And that probability hasn't been changed by the revealing of one of the other doors. | ▶ 00:37 |
| Why is door #2 two-thirds? | ▶ 00:43 |
| Well, one way to explain it is that the probability has to sum to 1, | ▶ 00:45 |
| and if 1/3 is here, the 2/3 has to be here. | ▶ 00:49 |
| But why doesn't the same argument that you use for 1 hold for 2? | ▶ 00:53 |
| Why can't we say the probability of 2 holding the car | ▶ 00:58 |
| was 1/3 before this door was revealed? | ▶ 01:03 |
| Why has that changed 2 and has not changed 1? | ▶ 01:07 |
| And the reason is because we've learned something about door #2. | ▶ 01:11 |
| We've learned that it wasn't the door that was flipped over by the host, | ▶ 01:14 |
| and so that additional information has updated the probability, | ▶ 01:18 |
| whereas we haven't learned anything additional about door #1 | ▶ 01:22 |
| because it was never an option that the host might switch door #1. | ▶ 01:26 |
| And in fact, in this case, if we reveal the door, | ▶ 01:30 |
| we find that's where the car actually is. | ▶ 01:37 |
| So you see, learning probability may end up winning you something. | ▶ 01:40 |
| Now, as a final epilogue, I have here a copy of a letter written by Monty Hall himself | ▶ 00:00 |
| in 1990 to Professor Lawrence Denenberg of Harvard | ▶ 00:07 |
| who, with Harry Lewis, wrote a statistics book | ▶ 00:10 |
| in which they used the Monty Hall Problem as an example, | ▶ 00:14 |
| and they wrote to Monty asking him for permission to use his name. | ▶ 00:18 |
| Monty kindly granted the permission, but in his letter, | ▶ 00:23 |
| he writes, "As I see it, it wouldn't make any difference after the player | ▶ 00:26 |
| has selected Door A, and having been shown Door C-- | ▶ 00:31 |
| why should he then attempt to switch to Door B? | ▶ 00:34 |
| So, we see Monty Hall himself did not understand the Monty Hall Problem. | ▶ 00:38 |
| [Thrun] Given the following Bayes network with P of A equal to 0.5, | ▶ 00:00 |
| P of B given the A equals 0.2, | ▶ 00:06 |
| and P of B given not A 0.8, | ▶ 00:08 |
| calculate the following probability. | ▶ 00:12 |
| [Thrun] Consider a network of the following type: | ▶ 00:00 |
| a variable, A, that is binary connects to three variables, X1, X2, and X3, | ▶ 00:03 |
| that are also binary. | ▶ 00:10 |
| The probability of A is 0.5, and for all variable XI we have the probability of XI given A is 0.2, | ▶ 00:12 |
| and the probability of XI given not A equals 0.6. | ▶ 00:24 |
| I would like to know from you the probability of A | ▶ 00:29 |
| given that we observed X1, X2, and not X3. | ▶ 00:31 |
| Notice that these variables over here are conditionally independent given A. | ▶ 00:37 |
| [Thrun] Let us consider the same network again. | ▶ 00:00 |
| I would like to know the probability of X3 given that I observed X1. | ▶ 00:03 |
| [Thrun] In this next homework assignment I will be drawing you a Bayes network | ▶ 00:00 |
| and will ask you some conditional independence questions. | ▶ 00:04 |
| Is B conditionally independent of C? And say yes or no. | ▶ 00:09 |
| Is B conditionally independent of C given D? And say yes or no. | ▶ 00:14 |
| Is B conditionally independent of C given A? And say yes or no. | ▶ 00:19 |
| And is B conditionally independent given A and D? And say yes or no. | ▶ 00:24 |
| [Thrun] Consider the following network. | ▶ 00:00 |
| I would like to know whether the following statements are true or false. | ▶ 00:02 |
| C is conditionally independent of E given A. | ▶ 00:08 |
| B is conditionally independent of D given C and E. | ▶ 00:12 |
| A is conditionally independent of C given E. | ▶ 00:18 |
| And A is conditionally independent of C given B. | ▶ 00:21 |
| Please check yes or no for each of these questions. | ▶ 00:25 |
| [Thrun] In my final question I'll look at the exact same network as before, | ▶ 00:00 |
| but I would like to know the minimum number of numerical parameters | ▶ 00:04 |
| such as the values to define probabilities and conditional probabilities | ▶ 00:08 |
| that are necessary to specify the joint distribution of all 5 variables. | ▶ 00:13 |
| [Thrun] The answer is 0.2, | ▶ 00:00 |
| and this follows directly from Bayes' rule. | ▶ 00:03 |
| In this formula, we can read off the first 2 values straight from the table over here, | ▶ 00:07 |
| and we expand the denominator by total probability. | ▶ 00:11 |
| Observing that this is exactly the same expression as up here, | ▶ 00:15 |
| we get 0.1 divided by 0.1 plus this expression over here can be copied from over here, | ▶ 00:19 |
| and P of not A is directly obtained up here. | ▶ 00:27 |
| Hence we get 0.5 over here, and as a result we get 0.2. | ▶ 00:30 |
| [Thrun] For this question we will be exploring a little trick | ▶ 00:00 |
| about non-normalized probability. | ▶ 00:03 |
| We will observe that P of A given X1, X2 and not X3, | ▶ 00:05 |
| the expression on the left can be resolved by Bayes' rule into this expression over here. | ▶ 00:11 |
| We will take X3 to the left and replace it by A, | ▶ 00:16 |
| both conditioned on the variables X1 and X2. | ▶ 00:20 |
| Then we have PA given X1, X2 divided by P not X3, X1, X2. | ▶ 00:23 |
| Next we employ 2 things. | ▶ 00:29 |
| One is the denominator does not depend on A, | ▶ 00:31 |
| so whether I put an A or not A has no bearing on any calculation here, | ▶ 00:34 |
| which means I can defer its calculation until later, and it will turn out to be important. | ▶ 00:39 |
| So I'm going to be proportional to just the stuff over here. | ▶ 00:44 |
| And second, I export my conditional independence | ▶ 00:49 |
| whereby I can omit X1 and X2 from the probability of not X3 conditioned on A. | ▶ 00:52 |
| These variables are conditionally independent. | ▶ 00:58 |
| This gives me the following recursion | ▶ 01:02 |
| where I now removed the third variable from the estimation problem | ▶ 01:05 |
| and just retained the first 2 relative to my initial expression. | ▶ 01:10 |
| If I keep expanding this, I get the following solution. | ▶ 01:14 |
| P of not X3 given A, P X2 given A, P X1 given A times P of A. | ▶ 01:19 |
| You might take a minute to just verify this, | ▶ 01:27 |
| but this is exploiting the conditional independence | ▶ 01:30 |
| very much as in the first step I showed you over here. | ▶ 01:32 |
| This step lacks the normalizer, | ▶ 01:35 |
| so let me work on the normalizer by expressing the opposite probability, | ▶ 01:38 |
| P of not A given the same events, X1, X2, and not X3, | ▶ 01:44 |
| which resolves to P of not X3 given not A, | ▶ 01:50 |
| P of X2 given not A, P of X1 given not A, | ▶ 01:54 |
| and P of not A. | ▶ 02:00 |
| I can now plug in the values from above. | ▶ 02:02 |
| So the first term gives me 0.8 times 0.2 times 0.2 times 0.5. | ▶ 02:04 |
| In the second term I get 0.4 times 0.6 times 0.6 times 0.5, | ▶ 02:15 |
| which resolves to 0.016 and 0.072. | ▶ 02:24 |
| This is clearly not a probability because we left out the normalizer. | ▶ 02:31 |
| But as we know, the normalizer does not depend on whether I put A or not A in here. | ▶ 02:36 |
| As a result, it will be the same for both of these expressions, | ▶ 02:40 |
| and I can obtain it by just adding these non-normalized probabilities | ▶ 02:44 |
| and then subsequently divide these non-normalized probabilities accordingly. | ▶ 02:47 |
| So let me just do this. | ▶ 02:52 |
| We get for the desired probability over here 0.1818 | ▶ 02:55 |
| and for the inverse probability over here 0.8182. | ▶ 03:01 |
| Our desired answer therefore is 0.1818. | ▶ 03:08 |
| This was not an easy question. | ▶ 03:14 |
| [Thrun] The answer is a little bit involved. | ▶ 00:00 |
| We use total probability to re-express this by bringing in A. | ▶ 00:03 |
| P of X3 given X1 is the sum of P of X3 given X1 and A | ▶ 00:08 |
| times P of A given X1 plus the A complement, which is X3, conditional X1 and not A | ▶ 00:15 |
| times P of not A given X1. | ▶ 00:22 |
| That is just total probability. | ▶ 00:24 |
| Next we utilized conditional independence by which we can simplify this expression | ▶ 00:26 |
| to drop X1 in the conditional variables | ▶ 00:30 |
| and we transform this expression by Bayes' rule again. | ▶ 00:33 |
| The same applies to the right side with not A replacing A. | ▶ 00:36 |
| All of those expressions over here can be found | ▶ 00:41 |
| either in the table up there or just by their complements, | ▶ 00:45 |
| with the exception of P of X1. | ▶ 00:49 |
| But P of X1 can again be just obtained by total probability, | ▶ 00:52 |
| which resolves to 0.2 times 0.5 plus 0.6 times 0.5, | ▶ 00:58 |
| which gives me 0.4. | ▶ 01:11 |
| We are now in a position to calculate the last term over here, which goes as follows. | ▶ 01:13 |
| This expression is 0.2 times 0.2 times 0.5 over 0.4 plus 0.6 times 0.6 times 0.5 over 0.4, | ▶ 01:19 |
| which gives us as a final result 0.5. | ▶ 01:36 |
| [Thrun] And the answer is as follows. | ▶ 00:00 |
| No, no, yes, and no. | ▶ 00:02 |
| B and C in the absence of any other information are dependent through A, | ▶ 00:06 |
| which is if you learn something about B, you can infer something about A, | ▶ 00:11 |
| and then we'll know more about C. | ▶ 00:17 |
| If you know D, that doesn't change a thing. | ▶ 00:20 |
| You can just take D out of the pool. | ▶ 00:22 |
| If you know A, B and C become conditionally independent. | ▶ 00:24 |
| This dependence goes away, and ignorance of D doesn't render B and C dependent. | ▶ 00:29 |
| However, if we add D back to the mix, | ▶ 00:36 |
| then knowledge of D will render B and C dependent by way of the explaining away effect. | ▶ 00:39 |
| [Thrun] So the correct answer is tricky in this case. | ▶ 00:00 |
| It is no, no, no, and yes. | ▶ 00:03 |
| The first one is straightforward. | ▶ 00:07 |
| C and E are conditionally independent based on D, | ▶ 00:09 |
| and knowledge of A doesn't change anything. | ▶ 00:13 |
| B and D are conditionally independent through A, | ▶ 00:15 |
| and knowledge of C or E doesn't change that. | ▶ 00:20 |
| A and C is interesting. | ▶ 00:23 |
| A and C is independent. But if you know D, they become dependent. | ▶ 00:25 |
| It turns out if you know E, you can know something about D, | ▶ 00:29 |
| and as a result, A and C become dependent through the explain away effect. | ▶ 00:32 |
| That doesn't apply if you know B. | ▶ 00:37 |
| Even though B tells you something about E, | ▶ 00:39 |
| it tells you nothing about D because B and D are independent. | ▶ 00:42 |
| Therefore, knowing B tells you nothing about D, | ▶ 00:46 |
| and the explain away effect does not occur between A and C. | ▶ 00:49 |
| The answer here is yes. | ▶ 00:52 |
| [Thrun] The correct answer is 16. | ▶ 00:00 |
| The probability of A and C require 1 parameter each. | ▶ 00:03 |
| The complement of not A and not C follows by 1 minus that parameter. | ▶ 00:06 |
| This guy over here requires 2 parameters. | ▶ 00:12 |
| You need to know the probability of B given A and B given not A. | ▶ 00:15 |
| The complements can be obtained easily. | ▶ 00:18 |
| The probability of D is conditioned on 2 variables which can take 4 possible values. | ▶ 00:20 |
| Hence the number is 4. | ▶ 00:24 |
| And E is conditioned on 3 variables, so it can take a total of 8 different values, | ▶ 00:26 |
| 2 to the 3rd, which is 8. | ▶ 00:30 |
| If you add 8 plus 4 plus 2 plus 1 plus 1, you get 16. | ▶ 00:32 |
| Welcome to the machine learning unit. | ▶ 00:00 |
| Machine learning is a fascinating area. | ▶ 00:03 |
| The world has become immeasurably data-rich. | ▶ 00:06 |
| The world wide web has come up over the last decade. | ▶ 00:09 |
| The human genome is being sequenced. | ▶ 00:12 |
| Vast chemical databases, pharmaceutical databases, | ▶ 00:15 |
| and financial databases are now available | ▶ 00:19 |
| on a scale unthinkable even 5 years ago. | ▶ 00:22 |
| To make sense out of the data, | ▶ 00:26 |
| to extract information from the data, | ▶ 00:28 |
| machine learning is the discipline to go. | ▶ 00:30 |
| Machine learning is an important subfeed of artificial intelligence, | ▶ 00:33 |
| it's my personal favorite next to robotics | ▶ 00:37 |
| because I believe it has a huge impact on society | ▶ 00:40 |
| and is absolutely necessary as we move forward. | ▶ 00:43 |
| So in this class, I teach you some of the very basics of | ▶ 00:47 |
| machine learning, and in our next unit | ▶ 00:50 |
| Peter will tell you some more about machine learning. | ▶ 00:52 |
| We'll talk about supervised learning, which is one side of machine learning, | ▶ 00:56 |
| and Peter will tell you about unsupervised learning, | ▶ 01:00 |
| which is a different style. | ▶ 01:02 |
| Later in this class we will also encounter reinforcement learning, | ▶ 01:05 |
| which is yet another set of machine learning. | ▶ 01:07 |
| Anyhow, let's just dive in. | ▶ 01:10 |
| Welcome to the first class on machine learning. | ▶ 00:00 |
| So far we talked a lot about Bayes Networks. | ▶ 00:03 |
| And the way we talked about them | ▶ 00:07 |
| is all about reasoning within Bayes Networks | ▶ 00:10 |
| that are known. | ▶ 00:14 |
| Machine learning addresses the problem | ▶ 00:15 |
| of how to find those networks | ▶ 00:17 |
| or other models | ▶ 00:19 |
| based on data. | ▶ 00:20 |
| Learning models from data | ▶ 00:22 |
| is a major, major area of artificial intelligence | ▶ 00:25 |
| and it's perhaps the one | ▶ 00:29 |
| that had the most commercial success. | ▶ 00:31 |
| In many commercial applications | ▶ 00:33 |
| the models themselves are fitted | ▶ 00:37 |
| based on data. | ▶ 00:39 |
| For example, Google | ▶ 00:40 |
| uses data to understand | ▶ 00:42 |
| how to respond to each search query. | ▶ 00:44 |
| Amazon uses data | ▶ 00:46 |
| to understand how to place products on their website. | ▶ 00:49 |
| And these machine learning techniques | ▶ 00:52 |
| are the enabling techniques that make that possible. | ▶ 00:53 |
| So this class | ▶ 00:56 |
| which is about supervised learning | ▶ 00:57 |
| will go through some very basic methods | ▶ 00:59 |
| for learning models from data | ▶ 01:02 |
| in particular, specific types of Bayes Networks. | ▶ 01:04 |
| We will complement this | ▶ 01:06 |
| with a class on unsupervised learning | ▶ 01:08 |
| that will be taught next | ▶ 01:10 |
| after this class. | ▶ 01:14 |
| Let me start off with a quiz. | ▶ 01:15 |
| The quiz is: What companies are famous | ▶ 01:18 |
| for machine learning using data? | ▶ 01:20 |
| Google for mining the web. | ▶ 01:24 |
| Netflix for mining what people | ▶ 01:29 |
| would like to rent on DVDs. | ▶ 01:31 |
| Which is DVD recommendations. | ▶ 01:36 |
| Amazon.com for product placement. | ▶ 01:40 |
| Check any or all | ▶ 01:45 |
| and if none of those apply | ▶ 01:47 |
| check down here. | ▶ 01:49 |
| And, not surprisingly, the answer is | ▶ 00:00 |
| all of those companies and many, many, many more | ▶ 00:03 |
| use massive machine learning for making decisions | ▶ 00:06 |
| that are really essential to the businesses. | ▶ 00:09 |
| Google mines the web and uses machine learning for translation, | ▶ 00:12 |
| as we've seen in the introductory level. Netflix has used | ▶ 00:15 |
| machine learning extensively for understanding what type of DVD to recommend to you next. | ▶ 00:18 |
| Amazon composes its entire product pages using | ▶ 00:22 |
| machine learning by understanding how customers | ▶ 00:25 |
| respond to different compositions and placements of their products, | ▶ 00:28 |
| and many, many other examples exist. | ▶ 00:31 |
| I would argue that in Silicon Valley, | ▶ 00:35 |
| at least half the companies dealing with customers and online products | ▶ 00:37 |
| do extensively use machine learning, | ▶ 00:41 |
| so it makes machine learning a really exciting discipline. | ▶ 00:43 |
| In my own research, I've extensively used machine learning for robotics. | ▶ 00:00 |
| What you see here is a robot my students and I built at Stanford | ▶ 00:05 |
| called Stanley, and it won the DARPA Grand Challenge. | ▶ 00:08 |
| It's a self-driving car that drives without any human assistance whatsoever, | ▶ 00:12 |
| and this vehicle extensively uses machine learning. | ▶ 00:16 |
| The robot is equipped with a laser system | ▶ 00:22 |
| I will talk more about lasers in my robotics class, | ▶ 00:25 |
| but here you can see how the robot is able to build | ▶ 00:28 |
| 3-D models of the terrain ahead. | ▶ 00:31 |
| These are almost like video game models that allow it to make | ▶ 00:34 |
| assessments where to drive and where not to drive. | ▶ 00:37 |
| Essentially, it's trying to drive on flat ground. | ▶ 00:39 |
| The problem with these lasers is that they don't see very far. | ▶ 00:43 |
| They see about 25 meters out, so to drive really fast | ▶ 00:46 |
| the robot has to see further. | ▶ 00:50 |
| This is where machine learning comes into play. | ▶ 00:53 |
| What you see here is camera images delivered by the robot | ▶ 00:56 |
| superimposed with laser data that doesn't see very far, | ▶ 00:58 |
| but the laser is good enough to extract samples | ▶ 01:01 |
| of driveable road surface that can then be machine learned | ▶ 01:04 |
| and extrapolated into the entire camera image. | ▶ 01:08 |
| That enables the robot to use the camera | ▶ 01:10 |
| to see driveable terrain all the way to the horizon | ▶ 01:13 |
| up to like 200 meters out, enough to drive really, really fast. | ▶ 01:16 |
| This ability to adapt its vision by driving its own training examples using lasers | ▶ 01:22 |
| but seeing out 200 meters or more | ▶ 01:27 |
| was a key factor in winning the race. | ▶ 01:30 |
| Machine learning is a very large field | ▶ 00:00 |
| with many different methods | ▶ 00:03 |
| and many different applications. | ▶ 00:04 |
| I will now define some of the very basic terminology | ▶ 00:06 |
| that is being used to distinguish | ▶ 00:10 |
| different machine learning methods. | ▶ 00:12 |
| Let's start with the what. | ▶ 00:13 |
| What is being learned? | ▶ 00:17 |
| You can learn parameters | ▶ 00:19 |
| like the probabilities of a Bayes Network. | ▶ 00:23 |
| You can learn structure | ▶ 00:26 |
| like the arc structure of a Bayes Network. | ▶ 00:27 |
| And you might even discover hidden concepts. | ▶ 00:31 |
| For example | ▶ 00:34 |
| you might find that certain training example | ▶ 00:35 |
| form a hidden group. | ▶ 00:37 |
| For example Netflix | ▶ 00:39 |
| you might find that there's different types of customers | ▶ 00:41 |
| some that care about classic movies | ▶ 00:43 |
| some of them care about modern movies | ▶ 00:45 |
| and those might form hidden concepts | ▶ 00:47 |
| whose discovery can really help you | ▶ 00:49 |
| make better sense of the data. | ▶ 00:51 |
| Next is what from? | ▶ 00:53 |
| Every machine learning method | ▶ 00:57 |
| is driven by some sort of target information | ▶ 01:00 |
| that you care about. | ▶ 01:02 |
| In supervised learning | ▶ 01:03 |
| which is the subject of today's class | ▶ 01:06 |
| we're given specific target labels | ▶ 01:08 |
| and I give you examples just in a second. | ▶ 01:10 |
| We also talk about unsupervised learning | ▶ 01:13 |
| where target labels are missing | ▶ 01:15 |
| and we use replacement principles | ▶ 01:19 |
| to find, for example | ▶ 01:21 |
| hidden concepts. | ▶ 01:22 |
| Later there will be a class in reinforcement learning | ▶ 01:24 |
| when an agent learns from feedback with the physical environment | ▶ 01:27 |
| by interacting and trying actions | ▶ 01:32 |
| and receiving some sort of evaluation | ▶ 01:34 |
| from the environment | ▶ 01:37 |
| like "Well done" or "That works." | ▶ 01:37 |
| Again, we will talk about those in detail later. | ▶ 01:41 |
| There's different things you could try to do | ▶ 01:43 |
| with machine learning technique. | ▶ 01:46 |
| You might care about prediction. | ▶ 01:48 |
| For example you might want to care about what's going to happen with the future | ▶ 01:49 |
| in the stockmarket for example. | ▶ 01:53 |
| You might care to diagnose something | ▶ 01:55 |
| which is you get data and you wish to explain it | ▶ 01:57 |
| and you use machine learning for that. | ▶ 01:59 |
| Sometimes your objective is to summarize something. | ▶ 02:01 |
| For example if you read a long article | ▶ 02:04 |
| your machine learning method might aim to | ▶ 02:07 |
| produce a short article that summarizes the long article. | ▶ 02:09 |
| And there's many, many, many more different things. | ▶ 02:12 |
| You can talk about the how to learn. | ▶ 02:14 |
| We use the word passive | ▶ 02:16 |
| if your learning agent is just an observer | ▶ 02:19 |
| and has no impact on the data itself. | ▶ 02:23 |
| Otherwise, you call it active. | ▶ 02:24 |
| Sometimes learning occurs online | ▶ 02:26 |
| which means while the data is being generated | ▶ 02:30 |
| and some of it is offline | ▶ 02:32 |
| which means learning occurs | ▶ 02:35 |
| after the data has been generated. | ▶ 02:37 |
| There's different types of outputs | ▶ 02:39 |
| of a machine learning algorithm. | ▶ 02:42 |
| Today we'll talk about classification | ▶ 02:44 |
| versus regression. | ▶ 02:47 |
| In classification the output is binary | ▶ 02:50 |
| or a fixed number of classes | ▶ 02:53 |
| for example something is either a chair or not. | ▶ 02:55 |
| Regression is continuous. | ▶ 02:57 |
| The temperature might be 66.5 degrees | ▶ 02:59 |
| in our prediction. | ▶ 03:01 |
| And there's tons of internal details | ▶ 03:03 |
| we will talk about. | ▶ 03:05 |
| Just to name one. | ▶ 03:07 |
| We will distinguish generative | ▶ 03:09 |
| from discriminative. | ▶ 03:12 |
| Generative seeks to model the data | ▶ 03:14 |
| as generally as possible | ▶ 03:16 |
| versus discriminative methods | ▶ 03:18 |
| seek to distinguish data | ▶ 03:20 |
| and this might sound like a superficial distinction | ▶ 03:21 |
| but it has enormous ramification | ▶ 03:24 |
| on the learning algorithm. | ▶ 03:26 |
| Now to tell you the truth | ▶ 03:27 |
| it took me many years | ▶ 03:29 |
| to fully learn all these words here | ▶ 03:30 |
| and I don't expect you to pick them all up | ▶ 03:33 |
| in one class | ▶ 03:36 |
| but you should as well know that they exist. | ▶ 03:37 |
| And as they come up | ▶ 03:39 |
| I'll emphasize them | ▶ 03:41 |
| so you can resort any learning method | ▶ 03:42 |
| I tell you back into the specific taxonomy over here. | ▶ 03:44 |
| The vast amount of work in the field | ▶ 00:00 |
| falls into the area of supervised learning. | ▶ 00:02 |
| In supervised learning | ▶ 00:06 |
| you're given for each training example | ▶ 00:08 |
| a feature vector | ▶ 00:10 |
| and a target label named Y. | ▶ 00:13 |
| For example, for a credit rating agency | ▶ 00:16 |
| X1, X2, X3 might be a feature | ▶ 00:20 |
| such as is the person employed? | ▶ 00:23 |
| What is the salary of the person? | ▶ 00:25 |
| Has the person previously defaulted on a credit card? | ▶ 00:27 |
| And so on. | ▶ 00:30 |
| And Y is a predictor | ▶ 00:32 |
| whether the person is to default | ▶ 00:34 |
| on the credit or not. | ▶ 00:36 |
| Now machine learning | ▶ 00:38 |
| is to be carried out on past data | ▶ 00:40 |
| where the credit rating agency | ▶ 00:42 |
| might have collected features just like these | ▶ 00:44 |
| and actual occurances of default or not. | ▶ 00:46 |
| What it wishes to produce | ▶ 00:49 |
| is a function that allows us | ▶ 00:51 |
| to predict future customers. | ▶ 00:53 |
| So the new person comes in | ▶ 00:55 |
| with a different feature vector. | ▶ 00:56 |
| Can we predict as good as possible | ▶ 00:58 |
| the functional relationship | ▶ 01:00 |
| between these features X1 to Xn all the way to Y? | ▶ 01:02 |
| You can apply the exact same example | ▶ 01:05 |
| in image recognition | ▶ 01:08 |
| where X might be pixels of images | ▶ 01:09 |
| or it might be features of things found in images | ▶ 01:11 |
| and Y might be a label that says | ▶ 01:14 |
| whether a certain object is contained | ▶ 01:16 |
| in an image or not. | ▶ 01:17 |
| Now in supervised learning | ▶ 01:19 |
| you're given many such examples. | ▶ 01:20 |
| X21 to X2n | ▶ 01:25 |
| leads to Y2 | ▶ 01:28 |
| all way the index m. | ▶ 01:32 |
| This is called your data. | ▶ 01:35 |
| If we call each input vector Xm | ▶ 01:38 |
| and we wish to find out the function | ▶ 01:43 |
| given any Xm or any future vector X | ▶ 01:44 |
| produces as close as possible | ▶ 01:50 |
| my target signal Y. | ▶ 01:53 |
| Now this isn't always possible | ▶ 01:55 |
| and sometimes it's acceptable | ▶ 01:57 |
| in fact preferable | ▶ 01:59 |
| to tolerate a certain amount of error | ▶ 02:00 |
| in your training data. | ▶ 02:03 |
| But the subject of machine learning | ▶ 02:05 |
| is to identify this function over here. | ▶ 02:07 |
| And once you identify it | ▶ 02:10 |
| you can use it for future Xs | ▶ 02:11 |
| that weren't part of the training set | ▶ 02:13 |
| to produce a prediction | ▶ 02:16 |
| that hopefully is really, really good. | ▶ 02:19 |
| So let me ask you a question. | ▶ 02:21 |
| And this is a question | ▶ 02:24 |
| for which I haven't given you the answer | ▶ 02:27 |
| but I'd like to appeal to your intuition. | ▶ 02:28 |
| Here's one data set | ▶ 02:31 |
| where the X is one dimensionally plotted horizontally | ▶ 02:34 |
| and the Y is vertically | ▶ 02:37 |
| and suppose there looks like this. | ▶ 02:39 |
| Suppose my machine learning algorithm | ▶ 02:44 |
| gives me 2 hypotheses. | ▶ 02:45 |
| One is this function over here | ▶ 02:47 |
| which is a linear function | ▶ 02:51 |
| and one is this function over here. | ▶ 02:52 |
| I'd like to know which of the functions | ▶ 02:53 |
| you find preferable | ▶ 02:57 |
| as an explanation for the data. | ▶ 02:59 |
| Is it function A? | ▶ 03:01 |
| Or function B? | ▶ 03:02 |
| Check here for A | ▶ 03:06 |
| here for B | ▶ 03:08 |
| and here for neither. | ▶ 03:09 |
| And I hope you guessed function A. | ▶ 00:00 |
| Even though both perfectly describe the data | ▶ 00:04 |
| B is much more complex than A. | ▶ 00:08 |
| In fact, outside the data | ▶ 00:10 |
| B seems to go to a minus infinity much faster | ▶ 00:12 |
| than these data points | ▶ 00:16 |
| and to plus infinity much faster | ▶ 00:17 |
| with these data points over here. | ▶ 00:19 |
| And in between | ▶ 00:21 |
| we have wide oscillations | ▶ 00:22 |
| that don't correspond to any data. | ▶ 00:23 |
| So I would argue | ▶ 00:25 |
| A is preferable. | ▶ 00:27 |
| The reason why I asked this question | ▶ 00:31 |
| is because of something called Occam's Razor. | ▶ 00:32 |
| Occam can be spelled in many different ways. | ▶ 00:35 |
| And what Occam says is that | ▶ 00:38 |
| everything else being equal | ▶ 00:41 |
| chose the less complex hypothesis. | ▶ 00:43 |
| Now in practice | ▶ 00:46 |
| there's actually a trade-off | ▶ 00:48 |
| between a really good data fit | ▶ 00:50 |
| and low complexity. | ▶ 00:53 |
| Let me illustrate this to you | ▶ 00:55 |
| by a hypothetical example. | ▶ 00:58 |
| Consider the following graph | ▶ 00:59 |
| where the horizontal axis graphs | ▶ 01:02 |
| complexity of the solution. | ▶ 01:04 |
| For example, if you use polynomials | ▶ 01:07 |
| this might be a high-degree polynomial over here | ▶ 01:10 |
| and maybe a linear function over here | ▶ 01:12 |
| which is a low-degree polynomial | ▶ 01:14 |
| your training data error | ▶ 01:16 |
| tends to go like this. | ▶ 01:19 |
| The more complex the hypothesis you allow | ▶ 01:22 |
| the more you can just fit your data. | ▶ 01:25 |
| However, in reality | ▶ 01:29 |
| your generalization error on unknown data | ▶ 01:31 |
| tends to go like this. | ▶ 01:33 |
| It is the sum of the training data error | ▶ 01:37 |
| and another function | ▶ 01:40 |
| which is called the overfitting error. | ▶ 01:42 |
| Not surprisingly | ▶ 01:46 |
| the best complexity is obtained | ▶ 01:47 |
| where the generalization error is minimum. | ▶ 01:49 |
| There are methods | ▶ 01:52 |
| to calculate the overfitting error. | ▶ 01:53 |
| They go into a statistical field | ▶ 01:55 |
| under the name Bayes variance methods. | ▶ 01:57 |
| However, in practice | ▶ 02:01 |
| you're often just given the training data error. | ▶ 02:02 |
| You find if you don't find the model | ▶ 02:04 |
| that minimizes the training data error | ▶ 02:08 |
| but instead pushes back the complexity | ▶ 02:11 |
| your algorithm tends to perform better | ▶ 02:14 |
| and that is something we will study a little bit | ▶ 02:17 |
| in this class. | ▶ 02:20 |
| However, this slide is really important | ▶ 02:22 |
| for anybody doing machine learning in practice. | ▶ 02:26 |
| If you deal with data | ▶ 02:29 |
| and you have ways to fit your data | ▶ 02:31 |
| be aware that overfitting | ▶ 02:33 |
| is a major source of poor performance | ▶ 02:36 |
| of a machine learning algorithm. | ▶ 02:39 |
| And I give you examples in just one second. | ▶ 02:41 |
| So a really important example | ▶ 00:00 |
| of machine learning is SPAM detection. | ▶ 00:02 |
| We all get way too much email | ▶ 00:04 |
| and a good number of those are SPAM. | ▶ 00:06 |
| Here are 3 examples of email. | ▶ 00:08 |
| Dear Sir: First I must solicit your confidence | ▶ 00:12 |
| in this transaction, this is by virtue of its nature | ▶ 00:14 |
| being utterly confidential and top secret... | ▶ 00:16 |
| This is likely SPAM. | ▶ 00:19 |
| Here's another one. | ▶ 00:22 |
| In upper caps. | ▶ 00:23 |
| 99 MILLION EMAIL ADDRESSES FOR ONLY $99 | ▶ 00:25 |
| This is very likely SPAM. | ▶ 00:28 |
| And here's another one. | ▶ 00:31 |
| Oh, I know it's blatantly OT | ▶ 00:33 |
| but I'm beginning to go insane. | ▶ 00:35 |
| Had an old Dell Dimension XPS sitting in the corner | ▶ 00:37 |
| and decided to put it to use. | ▶ 00:40 |
| And so on and so on. | ▶ 00:41 |
| Now this is likely not SPAM. | ▶ 00:42 |
| How can a computer program | ▶ 00:45 |
| distinguish between SPAM and not SPAM? | ▶ 00:47 |
| Let's use this as an example | ▶ 00:49 |
| to talk about machine learning for discrimination | ▶ 00:51 |
| using Bayes Networks. | ▶ 00:55 |
| In SPAM detection | ▶ 00:59 |
| we get an email | ▶ 01:01 |
| and we wish to categorize it | ▶ 01:03 |
| either as SPAM | ▶ 01:05 |
| in which case we don't even show as to the where | ▶ 01:07 |
| or what we call HAM | ▶ 01:10 |
| which is the technical word for | ▶ 01:12 |
| an email worth passing on to the person being emailed. | ▶ 01:15 |
| So the function over here | ▶ 01:19 |
| is the function we're trying to learn. | ▶ 01:21 |
| Most SPAM filters use human input. | ▶ 01:23 |
| When you go through email | ▶ 01:26 |
| you have a button called IS SPAM | ▶ 01:28 |
| which allows you as a user to flag SPAM | ▶ 01:32 |
| and occasionally you will say an email is SPAM. | ▶ 01:34 |
| If you look at this | ▶ 01:37 |
| you have a typical supervised machine learning situation | ▶ 01:40 |
| where the input is an email | ▶ 01:43 |
| and the output is whether you flag it as SPAM | ▶ 01:45 |
| or if we don't flag it | ▶ 01:47 |
| we just think it's HAM. | ▶ 01:49 |
| Now to make this amenable to | ▶ 01:52 |
| a machine learning algorithm | ▶ 01:54 |
| we have to talk about how to represent emails. | ▶ 01:55 |
| They're all using different words and different characters | ▶ 01:57 |
| and they might have different graphics included. | ▶ 02:00 |
| Let's pick a representation that's easy to process. | ▶ 02:02 |
| And this representation is often called | ▶ 02:06 |
| Bag of Words. | ▶ 02:09 |
| Bag of Words is a representation | ▶ 02:10 |
| of a document | ▶ 02:14 |
| that just counts the frequency | ▶ 02:15 |
| of words. | ▶ 02:17 |
| If an email were to say Hello | ▶ 02:18 |
| I will say Hello. | ▶ 02:22 |
| The Bag of Words representation | ▶ 02:24 |
| is the following. | ▶ 02:26 |
| 2-1-1-1 | ▶ 02:27 |
| for the dictionary | ▶ 02:31 |
| that contains the 4 words | ▶ 02:33 |
| Hello I will say. | ▶ 02:36 |
| Now look at the subtlety here. | ▶ 02:38 |
| Rather than representing each individual word | ▶ 02:41 |
| we have a count of each word | ▶ 02:43 |
| and the count is oblivious | ▶ 02:46 |
| to the order in which the words were stated. | ▶ 02:49 |
| A Bag of Words representation | ▶ 02:52 |
| relative to a fixed dictionary | ▶ 02:55 |
| represents the counts of each word | ▶ 02:57 |
| relative to the words in the dictionary. | ▶ 03:01 |
| If you were to use a different dictionary | ▶ 03:03 |
| like hello and good-bye | ▶ 03:06 |
| our counts would be | ▶ 03:08 |
| 2 and 0. | ▶ 03:10 |
| However, in most cases | ▶ 03:13 |
| you make sure that all the words found | ▶ 03:14 |
| in messages | ▶ 03:17 |
| are actually included in the dictionary. | ▶ 03:18 |
| So the dictionary might be very, very large. | ▶ 03:19 |
| Let me make up an unofficial example | ▶ 03:22 |
| of a few SPAM and a few HAM messages. | ▶ 03:25 |
| Offer is secret. | ▶ 03:30 |
| Click secret link. | ▶ 03:32 |
| Secret sports link. | ▶ 03:35 |
| Obviously those are contrived | ▶ 03:37 |
| and I tried to retain the recovery | ▶ 03:40 |
| to a small number of words | ▶ 03:42 |
| to make this example workable. | ▶ 03:44 |
| In practice we need thousands | ▶ 03:46 |
| of such messages | ▶ 03:47 |
| to get good information. | ▶ 03:48 |
| Play sports today. | ▶ 03:50 |
| Went play sports. | ▶ 03:52 |
| Secret sports event. | ▶ 03:54 |
| Sport is today. | ▶ 03:56 |
| Sport costs money. | ▶ 03:59 |
| My first quiz is | ▶ 04:02 |
| What is the size of the vocabulary | ▶ 04:06 |
| that contains all words in these messages? | ▶ 04:08 |
| Please enter the value in this box over here. | ▶ 04:12 |
| Well let's count. | ▶ 00:00 |
| Offer is secret click. | ▶ 00:02 |
| Secret occurs over here already | ▶ 00:08 |
| so we don't have to count it twice. | ▶ 00:10 |
| Link, sports, play, today, went, event | ▶ 00:12 |
| costs money. | ▶ 00:18 |
| So the answer is | ▶ 00:20 |
| 12. | ▶ 00:22 |
| There's 12 different words | ▶ 00:24 |
| contained in these 8 messages. | ▶ 00:26 |
| [Narrator] Another quiz. | ▶ 00:00 |
| What is the probability that a random message | ▶ 00:03 |
| that arrives to fall into the spam bucket? | ▶ 00:06 |
| Assuming that those messages | ▶ 00:09 |
| are all drawn at random. | ▶ 00:11 |
| [writing on page] | ▶ 00:13 |
| [Narrator] And the answer is: | ▶ 00:00 |
| there's 8 different messages | ▶ 00:02 |
| of which 3 are spam. | ▶ 00:04 |
| So the maximum likelihood estimate | ▶ 00:06 |
| is 3/8. | ▶ 00:09 |
| [writing on paper] | ▶ 00:11 |
| So, let's look at this a little bit more formally and talk about maximum likelihood. | ▶ 00:00 |
| Obviously, we're observing 8 messages: spam, spam, spam, and 5 times ham. | ▶ 00:03 |
| And what we care about is what's our prior probability of spam | ▶ 00:12 |
| that maximizes the likelihood of this data? | ▶ 00:17 |
| So, let's assume we're going to assign a value of pi to this, | ▶ 00:20 |
| and we wish to find the pi that maximizes the likelihood of this data over here, | ▶ 00:24 |
| assuming that each email is drawn independently | ▶ 00:29 |
| according to an identical distribution. | ▶ 00:33 |
| The probability of the p(yi) data item is then pi if yi = spam, | ▶ 00:37 |
| and 1 - pi if yi = ham. | ▶ 00:48 |
| If we rewrite the data as 1, 1, 1, 0, 0, 0, 0, 0, | ▶ 00:53 |
| we can write p(yi) as follows: pi to the yi times (1 - pi) to the 1 - yi. | ▶ 00:59 |
| It's not that easy to see that this is equivalent, | ▶ 01:13 |
| but say yi = 1. | ▶ 01:16 |
| Then this term will fall out. | ▶ 01:19 |
| It's not proficient by 1 because the exponent is zero, and we get pi as over here. | ▶ 01:22 |
| If yi = 0, then this term falls out, and this one here becomes 1 - pi as over here. | ▶ 01:28 |
| Now assuming independence, we get for the entire data set | ▶ 01:36 |
| that the joint probability of all data items is the product | ▶ 01:44 |
| of the individual data items over here, | ▶ 01:49 |
| which can now be written as follows: | ▶ 01:52 |
| pi to the count of instances where yi = 1 times | ▶ 01:56 |
| 1 - pi to the count of the instances where yi = 0. | ▶ 02:03 |
| And we know in our example, this count over here is 3, | ▶ 02:09 |
| and this count over here is 5, so we get pi to the 3rd times 1 - pi to the 5th. | ▶ 02:13 |
| We now wish to find the pi that maximizes this expression over here. | ▶ 02:22 |
| We can also maximize the logarithm of this expression, | ▶ 02:28 |
| which is 3 times log pi + 5 times log (1 - pi) | ▶ 02:33 |
| Optimizing the log is the same as optimizing p because the log is monotonic to p. | ▶ 02:42 |
| The maximum of this function is attained with a derivative of 0, | ▶ 02:50 |
| so let's compute with a derivative and set it to 0. | ▶ 02:54 |
| This is the derivative, 3 over pi - 5 over 1 - pi. | ▶ 03:00 |
| We now bring this expression to the right side, | ▶ 03:05 |
| multiply the denominators up, and sort all the expressions containing pi to the left, | ▶ 03:09 |
| which gives us pi = 3/8, exactly the number we were at before. | ▶ 03:18 |
| We just derived mathematically that the data likelihood maximizing number | ▶ 03:26 |
| for the probability is indeed the empirical count, | ▶ 03:33 |
| which means when we looked at this quiz before | ▶ 03:37 |
| and we said a maximum likelihood for the prior probability of spam is 3/8, | ▶ 03:41 |
| by simply counting 3 over 8 emails were spam, | ▶ 03:49 |
| we actually followed proper mathematical principles | ▶ 03:54 |
| to do maximum likelihood estimation. | ▶ 03:57 |
| Now, you might not fully have gotten the derivation of this, | ▶ 03:59 |
| and I recommend you to watch it again, but it's not that important | ▶ 04:03 |
| for the progress in this class. | ▶ 04:07 |
| So, here's another quiz. | ▶ 04:09 |
| I'd like the maximum likelihood, or ML solutions, | ▶ 04:11 |
| for the following probabilities. | ▶ 04:17 |
| The probability that the word "secret" comes up, | ▶ 04:19 |
| assuming that we already know a message is spam, | ▶ 04:21 |
| and the probability that the same word "secret" comes up | ▶ 04:25 |
| if we happen to know the message is not spam, it's ham. | ▶ 04:28 |
| And just as before | ▶ 00:00 |
| we count the word secret | ▶ 00:02 |
| in SPAM and in HAM | ▶ 00:04 |
| as I've underlined here. | ▶ 00:06 |
| Three out of 9 words in SPAM | ▶ 00:07 |
| are the word secret | ▶ 00:11 |
| so we have a third over here | ▶ 00:12 |
| or 0.333 | ▶ 00:14 |
| and only 1 out of all the 15 words in HAM | ▶ 00:18 |
| are secret | ▶ 00:21 |
| so you get a fifteenth | ▶ 00:22 |
| or 0.0667. | ▶ 00:23 |
| By now, you might have recognized what we're really building up is a Bayes network | ▶ 00:00 |
| where the parameters of the Bayes networks are estimated using supervised learning | ▶ 00:06 |
| by a maximum likelihood estimator based on training data. | ▶ 00:10 |
| The Bayes network has at its root an unobservable variable called spam, | ▶ 00:15 |
| which is binary, and it has as many children as there are words in a message, | ▶ 00:20 |
| where each word has an identical conditional distribution | ▶ 00:28 |
| of the word occurrence given the class spam or not spam. | ▶ 00:33 |
| If you write on our dictionary over here, | ▶ 00:39 |
| you might remember the dictionary had 12 different words, | ▶ 00:42 |
| so here is 5 of the 12, offer, is, secret, click and sports. | ▶ 00:48 |
| Then for the spam class, we found the probability of secret given spam is 1/3, | ▶ 00:52 |
| and we also found that the probability of secret given ham is 1/15, | ▶ 00:59 |
| so here's a quiz. | ▶ 01:05 |
| Assuming a vocabulary size of 12, or put differently, | ▶ 01:07 |
| the dictionary has 12 words, how many parameters | ▶ 01:12 |
| do we need to specify this Bayes network? | ▶ 01:16 |
| And the correct answer is 23. | ▶ 00:00 |
| We need 1 parameter for the prior p (spam), | ▶ 00:03 |
| and then we have 2 dictionary distributions of any word, | ▶ 00:07 |
| i given spam, and the same for ham. | ▶ 00:12 |
| Now, there's 12 words in a dictionary, | ▶ 00:16 |
| but this distribution only needs 11 parameters, | ▶ 00:18 |
| so 12 can be figured out because they have to add up to 1. | ▶ 00:20 |
| And the same is true over here, so if you add all these together, | ▶ 00:24 |
| we get 23. | ▶ 00:27 |
| So, here's a quiz. | ▶ 00:00 |
| Let's assume we fit all the 23 parameters of the Bayes network | ▶ 00:02 |
| as explained using maximum likelihood. | ▶ 00:06 |
| Let's now do classification and see what class and message it ends up with. | ▶ 00:09 |
| Let me start with a very simple message, and it contains a single word | ▶ 00:14 |
| just to make it a little bit simpler. | ▶ 00:18 |
| What's the probability that we classify this one word message as spam? | ▶ 00:21 |
| And the answer is 0.1667 or 3/18. | ▶ 00:00 |
| How do I get there? Well, let's apply Bayes rule. | ▶ 00:07 |
| This form is easily transformed into this expression over here, | ▶ 00:13 |
| the probability of the message given spam times the prior probability of spam | ▶ 00:19 |
| over the normalizer over here. | ▶ 00:25 |
| Now, we know that the word "sports" occurs 1 in our 9 words of spam, | ▶ 00:29 |
| and our prior probability for spam is 3/8, | ▶ 00:34 |
| which gives us this expression over here. | ▶ 00:38 |
| We now have to add the same probabilities for the class ham. | ▶ 00:40 |
| "Sports" occurs 5 times out of 15 in the ham class, | ▶ 00:45 |
| and the prior probability for ham is 5/8, | ▶ 00:51 |
| which gives us 3/72 divided by 18/72, which is 3/18 or 1/6. | ▶ 00:55 |
| This gets to a more complicated quiz. | ▶ 00:00 |
| Say the message now contains 3 words. | ▶ 00:03 |
| "Secret is secret," not a particularly meaningful email, | ▶ 00:06 |
| but the frequent occurrence of "secret" seems to suggest it might be spam. | ▶ 00:10 |
| What's the probability you're going to judge this to be spam? | ▶ 00:16 |
| And the answer is surprisingly high. It's 25/26, or 0.9615. | ▶ 00:00 |
| To see if we apply Bayes rule, which multiples the prior for spam-ness | ▶ 00:10 |
| with the conditional probability of each word given spam. | ▶ 00:16 |
| "Secret" carries 1/3, "is" 1/9, and "secret" 1/3 again. | ▶ 00:19 |
| We normalize this by the same expression plus the probability for | ▶ 00:26 |
| the non-spam case. | ▶ 00:32 |
| 5/8 is a prior. | ▶ 00:36 |
| "Secret" is 1/15. | ▶ 00:38 |
| "Is" is 1/15, | ▶ 00:42 |
| and "secret" again. | ▶ 00:45 |
| This resolves to 1/216 over this expression plus 1/5400, | ▶ 00:48 |
| and when you work it all out is 25/26. | ▶ 00:57 |
| The final quiz, let's assume our message is "Today is secret." | ▶ 00:00 |
| And again, it might look like spam because the word "secret" occurs. | ▶ 00:08 |
| I'd like you to compute for me the probability of spam given this message. | ▶ 00:12 |
| And surprisingly, the probability for this message to be spam is 0. | ▶ 00:00 |
| It's not 0.001. It's flat 0. | ▶ 00:07 |
| In other words, it's impossible, according to our model, | ▶ 00:11 |
| that this text could be a spam message. | ▶ 00:14 |
| Why is this? | ▶ 00:17 |
| When we apply the same rule as before, we get the prior for spam which is 3/8. | ▶ 00:19 |
| And we multiple the conditional for each word into this. | ▶ 00:24 |
| For "secret," we know it to be 1/3. | ▶ 00:28 |
| For "is," to be 1/9, but for today, it's 0. | ▶ 00:31 |
| It's 0 because the maximum of the estimate for the probability of "today" in spam is 0. | ▶ 00:39 |
| "Today" just never occurred in a spam message so far. | ▶ 00:45 |
| Now, this 0 is troublesome because as we compute the outcome-- | ▶ 00:49 |
| and I'm plugging in all the numbers as before-- | ▶ 00:55 |
| none of the words matter anymore, just the 0 matters. | ▶ 01:00 |
| So, we get 0 over something which is plain 0. | ▶ 01:03 |
| Are we overfitting? You bet. | ▶ 01:10 |
| We are clearly overfitting. | ▶ 01:13 |
| It can't be that a single word determines the entire outcome of our analysis. | ▶ 01:15 |
| The reason is that our model, to assign a probability of 0 for the word "today" | ▶ 01:21 |
| to be in the class of spam is just too aggressive. | ▶ 01:26 |
| Let's change this. | ▶ 01:29 |
| One technique to deal with the overfitting problem is called Laplace smoothing. | ▶ 01:34 |
| In maximum likelihood estimation, we assign towards our probability | ▶ 01:39 |
| the quotient of the count of this specific event over all events in our data set. | ▶ 01:45 |
| For example, for the prior probability, we found that 3/8 messages are spam. | ▶ 01:51 |
| Therefore, our maximum likelihood estimate | ▶ 01:57 |
| for the prior probability of spam was 3/8. | ▶ 02:00 |
| In Laplace Smoothing, we use a different estimate. | ▶ 02:05 |
| We add the value k to the count | ▶ 02:10 |
| and normalize as if we added k to every single class | ▶ 02:15 |
| that we've tried to estimate something over. | ▶ 02:20 |
| This is equivalent to assuming we have a couple of fake training examples | ▶ 02:23 |
| where we add k to each observation count. | ▶ 02:28 |
| Now, if k equals 0, we get our maximum likelihood estimator. | ▶ 02:32 |
| But if k is larger than 0 and n is finite, we get different answers. | ▶ 02:36 |
| Let's say k equals 1, | ▶ 02:41 |
| and let's assume we get one message, | ▶ 02:47 |
| and that message was spam, so we're going to write it one message, one spam. | ▶ 02:51 |
| What is p (spam) for the Laplace smoothing of k + 1? | ▶ 02:56 |
| Let's do the same with 10 messages, and we get 6 spam. | ▶ 03:03 |
| And 100 messages, of which 60 are spam. | ▶ 03:09 |
| Please enter your numbers into the boxes over here. | ▶ 03:16 |
| The answer here is 2/3 or 0.667 and is computed as follows. | ▶ 00:00 |
| We have 1 message with 1 as spam, but we're going to add k =1. | ▶ 00:10 |
| We're going to add k = 2 over here because there's 2 different classes. | ▶ 00:16 |
| K = 1 times 2 = 2, which gives us 2/3. | ▶ 00:22 |
| The answer over here is 7/12. | ▶ 00:28 |
| Again, we have 6/10 but we add 2 down here and 1 over here, so you get 7/12. | ▶ 00:32 |
| And correspondingly, we get 61/102 is 60 + 1 over 100 +2. | ▶ 00:41 |
| If we look at the numbers over here, we get 0.5833 | ▶ 00:49 |
| and 0.5986. | ▶ 00:56 |
| Interestingly, the maximum likelihood on the last 2 cases over here | ▶ 00:59 |
| will give us .6, but we only get a value that's closer to .5, | ▶ 01:03 |
| which is the effect of our smoothing prior for the Laplacian smoothing. | ▶ 01:09 |
| Let's use the Laplacian smoother with K=1 | ▶ 00:00 |
| to calculate the few interesting probabilities-- | ▶ 00:05 |
| P of SPAM, P of HAM, | ▶ 00:09 |
| and then the probability of the words "today", | ▶ 00:12 |
| given that it's in the SPAM class or the HAM class. | ▶ 00:15 |
| And you might assume that our recovery size | ▶ 00:19 |
| is about 12 different words here. | ▶ 00:22 |
| This one is easy to calculate for SPAM and HAM. | ▶ 00:00 |
| For SPAM, it's 2/5, | ▶ 00:03 |
| and the reason is, we had previously | ▶ 00:05 |
| 3 out of 8 messages assigned to SPAM. | ▶ 00:08 |
| But thanks to the Laplacian smoother, we add 1 over here. | ▶ 00:12 |
| And there are 2 classes, so we add 2 times 1 over here, | ▶ 00:15 |
| which gives us 4/10, which is 2/5. | ▶ 00:19 |
| Similarly to get 3/5 over here. | ▶ 00:22 |
| Now the tricky part comes up over here. | ▶ 00:26 |
| Before, we had 0 occurances of the word "today" in the SPAM class, | ▶ 00:29 |
| and we had 9 data points. | ▶ 00:33 |
| But now we are going to add 1 for Laplacian smoother, | ▶ 00:35 |
| and down here, we are going to add 12. | ▶ 00:38 |
| And the reason that we add 12 is because | ▶ 00:40 |
| there's 12 different words in our dictionary | ▶ 00:42 |
| Hence, for each word in the dictonary, we are going to add 1. | ▶ 00:44 |
| So we have a total of 12, which gives us the 12 over here. | ▶ 00:47 |
| That makes 1/21. | ▶ 00:50 |
| In the HAM class, we had 2 occurrences | ▶ 00:53 |
| of the word "today"--over here and over here. | ▶ 00:56 |
| We add 1, normalize by 15, | ▶ 00:59 |
| plus 12 for the dictionary size, | ▶ 01:04 |
| which is 3/27 or 1/9. | ▶ 01:07 |
| This was not an easy question. | ▶ 01:14 |
| We come now to the final quiz here, | ▶ 00:00 |
| which is--I would like to compute the probability | ▶ 00:03 |
| that the message "today is secret" | ▶ 00:05 |
| falls into the SPAM box with | ▶ 00:08 |
| Laplacian smoother using K=1. | ▶ 00:10 |
| Please just enter your number over here. | ▶ 00:13 |
| This is a non-trivia question. | ▶ 00:16 |
| It might take you a while to calculate this. | ▶ 00:18 |
| In the approximate probabilities--0.4858. | ▶ 00:00 |
| How did we get this? | ▶ 00:06 |
| Well, the prior probability for SPAM | ▶ 00:08 |
| under the Laplacian smoothing is 2/5. | ▶ 00:12 |
| "Today" doesn't occur, but we have already calculated this to be 1/21. | ▶ 00:15 |
| "Is" occurs once, so we get 2 over here over 21. | ▶ 00:22 |
| "Secret" occurs 3 times, so we get a 4 over here over 21, | ▶ 00:26 |
| and we normalize this by the same expression over here. | ▶ 00:32 |
| Plus the prior for HAM, which is 3/5, | ▶ 00:37 |
| we have 2 occurrences of "today", plus 1, equals 3/27. | ▶ 00:42 |
| "Is" occurs once--2/27. | ▶ 00:47 |
| And "secret" occurs once--again 2/27. | ▶ 00:50 |
| When you work this all out, you get this number over here. | ▶ 00:54 |
| So we learned quite a bit. | ▶ 00:00 |
| We learned about Naive Bayes | ▶ 00:02 |
| as our first supervised learning methods. | ▶ 00:04 |
| The setup was that we had | ▶ 00:06 |
| features of documents or trading examples and labels. | ▶ 00:08 |
| In this case, SPAM or not SPAM. | ▶ 00:14 |
| And from those pieces, | ▶ 00:17 |
| we made a generative model for the SPAM class | ▶ 00:19 |
| and the non-SPAM class | ▶ 00:23 |
| that described the condition of probability | ▶ 00:25 |
| of each individual feature. | ▶ 00:28 |
| We then used first maximum likelihood | ▶ 00:30 |
| and then a Laplacian smoother | ▶ 00:33 |
| to fit those primers over here. | ▶ 00:36 |
| And then using Bayes rule, | ▶ 00:38 |
| we could take any training examples over here | ▶ 00:41 |
| and figure out what the class probability was over here. | ▶ 00:44 |
| This is called a generative model | ▶ 00:48 |
| in that the condition of probabilities all aim to maximize | ▶ 00:51 |
| the probability of individual features as if those | ▶ 00:55 |
| describe the physical world. | ▶ 01:00 |
| We also used what is called a bag of words model, | ▶ 01:02 |
| in which our representation of each email | ▶ 01:06 |
| was such that we just counted the occurrences of words, | ▶ 01:09 |
| irrespective of their order. | ▶ 01:12 |
| Now this is a very powerful method for fighting SPAM. | ▶ 01:15 |
| Unfortunately, it is not powerful enough. | ▶ 01:19 |
| It turns out spammers know about Naive Bayes, | ▶ 01:21 |
| and they've long learned to come up with messages | ▶ 01:24 |
| that are fooling your SPAM filter if it uses Naive Bayes. | ▶ 01:27 |
| So companies like Google and others | ▶ 01:31 |
| have become much more involved | ▶ 01:33 |
| in methods for SPAM filtering. | ▶ 01:35 |
| Now I can give you some more examples how to filter SPAM, | ▶ 01:38 |
| but all of those quite easily fit with the same Naive Bayes model. | ▶ 01:42 |
| [Narrator] So here features that you might consider when you write | ▶ 00:00 |
| in an advance spam filter. | ▶ 00:03 |
| For example, | ▶ 00:05 |
| does the email come from | ▶ 00:07 |
| a known spamming IP or computer? | ▶ 00:09 |
| Have you emailed this person before? | ▶ 00:12 |
| In which case it is less likely to be spam. | ▶ 00:16 |
| Here's a powerful one: | ▶ 00:19 |
| have 1000 other people | ▶ 00:22 |
| recently received the same message? | ▶ 00:25 |
| Is the email header consistent? | ▶ 00:29 |
| So example if the from field says your bank | ▶ 00:32 |
| is the IP address really your bank? | ▶ 00:35 |
| Surprisingly is the email all caps? | ▶ 00:38 |
| Strangely many spammers believe if you write | ▶ 00:42 |
| things in all caps you'll pay more attention to it. | ▶ 00:44 |
| Do the inline URLs point to those pages | ▶ 00:48 |
| where they say they're pointing to? | ▶ 00:51 |
| Are you addressed by your correct name? | ▶ 00:54 |
| Now these are some features, | ▶ 00:56 |
| I'm sure you can think of more. | ▶ 00:58 |
| You can toss them easily into the | ▶ 01:00 |
| naive base model and get better classification. | ▶ 01:02 |
| In fact model spam filters keep learning | ▶ 01:05 |
| as people flag emails as spam, and | ▶ 01:08 |
| of course spammers keep learning as well | ▶ 01:10 |
| and trying to fool modern spam filters. | ▶ 01:13 |
| Who's going to win? | ▶ 01:16 |
| Well so far the spam filters are clearly winning. | ▶ 01:18 |
| Most of my spam I never see, but who knows | ▶ 01:21 |
| what's going to happen with the future? | ▶ 01:23 |
| It's a really fascinating machine learning problem. | ▶ 01:25 |
| [Narrator] Naive Bayes can also be applied to | ▶ 00:00 |
| the problem of hand written digits recognition. | ▶ 00:02 |
| This is a sample of hand-written digits taken | ▶ 00:05 |
| from a U.S. postal data set | ▶ 00:09 |
| where hand written zip codes on letters are | ▶ 00:12 |
| being scanned and automatically classified. | ▶ 00:17 |
| The machine-learning problem here is | ▶ 00:21 |
| taken a symbol just like this. | ▶ 00:23 |
| What is the corresponding number? | ▶ 00:28 |
| Here it's obviously 0. | ▶ 00:30 |
| Here it's obviously 1. | ▶ 00:32 |
| Here it's obviously 2, 1. | ▶ 00:34 |
| For the one down here, | ▶ 00:36 |
| it's a little bit harder to tell. | ▶ 00:38 |
| Now when you apply Naive Bayes, | ▶ 00:41 |
| the input vector | ▶ 00:44 |
| could be the pixel values | ▶ 00:46 |
| of each individual pixel so we have | ▶ 00:48 |
| a 16 x 16 input resolution. | ▶ 00:50 |
| You would get 256 different values | ▶ 00:54 |
| corresponding to the brightness of each pixel. | ▶ 00:59 |
| Now obviously given sufficiently made | ▶ 01:02 |
| training example, you might hope | ▶ 01:05 |
| to recognize digits, | ▶ 01:07 |
| but one of the deficiencies of this approach is | ▶ 01:09 |
| it is not particularly shifted range. | ▶ 01:12 |
| So for example a pattern like this | ▶ 01:15 |
| will look fundamentally different | ▶ 01:19 |
| from a pattern like this. | ▶ 01:21 |
| Even though the pattern on the right is obtained | ▶ 01:24 |
| by shifting the pattern on the left | ▶ 01:27 |
| by 1 to the right. | ▶ 01:29 |
| There's many different solutions, but a common one could be | ▶ 01:31 |
| to use smoothing in a different way from | ▶ 01:34 |
| the way we discussed it before. | ▶ 01:36 |
| Instead of just counting 1 pixel value's count, | ▶ 01:38 |
| you could mix it with counts of the | ▶ 01:40 |
| neighboring pixel values so if | ▶ 01:42 |
| all pixels are slightly shifted, | ▶ 01:44 |
| we get about the same statistics | ▶ 01:46 |
| as the pixel itself. | ▶ 01:48 |
| Such a method is called input smoothing. | ▶ 01:50 |
| You can what's technically called convolve | ▶ 01:52 |
| the input vector equals pixel value variable, and | ▶ 01:55 |
| you might get better results than if you | ▶ 01:57 |
| do Naive Bayes on the raw pixels. | ▶ 02:00 |
| Now to tell you the truth for | ▶ 02:02 |
| digit recognition of this type, | ▶ 02:04 |
| Naive Bayes is not a good choice. | ▶ 02:06 |
| The conditional independence assumption | ▶ 02:08 |
| of each pixel, given the class, | ▶ 02:10 |
| is too strong an assumption in this case, | ▶ 02:12 |
| but it's fun to talk about image recognition | ▶ 02:14 |
| in the context of Naive Bayes regardless. | ▶ 02:17 |
| So, let me step back a step and talk a bit about | ▶ 00:00 |
| overfitting prevention in machine learning | ▶ 00:04 |
| because it's such an important topic. | ▶ 00:07 |
| We talked about Occam's Razor, | ▶ 00:09 |
| which in a generalized way suggests there is | ▶ 00:12 |
| a tradeoff between how well we can fit the data | ▶ 00:16 |
| and how smooth our learning algorithm is. | ▶ 00:22 |
| In our class in smoothing, we already found 1 way | ▶ 00:28 |
| to let Occam's Razor play, which is by | ▶ 00:32 |
| selecting the value K to make our statistical counts smoother. | ▶ 00:34 |
| I alluded to a similar way in the image recognition domain | ▶ 00:40 |
| where we smoothed the image so the neighboring pixels count similar. | ▶ 00:44 |
| This all raises the question of how to choose the smoothing parameter. | ▶ 00:49 |
| So, in particular, in Laplacian smoothing, how to choose the K. | ▶ 00:53 |
| There is a method called cross-validation | ▶ 00:58 |
| which can help you find an answer. | ▶ 01:02 |
| This method assumes there is plenty of training examples, but | ▶ 01:05 |
| to tell you the truth, in spam filtering there is more than you'd ever want. | ▶ 01:09 |
| Take your training data | ▶ 01:14 |
| and divide it into 3 buckets. | ▶ 01:17 |
| Train, cross-validate, and test. | ▶ 01:19 |
| Typical ratios will be 80% goes into train, | ▶ 01:24 |
| 10% into cross-validate, | ▶ 01:27 |
| and 10% into test. | ▶ 01:30 |
| You use the train to find all your parameters. | ▶ 01:33 |
| For example, the probabilities of a base network. | ▶ 01:37 |
| You use your cross-validation set | ▶ 01:40 |
| to find the optimal K, and the way you do this is | ▶ 01:43 |
| you train for different values of K, | ▶ 01:46 |
| you observe how well the training model performs on the CV data, | ▶ 01:49 |
| not touching the test data, | ▶ 01:55 |
| and then you maximize over all the Ks to get the best performance | ▶ 01:58 |
| on the cross-validation set. | ▶ 02:01 |
| You iterate this many times until you find the best K. | ▶ 02:03 |
| When you're done with the best K, | ▶ 02:06 |
| you train again, and then finally | ▶ 02:09 |
| only one you touch the test data | ▶ 02:12 |
| to verify the performance, | ▶ 02:15 |
| and this is the performance you report. | ▶ 02:17 |
| It's really important in cross-validation | ▶ 02:20 |
| split apart a cross-validation set that's different from the test set. | ▶ 02:23 |
| If you were to use the test set to find the optimal K, | ▶ 02:28 |
| then your test set becomes an effective part of your training routine, | ▶ 02:31 |
| and you might overfit your test data, | ▶ 02:35 |
| and you wouldn't even know. | ▶ 02:38 |
| By keeping the test data separate from the beginning, | ▶ 02:40 |
| and train on the training data, you use | ▶ 02:43 |
| the cross-validation data to find how good your train data is doing, | ▶ 02:46 |
| and the unknown parameters of K to fine-tune the K. | ▶ 02:49 |
| Finally, only once you use the test data | ▶ 02:53 |
| do you get a fair answer to the question, | ▶ 02:56 |
| "How well will your model perform on future data?" | ▶ 02:59 |
| So, pretty much everybody in machine learning | ▶ 03:02 |
| uses this model. | ▶ 03:05 |
| You can redo the split between training and the cross-validation part, | ▶ 03:08 |
| people often use the word 10-fold cross-validation | ▶ 03:12 |
| where they do 10 different forwardings | ▶ 03:15 |
| and run the model 10 times to find the optimal K | ▶ 03:17 |
| or smoothing parameter. | ▶ 03:20 |
| No matter which way you do it, find the optimal smoothing parameter | ▶ 03:22 |
| and then use a test set exactly once to verify in a report. | ▶ 03:25 |
| Let me back up a step further, | ▶ 00:00 |
| and let's look at supervised learning more generally. | ▶ 00:03 |
| Our example so far was one of classification. | ▶ 00:06 |
| The characteristic of classifcation is | ▶ 00:09 |
| that the target labels or the target class is discrete. | ▶ 00:12 |
| In our case it was actually binary. | ▶ 00:16 |
| In many problems, we try to predict a continuous quantity. | ▶ 00:18 |
| For example, in the interval 0 to 1 or perhaps a real number. | ▶ 00:23 |
| Those machine learning problems are called regression problems. | ▶ 00:29 |
| Regression problems are fundamentally different from classification problems. | ▶ 00:33 |
| For example, our base network doesn't afford us an answer | ▶ 00:37 |
| to a problem where the target value could be at 0,1. | ▶ 00:42 |
| A regression problem, for example, would be one to | ▶ 00:45 |
| predict the weather tomorrow. | ▶ 00:48 |
| Temperature is a continuous value. Our base number would not be able | ▶ 00:50 |
| to predict the temperature, it only can predict discrete classes. | ▶ 00:53 |
| A regression algorithm is able to give us a continuous prediction | ▶ 00:58 |
| about the temperature tomorrow. | ▶ 01:01 |
| So let's look at the regression next. | ▶ 01:04 |
| So here's my first quiz for you on regression. | ▶ 01:07 |
| This scatter plot shows for Berkeley California for a period of time | ▶ 01:10 |
| the data for each house that was sold. | ▶ 01:18 |
| Each dot is a sold house. | ▶ 01:21 |
| It graphs the size of the house in square feet | ▶ 01:24 |
| to the sales price in thousands of dollars. | ▶ 01:27 |
| As you can see, roughly speaking, | ▶ 01:32 |
| as the size of the house goes up, | ▶ 01:34 |
| so does the sales price. | ▶ 01:37 |
| I wonder, for a house of about 2500 square feet, | ▶ 01:40 |
| what is the approximate sales price you would assume | ▶ 01:45 |
| based just on the scatter plot data? | ▶ 01:49 |
| Is it 400k, 600k, 800k, or 1000k? | ▶ 01:52 |
| My answer is, there seems to be a roughly linear relationship, | ▶ 00:00 |
| maybe not quite linear, between the house size and the price. | ▶ 00:05 |
| So we look at a linear graph that best describes the data-- | ▶ 00:11 |
| you get this dashed line over here. | ▶ 00:15 |
| And for the dashed line, if you walk up the 2500 square feet, | ▶ 00:18 |
| you end up with roughly 800K. | ▶ 00:22 |
| So this would have been the best answer. | ▶ 00:24 |
| Now obviously you can answer this question without understanding anything about regression. | ▶ 00:00 |
| But what you find is this is different from classification as before. | ▶ 00:05 |
| This is not a binary concept anymore of like expensive and cheap. | ▶ 00:10 |
| It really is a relationship between two variables. | ▶ 00:13 |
| One you care about--the house price, and one that you can observe, | ▶ 00:17 |
| which is the house size in square feet. | ▶ 00:20 |
| And your goal is to fit a curve that best explains the data. | ▶ 00:23 |
| Once again, we have a case where we can play Occam's razor. | ▶ 00:28 |
| There clearly is a data fit that is not linear that might be better, | ▶ 00:31 |
| like this one over here. | ▶ 00:35 |
| And when you go to hide the linear curves, | ▶ 00:37 |
| you might even be inclined to draw a curve like this. | ▶ 00:40 |
| Now of course the curve I'm drawing right now is likely an overfit. | ▶ 00:44 |
| And you don't want to postulate that this is the general relationship | ▶ 00:49 |
| between the size of a house and the sales price. | ▶ 00:54 |
| So even though my black curve might describe the data better, | ▶ 00:57 |
| the blue curve or the dashed linear curve over here might be a better explanation overture of Occam's razor. | ▶ 01:01 |
| So let's look a little bit deeper into what we call regression. | ▶ 01:08 |
| As in all regression problems, our data will be comprised of | ▶ 01:15 |
| input vectors of length in that map to another continuous value. | ▶ 01:19 |
| And we might be given a total of M data points. | ▶ 01:25 |
| This is from the classification case, except this time the Ys are continuous. | ▶ 01:30 |
| Once again, we're looking for function f that maps our vector x into y. | ▶ 01:36 |
| In linear regression, the function has a particular form which is W1 times X plus W0. | ▶ 01:44 |
| In this case X is one dimensional which is N = 1. | ▶ 01:54 |
| Or in the high-dimensional space, we might just write W times X plus W0, | ▶ 01:59 |
| where W is a vector and X is a vector. | ▶ 02:07 |
| And this is the inner product of these 2 vectors over here. | ▶ 02:12 |
| Let's for now just consider the one-dimensional case. | ▶ 02:16 |
| In this quiz, I've given you a linear regression form with 2 unknown parameters, W1 and W0. | ▶ 02:20 |
| I've given you a data set. | ▶ 02:27 |
| And this data set happens to be fittable by a linear regression model without any residual error. | ▶ 02:30 |
| Without any math, can you look at this and find out to me what the 2 parameters, W0 and W1 are? | ▶ 02:36 |
| This is a suprisingly challenging question. | ▶ 00:00 |
| If you look at these numbers from 3 to 6. | ▶ 00:03 |
| When we increase X by 3, Y decreases by 3, | ▶ 00:07 |
| which suggests W1 is -1. | ▶ 00:14 |
| Now let's see if this holds. | ▶ 00:18 |
| If we increase X by 3, it decreases Y by 3. | ▶ 00:20 |
| If we increase X by 1, we decrease Y by 1. | ▶ 00:24 |
| If we increase X by 2, we decrease Y by 2. | ▶ 00:28 |
| So this number seems to be an exact fit. | ▶ 00:32 |
| Next we have to get the constant W0 right. | ▶ 00:36 |
| For X = 3, we get -3 as an expression over here, | ▶ 00:41 |
| because we know W1 = -1. | ▶ 00:48 |
| So if this has to equal zero in the end, then W0 has to be 3. | ▶ 00:50 |
| Let's do a quick check. | ▶ 00:57 |
| -3 plus 3 is 0. | ▶ 00:59 |
| -6 plus 3 is -3. | ▶ 01:02 |
| And if we plug in any of the numbers, you find those are correct. | ▶ 01:05 |
| Now this is the case of an exact data set. | ▶ 01:09 |
| It gets much more challenging if the data set cannot be fit with a linear function. | ▶ 01:12 |
| To define linear regression, | ▶ 00:00 |
| we need to understand what we are trying to minimize. | ▶ 00:02 |
| The word is called here, are loss function | ▶ 00:05 |
| and the loss function is the amount of residual error we obtain | ▶ 00:08 |
| after fitting the linear function as good as possible. | ▶ 00:12 |
| The residual error is the sum of all training examples, | ▶ 00:16 |
| J of YJ, which is the target label, | ▶ 00:20 |
| minus our prediction, which is W1 XJ minus W0 to the square. | ▶ 00:25 |
| This is the quadratic error between our target tables | ▶ 00:34 |
| and what our best hypothesis can produce. | ▶ 00:37 |
| The minimizing of loss | ▶ 00:41 |
| is used for linear regression of a new regression problem, | ▶ 00:43 |
| and you can write it as follows: | ▶ 00:46 |
| Our solution to the regression problem W* | ▶ 00:50 |
| is the arg min of the loss over all possible vectors W. | ▶ 00:52 |
| The problem of minimizing quadratic loss for linear functions can be solved in closed form. | ▶ 00:00 |
| When I reduce, I will do this for the one-dimensional case on paper. | ▶ 00:07 |
| I will also give you the solution for the case where your input space is multidimensional, | ▶ 00:12 |
| which is often called "multivariant regression." | ▶ 00:17 |
| We seek to minimize a sum of a quadratic expression | ▶ 00:22 |
| where the target labels are subtracted with the output of our linear regression model | ▶ 00:26 |
| parameterized by w1 and w2. | ▶ 00:33 |
| The summation here is overall training examples, | ▶ 00:36 |
| and I leave the index of the summation out if not necessary. | ▶ 00:40 |
| The minimum of this is obtained where the derivative of this function equals zero. | ▶ 00:45 |
| Let's call this function "L." | ▶ 00:50 |
| For the partial derivative with respect to w0, we get this expression over here, | ▶ 00:53 |
| which we have to set to zero. | ▶ 00:59 |
| We can easily get rid of the -2 and transform this as follows: | ▶ 01:02 |
| Here M is the number of training examples. | ▶ 01:11 |
| This expression over here gives us w0 as a function of w1, | ▶ 01:17 |
| but we don't know w1. Let's do the same trick for w1 | ▶ 01:21 |
| and set this to zero as well, | ▶ 01:28 |
| which gets us the expression over here. | ▶ 01:32 |
| We can now plug in the w0 over here into this expression over here | ▶ 01:38 |
| and obtain this expression over here, | ▶ 01:44 |
| which looks really involved but is relatively straightforward. | ▶ 01:47 |
| With a few steps of further calculation, which I'll spare you for now, | ▶ 01:52 |
| we get for w1 the following important formula: | ▶ 01:56 |
| This is the final quotient for w1, | ▶ 02:02 |
| where we take the number of training examples times of the sum of all xy | ▶ 02:05 |
| minus the sum of x times the sum of y divided by this expression over here. | ▶ 02:10 |
| Once we've computed w1, | ▶ 02:16 |
| we can go back to our original articulation of w0 over here | ▶ 02:19 |
| and plug w1 into w0 and obtain w0. | ▶ 02:23 |
| These are the two important formulas we can also find in the textbook. | ▶ 02:30 |
| I'd like to go back and use those formulas to calculate these two coefficients over here. | ▶ 02:39 |
| You get 4 times the sum of x and the sum of y, which is -32 | ▶ 02:45 |
| minus the product of the sum of x, which is 18, and the sum of y, which is -6, | ▶ 02:56 |
| divided by the sum of x squared, which is 86, times 4, minus the sum of x squared, | ▶ 03:05 |
| which is 18 times 18, which is 324. | ▶ 03:16 |
| If you work this all out, it becomes -1, which is w1. | ▶ 03:20 |
| W0 is now obtained by completing the quarter times sum of all y, | ▶ 03:25 |
| which is -6, minus -1/4 times sum of all x. | ▶ 03:31 |
| If you plug this all in, you get 3, as over here. Our formula is actually correct. | ▶ 03:39 |
| Here is another quiz for linear regression. We have the follow data: | ▶ 03:46 |
| Here is the data plotted graphically. | ▶ 03:51 |
| I wonder what the best regression is. | ▶ 03:53 |
| Give me w0 and w1. Apply the formulas I just gave you. | ▶ 03:56 |
| And the answer is W0 = 0.5, and W1 = 0.9. | ▶ 00:00 |
| If I were to draw a line, it would go about like this. | ▶ 00:09 |
| It doesn't really hit the two points at the end. | ▶ 00:14 |
| If you were thinking of something like this, you were wrong. | ▶ 00:19 |
| If you draw a curve like this, your quadratic error becomes 2. | ▶ 00:24 |
| One over here, and one over here. | ▶ 00:28 |
| The quadratic error is smaller for the line that goes in between those points. | ▶ 00:30 |
| This is easily seen by computing as shown in the previous slide. | ▶ 00:35 |
| W1 equals (4 x 118 - 20 x 20) / (4 x 120 - 400) which is 0.9. | ▶ 00:41 |
| This is merely plugging in those numbers into the formulas I gave you. | ▶ 00:55 |
| W0 then becomes ΒΌ x 20. | ▶ 01:00 |
| Now we plug in W1-- 0.9 / 4 x 20 equals 0.5. | ▶ 01:05 |
| This is an example of linear regression, | ▶ 01:12 |
| in which case there is a residual error, | ▶ 01:16 |
| and the best-fitting curve is the one that minimizes | ▶ 01:18 |
| the total of the residual vertical error in this graph over here. | ▶ 01:22 |
| So linear regression works well | ▶ 00:00 |
| if the data is approximately linear, | ▶ 00:03 |
| but there are many examples when linear regression performs poorly. | ▶ 00:05 |
| Here's one where we have a | ▶ 00:09 |
| curve that is really nonlinear. | ▶ 00:12 |
| This is an interesting one where we seem to have a linear relationship | ▶ 00:15 |
| that is flatter than the linear regression indicates, | ▶ 00:18 |
| but there is one outlier. | ▶ 00:21 |
| Because if you are minimizing quadratic error, | ▶ 00:23 |
| outliers penalize you over-proportionately. | ▶ 00:26 |
| So outliers are particularly bad for linear regression. | ▶ 00:30 |
| And here is a case, | ▶ 00:34 |
| where the data clearly suggests | ▶ 00:35 |
| a very different phenomena for linear. | ▶ 00:37 |
| We have only two ?? variables even being used, | ▶ 00:40 |
| and this one has a strong frequency | ▶ 00:42 |
| and a strong vertical spread. | ▶ 00:45 |
| Clearly a linear regression model | ▶ 00:47 |
| is a very poor one to explain | ▶ 00:49 |
| this data over here. | ▶ 00:51 |
| Another problem with linear regression | ▶ 00:53 |
| is that as you go to infinity in the X space, | ▶ 00:55 |
| your Ys also become infinite. | ▶ 00:59 |
| In some problems that isn't a plausible model. | ▶ 01:02 |
| For example, if you wish to predict the weather | ▶ 01:05 |
| anytime into the future, | ▶ 01:08 |
| it's implausible to assume the further the prediction goes out, | ▶ 01:10 |
| the hotter or the cooler it becomes. | ▶ 01:13 |
| For such situations there is a | ▶ 01:15 |
| model called logistic regression, | ▶ 01:17 |
| which uses a slightly more complicated | ▶ 01:20 |
| model than linear regression, | ▶ 01:22 |
| which goes as follows:. | ▶ 01:24 |
| Let F of XP, or linear function, | ▶ 01:25 |
| and the output of logistic regression | ▶ 01:30 |
| is obtained by the following function: | ▶ 01:32 |
| One over one plus exponential of minus F of X. | ▶ 01:34 |
| So here's a quick quiz for you. | ▶ 01:40 |
| What is the range in which Z might fall | ▶ 01:43 |
| given this function over here, | ▶ 01:48 |
| and ??? the linear function of F or X over here. | ▶ 01:49 |
| Is it zero, one? | ▶ 01:53 |
| Is it minus one, one? | ▶ 01:56 |
| Is it minus one, zero? | ▶ 01:59 |
| Minus two, two? | ▶ 02:02 |
| Or none of the above? | ▶ 02:04 |
| The answer is zero, one. | ▶ 00:00 |
| If this expression over here, | ▶ 00:02 |
| F of X, | ▶ 00:05 |
| grows to positive infinity, | ▶ 00:07 |
| then Z becomes one. | ▶ 00:09 |
| And the reason is | ▶ 00:14 |
| as this term over here becomes very large, | ▶ 00:16 |
| E to the minus of that term approaches zero; | ▶ 00:19 |
| one over one equals one. | ▶ 00:22 |
| If F of X goes to minus infinity, | ▶ 00:25 |
| then Z goes to zero. | ▶ 00:30 |
| And the reason is, | ▶ 00:33 |
| if this expression over here goes to minus infinity, | ▶ 00:34 |
| E to the infinity becomes very large; | ▶ 00:38 |
| one over something very large becomes zero. | ▶ 00:41 |
| When we plot the logistic function it looks like this: | ▶ 00:44 |
| So it's approximately linear | ▶ 00:49 |
| around F of X equals zero, | ▶ 00:51 |
| but it levels off to zero and one | ▶ 00:54 |
| as we go to the extremes. | ▶ 00:58 |
| Another problem with linear regression has to do with the regularization | ▶ 00:00 |
| or complexity control. | ▶ 00:04 |
| Just like before, we sometimes wish to have | ▶ 00:06 |
| a less complex model. | ▶ 00:08 |
| So in regularization, the loss function is either the sum | ▶ 00:10 |
| of the loss of a data function and a complexity control term, | ▶ 00:15 |
| which is often called the loss of the parameters. | ▶ 00:21 |
| The loss of the data is simply curvatic loss, as we discussed before. | ▶ 00:24 |
| The loss of parameters might just be a function that penalizes | ▶ 00:29 |
| the parameters to become large | ▶ 00:35 |
| up to some known P, where P is usually either 1 or 2. | ▶ 00:37 |
| If you draw this graphically, | ▶ 00:43 |
| in a parameter space comprised of 2 parameters, | ▶ 00:46 |
| your curvatic term for minimizing the data error | ▶ 00:49 |
| might look like this, where the minimum sits over here. | ▶ 00:53 |
| Your term for regularization might pull these parameters toward 0. | ▶ 00:57 |
| It pulls it toward 0, along the circle if you use curvatic error, | ▶ 01:02 |
| and it does it in a diamond-shaped way. | ▶ 01:09 |
| For L1 regularization--either one works well. | ▶ 01:14 |
| L1 has the advantage in that parameters tend to get really sparse. | ▶ 01:20 |
| If you look at this diagram, there is a tradeoff between W-0 and W-1. | ▶ 01:24 |
| In the L1 case, that allows one of them to be driven to 0. | ▶ 01:30 |
| In the L2 case, parameters tend not to be as sparse. | ▶ 01:33 |
| So L1 is often preferred. | ▶ 01:37 |
| This all raises the question, | ▶ 00:00 |
| how to minimize more complicated loss functions | ▶ 00:03 |
| than the one we discussed so far. | ▶ 00:06 |
| Are there closed-form solutions of the type we found for linear regression? | ▶ 00:09 |
| Or do we have to resort to iterative methods? | ▶ 00:14 |
| The general answer is, unfortunantly, we have to resort to iterative methods. | ▶ 00:17 |
| Even though there are special cases in which corresponding solutions may exist, | ▶ 00:23 |
| in general, our loss functions now become complicated enough | ▶ 00:28 |
| that all we can do is iterate. | ▶ 00:32 |
| Here is a prototypical loss function | ▶ 00:35 |
| and the method for interation will be called gradient descent. | ▶ 00:40 |
| In gradient descent, you start with an initial guess, | ▶ 00:44 |
| W-0, where 0 is your iteration number, | ▶ 00:48 |
| and then you up with it iteratively. | ▶ 00:53 |
| Your i plus 1st parameter guess will be obtained by taking your i-th guess | ▶ 00:55 |
| and subtracting from it the gradient of your loss function, | ▶ 01:04 |
| and that guess multiplied by a small learning weight alpha, | ▶ 01:10 |
| where alpha is often as small as 0.01. | ▶ 01:15 |
| I have a couple of questions for you. | ▶ 01:19 |
| Consider the following 3 points. | ▶ 01:21 |
| We call them A, B, C. | ▶ 01:25 |
| I wish to know, for points A, B, and C, | ▶ 01:27 |
| Is the gradient at this point positive, about zero, or negative? | ▶ 01:34 |
| For each of those, check exactly one of those cases. | ▶ 01:40 |
| In case A, the gradient is negative. | ▶ 00:00 |
| If you move to the right in the X space, | ▶ 00:03 |
| then your loss decreases. | ▶ 00:06 |
| In B, it's about zero. | ▶ 00:09 |
| In C, it's pointing up; it's positive. | ▶ 00:12 |
| So if you apply the rule over here, | ▶ 00:15 |
| if you were to start at A as your W-zero, | ▶ 00:18 |
| then your gradient is negative. | ▶ 00:21 |
| Therefore, you would add something to the value of W. | ▶ 00:23 |
| You move to the right, and your loss has decreased. | ▶ 00:26 |
| You do this until you find yourself | ▶ 00:29 |
| with what's called a local minimum, where B resides. | ▶ 00:31 |
| In this instance over here, gradient descent starting at A | ▶ 00:34 |
| would not get you to the global minimum, | ▶ 00:37 |
| which sits over here because there's a bump in between. | ▶ 00:39 |
| Gradient methods are known to be subject to local minimum. | ▶ 00:42 |
| I have another gradient quiz. | ▶ 00:00 |
| Consider the following quadratic arrow function. | ▶ 00:03 |
| We are considering the gradient in 3 different places. | ▶ 00:06 |
| a. b. and c. | ▶ 00:09 |
| And they ask you which gradient is the largest. | ▶ 00:13 |
| a, b, or c or are they all equal? | ▶ 00:17 |
| In which case, you would want to check the last box over here | ▶ 00:23 |
| And the answer is C. | ▶ 00:00 |
| The derivative of a quadratic function is a linear function. | ▶ 00:04 |
| Which would look about like this. | ▶ 00:08 |
| And as we go outside, our gradient becomes larger and larger. | ▶ 00:11 |
| This over here is much steeper than this curve over here. | ▶ 00:15 |
| [Thrun] Here is a final gradient descent quiz. | ▶ 00:00 |
| Suppose we have a loss function like this | ▶ 00:04 |
| and our gradient descent starts over here. | ▶ 00:08 |
| Will it likely reach the global minimum? | ▶ 00:12 |
| Yes or no. | ▶ 00:15 |
| Please check one of those boxes. | ▶ 00:17 |
| [Thrun] And the answer is yes, | ▶ 00:00 |
| although, technically speaking, to reach the absolute global minimum | ▶ 00:02 |
| we need the learning rates to become smaller and smaller over time. | ▶ 00:06 |
| If they stay constant, there is a chance this thing might bounce around | ▶ 00:11 |
| between 2 points in the end and never reach the global minimum. | ▶ 00:15 |
| But assuming that we implement gradient descent correctly, | ▶ 00:18 |
| we will finally reach the global minimum. | ▶ 00:22 |
| That's not the case if you start over here, where we can get stuck over here | ▶ 00:24 |
| and settle for the minimum over here, which is a local minimum | ▶ 00:29 |
| and not the best solution to our optimization problem. | ▶ 00:32 |
| So one of the important points to take away from this is | ▶ 00:35 |
| gradient descent is universally applicable to more complicated problems-- | ▶ 00:38 |
| problems that don't have a plausible solution. | ▶ 00:43 |
| But you have to check whether there is many local minima, | ▶ 00:46 |
| and if so, you have to worry about this. | ▶ 00:49 |
| Any optimization book can tell you tricks how to overcome this. | ▶ 00:51 |
| I won't go into any more depth here in this class. | ▶ 00:55 |
| [Thrun] It's interesting to see how to minimize a loss function using gradient descent. | ▶ 00:00 |
| In our linear case, we have L equals sum over the correct labels | ▶ 00:05 |
| minus our linear function to the square, | ▶ 00:12 |
| which we seek to minimize. | ▶ 00:16 |
| We already know that this has a closed form solution, | ▶ 00:18 |
| but just for the fun of it, let's look at gradient descent. | ▶ 00:21 |
| The gradient of L with respect to W1 is minus 2, sum of all J | ▶ 00:25 |
| of the difference as before but without the square times Xj. | ▶ 00:33 |
| The gradient with respect to W0 is very similar. | ▶ 00:39 |
| So in gradient descent we start with W1 0 and W0 0 | ▶ 00:43 |
| where the upper cap 0 corresponds to the iteration index of gradient descent. | ▶ 00:49 |
| And then we iterate. | ▶ 00:55 |
| In the M iteration we get our new estimate by using the old estimate | ▶ 00:57 |
| minus a learning rate of this gradient over here | ▶ 01:06 |
| taking the position of the old estimate W1, M minus 1. | ▶ 01:10 |
| Similarly, for W0 we get this expression over here. | ▶ 01:15 |
| And these expressions look nasty, | ▶ 01:20 |
| but what it really means is we subtract an expression like this | ▶ 01:24 |
| every time we do gradient descent from W1 | ▶ 01:28 |
| and an expression like this every time we do gradient descent from W0, | ▶ 01:31 |
| which is easy to implement, and that implements gradient descent. | ▶ 01:36 |
| Now, there are many different ways to apply linear functions in machine learning. | ▶ 00:00 |
| We so far have studied linear functions for regression, | ▶ 00:08 |
| but linear functions are also used for classification, | ▶ 00:12 |
| and specifically for an algorithm called the perceptron algorithm. | ▶ 00:16 |
| This algorithm happens to be a very early model of a neuron, | ▶ 00:21 |
| as in the neurons we have in our brains, | ▶ 00:27 |
| and was invented in the 1940s. | ▶ 00:30 |
| Suppose we give a data set of positive samples and negative samples. | ▶ 00:33 |
| A linear separator is a linear equation that separates positive from negative examples. | ▶ 00:41 |
| Obviously, not all sets possess a linear separator, but some do. | ▶ 00:49 |
| For those we can define the algorithm of the perceptron and it actually converges. | ▶ 00:55 |
| To define a linear separator, let's start with our linear equation as before-- | ▶ 01:02 |
| w1x + w0 in cases where x is higher dimensional this might actually be a vector--never mind. | ▶ 01:07 |
| If this is larger or equal to zero, then we call our classification 1. | ▶ 01:18 |
| Otherwise, we call it zero. | ▶ 01:26 |
| Here's our linear separation classification function | ▶ 01:30 |
| where this is our common linear function. | ▶ 01:35 |
| Now, as I said, perceptron only converges if the data is linearly separable, | ▶ 01:39 |
| and then it converges to a linear separation of the data, | ▶ 01:45 |
| which is quite amazing. | ▶ 01:49 |
| Perceptron is an iterative algorithm that is not dissimilar from grade descent. | ▶ 01:52 |
| In fact, the update rule echoes that of grade descent, and here's how it goes. | ▶ 01:56 |
| We start with a random guess for w1 and w0, | ▶ 02:03 |
| which may correspond to a random separation line, | ▶ 02:09 |
| but usually is inaccurate. | ▶ 02:13 |
| Then the mth weight-i is obtained by using the old weight plus some learning rate alpha | ▶ 02:17 |
| times the difference between the desired target label | ▶ 02:29 |
| and the target label produced by our function at the point m-1. | ▶ 02:33 |
| Now, this is an online learning rule, which is we don't process all the data in batch. | ▶ 02:39 |
| We process one data at a time, and we might go through the data many, many times-- | ▶ 02:45 |
| hence the j over here-- | ▶ 02:50 |
| but every time we do this, we apply this rule over here. | ▶ 02:52 |
| What this rule gives us is a method to adapt our weights in proportion to the error. | ▶ 02:55 |
| If the prediction of our function f equals our target label, | ▶ 03:03 |
| and the error is zero, then no update occurs. | ▶ 03:07 |
| If there is a difference, however, we update in a way so as to minimize the error. | ▶ 03:11 |
| Alpha is a small learning weight. | ▶ 03:18 |
| Once again, perceptron converges to a correct linear separator | ▶ 03:22 |
| if such linear separator exists. | ▶ 03:28 |
| Now, the case of linear separation has recently received a lot of attention in machine learning. | ▶ 03:31 |
| If you look at the picture over here, you'll find there are many different linear separators. | ▶ 03:36 |
| There is one over here. There is one over here. There is one over here. | ▶ 03:42 |
| One of the questions that has recently been researched extensively is which one to prefer. | ▶ 03:47 |
| Is it a, b, or c? | ▶ 03:53 |
| Even though you probably have never seen this literature, | ▶ 03:57 |
| I will just ask your intuition in this following quiz. | ▶ 04:01 |
| Which linear separator would you prefer if you look at these three different linear separators-- | ▶ 04:05 |
| a, b, c, or none of them? | ▶ 04:10 |
| [Narrator] And intuitively I would argue it's B, | ▶ 00:00 |
| and the reason why is | ▶ 00:04 |
| C comes really close to examples. | ▶ 00:06 |
| So if these examples are noisy, | ▶ 00:09 |
| it's quite likely that | ▶ 00:12 |
| by being so close to these examples | ▶ 00:14 |
| that future examples cross the line. | ▶ 00:17 |
| Similarly A comes close to examples. | ▶ 00:20 |
| B is the one that stays really far away | ▶ 00:23 |
| from any example. | ▶ 00:26 |
| So there's this entire region over here | ▶ 00:28 |
| where there's no example anywhere near B. | ▶ 00:31 |
| This region is often called the margin. | ▶ 00:34 |
| The margin of the linear separator | ▶ 00:37 |
| is the distance of the separator | ▶ 00:40 |
| to the closest training example. | ▶ 00:43 |
| The margin is a really important concept | ▶ 00:45 |
| in machine learning. | ▶ 00:47 |
| There is an entire class of maximum margin | ▶ 00:49 |
| learning algorithms, | ▶ 00:51 |
| and the 2 most popular are | ▶ 00:53 |
| support vector machines and boosting. | ▶ 00:56 |
| If you are familiar with machine learning, | ▶ 01:00 |
| you've come across these terms. | ▶ 01:02 |
| These are very frequently used these days | ▶ 01:04 |
| in actual discrimination learning tasks. | ▶ 01:07 |
| I will not go into any details because it would go | ▶ 01:10 |
| way beyond the scope of this introduction | ▶ 01:12 |
| to artificial intelligence class, but let's see | ▶ 01:16 |
| a few abstract words specifically about | ▶ 01:18 |
| support vector machines or SVMs. | ▶ 01:21 |
| As I said before a support vector machine | ▶ 01:25 |
| derives a linear separator, and it takes | ▶ 01:30 |
| the one that actually maximizes the margin | ▶ 01:34 |
| as shown over here. | ▶ 01:39 |
| By doing so it attains additional robost-ness | ▶ 01:42 |
| over perceptron which only picks | ▶ 01:44 |
| a linear separator without | ▶ 01:46 |
| consideration of the margin. | ▶ 01:48 |
| Now the problem of finding the | ▶ 01:51 |
| margin maximizing linear separator | ▶ 01:53 |
| can be solved by a quadratic program | ▶ 01:55 |
| which is an integer method for finding the best | ▶ 01:59 |
| linear separator that maximizes the margin. | ▶ 02:03 |
| One of the nice things that support | ▶ 02:06 |
| vector machines do in practice is | ▶ 02:08 |
| they use linear techniques to solve | ▶ 02:12 |