Last update:Friday 16th of December 2011

## (1) Unit 0w

### (01:15) 1 Introduction

Welcome to the online introduction to artificial intelligence. â–¶ 00:00 My name is Sebastian Thrun. >>I'm Peter Norvig. â–¶ 00:04 We are teaching this class at Stanford, â–¶ 00:07 and now we are teaching it online for the entire world. â–¶ 00:09 We are really excited about this. â–¶ 00:11 It's great to have you all here. â–¶ 00:13 It's exciting to have such a record-breaking number of people. â–¶ 00:14 We think we can deliver a good introduction to artificial intelligence. â–¶ 00:18 We hope you'll stick with it. â–¶ 00:22 It's going to be a lot of work, â–¶ 00:24 but we think it's going to be very rewarding. â–¶ 00:25 The way that it is going to be organized is that â–¶ 00:27 every week there is going to be new videos and with these videos, quizes. â–¶ 00:29 With these quizzes, you can test your knowledge about AI. â–¶ 00:32 We also post for the advanced version of this class, homework assignments and exams â–¶ 00:35 on which you'll be quizzed. â–¶ 00:38 We're going to grade those to give you a final score to see â–¶ 00:40 if you can actually master artificial intelligence the same way â–¶ 00:44 any good student at Stanford would do it. â–¶ 00:47 If you do that, then at the end of the class, we'll sign a letter of accomplishment, â–¶ 00:49 and let you know that you've achieved this and what your rank in the class was. â–¶ 00:54 So I hope you have fun. Watch us on videotape. â–¶ 00:58 We will teach you AI. â–¶ 01:02 Participate in the discussion forum. â–¶ 01:04 Ask your questions, and help others answer questions. â–¶ 01:06 I hope we have a fantastic time ahead of us in the next 10 weeks. â–¶ 01:09 Welcome to the class. We'll see you online. â–¶ 01:12

## (15) Unit 1w

### (02:00) 1 Introduction

Welcome to the first unit of Online Introduction to Artificial Intelligence. â–¶ 00:00 I will be teaching you the very, very basics today. â–¶ 00:05 This is Unit 1 of Artificial Intelligence. â–¶ 00:09 Welcome. â–¶ 00:14 The purpose of this class is twofold: â–¶ 00:16 Number 1, to teach you the very basics of artificial intelligence â–¶ 00:20 so you'll be able to talk to people in the field â–¶ 00:25 and understand the basic tools of the trade; â–¶ 00:29 and also, very importantly, to excite you about the field. â–¶ 00:32 I have been in the field of artificial intelligence for about 20 years, â–¶ 00:37 and it's been truly rewarding. â–¶ 00:42 So I want you to participate in the beauty and the excitement of AI â–¶ 00:44 so you can become a professional who gets the same reward â–¶ 00:48 and excitement out of this field as I do. â–¶ 00:52 The basic structure of this class involves videos â–¶ 00:55 in which Peter or I will teach you something new, â–¶ 01:00 then also quizzes, which we will ask you about your ability to answer AI questions, â–¶ 01:03 and finally, answer videos in which we tell you what the right answer would have been â–¶ 01:11 for the quiz that you might have falsely or incorrectly answered before. â–¶ 01:17 This will all be reiterated, and every so often you get a homework assignment, â–¶ 01:22 also in the form of quizzes but without the answers. â–¶ 01:28 And then we also have video exams. â–¶ 01:34 If you check our website, there's requirements â–¶ 01:37 on how you have to do assignments and exams. â–¶ 01:39 Please go to ai-class.org in this class. â–¶ 01:43 An AI program is called wetware, a formula, or an intelligent agent. â–¶ 01:48 Pick the one that fits best. â–¶ 01:58 ### (01:34) 2 Intelligent Agents

[Thrun] The correct answer is intelligent agent. â–¶ 00:00 Let's talk about intelligent agents. â–¶ 00:04 Here is my intelligent agent, â–¶ 00:07 and it gets to interact with an environment. â–¶ 00:11 The agent can perceive the state of the environment â–¶ 00:17 through its sensors, â–¶ 00:22 and it can affect its state through its actuators. â–¶ 00:25 The big question of artificial intelligence is the function that maps sensors to actuators. â–¶ 00:29 That is called the control policy for the agent. â–¶ 00:37 So all of this class will deal with how does an agent make decisions â–¶ 00:41 that it can carry out with its actuators based on past sensor data. â–¶ 00:48 Those decisions take place many, many times, â–¶ 00:54 and the loop of environment feedback to sensors, agent decision, â–¶ 00:58 actuator interaction with the environment and so on is called perception action cycle. â–¶ 01:03 So here is my very first quiz for you. â–¶ 01:12 Artificial intelligence, AI, has successfully been used in finance, â–¶ 01:15 robotics, games, medicine, and the Web. â–¶ 01:21 Check any or all of those that apply. â–¶ 01:26 And if none of them applies, check the box down here that says none of them. â–¶ 01:28 ### (06:28) 3 Applications of AI

So the correct answer is all of those-- â–¶ 00:00 finance, robotics, games, medicine, the Web, and many more applications. â–¶ 00:03 So let me talk about them in some detail. â–¶ 00:08 There is a huge number of applications of artificial intelligence in finance, â–¶ 00:10 very often in the shape of making trading decisions-- â–¶ 00:15 in which case, the agent is called a trading agent. â–¶ 00:18 And the environment might be things like the stock market or the bond market â–¶ 00:21 or the commodities market. â–¶ 00:27 And our trading agent can sense the course of certain things, â–¶ 00:29 like the stock or bonds or commodities. â–¶ 00:33 It can also read the news online and follow certain events. â–¶ 00:35 And its decisions are usually things like buy or sell decisions--trades. â–¶ 00:40 There's a huge history of artificial intelligence finding methods to look at data over time â–¶ 00:48 and make predictions as to how courses develop over time-- â–¶ 00:55 and then put in trades behind those. â–¶ 00:58 And very frequently, people using artificial intelligence trading agents â–¶ 01:01 have made a good amount of money with superior trading decisions. â–¶ 01:06 There's also a long history of AI in Robotics. â–¶ 01:10 Here is my depiction of a robot. â–¶ 01:14 Of course, there are many different types of robots â–¶ 01:17 and they all interact with their environments through their sensors, â–¶ 01:20 which include things like cameras, microphones, tactile sensor or touch. â–¶ 01:24 And the way they impact their environments is to move motors around, â–¶ 01:33 in particular, their wheels, their legs, their arms, their grippers. â–¶ 01:38 They can also say things to people using voice. â–¶ 01:43 Now there's a huge history of using artificial intelligence in robotics. â–¶ 01:46 Pretty much, every robot that does something interesting today uses AI. â–¶ 01:50 In fact, often AI has been studied together with robotics, as one discipline. â–¶ 01:54 But because robots are somewhat special in that they use physical actuators â–¶ 01:58 and deal with physical environments, they are a little bit different from â–¶ 02:03 just artificial intelligence, as a whole. â–¶ 02:06 When the Web came out, the early Web crawlers were called robots â–¶ 02:08 and to block a robot from accessing your website, to the present day, â–¶ 02:15 there's a file called robot.txt, that allows you to deny any Web crawler â–¶ 02:20 to access and retrieve that information from your website. â–¶ 02:24 So historically, robotics played a huge role in artificial intelligence â–¶ 02:28 and a good chunk of this class will be focusing on robotics. â–¶ 02:32 AI has a huge history in games-- â–¶ 02:36 to make games smarter or feel more natural. â–¶ 02:39 There are 2 ways in which AI has been used in games, as a game agent. â–¶ 02:43 One is to play against you, as a human user. â–¶ 02:47 So for example, if you play the game of Chess, â–¶ 02:50 then you are the environment to the game agent. â–¶ 02:54 The game agent gets to observe your moves, and it generates its own moves â–¶ 02:57 with the purpose of defeating you in Chess. â–¶ 03:03 So most adversarial games, where you play against an opponent â–¶ 03:07 and the opponent is a computer program, â–¶ 03:10 the game agent is built to play against you--against your own interests--and make you lose. â–¶ 03:13 And of course, your objective is to win. â–¶ 03:20 That's an AI games-type situation. â–¶ 03:22 The second thing is that games agents in AI â–¶ 03:25 also are used to make games feel more natural. â–¶ 03:29 So very often games have characters inside, and these characters act in some way. â–¶ 03:32 And it's important for you, as the player, to feel that these characters are believable. â–¶ 03:36 There's an entire sub-field of artificial intelligence to use AI â–¶ 03:42 to make characters in a game more believable--look smarter, so to speak-- â–¶ 03:45 so that you, as a player, think you're playing a better game. â–¶ 03:51 Artificial intelligence has a long history in medicine as well. â–¶ 03:55 The classic example is that of a diagnostic agent. â–¶ 04:00 So here you are--and you might be sick, and you go to your doctor. â–¶ 04:04 And your doctor wishes to understand â–¶ 04:09 what the reason for your symptoms and your sickness is. â–¶ 04:11 The diagnostic agent will observe you through various measurements-- â–¶ 04:17 for example, blood pressure and heart signals, and so on-- â–¶ 04:21 and it'll come up with the hypothesis as to what you might be suffering from. â–¶ 04:25 But rather than intervene directly, in most cases the diagnostic of your disease â–¶ 04:29 is communicated to the doctor, who then takes on the intervention. â–¶ 04:34 This is called a diagnostic agent. â–¶ 04:38 There are many other versions of AI in medicine. â–¶ 04:40 AI is used in intensive care to understand whether there are situations â–¶ 04:43 that need immediate attention. â–¶ 04:48 It's been used for life-long medicine to monitor signs over long periods of time. â–¶ 04:50 And as medicine becomes more personal, the role of artificial intelligence â–¶ 04:54 will definitely increase. â–¶ 04:58 We already mentioned AI on the Web. â–¶ 05:01 The most generic version of AI is to crawl the Web and understand the Web, â–¶ 05:05 and assist you in answering questions. â–¶ 05:09 So when you have this search box over here â–¶ 05:12 and it says "Search" on the left, â–¶ 05:15 and "I'm Feeling Lucky" on the right, â–¶ 05:18 and you type in the words, â–¶ 05:20 what AI does for you is it understands what words you typed in â–¶ 05:21 and finds the most relevant pages. â–¶ 05:28 That is really co-artificial intelligence. â–¶ 05:30 It's used by a number of companies, such as Microsoft and Google â–¶ 05:32 and Amazon, Yahoo, and many others. â–¶ 05:36 And the way this works is that there's a crawling agent that can go â–¶ 05:39 to the World Wide Web and retrieve pages, through just a computer program. â–¶ 05:43 It then sorts these pages into a big database inside the crawler â–¶ 05:51 and also analyzes developments of each page to any possible query. â–¶ 05:56 When you then come and issue a query, â–¶ 06:01 the AI system is able to give you a response-- â–¶ 06:04 for example, a collection of 10 best Web links. â–¶ 06:08 In short, every time you try to write a piece of software, â–¶ 06:12 that makes your computer software smart â–¶ 06:15 likely you will need artificial intelligence. â–¶ 06:18 And in this class, Peter and I will teach you â–¶ 06:20 many of the basic tricks of the trade â–¶ 06:23 to make your software really smart. â–¶ 06:25 ### (05:17) 4 Terminology

It will be good to introduce some basic terminology â–¶ 00:00 that is commonly used in artificial intelligence to distinguish different types of problems. â–¶ 00:04 The very first word I will teach you is fully versus partially observable. â–¶ 00:09 An environment is called fully observable if what your agent can sense â–¶ 00:16 at any point in time is completely sufficient to make the optimal decision. â–¶ 00:19 So, for example, in many card games, â–¶ 00:26 when all the cards are on the table, the momentary site of all those cards â–¶ 00:29 is really sufficient to make the optimal choice. â–¶ 00:36 That is in contrast to some other environments where you need memory â–¶ 00:40 on the side of the agent to make the best possible decision. â–¶ 00:46 For example, in the game of poker, the cards aren't openly on the table, â–¶ 00:50 and memorizing past moves will help you make a better decision. â–¶ 00:55 To fully understand the difference, consider the interaction of an agent â–¶ 01:00 with the environment to its sensors and its actuators, â–¶ 01:04 and this interaction takes place over many cycles, â–¶ 01:08 often called the perception-action cycle. â–¶ 01:11 For many environments, it's convenient to assume â–¶ 01:16 that the environment has some sort of internal state. â–¶ 01:19 For example, in a card game where the cards are not openly on the table, â–¶ 01:22 the state might pertain to the cards in your hand. â–¶ 01:28 An environment is fully observable if the sensors can always see â–¶ 01:33 the entire state of the environment. â–¶ 01:37 It's partially observable if the sensors can only see a fraction of the state, â–¶ 01:41 yet memorizing past measurements gives us additional information of the state â–¶ 01:46 that is not readily observable right now. â–¶ 01:52 So any game, for example, where past moves have information about â–¶ 01:55 what might be in a person's hand, those games are partially observable, â–¶ 02:01 and they require different treatment. â–¶ 02:06 Very often agents that deal with partially observable environments â–¶ 02:08 need to acquire internal memory to understand what â–¶ 02:12 the state of the environment is, and we'll talk extensively â–¶ 02:15 when we talk about hidden Markov models about how this structure â–¶ 02:18 has such internal memory. â–¶ 02:21 A second terminology for environments pertains to whether the environment â–¶ 02:23 is deterministic or stochastic. â–¶ 02:26 Deterministic environment is one where your agent's actions â–¶ 02:29 uniquely determine the outcome. â–¶ 02:35 So, for example, in chess, there's really no randomness when you move a piece. â–¶ 02:37 The effect of moving a piece is completely predetermined, â–¶ 02:42 and no matter where I'm going to move the same piece, the outcome is the same. â–¶ 02:46 That we call deterministic. â–¶ 02:50 Games with dice, for example, like backgammon, are stochastic. â–¶ 02:52 While you can still deterministically move your pieces, â–¶ 02:56 the outcome of an action also involves throwing of the dice, â–¶ 03:00 and you can't predict those. â–¶ 03:03 There's a certain amount of randomness involved for the outcome of dice, â–¶ 03:05 and therefore, we call this stochastic. â–¶ 03:08 Let me talk about discrete versus continuous. â–¶ 03:10 A discrete environment is one where you have finitely many action choices, â–¶ 03:14 and finitely many things you can sense. â–¶ 03:18 So, for example, in chess, again, there's finitely many board positions, â–¶ 03:21 and finitely many things you can do. â–¶ 03:25 That is different from a continuous environment â–¶ 03:28 where the space of possible actions or things you could sense may be infinite. â–¶ 03:30 So, for example, if you throw darts, there's infinitely many ways to angle the darts â–¶ 03:35 and to accelerate them. â–¶ 03:41 Finally, we distinguish benign versus adversarial environments. â–¶ 03:43 In benign environments, the environment might be random. â–¶ 03:49 It might be stochastic, but it has no objective on its own â–¶ 03:53 that would contradict the own objective. â–¶ 03:57 So, for example, weather is benign. â–¶ 03:59 It might be random. It might affect the outcome of your actions. â–¶ 04:02 But it isn't really out there to get you. â–¶ 04:06 Contrast this with adversarial environments, such as many games, like chess, â–¶ 04:08 where your opponent is really out there to get you. â–¶ 04:14 It turns out it's much harder to find good actions in adversarial environments â–¶ 04:16 where the opponent actively observes you and counteracts what you're trying to achieve â–¶ 04:21 relative to benign environment, where the environment might merely be stochastic â–¶ 04:26 but isn't really interested in making your life worse. â–¶ 04:30 So, let's see to what extent these expressions make sense to you â–¶ 04:35 by going to our next quiz. â–¶ 04:38 So here are the 4 concepts again: partially observable versus fully, â–¶ 04:40 stochastic versus deterministic, continuous versus discrete, â–¶ 04:45 adversarial versus benign. â–¶ 04:50 And let me ask you about the game of checkers. â–¶ 04:52 Check one or all of those attributes that apply. â–¶ 04:56 So, if you think checkers is partially observable, check this one. â–¶ 05:00 Otherwise, just don't check it. â–¶ 05:03 If you think it's stochastic, check this one, â–¶ 05:05 continuous, check this one, adversarial, check this one. â–¶ 05:07 If you don't know about checkers, you can check the Web and Google it â–¶ 05:11 to find a little more information about checkers. â–¶ 05:15 ### (00:52) 5 Checkers Answer

So, checkers is an interesting game. â–¶ 00:00 Here's the typical board of the game of checkers. â–¶ 00:04 Your pieces might look like this, â–¶ 00:08 and your opponent's pieces might look like this. â–¶ 00:11 And apart from some very cryptic rules in checkers, â–¶ 00:16 which I won't really discuss here, the board basically tells you â–¶ 00:19 everything there is to know about checkers, so it's clearly fully observable. â–¶ 00:23 It is deterministic because your move and your opponent's move â–¶ 00:28 very clearly affect the state of the board in ways that have â–¶ 00:33 absolutely no stochasticity. â–¶ 00:36 It is also discrete because there's finitely many action choices â–¶ 00:39 and finitely many board positions, â–¶ 00:45 and obviously, it is adversarial, since your opponent is out to get you. â–¶ 00:47 ### (00:12) 6 Poker

[Male narrator] The game of poker--is this partially observable, stochastic, â–¶ 00:00 continuous, or adversarial? â–¶ 00:06 Please check any or all of those that apply. â–¶ 00:09 ### (00:30) 7 Poker Answer

[Male narrator] I would argue poker is partially observable â–¶ 00:00 because it can't be seen what is in your opponent's hands. â–¶ 00:03 It is stochastic because you're being dealt cards that are kind of coming at random. â–¶ 00:08 It is not continuous; it's just finding many cards â–¶ 00:13 and finding many actions you can do, even though you might argue â–¶ 00:16 that there's a huge number of different monies you can bet. â–¶ 00:20 It's still finite, and it is clearly adversarial. â–¶ 00:24 If you've ever played poker before, you know how brutal it can be. â–¶ 00:27 ### (00:22) 8 Robotic Car

[Male narrator] --a favorite, a robotic car. â–¶ 00:00 I wish to know whether it is partially observable, â–¶ 00:04 stochastic, continuous, or adversarial. â–¶ 00:06 That is, is the problem of driving robotically-- â–¶ 00:11 say, in a city--subject to any of those 4 categories? â–¶ 00:16 Please check any or all that might apply. â–¶ 00:20 ### (00:33) 9 Robotic Car Answer

Well, the robotic car clearly deals with a partially observable environment â–¶ 00:00 if you just look at momentary sensing input, you can't even tell how fast other cars are going. â–¶ 00:04 So, you need to memorize something. â–¶ 00:10 It is stochastic because it's inherently unpredictable â–¶ 00:12 what's going to happen next with other cars. â–¶ 00:15 It is continuous. â–¶ 00:17 There's the infinitely many ways to set your steering â–¶ 00:20 or push your gas pedal or your brake, â–¶ 00:23 and, well, you can argue with adversarial or not. â–¶ 00:26 Depending on where you live, it might be highly adversarial. â–¶ 00:29 Where I live, it isn't. â–¶ 00:31 ### (01:28) 10 AI and Uncertainty

I'm going to briefly talk of AI as something else, â–¶ 00:00 which is AI is the technique of uncertainty management in computer software. â–¶ 00:03 Put differently, AI is the discipline that you apply when you want to know what to do â–¶ 00:10 when you don't know what to do. â–¶ 00:17 Now, there's many reasons why there might be uncertainty in a computer program. â–¶ 00:22 There could be a sensor limit. â–¶ 00:27 That is, your sensors are unable to tell me â–¶ 00:29 what exactly is the case outside the AI system. â–¶ 00:33 There could be adversaries who act in a way that makes it hard for you â–¶ 00:37 to understand what is the case. â–¶ 00:41 There could be stochastic environments. â–¶ 00:44 Every time you roll the dice in a dice game, â–¶ 00:48 the stochasticity of the dice will make it impossible for you â–¶ 00:51 to be absolutely certain of what's the situation. â–¶ 00:55 There could be laziness. â–¶ 00:57 So perhaps you can actually compute what the situation is, â–¶ 01:00 but your computer program is just too lazy to do it. â–¶ 01:04 And here's my favorite: ignorance, plain ignorance. â–¶ 01:07 Many people are just ignorant of what's going on. â–¶ 01:11 They could know it, but they just don't care. â–¶ 01:14 All of these things are cause for uncertainty. â–¶ 01:17 AI is the discipline that deals with uncertainty and manages it in decision making. â–¶ 01:21 ### (04:00) 11 Examples of AI in Practice

Now we've had an introduction to AI. â–¶ 00:00 We've heard about some of the properties of environments, â–¶ 00:03 and we've seen some possible architecture for agents. â–¶ 00:06 I'd like next to show you some examples of AI in practice. â–¶ 00:10 And Sebastian and I have some experience personally in things we have done â–¶ 00:13 at Google, at NASA, and at Stanford. â–¶ 00:18 And I want to tell you a little bit about some of those. â–¶ 00:21 One of the best successes of AI technology at Google â–¶ 00:25 has been the machine translation system. â–¶ 00:28 Here we see an example of an article in Italian automatically translated into English. â–¶ 00:31 Now, these systems are built for 50 different languages, â–¶ 00:37 and we can translate from any of the languages into any of the other languages. â–¶ 00:41 So, that's over 2,500 different systems, and we've done this all â–¶ 00:46 using machine learning techniques, using AI techniques, â–¶ 00:51 rather than trying to build them by hand. â–¶ 00:55 And the way it works is that we go out and collect examples of text â–¶ 00:58 that's a line between the 2 languages. â–¶ 01:03 So we find, say, a newspaper that publishes 2 editions, â–¶ 01:06 an Italian edition and an English edition, and now we have examples of translations. â–¶ 01:11 And if anybody ever asked us for exactly the translation of this one particular article, â–¶ 01:16 then we could just look it up and say "We already know that." â–¶ 01:22 But of course, we aren't often going to be asked that. â–¶ 01:25 Rather, we're going to be asked parts of this. â–¶ 01:27 Here are some words that we've seen before, and we have to figure out â–¶ 01:30 which words in this article correspond to which words in the translation article. â–¶ 01:34 And when we do that by examining many, many millions of words of text â–¶ 01:40 in the 2 languages and making the correspondence, â–¶ 01:45 and then we can put that all together. â–¶ 01:49 And then when we see a new example of text that we haven't seen before, â–¶ 01:51 we can just look up what we've seen in the past for that correspondence. â–¶ 01:54 So, the task is really two parts. â–¶ 01:58 Off-line, before we see an example of text we want to translate, â–¶ 02:01 we first build our translation model. â–¶ 02:05 We do that by examining all of the different examples â–¶ 02:07 and figuring out which part aligns to which. â–¶ 02:10 Now, when we're given a text to translate, we use that model, â–¶ 02:14 and we go through and find the most probable translation. â–¶ 02:18 So, what does it look like? â–¶ 02:22 Well, let's look at it in some example text. â–¶ 02:24 And rather than look at news articles, I'm going to look at something simpler. â–¶ 02:26 I'm going to switch from Italian to Chinese. â–¶ 02:29 Here's a bilingual text. â–¶ 02:35 Now, for a large-scale machine translation, examples are found on the Web. â–¶ 02:37 This example was found in a Chinese restaurant by Adam Lopez. â–¶ 02:41 Now, it's given, for a text of this form, â–¶ 02:46 that a line in Chinese corresponds to a line in English, â–¶ 02:49 and that's true for each of the individual lines. â–¶ 02:55 But to learn from this text, what we really want to discover â–¶ 02:59 is what individual words in Chinese correspond to individual words â–¶ 03:02 or small phrases in English. â–¶ 03:07 I've started that process by highlighting the word "wonton" in English. â–¶ 03:09 It appears 3 times throughout the text. â–¶ 03:16 Now, in each of those lines, there's a character that appears, â–¶ 03:18 and that's the only place in the Chinese text where that character appears. â–¶ 03:23 So, that seems like it's a high probability that this character in Chinese â–¶ 03:27 corresponds to the word "wonton" in English. â–¶ 03:33 Let's see if we can go farther. â–¶ 03:36 My question for you is what word or what character or characters in Chinese â–¶ 03:38 correspond to the word "chicken" in English? â–¶ 03:44 And here we see "chicken" appears in these locations. â–¶ 03:47 Click on the character or characters in Chinese that corresponds to "chicken." â–¶ 03:54 ### (00:44) 12 Chinese Translation Answer

The answer is that chicken appears here, â–¶ 00:01 here, here, and here. â–¶ 00:04 Now, I don't know for sure, 100%, that that is the character for chicken in Chinese, â–¶ 00:10 but I do know that there is a good correspondence. â–¶ 00:14 Every place the word chicken appears in English, â–¶ 00:17 this character appears in Chinese and no other place. â–¶ 00:20 Let's go 1 step farther. â–¶ 00:24 Let's see if we can work out a phrase in Chinese â–¶ 00:27 and see if it corresponds to a phrase in English. â–¶ 00:30 Here's the phrase corn cream. â–¶ 00:33 Click on the characters in Chinese that correspond to corn cream. â–¶ 00:38 ### (00:29) 13 Chinese Translation Answer 2

The answer is: these 2 characters here â–¶ 00:00 appear only in these 2 locations â–¶ 00:04 corresponding to the words corn cream â–¶ 00:07 which appear only in these locations in the English text. â–¶ 00:10 Again, we're not 100% sure that's the right answer, â–¶ 00:13 but it looks like a strong correlation. â–¶ 00:17 Now, 1 more question. â–¶ 00:20 Tell me what character or characters in Chinese â–¶ 00:22 correspond to the English word soup. â–¶ 00:26 ### (00:48) 14 Chinese Translation Answer 3

The answer is that soup occurs in most of these phrases â–¶ 00:00 but not 100% of them. â–¶ 00:09 It's missing in this phrase. â–¶ 00:11 Equivalently, on the Chinese side â–¶ 00:14 we see this character occurs â–¶ 00:17 in most of the phrases, â–¶ 00:20 but it's missing here. â–¶ 00:23 So we see that the correspondence doesn't have to be 100% â–¶ 00:27 to tell us that there is still a good chance of a correlation. â–¶ 00:31 When we're learning to do machine translation â–¶ 00:34 we use these kinds of alignments to learn probability tables â–¶ 00:37 of what is the probability of one phrase in one language â–¶ 00:41 corresponding to the phrase in another language. â–¶ 00:45 ### (00:56) 15 Congratulations

So congratulations, you just finished unit 1. â–¶ 00:00 You just finished unit 1 of this class, â–¶ 00:03 where I told you about key applications â–¶ 00:07 of artificial intelligence, â–¶ 00:10 I told you about the definition of an intelligent agent, â–¶ 00:13 I gave you 4 key attributes of intelligent agents â–¶ 00:18 (partial observability, stochasticity, continuous spaces, and adversarial natures), â–¶ 00:24 I discussed sources and management of uncertainty, â–¶ 00:31 and I briefly mentioned the mathematical concept of rationality. â–¶ 00:34 Obviously, I only touched any of these issues superficially, â–¶ 00:40 but as this class goes on you're going to dive into any of those â–¶ 00:45 and learn much more about â–¶ 00:49 what it takes to make a truly intelligent AI system. â–¶ 00:51 Thank you. â–¶ 00:55

## (42) Unit 2

### (01:34) Topic 1, Introduction

[PROBLEM SOLVING] â–¶ 00:00 In this unit we're going to talk about problem solving. â–¶ 00:01 The theory and technology of building agents â–¶ 00:04 that can plan ahead to solve problems. â–¶ 00:06 In particular, we're talking about problem solving â–¶ 00:10 where the complexity of the problem comes from the idea that there are many states. â–¶ 00:13 As in this problem here. â–¶ 00:17 A navigation problem where there are many choices to start with. â–¶ 00:19 And the complexity comes from picking the right choice now and picking the right choice at the â–¶ 00:24 next intersection and the intersection after that. â–¶ 00:29 Streaming together a sequence of actions. â–¶ 00:32 This is in contrast to the type of complexity shown in this picture, â–¶ 00:35 where the complexity comes from the partial observability â–¶ 00:39 that we can't see through the fog where the possible paths are. â–¶ 00:43 We can't see the results of our actions â–¶ 00:46 and even the actions themselves are not known. â–¶ 00:48 This type of complexity will be covered in a later unit. â–¶ 00:51 Here's an example of a problem. â–¶ 00:56 This is a route-finding problem where we're given a start city, â–¶ 00:58 in this case, Arad, and a destination, Bucharest, the capital of Romania, â–¶ 01:03 from which this is a corner of the map. â–¶ 01:09 And the problem then is to find a route from Arad to Bucharest. â–¶ 01:11 The actions that the agent can execute when driving â–¶ 01:16 from one city to the next along one of the roads shown on the map. â–¶ 01:20 The question is, is there a solution that the agent can come up with â–¶ 01:23 given the knowledge shown here to the problem of driving from Arad to Bucharest? â–¶ 01:28 ### (04:29) Topic 2, Route Finding Question

And the answer is no. â–¶ 00:00 There is no solution that the agent can come up with â–¶ 00:03 because Bucharest doesn't appear on the map, â–¶ 00:06 and so the agent doesn't know any actions that can arrive there. â–¶ 00:08 So let's give the agent a better chance. â–¶ 00:12 Now we've given the agent the full map of Romania. â–¶ 00:19 To start, he's in Arad, and the destination--or goal--is in Bucharest. â–¶ 00:23 And the agent is given the problem of coming up with a sequence of actions â–¶ 00:30 that will arrive at the destination. â–¶ 00:35 Now, is it possible for the agent to solve this problem? â–¶ 00:37 And the answer is yes. â–¶ 00:43 There are many routes or steps or sequences of actions that will arrive at the destination. â–¶ 00:45 Here is one of them: â–¶ 00:50 Starting out in Arad, taking this step first, then this one, then this one, â–¶ 00:53 then this one, and then this one to arrive at the destination. â–¶ 01:00 So that would count as a solution to the problem. â–¶ 01:05 So sequence of actions, chained together, that are guaranteed to get us to the goal. â–¶ 01:08 [DEFINITION OF A PROBLEM] â–¶ 01:12 Now let's formally define what a problem looks like. â–¶ 01:14 A problem can be broken down into a number of components. â–¶ 01:17 First, the initial state that the agent starts out with. â–¶ 01:21 In our route finding problem, the initial state was the agent being in the city of Arad. â–¶ 01:25 Next, a function--Actions--that takes a state as input and returns â–¶ 01:32 a set of possible actions that the agent can execute when the agent is in this state. â–¶ 01:41 [ACTIONS (s) {a,a2,a3...}] â–¶ 01:47 In some problems, the agent will have the same actions available in all states â–¶ 01:50 and in other problems, he'll have different actions dependent on the state. â–¶ 01:54 In the route finding problem, the actions are dependent on the state. â–¶ 01:58 When we're in one city, we can take the routes to the neighboring cities-- â–¶ 02:02 but we can't go to any other cities. â–¶ 02:06 Next we have a function called Result, which takes, as input, a state and an action â–¶ 02:09 and delivers, as its output, a new state. â–¶ 02:20 So, for example, if the agent is in the city of Arad, and takes--that would be the state-- â–¶ 02:24 and takes the action of driving along Route E-671 towards Timisoara, â–¶ 02:33 then the result of applying that action in that state would be the new state-- â–¶ 02:40 where the agent is in the city of Timisoara. â–¶ 02:45 Next, we need a function called Goal Test, â–¶ 02:51 which takes a state and returns a Boolean value-- â–¶ 02:58 true or false--telling us if this state is a goal or not. â–¶ 03:04 In a route-finding problem, the only goal would be being in the destination city-- â–¶ 03:09 the city of Bucharest--and all the other states would return false for the Goal Test. â–¶ 03:14 And finally, we need one more thing which is a Path Cost function-- â–¶ 03:19 which takes a path, a sequence of state/action transitions, â–¶ 03:28 and returns a number, which is the cost of that path. â–¶ 03:40 Now, for most of the problems we'll deal with, we'll make the Path Cost function be additive â–¶ 03:44 so that the cost of the path is just the sum of the costs of the individual steps. â–¶ 03:50 And so we'll implement this Path Cost function, in terms of a Step Cost function. â–¶ 03:56 The Step Cost function takes a state, an action, and the resulting state from that action â–¶ 04:04 and returns a number--n--which is the cost of that action. â–¶ 04:14 In the route finding example, the cost might be the number of miles traveled â–¶ 04:18 or maybe the number of minutes it takes to get to that destination. â–¶ 04:24 ### (02:55) Topic 3, Route Finding

Now letâ€™s see how the definition of a problem â–¶ 00:00 maps onto the route finding, the domain. â–¶ 00:06 First, the initial state was given. â–¶ 00:10 Letâ€™s say we start off in Arad, â–¶ 00:12 and the goal test, â–¶ 00:15 letâ€™s say that the state of being in Bucharest â–¶ 00:17 is the only state that counts as a goal, â–¶ 00:22 and all the other states are not goals. â–¶ 00:24 Now the set of all of the states here â–¶ 00:26 is known as the state space, â–¶ 00:29 and we navigate the state space by applying actions. â–¶ 00:31 The actions are specific to each city, â–¶ 00:35 so when we are in Arad, there are three possible actions, â–¶ 00:39 to follow this road, this one, or this one. â–¶ 00:42 And as we follow them, we build paths â–¶ 00:46 or sequences of actions. â–¶ 00:49 So just being in Arad is the path of length zero, â–¶ 00:51 and now we could start exploring the space â–¶ 00:55 and add in this path of length one, â–¶ 00:58 this path of length one, â–¶ 01:01 and this path of length one. â–¶ 01:03 We could add in another path here of length two â–¶ 01:06 and another path here of length two. â–¶ 01:11 Here is another path of length two. â–¶ 01:14 Here is a path of length three. â–¶ 01:17 Another path of length two, and so on. â–¶ 01:21 Now at ever point, â–¶ 01:26 we want to separate the state out into three parts. â–¶ 01:28 First, the ends of the pathsâ€” â–¶ 01:34 The farthest paths that have been explored, â–¶ 01:37 we call the frontier. â–¶ 01:40 And so the frontier in this case â–¶ 01:42 consists of these states â–¶ 01:46 that are the farthest out we have explored. â–¶ 01:51 And then to the left of that in this diagram, â–¶ 01:55 we have the explored part of the state. â–¶ 01:59 And then off to the rigtht, â–¶ 02:02 we have the unexplored. â–¶ 02:04 So letâ€™s write down those three components. â–¶ 02:06 We have the frontier. â–¶ 02:09 We have the unexplored region, â–¶ 02:15 and we have the explored region. â–¶ 02:20 One more thing, â–¶ 02:25 in this diagram we have labeled the step cost â–¶ 02:27 of each action along the route. â–¶ 02:30 So the step cost of going between Neamt to Iasi â–¶ 02:33 would be 87 corresponding to a distance of 87 kilometers, â–¶ 02:37 and the path cost is just the sum of the step costs. â–¶ 02:42 So the cost of the path â–¶ 02:46 of going from Arad to Oradea â–¶ 02:48 would be 71 plus 75. â–¶ 02:50 ### (03:19) Topic 4, Tree Search

[Narrator] Now let's define a function for solving problems. â–¶ 00:00 It's called Tree Search because it superimposes â–¶ 00:04 a search tree over the state space. â–¶ 00:07 Here's how it works: It starts off by â–¶ 00:10 initializing the frontier to be the path â–¶ 00:12 consisting of only the initial states, â–¶ 00:14 and then it goes into a loop â–¶ 00:16 in which it first checks to see â–¶ 00:18 do we still have anything left in the frontier? â–¶ 00:21 If not we fail, there can be no solution. â–¶ 00:23 If we do have something, then we make a choice. â–¶ 00:25 Tree Search is really a family of functions â–¶ 00:28 not a single algorithm which â–¶ 00:31 depends on how we make that choice, â–¶ 00:33 and we'll see some of the options later. â–¶ 00:35 If we go ahead and make a choice of one of â–¶ 00:38 the paths on the frontier and remove that â–¶ 00:41 path from the frontier, we find the state â–¶ 00:43 which is at the end of the path, and if that â–¶ 00:45 state's a go then we're done. â–¶ 00:47 We found a path to the goal; otherwise, â–¶ 00:49 we do what's called expanding that path. â–¶ 00:51 We look at all the actions from that state, â–¶ 00:54 and we add to the path the actions â–¶ 00:57 and the result of that state; so we get â–¶ 01:00 a new path that has the old path, the action â–¶ 01:03 and the result of that action, and we â–¶ 01:06 stick all of those paths back onto the frontier. â–¶ 01:09 Now Tree Search represents a whole family â–¶ 01:17 of algorithms, and where you get the family â–¶ 01:19 resemblance is that they're all looking â–¶ 01:22 at the frontier, copying items off and â–¶ 01:24 and looking to see if their goal tests, â–¶ 01:26 but where you get the difference is right here, â–¶ 01:29 in the choice of how you're going to expand â–¶ 01:31 the next item on the frontier, which â–¶ 01:34 path do we look at first, and we'll go through â–¶ 01:36 different sets of algorithms that make â–¶ 01:39 different choices for which path to look at first. â–¶ 01:42 The first algorithm I want to consider â–¶ 01:47 is called Breadth-First Search. â–¶ 01:49 Now it could be called shortest-first search â–¶ 01:51 because what it does is always choose â–¶ 01:54 of the frontier one of the paths that hadn't been â–¶ 01:56 considered yet that's the shortest possible. â–¶ 01:59 So how does it work? â–¶ 02:02 Well we start off with the path of â–¶ 02:04 length 0, starting in the start state, and â–¶ 02:06 that's the only path in the frontier so â–¶ 02:10 it's the shortest one so we pick it, â–¶ 02:13 and then we expand it, and we add in â–¶ 02:15 all the paths that result from â–¶ 02:17 applying all the possible actions. â–¶ 02:20 So now we've removed â–¶ 02:22 this path from the frontier, â–¶ 02:25 but we've added in 3 new paths. â–¶ 02:28 This one, â–¶ 02:31 this one, and this one. â–¶ 02:33 Now we're in a position where â–¶ 02:37 we have 3 paths on the frontier, and â–¶ 02:39 we have to pick the shortest one. â–¶ 02:42 Now in this case all 3 paths â–¶ 02:45 have the same length, length 1, so we â–¶ 02:47 break the tie at random or using some â–¶ 02:50 other technique, and let's suppose that â–¶ 02:52 in this case we choose this path â–¶ 02:56 from Arad to Sibiu. â–¶ 02:58 Now the question I want you to answer â–¶ 03:00 is once we remove that from the frontier, â–¶ 03:03 what paths are we going to add next? â–¶ 03:09 So show me by checking off the cities â–¶ 03:11 that ends the paths, which paths â–¶ 03:14 are going to be added to the frontier? â–¶ 03:16 ### (02:54) Topic 5, Tree Search Answer

[Male narrator] The answer is that in Sibiu, the action function gives us 4 actions â–¶ 00:00 corresponding to traveling along these 4 roads, â–¶ 00:06 so we have to add in paths for each of those actions. â–¶ 00:09 One of those paths goes here, â–¶ 00:15 the other path continues from Arad and goes out here. â–¶ 00:17 The third path continues out here â–¶ 00:21 and then the fourth path goes from here--from Arad to Sibiu â–¶ 00:25 and then backtracks back to Arad. â–¶ 00:31 Now, it may seem silly and redundant to have a path that starts in Arad, â–¶ 00:36 goes to Sibiu and returns to Arad. â–¶ 00:41 How can that help us get to our destination in Bucharest? â–¶ 00:44 But we can see if we're dealing with a tree search, â–¶ 00:49 why it's natural to have this type of formulation â–¶ 00:52 and why the tree search doesn't even notice that it's backtracked. â–¶ 00:56 What the tree search does is superimpose on top of the state space â–¶ 01:00 a tree of searches, and the tree looks like this. â–¶ 01:05 We start off in state A, and in state A, there were 3 actions, â–¶ 01:09 so we gave those paths going to Z, S, and T. â–¶ 01:15 And from S, there were 4 actions, so that gave us paths going from O, F, R, and A, â–¶ 01:21 and then the tree would continue on from here. â–¶ 01:34 We'd take one of the next items â–¶ 01:37 and we'd move it and continue on, but notice that we returned to the A state â–¶ 01:40 in the state space, but in the tree, â–¶ 01:48 it's just another item in the tree. â–¶ 01:51 Now, here's another representation of the search space â–¶ 01:55 and what's happening is as we start to explore the state, â–¶ 01:57 we keep track of the frontier, which is the set of states that are at the end of the paths â–¶ 02:01 that we haven't explored yet, and behind that frontier â–¶ 02:09 is the set of explored states, and ahead of the frontier is the unexplored states. â–¶ 02:13 Now the reason we keep track of the explored states â–¶ 02:19 is that when we want to expand and we find a duplicate-- â–¶ 02:22 so say when we expand from here, if we pointed back to state T, â–¶ 02:27 if we hadn't kept track of that, we would have to add in a new state for T down here. â–¶ 02:33 But because we've already seen it and we know that this is actually a regressive step â–¶ 02:42 into the already explored state, now, because we kept track of that, â–¶ 02:47 we don't need it anymore. â–¶ 02:51 ### (01:05) Topic 6, Graph Search

Now we see how to modify the Tree Search Function â–¶ 00:00 to make it be a Graph Search Function â–¶ 00:04 to avoid those repeated paths. â–¶ 00:06 What we do, is we start off and initialize a set â–¶ 00:09 called the explored set of states that we have already explored. â–¶ 00:13 Then, when we consider a new path, â–¶ 00:16 we add the new state to the set of already explored states, â–¶ 00:19 and then when we are expanding the path â–¶ 00:23 and adding in new states to the end of it, â–¶ 00:26 we donâ€™t add that in if we have already seen that new state â–¶ 00:29 in either the frontier or the explored. â–¶ 00:33 Now back to Breadth First Search. â–¶ 00:37 Letâ€™s assume we are using the Graph Search â–¶ 00:39 so that we have eliminated the duplicate paths. â–¶ 00:41 Arad is crossed off the list. â–¶ 00:44 The path that goes from Arad to Sibiu â–¶ 00:47 and back to Arad is removed, â–¶ 00:49 and we are left with these one, two, three, â–¶ 00:51 four, five possible paths. â–¶ 00:53 Given these 5 paths, â–¶ 00:57 show me which ones are candidates to be expanded next â–¶ 00:59 by the Breadth First Search Algorithm. â–¶ 01:02 ### (00:42) Topic 7, Graph Search Answer

[Male narrator] And the answer is that Breadth - First Search always considers â–¶ 00:00 the shortest paths first, and in this case, there's 2 paths of length 1, â–¶ 00:03 and 1, the paths from Arad to Zerind and Arad to Timisoara, â–¶ 00:08 so those would be the 2 paths that would be considered. â–¶ 00:12 Now, let's suppose that the tie is broken in some way â–¶ 00:15 and we chose this path from Arad to Zerind. â–¶ 00:18 Now, we want to expand that node. â–¶ 00:22 We remove it from the frontier and put it in the explored list â–¶ 00:25 and now we say, "What paths are we going to add?" â–¶ 00:31 So check off the ends of the paths the cities that we're going to add. â–¶ 00:35 ### (00:13) Topic 8, Graph Search Answer

[Male narrator] In this case, there's nothing to add â–¶ 00:00 because of the 2 neighbors, 1 is in the explored list and 1 is in the frontier, â–¶ 00:03 and if we're using graph search, then we won't add either of those. â–¶ 00:09 ### (00:38) Topic 9, More Graph Search

[Male narrator] So we move on, we look for another shortest path. â–¶ 00:00 There's one path left of length 1, so we look at that path, we expand it, â–¶ 00:04 add in this path, put that one on the explored list, â–¶ 00:11 and now we've got 3 paths of length 2. â–¶ 00:16 We choose 1 of them, and let's say we choose this one. â–¶ 00:20 Now, my question is show me which states we add to the path â–¶ 00:23 and tell me whether we're going to terminate the algorithm at this point â–¶ 00:30 because we've reached the goal or whether we're going to continue. â–¶ 00:35 ### (00:29) Topic 10, Graph Search Answer

[Male narrator] The answer is that we add 1 more path, the path to Bucharest. â–¶ 00:00 We don't add the path going back because it's in the explored list, â–¶ 00:08 but we don't terminate it yet. â–¶ 00:11 True, we have added a path that ends in Bucharest, â–¶ 00:13 but the goal test isn't applied when we add a path to the frontier. â–¶ 00:16 Rather, it's applied when we remove that path from the frontier, â–¶ 00:22 and we haven't done that yet. â–¶ 00:26 ### (01:30) Topic 11, Graph Search Termination

[Male narrator] Now, why doesn't the general tree search or graph search algorithm stop â–¶ 00:00 when it adds a goal node to the frontier? â–¶ 00:06 The reason is because it might not be the best path to the goal. â–¶ 00:09 Now, here we found a path of length 2 â–¶ 00:13 and we added a path of length 3 that reached the goal. â–¶ 00:16 The general graph search or tree search doesn't know â–¶ 00:21 that there might be some other path that we could expand â–¶ 00:24 that would have a distance of say, 2-1/2, â–¶ 00:27 but there's an optimization that could be made. â–¶ 00:30 If we know we're doing Breadth - First Search â–¶ 00:33 and we know there's no possibility of a path of length 2-1/2. â–¶ 00:35 Then we can change algorithm so that it checks states â–¶ 00:40 as soon as they're added to the frontier â–¶ 00:44 rather than waiting until they're expanded â–¶ 00:46 and in that case, we can write a specific Breadth - First Search routine â–¶ 00:49 that terminates early and gives us a result as soon as we add a goal state to the frontier. â–¶ 00:53 Breadth - First Search will find this path â–¶ 01:01 that ends up in Bucharest, and if we're looking for the shortest path â–¶ 01:04 in terms of number of steps, â–¶ 01:08 Breadth - First Search is guaranteed to find it, â–¶ 01:10 But if we're looking for the shortest path in terms of total cost â–¶ 01:12 by adding up the step costs, then it turns out â–¶ 01:17 that this path is shorter than the path found by Breadth - First Search. â–¶ 01:21 So let's look at how we could find that path. â–¶ 01:26 ### (00:51) Topic 12, Uniform Cost Search

An algorithm that has traditionally been called uniform-cost search â–¶ 00:00 but could be called cheapest-first search, â–¶ 00:05 is guaranteed to find the path with the cheapest total cost. â–¶ 00:08 Let's see how it works. â–¶ 00:11 We start out as before in the start state. â–¶ 00:14 And we pop that empty path off. â–¶ 00:19 Move it from the frontier to explored, â–¶ 00:24 and then add in the paths out of that state. â–¶ 00:28 As before, there will be 3 of those paths. â–¶ 00:33 And now, which path are we going to pick next â–¶ 00:39 in order to expand according to the rules of cheapest first? â–¶ 00:43 ### (00:44) Topic 13, Uniform Cost Search

Cheapest first says that we pick the path with â–¶ 00:00 the lowest total cost. â–¶ 00:04 And that would be this path. â–¶ 00:06 It has a cost of 75 compared to the cost of 118 and 140 â–¶ 00:07 for the other paths. â–¶ 00:13 So we get here. We take that path off the frontier, â–¶ 00:14 put it on the explored list, add in its neighbors. â–¶ 00:19 Not going back to Arad, â–¶ 00:23 but adding in this new path. â–¶ 00:26 Summing up the total cost of that path, â–¶ 00:30 71 + 75 is 146 for this path. â–¶ 00:33 And now the question is, â–¶ 00:40 which path gets expanded next? â–¶ 00:41 ### (00:56) Topic 14, Uniform Cost Search

Of the 3 paths on the frontier, we have ones â–¶ 00:00 with a cost of 146, 140, and 118. â–¶ 00:05 And that's the cheapest, so this one gets expanded. â–¶ 00:10 We take it off the frontier, move it to explored, â–¶ 00:13 add in its successors. In this case it's only 1. â–¶ 00:16 And that has a path total of 229. â–¶ 00:21 Which path do we expand next? â–¶ 00:29 Well, we've got 146, 140, and 229 â–¶ 00:30 So 140 is the lowest. â–¶ 00:33 Take it off the frontier. Put it on explored. â–¶ 00:38 Add in this path â–¶ 00:41 for a total cost of 220. â–¶ 00:44 And this path for a total cost of 239. â–¶ 00:48 And now the question is, which path do we expand next? â–¶ 00:53 ### (00:15) Topic 15, Uniform Cost Search

The answer is this one, 146. â–¶ 00:00 Put it on explored. â–¶ 00:04 But there's nothing to add because â–¶ 00:07 both of its neighbors have already been explored. â–¶ 00:12 Which path do we look at next? â–¶ 00:13 ### (01:13) Topic 16, Uniform Cost Termination

The answer is this one. Two-twenty is less than 229 or 239. â–¶ 00:00 Take it off the frontier. Put it on explored. â–¶ 00:05 Add in 2 more paths and sum them up. â–¶ 00:09 So, 220 plus 146 is 366. â–¶ 00:15 And 220 plus 97 is 317. â–¶ 00:21 Okay, and now, notice that we're closing in on Bucharest. â–¶ 00:29 We've got 2 neighbors almost there, but neither of them is their turn yet. â–¶ 00:32 Instead, the cheapest path is this one over here, â–¶ 00:38 so move it to the explored list. â–¶ 00:43 Add 70 to the path cost so far, â–¶ 00:45 and we get 299. â–¶ 00:50 Now the cheapest node is 239 here, â–¶ 00:57 so we expand, finally, into Bucharest at a cost of 460. â–¶ 01:01 And now the question is are we done? Can we terminate the algorithm? â–¶ 01:09 ### (01:46) Topic 17, Uniform Cost Termination Answer

[Male] And the answer is no, we're not done yet. â–¶ 00:00 We've put Bucharest, the gold state, onto the frontier, â–¶ 00:03 but we haven't popped it off the frontier yet. â–¶ 00:07 And the reason is because we've got to look around and see if there's a better path â–¶ 00:09 that can reach it, Bucharest. â–¶ 00:13 And so, let's continue. â–¶ 00:15 Look at everything on the frontier. â–¶ 00:18 Here's the cheapest one over here. â–¶ 00:20 Expand that. â–¶ 00:23 Now, what's the cheapest next one? â–¶ 00:26 Well, over here. â–¶ 00:30 Oops, forgot to take this one off the list. â–¶ 00:33 So now, we're at 317 plus 101 gives us another path into Bucharest, â–¶ 00:36 and this is a better path. â–¶ 00:44 This is 418, gives us another route in. â–¶ 00:46 But we have to keep going. â–¶ 00:54 The best path on the frontier is 366, â–¶ 00:59 so pop that off, and that would give us 2 more routes into here, â–¶ 01:06 and eventually we pop off all of these. â–¶ 01:14 And then we get to the point where 418 was the best path on the frontier. â–¶ 01:18 We pop that off, and then we recognize that we'd reach the goal, â–¶ 01:24 and the reason that uniform cost finds the optimal path, the cheapest cost, â–¶ 01:29 is because it's guaranteed that it will first pop off this cheapest path, â–¶ 01:35 the 418, before it gets to the more expensive path, like the 460. â–¶ 01:40 ### (01:50) Topic 18, Depth First Search

So, we've looked at 2 search algorithms. â–¶ 00:00 One, breadth-first search, in which we always expand first â–¶ 00:03 the shallowest paths, the shortest paths. â–¶ 00:08 Second, cheapest-first search, in which we always expand first the path â–¶ 00:12 with the lowest total cost. â–¶ 00:17 And I'm going to take this opportunity to introduce a third algorithm, depth-first search, â–¶ 00:20 which is in a way the opposite of breadth-first search. â–¶ 00:25 In depth-first search, we always expand first the longest path, â–¶ 00:28 the path with the most lengths in it. â–¶ 00:33 Now, what I want to ask you to do is for each of these nodes in each of the trees, â–¶ 00:36 tell us in what order they're expanded, â–¶ 00:42 first, second, third, fourth, fifth and so on by putting a number into the box. â–¶ 00:44 And if there are ties, put that number in and resolve the ties in left to right order. â–¶ 00:49 Then I want you to ask one more question or answer one more question â–¶ 00:58 which is are these searches optimal? â–¶ 01:03 That is, are they guaranteed to find the best solution? â–¶ 01:06 And for breadth-first search, optimal would mean finding the shortest path. â–¶ 01:11 If you think it's guaranteed to find the shortest path, check here. â–¶ 01:16 For cheapest first, it would mean finding the path with the lowest total path cost. â–¶ 01:21 Check here if you think it's guaranteed to do that. â–¶ 01:26 And we'll allow the assumption that all costs have to be positive. â–¶ 01:30 And in depth first, cheapest or optimal would mean, again, â–¶ 01:34 as in breadth first, finding the shortest possible path in terms of number of lengths. â–¶ 01:41 Check here if you think depth first will always find that. â–¶ 01:46 ### (01:49) Topic 19, Search Optimality Answer

Here are the answers. â–¶ 00:00 Breadth-first search, as the name implies, expands nodes in this order. â–¶ 00:04 One, 2, 3, 4, 5, 6, 7. â–¶ 00:10 So, it's going across a stripe at a time, breadth first. â–¶ 00:17 Is it optimal? â–¶ 00:23 Well, it's always expanding in the shortest paths first, â–¶ 00:25 and so wherever the goal is hiding, it's going to find it by examining â–¶ 00:28 no longer paths, so in fact, it is optimal. â–¶ 00:34 Cheapest first, first we expand the path of length zero, â–¶ 00:38 then the path of length 2. â–¶ 00:45 Now there's a path of length 4, path of length 5, â–¶ 00:47 path of length 6, a path of length 7, and finally, a path of length 8. â–¶ 00:53 And as we've seen, it's guaranteed to find the cheapest path of all, â–¶ 01:02 assuming that all the individual step costs are not negative. â–¶ 01:08 Depth-first search tries to go as deep as it can first, â–¶ 01:14 so it goes 1, 2, 3, then backs up, 4, â–¶ 01:17 then backs up, 5, 6, 7. â–¶ 01:24 And you can see that it doesn't necessarily find the shortest path of all. â–¶ 01:29 Let's say that there were goals in position 5 and in position 3. â–¶ 01:34 It would find the longer path to position 3 and find the goal there â–¶ 01:39 and would not find the goal in position 5. â–¶ 01:43 So, it is not optimal. â–¶ 01:46 ### (02:02) Topic 20, Storage Requirements, Completeness

Given the non-optimality of depth-first search, â–¶ 00:00 why would anybody choose to use it? â–¶ 00:04 Well, the answer has to do with the storage requirements. â–¶ 00:07 Here I've illustrated a state space â–¶ 00:10 consisting of a very large or even infinite binary tree. â–¶ 00:13 As we go to levels 1, 2, 3, down to level n, â–¶ 00:18 the tree gets larger and larger. â–¶ 00:22 Now, let's consider the frontier for each of these search algorithms. â–¶ 00:24 For breadth-first search, we know a frontier looks like that, â–¶ 00:29 and so when we get down to level n, we'll require a storage space of â–¶ 00:35 2 to the n of pass in a breadth-first search. â–¶ 00:40 For cheapest first, the frontier is going to be more complicated. â–¶ 00:45 It's going to sort of work out this contour of cost, â–¶ 00:49 but it's going to have a similar total number of nodes. â–¶ 00:53 But for depth-first search, as we go down the tree, we start going down this branch, â–¶ 00:57 and then we back up, but at any point, our frontier is only going to have n nodes â–¶ 01:03 rather than 2 to the n nodes, so that's a substantial savings for depth-first search. â–¶ 01:08 Now, of course, if we're also keeping track of the explored set, â–¶ 01:14 then we don't get that much savings. â–¶ 01:19 But without the explored set, depth-first search has a huge advantage â–¶ 01:21 in terms of space saved. â–¶ 01:25 One more property of the algorithms to consider â–¶ 01:27 is the property of completeness, meaning if there is a goal somewhere, â–¶ 01:30 will the algorithm find it? â–¶ 01:35 So, let's move from very large trees to infinite trees, â–¶ 01:37 and let's say that there's some goal hidden somewhere deep down in that tree. â–¶ 01:41 And the question is, are each of these algorithms complete? â–¶ 01:47 That is, are they guaranteed to find a path to the goal? â–¶ 01:51 Mark off the check boxes for the algorithms that you believe are complete in this sense. â–¶ 01:55 ### (00:49) Topic 21, Completeness Answer

The answer is that breadth-first search is complete, â–¶ 00:00 so even if the tree is infinite, if the goal is placed at any finite level, â–¶ 00:04 eventually, we're going to march down and find that goal. â–¶ 00:10 Same with cheapest first. â–¶ 00:16 No matter where the goal is, if it has a finite cost, â–¶ 00:18 eventually, we're going to go down and find it. â–¶ 00:21 But not so for depth-first search. â–¶ 00:25 If there's an infinite path, depth-first search will keep following that, â–¶ 00:28 so it will keep going down and down and down along this path â–¶ 00:33 and never get to the path that the goal consists of â–¶ 00:37 and never get to the path on which the goal sits. â–¶ 00:42 So, depth-first search is not complete. â–¶ 00:46 ### (04:28) Topic 22, More on Uniform Cost Search

Let's try to understand a little better how uniform cost search works. â–¶ 00:00 We start at a start state, â–¶ 00:05 and then we start expanding out from there looking at different paths, â–¶ 00:08 and what we end of doing is expanding in terms of contours like on a topological map, â–¶ 00:13 where first we span out to a certain distance, then to a farther distance, â–¶ 00:21 and then to a farther distance. â–¶ 00:28 Now at some point we meet up with a goal. Let's say the goal is here. â–¶ 00:31 Now we found a path from the start to the goal. â–¶ 00:35 But notice that the search really wasn't directed at any way towards the goal. â–¶ 00:42 It was expanding out everywhere in the space and depending on where the goal is, â–¶ 00:46 we should expect to have to explore half the space, on average, before we find the goal. â–¶ 00:52 If the space is small, that can be fine, â–¶ 00:57 but when spaces are large, that won't get us to the goal fast enough. â–¶ 01:00 Unfortunately, there is really nothing we can do, with what we know, to do better than that, â–¶ 01:05 and so if we want to improve, if we want to be able to find the goal faster, â–¶ 01:10 we're going to have to add more knowledge. â–¶ 01:15 The type of knowledge that is proven most useful in search is an estimate of the distance â–¶ 01:21 from the start state to the goal. â–¶ 01:27 So let's say we're dealing with a route-finding problem, â–¶ 01:32 and we can move in any direction--up or down, right or left-- â–¶ 01:36 and we'll take as our estimate, the straight line distance between a state and a goal, â–¶ 01:43 and we'll try to use that estimate to find our way to the goal fastest. â–¶ 01:50 Now an algorithm called greedy best-first search does exactly that. â–¶ 01:55 It expands first the path that's closest to the goal according to the estimate. â–¶ 02:04 So what do the contours look like in this approach? â–¶ 02:09 Well, we start here, and then we look at all the neighboring states, â–¶ 02:13 and the ones that appear to be closest to the goal we would expand first. â–¶ 02:17 So we'd start expanding like this and like this and like this and like this â–¶ 02:21 and that would lead us directly to the goal. â–¶ 02:30 So now instead of exploring whole circles that go out everywhere with a certain space, â–¶ 02:33 our search is directed towards the goal. â–¶ 02:38 In this case it gets us immediately towards the goal, but that won't always be the case â–¶ 02:41 if there are obstacles along the way. â–¶ 02:46 Consider this search space. We have a start state and a goal, â–¶ 02:50 and there's an impassable barrier. â–¶ 02:54 Now greedy best-first search will start expanding out as before, â–¶ 02:57 trying to get towards the goal, â–¶ 03:02 and when it reaches the barrier, what will it do next? â–¶ 03:08 Well, it will try to increase along a path that's getting closer and closer to the goal. â–¶ 03:11 So it won't consider going back this way which is farther from the goal. â–¶ 03:15 Rather it will continue expanding out along these lines â–¶ 03:20 which always get closer and closer to the goal, â–¶ 03:24 and eventually it will find its way towards the goal. â–¶ 03:28 So it does find a path, and it does it by expanding a small number of nodes, â–¶ 03:31 but it's willing to accept a path which is longer than other paths. â–¶ 03:36 Now if we explored in the other direction, we could have found a much simpler path, â–¶ 03:42 a much shorter path, by just popping over the barrier, and then going directly to the goal. â–¶ 03:47 but greedy best-first search wouldn't have done that because â–¶ 03:54 that would have involved getting to this point, which is this distance to the goal, â–¶ 03:56 and then considering states which were farther from the goal. â–¶ 04:01 What we would really like is an algorithm that combines the best parts â–¶ 04:08 of greedy search which explores a small number of nodes in many cases â–¶ 04:11 and uniform cost search which is guaranteed to find a shortest path. â–¶ 04:17 We'll show how to do that next using an algorithm called the A-star algorithm. â–¶ 04:22 ### (03:14) Topic 23, A-Star Search

[Male narrator] A* Search works by always expanding the path â–¶ 00:00 that has a minimum value of the function f â–¶ 00:03 which is defined as a sum of the g + h components. â–¶ 00:07 Now, the function g of a path â–¶ 00:12 is just the path cost, â–¶ 00:16 and the function h of a path â–¶ 00:19 is equal to the h value of the state, â–¶ 00:23 which is the final state of the path, â–¶ 00:27 which is equal to the estimated distance to the goal. â–¶ 00:30 Here's an example of how A* works. â–¶ 00:36 Suppose we found this path through the state's base to a state x â–¶ 00:39 and we're trying to give a measure to the value of this path. â–¶ 00:44 The measure f is a sum of g, the path cost so far, â–¶ 00:48 and h, which is the estimated distance that the path will take â–¶ 00:55 to complete its path to the goal. â–¶ 01:02 Now, minimizing g helps us keep the path short â–¶ 01:04 and minimizing h helps us keep focused on finding the goal â–¶ 01:08 and the result is a search strategy that is the best possible â–¶ 01:13 in the sense that it finds the shortest length path â–¶ 01:17 while expanding the minimum number of paths possible. â–¶ 01:20 It could be called "best estimated total path cost first," â–¶ 01:24 but the name A* is traditional. â–¶ 01:28 Now let's go back to Romania and apply the A* algorithm â–¶ 01:32 and we're going to use a heuristic, which is a straight line distance â–¶ 01:36 between a state and the goal. â–¶ 01:40 The goal, again, is Bucharest, â–¶ 01:42 and so the distance from Bucharest to Bucharest is, of course, 0. â–¶ 01:44 And for all the other states, I've written in red â–¶ 01:47 the straight line distance. â–¶ 01:51 For example, straight across like that. â–¶ 01:53 Now, I should say that all the roads here I've drawn as straight lines, â–¶ 01:55 but actually, roads are going to be curved to some degree, â–¶ 01:59 so the actual distance along the roads is going to be longer â–¶ 02:03 than the straight line distance. â–¶ 02:06 Now, we start out as usual--we'll start in Arad as a start state-- â–¶ 02:09 and we'll expand out Arad and so we'll add 3 paths â–¶ 02:13 and the evaluation function, f, will be the sum of the path length, â–¶ 02:21 which is given in black, and the estimated distance, â–¶ 02:26 which is given in red. â–¶ 02:29 And so the path length from this path â–¶ 02:32 will be 140+253 or 393; â–¶ 02:37 for this path, 75+374, or 449; â–¶ 02:45 and for this path, 118+329, or 447. â–¶ 02:55 And now, the question is out of all the paths that are on the frontier, â–¶ 03:05 which path would we expand next under the A* algorithm? â–¶ 03:09 ### (00:14) Topic 23, A-Star Search ANSWER

The answer is that we select this path first--the one from Arad to Sibiu-- â–¶ 00:00 because it has the smallest value--393--of the sum f=g+h. â–¶ 00:05 ### (00:39) Topic 24, A-Star Second Question

Let's go ahead and expand this node now. â–¶ 00:00 So we're going to add 3 paths. â–¶ 00:03 This one has a path cost of 291 â–¶ 00:06 and an estimated distance to the goal of 380, â–¶ 00:10 for a total of 671. â–¶ 00:14 This one has a path cost of 239 â–¶ 00:18 and an estimated distance of 176, for a total of 415. â–¶ 00:21 And the final one is 220+193=413. â–¶ 00:27 And now the question is which state to we expand next? â–¶ 00:33 ### (00:12) Topic 24, A-Star Second Question ANSWER

The answer is we expand this path next â–¶ 00:00 because its total, 413, â–¶ 00:03 is less than all the other ones on the front tier-- â–¶ 00:06 although only slightly less than the 415 for this path. â–¶ 00:09 ### (00:20) Topic 25, A-Star Third Question

So we expand this node, â–¶ 00:00 giving us 2 more paths-- â–¶ 00:03 this one with an f-value of 417, â–¶ 00:06 and this one with an f-value of 526. â–¶ 00:10 The question again--which path are we going to expand next? â–¶ 00:16 ### (00:11) Topic 25, A-Star Third Question ANSWER

And the answer is that we expand this path, Fagaras, next, â–¶ 00:00 because its f-total, 415, â–¶ 00:05 is less than all the other paths in the front tier. â–¶ 00:08 ### (00:26) Topic 26, A-Star Fourth Question

Now we expand Fagaras â–¶ 00:01 and we get a path that reaches the goal â–¶ 00:04 and it has a path length of 450 and an estimated distance of 0 â–¶ 00:07 for a total f value of 450, â–¶ 00:11 and now the question is: What do we do next? â–¶ 00:14 Click here if you think we're at the end of the algorithm â–¶ 00:17 and we don't need to expand next â–¶ 00:22 or click on the node that you think we will expand next. â–¶ 00:24 ### (00:23) Topic 26, A-Star Fourth Question ANSWER

The answer is that we're not done yet, â–¶ 00:00 because the algorithm works by doing the goal test, â–¶ 00:03 when we take a path off the front tier, â–¶ 00:06 not when we put a path on the front tier. â–¶ 00:08 Instead, we just continue in the normal way and choose the node â–¶ 00:11 on the front tier which has the lowest value. â–¶ 00:15 That would be this one--the path through Pitesti, with a total of 417. â–¶ 00:18 ### (01:24) Topic 27, A-Star Fifth Question

So let's expand the node at Pitesti. â–¶ 00:01 We have to go down this direction, up, â–¶ 00:04 then we reach a path we've seen before, â–¶ 00:08 and we go in this direction. â–¶ 00:11 Now we reach Bucharest, which is the goal, â–¶ 00:13 and the h value is going to be 0 â–¶ 00:16 because we're at the goal, and the g value works out to 418. â–¶ 00:19 Again, we don't stop here just because we put a path onto the front tier, â–¶ 00:24 we put it there, we don't apply the goal test next, â–¶ 00:31 but, now we go back to the front tier, â–¶ 00:35 and it turns out that this 418 is the lowest-cost path on the front tier. â–¶ 00:38 So now we pull it off, do the goal test, â–¶ 00:43 and now we found our path to the goal, â–¶ 00:45 and it is, in fact, the shortest possible path. â–¶ 00:49 In this case, A-star was able to find the lowest-cost path. â–¶ 00:55 Now the question that you'll have to think about, â–¶ 00:59 because we haven't explained it yet, â–¶ 01:02 is whether A-star will always do this. â–¶ 01:04 Answer yes if you think A-star will always find the shortest cost path, â–¶ 01:06 or answer no if you think it depends on the particular problem given, â–¶ 01:12 or answer no if you think it depends on the particular heuristic estimate function, h. â–¶ 01:17 ### (00:49) Topic 27, A-Star Fifth Question ANSWER Mandatory

The answer is that it depends on the h function. â–¶ 00:02 A-star will find the lowest-cost path â–¶ 00:06 if the h function for a state is less than the true cost â–¶ 00:09 of the path to the goal through that state. â–¶ 00:16 In other words, we want the h to never overestimate the distance to the goal. â–¶ 00:20 We also say that h is optimistic. â–¶ 00:26 Another way of stating that â–¶ 00:31 is that h is admissible, â–¶ 00:34 meaning is it admissible to use it to find the lowest-cost path. â–¶ 00:37 Think of all of these of being the same way â–¶ 00:41 of stating the conditions under which A-star finds the lowest-cost path. â–¶ 00:45 ### (01:22) Topic 28, Optimistic Heuristics

Here we give you an intuition as to why â–¶ 00:01 an optimistic heuristic function, h, finds the lowest-cost path. â–¶ 00:03 When A-star ends, it returns a path, p, with estimated cost, c. â–¶ 00:08 It turns out that c is also the actual cost, â–¶ 00:15 because at the goal the h component is 0, â–¶ 00:20 and so the path cost is the total cost as estimated by the function. â–¶ 00:23 Now, all the paths on the front tier â–¶ 00:28 have an estimated cost that's greater than c, â–¶ 00:31 and we know that because the front tier is explored in cheapest-first order. â–¶ 00:35 If h is optimistic, then the estimated cost â–¶ 00:40 is less than the true cost, â–¶ 00:44 so the path p must have a cost that's less than the true cost â–¶ 00:47 of any of the paths on the front tier. â–¶ 00:51 Any paths that go beyond the front tier â–¶ 00:54 must have a cost that's greater than that â–¶ 00:57 because we agree that the step cost is always 0 or more. â–¶ 00:59 So that means that this path, p, must be the minimal cost path. â–¶ 01:04 Now, this argument, I should say, only goes through â–¶ 01:09 as is for tree search. â–¶ 01:13 For graph search the argument is slightly more complicated, â–¶ 01:16 but the general intuitions hold the same. â–¶ 01:19 ### (00:59) Topic 29, State Spaces

So far we've looked at the state space of cities in Romania-- â–¶ 00:01 a 2-dimensional, physical space. â–¶ 00:05 But the technology for problem solving through search â–¶ 00:07 can deal with many types of state spaces, â–¶ 00:10 dealing with abstract properties, not just x-y position in a plane. â–¶ 00:12 Here I introduce another state space--the vacuum world. â–¶ 00:17 It's a very simple world in which there are only 2 positions â–¶ 00:21 as opposed to the many positions in the Romania state space. â–¶ 00:25 But there are additional properties to deal with as well. â–¶ 00:30 The robot vacuum cleaner can be in either of the 2 conditions, â–¶ 00:33 but as well as that each of the positions â–¶ 00:36 can either have dirt in it or not have dirt in it. â–¶ 00:40 Now the question is to represent this as a state space â–¶ 00:43 how many states do we need? â–¶ 00:47 The number of states can fill in this box here. â–¶ 00:51 ### (00:35) Topic 29, State Spaces ANSWER

And the answer is there are 8 states. â–¶ 00:01 There are 2 physical states that the robot vacuum cleaner can be in-- â–¶ 00:04 either in state A or in state B. â–¶ 00:10 But in addition to that, there are states about how the world is â–¶ 00:12 as well as where the robot is in the world. â–¶ 00:17 So state A can be dirty or not. â–¶ 00:19 That's 2 possibilities. â–¶ 00:24 And B can be dirty or not. â–¶ 00:26 That's 2 more possibilities. â–¶ 00:28 We multiply those together. We get 8 possible states. â–¶ 00:31 ### (01:44) Topic 30, State Space Diagram and More Complexity

Here is a diagram of the state space for the vacuum world. â–¶ 00:01 Note that there are 8 states, and we have the actions connecting the states â–¶ 00:05 just as we did in the Romania problem. â–¶ 00:09 Now let's look at a path through this state. â–¶ 00:12 Let's say we start out in this position, â–¶ 00:15 and then we apply the action of moving right. â–¶ 00:19 Then we end up in a position where the state of the world looks the same, â–¶ 00:23 except the robot has moved from position 'A' to position 'B'. â–¶ 00:27 Now if we turn on the sucking action, â–¶ 00:32 then we end up in a state where the robot is in the same position â–¶ 00:37 but that position is no longer dirty. â–¶ 00:42 Let's take this very simple vacuum world â–¶ 00:47 and make a slightly more complicated one. â–¶ 00:50 First, we'll say that the robot has a power switch, â–¶ 00:53 which can be in one of three conditions: on, off, or sleep. â–¶ 00:56 Next, we'll say that the robot has a dirt-sensing camera, â–¶ 01:04 and that camera can either be on or off. â–¶ 01:09 Third, this is the deluxe model of robot â–¶ 01:13 in which the brushes that clean up the dust â–¶ 01:16 can be set at 1 of 5 different heights â–¶ 01:19 to be appropriate for whatever level of carpeting you have. â–¶ 01:22 Finally, rather that just having the 2 positions, â–¶ 01:27 we'll extend that out and have 10 positions. â–¶ 01:30 Now the question is how many states are in this state space? â–¶ 01:37 ### (00:57) Topic 30, State Space Diagram and More Complexity ANSWER

The answer is that the number of states is the cross product â–¶ 00:01 of the numbers of all the variables, since they're each independent, â–¶ 00:05 and any combination can occur. â–¶ 00:08 For the power we have 3 possible positions. â–¶ 00:10 The camera has 2. â–¶ 00:14 The brush height has 5. â–¶ 00:18 The dirt has 2 for each of the 10 positions. â–¶ 00:23 That's 2^10 or 1024. â–¶ 00:28 Then the robot's position can be any of those 10 positions as well. â–¶ 00:33 That works out to 307,200 states in the state space. â–¶ 00:39 Notice how a fairly trivial problem-- â–¶ 00:44 we're only modeling a few variables and only 10 positions-- â–¶ 00:46 works out to a large number of state spaces. â–¶ 00:50 That's why we need efficient algorithms for searching through states spaces. â–¶ 00:52 ### (01:49) Topic 31, Sliding Blocks Puzzle

I want to introduce one more problem that can be solved with search techniques. â–¶ 00:01 This is a sliding blocks puzzle, called a 15 puzzle. â–¶ 00:05 You may have seen something like this. â–¶ 00:08 So there are a bunch of little squares or blocks or tiles â–¶ 00:10 and you can slide them around. â–¶ 00:14 and the goal is to get into a certain configuration. â–¶ 00:19 So we'll say that this is the goal state, where the numbers 1-15 are in order â–¶ 00:21 left to right, top to bottom. â–¶ 00:27 The starting state would be some state where all the positions are messed up. â–¶ 00:29 Now the question is: Can we come up with a good heuristic for this? â–¶ 00:34 Let's examine that as a way of thinking about where heuristics come from. â–¶ 00:38 The first heuristic we're going to consider â–¶ 00:42 we'll call h1, and that is equal to the number of misplaced blocks. â–¶ 00:46 So here 10 and 11 are misplaced because they should be there and there, respectively, â–¶ 00:54 12 is in the right place, 13 is in the right place, â–¶ 00:59 and 14 and 15 are misplaced. â–¶ 01:02 That's a total of 4 misplaced blocks. â–¶ 01:04 The 2nd heuristic, h2, is equal to â–¶ 01:07 the sum of the distances that each block would have to move to get to the right position. â–¶ 01:13 For this position, 10 would have to move 1 space to get to the right position, â–¶ 01:19 11 would have to move 1, so that's a total of 2 so far, â–¶ 01:26 13 is in the right place, â–¶ 01:30 14 is 1 displaced, â–¶ 01:31 and 15 is 1 displaced, â–¶ 01:33 so that would also be a total of 4. â–¶ 01:35 Now, the question is: Which, if any, of these heuristics are admissible? â–¶ 01:38 Check the boxes next to the heuristics that you think â–¶ 01:44 are admissible. â–¶ 01:47 ### (00:42) Topic 31, Sliding Blocks Puzzle ANSWER

H1 is admissible, because every tile that's in the wrong position â–¶ 00:02 must be moved at least once to get into the right position. â–¶ 00:07 So h1 never overestimates. â–¶ 00:10 How about h2? â–¶ 00:13 H2 is also admissible, because every tile in the wrong position â–¶ 00:15 can be moved closer to the correct position no faster than 1 space per move. â–¶ 00:20 Therefore, both are admissible. â–¶ 00:26 But notice that h2 is always greater than or equal to h1. â–¶ 00:28 That means that, with the exception of breaking ties, â–¶ 00:33 an A* search using h2 will always expand â–¶ 00:35 fewer paths than one using h1 â–¶ 00:39 ### (03:16) Topic 32, Where is the Intelligence

Now, we're trying to build an artificial intelligence â–¶ 00:01 that can solve problems like this all on its own. â–¶ 00:04 You can see that the search algorithms do a great job â–¶ 00:08 of finding solutions to problems like this. â–¶ 00:12 But, you might complain that in order for the search algorithms to work, â–¶ 00:15 we had to provide it with a heurstic function. â–¶ 00:19 A heurstic function came from the outside. â–¶ 00:22 You might think that coming up with a good heurstic function is really where all the intelligence is. â–¶ 00:25 So, a problem solver that uses an heurstic function given to it â–¶ 00:30 really isn't intelligent at all. â–¶ 00:34 So let's think about where the intelligence could come from â–¶ 00:36 and can we automatically come up with good heurstic functions. â–¶ 00:39 I'm going to sketch a description of â–¶ 00:45 a program that can automatically come up with good heurstics â–¶ 00:47 given a description of a problem. â–¶ 00:50 Suppose this program is given a description of the sliding blocks puzzle â–¶ 00:52 where we say that a block can move from square A to square B â–¶ 00:57 if A is adjacent to B and B is blank. â–¶ 01:02 Now, imagine that we try to loosen this restriction. â–¶ 01:06 We cross out "B is blank," â–¶ 01:10 and then we get the rule â–¶ 01:14 "a block can move from A to B if A is adjacent to B," â–¶ 01:16 and that's equal to our heurstic h2 â–¶ 01:20 because a block can move anywhere to an adjacent state. â–¶ 01:23 Now, we could also cross out the other part of the rule, â–¶ 01:27 and we now get "a block can move from any square A â–¶ 01:31 to any square B regardless of any condition. â–¶ 01:36 That gives us heurstic h1. â–¶ 01:40 So we see that both of our heurstics can be derived â–¶ 01:43 from a simple mechanical manipulation â–¶ 01:48 of the formal description of the problem. â–¶ 01:50 Once we've generated automatically these candidate heuristics, â–¶ 01:53 another way to come up with a good heurstic is to say â–¶ 01:58 that a new heurstic, h, â–¶ 02:02 is equal to the maximum of h1 and h2, â–¶ 02:04 and that's guaranteed to be admissible as long as â–¶ 02:10 h1 and h2 are admissible â–¶ 02:13 because it still never overestimates, â–¶ 02:16 and it's guaranteed to be better because its getting closer to the true value. â–¶ 02:18 The only problem with combining multiple heuristics like this â–¶ 02:22 is that there is some cause to compute the heuristic â–¶ 02:27 and it could take longer to compute â–¶ 02:29 even if we end up expanding pure paths. â–¶ 02:31 Crossing out parts of the rules like this â–¶ 02:35 is called "generating a relaxed problem." â–¶ 02:38 What we've done is we've taken the original problem, â–¶ 02:41 where it's hard to move squares around, â–¶ 02:44 and made it easier by relaxing one of the constraints. â–¶ 02:46 You can see that as adding new links in the state space, â–¶ 02:49 so if we have a state space in which there are only particular links, â–¶ 02:54 by relaxing the problem it's as if we are adding new operators â–¶ 02:59 that traverse the state in new ways. â–¶ 03:05 So adding new operators only makes the problem easier, â–¶ 03:07 and thus never overestimates, and thus is admissible. â–¶ 03:11 ### (01:52) Topic 33, What Can't Search Do

We've seen what search can do for problem solving. â–¶ 00:00 It can find the lowest-cost path to a goal, â–¶ 00:03 and it can do that in a way in which we never generate more paths than we have to. â–¶ 00:06 We can find the optimal number of paths to generate, â–¶ 00:12 and we can do that with a heuristic function that we generate on our own â–¶ 00:15 by relaxing the existing problem definition. â–¶ 00:19 But let's be clear on what search can't do. â–¶ 00:22 All the solutions that we have found consist of a fixed sequence of actions. â–¶ 00:25 In other words, the agent Hirin Arad, thinks, comes up with a plan that it wants to execute â–¶ 00:31 and then essentially closes his eyes and starts driving, â–¶ 00:38 never considering along the way if something has gone wrong. â–¶ 00:42 That works fine for this type of problem, â–¶ 00:46 but it only works when we satisfy the following conditions. â–¶ 00:49 [Problem solving works when:] â–¶ 00:53 Problem-solving technology works when the following set of conditions is true: â–¶ 00:55 First, the domain must be fully observable. â–¶ 00:59 In other words, we must be able to see what initial state we start out with. â–¶ 01:03 Second, the domain must be known. â–¶ 01:08 That is, we have to know the set of available actions to us. â–¶ 01:12 Third, the domain must be discrete. â–¶ 01:16 There must be a finite number of actions to chose from. â–¶ 01:20 Fourth, the domain must be deterministic. â–¶ 01:24 We have to know the result of taking an action. â–¶ 01:28 Finally, the domain must be static. â–¶ 01:32 There must be nothing else in the world that can change the world except our own actions. â–¶ 01:36 If all these conditions are true, then we can search for a plan â–¶ 01:41 which solves the problem and is guaranteed to work. â–¶ 01:44 In later units, we will see what to do if any of these conditions fail to hold. â–¶ 01:47 ### (02:35) Topic 34, Note on Implementation

Our description of the algorithm has talked about paths in the state space. â–¶ 00:01 I want to say a little bit now about how to implement that in terms of a computer algorithm. â–¶ 00:08 We talk about paths, but we want to implement that in some ways. â–¶ 00:15 In the implementation we talk about nodes. â–¶ 00:19 A node is a data structure, and it has four fields. â–¶ 00:22 The state field indicates the state at the end of the path. â–¶ 00:27 The action was the action it took to get there. â–¶ 00:35 The cost is the total cost, â–¶ 00:40 and the parent is a pointer to another node. â–¶ 00:45 In this case, the node that has state "S", â–¶ 00:50 and it will have a parent which points to the node that has state "A", â–¶ 00:56 and that will have a parent pointer that's null. â–¶ 01:06 So we have a linked list of nodes representing the path. â–¶ 01:10 We'll use the word "path" for the abstract idea, â–¶ 01:15 and the word "node" for the representation in the computer memory. â–¶ 01:18 But otherwise, you can think of those two terms as being synonyms, â–¶ 01:22 because they're in a one-to-one correspondence. â–¶ 01:26 Now there are two main data structures that deal with nodes. â–¶ 01:31 We have the "frontier" and we have the "explored" list. â–¶ 01:35 Let's talk about how to implement them. â–¶ 01:41 In the frontier the operations we have to deal with â–¶ 01:44 are removing the best item from the frontier and adding in new ones. â–¶ 01:48 And that suggests we should implement it as a priority queue, â–¶ 01:52 which knows how to keep track of the best items in proper order. â–¶ 01:55 But we also need to have an additional operation â–¶ 01:59 of a membership test as a new item in the frontier. â–¶ 02:03 And that suggests representing it as a set, â–¶ 02:07 which can be built from a hash table or a tree. â–¶ 02:10 So the most efficient implementations of search actually have both representations. â–¶ 02:14 The explored set, on the other hand, is easier. â–¶ 02:20 All we have to do there is be able to add new members and check for membership. â–¶ 02:23 So we represent that as a single set, â–¶ 02:28 which again can be done with either a hash table or tree. â–¶ 02:31

## (16) Homework 1

### (00:05) Congratulations!

Congratulations. â–¶ 00:00 You just made assignment 1. â–¶ 00:02 ### (00:05) Introduction

This is homework assignment #1. â–¶ 00:00 ### (01:00) Question 1, Peg Solitaire

This is a question about peg solitaire. â–¶ 00:01 In peg solitaire, a single player faces â–¶ 00:04 the following kind of board. â–¶ 00:08 Initially, all pieces are occupied except for the center piece. â–¶ 00:13 You can find more information on peg solitare at the following URL. â–¶ 00:22 [http://en.wikipedia.org/wiki/peg_solitaire] â–¶ 00:26 I wish to know whether this game is partially observable, â–¶ 00:36 Please say yes or no. â–¶ 00:40 I wish to know whether it is stochastic. â–¶ 00:43 Please say yes if it is and no if it's deterministic. â–¶ 00:46 Let me know if it's continuous, yes or no, â–¶ 00:50 and let me know if it's adversarial, yes or no. â–¶ 00:55 ### (00:22) Question 1, Peg Solitaire ANSWER

>>Peg Solitaire is not partially observable because you can see the board at all times. â–¶ 00:00 It is not stochastic because you just make all the moves, â–¶ 00:06 and they have very different mystic effects. â–¶ 00:09 It is not continuous. It's just finding many choices of actions â–¶ 00:11 and finding many board positions, so therefore, it is not continuous. â–¶ 00:15 and it's not adversarial because there is no adversaries--just you playing. â–¶ 00:18 ### (00:54) Question 2, Loaded Coin

I am going to ask you about the problem to learn about a loaded coin. â–¶ 00:01 A loaded coin is a coin, â–¶ 00:05 that if you flip it, â–¶ 00:07 might have a non 0.5 chance â–¶ 00:09 of coming up heads or tails. â–¶ 00:13 Fair coins always come up 50% heads or tails. â–¶ 00:16 Loaded coins might come up, for example, â–¶ 00:20 0.9 chance heads and 0.1 chance tails. â–¶ 00:23 Your task will be to understand, â–¶ 00:27 from coin flips, â–¶ 00:30 whether a coin is loaded, â–¶ 00:31 and if so, at what probability. â–¶ 00:33 I don't want you to solve the problem, â–¶ 00:35 but I want you to answer the following questions: â–¶ 00:37 Is it partially observable? â–¶ 00:40 Yes or no. â–¶ 00:42 Is it stochastic? â–¶ 00:44 Yes or no. â–¶ 00:46 Is it continuous? [Yes or no.] â–¶ 00:48 And finally, is it adversarial? â–¶ 00:51 Yes or no. â–¶ 00:53 ### (00:38) Question 2, Loaded Coin ANSWER

[Thrun] So the loaded coin example is clearly partially observable, â–¶ 00:00 and the reason is it is actually used for the memory â–¶ 00:06 if you flip it more than 1 time so you can learn more about what the actual probability is. â–¶ 00:09 Therefore, looking at the most recent coin flip is insufficient to make your choice. â–¶ 00:14 It is stochastic because you flip a coin. â–¶ 00:20 It is not continuous because there's only 1 action--a flip--and 2 outcomes. â–¶ 00:25 And it isn't really adversarial because while you do your learning task â–¶ 00:31 no adversary interferes. â–¶ 00:36 ### (00:32) Question 3, Path Through Maze

Let's talk about the problem of finding a path through a maze. â–¶ 00:00 Let me draw you a maze. â–¶ 00:05 Suppose you wish to find the path from the start to your goal. â–¶ 00:10 I don't want to you to solve this problem. â–¶ 00:15 Rather I want you to tell me whether it's partially observable. â–¶ 00:19 Yes or no. â–¶ 00:23 It is stochastic? â–¶ 00:25 Yes or no. â–¶ 00:27 Is it continuous? â–¶ 00:29 Yes or no. â–¶ 00:31 ### (00:18) Question 3, Path Through Maze ANSWER

[Thrun] The path through the maze is clearly not partially observable â–¶ 00:00 because you can see the maze entirely at all times. â–¶ 00:03 It is not stochastic. There is no randomness involved. â–¶ 00:06 It isn't really continuous. â–¶ 00:10 There's typically just finitely many choices--go left or right. â–¶ 00:12 And it isn't adversarial because there's no real adversary involved. â–¶ 00:15 ### (00:43) Question 4, Search Tree

This is a search question. â–¶ 00:00 Suppose we are given the following search tree. â–¶ 00:02 We are searching from the top, the start node, â–¶ 00:05 to the goal, which is over here. â–¶ 00:08 Assume we expand from left to right. â–¶ 00:12 Tell me how many nodes are expanded â–¶ 00:17 if we expand from left to right, â–¶ 00:20 counting the start node and the goal node in your answer. â–¶ 00:23 And give me the same answer for Depth First Search. â–¶ 00:27 Now, let's assume you're going to search from right to left. â–¶ 00:32 How many nodes would we now expand in Breadth First Search, â–¶ 00:35 and how many do we expand in Depth First Search? â–¶ 00:39 ### (00:38) Question 4, Search Tree ANSWER

[Thrun] Breadth first from left to right is 6-- â–¶ 00:00 1, 2, 3, 4, 5, 6. â–¶ 00:03 Depth first from left to right is 4--1, 2, 3, 4. â–¶ 00:07 Breadth first searched from right to left is 9-- â–¶ 00:15 1, 2, 3, 4, 5, 6, 7, 8, 9. â–¶ 00:19 And depth first from right to left is 9-- â–¶ 00:25 1, 2, 3, 4, 5, 6, 7, 8, 9. â–¶ 00:28 ### (00:31) Question 5, Another Search Tree

Another search problem-- â–¶ 00:00 Consider the following search tree, â–¶ 00:03 where this is the start node. â–¶ 00:08 Now, assume we search from left to right. â–¶ 00:12 I would like you to tell me the number of nodes expanded from Breadth-First Search â–¶ 00:15 and Depth-First Search. â–¶ 00:19 Please do count the start and the goal node, â–¶ 00:22 and please give me the same numbers for Right-to-Left Search, â–¶ 00:25 for Breadth-First, and Depth-First. â–¶ 00:28 ### (00:48) Question 5, Another Search Tree ANSWER

[Thrun] The correct answer for breadth first left to right is 13-- â–¶ 00:00 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13. â–¶ 00:05 And for depth first it is 10-- â–¶ 00:13 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10. â–¶ 00:17 For right to left search, the right answer for breadth first is 11-- â–¶ 00:28 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11. â–¶ 00:32 And for depth first the right answer is 7-- â–¶ 00:38 1, 2, 3, 4, 5, 6, 7. â–¶ 00:42 ### (01:00) Question 6, Search Network

This is another search problem. â–¶ 00:00 Let's assume we have a search graph. â–¶ 00:04 It isn't quite a tree but looks like this. â–¶ 00:07 Obviously in the structure we can reach nodes through multiple paths. â–¶ 00:13 So let's assume that our search never expands the same node twice. â–¶ 00:18 Let's also assume this start node is on top. We search down. â–¶ 00:22 And this over here is our goal node. â–¶ 00:27 So left-to-right search, tell me how many nodes â–¶ 00:30 breadth first would expand--do count the start and goal node in the final answer. â–¶ 00:35 Give me the same result for a depth-first search. â–¶ 00:43 Again counting the start and the goal node in your answer. â–¶ 00:48 And again give me your answer for breadth-first â–¶ 00:51 and for depth-first in the right-to-left search paradigm. â–¶ 00:54 ### (00:49) Question 6, Search Network ANSWER

[Thrun] The right answer over here is 10 for breadth first from left to right-- â–¶ 00:00 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. â–¶ 00:05 Depth first is 16, or all nodes-- â–¶ 00:11 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16. â–¶ 00:15 And notice how I never expanded a node twice. â–¶ 00:30 Correct answer for breadth first right to left is 7-- â–¶ 00:34 1, 2, 3, 4, 5, 6, 7. â–¶ 00:38 And the correct answer for depth first from right to left is 4--1, 2, 3, and 4. â–¶ 00:43 ### (01:16) Question 7, A* Search

Let's talk about a star search. â–¶ 00:00 Let's assume we have the following grid. â–¶ 00:03 The start state is right here. â–¶ 00:08 And the goal state is right here. â–¶ 00:13 And just for convenience, I will give each here a little number. â–¶ 00:16 A. B. C. D. â–¶ 00:22 Let me draw a heuristic function. â–¶ 00:26 Please take a look for a moment â–¶ 00:30 and tell me whether this heuristic function is admissable. â–¶ 00:32 Check here if yes and here if no. â–¶ 00:38 Which one is the first node a star would expand? â–¶ 00:41 B1 or A2? â–¶ 00:46 What's the second node to expand? â–¶ 00:51 B1, C1, A2, A3, or B2? â–¶ 00:56 And finally, what is the third node to expand? â–¶ 01:06 D1, C2, B3, or A4? â–¶ 01:10 ### (01:44) Question 7, A* Search ANSWER

[Thrun] Clearly this is an admissable heuristic because the distance to the goal â–¶ 00:00 is strictly underestimated. â–¶ 00:05 From here it would take 1 step, â–¶ 00:07 from here it will take 1, 2 steps, so the answer is yes. â–¶ 00:09 Now, to understand A*, let me also draw the g function â–¶ 00:15 for development part of this table. â–¶ 00:22 Clearly g is 0 over here. â–¶ 00:24 To understand which node to expand, this one or this one, â–¶ 00:27 let's project the g function, which is 1, â–¶ 00:31 and we will see that 3 plus 1 is smaller than 4 plus 1; â–¶ 00:34 therefore, this is the second node to expand, which is b1. â–¶ 00:40 Now let me for the next step explain the g function from this guy here, 2 and 2. â–¶ 00:47 So 2 plus 2 is 4 versus 3 plus 2 is 5, so we expand this node next, which is c1. â–¶ 00:55 And finally, the g function from here would go 3 and 3. â–¶ 01:08 3 plus 1 is better than 3 plus 2, so we would expand d1 next. â–¶ 01:14 And notice how in the sum of g and h, â–¶ 01:24 this node over here, which has a total of 4, is better than any other node that is unexpanded. â–¶ 01:29 So in particular, 4 plus 1 is 5, and 3 plus 2 is 5 as well, â–¶ 01:35 and 2 plus 3 is 5 as well, so this is the next one to expand. â–¶ 01:40

## (64) Unit 3

### (06:26) 1 Introduction

So the next units will be concerned with probabilities â–¶ 00:00 and particularly with structured probabilities using Bayes networks. â–¶ 00:03 This is some of the most involved material in this class. â–¶ 00:08 And since this is a Stanford level class, â–¶ 00:12 you will find out that some of the quizzes are actually really hard. â–¶ 00:14 So as you go through the material, I hope the hardness of the quizzes won't discourage you; â–¶ 00:18 it'll really entice you to take a piece of paper and a pen and work them out. â–¶ 00:23 Let me give you a flavor of a Bayes network using an example. â–¶ 00:30 Suppose you find in the morning that your car won't start. â–¶ 00:35 Well, there's many causes why your car might not start. â–¶ 00:39 One is that your battery is flat. â–¶ 00:43 Even for a flat battery there is multiple causes. â–¶ 00:46 One, it's just plain dead, â–¶ 00:50 and one is that the battery is okay but it's not charging. â–¶ 00:52 The reason why a battery might not charge is that the alternator might be broken â–¶ 00:55 or the fan belt might be broken. â–¶ 01:01 If you look at this influence diagram, also called a Bayes network, â–¶ 01:03 you'll find there's many different ways to explain that the car won't start. â–¶ 01:07 And a natural question you might have is, "Can we diagnose the problem?" â–¶ 01:12 One diagnostic tool is a battery meter, â–¶ 01:17 which may increase or decrease your belief that the battery may cause your car failure. â–¶ 01:20 You might also know your battery age. â–¶ 01:26 Older batteries tend to go dead more often. â–¶ 01:29 And there's many other ways to look at reasons why the car might not start. â–¶ 01:31 You might inspect the lights, the oil light, the gas gauge. â–¶ 01:37 You might even dip into the engine to see what the oil level is with a dipstick. â–¶ 01:43 All of those relate to alternative reasons why the car might not be starting, â–¶ 01:48 like no oil, no gas, the fuel line might be blocked, or the starter may be broken. â–¶ 01:52 And all of these can influence your measurements, â–¶ 01:59 like the oil light or the gas gauge, in different ways. â–¶ 02:04 For example, the battery flat would have an effect on the lights. â–¶ 02:07 It might have an effect on the oil light and on the gas gauge, â–¶ 02:12 but it won't really affect the oil you measure with the dipstick. â–¶ 02:16 That is affected by the actual oil level, which also affects the oil light. â–¶ 02:20 Gas will affect the gas gauge, and of course without gas the car doesn't start. â–¶ 02:26 So this is a complicated structure that really describes one way to understand â–¶ 02:32 how a car doesn't start. â–¶ 02:39 A car is a complex system. â–¶ 02:41 It has lots of variables you can't really measure immediately, â–¶ 02:43 and it has sensors which allow you to understand a little bit about the state of the car. â–¶ 02:46 What the Bayes network does, â–¶ 02:52 it really assists you in reasoning from observable variables, like the car won't start â–¶ 02:54 and the value of the dipstick, to hidden causes, like is the fan belt broken â–¶ 03:01 or is the battery dead. â–¶ 03:06 What you have here is a Bayes network. â–¶ 03:09 A Bayes network is composed of nodes. â–¶ 03:13 These nodes correspond to events that you might or might not know â–¶ 03:15 that are typically called random variables. â–¶ 03:21 These nodes are linked by arcs, and the arcs suggest that a child of an arc â–¶ 03:24 is influenced by its parent but not in a deterministic way. â–¶ 03:31 It might be influenced in a probabilistic way, which means an older battery, for example, â–¶ 03:35 has a higher chance of causing the battery to be dead, â–¶ 03:41 but it's not clear that every old battery is dead. â–¶ 03:45 There is a total of 16 variables in this Bayes network. â–¶ 03:48 What the graph structure and associated probabilities specify â–¶ 03:53 is a huge probability distribution in the space of all of these 16 variables. â–¶ 03:59 If they are all binary, which we'll assume throughout this unit, â–¶ 04:06 they can take 2 to the 16th different values, which is a lot. â–¶ 04:10 The Bayes network, as we find out, is a complex representation â–¶ 04:15 of a distribution over this very, very large joint probability distribution of all of these variables. â–¶ 04:18 Further, once we specify the Bayes network, â–¶ 04:26 we can observe, for example, the car won't start. â–¶ 04:29 We can observe things like the oil light and the lights and the battery meter â–¶ 04:33 and then compute probabilities of the hypothesis, like the alternator is broken â–¶ 04:37 or the fan belt is broken or the battery is dead. â–¶ 04:41 So in this class we're going to talk about how to construct this Bayes network, â–¶ 04:45 what the semantics are, and how to reason in this Bayes network â–¶ 04:50 to find out about variables we can't observe, like whether the fan belt is broken or not. â–¶ 04:56 That's an overview. â–¶ 05:02 Throughout this unit I am going to assume that every event is discrete-- â–¶ 05:04 in fact, it's binary. â–¶ 05:08 We'll start with some consideration of basic probability, â–¶ 05:10 we'll work our way into some simple Bayes networks, â–¶ 05:14 we'll talk about concepts like conditional independence â–¶ 05:19 and then define Bayes networks more generally, â–¶ 05:23 move into concepts like D-separation and start doing parameter counts. â–¶ 05:26 Later on, Peter will tell you about inference in Bayes networks. â–¶ 05:32 So we won't do this in this class. â–¶ 05:36 I can't overemphasize how important this class is. â–¶ 05:38 Bayes networks are used extensively in almost all fields of smart computer system, â–¶ 05:43 in diagnostics, for prediction, for machine learning, and fields like finance, â–¶ 05:49 inside Google, in robotics. â–¶ 05:57 Bayes networks are also the building blocks of more advanced AI techniques â–¶ 06:00 such as particle filters, hidden Markov models, MDPs and POMDPs, â–¶ 06:05 Kalman filters, and many others. â–¶ 06:12 These are words that don't sound familiar quite yet, â–¶ 06:14 but as you go through the class, I can promise you you will get to know what they mean. â–¶ 06:18 So let's start now at the very, very basics. â–¶ 06:22 ### (00:40) 2 Probabilities

[Thrun] So let's talk about probabilities. â–¶ 00:00 Probabilities are the cornerstone of artificial intelligence. â–¶ 00:02 They are used to express uncertainty, â–¶ 00:05 and the management of uncertainty is really key to many, many things in AI â–¶ 00:08 such as machine learning and Bayes network inference â–¶ 00:12 and filtering and robotics and computer vision and so on. â–¶ 00:16 So I'm going to start with some very basic questions, â–¶ 00:21 and we're going to work our way up from there. â–¶ 00:24 Here is a coin. â–¶ 00:26 The coin can come up heads or tails, and my question is the following: â–¶ 00:28 Suppose the probability for heads is 0.5. â–¶ 00:32 What's the probability for it coming up tails? â–¶ 00:38 ### (00:19) 2a Answer

[Thrun] So the right answer is a half, or 0.5, â–¶ 00:00 and the reason is the coin can only come up heads or tails. â–¶ 00:03 We know that it has to be either one. â–¶ 00:07 Therefore, the total probability of both coming up is 1. â–¶ 00:10 So if half of the probability is assigned to heads, then the other half is assigned to tail. â–¶ 00:14 ### (00:08) 2b Question

[Thrun] Let me ask my next quiz. â–¶ 00:00 Suppose the probability of heads is a quarter, 0.25. â–¶ 00:02 What's the probability of tail? â–¶ 00:06 ### (00:17) 2c Answer

[Thrun] And the answer is 3/4. â–¶ 00:00 It's a loaded coin, and the reason is, well, â–¶ 00:02 each of them come up with a certain probability. â–¶ 00:05 The total of those is 1. The quarter is claimed by heads. â–¶ 00:08 Therefore, 3/4 remain for tail, which is the answer over here. â–¶ 00:12 ### (00:14) 2d Question

[Thrun] Here's another quiz. â–¶ 00:00 What's the probability that the coin comes up heads, heads, heads, three times in a row, â–¶ 00:02 assuming that each one of those has a probability of a half â–¶ 00:08 and that these coin flips are independent? â–¶ 00:12 ### (00:14) 2e Answer

[Thrun] And the answer is 0.125. â–¶ 00:00 Each head has a probability of a half. â–¶ 00:04 We can multiply those probabilities because they are independent events, â–¶ 00:06 and that gives us 1 over 8 or 0.125. â–¶ 00:10 ### (00:32) 2f Question

[Thrun] Now let's flip the coin 4 times, and let's call Xi the result of the i-th coin flip. â–¶ 00:00 So each Xi is going to be drawn from heads or tail. â–¶ 00:11 What's the probability that all 4 of those flips give us the same result, â–¶ 00:16 no matter what it is, assuming that each one of those has identically â–¶ 00:22 an equally distributed probability of coming up heads of the half? â–¶ 00:26 ### (00:23) 2g Answer

[Thrun] And the answer is, well, there's 2 ways that we can achieve this. â–¶ 00:00 One is the all heads and one is all tails. â–¶ 00:04 You already know that 4 times heads is 1/16, â–¶ 00:06 and we know that 4 times tail is also 1/16. â–¶ 00:10 These are completely independent events. â–¶ 00:13 The probability of either one occurring is 1/16 plus 1/16, which is 1/8, which is 0.125. â–¶ 00:15 ### (00:10) 2h Question

[Thrun] So here's another one. â–¶ 00:00 What's the probability that within the set of X1, X2, X3, and X4 â–¶ 00:02 there are at least three heads? â–¶ 00:07 ### (00:28) 2i Answer

[Thrun] And the solution is let's look at different sequences â–¶ 00:00 in which head occurs at least 3 times. â–¶ 00:03 It could be head, head, head, head, in which it comes 4 times. â–¶ 00:06 It could be head, head, head, tail and so on, all the way to tail, head, head, head. â–¶ 00:10 There's 1, 2, 3, 4, 5 of those outcomes. â–¶ 00:16 Each of them has a 16th for probability, so it's 5 times a 16th, which is 0.3125. â–¶ 00:19 ### (00:45) 2j Summary

[Thrun] So we just learned a number of things. â–¶ 00:00 One is about complementary probability. â–¶ 00:02 If an event has a certain probability, p, â–¶ 00:05 the complementary event has the probability 1-p. â–¶ 00:08 We also learned about independence. â–¶ 00:13 If 2 random variables, X and Y, are independent, â–¶ 00:15 which you're going to write like this, â–¶ 00:19 that means the probability of the joint that any 2 variables can assume â–¶ 00:21 is the product of the marginals. â–¶ 00:26 So rather than asking the question, "What is the probability â–¶ 00:30 "for any combination that these 2 coins or maybe 5 coins could have taken?" â–¶ 00:34 we can now look at the probability of each coin individually, â–¶ 00:40 look at its probability and just multiply them up. â–¶ 00:42 ### (01:04) 3 Dependence

[Thrun] So let me ask you about dependence. â–¶ 00:00 Suppose we flip 2 coins. â–¶ 00:03 Our first coin is a fair coin, and we're going to denote the outcome by X1. â–¶ 00:05 So the chance of X1 coming up heads is half. â–¶ 00:12 But now we branch into picking a coin based on the first outcome. â–¶ 00:15 So if the first outcome was heads, â–¶ 00:20 you pick a coin whose probability of coming up heads is going to be 0.9. â–¶ 00:23 The way I word this is by conditional probability, â–¶ 00:28 probability of the second coin flip coming up heads â–¶ 00:32 provided that or given that X1, the first coin flip, was heads, is 0.9. â–¶ 00:35 The first coin flip might also come up tails, â–¶ 00:41 in which case I pick a very different coin. â–¶ 00:44 In this case I pick a coin which with 0.8 probability will once again give me tails, â–¶ 00:47 conditioned on the first coin flip coming up tails. â–¶ 00:54 So my question for you is, â–¶ 00:57 what's the probability of the second coin flip coming up heads? â–¶ 00:59 ### (01:06) 3a Answer

[Thrun] The answer is 0.55. â–¶ 00:00 The way to compute this is by the theorem of total probability. â–¶ 00:04 Probability of X2 equals heads. â–¶ 00:08 There's 2 ways I can get to this outcome. â–¶ 00:12 One is via this path over here, and one is via this path over here. â–¶ 00:15 Let me just write both of them down. â–¶ 00:18 So first of all, it could be the probability of X2 equals heads â–¶ 00:20 given that and I will assume X1 was head already. â–¶ 00:26 Now I have to add the complementary event. â–¶ 00:30 Suppose X1 came up tails. â–¶ 00:32 Then I can ask the question, what is the probability that X2 comes up heads regardless, â–¶ 00:35 even though X1 was tails? â–¶ 00:40 Plugging in the numbers gives us the following. â–¶ 00:42 This one over here is 0.9 times a half. â–¶ 00:44 The probability of tails is 0.8, â–¶ 00:49 thereby my head probability becomes 1 minus 0.8, which is 0.2. â–¶ 00:51 Adding all of this together gives me 0.45 plus 0.1, â–¶ 00:58 which is exactly 0.55. â–¶ 01:03 ### (01:07) 4 What We Learned

So, we actually just learned some interesting lessons. â–¶ 00:00 The probability of any random variable Y can be written as â–¶ 00:02 probability of Y given that some other random variable X assumes value i â–¶ 00:08 times probability of X equals i, â–¶ 00:13 sums over all possible outcomes i for the (inaudible) variable X. â–¶ 00:17 This is called total probability. â–¶ 00:22 The second thing we learned has to do with negation of probabilities. â–¶ 00:24 We found that probability of not X given Y is 1 minus probability of X given Y. â–¶ 00:27 Now, you might be tempted to say "What about the probability of X given not Y?" â–¶ 00:37 "Is this the same as 1 minus probability of X given Y?" â–¶ 00:43 And the answer is absolutely no. â–¶ 00:51 That's not the case. â–¶ 00:54 If you condition on something that has a certain probability value, â–¶ 00:56 you can take the event you're looking at and negate this, â–¶ 01:00 but you can never negate your conditional variable â–¶ 01:03 and assume these values add up to 1. â–¶ 01:05 ### (00:25) 5 Weather Quiz

We assume there is sometimes sunny days and sometimes rainy days, â–¶ 00:00 and on day 1, which we're going to call D1, â–¶ 00:06 the probability of sunny is 0.9. â–¶ 00:09 And then let's assume that a sunny day follows a sunny day with 0.8 chance, â–¶ 00:13 and a rainy day follows a sunny day with--well-- â–¶ 00:20 ### (00:05) 5a Answer

Well, the correct answer is 0.2, which is a negation of this event over here. â–¶ 00:00 ### (00:13) 5b Question

A sunny day follows a rainy day with 0.6 chance, â–¶ 00:00 and a rainy day follows a rainy day-- â–¶ 00:06 please give me your number. â–¶ 00:11 ### (00:03) 5c Answer

0.4 â–¶ 00:00 ### (00:18) 5d Question

So, what are the chances that D2 is sunny? â–¶ 00:00 Suppose the same dynamics apply from D2 to D3, â–¶ 00:03 so just replace D3 over here with D2s over there. â–¶ 00:06 That means the transition probabilities from one day to the next remain the same. â–¶ 00:10 Tell me, what's the probability that D3 is sunny? â–¶ 00:14 ### (01:25) 5e Answer

So, the correct answer over here is 0.78, â–¶ 00:00 and over here it's 0.756. â–¶ 00:04 To get there, let's complete this one first. â–¶ 00:10 The probability of D2 = sunny. â–¶ 00:13 Well, we know there's a 0.9 chance it's sunny on D1, â–¶ 00:16 and then if it is sunny, we know it stays sunny with a 0.8 chance. â–¶ 00:21 So, we multiply these 2 things together, and we get 0.72. â–¶ 00:25 We know there's a 0.1 chance of it being rainy on day 1, which is the complement, â–¶ 00:29 but if it's rainy, we know it switches to sunny with 0.6 chance, â–¶ 00:33 so you multiply these 2 things, and you get 0.06. â–¶ 00:37 Adding those two up equals 0.78. â–¶ 00:41 Now, for the next day, we know our prior for sunny is 0.78. â–¶ 00:46 If it is sunny, it stays sunny with 0.8 probability. â–¶ 00:51 Multiplying these 2 things gives us 0.624. â–¶ 00:55 We know it's rainy with 0.2 chance, which is the complement of 0.78, â–¶ 01:01 but a 0.6 chance if it was (inaudible) sunny. â–¶ 01:07 But if you multiply those, 0.132. â–¶ 01:10 Adding those 2 things up gives us 0.756. â–¶ 01:14 So, to some extents, it's tedious to compute these values, â–¶ 01:19 but they can be perfectly computed, as shown here. â–¶ 01:23 ### (00:19) 6 Cancer Quiz

Next example is a cancer example. â–¶ 00:00 Suppose there's a specific type of cancer which exists for 1% of the population. â–¶ 00:05 I'm going to write this as follows. â–¶ 00:11 You can probably tell me now what the probability of not having this cancer is. â–¶ 00:13 ### (00:28) 6a Answer and Cancer Test

And yes, the answer is 0.99. â–¶ 00:00 Let's assume there's a test for this cancer, â–¶ 00:04 which gives us probabilistically an answer whether we have this cancer or not. â–¶ 00:07 So, let's say the probability of a test being positive, as indicated by this + sign, â–¶ 00:12 given that we have cancer, is 0.9. â–¶ 00:18 The probability of the test coming out negative if we have the cancer is--you name it. â–¶ 00:22 ### (01:01) 6b Answer

0.1, which is the difference between 1 and 0.9. â–¶ 00:00 Let's assume the probability of the test coming out positive â–¶ 00:06 given that we don't have this cancer is 0.2. â–¶ 00:11 In other words, the probability of the test correctly saying â–¶ 00:15 we don't have the cancer if we're cancer free is 0.8. â–¶ 00:19 Now, ultimately, I'd like to know what's the probability â–¶ 00:24 they have this cancer given they just received a single, positive test? â–¶ 00:28 Before I do this, please help me filling out some other probabilities â–¶ 00:35 that are actually important. â–¶ 00:39 Specifically, the joint probabilities. â–¶ 00:41 The probability of a positive test and having cancer. â–¶ 00:45 The probability of a negative test and having cancer, â–¶ 00:51 and this is not conditional anymore. â–¶ 00:53 It's now a joint probability. â–¶ 00:55 So, please give me those 4 values over here. â–¶ 00:57 ### (00:40) 6c Answer

And here the correct answer is 0.009, â–¶ 00:00 which is the product of your prior, 0.01, times the conditional, 0.9. â–¶ 00:05 Over here we get 0.001, the probability of our prior cancer times 0.1. â–¶ 00:12 Over here we get 0.198, â–¶ 00:21 the probability of not having cancer is 0.99 â–¶ 00:26 times still getting a positive reading, which is 0.2. â–¶ 00:29 And finally, we get 0.792, â–¶ 00:32 which is the probability of this guy over here, and this guy over here. â–¶ 00:37 ### (00:07) 6d Question

Now, our next quiz, I want you to fill in the probability of â–¶ 00:00 the cancer given that we just received a positive test. â–¶ 00:04 ### (01:52) 6e Answer

And the correct answer is 0.043. â–¶ 00:00 So, even though I received a positive test, â–¶ 00:06 my probability of having cancer is just 4.3%, â–¶ 00:09 which is not very much given that the test itself is quite sensitive. â–¶ 00:14 It really gives me a 0.8 chance of getting a negative result if I don't have cancer. â–¶ 00:18 It gives me a 0.9 chance of detecting cancer given that I have cancer. â–¶ 00:26 Now, what comes (inaudible) small? â–¶ 00:32 Well, let's just put all the cases together. â–¶ 00:35 You already know that we received a positive test. â–¶ 00:38 Therefore, this entry over here, and this entry over here are relevant. â–¶ 00:41 Now, the chance of having a positive test and having cancer is 0.009. â–¶ 00:47 Well, I might--when I receive a positive test--have cancer or not cancer, â–¶ 00:56 so we will just normalize by these 2 possible causes for the positive test, â–¶ 01:01 which is 0.009 + 0.198. â–¶ 01:06 We know both these 2 things together gets 0.009 over 0.207, â–¶ 01:11 which is approximately 0.043. â–¶ 01:20 Now, the interesting thing in this equation is that the chances â–¶ 01:23 of having seen a positive test result in the absence of cancers â–¶ 01:28 are still much, much higher than the chance of seeing a positive result â–¶ 01:32 in the presence of cancer, and that's because our prior for cancer â–¶ 01:35 is so small in the population that it's just very unlikely to have cancer. â–¶ 01:39 So, the additional information of a positive test â–¶ 01:44 only erased my posterior probability to 0.043. â–¶ 01:47 ### (03:34) 7 Bayes Rule

So, we've just learned about what's probably the most important â–¶ 00:00 piece of math for this class in statistics called Bayes Rule. â–¶ 00:03 It was invented by Reverend Thomas Bayes, who was a British mathematician â–¶ 00:09 and a Presbyterian minister in the 18th century. â–¶ 00:15 Bayes Rule is usually stated as follows: P of A given B where B is the evidence â–¶ 00:18 and A is the variable we care about is P of B given A times P of A over P of B. â–¶ 00:27 This expression is called the likelihood. â–¶ 00:36 This is called the prior, and this is called marginal likelihood. â–¶ 00:40 The expression over here is called the posterior. â–¶ 00:46 The interesting thing here is the way the probabilities are reworded. â–¶ 00:50 Say we have evidence B. â–¶ 00:55 We know about B, but we really care about the variable A. â–¶ 00:57 So, for example, B is a test result. â–¶ 01:01 We don't care about the test result as much as we care about the fact â–¶ 01:03 whether we have cancer or not. â–¶ 01:06 This diagnostic reasoning--which is from evidence to its causes-- â–¶ 01:08 is turned upside down by Bayes Rule into a causal reasoning, â–¶ 01:16 which is given--hypothetically, if we knew the cause, â–¶ 01:22 what would be the probability of the evidence we just observed. â–¶ 01:27 But to correct for this inversion, we have to multiply â–¶ 01:31 by the prior of the cause to be the case in the first place, â–¶ 01:36 in this case, having cancer or not, â–¶ 01:40 and divide it by the probability of the evidence, P(B), â–¶ 01:42 which often is expanded using the theorem of total probability as follows. â–¶ 01:47 The probability of B is a sum over all probabilities of B â–¶ 01:52 conditional on A, lower caps a, times the probability of A equals lower caps a. â–¶ 01:58 This is total probability as we already encountered it. â–¶ 02:04 So, let's apply this to the cancer case â–¶ 02:08 and say we really care about whether you have cancer, â–¶ 02:10 which is our cause, conditioned on the evidence â–¶ 02:13 that is the result of this hidden cause, in this case, a positive test result. â–¶ 02:17 Let's just plug in the numbers. â–¶ 02:23 Our likelihood is the probability of seeing a positive test result â–¶ 02:25 given that you have cancer multiplied by the prior probability â–¶ 02:30 of having cancer over the probability of the positive test result, â–¶ 02:33 and that is--according to the tables we looked at before-- â–¶ 02:38 0.9 times a prior of 0.01 over-- â–¶ 02:43 now we're going to expand this right over here according to total probability â–¶ 02:50 which gives us 0.9 times 0.01. â–¶ 02:55 That's the probability of + given that we do have cancer. â–¶ 03:01 So, the probability of + given that we don't have cancer is 0.2, â–¶ 03:06 but the prior here is 0.99. â–¶ 03:11 So, if we plug in the numbers we know about, we get 0.009 â–¶ 03:15 over 0.009 + 0.198. â–¶ 03:20 That is approximately 0.0434, which is the number we saw before. â–¶ 03:27 ### (01:52) 7a Bayes Rule Graphically

So, if you want to draw Bayes rule graphically, â–¶ 00:00 we have a situation where we have an internal variable A, â–¶ 00:03 like whether I'm going to die of cancer, but we can't sense A. â–¶ 00:08 Instead, we have a second variable, called B, â–¶ 00:13 which is our test, and B is observable, but A isn't. â–¶ 00:16 This is a classical example of a Bayes network. â–¶ 00:21 The Bayes network is composed of 2 variables, A and B. â–¶ 00:26 We know the prior probability for A, â–¶ 00:30 and we know the conditional. â–¶ 00:33 A causes B--whether or not we have cancer, â–¶ 00:35 causes the test result to be positive or not, â–¶ 00:38 although there was some randomness involved. â–¶ 00:41 So, we know what the probability of B given the different values for A, â–¶ 00:44 and what we care about in this specific instance is called diagnostic reasoning, â–¶ 00:49 which is the inverse of the causal reasoning, â–¶ 00:54 the probability of A given B or similarly, probability of A given not B. â–¶ 00:58 This is our very first Bayes network, and the graphical representation â–¶ 01:06 of drawing 2 variables, A and B, connected with an arc â–¶ 01:11 that goes from A to B is the graphical representation of a distribution â–¶ 01:15 of 2 variables that are specified in the structure over here, â–¶ 01:22 which has a prior probability and has a conditional probability as shown over here. â–¶ 01:26 Now, I do have a quick quiz for you. â–¶ 01:31 How many parameters does it take to specify â–¶ 01:34 the entire joint probability within A and B, or differently, the entire Bayes network? â–¶ 01:37 I'm not looking for structural parameters that relate to the graph over here. â–¶ 01:43 I'm just looking for the numerical parameters of the underlying probabilities. â–¶ 01:48 ### (00:24) 7b Answer

And the answer is 3. â–¶ 00:00 It takes 1 parameter to specify P of A from which we can derive P of not A. â–¶ 00:02 It takes 2 parameters to specify P of B given A and P given not A, â–¶ 00:09 from which we can derive P not B given A and P of not B given not A. â–¶ 00:15 So, it's a total of 3 parameters for this Bayes network. â–¶ 00:21 ### (02:32) 8 More Complex Bayes Networks

So, we just encountered our very first Bayes network â–¶ 00:00 and did a number of interesting calculations. â–¶ 00:03 Let's now talk about Bayes Rule and look into more complex Bayes networks. â–¶ 00:06 I will look at Bayes Rule again and make an observation â–¶ 00:10 that is really non-trivial. â–¶ 00:13 Here is Bayes Rule, and in practice, what we find is â–¶ 00:15 this term here is relatively easy to compute. â–¶ 00:20 It's just a product, whereas this term is really hard to compute. â–¶ 00:23 However, this term over here does not depend on what we assume for variable A. â–¶ 00:28 It's just the function of B. â–¶ 00:33 So, suppose for a moment we also care about the complementary event of not A â–¶ 00:35 given B, for which Bayes Rule unfolds as follows. â–¶ 00:40 Then we find that the normalizer, P(B), is identical, â–¶ 00:43 whether we assume A on the left side or not A on the left side. â–¶ 00:47 We also know from prior work that P(A) given B plus â–¶ 00:51 P of not A given B must be one because these are 2 complementary events. â–¶ 00:57 That allows us to compute Bayes Rule very differently â–¶ 01:03 by basically ignoring the normalizer, so here's how it goes. â–¶ 01:06 We compute P(A) given B--and I want to call this prime, â–¶ 01:11 because it's not a real probability--to be just P(B) given A times P(A), â–¶ 01:16 which is the normalizer, so the denominator of the expression over here. â–¶ 01:23 We do the same thing with not A. â–¶ 01:28 So, in both cases, we compute the posterior probability non-normalized â–¶ 01:31 by omitting the normalizer B. â–¶ 01:36 And then we can recover the original probabilities by normalizing â–¶ 01:38 based on those values over here, so the probability of A given B, â–¶ 01:43 the actual probability, is a normalizer, eta, â–¶ 01:48 times this non-normalized form over here. â–¶ 01:52 The same is true for the negation of A over here. â–¶ 01:55 And eta is just the normalizer that results by adding these 2 values over here together â–¶ 01:59 as shown over here, and dividing them for one. â–¶ 02:06 So, take a look at this for a moment. â–¶ 02:10 What we've done is we deferred the calculation of the normalizer over here â–¶ 02:13 by computing pseudo probabilities that are non-normalized. â–¶ 02:18 This made the calculation much easier, and when we were done with everything, â–¶ 02:22 we just folded it back into the normalizer based on the resulting â–¶ 02:26 pseudo probabilities and got the correct answer. â–¶ 02:29 ### (01:08) 8a Two Test Cancer Example

The reason why I gave you all this is because I want you to apply it now â–¶ 00:00 to a slightly more complicated problem, which is the 2-test cancer example. â–¶ 00:03 In this example, we again might have our unobservable cancer C, â–¶ 00:08 but now we're running 2 tests, test 1 and test 2. â–¶ 00:14 As before, the prior probability of cancer is 0.01. â–¶ 00:18 The probability of receiving a positive test result for either test is 0.9. â–¶ 00:24 The probability of getting a negative result given they're cancer free is 0.8. â–¶ 00:30 And from those, we were able to compute all the other probabilities, â–¶ 00:36 and we're just going to write them down over here. â–¶ 00:40 So, take a moment to just verify those. â–¶ 00:43 Now, let's assume both of my tests come back positive, â–¶ 00:46 so T1 = + and T2 = +. â–¶ 00:50 What's the probability of cancer now written in short form probability of â–¶ 00:56 C given ++? â–¶ 01:00 I want you to tell me what that is, and this is a non-trivial question. â–¶ 01:03 ### (02:00) 8b Answer

So, the correct answer is 0.1698 approximately, â–¶ 00:00 and to compute this, I used the trick I've shown you before. â–¶ 00:10 Let me write down the running count for cancer and for not cancer â–¶ 00:15 as I integrate the various multiplications in Bayes Rule. â–¶ 00:24 My prior for cancer was 0.01 and for non-cancer was 0.99. â–¶ 00:28 Then I get my first +, and the probability of a + given they have cancer is 0.9, â–¶ 00:37 and the same for non-cancer is 0.2. â–¶ 00:43 So, according to the non-normalized Bayes Rule, â–¶ 00:48 I now multiply these 2 things together to get my non-normalized probability â–¶ 00:52 of having cancer given the plus. â–¶ 00:58 Since multiplication is commutative, â–¶ 01:00 I can do the same thing again with my 2nd test result, 0.9 and 0.2, â–¶ 01:03 and I multiply all of these 3 things together to get my non-normalized probability â–¶ 01:09 P prime to be the following: 0.0081, if you multiply those things together, â–¶ 01:14 and 0.0396 if you multiply these facts together. â–¶ 01:21 And these are not a probability. â–¶ 01:28 If we add those for the 2 complementary of cancer/non-cancer, â–¶ 01:30 I get 0.0477. â–¶ 01:34 However, if I now divide, that is, I normalize â–¶ 01:38 those non-normalized probabilities over here by this factor over here, â–¶ 01:42 I actually get the correct posterior probability P of cancer given ++. â–¶ 01:47 And they look as follows: â–¶ 01:52 approximately 0.1698 and approximately 0.8301. â–¶ 01:54 ### (00:10) 8c Question

Calculate for me the probability of cancer â–¶ 00:00 given that I received one positive and one negative test result. â–¶ 00:03 Please write your number into this box. â–¶ 00:08 ### (01:03) 8d Answer

We apply the same trick as before â–¶ 00:00 where we use the exact same prior of 0.01. â–¶ 00:03 Our first + gives us the following factors: 0.9 and 0.2. â–¶ 00:07 And our minus gives us the probability 0.1 for a negative first test result given that we have cancer, â–¶ 00:13 and a 0.8 for the inverse of a negative result of not having cancer. â–¶ 00:20 We multiply those together. â–¶ 00:26 We get our non-normalized probability. â–¶ 00:28 And if we now normalize by the sum of those two things â–¶ 00:30 to turn this back into a probability, we get 0.009 â–¶ 00:35 over the sum of those two things over here, and this is 0.0056 â–¶ 00:41 for the chance of having cancer and 0.9943 for the chance of being cancer free. â–¶ 00:50 And this adds up approximately to 1, and therefore, is a probability distribution. â–¶ 00:59 ### (02:45) 9 Conditional Independence

I want to use a few words of terminology. â–¶ 00:00 This, again, is a Bayes network, of which the hidden variable C â–¶ 00:03 causes the still stochastic test outcomes T1 and T2. â–¶ 00:08 And what is really important is that we assume not just â–¶ 00:16 that T1 and T2 are identically distributed. â–¶ 00:19 We use the same 0.9 for test 1 as we use for test 2, â–¶ 00:22 but we also assume that they are conditionally independent. â–¶ 00:27 We assumed that if God told us whether we actually had cancer or not, â–¶ 00:31 if we knew with absolute certainty the value of the variable C, â–¶ 00:37 that knowing anything about T1 would not help us make a statement about T2. â–¶ 00:41 Put differently, we assumed that the probability of T2 given C and T1 â–¶ 00:48 is the same as the probability of T2 given C. â–¶ 00:55 This is called conditional independence, which is given the value of the cancer variable C. â–¶ 01:00 If you knew this for a fact, then T2 would be independent of T1. â–¶ 01:08 It's conditionally independent because the independence only holds true â–¶ 01:17 if we actually know C, and it comes out of this diagram over here. â–¶ 01:21 If we look at this diagram, if you knew the variable C over here, â–¶ 01:26 then C separately causes T1 and T2. â–¶ 01:32 So, as a result, if you know C, whatever counted over here â–¶ 01:39 is kind of cut off causally from what happens over here. â–¶ 01:46 That causes these 2 variables to be conditionally independent. â–¶ 01:48 So, conditional independence is a really big thing in Bayes networks. â–¶ 01:52 Here's a Bayes network where A causes B and C, â–¶ 01:58 and for a Bayes network of this structure, we know that given A, â–¶ 02:02 B and C are independent. â–¶ 02:08 It's written as B conditionally independent of C given A. â–¶ 02:11 So, here's a question. â–¶ 02:16 Suppose we have conditional independence between B and C given A. â–¶ 02:18 Would that imply--and there's my question--that B and C are independent? â–¶ 02:21 So, suppose we don't know A. â–¶ 02:28 We don't know whether we have cancer, for example. â–¶ 02:30 What that means is that the test results individually are still independent of each other â–¶ 02:33 even if we don't know about the cancer situation. â–¶ 02:38 Please answer yes or no. â–¶ 02:42 ### (00:41) 9a Answer

And the correct answer is No â–¶ 00:00 Intuitively, getting a positive test result about cancer â–¶ 00:03 gives us information about whether you have cancer or not. â–¶ 00:08 So if you get a positive test result â–¶ 00:13 you're going to raise the probability of having cancer â–¶ 00:15 relative to the prior probability. â–¶ 00:18 With that increased probability we will predict â–¶ 00:20 that another test will with a higher likelihood â–¶ 00:24 give us a positive response than if we hadn't taken the previous test. â–¶ 00:27 That's really important to understand â–¶ 00:33 So that we understand it let me make you calculate those probabilities â–¶ 00:36 ### (00:35) 9b Question

Let me draw the cancer example again with two tests. â–¶ 00:00 Here's my cancer variable â–¶ 00:05 and then there's two conditionally independent tests T1 and T2. â–¶ 00:07 And as before let me assume that the prior probability of cancer is 0.01 â–¶ 00:13 What I want you to compute for me is the probability of the second test â–¶ 00:19 to be positive if we know that the first test was positive. â–¶ 00:26 So write this into the following box. â–¶ 00:33 ### (02:52) 9c Answer

So, for this one, we want to apply total probability. â–¶ 00:00 This thing over here is the same as probability of test 2 to be positive, â–¶ 00:04 which I'm going to abbreviate with a +2 over here, â–¶ 00:10 conditioned on test 1 being positive and me having cancer â–¶ 00:14 times the probability of me having cancer given test 1 was positive plus â–¶ 00:19 the probability of test 2 being positive conditioned on test 1 being positive â–¶ 00:25 and me not having cancer times the probability of me not having cancer â–¶ 00:31 given that test 1 is positive. â–¶ 00:36 That's the same as the theorem of total probability, â–¶ 00:38 but now everything is conditioned on +1. â–¶ 00:42 Take a moment to verify this. â–¶ 00:46 Now, here I can plug in the numbers. â–¶ 00:48 You already calculated this one before, which is approximately 0.043, â–¶ 00:50 and this one over here is 1 minus that, which is 0.957 approximately. â–¶ 00:57 And this term over here now exploits conditional independence, â–¶ 01:05 which is given that I know C, knowledge of the first test â–¶ 01:09 gives me no more information about the second test. â–¶ 01:14 It only gives me information if C was unknown, as was the case over here. â–¶ 01:17 So, I can rewrite this thing over here as follows: â–¶ 01:21 P of +2 given that I have cancer. â–¶ 01:24 I can drop the +1, and the same is true over here. â–¶ 01:27 This is exploiting my conditional independence. â–¶ 01:31 I knew that P of +1 or +2 conditioned on C â–¶ 01:34 is the same as P of +2 conditioned on C and +1. â–¶ 01:41 I can now read those off my table over here, â–¶ 01:47 which is 0.9 times 0.043 plus 0.2, â–¶ 01:50 which is 1 minus 0.8 over here times 0.957, â–¶ 01:58 which gives me approximately 0.2301. â–¶ 02:03 So, that says if my first test comes in positive, â–¶ 02:09 I expect my second test to be positive with probably 0.2301. â–¶ 02:14 That's an increased probability to the default probability, â–¶ 02:21 which we calculated before, which is the probability of any test, â–¶ 02:24 test 2 come in as positive before was the normalizer of Bayes rule which was 0.207. â–¶ 02:29 So, my first test has a 20% chance of coming in positive. â–¶ 02:38 My second test, after seeing a positive test, â–¶ 02:43 has now an increased probability of about 23% of coming in positive. â–¶ 02:47 ### (00:27) 9d Absolute vs Conditional Independence

So, now we've learned about independence, â–¶ 00:00 and the corresponding Bayes network has 2 nodes. â–¶ 00:02 They're just not connected at all. â–¶ 00:04 And we learned about conditional independence, â–¶ 00:07 in which case we have a Bayes network that looks like this. â–¶ 00:09 Now I would like to know whether absolute independence â–¶ 00:12 implies conditional independence. â–¶ 00:16 True or false? â–¶ 00:18 And I'd also like to know whether conditional independence implies absolute independence. â–¶ 00:20 Again, true or false? â–¶ 00:25 ### (00:45) 9e Answer

And the answer is both of them are false. â–¶ 00:00 We already saw that conditional independence, as shown over here, â–¶ 00:03 doesn't give us absolute independence. â–¶ 00:07 So, for example, this is test #1 and test #2. â–¶ 00:09 You might or might not have cancer. â–¶ 00:13 Our first test gives us information about whether you have cancer or not. â–¶ 00:15 As a result, we've changed our prior probability â–¶ 00:18 for the second test to come in positive. â–¶ 00:21 That means that conditional independence does not imply absolute independence, â–¶ 00:24 which means this assumption here falls, â–¶ 00:30 and it also turns out that if you have absolute independence, â–¶ 00:32 things might not be conditionally independent for reasons that I can't quite explain so far, â–¶ 00:37 but that we will learn about next. â–¶ 00:43 ### (01:59) 10 Different Type of Bayes Network

[Thrun] For my next example, I will study a different type of a Bayes network. â–¶ 00:00 Before, we've seen networks of the following type, â–¶ 00:04 where a single hidden cause caused 2 different measurements. â–¶ 00:08 I now want to study a network that looks just like the opposite. â–¶ 00:13 We have 2 independent hidden causes, â–¶ 00:17 but they get confounded within a single observational variable. â–¶ 00:20 I would like to use the example of happiness. â–¶ 00:26 Suppose I can be happy or unhappy. â–¶ 00:29 What makes me happy is when the weather is sunny or if I get a raise in my job, â–¶ 00:33 which means I make more money. â–¶ 00:41 So let's call this sunny, let's call this a raise, and call this happiness. â–¶ 00:43 Perhaps the probability of it being sunny is 0.7, â–¶ 00:47 probability of a raise is 0.01. â–¶ 00:53 And I will tell you that the probability of being happy is governed as follows. â–¶ 00:58 The probability of being happy given that both of these things occur-- â–¶ 01:05 I got a raise and it is sunny--is 1. â–¶ 01:09 The probability of being happy given that it is not sunny and I still got a raise is 0.9. â–¶ 01:13 The probability of being happy given that it's sunny but I didn't get a raise is 0.7. â–¶ 01:20 And the probability of being happy given that it is neither sunny nor did I get a raise is 0.1. â–¶ 01:27 This is a perfectly fine specification of a probability distribution â–¶ 01:35 where 2 causes affect the variable down here, the happiness. â–¶ 01:39 So I'd like you to calculate for me the following questions. â–¶ 01:46 Probability of a raise given that it is sunny, according to this model. â–¶ 01:50 Please enter your answer over here. â–¶ 01:57 ### (00:55) 10a Answer

[Thrun] The answer is surprisingly simple. â–¶ 00:00 It is 0.01. â–¶ 00:03 How do I know this so fast? â–¶ 00:05 Well, if you look at this Bayes network, â–¶ 00:08 both the sunniness and the question whether I got a raise impact my happiness. â–¶ 00:12 But since I don't know anything about the happiness, â–¶ 00:21 there is no way that just the weather might implicate or impact whether I get a raise or not. â–¶ 00:24 In fact, it might be independently sunny, and I might independently get a raise at work. â–¶ 00:32 There is no mechanism of which these 2 things would co-occur. â–¶ 00:39 Therefore, the probability of a raise given that it's sunny â–¶ 00:46 is just the same as the probability of a raise given any weather, which is 0.01. â–¶ 00:49 ### (01:51) 11 Explaining Away

[Thrun] Let me talk about a really interesting special instance of Bayes net reasoning â–¶ 00:00 which is called explaining away. â–¶ 00:07 And I'll first give you the intuitive answer, â–¶ 00:10 then I'll wish you to compute probabilities for me that manifest the explain away effect â–¶ 00:14 in a Bayes network of this type. â–¶ 00:19 Explaining away means that if we know that we are happy, â–¶ 00:22 then sunny weather can explain away the cause of happiness. â–¶ 00:27 If I then also know that it's sunny, it becomes less likely that I received a raise. â–¶ 00:34 Let me put this differently. â–¶ 00:41 Suppose I'm a happy guy on a specific day â–¶ 00:43 and my wife asks me, "Sebastian, why are you so happy?" â–¶ 00:45 "Is it sunny, or did you get a raise?" â–¶ 00:49 If she then looks outside and sees it is sunny, â–¶ 00:52 then she might explain to herself, â–¶ 00:55 "Well, Sebastian is happy because it is sunny." â–¶ 00:57 "That makes it effectively less likely that he got a raise â–¶ 01:00 "because I could already explain his happiness by it being sunny." â–¶ 01:05 If she looks outside and it is rainy, â–¶ 01:10 that makes it more likely I got a raise, â–¶ 01:13 because the weather can't really explain my happiness. â–¶ 01:16 In other words, if we see a certain effect that could be caused by multiple causes, â–¶ 01:20 seeing one of those causes can explain away any other potential cause â–¶ 01:27 of this effect over here. â–¶ 01:33 So let me put this in numbers and ask you the challenging question of â–¶ 01:36 what's the probability of a raise given that I'm happy and it's sunny? â–¶ 01:43 ### (01:32) 11a Answer

[Thrun] The answer is approximately 0.0142, â–¶ 00:00 and it is an exercise in expanding this term using Bayes' rule, â–¶ 00:07 using total probability, which I'll just do for you. â–¶ 00:11 Using Bayes' rule, you can transform this into P of H given R comma S â–¶ 00:16 times P of R given S over P of H given S. â–¶ 00:24 We observe the conditional independence of R and S â–¶ 00:34 to simplify this to just P of R, â–¶ 00:37 and the denominator is expanded by folding in R and not R, â–¶ 00:40 P of H given R comma S â–¶ 00:46 times P of R plus P of H given not R and S â–¶ 00:49 times P of not R, which is total probability. â–¶ 00:54 We can now read off the numbers from the tables over here, â–¶ 00:58 which gives us 1 times 0.01 divided by this expression â–¶ 01:01 that is the same as the expression over here, so 0.01 plus this thing over here, â–¶ 01:10 which you can find over here to be 0.7, times this guy over here, â–¶ 01:17 which is 1 minus the value over here, 0.99, â–¶ 01:23 which gives us approximately 0.0142. â–¶ 01:27 ### (00:31) 11b Question

[Thrun] Now, to understand the explain away effect, â–¶ 00:00 you have to compare this to the probability of a raise given that we're just happy â–¶ 00:04 and we don't know anything about the weather. â–¶ 00:11 So let's do that exercise next. â–¶ 00:14 So my next quiz is, what's the probability of a raise given that all I know is that I'm happy â–¶ 00:16 and I don't know about the weather? â–¶ 00:24 This happens to be once again a pretty complicated question, so take your time. â–¶ 00:26 ### (02:53) 11c Answer

[Thrun] So this is a difficult question. â–¶ 00:00 Let me compute an auxiliary variable, which is P of happiness. â–¶ 00:02 That one is expanded by looking at the different conditions that can make us happy. â–¶ 00:12 P of happiness given S and R â–¶ 00:19 times P of S and R, which is of course the product of those 2 â–¶ 00:24 because they are independent, â–¶ 00:29 plus P of happiness given not S R, probability of not as R â–¶ 00:31 plus P of H given S and not R â–¶ 00:39 times the probability of P of S and not R plus the last case, â–¶ 00:43 P of H given not S and not R. â–¶ 00:48 So this just looks at the happiness under all 4 combinations of the variables â–¶ 00:52 that can lead to happiness. â–¶ 00:56 And you can plug those straight in. â–¶ 00:58 This one over here is 1, and this one over here is the product of S and R, â–¶ 01:00 which is 0.7 times 0.01. â–¶ 01:05 And as you plug all of those in, â–¶ 01:10 you get as a result 0.5245. â–¶ 01:14 That's P of H. â–¶ 01:21 Just take some time and do the math by going through these different cases â–¶ 01:24 using total probability, and you get this result. â–¶ 01:28 Armed with this number, the rest now becomes easy, â–¶ 01:32 which is we can use Bayes' rule to turn this around. â–¶ 01:38 P of H given R times P of R over P of H. â–¶ 01:43 P of R we know from over here, the probability of a raise is 0.01. â–¶ 01:49 So the only thing we need to compute now is P of H given R. â–¶ 01:54 And again, we apply total probability. â–¶ 01:57 Let me just do this over here. â–¶ 01:59 We can factor P of H given R as P of H given R and S, sunny, â–¶ 02:02 times probability of sunny plus P of H given R and not sunny â–¶ 02:09 times the probability of not sunny. â–¶ 02:14 And if you plug in the numbers with this, you get 1 times 0.7 â–¶ 02:16 plus 0.9 times 0.3. â–¶ 02:21 That happens to be 0.97. â–¶ 02:25 So if we now plug this all back into this equation over here, â–¶ 02:30 we get 0.97 times 0.01 over 0.5245. â–¶ 02:33 This gives us approximately as the correct answer 0.0185. â–¶ 02:45 ### (01:42) 11d Question

[Thrun] And if you got this right, I will be deeply impressed â–¶ 00:00 about the fact you got this right. â–¶ 00:04 But the interesting thing now to observe is if we happen to know it's sunny â–¶ 00:07 and I'm happy, then the probability of a raise is 14%, 0.014. â–¶ 00:13 If I don't know about the weather and I'm happy, â–¶ 00:21 then the probability of a raise goes up to about 18.5%. â–¶ 00:26 Why is that? â–¶ 00:30 Well, it's the explaining away effect. â–¶ 00:32 My happiness is well explained by the fact that it's sunny. â–¶ 00:35 So if someone observes me to be happy and asks the question, â–¶ 00:40 "Is this because Sebastian got a raise at work?" â–¶ 00:43 well, if you know it's sunny and this is a fairly good explanation for me being happy, â–¶ 00:46 you don't have to assume I got a raise. â–¶ 00:53 If you don't know about the weather, then obviously the chances are higher â–¶ 00:55 that the raise caused my happiness, â–¶ 01:01 and therefore this number goes up from 0.014 to 0.018. â–¶ 01:03 Let me ask you one final question in this next quiz, â–¶ 01:10 which is the probability of the raise given that I look happy and it's not sunny. â–¶ 01:14 This is the most extreme case for making a raise likely â–¶ 01:23 because I am a happy guy, and it's definitely not caused by the weather. â–¶ 01:27 So it could be just random, or it could be caused by the raise. â–¶ 01:33 So please calculate this number for me and enter it into this box. â–¶ 01:37 ### (01:18) 11e Answer

[Thrun] Well, the answer follows the exact same scheme as before, â–¶ 00:00 with S being replaced by not S. â–¶ 00:04 So this should be an easier question for you to answer. â–¶ 00:08 P of R given H and not S can be inverted by Bayes' rule to be as follows. â–¶ 00:11 Once we apply Bayes' rule, as indicated over here where we swapped H to the left side â–¶ 00:20 and R to the right side, you can observe that this value over here â–¶ 00:24 can be readily found in the table. â–¶ 00:29 It's actually the 0.9 over there. â–¶ 00:32 This value over here, the raise is independent of the weather â–¶ 00:35 by virtue of our Bayes network, so it's just 0.01. â–¶ 00:41 And as before, we apply total probability to the expression over here, â–¶ 00:45 and we obtain off this quotient over here that these 2 expressions are the same. â–¶ 00:52 P of H given not S, not R is the value over here, â–¶ 00:58 and the 0.99 is the complement of probability of R taken from over here, â–¶ 01:03 and that ends up to be 0.0833. â–¶ 01:08 This would have been the right answer. â–¶ 01:16 ### (03:13) 11f Conclusion

[Thrun] It's really interesting to compare this to the situation over here. â–¶ 00:00 In both cases I'm happy, as shown over here, â–¶ 00:04 and I ask the same question, which is whether I got a raise at work, as R over here. â–¶ 00:08 But in one case I observe that the weather is sunny; in the other one it isn't. â–¶ 00:15 And look what it does to my probability of having received a raise. â–¶ 00:21 The sunniness perfectly well explains my happiness, â–¶ 00:25 and my probability of having received a raise ends up to be a mere 1.4%, or 0.014. â–¶ 00:30 However, if my wife observes it to be non-sunny, then it is much more likely â–¶ 00:41 that the cause of my happiness is related to a raise at work, â–¶ 00:47 and now the probability is 8.3%, which is significantly higher than the 1.4% before. â–¶ 00:51 This is a Bayes network of which S and R are independent â–¶ 00:58 but H adds a dependence between S and R. â–¶ 01:04 Let me talk about this in a little bit more detail on the next paper. â–¶ 01:10 So here is our Bayes network again. â–¶ 01:16 In our previous exercises, we computed for this network â–¶ 01:18 that the probability of a raise of R given any of these variables shown here was as follows. â–¶ 01:22 The really interesting thing is that in the absence of information about H, â–¶ 01:29 which is the middle case over here, â–¶ 01:34 the probability of R is unaffected by knowledge of S-- â–¶ 01:37 that is, R and S are independent. â–¶ 01:41 This is the same as probability of R, â–¶ 01:46 and R and S are independent. â–¶ 01:49 However, if I know something about the variable H, â–¶ 01:56 then S and R become dependent-- â–¶ 02:02 that is, knowing about my happiness over here renders S and R dependent. â–¶ 02:06 This is not the same as probability of just R given H. â–¶ 02:15 Obviously, it isn't because if I now vary S from S to not S, â–¶ 02:23 it affects my probability for the variable R. â–¶ 02:28 That is a really unusual situation â–¶ 02:33 where we have R and S are independent â–¶ 02:36 but given the variable H, R and S are not independent anymore. â–¶ 02:40 So knowledge of H makes 2 variables that previously were independent non-independent. â–¶ 02:50 Offered differently, 2 variables that are independent may not be in certain cases â–¶ 02:58 conditionally independent. â–¶ 03:06 Independence does not imply conditional independence. â–¶ 03:08 ### (02:53) 12 General Bayes Networks

[Thrun] So we're now ready to define Bayes networks in a more general way. â–¶ 00:00 Bayes networks define probability distributions over graphs or random variables. â–¶ 00:05 Here is an example graph of 5 variables, â–¶ 00:10 and this Bayes network defines the distribution over those 5 random variables. â–¶ 00:14 Instead of enumerating all possibilities of combinations of these 5 random variables, â–¶ 00:19 the Bayes network is defined by probability distributions â–¶ 00:24 that are inherent to each individual node. â–¶ 00:28 For node A and B, we just have a distribution P of A and P of B â–¶ 00:32 because A and B have no incoming arcs. â–¶ 00:38 C is a conditional distribution conditioned on A and B. â–¶ 00:42 D and E are conditioned on C. â–¶ 00:47 The joint probability represented by a Bayes network â–¶ 00:52 is the product of various Bayes network probabilities â–¶ 00:56 that are defined over individual nodes â–¶ 01:00 where each node's probability is only conditioned on the incoming arcs. â–¶ 01:03 So A has no incoming arc; therefore, we just want it P of A. â–¶ 01:08 C has 2 incoming arcs, so we define the probability of C conditioned on A and B. â–¶ 01:12 And D and E have 1 incoming arc that's shown over here. â–¶ 01:18 The definition of this joint distribution by using the following factors â–¶ 01:22 has one really big advantage. â–¶ 01:27 Whereas the joint distribution over any 5 variables requires 2 to the 5 minus 1, â–¶ 01:30 which is 31 probability values, â–¶ 01:40 the Bayes network over here only requires 10 such values. â–¶ 01:43 P of A is one value, for which we can derive P of not A. â–¶ 01:48 Same for P of B. â–¶ 01:53 P of C given A B is derived by a distribution over C â–¶ 01:55 conditioned on any combination of A and B, of which there are 4 of A and B as binary. â–¶ 02:02 P of D given C is 2 parameters for P of D given C and P of D given not C. â–¶ 02:07 And the same is true for P of E given C. â–¶ 02:15 So if you add those up, you get 10 parameters in total. â–¶ 02:18 So the compactness of the Bayes network â–¶ 02:21 leads to a representation that scales significantly better to large networks â–¶ 02:25 than the common natorial approach which goes through all combinations of variable values. â–¶ 02:31 That is a key advantage of Bayes networks, â–¶ 02:36 and that is the reason why Bayes networks are being used so extensively â–¶ 02:39 for all kinds of problems. â–¶ 02:43 So here is a quiz. â–¶ 02:45 How many probability values are required to specify this Bayes network? â–¶ 02:47 Please put your answer in the following box. â–¶ 02:51 ### (00:19) 12a Answer

[Thrun] And the answer is 13. â–¶ 00:00 One over here, 2 over here, and 4 over here. â–¶ 00:03 Simplifiably speaking, any variable that has K inputs requires 2 to the K such variables. â–¶ 00:06 So in total we have 1, 9, 13. â–¶ 00:15 ### (00:17) 12b Question

[Thrun] Here's another quiz. â–¶ 00:00 How many parameters do we need to specify the joint distribution â–¶ 00:02 for this Bayes network over here â–¶ 00:06 where A, B, and C point into D, D points into E, F, and G â–¶ 00:09 and C also points into G? â–¶ 00:13 Please write your answer into this box. â–¶ 00:15 ### (00:16) 12c Answer

[Thrun] And the answer is 19. â–¶ 00:00 So 1 here, 1 here, 1 here, 2 here, 2 here, 2 arcs point into G, which makes for 4, â–¶ 00:02 and 3 arcs point into D. Two to the 3 is 8. â–¶ 00:09 So we get 1, 2, 3, 8, 2, 2, 4. If you add those up, it's 19. â–¶ 00:13 ### (00:28) 12d Question

[Thrun] And here is our car network which we discussed at the very beginning of this unit. â–¶ 00:00 How many parameters do we need to specify this network? â–¶ 00:06 Remember, there are 16 total variables, â–¶ 00:11 and the naive joint over the 16 will be 2 to the 16th minus 1, which is 65,535. â–¶ 00:15 Please write your answer into this box over here. â–¶ 00:25 ### (00:24) 12e Answer

[Thrun] To answer this question, let us add up these numbers. â–¶ 00:00 Battery age is 1, 1, 1. â–¶ 00:04 This has 1 incoming arc, so it's 2. â–¶ 00:08 Two incoming arcs makes 4. â–¶ 00:10 One incoming arc is 2, 2 equals 4. â–¶ 00:13 Four incoming arcs makes 16. â–¶ 00:17 If we add all the right numbers, we get 47. â–¶ 00:21 ### (00:20) 12f Value of the Network

[Thrun] So it takes 47 numerical probabilities to specify the joint â–¶ 00:00 compared to 65,000 if you didn't have the graph-like structure. â–¶ 00:05 I think this example really illustrates the advantage â–¶ 00:11 of compact Bayes network representations over unstructured joint representations. â–¶ 00:14 ### (00:35) 13 D-Separation

[Thrun] The next concept I'd like to teach you is called D-separation. â–¶ 00:00 And let me start the discussion of this concept by a quiz. â–¶ 00:04 We have here a Bayes network, â–¶ 00:09 and I'm going to ask you a conditional independence question. â–¶ 00:11 Is C independent of A? â–¶ 00:16 Please tell me yes or no. â–¶ 00:20 Is C independent of A given B? â–¶ 00:22 Is C independent of D? â–¶ 00:27 Is C independent of D given A? â–¶ 00:30 And is E independent of C given D? â–¶ 00:32 ### (00:52) 13a Answer

[Thrun] So C is not independent of A. â–¶ 00:00 In fact, A influences C by virtue of B. â–¶ 00:04 But if you know B, then A becomes independent of C, â–¶ 00:09 which means the only determinate into C is B. â–¶ 00:13 If you know B for sure, then knowledge of A won't really tell you anything about C. â–¶ 00:17 C is also not independent of D, just the same way C is not independent of A. â–¶ 00:22 If I learn something about D, I can infer more about C. â–¶ 00:27 But if I do know A, then it's hard to imagine how knowledge of D would help me with C â–¶ 00:31 because I can't learn anything more about A than knowing A already. â–¶ 00:38 Therefore, given A, C and D are independent. â–¶ 00:42 The same is true for E and C. â–¶ 00:45 If we know D, then E and C become independent. â–¶ 00:48 ### (00:45) 13b D-Separation Example

[Thrun] In this specific example, the rule that we could apply is very, very simple. â–¶ 00:00 Any 2 variables are independent if they're not linked by just unknown variables. â–¶ 00:04 So for example, if we know B, then everything downstream of B â–¶ 00:10 becomes independent of anything upstream of B. â–¶ 00:14 E is now independent of C, conditioned on B. â–¶ 00:18 However, knowledge of B does not render A and E independent. â–¶ 00:22 In this graph over here, A and B connect to C and C connects to D and to E. â–¶ 00:26 So let me ask you, is A independent of E, â–¶ 00:33 A independent of E given B, â–¶ 00:37 A independent of E given C, â–¶ 00:39 A independent of B, â–¶ 00:41 and A independent of B given C? â–¶ 00:43 ### (01:26) 13c Answer

[Thrun] And the answer for this one is really interesting. â–¶ 00:00 A is clearly not independent of E because through C we can see an influence of A to E. â–¶ 00:03 Given B, that doesn't change. â–¶ 00:08 A still influences C, despite the fact we know B. â–¶ 00:11 However, if we know C, the influence is cut off. â–¶ 00:15 There is no way A can influence E if we know C. â–¶ 00:18 A is clearly independent of B. â–¶ 00:22 They are different entry variables. They have no incoming arcs. â–¶ 00:25 But here is the caveat. â–¶ 00:29 Given C, A and B become dependent. â–¶ 00:32 So whereas initially A and B were independent, â–¶ 00:35 if you give C, they become dependent. â–¶ 00:38 And the reason why they become dependent we've studied before. â–¶ 00:41 This is the explain away effect. â–¶ 00:44 If you know, for example, C to be true, â–¶ 00:48 then knowledge of A will substantially affect what we believe about B. â–¶ 00:51 If there's 2 joint causes for C and we happen to know A is true, â–¶ 00:57 we will discredit cause B. â–¶ 01:02 If we happen to know A is false, we will increase our belief for the cause B. â–¶ 01:04 That was an effect we studied extensively in the happiness example I gave you before. â–¶ 01:09 The interesting thing here is we are facing a situation â–¶ 01:15 where knowledge of variable C renders previously independent variables dependent. â–¶ 01:19 ### (02:54) 13d D-Separation General Definition

[Thrun] This leads me to the general study of conditional independence in Bayes networks, â–¶ 00:00 often called D-separation or reachability. â–¶ 00:06 D-separation is best studied by so-called active triplets and inactive triplets â–¶ 00:10 where active triplets render variables dependent â–¶ 00:17 and inactive triplets render them independent. â–¶ 00:20 Any chain of 3 variables like this makes the initial and final variable dependent â–¶ 00:23 if all variables are unknown. â–¶ 00:30 However, if the center variable is known-- â–¶ 00:32 that is, it's behind the conditioning bar-- â–¶ 00:35 then this variable and this variable become independent. â–¶ 00:38 So if we have a structure like this and it's quote-unquote cut off â–¶ 00:42 by a known variable in the middle, that separates or deseparates â–¶ 00:47 the left variable from the right variable, and they become independent. â–¶ 00:53 Similarly, any structure like this renders the left variable and the right variable dependent â–¶ 00:57 unless the center variable is known, â–¶ 01:04 in which case the left and right variable become independent. â–¶ 01:08 Another active triplet now requires knowledge of a variable. â–¶ 01:12 This is the explain away case. â–¶ 01:16 If this variable is known for a Bayes network that converges into a single variable, â–¶ 01:19 then this variable and this variable over here become dependent. â–¶ 01:25 Contrast this with a case where all variables are unknown. â–¶ 01:29 A situation like this means that this variable on the left or on the right are actually independent. â–¶ 01:33 In a single final example, we also get dependence if we have the following situation: â–¶ 01:40 a direct successor of a conversion variable is known. â–¶ 01:48 So it is sufficient if a successor of this variable is known. â–¶ 01:52 The variable itself does not have to be known, â–¶ 01:57 and the reason is if you know this guy over here, â–¶ 01:59 we get knowledge about this guy over here. â–¶ 02:02 And by virtue of that, the case over here essentially applies. â–¶ 02:05 If you look at those rules, â–¶ 02:09 those rules allow you to determine for any Bayes network â–¶ 02:11 whether variables are dependent or not dependent given the evidence you have. â–¶ 02:15 If you color the nodes dark for which you do have evidence, â–¶ 02:20 then you can use these rules to understand whether any 2 variables â–¶ 02:25 are conditionally independent or not. â–¶ 02:29 So let me ask you for this relatively complicated Bayes network the following questions. â–¶ 02:31 Is F independent of A? â–¶ 02:37 Is F independent of A given D? â–¶ 02:41 Is F independent of A given G? â–¶ 02:45 And is F independent of A given H? â–¶ 02:49 Please mark your answers as you see fit. â–¶ 02:51 ### (01:03) 13e Answer

[Thrun] And the answer is yes, F is independent of A. â–¶ 00:00 What we find for our rules of D-separation is that F is dependent on D â–¶ 00:04 and A is dependent on D. â–¶ 00:08 But if you don't know D, you can't govern any dependence between A and F at all. â–¶ 00:11 If you do know D, then F and A become dependent. â–¶ 00:16 And the reason is B and E are dependent given D, â–¶ 00:20 and we can transform this back into dependence of A and F â–¶ 00:25 because B and A are dependent and E and F are dependent. â–¶ 00:29 There is an active path between A and F which goes across here and here â–¶ 00:33 because D is known. â–¶ 00:38 If we know G, the same thing is true because G gives us knowledge about D, â–¶ 00:40 and D can be applied back to this path over here. â–¶ 00:44 However, if you know H, that's not the case. â–¶ 00:47 So H might tell us something about G, â–¶ 00:49 but it doesn't tell us anything about D, â–¶ 00:51 and therefore, we have no reason to close the path between A and F. â–¶ 00:53 The path between A and F is still passive, even though we have knowledge of H. â–¶ 00:59 ### (00:50) 14 Congratulations

[Thrun] So congratulations. You learned a lot about Bayes networks. â–¶ 00:00 You learned about the graph structure of Bayes networks, â–¶ 00:03 you understood how this is a compact representation, â–¶ 00:06 you learned about conditional independence, â–¶ 00:10 and we talked a little bit about application of Bayes network â–¶ 00:12 to interesting reasoning problems. â–¶ 00:15 But by all means this was a mostly theoretical unit of this class, â–¶ 00:18 and in future classes we will talk more about applications. â–¶ 00:23 The instrument of Bayes networks is really essential to a number of problems. â–¶ 00:27 It really characterizes the sparse dependence that exists in many readable problems â–¶ 00:31 like in robotics and computer vision and filtering and diagnostics and so on. â–¶ 00:36 I really hope you enjoyed this class, â–¶ 00:41 and I really hope you understood in depth how Bayes networks work. â–¶ 00:43

## (34) Unit 4

### (04:38) 1 Probabilistic Inference

[Probabilistic Interference] â–¶ 00:00 [Male] Welcome back. In the previous unit, we went over the basics â–¶ 00:02 of probability theory and saw how â–¶ 00:05 a Bayes network could concisely represent a joint probability distribution, â–¶ 00:12 including the representation of independence between the variables. â–¶ 00:17 In this unit, we will see how to do probabilistic inference. â–¶ 00:24 That is, how to answer probability questions using Bayes nets. â–¶ 00:31 Let's put up a simple Bayes net. â–¶ 00:36 We'll use the familiar example of the earthquake â–¶ 00:40 where we can have a burglary or an earthquake â–¶ 00:45 setting off an alarm, and if the alarm goes off, â–¶ 00:50 either John or Mary might call. â–¶ 00:53 Now, what kinds of questions can we ask to do inference about? â–¶ 00:58 The simplest type of question is the same question we ask â–¶ 01:02 with an ordinary subroutine or function in a programming language. â–¶ 01:05 Namely, given some inputs, what are the outputs? â–¶ 01:08 So, in this case, we could say given the inputs of B and E, â–¶ 01:12 what are the outputs, J and M? â–¶ 01:18 Rather than call them input and output variables, â–¶ 01:22 in probabilistic inference, we'll call them evidence and query variables. â–¶ 01:26 That is, the variables that we know the values of are the evidence, â–¶ 01:36 and the ones that we want to find out the values of are the query variables. â–¶ 01:39 Anything that is neither evidence nor query is known as a hidden variable. â–¶ 01:44 That is, we won't tell you what its value is. â–¶ 01:52 We won't figure out what its value is and report it, â–¶ 01:55 but we'll have to compute with it internally. â–¶ 01:58 And now furthermore, in probabilistic inference, â–¶ 02:01 the output is not a single number for each of the query variables, â–¶ 02:05 but rather, it's a probability distribution. â–¶ 02:10 So, the answer is going to be a complete, joint probability distribution â–¶ 02:13 over the query variables. â–¶ 02:17 We call this the posterior distribution, given the evidence, â–¶ 02:19 and we can write it like this. â–¶ 02:23 It's the probability distribution of one or more query variables â–¶ 02:26 given the values of the evidence variables. â–¶ 02:34 And there can be zero or more evidence variables, â–¶ 02:39 and each of them are given an exact value. â–¶ 02:42 And that's the computation we want to come up with. â–¶ 02:47 There's another question we can ask. â–¶ 02:53 Which is the most likely explanation? â–¶ 02:56 That is, out of all the possible values for all the query variables, â–¶ 02:58 which combination of values has the highest probability? â–¶ 03:03 We write the formula like this, asking which Q values â–¶ 03:08 are maxable given the evidence values. â–¶ 03:12 Now, in an ordinary programming language, each function goes only one way. â–¶ 03:16 It has input variables, does some computation, â–¶ 03:22 and comes up with a result variable or result variables. â–¶ 03:26 One great thing about Bayes nets is that we're not restricted â–¶ 03:31 to going only in one direction. â–¶ 03:34 We could go in the causal direction, giving as evidence â–¶ 03:36 the route nodes of the tree and asking as query values the nodes at the bottom. â–¶ 03:41 Or, we could reverse that causal flow. â–¶ 03:47 For example, we could have J and M be the evidence variables â–¶ 03:50 and B and E be the query variables, â–¶ 03:55 or we could have any other combination. â–¶ 03:58 For example, we could have M be the evidence variable â–¶ 04:01 and J and B be the query variables. â–¶ 04:05 Here's a question for you. â–¶ 04:11 Imagine the situation where Mary has called to report that the alarm is going off, â–¶ 04:13 and we want to know whether or not there has been a burglary. â–¶ 04:18 For each of the nodes, click on the circle to tell us â–¶ 04:22 if the node is an evidence node, a hidden node, â–¶ 04:27 or a query node. â–¶ 04:32 ### (00:11) 1a Answer

The answer is that Mary calling is the evidence node. â–¶ 00:00 The burglary is the query node, â–¶ 00:04 and all the others are hidden variables in this case. â–¶ 00:07 ### (04:24) 2 Enumeration

Now we're going to talk about how to do inference on Bayes net. â–¶ 00:00 We'll start with our familiar network, and we'll talk about a method â–¶ 00:04 called enumeration, â–¶ 00:08 which goes through all the possibilities, adds them up, â–¶ 00:12 and comes up with an answer. â–¶ 00:15 So, what we do is start by stating the problem. â–¶ 00:17 We're going to ask the question of what is the probability â–¶ 00:24 that the burglar alarm occurred given that John called and Mary called? â–¶ 00:27 We'll use the definition of conditional probability to answer this. â–¶ 00:34 So, this query is equal to the joint probability distribution â–¶ 00:39 of all 3 variables divided by the conditionalized variables. â–¶ 00:47 Now, note I'm using a notation here where instead of writing out the probability â–¶ 00:55 of some variable equals true, I'm just using the notation plus â–¶ 01:01 and then the variable name in lower case, â–¶ 01:05 and if I wanted the negation, I would use negation sign. â–¶ 01:08 Notice there's a different notation where instead of writing out â–¶ 01:13 the plus and negation signs, we just use the variable name itself, P(e), â–¶ 01:17 to indicate E is true. â–¶ 01:22 That notation works well, but it can get confusing between â–¶ 01:25 does P(e) mean E is true, or does it mean E is a variable? â–¶ 01:29 And so we're going to stick to the notation where we explicitly have â–¶ 01:34 the pluses and negation signs. â–¶ 01:37 To do inference by enumeration, we first take a conditional probability â–¶ 01:41 and rewrite it as unconditional probabilities. â–¶ 01:45 Now we enumerate all the atomic probabilities and calculate the sum of products. â–¶ 01:49 Let's look at just the complex term on the numerator first. â–¶ 01:56 The procedure for figuring out the denominator would be similar, and we'll skip that. â–¶ 02:00 So, the probability of these 3 terms together â–¶ 02:05 can be determined by enumerating all possible values of the hidden variables. â–¶ 02:12 In this case, there are 2, E and A, â–¶ 02:17 so we'll sum over those variables for all values of E and for all values of A. â–¶ 02:22 In this case, they're boolean, so there's only 2 values of each. â–¶ 02:29 We ask what's the probability of this unconditional term? â–¶ 02:34 And that we get by summing out over all possibilities, â–¶ 02:41 E and A being true or false. â–¶ 02:44 Now, to get the values of these atomic events, â–¶ 02:49 we'll have to rewrite this equation in a form that corresponds â–¶ 02:52 to the conditional probability tables that we have associated with the Bayes net. â–¶ 02:55 So, we'll take this whole expression and rewrite it. â–¶ 03:00 It's still a sum over the hidden variables E and A, â–¶ 03:04 but now I'll rewrite this expression in terms of the parents â–¶ 03:08 of each of the nodes in the network. â–¶ 03:12 So, that gives us the product of these 5 terms, â–¶ 03:15 which we then have to sum over all values of E and A. â–¶ 03:21 If we call this product f(e,a), â–¶ 03:24 then the whole answer is the sum of F for all values of E and A, â–¶ 03:31 so as the sum of 4 terms where each of the terms is a product of 5 numbers. â–¶ 03:43 Where do we get the numbers to fill in this equation? â–¶ 03:51 From the conditional probability tables from our model, â–¶ 03:54 so let's put the equation back up, and we'll ask you for the case â–¶ 03:58 where both E and A are positive â–¶ 04:03 to look up in the conditional probability tables and fill in the numbers â–¶ 04:09 for each of these 5 terms, and then multiply them together and fill in the product. â–¶ 04:14 ### (01:59) 2a Answer

We get the answer by reading numbers off the conditional probability tables, â–¶ 00:00 so probability of B being positive is 0.001. â–¶ 00:04 Of E being positive, because we're dealing with the positive case now â–¶ 00:11 for the variable E, is 0.002. â–¶ 00:16 The probability of A being positive, because we're dealing with that case, â–¶ 00:22 given that B is positive and the case for an E is positive, â–¶ 00:26 that we can read off here as 0.95. â–¶ 00:30 The probability that J is positive given that A is positive is 0.9. â–¶ 00:37 And finally, the probability that M is positive given that A is positive â–¶ 00:44 we read off here as 0.7. â–¶ 00:50 We multiple all those together, it's going to be a small number â–¶ 00:54 because we've got the .001 and the .002 here. â–¶ 00:57 Can't quite fit it in the box, but it works out to .000001197. â–¶ 01:00 That seems like a really small number, but remember, â–¶ 01:12 we have to normalize by the P(+j,+m) term, â–¶ 01:14 and this is only 1 of the 4 possibilities. â–¶ 01:19 We have to enumerate over all 4 possibilities for E and A, â–¶ 01:22 and in the end, it works out that the probability of the burglar alarm being true â–¶ 01:26 given that John and Mary calls, is 0.284. â–¶ 01:32 And we get that number because intuitively, â–¶ 01:38 it seems that the alarm is fairly reliable. â–¶ 01:42 John and Mary calling are very reliable, â–¶ 01:44 but the prior probability of burglary is low. â–¶ 01:47 And those 2 terms combine together to give us the 0.284 value â–¶ 01:49 when we sum up each of the 4 terms of these products. â–¶ 01:54 ### (03:27) 3 Speeding up Enumeration

[Norvig] We've seen how to do enumeration to solve the inference problem â–¶ 00:00 on belief networks. â–¶ 00:04 For a simple network like the alarm network, that's all we need to know. â–¶ 00:06 There's only 5 variables, so even if all 5 of them were hidden, â–¶ 00:10 there would only be 32 rows in the table to sum up. â–¶ 00:14 From a theoretical point of view, we're done. â–¶ 00:20 But from a practical point of view, other networks could give us trouble. â–¶ 00:22 Consider this network, which is one for determining insurance for car owners. â–¶ 00:26 There are 27 different variables. â–¶ 00:35 If each of the variables were boolean, that would give us over 100 million rows to sum out. â–¶ 00:38 But in fact, some of the variables are non-boolean, â–¶ 00:44 they have multiple values, and it turns out that representing this entire network â–¶ 00:46 and doing enumeration we'd have to sum over a quadrillion rows. â–¶ 00:52 That's just not practical, so we're going to have to come up with methods â–¶ 00:57 that are faster than enumerating everything. â–¶ 01:01 The first technique we can use to get a speed-up in doing inference on Bayes nets â–¶ 01:04 is to pull out terms from the enumeration. â–¶ 01:09 For example, here the probability of b is going to be the same for all values of E and a. â–¶ 01:13 So we can take that term and move it out of the summation, â–¶ 01:20 and now we have a little bit less work to do. â–¶ 01:26 We can multiply by that term once rather than having it in each row of the table. â–¶ 01:28 We can also move this term, the P of e, to the left of the summation over a, â–¶ 01:33 because it doesn't depend on a. â–¶ 01:40 By doing this, we're doing less work. â–¶ 01:43 The inner loop of the summation now has only 3 terms rather than 5 terms. â–¶ 01:45 So we've reduced the cost of doing each row of the table. â–¶ 01:50 But we still have the same number of rows in the table, â–¶ 01:53 so we're going to have to do better than that. â–¶ 01:57 The next technique for efficient inference is to maximize independence of variables. â–¶ 02:00 The structure of a Bayes net determines how efficient it is to do inference on it. â–¶ 02:08 For example, a network that's a linear string of variables, â–¶ 02:12 X1 through Xn, can have inference done in time proportional to the number n, â–¶ 02:17 whereas a network that's a complete network â–¶ 02:27 where every node points to every other node and so on could take time 2 to the n â–¶ 02:31 if all n variables are boolean variables. â–¶ 02:40 In the alarm network we saw previously, we took care â–¶ 02:45 to make sure that we had all the independence relations represented â–¶ 02:50 in the structure of the network. â–¶ 02:54 But if we put the nodes together in a different order, â–¶ 02:57 we would end up with a different structure. â–¶ 03:00 Let's start by ordering the node John calls first â–¶ 03:03 and then adding in the node Mary calls. â–¶ 03:09 The question is, given just these 2 nodes and looking at the node for Mary calls, â–¶ 03:13 is that node dependent or independent of the node for John calls? â–¶ 03:19 ### (00:24) 3a Answer

[Norvig] The answer is that the node for Mary calls in this network â–¶ 00:01 is dependent on John calls. â–¶ 00:05 In the previous network, they were independent given that we knew that the alarm had occurred. â–¶ 00:08 But here we don't know that the alarm had occurred, â–¶ 00:13 and so the nodes are dependent â–¶ 00:16 because having information about one will affect the information about the other. â–¶ 00:18 ### (00:13) 3b Second Question

[Norvig] Now we'll continue and we'll add the node A for alarm to the network. â–¶ 00:00 And what I want you to do is click on all the other variables â–¶ 00:05 that A is dependent on in this network. â–¶ 00:09 ### (00:33) 3c Second Answer

[Norvig] The answer is that alarm is dependent on both John and Mary. â–¶ 00:01 And so we can draw both nodes in, both arrows in. â–¶ 00:05 Intuitively that makes sense because if John calls, â–¶ 00:09 then it's more likely that the alarm has occurred, â–¶ 00:14 likely as if Mary calls, and if both called, it's really likely. â–¶ 00:16 So you can figure out the answer by intuitive reasoning, â–¶ 00:20 or you can figure it out by going to the conditional probability tables â–¶ 00:23 and seeing according to the definition of conditional probability â–¶ 00:27 whether the numbers work out. â–¶ 00:31 ### (00:11) 3d Third Question

[Norvig] Now we'll continue and we'll add the node B for burglary â–¶ 00:01 and ask again, click on all the variables that B is dependent on. â–¶ 00:05 ### (00:10) 3e Third Answer

[Norvig] The answer is that B is dependent only on A. â–¶ 00:00 In other words, B is independent of J and M given A. â–¶ 00:04 ### (00:07) 3f Fourth Question

[Norvig] And finally, we'll add the last node, E, â–¶ 00:00 and ask you to click on all the nodes that E is dependent on. â–¶ 00:04 ### (00:26) 3g Fourth Answer

[Norvig] And the answer is that E is dependent on A. â–¶ 00:00 That much is fairly obvious. â–¶ 00:04 But it's also dependent on B. â–¶ 00:06 Now, why is that? â–¶ 00:08 E is dependent on A because if the earthquake did occur, â–¶ 00:10 then it's more likely that the alarm would go off. â–¶ 00:13 On the other hand, E is also dependent on B â–¶ 00:16 because if a burglary occurred, then that would explain why the alarm is going off, â–¶ 00:19 and it would mean that the earthquake is less likely. â–¶ 00:23 ### (00:18) 3h Causal Direction

[Norvig] The moral is that Bayes nets tend to be the most compact â–¶ 00:00 and thus the easier to do inference on when they're written in the causal direction-- â–¶ 00:04 that is, when the networks flow from causes to effects. â–¶ 00:12 ### (04:40) 4 Variable Elimination

Let's return to this equation, which we use to show how to do inference by enumeration. â–¶ 00:00 In this equation, we join up the whole joint distribution â–¶ 00:06 before we sum out over the hidden variables. â–¶ 00:10 That's slow, because we end up repeating a lot of work. â–¶ 00:15 Now we're going to show a new technique called variable elimination, â–¶ 00:18 which in many networks operates much faster. â–¶ 00:25 It's still a difficult computation, an NP-hard computation, â–¶ 00:27 to do inference over Bayes nets in general. â–¶ 00:30 Variable elimination works faster than inference by enumeration â–¶ 00:34 in most practical cases. â–¶ 00:38 It requires an algebra for manipulating factors, â–¶ 00:41 which are just names for multidimensional arrays â–¶ 00:45 that come out of these probabilistic terms. â–¶ 00:48 We'll use another example to show how variable elimination works. â–¶ 00:53 We'll start off with a network that has 3 boolean variables. â–¶ 00:57 R indicates whether or not it's raining. â–¶ 01:00 T indicates whether or not there's traffic, â–¶ 01:04 and T is dependent on whether it's raining. â–¶ 01:12 And finally, L indicates whether or not I'll be late for my next appointment, â–¶ 01:15 and that depends on whether or not there's traffic. â–¶ 01:19 Now we'll put up the conditional probability tables for each of these 3 variables. â–¶ 01:22 And then we can use inference to figure out the answer to questions like â–¶ 01:29 am I going to be late? â–¶ 01:35 And we know by definition that we could do that through enumeration â–¶ 01:38 by going through all the possible values for R and T â–¶ 01:42 and summing up the product of these 3 nodes. â–¶ 01:47 Now, in a simple network like this, straight enumeration would work fine, â–¶ 01:54 but in a more complex network, what variable elimination does is give us a way â–¶ 01:59 to combine together parts of the network into smaller parts â–¶ 02:03 and then enumerate over those smaller parts and then continue combining. â–¶ 02:09 So, we start with a big network. â–¶ 02:13 We eliminate some of the variables. â–¶ 02:15 We compute by marginalizing out, and then we have a smaller network to deal with, â–¶ 02:17 and we'll show you how those 2 steps work. â–¶ 02:24 The first operation in variable elimination is called joining factors. â–¶ 02:28 A factor, again, is one of these tables. â–¶ 02:35 It's a multidimensional matrix, and what we do is choose 2 of the factors, â–¶ 02:39 2 or more of the factors. â–¶ 02:43 In this case, we'll choose these 2, and we'll combine them together â–¶ 02:45 to form a new factor which represents â–¶ 02:49 the joint probability of all the variables in that factor. â–¶ 02:52 In this case, R and T. â–¶ 02:56 Now we'll draw out that table. â–¶ 03:00 In each case, we just look up in the corresponding table, â–¶ 03:03 figure out the numbers, and multiply them together. â–¶ 03:06 For example, in this row we have a +r and a +t, â–¶ 03:08 so the +r is 0.1, and the entry for +r and +t is 0.8, â–¶ 03:13 so multiply them together and you get 0.08. â–¶ 03:19 Go all the way down. For example, in the last row we have a -r and a -t. â–¶ 03:22 -r is 0.9. The entry for -r and -t is also 0.9. â–¶ 03:28 Multiply those together and you get 0.81. â–¶ 03:34 So, what have we done? â–¶ 03:40 We used the operation of joining factors on these 2 factors, â–¶ 03:42 getting us a new factor which is part of the existing network. â–¶ 03:45 Now we want to apply a second operation called elimination, â–¶ 03:50 also called summing out or marginalization, to take this table and reduce it. â–¶ 03:56 Right now, the tables we have look like this. â–¶ 04:02 We could sum out or marginalize over the variable R â–¶ 04:06 to give us a table that just operates on T. â–¶ 04:10 So, the question is to fill in this table for P(T)-- â–¶ 04:14 there will be 2 entries in this table, the +t entry, formed by summing out â–¶ 04:20 all the entries here for all values of r for which t is positive, â–¶ 04:23 and the -t entry, formed the same way, by looking in this table â–¶ 04:28 and summing up all the rows over all values of r where t is negative. â–¶ 04:32 Put your answers in these boxes. â–¶ 04:37 ### (00:27) 4a Answer

The answer is that for +t we look up the 2 possible values for r, â–¶ 00:00 and we get 0.08 or 0.09. â–¶ 00:05 Sum those up, get 0.17, â–¶ 00:09 and then we look at the 2 possible values of R for -t, â–¶ 00:13 and we get 0.02 and 0.81. â–¶ 00:18 Add those up, and we get 0.83. â–¶ 00:22 ### (00:28) 4b More Variable Elimination

So, we took our network with RT and L. We summed out over R. â–¶ 00:00 That gives us a new network with T and L â–¶ 00:04 with these conditional probability tables. â–¶ 00:09 And now we want to do a join over T and L â–¶ 00:13 and give us a new table with the joint probability of P(T, L). â–¶ 00:17 And that table is going to look like this. â–¶ 00:25 ### (00:38) 4c Answer

The answer, again, for joining variables is determined by pointwise multiplication, â–¶ 00:00 so we have 0.17 times 0.3 is 0.051, â–¶ 00:05 +t and +l, 0.17 times 0.7 is 0.119. â–¶ 00:12 Then we go to the minuses. â–¶ 00:21 Minus 0.83 times 0.1 is 0.083. â–¶ 00:23 And finally, 0.83 times 0.9 is 0.747. â–¶ 00:31 ### (00:30) 4d Even More Variable Elimination

Now we're down to a network with a single node, T, L, â–¶ 00:00 with this joint probability table, and the only operation we have left to do â–¶ 00:06 is to sum out to give us a node with just L in it. â–¶ 00:12 So, the question is to compute P(L) for both values of L, â–¶ 00:17 +l and -l. â–¶ 00:26 ### (00:20) 4e Answer

The answer is that the +l values, â–¶ 00:00 0.051 plus 0.083 equals 0.134. â–¶ 00:03 And the negative values, 0.119 plus 0.747 â–¶ 00:11 equals 0.886. â–¶ 00:15 ### ((??:??)) 4f Summary

No subtitles... â–¶ 00:00 ### (00:21) 4f Summary

So, that's how variable elimination works. â–¶ 00:00 It's a continued process of joining together factors â–¶ 00:03 to form a larger factor and then eliminating variables by summing out. â–¶ 00:06 If we make a good choice of the order in which we apply these operations, â–¶ 00:11 then variable elimination can be much more efficient â–¶ 00:15 than just doing the whole enumeration. â–¶ 00:18 ### (02:08) 5 Approximate Inference Sampling

Now I want to talk about approximate inference â–¶ 00:00 by means of sampling. â–¶ 00:07 What do I mean by that? â–¶ 00:12 Say we want to deal with a joint probability distribution, â–¶ 00:14 say the distribution of heads and tails over these 2 coins. â–¶ 00:17 We can build a table and then start counting by sampling. â–¶ 00:24 Here we have our first sample. â–¶ 00:30 We flip the coins and the one-cent piece came up heads, â–¶ 00:32 and the five-cent piece came up tails, â–¶ 00:35 so we would mark down one count. â–¶ 00:39 Then we'd toss them again. â–¶ 00:42 This time the five cents is heads, and the one cent is tails, â–¶ 00:45 so we put down a count there, and we'd repeat that process â–¶ 00:50 and keep repeating it until we got enough counts that we could estimate â–¶ 01:00 the joint probability distribution by looking at the counts. â–¶ 01:06 Now, if we do a small number of samples, the counts might not be very accurate. â–¶ 01:11 There may be some random variation that causes them not to converge â–¶ 01:15 to their true values, but as we add more counts, â–¶ 01:19 the counts--as we add more samples, â–¶ 01:23 the counts we get will come closer to the true distribution. â–¶ 01:25 Thus, sampling has an advantage over inference in that we know a procedure â–¶ 01:29 for coming up with at least an approximate value for the joint probability distribution, â–¶ 01:35 as opposed to exact inference, where the computation may be very complex. â–¶ 01:42 There's another advantage to sampling, which is if we don't know â–¶ 01:50 what the conditional probability tables are, as we did in our other models, â–¶ 01:53 if we don't know these numeric values, but we can simulate the process, â–¶ 01:59 we can still proceed with sampling, whereas we couldn't with exact inference. â–¶ 02:04 ### (02:15) 6 Sampling Example

Here's a new network that we'll use to investigate â–¶ 00:00 how sampling can be used to do inference. â–¶ 00:05 In this network, we have 4 variables. They're all boolean. â–¶ 00:10 Cloudy tells us if it's cloudy or not outside, â–¶ 00:14 and that can have an effect on whether the sprinklers are turned on, â–¶ 00:17 and whether it's raining. â–¶ 00:21 And those 2 variables in turn have an effect on whether the grass gets wet. â–¶ 00:23 Now, to do inference over this network using sampling, â–¶ 00:28 we start off with a variable where all the parents are defined. â–¶ 00:34 In this case, there's only one such variable, Cloudy. â–¶ 00:38 And it's conditional probability table tells us that the probability is 50% for Cloudy, â–¶ 00:42 50% for not Cloudy, and so we sample from that. â–¶ 00:48 We generate a random number, and let's say it comes up with positive for Cloudy. â–¶ 00:52 Now that variable is defined, we can choose another variable. â–¶ 00:59 In this case, let's choose Sprinkler, and we look at the rows in the table â–¶ 01:02 for which Cloudy, the parent, is positive, and we see we should sample â–¶ 01:08 with probability 10% to +s and 90% a -s. â–¶ 01:13 And so let's say we do that sampling with a random number generator, â–¶ 01:19 and it comes up negative for Sprinkler. â–¶ 01:23 Now let's jump over here. Look at the Rain variable. â–¶ 01:26 Again, the parent, Cloudy, is positive, â–¶ 01:29 so we're looking at this part of the table. â–¶ 01:34 We get a 0.8 probability for Rain being positive, â–¶ 01:38 and a 0.2 probability for Rain being negative. â–¶ 01:41 Let's say we sample that randomly, and it comes up Rain is positive. â–¶ 01:44 And now we're ready to sample the final variable, â–¶ 01:51 and what I want you to do is tell me which of the rows â–¶ 01:54 of this table should we be considering and tell me what's more likely. â–¶ 02:01 Is it more likely that we have a +w or a -w? â–¶ 02:07 ### (01:05) 6a Sampling Example

The answer to the question is that we look at the parents. â–¶ 00:00 We find that the Sprinkler variable is negative, â–¶ 00:03 so we're looking at this part of the table. â–¶ 00:06 And the Rain variable is positive, so we're looking at this part. â–¶ 00:09 So, it would be these 2 rows that we would consider, â–¶ 00:14 and thus, we'd find there's a 0.9 probability for w, the grass being wet, â–¶ 00:18 and only 0.1 for it being negative, â–¶ 00:25 so the positive is more likely. â–¶ 00:28 And once we've done that, then we generated a complete sample, â–¶ 00:31 and we can write down the sample here. â–¶ 00:34 We had +c, -s, +r. â–¶ 00:37 And assuming we got a probability of 0.9 came out in favor of the +w, â–¶ 00:43 that would be the end of the sample. â–¶ 00:51 Then we could throw all this information out and start over again â–¶ 00:54 by having another 50/50 choice for cloudy and then working our way through the network. â–¶ 00:59 ### (01:51) 6b More Sampling

Now, the probability of sampling a particular variable, â–¶ 00:00 choosing a +w or a -w, depends on the values of the parents. â–¶ 00:04 But those are chosen according to the conditional probability tables, â–¶ 00:10 so in the limit, the count of each sampled variable â–¶ 00:14 will approach the true probability. â–¶ 00:18 That is, with an infinite number of samples, this procedure computes the true â–¶ 00:20 joint probability distribution. â–¶ 00:24 We say that the sampling method is consistent. â–¶ 00:27 We can use this kind of sampling to compute the complete joint probability distribution, â–¶ 00:33 or we can use it to compute a value for an individual variable. â–¶ 00:38 But what if we wanted to compute a conditional probability? â–¶ 00:43 Say we wanted to compute the probability of wet grass â–¶ 00:47 given that it's not cloudy. â–¶ 00:53 To do that, the sample that we generated here wouldn't be helpful at all â–¶ 00:58 because it has to do with being cloudy, not with being not cloudy. â–¶ 01:03 So, we would cross this sample off the list. â–¶ 01:08 We would say that we reject the sample, and this technique is called rejection sampling. â–¶ 01:11 We go through ignoring any samples that don't match â–¶ 01:17 the conditional probabilities that we're interested in â–¶ 01:21 and keeping samples that do, say the sample -c, +s, +r, -w. â–¶ 01:24 We would just continue going through generating samples, â–¶ 01:34 crossing off the ones that don't match, keeping the ones that do. â–¶ 01:37 And this procedure would also be consistent. â–¶ 01:41 We call this procedure rejection sampling. â–¶ 01:46 ### (01:59) 7 Rejection Sampling

But there's a problem with rejection sampling. â–¶ 00:00 If the evidence is unlikely, you end up rejecting a lot of the samples. â–¶ 00:03 Let's go back to the alarm network where we had variables for burglary and for an alarm â–¶ 00:08 and say when arrested, in computing the probability of a burglary, â–¶ 00:16 given that the alarm goes off. â–¶ 00:22 The problem is that burglaries are very infrequent, â–¶ 00:25 so most of the samples we would get would end up being-- â–¶ 00:28 we start with generating a B, and we get a -b and then a -a. â–¶ 00:32 We go back and say does this match? â–¶ 00:39 No, we have to reject this sample, â–¶ 00:43 so we generate another sample, and we get another -b, -a. â–¶ 00:45 We reject that. We get another -b, -a. â–¶ 00:50 And we keep rejecting, and eventually we get a +b, â–¶ 00:54 but we'd end up spending a lot of time rejecting samples. â–¶ 01:00 So, we're going to introduce a new method called likelihood weighting â–¶ 01:04 that generates samples so that we can keep every one. â–¶ 01:13 With likelihood weighting, we fix the evidence variables. â–¶ 01:17 That is, we say that A will always be positive, â–¶ 01:20 and then we sample the rest of the variables, â–¶ 01:25 so then we get samples that we want. â–¶ 01:28 We would get a list like -b, +a, â–¶ 01:31 -b, +a, â–¶ 01:37 +b, +a. â–¶ 01:40 We get to keep every sample, but we have a problem. â–¶ 01:42 The resulting set of samples is inconsistent. â–¶ 01:46 We can fix that, however, by assigning a probability â–¶ 01:52 to each sample and weighing them correctly. â–¶ 01:56 ### (01:55) 8 Likelihood Weighting

In likelihood weighting, we're going to be collecting samples just like before, â–¶ 00:00 but we're going to add a probabilistic weight to each sample. â–¶ 00:05 Now, let's say we want to compute the probability of rain â–¶ 00:11 given that the sprinklers are on, and the grass is wet. â–¶ 00:17 We start as before. â–¶ 00:22 We make a choice for Cloudy, and let's say that, again, â–¶ 00:24 we choose Cloudy being positive. â–¶ 00:30 Now we want to make a choice for Sprinkler, â–¶ 00:33 but we're constrained to always choose Sprinkler being positive, â–¶ 00:37 so we'll make that choice. â–¶ 00:41 And we know we were dealing with Cloudy being positive, â–¶ 00:44 so we're in this row, and we're forced to make the choice of Sprinkler being positive, â–¶ 00:50 and that has a probability of only 0.1, so we'll put that 0.1 into the weight. â–¶ 00:56 Next, we'll look at the Rain variable, â–¶ 01:05 and here we're not constrained in any way, so we make a choice â–¶ 01:09 according to the probability tables with Cloudy being positive. â–¶ 01:13 And let's say that we choose the more popular choice, and Rain gets the positive value. â–¶ 01:19 Now, we look at Wet Grass. â–¶ 01:27 We're constrained to choose positive, and we know that the parents â–¶ 01:30 are also positive, so we're dealing with this row here. â–¶ 01:35 Since it's a constrained choice, we're going to add in or multiply in an additional weight, â–¶ 01:41 and I want you to tell me what that weight should be. â–¶ 01:47 ### (00:37) 8a Answer

The answer is we're looking for the probability â–¶ 00:00 of having a +w given a +s and a +r, â–¶ 00:04 so that's in this row, so it's 0.99. â–¶ 00:09 So, we take our old weight and multiply it by 0.99, â–¶ 00:16 gives us a final weight of 0.099 â–¶ 00:22 for a sample of +c, +s, +r and +w. â–¶ 00:28 ### (00:20) 8b Likelihood Weighting is Consistent

When we include the weights, â–¶ 00:00 counting this sample that was forced to have a +s and a +w â–¶ 00:03 with a weight of 0.099, instead of counting it as a full one sample, â–¶ 00:08 we find that likelihood weighting is also consistent. â–¶ 00:14 ### (00:56) 8c Likelihood Weighting Problems

Likelihood weighting is a great technique, â–¶ 00:00 but it doesn't solve all our problems. â–¶ 00:03 Suppose we wanted to compute the probability of C given +s and +r. â–¶ 00:05 In other words, we're constraining Sprinkler and Rain to always be positive. â–¶ 00:14 Since we use the evidence when we generate a node that has that evidence as parents, â–¶ 00:21 the Wet Grass node will always get good values based on that evidence. â–¶ 00:27 But the Cloudy node won't, and so it will be generating values at random â–¶ 00:31 without looking at these values, and most of the time, or some of the time, â–¶ 00:39 it will be generating values that don't go well with the evidence. â–¶ 00:44 Now, we won't have to reject them like we do in rejection sampling, â–¶ 00:48 but they'll have a low probability associated with them. â–¶ 00:51 ### (01:50) 9 Gibbs Sampling

A technique called Gibbs sampling, â–¶ 00:00 named after the physicist Josiah Gibbs, â–¶ 00:07 takes all the evidence into account and not just the upstream evidence. â–¶ 00:10 It uses a method called Markov Chain Monte Carlo, or MCMC. â–¶ 00:14 The idea is that we resample just one variable at a time â–¶ 00:26 conditioned on all the others. â–¶ 00:31 That is, we have a set of variables, â–¶ 00:33 and we initialize them to random variables, keeping the evidence values fixed. â–¶ 00:37 Maybe we have values like this, â–¶ 00:44 and that constitutes one sample, and now, at each iteration through the loop, â–¶ 00:48 we select just one non-evidence variable and resample it â–¶ 00:54 based on all the other variables. â–¶ 01:01 And that will give us another sample, and repeat that again. â–¶ 01:04 Choose another variable. â–¶ 01:11 Resample that variable and repeat. â–¶ 01:15 We end up walking around in this space of assignments of variables randomly. â–¶ 01:21 Now, in rejection and likelihood sampling, â–¶ 01:27 each sample was independent of the other samples. â–¶ 01:30 In MCMC, that's not true. â–¶ 01:34 The samples are dependent on each other, and in fact, â–¶ 01:37 adjacent samples are very similar. â–¶ 01:40 They only vary or differ in one place. â–¶ 01:42 However, the technique is still consistent. We won't show the proof for that. â–¶ 01:46 ### (01:19) 10 Monty Hall Problem

Now, just one more thing. â–¶ 00:00 I can't help but describe what is probably the most famous probability problem of all. â–¶ 00:02 It's called the Monty Hall Problem after the game show host. â–¶ 00:07 And the idea is that you're on a game show, and there's 3 doors: â–¶ 00:11 door #1, door #2, and door #3. â–¶ 00:15 And behind each door is a prize, and you know that one of the doors â–¶ 00:20 contains an expensive sports car, which you would find desirable, â–¶ 00:26 and the other 2 doors contain a goat, which you would find less desirable. â–¶ 00:29 Now, say you're given a choice, and let's say you choose door #1. â–¶ 00:35 But according to the conventions of the game, the host, Monty Hall, â–¶ 00:42 will now open one of the doors, knowing that the door that he opens â–¶ 00:47 contains a goat, and he shows you door #3. â–¶ 00:52 And he now gives you the opportunity to stick with your choice â–¶ 00:57 or to switch to the other door. â–¶ 01:02 What I want you to tell me is, what is your probability of winning â–¶ 01:05 if you stick to door #1, and what is the probability of winning â–¶ 01:10 if you switched to door #2? â–¶ 01:15 ### (01:45) 10a Answer

The answer is that you have a 1/3 chance of winning if you stick with door #1 â–¶ 00:00 and a 2/3 chance if you switch to door #2. â–¶ 00:08 How do we explain that, and why isn't it 50/50? â–¶ 00:12 Well, it's true that there's 2 possibilities, â–¶ 00:16 but we've learned from probability that just because there are 2 options â–¶ 00:18 doesn't mean that both options are equally likely. â–¶ 00:22 It's easier to explain why the first door has a 1/3 probability â–¶ 00:26 because when you started, the car could be in any one of 3 places. â–¶ 00:30 You chose one of them. That probability was 1/3. â–¶ 00:34 And that probability hasn't been changed by the revealing of one of the other doors. â–¶ 00:37 Why is door #2 two-thirds? â–¶ 00:43 Well, one way to explain it is that the probability has to sum to 1, â–¶ 00:45 and if 1/3 is here, the 2/3 has to be here. â–¶ 00:49 But why doesn't the same argument that you use for 1 hold for 2? â–¶ 00:53 Why can't we say the probability of 2 holding the car â–¶ 00:58 was 1/3 before this door was revealed? â–¶ 01:03 Why has that changed 2 and has not changed 1? â–¶ 01:07 And the reason is because we've learned something about door #2. â–¶ 01:11 We've learned that it wasn't the door that was flipped over by the host, â–¶ 01:14 and so that additional information has updated the probability, â–¶ 01:18 whereas we haven't learned anything additional about door #1 â–¶ 01:22 because it was never an option that the host might switch door #1. â–¶ 01:26 And in fact, in this case, if we reveal the door, â–¶ 01:30 we find that's where the car actually is. â–¶ 01:37 So you see, learning probability may end up winning you something. â–¶ 01:40 ### (00:44) 10b Monty Hall Letter

Now, as a final epilogue, I have here a copy of a letter written by Monty Hall himself â–¶ 00:00 in 1990 to Professor Lawrence Denenberg of Harvard â–¶ 00:07 who, with Harry Lewis, wrote a statistics book â–¶ 00:10 in which they used the Monty Hall Problem as an example, â–¶ 00:14 and they wrote to Monty asking him for permission to use his name. â–¶ 00:18 Monty kindly granted the permission, but in his letter, â–¶ 00:23 he writes, "As I see it, it wouldn't make any difference after the player â–¶ 00:26 has selected Door A, and having been shown Door C-- â–¶ 00:31 why should he then attempt to switch to Door B? â–¶ 00:34 So, we see Monty Hall himself did not understand the Monty Hall Problem. â–¶ 00:38

## (12) Homework 2

### (00:16) 1 Bayes Rule

[Thrun] Given the following Bayes network with P of A equal to 0.5, â–¶ 00:00 P of B given the A equals 0.2, â–¶ 00:06 and P of B given not A 0.8, â–¶ 00:08 calculate the following probability. â–¶ 00:12 ### (00:42) 2 Simple Bayes Net

[Thrun] Consider a network of the following type: â–¶ 00:00 a variable, A, that is binary connects to three variables, X1, X2, and X3, â–¶ 00:03 that are also binary. â–¶ 00:10 The probability of A is 0.5, and for all variable XI we have the probability of XI given A is 0.2, â–¶ 00:12 and the probability of XI given not A equals 0.6. â–¶ 00:24 I would like to know from you the probability of A â–¶ 00:29 given that we observed X1, X2, and not X3. â–¶ 00:31 Notice that these variables over here are conditionally independent given A. â–¶ 00:37 ### (00:10) 3 Simple Bayes Net 2

[Thrun] Let us consider the same network again. â–¶ 00:00 I would like to know the probability of X3 given that I observed X1. â–¶ 00:03 ### (00:29) 4 Conditional Independence

[Thrun] In this next homework assignment I will be drawing you a Bayes network â–¶ 00:00 and will ask you some conditional independence questions. â–¶ 00:04 Is B conditionally independent of C? And say yes or no. â–¶ 00:09 Is B conditionally independent of C given D? And say yes or no. â–¶ 00:14 Is B conditionally independent of C given A? And say yes or no. â–¶ 00:19 And is B conditionally independent given A and D? And say yes or no. â–¶ 00:24 ### (00:28) 5 Conditional Indepedence 2

[Thrun] Consider the following network. â–¶ 00:00 I would like to know whether the following statements are true or false. â–¶ 00:02 C is conditionally independent of E given A. â–¶ 00:08 B is conditionally independent of D given C and E. â–¶ 00:12 A is conditionally independent of C given E. â–¶ 00:18 And A is conditionally independent of C given B. â–¶ 00:21 Please check yes or no for each of these questions. â–¶ 00:25 ### (00:17) 6 Parameter Count

[Thrun] In my final question I'll look at the exact same network as before, â–¶ 00:00 but I would like to know the minimum number of numerical parameters â–¶ 00:04 such as the values to define probabilities and conditional probabilities â–¶ 00:08 that are necessary to specify the joint distribution of all 5 variables. â–¶ 00:13 ### (00:36) 1 ANSWER

[Thrun] The answer is 0.2, â–¶ 00:00 and this follows directly from Bayes' rule. â–¶ 00:03 In this formula, we can read off the first 2 values straight from the table over here, â–¶ 00:07 and we expand the denominator by total probability. â–¶ 00:11 Observing that this is exactly the same expression as up here, â–¶ 00:15 we get 0.1 divided by 0.1 plus this expression over here can be copied from over here, â–¶ 00:19 and P of not A is directly obtained up here. â–¶ 00:27 Hence we get 0.5 over here, and as a result we get 0.2. â–¶ 00:30 ### (03:16) 2 ANSWER

[Thrun] For this question we will be exploring a little trick â–¶ 00:00 about non-normalized probability. â–¶ 00:03 We will observe that P of A given X1, X2 and not X3, â–¶ 00:05 the expression on the left can be resolved by Bayes' rule into this expression over here. â–¶ 00:11 We will take X3 to the left and replace it by A, â–¶ 00:16 both conditioned on the variables X1 and X2. â–¶ 00:20 Then we have PA given X1, X2 divided by P not X3, X1, X2. â–¶ 00:23 Next we employ 2 things. â–¶ 00:29 One is the denominator does not depend on A, â–¶ 00:31 so whether I put an A or not A has no bearing on any calculation here, â–¶ 00:34 which means I can defer its calculation until later, and it will turn out to be important. â–¶ 00:39 So I'm going to be proportional to just the stuff over here. â–¶ 00:44 And second, I export my conditional independence â–¶ 00:49 whereby I can omit X1 and X2 from the probability of not X3 conditioned on A. â–¶ 00:52 These variables are conditionally independent. â–¶ 00:58 This gives me the following recursion â–¶ 01:02 where I now removed the third variable from the estimation problem â–¶ 01:05 and just retained the first 2 relative to my initial expression. â–¶ 01:10 If I keep expanding this, I get the following solution. â–¶ 01:14 P of not X3 given A, P X2 given A, P X1 given A times P of A. â–¶ 01:19 You might take a minute to just verify this, â–¶ 01:27 but this is exploiting the conditional independence â–¶ 01:30 very much as in the first step I showed you over here. â–¶ 01:32 This step lacks the normalizer, â–¶ 01:35 so let me work on the normalizer by expressing the opposite probability, â–¶ 01:38 P of not A given the same events, X1, X2, and not X3, â–¶ 01:44 which resolves to P of not X3 given not A, â–¶ 01:50 P of X2 given not A, P of X1 given not A, â–¶ 01:54 and P of not A. â–¶ 02:00 I can now plug in the values from above. â–¶ 02:02 So the first term gives me 0.8 times 0.2 times 0.2 times 0.5. â–¶ 02:04 In the second term I get 0.4 times 0.6 times 0.6 times 0.5, â–¶ 02:15 which resolves to 0.016 and 0.072. â–¶ 02:24 This is clearly not a probability because we left out the normalizer. â–¶ 02:31 But as we know, the normalizer does not depend on whether I put A or not A in here. â–¶ 02:36 As a result, it will be the same for both of these expressions, â–¶ 02:40 and I can obtain it by just adding these non-normalized probabilities â–¶ 02:44 and then subsequently divide these non-normalized probabilities accordingly. â–¶ 02:47 So let me just do this. â–¶ 02:52 We get for the desired probability over here 0.1818 â–¶ 02:55 and for the inverse probability over here 0.8182. â–¶ 03:01 Our desired answer therefore is 0.1818. â–¶ 03:08 This was not an easy question. â–¶ 03:14 ### (01:41) 3 ANSWER

[Thrun] The answer is a little bit involved. â–¶ 00:00 We use total probability to re-express this by bringing in A. â–¶ 00:03 P of X3 given X1 is the sum of P of X3 given X1 and A â–¶ 00:08 times P of A given X1 plus the A complement, which is X3, conditional X1 and not A â–¶ 00:15 times P of not A given X1. â–¶ 00:22 That is just total probability. â–¶ 00:24 Next we utilized conditional independence by which we can simplify this expression â–¶ 00:26 to drop X1 in the conditional variables â–¶ 00:30 and we transform this expression by Bayes' rule again. â–¶ 00:33 The same applies to the right side with not A replacing A. â–¶ 00:36 All of those expressions over here can be found â–¶ 00:41 either in the table up there or just by their complements, â–¶ 00:45 with the exception of P of X1. â–¶ 00:49 But P of X1 can again be just obtained by total probability, â–¶ 00:52 which resolves to 0.2 times 0.5 plus 0.6 times 0.5, â–¶ 00:58 which gives me 0.4. â–¶ 01:11 We are now in a position to calculate the last term over here, which goes as follows. â–¶ 01:13 This expression is 0.2 times 0.2 times 0.5 over 0.4 plus 0.6 times 0.6 times 0.5 over 0.4, â–¶ 01:19 which gives us as a final result 0.5. â–¶ 01:36 ### (00:46) 4 ANSWER

[Thrun] And the answer is as follows. â–¶ 00:00 No, no, yes, and no. â–¶ 00:02 B and C in the absence of any other information are dependent through A, â–¶ 00:06 which is if you learn something about B, you can infer something about A, â–¶ 00:11 and then we'll know more about C. â–¶ 00:17 If you know D, that doesn't change a thing. â–¶ 00:20 You can just take D out of the pool. â–¶ 00:22 If you know A, B and C become conditionally independent. â–¶ 00:24 This dependence goes away, and ignorance of D doesn't render B and C dependent. â–¶ 00:29 However, if we add D back to the mix, â–¶ 00:36 then knowledge of D will render B and C dependent by way of the explaining away effect. â–¶ 00:39 ### (00:54) 5 ANSWER

[Thrun] So the correct answer is tricky in this case. â–¶ 00:00 It is no, no, no, and yes. â–¶ 00:03 The first one is straightforward. â–¶ 00:07 C and E are conditionally independent based on D, â–¶ 00:09 and knowledge of A doesn't change anything. â–¶ 00:13 B and D are conditionally independent through A, â–¶ 00:15 and knowledge of C or E doesn't change that. â–¶ 00:20 A and C is interesting. â–¶ 00:23 A and C is independent. But if you know D, they become dependent. â–¶ 00:25 It turns out if you know E, you can know something about D, â–¶ 00:29 and as a result, A and C become dependent through the explain away effect. â–¶ 00:32 That doesn't apply if you know B. â–¶ 00:37 Even though B tells you something about E, â–¶ 00:39 it tells you nothing about D because B and D are independent. â–¶ 00:42 Therefore, knowing B tells you nothing about D, â–¶ 00:46 and the explain away effect does not occur between A and C. â–¶ 00:49 The answer here is yes. â–¶ 00:52 ### (00:37) 6 ANSWER

[Thrun] The correct answer is 16. â–¶ 00:00 The probability of A and C require 1 parameter each. â–¶ 00:03 The complement of not A and not C follows by 1 minus that parameter. â–¶ 00:06 This guy over here requires 2 parameters. â–¶ 00:12 You need to know the probability of B given A and B given not A. â–¶ 00:15 The complements can be obtained easily. â–¶ 00:18 The probability of D is conditioned on 2 variables which can take 4 possible values. â–¶ 00:20 Hence the number is 4. â–¶ 00:24 And E is conditioned on 3 variables, so it can take a total of 8 different values, â–¶ 00:26 2 to the 3rd, which is 8. â–¶ 00:30 If you add 8 plus 4 plus 2 plus 1 plus 1, you get 16. â–¶ 00:32

## (55) Unit 5

### (01:11) 1 Introduction

Welcome to the machine learning unit. â–¶ 00:00 Machine learning is a fascinating area. â–¶ 00:03 The world has become immeasurably data-rich. â–¶ 00:06 The world wide web has come up over the last decade. â–¶ 00:09 The human genome is being sequenced. â–¶ 00:12 Vast chemical databases, pharmaceutical databases, â–¶ 00:15 and financial databases are now available â–¶ 00:19 on a scale unthinkable even 5 years ago. â–¶ 00:22 To make sense out of the data, â–¶ 00:26 to extract information from the data, â–¶ 00:28 machine learning is the discipline to go. â–¶ 00:30 Machine learning is an important subfeed of artificial intelligence, â–¶ 00:33 it's my personal favorite next to robotics â–¶ 00:37 because I believe it has a huge impact on society â–¶ 00:40 and is absolutely necessary as we move forward. â–¶ 00:43 So in this class, I teach you some of the very basics of â–¶ 00:47 machine learning, and in our next unit â–¶ 00:50 Peter will tell you some more about machine learning. â–¶ 00:52 We'll talk about supervised learning, which is one side of machine learning, â–¶ 00:56 and Peter will tell you about unsupervised learning, â–¶ 01:00 which is a different style. â–¶ 01:02 Later in this class we will also encounter reinforcement learning, â–¶ 01:05 which is yet another set of machine learning. â–¶ 01:07 Anyhow, let's just dive in. â–¶ 01:10 ### (01:53) 2 What is Machine Learning

Welcome to the first class on machine learning. â–¶ 00:00 So far we talked a lot about Bayes Networks. â–¶ 00:03 And the way we talked about them â–¶ 00:07 is all about reasoning within Bayes Networks â–¶ 00:10 that are known. â–¶ 00:14 Machine learning addresses the problem â–¶ 00:15 of how to find those networks â–¶ 00:17 or other models â–¶ 00:19 based on data. â–¶ 00:20 Learning models from data â–¶ 00:22 is a major, major area of artificial intelligence â–¶ 00:25 and it's perhaps the one â–¶ 00:29 that had the most commercial success. â–¶ 00:31 In many commercial applications â–¶ 00:33 the models themselves are fitted â–¶ 00:37 based on data. â–¶ 00:39 For example, Google â–¶ 00:40 uses data to understand â–¶ 00:42 how to respond to each search query. â–¶ 00:44 Amazon uses data â–¶ 00:46 to understand how to place products on their website. â–¶ 00:49 And these machine learning techniques â–¶ 00:52 are the enabling techniques that make that possible. â–¶ 00:53 So this class â–¶ 00:56 which is about supervised learning â–¶ 00:57 will go through some very basic methods â–¶ 00:59 for learning models from data â–¶ 01:02 in particular, specific types of Bayes Networks. â–¶ 01:04 We will complement this â–¶ 01:06 with a class on unsupervised learning â–¶ 01:08 that will be taught next â–¶ 01:10 after this class. â–¶ 01:14 Let me start off with a quiz. â–¶ 01:15 The quiz is: What companies are famous â–¶ 01:18 for machine learning using data? â–¶ 01:20 Google for mining the web. â–¶ 01:24 Netflix for mining what people â–¶ 01:29 would like to rent on DVDs. â–¶ 01:31 Which is DVD recommendations. â–¶ 01:36 Amazon.com for product placement. â–¶ 01:40 Check any or all â–¶ 01:45 and if none of those apply â–¶ 01:47 check down here. â–¶ 01:49 ### (00:47) 3 Answer

And, not surprisingly, the answer is â–¶ 00:00 all of those companies and many, many, many more â–¶ 00:03 use massive machine learning for making decisions â–¶ 00:06 that are really essential to the businesses. â–¶ 00:09 Google mines the web and uses machine learning for translation, â–¶ 00:12 as we've seen in the introductory level. Netflix has used â–¶ 00:15 machine learning extensively for understanding what type of DVD to recommend to you next. â–¶ 00:18 Amazon composes its entire product pages using â–¶ 00:22 machine learning by understanding how customers â–¶ 00:25 respond to different compositions and placements of their products, â–¶ 00:28 and many, many other examples exist. â–¶ 00:31 I would argue that in Silicon Valley, â–¶ 00:35 at least half the companies dealing with customers and online products â–¶ 00:37 do extensively use machine learning, â–¶ 00:41 so it makes machine learning a really exciting discipline. â–¶ 00:43 ### (01:33) 4 Stanley DARPA Grand Challenge

In my own research, I've extensively used machine learning for robotics. â–¶ 00:00 What you see here is a robot my students and I built at Stanford â–¶ 00:05 called Stanley, and it won the DARPA Grand Challenge. â–¶ 00:08 It's a self-driving car that drives without any human assistance whatsoever, â–¶ 00:12 and this vehicle extensively uses machine learning. â–¶ 00:16 The robot is equipped with a laser system â–¶ 00:22 I will talk more about lasers in my robotics class, â–¶ 00:25 but here you can see how the robot is able to build â–¶ 00:28 3-D models of the terrain ahead. â–¶ 00:31 These are almost like video game models that allow it to make â–¶ 00:34 assessments where to drive and where not to drive. â–¶ 00:37 Essentially, it's trying to drive on flat ground. â–¶ 00:39 The problem with these lasers is that they don't see very far. â–¶ 00:43 They see about 25 meters out, so to drive really fast â–¶ 00:46 the robot has to see further. â–¶ 00:50 This is where machine learning comes into play. â–¶ 00:53 What you see here is camera images delivered by the robot â–¶ 00:56 superimposed with laser data that doesn't see very far, â–¶ 00:58 but the laser is good enough to extract samples â–¶ 01:01 of driveable road surface that can then be machine learned â–¶ 01:04 and extrapolated into the entire camera image. â–¶ 01:08 That enables the robot to use the camera â–¶ 01:10 to see driveable terrain all the way to the horizon â–¶ 01:13 up to like 200 meters out, enough to drive really, really fast. â–¶ 01:16 This ability to adapt its vision by driving its own training examples using lasers â–¶ 01:22 but seeing out 200 meters or more â–¶ 01:27 was a key factor in winning the race. â–¶ 01:30 ### (03:46) 5 Taxonomy

Machine learning is a very large field â–¶ 00:00 with many different methods â–¶ 00:03 and many different applications. â–¶ 00:04 I will now define some of the very basic terminology â–¶ 00:06 that is being used to distinguish â–¶ 00:10 different machine learning methods. â–¶ 00:12 Let's start with the what. â–¶ 00:13 What is being learned? â–¶ 00:17 You can learn parameters â–¶ 00:19 like the probabilities of a Bayes Network. â–¶ 00:23 You can learn structure â–¶ 00:26 like the arc structure of a Bayes Network. â–¶ 00:27 And you might even discover hidden concepts. â–¶ 00:31 For example â–¶ 00:34 you might find that certain training example â–¶ 00:35 form a hidden group. â–¶ 00:37 For example Netflix â–¶ 00:39 you might find that there's different types of customers â–¶ 00:41 some that care about classic movies â–¶ 00:43 some of them care about modern movies â–¶ 00:45 and those might form hidden concepts â–¶ 00:47 whose discovery can really help you â–¶ 00:49 make better sense of the data. â–¶ 00:51 Next is what from? â–¶ 00:53 Every machine learning method â–¶ 00:57 is driven by some sort of target information â–¶ 01:00 that you care about. â–¶ 01:02 In supervised learning â–¶ 01:03 which is the subject of today's class â–¶ 01:06 we're given specific target labels â–¶ 01:08 and I give you examples just in a second. â–¶ 01:10 We also talk about unsupervised learning â–¶ 01:13 where target labels are missing â–¶ 01:15 and we use replacement principles â–¶ 01:19 to find, for example â–¶ 01:21 hidden concepts. â–¶ 01:22 Later there will be a class in reinforcement learning â–¶ 01:24 when an agent learns from feedback with the physical environment â–¶ 01:27 by interacting and trying actions â–¶ 01:32 and receiving some sort of evaluation â–¶ 01:34 from the environment â–¶ 01:37 like "Well done" or "That works." â–¶ 01:37 Again, we will talk about those in detail later. â–¶ 01:41 There's different things you could try to do â–¶ 01:43 with machine learning technique. â–¶ 01:46 You might care about prediction. â–¶ 01:48 For example you might want to care about what's going to happen with the future â–¶ 01:49 in the stockmarket for example. â–¶ 01:53 You might care to diagnose something â–¶ 01:55 which is you get data and you wish to explain it â–¶ 01:57 and you use machine learning for that. â–¶ 01:59 Sometimes your objective is to summarize something. â–¶ 02:01 For example if you read a long article â–¶ 02:04 your machine learning method might aim to â–¶ 02:07 produce a short article that summarizes the long article. â–¶ 02:09 And there's many, many, many more different things. â–¶ 02:12 You can talk about the how to learn. â–¶ 02:14 We use the word passive â–¶ 02:16 if your learning agent is just an observer â–¶ 02:19 and has no impact on the data itself. â–¶ 02:23 Otherwise, you call it active. â–¶ 02:24 Sometimes learning occurs online â–¶ 02:26 which means while the data is being generated â–¶ 02:30 and some of it is offline â–¶ 02:32 which means learning occurs â–¶ 02:35 after the data has been generated. â–¶ 02:37 There's different types of outputs â–¶ 02:39 of a machine learning algorithm. â–¶ 02:42 Today we'll talk about classification â–¶ 02:44 versus regression. â–¶ 02:47 In classification the output is binary â–¶ 02:50 or a fixed number of classes â–¶ 02:53 for example something is either a chair or not. â–¶ 02:55 Regression is continuous. â–¶ 02:57 The temperature might be 66.5 degrees â–¶ 02:59 in our prediction. â–¶ 03:01 And there's tons of internal details â–¶ 03:03 we will talk about. â–¶ 03:05 Just to name one. â–¶ 03:07 We will distinguish generative â–¶ 03:09 from discriminative. â–¶ 03:12 Generative seeks to model the data â–¶ 03:14 as generally as possible â–¶ 03:16 versus discriminative methods â–¶ 03:18 seek to distinguish data â–¶ 03:20 and this might sound like a superficial distinction â–¶ 03:21 but it has enormous ramification â–¶ 03:24 on the learning algorithm. â–¶ 03:26 Now to tell you the truth â–¶ 03:27 it took me many years â–¶ 03:29 to fully learn all these words here â–¶ 03:30 and I don't expect you to pick them all up â–¶ 03:33 in one class â–¶ 03:36 but you should as well know that they exist. â–¶ 03:37 And as they come up â–¶ 03:39 I'll emphasize them â–¶ 03:41 so you can resort any learning method â–¶ 03:42 I tell you back into the specific taxonomy over here. â–¶ 03:44 ### (03:12) 6 Supervised Learning

The vast amount of work in the field â–¶ 00:00 falls into the area of supervised learning. â–¶ 00:02 In supervised learning â–¶ 00:06 you're given for each training example â–¶ 00:08 a feature vector â–¶ 00:10 and a target label named Y. â–¶ 00:13 For example, for a credit rating agency â–¶ 00:16 X1, X2, X3 might be a feature â–¶ 00:20 such as is the person employed? â–¶ 00:23 What is the salary of the person? â–¶ 00:25 Has the person previously defaulted on a credit card? â–¶ 00:27 And so on. â–¶ 00:30 And Y is a predictor â–¶ 00:32 whether the person is to default â–¶ 00:34 on the credit or not. â–¶ 00:36 Now machine learning â–¶ 00:38 is to be carried out on past data â–¶ 00:40 where the credit rating agency â–¶ 00:42 might have collected features just like these â–¶ 00:44 and actual occurances of default or not. â–¶ 00:46 What it wishes to produce â–¶ 00:49 is a function that allows us â–¶ 00:51 to predict future customers. â–¶ 00:53 So the new person comes in â–¶ 00:55 with a different feature vector. â–¶ 00:56 Can we predict as good as possible â–¶ 00:58 the functional relationship â–¶ 01:00 between these features X1 to Xn all the way to Y? â–¶ 01:02 You can apply the exact same example â–¶ 01:05 in image recognition â–¶ 01:08 where X might be pixels of images â–¶ 01:09 or it might be features of things found in images â–¶ 01:11 and Y might be a label that says â–¶ 01:14 whether a certain object is contained â–¶ 01:16 in an image or not. â–¶ 01:17 Now in supervised learning â–¶ 01:19 you're given many such examples. â–¶ 01:20 X21 to X2n â–¶ 01:25 leads to Y2 â–¶ 01:28 all way the index m. â–¶ 01:32 This is called your data. â–¶ 01:35 If we call each input vector Xm â–¶ 01:38 and we wish to find out the function â–¶ 01:43 given any Xm or any future vector X â–¶ 01:44 produces as close as possible â–¶ 01:50 my target signal Y. â–¶ 01:53 Now this isn't always possible â–¶ 01:55 and sometimes it's acceptable â–¶ 01:57 in fact preferable â–¶ 01:59 to tolerate a certain amount of error â–¶ 02:00 in your training data. â–¶ 02:03 But the subject of machine learning â–¶ 02:05 is to identify this function over here. â–¶ 02:07 And once you identify it â–¶ 02:10 you can use it for future Xs â–¶ 02:11 that weren't part of the training set â–¶ 02:13 to produce a prediction â–¶ 02:16 that hopefully is really, really good. â–¶ 02:19 So let me ask you a question. â–¶ 02:21 And this is a question â–¶ 02:24 for which I haven't given you the answer â–¶ 02:27 but I'd like to appeal to your intuition. â–¶ 02:28 Here's one data set â–¶ 02:31 where the X is one dimensionally plotted horizontally â–¶ 02:34 and the Y is vertically â–¶ 02:37 and suppose there looks like this. â–¶ 02:39 Suppose my machine learning algorithm â–¶ 02:44 gives me 2 hypotheses. â–¶ 02:45 One is this function over here â–¶ 02:47 which is a linear function â–¶ 02:51 and one is this function over here. â–¶ 02:52 I'd like to know which of the functions â–¶ 02:53 you find preferable â–¶ 02:57 as an explanation for the data. â–¶ 02:59 Is it function A? â–¶ 03:01 Or function B? â–¶ 03:02 Check here for A â–¶ 03:06 here for B â–¶ 03:08 and here for neither. â–¶ 03:09 ### (02:43) 7 Occam's Razor

And I hope you guessed function A. â–¶ 00:00 Even though both perfectly describe the data â–¶ 00:04 B is much more complex than A. â–¶ 00:08 In fact, outside the data â–¶ 00:10 B seems to go to a minus infinity much faster â–¶ 00:12 than these data points â–¶ 00:16 and to plus infinity much faster â–¶ 00:17 with these data points over here. â–¶ 00:19 And in between â–¶ 00:21 we have wide oscillations â–¶ 00:22 that don't correspond to any data. â–¶ 00:23 So I would argue â–¶ 00:25 A is preferable. â–¶ 00:27 The reason why I asked this question â–¶ 00:31 is because of something called Occam's Razor. â–¶ 00:32 Occam can be spelled in many different ways. â–¶ 00:35 And what Occam says is that â–¶ 00:38 everything else being equal â–¶ 00:41 chose the less complex hypothesis. â–¶ 00:43 Now in practice â–¶ 00:46 there's actually a trade-off â–¶ 00:48 between a really good data fit â–¶ 00:50 and low complexity. â–¶ 00:53 Let me illustrate this to you â–¶ 00:55 by a hypothetical example. â–¶ 00:58 Consider the following graph â–¶ 00:59 where the horizontal axis graphs â–¶ 01:02 complexity of the solution. â–¶ 01:04 For example, if you use polynomials â–¶ 01:07 this might be a high-degree polynomial over here â–¶ 01:10 and maybe a linear function over here â–¶ 01:12 which is a low-degree polynomial â–¶ 01:14 your training data error â–¶ 01:16 tends to go like this. â–¶ 01:19 The more complex the hypothesis you allow â–¶ 01:22 the more you can just fit your data. â–¶ 01:25 However, in reality â–¶ 01:29 your generalization error on unknown data â–¶ 01:31 tends to go like this. â–¶ 01:33 It is the sum of the training data error â–¶ 01:37 and another function â–¶ 01:40 which is called the overfitting error. â–¶ 01:42 Not surprisingly â–¶ 01:46 the best complexity is obtained â–¶ 01:47 where the generalization error is minimum. â–¶ 01:49 There are methods â–¶ 01:52 to calculate the overfitting error. â–¶ 01:53 They go into a statistical field â–¶ 01:55 under the name Bayes variance methods. â–¶ 01:57 However, in practice â–¶ 02:01 you're often just given the training data error. â–¶ 02:02 You find if you don't find the model â–¶ 02:04 that minimizes the training data error â–¶ 02:08 but instead pushes back the complexity â–¶ 02:11 your algorithm tends to perform better â–¶ 02:14 and that is something we will study a little bit â–¶ 02:17 in this class. â–¶ 02:20 However, this slide is really important â–¶ 02:22 for anybody doing machine learning in practice. â–¶ 02:26 If you deal with data â–¶ 02:29 and you have ways to fit your data â–¶ 02:31 be aware that overfitting â–¶ 02:33 is a major source of poor performance â–¶ 02:36 of a machine learning algorithm. â–¶ 02:39 And I give you examples in just one second. â–¶ 02:41 ### (04:15) 8 SPAM Detection

So a really important example â–¶ 00:00 of machine learning is SPAM detection. â–¶ 00:02 We all get way too much email â–¶ 00:04 and a good number of those are SPAM. â–¶ 00:06 Here are 3 examples of email. â–¶ 00:08 Dear Sir: First I must solicit your confidence â–¶ 00:12 in this transaction, this is by virtue of its nature â–¶ 00:14 being utterly confidential and top secret... â–¶ 00:16 This is likely SPAM. â–¶ 00:19 Here's another one. â–¶ 00:22 In upper caps. â–¶ 00:23 99 MILLION EMAIL ADDRESSES FOR ONLY $99 â–¶ 00:25 This is very likely SPAM. â–¶ 00:28 And here's another one. â–¶ 00:31 Oh, I know it's blatantly OT â–¶ 00:33 but I'm beginning to go insane. â–¶ 00:35 Had an old Dell Dimension XPS sitting in the corner â–¶ 00:37 and decided to put it to use. â–¶ 00:40 And so on and so on. â–¶ 00:41 Now this is likely not SPAM. â–¶ 00:42 How can a computer program â–¶ 00:45 distinguish between SPAM and not SPAM? â–¶ 00:47 Let's use this as an example â–¶ 00:49 to talk about machine learning for discrimination â–¶ 00:51 using Bayes Networks. â–¶ 00:55 In SPAM detection â–¶ 00:59 we get an email â–¶ 01:01 and we wish to categorize it â–¶ 01:03 either as SPAM â–¶ 01:05 in which case we don't even show as to the where â–¶ 01:07 or what we call HAM â–¶ 01:10 which is the technical word for â–¶ 01:12 an email worth passing on to the person being emailed. â–¶ 01:15 So the function over here â–¶ 01:19 is the function we're trying to learn. â–¶ 01:21 Most SPAM filters use human input. â–¶ 01:23 When you go through email â–¶ 01:26 you have a button called IS SPAM â–¶ 01:28 which allows you as a user to flag SPAM â–¶ 01:32 and occasionally you will say an email is SPAM. â–¶ 01:34 If you look at this â–¶ 01:37 you have a typical supervised machine learning situation â–¶ 01:40 where the input is an email â–¶ 01:43 and the output is whether you flag it as SPAM â–¶ 01:45 or if we don't flag it â–¶ 01:47 we just think it's HAM. â–¶ 01:49 Now to make this amenable to â–¶ 01:52 a machine learning algorithm â–¶ 01:54 we have to talk about how to represent emails. â–¶ 01:55 They're all using different words and different characters â–¶ 01:57 and they might have different graphics included. â–¶ 02:00 Let's pick a representation that's easy to process. â–¶ 02:02 And this representation is often called â–¶ 02:06 Bag of Words. â–¶ 02:09 Bag of Words is a representation â–¶ 02:10 of a document â–¶ 02:14 that just counts the frequency â–¶ 02:15 of words. â–¶ 02:17 If an email were to say Hello â–¶ 02:18 I will say Hello. â–¶ 02:22 The Bag of Words representation â–¶ 02:24 is the following. â–¶ 02:26 2-1-1-1 â–¶ 02:27 for the dictionary â–¶ 02:31 that contains the 4 words â–¶ 02:33 Hello I will say. â–¶ 02:36 Now look at the subtlety here. â–¶ 02:38 Rather than representing each individual word â–¶ 02:41 we have a count of each word â–¶ 02:43 and the count is oblivious â–¶ 02:46 to the order in which the words were stated. â–¶ 02:49 A Bag of Words representation â–¶ 02:52 relative to a fixed dictionary â–¶ 02:55 represents the counts of each word â–¶ 02:57 relative to the words in the dictionary. â–¶ 03:01 If you were to use a different dictionary â–¶ 03:03 like hello and good-bye â–¶ 03:06 our counts would be â–¶ 03:08 2 and 0. â–¶ 03:10 However, in most cases â–¶ 03:13 you make sure that all the words found â–¶ 03:14 in messages â–¶ 03:17 are actually included in the dictionary. â–¶ 03:18 So the dictionary might be very, very large. â–¶ 03:19 Let me make up an unofficial example â–¶ 03:22 of a few SPAM and a few HAM messages. â–¶ 03:25 Offer is secret. â–¶ 03:30 Click secret link. â–¶ 03:32 Secret sports link. â–¶ 03:35 Obviously those are contrived â–¶ 03:37 and I tried to retain the recovery â–¶ 03:40 to a small number of words â–¶ 03:42 to make this example workable. â–¶ 03:44 In practice we need thousands â–¶ 03:46 of such messages â–¶ 03:47 to get good information. â–¶ 03:48 Play sports today. â–¶ 03:50 Went play sports. â–¶ 03:52 Secret sports event. â–¶ 03:54 Sport is today. â–¶ 03:56 Sport costs money. â–¶ 03:59 My first quiz is â–¶ 04:02 What is the size of the vocabulary â–¶ 04:06 that contains all words in these messages? â–¶ 04:08 Please enter the value in this box over here. â–¶ 04:12 ### (00:28) 9 Answer

Well let's count. â–¶ 00:00 Offer is secret click. â–¶ 00:02 Secret occurs over here already â–¶ 00:08 so we don't have to count it twice. â–¶ 00:10 Link, sports, play, today, went, event â–¶ 00:12 costs money. â–¶ 00:18 So the answer is â–¶ 00:20 12. â–¶ 00:22 There's 12 different words â–¶ 00:24 contained in these 8 messages. â–¶ 00:26 ### (00:16) 10 Question

[Narrator] Another quiz. â–¶ 00:00 What is the probability that a random message â–¶ 00:03 that arrives to fall into the spam bucket? â–¶ 00:06 Assuming that those messages â–¶ 00:09 are all drawn at random. â–¶ 00:11 [writing on page] â–¶ 00:13 ### (00:16) 11 Answer

[Narrator] And the answer is: â–¶ 00:00 there's 8 different messages â–¶ 00:02 of which 3 are spam. â–¶ 00:04 So the maximum likelihood estimate â–¶ 00:06 is 3/8. â–¶ 00:09 [writing on paper] â–¶ 00:11 ### (04:31) 12 Maximum Likelihood

So, let's look at this a little bit more formally and talk about maximum likelihood. â–¶ 00:00 Obviously, we're observing 8 messages: spam, spam, spam, and 5 times ham. â–¶ 00:03 And what we care about is what's our prior probability of spam â–¶ 00:12 that maximizes the likelihood of this data? â–¶ 00:17 So, let's assume we're going to assign a value of pi to this, â–¶ 00:20 and we wish to find the pi that maximizes the likelihood of this data over here, â–¶ 00:24 assuming that each email is drawn independently â–¶ 00:29 according to an identical distribution. â–¶ 00:33 The probability of the p(yi) data item is then pi if yi = spam, â–¶ 00:37 and 1 - pi if yi = ham. â–¶ 00:48 If we rewrite the data as 1, 1, 1, 0, 0, 0, 0, 0, â–¶ 00:53 we can write p(yi) as follows: pi to the yi times (1 - pi) to the 1 - yi. â–¶ 00:59 It's not that easy to see that this is equivalent, â–¶ 01:13 but say yi = 1. â–¶ 01:16 Then this term will fall out. â–¶ 01:19 It's not proficient by 1 because the exponent is zero, and we get pi as over here. â–¶ 01:22 If yi = 0, then this term falls out, and this one here becomes 1 - pi as over here. â–¶ 01:28 Now assuming independence, we get for the entire data set â–¶ 01:36 that the joint probability of all data items is the product â–¶ 01:44 of the individual data items over here, â–¶ 01:49 which can now be written as follows: â–¶ 01:52 pi to the count of instances where yi = 1 times â–¶ 01:56 1 - pi to the count of the instances where yi = 0. â–¶ 02:03 And we know in our example, this count over here is 3, â–¶ 02:09 and this count over here is 5, so we get pi to the 3rd times 1 - pi to the 5th. â–¶ 02:13 We now wish to find the pi that maximizes this expression over here. â–¶ 02:22 We can also maximize the logarithm of this expression, â–¶ 02:28 which is 3 times log pi + 5 times log (1 - pi) â–¶ 02:33 Optimizing the log is the same as optimizing p because the log is monotonic to p. â–¶ 02:42 The maximum of this function is attained with a derivative of 0, â–¶ 02:50 so let's compute with a derivative and set it to 0. â–¶ 02:54 This is the derivative, 3 over pi - 5 over 1 - pi. â–¶ 03:00 We now bring this expression to the right side, â–¶ 03:05 multiply the denominators up, and sort all the expressions containing pi to the left, â–¶ 03:09 which gives us pi = 3/8, exactly the number we were at before. â–¶ 03:18 We just derived mathematically that the data likelihood maximizing number â–¶ 03:26 for the probability is indeed the empirical count, â–¶ 03:33 which means when we looked at this quiz before â–¶ 03:37 and we said a maximum likelihood for the prior probability of spam is 3/8, â–¶ 03:41 by simply counting 3 over 8 emails were spam, â–¶ 03:49 we actually followed proper mathematical principles â–¶ 03:54 to do maximum likelihood estimation. â–¶ 03:57 Now, you might not fully have gotten the derivation of this, â–¶ 03:59 and I recommend you to watch it again, but it's not that important â–¶ 04:03 for the progress in this class. â–¶ 04:07 So, here's another quiz. â–¶ 04:09 I'd like the maximum likelihood, or ML solutions, â–¶ 04:11 for the following probabilities. â–¶ 04:17 The probability that the word "secret" comes up, â–¶ 04:19 assuming that we already know a message is spam, â–¶ 04:21 and the probability that the same word "secret" comes up â–¶ 04:25 if we happen to know the message is not spam, it's ham. â–¶ 04:28 ### (00:25) 13 Answer

And just as before â–¶ 00:00 we count the word secret â–¶ 00:02 in SPAM and in HAM â–¶ 00:04 as I've underlined here. â–¶ 00:06 Three out of 9 words in SPAM â–¶ 00:07 are the word secret â–¶ 00:11 so we have a third over here â–¶ 00:12 or 0.333 â–¶ 00:14 and only 1 out of all the 15 words in HAM â–¶ 00:18 are secret â–¶ 00:21 so you get a fifteenth â–¶ 00:22 or 0.0667. â–¶ 00:23 ### (01:19) 14 Relationship to Bayes Networks

By now, you might have recognized what we're really building up is a Bayes network â–¶ 00:00 where the parameters of the Bayes networks are estimated using supervised learning â–¶ 00:06 by a maximum likelihood estimator based on training data. â–¶ 00:10 The Bayes network has at its root an unobservable variable called spam, â–¶ 00:15 which is binary, and it has as many children as there are words in a message, â–¶ 00:20 where each word has an identical conditional distribution â–¶ 00:28 of the word occurrence given the class spam or not spam. â–¶ 00:33 If you write on our dictionary over here, â–¶ 00:39 you might remember the dictionary had 12 different words, â–¶ 00:42 so here is 5 of the 12, offer, is, secret, click and sports. â–¶ 00:48 Then for the spam class, we found the probability of secret given spam is 1/3, â–¶ 00:52 and we also found that the probability of secret given ham is 1/15, â–¶ 00:59 so here's a quiz. â–¶ 01:05 Assuming a vocabulary size of 12, or put differently, â–¶ 01:07 the dictionary has 12 words, how many parameters â–¶ 01:12 do we need to specify this Bayes network? â–¶ 01:16 ### (00:29) 15 Answer

And the correct answer is 23. â–¶ 00:00 We need 1 parameter for the prior p (spam), â–¶ 00:03 and then we have 2 dictionary distributions of any word, â–¶ 00:07 i given spam, and the same for ham. â–¶ 00:12 Now, there's 12 words in a dictionary, â–¶ 00:16 but this distribution only needs 11 parameters, â–¶ 00:18 so 12 can be figured out because they have to add up to 1. â–¶ 00:20 And the same is true over here, so if you add all these together, â–¶ 00:24 we get 23. â–¶ 00:27 ### (00:26) 16 Question

So, here's a quiz. â–¶ 00:00 Let's assume we fit all the 23 parameters of the Bayes network â–¶ 00:02 as explained using maximum likelihood. â–¶ 00:06 Let's now do classification and see what class and message it ends up with. â–¶ 00:09 Let me start with a very simple message, and it contains a single word â–¶ 00:14 just to make it a little bit simpler. â–¶ 00:18 What's the probability that we classify this one word message as spam? â–¶ 00:21 ### (01:02) 17 Answer

And the answer is 0.1667 or 3/18. â–¶ 00:00 How do I get there? Well, let's apply Bayes rule. â–¶ 00:07 This form is easily transformed into this expression over here, â–¶ 00:13 the probability of the message given spam times the prior probability of spam â–¶ 00:19 over the normalizer over here. â–¶ 00:25 Now, we know that the word "sports" occurs 1 in our 9 words of spam, â–¶ 00:29 and our prior probability for spam is 3/8, â–¶ 00:34 which gives us this expression over here. â–¶ 00:38 We now have to add the same probabilities for the class ham. â–¶ 00:40 "Sports" occurs 5 times out of 15 in the ham class, â–¶ 00:45 and the prior probability for ham is 5/8, â–¶ 00:51 which gives us 3/72 divided by 18/72, which is 3/18 or 1/6. â–¶ 00:55 ### (00:21) 18 Question

This gets to a more complicated quiz. â–¶ 00:00 Say the message now contains 3 words. â–¶ 00:03 "Secret is secret," not a particularly meaningful email, â–¶ 00:06 but the frequent occurrence of "secret" seems to suggest it might be spam. â–¶ 00:10 What's the probability you're going to judge this to be spam? â–¶ 00:16 ### (01:03) 19 Answer

And the answer is surprisingly high. It's 25/26, or 0.9615. â–¶ 00:00 To see if we apply Bayes rule, which multiples the prior for spam-ness â–¶ 00:10 with the conditional probability of each word given spam. â–¶ 00:16 "Secret" carries 1/3, "is" 1/9, and "secret" 1/3 again. â–¶ 00:19 We normalize this by the same expression plus the probability for â–¶ 00:26 the non-spam case. â–¶ 00:32 5/8 is a prior. â–¶ 00:36 "Secret" is 1/15. â–¶ 00:38 "Is" is 1/15, â–¶ 00:42 and "secret" again. â–¶ 00:45 This resolves to 1/216 over this expression plus 1/5400, â–¶ 00:48 and when you work it all out is 25/26. â–¶ 00:57 ### (00:21) 20 Question

The final quiz, let's assume our message is "Today is secret." â–¶ 00:00 And again, it might look like spam because the word "secret" occurs. â–¶ 00:08 I'd like you to compute for me the probability of spam given this message. â–¶ 00:12 ### (03:19) 21 Answer and Laplace Smoothing

And surprisingly, the probability for this message to be spam is 0. â–¶ 00:00 It's not 0.001. It's flat 0. â–¶ 00:07 In other words, it's impossible, according to our model, â–¶ 00:11 that this text could be a spam message. â–¶ 00:14 Why is this? â–¶ 00:17 When we apply the same rule as before, we get the prior for spam which is 3/8. â–¶ 00:19 And we multiple the conditional for each word into this. â–¶ 00:24 For "secret," we know it to be 1/3. â–¶ 00:28 For "is," to be 1/9, but for today, it's 0. â–¶ 00:31 It's 0 because the maximum of the estimate for the probability of "today" in spam is 0. â–¶ 00:39 "Today" just never occurred in a spam message so far. â–¶ 00:45 Now, this 0 is troublesome because as we compute the outcome-- â–¶ 00:49 and I'm plugging in all the numbers as before-- â–¶ 00:55 none of the words matter anymore, just the 0 matters. â–¶ 01:00 So, we get 0 over something which is plain 0. â–¶ 01:03 Are we overfitting? You bet. â–¶ 01:10 We are clearly overfitting. â–¶ 01:13 It can't be that a single word determines the entire outcome of our analysis. â–¶ 01:15 The reason is that our model, to assign a probability of 0 for the word "today" â–¶ 01:21 to be in the class of spam is just too aggressive. â–¶ 01:26 Let's change this. â–¶ 01:29 One technique to deal with the overfitting problem is called Laplace smoothing. â–¶ 01:34 In maximum likelihood estimation, we assign towards our probability â–¶ 01:39 the quotient of the count of this specific event over all events in our data set. â–¶ 01:45 For example, for the prior probability, we found that 3/8 messages are spam. â–¶ 01:51 Therefore, our maximum likelihood estimate â–¶ 01:57 for the prior probability of spam was 3/8. â–¶ 02:00 In Laplace Smoothing, we use a different estimate. â–¶ 02:05 We add the value k to the count â–¶ 02:10 and normalize as if we added k to every single class â–¶ 02:15 that we've tried to estimate something over. â–¶ 02:20 This is equivalent to assuming we have a couple of fake training examples â–¶ 02:23 where we add k to each observation count. â–¶ 02:28 Now, if k equals 0, we get our maximum likelihood estimator. â–¶ 02:32 But if k is larger than 0 and n is finite, we get different answers. â–¶ 02:36 Let's say k equals 1, â–¶ 02:41 and let's assume we get one message, â–¶ 02:47 and that message was spam, so we're going to write it one message, one spam. â–¶ 02:51 What is p (spam) for the Laplace smoothing of k + 1? â–¶ 02:56 Let's do the same with 10 messages, and we get 6 spam. â–¶ 03:03 And 100 messages, of which 60 are spam. â–¶ 03:09 Please enter your numbers into the boxes over here. â–¶ 03:16 ### (01:14) 22 Answer

The answer here is 2/3 or 0.667 and is computed as follows. â–¶ 00:00 We have 1 message with 1 as spam, but we're going to add k =1. â–¶ 00:10 We're going to add k = 2 over here because there's 2 different classes. â–¶ 00:16 K = 1 times 2 = 2, which gives us 2/3. â–¶ 00:22 The answer over here is 7/12. â–¶ 00:28 Again, we have 6/10 but we add 2 down here and 1 over here, so you get 7/12. â–¶ 00:32 And correspondingly, we get 61/102 is 60 + 1 over 100 +2. â–¶ 00:41 If we look at the numbers over here, we get 0.5833 â–¶ 00:49 and 0.5986. â–¶ 00:56 Interestingly, the maximum likelihood on the last 2 cases over here â–¶ 00:59 will give us .6, but we only get a value that's closer to .5, â–¶ 01:03 which is the effect of our smoothing prior for the Laplacian smoothing. â–¶ 01:09 ### (00:25) 23 Question

Let's use the Laplacian smoother with K=1 â–¶ 00:00 to calculate the few interesting probabilities-- â–¶ 00:05 P of SPAM, P of HAM, â–¶ 00:09 and then the probability of the words "today", â–¶ 00:12 given that it's in the SPAM class or the HAM class. â–¶ 00:15 And you might assume that our recovery size â–¶ 00:19 is about 12 different words here. â–¶ 00:22 ### (01:17) 24 Answer

This one is easy to calculate for SPAM and HAM. â–¶ 00:00 For SPAM, it's 2/5, â–¶ 00:03 and the reason is, we had previously â–¶ 00:05 3 out of 8 messages assigned to SPAM. â–¶ 00:08 But thanks to the Laplacian smoother, we add 1 over here. â–¶ 00:12 And there are 2 classes, so we add 2 times 1 over here, â–¶ 00:15 which gives us 4/10, which is 2/5. â–¶ 00:19 Similarly to get 3/5 over here. â–¶ 00:22 Now the tricky part comes up over here. â–¶ 00:26 Before, we had 0 occurances of the word "today" in the SPAM class, â–¶ 00:29 and we had 9 data points. â–¶ 00:33 But now we are going to add 1 for Laplacian smoother, â–¶ 00:35 and down here, we are going to add 12. â–¶ 00:38 And the reason that we add 12 is because â–¶ 00:40 there's 12 different words in our dictionary â–¶ 00:42 Hence, for each word in the dictonary, we are going to add 1. â–¶ 00:44 So we have a total of 12, which gives us the 12 over here. â–¶ 00:47 That makes 1/21. â–¶ 00:50 In the HAM class, we had 2 occurrences â–¶ 00:53 of the word "today"--over here and over here. â–¶ 00:56 We add 1, normalize by 15, â–¶ 00:59 plus 12 for the dictionary size, â–¶ 01:04 which is 3/27 or 1/9. â–¶ 01:07 This was not an easy question. â–¶ 01:14 ### (00:21) 25 Question

We come now to the final quiz here, â–¶ 00:00 which is--I would like to compute the probability â–¶ 00:03 that the message "today is secret" â–¶ 00:05 falls into the SPAM box with â–¶ 00:08 Laplacian smoother using K=1. â–¶ 00:10 Please just enter your number over here. â–¶ 00:13 This is a non-trivia question. â–¶ 00:16 It might take you a while to calculate this. â–¶ 00:18 ### (00:58) 26 Answer

In the approximate probabilities--0.4858. â–¶ 00:00 How did we get this? â–¶ 00:06 Well, the prior probability for SPAM â–¶ 00:08 under the Laplacian smoothing is 2/5. â–¶ 00:12 "Today" doesn't occur, but we have already calculated this to be 1/21. â–¶ 00:15 "Is" occurs once, so we get 2 over here over 21. â–¶ 00:22 "Secret" occurs 3 times, so we get a 4 over here over 21, â–¶ 00:26 and we normalize this by the same expression over here. â–¶ 00:32 Plus the prior for HAM, which is 3/5, â–¶ 00:37 we have 2 occurrences of "today", plus 1, equals 3/27. â–¶ 00:42 "Is" occurs once--2/27. â–¶ 00:47 And "secret" occurs once--again 2/27. â–¶ 00:50 When you work this all out, you get this number over here. â–¶ 00:54 ### (01:47) 27 Summary Naive Bayes

So we learned quite a bit. â–¶ 00:00 We learned about Naive Bayes â–¶ 00:02 as our first supervised learning methods. â–¶ 00:04 The setup was that we had â–¶ 00:06 features of documents or trading examples and labels. â–¶ 00:08 In this case, SPAM or not SPAM. â–¶ 00:14 And from those pieces, â–¶ 00:17 we made a generative model for the SPAM class â–¶ 00:19 and the non-SPAM class â–¶ 00:23 that described the condition of probability â–¶ 00:25 of each individual feature. â–¶ 00:28 We then used first maximum likelihood â–¶ 00:30 and then a Laplacian smoother â–¶ 00:33 to fit those primers over here. â–¶ 00:36 And then using Bayes rule, â–¶ 00:38 we could take any training examples over here â–¶ 00:41 and figure out what the class probability was over here. â–¶ 00:44 This is called a generative model â–¶ 00:48 in that the condition of probabilities all aim to maximize â–¶ 00:51 the probability of individual features as if those â–¶ 00:55 describe the physical world. â–¶ 01:00 We also used what is called a bag of words model, â–¶ 01:02 in which our representation of each email â–¶ 01:06 was such that we just counted the occurrences of words, â–¶ 01:09 irrespective of their order. â–¶ 01:12 Now this is a very powerful method for fighting SPAM. â–¶ 01:15 Unfortunately, it is not powerful enough. â–¶ 01:19 It turns out spammers know about Naive Bayes, â–¶ 01:21 and they've long learned to come up with messages â–¶ 01:24 that are fooling your SPAM filter if it uses Naive Bayes. â–¶ 01:27 So companies like Google and others â–¶ 01:31 have become much more involved â–¶ 01:33 in methods for SPAM filtering. â–¶ 01:35 Now I can give you some more examples how to filter SPAM, â–¶ 01:38 but all of those quite easily fit with the same Naive Bayes model. â–¶ 01:42 ### (01:27) 28 Advanced SPAM Filtering

[Narrator] So here features that you might consider when you write â–¶ 00:00 in an advance spam filter. â–¶ 00:03 For example, â–¶ 00:05 does the email come from â–¶ 00:07 a known spamming IP or computer? â–¶ 00:09 Have you emailed this person before? â–¶ 00:12 In which case it is less likely to be spam. â–¶ 00:16 Here's a powerful one: â–¶ 00:19 have 1000 other people â–¶ 00:22 recently received the same message? â–¶ 00:25 Is the email header consistent? â–¶ 00:29 So example if the from field says your bank â–¶ 00:32 is the IP address really your bank? â–¶ 00:35 Surprisingly is the email all caps? â–¶ 00:38 Strangely many spammers believe if you write â–¶ 00:42 things in all caps you'll pay more attention to it. â–¶ 00:44 Do the inline URLs point to those pages â–¶ 00:48 where they say they're pointing to? â–¶ 00:51 Are you addressed by your correct name? â–¶ 00:54 Now these are some features, â–¶ 00:56 I'm sure you can think of more. â–¶ 00:58 You can toss them easily into the â–¶ 01:00 naive base model and get better classification. â–¶ 01:02 In fact model spam filters keep learning â–¶ 01:05 as people flag emails as spam, and â–¶ 01:08 of course spammers keep learning as well â–¶ 01:10 and trying to fool modern spam filters. â–¶ 01:13 Who's going to win? â–¶ 01:16 Well so far the spam filters are clearly winning. â–¶ 01:18 Most of my spam I never see, but who knows â–¶ 01:21 what's going to happen with the future? â–¶ 01:23 It's a really fascinating machine learning problem. â–¶ 01:25 ### (02:21) 29 Digit Recognition

[Narrator] Naive Bayes can also be applied to â–¶ 00:00 the problem of hand written digits recognition. â–¶ 00:02 This is a sample of hand-written digits taken â–¶ 00:05 from a U.S. postal data set â–¶ 00:09 where hand written zip codes on letters are â–¶ 00:12 being scanned and automatically classified. â–¶ 00:17 The machine-learning problem here is â–¶ 00:21 taken a symbol just like this. â–¶ 00:23 What is the corresponding number? â–¶ 00:28 Here it's obviously 0. â–¶ 00:30 Here it's obviously 1. â–¶ 00:32 Here it's obviously 2, 1. â–¶ 00:34 For the one down here, â–¶ 00:36 it's a little bit harder to tell. â–¶ 00:38 Now when you apply Naive Bayes, â–¶ 00:41 the input vector â–¶ 00:44 could be the pixel values â–¶ 00:46 of each individual pixel so we have â–¶ 00:48 a 16 x 16 input resolution. â–¶ 00:50 You would get 256 different values â–¶ 00:54 corresponding to the brightness of each pixel. â–¶ 00:59 Now obviously given sufficiently made â–¶ 01:02 training example, you might hope â–¶ 01:05 to recognize digits, â–¶ 01:07 but one of the deficiencies of this approach is â–¶ 01:09 it is not particularly shifted range. â–¶ 01:12 So for example a pattern like this â–¶ 01:15 will look fundamentally different â–¶ 01:19 from a pattern like this. â–¶ 01:21 Even though the pattern on the right is obtained â–¶ 01:24 by shifting the pattern on the left â–¶ 01:27 by 1 to the right. â–¶ 01:29 There's many different solutions, but a common one could be â–¶ 01:31 to use smoothing in a different way from â–¶ 01:34 the way we discussed it before. â–¶ 01:36 Instead of just counting 1 pixel value's count, â–¶ 01:38 you could mix it with counts of the â–¶ 01:40 neighboring pixel values so if â–¶ 01:42 all pixels are slightly shifted, â–¶ 01:44 we get about the same statistics â–¶ 01:46 as the pixel itself. â–¶ 01:48 Such a method is called input smoothing. â–¶ 01:50 You can what's technically called convolve â–¶ 01:52 the input vector equals pixel value variable, and â–¶ 01:55 you might get better results than if you â–¶ 01:57 do Naive Bayes on the raw pixels. â–¶ 02:00 Now to tell you the truth for â–¶ 02:02 digit recognition of this type, â–¶ 02:04 Naive Bayes is not a good choice. â–¶ 02:06 The conditional independence assumption â–¶ 02:08 of each pixel, given the class, â–¶ 02:10 is too strong an assumption in this case, â–¶ 02:12 but it's fun to talk about image recognition â–¶ 02:14 in the context of Naive Bayes regardless. â–¶ 02:17 ### (03:30) 30 Overfitting Prevention

So, let me step back a step and talk a bit about â–¶ 00:00 overfitting prevention in machine learning â–¶ 00:04 because it's such an important topic. â–¶ 00:07 We talked about Occam's Razor, â–¶ 00:09 which in a generalized way suggests there is â–¶ 00:12 a tradeoff between how well we can fit the data â–¶ 00:16 and how smooth our learning algorithm is. â–¶ 00:22 In our class in smoothing, we already found 1 way â–¶ 00:28 to let Occam's Razor play, which is by â–¶ 00:32 selecting the value K to make our statistical counts smoother. â–¶ 00:34 I alluded to a similar way in the image recognition domain â–¶ 00:40 where we smoothed the image so the neighboring pixels count similar. â–¶ 00:44 This all raises the question of how to choose the smoothing parameter. â–¶ 00:49 So, in particular, in Laplacian smoothing, how to choose the K. â–¶ 00:53 There is a method called cross-validation â–¶ 00:58 which can help you find an answer. â–¶ 01:02 This method assumes there is plenty of training examples, but â–¶ 01:05 to tell you the truth, in spam filtering there is more than you'd ever want. â–¶ 01:09 Take your training data â–¶ 01:14 and divide it into 3 buckets. â–¶ 01:17 Train, cross-validate, and test. â–¶ 01:19 Typical ratios will be 80% goes into train, â–¶ 01:24 10% into cross-validate, â–¶ 01:27 and 10% into test. â–¶ 01:30 You use the train to find all your parameters. â–¶ 01:33 For example, the probabilities of a base network. â–¶ 01:37 You use your cross-validation set â–¶ 01:40 to find the optimal K, and the way you do this is â–¶ 01:43 you train for different values of K, â–¶ 01:46 you observe how well the training model performs on the CV data, â–¶ 01:49 not touching the test data, â–¶ 01:55 and then you maximize over all the Ks to get the best performance â–¶ 01:58 on the cross-validation set. â–¶ 02:01 You iterate this many times until you find the best K. â–¶ 02:03 When you're done with the best K, â–¶ 02:06 you train again, and then finally â–¶ 02:09 only one you touch the test data â–¶ 02:12 to verify the performance, â–¶ 02:15 and this is the performance you report. â–¶ 02:17 It's really important in cross-validation â–¶ 02:20 split apart a cross-validation set that's different from the test set. â–¶ 02:23 If you were to use the test set to find the optimal K, â–¶ 02:28 then your test set becomes an effective part of your training routine, â–¶ 02:31 and you might overfit your test data, â–¶ 02:35 and you wouldn't even know. â–¶ 02:38 By keeping the test data separate from the beginning, â–¶ 02:40 and train on the training data, you use â–¶ 02:43 the cross-validation data to find how good your train data is doing, â–¶ 02:46 and the unknown parameters of K to fine-tune the K. â–¶ 02:49 Finally, only once you use the test data â–¶ 02:53 do you get a fair answer to the question, â–¶ 02:56 "How well will your model perform on future data?" â–¶ 02:59 So, pretty much everybody in machine learning â–¶ 03:02 uses this model. â–¶ 03:05 You can redo the split between training and the cross-validation part, â–¶ 03:08 people often use the word 10-fold cross-validation â–¶ 03:12 where they do 10 different forwardings â–¶ 03:15 and run the model 10 times to find the optimal K â–¶ 03:17 or smoothing parameter. â–¶ 03:20 No matter which way you do it, find the optimal smoothing parameter â–¶ 03:22 and then use a test set exactly once to verify in a report. â–¶ 03:25 ### (02:00) 31 Classification vs Regression

Let me back up a step further, â–¶ 00:00 and let's look at supervised learning more generally. â–¶ 00:03 Our example so far was one of classification. â–¶ 00:06 The characteristic of classifcation is â–¶ 00:09 that the target labels or the target class is discrete. â–¶ 00:12 In our case it was actually binary. â–¶ 00:16 In many problems, we try to predict a continuous quantity. â–¶ 00:18 For example, in the interval 0 to 1 or perhaps a real number. â–¶ 00:23 Those machine learning problems are called regression problems. â–¶ 00:29 Regression problems are fundamentally different from classification problems. â–¶ 00:33 For example, our base network doesn't afford us an answer â–¶ 00:37 to a problem where the target value could be at 0,1. â–¶ 00:42 A regression problem, for example, would be one to â–¶ 00:45 predict the weather tomorrow. â–¶ 00:48 Temperature is a continuous value. Our base number would not be able â–¶ 00:50 to predict the temperature, it only can predict discrete classes. â–¶ 00:53 A regression algorithm is able to give us a continuous prediction â–¶ 00:58 about the temperature tomorrow. â–¶ 01:01 So let's look at the regression next. â–¶ 01:04 So here's my first quiz for you on regression. â–¶ 01:07 This scatter plot shows for Berkeley California for a period of time â–¶ 01:10 the data for each house that was sold. â–¶ 01:18 Each dot is a sold house. â–¶ 01:21 It graphs the size of the house in square feet â–¶ 01:24 to the sales price in thousands of dollars. â–¶ 01:27 As you can see, roughly speaking, â–¶ 01:32 as the size of the house goes up, â–¶ 01:34 so does the sales price. â–¶ 01:37 I wonder, for a house of about 2500 square feet, â–¶ 01:40 what is the approximate sales price you would assume â–¶ 01:45 based just on the scatter plot data? â–¶ 01:49 Is it 400k, 600k, 800k, or 1000k? â–¶ 01:52 ### (00:26) 32 Answer

My answer is, there seems to be a roughly linear relationship, â–¶ 00:00 maybe not quite linear, between the house size and the price. â–¶ 00:05 So we look at a linear graph that best describes the data-- â–¶ 00:11 you get this dashed line over here. â–¶ 00:15 And for the dashed line, if you walk up the 2500 square feet, â–¶ 00:18 you end up with roughly 800K. â–¶ 00:22 So this would have been the best answer. â–¶ 00:24 ### (02:46) 33 Linear Regression

Now obviously you can answer this question without understanding anything about regression. â–¶ 00:00 But what you find is this is different from classification as before. â–¶ 00:05 This is not a binary concept anymore of like expensive and cheap. â–¶ 00:10 It really is a relationship between two variables. â–¶ 00:13 One you care about--the house price, and one that you can observe, â–¶ 00:17 which is the house size in square feet. â–¶ 00:20 And your goal is to fit a curve that best explains the data. â–¶ 00:23 Once again, we have a case where we can play Occam's razor. â–¶ 00:28 There clearly is a data fit that is not linear that might be better, â–¶ 00:31 like this one over here. â–¶ 00:35 And when you go to hide the linear curves, â–¶ 00:37 you might even be inclined to draw a curve like this. â–¶ 00:40 Now of course the curve I'm drawing right now is likely an overfit. â–¶ 00:44 And you don't want to postulate that this is the general relationship â–¶ 00:49 between the size of a house and the sales price. â–¶ 00:54 So even though my black curve might describe the data better, â–¶ 00:57 the blue curve or the dashed linear curve over here might be a better explanation overture of Occam's razor. â–¶ 01:01 So let's look a little bit deeper into what we call regression. â–¶ 01:08 As in all regression problems, our data will be comprised of â–¶ 01:15 input vectors of length in that map to another continuous value. â–¶ 01:19 And we might be given a total of M data points. â–¶ 01:25 This is from the classification case, except this time the Ys are continuous. â–¶ 01:30 Once again, we're looking for function f that maps our vector x into y. â–¶ 01:36 In linear regression, the function has a particular form which is W1 times X plus W0. â–¶ 01:44 In this case X is one dimensional which is N = 1. â–¶ 01:54 Or in the high-dimensional space, we might just write W times X plus W0, â–¶ 01:59 where W is a vector and X is a vector. â–¶ 02:07 And this is the inner product of these 2 vectors over here. â–¶ 02:12 Let's for now just consider the one-dimensional case. â–¶ 02:16 In this quiz, I've given you a linear regression form with 2 unknown parameters, W1 and W0. â–¶ 02:20 I've given you a data set. â–¶ 02:27 And this data set happens to be fittable by a linear regression model without any residual error. â–¶ 02:30 Without any math, can you look at this and find out to me what the 2 parameters, W0 and W1 are? â–¶ 02:36 ### (01:17) 34 Answer

This is a suprisingly challenging question. â–¶ 00:00 If you look at these numbers from 3 to 6. â–¶ 00:03 When we increase X by 3, Y decreases by 3, â–¶ 00:07 which suggests W1 is -1. â–¶ 00:14 Now let's see if this holds. â–¶ 00:18 If we increase X by 3, it decreases Y by 3. â–¶ 00:20 If we increase X by 1, we decrease Y by 1. â–¶ 00:24 If we increase X by 2, we decrease Y by 2. â–¶ 00:28 So this number seems to be an exact fit. â–¶ 00:32 Next we have to get the constant W0 right. â–¶ 00:36 For X = 3, we get -3 as an expression over here, â–¶ 00:41 because we know W1 = -1. â–¶ 00:48 So if this has to equal zero in the end, then W0 has to be 3. â–¶ 00:50 Let's do a quick check. â–¶ 00:57 -3 plus 3 is 0. â–¶ 00:59 -6 plus 3 is -3. â–¶ 01:02 And if we plug in any of the numbers, you find those are correct. â–¶ 01:05 Now this is the case of an exact data set. â–¶ 01:09 It gets much more challenging if the data set cannot be fit with a linear function. â–¶ 01:12 ### (01:00) 35 More Linear Regression

To define linear regression, â–¶ 00:00 we need to understand what we are trying to minimize. â–¶ 00:02 The word is called here, are loss function â–¶ 00:05 and the loss function is the amount of residual error we obtain â–¶ 00:08 after fitting the linear function as good as possible. â–¶ 00:12 The residual error is the sum of all training examples, â–¶ 00:16 J of YJ, which is the target label, â–¶ 00:20 minus our prediction, which is W1 XJ minus W0 to the square. â–¶ 00:25 This is the quadratic error between our target tables â–¶ 00:34 and what our best hypothesis can produce. â–¶ 00:37 The minimizing of loss â–¶ 00:41 is used for linear regression of a new regression problem, â–¶ 00:43 and you can write it as follows: â–¶ 00:46 Our solution to the regression problem W* â–¶ 00:50 is the arg min of the loss over all possible vectors W. â–¶ 00:52 ### (04:04) 36 Quadratic Loss

The problem of minimizing quadratic loss for linear functions can be solved in closed form. â–¶ 00:00 When I reduce, I will do this for the one-dimensional case on paper. â–¶ 00:07 I will also give you the solution for the case where your input space is multidimensional, â–¶ 00:12 which is often called "multivariant regression." â–¶ 00:17 We seek to minimize a sum of a quadratic expression â–¶ 00:22 where the target labels are subtracted with the output of our linear regression model â–¶ 00:26 parameterized by w1 and w2. â–¶ 00:33 The summation here is overall training examples, â–¶ 00:36 and I leave the index of the summation out if not necessary. â–¶ 00:40 The minimum of this is obtained where the derivative of this function equals zero. â–¶ 00:45 Let's call this function "L." â–¶ 00:50 For the partial derivative with respect to w0, we get this expression over here, â–¶ 00:53 which we have to set to zero. â–¶ 00:59 We can easily get rid of the -2 and transform this as follows: â–¶ 01:02 Here M is the number of training examples. â–¶ 01:11 This expression over here gives us w0 as a function of w1, â–¶ 01:17 but we don't know w1. Let's do the same trick for w1 â–¶ 01:21 and set this to zero as well, â–¶ 01:28 which gets us the expression over here. â–¶ 01:32 We can now plug in the w0 over here into this expression over here â–¶ 01:38 and obtain this expression over here, â–¶ 01:44 which looks really involved but is relatively straightforward. â–¶ 01:47 With a few steps of further calculation, which I'll spare you for now, â–¶ 01:52 we get for w1 the following important formula: â–¶ 01:56 This is the final quotient for w1, â–¶ 02:02 where we take the number of training examples times of the sum of all xy â–¶ 02:05 minus the sum of x times the sum of y divided by this expression over here. â–¶ 02:10 Once we've computed w1, â–¶ 02:16 we can go back to our original articulation of w0 over here â–¶ 02:19 and plug w1 into w0 and obtain w0. â–¶ 02:23 These are the two important formulas we can also find in the textbook. â–¶ 02:30 I'd like to go back and use those formulas to calculate these two coefficients over here. â–¶ 02:39 You get 4 times the sum of x and the sum of y, which is -32 â–¶ 02:45 minus the product of the sum of x, which is 18, and the sum of y, which is -6, â–¶ 02:56 divided by the sum of x squared, which is 86, times 4, minus the sum of x squared, â–¶ 03:05 which is 18 times 18, which is 324. â–¶ 03:16 If you work this all out, it becomes -1, which is w1. â–¶ 03:20 W0 is now obtained by completing the quarter times sum of all y, â–¶ 03:25 which is -6, minus -1/4 times sum of all x. â–¶ 03:31 If you plug this all in, you get 3, as over here. Our formula is actually correct. â–¶ 03:39 Here is another quiz for linear regression. We have the follow data: â–¶ 03:46 Here is the data plotted graphically. â–¶ 03:51 I wonder what the best regression is. â–¶ 03:53 Give me w0 and w1. Apply the formulas I just gave you. â–¶ 03:56 ### (01:27) 37 Answer

And the answer is W0 = 0.5, and W1 = 0.9. â–¶ 00:00 If I were to draw a line, it would go about like this. â–¶ 00:09 It doesn't really hit the two points at the end. â–¶ 00:14 If you were thinking of something like this, you were wrong. â–¶ 00:19 If you draw a curve like this, your quadratic error becomes 2. â–¶ 00:24 One over here, and one over here. â–¶ 00:28 The quadratic error is smaller for the line that goes in between those points. â–¶ 00:30 This is easily seen by computing as shown in the previous slide. â–¶ 00:35 W1 equals (4 x 118 - 20 x 20) / (4 x 120 - 400) which is 0.9. â–¶ 00:41 This is merely plugging in those numbers into the formulas I gave you. â–¶ 00:55 W0 then becomes Â¼ x 20. â–¶ 01:00 Now we plug in W1-- 0.9 / 4 x 20 equals 0.5. â–¶ 01:05 This is an example of linear regression, â–¶ 01:12 in which case there is a residual error, â–¶ 01:16 and the best-fitting curve is the one that minimizes â–¶ 01:18 the total of the residual vertical error in this graph over here. â–¶ 01:22 ### (02:10) 38 Problems with Linear Regression

So linear regression works well â–¶ 00:00 if the data is approximately linear, â–¶ 00:03 but there are many examples when linear regression performs poorly. â–¶ 00:05 Here's one where we have a â–¶ 00:09 curve that is really nonlinear. â–¶ 00:12 This is an interesting one where we seem to have a linear relationship â–¶ 00:15 that is flatter than the linear regression indicates, â–¶ 00:18 but there is one outlier. â–¶ 00:21 Because if you are minimizing quadratic error, â–¶ 00:23 outliers penalize you over-proportionately. â–¶ 00:26 So outliers are particularly bad for linear regression. â–¶ 00:30 And here is a case, â–¶ 00:34 where the data clearly suggests â–¶ 00:35 a very different phenomena for linear. â–¶ 00:37 We have only two ?? variables even being used, â–¶ 00:40 and this one has a strong frequency â–¶ 00:42 and a strong vertical spread. â–¶ 00:45 Clearly a linear regression model â–¶ 00:47 is a very poor one to explain â–¶ 00:49 this data over here. â–¶ 00:51 Another problem with linear regression â–¶ 00:53 is that as you go to infinity in the X space, â–¶ 00:55 your Ys also become infinite. â–¶ 00:59 In some problems that isn't a plausible model. â–¶ 01:02 For example, if you wish to predict the weather â–¶ 01:05 anytime into the future, â–¶ 01:08 it's implausible to assume the further the prediction goes out, â–¶ 01:10 the hotter or the cooler it becomes. â–¶ 01:13 For such situations there is a â–¶ 01:15 model called logistic regression, â–¶ 01:17 which uses a slightly more complicated â–¶ 01:20 model than linear regression, â–¶ 01:22 which goes as follows:. â–¶ 01:24 Let F of XP, or linear function, â–¶ 01:25 and the output of logistic regression â–¶ 01:30 is obtained by the following function: â–¶ 01:32 One over one plus exponential of minus F of X. â–¶ 01:34 So here's a quick quiz for you. â–¶ 01:40 What is the range in which Z might fall â–¶ 01:43 given this function over here, â–¶ 01:48 and ??? the linear function of F or X over here. â–¶ 01:49 Is it zero, one? â–¶ 01:53 Is it minus one, one? â–¶ 01:56 Is it minus one, zero? â–¶ 01:59 Minus two, two? â–¶ 02:02 Or none of the above? â–¶ 02:04 ### (01:00) 39 Answer

The answer is zero, one. â–¶ 00:00 If this expression over here, â–¶ 00:02 F of X, â–¶ 00:05 grows to positive infinity, â–¶ 00:07 then Z becomes one. â–¶ 00:09 And the reason is â–¶ 00:14 as this term over here becomes very large, â–¶ 00:16 E to the minus of that term approaches zero; â–¶ 00:19 one over one equals one. â–¶ 00:22 If F of X goes to minus infinity, â–¶ 00:25 then Z goes to zero. â–¶ 00:30 And the reason is, â–¶ 00:33 if this expression over here goes to minus infinity, â–¶ 00:34 E to the infinity becomes very large; â–¶ 00:38 one over something very large becomes zero. â–¶ 00:41 When we plot the logistic function it looks like this: â–¶ 00:44 So it's approximately linear â–¶ 00:49 around F of X equals zero, â–¶ 00:51 but it levels off to zero and one â–¶ 00:54 as we go to the extremes. â–¶ 00:58 ### (01:39) 40 Linear Regression and Complexity Control

Another problem with linear regression has to do with the regularization â–¶ 00:00 or complexity control. â–¶ 00:04 Just like before, we sometimes wish to have â–¶ 00:06 a less complex model. â–¶ 00:08 So in regularization, the loss function is either the sum â–¶ 00:10 of the loss of a data function and a complexity control term, â–¶ 00:15 which is often called the loss of the parameters. â–¶ 00:21 The loss of the data is simply curvatic loss, as we discussed before. â–¶ 00:24 The loss of parameters might just be a function that penalizes â–¶ 00:29 the parameters to become large â–¶ 00:35 up to some known P, where P is usually either 1 or 2. â–¶ 00:37 If you draw this graphically, â–¶ 00:43 in a parameter space comprised of 2 parameters, â–¶ 00:46 your curvatic term for minimizing the data error â–¶ 00:49 might look like this, where the minimum sits over here. â–¶ 00:53 Your term for regularization might pull these parameters toward 0. â–¶ 00:57 It pulls it toward 0, along the circle if you use curvatic error, â–¶ 01:02 and it does it in a diamond-shaped way. â–¶ 01:09 For L1 regularization--either one works well. â–¶ 01:14 L1 has the advantage in that parameters tend to get really sparse. â–¶ 01:20 If you look at this diagram, there is a tradeoff between W-0 and W-1. â–¶ 01:24 In the L1 case, that allows one of them to be driven to 0. â–¶ 01:30 In the L2 case, parameters tend not to be as sparse. â–¶ 01:33 So L1 is often preferred. â–¶ 01:37 ### (01:46) 41 Minimizing Complicated Loss Functions

This all raises the question, â–¶ 00:00 how to minimize more complicated loss functions â–¶ 00:03 than the one we discussed so far. â–¶ 00:06 Are there closed-form solutions of the type we found for linear regression? â–¶ 00:09 Or do we have to resort to iterative methods? â–¶ 00:14 The general answer is, unfortunantly, we have to resort to iterative methods. â–¶ 00:17 Even though there are special cases in which corresponding solutions may exist, â–¶ 00:23 in general, our loss functions now become complicated enough â–¶ 00:28 that all we can do is iterate. â–¶ 00:32 Here is a prototypical loss function â–¶ 00:35 and the method for interation will be called gradient descent. â–¶ 00:40 In gradient descent, you start with an initial guess, â–¶ 00:44 W-0, where 0 is your iteration number, â–¶ 00:48 and then you up with it iteratively. â–¶ 00:53 Your i plus 1st parameter guess will be obtained by taking your i-th guess â–¶ 00:55 and subtracting from it the gradient of your loss function, â–¶ 01:04 and that guess multiplied by a small learning weight alpha, â–¶ 01:10 where alpha is often as small as 0.01. â–¶ 01:15 I have a couple of questions for you. â–¶ 01:19 Consider the following 3 points. â–¶ 01:21 We call them A, B, C. â–¶ 01:25 I wish to know, for points A, B, and C, â–¶ 01:27 Is the gradient at this point positive, about zero, or negative? â–¶ 01:34 For each of those, check exactly one of those cases. â–¶ 01:40 ### (00:47) 42 Answer

In case A, the gradient is negative. â–¶ 00:00 If you move to the right in the X space, â–¶ 00:03 then your loss decreases. â–¶ 00:06 In B, it's about zero. â–¶ 00:09 In C, it's pointing up; it's positive. â–¶ 00:12 So if you apply the rule over here, â–¶ 00:15 if you were to start at A as your W-zero, â–¶ 00:18 then your gradient is negative. â–¶ 00:21 Therefore, you would add something to the value of W. â–¶ 00:23 You move to the right, and your loss has decreased. â–¶ 00:26 You do this until you find yourself â–¶ 00:29 with what's called a local minimum, where B resides. â–¶ 00:31 In this instance over here, gradient descent starting at A â–¶ 00:34 would not get you to the global minimum, â–¶ 00:37 which sits over here because there's a bump in between. â–¶ 00:39 Gradient methods are known to be subject to local minimum. â–¶ 00:42 ### (00:28) 43 Question

I have another gradient quiz. â–¶ 00:00 Consider the following quadratic arrow function. â–¶ 00:03 We are considering the gradient in 3 different places. â–¶ 00:06 a. b. and c. â–¶ 00:09 And they ask you which gradient is the largest. â–¶ 00:13 a, b, or c or are they all equal? â–¶ 00:17 In which case, you would want to check the last box over here â–¶ 00:23 ### (00:20) 44 Answer

And the answer is C. â–¶ 00:00 The derivative of a quadratic function is a linear function. â–¶ 00:04 Which would look about like this. â–¶ 00:08 And as we go outside, our gradient becomes larger and larger. â–¶ 00:11 This over here is much steeper than this curve over here. â–¶ 00:15 ### (00:21) 45 Question

[Thrun] Here is a final gradient descent quiz. â–¶ 00:00 Suppose we have a loss function like this â–¶ 00:04 and our gradient descent starts over here. â–¶ 00:08 Will it likely reach the global minimum? â–¶ 00:12 Yes or no. â–¶ 00:15 Please check one of those boxes. â–¶ 00:17 ### (01:00) 46 Answer

[Thrun] And the answer is yes, â–¶ 00:00 although, technically speaking, to reach the absolute global minimum â–¶ 00:02 we need the learning rates to become smaller and smaller over time. â–¶ 00:06 If they stay constant, there is a chance this thing might bounce around â–¶ 00:11 between 2 points in the end and never reach the global minimum. â–¶ 00:15 But assuming that we implement gradient descent correctly, â–¶ 00:18 we will finally reach the global minimum. â–¶ 00:22 That's not the case if you start over here, where we can get stuck over here â–¶ 00:24 and settle for the minimum over here, which is a local minimum â–¶ 00:29 and not the best solution to our optimization problem. â–¶ 00:32 So one of the important points to take away from this is â–¶ 00:35 gradient descent is universally applicable to more complicated problems-- â–¶ 00:38 problems that don't have a plausible solution. â–¶ 00:43 But you have to check whether there is many local minima, â–¶ 00:46 and if so, you have to worry about this. â–¶ 00:49 Any optimization book can tell you tricks how to overcome this. â–¶ 00:51 I won't go into any more depth here in this class. â–¶ 00:55 ### (01:41) 47 Gradient Descent Implementation

[Thrun] It's interesting to see how to minimize a loss function using gradient descent. â–¶ 00:00 In our linear case, we have L equals sum over the correct labels â–¶ 00:05 minus our linear function to the square, â–¶ 00:12 which we seek to minimize. â–¶ 00:16 We already know that this has a closed form solution, â–¶ 00:18 but just for the fun of it, let's look at gradient descent. â–¶ 00:21 The gradient of L with respect to W1 is minus 2, sum of all J â–¶ 00:25 of the difference as before but without the square times Xj. â–¶ 00:33 The gradient with respect to W0 is very similar. â–¶ 00:39 So in gradient descent we start with W1 0 and W0 0 â–¶ 00:43 where the upper cap 0 corresponds to the iteration index of gradient descent. â–¶ 00:49 And then we iterate. â–¶ 00:55 In the M iteration we get our new estimate by using the old estimate â–¶ 00:57 minus a learning rate of this gradient over here â–¶ 01:06 taking the position of the old estimate W1, M minus 1. â–¶ 01:10 Similarly, for W0 we get this expression over here. â–¶ 01:15 And these expressions look nasty, â–¶ 01:20 but what it really means is we subtract an expression like this â–¶ 01:24 every time we do gradient descent from W1 â–¶ 01:28 and an expression like this every time we do gradient descent from W0, â–¶ 01:31 which is easy to implement, and that implements gradient descent. â–¶ 01:36 ### (04:15) 48 Perceptron

Now, there are many different ways to apply linear functions in machine learning. â–¶ 00:00 We so far have studied linear functions for regression, â–¶ 00:08 but linear functions are also used for classification, â–¶ 00:12 and specifically for an algorithm called the perceptron algorithm. â–¶ 00:16 This algorithm happens to be a very early model of a neuron, â–¶ 00:21 as in the neurons we have in our brains, â–¶ 00:27 and was invented in the 1940s. â–¶ 00:30 Suppose we give a data set of positive samples and negative samples. â–¶ 00:33 A linear separator is a linear equation that separates positive from negative examples. â–¶ 00:41 Obviously, not all sets possess a linear separator, but some do. â–¶ 00:49 For those we can define the algorithm of the perceptron and it actually converges. â–¶ 00:55 To define a linear separator, let's start with our linear equation as before-- â–¶ 01:02 w1x + w0 in cases where x is higher dimensional this might actually be a vector--never mind. â–¶ 01:07 If this is larger or equal to zero, then we call our classification 1. â–¶ 01:18 Otherwise, we call it zero. â–¶ 01:26 Here's our linear separation classification function â–¶ 01:30 where this is our common linear function. â–¶ 01:35 Now, as I said, perceptron only converges if the data is linearly separable, â–¶ 01:39 and then it converges to a linear separation of the data, â–¶ 01:45 which is quite amazing. â–¶ 01:49 Perceptron is an iterative algorithm that is not dissimilar from grade descent. â–¶ 01:52 In fact, the update rule echoes that of grade descent, and here's how it goes. â–¶ 01:56 We start with a random guess for w1 and w0, â–¶ 02:03 which may correspond to a random separation line, â–¶ 02:09 but usually is inaccurate. â–¶ 02:13 Then the mth weight-i is obtained by using the old weight plus some learning rate alpha â–¶ 02:17 times the difference between the desired target label â–¶ 02:29 and the target label produced by our function at the point m-1. â–¶ 02:33 Now, this is an online learning rule, which is we don't process all the data in batch. â–¶ 02:39 We process one data at a time, and we might go through the data many, many times-- â–¶ 02:45 hence the j over here-- â–¶ 02:50 but every time we do this, we apply this rule over here. â–¶ 02:52 What this rule gives us is a method to adapt our weights in proportion to the error. â–¶ 02:55 If the prediction of our function f equals our target label, â–¶ 03:03 and the error is zero, then no update occurs. â–¶ 03:07 If there is a difference, however, we update in a way so as to minimize the error. â–¶ 03:11 Alpha is a small learning weight. â–¶ 03:18 Once again, perceptron converges to a correct linear separator â–¶ 03:22 if such linear separator exists. â–¶ 03:28 Now, the case of linear separation has recently received a lot of attention in machine learning. â–¶ 03:31 If you look at the picture over here, you'll find there are many different linear separators. â–¶ 03:36 There is one over here. There is one over here. There is one over here. â–¶ 03:42 One of the questions that has recently been researched extensively is which one to prefer. â–¶ 03:47 Is it a, b, or c? â–¶ 03:53 Even though you probably have never seen this literature, â–¶ 03:57 I will just ask your intuition in this following quiz. â–¶ 04:01 Which linear separator would you prefer if you look at these three different linear separators-- â–¶ 04:05 a, b, c, or none of them? â–¶ 04:10 ### (05:38) 49 Answer and SVMs

[Narrator] And intuitively I would argue it's B, â–¶ 00:00 and the reason why is â–¶ 00:04 C comes really close to examples. â–¶ 00:06 So if these examples are noisy, â–¶ 00:09 it's quite likely that â–¶ 00:12 by being so close to these examples â–¶ 00:14 that future examples cross the line. â–¶ 00:17 Similarly A comes close to examples. â–¶ 00:20 B is the one that stays really far away â–¶ 00:23 from any example. â–¶ 00:26 So there's this entire region over here â–¶ 00:28 where there's no example anywhere near B. â–¶ 00:31 This region is often called the margin. â–¶ 00:34 The margin of the linear separator â–¶ 00:37 is the distance of the separator â–¶ 00:40 to the closest training example. â–¶ 00:43 The margin is a really important concept â–¶ 00:45 in machine learning. â–¶ 00:47 There is an entire class of maximum margin â–¶ 00:49 learning algorithms, â–¶ 00:51 and the 2 most popular are â–¶ 00:53 support vector machines and boosting. â–¶ 00:56 If you are familiar with machine learning, â–¶ 01:00 you've come across these terms. â–¶ 01:02 These are very frequently used these days â–¶ 01:04 in actual discrimination learning tasks. â–¶ 01:07 I will not go into any details because it would go â–¶ 01:10 way beyond the scope of this introduction â–¶ 01:12 to artificial intelligence class, but let's see â–¶ 01:16 a few abstract words specifically about â–¶ 01:18 support vector machines or SVMs. â–¶ 01:21 As I said before a support vector machine â–¶ 01:25 derives a linear separator, and it takes â–¶ 01:30 the one that actually maximizes the margin â–¶ 01:34 as shown over here. â–¶ 01:39 By doing so it attains additional robost-ness â–¶ 01:42 over perceptron which only picks â–¶ 01:44 a linear separator without â–¶ 01:46 consideration of the margin. â–¶ 01:48 Now the problem of finding the â–¶ 01:51 margin maximizing linear separator â–¶ 01:53 can be solved by a quadratic program â–¶ 01:55 which is an integer method for finding the best â–¶ 01:59 linear separator that maximizes the margin. â–¶ 02:03 One of the nice things that support â–¶ 02:06 vector machines do in practice is â–¶ 02:08 they use linear techniques to solve â–¶ 02:12 nonlinear separation problems, â–¶ 02:16 and I'm just going to give you a glimpse of â–¶ 02:19 what's happening without going into any detail. â–¶ 02:22 Suppose the data looks as follows: â–¶ 02:25 we have a positive class â–¶ 02:28 which is near the origin of a coordinate system â–¶ 02:31 and a negative class that surrounds the positive class. â–¶ 02:33 Clearly these 2 classes â–¶ 02:37 are not linearly separable â–¶ 02:39 because there's no line I can draw that â–¶ 02:41 separates the negative examples from the positive examples. â–¶ 02:43 An idea that underlies SVMs, â–¶ 02:47 that will ultimately be known as â–¶ 02:49 the kernel trick, â–¶ 02:51 is to augment the feature set by new features. â–¶ 02:53 Suppose this is X1, and this is X2, â–¶ 02:56 and normally X1 and X2 â–¶ 02:58 will be the input features. â–¶ 03:00 In this example, you might derive â–¶ 03:03 a 3rd one. â–¶ 03:05 Let me pick a 3rd one â–¶ 03:07 Suppose X3 equals the square root of â–¶ 03:09 X1 square + X2 square. â–¶ 03:13 In other words X3 is the distance â–¶ 03:18 of any data point from the center â–¶ 03:22 of the coordinate system. â–¶ 03:25 Then things do become linearly separable â–¶ 03:27 so that just along the 3rd dimension â–¶ 03:31 all the positive examples end up â–¶ 03:33 to be close to the origin, â–¶ 03:36 and all the negative examples â–¶ 03:39 are further away, and the line is â–¶ 03:41 orthogonal to the 3rd input feature â–¶ 03:43 solves the separation problem. â–¶ 03:46 Map back into the space over here â–¶ 03:49 is actually a circle which is a set of all â–¶ 03:52 values of X3 that are equidistant â–¶ 03:55 to the center of the origin. â–¶ 04:00 Now this trick could be done in any linear learning algorithm, â–¶ 04:02 and it's really an amazing trick. â–¶ 04:06 You can take any nonlinear problem, add â–¶ 04:08 features of this type or any other type, â–¶ 04:10 and use linear techniques â–¶ 04:13 and get better solutions. â–¶ 04:15 This is a very deep machine learning insight â–¶ 04:17 that you can extend your feature space â–¶ 04:19 in this way, and there's numerous â–¶ 04:21 papers written about this. â–¶ 04:23 In SVMs, the extension of the feature space is mathematically done by â–¶ 04:25 what's called a kernel. â–¶ 04:31 I can't really tell you about this in this class, â–¶ 04:33 but it makes it possible to write â–¶ 04:36 very large new feature spaces including â–¶ 04:38 infinitely dimensional new feature spaces. â–¶ 04:41 These messages are very powerful. â–¶ 04:44 It turns out you never â–¶ 04:46 really compute all those features. â–¶ 04:48 They are implicitly represented by â–¶ 04:50 so called kernels, and if you care about this, â–¶ 04:52 I recommend you to dive â–¶ 04:55 deeper into the literature â–¶ 04:57 of support vector machines. â–¶ 04:59 This is meant to just give you â–¶ 05:01 an overview of the essence of â–¶ 05:03 what support vector machines are all about. â–¶ 05:05 So in summary, â–¶ 05:08 linear methods we learned about â–¶ 05:10 using them for regression â–¶ 05:12 and also classification. â–¶ 05:15 We learned about exact solutions â–¶ 05:17 versus iterative solutions. â–¶ 05:19 We talked about smoothing, â–¶ 05:23 and we even talked about â–¶ 05:25 using linear methods for nonlinear problems. â–