Last update:Friday 16th of December 2011

## (1) Unit 0w

### (01:15) 1 Introduction

Welcome to the online introduction to artificial intelligence. ▶ 00:00 My name is Sebastian Thrun. >>I'm Peter Norvig. ▶ 00:04 We are teaching this class at Stanford, ▶ 00:07 and now we are teaching it online for the entire world. ▶ 00:09 We are really excited about this. ▶ 00:11 It's great to have you all here. ▶ 00:13 It's exciting to have such a record-breaking number of people. ▶ 00:14 We think we can deliver a good introduction to artificial intelligence. ▶ 00:18 We hope you'll stick with it. ▶ 00:22 It's going to be a lot of work, ▶ 00:24 but we think it's going to be very rewarding. ▶ 00:25 The way that it is going to be organized is that ▶ 00:27 every week there is going to be new videos and with these videos, quizes. ▶ 00:29 With these quizzes, you can test your knowledge about AI. ▶ 00:32 We also post for the advanced version of this class, homework assignments and exams ▶ 00:35 on which you'll be quizzed. ▶ 00:38 We're going to grade those to give you a final score to see ▶ 00:40 if you can actually master artificial intelligence the same way ▶ 00:44 any good student at Stanford would do it. ▶ 00:47 If you do that, then at the end of the class, we'll sign a letter of accomplishment, ▶ 00:49 and let you know that you've achieved this and what your rank in the class was. ▶ 00:54 So I hope you have fun. Watch us on videotape. ▶ 00:58 We will teach you AI. ▶ 01:02 Participate in the discussion forum. ▶ 01:04 Ask your questions, and help others answer questions. ▶ 01:06 I hope we have a fantastic time ahead of us in the next 10 weeks. ▶ 01:09 Welcome to the class. We'll see you online. ▶ 01:12

## (15) Unit 1w

### (02:00) 1 Introduction

Welcome to the first unit of Online Introduction to Artificial Intelligence. ▶ 00:00 I will be teaching you the very, very basics today. ▶ 00:05 This is Unit 1 of Artificial Intelligence. ▶ 00:09 Welcome. ▶ 00:14 The purpose of this class is twofold: ▶ 00:16 Number 1, to teach you the very basics of artificial intelligence ▶ 00:20 so you'll be able to talk to people in the field ▶ 00:25 and understand the basic tools of the trade; ▶ 00:29 and also, very importantly, to excite you about the field. ▶ 00:32 I have been in the field of artificial intelligence for about 20 years, ▶ 00:37 and it's been truly rewarding. ▶ 00:42 So I want you to participate in the beauty and the excitement of AI ▶ 00:44 so you can become a professional who gets the same reward ▶ 00:48 and excitement out of this field as I do. ▶ 00:52 The basic structure of this class involves videos ▶ 00:55 in which Peter or I will teach you something new, ▶ 01:00 then also quizzes, which we will ask you about your ability to answer AI questions, ▶ 01:03 and finally, answer videos in which we tell you what the right answer would have been ▶ 01:11 for the quiz that you might have falsely or incorrectly answered before. ▶ 01:17 This will all be reiterated, and every so often you get a homework assignment, ▶ 01:22 also in the form of quizzes but without the answers. ▶ 01:28 And then we also have video exams. ▶ 01:34 If you check our website, there's requirements ▶ 01:37 on how you have to do assignments and exams. ▶ 01:39 Please go to ai-class.org in this class. ▶ 01:43 An AI program is called wetware, a formula, or an intelligent agent. ▶ 01:48 Pick the one that fits best. ▶ 01:58 ### (01:34) 2 Intelligent Agents

[Thrun] The correct answer is intelligent agent. ▶ 00:00 Let's talk about intelligent agents. ▶ 00:04 Here is my intelligent agent, ▶ 00:07 and it gets to interact with an environment. ▶ 00:11 The agent can perceive the state of the environment ▶ 00:17 through its sensors, ▶ 00:22 and it can affect its state through its actuators. ▶ 00:25 The big question of artificial intelligence is the function that maps sensors to actuators. ▶ 00:29 That is called the control policy for the agent. ▶ 00:37 So all of this class will deal with how does an agent make decisions ▶ 00:41 that it can carry out with its actuators based on past sensor data. ▶ 00:48 Those decisions take place many, many times, ▶ 00:54 and the loop of environment feedback to sensors, agent decision, ▶ 00:58 actuator interaction with the environment and so on is called perception action cycle. ▶ 01:03 So here is my very first quiz for you. ▶ 01:12 Artificial intelligence, AI, has successfully been used in finance, ▶ 01:15 robotics, games, medicine, and the Web. ▶ 01:21 Check any or all of those that apply. ▶ 01:26 And if none of them applies, check the box down here that says none of them. ▶ 01:28 ### (06:28) 3 Applications of AI

So the correct answer is all of those-- ▶ 00:00 finance, robotics, games, medicine, the Web, and many more applications. ▶ 00:03 So let me talk about them in some detail. ▶ 00:08 There is a huge number of applications of artificial intelligence in finance, ▶ 00:10 very often in the shape of making trading decisions-- ▶ 00:15 in which case, the agent is called a trading agent. ▶ 00:18 And the environment might be things like the stock market or the bond market ▶ 00:21 or the commodities market. ▶ 00:27 And our trading agent can sense the course of certain things, ▶ 00:29 like the stock or bonds or commodities. ▶ 00:33 It can also read the news online and follow certain events. ▶ 00:35 And its decisions are usually things like buy or sell decisions--trades. ▶ 00:40 There's a huge history of artificial intelligence finding methods to look at data over time ▶ 00:48 and make predictions as to how courses develop over time-- ▶ 00:55 and then put in trades behind those. ▶ 00:58 And very frequently, people using artificial intelligence trading agents ▶ 01:01 have made a good amount of money with superior trading decisions. ▶ 01:06 There's also a long history of AI in Robotics. ▶ 01:10 Here is my depiction of a robot. ▶ 01:14 Of course, there are many different types of robots ▶ 01:17 and they all interact with their environments through their sensors, ▶ 01:20 which include things like cameras, microphones, tactile sensor or touch. ▶ 01:24 And the way they impact their environments is to move motors around, ▶ 01:33 in particular, their wheels, their legs, their arms, their grippers. ▶ 01:38 They can also say things to people using voice. ▶ 01:43 Now there's a huge history of using artificial intelligence in robotics. ▶ 01:46 Pretty much, every robot that does something interesting today uses AI. ▶ 01:50 In fact, often AI has been studied together with robotics, as one discipline. ▶ 01:54 But because robots are somewhat special in that they use physical actuators ▶ 01:58 and deal with physical environments, they are a little bit different from ▶ 02:03 just artificial intelligence, as a whole. ▶ 02:06 When the Web came out, the early Web crawlers were called robots ▶ 02:08 and to block a robot from accessing your website, to the present day, ▶ 02:15 there's a file called robot.txt, that allows you to deny any Web crawler ▶ 02:20 to access and retrieve that information from your website. ▶ 02:24 So historically, robotics played a huge role in artificial intelligence ▶ 02:28 and a good chunk of this class will be focusing on robotics. ▶ 02:32 AI has a huge history in games-- ▶ 02:36 to make games smarter or feel more natural. ▶ 02:39 There are 2 ways in which AI has been used in games, as a game agent. ▶ 02:43 One is to play against you, as a human user. ▶ 02:47 So for example, if you play the game of Chess, ▶ 02:50 then you are the environment to the game agent. ▶ 02:54 The game agent gets to observe your moves, and it generates its own moves ▶ 02:57 with the purpose of defeating you in Chess. ▶ 03:03 So most adversarial games, where you play against an opponent ▶ 03:07 and the opponent is a computer program, ▶ 03:10 the game agent is built to play against you--against your own interests--and make you lose. ▶ 03:13 And of course, your objective is to win. ▶ 03:20 That's an AI games-type situation. ▶ 03:22 The second thing is that games agents in AI ▶ 03:25 also are used to make games feel more natural. ▶ 03:29 So very often games have characters inside, and these characters act in some way. ▶ 03:32 And it's important for you, as the player, to feel that these characters are believable. ▶ 03:36 There's an entire sub-field of artificial intelligence to use AI ▶ 03:42 to make characters in a game more believable--look smarter, so to speak-- ▶ 03:45 so that you, as a player, think you're playing a better game. ▶ 03:51 Artificial intelligence has a long history in medicine as well. ▶ 03:55 The classic example is that of a diagnostic agent. ▶ 04:00 So here you are--and you might be sick, and you go to your doctor. ▶ 04:04 And your doctor wishes to understand ▶ 04:09 what the reason for your symptoms and your sickness is. ▶ 04:11 The diagnostic agent will observe you through various measurements-- ▶ 04:17 for example, blood pressure and heart signals, and so on-- ▶ 04:21 and it'll come up with the hypothesis as to what you might be suffering from. ▶ 04:25 But rather than intervene directly, in most cases the diagnostic of your disease ▶ 04:29 is communicated to the doctor, who then takes on the intervention. ▶ 04:34 This is called a diagnostic agent. ▶ 04:38 There are many other versions of AI in medicine. ▶ 04:40 AI is used in intensive care to understand whether there are situations ▶ 04:43 that need immediate attention. ▶ 04:48 It's been used for life-long medicine to monitor signs over long periods of time. ▶ 04:50 And as medicine becomes more personal, the role of artificial intelligence ▶ 04:54 will definitely increase. ▶ 04:58 We already mentioned AI on the Web. ▶ 05:01 The most generic version of AI is to crawl the Web and understand the Web, ▶ 05:05 and assist you in answering questions. ▶ 05:09 So when you have this search box over here ▶ 05:12 and it says "Search" on the left, ▶ 05:15 and "I'm Feeling Lucky" on the right, ▶ 05:18 and you type in the words, ▶ 05:20 what AI does for you is it understands what words you typed in ▶ 05:21 and finds the most relevant pages. ▶ 05:28 That is really co-artificial intelligence. ▶ 05:30 It's used by a number of companies, such as Microsoft and Google ▶ 05:32 and Amazon, Yahoo, and many others. ▶ 05:36 And the way this works is that there's a crawling agent that can go ▶ 05:39 to the World Wide Web and retrieve pages, through just a computer program. ▶ 05:43 It then sorts these pages into a big database inside the crawler ▶ 05:51 and also analyzes developments of each page to any possible query. ▶ 05:56 When you then come and issue a query, ▶ 06:01 the AI system is able to give you a response-- ▶ 06:04 for example, a collection of 10 best Web links. ▶ 06:08 In short, every time you try to write a piece of software, ▶ 06:12 that makes your computer software smart ▶ 06:15 likely you will need artificial intelligence. ▶ 06:18 And in this class, Peter and I will teach you ▶ 06:20 many of the basic tricks of the trade ▶ 06:23 to make your software really smart. ▶ 06:25 ### (05:17) 4 Terminology

It will be good to introduce some basic terminology ▶ 00:00 that is commonly used in artificial intelligence to distinguish different types of problems. ▶ 00:04 The very first word I will teach you is fully versus partially observable. ▶ 00:09 An environment is called fully observable if what your agent can sense ▶ 00:16 at any point in time is completely sufficient to make the optimal decision. ▶ 00:19 So, for example, in many card games, ▶ 00:26 when all the cards are on the table, the momentary site of all those cards ▶ 00:29 is really sufficient to make the optimal choice. ▶ 00:36 That is in contrast to some other environments where you need memory ▶ 00:40 on the side of the agent to make the best possible decision. ▶ 00:46 For example, in the game of poker, the cards aren't openly on the table, ▶ 00:50 and memorizing past moves will help you make a better decision. ▶ 00:55 To fully understand the difference, consider the interaction of an agent ▶ 01:00 with the environment to its sensors and its actuators, ▶ 01:04 and this interaction takes place over many cycles, ▶ 01:08 often called the perception-action cycle. ▶ 01:11 For many environments, it's convenient to assume ▶ 01:16 that the environment has some sort of internal state. ▶ 01:19 For example, in a card game where the cards are not openly on the table, ▶ 01:22 the state might pertain to the cards in your hand. ▶ 01:28 An environment is fully observable if the sensors can always see ▶ 01:33 the entire state of the environment. ▶ 01:37 It's partially observable if the sensors can only see a fraction of the state, ▶ 01:41 yet memorizing past measurements gives us additional information of the state ▶ 01:46 that is not readily observable right now. ▶ 01:52 So any game, for example, where past moves have information about ▶ 01:55 what might be in a person's hand, those games are partially observable, ▶ 02:01 and they require different treatment. ▶ 02:06 Very often agents that deal with partially observable environments ▶ 02:08 need to acquire internal memory to understand what ▶ 02:12 the state of the environment is, and we'll talk extensively ▶ 02:15 when we talk about hidden Markov models about how this structure ▶ 02:18 has such internal memory. ▶ 02:21 A second terminology for environments pertains to whether the environment ▶ 02:23 is deterministic or stochastic. ▶ 02:26 Deterministic environment is one where your agent's actions ▶ 02:29 uniquely determine the outcome. ▶ 02:35 So, for example, in chess, there's really no randomness when you move a piece. ▶ 02:37 The effect of moving a piece is completely predetermined, ▶ 02:42 and no matter where I'm going to move the same piece, the outcome is the same. ▶ 02:46 That we call deterministic. ▶ 02:50 Games with dice, for example, like backgammon, are stochastic. ▶ 02:52 While you can still deterministically move your pieces, ▶ 02:56 the outcome of an action also involves throwing of the dice, ▶ 03:00 and you can't predict those. ▶ 03:03 There's a certain amount of randomness involved for the outcome of dice, ▶ 03:05 and therefore, we call this stochastic. ▶ 03:08 Let me talk about discrete versus continuous. ▶ 03:10 A discrete environment is one where you have finitely many action choices, ▶ 03:14 and finitely many things you can sense. ▶ 03:18 So, for example, in chess, again, there's finitely many board positions, ▶ 03:21 and finitely many things you can do. ▶ 03:25 That is different from a continuous environment ▶ 03:28 where the space of possible actions or things you could sense may be infinite. ▶ 03:30 So, for example, if you throw darts, there's infinitely many ways to angle the darts ▶ 03:35 and to accelerate them. ▶ 03:41 Finally, we distinguish benign versus adversarial environments. ▶ 03:43 In benign environments, the environment might be random. ▶ 03:49 It might be stochastic, but it has no objective on its own ▶ 03:53 that would contradict the own objective. ▶ 03:57 So, for example, weather is benign. ▶ 03:59 It might be random. It might affect the outcome of your actions. ▶ 04:02 But it isn't really out there to get you. ▶ 04:06 Contrast this with adversarial environments, such as many games, like chess, ▶ 04:08 where your opponent is really out there to get you. ▶ 04:14 It turns out it's much harder to find good actions in adversarial environments ▶ 04:16 where the opponent actively observes you and counteracts what you're trying to achieve ▶ 04:21 relative to benign environment, where the environment might merely be stochastic ▶ 04:26 but isn't really interested in making your life worse. ▶ 04:30 So, let's see to what extent these expressions make sense to you ▶ 04:35 by going to our next quiz. ▶ 04:38 So here are the 4 concepts again: partially observable versus fully, ▶ 04:40 stochastic versus deterministic, continuous versus discrete, ▶ 04:45 adversarial versus benign. ▶ 04:50 And let me ask you about the game of checkers. ▶ 04:52 Check one or all of those attributes that apply. ▶ 04:56 So, if you think checkers is partially observable, check this one. ▶ 05:00 Otherwise, just don't check it. ▶ 05:03 If you think it's stochastic, check this one, ▶ 05:05 continuous, check this one, adversarial, check this one. ▶ 05:07 If you don't know about checkers, you can check the Web and Google it ▶ 05:11 to find a little more information about checkers. ▶ 05:15 ### (00:52) 5 Checkers Answer

So, checkers is an interesting game. ▶ 00:00 Here's the typical board of the game of checkers. ▶ 00:04 Your pieces might look like this, ▶ 00:08 and your opponent's pieces might look like this. ▶ 00:11 And apart from some very cryptic rules in checkers, ▶ 00:16 which I won't really discuss here, the board basically tells you ▶ 00:19 everything there is to know about checkers, so it's clearly fully observable. ▶ 00:23 It is deterministic because your move and your opponent's move ▶ 00:28 very clearly affect the state of the board in ways that have ▶ 00:33 absolutely no stochasticity. ▶ 00:36 It is also discrete because there's finitely many action choices ▶ 00:39 and finitely many board positions, ▶ 00:45 and obviously, it is adversarial, since your opponent is out to get you. ▶ 00:47 ### (00:12) 6 Poker

[Male narrator] The game of poker--is this partially observable, stochastic, ▶ 00:00 continuous, or adversarial? ▶ 00:06 Please check any or all of those that apply. ▶ 00:09 ### (00:30) 7 Poker Answer

[Male narrator] I would argue poker is partially observable ▶ 00:00 because it can't be seen what is in your opponent's hands. ▶ 00:03 It is stochastic because you're being dealt cards that are kind of coming at random. ▶ 00:08 It is not continuous; it's just finding many cards ▶ 00:13 and finding many actions you can do, even though you might argue ▶ 00:16 that there's a huge number of different monies you can bet. ▶ 00:20 It's still finite, and it is clearly adversarial. ▶ 00:24 If you've ever played poker before, you know how brutal it can be. ▶ 00:27 ### (00:22) 8 Robotic Car

[Male narrator] --a favorite, a robotic car. ▶ 00:00 I wish to know whether it is partially observable, ▶ 00:04 stochastic, continuous, or adversarial. ▶ 00:06 That is, is the problem of driving robotically-- ▶ 00:11 say, in a city--subject to any of those 4 categories? ▶ 00:16 Please check any or all that might apply. ▶ 00:20 ### (00:33) 9 Robotic Car Answer

Well, the robotic car clearly deals with a partially observable environment ▶ 00:00 if you just look at momentary sensing input, you can't even tell how fast other cars are going. ▶ 00:04 So, you need to memorize something. ▶ 00:10 It is stochastic because it's inherently unpredictable ▶ 00:12 what's going to happen next with other cars. ▶ 00:15 It is continuous. ▶ 00:17 There's the infinitely many ways to set your steering ▶ 00:20 or push your gas pedal or your brake, ▶ 00:23 and, well, you can argue with adversarial or not. ▶ 00:26 Depending on where you live, it might be highly adversarial. ▶ 00:29 Where I live, it isn't. ▶ 00:31 ### (01:28) 10 AI and Uncertainty

I'm going to briefly talk of AI as something else, ▶ 00:00 which is AI is the technique of uncertainty management in computer software. ▶ 00:03 Put differently, AI is the discipline that you apply when you want to know what to do ▶ 00:10 when you don't know what to do. ▶ 00:17 Now, there's many reasons why there might be uncertainty in a computer program. ▶ 00:22 There could be a sensor limit. ▶ 00:27 That is, your sensors are unable to tell me ▶ 00:29 what exactly is the case outside the AI system. ▶ 00:33 There could be adversaries who act in a way that makes it hard for you ▶ 00:37 to understand what is the case. ▶ 00:41 There could be stochastic environments. ▶ 00:44 Every time you roll the dice in a dice game, ▶ 00:48 the stochasticity of the dice will make it impossible for you ▶ 00:51 to be absolutely certain of what's the situation. ▶ 00:55 There could be laziness. ▶ 00:57 So perhaps you can actually compute what the situation is, ▶ 01:00 but your computer program is just too lazy to do it. ▶ 01:04 And here's my favorite: ignorance, plain ignorance. ▶ 01:07 Many people are just ignorant of what's going on. ▶ 01:11 They could know it, but they just don't care. ▶ 01:14 All of these things are cause for uncertainty. ▶ 01:17 AI is the discipline that deals with uncertainty and manages it in decision making. ▶ 01:21 ### (04:00) 11 Examples of AI in Practice

Now we've had an introduction to AI. ▶ 00:00 We've heard about some of the properties of environments, ▶ 00:03 and we've seen some possible architecture for agents. ▶ 00:06 I'd like next to show you some examples of AI in practice. ▶ 00:10 And Sebastian and I have some experience personally in things we have done ▶ 00:13 at Google, at NASA, and at Stanford. ▶ 00:18 And I want to tell you a little bit about some of those. ▶ 00:21 One of the best successes of AI technology at Google ▶ 00:25 has been the machine translation system. ▶ 00:28 Here we see an example of an article in Italian automatically translated into English. ▶ 00:31 Now, these systems are built for 50 different languages, ▶ 00:37 and we can translate from any of the languages into any of the other languages. ▶ 00:41 So, that's over 2,500 different systems, and we've done this all ▶ 00:46 using machine learning techniques, using AI techniques, ▶ 00:51 rather than trying to build them by hand. ▶ 00:55 And the way it works is that we go out and collect examples of text ▶ 00:58 that's a line between the 2 languages. ▶ 01:03 So we find, say, a newspaper that publishes 2 editions, ▶ 01:06 an Italian edition and an English edition, and now we have examples of translations. ▶ 01:11 And if anybody ever asked us for exactly the translation of this one particular article, ▶ 01:16 then we could just look it up and say "We already know that." ▶ 01:22 But of course, we aren't often going to be asked that. ▶ 01:25 Rather, we're going to be asked parts of this. ▶ 01:27 Here are some words that we've seen before, and we have to figure out ▶ 01:30 which words in this article correspond to which words in the translation article. ▶ 01:34 And when we do that by examining many, many millions of words of text ▶ 01:40 in the 2 languages and making the correspondence, ▶ 01:45 and then we can put that all together. ▶ 01:49 And then when we see a new example of text that we haven't seen before, ▶ 01:51 we can just look up what we've seen in the past for that correspondence. ▶ 01:54 So, the task is really two parts. ▶ 01:58 Off-line, before we see an example of text we want to translate, ▶ 02:01 we first build our translation model. ▶ 02:05 We do that by examining all of the different examples ▶ 02:07 and figuring out which part aligns to which. ▶ 02:10 Now, when we're given a text to translate, we use that model, ▶ 02:14 and we go through and find the most probable translation. ▶ 02:18 So, what does it look like? ▶ 02:22 Well, let's look at it in some example text. ▶ 02:24 And rather than look at news articles, I'm going to look at something simpler. ▶ 02:26 I'm going to switch from Italian to Chinese. ▶ 02:29 Here's a bilingual text. ▶ 02:35 Now, for a large-scale machine translation, examples are found on the Web. ▶ 02:37 This example was found in a Chinese restaurant by Adam Lopez. ▶ 02:41 Now, it's given, for a text of this form, ▶ 02:46 that a line in Chinese corresponds to a line in English, ▶ 02:49 and that's true for each of the individual lines. ▶ 02:55 But to learn from this text, what we really want to discover ▶ 02:59 is what individual words in Chinese correspond to individual words ▶ 03:02 or small phrases in English. ▶ 03:07 I've started that process by highlighting the word "wonton" in English. ▶ 03:09 It appears 3 times throughout the text. ▶ 03:16 Now, in each of those lines, there's a character that appears, ▶ 03:18 and that's the only place in the Chinese text where that character appears. ▶ 03:23 So, that seems like it's a high probability that this character in Chinese ▶ 03:27 corresponds to the word "wonton" in English. ▶ 03:33 Let's see if we can go farther. ▶ 03:36 My question for you is what word or what character or characters in Chinese ▶ 03:38 correspond to the word "chicken" in English? ▶ 03:44 And here we see "chicken" appears in these locations. ▶ 03:47 Click on the character or characters in Chinese that corresponds to "chicken." ▶ 03:54 ### (00:44) 12 Chinese Translation Answer

The answer is that chicken appears here, ▶ 00:01 here, here, and here. ▶ 00:04 Now, I don't know for sure, 100%, that that is the character for chicken in Chinese, ▶ 00:10 but I do know that there is a good correspondence. ▶ 00:14 Every place the word chicken appears in English, ▶ 00:17 this character appears in Chinese and no other place. ▶ 00:20 Let's go 1 step farther. ▶ 00:24 Let's see if we can work out a phrase in Chinese ▶ 00:27 and see if it corresponds to a phrase in English. ▶ 00:30 Here's the phrase corn cream. ▶ 00:33 Click on the characters in Chinese that correspond to corn cream. ▶ 00:38 ### (00:29) 13 Chinese Translation Answer 2

The answer is: these 2 characters here ▶ 00:00 appear only in these 2 locations ▶ 00:04 corresponding to the words corn cream ▶ 00:07 which appear only in these locations in the English text. ▶ 00:10 Again, we're not 100% sure that's the right answer, ▶ 00:13 but it looks like a strong correlation. ▶ 00:17 Now, 1 more question. ▶ 00:20 Tell me what character or characters in Chinese ▶ 00:22 correspond to the English word soup. ▶ 00:26 ### (00:48) 14 Chinese Translation Answer 3

The answer is that soup occurs in most of these phrases ▶ 00:00 but not 100% of them. ▶ 00:09 It's missing in this phrase. ▶ 00:11 Equivalently, on the Chinese side ▶ 00:14 we see this character occurs ▶ 00:17 in most of the phrases, ▶ 00:20 but it's missing here. ▶ 00:23 So we see that the correspondence doesn't have to be 100% ▶ 00:27 to tell us that there is still a good chance of a correlation. ▶ 00:31 When we're learning to do machine translation ▶ 00:34 we use these kinds of alignments to learn probability tables ▶ 00:37 of what is the probability of one phrase in one language ▶ 00:41 corresponding to the phrase in another language. ▶ 00:45 ### (00:56) 15 Congratulations

So congratulations, you just finished unit 1. ▶ 00:00 You just finished unit 1 of this class, ▶ 00:03 where I told you about key applications ▶ 00:07 of artificial intelligence, ▶ 00:10 I told you about the definition of an intelligent agent, ▶ 00:13 I gave you 4 key attributes of intelligent agents ▶ 00:18 (partial observability, stochasticity, continuous spaces, and adversarial natures), ▶ 00:24 I discussed sources and management of uncertainty, ▶ 00:31 and I briefly mentioned the mathematical concept of rationality. ▶ 00:34 Obviously, I only touched any of these issues superficially, ▶ 00:40 but as this class goes on you're going to dive into any of those ▶ 00:45 and learn much more about ▶ 00:49 what it takes to make a truly intelligent AI system. ▶ 00:51 Thank you. ▶ 00:55

## (42) Unit 2

### (01:34) Topic 1, Introduction

[PROBLEM SOLVING] ▶ 00:00 In this unit we're going to talk about problem solving. ▶ 00:01 The theory and technology of building agents ▶ 00:04 that can plan ahead to solve problems. ▶ 00:06 In particular, we're talking about problem solving ▶ 00:10 where the complexity of the problem comes from the idea that there are many states. ▶ 00:13 As in this problem here. ▶ 00:17 A navigation problem where there are many choices to start with. ▶ 00:19 And the complexity comes from picking the right choice now and picking the right choice at the ▶ 00:24 next intersection and the intersection after that. ▶ 00:29 Streaming together a sequence of actions. ▶ 00:32 This is in contrast to the type of complexity shown in this picture, ▶ 00:35 where the complexity comes from the partial observability ▶ 00:39 that we can't see through the fog where the possible paths are. ▶ 00:43 We can't see the results of our actions ▶ 00:46 and even the actions themselves are not known. ▶ 00:48 This type of complexity will be covered in a later unit. ▶ 00:51 Here's an example of a problem. ▶ 00:56 This is a route-finding problem where we're given a start city, ▶ 00:58 in this case, Arad, and a destination, Bucharest, the capital of Romania, ▶ 01:03 from which this is a corner of the map. ▶ 01:09 And the problem then is to find a route from Arad to Bucharest. ▶ 01:11 The actions that the agent can execute when driving ▶ 01:16 from one city to the next along one of the roads shown on the map. ▶ 01:20 The question is, is there a solution that the agent can come up with ▶ 01:23 given the knowledge shown here to the problem of driving from Arad to Bucharest? ▶ 01:28 ### (04:29) Topic 2, Route Finding Question

And the answer is no. ▶ 00:00 There is no solution that the agent can come up with ▶ 00:03 because Bucharest doesn't appear on the map, ▶ 00:06 and so the agent doesn't know any actions that can arrive there. ▶ 00:08 So let's give the agent a better chance. ▶ 00:12 Now we've given the agent the full map of Romania. ▶ 00:19 To start, he's in Arad, and the destination--or goal--is in Bucharest. ▶ 00:23 And the agent is given the problem of coming up with a sequence of actions ▶ 00:30 that will arrive at the destination. ▶ 00:35 Now, is it possible for the agent to solve this problem? ▶ 00:37 And the answer is yes. ▶ 00:43 There are many routes or steps or sequences of actions that will arrive at the destination. ▶ 00:45 Here is one of them: ▶ 00:50 Starting out in Arad, taking this step first, then this one, then this one, ▶ 00:53 then this one, and then this one to arrive at the destination. ▶ 01:00 So that would count as a solution to the problem. ▶ 01:05 So sequence of actions, chained together, that are guaranteed to get us to the goal. ▶ 01:08 [DEFINITION OF A PROBLEM] ▶ 01:12 Now let's formally define what a problem looks like. ▶ 01:14 A problem can be broken down into a number of components. ▶ 01:17 First, the initial state that the agent starts out with. ▶ 01:21 In our route finding problem, the initial state was the agent being in the city of Arad. ▶ 01:25 Next, a function--Actions--that takes a state as input and returns ▶ 01:32 a set of possible actions that the agent can execute when the agent is in this state. ▶ 01:41 [ACTIONS (s) {a,a2,a3...}] ▶ 01:47 In some problems, the agent will have the same actions available in all states ▶ 01:50 and in other problems, he'll have different actions dependent on the state. ▶ 01:54 In the route finding problem, the actions are dependent on the state. ▶ 01:58 When we're in one city, we can take the routes to the neighboring cities-- ▶ 02:02 but we can't go to any other cities. ▶ 02:06 Next we have a function called Result, which takes, as input, a state and an action ▶ 02:09 and delivers, as its output, a new state. ▶ 02:20 So, for example, if the agent is in the city of Arad, and takes--that would be the state-- ▶ 02:24 and takes the action of driving along Route E-671 towards Timisoara, ▶ 02:33 then the result of applying that action in that state would be the new state-- ▶ 02:40 where the agent is in the city of Timisoara. ▶ 02:45 Next, we need a function called Goal Test, ▶ 02:51 which takes a state and returns a Boolean value-- ▶ 02:58 true or false--telling us if this state is a goal or not. ▶ 03:04 In a route-finding problem, the only goal would be being in the destination city-- ▶ 03:09 the city of Bucharest--and all the other states would return false for the Goal Test. ▶ 03:14 And finally, we need one more thing which is a Path Cost function-- ▶ 03:19 which takes a path, a sequence of state/action transitions, ▶ 03:28 and returns a number, which is the cost of that path. ▶ 03:40 Now, for most of the problems we'll deal with, we'll make the Path Cost function be additive ▶ 03:44 so that the cost of the path is just the sum of the costs of the individual steps. ▶ 03:50 And so we'll implement this Path Cost function, in terms of a Step Cost function. ▶ 03:56 The Step Cost function takes a state, an action, and the resulting state from that action ▶ 04:04 and returns a number--n--which is the cost of that action. ▶ 04:14 In the route finding example, the cost might be the number of miles traveled ▶ 04:18 or maybe the number of minutes it takes to get to that destination. ▶ 04:24 ### (02:55) Topic 3, Route Finding

Now let’s see how the definition of a problem ▶ 00:00 maps onto the route finding, the domain. ▶ 00:06 First, the initial state was given. ▶ 00:10 Let’s say we start off in Arad, ▶ 00:12 and the goal test, ▶ 00:15 let’s say that the state of being in Bucharest ▶ 00:17 is the only state that counts as a goal, ▶ 00:22 and all the other states are not goals. ▶ 00:24 Now the set of all of the states here ▶ 00:26 is known as the state space, ▶ 00:29 and we navigate the state space by applying actions. ▶ 00:31 The actions are specific to each city, ▶ 00:35 so when we are in Arad, there are three possible actions, ▶ 00:39 to follow this road, this one, or this one. ▶ 00:42 And as we follow them, we build paths ▶ 00:46 or sequences of actions. ▶ 00:49 So just being in Arad is the path of length zero, ▶ 00:51 and now we could start exploring the space ▶ 00:55 and add in this path of length one, ▶ 00:58 this path of length one, ▶ 01:01 and this path of length one. ▶ 01:03 We could add in another path here of length two ▶ 01:06 and another path here of length two. ▶ 01:11 Here is another path of length two. ▶ 01:14 Here is a path of length three. ▶ 01:17 Another path of length two, and so on. ▶ 01:21 Now at ever point, ▶ 01:26 we want to separate the state out into three parts. ▶ 01:28 First, the ends of the paths— ▶ 01:34 The farthest paths that have been explored, ▶ 01:37 we call the frontier. ▶ 01:40 And so the frontier in this case ▶ 01:42 consists of these states ▶ 01:46 that are the farthest out we have explored. ▶ 01:51 And then to the left of that in this diagram, ▶ 01:55 we have the explored part of the state. ▶ 01:59 And then off to the rigtht, ▶ 02:02 we have the unexplored. ▶ 02:04 So let’s write down those three components. ▶ 02:06 We have the frontier. ▶ 02:09 We have the unexplored region, ▶ 02:15 and we have the explored region. ▶ 02:20 One more thing, ▶ 02:25 in this diagram we have labeled the step cost ▶ 02:27 of each action along the route. ▶ 02:30 So the step cost of going between Neamt to Iasi ▶ 02:33 would be 87 corresponding to a distance of 87 kilometers, ▶ 02:37 and the path cost is just the sum of the step costs. ▶ 02:42 So the cost of the path ▶ 02:46 of going from Arad to Oradea ▶ 02:48 would be 71 plus 75. ▶ 02:50 ### (03:19) Topic 4, Tree Search

[Narrator] Now let's define a function for solving problems. ▶ 00:00 It's called Tree Search because it superimposes ▶ 00:04 a search tree over the state space. ▶ 00:07 Here's how it works: It starts off by ▶ 00:10 initializing the frontier to be the path ▶ 00:12 consisting of only the initial states, ▶ 00:14 and then it goes into a loop ▶ 00:16 in which it first checks to see ▶ 00:18 do we still have anything left in the frontier? ▶ 00:21 If not we fail, there can be no solution. ▶ 00:23 If we do have something, then we make a choice. ▶ 00:25 Tree Search is really a family of functions ▶ 00:28 not a single algorithm which ▶ 00:31 depends on how we make that choice, ▶ 00:33 and we'll see some of the options later. ▶ 00:35 If we go ahead and make a choice of one of ▶ 00:38 the paths on the frontier and remove that ▶ 00:41 path from the frontier, we find the state ▶ 00:43 which is at the end of the path, and if that ▶ 00:45 state's a go then we're done. ▶ 00:47 We found a path to the goal; otherwise, ▶ 00:49 we do what's called expanding that path. ▶ 00:51 We look at all the actions from that state, ▶ 00:54 and we add to the path the actions ▶ 00:57 and the result of that state; so we get ▶ 01:00 a new path that has the old path, the action ▶ 01:03 and the result of that action, and we ▶ 01:06 stick all of those paths back onto the frontier. ▶ 01:09 Now Tree Search represents a whole family ▶ 01:17 of algorithms, and where you get the family ▶ 01:19 resemblance is that they're all looking ▶ 01:22 at the frontier, copying items off and ▶ 01:24 and looking to see if their goal tests, ▶ 01:26 but where you get the difference is right here, ▶ 01:29 in the choice of how you're going to expand ▶ 01:31 the next item on the frontier, which ▶ 01:34 path do we look at first, and we'll go through ▶ 01:36 different sets of algorithms that make ▶ 01:39 different choices for which path to look at first. ▶ 01:42 The first algorithm I want to consider ▶ 01:47 is called Breadth-First Search. ▶ 01:49 Now it could be called shortest-first search ▶ 01:51 because what it does is always choose ▶ 01:54 of the frontier one of the paths that hadn't been ▶ 01:56 considered yet that's the shortest possible. ▶ 01:59 So how does it work? ▶ 02:02 Well we start off with the path of ▶ 02:04 length 0, starting in the start state, and ▶ 02:06 that's the only path in the frontier so ▶ 02:10 it's the shortest one so we pick it, ▶ 02:13 and then we expand it, and we add in ▶ 02:15 all the paths that result from ▶ 02:17 applying all the possible actions. ▶ 02:20 So now we've removed ▶ 02:22 this path from the frontier, ▶ 02:25 but we've added in 3 new paths. ▶ 02:28 This one, ▶ 02:31 this one, and this one. ▶ 02:33 Now we're in a position where ▶ 02:37 we have 3 paths on the frontier, and ▶ 02:39 we have to pick the shortest one. ▶ 02:42 Now in this case all 3 paths ▶ 02:45 have the same length, length 1, so we ▶ 02:47 break the tie at random or using some ▶ 02:50 other technique, and let's suppose that ▶ 02:52 in this case we choose this path ▶ 02:56 from Arad to Sibiu. ▶ 02:58 Now the question I want you to answer ▶ 03:00 is once we remove that from the frontier, ▶ 03:03 what paths are we going to add next? ▶ 03:09 So show me by checking off the cities ▶ 03:11 that ends the paths, which paths ▶ 03:14 are going to be added to the frontier? ▶ 03:16 ### (02:54) Topic 5, Tree Search Answer

[Male narrator] The answer is that in Sibiu, the action function gives us 4 actions ▶ 00:00 corresponding to traveling along these 4 roads, ▶ 00:06 so we have to add in paths for each of those actions. ▶ 00:09 One of those paths goes here, ▶ 00:15 the other path continues from Arad and goes out here. ▶ 00:17 The third path continues out here ▶ 00:21 and then the fourth path goes from here--from Arad to Sibiu ▶ 00:25 and then backtracks back to Arad. ▶ 00:31 Now, it may seem silly and redundant to have a path that starts in Arad, ▶ 00:36 goes to Sibiu and returns to Arad. ▶ 00:41 How can that help us get to our destination in Bucharest? ▶ 00:44 But we can see if we're dealing with a tree search, ▶ 00:49 why it's natural to have this type of formulation ▶ 00:52 and why the tree search doesn't even notice that it's backtracked. ▶ 00:56 What the tree search does is superimpose on top of the state space ▶ 01:00 a tree of searches, and the tree looks like this. ▶ 01:05 We start off in state A, and in state A, there were 3 actions, ▶ 01:09 so we gave those paths going to Z, S, and T. ▶ 01:15 And from S, there were 4 actions, so that gave us paths going from O, F, R, and A, ▶ 01:21 and then the tree would continue on from here. ▶ 01:34 We'd take one of the next items ▶ 01:37 and we'd move it and continue on, but notice that we returned to the A state ▶ 01:40 in the state space, but in the tree, ▶ 01:48 it's just another item in the tree. ▶ 01:51 Now, here's another representation of the search space ▶ 01:55 and what's happening is as we start to explore the state, ▶ 01:57 we keep track of the frontier, which is the set of states that are at the end of the paths ▶ 02:01 that we haven't explored yet, and behind that frontier ▶ 02:09 is the set of explored states, and ahead of the frontier is the unexplored states. ▶ 02:13 Now the reason we keep track of the explored states ▶ 02:19 is that when we want to expand and we find a duplicate-- ▶ 02:22 so say when we expand from here, if we pointed back to state T, ▶ 02:27 if we hadn't kept track of that, we would have to add in a new state for T down here. ▶ 02:33 But because we've already seen it and we know that this is actually a regressive step ▶ 02:42 into the already explored state, now, because we kept track of that, ▶ 02:47 we don't need it anymore. ▶ 02:51 ### (01:05) Topic 6, Graph Search

Now we see how to modify the Tree Search Function ▶ 00:00 to make it be a Graph Search Function ▶ 00:04 to avoid those repeated paths. ▶ 00:06 What we do, is we start off and initialize a set ▶ 00:09 called the explored set of states that we have already explored. ▶ 00:13 Then, when we consider a new path, ▶ 00:16 we add the new state to the set of already explored states, ▶ 00:19 and then when we are expanding the path ▶ 00:23 and adding in new states to the end of it, ▶ 00:26 we don’t add that in if we have already seen that new state ▶ 00:29 in either the frontier or the explored. ▶ 00:33 Now back to Breadth First Search. ▶ 00:37 Let’s assume we are using the Graph Search ▶ 00:39 so that we have eliminated the duplicate paths. ▶ 00:41 Arad is crossed off the list. ▶ 00:44 The path that goes from Arad to Sibiu ▶ 00:47 and back to Arad is removed, ▶ 00:49 and we are left with these one, two, three, ▶ 00:51 four, five possible paths. ▶ 00:53 Given these 5 paths, ▶ 00:57 show me which ones are candidates to be expanded next ▶ 00:59 by the Breadth First Search Algorithm. ▶ 01:02 ### (00:42) Topic 7, Graph Search Answer

[Male narrator] And the answer is that Breadth - First Search always considers ▶ 00:00 the shortest paths first, and in this case, there's 2 paths of length 1, ▶ 00:03 and 1, the paths from Arad to Zerind and Arad to Timisoara, ▶ 00:08 so those would be the 2 paths that would be considered. ▶ 00:12 Now, let's suppose that the tie is broken in some way ▶ 00:15 and we chose this path from Arad to Zerind. ▶ 00:18 Now, we want to expand that node. ▶ 00:22 We remove it from the frontier and put it in the explored list ▶ 00:25 and now we say, "What paths are we going to add?" ▶ 00:31 So check off the ends of the paths the cities that we're going to add. ▶ 00:35 ### (00:13) Topic 8, Graph Search Answer

[Male narrator] In this case, there's nothing to add ▶ 00:00 because of the 2 neighbors, 1 is in the explored list and 1 is in the frontier, ▶ 00:03 and if we're using graph search, then we won't add either of those. ▶ 00:09 ### (00:38) Topic 9, More Graph Search

[Male narrator] So we move on, we look for another shortest path. ▶ 00:00 There's one path left of length 1, so we look at that path, we expand it, ▶ 00:04 add in this path, put that one on the explored list, ▶ 00:11 and now we've got 3 paths of length 2. ▶ 00:16 We choose 1 of them, and let's say we choose this one. ▶ 00:20 Now, my question is show me which states we add to the path ▶ 00:23 and tell me whether we're going to terminate the algorithm at this point ▶ 00:30 because we've reached the goal or whether we're going to continue. ▶ 00:35 ### (00:29) Topic 10, Graph Search Answer

[Male narrator] The answer is that we add 1 more path, the path to Bucharest. ▶ 00:00 We don't add the path going back because it's in the explored list, ▶ 00:08 but we don't terminate it yet. ▶ 00:11 True, we have added a path that ends in Bucharest, ▶ 00:13 but the goal test isn't applied when we add a path to the frontier. ▶ 00:16 Rather, it's applied when we remove that path from the frontier, ▶ 00:22 and we haven't done that yet. ▶ 00:26 ### (01:30) Topic 11, Graph Search Termination

[Male narrator] Now, why doesn't the general tree search or graph search algorithm stop ▶ 00:00 when it adds a goal node to the frontier? ▶ 00:06 The reason is because it might not be the best path to the goal. ▶ 00:09 Now, here we found a path of length 2 ▶ 00:13 and we added a path of length 3 that reached the goal. ▶ 00:16 The general graph search or tree search doesn't know ▶ 00:21 that there might be some other path that we could expand ▶ 00:24 that would have a distance of say, 2-1/2, ▶ 00:27 but there's an optimization that could be made. ▶ 00:30 If we know we're doing Breadth - First Search ▶ 00:33 and we know there's no possibility of a path of length 2-1/2. ▶ 00:35 Then we can change algorithm so that it checks states ▶ 00:40 as soon as they're added to the frontier ▶ 00:44 rather than waiting until they're expanded ▶ 00:46 and in that case, we can write a specific Breadth - First Search routine ▶ 00:49 that terminates early and gives us a result as soon as we add a goal state to the frontier. ▶ 00:53 Breadth - First Search will find this path ▶ 01:01 that ends up in Bucharest, and if we're looking for the shortest path ▶ 01:04 in terms of number of steps, ▶ 01:08 Breadth - First Search is guaranteed to find it, ▶ 01:10 But if we're looking for the shortest path in terms of total cost ▶ 01:12 by adding up the step costs, then it turns out ▶ 01:17 that this path is shorter than the path found by Breadth - First Search. ▶ 01:21 So let's look at how we could find that path. ▶ 01:26 ### (00:51) Topic 12, Uniform Cost Search

An algorithm that has traditionally been called uniform-cost search ▶ 00:00 but could be called cheapest-first search, ▶ 00:05 is guaranteed to find the path with the cheapest total cost. ▶ 00:08 Let's see how it works. ▶ 00:11 We start out as before in the start state. ▶ 00:14 And we pop that empty path off. ▶ 00:19 Move it from the frontier to explored, ▶ 00:24 and then add in the paths out of that state. ▶ 00:28 As before, there will be 3 of those paths. ▶ 00:33 And now, which path are we going to pick next ▶ 00:39 in order to expand according to the rules of cheapest first? ▶ 00:43 ### (00:44) Topic 13, Uniform Cost Search

Cheapest first says that we pick the path with ▶ 00:00 the lowest total cost. ▶ 00:04 And that would be this path. ▶ 00:06 It has a cost of 75 compared to the cost of 118 and 140 ▶ 00:07 for the other paths. ▶ 00:13 So we get here. We take that path off the frontier, ▶ 00:14 put it on the explored list, add in its neighbors. ▶ 00:19 Not going back to Arad, ▶ 00:23 but adding in this new path. ▶ 00:26 Summing up the total cost of that path, ▶ 00:30 71 + 75 is 146 for this path. ▶ 00:33 And now the question is, ▶ 00:40 which path gets expanded next? ▶ 00:41 ### (00:56) Topic 14, Uniform Cost Search

Of the 3 paths on the frontier, we have ones ▶ 00:00 with a cost of 146, 140, and 118. ▶ 00:05 And that's the cheapest, so this one gets expanded. ▶ 00:10 We take it off the frontier, move it to explored, ▶ 00:13 add in its successors. In this case it's only 1. ▶ 00:16 And that has a path total of 229. ▶ 00:21 Which path do we expand next? ▶ 00:29 Well, we've got 146, 140, and 229 ▶ 00:30 So 140 is the lowest. ▶ 00:33 Take it off the frontier. Put it on explored. ▶ 00:38 Add in this path ▶ 00:41 for a total cost of 220. ▶ 00:44 And this path for a total cost of 239. ▶ 00:48 And now the question is, which path do we expand next? ▶ 00:53 ### (00:15) Topic 15, Uniform Cost Search

The answer is this one, 146. ▶ 00:00 Put it on explored. ▶ 00:04 But there's nothing to add because ▶ 00:07 both of its neighbors have already been explored. ▶ 00:12 Which path do we look at next? ▶ 00:13 ### (01:13) Topic 16, Uniform Cost Termination

The answer is this one. Two-twenty is less than 229 or 239. ▶ 00:00 Take it off the frontier. Put it on explored. ▶ 00:05 Add in 2 more paths and sum them up. ▶ 00:09 So, 220 plus 146 is 366. ▶ 00:15 And 220 plus 97 is 317. ▶ 00:21 Okay, and now, notice that we're closing in on Bucharest. ▶ 00:29 We've got 2 neighbors almost there, but neither of them is their turn yet. ▶ 00:32 Instead, the cheapest path is this one over here, ▶ 00:38 so move it to the explored list. ▶ 00:43 Add 70 to the path cost so far, ▶ 00:45 and we get 299. ▶ 00:50 Now the cheapest node is 239 here, ▶ 00:57 so we expand, finally, into Bucharest at a cost of 460. ▶ 01:01 And now the question is are we done? Can we terminate the algorithm? ▶ 01:09 ### (01:46) Topic 17, Uniform Cost Termination Answer

[Male] And the answer is no, we're not done yet. ▶ 00:00 We've put Bucharest, the gold state, onto the frontier, ▶ 00:03 but we haven't popped it off the frontier yet. ▶ 00:07 And the reason is because we've got to look around and see if there's a better path ▶ 00:09 that can reach it, Bucharest. ▶ 00:13 And so, let's continue. ▶ 00:15 Look at everything on the frontier. ▶ 00:18 Here's the cheapest one over here. ▶ 00:20 Expand that. ▶ 00:23 Now, what's the cheapest next one? ▶ 00:26 Well, over here. ▶ 00:30 Oops, forgot to take this one off the list. ▶ 00:33 So now, we're at 317 plus 101 gives us another path into Bucharest, ▶ 00:36 and this is a better path. ▶ 00:44 This is 418, gives us another route in. ▶ 00:46 But we have to keep going. ▶ 00:54 The best path on the frontier is 366, ▶ 00:59 so pop that off, and that would give us 2 more routes into here, ▶ 01:06 and eventually we pop off all of these. ▶ 01:14 And then we get to the point where 418 was the best path on the frontier. ▶ 01:18 We pop that off, and then we recognize that we'd reach the goal, ▶ 01:24 and the reason that uniform cost finds the optimal path, the cheapest cost, ▶ 01:29 is because it's guaranteed that it will first pop off this cheapest path, ▶ 01:35 the 418, before it gets to the more expensive path, like the 460. ▶ 01:40 ### (01:50) Topic 18, Depth First Search

So, we've looked at 2 search algorithms. ▶ 00:00 One, breadth-first search, in which we always expand first ▶ 00:03 the shallowest paths, the shortest paths. ▶ 00:08 Second, cheapest-first search, in which we always expand first the path ▶ 00:12 with the lowest total cost. ▶ 00:17 And I'm going to take this opportunity to introduce a third algorithm, depth-first search, ▶ 00:20 which is in a way the opposite of breadth-first search. ▶ 00:25 In depth-first search, we always expand first the longest path, ▶ 00:28 the path with the most lengths in it. ▶ 00:33 Now, what I want to ask you to do is for each of these nodes in each of the trees, ▶ 00:36 tell us in what order they're expanded, ▶ 00:42 first, second, third, fourth, fifth and so on by putting a number into the box. ▶ 00:44 And if there are ties, put that number in and resolve the ties in left to right order. ▶ 00:49 Then I want you to ask one more question or answer one more question ▶ 00:58 which is are these searches optimal? ▶ 01:03 That is, are they guaranteed to find the best solution? ▶ 01:06 And for breadth-first search, optimal would mean finding the shortest path. ▶ 01:11 If you think it's guaranteed to find the shortest path, check here. ▶ 01:16 For cheapest first, it would mean finding the path with the lowest total path cost. ▶ 01:21 Check here if you think it's guaranteed to do that. ▶ 01:26 And we'll allow the assumption that all costs have to be positive. ▶ 01:30 And in depth first, cheapest or optimal would mean, again, ▶ 01:34 as in breadth first, finding the shortest possible path in terms of number of lengths. ▶ 01:41 Check here if you think depth first will always find that. ▶ 01:46 ### (01:49) Topic 19, Search Optimality Answer

Here are the answers. ▶ 00:00 Breadth-first search, as the name implies, expands nodes in this order. ▶ 00:04 One, 2, 3, 4, 5, 6, 7. ▶ 00:10 So, it's going across a stripe at a time, breadth first. ▶ 00:17 Is it optimal? ▶ 00:23 Well, it's always expanding in the shortest paths first, ▶ 00:25 and so wherever the goal is hiding, it's going to find it by examining ▶ 00:28 no longer paths, so in fact, it is optimal. ▶ 00:34 Cheapest first, first we expand the path of length zero, ▶ 00:38 then the path of length 2. ▶ 00:45 Now there's a path of length 4, path of length 5, ▶ 00:47 path of length 6, a path of length 7, and finally, a path of length 8. ▶ 00:53 And as we've seen, it's guaranteed to find the cheapest path of all, ▶ 01:02 assuming that all the individual step costs are not negative. ▶ 01:08 Depth-first search tries to go as deep as it can first, ▶ 01:14 so it goes 1, 2, 3, then backs up, 4, ▶ 01:17 then backs up, 5, 6, 7. ▶ 01:24 And you can see that it doesn't necessarily find the shortest path of all. ▶ 01:29 Let's say that there were goals in position 5 and in position 3. ▶ 01:34 It would find the longer path to position 3 and find the goal there ▶ 01:39 and would not find the goal in position 5. ▶ 01:43 So, it is not optimal. ▶ 01:46 ### (02:02) Topic 20, Storage Requirements, Completeness

Given the non-optimality of depth-first search, ▶ 00:00 why would anybody choose to use it? ▶ 00:04 Well, the answer has to do with the storage requirements. ▶ 00:07 Here I've illustrated a state space ▶ 00:10 consisting of a very large or even infinite binary tree. ▶ 00:13 As we go to levels 1, 2, 3, down to level n, ▶ 00:18 the tree gets larger and larger. ▶ 00:22 Now, let's consider the frontier for each of these search algorithms. ▶ 00:24 For breadth-first search, we know a frontier looks like that, ▶ 00:29 and so when we get down to level n, we'll require a storage space of ▶ 00:35 2 to the n of pass in a breadth-first search. ▶ 00:40 For cheapest first, the frontier is going to be more complicated. ▶ 00:45 It's going to sort of work out this contour of cost, ▶ 00:49 but it's going to have a similar total number of nodes. ▶ 00:53 But for depth-first search, as we go down the tree, we start going down this branch, ▶ 00:57 and then we back up, but at any point, our frontier is only going to have n nodes ▶ 01:03 rather than 2 to the n nodes, so that's a substantial savings for depth-first search. ▶ 01:08 Now, of course, if we're also keeping track of the explored set, ▶ 01:14 then we don't get that much savings. ▶ 01:19 But without the explored set, depth-first search has a huge advantage ▶ 01:21 in terms of space saved. ▶ 01:25 One more property of the algorithms to consider ▶ 01:27 is the property of completeness, meaning if there is a goal somewhere, ▶ 01:30 will the algorithm find it? ▶ 01:35 So, let's move from very large trees to infinite trees, ▶ 01:37 and let's say that there's some goal hidden somewhere deep down in that tree. ▶ 01:41 And the question is, are each of these algorithms complete? ▶ 01:47 That is, are they guaranteed to find a path to the goal? ▶ 01:51 Mark off the check boxes for the algorithms that you believe are complete in this sense. ▶ 01:55 ### (00:49) Topic 21, Completeness Answer

The answer is that breadth-first search is complete, ▶ 00:00 so even if the tree is infinite, if the goal is placed at any finite level, ▶ 00:04 eventually, we're going to march down and find that goal. ▶ 00:10 Same with cheapest first. ▶ 00:16 No matter where the goal is, if it has a finite cost, ▶ 00:18 eventually, we're going to go down and find it. ▶ 00:21 But not so for depth-first search. ▶ 00:25 If there's an infinite path, depth-first search will keep following that, ▶ 00:28 so it will keep going down and down and down along this path ▶ 00:33 and never get to the path that the goal consists of ▶ 00:37 and never get to the path on which the goal sits. ▶ 00:42 So, depth-first search is not complete. ▶ 00:46 ### (04:28) Topic 22, More on Uniform Cost Search

Let's try to understand a little better how uniform cost search works. ▶ 00:00 We start at a start state, ▶ 00:05 and then we start expanding out from there looking at different paths, ▶ 00:08 and what we end of doing is expanding in terms of contours like on a topological map, ▶ 00:13 where first we span out to a certain distance, then to a farther distance, ▶ 00:21 and then to a farther distance. ▶ 00:28 Now at some point we meet up with a goal. Let's say the goal is here. ▶ 00:31 Now we found a path from the start to the goal. ▶ 00:35 But notice that the search really wasn't directed at any way towards the goal. ▶ 00:42 It was expanding out everywhere in the space and depending on where the goal is, ▶ 00:46 we should expect to have to explore half the space, on average, before we find the goal. ▶ 00:52 If the space is small, that can be fine, ▶ 00:57 but when spaces are large, that won't get us to the goal fast enough. ▶ 01:00 Unfortunately, there is really nothing we can do, with what we know, to do better than that, ▶ 01:05 and so if we want to improve, if we want to be able to find the goal faster, ▶ 01:10 we're going to have to add more knowledge. ▶ 01:15 The type of knowledge that is proven most useful in search is an estimate of the distance ▶ 01:21 from the start state to the goal. ▶ 01:27 So let's say we're dealing with a route-finding problem, ▶ 01:32 and we can move in any direction--up or down, right or left-- ▶ 01:36 and we'll take as our estimate, the straight line distance between a state and a goal, ▶ 01:43 and we'll try to use that estimate to find our way to the goal fastest. ▶ 01:50 Now an algorithm called greedy best-first search does exactly that. ▶ 01:55 It expands first the path that's closest to the goal according to the estimate. ▶ 02:04 So what do the contours look like in this approach? ▶ 02:09 Well, we start here, and then we look at all the neighboring states, ▶ 02:13 and the ones that appear to be closest to the goal we would expand first. ▶ 02:17 So we'd start expanding like this and like this and like this and like this ▶ 02:21 and that would lead us directly to the goal. ▶ 02:30 So now instead of exploring whole circles that go out everywhere with a certain space, ▶ 02:33 our search is directed towards the goal. ▶ 02:38 In this case it gets us immediately towards the goal, but that won't always be the case ▶ 02:41 if there are obstacles along the way. ▶ 02:46 Consider this search space. We have a start state and a goal, ▶ 02:50 and there's an impassable barrier. ▶ 02:54 Now greedy best-first search will start expanding out as before, ▶ 02:57 trying to get towards the goal, ▶ 03:02 and when it reaches the barrier, what will it do next? ▶ 03:08 Well, it will try to increase along a path that's getting closer and closer to the goal. ▶ 03:11 So it won't consider going back this way which is farther from the goal. ▶ 03:15 Rather it will continue expanding out along these lines ▶ 03:20 which always get closer and closer to the goal, ▶ 03:24 and eventually it will find its way towards the goal. ▶ 03:28 So it does find a path, and it does it by expanding a small number of nodes, ▶ 03:31 but it's willing to accept a path which is longer than other paths. ▶ 03:36 Now if we explored in the other direction, we could have found a much simpler path, ▶ 03:42 a much shorter path, by just popping over the barrier, and then going directly to the goal. ▶ 03:47 but greedy best-first search wouldn't have done that because ▶ 03:54 that would have involved getting to this point, which is this distance to the goal, ▶ 03:56 and then considering states which were farther from the goal. ▶ 04:01 What we would really like is an algorithm that combines the best parts ▶ 04:08 of greedy search which explores a small number of nodes in many cases ▶ 04:11 and uniform cost search which is guaranteed to find a shortest path. ▶ 04:17 We'll show how to do that next using an algorithm called the A-star algorithm. ▶ 04:22 ### (03:14) Topic 23, A-Star Search

[Male narrator] A* Search works by always expanding the path ▶ 00:00 that has a minimum value of the function f ▶ 00:03 which is defined as a sum of the g + h components. ▶ 00:07 Now, the function g of a path ▶ 00:12 is just the path cost, ▶ 00:16 and the function h of a path ▶ 00:19 is equal to the h value of the state, ▶ 00:23 which is the final state of the path, ▶ 00:27 which is equal to the estimated distance to the goal. ▶ 00:30 Here's an example of how A* works. ▶ 00:36 Suppose we found this path through the state's base to a state x ▶ 00:39 and we're trying to give a measure to the value of this path. ▶ 00:44 The measure f is a sum of g, the path cost so far, ▶ 00:48 and h, which is the estimated distance that the path will take ▶ 00:55 to complete its path to the goal. ▶ 01:02 Now, minimizing g helps us keep the path short ▶ 01:04 and minimizing h helps us keep focused on finding the goal ▶ 01:08 and the result is a search strategy that is the best possible ▶ 01:13 in the sense that it finds the shortest length path ▶ 01:17 while expanding the minimum number of paths possible. ▶ 01:20 It could be called "best estimated total path cost first," ▶ 01:24 but the name A* is traditional. ▶ 01:28 Now let's go back to Romania and apply the A* algorithm ▶ 01:32 and we're going to use a heuristic, which is a straight line distance ▶ 01:36 between a state and the goal. ▶ 01:40 The goal, again, is Bucharest, ▶ 01:42 and so the distance from Bucharest to Bucharest is, of course, 0. ▶ 01:44 And for all the other states, I've written in red ▶ 01:47 the straight line distance. ▶ 01:51 For example, straight across like that. ▶ 01:53 Now, I should say that all the roads here I've drawn as straight lines, ▶ 01:55 but actually, roads are going to be curved to some degree, ▶ 01:59 so the actual distance along the roads is going to be longer ▶ 02:03 than the straight line distance. ▶ 02:06 Now, we start out as usual--we'll start in Arad as a start state-- ▶ 02:09 and we'll expand out Arad and so we'll add 3 paths ▶ 02:13 and the evaluation function, f, will be the sum of the path length, ▶ 02:21 which is given in black, and the estimated distance, ▶ 02:26 which is given in red. ▶ 02:29 And so the path length from this path ▶ 02:32 will be 140+253 or 393; ▶ 02:37 for this path, 75+374, or 449; ▶ 02:45 and for this path, 118+329, or 447. ▶ 02:55 And now, the question is out of all the paths that are on the frontier, ▶ 03:05 which path would we expand next under the A* algorithm? ▶ 03:09 ### (00:14) Topic 23, A-Star Search ANSWER

The answer is that we select this path first--the one from Arad to Sibiu-- ▶ 00:00 because it has the smallest value--393--of the sum f=g+h. ▶ 00:05 ### (00:39) Topic 24, A-Star Second Question

Let's go ahead and expand this node now. ▶ 00:00 So we're going to add 3 paths. ▶ 00:03 This one has a path cost of 291 ▶ 00:06 and an estimated distance to the goal of 380, ▶ 00:10 for a total of 671. ▶ 00:14 This one has a path cost of 239 ▶ 00:18 and an estimated distance of 176, for a total of 415. ▶ 00:21 And the final one is 220+193=413. ▶ 00:27 And now the question is which state to we expand next? ▶ 00:33 ### (00:12) Topic 24, A-Star Second Question ANSWER

The answer is we expand this path next ▶ 00:00 because its total, 413, ▶ 00:03 is less than all the other ones on the front tier-- ▶ 00:06 although only slightly less than the 415 for this path. ▶ 00:09 ### (00:20) Topic 25, A-Star Third Question

So we expand this node, ▶ 00:00 giving us 2 more paths-- ▶ 00:03 this one with an f-value of 417, ▶ 00:06 and this one with an f-value of 526. ▶ 00:10 The question again--which path are we going to expand next? ▶ 00:16 ### (00:11) Topic 25, A-Star Third Question ANSWER

And the answer is that we expand this path, Fagaras, next, ▶ 00:00 because its f-total, 415, ▶ 00:05 is less than all the other paths in the front tier. ▶ 00:08 ### (00:26) Topic 26, A-Star Fourth Question

Now we expand Fagaras ▶ 00:01 and we get a path that reaches the goal ▶ 00:04 and it has a path length of 450 and an estimated distance of 0 ▶ 00:07 for a total f value of 450, ▶ 00:11 and now the question is: What do we do next? ▶ 00:14 Click here if you think we're at the end of the algorithm ▶ 00:17 and we don't need to expand next ▶ 00:22 or click on the node that you think we will expand next. ▶ 00:24 ### (00:23) Topic 26, A-Star Fourth Question ANSWER

The answer is that we're not done yet, ▶ 00:00 because the algorithm works by doing the goal test, ▶ 00:03 when we take a path off the front tier, ▶ 00:06 not when we put a path on the front tier. ▶ 00:08 Instead, we just continue in the normal way and choose the node ▶ 00:11 on the front tier which has the lowest value. ▶ 00:15 That would be this one--the path through Pitesti, with a total of 417. ▶ 00:18 ### (01:24) Topic 27, A-Star Fifth Question

So let's expand the node at Pitesti. ▶ 00:01 We have to go down this direction, up, ▶ 00:04 then we reach a path we've seen before, ▶ 00:08 and we go in this direction. ▶ 00:11 Now we reach Bucharest, which is the goal, ▶ 00:13 and the h value is going to be 0 ▶ 00:16 because we're at the goal, and the g value works out to 418. ▶ 00:19 Again, we don't stop here just because we put a path onto the front tier, ▶ 00:24 we put it there, we don't apply the goal test next, ▶ 00:31 but, now we go back to the front tier, ▶ 00:35 and it turns out that this 418 is the lowest-cost path on the front tier. ▶ 00:38 So now we pull it off, do the goal test, ▶ 00:43 and now we found our path to the goal, ▶ 00:45 and it is, in fact, the shortest possible path. ▶ 00:49 In this case, A-star was able to find the lowest-cost path. ▶ 00:55 Now the question that you'll have to think about, ▶ 00:59 because we haven't explained it yet, ▶ 01:02 is whether A-star will always do this. ▶ 01:04 Answer yes if you think A-star will always find the shortest cost path, ▶ 01:06 or answer no if you think it depends on the particular problem given, ▶ 01:12 or answer no if you think it depends on the particular heuristic estimate function, h. ▶ 01:17 ### (00:49) Topic 27, A-Star Fifth Question ANSWER Mandatory

The answer is that it depends on the h function. ▶ 00:02 A-star will find the lowest-cost path ▶ 00:06 if the h function for a state is less than the true cost ▶ 00:09 of the path to the goal through that state. ▶ 00:16 In other words, we want the h to never overestimate the distance to the goal. ▶ 00:20 We also say that h is optimistic. ▶ 00:26 Another way of stating that ▶ 00:31 is that h is admissible, ▶ 00:34 meaning is it admissible to use it to find the lowest-cost path. ▶ 00:37 Think of all of these of being the same way ▶ 00:41 of stating the conditions under which A-star finds the lowest-cost path. ▶ 00:45 ### (01:22) Topic 28, Optimistic Heuristics

Here we give you an intuition as to why ▶ 00:01 an optimistic heuristic function, h, finds the lowest-cost path. ▶ 00:03 When A-star ends, it returns a path, p, with estimated cost, c. ▶ 00:08 It turns out that c is also the actual cost, ▶ 00:15 because at the goal the h component is 0, ▶ 00:20 and so the path cost is the total cost as estimated by the function. ▶ 00:23 Now, all the paths on the front tier ▶ 00:28 have an estimated cost that's greater than c, ▶ 00:31 and we know that because the front tier is explored in cheapest-first order. ▶ 00:35 If h is optimistic, then the estimated cost ▶ 00:40 is less than the true cost, ▶ 00:44 so the path p must have a cost that's less than the true cost ▶ 00:47 of any of the paths on the front tier. ▶ 00:51 Any paths that go beyond the front tier ▶ 00:54 must have a cost that's greater than that ▶ 00:57 because we agree that the step cost is always 0 or more. ▶ 00:59 So that means that this path, p, must be the minimal cost path. ▶ 01:04 Now, this argument, I should say, only goes through ▶ 01:09 as is for tree search. ▶ 01:13 For graph search the argument is slightly more complicated, ▶ 01:16 but the general intuitions hold the same. ▶ 01:19 ### (00:59) Topic 29, State Spaces

So far we've looked at the state space of cities in Romania-- ▶ 00:01 a 2-dimensional, physical space. ▶ 00:05 But the technology for problem solving through search ▶ 00:07 can deal with many types of state spaces, ▶ 00:10 dealing with abstract properties, not just x-y position in a plane. ▶ 00:12 Here I introduce another state space--the vacuum world. ▶ 00:17 It's a very simple world in which there are only 2 positions ▶ 00:21 as opposed to the many positions in the Romania state space. ▶ 00:25 But there are additional properties to deal with as well. ▶ 00:30 The robot vacuum cleaner can be in either of the 2 conditions, ▶ 00:33 but as well as that each of the positions ▶ 00:36 can either have dirt in it or not have dirt in it. ▶ 00:40 Now the question is to represent this as a state space ▶ 00:43 how many states do we need? ▶ 00:47 The number of states can fill in this box here. ▶ 00:51 ### (00:35) Topic 29, State Spaces ANSWER

And the answer is there are 8 states. ▶ 00:01 There are 2 physical states that the robot vacuum cleaner can be in-- ▶ 00:04 either in state A or in state B. ▶ 00:10 But in addition to that, there are states about how the world is ▶ 00:12 as well as where the robot is in the world. ▶ 00:17 So state A can be dirty or not. ▶ 00:19 That's 2 possibilities. ▶ 00:24 And B can be dirty or not. ▶ 00:26 That's 2 more possibilities. ▶ 00:28 We multiply those together. We get 8 possible states. ▶ 00:31 ### (01:44) Topic 30, State Space Diagram and More Complexity

Here is a diagram of the state space for the vacuum world. ▶ 00:01 Note that there are 8 states, and we have the actions connecting the states ▶ 00:05 just as we did in the Romania problem. ▶ 00:09 Now let's look at a path through this state. ▶ 00:12 Let's say we start out in this position, ▶ 00:15 and then we apply the action of moving right. ▶ 00:19 Then we end up in a position where the state of the world looks the same, ▶ 00:23 except the robot has moved from position 'A' to position 'B'. ▶ 00:27 Now if we turn on the sucking action, ▶ 00:32 then we end up in a state where the robot is in the same position ▶ 00:37 but that position is no longer dirty. ▶ 00:42 Let's take this very simple vacuum world ▶ 00:47 and make a slightly more complicated one. ▶ 00:50 First, we'll say that the robot has a power switch, ▶ 00:53 which can be in one of three conditions: on, off, or sleep. ▶ 00:56 Next, we'll say that the robot has a dirt-sensing camera, ▶ 01:04 and that camera can either be on or off. ▶ 01:09 Third, this is the deluxe model of robot ▶ 01:13 in which the brushes that clean up the dust ▶ 01:16 can be set at 1 of 5 different heights ▶ 01:19 to be appropriate for whatever level of carpeting you have. ▶ 01:22 Finally, rather that just having the 2 positions, ▶ 01:27 we'll extend that out and have 10 positions. ▶ 01:30 Now the question is how many states are in this state space? ▶ 01:37 ### (00:57) Topic 30, State Space Diagram and More Complexity ANSWER

The answer is that the number of states is the cross product ▶ 00:01 of the numbers of all the variables, since they're each independent, ▶ 00:05 and any combination can occur. ▶ 00:08 For the power we have 3 possible positions. ▶ 00:10 The camera has 2. ▶ 00:14 The brush height has 5. ▶ 00:18 The dirt has 2 for each of the 10 positions. ▶ 00:23 That's 2^10 or 1024. ▶ 00:28 Then the robot's position can be any of those 10 positions as well. ▶ 00:33 That works out to 307,200 states in the state space. ▶ 00:39 Notice how a fairly trivial problem-- ▶ 00:44 we're only modeling a few variables and only 10 positions-- ▶ 00:46 works out to a large number of state spaces. ▶ 00:50 That's why we need efficient algorithms for searching through states spaces. ▶ 00:52 ### (01:49) Topic 31, Sliding Blocks Puzzle

I want to introduce one more problem that can be solved with search techniques. ▶ 00:01 This is a sliding blocks puzzle, called a 15 puzzle. ▶ 00:05 You may have seen something like this. ▶ 00:08 So there are a bunch of little squares or blocks or tiles ▶ 00:10 and you can slide them around. ▶ 00:14 and the goal is to get into a certain configuration. ▶ 00:19 So we'll say that this is the goal state, where the numbers 1-15 are in order ▶ 00:21 left to right, top to bottom. ▶ 00:27 The starting state would be some state where all the positions are messed up. ▶ 00:29 Now the question is: Can we come up with a good heuristic for this? ▶ 00:34 Let's examine that as a way of thinking about where heuristics come from. ▶ 00:38 The first heuristic we're going to consider ▶ 00:42 we'll call h1, and that is equal to the number of misplaced blocks. ▶ 00:46 So here 10 and 11 are misplaced because they should be there and there, respectively, ▶ 00:54 12 is in the right place, 13 is in the right place, ▶ 00:59 and 14 and 15 are misplaced. ▶ 01:02 That's a total of 4 misplaced blocks. ▶ 01:04 The 2nd heuristic, h2, is equal to ▶ 01:07 the sum of the distances that each block would have to move to get to the right position. ▶ 01:13 For this position, 10 would have to move 1 space to get to the right position, ▶ 01:19 11 would have to move 1, so that's a total of 2 so far, ▶ 01:26 13 is in the right place, ▶ 01:30 14 is 1 displaced, ▶ 01:31 and 15 is 1 displaced, ▶ 01:33 so that would also be a total of 4. ▶ 01:35 Now, the question is: Which, if any, of these heuristics are admissible? ▶ 01:38 Check the boxes next to the heuristics that you think ▶ 01:44 are admissible. ▶ 01:47 ### (00:42) Topic 31, Sliding Blocks Puzzle ANSWER

H1 is admissible, because every tile that's in the wrong position ▶ 00:02 must be moved at least once to get into the right position. ▶ 00:07 So h1 never overestimates. ▶ 00:10 How about h2? ▶ 00:13 H2 is also admissible, because every tile in the wrong position ▶ 00:15 can be moved closer to the correct position no faster than 1 space per move. ▶ 00:20 Therefore, both are admissible. ▶ 00:26 But notice that h2 is always greater than or equal to h1. ▶ 00:28 That means that, with the exception of breaking ties, ▶ 00:33 an A* search using h2 will always expand ▶ 00:35 fewer paths than one using h1 ▶ 00:39 ### (03:16) Topic 32, Where is the Intelligence

Now, we're trying to build an artificial intelligence ▶ 00:01 that can solve problems like this all on its own. ▶ 00:04 You can see that the search algorithms do a great job ▶ 00:08 of finding solutions to problems like this. ▶ 00:12 But, you might complain that in order for the search algorithms to work, ▶ 00:15 we had to provide it with a heurstic function. ▶ 00:19 A heurstic function came from the outside. ▶ 00:22 You might think that coming up with a good heurstic function is really where all the intelligence is. ▶ 00:25 So, a problem solver that uses an heurstic function given to it ▶ 00:30 really isn't intelligent at all. ▶ 00:34 So let's think about where the intelligence could come from ▶ 00:36 and can we automatically come up with good heurstic functions. ▶ 00:39 I'm going to sketch a description of ▶ 00:45 a program that can automatically come up with good heurstics ▶ 00:47 given a description of a problem. ▶ 00:50 Suppose this program is given a description of the sliding blocks puzzle ▶ 00:52 where we say that a block can move from square A to square B ▶ 00:57 if A is adjacent to B and B is blank. ▶ 01:02 Now, imagine that we try to loosen this restriction. ▶ 01:06 We cross out "B is blank," ▶ 01:10 and then we get the rule ▶ 01:14 "a block can move from A to B if A is adjacent to B," ▶ 01:16 and that's equal to our heurstic h2 ▶ 01:20 because a block can move anywhere to an adjacent state. ▶ 01:23 Now, we could also cross out the other part of the rule, ▶ 01:27 and we now get "a block can move from any square A ▶ 01:31 to any square B regardless of any condition. ▶ 01:36 That gives us heurstic h1. ▶ 01:40 So we see that both of our heurstics can be derived ▶ 01:43 from a simple mechanical manipulation ▶ 01:48 of the formal description of the problem. ▶ 01:50 Once we've generated automatically these candidate heuristics, ▶ 01:53 another way to come up with a good heurstic is to say ▶ 01:58 that a new heurstic, h, ▶ 02:02 is equal to the maximum of h1 and h2, ▶ 02:04 and that's guaranteed to be admissible as long as ▶ 02:10 h1 and h2 are admissible ▶ 02:13 because it still never overestimates, ▶ 02:16 and it's guaranteed to be better because its getting closer to the true value. ▶ 02:18 The only problem with combining multiple heuristics like this ▶ 02:22 is that there is some cause to compute the heuristic ▶ 02:27 and it could take longer to compute ▶ 02:29 even if we end up expanding pure paths. ▶ 02:31 Crossing out parts of the rules like this ▶ 02:35 is called "generating a relaxed problem." ▶ 02:38 What we've done is we've taken the original problem, ▶ 02:41 where it's hard to move squares around, ▶ 02:44 and made it easier by relaxing one of the constraints. ▶ 02:46 You can see that as adding new links in the state space, ▶ 02:49 so if we have a state space in which there are only particular links, ▶ 02:54 by relaxing the problem it's as if we are adding new operators ▶ 02:59 that traverse the state in new ways. ▶ 03:05 So adding new operators only makes the problem easier, ▶ 03:07 and thus never overestimates, and thus is admissible. ▶ 03:11 ### (01:52) Topic 33, What Can't Search Do

We've seen what search can do for problem solving. ▶ 00:00 It can find the lowest-cost path to a goal, ▶ 00:03 and it can do that in a way in which we never generate more paths than we have to. ▶ 00:06 We can find the optimal number of paths to generate, ▶ 00:12 and we can do that with a heuristic function that we generate on our own ▶ 00:15 by relaxing the existing problem definition. ▶ 00:19 But let's be clear on what search can't do. ▶ 00:22 All the solutions that we have found consist of a fixed sequence of actions. ▶ 00:25 In other words, the agent Hirin Arad, thinks, comes up with a plan that it wants to execute ▶ 00:31 and then essentially closes his eyes and starts driving, ▶ 00:38 never considering along the way if something has gone wrong. ▶ 00:42 That works fine for this type of problem, ▶ 00:46 but it only works when we satisfy the following conditions. ▶ 00:49 [Problem solving works when:] ▶ 00:53 Problem-solving technology works when the following set of conditions is true: ▶ 00:55 First, the domain must be fully observable. ▶ 00:59 In other words, we must be able to see what initial state we start out with. ▶ 01:03 Second, the domain must be known. ▶ 01:08 That is, we have to know the set of available actions to us. ▶ 01:12 Third, the domain must be discrete. ▶ 01:16 There must be a finite number of actions to chose from. ▶ 01:20 Fourth, the domain must be deterministic. ▶ 01:24 We have to know the result of taking an action. ▶ 01:28 Finally, the domain must be static. ▶ 01:32 There must be nothing else in the world that can change the world except our own actions. ▶ 01:36 If all these conditions are true, then we can search for a plan ▶ 01:41 which solves the problem and is guaranteed to work. ▶ 01:44 In later units, we will see what to do if any of these conditions fail to hold. ▶ 01:47 ### (02:35) Topic 34, Note on Implementation

Our description of the algorithm has talked about paths in the state space. ▶ 00:01 I want to say a little bit now about how to implement that in terms of a computer algorithm. ▶ 00:08 We talk about paths, but we want to implement that in some ways. ▶ 00:15 In the implementation we talk about nodes. ▶ 00:19 A node is a data structure, and it has four fields. ▶ 00:22 The state field indicates the state at the end of the path. ▶ 00:27 The action was the action it took to get there. ▶ 00:35 The cost is the total cost, ▶ 00:40 and the parent is a pointer to another node. ▶ 00:45 In this case, the node that has state "S", ▶ 00:50 and it will have a parent which points to the node that has state "A", ▶ 00:56 and that will have a parent pointer that's null. ▶ 01:06 So we have a linked list of nodes representing the path. ▶ 01:10 We'll use the word "path" for the abstract idea, ▶ 01:15 and the word "node" for the representation in the computer memory. ▶ 01:18 But otherwise, you can think of those two terms as being synonyms, ▶ 01:22 because they're in a one-to-one correspondence. ▶ 01:26 Now there are two main data structures that deal with nodes. ▶ 01:31 We have the "frontier" and we have the "explored" list. ▶ 01:35 Let's talk about how to implement them. ▶ 01:41 In the frontier the operations we have to deal with ▶ 01:44 are removing the best item from the frontier and adding in new ones. ▶ 01:48 And that suggests we should implement it as a priority queue, ▶ 01:52 which knows how to keep track of the best items in proper order. ▶ 01:55 But we also need to have an additional operation ▶ 01:59 of a membership test as a new item in the frontier. ▶ 02:03 And that suggests representing it as a set, ▶ 02:07 which can be built from a hash table or a tree. ▶ 02:10 So the most efficient implementations of search actually have both representations. ▶ 02:14 The explored set, on the other hand, is easier. ▶ 02:20 All we have to do there is be able to add new members and check for membership. ▶ 02:23 So we represent that as a single set, ▶ 02:28 which again can be done with either a hash table or tree. ▶ 02:31

## (16) Homework 1

### (00:05) Congratulations!

Congratulations. ▶ 00:00 You just made assignment 1. ▶ 00:02 ### (00:05) Introduction

This is homework assignment #1. ▶ 00:00 ### (01:00) Question 1, Peg Solitaire

This is a question about peg solitaire. ▶ 00:01 In peg solitaire, a single player faces ▶ 00:04 the following kind of board. ▶ 00:08 Initially, all pieces are occupied except for the center piece. ▶ 00:13 You can find more information on peg solitare at the following URL. ▶ 00:22 [http://en.wikipedia.org/wiki/peg_solitaire] ▶ 00:26 I wish to know whether this game is partially observable, ▶ 00:36 Please say yes or no. ▶ 00:40 I wish to know whether it is stochastic. ▶ 00:43 Please say yes if it is and no if it's deterministic. ▶ 00:46 Let me know if it's continuous, yes or no, ▶ 00:50 and let me know if it's adversarial, yes or no. ▶ 00:55 ### (00:22) Question 1, Peg Solitaire ANSWER

>>Peg Solitaire is not partially observable because you can see the board at all times. ▶ 00:00 It is not stochastic because you just make all the moves, ▶ 00:06 and they have very different mystic effects. ▶ 00:09 It is not continuous. It's just finding many choices of actions ▶ 00:11 and finding many board positions, so therefore, it is not continuous. ▶ 00:15 and it's not adversarial because there is no adversaries--just you playing. ▶ 00:18 ### (00:54) Question 2, Loaded Coin

I am going to ask you about the problem to learn about a loaded coin. ▶ 00:01 A loaded coin is a coin, ▶ 00:05 that if you flip it, ▶ 00:07 might have a non 0.5 chance ▶ 00:09 of coming up heads or tails. ▶ 00:13 Fair coins always come up 50% heads or tails. ▶ 00:16 Loaded coins might come up, for example, ▶ 00:20 0.9 chance heads and 0.1 chance tails. ▶ 00:23 Your task will be to understand, ▶ 00:27 from coin flips, ▶ 00:30 whether a coin is loaded, ▶ 00:31 and if so, at what probability. ▶ 00:33 I don't want you to solve the problem, ▶ 00:35 but I want you to answer the following questions: ▶ 00:37 Is it partially observable? ▶ 00:40 Yes or no. ▶ 00:42 Is it stochastic? ▶ 00:44 Yes or no. ▶ 00:46 Is it continuous? [Yes or no.] ▶ 00:48 And finally, is it adversarial? ▶ 00:51 Yes or no. ▶ 00:53 ### (00:38) Question 2, Loaded Coin ANSWER

[Thrun] So the loaded coin example is clearly partially observable, ▶ 00:00 and the reason is it is actually used for the memory ▶ 00:06 if you flip it more than 1 time so you can learn more about what the actual probability is. ▶ 00:09 Therefore, looking at the most recent coin flip is insufficient to make your choice. ▶ 00:14 It is stochastic because you flip a coin. ▶ 00:20 It is not continuous because there's only 1 action--a flip--and 2 outcomes. ▶ 00:25 And it isn't really adversarial because while you do your learning task ▶ 00:31 no adversary interferes. ▶ 00:36 ### (00:32) Question 3, Path Through Maze

Let's talk about the problem of finding a path through a maze. ▶ 00:00 Let me draw you a maze. ▶ 00:05 Suppose you wish to find the path from the start to your goal. ▶ 00:10 I don't want to you to solve this problem. ▶ 00:15 Rather I want you to tell me whether it's partially observable. ▶ 00:19 Yes or no. ▶ 00:23 It is stochastic? ▶ 00:25 Yes or no. ▶ 00:27 Is it continuous? ▶ 00:29 Yes or no. ▶ 00:31 ### (00:18) Question 3, Path Through Maze ANSWER

[Thrun] The path through the maze is clearly not partially observable ▶ 00:00 because you can see the maze entirely at all times. ▶ 00:03 It is not stochastic. There is no randomness involved. ▶ 00:06 It isn't really continuous. ▶ 00:10 There's typically just finitely many choices--go left or right. ▶ 00:12 And it isn't adversarial because there's no real adversary involved. ▶ 00:15 ### (00:43) Question 4, Search Tree

This is a search question. ▶ 00:00 Suppose we are given the following search tree. ▶ 00:02 We are searching from the top, the start node, ▶ 00:05 to the goal, which is over here. ▶ 00:08 Assume we expand from left to right. ▶ 00:12 Tell me how many nodes are expanded ▶ 00:17 if we expand from left to right, ▶ 00:20 counting the start node and the goal node in your answer. ▶ 00:23 And give me the same answer for Depth First Search. ▶ 00:27 Now, let's assume you're going to search from right to left. ▶ 00:32 How many nodes would we now expand in Breadth First Search, ▶ 00:35 and how many do we expand in Depth First Search? ▶ 00:39 ### (00:38) Question 4, Search Tree ANSWER

[Thrun] Breadth first from left to right is 6-- ▶ 00:00 1, 2, 3, 4, 5, 6. ▶ 00:03 Depth first from left to right is 4--1, 2, 3, 4. ▶ 00:07 Breadth first searched from right to left is 9-- ▶ 00:15 1, 2, 3, 4, 5, 6, 7, 8, 9. ▶ 00:19 And depth first from right to left is 9-- ▶ 00:25 1, 2, 3, 4, 5, 6, 7, 8, 9. ▶ 00:28 ### (00:31) Question 5, Another Search Tree

Another search problem-- ▶ 00:00 Consider the following search tree, ▶ 00:03 where this is the start node. ▶ 00:08 Now, assume we search from left to right. ▶ 00:12 I would like you to tell me the number of nodes expanded from Breadth-First Search ▶ 00:15 and Depth-First Search. ▶ 00:19 Please do count the start and the goal node, ▶ 00:22 and please give me the same numbers for Right-to-Left Search, ▶ 00:25 for Breadth-First, and Depth-First. ▶ 00:28 ### (00:48) Question 5, Another Search Tree ANSWER

[Thrun] The correct answer for breadth first left to right is 13-- ▶ 00:00 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13. ▶ 00:05 And for depth first it is 10-- ▶ 00:13 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10. ▶ 00:17 For right to left search, the right answer for breadth first is 11-- ▶ 00:28 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11. ▶ 00:32 And for depth first the right answer is 7-- ▶ 00:38 1, 2, 3, 4, 5, 6, 7. ▶ 00:42 ### (01:00) Question 6, Search Network

This is another search problem. ▶ 00:00 Let's assume we have a search graph. ▶ 00:04 It isn't quite a tree but looks like this. ▶ 00:07 Obviously in the structure we can reach nodes through multiple paths. ▶ 00:13 So let's assume that our search never expands the same node twice. ▶ 00:18 Let's also assume this start node is on top. We search down. ▶ 00:22 And this over here is our goal node. ▶ 00:27 So left-to-right search, tell me how many nodes ▶ 00:30 breadth first would expand--do count the start and goal node in the final answer. ▶ 00:35 Give me the same result for a depth-first search. ▶ 00:43 Again counting the start and the goal node in your answer. ▶ 00:48 And again give me your answer for breadth-first ▶ 00:51 and for depth-first in the right-to-left search paradigm. ▶ 00:54 ### (00:49) Question 6, Search Network ANSWER

[Thrun] The right answer over here is 10 for breadth first from left to right-- ▶ 00:00 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. ▶ 00:05 Depth first is 16, or all nodes-- ▶ 00:11 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16. ▶ 00:15 And notice how I never expanded a node twice. ▶ 00:30 Correct answer for breadth first right to left is 7-- ▶ 00:34 1, 2, 3, 4, 5, 6, 7. ▶ 00:38 And the correct answer for depth first from right to left is 4--1, 2, 3, and 4. ▶ 00:43 ### (01:16) Question 7, A* Search

Let's talk about a star search. ▶ 00:00 Let's assume we have the following grid. ▶ 00:03 The start state is right here. ▶ 00:08 And the goal state is right here. ▶ 00:13 And just for convenience, I will give each here a little number. ▶ 00:16 A. B. C. D. ▶ 00:22 Let me draw a heuristic function. ▶ 00:26 Please take a look for a moment ▶ 00:30 and tell me whether this heuristic function is admissable. ▶ 00:32 Check here if yes and here if no. ▶ 00:38 Which one is the first node a star would expand? ▶ 00:41 B1 or A2? ▶ 00:46 What's the second node to expand? ▶ 00:51 B1, C1, A2, A3, or B2? ▶ 00:56 And finally, what is the third node to expand? ▶ 01:06 D1, C2, B3, or A4? ▶ 01:10 ### (01:44) Question 7, A* Search ANSWER

[Thrun] Clearly this is an admissable heuristic because the distance to the goal ▶ 00:00 is strictly underestimated. ▶ 00:05 From here it would take 1 step, ▶ 00:07 from here it will take 1, 2 steps, so the answer is yes. ▶ 00:09 Now, to understand A*, let me also draw the g function ▶ 00:15 for development part of this table. ▶ 00:22 Clearly g is 0 over here. ▶ 00:24 To understand which node to expand, this one or this one, ▶ 00:27 let's project the g function, which is 1, ▶ 00:31 and we will see that 3 plus 1 is smaller than 4 plus 1; ▶ 00:34 therefore, this is the second node to expand, which is b1. ▶ 00:40 Now let me for the next step explain the g function from this guy here, 2 and 2. ▶ 00:47 So 2 plus 2 is 4 versus 3 plus 2 is 5, so we expand this node next, which is c1. ▶ 00:55 And finally, the g function from here would go 3 and 3. ▶ 01:08 3 plus 1 is better than 3 plus 2, so we would expand d1 next. ▶ 01:14 And notice how in the sum of g and h, ▶ 01:24 this node over here, which has a total of 4, is better than any other node that is unexpanded. ▶ 01:29 So in particular, 4 plus 1 is 5, and 3 plus 2 is 5 as well, ▶ 01:35 and 2 plus 3 is 5 as well, so this is the next one to expand. ▶ 01:40

## (64) Unit 3

### (06:26) 1 Introduction

So the next units will be concerned with probabilities ▶ 00:00 and particularly with structured probabilities using Bayes networks. ▶ 00:03 This is some of the most involved material in this class. ▶ 00:08 And since this is a Stanford level class, ▶ 00:12 you will find out that some of the quizzes are actually really hard. ▶ 00:14 So as you go through the material, I hope the hardness of the quizzes won't discourage you; ▶ 00:18 it'll really entice you to take a piece of paper and a pen and work them out. ▶ 00:23 Let me give you a flavor of a Bayes network using an example. ▶ 00:30 Suppose you find in the morning that your car won't start. ▶ 00:35 Well, there's many causes why your car might not start. ▶ 00:39 One is that your battery is flat. ▶ 00:43 Even for a flat battery there is multiple causes. ▶ 00:46 One, it's just plain dead, ▶ 00:50 and one is that the battery is okay but it's not charging. ▶ 00:52 The reason why a battery might not charge is that the alternator might be broken ▶ 00:55 or the fan belt might be broken. ▶ 01:01 If you look at this influence diagram, also called a Bayes network, ▶ 01:03 you'll find there's many different ways to explain that the car won't start. ▶ 01:07 And a natural question you might have is, "Can we diagnose the problem?" ▶ 01:12 One diagnostic tool is a battery meter, ▶ 01:17 which may increase or decrease your belief that the battery may cause your car failure. ▶ 01:20 You might also know your battery age. ▶ 01:26 Older batteries tend to go dead more often. ▶ 01:29 And there's many other ways to look at reasons why the car might not start. ▶ 01:31 You might inspect the lights, the oil light, the gas gauge. ▶ 01:37 You might even dip into the engine to see what the oil level is with a dipstick. ▶ 01:43 All of those relate to alternative reasons why the car might not be starting, ▶ 01:48 like no oil, no gas, the fuel line might be blocked, or the starter may be broken. ▶ 01:52 And all of these can influence your measurements, ▶ 01:59 like the oil light or the gas gauge, in different ways. ▶ 02:04 For example, the battery flat would have an effect on the lights. ▶ 02:07 It might have an effect on the oil light and on the gas gauge, ▶ 02:12 but it won't really affect the oil you measure with the dipstick. ▶ 02:16 That is affected by the actual oil level, which also affects the oil light. ▶ 02:20 Gas will affect the gas gauge, and of course without gas the car doesn't start. ▶ 02:26 So this is a complicated structure that really describes one way to understand ▶ 02:32 how a car doesn't start. ▶ 02:39 A car is a complex system. ▶ 02:41 It has lots of variables you can't really measure immediately, ▶ 02:43 and it has sensors which allow you to understand a little bit about the state of the car. ▶ 02:46 What the Bayes network does, ▶ 02:52 it really assists you in reasoning from observable variables, like the car won't start ▶ 02:54 and the value of the dipstick, to hidden causes, like is the fan belt broken ▶ 03:01 or is the battery dead. ▶ 03:06 What you have here is a Bayes network. ▶ 03:09 A Bayes network is composed of nodes. ▶ 03:13 These nodes correspond to events that you might or might not know ▶ 03:15 that are typically called random variables. ▶ 03:21 These nodes are linked by arcs, and the arcs suggest that a child of an arc ▶ 03:24 is influenced by its parent but not in a deterministic way. ▶ 03:31 It might be influenced in a probabilistic way, which means an older battery, for example, ▶ 03:35 has a higher chance of causing the battery to be dead, ▶ 03:41 but it's not clear that every old battery is dead. ▶ 03:45 There is a total of 16 variables in this Bayes network. ▶ 03:48 What the graph structure and associated probabilities specify ▶ 03:53 is a huge probability distribution in the space of all of these 16 variables. ▶ 03:59 If they are all binary, which we'll assume throughout this unit, ▶ 04:06 they can take 2 to the 16th different values, which is a lot. ▶ 04:10 The Bayes network, as we find out, is a complex representation ▶ 04:15 of a distribution over this very, very large joint probability distribution of all of these variables. ▶ 04:18 Further, once we specify the Bayes network, ▶ 04:26 we can observe, for example, the car won't start. ▶ 04:29 We can observe things like the oil light and the lights and the battery meter ▶ 04:33 and then compute probabilities of the hypothesis, like the alternator is broken ▶ 04:37 or the fan belt is broken or the battery is dead. ▶ 04:41 So in this class we're going to talk about how to construct this Bayes network, ▶ 04:45 what the semantics are, and how to reason in this Bayes network ▶ 04:50 to find out about variables we can't observe, like whether the fan belt is broken or not. ▶ 04:56 That's an overview. ▶ 05:02 Throughout this unit I am going to assume that every event is discrete-- ▶ 05:04 in fact, it's binary. ▶ 05:08 We'll start with some consideration of basic probability, ▶ 05:10 we'll work our way into some simple Bayes networks, ▶ 05:14 we'll talk about concepts like conditional independence ▶ 05:19 and then define Bayes networks more generally, ▶ 05:23 move into concepts like D-separation and start doing parameter counts. ▶ 05:26 Later on, Peter will tell you about inference in Bayes networks. ▶ 05:32 So we won't do this in this class. ▶ 05:36 I can't overemphasize how important this class is. ▶ 05:38 Bayes networks are used extensively in almost all fields of smart computer system, ▶ 05:43 in diagnostics, for prediction, for machine learning, and fields like finance, ▶ 05:49 inside Google, in robotics. ▶ 05:57 Bayes networks are also the building blocks of more advanced AI techniques ▶ 06:00 such as particle filters, hidden Markov models, MDPs and POMDPs, ▶ 06:05 Kalman filters, and many others. ▶ 06:12 These are words that don't sound familiar quite yet, ▶ 06:14 but as you go through the class, I can promise you you will get to know what they mean. ▶ 06:18 So let's start now at the very, very basics. ▶ 06:22 ### (00:40) 2 Probabilities

[Thrun] So let's talk about probabilities. ▶ 00:00 Probabilities are the cornerstone of artificial intelligence. ▶ 00:02 They are used to express uncertainty, ▶ 00:05 and the management of uncertainty is really key to many, many things in AI ▶ 00:08 such as machine learning and Bayes network inference ▶ 00:12 and filtering and robotics and computer vision and so on. ▶ 00:16 So I'm going to start with some very basic questions, ▶ 00:21 and we're going to work our way up from there. ▶ 00:24 Here is a coin. ▶ 00:26 The coin can come up heads or tails, and my question is the following: ▶ 00:28 Suppose the probability for heads is 0.5. ▶ 00:32 What's the probability for it coming up tails? ▶ 00:38 ### (00:19) 2a Answer

[Thrun] So the right answer is a half, or 0.5, ▶ 00:00 and the reason is the coin can only come up heads or tails. ▶ 00:03 We know that it has to be either one. ▶ 00:07 Therefore, the total probability of both coming up is 1. ▶ 00:10 So if half of the probability is assigned to heads, then the other half is assigned to tail. ▶ 00:14 ### (00:08) 2b Question

[Thrun] Let me ask my next quiz. ▶ 00:00 Suppose the probability of heads is a quarter, 0.25. ▶ 00:02 What's the probability of tail? ▶ 00:06 ### (00:17) 2c Answer

[Thrun] And the answer is 3/4. ▶ 00:00 It's a loaded coin, and the reason is, well, ▶ 00:02 each of them come up with a certain probability. ▶ 00:05 The total of those is 1. The quarter is claimed by heads. ▶ 00:08 Therefore, 3/4 remain for tail, which is the answer over here. ▶ 00:12 ### (00:14) 2d Question

[Thrun] Here's another quiz. ▶ 00:00 What's the probability that the coin comes up heads, heads, heads, three times in a row, ▶ 00:02 assuming that each one of those has a probability of a half ▶ 00:08 and that these coin flips are independent? ▶ 00:12 ### (00:14) 2e Answer

[Thrun] And the answer is 0.125. ▶ 00:00 Each head has a probability of a half. ▶ 00:04 We can multiply those probabilities because they are independent events, ▶ 00:06 and that gives us 1 over 8 or 0.125. ▶ 00:10 ### (00:32) 2f Question

[Thrun] Now let's flip the coin 4 times, and let's call Xi the result of the i-th coin flip. ▶ 00:00 So each Xi is going to be drawn from heads or tail. ▶ 00:11 What's the probability that all 4 of those flips give us the same result, ▶ 00:16 no matter what it is, assuming that each one of those has identically ▶ 00:22 an equally distributed probability of coming up heads of the half? ▶ 00:26 ### (00:23) 2g Answer

[Thrun] And the answer is, well, there's 2 ways that we can achieve this. ▶ 00:00 One is the all heads and one is all tails. ▶ 00:04 You already know that 4 times heads is 1/16, ▶ 00:06 and we know that 4 times tail is also 1/16. ▶ 00:10 These are completely independent events. ▶ 00:13 The probability of either one occurring is 1/16 plus 1/16, which is 1/8, which is 0.125. ▶ 00:15 ### (00:10) 2h Question

[Thrun] So here's another one. ▶ 00:00 What's the probability that within the set of X1, X2, X3, and X4 ▶ 00:02 there are at least three heads? ▶ 00:07 ### (00:28) 2i Answer

[Thrun] And the solution is let's look at different sequences ▶ 00:00 in which head occurs at least 3 times. ▶ 00:03 It could be head, head, head, head, in which it comes 4 times. ▶ 00:06 It could be head, head, head, tail and so on, all the way to tail, head, head, head. ▶ 00:10 There's 1, 2, 3, 4, 5 of those outcomes. ▶ 00:16 Each of them has a 16th for probability, so it's 5 times a 16th, which is 0.3125. ▶ 00:19 ### (00:45) 2j Summary

[Thrun] So we just learned a number of things. ▶ 00:00 One is about complementary probability. ▶ 00:02 If an event has a certain probability, p, ▶ 00:05 the complementary event has the probability 1-p. ▶ 00:08 We also learned about independence. ▶ 00:13 If 2 random variables, X and Y, are independent, ▶ 00:15 which you're going to write like this, ▶ 00:19 that means the probability of the joint that any 2 variables can assume ▶ 00:21 is the product of the marginals. ▶ 00:26 So rather than asking the question, "What is the probability ▶ 00:30 "for any combination that these 2 coins or maybe 5 coins could have taken?" ▶ 00:34 we can now look at the probability of each coin individually, ▶ 00:40 look at its probability and just multiply them up. ▶ 00:42 ### (01:04) 3 Dependence

[Thrun] So let me ask you about dependence. ▶ 00:00 Suppose we flip 2 coins. ▶ 00:03 Our first coin is a fair coin, and we're going to denote the outcome by X1. ▶ 00:05 So the chance of X1 coming up heads is half. ▶ 00:12 But now we branch into picking a coin based on the first outcome. ▶ 00:15 So if the first outcome was heads, ▶ 00:20 you pick a coin whose probability of coming up heads is going to be 0.9. ▶ 00:23 The way I word this is by conditional probability, ▶ 00:28 probability of the second coin flip coming up heads ▶ 00:32 provided that or given that X1, the first coin flip, was heads, is 0.9. ▶ 00:35 The first coin flip might also come up tails, ▶ 00:41 in which case I pick a very different coin. ▶ 00:44 In this case I pick a coin which with 0.8 probability will once again give me tails, ▶ 00:47 conditioned on the first coin flip coming up tails. ▶ 00:54 So my question for you is, ▶ 00:57 what's the probability of the second coin flip coming up heads? ▶ 00:59 ### (01:06) 3a Answer

[Thrun] The answer is 0.55. ▶ 00:00 The way to compute this is by the theorem of total probability. ▶ 00:04 Probability of X2 equals heads. ▶ 00:08 There's 2 ways I can get to this outcome. ▶ 00:12 One is via this path over here, and one is via this path over here. ▶ 00:15 Let me just write both of them down. ▶ 00:18 So first of all, it could be the probability of X2 equals heads ▶ 00:20 given that and I will assume X1 was head already. ▶ 00:26 Now I have to add the complementary event. ▶ 00:30 Suppose X1 came up tails. ▶ 00:32 Then I can ask the question, what is the probability that X2 comes up heads regardless, ▶ 00:35 even though X1 was tails? ▶ 00:40 Plugging in the numbers gives us the following. ▶ 00:42 This one over here is 0.9 times a half. ▶ 00:44 The probability of tails is 0.8, ▶ 00:49 thereby my head probability becomes 1 minus 0.8, which is 0.2. ▶ 00:51 Adding all of this together gives me 0.45 plus 0.1, ▶ 00:58 which is exactly 0.55. ▶ 01:03 ### (01:07) 4 What We Learned

So, we actually just learned some interesting lessons. ▶ 00:00 The probability of any random variable Y can be written as ▶ 00:02 probability of Y given that some other random variable X assumes value i ▶ 00:08 times probability of X equals i, ▶ 00:13 sums over all possible outcomes i for the (inaudible) variable X. ▶ 00:17 This is called total probability. ▶ 00:22 The second thing we learned has to do with negation of probabilities. ▶ 00:24 We found that probability of not X given Y is 1 minus probability of X given Y. ▶ 00:27 Now, you might be tempted to say "What about the probability of X given not Y?" ▶ 00:37 "Is this the same as 1 minus probability of X given Y?" ▶ 00:43 And the answer is absolutely no. ▶ 00:51 That's not the case. ▶ 00:54 If you condition on something that has a certain probability value, ▶ 00:56 you can take the event you're looking at and negate this, ▶ 01:00 but you can never negate your conditional variable ▶ 01:03 and assume these values add up to 1. ▶ 01:05 ### (00:25) 5 Weather Quiz

We assume there is sometimes sunny days and sometimes rainy days, ▶ 00:00 and on day 1, which we're going to call D1, ▶ 00:06 the probability of sunny is 0.9. ▶ 00:09 And then let's assume that a sunny day follows a sunny day with 0.8 chance, ▶ 00:13 and a rainy day follows a sunny day with--well-- ▶ 00:20 ### (00:05) 5a Answer

Well, the correct answer is 0.2, which is a negation of this event over here. ▶ 00:00 ### (00:13) 5b Question

A sunny day follows a rainy day with 0.6 chance, ▶ 00:00 and a rainy day follows a rainy day-- ▶ 00:06 please give me your number. ▶ 00:11 ### (00:03) 5c Answer

0.4 ▶ 00:00 ### (00:18) 5d Question

So, what are the chances that D2 is sunny? ▶ 00:00 Suppose the same dynamics apply from D2 to D3, ▶ 00:03 so just replace D3 over here with D2s over there. ▶ 00:06 That means the transition probabilities from one day to the next remain the same. ▶ 00:10 Tell me, what's the probability that D3 is sunny? ▶ 00:14 ### (01:25) 5e Answer

So, the correct answer over here is 0.78, ▶ 00:00 and over here it's 0.756. ▶ 00:04 To get there, let's complete this one first. ▶ 00:10 The probability of D2 = sunny. ▶ 00:13 Well, we know there's a 0.9 chance it's sunny on D1, ▶ 00:16 and then if it is sunny, we know it stays sunny with a 0.8 chance. ▶ 00:21 So, we multiply these 2 things together, and we get 0.72. ▶ 00:25 We know there's a 0.1 chance of it being rainy on day 1, which is the complement, ▶ 00:29 but if it's rainy, we know it switches to sunny with 0.6 chance, ▶ 00:33 so you multiply these 2 things, and you get 0.06. ▶ 00:37 Adding those two up equals 0.78. ▶ 00:41 Now, for the next day, we know our prior for sunny is 0.78. ▶ 00:46 If it is sunny, it stays sunny with 0.8 probability. ▶ 00:51 Multiplying these 2 things gives us 0.624. ▶ 00:55 We know it's rainy with 0.2 chance, which is the complement of 0.78, ▶ 01:01 but a 0.6 chance if it was (inaudible) sunny. ▶ 01:07 But if you multiply those, 0.132. ▶ 01:10 Adding those 2 things up gives us 0.756. ▶ 01:14 So, to some extents, it's tedious to compute these values, ▶ 01:19 but they can be perfectly computed, as shown here. ▶ 01:23 ### (00:19) 6 Cancer Quiz

Next example is a cancer example. ▶ 00:00 Suppose there's a specific type of cancer which exists for 1% of the population. ▶ 00:05 I'm going to write this as follows. ▶ 00:11 You can probably tell me now what the probability of not having this cancer is. ▶ 00:13 ### (00:28) 6a Answer and Cancer Test

And yes, the answer is 0.99. ▶ 00:00 Let's assume there's a test for this cancer, ▶ 00:04 which gives us probabilistically an answer whether we have this cancer or not. ▶ 00:07 So, let's say the probability of a test being positive, as indicated by this + sign, ▶ 00:12 given that we have cancer, is 0.9. ▶ 00:18 The probability of the test coming out negative if we have the cancer is--you name it. ▶ 00:22 ### (01:01) 6b Answer

0.1, which is the difference between 1 and 0.9. ▶ 00:00 Let's assume the probability of the test coming out positive ▶ 00:06 given that we don't have this cancer is 0.2. ▶ 00:11 In other words, the probability of the test correctly saying ▶ 00:15 we don't have the cancer if we're cancer free is 0.8. ▶ 00:19 Now, ultimately, I'd like to know what's the probability ▶ 00:24 they have this cancer given they just received a single, positive test? ▶ 00:28 Before I do this, please help me filling out some other probabilities ▶ 00:35 that are actually important. ▶ 00:39 Specifically, the joint probabilities. ▶ 00:41 The probability of a positive test and having cancer. ▶ 00:45 The probability of a negative test and having cancer, ▶ 00:51 and this is not conditional anymore. ▶ 00:53 It's now a joint probability. ▶ 00:55 So, please give me those 4 values over here. ▶ 00:57 ### (00:40) 6c Answer

And here the correct answer is 0.009, ▶ 00:00 which is the product of your prior, 0.01, times the conditional, 0.9. ▶ 00:05 Over here we get 0.001, the probability of our prior cancer times 0.1. ▶ 00:12 Over here we get 0.198, ▶ 00:21 the probability of not having cancer is 0.99 ▶ 00:26 times still getting a positive reading, which is 0.2. ▶ 00:29 And finally, we get 0.792, ▶ 00:32 which is the probability of this guy over here, and this guy over here. ▶ 00:37 ### (00:07) 6d Question

Now, our next quiz, I want you to fill in the probability of ▶ 00:00 the cancer given that we just received a positive test. ▶ 00:04 ### (01:52) 6e Answer

And the correct answer is 0.043. ▶ 00:00 So, even though I received a positive test, ▶ 00:06 my probability of having cancer is just 4.3%, ▶ 00:09 which is not very much given that the test itself is quite sensitive. ▶ 00:14 It really gives me a 0.8 chance of getting a negative result if I don't have cancer. ▶ 00:18 It gives me a 0.9 chance of detecting cancer given that I have cancer. ▶ 00:26 Now, what comes (inaudible) small? ▶ 00:32 Well, let's just put all the cases together. ▶ 00:35 You already know that we received a positive test. ▶ 00:38 Therefore, this entry over here, and this entry over here are relevant. ▶ 00:41 Now, the chance of having a positive test and having cancer is 0.009. ▶ 00:47 Well, I might--when I receive a positive test--have cancer or not cancer, ▶ 00:56 so we will just normalize by these 2 possible causes for the positive test, ▶ 01:01 which is 0.009 + 0.198. ▶ 01:06 We know both these 2 things together gets 0.009 over 0.207, ▶ 01:11 which is approximately 0.043. ▶ 01:20 Now, the interesting thing in this equation is that the chances ▶ 01:23 of having seen a positive test result in the absence of cancers ▶ 01:28 are still much, much higher than the chance of seeing a positive result ▶ 01:32 in the presence of cancer, and that's because our prior for cancer ▶ 01:35 is so small in the population that it's just very unlikely to have cancer. ▶ 01:39 So, the additional information of a positive test ▶ 01:44 only erased my posterior probability to 0.043. ▶ 01:47 ### (03:34) 7 Bayes Rule

So, we've just learned about what's probably the most important ▶ 00:00 piece of math for this class in statistics called Bayes Rule. ▶ 00:03 It was invented by Reverend Thomas Bayes, who was a British mathematician ▶ 00:09 and a Presbyterian minister in the 18th century. ▶ 00:15 Bayes Rule is usually stated as follows: P of A given B where B is the evidence ▶ 00:18 and A is the variable we care about is P of B given A times P of A over P of B. ▶ 00:27 This expression is called the likelihood. ▶ 00:36 This is called the prior, and this is called marginal likelihood. ▶ 00:40 The expression over here is called the posterior. ▶ 00:46 The interesting thing here is the way the probabilities are reworded. ▶ 00:50 Say we have evidence B. ▶ 00:55 We know about B, but we really care about the variable A. ▶ 00:57 So, for example, B is a test result. ▶ 01:01 We don't care about the test result as much as we care about the fact ▶ 01:03 whether we have cancer or not. ▶ 01:06 This diagnostic reasoning--which is from evidence to its causes-- ▶ 01:08 is turned upside down by Bayes Rule into a causal reasoning, ▶ 01:16 which is given--hypothetically, if we knew the cause, ▶ 01:22 what would be the probability of the evidence we just observed. ▶ 01:27 But to correct for this inversion, we have to multiply ▶ 01:31 by the prior of the cause to be the case in the first place, ▶ 01:36 in this case, having cancer or not, ▶ 01:40 and divide it by the probability of the evidence, P(B), ▶ 01:42 which often is expanded using the theorem of total probability as follows. ▶ 01:47 The probability of B is a sum over all probabilities of B ▶ 01:52 conditional on A, lower caps a, times the probability of A equals lower caps a. ▶ 01:58 This is total probability as we already encountered it. ▶ 02:04 So, let's apply this to the cancer case ▶ 02:08 and say we really care about whether you have cancer, ▶ 02:10 which is our cause, conditioned on the evidence ▶ 02:13 that is the result of this hidden cause, in this case, a positive test result. ▶ 02:17 Let's just plug in the numbers. ▶ 02:23 Our likelihood is the probability of seeing a positive test result ▶ 02:25 given that you have cancer multiplied by the prior probability ▶ 02:30 of having cancer over the probability of the positive test result, ▶ 02:33 and that is--according to the tables we looked at before-- ▶ 02:38 0.9 times a prior of 0.01 over-- ▶ 02:43 now we're going to expand this right over here according to total probability ▶ 02:50 which gives us 0.9 times 0.01. ▶ 02:55 That's the probability of + given that we do have cancer. ▶ 03:01 So, the probability of + given that we don't have cancer is 0.2, ▶ 03:06 but the prior here is 0.99. ▶ 03:11 So, if we plug in the numbers we know about, we get 0.009 ▶ 03:15 over 0.009 + 0.198. ▶ 03:20 That is approximately 0.0434, which is the number we saw before. ▶ 03:27 ### (01:52) 7a Bayes Rule Graphically

So, if you want to draw Bayes rule graphically, ▶ 00:00 we have a situation where we have an internal variable A, ▶ 00:03 like whether I'm going to die of cancer, but we can't sense A. ▶ 00:08 Instead, we have a second variable, called B, ▶ 00:13 which is our test, and B is observable, but A isn't. ▶ 00:16 This is a classical example of a Bayes network. ▶ 00:21 The Bayes network is composed of 2 variables, A and B. ▶ 00:26 We know the prior probability for A, ▶ 00:30 and we know the conditional. ▶ 00:33 A causes B--whether or not we have cancer, ▶ 00:35 causes the test result to be positive or not, ▶ 00:38 although there was some randomness involved. ▶ 00:41 So, we know what the probability of B given the different values for A, ▶ 00:44 and what we care about in this specific instance is called diagnostic reasoning, ▶ 00:49 which is the inverse of the causal reasoning, ▶ 00:54 the probability of A given B or similarly, probability of A given not B. ▶ 00:58 This is our very first Bayes network, and the graphical representation ▶ 01:06 of drawing 2 variables, A and B, connected with an arc ▶ 01:11 that goes from A to B is the graphical representation of a distribution ▶ 01:15 of 2 variables that are specified in the structure over here, ▶ 01:22 which has a prior probability and has a conditional probability as shown over here. ▶ 01:26 Now, I do have a quick quiz for you. ▶ 01:31 How many parameters does it take to specify ▶ 01:34 the entire joint probability within A and B, or differently, the entire Bayes network? ▶ 01:37 I'm not looking for structural parameters that relate to the graph over here. ▶ 01:43 I'm just looking for the numerical parameters of the underlying probabilities. ▶ 01:48 ### (00:24) 7b Answer

And the answer is 3. ▶ 00:00 It takes 1 parameter to specify P of A from which we can derive P of not A. ▶ 00:02 It takes 2 parameters to specify P of B given A and P given not A, ▶ 00:09 from which we can derive P not B given A and P of not B given not A. ▶ 00:15 So, it's a total of 3 parameters for this Bayes network. ▶ 00:21 ### (02:32) 8 More Complex Bayes Networks

So, we just encountered our very first Bayes network ▶ 00:00 and did a number of interesting calculations. ▶ 00:03 Let's now talk about Bayes Rule and look into more complex Bayes networks. ▶ 00:06 I will look at Bayes Rule again and make an observation ▶ 00:10 that is really non-trivial. ▶ 00:13 Here is Bayes Rule, and in practice, what we find is ▶ 00:15 this term here is relatively easy to compute. ▶ 00:20 It's just a product, whereas this term is really hard to compute. ▶ 00:23 However, this term over here does not depend on what we assume for variable A. ▶ 00:28 It's just the function of B. ▶ 00:33 So, suppose for a moment we also care about the complementary event of not A ▶ 00:35 given B, for which Bayes Rule unfolds as follows. ▶ 00:40 Then we find that the normalizer, P(B), is identical, ▶ 00:43 whether we assume A on the left side or not A on the left side. ▶ 00:47 We also know from prior work that P(A) given B plus ▶ 00:51 P of not A given B must be one because these are 2 complementary events. ▶ 00:57 That allows us to compute Bayes Rule very differently ▶ 01:03 by basically ignoring the normalizer, so here's how it goes. ▶ 01:06 We compute P(A) given B--and I want to call this prime, ▶ 01:11 because it's not a real probability--to be just P(B) given A times P(A), ▶ 01:16 which is the normalizer, so the denominator of the expression over here. ▶ 01:23 We do the same thing with not A. ▶ 01:28 So, in both cases, we compute the posterior probability non-normalized ▶ 01:31 by omitting the normalizer B. ▶ 01:36 And then we can recover the original probabilities by normalizing ▶ 01:38 based on those values over here, so the probability of A given B, ▶ 01:43 the actual probability, is a normalizer, eta, ▶ 01:48 times this non-normalized form over here. ▶ 01:52 The same is true for the negation of A over here. ▶ 01:55 And eta is just the normalizer that results by adding these 2 values over here together ▶ 01:59 as shown over here, and dividing them for one. ▶ 02:06 So, take a look at this for a moment. ▶ 02:10 What we've done is we deferred the calculation of the normalizer over here ▶ 02:13 by computing pseudo probabilities that are non-normalized. ▶ 02:18 This made the calculation much easier, and when we were done with everything, ▶ 02:22 we just folded it back into the normalizer based on the resulting ▶ 02:26 pseudo probabilities and got the correct answer. ▶ 02:29 ### (01:08) 8a Two Test Cancer Example

The reason why I gave you all this is because I want you to apply it now ▶ 00:00 to a slightly more complicated problem, which is the 2-test cancer example. ▶ 00:03 In this example, we again might have our unobservable cancer C, ▶ 00:08 but now we're running 2 tests, test 1 and test 2. ▶ 00:14 As before, the prior probability of cancer is 0.01. ▶ 00:18 The probability of receiving a positive test result for either test is 0.9. ▶ 00:24 The probability of getting a negative result given they're cancer free is 0.8. ▶ 00:30 And from those, we were able to compute all the other probabilities, ▶ 00:36 and we're just going to write them down over here. ▶ 00:40 So, take a moment to just verify those. ▶ 00:43 Now, let's assume both of my tests come back positive, ▶ 00:46 so T1 = + and T2 = +. ▶ 00:50 What's the probability of cancer now written in short form probability of ▶ 00:56 C given ++? ▶ 01:00 I want you to tell me what that is, and this is a non-trivial question. ▶ 01:03 ### (02:00) 8b Answer

So, the correct answer is 0.1698 approximately, ▶ 00:00 and to compute this, I used the trick I've shown you before. ▶ 00:10 Let me write down the running count for cancer and for not cancer ▶ 00:15 as I integrate the various multiplications in Bayes Rule. ▶ 00:24 My prior for cancer was 0.01 and for non-cancer was 0.99. ▶ 00:28 Then I get my first +, and the probability of a + given they have cancer is 0.9, ▶ 00:37 and the same for non-cancer is 0.2. ▶ 00:43 So, according to the non-normalized Bayes Rule, ▶ 00:48 I now multiply these 2 things together to get my non-normalized probability ▶ 00:52 of having cancer given the plus. ▶ 00:58 Since multiplication is commutative, ▶ 01:00 I can do the same thing again with my 2nd test result, 0.9 and 0.2, ▶ 01:03 and I multiply all of these 3 things together to get my non-normalized probability ▶ 01:09 P prime to be the following: 0.0081, if you multiply those things together, ▶ 01:14 and 0.0396 if you multiply these facts together. ▶ 01:21 And these are not a probability. ▶ 01:28 If we add those for the 2 complementary of cancer/non-cancer, ▶ 01:30 I get 0.0477. ▶ 01:34 However, if I now divide, that is, I normalize ▶ 01:38 those non-normalized probabilities over here by this factor over here, ▶ 01:42 I actually get the correct posterior probability P of cancer given ++. ▶ 01:47 And they look as follows: ▶ 01:52 approximately 0.1698 and approximately 0.8301. ▶ 01:54 ### (00:10) 8c Question

Calculate for me the probability of cancer ▶ 00:00 given that I received one positive and one negative test result. ▶ 00:03 Please write your number into this box. ▶ 00:08 ### (01:03) 8d Answer

We apply the same trick as before ▶ 00:00 where we use the exact same prior of 0.01. ▶ 00:03 Our first + gives us the following factors: 0.9 and 0.2. ▶ 00:07 And our minus gives us the probability 0.1 for a negative first test result given that we have cancer, ▶ 00:13 and a 0.8 for the inverse of a negative result of not having cancer. ▶ 00:20 We multiply those together. ▶ 00:26 We get our non-normalized probability. ▶ 00:28 And if we now normalize by the sum of those two things ▶ 00:30 to turn this back into a probability, we get 0.009 ▶ 00:35 over the sum of those two things over here, and this is 0.0056 ▶ 00:41 for the chance of having cancer and 0.9943 for the chance of being cancer free. ▶ 00:50 And this adds up approximately to 1, and therefore, is a probability distribution. ▶ 00:59 ### (02:45) 9 Conditional Independence

I want to use a few words of terminology. ▶ 00:00 This, again, is a Bayes network, of which the hidden variable C ▶ 00:03 causes the still stochastic test outcomes T1 and T2. ▶ 00:08 And what is really important is that we assume not just ▶ 00:16 that T1 and T2 are identically distributed. ▶ 00:19 We use the same 0.9 for test 1 as we use for test 2, ▶ 00:22 but we also assume that they are conditionally independent. ▶ 00:27 We assumed that if God told us whether we actually had cancer or not, ▶ 00:31 if we knew with absolute certainty the value of the variable C, ▶ 00:37 that knowing anything about T1 would not help us make a statement about T2. ▶ 00:41 Put differently, we assumed that the probability of T2 given C and T1 ▶ 00:48 is the same as the probability of T2 given C. ▶ 00:55 This is called conditional independence, which is given the value of the cancer variable C. ▶ 01:00 If you knew this for a fact, then T2 would be independent of T1. ▶ 01:08 It's conditionally independent because the independence only holds true ▶ 01:17 if we actually know C, and it comes out of this diagram over here. ▶ 01:21 If we look at this diagram, if you knew the variable C over here, ▶ 01:26 then C separately causes T1 and T2. ▶ 01:32 So, as a result, if you know C, whatever counted over here ▶ 01:39 is kind of cut off causally from what happens over here. ▶ 01:46 That causes these 2 variables to be conditionally independent. ▶ 01:48 So, conditional independence is a really big thing in Bayes networks. ▶ 01:52 Here's a Bayes network where A causes B and C, ▶ 01:58 and for a Bayes network of this structure, we know that given A, ▶ 02:02 B and C are independent. ▶ 02:08 It's written as B conditionally independent of C given A. ▶ 02:11 So, here's a question. ▶ 02:16 Suppose we have conditional independence between B and C given A. ▶ 02:18 Would that imply--and there's my question--that B and C are independent? ▶ 02:21 So, suppose we don't know A. ▶ 02:28 We don't know whether we have cancer, for example. ▶ 02:30 What that means is that the test results individually are still independent of each other ▶ 02:33 even if we don't know about the cancer situation. ▶ 02:38 Please answer yes or no. ▶ 02:42 ### (00:41) 9a Answer

And the correct answer is No ▶ 00:00 Intuitively, getting a positive test result about cancer ▶ 00:03 gives us information about whether you have cancer or not. ▶ 00:08 So if you get a positive test result ▶ 00:13 you're going to raise the probability of having cancer ▶ 00:15 relative to the prior probability. ▶ 00:18 With that increased probability we will predict ▶ 00:20 that another test will with a higher likelihood ▶ 00:24 give us a positive response than if we hadn't taken the previous test. ▶ 00:27 That's really important to understand ▶ 00:33 So that we understand it let me make you calculate those probabilities ▶ 00:36 ### (00:35) 9b Question

Let me draw the cancer example again with two tests. ▶ 00:00 Here's my cancer variable ▶ 00:05 and then there's two conditionally independent tests T1 and T2. ▶ 00:07 And as before let me assume that the prior probability of cancer is 0.01 ▶ 00:13 What I want you to compute for me is the probability of the second test ▶ 00:19 to be positive if we know that the first test was positive. ▶ 00:26 So write this into the following box. ▶ 00:33 ### (02:52) 9c Answer

So, for this one, we want to apply total probability. ▶ 00:00 This thing over here is the same as probability of test 2 to be positive, ▶ 00:04 which I'm going to abbreviate with a +2 over here, ▶ 00:10 conditioned on test 1 being positive and me having cancer ▶ 00:14 times the probability of me having cancer given test 1 was positive plus ▶ 00:19 the probability of test 2 being positive conditioned on test 1 being positive ▶ 00:25 and me not having cancer times the probability of me not having cancer ▶ 00:31 given that test 1 is positive. ▶ 00:36 That's the same as the theorem of total probability, ▶ 00:38 but now everything is conditioned on +1. ▶ 00:42 Take a moment to verify this. ▶ 00:46 Now, here I can plug in the numbers. ▶ 00:48 You already calculated this one before, which is approximately 0.043, ▶ 00:50 and this one over here is 1 minus that, which is 0.957 approximately. ▶ 00:57 And this term over here now exploits conditional independence, ▶ 01:05 which is given that I know C, knowledge of the first test ▶ 01:09 gives me no more information about the second test. ▶ 01:14 It only gives me information if C was unknown, as was the case over here. ▶ 01:17 So, I can rewrite this thing over here as follows: ▶ 01:21 P of +2 given that I have cancer. ▶ 01:24 I can drop the +1, and the same is true over here. ▶ 01:27 This is exploiting my conditional independence. ▶ 01:31 I knew that P of +1 or +2 conditioned on C ▶ 01:34 is the same as P of +2 conditioned on C and +1. ▶ 01:41 I can now read those off my table over here, ▶ 01:47 which is 0.9 times 0.043 plus 0.2, ▶ 01:50 which is 1 minus 0.8 over here times 0.957, ▶ 01:58 which gives me approximately 0.2301. ▶ 02:03 So, that says if my first test comes in positive, ▶ 02:09 I expect my second test to be positive with probably 0.2301. ▶ 02:14 That's an increased probability to the default probability, ▶ 02:21 which we calculated before, which is the probability of any test, ▶ 02:24 test 2 come in as positive before was the normalizer of Bayes rule which was 0.207. ▶ 02:29 So, my first test has a 20% chance of coming in positive. ▶ 02:38 My second test, after seeing a positive test, ▶ 02:43 has now an increased probability of about 23% of coming in positive. ▶ 02:47 ### (00:27) 9d Absolute vs Conditional Independence

So, now we've learned about independence, ▶ 00:00 and the corresponding Bayes network has 2 nodes. ▶ 00:02 They're just not connected at all. ▶ 00:04 And we learned about conditional independence, ▶ 00:07 in which case we have a Bayes network that looks like this. ▶ 00:09 Now I would like to know whether absolute independence ▶ 00:12 implies conditional independence. ▶ 00:16 True or false? ▶ 00:18 And I'd also like to know whether conditional independence implies absolute independence. ▶ 00:20 Again, true or false? ▶ 00:25 ### (00:45) 9e Answer

And the answer is both of them are false. ▶ 00:00 We already saw that conditional independence, as shown over here, ▶ 00:03 doesn't give us absolute independence. ▶ 00:07 So, for example, this is test #1 and test #2. ▶ 00:09 You might or might not have cancer. ▶ 00:13 Our first test gives us information about whether you have cancer or not. ▶ 00:15 As a result, we've changed our prior probability ▶ 00:18 for the second test to come in positive. ▶ 00:21 That means that conditional independence does not imply absolute independence, ▶ 00:24 which means this assumption here falls, ▶ 00:30 and it also turns out that if you have absolute independence, ▶ 00:32 things might not be conditionally independent for reasons that I can't quite explain so far, ▶ 00:37 but that we will learn about next. ▶ 00:43 ### (01:59) 10 Different Type of Bayes Network

[Thrun] For my next example, I will study a different type of a Bayes network. ▶ 00:00 Before, we've seen networks of the following type, ▶ 00:04 where a single hidden cause caused 2 different measurements. ▶ 00:08 I now want to study a network that looks just like the opposite. ▶ 00:13 We have 2 independent hidden causes, ▶ 00:17 but they get confounded within a single observational variable. ▶ 00:20 I would like to use the example of happiness. ▶ 00:26 Suppose I can be happy or unhappy. ▶ 00:29 What makes me happy is when the weather is sunny or if I get a raise in my job, ▶ 00:33 which means I make more money. ▶ 00:41 So let's call this sunny, let's call this a raise, and call this happiness. ▶ 00:43 Perhaps the probability of it being sunny is 0.7, ▶ 00:47 probability of a raise is 0.01. ▶ 00:53 And I will tell you that the probability of being happy is governed as follows. ▶ 00:58 The probability of being happy given that both of these things occur-- ▶ 01:05 I got a raise and it is sunny--is 1. ▶ 01:09 The probability of being happy given that it is not sunny and I still got a raise is 0.9. ▶ 01:13 The probability of being happy given that it's sunny but I didn't get a raise is 0.7. ▶ 01:20 And the probability of being happy given that it is neither sunny nor did I get a raise is 0.1. ▶ 01:27 This is a perfectly fine specification of a probability distribution ▶ 01:35 where 2 causes affect the variable down here, the happiness. ▶ 01:39 So I'd like you to calculate for me the following questions. ▶ 01:46 Probability of a raise given that it is sunny, according to this model. ▶ 01:50 Please enter your answer over here. ▶ 01:57 ### (00:55) 10a Answer

[Thrun] The answer is surprisingly simple. ▶ 00:00 It is 0.01. ▶ 00:03 How do I know this so fast? ▶ 00:05 Well, if you look at this Bayes network, ▶ 00:08 both the sunniness and the question whether I got a raise impact my happiness. ▶ 00:12 But since I don't know anything about the happiness, ▶ 00:21 there is no way that just the weather might implicate or impact whether I get a raise or not. ▶ 00:24 In fact, it might be independently sunny, and I might independently get a raise at work. ▶ 00:32 There is no mechanism of which these 2 things would co-occur. ▶ 00:39 Therefore, the probability of a raise given that it's sunny ▶ 00:46 is just the same as the probability of a raise given any weather, which is 0.01. ▶ 00:49 ### (01:51) 11 Explaining Away

[Thrun] Let me talk about a really interesting special instance of Bayes net reasoning ▶ 00:00 which is called explaining away. ▶ 00:07 And I'll first give you the intuitive answer, ▶ 00:10 then I'll wish you to compute probabilities for me that manifest the explain away effect ▶ 00:14 in a Bayes network of this type. ▶ 00:19 Explaining away means that if we know that we are happy, ▶ 00:22 then sunny weather can explain away the cause of happiness. ▶ 00:27 If I then also know that it's sunny, it becomes less likely that I received a raise. ▶ 00:34 Let me put this differently. ▶ 00:41 Suppose I'm a happy guy on a specific day ▶ 00:43 and my wife asks me, "Sebastian, why are you so happy?" ▶ 00:45 "Is it sunny, or did you get a raise?" ▶ 00:49 If she then looks outside and sees it is sunny, ▶ 00:52 then she might explain to herself, ▶ 00:55 "Well, Sebastian is happy because it is sunny." ▶ 00:57 "That makes it effectively less likely that he got a raise ▶ 01:00 "because I could already explain his happiness by it being sunny." ▶ 01:05 If she looks outside and it is rainy, ▶ 01:10 that makes it more likely I got a raise, ▶ 01:13 because the weather can't really explain my happiness. ▶ 01:16 In other words, if we see a certain effect that could be caused by multiple causes, ▶ 01:20 seeing one of those causes can explain away any other potential cause ▶ 01:27 of this effect over here. ▶ 01:33 So let me put this in numbers and ask you the challenging question of ▶ 01:36 what's the probability of a raise given that I'm happy and it's sunny? ▶ 01:43 ### (01:32) 11a Answer

[Thrun] The answer is approximately 0.0142, ▶ 00:00 and it is an exercise in expanding this term using Bayes' rule, ▶ 00:07 using total probability, which I'll just do for you. ▶ 00:11 Using Bayes' rule, you can transform this into P of H given R comma S ▶ 00:16 times P of R given S over P of H given S. ▶ 00:24 We observe the conditional independence of R and S ▶ 00:34 to simplify this to just P of R, ▶ 00:37 and the denominator is expanded by folding in R and not R, ▶ 00:40 P of H given R comma S ▶ 00:46 times P of R plus P of H given not R and S ▶ 00:49 times P of not R, which is total probability. ▶ 00:54 We can now read off the numbers from the tables over here, ▶ 00:58 which gives us 1 times 0.01 divided by this expression ▶ 01:01 that is the same as the expression over here, so 0.01 plus this thing over here, ▶ 01:10 which you can find over here to be 0.7, times this guy over here, ▶ 01:17 which is 1 minus the value over here, 0.99, ▶ 01:23 which gives us approximately 0.0142. ▶ 01:27 ### (00:31) 11b Question

[Thrun] Now, to understand the explain away effect, ▶ 00:00 you have to compare this to the probability of a raise given that we're just happy ▶ 00:04 and we don't know anything about the weather. ▶ 00:11 So let's do that exercise next. ▶ 00:14 So my next quiz is, what's the probability of a raise given that all I know is that I'm happy ▶ 00:16 and I don't know about the weather? ▶ 00:24 This happens to be once again a pretty complicated question, so take your time. ▶ 00:26 ### (02:53) 11c Answer

[Thrun] So this is a difficult question. ▶ 00:00 Let me compute an auxiliary variable, which is P of happiness. ▶ 00:02 That one is expanded by looking at the different conditions that can make us happy. ▶ 00:12 P of happiness given S and R ▶ 00:19 times P of S and R, which is of course the product of those 2 ▶ 00:24 because they are independent, ▶ 00:29 plus P of happiness given not S R, probability of not as R ▶ 00:31 plus P of H given S and not R ▶ 00:39 times the probability of P of S and not R plus the last case, ▶ 00:43 P of H given not S and not R. ▶ 00:48 So this just looks at the happiness under all 4 combinations of the variables ▶ 00:52 that can lead to happiness. ▶ 00:56 And you can plug those straight in. ▶ 00:58 This one over here is 1, and this one over here is the product of S and R, ▶ 01:00 which is 0.7 times 0.01. ▶ 01:05 And as you plug all of those in, ▶ 01:10 you get as a result 0.5245. ▶ 01:14 That's P of H. ▶ 01:21 Just take some time and do the math by going through these different cases ▶ 01:24 using total probability, and you get this result. ▶ 01:28 Armed with this number, the rest now becomes easy, ▶ 01:32 which is we can use Bayes' rule to turn this around. ▶ 01:38 P of H given R times P of R over P of H. ▶ 01:43 P of R we know from over here, the probability of a raise is 0.01. ▶ 01:49 So the only thing we need to compute now is P of H given R. ▶ 01:54 And again, we apply total probability. ▶ 01:57 Let me just do this over here. ▶ 01:59 We can factor P of H given R as P of H given R and S, sunny, ▶ 02:02 times probability of sunny plus P of H given R and not sunny ▶ 02:09 times the probability of not sunny. ▶ 02:14 And if you plug in the numbers with this, you get 1 times 0.7 ▶ 02:16 plus 0.9 times 0.3. ▶ 02:21 That happens to be 0.97. ▶ 02:25 So if we now plug this all back into this equation over here, ▶ 02:30 we get 0.97 times 0.01 over 0.5245. ▶ 02:33 This gives us approximately as the correct answer 0.0185. ▶ 02:45 ### (01:42) 11d Question

[Thrun] And if you got this right, I will be deeply impressed ▶ 00:00 about the fact you got this right. ▶ 00:04 But the interesting thing now to observe is if we happen to know it's sunny ▶ 00:07 and I'm happy, then the probability of a raise is 14%, 0.014. ▶ 00:13 If I don't know about the weather and I'm happy, ▶ 00:21 then the probability of a raise goes up to about 18.5%. ▶ 00:26 Why is that? ▶ 00:30 Well, it's the explaining away effect. ▶ 00:32 My happiness is well explained by the fact that it's sunny. ▶ 00:35 So if someone observes me to be happy and asks the question, ▶ 00:40 "Is this because Sebastian got a raise at work?" ▶ 00:43 well, if you know it's sunny and this is a fairly good explanation for me being happy, ▶ 00:46 you don't have to assume I got a raise. ▶ 00:53 If you don't know about the weather, then obviously the chances are higher ▶ 00:55 that the raise caused my happiness, ▶ 01:01 and therefore this number goes up from 0.014 to 0.018. ▶ 01:03 Let me ask you one final question in this next quiz, ▶ 01:10 which is the probability of the raise given that I look happy and it's not sunny. ▶ 01:14 This is the most extreme case for making a raise likely ▶ 01:23 because I am a happy guy, and it's definitely not caused by the weather. ▶ 01:27 So it could be just random, or it could be caused by the raise. ▶ 01:33 So please calculate this number for me and enter it into this box. ▶ 01:37 ### (01:18) 11e Answer

[Thrun] Well, the answer follows the exact same scheme as before, ▶ 00:00 with S being replaced by not S. ▶ 00:04 So this should be an easier question for you to answer. ▶ 00:08 P of R given H and not S can be inverted by Bayes' rule to be as follows. ▶ 00:11 Once we apply Bayes' rule, as indicated over here where we swapped H to the left side ▶ 00:20 and R to the right side, you can observe that this value over here ▶ 00:24 can be readily found in the table. ▶ 00:29 It's actually the 0.9 over there. ▶ 00:32 This value over here, the raise is independent of the weather ▶ 00:35 by virtue of our Bayes network, so it's just 0.01. ▶ 00:41 And as before, we apply total probability to the expression over here, ▶ 00:45 and we obtain off this quotient over here that these 2 expressions are the same. ▶ 00:52 P of H given not S, not R is the value over here, ▶ 00:58 and the 0.99 is the complement of probability of R taken from over here, ▶ 01:03 and that ends up to be 0.0833. ▶ 01:08 This would have been the right answer. ▶ 01:16 ### (03:13) 11f Conclusion

[Thrun] It's really interesting to compare this to the situation over here. ▶ 00:00 In both cases I'm happy, as shown over here, ▶ 00:04 and I ask the same question, which is whether I got a raise at work, as R over here. ▶ 00:08 But in one case I observe that the weather is sunny; in the other one it isn't. ▶ 00:15 And look what it does to my probability of having received a raise. ▶ 00:21 The sunniness perfectly well explains my happiness, ▶ 00:25 and my probability of having received a raise ends up to be a mere 1.4%, or 0.014. ▶ 00:30 However, if my wife observes it to be non-sunny, then it is much more likely ▶ 00:41 that the cause of my happiness is related to a raise at work, ▶ 00:47 and now the probability is 8.3%, which is significantly higher than the 1.4% before. ▶ 00:51 This is a Bayes network of which S and R are independent ▶ 00:58 but H adds a dependence between S and R. ▶ 01:04 Let me talk about this in a little bit more detail on the next paper. ▶ 01:10 So here is our Bayes network again. ▶ 01:16 In our previous exercises, we computed for this network ▶ 01:18 that the probability of a raise of R given any of these variables shown here was as follows. ▶ 01:22 The really interesting thing is that in the absence of information about H, ▶ 01:29 which is the middle case over here, ▶ 01:34 the probability of R is unaffected by knowledge of S-- ▶ 01:37 that is, R and S are independent. ▶ 01:41 This is the same as probability of R, ▶ 01:46 and R and S are independent. ▶ 01:49 However, if I know something about the variable H, ▶ 01:56 then S and R become dependent-- ▶ 02:02 that is, knowing about my happiness over here renders S and R dependent. ▶ 02:06 This is not the same as probability of just R given H. ▶ 02:15 Obviously, it isn't because if I now vary S from S to not S, ▶ 02:23 it affects my probability for the variable R. ▶ 02:28 That is a really unusual situation ▶ 02:33 where we have R and S are independent ▶ 02:36 but given the variable H, R and S are not independent anymore. ▶ 02:40 So knowledge of H makes 2 variables that previously were independent non-independent. ▶ 02:50 Offered differently, 2 variables that are independent may not be in certain cases ▶ 02:58 conditionally independent. ▶ 03:06 Independence does not imply conditional independence. ▶ 03:08 ### (02:53) 12 General Bayes Networks

[Thrun] So we're now ready to define Bayes networks in a more general way. ▶ 00:00 Bayes networks define probability distributions over graphs or random variables. ▶ 00:05 Here is an example graph of 5 variables, ▶ 00:10 and this Bayes network defines the distribution over those 5 random variables. ▶ 00:14 Instead of enumerating all possibilities of combinations of these 5 random variables, ▶ 00:19 the Bayes network is defined by probability distributions ▶ 00:24 that are inherent to each individual node. ▶ 00:28 For node A and B, we just have a distribution P of A and P of B ▶ 00:32 because A and B have no incoming arcs. ▶ 00:38 C is a conditional distribution conditioned on A and B. ▶ 00:42 D and E are conditioned on C. ▶ 00:47 The joint probability represented by a Bayes network ▶ 00:52 is the product of various Bayes network probabilities ▶ 00:56 that are defined over individual nodes ▶ 01:00 where each node's probability is only conditioned on the incoming arcs. ▶ 01:03 So A has no incoming arc; therefore, we just want it P of A. ▶ 01:08 C has 2 incoming arcs, so we define the probability of C conditioned on A and B. ▶ 01:12 And D and E have 1 incoming arc that's shown over here. ▶ 01:18 The definition of this joint distribution by using the following factors ▶ 01:22 has one really big advantage. ▶ 01:27 Whereas the joint distribution over any 5 variables requires 2 to the 5 minus 1, ▶ 01:30 which is 31 probability values, ▶ 01:40 the Bayes network over here only requires 10 such values. ▶ 01:43 P of A is one value, for which we can derive P of not A. ▶ 01:48 Same for P of B. ▶ 01:53 P of C given A B is derived by a distribution over C ▶ 01:55 conditioned on any combination of A and B, of which there are 4 of A and B as binary. ▶ 02:02 P of D given C is 2 parameters for P of D given C and P of D given not C. ▶ 02:07 And the same is true for P of E given C. ▶ 02:15 So if you add those up, you get 10 parameters in total. ▶ 02:18 So the compactness of the Bayes network ▶ 02:21 leads to a representation that scales significantly better to large networks ▶ 02:25 than the common natorial approach which goes through all combinations of variable values. ▶ 02:31 That is a key advantage of Bayes networks, ▶ 02:36 and that is the reason why Bayes networks are being used so extensively ▶ 02:39 for all kinds of problems. ▶ 02:43 So here is a quiz. ▶ 02:45 How many probability values are required to specify this Bayes network? ▶ 02:47 Please put your answer in the following box. ▶ 02:51 ### (00:19) 12a Answer

[Thrun] And the answer is 13. ▶ 00:00 One over here, 2 over here, and 4 over here. ▶ 00:03 Simplifiably speaking, any variable that has K inputs requires 2 to the K such variables. ▶ 00:06 So in total we have 1, 9, 13. ▶ 00:15 ### (00:17) 12b Question

[Thrun] Here's another quiz. ▶ 00:00 How many parameters do we need to specify the joint distribution ▶ 00:02 for this Bayes network over here ▶ 00:06 where A, B, and C point into D, D points into E, F, and G ▶ 00:09 and C also points into G? ▶ 00:13 Please write your answer into this box. ▶ 00:15 ### (00:16) 12c Answer

[Thrun] And the answer is 19. ▶ 00:00 So 1 here, 1 here, 1 here, 2 here, 2 here, 2 arcs point into G, which makes for 4, ▶ 00:02 and 3 arcs point into D. Two to the 3 is 8. ▶ 00:09 So we get 1, 2, 3, 8, 2, 2, 4. If you add those up, it's 19. ▶ 00:13 ### (00:28) 12d Question

[Thrun] And here is our car network which we discussed at the very beginning of this unit. ▶ 00:00 How many parameters do we need to specify this network? ▶ 00:06 Remember, there are 16 total variables, ▶ 00:11 and the naive joint over the 16 will be 2 to the 16th minus 1, which is 65,535. ▶ 00:15 Please write your answer into this box over here. ▶ 00:25 ### (00:24) 12e Answer

[Thrun] To answer this question, let us add up these numbers. ▶ 00:00 Battery age is 1, 1, 1. ▶ 00:04 This has 1 incoming arc, so it's 2. ▶ 00:08 Two incoming arcs makes 4. ▶ 00:10 One incoming arc is 2, 2 equals 4. ▶ 00:13 Four incoming arcs makes 16. ▶ 00:17 If we add all the right numbers, we get 47. ▶ 00:21 ### (00:20) 12f Value of the Network

[Thrun] So it takes 47 numerical probabilities to specify the joint ▶ 00:00 compared to 65,000 if you didn't have the graph-like structure. ▶ 00:05 I think this example really illustrates the advantage ▶ 00:11 of compact Bayes network representations over unstructured joint representations. ▶ 00:14 ### (00:35) 13 D-Separation

[Thrun] The next concept I'd like to teach you is called D-separation. ▶ 00:00 And let me start the discussion of this concept by a quiz. ▶ 00:04 We have here a Bayes network, ▶ 00:09 and I'm going to ask you a conditional independence question. ▶ 00:11 Is C independent of A? ▶ 00:16 Please tell me yes or no. ▶ 00:20 Is C independent of A given B? ▶ 00:22 Is C independent of D? ▶ 00:27 Is C independent of D given A? ▶ 00:30 And is E independent of C given D? ▶ 00:32 ### (00:52) 13a Answer

[Thrun] So C is not independent of A. ▶ 00:00 In fact, A influences C by virtue of B. ▶ 00:04 But if you know B, then A becomes independent of C, ▶ 00:09 which means the only determinate into C is B. ▶ 00:13 If you know B for sure, then knowledge of A won't really tell you anything about C. ▶ 00:17 C is also not independent of D, just the same way C is not independent of A. ▶ 00:22 If I learn something about D, I can infer more about C. ▶ 00:27 But if I do know A, then it's hard to imagine how knowledge of D would help me with C ▶ 00:31 because I can't learn anything more about A than knowing A already. ▶ 00:38 Therefore, given A, C and D are independent. ▶ 00:42 The same is true for E and C. ▶ 00:45 If we know D, then E and C become independent. ▶ 00:48 ### (00:45) 13b D-Separation Example

[Thrun] In this specific example, the rule that we could apply is very, very simple. ▶ 00:00 Any 2 variables are independent if they're not linked by just unknown variables. ▶ 00:04 So for example, if we know B, then everything downstream of B ▶ 00:10 becomes independent of anything upstream of B. ▶ 00:14 E is now independent of C, conditioned on B. ▶ 00:18 However, knowledge of B does not render A and E independent. ▶ 00:22 In this graph over here, A and B connect to C and C connects to D and to E. ▶ 00:26 So let me ask you, is A independent of E, ▶ 00:33 A independent of E given B, ▶ 00:37 A independent of E given C, ▶ 00:39 A independent of B, ▶ 00:41 and A independent of B given C? ▶ 00:43 ### (01:26) 13c Answer

[Thrun] And the answer for this one is really interesting. ▶ 00:00 A is clearly not independent of E because through C we can see an influence of A to E. ▶ 00:03 Given B, that doesn't change. ▶ 00:08 A still influences C, despite the fact we know B. ▶ 00:11 However, if we know C, the influence is cut off. ▶ 00:15 There is no way A can influence E if we know C. ▶ 00:18 A is clearly independent of B. ▶ 00:22 They are different entry variables. They have no incoming arcs. ▶ 00:25 But here is the caveat. ▶ 00:29 Given C, A and B become dependent. ▶ 00:32 So whereas initially A and B were independent, ▶ 00:35 if you give C, they become dependent. ▶ 00:38 And the reason why they become dependent we've studied before. ▶ 00:41 This is the explain away effect. ▶ 00:44 If you know, for example, C to be true, ▶ 00:48 then knowledge of A will substantially affect what we believe about B. ▶ 00:51 If there's 2 joint causes for C and we happen to know A is true, ▶ 00:57 we will discredit cause B. ▶ 01:02 If we happen to know A is false, we will increase our belief for the cause B. ▶ 01:04 That was an effect we studied extensively in the happiness example I gave you before. ▶ 01:09 The interesting thing here is we are facing a situation ▶ 01:15 where knowledge of variable C renders previously independent variables dependent. ▶ 01:19 ### (02:54) 13d D-Separation General Definition

[Thrun] This leads me to the general study of conditional independence in Bayes networks, ▶ 00:00 often called D-separation or reachability. ▶ 00:06 D-separation is best studied by so-called active triplets and inactive triplets ▶ 00:10 where active triplets render variables dependent ▶ 00:17 and inactive triplets render them independent. ▶ 00:20 Any chain of 3 variables like this makes the initial and final variable dependent ▶ 00:23 if all variables are unknown. ▶ 00:30 However, if the center variable is known-- ▶ 00:32 that is, it's behind the conditioning bar-- ▶ 00:35 then this variable and this variable become independent. ▶ 00:38 So if we have a structure like this and it's quote-unquote cut off ▶ 00:42 by a known variable in the middle, that separates or deseparates ▶ 00:47 the left variable from the right variable, and they become independent. ▶ 00:53 Similarly, any structure like this renders the left variable and the right variable dependent ▶ 00:57 unless the center variable is known, ▶ 01:04 in which case the left and right variable become independent. ▶ 01:08 Another active triplet now requires knowledge of a variable. ▶ 01:12 This is the explain away case. ▶ 01:16 If this variable is known for a Bayes network that converges into a single variable, ▶ 01:19 then this variable and this variable over here become dependent. ▶ 01:25 Contrast this with a case where all variables are unknown. ▶ 01:29 A situation like this means that this variable on the left or on the right are actually independent. ▶ 01:33 In a single final example, we also get dependence if we have the following situation: ▶ 01:40 a direct successor of a conversion variable is known. ▶ 01:48 So it is sufficient if a successor of this variable is known. ▶ 01:52 The variable itself does not have to be known, ▶ 01:57 and the reason is if you know this guy over here, ▶ 01:59 we get knowledge about this guy over here. ▶ 02:02 And by virtue of that, the case over here essentially applies. ▶ 02:05 If you look at those rules, ▶ 02:09 those rules allow you to determine for any Bayes network ▶ 02:11 whether variables are dependent or not dependent given the evidence you have. ▶ 02:15 If you color the nodes dark for which you do have evidence, ▶ 02:20 then you can use these rules to understand whether any 2 variables ▶ 02:25 are conditionally independent or not. ▶ 02:29 So let me ask you for this relatively complicated Bayes network the following questions. ▶ 02:31 Is F independent of A? ▶ 02:37 Is F independent of A given D? ▶ 02:41 Is F independent of A given G? ▶ 02:45 And is F independent of A given H? ▶ 02:49 Please mark your answers as you see fit. ▶ 02:51 ### (01:03) 13e Answer

[Thrun] And the answer is yes, F is independent of A. ▶ 00:00 What we find for our rules of D-separation is that F is dependent on D ▶ 00:04 and A is dependent on D. ▶ 00:08 But if you don't know D, you can't govern any dependence between A and F at all. ▶ 00:11 If you do know D, then F and A become dependent. ▶ 00:16 And the reason is B and E are dependent given D, ▶ 00:20 and we can transform this back into dependence of A and F ▶ 00:25 because B and A are dependent and E and F are dependent. ▶ 00:29 There is an active path between A and F which goes across here and here ▶ 00:33 because D is known. ▶ 00:38 If we know G, the same thing is true because G gives us knowledge about D, ▶ 00:40 and D can be applied back to this path over here. ▶ 00:44 However, if you know H, that's not the case. ▶ 00:47 So H might tell us something about G, ▶ 00:49 but it doesn't tell us anything about D, ▶ 00:51 and therefore, we have no reason to close the path between A and F. ▶ 00:53 The path between A and F is still passive, even though we have knowledge of H. ▶ 00:59 ### (00:50) 14 Congratulations

[Thrun] So congratulations. You learned a lot about Bayes networks. ▶ 00:00 You learned about the graph structure of Bayes networks, ▶ 00:03 you understood how this is a compact representation, ▶ 00:06 you learned about conditional independence, ▶ 00:10 and we talked a little bit about application of Bayes network ▶ 00:12 to interesting reasoning problems. ▶ 00:15 But by all means this was a mostly theoretical unit of this class, ▶ 00:18 and in future classes we will talk more about applications. ▶ 00:23 The instrument of Bayes networks is really essential to a number of problems. ▶ 00:27 It really characterizes the sparse dependence that exists in many readable problems ▶ 00:31 like in robotics and computer vision and filtering and diagnostics and so on. ▶ 00:36 I really hope you enjoyed this class, ▶ 00:41 and I really hope you understood in depth how Bayes networks work. ▶ 00:43

## (34) Unit 4

### (04:38) 1 Probabilistic Inference

[Probabilistic Interference] ▶ 00:00 [Male] Welcome back. In the previous unit, we went over the basics ▶ 00:02 of probability theory and saw how ▶ 00:05 a Bayes network could concisely represent a joint probability distribution, ▶ 00:12 including the representation of independence between the variables. ▶ 00:17 In this unit, we will see how to do probabilistic inference. ▶ 00:24 That is, how to answer probability questions using Bayes nets. ▶ 00:31 Let's put up a simple Bayes net. ▶ 00:36 We'll use the familiar example of the earthquake ▶ 00:40 where we can have a burglary or an earthquake ▶ 00:45 setting off an alarm, and if the alarm goes off, ▶ 00:50 either John or Mary might call. ▶ 00:53 Now, what kinds of questions can we ask to do inference about? ▶ 00:58 The simplest type of question is the same question we ask ▶ 01:02 with an ordinary subroutine or function in a programming language. ▶ 01:05 Namely, given some inputs, what are the outputs? ▶ 01:08 So, in this case, we could say given the inputs of B and E, ▶ 01:12 what are the outputs, J and M? ▶ 01:18 Rather than call them input and output variables, ▶ 01:22 in probabilistic inference, we'll call them evidence and query variables. ▶ 01:26 That is, the variables that we know the values of are the evidence, ▶ 01:36 and the ones that we want to find out the values of are the query variables. ▶ 01:39 Anything that is neither evidence nor query is known as a hidden variable. ▶ 01:44 That is, we won't tell you what its value is. ▶ 01:52 We won't figure out what its value is and report it, ▶ 01:55 but we'll have to compute with it internally. ▶ 01:58 And now furthermore, in probabilistic inference, ▶ 02:01 the output is not a single number for each of the query variables, ▶ 02:05 but rather, it's a probability distribution. ▶ 02:10 So, the answer is going to be a complete, joint probability distribution ▶ 02:13 over the query variables. ▶ 02:17 We call this the posterior distribution, given the evidence, ▶ 02:19 and we can write it like this. ▶ 02:23 It's the probability distribution of one or more query variables ▶ 02:26 given the values of the evidence variables. ▶ 02:34 And there can be zero or more evidence variables, ▶ 02:39 and each of them are given an exact value. ▶ 02:42 And that's the computation we want to come up with. ▶ 02:47 There's another question we can ask. ▶ 02:53 Which is the most likely explanation? ▶ 02:56 That is, out of all the possible values for all the query variables, ▶ 02:58 which combination of values has the highest probability? ▶ 03:03 We write the formula like this, asking which Q values ▶ 03:08 are maxable given the evidence values. ▶ 03:12 Now, in an ordinary programming language, each function goes only one way. ▶ 03:16 It has input variables, does some computation, ▶ 03:22 and comes up with a result variable or result variables. ▶ 03:26 One great thing about Bayes nets is that we're not restricted ▶ 03:31 to going only in one direction. ▶ 03:34 We could go in the causal direction, giving as evidence ▶ 03:36 the route nodes of the tree and asking as query values the nodes at the bottom. ▶ 03:41 Or, we could reverse that causal flow. ▶ 03:47 For example, we could have J and M be the evidence variables ▶ 03:50 and B and E be the query variables, ▶ 03:55 or we could have any other combination. ▶ 03:58 For example, we could have M be the evidence variable ▶ 04:01 and J and B be the query variables. ▶ 04:05 Here's a question for you. ▶ 04:11 Imagine the situation where Mary has called to report that the alarm is going off, ▶ 04:13 and we want to know whether or not there has been a burglary. ▶ 04:18 For each of the nodes, click on the circle to tell us ▶ 04:22 if the node is an evidence node, a hidden node, ▶ 04:27 or a query node. ▶ 04:32 ### (00:11) 1a Answer

The answer is that Mary calling is the evidence node. ▶ 00:00 The burglary is the query node, ▶ 00:04 and all the others are hidden variables in this case. ▶ 00:07 ### (04:24) 2 Enumeration

Now we're going to talk about how to do inference on Bayes net. ▶ 00:00 We'll start with our familiar network, and we'll talk about a method ▶ 00:04 called enumeration, ▶ 00:08 which goes through all the possibilities, adds them up, ▶ 00:12 and comes up with an answer. ▶ 00:15 So, what we do is start by stating the problem. ▶ 00:17 We're going to ask the question of what is the probability ▶ 00:24 that the burglar alarm occurred given that John called and Mary called? ▶ 00:27 We'll use the definition of conditional probability to answer this. ▶ 00:34 So, this query is equal to the joint probability distribution ▶ 00:39 of all 3 variables divided by the conditionalized variables. ▶ 00:47 Now, note I'm using a notation here where instead of writing out the probability ▶ 00:55 of some variable equals true, I'm just using the notation plus ▶ 01:01 and then the variable name in lower case, ▶ 01:05 and if I wanted the negation, I would use negation sign. ▶ 01:08 Notice there's a different notation where instead of writing out ▶ 01:13 the plus and negation signs, we just use the variable name itself, P(e), ▶ 01:17 to indicate E is true. ▶ 01:22 That notation works well, but it can get confusing between ▶ 01:25 does P(e) mean E is true, or does it mean E is a variable? ▶ 01:29 And so we're going to stick to the notation where we explicitly have ▶ 01:34 the pluses and negation signs. ▶ 01:37 To do inference by enumeration, we first take a conditional probability ▶ 01:41 and rewrite it as unconditional probabilities. ▶ 01:45 Now we enumerate all the atomic probabilities and calculate the sum of products. ▶ 01:49 Let's look at just the complex term on the numerator first. ▶ 01:56 The procedure for figuring out the denominator would be similar, and we'll skip that. ▶ 02:00 So, the probability of these 3 terms together ▶ 02:05 can be determined by enumerating all possible values of the hidden variables. ▶ 02:12 In this case, there are 2, E and A, ▶ 02:17 so we'll sum over those variables for all values of E and for all values of A. ▶ 02:22 In this case, they're boolean, so there's only 2 values of each. ▶ 02:29 We ask what's the probability of this unconditional term? ▶ 02:34 And that we get by summing out over all possibilities, ▶ 02:41 E and A being true or false. ▶ 02:44 Now, to get the values of these atomic events, ▶ 02:49 we'll have to rewrite this equation in a form that corresponds ▶ 02:52 to the conditional probability tables that we have associated with the Bayes net. ▶ 02:55 So, we'll take this whole expression and rewrite it. ▶ 03:00 It's still a sum over the hidden variables E and A, ▶ 03:04 but now I'll rewrite this expression in terms of the parents ▶ 03:08 of each of the nodes in the network. ▶ 03:12 So, that gives us the product of these 5 terms, ▶ 03:15 which we then have to sum over all values of E and A. ▶ 03:21 If we call this product f(e,a), ▶ 03:24 then the whole answer is the sum of F for all values of E and A, ▶ 03:31 so as the sum of 4 terms where each of the terms is a product of 5 numbers. ▶ 03:43 Where do we get the numbers to fill in this equation? ▶ 03:51 From the conditional probability tables from our model, ▶ 03:54 so let's put the equation back up, and we'll ask you for the case ▶ 03:58 where both E and A are positive ▶ 04:03 to look up in the conditional probability tables and fill in the numbers ▶ 04:09 for each of these 5 terms, and then multiply them together and fill in the product. ▶ 04:14 ### (01:59) 2a Answer

We get the answer by reading numbers off the conditional probability tables, ▶ 00:00 so probability of B being positive is 0.001. ▶ 00:04 Of E being positive, because we're dealing with the positive case now ▶ 00:11 for the variable E, is 0.002. ▶ 00:16 The probability of A being positive, because we're dealing with that case, ▶ 00:22 given that B is positive and the case for an E is positive, ▶ 00:26 that we can read off here as 0.95. ▶ 00:30 The probability that J is positive given that A is positive is 0.9. ▶ 00:37 And finally, the probability that M is positive given that A is positive ▶ 00:44 we read off here as 0.7. ▶ 00:50 We multiple all those together, it's going to be a small number ▶ 00:54 because we've got the .001 and the .002 here. ▶ 00:57 Can't quite fit it in the box, but it works out to .000001197. ▶ 01:00 That seems like a really small number, but remember, ▶ 01:12 we have to normalize by the P(+j,+m) term, ▶ 01:14 and this is only 1 of the 4 possibilities. ▶ 01:19 We have to enumerate over all 4 possibilities for E and A, ▶ 01:22 and in the end, it works out that the probability of the burglar alarm being true ▶ 01:26 given that John and Mary calls, is 0.284. ▶ 01:32 And we get that number because intuitively, ▶ 01:38 it seems that the alarm is fairly reliable. ▶ 01:42 John and Mary calling are very reliable, ▶ 01:44 but the prior probability of burglary is low. ▶ 01:47 And those 2 terms combine together to give us the 0.284 value ▶ 01:49 when we sum up each of the 4 terms of these products. ▶ 01:54 ### (03:27) 3 Speeding up Enumeration

[Norvig] We've seen how to do enumeration to solve the inference problem ▶ 00:00 on belief networks. ▶ 00:04 For a simple network like the alarm network, that's all we need to know. ▶ 00:06 There's only 5 variables, so even if all 5 of them were hidden, ▶ 00:10 there would only be 32 rows in the table to sum up. ▶ 00:14 From a theoretical point of view, we're done. ▶ 00:20 But from a practical point of view, other networks could give us trouble. ▶ 00:22 Consider this network, which is one for determining insurance for car owners. ▶ 00:26 There are 27 different variables. ▶ 00:35 If each of the variables were boolean, that would give us over 100 million rows to sum out. ▶ 00:38 But in fact, some of the variables are non-boolean, ▶ 00:44 they have multiple values, and it turns out that representing this entire network ▶ 00:46 and doing enumeration we'd have to sum over a quadrillion rows. ▶ 00:52 That's just not practical, so we're going to have to come up with methods ▶ 00:57 that are faster than enumerating everything. ▶ 01:01 The first technique we can use to get a speed-up in doing inference on Bayes nets ▶ 01:04 is to pull out terms from the enumeration. ▶ 01:09 For example, here the probability of b is going to be the same for all values of E and a. ▶ 01:13 So we can take that term and move it out of the summation, ▶ 01:20 and now we have a little bit less work to do. ▶ 01:26 We can multiply by that term once rather than having it in each row of the table. ▶ 01:28 We can also move this term, the P of e, to the left of the summation over a, ▶ 01:33 because it doesn't depend on a. ▶ 01:40 By doing this, we're doing less work. ▶ 01:43 The inner loop of the summation now has only 3 terms rather than 5 terms. ▶ 01:45 So we've reduced the cost of doing each row of the table. ▶ 01:50 But we still have the same number of rows in the table, ▶ 01:53 so we're going to have to do better than that. ▶ 01:57 The next technique for efficient inference is to maximize independence of variables. ▶ 02:00 The structure of a Bayes net determines how efficient it is to do inference on it. ▶ 02:08 For example, a network that's a linear string of variables, ▶ 02:12 X1 through Xn, can have inference done in time proportional to the number n, ▶ 02:17 whereas a network that's a complete network ▶ 02:27 where every node points to every other node and so on could take time 2 to the n ▶ 02:31 if all n variables are boolean variables. ▶ 02:40 In the alarm network we saw previously, we took care ▶ 02:45 to make sure that we had all the independence relations represented ▶ 02:50 in the structure of the network. ▶ 02:54 But if we put the nodes together in a different order, ▶ 02:57 we would end up with a different structure. ▶ 03:00 Let's start by ordering the node John calls first ▶ 03:03 and then adding in the node Mary calls. ▶ 03:09 The question is, given just these 2 nodes and looking at the node for Mary calls, ▶ 03:13 is that node dependent or independent of the node for John calls? ▶ 03:19 ### (00:24) 3a Answer

[Norvig] The answer is that the node for Mary calls in this network ▶ 00:01 is dependent on John calls. ▶ 00:05 In the previous network, they were independent given that we knew that the alarm had occurred. ▶ 00:08 But here we don't know that the alarm had occurred, ▶ 00:13 and so the nodes are dependent ▶ 00:16 because having information about one will affect the information about the other. ▶ 00:18 ### (00:13) 3b Second Question

[Norvig] Now we'll continue and we'll add the node A for alarm to the network. ▶ 00:00 And what I want you to do is click on all the other variables ▶ 00:05 that A is dependent on in this network. ▶ 00:09 ### (00:33) 3c Second Answer

[Norvig] The answer is that alarm is dependent on both John and Mary. ▶ 00:01 And so we can draw both nodes in, both arrows in. ▶ 00:05 Intuitively that makes sense because if John calls, ▶ 00:09 then it's more likely that the alarm has occurred, ▶ 00:14 likely as if Mary calls, and if both called, it's really likely. ▶ 00:16 So you can figure out the answer by intuitive reasoning, ▶ 00:20 or you can figure it out by going to the conditional probability tables ▶ 00:23 and seeing according to the definition of conditional probability ▶ 00:27 whether the numbers work out. ▶ 00:31 ### (00:11) 3d Third Question

[Norvig] Now we'll continue and we'll add the node B for burglary ▶ 00:01 and ask again, click on all the variables that B is dependent on. ▶ 00:05 ### (00:10) 3e Third Answer

[Norvig] The answer is that B is dependent only on A. ▶ 00:00 In other words, B is independent of J and M given A. ▶ 00:04 ### (00:07) 3f Fourth Question

[Norvig] And finally, we'll add the last node, E, ▶ 00:00 and ask you to click on all the nodes that E is dependent on. ▶ 00:04 ### (00:26) 3g Fourth Answer

[Norvig] And the answer is that E is dependent on A. ▶ 00:00 That much is fairly obvious. ▶ 00:04 But it's also dependent on B. ▶ 00:06 Now, why is that? ▶ 00:08 E is dependent on A because if the earthquake did occur, ▶ 00:10 then it's more likely that the alarm would go off. ▶ 00:13 On the other hand, E is also dependent on B ▶ 00:16 because if a burglary occurred, then that would explain why the alarm is going off, ▶ 00:19 and it would mean that the earthquake is less likely. ▶ 00:23 ### (00:18) 3h Causal Direction

[Norvig] The moral is that Bayes nets tend to be the most compact ▶ 00:00 and thus the easier to do inference on when they're written in the causal direction-- ▶ 00:04 that is, when the networks flow from causes to effects. ▶ 00:12 ### (04:40) 4 Variable Elimination

Let's return to this equation, which we use to show how to do inference by enumeration. ▶ 00:00 In this equation, we join up the whole joint distribution ▶ 00:06 before we sum out over the hidden variables. ▶ 00:10 That's slow, because we end up repeating a lot of work. ▶ 00:15 Now we're going to show a new technique called variable elimination, ▶ 00:18 which in many networks operates much faster. ▶ 00:25 It's still a difficult computation, an NP-hard computation, ▶ 00:27 to do inference over Bayes nets in general. ▶ 00:30 Variable elimination works faster than inference by enumeration ▶ 00:34 in most practical cases. ▶ 00:38 It requires an algebra for manipulating factors, ▶ 00:41 which are just names for multidimensional arrays ▶ 00:45 that come out of these probabilistic terms. ▶ 00:48 We'll use another example to show how variable elimination works. ▶ 00:53 We'll start off with a network that has 3 boolean variables. ▶ 00:57 R indicates whether or not it's raining. ▶ 01:00 T indicates whether or not there's traffic, ▶ 01:04 and T is dependent on whether it's raining. ▶ 01:12 And finally, L indicates whether or not I'll be late for my next appointment, ▶ 01:15 and that depends on whether or not there's traffic. ▶ 01:19 Now we'll put up the conditional probability tables for each of these 3 variables. ▶ 01:22 And then we can use inference to figure out the answer to questions like ▶ 01:29 am I going to be late? ▶ 01:35 And we know by definition that we could do that through enumeration ▶ 01:38 by going through all the possible values for R and T ▶ 01:42 and summing up the product of these 3 nodes. ▶ 01:47 Now, in a simple network like this, straight enumeration would work fine, ▶ 01:54 but in a more complex network, what variable elimination does is give us a way ▶ 01:59 to combine together parts of the network into smaller parts ▶ 02:03 and then enumerate over those smaller parts and then continue combining. ▶ 02:09 So, we start with a big network. ▶ 02:13 We eliminate some of the variables. ▶ 02:15 We compute by marginalizing out, and then we have a smaller network to deal with, ▶ 02:17 and we'll show you how those 2 steps work. ▶ 02:24 The first operation in variable elimination is called joining factors. ▶ 02:28 A factor, again, is one of these tables. ▶ 02:35 It's a multidimensional matrix, and what we do is choose 2 of the factors, ▶ 02:39 2 or more of the factors. ▶ 02:43 In this case, we'll choose these 2, and we'll combine them together ▶ 02:45 to form a new factor which represents ▶ 02:49 the joint probability of all the variables in that factor. ▶ 02:52 In this case, R and T. ▶ 02:56 Now we'll draw out that table. ▶ 03:00 In each case, we just look up in the corresponding table, ▶ 03:03 figure out the numbers, and multiply them together. ▶ 03:06 For example, in this row we have a +r and a +t, ▶ 03:08 so the +r is 0.1, and the entry for +r and +t is 0.8, ▶ 03:13 so multiply them together and you get 0.08. ▶ 03:19 Go all the way down. For example, in the last row we have a -r and a -t. ▶ 03:22 -r is 0.9. The entry for -r and -t is also 0.9. ▶ 03:28 Multiply those together and you get 0.81. ▶ 03:34 So, what have we done? ▶ 03:40 We used the operation of joining factors on these 2 factors, ▶ 03:42 getting us a new factor which is part of the existing network. ▶ 03:45 Now we want to apply a second operation called elimination, ▶ 03:50 also called summing out or marginalization, to take this table and reduce it. ▶ 03:56 Right now, the tables we have look like this. ▶ 04:02 We could sum out or marginalize over the variable R ▶ 04:06 to give us a table that just operates on T. ▶ 04:10 So, the question is to fill in this table for P(T)-- ▶ 04:14 there will be 2 entries in this table, the +t entry, formed by summing out ▶ 04:20 all the entries here for all values of r for which t is positive, ▶ 04:23 and the -t entry, formed the same way, by looking in this table ▶ 04:28 and summing up all the rows over all values of r where t is negative. ▶ 04:32 Put your answers in these boxes. ▶ 04:37 ### (00:27) 4a Answer

The answer is that for +t we look up the 2 possible values for r, ▶ 00:00 and we get 0.08 or 0.09. ▶ 00:05 Sum those up, get 0.17, ▶ 00:09 and then we look at the 2 possible values of R for -t, ▶ 00:13 and we get 0.02 and 0.81. ▶ 00:18 Add those up, and we get 0.83. ▶ 00:22 ### (00:28) 4b More Variable Elimination

So, we took our network with RT and L. We summed out over R. ▶ 00:00 That gives us a new network with T and L ▶ 00:04 with these conditional probability tables. ▶ 00:09 And now we want to do a join over T and L ▶ 00:13 and give us a new table with the joint probability of P(T, L). ▶ 00:17 And that table is going to look like this. ▶ 00:25 ### (00:38) 4c Answer

The answer, again, for joining variables is determined by pointwise multiplication, ▶ 00:00 so we have 0.17 times 0.3 is 0.051, ▶ 00:05 +t and +l, 0.17 times 0.7 is 0.119. ▶ 00:12 Then we go to the minuses. ▶ 00:21 Minus 0.83 times 0.1 is 0.083. ▶ 00:23 And finally, 0.83 times 0.9 is 0.747. ▶ 00:31 ### (00:30) 4d Even More Variable Elimination

Now we're down to a network with a single node, T, L, ▶ 00:00 with this joint probability table, and the only operation we have left to do ▶ 00:06 is to sum out to give us a node with just L in it. ▶ 00:12 So, the question is to compute P(L) for both values of L, ▶ 00:17 +l and -l. ▶ 00:26 ### (00:20) 4e Answer

The answer is that the +l values, ▶ 00:00 0.051 plus 0.083 equals 0.134. ▶ 00:03 And the negative values, 0.119 plus 0.747 ▶ 00:11 equals 0.886. ▶ 00:15 ### ((??:??)) 4f Summary

No subtitles... ▶ 00:00 ### (00:21) 4f Summary

So, that's how variable elimination works. ▶ 00:00 It's a continued process of joining together factors ▶ 00:03 to form a larger factor and then eliminating variables by summing out. ▶ 00:06 If we make a good choice of the order in which we apply these operations, ▶ 00:11 then variable elimination can be much more efficient ▶ 00:15 than just doing the whole enumeration. ▶ 00:18 ### (02:08) 5 Approximate Inference Sampling

Now I want to talk about approximate inference ▶ 00:00 by means of sampling. ▶ 00:07 What do I mean by that? ▶ 00:12 Say we want to deal with a joint probability distribution, ▶ 00:14 say the distribution of heads and tails over these 2 coins. ▶ 00:17 We can build a table and then start counting by sampling. ▶ 00:24 Here we have our first sample. ▶ 00:30 We flip the coins and the one-cent piece came up heads, ▶ 00:32 and the five-cent piece came up tails, ▶ 00:35 so we would mark down one count. ▶ 00:39 Then we'd toss them again. ▶ 00:42 This time the five cents is heads, and the one cent is tails, ▶ 00:45 so we put down a count there, and we'd repeat that process ▶ 00:50 and keep repeating it until we got enough counts that we could estimate ▶ 01:00 the joint probability distribution by looking at the counts. ▶ 01:06 Now, if we do a small number of samples, the counts might not be very accurate. ▶ 01:11 There may be some random variation that causes them not to converge ▶ 01:15 to their true values, but as we add more counts, ▶ 01:19 the counts--as we add more samples, ▶ 01:23 the counts we get will come closer to the true distribution. ▶ 01:25 Thus, sampling has an advantage over inference in that we know a procedure ▶ 01:29 for coming up with at least an approximate value for the joint probability distribution, ▶ 01:35 as opposed to exact inference, where the computation may be very complex. ▶ 01:42 There's another advantage to sampling, which is if we don't know ▶ 01:50 what the conditional probability tables are, as we did in our other models, ▶ 01:53 if we don't know these numeric values, but we can simulate the process, ▶ 01:59 we can still proceed with sampling, whereas we couldn't with exact inference. ▶ 02:04 ### (02:15) 6 Sampling Example

Here's a new network that we'll use to investigate ▶ 00:00 how sampling can be used to do inference. ▶ 00:05 In this network, we have 4 variables. They're all boolean. ▶ 00:10 Cloudy tells us if it's cloudy or not outside, ▶ 00:14 and that can have an effect on whether the sprinklers are turned on, ▶ 00:17 and whether it's raining. ▶ 00:21 And those 2 variables in turn have an effect on whether the grass gets wet. ▶ 00:23 Now, to do inference over this network using sampling, ▶ 00:28 we start off with a variable where all the parents are defined. ▶ 00:34 In this case, there's only one such variable, Cloudy. ▶ 00:38 And it's conditional probability table tells us that the probability is 50% for Cloudy, ▶ 00:42 50% for not Cloudy, and so we sample from that. ▶ 00:48 We generate a random number, and let's say it comes up with positive for Cloudy. ▶ 00:52 Now that variable is defined, we can choose another variable. ▶ 00:59 In this case, let's choose Sprinkler, and we look at the rows in the table ▶ 01:02 for which Cloudy, the parent, is positive, and we see we should sample ▶ 01:08 with probability 10% to +s and 90% a -s. ▶ 01:13 And so let's say we do that sampling with a random number generator, ▶ 01:19 and it comes up negative for Sprinkler. ▶ 01:23 Now let's jump over here. Look at the Rain variable. ▶ 01:26 Again, the parent, Cloudy, is positive, ▶ 01:29 so we're looking at this part of the table. ▶ 01:34 We get a 0.8 probability for Rain being positive, ▶ 01:38 and a 0.2 probability for Rain being negative. ▶ 01:41 Let's say we sample that randomly, and it comes up Rain is positive. ▶ 01:44 And now we're ready to sample the final variable, ▶ 01:51 and what I want you to do is tell me which of the rows ▶ 01:54 of this table should we be considering and tell me what's more likely. ▶ 02:01 Is it more likely that we have a +w or a -w? ▶ 02:07 ### (01:05) 6a Sampling Example

The answer to the question is that we look at the parents. ▶ 00:00 We find that the Sprinkler variable is negative, ▶ 00:03 so we're looking at this part of the table. ▶ 00:06 And the Rain variable is positive, so we're looking at this part. ▶ 00:09 So, it would be these 2 rows that we would consider, ▶ 00:14 and thus, we'd find there's a 0.9 probability for w, the grass being wet, ▶ 00:18 and only 0.1 for it being negative, ▶ 00:25 so the positive is more likely. ▶ 00:28 And once we've done that, then we generated a complete sample, ▶ 00:31 and we can write down the sample here. ▶ 00:34 We had +c, -s, +r. ▶ 00:37 And assuming we got a probability of 0.9 came out in favor of the +w, ▶ 00:43 that would be the end of the sample. ▶ 00:51 Then we could throw all this information out and start over again ▶ 00:54 by having another 50/50 choice for cloudy and then working our way through the network. ▶ 00:59 ### (01:51) 6b More Sampling

Now, the probability of sampling a particular variable, ▶ 00:00 choosing a +w or a -w, depends on the values of the parents. ▶ 00:04 But those are chosen according to the conditional probability tables, ▶ 00:10 so in the limit, the count of each sampled variable ▶ 00:14 will approach the true probability. ▶ 00:18 That is, with an infinite number of samples, this procedure computes the true ▶ 00:20 joint probability distribution. ▶ 00:24 We say that the sampling method is consistent. ▶ 00:27 We can use this kind of sampling to compute the complete joint probability distribution, ▶ 00:33 or we can use it to compute a value for an individual variable. ▶ 00:38 But what if we wanted to compute a conditional probability? ▶ 00:43 Say we wanted to compute the probability of wet grass ▶ 00:47 given that it's not cloudy. ▶ 00:53 To do that, the sample that we generated here wouldn't be helpful at all ▶ 00:58 because it has to do with being cloudy, not with being not cloudy. ▶ 01:03 So, we would cross this sample off the list. ▶ 01:08 We would say that we reject the sample, and this technique is called rejection sampling. ▶ 01:11 We go through ignoring any samples that don't match ▶ 01:17 the conditional probabilities that we're interested in ▶ 01:21 and keeping samples that do, say the sample -c, +s, +r, -w. ▶ 01:24 We would just continue going through generating samples, ▶ 01:34 crossing off the ones that don't match, keeping the ones that do. ▶ 01:37 And this procedure would also be consistent. ▶ 01:41 We call this procedure rejection sampling. ▶ 01:46 ### (01:59) 7 Rejection Sampling

But there's a problem with rejection sampling. ▶ 00:00 If the evidence is unlikely, you end up rejecting a lot of the samples. ▶ 00:03 Let's go back to the alarm network where we had variables for burglary and for an alarm ▶ 00:08 and say when arrested, in computing the probability of a burglary, ▶ 00:16 given that the alarm goes off. ▶ 00:22 The problem is that burglaries are very infrequent, ▶ 00:25 so most of the samples we would get would end up being-- ▶ 00:28 we start with generating a B, and we get a -b and then a -a. ▶ 00:32 We go back and say does this match? ▶ 00:39 No, we have to reject this sample, ▶ 00:43 so we generate another sample, and we get another -b, -a. ▶ 00:45 We reject that. We get another -b, -a. ▶ 00:50 And we keep rejecting, and eventually we get a +b, ▶ 00:54 but we'd end up spending a lot of time rejecting samples. ▶ 01:00 So, we're going to introduce a new method called likelihood weighting ▶ 01:04 that generates samples so that we can keep every one. ▶ 01:13 With likelihood weighting, we fix the evidence variables. ▶ 01:17 That is, we say that A will always be positive, ▶ 01:20 and then we sample the rest of the variables, ▶ 01:25 so then we get samples that we want. ▶ 01:28 We would get a list like -b, +a, ▶ 01:31 -b, +a, ▶ 01:37 +b, +a. ▶ 01:40 We get to keep every sample, but we have a problem. ▶ 01:42 The resulting set of samples is inconsistent. ▶ 01:46 We can fix that, however, by assigning a probability ▶ 01:52 to each sample and weighing them correctly. ▶ 01:56 ### (01:55) 8 Likelihood Weighting

In likelihood weighting, we're going to be collecting samples just like before, ▶ 00:00 but we're going to add a probabilistic weight to each sample. ▶ 00:05 Now, let's say we want to compute the probability of rain ▶ 00:11 given that the sprinklers are on, and the grass is wet. ▶ 00:17 We start as before. ▶ 00:22 We make a choice for Cloudy, and let's say that, again, ▶ 00:24 we choose Cloudy being positive. ▶ 00:30 Now we want to make a choice for Sprinkler, ▶ 00:33 but we're constrained to always choose Sprinkler being positive, ▶ 00:37 so we'll make that choice. ▶ 00:41 And we know we were dealing with Cloudy being positive, ▶ 00:44 so we're in this row, and we're forced to make the choice of Sprinkler being positive, ▶ 00:50 and that has a probability of only 0.1, so we'll put that 0.1 into the weight. ▶ 00:56 Next, we'll look at the Rain variable, ▶ 01:05 and here we're not constrained in any way, so we make a choice ▶ 01:09 according to the probability tables with Cloudy being positive. ▶ 01:13 And let's say that we choose the more popular choice, and Rain gets the positive value. ▶ 01:19 Now, we look at Wet Grass. ▶ 01:27 We're constrained to choose positive, and we know that the parents ▶ 01:30 are also positive, so we're dealing with this row here. ▶ 01:35 Since it's a constrained choice, we're going to add in or multiply in an additional weight, ▶ 01:41 and I want you to tell me what that weight should be. ▶ 01:47 ### (00:37) 8a Answer

The answer is we're looking for the probability ▶ 00:00 of having a +w given a +s and a +r, ▶ 00:04 so that's in this row, so it's 0.99. ▶ 00:09 So, we take our old weight and multiply it by 0.99, ▶ 00:16 gives us a final weight of 0.099 ▶ 00:22 for a sample of +c, +s, +r and +w. ▶ 00:28 ### (00:20) 8b Likelihood Weighting is Consistent

When we include the weights, ▶ 00:00 counting this sample that was forced to have a +s and a +w ▶ 00:03 with a weight of 0.099, instead of counting it as a full one sample, ▶ 00:08 we find that likelihood weighting is also consistent. ▶ 00:14 ### (00:56) 8c Likelihood Weighting Problems

Likelihood weighting is a great technique, ▶ 00:00 but it doesn't solve all our problems. ▶ 00:03 Suppose we wanted to compute the probability of C given +s and +r. ▶ 00:05 In other words, we're constraining Sprinkler and Rain to always be positive. ▶ 00:14 Since we use the evidence when we generate a node that has that evidence as parents, ▶ 00:21 the Wet Grass node will always get good values based on that evidence. ▶ 00:27 But the Cloudy node won't, and so it will be generating values at random ▶ 00:31 without looking at these values, and most of the time, or some of the time, ▶ 00:39 it will be generating values that don't go well with the evidence. ▶ 00:44 Now, we won't have to reject them like we do in rejection sampling, ▶ 00:48 but they'll have a low probability associated with them. ▶ 00:51 ### (01:50) 9 Gibbs Sampling

A technique called Gibbs sampling, ▶ 00:00 named after the physicist Josiah Gibbs, ▶ 00:07 takes all the evidence into account and not just the upstream evidence. ▶ 00:10 It uses a method called Markov Chain Monte Carlo, or MCMC. ▶ 00:14 The idea is that we resample just one variable at a time ▶ 00:26 conditioned on all the others. ▶ 00:31 That is, we have a set of variables, ▶ 00:33 and we initialize them to random variables, keeping the evidence values fixed. ▶ 00:37 Maybe we have values like this, ▶ 00:44 and that constitutes one sample, and now, at each iteration through the loop, ▶ 00:48 we select just one non-evidence variable and resample it ▶ 00:54 based on all the other variables. ▶ 01:01 And that will give us another sample, and repeat that again. ▶ 01:04 Choose another variable. ▶ 01:11 Resample that variable and repeat. ▶ 01:15 We end up walking around in this space of assignments of variables randomly. ▶ 01:21 Now, in rejection and likelihood sampling, ▶ 01:27 each sample was independent of the other samples. ▶ 01:30 In MCMC, that's not true. ▶ 01:34 The samples are dependent on each other, and in fact, ▶ 01:37 adjacent samples are very similar. ▶ 01:40 They only vary or differ in one place. ▶ 01:42 However, the technique is still consistent. We won't show the proof for that. ▶ 01:46 ### (01:19) 10 Monty Hall Problem

Now, just one more thing. ▶ 00:00 I can't help but describe what is probably the most famous probability problem of all. ▶ 00:02 It's called the Monty Hall Problem after the game show host. ▶ 00:07 And the idea is that you're on a game show, and there's 3 doors: ▶ 00:11 door #1, door #2, and door #3. ▶ 00:15 And behind each door is a prize, and you know that one of the doors ▶ 00:20 contains an expensive sports car, which you would find desirable, ▶ 00:26 and the other 2 doors contain a goat, which you would find less desirable. ▶ 00:29 Now, say you're given a choice, and let's say you choose door #1. ▶ 00:35 But according to the conventions of the game, the host, Monty Hall, ▶ 00:42 will now open one of the doors, knowing that the door that he opens ▶ 00:47 contains a goat, and he shows you door #3. ▶ 00:52 And he now gives you the opportunity to stick with your choice ▶ 00:57 or to switch to the other door. ▶ 01:02 What I want you to tell me is, what is your probability of winning ▶ 01:05 if you stick to door #1, and what is the probability of winning ▶ 01:10 if you switched to door #2? ▶ 01:15 ### (01:45) 10a Answer

The answer is that you have a 1/3 chance of winning if you stick with door #1 ▶ 00:00 and a 2/3 chance if you switch to door #2. ▶ 00:08 How do we explain that, and why isn't it 50/50? ▶ 00:12 Well, it's true that there's 2 possibilities, ▶ 00:16 but we've learned from probability that just because there are 2 options ▶ 00:18 doesn't mean that both options are equally likely. ▶ 00:22 It's easier to explain why the first door has a 1/3 probability ▶ 00:26 because when you started, the car could be in any one of 3 places. ▶ 00:30 You chose one of them. That probability was 1/3. ▶ 00:34 And that probability hasn't been changed by the revealing of one of the other doors. ▶ 00:37 Why is door #2 two-thirds? ▶ 00:43 Well, one way to explain it is that the probability has to sum to 1, ▶ 00:45 and if 1/3 is here, the 2/3 has to be here. ▶ 00:49 But why doesn't the same argument that you use for 1 hold for 2? ▶ 00:53 Why can't we say the probability of 2 holding the car ▶ 00:58 was 1/3 before this door was revealed? ▶ 01:03 Why has that changed 2 and has not changed 1? ▶ 01:07 And the reason is because we've learned something about door #2. ▶ 01:11 We've learned that it wasn't the door that was flipped over by the host, ▶ 01:14 and so that additional information has updated the probability, ▶ 01:18 whereas we haven't learned anything additional about door #1 ▶ 01:22 because it was never an option that the host might switch door #1. ▶ 01:26 And in fact, in this case, if we reveal the door, ▶ 01:30 we find that's where the car actually is. ▶ 01:37 So you see, learning probability may end up winning you something. ▶ 01:40 ### (00:44) 10b Monty Hall Letter

Now, as a final epilogue, I have here a copy of a letter written by Monty Hall himself ▶ 00:00 in 1990 to Professor Lawrence Denenberg of Harvard ▶ 00:07 who, with Harry Lewis, wrote a statistics book ▶ 00:10 in which they used the Monty Hall Problem as an example, ▶ 00:14 and they wrote to Monty asking him for permission to use his name. ▶ 00:18 Monty kindly granted the permission, but in his letter, ▶ 00:23 he writes, "As I see it, it wouldn't make any difference after the player ▶ 00:26 has selected Door A, and having been shown Door C-- ▶ 00:31 why should he then attempt to switch to Door B? ▶ 00:34 So, we see Monty Hall himself did not understand the Monty Hall Problem. ▶ 00:38

## (12) Homework 2

### (00:16) 1 Bayes Rule

[Thrun] Given the following Bayes network with P of A equal to 0.5, ▶ 00:00 P of B given the A equals 0.2, ▶ 00:06 and P of B given not A 0.8, ▶ 00:08 calculate the following probability. ▶ 00:12 ### (00:42) 2 Simple Bayes Net

[Thrun] Consider a network of the following type: ▶ 00:00 a variable, A, that is binary connects to three variables, X1, X2, and X3, ▶ 00:03 that are also binary. ▶ 00:10 The probability of A is 0.5, and for all variable XI we have the probability of XI given A is 0.2, ▶ 00:12 and the probability of XI given not A equals 0.6. ▶ 00:24 I would like to know from you the probability of A ▶ 00:29 given that we observed X1, X2, and not X3. ▶ 00:31 Notice that these variables over here are conditionally independent given A. ▶ 00:37 ### (00:10) 3 Simple Bayes Net 2

[Thrun] Let us consider the same network again. ▶ 00:00 I would like to know the probability of X3 given that I observed X1. ▶ 00:03 ### (00:29) 4 Conditional Independence

[Thrun] In this next homework assignment I will be drawing you a Bayes network ▶ 00:00 and will ask you some conditional independence questions. ▶ 00:04 Is B conditionally independent of C? And say yes or no. ▶ 00:09 Is B conditionally independent of C given D? And say yes or no. ▶ 00:14 Is B conditionally independent of C given A? And say yes or no. ▶ 00:19 And is B conditionally independent given A and D? And say yes or no. ▶ 00:24 ### (00:28) 5 Conditional Indepedence 2

[Thrun] Consider the following network. ▶ 00:00 I would like to know whether the following statements are true or false. ▶ 00:02 C is conditionally independent of E given A. ▶ 00:08 B is conditionally independent of D given C and E. ▶ 00:12 A is conditionally independent of C given E. ▶ 00:18 And A is conditionally independent of C given B. ▶ 00:21 Please check yes or no for each of these questions. ▶ 00:25 ### (00:17) 6 Parameter Count

[Thrun] In my final question I'll look at the exact same network as before, ▶ 00:00 but I would like to know the minimum number of numerical parameters ▶ 00:04 such as the values to define probabilities and conditional probabilities ▶ 00:08 that are necessary to specify the joint distribution of all 5 variables. ▶ 00:13 ### (00:36) 1 ANSWER

[Thrun] The answer is 0.2, ▶ 00:00 and this follows directly from Bayes' rule. ▶ 00:03 In this formula, we can read off the first 2 values straight from the table over here, ▶ 00:07 and we expand the denominator by total probability. ▶ 00:11 Observing that this is exactly the same expression as up here, ▶ 00:15 we get 0.1 divided by 0.1 plus this expression over here can be copied from over here, ▶ 00:19 and P of not A is directly obtained up here. ▶ 00:27 Hence we get 0.5 over here, and as a result we get 0.2. ▶ 00:30 ### (03:16) 2 ANSWER

[Thrun] For this question we will be exploring a little trick ▶ 00:00 about non-normalized probability. ▶ 00:03 We will observe that P of A given X1, X2 and not X3, ▶ 00:05 the expression on the left can be resolved by Bayes' rule into this expression over here. ▶ 00:11 We will take X3 to the left and replace it by A, ▶ 00:16 both conditioned on the variables X1 and X2. ▶ 00:20 Then we have PA given X1, X2 divided by P not X3, X1, X2. ▶ 00:23 Next we employ 2 things. ▶ 00:29 One is the denominator does not depend on A, ▶ 00:31 so whether I put an A or not A has no bearing on any calculation here, ▶ 00:34 which means I can defer its calculation until later, and it will turn out to be important. ▶ 00:39 So I'm going to be proportional to just the stuff over here. ▶ 00:44 And second, I export my conditional independence ▶ 00:49 whereby I can omit X1 and X2 from the probability of not X3 conditioned on A. ▶ 00:52 These variables are conditionally independent. ▶ 00:58 This gives me the following recursion ▶ 01:02 where I now removed the third variable from the estimation problem ▶ 01:05 and just retained the first 2 relative to my initial expression. ▶ 01:10 If I keep expanding this, I get the following solution. ▶ 01:14 P of not X3 given A, P X2 given A, P X1 given A times P of A. ▶ 01:19 You might take a minute to just verify this, ▶ 01:27 but this is exploiting the conditional independence ▶ 01:30 very much as in the first step I showed you over here. ▶ 01:32 This step lacks the normalizer, ▶ 01:35 so let me work on the normalizer by expressing the opposite probability, ▶ 01:38 P of not A given the same events, X1, X2, and not X3, ▶ 01:44 which resolves to P of not X3 given not A, ▶ 01:50 P of X2 given not A, P of X1 given not A, ▶ 01:54 and P of not A. ▶ 02:00 I can now plug in the values from above. ▶ 02:02 So the first term gives me 0.8 times 0.2 times 0.2 times 0.5. ▶ 02:04 In the second term I get 0.4 times 0.6 times 0.6 times 0.5, ▶ 02:15 which resolves to 0.016 and 0.072. ▶ 02:24 This is clearly not a probability because we left out the normalizer. ▶ 02:31 But as we know, the normalizer does not depend on whether I put A or not A in here. ▶ 02:36 As a result, it will be the same for both of these expressions, ▶ 02:40 and I can obtain it by just adding these non-normalized probabilities ▶ 02:44 and then subsequently divide these non-normalized probabilities accordingly. ▶ 02:47 So let me just do this. ▶ 02:52 We get for the desired probability over here 0.1818 ▶ 02:55 and for the inverse probability over here 0.8182. ▶ 03:01 Our desired answer therefore is 0.1818. ▶ 03:08 This was not an easy question. ▶ 03:14 ### (01:41) 3 ANSWER

[Thrun] The answer is a little bit involved. ▶ 00:00 We use total probability to re-express this by bringing in A. ▶ 00:03 P of X3 given X1 is the sum of P of X3 given X1 and A ▶ 00:08 times P of A given X1 plus the A complement, which is X3, conditional X1 and not A ▶ 00:15 times P of not A given X1. ▶ 00:22 That is just total probability. ▶ 00:24 Next we utilized conditional independence by which we can simplify this expression ▶ 00:26 to drop X1 in the conditional variables ▶ 00:30 and we transform this expression by Bayes' rule again. ▶ 00:33 The same applies to the right side with not A replacing A. ▶ 00:36 All of those expressions over here can be found ▶ 00:41 either in the table up there or just by their complements, ▶ 00:45 with the exception of P of X1. ▶ 00:49 But P of X1 can again be just obtained by total probability, ▶ 00:52 which resolves to 0.2 times 0.5 plus 0.6 times 0.5, ▶ 00:58 which gives me 0.4. ▶ 01:11 We are now in a position to calculate the last term over here, which goes as follows. ▶ 01:13 This expression is 0.2 times 0.2 times 0.5 over 0.4 plus 0.6 times 0.6 times 0.5 over 0.4, ▶ 01:19 which gives us as a final result 0.5. ▶ 01:36 ### (00:46) 4 ANSWER

[Thrun] And the answer is as follows. ▶ 00:00 No, no, yes, and no. ▶ 00:02 B and C in the absence of any other information are dependent through A, ▶ 00:06 which is if you learn something about B, you can infer something about A, ▶ 00:11 and then we'll know more about C. ▶ 00:17 If you know D, that doesn't change a thing. ▶ 00:20 You can just take D out of the pool. ▶ 00:22 If you know A, B and C become conditionally independent. ▶ 00:24 This dependence goes away, and ignorance of D doesn't render B and C dependent. ▶ 00:29 However, if we add D back to the mix, ▶ 00:36 then knowledge of D will render B and C dependent by way of the explaining away effect. ▶ 00:39 ### (00:54) 5 ANSWER

[Thrun] So the correct answer is tricky in this case. ▶ 00:00 It is no, no, no, and yes. ▶ 00:03 The first one is straightforward. ▶ 00:07 C and E are conditionally independent based on D, ▶ 00:09 and knowledge of A doesn't change anything. ▶ 00:13 B and D are conditionally independent through A, ▶ 00:15 and knowledge of C or E doesn't change that. ▶ 00:20 A and C is interesting. ▶ 00:23 A and C is independent. But if you know D, they become dependent. ▶ 00:25 It turns out if you know E, you can know something about D, ▶ 00:29 and as a result, A and C become dependent through the explain away effect. ▶ 00:32 That doesn't apply if you know B. ▶ 00:37 Even though B tells you something about E, ▶ 00:39 it tells you nothing about D because B and D are independent. ▶ 00:42 Therefore, knowing B tells you nothing about D, ▶ 00:46 and the explain away effect does not occur between A and C. ▶ 00:49 The answer here is yes. ▶ 00:52 ### (00:37) 6 ANSWER

[Thrun] The correct answer is 16. ▶ 00:00 The probability of A and C require 1 parameter each. ▶ 00:03 The complement of not A and not C follows by 1 minus that parameter. ▶ 00:06 This guy over here requires 2 parameters. ▶ 00:12 You need to know the probability of B given A and B given not A. ▶ 00:15 The complements can be obtained easily. ▶ 00:18 The probability of D is conditioned on 2 variables which can take 4 possible values. ▶ 00:20 Hence the number is 4. ▶ 00:24 And E is conditioned on 3 variables, so it can take a total of 8 different values, ▶ 00:26 2 to the 3rd, which is 8. ▶ 00:30 If you add 8 plus 4 plus 2 plus 1 plus 1, you get 16. ▶ 00:32

## (55) Unit 5

### (01:11) 1 Introduction

Welcome to the machine learning unit. ▶ 00:00 Machine learning is a fascinating area. ▶ 00:03 The world has become immeasurably data-rich. ▶ 00:06 The world wide web has come up over the last decade. ▶ 00:09 The human genome is being sequenced. ▶ 00:12 Vast chemical databases, pharmaceutical databases, ▶ 00:15 and financial databases are now available ▶ 00:19 on a scale unthinkable even 5 years ago. ▶ 00:22 To make sense out of the data, ▶ 00:26 to extract information from the data, ▶ 00:28 machine learning is the discipline to go. ▶ 00:30 Machine learning is an important subfeed of artificial intelligence, ▶ 00:33 it's my personal favorite next to robotics ▶ 00:37 because I believe it has a huge impact on society ▶ 00:40 and is absolutely necessary as we move forward. ▶ 00:43 So in this class, I teach you some of the very basics of ▶ 00:47 machine learning, and in our next unit ▶ 00:50 Peter will tell you some more about machine learning. ▶ 00:52 We'll talk about supervised learning, which is one side of machine learning, ▶ 00:56 and Peter will tell you about unsupervised learning, ▶ 01:00 which is a different style. ▶ 01:02 Later in this class we will also encounter reinforcement learning, ▶ 01:05 which is yet another set of machine learning. ▶ 01:07 Anyhow, let's just dive in. ▶ 01:10 ### (01:53) 2 What is Machine Learning

Welcome to the first class on machine learning. ▶ 00:00 So far we talked a lot about Bayes Networks. ▶ 00:03 And the way we talked about them ▶ 00:07 is all about reasoning within Bayes Networks ▶ 00:10 that are known. ▶ 00:14 Machine learning addresses the problem ▶ 00:15 of how to find those networks ▶ 00:17 or other models ▶ 00:19 based on data. ▶ 00:20 Learning models from data ▶ 00:22 is a major, major area of artificial intelligence ▶ 00:25 and it's perhaps the one ▶ 00:29 that had the most commercial success. ▶ 00:31 In many commercial applications ▶ 00:33 the models themselves are fitted ▶ 00:37 based on data. ▶ 00:39 For example, Google ▶ 00:40 uses data to understand ▶ 00:42 how to respond to each search query. ▶ 00:44 Amazon uses data ▶ 00:46 to understand how to place products on their website. ▶ 00:49 And these machine learning techniques ▶ 00:52 are the enabling techniques that make that possible. ▶ 00:53 So this class ▶ 00:56 which is about supervised learning ▶ 00:57 will go through some very basic methods ▶ 00:59 for learning models from data ▶ 01:02 in particular, specific types of Bayes Networks. ▶ 01:04 We will complement this ▶ 01:06 with a class on unsupervised learning ▶ 01:08 that will be taught next ▶ 01:10 after this class. ▶ 01:14 Let me start off with a quiz. ▶ 01:15 The quiz is: What companies are famous ▶ 01:18 for machine learning using data? ▶ 01:20 Google for mining the web. ▶ 01:24 Netflix for mining what people ▶ 01:29 would like to rent on DVDs. ▶ 01:31 Which is DVD recommendations. ▶ 01:36 Amazon.com for product placement. ▶ 01:40 Check any or all ▶ 01:45 and if none of those apply ▶ 01:47 check down here. ▶ 01:49 ### (00:47) 3 Answer

And, not surprisingly, the answer is ▶ 00:00 all of those companies and many, many, many more ▶ 00:03 use massive machine learning for making decisions ▶ 00:06 that are really essential to the businesses. ▶ 00:09 Google mines the web and uses machine learning for translation, ▶ 00:12 as we've seen in the introductory level. Netflix has used ▶ 00:15 machine learning extensively for understanding what type of DVD to recommend to you next. ▶ 00:18 Amazon composes its entire product pages using ▶ 00:22 machine learning by understanding how customers ▶ 00:25 respond to different compositions and placements of their products, ▶ 00:28 and many, many other examples exist. ▶ 00:31 I would argue that in Silicon Valley, ▶ 00:35 at least half the companies dealing with customers and online products ▶ 00:37 do extensively use machine learning, ▶ 00:41 so it makes machine learning a really exciting discipline. ▶ 00:43 ### (01:33) 4 Stanley DARPA Grand Challenge

In my own research, I've extensively used machine learning for robotics. ▶ 00:00 What you see here is a robot my students and I built at Stanford ▶ 00:05 called Stanley, and it won the DARPA Grand Challenge. ▶ 00:08 It's a self-driving car that drives without any human assistance whatsoever, ▶ 00:12 and this vehicle extensively uses machine learning. ▶ 00:16 The robot is equipped with a laser system ▶ 00:22 I will talk more about lasers in my robotics class, ▶ 00:25 but here you can see how the robot is able to build ▶ 00:28 3-D models of the terrain ahead. ▶ 00:31 These are almost like video game models that allow it to make ▶ 00:34 assessments where to drive and where not to drive. ▶ 00:37 Essentially, it's trying to drive on flat ground. ▶ 00:39 The problem with these lasers is that they don't see very far. ▶ 00:43 They see about 25 meters out, so to drive really fast ▶ 00:46 the robot has to see further. ▶ 00:50 This is where machine learning comes into play. ▶ 00:53 What you see here is camera images delivered by the robot ▶ 00:56 superimposed with laser data that doesn't see very far, ▶ 00:58 but the laser is good enough to extract samples ▶ 01:01 of driveable road surface that can then be machine learned ▶ 01:04 and extrapolated into the entire camera image. ▶ 01:08 That enables the robot to use the camera ▶ 01:10 to see driveable terrain all the way to the horizon ▶ 01:13 up to like 200 meters out, enough to drive really, really fast. ▶ 01:16 This ability to adapt its vision by driving its own training examples using lasers ▶ 01:22 but seeing out 200 meters or more ▶ 01:27 was a key factor in winning the race. ▶ 01:30 ### (03:46) 5 Taxonomy

Machine learning is a very large field ▶ 00:00 with many different methods ▶ 00:03 and many different applications. ▶ 00:04 I will now define some of the very basic terminology ▶ 00:06 that is being used to distinguish ▶ 00:10 different machine learning methods. ▶ 00:12 Let's start with the what. ▶ 00:13 What is being learned? ▶ 00:17 You can learn parameters ▶ 00:19 like the probabilities of a Bayes Network. ▶ 00:23 You can learn structure ▶ 00:26 like the arc structure of a Bayes Network. ▶ 00:27 And you might even discover hidden concepts. ▶ 00:31 For example ▶ 00:34 you might find that certain training example ▶ 00:35 form a hidden group. ▶ 00:37 For example Netflix ▶ 00:39 you might find that there's different types of customers ▶ 00:41 some that care about classic movies ▶ 00:43 some of them care about modern movies ▶ 00:45 and those might form hidden concepts ▶ 00:47 whose discovery can really help you ▶ 00:49 make better sense of the data. ▶ 00:51 Next is what from? ▶ 00:53 Every machine learning method ▶ 00:57 is driven by some sort of target information ▶ 01:00 that you care about. ▶ 01:02 In supervised learning ▶ 01:03 which is the subject of today's class ▶ 01:06 we're given specific target labels ▶ 01:08 and I give you examples just in a second. ▶ 01:10 We also talk about unsupervised learning ▶ 01:13 where target labels are missing ▶ 01:15 and we use replacement principles ▶ 01:19 to find, for example ▶ 01:21 hidden concepts. ▶ 01:22 Later there will be a class in reinforcement learning ▶ 01:24 when an agent learns from feedback with the physical environment ▶ 01:27 by interacting and trying actions ▶ 01:32 and receiving some sort of evaluation ▶ 01:34 from the environment ▶ 01:37 like "Well done" or "That works." ▶ 01:37 Again, we will talk about those in detail later. ▶ 01:41 There's different things you could try to do ▶ 01:43 with machine learning technique. ▶ 01:46 You might care about prediction. ▶ 01:48 For example you might want to care about what's going to happen with the future ▶ 01:49 in the stockmarket for example. ▶ 01:53 You might care to diagnose something ▶ 01:55 which is you get data and you wish to explain it ▶ 01:57 and you use machine learning for that. ▶ 01:59 Sometimes your objective is to summarize something. ▶ 02:01 For example if you read a long article ▶ 02:04 your machine learning method might aim to ▶ 02:07 produce a short article that summarizes the long article. ▶ 02:09 And there's many, many, many more different things. ▶ 02:12 You can talk about the how to learn. ▶ 02:14 We use the word passive ▶ 02:16 if your learning agent is just an observer ▶ 02:19 and has no impact on the data itself. ▶ 02:23 Otherwise, you call it active. ▶ 02:24 Sometimes learning occurs online ▶ 02:26 which means while the data is being generated ▶ 02:30 and some of it is offline ▶ 02:32 which means learning occurs ▶ 02:35 after the data has been generated. ▶ 02:37 There's different types of outputs ▶ 02:39 of a machine learning algorithm. ▶ 02:42 Today we'll talk about classification ▶ 02:44 versus regression. ▶ 02:47 In classification the output is binary ▶ 02:50 or a fixed number of classes ▶ 02:53 for example something is either a chair or not. ▶ 02:55 Regression is continuous. ▶ 02:57 The temperature might be 66.5 degrees ▶ 02:59 in our prediction. ▶ 03:01 And there's tons of internal details ▶ 03:03 we will talk about. ▶ 03:05 Just to name one. ▶ 03:07 We will distinguish generative ▶ 03:09 from discriminative. ▶ 03:12 Generative seeks to model the data ▶ 03:14 as generally as possible ▶ 03:16 versus discriminative methods ▶ 03:18 seek to distinguish data ▶ 03:20 and this might sound like a superficial distinction ▶ 03:21 but it has enormous ramification ▶ 03:24 on the learning algorithm. ▶ 03:26 Now to tell you the truth ▶ 03:27 it took me many years ▶ 03:29 to fully learn all these words here ▶ 03:30 and I don't expect you to pick them all up ▶ 03:33 in one class ▶ 03:36 but you should as well know that they exist. ▶ 03:37 And as they come up ▶ 03:39 I'll emphasize them ▶ 03:41 so you can resort any learning method ▶ 03:42 I tell you back into the specific taxonomy over here. ▶ 03:44 ### (03:12) 6 Supervised Learning

The vast amount of work in the field ▶ 00:00 falls into the area of supervised learning. ▶ 00:02 In supervised learning ▶ 00:06 you're given for each training example ▶ 00:08 a feature vector ▶ 00:10 and a target label named Y. ▶ 00:13 For example, for a credit rating agency ▶ 00:16 X1, X2, X3 might be a feature ▶ 00:20 such as is the person employed? ▶ 00:23 What is the salary of the person? ▶ 00:25 Has the person previously defaulted on a credit card? ▶ 00:27 And so on. ▶ 00:30 And Y is a predictor ▶ 00:32 whether the person is to default ▶ 00:34 on the credit or not. ▶ 00:36 Now machine learning ▶ 00:38 is to be carried out on past data ▶ 00:40 where the credit rating agency ▶ 00:42 might have collected features just like these ▶ 00:44 and actual occurances of default or not. ▶ 00:46 What it wishes to produce ▶ 00:49 is a function that allows us ▶ 00:51 to predict future customers. ▶ 00:53 So the new person comes in ▶ 00:55 with a different feature vector. ▶ 00:56 Can we predict as good as possible ▶ 00:58 the functional relationship ▶ 01:00 between these features X1 to Xn all the way to Y? ▶ 01:02 You can apply the exact same example ▶ 01:05 in image recognition ▶ 01:08 where X might be pixels of images ▶ 01:09 or it might be features of things found in images ▶ 01:11 and Y might be a label that says ▶ 01:14 whether a certain object is contained ▶ 01:16 in an image or not. ▶ 01:17 Now in supervised learning ▶ 01:19 you're given many such examples. ▶ 01:20 X21 to X2n ▶ 01:25 leads to Y2 ▶ 01:28 all way the index m. ▶ 01:32 This is called your data. ▶ 01:35 If we call each input vector Xm ▶ 01:38 and we wish to find out the function ▶ 01:43 given any Xm or any future vector X ▶ 01:44 produces as close as possible ▶ 01:50 my target signal Y. ▶ 01:53 Now this isn't always possible ▶ 01:55 and sometimes it's acceptable ▶ 01:57 in fact preferable ▶ 01:59 to tolerate a certain amount of error ▶ 02:00 in your training data. ▶ 02:03 But the subject of machine learning ▶ 02:05 is to identify this function over here. ▶ 02:07 And once you identify it ▶ 02:10 you can use it for future Xs ▶ 02:11 that weren't part of the training set ▶ 02:13 to produce a prediction ▶ 02:16 that hopefully is really, really good. ▶ 02:19 So let me ask you a question. ▶ 02:21 And this is a question ▶ 02:24 for which I haven't given you the answer ▶ 02:27 but I'd like to appeal to your intuition. ▶ 02:28 Here's one data set ▶ 02:31 where the X is one dimensionally plotted horizontally ▶ 02:34 and the Y is vertically ▶ 02:37 and suppose there looks like this. ▶ 02:39 Suppose my machine learning algorithm ▶ 02:44 gives me 2 hypotheses. ▶ 02:45 One is this function over here ▶ 02:47 which is a linear function ▶ 02:51 and one is this function over here. ▶ 02:52 I'd like to know which of the functions ▶ 02:53 you find preferable ▶ 02:57 as an explanation for the data. ▶ 02:59 Is it function A? ▶ 03:01 Or function B? ▶ 03:02 Check here for A ▶ 03:06 here for B ▶ 03:08 and here for neither. ▶ 03:09 ### (02:43) 7 Occam's Razor

And I hope you guessed function A. ▶ 00:00 Even though both perfectly describe the data ▶ 00:04 B is much more complex than A. ▶ 00:08 In fact, outside the data ▶ 00:10 B seems to go to a minus infinity much faster ▶ 00:12 than these data points ▶ 00:16 and to plus infinity much faster ▶ 00:17 with these data points over here. ▶ 00:19 And in between ▶ 00:21 we have wide oscillations ▶ 00:22 that don't correspond to any data. ▶ 00:23 So I would argue ▶ 00:25 A is preferable. ▶ 00:27 The reason why I asked this question ▶ 00:31 is because of something called Occam's Razor. ▶ 00:32 Occam can be spelled in many different ways. ▶ 00:35 And what Occam says is that ▶ 00:38 everything else being equal ▶ 00:41 chose the less complex hypothesis. ▶ 00:43 Now in practice ▶ 00:46 there's actually a trade-off ▶ 00:48 between a really good data fit ▶ 00:50 and low complexity. ▶ 00:53 Let me illustrate this to you ▶ 00:55 by a hypothetical example. ▶ 00:58 Consider the following graph ▶ 00:59 where the horizontal axis graphs ▶ 01:02 complexity of the solution. ▶ 01:04 For example, if you use polynomials ▶ 01:07 this might be a high-degree polynomial over here ▶ 01:10 and maybe a linear function over here ▶ 01:12 which is a low-degree polynomial ▶ 01:14 your training data error ▶ 01:16 tends to go like this. ▶ 01:19 The more complex the hypothesis you allow ▶ 01:22 the more you can just fit your data. ▶ 01:25 However, in reality ▶ 01:29 your generalization error on unknown data ▶ 01:31 tends to go like this. ▶ 01:33 It is the sum of the training data error ▶ 01:37 and another function ▶ 01:40 which is called the overfitting error. ▶ 01:42 Not surprisingly ▶ 01:46 the best complexity is obtained ▶ 01:47 where the generalization error is minimum. ▶ 01:49 There are methods ▶ 01:52 to calculate the overfitting error. ▶ 01:53 They go into a statistical field ▶ 01:55 under the name Bayes variance methods. ▶ 01:57 However, in practice ▶ 02:01 you're often just given the training data error. ▶ 02:02 You find if you don't find the model ▶ 02:04 that minimizes the training data error ▶ 02:08 but instead pushes back the complexity ▶ 02:11 your algorithm tends to perform better ▶ 02:14 and that is something we will study a little bit ▶ 02:17 in this class. ▶ 02:20 However, this slide is really important ▶ 02:22 for anybody doing machine learning in practice. ▶ 02:26 If you deal with data ▶ 02:29 and you have ways to fit your data ▶ 02:31 be aware that overfitting ▶ 02:33 is a major source of poor performance ▶ 02:36 of a machine learning algorithm. ▶ 02:39 And I give you examples in just one second. ▶ 02:41 ### (04:15) 8 SPAM Detection

So a really important example ▶ 00:00 of machine learning is SPAM detection. ▶ 00:02 We all get way too much email ▶ 00:04 and a good number of those are SPAM. ▶ 00:06 Here are 3 examples of email. ▶ 00:08 Dear Sir: First I must solicit your confidence ▶ 00:12 in this transaction, this is by virtue of its nature ▶ 00:14 being utterly confidential and top secret... ▶ 00:16 This is likely SPAM. ▶ 00:19 Here's another one. ▶ 00:22 In upper caps. ▶ 00:23 99 MILLION EMAIL ADDRESSES FOR ONLY $99 ▶ 00:25 This is very likely SPAM. ▶ 00:28 And here's another one. ▶ 00:31 Oh, I know it's blatantly OT ▶ 00:33 but I'm beginning to go insane. ▶ 00:35 Had an old Dell Dimension XPS sitting in the corner ▶ 00:37 and decided to put it to use. ▶ 00:40 And so on and so on. ▶ 00:41 Now this is likely not SPAM. ▶ 00:42 How can a computer program ▶ 00:45 distinguish between SPAM and not SPAM? ▶ 00:47 Let's use this as an example ▶ 00:49 to talk about machine learning for discrimination ▶ 00:51 using Bayes Networks. ▶ 00:55 In SPAM detection ▶ 00:59 we get an email ▶ 01:01 and we wish to categorize it ▶ 01:03 either as SPAM ▶ 01:05 in which case we don't even show as to the where ▶ 01:07 or what we call HAM ▶ 01:10 which is the technical word for ▶ 01:12 an email worth passing on to the person being emailed. ▶ 01:15 So the function over here ▶ 01:19 is the function we're trying to learn. ▶ 01:21 Most SPAM filters use human input. ▶ 01:23 When you go through email ▶ 01:26 you have a button called IS SPAM ▶ 01:28 which allows you as a user to flag SPAM ▶ 01:32 and occasionally you will say an email is SPAM. ▶ 01:34 If you look at this ▶ 01:37 you have a typical supervised machine learning situation ▶ 01:40 where the input is an email ▶ 01:43 and the output is whether you flag it as SPAM ▶ 01:45 or if we don't flag it ▶ 01:47 we just think it's HAM. ▶ 01:49 Now to make this amenable to ▶ 01:52 a machine learning algorithm ▶ 01:54 we have to talk about how to represent emails. ▶ 01:55 They're all using different words and different characters ▶ 01:57 and they might have different graphics included. ▶ 02:00 Let's pick a representation that's easy to process. ▶ 02:02 And this representation is often called ▶ 02:06 Bag of Words. ▶ 02:09 Bag of Words is a representation ▶ 02:10 of a document ▶ 02:14 that just counts the frequency ▶ 02:15 of words. ▶ 02:17 If an email were to say Hello ▶ 02:18 I will say Hello. ▶ 02:22 The Bag of Words representation ▶ 02:24 is the following. ▶ 02:26 2-1-1-1 ▶ 02:27 for the dictionary ▶ 02:31 that contains the 4 words ▶ 02:33 Hello I will say. ▶ 02:36 Now look at the subtlety here. ▶ 02:38 Rather than representing each individual word ▶ 02:41 we have a count of each word ▶ 02:43 and the count is oblivious ▶ 02:46 to the order in which the words were stated. ▶ 02:49 A Bag of Words representation ▶ 02:52 relative to a fixed dictionary ▶ 02:55 represents the counts of each word ▶ 02:57 relative to the words in the dictionary. ▶ 03:01 If you were to use a different dictionary ▶ 03:03 like hello and good-bye ▶ 03:06 our counts would be ▶ 03:08 2 and 0. ▶ 03:10 However, in most cases ▶ 03:13 you make sure that all the words found ▶ 03:14 in messages ▶ 03:17 are actually included in the dictionary. ▶ 03:18 So the dictionary might be very, very large. ▶ 03:19 Let me make up an unofficial example ▶ 03:22 of a few SPAM and a few HAM messages. ▶ 03:25 Offer is secret. ▶ 03:30 Click secret link. ▶ 03:32 Secret sports link. ▶ 03:35 Obviously those are contrived ▶ 03:37 and I tried to retain the recovery ▶ 03:40 to a small number of words ▶ 03:42 to make this example workable. ▶ 03:44 In practice we need thousands ▶ 03:46 of such messages ▶ 03:47 to get good information. ▶ 03:48 Play sports today. ▶ 03:50 Went play sports. ▶ 03:52 Secret sports event. ▶ 03:54 Sport is today. ▶ 03:56 Sport costs money. ▶ 03:59 My first quiz is ▶ 04:02 What is the size of the vocabulary ▶ 04:06 that contains all words in these messages? ▶ 04:08 Please enter the value in this box over here. ▶ 04:12 ### (00:28) 9 Answer

Well let's count. ▶ 00:00 Offer is secret click. ▶ 00:02 Secret occurs over here already ▶ 00:08 so we don't have to count it twice. ▶ 00:10 Link, sports, play, today, went, event ▶ 00:12 costs money. ▶ 00:18 So the answer is ▶ 00:20 12. ▶ 00:22 There's 12 different words ▶ 00:24 contained in these 8 messages. ▶ 00:26 ### (00:16) 10 Question

[Narrator] Another quiz. ▶ 00:00 What is the probability that a random message ▶ 00:03 that arrives to fall into the spam bucket? ▶ 00:06 Assuming that those messages ▶ 00:09 are all drawn at random. ▶ 00:11 [writing on page] ▶ 00:13 ### (00:16) 11 Answer

[Narrator] And the answer is: ▶ 00:00 there's 8 different messages ▶ 00:02 of which 3 are spam. ▶ 00:04 So the maximum likelihood estimate ▶ 00:06 is 3/8. ▶ 00:09 [writing on paper] ▶ 00:11 ### (04:31) 12 Maximum Likelihood

So, let's look at this a little bit more formally and talk about maximum likelihood. ▶ 00:00 Obviously, we're observing 8 messages: spam, spam, spam, and 5 times ham. ▶ 00:03 And what we care about is what's our prior probability of spam ▶ 00:12 that maximizes the likelihood of this data? ▶ 00:17 So, let's assume we're going to assign a value of pi to this, ▶ 00:20 and we wish to find the pi that maximizes the likelihood of this data over here, ▶ 00:24 assuming that each email is drawn independently ▶ 00:29 according to an identical distribution. ▶ 00:33 The probability of the p(yi) data item is then pi if yi = spam, ▶ 00:37 and 1 - pi if yi = ham. ▶ 00:48 If we rewrite the data as 1, 1, 1, 0, 0, 0, 0, 0, ▶ 00:53 we can write p(yi) as follows: pi to the yi times (1 - pi) to the 1 - yi. ▶ 00:59 It's not that easy to see that this is equivalent, ▶ 01:13 but say yi = 1. ▶ 01:16 Then this term will fall out. ▶ 01:19 It's not proficient by 1 because the exponent is zero, and we get pi as over here. ▶ 01:22 If yi = 0, then this term falls out, and this one here becomes 1 - pi as over here. ▶ 01:28 Now assuming independence, we get for the entire data set ▶ 01:36 that the joint probability of all data items is the product ▶ 01:44 of the individual data items over here, ▶ 01:49 which can now be written as follows: ▶ 01:52 pi to the count of instances where yi = 1 times ▶ 01:56 1 - pi to the count of the instances where yi = 0. ▶ 02:03 And we know in our example, this count over here is 3, ▶ 02:09 and this count over here is 5, so we get pi to the 3rd times 1 - pi to the 5th. ▶ 02:13 We now wish to find the pi that maximizes this expression over here. ▶ 02:22 We can also maximize the logarithm of this expression, ▶ 02:28 which is 3 times log pi + 5 times log (1 - pi) ▶ 02:33 Optimizing the log is the same as optimizing p because the log is monotonic to p. ▶ 02:42 The maximum of this function is attained with a derivative of 0, ▶ 02:50 so let's compute with a derivative and set it to 0. ▶ 02:54 This is the derivative, 3 over pi - 5 over 1 - pi. ▶ 03:00 We now bring this expression to the right side, ▶ 03:05 multiply the denominators up, and sort all the expressions containing pi to the left, ▶ 03:09 which gives us pi = 3/8, exactly the number we were at before. ▶ 03:18 We just derived mathematically that the data likelihood maximizing number ▶ 03:26 for the probability is indeed the empirical count, ▶ 03:33 which means when we looked at this quiz before ▶ 03:37 and we said a maximum likelihood for the prior probability of spam is 3/8, ▶ 03:41 by simply counting 3 over 8 emails were spam, ▶ 03:49 we actually followed proper mathematical principles ▶ 03:54 to do maximum likelihood estimation. ▶ 03:57 Now, you might not fully have gotten the derivation of this, ▶ 03:59 and I recommend you to watch it again, but it's not that important ▶ 04:03 for the progress in this class. ▶ 04:07 So, here's another quiz. ▶ 04:09 I'd like the maximum likelihood, or ML solutions, ▶ 04:11 for the following probabilities. ▶ 04:17 The probability that the word "secret" comes up, ▶ 04:19 assuming that we already know a message is spam, ▶ 04:21 and the probability that the same word "secret" comes up ▶ 04:25 if we happen to know the message is not spam, it's ham. ▶ 04:28 ### (00:25) 13 Answer

And just as before ▶ 00:00 we count the word secret ▶ 00:02 in SPAM and in HAM ▶ 00:04 as I've underlined here. ▶ 00:06 Three out of 9 words in SPAM ▶ 00:07 are the word secret ▶ 00:11 so we have a third over here ▶ 00:12 or 0.333 ▶ 00:14 and only 1 out of all the 15 words in HAM ▶ 00:18 are secret ▶ 00:21 so you get a fifteenth ▶ 00:22 or 0.0667. ▶ 00:23 ### (01:19) 14 Relationship to Bayes Networks

By now, you might have recognized what we're really building up is a Bayes network ▶ 00:00 where the parameters of the Bayes networks are estimated using supervised learning ▶ 00:06 by a maximum likelihood estimator based on training data. ▶ 00:10 The Bayes network has at its root an unobservable variable called spam, ▶ 00:15 which is binary, and it has as many children as there are words in a message, ▶ 00:20 where each word has an identical conditional distribution ▶ 00:28 of the word occurrence given the class spam or not spam. ▶ 00:33 If you write on our dictionary over here, ▶ 00:39 you might remember the dictionary had 12 different words, ▶ 00:42 so here is 5 of the 12, offer, is, secret, click and sports. ▶ 00:48 Then for the spam class, we found the probability of secret given spam is 1/3, ▶ 00:52 and we also found that the probability of secret given ham is 1/15, ▶ 00:59 so here's a quiz. ▶ 01:05 Assuming a vocabulary size of 12, or put differently, ▶ 01:07 the dictionary has 12 words, how many parameters ▶ 01:12 do we need to specify this Bayes network? ▶ 01:16 ### (00:29) 15 Answer

And the correct answer is 23. ▶ 00:00 We need 1 parameter for the prior p (spam), ▶ 00:03 and then we have 2 dictionary distributions of any word, ▶ 00:07 i given spam, and the same for ham. ▶ 00:12 Now, there's 12 words in a dictionary, ▶ 00:16 but this distribution only needs 11 parameters, ▶ 00:18 so 12 can be figured out because they have to add up to 1. ▶ 00:20 And the same is true over here, so if you add all these together, ▶ 00:24 we get 23. ▶ 00:27 ### (00:26) 16 Question

So, here's a quiz. ▶ 00:00 Let's assume we fit all the 23 parameters of the Bayes network ▶ 00:02 as explained using maximum likelihood. ▶ 00:06 Let's now do classification and see what class and message it ends up with. ▶ 00:09 Let me start with a very simple message, and it contains a single word ▶ 00:14 just to make it a little bit simpler. ▶ 00:18 What's the probability that we classify this one word message as spam? ▶ 00:21 ### (01:02) 17 Answer

And the answer is 0.1667 or 3/18. ▶ 00:00 How do I get there? Well, let's apply Bayes rule. ▶ 00:07 This form is easily transformed into this expression over here, ▶ 00:13 the probability of the message given spam times the prior probability of spam ▶ 00:19 over the normalizer over here. ▶ 00:25 Now, we know that the word "sports" occurs 1 in our 9 words of spam, ▶ 00:29 and our prior probability for spam is 3/8, ▶ 00:34 which gives us this expression over here. ▶ 00:38 We now have to add the same probabilities for the class ham. ▶ 00:40 "Sports" occurs 5 times out of 15 in the ham class, ▶ 00:45 and the prior probability for ham is 5/8, ▶ 00:51 which gives us 3/72 divided by 18/72, which is 3/18 or 1/6. ▶ 00:55 ### (00:21) 18 Question

This gets to a more complicated quiz. ▶ 00:00 Say the message now contains 3 words. ▶ 00:03 "Secret is secret," not a particularly meaningful email, ▶ 00:06 but the frequent occurrence of "secret" seems to suggest it might be spam. ▶ 00:10 What's the probability you're going to judge this to be spam? ▶ 00:16 ### (01:03) 19 Answer

And the answer is surprisingly high. It's 25/26, or 0.9615. ▶ 00:00 To see if we apply Bayes rule, which multiples the prior for spam-ness ▶ 00:10 with the conditional probability of each word given spam. ▶ 00:16 "Secret" carries 1/3, "is" 1/9, and "secret" 1/3 again. ▶ 00:19 We normalize this by the same expression plus the probability for ▶ 00:26 the non-spam case. ▶ 00:32 5/8 is a prior. ▶ 00:36 "Secret" is 1/15. ▶ 00:38 "Is" is 1/15, ▶ 00:42 and "secret" again. ▶ 00:45 This resolves to 1/216 over this expression plus 1/5400, ▶ 00:48 and when you work it all out is 25/26. ▶ 00:57 ### (00:21) 20 Question

The final quiz, let's assume our message is "Today is secret." ▶ 00:00 And again, it might look like spam because the word "secret" occurs. ▶ 00:08 I'd like you to compute for me the probability of spam given this message. ▶ 00:12 ### (03:19) 21 Answer and Laplace Smoothing

And surprisingly, the probability for this message to be spam is 0. ▶ 00:00 It's not 0.001. It's flat 0. ▶ 00:07 In other words, it's impossible, according to our model, ▶ 00:11 that this text could be a spam message. ▶ 00:14 Why is this? ▶ 00:17 When we apply the same rule as before, we get the prior for spam which is 3/8. ▶ 00:19 And we multiple the conditional for each word into this. ▶ 00:24 For "secret," we know it to be 1/3. ▶ 00:28 For "is," to be 1/9, but for today, it's 0. ▶ 00:31 It's 0 because the maximum of the estimate for the probability of "today" in spam is 0. ▶ 00:39 "Today" just never occurred in a spam message so far. ▶ 00:45 Now, this 0 is troublesome because as we compute the outcome-- ▶ 00:49 and I'm plugging in all the numbers as before-- ▶ 00:55 none of the words matter anymore, just the 0 matters. ▶ 01:00 So, we get 0 over something which is plain 0. ▶ 01:03 Are we overfitting? You bet. ▶ 01:10 We are clearly overfitting. ▶ 01:13 It can't be that a single word determines the entire outcome of our analysis. ▶ 01:15 The reason is that our model, to assign a probability of 0 for the word "today" ▶ 01:21 to be in the class of spam is just too aggressive. ▶ 01:26 Let's change this. ▶ 01:29 One technique to deal with the overfitting problem is called Laplace smoothing. ▶ 01:34 In maximum likelihood estimation, we assign towards our probability ▶ 01:39 the quotient of the count of this specific event over all events in our data set. ▶ 01:45 For example, for the prior probability, we found that 3/8 messages are spam. ▶ 01:51 Therefore, our maximum likelihood estimate ▶ 01:57 for the prior probability of spam was 3/8. ▶ 02:00 In Laplace Smoothing, we use a different estimate. ▶ 02:05 We add the value k to the count ▶ 02:10 and normalize as if we added k to every single class ▶ 02:15 that we've tried to estimate something over. ▶ 02:20 This is equivalent to assuming we have a couple of fake training examples ▶ 02:23 where we add k to each observation count. ▶ 02:28 Now, if k equals 0, we get our maximum likelihood estimator. ▶ 02:32 But if k is larger than 0 and n is finite, we get different answers. ▶ 02:36 Let's say k equals 1, ▶ 02:41 and let's assume we get one message, ▶ 02:47 and that message was spam, so we're going to write it one message, one spam. ▶ 02:51 What is p (spam) for the Laplace smoothing of k + 1? ▶ 02:56 Let's do the same with 10 messages, and we get 6 spam. ▶ 03:03 And 100 messages, of which 60 are spam. ▶ 03:09 Please enter your numbers into the boxes over here. ▶ 03:16 ### (01:14) 22 Answer

The answer here is 2/3 or 0.667 and is computed as follows. ▶ 00:00 We have 1 message with 1 as spam, but we're going to add k =1. ▶ 00:10 We're going to add k = 2 over here because there's 2 different classes. ▶ 00:16 K = 1 times 2 = 2, which gives us 2/3. ▶ 00:22 The answer over here is 7/12. ▶ 00:28 Again, we have 6/10 but we add 2 down here and 1 over here, so you get 7/12. ▶ 00:32 And correspondingly, we get 61/102 is 60 + 1 over 100 +2. ▶ 00:41 If we look at the numbers over here, we get 0.5833 ▶ 00:49 and 0.5986. ▶ 00:56 Interestingly, the maximum likelihood on the last 2 cases over here ▶ 00:59 will give us .6, but we only get a value that's closer to .5, ▶ 01:03 which is the effect of our smoothing prior for the Laplacian smoothing. ▶ 01:09 ### (00:25) 23 Question

Let's use the Laplacian smoother with K=1 ▶ 00:00 to calculate the few interesting probabilities-- ▶ 00:05 P of SPAM, P of HAM, ▶ 00:09 and then the probability of the words "today", ▶ 00:12 given that it's in the SPAM class or the HAM class. ▶ 00:15 And you might assume that our recovery size ▶ 00:19 is about 12 different words here. ▶ 00:22 ### (01:17) 24 Answer

This one is easy to calculate for SPAM and HAM. ▶ 00:00 For SPAM, it's 2/5, ▶ 00:03 and the reason is, we had previously ▶ 00:05 3 out of 8 messages assigned to SPAM. ▶ 00:08 But thanks to the Laplacian smoother, we add 1 over here. ▶ 00:12 And there are 2 classes, so we add 2 times 1 over here, ▶ 00:15 which gives us 4/10, which is 2/5. ▶ 00:19 Similarly to get 3/5 over here. ▶ 00:22 Now the tricky part comes up over here. ▶ 00:26 Before, we had 0 occurances of the word "today" in the SPAM class, ▶ 00:29 and we had 9 data points. ▶ 00:33 But now we are going to add 1 for Laplacian smoother, ▶ 00:35 and down here, we are going to add 12. ▶ 00:38 And the reason that we add 12 is because ▶ 00:40 there's 12 different words in our dictionary ▶ 00:42 Hence, for each word in the dictonary, we are going to add 1. ▶ 00:44 So we have a total of 12, which gives us the 12 over here. ▶ 00:47 That makes 1/21. ▶ 00:50 In the HAM class, we had 2 occurrences ▶ 00:53 of the word "today"--over here and over here. ▶ 00:56 We add 1, normalize by 15, ▶ 00:59 plus 12 for the dictionary size, ▶ 01:04 which is 3/27 or 1/9. ▶ 01:07 This was not an easy question. ▶ 01:14 ### (00:21) 25 Question

We come now to the final quiz here, ▶ 00:00 which is--I would like to compute the probability ▶ 00:03 that the message "today is secret" ▶ 00:05 falls into the SPAM box with ▶ 00:08 Laplacian smoother using K=1. ▶ 00:10 Please just enter your number over here. ▶ 00:13 This is a non-trivia question. ▶ 00:16 It might take you a while to calculate this. ▶ 00:18 ### (00:58) 26 Answer

In the approximate probabilities--0.4858. ▶ 00:00 How did we get this? ▶ 00:06 Well, the prior probability for SPAM ▶ 00:08 under the Laplacian smoothing is 2/5. ▶ 00:12 "Today" doesn't occur, but we have already calculated this to be 1/21. ▶ 00:15 "Is" occurs once, so we get 2 over here over 21. ▶ 00:22 "Secret" occurs 3 times, so we get a 4 over here over 21, ▶ 00:26 and we normalize this by the same expression over here. ▶ 00:32 Plus the prior for HAM, which is 3/5, ▶ 00:37 we have 2 occurrences of "today", plus 1, equals 3/27. ▶ 00:42 "Is" occurs once--2/27. ▶ 00:47 And "secret" occurs once--again 2/27. ▶ 00:50 When you work this all out, you get this number over here. ▶ 00:54 ### (01:47) 27 Summary Naive Bayes

So we learned quite a bit. ▶ 00:00 We learned about Naive Bayes ▶ 00:02 as our first supervised learning methods. ▶ 00:04 The setup was that we had ▶ 00:06 features of documents or trading examples and labels. ▶ 00:08 In this case, SPAM or not SPAM. ▶ 00:14 And from those pieces, ▶ 00:17 we made a generative model for the SPAM class ▶ 00:19 and the non-SPAM class ▶ 00:23 that described the condition of probability ▶ 00:25 of each individual feature. ▶ 00:28 We then used first maximum likelihood ▶ 00:30 and then a Laplacian smoother ▶ 00:33 to fit those primers over here. ▶ 00:36 And then using Bayes rule, ▶ 00:38 we could take any training examples over here ▶ 00:41 and figure out what the class probability was over here. ▶ 00:44 This is called a generative model ▶ 00:48 in that the condition of probabilities all aim to maximize ▶ 00:51 the probability of individual features as if those ▶ 00:55 describe the physical world. ▶ 01:00 We also used what is called a bag of words model, ▶ 01:02 in which our representation of each email ▶ 01:06 was such that we just counted the occurrences of words, ▶ 01:09 irrespective of their order. ▶ 01:12 Now this is a very powerful method for fighting SPAM. ▶ 01:15 Unfortunately, it is not powerful enough. ▶ 01:19 It turns out spammers know about Naive Bayes, ▶ 01:21 and they've long learned to come up with messages ▶ 01:24 that are fooling your SPAM filter if it uses Naive Bayes. ▶ 01:27 So companies like Google and others ▶ 01:31 have become much more involved ▶ 01:33 in methods for SPAM filtering. ▶ 01:35 Now I can give you some more examples how to filter SPAM, ▶ 01:38 but all of those quite easily fit with the same Naive Bayes model. ▶ 01:42 ### (01:27) 28 Advanced SPAM Filtering

[Narrator] So here features that you might consider when you write ▶ 00:00 in an advance spam filter. ▶ 00:03 For example, ▶ 00:05 does the email come from ▶ 00:07 a known spamming IP or computer? ▶ 00:09 Have you emailed this person before? ▶ 00:12 In which case it is less likely to be spam. ▶ 00:16 Here's a powerful one: ▶ 00:19 have 1000 other people ▶ 00:22 recently received the same message? ▶ 00:25 Is the email header consistent? ▶ 00:29 So example if the from field says your bank ▶ 00:32 is the IP address really your bank? ▶ 00:35 Surprisingly is the email all caps? ▶ 00:38 Strangely many spammers believe if you write ▶ 00:42 things in all caps you'll pay more attention to it. ▶ 00:44 Do the inline URLs point to those pages ▶ 00:48 where they say they're pointing to? ▶ 00:51 Are you addressed by your correct name? ▶ 00:54 Now these are some features, ▶ 00:56 I'm sure you can think of more. ▶ 00:58 You can toss them easily into the ▶ 01:00 naive base model and get better classification. ▶ 01:02 In fact model spam filters keep learning ▶ 01:05 as people flag emails as spam, and ▶ 01:08 of course spammers keep learning as well ▶ 01:10 and trying to fool modern spam filters. ▶ 01:13 Who's going to win? ▶ 01:16 Well so far the spam filters are clearly winning. ▶ 01:18 Most of my spam I never see, but who knows ▶ 01:21 what's going to happen with the future? ▶ 01:23 It's a really fascinating machine learning problem. ▶ 01:25 ### (02:21) 29 Digit Recognition

[Narrator] Naive Bayes can also be applied to ▶ 00:00 the problem of hand written digits recognition. ▶ 00:02 This is a sample of hand-written digits taken ▶ 00:05 from a U.S. postal data set ▶ 00:09 where hand written zip codes on letters are ▶ 00:12 being scanned and automatically classified. ▶ 00:17 The machine-learning problem here is ▶ 00:21 taken a symbol just like this. ▶ 00:23 What is the corresponding number? ▶ 00:28 Here it's obviously 0. ▶ 00:30 Here it's obviously 1. ▶ 00:32 Here it's obviously 2, 1. ▶ 00:34 For the one down here, ▶ 00:36 it's a little bit harder to tell. ▶ 00:38 Now when you apply Naive Bayes, ▶ 00:41 the input vector ▶ 00:44 could be the pixel values ▶ 00:46 of each individual pixel so we have ▶ 00:48 a 16 x 16 input resolution. ▶ 00:50 You would get 256 different values ▶ 00:54 corresponding to the brightness of each pixel. ▶ 00:59 Now obviously given sufficiently made ▶ 01:02 training example, you might hope ▶ 01:05 to recognize digits, ▶ 01:07 but one of the deficiencies of this approach is ▶ 01:09 it is not particularly shifted range. ▶ 01:12 So for example a pattern like this ▶ 01:15 will look fundamentally different ▶ 01:19 from a pattern like this. ▶ 01:21 Even though the pattern on the right is obtained ▶ 01:24 by shifting the pattern on the left ▶ 01:27 by 1 to the right. ▶ 01:29 There's many different solutions, but a common one could be ▶ 01:31 to use smoothing in a different way from ▶ 01:34 the way we discussed it before. ▶ 01:36 Instead of just counting 1 pixel value's count, ▶ 01:38 you could mix it with counts of the ▶ 01:40 neighboring pixel values so if ▶ 01:42 all pixels are slightly shifted, ▶ 01:44 we get about the same statistics ▶ 01:46 as the pixel itself. ▶ 01:48 Such a method is called input smoothing. ▶ 01:50 You can what's technically called convolve ▶ 01:52 the input vector equals pixel value variable, and ▶ 01:55 you might get better results than if you ▶ 01:57 do Naive Bayes on the raw pixels. ▶ 02:00 Now to tell you the truth for ▶ 02:02 digit recognition of this type, ▶ 02:04 Naive Bayes is not a good choice. ▶ 02:06 The conditional independence assumption ▶ 02:08 of each pixel, given the class, ▶ 02:10 is too strong an assumption in this case, ▶ 02:12 but it's fun to talk about image recognition ▶ 02:14 in the context of Naive Bayes regardless. ▶ 02:17 ### (03:30) 30 Overfitting Prevention

So, let me step back a step and talk a bit about ▶ 00:00 overfitting prevention in machine learning ▶ 00:04 because it's such an important topic. ▶ 00:07 We talked about Occam's Razor, ▶ 00:09 which in a generalized way suggests there is ▶ 00:12 a tradeoff between how well we can fit the data ▶ 00:16 and how smooth our learning algorithm is. ▶ 00:22 In our class in smoothing, we already found 1 way ▶ 00:28 to let Occam's Razor play, which is by ▶ 00:32 selecting the value K to make our statistical counts smoother. ▶ 00:34 I alluded to a similar way in the image recognition domain ▶ 00:40 where we smoothed the image so the neighboring pixels count similar. ▶ 00:44 This all raises the question of how to choose the smoothing parameter. ▶ 00:49 So, in particular, in Laplacian smoothing, how to choose the K. ▶ 00:53 There is a method called cross-validation ▶ 00:58 which can help you find an answer. ▶ 01:02 This method assumes there is plenty of training examples, but ▶ 01:05 to tell you the truth, in spam filtering there is more than you'd ever want. ▶ 01:09 Take your training data ▶ 01:14 and divide it into 3 buckets. ▶ 01:17 Train, cross-validate, and test. ▶ 01:19 Typical ratios will be 80% goes into train, ▶ 01:24 10% into cross-validate, ▶ 01:27 and 10% into test. ▶ 01:30 You use the train to find all your parameters. ▶ 01:33 For example, the probabilities of a base network. ▶ 01:37 You use your cross-validation set ▶ 01:40 to find the optimal K, and the way you do this is ▶ 01:43 you train for different values of K, ▶ 01:46 you observe how well the training model performs on the CV data, ▶ 01:49 not touching the test data, ▶ 01:55 and then you maximize over all the Ks to get the best performance ▶ 01:58 on the cross-validation set. ▶ 02:01 You iterate this many times until you find the best K. ▶ 02:03 When you're done with the best K, ▶ 02:06 you train again, and then finally ▶ 02:09 only one you touch the test data ▶ 02:12 to verify the performance, ▶ 02:15 and this is the performance you report. ▶ 02:17 It's really important in cross-validation ▶ 02:20 split apart a cross-validation set that's different from the test set. ▶ 02:23 If you were to use the test set to find the optimal K, ▶ 02:28 then your test set becomes an effective part of your training routine, ▶ 02:31 and you might overfit your test data, ▶ 02:35 and you wouldn't even know. ▶ 02:38 By keeping the test data separate from the beginning, ▶ 02:40 and train on the training data, you use ▶ 02:43 the cross-validation data to find how good your train data is doing, ▶ 02:46 and the unknown parameters of K to fine-tune the K. ▶ 02:49 Finally, only once you use the test data ▶ 02:53 do you get a fair answer to the question, ▶ 02:56 "How well will your model perform on future data?" ▶ 02:59 So, pretty much everybody in machine learning ▶ 03:02 uses this model. ▶ 03:05 You can redo the split between training and the cross-validation part, ▶ 03:08 people often use the word 10-fold cross-validation ▶ 03:12 where they do 10 different forwardings ▶ 03:15 and run the model 10 times to find the optimal K ▶ 03:17 or smoothing parameter. ▶ 03:20 No matter which way you do it, find the optimal smoothing parameter ▶ 03:22 and then use a test set exactly once to verify in a report. ▶ 03:25 ### (02:00) 31 Classification vs Regression

Let me back up a step further, ▶ 00:00 and let's look at supervised learning more generally. ▶ 00:03 Our example so far was one of classification. ▶ 00:06 The characteristic of classifcation is ▶ 00:09 that the target labels or the target class is discrete. ▶ 00:12 In our case it was actually binary. ▶ 00:16 In many problems, we try to predict a continuous quantity. ▶ 00:18 For example, in the interval 0 to 1 or perhaps a real number. ▶ 00:23 Those machine learning problems are called regression problems. ▶ 00:29 Regression problems are fundamentally different from classification problems. ▶ 00:33 For example, our base network doesn't afford us an answer ▶ 00:37 to a problem where the target value could be at 0,1. ▶ 00:42 A regression problem, for example, would be one to ▶ 00:45 predict the weather tomorrow. ▶ 00:48 Temperature is a continuous value. Our base number would not be able ▶ 00:50 to predict the temperature, it only can predict discrete classes. ▶ 00:53 A regression algorithm is able to give us a continuous prediction ▶ 00:58 about the temperature tomorrow. ▶ 01:01 So let's look at the regression next. ▶ 01:04 So here's my first quiz for you on regression. ▶ 01:07 This scatter plot shows for Berkeley California for a period of time ▶ 01:10 the data for each house that was sold. ▶ 01:18 Each dot is a sold house. ▶ 01:21 It graphs the size of the house in square feet ▶ 01:24 to the sales price in thousands of dollars. ▶ 01:27 As you can see, roughly speaking, ▶ 01:32 as the size of the house goes up, ▶ 01:34 so does the sales price. ▶ 01:37 I wonder, for a house of about 2500 square feet, ▶ 01:40 what is the approximate sales price you would assume ▶ 01:45 based just on the scatter plot data? ▶ 01:49 Is it 400k, 600k, 800k, or 1000k? ▶ 01:52 ### (00:26) 32 Answer

My answer is, there seems to be a roughly linear relationship, ▶ 00:00 maybe not quite linear, between the house size and the price. ▶ 00:05 So we look at a linear graph that best describes the data-- ▶ 00:11 you get this dashed line over here. ▶ 00:15 And for the dashed line, if you walk up the 2500 square feet, ▶ 00:18 you end up with roughly 800K. ▶ 00:22 So this would have been the best answer. ▶ 00:24 ### (02:46) 33 Linear Regression

Now obviously you can answer this question without understanding anything about regression. ▶ 00:00 But what you find is this is different from classification as before. ▶ 00:05 This is not a binary concept anymore of like expensive and cheap. ▶ 00:10 It really is a relationship between two variables. ▶ 00:13 One you care about--the house price, and one that you can observe, ▶ 00:17 which is the house size in square feet. ▶ 00:20 And your goal is to fit a curve that best explains the data. ▶ 00:23 Once again, we have a case where we can play Occam's razor. ▶ 00:28 There clearly is a data fit that is not linear that might be better, ▶ 00:31 like this one over here. ▶ 00:35 And when you go to hide the linear curves, ▶ 00:37 you might even be inclined to draw a curve like this. ▶ 00:40 Now of course the curve I'm drawing right now is likely an overfit. ▶ 00:44 And you don't want to postulate that this is the general relationship ▶ 00:49 between the size of a house and the sales price. ▶ 00:54 So even though my black curve might describe the data better, ▶ 00:57 the blue curve or the dashed linear curve over here might be a better explanation overture of Occam's razor. ▶ 01:01 So let's look a little bit deeper into what we call regression. ▶ 01:08 As in all regression problems, our data will be comprised of ▶ 01:15 input vectors of length in that map to another continuous value. ▶ 01:19 And we might be given a total of M data points. ▶ 01:25 This is from the classification case, except this time the Ys are continuous. ▶ 01:30 Once again, we're looking for function f that maps our vector x into y. ▶ 01:36 In linear regression, the function has a particular form which is W1 times X plus W0. ▶ 01:44 In this case X is one dimensional which is N = 1. ▶ 01:54 Or in the high-dimensional space, we might just write W times X plus W0, ▶ 01:59 where W is a vector and X is a vector. ▶ 02:07 And this is the inner product of these 2 vectors over here. ▶ 02:12 Let's for now just consider the one-dimensional case. ▶ 02:16 In this quiz, I've given you a linear regression form with 2 unknown parameters, W1 and W0. ▶ 02:20 I've given you a data set. ▶ 02:27 And this data set happens to be fittable by a linear regression model without any residual error. ▶ 02:30 Without any math, can you look at this and find out to me what the 2 parameters, W0 and W1 are? ▶ 02:36 ### (01:17) 34 Answer

This is a suprisingly challenging question. ▶ 00:00 If you look at these numbers from 3 to 6. ▶ 00:03 When we increase X by 3, Y decreases by 3, ▶ 00:07 which suggests W1 is -1. ▶ 00:14 Now let's see if this holds. ▶ 00:18 If we increase X by 3, it decreases Y by 3. ▶ 00:20 If we increase X by 1, we decrease Y by 1. ▶ 00:24 If we increase X by 2, we decrease Y by 2. ▶ 00:28 So this number seems to be an exact fit. ▶ 00:32 Next we have to get the constant W0 right. ▶ 00:36 For X = 3, we get -3 as an expression over here, ▶ 00:41 because we know W1 = -1. ▶ 00:48 So if this has to equal zero in the end, then W0 has to be 3. ▶ 00:50 Let's do a quick check. ▶ 00:57 -3 plus 3 is 0. ▶ 00:59 -6 plus 3 is -3. ▶ 01:02 And if we plug in any of the numbers, you find those are correct. ▶ 01:05 Now this is the case of an exact data set. ▶ 01:09 It gets much more challenging if the data set cannot be fit with a linear function. ▶ 01:12 ### (01:00) 35 More Linear Regression

To define linear regression, ▶ 00:00 we need to understand what we are trying to minimize. ▶ 00:02 The word is called here, are loss function ▶ 00:05 and the loss function is the amount of residual error we obtain ▶ 00:08 after fitting the linear function as good as possible. ▶ 00:12 The residual error is the sum of all training examples, ▶ 00:16 J of YJ, which is the target label, ▶ 00:20 minus our prediction, which is W1 XJ minus W0 to the square. ▶ 00:25 This is the quadratic error between our target tables ▶ 00:34 and what our best hypothesis can produce. ▶ 00:37 The minimizing of loss ▶ 00:41 is used for linear regression of a new regression problem, ▶ 00:43 and you can write it as follows: ▶ 00:46 Our solution to the regression problem W* ▶ 00:50 is the arg min of the loss over all possible vectors W. ▶ 00:52 ### (04:04) 36 Quadratic Loss

The problem of minimizing quadratic loss for linear functions can be solved in closed form. ▶ 00:00 When I reduce, I will do this for the one-dimensional case on paper. ▶ 00:07 I will also give you the solution for the case where your input space is multidimensional, ▶ 00:12 which is often called "multivariant regression." ▶ 00:17 We seek to minimize a sum of a quadratic expression ▶ 00:22 where the target labels are subtracted with the output of our linear regression model ▶ 00:26 parameterized by w1 and w2. ▶ 00:33 The summation here is overall training examples, ▶ 00:36 and I leave the index of the summation out if not necessary. ▶ 00:40 The minimum of this is obtained where the derivative of this function equals zero. ▶ 00:45 Let's call this function "L." ▶ 00:50 For the partial derivative with respect to w0, we get this expression over here, ▶ 00:53 which we have to set to zero. ▶ 00:59 We can easily get rid of the -2 and transform this as follows: ▶ 01:02 Here M is the number of training examples. ▶ 01:11 This expression over here gives us w0 as a function of w1, ▶ 01:17 but we don't know w1. Let's do the same trick for w1 ▶ 01:21 and set this to zero as well, ▶ 01:28 which gets us the expression over here. ▶ 01:32 We can now plug in the w0 over here into this expression over here ▶ 01:38 and obtain this expression over here, ▶ 01:44 which looks really involved but is relatively straightforward. ▶ 01:47 With a few steps of further calculation, which I'll spare you for now, ▶ 01:52 we get for w1 the following important formula: ▶ 01:56 This is the final quotient for w1, ▶ 02:02 where we take the number of training examples times of the sum of all xy ▶ 02:05 minus the sum of x times the sum of y divided by this expression over here. ▶ 02:10 Once we've computed w1, ▶ 02:16 we can go back to our original articulation of w0 over here ▶ 02:19 and plug w1 into w0 and obtain w0. ▶ 02:23 These are the two important formulas we can also find in the textbook. ▶ 02:30 I'd like to go back and use those formulas to calculate these two coefficients over here. ▶ 02:39 You get 4 times the sum of x and the sum of y, which is -32 ▶ 02:45 minus the product of the sum of x, which is 18, and the sum of y, which is -6, ▶ 02:56 divided by the sum of x squared, which is 86, times 4, minus the sum of x squared, ▶ 03:05 which is 18 times 18, which is 324. ▶ 03:16 If you work this all out, it becomes -1, which is w1. ▶ 03:20 W0 is now obtained by completing the quarter times sum of all y, ▶ 03:25 which is -6, minus -1/4 times sum of all x. ▶ 03:31 If you plug this all in, you get 3, as over here. Our formula is actually correct. ▶ 03:39 Here is another quiz for linear regression. We have the follow data: ▶ 03:46 Here is the data plotted graphically. ▶ 03:51 I wonder what the best regression is. ▶ 03:53 Give me w0 and w1. Apply the formulas I just gave you. ▶ 03:56 ### (01:27) 37 Answer

And the answer is W0 = 0.5, and W1 = 0.9. ▶ 00:00 If I were to draw a line, it would go about like this. ▶ 00:09 It doesn't really hit the two points at the end. ▶ 00:14 If you were thinking of something like this, you were wrong. ▶ 00:19 If you draw a curve like this, your quadratic error becomes 2. ▶ 00:24 One over here, and one over here. ▶ 00:28 The quadratic error is smaller for the line that goes in between those points. ▶ 00:30 This is easily seen by computing as shown in the previous slide. ▶ 00:35 W1 equals (4 x 118 - 20 x 20) / (4 x 120 - 400) which is 0.9. ▶ 00:41 This is merely plugging in those numbers into the formulas I gave you. ▶ 00:55 W0 then becomes ¼ x 20. ▶ 01:00 Now we plug in W1-- 0.9 / 4 x 20 equals 0.5. ▶ 01:05 This is an example of linear regression, ▶ 01:12 in which case there is a residual error, ▶ 01:16 and the best-fitting curve is the one that minimizes ▶ 01:18 the total of the residual vertical error in this graph over here. ▶ 01:22 ### (02:10) 38 Problems with Linear Regression

So linear regression works well ▶ 00:00 if the data is approximately linear, ▶ 00:03 but there are many examples when linear regression performs poorly. ▶ 00:05 Here's one where we have a ▶ 00:09 curve that is really nonlinear. ▶ 00:12 This is an interesting one where we seem to have a linear relationship ▶ 00:15 that is flatter than the linear regression indicates, ▶ 00:18 but there is one outlier. ▶ 00:21 Because if you are minimizing quadratic error, ▶ 00:23 outliers penalize you over-proportionately. ▶ 00:26 So outliers are particularly bad for linear regression. ▶ 00:30 And here is a case, ▶ 00:34 where the data clearly suggests ▶ 00:35 a very different phenomena for linear. ▶ 00:37 We have only two ?? variables even being used, ▶ 00:40 and this one has a strong frequency ▶ 00:42 and a strong vertical spread. ▶ 00:45 Clearly a linear regression model ▶ 00:47 is a very poor one to explain ▶ 00:49 this data over here. ▶ 00:51 Another problem with linear regression ▶ 00:53 is that as you go to infinity in the X space, ▶ 00:55 your Ys also become infinite. ▶ 00:59 In some problems that isn't a plausible model. ▶ 01:02 For example, if you wish to predict the weather ▶ 01:05 anytime into the future, ▶ 01:08 it's implausible to assume the further the prediction goes out, ▶ 01:10 the hotter or the cooler it becomes. ▶ 01:13 For such situations there is a ▶ 01:15 model called logistic regression, ▶ 01:17 which uses a slightly more complicated ▶ 01:20 model than linear regression, ▶ 01:22 which goes as follows:. ▶ 01:24 Let F of XP, or linear function, ▶ 01:25 and the output of logistic regression ▶ 01:30 is obtained by the following function: ▶ 01:32 One over one plus exponential of minus F of X. ▶ 01:34 So here's a quick quiz for you. ▶ 01:40 What is the range in which Z might fall ▶ 01:43 given this function over here, ▶ 01:48 and ??? the linear function of F or X over here. ▶ 01:49 Is it zero, one? ▶ 01:53 Is it minus one, one? ▶ 01:56 Is it minus one, zero? ▶ 01:59 Minus two, two? ▶ 02:02 Or none of the above? ▶ 02:04 ### (01:00) 39 Answer

The answer is zero, one. ▶ 00:00 If this expression over here, ▶ 00:02 F of X, ▶ 00:05 grows to positive infinity, ▶ 00:07 then Z becomes one. ▶ 00:09 And the reason is ▶ 00:14 as this term over here becomes very large, ▶ 00:16 E to the minus of that term approaches zero; ▶ 00:19 one over one equals one. ▶ 00:22 If F of X goes to minus infinity, ▶ 00:25 then Z goes to zero. ▶ 00:30 And the reason is, ▶ 00:33 if this expression over here goes to minus infinity, ▶ 00:34 E to the infinity becomes very large; ▶ 00:38 one over something very large becomes zero. ▶ 00:41 When we plot the logistic function it looks like this: ▶ 00:44 So it's approximately linear ▶ 00:49 around F of X equals zero, ▶ 00:51 but it levels off to zero and one ▶ 00:54 as we go to the extremes. ▶ 00:58 ### (01:39) 40 Linear Regression and Complexity Control

Another problem with linear regression has to do with the regularization ▶ 00:00 or complexity control. ▶ 00:04 Just like before, we sometimes wish to have ▶ 00:06 a less complex model. ▶ 00:08 So in regularization, the loss function is either the sum ▶ 00:10 of the loss of a data function and a complexity control term, ▶ 00:15 which is often called the loss of the parameters. ▶ 00:21 The loss of the data is simply curvatic loss, as we discussed before. ▶ 00:24 The loss of parameters might just be a function that penalizes ▶ 00:29 the parameters to become large ▶ 00:35 up to some known P, where P is usually either 1 or 2. ▶ 00:37 If you draw this graphically, ▶ 00:43 in a parameter space comprised of 2 parameters, ▶ 00:46 your curvatic term for minimizing the data error ▶ 00:49 might look like this, where the minimum sits over here. ▶ 00:53 Your term for regularization might pull these parameters toward 0. ▶ 00:57 It pulls it toward 0, along the circle if you use curvatic error, ▶ 01:02 and it does it in a diamond-shaped way. ▶ 01:09 For L1 regularization--either one works well. ▶ 01:14 L1 has the advantage in that parameters tend to get really sparse. ▶ 01:20 If you look at this diagram, there is a tradeoff between W-0 and W-1. ▶ 01:24 In the L1 case, that allows one of them to be driven to 0. ▶ 01:30 In the L2 case, parameters tend not to be as sparse. ▶ 01:33 So L1 is often preferred. ▶ 01:37 ### (01:46) 41 Minimizing Complicated Loss Functions

This all raises the question, ▶ 00:00 how to minimize more complicated loss functions ▶ 00:03 than the one we discussed so far. ▶ 00:06 Are there closed-form solutions of the type we found for linear regression? ▶ 00:09 Or do we have to resort to iterative methods? ▶ 00:14 The general answer is, unfortunantly, we have to resort to iterative methods. ▶ 00:17 Even though there are special cases in which corresponding solutions may exist, ▶ 00:23 in general, our loss functions now become complicated enough ▶ 00:28 that all we can do is iterate. ▶ 00:32 Here is a prototypical loss function ▶ 00:35 and the method for interation will be called gradient descent. ▶ 00:40 In gradient descent, you start with an initial guess, ▶ 00:44 W-0, where 0 is your iteration number, ▶ 00:48 and then you up with it iteratively. ▶ 00:53 Your i plus 1st parameter guess will be obtained by taking your i-th guess ▶ 00:55 and subtracting from it the gradient of your loss function, ▶ 01:04 and that guess multiplied by a small learning weight alpha, ▶ 01:10 where alpha is often as small as 0.01. ▶ 01:15 I have a couple of questions for you. ▶ 01:19 Consider the following 3 points. ▶ 01:21 We call them A, B, C. ▶ 01:25 I wish to know, for points A, B, and C, ▶ 01:27 Is the gradient at this point positive, about zero, or negative? ▶ 01:34 For each of those, check exactly one of those cases. ▶ 01:40 ### (00:47) 42 Answer

In case A, the gradient is negative. ▶ 00:00 If you move to the right in the X space, ▶ 00:03 then your loss decreases. ▶ 00:06 In B, it's about zero. ▶ 00:09 In C, it's pointing up; it's positive. ▶ 00:12 So if you apply the rule over here, ▶ 00:15 if you were to start at A as your W-zero, ▶ 00:18 then your gradient is negative. ▶ 00:21 Therefore, you would add something to the value of W. ▶ 00:23 You move to the right, and your loss has decreased. ▶ 00:26 You do this until you find yourself ▶ 00:29 with what's called a local minimum, where B resides. ▶ 00:31 In this instance over here, gradient descent starting at A ▶ 00:34 would not get you to the global minimum, ▶ 00:37 which sits over here because there's a bump in between. ▶ 00:39 Gradient methods are known to be subject to local minimum. ▶ 00:42 ### (00:28) 43 Question

I have another gradient quiz. ▶ 00:00 Consider the following quadratic arrow function. ▶ 00:03 We are considering the gradient in 3 different places. ▶ 00:06 a. b. and c. ▶ 00:09 And they ask you which gradient is the largest. ▶ 00:13 a, b, or c or are they all equal? ▶ 00:17 In which case, you would want to check the last box over here ▶ 00:23 ### (00:20) 44 Answer

And the answer is C. ▶ 00:00 The derivative of a quadratic function is a linear function. ▶ 00:04 Which would look about like this. ▶ 00:08 And as we go outside, our gradient becomes larger and larger. ▶ 00:11 This over here is much steeper than this curve over here. ▶ 00:15 ### (00:21) 45 Question

[Thrun] Here is a final gradient descent quiz. ▶ 00:00 Suppose we have a loss function like this ▶ 00:04 and our gradient descent starts over here. ▶ 00:08 Will it likely reach the global minimum? ▶ 00:12 Yes or no. ▶ 00:15 Please check one of those boxes. ▶ 00:17 ### (01:00) 46 Answer

[Thrun] And the answer is yes, ▶ 00:00 although, technically speaking, to reach the absolute global minimum ▶ 00:02 we need the learning rates to become smaller and smaller over time. ▶ 00:06 If they stay constant, there is a chance this thing might bounce around ▶ 00:11 between 2 points in the end and never reach the global minimum. ▶ 00:15 But assuming that we implement gradient descent correctly, ▶ 00:18 we will finally reach the global minimum. ▶ 00:22 That's not the case if you start over here, where we can get stuck over here ▶ 00:24 and settle for the minimum over here, which is a local minimum ▶ 00:29 and not the best solution to our optimization problem. ▶ 00:32 So one of the important points to take away from this is ▶ 00:35 gradient descent is universally applicable to more complicated problems-- ▶ 00:38 problems that don't have a plausible solution. ▶ 00:43 But you have to check whether there is many local minima, ▶ 00:46 and if so, you have to worry about this. ▶ 00:49 Any optimization book can tell you tricks how to overcome this. ▶ 00:51 I won't go into any more depth here in this class. ▶ 00:55 ### (01:41) 47 Gradient Descent Implementation

[Thrun] It's interesting to see how to minimize a loss function using gradient descent. ▶ 00:00 In our linear case, we have L equals sum over the correct labels ▶ 00:05 minus our linear function to the square, ▶ 00:12 which we seek to minimize. ▶ 00:16 We already know that this has a closed form solution, ▶ 00:18 but just for the fun of it, let's look at gradient descent. ▶ 00:21 The gradient of L with respect to W1 is minus 2, sum of all J ▶ 00:25 of the difference as before but without the square times Xj. ▶ 00:33 The gradient with respect to W0 is very similar. ▶ 00:39 So in gradient descent we start with W1 0 and W0 0 ▶ 00:43 where the upper cap 0 corresponds to the iteration index of gradient descent. ▶ 00:49 And then we iterate. ▶ 00:55 In the M iteration we get our new estimate by using the old estimate ▶ 00:57 minus a learning rate of this gradient over here ▶ 01:06 taking the position of the old estimate W1, M minus 1. ▶ 01:10 Similarly, for W0 we get this expression over here. ▶ 01:15 And these expressions look nasty, ▶ 01:20 but what it really means is we subtract an expression like this ▶ 01:24 every time we do gradient descent from W1 ▶ 01:28 and an expression like this every time we do gradient descent from W0, ▶ 01:31 which is easy to implement, and that implements gradient descent. ▶ 01:36 ### (04:15) 48 Perceptron

Now, there are many different ways to apply linear functions in machine learning. ▶ 00:00 We so far have studied linear functions for regression, ▶ 00:08 but linear functions are also used for classification, ▶ 00:12 and specifically for an algorithm called the perceptron algorithm. ▶ 00:16 This algorithm happens to be a very early model of a neuron, ▶ 00:21 as in the neurons we have in our brains, ▶ 00:27 and was invented in the 1940s. ▶ 00:30 Suppose we give a data set of positive samples and negative samples. ▶ 00:33 A linear separator is a linear equation that separates positive from negative examples. ▶ 00:41 Obviously, not all sets possess a linear separator, but some do. ▶ 00:49 For those we can define the algorithm of the perceptron and it actually converges. ▶ 00:55 To define a linear separator, let's start with our linear equation as before-- ▶ 01:02 w1x + w0 in cases where x is higher dimensional this might actually be a vector--never mind. ▶ 01:07 If this is larger or equal to zero, then we call our classification 1. ▶ 01:18 Otherwise, we call it zero. ▶ 01:26 Here's our linear separation classification function ▶ 01:30 where this is our common linear function. ▶ 01:35 Now, as I said, perceptron only converges if the data is linearly separable, ▶ 01:39 and then it converges to a linear separation of the data, ▶ 01:45 which is quite amazing. ▶ 01:49 Perceptron is an iterative algorithm that is not dissimilar from grade descent. ▶ 01:52 In fact, the update rule echoes that of grade descent, and here's how it goes. ▶ 01:56 We start with a random guess for w1 and w0, ▶ 02:03 which may correspond to a random separation line, ▶ 02:09 but usually is inaccurate. ▶ 02:13 Then the mth weight-i is obtained by using the old weight plus some learning rate alpha ▶ 02:17 times the difference between the desired target label ▶ 02:29 and the target label produced by our function at the point m-1. ▶ 02:33 Now, this is an online learning rule, which is we don't process all the data in batch. ▶ 02:39 We process one data at a time, and we might go through the data many, many times-- ▶ 02:45 hence the j over here-- ▶ 02:50 but every time we do this, we apply this rule over here. ▶ 02:52 What this rule gives us is a method to adapt our weights in proportion to the error. ▶ 02:55 If the prediction of our function f equals our target label, ▶ 03:03 and the error is zero, then no update occurs. ▶ 03:07 If there is a difference, however, we update in a way so as to minimize the error. ▶ 03:11 Alpha is a small learning weight. ▶ 03:18 Once again, perceptron converges to a correct linear separator ▶ 03:22 if such linear separator exists. ▶ 03:28 Now, the case of linear separation has recently received a lot of attention in machine learning. ▶ 03:31 If you look at the picture over here, you'll find there are many different linear separators. ▶ 03:36 There is one over here. There is one over here. There is one over here. ▶ 03:42 One of the questions that has recently been researched extensively is which one to prefer. ▶ 03:47 Is it a, b, or c? ▶ 03:53 Even though you probably have never seen this literature, ▶ 03:57 I will just ask your intuition in this following quiz. ▶ 04:01 Which linear separator would you prefer if you look at these three different linear separators-- ▶ 04:05 a, b, c, or none of them? ▶ 04:10 ### (05:38) 49 Answer and SVMs

[Narrator] And intuitively I would argue it's B, ▶ 00:00 and the reason why is ▶ 00:04 C comes really close to examples. ▶ 00:06 So if these examples are noisy, ▶ 00:09 it's quite likely that ▶ 00:12 by being so close to these examples ▶ 00:14 that future examples cross the line. ▶ 00:17 Similarly A comes close to examples. ▶ 00:20 B is the one that stays really far away ▶ 00:23 from any example. ▶ 00:26 So there's this entire region over here ▶ 00:28 where there's no example anywhere near B. ▶ 00:31 This region is often called the margin. ▶ 00:34 The margin of the linear separator ▶ 00:37 is the distance of the separator ▶ 00:40 to the closest training example. ▶ 00:43 The margin is a really important concept ▶ 00:45 in machine learning. ▶ 00:47 There is an entire class of maximum margin ▶ 00:49 learning algorithms, ▶ 00:51 and the 2 most popular are ▶ 00:53 support vector machines and boosting. ▶ 00:56 If you are familiar with machine learning, ▶ 01:00 you've come across these terms. ▶ 01:02 These are very frequently used these days ▶ 01:04 in actual discrimination learning tasks. ▶ 01:07 I will not go into any details because it would go ▶ 01:10 way beyond the scope of this introduction ▶ 01:12 to artificial intelligence class, but let's see ▶ 01:16 a few abstract words specifically about ▶ 01:18 support vector machines or SVMs. ▶ 01:21 As I said before a support vector machine ▶ 01:25 derives a linear separator, and it takes ▶ 01:30 the one that actually maximizes the margin ▶ 01:34 as shown over here. ▶ 01:39 By doing so it attains additional robost-ness ▶ 01:42 over perceptron which only picks ▶ 01:44 a linear separator without ▶ 01:46 consideration of the margin. ▶ 01:48 Now the problem of finding the ▶ 01:51 margin maximizing linear separator ▶ 01:53 can be solved by a quadratic program ▶ 01:55 which is an integer method for finding the best ▶ 01:59 linear separator that maximizes the margin. ▶ 02:03 One of the nice things that support ▶ 02:06 vector machines do in practice is ▶ 02:08 they use linear techniques to solve ▶ 02:12 nonlinear separation problems, ▶ 02:16 and I'm just going to give you a glimpse of ▶ 02:19 what's happening without going into any detail. ▶ 02:22 Suppose the data looks as follows: ▶ 02:25 we have a positive class ▶ 02:28 which is near the origin of a coordinate system ▶ 02:31 and a negative class that surrounds the positive class. ▶ 02:33 Clearly these 2 classes ▶ 02:37 are not linearly separable ▶ 02:39 because there's no line I can draw that ▶ 02:41 separates the negative examples from the positive examples. ▶ 02:43 An idea that underlies SVMs, ▶ 02:47 that will ultimately be known as ▶ 02:49 the kernel trick, ▶ 02:51 is to augment the feature set by new features. ▶ 02:53 Suppose this is X1, and this is X2, ▶ 02:56 and normally X1 and X2 ▶ 02:58 will be the input features. ▶ 03:00 In this example, you might derive ▶ 03:03 a 3rd one. ▶ 03:05 Let me pick a 3rd one ▶ 03:07 Suppose X3 equals the square root of ▶ 03:09 X1 square + X2 square. ▶ 03:13 In other words X3 is the distance ▶ 03:18 of any data point from the center ▶ 03:22 of the coordinate system. ▶ 03:25 Then things do become linearly separable ▶ 03:27 so that just along the 3rd dimension ▶ 03:31 all the positive examples end up ▶ 03:33 to be close to the origin, ▶ 03:36 and all the negative examples ▶ 03:39 are further away, and the line is ▶ 03:41 orthogonal to the 3rd input feature ▶ 03:43 solves the separation problem. ▶ 03:46 Map back into the space over here ▶ 03:49 is actually a circle which is a set of all ▶ 03:52 values of X3 that are equidistant ▶ 03:55 to the center of the origin. ▶ 04:00 Now this trick could be done in any linear learning algorithm, ▶ 04:02 and it's really an amazing trick. ▶ 04:06 You can take any nonlinear problem, add ▶ 04:08 features of this type or any other type, ▶ 04:10 and use linear techniques ▶ 04:13 and get better solutions. ▶ 04:15 This is a very deep machine learning insight ▶ 04:17 that you can extend your feature space ▶ 04:19 in this way, and there's numerous ▶ 04:21 papers written about this. ▶ 04:23 In SVMs, the extension of the feature space is mathematically done by ▶ 04:25 what's called a kernel. ▶ 04:31 I can't really tell you about this in this class, ▶ 04:33 but it makes it possible to write ▶ 04:36 very large new feature spaces including ▶ 04:38 infinitely dimensional new feature spaces. ▶ 04:41 These messages are very powerful. ▶ 04:44 It turns out you never ▶ 04:46 really compute all those features. ▶ 04:48 They are implicitly represented by ▶ 04:50 so called kernels, and if you care about this, ▶ 04:52 I recommend you to dive ▶ 04:55 deeper into the literature ▶ 04:57 of support vector machines. ▶ 04:59 This is meant to just give you ▶ 05:01 an overview of the essence of ▶ 05:03 what support vector machines are all about. ▶ 05:05 So in summary, ▶ 05:08 linear methods we learned about ▶ 05:10 using them for regression ▶ 05:12 and also classification. ▶ 05:15 We learned about exact solutions ▶ 05:17 versus iterative solutions. ▶ 05:19 We talked about smoothing, ▶ 05:23 and we even talked about ▶ 05:25 using linear methods for nonlinear problems. ▶ 05:27 So we covered quite a bit of ground. ▶ 05:30 This is a really significant cross section ▶ 05:33 of machine learning. ▶ 05:35 ### (02:01) 50 k Nearest Neighbors

As the final method in this unit, I'd like now to talk about k-nearest neighbors. ▶ 00:00 And the distinguishing factor of k-nearest neighbors ▶ 00:06 is that it is a nonparametric machine learning method. ▶ 00:09 So far we've talked about parametric methods. ▶ 00:13 Parametric methods have parameters, like probabilities or weights, ▶ 00:16 and the number of parameters is constant. ▶ 00:21 Or to put it differently, the number of parameters is independent of the training set size. ▶ 00:25 So for example in the Naive Bayes, if we bring up more data, ▶ 00:29 the number of condition probabilities will stay the same. ▶ 00:34 Well, that wasn't technically always the case. ▶ 00:37 Our vocabulary might increase and as such the number of parameters. ▶ 00:41