Last update:Friday 16th of December 2011
Welcome to the online introduction to artificial intelligence. | ▶ 00:00 |
My name is Sebastian Thrun. >>I'm Peter Norvig. | ▶ 00:04 |
We are teaching this class at Stanford, | ▶ 00:07 |
and now we are teaching it online for the entire world. | ▶ 00:09 |
We are really excited about this. | ▶ 00:11 |
It's great to have you all here. | ▶ 00:13 |
It's exciting to have such a record-breaking number of people. | ▶ 00:14 |
We think we can deliver a good introduction to artificial intelligence. | ▶ 00:18 |
We hope you'll stick with it. | ▶ 00:22 |
It's going to be a lot of work, | ▶ 00:24 |
but we think it's going to be very rewarding. | ▶ 00:25 |
The way that it is going to be organized is that | ▶ 00:27 |
every week there is going to be new videos and with these videos, quizes. | ▶ 00:29 |
With these quizzes, you can test your knowledge about AI. | ▶ 00:32 |
We also post for the advanced version of this class, homework assignments and exams | ▶ 00:35 |
on which you'll be quizzed. | ▶ 00:38 |
We're going to grade those to give you a final score to see | ▶ 00:40 |
if you can actually master artificial intelligence the same way | ▶ 00:44 |
any good student at Stanford would do it. | ▶ 00:47 |
If you do that, then at the end of the class, we'll sign a letter of accomplishment, | ▶ 00:49 |
and let you know that you've achieved this and what your rank in the class was. | ▶ 00:54 |
So I hope you have fun. Watch us on videotape. | ▶ 00:58 |
We will teach you AI. | ▶ 01:02 |
Participate in the discussion forum. | ▶ 01:04 |
Ask your questions, and help others answer questions. | ▶ 01:06 |
I hope we have a fantastic time ahead of us in the next 10 weeks. | ▶ 01:09 |
Welcome to the class. We'll see you online. | ▶ 01:12 |
Welcome to the first unit of Online Introduction to Artificial Intelligence. | ▶ 00:00 |
I will be teaching you the very, very basics today. | ▶ 00:05 |
This is Unit 1 of Artificial Intelligence. | ▶ 00:09 |
Welcome. | ▶ 00:14 |
The purpose of this class is twofold: | ▶ 00:16 |
Number 1, to teach you the very basics of artificial intelligence | ▶ 00:20 |
so you'll be able to talk to people in the field | ▶ 00:25 |
and understand the basic tools of the trade; | ▶ 00:29 |
and also, very importantly, to excite you about the field. | ▶ 00:32 |
I have been in the field of artificial intelligence for about 20 years, | ▶ 00:37 |
and it's been truly rewarding. | ▶ 00:42 |
So I want you to participate in the beauty and the excitement of AI | ▶ 00:44 |
so you can become a professional who gets the same reward | ▶ 00:48 |
and excitement out of this field as I do. | ▶ 00:52 |
The basic structure of this class involves videos | ▶ 00:55 |
in which Peter or I will teach you something new, | ▶ 01:00 |
then also quizzes, which we will ask you about your ability to answer AI questions, | ▶ 01:03 |
and finally, answer videos in which we tell you what the right answer would have been | ▶ 01:11 |
for the quiz that you might have falsely or incorrectly answered before. | ▶ 01:17 |
This will all be reiterated, and every so often you get a homework assignment, | ▶ 01:22 |
also in the form of quizzes but without the answers. | ▶ 01:28 |
And then we also have video exams. | ▶ 01:34 |
If you check our website, there's requirements | ▶ 01:37 |
on how you have to do assignments and exams. | ▶ 01:39 |
Please go to ai-class.org in this class. | ▶ 01:43 |
An AI program is called wetware, a formula, or an intelligent agent. | ▶ 01:48 |
Pick the one that fits best. | ▶ 01:58 |
[Thrun] The correct answer is intelligent agent. | ▶ 00:00 |
Let's talk about intelligent agents. | ▶ 00:04 |
Here is my intelligent agent, | ▶ 00:07 |
and it gets to interact with an environment. | ▶ 00:11 |
The agent can perceive the state of the environment | ▶ 00:17 |
through its sensors, | ▶ 00:22 |
and it can affect its state through its actuators. | ▶ 00:25 |
The big question of artificial intelligence is the function that maps sensors to actuators. | ▶ 00:29 |
That is called the control policy for the agent. | ▶ 00:37 |
So all of this class will deal with how does an agent make decisions | ▶ 00:41 |
that it can carry out with its actuators based on past sensor data. | ▶ 00:48 |
Those decisions take place many, many times, | ▶ 00:54 |
and the loop of environment feedback to sensors, agent decision, | ▶ 00:58 |
actuator interaction with the environment and so on is called perception action cycle. | ▶ 01:03 |
So here is my very first quiz for you. | ▶ 01:12 |
Artificial intelligence, AI, has successfully been used in finance, | ▶ 01:15 |
robotics, games, medicine, and the Web. | ▶ 01:21 |
Check any or all of those that apply. | ▶ 01:26 |
And if none of them applies, check the box down here that says none of them. | ▶ 01:28 |
So the correct answer is all of those-- | ▶ 00:00 |
finance, robotics, games, medicine, the Web, and many more applications. | ▶ 00:03 |
So let me talk about them in some detail. | ▶ 00:08 |
There is a huge number of applications of artificial intelligence in finance, | ▶ 00:10 |
very often in the shape of making trading decisions-- | ▶ 00:15 |
in which case, the agent is called a trading agent. | ▶ 00:18 |
And the environment might be things like the stock market or the bond market | ▶ 00:21 |
or the commodities market. | ▶ 00:27 |
And our trading agent can sense the course of certain things, | ▶ 00:29 |
like the stock or bonds or commodities. | ▶ 00:33 |
It can also read the news online and follow certain events. | ▶ 00:35 |
And its decisions are usually things like buy or sell decisions--trades. | ▶ 00:40 |
There's a huge history of artificial intelligence finding methods to look at data over time | ▶ 00:48 |
and make predictions as to how courses develop over time-- | ▶ 00:55 |
and then put in trades behind those. | ▶ 00:58 |
And very frequently, people using artificial intelligence trading agents | ▶ 01:01 |
have made a good amount of money with superior trading decisions. | ▶ 01:06 |
There's also a long history of AI in Robotics. | ▶ 01:10 |
Here is my depiction of a robot. | ▶ 01:14 |
Of course, there are many different types of robots | ▶ 01:17 |
and they all interact with their environments through their sensors, | ▶ 01:20 |
which include things like cameras, microphones, tactile sensor or touch. | ▶ 01:24 |
And the way they impact their environments is to move motors around, | ▶ 01:33 |
in particular, their wheels, their legs, their arms, their grippers. | ▶ 01:38 |
They can also say things to people using voice. | ▶ 01:43 |
Now there's a huge history of using artificial intelligence in robotics. | ▶ 01:46 |
Pretty much, every robot that does something interesting today uses AI. | ▶ 01:50 |
In fact, often AI has been studied together with robotics, as one discipline. | ▶ 01:54 |
But because robots are somewhat special in that they use physical actuators | ▶ 01:58 |
and deal with physical environments, they are a little bit different from | ▶ 02:03 |
just artificial intelligence, as a whole. | ▶ 02:06 |
When the Web came out, the early Web crawlers were called robots | ▶ 02:08 |
and to block a robot from accessing your website, to the present day, | ▶ 02:15 |
there's a file called robot.txt, that allows you to deny any Web crawler | ▶ 02:20 |
to access and retrieve that information from your website. | ▶ 02:24 |
So historically, robotics played a huge role in artificial intelligence | ▶ 02:28 |
and a good chunk of this class will be focusing on robotics. | ▶ 02:32 |
AI has a huge history in games-- | ▶ 02:36 |
to make games smarter or feel more natural. | ▶ 02:39 |
There are 2 ways in which AI has been used in games, as a game agent. | ▶ 02:43 |
One is to play against you, as a human user. | ▶ 02:47 |
So for example, if you play the game of Chess, | ▶ 02:50 |
then you are the environment to the game agent. | ▶ 02:54 |
The game agent gets to observe your moves, and it generates its own moves | ▶ 02:57 |
with the purpose of defeating you in Chess. | ▶ 03:03 |
So most adversarial games, where you play against an opponent | ▶ 03:07 |
and the opponent is a computer program, | ▶ 03:10 |
the game agent is built to play against you--against your own interests--and make you lose. | ▶ 03:13 |
And of course, your objective is to win. | ▶ 03:20 |
That's an AI games-type situation. | ▶ 03:22 |
The second thing is that games agents in AI | ▶ 03:25 |
also are used to make games feel more natural. | ▶ 03:29 |
So very often games have characters inside, and these characters act in some way. | ▶ 03:32 |
And it's important for you, as the player, to feel that these characters are believable. | ▶ 03:36 |
There's an entire sub-field of artificial intelligence to use AI | ▶ 03:42 |
to make characters in a game more believable--look smarter, so to speak-- | ▶ 03:45 |
so that you, as a player, think you're playing a better game. | ▶ 03:51 |
Artificial intelligence has a long history in medicine as well. | ▶ 03:55 |
The classic example is that of a diagnostic agent. | ▶ 04:00 |
So here you are--and you might be sick, and you go to your doctor. | ▶ 04:04 |
And your doctor wishes to understand | ▶ 04:09 |
what the reason for your symptoms and your sickness is. | ▶ 04:11 |
The diagnostic agent will observe you through various measurements-- | ▶ 04:17 |
for example, blood pressure and heart signals, and so on-- | ▶ 04:21 |
and it'll come up with the hypothesis as to what you might be suffering from. | ▶ 04:25 |
But rather than intervene directly, in most cases the diagnostic of your disease | ▶ 04:29 |
is communicated to the doctor, who then takes on the intervention. | ▶ 04:34 |
This is called a diagnostic agent. | ▶ 04:38 |
There are many other versions of AI in medicine. | ▶ 04:40 |
AI is used in intensive care to understand whether there are situations | ▶ 04:43 |
that need immediate attention. | ▶ 04:48 |
It's been used for life-long medicine to monitor signs over long periods of time. | ▶ 04:50 |
And as medicine becomes more personal, the role of artificial intelligence | ▶ 04:54 |
will definitely increase. | ▶ 04:58 |
We already mentioned AI on the Web. | ▶ 05:01 |
The most generic version of AI is to crawl the Web and understand the Web, | ▶ 05:05 |
and assist you in answering questions. | ▶ 05:09 |
So when you have this search box over here | ▶ 05:12 |
and it says "Search" on the left, | ▶ 05:15 |
and "I'm Feeling Lucky" on the right, | ▶ 05:18 |
and you type in the words, | ▶ 05:20 |
what AI does for you is it understands what words you typed in | ▶ 05:21 |
and finds the most relevant pages. | ▶ 05:28 |
That is really co-artificial intelligence. | ▶ 05:30 |
It's used by a number of companies, such as Microsoft and Google | ▶ 05:32 |
and Amazon, Yahoo, and many others. | ▶ 05:36 |
And the way this works is that there's a crawling agent that can go | ▶ 05:39 |
to the World Wide Web and retrieve pages, through just a computer program. | ▶ 05:43 |
It then sorts these pages into a big database inside the crawler | ▶ 05:51 |
and also analyzes developments of each page to any possible query. | ▶ 05:56 |
When you then come and issue a query, | ▶ 06:01 |
the AI system is able to give you a response-- | ▶ 06:04 |
for example, a collection of 10 best Web links. | ▶ 06:08 |
In short, every time you try to write a piece of software, | ▶ 06:12 |
that makes your computer software smart | ▶ 06:15 |
likely you will need artificial intelligence. | ▶ 06:18 |
And in this class, Peter and I will teach you | ▶ 06:20 |
many of the basic tricks of the trade | ▶ 06:23 |
to make your software really smart. | ▶ 06:25 |
It will be good to introduce some basic terminology | ▶ 00:00 |
that is commonly used in artificial intelligence to distinguish different types of problems. | ▶ 00:04 |
The very first word I will teach you is fully versus partially observable. | ▶ 00:09 |
An environment is called fully observable if what your agent can sense | ▶ 00:16 |
at any point in time is completely sufficient to make the optimal decision. | ▶ 00:19 |
So, for example, in many card games, | ▶ 00:26 |
when all the cards are on the table, the momentary site of all those cards | ▶ 00:29 |
is really sufficient to make the optimal choice. | ▶ 00:36 |
That is in contrast to some other environments where you need memory | ▶ 00:40 |
on the side of the agent to make the best possible decision. | ▶ 00:46 |
For example, in the game of poker, the cards aren't openly on the table, | ▶ 00:50 |
and memorizing past moves will help you make a better decision. | ▶ 00:55 |
To fully understand the difference, consider the interaction of an agent | ▶ 01:00 |
with the environment to its sensors and its actuators, | ▶ 01:04 |
and this interaction takes place over many cycles, | ▶ 01:08 |
often called the perception-action cycle. | ▶ 01:11 |
For many environments, it's convenient to assume | ▶ 01:16 |
that the environment has some sort of internal state. | ▶ 01:19 |
For example, in a card game where the cards are not openly on the table, | ▶ 01:22 |
the state might pertain to the cards in your hand. | ▶ 01:28 |
An environment is fully observable if the sensors can always see | ▶ 01:33 |
the entire state of the environment. | ▶ 01:37 |
It's partially observable if the sensors can only see a fraction of the state, | ▶ 01:41 |
yet memorizing past measurements gives us additional information of the state | ▶ 01:46 |
that is not readily observable right now. | ▶ 01:52 |
So any game, for example, where past moves have information about | ▶ 01:55 |
what might be in a person's hand, those games are partially observable, | ▶ 02:01 |
and they require different treatment. | ▶ 02:06 |
Very often agents that deal with partially observable environments | ▶ 02:08 |
need to acquire internal memory to understand what | ▶ 02:12 |
the state of the environment is, and we'll talk extensively | ▶ 02:15 |
when we talk about hidden Markov models about how this structure | ▶ 02:18 |
has such internal memory. | ▶ 02:21 |
A second terminology for environments pertains to whether the environment | ▶ 02:23 |
is deterministic or stochastic. | ▶ 02:26 |
Deterministic environment is one where your agent's actions | ▶ 02:29 |
uniquely determine the outcome. | ▶ 02:35 |
So, for example, in chess, there's really no randomness when you move a piece. | ▶ 02:37 |
The effect of moving a piece is completely predetermined, | ▶ 02:42 |
and no matter where I'm going to move the same piece, the outcome is the same. | ▶ 02:46 |
That we call deterministic. | ▶ 02:50 |
Games with dice, for example, like backgammon, are stochastic. | ▶ 02:52 |
While you can still deterministically move your pieces, | ▶ 02:56 |
the outcome of an action also involves throwing of the dice, | ▶ 03:00 |
and you can't predict those. | ▶ 03:03 |
There's a certain amount of randomness involved for the outcome of dice, | ▶ 03:05 |
and therefore, we call this stochastic. | ▶ 03:08 |
Let me talk about discrete versus continuous. | ▶ 03:10 |
A discrete environment is one where you have finitely many action choices, | ▶ 03:14 |
and finitely many things you can sense. | ▶ 03:18 |
So, for example, in chess, again, there's finitely many board positions, | ▶ 03:21 |
and finitely many things you can do. | ▶ 03:25 |
That is different from a continuous environment | ▶ 03:28 |
where the space of possible actions or things you could sense may be infinite. | ▶ 03:30 |
So, for example, if you throw darts, there's infinitely many ways to angle the darts | ▶ 03:35 |
and to accelerate them. | ▶ 03:41 |
Finally, we distinguish benign versus adversarial environments. | ▶ 03:43 |
In benign environments, the environment might be random. | ▶ 03:49 |
It might be stochastic, but it has no objective on its own | ▶ 03:53 |
that would contradict the own objective. | ▶ 03:57 |
So, for example, weather is benign. | ▶ 03:59 |
It might be random. It might affect the outcome of your actions. | ▶ 04:02 |
But it isn't really out there to get you. | ▶ 04:06 |
Contrast this with adversarial environments, such as many games, like chess, | ▶ 04:08 |
where your opponent is really out there to get you. | ▶ 04:14 |
It turns out it's much harder to find good actions in adversarial environments | ▶ 04:16 |
where the opponent actively observes you and counteracts what you're trying to achieve | ▶ 04:21 |
relative to benign environment, where the environment might merely be stochastic | ▶ 04:26 |
but isn't really interested in making your life worse. | ▶ 04:30 |
So, let's see to what extent these expressions make sense to you | ▶ 04:35 |
by going to our next quiz. | ▶ 04:38 |
So here are the 4 concepts again: partially observable versus fully, | ▶ 04:40 |
stochastic versus deterministic, continuous versus discrete, | ▶ 04:45 |
adversarial versus benign. | ▶ 04:50 |
And let me ask you about the game of checkers. | ▶ 04:52 |
Check one or all of those attributes that apply. | ▶ 04:56 |
So, if you think checkers is partially observable, check this one. | ▶ 05:00 |
Otherwise, just don't check it. | ▶ 05:03 |
If you think it's stochastic, check this one, | ▶ 05:05 |
continuous, check this one, adversarial, check this one. | ▶ 05:07 |
If you don't know about checkers, you can check the Web and Google it | ▶ 05:11 |
to find a little more information about checkers. | ▶ 05:15 |
So, checkers is an interesting game. | ▶ 00:00 |
Here's the typical board of the game of checkers. | ▶ 00:04 |
Your pieces might look like this, | ▶ 00:08 |
and your opponent's pieces might look like this. | ▶ 00:11 |
And apart from some very cryptic rules in checkers, | ▶ 00:16 |
which I won't really discuss here, the board basically tells you | ▶ 00:19 |
everything there is to know about checkers, so it's clearly fully observable. | ▶ 00:23 |
It is deterministic because your move and your opponent's move | ▶ 00:28 |
very clearly affect the state of the board in ways that have | ▶ 00:33 |
absolutely no stochasticity. | ▶ 00:36 |
It is also discrete because there's finitely many action choices | ▶ 00:39 |
and finitely many board positions, | ▶ 00:45 |
and obviously, it is adversarial, since your opponent is out to get you. | ▶ 00:47 |
[Male narrator] The game of poker--is this partially observable, stochastic, | ▶ 00:00 |
continuous, or adversarial? | ▶ 00:06 |
Please check any or all of those that apply. | ▶ 00:09 |
[Male narrator] I would argue poker is partially observable | ▶ 00:00 |
because it can't be seen what is in your opponent's hands. | ▶ 00:03 |
It is stochastic because you're being dealt cards that are kind of coming at random. | ▶ 00:08 |
It is not continuous; it's just finding many cards | ▶ 00:13 |
and finding many actions you can do, even though you might argue | ▶ 00:16 |
that there's a huge number of different monies you can bet. | ▶ 00:20 |
It's still finite, and it is clearly adversarial. | ▶ 00:24 |
If you've ever played poker before, you know how brutal it can be. | ▶ 00:27 |
[Male narrator] --a favorite, a robotic car. | ▶ 00:00 |
I wish to know whether it is partially observable, | ▶ 00:04 |
stochastic, continuous, or adversarial. | ▶ 00:06 |
That is, is the problem of driving robotically-- | ▶ 00:11 |
say, in a city--subject to any of those 4 categories? | ▶ 00:16 |
Please check any or all that might apply. | ▶ 00:20 |
Well, the robotic car clearly deals with a partially observable environment | ▶ 00:00 |
if you just look at momentary sensing input, you can't even tell how fast other cars are going. | ▶ 00:04 |
So, you need to memorize something. | ▶ 00:10 |
It is stochastic because it's inherently unpredictable | ▶ 00:12 |
what's going to happen next with other cars. | ▶ 00:15 |
It is continuous. | ▶ 00:17 |
There's the infinitely many ways to set your steering | ▶ 00:20 |
or push your gas pedal or your brake, | ▶ 00:23 |
and, well, you can argue with adversarial or not. | ▶ 00:26 |
Depending on where you live, it might be highly adversarial. | ▶ 00:29 |
Where I live, it isn't. | ▶ 00:31 |
I'm going to briefly talk of AI as something else, | ▶ 00:00 |
which is AI is the technique of uncertainty management in computer software. | ▶ 00:03 |
Put differently, AI is the discipline that you apply when you want to know what to do | ▶ 00:10 |
when you don't know what to do. | ▶ 00:17 |
Now, there's many reasons why there might be uncertainty in a computer program. | ▶ 00:22 |
There could be a sensor limit. | ▶ 00:27 |
That is, your sensors are unable to tell me | ▶ 00:29 |
what exactly is the case outside the AI system. | ▶ 00:33 |
There could be adversaries who act in a way that makes it hard for you | ▶ 00:37 |
to understand what is the case. | ▶ 00:41 |
There could be stochastic environments. | ▶ 00:44 |
Every time you roll the dice in a dice game, | ▶ 00:48 |
the stochasticity of the dice will make it impossible for you | ▶ 00:51 |
to be absolutely certain of what's the situation. | ▶ 00:55 |
There could be laziness. | ▶ 00:57 |
So perhaps you can actually compute what the situation is, | ▶ 01:00 |
but your computer program is just too lazy to do it. | ▶ 01:04 |
And here's my favorite: ignorance, plain ignorance. | ▶ 01:07 |
Many people are just ignorant of what's going on. | ▶ 01:11 |
They could know it, but they just don't care. | ▶ 01:14 |
All of these things are cause for uncertainty. | ▶ 01:17 |
AI is the discipline that deals with uncertainty and manages it in decision making. | ▶ 01:21 |
Now we've had an introduction to AI. | ▶ 00:00 |
We've heard about some of the properties of environments, | ▶ 00:03 |
and we've seen some possible architecture for agents. | ▶ 00:06 |
I'd like next to show you some examples of AI in practice. | ▶ 00:10 |
And Sebastian and I have some experience personally in things we have done | ▶ 00:13 |
at Google, at NASA, and at Stanford. | ▶ 00:18 |
And I want to tell you a little bit about some of those. | ▶ 00:21 |
One of the best successes of AI technology at Google | ▶ 00:25 |
has been the machine translation system. | ▶ 00:28 |
Here we see an example of an article in Italian automatically translated into English. | ▶ 00:31 |
Now, these systems are built for 50 different languages, | ▶ 00:37 |
and we can translate from any of the languages into any of the other languages. | ▶ 00:41 |
So, that's over 2,500 different systems, and we've done this all | ▶ 00:46 |
using machine learning techniques, using AI techniques, | ▶ 00:51 |
rather than trying to build them by hand. | ▶ 00:55 |
And the way it works is that we go out and collect examples of text | ▶ 00:58 |
that's a line between the 2 languages. | ▶ 01:03 |
So we find, say, a newspaper that publishes 2 editions, | ▶ 01:06 |
an Italian edition and an English edition, and now we have examples of translations. | ▶ 01:11 |
And if anybody ever asked us for exactly the translation of this one particular article, | ▶ 01:16 |
then we could just look it up and say "We already know that." | ▶ 01:22 |
But of course, we aren't often going to be asked that. | ▶ 01:25 |
Rather, we're going to be asked parts of this. | ▶ 01:27 |
Here are some words that we've seen before, and we have to figure out | ▶ 01:30 |
which words in this article correspond to which words in the translation article. | ▶ 01:34 |
And when we do that by examining many, many millions of words of text | ▶ 01:40 |
in the 2 languages and making the correspondence, | ▶ 01:45 |
and then we can put that all together. | ▶ 01:49 |
And then when we see a new example of text that we haven't seen before, | ▶ 01:51 |
we can just look up what we've seen in the past for that correspondence. | ▶ 01:54 |
So, the task is really two parts. | ▶ 01:58 |
Off-line, before we see an example of text we want to translate, | ▶ 02:01 |
we first build our translation model. | ▶ 02:05 |
We do that by examining all of the different examples | ▶ 02:07 |
and figuring out which part aligns to which. | ▶ 02:10 |
Now, when we're given a text to translate, we use that model, | ▶ 02:14 |
and we go through and find the most probable translation. | ▶ 02:18 |
So, what does it look like? | ▶ 02:22 |
Well, let's look at it in some example text. | ▶ 02:24 |
And rather than look at news articles, I'm going to look at something simpler. | ▶ 02:26 |
I'm going to switch from Italian to Chinese. | ▶ 02:29 |
Here's a bilingual text. | ▶ 02:35 |
Now, for a large-scale machine translation, examples are found on the Web. | ▶ 02:37 |
This example was found in a Chinese restaurant by Adam Lopez. | ▶ 02:41 |
Now, it's given, for a text of this form, | ▶ 02:46 |
that a line in Chinese corresponds to a line in English, | ▶ 02:49 |
and that's true for each of the individual lines. | ▶ 02:55 |
But to learn from this text, what we really want to discover | ▶ 02:59 |
is what individual words in Chinese correspond to individual words | ▶ 03:02 |
or small phrases in English. | ▶ 03:07 |
I've started that process by highlighting the word "wonton" in English. | ▶ 03:09 |
It appears 3 times throughout the text. | ▶ 03:16 |
Now, in each of those lines, there's a character that appears, | ▶ 03:18 |
and that's the only place in the Chinese text where that character appears. | ▶ 03:23 |
So, that seems like it's a high probability that this character in Chinese | ▶ 03:27 |
corresponds to the word "wonton" in English. | ▶ 03:33 |
Let's see if we can go farther. | ▶ 03:36 |
My question for you is what word or what character or characters in Chinese | ▶ 03:38 |
correspond to the word "chicken" in English? | ▶ 03:44 |
And here we see "chicken" appears in these locations. | ▶ 03:47 |
Click on the character or characters in Chinese that corresponds to "chicken." | ▶ 03:54 |
The answer is that chicken appears here, | ▶ 00:01 |
here, here, and here. | ▶ 00:04 |
Now, I don't know for sure, 100%, that that is the character for chicken in Chinese, | ▶ 00:10 |
but I do know that there is a good correspondence. | ▶ 00:14 |
Every place the word chicken appears in English, | ▶ 00:17 |
this character appears in Chinese and no other place. | ▶ 00:20 |
Let's go 1 step farther. | ▶ 00:24 |
Let's see if we can work out a phrase in Chinese | ▶ 00:27 |
and see if it corresponds to a phrase in English. | ▶ 00:30 |
Here's the phrase corn cream. | ▶ 00:33 |
Click on the characters in Chinese that correspond to corn cream. | ▶ 00:38 |
The answer is: these 2 characters here | ▶ 00:00 |
appear only in these 2 locations | ▶ 00:04 |
corresponding to the words corn cream | ▶ 00:07 |
which appear only in these locations in the English text. | ▶ 00:10 |
Again, we're not 100% sure that's the right answer, | ▶ 00:13 |
but it looks like a strong correlation. | ▶ 00:17 |
Now, 1 more question. | ▶ 00:20 |
Tell me what character or characters in Chinese | ▶ 00:22 |
correspond to the English word soup. | ▶ 00:26 |
The answer is that soup occurs in most of these phrases | ▶ 00:00 |
but not 100% of them. | ▶ 00:09 |
It's missing in this phrase. | ▶ 00:11 |
Equivalently, on the Chinese side | ▶ 00:14 |
we see this character occurs | ▶ 00:17 |
in most of the phrases, | ▶ 00:20 |
but it's missing here. | ▶ 00:23 |
So we see that the correspondence doesn't have to be 100% | ▶ 00:27 |
to tell us that there is still a good chance of a correlation. | ▶ 00:31 |
When we're learning to do machine translation | ▶ 00:34 |
we use these kinds of alignments to learn probability tables | ▶ 00:37 |
of what is the probability of one phrase in one language | ▶ 00:41 |
corresponding to the phrase in another language. | ▶ 00:45 |
So congratulations, you just finished unit 1. | ▶ 00:00 |
You just finished unit 1 of this class, | ▶ 00:03 |
where I told you about key applications | ▶ 00:07 |
of artificial intelligence, | ▶ 00:10 |
I told you about the definition of an intelligent agent, | ▶ 00:13 |
I gave you 4 key attributes of intelligent agents | ▶ 00:18 |
(partial observability, stochasticity, continuous spaces, and adversarial natures), | ▶ 00:24 |
I discussed sources and management of uncertainty, | ▶ 00:31 |
and I briefly mentioned the mathematical concept of rationality. | ▶ 00:34 |
Obviously, I only touched any of these issues superficially, | ▶ 00:40 |
but as this class goes on you're going to dive into any of those | ▶ 00:45 |
and learn much more about | ▶ 00:49 |
what it takes to make a truly intelligent AI system. | ▶ 00:51 |
Thank you. | ▶ 00:55 |
[PROBLEM SOLVING] | ▶ 00:00 |
In this unit we're going to talk about problem solving. | ▶ 00:01 |
The theory and technology of building agents | ▶ 00:04 |
that can plan ahead to solve problems. | ▶ 00:06 |
In particular, we're talking about problem solving | ▶ 00:10 |
where the complexity of the problem comes from the idea that there are many states. | ▶ 00:13 |
As in this problem here. | ▶ 00:17 |
A navigation problem where there are many choices to start with. | ▶ 00:19 |
And the complexity comes from picking the right choice now and picking the right choice at the | ▶ 00:24 |
next intersection and the intersection after that. | ▶ 00:29 |
Streaming together a sequence of actions. | ▶ 00:32 |
This is in contrast to the type of complexity shown in this picture, | ▶ 00:35 |
where the complexity comes from the partial observability | ▶ 00:39 |
that we can't see through the fog where the possible paths are. | ▶ 00:43 |
We can't see the results of our actions | ▶ 00:46 |
and even the actions themselves are not known. | ▶ 00:48 |
This type of complexity will be covered in a later unit. | ▶ 00:51 |
Here's an example of a problem. | ▶ 00:56 |
This is a route-finding problem where we're given a start city, | ▶ 00:58 |
in this case, Arad, and a destination, Bucharest, the capital of Romania, | ▶ 01:03 |
from which this is a corner of the map. | ▶ 01:09 |
And the problem then is to find a route from Arad to Bucharest. | ▶ 01:11 |
The actions that the agent can execute when driving | ▶ 01:16 |
from one city to the next along one of the roads shown on the map. | ▶ 01:20 |
The question is, is there a solution that the agent can come up with | ▶ 01:23 |
given the knowledge shown here to the problem of driving from Arad to Bucharest? | ▶ 01:28 |
And the answer is no. | ▶ 00:00 |
There is no solution that the agent can come up with | ▶ 00:03 |
because Bucharest doesn't appear on the map, | ▶ 00:06 |
and so the agent doesn't know any actions that can arrive there. | ▶ 00:08 |
So let's give the agent a better chance. | ▶ 00:12 |
Now we've given the agent the full map of Romania. | ▶ 00:19 |
To start, he's in Arad, and the destination--or goal--is in Bucharest. | ▶ 00:23 |
And the agent is given the problem of coming up with a sequence of actions | ▶ 00:30 |
that will arrive at the destination. | ▶ 00:35 |
Now, is it possible for the agent to solve this problem? | ▶ 00:37 |
And the answer is yes. | ▶ 00:43 |
There are many routes or steps or sequences of actions that will arrive at the destination. | ▶ 00:45 |
Here is one of them: | ▶ 00:50 |
Starting out in Arad, taking this step first, then this one, then this one, | ▶ 00:53 |
then this one, and then this one to arrive at the destination. | ▶ 01:00 |
So that would count as a solution to the problem. | ▶ 01:05 |
So sequence of actions, chained together, that are guaranteed to get us to the goal. | ▶ 01:08 |
[DEFINITION OF A PROBLEM] | ▶ 01:12 |
Now let's formally define what a problem looks like. | ▶ 01:14 |
A problem can be broken down into a number of components. | ▶ 01:17 |
First, the initial state that the agent starts out with. | ▶ 01:21 |
In our route finding problem, the initial state was the agent being in the city of Arad. | ▶ 01:25 |
Next, a function--Actions--that takes a state as input and returns | ▶ 01:32 |
a set of possible actions that the agent can execute when the agent is in this state. | ▶ 01:41 |
[ACTIONS (s) {a,a2,a3...}] | ▶ 01:47 |
In some problems, the agent will have the same actions available in all states | ▶ 01:50 |
and in other problems, he'll have different actions dependent on the state. | ▶ 01:54 |
In the route finding problem, the actions are dependent on the state. | ▶ 01:58 |
When we're in one city, we can take the routes to the neighboring cities-- | ▶ 02:02 |
but we can't go to any other cities. | ▶ 02:06 |
Next we have a function called Result, which takes, as input, a state and an action | ▶ 02:09 |
and delivers, as its output, a new state. | ▶ 02:20 |
So, for example, if the agent is in the city of Arad, and takes--that would be the state-- | ▶ 02:24 |
and takes the action of driving along Route E-671 towards Timisoara, | ▶ 02:33 |
then the result of applying that action in that state would be the new state-- | ▶ 02:40 |
where the agent is in the city of Timisoara. | ▶ 02:45 |
Next, we need a function called Goal Test, | ▶ 02:51 |
which takes a state and returns a Boolean value-- | ▶ 02:58 |
true or false--telling us if this state is a goal or not. | ▶ 03:04 |
In a route-finding problem, the only goal would be being in the destination city-- | ▶ 03:09 |
the city of Bucharest--and all the other states would return false for the Goal Test. | ▶ 03:14 |
And finally, we need one more thing which is a Path Cost function-- | ▶ 03:19 |
which takes a path, a sequence of state/action transitions, | ▶ 03:28 |
and returns a number, which is the cost of that path. | ▶ 03:40 |
Now, for most of the problems we'll deal with, we'll make the Path Cost function be additive | ▶ 03:44 |
so that the cost of the path is just the sum of the costs of the individual steps. | ▶ 03:50 |
And so we'll implement this Path Cost function, in terms of a Step Cost function. | ▶ 03:56 |
The Step Cost function takes a state, an action, and the resulting state from that action | ▶ 04:04 |
and returns a number--n--which is the cost of that action. | ▶ 04:14 |
In the route finding example, the cost might be the number of miles traveled | ▶ 04:18 |
or maybe the number of minutes it takes to get to that destination. | ▶ 04:24 |
Now let’s see how the definition of a problem | ▶ 00:00 |
maps onto the route finding, the domain. | ▶ 00:06 |
First, the initial state was given. | ▶ 00:10 |
Let’s say we start off in Arad, | ▶ 00:12 |
and the goal test, | ▶ 00:15 |
let’s say that the state of being in Bucharest | ▶ 00:17 |
is the only state that counts as a goal, | ▶ 00:22 |
and all the other states are not goals. | ▶ 00:24 |
Now the set of all of the states here | ▶ 00:26 |
is known as the state space, | ▶ 00:29 |
and we navigate the state space by applying actions. | ▶ 00:31 |
The actions are specific to each city, | ▶ 00:35 |
so when we are in Arad, there are three possible actions, | ▶ 00:39 |
to follow this road, this one, or this one. | ▶ 00:42 |
And as we follow them, we build paths | ▶ 00:46 |
or sequences of actions. | ▶ 00:49 |
So just being in Arad is the path of length zero, | ▶ 00:51 |
and now we could start exploring the space | ▶ 00:55 |
and add in this path of length one, | ▶ 00:58 |
this path of length one, | ▶ 01:01 |
and this path of length one. | ▶ 01:03 |
We could add in another path here of length two | ▶ 01:06 |
and another path here of length two. | ▶ 01:11 |
Here is another path of length two. | ▶ 01:14 |
Here is a path of length three. | ▶ 01:17 |
Another path of length two, and so on. | ▶ 01:21 |
Now at ever point, | ▶ 01:26 |
we want to separate the state out into three parts. | ▶ 01:28 |
First, the ends of the paths— | ▶ 01:34 |
The farthest paths that have been explored, | ▶ 01:37 |
we call the frontier. | ▶ 01:40 |
And so the frontier in this case | ▶ 01:42 |
consists of these states | ▶ 01:46 |
that are the farthest out we have explored. | ▶ 01:51 |
And then to the left of that in this diagram, | ▶ 01:55 |
we have the explored part of the state. | ▶ 01:59 |
And then off to the rigtht, | ▶ 02:02 |
we have the unexplored. | ▶ 02:04 |
So let’s write down those three components. | ▶ 02:06 |
We have the frontier. | ▶ 02:09 |
We have the unexplored region, | ▶ 02:15 |
and we have the explored region. | ▶ 02:20 |
One more thing, | ▶ 02:25 |
in this diagram we have labeled the step cost | ▶ 02:27 |
of each action along the route. | ▶ 02:30 |
So the step cost of going between Neamt to Iasi | ▶ 02:33 |
would be 87 corresponding to a distance of 87 kilometers, | ▶ 02:37 |
and the path cost is just the sum of the step costs. | ▶ 02:42 |
So the cost of the path | ▶ 02:46 |
of going from Arad to Oradea | ▶ 02:48 |
would be 71 plus 75. | ▶ 02:50 |
[Narrator] Now let's define a function for solving problems. | ▶ 00:00 |
It's called Tree Search because it superimposes | ▶ 00:04 |
a search tree over the state space. | ▶ 00:07 |
Here's how it works: It starts off by | ▶ 00:10 |
initializing the frontier to be the path | ▶ 00:12 |
consisting of only the initial states, | ▶ 00:14 |
and then it goes into a loop | ▶ 00:16 |
in which it first checks to see | ▶ 00:18 |
do we still have anything left in the frontier? | ▶ 00:21 |
If not we fail, there can be no solution. | ▶ 00:23 |
If we do have something, then we make a choice. | ▶ 00:25 |
Tree Search is really a family of functions | ▶ 00:28 |
not a single algorithm which | ▶ 00:31 |
depends on how we make that choice, | ▶ 00:33 |
and we'll see some of the options later. | ▶ 00:35 |
If we go ahead and make a choice of one of | ▶ 00:38 |
the paths on the frontier and remove that | ▶ 00:41 |
path from the frontier, we find the state | ▶ 00:43 |
which is at the end of the path, and if that | ▶ 00:45 |
state's a go then we're done. | ▶ 00:47 |
We found a path to the goal; otherwise, | ▶ 00:49 |
we do what's called expanding that path. | ▶ 00:51 |
We look at all the actions from that state, | ▶ 00:54 |
and we add to the path the actions | ▶ 00:57 |
and the result of that state; so we get | ▶ 01:00 |
a new path that has the old path, the action | ▶ 01:03 |
and the result of that action, and we | ▶ 01:06 |
stick all of those paths back onto the frontier. | ▶ 01:09 |
Now Tree Search represents a whole family | ▶ 01:17 |
of algorithms, and where you get the family | ▶ 01:19 |
resemblance is that they're all looking | ▶ 01:22 |
at the frontier, copying items off and | ▶ 01:24 |
and looking to see if their goal tests, | ▶ 01:26 |
but where you get the difference is right here, | ▶ 01:29 |
in the choice of how you're going to expand | ▶ 01:31 |
the next item on the frontier, which | ▶ 01:34 |
path do we look at first, and we'll go through | ▶ 01:36 |
different sets of algorithms that make | ▶ 01:39 |
different choices for which path to look at first. | ▶ 01:42 |
The first algorithm I want to consider | ▶ 01:47 |
is called Breadth-First Search. | ▶ 01:49 |
Now it could be called shortest-first search | ▶ 01:51 |
because what it does is always choose | ▶ 01:54 |
of the frontier one of the paths that hadn't been | ▶ 01:56 |
considered yet that's the shortest possible. | ▶ 01:59 |
So how does it work? | ▶ 02:02 |
Well we start off with the path of | ▶ 02:04 |
length 0, starting in the start state, and | ▶ 02:06 |
that's the only path in the frontier so | ▶ 02:10 |
it's the shortest one so we pick it, | ▶ 02:13 |
and then we expand it, and we add in | ▶ 02:15 |
all the paths that result from | ▶ 02:17 |
applying all the possible actions. | ▶ 02:20 |
So now we've removed | ▶ 02:22 |
this path from the frontier, | ▶ 02:25 |
but we've added in 3 new paths. | ▶ 02:28 |
This one, | ▶ 02:31 |
this one, and this one. | ▶ 02:33 |
Now we're in a position where | ▶ 02:37 |
we have 3 paths on the frontier, and | ▶ 02:39 |
we have to pick the shortest one. | ▶ 02:42 |
Now in this case all 3 paths | ▶ 02:45 |
have the same length, length 1, so we | ▶ 02:47 |
break the tie at random or using some | ▶ 02:50 |
other technique, and let's suppose that | ▶ 02:52 |
in this case we choose this path | ▶ 02:56 |
from Arad to Sibiu. | ▶ 02:58 |
Now the question I want you to answer | ▶ 03:00 |
is once we remove that from the frontier, | ▶ 03:03 |
what paths are we going to add next? | ▶ 03:09 |
So show me by checking off the cities | ▶ 03:11 |
that ends the paths, which paths | ▶ 03:14 |
are going to be added to the frontier? | ▶ 03:16 |
[Male narrator] The answer is that in Sibiu, the action function gives us 4 actions | ▶ 00:00 |
corresponding to traveling along these 4 roads, | ▶ 00:06 |
so we have to add in paths for each of those actions. | ▶ 00:09 |
One of those paths goes here, | ▶ 00:15 |
the other path continues from Arad and goes out here. | ▶ 00:17 |
The third path continues out here | ▶ 00:21 |
and then the fourth path goes from here--from Arad to Sibiu | ▶ 00:25 |
and then backtracks back to Arad. | ▶ 00:31 |
Now, it may seem silly and redundant to have a path that starts in Arad, | ▶ 00:36 |
goes to Sibiu and returns to Arad. | ▶ 00:41 |
How can that help us get to our destination in Bucharest? | ▶ 00:44 |
But we can see if we're dealing with a tree search, | ▶ 00:49 |
why it's natural to have this type of formulation | ▶ 00:52 |
and why the tree search doesn't even notice that it's backtracked. | ▶ 00:56 |
What the tree search does is superimpose on top of the state space | ▶ 01:00 |
a tree of searches, and the tree looks like this. | ▶ 01:05 |
We start off in state A, and in state A, there were 3 actions, | ▶ 01:09 |
so we gave those paths going to Z, S, and T. | ▶ 01:15 |
And from S, there were 4 actions, so that gave us paths going from O, F, R, and A, | ▶ 01:21 |
and then the tree would continue on from here. | ▶ 01:34 |
We'd take one of the next items | ▶ 01:37 |
and we'd move it and continue on, but notice that we returned to the A state | ▶ 01:40 |
in the state space, but in the tree, | ▶ 01:48 |
it's just another item in the tree. | ▶ 01:51 |
Now, here's another representation of the search space | ▶ 01:55 |
and what's happening is as we start to explore the state, | ▶ 01:57 |
we keep track of the frontier, which is the set of states that are at the end of the paths | ▶ 02:01 |
that we haven't explored yet, and behind that frontier | ▶ 02:09 |
is the set of explored states, and ahead of the frontier is the unexplored states. | ▶ 02:13 |
Now the reason we keep track of the explored states | ▶ 02:19 |
is that when we want to expand and we find a duplicate-- | ▶ 02:22 |
so say when we expand from here, if we pointed back to state T, | ▶ 02:27 |
if we hadn't kept track of that, we would have to add in a new state for T down here. | ▶ 02:33 |
But because we've already seen it and we know that this is actually a regressive step | ▶ 02:42 |
into the already explored state, now, because we kept track of that, | ▶ 02:47 |
we don't need it anymore. | ▶ 02:51 |
Now we see how to modify the Tree Search Function | ▶ 00:00 |
to make it be a Graph Search Function | ▶ 00:04 |
to avoid those repeated paths. | ▶ 00:06 |
What we do, is we start off and initialize a set | ▶ 00:09 |
called the explored set of states that we have already explored. | ▶ 00:13 |
Then, when we consider a new path, | ▶ 00:16 |
we add the new state to the set of already explored states, | ▶ 00:19 |
and then when we are expanding the path | ▶ 00:23 |
and adding in new states to the end of it, | ▶ 00:26 |
we don’t add that in if we have already seen that new state | ▶ 00:29 |
in either the frontier or the explored. | ▶ 00:33 |
Now back to Breadth First Search. | ▶ 00:37 |
Let’s assume we are using the Graph Search | ▶ 00:39 |
so that we have eliminated the duplicate paths. | ▶ 00:41 |
Arad is crossed off the list. | ▶ 00:44 |
The path that goes from Arad to Sibiu | ▶ 00:47 |
and back to Arad is removed, | ▶ 00:49 |
and we are left with these one, two, three, | ▶ 00:51 |
four, five possible paths. | ▶ 00:53 |
Given these 5 paths, | ▶ 00:57 |
show me which ones are candidates to be expanded next | ▶ 00:59 |
by the Breadth First Search Algorithm. | ▶ 01:02 |
[Male narrator] And the answer is that Breadth - First Search always considers | ▶ 00:00 |
the shortest paths first, and in this case, there's 2 paths of length 1, | ▶ 00:03 |
and 1, the paths from Arad to Zerind and Arad to Timisoara, | ▶ 00:08 |
so those would be the 2 paths that would be considered. | ▶ 00:12 |
Now, let's suppose that the tie is broken in some way | ▶ 00:15 |
and we chose this path from Arad to Zerind. | ▶ 00:18 |
Now, we want to expand that node. | ▶ 00:22 |
We remove it from the frontier and put it in the explored list | ▶ 00:25 |
and now we say, "What paths are we going to add?" | ▶ 00:31 |
So check off the ends of the paths the cities that we're going to add. | ▶ 00:35 |
[Male narrator] In this case, there's nothing to add | ▶ 00:00 |
because of the 2 neighbors, 1 is in the explored list and 1 is in the frontier, | ▶ 00:03 |
and if we're using graph search, then we won't add either of those. | ▶ 00:09 |
[Male narrator] So we move on, we look for another shortest path. | ▶ 00:00 |
There's one path left of length 1, so we look at that path, we expand it, | ▶ 00:04 |
add in this path, put that one on the explored list, | ▶ 00:11 |
and now we've got 3 paths of length 2. | ▶ 00:16 |
We choose 1 of them, and let's say we choose this one. | ▶ 00:20 |
Now, my question is show me which states we add to the path | ▶ 00:23 |
and tell me whether we're going to terminate the algorithm at this point | ▶ 00:30 |
because we've reached the goal or whether we're going to continue. | ▶ 00:35 |
[Male narrator] The answer is that we add 1 more path, the path to Bucharest. | ▶ 00:00 |
We don't add the path going back because it's in the explored list, | ▶ 00:08 |
but we don't terminate it yet. | ▶ 00:11 |
True, we have added a path that ends in Bucharest, | ▶ 00:13 |
but the goal test isn't applied when we add a path to the frontier. | ▶ 00:16 |
Rather, it's applied when we remove that path from the frontier, | ▶ 00:22 |
and we haven't done that yet. | ▶ 00:26 |
[Male narrator] Now, why doesn't the general tree search or graph search algorithm stop | ▶ 00:00 |
when it adds a goal node to the frontier? | ▶ 00:06 |
The reason is because it might not be the best path to the goal. | ▶ 00:09 |
Now, here we found a path of length 2 | ▶ 00:13 |
and we added a path of length 3 that reached the goal. | ▶ 00:16 |
The general graph search or tree search doesn't know | ▶ 00:21 |
that there might be some other path that we could expand | ▶ 00:24 |
that would have a distance of say, 2-1/2, | ▶ 00:27 |
but there's an optimization that could be made. | ▶ 00:30 |
If we know we're doing Breadth - First Search | ▶ 00:33 |
and we know there's no possibility of a path of length 2-1/2. | ▶ 00:35 |
Then we can change algorithm so that it checks states | ▶ 00:40 |
as soon as they're added to the frontier | ▶ 00:44 |
rather than waiting until they're expanded | ▶ 00:46 |
and in that case, we can write a specific Breadth - First Search routine | ▶ 00:49 |
that terminates early and gives us a result as soon as we add a goal state to the frontier. | ▶ 00:53 |
Breadth - First Search will find this path | ▶ 01:01 |
that ends up in Bucharest, and if we're looking for the shortest path | ▶ 01:04 |
in terms of number of steps, | ▶ 01:08 |
Breadth - First Search is guaranteed to find it, | ▶ 01:10 |
But if we're looking for the shortest path in terms of total cost | ▶ 01:12 |
by adding up the step costs, then it turns out | ▶ 01:17 |
that this path is shorter than the path found by Breadth - First Search. | ▶ 01:21 |
So let's look at how we could find that path. | ▶ 01:26 |
An algorithm that has traditionally been called uniform-cost search | ▶ 00:00 |
but could be called cheapest-first search, | ▶ 00:05 |
is guaranteed to find the path with the cheapest total cost. | ▶ 00:08 |
Let's see how it works. | ▶ 00:11 |
We start out as before in the start state. | ▶ 00:14 |
And we pop that empty path off. | ▶ 00:19 |
Move it from the frontier to explored, | ▶ 00:24 |
and then add in the paths out of that state. | ▶ 00:28 |
As before, there will be 3 of those paths. | ▶ 00:33 |
And now, which path are we going to pick next | ▶ 00:39 |
in order to expand according to the rules of cheapest first? | ▶ 00:43 |
Cheapest first says that we pick the path with | ▶ 00:00 |
the lowest total cost. | ▶ 00:04 |
And that would be this path. | ▶ 00:06 |
It has a cost of 75 compared to the cost of 118 and 140 | ▶ 00:07 |
for the other paths. | ▶ 00:13 |
So we get here. We take that path off the frontier, | ▶ 00:14 |
put it on the explored list, add in its neighbors. | ▶ 00:19 |
Not going back to Arad, | ▶ 00:23 |
but adding in this new path. | ▶ 00:26 |
Summing up the total cost of that path, | ▶ 00:30 |
71 + 75 is 146 for this path. | ▶ 00:33 |
And now the question is, | ▶ 00:40 |
which path gets expanded next? | ▶ 00:41 |
Of the 3 paths on the frontier, we have ones | ▶ 00:00 |
with a cost of 146, 140, and 118. | ▶ 00:05 |
And that's the cheapest, so this one gets expanded. | ▶ 00:10 |
We take it off the frontier, move it to explored, | ▶ 00:13 |
add in its successors. In this case it's only 1. | ▶ 00:16 |
And that has a path total of 229. | ▶ 00:21 |
Which path do we expand next? | ▶ 00:29 |
Well, we've got 146, 140, and 229 | ▶ 00:30 |
So 140 is the lowest. | ▶ 00:33 |
Take it off the frontier. Put it on explored. | ▶ 00:38 |
Add in this path | ▶ 00:41 |
for a total cost of 220. | ▶ 00:44 |
And this path for a total cost of 239. | ▶ 00:48 |
And now the question is, which path do we expand next? | ▶ 00:53 |
The answer is this one, 146. | ▶ 00:00 |
Put it on explored. | ▶ 00:04 |
But there's nothing to add because | ▶ 00:07 |
both of its neighbors have already been explored. | ▶ 00:12 |
Which path do we look at next? | ▶ 00:13 |
The answer is this one. Two-twenty is less than 229 or 239. | ▶ 00:00 |
Take it off the frontier. Put it on explored. | ▶ 00:05 |
Add in 2 more paths and sum them up. | ▶ 00:09 |
So, 220 plus 146 is 366. | ▶ 00:15 |
And 220 plus 97 is 317. | ▶ 00:21 |
Okay, and now, notice that we're closing in on Bucharest. | ▶ 00:29 |
We've got 2 neighbors almost there, but neither of them is their turn yet. | ▶ 00:32 |
Instead, the cheapest path is this one over here, | ▶ 00:38 |
so move it to the explored list. | ▶ 00:43 |
Add 70 to the path cost so far, | ▶ 00:45 |
and we get 299. | ▶ 00:50 |
Now the cheapest node is 239 here, | ▶ 00:57 |
so we expand, finally, into Bucharest at a cost of 460. | ▶ 01:01 |
And now the question is are we done? Can we terminate the algorithm? | ▶ 01:09 |
[Male] And the answer is no, we're not done yet. | ▶ 00:00 |
We've put Bucharest, the gold state, onto the frontier, | ▶ 00:03 |
but we haven't popped it off the frontier yet. | ▶ 00:07 |
And the reason is because we've got to look around and see if there's a better path | ▶ 00:09 |
that can reach it, Bucharest. | ▶ 00:13 |
And so, let's continue. | ▶ 00:15 |
Look at everything on the frontier. | ▶ 00:18 |
Here's the cheapest one over here. | ▶ 00:20 |
Expand that. | ▶ 00:23 |
Now, what's the cheapest next one? | ▶ 00:26 |
Well, over here. | ▶ 00:30 |
Oops, forgot to take this one off the list. | ▶ 00:33 |
So now, we're at 317 plus 101 gives us another path into Bucharest, | ▶ 00:36 |
and this is a better path. | ▶ 00:44 |
This is 418, gives us another route in. | ▶ 00:46 |
But we have to keep going. | ▶ 00:54 |
The best path on the frontier is 366, | ▶ 00:59 |
so pop that off, and that would give us 2 more routes into here, | ▶ 01:06 |
and eventually we pop off all of these. | ▶ 01:14 |
And then we get to the point where 418 was the best path on the frontier. | ▶ 01:18 |
We pop that off, and then we recognize that we'd reach the goal, | ▶ 01:24 |
and the reason that uniform cost finds the optimal path, the cheapest cost, | ▶ 01:29 |
is because it's guaranteed that it will first pop off this cheapest path, | ▶ 01:35 |
the 418, before it gets to the more expensive path, like the 460. | ▶ 01:40 |
So, we've looked at 2 search algorithms. | ▶ 00:00 |
One, breadth-first search, in which we always expand first | ▶ 00:03 |
the shallowest paths, the shortest paths. | ▶ 00:08 |
Second, cheapest-first search, in which we always expand first the path | ▶ 00:12 |
with the lowest total cost. | ▶ 00:17 |
And I'm going to take this opportunity to introduce a third algorithm, depth-first search, | ▶ 00:20 |
which is in a way the opposite of breadth-first search. | ▶ 00:25 |
In depth-first search, we always expand first the longest path, | ▶ 00:28 |
the path with the most lengths in it. | ▶ 00:33 |
Now, what I want to ask you to do is for each of these nodes in each of the trees, | ▶ 00:36 |
tell us in what order they're expanded, | ▶ 00:42 |
first, second, third, fourth, fifth and so on by putting a number into the box. | ▶ 00:44 |
And if there are ties, put that number in and resolve the ties in left to right order. | ▶ 00:49 |
Then I want you to ask one more question or answer one more question | ▶ 00:58 |
which is are these searches optimal? | ▶ 01:03 |
That is, are they guaranteed to find the best solution? | ▶ 01:06 |
And for breadth-first search, optimal would mean finding the shortest path. | ▶ 01:11 |
If you think it's guaranteed to find the shortest path, check here. | ▶ 01:16 |
For cheapest first, it would mean finding the path with the lowest total path cost. | ▶ 01:21 |
Check here if you think it's guaranteed to do that. | ▶ 01:26 |
And we'll allow the assumption that all costs have to be positive. | ▶ 01:30 |
And in depth first, cheapest or optimal would mean, again, | ▶ 01:34 |
as in breadth first, finding the shortest possible path in terms of number of lengths. | ▶ 01:41 |
Check here if you think depth first will always find that. | ▶ 01:46 |
Here are the answers. | ▶ 00:00 |
Breadth-first search, as the name implies, expands nodes in this order. | ▶ 00:04 |
One, 2, 3, 4, 5, 6, 7. | ▶ 00:10 |
So, it's going across a stripe at a time, breadth first. | ▶ 00:17 |
Is it optimal? | ▶ 00:23 |
Well, it's always expanding in the shortest paths first, | ▶ 00:25 |
and so wherever the goal is hiding, it's going to find it by examining | ▶ 00:28 |
no longer paths, so in fact, it is optimal. | ▶ 00:34 |
Cheapest first, first we expand the path of length zero, | ▶ 00:38 |
then the path of length 2. | ▶ 00:45 |
Now there's a path of length 4, path of length 5, | ▶ 00:47 |
path of length 6, a path of length 7, and finally, a path of length 8. | ▶ 00:53 |
And as we've seen, it's guaranteed to find the cheapest path of all, | ▶ 01:02 |
assuming that all the individual step costs are not negative. | ▶ 01:08 |
Depth-first search tries to go as deep as it can first, | ▶ 01:14 |
so it goes 1, 2, 3, then backs up, 4, | ▶ 01:17 |
then backs up, 5, 6, 7. | ▶ 01:24 |
And you can see that it doesn't necessarily find the shortest path of all. | ▶ 01:29 |
Let's say that there were goals in position 5 and in position 3. | ▶ 01:34 |
It would find the longer path to position 3 and find the goal there | ▶ 01:39 |
and would not find the goal in position 5. | ▶ 01:43 |
So, it is not optimal. | ▶ 01:46 |
Given the non-optimality of depth-first search, | ▶ 00:00 |
why would anybody choose to use it? | ▶ 00:04 |
Well, the answer has to do with the storage requirements. | ▶ 00:07 |
Here I've illustrated a state space | ▶ 00:10 |
consisting of a very large or even infinite binary tree. | ▶ 00:13 |
As we go to levels 1, 2, 3, down to level n, | ▶ 00:18 |
the tree gets larger and larger. | ▶ 00:22 |
Now, let's consider the frontier for each of these search algorithms. | ▶ 00:24 |
For breadth-first search, we know a frontier looks like that, | ▶ 00:29 |
and so when we get down to level n, we'll require a storage space of | ▶ 00:35 |
2 to the n of pass in a breadth-first search. | ▶ 00:40 |
For cheapest first, the frontier is going to be more complicated. | ▶ 00:45 |
It's going to sort of work out this contour of cost, | ▶ 00:49 |
but it's going to have a similar total number of nodes. | ▶ 00:53 |
But for depth-first search, as we go down the tree, we start going down this branch, | ▶ 00:57 |
and then we back up, but at any point, our frontier is only going to have n nodes | ▶ 01:03 |
rather than 2 to the n nodes, so that's a substantial savings for depth-first search. | ▶ 01:08 |
Now, of course, if we're also keeping track of the explored set, | ▶ 01:14 |
then we don't get that much savings. | ▶ 01:19 |
But without the explored set, depth-first search has a huge advantage | ▶ 01:21 |
in terms of space saved. | ▶ 01:25 |
One more property of the algorithms to consider | ▶ 01:27 |
is the property of completeness, meaning if there is a goal somewhere, | ▶ 01:30 |
will the algorithm find it? | ▶ 01:35 |
So, let's move from very large trees to infinite trees, | ▶ 01:37 |
and let's say that there's some goal hidden somewhere deep down in that tree. | ▶ 01:41 |
And the question is, are each of these algorithms complete? | ▶ 01:47 |
That is, are they guaranteed to find a path to the goal? | ▶ 01:51 |
Mark off the check boxes for the algorithms that you believe are complete in this sense. | ▶ 01:55 |
The answer is that breadth-first search is complete, | ▶ 00:00 |
so even if the tree is infinite, if the goal is placed at any finite level, | ▶ 00:04 |
eventually, we're going to march down and find that goal. | ▶ 00:10 |
Same with cheapest first. | ▶ 00:16 |
No matter where the goal is, if it has a finite cost, | ▶ 00:18 |
eventually, we're going to go down and find it. | ▶ 00:21 |
But not so for depth-first search. | ▶ 00:25 |
If there's an infinite path, depth-first search will keep following that, | ▶ 00:28 |
so it will keep going down and down and down along this path | ▶ 00:33 |
and never get to the path that the goal consists of | ▶ 00:37 |
and never get to the path on which the goal sits. | ▶ 00:42 |
So, depth-first search is not complete. | ▶ 00:46 |
Let's try to understand a little better how uniform cost search works. | ▶ 00:00 |
We start at a start state, | ▶ 00:05 |
and then we start expanding out from there looking at different paths, | ▶ 00:08 |
and what we end of doing is expanding in terms of contours like on a topological map, | ▶ 00:13 |
where first we span out to a certain distance, then to a farther distance, | ▶ 00:21 |
and then to a farther distance. | ▶ 00:28 |
Now at some point we meet up with a goal. Let's say the goal is here. | ▶ 00:31 |
Now we found a path from the start to the goal. | ▶ 00:35 |
But notice that the search really wasn't directed at any way towards the goal. | ▶ 00:42 |
It was expanding out everywhere in the space and depending on where the goal is, | ▶ 00:46 |
we should expect to have to explore half the space, on average, before we find the goal. | ▶ 00:52 |
If the space is small, that can be fine, | ▶ 00:57 |
but when spaces are large, that won't get us to the goal fast enough. | ▶ 01:00 |
Unfortunately, there is really nothing we can do, with what we know, to do better than that, | ▶ 01:05 |
and so if we want to improve, if we want to be able to find the goal faster, | ▶ 01:10 |
we're going to have to add more knowledge. | ▶ 01:15 |
The type of knowledge that is proven most useful in search is an estimate of the distance | ▶ 01:21 |
from the start state to the goal. | ▶ 01:27 |
So let's say we're dealing with a route-finding problem, | ▶ 01:32 |
and we can move in any direction--up or down, right or left-- | ▶ 01:36 |
and we'll take as our estimate, the straight line distance between a state and a goal, | ▶ 01:43 |
and we'll try to use that estimate to find our way to the goal fastest. | ▶ 01:50 |
Now an algorithm called greedy best-first search does exactly that. | ▶ 01:55 |
It expands first the path that's closest to the goal according to the estimate. | ▶ 02:04 |
So what do the contours look like in this approach? | ▶ 02:09 |
Well, we start here, and then we look at all the neighboring states, | ▶ 02:13 |
and the ones that appear to be closest to the goal we would expand first. | ▶ 02:17 |
So we'd start expanding like this and like this and like this and like this | ▶ 02:21 |
and that would lead us directly to the goal. | ▶ 02:30 |
So now instead of exploring whole circles that go out everywhere with a certain space, | ▶ 02:33 |
our search is directed towards the goal. | ▶ 02:38 |
In this case it gets us immediately towards the goal, but that won't always be the case | ▶ 02:41 |
if there are obstacles along the way. | ▶ 02:46 |
Consider this search space. We have a start state and a goal, | ▶ 02:50 |
and there's an impassable barrier. | ▶ 02:54 |
Now greedy best-first search will start expanding out as before, | ▶ 02:57 |
trying to get towards the goal, | ▶ 03:02 |
and when it reaches the barrier, what will it do next? | ▶ 03:08 |
Well, it will try to increase along a path that's getting closer and closer to the goal. | ▶ 03:11 |
So it won't consider going back this way which is farther from the goal. | ▶ 03:15 |
Rather it will continue expanding out along these lines | ▶ 03:20 |
which always get closer and closer to the goal, | ▶ 03:24 |
and eventually it will find its way towards the goal. | ▶ 03:28 |
So it does find a path, and it does it by expanding a small number of nodes, | ▶ 03:31 |
but it's willing to accept a path which is longer than other paths. | ▶ 03:36 |
Now if we explored in the other direction, we could have found a much simpler path, | ▶ 03:42 |
a much shorter path, by just popping over the barrier, and then going directly to the goal. | ▶ 03:47 |
but greedy best-first search wouldn't have done that because | ▶ 03:54 |
that would have involved getting to this point, which is this distance to the goal, | ▶ 03:56 |
and then considering states which were farther from the goal. | ▶ 04:01 |
What we would really like is an algorithm that combines the best parts | ▶ 04:08 |
of greedy search which explores a small number of nodes in many cases | ▶ 04:11 |
and uniform cost search which is guaranteed to find a shortest path. | ▶ 04:17 |
We'll show how to do that next using an algorithm called the A-star algorithm. | ▶ 04:22 |
[Male narrator] A* Search works by always expanding the path | ▶ 00:00 |
that has a minimum value of the function f | ▶ 00:03 |
which is defined as a sum of the g + h components. | ▶ 00:07 |
Now, the function g of a path | ▶ 00:12 |
is just the path cost, | ▶ 00:16 |
and the function h of a path | ▶ 00:19 |
is equal to the h value of the state, | ▶ 00:23 |
which is the final state of the path, | ▶ 00:27 |
which is equal to the estimated distance to the goal. | ▶ 00:30 |
Here's an example of how A* works. | ▶ 00:36 |
Suppose we found this path through the state's base to a state x | ▶ 00:39 |
and we're trying to give a measure to the value of this path. | ▶ 00:44 |
The measure f is a sum of g, the path cost so far, | ▶ 00:48 |
and h, which is the estimated distance that the path will take | ▶ 00:55 |
to complete its path to the goal. | ▶ 01:02 |
Now, minimizing g helps us keep the path short | ▶ 01:04 |
and minimizing h helps us keep focused on finding the goal | ▶ 01:08 |
and the result is a search strategy that is the best possible | ▶ 01:13 |
in the sense that it finds the shortest length path | ▶ 01:17 |
while expanding the minimum number of paths possible. | ▶ 01:20 |
It could be called "best estimated total path cost first," | ▶ 01:24 |
but the name A* is traditional. | ▶ 01:28 |
Now let's go back to Romania and apply the A* algorithm | ▶ 01:32 |
and we're going to use a heuristic, which is a straight line distance | ▶ 01:36 |
between a state and the goal. | ▶ 01:40 |
The goal, again, is Bucharest, | ▶ 01:42 |
and so the distance from Bucharest to Bucharest is, of course, 0. | ▶ 01:44 |
And for all the other states, I've written in red | ▶ 01:47 |
the straight line distance. | ▶ 01:51 |
For example, straight across like that. | ▶ 01:53 |
Now, I should say that all the roads here I've drawn as straight lines, | ▶ 01:55 |
but actually, roads are going to be curved to some degree, | ▶ 01:59 |
so the actual distance along the roads is going to be longer | ▶ 02:03 |
than the straight line distance. | ▶ 02:06 |
Now, we start out as usual--we'll start in Arad as a start state-- | ▶ 02:09 |
and we'll expand out Arad and so we'll add 3 paths | ▶ 02:13 |
and the evaluation function, f, will be the sum of the path length, | ▶ 02:21 |
which is given in black, and the estimated distance, | ▶ 02:26 |
which is given in red. | ▶ 02:29 |
And so the path length from this path | ▶ 02:32 |
will be 140+253 or 393; | ▶ 02:37 |
for this path, 75+374, or 449; | ▶ 02:45 |
and for this path, 118+329, or 447. | ▶ 02:55 |
And now, the question is out of all the paths that are on the frontier, | ▶ 03:05 |
which path would we expand next under the A* algorithm? | ▶ 03:09 |
The answer is that we select this path first--the one from Arad to Sibiu-- | ▶ 00:00 |
because it has the smallest value--393--of the sum f=g+h. | ▶ 00:05 |
Let's go ahead and expand this node now. | ▶ 00:00 |
So we're going to add 3 paths. | ▶ 00:03 |
This one has a path cost of 291 | ▶ 00:06 |
and an estimated distance to the goal of 380, | ▶ 00:10 |
for a total of 671. | ▶ 00:14 |
This one has a path cost of 239 | ▶ 00:18 |
and an estimated distance of 176, for a total of 415. | ▶ 00:21 |
And the final one is 220+193=413. | ▶ 00:27 |
And now the question is which state to we expand next? | ▶ 00:33 |
The answer is we expand this path next | ▶ 00:00 |
because its total, 413, | ▶ 00:03 |
is less than all the other ones on the front tier-- | ▶ 00:06 |
although only slightly less than the 415 for this path. | ▶ 00:09 |
So we expand this node, | ▶ 00:00 |
giving us 2 more paths-- | ▶ 00:03 |
this one with an f-value of 417, | ▶ 00:06 |
and this one with an f-value of 526. | ▶ 00:10 |
The question again--which path are we going to expand next? | ▶ 00:16 |
And the answer is that we expand this path, Fagaras, next, | ▶ 00:00 |
because its f-total, 415, | ▶ 00:05 |
is less than all the other paths in the front tier. | ▶ 00:08 |
Now we expand Fagaras | ▶ 00:01 |
and we get a path that reaches the goal | ▶ 00:04 |
and it has a path length of 450 and an estimated distance of 0 | ▶ 00:07 |
for a total f value of 450, | ▶ 00:11 |
and now the question is: What do we do next? | ▶ 00:14 |
Click here if you think we're at the end of the algorithm | ▶ 00:17 |
and we don't need to expand next | ▶ 00:22 |
or click on the node that you think we will expand next. | ▶ 00:24 |
The answer is that we're not done yet, | ▶ 00:00 |
because the algorithm works by doing the goal test, | ▶ 00:03 |
when we take a path off the front tier, | ▶ 00:06 |
not when we put a path on the front tier. | ▶ 00:08 |
Instead, we just continue in the normal way and choose the node | ▶ 00:11 |
on the front tier which has the lowest value. | ▶ 00:15 |
That would be this one--the path through Pitesti, with a total of 417. | ▶ 00:18 |
So let's expand the node at Pitesti. | ▶ 00:01 |
We have to go down this direction, up, | ▶ 00:04 |
then we reach a path we've seen before, | ▶ 00:08 |
and we go in this direction. | ▶ 00:11 |
Now we reach Bucharest, which is the goal, | ▶ 00:13 |
and the h value is going to be 0 | ▶ 00:16 |
because we're at the goal, and the g value works out to 418. | ▶ 00:19 |
Again, we don't stop here just because we put a path onto the front tier, | ▶ 00:24 |
we put it there, we don't apply the goal test next, | ▶ 00:31 |
but, now we go back to the front tier, | ▶ 00:35 |
and it turns out that this 418 is the lowest-cost path on the front tier. | ▶ 00:38 |
So now we pull it off, do the goal test, | ▶ 00:43 |
and now we found our path to the goal, | ▶ 00:45 |
and it is, in fact, the shortest possible path. | ▶ 00:49 |
In this case, A-star was able to find the lowest-cost path. | ▶ 00:55 |
Now the question that you'll have to think about, | ▶ 00:59 |
because we haven't explained it yet, | ▶ 01:02 |
is whether A-star will always do this. | ▶ 01:04 |
Answer yes if you think A-star will always find the shortest cost path, | ▶ 01:06 |
or answer no if you think it depends on the particular problem given, | ▶ 01:12 |
or answer no if you think it depends on the particular heuristic estimate function, h. | ▶ 01:17 |
The answer is that it depends on the h function. | ▶ 00:02 |
A-star will find the lowest-cost path | ▶ 00:06 |
if the h function for a state is less than the true cost | ▶ 00:09 |
of the path to the goal through that state. | ▶ 00:16 |
In other words, we want the h to never overestimate the distance to the goal. | ▶ 00:20 |
We also say that h is optimistic. | ▶ 00:26 |
Another way of stating that | ▶ 00:31 |
is that h is admissible, | ▶ 00:34 |
meaning is it admissible to use it to find the lowest-cost path. | ▶ 00:37 |
Think of all of these of being the same way | ▶ 00:41 |
of stating the conditions under which A-star finds the lowest-cost path. | ▶ 00:45 |
Here we give you an intuition as to why | ▶ 00:01 |
an optimistic heuristic function, h, finds the lowest-cost path. | ▶ 00:03 |
When A-star ends, it returns a path, p, with estimated cost, c. | ▶ 00:08 |
It turns out that c is also the actual cost, | ▶ 00:15 |
because at the goal the h component is 0, | ▶ 00:20 |
and so the path cost is the total cost as estimated by the function. | ▶ 00:23 |
Now, all the paths on the front tier | ▶ 00:28 |
have an estimated cost that's greater than c, | ▶ 00:31 |
and we know that because the front tier is explored in cheapest-first order. | ▶ 00:35 |
If h is optimistic, then the estimated cost | ▶ 00:40 |
is less than the true cost, | ▶ 00:44 |
so the path p must have a cost that's less than the true cost | ▶ 00:47 |
of any of the paths on the front tier. | ▶ 00:51 |
Any paths that go beyond the front tier | ▶ 00:54 |
must have a cost that's greater than that | ▶ 00:57 |
because we agree that the step cost is always 0 or more. | ▶ 00:59 |
So that means that this path, p, must be the minimal cost path. | ▶ 01:04 |
Now, this argument, I should say, only goes through | ▶ 01:09 |
as is for tree search. | ▶ 01:13 |
For graph search the argument is slightly more complicated, | ▶ 01:16 |
but the general intuitions hold the same. | ▶ 01:19 |
So far we've looked at the state space of cities in Romania-- | ▶ 00:01 |
a 2-dimensional, physical space. | ▶ 00:05 |
But the technology for problem solving through search | ▶ 00:07 |
can deal with many types of state spaces, | ▶ 00:10 |
dealing with abstract properties, not just x-y position in a plane. | ▶ 00:12 |
Here I introduce another state space--the vacuum world. | ▶ 00:17 |
It's a very simple world in which there are only 2 positions | ▶ 00:21 |
as opposed to the many positions in the Romania state space. | ▶ 00:25 |
But there are additional properties to deal with as well. | ▶ 00:30 |
The robot vacuum cleaner can be in either of the 2 conditions, | ▶ 00:33 |
but as well as that each of the positions | ▶ 00:36 |
can either have dirt in it or not have dirt in it. | ▶ 00:40 |
Now the question is to represent this as a state space | ▶ 00:43 |
how many states do we need? | ▶ 00:47 |
The number of states can fill in this box here. | ▶ 00:51 |
And the answer is there are 8 states. | ▶ 00:01 |
There are 2 physical states that the robot vacuum cleaner can be in-- | ▶ 00:04 |
either in state A or in state B. | ▶ 00:10 |
But in addition to that, there are states about how the world is | ▶ 00:12 |
as well as where the robot is in the world. | ▶ 00:17 |
So state A can be dirty or not. | ▶ 00:19 |
That's 2 possibilities. | ▶ 00:24 |
And B can be dirty or not. | ▶ 00:26 |
That's 2 more possibilities. | ▶ 00:28 |
We multiply those together. We get 8 possible states. | ▶ 00:31 |
Here is a diagram of the state space for the vacuum world. | ▶ 00:01 |
Note that there are 8 states, and we have the actions connecting the states | ▶ 00:05 |
just as we did in the Romania problem. | ▶ 00:09 |
Now let's look at a path through this state. | ▶ 00:12 |
Let's say we start out in this position, | ▶ 00:15 |
and then we apply the action of moving right. | ▶ 00:19 |
Then we end up in a position where the state of the world looks the same, | ▶ 00:23 |
except the robot has moved from position 'A' to position 'B'. | ▶ 00:27 |
Now if we turn on the sucking action, | ▶ 00:32 |
then we end up in a state where the robot is in the same position | ▶ 00:37 |
but that position is no longer dirty. | ▶ 00:42 |
Let's take this very simple vacuum world | ▶ 00:47 |
and make a slightly more complicated one. | ▶ 00:50 |
First, we'll say that the robot has a power switch, | ▶ 00:53 |
which can be in one of three conditions: on, off, or sleep. | ▶ 00:56 |
Next, we'll say that the robot has a dirt-sensing camera, | ▶ 01:04 |
and that camera can either be on or off. | ▶ 01:09 |
Third, this is the deluxe model of robot | ▶ 01:13 |
in which the brushes that clean up the dust | ▶ 01:16 |
can be set at 1 of 5 different heights | ▶ 01:19 |
to be appropriate for whatever level of carpeting you have. | ▶ 01:22 |
Finally, rather that just having the 2 positions, | ▶ 01:27 |
we'll extend that out and have 10 positions. | ▶ 01:30 |
Now the question is how many states are in this state space? | ▶ 01:37 |
The answer is that the number of states is the cross product | ▶ 00:01 |
of the numbers of all the variables, since they're each independent, | ▶ 00:05 |
and any combination can occur. | ▶ 00:08 |
For the power we have 3 possible positions. | ▶ 00:10 |
The camera has 2. | ▶ 00:14 |
The brush height has 5. | ▶ 00:18 |
The dirt has 2 for each of the 10 positions. | ▶ 00:23 |
That's 2^10 or 1024. | ▶ 00:28 |
Then the robot's position can be any of those 10 positions as well. | ▶ 00:33 |
That works out to 307,200 states in the state space. | ▶ 00:39 |
Notice how a fairly trivial problem-- | ▶ 00:44 |
we're only modeling a few variables and only 10 positions-- | ▶ 00:46 |
works out to a large number of state spaces. | ▶ 00:50 |
That's why we need efficient algorithms for searching through states spaces. | ▶ 00:52 |
I want to introduce one more problem that can be solved with search techniques. | ▶ 00:01 |
This is a sliding blocks puzzle, called a 15 puzzle. | ▶ 00:05 |
You may have seen something like this. | ▶ 00:08 |
So there are a bunch of little squares or blocks or tiles | ▶ 00:10 |
and you can slide them around. | ▶ 00:14 |
and the goal is to get into a certain configuration. | ▶ 00:19 |
So we'll say that this is the goal state, where the numbers 1-15 are in order | ▶ 00:21 |
left to right, top to bottom. | ▶ 00:27 |
The starting state would be some state where all the positions are messed up. | ▶ 00:29 |
Now the question is: Can we come up with a good heuristic for this? | ▶ 00:34 |
Let's examine that as a way of thinking about where heuristics come from. | ▶ 00:38 |
The first heuristic we're going to consider | ▶ 00:42 |
we'll call h1, and that is equal to the number of misplaced blocks. | ▶ 00:46 |
So here 10 and 11 are misplaced because they should be there and there, respectively, | ▶ 00:54 |
12 is in the right place, 13 is in the right place, | ▶ 00:59 |
and 14 and 15 are misplaced. | ▶ 01:02 |
That's a total of 4 misplaced blocks. | ▶ 01:04 |
The 2nd heuristic, h2, is equal to | ▶ 01:07 |
the sum of the distances that each block would have to move to get to the right position. | ▶ 01:13 |
For this position, 10 would have to move 1 space to get to the right position, | ▶ 01:19 |
11 would have to move 1, so that's a total of 2 so far, | ▶ 01:26 |
13 is in the right place, | ▶ 01:30 |
14 is 1 displaced, | ▶ 01:31 |
and 15 is 1 displaced, | ▶ 01:33 |
so that would also be a total of 4. | ▶ 01:35 |
Now, the question is: Which, if any, of these heuristics are admissible? | ▶ 01:38 |
Check the boxes next to the heuristics that you think | ▶ 01:44 |
are admissible. | ▶ 01:47 |
H1 is admissible, because every tile that's in the wrong position | ▶ 00:02 |
must be moved at least once to get into the right position. | ▶ 00:07 |
So h1 never overestimates. | ▶ 00:10 |
How about h2? | ▶ 00:13 |
H2 is also admissible, because every tile in the wrong position | ▶ 00:15 |
can be moved closer to the correct position no faster than 1 space per move. | ▶ 00:20 |
Therefore, both are admissible. | ▶ 00:26 |
But notice that h2 is always greater than or equal to h1. | ▶ 00:28 |
That means that, with the exception of breaking ties, | ▶ 00:33 |
an A* search using h2 will always expand | ▶ 00:35 |
fewer paths than one using h1 | ▶ 00:39 |
Now, we're trying to build an artificial intelligence | ▶ 00:01 |
that can solve problems like this all on its own. | ▶ 00:04 |
You can see that the search algorithms do a great job | ▶ 00:08 |
of finding solutions to problems like this. | ▶ 00:12 |
But, you might complain that in order for the search algorithms to work, | ▶ 00:15 |
we had to provide it with a heurstic function. | ▶ 00:19 |
A heurstic function came from the outside. | ▶ 00:22 |
You might think that coming up with a good heurstic function is really where all the intelligence is. | ▶ 00:25 |
So, a problem solver that uses an heurstic function given to it | ▶ 00:30 |
really isn't intelligent at all. | ▶ 00:34 |
So let's think about where the intelligence could come from | ▶ 00:36 |
and can we automatically come up with good heurstic functions. | ▶ 00:39 |
I'm going to sketch a description of | ▶ 00:45 |
a program that can automatically come up with good heurstics | ▶ 00:47 |
given a description of a problem. | ▶ 00:50 |
Suppose this program is given a description of the sliding blocks puzzle | ▶ 00:52 |
where we say that a block can move from square A to square B | ▶ 00:57 |
if A is adjacent to B and B is blank. | ▶ 01:02 |
Now, imagine that we try to loosen this restriction. | ▶ 01:06 |
We cross out "B is blank," | ▶ 01:10 |
and then we get the rule | ▶ 01:14 |
"a block can move from A to B if A is adjacent to B," | ▶ 01:16 |
and that's equal to our heurstic h2 | ▶ 01:20 |
because a block can move anywhere to an adjacent state. | ▶ 01:23 |
Now, we could also cross out the other part of the rule, | ▶ 01:27 |
and we now get "a block can move from any square A | ▶ 01:31 |
to any square B regardless of any condition. | ▶ 01:36 |
That gives us heurstic h1. | ▶ 01:40 |
So we see that both of our heurstics can be derived | ▶ 01:43 |
from a simple mechanical manipulation | ▶ 01:48 |
of the formal description of the problem. | ▶ 01:50 |
Once we've generated automatically these candidate heuristics, | ▶ 01:53 |
another way to come up with a good heurstic is to say | ▶ 01:58 |
that a new heurstic, h, | ▶ 02:02 |
is equal to the maximum of h1 and h2, | ▶ 02:04 |
and that's guaranteed to be admissible as long as | ▶ 02:10 |
h1 and h2 are admissible | ▶ 02:13 |
because it still never overestimates, | ▶ 02:16 |
and it's guaranteed to be better because its getting closer to the true value. | ▶ 02:18 |
The only problem with combining multiple heuristics like this | ▶ 02:22 |
is that there is some cause to compute the heuristic | ▶ 02:27 |
and it could take longer to compute | ▶ 02:29 |
even if we end up expanding pure paths. | ▶ 02:31 |
Crossing out parts of the rules like this | ▶ 02:35 |
is called "generating a relaxed problem." | ▶ 02:38 |
What we've done is we've taken the original problem, | ▶ 02:41 |
where it's hard to move squares around, | ▶ 02:44 |
and made it easier by relaxing one of the constraints. | ▶ 02:46 |
You can see that as adding new links in the state space, | ▶ 02:49 |
so if we have a state space in which there are only particular links, | ▶ 02:54 |
by relaxing the problem it's as if we are adding new operators | ▶ 02:59 |
that traverse the state in new ways. | ▶ 03:05 |
So adding new operators only makes the problem easier, | ▶ 03:07 |
and thus never overestimates, and thus is admissible. | ▶ 03:11 |
We've seen what search can do for problem solving. | ▶ 00:00 |
It can find the lowest-cost path to a goal, | ▶ 00:03 |
and it can do that in a way in which we never generate more paths than we have to. | ▶ 00:06 |
We can find the optimal number of paths to generate, | ▶ 00:12 |
and we can do that with a heuristic function that we generate on our own | ▶ 00:15 |
by relaxing the existing problem definition. | ▶ 00:19 |
But let's be clear on what search can't do. | ▶ 00:22 |
All the solutions that we have found consist of a fixed sequence of actions. | ▶ 00:25 |
In other words, the agent Hirin Arad, thinks, comes up with a plan that it wants to execute | ▶ 00:31 |
and then essentially closes his eyes and starts driving, | ▶ 00:38 |
never considering along the way if something has gone wrong. | ▶ 00:42 |
That works fine for this type of problem, | ▶ 00:46 |
but it only works when we satisfy the following conditions. | ▶ 00:49 |
[Problem solving works when:] | ▶ 00:53 |
Problem-solving technology works when the following set of conditions is true: | ▶ 00:55 |
First, the domain must be fully observable. | ▶ 00:59 |
In other words, we must be able to see what initial state we start out with. | ▶ 01:03 |
Second, the domain must be known. | ▶ 01:08 |
That is, we have to know the set of available actions to us. | ▶ 01:12 |
Third, the domain must be discrete. | ▶ 01:16 |
There must be a finite number of actions to chose from. | ▶ 01:20 |
Fourth, the domain must be deterministic. | ▶ 01:24 |
We have to know the result of taking an action. | ▶ 01:28 |
Finally, the domain must be static. | ▶ 01:32 |
There must be nothing else in the world that can change the world except our own actions. | ▶ 01:36 |
If all these conditions are true, then we can search for a plan | ▶ 01:41 |
which solves the problem and is guaranteed to work. | ▶ 01:44 |
In later units, we will see what to do if any of these conditions fail to hold. | ▶ 01:47 |
Our description of the algorithm has talked about paths in the state space. | ▶ 00:01 |
I want to say a little bit now about how to implement that in terms of a computer algorithm. | ▶ 00:08 |
We talk about paths, but we want to implement that in some ways. | ▶ 00:15 |
In the implementation we talk about nodes. | ▶ 00:19 |
A node is a data structure, and it has four fields. | ▶ 00:22 |
The state field indicates the state at the end of the path. | ▶ 00:27 |
The action was the action it took to get there. | ▶ 00:35 |
The cost is the total cost, | ▶ 00:40 |
and the parent is a pointer to another node. | ▶ 00:45 |
In this case, the node that has state "S", | ▶ 00:50 |
and it will have a parent which points to the node that has state "A", | ▶ 00:56 |
and that will have a parent pointer that's null. | ▶ 01:06 |
So we have a linked list of nodes representing the path. | ▶ 01:10 |
We'll use the word "path" for the abstract idea, | ▶ 01:15 |
and the word "node" for the representation in the computer memory. | ▶ 01:18 |
But otherwise, you can think of those two terms as being synonyms, | ▶ 01:22 |
because they're in a one-to-one correspondence. | ▶ 01:26 |
Now there are two main data structures that deal with nodes. | ▶ 01:31 |
We have the "frontier" and we have the "explored" list. | ▶ 01:35 |
Let's talk about how to implement them. | ▶ 01:41 |
In the frontier the operations we have to deal with | ▶ 01:44 |
are removing the best item from the frontier and adding in new ones. | ▶ 01:48 |
And that suggests we should implement it as a priority queue, | ▶ 01:52 |
which knows how to keep track of the best items in proper order. | ▶ 01:55 |
But we also need to have an additional operation | ▶ 01:59 |
of a membership test as a new item in the frontier. | ▶ 02:03 |
And that suggests representing it as a set, | ▶ 02:07 |
which can be built from a hash table or a tree. | ▶ 02:10 |
So the most efficient implementations of search actually have both representations. | ▶ 02:14 |
The explored set, on the other hand, is easier. | ▶ 02:20 |
All we have to do there is be able to add new members and check for membership. | ▶ 02:23 |
So we represent that as a single set, | ▶ 02:28 |
which again can be done with either a hash table or tree. | ▶ 02:31 |
Congratulations. | ▶ 00:00 |
You just made assignment 1. | ▶ 00:02 |
This is homework assignment #1. | ▶ 00:00 |
This is a question about peg solitaire. | ▶ 00:01 |
In peg solitaire, a single player faces | ▶ 00:04 |
the following kind of board. | ▶ 00:08 |
Initially, all pieces are occupied except for the center piece. | ▶ 00:13 |
You can find more information on peg solitare at the following URL. | ▶ 00:22 |
[http://en.wikipedia.org/wiki/peg_solitaire] | ▶ 00:26 |
I wish to know whether this game is partially observable, | ▶ 00:36 |
Please say yes or no. | ▶ 00:40 |
I wish to know whether it is stochastic. | ▶ 00:43 |
Please say yes if it is and no if it's deterministic. | ▶ 00:46 |
Let me know if it's continuous, yes or no, | ▶ 00:50 |
and let me know if it's adversarial, yes or no. | ▶ 00:55 |
>>Peg Solitaire is not partially observable because you can see the board at all times. | ▶ 00:00 |
It is not stochastic because you just make all the moves, | ▶ 00:06 |
and they have very different mystic effects. | ▶ 00:09 |
It is not continuous. It's just finding many choices of actions | ▶ 00:11 |
and finding many board positions, so therefore, it is not continuous. | ▶ 00:15 |
and it's not adversarial because there is no adversaries--just you playing. | ▶ 00:18 |
I am going to ask you about the problem to learn about a loaded coin. | ▶ 00:01 |
A loaded coin is a coin, | ▶ 00:05 |
that if you flip it, | ▶ 00:07 |
might have a non 0.5 chance | ▶ 00:09 |
of coming up heads or tails. | ▶ 00:13 |
Fair coins always come up 50% heads or tails. | ▶ 00:16 |
Loaded coins might come up, for example, | ▶ 00:20 |
0.9 chance heads and 0.1 chance tails. | ▶ 00:23 |
Your task will be to understand, | ▶ 00:27 |
from coin flips, | ▶ 00:30 |
whether a coin is loaded, | ▶ 00:31 |
and if so, at what probability. | ▶ 00:33 |
I don't want you to solve the problem, | ▶ 00:35 |
but I want you to answer the following questions: | ▶ 00:37 |
Is it partially observable? | ▶ 00:40 |
Yes or no. | ▶ 00:42 |
Is it stochastic? | ▶ 00:44 |
Yes or no. | ▶ 00:46 |
Is it continuous? [Yes or no.] | ▶ 00:48 |
And finally, is it adversarial? | ▶ 00:51 |
Yes or no. | ▶ 00:53 |
[Thrun] So the loaded coin example is clearly partially observable, | ▶ 00:00 |
and the reason is it is actually used for the memory | ▶ 00:06 |
if you flip it more than 1 time so you can learn more about what the actual probability is. | ▶ 00:09 |
Therefore, looking at the most recent coin flip is insufficient to make your choice. | ▶ 00:14 |
It is stochastic because you flip a coin. | ▶ 00:20 |
It is not continuous because there's only 1 action--a flip--and 2 outcomes. | ▶ 00:25 |
And it isn't really adversarial because while you do your learning task | ▶ 00:31 |
no adversary interferes. | ▶ 00:36 |
Let's talk about the problem of finding a path through a maze. | ▶ 00:00 |
Let me draw you a maze. | ▶ 00:05 |
Suppose you wish to find the path from the start to your goal. | ▶ 00:10 |
I don't want to you to solve this problem. | ▶ 00:15 |
Rather I want you to tell me whether it's partially observable. | ▶ 00:19 |
Yes or no. | ▶ 00:23 |
It is stochastic? | ▶ 00:25 |
Yes or no. | ▶ 00:27 |
Is it continuous? | ▶ 00:29 |
Yes or no. | ▶ 00:31 |
[Thrun] The path through the maze is clearly not partially observable | ▶ 00:00 |
because you can see the maze entirely at all times. | ▶ 00:03 |
It is not stochastic. There is no randomness involved. | ▶ 00:06 |
It isn't really continuous. | ▶ 00:10 |
There's typically just finitely many choices--go left or right. | ▶ 00:12 |
And it isn't adversarial because there's no real adversary involved. | ▶ 00:15 |
This is a search question. | ▶ 00:00 |
Suppose we are given the following search tree. | ▶ 00:02 |
We are searching from the top, the start node, | ▶ 00:05 |
to the goal, which is over here. | ▶ 00:08 |
Assume we expand from left to right. | ▶ 00:12 |
Tell me how many nodes are expanded | ▶ 00:17 |
if we expand from left to right, | ▶ 00:20 |
counting the start node and the goal node in your answer. | ▶ 00:23 |
And give me the same answer for Depth First Search. | ▶ 00:27 |
Now, let's assume you're going to search from right to left. | ▶ 00:32 |
How many nodes would we now expand in Breadth First Search, | ▶ 00:35 |
and how many do we expand in Depth First Search? | ▶ 00:39 |
[Thrun] Breadth first from left to right is 6-- | ▶ 00:00 |
1, 2, 3, 4, 5, 6. | ▶ 00:03 |
Depth first from left to right is 4--1, 2, 3, 4. | ▶ 00:07 |
Breadth first searched from right to left is 9-- | ▶ 00:15 |
1, 2, 3, 4, 5, 6, 7, 8, 9. | ▶ 00:19 |
And depth first from right to left is 9-- | ▶ 00:25 |
1, 2, 3, 4, 5, 6, 7, 8, 9. | ▶ 00:28 |
Another search problem-- | ▶ 00:00 |
Consider the following search tree, | ▶ 00:03 |
where this is the start node. | ▶ 00:08 |
Now, assume we search from left to right. | ▶ 00:12 |
I would like you to tell me the number of nodes expanded from Breadth-First Search | ▶ 00:15 |
and Depth-First Search. | ▶ 00:19 |
Please do count the start and the goal node, | ▶ 00:22 |
and please give me the same numbers for Right-to-Left Search, | ▶ 00:25 |
for Breadth-First, and Depth-First. | ▶ 00:28 |
[Thrun] The correct answer for breadth first left to right is 13-- | ▶ 00:00 |
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13. | ▶ 00:05 |
And for depth first it is 10-- | ▶ 00:13 |
1, 2, 3, 4, 5, 6, 7, 8, 9, and 10. | ▶ 00:17 |
For right to left search, the right answer for breadth first is 11-- | ▶ 00:28 |
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11. | ▶ 00:32 |
And for depth first the right answer is 7-- | ▶ 00:38 |
1, 2, 3, 4, 5, 6, 7. | ▶ 00:42 |
This is another search problem. | ▶ 00:00 |
Let's assume we have a search graph. | ▶ 00:04 |
It isn't quite a tree but looks like this. | ▶ 00:07 |
Obviously in the structure we can reach nodes through multiple paths. | ▶ 00:13 |
So let's assume that our search never expands the same node twice. | ▶ 00:18 |
Let's also assume this start node is on top. We search down. | ▶ 00:22 |
And this over here is our goal node. | ▶ 00:27 |
So left-to-right search, tell me how many nodes | ▶ 00:30 |
breadth first would expand--do count the start and goal node in the final answer. | ▶ 00:35 |
Give me the same result for a depth-first search. | ▶ 00:43 |
Again counting the start and the goal node in your answer. | ▶ 00:48 |
And again give me your answer for breadth-first | ▶ 00:51 |
and for depth-first in the right-to-left search paradigm. | ▶ 00:54 |
[Thrun] The right answer over here is 10 for breadth first from left to right-- | ▶ 00:00 |
1, 2, 3, 4, 5, 6, 7, 8, 9, 10. | ▶ 00:05 |
Depth first is 16, or all nodes-- | ▶ 00:11 |
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16. | ▶ 00:15 |
And notice how I never expanded a node twice. | ▶ 00:30 |
Correct answer for breadth first right to left is 7-- | ▶ 00:34 |
1, 2, 3, 4, 5, 6, 7. | ▶ 00:38 |
And the correct answer for depth first from right to left is 4--1, 2, 3, and 4. | ▶ 00:43 |
Let's talk about a star search. | ▶ 00:00 |
Let's assume we have the following grid. | ▶ 00:03 |
The start state is right here. | ▶ 00:08 |
And the goal state is right here. | ▶ 00:13 |
And just for convenience, I will give each here a little number. | ▶ 00:16 |
A. B. C. D. | ▶ 00:22 |
Let me draw a heuristic function. | ▶ 00:26 |
Please take a look for a moment | ▶ 00:30 |
and tell me whether this heuristic function is admissable. | ▶ 00:32 |
Check here if yes and here if no. | ▶ 00:38 |
Which one is the first node a star would expand? | ▶ 00:41 |
B1 or A2? | ▶ 00:46 |
What's the second node to expand? | ▶ 00:51 |
B1, C1, A2, A3, or B2? | ▶ 00:56 |
And finally, what is the third node to expand? | ▶ 01:06 |
D1, C2, B3, or A4? | ▶ 01:10 |
[Thrun] Clearly this is an admissable heuristic because the distance to the goal | ▶ 00:00 |
is strictly underestimated. | ▶ 00:05 |
From here it would take 1 step, | ▶ 00:07 |
from here it will take 1, 2 steps, so the answer is yes. | ▶ 00:09 |
Now, to understand A*, let me also draw the g function | ▶ 00:15 |
for development part of this table. | ▶ 00:22 |
Clearly g is 0 over here. | ▶ 00:24 |
To understand which node to expand, this one or this one, | ▶ 00:27 |
let's project the g function, which is 1, | ▶ 00:31 |
and we will see that 3 plus 1 is smaller than 4 plus 1; | ▶ 00:34 |
therefore, this is the second node to expand, which is b1. | ▶ 00:40 |
Now let me for the next step explain the g function from this guy here, 2 and 2. | ▶ 00:47 |
So 2 plus 2 is 4 versus 3 plus 2 is 5, so we expand this node next, which is c1. | ▶ 00:55 |
And finally, the g function from here would go 3 and 3. | ▶ 01:08 |
3 plus 1 is better than 3 plus 2, so we would expand d1 next. | ▶ 01:14 |
And notice how in the sum of g and h, | ▶ 01:24 |
this node over here, which has a total of 4, is better than any other node that is unexpanded. | ▶ 01:29 |
So in particular, 4 plus 1 is 5, and 3 plus 2 is 5 as well, | ▶ 01:35 |
and 2 plus 3 is 5 as well, so this is the next one to expand. | ▶ 01:40 |
So the next units will be concerned with probabilities | ▶ 00:00 |
and particularly with structured probabilities using Bayes networks. | ▶ 00:03 |
This is some of the most involved material in this class. | ▶ 00:08 |
And since this is a Stanford level class, | ▶ 00:12 |
you will find out that some of the quizzes are actually really hard. | ▶ 00:14 |
So as you go through the material, I hope the hardness of the quizzes won't discourage you; | ▶ 00:18 |
it'll really entice you to take a piece of paper and a pen and work them out. | ▶ 00:23 |
Let me give you a flavor of a Bayes network using an example. | ▶ 00:30 |
Suppose you find in the morning that your car won't start. | ▶ 00:35 |
Well, there's many causes why your car might not start. | ▶ 00:39 |
One is that your battery is flat. | ▶ 00:43 |
Even for a flat battery there is multiple causes. | ▶ 00:46 |
One, it's just plain dead, | ▶ 00:50 |
and one is that the battery is okay but it's not charging. | ▶ 00:52 |
The reason why a battery might not charge is that the alternator might be broken | ▶ 00:55 |
or the fan belt might be broken. | ▶ 01:01 |
If you look at this influence diagram, also called a Bayes network, | ▶ 01:03 |
you'll find there's many different ways to explain that the car won't start. | ▶ 01:07 |
And a natural question you might have is, "Can we diagnose the problem?" | ▶ 01:12 |
One diagnostic tool is a battery meter, | ▶ 01:17 |
which may increase or decrease your belief that the battery may cause your car failure. | ▶ 01:20 |
You might also know your battery age. | ▶ 01:26 |
Older batteries tend to go dead more often. | ▶ 01:29 |
And there's many other ways to look at reasons why the car might not start. | ▶ 01:31 |
You might inspect the lights, the oil light, the gas gauge. | ▶ 01:37 |
You might even dip into the engine to see what the oil level is with a dipstick. | ▶ 01:43 |
All of those relate to alternative reasons why the car might not be starting, | ▶ 01:48 |
like no oil, no gas, the fuel line might be blocked, or the starter may be broken. | ▶ 01:52 |
And all of these can influence your measurements, | ▶ 01:59 |
like the oil light or the gas gauge, in different ways. | ▶ 02:04 |
For example, the battery flat would have an effect on the lights. | ▶ 02:07 |
It might have an effect on the oil light and on the gas gauge, | ▶ 02:12 |
but it won't really affect the oil you measure with the dipstick. | ▶ 02:16 |
That is affected by the actual oil level, which also affects the oil light. | ▶ 02:20 |
Gas will affect the gas gauge, and of course without gas the car doesn't start. | ▶ 02:26 |
So this is a complicated structure that really describes one way to understand | ▶ 02:32 |
how a car doesn't start. | ▶ 02:39 |
A car is a complex system. | ▶ 02:41 |
It has lots of variables you can't really measure immediately, | ▶ 02:43 |
and it has sensors which allow you to understand a little bit about the state of the car. | ▶ 02:46 |
What the Bayes network does, | ▶ 02:52 |
it really assists you in reasoning from observable variables, like the car won't start | ▶ 02:54 |
and the value of the dipstick, to hidden causes, like is the fan belt broken | ▶ 03:01 |
or is the battery dead. | ▶ 03:06 |
What you have here is a Bayes network. | ▶ 03:09 |
A Bayes network is composed of nodes. | ▶ 03:13 |
These nodes correspond to events that you might or might not know | ▶ 03:15 |
that are typically called random variables. | ▶ 03:21 |
These nodes are linked by arcs, and the arcs suggest that a child of an arc | ▶ 03:24 |
is influenced by its parent but not in a deterministic way. | ▶ 03:31 |
It might be influenced in a probabilistic way, which means an older battery, for example, | ▶ 03:35 |
has a higher chance of causing the battery to be dead, | ▶ 03:41 |
but it's not clear that every old battery is dead. | ▶ 03:45 |
There is a total of 16 variables in this Bayes network. | ▶ 03:48 |
What the graph structure and associated probabilities specify | ▶ 03:53 |
is a huge probability distribution in the space of all of these 16 variables. | ▶ 03:59 |
If they are all binary, which we'll assume throughout this unit, | ▶ 04:06 |
they can take 2 to the 16th different values, which is a lot. | ▶ 04:10 |
The Bayes network, as we find out, is a complex representation | ▶ 04:15 |
of a distribution over this very, very large joint probability distribution of all of these variables. | ▶ 04:18 |
Further, once we specify the Bayes network, | ▶ 04:26 |
we can observe, for example, the car won't start. | ▶ 04:29 |
We can observe things like the oil light and the lights and the battery meter | ▶ 04:33 |
and then compute probabilities of the hypothesis, like the alternator is broken | ▶ 04:37 |
or the fan belt is broken or the battery is dead. | ▶ 04:41 |
So in this class we're going to talk about how to construct this Bayes network, | ▶ 04:45 |
what the semantics are, and how to reason in this Bayes network | ▶ 04:50 |
to find out about variables we can't observe, like whether the fan belt is broken or not. | ▶ 04:56 |
That's an overview. | ▶ 05:02 |
Throughout this unit I am going to assume that every event is discrete-- | ▶ 05:04 |
in fact, it's binary. | ▶ 05:08 |
We'll start with some consideration of basic probability, | ▶ 05:10 |
we'll work our way into some simple Bayes networks, | ▶ 05:14 |
we'll talk about concepts like conditional independence | ▶ 05:19 |
and then define Bayes networks more generally, | ▶ 05:23 |
move into concepts like D-separation and start doing parameter counts. | ▶ 05:26 |
Later on, Peter will tell you about inference in Bayes networks. | ▶ 05:32 |
So we won't do this in this class. | ▶ 05:36 |
I can't overemphasize how important this class is. | ▶ 05:38 |
Bayes networks are used extensively in almost all fields of smart computer system, | ▶ 05:43 |
in diagnostics, for prediction, for machine learning, and fields like finance, | ▶ 05:49 |
inside Google, in robotics. | ▶ 05:57 |
Bayes networks are also the building blocks of more advanced AI techniques | ▶ 06:00 |
such as particle filters, hidden Markov models, MDPs and POMDPs, | ▶ 06:05 |
Kalman filters, and many others. | ▶ 06:12 |
These are words that don't sound familiar quite yet, | ▶ 06:14 |
but as you go through the class, I can promise you you will get to know what they mean. | ▶ 06:18 |
So let's start now at the very, very basics. | ▶ 06:22 |
[Thrun] So let's talk about probabilities. | ▶ 00:00 |
Probabilities are the cornerstone of artificial intelligence. | ▶ 00:02 |
They are used to express uncertainty, | ▶ 00:05 |
and the management of uncertainty is really key to many, many things in AI | ▶ 00:08 |
such as machine learning and Bayes network inference | ▶ 00:12 |
and filtering and robotics and computer vision and so on. | ▶ 00:16 |
So I'm going to start with some very basic questions, | ▶ 00:21 |
and we're going to work our way up from there. | ▶ 00:24 |
Here is a coin. | ▶ 00:26 |
The coin can come up heads or tails, and my question is the following: | ▶ 00:28 |
Suppose the probability for heads is 0.5. | ▶ 00:32 |
What's the probability for it coming up tails? | ▶ 00:38 |
[Thrun] So the right answer is a half, or 0.5, | ▶ 00:00 |
and the reason is the coin can only come up heads or tails. | ▶ 00:03 |
We know that it has to be either one. | ▶ 00:07 |
Therefore, the total probability of both coming up is 1. | ▶ 00:10 |
So if half of the probability is assigned to heads, then the other half is assigned to tail. | ▶ 00:14 |
[Thrun] Let me ask my next quiz. | ▶ 00:00 |
Suppose the probability of heads is a quarter, 0.25. | ▶ 00:02 |
What's the probability of tail? | ▶ 00:06 |
[Thrun] And the answer is 3/4. | ▶ 00:00 |
It's a loaded coin, and the reason is, well, | ▶ 00:02 |
each of them come up with a certain probability. | ▶ 00:05 |
The total of those is 1. The quarter is claimed by heads. | ▶ 00:08 |
Therefore, 3/4 remain for tail, which is the answer over here. | ▶ 00:12 |
[Thrun] Here's another quiz. | ▶ 00:00 |
What's the probability that the coin comes up heads, heads, heads, three times in a row, | ▶ 00:02 |
assuming that each one of those has a probability of a half | ▶ 00:08 |
and that these coin flips are independent? | ▶ 00:12 |
[Thrun] And the answer is 0.125. | ▶ 00:00 |
Each head has a probability of a half. | ▶ 00:04 |
We can multiply those probabilities because they are independent events, | ▶ 00:06 |
and that gives us 1 over 8 or 0.125. | ▶ 00:10 |
[Thrun] Now let's flip the coin 4 times, and let's call Xi the result of the i-th coin flip. | ▶ 00:00 |
So each Xi is going to be drawn from heads or tail. | ▶ 00:11 |
What's the probability that all 4 of those flips give us the same result, | ▶ 00:16 |
no matter what it is, assuming that each one of those has identically | ▶ 00:22 |
an equally distributed probability of coming up heads of the half? | ▶ 00:26 |
[Thrun] And the answer is, well, there's 2 ways that we can achieve this. | ▶ 00:00 |
One is the all heads and one is all tails. | ▶ 00:04 |
You already know that 4 times heads is 1/16, | ▶ 00:06 |
and we know that 4 times tail is also 1/16. | ▶ 00:10 |
These are completely independent events. | ▶ 00:13 |
The probability of either one occurring is 1/16 plus 1/16, which is 1/8, which is 0.125. | ▶ 00:15 |
[Thrun] So here's another one. | ▶ 00:00 |
What's the probability that within the set of X1, X2, X3, and X4 | ▶ 00:02 |
there are at least three heads? | ▶ 00:07 |
[Thrun] And the solution is let's look at different sequences | ▶ 00:00 |
in which head occurs at least 3 times. | ▶ 00:03 |
It could be head, head, head, head, in which it comes 4 times. | ▶ 00:06 |
It could be head, head, head, tail and so on, all the way to tail, head, head, head. | ▶ 00:10 |
There's 1, 2, 3, 4, 5 of those outcomes. | ▶ 00:16 |
Each of them has a 16th for probability, so it's 5 times a 16th, which is 0.3125. | ▶ 00:19 |
[Thrun] So we just learned a number of things. | ▶ 00:00 |
One is about complementary probability. | ▶ 00:02 |
If an event has a certain probability, p, | ▶ 00:05 |
the complementary event has the probability 1-p. | ▶ 00:08 |
We also learned about independence. | ▶ 00:13 |
If 2 random variables, X and Y, are independent, | ▶ 00:15 |
which you're going to write like this, | ▶ 00:19 |
that means the probability of the joint that any 2 variables can assume | ▶ 00:21 |
is the product of the marginals. | ▶ 00:26 |
So rather than asking the question, "What is the probability | ▶ 00:30 |
"for any combination that these 2 coins or maybe 5 coins could have taken?" | ▶ 00:34 |
we can now look at the probability of each coin individually, | ▶ 00:40 |
look at its probability and just multiply them up. | ▶ 00:42 |
[Thrun] So let me ask you about dependence. | ▶ 00:00 |
Suppose we flip 2 coins. | ▶ 00:03 |
Our first coin is a fair coin, and we're going to denote the outcome by X1. | ▶ 00:05 |
So the chance of X1 coming up heads is half. | ▶ 00:12 |
But now we branch into picking a coin based on the first outcome. | ▶ 00:15 |
So if the first outcome was heads, | ▶ 00:20 |
you pick a coin whose probability of coming up heads is going to be 0.9. | ▶ 00:23 |
The way I word this is by conditional probability, | ▶ 00:28 |
probability of the second coin flip coming up heads | ▶ 00:32 |
provided that or given that X1, the first coin flip, was heads, is 0.9. | ▶ 00:35 |
The first coin flip might also come up tails, | ▶ 00:41 |
in which case I pick a very different coin. | ▶ 00:44 |
In this case I pick a coin which with 0.8 probability will once again give me tails, | ▶ 00:47 |
conditioned on the first coin flip coming up tails. | ▶ 00:54 |
So my question for you is, | ▶ 00:57 |
what's the probability of the second coin flip coming up heads? | ▶ 00:59 |
[Thrun] The answer is 0.55. | ▶ 00:00 |
The way to compute this is by the theorem of total probability. | ▶ 00:04 |
Probability of X2 equals heads. | ▶ 00:08 |
There's 2 ways I can get to this outcome. | ▶ 00:12 |
One is via this path over here, and one is via this path over here. | ▶ 00:15 |
Let me just write both of them down. | ▶ 00:18 |
So first of all, it could be the probability of X2 equals heads | ▶ 00:20 |
given that and I will assume X1 was head already. | ▶ 00:26 |
Now I have to add the complementary event. | ▶ 00:30 |
Suppose X1 came up tails. | ▶ 00:32 |
Then I can ask the question, what is the probability that X2 comes up heads regardless, | ▶ 00:35 |
even though X1 was tails? | ▶ 00:40 |
Plugging in the numbers gives us the following. | ▶ 00:42 |
This one over here is 0.9 times a half. | ▶ 00:44 |
The probability of tails is 0.8, | ▶ 00:49 |
thereby my head probability becomes 1 minus 0.8, which is 0.2. | ▶ 00:51 |
Adding all of this together gives me 0.45 plus 0.1, | ▶ 00:58 |
which is exactly 0.55. | ▶ 01:03 |
So, we actually just learned some interesting lessons. | ▶ 00:00 |
The probability of any random variable Y can be written as | ▶ 00:02 |
probability of Y given that some other random variable X assumes value i | ▶ 00:08 |
times probability of X equals i, | ▶ 00:13 |
sums over all possible outcomes i for the (inaudible) variable X. | ▶ 00:17 |
This is called total probability. | ▶ 00:22 |
The second thing we learned has to do with negation of probabilities. | ▶ 00:24 |
We found that probability of not X given Y is 1 minus probability of X given Y. | ▶ 00:27 |
Now, you might be tempted to say "What about the probability of X given not Y?" | ▶ 00:37 |
"Is this the same as 1 minus probability of X given Y?" | ▶ 00:43 |
And the answer is absolutely no. | ▶ 00:51 |
That's not the case. | ▶ 00:54 |
If you condition on something that has a certain probability value, | ▶ 00:56 |
you can take the event you're looking at and negate this, | ▶ 01:00 |
but you can never negate your conditional variable | ▶ 01:03 |
and assume these values add up to 1. | ▶ 01:05 |
We assume there is sometimes sunny days and sometimes rainy days, | ▶ 00:00 |
and on day 1, which we're going to call D1, | ▶ 00:06 |
the probability of sunny is 0.9. | ▶ 00:09 |
And then let's assume that a sunny day follows a sunny day with 0.8 chance, | ▶ 00:13 |
and a rainy day follows a sunny day with--well-- | ▶ 00:20 |
Well, the correct answer is 0.2, which is a negation of this event over here. | ▶ 00:00 |
A sunny day follows a rainy day with 0.6 chance, | ▶ 00:00 |
and a rainy day follows a rainy day-- | ▶ 00:06 |
please give me your number. | ▶ 00:11 |
0.4 | ▶ 00:00 |
So, what are the chances that D2 is sunny? | ▶ 00:00 |
Suppose the same dynamics apply from D2 to D3, | ▶ 00:03 |
so just replace D3 over here with D2s over there. | ▶ 00:06 |
That means the transition probabilities from one day to the next remain the same. | ▶ 00:10 |
Tell me, what's the probability that D3 is sunny? | ▶ 00:14 |
So, the correct answer over here is 0.78, | ▶ 00:00 |
and over here it's 0.756. | ▶ 00:04 |
To get there, let's complete this one first. | ▶ 00:10 |
The probability of D2 = sunny. | ▶ 00:13 |
Well, we know there's a 0.9 chance it's sunny on D1, | ▶ 00:16 |
and then if it is sunny, we know it stays sunny with a 0.8 chance. | ▶ 00:21 |
So, we multiply these 2 things together, and we get 0.72. | ▶ 00:25 |
We know there's a 0.1 chance of it being rainy on day 1, which is the complement, | ▶ 00:29 |
but if it's rainy, we know it switches to sunny with 0.6 chance, | ▶ 00:33 |
so you multiply these 2 things, and you get 0.06. | ▶ 00:37 |
Adding those two up equals 0.78. | ▶ 00:41 |
Now, for the next day, we know our prior for sunny is 0.78. | ▶ 00:46 |
If it is sunny, it stays sunny with 0.8 probability. | ▶ 00:51 |
Multiplying these 2 things gives us 0.624. | ▶ 00:55 |
We know it's rainy with 0.2 chance, which is the complement of 0.78, | ▶ 01:01 |
but a 0.6 chance if it was (inaudible) sunny. | ▶ 01:07 |
But if you multiply those, 0.132. | ▶ 01:10 |
Adding those 2 things up gives us 0.756. | ▶ 01:14 |
So, to some extents, it's tedious to compute these values, | ▶ 01:19 |
but they can be perfectly computed, as shown here. | ▶ 01:23 |
Next example is a cancer example. | ▶ 00:00 |
Suppose there's a specific type of cancer which exists for 1% of the population. | ▶ 00:05 |
I'm going to write this as follows. | ▶ 00:11 |
You can probably tell me now what the probability of not having this cancer is. | ▶ 00:13 |
And yes, the answer is 0.99. | ▶ 00:00 |
Let's assume there's a test for this cancer, | ▶ 00:04 |
which gives us probabilistically an answer whether we have this cancer or not. | ▶ 00:07 |
So, let's say the probability of a test being positive, as indicated by this + sign, | ▶ 00:12 |
given that we have cancer, is 0.9. | ▶ 00:18 |
The probability of the test coming out negative if we have the cancer is--you name it. | ▶ 00:22 |
0.1, which is the difference between 1 and 0.9. | ▶ 00:00 |
Let's assume the probability of the test coming out positive | ▶ 00:06 |
given that we don't have this cancer is 0.2. | ▶ 00:11 |
In other words, the probability of the test correctly saying | ▶ 00:15 |
we don't have the cancer if we're cancer free is 0.8. | ▶ 00:19 |
Now, ultimately, I'd like to know what's the probability | ▶ 00:24 |
they have this cancer given they just received a single, positive test? | ▶ 00:28 |
Before I do this, please help me filling out some other probabilities | ▶ 00:35 |
that are actually important. | ▶ 00:39 |
Specifically, the joint probabilities. | ▶ 00:41 |
The probability of a positive test and having cancer. | ▶ 00:45 |
The probability of a negative test and having cancer, | ▶ 00:51 |
and this is not conditional anymore. | ▶ 00:53 |
It's now a joint probability. | ▶ 00:55 |
So, please give me those 4 values over here. | ▶ 00:57 |
And here the correct answer is 0.009, | ▶ 00:00 |
which is the product of your prior, 0.01, times the conditional, 0.9. | ▶ 00:05 |
Over here we get 0.001, the probability of our prior cancer times 0.1. | ▶ 00:12 |
Over here we get 0.198, | ▶ 00:21 |
the probability of not having cancer is 0.99 | ▶ 00:26 |
times still getting a positive reading, which is 0.2. | ▶ 00:29 |
And finally, we get 0.792, | ▶ 00:32 |
which is the probability of this guy over here, and this guy over here. | ▶ 00:37 |
Now, our next quiz, I want you to fill in the probability of | ▶ 00:00 |
the cancer given that we just received a positive test. | ▶ 00:04 |
And the correct answer is 0.043. | ▶ 00:00 |
So, even though I received a positive test, | ▶ 00:06 |
my probability of having cancer is just 4.3%, | ▶ 00:09 |
which is not very much given that the test itself is quite sensitive. | ▶ 00:14 |
It really gives me a 0.8 chance of getting a negative result if I don't have cancer. | ▶ 00:18 |
It gives me a 0.9 chance of detecting cancer given that I have cancer. | ▶ 00:26 |
Now, what comes (inaudible) small? | ▶ 00:32 |
Well, let's just put all the cases together. | ▶ 00:35 |
You already know that we received a positive test. | ▶ 00:38 |
Therefore, this entry over here, and this entry over here are relevant. | ▶ 00:41 |
Now, the chance of having a positive test and having cancer is 0.009. | ▶ 00:47 |
Well, I might--when I receive a positive test--have cancer or not cancer, | ▶ 00:56 |
so we will just normalize by these 2 possible causes for the positive test, | ▶ 01:01 |
which is 0.009 + 0.198. | ▶ 01:06 |
We know both these 2 things together gets 0.009 over 0.207, | ▶ 01:11 |
which is approximately 0.043. | ▶ 01:20 |
Now, the interesting thing in this equation is that the chances | ▶ 01:23 |
of having seen a positive test result in the absence of cancers | ▶ 01:28 |
are still much, much higher than the chance of seeing a positive result | ▶ 01:32 |
in the presence of cancer, and that's because our prior for cancer | ▶ 01:35 |
is so small in the population that it's just very unlikely to have cancer. | ▶ 01:39 |
So, the additional information of a positive test | ▶ 01:44 |
only erased my posterior probability to 0.043. | ▶ 01:47 |
So, we've just learned about what's probably the most important | ▶ 00:00 |
piece of math for this class in statistics called Bayes Rule. | ▶ 00:03 |
It was invented by Reverend Thomas Bayes, who was a British mathematician | ▶ 00:09 |
and a Presbyterian minister in the 18th century. | ▶ 00:15 |
Bayes Rule is usually stated as follows: P of A given B where B is the evidence | ▶ 00:18 |
and A is the variable we care about is P of B given A times P of A over P of B. | ▶ 00:27 |
This expression is called the likelihood. | ▶ 00:36 |
This is called the prior, and this is called marginal likelihood. | ▶ 00:40 |
The expression over here is called the posterior. | ▶ 00:46 |
The interesting thing here is the way the probabilities are reworded. | ▶ 00:50 |
Say we have evidence B. | ▶ 00:55 |
We know about B, but we really care about the variable A. | ▶ 00:57 |
So, for example, B is a test result. | ▶ 01:01 |
We don't care about the test result as much as we care about the fact | ▶ 01:03 |
whether we have cancer or not. | ▶ 01:06 |
This diagnostic reasoning--which is from evidence to its causes-- | ▶ 01:08 |
is turned upside down by Bayes Rule into a causal reasoning, | ▶ 01:16 |
which is given--hypothetically, if we knew the cause, | ▶ 01:22 |
what would be the probability of the evidence we just observed. | ▶ 01:27 |
But to correct for this inversion, we have to multiply | ▶ 01:31 |
by the prior of the cause to be the case in the first place, | ▶ 01:36 |
in this case, having cancer or not, | ▶ 01:40 |
and divide it by the probability of the evidence, P(B), | ▶ 01:42 |
which often is expanded using the theorem of total probability as follows. | ▶ 01:47 |
The probability of B is a sum over all probabilities of B | ▶ 01:52 |
conditional on A, lower caps a, times the probability of A equals lower caps a. | ▶ 01:58 |
This is total probability as we already encountered it. | ▶ 02:04 |
So, let's apply this to the cancer case | ▶ 02:08 |
and say we really care about whether you have cancer, | ▶ 02:10 |
which is our cause, conditioned on the evidence | ▶ 02:13 |
that is the result of this hidden cause, in this case, a positive test result. | ▶ 02:17 |
Let's just plug in the numbers. | ▶ 02:23 |
Our likelihood is the probability of seeing a positive test result | ▶ 02:25 |
given that you have cancer multiplied by the prior probability | ▶ 02:30 |
of having cancer over the probability of the positive test result, | ▶ 02:33 |
and that is--according to the tables we looked at before-- | ▶ 02:38 |
0.9 times a prior of 0.01 over-- | ▶ 02:43 |
now we're going to expand this right over here according to total probability | ▶ 02:50 |
which gives us 0.9 times 0.01. | ▶ 02:55 |
That's the probability of + given that we do have cancer. | ▶ 03:01 |
So, the probability of + given that we don't have cancer is 0.2, | ▶ 03:06 |
but the prior here is 0.99. | ▶ 03:11 |
So, if we plug in the numbers we know about, we get 0.009 | ▶ 03:15 |
over 0.009 + 0.198. | ▶ 03:20 |
That is approximately 0.0434, which is the number we saw before. | ▶ 03:27 |
So, if you want to draw Bayes rule graphically, | ▶ 00:00 |
we have a situation where we have an internal variable A, | ▶ 00:03 |
like whether I'm going to die of cancer, but we can't sense A. | ▶ 00:08 |
Instead, we have a second variable, called B, | ▶ 00:13 |
which is our test, and B is observable, but A isn't. | ▶ 00:16 |
This is a classical example of a Bayes network. | ▶ 00:21 |
The Bayes network is composed of 2 variables, A and B. | ▶ 00:26 |
We know the prior probability for A, | ▶ 00:30 |
and we know the conditional. | ▶ 00:33 |
A causes B--whether or not we have cancer, | ▶ 00:35 |
causes the test result to be positive or not, | ▶ 00:38 |
although there was some randomness involved. | ▶ 00:41 |
So, we know what the probability of B given the different values for A, | ▶ 00:44 |
and what we care about in this specific instance is called diagnostic reasoning, | ▶ 00:49 |
which is the inverse of the causal reasoning, | ▶ 00:54 |
the probability of A given B or similarly, probability of A given not B. | ▶ 00:58 |
This is our very first Bayes network, and the graphical representation | ▶ 01:06 |
of drawing 2 variables, A and B, connected with an arc | ▶ 01:11 |
that goes from A to B is the graphical representation of a distribution | ▶ 01:15 |
of 2 variables that are specified in the structure over here, | ▶ 01:22 |
which has a prior probability and has a conditional probability as shown over here. | ▶ 01:26 |
Now, I do have a quick quiz for you. | ▶ 01:31 |
How many parameters does it take to specify | ▶ 01:34 |
the entire joint probability within A and B, or differently, the entire Bayes network? | ▶ 01:37 |
I'm not looking for structural parameters that relate to the graph over here. | ▶ 01:43 |
I'm just looking for the numerical parameters of the underlying probabilities. | ▶ 01:48 |
And the answer is 3. | ▶ 00:00 |
It takes 1 parameter to specify P of A from which we can derive P of not A. | ▶ 00:02 |
It takes 2 parameters to specify P of B given A and P given not A, | ▶ 00:09 |
from which we can derive P not B given A and P of not B given not A. | ▶ 00:15 |
So, it's a total of 3 parameters for this Bayes network. | ▶ 00:21 |
So, we just encountered our very first Bayes network | ▶ 00:00 |
and did a number of interesting calculations. | ▶ 00:03 |
Let's now talk about Bayes Rule and look into more complex Bayes networks. | ▶ 00:06 |
I will look at Bayes Rule again and make an observation | ▶ 00:10 |
that is really non-trivial. | ▶ 00:13 |
Here is Bayes Rule, and in practice, what we find is | ▶ 00:15 |
this term here is relatively easy to compute. | ▶ 00:20 |
It's just a product, whereas this term is really hard to compute. | ▶ 00:23 |
However, this term over here does not depend on what we assume for variable A. | ▶ 00:28 |
It's just the function of B. | ▶ 00:33 |
So, suppose for a moment we also care about the complementary event of not A | ▶ 00:35 |
given B, for which Bayes Rule unfolds as follows. | ▶ 00:40 |
Then we find that the normalizer, P(B), is identical, | ▶ 00:43 |
whether we assume A on the left side or not A on the left side. | ▶ 00:47 |
We also know from prior work that P(A) given B plus | ▶ 00:51 |
P of not A given B must be one because these are 2 complementary events. | ▶ 00:57 |
That allows us to compute Bayes Rule very differently | ▶ 01:03 |
by basically ignoring the normalizer, so here's how it goes. | ▶ 01:06 |
We compute P(A) given B--and I want to call this prime, | ▶ 01:11 |
because it's not a real probability--to be just P(B) given A times P(A), | ▶ 01:16 |
which is the normalizer, so the denominator of the expression over here. | ▶ 01:23 |
We do the same thing with not A. | ▶ 01:28 |
So, in both cases, we compute the posterior probability non-normalized | ▶ 01:31 |
by omitting the normalizer B. | ▶ 01:36 |
And then we can recover the original probabilities by normalizing | ▶ 01:38 |
based on those values over here, so the probability of A given B, | ▶ 01:43 |
the actual probability, is a normalizer, eta, | ▶ 01:48 |
times this non-normalized form over here. | ▶ 01:52 |
The same is true for the negation of A over here. | ▶ 01:55 |
And eta is just the normalizer that results by adding these 2 values over here together | ▶ 01:59 |
as shown over here, and dividing them for one. | ▶ 02:06 |
So, take a look at this for a moment. | ▶ 02:10 |
What we've done is we deferred the calculation of the normalizer over here | ▶ 02:13 |
by computing pseudo probabilities that are non-normalized. | ▶ 02:18 |
This made the calculation much easier, and when we were done with everything, | ▶ 02:22 |
we just folded it back into the normalizer based on the resulting | ▶ 02:26 |
pseudo probabilities and got the correct answer. | ▶ 02:29 |
The reason why I gave you all this is because I want you to apply it now | ▶ 00:00 |
to a slightly more complicated problem, which is the 2-test cancer example. | ▶ 00:03 |
In this example, we again might have our unobservable cancer C, | ▶ 00:08 |
but now we're running 2 tests, test 1 and test 2. | ▶ 00:14 |
As before, the prior probability of cancer is 0.01. | ▶ 00:18 |
The probability of receiving a positive test result for either test is 0.9. | ▶ 00:24 |
The probability of getting a negative result given they're cancer free is 0.8. | ▶ 00:30 |
And from those, we were able to compute all the other probabilities, | ▶ 00:36 |
and we're just going to write them down over here. | ▶ 00:40 |
So, take a moment to just verify those. | ▶ 00:43 |
Now, let's assume both of my tests come back positive, | ▶ 00:46 |
so T1 = + and T2 = +. | ▶ 00:50 |
What's the probability of cancer now written in short form probability of | ▶ 00:56 |
C given ++? | ▶ 01:00 |
I want you to tell me what that is, and this is a non-trivial question. | ▶ 01:03 |
So, the correct answer is 0.1698 approximately, | ▶ 00:00 |
and to compute this, I used the trick I've shown you before. | ▶ 00:10 |
Let me write down the running count for cancer and for not cancer | ▶ 00:15 |
as I integrate the various multiplications in Bayes Rule. | ▶ 00:24 |
My prior for cancer was 0.01 and for non-cancer was 0.99. | ▶ 00:28 |
Then I get my first +, and the probability of a + given they have cancer is 0.9, | ▶ 00:37 |
and the same for non-cancer is 0.2. | ▶ 00:43 |
So, according to the non-normalized Bayes Rule, | ▶ 00:48 |
I now multiply these 2 things together to get my non-normalized probability | ▶ 00:52 |
of having cancer given the plus. | ▶ 00:58 |
Since multiplication is commutative, | ▶ 01:00 |
I can do the same thing again with my 2nd test result, 0.9 and 0.2, | ▶ 01:03 |
and I multiply all of these 3 things together to get my non-normalized probability | ▶ 01:09 |
P prime to be the following: 0.0081, if you multiply those things together, | ▶ 01:14 |
and 0.0396 if you multiply these facts together. | ▶ 01:21 |
And these are not a probability. | ▶ 01:28 |
If we add those for the 2 complementary of cancer/non-cancer, | ▶ 01:30 |
I get 0.0477. | ▶ 01:34 |
However, if I now divide, that is, I normalize | ▶ 01:38 |
those non-normalized probabilities over here by this factor over here, | ▶ 01:42 |
I actually get the correct posterior probability P of cancer given ++. | ▶ 01:47 |
And they look as follows: | ▶ 01:52 |
approximately 0.1698 and approximately 0.8301. | ▶ 01:54 |
Calculate for me the probability of cancer | ▶ 00:00 |
given that I received one positive and one negative test result. | ▶ 00:03 |
Please write your number into this box. | ▶ 00:08 |
We apply the same trick as before | ▶ 00:00 |
where we use the exact same prior of 0.01. | ▶ 00:03 |
Our first + gives us the following factors: 0.9 and 0.2. | ▶ 00:07 |
And our minus gives us the probability 0.1 for a negative first test result given that we have cancer, | ▶ 00:13 |
and a 0.8 for the inverse of a negative result of not having cancer. | ▶ 00:20 |
We multiply those together. | ▶ 00:26 |
We get our non-normalized probability. | ▶ 00:28 |
And if we now normalize by the sum of those two things | ▶ 00:30 |
to turn this back into a probability, we get 0.009 | ▶ 00:35 |
over the sum of those two things over here, and this is 0.0056 | ▶ 00:41 |
for the chance of having cancer and 0.9943 for the chance of being cancer free. | ▶ 00:50 |
And this adds up approximately to 1, and therefore, is a probability distribution. | ▶ 00:59 |
I want to use a few words of terminology. | ▶ 00:00 |
This, again, is a Bayes network, of which the hidden variable C | ▶ 00:03 |
causes the still stochastic test outcomes T1 and T2. | ▶ 00:08 |
And what is really important is that we assume not just | ▶ 00:16 |
that T1 and T2 are identically distributed. | ▶ 00:19 |
We use the same 0.9 for test 1 as we use for test 2, | ▶ 00:22 |
but we also assume that they are conditionally independent. | ▶ 00:27 |
We assumed that if God told us whether we actually had cancer or not, | ▶ 00:31 |
if we knew with absolute certainty the value of the variable C, | ▶ 00:37 |
that knowing anything about T1 would not help us make a statement about T2. | ▶ 00:41 |
Put differently, we assumed that the probability of T2 given C and T1 | ▶ 00:48 |
is the same as the probability of T2 given C. | ▶ 00:55 |
This is called conditional independence, which is given the value of the cancer variable C. | ▶ 01:00 |
If you knew this for a fact, then T2 would be independent of T1. | ▶ 01:08 |
It's conditionally independent because the independence only holds true | ▶ 01:17 |
if we actually know C, and it comes out of this diagram over here. | ▶ 01:21 |
If we look at this diagram, if you knew the variable C over here, | ▶ 01:26 |
then C separately causes T1 and T2. | ▶ 01:32 |
So, as a result, if you know C, whatever counted over here | ▶ 01:39 |
is kind of cut off causally from what happens over here. | ▶ 01:46 |
That causes these 2 variables to be conditionally independent. | ▶ 01:48 |
So, conditional independence is a really big thing in Bayes networks. | ▶ 01:52 |
Here's a Bayes network where A causes B and C, | ▶ 01:58 |
and for a Bayes network of this structure, we know that given A, | ▶ 02:02 |
B and C are independent. | ▶ 02:08 |
It's written as B conditionally independent of C given A. | ▶ 02:11 |
So, here's a question. | ▶ 02:16 |
Suppose we have conditional independence between B and C given A. | ▶ 02:18 |
Would that imply--and there's my question--that B and C are independent? | ▶ 02:21 |
So, suppose we don't know A. | ▶ 02:28 |
We don't know whether we have cancer, for example. | ▶ 02:30 |
What that means is that the test results individually are still independent of each other | ▶ 02:33 |
even if we don't know about the cancer situation. | ▶ 02:38 |
Please answer yes or no. | ▶ 02:42 |
And the correct answer is No | ▶ 00:00 |
Intuitively, getting a positive test result about cancer | ▶ 00:03 |
gives us information about whether you have cancer or not. | ▶ 00:08 |
So if you get a positive test result | ▶ 00:13 |
you're going to raise the probability of having cancer | ▶ 00:15 |
relative to the prior probability. | ▶ 00:18 |
With that increased probability we will predict | ▶ 00:20 |
that another test will with a higher likelihood | ▶ 00:24 |
give us a positive response than if we hadn't taken the previous test. | ▶ 00:27 |
That's really important to understand | ▶ 00:33 |
So that we understand it let me make you calculate those probabilities | ▶ 00:36 |
Let me draw the cancer example again with two tests. | ▶ 00:00 |
Here's my cancer variable | ▶ 00:05 |
and then there's two conditionally independent tests T1 and T2. | ▶ 00:07 |
And as before let me assume that the prior probability of cancer is 0.01 | ▶ 00:13 |
What I want you to compute for me is the probability of the second test | ▶ 00:19 |
to be positive if we know that the first test was positive. | ▶ 00:26 |
So write this into the following box. | ▶ 00:33 |
So, for this one, we want to apply total probability. | ▶ 00:00 |
This thing over here is the same as probability of test 2 to be positive, | ▶ 00:04 |
which I'm going to abbreviate with a +2 over here, | ▶ 00:10 |
conditioned on test 1 being positive and me having cancer | ▶ 00:14 |
times the probability of me having cancer given test 1 was positive plus | ▶ 00:19 |
the probability of test 2 being positive conditioned on test 1 being positive | ▶ 00:25 |
and me not having cancer times the probability of me not having cancer | ▶ 00:31 |
given that test 1 is positive. | ▶ 00:36 |
That's the same as the theorem of total probability, | ▶ 00:38 |
but now everything is conditioned on +1. | ▶ 00:42 |
Take a moment to verify this. | ▶ 00:46 |
Now, here I can plug in the numbers. | ▶ 00:48 |
You already calculated this one before, which is approximately 0.043, | ▶ 00:50 |
and this one over here is 1 minus that, which is 0.957 approximately. | ▶ 00:57 |
And this term over here now exploits conditional independence, | ▶ 01:05 |
which is given that I know C, knowledge of the first test | ▶ 01:09 |
gives me no more information about the second test. | ▶ 01:14 |
It only gives me information if C was unknown, as was the case over here. | ▶ 01:17 |
So, I can rewrite this thing over here as follows: | ▶ 01:21 |
P of +2 given that I have cancer. | ▶ 01:24 |
I can drop the +1, and the same is true over here. | ▶ 01:27 |
This is exploiting my conditional independence. | ▶ 01:31 |
I knew that P of +1 or +2 conditioned on C | ▶ 01:34 |
is the same as P of +2 conditioned on C and +1. | ▶ 01:41 |
I can now read those off my table over here, | ▶ 01:47 |
which is 0.9 times 0.043 plus 0.2, | ▶ 01:50 |
which is 1 minus 0.8 over here times 0.957, | ▶ 01:58 |
which gives me approximately 0.2301. | ▶ 02:03 |
So, that says if my first test comes in positive, | ▶ 02:09 |
I expect my second test to be positive with probably 0.2301. | ▶ 02:14 |
That's an increased probability to the default probability, | ▶ 02:21 |
which we calculated before, which is the probability of any test, | ▶ 02:24 |
test 2 come in as positive before was the normalizer of Bayes rule which was 0.207. | ▶ 02:29 |
So, my first test has a 20% chance of coming in positive. | ▶ 02:38 |
My second test, after seeing a positive test, | ▶ 02:43 |
has now an increased probability of about 23% of coming in positive. | ▶ 02:47 |
So, now we've learned about independence, | ▶ 00:00 |
and the corresponding Bayes network has 2 nodes. | ▶ 00:02 |
They're just not connected at all. | ▶ 00:04 |
And we learned about conditional independence, | ▶ 00:07 |
in which case we have a Bayes network that looks like this. | ▶ 00:09 |
Now I would like to know whether absolute independence | ▶ 00:12 |
implies conditional independence. | ▶ 00:16 |
True or false? | ▶ 00:18 |
And I'd also like to know whether conditional independence implies absolute independence. | ▶ 00:20 |
Again, true or false? | ▶ 00:25 |
And the answer is both of them are false. | ▶ 00:00 |
We already saw that conditional independence, as shown over here, | ▶ 00:03 |
doesn't give us absolute independence. | ▶ 00:07 |
So, for example, this is test #1 and test #2. | ▶ 00:09 |
You might or might not have cancer. | ▶ 00:13 |
Our first test gives us information about whether you have cancer or not. | ▶ 00:15 |
As a result, we've changed our prior probability | ▶ 00:18 |
for the second test to come in positive. | ▶ 00:21 |
That means that conditional independence does not imply absolute independence, | ▶ 00:24 |
which means this assumption here falls, | ▶ 00:30 |
and it also turns out that if you have absolute independence, | ▶ 00:32 |
things might not be conditionally independent for reasons that I can't quite explain so far, | ▶ 00:37 |
but that we will learn about next. | ▶ 00:43 |
[Thrun] For my next example, I will study a different type of a Bayes network. | ▶ 00:00 |
Before, we've seen networks of the following type, | ▶ 00:04 |
where a single hidden cause caused 2 different measurements. | ▶ 00:08 |
I now want to study a network that looks just like the opposite. | ▶ 00:13 |
We have 2 independent hidden causes, | ▶ 00:17 |
but they get confounded within a single observational variable. | ▶ 00:20 |
I would like to use the example of happiness. | ▶ 00:26 |
Suppose I can be happy or unhappy. | ▶ 00:29 |
What makes me happy is when the weather is sunny or if I get a raise in my job, | ▶ 00:33 |
which means I make more money. | ▶ 00:41 |
So let's call this sunny, let's call this a raise, and call this happiness. | ▶ 00:43 |
Perhaps the probability of it being sunny is 0.7, | ▶ 00:47 |
probability of a raise is 0.01. | ▶ 00:53 |
And I will tell you that the probability of being happy is governed as follows. | ▶ 00:58 |
The probability of being happy given that both of these things occur-- | ▶ 01:05 |
I got a raise and it is sunny--is 1. | ▶ 01:09 |
The probability of being happy given that it is not sunny and I still got a raise is 0.9. | ▶ 01:13 |
The probability of being happy given that it's sunny but I didn't get a raise is 0.7. | ▶ 01:20 |
And the probability of being happy given that it is neither sunny nor did I get a raise is 0.1. | ▶ 01:27 |
This is a perfectly fine specification of a probability distribution | ▶ 01:35 |
where 2 causes affect the variable down here, the happiness. | ▶ 01:39 |
So I'd like you to calculate for me the following questions. | ▶ 01:46 |
Probability of a raise given that it is sunny, according to this model. | ▶ 01:50 |
Please enter your answer over here. | ▶ 01:57 |
[Thrun] The answer is surprisingly simple. | ▶ 00:00 |
It is 0.01. | ▶ 00:03 |
How do I know this so fast? | ▶ 00:05 |
Well, if you look at this Bayes network, | ▶ 00:08 |
both the sunniness and the question whether I got a raise impact my happiness. | ▶ 00:12 |
But since I don't know anything about the happiness, | ▶ 00:21 |
there is no way that just the weather might implicate or impact whether I get a raise or not. | ▶ 00:24 |
In fact, it might be independently sunny, and I might independently get a raise at work. | ▶ 00:32 |
There is no mechanism of which these 2 things would co-occur. | ▶ 00:39 |
Therefore, the probability of a raise given that it's sunny | ▶ 00:46 |
is just the same as the probability of a raise given any weather, which is 0.01. | ▶ 00:49 |
[Thrun] Let me talk about a really interesting special instance of Bayes net reasoning | ▶ 00:00 |
which is called explaining away. | ▶ 00:07 |
And I'll first give you the intuitive answer, | ▶ 00:10 |
then I'll wish you to compute probabilities for me that manifest the explain away effect | ▶ 00:14 |
in a Bayes network of this type. | ▶ 00:19 |
Explaining away means that if we know that we are happy, | ▶ 00:22 |
then sunny weather can explain away the cause of happiness. | ▶ 00:27 |
If I then also know that it's sunny, it becomes less likely that I received a raise. | ▶ 00:34 |
Let me put this differently. | ▶ 00:41 |
Suppose I'm a happy guy on a specific day | ▶ 00:43 |
and my wife asks me, "Sebastian, why are you so happy?" | ▶ 00:45 |
"Is it sunny, or did you get a raise?" | ▶ 00:49 |
If she then looks outside and sees it is sunny, | ▶ 00:52 |
then she might explain to herself, | ▶ 00:55 |
"Well, Sebastian is happy because it is sunny." | ▶ 00:57 |
"That makes it effectively less likely that he got a raise | ▶ 01:00 |
"because I could already explain his happiness by it being sunny." | ▶ 01:05 |
If she looks outside and it is rainy, | ▶ 01:10 |
that makes it more likely I got a raise, | ▶ 01:13 |
because the weather can't really explain my happiness. | ▶ 01:16 |
In other words, if we see a certain effect that could be caused by multiple causes, | ▶ 01:20 |
seeing one of those causes can explain away any other potential cause | ▶ 01:27 |
of this effect over here. | ▶ 01:33 |
So let me put this in numbers and ask you the challenging question of | ▶ 01:36 |
what's the probability of a raise given that I'm happy and it's sunny? | ▶ 01:43 |
[Thrun] The answer is approximately 0.0142, | ▶ 00:00 |
and it is an exercise in expanding this term using Bayes' rule, | ▶ 00:07 |
using total probability, which I'll just do for you. | ▶ 00:11 |
Using Bayes' rule, you can transform this into P of H given R comma S | ▶ 00:16 |
times P of R given S over P of H given S. | ▶ 00:24 |
We observe the conditional independence of R and S | ▶ 00:34 |
to simplify this to just P of R, | ▶ 00:37 |
and the denominator is expanded by folding in R and not R, | ▶ 00:40 |
P of H given R comma S | ▶ 00:46 |
times P of R plus P of H given not R and S | ▶ 00:49 |
times P of not R, which is total probability. | ▶ 00:54 |
We can now read off the numbers from the tables over here, | ▶ 00:58 |
which gives us 1 times 0.01 divided by this expression | ▶ 01:01 |
that is the same as the expression over here, so 0.01 plus this thing over here, | ▶ 01:10 |
which you can find over here to be 0.7, times this guy over here, | ▶ 01:17 |
which is 1 minus the value over here, 0.99, | ▶ 01:23 |
which gives us approximately 0.0142. | ▶ 01:27 |
[Thrun] Now, to understand the explain away effect, | ▶ 00:00 |
you have to compare this to the probability of a raise given that we're just happy | ▶ 00:04 |
and we don't know anything about the weather. | ▶ 00:11 |
So let's do that exercise next. | ▶ 00:14 |
So my next quiz is, what's the probability of a raise given that all I know is that I'm happy | ▶ 00:16 |
and I don't know about the weather? | ▶ 00:24 |
This happens to be once again a pretty complicated question, so take your time. | ▶ 00:26 |
[Thrun] So this is a difficult question. | ▶ 00:00 |
Let me compute an auxiliary variable, which is P of happiness. | ▶ 00:02 |
That one is expanded by looking at the different conditions that can make us happy. | ▶ 00:12 |
P of happiness given S and R | ▶ 00:19 |
times P of S and R, which is of course the product of those 2 | ▶ 00:24 |
because they are independent, | ▶ 00:29 |
plus P of happiness given not S R, probability of not as R | ▶ 00:31 |
plus P of H given S and not R | ▶ 00:39 |
times the probability of P of S and not R plus the last case, | ▶ 00:43 |
P of H given not S and not R. | ▶ 00:48 |
So this just looks at the happiness under all 4 combinations of the variables | ▶ 00:52 |
that can lead to happiness. | ▶ 00:56 |
And you can plug those straight in. | ▶ 00:58 |
This one over here is 1, and this one over here is the product of S and R, | ▶ 01:00 |
which is 0.7 times 0.01. | ▶ 01:05 |
And as you plug all of those in, | ▶ 01:10 |
you get as a result 0.5245. | ▶ 01:14 |
That's P of H. | ▶ 01:21 |
Just take some time and do the math by going through these different cases | ▶ 01:24 |
using total probability, and you get this result. | ▶ 01:28 |
Armed with this number, the rest now becomes easy, | ▶ 01:32 |
which is we can use Bayes' rule to turn this around. | ▶ 01:38 |
P of H given R times P of R over P of H. | ▶ 01:43 |
P of R we know from over here, the probability of a raise is 0.01. | ▶ 01:49 |
So the only thing we need to compute now is P of H given R. | ▶ 01:54 |
And again, we apply total probability. | ▶ 01:57 |
Let me just do this over here. | ▶ 01:59 |
We can factor P of H given R as P of H given R and S, sunny, | ▶ 02:02 |
times probability of sunny plus P of H given R and not sunny | ▶ 02:09 |
times the probability of not sunny. | ▶ 02:14 |
And if you plug in the numbers with this, you get 1 times 0.7 | ▶ 02:16 |
plus 0.9 times 0.3. | ▶ 02:21 |
That happens to be 0.97. | ▶ 02:25 |
So if we now plug this all back into this equation over here, | ▶ 02:30 |
we get 0.97 times 0.01 over 0.5245. | ▶ 02:33 |
This gives us approximately as the correct answer 0.0185. | ▶ 02:45 |
[Thrun] And if you got this right, I will be deeply impressed | ▶ 00:00 |
about the fact you got this right. | ▶ 00:04 |
But the interesting thing now to observe is if we happen to know it's sunny | ▶ 00:07 |
and I'm happy, then the probability of a raise is 14%, 0.014. | ▶ 00:13 |
If I don't know about the weather and I'm happy, | ▶ 00:21 |
then the probability of a raise goes up to about 18.5%. | ▶ 00:26 |
Why is that? | ▶ 00:30 |
Well, it's the explaining away effect. | ▶ 00:32 |
My happiness is well explained by the fact that it's sunny. | ▶ 00:35 |
So if someone observes me to be happy and asks the question, | ▶ 00:40 |
"Is this because Sebastian got a raise at work?" | ▶ 00:43 |
well, if you know it's sunny and this is a fairly good explanation for me being happy, | ▶ 00:46 |
you don't have to assume I got a raise. | ▶ 00:53 |
If you don't know about the weather, then obviously the chances are higher | ▶ 00:55 |
that the raise caused my happiness, | ▶ 01:01 |
and therefore this number goes up from 0.014 to 0.018. | ▶ 01:03 |
Let me ask you one final question in this next quiz, | ▶ 01:10 |
which is the probability of the raise given that I look happy and it's not sunny. | ▶ 01:14 |
This is the most extreme case for making a raise likely | ▶ 01:23 |
because I am a happy guy, and it's definitely not caused by the weather. | ▶ 01:27 |
So it could be just random, or it could be caused by the raise. | ▶ 01:33 |
So please calculate this number for me and enter it into this box. | ▶ 01:37 |
[Thrun] Well, the answer follows the exact same scheme as before, | ▶ 00:00 |
with S being replaced by not S. | ▶ 00:04 |
So this should be an easier question for you to answer. | ▶ 00:08 |
P of R given H and not S can be inverted by Bayes' rule to be as follows. | ▶ 00:11 |
Once we apply Bayes' rule, as indicated over here where we swapped H to the left side | ▶ 00:20 |
and R to the right side, you can observe that this value over here | ▶ 00:24 |
can be readily found in the table. | ▶ 00:29 |
It's actually the 0.9 over there. | ▶ 00:32 |
This value over here, the raise is independent of the weather | ▶ 00:35 |
by virtue of our Bayes network, so it's just 0.01. | ▶ 00:41 |
And as before, we apply total probability to the expression over here, | ▶ 00:45 |
and we obtain off this quotient over here that these 2 expressions are the same. | ▶ 00:52 |
P of H given not S, not R is the value over here, | ▶ 00:58 |
and the 0.99 is the complement of probability of R taken from over here, | ▶ 01:03 |
and that ends up to be 0.0833. | ▶ 01:08 |
This would have been the right answer. | ▶ 01:16 |
[Thrun] It's really interesting to compare this to the situation over here. | ▶ 00:00 |
In both cases I'm happy, as shown over here, | ▶ 00:04 |
and I ask the same question, which is whether I got a raise at work, as R over here. | ▶ 00:08 |
But in one case I observe that the weather is sunny; in the other one it isn't. | ▶ 00:15 |
And look what it does to my probability of having received a raise. | ▶ 00:21 |
The sunniness perfectly well explains my happiness, | ▶ 00:25 |
and my probability of having received a raise ends up to be a mere 1.4%, or 0.014. | ▶ 00:30 |
However, if my wife observes it to be non-sunny, then it is much more likely | ▶ 00:41 |
that the cause of my happiness is related to a raise at work, | ▶ 00:47 |
and now the probability is 8.3%, which is significantly higher than the 1.4% before. | ▶ 00:51 |
This is a Bayes network of which S and R are independent | ▶ 00:58 |
but H adds a dependence between S and R. | ▶ 01:04 |
Let me talk about this in a little bit more detail on the next paper. | ▶ 01:10 |
So here is our Bayes network again. | ▶ 01:16 |
In our previous exercises, we computed for this network | ▶ 01:18 |
that the probability of a raise of R given any of these variables shown here was as follows. | ▶ 01:22 |
The really interesting thing is that in the absence of information about H, | ▶ 01:29 |
which is the middle case over here, | ▶ 01:34 |
the probability of R is unaffected by knowledge of S-- | ▶ 01:37 |
that is, R and S are independent. | ▶ 01:41 |
This is the same as probability of R, | ▶ 01:46 |
and R and S are independent. | ▶ 01:49 |
However, if I know something about the variable H, | ▶ 01:56 |
then S and R become dependent-- | ▶ 02:02 |
that is, knowing about my happiness over here renders S and R dependent. | ▶ 02:06 |
This is not the same as probability of just R given H. | ▶ 02:15 |
Obviously, it isn't because if I now vary S from S to not S, | ▶ 02:23 |
it affects my probability for the variable R. | ▶ 02:28 |
That is a really unusual situation | ▶ 02:33 |
where we have R and S are independent | ▶ 02:36 |
but given the variable H, R and S are not independent anymore. | ▶ 02:40 |
So knowledge of H makes 2 variables that previously were independent non-independent. | ▶ 02:50 |
Offered differently, 2 variables that are independent may not be in certain cases | ▶ 02:58 |
conditionally independent. | ▶ 03:06 |
Independence does not imply conditional independence. | ▶ 03:08 |
[Thrun] So we're now ready to define Bayes networks in a more general way. | ▶ 00:00 |
Bayes networks define probability distributions over graphs or random variables. | ▶ 00:05 |
Here is an example graph of 5 variables, | ▶ 00:10 |
and this Bayes network defines the distribution over those 5 random variables. | ▶ 00:14 |
Instead of enumerating all possibilities of combinations of these 5 random variables, | ▶ 00:19 |
the Bayes network is defined by probability distributions | ▶ 00:24 |
that are inherent to each individual node. | ▶ 00:28 |
For node A and B, we just have a distribution P of A and P of B | ▶ 00:32 |
because A and B have no incoming arcs. | ▶ 00:38 |
C is a conditional distribution conditioned on A and B. | ▶ 00:42 |
D and E are conditioned on C. | ▶ 00:47 |
The joint probability represented by a Bayes network | ▶ 00:52 |
is the product of various Bayes network probabilities | ▶ 00:56 |
that are defined over individual nodes | ▶ 01:00 |
where each node's probability is only conditioned on the incoming arcs. | ▶ 01:03 |
So A has no incoming arc; therefore, we just want it P of A. | ▶ 01:08 |
C has 2 incoming arcs, so we define the probability of C conditioned on A and B. | ▶ 01:12 |
And D and E have 1 incoming arc that's shown over here. | ▶ 01:18 |
The definition of this joint distribution by using the following factors | ▶ 01:22 |
has one really big advantage. | ▶ 01:27 |
Whereas the joint distribution over any 5 variables requires 2 to the 5 minus 1, | ▶ 01:30 |
which is 31 probability values, | ▶ 01:40 |
the Bayes network over here only requires 10 such values. | ▶ 01:43 |
P of A is one value, for which we can derive P of not A. | ▶ 01:48 |
Same for P of B. | ▶ 01:53 |
P of C given A B is derived by a distribution over C | ▶ 01:55 |
conditioned on any combination of A and B, of which there are 4 of A and B as binary. | ▶ 02:02 |
P of D given C is 2 parameters for P of D given C and P of D given not C. | ▶ 02:07 |
And the same is true for P of E given C. | ▶ 02:15 |
So if you add those up, you get 10 parameters in total. | ▶ 02:18 |
So the compactness of the Bayes network | ▶ 02:21 |
leads to a representation that scales significantly better to large networks | ▶ 02:25 |
than the common natorial approach which goes through all combinations of variable values. | ▶ 02:31 |
That is a key advantage of Bayes networks, | ▶ 02:36 |
and that is the reason why Bayes networks are being used so extensively | ▶ 02:39 |
for all kinds of problems. | ▶ 02:43 |
So here is a quiz. | ▶ 02:45 |
How many probability values are required to specify this Bayes network? | ▶ 02:47 |
Please put your answer in the following box. | ▶ 02:51 |
[Thrun] And the answer is 13. | ▶ 00:00 |
One over here, 2 over here, and 4 over here. | ▶ 00:03 |
Simplifiably speaking, any variable that has K inputs requires 2 to the K such variables. | ▶ 00:06 |
So in total we have 1, 9, 13. | ▶ 00:15 |
[Thrun] Here's another quiz. | ▶ 00:00 |
How many parameters do we need to specify the joint distribution | ▶ 00:02 |
for this Bayes network over here | ▶ 00:06 |
where A, B, and C point into D, D points into E, F, and G | ▶ 00:09 |
and C also points into G? | ▶ 00:13 |
Please write your answer into this box. | ▶ 00:15 |
[Thrun] And the answer is 19. | ▶ 00:00 |
So 1 here, 1 here, 1 here, 2 here, 2 here, 2 arcs point into G, which makes for 4, | ▶ 00:02 |
and 3 arcs point into D. Two to the 3 is 8. | ▶ 00:09 |
So we get 1, 2, 3, 8, 2, 2, 4. If you add those up, it's 19. | ▶ 00:13 |
[Thrun] And here is our car network which we discussed at the very beginning of this unit. | ▶ 00:00 |
How many parameters do we need to specify this network? | ▶ 00:06 |
Remember, there are 16 total variables, | ▶ 00:11 |
and the naive joint over the 16 will be 2 to the 16th minus 1, which is 65,535. | ▶ 00:15 |
Please write your answer into this box over here. | ▶ 00:25 |
[Thrun] To answer this question, let us add up these numbers. | ▶ 00:00 |
Battery age is 1, 1, 1. | ▶ 00:04 |
This has 1 incoming arc, so it's 2. | ▶ 00:08 |
Two incoming arcs makes 4. | ▶ 00:10 |
One incoming arc is 2, 2 equals 4. | ▶ 00:13 |
Four incoming arcs makes 16. | ▶ 00:17 |
If we add all the right numbers, we get 47. | ▶ 00:21 |
[Thrun] So it takes 47 numerical probabilities to specify the joint | ▶ 00:00 |
compared to 65,000 if you didn't have the graph-like structure. | ▶ 00:05 |
I think this example really illustrates the advantage | ▶ 00:11 |
of compact Bayes network representations over unstructured joint representations. | ▶ 00:14 |
[Thrun] The next concept I'd like to teach you is called D-separation. | ▶ 00:00 |
And let me start the discussion of this concept by a quiz. | ▶ 00:04 |
We have here a Bayes network, | ▶ 00:09 |
and I'm going to ask you a conditional independence question. | ▶ 00:11 |
Is C independent of A? | ▶ 00:16 |
Please tell me yes or no. | ▶ 00:20 |
Is C independent of A given B? | ▶ 00:22 |
Is C independent of D? | ▶ 00:27 |
Is C independent of D given A? | ▶ 00:30 |
And is E independent of C given D? | ▶ 00:32 |
[Thrun] So C is not independent of A. | ▶ 00:00 |
In fact, A influences C by virtue of B. | ▶ 00:04 |
But if you know B, then A becomes independent of C, | ▶ 00:09 |
which means the only determinate into C is B. | ▶ 00:13 |
If you know B for sure, then knowledge of A won't really tell you anything about C. | ▶ 00:17 |
C is also not independent of D, just the same way C is not independent of A. | ▶ 00:22 |
If I learn something about D, I can infer more about C. | ▶ 00:27 |
But if I do know A, then it's hard to imagine how knowledge of D would help me with C | ▶ 00:31 |
because I can't learn anything more about A than knowing A already. | ▶ 00:38 |
Therefore, given A, C and D are independent. | ▶ 00:42 |
The same is true for E and C. | ▶ 00:45 |
If we know D, then E and C become independent. | ▶ 00:48 |
[Thrun] In this specific example, the rule that we could apply is very, very simple. | ▶ 00:00 |
Any 2 variables are independent if they're not linked by just unknown variables. | ▶ 00:04 |
So for example, if we know B, then everything downstream of B | ▶ 00:10 |
becomes independent of anything upstream of B. | ▶ 00:14 |
E is now independent of C, conditioned on B. | ▶ 00:18 |
However, knowledge of B does not render A and E independent. | ▶ 00:22 |
In this graph over here, A and B connect to C and C connects to D and to E. | ▶ 00:26 |
So let me ask you, is A independent of E, | ▶ 00:33 |
A independent of E given B, | ▶ 00:37 |
A independent of E given C, | ▶ 00:39 |
A independent of B, | ▶ 00:41 |
and A independent of B given C? | ▶ 00:43 |
[Thrun] And the answer for this one is really interesting. | ▶ 00:00 |
A is clearly not independent of E because through C we can see an influence of A to E. | ▶ 00:03 |
Given B, that doesn't change. | ▶ 00:08 |
A still influences C, despite the fact we know B. | ▶ 00:11 |
However, if we know C, the influence is cut off. | ▶ 00:15 |
There is no way A can influence E if we know C. | ▶ 00:18 |
A is clearly independent of B. | ▶ 00:22 |
They are different entry variables. They have no incoming arcs. | ▶ 00:25 |
But here is the caveat. | ▶ 00:29 |
Given C, A and B become dependent. | ▶ 00:32 |
So whereas initially A and B were independent, | ▶ 00:35 |
if you give C, they become dependent. | ▶ 00:38 |
And the reason why they become dependent we've studied before. | ▶ 00:41 |
This is the explain away effect. | ▶ 00:44 |
If you know, for example, C to be true, | ▶ 00:48 |
then knowledge of A will substantially affect what we believe about B. | ▶ 00:51 |
If there's 2 joint causes for C and we happen to know A is true, | ▶ 00:57 |
we will discredit cause B. | ▶ 01:02 |
If we happen to know A is false, we will increase our belief for the cause B. | ▶ 01:04 |
That was an effect we studied extensively in the happiness example I gave you before. | ▶ 01:09 |
The interesting thing here is we are facing a situation | ▶ 01:15 |
where knowledge of variable C renders previously independent variables dependent. | ▶ 01:19 |
[Thrun] This leads me to the general study of conditional independence in Bayes networks, | ▶ 00:00 |
often called D-separation or reachability. | ▶ 00:06 |
D-separation is best studied by so-called active triplets and inactive triplets | ▶ 00:10 |
where active triplets render variables dependent | ▶ 00:17 |
and inactive triplets render them independent. | ▶ 00:20 |
Any chain of 3 variables like this makes the initial and final variable dependent | ▶ 00:23 |
if all variables are unknown. | ▶ 00:30 |
However, if the center variable is known-- | ▶ 00:32 |
that is, it's behind the conditioning bar-- | ▶ 00:35 |
then this variable and this variable become independent. | ▶ 00:38 |
So if we have a structure like this and it's quote-unquote cut off | ▶ 00:42 |
by a known variable in the middle, that separates or deseparates | ▶ 00:47 |
the left variable from the right variable, and they become independent. | ▶ 00:53 |
Similarly, any structure like this renders the left variable and the right variable dependent | ▶ 00:57 |
unless the center variable is known, | ▶ 01:04 |
in which case the left and right variable become independent. | ▶ 01:08 |
Another active triplet now requires knowledge of a variable. | ▶ 01:12 |
This is the explain away case. | ▶ 01:16 |
If this variable is known for a Bayes network that converges into a single variable, | ▶ 01:19 |
then this variable and this variable over here become dependent. | ▶ 01:25 |
Contrast this with a case where all variables are unknown. | ▶ 01:29 |
A situation like this means that this variable on the left or on the right are actually independent. | ▶ 01:33 |
In a single final example, we also get dependence if we have the following situation: | ▶ 01:40 |
a direct successor of a conversion variable is known. | ▶ 01:48 |
So it is sufficient if a successor of this variable is known. | ▶ 01:52 |
The variable itself does not have to be known, | ▶ 01:57 |
and the reason is if you know this guy over here, | ▶ 01:59 |
we get knowledge about this guy over here. | ▶ 02:02 |
And by virtue of that, the case over here essentially applies. | ▶ 02:05 |
If you look at those rules, | ▶ 02:09 |
those rules allow you to determine for any Bayes network | ▶ 02:11 |
whether variables are dependent or not dependent given the evidence you have. | ▶ 02:15 |
If you color the nodes dark for which you do have evidence, | ▶ 02:20 |
then you can use these rules to understand whether any 2 variables | ▶ 02:25 |
are conditionally independent or not. | ▶ 02:29 |
So let me ask you for this relatively complicated Bayes network the following questions. | ▶ 02:31 |
Is F independent of A? | ▶ 02:37 |
Is F independent of A given D? | ▶ 02:41 |
Is F independent of A given G? | ▶ 02:45 |
And is F independent of A given H? | ▶ 02:49 |
Please mark your answers as you see fit. | ▶ 02:51 |
[Thrun] And the answer is yes, F is independent of A. | ▶ 00:00 |
What we find for our rules of D-separation is that F is dependent on D | ▶ 00:04 |
and A is dependent on D. | ▶ 00:08 |
But if you don't know D, you can't govern any dependence between A and F at all. | ▶ 00:11 |
If you do know D, then F and A become dependent. | ▶ 00:16 |
And the reason is B and E are dependent given D, | ▶ 00:20 |
and we can transform this back into dependence of A and F | ▶ 00:25 |
because B and A are dependent and E and F are dependent. | ▶ 00:29 |
There is an active path between A and F which goes across here and here | ▶ 00:33 |
because D is known. | ▶ 00:38 |
If we know G, the same thing is true because G gives us knowledge about D, | ▶ 00:40 |
and D can be applied back to this path over here. | ▶ 00:44 |
However, if you know H, that's not the case. | ▶ 00:47 |
So H might tell us something about G, | ▶ 00:49 |
but it doesn't tell us anything about D, | ▶ 00:51 |
and therefore, we have no reason to close the path between A and F. | ▶ 00:53 |
The path between A and F is still passive, even though we have knowledge of H. | ▶ 00:59 |
[Thrun] So congratulations. You learned a lot about Bayes networks. | ▶ 00:00 |
You learned about the graph structure of Bayes networks, | ▶ 00:03 |
you understood how this is a compact representation, | ▶ 00:06 |
you learned about conditional independence, | ▶ 00:10 |
and we talked a little bit about application of Bayes network | ▶ 00:12 |
to interesting reasoning problems. | ▶ 00:15 |
But by all means this was a mostly theoretical unit of this class, | ▶ 00:18 |
and in future classes we will talk more about applications. | ▶ 00:23 |
The instrument of Bayes networks is really essential to a number of problems. | ▶ 00:27 |
It really characterizes the sparse dependence that exists in many readable problems | ▶ 00:31 |
like in robotics and computer vision and filtering and diagnostics and so on. | ▶ 00:36 |
I really hope you enjoyed this class, | ▶ 00:41 |
and I really hope you understood in depth how Bayes networks work. | ▶ 00:43 |
[Probabilistic Interference] | ▶ 00:00 |
[Male] Welcome back. In the previous unit, we went over the basics | ▶ 00:02 |
of probability theory and saw how | ▶ 00:05 |
a Bayes network could concisely represent a joint probability distribution, | ▶ 00:12 |
including the representation of independence between the variables. | ▶ 00:17 |
In this unit, we will see how to do probabilistic inference. | ▶ 00:24 |
That is, how to answer probability questions using Bayes nets. | ▶ 00:31 |
Let's put up a simple Bayes net. | ▶ 00:36 |
We'll use the familiar example of the earthquake | ▶ 00:40 |
where we can have a burglary or an earthquake | ▶ 00:45 |
setting off an alarm, and if the alarm goes off, | ▶ 00:50 |
either John or Mary might call. | ▶ 00:53 |
Now, what kinds of questions can we ask to do inference about? | ▶ 00:58 |
The simplest type of question is the same question we ask | ▶ 01:02 |
with an ordinary subroutine or function in a programming language. | ▶ 01:05 |
Namely, given some inputs, what are the outputs? | ▶ 01:08 |
So, in this case, we could say given the inputs of B and E, | ▶ 01:12 |
what are the outputs, J and M? | ▶ 01:18 |
Rather than call them input and output variables, | ▶ 01:22 |
in probabilistic inference, we'll call them evidence and query variables. | ▶ 01:26 |
That is, the variables that we know the values of are the evidence, | ▶ 01:36 |
and the ones that we want to find out the values of are the query variables. | ▶ 01:39 |
Anything that is neither evidence nor query is known as a hidden variable. | ▶ 01:44 |
That is, we won't tell you what its value is. | ▶ 01:52 |
We won't figure out what its value is and report it, | ▶ 01:55 |
but we'll have to compute with it internally. | ▶ 01:58 |
And now furthermore, in probabilistic inference, | ▶ 02:01 |
the output is not a single number for each of the query variables, | ▶ 02:05 |
but rather, it's a probability distribution. | ▶ 02:10 |
So, the answer is going to be a complete, joint probability distribution | ▶ 02:13 |
over the query variables. | ▶ 02:17 |
We call this the posterior distribution, given the evidence, | ▶ 02:19 |
and we can write it like this. | ▶ 02:23 |
It's the probability distribution of one or more query variables | ▶ 02:26 |
given the values of the evidence variables. | ▶ 02:34 |
And there can be zero or more evidence variables, | ▶ 02:39 |
and each of them are given an exact value. | ▶ 02:42 |
And that's the computation we want to come up with. | ▶ 02:47 |
There's another question we can ask. | ▶ 02:53 |
Which is the most likely explanation? | ▶ 02:56 |
That is, out of all the possible values for all the query variables, | ▶ 02:58 |
which combination of values has the highest probability? | ▶ 03:03 |
We write the formula like this, asking which Q values | ▶ 03:08 |
are maxable given the evidence values. | ▶ 03:12 |
Now, in an ordinary programming language, each function goes only one way. | ▶ 03:16 |
It has input variables, does some computation, | ▶ 03:22 |
and comes up with a result variable or result variables. | ▶ 03:26 |
One great thing about Bayes nets is that we're not restricted | ▶ 03:31 |
to going only in one direction. | ▶ 03:34 |
We could go in the causal direction, giving as evidence | ▶ 03:36 |
the route nodes of the tree and asking as query values the nodes at the bottom. | ▶ 03:41 |
Or, we could reverse that causal flow. | ▶ 03:47 |
For example, we could have J and M be the evidence variables | ▶ 03:50 |
and B and E be the query variables, | ▶ 03:55 |
or we could have any other combination. | ▶ 03:58 |
For example, we could have M be the evidence variable | ▶ 04:01 |
and J and B be the query variables. | ▶ 04:05 |
Here's a question for you. | ▶ 04:11 |
Imagine the situation where Mary has called to report that the alarm is going off, | ▶ 04:13 |
and we want to know whether or not there has been a burglary. | ▶ 04:18 |
For each of the nodes, click on the circle to tell us | ▶ 04:22 |
if the node is an evidence node, a hidden node, | ▶ 04:27 |
or a query node. | ▶ 04:32 |
The answer is that Mary calling is the evidence node. | ▶ 00:00 |
The burglary is the query node, | ▶ 00:04 |
and all the others are hidden variables in this case. | ▶ 00:07 |
Now we're going to talk about how to do inference on Bayes net. | ▶ 00:00 |
We'll start with our familiar network, and we'll talk about a method | ▶ 00:04 |
called enumeration, | ▶ 00:08 |
which goes through all the possibilities, adds them up, | ▶ 00:12 |
and comes up with an answer. | ▶ 00:15 |
So, what we do is start by stating the problem. | ▶ 00:17 |
We're going to ask the question of what is the probability | ▶ 00:24 |
that the burglar alarm occurred given that John called and Mary called? | ▶ 00:27 |
We'll use the definition of conditional probability to answer this. | ▶ 00:34 |
So, this query is equal to the joint probability distribution | ▶ 00:39 |
of all 3 variables divided by the conditionalized variables. | ▶ 00:47 |
Now, note I'm using a notation here where instead of writing out the probability | ▶ 00:55 |
of some variable equals true, I'm just using the notation plus | ▶ 01:01 |
and then the variable name in lower case, | ▶ 01:05 |
and if I wanted the negation, I would use negation sign. | ▶ 01:08 |
Notice there's a different notation where instead of writing out | ▶ 01:13 |
the plus and negation signs, we just use the variable name itself, P(e), | ▶ 01:17 |
to indicate E is true. | ▶ 01:22 |
That notation works well, but it can get confusing between | ▶ 01:25 |
does P(e) mean E is true, or does it mean E is a variable? | ▶ 01:29 |
And so we're going to stick to the notation where we explicitly have | ▶ 01:34 |
the pluses and negation signs. | ▶ 01:37 |
To do inference by enumeration, we first take a conditional probability | ▶ 01:41 |
and rewrite it as unconditional probabilities. | ▶ 01:45 |
Now we enumerate all the atomic probabilities and calculate the sum of products. | ▶ 01:49 |
Let's look at just the complex term on the numerator first. | ▶ 01:56 |
The procedure for figuring out the denominator would be similar, and we'll skip that. | ▶ 02:00 |
So, the probability of these 3 terms together | ▶ 02:05 |
can be determined by enumerating all possible values of the hidden variables. | ▶ 02:12 |
In this case, there are 2, E and A, | ▶ 02:17 |
so we'll sum over those variables for all values of E and for all values of A. | ▶ 02:22 |
In this case, they're boolean, so there's only 2 values of each. | ▶ 02:29 |
We ask what's the probability of this unconditional term? | ▶ 02:34 |
And that we get by summing out over all possibilities, | ▶ 02:41 |
E and A being true or false. | ▶ 02:44 |
Now, to get the values of these atomic events, | ▶ 02:49 |
we'll have to rewrite this equation in a form that corresponds | ▶ 02:52 |
to the conditional probability tables that we have associated with the Bayes net. | ▶ 02:55 |
So, we'll take this whole expression and rewrite it. | ▶ 03:00 |
It's still a sum over the hidden variables E and A, | ▶ 03:04 |
but now I'll rewrite this expression in terms of the parents | ▶ 03:08 |
of each of the nodes in the network. | ▶ 03:12 |
So, that gives us the product of these 5 terms, | ▶ 03:15 |
which we then have to sum over all values of E and A. | ▶ 03:21 |
If we call this product f(e,a), | ▶ 03:24 |
then the whole answer is the sum of F for all values of E and A, | ▶ 03:31 |
so as the sum of 4 terms where each of the terms is a product of 5 numbers. | ▶ 03:43 |
Where do we get the numbers to fill in this equation? | ▶ 03:51 |
From the conditional probability tables from our model, | ▶ 03:54 |
so let's put the equation back up, and we'll ask you for the case | ▶ 03:58 |
where both E and A are positive | ▶ 04:03 |
to look up in the conditional probability tables and fill in the numbers | ▶ 04:09 |
for each of these 5 terms, and then multiply them together and fill in the product. | ▶ 04:14 |
We get the answer by reading numbers off the conditional probability tables, | ▶ 00:00 |
so probability of B being positive is 0.001. | ▶ 00:04 |
Of E being positive, because we're dealing with the positive case now | ▶ 00:11 |
for the variable E, is 0.002. | ▶ 00:16 |
The probability of A being positive, because we're dealing with that case, | ▶ 00:22 |
given that B is positive and the case for an E is positive, | ▶ 00:26 |
that we can read off here as 0.95. | ▶ 00:30 |
The probability that J is positive given that A is positive is 0.9. | ▶ 00:37 |
And finally, the probability that M is positive given that A is positive | ▶ 00:44 |
we read off here as 0.7. | ▶ 00:50 |
We multiple all those together, it's going to be a small number | ▶ 00:54 |
because we've got the .001 and the .002 here. | ▶ 00:57 |
Can't quite fit it in the box, but it works out to .000001197. | ▶ 01:00 |
That seems like a really small number, but remember, | ▶ 01:12 |
we have to normalize by the P(+j,+m) term, | ▶ 01:14 |
and this is only 1 of the 4 possibilities. | ▶ 01:19 |
We have to enumerate over all 4 possibilities for E and A, | ▶ 01:22 |
and in the end, it works out that the probability of the burglar alarm being true | ▶ 01:26 |
given that John and Mary calls, is 0.284. | ▶ 01:32 |
And we get that number because intuitively, | ▶ 01:38 |
it seems that the alarm is fairly reliable. | ▶ 01:42 |
John and Mary calling are very reliable, | ▶ 01:44 |
but the prior probability of burglary is low. | ▶ 01:47 |
And those 2 terms combine together to give us the 0.284 value | ▶ 01:49 |
when we sum up each of the 4 terms of these products. | ▶ 01:54 |
[Norvig] We've seen how to do enumeration to solve the inference problem | ▶ 00:00 |
on belief networks. | ▶ 00:04 |
For a simple network like the alarm network, that's all we need to know. | ▶ 00:06 |
There's only 5 variables, so even if all 5 of them were hidden, | ▶ 00:10 |
there would only be 32 rows in the table to sum up. | ▶ 00:14 |
From a theoretical point of view, we're done. | ▶ 00:20 |
But from a practical point of view, other networks could give us trouble. | ▶ 00:22 |
Consider this network, which is one for determining insurance for car owners. | ▶ 00:26 |
There are 27 different variables. | ▶ 00:35 |
If each of the variables were boolean, that would give us over 100 million rows to sum out. | ▶ 00:38 |
But in fact, some of the variables are non-boolean, | ▶ 00:44 |
they have multiple values, and it turns out that representing this entire network | ▶ 00:46 |
and doing enumeration we'd have to sum over a quadrillion rows. | ▶ 00:52 |
That's just not practical, so we're going to have to come up with methods | ▶ 00:57 |
that are faster than enumerating everything. | ▶ 01:01 |
The first technique we can use to get a speed-up in doing inference on Bayes nets | ▶ 01:04 |
is to pull out terms from the enumeration. | ▶ 01:09 |
For example, here the probability of b is going to be the same for all values of E and a. | ▶ 01:13 |
So we can take that term and move it out of the summation, | ▶ 01:20 |
and now we have a little bit less work to do. | ▶ 01:26 |
We can multiply by that term once rather than having it in each row of the table. | ▶ 01:28 |
We can also move this term, the P of e, to the left of the summation over a, | ▶ 01:33 |
because it doesn't depend on a. | ▶ 01:40 |
By doing this, we're doing less work. | ▶ 01:43 |
The inner loop of the summation now has only 3 terms rather than 5 terms. | ▶ 01:45 |
So we've reduced the cost of doing each row of the table. | ▶ 01:50 |
But we still have the same number of rows in the table, | ▶ 01:53 |
so we're going to have to do better than that. | ▶ 01:57 |
The next technique for efficient inference is to maximize independence of variables. | ▶ 02:00 |
The structure of a Bayes net determines how efficient it is to do inference on it. | ▶ 02:08 |
For example, a network that's a linear string of variables, | ▶ 02:12 |
X1 through Xn, can have inference done in time proportional to the number n, | ▶ 02:17 |
whereas a network that's a complete network | ▶ 02:27 |
where every node points to every other node and so on could take time 2 to the n | ▶ 02:31 |
if all n variables are boolean variables. | ▶ 02:40 |
In the alarm network we saw previously, we took care | ▶ 02:45 |
to make sure that we had all the independence relations represented | ▶ 02:50 |
in the structure of the network. | ▶ 02:54 |
But if we put the nodes together in a different order, | ▶ 02:57 |
we would end up with a different structure. | ▶ 03:00 |
Let's start by ordering the node John calls first | ▶ 03:03 |
and then adding in the node Mary calls. | ▶ 03:09 |
The question is, given just these 2 nodes and looking at the node for Mary calls, | ▶ 03:13 |
is that node dependent or independent of the node for John calls? | ▶ 03:19 |
[Norvig] The answer is that the node for Mary calls in this network | ▶ 00:01 |
is dependent on John calls. | ▶ 00:05 |
In the previous network, they were independent given that we knew that the alarm had occurred. | ▶ 00:08 |
But here we don't know that the alarm had occurred, | ▶ 00:13 |
and so the nodes are dependent | ▶ 00:16 |
because having information about one will affect the information about the other. | ▶ 00:18 |
[Norvig] Now we'll continue and we'll add the node A for alarm to the network. | ▶ 00:00 |
And what I want you to do is click on all the other variables | ▶ 00:05 |
that A is dependent on in this network. | ▶ 00:09 |
[Norvig] The answer is that alarm is dependent on both John and Mary. | ▶ 00:01 |
And so we can draw both nodes in, both arrows in. | ▶ 00:05 |
Intuitively that makes sense because if John calls, | ▶ 00:09 |
then it's more likely that the alarm has occurred, | ▶ 00:14 |
likely as if Mary calls, and if both called, it's really likely. | ▶ 00:16 |
So you can figure out the answer by intuitive reasoning, | ▶ 00:20 |
or you can figure it out by going to the conditional probability tables | ▶ 00:23 |
and seeing according to the definition of conditional probability | ▶ 00:27 |
whether the numbers work out. | ▶ 00:31 |
[Norvig] Now we'll continue and we'll add the node B for burglary | ▶ 00:01 |
and ask again, click on all the variables that B is dependent on. | ▶ 00:05 |
[Norvig] The answer is that B is dependent only on A. | ▶ 00:00 |
In other words, B is independent of J and M given A. | ▶ 00:04 |
[Norvig] And finally, we'll add the last node, E, | ▶ 00:00 |
and ask you to click on all the nodes that E is dependent on. | ▶ 00:04 |
[Norvig] And the answer is that E is dependent on A. | ▶ 00:00 |
That much is fairly obvious. | ▶ 00:04 |
But it's also dependent on B. | ▶ 00:06 |
Now, why is that? | ▶ 00:08 |
E is dependent on A because if the earthquake did occur, | ▶ 00:10 |
then it's more likely that the alarm would go off. | ▶ 00:13 |
On the other hand, E is also dependent on B | ▶ 00:16 |
because if a burglary occurred, then that would explain why the alarm is going off, | ▶ 00:19 |
and it would mean that the earthquake is less likely. | ▶ 00:23 |
[Norvig] The moral is that Bayes nets tend to be the most compact | ▶ 00:00 |
and thus the easier to do inference on when they're written in the causal direction-- | ▶ 00:04 |
that is, when the networks flow from causes to effects. | ▶ 00:12 |
Let's return to this equation, which we use to show how to do inference by enumeration. | ▶ 00:00 |
In this equation, we join up the whole joint distribution | ▶ 00:06 |
before we sum out over the hidden variables. | ▶ 00:10 |
That's slow, because we end up repeating a lot of work. | ▶ 00:15 |
Now we're going to show a new technique called variable elimination, | ▶ 00:18 |
which in many networks operates much faster. | ▶ 00:25 |
It's still a difficult computation, an NP-hard computation, | ▶ 00:27 |
to do inference over Bayes nets in general. | ▶ 00:30 |
Variable elimination works faster than inference by enumeration | ▶ 00:34 |
in most practical cases. | ▶ 00:38 |
It requires an algebra for manipulating factors, | ▶ 00:41 |
which are just names for multidimensional arrays | ▶ 00:45 |
that come out of these probabilistic terms. | ▶ 00:48 |
We'll use another example to show how variable elimination works. | ▶ 00:53 |
We'll start off with a network that has 3 boolean variables. | ▶ 00:57 |
R indicates whether or not it's raining. | ▶ 01:00 |
T indicates whether or not there's traffic, | ▶ 01:04 |
and T is dependent on whether it's raining. | ▶ 01:12 |
And finally, L indicates whether or not I'll be late for my next appointment, | ▶ 01:15 |
and that depends on whether or not there's traffic. | ▶ 01:19 |
Now we'll put up the conditional probability tables for each of these 3 variables. | ▶ 01:22 |
And then we can use inference to figure out the answer to questions like | ▶ 01:29 |
am I going to be late? | ▶ 01:35 |
And we know by definition that we could do that through enumeration | ▶ 01:38 |
by going through all the possible values for R and T | ▶ 01:42 |
and summing up the product of these 3 nodes. | ▶ 01:47 |
Now, in a simple network like this, straight enumeration would work fine, | ▶ 01:54 |
but in a more complex network, what variable elimination does is give us a way | ▶ 01:59 |
to combine together parts of the network into smaller parts | ▶ 02:03 |
and then enumerate over those smaller parts and then continue combining. | ▶ 02:09 |
So, we start with a big network. | ▶ 02:13 |
We eliminate some of the variables. | ▶ 02:15 |
We compute by marginalizing out, and then we have a smaller network to deal with, | ▶ 02:17 |
and we'll show you how those 2 steps work. | ▶ 02:24 |
The first operation in variable elimination is called joining factors. | ▶ 02:28 |
A factor, again, is one of these tables. | ▶ 02:35 |
It's a multidimensional matrix, and what we do is choose 2 of the factors, | ▶ 02:39 |
2 or more of the factors. | ▶ 02:43 |
In this case, we'll choose these 2, and we'll combine them together | ▶ 02:45 |
to form a new factor which represents | ▶ 02:49 |
the joint probability of all the variables in that factor. | ▶ 02:52 |
In this case, R and T. | ▶ 02:56 |
Now we'll draw out that table. | ▶ 03:00 |
In each case, we just look up in the corresponding table, | ▶ 03:03 |
figure out the numbers, and multiply them together. | ▶ 03:06 |
For example, in this row we have a +r and a +t, | ▶ 03:08 |
so the +r is 0.1, and the entry for +r and +t is 0.8, | ▶ 03:13 |
so multiply them together and you get 0.08. | ▶ 03:19 |
Go all the way down. For example, in the last row we have a -r and a -t. | ▶ 03:22 |
-r is 0.9. The entry for -r and -t is also 0.9. | ▶ 03:28 |
Multiply those together and you get 0.81. | ▶ 03:34 |
So, what have we done? | ▶ 03:40 |
We used the operation of joining factors on these 2 factors, | ▶ 03:42 |
getting us a new factor which is part of the existing network. | ▶ 03:45 |
Now we want to apply a second operation called elimination, | ▶ 03:50 |
also called summing out or marginalization, to take this table and reduce it. | ▶ 03:56 |
Right now, the tables we have look like this. | ▶ 04:02 |
We could sum out or marginalize over the variable R | ▶ 04:06 |
to give us a table that just operates on T. | ▶ 04:10 |
So, the question is to fill in this table for P(T)-- | ▶ 04:14 |
there will be 2 entries in this table, the +t entry, formed by summing out | ▶ 04:20 |
all the entries here for all values of r for which t is positive, | ▶ 04:23 |
and the -t entry, formed the same way, by looking in this table | ▶ 04:28 |
and summing up all the rows over all values of r where t is negative. | ▶ 04:32 |
Put your answers in these boxes. | ▶ 04:37 |
The answer is that for +t we look up the 2 possible values for r, | ▶ 00:00 |
and we get 0.08 or 0.09. | ▶ 00:05 |
Sum those up, get 0.17, | ▶ 00:09 |
and then we look at the 2 possible values of R for -t, | ▶ 00:13 |
and we get 0.02 and 0.81. | ▶ 00:18 |
Add those up, and we get 0.83. | ▶ 00:22 |
So, we took our network with RT and L. We summed out over R. | ▶ 00:00 |
That gives us a new network with T and L | ▶ 00:04 |
with these conditional probability tables. | ▶ 00:09 |
And now we want to do a join over T and L | ▶ 00:13 |
and give us a new table with the joint probability of P(T, L). | ▶ 00:17 |
And that table is going to look like this. | ▶ 00:25 |
The answer, again, for joining variables is determined by pointwise multiplication, | ▶ 00:00 |
so we have 0.17 times 0.3 is 0.051, | ▶ 00:05 |
+t and +l, 0.17 times 0.7 is 0.119. | ▶ 00:12 |
Then we go to the minuses. | ▶ 00:21 |
Minus 0.83 times 0.1 is 0.083. | ▶ 00:23 |
And finally, 0.83 times 0.9 is 0.747. | ▶ 00:31 |
Now we're down to a network with a single node, T, L, | ▶ 00:00 |
with this joint probability table, and the only operation we have left to do | ▶ 00:06 |
is to sum out to give us a node with just L in it. | ▶ 00:12 |
So, the question is to compute P(L) for both values of L, | ▶ 00:17 |
+l and -l. | ▶ 00:26 |
The answer is that the +l values, | ▶ 00:00 |
0.051 plus 0.083 equals 0.134. | ▶ 00:03 |
And the negative values, 0.119 plus 0.747 | ▶ 00:11 |
equals 0.886. | ▶ 00:15 |
No subtitles... | ▶ 00:00 |
So, that's how variable elimination works. | ▶ 00:00 |
It's a continued process of joining together factors | ▶ 00:03 |
to form a larger factor and then eliminating variables by summing out. | ▶ 00:06 |
If we make a good choice of the order in which we apply these operations, | ▶ 00:11 |
then variable elimination can be much more efficient | ▶ 00:15 |
than just doing the whole enumeration. | ▶ 00:18 |
Now I want to talk about approximate inference | ▶ 00:00 |
by means of sampling. | ▶ 00:07 |
What do I mean by that? | ▶ 00:12 |
Say we want to deal with a joint probability distribution, | ▶ 00:14 |
say the distribution of heads and tails over these 2 coins. | ▶ 00:17 |
We can build a table and then start counting by sampling. | ▶ 00:24 |
Here we have our first sample. | ▶ 00:30 |
We flip the coins and the one-cent piece came up heads, | ▶ 00:32 |
and the five-cent piece came up tails, | ▶ 00:35 |
so we would mark down one count. | ▶ 00:39 |
Then we'd toss them again. | ▶ 00:42 |
This time the five cents is heads, and the one cent is tails, | ▶ 00:45 |
so we put down a count there, and we'd repeat that process | ▶ 00:50 |
and keep repeating it until we got enough counts that we could estimate | ▶ 01:00 |
the joint probability distribution by looking at the counts. | ▶ 01:06 |
Now, if we do a small number of samples, the counts might not be very accurate. | ▶ 01:11 |
There may be some random variation that causes them not to converge | ▶ 01:15 |
to their true values, but as we add more counts, | ▶ 01:19 |
the counts--as we add more samples, | ▶ 01:23 |
the counts we get will come closer to the true distribution. | ▶ 01:25 |
Thus, sampling has an advantage over inference in that we know a procedure | ▶ 01:29 |
for coming up with at least an approximate value for the joint probability distribution, | ▶ 01:35 |
as opposed to exact inference, where the computation may be very complex. | ▶ 01:42 |
There's another advantage to sampling, which is if we don't know | ▶ 01:50 |
what the conditional probability tables are, as we did in our other models, | ▶ 01:53 |
if we don't know these numeric values, but we can simulate the process, | ▶ 01:59 |
we can still proceed with sampling, whereas we couldn't with exact inference. | ▶ 02:04 |
Here's a new network that we'll use to investigate | ▶ 00:00 |
how sampling can be used to do inference. | ▶ 00:05 |
In this network, we have 4 variables. They're all boolean. | ▶ 00:10 |
Cloudy tells us if it's cloudy or not outside, | ▶ 00:14 |
and that can have an effect on whether the sprinklers are turned on, | ▶ 00:17 |
and whether it's raining. | ▶ 00:21 |
And those 2 variables in turn have an effect on whether the grass gets wet. | ▶ 00:23 |
Now, to do inference over this network using sampling, | ▶ 00:28 |
we start off with a variable where all the parents are defined. | ▶ 00:34 |
In this case, there's only one such variable, Cloudy. | ▶ 00:38 |
And it's conditional probability table tells us that the probability is 50% for Cloudy, | ▶ 00:42 |
50% for not Cloudy, and so we sample from that. | ▶ 00:48 |
We generate a random number, and let's say it comes up with positive for Cloudy. | ▶ 00:52 |
Now that variable is defined, we can choose another variable. | ▶ 00:59 |
In this case, let's choose Sprinkler, and we look at the rows in the table | ▶ 01:02 |
for which Cloudy, the parent, is positive, and we see we should sample | ▶ 01:08 |
with probability 10% to +s and 90% a -s. | ▶ 01:13 |
And so let's say we do that sampling with a random number generator, | ▶ 01:19 |
and it comes up negative for Sprinkler. | ▶ 01:23 |
Now let's jump over here. Look at the Rain variable. | ▶ 01:26 |
Again, the parent, Cloudy, is positive, | ▶ 01:29 |
so we're looking at this part of the table. | ▶ 01:34 |
We get a 0.8 probability for Rain being positive, | ▶ 01:38 |
and a 0.2 probability for Rain being negative. | ▶ 01:41 |
Let's say we sample that randomly, and it comes up Rain is positive. | ▶ 01:44 |
And now we're ready to sample the final variable, | ▶ 01:51 |
and what I want you to do is tell me which of the rows | ▶ 01:54 |
of this table should we be considering and tell me what's more likely. | ▶ 02:01 |
Is it more likely that we have a +w or a -w? | ▶ 02:07 |
The answer to the question is that we look at the parents. | ▶ 00:00 |
We find that the Sprinkler variable is negative, | ▶ 00:03 |
so we're looking at this part of the table. | ▶ 00:06 |
And the Rain variable is positive, so we're looking at this part. | ▶ 00:09 |
So, it would be these 2 rows that we would consider, | ▶ 00:14 |
and thus, we'd find there's a 0.9 probability for w, the grass being wet, | ▶ 00:18 |
and only 0.1 for it being negative, | ▶ 00:25 |
so the positive is more likely. | ▶ 00:28 |
And once we've done that, then we generated a complete sample, | ▶ 00:31 |
and we can write down the sample here. | ▶ 00:34 |
We had +c, -s, +r. | ▶ 00:37 |
And assuming we got a probability of 0.9 came out in favor of the +w, | ▶ 00:43 |
that would be the end of the sample. | ▶ 00:51 |
Then we could throw all this information out and start over again | ▶ 00:54 |
by having another 50/50 choice for cloudy and then working our way through the network. | ▶ 00:59 |
Now, the probability of sampling a particular variable, | ▶ 00:00 |
choosing a +w or a -w, depends on the values of the parents. | ▶ 00:04 |
But those are chosen according to the conditional probability tables, | ▶ 00:10 |
so in the limit, the count of each sampled variable | ▶ 00:14 |
will approach the true probability. | ▶ 00:18 |
That is, with an infinite number of samples, this procedure computes the true | ▶ 00:20 |
joint probability distribution. | ▶ 00:24 |
We say that the sampling method is consistent. | ▶ 00:27 |
We can use this kind of sampling to compute the complete joint probability distribution, | ▶ 00:33 |
or we can use it to compute a value for an individual variable. | ▶ 00:38 |
But what if we wanted to compute a conditional probability? | ▶ 00:43 |
Say we wanted to compute the probability of wet grass | ▶ 00:47 |
given that it's not cloudy. | ▶ 00:53 |
To do that, the sample that we generated here wouldn't be helpful at all | ▶ 00:58 |
because it has to do with being cloudy, not with being not cloudy. | ▶ 01:03 |
So, we would cross this sample off the list. | ▶ 01:08 |
We would say that we reject the sample, and this technique is called rejection sampling. | ▶ 01:11 |
We go through ignoring any samples that don't match | ▶ 01:17 |
the conditional probabilities that we're interested in | ▶ 01:21 |
and keeping samples that do, say the sample -c, +s, +r, -w. | ▶ 01:24 |
We would just continue going through generating samples, | ▶ 01:34 |
crossing off the ones that don't match, keeping the ones that do. | ▶ 01:37 |
And this procedure would also be consistent. | ▶ 01:41 |
We call this procedure rejection sampling. | ▶ 01:46 |
But there's a problem with rejection sampling. | ▶ 00:00 |
If the evidence is unlikely, you end up rejecting a lot of the samples. | ▶ 00:03 |
Let's go back to the alarm network where we had variables for burglary and for an alarm | ▶ 00:08 |
and say when arrested, in computing the probability of a burglary, | ▶ 00:16 |
given that the alarm goes off. | ▶ 00:22 |
The problem is that burglaries are very infrequent, | ▶ 00:25 |
so most of the samples we would get would end up being-- | ▶ 00:28 |
we start with generating a B, and we get a -b and then a -a. | ▶ 00:32 |
We go back and say does this match? | ▶ 00:39 |
No, we have to reject this sample, | ▶ 00:43 |
so we generate another sample, and we get another -b, -a. | ▶ 00:45 |
We reject that. We get another -b, -a. | ▶ 00:50 |
And we keep rejecting, and eventually we get a +b, | ▶ 00:54 |
but we'd end up spending a lot of time rejecting samples. | ▶ 01:00 |
So, we're going to introduce a new method called likelihood weighting | ▶ 01:04 |
that generates samples so that we can keep every one. | ▶ 01:13 |
With likelihood weighting, we fix the evidence variables. | ▶ 01:17 |
That is, we say that A will always be positive, | ▶ 01:20 |
and then we sample the rest of the variables, | ▶ 01:25 |
so then we get samples that we want. | ▶ 01:28 |
We would get a list like -b, +a, | ▶ 01:31 |
-b, +a, | ▶ 01:37 |
+b, +a. | ▶ 01:40 |
We get to keep every sample, but we have a problem. | ▶ 01:42 |
The resulting set of samples is inconsistent. | ▶ 01:46 |
We can fix that, however, by assigning a probability | ▶ 01:52 |
to each sample and weighing them correctly. | ▶ 01:56 |
In likelihood weighting, we're going to be collecting samples just like before, | ▶ 00:00 |
but we're going to add a probabilistic weight to each sample. | ▶ 00:05 |
Now, let's say we want to compute the probability of rain | ▶ 00:11 |
given that the sprinklers are on, and the grass is wet. | ▶ 00:17 |
We start as before. | ▶ 00:22 |
We make a choice for Cloudy, and let's say that, again, | ▶ 00:24 |
we choose Cloudy being positive. | ▶ 00:30 |
Now we want to make a choice for Sprinkler, | ▶ 00:33 |
but we're constrained to always choose Sprinkler being positive, | ▶ 00:37 |
so we'll make that choice. | ▶ 00:41 |
And we know we were dealing with Cloudy being positive, | ▶ 00:44 |
so we're in this row, and we're forced to make the choice of Sprinkler being positive, | ▶ 00:50 |
and that has a probability of only 0.1, so we'll put that 0.1 into the weight. | ▶ 00:56 |
Next, we'll look at the Rain variable, | ▶ 01:05 |
and here we're not constrained in any way, so we make a choice | ▶ 01:09 |
according to the probability tables with Cloudy being positive. | ▶ 01:13 |
And let's say that we choose the more popular choice, and Rain gets the positive value. | ▶ 01:19 |
Now, we look at Wet Grass. | ▶ 01:27 |
We're constrained to choose positive, and we know that the parents | ▶ 01:30 |
are also positive, so we're dealing with this row here. | ▶ 01:35 |
Since it's a constrained choice, we're going to add in or multiply in an additional weight, | ▶ 01:41 |
and I want you to tell me what that weight should be. | ▶ 01:47 |
The answer is we're looking for the probability | ▶ 00:00 |
of having a +w given a +s and a +r, | ▶ 00:04 |
so that's in this row, so it's 0.99. | ▶ 00:09 |
So, we take our old weight and multiply it by 0.99, | ▶ 00:16 |
gives us a final weight of 0.099 | ▶ 00:22 |
for a sample of +c, +s, +r and +w. | ▶ 00:28 |
When we include the weights, | ▶ 00:00 |
counting this sample that was forced to have a +s and a +w | ▶ 00:03 |
with a weight of 0.099, instead of counting it as a full one sample, | ▶ 00:08 |
we find that likelihood weighting is also consistent. | ▶ 00:14 |
Likelihood weighting is a great technique, | ▶ 00:00 |
but it doesn't solve all our problems. | ▶ 00:03 |
Suppose we wanted to compute the probability of C given +s and +r. | ▶ 00:05 |
In other words, we're constraining Sprinkler and Rain to always be positive. | ▶ 00:14 |
Since we use the evidence when we generate a node that has that evidence as parents, | ▶ 00:21 |
the Wet Grass node will always get good values based on that evidence. | ▶ 00:27 |
But the Cloudy node won't, and so it will be generating values at random | ▶ 00:31 |
without looking at these values, and most of the time, or some of the time, | ▶ 00:39 |
it will be generating values that don't go well with the evidence. | ▶ 00:44 |
Now, we won't have to reject them like we do in rejection sampling, | ▶ 00:48 |
but they'll have a low probability associated with them. | ▶ 00:51 |
A technique called Gibbs sampling, | ▶ 00:00 |
named after the physicist Josiah Gibbs, | ▶ 00:07 |
takes all the evidence into account and not just the upstream evidence. | ▶ 00:10 |
It uses a method called Markov Chain Monte Carlo, or MCMC. | ▶ 00:14 |
The idea is that we resample just one variable at a time | ▶ 00:26 |
conditioned on all the others. | ▶ 00:31 |
That is, we have a set of variables, | ▶ 00:33 |
and we initialize them to random variables, keeping the evidence values fixed. | ▶ 00:37 |
Maybe we have values like this, | ▶ 00:44 |
and that constitutes one sample, and now, at each iteration through the loop, | ▶ 00:48 |
we select just one non-evidence variable and resample it | ▶ 00:54 |
based on all the other variables. | ▶ 01:01 |
And that will give us another sample, and repeat that again. | ▶ 01:04 |
Choose another variable. | ▶ 01:11 |
Resample that variable and repeat. | ▶ 01:15 |
We end up walking around in this space of assignments of variables randomly. | ▶ 01:21 |
Now, in rejection and likelihood sampling, | ▶ 01:27 |
each sample was independent of the other samples. | ▶ 01:30 |
In MCMC, that's not true. | ▶ 01:34 |
The samples are dependent on each other, and in fact, | ▶ 01:37 |
adjacent samples are very similar. | ▶ 01:40 |
They only vary or differ in one place. | ▶ 01:42 |
However, the technique is still consistent. We won't show the proof for that. | ▶ 01:46 |
Now, just one more thing. | ▶ 00:00 |
I can't help but describe what is probably the most famous probability problem of all. | ▶ 00:02 |
It's called the Monty Hall Problem after the game show host. | ▶ 00:07 |
And the idea is that you're on a game show, and there's 3 doors: | ▶ 00:11 |
door #1, door #2, and door #3. | ▶ 00:15 |
And behind each door is a prize, and you know that one of the doors | ▶ 00:20 |
contains an expensive sports car, which you would find desirable, | ▶ 00:26 |
and the other 2 doors contain a goat, which you would find less desirable. | ▶ 00:29 |
Now, say you're given a choice, and let's say you choose door #1. | ▶ 00:35 |
But according to the conventions of the game, the host, Monty Hall, | ▶ 00:42 |
will now open one of the doors, knowing that the door that he opens | ▶ 00:47 |
contains a goat, and he shows you door #3. | ▶ 00:52 |
And he now gives you the opportunity to stick with your choice | ▶ 00:57 |
or to switch to the other door. | ▶ 01:02 |
What I want you to tell me is, what is your probability of winning | ▶ 01:05 |
if you stick to door #1, and what is the probability of winning | ▶ 01:10 |
if you switched to door #2? | ▶ 01:15 |
The answer is that you have a 1/3 chance of winning if you stick with door #1 | ▶ 00:00 |
and a 2/3 chance if you switch to door #2. | ▶ 00:08 |
How do we explain that, and why isn't it 50/50? | ▶ 00:12 |
Well, it's true that there's 2 possibilities, | ▶ 00:16 |
but we've learned from probability that just because there are 2 options | ▶ 00:18 |
doesn't mean that both options are equally likely. | ▶ 00:22 |
It's easier to explain why the first door has a 1/3 probability | ▶ 00:26 |
because when you started, the car could be in any one of 3 places. | ▶ 00:30 |
You chose one of them. That probability was 1/3. | ▶ 00:34 |
And that probability hasn't been changed by the revealing of one of the other doors. | ▶ 00:37 |
Why is door #2 two-thirds? | ▶ 00:43 |
Well, one way to explain it is that the probability has to sum to 1, | ▶ 00:45 |
and if 1/3 is here, the 2/3 has to be here. | ▶ 00:49 |
But why doesn't the same argument that you use for 1 hold for 2? | ▶ 00:53 |
Why can't we say the probability of 2 holding the car | ▶ 00:58 |
was 1/3 before this door was revealed? | ▶ 01:03 |
Why has that changed 2 and has not changed 1? | ▶ 01:07 |
And the reason is because we've learned something about door #2. | ▶ 01:11 |
We've learned that it wasn't the door that was flipped over by the host, | ▶ 01:14 |
and so that additional information has updated the probability, | ▶ 01:18 |
whereas we haven't learned anything additional about door #1 | ▶ 01:22 |
because it was never an option that the host might switch door #1. | ▶ 01:26 |
And in fact, in this case, if we reveal the door, | ▶ 01:30 |
we find that's where the car actually is. | ▶ 01:37 |
So you see, learning probability may end up winning you something. | ▶ 01:40 |
Now, as a final epilogue, I have here a copy of a letter written by Monty Hall himself | ▶ 00:00 |
in 1990 to Professor Lawrence Denenberg of Harvard | ▶ 00:07 |
who, with Harry Lewis, wrote a statistics book | ▶ 00:10 |
in which they used the Monty Hall Problem as an example, | ▶ 00:14 |
and they wrote to Monty asking him for permission to use his name. | ▶ 00:18 |
Monty kindly granted the permission, but in his letter, | ▶ 00:23 |
he writes, "As I see it, it wouldn't make any difference after the player | ▶ 00:26 |
has selected Door A, and having been shown Door C-- | ▶ 00:31 |
why should he then attempt to switch to Door B? | ▶ 00:34 |
So, we see Monty Hall himself did not understand the Monty Hall Problem. | ▶ 00:38 |
[Thrun] Given the following Bayes network with P of A equal to 0.5, | ▶ 00:00 |
P of B given the A equals 0.2, | ▶ 00:06 |
and P of B given not A 0.8, | ▶ 00:08 |
calculate the following probability. | ▶ 00:12 |
[Thrun] Consider a network of the following type: | ▶ 00:00 |
a variable, A, that is binary connects to three variables, X1, X2, and X3, | ▶ 00:03 |
that are also binary. | ▶ 00:10 |
The probability of A is 0.5, and for all variable XI we have the probability of XI given A is 0.2, | ▶ 00:12 |
and the probability of XI given not A equals 0.6. | ▶ 00:24 |
I would like to know from you the probability of A | ▶ 00:29 |
given that we observed X1, X2, and not X3. | ▶ 00:31 |
Notice that these variables over here are conditionally independent given A. | ▶ 00:37 |
[Thrun] Let us consider the same network again. | ▶ 00:00 |
I would like to know the probability of X3 given that I observed X1. | ▶ 00:03 |
[Thrun] In this next homework assignment I will be drawing you a Bayes network | ▶ 00:00 |
and will ask you some conditional independence questions. | ▶ 00:04 |
Is B conditionally independent of C? And say yes or no. | ▶ 00:09 |
Is B conditionally independent of C given D? And say yes or no. | ▶ 00:14 |
Is B conditionally independent of C given A? And say yes or no. | ▶ 00:19 |
And is B conditionally independent given A and D? And say yes or no. | ▶ 00:24 |
[Thrun] Consider the following network. | ▶ 00:00 |
I would like to know whether the following statements are true or false. | ▶ 00:02 |
C is conditionally independent of E given A. | ▶ 00:08 |
B is conditionally independent of D given C and E. | ▶ 00:12 |
A is conditionally independent of C given E. | ▶ 00:18 |
And A is conditionally independent of C given B. | ▶ 00:21 |
Please check yes or no for each of these questions. | ▶ 00:25 |
[Thrun] In my final question I'll look at the exact same network as before, | ▶ 00:00 |
but I would like to know the minimum number of numerical parameters | ▶ 00:04 |
such as the values to define probabilities and conditional probabilities | ▶ 00:08 |
that are necessary to specify the joint distribution of all 5 variables. | ▶ 00:13 |
[Thrun] The answer is 0.2, | ▶ 00:00 |
and this follows directly from Bayes' rule. | ▶ 00:03 |
In this formula, we can read off the first 2 values straight from the table over here, | ▶ 00:07 |
and we expand the denominator by total probability. | ▶ 00:11 |
Observing that this is exactly the same expression as up here, | ▶ 00:15 |
we get 0.1 divided by 0.1 plus this expression over here can be copied from over here, | ▶ 00:19 |
and P of not A is directly obtained up here. | ▶ 00:27 |
Hence we get 0.5 over here, and as a result we get 0.2. | ▶ 00:30 |
[Thrun] For this question we will be exploring a little trick | ▶ 00:00 |
about non-normalized probability. | ▶ 00:03 |
We will observe that P of A given X1, X2 and not X3, | ▶ 00:05 |
the expression on the left can be resolved by Bayes' rule into this expression over here. | ▶ 00:11 |
We will take X3 to the left and replace it by A, | ▶ 00:16 |
both conditioned on the variables X1 and X2. | ▶ 00:20 |
Then we have PA given X1, X2 divided by P not X3, X1, X2. | ▶ 00:23 |
Next we employ 2 things. | ▶ 00:29 |
One is the denominator does not depend on A, | ▶ 00:31 |
so whether I put an A or not A has no bearing on any calculation here, | ▶ 00:34 |
which means I can defer its calculation until later, and it will turn out to be important. | ▶ 00:39 |
So I'm going to be proportional to just the stuff over here. | ▶ 00:44 |
And second, I export my conditional independence | ▶ 00:49 |
whereby I can omit X1 and X2 from the probability of not X3 conditioned on A. | ▶ 00:52 |
These variables are conditionally independent. | ▶ 00:58 |
This gives me the following recursion | ▶ 01:02 |
where I now removed the third variable from the estimation problem | ▶ 01:05 |
and just retained the first 2 relative to my initial expression. | ▶ 01:10 |
If I keep expanding this, I get the following solution. | ▶ 01:14 |
P of not X3 given A, P X2 given A, P X1 given A times P of A. | ▶ 01:19 |
You might take a minute to just verify this, | ▶ 01:27 |
but this is exploiting the conditional independence | ▶ 01:30 |
very much as in the first step I showed you over here. | ▶ 01:32 |
This step lacks the normalizer, | ▶ 01:35 |
so let me work on the normalizer by expressing the opposite probability, | ▶ 01:38 |
P of not A given the same events, X1, X2, and not X3, | ▶ 01:44 |
which resolves to P of not X3 given not A, | ▶ 01:50 |
P of X2 given not A, P of X1 given not A, | ▶ 01:54 |
and P of not A. | ▶ 02:00 |
I can now plug in the values from above. | ▶ 02:02 |
So the first term gives me 0.8 times 0.2 times 0.2 times 0.5. | ▶ 02:04 |
In the second term I get 0.4 times 0.6 times 0.6 times 0.5, | ▶ 02:15 |
which resolves to 0.016 and 0.072. | ▶ 02:24 |
This is clearly not a probability because we left out the normalizer. | ▶ 02:31 |
But as we know, the normalizer does not depend on whether I put A or not A in here. | ▶ 02:36 |
As a result, it will be the same for both of these expressions, | ▶ 02:40 |
and I can obtain it by just adding these non-normalized probabilities | ▶ 02:44 |
and then subsequently divide these non-normalized probabilities accordingly. | ▶ 02:47 |
So let me just do this. | ▶ 02:52 |
We get for the desired probability over here 0.1818 | ▶ 02:55 |
and for the inverse probability over here 0.8182. | ▶ 03:01 |
Our desired answer therefore is 0.1818. | ▶ 03:08 |
This was not an easy question. | ▶ 03:14 |
[Thrun] The answer is a little bit involved. | ▶ 00:00 |
We use total probability to re-express this by bringing in A. | ▶ 00:03 |
P of X3 given X1 is the sum of P of X3 given X1 and A | ▶ 00:08 |
times P of A given X1 plus the A complement, which is X3, conditional X1 and not A | ▶ 00:15 |
times P of not A given X1. | ▶ 00:22 |
That is just total probability. | ▶ 00:24 |
Next we utilized conditional independence by which we can simplify this expression | ▶ 00:26 |
to drop X1 in the conditional variables | ▶ 00:30 |
and we transform this expression by Bayes' rule again. | ▶ 00:33 |
The same applies to the right side with not A replacing A. | ▶ 00:36 |
All of those expressions over here can be found | ▶ 00:41 |
either in the table up there or just by their complements, | ▶ 00:45 |
with the exception of P of X1. | ▶ 00:49 |
But P of X1 can again be just obtained by total probability, | ▶ 00:52 |
which resolves to 0.2 times 0.5 plus 0.6 times 0.5, | ▶ 00:58 |
which gives me 0.4. | ▶ 01:11 |
We are now in a position to calculate the last term over here, which goes as follows. | ▶ 01:13 |
This expression is 0.2 times 0.2 times 0.5 over 0.4 plus 0.6 times 0.6 times 0.5 over 0.4, | ▶ 01:19 |
which gives us as a final result 0.5. | ▶ 01:36 |
[Thrun] And the answer is as follows. | ▶ 00:00 |
No, no, yes, and no. | ▶ 00:02 |
B and C in the absence of any other information are dependent through A, | ▶ 00:06 |
which is if you learn something about B, you can infer something about A, | ▶ 00:11 |
and then we'll know more about C. | ▶ 00:17 |
If you know D, that doesn't change a thing. | ▶ 00:20 |
You can just take D out of the pool. | ▶ 00:22 |
If you know A, B and C become conditionally independent. | ▶ 00:24 |
This dependence goes away, and ignorance of D doesn't render B and C dependent. | ▶ 00:29 |
However, if we add D back to the mix, | ▶ 00:36 |
then knowledge of D will render B and C dependent by way of the explaining away effect. | ▶ 00:39 |
[Thrun] So the correct answer is tricky in this case. | ▶ 00:00 |
It is no, no, no, and yes. | ▶ 00:03 |
The first one is straightforward. | ▶ 00:07 |
C and E are conditionally independent based on D, | ▶ 00:09 |
and knowledge of A doesn't change anything. | ▶ 00:13 |
B and D are conditionally independent through A, | ▶ 00:15 |
and knowledge of C or E doesn't change that. | ▶ 00:20 |
A and C is interesting. | ▶ 00:23 |
A and C is independent. But if you know D, they become dependent. | ▶ 00:25 |
It turns out if you know E, you can know something about D, | ▶ 00:29 |
and as a result, A and C become dependent through the explain away effect. | ▶ 00:32 |
That doesn't apply if you know B. | ▶ 00:37 |
Even though B tells you something about E, | ▶ 00:39 |
it tells you nothing about D because B and D are independent. | ▶ 00:42 |
Therefore, knowing B tells you nothing about D, | ▶ 00:46 |
and the explain away effect does not occur between A and C. | ▶ 00:49 |
The answer here is yes. | ▶ 00:52 |
[Thrun] The correct answer is 16. | ▶ 00:00 |
The probability of A and C require 1 parameter each. | ▶ 00:03 |
The complement of not A and not C follows by 1 minus that parameter. | ▶ 00:06 |
This guy over here requires 2 parameters. | ▶ 00:12 |
You need to know the probability of B given A and B given not A. | ▶ 00:15 |
The complements can be obtained easily. | ▶ 00:18 |
The probability of D is conditioned on 2 variables which can take 4 possible values. | ▶ 00:20 |
Hence the number is 4. | ▶ 00:24 |
And E is conditioned on 3 variables, so it can take a total of 8 different values, | ▶ 00:26 |
2 to the 3rd, which is 8. | ▶ 00:30 |
If you add 8 plus 4 plus 2 plus 1 plus 1, you get 16. | ▶ 00:32 |
Welcome to the machine learning unit. | ▶ 00:00 |
Machine learning is a fascinating area. | ▶ 00:03 |
The world has become immeasurably data-rich. | ▶ 00:06 |
The world wide web has come up over the last decade. | ▶ 00:09 |
The human genome is being sequenced. | ▶ 00:12 |
Vast chemical databases, pharmaceutical databases, | ▶ 00:15 |
and financial databases are now available | ▶ 00:19 |
on a scale unthinkable even 5 years ago. | ▶ 00:22 |
To make sense out of the data, | ▶ 00:26 |
to extract information from the data, | ▶ 00:28 |
machine learning is the discipline to go. | ▶ 00:30 |
Machine learning is an important subfeed of artificial intelligence, | ▶ 00:33 |
it's my personal favorite next to robotics | ▶ 00:37 |
because I believe it has a huge impact on society | ▶ 00:40 |
and is absolutely necessary as we move forward. | ▶ 00:43 |
So in this class, I teach you some of the very basics of | ▶ 00:47 |
machine learning, and in our next unit | ▶ 00:50 |
Peter will tell you some more about machine learning. | ▶ 00:52 |
We'll talk about supervised learning, which is one side of machine learning, | ▶ 00:56 |
and Peter will tell you about unsupervised learning, | ▶ 01:00 |
which is a different style. | ▶ 01:02 |
Later in this class we will also encounter reinforcement learning, | ▶ 01:05 |
which is yet another set of machine learning. | ▶ 01:07 |
Anyhow, let's just dive in. | ▶ 01:10 |
Welcome to the first class on machine learning. | ▶ 00:00 |
So far we talked a lot about Bayes Networks. | ▶ 00:03 |
And the way we talked about them | ▶ 00:07 |
is all about reasoning within Bayes Networks | ▶ 00:10 |
that are known. | ▶ 00:14 |
Machine learning addresses the problem | ▶ 00:15 |
of how to find those networks | ▶ 00:17 |
or other models | ▶ 00:19 |
based on data. | ▶ 00:20 |
Learning models from data | ▶ 00:22 |
is a major, major area of artificial intelligence | ▶ 00:25 |
and it's perhaps the one | ▶ 00:29 |
that had the most commercial success. | ▶ 00:31 |
In many commercial applications | ▶ 00:33 |
the models themselves are fitted | ▶ 00:37 |
based on data. | ▶ 00:39 |
For example, Google | ▶ 00:40 |
uses data to understand | ▶ 00:42 |
how to respond to each search query. | ▶ 00:44 |
Amazon uses data | ▶ 00:46 |
to understand how to place products on their website. | ▶ 00:49 |
And these machine learning techniques | ▶ 00:52 |
are the enabling techniques that make that possible. | ▶ 00:53 |
So this class | ▶ 00:56 |
which is about supervised learning | ▶ 00:57 |
will go through some very basic methods | ▶ 00:59 |
for learning models from data | ▶ 01:02 |
in particular, specific types of Bayes Networks. | ▶ 01:04 |
We will complement this | ▶ 01:06 |
with a class on unsupervised learning | ▶ 01:08 |
that will be taught next | ▶ 01:10 |
after this class. | ▶ 01:14 |
Let me start off with a quiz. | ▶ 01:15 |
The quiz is: What companies are famous | ▶ 01:18 |
for machine learning using data? | ▶ 01:20 |
Google for mining the web. | ▶ 01:24 |
Netflix for mining what people | ▶ 01:29 |
would like to rent on DVDs. | ▶ 01:31 |
Which is DVD recommendations. | ▶ 01:36 |
Amazon.com for product placement. | ▶ 01:40 |
Check any or all | ▶ 01:45 |
and if none of those apply | ▶ 01:47 |
check down here. | ▶ 01:49 |
And, not surprisingly, the answer is | ▶ 00:00 |
all of those companies and many, many, many more | ▶ 00:03 |
use massive machine learning for making decisions | ▶ 00:06 |
that are really essential to the businesses. | ▶ 00:09 |
Google mines the web and uses machine learning for translation, | ▶ 00:12 |
as we've seen in the introductory level. Netflix has used | ▶ 00:15 |
machine learning extensively for understanding what type of DVD to recommend to you next. | ▶ 00:18 |
Amazon composes its entire product pages using | ▶ 00:22 |
machine learning by understanding how customers | ▶ 00:25 |
respond to different compositions and placements of their products, | ▶ 00:28 |
and many, many other examples exist. | ▶ 00:31 |
I would argue that in Silicon Valley, | ▶ 00:35 |
at least half the companies dealing with customers and online products | ▶ 00:37 |
do extensively use machine learning, | ▶ 00:41 |
so it makes machine learning a really exciting discipline. | ▶ 00:43 |
In my own research, I've extensively used machine learning for robotics. | ▶ 00:00 |
What you see here is a robot my students and I built at Stanford | ▶ 00:05 |
called Stanley, and it won the DARPA Grand Challenge. | ▶ 00:08 |
It's a self-driving car that drives without any human assistance whatsoever, | ▶ 00:12 |
and this vehicle extensively uses machine learning. | ▶ 00:16 |
The robot is equipped with a laser system | ▶ 00:22 |
I will talk more about lasers in my robotics class, | ▶ 00:25 |
but here you can see how the robot is able to build | ▶ 00:28 |
3-D models of the terrain ahead. | ▶ 00:31 |
These are almost like video game models that allow it to make | ▶ 00:34 |
assessments where to drive and where not to drive. | ▶ 00:37 |
Essentially, it's trying to drive on flat ground. | ▶ 00:39 |
The problem with these lasers is that they don't see very far. | ▶ 00:43 |
They see about 25 meters out, so to drive really fast | ▶ 00:46 |
the robot has to see further. | ▶ 00:50 |
This is where machine learning comes into play. | ▶ 00:53 |
What you see here is camera images delivered by the robot | ▶ 00:56 |
superimposed with laser data that doesn't see very far, | ▶ 00:58 |
but the laser is good enough to extract samples | ▶ 01:01 |
of driveable road surface that can then be machine learned | ▶ 01:04 |
and extrapolated into the entire camera image. | ▶ 01:08 |
That enables the robot to use the camera | ▶ 01:10 |
to see driveable terrain all the way to the horizon | ▶ 01:13 |
up to like 200 meters out, enough to drive really, really fast. | ▶ 01:16 |
This ability to adapt its vision by driving its own training examples using lasers | ▶ 01:22 |
but seeing out 200 meters or more | ▶ 01:27 |
was a key factor in winning the race. | ▶ 01:30 |
Machine learning is a very large field | ▶ 00:00 |
with many different methods | ▶ 00:03 |
and many different applications. | ▶ 00:04 |
I will now define some of the very basic terminology | ▶ 00:06 |
that is being used to distinguish | ▶ 00:10 |
different machine learning methods. | ▶ 00:12 |
Let's start with the what. | ▶ 00:13 |
What is being learned? | ▶ 00:17 |
You can learn parameters | ▶ 00:19 |
like the probabilities of a Bayes Network. | ▶ 00:23 |
You can learn structure | ▶ 00:26 |
like the arc structure of a Bayes Network. | ▶ 00:27 |
And you might even discover hidden concepts. | ▶ 00:31 |
For example | ▶ 00:34 |
you might find that certain training example | ▶ 00:35 |
form a hidden group. | ▶ 00:37 |
For example Netflix | ▶ 00:39 |
you might find that there's different types of customers | ▶ 00:41 |
some that care about classic movies | ▶ 00:43 |
some of them care about modern movies | ▶ 00:45 |
and those might form hidden concepts | ▶ 00:47 |
whose discovery can really help you | ▶ 00:49 |
make better sense of the data. | ▶ 00:51 |
Next is what from? | ▶ 00:53 |
Every machine learning method | ▶ 00:57 |
is driven by some sort of target information | ▶ 01:00 |
that you care about. | ▶ 01:02 |
In supervised learning | ▶ 01:03 |
which is the subject of today's class | ▶ 01:06 |
we're given specific target labels | ▶ 01:08 |
and I give you examples just in a second. | ▶ 01:10 |
We also talk about unsupervised learning | ▶ 01:13 |
where target labels are missing | ▶ 01:15 |
and we use replacement principles | ▶ 01:19 |
to find, for example | ▶ 01:21 |
hidden concepts. | ▶ 01:22 |
Later there will be a class in reinforcement learning | ▶ 01:24 |
when an agent learns from feedback with the physical environment | ▶ 01:27 |
by interacting and trying actions | ▶ 01:32 |
and receiving some sort of evaluation | ▶ 01:34 |
from the environment | ▶ 01:37 |
like "Well done" or "That works." | ▶ 01:37 |
Again, we will talk about those in detail later. | ▶ 01:41 |
There's different things you could try to do | ▶ 01:43 |
with machine learning technique. | ▶ 01:46 |
You might care about prediction. | ▶ 01:48 |
For example you might want to care about what's going to happen with the future | ▶ 01:49 |
in the stockmarket for example. | ▶ 01:53 |
You might care to diagnose something | ▶ 01:55 |
which is you get data and you wish to explain it | ▶ 01:57 |
and you use machine learning for that. | ▶ 01:59 |
Sometimes your objective is to summarize something. | ▶ 02:01 |
For example if you read a long article | ▶ 02:04 |
your machine learning method might aim to | ▶ 02:07 |
produce a short article that summarizes the long article. | ▶ 02:09 |
And there's many, many, many more different things. | ▶ 02:12 |
You can talk about the how to learn. | ▶ 02:14 |
We use the word passive | ▶ 02:16 |
if your learning agent is just an observer | ▶ 02:19 |
and has no impact on the data itself. | ▶ 02:23 |
Otherwise, you call it active. | ▶ 02:24 |
Sometimes learning occurs online | ▶ 02:26 |
which means while the data is being generated | ▶ 02:30 |
and some of it is offline | ▶ 02:32 |
which means learning occurs | ▶ 02:35 |
after the data has been generated. | ▶ 02:37 |
There's different types of outputs | ▶ 02:39 |
of a machine learning algorithm. | ▶ 02:42 |
Today we'll talk about classification | ▶ 02:44 |
versus regression. | ▶ 02:47 |
In classification the output is binary | ▶ 02:50 |
or a fixed number of classes | ▶ 02:53 |
for example something is either a chair or not. | ▶ 02:55 |
Regression is continuous. | ▶ 02:57 |
The temperature might be 66.5 degrees | ▶ 02:59 |
in our prediction. | ▶ 03:01 |
And there's tons of internal details | ▶ 03:03 |
we will talk about. | ▶ 03:05 |
Just to name one. | ▶ 03:07 |
We will distinguish generative | ▶ 03:09 |
from discriminative. | ▶ 03:12 |
Generative seeks to model the data | ▶ 03:14 |
as generally as possible | ▶ 03:16 |
versus discriminative methods | ▶ 03:18 |
seek to distinguish data | ▶ 03:20 |
and this might sound like a superficial distinction | ▶ 03:21 |
but it has enormous ramification | ▶ 03:24 |
on the learning algorithm. | ▶ 03:26 |
Now to tell you the truth | ▶ 03:27 |
it took me many years | ▶ 03:29 |
to fully learn all these words here | ▶ 03:30 |
and I don't expect you to pick them all up | ▶ 03:33 |
in one class | ▶ 03:36 |
but you should as well know that they exist. | ▶ 03:37 |
And as they come up | ▶ 03:39 |
I'll emphasize them | ▶ 03:41 |
so you can resort any learning method | ▶ 03:42 |
I tell you back into the specific taxonomy over here. | ▶ 03:44 |
The vast amount of work in the field | ▶ 00:00 |
falls into the area of supervised learning. | ▶ 00:02 |
In supervised learning | ▶ 00:06 |
you're given for each training example | ▶ 00:08 |
a feature vector | ▶ 00:10 |
and a target label named Y. | ▶ 00:13 |
For example, for a credit rating agency | ▶ 00:16 |
X1, X2, X3 might be a feature | ▶ 00:20 |
such as is the person employed? | ▶ 00:23 |
What is the salary of the person? | ▶ 00:25 |
Has the person previously defaulted on a credit card? | ▶ 00:27 |
And so on. | ▶ 00:30 |
And Y is a predictor | ▶ 00:32 |
whether the person is to default | ▶ 00:34 |
on the credit or not. | ▶ 00:36 |
Now machine learning | ▶ 00:38 |
is to be carried out on past data | ▶ 00:40 |
where the credit rating agency | ▶ 00:42 |
might have collected features just like these | ▶ 00:44 |
and actual occurances of default or not. | ▶ 00:46 |
What it wishes to produce | ▶ 00:49 |
is a function that allows us | ▶ 00:51 |
to predict future customers. | ▶ 00:53 |
So the new person comes in | ▶ 00:55 |
with a different feature vector. | ▶ 00:56 |
Can we predict as good as possible | ▶ 00:58 |
the functional relationship | ▶ 01:00 |
between these features X1 to Xn all the way to Y? | ▶ 01:02 |
You can apply the exact same example | ▶ 01:05 |
in image recognition | ▶ 01:08 |
where X might be pixels of images | ▶ 01:09 |
or it might be features of things found in images | ▶ 01:11 |
and Y might be a label that says | ▶ 01:14 |
whether a certain object is contained | ▶ 01:16 |
in an image or not. | ▶ 01:17 |
Now in supervised learning | ▶ 01:19 |
you're given many such examples. | ▶ 01:20 |
X21 to X2n | ▶ 01:25 |
leads to Y2 | ▶ 01:28 |
all way the index m. | ▶ 01:32 |
This is called your data. | ▶ 01:35 |
If we call each input vector Xm | ▶ 01:38 |
and we wish to find out the function | ▶ 01:43 |
given any Xm or any future vector X | ▶ 01:44 |
produces as close as possible | ▶ 01:50 |
my target signal Y. | ▶ 01:53 |
Now this isn't always possible | ▶ 01:55 |
and sometimes it's acceptable | ▶ 01:57 |
in fact preferable | ▶ 01:59 |
to tolerate a certain amount of error | ▶ 02:00 |
in your training data. | ▶ 02:03 |
But the subject of machine learning | ▶ 02:05 |
is to identify this function over here. | ▶ 02:07 |
And once you identify it | ▶ 02:10 |
you can use it for future Xs | ▶ 02:11 |
that weren't part of the training set | ▶ 02:13 |
to produce a prediction | ▶ 02:16 |
that hopefully is really, really good. | ▶ 02:19 |
So let me ask you a question. | ▶ 02:21 |
And this is a question | ▶ 02:24 |
for which I haven't given you the answer | ▶ 02:27 |
but I'd like to appeal to your intuition. | ▶ 02:28 |
Here's one data set | ▶ 02:31 |
where the X is one dimensionally plotted horizontally | ▶ 02:34 |
and the Y is vertically | ▶ 02:37 |
and suppose there looks like this. | ▶ 02:39 |
Suppose my machine learning algorithm | ▶ 02:44 |
gives me 2 hypotheses. | ▶ 02:45 |
One is this function over here | ▶ 02:47 |
which is a linear function | ▶ 02:51 |
and one is this function over here. | ▶ 02:52 |
I'd like to know which of the functions | ▶ 02:53 |
you find preferable | ▶ 02:57 |
as an explanation for the data. | ▶ 02:59 |
Is it function A? | ▶ 03:01 |
Or function B? | ▶ 03:02 |
Check here for A | ▶ 03:06 |
here for B | ▶ 03:08 |
and here for neither. | ▶ 03:09 |
And I hope you guessed function A. | ▶ 00:00 |
Even though both perfectly describe the data | ▶ 00:04 |
B is much more complex than A. | ▶ 00:08 |
In fact, outside the data | ▶ 00:10 |
B seems to go to a minus infinity much faster | ▶ 00:12 |
than these data points | ▶ 00:16 |
and to plus infinity much faster | ▶ 00:17 |
with these data points over here. | ▶ 00:19 |
And in between | ▶ 00:21 |
we have wide oscillations | ▶ 00:22 |
that don't correspond to any data. | ▶ 00:23 |
So I would argue | ▶ 00:25 |
A is preferable. | ▶ 00:27 |
The reason why I asked this question | ▶ 00:31 |
is because of something called Occam's Razor. | ▶ 00:32 |
Occam can be spelled in many different ways. | ▶ 00:35 |
And what Occam says is that | ▶ 00:38 |
everything else being equal | ▶ 00:41 |
chose the less complex hypothesis. | ▶ 00:43 |
Now in practice | ▶ 00:46 |
there's actually a trade-off | ▶ 00:48 |
between a really good data fit | ▶ 00:50 |
and low complexity. | ▶ 00:53 |
Let me illustrate this to you | ▶ 00:55 |
by a hypothetical example. | ▶ 00:58 |
Consider the following graph | ▶ 00:59 |
where the horizontal axis graphs | ▶ 01:02 |
complexity of the solution. | ▶ 01:04 |
For example, if you use polynomials | ▶ 01:07 |
this might be a high-degree polynomial over here | ▶ 01:10 |
and maybe a linear function over here | ▶ 01:12 |
which is a low-degree polynomial | ▶ 01:14 |
your training data error | ▶ 01:16 |
tends to go like this. | ▶ 01:19 |
The more complex the hypothesis you allow | ▶ 01:22 |
the more you can just fit your data. | ▶ 01:25 |
However, in reality | ▶ 01:29 |
your generalization error on unknown data | ▶ 01:31 |
tends to go like this. | ▶ 01:33 |
It is the sum of the training data error | ▶ 01:37 |
and another function | ▶ 01:40 |
which is called the overfitting error. | ▶ 01:42 |
Not surprisingly | ▶ 01:46 |
the best complexity is obtained | ▶ 01:47 |
where the generalization error is minimum. | ▶ 01:49 |
There are methods | ▶ 01:52 |
to calculate the overfitting error. | ▶ 01:53 |
They go into a statistical field | ▶ 01:55 |
under the name Bayes variance methods. | ▶ 01:57 |
However, in practice | ▶ 02:01 |
you're often just given the training data error. | ▶ 02:02 |
You find if you don't find the model | ▶ 02:04 |
that minimizes the training data error | ▶ 02:08 |
but instead pushes back the complexity | ▶ 02:11 |
your algorithm tends to perform better | ▶ 02:14 |
and that is something we will study a little bit | ▶ 02:17 |
in this class. | ▶ 02:20 |
However, this slide is really important | ▶ 02:22 |
for anybody doing machine learning in practice. | ▶ 02:26 |
If you deal with data | ▶ 02:29 |
and you have ways to fit your data | ▶ 02:31 |
be aware that overfitting | ▶ 02:33 |
is a major source of poor performance | ▶ 02:36 |
of a machine learning algorithm. | ▶ 02:39 |
And I give you examples in just one second. | ▶ 02:41 |
So a really important example | ▶ 00:00 |
of machine learning is SPAM detection. | ▶ 00:02 |
We all get way too much email | ▶ 00:04 |
and a good number of those are SPAM. | ▶ 00:06 |
Here are 3 examples of email. | ▶ 00:08 |
Dear Sir: First I must solicit your confidence | ▶ 00:12 |
in this transaction, this is by virtue of its nature | ▶ 00:14 |
being utterly confidential and top secret... | ▶ 00:16 |
This is likely SPAM. | ▶ 00:19 |
Here's another one. | ▶ 00:22 |
In upper caps. | ▶ 00:23 |
99 MILLION EMAIL ADDRESSES FOR ONLY $99 | ▶ 00:25 |
This is very likely SPAM. | ▶ 00:28 |
And here's another one. | ▶ 00:31 |
Oh, I know it's blatantly OT | ▶ 00:33 |
but I'm beginning to go insane. | ▶ 00:35 |
Had an old Dell Dimension XPS sitting in the corner | ▶ 00:37 |
and decided to put it to use. | ▶ 00:40 |
And so on and so on. | ▶ 00:41 |
Now this is likely not SPAM. | ▶ 00:42 |
How can a computer program | ▶ 00:45 |
distinguish between SPAM and not SPAM? | ▶ 00:47 |
Let's use this as an example | ▶ 00:49 |
to talk about machine learning for discrimination | ▶ 00:51 |
using Bayes Networks. | ▶ 00:55 |
In SPAM detection | ▶ 00:59 |
we get an email | ▶ 01:01 |
and we wish to categorize it | ▶ 01:03 |
either as SPAM | ▶ 01:05 |
in which case we don't even show as to the where | ▶ 01:07 |
or what we call HAM | ▶ 01:10 |
which is the technical word for | ▶ 01:12 |
an email worth passing on to the person being emailed. | ▶ 01:15 |
So the function over here | ▶ 01:19 |
is the function we're trying to learn. | ▶ 01:21 |
Most SPAM filters use human input. | ▶ 01:23 |
When you go through email | ▶ 01:26 |
you have a button called IS SPAM | ▶ 01:28 |
which allows you as a user to flag SPAM | ▶ 01:32 |
and occasionally you will say an email is SPAM. | ▶ 01:34 |
If you look at this | ▶ 01:37 |
you have a typical supervised machine learning situation | ▶ 01:40 |
where the input is an email | ▶ 01:43 |
and the output is whether you flag it as SPAM | ▶ 01:45 |
or if we don't flag it | ▶ 01:47 |
we just think it's HAM. | ▶ 01:49 |
Now to make this amenable to | ▶ 01:52 |
a machine learning algorithm | ▶ 01:54 |
we have to talk about how to represent emails. | ▶ 01:55 |
They're all using different words and different characters | ▶ 01:57 |
and they might have different graphics included. | ▶ 02:00 |
Let's pick a representation that's easy to process. | ▶ 02:02 |
And this representation is often called | ▶ 02:06 |
Bag of Words. | ▶ 02:09 |
Bag of Words is a representation | ▶ 02:10 |
of a document | ▶ 02:14 |
that just counts the frequency | ▶ 02:15 |
of words. | ▶ 02:17 |
If an email were to say Hello | ▶ 02:18 |
I will say Hello. | ▶ 02:22 |
The Bag of Words representation | ▶ 02:24 |
is the following. | ▶ 02:26 |
2-1-1-1 | ▶ 02:27 |
for the dictionary | ▶ 02:31 |
that contains the 4 words | ▶ 02:33 |
Hello I will say. | ▶ 02:36 |
Now look at the subtlety here. | ▶ 02:38 |
Rather than representing each individual word | ▶ 02:41 |
we have a count of each word | ▶ 02:43 |
and the count is oblivious | ▶ 02:46 |
to the order in which the words were stated. | ▶ 02:49 |
A Bag of Words representation | ▶ 02:52 |
relative to a fixed dictionary | ▶ 02:55 |
represents the counts of each word | ▶ 02:57 |
relative to the words in the dictionary. | ▶ 03:01 |
If you were to use a different dictionary | ▶ 03:03 |
like hello and good-bye | ▶ 03:06 |
our counts would be | ▶ 03:08 |
2 and 0. | ▶ 03:10 |
However, in most cases | ▶ 03:13 |
you make sure that all the words found | ▶ 03:14 |
in messages | ▶ 03:17 |
are actually included in the dictionary. | ▶ 03:18 |
So the dictionary might be very, very large. | ▶ 03:19 |
Let me make up an unofficial example | ▶ 03:22 |
of a few SPAM and a few HAM messages. | ▶ 03:25 |
Offer is secret. | ▶ 03:30 |
Click secret link. | ▶ 03:32 |
Secret sports link. | ▶ 03:35 |
Obviously those are contrived | ▶ 03:37 |
and I tried to retain the recovery | ▶ 03:40 |
to a small number of words | ▶ 03:42 |
to make this example workable. | ▶ 03:44 |
In practice we need thousands | ▶ 03:46 |
of such messages | ▶ 03:47 |
to get good information. | ▶ 03:48 |
Play sports today. | ▶ 03:50 |
Went play sports. | ▶ 03:52 |
Secret sports event. | ▶ 03:54 |
Sport is today. | ▶ 03:56 |
Sport costs money. | ▶ 03:59 |
My first quiz is | ▶ 04:02 |
What is the size of the vocabulary | ▶ 04:06 |
that contains all words in these messages? | ▶ 04:08 |
Please enter the value in this box over here. | ▶ 04:12 |
Well let's count. | ▶ 00:00 |
Offer is secret click. | ▶ 00:02 |
Secret occurs over here already | ▶ 00:08 |
so we don't have to count it twice. | ▶ 00:10 |
Link, sports, play, today, went, event | ▶ 00:12 |
costs money. | ▶ 00:18 |
So the answer is | ▶ 00:20 |
12. | ▶ 00:22 |
There's 12 different words | ▶ 00:24 |
contained in these 8 messages. | ▶ 00:26 |
[Narrator] Another quiz. | ▶ 00:00 |
What is the probability that a random message | ▶ 00:03 |
that arrives to fall into the spam bucket? | ▶ 00:06 |
Assuming that those messages | ▶ 00:09 |
are all drawn at random. | ▶ 00:11 |
[writing on page] | ▶ 00:13 |
[Narrator] And the answer is: | ▶ 00:00 |
there's 8 different messages | ▶ 00:02 |
of which 3 are spam. | ▶ 00:04 |
So the maximum likelihood estimate | ▶ 00:06 |
is 3/8. | ▶ 00:09 |
[writing on paper] | ▶ 00:11 |
So, let's look at this a little bit more formally and talk about maximum likelihood. | ▶ 00:00 |
Obviously, we're observing 8 messages: spam, spam, spam, and 5 times ham. | ▶ 00:03 |
And what we care about is what's our prior probability of spam | ▶ 00:12 |
that maximizes the likelihood of this data? | ▶ 00:17 |
So, let's assume we're going to assign a value of pi to this, | ▶ 00:20 |
and we wish to find the pi that maximizes the likelihood of this data over here, | ▶ 00:24 |
assuming that each email is drawn independently | ▶ 00:29 |
according to an identical distribution. | ▶ 00:33 |
The probability of the p(yi) data item is then pi if yi = spam, | ▶ 00:37 |
and 1 - pi if yi = ham. | ▶ 00:48 |
If we rewrite the data as 1, 1, 1, 0, 0, 0, 0, 0, | ▶ 00:53 |
we can write p(yi) as follows: pi to the yi times (1 - pi) to the 1 - yi. | ▶ 00:59 |
It's not that easy to see that this is equivalent, | ▶ 01:13 |
but say yi = 1. | ▶ 01:16 |
Then this term will fall out. | ▶ 01:19 |
It's not proficient by 1 because the exponent is zero, and we get pi as over here. | ▶ 01:22 |
If yi = 0, then this term falls out, and this one here becomes 1 - pi as over here. | ▶ 01:28 |
Now assuming independence, we get for the entire data set | ▶ 01:36 |
that the joint probability of all data items is the product | ▶ 01:44 |
of the individual data items over here, | ▶ 01:49 |
which can now be written as follows: | ▶ 01:52 |
pi to the count of instances where yi = 1 times | ▶ 01:56 |
1 - pi to the count of the instances where yi = 0. | ▶ 02:03 |
And we know in our example, this count over here is 3, | ▶ 02:09 |
and this count over here is 5, so we get pi to the 3rd times 1 - pi to the 5th. | ▶ 02:13 |
We now wish to find the pi that maximizes this expression over here. | ▶ 02:22 |
We can also maximize the logarithm of this expression, | ▶ 02:28 |
which is 3 times log pi + 5 times log (1 - pi) | ▶ 02:33 |
Optimizing the log is the same as optimizing p because the log is monotonic to p. | ▶ 02:42 |
The maximum of this function is attained with a derivative of 0, | ▶ 02:50 |
so let's compute with a derivative and set it to 0. | ▶ 02:54 |
This is the derivative, 3 over pi - 5 over 1 - pi. | ▶ 03:00 |
We now bring this expression to the right side, | ▶ 03:05 |
multiply the denominators up, and sort all the expressions containing pi to the left, | ▶ 03:09 |
which gives us pi = 3/8, exactly the number we were at before. | ▶ 03:18 |
We just derived mathematically that the data likelihood maximizing number | ▶ 03:26 |
for the probability is indeed the empirical count, | ▶ 03:33 |
which means when we looked at this quiz before | ▶ 03:37 |
and we said a maximum likelihood for the prior probability of spam is 3/8, | ▶ 03:41 |
by simply counting 3 over 8 emails were spam, | ▶ 03:49 |
we actually followed proper mathematical principles | ▶ 03:54 |
to do maximum likelihood estimation. | ▶ 03:57 |
Now, you might not fully have gotten the derivation of this, | ▶ 03:59 |
and I recommend you to watch it again, but it's not that important | ▶ 04:03 |
for the progress in this class. | ▶ 04:07 |
So, here's another quiz. | ▶ 04:09 |
I'd like the maximum likelihood, or ML solutions, | ▶ 04:11 |
for the following probabilities. | ▶ 04:17 |
The probability that the word "secret" comes up, | ▶ 04:19 |
assuming that we already know a message is spam, | ▶ 04:21 |
and the probability that the same word "secret" comes up | ▶ 04:25 |
if we happen to know the message is not spam, it's ham. | ▶ 04:28 |
And just as before | ▶ 00:00 |
we count the word secret | ▶ 00:02 |
in SPAM and in HAM | ▶ 00:04 |
as I've underlined here. | ▶ 00:06 |
Three out of 9 words in SPAM | ▶ 00:07 |
are the word secret | ▶ 00:11 |
so we have a third over here | ▶ 00:12 |
or 0.333 | ▶ 00:14 |
and only 1 out of all the 15 words in HAM | ▶ 00:18 |
are secret | ▶ 00:21 |
so you get a fifteenth | ▶ 00:22 |
or 0.0667. | ▶ 00:23 |
By now, you might have recognized what we're really building up is a Bayes network | ▶ 00:00 |
where the parameters of the Bayes networks are estimated using supervised learning | ▶ 00:06 |
by a maximum likelihood estimator based on training data. | ▶ 00:10 |
The Bayes network has at its root an unobservable variable called spam, | ▶ 00:15 |
which is binary, and it has as many children as there are words in a message, | ▶ 00:20 |
where each word has an identical conditional distribution | ▶ 00:28 |
of the word occurrence given the class spam or not spam. | ▶ 00:33 |
If you write on our dictionary over here, | ▶ 00:39 |
you might remember the dictionary had 12 different words, | ▶ 00:42 |
so here is 5 of the 12, offer, is, secret, click and sports. | ▶ 00:48 |
Then for the spam class, we found the probability of secret given spam is 1/3, | ▶ 00:52 |
and we also found that the probability of secret given ham is 1/15, | ▶ 00:59 |
so here's a quiz. | ▶ 01:05 |
Assuming a vocabulary size of 12, or put differently, | ▶ 01:07 |
the dictionary has 12 words, how many parameters | ▶ 01:12 |
do we need to specify this Bayes network? | ▶ 01:16 |
And the correct answer is 23. | ▶ 00:00 |
We need 1 parameter for the prior p (spam), | ▶ 00:03 |
and then we have 2 dictionary distributions of any word, | ▶ 00:07 |
i given spam, and the same for ham. | ▶ 00:12 |
Now, there's 12 words in a dictionary, | ▶ 00:16 |
but this distribution only needs 11 parameters, | ▶ 00:18 |
so 12 can be figured out because they have to add up to 1. | ▶ 00:20 |
And the same is true over here, so if you add all these together, | ▶ 00:24 |
we get 23. | ▶ 00:27 |
So, here's a quiz. | ▶ 00:00 |
Let's assume we fit all the 23 parameters of the Bayes network | ▶ 00:02 |
as explained using maximum likelihood. | ▶ 00:06 |
Let's now do classification and see what class and message it ends up with. | ▶ 00:09 |
Let me start with a very simple message, and it contains a single word | ▶ 00:14 |
just to make it a little bit simpler. | ▶ 00:18 |
What's the probability that we classify this one word message as spam? | ▶ 00:21 |
And the answer is 0.1667 or 3/18. | ▶ 00:00 |
How do I get there? Well, let's apply Bayes rule. | ▶ 00:07 |
This form is easily transformed into this expression over here, | ▶ 00:13 |
the probability of the message given spam times the prior probability of spam | ▶ 00:19 |
over the normalizer over here. | ▶ 00:25 |
Now, we know that the word "sports" occurs 1 in our 9 words of spam, | ▶ 00:29 |
and our prior probability for spam is 3/8, | ▶ 00:34 |
which gives us this expression over here. | ▶ 00:38 |
We now have to add the same probabilities for the class ham. | ▶ 00:40 |
"Sports" occurs 5 times out of 15 in the ham class, | ▶ 00:45 |
and the prior probability for ham is 5/8, | ▶ 00:51 |
which gives us 3/72 divided by 18/72, which is 3/18 or 1/6. | ▶ 00:55 |
This gets to a more complicated quiz. | ▶ 00:00 |
Say the message now contains 3 words. | ▶ 00:03 |
"Secret is secret," not a particularly meaningful email, | ▶ 00:06 |
but the frequent occurrence of "secret" seems to suggest it might be spam. | ▶ 00:10 |
What's the probability you're going to judge this to be spam? | ▶ 00:16 |
And the answer is surprisingly high. It's 25/26, or 0.9615. | ▶ 00:00 |
To see if we apply Bayes rule, which multiples the prior for spam-ness | ▶ 00:10 |
with the conditional probability of each word given spam. | ▶ 00:16 |
"Secret" carries 1/3, "is" 1/9, and "secret" 1/3 again. | ▶ 00:19 |
We normalize this by the same expression plus the probability for | ▶ 00:26 |
the non-spam case. | ▶ 00:32 |
5/8 is a prior. | ▶ 00:36 |
"Secret" is 1/15. | ▶ 00:38 |
"Is" is 1/15, | ▶ 00:42 |
and "secret" again. | ▶ 00:45 |
This resolves to 1/216 over this expression plus 1/5400, | ▶ 00:48 |
and when you work it all out is 25/26. | ▶ 00:57 |
The final quiz, let's assume our message is "Today is secret." | ▶ 00:00 |
And again, it might look like spam because the word "secret" occurs. | ▶ 00:08 |
I'd like you to compute for me the probability of spam given this message. | ▶ 00:12 |
And surprisingly, the probability for this message to be spam is 0. | ▶ 00:00 |
It's not 0.001. It's flat 0. | ▶ 00:07 |
In other words, it's impossible, according to our model, | ▶ 00:11 |
that this text could be a spam message. | ▶ 00:14 |
Why is this? | ▶ 00:17 |
When we apply the same rule as before, we get the prior for spam which is 3/8. | ▶ 00:19 |
And we multiple the conditional for each word into this. | ▶ 00:24 |
For "secret," we know it to be 1/3. | ▶ 00:28 |
For "is," to be 1/9, but for today, it's 0. | ▶ 00:31 |
It's 0 because the maximum of the estimate for the probability of "today" in spam is 0. | ▶ 00:39 |
"Today" just never occurred in a spam message so far. | ▶ 00:45 |
Now, this 0 is troublesome because as we compute the outcome-- | ▶ 00:49 |
and I'm plugging in all the numbers as before-- | ▶ 00:55 |
none of the words matter anymore, just the 0 matters. | ▶ 01:00 |
So, we get 0 over something which is plain 0. | ▶ 01:03 |
Are we overfitting? You bet. | ▶ 01:10 |
We are clearly overfitting. | ▶ 01:13 |
It can't be that a single word determines the entire outcome of our analysis. | ▶ 01:15 |
The reason is that our model, to assign a probability of 0 for the word "today" | ▶ 01:21 |
to be in the class of spam is just too aggressive. | ▶ 01:26 |
Let's change this. | ▶ 01:29 |
One technique to deal with the overfitting problem is called Laplace smoothing. | ▶ 01:34 |
In maximum likelihood estimation, we assign towards our probability | ▶ 01:39 |
the quotient of the count of this specific event over all events in our data set. | ▶ 01:45 |
For example, for the prior probability, we found that 3/8 messages are spam. | ▶ 01:51 |
Therefore, our maximum likelihood estimate | ▶ 01:57 |
for the prior probability of spam was 3/8. | ▶ 02:00 |
In Laplace Smoothing, we use a different estimate. | ▶ 02:05 |
We add the value k to the count | ▶ 02:10 |
and normalize as if we added k to every single class | ▶ 02:15 |
that we've tried to estimate something over. | ▶ 02:20 |
This is equivalent to assuming we have a couple of fake training examples | ▶ 02:23 |
where we add k to each observation count. | ▶ 02:28 |
Now, if k equals 0, we get our maximum likelihood estimator. | ▶ 02:32 |
But if k is larger than 0 and n is finite, we get different answers. | ▶ 02:36 |
Let's say k equals 1, | ▶ 02:41 |
and let's assume we get one message, | ▶ 02:47 |
and that message was spam, so we're going to write it one message, one spam. | ▶ 02:51 |
What is p (spam) for the Laplace smoothing of k + 1? | ▶ 02:56 |
Let's do the same with 10 messages, and we get 6 spam. | ▶ 03:03 |
And 100 messages, of which 60 are spam. | ▶ 03:09 |
Please enter your numbers into the boxes over here. | ▶ 03:16 |
The answer here is 2/3 or 0.667 and is computed as follows. | ▶ 00:00 |
We have 1 message with 1 as spam, but we're going to add k =1. | ▶ 00:10 |
We're going to add k = 2 over here because there's 2 different classes. | ▶ 00:16 |
K = 1 times 2 = 2, which gives us 2/3. | ▶ 00:22 |
The answer over here is 7/12. | ▶ 00:28 |
Again, we have 6/10 but we add 2 down here and 1 over here, so you get 7/12. | ▶ 00:32 |
And correspondingly, we get 61/102 is 60 + 1 over 100 +2. | ▶ 00:41 |
If we look at the numbers over here, we get 0.5833 | ▶ 00:49 |
and 0.5986. | ▶ 00:56 |
Interestingly, the maximum likelihood on the last 2 cases over here | ▶ 00:59 |
will give us .6, but we only get a value that's closer to .5, | ▶ 01:03 |
which is the effect of our smoothing prior for the Laplacian smoothing. | ▶ 01:09 |
Let's use the Laplacian smoother with K=1 | ▶ 00:00 |
to calculate the few interesting probabilities-- | ▶ 00:05 |
P of SPAM, P of HAM, | ▶ 00:09 |
and then the probability of the words "today", | ▶ 00:12 |
given that it's in the SPAM class or the HAM class. | ▶ 00:15 |
And you might assume that our recovery size | ▶ 00:19 |
is about 12 different words here. | ▶ 00:22 |
This one is easy to calculate for SPAM and HAM. | ▶ 00:00 |
For SPAM, it's 2/5, | ▶ 00:03 |
and the reason is, we had previously | ▶ 00:05 |
3 out of 8 messages assigned to SPAM. | ▶ 00:08 |
But thanks to the Laplacian smoother, we add 1 over here. | ▶ 00:12 |
And there are 2 classes, so we add 2 times 1 over here, | ▶ 00:15 |
which gives us 4/10, which is 2/5. | ▶ 00:19 |
Similarly to get 3/5 over here. | ▶ 00:22 |
Now the tricky part comes up over here. | ▶ 00:26 |
Before, we had 0 occurances of the word "today" in the SPAM class, | ▶ 00:29 |
and we had 9 data points. | ▶ 00:33 |
But now we are going to add 1 for Laplacian smoother, | ▶ 00:35 |
and down here, we are going to add 12. | ▶ 00:38 |
And the reason that we add 12 is because | ▶ 00:40 |
there's 12 different words in our dictionary | ▶ 00:42 |
Hence, for each word in the dictonary, we are going to add 1. | ▶ 00:44 |
So we have a total of 12, which gives us the 12 over here. | ▶ 00:47 |
That makes 1/21. | ▶ 00:50 |
In the HAM class, we had 2 occurrences | ▶ 00:53 |
of the word "today"--over here and over here. | ▶ 00:56 |
We add 1, normalize by 15, | ▶ 00:59 |
plus 12 for the dictionary size, | ▶ 01:04 |
which is 3/27 or 1/9. | ▶ 01:07 |
This was not an easy question. | ▶ 01:14 |
We come now to the final quiz here, | ▶ 00:00 |
which is--I would like to compute the probability | ▶ 00:03 |
that the message "today is secret" | ▶ 00:05 |
falls into the SPAM box with | ▶ 00:08 |
Laplacian smoother using K=1. | ▶ 00:10 |
Please just enter your number over here. | ▶ 00:13 |
This is a non-trivia question. | ▶ 00:16 |
It might take you a while to calculate this. | ▶ 00:18 |
In the approximate probabilities--0.4858. | ▶ 00:00 |
How did we get this? | ▶ 00:06 |
Well, the prior probability for SPAM | ▶ 00:08 |
under the Laplacian smoothing is 2/5. | ▶ 00:12 |
"Today" doesn't occur, but we have already calculated this to be 1/21. | ▶ 00:15 |
"Is" occurs once, so we get 2 over here over 21. | ▶ 00:22 |
"Secret" occurs 3 times, so we get a 4 over here over 21, | ▶ 00:26 |
and we normalize this by the same expression over here. | ▶ 00:32 |
Plus the prior for HAM, which is 3/5, | ▶ 00:37 |
we have 2 occurrences of "today", plus 1, equals 3/27. | ▶ 00:42 |
"Is" occurs once--2/27. | ▶ 00:47 |
And "secret" occurs once--again 2/27. | ▶ 00:50 |
When you work this all out, you get this number over here. | ▶ 00:54 |
So we learned quite a bit. | ▶ 00:00 |
We learned about Naive Bayes | ▶ 00:02 |
as our first supervised learning methods. | ▶ 00:04 |
The setup was that we had | ▶ 00:06 |
features of documents or trading examples and labels. | ▶ 00:08 |
In this case, SPAM or not SPAM. | ▶ 00:14 |
And from those pieces, | ▶ 00:17 |
we made a generative model for the SPAM class | ▶ 00:19 |
and the non-SPAM class | ▶ 00:23 |
that described the condition of probability | ▶ 00:25 |
of each individual feature. | ▶ 00:28 |
We then used first maximum likelihood | ▶ 00:30 |
and then a Laplacian smoother | ▶ 00:33 |
to fit those primers over here. | ▶ 00:36 |
And then using Bayes rule, | ▶ 00:38 |
we could take any training examples over here | ▶ 00:41 |
and figure out what the class probability was over here. | ▶ 00:44 |
This is called a generative model | ▶ 00:48 |
in that the condition of probabilities all aim to maximize | ▶ 00:51 |
the probability of individual features as if those | ▶ 00:55 |
describe the physical world. | ▶ 01:00 |
We also used what is called a bag of words model, | ▶ 01:02 |
in which our representation of each email | ▶ 01:06 |
was such that we just counted the occurrences of words, | ▶ 01:09 |
irrespective of their order. | ▶ 01:12 |
Now this is a very powerful method for fighting SPAM. | ▶ 01:15 |
Unfortunately, it is not powerful enough. | ▶ 01:19 |
It turns out spammers know about Naive Bayes, | ▶ 01:21 |
and they've long learned to come up with messages | ▶ 01:24 |
that are fooling your SPAM filter if it uses Naive Bayes. | ▶ 01:27 |
So companies like Google and others | ▶ 01:31 |
have become much more involved | ▶ 01:33 |
in methods for SPAM filtering. | ▶ 01:35 |
Now I can give you some more examples how to filter SPAM, | ▶ 01:38 |
but all of those quite easily fit with the same Naive Bayes model. | ▶ 01:42 |
[Narrator] So here features that you might consider when you write | ▶ 00:00 |
in an advance spam filter. | ▶ 00:03 |
For example, | ▶ 00:05 |
does the email come from | ▶ 00:07 |
a known spamming IP or computer? | ▶ 00:09 |
Have you emailed this person before? | ▶ 00:12 |
In which case it is less likely to be spam. | ▶ 00:16 |
Here's a powerful one: | ▶ 00:19 |
have 1000 other people | ▶ 00:22 |
recently received the same message? | ▶ 00:25 |
Is the email header consistent? | ▶ 00:29 |
So example if the from field says your bank | ▶ 00:32 |
is the IP address really your bank? | ▶ 00:35 |
Surprisingly is the email all caps? | ▶ 00:38 |
Strangely many spammers believe if you write | ▶ 00:42 |
things in all caps you'll pay more attention to it. | ▶ 00:44 |
Do the inline URLs point to those pages | ▶ 00:48 |
where they say they're pointing to? | ▶ 00:51 |
Are you addressed by your correct name? | ▶ 00:54 |
Now these are some features, | ▶ 00:56 |
I'm sure you can think of more. | ▶ 00:58 |
You can toss them easily into the | ▶ 01:00 |
naive base model and get better classification. | ▶ 01:02 |
In fact model spam filters keep learning | ▶ 01:05 |
as people flag emails as spam, and | ▶ 01:08 |
of course spammers keep learning as well | ▶ 01:10 |
and trying to fool modern spam filters. | ▶ 01:13 |
Who's going to win? | ▶ 01:16 |
Well so far the spam filters are clearly winning. | ▶ 01:18 |
Most of my spam I never see, but who knows | ▶ 01:21 |
what's going to happen with the future? | ▶ 01:23 |
It's a really fascinating machine learning problem. | ▶ 01:25 |
[Narrator] Naive Bayes can also be applied to | ▶ 00:00 |
the problem of hand written digits recognition. | ▶ 00:02 |
This is a sample of hand-written digits taken | ▶ 00:05 |
from a U.S. postal data set | ▶ 00:09 |
where hand written zip codes on letters are | ▶ 00:12 |
being scanned and automatically classified. | ▶ 00:17 |
The machine-learning problem here is | ▶ 00:21 |
taken a symbol just like this. | ▶ 00:23 |
What is the corresponding number? | ▶ 00:28 |
Here it's obviously 0. | ▶ 00:30 |
Here it's obviously 1. | ▶ 00:32 |
Here it's obviously 2, 1. | ▶ 00:34 |
For the one down here, | ▶ 00:36 |
it's a little bit harder to tell. | ▶ 00:38 |
Now when you apply Naive Bayes, | ▶ 00:41 |
the input vector | ▶ 00:44 |
could be the pixel values | ▶ 00:46 |
of each individual pixel so we have | ▶ 00:48 |
a 16 x 16 input resolution. | ▶ 00:50 |
You would get 256 different values | ▶ 00:54 |
corresponding to the brightness of each pixel. | ▶ 00:59 |
Now obviously given sufficiently made | ▶ 01:02 |
training example, you might hope | ▶ 01:05 |
to recognize digits, | ▶ 01:07 |
but one of the deficiencies of this approach is | ▶ 01:09 |
it is not particularly shifted range. | ▶ 01:12 |
So for example a pattern like this | ▶ 01:15 |
will look fundamentally different | ▶ 01:19 |
from a pattern like this. | ▶ 01:21 |
Even though the pattern on the right is obtained | ▶ 01:24 |
by shifting the pattern on the left | ▶ 01:27 |
by 1 to the right. | ▶ 01:29 |
There's many different solutions, but a common one could be | ▶ 01:31 |
to use smoothing in a different way from | ▶ 01:34 |
the way we discussed it before. | ▶ 01:36 |
Instead of just counting 1 pixel value's count, | ▶ 01:38 |
you could mix it with counts of the | ▶ 01:40 |
neighboring pixel values so if | ▶ 01:42 |
all pixels are slightly shifted, | ▶ 01:44 |
we get about the same statistics | ▶ 01:46 |
as the pixel itself. | ▶ 01:48 |
Such a method is called input smoothing. | ▶ 01:50 |
You can what's technically called convolve | ▶ 01:52 |
the input vector equals pixel value variable, and | ▶ 01:55 |
you might get better results than if you | ▶ 01:57 |
do Naive Bayes on the raw pixels. | ▶ 02:00 |
Now to tell you the truth for | ▶ 02:02 |
digit recognition of this type, | ▶ 02:04 |
Naive Bayes is not a good choice. | ▶ 02:06 |
The conditional independence assumption | ▶ 02:08 |
of each pixel, given the class, | ▶ 02:10 |
is too strong an assumption in this case, | ▶ 02:12 |
but it's fun to talk about image recognition | ▶ 02:14 |
in the context of Naive Bayes regardless. | ▶ 02:17 |
So, let me step back a step and talk a bit about | ▶ 00:00 |
overfitting prevention in machine learning | ▶ 00:04 |
because it's such an important topic. | ▶ 00:07 |
We talked about Occam's Razor, | ▶ 00:09 |
which in a generalized way suggests there is | ▶ 00:12 |
a tradeoff between how well we can fit the data | ▶ 00:16 |
and how smooth our learning algorithm is. | ▶ 00:22 |
In our class in smoothing, we already found 1 way | ▶ 00:28 |
to let Occam's Razor play, which is by | ▶ 00:32 |
selecting the value K to make our statistical counts smoother. | ▶ 00:34 |
I alluded to a similar way in the image recognition domain | ▶ 00:40 |
where we smoothed the image so the neighboring pixels count similar. | ▶ 00:44 |
This all raises the question of how to choose the smoothing parameter. | ▶ 00:49 |
So, in particular, in Laplacian smoothing, how to choose the K. | ▶ 00:53 |
There is a method called cross-validation | ▶ 00:58 |
which can help you find an answer. | ▶ 01:02 |
This method assumes there is plenty of training examples, but | ▶ 01:05 |
to tell you the truth, in spam filtering there is more than you'd ever want. | ▶ 01:09 |
Take your training data | ▶ 01:14 |
and divide it into 3 buckets. | ▶ 01:17 |
Train, cross-validate, and test. | ▶ 01:19 |
Typical ratios will be 80% goes into train, | ▶ 01:24 |
10% into cross-validate, | ▶ 01:27 |
and 10% into test. | ▶ 01:30 |
You use the train to find all your parameters. | ▶ 01:33 |
For example, the probabilities of a base network. | ▶ 01:37 |
You use your cross-validation set | ▶ 01:40 |
to find the optimal K, and the way you do this is | ▶ 01:43 |
you train for different values of K, | ▶ 01:46 |
you observe how well the training model performs on the CV data, | ▶ 01:49 |
not touching the test data, | ▶ 01:55 |
and then you maximize over all the Ks to get the best performance | ▶ 01:58 |
on the cross-validation set. | ▶ 02:01 |
You iterate this many times until you find the best K. | ▶ 02:03 |
When you're done with the best K, | ▶ 02:06 |
you train again, and then finally | ▶ 02:09 |
only one you touch the test data | ▶ 02:12 |
to verify the performance, | ▶ 02:15 |
and this is the performance you report. | ▶ 02:17 |
It's really important in cross-validation | ▶ 02:20 |
split apart a cross-validation set that's different from the test set. | ▶ 02:23 |
If you were to use the test set to find the optimal K, | ▶ 02:28 |
then your test set becomes an effective part of your training routine, | ▶ 02:31 |
and you might overfit your test data, | ▶ 02:35 |
and you wouldn't even know. | ▶ 02:38 |
By keeping the test data separate from the beginning, | ▶ 02:40 |
and train on the training data, you use | ▶ 02:43 |
the cross-validation data to find how good your train data is doing, | ▶ 02:46 |
and the unknown parameters of K to fine-tune the K. | ▶ 02:49 |
Finally, only once you use the test data | ▶ 02:53 |
do you get a fair answer to the question, | ▶ 02:56 |
"How well will your model perform on future data?" | ▶ 02:59 |
So, pretty much everybody in machine learning | ▶ 03:02 |
uses this model. | ▶ 03:05 |
You can redo the split between training and the cross-validation part, | ▶ 03:08 |
people often use the word 10-fold cross-validation | ▶ 03:12 |
where they do 10 different forwardings | ▶ 03:15 |
and run the model 10 times to find the optimal K | ▶ 03:17 |
or smoothing parameter. | ▶ 03:20 |
No matter which way you do it, find the optimal smoothing parameter | ▶ 03:22 |
and then use a test set exactly once to verify in a report. | ▶ 03:25 |
Let me back up a step further, | ▶ 00:00 |
and let's look at supervised learning more generally. | ▶ 00:03 |
Our example so far was one of classification. | ▶ 00:06 |
The characteristic of classifcation is | ▶ 00:09 |
that the target labels or the target class is discrete. | ▶ 00:12 |
In our case it was actually binary. | ▶ 00:16 |
In many problems, we try to predict a continuous quantity. | ▶ 00:18 |
For example, in the interval 0 to 1 or perhaps a real number. | ▶ 00:23 |
Those machine learning problems are called regression problems. | ▶ 00:29 |
Regression problems are fundamentally different from classification problems. | ▶ 00:33 |
For example, our base network doesn't afford us an answer | ▶ 00:37 |
to a problem where the target value could be at 0,1. | ▶ 00:42 |
A regression problem, for example, would be one to | ▶ 00:45 |
predict the weather tomorrow. | ▶ 00:48 |
Temperature is a continuous value. Our base number would not be able | ▶ 00:50 |
to predict the temperature, it only can predict discrete classes. | ▶ 00:53 |
A regression algorithm is able to give us a continuous prediction | ▶ 00:58 |
about the temperature tomorrow. | ▶ 01:01 |
So let's look at the regression next. | ▶ 01:04 |
So here's my first quiz for you on regression. | ▶ 01:07 |
This scatter plot shows for Berkeley California for a period of time | ▶ 01:10 |
the data for each house that was sold. | ▶ 01:18 |
Each dot is a sold house. | ▶ 01:21 |
It graphs the size of the house in square feet | ▶ 01:24 |
to the sales price in thousands of dollars. | ▶ 01:27 |
As you can see, roughly speaking, | ▶ 01:32 |
as the size of the house goes up, | ▶ 01:34 |
so does the sales price. | ▶ 01:37 |
I wonder, for a house of about 2500 square feet, | ▶ 01:40 |
what is the approximate sales price you would assume | ▶ 01:45 |
based just on the scatter plot data? | ▶ 01:49 |
Is it 400k, 600k, 800k, or 1000k? | ▶ 01:52 |
My answer is, there seems to be a roughly linear relationship, | ▶ 00:00 |
maybe not quite linear, between the house size and the price. | ▶ 00:05 |
So we look at a linear graph that best describes the data-- | ▶ 00:11 |
you get this dashed line over here. | ▶ 00:15 |
And for the dashed line, if you walk up the 2500 square feet, | ▶ 00:18 |
you end up with roughly 800K. | ▶ 00:22 |
So this would have been the best answer. | ▶ 00:24 |
Now obviously you can answer this question without understanding anything about regression. | ▶ 00:00 |
But what you find is this is different from classification as before. | ▶ 00:05 |
This is not a binary concept anymore of like expensive and cheap. | ▶ 00:10 |
It really is a relationship between two variables. | ▶ 00:13 |
One you care about--the house price, and one that you can observe, | ▶ 00:17 |
which is the house size in square feet. | ▶ 00:20 |
And your goal is to fit a curve that best explains the data. | ▶ 00:23 |
Once again, we have a case where we can play Occam's razor. | ▶ 00:28 |
There clearly is a data fit that is not linear that might be better, | ▶ 00:31 |
like this one over here. | ▶ 00:35 |
And when you go to hide the linear curves, | ▶ 00:37 |
you might even be inclined to draw a curve like this. | ▶ 00:40 |
Now of course the curve I'm drawing right now is likely an overfit. | ▶ 00:44 |
And you don't want to postulate that this is the general relationship | ▶ 00:49 |
between the size of a house and the sales price. | ▶ 00:54 |
So even though my black curve might describe the data better, | ▶ 00:57 |
the blue curve or the dashed linear curve over here might be a better explanation overture of Occam's razor. | ▶ 01:01 |
So let's look a little bit deeper into what we call regression. | ▶ 01:08 |
As in all regression problems, our data will be comprised of | ▶ 01:15 |
input vectors of length in that map to another continuous value. | ▶ 01:19 |
And we might be given a total of M data points. | ▶ 01:25 |
This is from the classification case, except this time the Ys are continuous. | ▶ 01:30 |
Once again, we're looking for function f that maps our vector x into y. | ▶ 01:36 |
In linear regression, the function has a particular form which is W1 times X plus W0. | ▶ 01:44 |
In this case X is one dimensional which is N = 1. | ▶ 01:54 |
Or in the high-dimensional space, we might just write W times X plus W0, | ▶ 01:59 |
where W is a vector and X is a vector. | ▶ 02:07 |
And this is the inner product of these 2 vectors over here. | ▶ 02:12 |
Let's for now just consider the one-dimensional case. | ▶ 02:16 |
In this quiz, I've given you a linear regression form with 2 unknown parameters, W1 and W0. | ▶ 02:20 |
I've given you a data set. | ▶ 02:27 |
And this data set happens to be fittable by a linear regression model without any residual error. | ▶ 02:30 |
Without any math, can you look at this and find out to me what the 2 parameters, W0 and W1 are? | ▶ 02:36 |
This is a suprisingly challenging question. | ▶ 00:00 |
If you look at these numbers from 3 to 6. | ▶ 00:03 |
When we increase X by 3, Y decreases by 3, | ▶ 00:07 |
which suggests W1 is -1. | ▶ 00:14 |
Now let's see if this holds. | ▶ 00:18 |
If we increase X by 3, it decreases Y by 3. | ▶ 00:20 |
If we increase X by 1, we decrease Y by 1. | ▶ 00:24 |
If we increase X by 2, we decrease Y by 2. | ▶ 00:28 |
So this number seems to be an exact fit. | ▶ 00:32 |
Next we have to get the constant W0 right. | ▶ 00:36 |
For X = 3, we get -3 as an expression over here, | ▶ 00:41 |
because we know W1 = -1. | ▶ 00:48 |
So if this has to equal zero in the end, then W0 has to be 3. | ▶ 00:50 |
Let's do a quick check. | ▶ 00:57 |
-3 plus 3 is 0. | ▶ 00:59 |
-6 plus 3 is -3. | ▶ 01:02 |
And if we plug in any of the numbers, you find those are correct. | ▶ 01:05 |
Now this is the case of an exact data set. | ▶ 01:09 |
It gets much more challenging if the data set cannot be fit with a linear function. | ▶ 01:12 |
To define linear regression, | ▶ 00:00 |
we need to understand what we are trying to minimize. | ▶ 00:02 |
The word is called here, are loss function | ▶ 00:05 |
and the loss function is the amount of residual error we obtain | ▶ 00:08 |
after fitting the linear function as good as possible. | ▶ 00:12 |
The residual error is the sum of all training examples, | ▶ 00:16 |
J of YJ, which is the target label, | ▶ 00:20 |
minus our prediction, which is W1 XJ minus W0 to the square. | ▶ 00:25 |
This is the quadratic error between our target tables | ▶ 00:34 |
and what our best hypothesis can produce. | ▶ 00:37 |
The minimizing of loss | ▶ 00:41 |
is used for linear regression of a new regression problem, | ▶ 00:43 |
and you can write it as follows: | ▶ 00:46 |
Our solution to the regression problem W* | ▶ 00:50 |
is the arg min of the loss over all possible vectors W. | ▶ 00:52 |
The problem of minimizing quadratic loss for linear functions can be solved in closed form. | ▶ 00:00 |
When I reduce, I will do this for the one-dimensional case on paper. | ▶ 00:07 |
I will also give you the solution for the case where your input space is multidimensional, | ▶ 00:12 |
which is often called "multivariant regression." | ▶ 00:17 |
We seek to minimize a sum of a quadratic expression | ▶ 00:22 |
where the target labels are subtracted with the output of our linear regression model | ▶ 00:26 |
parameterized by w1 and w2. | ▶ 00:33 |
The summation here is overall training examples, | ▶ 00:36 |
and I leave the index of the summation out if not necessary. | ▶ 00:40 |
The minimum of this is obtained where the derivative of this function equals zero. | ▶ 00:45 |
Let's call this function "L." | ▶ 00:50 |
For the partial derivative with respect to w0, we get this expression over here, | ▶ 00:53 |
which we have to set to zero. | ▶ 00:59 |
We can easily get rid of the -2 and transform this as follows: | ▶ 01:02 |
Here M is the number of training examples. | ▶ 01:11 |
This expression over here gives us w0 as a function of w1, | ▶ 01:17 |
but we don't know w1. Let's do the same trick for w1 | ▶ 01:21 |
and set this to zero as well, | ▶ 01:28 |
which gets us the expression over here. | ▶ 01:32 |
We can now plug in the w0 over here into this expression over here | ▶ 01:38 |
and obtain this expression over here, | ▶ 01:44 |
which looks really involved but is relatively straightforward. | ▶ 01:47 |
With a few steps of further calculation, which I'll spare you for now, | ▶ 01:52 |
we get for w1 the following important formula: | ▶ 01:56 |
This is the final quotient for w1, | ▶ 02:02 |
where we take the number of training examples times of the sum of all xy | ▶ 02:05 |
minus the sum of x times the sum of y divided by this expression over here. | ▶ 02:10 |
Once we've computed w1, | ▶ 02:16 |
we can go back to our original articulation of w0 over here | ▶ 02:19 |
and plug w1 into w0 and obtain w0. | ▶ 02:23 |
These are the two important formulas we can also find in the textbook. | ▶ 02:30 |
I'd like to go back and use those formulas to calculate these two coefficients over here. | ▶ 02:39 |
You get 4 times the sum of x and the sum of y, which is -32 | ▶ 02:45 |
minus the product of the sum of x, which is 18, and the sum of y, which is -6, | ▶ 02:56 |
divided by the sum of x squared, which is 86, times 4, minus the sum of x squared, | ▶ 03:05 |
which is 18 times 18, which is 324. | ▶ 03:16 |
If you work this all out, it becomes -1, which is w1. | ▶ 03:20 |
W0 is now obtained by completing the quarter times sum of all y, | ▶ 03:25 |
which is -6, minus -1/4 times sum of all x. | ▶ 03:31 |
If you plug this all in, you get 3, as over here. Our formula is actually correct. | ▶ 03:39 |
Here is another quiz for linear regression. We have the follow data: | ▶ 03:46 |
Here is the data plotted graphically. | ▶ 03:51 |
I wonder what the best regression is. | ▶ 03:53 |
Give me w0 and w1. Apply the formulas I just gave you. | ▶ 03:56 |
And the answer is W0 = 0.5, and W1 = 0.9. | ▶ 00:00 |
If I were to draw a line, it would go about like this. | ▶ 00:09 |
It doesn't really hit the two points at the end. | ▶ 00:14 |
If you were thinking of something like this, you were wrong. | ▶ 00:19 |
If you draw a curve like this, your quadratic error becomes 2. | ▶ 00:24 |
One over here, and one over here. | ▶ 00:28 |
The quadratic error is smaller for the line that goes in between those points. | ▶ 00:30 |
This is easily seen by computing as shown in the previous slide. | ▶ 00:35 |
W1 equals (4 x 118 - 20 x 20) / (4 x 120 - 400) which is 0.9. | ▶ 00:41 |
This is merely plugging in those numbers into the formulas I gave you. | ▶ 00:55 |
W0 then becomes ¼ x 20. | ▶ 01:00 |
Now we plug in W1-- 0.9 / 4 x 20 equals 0.5. | ▶ 01:05 |
This is an example of linear regression, | ▶ 01:12 |
in which case there is a residual error, | ▶ 01:16 |
and the best-fitting curve is the one that minimizes | ▶ 01:18 |
the total of the residual vertical error in this graph over here. | ▶ 01:22 |
So linear regression works well | ▶ 00:00 |
if the data is approximately linear, | ▶ 00:03 |
but there are many examples when linear regression performs poorly. | ▶ 00:05 |
Here's one where we have a | ▶ 00:09 |
curve that is really nonlinear. | ▶ 00:12 |
This is an interesting one where we seem to have a linear relationship | ▶ 00:15 |
that is flatter than the linear regression indicates, | ▶ 00:18 |
but there is one outlier. | ▶ 00:21 |
Because if you are minimizing quadratic error, | ▶ 00:23 |
outliers penalize you over-proportionately. | ▶ 00:26 |
So outliers are particularly bad for linear regression. | ▶ 00:30 |
And here is a case, | ▶ 00:34 |
where the data clearly suggests | ▶ 00:35 |
a very different phenomena for linear. | ▶ 00:37 |
We have only two ?? variables even being used, | ▶ 00:40 |
and this one has a strong frequency | ▶ 00:42 |
and a strong vertical spread. | ▶ 00:45 |
Clearly a linear regression model | ▶ 00:47 |
is a very poor one to explain | ▶ 00:49 |
this data over here. | ▶ 00:51 |
Another problem with linear regression | ▶ 00:53 |
is that as you go to infinity in the X space, | ▶ 00:55 |
your Ys also become infinite. | ▶ 00:59 |
In some problems that isn't a plausible model. | ▶ 01:02 |
For example, if you wish to predict the weather | ▶ 01:05 |
anytime into the future, | ▶ 01:08 |
it's implausible to assume the further the prediction goes out, | ▶ 01:10 |
the hotter or the cooler it becomes. | ▶ 01:13 |
For such situations there is a | ▶ 01:15 |
model called logistic regression, | ▶ 01:17 |
which uses a slightly more complicated | ▶ 01:20 |
model than linear regression, | ▶ 01:22 |
which goes as follows:. | ▶ 01:24 |
Let F of XP, or linear function, | ▶ 01:25 |
and the output of logistic regression | ▶ 01:30 |
is obtained by the following function: | ▶ 01:32 |
One over one plus exponential of minus F of X. | ▶ 01:34 |
So here's a quick quiz for you. | ▶ 01:40 |
What is the range in which Z might fall | ▶ 01:43 |
given this function over here, | ▶ 01:48 |
and ??? the linear function of F or X over here. | ▶ 01:49 |
Is it zero, one? | ▶ 01:53 |
Is it minus one, one? | ▶ 01:56 |
Is it minus one, zero? | ▶ 01:59 |
Minus two, two? | ▶ 02:02 |
Or none of the above? | ▶ 02:04 |
The answer is zero, one. | ▶ 00:00 |
If this expression over here, | ▶ 00:02 |
F of X, | ▶ 00:05 |
grows to positive infinity, | ▶ 00:07 |
then Z becomes one. | ▶ 00:09 |
And the reason is | ▶ 00:14 |
as this term over here becomes very large, | ▶ 00:16 |
E to the minus of that term approaches zero; | ▶ 00:19 |
one over one equals one. | ▶ 00:22 |
If F of X goes to minus infinity, | ▶ 00:25 |
then Z goes to zero. | ▶ 00:30 |
And the reason is, | ▶ 00:33 |
if this expression over here goes to minus infinity, | ▶ 00:34 |
E to the infinity becomes very large; | ▶ 00:38 |
one over something very large becomes zero. | ▶ 00:41 |
When we plot the logistic function it looks like this: | ▶ 00:44 |
So it's approximately linear | ▶ 00:49 |
around F of X equals zero, | ▶ 00:51 |
but it levels off to zero and one | ▶ 00:54 |
as we go to the extremes. | ▶ 00:58 |
Another problem with linear regression has to do with the regularization | ▶ 00:00 |
or complexity control. | ▶ 00:04 |
Just like before, we sometimes wish to have | ▶ 00:06 |
a less complex model. | ▶ 00:08 |
So in regularization, the loss function is either the sum | ▶ 00:10 |
of the loss of a data function and a complexity control term, | ▶ 00:15 |
which is often called the loss of the parameters. | ▶ 00:21 |
The loss of the data is simply curvatic loss, as we discussed before. | ▶ 00:24 |
The loss of parameters might just be a function that penalizes | ▶ 00:29 |
the parameters to become large | ▶ 00:35 |
up to some known P, where P is usually either 1 or 2. | ▶ 00:37 |
If you draw this graphically, | ▶ 00:43 |
in a parameter space comprised of 2 parameters, | ▶ 00:46 |
your curvatic term for minimizing the data error | ▶ 00:49 |
might look like this, where the minimum sits over here. | ▶ 00:53 |
Your term for regularization might pull these parameters toward 0. | ▶ 00:57 |
It pulls it toward 0, along the circle if you use curvatic error, | ▶ 01:02 |
and it does it in a diamond-shaped way. | ▶ 01:09 |
For L1 regularization--either one works well. | ▶ 01:14 |
L1 has the advantage in that parameters tend to get really sparse. | ▶ 01:20 |
If you look at this diagram, there is a tradeoff between W-0 and W-1. | ▶ 01:24 |
In the L1 case, that allows one of them to be driven to 0. | ▶ 01:30 |
In the L2 case, parameters tend not to be as sparse. | ▶ 01:33 |
So L1 is often preferred. | ▶ 01:37 |
This all raises the question, | ▶ 00:00 |
how to minimize more complicated loss functions | ▶ 00:03 |
than the one we discussed so far. | ▶ 00:06 |
Are there closed-form solutions of the type we found for linear regression? | ▶ 00:09 |
Or do we have to resort to iterative methods? | ▶ 00:14 |
The general answer is, unfortunantly, we have to resort to iterative methods. | ▶ 00:17 |
Even though there are special cases in which corresponding solutions may exist, | ▶ 00:23 |
in general, our loss functions now become complicated enough | ▶ 00:28 |
that all we can do is iterate. | ▶ 00:32 |
Here is a prototypical loss function | ▶ 00:35 |
and the method for interation will be called gradient descent. | ▶ 00:40 |
In gradient descent, you start with an initial guess, | ▶ 00:44 |
W-0, where 0 is your iteration number, | ▶ 00:48 |
and then you up with it iteratively. | ▶ 00:53 |
Your i plus 1st parameter guess will be obtained by taking your i-th guess | ▶ 00:55 |
and subtracting from it the gradient of your loss function, | ▶ 01:04 |
and that guess multiplied by a small learning weight alpha, | ▶ 01:10 |
where alpha is often as small as 0.01. | ▶ 01:15 |
I have a couple of questions for you. | ▶ 01:19 |
Consider the following 3 points. | ▶ 01:21 |
We call them A, B, C. | ▶ 01:25 |
I wish to know, for points A, B, and C, | ▶ 01:27 |
Is the gradient at this point positive, about zero, or negative? | ▶ 01:34 |
For each of those, check exactly one of those cases. | ▶ 01:40 |
In case A, the gradient is negative. | ▶ 00:00 |
If you move to the right in the X space, | ▶ 00:03 |
then your loss decreases. | ▶ 00:06 |
In B, it's about zero. | ▶ 00:09 |
In C, it's pointing up; it's positive. | ▶ 00:12 |
So if you apply the rule over here, | ▶ 00:15 |
if you were to start at A as your W-zero, | ▶ 00:18 |
then your gradient is negative. | ▶ 00:21 |
Therefore, you would add something to the value of W. | ▶ 00:23 |
You move to the right, and your loss has decreased. | ▶ 00:26 |
You do this until you find yourself | ▶ 00:29 |
with what's called a local minimum, where B resides. | ▶ 00:31 |
In this instance over here, gradient descent starting at A | ▶ 00:34 |
would not get you to the global minimum, | ▶ 00:37 |
which sits over here because there's a bump in between. | ▶ 00:39 |
Gradient methods are known to be subject to local minimum. | ▶ 00:42 |
I have another gradient quiz. | ▶ 00:00 |
Consider the following quadratic arrow function. | ▶ 00:03 |
We are considering the gradient in 3 different places. | ▶ 00:06 |
a. b. and c. | ▶ 00:09 |
And they ask you which gradient is the largest. | ▶ 00:13 |
a, b, or c or are they all equal? | ▶ 00:17 |
In which case, you would want to check the last box over here | ▶ 00:23 |
And the answer is C. | ▶ 00:00 |
The derivative of a quadratic function is a linear function. | ▶ 00:04 |
Which would look about like this. | ▶ 00:08 |
And as we go outside, our gradient becomes larger and larger. | ▶ 00:11 |
This over here is much steeper than this curve over here. | ▶ 00:15 |
[Thrun] Here is a final gradient descent quiz. | ▶ 00:00 |
Suppose we have a loss function like this | ▶ 00:04 |
and our gradient descent starts over here. | ▶ 00:08 |
Will it likely reach the global minimum? | ▶ 00:12 |
Yes or no. | ▶ 00:15 |
Please check one of those boxes. | ▶ 00:17 |
[Thrun] And the answer is yes, | ▶ 00:00 |
although, technically speaking, to reach the absolute global minimum | ▶ 00:02 |
we need the learning rates to become smaller and smaller over time. | ▶ 00:06 |
If they stay constant, there is a chance this thing might bounce around | ▶ 00:11 |
between 2 points in the end and never reach the global minimum. | ▶ 00:15 |
But assuming that we implement gradient descent correctly, | ▶ 00:18 |
we will finally reach the global minimum. | ▶ 00:22 |
That's not the case if you start over here, where we can get stuck over here | ▶ 00:24 |
and settle for the minimum over here, which is a local minimum | ▶ 00:29 |
and not the best solution to our optimization problem. | ▶ 00:32 |
So one of the important points to take away from this is | ▶ 00:35 |
gradient descent is universally applicable to more complicated problems-- | ▶ 00:38 |
problems that don't have a plausible solution. | ▶ 00:43 |
But you have to check whether there is many local minima, | ▶ 00:46 |
and if so, you have to worry about this. | ▶ 00:49 |
Any optimization book can tell you tricks how to overcome this. | ▶ 00:51 |
I won't go into any more depth here in this class. | ▶ 00:55 |
[Thrun] It's interesting to see how to minimize a loss function using gradient descent. | ▶ 00:00 |
In our linear case, we have L equals sum over the correct labels | ▶ 00:05 |
minus our linear function to the square, | ▶ 00:12 |
which we seek to minimize. | ▶ 00:16 |
We already know that this has a closed form solution, | ▶ 00:18 |
but just for the fun of it, let's look at gradient descent. | ▶ 00:21 |
The gradient of L with respect to W1 is minus 2, sum of all J | ▶ 00:25 |
of the difference as before but without the square times Xj. | ▶ 00:33 |
The gradient with respect to W0 is very similar. | ▶ 00:39 |
So in gradient descent we start with W1 0 and W0 0 | ▶ 00:43 |
where the upper cap 0 corresponds to the iteration index of gradient descent. | ▶ 00:49 |
And then we iterate. | ▶ 00:55 |
In the M iteration we get our new estimate by using the old estimate | ▶ 00:57 |
minus a learning rate of this gradient over here | ▶ 01:06 |
taking the position of the old estimate W1, M minus 1. | ▶ 01:10 |
Similarly, for W0 we get this expression over here. | ▶ 01:15 |
And these expressions look nasty, | ▶ 01:20 |
but what it really means is we subtract an expression like this | ▶ 01:24 |
every time we do gradient descent from W1 | ▶ 01:28 |
and an expression like this every time we do gradient descent from W0, | ▶ 01:31 |
which is easy to implement, and that implements gradient descent. | ▶ 01:36 |
Now, there are many different ways to apply linear functions in machine learning. | ▶ 00:00 |
We so far have studied linear functions for regression, | ▶ 00:08 |
but linear functions are also used for classification, | ▶ 00:12 |
and specifically for an algorithm called the perceptron algorithm. | ▶ 00:16 |
This algorithm happens to be a very early model of a neuron, | ▶ 00:21 |
as in the neurons we have in our brains, | ▶ 00:27 |
and was invented in the 1940s. | ▶ 00:30 |
Suppose we give a data set of positive samples and negative samples. | ▶ 00:33 |
A linear separator is a linear equation that separates positive from negative examples. | ▶ 00:41 |
Obviously, not all sets possess a linear separator, but some do. | ▶ 00:49 |
For those we can define the algorithm of the perceptron and it actually converges. | ▶ 00:55 |
To define a linear separator, let's start with our linear equation as before-- | ▶ 01:02 |
w1x + w0 in cases where x is higher dimensional this might actually be a vector--never mind. | ▶ 01:07 |
If this is larger or equal to zero, then we call our classification 1. | ▶ 01:18 |
Otherwise, we call it zero. | ▶ 01:26 |
Here's our linear separation classification function | ▶ 01:30 |
where this is our common linear function. | ▶ 01:35 |
Now, as I said, perceptron only converges if the data is linearly separable, | ▶ 01:39 |
and then it converges to a linear separation of the data, | ▶ 01:45 |
which is quite amazing. | ▶ 01:49 |
Perceptron is an iterative algorithm that is not dissimilar from grade descent. | ▶ 01:52 |
In fact, the update rule echoes that of grade descent, and here's how it goes. | ▶ 01:56 |
We start with a random guess for w1 and w0, | ▶ 02:03 |
which may correspond to a random separation line, | ▶ 02:09 |
but usually is inaccurate. | ▶ 02:13 |
Then the mth weight-i is obtained by using the old weight plus some learning rate alpha | ▶ 02:17 |
times the difference between the desired target label | ▶ 02:29 |
and the target label produced by our function at the point m-1. | ▶ 02:33 |
Now, this is an online learning rule, which is we don't process all the data in batch. | ▶ 02:39 |
We process one data at a time, and we might go through the data many, many times-- | ▶ 02:45 |
hence the j over here-- | ▶ 02:50 |
but every time we do this, we apply this rule over here. | ▶ 02:52 |
What this rule gives us is a method to adapt our weights in proportion to the error. | ▶ 02:55 |
If the prediction of our function f equals our target label, | ▶ 03:03 |
and the error is zero, then no update occurs. | ▶ 03:07 |
If there is a difference, however, we update in a way so as to minimize the error. | ▶ 03:11 |
Alpha is a small learning weight. | ▶ 03:18 |
Once again, perceptron converges to a correct linear separator | ▶ 03:22 |
if such linear separator exists. | ▶ 03:28 |
Now, the case of linear separation has recently received a lot of attention in machine learning. | ▶ 03:31 |
If you look at the picture over here, you'll find there are many different linear separators. | ▶ 03:36 |
There is one over here. There is one over here. There is one over here. | ▶ 03:42 |
One of the questions that has recently been researched extensively is which one to prefer. | ▶ 03:47 |
Is it a, b, or c? | ▶ 03:53 |
Even though you probably have never seen this literature, | ▶ 03:57 |
I will just ask your intuition in this following quiz. | ▶ 04:01 |
Which linear separator would you prefer if you look at these three different linear separators-- | ▶ 04:05 |
a, b, c, or none of them? | ▶ 04:10 |
[Narrator] And intuitively I would argue it's B, | ▶ 00:00 |
and the reason why is | ▶ 00:04 |
C comes really close to examples. | ▶ 00:06 |
So if these examples are noisy, | ▶ 00:09 |
it's quite likely that | ▶ 00:12 |
by being so close to these examples | ▶ 00:14 |
that future examples cross the line. | ▶ 00:17 |
Similarly A comes close to examples. | ▶ 00:20 |
B is the one that stays really far away | ▶ 00:23 |
from any example. | ▶ 00:26 |
So there's this entire region over here | ▶ 00:28 |
where there's no example anywhere near B. | ▶ 00:31 |
This region is often called the margin. | ▶ 00:34 |
The margin of the linear separator | ▶ 00:37 |
is the distance of the separator | ▶ 00:40 |
to the closest training example. | ▶ 00:43 |
The margin is a really important concept | ▶ 00:45 |
in machine learning. | ▶ 00:47 |
There is an entire class of maximum margin | ▶ 00:49 |
learning algorithms, | ▶ 00:51 |
and the 2 most popular are | ▶ 00:53 |
support vector machines and boosting. | ▶ 00:56 |
If you are familiar with machine learning, | ▶ 01:00 |
you've come across these terms. | ▶ 01:02 |
These are very frequently used these days | ▶ 01:04 |
in actual discrimination learning tasks. | ▶ 01:07 |
I will not go into any details because it would go | ▶ 01:10 |
way beyond the scope of this introduction | ▶ 01:12 |
to artificial intelligence class, but let's see | ▶ 01:16 |
a few abstract words specifically about | ▶ 01:18 |
support vector machines or SVMs. | ▶ 01:21 |
As I said before a support vector machine | ▶ 01:25 |
derives a linear separator, and it takes | ▶ 01:30 |
the one that actually maximizes the margin | ▶ 01:34 |
as shown over here. | ▶ 01:39 |
By doing so it attains additional robost-ness | ▶ 01:42 |
over perceptron which only picks | ▶ 01:44 |
a linear separator without | ▶ 01:46 |
consideration of the margin. | ▶ 01:48 |
Now the problem of finding the | ▶ 01:51 |
margin maximizing linear separator | ▶ 01:53 |
can be solved by a quadratic program | ▶ 01:55 |
which is an integer method for finding the best | ▶ 01:59 |
linear separator that maximizes the margin. | ▶ 02:03 |
One of the nice things that support | ▶ 02:06 |
vector machines do in practice is | ▶ 02:08 |
they use linear techniques to solve | ▶ 02:12 |
nonlinear separation problems, | ▶ 02:16 |
and I'm just going to give you a glimpse of | ▶ 02:19 |
what's happening without going into any detail. | ▶ 02:22 |
Suppose the data looks as follows: | ▶ 02:25 |
we have a positive class | ▶ 02:28 |
which is near the origin of a coordinate system | ▶ 02:31 |
and a negative class that surrounds the positive class. | ▶ 02:33 |
Clearly these 2 classes | ▶ 02:37 |
are not linearly separable | ▶ 02:39 |
because there's no line I can draw that | ▶ 02:41 |
separates the negative examples from the positive examples. | ▶ 02:43 |
An idea that underlies SVMs, | ▶ 02:47 |
that will ultimately be known as | ▶ 02:49 |
the kernel trick, | ▶ 02:51 |
is to augment the feature set by new features. | ▶ 02:53 |
Suppose this is X1, and this is X2, | ▶ 02:56 |
and normally X1 and X2 | ▶ 02:58 |
will be the input features. | ▶ 03:00 |
In this example, you might derive | ▶ 03:03 |
a 3rd one. | ▶ 03:05 |
Let me pick a 3rd one | ▶ 03:07 |
Suppose X3 equals the square root of | ▶ 03:09 |
X1 square + X2 square. | ▶ 03:13 |
In other words X3 is the distance | ▶ 03:18 |
of any data point from the center | ▶ 03:22 |
of the coordinate system. | ▶ 03:25 |
Then things do become linearly separable | ▶ 03:27 |
so that just along the 3rd dimension | ▶ 03:31 |
all the positive examples end up | ▶ 03:33 |
to be close to the origin, | ▶ 03:36 |
and all the negative examples | ▶ 03:39 |
are further away, and the line is | ▶ 03:41 |
orthogonal to the 3rd input feature | ▶ 03:43 |
solves the separation problem. | ▶ 03:46 |
Map back into the space over here | ▶ 03:49 |
is actually a circle which is a set of all | ▶ 03:52 |
values of X3 that are equidistant | ▶ 03:55 |
to the center of the origin. | ▶ 04:00 |
Now this trick could be done in any linear learning algorithm, | ▶ 04:02 |
and it's really an amazing trick. | ▶ 04:06 |
You can take any nonlinear problem, add | ▶ 04:08 |
features of this type or any other type, | ▶ 04:10 |
and use linear techniques | ▶ 04:13 |
and get better solutions. | ▶ 04:15 |
This is a very deep machine learning insight | ▶ 04:17 |
that you can extend your feature space | ▶ 04:19 |
in this way, and there's numerous | ▶ 04:21 |
papers written about this. | ▶ 04:23 |
In SVMs, the extension of the feature space is mathematically done by | ▶ 04:25 |
what's called a kernel. | ▶ 04:31 |
I can't really tell you about this in this class, | ▶ 04:33 |
but it makes it possible to write | ▶ 04:36 |
very large new feature spaces including | ▶ 04:38 |
infinitely dimensional new feature spaces. | ▶ 04:41 |
These messages are very powerful. | ▶ 04:44 |
It turns out you never | ▶ 04:46 |
really compute all those features. | ▶ 04:48 |
They are implicitly represented by | ▶ 04:50 |
so called kernels, and if you care about this, | ▶ 04:52 |
I recommend you to dive | ▶ 04:55 |
deeper into the literature | ▶ 04:57 |
of support vector machines. | ▶ 04:59 |
This is meant to just give you | ▶ 05:01 |
an overview of the essence of | ▶ 05:03 |
what support vector machines are all about. | ▶ 05:05 |
So in summary, | ▶ 05:08 |
linear methods we learned about | ▶ 05:10 |
using them for regression | ▶ 05:12 |
and also classification. | ▶ 05:15 |
We learned about exact solutions | ▶ 05:17 |
versus iterative solutions. | ▶ 05:19 |
We talked about smoothing, | ▶ 05:23 |
and we even talked about | ▶ 05:25 |
using linear methods for nonlinear problems. | ▶ 05:27 |
So we covered quite a bit of ground. | ▶ 05:30 |
This is a really significant cross section | ▶ 05:33 |
of machine learning. | ▶ 05:35 |
As the final method in this unit, I'd like now to talk about k-nearest neighbors. | ▶ 00:00 |
And the distinguishing factor of k-nearest neighbors | ▶ 00:06 |
is that it is a nonparametric machine learning method. | ▶ 00:09 |
So far we've talked about parametric methods. | ▶ 00:13 |
Parametric methods have parameters, like probabilities or weights, | ▶ 00:16 |
and the number of parameters is constant. | ▶ 00:21 |
Or to put it differently, the number of parameters is independent of the training set size. | ▶ 00:25 |
So for example in the Naive Bayes, if we bring up more data, | ▶ 00:29 |
the number of condition probabilities will stay the same. | ▶ 00:34 |
Well, that wasn't technically always the case. | ▶ 00:37 |
Our vocabulary might increase and as such the number of parameters. | ▶ 00:41 |
But for any fixed dictionary, the number of parameters are truly independent of the training set size. | ▶ 00:46 |
The same was true, for example, in our regression cases | ▶ 00:53 |