AI-Class subs browser

Last update:Friday 16th of December 2011

(1) Unit 0w

(15) Unit 1w

(02:00) 1 Introduction

(01:34) 2 Intelligent Agents

(06:28) 3 Applications of AI

So the correct answer is all of those--	▶ 00:00
finance, robotics, games, medicine, the Web, and many more applications.	▶ 00:03
So let me talk about them in some detail.	▶ 00:08
There is a huge number of applications of artificial intelligence in finance,	▶ 00:10
very often in the shape of making trading decisions--	▶ 00:15
in which case, the agent is called a trading agent.	▶ 00:18
And the environment might be things like the stock market or the bond market	▶ 00:21
or the commodities market.	▶ 00:27
And our trading agent can sense the course of certain things,	▶ 00:29
like the stock or bonds or commodities.	▶ 00:33
It can also read the news online and follow certain events.	▶ 00:35
And its decisions are usually things like buy or sell decisions--trades.	▶ 00:40
There's a huge history of artificial intelligence finding methods to look at data over time	▶ 00:48
and make predictions as to how courses develop over time--	▶ 00:55
and then put in trades behind those.	▶ 00:58
And very frequently, people using artificial intelligence trading agents	▶ 01:01
have made a good amount of money with superior trading decisions.	▶ 01:06
There's also a long history of AI in Robotics.	▶ 01:10
Here is my depiction of a robot.	▶ 01:14
Of course, there are many different types of robots	▶ 01:17
and they all interact with their environments through their sensors,	▶ 01:20
which include things like cameras, microphones, tactile sensor or touch.	▶ 01:24
And the way they impact their environments is to move motors around,	▶ 01:33
in particular, their wheels, their legs, their arms, their grippers.	▶ 01:38
They can also say things to people using voice.	▶ 01:43
Now there's a huge history of using artificial intelligence in robotics.	▶ 01:46
Pretty much, every robot that does something interesting today uses AI.	▶ 01:50
In fact, often AI has been studied together with robotics, as one discipline.	▶ 01:54
But because robots are somewhat special in that they use physical actuators	▶ 01:58
and deal with physical environments, they are a little bit different from	▶ 02:03
just artificial intelligence, as a whole.	▶ 02:06
When the Web came out, the early Web crawlers were called robots	▶ 02:08
and to block a robot from accessing your website, to the present day,	▶ 02:15
there's a file called robot.txt, that allows you to deny any Web crawler	▶ 02:20
to access and retrieve that information from your website.	▶ 02:24
So historically, robotics played a huge role in artificial intelligence	▶ 02:28
and a good chunk of this class will be focusing on robotics.	▶ 02:32
AI has a huge history in games--	▶ 02:36
to make games smarter or feel more natural.	▶ 02:39
There are 2 ways in which AI has been used in games, as a game agent.	▶ 02:43
One is to play against you, as a human user.	▶ 02:47
So for example, if you play the game of Chess,	▶ 02:50
then you are the environment to the game agent.	▶ 02:54
The game agent gets to observe your moves, and it generates its own moves	▶ 02:57
with the purpose of defeating you in Chess.	▶ 03:03
So most adversarial games, where you play against an opponent	▶ 03:07
and the opponent is a computer program,	▶ 03:10
the game agent is built to play against you--against your own interests--and make you lose.	▶ 03:13
And of course, your objective is to win.	▶ 03:20
That's an AI games-type situation.	▶ 03:22
The second thing is that games agents in AI	▶ 03:25
also are used to make games feel more natural.	▶ 03:29
So very often games have characters inside, and these characters act in some way.	▶ 03:32
And it's important for you, as the player, to feel that these characters are believable.	▶ 03:36
There's an entire sub-field of artificial intelligence to use AI	▶ 03:42
to make characters in a game more believable--look smarter, so to speak--	▶ 03:45
so that you, as a player, think you're playing a better game.	▶ 03:51
Artificial intelligence has a long history in medicine as well.	▶ 03:55
The classic example is that of a diagnostic agent.	▶ 04:00
So here you are--and you might be sick, and you go to your doctor.	▶ 04:04
And your doctor wishes to understand	▶ 04:09
what the reason for your symptoms and your sickness is.	▶ 04:11
The diagnostic agent will observe you through various measurements--	▶ 04:17
for example, blood pressure and heart signals, and so on--	▶ 04:21
and it'll come up with the hypothesis as to what you might be suffering from.	▶ 04:25
But rather than intervene directly, in most cases the diagnostic of your disease	▶ 04:29
is communicated to the doctor, who then takes on the intervention.	▶ 04:34
This is called a diagnostic agent.	▶ 04:38
There are many other versions of AI in medicine.	▶ 04:40
AI is used in intensive care to understand whether there are situations	▶ 04:43
that need immediate attention.	▶ 04:48
It's been used for life-long medicine to monitor signs over long periods of time.	▶ 04:50
And as medicine becomes more personal, the role of artificial intelligence	▶ 04:54
will definitely increase.	▶ 04:58
We already mentioned AI on the Web.	▶ 05:01
The most generic version of AI is to crawl the Web and understand the Web,	▶ 05:05
and assist you in answering questions.	▶ 05:09
So when you have this search box over here	▶ 05:12
and it says "Search" on the left,	▶ 05:15
and "I'm Feeling Lucky" on the right,	▶ 05:18
and you type in the words,	▶ 05:20
what AI does for you is it understands what words you typed in	▶ 05:21
and finds the most relevant pages.	▶ 05:28
That is really co-artificial intelligence.	▶ 05:30
It's used by a number of companies, such as Microsoft and Google	▶ 05:32
and Amazon, Yahoo, and many others.	▶ 05:36
And the way this works is that there's a crawling agent that can go	▶ 05:39
to the World Wide Web and retrieve pages, through just a computer program.	▶ 05:43
It then sorts these pages into a big database inside the crawler	▶ 05:51
and also analyzes developments of each page to any possible query.	▶ 05:56
When you then come and issue a query,	▶ 06:01
the AI system is able to give you a response--	▶ 06:04
for example, a collection of 10 best Web links.	▶ 06:08
In short, every time you try to write a piece of software,	▶ 06:12
that makes your computer software smart	▶ 06:15
likely you will need artificial intelligence.	▶ 06:18
And in this class, Peter and I will teach you	▶ 06:20
many of the basic tricks of the trade	▶ 06:23
to make your software really smart.	▶ 06:25

(05:17) 4 Terminology

It will be good to introduce some basic terminology	▶ 00:00
that is commonly used in artificial intelligence to distinguish different types of problems.	▶ 00:04
The very first word I will teach you is fully versus partially observable.	▶ 00:09
An environment is called fully observable if what your agent can sense	▶ 00:16
at any point in time is completely sufficient to make the optimal decision.	▶ 00:19
So, for example, in many card games,	▶ 00:26
when all the cards are on the table, the momentary site of all those cards	▶ 00:29
is really sufficient to make the optimal choice.	▶ 00:36
That is in contrast to some other environments where you need memory	▶ 00:40
on the side of the agent to make the best possible decision.	▶ 00:46
For example, in the game of poker, the cards aren't openly on the table,	▶ 00:50
and memorizing past moves will help you make a better decision.	▶ 00:55
To fully understand the difference, consider the interaction of an agent	▶ 01:00
with the environment to its sensors and its actuators,	▶ 01:04
and this interaction takes place over many cycles,	▶ 01:08
often called the perception-action cycle.	▶ 01:11
For many environments, it's convenient to assume	▶ 01:16
that the environment has some sort of internal state.	▶ 01:19
For example, in a card game where the cards are not openly on the table,	▶ 01:22
the state might pertain to the cards in your hand.	▶ 01:28
An environment is fully observable if the sensors can always see	▶ 01:33
the entire state of the environment.	▶ 01:37
It's partially observable if the sensors can only see a fraction of the state,	▶ 01:41
yet memorizing past measurements gives us additional information of the state	▶ 01:46
that is not readily observable right now.	▶ 01:52
So any game, for example, where past moves have information about	▶ 01:55
what might be in a person's hand, those games are partially observable,	▶ 02:01
and they require different treatment.	▶ 02:06
Very often agents that deal with partially observable environments	▶ 02:08
need to acquire internal memory to understand what	▶ 02:12
the state of the environment is, and we'll talk extensively	▶ 02:15
when we talk about hidden Markov models about how this structure	▶ 02:18
has such internal memory.	▶ 02:21
A second terminology for environments pertains to whether the environment	▶ 02:23
is deterministic or stochastic.	▶ 02:26
Deterministic environment is one where your agent's actions	▶ 02:29
uniquely determine the outcome.	▶ 02:35
So, for example, in chess, there's really no randomness when you move a piece.	▶ 02:37
The effect of moving a piece is completely predetermined,	▶ 02:42
and no matter where I'm going to move the same piece, the outcome is the same.	▶ 02:46
That we call deterministic.	▶ 02:50
Games with dice, for example, like backgammon, are stochastic.	▶ 02:52
While you can still deterministically move your pieces,	▶ 02:56
the outcome of an action also involves throwing of the dice,	▶ 03:00
and you can't predict those.	▶ 03:03
There's a certain amount of randomness involved for the outcome of dice,	▶ 03:05
and therefore, we call this stochastic.	▶ 03:08
Let me talk about discrete versus continuous.	▶ 03:10
A discrete environment is one where you have finitely many action choices,	▶ 03:14
and finitely many things you can sense.	▶ 03:18
So, for example, in chess, again, there's finitely many board positions,	▶ 03:21
and finitely many things you can do.	▶ 03:25
That is different from a continuous environment	▶ 03:28
where the space of possible actions or things you could sense may be infinite.	▶ 03:30
So, for example, if you throw darts, there's infinitely many ways to angle the darts	▶ 03:35
and to accelerate them.	▶ 03:41
Finally, we distinguish benign versus adversarial environments.	▶ 03:43
In benign environments, the environment might be random.	▶ 03:49
It might be stochastic, but it has no objective on its own	▶ 03:53
that would contradict the own objective.	▶ 03:57
So, for example, weather is benign.	▶ 03:59
It might be random. It might affect the outcome of your actions.	▶ 04:02
But it isn't really out there to get you.	▶ 04:06
Contrast this with adversarial environments, such as many games, like chess,	▶ 04:08
where your opponent is really out there to get you.	▶ 04:14
It turns out it's much harder to find good actions in adversarial environments	▶ 04:16
where the opponent actively observes you and counteracts what you're trying to achieve	▶ 04:21
relative to benign environment, where the environment might merely be stochastic	▶ 04:26
but isn't really interested in making your life worse.	▶ 04:30
So, let's see to what extent these expressions make sense to you	▶ 04:35
by going to our next quiz.	▶ 04:38
So here are the 4 concepts again: partially observable versus fully,	▶ 04:40
stochastic versus deterministic, continuous versus discrete,	▶ 04:45
adversarial versus benign.	▶ 04:50
And let me ask you about the game of checkers.	▶ 04:52
Check one or all of those attributes that apply.	▶ 04:56
So, if you think checkers is partially observable, check this one.	▶ 05:00
Otherwise, just don't check it.	▶ 05:03
If you think it's stochastic, check this one,	▶ 05:05
continuous, check this one, adversarial, check this one.	▶ 05:07
If you don't know about checkers, you can check the Web and Google it	▶ 05:11
to find a little more information about checkers.	▶ 05:15

(00:52) 5 Checkers Answer

(00:12) 6 Poker
[Male narrator] The game of poker--is this partially observable, stochastic, ▶ 00:00
continuous, or adversarial? ▶ 00:06
Please check any or all of those that apply. ▶ 00:09

(00:30) 7 Poker Answer

(00:22) 8 Robotic Car

(00:33) 9 Robotic Car Answer

(01:28) 10 AI and Uncertainty

(04:00) 11 Examples of AI in Practice

Now we've had an introduction to AI.	▶ 00:00
We've heard about some of the properties of environments,	▶ 00:03
and we've seen some possible architecture for agents.	▶ 00:06
I'd like next to show you some examples of AI in practice.	▶ 00:10
And Sebastian and I have some experience personally in things we have done	▶ 00:13
at Google, at NASA, and at Stanford.	▶ 00:18
And I want to tell you a little bit about some of those.	▶ 00:21
One of the best successes of AI technology at Google	▶ 00:25
has been the machine translation system.	▶ 00:28
Here we see an example of an article in Italian automatically translated into English.	▶ 00:31
Now, these systems are built for 50 different languages,	▶ 00:37
and we can translate from any of the languages into any of the other languages.	▶ 00:41
So, that's over 2,500 different systems, and we've done this all	▶ 00:46
using machine learning techniques, using AI techniques,	▶ 00:51
rather than trying to build them by hand.	▶ 00:55
And the way it works is that we go out and collect examples of text	▶ 00:58
that's a line between the 2 languages.	▶ 01:03
So we find, say, a newspaper that publishes 2 editions,	▶ 01:06
an Italian edition and an English edition, and now we have examples of translations.	▶ 01:11
And if anybody ever asked us for exactly the translation of this one particular article,	▶ 01:16
then we could just look it up and say "We already know that."	▶ 01:22
But of course, we aren't often going to be asked that.	▶ 01:25
Rather, we're going to be asked parts of this.	▶ 01:27
Here are some words that we've seen before, and we have to figure out	▶ 01:30
which words in this article correspond to which words in the translation article.	▶ 01:34
And when we do that by examining many, many millions of words of text	▶ 01:40
in the 2 languages and making the correspondence,	▶ 01:45
and then we can put that all together.	▶ 01:49
And then when we see a new example of text that we haven't seen before,	▶ 01:51
we can just look up what we've seen in the past for that correspondence.	▶ 01:54
So, the task is really two parts.	▶ 01:58
Off-line, before we see an example of text we want to translate,	▶ 02:01
we first build our translation model.	▶ 02:05
We do that by examining all of the different examples	▶ 02:07
and figuring out which part aligns to which.	▶ 02:10
Now, when we're given a text to translate, we use that model,	▶ 02:14
and we go through and find the most probable translation.	▶ 02:18
So, what does it look like?	▶ 02:22
Well, let's look at it in some example text.	▶ 02:24
And rather than look at news articles, I'm going to look at something simpler.	▶ 02:26
I'm going to switch from Italian to Chinese.	▶ 02:29
Here's a bilingual text.	▶ 02:35
Now, for a large-scale machine translation, examples are found on the Web.	▶ 02:37
This example was found in a Chinese restaurant by Adam Lopez.	▶ 02:41
Now, it's given, for a text of this form,	▶ 02:46
that a line in Chinese corresponds to a line in English,	▶ 02:49
and that's true for each of the individual lines.	▶ 02:55
But to learn from this text, what we really want to discover	▶ 02:59
is what individual words in Chinese correspond to individual words	▶ 03:02
or small phrases in English.	▶ 03:07
I've started that process by highlighting the word "wonton" in English.	▶ 03:09
It appears 3 times throughout the text.	▶ 03:16
Now, in each of those lines, there's a character that appears,	▶ 03:18
and that's the only place in the Chinese text where that character appears.	▶ 03:23
So, that seems like it's a high probability that this character in Chinese	▶ 03:27
corresponds to the word "wonton" in English.	▶ 03:33
Let's see if we can go farther.	▶ 03:36
My question for you is what word or what character or characters in Chinese	▶ 03:38
correspond to the word "chicken" in English?	▶ 03:44
And here we see "chicken" appears in these locations.	▶ 03:47
Click on the character or characters in Chinese that corresponds to "chicken."	▶ 03:54

(00:44) 12 Chinese Translation Answer

(00:29) 13 Chinese Translation Answer 2

(00:48) 14 Chinese Translation Answer 3

(00:56) 15 Congratulations

(42) Unit 2

(01:34) Topic 1, Introduction

(04:29) Topic 2, Route Finding Question

And the answer is no.	▶ 00:00
There is no solution that the agent can come up with	▶ 00:03
because Bucharest doesn't appear on the map,	▶ 00:06
and so the agent doesn't know any actions that can arrive there.	▶ 00:08
So let's give the agent a better chance.	▶ 00:12
Now we've given the agent the full map of Romania.	▶ 00:19
To start, he's in Arad, and the destination--or goal--is in Bucharest.	▶ 00:23
And the agent is given the problem of coming up with a sequence of actions	▶ 00:30
that will arrive at the destination.	▶ 00:35
Now, is it possible for the agent to solve this problem?	▶ 00:37
And the answer is yes.	▶ 00:43
There are many routes or steps or sequences of actions that will arrive at the destination.	▶ 00:45
Here is one of them:	▶ 00:50
Starting out in Arad, taking this step first, then this one, then this one,	▶ 00:53
then this one, and then this one to arrive at the destination.	▶ 01:00
So that would count as a solution to the problem.	▶ 01:05
So sequence of actions, chained together, that are guaranteed to get us to the goal.	▶ 01:08
[DEFINITION OF A PROBLEM]	▶ 01:12
Now let's formally define what a problem looks like.	▶ 01:14
A problem can be broken down into a number of components.	▶ 01:17
First, the initial state that the agent starts out with.	▶ 01:21
In our route finding problem, the initial state was the agent being in the city of Arad.	▶ 01:25
Next, a function--Actions--that takes a state as input and returns	▶ 01:32
a set of possible actions that the agent can execute when the agent is in this state.	▶ 01:41
[ACTIONS (s) {a,a2,a3...}]	▶ 01:47
In some problems, the agent will have the same actions available in all states	▶ 01:50
and in other problems, he'll have different actions dependent on the state.	▶ 01:54
In the route finding problem, the actions are dependent on the state.	▶ 01:58
When we're in one city, we can take the routes to the neighboring cities--	▶ 02:02
but we can't go to any other cities.	▶ 02:06
Next we have a function called Result, which takes, as input, a state and an action	▶ 02:09
and delivers, as its output, a new state.	▶ 02:20
So, for example, if the agent is in the city of Arad, and takes--that would be the state--	▶ 02:24
and takes the action of driving along Route E-671 towards Timisoara,	▶ 02:33
then the result of applying that action in that state would be the new state--	▶ 02:40
where the agent is in the city of Timisoara.	▶ 02:45
Next, we need a function called Goal Test,	▶ 02:51
which takes a state and returns a Boolean value--	▶ 02:58
true or false--telling us if this state is a goal or not.	▶ 03:04
In a route-finding problem, the only goal would be being in the destination city--	▶ 03:09
the city of Bucharest--and all the other states would return false for the Goal Test.	▶ 03:14
And finally, we need one more thing which is a Path Cost function--	▶ 03:19
which takes a path, a sequence of state/action transitions,	▶ 03:28
and returns a number, which is the cost of that path.	▶ 03:40
Now, for most of the problems we'll deal with, we'll make the Path Cost function be additive	▶ 03:44
so that the cost of the path is just the sum of the costs of the individual steps.	▶ 03:50
And so we'll implement this Path Cost function, in terms of a Step Cost function.	▶ 03:56
The Step Cost function takes a state, an action, and the resulting state from that action	▶ 04:04
and returns a number--n--which is the cost of that action.	▶ 04:14
In the route finding example, the cost might be the number of miles traveled	▶ 04:18
or maybe the number of minutes it takes to get to that destination.	▶ 04:24

(02:55) Topic 3, Route Finding

Now let’s see how the definition of a problem	▶ 00:00
maps onto the route finding, the domain.	▶ 00:06
First, the initial state was given.	▶ 00:10
Let’s say we start off in Arad,	▶ 00:12
and the goal test,	▶ 00:15
let’s say that the state of being in Bucharest	▶ 00:17
is the only state that counts as a goal,	▶ 00:22
and all the other states are not goals.	▶ 00:24
Now the set of all of the states here	▶ 00:26
is known as the state space,	▶ 00:29
and we navigate the state space by applying actions.	▶ 00:31
The actions are specific to each city,	▶ 00:35
so when we are in Arad, there are three possible actions,	▶ 00:39
to follow this road, this one, or this one.	▶ 00:42
And as we follow them, we build paths	▶ 00:46
or sequences of actions.	▶ 00:49
So just being in Arad is the path of length zero,	▶ 00:51
and now we could start exploring the space	▶ 00:55
and add in this path of length one,	▶ 00:58
this path of length one,	▶ 01:01
and this path of length one.	▶ 01:03
We could add in another path here of length two	▶ 01:06
and another path here of length two.	▶ 01:11
Here is another path of length two.	▶ 01:14
Here is a path of length three.	▶ 01:17
Another path of length two, and so on.	▶ 01:21
Now at ever point,	▶ 01:26
we want to separate the state out into three parts.	▶ 01:28
First, the ends of the paths—	▶ 01:34
The farthest paths that have been explored,	▶ 01:37
we call the frontier.	▶ 01:40
And so the frontier in this case	▶ 01:42
consists of these states	▶ 01:46
that are the farthest out we have explored.	▶ 01:51
And then to the left of that in this diagram,	▶ 01:55
we have the explored part of the state.	▶ 01:59
And then off to the rigtht,	▶ 02:02
we have the unexplored.	▶ 02:04
So let’s write down those three components.	▶ 02:06
We have the frontier.	▶ 02:09
We have the unexplored region,	▶ 02:15
and we have the explored region.	▶ 02:20
One more thing,	▶ 02:25
in this diagram we have labeled the step cost	▶ 02:27
of each action along the route.	▶ 02:30
So the step cost of going between Neamt to Iasi	▶ 02:33
would be 87 corresponding to a distance of 87 kilometers,	▶ 02:37
and the path cost is just the sum of the step costs.	▶ 02:42
So the cost of the path	▶ 02:46
of going from Arad to Oradea	▶ 02:48
would be 71 plus 75.	▶ 02:50

(03:19) Topic 4, Tree Search

[Narrator] Now let's define a function for solving problems.	▶ 00:00
It's called Tree Search because it superimposes	▶ 00:04
a search tree over the state space.	▶ 00:07
Here's how it works: It starts off by	▶ 00:10
initializing the frontier to be the path	▶ 00:12
consisting of only the initial states,	▶ 00:14
and then it goes into a loop	▶ 00:16
in which it first checks to see	▶ 00:18
do we still have anything left in the frontier?	▶ 00:21
If not we fail, there can be no solution.	▶ 00:23
If we do have something, then we make a choice.	▶ 00:25
Tree Search is really a family of functions	▶ 00:28
not a single algorithm which	▶ 00:31
depends on how we make that choice,	▶ 00:33
and we'll see some of the options later.	▶ 00:35
If we go ahead and make a choice of one of	▶ 00:38
the paths on the frontier and remove that	▶ 00:41
path from the frontier, we find the state	▶ 00:43
which is at the end of the path, and if that	▶ 00:45
state's a go then we're done.	▶ 00:47
We found a path to the goal; otherwise,	▶ 00:49
we do what's called expanding that path.	▶ 00:51
We look at all the actions from that state,	▶ 00:54
and we add to the path the actions	▶ 00:57
and the result of that state; so we get	▶ 01:00
a new path that has the old path, the action	▶ 01:03
and the result of that action, and we	▶ 01:06
stick all of those paths back onto the frontier.	▶ 01:09
Now Tree Search represents a whole family	▶ 01:17
of algorithms, and where you get the family	▶ 01:19
resemblance is that they're all looking	▶ 01:22
at the frontier, copying items off and	▶ 01:24
and looking to see if their goal tests,	▶ 01:26
but where you get the difference is right here,	▶ 01:29
in the choice of how you're going to expand	▶ 01:31
the next item on the frontier, which	▶ 01:34
path do we look at first, and we'll go through	▶ 01:36
different sets of algorithms that make	▶ 01:39
different choices for which path to look at first.	▶ 01:42
The first algorithm I want to consider	▶ 01:47
is called Breadth-First Search.	▶ 01:49
Now it could be called shortest-first search	▶ 01:51
because what it does is always choose	▶ 01:54
of the frontier one of the paths that hadn't been	▶ 01:56
considered yet that's the shortest possible.	▶ 01:59
So how does it work?	▶ 02:02
Well we start off with the path of	▶ 02:04
length 0, starting in the start state, and	▶ 02:06
that's the only path in the frontier so	▶ 02:10
it's the shortest one so we pick it,	▶ 02:13
and then we expand it, and we add in	▶ 02:15
all the paths that result from	▶ 02:17
applying all the possible actions.	▶ 02:20
So now we've removed	▶ 02:22
this path from the frontier,	▶ 02:25
but we've added in 3 new paths.	▶ 02:28
This one,	▶ 02:31
this one, and this one.	▶ 02:33
Now we're in a position where	▶ 02:37
we have 3 paths on the frontier, and	▶ 02:39
we have to pick the shortest one.	▶ 02:42
Now in this case all 3 paths	▶ 02:45
have the same length, length 1, so we	▶ 02:47
break the tie at random or using some	▶ 02:50
other technique, and let's suppose that	▶ 02:52
in this case we choose this path	▶ 02:56
from Arad to Sibiu.	▶ 02:58
Now the question I want you to answer	▶ 03:00
is once we remove that from the frontier,	▶ 03:03
what paths are we going to add next?	▶ 03:09
So show me by checking off the cities	▶ 03:11
that ends the paths, which paths	▶ 03:14
are going to be added to the frontier?	▶ 03:16

(02:54) Topic 5, Tree Search Answer

[Male narrator] The answer is that in Sibiu, the action function gives us 4 actions	▶ 00:00
corresponding to traveling along these 4 roads,	▶ 00:06
so we have to add in paths for each of those actions.	▶ 00:09
One of those paths goes here,	▶ 00:15
the other path continues from Arad and goes out here.	▶ 00:17
The third path continues out here	▶ 00:21
and then the fourth path goes from here--from Arad to Sibiu	▶ 00:25
and then backtracks back to Arad.	▶ 00:31
Now, it may seem silly and redundant to have a path that starts in Arad,	▶ 00:36
goes to Sibiu and returns to Arad.	▶ 00:41
How can that help us get to our destination in Bucharest?	▶ 00:44
But we can see if we're dealing with a tree search,	▶ 00:49
why it's natural to have this type of formulation	▶ 00:52
and why the tree search doesn't even notice that it's backtracked.	▶ 00:56
What the tree search does is superimpose on top of the state space	▶ 01:00
a tree of searches, and the tree looks like this.	▶ 01:05
We start off in state A, and in state A, there were 3 actions,	▶ 01:09
so we gave those paths going to Z, S, and T.	▶ 01:15
And from S, there were 4 actions, so that gave us paths going from O, F, R, and A,	▶ 01:21
and then the tree would continue on from here.	▶ 01:34
We'd take one of the next items	▶ 01:37
and we'd move it and continue on, but notice that we returned to the A state	▶ 01:40
in the state space, but in the tree,	▶ 01:48
it's just another item in the tree.	▶ 01:51
Now, here's another representation of the search space	▶ 01:55
and what's happening is as we start to explore the state,	▶ 01:57
we keep track of the frontier, which is the set of states that are at the end of the paths	▶ 02:01
that we haven't explored yet, and behind that frontier	▶ 02:09
is the set of explored states, and ahead of the frontier is the unexplored states.	▶ 02:13
Now the reason we keep track of the explored states	▶ 02:19
is that when we want to expand and we find a duplicate--	▶ 02:22
so say when we expand from here, if we pointed back to state T,	▶ 02:27
if we hadn't kept track of that, we would have to add in a new state for T down here.	▶ 02:33
But because we've already seen it and we know that this is actually a regressive step	▶ 02:42
into the already explored state, now, because we kept track of that,	▶ 02:47
we don't need it anymore.	▶ 02:51

(01:05) Topic 6, Graph Search

(00:42) Topic 7, Graph Search Answer

(00:13) Topic 8, Graph Search Answer

(00:38) Topic 9, More Graph Search

(00:29) Topic 10, Graph Search Answer

(01:30) Topic 11, Graph Search Termination

(00:51) Topic 12, Uniform Cost Search

(00:44) Topic 13, Uniform Cost Search

(00:56) Topic 14, Uniform Cost Search

(00:15) Topic 15, Uniform Cost Search

(01:13) Topic 16, Uniform Cost Termination

(01:46) Topic 17, Uniform Cost Termination Answer

(01:50) Topic 18, Depth First Search

(01:49) Topic 19, Search Optimality Answer

(02:02) Topic 20, Storage Requirements, Completeness

(00:49) Topic 21, Completeness Answer

(04:28) Topic 22, More on Uniform Cost Search

Let's try to understand a little better how uniform cost search works.	▶ 00:00
We start at a start state,	▶ 00:05
and then we start expanding out from there looking at different paths,	▶ 00:08
and what we end of doing is expanding in terms of contours like on a topological map,	▶ 00:13
where first we span out to a certain distance, then to a farther distance,	▶ 00:21
and then to a farther distance.	▶ 00:28
Now at some point we meet up with a goal. Let's say the goal is here.	▶ 00:31
Now we found a path from the start to the goal.	▶ 00:35
But notice that the search really wasn't directed at any way towards the goal.	▶ 00:42
It was expanding out everywhere in the space and depending on where the goal is,	▶ 00:46
we should expect to have to explore half the space, on average, before we find the goal.	▶ 00:52
If the space is small, that can be fine,	▶ 00:57
but when spaces are large, that won't get us to the goal fast enough.	▶ 01:00
Unfortunately, there is really nothing we can do, with what we know, to do better than that,	▶ 01:05
and so if we want to improve, if we want to be able to find the goal faster,	▶ 01:10
we're going to have to add more knowledge.	▶ 01:15
The type of knowledge that is proven most useful in search is an estimate of the distance	▶ 01:21
from the start state to the goal.	▶ 01:27
So let's say we're dealing with a route-finding problem,	▶ 01:32
and we can move in any direction--up or down, right or left--	▶ 01:36
and we'll take as our estimate, the straight line distance between a state and a goal,	▶ 01:43
and we'll try to use that estimate to find our way to the goal fastest.	▶ 01:50
Now an algorithm called greedy best-first search does exactly that.	▶ 01:55
It expands first the path that's closest to the goal according to the estimate.	▶ 02:04
So what do the contours look like in this approach?	▶ 02:09
Well, we start here, and then we look at all the neighboring states,	▶ 02:13
and the ones that appear to be closest to the goal we would expand first.	▶ 02:17
So we'd start expanding like this and like this and like this and like this	▶ 02:21
and that would lead us directly to the goal.	▶ 02:30
So now instead of exploring whole circles that go out everywhere with a certain space,	▶ 02:33
our search is directed towards the goal.	▶ 02:38
In this case it gets us immediately towards the goal, but that won't always be the case	▶ 02:41
if there are obstacles along the way.	▶ 02:46
Consider this search space. We have a start state and a goal,	▶ 02:50
and there's an impassable barrier.	▶ 02:54
Now greedy best-first search will start expanding out as before,	▶ 02:57
trying to get towards the goal,	▶ 03:02
and when it reaches the barrier, what will it do next?	▶ 03:08
Well, it will try to increase along a path that's getting closer and closer to the goal.	▶ 03:11
So it won't consider going back this way which is farther from the goal.	▶ 03:15
Rather it will continue expanding out along these lines	▶ 03:20
which always get closer and closer to the goal,	▶ 03:24
and eventually it will find its way towards the goal.	▶ 03:28
So it does find a path, and it does it by expanding a small number of nodes,	▶ 03:31
but it's willing to accept a path which is longer than other paths.	▶ 03:36
Now if we explored in the other direction, we could have found a much simpler path,	▶ 03:42
a much shorter path, by just popping over the barrier, and then going directly to the goal.	▶ 03:47
but greedy best-first search wouldn't have done that because	▶ 03:54
that would have involved getting to this point, which is this distance to the goal,	▶ 03:56
and then considering states which were farther from the goal.	▶ 04:01
What we would really like is an algorithm that combines the best parts	▶ 04:08
of greedy search which explores a small number of nodes in many cases	▶ 04:11
and uniform cost search which is guaranteed to find a shortest path.	▶ 04:17
We'll show how to do that next using an algorithm called the A-star algorithm.	▶ 04:22

(03:14) Topic 23, A-Star Search

[Male narrator] A* Search works by always expanding the path	▶ 00:00
that has a minimum value of the function f	▶ 00:03
which is defined as a sum of the g + h components.	▶ 00:07
Now, the function g of a path	▶ 00:12
is just the path cost,	▶ 00:16
and the function h of a path	▶ 00:19
is equal to the h value of the state,	▶ 00:23
which is the final state of the path,	▶ 00:27
which is equal to the estimated distance to the goal.	▶ 00:30
Here's an example of how A* works.	▶ 00:36
Suppose we found this path through the state's base to a state x	▶ 00:39
and we're trying to give a measure to the value of this path.	▶ 00:44
The measure f is a sum of g, the path cost so far,	▶ 00:48
and h, which is the estimated distance that the path will take	▶ 00:55
to complete its path to the goal.	▶ 01:02
Now, minimizing g helps us keep the path short	▶ 01:04
and minimizing h helps us keep focused on finding the goal	▶ 01:08
and the result is a search strategy that is the best possible	▶ 01:13
in the sense that it finds the shortest length path	▶ 01:17
while expanding the minimum number of paths possible.	▶ 01:20
It could be called "best estimated total path cost first,"	▶ 01:24
but the name A* is traditional.	▶ 01:28
Now let's go back to Romania and apply the A* algorithm	▶ 01:32
and we're going to use a heuristic, which is a straight line distance	▶ 01:36
between a state and the goal.	▶ 01:40
The goal, again, is Bucharest,	▶ 01:42
and so the distance from Bucharest to Bucharest is, of course, 0.	▶ 01:44
And for all the other states, I've written in red	▶ 01:47
the straight line distance.	▶ 01:51
For example, straight across like that.	▶ 01:53
Now, I should say that all the roads here I've drawn as straight lines,	▶ 01:55
but actually, roads are going to be curved to some degree,	▶ 01:59
so the actual distance along the roads is going to be longer	▶ 02:03
than the straight line distance.	▶ 02:06
Now, we start out as usual--we'll start in Arad as a start state--	▶ 02:09
and we'll expand out Arad and so we'll add 3 paths	▶ 02:13
and the evaluation function, f, will be the sum of the path length,	▶ 02:21
which is given in black, and the estimated distance,	▶ 02:26
which is given in red.	▶ 02:29
And so the path length from this path	▶ 02:32
will be 140+253 or 393;	▶ 02:37
for this path, 75+374, or 449;	▶ 02:45
and for this path, 118+329, or 447.	▶ 02:55
And now, the question is out of all the paths that are on the frontier,	▶ 03:05
which path would we expand next under the A* algorithm?	▶ 03:09

(00:14) Topic 23, A-Star Search ANSWER
The answer is that we select this path first--the one from Arad to Sibiu-- ▶ 00:00
because it has the smallest value--393--of the sum f=g+h. ▶ 00:05

(00:39) Topic 24, A-Star Second Question

(00:12) Topic 24, A-Star Second Question ANSWER

(00:20) Topic 25, A-Star Third Question

(00:11) Topic 25, A-Star Third Question ANSWER
And the answer is that we expand this path, Fagaras, next, ▶ 00:00
because its f-total, 415, ▶ 00:05
is less than all the other paths in the front tier. ▶ 00:08

(00:26) Topic 26, A-Star Fourth Question

(00:23) Topic 26, A-Star Fourth Question ANSWER

(01:24) Topic 27, A-Star Fifth Question

(00:49) Topic 27, A-Star Fifth Question ANSWER Mandatory

(01:22) Topic 28, Optimistic Heuristics

(00:59) Topic 29, State Spaces

(00:35) Topic 29, State Spaces ANSWER

(01:44) Topic 30, State Space Diagram and More Complexity

(00:57) Topic 30, State Space Diagram and More Complexity ANSWER

(01:49) Topic 31, Sliding Blocks Puzzle

(00:42) Topic 31, Sliding Blocks Puzzle ANSWER

(03:16) Topic 32, Where is the Intelligence

Now, we're trying to build an artificial intelligence	▶ 00:01
that can solve problems like this all on its own.	▶ 00:04
You can see that the search algorithms do a great job	▶ 00:08
of finding solutions to problems like this.	▶ 00:12
But, you might complain that in order for the search algorithms to work,	▶ 00:15
we had to provide it with a heurstic function.	▶ 00:19
A heurstic function came from the outside.	▶ 00:22
You might think that coming up with a good heurstic function is really where all the intelligence is.	▶ 00:25
So, a problem solver that uses an heurstic function given to it	▶ 00:30
really isn't intelligent at all.	▶ 00:34
So let's think about where the intelligence could come from	▶ 00:36
and can we automatically come up with good heurstic functions.	▶ 00:39
I'm going to sketch a description of	▶ 00:45
a program that can automatically come up with good heurstics	▶ 00:47
given a description of a problem.	▶ 00:50
Suppose this program is given a description of the sliding blocks puzzle	▶ 00:52
where we say that a block can move from square A to square B	▶ 00:57
if A is adjacent to B and B is blank.	▶ 01:02
Now, imagine that we try to loosen this restriction.	▶ 01:06
We cross out "B is blank,"	▶ 01:10
and then we get the rule	▶ 01:14
"a block can move from A to B if A is adjacent to B,"	▶ 01:16
and that's equal to our heurstic h2	▶ 01:20
because a block can move anywhere to an adjacent state.	▶ 01:23
Now, we could also cross out the other part of the rule,	▶ 01:27
and we now get "a block can move from any square A	▶ 01:31
to any square B regardless of any condition.	▶ 01:36
That gives us heurstic h1.	▶ 01:40
So we see that both of our heurstics can be derived	▶ 01:43
from a simple mechanical manipulation	▶ 01:48
of the formal description of the problem.	▶ 01:50
Once we've generated automatically these candidate heuristics,	▶ 01:53
another way to come up with a good heurstic is to say	▶ 01:58
that a new heurstic, h,	▶ 02:02
is equal to the maximum of h1 and h2,	▶ 02:04
and that's guaranteed to be admissible as long as	▶ 02:10
h1 and h2 are admissible	▶ 02:13
because it still never overestimates,	▶ 02:16
and it's guaranteed to be better because its getting closer to the true value.	▶ 02:18
The only problem with combining multiple heuristics like this	▶ 02:22
is that there is some cause to compute the heuristic	▶ 02:27
and it could take longer to compute	▶ 02:29
even if we end up expanding pure paths.	▶ 02:31
Crossing out parts of the rules like this	▶ 02:35
is called "generating a relaxed problem."	▶ 02:38
What we've done is we've taken the original problem,	▶ 02:41
where it's hard to move squares around,	▶ 02:44
and made it easier by relaxing one of the constraints.	▶ 02:46
You can see that as adding new links in the state space,	▶ 02:49
so if we have a state space in which there are only particular links,	▶ 02:54
by relaxing the problem it's as if we are adding new operators	▶ 02:59
that traverse the state in new ways.	▶ 03:05
So adding new operators only makes the problem easier,	▶ 03:07
and thus never overestimates, and thus is admissible.	▶ 03:11

(01:52) Topic 33, What Can't Search Do

(02:35) Topic 34, Note on Implementation

Our description of the algorithm has talked about paths in the state space.	▶ 00:01
I want to say a little bit now about how to implement that in terms of a computer algorithm.	▶ 00:08
We talk about paths, but we want to implement that in some ways.	▶ 00:15
In the implementation we talk about nodes.	▶ 00:19
A node is a data structure, and it has four fields.	▶ 00:22
The state field indicates the state at the end of the path.	▶ 00:27
The action was the action it took to get there.	▶ 00:35
The cost is the total cost,	▶ 00:40
and the parent is a pointer to another node.	▶ 00:45
In this case, the node that has state "S",	▶ 00:50
and it will have a parent which points to the node that has state "A",	▶ 00:56
and that will have a parent pointer that's null.	▶ 01:06
So we have a linked list of nodes representing the path.	▶ 01:10
We'll use the word "path" for the abstract idea,	▶ 01:15
and the word "node" for the representation in the computer memory.	▶ 01:18
But otherwise, you can think of those two terms as being synonyms,	▶ 01:22
because they're in a one-to-one correspondence.	▶ 01:26
Now there are two main data structures that deal with nodes.	▶ 01:31
We have the "frontier" and we have the "explored" list.	▶ 01:35
Let's talk about how to implement them.	▶ 01:41
In the frontier the operations we have to deal with	▶ 01:44
are removing the best item from the frontier and adding in new ones.	▶ 01:48
And that suggests we should implement it as a priority queue,	▶ 01:52
which knows how to keep track of the best items in proper order.	▶ 01:55
But we also need to have an additional operation	▶ 01:59
of a membership test as a new item in the frontier.	▶ 02:03
And that suggests representing it as a set,	▶ 02:07
which can be built from a hash table or a tree.	▶ 02:10
So the most efficient implementations of search actually have both representations.	▶ 02:14
The explored set, on the other hand, is easier.	▶ 02:20
All we have to do there is be able to add new members and check for membership.	▶ 02:23
So we represent that as a single set,	▶ 02:28
which again can be done with either a hash table or tree.	▶ 02:31

(16) Homework 1

(00:05) Congratulations!
Congratulations. ▶ 00:00
You just made assignment 1. ▶ 00:02
(00:05) Introduction
This is homework assignment #1. ▶ 00:00

(01:00) Question 1, Peg Solitaire

(00:22) Question 1, Peg Solitaire ANSWER

(00:54) Question 2, Loaded Coin

(00:38) Question 2, Loaded Coin ANSWER

(00:32) Question 3, Path Through Maze

(00:18) Question 3, Path Through Maze ANSWER

(00:43) Question 4, Search Tree

(00:38) Question 4, Search Tree ANSWER

(00:31) Question 5, Another Search Tree

(00:48) Question 5, Another Search Tree ANSWER

(01:00) Question 6, Search Network

(00:49) Question 6, Search Network ANSWER

(01:16) Question 7, A* Search

(01:44) Question 7, A* Search ANSWER

(64) Unit 3

(06:26) 1 Introduction

So the next units will be concerned with probabilities	▶ 00:00
and particularly with structured probabilities using Bayes networks.	▶ 00:03
This is some of the most involved material in this class.	▶ 00:08
And since this is a Stanford level class,	▶ 00:12
you will find out that some of the quizzes are actually really hard.	▶ 00:14
So as you go through the material, I hope the hardness of the quizzes won't discourage you;	▶ 00:18
it'll really entice you to take a piece of paper and a pen and work them out.	▶ 00:23
Let me give you a flavor of a Bayes network using an example.	▶ 00:30
Suppose you find in the morning that your car won't start.	▶ 00:35
Well, there's many causes why your car might not start.	▶ 00:39
One is that your battery is flat.	▶ 00:43
Even for a flat battery there is multiple causes.	▶ 00:46
One, it's just plain dead,	▶ 00:50
and one is that the battery is okay but it's not charging.	▶ 00:52
The reason why a battery might not charge is that the alternator might be broken	▶ 00:55
or the fan belt might be broken.	▶ 01:01
If you look at this influence diagram, also called a Bayes network,	▶ 01:03
you'll find there's many different ways to explain that the car won't start.	▶ 01:07
And a natural question you might have is, "Can we diagnose the problem?"	▶ 01:12
One diagnostic tool is a battery meter,	▶ 01:17
which may increase or decrease your belief that the battery may cause your car failure.	▶ 01:20
You might also know your battery age.	▶ 01:26
Older batteries tend to go dead more often.	▶ 01:29
And there's many other ways to look at reasons why the car might not start.	▶ 01:31
You might inspect the lights, the oil light, the gas gauge.	▶ 01:37
You might even dip into the engine to see what the oil level is with a dipstick.	▶ 01:43
All of those relate to alternative reasons why the car might not be starting,	▶ 01:48
like no oil, no gas, the fuel line might be blocked, or the starter may be broken.	▶ 01:52
And all of these can influence your measurements,	▶ 01:59
like the oil light or the gas gauge, in different ways.	▶ 02:04
For example, the battery flat would have an effect on the lights.	▶ 02:07
It might have an effect on the oil light and on the gas gauge,	▶ 02:12
but it won't really affect the oil you measure with the dipstick.	▶ 02:16
That is affected by the actual oil level, which also affects the oil light.	▶ 02:20
Gas will affect the gas gauge, and of course without gas the car doesn't start.	▶ 02:26
So this is a complicated structure that really describes one way to understand	▶ 02:32
how a car doesn't start.	▶ 02:39
A car is a complex system.	▶ 02:41
It has lots of variables you can't really measure immediately,	▶ 02:43
and it has sensors which allow you to understand a little bit about the state of the car.	▶ 02:46
What the Bayes network does,	▶ 02:52
it really assists you in reasoning from observable variables, like the car won't start	▶ 02:54
and the value of the dipstick, to hidden causes, like is the fan belt broken	▶ 03:01
or is the battery dead.	▶ 03:06
What you have here is a Bayes network.	▶ 03:09
A Bayes network is composed of nodes.	▶ 03:13
These nodes correspond to events that you might or might not know	▶ 03:15
that are typically called random variables.	▶ 03:21
These nodes are linked by arcs, and the arcs suggest that a child of an arc	▶ 03:24
is influenced by its parent but not in a deterministic way.	▶ 03:31
It might be influenced in a probabilistic way, which means an older battery, for example,	▶ 03:35
has a higher chance of causing the battery to be dead,	▶ 03:41
but it's not clear that every old battery is dead.	▶ 03:45
There is a total of 16 variables in this Bayes network.	▶ 03:48
What the graph structure and associated probabilities specify	▶ 03:53
is a huge probability distribution in the space of all of these 16 variables.	▶ 03:59
If they are all binary, which we'll assume throughout this unit,	▶ 04:06
they can take 2 to the 16th different values, which is a lot.	▶ 04:10
The Bayes network, as we find out, is a complex representation	▶ 04:15
of a distribution over this very, very large joint probability distribution of all of these variables.	▶ 04:18
Further, once we specify the Bayes network,	▶ 04:26
we can observe, for example, the car won't start.	▶ 04:29
We can observe things like the oil light and the lights and the battery meter	▶ 04:33
and then compute probabilities of the hypothesis, like the alternator is broken	▶ 04:37
or the fan belt is broken or the battery is dead.	▶ 04:41
So in this class we're going to talk about how to construct this Bayes network,	▶ 04:45
what the semantics are, and how to reason in this Bayes network	▶ 04:50
to find out about variables we can't observe, like whether the fan belt is broken or not.	▶ 04:56
That's an overview.	▶ 05:02
Throughout this unit I am going to assume that every event is discrete--	▶ 05:04
in fact, it's binary.	▶ 05:08
We'll start with some consideration of basic probability,	▶ 05:10
we'll work our way into some simple Bayes networks,	▶ 05:14
we'll talk about concepts like conditional independence	▶ 05:19
and then define Bayes networks more generally,	▶ 05:23
move into concepts like D-separation and start doing parameter counts.	▶ 05:26
Later on, Peter will tell you about inference in Bayes networks.	▶ 05:32
So we won't do this in this class.	▶ 05:36
I can't overemphasize how important this class is.	▶ 05:38
Bayes networks are used extensively in almost all fields of smart computer system,	▶ 05:43
in diagnostics, for prediction, for machine learning, and fields like finance,	▶ 05:49
inside Google, in robotics.	▶ 05:57
Bayes networks are also the building blocks of more advanced AI techniques	▶ 06:00
such as particle filters, hidden Markov models, MDPs and POMDPs,	▶ 06:05
Kalman filters, and many others.	▶ 06:12
These are words that don't sound familiar quite yet,	▶ 06:14
but as you go through the class, I can promise you you will get to know what they mean.	▶ 06:18
So let's start now at the very, very basics.	▶ 06:22

(00:40) 2 Probabilities

(00:19) 2a Answer

(00:08) 2b Question
[Thrun] Let me ask my next quiz. ▶ 00:00
Suppose the probability of heads is a quarter, 0.25. ▶ 00:02
What's the probability of tail? ▶ 00:06

(00:17) 2c Answer

(00:14) 2d Question

(00:14) 2e Answer

(00:32) 2f Question

(00:23) 2g Answer

(00:10) 2h Question
[Thrun] So here's another one. ▶ 00:00
What's the probability that within the set of X1, X2, X3, and X4 ▶ 00:02
there are at least three heads? ▶ 00:07

(00:28) 2i Answer

(00:45) 2j Summary

(01:04) 3 Dependence

(01:06) 3a Answer

(01:07) 4 What We Learned

(00:25) 5 Weather Quiz

(00:05) 5a Answer
Well, the correct answer is 0.2, which is a negation of this event over here. ▶ 00:00
(00:13) 5b Question
A sunny day follows a rainy day with 0.6 chance, ▶ 00:00
and a rainy day follows a rainy day-- ▶ 00:06
please give me your number. ▶ 00:11
(00:03) 5c Answer
0.4 ▶ 00:00

(00:18) 5d Question

(01:25) 5e Answer

(00:19) 6 Cancer Quiz

(00:28) 6a Answer and Cancer Test

(01:01) 6b Answer

(00:40) 6c Answer

(00:07) 6d Question
Now, our next quiz, I want you to fill in the probability of ▶ 00:00
the cancer given that we just received a positive test. ▶ 00:04

(01:52) 6e Answer

(03:34) 7 Bayes Rule

So, we've just learned about what's probably the most important	▶ 00:00
piece of math for this class in statistics called Bayes Rule.	▶ 00:03
It was invented by Reverend Thomas Bayes, who was a British mathematician	▶ 00:09
and a Presbyterian minister in the 18th century.	▶ 00:15
Bayes Rule is usually stated as follows: P of A given B where B is the evidence	▶ 00:18
and A is the variable we care about is P of B given A times P of A over P of B.	▶ 00:27
This expression is called the likelihood.	▶ 00:36
This is called the prior, and this is called marginal likelihood.	▶ 00:40
The expression over here is called the posterior.	▶ 00:46
The interesting thing here is the way the probabilities are reworded.	▶ 00:50
Say we have evidence B.	▶ 00:55
We know about B, but we really care about the variable A.	▶ 00:57
So, for example, B is a test result.	▶ 01:01
We don't care about the test result as much as we care about the fact	▶ 01:03
whether we have cancer or not.	▶ 01:06
This diagnostic reasoning--which is from evidence to its causes--	▶ 01:08
is turned upside down by Bayes Rule into a causal reasoning,	▶ 01:16
which is given--hypothetically, if we knew the cause,	▶ 01:22
what would be the probability of the evidence we just observed.	▶ 01:27
But to correct for this inversion, we have to multiply	▶ 01:31
by the prior of the cause to be the case in the first place,	▶ 01:36
in this case, having cancer or not,	▶ 01:40
and divide it by the probability of the evidence, P(B),	▶ 01:42
which often is expanded using the theorem of total probability as follows.	▶ 01:47
The probability of B is a sum over all probabilities of B	▶ 01:52
conditional on A, lower caps a, times the probability of A equals lower caps a.	▶ 01:58
This is total probability as we already encountered it.	▶ 02:04
So, let's apply this to the cancer case	▶ 02:08
and say we really care about whether you have cancer,	▶ 02:10
which is our cause, conditioned on the evidence	▶ 02:13
that is the result of this hidden cause, in this case, a positive test result.	▶ 02:17
Let's just plug in the numbers.	▶ 02:23
Our likelihood is the probability of seeing a positive test result	▶ 02:25
given that you have cancer multiplied by the prior probability	▶ 02:30
of having cancer over the probability of the positive test result,	▶ 02:33
and that is--according to the tables we looked at before--	▶ 02:38
0.9 times a prior of 0.01 over--	▶ 02:43
now we're going to expand this right over here according to total probability	▶ 02:50
which gives us 0.9 times 0.01.	▶ 02:55
That's the probability of + given that we do have cancer.	▶ 03:01
So, the probability of + given that we don't have cancer is 0.2,	▶ 03:06
but the prior here is 0.99.	▶ 03:11
So, if we plug in the numbers we know about, we get 0.009	▶ 03:15
over 0.009 + 0.198.	▶ 03:20
That is approximately 0.0434, which is the number we saw before.	▶ 03:27

(01:52) 7a Bayes Rule Graphically

(00:24) 7b Answer

(02:32) 8 More Complex Bayes Networks

So, we just encountered our very first Bayes network	▶ 00:00
and did a number of interesting calculations.	▶ 00:03
Let's now talk about Bayes Rule and look into more complex Bayes networks.	▶ 00:06
I will look at Bayes Rule again and make an observation	▶ 00:10
that is really non-trivial.	▶ 00:13
Here is Bayes Rule, and in practice, what we find is	▶ 00:15
this term here is relatively easy to compute.	▶ 00:20
It's just a product, whereas this term is really hard to compute.	▶ 00:23
However, this term over here does not depend on what we assume for variable A.	▶ 00:28
It's just the function of B.	▶ 00:33
So, suppose for a moment we also care about the complementary event of not A	▶ 00:35
given B, for which Bayes Rule unfolds as follows.	▶ 00:40
Then we find that the normalizer, P(B), is identical,	▶ 00:43
whether we assume A on the left side or not A on the left side.	▶ 00:47
We also know from prior work that P(A) given B plus	▶ 00:51
P of not A given B must be one because these are 2 complementary events.	▶ 00:57
That allows us to compute Bayes Rule very differently	▶ 01:03
by basically ignoring the normalizer, so here's how it goes.	▶ 01:06
We compute P(A) given B--and I want to call this prime,	▶ 01:11
because it's not a real probability--to be just P(B) given A times P(A),	▶ 01:16
which is the normalizer, so the denominator of the expression over here.	▶ 01:23
We do the same thing with not A.	▶ 01:28
So, in both cases, we compute the posterior probability non-normalized	▶ 01:31
by omitting the normalizer B.	▶ 01:36
And then we can recover the original probabilities by normalizing	▶ 01:38
based on those values over here, so the probability of A given B,	▶ 01:43
the actual probability, is a normalizer, eta,	▶ 01:48
times this non-normalized form over here.	▶ 01:52
The same is true for the negation of A over here.	▶ 01:55
And eta is just the normalizer that results by adding these 2 values over here together	▶ 01:59
as shown over here, and dividing them for one.	▶ 02:06
So, take a look at this for a moment.	▶ 02:10
What we've done is we deferred the calculation of the normalizer over here	▶ 02:13
by computing pseudo probabilities that are non-normalized.	▶ 02:18
This made the calculation much easier, and when we were done with everything,	▶ 02:22
we just folded it back into the normalizer based on the resulting	▶ 02:26
pseudo probabilities and got the correct answer.	▶ 02:29

(01:08) 8a Two Test Cancer Example

(02:00) 8b Answer

(00:10) 8c Question
Calculate for me the probability of cancer ▶ 00:00
given that I received one positive and one negative test result. ▶ 00:03
Please write your number into this box. ▶ 00:08

(01:03) 8d Answer

(02:45) 9 Conditional Independence

I want to use a few words of terminology.	▶ 00:00
This, again, is a Bayes network, of which the hidden variable C	▶ 00:03
causes the still stochastic test outcomes T1 and T2.	▶ 00:08
And what is really important is that we assume not just	▶ 00:16
that T1 and T2 are identically distributed.	▶ 00:19
We use the same 0.9 for test 1 as we use for test 2,	▶ 00:22
but we also assume that they are conditionally independent.	▶ 00:27
We assumed that if God told us whether we actually had cancer or not,	▶ 00:31
if we knew with absolute certainty the value of the variable C,	▶ 00:37
that knowing anything about T1 would not help us make a statement about T2.	▶ 00:41
Put differently, we assumed that the probability of T2 given C and T1	▶ 00:48
is the same as the probability of T2 given C.	▶ 00:55
This is called conditional independence, which is given the value of the cancer variable C.	▶ 01:00
If you knew this for a fact, then T2 would be independent of T1.	▶ 01:08
It's conditionally independent because the independence only holds true	▶ 01:17
if we actually know C, and it comes out of this diagram over here.	▶ 01:21
If we look at this diagram, if you knew the variable C over here,	▶ 01:26
then C separately causes T1 and T2.	▶ 01:32
So, as a result, if you know C, whatever counted over here	▶ 01:39
is kind of cut off causally from what happens over here.	▶ 01:46
That causes these 2 variables to be conditionally independent.	▶ 01:48
So, conditional independence is a really big thing in Bayes networks.	▶ 01:52
Here's a Bayes network where A causes B and C,	▶ 01:58
and for a Bayes network of this structure, we know that given A,	▶ 02:02
B and C are independent.	▶ 02:08
It's written as B conditionally independent of C given A.	▶ 02:11
So, here's a question.	▶ 02:16
Suppose we have conditional independence between B and C given A.	▶ 02:18
Would that imply--and there's my question--that B and C are independent?	▶ 02:21
So, suppose we don't know A.	▶ 02:28
We don't know whether we have cancer, for example.	▶ 02:30
What that means is that the test results individually are still independent of each other	▶ 02:33
even if we don't know about the cancer situation.	▶ 02:38
Please answer yes or no.	▶ 02:42

(00:41) 9a Answer

(00:35) 9b Question

(02:52) 9c Answer

So, for this one, we want to apply total probability.	▶ 00:00
This thing over here is the same as probability of test 2 to be positive,	▶ 00:04
which I'm going to abbreviate with a +2 over here,	▶ 00:10
conditioned on test 1 being positive and me having cancer	▶ 00:14
times the probability of me having cancer given test 1 was positive plus	▶ 00:19
the probability of test 2 being positive conditioned on test 1 being positive	▶ 00:25
and me not having cancer times the probability of me not having cancer	▶ 00:31
given that test 1 is positive.	▶ 00:36
That's the same as the theorem of total probability,	▶ 00:38
but now everything is conditioned on +1.	▶ 00:42
Take a moment to verify this.	▶ 00:46
Now, here I can plug in the numbers.	▶ 00:48
You already calculated this one before, which is approximately 0.043,	▶ 00:50
and this one over here is 1 minus that, which is 0.957 approximately.	▶ 00:57
And this term over here now exploits conditional independence,	▶ 01:05
which is given that I know C, knowledge of the first test	▶ 01:09
gives me no more information about the second test.	▶ 01:14
It only gives me information if C was unknown, as was the case over here.	▶ 01:17
So, I can rewrite this thing over here as follows:	▶ 01:21
P of +2 given that I have cancer.	▶ 01:24
I can drop the +1, and the same is true over here.	▶ 01:27
This is exploiting my conditional independence.	▶ 01:31
I knew that P of +1 or +2 conditioned on C	▶ 01:34
is the same as P of +2 conditioned on C and +1.	▶ 01:41
I can now read those off my table over here,	▶ 01:47
which is 0.9 times 0.043 plus 0.2,	▶ 01:50
which is 1 minus 0.8 over here times 0.957,	▶ 01:58
which gives me approximately 0.2301.	▶ 02:03
So, that says if my first test comes in positive,	▶ 02:09
I expect my second test to be positive with probably 0.2301.	▶ 02:14
That's an increased probability to the default probability,	▶ 02:21
which we calculated before, which is the probability of any test,	▶ 02:24
test 2 come in as positive before was the normalizer of Bayes rule which was 0.207.	▶ 02:29
So, my first test has a 20% chance of coming in positive.	▶ 02:38
My second test, after seeing a positive test,	▶ 02:43
has now an increased probability of about 23% of coming in positive.	▶ 02:47

(00:27) 9d Absolute vs Conditional Independence

(00:45) 9e Answer

(01:59) 10 Different Type of Bayes Network

(00:55) 10a Answer

(01:51) 11 Explaining Away

(01:32) 11a Answer

(00:31) 11b Question

(02:53) 11c Answer

(01:42) 11d Question

(01:18) 11e Answer

(03:13) 11f Conclusion

[Thrun] It's really interesting to compare this to the situation over here.	▶ 00:00
In both cases I'm happy, as shown over here,	▶ 00:04
and I ask the same question, which is whether I got a raise at work, as R over here.	▶ 00:08
But in one case I observe that the weather is sunny; in the other one it isn't.	▶ 00:15
And look what it does to my probability of having received a raise.	▶ 00:21
The sunniness perfectly well explains my happiness,	▶ 00:25
and my probability of having received a raise ends up to be a mere 1.4%, or 0.014.	▶ 00:30
However, if my wife observes it to be non-sunny, then it is much more likely	▶ 00:41
that the cause of my happiness is related to a raise at work,	▶ 00:47
and now the probability is 8.3%, which is significantly higher than the 1.4% before.	▶ 00:51
This is a Bayes network of which S and R are independent	▶ 00:58
but H adds a dependence between S and R.	▶ 01:04
Let me talk about this in a little bit more detail on the next paper.	▶ 01:10
So here is our Bayes network again.	▶ 01:16
In our previous exercises, we computed for this network	▶ 01:18
that the probability of a raise of R given any of these variables shown here was as follows.	▶ 01:22
The really interesting thing is that in the absence of information about H,	▶ 01:29
which is the middle case over here,	▶ 01:34
the probability of R is unaffected by knowledge of S--	▶ 01:37
that is, R and S are independent.	▶ 01:41
This is the same as probability of R,	▶ 01:46
and R and S are independent.	▶ 01:49
However, if I know something about the variable H,	▶ 01:56
then S and R become dependent--	▶ 02:02
that is, knowing about my happiness over here renders S and R dependent.	▶ 02:06
This is not the same as probability of just R given H.	▶ 02:15
Obviously, it isn't because if I now vary S from S to not S,	▶ 02:23
it affects my probability for the variable R.	▶ 02:28
That is a really unusual situation	▶ 02:33
where we have R and S are independent	▶ 02:36
but given the variable H, R and S are not independent anymore.	▶ 02:40
So knowledge of H makes 2 variables that previously were independent non-independent.	▶ 02:50
Offered differently, 2 variables that are independent may not be in certain cases	▶ 02:58
conditionally independent.	▶ 03:06
Independence does not imply conditional independence.	▶ 03:08

(02:53) 12 General Bayes Networks

[Thrun] So we're now ready to define Bayes networks in a more general way.	▶ 00:00
Bayes networks define probability distributions over graphs or random variables.	▶ 00:05
Here is an example graph of 5 variables,	▶ 00:10
and this Bayes network defines the distribution over those 5 random variables.	▶ 00:14
Instead of enumerating all possibilities of combinations of these 5 random variables,	▶ 00:19
the Bayes network is defined by probability distributions	▶ 00:24
that are inherent to each individual node.	▶ 00:28
For node A and B, we just have a distribution P of A and P of B	▶ 00:32
because A and B have no incoming arcs.	▶ 00:38
C is a conditional distribution conditioned on A and B.	▶ 00:42
D and E are conditioned on C.	▶ 00:47
The joint probability represented by a Bayes network	▶ 00:52
is the product of various Bayes network probabilities	▶ 00:56
that are defined over individual nodes	▶ 01:00
where each node's probability is only conditioned on the incoming arcs.	▶ 01:03
So A has no incoming arc; therefore, we just want it P of A.	▶ 01:08
C has 2 incoming arcs, so we define the probability of C conditioned on A and B.	▶ 01:12
And D and E have 1 incoming arc that's shown over here.	▶ 01:18
The definition of this joint distribution by using the following factors	▶ 01:22
has one really big advantage.	▶ 01:27
Whereas the joint distribution over any 5 variables requires 2 to the 5 minus 1,	▶ 01:30
which is 31 probability values,	▶ 01:40
the Bayes network over here only requires 10 such values.	▶ 01:43
P of A is one value, for which we can derive P of not A.	▶ 01:48
Same for P of B.	▶ 01:53
P of C given A B is derived by a distribution over C	▶ 01:55
conditioned on any combination of A and B, of which there are 4 of A and B as binary.	▶ 02:02
P of D given C is 2 parameters for P of D given C and P of D given not C.	▶ 02:07
And the same is true for P of E given C.	▶ 02:15
So if you add those up, you get 10 parameters in total.	▶ 02:18
So the compactness of the Bayes network	▶ 02:21
leads to a representation that scales significantly better to large networks	▶ 02:25
than the common natorial approach which goes through all combinations of variable values.	▶ 02:31
That is a key advantage of Bayes networks,	▶ 02:36
and that is the reason why Bayes networks are being used so extensively	▶ 02:39
for all kinds of problems.	▶ 02:43
So here is a quiz.	▶ 02:45
How many probability values are required to specify this Bayes network?	▶ 02:47
Please put your answer in the following box.	▶ 02:51

(00:19) 12a Answer

(00:17) 12b Question

(00:16) 12c Answer

(00:28) 12d Question

(00:24) 12e Answer

(00:20) 12f Value of the Network

(00:35) 13 D-Separation

(00:52) 13a Answer

(00:45) 13b D-Separation Example

(01:26) 13c Answer

(02:54) 13d D-Separation General Definition

[Thrun] This leads me to the general study of conditional independence in Bayes networks,	▶ 00:00
often called D-separation or reachability.	▶ 00:06
D-separation is best studied by so-called active triplets and inactive triplets	▶ 00:10
where active triplets render variables dependent	▶ 00:17
and inactive triplets render them independent.	▶ 00:20
Any chain of 3 variables like this makes the initial and final variable dependent	▶ 00:23
if all variables are unknown.	▶ 00:30
However, if the center variable is known--	▶ 00:32
that is, it's behind the conditioning bar--	▶ 00:35
then this variable and this variable become independent.	▶ 00:38
So if we have a structure like this and it's quote-unquote cut off	▶ 00:42
by a known variable in the middle, that separates or deseparates	▶ 00:47
the left variable from the right variable, and they become independent.	▶ 00:53
Similarly, any structure like this renders the left variable and the right variable dependent	▶ 00:57
unless the center variable is known,	▶ 01:04
in which case the left and right variable become independent.	▶ 01:08
Another active triplet now requires knowledge of a variable.	▶ 01:12
This is the explain away case.	▶ 01:16
If this variable is known for a Bayes network that converges into a single variable,	▶ 01:19
then this variable and this variable over here become dependent.	▶ 01:25
Contrast this with a case where all variables are unknown.	▶ 01:29
A situation like this means that this variable on the left or on the right are actually independent.	▶ 01:33
In a single final example, we also get dependence if we have the following situation:	▶ 01:40
a direct successor of a conversion variable is known.	▶ 01:48
So it is sufficient if a successor of this variable is known.	▶ 01:52
The variable itself does not have to be known,	▶ 01:57
and the reason is if you know this guy over here,	▶ 01:59
we get knowledge about this guy over here.	▶ 02:02
And by virtue of that, the case over here essentially applies.	▶ 02:05
If you look at those rules,	▶ 02:09
those rules allow you to determine for any Bayes network	▶ 02:11
whether variables are dependent or not dependent given the evidence you have.	▶ 02:15
If you color the nodes dark for which you do have evidence,	▶ 02:20
then you can use these rules to understand whether any 2 variables	▶ 02:25
are conditionally independent or not.	▶ 02:29
So let me ask you for this relatively complicated Bayes network the following questions.	▶ 02:31
Is F independent of A?	▶ 02:37
Is F independent of A given D?	▶ 02:41
Is F independent of A given G?	▶ 02:45
And is F independent of A given H?	▶ 02:49
Please mark your answers as you see fit.	▶ 02:51

(01:03) 13e Answer

(00:50) 14 Congratulations

(34) Unit 4

(04:38) 1 Probabilistic Inference

[Probabilistic Interference]	▶ 00:00
[Male] Welcome back. In the previous unit, we went over the basics	▶ 00:02
of probability theory and saw how	▶ 00:05
a Bayes network could concisely represent a joint probability distribution,	▶ 00:12
including the representation of independence between the variables.	▶ 00:17
In this unit, we will see how to do probabilistic inference.	▶ 00:24
That is, how to answer probability questions using Bayes nets.	▶ 00:31
Let's put up a simple Bayes net.	▶ 00:36
We'll use the familiar example of the earthquake	▶ 00:40
where we can have a burglary or an earthquake	▶ 00:45
setting off an alarm, and if the alarm goes off,	▶ 00:50
either John or Mary might call.	▶ 00:53
Now, what kinds of questions can we ask to do inference about?	▶ 00:58
The simplest type of question is the same question we ask	▶ 01:02
with an ordinary subroutine or function in a programming language.	▶ 01:05
Namely, given some inputs, what are the outputs?	▶ 01:08
So, in this case, we could say given the inputs of B and E,	▶ 01:12
what are the outputs, J and M?	▶ 01:18
Rather than call them input and output variables,	▶ 01:22
in probabilistic inference, we'll call them evidence and query variables.	▶ 01:26
That is, the variables that we know the values of are the evidence,	▶ 01:36
and the ones that we want to find out the values of are the query variables.	▶ 01:39
Anything that is neither evidence nor query is known as a hidden variable.	▶ 01:44
That is, we won't tell you what its value is.	▶ 01:52
We won't figure out what its value is and report it,	▶ 01:55
but we'll have to compute with it internally.	▶ 01:58
And now furthermore, in probabilistic inference,	▶ 02:01
the output is not a single number for each of the query variables,	▶ 02:05
but rather, it's a probability distribution.	▶ 02:10
So, the answer is going to be a complete, joint probability distribution	▶ 02:13
over the query variables.	▶ 02:17
We call this the posterior distribution, given the evidence,	▶ 02:19
and we can write it like this.	▶ 02:23
It's the probability distribution of one or more query variables	▶ 02:26
given the values of the evidence variables.	▶ 02:34
And there can be zero or more evidence variables,	▶ 02:39
and each of them are given an exact value.	▶ 02:42
And that's the computation we want to come up with.	▶ 02:47
There's another question we can ask.	▶ 02:53
Which is the most likely explanation?	▶ 02:56
That is, out of all the possible values for all the query variables,	▶ 02:58
which combination of values has the highest probability?	▶ 03:03
We write the formula like this, asking which Q values	▶ 03:08
are maxable given the evidence values.	▶ 03:12
Now, in an ordinary programming language, each function goes only one way.	▶ 03:16
It has input variables, does some computation,	▶ 03:22
and comes up with a result variable or result variables.	▶ 03:26
One great thing about Bayes nets is that we're not restricted	▶ 03:31
to going only in one direction.	▶ 03:34
We could go in the causal direction, giving as evidence	▶ 03:36
the route nodes of the tree and asking as query values the nodes at the bottom.	▶ 03:41
Or, we could reverse that causal flow.	▶ 03:47
For example, we could have J and M be the evidence variables	▶ 03:50
and B and E be the query variables,	▶ 03:55
or we could have any other combination.	▶ 03:58
For example, we could have M be the evidence variable	▶ 04:01
and J and B be the query variables.	▶ 04:05
Here's a question for you.	▶ 04:11
Imagine the situation where Mary has called to report that the alarm is going off,	▶ 04:13
and we want to know whether or not there has been a burglary.	▶ 04:18
For each of the nodes, click on the circle to tell us	▶ 04:22
if the node is an evidence node, a hidden node,	▶ 04:27
or a query node.	▶ 04:32

(00:11) 1a Answer
The answer is that Mary calling is the evidence node. ▶ 00:00
The burglary is the query node, ▶ 00:04
and all the others are hidden variables in this case. ▶ 00:07

(04:24) 2 Enumeration

Now we're going to talk about how to do inference on Bayes net.	▶ 00:00
We'll start with our familiar network, and we'll talk about a method	▶ 00:04
called enumeration,	▶ 00:08
which goes through all the possibilities, adds them up,	▶ 00:12
and comes up with an answer.	▶ 00:15
So, what we do is start by stating the problem.	▶ 00:17
We're going to ask the question of what is the probability	▶ 00:24
that the burglar alarm occurred given that John called and Mary called?	▶ 00:27
We'll use the definition of conditional probability to answer this.	▶ 00:34
So, this query is equal to the joint probability distribution	▶ 00:39
of all 3 variables divided by the conditionalized variables.	▶ 00:47
Now, note I'm using a notation here where instead of writing out the probability	▶ 00:55
of some variable equals true, I'm just using the notation plus	▶ 01:01
and then the variable name in lower case,	▶ 01:05
and if I wanted the negation, I would use negation sign.	▶ 01:08
Notice there's a different notation where instead of writing out	▶ 01:13
the plus and negation signs, we just use the variable name itself, P(e),	▶ 01:17
to indicate E is true.	▶ 01:22
That notation works well, but it can get confusing between	▶ 01:25
does P(e) mean E is true, or does it mean E is a variable?	▶ 01:29
And so we're going to stick to the notation where we explicitly have	▶ 01:34
the pluses and negation signs.	▶ 01:37
To do inference by enumeration, we first take a conditional probability	▶ 01:41
and rewrite it as unconditional probabilities.	▶ 01:45
Now we enumerate all the atomic probabilities and calculate the sum of products.	▶ 01:49
Let's look at just the complex term on the numerator first.	▶ 01:56
The procedure for figuring out the denominator would be similar, and we'll skip that.	▶ 02:00
So, the probability of these 3 terms together	▶ 02:05
can be determined by enumerating all possible values of the hidden variables.	▶ 02:12
In this case, there are 2, E and A,	▶ 02:17
so we'll sum over those variables for all values of E and for all values of A.	▶ 02:22
In this case, they're boolean, so there's only 2 values of each.	▶ 02:29
We ask what's the probability of this unconditional term?	▶ 02:34
And that we get by summing out over all possibilities,	▶ 02:41
E and A being true or false.	▶ 02:44
Now, to get the values of these atomic events,	▶ 02:49
we'll have to rewrite this equation in a form that corresponds	▶ 02:52
to the conditional probability tables that we have associated with the Bayes net.	▶ 02:55
So, we'll take this whole expression and rewrite it.	▶ 03:00
It's still a sum over the hidden variables E and A,	▶ 03:04
but now I'll rewrite this expression in terms of the parents	▶ 03:08
of each of the nodes in the network.	▶ 03:12
So, that gives us the product of these 5 terms,	▶ 03:15
which we then have to sum over all values of E and A.	▶ 03:21
If we call this product f(e,a),	▶ 03:24
then the whole answer is the sum of F for all values of E and A,	▶ 03:31
so as the sum of 4 terms where each of the terms is a product of 5 numbers.	▶ 03:43
Where do we get the numbers to fill in this equation?	▶ 03:51
From the conditional probability tables from our model,	▶ 03:54
so let's put the equation back up, and we'll ask you for the case	▶ 03:58
where both E and A are positive	▶ 04:03
to look up in the conditional probability tables and fill in the numbers	▶ 04:09
for each of these 5 terms, and then multiply them together and fill in the product.	▶ 04:14

(01:59) 2a Answer

(03:27) 3 Speeding up Enumeration

[Norvig] We've seen how to do enumeration to solve the inference problem	▶ 00:00
on belief networks.	▶ 00:04
For a simple network like the alarm network, that's all we need to know.	▶ 00:06
There's only 5 variables, so even if all 5 of them were hidden,	▶ 00:10
there would only be 32 rows in the table to sum up.	▶ 00:14
From a theoretical point of view, we're done.	▶ 00:20
But from a practical point of view, other networks could give us trouble.	▶ 00:22
Consider this network, which is one for determining insurance for car owners.	▶ 00:26
There are 27 different variables.	▶ 00:35
If each of the variables were boolean, that would give us over 100 million rows to sum out.	▶ 00:38
But in fact, some of the variables are non-boolean,	▶ 00:44
they have multiple values, and it turns out that representing this entire network	▶ 00:46
and doing enumeration we'd have to sum over a quadrillion rows.	▶ 00:52
That's just not practical, so we're going to have to come up with methods	▶ 00:57
that are faster than enumerating everything.	▶ 01:01
The first technique we can use to get a speed-up in doing inference on Bayes nets	▶ 01:04
is to pull out terms from the enumeration.	▶ 01:09
For example, here the probability of b is going to be the same for all values of E and a.	▶ 01:13
So we can take that term and move it out of the summation,	▶ 01:20
and now we have a little bit less work to do.	▶ 01:26
We can multiply by that term once rather than having it in each row of the table.	▶ 01:28
We can also move this term, the P of e, to the left of the summation over a,	▶ 01:33
because it doesn't depend on a.	▶ 01:40
By doing this, we're doing less work.	▶ 01:43
The inner loop of the summation now has only 3 terms rather than 5 terms.	▶ 01:45
So we've reduced the cost of doing each row of the table.	▶ 01:50
But we still have the same number of rows in the table,	▶ 01:53
so we're going to have to do better than that.	▶ 01:57
The next technique for efficient inference is to maximize independence of variables.	▶ 02:00
The structure of a Bayes net determines how efficient it is to do inference on it.	▶ 02:08
For example, a network that's a linear string of variables,	▶ 02:12
X1 through Xn, can have inference done in time proportional to the number n,	▶ 02:17
whereas a network that's a complete network	▶ 02:27
where every node points to every other node and so on could take time 2 to the n	▶ 02:31
if all n variables are boolean variables.	▶ 02:40
In the alarm network we saw previously, we took care	▶ 02:45
to make sure that we had all the independence relations represented	▶ 02:50
in the structure of the network.	▶ 02:54
But if we put the nodes together in a different order,	▶ 02:57
we would end up with a different structure.	▶ 03:00
Let's start by ordering the node John calls first	▶ 03:03
and then adding in the node Mary calls.	▶ 03:09
The question is, given just these 2 nodes and looking at the node for Mary calls,	▶ 03:13
is that node dependent or independent of the node for John calls?	▶ 03:19

(00:24) 3a Answer

(00:13) 3b Second Question

(00:33) 3c Second Answer

(00:11) 3d Third Question
[Norvig] Now we'll continue and we'll add the node B for burglary ▶ 00:01
and ask again, click on all the variables that B is dependent on. ▶ 00:05
(00:10) 3e Third Answer
[Norvig] The answer is that B is dependent only on A. ▶ 00:00
In other words, B is independent of J and M given A. ▶ 00:04
(00:07) 3f Fourth Question
[Norvig] And finally, we'll add the last node, E, ▶ 00:00
and ask you to click on all the nodes that E is dependent on. ▶ 00:04

(00:26) 3g Fourth Answer

(00:18) 3h Causal Direction

(04:40) 4 Variable Elimination

Let's return to this equation, which we use to show how to do inference by enumeration.	▶ 00:00
In this equation, we join up the whole joint distribution	▶ 00:06
before we sum out over the hidden variables.	▶ 00:10
That's slow, because we end up repeating a lot of work.	▶ 00:15
Now we're going to show a new technique called variable elimination,	▶ 00:18
which in many networks operates much faster.	▶ 00:25
It's still a difficult computation, an NP-hard computation,	▶ 00:27
to do inference over Bayes nets in general.	▶ 00:30
Variable elimination works faster than inference by enumeration	▶ 00:34
in most practical cases.	▶ 00:38
It requires an algebra for manipulating factors,	▶ 00:41
which are just names for multidimensional arrays	▶ 00:45
that come out of these probabilistic terms.	▶ 00:48
We'll use another example to show how variable elimination works.	▶ 00:53
We'll start off with a network that has 3 boolean variables.	▶ 00:57
R indicates whether or not it's raining.	▶ 01:00
T indicates whether or not there's traffic,	▶ 01:04
and T is dependent on whether it's raining.	▶ 01:12
And finally, L indicates whether or not I'll be late for my next appointment,	▶ 01:15
and that depends on whether or not there's traffic.	▶ 01:19
Now we'll put up the conditional probability tables for each of these 3 variables.	▶ 01:22
And then we can use inference to figure out the answer to questions like	▶ 01:29
am I going to be late?	▶ 01:35
And we know by definition that we could do that through enumeration	▶ 01:38
by going through all the possible values for R and T	▶ 01:42
and summing up the product of these 3 nodes.	▶ 01:47
Now, in a simple network like this, straight enumeration would work fine,	▶ 01:54
but in a more complex network, what variable elimination does is give us a way	▶ 01:59
to combine together parts of the network into smaller parts	▶ 02:03
and then enumerate over those smaller parts and then continue combining.	▶ 02:09
So, we start with a big network.	▶ 02:13
We eliminate some of the variables.	▶ 02:15
We compute by marginalizing out, and then we have a smaller network to deal with,	▶ 02:17
and we'll show you how those 2 steps work.	▶ 02:24
The first operation in variable elimination is called joining factors.	▶ 02:28
A factor, again, is one of these tables.	▶ 02:35
It's a multidimensional matrix, and what we do is choose 2 of the factors,	▶ 02:39
2 or more of the factors.	▶ 02:43
In this case, we'll choose these 2, and we'll combine them together	▶ 02:45
to form a new factor which represents	▶ 02:49
the joint probability of all the variables in that factor.	▶ 02:52
In this case, R and T.	▶ 02:56
Now we'll draw out that table.	▶ 03:00
In each case, we just look up in the corresponding table,	▶ 03:03
figure out the numbers, and multiply them together.	▶ 03:06
For example, in this row we have a +r and a +t,	▶ 03:08
so the +r is 0.1, and the entry for +r and +t is 0.8,	▶ 03:13
so multiply them together and you get 0.08.	▶ 03:19
Go all the way down. For example, in the last row we have a -r and a -t.	▶ 03:22
-r is 0.9. The entry for -r and -t is also 0.9.	▶ 03:28
Multiply those together and you get 0.81.	▶ 03:34
So, what have we done?	▶ 03:40
We used the operation of joining factors on these 2 factors,	▶ 03:42
getting us a new factor which is part of the existing network.	▶ 03:45
Now we want to apply a second operation called elimination,	▶ 03:50
also called summing out or marginalization, to take this table and reduce it.	▶ 03:56
Right now, the tables we have look like this.	▶ 04:02
We could sum out or marginalize over the variable R	▶ 04:06
to give us a table that just operates on T.	▶ 04:10
So, the question is to fill in this table for P(T)--	▶ 04:14
there will be 2 entries in this table, the +t entry, formed by summing out	▶ 04:20
all the entries here for all values of r for which t is positive,	▶ 04:23
and the -t entry, formed the same way, by looking in this table	▶ 04:28
and summing up all the rows over all values of r where t is negative.	▶ 04:32
Put your answers in these boxes.	▶ 04:37

(00:27) 4a Answer

(00:28) 4b More Variable Elimination

(00:38) 4c Answer

(00:30) 4d Even More Variable Elimination

(00:20) 4e Answer
The answer is that the +l values, ▶ 00:00
0.051 plus 0.083 equals 0.134. ▶ 00:03
And the negative values, 0.119 plus 0.747 ▶ 00:11
equals 0.886. ▶ 00:15
((??:??)) 4f Summary
No subtitles... ▶ 00:00

(00:21) 4f Summary

(02:08) 5 Approximate Inference Sampling

(02:15) 6 Sampling Example

(01:05) 6a Sampling Example

(01:51) 6b More Sampling

(01:59) 7 Rejection Sampling

(01:55) 8 Likelihood Weighting

(00:37) 8a Answer

(00:20) 8b Likelihood Weighting is Consistent

(00:56) 8c Likelihood Weighting Problems

(01:50) 9 Gibbs Sampling

(01:19) 10 Monty Hall Problem

(01:45) 10a Answer

(00:44) 10b Monty Hall Letter

(12) Homework 2

(00:16) 1 Bayes Rule
[Thrun] Given the following Bayes network with P of A equal to 0.5, ▶ 00:00
P of B given the A equals 0.2, ▶ 00:06
and P of B given not A 0.8, ▶ 00:08
calculate the following probability. ▶ 00:12

(00:42) 2 Simple Bayes Net

(00:10) 3 Simple Bayes Net 2
[Thrun] Let us consider the same network again. ▶ 00:00
I would like to know the probability of X3 given that I observed X1. ▶ 00:03

(00:29) 4 Conditional Independence

(00:28) 5 Conditional Indepedence 2

(00:17) 6 Parameter Count

(00:36) 1 ANSWER

(03:16) 2 ANSWER

[Thrun] For this question we will be exploring a little trick	▶ 00:00
about non-normalized probability.	▶ 00:03
We will observe that P of A given X1, X2 and not X3,	▶ 00:05
the expression on the left can be resolved by Bayes' rule into this expression over here.	▶ 00:11
We will take X3 to the left and replace it by A,	▶ 00:16
both conditioned on the variables X1 and X2.	▶ 00:20
Then we have PA given X1, X2 divided by P not X3, X1, X2.	▶ 00:23
Next we employ 2 things.	▶ 00:29
One is the denominator does not depend on A,	▶ 00:31
so whether I put an A or not A has no bearing on any calculation here,	▶ 00:34
which means I can defer its calculation until later, and it will turn out to be important.	▶ 00:39
So I'm going to be proportional to just the stuff over here.	▶ 00:44
And second, I export my conditional independence	▶ 00:49
whereby I can omit X1 and X2 from the probability of not X3 conditioned on A.	▶ 00:52
These variables are conditionally independent.	▶ 00:58
This gives me the following recursion	▶ 01:02
where I now removed the third variable from the estimation problem	▶ 01:05
and just retained the first 2 relative to my initial expression.	▶ 01:10
If I keep expanding this, I get the following solution.	▶ 01:14
P of not X3 given A, P X2 given A, P X1 given A times P of A.	▶ 01:19
You might take a minute to just verify this,	▶ 01:27
but this is exploiting the conditional independence	▶ 01:30
very much as in the first step I showed you over here.	▶ 01:32
This step lacks the normalizer,	▶ 01:35
so let me work on the normalizer by expressing the opposite probability,	▶ 01:38
P of not A given the same events, X1, X2, and not X3,	▶ 01:44
which resolves to P of not X3 given not A,	▶ 01:50
P of X2 given not A, P of X1 given not A,	▶ 01:54
and P of not A.	▶ 02:00
I can now plug in the values from above.	▶ 02:02
So the first term gives me 0.8 times 0.2 times 0.2 times 0.5.	▶ 02:04
In the second term I get 0.4 times 0.6 times 0.6 times 0.5,	▶ 02:15
which resolves to 0.016 and 0.072.	▶ 02:24
This is clearly not a probability because we left out the normalizer.	▶ 02:31
But as we know, the normalizer does not depend on whether I put A or not A in here.	▶ 02:36
As a result, it will be the same for both of these expressions,	▶ 02:40
and I can obtain it by just adding these non-normalized probabilities	▶ 02:44
and then subsequently divide these non-normalized probabilities accordingly.	▶ 02:47
So let me just do this.	▶ 02:52
We get for the desired probability over here 0.1818	▶ 02:55
and for the inverse probability over here 0.8182.	▶ 03:01
Our desired answer therefore is 0.1818.	▶ 03:08
This was not an easy question.	▶ 03:14

(01:41) 3 ANSWER

(00:46) 4 ANSWER

(00:54) 5 ANSWER

(00:37) 6 ANSWER

(55) Unit 5

(01:11) 1 Introduction

(01:53) 2 What is Machine Learning

(00:47) 3 Answer

(01:33) 4 Stanley DARPA Grand Challenge

(03:46) 5 Taxonomy

Machine learning is a very large field	▶ 00:00
with many different methods	▶ 00:03
and many different applications.	▶ 00:04
I will now define some of the very basic terminology	▶ 00:06
that is being used to distinguish	▶ 00:10
different machine learning methods.	▶ 00:12
Let's start with the what.	▶ 00:13
What is being learned?	▶ 00:17
You can learn parameters	▶ 00:19
like the probabilities of a Bayes Network.	▶ 00:23
You can learn structure	▶ 00:26
like the arc structure of a Bayes Network.	▶ 00:27
And you might even discover hidden concepts.	▶ 00:31
For example	▶ 00:34
you might find that certain training example	▶ 00:35
form a hidden group.	▶ 00:37
For example Netflix	▶ 00:39
you might find that there's different types of customers	▶ 00:41
some that care about classic movies	▶ 00:43
some of them care about modern movies	▶ 00:45
and those might form hidden concepts	▶ 00:47
whose discovery can really help you	▶ 00:49
make better sense of the data.	▶ 00:51
Next is what from?	▶ 00:53
Every machine learning method	▶ 00:57
is driven by some sort of target information	▶ 01:00
that you care about.	▶ 01:02
In supervised learning	▶ 01:03
which is the subject of today's class	▶ 01:06
we're given specific target labels	▶ 01:08
and I give you examples just in a second.	▶ 01:10
We also talk about unsupervised learning	▶ 01:13
where target labels are missing	▶ 01:15
and we use replacement principles	▶ 01:19
to find, for example	▶ 01:21
hidden concepts.	▶ 01:22
Later there will be a class in reinforcement learning	▶ 01:24
when an agent learns from feedback with the physical environment	▶ 01:27
by interacting and trying actions	▶ 01:32
and receiving some sort of evaluation	▶ 01:34
from the environment	▶ 01:37
like "Well done" or "That works."	▶ 01:37
Again, we will talk about those in detail later.	▶ 01:41
There's different things you could try to do	▶ 01:43
with machine learning technique.	▶ 01:46
You might care about prediction.	▶ 01:48
For example you might want to care about what's going to happen with the future	▶ 01:49
in the stockmarket for example.	▶ 01:53
You might care to diagnose something	▶ 01:55
which is you get data and you wish to explain it	▶ 01:57
and you use machine learning for that.	▶ 01:59
Sometimes your objective is to summarize something.	▶ 02:01
For example if you read a long article	▶ 02:04
your machine learning method might aim to	▶ 02:07
produce a short article that summarizes the long article.	▶ 02:09
And there's many, many, many more different things.	▶ 02:12
You can talk about the how to learn.	▶ 02:14
We use the word passive	▶ 02:16
if your learning agent is just an observer	▶ 02:19
and has no impact on the data itself.	▶ 02:23
Otherwise, you call it active.	▶ 02:24
Sometimes learning occurs online	▶ 02:26
which means while the data is being generated	▶ 02:30
and some of it is offline	▶ 02:32
which means learning occurs	▶ 02:35
after the data has been generated.	▶ 02:37
There's different types of outputs	▶ 02:39
of a machine learning algorithm.	▶ 02:42
Today we'll talk about classification	▶ 02:44
versus regression.	▶ 02:47
In classification the output is binary	▶ 02:50
or a fixed number of classes	▶ 02:53
for example something is either a chair or not.	▶ 02:55
Regression is continuous.	▶ 02:57
The temperature might be 66.5 degrees	▶ 02:59
in our prediction.	▶ 03:01
And there's tons of internal details	▶ 03:03
we will talk about.	▶ 03:05
Just to name one.	▶ 03:07
We will distinguish generative	▶ 03:09
from discriminative.	▶ 03:12
Generative seeks to model the data	▶ 03:14
as generally as possible	▶ 03:16
versus discriminative methods	▶ 03:18
seek to distinguish data	▶ 03:20
and this might sound like a superficial distinction	▶ 03:21
but it has enormous ramification	▶ 03:24
on the learning algorithm.	▶ 03:26
Now to tell you the truth	▶ 03:27
it took me many years	▶ 03:29
to fully learn all these words here	▶ 03:30
and I don't expect you to pick them all up	▶ 03:33
in one class	▶ 03:36
but you should as well know that they exist.	▶ 03:37
And as they come up	▶ 03:39
I'll emphasize them	▶ 03:41
so you can resort any learning method	▶ 03:42
I tell you back into the specific taxonomy over here.	▶ 03:44

(03:12) 6 Supervised Learning

The vast amount of work in the field	▶ 00:00
falls into the area of supervised learning.	▶ 00:02
In supervised learning	▶ 00:06
you're given for each training example	▶ 00:08
a feature vector	▶ 00:10
and a target label named Y.	▶ 00:13
For example, for a credit rating agency	▶ 00:16
X1, X2, X3 might be a feature	▶ 00:20
such as is the person employed?	▶ 00:23
What is the salary of the person?	▶ 00:25
Has the person previously defaulted on a credit card?	▶ 00:27
And so on.	▶ 00:30
And Y is a predictor	▶ 00:32
whether the person is to default	▶ 00:34
on the credit or not.	▶ 00:36
Now machine learning	▶ 00:38
is to be carried out on past data	▶ 00:40
where the credit rating agency	▶ 00:42
might have collected features just like these	▶ 00:44
and actual occurances of default or not.	▶ 00:46
What it wishes to produce	▶ 00:49
is a function that allows us	▶ 00:51
to predict future customers.	▶ 00:53
So the new person comes in	▶ 00:55
with a different feature vector.	▶ 00:56
Can we predict as good as possible	▶ 00:58
the functional relationship	▶ 01:00
between these features X1 to Xn all the way to Y?	▶ 01:02
You can apply the exact same example	▶ 01:05
in image recognition	▶ 01:08
where X might be pixels of images	▶ 01:09
or it might be features of things found in images	▶ 01:11
and Y might be a label that says	▶ 01:14
whether a certain object is contained	▶ 01:16
in an image or not.	▶ 01:17
Now in supervised learning	▶ 01:19
you're given many such examples.	▶ 01:20
X21 to X2n	▶ 01:25
leads to Y2	▶ 01:28
all way the index m.	▶ 01:32
This is called your data.	▶ 01:35
If we call each input vector Xm	▶ 01:38
and we wish to find out the function	▶ 01:43
given any Xm or any future vector X	▶ 01:44
produces as close as possible	▶ 01:50
my target signal Y.	▶ 01:53
Now this isn't always possible	▶ 01:55
and sometimes it's acceptable	▶ 01:57
in fact preferable	▶ 01:59
to tolerate a certain amount of error	▶ 02:00
in your training data.	▶ 02:03
But the subject of machine learning	▶ 02:05
is to identify this function over here.	▶ 02:07
And once you identify it	▶ 02:10
you can use it for future Xs	▶ 02:11
that weren't part of the training set	▶ 02:13
to produce a prediction	▶ 02:16
that hopefully is really, really good.	▶ 02:19
So let me ask you a question.	▶ 02:21
And this is a question	▶ 02:24
for which I haven't given you the answer	▶ 02:27
but I'd like to appeal to your intuition.	▶ 02:28
Here's one data set	▶ 02:31
where the X is one dimensionally plotted horizontally	▶ 02:34
and the Y is vertically	▶ 02:37
and suppose there looks like this.	▶ 02:39
Suppose my machine learning algorithm	▶ 02:44
gives me 2 hypotheses.	▶ 02:45
One is this function over here	▶ 02:47
which is a linear function	▶ 02:51
and one is this function over here.	▶ 02:52
I'd like to know which of the functions	▶ 02:53
you find preferable	▶ 02:57
as an explanation for the data.	▶ 02:59
Is it function A?	▶ 03:01
Or function B?	▶ 03:02
Check here for A	▶ 03:06
here for B	▶ 03:08
and here for neither.	▶ 03:09

(02:43) 7 Occam's Razor

And I hope you guessed function A.	▶ 00:00
Even though both perfectly describe the data	▶ 00:04
B is much more complex than A.	▶ 00:08
In fact, outside the data	▶ 00:10
B seems to go to a minus infinity much faster	▶ 00:12
than these data points	▶ 00:16
and to plus infinity much faster	▶ 00:17
with these data points over here.	▶ 00:19
And in between	▶ 00:21
we have wide oscillations	▶ 00:22
that don't correspond to any data.	▶ 00:23
So I would argue	▶ 00:25
A is preferable.	▶ 00:27
The reason why I asked this question	▶ 00:31
is because of something called Occam's Razor.	▶ 00:32
Occam can be spelled in many different ways.	▶ 00:35
And what Occam says is that	▶ 00:38
everything else being equal	▶ 00:41
chose the less complex hypothesis.	▶ 00:43
Now in practice	▶ 00:46
there's actually a trade-off	▶ 00:48
between a really good data fit	▶ 00:50
and low complexity.	▶ 00:53
Let me illustrate this to you	▶ 00:55
by a hypothetical example.	▶ 00:58
Consider the following graph	▶ 00:59
where the horizontal axis graphs	▶ 01:02
complexity of the solution.	▶ 01:04
For example, if you use polynomials	▶ 01:07
this might be a high-degree polynomial over here	▶ 01:10
and maybe a linear function over here	▶ 01:12
which is a low-degree polynomial	▶ 01:14
your training data error	▶ 01:16
tends to go like this.	▶ 01:19
The more complex the hypothesis you allow	▶ 01:22
the more you can just fit your data.	▶ 01:25
However, in reality	▶ 01:29
your generalization error on unknown data	▶ 01:31
tends to go like this.	▶ 01:33
It is the sum of the training data error	▶ 01:37
and another function	▶ 01:40
which is called the overfitting error.	▶ 01:42
Not surprisingly	▶ 01:46
the best complexity is obtained	▶ 01:47
where the generalization error is minimum.	▶ 01:49
There are methods	▶ 01:52
to calculate the overfitting error.	▶ 01:53
They go into a statistical field	▶ 01:55
under the name Bayes variance methods.	▶ 01:57
However, in practice	▶ 02:01
you're often just given the training data error.	▶ 02:02
You find if you don't find the model	▶ 02:04
that minimizes the training data error	▶ 02:08
but instead pushes back the complexity	▶ 02:11
your algorithm tends to perform better	▶ 02:14
and that is something we will study a little bit	▶ 02:17
in this class.	▶ 02:20
However, this slide is really important	▶ 02:22
for anybody doing machine learning in practice.	▶ 02:26
If you deal with data	▶ 02:29
and you have ways to fit your data	▶ 02:31
be aware that overfitting	▶ 02:33
is a major source of poor performance	▶ 02:36
of a machine learning algorithm.	▶ 02:39
And I give you examples in just one second.	▶ 02:41

(04:15) 8 SPAM Detection

So a really important example	▶ 00:00
of machine learning is SPAM detection.	▶ 00:02
We all get way too much email	▶ 00:04
and a good number of those are SPAM.	▶ 00:06
Here are 3 examples of email.	▶ 00:08
Dear Sir: First I must solicit your confidence	▶ 00:12
in this transaction, this is by virtue of its nature	▶ 00:14
being utterly confidential and top secret...	▶ 00:16
This is likely SPAM.	▶ 00:19
Here's another one.	▶ 00:22
In upper caps.	▶ 00:23
99 MILLION EMAIL ADDRESSES FOR ONLY $99	▶ 00:25
This is very likely SPAM.	▶ 00:28
And here's another one.	▶ 00:31
Oh, I know it's blatantly OT	▶ 00:33
but I'm beginning to go insane.	▶ 00:35
Had an old Dell Dimension XPS sitting in the corner	▶ 00:37
and decided to put it to use.	▶ 00:40
And so on and so on.	▶ 00:41
Now this is likely not SPAM.	▶ 00:42
How can a computer program	▶ 00:45
distinguish between SPAM and not SPAM?	▶ 00:47
Let's use this as an example	▶ 00:49
to talk about machine learning for discrimination	▶ 00:51
using Bayes Networks.	▶ 00:55
In SPAM detection	▶ 00:59
we get an email	▶ 01:01
and we wish to categorize it	▶ 01:03
either as SPAM	▶ 01:05
in which case we don't even show as to the where	▶ 01:07
or what we call HAM	▶ 01:10
which is the technical word for	▶ 01:12
an email worth passing on to the person being emailed.	▶ 01:15
So the function over here	▶ 01:19
is the function we're trying to learn.	▶ 01:21
Most SPAM filters use human input.	▶ 01:23
When you go through email	▶ 01:26
you have a button called IS SPAM	▶ 01:28
which allows you as a user to flag SPAM	▶ 01:32
and occasionally you will say an email is SPAM.	▶ 01:34
If you look at this	▶ 01:37
you have a typical supervised machine learning situation	▶ 01:40
where the input is an email	▶ 01:43
and the output is whether you flag it as SPAM	▶ 01:45
or if we don't flag it	▶ 01:47
we just think it's HAM.	▶ 01:49
Now to make this amenable to	▶ 01:52
a machine learning algorithm	▶ 01:54
we have to talk about how to represent emails.	▶ 01:55
They're all using different words and different characters	▶ 01:57
and they might have different graphics included.	▶ 02:00
Let's pick a representation that's easy to process.	▶ 02:02
And this representation is often called	▶ 02:06
Bag of Words.	▶ 02:09
Bag of Words is a representation	▶ 02:10
of a document	▶ 02:14
that just counts the frequency	▶ 02:15
of words.	▶ 02:17
If an email were to say Hello	▶ 02:18
I will say Hello.	▶ 02:22
The Bag of Words representation	▶ 02:24
is the following.	▶ 02:26
2-1-1-1	▶ 02:27
for the dictionary	▶ 02:31
that contains the 4 words	▶ 02:33
Hello I will say.	▶ 02:36
Now look at the subtlety here.	▶ 02:38
Rather than representing each individual word	▶ 02:41
we have a count of each word	▶ 02:43
and the count is oblivious	▶ 02:46
to the order in which the words were stated.	▶ 02:49
A Bag of Words representation	▶ 02:52
relative to a fixed dictionary	▶ 02:55
represents the counts of each word	▶ 02:57
relative to the words in the dictionary.	▶ 03:01
If you were to use a different dictionary	▶ 03:03
like hello and good-bye	▶ 03:06
our counts would be	▶ 03:08
2 and 0.	▶ 03:10
However, in most cases	▶ 03:13
you make sure that all the words found	▶ 03:14
in messages	▶ 03:17
are actually included in the dictionary.	▶ 03:18
So the dictionary might be very, very large.	▶ 03:19
Let me make up an unofficial example	▶ 03:22
of a few SPAM and a few HAM messages.	▶ 03:25
Offer is secret.	▶ 03:30
Click secret link.	▶ 03:32
Secret sports link.	▶ 03:35
Obviously those are contrived	▶ 03:37
and I tried to retain the recovery	▶ 03:40
to a small number of words	▶ 03:42
to make this example workable.	▶ 03:44
In practice we need thousands	▶ 03:46
of such messages	▶ 03:47
to get good information.	▶ 03:48
Play sports today.	▶ 03:50
Went play sports.	▶ 03:52
Secret sports event.	▶ 03:54
Sport is today.	▶ 03:56
Sport costs money.	▶ 03:59
My first quiz is	▶ 04:02
What is the size of the vocabulary	▶ 04:06
that contains all words in these messages?	▶ 04:08
Please enter the value in this box over here.	▶ 04:12

(00:28) 9 Answer

(00:16) 10 Question

(00:16) 11 Answer
[Narrator] And the answer is: ▶ 00:00
there's 8 different messages ▶ 00:02
of which 3 are spam. ▶ 00:04
So the maximum likelihood estimate ▶ 00:06
is 3/8. ▶ 00:09
[writing on paper] ▶ 00:11

(04:31) 12 Maximum Likelihood

So, let's look at this a little bit more formally and talk about maximum likelihood.	▶ 00:00
Obviously, we're observing 8 messages: spam, spam, spam, and 5 times ham.	▶ 00:03
And what we care about is what's our prior probability of spam	▶ 00:12
that maximizes the likelihood of this data?	▶ 00:17
So, let's assume we're going to assign a value of pi to this,	▶ 00:20
and we wish to find the pi that maximizes the likelihood of this data over here,	▶ 00:24
assuming that each email is drawn independently	▶ 00:29
according to an identical distribution.	▶ 00:33
The probability of the p(yi) data item is then pi if yi = spam,	▶ 00:37
and 1 - pi if yi = ham.	▶ 00:48
If we rewrite the data as 1, 1, 1, 0, 0, 0, 0, 0,	▶ 00:53
we can write p(yi) as follows: pi to the yi times (1 - pi) to the 1 - yi.	▶ 00:59
It's not that easy to see that this is equivalent,	▶ 01:13
but say yi = 1.	▶ 01:16
Then this term will fall out.	▶ 01:19
It's not proficient by 1 because the exponent is zero, and we get pi as over here.	▶ 01:22
If yi = 0, then this term falls out, and this one here becomes 1 - pi as over here.	▶ 01:28
Now assuming independence, we get for the entire data set	▶ 01:36
that the joint probability of all data items is the product	▶ 01:44
of the individual data items over here,	▶ 01:49
which can now be written as follows:	▶ 01:52
pi to the count of instances where yi = 1 times	▶ 01:56
1 - pi to the count of the instances where yi = 0.	▶ 02:03
And we know in our example, this count over here is 3,	▶ 02:09
and this count over here is 5, so we get pi to the 3rd times 1 - pi to the 5th.	▶ 02:13
We now wish to find the pi that maximizes this expression over here.	▶ 02:22
We can also maximize the logarithm of this expression,	▶ 02:28
which is 3 times log pi + 5 times log (1 - pi)	▶ 02:33
Optimizing the log is the same as optimizing p because the log is monotonic to p.	▶ 02:42
The maximum of this function is attained with a derivative of 0,	▶ 02:50
so let's compute with a derivative and set it to 0.	▶ 02:54
This is the derivative, 3 over pi - 5 over 1 - pi.	▶ 03:00
We now bring this expression to the right side,	▶ 03:05
multiply the denominators up, and sort all the expressions containing pi to the left,	▶ 03:09
which gives us pi = 3/8, exactly the number we were at before.	▶ 03:18
We just derived mathematically that the data likelihood maximizing number	▶ 03:26
for the probability is indeed the empirical count,	▶ 03:33
which means when we looked at this quiz before	▶ 03:37
and we said a maximum likelihood for the prior probability of spam is 3/8,	▶ 03:41
by simply counting 3 over 8 emails were spam,	▶ 03:49
we actually followed proper mathematical principles	▶ 03:54
to do maximum likelihood estimation.	▶ 03:57
Now, you might not fully have gotten the derivation of this,	▶ 03:59
and I recommend you to watch it again, but it's not that important	▶ 04:03
for the progress in this class.	▶ 04:07
So, here's another quiz.	▶ 04:09
I'd like the maximum likelihood, or ML solutions,	▶ 04:11
for the following probabilities.	▶ 04:17
The probability that the word "secret" comes up,	▶ 04:19
assuming that we already know a message is spam,	▶ 04:21
and the probability that the same word "secret" comes up	▶ 04:25
if we happen to know the message is not spam, it's ham.	▶ 04:28

(00:25) 13 Answer

(01:19) 14 Relationship to Bayes Networks

(00:29) 15 Answer

(00:26) 16 Question

(01:02) 17 Answer

(00:21) 18 Question

(01:03) 19 Answer

(00:21) 20 Question

(03:19) 21 Answer and Laplace Smoothing

And surprisingly, the probability for this message to be spam is 0.	▶ 00:00
It's not 0.001. It's flat 0.	▶ 00:07
In other words, it's impossible, according to our model,	▶ 00:11
that this text could be a spam message.	▶ 00:14
Why is this?	▶ 00:17
When we apply the same rule as before, we get the prior for spam which is 3/8.	▶ 00:19
And we multiple the conditional for each word into this.	▶ 00:24
For "secret," we know it to be 1/3.	▶ 00:28
For "is," to be 1/9, but for today, it's 0.	▶ 00:31
It's 0 because the maximum of the estimate for the probability of "today" in spam is 0.	▶ 00:39
"Today" just never occurred in a spam message so far.	▶ 00:45
Now, this 0 is troublesome because as we compute the outcome--	▶ 00:49
and I'm plugging in all the numbers as before--	▶ 00:55
none of the words matter anymore, just the 0 matters.	▶ 01:00
So, we get 0 over something which is plain 0.	▶ 01:03
Are we overfitting? You bet.	▶ 01:10
We are clearly overfitting.	▶ 01:13
It can't be that a single word determines the entire outcome of our analysis.	▶ 01:15
The reason is that our model, to assign a probability of 0 for the word "today"	▶ 01:21
to be in the class of spam is just too aggressive.	▶ 01:26
Let's change this.	▶ 01:29
One technique to deal with the overfitting problem is called Laplace smoothing.	▶ 01:34
In maximum likelihood estimation, we assign towards our probability	▶ 01:39
the quotient of the count of this specific event over all events in our data set.	▶ 01:45
For example, for the prior probability, we found that 3/8 messages are spam.	▶ 01:51
Therefore, our maximum likelihood estimate	▶ 01:57
for the prior probability of spam was 3/8.	▶ 02:00
In Laplace Smoothing, we use a different estimate.	▶ 02:05
We add the value k to the count	▶ 02:10
and normalize as if we added k to every single class	▶ 02:15
that we've tried to estimate something over.	▶ 02:20
This is equivalent to assuming we have a couple of fake training examples	▶ 02:23
where we add k to each observation count.	▶ 02:28
Now, if k equals 0, we get our maximum likelihood estimator.	▶ 02:32
But if k is larger than 0 and n is finite, we get different answers.	▶ 02:36
Let's say k equals 1,	▶ 02:41
and let's assume we get one message,	▶ 02:47
and that message was spam, so we're going to write it one message, one spam.	▶ 02:51
What is p (spam) for the Laplace smoothing of k + 1?	▶ 02:56
Let's do the same with 10 messages, and we get 6 spam.	▶ 03:03
And 100 messages, of which 60 are spam.	▶ 03:09
Please enter your numbers into the boxes over here.	▶ 03:16

(01:14) 22 Answer

(00:25) 23 Question

(01:17) 24 Answer

(00:21) 25 Question

(00:58) 26 Answer

(01:47) 27 Summary Naive Bayes

(01:27) 28 Advanced SPAM Filtering

(02:21) 29 Digit Recognition

[Narrator] Naive Bayes can also be applied to	▶ 00:00
the problem of hand written digits recognition.	▶ 00:02
This is a sample of hand-written digits taken	▶ 00:05
from a U.S. postal data set	▶ 00:09
where hand written zip codes on letters are	▶ 00:12
being scanned and automatically classified.	▶ 00:17
The machine-learning problem here is	▶ 00:21
taken a symbol just like this.	▶ 00:23
What is the corresponding number?	▶ 00:28
Here it's obviously 0.	▶ 00:30
Here it's obviously 1.	▶ 00:32
Here it's obviously 2, 1.	▶ 00:34
For the one down here,	▶ 00:36
it's a little bit harder to tell.	▶ 00:38
Now when you apply Naive Bayes,	▶ 00:41
the input vector	▶ 00:44
could be the pixel values	▶ 00:46
of each individual pixel so we have	▶ 00:48
a 16 x 16 input resolution.	▶ 00:50
You would get 256 different values	▶ 00:54
corresponding to the brightness of each pixel.	▶ 00:59
Now obviously given sufficiently made	▶ 01:02
training example, you might hope	▶ 01:05
to recognize digits,	▶ 01:07
but one of the deficiencies of this approach is	▶ 01:09
it is not particularly shifted range.	▶ 01:12
So for example a pattern like this	▶ 01:15
will look fundamentally different	▶ 01:19
from a pattern like this.	▶ 01:21
Even though the pattern on the right is obtained	▶ 01:24
by shifting the pattern on the left	▶ 01:27
by 1 to the right.	▶ 01:29
There's many different solutions, but a common one could be	▶ 01:31
to use smoothing in a different way from	▶ 01:34
the way we discussed it before.	▶ 01:36
Instead of just counting 1 pixel value's count,	▶ 01:38
you could mix it with counts of the	▶ 01:40
neighboring pixel values so if	▶ 01:42
all pixels are slightly shifted,	▶ 01:44
we get about the same statistics	▶ 01:46
as the pixel itself.	▶ 01:48
Such a method is called input smoothing.	▶ 01:50
You can what's technically called convolve	▶ 01:52
the input vector equals pixel value variable, and	▶ 01:55
you might get better results than if you	▶ 01:57
do Naive Bayes on the raw pixels.	▶ 02:00
Now to tell you the truth for	▶ 02:02
digit recognition of this type,	▶ 02:04
Naive Bayes is not a good choice.	▶ 02:06
The conditional independence assumption	▶ 02:08
of each pixel, given the class,	▶ 02:10
is too strong an assumption in this case,	▶ 02:12
but it's fun to talk about image recognition	▶ 02:14
in the context of Naive Bayes regardless.	▶ 02:17

(03:30) 30 Overfitting Prevention

So, let me step back a step and talk a bit about	▶ 00:00
overfitting prevention in machine learning	▶ 00:04
because it's such an important topic.	▶ 00:07
We talked about Occam's Razor,	▶ 00:09
which in a generalized way suggests there is	▶ 00:12
a tradeoff between how well we can fit the data	▶ 00:16
and how smooth our learning algorithm is.	▶ 00:22
In our class in smoothing, we already found 1 way	▶ 00:28
to let Occam's Razor play, which is by	▶ 00:32
selecting the value K to make our statistical counts smoother.	▶ 00:34
I alluded to a similar way in the image recognition domain	▶ 00:40
where we smoothed the image so the neighboring pixels count similar.	▶ 00:44
This all raises the question of how to choose the smoothing parameter.	▶ 00:49
So, in particular, in Laplacian smoothing, how to choose the K.	▶ 00:53
There is a method called cross-validation	▶ 00:58
which can help you find an answer.	▶ 01:02
This method assumes there is plenty of training examples, but	▶ 01:05
to tell you the truth, in spam filtering there is more than you'd ever want.	▶ 01:09
Take your training data	▶ 01:14
and divide it into 3 buckets.	▶ 01:17
Train, cross-validate, and test.	▶ 01:19
Typical ratios will be 80% goes into train,	▶ 01:24
10% into cross-validate,	▶ 01:27
and 10% into test.	▶ 01:30
You use the train to find all your parameters.	▶ 01:33
For example, the probabilities of a base network.	▶ 01:37
You use your cross-validation set	▶ 01:40
to find the optimal K, and the way you do this is	▶ 01:43
you train for different values of K,	▶ 01:46
you observe how well the training model performs on the CV data,	▶ 01:49
not touching the test data,	▶ 01:55
and then you maximize over all the Ks to get the best performance	▶ 01:58
on the cross-validation set.	▶ 02:01
You iterate this many times until you find the best K.	▶ 02:03
When you're done with the best K,	▶ 02:06
you train again, and then finally	▶ 02:09
only one you touch the test data	▶ 02:12
to verify the performance,	▶ 02:15
and this is the performance you report.	▶ 02:17
It's really important in cross-validation	▶ 02:20
split apart a cross-validation set that's different from the test set.	▶ 02:23
If you were to use the test set to find the optimal K,	▶ 02:28
then your test set becomes an effective part of your training routine,	▶ 02:31
and you might overfit your test data,	▶ 02:35
and you wouldn't even know.	▶ 02:38
By keeping the test data separate from the beginning,	▶ 02:40
and train on the training data, you use	▶ 02:43
the cross-validation data to find how good your train data is doing,	▶ 02:46
and the unknown parameters of K to fine-tune the K.	▶ 02:49
Finally, only once you use the test data	▶ 02:53
do you get a fair answer to the question,	▶ 02:56
"How well will your model perform on future data?"	▶ 02:59
So, pretty much everybody in machine learning	▶ 03:02
uses this model.	▶ 03:05
You can redo the split between training and the cross-validation part,	▶ 03:08
people often use the word 10-fold cross-validation	▶ 03:12
where they do 10 different forwardings	▶ 03:15
and run the model 10 times to find the optimal K	▶ 03:17
or smoothing parameter.	▶ 03:20
No matter which way you do it, find the optimal smoothing parameter	▶ 03:22
and then use a test set exactly once to verify in a report.	▶ 03:25

(02:00) 31 Classification vs Regression

(00:26) 32 Answer

(02:46) 33 Linear Regression

Now obviously you can answer this question without understanding anything about regression.	▶ 00:00
But what you find is this is different from classification as before.	▶ 00:05
This is not a binary concept anymore of like expensive and cheap.	▶ 00:10
It really is a relationship between two variables.	▶ 00:13
One you care about--the house price, and one that you can observe,	▶ 00:17
which is the house size in square feet.	▶ 00:20
And your goal is to fit a curve that best explains the data.	▶ 00:23
Once again, we have a case where we can play Occam's razor.	▶ 00:28
There clearly is a data fit that is not linear that might be better,	▶ 00:31
like this one over here.	▶ 00:35
And when you go to hide the linear curves,	▶ 00:37
you might even be inclined to draw a curve like this.	▶ 00:40
Now of course the curve I'm drawing right now is likely an overfit.	▶ 00:44
And you don't want to postulate that this is the general relationship	▶ 00:49
between the size of a house and the sales price.	▶ 00:54
So even though my black curve might describe the data better,	▶ 00:57
the blue curve or the dashed linear curve over here might be a better explanation overture of Occam's razor.	▶ 01:01
So let's look a little bit deeper into what we call regression.	▶ 01:08
As in all regression problems, our data will be comprised of	▶ 01:15
input vectors of length in that map to another continuous value.	▶ 01:19
And we might be given a total of M data points.	▶ 01:25
This is from the classification case, except this time the Ys are continuous.	▶ 01:30
Once again, we're looking for function f that maps our vector x into y.	▶ 01:36
In linear regression, the function has a particular form which is W1 times X plus W0.	▶ 01:44
In this case X is one dimensional which is N = 1.	▶ 01:54
Or in the high-dimensional space, we might just write W times X plus W0,	▶ 01:59
where W is a vector and X is a vector.	▶ 02:07
And this is the inner product of these 2 vectors over here.	▶ 02:12
Let's for now just consider the one-dimensional case.	▶ 02:16
In this quiz, I've given you a linear regression form with 2 unknown parameters, W1 and W0.	▶ 02:20
I've given you a data set.	▶ 02:27
And this data set happens to be fittable by a linear regression model without any residual error.	▶ 02:30
Without any math, can you look at this and find out to me what the 2 parameters, W0 and W1 are?	▶ 02:36

(01:17) 34 Answer

(01:00) 35 More Linear Regression

(04:04) 36 Quadratic Loss

The problem of minimizing quadratic loss for linear functions can be solved in closed form.	▶ 00:00
When I reduce, I will do this for the one-dimensional case on paper.	▶ 00:07
I will also give you the solution for the case where your input space is multidimensional,	▶ 00:12
which is often called "multivariant regression."	▶ 00:17
We seek to minimize a sum of a quadratic expression	▶ 00:22
where the target labels are subtracted with the output of our linear regression model	▶ 00:26
parameterized by w1 and w2.	▶ 00:33
The summation here is overall training examples,	▶ 00:36
and I leave the index of the summation out if not necessary.	▶ 00:40
The minimum of this is obtained where the derivative of this function equals zero.	▶ 00:45
Let's call this function "L."	▶ 00:50
For the partial derivative with respect to w0, we get this expression over here,	▶ 00:53
which we have to set to zero.	▶ 00:59
We can easily get rid of the -2 and transform this as follows:	▶ 01:02
Here M is the number of training examples.	▶ 01:11
This expression over here gives us w0 as a function of w1,	▶ 01:17
but we don't know w1. Let's do the same trick for w1	▶ 01:21
and set this to zero as well,	▶ 01:28
which gets us the expression over here.	▶ 01:32
We can now plug in the w0 over here into this expression over here	▶ 01:38
and obtain this expression over here,	▶ 01:44
which looks really involved but is relatively straightforward.	▶ 01:47
With a few steps of further calculation, which I'll spare you for now,	▶ 01:52
we get for w1 the following important formula:	▶ 01:56
This is the final quotient for w1,	▶ 02:02
where we take the number of training examples times of the sum of all xy	▶ 02:05
minus the sum of x times the sum of y divided by this expression over here.	▶ 02:10
Once we've computed w1,	▶ 02:16
we can go back to our original articulation of w0 over here	▶ 02:19
and plug w1 into w0 and obtain w0.	▶ 02:23
These are the two important formulas we can also find in the textbook.	▶ 02:30
I'd like to go back and use those formulas to calculate these two coefficients over here.	▶ 02:39
You get 4 times the sum of x and the sum of y, which is -32	▶ 02:45
minus the product of the sum of x, which is 18, and the sum of y, which is -6,	▶ 02:56
divided by the sum of x squared, which is 86, times 4, minus the sum of x squared,	▶ 03:05
which is 18 times 18, which is 324.	▶ 03:16
If you work this all out, it becomes -1, which is w1.	▶ 03:20
W0 is now obtained by completing the quarter times sum of all y,	▶ 03:25
which is -6, minus -1/4 times sum of all x.	▶ 03:31
If you plug this all in, you get 3, as over here. Our formula is actually correct.	▶ 03:39
Here is another quiz for linear regression. We have the follow data:	▶ 03:46
Here is the data plotted graphically.	▶ 03:51
I wonder what the best regression is.	▶ 03:53
Give me w0 and w1. Apply the formulas I just gave you.	▶ 03:56

(01:27) 37 Answer

(02:10) 38 Problems with Linear Regression

So linear regression works well	▶ 00:00
if the data is approximately linear,	▶ 00:03
but there are many examples when linear regression performs poorly.	▶ 00:05
Here's one where we have a	▶ 00:09
curve that is really nonlinear.	▶ 00:12
This is an interesting one where we seem to have a linear relationship	▶ 00:15
that is flatter than the linear regression indicates,	▶ 00:18
but there is one outlier.	▶ 00:21
Because if you are minimizing quadratic error,	▶ 00:23
outliers penalize you over-proportionately.	▶ 00:26
So outliers are particularly bad for linear regression.	▶ 00:30
And here is a case,	▶ 00:34
where the data clearly suggests	▶ 00:35
a very different phenomena for linear.	▶ 00:37
We have only two ?? variables even being used,	▶ 00:40
and this one has a strong frequency	▶ 00:42
and a strong vertical spread.	▶ 00:45
Clearly a linear regression model	▶ 00:47
is a very poor one to explain	▶ 00:49
this data over here.	▶ 00:51
Another problem with linear regression	▶ 00:53
is that as you go to infinity in the X space,	▶ 00:55
your Ys also become infinite.	▶ 00:59
In some problems that isn't a plausible model.	▶ 01:02
For example, if you wish to predict the weather	▶ 01:05
anytime into the future,	▶ 01:08
it's implausible to assume the further the prediction goes out,	▶ 01:10
the hotter or the cooler it becomes.	▶ 01:13
For such situations there is a	▶ 01:15
model called logistic regression,	▶ 01:17
which uses a slightly more complicated	▶ 01:20
model than linear regression,	▶ 01:22
which goes as follows:.	▶ 01:24
Let F of XP, or linear function,	▶ 01:25
and the output of logistic regression	▶ 01:30
is obtained by the following function:	▶ 01:32
One over one plus exponential of minus F of X.	▶ 01:34
So here's a quick quiz for you.	▶ 01:40
What is the range in which Z might fall	▶ 01:43
given this function over here,	▶ 01:48
and ??? the linear function of F or X over here.	▶ 01:49
Is it zero, one?	▶ 01:53
Is it minus one, one?	▶ 01:56
Is it minus one, zero?	▶ 01:59
Minus two, two?	▶ 02:02
Or none of the above?	▶ 02:04

(01:00) 39 Answer

(01:39) 40 Linear Regression and Complexity Control

(01:46) 41 Minimizing Complicated Loss Functions

(00:47) 42 Answer

(00:28) 43 Question

(00:20) 44 Answer

(00:21) 45 Question

(01:00) 46 Answer

(01:41) 47 Gradient Descent Implementation

(04:15) 48 Perceptron

Now, there are many different ways to apply linear functions in machine learning.	▶ 00:00
We so far have studied linear functions for regression,	▶ 00:08
but linear functions are also used for classification,	▶ 00:12
and specifically for an algorithm called the perceptron algorithm.	▶ 00:16
This algorithm happens to be a very early model of a neuron,	▶ 00:21
as in the neurons we have in our brains,	▶ 00:27
and was invented in the 1940s.	▶ 00:30
Suppose we give a data set of positive samples and negative samples.	▶ 00:33
A linear separator is a linear equation that separates positive from negative examples.	▶ 00:41
Obviously, not all sets possess a linear separator, but some do.	▶ 00:49
For those we can define the algorithm of the perceptron and it actually converges.	▶ 00:55
To define a linear separator, let's start with our linear equation as before--	▶ 01:02
w1x + w0 in cases where x is higher dimensional this might actually be a vector--never mind.	▶ 01:07
If this is larger or equal to zero, then we call our classification 1.	▶ 01:18
Otherwise, we call it zero.	▶ 01:26
Here's our linear separation classification function	▶ 01:30
where this is our common linear function.	▶ 01:35
Now, as I said, perceptron only converges if the data is linearly separable,	▶ 01:39
and then it converges to a linear separation of the data,	▶ 01:45
which is quite amazing.	▶ 01:49
Perceptron is an iterative algorithm that is not dissimilar from grade descent.	▶ 01:52
In fact, the update rule echoes that of grade descent, and here's how it goes.	▶ 01:56
We start with a random guess for w1 and w0,	▶ 02:03
which may correspond to a random separation line,	▶ 02:09
but usually is inaccurate.	▶ 02:13
Then the mth weight-i is obtained by using the old weight plus some learning rate alpha	▶ 02:17
times the difference between the desired target label	▶ 02:29
and the target label produced by our function at the point m-1.	▶ 02:33
Now, this is an online learning rule, which is we don't process all the data in batch.	▶ 02:39
We process one data at a time, and we might go through the data many, many times--	▶ 02:45
hence the j over here--	▶ 02:50
but every time we do this, we apply this rule over here.	▶ 02:52
What this rule gives us is a method to adapt our weights in proportion to the error.	▶ 02:55
If the prediction of our function f equals our target label,	▶ 03:03
and the error is zero, then no update occurs.	▶ 03:07
If there is a difference, however, we update in a way so as to minimize the error.	▶ 03:11
Alpha is a small learning weight.	▶ 03:18
Once again, perceptron converges to a correct linear separator	▶ 03:22
if such linear separator exists.	▶ 03:28
Now, the case of linear separation has recently received a lot of attention in machine learning.	▶ 03:31
If you look at the picture over here, you'll find there are many different linear separators.	▶ 03:36
There is one over here. There is one over here. There is one over here.	▶ 03:42
One of the questions that has recently been researched extensively is which one to prefer.	▶ 03:47
Is it a, b, or c?	▶ 03:53
Even though you probably have never seen this literature,	▶ 03:57
I will just ask your intuition in this following quiz.	▶ 04:01
Which linear separator would you prefer if you look at these three different linear separators--	▶ 04:05
a, b, c, or none of them?	▶ 04:10

(05:38) 49 Answer and SVMs

[Narrator] And intuitively I would argue it's B,	▶ 00:00
and the reason why is	▶ 00:04
C comes really close to examples.	▶ 00:06
So if these examples are noisy,	▶ 00:09
it's quite likely that	▶ 00:12
by being so close to these examples	▶ 00:14
that future examples cross the line.	▶ 00:17
Similarly A comes close to examples.	▶ 00:20
B is the one that stays really far away	▶ 00:23
from any example.	▶ 00:26
So there's this entire region over here	▶ 00:28
where there's no example anywhere near B.	▶ 00:31
This region is often called the margin.	▶ 00:34
The margin of the linear separator	▶ 00:37
is the distance of the separator	▶ 00:40
to the closest training example.	▶ 00:43
The margin is a really important concept	▶ 00:45
in machine learning.	▶ 00:47
There is an entire class of maximum margin	▶ 00:49
learning algorithms,	▶ 00:51
and the 2 most popular are	▶ 00:53
support vector machines and boosting.	▶ 00:56
If you are familiar with machine learning,	▶ 01:00
you've come across these terms.	▶ 01:02
These are very frequently used these days	▶ 01:04
in actual discrimination learning tasks.	▶ 01:07
I will not go into any details because it would go	▶ 01:10
way beyond the scope of this introduction	▶ 01:12
to artificial intelligence class, but let's see	▶ 01:16
a few abstract words specifically about	▶ 01:18
support vector machines or SVMs.	▶ 01:21
As I said before a support vector machine	▶ 01:25
derives a linear separator, and it takes	▶ 01:30
the one that actually maximizes the margin	▶ 01:34
as shown over here.	▶ 01:39
By doing so it attains additional robost-ness	▶ 01:42
over perceptron which only picks	▶ 01:44
a linear separator without	▶ 01:46
consideration of the margin.	▶ 01:48
Now the problem of finding the	▶ 01:51
margin maximizing linear separator	▶ 01:53
can be solved by a quadratic program	▶ 01:55
which is an integer method for finding the best	▶ 01:59
linear separator that maximizes the margin.	▶ 02:03
One of the nice things that support	▶ 02:06
vector machines do in practice is	▶ 02:08
they use linear techniques to solve	▶ 02:12
nonlinear separation problems,	▶ 02:16
and I'm just going to give you a glimpse of	▶ 02:19
what's happening without going into any detail.	▶ 02:22
Suppose the data looks as follows:	▶ 02:25
we have a positive class	▶ 02:28
which is near the origin of a coordinate system	▶ 02:31
and a negative class that surrounds the positive class.	▶ 02:33
Clearly these 2 classes	▶ 02:37
are not linearly separable	▶ 02:39
because there's no line I can draw that	▶ 02:41
separates the negative examples from the positive examples.	▶ 02:43
An idea that underlies SVMs,	▶ 02:47
that will ultimately be known as	▶ 02:49
the kernel trick,	▶ 02:51
is to augment the feature set by new features.	▶ 02:53
Suppose this is X1, and this is X2,	▶ 02:56
and normally X1 and X2	▶ 02:58
will be the input features.	▶ 03:00
In this example, you might derive	▶ 03:03
a 3rd one.	▶ 03:05
Let me pick a 3rd one	▶ 03:07
Suppose X3 equals the square root of	▶ 03:09
X1 square + X2 square.	▶ 03:13
In other words X3 is the distance	▶ 03:18
of any data point from the center	▶ 03:22
of the coordinate system.	▶ 03:25
Then things do become linearly separable	▶ 03:27
so that just along the 3rd dimension	▶ 03:31
all the positive examples end up	▶ 03:33
to be close to the origin,	▶ 03:36
and all the negative examples	▶ 03:39
are further away, and the line is	▶ 03:41
orthogonal to the 3rd input feature	▶ 03:43
solves the separation problem.	▶ 03:46
Map back into the space over here	▶ 03:49
is actually a circle which is a set of all	▶ 03:52
values of X3 that are equidistant	▶ 03:55
to the center of the origin.	▶ 04:00
Now this trick could be done in any linear learning algorithm,	▶ 04:02
and it's really an amazing trick.	▶ 04:06
You can take any nonlinear problem, add	▶ 04:08
features of this type or any other type,	▶ 04:10
and use linear techniques	▶ 04:13
and get better solutions.	▶ 04:15
This is a very deep machine learning insight	▶ 04:17
that you can extend your feature space	▶ 04:19
in this way, and there's numerous	▶ 04:21
papers written about this.	▶ 04:23
In SVMs, the extension of the feature space is mathematically done by	▶ 04:25
what's called a kernel.	▶ 04:31
I can't really tell you about this in this class,	▶ 04:33
but it makes it possible to write	▶ 04:36
very large new feature spaces including	▶ 04:38
infinitely dimensional new feature spaces.	▶ 04:41
These messages are very powerful.	▶ 04:44
It turns out you never	▶ 04:46
really compute all those features.	▶ 04:48
They are implicitly represented by	▶ 04:50
so called kernels, and if you care about this,	▶ 04:52
I recommend you to dive	▶ 04:55
deeper into the literature	▶ 04:57
of support vector machines.	▶ 04:59
This is meant to just give you	▶ 05:01
an overview of the essence of	▶ 05:03
what support vector machines are all about.	▶ 05:05
So in summary,	▶ 05:08
linear methods we learned about	▶ 05:10
using them for regression	▶ 05:12
and also classification.	▶ 05:15
We learned about exact solutions	▶ 05:17
versus iterative solutions.	▶ 05:19
We talked about smoothing,	▶ 05:23
and we even talked about	▶ 05:25
using linear methods for nonlinear problems.	▶ 05:27
So we covered quite a bit of ground.	▶ 05:30
This is a really significant cross section	▶ 05:33
of machine learning.	▶ 05:35

(02:01) 50 k Nearest Neighbors

(01:08) 51 kNN Definition

(00:41) 52 Answer

(01:44) 53 k as Smoothing Parameter

(02:08) 54 Problems with kNN

What are the problems of kNN?	▶ 00:00
Well, I would argue that there're two.	▶ 00:02
One is very large data sets,	▶ 00:04
and one is very large feature spaces.	▶ 00:07
Now the first one results in lengthy searches	▶ 00:10
when you try to find K's nearest neighbors.	▶ 00:14
Now, fortunately there are	▶ 00:17
methods to search efficiently.	▶ 00:19
Often you represent your data	▶ 00:22
not by a linear list, in which case the search	▶ 00:24
would be linear in the number of data points,	▶ 00:27
but by a tree, where the search becomes logarithmic.	▶ 00:29
The method of choice is called kDD trees	▶ 00:34
where there are many other ways	▶ 00:38
to represent data points as trees.	▶ 00:40
Now very large feature spaces	▶ 00:43
cause more of a problem.	▶ 00:45
It turns out computing nearest neighbors,	▶ 00:48
as the feature space for the input vector increases,	▶ 00:51
becomes increasingly difficult,	▶ 00:54
and the tree methods become increasingly brittle.	▶ 00:57
And the reason is shown in the following graph:	▶ 01:00
If your graph input dimension to	▶ 01:03
the average edge length of your neighborhood	▶ 01:06
you'll find that for randomly chosen points	▶ 01:09
very quickly all points are really far away.	▶ 01:12
The edge length of one is obtained	▶ 01:16
if your query point	▶ 01:19
is unit one away from the nearest neighbor.	▶ 01:23
If you have one hundred dimensions,	▶ 01:26
that is almost certain.	▶ 01:28
Why is that?	▶ 01:29
Well, in one hundred dimensions,	▶ 01:31
they are to be one where just by chance	▶ 01:33
your're far away.	▶ 01:35
The number of points you need	▶ 01:37
to get something close	▶ 01:39
grows exponentially with the number of dimensions.	▶ 01:40
So, for any fixed data set size	▶ 01:45
you will find yourself in a situation	▶ 01:47
where all your neighbors are far away.	▶ 01:49
Nearest neighbor works really well	▶ 01:52
for small input spaces like three or four dimensions.	▶ 01:54
It works very poorly	▶ 01:58
if your input space is twenty, twenty-five,	▶ 01:59
or maybe one hundred dimensions.	▶ 02:01
So don't trust nearest neighbor to do a good job	▶ 02:03
if your input and measure spaces are high.	▶ 02:06

(01:12) 55 Congratulations

(46) Unit 6

(01:11) 1 Unsupervised Learning

(00:26) 1a Answer

(00:43) 1b Question

((??:??)) 1c Answer
No subtitles... ▶ 00:00

(01:27) 2 Terminology

[Narrator] So to start with some lingo about unsupervised learning.	▶ 00:00
If you look at this as a probabilist, you're given data, and	▶ 00:03
we interpretively assume the data is IID,	▶ 00:06
which means identically distributed and independently drawn	▶ 00:09
from the same distribution.	▶ 00:11
So a good chunk of unsupervised learning	▶ 00:13
seeks to recover the underlying--the density of	▶ 00:15
probability distribution that generated the data.	▶ 00:18
It's called density estimation.	▶ 00:21
As we find out our methods for clustering,	▶ 00:23
our versions of density estimation	▶ 00:25
using what's called mixture models.	▶ 00:27
Dimensionality reduction is also a method	▶ 00:29
for doing density estimation,	▶ 00:31
and there are many others.	▶ 00:33
Unsupervised learning can be applied to find	▶ 00:35
structure and data.	▶ 00:37
One of the fascinating ones that	▶ 00:39
I believe exists is called	▶ 00:41
blind signals separation.	▶ 00:43
Suppose you are given a microphone, and	▶ 00:45
two people simultaneously talk, and you're	▶ 00:48
record the joint of both of those speakers.	▶ 00:51
Blind source separation or blind signal separation	▶ 00:54
addresses the question of can you recover	▶ 00:56
those two speakers and filter	▶ 00:59
the data into two separate streams.	▶ 01:01
One for each speaker.	▶ 01:03
Now this is a really complicated unsupervised	▶ 01:05
learning task, but is one of the many things	▶ 01:07
that don't require target signals as	▶ 01:09
unsupervised learning yet make for	▶ 01:11
really interesting learning problems.	▶ 01:13
This can be construed as an example	▶ 01:15
of what's called factor analysis where each	▶ 01:17
speaker is a factor in the drawing signal that your microphone records.	▶ 01:19
There are many other examples of unsupervised learning.	▶ 01:23
I will show you a few in a second.	▶ 01:25

(02:06) 3 Google Street View and Clustering

Here is one of my favorite examples of unsupervised learning--	▶ 00:00
one that is yet unsolved.	▶ 00:03
At Google, I had the opportunity to participate--	▶ 00:05
in the building of Street View,	▶ 00:08
which is a huge photographic database--	▶ 00:10
of many, many streets in the world.	▶ 00:13
As you dive into Street View--	▶ 00:16
you can get ground imagery--	▶ 00:18
of almost any location in the world--	▶ 00:20
like this house here, that I chose at random.	▶ 00:23
In these images, there is vast regularities.	▶ 00:26
You can go somewhere else--	▶ 00:29
and you'll find that the type of objects--	▶ 00:31
visible in Street View--	▶ 00:33
is not entirely random.	▶ 00:35
For example, there is many images of homes--	▶ 00:37
many images of cars--	▶ 00:39
trees, pavement, lane markers--	▶ 00:41
stop sign, just to name a few.	▶ 00:44
So one of the fascinating, unsolved, unsupervised learning tasks is:	▶ 00:47
Can you take hundreds of billions of images--	▶ 00:52
as comprised in the Street View data set--	▶ 00:55
and discover from it that there are concepts such as--	▶ 00:58
trees, lane markers, stop signs, cars, and pedestrians?	▶ 01:01
It seems to be tedious to hand label each image--	▶ 01:05
for the occurrence of such objects.	▶ 01:07
And attempts to do so--	▶ 01:09
has resulted in very small image data sets.	▶ 01:11
Humans can learn from data--	▶ 01:14
even without explicit target labels.	▶ 01:16
We often just observe.	▶ 01:18
In observing, we apply unsupervised learning techniques.	▶ 01:20
So one of the great, great open questions of artificial intelligence is:	▶ 01:23
Can you observe many intersections and many streets and many roads--	▶ 01:27
and learn from it what concepts are contained in the imagery?	▶ 01:32
Of course, I can't teach you anything as complex in this class.	▶ 01:35
I don't even know the answer myself.	▶ 01:39
So let me start with something simple.	▶ 01:41
Clustering. Clustering is the most basic form of unsupervised learning.	▶ 01:43
And I will tell you about two algorithms that are very related.	▶ 01:47
One is called k-means,	▶ 01:50
one is called expectation maximization.	▶ 01:52
K-means is a nice, intuitive algorithm to derive clusterings.	▶ 01:55
Expectation maximization is a probabilistic--	▶ 01:59
generalization of k-means.	▶ 02:02
They were derived from first principles.	▶ 02:04

(01:52) 4 k-Means Clustering Example

Let me explain k-means by an example.	▶ 00:00
Suppose we're given the following data points in a 2-dimensional space.	▶ 00:03
K-means estimates for a fixed number of k. Here k = 2.	▶ 00:07
The best centers of clusters representing those data points.	▶ 00:12
Those are found interatively by the following algorithm.	▶ 00:17
Step 1: Guess cluster centers at random, as shown over here with the 2 stars.	▶ 00:20
Step 2: Assign to each cluster center, even though they are randomly chosen,	▶ 00:25
the most likely corresponding data points.	▶ 00:30
This is done by minimizing Euclidian distance.	▶ 00:33
In particular, each cluster center represents half of the space.	▶ 00:36
And the line that separates the space between the left and right cluster center	▶ 00:41
is the equidistant line, often called a Voronoi graph.	▶ 00:45
All the data points on the left correspond to the red cluster,	▶ 00:48
and the ones on the right to the green cluster.	▶ 00:53
Step 3: Given now we have a correspondence between the data points and cluster centers,	▶ 00:55
find the optimal cluster center that corresponds to the points associated with the cluster center.	▶ 01:00
Our red cluster center has only 2 data points attached.	▶ 01:06
So the optimal cluster center would be the halfway point in the middle.	▶ 01:09
Our right cluster center has more than 2 points attached;	▶ 01:13
yet it isn't placed optimally, as you can see as they move with the animation back and forth.	▶ 01:16
By minimizing the joint quadratic distance to all of those points,	▶ 01:21
the new cluster center has attained the center of those data points.	▶ 01:25
Now the final step is iterate. Go back and reassign cluster centers.	▶ 01:29
Now the Voronoi diagram has shifted, and the points are associated differently,	▶ 01:35
and then reevaluate what the optimal cluster center looks like given the associated points.	▶ 01:39
And in both cases we see significant motion.	▶ 01:45
Repeat. Now this is the clustering.	▶ 01:47
The point association doesn't change, and as a result, we just converged.	▶ 01:49

(01:56) 4a k-Means Algorithm

(00:17) 4b Answer

(00:17) 4c Question

(00:34) 4d Answer

(00:12) 4e Question

(00:24) 4f Answer

(04:12) 5 Expectation Maximization

So, let's now generalize k-means into expectation maximization.	▶ 00:00
Expectation maximization is an algorithm that uses actual probability distributions	▶ 00:05
to describe what we're doing, and it's in many ways more general,	▶ 00:10
and it's also nice in that it really has a probabilistic basis.	▶ 00:14
To get there, I have to take the discourse and tell you all about Gaussians,	▶ 00:17
or the normal distribution, and the reason is so far,	▶ 00:21
we've just encountered discrete distributions,	▶ 00:24
and Gaussians will be the first example of a continuous distribution.	▶ 00:26
Many of you know that a Gaussian is described by an identity that looks as follows,	▶ 00:30
where the mean is called mu, and the variance is called sigma or sigma squared.	▶ 00:34
And for any X along the horizontal access, the density is given by the following function:	▶ 00:41
1 over square root of 2 pi times sigma, and then an exponential function	▶ 00:47
of minus ½ of x - mu squared over sigma squared.	▶ 00:52
This function might look complex, but it's also very, very beautiful.	▶ 00:56
It peaks at X = mu where the value in the exponent becomes 0.	▶ 01:01
And towards plus or minus infinity, it goes to 0 quickly.	▶ 01:07
In fact, exponentially fast.	▶ 01:11
The argument inside is a quadratic function.	▶ 01:14
The exponential function makes it exponential.	▶ 01:16
And this over here is a normalizer to make sure that the area underneath	▶ 01:20
sums up to one, which is characteristic of any probability density function.	▶ 01:23
If you map this back to our discrete random variables,	▶ 01:29
for each possible X, we can now assign a density value,	▶ 01:32
which is the function of this, and that's effectively	▶ 01:37
the probability that this X might be drawn.	▶ 01:41
Now, the space itself is infinite, so any individual value will have a probability of 0,	▶ 01:43
but what you can do is you can make an interval, A and B,	▶ 01:48
and the area underneath this function is the total probability	▶ 01:52
that an experiment will come up between A and B.	▶ 01:56
Clearly, it's more likely to generate values around mu	▶ 02:00
then it is to generate values in the periphery summary over here.	▶ 02:03
And just for completeness, I'm going to give you the formula	▶ 02:07
for what's called the multi-variate Gaussian	▶ 02:09
where multi-variate means nothing else but we have more than one input variable.	▶ 02:12
You might have a Gaussian over a 2-dimensional space or a 3-dimensional space.	▶ 02:17
Often, these Gaussians are drawn by what's called level sets,	▶ 02:21
sets of equal probability.	▶ 02:24
Here's one in a 2-dimensional space, X1 and X2.	▶ 02:26
The Gaussian itself can be thought of as coming out of the paper towards me	▶ 02:30
where the most likely or highest point of probability is the center over here.	▶ 02:35
And these rings measure areas of equal probability.	▶ 02:39
The formula for a multi-variate Gaussian looks as follows:	▶ 02:43
N is the number of dimensions in the input space.	▶ 02:49
Sigma is a covariance matrix that generalizes the value over here.	▶ 02:53
And the inner product inside the exponential	▶ 02:57
is now done using linear algebra where this is the difference between	▶ 03:02
a probe point and the mean vector mu	▶ 03:08
transposed sigma to the minus 1 times X - mu.	▶ 03:12
You can find this formula in any textbook or web page	▶ 03:16
on Gaussians or multi-variate normal distributions.	▶ 03:21
It looks cryptic at first, but the key thing to remember is	▶ 03:25
it's just a generalization of the 1-dimensional case.	▶ 03:29
We have a quadratic area over here as manifested by the product	▶ 03:33
of this guy and this guy.	▶ 03:36
We have a normalization by a variance or covariance	▶ 03:38
as shown by this number over here or the inverse matrix over here.	▶ 03:42
And then this entire thing is an exponential form in both cases,	▶ 03:48
and the normalizer looks a little more different in the multi-variate case,	▶ 03:51
but all it does is make sure that the volume underneath adds up to 1	▶ 03:55
to make it a legitimate probability density function.	▶ 03:59
For most of this explanation, I will stick with 1-dimensional Gaussians,	▶ 04:02
so all you have to do is to worry about this formula over here,	▶ 04:07
but this is given just for completeness.	▶ 04:10

(02:12) 5a Gaussian Learning

I will now talk about fitting Gaussians to data or Gaussian learning.	▶ 00:00
You may be given some data points, and you might worry about	▶ 00:06
what is the best Guassian fitting the data?	▶ 00:09
Now, to explain this, let me first tell you what parameters characterizes a Gaussian.	▶ 00:12
In the 1-dimensional case, it is mu and sigma squared.	▶ 00:18
Mu is the mean. Sigma squared is called the variance.	▶ 00:24
If we look at the formula of a Gaussian, it's a function over any possible input X,	▶ 00:28
and it requires knowledge of mu and sigma squared.	▶ 00:34
And as before, I'm just restating what I said before.	▶ 00:38
We get this function over here that specifies any probability	▶ 00:42
for a value X given a specific mu and sigma squared.	▶ 00:48
Suppose we wish to fit data, and our data is 1-dimensional, and it looks as follows.	▶ 00:53
Just looking at this diagram makes me believe	▶ 01:01
that there's a high density of data points over here	▶ 01:03
and a fading density of data points over there,	▶ 01:06
so maybe the most likely Gaussian will look a little bit like this	▶ 01:09
where this is mu and this is sigma.	▶ 01:13
They are really easy formulas for fitting data to Gaussians,	▶ 01:17
and I'll give you the result right now.	▶ 01:21
The optimal or most likely mean is just the average of the data points.	▶ 01:23
There's M data points, X1 to Xm.	▶ 01:30
The average will look like this.	▶ 01:33
The sum of all data points divided by the total number of data points.	▶ 01:35
That's called the average, and once you calculate the average,	▶ 01:41
the sigma squared is obtained by a similar normalization	▶ 01:44
in a slightly more complex sum.	▶ 01:48
We sum the deviation from the mean	▶ 01:51
and compute the average deviation to the square from the mean,	▶ 01:54
and that gives us sigma squared.	▶ 01:58
So, intuitively speaking, the formulas are really easy.	▶ 02:00
Mu is the mean, or the average.	▶ 02:03
Sigma squared is the average quadratic deviation from the mean, as shown over here.	▶ 02:06

(04:10) 5b Maximum Likelihood

Now I want take a second to convince ourselves	▶ 00:00
this is indeed the maximum likelihood estimate	▶ 00:03
of the mean and the variance.	▶ 00:06
Suppose our data looks like this--	▶ 00:09
There's "M" data points.	▶ 00:12
And the probability of those data points	▶ 00:15
for any Gaussian model--mu and sigma squared	▶ 00:18
is the product of any individual of data likelihood--x,i.	▶ 00:22
And if you plug in our Gaussian formula, you get the following--	▶ 00:29
This is the normalizer multiplied "M" times	▶ 00:34
where the square root is now drawn into the half over here,	▶ 00:37
and here is our joint exponential.	▶ 00:43
We took the product of the individual exponentials	▶ 00:45
and moved it up straight in here where it becomes a sum.	▶ 00:49
So the best estimates for mu and sigma squared	▶ 00:53
are those that maximize this entire expression over here	▶ 00:58
for given data set X1 to Xm.	▶ 01:01
So we seek to maximize this over the unknown parameters	▶ 01:05
mu and sigma squared.	▶ 01:08
And now I will apply a trick.	▶ 01:10
Instead of maximizing this expression,	▶ 01:12
I will maximize the logarithm of this expression.	▶ 01:14
The logarithm is a monotonic function.	▶ 01:17
So let's maximize instead the logarithm	▶ 01:19
where this expression over here resolves to this expression over here.	▶ 01:23
The multiplication becomes a minus sign from over here,	▶ 01:27
and this is the argument inside the exponent	▶ 01:32
written slightly differently,	▶ 01:35
but pulling the 2 sigma squared to the left.	▶ 01:37
So let's maximize this one instead.	▶ 01:40
The maximum was obtained where the first	▶ 01:42
derivative is zero.	▶ 01:45
If we do this for our variable mu,	▶ 01:48
we take the "log f" expression and	▶ 01:51
complete the derivative for spectrum mu,	▶ 01:53
we get following--	▶ 01:56
This expression does not depend on mu at all, so it falls out.	▶ 01:58
And we can still get this expression over here, which we've set to zero.	▶ 02:01
And now we can multiply everything by sigma squared next to zero,	▶ 02:05
and then bring the Xi to the right and the mu to the left.	▶ 02:11
The sum over all "E" of the mu is mu equals sum over i, xi.	▶ 02:15
Hence, we proved that the mean is indeed the maximum likelihood estimate	▶ 02:24
for the Gaussian.	▶ 02:31
This is now easily repeated for the variance.	▶ 02:33
If you compute the derivative of this expression over here	▶ 02:38
with respect to the variance,	▶ 02:41
we get minus "m" over sigma, which happens to be the derivative	▶ 02:43
of this expression over here.	▶ 02:48
Keep in mind that the derivative of	▶ 02:50
a logarithm stresses internal argument	▶ 02:53
times by chain rule--the derivative of the internal argument,	▶ 02:57
which if you work out becomes this expression over here.	▶ 03:01
And this guy over here changes signs	▶ 03:05
but becomes the following.	▶ 03:08
And again, you move this guy to the left side,	▶ 03:10
multiply by sigma cubed, and divide by "m".	▶ 03:13
So we get the following result over here.	▶ 03:18
You might take a moment to verify these steps over here,	▶ 03:22
I was a little bit fast,	▶ 03:25
but this is relatively straight forward mathematics.	▶ 03:27
And if you will verify them,	▶ 03:32
you will find that the maximum likelihood estimate	▶ 03:34
for sigma squared is the average	▶ 03:36
deviation of data points from the mean mu.	▶ 03:39
This gives us a very nice basis to fit	▶ 03:43
Gaussians to data points.	▶ 03:45
So keeping these formulas in mind, here's a quick quiz,	▶ 03:48
which I ask you to actually calculate the mean and variance for a data sequence.	▶ 03:52
So suppose the data you observe is 3, 4, 5, 6, and 7.	▶ 03:58
There is 5 data points.	▶ 04:02
Compute for me the mean and the variance	▶ 04:04
using the maximum likelihood estimator I just gave you.	▶ 04:07

(00:33) 5c Answer

(00:10) 5d Question
Here is another quiz--Suppose my DATA ▶ 00:00
looks as follows--3,9,9,3. ▶ 00:02
Compute for me mu and sigma squared ▶ 00:06
using the maximum likelihood estimator I just gave you. ▶ 00:08

(00:20) 5e Answer

(00:50) 5f Question

(00:36) 5g Answer

(00:17) 5h Gaussian Summary

(01:13) 5i EM as Generalization of k-Means

(03:14) 5j EM Algorithm

The model of expectation maximization	▶ 00:00
is that each data point	▶ 00:02
is generated from what's called a mixture.	▶ 00:04
The sum of all possible classes	▶ 00:06
or clusters, of which there are K	▶ 00:08
we draw a class at random	▶ 00:10
with a prior probability of p of the class C = i	▶ 00:13
and then we draw data point X	▶ 00:17
from the distribution correspondent with its class over here.	▶ 00:19
The way to think about this if there is K different cluster centers shown over here	▶ 00:22
each one of those has a generic Gaussian attached.	▶ 00:28
In the generative version of expectation maximization	▶ 00:30
you first draw a cluster center	▶ 00:34
and then we draw from the Gaussian attached to this cluster center.	▶ 00:36
The unknowns here are the prior probabilities for each cluster center	▶ 00:39
should we call P-i and the Mu-i and in the general case Sigma-i	▶ 00:43
for each of the individual Gaussian.	▶ 00:49
Where i = 1 all the way to K.	▶ 00:51
Expectation maximization iterates 2 steps just like K-means.	▶ 00:54
One is called the E-step or expectation step	▶ 00:59
for which we assume that we know the Gaussian parameters and the P-i.	▶ 01:01
With those known values calculating the sum over here	▶ 01:08
is a fairly trivial exercise.	▶ 01:11
This is our known formula for a Gaussian	▶ 01:13
we just plug that in and this is a fixed probability.	▶ 01:17
The sum of all possible classes.	▶ 01:21
So you get for e-ij	▶ 01:24
the probability that the j-th data point	▶ 01:27
corresponds to cluster center number i	▶ 01:30
P-i times the normalizer	▶ 01:32
times the Gaussian expression.	▶ 01:36
Where we have a quadratic of Xj minus Mu-i	▶ 01:38
times Sigma-i to the -1 times the same thing again over here.	▶ 01:42
These are the probabilities	▶ 01:47
that the j-th data point	▶ 01:49
corresponds to the i-th cluster center	▶ 01:52
under the assumption that we do know	▶ 01:54
the parameters P-i, Mu-i, and Sigma-i.	▶ 01:57
In the M-step we now figure out where these parameters should have been.	▶ 02:00
For the prior probability of each cluster center	▶ 02:03
we just take the sum over all the e-ijs, over all data points	▶ 02:06
divided by the total number of data points.	▶ 02:11
The mean is obtained by the weighted mean of the x-js	▶ 02:14
normalized by the sum over e-ijs	▶ 02:21
and finally the sigma is obtained as a sum	▶ 02:25
over the weighted expression like this	▶ 02:30
and this is the same expression as before	▶ 02:33
and now again we are normalizing over the sum over all e-ijs.	▶ 02:35
And these are exactly the same calculations	▶ 02:40
as before when we fit a Gaussian but just weighted by	▶ 02:42
the soft correspondence of a data point to each Gaussian.	▶ 02:46
And this weighting is relatively straightforward to apply in Gaussian fitting.	▶ 02:51
Let's do a very quick quiz for EM.	▶ 02:55
Suppose we're given 3 data points and 2 cluster centers.	▶ 02:58
And the question is, does this point over here	▶ 03:01
called X1 correspond to C1 or C2 or both of them?	▶ 03:04
Please check exactly one of these 3 different check boxes here.	▶ 03:09

(00:16) 5k Answer

(00:21) 5l Question

(00:08) 5m Answer

(00:47) 5n Question

(00:41) 5o Answer

(01:58) 5p Choosing k

[Thrun] One of the remaining open questions pertains to the number of clusters.	▶ 00:00
So far I've assumed it's simply constant and you know it.	▶ 00:03
But in reality, you don't know it.	▶ 00:06
Practical implementations often guess the number of clusters along with the parameters.	▶ 00:08
And the way this works is that you periodically evaluate which data is poorly covered	▶ 00:12
by the existing mixture, you generate new cluster centers	▶ 00:17
at random near unexplained points, and then you run the algorithm for a while	▶ 00:21
to see whether the existence of your clusters is still justified.	▶ 00:25
And the justification test is based on a memorization of a criterion	▶ 00:29
that combines the negative log likelihood of your data itself	▶ 00:33
and a penalty for each cluster.	▶ 00:37
In particular, you're going to minimize the negative log likelihood of your data	▶ 00:40
given the model plus a constant penalty per cluster.	▶ 00:43
If we look at this expression, this is the expression that EM already minimizes.	▶ 00:46
We maximized the posterior probability of data	▶ 00:51
logarithmic is a monotonic function, and I put a minus sign over here	▶ 00:53
so the optimization problem becomes a minimization problem.	▶ 00:57
This one over here, we have a constant cost per cluster is new.	▶ 01:00
If you increase the number of clusters, you would pay a penalty	▶ 01:04
that is in the way of your attempted minimization.	▶ 01:07
Typically, this expression balances out at a certain number of clusters,	▶ 01:10
and it is generically the best explanation for your data.	▶ 01:14
So the algorithm looks as follows.	▶ 01:16
Guess an initial K, run EM, remove unnecessary clusters	▶ 01:18
that will make this quote over here go up,	▶ 01:22
create some new random clusters, and go back and run EM.	▶ 01:24
There is all kinds of variants of this algorithm.	▶ 01:27
One of the nice things here is this algorithm also overcomes local minima problems	▶ 01:30
to some extent.	▶ 01:33
If, for example, 2 clusters end up grabbing the same data,	▶ 01:35
then your tests would show you that 1 of the clusters can be omitted;	▶ 01:39
thereby the score can be improved.	▶ 01:42
That cluster can later be restarted somewhere else,	▶ 01:44
and by randomly restarting clusters, you tend to get a much, much better solution	▶ 01:47
than if you run EM just once with a fixed number of clusters.	▶ 01:51
So this trick is highly recommended for any implementation of expectation maximization.	▶ 01:54

(00:41) 5q Clustering Summary

(00:26) 6 Dimensionality Reduction

(00:11) 6a Answer

(00:08) 6b Question
Here is a quiz that is a little bit more tricky. ▶ 00:00
I'm going to draw data for you like this. ▶ 00:02
I'm going to ask the same question. ▶ 00:04
How many dimensions do we really need? ▶ 00:06

(00:30) 6c Answer

(02:57) 6d Linear Dimensionality Reduction

For the remainder of this unit,	▶ 00:00
I am going to talk about linear dimensionality reduction.	▶ 00:02
Where the idea is that the given data points like this,	▶ 00:05
and we seek to find a linear subspace in which to perfect the data.	▶ 00:08
In this case, I would submit this is probably the most suitable linear subspace.	▶ 00:13
So we remap the data onto the space over here, with x1 over here and x2 over here.	▶ 00:17
Then we can capture the data in just 1 dimension.	▶ 00:23
The algorithm is amazingly simple.	▶ 00:25
Number 1: Fit a gaussian; we now know how this works.	▶ 00:28
The gaussian will look something like this.	▶ 00:31
Number 2: Caluclate the eigenvalues and eigenvectors of this gaussian.	▶ 00:34
In this gaussian this would be the dominant eigenvector,	▶ 00:39
and this would be the 2nd eigenvector over here.	▶ 00:42
Step 3 is take those eigenvectors whose eigenvalues are the largest.	▶ 00:45
Step 4 is to project the data onto the subspace of eigenvectors you chose.	▶ 00:50
Now to understand this, you have to be familiar with eigenvectors and eigenvalues.	▶ 00:55
I give you an intuitive familiarity with those.	▶ 00:59
This is standard statistics material, and you will find this in many linear algebra classes.	▶ 01:02
So let me just go through this very quickly	▶ 01:07
and give you an intuition how to do linear dimensionality reduction.	▶ 01:09
Suppose you're given the following data points:	▶ 01:14
Your axes are 0, 1, 2, 3, and 4,	▶ 01:16
4 x1, and 1.9, 3.1, 4, 5.1, and 5.9.	▶ 01:20
These are essentially 2, 3, 4, 5, 6,	▶ 01:28
but slightly modified to define actual variance over this dimension.	▶ 01:33
So I draw this in here.	▶ 01:38
What I get is a set of points that doesn't quite fit a line, but almost.	▶ 01:40
There is a little error over here, a little error over here, and here and here.	▶ 01:44
The mean is easily calculated; it's 2 and 4.	▶ 01:47
The covairance matrix looks as follows.	▶ 01:50
Notice the slightly different covairance for the 1st variable, which is exactly 2,	▶ 01:53
to the 2nd variable, which is 2.008.	▶ 01:59
The eigenvectors happen to be 0.7064 and 0.7078 with an eigenvalue of 4.004,	▶ 02:02
and the 2nd one is orthogonal with an eigenvalue much smaller.	▶ 02:13
So obviously this is the eigenvector that dominates the spread of the data points.	▶ 02:18
If you look at this vector over here, it is centered around the mean,	▶ 02:22
which sits over here, and is exactly this vector shown over here.	▶ 02:27
Where this one is the orthogonal vector shown over here.	▶ 02:31
So this single dimension with a large weight explains the data relative to	▶ 02:34
any other dimension, which is a very small eidenvalue.	▶ 02:39
I should mention why these numerical examples might look confusing.	▶ 02:41
This is very standard linear algebra.	▶ 02:47
When you estimate covariance from data and try to understand which direction they point,	▶ 02:49
this kind of eigenvalue anylysis gives you the right answer.	▶ 02:53

(02:11) 6e Face Example

The dimensionality reduction looks a little bit silly when you go	▶ 00:00
from 2 dimensions to 1 dimension.	▶ 00:04
But in truly high-dimensional space it has a very strong utility.	▶ 00:05
Here's an example that goes back to MIT several decades ago	▶ 00:09
on something called eigenfaces.	▶ 00:13
These are all well-aligned faces.	▶ 00:15
The objective in eigenface research has been to find	▶ 00:17
simple ways to describe different people in a parameter space,	▶ 00:21
in which we can easily identify the same person again.	▶ 00:25
Images like these are very high-dimensional statistics.	▶ 00:27
If each image is 50 by 50 pixels,	▶ 00:31
each image itself becomes a data point in a 2500 dimensional feature space.	▶ 00:33
Now obviously, we don't have random images.	▶ 00:39
We don't fill the space of 2500 dimensions with all face images.	▶ 00:43
Instead, it is reasonable to assume that all the faces live on a small subspace in that space.	▶ 00:48
Obviously, you as a human can easily distinguish what is a valid image of a face	▶ 00:54
and what is a valid image of a non face, like a car or a cloud or the sky.	▶ 00:58
Therefore, there are many, many images that you can	▶ 01:02
represent with 2500 pixels that are not faces.	▶ 01:04
So research on eigenfaces has applied	▶ 01:08
principle component analysis and eigenvalues to the space of faces.	▶ 01:10
Here is a database in which faces are aligned.	▶ 01:15
A researcher at Santiago Serrano extracted from it	▶ 01:19
the average face after alignment on the right side.	▶ 01:23
The truly interesting phenomenon occurs when you look at the eigenvalues.	▶ 01:27
The face on the top left, over here, is the average face,	▶ 01:31
and these are the variations,	▶ 01:34
the eigenvectors that correspond to the largest eigenvalues over here.	▶ 01:37
This is the strongest variation.	▶ 01:41
You see a certain amount of different regions in and around the head shape	▶ 01:42
and the hair that gets excited.	▶ 01:46
That's the 2nd strongest one, where the shirt gets more excited.	▶ 01:48
As you go down,	▶ 01:50
you find more and more interesting variations that can be used to reconstruct faces.	▶ 01:51
Typically a dozen or so will suffice to make a face completely reconstructable,	▶ 01:56
which means you've just mapped a 2500 dimensional feature space	▶ 02:01
into a, perhaps, 12 dimensional feature space	▶ 02:05
on which we can now learn much, much easier.	▶ 02:08

(04:01) 6f Scan Example

In our own reserch, we also have applied eigenvector decomposition	▶ 00:00
to relatively challenging problems that don't look like a linear problem at the surface.	▶ 00:06
We scanned a good number of people with different physiques:	▶ 00:11
Some thin, some not so thin, some tall, some short, some male, some female.	▶ 00:15
We also scanned them in 3-D in different body postures:	▶ 00:19
The arms down, the arms up, walking, throwing a ball, and so on.	▶ 00:23
We applied eigenvector decomposition of the type I've just shown you	▶ 00:28
to understand whether there is a latent low-dimensional space	▶ 00:33
that is sufficient to represent the different physiques that people have,	▶ 00:37
like thin or thick, and the different postures people can assume, like standing and so on.	▶ 00:41
It turns out if you apply eigenvector decomposition	▶ 00:46
to the space of all the formations of your body,	▶ 00:51
you can find relatively low dimensional linear spaces,	▶ 00:55
in which you can express different physiques and different body postures.	▶ 01:00
For the space of all different physiques it turns only 3-dimensions sufficed	▶ 01:05
to explain different heights, different thicknesses or body weights,	▶ 01:11
and also different genders.	▶ 01:15
That is, even though our surfaces themselves are representive	▶ 01:18
of tens of thousands of data points, the underlying dimensionality	▶ 01:22
when scanning people is really small.	▶ 01:25
I'll let you watch the entire movie.	▶ 01:29
Please enjoy.	▶ 01:31
[SCAPE: Shape Completion and Animation of People]	▶ 01:32
We present a method named SCAPE for simultaneously modeling	▶ 01:34
the space of all human shapes and poses.	▶ 01:38
Further, we demonstrate the method's usefulness	▶ 01:41
for both shape completion and animation.	▶ 01:44
The model is computed from an example set of surface meshes.	▶ 01:48
We require only a limited set of training data:	▶ 01:51
Examples of posed variation from a single subject	▶ 01:55
and examples of the shape variation between subjects.	▶ 01:58
The resulting model can represent both articulated motion	▶ 02:02
and, importantly, the nonrigid muscle deformations	▶ 02:06
required for natural appearance in a wide variety of poses.	▶ 02:10
The model can also represent a wide variety of different body shapes,	▶ 02:14
spanning both men and women.	▶ 02:18
Because SCAPE incorporates both shape and pose	▶ 02:20
we can jointly vary both shape and pose to create people who never existed	▶ 02:23
and poses that were never observed.	▶ 02:28
We demonstrate the use of this model 1st for shape completion of scanned meshes.	▶ 02:31
Even when a subject has only been partially observed,	▶ 02:36
we can use the model to estimate a complete surface.	▶ 02:39
In this case, the entire front half of the subject has been synthesized.	▶ 02:42
Note that the synthesized data both conforms to the individual subject's	▶ 02:47
specific shape and faithfully represents	▶ 02:51
the nonrigid muscle deformations associated with a specific pose.	▶ 02:54
Mesh completion is possible even when	▶ 02:59
neither the person or the pose exists in the original training set.	▶ 03:01
None of the women in our example set	▶ 03:05
look similar to the woman in this sequence.	▶ 03:07
Shape completion can also be used to synthesize complete	▶ 03:11
animated surface meshes.	▶ 03:15
Starting from a single scanned mesh of an actor	▶ 03:18
and a timed series of motion capture markers	▶ 03:20
we can treat the markers themselves	▶ 03:24
as a very sparse sampling of surface geometry	▶ 03:26
and complete the surface which best fits the available data at each point in time.	▶ 03:29
Using this method, animated surface models	▶ 03:34
for a wide variety of motions can be created with relative ease.	▶ 03:36
In addition, the target identity of the surface model can easily be changed	▶ 03:40
simply by replacing the subject portion of our factorized model with a different vector.	▶ 03:45
The new identity need not be present in our training set	▶ 03:50
or even correspond to a real person.	▶ 03:54
An artist is free to alter the identity arbitrarily.	▶ 03:56

(00:32) 6g Piece-Wise Linear Projection

(00:42) 7 Spectral Clustering

(00:34) 7a Answer

(05:24) 7b Spectral Clustering Algorithm

So let's look at this example again--let me redraw the data.	▶ 00:00
What makes these clusters so different	▶ 00:03
is not the absolute location of each data point,	▶ 00:05
but the connectedness of these data points.	▶ 00:08
The fact that these 2 points belong together	▶ 00:11
is likely because there's lots of points in-between.	▶ 00:13
In other words, it's the affinity	▶ 00:16
that defines those clusters, not the absolute location.	▶ 00:18
So spectral clustering gets annotation of affinity	▶ 00:21
to make clustering happen.	▶ 00:25
So let me look at the simple example for spectral clustering	▶ 00:27
that would also work for K-means or EM,	▶ 00:30
but they'll be useful to illustrate spectral clustering.	▶ 00:33
Let's assume there's 9 data points as shown over here,	▶ 00:36
and I've colored them differently in blue, red, and black.	▶ 00:39
But to clustering algorithms, they all come with the same color.	▶ 00:43
Now the key element of spectral clustering	▶ 00:46
is called the affinity martrix,	▶ 00:48
which is a 9 by 9 matrix in this case,	▶ 00:50
where each data point gets graphed	▶ 00:53
realtive to each other data point.	▶ 00:56
So let me write down all the 9 data points	▶ 00:58
into the different rows of this matrix--	▶ 01:00
the red ones, the black ones, and the blue ones.	▶ 01:03
And in the columns, I graphed the exact same 9 data points.	▶ 01:05
I then calculate for each pair of data points their affinity,	▶ 01:09
where I use for now affinity as the	▶ 01:13
quadratic distance in this diagram over here.	▶ 01:16
Clearly, the red dots to each other have a high affinity,	▶ 01:19
which means a small quadratic distance.	▶ 01:22
Let me indicate this as follows--	▶ 01:24
But realtive to all the other points, the affinity is weak.	▶ 01:26
So there's a very small value in these elements over here.	▶ 01:29
Similarly, the affinity of the black	▶ 01:32
data points to each other is very high,	▶ 01:34
which means that the following block diagonal	▶ 01:36
in this matrix will have a very large value.	▶ 01:38
Yet the affinity to all the other data points will be low.	▶ 01:41
And of course, the same is true for the blue data points.	▶ 01:44
The interesting thing to notice now	▶ 01:47
is that this is an approximately rank-deficient matrix.	▶ 01:49
And further, the data points that belong to the same class--	▶ 01:52
like the 3 red dots or the 3 black dots,	▶ 01:56
have a singular affinitive vector to all the other data points.	▶ 01:59
So this vector over here is similar to this vector over here.	▶ 02:03
It's similar to this vector over here,	▶ 02:06
but it's very different to this vector over here,	▶ 02:08
which then itself is similar to the vector over here,	▶ 02:10
yet different to the previous ones.	▶ 02:13
Such a situation is easily addressed by what's called	▶ 02:15
principal component analysis, or PCA.	▶ 02:17
PCA is a method to identify vectors that are similar	▶ 02:21
in an approximate rank-deficient matrix.	▶ 02:25
Consider once again our affinity matrix	▶ 02:28
with prinicple component analysis,	▶ 02:31
which is a standard linear trick,	▶ 02:33
we can re-represent this matrix	▶ 02:36
by the most dominant tivectors you'll find there.	▶ 02:38
And the first one, might look like this.	▶ 02:42
The second one, which would be orthogonal, may look like this.	▶ 02:44
The third one, like this.	▶ 02:47
These are called eigenvectors, and the principle component	▶ 02:49
now is each eigenvector has an item of value	▶ 02:51
that states how prevalent this vector is in the original data.	▶ 02:53
And for these 3 vectors, you're going to find a large eigenvalue	▶ 02:57
because there's a number data points that represent	▶ 03:00
these vectors quite prevalently	▶ 03:03
like the first 3 does for this guy over here.	▶ 03:06
There might be additional eigenvectors like something like this,	▶ 03:09
but such eigenvectors will have a small eigenvalue	▶ 03:12
simply because this vector isn't really	▶ 03:15
required to explain the data over here.	▶ 03:17
It might just be explaining some of the noise	▶ 03:19
in the affinity matrix	▶ 03:21
that I didn't even dare draw in here.	▶ 03:23
Now if you take the eigenvectors with the largest	▶ 03:25
eigenvalues--3 in this case,	▶ 03:27
you first discover that the dimensionality	▶ 03:29
of the underlying data space.	▶ 03:32
The dimensionality equals the number of large eigenvalues.	▶ 03:34
Further, if you re-represent each data vector	▶ 03:37
using those eigenvectors,	▶ 03:40
you'll find a 3 dimensional space	▶ 03:42
where original data falls into a varity of different places.	▶ 03:44
And these places are easily told apart by conventional clustering.	▶ 03:48
So in summary, spectral clustering builds	▶ 03:51
an affinity matrix of the data points.	▶ 03:53
It strikes the eigenvectors with the largest eigenvalues,	▶ 03:55
and then re-map those vecotrs into a new space	▶ 03:58
with the data points easily clustering the conventional way.	▶ 04:01
This is called affinity-based clustering or spectral clustering.	▶ 04:05
Let me illustrate this once again with the	▶ 04:09
data set that has a different spectral clustering	▶ 04:11
than a conventional clustering.	▶ 04:13
In this data set, the different clusters belong	▶ 04:15
together because they're affinity is similar.	▶ 04:17
These 2 points belong together	▶ 04:19
because there is a point in-between.	▶ 04:21
If we now draw the affinity matrix for those data points,	▶ 04:23
you find that the first and second data points are close together	▶ 04:26
and the second and the third, but not the first and the third.	▶ 04:29
Hence these 2 off diagonal elements here have remained small.	▶ 04:32
Similarly for the red points as shown here	▶ 04:35
with these 2 elements over here relatively small.	▶ 04:38
And also for the black points	▶ 04:40
where these 2 elements over here are small.	▶ 04:42
And interestingly enough, even though these aren't blocked diagonal,	▶ 04:44
your first 3 largest eigenvectors	▶ 04:47
will still look the same as before.	▶ 04:50
I find this quite remarkable	▶ 04:52
that even though these aren't exactly blocks,	▶ 04:54
those vecotrs still represent the 3 most	▶ 04:56
important vectors for which to recover	▶ 04:59
the data using principle component analysis.	▶ 05:01
So in this case, spectral clustering would easily	▶ 05:04
assign those guys and those guys and those guys	▶ 05:06
to the respective same cluster,	▶ 05:10
which wouldn't be quite as easily the case for	▶ 05:12
expectation-maximization or k-means.	▶ 05:14
So let me ask you the following quiz.	▶ 05:16
Suppose we have 8 data points.	▶ 05:18
How many elements will the affinity matrix have?	▶ 05:20

(00:05) 7c Answer
And the answer is 64. ▶ 00:00
There's 8 data points--8 times 8 is 64. ▶ 00:02

(00:17) 7d Question

(00:13) 7e Answer

(01:55) 8 Supervised vs Unsupervised Learning

So, congratulations.	▶ 00:00
You just made it through the unsupervised learning section of this class.	▶ 00:02
I think you've learned a lot.	▶ 00:05
You learned about K-means, you learned about expectation maximization,	▶ 00:07
about dimensionality reduction and even spectral clustering.	▶ 00:10
The first 3 items--K-means, EM, and dimensionality reduction--	▶ 00:14
are used very frequently, and spectral clustering is a rarer used method	▶ 00:17
that shows some of the most recent research going on in the field.	▶ 00:22
I hope you have fun applying these methods in practice.	▶ 00:26
I'd like to say a few final words about supervised versus unsupervised learning.	▶ 00:30
In both cases you're given data, but in 1 case you have labeled data,	▶ 00:35
in another you have unlabeled data.	▶ 00:39
The supervised learning paradigm is the dominant paradigm in machine learning,	▶ 00:41
and there are a vast amount of papers being written about it.	▶ 00:45
We talked about classification and regression	▶ 00:48
and different methods to do supervised learning.	▶ 00:51
The unsupervised paradigm is much less explored,	▶ 00:53
even though I think it's at least equally important--possibly even more important.	▶ 00:56
Many systems can collect vast amounts of data such as web crawlers,	▶ 01:00
robots, I told you about street view,	▶ 01:05
and getting the data is cheap, but getting labels is hard.	▶ 01:08
So to me, unsupervised is the method of the future.	▶ 01:11
It's one of the most interesting open research topics	▶ 01:14
to see whether we can make sense out of large amounts of unlabeled or poorly labeled data.	▶ 01:17
In between, there are techniques that do both: supervised and unsupervised.	▶ 01:21
They are called semi-supervised or self-supervised,	▶ 01:26
and they use elements of unsupervised learning and pair them with supervised learning.	▶ 01:29
Those are fascinating by their own rights.	▶ 01:32
Our robot Stanley, for example, that won the DARPA Grand Challenge	▶ 01:35
used its own sensors to produce labels on the fly to other data.	▶ 01:38
And I'll talk about this when I talk about robotics in more detail.	▶ 01:43
But for the time being, understand that the paradigms supervised and unsupervised	▶ 01:46
span 2 very large areas of machine learning, and you learn quite a bit about it.	▶ 01:51

(18) Homework 3

(00:05) 1 Introduction
Welcome to the third homework assignment covering topics of machine learning. ▶ 00:00

(01:08) 2a Naive Bayes Laplacian Smoothing

(01:50) 2a Naive Bayes Laplacian Smoothing ANSWER

(00:11) 2b Naive Bayes Laplacian Smoothing
[Thrun] For the same example I now would like to know the probability ▶ 00:00
of movie title for my query. ▶ 00:04
So please write this into the following box. ▶ 00:09

(01:01) 2b Naive Bayes Laplacian Smoothing ANSWER

(00:16) 2c Maximum Likelihood

(00:44) 2c Maximum Likelihood ANSWER

(00:15) 3a Linear Regression

(00:40) 3a Linear Regression ANSWER

(00:16) 3b Linear Regression

(01:39) 3b Linear Regression ANSWER

(00:30) 4 k Nearest Neighbors

(00:49) 4 k Nearest Neighbors ANSWER

(00:31) 5 k Nearest Neighbors

(01:14) 5 k Nearest Neighbors ANSWER

(00:27) 6 Perceptron

(00:22) 6 Perceptron ANSWER

(00:03) 7 Congratulations
Congratulations! You just finished the third homework assignment. ▶ 00:00

(19) Unit 7

(01:08) 1 Introduction

(03:43) 2 Propositional Logic

The first logic we will consider is called propositional logic.	▶ 00:00
Let's jump right into an example, recasting the alarm problem in propositional logic.	▶ 00:07
We have propositional symbols B, E, A, M, and J	▶ 00:12
corresponding to the events of a burglary occurring, of\ the earthquake occurring,	▶ 00:23
of the alarm going off, of Mary calling, and of John calling.	▶ 00:28
And just as in the probabalistic models,	▶ 00:34
these can be either true or false,	▶ 00:37
but unlike improbability, our degree of belief in propositional logic	▶ 00:40
is not a number.	▶ 00:44
Rather, our belief is that each of these is either true or false or unknown.	▶ 00:47
Now, we can make logical sentences using these symbols	▶ 00:53
and also using the logical constants true and false	▶ 00:57
by combining them together using logical operators.	▶ 01:04
For example, we can say that the alarm is true	▶ 01:08
whenever the earthquake or burglary is true with this sentence.	▶ 01:12
(E V B) E or B implies A.	▶ 01:16
So that says whenever the earthquake or the burglary is true,	▶ 01:28
then the alarm will be true.	▶ 01:35
We use this V symbol to mean or	▶ 01:38
and a right arrow to mean implies.	▶ 01:40
We could also say that it would be true that both John and Mary call	▶ 01:43
when the alarm is true.	▶ 01:47
We write that as A implies (J ^ M)	▶ 01:50
and we use this symbol ^ to indicate an and,	▶ 02:01
so that this upward-facing wedge looks kind of like an A	▶ 02:05
with the crossbar missing, and so you can remember A is for "and"	▶ 02:09
where with this downward-facing V symbol is the opposite of and,	▶ 02:14
so that's the symbol for or.	▶ 02:19
Now, there's 2 more connectors we haven't seen yet.	▶ 02:22
There's a double arrow for equivalent, also known as a biconditional,	▶ 02:25
and a not sign for negation,	▶ 02:29
so we could say if we wanted to that John calls if and only if Mary calls.	▶ 02:32
We would write that as J <=> M.	▶ 02:39
John is equivalent to Mary--when one is true, the other is true;	▶ 02:45
when one is false, the other is false.	▶ 02:48
Or we could say that when John calls, Mary doesn't, and vice versa.	▶ 02:51
We could write that as John is equivalent J<=> to not Mary,	▶ 02:56
and this is the not sign.	▶ 03:04
Now, how do we know what the sentences mean?	▶ 03:08
A propositional logic sentence is either true or false	▶ 03:11
with respect to a model of the world.	▶ 03:14
Now, a model is just a set of true/false values for all the propositional symbols,	▶ 03:17
so a model might be the set B is true, E is false, and so on.	▶ 03:21
We can define the truth of the sentence in terms of the truth of the symbols	▶ 03:34
with respect to the models using truth tables.	▶ 03:39

(02:42) 2a Truth Tables

(00:47) 2b Answer

(00:31) 2c Question

(01:20) 2d Answer

(00:56) 2e Question

(00:39) 2f Answer

(01:31) 2g Terminology

(01:40) 2h Answer

(01:16) 2i Propositional Logic Limitations

(03:42) 3 First Order Logic

[Norvig] I'm going to talk about first order logic	▶ 00:00
and its relation to the other logics we've seen so far--	▶ 00:04
namely, propositional logic and probability theory.	▶ 00:09
We're going to talk about them in terms of what they say about the world,	▶ 00:18
which we call the ontological commitment of these logics,	▶ 00:23
and what types of beliefs agents can have using these logics,	▶ 00:29
which we call the epistemological commitments.	▶ 00:35
So in first order logic we have relations about things in the world,	▶ 00:39
objects, and functions on those objects.	▶ 00:46
And what we can believe about those relations is that they're true or false or unknown.	▶ 00:49
So this is an extension of propositional logic	▶ 00:59
in which all we had was facts about the world	▶ 01:02
and we could believe that those facts were true or false or unknown.	▶ 01:06
In probability theory we had the same types of facts as in propositional logic--	▶ 01:13
the symbols or variables--but the beliefs could be a real number in the range 0 to 1.	▶ 01:21
So logics vary both in what you can say about the world	▶ 01:30
and what you can believe about what's been said about the world.	▶ 01:34
Another way to look at representation	▶ 01:38
is to break the world up into representations that are atomic,	▶ 01:41
meaning that a representation of the state is just an individual state	▶ 01:50
with no pieces inside of it.	▶ 01:54
And that's what we used for search and problem solving.	▶ 01:57
We had a state, like state A,	▶ 02:03
and then we transitioned to another state, like state B,	▶ 02:06
and all we could say about those states was are they identical to each other or not	▶ 02:11
and maybe is one of them a goal state or not.	▶ 02:15
But there wasn't any internal structure to those states.	▶ 02:19
In propositional logic, as well as in probability theory,	▶ 02:24
we break up the world into a set of facts that are true or false,	▶ 02:28
so we call this a factored representation--	▶ 02:33
that is, the representation of an individual state of the world	▶ 02:37
is factored into several variables--the B and E and A and M and J, for example--	▶ 02:41
and those could be Boolean variables or in some types of representations--	▶ 02:47
not in propositional logic--they can be other types of variables besides Boolean.	▶ 02:51
Then the third type--the most complex type of representation--we call structured.	▶ 02:59
And in a structured representation, an individual state is not just a set of values for variables,	▶ 03:06
but it can include relationships between objects,	▶ 03:14
a branching structure, and complex representations and relations	▶ 03:17
between one object and another.	▶ 03:22
And that's what we see in traditional programming languages,	▶ 03:25
it's what we see in databases--they're called structured databases,	▶ 03:28
and we have structured query languages over those databases--	▶ 03:32
and that's a more powerful representation,	▶ 03:36
and that's what we get in first order logic.	▶ 03:39

(03:56) 3a Models

[Norvig] How does first order logic work? What does it do?	▶ 00:00
Like propositional logic, we start with a model.	▶ 00:04
In propositional logic a model was a value for each propositional symbol.	▶ 00:08
So we might say that the symbol P was true	▶ 00:13
and the symbol Q was false,	▶ 00:18
and that would be a model that corresponds to what's going on in a possible world.	▶ 00:22
In first order logic the models are more complex.	▶ 00:30
We start off with a set of objects.	▶ 00:32
Here I've shown 4 objects, these 4 tiles,	▶ 00:35
but we could have more objects than that.	▶ 00:39
We could say, for example, that the numbers 1, 2, and 3	▶ 00:42
were also objects in our model.	▶ 00:46
So we have a set of objects.	▶ 00:49
We can also have a set of constants that refer to those objects.	▶ 00:51
So I could use the constant names A, B, C, D, 1, 2, 3,	▶ 00:58
but I don't have to have a one-to-one correspondence	▶ 01:08
between constants and objects.	▶ 01:10
I could have 2 different constant names that refer to the same object.	▶ 01:13
I could also have, say, the name C that refers to this object,	▶ 01:18
or I could have some of the objects that don't have any names at all.	▶ 01:24
But I've got a set of constants, and I also have a set of functions.	▶ 01:28
A function is defined as a mapping from objects to objects.	▶ 01:38
And so, for example, I might have the Number Of function	▶ 01:46
that maps from a tile to the number on that tile,	▶ 01:52
and that function then would be defined by the mapping from A to 1	▶ 01:56
and B to 3 and C to 3 and D to 2,	▶ 02:04
and I could have other functions as well.	▶ 02:13
In addition to functions, I can have relations.	▶ 02:17
For example, I could have the Above relation,	▶ 02:23
and I could say in this model of the world the Above relation is a set of tuples.	▶ 02:28
Say A is above B and C is above D.	▶ 02:36
So that was a binary relation holding between 2 objects.	▶ 02:41
Say 1 block is above another block.	▶ 02:46
We can have other types of relations.	▶ 02:50
For example, here is a unary relation--vowel--	▶ 02:52
and if we want to say the relation Vowel is true only of the object that we call A,	▶ 02:57
then that's a set of tuples of length 1 that contains just A.	▶ 03:04
We can even have relations over no objects.	▶ 03:11
Say we wanted to have the relation Rainy, which doesn't refer to any objects at all	▶ 03:16
but just refers to the current situation.	▶ 03:20
Then since it's not rainy today, we would represent that as the empty set.	▶ 03:24
There's no tuples corresponding to that relation.	▶ 03:30
Or, if it was rainy, we could say that it's represented by a singleton set,	▶ 03:34
and since the arity of Rainy is 0, there would be 0 elements in each one of those tuples.	▶ 03:42
So that's what a model in first order logic looks like.	▶ 03:50

(04:04) 3b Syntax

[Man] Now let's talk about the syntax of first order logic,	▶ 00:00
and like in propositional logic,	▶ 00:05
we have sentences which describe facts that are true or false.	▶ 00:09
But unlike propositional logic, we also have terms	▶ 00:14
which describe objects.	▶ 00:20
Now, the atomic sentences are predicates corresponding to relations,	▶ 00:22
so we can say vowel (A) is an atomic sentence	▶ 00:29
or above (A, B).	▶ 00:37
And we also have a distinguished relation--the equality relation.	▶ 00:43
We can say 2 = 2 and the equality relation is always in every model,	▶ 00:49
and sentences can be combined with all the operators from propositional logic	▶ 00:58
so that's and, or, not, implies, equivalent, and parentheses.	▶ 01:07
Now, terms, which refer to objects, can be constants,	▶ 01:20
like A, B, and 2.	▶ 01:26
They can be variables.	▶ 01:30
We normally use lowercase, like x and y.	▶ 01:32
And they can be functions, like number of A,	▶ 01:36
which is just another name or another expression that refers to the same object as 1,	▶ 01:41
at least in the model that we showed previously.	▶ 01:48
And then, there's 1 more type of complex sentence	▶ 01:50
besides the sentences we get by combining operators,	▶ 01:53
that makes first order logic unique, and these are the quantifiers.	▶ 01:57
And there are two quantifiers for all, which we write with an upside-down A	▶ 02:03
followed by a variable that it introduces	▶ 02:09
and there exists, which we write with an upside-down E	▶ 02:13
followed by the variable that it introduces.	▶ 02:18
So for example, we could say for all x, if x is a vowel,	▶ 02:21
then the number of (x) is equal to 1,	▶ 02:28
and that's the valid sentence in first order logic.	▶ 02:33
Or we could say there exists in x such that the number of (x)	▶ 02:36
is equal to 2,	▶ 02:45
and this is saying that there's some object in the domain	▶ 02:47
to which the number of function applies and has a value of 2,	▶ 02:51
but we're not saying what that object is.	▶ 02:55
Now, another note is that sometimes as an abbreviation,	▶ 02:58
we'll omit the quantifier, and when we do that,	▶ 03:01
you can just assume that it means for all; that's left out just as a shortcut.	▶ 03:06
And I should say that these forms, or these sentences are typical,	▶ 03:13
and you'll see these form over and over again,	▶ 03:16
so typically, whenever we have a "for all" quantifier introduced,	▶ 03:19
it tends to go with a conditional like vowel of (x) implies number of (x) =1,	▶ 03:24
and the reason is because we usually don't want to say something about every object	▶ 03:31
in the domain, since the objects can be so different,	▶ 03:35
but rather, we want to say something about a particular type of object,	▶ 03:39
say, in this case, vowels.	▶ 03:43
And also, typically, when we have an exists an x, or an exists any variable,	▶ 03:45
that typically goes with just a form like this,	▶ 03:54
and not with a conditional, because we're talking about just 1 object	▶ 03:58
that we want to describe.	▶ 04:02

(03:16) 3c Vacuum World

[man] Now let's go back to the 2-location vacuum world	▶ 00:00
and represent it in first order logic.	▶ 00:03
So first of all, we can have locations.	▶ 00:06
We can call the left location A and the right location B	▶ 00:09
and the vacuum V, and the dirt--say, D1 and D2.	▶ 00:15
Then, we can have relations.	▶ 00:23
The relation loc, which is true of any location;	▶ 00:27
vacuum, which is true of the vacuum;	▶ 00:32
dirt, which is true of dirt;	▶ 00:34
and at, which is true of an object and a location.	▶ 00:37
And so if we wanted to say the vacuum is at location A,	▶ 00:44
we just say at (V, A).	▶ 00:49
If we want to say there's no dirt in any location, it's a little bit more complicated.	▶ 00:54
We can say for all dirt and for all locations,	▶ 01:00
if D is a dirt, and L is a location,	▶ 01:07
then D is not at L.	▶ 01:13
So that says there's no dirt in any location.	▶ 01:18
Now, note if there were thousands of locations instead of just 2,	▶ 01:21
this sentence would still hold, and that's really the power of first order logic.	▶ 01:26
Let's keep going and try some more examples.	▶ 01:32
If I want to say the vacuum is in a location with dirt without specifying what location it's in,	▶ 01:35
I can do that.	▶ 01:42
I can say there exists an L and there exists a D	▶ 01:44
such that D is a dirt and L is a location	▶ 01:53
and the vacuum is at the location	▶ 02:01
and the dirt is at that same location.	▶ 02:07
and that's the power of first order logic.	▶ 02:11
Now one final thing.	▶ 02:14
You might ask what "first order" means.	▶ 02:16
It means that the relations are on objects, but not on relations,	▶ 02:19
and that would be called "higher order."	▶ 02:24
In higher order logic, we could, say, define the notion of a transitive relation	▶ 02:26
talking about relations itself, and so we could say	▶ 02:33
for all R, transitive of R is equivalent to for all A, B, and C;	▶ 02:38
R of (A, B) and R of (B, C) implies R (A, C).	▶ 02:52
So that would be a valid statement in higher order logic	▶ 03:06
that would define the notion of a transitive relation,	▶ 03:10
but this would be invalid in first order logic.	▶ 03:13

(01:14) 3d Question

(01:20) 3e Answer

(02:59) 3f Question

(02:04) 3g Answer

(25) Unit 8

(00:39) 1 Introduction

(02:26) 2 Problem Solving vs Planning

[Narrator] You remember our problem-solving work?	▶ 00:00
We have a state space like this, and	▶ 00:03
we're given a start space and	▶ 00:06
a goal to reach,	▶ 00:09
and then we'd search for a path	▶ 00:11
to find that goal, and maybe we find	▶ 00:13
this path.	▶ 00:16
Now the way a problem-solving agent	▶ 00:19
would work is first it does all the work	▶ 00:21
to figure out the path to the goal	▶ 00:24
just doing by thinking,	▶ 00:26
and then it starts to execute that path	▶ 00:29
to drive or walk, however you want to get there,	▶ 00:31
from the start state to the end state,	▶ 00:35
but think about what would happen	▶ 00:37
if you did that in real life; if you did all	▶ 00:39
your planning ahead of time, you had the complete goal,	▶ 00:41
and then without interacting with the world,	▶ 00:43
without sensing it at all,	▶ 00:46
you started to execute that path.	▶ 00:48
Well this has, in fact, been studied.	▶ 00:50
People have gone out and	▶ 00:53
blindfolded walkers, put them in a field	▶ 00:56
and told them to walk in a straight line,	▶ 00:59
and the results are not pretty.	▶ 01:01
Here are the GPS tracks to prove it.	▶ 01:04
So we take a hiker, we put him at a	▶ 01:07
start location, say here,	▶ 01:09
and we blindfold him so that he can't	▶ 01:11
see anything in the horizon,	▶ 01:13
but just has enough to see his or her feet	▶ 01:15
so that they won't stumble over something,	▶ 01:18
and tell them execute the plan of going forward.	▶ 01:20
Put one foot in front of each other and walk forward in a straight line,	▶ 01:23
and these are the typical paths we see.	▶ 01:26
They start out going straight for awhile	▶ 01:28
but then go in loop de loops	▶ 01:30
and end up not at a straight path at all.	▶ 01:32
These ones over here, starting in this location,	▶ 01:35
are even more convoluted.	▶ 01:37
They get going straight for a little bit	▶ 01:39
and then go in very tight loops.	▶ 01:41
So people are incapable of walking a straight line	▶ 01:43
without any feedback from the environment.	▶ 01:45
Now here on this yellow path, this one did much better,	▶ 01:48
and why was that?	▶ 01:51
Well it's because these paths were on overcast days,	▶ 01:53
and so there was no input to make sense of.	▶ 01:56
Whereas on this path was on a very sunny day,	▶ 01:59
and so even though the hiker couldn't	▶ 02:02
see farther than a few feet in front of him,	▶ 02:04
he could see shadows and say,	▶ 02:07
"As long as I keep the shadows pointing in the right direction then	▶ 02:10
I can go in a relatively straight line."	▶ 02:12
So the moral is we need some feedback from the environment.	▶ 02:15
We can't just plan ahead and come up with a whole plan.	▶ 02:18
We've got to interleave planning	▶ 02:21
and executing.	▶ 02:24

(03:19) 3 Planning vs Execution

[Narrator] Now why do we have to interleave	▶ 00:00
planning and execution?	▶ 00:02
Mostly because of properties of the	▶ 00:04
environment that make it difficult to deal with.	▶ 00:06
The most important one is	▶ 00:08
if the environment is	▶ 00:10
stochastic.	▶ 00:12
That is if we don't know for sure what	▶ 00:14
an action is going to do.	▶ 00:16
If we know what everything is going to do,	▶ 00:18
we can plan it our right from the start, but if we don't, we have to	▶ 00:20
be able to deal with contingencies of	▶ 00:22
say I tried to move forward,	▶ 00:24
and the wheels slipped, and I went someplace else,	▶ 00:26
or the brakes might skid, or	▶ 00:29
if we're walking our feet don't go 100% straight,	▶ 00:31
or consider the problem of traffic lights.	▶ 00:34
If the traffic light is red,	▶ 00:37
then the result of the action of go	▶ 00:39
forward through the intersection	▶ 00:41
is bound to be different than if the traffic light is green.	▶ 00:43
Another difficulty we have to deal with	▶ 00:46
is multi-agent environments.	▶ 00:48
If there are other cars and people that can get in our way,	▶ 00:51
we have to plan about what they're going to do,	▶ 00:54
and we have to react when they do something unexpected,	▶ 00:57
and we can only know that	▶ 01:00
at execution time, not at planning time.	▶ 01:02
The other big problem is with	▶ 01:05
partial observability.	▶ 01:07
Suppose we've come up with a plan	▶ 01:11
to go from A to S to F to B.	▶ 01:14
That plan looks like it will work,	▶ 01:19
but we know that at S,	▶ 01:21
the road to F is sometimes closed,	▶ 01:24
and there will be a sign there	▶ 01:27
telling us whether it's closed or not,	▶ 01:29
but when we start off, we can't read that sign.	▶ 01:31
So that's partial observability.	▶ 01:33
Another way to look at it is when we start off	▶ 01:35
we don't know what state we're in.	▶ 01:37
We know we're in A, but we don't know	▶ 01:39
if we're in A in the state where	▶ 01:41
the road is closed or if we're in A	▶ 01:43
in the state where the road is open,	▶ 01:46
and it's not until we get to S	▶ 01:48
that we discover what state we're actually in,	▶ 01:50
and then we know if we can continue along	▶ 01:53
that route or if we have to take a detour south.	▶ 01:55
Now in addition to these properties of	▶ 01:58
the environment, we can also have	▶ 02:00
difficulty because of	▶ 02:02
lack of knowledge on our own part.	▶ 02:04
So if some model of the world is unknown,	▶ 02:06
that is, for example,	▶ 02:12
we have map or GPS software	▶ 02:14
that's inaccurate or incomplete,	▶ 02:16
then we won't be able to	▶ 02:18
executive a straight-line plan,	▶ 02:20
and, similarly, often we want to deal with	▶ 02:23
a case where the plans have to be	▶ 02:26
hierarchical.	▶ 02:29
And, certainly, a plan like this	▶ 02:31
is at a very high level.	▶ 02:33
We can't really execute the action	▶ 02:37
of going from A to S	▶ 02:39
when we're in a car.	▶ 02:41
All the actions that we can actually execute	▶ 02:43
are things like turn the steering wheel a little bit	▶ 02:45
to the right, press on the pedal a little bit more.	▶ 02:47
So those are the low-level steps of the plan,	▶ 02:50
but those aren't sketched out in detail when we start,	▶ 02:54
when we only have the high-level parts of the plan,	▶ 02:57
and then it's during execution that we schedule	▶ 03:00
the rest of the low-level parts of the plan.	▶ 03:03
Now most of these difficulties can be	▶ 03:05
addressed by changing our point of view.	▶ 03:08
Instead of planning in the space of world states,	▶ 03:10
we plan in the space of belief states.	▶ 03:13
To understand that let's look at a state.	▶ 03:16

(02:11) 4 Vacuum Cleaner Example

[Narrator] Here's a state space	▶ 00:00
diagram for a simple problem.	▶ 00:02
It involves a room with 2 locations.	▶ 00:04
The left we call A, and the right we call B,	▶ 00:07
and in that environment	▶ 00:11
there's a vacuum cleaner, and there	▶ 00:13
may or may not be dirt in either of the 2 locations,	▶ 00:15
and so that gives us 8 total states.	▶ 00:18
Dirt is here or not, here or not, and	▶ 00:22
the vacuum cleaner is here or here.	▶ 00:25
So that's 2 times 2 times 2	▶ 00:27
is 8 possible states, and I've drawn	▶ 00:29
here the states based diagram	▶ 00:31
with all the transitions	▶ 00:33
for the 3 possible actions, and the actions are moving right.	▶ 00:35
So we'd go from this state to this state.	▶ 00:38
Moving left, we'd go from this state to this state,	▶ 00:40
and sucking up dirt, we'd go from this state	▶ 00:43
to this state for example, and	▶ 00:45
in this state space diagram,	▶ 00:48
if we have a fully deterministic,	▶ 00:51
fully observable world, it's easy to plan.	▶ 00:53
Say we start in this state, and we want to be--	▶ 00:56
end up in a goal state where both sides are clean.	▶ 00:59
We can execute the suck-dirt action	▶ 01:02
and get here and then move right,	▶ 01:04
and then suck dirt again,	▶ 01:06
and now we end up in a goal state	▶ 01:08
where everything is clean.	▶ 01:11
Now suppose our robot vacuum cleaner's	▶ 01:14
sensors break down, and so the robot	▶ 01:16
can no longer perceive either	▶ 01:18
which location its in	▶ 01:20
or whether there's any dirt.	▶ 01:22
So we now have an unobservable	▶ 01:24
or sensor-less world rather	▶ 01:26
than a fully observable one,	▶ 01:28
and how does the agent then represent the state of the world?	▶ 01:30
Well it could be in any one of these 8 states,	▶ 01:33
and so all we can do to represent	▶ 01:36
the current state is draw a big circle	▶ 01:39
or box around everything, and say,	▶ 01:42
"I know I'm somewhere inside here."	▶ 01:44
Now that doesn't seem like it helps very much.	▶ 01:48
What good is it to know that	▶ 01:50
we don't really know anything at all?	▶ 01:52
But the point is that we can search in the state	▶ 01:54
space of the least states rather	▶ 01:57
than in the state space of actual spaces.	▶ 01:59
So we believe that we're in 1 of these 8 states,	▶ 02:02
and now when we execute an action,	▶ 02:05
we're going to get to another belief state.	▶ 02:07
Let's take a look at how that works.	▶ 02:09

(02:22) 5 Sensorless Vacumm Cleaner Problem-

[Narrator] This is the belief state space	▶ 00:00
for the sensor-less vacuum problem.	▶ 00:03
So we started off here.	▶ 00:05
We drew the circle around this belief state.	▶ 00:07
So we don't anything about where we are,	▶ 00:10
but the amazing thing is,	▶ 00:13
if we execute actions, we can gain knowledge	▶ 00:15
about the world even without sensing.	▶ 00:17
So let's say we move right,	▶ 00:20
then we'll know we're in the right-hand location.	▶ 00:22
Either we were in the left, and we moved right	▶ 00:26
and arrived there, or we were in the right	▶ 00:28
to begin with, and we bumped against the wall	▶ 00:30
and stayed there.	▶ 00:32
So now we end up in this state.	▶ 00:34
We now know more about the world.	▶ 00:37
We're down to 4 possibilities rather than 8,	▶ 00:40
even though we haven't observed anything,	▶ 00:43
and now note something interesting,	▶ 00:46
that in the real world, the operations	▶ 00:48
of going left and going right are	▶ 00:50
inverses of each other, but	▶ 00:52
in the belief state world	▶ 00:54
going right and going left are not inverses.	▶ 00:56
If we go right, and then we go left,	▶ 00:59
we don't end up back where we were	▶ 01:01
in a state of total uncertainty, rather	▶ 01:03
going left takes us over here	▶ 01:05
where we still know we're in 1 of 4 states	▶ 01:08
rather than in 1 of 8 states.	▶ 01:10
Note that it's possible to form a plan that	▶ 01:13
reaches a goal without ever observing the world.	▶ 01:15
Plans like that are called conform-it plans.	▶ 01:18
For example, if the goal is to be	▶ 01:21
in a clean location	▶ 01:23
all we have to do is suck.	▶ 01:25
So we go from one of these 8 states	▶ 01:28
to one of these 4 states and,	▶ 01:30
every one of those 4,	▶ 01:32
we're in a clean location.	▶ 01:34
We don't know which of the 4 we're in,	▶ 01:36
but we know we've achieved the goal.	▶ 01:38
It's also possible to arrive	▶ 01:41
at a completely known state.	▶ 01:43
For example, if we start here,	▶ 01:45
we go left; we suck up the dirt there.	▶ 01:47
If we go right and suck up the dirt,	▶ 01:50
now we're down to a belief state	▶ 01:53
consisting of 1 single state that is	▶ 01:55
we know exactly where we are.	▶ 01:57
Here's a question for you:	▶ 02:00
How do I get from the state where I know	▶ 02:02
my current square is clean,	▶ 02:04
but know nothing else, to the belief state	▶ 02:06
where I know that I'm in the right-hand side	▶ 02:08
location and that that location is clean?	▶ 02:10
What I want you to do is click on the	▶ 02:14
sequence of actions, left, right, or suck	▶ 02:16
that will take us from that start to that goal.	▶ 02:18

(00:23) 6 Sensorless Vacuum Cleaner Answer

(02:31) 7 Partially Observable Vacuum Cleaner Example

[Narrator] We've been considering sensor-less planning in a deterministic world.	▶ 00:00
Now I want to turn our attention to partially observable planning	▶ 00:05
but still in a deterministic world.	▶ 00:08
Suppose we have what's called local sensing,	▶ 00:10
that is our vacuum can see what location	▶ 00:13
it is in and it can see	▶ 00:15
what's going on in the current location, that is	▶ 00:17
whether there's dirt in the current location or not,	▶ 00:21
but it can't see anything about	▶ 00:23
whether there's dirt in any other location.	▶ 00:25
So here's a partial diagram of the--	▶ 00:29
part of the belief state from that world,	▶ 00:31
and I want it to show	▶ 00:35
how the belief state unfolds	▶ 00:37
as 2 things happen.	▶ 00:39
First, as we take action,	▶ 00:41
so we start in this state,	▶ 00:43
and we take the action of going right,	▶ 00:46
and in this case we still go	▶ 00:49
from 2 world states in our belief state	▶ 00:53
to 2 new ones,	▶ 00:56
but then, after we do an action,	▶ 00:58
we do an observation, and we have the act	▶ 01:00
precept cycle, and now,	▶ 01:03
once we get the observation,	▶ 01:05
we can split that world,	▶ 01:07
we can split our belief state to say,	▶ 01:09
"If we observe that we're in	▶ 01:11
location B and it's dirty, then we know	▶ 01:13
we're in this belief state here,	▶ 01:15
which happens to have exactly 1 world state in it,	▶ 01:18
and if we observe that we're clean	▶ 01:21
then we know that we're in this state,	▶ 01:23
which also has exactly 1 in it.	▶ 01:25
Now what is the act-observe cycle do	▶ 01:27
to the sizes of the belief states?	▶ 01:29
Well in a deterministic world,	▶ 01:32
each of the individual world states within	▶ 01:34
a belief state maps into exactly 1 other one.	▶ 01:36
That's what we mean by deterministic,	▶ 01:40
and so that means the size of the belief state	▶ 01:42
will either stay the same or it might decrease	▶ 01:45
if 2 of the actions sort of accidentally	▶ 01:48
end up bringing you to the same place.	▶ 01:50
On the other hand, the observation	▶ 01:53
works in kind of the opposite way.	▶ 01:55
When we observe the world, what we're doing	▶ 01:58
is we're taking the current belief state and	▶ 02:00
partitioning it up into pieces.	▶ 02:02
Observations alone can't introduce	▶ 02:05
a new state--a new world state into the belief state.	▶ 02:07
All they can do is say,	▶ 02:10
"Some of them go here and some of them go here."	▶ 02:13
Now maybe that for some observation	▶ 02:16
all the belief states go into 1 bin,	▶ 02:18
and so we make an observation	▶ 02:21
that we don't learn anything new, but at least	▶ 02:23
the observation can't make us more confused	▶ 02:25
than we were before the observation.	▶ 02:28

(02:54) 8 Stocastic Environment Problem

[Norvig] Now let's move on to stochastic environments.	▶ 00:00
Let's consider a robot that has slippery wheels	▶ 00:03
so that sometimes when you make a movement--a left or a right action--	▶ 00:06
the wheels slip and you stay in the same location.	▶ 00:10
And sometimes they work and you arrive where you expected to go.	▶ 00:13
And let's assume that the suck action always works perfectly.	▶ 00:17
We get a belief state space that looks something like this.	▶ 00:21
Notice that the results of actions will often result in a belief state	▶ 00:25
that's larger than it was before--that is, the action will increase uncertainty	▶ 00:30
because we don't know what the result of the action is going to be.	▶ 00:34
And so here for each of the individual world states belonging to a belief state,	▶ 00:37
we have multiple outcomes for the action, and that's what stochastic means.	▶ 00:42
And so we end up with a larger belief state here.	▶ 00:47
But in terms of the observation, the same thing holds as in the deterministic world.	▶ 00:50
The observation partitions the belief state into smaller belief states.	▶ 00:55
So in a stochastic partially observable environment,	▶ 01:01
the actions tend to increase uncertainty,	▶ 01:04
and the observations tend to bring that uncertainty back down.	▶ 01:07
Now, how would we do planning in this type of environment?	▶ 01:11
I haven't told you yet, so you won't know the answer for sure,	▶ 01:14
but I want you to try to figure it out anyways, even if you might get the answer wrong.	▶ 01:17
Imagine I had the whole belief state from which I've diagrammed just a little bit here	▶ 01:21
and I wanted to know how to get from this belief state	▶ 01:27
to one in which all squares are clean.	▶ 01:31
So I'm going to give you some possible plans,	▶ 01:34
and I want you to tell me whether you think each of these plans will always work	▶ 01:36
or maybe sometimes work depending on how the stochasticity works out.	▶ 01:42
Here are the possible plans.	▶ 01:47
Remember I'm starting here, and I want to know how to get to a belief state	▶ 01:49
in which all the squares are clean.	▶ 01:54
One possibility is suck right and suck, one is right suck left suck,	▶ 01:57
one is suck right right suck,	▶ 02:06
and the other is suck right suck right suck.	▶ 02:11
So some of these actions might take you out of this little belief state here,	▶ 02:18
but just use what you knew from the previous definition of the state space	▶ 02:22
and the results of each of those actions	▶ 02:27
and the fact that the right and left actions are nondeterministic	▶ 02:29
and tell me which of these you think will always achieve the goal	▶ 02:34
or will maybe achieve the goal.	▶ 02:39
And then I want you to also answer for the fill-in-the-blank plan--	▶ 02:42
that is, is there some plan, some ideal plan, which always or maybe achieves the goal?	▶ 02:48

(00:50) 9 Stochastic Environment Answer

(02:05) 10 Infinite Sequences

(01:39) 11 Finding a Successful Plan-

(02:10) 12 Finding a Successful Plan Question

(00:58) 13 Finding a Successful Plan Answer

(02:39) 14 Problem Solving via Mathematical Notation

Now, some people like manipulating trees	▶ 00:00
and some people like a more--sort of formal--mathematical notation.	▶ 00:02
So if you're one of those, I'm going to give you another way to think about	▶ 00:06
whether or not we have a solution;	▶ 00:09
and let's start with a problem-solving	▶ 00:12
where a plan consists of a straight line sequence.	▶ 00:15
And we said one way to decide if this is a plan that satisfies the goal	▶ 00:20
is to say, "Is the end state a goal state?"	▶ 00:25
If we want to be more formal and write that out mathematically,	▶ 00:30
what we can say is--what this plan represents	▶ 00:33
is--we started in the start state,	▶ 00:37
and then we transitioned	▶ 00:40
to the state that is the result of applying the action	▶ 00:43
of going from A to S, to that start state;	▶ 00:47
and then we applied to that, the result of starting in that intermediate state	▶ 00:53
and applying the action of going from S to F.	▶ 01:01
And if that resulting state is an element of the set of Goals,	▶ 01:08
then this plan is valid; this plan gives us a solution.	▶ 01:14
And so that's a mathematical formulation of what it means for this plan to be a Goal.	▶ 01:19
Now, in stochastic partially observable worlds,	▶ 01:24
the equations are a little bit more complicated.	▶ 01:27
Instead of just having S Prime is a result of applying some action to the initial state,	▶ 01:30
we're dealing with belief states, rather than individual states.	▶ 01:40
And what we say is our new belief state	▶ 01:44
is the result of updating what we get from predicting what our action will do;	▶ 01:50
and then updating it, based on our observation, O, of the world.	▶ 01:59
So the prediction step is when we start off in a belief state;	▶ 02:06
we look at the action, we look at each possible result of the action--	▶ 02:10
because they're stochastic--to each possible member of the belief state,	▶ 02:15
and so that gives us a larger belief state;	▶ 02:18
and then we update that belief state by taking account of the observation--	▶ 02:21
and that will give us a smaller--or same size--belief state.	▶ 02:25
And now, that gives us the new state.	▶ 02:29
Now, we can use this to predict and update cycle	▶ 02:32
to keep track of where we are in a belief state.	▶ 02:35

(02:55) 15 Tracking the Predict Update Cycle

Here's an example of tracking the Predict Update Cycle;	▶ 00:00
and this is in a world in which the actions are guaranteed to work, as advertised--	▶ 00:04
that is, if you start to clean up the current location,	▶ 00:09
and if you move right or left, the wheels actually turn; and you do move.	▶ 00:12
But we can call this the kindergarten world because there are little toddlers	▶ 00:17
walking around who can deposit Dirt in any location, at any time.	▶ 00:22
So if we start off in this state, and execute the Suck action,	▶ 00:27
we can predict that we'll end up in one of these 2 states.	▶ 00:32
Then, if we have an observation--well, we know what that observation's going to be	▶ 00:38
because we know the Suck action always works, and we know we were in A;	▶ 00:42
so the only observation we can get is that we're in A--and that it's Clean--	▶ 00:45
so we end up in that same belief state.	▶ 00:50
And then, if we execute the Right action--	▶ 00:54
well, then lots of things could happen;	▶ 00:58
because we move Right, and somebody might have dropped Dirt in the Right location,	▶ 01:01
and somebody might have dropped Dirt in the Left location--or maybe not.	▶ 01:06
So we end up with 4 possibilities,	▶ 01:10
and then we can update again when we get the next observation--	▶ 01:12
say, if we observed that we're in B and it's Dirty, then we end up in this belief state.	▶ 01:17
And we can keep on going--specifying new belief states--	▶ 01:23
as a result of success of predicts and updates.	▶ 01:27
Now, this Predict Update Cycle gives us a kind of calculus of belief states	▶ 01:33
that can tell us, really, everything we need to know.	▶ 01:38
But there is one weakness with this approach--	▶ 01:41
that, as you can see here, some of the belief states start to get large;	▶ 01:43
and this is a tiny little world.	▶ 01:47
Already, we have a belief state with 4 world states in it.	▶ 01:49
We could have one with 8, 16, 10, 24--or whatever.	▶ 01:53
And it seems that there may be more succinct representations of a belief state,	▶ 01:58
rather than to just list all the world states.	▶ 02:03
For example, take this one here:	▶ 02:06
If we had divided the world up--not into individual world states,	▶ 02:08
but into variables describing that state,	▶ 02:13
then this whole belief state could be represented just by: Vacuum is on the Right.	▶ 02:17
So the whole world could be represented by 3 states--or 3 variables:	▶ 02:23
One, where is the Vacuum--is it on the Right, or not?	▶ 02:29
Secondly, is there Dirt in the Left location?	▶ 02:33
And third, is there Dirt in the Right location?	▶ 02:36
And we could have some formula, over those variables, to describe states.	▶ 02:39
And with that type of formulation,	▶ 02:44
some very large states--in terms of enumerating the world states--	▶ 02:47
can be made small, in terms of the description.	▶ 02:51

(05:35) 16 Classical Planning 1

[Norvig] I want to describe a notation which we call classical planning,	▶ 00:00
which is a representation language for dealing with states and actions and plans,	▶ 00:06
and it's also an approach for dealing with the problem of complexity	▶ 00:13
by factoring the world into variables.	▶ 00:17
So under classical planning, a state space consists of all the possible assignments	▶ 00:21
to k Boolean variables.	▶ 00:28
So that means they'll be 2 to the k states in that state space.	▶ 00:32
And if we think about the 2 location vacuum world,	▶ 00:38
we would have 3 Boolean variables.	▶ 00:41
We could have dirt in location A, dirt in location B, and vacuum in location A.	▶ 00:44
The vacuum has to be in either A or B.	▶ 00:57
So these 3 variables will do, and there will be 8 possible states in that world,	▶ 01:00
but they can be succinctly represented through the 3 variables.	▶ 01:06
And then a world state consists of a complete assignment of true or false	▶ 01:11
through each of the 3 variables.	▶ 01:18
And then a belief state.	▶ 01:20
Just as in problem solving, the belief state depends on	▶ 01:24
what type of environment you want to deal with.	▶ 01:28
In the core classical planning, the belief state had to be a complete assignment,	▶ 01:31
and that was useful for dealing with deterministic fully observable domains.	▶ 01:38
But we can easily extend classical planning,	▶ 01:43
and we can deal with belief states that are partial assignments--	▶ 01:47
that is, some of the variables have values and others don't.	▶ 01:51
So we could have the belief state consisting of vacuum in A is true	▶ 01:56
and the others are unknown, and that small formula represents 4 possible world states.	▶ 02:01
We can even have a belief state which is an arbitrary formula in Boolean logic,	▶ 02:08
and that can represent anything we want.	▶ 02:18
So that's what states look like.	▶ 02:20
Now we have to figure out what actions look like	▶ 02:22
and what the results of those actions look like.	▶ 02:25
These are represented in classical planning by something called an action schema.	▶ 02:28
It's called a schema because it represents many possible actions that are similar to each other.	▶ 02:34
So let's take an example of we want to send cargo around the world,	▶ 02:40
and we've got a bunch of planes in airports, and we have cargo and so on.	▶ 02:46
I'll show you the action for having a plane fly from one location to another.	▶ 02:50
Here's one possible representation.	▶ 02:56
We say it's an action schema, so we write the word Action	▶ 02:59
and then we write the action operator and its arguments,	▶ 03:03
so it's a Fly of P from X to Y.	▶ 03:08
And then we list the preconditions,	▶ 03:15
what needs to be true in order to be able to execute this action.	▶ 03:19
We can say something like P better be a plane.	▶ 03:24
It's no good trying to fly a truck or a submarine.	▶ 03:29
And we'll use the And formula from Boolean propositional logic.	▶ 03:35
X better be an airport.	▶ 03:43
We don't want to try to take off from my backyard.	▶ 03:47
And similarly, Y better be an airport.	▶ 03:50
And, most importantly, P better be at airport X in order to take off from there.	▶ 03:55
And then we represent the effects of the action by saying	▶ 04:02
what's going to happen.	▶ 04:08
Once we fly from X to Y,	▶ 04:10
the plane is no longer at X,	▶ 04:13
so we say not at P,X--the plane is no longer at X--	▶ 04:16
and the plane is now at Y.	▶ 04:23
This is called an action schema.	▶ 04:27
It represents a set of actions for all possible planes, for all X and for all Y,	▶ 04:30
represents all of those actions in one schema	▶ 04:36
that says what we need to know in order to apply the action and it says what will happen.	▶ 04:39
In terms of the transition from state spaces, this variable will become false	▶ 04:45
and this one will become true.	▶ 04:50
When we look at this formula, this looks like a term in first order logic,	▶ 04:53
but we're actually dealing with a completely propositional world.	▶ 05:00
It just looks like that because this is a schema.	▶ 05:04
We can apply this schema to specific ground states, specific world states,	▶ 05:08
and then P and X would have specific values,	▶ 05:15
and you could just think of it as concatenating their names all together,	▶ 05:18
and that's just the name of one variable.	▶ 05:21
The name just happens to have this complex form with parentheses and commas in it	▶ 05:24
to make it easier to write one schema that covers all the individual fly actions.	▶ 05:29

(01:49) 17 Classical Planning 2

(01:24) 18 Progression Search

(03:07) 19 Regression Search

[Norvig] Another way to search is called backwards or regression search	▶ 00:00
in which we start at the goal.	▶ 00:05
So we take the description of the goal state.	▶ 00:07
C1 is at JFK and C2 is at SFO, so that's the goal state.	▶ 00:10
And notice that that's the complete goal state.	▶ 00:21
It's not that I left out all the other facts about the state;	▶ 00:23
it's that that's all that's known about the state is that these 2 propositions are true	▶ 00:26
and all the others can be anything you want.	▶ 00:31
And now we can start searching backwards.	▶ 00:34
We can say what actions would lead to that state.	▶ 00:37
Remember in problem solving we did have that option of searching backwards.	▶ 00:40
If there was a single goal state, we could say what other arcs are coming into that goal state.	▶ 00:45
But here, this goal state doesn't represent a single state;	▶ 00:51
it represents a whole family of states with different values for all the other variables.	▶ 00:54
And so we can't just look at that,	▶ 01:01
but what we can do is look at the definition of possible actions that will result in this goal.	▶ 01:03
So let's look at it one at a time.	▶ 01:10
Let's first look at what actions could result at C1, JFK.	▶ 01:12
We look at our action schema, and there's only 1 action schema that adds an At,	▶ 01:19
and that would be the Unload schema.	▶ 01:26
Unload of C, P, A adds C, A.	▶ 01:30
And so what we would know is if we want to achieve this,	▶ 01:37
then we would have to do an Unload where the C variable would have to be C1,	▶ 01:40
the P variable is still unknown--it could be any plane--	▶ 01:50
and the A variable has to be JFK.	▶ 01:55
Notice what we've done here.	▶ 02:01
We have this representation in terms of logical formula	▶ 02:03
that allows us to specify a goal as a set of many world states,	▶ 02:07
and we also can use that same representation to represent an arrow here	▶ 02:12
not as a single action but as a set of possible actions.	▶ 02:18
So this is representing all possible actions for any plane, P,	▶ 02:21
of unloading cargo at the destination.	▶ 02:26
And then we can regress this state over this operator	▶ 02:29
and now we have another representation of this state here.	▶ 02:36
But just as this state was uncertain--not all the variables were known--	▶ 02:40
this state too will be uncertain.	▶ 02:44
For example, we won't know anything about what plane, P, is involved,	▶ 02:46
and now we continue searching backwards until we get to a state	▶ 02:51
where enough of the variables are filled in and where we match against the initial state.	▶ 02:56
And then we have our solution.	▶ 03:01
We found it going backwards, but we can apply the solution going forwards.	▶ 03:03

(01:48) 20 Regression vs Progression

(02:12) 21 Plan Space Search

[Norvig] There's one more type of search for plans	▶ 00:00
that we can do with the classical planning language	▶ 00:03
that we couldn't do before, and this is searching through the space of plans	▶ 00:06
rather than searching through the space of states.	▶ 00:11
In forward search we were searching through concrete world states.	▶ 00:14
In backward search we were searching through abstract states	▶ 00:18
in which some of the variables were unspecified.	▶ 00:22
But in plan space search we search through the space of plans.	▶ 00:25
And here's how it works.	▶ 00:29
We start off with an empty plan.	▶ 00:31
We have the start state and the goal state, and that's all we know about the plan.	▶ 00:33
So obviously, this plan is flawed. It doesn't lead us from the start to the goal.	▶ 00:38
And then we say let's do an operation to edit or modify that plan	▶ 00:43
by adding something in new.	▶ 00:48
And here we're tackling the problem of how to get dressed	▶ 00:50
and put on all the clothes in the right order,	▶ 00:53
so we say out of all the operators we have, we could add one of those operators into the plan.	▶ 00:56
And so here we say what if we added the put on right shoe operator.	▶ 01:01
Then we end up with this plan.	▶ 01:06
That still doesn't solve the problem, so we need to keep refining that plan.	▶ 01:09
Then we come here and say maybe we could add in the put on left shoe operator.	▶ 01:13
And here I've shown the plan as a parallel branching structure	▶ 01:20
rather than just as a sequence.	▶ 01:24
And that's a useful thing to do because it captures the fact	▶ 01:27
that these can be done in either order.	▶ 01:30
And we keep refining like that, adding on new branches or new operators	▶ 01:32
into the plan until we got a plan that was guaranteed to work.	▶ 01:38
This approach was popular in the 1980s, but it's faded from popularity.	▶ 01:42
Right now the most popular approaches have to do with forward search.	▶ 01:47
We saw some of the advantages of backward search.	▶ 01:52
The advantage of forward search seems to be that we can come up with very good heuristics.	▶ 01:55
So we can do heuristic search, and we saw how important it was to have good heuristics	▶ 01:59
to do heuristic search.	▶ 02:04
And because the forward search deals with concrete plan states,	▶ 02:06
it seems to be easier to come up with good heuristics.	▶ 02:09

(03:13) 22 Sliding Puzzle Example

[Norvig] To understand the idea of heuristics, let's talk about another domain.	▶ 00:00
Here we have the sliding puzzle domain.	▶ 00:05
Remember we can slide around these little tiles and we try to reach a goal state.	▶ 00:07
A 16 puzzle is kind of big, so let's show you the state space for the smaller 8 puzzle.	▶ 00:13
Here is just a small portion of it.	▶ 00:20
Let's figure out what the action schema looks like for this puzzle.	▶ 00:22
We only need to describe one action, which is to slide a tile, T,	▶ 00:27
from location A to location B.	▶ 00:33
The precondition: the tile has to be on location A	▶ 00:38
and has to be a tile	▶ 00:45
and B has to be blank and A and B have to be adjacent.	▶ 00:50
This should be an And sign, not an A.	▶ 01:02
So that's the action schema.	▶ 01:06
Oops. I forgot we need an effect, which should be that the tile is now on B	▶ 01:08
and the blank is now on A and the tile is no longer on A and the blank is no longer on B.	▶ 01:19
We talked before about how a human analyst could examine a problem	▶ 01:38
and come up with heuristics and encode those heuristics as a function	▶ 01:43
that would help search do a better job.	▶ 01:47
But with this kind of a formal representation	▶ 01:50
we can automatically come up with good representations of heuristics.	▶ 01:53
For example, if we came up with a relaxed problem	▶ 01:57
by automatically going in and throwing out some of the prerequisites--	▶ 02:02
if you throw out a prerequisite, you make the problem strictly easier--	▶ 02:06
then you get a new heuristic.	▶ 02:10
So for example, if we crossed out the requirement that B has to be blank,	▶ 02:12
then we end up with the Manhattan or city block heuristic.	▶ 02:17
And if we also throw out the requirement that A and B have to be adjacent,	▶ 02:22
then we get the number of misplaced tiles heuristic.	▶ 02:28
So that means we could slide a tile from any A to any B, no matter how far apart they were.	▶ 02:31
That's the number of misplaced tiles.	▶ 02:37
Other heuristics are possible.	▶ 02:40
For example, one popular thing is to ignore negative effects,	▶ 02:42
to say let's not say that this takes away the blank being in B.	▶ 02:46
So if we ignore that negative effect, we make the whole problem strictly easier.	▶ 02:52
We'd have a relaxed problem, and that might end up being a good heuristic.	▶ 02:56
So because we have our actions encoded in this logical form,	▶ 03:00
we can automatically edit that form.	▶ 03:04
A program can do that, and the program can come up with heuristics	▶ 03:07
rather than requiring the human to come up with heuristics.	▶ 03:10

(03:47) 23 Situation Calculus 1

[Norvig] Now I want to talk about 1 more representation for planning	▶ 00:00
called situation calculus.	▶ 00:03
To motivate this, suppose we wanted to have the goal of moving all the cargo	▶ 00:07
from airport A to airport B, regardless of how many pieces of cargo there are.	▶ 00:12
You can't express the notion of All in propositional languages like classical planning,	▶ 00:17
but you can in first order logic.	▶ 00:22
There are several ways to use first order logic for planning.	▶ 00:25
The best known is situation calculus.	▶ 00:27
It's not a new kind of logic;	▶ 00:30
rather, it's regular first order logic with a set of conventions	▶ 00:32
for how to represent states and actions.	▶ 00:36
I'll show you what the conventions are.	▶ 00:38
First, actions are represented as objects in first order logic,	▶ 00:41
normally by functions.	▶ 00:49
And so we would have a function like the function Fly	▶ 00:51
of a plane and a From Airport and a To Airport	▶ 00:56
which represents an object, which is the action.	▶ 01:02
Then we have situations, and situations are also objects in the logic,	▶ 01:08
and they correspond not to states but rather to paths--	▶ 01:16
the paths of actions that we have in state space search.	▶ 01:22
So if you arrive at what would be the same world state by 2 different sets of actions,	▶ 01:27
those would be considered 2 different situations in situation calculus.	▶ 01:33
We describe the situations by objects, so we usually have an initial situation,	▶ 01:37
often called S0,	▶ 01:43
and then we have a function on situations called Result.	▶ 01:46
So the result of a situation object and an action object is equal to another situation.	▶ 01:52
And now instead of describing the actions that are applicable	▶ 02:02
in a situation with a predicate Actions of S,	▶ 02:07
situation calculus for some reason decided not to do that	▶ 02:14
and instead we're going to talk about the actions that are possible in the state,	▶ 02:17
and we're going to do that with a predicate.	▶ 02:23
If we have a predicate Possible of A and S, is an action A possible in a state?	▶ 02:28
There's a specific form for describing these predicates,	▶ 02:37
and in general, it has the form of some precondition of state S	▶ 02:43
implies that it's possible to do action A in state S.	▶ 02:52
I'll show you the possibility axiom for the Fly action.	▶ 02:59
We would say if there is some P, which is the plane in state S,	▶ 03:04
and there is some X, which is an airport in state S,	▶ 03:10
and there is some Y, which is also an airport in state S,	▶ 03:16
and P is at location X in state S,	▶ 03:21
then that implies that it's possible to fly P from X to Y in state S.	▶ 03:28
And that's known as the possibility axiom for the action Fly.	▶ 03:41

(03:56) 24 Situation Calculus 2

[Norvig] There's a convention in situation calculus that predicates like At--	▶ 00:01
we said plane P was at airport X in situation S--	▶ 00:07
these types of predicates that can vary from 1 situation to another are called fluents,	▶ 00:14
from the word fluent, having to do with fluidity or change over time.	▶ 00:19
And the convention is that they refer to a specific situation,	▶ 00:25
and we always put that situation argument as the last in the predicate.	▶ 00:29
Now, the trickiest part about situation calculus is describing what changes	▶ 00:35
and what doesn't change as a result of an action.	▶ 00:41
Remember in classical planning we had action schemas	▶ 00:44
where we described 1 action at a time and said what changed.	▶ 00:48
For situation calculus it turns out to be easier to do it the other way around.	▶ 00:53
Instead of writing 1 action or 1 schema or 1 axiom for each action,	▶ 00:57
we do 1 for each fluent, for each predicate that can change.	▶ 01:03
We use the convention called successor state axioms.	▶ 01:07
These are used to describe what happens in the state	▶ 01:12
that's a successor of executing an action.	▶ 01:15
And in general, a successor state axiom will have the form of saying	▶ 01:19
for all actions and states, if it's possible to execute action A in state S,	▶ 01:26
then--and I'll show in general what they look like here--	▶ 01:35
the fluent is true if and only if action A made it true	▶ 01:42
or action A didn't undo it.	▶ 01:54
So that is, either it wasn't true before and A made it be true,	▶ 02:03
or it was true before and A didn't do something to stop it being true.	▶ 02:08
For example, I'll show you the successor state axiom for the In predicate.	▶ 02:14
And just to make it a little bit simpler, I'll leave out all the For All quantifiers.	▶ 02:18
So wherever you see a variable without a quantifier, assume that there's a For All.	▶ 02:23
What we'll say is it's possible to execute A in situation S.	▶ 02:28
If that's true, then the In predicate holds between some cargo C	▶ 02:38
and some plane in the state, which is the result of executing action A in state S.	▶ 02:48
So that In predicate will hold if and only if either A was a load action--	▶ 03:01
so if we load the cargo into the plane, then the result of executing that action A	▶ 03:12
is that the cargo is in the plane--	▶ 03:19
or it might be that it was already true that the cargo was in the plane in situation S	▶ 03:23
and A is not equal to an unload action.	▶ 03:30
So for all A and S for which it's possible to execute A in situation S,	▶ 03:38
the In predicate holds if and only if the action was a load	▶ 03:45
or the In predicate used to hold in the previous state and the action is not an unload.	▶ 03:50

(03:06) 25 Situation Calculus 3

[Norvig] So I've talked about the possibility axioms and the successor state axioms.	▶ 00:00
That's most of what's in situation calculus,	▶ 00:04
and that's used to describe an entire domain like the airport cargo domain.	▶ 00:07
And now we describe a particular problem within that domain by describing the initial state.	▶ 00:11
Typically we call that S0, the initial situation.	▶ 00:18
And in S0 we can make various types of assertions	▶ 00:23
of different types of predicates.	▶ 00:29
So we could say that plane P1 is at airport JFK in S0, so just a simple predicate.	▶ 00:31
And we could also make larger sentences, so we could say	▶ 00:43
for all C, if C is cargo, then that C is at JFK in situation S0.	▶ 00:52
So we have much more flexibility in situation calculus to say almost anything we want.	▶ 01:07
Anything that's a valid sentence in first order logic can be asserted about the initial state.	▶ 01:11
The goal state is similar.	▶ 01:18
We could have a goal of saying there exists some goal state S	▶ 01:20
such that for all C, if C is cargo, then we want that cargo to be at SFO in state S.	▶ 01:25
So this initial state and this goal says move all the cargo--	▶ 01:41
I don't care how much there is--from JFK to SFO.	▶ 01:45
The great thing about situation calculus is that once we've described this	▶ 01:50
in the ordinary language of first order logic,	▶ 01:55
we don't need any special programs to manipulate it and come up with the solution	▶ 01:58
because we already have theorem provers for first order logic	▶ 02:03
and we can just state this as a problem,	▶ 02:06
apply the normal theorem prover that we already had for other uses,	▶ 02:08
and it can come up with an answer of a path that satisfies this goal,	▶ 02:13
a situation which corresponds to a path which satisfies this	▶ 02:19
given the initial state and given the descriptions of the actions.	▶ 02:23
So the advantage of situation calculus is that we have the full power of first order logic.	▶ 02:28
We can represent anything we want.	▶ 02:32
Much more flexibility than in problem solving or classical planning.	▶ 02:34
So all together now, we've seen several ways of dealing with planning.	▶ 02:39
We started in deterministic, fully observable environments	▶ 02:42
and we moved into stochastic and partially observable environments.	▶ 02:45
We were able to distinguish between plans that can or cannot solve a problem,	▶ 02:49
but we had 1 weakness in all these different approaches.	▶ 02:55
It is that we weren't able to distinguish between probable and improbable solutions.	▶ 02:58
And that will be the subject of the next unit.	▶ 03:03

(27) Homework 4

(01:50) 1 Logic

(01:50) 1. Logic

(01:04) 1. Logic Answer

(03:19) 2. More Logic

In this exercise, I'm going to give you some English sentences	▶ 00:00
and then some first-order logic sentences	▶ 00:04
and ask you does the first-order logic sentence	▶ 00:07
correctly encode the English sentence, does it incorrectly encode it,	▶ 00:10
or is it just an error that is not a legitimate sentence	▶ 00:15
in first-order logic?	▶ 00:21
The first English sentence is "Paris and Nice are both in France."	▶ 00:25
Here's one possible translation.	▶ 00:30
Paris and Nice are in France.	▶ 00:33
Here's another.	▶ 00:41
Paris is in France, and Nice is in France.	▶ 00:44
Tell us if each of these is a correct encoding of English,	▶ 00:49
incorrect, or if it's erroneous first-order logic syntax.	▶ 00:54
The second sentence in English is "There is a country that borders Iran and Syria."	▶ 01:00
Here are the possible translations.	▶ 01:07
There exists a c, and we're going to use the predicate capital C	▶ 01:09
to mean C when the argument is a country.	▶ 01:14
So, there exists a c such that C of c,	▶ 01:22
and we're going to use the predicate B to mean 2 objects border each other.	▶ 01:26
So, c borders Iran, and c borders Syria.	▶ 01:32
That's one translation. Here's the other translation.	▶ 01:40
There exists a c if C is a country,	▶ 01:44
then c borders Iran and c borders Syria.	▶ 01:50
And the final English sentence is no 2 bordering countries	▶ 02:01
can have the same map color, and we're going to use the predicate MC for map color.	▶ 02:04
Here's one possibility for all x and y.	▶ 02:10
X is a country, and y is a country.	▶ 02:14
And x and y border each other.	▶ 02:21
That implies it's not the case that the map color	▶ 02:26
of x equals the map color of y.	▶ 02:32
And I should say we're using map color here as a function, not as a predicate.	▶ 02:38
Here's another possibility.	▶ 02:43
For all x and y, it's not the case that x is a country,	▶ 02:46
or it's not the case that y is a country,	▶ 02:53
or it's not the case that x and y border,	▶ 02:59
or it's not the case that the map color of x	▶ 03:05
is equal to the map color of y.	▶ 03:11

(01:41) 2. More Logic Answer

(01:09) 3. Vacuum World

(00:37) 3. Vacuum World Answer

(01:39) 4. More Vacuum World

(00:39) 4. More Vacuum World Answer

(00:19) 5. More Vacuum World

(00:15) 5. More Vacuum World Answer

(00:19) 6. More Vacuum World

(00:10) 6. More Vacuum World Answer
The answer is states 4 and 8. ▶ 00:00
We know that the suck action will make it clean ▶ 00:03
in our current location, but we don't know what's going on in the other location. ▶ 00:06

(00:19) 7. More Vacuum World

(00:16) 7. More Vacuum World Answer

(01:10) 8. Monkey and Bananas

(00:53) 8. Monkey and Bananas Answer

(04:36) 9. Situation Calculus

[Norvig] The final problem involves situation calculus.	▶ 00:00
In the domain I want to describe, we have a combination lock with 4 digits,	▶ 00:05
and the correct combination that will open the lock we'll call X.	▶ 00:11
There are 2 actions you can perform.	▶ 00:17
One is to dial any combination on the dial,	▶ 00:20
and if you dial the correct one, X, then the lock will open.	▶ 00:24
And the other action you can perform is to press a lock button,	▶ 00:29
and if you press that button, then the lock will be locked,	▶ 00:34
whether it was open before or not.	▶ 00:37
I'm going to describe some axioms,	▶ 00:40
and I want you to tell me whether these axioms are correct for the domain or not.	▶ 00:43
First the possibility axioms.	▶ 00:51
One choice is the possibility axiom that says	▶ 00:53
if C equals X, then it's possible to dial C in situation S.	▶ 00:59
And here I'm assuming that all variables are scoped	▶ 01:09
so that we say an implicit for all C and for all S here.	▶ 01:13
And X is not a variable. This is a constant, referring to the correct combination.	▶ 01:18
The other possible axiom is for all C if C is greater than or equal to 0	▶ 01:26
and less than or equal to 9999,	▶ 01:36
then it's possible to dial C in any situation S.	▶ 01:41
So tell me which, if any or both, of these axioms you think correctly encode the situation.	▶ 01:50
Next we'll look at the possibility axioms for the lock action.	▶ 02:00
Here's one.	▶ 02:05
We can say if the safe is open in situation S,	▶ 02:07
then it's possible to execute the lock action in S.	▶ 02:12
Or maybe we should say if the safe is not open in S,	▶ 02:18
then it's possible to execute Lock in S.	▶ 02:24
Or maybe we should say if true,	▶ 02:30
then it's possible to execute the lock action in situation S.	▶ 02:35
And tell me which, if any, of those represents a correct representation of the problem.	▶ 02:42
And finally we need successor state axioms for all the fluents,	▶ 02:50
but there's really only one fluent, and that's whether or not the safe is open.	▶ 02:54
So here's one example of a successor state axiom.	▶ 02:59
We could say for any situation and action,	▶ 03:06
if it's possible to execute that action in the situation,	▶ 03:13
then the Open fluent is going to be true in the result of executing that action	▶ 03:18
if and only if the action is dialing the correct combination, X,	▶ 03:26
or if the safe was already open in S and the action is not equal to Lock.	▶ 03:36
That's one option.	▶ 03:47
And the other option is the same thing on the left-hand side,	▶ 03:49
and on the right-hand side it's open if and only if the action is dialing the correct combination	▶ 03:54
and the action is not equal to Lock.	▶ 04:03
So tell me which, if any or all, of these are accurate representations of the problem.	▶ 04:09
In each case I want you to tell me if each of these axioms are good as they stand alone.	▶ 04:16
I don't want you to look at any combinations of axioms	▶ 04:23
but just go through each one and check the box if you think that the axiom on that line alone	▶ 04:26
is a correct representation of the problem.	▶ 04:33

(00:58) 9. Situation Calculus Answer

(03:19) 2 More Logic

In this exercise, I'm going to give you some English sentences	▶ 00:00
and then some first-order logic sentences	▶ 00:04
and ask you does the first-order logic sentence	▶ 00:07
correctly encode the English sentence, does it incorrectly encode it,	▶ 00:10
or is it just an error that is not a legitimate sentence	▶ 00:15
in first-order logic?	▶ 00:21
The first English sentence is "Paris and Nice are both in France."	▶ 00:25
Here's one possible translation.	▶ 00:30
Paris and Nice are in France.	▶ 00:33
Here's another.	▶ 00:41
Paris is in France, and Nice is in France.	▶ 00:44
Tell us if each of these is a correct encoding of English,	▶ 00:49
incorrect, or if it's erroneous first-order logic syntax.	▶ 00:54
The second sentence in English is "There is a country that borders Iran and Syria."	▶ 01:00
Here are the possible translations.	▶ 01:07
There exists a c, and we're going to use the predicate capital C	▶ 01:09
to mean C when the argument is a country.	▶ 01:14
So, there exists a c such that C of c,	▶ 01:22
and we're going to use the predicate B to mean 2 objects border each other.	▶ 01:26
So, c borders Iran, and c borders Syria.	▶ 01:32
That's one translation. Here's the other translation.	▶ 01:40
There exists a c if C is a country,	▶ 01:44
then c borders Iran and c borders Syria.	▶ 01:50
And the final English sentence is no 2 bordering countries	▶ 02:01
can have the same map color, and we're going to use the predicate MC for map color.	▶ 02:04
Here's one possibility for all x and y.	▶ 02:10
X is a country, and y is a country.	▶ 02:14
And x and y border each other.	▶ 02:21
That implies it's not the case that the map color	▶ 02:26
of x equals the map color of y.	▶ 02:32
And I should say we're using map color here as a function, not as a predicate.	▶ 02:38
Here's another possibility.	▶ 02:43
For all x and y, it's not the case that x is a country,	▶ 02:46
or it's not the case that y is a country,	▶ 02:53
or it's not the case that x and y border,	▶ 02:59
or it's not the case that the map color of x	▶ 03:05
is equal to the map color of y.	▶ 03:11

(01:09) 3 Vacuum World

(01:39) 4a More Vacuum World

(00:19) 4b More Vacuum World

(00:19) 4c More Vacuum World

(00:19) 4d More Vacuum World

(01:10) 5 Monkey and Bananas

(04:36) 6 Situation Calculus

[Norvig] The final problem involves situation calculus.	▶ 00:00
In the domain I want to describe, we have a combination lock with 4 digits,	▶ 00:05
and the correct combination that will open the lock we'll call X.	▶ 00:11
There are 2 actions you can perform.	▶ 00:17
One is to dial any combination on the dial,	▶ 00:20
and if you dial the correct one, X, then the lock will open.	▶ 00:24
And the other action you can perform is to press a lock button,	▶ 00:29
and if you press that button, then the lock will be locked,	▶ 00:34
whether it was open before or not.	▶ 00:37
I'm going to describe some axioms,	▶ 00:40
and I want you to tell me whether these axioms are correct for the domain or not.	▶ 00:43
First the possibility axioms.	▶ 00:51
One choice is the possibility axiom that says	▶ 00:53
if C equals X, then it's possible to dial C in situation S.	▶ 00:59
And here I'm assuming that all variables are scoped	▶ 01:09
so that we say an implicit for all C and for all S here.	▶ 01:13
And X is not a variable. This is a constant, referring to the correct combination.	▶ 01:18
The other possible axiom is for all C if C is greater than or equal to 0	▶ 01:26
and less than or equal to 9999,	▶ 01:36
then it's possible to dial C in any situation S.	▶ 01:41
So tell me which, if any or both, of these axioms you think correctly encode the situation.	▶ 01:50
Next we'll look at the possibility axioms for the lock action.	▶ 02:00
Here's one.	▶ 02:05
We can say if the safe is open in situation S,	▶ 02:07
then it's possible to execute the lock action in S.	▶ 02:12
Or maybe we should say if the safe is not open in S,	▶ 02:18
then it's possible to execute Lock in S.	▶ 02:24
Or maybe we should say if true,	▶ 02:30
then it's possible to execute the lock action in situation S.	▶ 02:35
And tell me which, if any, of those represents a correct representation of the problem.	▶ 02:42
And finally we need successor state axioms for all the fluents,	▶ 02:50
but there's really only one fluent, and that's whether or not the safe is open.	▶ 02:54
So here's one example of a successor state axiom.	▶ 02:59
We could say for any situation and action,	▶ 03:06
if it's possible to execute that action in the situation,	▶ 03:13
then the Open fluent is going to be true in the result of executing that action	▶ 03:18
if and only if the action is dialing the correct combination, X,	▶ 03:26
or if the safe was already open in S and the action is not equal to Lock.	▶ 03:36
That's one option.	▶ 03:47
And the other option is the same thing on the left-hand side,	▶ 03:49
and on the right-hand side it's open if and only if the action is dialing the correct combination	▶ 03:54
and the action is not equal to Lock.	▶ 04:03
So tell me which, if any or all, of these are accurate representations of the problem.	▶ 04:09
In each case I want you to tell me if each of these axioms are good as they stand alone.	▶ 04:16
I don't want you to look at any combinations of axioms	▶ 04:23
but just go through each one and check the box if you think that the axiom on that line alone	▶ 04:26
is a correct representation of the problem.	▶ 04:33

(36) Unit 9

(00:32) 01 Introduction

(03:30) 02 Planning Under Uncertainty MDP

[Narrator] Planning under uncertainty.	▶ 00:00
In this class so far	▶ 00:04
we talked a good deal about planning.	▶ 00:06
We talked about uncertainty and probabilities,	▶ 00:08
and we also talked about learning,	▶ 00:12
but all 3 items were discussed separately.	▶ 00:15
We never brought planning and uncertainty together,	▶ 00:18
uncertainty and learning, or planning and learning.	▶ 00:20
So the class today, we'll fuse planning and uncertainty	▶ 00:23
using techniques known as Markov decision processes or MDPs,	▶ 00:26
and partial observer Markov decision processes or POMDPs.	▶ 00:31
We also have a class coming up on reinforcement learning	▶ 00:36
which combines all 3 of his aspects,	▶ 00:39
planning, uncertainty, and machine learning.	▶ 00:41
You might remember in the very first class	▶ 00:44
we distinguished very different characteristics of agent tasks,	▶ 00:46
and here are some of those.	▶ 00:49
We distinguished deterministic was the casting environments,	▶ 00:51
and we also talked about photos as partial observable.	▶ 00:54
In the area of planning so far	▶ 00:58
all of our evidence falls into this field over here,	▶ 01:01
like A*, depth first, right first and so on.	▶ 01:04
The MDP algorithms	▶ 01:10
which I will talk about first	▶ 01:12
fall into the intersection of fully observable	▶ 01:14
yet stochastic, and just to remind us	▶ 01:17
what the difference was,	▶ 01:19
stochastic is an environment where the outcome of an action is somewhat random.	▶ 01:21
Whereas an environment that's deterministic	▶ 01:24
where the outcome of an action is predictable	▶ 01:26
and always the same.	▶ 01:29
An environment is fully observable if you can	▶ 01:31
see the state of the environment which means if you can make all decisions	▶ 01:33
based on the momentary sensory input.	▶ 01:35
Whereas if you need memory,	▶ 01:37
it's partially observable.	▶ 01:39
Planning in the partially observable case	▶ 01:41
is called POMDP, and towards the end of this class,	▶ 01:43
I'll briefly talk about POMDPs but not in any depth.	▶ 01:47
So most of this class focuses on Markov decision processes	▶ 01:50
as opposed to partially observable Markov decision processes.	▶ 01:53
So what is a Markov decision process?	▶ 01:57
One way you can specify a Markov decision process by a graph.	▶ 01:59
Suppose you have states S1, S2, and S3,	▶ 02:04
and you have actions A1 and A2.	▶ 02:08
In a state transition graph, like this,	▶ 02:11
is a finite state machine,	▶ 02:14
and it becomes Markov if the outcomes of actions are somewhat random.	▶ 02:16
So for example if A1 over here, with a 50% probability, leads to	▶ 02:20
state S2 but with another 50% probability	▶ 02:25
leads to state S3.	▶ 02:29
So put differently, a Markov decision process of	▶ 02:32
states, actions, a state's transition matrix,	▶ 02:34
often written of the following form	▶ 02:40
which is just about the same as	▶ 02:42
a conditional state transition probability	▶ 02:44
that a state is prime	▶ 02:47
is the correct posterior state	▶ 02:49
after executing action A in a state S,	▶ 02:51
and the missing thing is the objective for the Markov decision process.	▶ 02:55
What do we want to achieve?	▶ 02:58
For that we often define a reward function,	▶ 03:00
and for the sake of this lecture,	▶ 03:03
I will attach rewards just to states.	▶ 03:05
So each state will have a function R attached	▶ 03:07
that tells me how good the state is.	▶ 03:10
So for example it might be worth $10	▶ 03:12
to be in the state over here,	▶ 03:14
$0 to be in the state over here,	▶ 03:16
and $100 to be in a state over here.	▶ 03:18
So the planning problem is now the problem	▶ 03:21
which relies on an action to each possible state.	▶ 03:23
So that we maximize our total reward.	▶ 03:27

(02:42) 03 Robot Tour Guide Examples

[Narrator] Before diving into too much detail,	▶ 00:00
let me explain to you why MDPs really matter.	▶ 00:03
What you see here is a robotic tour guide	▶ 00:07
that the University of Bonn, with my assistance,	▶ 00:10
deployed in the German museum in Bonn,	▶ 00:13
and the objective of the this robot was to	▶ 00:17
navigate the museum and guide visitors,	▶ 00:20
mostly kids, from exhibit to exhibit.	▶ 00:23
This is a challenging planning problem because	▶ 00:27
as the robot moves	▶ 00:31
it can't really predict its action outcomes	▶ 00:33
because of the randomness of the environment	▶ 00:35
and the carpet and the wheels of the robot.	▶ 00:38
The robot is not able to really follow its own commands very well,	▶ 00:40
and it has to take this into consideration during the planning process	▶ 00:44
so when it finds itself in a location it didn't expect,	▶ 00:47
it knows what to do.	▶ 00:50
In the second video here, you see a successor robot	▶ 00:53
that was deployed in the Smithsonian National	▶ 00:56
Museum of American History in the late 1990s	▶ 00:59
where it guided many, many thousands of kids	▶ 01:03
through the entrance hall of the museum,	▶ 01:05
and once again, this is a challenging planning problem.	▶ 01:08
As you can see people are often in the way of the robot.	▶ 01:10
The robot has to take detours.	▶ 01:13
Now this one is particularly difficult because	▶ 01:15
there were obstacles that were invisible	▶ 01:17
like a downward staircase.	▶ 01:19
So this is a challenging localization problem	▶ 01:21
trying to find out where you are,	▶ 01:23
but that's for a later class.	▶ 01:25
In the video here, you see a robot being deployed in a nursing home	▶ 01:30
with the objective to assist elderly people	▶ 01:33
by guiding them around, bring them to appointments,	▶ 01:36
reminding them to take their medication, and	▶ 01:39
interacting with them, and this robot has been active for many, many years	▶ 01:42
and been used, and, again, it's a very challenging planning problem	▶ 01:45
to navigate through this elderly home.	▶ 01:48
And the final robot I'm showing you here.	▶ 01:52
This was built with my colleague Will Whittaker at Carnegie Melon University.	▶ 01:54
The objective here was to explore abandoned mines.	▶ 01:57
Pennsylvania and West Virginia	▶ 02:01
and other states are heavily mined.	▶ 02:03
There's many abandoned old coal mines,	▶ 02:06
and for many of these mines,	▶ 02:09
it's unknown what the conditions are and where exactly they are.	▶ 02:11
They're not really human accessible.	▶ 02:14
They tend to have roof fall and very low oxygen levels.	▶ 02:16
So we made a robot that went inside	▶ 02:19
and built maps of those mines.	▶ 02:21
All these problems have in common that they	▶ 02:26
have really challenging planning problems.	▶ 02:29
The environments are stochastic.	▶ 02:32
That is the outcome of actions are unknown,	▶ 02:34
and the robot has to be able to react to	▶ 02:36
all kinds of situations, even the ones that it didn't plan for.	▶ 02:39

(02:40) 04 MDP Grid World

[Narrator] Let me give a much simpler example	▶ 00:00
often called grid world for MDPs,	▶ 00:03
and I'll be using this insanely simple	▶ 00:06
example over here throughout this class.	▶ 00:08
Let's assume we have a starting state over here,	▶ 00:11
and there's 2 goal states who	▶ 00:13
are often called absorbing states	▶ 00:16
with very different reward or payout.	▶ 00:18
Plus 100 for the state over here,	▶ 00:21
minus 100 for the state over here,	▶ 00:23
and our agent is able to move about the environment,	▶ 00:25
and when it reaches one of those 2 states,	▶ 00:28
the game is over and the task is done.	▶ 00:30
Obviously the top state is much more attractive than the bottom state with minus 100.	▶ 00:33
Now to turn this into an MDP, let's assume	▶ 00:37
actions are somewhat stochastic.	▶ 00:41
So suppose we had a grid cell, and we attempt to go north.	▶ 00:43
The deterministic agent would always succeed	▶ 00:46
to go to the north square if it's available,	▶ 00:49
but let's assume that we only have an 80% chance	▶ 00:53
to make it to the cell in the north.	▶ 00:55
If there's no cell at all,	▶ 00:57
there's a wall like over here,	▶ 00:59
we assume with 80% chance, we just bounce back to the same cell,	▶ 01:01
but with 10% chance, we instead go left.	▶ 01:04
Another 10% chance, we go right.	▶ 01:08
So if an agent is over here and wishes to go north,	▶ 01:10
then with 80% chance, it finds itself over here,	▶ 01:13
10% over here, 10% over here.	▶ 01:15
If it goes north from here,	▶ 01:17
because there's no north cell,	▶ 01:19
it'll bounce back with 80% probability,	▶ 01:21
10% left, 10% right.	▶ 01:23
In a cell like this one over here,	▶ 01:25
it'll bounce back with 90% probability,	▶ 01:27
80 from the top and 10 from the left,	▶ 01:30
but it still has a 10% chance of going right.	▶ 01:32
This is a stochastic state transition which	▶ 01:35
we can equally define for actions,	▶ 01:37
south, west and east, and	▶ 01:39
now we can see a situation like this	▶ 01:41
conventional planning is insufficient.	▶ 01:43
So for example if you're plan a sequence of actions starting over here,	▶ 01:45
you might go north, north, east, east, east	▶ 01:48
to reach our plus 100 absorbing or final state,	▶ 01:51
but with this state transition model over here,	▶ 01:55
even with the first step it might happen with 10% chance do you find yourselves over here,	▶ 01:57
in which case conventional planning would not give us an answer.	▶ 02:02
So we wish to have a planning method that provides an answer no matter where we are	▶ 02:05
and that's called a policy.	▶ 02:09
A policy assigns actions to any state.	▶ 02:12
So for example a policy might look as follows:	▶ 02:15
for this state, we wish to go north, north, east, east, east,	▶ 02:17
but for this state over here, we wish to go north, maybe east over here,	▶ 02:24
and maybe west over here.	▶ 02:28
So each state, except for the absorbing states,	▶ 02:31
we have to define an action to define a policy.	▶ 02:33
The planning problem we have becomes one of finding the optimal policy	▶ 02:36

(01:37) 05 Problems with Conventional Planning 1

(00:34) 06 Branching Factor Question

(00:17) 07 Branching Factor Answer

(01:14) 08 Problems with Conventional Planning 2

(00:37) 09 Policy Question 1

(00:06) 10 Policy Answer 1
[Narrator] And the answer is east. ▶ 00:00
East in expectation transfers you to the right side, ▶ 00:02
and you're one closer to your goal position. ▶ 00:04
(00:06) 11 Policy Question 2
[Narrator] Let me ask the same question for the state over here, C1, ▶ 00:00
which one is the optimal action for C1? ▶ 00:03

(00:08) 12 Policy Answer 2

(00:17) 13 Policy Question 3

(00:50) 14 Policy Answer 3 Question 4

(00:25) 15 Policy Answer 4

(02:47) 16 MDP and Costs

[Narrator] So even for the simple grid world,	▶ 00:00
the optimal control policy assuming stochastic actions	▶ 00:04
and no costs of moving, except for the final absorbing costs,	▶ 00:08
is somewhat nontrivial.	▶ 00:12
Take a second to look at this.	▶ 00:14
Along here it seems pretty obvious, but	▶ 00:17
for the state over here, B3, and for the state over here, C4,	▶ 00:19
we choose an action that just avoids falling into the minus 100,	▶ 00:24
which is more important than trying to make progress towards the plus 100.	▶ 00:27
Now obviously this is not the general case of an MDP,	▶ 00:32
and it's somewhat frustrating they'd be willing to run through the wall,	▶ 00:35
just so as to avoid falling into the minus 100,	▶ 00:38
and the reason why this seems unintuitive is	▶ 00:41
because we're really forgetting the issue of costs.	▶ 00:43
In normal life, there is a cost associated with moving.	▶ 00:46
MDPs are gentle enough to have a cost factor,	▶ 00:49
and the way we're going to denote costs	▶ 00:53
is by defining our award function over any possible state.	▶ 00:56
We are reaching the state A4,	▶ 01:00
gives us plus 100, minus 100 for B4,	▶ 01:03
and perhaps minus 3 for every other state,	▶ 01:07
which reflects the fact that if you take a step somewhere	▶ 01:10
that we will pay minus 3.	▶ 01:13
So this gives an incentive to shorten the final action sequence.	▶ 01:15
So we're now ready to state the actual objective	▶ 01:19
of an MDP which is to minimize not	▶ 01:23
just the momentary costs, but the sum	▶ 01:25
of all future rewards,	▶ 01:29
but you're going to write RT to denote the fact that	▶ 01:32
this reward has received time T, and because	▶ 01:35
our reward itself is stochastic,	▶ 01:38
we have to complete the expectation over those,	▶ 01:41
and that we seek to maximize.	▶ 01:44
So we seek to find the policy that maximizes the expression over here.	▶ 01:46
Now another interesting caveat is a sentence people put	▶ 01:50
a so called discount factor into this equation	▶ 01:54
with an exponent of T, where a discount factor was going to be 0.9,	▶ 01:57
and what this does is it decays future reward	▶ 02:01
relative to more immediate rewards, and it's	▶ 02:04
kind of an alternative way to specify costs.	▶ 02:07
So we can make this explicit by a negative reward per state	▶ 02:10
or we can bring in a discount factor	▶ 02:13
that discounts the plus 100 by the	▶ 02:16
number of steps that it went by before it reached the plus 100.	▶ 02:19
This also gives an incentive to get to the goal as fast as possible.	▶ 02:23
The nice mathematical thing about discount factor is	▶ 02:27
it keeps this expectation bounded.	▶ 02:30
It easy to show that this expression over here	▶ 02:33
will always be smaller or equal to 1 over 1 minus gamma times the	▶ 02:36
absolute reward maximizing value and	▶ 02:41
which in this case would be plus 100.	▶ 02:44

(01:31) 17 Value Iteration 1

(00:56) 18 Value Iteration 2

(04:36) 19 Value Iteration 3

Let me tell you about a truly magical algorithm called value iteration.	▶ 00:00
In value iteration, we recursively calculate the value function	▶ 00:05
so that in the end, we get what's called the optimal value function.	▶ 00:08
And from that, we can derive,	▶ 00:12
look up, the optimal policy.	▶ 00:14
Here's how it goes.	▶ 00:18
Suppose we start with a value function of 0 everywhere	▶ 00:20
except for the 2 absorbing states, whose value is +100 and -100.	▶ 00:26
Then we can ask ourselves the question is, for example,	▶ 00:30
for the field A3, 0 a good value.	▶ 00:33
And the answer is no, it isn't. It is somewhat inconsistent.	▶ 00:37
We can compute a better value.	▶ 00:40
In particular, we can understand that	▶ 00:42
if we're in A3 and we choose to go east,	▶ 00:46
then with 0.8 chance we should expect a value of 100.	▶ 00:50
With 0.1 chance, we'll stay in the same state,	▶ 00:55
in which case the value is -3.	▶ 00:58
And with 0.1 chance, we're going to stay down here for -3.	▶ 01:01
With the appropriate definition of value,	▶ 01:05
we would get the following formula,	▶ 01:08
which is 77.	▶ 01:11
So, 77 is a better estimate of value	▶ 01:13
for the state over here.	▶ 01:18
And now that we've done it, we can ask ourselves the question	▶ 01:20
is this a good value, or this a good value, or this a good value?	▶ 01:22
And we can propagate value backwards	▶ 01:25
in reverse order of action execution	▶ 01:27
from the positive absorbing state through this grid world	▶ 01:30
and fill every single state with a better value estimate	▶ 01:34
then the one we assumed initially.	▶ 01:38
If we do this for the grid over here and run value iteration	▶ 01:42
through convergence, then we get the following value function.	▶ 01:46
We get 93 over here. We're very close to the goal.	▶ 01:50
89, 85, 81, 77, 73, 70, over here.	▶ 01:53
This state will be worth 68, and this state is worth 47,	▶ 01:58
and the reason why these are not so good is because	▶ 02:02
we might stay quite a while in those	▶ 02:04
before we'll be able to execute an action	▶ 02:06
that gets us outside the state.	▶ 02:09
Let me give you an algorithm that defines value iteration.	▶ 02:12
We wish to estimate recursively the value of state S.	▶ 02:15
And we do this based on a possible successor state	▶ 02:20
as prime that we look up in the existing table.	▶ 02:23
Now, actions A are non-deterministic.	▶ 02:27
Therefore, we have to go through all possible as primes	▶ 02:30
and weigh each outcome with the associated probability.	▶ 02:34
The probability of reaching S prime given that we started state S	▶ 02:37
and apply action A.	▶ 02:40
This expression is usually discounted by gamma,	▶ 02:42
and we also add the reward or the costs of the state.	▶ 02:46
And because there's multiple actions and it's up to us	▶ 02:51
to choose the right action, we will maximize over all possible actions.	▶ 02:54
See, we look at this equation, and it looks really complicated,	▶ 03:00
but it's actually really simple.	▶ 03:03
We compute a value recursively based on successor values	▶ 03:06
plus the reward and minus the cost that it takes us to get us there.	▶ 03:11
Because Mother Nature picks a successor state for us for any given action,	▶ 03:15
if you compute an expectation over the value of the successor state	▶ 03:20
weighted by the corresponding probabilities which is happening over here,	▶ 03:25
and because we can choose our action,	▶ 03:29
we maximize over all possible actions.	▶ 03:32
Therefore, the max as opposed to the expectation on the left side over here.	▶ 03:35
This is an equation that's called backup.	▶ 03:39
In terminal states, we just assign R(s),	▶ 03:43
and obviously, in the beginning of value iteration,	▶ 03:48
these expressions are different, and we have to update.	▶ 03:52
But as Bellman has shown a while ago,	▶ 03:55
this process of updates converges.	▶ 03:58
After convergence, this assignment over here	▶ 04:01
is replaced by the equality sign,	▶ 04:05
and when this equality holds true,	▶ 04:07
we have what is called a Bellman equality or Bellman equation.	▶ 04:10
And that's all there is to know to compute values.	▶ 04:16
If you assign this specific equation over and over again to each state,	▶ 04:19
eventually you get a value function that looks just like this,	▶ 04:24
where the value really corresponds to what's the optimal future	▶ 04:27
cost reward trade off that you can achieve	▶ 04:30
if you act optimally in any given state over here.	▶ 04:33

(00:49) 20 Deterministic Question 1

(00:18) 21 Deterministic Answer 1

(00:08) 22 Deterministic Question 2
Let's run value iteration again, and let me ask ▶ 00:00
what's the value for B3, assuming that we already updated ▶ 00:03
the value for A3 as shown over here. ▶ 00:06

(00:20) 23 Deterministic Answer 2

(00:12) 24 Deterministic Question 3

(00:40) 25 Deterministic Answer 3

(00:22) 26 Stochastic Question 1

(00:51) 27 Stochastic Answer 1

(00:21) 28 Stochastic Question 2

(01:30) 29 Stochastic Answer 2

(00:50) 30 Value Iterations and Policy 1

(02:48) 31 Value Iterations and Policy 2

I'd like to show you some value function after convergence	▶ 00:00
and the corresponding policies.	▶ 00:04
If we assume gamma = 1 and our cost for the non-absorbing state	▶ 00:07
equals -3, as before, we get the following approximate value function	▶ 00:11
after convergence, and the corresponding policy looks as follows.	▶ 00:15
Up here we go right until we hit the absorbing state.	▶ 00:21
Over here we prefer to go north.	▶ 00:25
Here we go left, and here we go north again.	▶ 00:27
I left the policy open for the absorbing states	▶ 00:31
because there's no action to be chosen here.	▶ 00:33
This is a situation where	▶ 00:36
the risk of falling into the -100 is balanced by	▶ 00:39
the time spent going around.	▶ 00:42
We have an action over here in this visible state here	▶ 00:45
that risks the 10% chance of falling into the -100.	▶ 00:48
But that's preferable under the cost model of -3	▶ 00:52
to the action of going south.	▶ 00:55
Now, this all changes if we assume a cost of 0	▶ 00:58
for all the states over here, in which case,	▶ 01:02
the value function after convergence looks interesting.	▶ 01:05
And with some thought, you realize it's exactly the right one.	▶ 01:09
Each value is exactly 100,	▶ 01:13
and the reason is with a cost of 0,	▶ 01:16
it doesn't matter how long we move around.	▶ 01:18
Eventually we can guarantee in this case we reach the 100,	▶ 01:21
therefore each value after backups will become 100.	▶ 01:24
The corresponding policy is the one we discussed before.	▶ 01:28
And the crucial thing here is that for this state,	▶ 01:32
we go south, if you're willing to wait the time.	▶ 01:35
For this state over here, we go west,	▶ 01:38
willing to wait the time so as to avoid	▶ 01:40
falling into the -100.	▶ 01:42
And all the other states resolve	▶ 01:44
exactly as you would expect them to resolve	▶ 01:46
as shown over here.	▶ 01:49
If we set the costs to -200,	▶ 01:52
so each step itself is even more expensive	▶ 01:55
then falling into this ditch over here,	▶ 01:58
we get a value function that's strongly negative everywhere	▶ 02:02
with this being the most negative state.	▶ 02:05
But more interesting is the policy.	▶ 02:08
This is a situation where our agent tries to end the game	▶ 02:11
as fast as possible so as not to endure the penalty of -200.	▶ 02:14
And even over here where it jumps itself into the -100's	▶ 02:18
it's still better than going north and taking the excess of 200 as a penalty	▶ 02:21
and then leave the +100.	▶ 02:25
Similarly, over here we go straight north,	▶ 02:27
and over here we go as fast as possible	▶ 02:30
to the state over here.	▶ 02:32
Now, this is an extreme case.	▶ 02:35
I don't know why it would make sense to set a penalty for life	▶ 02:37
that is so negative that even negative death is worse than living,	▶ 02:39
but certainly that's the result of running value iteration in this extreme case.	▶ 02:45

(01:46) 32 MDP Conclusion

(00:44) 33 Partial Observability Introduction

(00:46) 34 POMDP vs MDP

(05:45) 35 POMDP

I'd like to illustrate the problem, using a very simple environment	▶ 00:00
that looks, as follows:	▶ 00:05
Suppose you live in world like this;	▶ 00:07
and your agent starts over here,	▶ 00:09
and there are 2 possible outcomes.	▶ 00:11
You can exit the maze over here--	▶ 00:13
where you get a plus 100--	▶ 00:16
or you can exit the maze over here,	▶ 00:18
where you receive a minus 100.	▶ 00:20
Now, in a fully observable case,	▶ 00:22
and in a deterministic case,	▶ 00:25
the optimal plan might look something like this;	▶ 00:28
and whether or not is goes straight over here or not, depends on the details.	▶ 00:32
For example, whether the agent has momentum or not.	▶ 00:35
But you'll find a single sequence of actions and states that might cut the corners,	▶ 00:38
as close as possible, to reach the plus 100 as fast as possible.	▶ 00:44
That's conventional planning.	▶ 00:47
Let's contrast this with the case we just learned about,	▶ 00:50
which is the fully observable case or the stochastic case.	▶ 00:53
We just learned that the best thing to compute is a policy	▶ 00:57
that assigns to every possible state, an optimal action;	▶ 01:01
and simplified speaking, this might look as follows:	▶ 01:04
Where each of these arrows corresponds	▶ 01:07
to a sample control policy.	▶ 01:09
And those are defined in part of the state space that are even far away.	▶ 01:12
So this wouuld be an example of a control policy	▶ 01:16
where all the arrows gradually point you over here.	▶ 01:18
We just learned about this, using MDPs and value iteration.	▶ 01:22
The case I really want to get at is the case of partial observability--	▶ 01:25
which we will eventually solve, using a technique called POMDP.	▶ 01:29
And in this case, I'm going to keep the location of the agent in the maze observable.	▶ 01:32
The part I'm going to make unobservable is where, exactly, I receive plus 100	▶ 01:37
and where I receive minus 100.	▶ 01:43
Instead, I'm going to put a sign over here	▶ 01:45
that tells the agent where to expect plus 100,	▶ 01:48
and where to expect minus 100.	▶ 01:51
So the optimum policy would be to first move to the sign,	▶ 01:53
read the sign;	▶ 01:57
and then return and go to the corresponding exit,	▶ 01:59
for which the agent now knows where to receive plus 100.	▶ 02:03
So, for example, if this exit over here gives us plus 100,	▶ 02:07
the sign will say Left.	▶ 02:10
If this exit over here gives us plus 100, the sign will say Right.	▶ 02:12
What makes this environment interesting is	▶ 02:15
that if the agent knew which exit would have plus 100,	▶ 02:17
it will go north, from a starting position.	▶ 02:21
It goes south exclusively to gather information.	▶ 02:23
So the question becomes: Can we devise a method for planning	▶ 02:26
that understands that, even though we'd wish to receive the plus 100 as the best exit,	▶ 02:30
there's a detour necessary to gather information.	▶ 02:36
So here's a solution that doesn't work:	▶ 02:40
Obviously, the agent might be in 2 different worlds--and it doesn't know.	▶ 02:42
It might be in the world where there's plus 100 on the Left side	▶ 02:46
or it might be in the world with plus 100 on the Right side,	▶ 02:49
with minus 100 in the corresponding other exit.	▶ 02:51
What doesn't work is you can't solve the problem for both of these cases	▶ 02:53
and then put these solutions together--	▶ 02:59
for example, by averaging.	▶ 03:00
The reason why this doesn't work is	▶ 03:02
this agent, after averaging, would go north.	▶ 03:04
It would never have the idea that it is worthwhile to go south,	▶ 03:08
read the sign, and then return to the optimal exit.	▶ 03:11
When it arrives, finally, at the intersection over here,	▶ 03:15
it doesn't really know what to do.	▶ 03:18
So here is the situation that does work--	▶ 03:20
and it's related to information space or belief space.	▶ 03:22
In the information space or belief space representation you do planning,	▶ 03:25
not in the set of physical world states,	▶ 03:29
but in what you might know about those states.	▶ 03:31
And if you're really honest, you find out that there's a multitude of belief states.	▶ 03:34
Here's the initial one, where you just don't know where to receive 100.	▶ 03:39
Now, if you move around and either reach one of these exits or the sign,	▶ 03:44
you will suddenly know where to receive 100.	▶ 03:48
And that makes your belief state change--	▶ 03:51
and that makes your belief state change.	▶ 03:55
So, for example, if you find out that 100 is Left,	▶ 03:58
then your belief state will look like this--	▶ 04:01
where the ambiguity is now resolved.	▶ 04:03
Now, how would you jump from this state space or this state space?	▶ 04:05
The answer is: when you read the sign,	▶ 04:09
there's a 50 percent chance that the location over here	▶ 04:12
will result in a transition to the location over here--	▶ 04:16
50 percent because there's a 50 percent chance that the plus 100 is on the Left.	▶ 04:19
There's also a 50 percent chance that the plus 100 is on the Right,	▶ 04:23
so the transition over here is stochastic;	▶ 04:28
and with 50 percent chance, it will result in a transition over here.	▶ 04:31
If we now do the MDP trick in this new belief space,	▶ 04:35
and you pour water in here, it kind of flows through here	▶ 04:39
and creates all these gradients--as we had before.	▶ 04:44
We do the same over here and all these gradients are being created	▶ 04:48
point to this exit on the Left side.	▶ 04:51
Then, eventually, this water will flow through here and create gradients like this;	▶ 04:53
and then flow back through here, where it creates gradients like this.	▶ 04:58
So the value function is plus 100 over here, plus 100 over here	▶ 05:02
that gradually decrease down here, down here;	▶ 05:06
and then gradually further decrease over here--	▶ 05:08
and even further decrease over there, so we've got arrows like these.	▶ 05:11
And that shows you that in this new belief space, you can find a solution.	▶ 05:15
In fact, you can use value iteration--MDP's value iteration--	▶ 05:20
in this new space to find a solution to this really complicated	▶ 05:24
partially observable planning process.	▶ 05:28
And the solution--just to reiterate--	▶ 05:30
we'll suggest: Go south first,	▶ 05:33
read the sign,	▶ 05:35
expose yourself to the random position to the Left or Right world	▶ 05:37
in which you are now able to reach the plus 100 with absolute confidence.	▶ 05:41

(00:35) 36 Planning Under Uncertainty Conclusion

(26) Unit 10

(00:35) 01 Introduction.mp4

((??:??)) 02 Successes.mp4
No subtitles... ▶ 00:00

(03:10) 2 Successes.mp4

For example, in the 4 by 3 GridWorld,	▶ 00:00
what if we don't know where the plus 1 and minus 1 rewards are when we start out?	▶ 00:03
A reinforcement learning agent can learn to explore the territory,	▶ 00:08
find where the rewards are,	▶ 00:13
and then learn an optimal policy.	▶ 00:15
Whereas, an MDP solver can only do that	▶ 00:17
once it knows exactly where the rewards are.	▶ 00:19
Now, this idea of wandering around and then finding a plus 1 or a minus 1	▶ 00:22
is analogous to many forms of games, such as backgammon--	▶ 00:27
and here's an example: backgammon is a stochastic game;	▶ 00:32
and at the end, you either win or lose.	▶ 00:35
And in the 1990s, Gary Tesauro at IBM	▶ 00:38
wrote a program to play backgammon.	▶ 00:40
His first attempt tried to learn the utility of a Game state, U of S,	▶ 00:43
using examples that were labelled by human expert backgammon players.	▶ 00:49
But this was tedious work for the experts,	▶ 00:53
so only a small number of states were labelled.	▶ 00:55
The program tried to generalize from that,	▶ 00:58
using supervised learning,	▶ 01:00
and was not able to perform very well.	▶ 01:02
So Tesauro's second attempt used no human expertise and no supervision.	▶ 01:04
Instead, he had 1 copy of his program play against another;	▶ 01:11
and at the end of the game, the winner got a positive reward,	▶ 01:14
and the loser, a negative.	▶ 01:18
So he used reinforcement learning;	▶ 01:20
he backed up that knowledge throughout the Game states,	▶ 01:22
and he was able to arrive at a function	▶ 01:25
that had no input from human expert players,	▶ 01:27
but, still, was able to perform	▶ 01:30
at the level of the very best players in the world.	▶ 01:32
He was able to do this, after learning from examples of about 200,000 games.	▶ 01:35
Now, that may seem like a lot--	▶ 01:41
but it really only covers about 1 trillionth	▶ 01:43
of the total state space of backgammon.	▶ 01:46
Now, here's another example:	▶ 01:49
This is a remote controlled helicopter	▶ 01:51
that Professor Andrew Ng at Stanford trained,	▶ 01:54
using reinforcement learning;	▶ 01:56
and the helicopter--oh--oh, sorry--	▶ 01:58
I made a mistake--I put this picture upside down	▶ 02:00
because--really, Ng trained the helicopter	▶ 02:04
to be able to fly fancy maneuvers--like flying upside down.	▶ 02:08
And he did that by looking at only a few hours	▶ 02:11
of training data from expert helicopter pilots	▶ 02:15
who would take over the remote controls,	▶ 02:18
pilot the helicopter--and those would all be recorded--	▶ 02:20
and then, you would get rewards from when it did something good,	▶ 02:23
or when it did something bad;	▶ 02:27
and Ng was able to use reinforcement learrning	▶ 02:29
to build an automated helicopter pilot,	▶ 02:32
just from those training examples.	▶ 02:34
And that automated pilot, too, can perform tricks	▶ 02:36
that only a handful of humans are capable of performing.	▶ 02:39
But enough of this still picture--let's watch a video of Ng's helicopters in action.	▶ 02:43
[Stanford University Autonomous Helicopter]	▶ 02:49
[sound of helicopter flying] [Chaos]	▶ 02:52
[Stanford University Autonomous Helicopter]	▶ 03:05

(01:34) 03 Forms of Learning.mp4

(02:03) 04 Forms of Learning Question.mp4

(01:18) 05 Forms of Learning Answer.mp4

(01:47) 06 MDP Review.mp4

(01:43) 07 Solving a MDP.mp4

(02:19) 08 Agents of Reinforcement Learning.mp4

Now here's where reinforcement learning comes into play:	▶ 00:00
What if you don't know R--the Reward function?	▶ 00:03
What if you don't even know P--the transition model of the world?	▶ 00:06
Then you can't solve the Markov Decision Process	▶ 00:09
because you don't have what you need to solve it.	▶ 00:12
However, with reinforcement learning,	▶ 00:14
you can learn R and P by interacting with the world	▶ 00:16
or you can learn substitutes that will tell you	▶ 00:19
as much as you know, so that you never actually have to compute with R and P.	▶ 00:22
What you learn, exactly, depends on what you already know and what you want to do.	▶ 00:26
So we have several choices.	▶ 00:30
One choice is we can build a utility-based agent.	▶ 00:32
So we're going to list agent types, based on what we know,	▶ 00:36
what we want to learn,	▶ 00:41
and what we then use once we've learned.	▶ 00:43
So for a utility-based agent,	▶ 00:45
if we already know T, the transition model,	▶ 00:47
but we don't know R, the Reward model,	▶ 00:51
then we can learn R--and use that,	▶ 00:54
along with P, to learn our utility function;	▶ 00:57
and then go ahead and use the utility function	▶ 01:01
just as we did in normal Markov Decision Processes.	▶ 01:04
So that's one agent design.	▶ 01:07
Another design that we'll see in this Unit	▶ 01:09
is called a Q-learning agent.	▶ 01:11
In this one, we don't have to know P or R;	▶ 01:14
and we learn a value function, which is usually denoted by Q.	▶ 01:17
And that's a type of utility	▶ 01:22
but, rather than being a utility over states,	▶ 01:26
it's a utility of state action pairs--and that tells us:	▶ 01:28
For any given state and any given action,	▶ 01:32
what's the utility of that result--	▶ 01:36
without knowing the utilities and rewards, individually?	▶ 01:38
And then we can just use that Q directly.	▶ 01:42
So we don't actually have to ever learn the transition model, P,	▶ 01:45
with a Q-learning agent.	▶ 01:49
And finally, we can have a reflex agent	▶ 01:51
where, again, we don't need to know P and R to begin with;	▶ 01:54
and we learn directly, the policy, pi of S;	▶ 01:57
and then we just go ahead and apply pi.	▶ 02:02
So it's called a reflex agent because it's pure stimulus response:	▶ 02:05
I'm in a certain state, I take a certain action.	▶ 02:09
I don't have to think about modeling the world, in terms of:	▶ 02:11
What are the transitions--where am I going to go next?	▶ 02:15
I just go ahead and take that action.	▶ 02:17

(01:45) 09 Passive vs Active.mp4

Now, the next choice we have in agent design	▶ 00:00
revolves around how adventurous he wants to be.	▶ 00:03
One possibility is what's called the passive reinforcement learning agent--	▶ 00:06
and that can be any of these agent designs,	▶ 00:11
but what passive means is that the agent	▶ 00:14
has a fixed policy and executes that policy.	▶ 00:16
But it learns about the reward function, R,	▶ 00:19
and maybe the transition function, P,	▶ 00:22
if it didn't already know that.	▶ 00:25
It learns that while executing the fixed policy.	▶ 00:27
So let me give you an example.	▶ 00:30
Imagine that you're on a ship in uncharted waters	▶ 00:32
and the captain has a policy for piloting the ship.	▶ 00:35
You can't change the captain's policy.	▶ 00:38
He or she is going to execute that, no matter what.	▶ 00:41
But it's your job to learn all you can about the uncharted waters.	▶ 00:44
In other words, learn the reward function,	▶ 00:47
given the actions and the state transitions	▶ 00:50
that the ship is going through.	▶ 00:53
You learn, and remember what you've learned,	▶ 00:55
but that doesn't change the captain's policy--	▶ 00:57
and that's passive learning.	▶ 00:59
Now, the alternative is called	▶ 01:01
active reinforcement learning--	▶ 01:04
and that's where we change the policy as we go.	▶ 01:06
So let's say, eventually, you've done such a great job	▶ 01:09
of learning about the uncharted water	▶ 01:12
that the captain says to you,	▶ 01:14
"Okay--I'm going to hand over control	▶ 01:16
and as you learn, I'm going to allow you	▶ 01:19
to change the policy for this ship.	▶ 01:21
You can make decisions of where we're going to go next."	▶ 01:23
And that's good, because you can start to	▶ 01:26
cash in early on your learning	▶ 01:28
and it's also good because it gives you	▶ 01:30
a possibility to explore.	▶ 01:32
Rather than just say: What's the best action I can do right now?--	▶ 01:35
you can say: What's the action that might allow me to learn something--	▶ 01:38
to allow me to do better in the future?	▶ 01:42

(06:13) 10 Passive Temporal Difference Learning.mp4

Let's start by looking at passive reinforcement learning.	▶ 00:00
I'm going to describe an algorithm called	▶ 00:03
Temporal Difference Learning--or TD.	▶ 00:05
And what that means--sounds like a fancy name,	▶ 00:07
but all it really means is we're going to move	▶ 00:09
from one state to the next;	▶ 00:11
and we're going to look at the difference between the 2 states,	▶ 00:13
and learn that--and then kind of back up	▶ 00:16
the values, from one state to the next.	▶ 00:19
So if we're going to follow a fixed policy, pi,	▶ 00:22
and let's say our policy tells us to go this way, and then go this way.	▶ 00:27
We'll eventually learn that we get a plus 1 reward there	▶ 00:31
and we'll start feeding back that plus 1, saying:	▶ 00:35
if it was good to get a plus 1 here,	▶ 00:38
it must be somewhat good to be in this state,	▶ 00:40
somewhat good to be in this state--and so on, back to the start state.	▶ 00:42
So, in order to run this algorithm,	▶ 00:46
we're going to try to build up a table of utilities for each state	▶ 00:48
and along the way, we're going to keep track of	▶ 00:53
the number of times that we visited each state.	▶ 00:56
Now, the table of utilities, we're going to start blank--	▶ 00:59
we're not going to start them at zero or anything else	▶ 01:01
where they're just going to be undefined.	▶ 01:03
And the table of numbers, we're going to start at zero,	▶ 01:05
saying we visited each state a total of zero times.	▶ 01:07
What we're going to do is run the policy,	▶ 01:11
have a trial that goes through the state;	▶ 01:14
when it gets to a terminal state,	▶ 01:16
we start it over again at the start and run it again;	▶ 01:18
and we keep track of how many times we visited each state,	▶ 01:21
we update the utilities, and we get a better	▶ 01:24
and better estimate for the utility.	▶ 01:26
And this is what the inner loop of the algorithm looks like--	▶ 01:28
and let's see if we can trace it out.	▶ 01:30
So we'll start at a start state,	▶ 01:32
we'll apply the policy--and let's say the policy tells us to move in this direction.	▶ 01:34
Then we get a reward here,	▶ 01:39
which is zero;	▶ 01:42
and then we look at it with the algorithm,	▶ 01:44
and the algorithm tells us if the state	▶ 01:46
is new--yes, it is; we've never been there before--	▶ 01:48
then set the utility of that state to the new reward, which is zero.	▶ 01:51
Okay--so now we have a zero here;	▶ 01:56
and then let's say, the next step, we move up here.	▶ 01:58
So, again, we have a zero;	▶ 02:02
and let's say our policy looks like a good one,	▶ 02:04
so we get: here, we have a zero.	▶ 02:07
We get: here, we have a zero.	▶ 02:10
We get: here--now, this state,	▶ 02:12
we get a reward of 1, so that state gets a utility of 1.	▶ 02:16
And all along the way, we have to think about	▶ 02:20
how we're backing up these values, as well.	▶ 02:23
So when we get here, we have to look at this formula to say:	▶ 02:26
How are we going to update the utility of the prior state?	▶ 02:31
And the difference between this state and this state is zero.	▶ 02:35
so this difference, here, is going to be zero--	▶ 02:38
the reward is zero, and so there's going to be no update to this state.	▶ 02:43
But now, finally--for the first time--we're going to have an actual update.	▶ 02:46
So we're going to update this state to be plus 1,	▶ 02:50
and now we're going to think about changing this state.	▶ 02:54
And what was its old utility?--well, it was zero.	▶ 02:57
And then there's a factor called Alpha,	▶ 03:00
which is the learning rate	▶ 03:03
that tells us how much we want to move this utility	▶ 03:05
towards something that's maybe a better estimate.	▶ 03:08
And the learning rate should be such that,	▶ 03:11
if we are brand new,	▶ 03:14
we want to move a big step;	▶ 03:16
and if we've seen this state a lot of times,	▶ 03:18
we're pretty confident of our number	▶ 03:20
and we want to make a small step.	▶ 03:22
So let's say that the Alpha function is 1 over N plus 1.	▶ 03:24
Well, we'd better not make it 1 over N plus 1, when N is zero.	▶ 03:29
So 1 over N plus 1 would be ½;	▶ 03:31
and then the reward in this state was zero;	▶ 03:35
plus, we had a Gamma--	▶ 03:39
and let's just say that Gamma is 1,	▶ 03:41
so there's no discounting; and then	▶ 03:44
we look at the difference between the utility	▶ 03:46
of the resulting state--which is 1--	▶ 03:49
minus the utility of this state, which was zero.	▶ 03:52
So we get ½, 1 minus zero--which is ½.	▶ 03:57
So we update this;	▶ 04:01
and we change this zero to ½.	▶ 04:03
Now let's say we start all over again	▶ 04:06
and let's say our policy is right on track;	▶ 04:10
and nothing unusual, stochastically, has happened.	▶ 04:12
So we follow the same path,	▶ 04:16
we don't update--because they're all zeros all along this path.	▶ 04:19
We go here, here, here;	▶ 04:23
and now it's time for an update.	▶ 04:26
So now, we've transitioned from a zero to ½--	▶ 04:28
so how are we going to update this state?	▶ 04:33
Well, the old state was zero	▶ 04:35
and now we have a 1 over N plus 1--	▶ 04:37
so let's say 1/3.	▶ 04:41
So we're getting a little bit more confident--because we've been there	▶ 04:44
twice, rather than just once.	▶ 04:46
The reward in this state was zero,	▶ 04:48
and then we have to look at the difference between these 2 states.	▶ 04:51
That's where we get the name, Temporal Difference;	▶ 04:54
and so, we have ½ minus zero--	▶ 04:57
and so that's 1/3 times ½--	▶ 05:01
so that's 1/6.	▶ 05:03
Now we update this state.	▶ 05:05
It was zero; now it becomes 1/6.	▶ 05:07
And you can see how the results	▶ 05:11
of the positive 1 starts to propagate	▶ 05:13
backwards--but it propagates slowly.	▶ 05:16
We have to have 1 trial at a time	▶ 05:18
to get that to propagate backwards.	▶ 05:20
Now, how about the update from this state to this state?	▶ 05:22
Now, we were ½ here--so our old utility was ½;	▶ 05:25
plus Alpha--the learning rate--is 1/3.	▶ 05:31
The reward in the old state was zero;	▶ 05:35
plus the difference between these two,	▶ 05:39
which is 1 minus ½.	▶ 05:42
So that's ½ plus 1/6 is 2/3.	▶ 05:45
And now the second time through,	▶ 05:49
we've updated the utility of this state from 1/2 to 2/3.	▶ 05:51
And we keep on going--and you can see the results of the positive, propagating backwards.	▶ 05:57
And if we did more examples through here,	▶ 06:02
you would see the results of the negative propagating backwards.	▶ 06:04
And eventually, it converges to the correct utilities for this policy.	▶ 06:08

(01:03) 11 Passive Agent Results.mp4

(01:11) 12 Weaknesses Question.mp4

(01:02) 13 Weaknesses Answers.mp4

(00:55) 14 Active Reinforcement Learning.mp4

(01:25) 15 Greedy Agent Results.mp4

(01:43) 16 Balancing Policy.mp4

(01:39) 17 Errors in Utility Questions.mp4

(01:06) 18 Errors in Utility Answers.mp4

(01:13) 19 Exploration Agents.mp4

(00:52) 20 Exploration Agent Results.mp4

(01:24) 21 Q Learning 1.mp4

(01:30) 22 Q Learning 2.mp4

(01:48) 23 Pacman 1.mp4

(02:28) 24 Pacman 2.mp4

So we can represent a state,	▶ 00:00
not by an exhaustive listing of everything that's true in the state--	▶ 00:02
every single dot, and so on.	▶ 00:05
But rather, by a collection of important features.	▶ 00:07
So we can say that a state is this collection	▶ 00:11
of Feature 1, Feature 2, and so on.	▶ 00:15
And what are the features?	▶ 00:18
Well, they don't have to be the exact position	▶ 00:20
of every piece in the board.	▶ 00:22
They could be things like the distance to the nearest Ghost	▶ 00:25
or maybe the square of the distance--or the inverse square;	▶ 00:28
or the distance to a dot or food--	▶ 00:31
or the number of Ghosts remaining.	▶ 00:34
And then we can represent the utility of a state,	▶ 00:36
or let's go with a Q value, of a state action pair	▶ 00:39
and represent that as the sum over some set of waits times the value of each feature.	▶ 00:43
And then our task, then, is to learn good values of these waits--	▶ 00:51
how important is each feature, whether they're positive or negative, and so on.	▶ 00:55
This formulation will be good to the extent that similar states have the same value.	▶ 01:00
So if these 2 states have the same value, that would be good	▶ 01:05
because we could learn that, in both cases, Pacman is trapped.	▶ 01:08
It would be bad, to the extent that dissimilar states have the same value--	▶ 01:12
say, if we're ignoring something important.	▶ 01:18
So, for example, if one of the features was:	▶ 01:20
Is Pacman in a tunnel?	▶ 01:25
It would probably be important to know: is that tunnel a dead end or not?	▶ 01:27
And if we represented all tunnels the same, we'd probably be making a mistake.	▶ 01:31
Now, the great thing is that we can make a small modification to our Q learning algorithm	▶ 01:35
where, when we were updating, the Q of S, A got updated	▶ 01:42
in terms of a small change to the existing Q of S, A values.	▶ 01:48
We can do the same thing with the wait's sub-i values.	▶ 01:53
We can update them as we make each change to the Q values.	▶ 01:59
And they're both driven by the amount of error.	▶ 02:02
If the Q values are off by a lot, we have to make a big change;	▶ 02:05
if they're not, we make a small change--	▶ 02:09
the same thing with the Wi values.	▶ 02:11
And that looks just like what we did when we	▶ 02:13
used supervised machine learning to update our waits.	▶ 02:17
So we can apply that same process, even though it's not supervised.	▶ 02:20
It's as if we're bringing our own supervision to reinforcement learning.	▶ 02:24

(00:48) 25 Conclusion.mp4

(6) Homework 5

(00:47) 1 Q Learning ANSWER.mp4

(01:24) 1 Q Learning.mp4

(01:28) 2 Function Generalization ANSWER.mp4

(01:28) 2 Function Generalization.mp4

(00:58) 3 Passive RL Agent ANSWER.mp4

(01:03) 3 Passive RL Agent.mp4

(28) Unit 11

(01:14) 01.mp4 Introduction.mp4

(00:54) 02.mp4 Hidden Markov Models.mp4

(01:45) 03.mp4 Bayes Network of HMMs.mp4

(03:46) 04.mp4 Localization Problem Examples.mp4

[Thrun] What's shown here is the tour guide robot that I showed you earlier,	▶ 00:00
but now I'll talk about the what's called localization problem--	▶ 00:03
the problem of finding out where in the world this robot is.	▶ 00:07
This problem is important because to find its way around the museum	▶ 00:13
and to arrive at exhibits of interest, it must know where it is.	▶ 00:17
The problem with this problem is that it doesn't have a sensor that tells us where it is.	▶ 00:23
Instead, it's given what's called range finders.	▶ 00:30
These are sensors that measure distances to surrounding objects.	▶ 00:34
It's also given the map of the environment,	▶ 00:38
and it can compare these range finders measurements with the map of the environment	▶ 00:40
and infer from that where it might be.	▶ 00:46
The process of inferring the hidden state of the robot's location from the measurements,	▶ 00:49
the range sensor measurements, that's the problem of filtering.	▶ 00:56
And the underlying model is exactly the same I showed you before.	▶ 01:00
It's a hidden Markov model where the state is the sequence of locations	▶ 01:05
that the robot assumes in the museum	▶ 01:09
and the measurements is the sequence of range measurements it perceives	▶ 01:12
while it navigates the museum.	▶ 01:16
A second example is the underground robotic mapping robot	▶ 01:19
which has pretty much the same problem--finding out where it is--	▶ 01:24
but now it is not given a map; it builds the map from scratch.	▶ 01:28
What this animation here shows you is a so-called particle filter applied to robotic mapping.	▶ 01:32
Intuitively--what you see is very simple--	▶ 01:41
as the robot transcends into a mine, it builds a map.	▶ 01:44
But the many black lines are hypotheses on where the robot might have been	▶ 01:48
when building this map.	▶ 01:54
It can't tell because of the noise in its motors and in its sensors.	▶ 01:56
As the robot reconnects and closes the loop in this map,	▶ 02:00
one of these black what we call particles in the trade--	▶ 02:05
one of these black hypotheses are being selected as the best one,	▶ 02:09
and by virtue of having maintained many of those, the robot is able to build a coherent map.	▶ 02:13
In fact, this animation was a key animation in my job talk	▶ 02:19
when I applied to become a professor at Stanford University.	▶ 02:23
Here is one final example I'd like to discuss with you which is called speech recognition.	▶ 02:27
If you have a microphone that records speech	▶ 02:32
and you want to make your computer recognize the speech,	▶ 02:35
you will likely come across hidden Markov models.	▶ 02:38
This is a typical speech signal over here.	▶ 02:41
It's an oscillation for the words "speech lab" which I borrowed from Simon Arnfield.	▶ 02:44
And if you blow up a small region over here, you'll find that there is an oscillation,	▶ 02:51
and this oscillation in time is the speech signal.	▶ 02:58
What speech recognizing systems do is they transform this signal over here	▶ 03:03
back into letters like "speech lab."	▶ 03:09
And you can see it's not an easy task.	▶ 03:12
There is some signal here.	▶ 03:14
The E, for example, is a certain shape.	▶ 03:16
But different speakers speak differently, and there might be background noise,	▶ 03:18
so decoding this back into speech is challenging.	▶ 03:22
There's been enormous progress in the field	▶ 03:25
mostly due to hidden Markov models that have been researched for more than 20 years.	▶ 03:28
And today's best speech recognizers all use variants of hidden Markov models.	▶ 03:33
So once again, I can't teach you everything in this class, but I'll teach you the very basics	▶ 03:38
that you can apply to things such as speech signals.	▶ 03:43

(01:12) 05.mp4 Markov Chain Question 1.mp4

(02:29) 06.mp4 Markov Chain Answer 1.mp4

[Thrun] And the answer will be 0.6, 0.44, and 0.376.	▶ 00:00
It's really an exercise applying probability theory.	▶ 00:08
In the very beginning we know to be in state R,	▶ 00:13
and the probability of remaining there is 0.6, which is directly the value on the arc over here.	▶ 00:17
On the second state we know that the probability of R is 0.6	▶ 00:24
and therefore, the probability of sun is 0.4,	▶ 00:29
and we compute the probability of rain on day 2 using total probability.	▶ 00:34
The probability of rain on day 2 given rain on day 1	▶ 00:40
times the probability of rain on day 1 plus the probability of rain on day 2	▶ 00:45
given it was sunny on day 1 times the probability of sun on day 1.	▶ 00:49
And if you plug in all these values,	▶ 00:54
we get 0.6 times 0.6 plus rain following sun which is this arc over here, 0.2,	▶ 00:56
times 0.4 as the prior, and this results in 0.44.	▶ 01:05
We can now do the same with the probability of rain on day 3,	▶ 01:12
which is the same 0.6 over here, but now our prior is different--it's 0.44--	▶ 01:17
plus the same 0.2 over here with the prior of 0.56, which is 1 minus 0.44.	▶ 01:26
And when you work this all out, it is 0.376 as indicated over here.	▶ 01:33
So what we really learned here is that this is a temporal Bayes network	▶ 01:38
of which we can apply conventional probabilities such as the total probability	▶ 01:42
which was also known as variable elimination in the Bayes network lecture.	▶ 01:48
All these fancy words aside, it's really easy to evaluate those.	▶ 01:52
So if you want to do this and you ask yourself given the probability of the certain time step	▶ 01:56
like 1, what's it related to time step 2,	▶ 02:01
you ask yourself what's the durations that I encounter in time step 1.	▶ 02:04
There are usually 2 in this case.	▶ 02:09
What are the transition probabilities that lead me to the desired state in time step 2	▶ 02:11
like the 0.6 if you started in R and 0.2 if you started in S,	▶ 02:16
and you add all these cases up and you just get the right number.	▶ 02:22
It's really an easy piece of mathematics if you think about it.	▶ 02:25

(00:28) 07.mp4 Markov Chain Question 2.mp4

(01:19) 08.mp4 Markov Chain Answer 2.mp4

(02:42) 09.mp4 Stationary Distribution.mp4

[Thrun] So one of the questions you might ask for a Markov chain like this is	▶ 00:00
what happens if time becomes really large?	▶ 00:04
What happens for the probability of A1000?	▶ 00:07
Or let's go extreme.	▶ 00:11
What about in the limit, A infinity, often written as the limits of time	▶ 00:13
going to infinity of any P of At.	▶ 00:19
That's like the fancy math notation, but what it really means is we just wait a long, long time.	▶ 00:22
What is going to happen to the Markov chain over here? What is that probability?	▶ 00:28
This probability is called a stationary distribution,	▶ 00:32
and a Markov chain settles to a stationary distribution	▶ 00:36
or sometimes a limit cycle if the transition is alternativistic(?), which we don't care about.	▶ 00:39
And the key to calculating this is to realize that the probability for any t	▶ 00:44
must be the same as the probability 1 times (?)	▶ 00:51
This can be resolved as follows.	▶ 00:55
We know that P of At is P of At given At minus 1 times P of At minus 1	▶ 00:57
plus P of At given Bt minus 1	▶ 01:05
times probability of Bt minus 1.	▶ 01:12
This is just the theorem of total probability or forward propagation rule	▶ 01:17
applied to this case over here, so nothing really new.	▶ 01:21
But if you call this guy over here X, then we now have X	▶ 01:26
equals probability of At given At minus 1 is 0.5	▶ 01:32
times--and this is the same X as this one over here	▶ 01:39
because you're looking for the stationary distribution, so it's X again.	▶ 01:41
This probability over here, A following B, is 1 in this special case,	▶ 01:45
and the probability of Bt minus 1 is 1 minus At minus 1.	▶ 01:51
And if you plug this in, that's the same as 1 minus X.	▶ 01:58
And we can now solve this for X.	▶ 02:02
Let me just do this.	▶ 02:05
X equals, if you put these 2 Xs together we get minus 0.5 plus 1	▶ 02:07
or, differently, 1.5X equals 1.	▶ 02:15
That means X equals 1 over 1.5, which is 2/3.	▶ 02:18
So the answer here is the stationary distribution will have A occurring with 2/3 chance	▶ 02:24
and B with 1/3 chance.	▶ 02:31
It's still a Markov chain--it flips from A to B--	▶ 02:33
but these are the frequencies at which A occurs	▶ 02:35
and this is the frequency at which B occurs.	▶ 02:38

(00:15) 10.mp4 Stationary Distribution Question-.mp4

(01:41) 11.mp4 Stationary Distribution Answer.mp4

(01:56) 12.mp4 Finding Transition Probabilities.mp4

(00:21) 13.mp4 Transition Probabilities Question.mp4

(00:42) 14.mp4 Transition Probabilities Answer.mp4

(01:32) 15.mp4 Laplacian Smoothing Question.mp4

(02:03) 16.mp4 Laplacian Smoothing Answer.mp4

[Thrun] So in Laplacian smoothing we look at the relative counts.	▶ 00:00
We know there is 1 instance of rain at time 0.	▶ 00:04
Normally it would be 1.	▶ 00:07
But we add 1 to the numerator and 2 to the denominator, and we get 2/3.	▶ 00:10
Let's look at these numbers again.	▶ 00:19
The count that we have is 1 out of 1 is rain and 1 out of 1 would give us 1	▶ 00:21
under the maximum likelihood estimator.	▶ 00:26
But because we're smoothing, we're adding a pseudocount,	▶ 00:28
which is 1 rainy day and 1 sunny day,	▶ 00:31
and we have to compensate for the 2 additional counts with a 2 over here	▶ 00:34
and therefore we get 2/3.	▶ 00:38
So our probability under the Laplacian smoother is 2/3 for the rainy day to be the first day,	▶ 00:40
which is really different from 1.	▶ 00:46
Applying the same logic over here, we transition 3 times out of a sunny state--	▶ 00:48
1, 2, 3--and each time it's a sunny state.	▶ 00:53
So maximum likelihood would say 3 times out of 3 it's sunny into sunny.	▶ 00:58
We add a pseudo observation of 1, and then there's 2 possible outcomes;	▶ 01:02
hence, we have to count 2 over here.	▶ 01:07
So it's 4/5.	▶ 01:10
And the missing 1/5 shows up over here.	▶ 01:13
We can do the same math as before.	▶ 01:15
Zero with 3 transitions from a sunny day resulted in a rainy day.	▶ 01:18
In fact, they were all sunny.	▶ 01:22
But we add 1 pseudo observation over here and 2 of the normalizer, 1/5.	▶ 01:24
These 2 things surely add up to 1.	▶ 01:29
The last one is analogous.	▶ 01:32
We have 1 transition of a rainy state and it led to a sunny state, so 1/1,	▶ 01:34
but we add 1 over here and 2 on the denominator so you get 2/3.	▶ 01:38
And if you do the math over here, you get 1/3.	▶ 01:42
I really want you to remember Laplacian smoothing.	▶ 01:45
It's applicable to many estimation problems,	▶ 01:47
and it will be important going forward in this class.	▶ 01:51
Here we applied it to the estimation of a Markov chain.	▶ 01:55
Please take a moment and study the logic so you'll be able to apply those things again.	▶ 01:58

(04:12) 17.mp4 HMM Happy Grumpy Problem.mp4

[Thrun] So now let's return to hidden Markov models.	▶ 00:00
Those are really the subject of this class.	▶ 00:04
Let's again use the rainy and sunny example just to keep it simple.	▶ 00:08
These are the transition probabilities as before.	▶ 00:12
Let's assume for now that the initial probability of rain is 0.5;	▶ 00:15
hence, the probability of sun at time 0 is 0.5.	▶ 00:20
The key modification to go to hidden Markov model is that this state is actually hidden.	▶ 00:23
I cannot see whether it's raining or it's sunny.	▶ 00:28
Instead I get to observe something else.	▶ 00:32
Suppose I can be happy or grumpy	▶ 00:35
and happiness or grumpiness is being caused by the weather.	▶ 00:38
So rain might make me happy or grumpy,	▶ 00:43
and sunshine makes me happy or grumpy	▶ 00:46
but with vastly different probabilities.	▶ 00:49
If it's sunny, I'm just mostly happy, 0.9.	▶ 00:51
There's a 0.1 chance I might still be grumpy for some other reason.	▶ 00:55
If it's rainy, I'm only happy with 0.4 probability and with 0.6 I'm grumpy.	▶ 00:59
In fact, living in California I can attest that these are actually not wrong probabilities.	▶ 01:05
I love the sun over here.	▶ 01:11
Suppose I observe that I'm happy on day 1.	▶ 01:14
A question that we can ask now is what is the so-called posterior probability	▶ 01:20
for it raining on day 1 and what's the posterior probability for it being sunny on day 1?	▶ 01:27
What's the probability of rain on day 1 given that I observed that I was happy on day 1?	▶ 01:35
This is being answered using Bayes rule,	▶ 01:43
so this is the probability of being happy given that it rains	▶ 01:46
times the probability that it rains over the probability of being happy.	▶ 01:50
We know the probability of rain at day 1 based on our Markov state transition model.	▶ 01:56
In fact, let's just calculate it.	▶ 02:03
The probability of rain on day 1 is the probability it was rainy on day 0	▶ 02:05
and it led to a self transition from rain to rain from day 0 to day 1	▶ 02:10
plus the probability it was sunny on day 0 times the probability that sun led to rain over here.	▶ 02:14
If you can plug in all these numbers to obtain 0.4,	▶ 02:20
you can just easily verify this.	▶ 02:26
So we know this guy over here is 0.4.	▶ 02:29
This guy over here is 0.4 again, but now it's this 0.4 over here.	▶ 02:32
The probability of being happy on a rainy day is 0.4.	▶ 02:39
This guy over here resolves to 0.4 times 0.4	▶ 02:44
plus the same situation with sunny in time 1	▶ 02:51
where the prior is 0.6 and the happiness factor is 0.9.	▶ 02:55
And that gives us the entire expression is 0.229.	▶ 03:01
Let's interpret the 0.229 in the context of the question we asked.	▶ 03:06
We know that at time 0 it was raining with half a chance.	▶ 03:11
If you look at the state transition diagram, it's more likely to be sunny afterwards	▶ 03:16
because it's more likely to flip from rain to sun than sun to rain.	▶ 03:20
In fact, we worked out that the probability of rain at a time step later was only 0.4,	▶ 03:23
so it was 0.6 sunny.	▶ 03:29
But now that I saw myself being happy, my probability of rain was further lowered	▶ 03:31
from 0.4 to 0.229.	▶ 03:36
And the reason why the probability went down is if you look at happiness,	▶ 03:39
happiness is much more likely to occur on a sunny day than it is to occur on a rainy day.	▶ 03:45
And when you work this in using Bayes rule and total probability,	▶ 03:50
you would find just the fact that it was at happiness at time 1	▶ 03:53
makes your belief of it being rainy go down from 0.4 to 0.229.	▶ 03:57
This is a wonderful example of applying Bayes rule	▶ 04:05
in this really relatively complicated hidden Markov model.	▶ 04:08

(00:46) 18.mp4 Happy Grumpy Question.mp4

(00:58) 19.mp4 Happy Grumpy Answer.mp4

(00:31) 20.mp4 Wow You Understand.mp4

(02:58) 21.mp4 HMMs and Robot Localization.mp4

[Thrun] I want to show you a little animation of hidden Markov models	▶ 00:00
used for robot localization.	▶ 00:05
This is obviously a little toy robot over here that lives in the grid world,	▶ 00:07
and the grid world is composed of discrete cells where the robot may be located.	▶ 00:12
This robot happens to know where north is at all times.	▶ 00:16
It's given 4 sensors, a wall sensor to the left, to the right, to the top	▶ 00:20
and the bottom over here, and it can sense whether in the adjacent cell there's a wall or not.	▶ 00:24
Initially this robot has no clue where it is. It faces what we call a global localization problem.	▶ 00:30
It now uses its sensors and its actuators to localize itself.	▶ 00:37
So in the very first episode the robot senses a wall north and south of it	▶ 00:43
but none west or east.	▶ 00:49
And look what this does to the probabilities.	▶ 00:52
The posterior probability is now increased	▶ 00:56
in places that are consistent with this measurement,	▶ 00:58
like all of those places have a wall in north and east, like these guys over here,	▶ 01:01
and free space in the left and the right,	▶ 01:06
yet they have been decreased in places that are inconsistent, like this guy over here.	▶ 01:09
These states over here are interesting. They are shaded gray and lighter gray.	▶ 01:14
What this means is they still have a significant probability	▶ 01:18
but yet not as much as over here,	▶ 01:21
the reason being that this measurement over here would be characteristic	▶ 01:24
for the state over here if there had been exactly 1 measurement error--	▶ 01:29
if the bottom sensor had erred and erroneously detected a wall.	▶ 01:33
Errors are less likely than no errors, and as a result, the cell over here	▶ 01:39
which is completely consistent ends up to be more likely than the cell over here,	▶ 01:43
yet you can see the HMM does a nice job in understanding the posterior probability.	▶ 01:47
Let's assume the robot moves right and senses again	▶ 01:53
and gets the exact same measurement.	▶ 01:57
Of course it has no clue that it is exactly over here.	▶ 01:59
It can see the probability as being decayed.	▶ 02:02
Interestingly enough, this guy over here has a lower probability,	▶ 02:04
and the reason is by itself it is very consistent with the most recent measurement,	▶ 02:08
but it's less consistent with the idea of having moved right and measured before	▶ 02:12
a wall to the north and the south.	▶ 02:17
And similarly, these places over here become less consistent.	▶ 02:19
The only ones that are completely consistent are these 3 states over here	▶ 02:23
and the 3 states over here.	▶ 02:26
The robot keeps moving to the right,	▶ 02:28
and now we get to the point where the sequence of measurement	▶ 02:30
really makes 2 states equally likely--the ones over here.	▶ 02:34
They are equally likely with symmetry.	▶ 02:36
Those are still pretty likely, and those are gradually and likely over here to the left.	▶ 02:38
As the robot now moves, it moves into a distinguishing state.	▶ 02:44
It sees a wall in the north but free space in the 3 other directions,	▶ 02:48
and that renders the state over here relatively unlikely,	▶ 02:52
and now it has localized itself.	▶ 02:55

(02:36) 22.mp4 HMM Equations.mp4

[Thrun] We discussed specific incidents of hidden Markov model inference or filtering	▶ 00:00
in our quizzes.	▶ 00:05
Let me now give you the basic math.	▶ 00:07
We all know hidden Markov model is a chain like this	▶ 00:09
of hidden states that are Markovian	▶ 00:13
and measurements that only depend on the corresponding state.	▶ 00:17
We know that this Bayes network entailed certain independencies.	▶ 00:22
For example, given X2 the past, the future, and the present measurement	▶ 00:25
are all conditionally independent given X2.	▶ 00:34
The nice thing about this structure is it makes it possible to efficiently do inference.	▶ 00:37
I'll give you the equations we used before here in a more explicit form.	▶ 00:42
Let's look at the measurement side, and suppose we wish to know the probability	▶ 00:49
of an internal state variable given a specific measurement,	▶ 00:55
and that by Bayes rule becomes P of Z1 given X1 times P of X1 over P of Z1.	▶ 00:59
When you start doing this, you'll find that the normalizer	▶ 01:06
doesn't depend on the target variable X;	▶ 01:10
therefore, we often write a proportionality sign and get an equation like this.	▶ 01:13
This product over here is the basic measurement update of hidden Markov models.	▶ 01:19
And the thing to remember when you apply it, you have to normalize.	▶ 01:24
We already practiced all of this, so you know all of this.	▶ 01:27
The other equation is the prediction equation,	▶ 01:30
so let's go from X1 to X2.	▶ 01:33
This is called prediction even though sometimes it has nothing to do with prediction.	▶ 01:36
It's the traditional term, but it comes from the fact that we might want to predict	▶ 01:40
the distribution of X2 given that we know the distribution of X1.	▶ 01:43
Here we apply total probability.	▶ 01:49
The probability of X2 is obtained by checking all states we might have come from in X1	▶ 01:51
and calculating the probability of going from X1 to X2.	▶ 01:59
We also practiced this before.	▶ 02:03
Any probability of X2 being in a certain state must have come from another state, X1,	▶ 02:06
and then transitioned into X2, so we sum over all of those	▶ 02:12
and we get the posterior probability of X2.	▶ 02:15
These 2 equations together form the math of a hidden Markov model	▶ 02:18
where the next state distribution and the measurement distribution	▶ 02:24
and the initial state distribution are all given as the parameters of a hidden Markov model.	▶ 02:29

(02:44) 23 HMM Localization Example.mp4

[Thrun] Here is the application of HMM to a real robot localization example.	▶ 00:00
This robot is in a world that's 1-dimensional and it is lost.	▶ 00:05
It has initial uncertainty about where it is,	▶ 00:09
and it is actually located next to a door but it doesn't know.	▶ 00:12
It's also given a map of the world,	▶ 00:16
and the distribution of all possible states, here noted as s, is given by this histogram.	▶ 00:18
We bin the world into small bins, and for each bin we assign a single numerical probability	▶ 00:24
of the robot being there.	▶ 00:31
The fact they have all the same height means that the robot is maximally uncertain	▶ 00:33
as to where it is.	▶ 00:37
Let's assume this robot is going to sense	▶ 00:39
and it senses to be next to a door.	▶ 00:41
The red graph over here is the probability of seeing a door	▶ 00:43
for different locations in the environment.	▶ 00:47
There are 3 different doors, and seeing a door is more likely here	▶ 00:49
than it is in between.	▶ 00:53
It might still see a door here, but it's just less likely.	▶ 00:55
We now apply Bayes rule.	▶ 00:58
We multiply the prior with this measurement probability to obtain the posterior.	▶ 01:00
That was our measurement update. It's that simple.	▶ 01:07
So you can see how all these uniform values over here become nonuniform values over here	▶ 01:10
multiplied by this curve over here.	▶ 01:17
The story progresses by the robot taking an action to the right,	▶ 01:20
and this is now the next state prediction part, the what we call convolution part	▶ 01:25
or state transition part, where these little bumps over here get shifted along with the robot	▶ 01:30
and they are flattened out a little bit just because robot motion has used uncertainty.	▶ 01:36
Again, it's a really simple operation.	▶ 01:40
You shift those to the right and you smooth them out a little bit	▶ 01:43
to account for the control noise in the robot's actuators.	▶ 01:46
And now we get to the point that the robot senses again,	▶ 01:51
and this robot senses a door again.	▶ 01:53
And see what happens. It multiplies.	▶ 01:56
It's now a nonuniform prior over here with the same measurement probability as before,	▶ 01:59
but now we get a distribution that's peaked over here	▶ 02:06
and has smaller bumps at various other places,	▶ 02:09
the reason being the only place where my prior has a higher probability	▶ 02:12
and my measurement probability is also high probability is the second door,	▶ 02:17
and as a result of our distribution over here, it assumes a much larger value.	▶ 02:21
If you look at that picture, that is really easy to implement,	▶ 02:24
and that's what we did all along when we talked about rain and sun and so on.	▶ 02:28
It's really a very simple algorithm.	▶ 02:32
Measurements are multiplications, and motion become essentially convolutions	▶ 02:35
which are shifts with added noise.	▶ 02:41

(03:47) 24.mp4 Particle Filters.mp4

[Thrun] This is a great segue to one of the most successful algorithms	▶ 00:00
in artificial intelligence and robotics called particle filters.	▶ 00:05
Again, the topic here is robot localization,	▶ 00:09
and here we're dealing with a real robot with actual sensor data.	▶ 00:13
The robot is lost in this building.	▶ 00:17
You can see different rooms, and you can see corridors,	▶ 00:20
and the robot is equipped with range sensors.	▶ 00:24
These are sound sensors that measure the range to nearby obstacles.	▶ 00:26
Its task is to figure out where it is.	▶ 00:31
The robot will move along the black line over here, but it doesn't know this.	▶ 00:35
It has no clue where it is.	▶ 00:39
It has to figure out where it is.	▶ 00:41
The key thing in particle filters is the representation of the belief.	▶ 00:43
Whereas before we had discrete worlds like our sun and rain example	▶ 00:48
or we had a histogram approach where we cut the space into small bins,	▶ 00:54
particle filters have a very different representation.	▶ 00:59
They represent the space by a collection of points or particles.	▶ 01:02
Each of these small dots over here is a hypothesis where the robot might be.	▶ 01:07
It's a concrete value of its X location and its Y location and its heading direction	▶ 01:12
in this environment.	▶ 01:19
So it's a vector of 3 values.	▶ 01:21
The sum or set of all those vectors together form the belief space.	▶ 01:23
So particle filters approximate a posterior	▶ 01:30
by many, many, many guesses,	▶ 01:33
and the density of those guesses represents the posterior probability	▶ 01:36
of being at a certain location.	▶ 01:41
To illustrate this, let me run the video.	▶ 01:44
You can see in a very short amount of time the range sensors,	▶ 01:47
even though they're very noisy, force the particles to collect in the corridor.	▶ 01:51
There's 2 symmetrical point dots--this one over here and this one over here--	▶ 01:57
that come from the fact that the corridor itself is symmetric.	▶ 02:01
But as the robot moves into the office, the symmetry is broken.	▶ 02:04
This office looks very different from this office over here,	▶ 02:08
and those particles die out.	▶ 02:11
What's happening here?	▶ 02:14
Intuitively speaking, each particle is a representation of a possible state,	▶ 02:16
and the more consistent the particle with the measurement,	▶ 02:21
the more the sonar measurement fits into the place where the particle says the robot is,	▶ 02:24
the more likely it is to survive.	▶ 02:29
This is the essence of particle filters.	▶ 02:31
Particle filters use many particles to represent a belief,	▶ 02:34
and they will let those particles survive in proportion to the measurement probability.	▶ 02:38
And the measurement probability here is nothing else but the consistency	▶ 02:44
of the sonar range measurements with the map of the environment	▶ 02:49
given the particle place.	▶ 02:53
Let me play this again.	▶ 02:55
Here's the maze. The robot is lost in space.	▶ 02:57
Again, you can see how within very few steps the particles	▶ 03:00
consistent with the range measurements all accumulate in the corridor.	▶ 03:05
As the robot hits the end of the corridor, only 2 particle clouds survive	▶ 03:10
due to the symmetry of the corridor, and the particles finally die out.	▶ 03:14
This algorithm is beautiful,	▶ 03:19
and you can implement it in less than 10 lines of program code.	▶ 03:21
So given all the difficulty of talking of probabilities and Bayes network	▶ 03:27
and hidden Markov models, you will now find a way	▶ 03:32
to implement one of the most amazing algorithms for filtering and state estimation	▶ 03:36
in less than 10 lines of C code.	▶ 03:41
Isn't that amazing?	▶ 03:45

(04:27) 25.mp4 Localization and Particle Filters.mp4

[Thrun] Here is our 1-dimensional localization example again,	▶ 00:00
this time with particle filters.	▶ 00:03
You can see the particles initially spread out uniformly.	▶ 00:06
This 1-dimensional space of forward locations you're going to use as an example	▶ 00:09
to explain every single step of particle filters.	▶ 00:13
In the very first step, the robot senses a door.	▶ 00:16
Here is its initial particles before sensing the door.	▶ 00:20
It now copies these particles over verbatim but gives them what's called a weight.	▶ 00:24
We call this weight the importance weight,	▶ 00:30
and the importance weight is nothing else but the measurement probability.	▶ 00:33
It's more likely to see a door over here than over here.	▶ 00:37
The red curve over here is the measurement probability,	▶ 00:41
and the particles over here are the same as up here,	▶ 00:44
but they now attached an importance weight where the height of the particle	▶ 00:48
illustrates the weight.	▶ 00:52
So you can see the place over here, the place over here, and the place over here	▶ 00:54
carry the most weight because they're the most likely ones.	▶ 00:57
This robot moves and it moves by using its previous particles	▶ 01:00
to create a new random particle set that represents the posterior probability	▶ 01:07
of being at a new location.	▶ 01:12
The key thing here is called resampling.	▶ 01:15
The algorithm works as follows.	▶ 01:19
Pick a particle from the set over here and pick it in proportion to the importance weight.	▶ 01:21
Once you've picked one--and sure enough, you pick those more frequently	▶ 01:28
than those over here--add the motion to it plus a little bit of noise	▶ 01:32
to create a new particle.	▶ 01:37
Repeat this procedure for each particle.	▶ 01:39
Pick them with replacement.	▶ 01:42
You're allowed to pick a particle twice or 3 or 4 times.	▶ 01:44
Sure enough, you pick these more frequently.	▶ 01:47
These are being forward moved to over here, these to over here.	▶ 01:49
You see a higher density of particles over here and over here,	▶ 01:53
than you see, for example, over here.	▶ 01:56
That's your forward prediction step in particle filters.	▶ 01:58
It's really easy to implement.	▶ 02:01
The next step is another measurement step,	▶ 02:04
and here I'm illustrating to you that indeed this nonuniform set of particles	▶ 02:06
leads to a reasonable posterior in this space.	▶ 02:10
We now have a particle set as nonuniform.	▶ 02:13
We have increased density over here, over here, and over here.	▶ 02:16
You can see how multiplying these particles with the importance weight,	▶ 02:20
which is copying them over verbatim but attaching a vertical importance weight	▶ 02:25
in proportion to the measurement probability,	▶ 02:31
yields a lot of particles over here with big weights,	▶ 02:33
some over here with big weights, lots of particles over here with low weights.	▶ 02:37
They got copied over, but the measurement probability here is low and so on and so on.	▶ 02:41
And if you look at this set of particles, you already understand	▶ 02:46
why the majority of importance and weights resides in the correct location	▶ 02:50
given that we had a measurement of a door and motion to the right	▶ 02:56
and another measurement of the door.	▶ 02:59
The nice thing here is that particle filters work in continuous spaces,	▶ 03:01
and, what's often underappreciated, they use your computational resources	▶ 03:06
in proportion to how likely something is.	▶ 03:13
You can see that almost all the computation now resides over here,	▶ 03:16
almost all the memory resides over here,	▶ 03:19
and that's the place that's likely.	▶ 03:21
Stuff over here requires less memory, less computation, and guess what?	▶ 03:23
It's much less likely.	▶ 03:27
So particle filters make use of your computational resources in an intelligent way.	▶ 03:29
They're really nice to implement on something with low compute power.	▶ 03:34
Let me move on to explain the next motion.	▶ 03:38
Here you see our robot moving to the right again,	▶ 03:42
and now the same what we call resampling takes place.	▶ 03:45
We pick, with replacement, particles from over here.	▶ 03:49
Sure enough, these are the ones we pick the most often.	▶ 03:52
And then we add the motion command plus some random noise.	▶ 03:55
If you look at this particle set over here, almost all the particles sit over here.	▶ 03:59
It doesn't really show it very well on this computer screen,	▶ 04:03
but the density of particles over here is significantly higher than anywhere else.	▶ 04:06
There's occurrences over here and over here that correspond with these guys over here	▶ 04:10
and these guys over here and over here, correspond to this guy over here,	▶ 04:14
but the vast majority of probability mass sits over here.	▶ 04:18
So let's dive into how complicated this algorithm really is.	▶ 04:22

(02:58) 26.mp4 Particle Filter Algorithm.mp4

[Thrun] So here is our algorithm particle filter.	▶ 00:00
It sets as an input a set of particles with associated important weights,	▶ 00:04
a control, and a measurement vector,	▶ 00:09
and it constructs a new particle set as prime	▶ 00:12
and in doing so it also has an auxiliary variable, eta.	▶ 00:16
Here is the algorithm.	▶ 00:20
Initially we go through all new particles of which there are n	▶ 00:22
and we sample in index j according to the distribution	▶ 00:27
defined by the importance weights associated with the particle set over here.	▶ 00:32
Put differently, we have a set of particles over here	▶ 00:38
and we have associated importance factors which we will construct a little bit later on,	▶ 00:41
and now we pick one of these particles with replacement	▶ 00:45
where the probability of picking this particle is exactly the importance weight, w.	▶ 00:48
For this particle we now sample a possible successor state	▶ 00:55
according to the state transition probability using our controls	▶ 01:01
and that specific particle as an input. We call it sj over here.	▶ 01:06
We also compute an importance weight, which is the measurement probability	▶ 01:11
for that specific particle over here.	▶ 01:16
This gives us a new particle, and this gives us a new non-normalized importance weight.	▶ 01:20
For now we just add them into our new particle set as prime and we reiterate.	▶ 01:25
The only thing missing now is at the very end we have to normalize all the weights.	▶ 01:32
For this we keep our running counter, eta,	▶ 01:37
and we have a For loop in which we take all the weights in the set over here	▶ 01:40
and just normalize them accordingly.	▶ 01:45
This is the entire algorithm.	▶ 01:48
We feed in over here particles with associated important weights	▶ 01:51
and a control and a measurement,	▶ 01:55
and then we construct the new set of particles by picking particles from our previous set	▶ 01:58
at random with replacement but in accordance to the importance weights,	▶ 02:05
so important particles are picked more frequently.	▶ 02:10
We guess for this particle this will be a state.	▶ 02:13
We guess what a new state might be by just sampling it,	▶ 02:16
and we attach it an importance weight which we later normalize	▶ 02:20
that is proportional to the measurement probability for this thing over here.	▶ 02:23
So you're going to upweigh the particles that look consistent with the measurements	▶ 02:28
and downweigh the ones that are non-consistent.	▶ 02:31
We add all of these things back to our particle sets and reiterate.	▶ 02:34
I promised you it would be an easy algorithm.	▶ 02:38
You can look at this, and you could actually implement this really easily.	▶ 02:40
Just remember how much difficulty we introduced	▶ 02:45
with talking about Bayes networks and hidden Markov models and all that stuff.	▶ 02:48
This is all there is to implement particle filters.	▶ 02:54

(01:39) 27.mp4 Particle Filters Pros and Cons.mp4

(00:49) 28.mp4 Conclusion.mp4

(8) mdpreview 1

(01:22) 01 Deterministic Question.mp4

(00:13) 02 Deterministic Answer.mp4
And the answer is obtained by looking at the nearest value -4, ▶ 00:00
which gives us 96 over here, 92, 88, and 84. ▶ 00:05

(00:33) 03 Single Backup Question.mp4

(00:23) 04 Single Backup Answer.mp4

(00:32) 05 Convergence Question.mp4

(00:43) 06 Convergence Answer.mp4

(00:27) 07 Optimal Policy Question.mp4

(00:31) 08 Optimal Policy Answer.mp4

(15) Midterm 1

(00:59) Question 01.mp4

(00:35) Question 02.mp4

(00:08) Question 03.mp4
Here's an easy question. ▶ 00:00
For coin X, we know that the probability of heads is 0.3. ▶ 00:02
What is the probability of tails? ▶ 00:05

(00:22) Question 04.mp4

(00:37) Question 05.mp4

(00:30) Question 06.mp4

(00:29) Question 07.mp4

(01:08) Question 08.mp4

(00:25) Question 09.mp4

(00:24) Question 10.mp4

(00:25) Question 11.mp4

(01:02) Question 12.mp4

(01:46) Question 13.mp4

(00:26) Question 14.mp4

(01:13) Question 15.mp4

(36) Unit 13

(00:34) 01 Introduction.mp4

(01:21) 02 Technologies Question.mp4

(00:37) 03 Technologies Answer.mp4

(01:12) 04 Games Question.mp4

(01:56) 05 Games Answer.mp4

Now, I've chosen to say that only robotic soccer and hide-and-go-seek are stochastic.	▶ 00:00
By that I mean if you have an action like go forward 1 meter,	▶ 00:06
the result of that action stochastic. You may not go forward exactly 1 meter.	▶ 00:10
You could also analyze games like poker and cards and say that they're stochastic	▶ 00:15
in that the next car is random, and so the action of flipping over the next card is stochastic.	▶ 00:21
You don't know how that action is going to result.	▶ 00:28
I've chosen to model that as partial observability.	▶ 00:32
What I've said is it's not that you pick randomly from the next card,	▶ 00:36
it's that the cards are already arranged in some order.	▶ 00:41
It's just that you don't know what that order is.	▶ 00:45
There's partial observability that gives you the next card.	▶ 00:47
Partial observability also shows up in the real world sports	▶ 00:50
or of robot soccer and hide-and-go-seek.	▶ 00:54
Obviously, that's kind of the point of hide-and-go-seek that it's partially observable.	▶ 00:58
Now, in terms of unknown, I've said that only hide-and-go-seek satisfies that.	▶ 01:03
In everything else, the world is well-defined.	▶ 01:07
Even in the real world in an environment like robot soccer,	▶ 01:10
you only have the known field to deal with.	▶ 01:14
Whereas in hide-and-go-seek, someone could be hiding anywhere	▶ 01:17
in a room or location that you don't know about yet.	▶ 01:20
Notice that many games are adversarial, but some games are not.	▶ 01:25
Solitaire games are not adversarial.	▶ 01:29
You could mark that down as saying, well, I'm playing against the game itself,	▶ 01:31
but we don't count that as adversarial, because the games itself is not trying to defeat you.	▶ 01:37
The game itself is passive.	▶ 01:42
Whereas in these games and what adversarial has come to mean is that	▶ 01:44
the opponent is taking into account what you are thinking	▶ 01:49
when the opponent does their own thinking and tries to defeat you that way.	▶ 01:52

(02:05) 06 Single Player Game.mp4

(03:58) 07 Two Player Game.mp4

Now let's consider games like chess and checkers,	▶ 00:00
which we define as deterministic, two-player, zero-sum games.	▶ 00:04
The deterministic part is clear.	▶ 00:09
The rules of chess say you make a move, take a piece, and that's it. There's no stochasticity.	▶ 00:11
It's two players, one against another,	▶ 00:18
and zero sum means that the sum of the utilities to the two players is zero.	▶ 00:20
If one player gets a +1 for winning the game, the other player gets a -1 for losing.	▶ 00:25
How do we deal with these types of games?	▶ 00:30
Well, we use a similar type of approach.	▶ 00:33
We have a state-space search. We have a starting state.	▶ 00:36
There are some moves available to player one.	▶ 00:39
Then in the next state there are moves available to player two.	▶ 00:43
We're going to draw them like this, and we're going to give names to our players.	▶ 00:47
The first player we're going to call Max, because it's a nice name,	▶ 00:51
and because player one is trying to maximize the utility to player one.	▶ 00:55
The next player, who operates at this level, we draw with a downward-pointing triangle.	▶ 01:02
We call that player Min, because Min is trying to minimize the utility to Max,	▶ 01:08
which is the same thing as trying to maximize the utility to himself or herself.	▶ 01:14
Then we have a game tree that continues like that, alternating between Max and Min moves.	▶ 01:19
Now, the search tree keeps going and let's say we get to a point where one player,	▶ 01:26
and let's say it's Max, has a choice, and there are two states,	▶ 01:31
and these, rather than being states where it's Min's turn, are states that are terminal.	▶ 01:36
We'll draw them with a square box.	▶ 01:42
Let's say one of them results in +1, a win for Max,	▶ 01:45
and one of them results in -1, a loss for Max.	▶ 01:50
Now if Max is rational, of course, Max is going to make this choice to the +1.	▶ 01:54
What we're going to do now is show we can determine the value of any state in the tree,	▶ 02:00
including the start state up here in terms of the values of the terminal nodes.	▶ 02:07
The tree keeps on going. We assume it's a finite game.	▶ 02:12
After a finite number of moves, every path leads to a terminal state.	▶ 02:15
Then we look at each state and say whose turn is it to make the decision.	▶ 02:21
In this state Max is making the decision, and Max, being rational,	▶ 02:28
will choose the maximum value, saying, "I'd rather have a +1 than a -1,	▶ 02:32
so I'll get a +1 here."	▶ 02:37
We start going back up the tree, and maybe we get up to a point here	▶ 02:39
where Min has a choice, and we've used this type of process to go up the tree,	▶ 02:46
and Min has a choice between a +1 and a -1.	▶ 02:50
Min is going to choose the minimum and will have a -1 here.	▶ 02:55
If we go through all the possibilities, let's say these all result in -1,	▶ 02:59
but this move results in a +1. Then Max will take that move.	▶ 03:05
He'll say, "Out of my four possibilities, I know this is the best one. I'll take that move."	▶ 03:10
Now we've done two things.	▶ 03:16
One, we've assigned a value to every state in the search tree,	▶ 03:18
and secondly, we backed that all the way up the top.	▶ 03:22
Now we've worked out a path through that state to say,	▶ 03:25
if all players are rational, here's the choices they would make.	▶ 03:29
The important point here is that we've taken the utility function,	▶ 03:32
which is defined only on terminal states.	▶ 03:37
Here's a state here. The utility of that state was +1.	▶ 03:40
Here's a state here. The utility of that state was -1.	▶ 03:43
We've used those utility values in the definition of available actions	▶ 03:48
to back those utilities up and tell us the utility of every state, including the start state.	▶ 03:52

(01:46) 08 Two Player Function.mp4

(00:57) 09 Time Complexity Question.mp4

(00:13) 10 Time Complexity Answer.mp4

(00:26) 11 Space Complexity Question.mp4

(00:31) 12 Space Complexity Answer.mp4

(00:47) 13 Chess Question.mp4

(00:13) 14 Chess Answer.mp4

(00:33) 15 Complexity Reduction Question.mp4

(00:04) 16 Complexity Reduction Answer.mp4
The answer is that all three are useful approaches, and we'll look at each of them. ▶ 00:00

(01:03) 17 Review Question.mp4

(00:26) 18 Review Answer.mp4

(01:44) 19 Reduce B.mp4

(00:08) 20 Reduce B Question.mp4
Now I want you to tell me over here which, if any or all, ▶ 00:00
of the three nodes can be pruned away by this procedure. ▶ 00:04

(00:27) 21 Reduce B Answer.mp4

(02:33) 22 Reduce M.mp4

Now I'm going to look at the issue of reducing m, the depth of the tree.	▶ 00:00
Here, I've drawn a game tree and left out some bits,	▶ 00:05
but the idea is that is that it keeps on going and going.	▶ 00:08
There'll be too many nodes for us to evaluate at all. What can we do?	▶ 00:11
The simplest approach is to just by fiat cut off the search at a certain depth.	▶ 00:15
We'll say we're only going to search to level three,	▶ 00:20
and when we get down to level three,	▶ 00:23
we're going to pretend that these are all terminal nodes.	▶ 00:25
We'll draw them as the square boxes for terminals rather than the max nodes	▶ 00:28
and cut off the search at that point.	▶ 00:35
Now, of course, they aren't terminal, so according to the rules of the game,	▶ 00:38
we haven't either won or lost at this particular point.	▶ 00:41
We can't say for sure what the value is for each of these nodes,	▶ 00:45
but we can estimate it using something called an evaluation function,	▶ 00:49
which is given a state S and returns an estimate of the final value for that state.	▶ 00:54
What do we want out of our evaluation function and how do we get it?	▶ 01:00
We want the evaluation function to be stronger for positions that are stronger	▶ 01:03
and weaker for positions that are weaker.	▶ 01:07
We can get it one way from experience--	▶ 01:10
from playing the games before and seeing similar situations	▶ 01:13
and figuring out what their values are.	▶ 01:16
We can try to break that down into components by using experience with the game.	▶ 01:19
For example, in the game of chess it is traditional to say that a pawn is worth 1 point,	▶ 01:24
a knight 3 points, a bishop 3 points, a rook 5, and a queen 9.	▶ 01:30
You could add up all those points.	▶ 01:34
So we could have an evaluation function of S	▶ 01:36
which is equal to this weighted sum of the various weights times the various pieces--	▶ 01:40
positive weights for your pieces and negative weights for the opponent's pieces.	▶ 01:48
We've seen this idea before when we did machine learning	▶ 01:52
where we have a set of features, which could be the pieces,	▶ 01:55
and they could be other features of the game as well.	▶ 01:59
For example, in chess it's good to control the center,	▶ 02:02
it's good not to have a double pawn, and so on.	▶ 02:05
We could make up as many features as we can think of to represent each individual state	▶ 02:08
and then use machine learning from examples to figure out what the weight should be.	▶ 02:13
Then we have an evaluation function.	▶ 02:18
We apply the evaluation function to each state at the cutoff point	▶ 02:21
rather than doing a long search.	▶ 02:25
Then we have an estimate, and we back those values up just as if they were terminal values.	▶ 02:28

(01:46) 23 Computing State Values.mp4

(02:56) 24 Complexity Reduction Benefits.mp4

Now we said we have three ways to reduce this exponential b to the m--	▶ 00:00
reducing the branching factor b, reducing the depth of the tree m,	▶ 00:04
and converting the tree to a graph	▶ 00:08
Let's see how each of those fare.	▶ 00:11
First, for reducing b we came up with this alpha-beta pruning technique.	▶ 00:13
In fact, that is very effective.	▶ 00:19
That takes us from a regime where we're in order b to the m to one where,	▶ 00:21
if we do a good jog, we can get to order b to the m/2.	▶ 00:29
Now what do I mean by doing a good job?	▶ 00:34
Well, we get different amounts of pruning depending on the order	▶ 00:36
in which we expand each branch from a node.	▶ 00:39
If we expand the good nodes first, then we get a lot of pruning,	▶ 00:42
because we do a good job of getting to the cutoff points.	▶ 00:45
If we expand the poor nodes first, then we don't do any pruning,	▶ 00:49
because we don't get to that cutoff point until later.	▶ 00:53
But if we can do well, then we get to the square root of the number of nodes.	▶ 00:56
In other words, we get to search twice as deep into the search tree.	▶ 01:01
That's all 100% perfect in terms of not changing the result.	▶ 01:05
We'd still get the exact evaluation.	▶ 01:12
We just stop doing work that we didn't have to do.	▶ 01:15
Now for the tree to the graph, we haven't talked that yet.	▶ 01:18
In fact, it depends on the particular game, but in many games it can be very useful.	▶ 01:21
In games like chess, we have opening books.	▶ 01:25
That is, we look at the past openings	▶ 01:29
and we just memorize those positions and what are the good moves.	▶ 01:32
It doesn't matter how we get to those positions.	▶ 01:36
We can get to them in multiple paths through a tree,	▶ 01:38
and we can just consider it a single graph.	▶ 01:41
We also have closing books, where we can memorize all the positions	▶ 01:43
with five or fewer pieces and know exactly what to do.	▶ 01:48
In the midgame when there are too many positions to memorize all of them,	▶ 01:52
we can still search through a graph if we want to or if we want we can just do part of that.	▶ 01:57
One thing that has proven effective in games like chess is called the killer-move heuristic.	▶ 02:04
What that says is if there's one really good move in part of a search tree,	▶ 02:09
then try the other move in the sister branches for that tree.	▶ 02:14
In other words, if I try making one move and I find that the opponent takes my queen,	▶ 02:19
then when I try making another move from that same position,	▶ 02:25
I should also check if the opponent has that response of taking my queen.	▶ 02:28
Converting from a tree to graph, also doesn't lose information.	▶ 02:32
It can just help us make the search go faster.	▶ 02:35
The third possibility was reducing m, the depth of the tree,	▶ 02:38
by just cutting off search and going to an evaluation function.	▶ 02:42
That is imperfect in that it is an estimate of the true value of the tree	▶ 02:46
but won't give you the exact value.	▶ 02:51
We can get into trouble. Let me show you an example of that.	▶ 02:53

(01:02) 25 Pacman Question.mp4

(01:57) 26 Pacman Answer.mp4

(01:37) 27 Chance.mp4

(00:50) 28 Chance Question.mp4

(01:10) 29 Chance Answer.mp4

(00:10) 30 Terminal State Question.mp4
Now one more question for this same game tree. ▶ 00:00
I want you to click on all the terminal states that are possible outcomes ▶ 00:03
for this game if both players play rationally. ▶ 00:07
((??:??)) 31 Terminal State Answer.mp4
No subtitles... ▶ 00:00

(00:21) 32 Game Tree Question 1.mp4

(00:17) 33 Game Tree Answer 1.mp4

(00:49) 34 Game Tree Question 2.mp4

(00:34) 35 Game Tree Answer 2.mp4

(01:52) 36 Conclusion.mp4

(29) Unit 14

(01:41) 01 Introduction.mp4

(02:56) 02 Dominant Strategy Question.mp4

We're going to talk about game theory,	▶ 00:00
which is the study of finding an optimal policy	▶ 00:03
when that policy can depend on the opponent's policy and vice versa.	▶ 00:06
And let's look at 1 of the most famous games of all,	▶ 00:10
a game called the "Prisoner's Dilemma."	▶ 00:14
And the story is that there are 2 criminals, Alice and Bob,	▶ 00:17
who have a working relationship, and they're both caught	▶ 00:21
at the scene of a crime, but the police don't quite have enough evidence	▶ 00:24
to put them away.	▶ 00:28
They offer each independently a deal saying	▶ 00:30
"If you testify against your cohort,	▶ 00:33
we'll give you a better deal and give you a reduced sentence time."	▶ 00:37
And Alice and Bob both understand what's going on.	▶ 00:42
They're both perfectly rational,	▶ 00:45
and to understand what the situation is,	▶ 00:47
we draw up a matrix in which we have possible outcomes	▶ 00:50
and possible strategies for each side.	▶ 00:55
For Alice, she has 2 strategies.	▶ 00:57
1 is to testify against Bob,	▶ 01:01
and the other is to refuse to testify.	▶ 01:04
And Bob has the same choices,	▶ 01:07
to testify against Alice or to refuse.	▶ 01:09
In general, different agents may have different actions available to them.	▶ 01:13
And now we show the payoff to each agent.	▶ 01:17
Sometimes those payoffs are opposite,	▶ 01:20
as in a game like chess where if 1 player gets a +1,	▶ 01:23
the other gets a -1.	▶ 01:27
In this game, the payoffs are not opposite,	▶ 01:29
so it's a non-zero-sum game.	▶ 01:32
And if they both refuse to testify against each other,	▶ 01:34
then neither can be convicted of the major crime,	▶ 01:38
but the police will get them for a lesser crime.	▶ 01:42
And let's say they each serve 1 year in jail,	▶ 01:45
so that's a -1 for each of them.	▶ 01:50
If Alice testifies and Bob refuses,	▶ 01:52
then the police are grateful to Alice,	▶ 01:56
and she gets off with nothing, and Bob gets	▶ 01:59
the book thrown at him and gets a -10 score.	▶ 02:03
Likewise, if the roles are reversed	▶ 02:06
and if both testify against each other, then they're both guilty,	▶ 02:09
and they split the penalty.	▶ 02:13
Now, the question that both Alice and Bob have to face	▶ 02:15
is what is the strategy going to be?	▶ 02:19
And the first concept we want to talk about	▶ 02:21
is the concept of a dominant strategy.	▶ 02:24
A dominant strategy is one for which a player	▶ 02:27
does better than any other strategy	▶ 02:31
no matter what the other player does.	▶ 02:34
And now the question is, does either Alice or Bob	▶ 02:36
have a dominant strategy?	▶ 02:41
If Alice has a dominant strategy,	▶ 02:44
I want you to check that off, either testify or refuse,	▶ 02:46
and similarly, if Bob has a dominant strategy,	▶ 02:51
check that off.	▶ 02:54

(00:35) 03 Dominant Strategy Answer.mp4

(00:35) 04 Pareto Optimal Question.mp4

(00:17) 05 Pareto Optimal Answer.mp4

(00:33) 06 Equilibrium Question.mp4

(00:52) 07 Equilibrium Answer.mp4

(01:32) 08 Game Console Question 1.mp4

(00:27) 09 Game Console Answer 1.mp4

(00:09) 10 Game Console Question 2.mp4
And now the next question is ▶ 00:00
is there 1 or more Pareto optimal outcomes? ▶ 00:02
Click on any of the outcomes that you think are Pareto optimal. ▶ 00:05

(00:21) 11 Game Console Answer 2.mp4

(01:51) 12 2 Finger Morra.mp4

(01:51) 13 Tree Question.mp4

(01:12) 14 Tree Answer.mp4

(02:26) 15 Mixed Strategy.mp4

Now, 1 reason there's such a wide discrepancy in the outcomes	▶ 00:00
of these 2 versions of the game is that	▶ 00:03
we handicapped E and O so severely	▶ 00:06
that here E had to reveal his entire strategy,	▶ 00:09
whether he's going to play 1 or 2 all the time,	▶ 00:13
and the same thing for O over here.	▶ 00:16
What if we could think of a way where we didn't handicap them quite as much,	▶ 00:18
where they weren't giving away quite as much information?	▶ 00:21
Let's look at a way to do that.	▶ 00:24
Let's look at the situation where E goes first	▶ 00:26
and has to reveal the strategy,	▶ 00:30
but instead of having to reveal my strategy is	▶ 00:32
to play 1 or to play 2,	▶ 00:35
what if E says "Well, my strategy is	▶ 00:37
with probability P, I'm going to play 1."	▶ 00:40
"And with probability 1 - P, I'm going to play 2."	▶ 00:44
And that's called a mixed strategy.	▶ 00:48
So, E would announce that strategy for some number P.	▶ 00:50
And there could be an infinite number of possibilities,	▶ 00:55
so we should be drawing an infinite number of branches	▶ 00:58
out of this decision point for all the possibilities	▶ 01:02
for values of P that E would come up with.	▶ 01:05
But instead, I'm just going to sort of parameterize that	▶ 01:07
and just draw 1 line coming out.	▶ 01:09
And now O as the minimizing player	▶ 01:12
has to make a choice between 1 and 2, and what are the outcomes?	▶ 01:15
Well, if P was 1, then 1 + 1 is 2,	▶ 01:19
so with probability P, we get an outcome of 2.	▶ 01:23
That's 2P, but if we choose 2,	▶ 01:28
the probability 1 - P, then 2 + 1 is 3,	▶ 01:33
so with probability 1 - P, we get a -3.	▶ 01:36
So, 2P - 3 times 1 - P	▶ 01:40
would be the outcome for this day.	▶ 01:44
And then the outcome over here would be	▶ 01:48
-3P + 4 times 1 - P.	▶ 01:51
That's the parameterized outcome given the parameterized strategy.	▶ 01:56
And we could do the same thing on the other side.	▶ 02:00
What if O had to go first?	▶ 02:03
With probability Q, O plays 1,	▶ 02:05
and with probability 1 - Q plays 2.	▶ 02:09
Then even is the maximizer	▶ 02:12
and we get 2Q - 3(1 - Q)	▶ 02:15
and -3Q + 4(1 - Q).	▶ 02:21

(02:19) 16 Solving the Game.mp4

(02:50) 17 Mixed Strategy Issues.mp4

Now, the introduction of mixed strategy	▶ 00:00
brings us some curious philosophical problems	▶ 00:04
related to the idea of randomness, secrecy, and rationality.	▶ 00:07
We said that sometimes the rational strategy	▶ 00:14
can be a mixed strategy.	▶ 00:17
That is, ones with probability in it.	▶ 00:19
Probability P, I do action A,	▶ 00:21
and with probability 1 - P I do action B.	▶ 00:25
And that suggests that we need some secrecy	▶ 00:30
so that our opponent doesn't know which of these random choices we're making.	▶ 00:35
The curious thing is that that's only true	▶ 00:40
at the extent of an individual play,	▶ 00:43
not to the extent of the strategy itself.	▶ 00:45
So, if this is the optimal strategy,	▶ 00:48
a mixed strategy, it's okay for us to reveal	▶ 00:51
that strategy to our opponent because our opponent	▶ 00:55
can also compute that that's our rational strategy,	▶ 00:58
and so we won't do any worse by revealing to the opponent	▶ 01:01
exactly what our strategy is.	▶ 01:05
However, the actual implementation of that strategy,	▶ 01:07
that is, this is the grand strategy, that in this situation,	▶ 01:11
whenever we're faced with playing this game, this is what we'll do,	▶ 01:15
that part can be revealed, but the actual choice	▶ 01:19
that this time we're going to choose A or we're going to choose B,	▶ 01:22
of course, that has be to kept secret.	▶ 01:25
If we reveal that, if our opponent can somehow discover	▶ 01:27
which choice we're going to make based on this random choice,	▶ 01:31
then our opponent can get an advantage over us.	▶ 01:36
Now, with respect to rationality,	▶ 01:39
we said that a rational agent is one that does the right thing,	▶ 01:42
and that's still true.	▶ 01:44
However, it turns out that there are games	▶ 01:46
in which you can do better if your opponent believes	▶ 01:48
you are not rational, and that has been said about various politicians	▶ 01:51
throughout history, and I won't pick on one or another.	▶ 01:54
But sometimes it has been said that they are intentionally	▶ 01:59
cultivating an image of being crazy	▶ 02:02
so that they can gain an advantage	▶ 02:05
when faced with certain games with opponents.	▶ 02:07
For example, suppose 1 action available to a leader is to go to war,	▶ 02:10
but both sides realize that the strategy of going to war	▶ 02:15
is dominated by other strategies and thus would be irrational.	▶ 02:18
So, a leader who is perceived to be rational and makes a threat	▶ 02:22
of "Give me this concession, or I'll go to war against you,"	▶ 02:26
that's not a credible threat.	▶ 02:29
The leader's threat would be dismissed, and it would have no effect.	▶ 02:31
However, if the leader can convince the opponent	▶ 02:35
that he is irrational or crazy, then the threat suddenly becomes credible.	▶ 02:38
And so note that being irrational doesn't help,	▶ 02:43
but appearing irrational can gain you an advantage.	▶ 02:46

(00:46) 18 2x2 Game Question 1.mp4

(00:42) 19 2x2 Game Answer 1.mp4

(00:26) 20 2x2 Game Question 2.mp4

(01:37) 21 2x2 Game Answer 2.mp4

(01:51) 22 Geometric Interpretation.mp4

(03:32) 23 Poker.mp4

Now so far we've dealt only with games that take a single turn--	▶ 00:00
that is there are two players, they both simultaneously reveal their move,	▶ 00:04
and the game is over.	▶ 00:08
But game theory can also deal with more complex games	▶ 00:10
that have multiple rounds of turn taking.	▶ 00:12
Here I'm describing a simple game of poker,	▶ 00:16
the simplest type of poker you've probably ever seen.	▶ 00:19
The deck only has four cards.	▶ 00:22
One card is dealt to each player.	▶ 00:24
There are two rounds.	▶ 00:26
In the first, player one has a choice to either raise--to bet a dollar--or to check.	▶ 00:28
Then in the second round, the second player has the chance to call--	▶ 00:34
to say I want to see what's up--or to fold.	▶ 00:40
Now this format begins to look very much like the game tree	▶ 00:44
that we talked about in the previous unit.	▶ 00:48
It starts out and there's a chance node.	▶ 00:51
This corresponds to dealing the cards with the 1/6 that the first player gets an Ace	▶ 00:54
and the second player gets an Ace.	▶ 01:00
One-third that the first player gets and Ace and the second Player gets a Kind, and so on.	▶ 01:02
There there are maximizing nodes and minimizing nodes.	▶ 01:08
What this format, which is known as the sequential game format,	▶ 01:12
is especially good at is keeping track of the belief states of the possibilities	▶ 01:15
of what each agent knows and doesn't know.	▶ 01:22
The tree as a whole describes everything that's going on,	▶ 01:27
but each agent doesn't know at which point in the tree they are.	▶ 01:30
So if you're agent number one, you know that you have an Ace,	▶ 01:35
so you know you're in one of these two states denoted by the dotted lines.	▶ 01:38
You're either in the state where you have an Ace and the other player has an Ace,	▶ 01:43
or in the state where you have an Ace and the other player has a King,	▶ 01:47
But you don't know which one you're at.	▶ 01:51
Similarly, over here there is confusion for the second player as to what state they're in.	▶ 01:53
Now, we can solve this game using this game tree approach,	▶ 01:57
and it's not quite the same as the max and the min approach,	▶ 02:00
because where you are in the states, what you know about the partial information,	▶ 02:05
affects your strategy in a way that we haven't dealt with before.	▶ 02:11
One possibility for how you can evaluate again like this	▶ 02:15
is just to convert it to the other form.	▶ 02:20
The form we've seen before is called the normal form or matrix form.	▶ 02:22
This is the sequential game in extensive form.	▶ 02:25
If we convert the extensive form, we get something like this.	▶ 02:29
Here for each player, we've denoted by a two-letter strategy	▶ 02:33
what you should do when you have an Ace and what you should do when you have a King.	▶ 02:36
So we end up with an exponentially large search space,	▶ 02:42
but here the game was so simple, that it ends up being rather small,	▶ 02:46
and the game is rather trivial, and you can solve it.	▶ 02:50
It turns out that there are two equilibrium corresponding to the strategy for player two,	▶ 02:53
which is he should check when he has an Ace, and he should fold when he has a King,	▶ 03:00
and strategy for player one is it doesn't matter if he raises or checks when has an Ace,	▶ 03:05
but he should check when he has a King.	▶ 03:12
That would give the game a value of zero.	▶ 03:14
Now this works fine for the simple version of poker.	▶ 03:18
For real poker, this table would have about 10^18 states,	▶ 03:21
and it would be impossible to deal with.	▶ 03:26
So we need some strategies for getting back down to a reasonable number of states.	▶ 03:28

(02:10) 24 Game Theory Strategies.mp4

One of the best strategies is to try abstraction.	▶ 00:00
Instead of dealing with every single possible state of the game,	▶ 00:04
we can take similar states and deal with them as if they were the same.	▶ 00:07
For example, in poker one abstraction that works pretty well is to eliminate the suits.	▶ 00:10
If no player is trying to get a flush, then we can treat all four Aces as if they were identical	▶ 00:15
rather than treating the four of them as being different	▶ 00:21
and similarly with all the other face values.	▶ 00:25
Another thing we can do is lump similar cards together.	▶ 00:27
Rather than saying that 2, 3, 4, and 5 are all different values,	▶ 00:31
if I know that I'm holding a pair of 10s then I can think of the other players' cards	▶ 00:36
as being equal to 10, lower than 10, or higher than 10.	▶ 00:42
Otherwise, lump the same.	▶ 00:47
Similarly, I can lump bets together.	▶ 00:49
Rather than thinking of every dollar amount of a bet from $1 to the upper limit,	▶ 00:52
I can lump the bets into small, medium, and large.	▶ 00:58
Then finally another way to do abstraction is rather than considering every possible deal	▶ 01:02
of all the cards, I can just consider a small subset of the deals	▶ 01:07
to do Monte Carlo sampling over the possible deals,	▶ 01:13
rather than considering them all.	▶ 01:17
This approach extensive games can handle quite a lot	▶ 01:20
in terms of dealing with uncertainty, dealing with partial observability,	▶ 01:24
dealing with multiple agents, stochastic, sequential, dynamic.	▶ 01:30
But there's a few things they can't handle very well.	▶ 01:34
They aren't very good at unknown actions.	▶ 01:36
We need to know what all the actions are for either player before we can define the game.	▶ 01:38
Game theory doesn't deal very well with continuous actions,	▶ 01:44
because we have this matrix-like form.	▶ 01:46
It doesn't deal very well with irrational opponents.	▶ 01:49
We can know that we're going to do the best we possibly can against a rational opponent,	▶ 01:51
but it doesn't tell us how to exploit our opponent's weakness	▶ 01:56
if he turns out to be irrational.	▶ 02:00
Then finally, it doesn't deal with unknown utilities.	▶ 02:02
If we don't know what it is we're trying to optimize,	▶ 02:05
game theory isn't going to tell us how to do it.	▶ 02:07

(01:06) 25 Fed vs Politicians Question.mp4

(01:59) 26 Fed Vs Politicians Answer.mp4

(01:56) 27 Mechanism Design.mp4

Now let's switch to the other part of game theory,	▶ 00:00
which remember we called mechanism design.	▶ 00:02
It could really be called game design.	▶ 00:06
The idea is that someone is going to be running a game	▶ 00:08
that players are going to be participating in.	▶ 00:12
We want to design the rules of the game such that we get a high outcome	▶ 00:14
or a high expected utility for the people that run the game,	▶ 00:20
for the players who play the game, and for the public at large.	▶ 00:25
Here's an example of a game.	▶ 00:29
This is the advertising game.	▶ 00:31
Here I've shown it on an Internet search engine, where you do a search,	▶ 00:34
and then ads show up, sometimes at the top, sometimes at the right,	▶ 00:38
sometimes at the bottom of the page, depending on the mechanism.	▶ 00:42
This is also done at sites like eBay that sell items,	▶ 00:46
and there's lots of places where auctions are run.	▶ 00:50
The idea of mechanism design is to come up with the rules of the auction	▶ 00:54
that will make it attractive to bidders and/or people who want to respond to the ads,	▶ 00:58
and make a good result for all.	▶ 01:06
Now, one property that you would like an auction to have is to	▶ 01:09
attract more bidders to make it a more competitive market,	▶ 01:13
and you could attract more if it's less work for them.	▶ 01:17
It's easier for the bidders if they have a dominant strategy.	▶ 01:21
You saw how hard it was to work out the value of a game	▶ 01:25
when you didn't have a dominant strategy,	▶ 01:28
and how easy it is to work it out if you did.	▶ 01:30
If you want to save everybody a lot of trouble, design the game	▶ 01:33
so that dominant strategies exist.	▶ 01:36
These strategies have various names in auctions.	▶ 01:38
Sometimes you call it an auction strategy proof	▶ 01:41
if you only need to know your own strategy.	▶ 01:45
You don't have to think about what all the other people are going to be bidding.	▶ 01:47
They also call that truth revealing or incentive compatible.	▶ 01:51

(02:52) 28 Auction Question.mp4

Let's examine a type of auction called the second price option.	▶ 00:00
This is popular in various internet search and auction sites.	▶ 00:05
The way it works is that we have a line of possible prices--	▶ 00:10
higher prices at the top--and bids come in.	▶ 00:15
Different players can bid whatever they want,	▶ 00:19
and whoever bids the highest is the winner,	▶ 00:23
but the price that they pay is the price of the second highest bidder.	▶ 00:26
Now let's say you're participating in this auction,	▶ 00:30
and something is for sale, and you place a value on that.	▶ 00:33
We'll call that value "V", and say V is here.	▶ 00:36
Your bid we'll call "b", and the highest other bid we'll call "c."	▶ 00:42
Now the payoff to you if your bid is higher than all the others	▶ 00:48
then the payoff is you get the value of the auction,	▶ 00:54
because you won the item, and you get V,	▶ 00:57
but you have to pay the second highest price, which is c.	▶ 01:00
You get b minus c. Otherwise, you lose the auction.	▶ 01:03
You don't get anything, and you don't pay anything.	▶ 01:08
The value to you of the auction is zero.	▶ 01:10
What I want you to do is fill in this chart to look at different strategies for different possible bids.	▶ 01:12
We'll say that the value to you of the item for sale is V equals 10.	▶ 01:15
You have the option of bidding, say, 12, 10, or 8,	▶ 01:26
and we'll consider the cases where the highest other bid is 7, 9, 11, or 13.	▶ 01:32
What I want you to do is fill in this chart with the value to you	▶ 01:41
of this game according to your strategy and the strategies of the other players.	▶ 01:46
Tell me if one of these strategies is a dominant strategy.	▶ 01:51
Then tell me is that dominant strategy, if there is one, a truth revealing strategy?	▶ 01:55
I should have one note about dominance.	▶ 02:01
When we talked about it before, we glossed over the possibility of ties.	▶ 02:04
If some policy is better everywhere than any other policy,	▶ 02:09
then we say that that policy strictly dominates the others.	▶ 02:14
On the other hand, if there are some ties and some places where its better	▶ 02:17
but none where it's worse, then we say it weakly dominates.	▶ 02:22
Either way, it's a case of dominance.	▶ 02:26
Now I'll do the first entry to get you started.	▶ 02:28
If you bid 12 and the highest other bid is 7,	▶ 02:30
then you have the high bid, so you win.	▶ 02:34
It's a second-price auction, so you pay 7.	▶ 02:36
The value of the goods is 10, so the total value of the outcome is 10 minus the cost of 7 for 3.	▶ 02:39
I want you to fill in the rest.	▶ 02:49

(01:01) 29 Auction Answer.mp4

(16) Unit 15

(00:39) 01 Introduction.mp4

(00:48) 02 Scheduling.mp4

(01:52) 03 Schedule Question.mp4

(01:00) 04 Schedule Answer.mp4

(01:21) 05 Resources Question.mp4

(00:55) 06 Resources Answer.mp4

(01:01) 07 Extending Planning.mp4

(01:25) 08 Hierarchical Planning.mp4

(01:37) 09 Refinement Planning.mp4

(01:50) 10 Reachable States.mp4

(02:01) 11 Reachable States Question.mp4

(01:00) 12 Reachable States Answer.mp4

(01:51) 13 Conformant Plan Question.mp4

(00:23) 14 Conformant Plan Answer.mp4

(00:31) 15 Sensory Plan Question.mp4

(00:53) 16 Sensory Plan Answer.mp4

(22) Homework 6

(01:06) 01 Max Likelihood Question.mp4

(01:00) 02 Max Likelihood Answer.mp4

(00:28) 03 Stationary Distribution Question.mp4

(00:38) 04 Stationary Distribution Answer.mp4

(00:54) 05 HMM Question.mp4

((??:??)) 06 HMM Answer.mp4
No subtitles... ▶ 00:00

(00:49) 07 Particle Filter Question 1.mp4

(00:29) 08 Particle Filter Answer 1.mp4

(00:53) 09 Particle Filter Question 2.mp4

(00:32) 10 Particle Filter Answer 2.mp4

(00:37) 11 Particle Filter Question 3.mp4

(00:59) 12 Particle Filter Answer 3.mp4

(00:21) 13 Particle Filter Question 4.mp4

(00:38) 14 Particle Filter Answer 4.mp4

(00:26) 15 Max Min Question.mp4

(00:20) 16 Max Min Answer.mp4

(00:18) 17 Scheduling Question.mp4

(00:11) 18 Scheduling Answer.mp4
Here we see the start times for each of the actions. ▶ 00:00
Note that the critical path which the earliest and latest start time are the same ▶ 00:03
goes straight down the center. ▶ 00:08

(01:05) 19 Game Tree Question.mp4

(01:06) 20 Game Tree Answer.mp4

(00:17) 21 Strategy Question.mp4

(00:34) 22 Strategy Answer.mp4

(47) Unit 16

(02:15) 01 Introduction.mp4

Hi again. It's great to see you again.	▶ 00:00
We talked a lot about basic methods of AI,	▶ 00:03
and from today on we'd like to go into applications.	▶ 00:07
Specifically, today we'll talk about computer vision.	▶ 00:11
Computer vision is a very bright field	▶ 00:15
that concerns itself with making sense out of camera images or video.	▶ 00:18
Many devices today are equipped with cameras, such as cell phones or cars,	▶ 00:24
and making sense out of image data has become a really important subfield	▶ 00:30
of artificial intelligence.	▶ 00:34
Today I'll teach you some of the very basics.	▶ 00:37
It's not as deep as my graduate level class on computer vision,	▶ 00:39
and I hope you get a chance to take that in the future,	▶ 00:42
but I hope to enable you to apply some of the very basic methods	▶ 00:46
to, for example, use images and classify them using artificial intelligence technology	▶ 00:50
through feature extraction and other techniques	▶ 00:56
and also to start doing some of the more 3D-oriented tasks	▶ 00:59
such as 3D constructions.	▶ 01:03
So let's start with the very, very basics	▶ 01:06
and ask ourselves what is a camera.	▶ 01:08
Cameras come in all sizes and shapes.	▶ 01:13
This is my beautiful Nikon D3 camera [shutter clicks],	▶ 01:16
but I don't use it much because it's very heavy, even though it takes beautiful pictures.	▶ 01:20
This is the camera I use the most. It's a cell phone camera.	▶ 01:26
It's an 8 megapixel camera over here with a flash,	▶ 01:30
and I can start it, and you get to see whatever is underneath,	▶ 01:33
like this pen over here.	▶ 01:39
I can also activate the front camera, and you get to see the way I've been recording	▶ 01:44
all those wonderful online lectures over all those weeks	▶ 01:50
with this little camera over here.	▶ 01:54
In all of those cameras there is a lens and there's a chip,	▶ 01:58
and the light is captured from the environment and focused through the lens on the chip,	▶ 02:03
which raises the question, how does a lens and a chip really work?	▶ 02:08

(02:56) 02 Image Formation.mp4

[Thrun] The science of how images are created using cameras is called image formation,	▶ 00:01
where formation just means the way an image is being captured.	▶ 00:06
Perhaps the easiest model of a camera is called a pinhole camera.	▶ 00:11
In a pinhole camera, the light from within the world	▶ 00:15
goes through a various small hole--ideally it's a really, really small hole--	▶ 00:19
to project into a camera chip that sits somewhere in the background.	▶ 00:25
So for example, if you had an object that was a person over here,	▶ 00:29
then this person would be projected as follows.	▶ 00:33
The feet would be projected to over here and the head to over here,	▶ 00:36
which gives us this inverted person on the projection plane or the camera chip.	▶ 00:42
There is some very basic math that governs the geometry of a pinhole camera.	▶ 00:48
If we call X the physical height of the object and small x the height of the projection,	▶ 00:52
which I'll call -x because it points in the opposite direction as the original object,	▶ 01:01
then we can also talk about other values	▶ 01:06
such as the distance of the object to the camera plane	▶ 01:09
and f, which is the focal distance of the camera,	▶ 01:14
which is the distance between the pinhole and the projection plane over here.	▶ 01:19
There's a simple piece of math that relates all of those 4 variables over here,	▶ 01:25
and it's easily obtained by what's called equal triangles.	▶ 01:31
In particular, it turns out if I map this triangle over here to right over here--	▶ 01:34
so these are the same triangles, just flipped, where x is over here and f is over here--	▶ 01:41
we get that the ratio of upper caps X to Z is the same as lower caps x to f.	▶ 01:50
So I write this as follows.	▶ 01:58
This is a result of equal triangles.	▶ 02:00
So as you take a triangle of a certain shape,	▶ 02:02
when you scale it up to larger triangles, those proportions are retained,	▶ 02:05
so therefore, upper caps X divided by Z is the same as lower caps x divided by f.	▶ 02:10
If we now transform this, I find that the projection of lower caps x,	▶ 02:16
which I might care about, is upper caps X, the physical size of the object itself,	▶ 02:20
times the quotient of the focal length over the distance.	▶ 02:27
That's an interesting equation.	▶ 02:33
The further an object is away, the smaller it appears.	▶ 02:36
The larger the focal length of the camera, the larger the object in its projection.	▶ 02:40
And of course the size of the object itself directly influences	▶ 02:46
how big its image of the object really is.	▶ 02:50
So let's see if you can practice that equation using a quiz.	▶ 02:53

(00:26) 03 Projection Length Question.mp4

(00:46) 04 Projection Length Answer.mp4

(00:22) 05 Focal Length Question.mp4

(00:25) 06 Focal Length Answer.mp4

(00:43) 07 Range Question.mp4

(00:33) 08 Range Answer.mp4

(01:12) 09 Perspective Projection.mp4

(01:20) 10 Vanishing Points.mp4

(00:25) 11 Vanishing Points Question.mp4

(00:42) 12 Vanishing Points Answer.mp4

(02:17) 13 Lenses.mp4

Let me comment on the idea of a lens.	▶ 00:00
A fundamental limitation of a pinhole camera	▶ 00:03
is that only very few rays of light hit the plane of the imager.	▶ 00:06
So suppose we have an object over here	▶ 00:13
and the object emits light in all directions.	▶ 00:16
Then most beams get absorbed by the area outside the pinhole	▶ 00:19
and a very small number of beam makes it through the pinhole.	▶ 00:23
Now this is misfortunate because the total amount of light that hits the camera chip	▶ 00:26
is small and its support is only applicable for very, very bright scenes.	▶ 00:31
And further, as you make this gap smaller and smaller to increase your focus	▶ 00:36
on the image plane, you will eventually run into what's called "Light Defraction,"	▶ 00:43
which puts a limit on how small you can make this pinhole over here.	▶ 00:49
Now if you use a lens, then all rays will make it to the same point in the image plane.	▶ 00:54
So an example--a ray over here gets projected like this,	▶ 01:00
and a ray over here might make it like this.	▶ 01:04
So any ray in a good lens will eventually meet at the same point over here.	▶ 01:07
The lens collects all the light that hits it, and projects it back to 1 point.	▶ 01:14
Now this specific situation is characterized by only a small plane over here,	▶ 01:22
for which everything is in complete focus.	▶ 01:27
If you move your object back to over here,	▶ 01:30
then what you find is the resulting projections don't match up.	▶ 01:33
Therefore, when you have a camera with a large lens or a large aperture,	▶ 01:38
you have to focus the camera to make sure that the distance between the image plane,	▶ 01:42
the lens itself, and the opposite object are in tune.	▶ 01:47
There is an equation that governs all of this,	▶ 01:51
and it looks about as follows:	▶ 01:54
1 over the focal length, f, for the lens.	▶ 01:56
This would be the sum over the extrinsic distance.	▶ 01:59
Plus 1 over the intrinsic distance, lower cap z.	▶ 02:04
I won't revise this equation,	▶ 02:07
but this is the fundamental equation that governs when things are in focus, for a lens.	▶ 02:10

(01:13) 14 Computer Vision.mp4

(01:25) 15 Invariance Question A.mp4

(00:10) 16 Invariance Answer A.mp4
And the answer here is Rotation. ▶ 00:00
I rotated the object. ▶ 00:02
So you wish to make sure that any recognition item ▶ 00:04
is invariant to a rotation. ▶ 00:07

(00:18) 17 Invariance Question B.mp4

(00:23) 18 Invariance Answer B.mp4

(00:17) 19 Invariance Question C.mp4

(00:24) 20 Invariance Answer C.mp4

(00:19) 21 Invariance Question D.mp4

(00:13) 22 Invariance Answer D.mp4

(00:16) 23 Invariance Question E.mp4

(00:09) 24 Invariance Answer E.mp4
This is called Occlusion Invariance. ▶ 00:00
Sometimes objects are partially occluded, ▶ 00:03
yet you would wish to be able to recognize them even with a partial occlusion. ▶ 00:06

(00:32) 25 Final Invariance Type.mp4

(00:42) 26 Importance of Invariance.mp4

(01:53) 27 Greyscale Images.mp4

(02:47) 28 Extracting Features.mp4

[Thrun] One of the most basic things we can do with computer vision	▶ 00:00
is to extract features.	▶ 00:03
For example, there is a very strong edge feature over here	▶ 00:05
and a strong corner feature right over here and right over here.	▶ 00:09
Let me tell you how to do this.	▶ 00:14
How can you find in an image like this whether there is an edge,	▶ 00:16
or in an image like this where there is an edge from a bright area on the left	▶ 00:20
to a dark region on the right?	▶ 00:26
Let us write a feature extractor that identifies transitions of this type,	▶ 00:28
and let's start with horizontal transitions.	▶ 00:33
The most obvious feature detector looks like this.	▶ 00:36
We run this little 2-value matrix across the entire image over here,	▶ 00:39
and we add whatever is on the left side and subtract whatever is on the right side.	▶ 00:45
So if both sides are approximately in balance, like these points over here,	▶ 00:51
adding an expression here is approximately 0.	▶ 00:56
But if the left side is significantly larger than the right side,	▶ 00:59
then adding and subtracting yields a very large value, like 212 - 7 over here.	▶ 01:02
So this specific mask gives us edges that run from bright to dark.	▶ 01:08
So here I'm taking the first value and subtract the second value from it.	▶ 01:16
255 - 212 gives me 43. That's applying this mask over here.	▶ 01:20
From 211 to 237 is -26 and so on.	▶ 01:26
212 - 7 is 205.	▶ 01:31
237 - 3 is 234 and so on.	▶ 01:35
7 - 1 is 6.	▶ 01:40
3 - 9 is -6 and so on.	▶ 01:43
If you look at this result of applying the mask over here,	▶ 01:46
you'll find that this column stands out.	▶ 01:49
It is much, much larger in value than any of the adjacent columns,	▶ 01:52
and that indicates that we have a high likelihood of a horizontal edge feature occurring	▶ 01:57
at the ridge between this column and this column over here.	▶ 02:03
So here we are applying that same trick to the original image, and this is the result.	▶ 02:07
You can see that areas where the original image has a strong transition	▶ 02:12
you get a strong response over here.	▶ 02:17
This is actually showing the absolute value of the difference	▶ 02:19
where we get rid of the minus sign, so you can see any transition from bright to dark	▶ 02:22
or dark to bright horizontally shows up.	▶ 02:26
Now, you can see these lines over here that are vertical show up very strongly.	▶ 02:29
The lines over here don't, and the reason is the way we defined our kernel,	▶ 02:34
it ran actually horizontal, so it finds horizontal edges and not vertical edges.	▶ 02:39
Vertical edges require a different kernel, so let me get to this in a second.	▶ 02:43

(00:23) 29 Extracting Features Question.mp4

(00:30) 30 Extracting Features Answer.mp4

(01:38) 31 Linear Filter.mp4

(00:23) 32 Horizontal Edge Question.mp4

(00:27) 33 Horizontal Edge Answer.mp4

(00:20) 34 Vertical Filter Question.mp4

(00:25) 35 Vertical Filter Answer.mp4

(00:29) 36 Filter Results.mp4

(01:31) 37 Gradient Images.mp4

(00:45) 38 Canny Edge Detector.mp4

(00:38) 39 Other Masks.mp4

(00:16) 40 Prewitt Mask Question.mp4

(00:12) 41 Prewitt Mask Answer.mp4

(00:46) 42 Gaussian Kernel Question.mp4

(00:29) 43 Gaussian Kernel Answer.mp4

(03:01) 44 Reasons for Gaussian Kernels.mp4

So, why on earth would we ever want to blur an image?	▶ 00:00
There are generally 2 reasons why you might want to do this.	▶ 00:04
One is for down-sampling.	▶ 00:06
If you have an image of super high resolution,	▶ 00:08
maybe 5,000 x 5,000 pixels,	▶ 00:11
and you'd like to go to a web image of much smaller resolution,	▶ 00:14
it's better to blur by Gaussian before down-sampling	▶ 00:17
then picking each nth pixel.	▶ 00:21
And the reason is called aliasing.	▶ 00:24
If you pick each nth pixel without blurring,	▶ 00:27
you sometimes get very, very funny effects	▶ 00:30
because each nth pixel might by chance	▶ 00:33
correspond to something that's somewhat irregular.	▶ 00:36
For example, if you have a checkerboard and you pick each nth pixel,	▶ 00:39
you might only end up with black pixels.	▶ 00:43
The second reason is called noise reduction.	▶ 00:45
In noise reduction, you respond to pixel noise	▶ 00:47
that might otherwise make it hard to compute things like image gradients.	▶ 00:51
If you blur the image first,	▶ 00:55
you get a smoother result that isn't quite as pronounced	▶ 00:57
but has much less noise in the image.	▶ 01:02
Here's the original gradient magnitude image to find edges,	▶ 01:05
and here's the same applied to the blurred image.	▶ 01:08
And you can see the original one is much more succinct,	▶ 01:11
but also it's more subject to noise.	▶ 01:16
Take the area over here, which has lots of image noise,	▶ 01:19
and compare this to the area over here, which has [s/l many few edges.]	▶ 01:22
The same is true over here and over here.	▶ 01:26
I wouldn't really claim this is a much better result.	▶ 01:28
In fact, it looks kind of funny and very coarse,	▶ 01:31
but it does have less noise.	▶ 01:35
Just to complete the issue on blurring,	▶ 01:37
what we just did is we took an image,	▶ 01:40
we blurred it with a Gaussian kernel,	▶ 01:42
and then we applied a gradient kernel.	▶ 01:44
If you dive into the math of convolution,	▶ 01:47
you'll find that convolution is associative,	▶ 01:49
so you could apply this one to the image and then this one over here,	▶ 01:52
or you can combine these 2 guys over here	▶ 01:56
into a Gaussian gradient kernel	▶ 01:59
and apply this Gaussian gradient kernel to the image.	▶ 02:03
So, f convolved with g is this big	▶ 02:07
maybe 9 x 9 Gaussian matrix convolved by a single	▶ 02:11
+1, -1 kernel g.	▶ 02:15
And here's what this Gaussian gradient kernel looks like.	▶ 02:18
It's really interesting.	▶ 02:22
It is the same gradient kernel we had before	▶ 02:24
but smooth now and spread out	▶ 02:26
by Gaussian.	▶ 02:29
And it really responds to an area over here similar to a Sobel operator	▶ 02:32
that might have a strong negative value.	▶ 02:37
And the area over here on the right side	▶ 02:40
has a strong positive value,	▶ 02:42
so you can think of Sobel and many other kernels	▶ 02:44
as a combination of smoothing and taking a gradient.	▶ 02:47
I find this really interesting because	▶ 02:52
we can now devise a single, linear kernel that does both smoothing	▶ 02:54
and find gradients at the same time.	▶ 02:58

(02:50) 45 Harris Corner Detector.mp4

Sometimes you wish to find corners,	▶ 00:00
as in this checkerboard over here.	▶ 00:03
Corners have an advantage over edges.	▶ 00:05
Edges aren't localizable.	▶ 00:08
They could be anywhere on an edge.	▶ 00:10
But a corner like this or a corner like this	▶ 00:12
can be localized, which is useful in computer vision.	▶ 00:15
What you see here is a Harris corner detector	▶ 00:18
applied to a checkerboard pattern.	▶ 00:22
And you can see all the points that define the checkerboard	▶ 00:26
clearly found by a relatively simple algorithm	▶ 00:29
which I'm just about to explain to you.	▶ 00:33
The Harris corner detector is really a simple algorithm.	▶ 00:36
Suppose you wished to find a corner just like this.	▶ 00:41
Then in the small region over here where the corner resides,	▶ 00:44
you will find a lot of horizontal gradients	▶ 00:48
and a lot of vertical gradients.	▶ 00:51
Now, what's our trick of finding gradients?	▶ 00:53
Well, we know about horizontal gradients.	▶ 00:55
We know about vertical gradients.	▶ 00:58
If those summed up over a small window--	▶ 01:00
as shown right over here--are large, we have a corner.	▶ 01:03
If only 1 of them is large and the other 1 is small, we likely have an edge.	▶ 01:07
We already learned this before.	▶ 01:11
It should be no surprise so far.	▶ 01:13
Now, the Harris corner detector generalizes 2 images.	▶ 01:15
We might have a corner like this	▶ 01:20
that is rotated from the original corner.	▶ 01:22
An image like this on a horizontal gradient	▶ 01:25
isn't quite as pronounced as it is on the vertical gradient.	▶ 01:28
But if you were to rotate our coordinate system	▶ 01:31
back into the correct orientation,	▶ 01:34
we could reduce it back to the case over here.	▶ 01:36
The trick that's being applied is to de-rotate	▶ 01:38
this image over here using eigenvalue decomposition.	▶ 01:43
We use a matrix that slightly generalizes these 2 things over here	▶ 01:47
where again we add our small windows.	▶ 01:50
We plug in the statistic over here up here.	▶ 01:53
The statistic over here down there.	▶ 01:56
And here we have [s/l mixed strums] if we sum over the product	▶ 01:58
of Ix and Iy in [ s/l after angle terms].	▶ 02:01
If we apply eigenvalue decomposition to this matrix over here,	▶ 02:06
we get 2 eigenvalues.	▶ 02:09
And if both eigenvalues are large,	▶ 02:11
we again say we have a corner.	▶ 02:13
So, applying this eigenvalue decomposition	▶ 02:16
to every positive pixel in the original image	▶ 02:19
and then taking the local maxima of that result	▶ 02:21
where both eigenvalues are large gives us exactly	▶ 02:24
the Harris corner detector in a very robust way	▶ 02:27
to find corners in an image.	▶ 02:30
This is exactly what's being done over here,	▶ 02:33
and you can see it's very robust even to small rotations of the image,	▶ 02:37
and of course, to a scale of the image.	▶ 02:41
It's a beautiful way to find stable,	▶ 02:43
localizable features in contrast-rich images.	▶ 02:47

(01:37) 46 Modern Feature Detectors.mp4

(00:46) 47 Conclusion.mp4

(33) Unit 17

(00:44) 01 Introduction.mp4

(00:26) 02 Depth Question.mp4

(00:50) 03 Depth Answer.mp4

(01:33) 04 Stereo.mp4

(00:23) 05 Stereo Question.mp4

(01:18) 06 Stereo Answer.mp4

(01:58) 07 Solving for Depth.mp4

(00:40) 08 Solve Depth Question.mp4

(00:18) 09 Solve Depth Answer.mp4
The answer is 40. ▶ 00:00
F is 8 mm over 3 minus -1 is 4 mm, ▶ 00:02
which gives this guy over here a factor of 2 times B. ▶ 00:09
Ten centimeters makes 40 cm for z. ▶ 00:14

(00:28) 10 Change in X Question.mp4

(00:17) 11 Change in X Answer.mp4
The answer is absolute yes. It's going to be 3 mm. ▶ 00:00
To see, we transform this equation over here to delta x to the left, ▶ 00:05
B over z is 0.1 times 30 mm makes 3 mm over here. ▶ 00:09

(00:30) 12 Focal Length Question.mp4

(00:18) 13 Focal Length Answer.mp4

(00:36) 14 Correspondence Question.mp4

(01:10) 15 Correspondence Answer.mp4

(02:02) 16 Determine Correspondence Question.mp4

The general correspondence problem is given	▶ 00:00
if there are two identical-looking points in the scene that have different depths.	▶ 00:03
For example with P1 might reflect into the image over here,	▶ 00:08
and P2 will reflect into the image as indicated by these red lines.	▶ 00:12
Now we understand the correspondence of P1 in both images	▶ 00:16
that this point corresponds to this point, we are well off,	▶ 00:20
and we can estimate the depth of P1.	▶ 00:23
If we get it wrong, if we correspond this point over here in the image to this guy over here,	▶ 00:25
then what we will see is this point right over here--P1 prime.	▶ 00:31
If we correspond this guy over here with this guy over here,	▶ 00:36
we get P2 prime.	▶ 00:39
These aren't really points in the action image, but they'll be phantom points	▶ 00:41
that occur because we got the correspondence wrong.	▶ 00:46
It's really important when we look at two camera images	▶ 00:48
to understand what is the actual correspondence.	▶ 00:51
Here are actually two images from a stereo rig of a scene,	▶ 00:55
and you can see there's a slight displacement. It's actually really hard to see.	▶ 01:00
We're looking at this feature over here for now.	▶ 01:04
I'd like to correspond it to something in the right image.	▶ 01:07
We have already learned that the search will have to be along a line.	▶ 01:10
Here is the green line, which is the corresponding line.	▶ 01:15
It can't be that this point over here shows up somewhere in the sky over here,	▶ 01:18
but even along the point, it's not completely obvious how to do correspondence--	▶ 01:22
how to match this image over here to this image over there.	▶ 01:26
So my question is how can we possibly find	▶ 01:29
where this feature corresponds to a feature over here?	▶ 01:33
How can we determine correspondence?	▶ 01:37
By matching small image patches using some of the linear techniques we talked about in	▶ 01:40
the last class by just basically comparing how similar looking small image patches are	▶ 01:45
or by matching features, and particularly edge features or corner features	▶ 01:51
that we might extract from the original image.	▶ 01:54
Or maybe neither of those two. Please check any or all of those that apply.	▶ 01:57

(00:12) 17 Determine Correspondence Answer.mp4

(02:02) 18 SSD Minimization.mp4

Here is my pair of images again,	▶ 00:00
and my scan line, and I'm extracting from it a very small little window	▶ 00:04
that is the local image of the specific feature over here	▶ 00:10
which happens to have a strong vertical structure,	▶ 00:13
which is nice of localization.	▶ 00:16
Now I'm comparing this little patch with my little patches in the right image,	▶ 00:18
and I'm drawing a sum of square difference error,	▶ 00:23
which is minimized when these two patches look alike.	▶ 00:27
I'll tell you in a second how this looks like mathematically,	▶ 00:31
but intuitively we have to pick the place along the random measured search space	▶ 00:34
that has the smallest sum of square difference error,	▶ 00:39
which is the one where these two patches just look mostly alike.	▶ 00:43
This is a space of the scan line in which I search,	▶ 00:47
often called disparity, and for one location this is actually being minimized right over here.	▶ 00:51
Here's the basic algorithm for SSD minimization.	▶ 00:56
We take two patches--one from the left image, one from the right image.	▶ 00:59
We normalize, so the average brightness is zero.	▶ 01:03
We then take the normalized image and take the difference.	▶ 01:06
Then we square the difference. That gives us a sum-of-square image.	▶ 01:09
Then we can sum up all the pixels to get a single value.	▶ 01:13
This is our SSD value, our sum-of-square difference value.	▶ 01:17
All of these operations are easily implemented using the material you already know.	▶ 01:21
The smaller the SSD value, the closer these two images correspond.	▶ 01:26
This is a very common technique for comparing what's called image templates,	▶ 01:31
where your left image is a template,	▶ 01:36
and you're searching the left image for the optimal template.	▶ 01:39
As you vary the location of the right image, you can find different SSDs.	▶ 01:42
You tend to get graphs for the right image.	▶ 01:47
With an image template, it gives you certain errors.	▶ 01:50
Sometimes you get a very small disparate error.	▶ 01:54
That's the place you'll pick for the best, mostly likely alignment.	▶ 01:57

(01:34) 19 Disparity Maps.mp4

(01:32) 20 Context Question.mp4

(00:45) 21 Context Answer.mp4

(01:19) 22 Alignment 1 Question.mp4

(00:23) 23 Alignment 1 Answer.mp4

(00:22) 24 Alignment 2 Question.mp4

(00:45) 25 Alignment 2 Answer.mp4

(04:14) 26 Dynamic Programming.mp4

The tricky part is how to compute the best possible alignment.	▶ 00:00
It's usually done by dynamic programming.	▶ 00:06
The recognition here is that in principle there are	▶ 00:09
exponentially many ways to align pixels in the left and right image,	▶ 00:13
but in practice you can get away with an n-squared algorithm	▶ 00:17
where n is the number of pixels in the scan line.	▶ 00:22
Let's write this as n-squared. It's a much, much faster algorithm.	▶ 00:25
Here's the idea. Let's write down both scan lines as shown over here.	▶ 00:29
And let's write down a matrix of size and square.	▶ 00:34
The neat thing here is that any path from the top left to the bottom right	▶ 00:38
is a specific correspondence of pixels over here on the left scan line	▶ 00:43
to pixels over here on the right scan line.	▶ 00:50
For example, if I take the path that's diagonal, that line pixels by each other.	▶ 00:52
But the best possible path would assume that the first two pixels correspond,	▶ 00:58
and there's a left occlusion afterwards.	▶ 01:03
Then all the red guys correspond.	▶ 01:06
So this red guy over here corresponds to this red guy over here.	▶ 01:08
There's an occlusion over here.	▶ 01:12
Then we go diagonal again.	▶ 01:14
So any path that picks actions that go diagonal, down, or right	▶ 01:16
so that the top left is connected to the bottom right	▶ 01:21
becomes a valid correspondence of the left scan line to the right scan line.	▶ 01:26
How do we find the best one?	▶ 01:33
Well, just like in MVPs we use the same methodology as an MVP.	▶ 01:35
We define the value of any of these points in the grid to be the best,	▶ 01:40
taking the value of getting there.	▶ 01:44
The value of a point ij in the grid is the maximum of the match value	▶ 01:47
if we chose diagonal, which is expressed over here to be the match of ij	▶ 01:54
given that we chose the diagonal, which means add the value of i minus 1 and j minus 1,	▶ 01:59
over the occlusion penalty plus any way we could have occluded for the left or the right.	▶ 02:06
If we look at these three different things we maximize over here,	▶ 02:12
then each value over here becomes the maximum of assuming we have no occlusion	▶ 02:15
plus the corresponding match penalty or assuming we did have an occlusion,	▶ 02:21
either from the top or the bottom, and then we just pay the occlusion penalty,	▶ 02:26
and we assume the value over there.	▶ 02:31
Now, that's not trivial. You have to think about this.	▶ 02:34
Why does this give us the optimal path?	▶ 02:37
But if you think about it and look at the optimal path,	▶ 02:39
we pay no penalty over here because the match is perfect.	▶ 02:42
We pay no penalty over here because the match is perfect again.	▶ 02:45
So, again, the first clause in this formula.	▶ 02:48
Over here we do pay a penalty.	▶ 02:50
We pay a penalty of 10, which is the occlusion penalty,	▶ 02:53
because we assume that between the blue pixel over here and the right scan image	▶ 02:56
there's just no appropriate match. We're going to pay a penalty of 10 over here.	▶ 03:01
Over here we pay no penalty, because the right corresponds perfectly to the red,	▶ 03:06
and we assume it is a perfect match.	▶ 03:09
The same over here and the same over here.	▶ 03:12
Down here we pay a penalty of 10, because we assume an occlusion,	▶ 03:14
and down here we just assume no penalty at all.	▶ 03:18
Now with dynamic programming, it computes for every possible location.	▶ 03:21
For example, this guy over here would have a best optimal path,	▶ 03:25
which might assume we had a perfect match over here and two occlusions over there,	▶ 03:29
but now the penalty is already 20 whereas the penalty over here is 10.	▶ 03:34
So likely this point won't survive.	▶ 03:37
By working out the value function in this really interesting grid over here,	▶ 03:40
we find the value of the final point, which is 20,	▶ 03:46
and we also find the best possible path	▶ 03:49
by tracing the way in which the value propagated through this grid.	▶ 03:53
This becomes the best possible correspondence of the left and the right image	▶ 03:57
by aligning the entire left scan line and the entire right scan line	▶ 04:02
simultaneously using dynamic programming.	▶ 04:06
This is the state of the art in stereo computer vision.	▶ 04:10

(00:42) 27 Pixel Correspondence Question 1.mp4

(00:44) 28 Pixel Correspondence Answer 1.mp4

(00:30) 29 Pixel Correspondence Question 2.mp4

(01:03) 30 Pixel Correspondence Answer 2.mp4

(00:32) 31 Finding the Best Alignment.mp4

(01:21) 32 Correspondence Issues.mp4

(02:58) 33 Improving Stereo Vision.mp4

I'd like to say a few words about how to improve the results of stereo vision.	▶ 00:00
Here is a vision assembly that James David built up of two cameras.	▶ 00:06
In addition to having these two cameras, he also put a projector into the scene	▶ 00:11
that emitted a random light pattern.	▶ 00:14
In fact, it emitted a striped pattern, shown over here on this frog,	▶ 00:17
and by adding texture to the scene, you can making correspondence easier.	▶ 00:23
This is a striped pattern of unequal distances.	▶ 00:29
There's a coding over here, which makes certain stripes larger than others.	▶ 00:33
If you run the same algorithm I just told you,	▶ 00:38
you'll find that stereo vision becomes better,	▶ 00:41
because we can now better disambiguate the correspondence of points.	▶ 00:44
Here is the assembly used for imaging myself. This is me with a sweater on.	▶ 00:48
That's my face.	▶ 00:53
And you can see by emitting structured light, as it is called,	▶ 00:55
you can enhance the performance of stereo	▶ 01:00
and objects that otherwise have very poor texture.	▶ 01:03
Another solution is called the Microsoft Kinect. You're probably familiar with it.	▶ 01:06
It's a new gaming platform that's been sold at record pace.	▶ 01:11
It uses a camera system, together with a laser.	▶ 01:15
The laser adds texture to the scene,	▶ 01:18
and by triangulation using the same method I showed you,	▶ 01:21
it can recover depth.	▶ 01:24
Here's my postdoc Christian using a Kinect-like sensor	▶ 01:26
to do certain poses in front of a depth sensor.	▶ 01:31
You can see in the screen how his pose is being perceived,	▶ 01:36
and you can see Christian trying to do handstands and other acrobatic maneuvers.	▶ 01:41
He's actually pretty good.	▶ 01:51
That's all using effectively stereo vision.	▶ 01:54
There is actually a whole bunch of different types of techniques	▶ 02:07
for sensing range in computer vision.	▶ 02:10
I'm just going to briefly talk about them.	▶ 02:13
They're called laser range finders.	▶ 02:15
They send off beams of light,	▶ 02:17
and they measure the time until the light comes back into the sensor.	▶ 02:19
They're being manufactured by many different companies.	▶ 02:22
In our experiments using robots to drive through the desert and through traffic.	▶ 02:25
We quite extensively used laser range finders as an alternative to stereo vision,	▶ 02:30
because they give us very, very good range estimates.	▶ 02:35
Here is a 3D model constructed by laser range finders of our neighborhood in Palo Alto,	▶ 02:38
and it's easy to see how 3D points can making amazing 3D models,	▶ 02:45
using techniques like stereo vision or like the laser range finders I just briefly talked about.	▶ 02:52

(9) Unit 18

(01:40) 01 Structure from Motion Question.mp4

(00:30) 02 Structure from Motion Answer.mp4

(00:42) 03 Projection Question.mp4

(00:51) 04 Projection Answer.mp4

(03:25) 05 Structure from Motion Models.mp4

[Thrun] Here is a very early example of structure from motion by Carlo Tomasi	▶ 00:00
and Takeo Kanade.	▶ 00:05
They used Harris corner detectors to find corners in the image of this toy 3D house,	▶ 00:07
and they were able from a number of images to fully recover the 3D structure	▶ 00:13
of every single corner point, as shown in this video.	▶ 00:18
So as they then take this 3D data set and turn it in arbitrary directions,	▶ 00:22
you can see the full 3D structure was recovered.	▶ 00:26
This is work in 1992.	▶ 00:29
It used principal component analysis to solve the problem	▶ 00:31
and is one of the most amazing pieces of early computer vision research.	▶ 00:34
Carlo, who used to be a Stanford professor for many years,	▶ 00:39
then scanned his kitchen and with the same Harris corner detector	▶ 00:44
was able to reconstruct a 3D structure of his kitchen, as shown over here.	▶ 00:47
Again, this is one of the most impressive early computer vision research results I've seen.	▶ 00:52
Here is a flight video of flying over the hills of Pennsylvania.	▶ 00:58
As you can see, using the same technique he was able to recover the 3D structure	▶ 01:03
of the outdoor terrain and build elevation maps, as shown over here.	▶ 01:09
Marc Pollefeys, who presently teaches at ETH Zurich,	▶ 01:32
came up with a beautiful solution to the structure from motion problem,	▶ 01:35
here imaging different buildings in his hometown.	▶ 01:40
From this video you can see multiple snapshots of a single building	▶ 01:43
where the different perspective distortion has an effect on the appearance of the building,	▶ 01:48
quite obviously.	▶ 01:53
Using those images he was able to reconstruct the 3D shape of the building facade,	▶ 01:55
as shown in this video.	▶ 02:00
Again, at the time it was one of the most impressive results ever achieved	▶ 02:10
in structure from motion.	▶ 02:14
You can see amazing detail as he zooms in to his building model.	▶ 02:29
He then moved on to map entire cities,	▶ 02:38
and here is an example of a map that he produced from an entire city block.	▶ 02:42
You can see how he reconstructs the building facades in unprecedented detail.	▶ 02:48
There's also a lot of occlusion gaps where the original imager wasn't able to see anything.	▶ 02:55
Those show up in black, and they look a little bit disturbing in this image over here.	▶ 03:00
But in reality, your camera can't see everything.	▶ 03:04
So even if you do a perfect job with structure from motion,	▶ 03:07
it's really hard to reconstruct every single inch of the environment.	▶ 03:09
Still, this stands out as one of the most impressive results ever	▶ 03:17
in what I would call the Holy Grail of 3D computer vision.	▶ 03:20

(01:37) 06 SFM Math.mp4

(01:47) 07 Recovered Unknowns Question.mp4

(01:01) 08 Recovered Unknowns Answer.mp4

(00:52) 09 Conclusion.mp4

(12) Homework 7

(01:40) 01 Perspective Projection.mp4

(01:01) 02 Perspective Projection Answer.mp4

(01:33) 03 Linear or Not.mp4

(01:05) 04 Linear or Not Answer.mp4

(00:59) 05 Gradient Image.mp4

(00:58) 06 Gradient Image Answer.mp4

(01:48) 07 Stereo.mp4

(01:17) 08 Stereo Answer.mp4

(01:04) 09 Correspondence in Stereo.mp4

(00:47) 10 Correspondence Answer.mp4

(01:20) 11 Structure from Motion.mp4

(00:56) 12 Motion Answer.mp4

(21) Unit 19

(04:50) 01 Autonomous Vehicle Intro 1.mp4

One of the things I've been working on for most of my professional career are self-driving cars.	▶ 00:00
The vision is that in the future cars will drive themselves,	▶ 00:07
and in doing so they can be significantly safer.	▶ 00:11
We lose about a little over 1 million people per year in the entire world in traffic accidents.	▶ 00:14
I believe most of these accidents can be avoided by making cars safer.	▶ 00:21
If they drive themselves, they can drive disabled people.	▶ 00:25
They can drive blind people, young children, aging people,	▶ 00:28
and they could drive all of us while we do better things that staring at the road ahead.	▶ 00:32
So one of my life passions has been to be develop self-driving cars.	▶ 00:36
Today, I'd like to tell you about those, and also show you some of the basic techniques	▶ 00:40
so you can in principle program your own self-driving car.	▶ 00:45
So for me the work on self-driving cars started in 2004 after the first DARPA Grand Challenge.	▶ 00:51
This was a government-sponsored robot race	▶ 01:01
in which autonomous robots were asked to drive through the Mojave desert from California to Nevada	▶ 01:04
along 141 miles of really punishing desert terrain.	▶ 01:11
Lots of teams competed from various universities, car companies,	▶ 01:17
and also lots of hobbiests that were new to the field competed,	▶ 01:21
and built this huge set of different cars.	▶ 01:25
There were over 100 different entries into the first DARPA Grand Challenge.	▶ 01:28
Despite all this work, most robots failed out of the starting gate,	▶ 01:32
like this one over here flipped over less than 100 meters into the race.	▶ 01:36
Some were very, very large.	▶ 01:42
This is a major defense contractor who built this 35,000 pound vehicle,	▶ 01:44
which on the course was rather timid,	▶ 01:50
and some of the the teams had very small robots, like the next one by UC Berkeley,	▶ 01:53
which was a motorcycle.	▶ 02:00
So here we go.	▶ 02:04
The first DARPA Grand Challenge came with $1 million of prize money,	▶ 02:11
and despite this prize money, no team made it further than 5% of the total course.	▶ 02:16
In fact, almost all cars stopped for something very stupid,	▶ 02:21
some went up in flames,	▶ 02:26
and the furthest any team made it was this car over here by Carnegie Mellon University,	▶ 02:28
which made it about just below 8 miles of the total distance.	▶ 02:33
So for many of us, this was a massive failure of robotic technology,	▶ 02:37
which motivated me to get involved in this race.	▶ 02:41
My own story is really simple.	▶ 02:46
I started a class at Stanford, and I got about 20 students to work with me	▶ 02:48
on what would become the Stanford racing team that would ultimately go and win this race.	▶ 02:52
We modified a Volkswagen Toureg to put all kinds of sensors onto the roof	▶ 02:57
and actuators into the car that could actuate the steering wheel, the gas pedal, and the brake.	▶ 03:04
The sensors came in multiple versions.	▶ 03:09
Some were related to localization, such as global positioning sensors,	▶ 03:11
and some were related to understanding where obstacles are, like laser-range finders.	▶ 03:15
We talked about computer vision in this class.	▶ 03:19
The actuators were basically a motor on the steering wheel and on the brake pedals and on the gas pedal.	▶ 03:21
Early on, we tested on Stanford's campus.	▶ 03:27
This is the roof of the medical parking garage.	▶ 03:31
Here you can see my students and I performed simple maneuvers.	▶ 03:34
Now, I've got to tell you that this is usually a busy parking garage.	▶ 03:38
It's the medical parking garage at Stanford Hospital,	▶ 03:41
but as we practiced autonomous driving, people would come and pick up their car	▶ 03:44
and ask us about, what we were doing, so we kept telling them,	▶ 03:48
well, we're building a self-driving car.	▶ 03:51
Within less than a week, people just chose not to park there anymore.	▶ 03:53
Closer to the next version of the Grand Challenge, the second one in 2005,	▶ 04:01
we had built a car that could drive competently on most desert tracks	▶ 04:06
at speeds up to about 60 km per hour through dry river beds, through steep inclines and declines,	▶ 04:12
and would be able to avoid obstacles like this little shrub on the right side over here.	▶ 04:22
It was never really elegant, but it was insanely effective.	▶ 04:27
Now, not all testing went smooth.	▶ 04:35
This is imagery that the New York Times shot of us when we invited them for a test drive.	▶ 04:38
During this day, we managed to crash into a tree and get stuck in the mud.	▶ 04:43
It was pretty embarrassing.	▶ 04:47

(05:33) 02 Autonomous Vehicle Intro 2.mp4

Here is imagery of our laser system mapping out the terrain ahead.	▶ 00:03
We talked a little bit about lasers and range finders in this class.	▶ 00:07
Here you can see all these systems work together on building 3D maps of the environment	▶ 00:11
that our car, Stanley, uses to assess the driving situation.	▶ 00:16
This shows work on machine learning autonomous driving,	▶ 00:24
where we used the laser to identify driveable terrain at a short range	▶ 00:27
and then extrapolate this out into the long range using a machine-learning technique	▶ 00:32
applied to computer vision.	▶ 00:36
What you see here is a coloring, which is the output of a machine learning algorithml	▶ 00:39
that identifies driveable terrain in the desert.	▶ 00:43
So very briefly to tell you about the race, one with a lot of fame and $2 million.	▶ 00:47
This race started early in the morning. The sun was basically still gone and was just rising.	▶ 00:54
Our car was able to drive itself followed by a human-driven change vehicle and did quite well.	▶ 01:01
It did so well that it actually passed the front-seated and first-running vehicle by Carnegie Mellon University.	▶ 01:07
It had to navigate complicated and dangerous mountain trails where destruction lured on both sides of the car.	▶ 01:13
On the left there was a cliff. On the right side there was a mountain.	▶ 01:21
It is here followed by a human-driven chase vehicle.	▶ 01:24
Our car very carefully ascended this route.	▶ 01:27
You can see it here close before the finishing line,	▶ 01:30
and after just about 7 hours it managed to do what no robot had every done before.	▶ 01:33
It managed to really finish DARPA Grand Challenge, do this race, and won Stanford $2 million.	▶ 01:38
We were insanely proud on this day.	▶ 01:44
From this we moved onto build Junior, which competed in the DARPA Urban Challenge.	▶ 01:49
Here you can see Junior's laser pursuing obstacles and being able to detect those,	▶ 01:57
using basically range vision.	▶ 02:02
We will talk today of localization.	▶ 02:08
Junior was able to localize itself using particle filters	▶ 02:10
relative to a given map of the environment, which is essential for navigating safely in traffic.	▶ 02:15
It was able to detect other cars using particle filters	▶ 02:23
and estimate not just where they are and how far they are moving but also what size they are, how big they are.	▶ 02:27
You can see on the left the detected cars.	▶ 02:34
On the right side, you see our camera view of the same situation.	▶ 02:36
Here again, you can see it detect cars.	▶ 02:42
Here is how it looked like from an external observation point.	▶ 02:49
You can see Junior, our vehicle, driving in a fairly busy city street with lots of cars passing.	▶ 02:52
It has to wait for a gap to take a left turn.	▶ 03:00
When the gap finally occurs, it confidently takes the turns and drives.	▶ 03:03
Today in today's class I teach you how to basically program a car just like that.	▶ 03:09
So this is footage from our Google self-driving car, which you might have heard about.	▶ 03:18
This car was able to drive at speeds as high as a Prius can go.	▶ 03:23
It drives seamlessly in traffic.	▶ 03:29
In fact, we drove over 100,000 miles without anybody noticing	▶ 03:32
that there were self-driving cars in our experiments.	▶ 03:36
This is near Stanford University on University Street in Palo Alto.	▶ 03:39
You can see how the vehicle yields by itself for pedestrians.	▶ 03:43
Of course, there's also a human driver on board just for safety,	▶ 03:48
but this car, you can take my word for it, is really driving itself in traffic.	▶ 03:50
This is image footage from the car itself as it goes onto a highway.	▶ 03:55
This is sped up, I should say.	▶ 03:58
Driving through a toll booth, and driving in Los Angeles.	▶ 04:01
You can see a lot of palm trees here. It's a beautiful environment to drive in.	▶ 04:07
Here you can see some of the inner workings,	▶ 04:23
where you can see a corridor that the vehicle attempts to go.	▶ 04:26
We can see obstacles being flagged using machine-learning techniques,	▶ 04:28
range vision, laser radar, and so on.	▶ 04:32
You can see it is colored by its relation to our car and its nature,	▶ 04:36
and you can see it drives fairly confidently.	▶ 04:40
This is an attempt to drive down Lombard Street in San Francisco--the famous crooked street.	▶ 04:43
It's very curvy, and while this is sped up it gives you a sense of the complexity	▶ 04:48
that is involved in building cars like these.	▶ 04:53
It's actually quiet amazing how far technology has come in such a short amount of time.	▶ 04:55
Here is an experiment that my Stanford students did on south parking using machine learning,	▶ 05:00
reinforcement learning for control,	▶ 05:06
and you can see how agile and how capable these methods are.	▶ 05:08
So today I really want to enable you to write software like this based on lots of what we learned before.	▶ 05:15
We talked a little bit about machine learning, a lot about particle filters,	▶ 05:21
and some about motion planning, which relates to the planning class	▶ 05:25
that Peter taught you quite a while back.	▶ 05:29

(00:27) 03 Robotics Introduction.mp4

(00:34) 04 Robotics Question.mp4

(00:42) 05 Robotics Answer.mp4

(00:59) 06 Kinematic Question 1.mp4

(00:39) 07 Kinematic Answer 1.mp4

(00:15) 08 Kinematic Question 2.mp4

(00:26) 09 Kinematic Answer 2.mp4

(00:13) 10 Dynamic Question.mp4

(00:46) 11 Dynamic Answer.mp4

(00:20) 12 Helicopter Question 1.mp4

(00:35) 13 Helicopter Answer 1.mp4

(00:07) 14 Helicopter Question 2.mp4
Here is my most difficult question for the helicopter. ▶ 00:00
What's the dynamic state? What's the right dimensionality here? ▶ 00:03

(00:34) 15 Helicopter Answer 2.mp4

(00:47) 16 Localization.mp4

(03:00) 17 Monte Carlo Localization.mp4

Let's talk about particle filters for localization	▶ 00:00
that is commonly called Monte Carlo localization.	▶ 00:03
We learned in the particle filter lesson that the state is retained by a set of particles.	▶ 00:06
Each particle is a three-dimensional vector here,	▶ 00:12
comprising x,y, and the heading direction theta,	▶ 00:15
as indicated by these little arrows that I'm going to just draw here.	▶ 00:18
A set of particles like these would be a representation for the distribution at any point in time.	▶ 00:22
Now let me look at the 2 main steps in particle filters.	▶ 00:28
On is the prediction step, and one if the measurement step. Let's start with prediction.	▶ 00:31
Just to make things simpler, let's assume our vehicle has only 2 wheels.	▶ 00:35
It's called a differential-drive robot, and it can navigate by moving both wheels forward,	▶ 00:40
but if 1 wheel moves faster than the other one, it'll turn.	▶ 00:47
Let's understand how to apply a particle filter to a robot on that simplicity.	▶ 00:51
This is simpler than a car, but not much simpler. It's about the same complexity.	▶ 00:56
As I said, the state of this vehicle is given by the following 3 values: x, y, and θ.	▶ 01:01
And to predict the outcome of an action, we need to write a function	▶ 01:07
that predicts those values based on values Δt over here where Δt might be a 10th of a second.	▶ 01:11
Now the math for this in first approximation is very simple.	▶ 01:19
It turns out this approximation is good enough to do pretty much anything in robotics	▶ 01:23
even though it is not very accurate.	▶ 01:28
Let's assume the robot just keeps moving forward at a fixed velocity v.	▶ 01:30
Then the new x is given by the old x plus the progress it makes along axis x with velocity v.	▶ 01:35
So you get v times Δt, which is the total distance traversed,	▶ 01:44
but the x portion of it is cos θ.	▶ 01:49
Similarly, for the y coordinates, you get the old y plus the distance traversed--	▶ 01:53
velocity times Δt times sin θ.	▶ 01:58
This is a robot that doesn't really change heading directions,	▶ 02:03
and it'll be sufficient for very small Δt to assume that robot doesn't change heading directions.	▶ 02:07
These are actually good equations even if the robot is turning.	▶ 02:11
However, to understand the change of heading direction,	▶ 02:15
we also have to assume that there is an angular velocity,	▶ 02:17
and we call this ω [omega], which is a Greek letter.	▶ 02:20
So the new heading direction is the old one plus ω times Δt.	▶ 02:24
These are really nice equations to model relatively complex smaller robots.	▶ 02:31
They're really simple geometry. If you understand cosine and sine,	▶ 02:36
you realize this is basically a robot that moves on a fixed straight trajectory.	▶ 02:39
For time Δt it then applies the rotation and it moves again for fixed time Δt on a straight trajectory,	▶ 02:46
which is an approximation to the actual curve the robot might be taking.	▶ 02:55

(00:30) 18 Localization Question 1.mp4

(00:46) 19 Localization Answer 1.mp4

(00:28) 20 Localization Question 2.mp4

(01:03) 21 Localization Answer 2.mp4

(15) Unit 20

(01:26) 01 Prediction.mp4

(02:28) 02 Measurement Question.mp4

Now we have a set of predictions that might arise from a single particle,	▶ 00:00
and the other important step in particle filtering is the measurement step.	▶ 00:06
We need to understand at what rate will these particles survive,	▶ 00:11
and that's usually done in proportion to the measurement probabilities.	▶ 00:14
Let's talk about measurements.	▶ 00:18
For the sake of this exercise, let's assume we only have 2 measurements.	▶ 00:20
We would either see something bright or something dark.	▶ 00:25
It does a certain response to whether it's on land marking.	▶ 00:29
Just for simplicity, let's assume we have certain locations that have land markings,	▶ 00:32
like this one over here and these over there.	▶ 00:37
If a robot center is aligned with a lane marking, it should see a bright spot	▶ 00:41
because lane markings tend to be bright.	▶ 00:45
But if it's off the lane marking on the regular road, it should see a dark spot.	▶ 00:48
Let's turn this into a probability that's called the measurement probability.	▶ 00:51
The probability of seeing something bright is going to be large when it's on a lane marker, say 0.8.	▶ 00:56
From that we can deduce that the probability of seeing something dark on a lane marker is 0.2.	▶ 01:04
The probability of seeing something dark when off a lane marker is even higher at 0.9,	▶ 01:10
and from that it follows that the probability of seeing something bright	▶ 01:17
on the regular road with regular pavement is going to be 1 minus 0.9 equals 0.1.	▶ 01:20
Here's my quiz for you. This is an entirely nontrivial quiz.	▶ 01:27
If you get this right, you understand particle filters.	▶ 01:32
Suppose we measure bright.	▶ 01:36
The actual sensor told us it saw something bright underneath the robot.	▶ 01:39
I'd like to know what is the importance weight of the particle over here,	▶ 01:43
which we're going to call x1, and the particle over here, which I'll call x2.	▶ 01:49
Tell me what's the weight w of x1 after I apply the measurement probability	▶ 01:55
and the normalization that's common in particle filters.	▶ 02:03
Please do the same for the particle x2 where x1 happens to be on the lane marker,	▶ 02:07
and x2 happens to be off a lane marker.	▶ 02:13
So please put in these two numbers of over here.	▶ 02:16
It'll take a while to calculate those, so please take the time.	▶ 02:18
I assure you if you get those right, you really understand particle filters.	▶ 02:22

(02:16) 03 Measurement Answer.mp4

(01:48) 04 Resampling Question.mp4

Let's now apply the resampling where the on-lane-marker particles are being resampled for probability 0.2963,	▶ 00:00
and the ones in the middle with probability 0.037.	▶ 00:08
Let's draw with replacement 6 new particles.	▶ 00:11
A typical outcome will be we draw this one over here twice,	▶ 00:14
this one down here twice, and this one over here once.	▶ 00:18
Perhaps we draw once over here.	▶ 00:20
Clearly we draw the particles that sit on the lane markings much more frequently	▶ 00:22
than the ones that sit off the lane markings for a total of 6 new particles.	▶ 00:26
We now apply our resampling method whereby we draw twice from this particle over here.	▶ 00:31
That might give us something over there and over here, given that we add a small amount of noise.	▶ 00:37
The guy over here will be resampled to something over there.	▶ 00:43
Same with this guy over here, and this guy might find itself over here.	▶ 00:46
The set over here of 6 particles in total, will now be the new posterior.	▶ 00:49
As you can see, this posterior is more consistent with the lane marking observation	▶ 00:53
than the one of not being on a lane marking by virtue of the fact that	▶ 00:58
we saw a bright measurement before.	▶ 01:01
Now we just repeat. We look at the next measurement. We weight particles accordingly.	▶ 01:03
We resample. We do forward prediction.	▶ 01:08
That's the basic particle filter algorithm.4	▶ 01:11
Look at measurment, compute weights, sample, and predict	▶ 01:14
where the prediction has a certain amount randomness.	▶ 01:17
If you get that loop implemented, you've implemented an amazing algorithm	▶ 01:20
that's exactly what has driven many of my robots in the ability to localize themselves.	▶ 01:24
Obviously they have more than just 1 pixel sensor that measures bright and dark.	▶ 01:29
They might take an entire road image and use the road image to complete the measurement probability,	▶ 01:33
but the basic mechanics is exactly the same as shown over here.	▶ 01:38
So let me ask you, did you actually understand this. Yes or no?	▶ 01:42

(00:19) 05 Resampling Answer.mp4

(00:56) 06 Planning Question.mp4

(00:15) 07 Planning Answer.mp4

(01:07) 08 Road Graph.mp4

(01:29) 09 Cost Question.mp4

(00:49) 10 Cost Answer.mp4

(02:44) 11 Dynamic Programming 1.mp4

[Thrun] So let me give you some examples of this method in action.	▶ 00:00
Here we have an actual planning technique that uses dynamic programming	▶ 00:03
and understands how far things are away.	▶ 00:08
And on top of it, it also considers local rollouts to avoid local obstacles.	▶ 00:10
These local rollouts are continuous trajectories.	▶ 00:15
They are variant by discrete decisions,	▶ 00:18
like whether to change the lane, and by various small discrete nudges around obstacles	▶ 00:20
so we can avoid obstacles.	▶ 00:26
And in rolling out to a certain horizon, like up to here,	▶ 00:29
and then connecting to the dynamic programming value,	▶ 00:32
we can calculate in actual traffic situations what is the cost of going a certain path.	▶ 00:34
Here is an attempt to turn right.	▶ 00:41
You can see the vehicle approaching a stop sign.	▶ 00:43
There is an entire maze of streets.	▶ 00:46
The best way to go right and then into the left lane is to take the right turn	▶ 00:48
and then initiate a lane shift, which is happening right now,	▶ 00:54
to reach a target location that is indicated by this big orange circle	▶ 00:58
that it's crossing right now.	▶ 01:02
If we increase the cost of a lane shift to a much larger value,	▶ 01:04
the answer that emerges is really interesting.	▶ 01:09
It doesn't turn right because of the cost of the subsequent lane shift.	▶ 01:13
Instead this vehicle goes straight,	▶ 01:16
takes a left turn, which happens to be much cheaper than the lane shift.	▶ 01:19
It then follows this left turn, takes another left turn,	▶ 01:27
and eventually takes a third left turn just to get to the left lane.	▶ 01:38
And if you look very carefully, you can now reach the goal location	▶ 01:44
without a lane change maneuver which would have much higher cost.	▶ 01:49
So here it is now in the left lane, and without lane-changing maneuver	▶ 01:57
it manages to reach the goal.	▶ 02:01
This illustrates that dynamic programming in the context of controlling actual cars	▶ 02:03
has a big value to play.	▶ 02:07
Here is the same idea applied to a passing maneuver in normal driving.	▶ 02:10
You see our vehicle following another vehicle.	▶ 02:13
This is actual data in preparation for the Urban Challenge.	▶ 02:16
Now we placed an abandoned vehicle on the left lane,	▶ 02:19
and you can see how trainers are being made that incorporate dynamic obstacles	▶ 02:23
by virtue of those rollouts and a dynamic programming function	▶ 02:27
by virtue of the background green to red to find the optimal actions.	▶ 02:31
This method has really enabled us to navigate complicated situations with self-driving cars.	▶ 02:36

(02:12) 12 Dynamic Programming 2.mp4

[Thrun] The last example I want to talk about in this lecture	▶ 00:00
is related to general purpose path planning	▶ 00:04
where we don't have a road network.	▶ 00:06
Here is an example of where this occurs.	▶ 00:09
This is an example of where a blockage occurs.	▶ 00:11
None of the preplanned paths are drivable by our robot,	▶ 00:15
so it has to, after a certain time out here--20 seconds--	▶ 00:20
find itself a path anywhere in the environment.	▶ 00:23
In fact, our Urban Challenge car did just this.	▶ 00:27
We don't do this today in traffic. It's a little bit dangerous.	▶ 00:31
But for the Urban Challenge it was perfectly doable, and it was safe.	▶ 00:35
So this car found a route that was outside any pre-given corridor.	▶ 00:38
Here is an even more challenging example	▶ 00:45
where our robot Junior approaches a complete road blockage,	▶ 00:47
but its target location is behind the road blockage.	▶ 00:52
You can see that none of the paths can possibly make it there	▶ 00:56
and the only correct action is to turn around and pick a different road	▶ 00:59
to finally approach the goal location from the opposite side.	▶ 01:03
You can see an attempted lane shift to the opposite lane doesn't function either,	▶ 01:07
and there is time models supposed to be with all of those.	▶ 01:12
Eventually, using a general purpose planner	▶ 01:15
of the type that Peter talked about in his class	▶ 01:18
to find what ends up to be a really complicated multi-turn turnaround	▶ 01:21
where the car backs into a driveway a little bit, as you can see over here,	▶ 01:29
and it is all planned completely dynamically without any preconception	▶ 01:33
how such a multi-point U-turn would look like.	▶ 01:38
Then it goes forward, then it goes backward, and does so multiple times	▶ 01:41
until it finally has turned around.	▶ 01:46
It's not particularly efficient or elegant, but it's very, very safe.	▶ 01:48
This vehicle will eventually be able to drive in a different direction	▶ 01:53
and reach the goal point behind the blockage.	▶ 01:56
That was one of the tasks DARPA gave us.	▶ 01:59
So you can see it do its job until it finally breaks free	▶ 02:01
and is able to navigate around this blockage onto a different street, as shown over here.	▶ 02:05

(02:42) 13 Robotic Path Planning.mp4

[Thrun] So let's talk about robot path planning or robot motion planning,	▶ 00:00
which is a rich field in itself, and I can't give you a complete survey	▶ 00:04
of all the algorithms involved.	▶ 00:08
But one of the key differences to the planning algorithms we talked about before	▶ 00:10
is that now the world is continuous.	▶ 00:14
For example, we learned about A* in which we discretize the world.	▶ 00:17
We might have a goal location, we might have obstacles,	▶ 00:21
and then A*, a valid action sequence, might look like this.	▶ 00:24
And even though this is a valid solution to the planning problem,	▶ 00:28
a car can't really follow these discrete choices.	▶ 00:32
There is a number of very sharp turns over here that are just irreconcilable	▶ 00:35
with the motion of a car.	▶ 00:39
So the fundamental problem here is A* is discrete,	▶ 00:42
whereas the robotic world is continuous.	▶ 00:45
So the question arises, is there a version of A* that can deal with the continuous nature	▶ 00:48
and give us provably executable paths?	▶ 00:52
This is a big, big question in robot motion planning.	▶ 00:56
Let me just discuss it for this one example	▶ 00:59
and show you what we've done to solve this problem in the DARPA Urban Challenge.	▶ 01:02
The key to solving this with A* has to do with the state transition function.	▶ 01:07
Suppose we have a cell like this and we apply a sequence of very small step simulations	▶ 01:12
using our continuous math from before.	▶ 01:17
Then a state over here might find itself right here in the corner of the next discrete state.	▶ 01:20
Instead of assigning this just to the grid cell,	▶ 01:27
in the algorithm called hybrid A*, it memorizes the exact x prime, y prime,	▶ 01:29
and theta prime and associates it with this grid cell over here	▶ 01:34
the first time the grid cell is expanded.	▶ 01:38
Then when expanding from this cell it uses a specific starting point over here	▶ 01:40
to figure out what the next cell might be.	▶ 01:45
It might happen that the same cell is reached again in A*, maybe from over here,	▶ 01:47
leading to a different continuous amortization of x, y, and theta,	▶ 01:51
but because in A* we tend to expand cells along the shortest path	▶ 01:55
before we look at the longer paths, we now just cut this off	▶ 01:59
and never consider the state over here again.	▶ 02:03
This leads to a lack of completeness,	▶ 02:06
which means there might be solutions to the navigation problem	▶ 02:09
that this algorithm doesn't capture,	▶ 02:12
but it does give us correctness.	▶ 02:14
So as long as our motion equations are correct, the resulting paths can be executed.	▶ 02:16
Now here is a caveat.	▶ 02:21
This is an approximation and is only correct to the point	▶ 02:23
that these motions equations are correct that are not correct.	▶ 02:26
But nevertheless, our paths that come out are nice, smooth, and curved paths,	▶ 02:28
and every time we expand a grid cell	▶ 02:34
we memorize explicitly the continuous values of x prime, y prime,	▶ 02:36
and theta with this grid cell.	▶ 02:40

(03:03) 14 Path Planning Examples.mp4

[Thrun] Now here is an actual result of applying this A* algorithm	▶ 00:00
for our vehicle that sits over here.	▶ 00:03
Real obstacles--these are laser scans of parked cars--	▶ 00:05
and a target location over here.	▶ 00:09
And while the curve isn't super smooth,	▶ 00:11
you can still see it is able to find a continuous and drivable curve	▶ 00:14
to the parking location over here	▶ 00:18
by this small but important modification of A*.	▶ 00:20
There are a few other modifications of A* which I can't go into detail,	▶ 00:24
but here you can see a typical attempt of a robot to navigate a parking lot	▶ 00:30
here in simulation.	▶ 00:35
You can see the tree that is being expanded in that search.	▶ 00:37
And every time it gets stuck, it does a new A* search.	▶ 00:43
You can see how the map is being acquired as the robot moves.	▶ 00:47
In its state that's in front of the robot, it not only considers the x, y and hidden direction	▶ 00:51
but also allows the robot to go forward and backwards,	▶ 00:56
and driving backwards is just a different state than going forwards.	▶ 00:59
Now you can see how it backs up, finds a new path, and it is an incomplete maze	▶ 01:03
until it finally is able to reach the goal location through an actual opening.	▶ 01:08
We made this maze really hard to test our algorithms.	▶ 01:13
The nice thing is these algorithms work almost real time.	▶ 01:17
It takes less than a tenth of a second to build this entire search tree,	▶ 01:20
and the robot is able to navigate this parking lot really, really efficiently.	▶ 01:25
This was one of the fastest motion planning algorithms that I saw	▶ 01:30
in the DARPA Urban Challenge.	▶ 01:34
In fact, in all of robotics it's been one of the fastest algorithms	▶ 01:36
I've personally seen in my life.	▶ 01:39
Here is the same algorithm applied to an actual parking example using our robot Junior.	▶ 01:42
It's driving over here, it wishes to get over there,	▶ 01:49
and you can see it has backed up into a parking gap over here,	▶ 01:53
which is an amazing precision for a robot, and then moved forward along the line over here.	▶ 01:57
Our state space is, I guess, 4-dimensional.	▶ 02:04
It comprises x, y, hidden direction, and whether the car is going forward or backwards.	▶ 02:08
There is a cost to changing directions, so it doesn't change direction too often.	▶ 02:13
You can see it navigate to its target location.	▶ 02:17
Details I am not telling you include that the trajectory that the planner generates	▶ 02:20
is subsequently smoothed using a quadratic smoother	▶ 02:25
so that we get rid of the kinks,	▶ 02:29
and the car drives much nicer as a result.	▶ 02:31
But the workhorse here that does all the work to find the best path	▶ 02:34
is actually A* modified into hybrid A*, as I told you.	▶ 02:38
And in this final video we see the car navigating a parking lot with lots of traffic cones.	▶ 02:46
On the left you see the video imagery, on the right side you can see the internal map	▶ 02:52
and the path planner,	▶ 02:57
and it attempts to park itself in the designated spot on the left.	▶ 02:59

(01:18) 15 Conclusion.mp4

(14) Homework 8

(01:30) 01 State Space Question.mp4

(01:10) 02 State Space Answer.mp4

(00:33) 03 Dynamic Programming Question 1.mp4

(00:21) 04 Dynamic Programming Answer 1.mp4

(00:23) 05 Dynamic Programming Question 2.mp4

(00:42) 06 Dynamic Programming Answer 2.mp4

(01:13) 07 Particle Question 1.mp4

(00:39) 08 Particle Answer 1.mp4

(00:41) 09 Particle Question 2.mp4

(01:23) 10 Particle Answer 2.mp4

(00:11) 11 Stanley Question.mp4
Our robot, Stanley, performed as follows in the DARPA Urban Challenge in 2005. ▶ 00:00
He came in 1st, 2nd, 3rd, or 4th or below in the ranking. ▶ 00:05
(00:09) 12 Stanley Answer.mp4
And oh, my God! Yes, we came in first! ▶ 00:00
It was one of the most amazing events in my entire life. ▶ 00:03
Our robot made it first across the finishing line. ▶ 00:06

(00:52) 13 Motion Model Question.mp4

(00:58) 14 Motion Model Answer.mp4

(40) Unit 21

(01:19) 01 Introduction.mp4

(02:55) 02 Language Models.mp4

We'll start by talking about language models.	▶ 00:00
Historically, there have been two types of models that have been popular	▶ 00:03
for natural language understanding within AI.	▶ 00:07
One of the types of models has to do with sequences of letters or words?	▶ 00:10
These types of models tend to be probabilistic	▶ 00:16
in that we're talking about the probability of a sequence,	▶ 00:20
word based in that mostly what we're dealing with is the surface words themselves,	▶ 00:24
and sometimes letters.	▶ 00:30
But we're dealing with the actual data of what we see,	▶ 00:33
Rather than some underlying extractions,	▶ 00:37
and these models are primarily learned from data.	▶ 00:39
Now, in contrast to that is another type of model that you might have seen before,	▶ 00:44
where we're primarily dealing with trees and with abstract structures.	▶ 00:50
So we say we can have a sentence, which is composed of a noun phrase and a verb phrase,	▶ 00:54
and a noun phrase might be a person's name, and that might be "Sam."	▶ 01:01
And the verb phrase might be a verb and we might say "Sam slept"--	▶ 01:07
a very simple sentence.	▶ 01:14
Now, these types of models have different properties.	▶ 01:16
For one, they tend to be logical rather than probabilistic--	▶ 01:20
that is whereas on this side, we're talking about the probability of a sequence of words,	▶ 01:25
on this side we're talking about a set of sentences and that set defines the language,	▶ 01:32
and a sentence is either in the language or not.	▶ 01:40
It's a Boolean logical distinction rather than on this side a probabilistic distinction.	▶ 01:44
These models are based on abstraction such as trees and categories--	▶ 01:50
categories like noun phrase and verb phrase and tree structures like this	▶ 01:57
that don't actually occur in the surface form, so the words that we can observe.	▶ 02:02
An agent can observe the words "Sam" and "slept,"	▶ 02:08
but an agent can't directly observe the fact that slept is a verb or that it's part of this tree structure.	▶ 02:12
Traditionally, these types of approaches have been primarily hand-coded.	▶ 02:19
That is, rather than learning this type of structure from data,	▶ 02:25
we learn it by going out and having linguists and other experts write down these rules.	▶ 02:29
Now, these distinctions are not hard to cut.	▶ 02:35
You could have trees and have a probabilistic model of them.	▶ 02:39
You could learn trees.	▶ 02:45
We can go back and forth, but traditionally the two camps have divided up in this way.	▶ 02:48

(01:04) 03 Bag of Words.mp4

(05:35) 04 Probabilistic Models.mp4

What we want then is a probabilistic model P over a word sequence,	▶ 00:00
and we can write that sequence word 1, word 2, word 3, all the way up to word n,	▶ 00:06
and we can use an abbreviation for that and write that the sequence of	▶ 00:14
words 1 through n, using the colon.	▶ 00:19
Now the next step is to say we can factor this and take these individual variables	▶ 00:23
write that in terms of conditional probabilities.	▶ 00:29
So, this probability is equal to the product over all i of the probability of words of i	▶ 00:33
given all the subsequent words.	▶ 00:43
So that would be from word 1 up to word i - 1.	▶ 00:46
The is just the definition of conventional probability--	▶ 00:51
the joint probability of a set of variables can be factored out as the conditional probability	▶ 00:55
of one variable given all the others,	▶ 01:02
and then we can recursively do that until we've got all the variables accounted for.	▶ 01:05
We can make the Markov assumption	▶ 01:09
and that's the assumption that the effect of one variable on another will be local.	▶ 01:12
That is, if we're looking at the nth word, the words that are relevant to that	▶ 01:17
are the ones that have occurred recently and not the ones occurred a long time ago.	▶ 01:21
What the Markov assumption means is that the probability of a word i,	▶ 01:25
given the words all the was from 1 up to word i minus 1,	▶ 01:32
we can assume that that's equal or approximately equal to the probability	▶ 01:38
of the word given only the words from i minus k up to i minus 1.	▶ 01:45
Instead of going all the way back to number 1, we only go back k steps.	▶ 01:52
For order 1 Markov model, for an order k equals one, then this would be equal to	▶ 01:58
the probability of word i given only word i minus 1.	▶ 02:04
Now, the next thing we want to do is in our mode is called the Stationarity Assumption.	▶ 02:10
What that says is that the probability of each variable is going to be the same.	▶ 02:16
So the probability distribution over the first word is going to be same	▶ 02:23
as the probability distribution over the nth word.	▶ 02:27
Another way to look at that is if I keep saying sentences,	▶ 02:31
the words that show up in my sentence depend on what the surrounding words are	▶ 02:35
in the sentence, but they don't depend on whether I'm on the first sentence	▶ 02:38
or the second sentence or the third sentence.	▶ 02:42
Stationarity assumption we can write as the probability of a word given	▶ 02:45
the previous word is the same for all variables.	▶ 02:51
For all values of i and j, the probability of word i given the previous word	▶ 02:56
as the same as the probability of word j given the previous word.	▶ 03:02
That gives us all the formalism we need to talk about these word sequence models--	▶ 03:06
probabilistic word sequence models.	▶ 03:11
In practice there are many tricks.	▶ 03:14
One thing we talked about before, when we were doing the spam filterings and so on,	▶ 03:16
is a necessity of smoothing.	▶ 03:21
That is, if we're going to learn these probabilities from counts,	▶ 03:24
we go out into the world, we observe some data,	▶ 03:27
we figure out how often word i occurs given word i - 1 was the previous word,	▶ 03:31
we're going to find out that a lot of these counts are going to be zero	▶ 03:38
or going to be some small number, and the estimates are not going to be good.	▶ 03:41
And therefore we need some type of smoothing,	▶ 03:44
like the Laplace smoothing that we talked about,	▶ 03:46
and there are many other techniques for doing smoothing to come up good estimates.	▶ 03:48
Another thing we can do is augment these models to say	▶ 03:53
maybe we want to deal not just with words but with other data as well.	▶ 03:57
We saw that in the spam-filtering model also.	▶ 04:01
So there you might want to think about who the sender is,	▶ 04:04
what the time of day is and so on,	▶ 04:07
these auxiliary fields like in the header fields of the email messages	▶ 04:10
as well as the words in the message, and that's true for other applications as well.	▶ 04:15
You may want to go beyond the words and consider variables that have to do with context of the words.	▶ 04:20
We may also want to have other properties of words.	▶ 04:25
The great thing about just dealing with an individual word like "dog"	▶ 04:29
is that it's observable in the world.	▶ 04:33
We see this spoken or written text, and we can figure out what it means,	▶ 04:36
and we can start making counts about it and start estimating probabilities,	▶ 04:41
but we also might want to know that, say, "dog" is being used as a noun,	▶ 04:45
and that's not immediately observable in the world, but it is inferable.	▶ 04:52
It's a hidden variable, and we may want to try to recover these hidden variables like parts of speech.	▶ 04:55
We may also want to go to bigger sequences than just individual words,	▶ 05:01
so rather than treat "New York City" as three separate words,	▶ 05:06
we may want to a model that allows us to think of it as a single phrase.	▶ 05:10
Or we may want to go smaller than that and look at a model that deals with individual letters	▶ 05:15
rather than dealing with words.	▶ 05:21
So these are all variations, and the type of model you choose depends on the application,	▶ 05:23
but they all follow from this idea of a probabilistic model over sequences.	▶ 05:28

(01:59) 05 Language and Learning.mp4

(00:25) 06 Language Models Question.mp4

(01:40) 07 Language Models Answer.mp4

(00:58) 08 Unigram Model Samples.mp4

(00:32) 09 Bigram Model Samples.mp4

(00:29) 10 Trigram Model Samples.mp4

(01:04) 11 N Gram Model Samples.mp4

(00:42) 12 N Gram Model Question.mp4

(00:16) 13 N Gram Model Answer.mp4

(01:48) 14 Probability Question.mp4

(00:31) 15 Probability Answer.mp4

(01:10) 16 Language Question.mp4

(01:30) 17 Language Answer.mp4

(00:39) 18 Letter Bigram Question.mp4

(00:19) 19 Letter Bigram Answer.mp4

(01:01) 20 Trigram Model Question.mp4

(00:14) 21 Trigram Model Answer.mp4
You can see that English is a language in which THE is very common. ▶ 00:00
German is a language in which DER is more common. ▶ 00:04
And in Azerbaijani, the sequence RBA is more common. ▶ 00:07

(01:34) 22 Classification.mp4

(00:30) 23 Classification Question.mp4

(00:24) 24 Classification Answer.mp4

(02:23) 25 Gzip.mp4

Here I have 3 files containing a corpus of text in each of the languages that I want to classify into,	▶ 00:00
and imagine these are much longer, so it gives you a good sample in text in	▶ 00:07
English, German, and Azerbaijan.	▶ 00:11
Now I have a new piece of text that I want to classify against each of these possibilities.	▶ 00:14
Well, I can do that using the gzip command.	▶ 00:20
So I could issue this Unix command that says	▶ 00:23
"concatenate together the new file with the English file,	▶ 00:27
gzip them, compress them, then count the number of characters,	▶ 00:31
and do the same for the German and Azerbaijani,	▶ 00:35
and then figure out which one is shortest.	▶ 00:39
In fact, when we do that with the files I've collected, it gives me the right answer.	▶ 00:43
Now how does it do that?	▶ 00:48
Well, you have to understand a little bit about how compression algorithms like gzip work.	▶ 00:50
What they do is they take a file like this and they look for common subsequences,	▶ 00:55
and they represent that in less than 1 byte.	▶ 01:00
For example, I-S-SPACE would be represented by 3 bytes in an ASCII encoding,	▶ 01:04
but in compressed encoding you could say,	▶ 01:12
"Hey, I see that sequence here. I see it here again. It's going to show up many times."	▶ 01:14
So maybe I can represent those 3 bytes just in terms of one,	▶ 01:18
saying this is a common subsequence that I'm going to see again and again.	▶ 01:22
Once we've done that for English, we come up with common subsequences in English.	▶ 01:26
Then if we add in another file that has a lot of the same common sequences,	▶ 01:31
like here it has I-S-SPACE again,	▶ 01:37
then that's going to compress well with respect to this.	▶ 01:40
It's not going to compress very well with respect to the Azerbaijan,	▶ 01:43
because that won't have built up a code for I-S-SPACE.	▶ 01:47
That will have built up codes for things like R-B-A rather than for I-S-SPACE.	▶ 01:50
So it turns out that the ideas of compression and learning are actually very closely related,	▶ 01:58
and they're related by information theory and this idea of entropy of an expression	▶ 02:04
or the information content.	▶ 02:10
That wasn't discovered until fairly recently.	▶ 02:13
The two fields had developed independently, but now they've come back together,	▶ 02:16
and we understand how they relate.	▶ 02:20

(01:16) 26 Segmentation.mp4

(01:31) 27 Segmentation Probabilistic Model.mp4

(00:37) 28 Probabilistic Model Question.mp4

(00:24) 29 Probabilistic Model Answer.mp4

(01:36) 30 Best Segmentation 1.mp4

(02:06) 31 Best Segmentation 2.mp4

(00:52) 32 Segment Code.mp4

(01:19) 33 Segment Question 1.mp4

(00:47) 34 Segment Answer 1.mp4

(00:30) 35 Segment Question 2.mp4

(00:59) 36 Segment Answer 2.mp4

(02:37) 37 Spelling Correction.mp4

Now let's do one more example of a probabilistic problem--this time, spelling correction.	▶ 00:00
That is, given a word that is possibly misspelled,	▶ 00:05
how do we come up with the best correction for that word?	▶ 00:08
We're going to do the same type of analysis.	▶ 00:12
We're saying we're looking for the best possible correction, C*,	▶ 00:14
and that's going to be the argmax over all possible corrections c to maximize	▶ 00:20
the probability of that correction given the word.	▶ 00:26
So that's the definition of what it means to have the best correction.	▶ 00:30
Then we can start the analysis, and we can apply Bayes rule to say	▶ 00:33
that's going to be equal to the probability of the word given the correction	▶ 00:38
times the probability of the correction.	▶ 00:45
Of course, in Bayes rule there's a factor on the bottom, but that cancels out,	▶ 00:48
because it's equal for all possible corrections.	▶ 00:52
So to choose the maximum, we just have to deal with these two probabilities.	▶ 00:54
Now, it may seem like we made a backwards step.	▶ 00:59
Here we had one probability to estimate.	▶ 01:02
Now we've applied Bayes rule and now we have two probabilities we have to estimate,	▶ 01:05
but the hope is that we can come up with data that can help us with this.	▶ 01:10
And certainly, these unigram statistics--what's the probability of a correction?--	▶ 01:15
those we can get from our document counts, so we look at our corpus.	▶ 01:20
The probability of a correct word is from the data.	▶ 01:25
We just look at those counts and apply whatever smoothing we decided is best.	▶ 01:30
Now, the other part--what's the probability that somebody typed the word w	▶ 01:35
when they meant to type to the word c--that's harder.	▶ 01:41
We can't observe that directly by just looking at documents that are typed,	▶ 01:45
because there we only have the words where we are.	▶ 01:51
We don't have the intent and the word,	▶ 01:54
but maybe we can look at lists of spelling corrections.	▶ 01:56
So this is from spelling correction data.	▶ 02:01
Now that kind of data is much harder to come by.	▶ 02:04
It's easy to go out and collect billions of words of regular text and do those counts,	▶ 02:08
but to find spelling correction data--that's harder to do	▶ 02:14
unless you're, say, already running a spelling correction service.	▶ 02:17
If you're a big company that happens to run that, then it's easy to collect the data.	▶ 02:21
But bootstrapping it is hard.	▶ 02:24
There are, however, some sites that will give you on the order of thousands	▶ 02:26
or tens of thousands of examples of misspellings, not billions or trillions.	▶ 02:30

(01:42) 38 Spelling Data.mp4

(04:10) 39 Correction Example.mp4

Here's an example of spelling correction in action.	▶ 00:00
Take the word w equals "thew,"	▶ 00:03
and we want to find the correction c	▶ 00:09
that maximizes the probability of w given c times the probability of c.	▶ 00:11
We start searching for the possible corrections c	▶ 00:18
that are close to our target word "thew" in terms of added distance.	▶ 00:23
That is, first we start with all possible c that are one letter away,	▶ 00:28
replacing one letter, swapping two letters, inserting one letter, or transposing two letters.	▶ 00:34
And here we have a list of a few of those possible corrections.	▶ 00:42
So it could be "the" by deleting the "w.	▶ 00:45
We could do no correction at all; we have to consider that as one of the possibilities.	▶ 00:48
We could replace the "e" with an "a."	▶ 00:53
We could add a "r."	▶ 00:55
We could transpose the "w" and the "e."	▶ 00:57
Then we look into our spelling correction tables,	▶ 01:01
and again we reduce them from a word-based to a letter- or edit-based,	▶ 01:05
and we say what's the probability of inserting a "w."	▶ 01:10
Here we've conditioned the insert not just absolutely of inserting a "w" anywhere,	▶ 01:15
but for insertions and deletions, we condition them on the previous letter.	▶ 01:21
So what's the possibility of inserting a "w" given that the previous letter was an "e?"	▶ 01:26
It turns out that's what the probability is,	▶ 01:33
and then we go through the list.	▶ 01:35
Here's replacing an "e" with an "a."	▶ 01:37
That's one of the most common edits made in English,	▶ 01:40
one of the most common spelling corrections.	▶ 01:43
A 10th of a percent of all spelling errors are mistaking an "e" for an "a,"	▶ 01:46
and similarly down the list.	▶ 01:50
So we get this probability for the probability of w given c,	▶ 01:52
and then the probability of the correction word c,	▶ 01:56
that we just get by looking up in our corpus how many times we have seen this word	▶ 01:59
and applying whatever smoothing we're getting.	▶ 02:04
Then we multiply them all out, and I've scaled these by a factor of 1 billion.	▶ 02:06
It turns out with the model I've built that "thew" is most probably corrected to "the."	▶ 02:11
And that makes sense.	▶ 02:21
It's easy to imagine your finger slipping off the "e" key and going over to	▶ 02:23
the "w" since they're next to each other,	▶ 02:26
and "w" is a very common word in English.	▶ 02:28
But it's troubling that the second possibility,	▶ 02:33
namely leaving "thew" alone and keeping it as is has such a high probability.	▶ 02:37
Now, it turns out "thew" is a word.	▶ 02:44
It's rather archaic. It does show up in the Shakespeare corpus.	▶ 02:48
It has to do with muscle tissue,	▶ 02:52
but it's a fairly uncommon word,	▶ 02:56
and how high it ranks depends in large part on the probability that we assign	▶ 02:58
to this edit of doing nothing at all.	▶ 03:04
Here I've assigned it a probability of 0.95.	▶ 03:07
That is, I've said for my probabilistic model,	▶ 03:11
I've made this choice to say I think that about 95% of the words are spelled correctly	▶ 03:15
and 5% are spelled incorrectly.	▶ 03:22
You have to make that choice in order to have a complete model.	▶ 03:24
The probability distribution has to be spread out over all possible,	▶ 03:27
and they have to sum up to one, so I've got to put it somewhere.	▶ 03:31
If I had made another choice, then these two could have been swapped around.	▶ 03:33
So the answer you get depends on the assumptions you make.	▶ 03:37
Still, we can have spelling correcters that are highly accurate.	▶ 03:41
This very simple model of just looking at unigram possibilities	▶ 03:46
and looking at the edits achieves accuracy in the 80% range.	▶ 03:51
If we go beyond that and start dealing with Markov assumptions	▶ 03:58
and looking at multiple word sequences, then we can get up into the high 90%.	▶ 04:03

(03:02) 40 Software Engineering.mp4

Now, let me back up just for a minute and talk about software engineering in general	▶ 00:00
rather than talking about specific AI techniques.	▶ 00:05
What I'm showing here is a small excerpt from the spelling correction code	▶ 00:08
from a project called Htdig, which is an open-source search engine. It's a great search engine.	▶ 00:13
If you ever have need of one, you might want to check it out.	▶ 00:18
All the code is very straightforward and easy to deal with.	▶ 00:22
It has several thousand lines of code dealing with spelling correction.	▶ 00:26
Here we see a little bit of code.	▶ 00:32
It has the good idea of saying one word might be misspelled for another if they sound alike,	▶ 00:34
and so let's go through each word and figure out what each letter is sounding like	▶ 00:40
and see if there are other words that sound similar.	▶ 00:44
So for example, here it's saying what does a "c" sound like.	▶ 00:47
Well, "c" is ambiguous in English.	▶ 00:51
It has this "x" sound, the "ch" sound, this "s" or "k" sound,	▶ 00:54
and there's all these possibilities about how it can have one sound or another.	▶ 00:59
Now imagine you're in charge of maintaining this program.	▶ 01:03
In order for you to make sure that it's right you have to do several things.	▶ 01:06
First, you could look at this comment and say, well, does this comment	▶ 01:10
accurately reflect the rules for English pronunciation?	▶ 01:13
Here, it's talking about pronouncing a "c" as an "s" in the context of an "i," "e," or "y."	▶ 01:17
What about the other vowels--"a" and "o?"	▶ 01:24
Were they left out by accident or is this correct?	▶ 01:26
So you'd have to do some work to check that out.	▶ 01:29
Then you'd have to do more work to say if this comment correct,	▶ 01:31
is the comment correctly implemented in this code here?	▶ 01:35
In fact, just this sort of one page of code just dealing with a couple letters	▶ 01:39
is about the same as all the code that we use to implement the probabilistic model.	▶ 01:43
But I think the most important difficulty in maintaining code like this	▶ 01:50
is that it's so specific to the English language.	▶ 01:55
Imagine you're in charge of maintaining it, and you're boss or professor comes to you and says,	▶ 01:59
"Great job. Now I'd like you to make this work for	▶ 02:05
German and French and Azerbaijani and 50 other languages."	▶ 02:09
You'd have to go through and understand the pronunciation rules in each of those languages	▶ 02:13
and edit a version of this code for each particular language.	▶ 02:18
That would be quite tedious.	▶ 02:22
But if you were dealing with a probabilistic model	▶ 02:24
and you were asked to work in another language,	▶ 02:28
all you would have to do is go out and collect a large corpus of words in that language.	▶ 02:30
Then you'd have the probability of the individuals words.	▶ 02:35
And then find a corpus of spelling errors.	▶ 02:38
Then you'd have the probability of the spelling edits.	▶ 02:41
And so gathering that data is much faster, much easier software engineering process	▶ 02:44
than writing this code by hand.	▶ 02:50
In sense, you could say that machine learning over probabilistic models	▶ 02:52
is the ultimate in agile programming.	▶ 02:58

(1) Programming Project
- ((??:??)) 01 Optional Problem.mp4
  No subtitles... ▶ 00:00
(15) Unit 22
- ((??:??)) 01 Sentence Structure.mp4
  No subtitles... ▶ 00:00
- ((??:??)) 02 Parses Question.mp4
  No subtitles... ▶ 00:00
- ((??:??)) 03 Parses Answer.mp4
  No subtitles... ▶ 00:00
- ((??:??)) 04 Problems and Solutions Question.mp4
  No subtitles... ▶ 00:00
- ((??:??)) 05 Problems and Solutions Answer.mp4
  No subtitles... ▶ 00:00
- ((??:??)) 06 Writing Grammars.mp4
  No subtitles... ▶ 00:00
- ((??:??)) 07 PCFG.mp4
  No subtitles... ▶ 00:00
- ((??:??)) 08 PCFG Question.mp4
  No subtitles... ▶ 00:00
- ((??:??)) 09 PCFG Answer.mp4
  No subtitles... ▶ 00:00
- ((??:??)) 10 Probability Origins.mp4
  No subtitles... ▶ 00:00
- ((??:??)) 11 Resolving Ambiguity.mp4
  No subtitles... ▶ 00:00
- ((??:??)) 12 LPCFG.mp4
  No subtitles... ▶ 00:00
- ((??:??)) 13 Parsing into a Tree.mp4
  No subtitles... ▶ 00:00
- ((??:??)) 14 Machine Translation.mp4
  No subtitles... ▶ 00:00
- ((??:??)) 15 Translation Example.mp4
  No subtitles... ▶ 00:00

(12) Final

(01:20) Question 1.mp4

(01:12) Question 2.mp4

(00:50) Question 3.mp4

(03:31) Question 4.mp4

So this is a planning question.	▶ 00:00
And I apologize, it's a little bit hard to read.	▶ 00:03
There's a lot of text here.	▶ 00:06
And I ask you to consult the pdf document to read the text.	▶ 00:08
Given the resources on the left, over here,	▶ 00:12
can we reach those five goals;	▶ 00:16
A, B, C, D, E, on the right side?	▶ 00:19
And in looking at those, there's words like 'consume',	▶ 00:22
which means, the action eliminates the resource.	▶ 00:26
Whereas 'use' means,	▶ 00:31
you have to have it, but you retain it after using it.	▶ 00:33
Now initially, you know there's a couple of books;	▶ 00:38
one by Nau, about planning,	▶ 00:41
one by Zweben, about scheduling,	▶ 00:43
and one by Melville, about Whales.	▶ 00:45
And there's also videos.	▶ 00:48
Video 8 is about Planning.	▶ 00:50
And Video 15 is about Scheduling.	▶ 00:52
These might be our in-class videos.	▶ 00:54
That's your initial state.	▶ 00:56
And your goal is that you, as a student,	▶ 00:58
know about planning,	▶ 01:02
and you know about scheduling.	▶ 01:04
So the question is, with certain resources	▶ 01:06
that are available in the beginning,	▶ 01:10
and they differ from question to question,	▶ 01:13
can you attain the state of knowing about planning and scheduling?	▶ 01:15
Now, there's two ways to know about a topic.	▶ 01:19
One is to study it using a book.	▶ 01:23
And one is to view it using a video.	▶ 01:26
In both cases, the outcome is to know about the topic over here.	▶ 01:29
Now either one has a different precondition.	▶ 01:34
In the 'book' case, you have to have the book,	▶ 01:37
and the book has to be about the topic you care about.	▶ 01:40
In which case, the action 'study'	▶ 01:43
lets you understand the book and you know about the topic.	▶ 01:46
So for example, if you have a book about planning,	▶ 01:49
and study it, then you know about planning.	▶ 01:52
In the 'view' case, you have to have a video that's about the topic,	▶ 01:56
and you have to have a certain bandwidth,	▶ 02:01
which happens to be 2.5.	▶ 02:04
If you don't have the bandwidth 2.5, you won't be able to view the video,	▶ 02:06
and you won't be able to know about the topic.	▶ 02:10
That's the way the problem is set up.	▶ 02:12
Now, books can be bought or borrowed.	▶ 02:15
In the buying case, you consume 50 dollars	▶ 02:19
In the borrowing case,	▶ 02:23
you have to have a privilege, at the library, that's at least '1'.	▶ 02:25
It might be larger but it can't be lower than '1'.	▶ 02:30
And in either case after doing this, you have the book,	▶ 02:32
and you can now plug this into the 'study' action,	▶ 02:36
and you can read about it,	▶ 02:39
and study it, and know the topic.	▶ 02:41
So here are the questions.	▶ 02:44
If your resource is that you have 50 dollars,	▶ 02:46
and you have library privileges of '1',	▶ 02:49
can you then attain the state of	▶ 02:51
knowing about planning and scheduling?	▶ 02:54
Secondly, suppose your resources is no dollars,	▶ 02:57
but you have library privileges of '2',	▶ 03:01
can you attain the same state?	▶ 03:03
Third, what about the same with library privileges of '1'?	▶ 03:06
Can you get here?	▶ 03:10
Fourth, what about if you have 40 dollars,	▶ 03:12
and bandwidth of '3'? Can you get here?	▶ 03:15
And fifth, what about if you have bandwidth of '2', and 95 dollars?	▶ 03:18
Can you get here?	▶ 03:23
Check all, or any, or none	▶ 03:25
of those five questions that apply.	▶ 03:28

Welcome to the online introduction to artificial intelligence.	▶ 00:00
My name is Sebastian Thrun. >>I'm Peter Norvig.	▶ 00:04
We are teaching this class at Stanford,	▶ 00:07
and now we are teaching it online for the entire world.	▶ 00:09
We are really excited about this.	▶ 00:11
It's great to have you all here.	▶ 00:13
It's exciting to have such a record-breaking number of people.	▶ 00:14
We think we can deliver a good introduction to artificial intelligence.	▶ 00:18
We hope you'll stick with it.	▶ 00:22
It's going to be a lot of work,	▶ 00:24
but we think it's going to be very rewarding.	▶ 00:25
The way that it is going to be organized is that	▶ 00:27
every week there is going to be new videos and with these videos, quizes.	▶ 00:29
With these quizzes, you can test your knowledge about AI.	▶ 00:32
We also post for the advanced version of this class, homework assignments and exams	▶ 00:35
on which you'll be quizzed.	▶ 00:38
We're going to grade those to give you a final score to see	▶ 00:40
if you can actually master artificial intelligence the same way	▶ 00:44
any good student at Stanford would do it.	▶ 00:47
If you do that, then at the end of the class, we'll sign a letter of accomplishment,	▶ 00:49
and let you know that you've achieved this and what your rank in the class was.	▶ 00:54
So I hope you have fun. Watch us on videotape.	▶ 00:58
We will teach you AI.	▶ 01:02
Participate in the discussion forum.	▶ 01:04
Ask your questions, and help others answer questions.	▶ 01:06
I hope we have a fantastic time ahead of us in the next 10 weeks.	▶ 01:09
Welcome to the class. We'll see you online.	▶ 01:12

Welcome to the first unit of Online Introduction to Artificial Intelligence.	▶ 00:00
I will be teaching you the very, very basics today.	▶ 00:05
This is Unit 1 of Artificial Intelligence.	▶ 00:09
Welcome.	▶ 00:14
The purpose of this class is twofold:	▶ 00:16
Number 1, to teach you the very basics of artificial intelligence	▶ 00:20
so you'll be able to talk to people in the field	▶ 00:25
and understand the basic tools of the trade;	▶ 00:29
and also, very importantly, to excite you about the field.	▶ 00:32
I have been in the field of artificial intelligence for about 20 years,	▶ 00:37
and it's been truly rewarding.	▶ 00:42
So I want you to participate in the beauty and the excitement of AI	▶ 00:44
so you can become a professional who gets the same reward	▶ 00:48
and excitement out of this field as I do.	▶ 00:52
The basic structure of this class involves videos	▶ 00:55
in which Peter or I will teach you something new,	▶ 01:00
then also quizzes, which we will ask you about your ability to answer AI questions,	▶ 01:03
and finally, answer videos in which we tell you what the right answer would have been	▶ 01:11
for the quiz that you might have falsely or incorrectly answered before.	▶ 01:17
This will all be reiterated, and every so often you get a homework assignment,	▶ 01:22
also in the form of quizzes but without the answers.	▶ 01:28
And then we also have video exams.	▶ 01:34
If you check our website, there's requirements	▶ 01:37
on how you have to do assignments and exams.	▶ 01:39
Please go to ai-class.org in this class.	▶ 01:43
An AI program is called wetware, a formula, or an intelligent agent.	▶ 01:48
Pick the one that fits best.	▶ 01:58

[Thrun] The correct answer is intelligent agent.	▶ 00:00
Let's talk about intelligent agents.	▶ 00:04
Here is my intelligent agent,	▶ 00:07
and it gets to interact with an environment.	▶ 00:11
The agent can perceive the state of the environment	▶ 00:17
through its sensors,	▶ 00:22
and it can affect its state through its actuators.	▶ 00:25
The big question of artificial intelligence is the function that maps sensors to actuators.	▶ 00:29
That is called the control policy for the agent.	▶ 00:37
So all of this class will deal with how does an agent make decisions	▶ 00:41
that it can carry out with its actuators based on past sensor data.	▶ 00:48
Those decisions take place many, many times,	▶ 00:54
and the loop of environment feedback to sensors, agent decision,	▶ 00:58
actuator interaction with the environment and so on is called perception action cycle.	▶ 01:03
So here is my very first quiz for you.	▶ 01:12
Artificial intelligence, AI, has successfully been used in finance,	▶ 01:15
robotics, games, medicine, and the Web.	▶ 01:21
Check any or all of those that apply.	▶ 01:26
And if none of them applies, check the box down here that says none of them.	▶ 01:28

So, checkers is an interesting game.	▶ 00:00
Here's the typical board of the game of checkers.	▶ 00:04
Your pieces might look like this,	▶ 00:08
and your opponent's pieces might look like this.	▶ 00:11
And apart from some very cryptic rules in checkers,	▶ 00:16
which I won't really discuss here, the board basically tells you	▶ 00:19
everything there is to know about checkers, so it's clearly fully observable.	▶ 00:23
It is deterministic because your move and your opponent's move	▶ 00:28
very clearly affect the state of the board in ways that have	▶ 00:33
absolutely no stochasticity.	▶ 00:36
It is also discrete because there's finitely many action choices	▶ 00:39
and finitely many board positions,	▶ 00:45
and obviously, it is adversarial, since your opponent is out to get you.	▶ 00:47

[Male narrator] The game of poker--is this partially observable, stochastic,	▶ 00:00
continuous, or adversarial?	▶ 00:06
Please check any or all of those that apply.	▶ 00:09

[Male narrator] I would argue poker is partially observable	▶ 00:00
because it can't be seen what is in your opponent's hands.	▶ 00:03
It is stochastic because you're being dealt cards that are kind of coming at random.	▶ 00:08
It is not continuous; it's just finding many cards	▶ 00:13
and finding many actions you can do, even though you might argue	▶ 00:16
that there's a huge number of different monies you can bet.	▶ 00:20
It's still finite, and it is clearly adversarial.	▶ 00:24
If you've ever played poker before, you know how brutal it can be.	▶ 00:27

[Male narrator] --a favorite, a robotic car.	▶ 00:00
I wish to know whether it is partially observable,	▶ 00:04
stochastic, continuous, or adversarial.	▶ 00:06
That is, is the problem of driving robotically--	▶ 00:11
say, in a city--subject to any of those 4 categories?	▶ 00:16
Please check any or all that might apply.	▶ 00:20

Well, the robotic car clearly deals with a partially observable environment	▶ 00:00
if you just look at momentary sensing input, you can't even tell how fast other cars are going.	▶ 00:04
So, you need to memorize something.	▶ 00:10
It is stochastic because it's inherently unpredictable	▶ 00:12
what's going to happen next with other cars.	▶ 00:15
It is continuous.	▶ 00:17
There's the infinitely many ways to set your steering	▶ 00:20
or push your gas pedal or your brake,	▶ 00:23
and, well, you can argue with adversarial or not.	▶ 00:26
Depending on where you live, it might be highly adversarial.	▶ 00:29
Where I live, it isn't.	▶ 00:31

I'm going to briefly talk of AI as something else,	▶ 00:00
which is AI is the technique of uncertainty management in computer software.	▶ 00:03
Put differently, AI is the discipline that you apply when you want to know what to do	▶ 00:10
when you don't know what to do.	▶ 00:17
Now, there's many reasons why there might be uncertainty in a computer program.	▶ 00:22
There could be a sensor limit.	▶ 00:27
That is, your sensors are unable to tell me	▶ 00:29
what exactly is the case outside the AI system.	▶ 00:33
There could be adversaries who act in a way that makes it hard for you	▶ 00:37
to understand what is the case.	▶ 00:41
There could be stochastic environments.	▶ 00:44
Every time you roll the dice in a dice game,	▶ 00:48
the stochasticity of the dice will make it impossible for you	▶ 00:51
to be absolutely certain of what's the situation.	▶ 00:55
There could be laziness.	▶ 00:57
So perhaps you can actually compute what the situation is,	▶ 01:00
but your computer program is just too lazy to do it.	▶ 01:04
And here's my favorite: ignorance, plain ignorance.	▶ 01:07
Many people are just ignorant of what's going on.	▶ 01:11
They could know it, but they just don't care.	▶ 01:14
All of these things are cause for uncertainty.	▶ 01:17
AI is the discipline that deals with uncertainty and manages it in decision making.	▶ 01:21

The answer is that chicken appears here,	▶ 00:01
here, here, and here.	▶ 00:04
Now, I don't know for sure, 100%, that that is the character for chicken in Chinese,	▶ 00:10
but I do know that there is a good correspondence.	▶ 00:14
Every place the word chicken appears in English,	▶ 00:17
this character appears in Chinese and no other place.	▶ 00:20
Let's go 1 step farther.	▶ 00:24
Let's see if we can work out a phrase in Chinese	▶ 00:27
and see if it corresponds to a phrase in English.	▶ 00:30
Here's the phrase corn cream.	▶ 00:33
Click on the characters in Chinese that correspond to corn cream.	▶ 00:38

The answer is: these 2 characters here	▶ 00:00
appear only in these 2 locations	▶ 00:04
corresponding to the words corn cream	▶ 00:07
which appear only in these locations in the English text.	▶ 00:10
Again, we're not 100% sure that's the right answer,	▶ 00:13
but it looks like a strong correlation.	▶ 00:17
Now, 1 more question.	▶ 00:20
Tell me what character or characters in Chinese	▶ 00:22
correspond to the English word soup.	▶ 00:26

The answer is that soup occurs in most of these phrases	▶ 00:00
but not 100% of them.	▶ 00:09
It's missing in this phrase.	▶ 00:11
Equivalently, on the Chinese side	▶ 00:14
we see this character occurs	▶ 00:17
in most of the phrases,	▶ 00:20
but it's missing here.	▶ 00:23
So we see that the correspondence doesn't have to be 100%	▶ 00:27
to tell us that there is still a good chance of a correlation.	▶ 00:31
When we're learning to do machine translation	▶ 00:34
we use these kinds of alignments to learn probability tables	▶ 00:37
of what is the probability of one phrase in one language	▶ 00:41
corresponding to the phrase in another language.	▶ 00:45

So congratulations, you just finished unit 1.	▶ 00:00
You just finished unit 1 of this class,	▶ 00:03
where I told you about key applications	▶ 00:07
of artificial intelligence,	▶ 00:10
I told you about the definition of an intelligent agent,	▶ 00:13
I gave you 4 key attributes of intelligent agents	▶ 00:18
(partial observability, stochasticity, continuous spaces, and adversarial natures),	▶ 00:24
I discussed sources and management of uncertainty,	▶ 00:31
and I briefly mentioned the mathematical concept of rationality.	▶ 00:34
Obviously, I only touched any of these issues superficially,	▶ 00:40
but as this class goes on you're going to dive into any of those	▶ 00:45
and learn much more about	▶ 00:49
what it takes to make a truly intelligent AI system.	▶ 00:51
Thank you.	▶ 00:55

[PROBLEM SOLVING]	▶ 00:00
In this unit we're going to talk about problem solving.	▶ 00:01
The theory and technology of building agents	▶ 00:04
that can plan ahead to solve problems.	▶ 00:06
In particular, we're talking about problem solving	▶ 00:10
where the complexity of the problem comes from the idea that there are many states.	▶ 00:13
As in this problem here.	▶ 00:17
A navigation problem where there are many choices to start with.	▶ 00:19
And the complexity comes from picking the right choice now and picking the right choice at the	▶ 00:24
next intersection and the intersection after that.	▶ 00:29
Streaming together a sequence of actions.	▶ 00:32
This is in contrast to the type of complexity shown in this picture,	▶ 00:35
where the complexity comes from the partial observability	▶ 00:39
that we can't see through the fog where the possible paths are.	▶ 00:43
We can't see the results of our actions	▶ 00:46
and even the actions themselves are not known.	▶ 00:48
This type of complexity will be covered in a later unit.	▶ 00:51
Here's an example of a problem.	▶ 00:56
This is a route-finding problem where we're given a start city,	▶ 00:58
in this case, Arad, and a destination, Bucharest, the capital of Romania,	▶ 01:03
from which this is a corner of the map.	▶ 01:09
And the problem then is to find a route from Arad to Bucharest.	▶ 01:11
The actions that the agent can execute when driving	▶ 01:16
from one city to the next along one of the roads shown on the map.	▶ 01:20
The question is, is there a solution that the agent can come up with	▶ 01:23
given the knowledge shown here to the problem of driving from Arad to Bucharest?	▶ 01:28

Now we see how to modify the Tree Search Function	▶ 00:00
to make it be a Graph Search Function	▶ 00:04
to avoid those repeated paths.	▶ 00:06
What we do, is we start off and initialize a set	▶ 00:09
called the explored set of states that we have already explored.	▶ 00:13
Then, when we consider a new path,	▶ 00:16
we add the new state to the set of already explored states,	▶ 00:19
and then when we are expanding the path	▶ 00:23
and adding in new states to the end of it,	▶ 00:26
we don’t add that in if we have already seen that new state	▶ 00:29
in either the frontier or the explored.	▶ 00:33
Now back to Breadth First Search.	▶ 00:37
Let’s assume we are using the Graph Search	▶ 00:39
so that we have eliminated the duplicate paths.	▶ 00:41
Arad is crossed off the list.	▶ 00:44
The path that goes from Arad to Sibiu	▶ 00:47
and back to Arad is removed,	▶ 00:49
and we are left with these one, two, three,	▶ 00:51
four, five possible paths.	▶ 00:53
Given these 5 paths,	▶ 00:57
show me which ones are candidates to be expanded next	▶ 00:59
by the Breadth First Search Algorithm.	▶ 01:02

[Male narrator] And the answer is that Breadth - First Search always considers	▶ 00:00
the shortest paths first, and in this case, there's 2 paths of length 1,	▶ 00:03
and 1, the paths from Arad to Zerind and Arad to Timisoara,	▶ 00:08
so those would be the 2 paths that would be considered.	▶ 00:12
Now, let's suppose that the tie is broken in some way	▶ 00:15
and we chose this path from Arad to Zerind.	▶ 00:18
Now, we want to expand that node.	▶ 00:22
We remove it from the frontier and put it in the explored list	▶ 00:25
and now we say, "What paths are we going to add?"	▶ 00:31
So check off the ends of the paths the cities that we're going to add.	▶ 00:35

[Male narrator] In this case, there's nothing to add	▶ 00:00
because of the 2 neighbors, 1 is in the explored list and 1 is in the frontier,	▶ 00:03
and if we're using graph search, then we won't add either of those.	▶ 00:09

[Male narrator] So we move on, we look for another shortest path.	▶ 00:00
There's one path left of length 1, so we look at that path, we expand it,	▶ 00:04
add in this path, put that one on the explored list,	▶ 00:11
and now we've got 3 paths of length 2.	▶ 00:16
We choose 1 of them, and let's say we choose this one.	▶ 00:20
Now, my question is show me which states we add to the path	▶ 00:23
and tell me whether we're going to terminate the algorithm at this point	▶ 00:30
because we've reached the goal or whether we're going to continue.	▶ 00:35

[Male narrator] The answer is that we add 1 more path, the path to Bucharest.	▶ 00:00
We don't add the path going back because it's in the explored list,	▶ 00:08
but we don't terminate it yet.	▶ 00:11
True, we have added a path that ends in Bucharest,	▶ 00:13
but the goal test isn't applied when we add a path to the frontier.	▶ 00:16
Rather, it's applied when we remove that path from the frontier,	▶ 00:22
and we haven't done that yet.	▶ 00:26

[Male narrator] Now, why doesn't the general tree search or graph search algorithm stop	▶ 00:00
when it adds a goal node to the frontier?	▶ 00:06
The reason is because it might not be the best path to the goal.	▶ 00:09
Now, here we found a path of length 2	▶ 00:13
and we added a path of length 3 that reached the goal.	▶ 00:16
The general graph search or tree search doesn't know	▶ 00:21
that there might be some other path that we could expand	▶ 00:24
that would have a distance of say, 2-1/2,	▶ 00:27
but there's an optimization that could be made.	▶ 00:30
If we know we're doing Breadth - First Search	▶ 00:33
and we know there's no possibility of a path of length 2-1/2.	▶ 00:35
Then we can change algorithm so that it checks states	▶ 00:40
as soon as they're added to the frontier	▶ 00:44
rather than waiting until they're expanded	▶ 00:46
and in that case, we can write a specific Breadth - First Search routine	▶ 00:49
that terminates early and gives us a result as soon as we add a goal state to the frontier.	▶ 00:53
Breadth - First Search will find this path	▶ 01:01
that ends up in Bucharest, and if we're looking for the shortest path	▶ 01:04
in terms of number of steps,	▶ 01:08
Breadth - First Search is guaranteed to find it,	▶ 01:10
But if we're looking for the shortest path in terms of total cost	▶ 01:12
by adding up the step costs, then it turns out	▶ 01:17
that this path is shorter than the path found by Breadth - First Search.	▶ 01:21
So let's look at how we could find that path.	▶ 01:26

(1) Unit 0w

(01:15) 1 Introduction

(15) Unit 1w

(02:00) 1 Introduction

(01:34) 2 Intelligent Agents

(06:28) 3 Applications of AI

(05:17) 4 Terminology

(00:52) 5 Checkers Answer

(00:12) 6 Poker

(00:30) 7 Poker Answer

(00:22) 8 Robotic Car

(00:33) 9 Robotic Car Answer

(01:28) 10 AI and Uncertainty

(04:00) 11 Examples of AI in Practice

(00:44) 12 Chinese Translation Answer

(00:29) 13 Chinese Translation Answer 2

(00:48) 14 Chinese Translation Answer 3

(00:56) 15 Congratulations

(42) Unit 2

(01:34) Topic 1, Introduction

(04:29) Topic 2, Route Finding Question

(02:55) Topic 3, Route Finding

(03:19) Topic 4, Tree Search

(02:54) Topic 5, Tree Search Answer

(01:05) Topic 6, Graph Search

(00:42) Topic 7, Graph Search Answer

(00:13) Topic 8, Graph Search Answer

(00:38) Topic 9, More Graph Search

(00:29) Topic 10, Graph Search Answer

(01:30) Topic 11, Graph Search Termination

(00:51) Topic 12, Uniform Cost Search

(00:44) Topic 13, Uniform Cost Search

(00:56) Topic 14, Uniform Cost Search

(00:15) Topic 15, Uniform Cost Search

(01:13) Topic 16, Uniform Cost Termination

(01:46) Topic 17, Uniform Cost Termination Answer

(01:50) Topic 18, Depth First Search

(01:49) Topic 19, Search Optimality Answer

(02:02) Topic 20, Storage Requirements, Completeness

(00:49) Topic 21, Completeness Answer

(04:28) Topic 22, More on Uniform Cost Search

(03:14) Topic 23, A-Star Search

(00:14) Topic 23, A-Star Search ANSWER

(00:39) Topic 24, A-Star Second Question

(00:12) Topic 24, A-Star Second Question ANSWER

(00:20) Topic 25, A-Star Third Question

(00:11) Topic 25, A-Star Third Question ANSWER

(00:26) Topic 26, A-Star Fourth Question

(00:23) Topic 26, A-Star Fourth Question ANSWER

(01:24) Topic 27, A-Star Fifth Question

(00:49) Topic 27, A-Star Fifth Question ANSWER Mandatory

(01:22) Topic 28, Optimistic Heuristics

(00:59) Topic 29, State Spaces

(00:35) Topic 29, State Spaces ANSWER

(01:44) Topic 30, State Space Diagram and More Complexity

(00:57) Topic 30, State Space Diagram and More Complexity ANSWER

(01:49) Topic 31, Sliding Blocks Puzzle

(00:42) Topic 31, Sliding Blocks Puzzle ANSWER

(03:16) Topic 32, Where is the Intelligence

(01:52) Topic 33, What Can't Search Do

(02:35) Topic 34, Note on Implementation

(16) Homework 1

(00:05) Congratulations!

(00:05) Introduction

(01:00) Question 1, Peg Solitaire

(00:22) Question 1, Peg Solitaire ANSWER

(00:54) Question 2, Loaded Coin

(00:38) Question 2, Loaded Coin ANSWER

(00:32) Question 3, Path Through Maze

(00:18) Question 3, Path Through Maze ANSWER

(00:43) Question 4, Search Tree

(00:38) Question 4, Search Tree ANSWER

(00:31) Question 5, Another Search Tree

(00:48) Question 5, Another Search Tree ANSWER

(01:00) Question 6, Search Network

(00:49) Question 6, Search Network ANSWER

(01:16) Question 7, A* Search

(01:44) Question 7, A* Search ANSWER

(64) Unit 3

(06:26) 1 Introduction

An algorithm that has traditionally been called uniform-cost search	▶ 00:00
but could be called cheapest-first search,	▶ 00:05
is guaranteed to find the path with the cheapest total cost.	▶ 00:08
Let's see how it works.	▶ 00:11
We start out as before in the start state.	▶ 00:14
And we pop that empty path off.	▶ 00:19
Move it from the frontier to explored,	▶ 00:24
and then add in the paths out of that state.	▶ 00:28
As before, there will be 3 of those paths.	▶ 00:33
And now, which path are we going to pick next	▶ 00:39
in order to expand according to the rules of cheapest first?	▶ 00:43

Cheapest first says that we pick the path with	▶ 00:00
the lowest total cost.	▶ 00:04
And that would be this path.	▶ 00:06
It has a cost of 75 compared to the cost of 118 and 140	▶ 00:07
for the other paths.	▶ 00:13
So we get here. We take that path off the frontier,	▶ 00:14
put it on the explored list, add in its neighbors.	▶ 00:19
Not going back to Arad,	▶ 00:23
but adding in this new path.	▶ 00:26
Summing up the total cost of that path,	▶ 00:30
71 + 75 is 146 for this path.	▶ 00:33
And now the question is,	▶ 00:40
which path gets expanded next?	▶ 00:41

Of the 3 paths on the frontier, we have ones	▶ 00:00
with a cost of 146, 140, and 118.	▶ 00:05
And that's the cheapest, so this one gets expanded.	▶ 00:10
We take it off the frontier, move it to explored,	▶ 00:13
add in its successors. In this case it's only 1.	▶ 00:16
And that has a path total of 229.	▶ 00:21
Which path do we expand next?	▶ 00:29
Well, we've got 146, 140, and 229	▶ 00:30
So 140 is the lowest.	▶ 00:33
Take it off the frontier. Put it on explored.	▶ 00:38
Add in this path	▶ 00:41
for a total cost of 220.	▶ 00:44
And this path for a total cost of 239.	▶ 00:48
And now the question is, which path do we expand next?	▶ 00:53

The answer is this one, 146.	▶ 00:00
Put it on explored.	▶ 00:04
But there's nothing to add because	▶ 00:07
both of its neighbors have already been explored.	▶ 00:12
Which path do we look at next?	▶ 00:13

The answer is this one. Two-twenty is less than 229 or 239.	▶ 00:00
Take it off the frontier. Put it on explored.	▶ 00:05
Add in 2 more paths and sum them up.	▶ 00:09
So, 220 plus 146 is 366.	▶ 00:15
And 220 plus 97 is 317.	▶ 00:21
Okay, and now, notice that we're closing in on Bucharest.	▶ 00:29
We've got 2 neighbors almost there, but neither of them is their turn yet.	▶ 00:32
Instead, the cheapest path is this one over here,	▶ 00:38
so move it to the explored list.	▶ 00:43
Add 70 to the path cost so far,	▶ 00:45
and we get 299.	▶ 00:50
Now the cheapest node is 239 here,	▶ 00:57
so we expand, finally, into Bucharest at a cost of 460.	▶ 01:01
And now the question is are we done? Can we terminate the algorithm?	▶ 01:09

[Male] And the answer is no, we're not done yet.	▶ 00:00
We've put Bucharest, the gold state, onto the frontier,	▶ 00:03
but we haven't popped it off the frontier yet.	▶ 00:07
And the reason is because we've got to look around and see if there's a better path	▶ 00:09
that can reach it, Bucharest.	▶ 00:13
And so, let's continue.	▶ 00:15
Look at everything on the frontier.	▶ 00:18
Here's the cheapest one over here.	▶ 00:20
Expand that.	▶ 00:23
Now, what's the cheapest next one?	▶ 00:26
Well, over here.	▶ 00:30
Oops, forgot to take this one off the list.	▶ 00:33
So now, we're at 317 plus 101 gives us another path into Bucharest,	▶ 00:36
and this is a better path.	▶ 00:44
This is 418, gives us another route in.	▶ 00:46
But we have to keep going.	▶ 00:54
The best path on the frontier is 366,	▶ 00:59
so pop that off, and that would give us 2 more routes into here,	▶ 01:06
and eventually we pop off all of these.	▶ 01:14
And then we get to the point where 418 was the best path on the frontier.	▶ 01:18
We pop that off, and then we recognize that we'd reach the goal,	▶ 01:24
and the reason that uniform cost finds the optimal path, the cheapest cost,	▶ 01:29
is because it's guaranteed that it will first pop off this cheapest path,	▶ 01:35
the 418, before it gets to the more expensive path, like the 460.	▶ 01:40

So, we've looked at 2 search algorithms.	▶ 00:00
One, breadth-first search, in which we always expand first	▶ 00:03
the shallowest paths, the shortest paths.	▶ 00:08
Second, cheapest-first search, in which we always expand first the path	▶ 00:12
with the lowest total cost.	▶ 00:17
And I'm going to take this opportunity to introduce a third algorithm, depth-first search,	▶ 00:20
which is in a way the opposite of breadth-first search.	▶ 00:25
In depth-first search, we always expand first the longest path,	▶ 00:28
the path with the most lengths in it.	▶ 00:33
Now, what I want to ask you to do is for each of these nodes in each of the trees,	▶ 00:36
tell us in what order they're expanded,	▶ 00:42
first, second, third, fourth, fifth and so on by putting a number into the box.	▶ 00:44
And if there are ties, put that number in and resolve the ties in left to right order.	▶ 00:49
Then I want you to ask one more question or answer one more question	▶ 00:58
which is are these searches optimal?	▶ 01:03
That is, are they guaranteed to find the best solution?	▶ 01:06
And for breadth-first search, optimal would mean finding the shortest path.	▶ 01:11
If you think it's guaranteed to find the shortest path, check here.	▶ 01:16
For cheapest first, it would mean finding the path with the lowest total path cost.	▶ 01:21
Check here if you think it's guaranteed to do that.	▶ 01:26
And we'll allow the assumption that all costs have to be positive.	▶ 01:30
And in depth first, cheapest or optimal would mean, again,	▶ 01:34
as in breadth first, finding the shortest possible path in terms of number of lengths.	▶ 01:41
Check here if you think depth first will always find that.	▶ 01:46

Here are the answers.	▶ 00:00
Breadth-first search, as the name implies, expands nodes in this order.	▶ 00:04
One, 2, 3, 4, 5, 6, 7.	▶ 00:10
So, it's going across a stripe at a time, breadth first.	▶ 00:17
Is it optimal?	▶ 00:23
Well, it's always expanding in the shortest paths first,	▶ 00:25
and so wherever the goal is hiding, it's going to find it by examining	▶ 00:28
no longer paths, so in fact, it is optimal.	▶ 00:34
Cheapest first, first we expand the path of length zero,	▶ 00:38
then the path of length 2.	▶ 00:45
Now there's a path of length 4, path of length 5,	▶ 00:47
path of length 6, a path of length 7, and finally, a path of length 8.	▶ 00:53
And as we've seen, it's guaranteed to find the cheapest path of all,	▶ 01:02
assuming that all the individual step costs are not negative.	▶ 01:08
Depth-first search tries to go as deep as it can first,	▶ 01:14
so it goes 1, 2, 3, then backs up, 4,	▶ 01:17
then backs up, 5, 6, 7.	▶ 01:24
And you can see that it doesn't necessarily find the shortest path of all.	▶ 01:29
Let's say that there were goals in position 5 and in position 3.	▶ 01:34
It would find the longer path to position 3 and find the goal there	▶ 01:39
and would not find the goal in position 5.	▶ 01:43
So, it is not optimal.	▶ 01:46

Given the non-optimality of depth-first search,	▶ 00:00
why would anybody choose to use it?	▶ 00:04
Well, the answer has to do with the storage requirements.	▶ 00:07
Here I've illustrated a state space	▶ 00:10
consisting of a very large or even infinite binary tree.	▶ 00:13
As we go to levels 1, 2, 3, down to level n,	▶ 00:18
the tree gets larger and larger.	▶ 00:22
Now, let's consider the frontier for each of these search algorithms.	▶ 00:24
For breadth-first search, we know a frontier looks like that,	▶ 00:29
and so when we get down to level n, we'll require a storage space of	▶ 00:35
2 to the n of pass in a breadth-first search.	▶ 00:40
For cheapest first, the frontier is going to be more complicated.	▶ 00:45
It's going to sort of work out this contour of cost,	▶ 00:49
but it's going to have a similar total number of nodes.	▶ 00:53
But for depth-first search, as we go down the tree, we start going down this branch,	▶ 00:57
and then we back up, but at any point, our frontier is only going to have n nodes	▶ 01:03
rather than 2 to the n nodes, so that's a substantial savings for depth-first search.	▶ 01:08
Now, of course, if we're also keeping track of the explored set,	▶ 01:14
then we don't get that much savings.	▶ 01:19
But without the explored set, depth-first search has a huge advantage	▶ 01:21
in terms of space saved.	▶ 01:25
One more property of the algorithms to consider	▶ 01:27
is the property of completeness, meaning if there is a goal somewhere,	▶ 01:30
will the algorithm find it?	▶ 01:35
So, let's move from very large trees to infinite trees,	▶ 01:37
and let's say that there's some goal hidden somewhere deep down in that tree.	▶ 01:41
And the question is, are each of these algorithms complete?	▶ 01:47
That is, are they guaranteed to find a path to the goal?	▶ 01:51
Mark off the check boxes for the algorithms that you believe are complete in this sense.	▶ 01:55

The answer is that breadth-first search is complete,	▶ 00:00
so even if the tree is infinite, if the goal is placed at any finite level,	▶ 00:04
eventually, we're going to march down and find that goal.	▶ 00:10
Same with cheapest first.	▶ 00:16
No matter where the goal is, if it has a finite cost,	▶ 00:18
eventually, we're going to go down and find it.	▶ 00:21
But not so for depth-first search.	▶ 00:25
If there's an infinite path, depth-first search will keep following that,	▶ 00:28
so it will keep going down and down and down along this path	▶ 00:33
and never get to the path that the goal consists of	▶ 00:37
and never get to the path on which the goal sits.	▶ 00:42
So, depth-first search is not complete.	▶ 00:46