<?xml version="1.0"?>
<videos date="Friday 16th of December 2011"><group title="Unit 0w" count="1"><video title="1 Introduction" id="BnIJ7Ba5Sr4" length="75"><transcript><text start="0" dur="4">Welcome to the online introduction to artificial intelligence.</text><text start="4" dur="3">My name is Sebastian Thrun. &amp;gt;&amp;gt;I&amp;#39;m Peter Norvig.</text><text start="7" dur="2">We are teaching this class at Stanford, </text><text start="9" dur="2">and now we are teaching it online for the entire world.</text><text start="11" dur="2">We are really excited about this.</text><text start="13" dur="1">It&amp;#39;s great to have you all here.</text><text start="14" dur="4">It&amp;#39;s exciting to have such a record-breaking number of people.</text><text start="18" dur="4">We think we can deliver a good introduction to artificial intelligence.</text><text start="22" dur="2">We hope you&amp;#39;ll stick with it.</text><text start="24" dur="1">It&amp;#39;s going to be a lot of work,</text><text start="25" dur="2">but we think it&amp;#39;s going to be very rewarding.</text><text start="27" dur="2">The way that it is going to be organized is that  </text><text start="29" dur="3">every week there is going to be new videos and with these videos, quizes.</text><text start="32" dur="3">With these quizzes, you can test your knowledge about AI.</text><text start="35" dur="3">We also post for the advanced version of this class, homework assignments and exams </text><text start="38" dur="2">on which you&amp;#39;ll be quizzed.</text><text start="40" dur="4">We&amp;#39;re going to grade those to give you a final score to see</text><text start="44" dur="3">if you can actually master artificial intelligence the same way </text><text start="47" dur="2">any good student at Stanford would do it.</text><text start="49" dur="5">If you do that, then at the end of the class, we&amp;#39;ll sign a letter of accomplishment, </text><text start="54" dur="4">and let you know that you&amp;#39;ve achieved this and what your rank in the class was.</text><text start="58" dur="4">So I hope you have fun.  Watch us on videotape.</text><text start="62" dur="2">We will teach you AI.</text><text start="64" dur="2">Participate in the discussion forum. </text><text start="66" dur="3">Ask your questions, and help others answer questions.</text><text start="69" dur="3">I hope we have a fantastic time ahead of us in the next 10 weeks.</text><text start="72" dur="3">Welcome to the class. We&amp;#39;ll see you online.</text></transcript></video></group><group title="Unit 1w" count="15"><video title="1 Introduction" id="Q7_GQq7cDyM" length="120"><transcript><text start="0" dur="5">Welcome to the first unit of Online Introduction to Artificial Intelligence.</text><text start="5" dur="4">I will be teaching you the very, very basics today.</text><text start="9" dur="5">This is Unit 1 of Artificial Intelligence.</text><text start="14" dur="2">Welcome.</text><text start="16" dur="4">The purpose of this class is twofold:</text><text start="20" dur="5">Number 1, to teach you the very basics of artificial intelligence</text><text start="25" dur="4">so you&amp;#39;ll be able to talk to people in the field</text><text start="29" dur="3">and understand the basic tools of the trade;</text><text start="32" dur="5">and also, very importantly, to excite you about the field.</text><text start="37" dur="5">I have been in the field of artificial intelligence for about 20 years,</text><text start="42" dur="2">and it&amp;#39;s been truly rewarding.</text><text start="44" dur="4">So I want you to participate in the beauty and the excitement of AI</text><text start="48" dur="4">so you can become a professional who gets the same reward </text><text start="52" dur="3">and excitement out of this field as I do.</text><text start="55" dur="5">The basic structure of this class involves videos </text><text start="60" dur="3">in which Peter or I will teach you something new,</text><text start="63" dur="8">then also quizzes, which we will ask you about your ability to answer AI questions,</text><text start="71" dur="6">and finally, answer videos in which we tell you what the right answer would have been</text><text start="77" dur="5">for the quiz that you might have falsely or incorrectly answered before.</text><text start="82" dur="6">This will all be reiterated, and every so often you get a homework assignment,</text><text start="88" dur="6">also in the form of quizzes but without the answers.</text><text start="94" dur="3">And then we also have video exams.</text><text start="97" dur="2">If you check our website, there&amp;#39;s requirements</text><text start="99" dur="4">on how you have to do assignments and exams.</text><text start="103" dur="5">Please go to ai-class.org in this class.</text><text start="108" dur="10">An AI program is called wetware, a formula, or an intelligent agent.</text><text start="118" dur="2">Pick the one that fits best.</text></transcript></video><video title="2 Intelligent Agents" id="cx3lV07w-XE" length="94"><transcript><text start="0" dur="4">[Thrun] The correct answer is intelligent agent.</text><text start="4" dur="3">Let&amp;#39;s talk about intelligent agents.</text><text start="7" dur="4">Here is my intelligent agent,</text><text start="11" dur="6">and it gets to interact with an environment.</text><text start="17" dur="5">The agent can perceive the state of the environment</text><text start="22" dur="3">through its sensors,</text><text start="25" dur="4">and it can affect its state through its actuators.</text><text start="29" dur="8">The big question of artificial intelligence is the function that maps sensors to actuators.</text><text start="37" dur="4">That is called the control policy for the agent.</text><text start="41" dur="7">So all of this class will deal with how does an agent make decisions</text><text start="48" dur="6">that it can carry out with its actuators based on past sensor data.</text><text start="54" dur="4">Those decisions take place many, many times,</text><text start="58" dur="5">and the loop of environment feedback to sensors, agent decision,</text><text start="63" dur="9">actuator interaction with the environment and so on is called perception action cycle.</text><text start="72" dur="3">So here is my very first quiz for you.</text><text start="75" dur="6">Artificial intelligence, AI, has successfully been used in finance,</text><text start="81" dur="5">robotics, games, medicine, and the Web.</text><text start="86" dur="2">Check any or all of those that apply.</text><text start="88" dur="6">And if none of them applies, check the box down here that says none of them.</text></transcript></video><video title="3 Applications of AI" id="N6JW8TQzbX8" length="388"><transcript><text start="0" dur="3">So the correct answer is all of those-- </text><text start="3" dur="5">finance, robotics, games, medicine, the Web, and many more applications.</text><text start="8" dur="2">So let me talk about them in some detail. </text><text start="10" dur="5">There is a huge number of applications of artificial intelligence in finance,</text><text start="15" dur="3">very often in the shape of making trading decisions--</text><text start="18" dur="3">in which case, the agent is called a trading agent. </text><text start="21" dur="6">And the environment might be things like the stock market or the bond market </text><text start="27" dur="2">or the commodities market.</text><text start="29" dur="4">And our trading agent can sense the course of certain things, </text><text start="33" dur="2">like the stock or bonds or commodities.</text><text start="35" dur="5">It can also read the news online and follow certain events.</text><text start="40" dur="8">And its decisions are usually things like buy or sell decisions--trades. </text><text start="48" dur="7">There&amp;#39;s a huge history of artificial intelligence finding methods to look at data over time</text><text start="55" dur="3">and make predictions as to how courses develop over time--</text><text start="58" dur="3">and then put in trades behind those. </text><text start="61" dur="5">And very frequently, people using artificial intelligence trading agents</text><text start="66" dur="4">have made a good amount of money with superior trading decisions. </text><text start="70" dur="4">There&amp;#39;s also a long history of AI in Robotics.  </text><text start="74" dur="3">Here is my depiction of a robot. </text><text start="77" dur="3">Of course, there are many different types of robots</text><text start="80" dur="4">and they all interact with their environments through their sensors,</text><text start="84" dur="9">which include things like cameras, microphones, tactile sensor or touch.</text><text start="93" dur="5">And the way they impact their environments is to move motors around,  </text><text start="98" dur="5">in particular, their wheels, their legs, their arms, their grippers.</text><text start="103" dur="3">They can also say things to people using voice. </text><text start="106" dur="4">Now there&amp;#39;s a huge history of using artificial intelligence in robotics.</text><text start="110" dur="4">Pretty much, every robot that does something interesting today uses AI. </text><text start="114" dur="4">In fact, often AI has been studied together with robotics, as one discipline.</text><text start="118" dur="5">But because robots are somewhat special in that they use physical actuators</text><text start="123" dur="3">and deal with physical environments, they are a little bit different from </text><text start="126" dur="2">just artificial intelligence, as a whole. </text><text start="128" dur="7">When the Web came out, the early Web crawlers were called robots </text><text start="135" dur="5">and to block a robot from accessing your website, to the present day, </text><text start="140" dur="4">there&amp;#39;s a file called robot.txt, that allows you to deny any Web crawler  </text><text start="144" dur="4">to access and retrieve that information from your website. </text><text start="148" dur="4">So historically, robotics played a huge role in artificial intelligence</text><text start="152" dur="4">and a good chunk of this class will be focusing on robotics. </text><text start="156" dur="3">AI has a huge history in games--</text><text start="159" dur="4">to make games smarter or feel more natural.</text><text start="163" dur="4">There are 2 ways in which AI has been used in games, as a game agent. </text><text start="167" dur="3">One is to play against you, as a human user.</text><text start="170" dur="4">So for example, if you play the game of Chess,</text><text start="174" dur="3">then you are the environment to the game agent.</text><text start="177" dur="6">The game agent gets to observe your moves, and it generates its own moves</text><text start="183" dur="4">with the purpose of defeating you in Chess. </text><text start="187" dur="3">So most adversarial games, where you play against an opponent</text><text start="190" dur="3">and the opponent is a computer program,</text><text start="193" dur="7">the game agent is built to play against you--against your own interests--and make you lose. </text><text start="200" dur="2">And of course, your objective is to win. </text><text start="202" dur="3">That&amp;#39;s an AI games-type situation. </text><text start="205" dur="4">The second thing is that games agents in AI </text><text start="209" dur="3">also are used to make games feel more natural.</text><text start="212" dur="4">So very often games have characters inside, and these characters act in some way.</text><text start="216" dur="6">And it&amp;#39;s important for you, as the player, to feel that these characters are believable. </text><text start="222" dur="3">There&amp;#39;s an entire sub-field of artificial intelligence to use AI</text><text start="225" dur="6">to make characters in a game more believable--look smarter, so to speak--</text><text start="231" dur="4">so that you, as a player, think you&amp;#39;re playing a better game. </text><text start="235" dur="5">Artificial intelligence has a long history in medicine as well. </text><text start="240" dur="4">The classic example is that of a diagnostic agent. </text><text start="244" dur="5">So here you are--and you might be sick, and you go to your doctor. </text><text start="249" dur="2">And your doctor wishes to understand </text><text start="251" dur="6">what the reason for your symptoms and your sickness is.</text><text start="257" dur="4">The diagnostic agent will observe you through various measurements-- </text><text start="261" dur="4">for example, blood pressure and heart signals, and so on--</text><text start="265" dur="4">and it&amp;#39;ll come up with the hypothesis as to what you might be suffering from.</text><text start="269" dur="5">But rather than intervene directly, in most cases the diagnostic of your disease </text><text start="274" dur="4">is communicated to the doctor, who then takes on the intervention.</text><text start="278" dur="2">This is called a diagnostic agent. </text><text start="280" dur="3">There are many other versions of AI in medicine.</text><text start="283" dur="5">AI is used in intensive care to understand whether there are situations</text><text start="288" dur="2">that need immediate attention.</text><text start="290" dur="4">It&amp;#39;s been used for life-long medicine to monitor signs over long periods of time.</text><text start="294" dur="4">And as medicine becomes more personal, the role of artificial intelligence</text><text start="298" dur="3">will definitely increase. </text><text start="301" dur="4">We already mentioned AI on the Web.</text><text start="305" dur="4">The most generic version of AI is to crawl the Web and understand the Web,</text><text start="309" dur="3">and assist you in answering questions.</text><text start="312" dur="3">So when you have this search box over here</text><text start="315" dur="3">and it says &amp;quot;Search&amp;quot; on the left,</text><text start="318" dur="2">and &amp;quot;I&amp;#39;m Feeling Lucky&amp;quot; on the right,</text><text start="320" dur="1">and you type in the words,</text><text start="321" dur="7">what AI does for you is it understands what words you typed in</text><text start="328" dur="2">and finds the most relevant pages. </text><text start="330" dur="2">That is really co-artificial intelligence. </text><text start="332" dur="4">It&amp;#39;s used by a number of companies, such as Microsoft and Google</text><text start="336" dur="3">and Amazon, Yahoo, and many others.</text><text start="339" dur="4">And the way this works is that there&amp;#39;s a crawling agent that can go</text><text start="343" dur="8">to the World Wide Web and retrieve pages, through just a computer program.</text><text start="351" dur="5">It then sorts these pages into a big database inside the crawler</text><text start="356" dur="5">and also analyzes developments of each page to any possible query. </text><text start="361" dur="3">When you then come and issue a query, </text><text start="364" dur="4">the AI system is able to give you a response--  </text><text start="368" dur="4">for example, a collection of 10 best Web links. </text><text start="372" dur="3">In short, every time you try to write a piece of software,</text><text start="375" dur="3">that makes your computer software smart</text><text start="378" dur="2">likely you will need artificial intelligence.</text><text start="380" dur="3">And in this class, Peter and I will teach you </text><text start="383" dur="2">many of the basic tricks of the trade</text><text start="385" dur="3">to make your software really smart.</text></transcript></video><video title="4 Terminology" id="5lcLmhsmBnQ" length="317"><transcript><text start="0" dur="4">It will be good to introduce some basic terminology</text><text start="4" dur="5">that is commonly used in artificial intelligence to distinguish different types of problems.</text><text start="9" dur="7">The very first word I will teach you is  fully versus partially observable.</text><text start="16" dur="3">An environment is called fully observable if what your agent can sense</text><text start="19" dur="7">at any point in time is completely sufficient to make the optimal decision. </text><text start="26" dur="3">So, for example, in many card games, </text><text start="29" dur="7">when all the cards are on the table, the momentary site of all those cards</text><text start="36" dur="4">is really sufficient to make the optimal choice.</text><text start="40" dur="6">That is in contrast to some other environments where you need memory</text><text start="46" dur="4">on the side of the agent to make the best possible decision.</text><text start="50" dur="5">For example, in the game of poker, the cards aren&amp;#39;t openly on the table,</text><text start="55" dur="5">and memorizing past moves will help you make a better decision.</text><text start="60" dur="4">To fully understand the difference, consider the interaction of an agent</text><text start="64" dur="4"> with the environment to its sensors and its actuators, </text><text start="68" dur="3">and this interaction takes place over many cycles, </text><text start="71" dur="5">often called the perception-action cycle.</text><text start="76" dur="3">For many environments, it&amp;#39;s convenient to assume </text><text start="79" dur="3">that the environment has some sort of internal state.</text><text start="82" dur="6">For example, in a card game where the cards are not openly on the table, </text><text start="88" dur="5">the state might pertain to the cards in your hand.</text><text start="93" dur="4">An environment is fully observable if the sensors can always see </text><text start="97" dur="4">the entire state of the environment. </text><text start="101" dur="5">It&amp;#39;s partially observable if the sensors can only see a fraction of the state,</text><text start="106" dur="6">yet memorizing past measurements gives us additional information of the state</text><text start="112" dur="3">that is not readily observable right now.</text><text start="115" dur="6">So any game, for example, where past moves have information about </text><text start="121" dur="5">what might be in a person&amp;#39;s hand, those games are partially observable,</text><text start="126" dur="2">and they require different treatment. </text><text start="128" dur="4">Very often agents that deal with partially observable environments</text><text start="132" dur="3">need to acquire internal memory to understand what</text><text start="135" dur="3"> the state of the environment is, and we&amp;#39;ll talk extensively </text><text start="138" dur="3">when we talk about hidden Markov models about how this structure</text><text start="141" dur="2">has such internal memory. </text><text start="143" dur="3">A second terminology for environments pertains to whether the environment</text><text start="146" dur="3">is deterministic or stochastic. </text><text start="149" dur="6">Deterministic environment is one where your agent&amp;#39;s actions</text><text start="155" dur="2">uniquely determine the outcome. </text><text start="157" dur="5">So, for example, in chess, there&amp;#39;s really no randomness when you move a piece.</text><text start="162" dur="4">The effect of moving a piece is completely predetermined,</text><text start="166" dur="4">and no matter where I&amp;#39;m going to move the same piece, the outcome is the same. </text><text start="170" dur="2">That we call deterministic. </text><text start="172" dur="4">Games with dice, for example, like backgammon, are stochastic. </text><text start="176" dur="4">While you can still deterministically move your pieces, </text><text start="180" dur="3">the outcome of an action also involves throwing of the dice, </text><text start="183" dur="2">and  you can&amp;#39;t predict those. </text><text start="185" dur="3">There&amp;#39;s a certain amount of randomness involved for the outcome of dice,</text><text start="188" dur="2">and therefore, we call this stochastic. </text><text start="190" dur="4">Let me talk about discrete versus continuous.</text><text start="194" dur="4">A discrete environment is one where you have finitely many action choices, </text><text start="198" dur="3">and finitely many things you can sense. </text><text start="201" dur="4">So, for example, in chess, again, there&amp;#39;s finitely many board positions,</text><text start="205" dur="3">and finitely many things you can do.</text><text start="208" dur="2">That is different from a continuous environment </text><text start="210" dur="5">where the space of possible actions or things you could sense may be infinite. </text><text start="215" dur="6">So, for example, if you throw darts, there&amp;#39;s infinitely many ways to angle the darts</text><text start="221" dur="2">and to accelerate them. </text><text start="223" dur="6">Finally, we distinguish benign versus adversarial environments. </text><text start="229" dur="4">In benign environments, the environment might be random.</text><text start="233" dur="4">It might be stochastic, but it has no objective on its own </text><text start="237" dur="2">that would contradict the own objective. </text><text start="239" dur="3">So, for example, weather is benign.</text><text start="242" dur="4">It might be random. It might affect the outcome of your actions.</text><text start="246" dur="2">But it isn&amp;#39;t really out there to get you. </text><text start="248" dur="6">Contrast this with adversarial environments, such as many games, like chess,</text><text start="254" dur="2">where your opponent is really out there to get you.</text><text start="256" dur="5">It turns out it&amp;#39;s much harder to find good actions in adversarial environments</text><text start="261" dur="5">where the opponent actively observes you and counteracts what you&amp;#39;re trying to achieve</text><text start="266" dur="4">relative to benign environment, where the environment might merely be stochastic</text><text start="270" dur="5">but isn&amp;#39;t really interested in making your life worse.</text><text start="275" dur="3">So, let&amp;#39;s see to what extent these expressions make sense to you</text><text start="278" dur="2">by going to our next quiz. </text><text start="280" dur="5">So here are the 4 concepts again: partially observable versus fully,</text><text start="285" dur="5">stochastic versus deterministic, continuous versus discrete, </text><text start="290" dur="2">adversarial versus benign.</text><text start="292" dur="4">And let me ask you about the game of checkers.</text><text start="296" dur="4">Check one or all of those attributes that apply.</text><text start="300" dur="3">So, if you think checkers is partially observable, check this one.</text><text start="303" dur="2">Otherwise, just don&amp;#39;t check it.</text><text start="305" dur="2">If you think it&amp;#39;s stochastic, check this one,</text><text start="307" dur="4">continuous, check this one, adversarial, check this one.</text><text start="311" dur="4">If you don&amp;#39;t know about checkers, you can check the Web and Google it</text><text start="315" dur="2">to find a little more information about checkers.</text></transcript></video><video title="5 Checkers Answer" id="qVppDRbx2kM" length="52"><transcript><text start="0" dur="4">So, checkers is an interesting game.</text><text start="4" dur="4">Here&amp;#39;s the typical board of the game of checkers.</text><text start="8" dur="3">Your pieces might look like this, </text><text start="11" dur="5">and your opponent&amp;#39;s pieces might look like this.</text><text start="16" dur="3">And apart from some very cryptic rules in checkers,</text><text start="19" dur="4">which I won&amp;#39;t really discuss here, the board basically tells you</text><text start="23" dur="5">everything there is to know about checkers, so it&amp;#39;s clearly fully observable.</text><text start="28" dur="5">It is deterministic because your move and your opponent&amp;#39;s move</text><text start="33" dur="3">very clearly affect the state of the board in ways that have </text><text start="36" dur="3">absolutely no stochasticity. </text><text start="39" dur="6">It is also discrete because there&amp;#39;s finitely many action choices</text><text start="45" dur="2">and finitely many board positions,</text><text start="47" dur="5">and obviously, it is adversarial, since your opponent is out to get you.</text></transcript></video><video title="6 Poker" id="M_AdFAazf4k" length="12"><transcript><text start="0" dur="6">[Male narrator] The game of poker--is this partially observable, stochastic, </text><text start="6" dur="3">continuous, or adversarial?</text><text start="9" dur="3">Please check any or all of those that apply.</text></transcript></video><video title="7 Poker Answer" id="DjILhASM3A8" length="30"><transcript><text start="0" dur="3">[Male narrator] I would argue poker is partially observable</text><text start="3" dur="5">because it can&amp;#39;t be seen what is in your opponent&amp;#39;s hands.</text><text start="8" dur="5">It is stochastic because you&amp;#39;re being dealt cards that are kind of coming at random.</text><text start="13" dur="3">It is not continuous; it&amp;#39;s just finding many cards</text><text start="16" dur="4">and finding many actions you can do, even though you might argue</text><text start="20" dur="4">that there&amp;#39;s a huge number of different monies you can bet. </text><text start="24" dur="3">It&amp;#39;s still finite, and it is clearly adversarial.</text><text start="27" dur="3">If you&amp;#39;ve ever played poker before, you know how brutal it can be.</text></transcript></video><video title="8 Robotic Car" id="vz-ERydsKLU" length="22"><transcript><text start="0" dur="4">[Male narrator] --a favorite, a robotic car.</text><text start="4" dur="2">I wish to know whether it is partially observable, </text><text start="6" dur="5">stochastic, continuous, or adversarial. </text><text start="11" dur="5">That is, is the problem of driving robotically--</text><text start="16" dur="4">say, in a city--subject to any of those 4 categories?</text><text start="20" dur="2">Please check any or all that might apply.</text></transcript></video><video title="9 Robotic Car Answer" id="nOWCfVG0xNQ" length="33"><transcript><text start="0" dur="4">Well, the robotic car clearly deals with a partially observable environment</text><text start="4" dur="6">if you just look at momentary sensing input, you can&amp;#39;t even tell how fast other cars are going.</text><text start="10" dur="2">So, you need to memorize something.</text><text start="12" dur="3">It is stochastic because it&amp;#39;s inherently unpredictable </text><text start="15" dur="2">what&amp;#39;s going to happen next with other cars. </text><text start="17" dur="3">It is continuous. </text><text start="20" dur="3">There&amp;#39;s the infinitely many ways to set your steering</text><text start="23" dur="3">or push your gas pedal or your brake,</text><text start="26" dur="3">and, well, you can argue with adversarial or not.</text><text start="29" dur="2">Depending on where you live, it might be highly adversarial. </text><text start="31" dur="2">Where I live, it isn&amp;#39;t.</text></transcript></video><video title="10 AI and Uncertainty" id="ytw6_8a5Wls" length="88"><transcript><text start="0" dur="3">I&amp;#39;m going to briefly talk of AI as something else, </text><text start="3" dur="7">which is AI is the technique of uncertainty management in computer software.</text><text start="10" dur="7">Put differently, AI is the discipline that you apply when you want to know what to do</text><text start="17" dur="5">when you don&amp;#39;t know what to do.</text><text start="22" dur="5">Now, there&amp;#39;s many reasons why there might be uncertainty in a computer program.</text><text start="27" dur="2">There could be a sensor limit.</text><text start="29" dur="4">That is, your sensors are unable to tell me </text><text start="33" dur="4">what exactly is the case outside the AI system.</text><text start="37" dur="4">There could be adversaries who act in a way that makes it hard for  you</text><text start="41" dur="3">to understand what is the case.</text><text start="44" dur="4">There could be stochastic environments.</text><text start="48" dur="3">Every time you roll the dice in a dice game,</text><text start="51" dur="4">the stochasticity of the dice will make it impossible for you</text><text start="55" dur="2">to be absolutely certain of what&amp;#39;s the situation. </text><text start="57" dur="3">There could be laziness.</text><text start="60" dur="4">So perhaps you can actually compute what the situation is,</text><text start="64" dur="3">but your computer program is just too lazy to do it.</text><text start="67" dur="4">And here&amp;#39;s my favorite: ignorance, plain ignorance. </text><text start="71" dur="3">Many people are just ignorant of what&amp;#39;s going on.</text><text start="74" dur="3">They could know it, but they just don&amp;#39;t care.</text><text start="77" dur="4">All of these things are cause for uncertainty.</text><text start="81" dur="7">AI is the discipline that deals with uncertainty and manages it in decision making. </text></transcript></video><video title="11 Examples of AI in Practice" id="sPSN0aI0PgE" length="240"><transcript><text start="0" dur="3">Now we&amp;#39;ve had an introduction to AI.</text><text start="3" dur="3">We&amp;#39;ve heard about some of the properties of environments, </text><text start="6" dur="4">and we&amp;#39;ve seen some possible architecture for agents. </text><text start="10" dur="3">I&amp;#39;d like next to show you some examples of AI in practice.</text><text start="13" dur="5">And Sebastian and I have some experience personally in things we have done</text><text start="18" dur="3">at Google, at NASA, and at Stanford.</text><text start="21" dur="4">And I want to tell you a little bit about some of those.</text><text start="25" dur="3">One of the best successes of AI technology at Google </text><text start="28" dur="3">has been the machine translation system. </text><text start="31" dur="6">Here we see an example of an article in Italian automatically translated into English.</text><text start="37" dur="4">Now, these systems are built for 50 different languages,</text><text start="41" dur="5">and we can translate from any of the languages into any of the other languages.</text><text start="46" dur="5">So, that&amp;#39;s over 2,500 different systems, and we&amp;#39;ve done this all </text><text start="51" dur="4">using machine learning techniques, using AI techniques,</text><text start="55" dur="3">rather than trying to build them by hand.</text><text start="58" dur="5">And the way it works is that we go out and collect examples of text</text><text start="63" dur="3">that&amp;#39;s a line between the 2 languages.</text><text start="66" dur="5">So we find, say, a newspaper that publishes 2 editions, </text><text start="71" dur="5">an Italian edition and an English edition, and now we have examples of translations. </text><text start="76" dur="6">And if anybody ever asked us for exactly the translation of this one particular article,</text><text start="82" dur="3">then we could just look it up and say &amp;quot;We already know that.&amp;quot;</text><text start="85" dur="2">But of course, we aren&amp;#39;t often going to be asked that.</text><text start="87" dur="3">Rather, we&amp;#39;re going to be asked parts of this.</text><text start="90" dur="4">Here are some words that we&amp;#39;ve seen before, and we have to figure out </text><text start="94" dur="6">which words in this article correspond to which words in the translation article.</text><text start="100" dur="5">And when we do that by examining many, many millions of words of text</text><text start="105" dur="4"> in the 2 languages and making the correspondence, </text><text start="109" dur="2">and then we can put that all together.</text><text start="111" dur="3">And then when we see a new example of text that we haven&amp;#39;t seen before,</text><text start="114" dur="4">we can just look up what we&amp;#39;ve seen in the past for that correspondence.</text><text start="118" dur="3">So, the task is really two parts.</text><text start="121" dur="4">Off-line, before we see an example of text we want to translate,</text><text start="125" dur="2">we first build our translation model.</text><text start="127" dur="3">We do that by examining all of the different examples</text><text start="130" dur="4">and figuring out which part aligns to which.</text><text start="134" dur="4">Now, when we&amp;#39;re given a text to translate, we use that model, </text><text start="138" dur="4">and we go through and find the most probable translation.</text><text start="142" dur="2">So, what does it look like?</text><text start="144" dur="2">Well, let&amp;#39;s look at it in some example text.</text><text start="146" dur="3">And rather than look at news articles, I&amp;#39;m going to look at something simpler. </text><text start="149" dur="6">I&amp;#39;m going to switch from Italian to Chinese.</text><text start="155" dur="2">Here&amp;#39;s a bilingual text.</text><text start="157" dur="4">Now, for a large-scale machine translation, examples are found on the Web.</text><text start="161" dur="5">This example was found in a Chinese restaurant by Adam Lopez.</text><text start="166" dur="3">Now, it&amp;#39;s given, for a text of this form, </text><text start="169" dur="6">that a line in Chinese corresponds to a line in English,</text><text start="175" dur="4">and that&amp;#39;s true for each of the individual lines.</text><text start="179" dur="3">But to learn from this text, what we really want to discover</text><text start="182" dur="5">is what individual words in Chinese correspond to individual words</text><text start="187" dur="2">or small phrases in English.</text><text start="189" dur="7">I&amp;#39;ve started that process by highlighting  the word &amp;quot;wonton&amp;quot; in English.</text><text start="196" dur="2">It appears 3 times throughout the text.</text><text start="198" dur="5">Now, in each of those lines, there&amp;#39;s a character that appears,</text><text start="203" dur="4">and that&amp;#39;s the only place in the Chinese text where that character appears.</text><text start="207" dur="6">So, that seems like it&amp;#39;s a high probability that this character in Chinese</text><text start="213" dur="3">corresponds to the word &amp;quot;wonton&amp;quot; in English.</text><text start="216" dur="2">Let&amp;#39;s see if we can go farther.</text><text start="218" dur="6">My question for you is what word or what character or characters in Chinese</text><text start="224" dur="3">correspond to the word &amp;quot;chicken&amp;quot; in English?</text><text start="227" dur="7">And here we see &amp;quot;chicken&amp;quot; appears in these locations.</text><text start="234" dur="6">Click on the character or characters in Chinese that corresponds to &amp;quot;chicken.&amp;quot;</text></transcript></video><video title="12 Chinese Translation Answer" id="RWhwKudtixY" length="44"><transcript><text start="1" dur="3">The answer is that chicken appears here,</text><text start="4" dur="6">here, here, and here.</text><text start="10" dur="4">Now, I don&amp;#39;t know for sure, 100%, that that is the character for chicken in Chinese,</text><text start="14" dur="3">but I do know that there is a good correspondence.</text><text start="17" dur="3">Every place the word chicken appears in English,</text><text start="20" dur="4">this character appears in Chinese and no other place.</text><text start="24" dur="3">Let&amp;#39;s go 1 step farther.</text><text start="27" dur="3">Let&amp;#39;s see if we can work out a phrase in Chinese</text><text start="30" dur="3">and see if it corresponds to a phrase in English.</text><text start="33" dur="4">Here&amp;#39;s the phrase corn cream.</text><text start="38" dur="6">Click on the characters in Chinese that correspond to corn cream.</text></transcript></video><video title="13 Chinese Translation Answer 2" id="vvyaXxjsxBU" length="29"><transcript><text start="0" dur="4">The answer is: these 2 characters here</text><text start="4" dur="3">appear only in these 2 locations</text><text start="7" dur="3">corresponding to the words corn cream </text><text start="10" dur="3">which appear only in these locations in the English text.</text><text start="13" dur="4">Again, we&amp;#39;re not 100% sure that&amp;#39;s the right answer,</text><text start="17" dur="3">but it looks like a strong correlation.</text><text start="20" dur="2">Now, 1 more question.</text><text start="22" dur="4">Tell me what character or characters in Chinese</text><text start="26" dur="3">correspond to the English word soup.</text></transcript></video><video title="14 Chinese Translation Answer 3" id="lFJey0tOvBg" length="48"><transcript><text start="0" dur="5">The answer is that soup occurs in most of these phrases</text><text start="9" dur="2">but not 100% of them.</text><text start="11" dur="2">It&amp;#39;s missing in this phrase.</text><text start="14" dur="3">Equivalently, on the Chinese side </text><text start="17" dur="3">we see this character occurs</text><text start="20" dur="3">in most of the phrases,</text><text start="23" dur="4">but it&amp;#39;s missing here.</text><text start="27" dur="4">So we see that the correspondence doesn&amp;#39;t have to be 100%</text><text start="31" dur="3">to tell us that there is still a good chance of a correlation.</text><text start="34" dur="3">When we&amp;#39;re learning to do machine translation</text><text start="37" dur="4">we use these kinds of alignments to learn probability tables</text><text start="41" dur="4">of what is the probability of one phrase in one language </text><text start="45" dur="3">corresponding to the phrase in another language.</text></transcript></video><video title="15 Congratulations" id="mXM38kjzK-M" length="56"><transcript><text start="0" dur="3">So congratulations, you just finished unit 1.</text><text start="3" dur="4">You just finished unit 1 of this class, </text><text start="7" dur="3">where I told you about key applications</text><text start="10" dur="3">of artificial intelligence,</text><text start="13" dur="5">I told you about the definition of an intelligent agent,</text><text start="18" dur="5">I gave you 4 key attributes of intelligent agents</text><text start="24" dur="6">(partial observability, stochasticity, continuous spaces, and adversarial natures),</text><text start="31" dur="3">I discussed sources and management of uncertainty,</text><text start="34" dur="6">and I briefly mentioned the mathematical concept of rationality.</text><text start="40" dur="5">Obviously, I only touched any of these issues superficially,</text><text start="45" dur="4">but as this class goes on you&amp;#39;re going to dive into any of those</text><text start="49" dur="2">and learn much more about </text><text start="51" dur="4">what it takes to make a truly intelligent AI system.</text><text start="55" dur="1">Thank you.</text></transcript></video></group><group title="Unit 2" count="42"><video title="Topic 1, Introduction" id="ZQmJuHtpGfs" length="94"><transcript><text start="0" dur="1">[PROBLEM SOLVING]</text><text start="1" dur="3">In this unit we&amp;#39;re going to talk about problem solving.</text><text start="4" dur="2">The theory and technology of building agents</text><text start="6" dur="4">that can plan ahead to solve problems.</text><text start="10" dur="3">In particular, we&amp;#39;re talking about problem solving </text><text start="13" dur="4">where the complexity of the problem comes from the idea that there are many states.</text><text start="17" dur="2">As in this problem here.</text><text start="19" dur="5">A navigation problem where there are many choices to start with.</text><text start="24" dur="5">And the complexity comes from picking the right choice now and picking the right choice at the</text><text start="29" dur="3">next intersection and the intersection after that.</text><text start="32" dur="3">Streaming together a sequence of actions.</text><text start="35" dur="4">This is in contrast to the type of complexity shown in this picture,</text><text start="39" dur="4">where the complexity comes from the partial observability</text><text start="43" dur="3">that we can&amp;#39;t see through the fog where the possible paths are.</text><text start="46" dur="2">We can&amp;#39;t see the results of our actions</text><text start="48" dur="3">and even the actions themselves are not known.</text><text start="51" dur="5">This type of complexity will be covered in a later unit. </text><text start="56" dur="2">Here&amp;#39;s an example of a problem.</text><text start="58" dur="5">This is a route-finding problem where we&amp;#39;re given a start city,</text><text start="63" dur="6">in this case, Arad, and a destination, Bucharest, the capital of Romania,</text><text start="69" dur="2">from which this is a corner of the map.</text><text start="71" dur="5">And the problem then is to find a route from Arad to Bucharest.</text><text start="76" dur="4">The actions that the agent can execute when driving </text><text start="80" dur="3">from one city to the next along one of the roads shown on the map.</text><text start="83" dur="5">The question is, is there a solution that the agent can come up with </text><text start="88" dur="6">given the knowledge shown here to the problem of driving from Arad to Bucharest?</text></transcript></video><video title="Topic 2, Route Finding Question" id="SIHc9LgMeaU" length="269"><transcript><text start="0" dur="3">And the answer is no.</text><text start="3" dur="3">There is no solution that the agent can come up with </text><text start="6" dur="2">because Bucharest doesn&amp;#39;t appear on the map,</text><text start="8" dur="4">and so the agent doesn&amp;#39;t know any actions that can arrive there. </text><text start="12" dur="7">So let&amp;#39;s give the agent a better chance. </text><text start="19" dur="4">Now we&amp;#39;ve given the agent the full map of Romania.</text><text start="23" dur="7">To start, he&amp;#39;s in Arad, and the destination--or goal--is in Bucharest.</text><text start="30" dur="5">And the agent is given the problem of coming up with a sequence of actions</text><text start="35" dur="2">that will arrive at the destination.</text><text start="37" dur="6">Now, is it possible for the agent to solve this problem?</text><text start="43" dur="2">And the answer is yes. </text><text start="45" dur="5">There are many routes or steps or sequences of actions that will arrive at the destination. </text><text start="50" dur="3">Here is one of them:</text><text start="53" dur="7">Starting out in Arad, taking this step first, then this one, then this one, </text><text start="60" dur="5">then this one, and then this one to arrive at the destination. </text><text start="65" dur="3">So that would count as a solution to the problem.</text><text start="68" dur="4">So sequence of actions, chained together, that are guaranteed to get us to the goal. </text><text start="72" dur="2">[DEFINITION OF A PROBLEM]</text><text start="74" dur="3">Now let&amp;#39;s formally define what a problem looks like. </text><text start="77" dur="4">A problem can be broken down into a number of components. </text><text start="81" dur="4">First, the initial state that the agent starts out with.                         </text><text start="85" dur="7">In our route finding problem, the initial state was the agent being in the city of Arad. </text><text start="92" dur="9">Next, a function--Actions--that takes a state as input and returns    </text><text start="101" dur="6">a set of possible actions that the agent can execute when the agent is in this state. </text><text start="107" dur="3">[ACTIONS (s)     {a,a2,a3...}]</text><text start="110" dur="4">In some problems, the agent will have the same actions available in all states</text><text start="114" dur="4">and in other problems, he&amp;#39;ll have different actions dependent on the state. </text><text start="118" dur="4">In the route finding problem, the actions are dependent on the state.</text><text start="122" dur="4">When we&amp;#39;re in one city, we can take the routes to the neighboring cities--</text><text start="126" dur="3">but we can&amp;#39;t go to any other cities. </text><text start="129" dur="11">Next we have a function called Result, which takes, as input, a state and an action</text><text start="140" dur="4">and delivers, as its output, a new state. </text><text start="144" dur="9">So, for example, if the agent is in the city of Arad, and takes--that would be the state--</text><text start="153" dur="7">and takes the action of driving along Route E-671 towards Timisoara,</text><text start="160" dur="5">then the result of applying that action in that state would be the new state--</text><text start="165" dur="6">where the agent is in the city of Timisoara.</text><text start="171" dur="7">Next, we need a function called Goal Test,</text><text start="178" dur="6">which takes a state and returns a Boolean value--    </text><text start="184" dur="5">true or false--telling us if this state is a goal or not. </text><text start="189" dur="5">In a route-finding problem, the only goal would be being in the destination city--</text><text start="194" dur="5">the city of Bucharest--and all the other states would return false for the Goal Test. </text><text start="199" dur="9">And finally, we need one more thing which is a Path Cost function--</text><text start="208" dur="12">which takes a path, a sequence of state/action transitions,</text><text start="220" dur="4">and returns a number, which is the cost of that path.</text><text start="224" dur="6">Now, for most of the problems we&amp;#39;ll deal with, we&amp;#39;ll make the Path Cost function be additive</text><text start="230" dur="6">so that the cost of the path is just the sum of the costs of the individual steps. </text><text start="236" dur="8">And so we&amp;#39;ll implement this Path Cost function, in terms of a Step Cost function. </text><text start="244" dur="10">The Step Cost function takes a state, an action, and the resulting state from that action </text><text start="254" dur="4">and returns a number--n--which is the cost of that action. </text><text start="258" dur="6">In the route finding example, the cost might be the number of miles traveled </text><text start="264" dur="5">or maybe the number of minutes it takes to get to that destination. </text></transcript></video><video title="Topic 3, Route Finding" id="bEi73QXP7PA" length="175"><transcript><text start="0" dur="6">Now let&#x2019;s see how the definition of a problem</text><text start="6" dur="4">maps onto the route finding, the domain.</text><text start="10" dur="2">First, the initial state was given. </text><text start="12" dur="3">Let&#x2019;s say we start off in Arad,</text><text start="15" dur="2">and the goal test,</text><text start="17" dur="5">let&#x2019;s say that the state of being in Bucharest</text><text start="22" dur="2">is the only state that counts as a goal,</text><text start="24" dur="2">and all the other states are not goals.</text><text start="26" dur="3">Now the set of all of the states here</text><text start="29" dur="2">is known as the state space,</text><text start="31" dur="4">and we navigate the state space by applying actions.</text><text start="35" dur="4">The actions are specific to each city,</text><text start="39" dur="3">so when we are in Arad, there are three possible actions,</text><text start="42" dur="4">to follow this road, this one, or this one.</text><text start="46" dur="3">And as we follow them, we build paths</text><text start="49" dur="2">or sequences of actions.</text><text start="51" dur="4">So just being in Arad is the path of length zero,</text><text start="55" dur="3">and now we could start exploring the space</text><text start="58" dur="3">and add in this path of length one,</text><text start="61" dur="2">this path of length one,</text><text start="63" dur="3">and this path of length one.</text><text start="66" dur="5">We could add in another path here of length two</text><text start="71" dur="3"> and another path here of length two.</text><text start="74" dur="3">Here is another path of length two.</text><text start="77" dur="4">Here is a path of length three.</text><text start="81" dur="5">Another path of length two, and so on.</text><text start="86" dur="2">Now at ever point, </text><text start="88" dur="6">we want to separate the state out into three parts.</text><text start="94" dur="3">First, the ends of the paths&#x2014;</text><text start="97" dur="3">The farthest paths that have been explored,</text><text start="100" dur="2">we call the frontier.</text><text start="102" dur="4">And so the frontier in this case</text><text start="106" dur="5">consists of these states</text><text start="111" dur="4">that are the farthest out we have explored.</text><text start="115" dur="4">And then to the left of that in this diagram,</text><text start="119" dur="3">we have the explored part of the state.</text><text start="122" dur="2">And then off to the rigtht,</text><text start="124" dur="2">we have the unexplored.</text><text start="126" dur="3">So let&#x2019;s write down those three components.</text><text start="129" dur="6">We have the frontier.</text><text start="135" dur="5">We have the unexplored region,</text><text start="140" dur="5">and we have the explored region.</text><text start="145" dur="2">One more thing,</text><text start="147" dur="3">in this diagram we have labeled the step cost </text><text start="150" dur="3">of each action along the route.</text><text start="153" dur="4">So the step cost of going between Neamt to Iasi</text><text start="157" dur="5">would be 87 corresponding to a distance of 87 kilometers,</text><text start="162" dur="4">and the path cost is just the sum of the step costs.</text><text start="166" dur="2">So the cost of the path</text><text start="168" dur="2">of going from Arad to Oradea</text><text start="170" dur="5">would be 71 plus 75.</text></transcript></video><video title="Topic 4, Tree Search" id="c0PfWsqtfdo" length="199"><transcript><text start="0" dur="4">[Narrator] Now let&amp;#39;s define a function for solving problems.</text><text start="4" dur="3">It&amp;#39;s called Tree Search because it superimposes  </text><text start="7" dur="3">a search tree over the state space.</text><text start="10" dur="2">Here&amp;#39;s how it works: It starts off by</text><text start="12" dur="2">initializing the frontier to be the path</text><text start="14" dur="2">consisting of only the initial states,</text><text start="16" dur="2">and then it goes into a loop </text><text start="18" dur="3">in which it first checks to see </text><text start="21" dur="2">do we still have anything left in the frontier?</text><text start="23" dur="2">If not we fail, there can be no solution.</text><text start="25" dur="3">If we do have something, then we make a choice.</text><text start="28" dur="3">Tree Search is really a family of functions</text><text start="31" dur="2">not a single algorithm which  </text><text start="33" dur="2">depends on how we make that choice,</text><text start="35" dur="3">and we&amp;#39;ll see some of the options later.</text><text start="38" dur="3">If we go ahead and make a choice of one of </text><text start="41" dur="2">the paths on the frontier and remove that </text><text start="43" dur="2">path from the frontier, we find the state</text><text start="45" dur="2">which is at the end of the path, and if that</text><text start="47" dur="2">state&amp;#39;s a go then we&amp;#39;re done.</text><text start="49" dur="2">We found a path to the goal; otherwise,  </text><text start="51" dur="3">we do what&amp;#39;s called expanding that path.</text><text start="54" dur="3">We look at all the actions from that state,</text><text start="57" dur="3">and we add to the path the actions </text><text start="60" dur="3">and the result of that state; so we get </text><text start="63" dur="3">a new path that has the old path, the action </text><text start="66" dur="3">and the result of that action, and we</text><text start="69" dur="4">stick all of those paths back onto the frontier.</text><text start="77" dur="2">Now Tree Search represents a whole family </text><text start="79" dur="3">of algorithms, and where you get the family</text><text start="82" dur="2">resemblance is that they&amp;#39;re all looking </text><text start="84" dur="2">at the frontier, copying items off and  </text><text start="86" dur="3">and looking to see if their goal tests,</text><text start="89" dur="2">but where you get the difference is right here,</text><text start="91" dur="3">in the choice of how you&amp;#39;re going to expand </text><text start="94" dur="2">the next item on the frontier, which </text><text start="96" dur="3">path do we look at first, and we&amp;#39;ll go through</text><text start="99" dur="3">different sets of algorithms that make  </text><text start="102" dur="3">different choices for which path to look at first.</text><text start="107" dur="2">The first algorithm I want to consider </text><text start="109" dur="2">is called Breadth-First Search.</text><text start="111" dur="3">Now it could be called shortest-first search</text><text start="114" dur="2">because what it does is always choose</text><text start="116" dur="3">of the frontier one of the paths that hadn&amp;#39;t been</text><text start="119" dur="3">considered yet that&amp;#39;s the shortest possible. </text><text start="122" dur="2">So how does it work?</text><text start="124" dur="2">Well we start off with the path of </text><text start="126" dur="4">length 0, starting in the start state, and  </text><text start="130" dur="3">that&amp;#39;s the only path in the frontier so</text><text start="133" dur="2">it&amp;#39;s the shortest one so we pick it,</text><text start="135" dur="2">and then we expand it, and we add in</text><text start="137" dur="3">all the paths that result from </text><text start="140" dur="2">applying all the possible actions.</text><text start="142" dur="3">So now we&amp;#39;ve removed </text><text start="145" dur="3">this path from the frontier,</text><text start="148" dur="3">but we&amp;#39;ve added in 3 new paths. </text><text start="151" dur="2">This one, </text><text start="153" dur="4">this one, and this one.</text><text start="157" dur="2">Now we&amp;#39;re in a position where</text><text start="159" dur="3">we have 3 paths on the frontier, and</text><text start="162" dur="3">we have to pick the shortest one.</text><text start="165" dur="2">Now in this case all 3 paths </text><text start="167" dur="3">have the same length, length 1, so we </text><text start="170" dur="2">break the tie at random or using some</text><text start="172" dur="4">other technique, and let&amp;#39;s suppose that </text><text start="176" dur="2">in this case we choose this path</text><text start="178" dur="2">from Arad to Sibiu.</text><text start="180" dur="3">Now the question I want you to answer </text><text start="183" dur="6">is once we remove that from the frontier,</text><text start="189" dur="2">what paths are we going to add next?</text><text start="191" dur="3">So show me by checking off the cities</text><text start="194" dur="2">that ends the paths, which paths  </text><text start="196" dur="3">are going to be added to the frontier?</text></transcript></video><video title="Topic 5, Tree Search Answer" id="GKKQyJLee84" length="174"><transcript><text start="0" dur="6">[Male narrator] The answer is that in Sibiu, the action function gives us 4 actions</text><text start="6" dur="3">corresponding to traveling along these 4 roads, </text><text start="9" dur="6">so we have to add in paths for each of those actions. </text><text start="15" dur="2">One of those paths goes here, </text><text start="17" dur="4">the other path continues from Arad and goes out here. </text><text start="21" dur="4">The third path continues out here</text><text start="25" dur="6">and then the fourth path goes from here--from Arad to Sibiu</text><text start="31" dur="5">and then backtracks back to Arad.</text><text start="36" dur="5">Now, it may seem silly and redundant to have a path that starts in Arad,</text><text start="41" dur="3">goes to Sibiu and returns to Arad.</text><text start="44" dur="5">How can that help us get to our destination in Bucharest?</text><text start="49" dur="3">But we can see if we&amp;#39;re dealing with a tree search, </text><text start="52" dur="4">why it&amp;#39;s natural to have this type of formulation </text><text start="56" dur="4">and why the tree search doesn&amp;#39;t even notice that it&amp;#39;s backtracked. </text><text start="60" dur="5">What the tree search does is superimpose on top of the state space</text><text start="65" dur="4">a tree of searches, and the tree looks like this. </text><text start="69" dur="6">We start off in state A, and in state A, there were 3 actions,</text><text start="75" dur="6">so we gave those paths going to Z, S, and T.</text><text start="81" dur="13">And from S, there were 4 actions, so that gave us paths going from O, F, R, and A,</text><text start="94" dur="3">and then the tree would continue on from here. </text><text start="97" dur="3">We&amp;#39;d take one of the next items</text><text start="100" dur="8">and we&amp;#39;d move it and continue on, but notice that we returned to the A state</text><text start="108" dur="3">in the state space, but in the tree, </text><text start="111" dur="4">it&amp;#39;s just another item in the tree. </text><text start="115" dur="2">Now, here&amp;#39;s another representation of the search space</text><text start="117" dur="4">and what&amp;#39;s happening is as we start to explore the state, </text><text start="121" dur="8">we keep track of the frontier, which is the set of states that are at the end of the paths</text><text start="129" dur="4">that we haven&amp;#39;t explored yet, and behind that frontier</text><text start="133" dur="6">is the set of explored states, and ahead of the frontier is the unexplored states. </text><text start="139" dur="3">Now the reason we keep track of the explored states</text><text start="142" dur="5">is that when we want to expand and we find a duplicate--</text><text start="147" dur="6">so say when we expand from here, if we pointed back to state T,</text><text start="153" dur="9">if we hadn&amp;#39;t kept track of that, we would have to add in a new state for T down here.</text><text start="162" dur="5">But because we&amp;#39;ve already seen it and we know that this is actually a regressive step</text><text start="167" dur="4">into the already explored state, now, because we kept track of that,</text><text start="171" dur="3">we don&amp;#39;t need it anymore. </text></transcript></video><video title="Topic 6, Graph Search" id="mtbfvJuOV_U" length="65"><transcript><text start="0" dur="4">Now we see how to modify the Tree Search Function</text><text start="4" dur="2">to make it be a Graph Search Function</text><text start="6" dur="3">to avoid those repeated paths.</text><text start="9" dur="4">What we do, is we start off and initialize a set</text><text start="13" dur="3">called the explored set of states that we have already explored.</text><text start="16" dur="3">Then, when we consider a new path,</text><text start="19" dur="4">we add the new state to the set of already explored states,</text><text start="23" dur="3">and then when we are expanding the path</text><text start="26" dur="3"> and adding in new states to the end of it,</text><text start="29" dur="4">we don&#x2019;t  add that in if we have already seen that new state</text><text start="33" dur="4">in either the frontier or the explored.</text><text start="37" dur="2">Now back to Breadth First Search.</text><text start="39" dur="2">Let&#x2019;s assume we are using the Graph Search</text><text start="41" dur="3">so that we have eliminated the duplicate paths.</text><text start="44" dur="3">Arad is crossed off the list.</text><text start="47" dur="2">The path that goes from Arad to Sibiu</text><text start="49" dur="2">and back to Arad is removed,</text><text start="51" dur="2">and we are left with these one, two, three,</text><text start="53" dur="4">four, five possible paths.</text><text start="57" dur="2">Given these 5 paths,</text><text start="59" dur="3">show me which ones are candidates to be expanded next</text><text start="62" dur="3">by the Breadth First Search Algorithm.</text></transcript></video><video title="Topic 7, Graph Search Answer" id="HfYXaw56-0w" length="42"><transcript><text start="0" dur="3">[Male narrator] And the answer is that Breadth - First Search always considers </text><text start="3" dur="5">the shortest paths first, and in this case, there&amp;#39;s 2 paths of length 1,</text><text start="8" dur="4">and 1, the paths from Arad to Zerind and Arad to Timisoara,</text><text start="12" dur="3">so those would be the 2 paths that would be considered. </text><text start="15" dur="3">Now, let&amp;#39;s suppose that the tie is broken in some way </text><text start="18" dur="4">and we chose this path from Arad to Zerind.</text><text start="22" dur="3">Now, we want to expand that node. </text><text start="25" dur="6">We remove it from the frontier and put it in the explored list</text><text start="31" dur="4">and now we say, &amp;quot;What paths are we going to add?&amp;quot;</text><text start="35" dur="7">So check off the ends of the paths the cities that we&amp;#39;re going to add.</text></transcript></video><video title="Topic 8, Graph Search Answer" id="CUfmOLQi3RM" length="13"><transcript><text start="0" dur="3">[Male narrator] In this case, there&amp;#39;s nothing to add</text><text start="3" dur="6">because of the 2 neighbors, 1 is in the explored list and 1 is in the frontier,</text><text start="9" dur="4">and if we&amp;#39;re using graph search, then we won&amp;#39;t add either of those. </text></transcript></video><video title="Topic 9, More Graph Search" id="I3lrnzdgwmI" length="38"><transcript><text start="0" dur="4">[Male narrator] So we move on, we look for another shortest path.</text><text start="4" dur="7">There&amp;#39;s one path left of length 1, so we look at that path, we expand it,</text><text start="11" dur="5">add in this path, put that one on the explored list, </text><text start="16" dur="4">and now we&amp;#39;ve got 3 paths of length 2. </text><text start="20" dur="3">We choose 1 of them, and let&amp;#39;s say we choose this one. </text><text start="23" dur="7">Now, my question is show me which states we add to the path</text><text start="30" dur="5">and tell me whether we&amp;#39;re going to terminate the algorithm at this point </text><text start="35" dur="3">because we&amp;#39;ve reached the goal or whether we&amp;#39;re going to continue.</text></transcript></video><video title="Topic 10, Graph Search Answer" id="cr1Ck1Fr60M" length="29"><transcript><text start="0" dur="8">[Male narrator] The answer is that we add 1 more path, the path to Bucharest. </text><text start="8" dur="3">We don&amp;#39;t add the path going back because it&amp;#39;s in the explored list,</text><text start="11" dur="2">but we don&amp;#39;t terminate it yet.</text><text start="13" dur="3">True, we have added a path that ends in Bucharest,</text><text start="16" dur="6">but the goal test isn&amp;#39;t applied when we add a path to the frontier. </text><text start="22" dur="4">Rather, it&amp;#39;s applied when we remove that path from the frontier,</text><text start="26" dur="3">and we haven&amp;#39;t done that yet. </text></transcript></video><video title="Topic 11, Graph Search Termination" id="mueRduwpg-U" length="90"><transcript><text start="0" dur="6">[Male narrator] Now, why doesn&amp;#39;t the general tree search or graph search algorithm stop</text><text start="6" dur="3">when it adds a goal node to the frontier? </text><text start="9" dur="4">The reason is because it might not be the best path to the goal.</text><text start="13" dur="3">Now, here we found a path of length 2</text><text start="16" dur="5">and we added a path of length 3 that reached the goal.</text><text start="21" dur="3">The general graph search or tree search doesn&amp;#39;t know </text><text start="24" dur="3">that there might be some other path that we could expand</text><text start="27" dur="3">that would have a distance of say, 2-1/2,</text><text start="30" dur="3">but there&amp;#39;s an optimization that could be made.</text><text start="33" dur="2">If we know we&amp;#39;re doing Breadth - First Search</text><text start="35" dur="5">and we know there&amp;#39;s no possibility of a path of length 2-1/2.</text><text start="40" dur="4">Then we can change algorithm so that it checks states</text><text start="44" dur="2">as soon as they&amp;#39;re added to the frontier</text><text start="46" dur="3">rather than waiting until they&amp;#39;re expanded</text><text start="49" dur="4">and in that case, we can write a specific Breadth - First Search routine </text><text start="53" dur="8">that terminates early and gives us a result as soon as we add a goal state to the frontier.</text><text start="61" dur="3">Breadth - First Search will find this path </text><text start="64" dur="4">that ends up in Bucharest, and if we&amp;#39;re looking for the shortest path </text><text start="68" dur="2">in terms of number of steps, </text><text start="70" dur="2">Breadth - First Search is guaranteed to find it, </text><text start="72" dur="5">But if we&amp;#39;re looking for the shortest path in terms of total cost</text><text start="77" dur="4">by adding up the step costs, then it turns out </text><text start="81" dur="5">that this path is shorter than the path found by Breadth - First Search. </text><text start="86" dur="4">So let&amp;#39;s look at how we could find that path. </text></transcript></video><video title="Topic 12, Uniform Cost Search" id="Qrig0mznzG4" length="51"><transcript><text start="0" dur="5">An algorithm that has traditionally been called uniform-cost search</text><text start="5" dur="3">but could be called cheapest-first search,</text><text start="8" dur="3">is guaranteed to find the path with the cheapest total cost. </text><text start="11" dur="3">Let&amp;#39;s see how it works.</text><text start="14" dur="5">We start out as before in the start state.</text><text start="19" dur="5">And we pop that empty path off.</text><text start="24" dur="4">Move it from the frontier to explored,</text><text start="28" dur="5">and then add in the paths out of that state.</text><text start="33" dur="6">As before, there will be 3 of those paths.</text><text start="39" dur="4">And now, which path are we going to pick next</text><text start="43" dur="8">in order to expand according to the rules of cheapest first?</text></transcript></video><video title="Topic 13, Uniform Cost Search" id="7MbW6kZ_vb8" length="44"><transcript><text start="0" dur="3">Cheapest first says that we pick the path with</text><text start="4" dur="2">the lowest total cost.</text><text start="6" dur="1">And that would be this path.</text><text start="7" dur="6">It has a cost of 75 compared to the cost of 118 and 140</text><text start="13" dur="1">for the other paths.</text><text start="14" dur="5">So we get here. We take that path off the frontier,</text><text start="19" dur="4">put it on the explored list, add in its neighbors.</text><text start="23" dur="3">Not going back to Arad,</text><text start="26" dur="4">but adding in this new path.</text><text start="30" dur="3">Summing up the total cost of that path, </text><text start="33" dur="7">71 + 75 is 146 for this path.</text><text start="40" dur="1">And now the question is, </text><text start="41" dur="3">which path gets expanded next?</text></transcript></video><video title="Topic 14, Uniform Cost Search" id="9vNvrRP0ymw" length="56"><transcript><text start="0" dur="5">Of the 3 paths on the frontier, we have ones </text><text start="5" dur="5">with a cost of 146, 140, and 118.</text><text start="10" dur="3">And that&amp;#39;s the cheapest, so this one gets expanded.</text><text start="13" dur="3">We take it off the frontier, move it to explored,</text><text start="16" dur="5">add in its successors. In this case it&amp;#39;s only 1.</text><text start="21" dur="8">And that has a path total of 229.</text><text start="29" dur="1">Which path do we expand next?</text><text start="30" dur="3">Well, we&amp;#39;ve got 146, 140, and 229</text><text start="33" dur="5">So 140 is the lowest.</text><text start="38" dur="3">Take it off the frontier. Put it on explored.</text><text start="41" dur="3">Add in this path</text><text start="44" dur="4">for a total cost of 220.</text><text start="48" dur="5">And this path for a total cost of 239.</text><text start="53" dur="3">And now the question is, which path do we expand next?</text></transcript></video><video title="Topic 15, Uniform Cost Search" id="LVCMMPXaQlE" length="15"><transcript><text start="0" dur="4">The answer is this one, 146.</text><text start="4" dur="3">Put it on explored.</text><text start="7" dur="5">But there&amp;#39;s nothing to add because</text><text start="12" dur="1">both of its neighbors have already been explored.</text><text start="13" dur="2">Which path do we look at next?</text></transcript></video><video title="Topic 16, Uniform Cost Termination" id="G-H1AnA8uBI" length="73"><transcript><text start="0" dur="5">The answer is this one. Two-twenty is less than 229 or 239.</text><text start="5" dur="4">Take it off the frontier. Put it on explored.</text><text start="9" dur="6">Add in 2 more paths and sum them up.</text><text start="15" dur="6">So, 220 plus 146 is 366.</text><text start="21" dur="8">And 220 plus 97 is 317.</text><text start="29" dur="3">Okay, and now, notice that we&amp;#39;re closing in on Bucharest.</text><text start="32" dur="6">We&amp;#39;ve got 2 neighbors almost there, but neither of them is their turn yet.</text><text start="38" dur="5">Instead, the cheapest path is this one over here,</text><text start="43" dur="2">so move it to the explored list.</text><text start="45" dur="5">Add 70 to the path cost so far,</text><text start="50" dur="7">and we get 299.</text><text start="57" dur="4">Now the cheapest node is 239 here,</text><text start="61" dur="8">so we expand, finally, into Bucharest at a cost of 460.</text><text start="69" dur="4">And now the question is are we done? Can we terminate the algorithm?</text></transcript></video><video title="Topic 17, Uniform Cost Termination Answer" id="NxCUVltVoZ8" length="106"><transcript><text start="0" dur="3">[Male] And the answer is no, we&amp;#39;re not done yet.</text><text start="3" dur="4">We&amp;#39;ve put Bucharest, the gold state, onto the frontier,</text><text start="7" dur="2">but we haven&amp;#39;t popped it off the frontier yet.</text><text start="9" dur="4">And the reason is because we&amp;#39;ve got to look around and see if there&amp;#39;s a better path</text><text start="13" dur="2">that can reach it, Bucharest.</text><text start="15" dur="3">And so, let&amp;#39;s continue.</text><text start="18" dur="2">Look at everything on the frontier.</text><text start="20" dur="3">Here&amp;#39;s the cheapest one over here.</text><text start="23" dur="3">Expand that.</text><text start="26" dur="4">Now, what&amp;#39;s the cheapest next one?</text><text start="30" dur="3">Well, over here.</text><text start="33" dur="3">Oops, forgot to take this one off the list.</text><text start="36" dur="8">So now, we&amp;#39;re at 317 plus 101 gives us another path into Bucharest,</text><text start="44" dur="2">and this is a better path. </text><text start="46" dur="8">This is 418, gives us another route in.</text><text start="54" dur="5">But we have to keep going.</text><text start="59" dur="7">The best path on the frontier is 366,</text><text start="66" dur="8">so pop that off, and that would give us 2 more routes into here,</text><text start="74" dur="4">and eventually we pop off all of these.</text><text start="78" dur="6">And then we get to the point where 418 was the best path on the frontier.</text><text start="84" dur="5">We pop that off, and then we recognize that we&amp;#39;d reach the goal,</text><text start="89" dur="6">and the reason that uniform cost finds the optimal path, the cheapest cost, </text><text start="95" dur="5">is because it&amp;#39;s guaranteed that it will first pop off this cheapest path, </text><text start="100" dur="6">the 418, before it gets to the more expensive path, like the 460.</text></transcript></video><video title="Topic 18, Depth First Search" id="Ve-mmCM8TI0" length="110"><transcript><text start="0" dur="3">So, we&amp;#39;ve looked at 2 search algorithms.</text><text start="3" dur="5">One, breadth-first search, in which we always expand first </text><text start="8" dur="4">the shallowest paths, the shortest paths.</text><text start="12" dur="5">Second, cheapest-first search, in which we always expand first the path</text><text start="17" dur="3">with the lowest total cost.</text><text start="20" dur="5">And I&amp;#39;m going to take this opportunity to introduce a third algorithm, depth-first search,</text><text start="25" dur="3">which is in a way the opposite of breadth-first search.</text><text start="28" dur="5">In depth-first search, we always expand first the longest path, </text><text start="33" dur="3">the path with the most lengths in it.</text><text start="36" dur="6">Now, what I want to ask you to do is for each of these nodes in each of the trees,</text><text start="42" dur="2">tell us in what order they&amp;#39;re expanded,</text><text start="44" dur="5">first, second, third, fourth, fifth and so on by putting a number into the box.</text><text start="49" dur="9">And if there are ties, put that number in and resolve the ties in left to right order.</text><text start="58" dur="5">Then I want you to ask one more question or answer one more question</text><text start="63" dur="3">which is are these searches optimal?</text><text start="66" dur="5">That is, are they guaranteed to find the best solution?</text><text start="71" dur="5">And for breadth-first search, optimal would mean finding the shortest path.</text><text start="76" dur="5">If you think it&amp;#39;s guaranteed to find the shortest path, check here.</text><text start="81" dur="5">For cheapest first, it would mean finding the path with the lowest total path cost.</text><text start="86" dur="4">Check here if you think it&amp;#39;s guaranteed to do that.</text><text start="90" dur="4">And we&amp;#39;ll allow the assumption that all costs have to be positive.</text><text start="94" dur="7">And in depth first, cheapest or optimal would mean, again,</text><text start="101" dur="5">as in breadth first, finding the shortest possible path in terms of number of lengths.</text><text start="106" dur="4">Check here if  you think depth first will always find that.</text></transcript></video><video title="Topic 19, Search Optimality Answer" id="slLRsFFiiRc" length="109"><transcript><text start="0" dur="4">Here are the answers.</text><text start="4" dur="6">Breadth-first search, as the name implies, expands nodes in this order.</text><text start="10" dur="7">One, 2, 3, 4, 5, 6, 7. </text><text start="17" dur="6">So, it&amp;#39;s going across a stripe at a time, breadth first.</text><text start="23" dur="2">Is it optimal?</text><text start="25" dur="3">Well, it&amp;#39;s always expanding in the shortest paths first,</text><text start="28" dur="6">and so wherever the goal is hiding, it&amp;#39;s going to find it by examining</text><text start="34" dur="4">no longer paths, so in fact, it is optimal.</text><text start="38" dur="7">Cheapest first, first we expand the path of length zero, </text><text start="45" dur="2">then the path of length 2.</text><text start="47" dur="6">Now there&amp;#39;s a path of length 4, path of length 5, </text><text start="53" dur="9">path of length 6, a path of length 7, and finally, a path of length 8.</text><text start="62" dur="6">And as we&amp;#39;ve seen, it&amp;#39;s guaranteed to find the cheapest path of all,</text><text start="68" dur="6">assuming that all the individual step costs are not negative.</text><text start="74" dur="3">Depth-first search tries to go as deep as it can first,</text><text start="77" dur="7">so it goes 1, 2, 3, then backs up, 4, </text><text start="84" dur="5">then backs up, 5, 6, 7.</text><text start="89" dur="5">And you can see that it doesn&amp;#39;t necessarily find the shortest path of all.</text><text start="94" dur="5">Let&amp;#39;s say that there were goals in position 5 and in position 3.</text><text start="99" dur="4">It would find the longer path to position 3 and find the goal there</text><text start="103" dur="3">and would not find the goal in position 5.</text><text start="106" dur="3">So, it is not optimal.</text></transcript></video><video title="Topic 20, Storage Requirements, Completeness" id="RntnUP9QRiU" length="122"><transcript><text start="0" dur="4">Given the non-optimality of depth-first search,</text><text start="4" dur="3">why would anybody choose to use it?</text><text start="7" dur="3">Well, the answer has to do with the storage requirements.</text><text start="10" dur="3">Here I&amp;#39;ve illustrated a state space </text><text start="13" dur="5">consisting of a very large or even infinite binary tree.</text><text start="18" dur="4">As we go to levels 1, 2, 3, down to level n, </text><text start="22" dur="2">the tree gets larger and larger.</text><text start="24" dur="5">Now, let&amp;#39;s consider the frontier for each of these search algorithms. </text><text start="29" dur="6">For breadth-first search, we know a frontier looks like that, </text><text start="35" dur="5">and so when we get down to level n, we&amp;#39;ll require a storage space of</text><text start="40" dur="5"> 2 to the n of pass in a breadth-first search.</text><text start="45" dur="4">For cheapest first, the frontier is going to be more complicated.</text><text start="49" dur="4">It&amp;#39;s going to sort of work out this contour of cost,</text><text start="53" dur="4">but it&amp;#39;s going to have a similar total number of nodes.</text><text start="57" dur="6">But for depth-first search, as we go down the tree, we start going down this branch,</text><text start="63" dur="5">and then we back up, but at any point, our frontier is only going to have n nodes</text><text start="68" dur="6">rather than 2 to the n nodes, so that&amp;#39;s a substantial savings for depth-first search.</text><text start="74" dur="5">Now, of course, if we&amp;#39;re also keeping track of the explored set, </text><text start="79" dur="2">then we don&amp;#39;t get that much savings.</text><text start="81" dur="4">But without the explored set, depth-first search has a huge advantage </text><text start="85" dur="2">in terms of space saved.</text><text start="87" dur="3">One more property of the algorithms to consider</text><text start="90" dur="5">is the property of completeness, meaning if there is a goal somewhere,</text><text start="95" dur="2">will the algorithm find it?</text><text start="97" dur="4">So, let&amp;#39;s move from very large trees to infinite trees, </text><text start="101" dur="6">and let&amp;#39;s say that there&amp;#39;s some goal hidden somewhere deep down in that tree.</text><text start="107" dur="4">And the question is, are each of these algorithms complete?</text><text start="111" dur="4">That is, are they guaranteed to find a path to the goal?</text><text start="115" dur="7">Mark off the check boxes for the algorithms that you believe are complete in this sense.</text></transcript></video><video title="Topic 21, Completeness Answer" id="aEZOJ-KazvU" length="49"><transcript><text start="0" dur="4">The answer is that breadth-first search is complete,</text><text start="4" dur="6">so even if the tree is infinite, if the goal is placed at any finite level,</text><text start="10" dur="6">eventually, we&amp;#39;re going to march down and find that goal.</text><text start="16" dur="2">Same with cheapest first. </text><text start="18" dur="3">No matter where the goal is, if it has a finite cost,</text><text start="21" dur="4">eventually, we&amp;#39;re going to go down and find it.</text><text start="25" dur="3">But not so for depth-first search.</text><text start="28" dur="5">If there&amp;#39;s an infinite path, depth-first search will keep following that,</text><text start="33" dur="4">so it will keep going down and down and down along this path</text><text start="37" dur="5">and never get to the path that the goal consists of</text><text start="42" dur="4">and never get to the path on which the goal sits. </text><text start="46" dur="3">So, depth-first search is not complete.</text></transcript></video><video title="Topic 22, More on Uniform Cost Search" id="IBAuWgq0ews" length="268"><transcript><text start="0" dur="5">Let&amp;#39;s try to understand a little better how uniform cost search works.</text><text start="5" dur="3">We start at a start state, </text><text start="8" dur="5">and then we start expanding out from there looking at different paths, </text><text start="13" dur="8">and what we end of doing is expanding in terms of contours like on a topological map,</text><text start="21" dur="7">where first we span out to a certain distance, then to a farther distance,</text><text start="28" dur="3">and then to a farther distance.</text><text start="31" dur="4">Now at some point we meet up with a goal.  Let&amp;#39;s say the goal is here.</text><text start="35" dur="7">Now we found a path from the start to the goal.</text><text start="42" dur="4">But notice that the search really wasn&amp;#39;t directed at any way towards the goal.</text><text start="46" dur="6">It was expanding out everywhere in the space and depending on where the goal is, </text><text start="52" dur="5">we should expect to have to explore half the space, on average, before we find the goal.</text><text start="57" dur="3">If the space is small, that can be fine,</text><text start="60" dur="5">but when spaces are large, that won&amp;#39;t get us to the goal fast enough.</text><text start="65" dur="5">Unfortunately, there is really nothing we can do, with what we know, to do better than that,</text><text start="70" dur="5">and so if we want to improve, if we want to be able to find the goal faster, </text><text start="75" dur="6">we&amp;#39;re going to have to add more knowledge.</text><text start="81" dur="6">The type of knowledge that is proven most useful in search is an estimate of the distance </text><text start="87" dur="5">from the start state to the goal.</text><text start="92" dur="4">So let&amp;#39;s say we&amp;#39;re dealing with a route-finding problem, </text><text start="96" dur="7">and we can move in any direction--up or down, right or left--</text><text start="103" dur="7">and we&amp;#39;ll take as our estimate, the straight line distance between a state and a goal,</text><text start="110" dur="5">and we&amp;#39;ll try to use that estimate to find our way to the goal fastest.</text><text start="115" dur="9">Now an algorithm called greedy best-first search does exactly that.</text><text start="124" dur="5">It expands first the path that&amp;#39;s closest to the goal according to the estimate.</text><text start="129" dur="4">So what do the contours look like in this approach?</text><text start="133" dur="4">Well, we start here, and then we look at all the neighboring states,</text><text start="137" dur="4">and the ones that appear to be closest to the goal we would expand first.</text><text start="141" dur="9">So we&amp;#39;d start expanding like this and like this and like this and like this</text><text start="150" dur="3">and that would lead us directly to the goal.</text><text start="153" dur="5">So now instead of exploring whole circles that go out everywhere with a certain space,</text><text start="158" dur="3">our search is directed towards the goal.</text><text start="161" dur="5">In this case it gets us immediately towards the goal, but that won&amp;#39;t always be the case </text><text start="166" dur="4">if there are obstacles along the way.</text><text start="170" dur="4">Consider this search space.  We have a start state and a goal,</text><text start="174" dur="3">and there&amp;#39;s an impassable barrier.</text><text start="177" dur="5">Now greedy best-first search will start expanding out as before,</text><text start="182" dur="6">trying to get towards the goal,</text><text start="188" dur="3">and when it reaches the barrier, what will it do next?</text><text start="191" dur="4">Well, it will try to increase along a path that&amp;#39;s getting closer and closer to the goal.</text><text start="195" dur="5">So it won&amp;#39;t consider going back this way which is farther from the goal.</text><text start="200" dur="4">Rather it will continue expanding out along these lines</text><text start="204" dur="4">which always get closer and closer to the goal,</text><text start="208" dur="3">and eventually it will find its way towards the goal.</text><text start="211" dur="5">So it does find a path, and it does it by expanding a small number of nodes,</text><text start="216" dur="6">but it&amp;#39;s willing to accept a path which is longer than other paths.</text><text start="222" dur="5">Now if we explored in the other direction, we could have found a much simpler path,</text><text start="227" dur="7">a much shorter path, by just popping over the barrier, and then going directly to the goal.</text><text start="234" dur="2">but greedy best-first search wouldn&amp;#39;t have done that because </text><text start="236" dur="5">that would have involved getting to this point, which is this distance to the goal, </text><text start="241" dur="7">and then considering states which were farther from the goal.</text><text start="248" dur="3">What we would really like is an algorithm that combines the best parts </text><text start="251" dur="6">of greedy search which explores a small number of nodes in many cases </text><text start="257" dur="5">and uniform cost search which is guaranteed to find a shortest path.</text><text start="262" dur="6">We&amp;#39;ll show how to do that next using an algorithm called the A-star algorithm.</text></transcript></video><video title="Topic 23, A-Star Search" id="_CBhTubi-CU" length="194"><transcript><text start="0" dur="3">[Male narrator] A* Search works by always expanding the path </text><text start="3" dur="4">that has a minimum value of the function f</text><text start="7" dur="5">which is defined as a sum of the g + h components.</text><text start="12" dur="4">Now, the function g of a path</text><text start="16" dur="3">is just the path cost,</text><text start="19" dur="4">and the function h of a path </text><text start="23" dur="4">is equal to the h value of the state, </text><text start="27" dur="3">which is the final state of the path,</text><text start="30" dur="6">which is equal to the estimated distance to the goal.</text><text start="36" dur="3">Here&amp;#39;s an example of how A* works.</text><text start="39" dur="5">Suppose we found this path through the state&amp;#39;s base to a state x</text><text start="44" dur="4">and we&amp;#39;re trying to give a measure to the value of this path.</text><text start="48" dur="7">The measure f is a sum of g, the path cost so far, </text><text start="55" dur="7">and h, which is the estimated distance that the path will take </text><text start="62" dur="2">to complete its path to the goal. </text><text start="64" dur="4">Now, minimizing g helps us keep the path short</text><text start="68" dur="5">and minimizing h helps us keep focused on finding the goal</text><text start="73" dur="4">and the result is a search strategy that is the best possible</text><text start="77" dur="3">in the sense that it finds the shortest length path</text><text start="80" dur="4">while expanding the minimum number of paths possible.</text><text start="84" dur="4">It could be called &amp;quot;best estimated total path cost first,&amp;quot;</text><text start="88" dur="4">but the name A* is traditional. </text><text start="92" dur="4">Now let&amp;#39;s go back to Romania and apply the A* algorithm</text><text start="96" dur="4">and we&amp;#39;re going to use a heuristic, which is a straight line distance</text><text start="100" dur="2">between a state and the goal. </text><text start="102" dur="2">The goal, again, is Bucharest,</text><text start="104" dur="3">and so the distance from Bucharest to Bucharest is, of course, 0.</text><text start="107" dur="4">And for all the other states, I&amp;#39;ve written in red</text><text start="111" dur="2">the straight line distance.</text><text start="113" dur="2">For example, straight across like that. </text><text start="115" dur="4">Now, I should say that all the roads here I&amp;#39;ve drawn as straight lines,</text><text start="119" dur="4">but actually, roads are going to be curved to some degree, </text><text start="123" dur="3">so the actual distance along the roads is going to be longer</text><text start="126" dur="3">than the straight line distance. </text><text start="129" dur="4">Now, we start out as usual--we&amp;#39;ll start in Arad as a start state--</text><text start="133" dur="8">and we&amp;#39;ll expand out Arad and so we&amp;#39;ll add 3 paths</text><text start="141" dur="5">and the evaluation function, f, will be the sum of the path length,</text><text start="146" dur="3">which is given in black, and the estimated distance, </text><text start="149" dur="3">which is given in red. </text><text start="152" dur="5">And so the path length from this path </text><text start="157" dur="8">will be 140+253 or 393;</text><text start="165" dur="10">for this path, 75+374, or 449;</text><text start="175" dur="10">and for this path, 118+329, or 447.</text><text start="185" dur="4">And now, the question is out of all the paths that are on the frontier,</text><text start="189" dur="5">which path would we expand next under the A* algorithm? </text></transcript></video><video title="Topic 23, A-Star Search ANSWER" id="yO5Cx5zw8h4" length="14"><transcript><text start="0" dur="5">The answer is that we select this path first--the one from Arad to Sibiu--</text><text start="5" dur="9">because it has the smallest value--393--of the sum f=g+h.</text></transcript></video><video title="Topic 24, A-Star Second Question" id="KP8JiOrl5As" length="39"><transcript><text start="0" dur="3">Let&amp;#39;s go ahead and expand this node now. </text><text start="3" dur="3">So we&amp;#39;re going to add 3 paths.</text><text start="6" dur="4">This one has a path cost of 291</text><text start="10" dur="4">and an estimated distance to the goal of 380, </text><text start="14" dur="4">for a total of 671.</text><text start="18" dur="3">This one has a path cost of 239</text><text start="21" dur="6">and an estimated distance of 176, for a total of 415. </text><text start="27" dur="6">And the final one is 220+193=413.</text><text start="33" dur="6">And now the question is which state to we expand next? </text></transcript></video><video title="Topic 24, A-Star Second Question ANSWER" id="YOjVW4NKgDQ" length="12"><transcript><text start="0" dur="3">The answer is we expand this path next</text><text start="3" dur="3">because its total, 413, </text><text start="6" dur="3">is less than all the other ones on the front tier--</text><text start="9" dur="3">although only slightly less than the 415 for this path. </text></transcript></video><video title="Topic 25, A-Star Third Question" id="u6_Xjgz7MCg" length="20"><transcript><text start="0" dur="3">So we expand this node,</text><text start="3" dur="3">giving us 2 more paths--</text><text start="6" dur="4">this one with an f-value of 417,</text><text start="10" dur="6">and this one with an f-value of 526.</text><text start="16" dur="4">The question again--which path are we going to expand next?</text></transcript></video><video title="Topic 25, A-Star Third Question ANSWER" id="BG5V3_MQP54" length="11"><transcript><text start="0" dur="5">And the answer is that we expand this path, Fagaras, next,</text><text start="5" dur="3">because its f-total, 415, </text><text start="8" dur="3">is less than all the other paths in the front tier. </text></transcript></video><video title="Topic 26, A-Star Fourth Question" id="i0ExF1xivqc" length="26"><transcript><text start="1" dur="3">Now we expand Fagaras </text><text start="4" dur="3">and we get a path that reaches the goal</text><text start="7" dur="4">and it has a path length of 450 and an estimated distance of 0</text><text start="11" dur="3">for a total f value of 450,</text><text start="14" dur="3">and now the question is: What do we do next?</text><text start="17" dur="5">Click here if you think we&amp;#39;re at the end of the algorithm </text><text start="22" dur="2">and we don&amp;#39;t need to expand next</text><text start="24" dur="2">or click on the node that you think we will expand next.</text></transcript></video><video title="Topic 26, A-Star Fourth Question ANSWER" id="qLfsDlLP2SY" length="23"><transcript><text start="0" dur="3">The answer is that we&amp;#39;re not done yet,</text><text start="3" dur="3">because the algorithm works by doing the goal test, </text><text start="6" dur="2">when we take a path off the front tier, </text><text start="8" dur="3">not when we put a path on the front tier. </text><text start="11" dur="4">Instead, we just continue in the normal way and choose the node</text><text start="15" dur="3">on the front tier which has the lowest value.</text><text start="18" dur="5">That would be this one--the path through Pitesti, with a total of 417.</text></transcript></video><video title="Topic 27, A-Star Fifth Question" id="pFPqrufkL48" length="84"><transcript><text start="1" dur="3">So let&amp;#39;s expand the node at Pitesti. </text><text start="4" dur="4">We have to go down this direction, up,</text><text start="8" dur="3">then we reach a path we&amp;#39;ve seen before,</text><text start="11" dur="2">and we go in this direction.</text><text start="13" dur="3">Now we reach Bucharest, which is the goal,</text><text start="16" dur="3">and the h value is going to be 0 </text><text start="19" dur="5">because we&amp;#39;re at the goal, and the g value works out to 418.</text><text start="24" dur="7">Again, we don&amp;#39;t stop here just because we put a path onto the front tier,</text><text start="31" dur="4">we put it there, we don&amp;#39;t apply the goal test next,</text><text start="35" dur="3">but, now we go back to the front tier,</text><text start="38" dur="5">and it turns out that this 418 is the lowest-cost path on the front tier.</text><text start="43" dur="2">So now we pull it off, do the goal test,</text><text start="45" dur="4">and now we found our path to the goal,</text><text start="49" dur="3">and it is, in fact, the shortest possible path.</text><text start="55" dur="4">In this case, A-star was able to find the lowest-cost path.</text><text start="59" dur="3">Now the question that you&amp;#39;ll have to think about,</text><text start="62" dur="2">because we haven&amp;#39;t explained it yet,</text><text start="64" dur="2">is whether A-star will always do this.</text><text start="66" dur="6">Answer yes if you think A-star will always find the shortest cost path,</text><text start="72" dur="5">or answer no if you think it depends on the particular problem given,</text><text start="77" dur="7">or answer no if you think it depends on the particular heuristic estimate function, h.</text></transcript></video><video title="Topic 27, A-Star Fifth Question ANSWER Mandatory" id="z86_jYE6CDA" length="49"><transcript><text start="2" dur="4">The answer is that it depends on the h function.</text><text start="6" dur="3">A-star will find the lowest-cost path </text><text start="9" dur="7">if the h function for a state is less than the true cost</text><text start="16" dur="4">of the path to the goal through that state.</text><text start="20" dur="6">In other words, we want the h to never overestimate the distance to the goal.</text><text start="26" dur="5">We also say that h is optimistic.</text><text start="31" dur="3">Another way of stating that</text><text start="34" dur="3">is that h is admissible, </text><text start="37" dur="4">meaning is it admissible to use it to find the lowest-cost path.</text><text start="41" dur="4">Think of all of these of being the same way</text><text start="45" dur="4">of stating the conditions under which A-star finds the lowest-cost path.</text></transcript></video><video title="Topic 28, Optimistic Heuristics" id="3Vmn9Rn-lDM" length="82"><transcript><text start="1" dur="2">Here we give you an intuition as to why</text><text start="3" dur="4">an optimistic heuristic function, h, finds the lowest-cost path.</text><text start="8" dur="7">When A-star ends, it returns a path, p, with estimated cost, c.</text><text start="15" dur="5">It turns out that c is also the actual cost,</text><text start="20" dur="3">because at the goal the h component is 0,</text><text start="23" dur="4">and so the path cost is the total cost as estimated by the function.</text><text start="28" dur="3">Now, all the paths on the front tier</text><text start="31" dur="4">have an estimated cost that&amp;#39;s greater than c,</text><text start="35" dur="5">and we know that because the front tier is explored in cheapest-first order.</text><text start="40" dur="4">If h is optimistic, then the estimated cost </text><text start="44" dur="3">is less than the true cost,</text><text start="47" dur="4">so the path p must have a cost that&amp;#39;s less than the true cost</text><text start="51" dur="3">of any of the paths on the front tier.</text><text start="54" dur="3">Any paths that go beyond the front tier</text><text start="57" dur="2">must have a cost that&amp;#39;s greater than that</text><text start="59" dur="5">because we agree that the step cost is always 0 or more.</text><text start="64" dur="5">So that means that this path, p, must be the minimal cost path.</text><text start="69" dur="4">Now, this argument, I should say, only goes through</text><text start="73" dur="3">as is for tree search.</text><text start="76" dur="3">For graph search the argument is slightly more complicated, </text><text start="79" dur="3">but the general intuitions hold the same.</text></transcript></video><video title="Topic 29, State Spaces" id="8dXgwOvQYVE" length="59"><transcript><text start="1" dur="4">So far we&amp;#39;ve looked at the state space of cities in Romania--</text><text start="5" dur="2">a 2-dimensional, physical space.</text><text start="7" dur="3">But the technology for problem solving through search</text><text start="10" dur="2">can deal with many types of state spaces,</text><text start="12" dur="5">dealing with abstract properties, not just x-y position in a plane.</text><text start="17" dur="4">Here I introduce another state space--the vacuum world.</text><text start="21" dur="4">It&amp;#39;s a very simple world in which there are only 2 positions</text><text start="25" dur="5">as opposed to the many positions in the Romania state space.</text><text start="30" dur="3">But there are additional properties to deal with as well.</text><text start="33" dur="3">The robot vacuum cleaner can be in either of the 2 conditions,</text><text start="36" dur="4">but as well as that each of the positions </text><text start="40" dur="3">can either have dirt in it or not have dirt in it.</text><text start="43" dur="4">Now the question is to represent this as a state space</text><text start="47" dur="4">how many states do we need?</text><text start="51" dur="8">The number of states can fill in this box here.</text></transcript></video><video title="Topic 29, State Spaces ANSWER" id="6KTjn8LpbZM" length="35"><transcript><text start="1" dur="3">And the answer is there are 8 states.</text><text start="4" dur="6">There are 2 physical states that the robot vacuum cleaner can be in--</text><text start="10" dur="2">either in state A or in state B.</text><text start="12" dur="5">But in addition to that, there are states about how the world is</text><text start="17" dur="2">as well as where the robot is in the world.</text><text start="19" dur="5">So state A can be dirty or not.</text><text start="24" dur="2">That&amp;#39;s 2 possibilities.</text><text start="26" dur="2">And B can be dirty or not.</text><text start="28" dur="3">That&amp;#39;s 2 more possibilities.</text><text start="31" dur="4">We multiply those together.                                 We get 8 possible states.</text></transcript></video><video title="Topic 30, State Space Diagram and More Complexity" id="NCfWMf9lL5I" length="104"><transcript><text start="1" dur="4">Here is a diagram of the state space for the vacuum world. </text><text start="5" dur="4">Note that there are 8 states, and we have the actions connecting the states</text><text start="9" dur="3">just as we did in the Romania problem.</text><text start="12" dur="3">Now let&amp;#39;s look at a path through this state.</text><text start="15" dur="4">Let&amp;#39;s say we start out in this position,</text><text start="19" dur="4">and then we apply the action of moving right.</text><text start="23" dur="4">Then we end up in a position where the state of the world looks the same,</text><text start="27" dur="5">except the robot has moved from position &amp;#39;A&amp;#39; to position &amp;#39;B&amp;#39;.</text><text start="32" dur="5">Now if we turn on the sucking action,</text><text start="37" dur="5">then we end up in a state where the robot is in the same position</text><text start="42" dur="4">but that position is no longer dirty.</text><text start="47" dur="3">Let&amp;#39;s take this very simple vacuum world</text><text start="50" dur="3">and make a slightly more complicated one.</text><text start="53" dur="3">First, we&amp;#39;ll say that the robot has a power switch,</text><text start="56" dur="8">which can be in one of three conditions:                     on, off, or sleep.</text><text start="64" dur="5">Next, we&amp;#39;ll say that the robot has a dirt-sensing camera,</text><text start="69" dur="4">and that camera can either be on or off.</text><text start="73" dur="3">Third, this is the deluxe model of robot</text><text start="76" dur="3">in which the brushes that clean up the dust</text><text start="79" dur="3">can be set at 1 of 5 different heights </text><text start="82" dur="5">to be appropriate for whatever level of carpeting you have.</text><text start="87" dur="3">Finally, rather that just having the 2 positions,</text><text start="90" dur="7">we&amp;#39;ll extend that out and have 10 positions.</text><text start="97" dur="7">Now the question is how many states are in this state space?</text></transcript></video><video title="Topic 30, State Space Diagram and More Complexity ANSWER" id="ATEXTIBgH4o" length="57"><transcript><text start="1" dur="4">The answer is that the number of states is the cross product </text><text start="5" dur="3">of the numbers of all the variables, since they&amp;#39;re each independent,</text><text start="8" dur="2">and any combination can occur.</text><text start="10" dur="4">For the power we have 3 possible positions.</text><text start="14" dur="4">The camera has 2.</text><text start="18" dur="5">The brush height has 5.</text><text start="23" dur="5">The dirt has 2 for each of the 10 positions.</text><text start="28" dur="5">That&amp;#39;s 2^10 or 1024.</text><text start="33" dur="6">Then the robot&amp;#39;s position can be any of those 10 positions as well. </text><text start="39" dur="5">That works out to 307,200 states in the state space.</text><text start="44" dur="2">Notice how a fairly trivial problem--</text><text start="46" dur="4">we&amp;#39;re only modeling a few variables and only 10 positions--</text><text start="50" dur="2">works out to a large number of state spaces.</text><text start="52" dur="5">That&amp;#39;s why we need efficient algorithms for searching through states spaces.</text></transcript></video><video title="Topic 31, Sliding Blocks Puzzle" id="-HvDwJAM2y4" length="109"><transcript><text start="1" dur="4">I want to introduce one more problem that can be solved with search techniques.</text><text start="5" dur="3">This is a sliding blocks puzzle, called a 15 puzzle.</text><text start="8" dur="2">You may have seen something like this.</text><text start="10" dur="4">So there are a bunch of little squares or blocks or tiles</text><text start="14" dur="2">and you can slide them around.</text><text start="19" dur="2">and the goal is to get into a certain configuration. </text><text start="21" dur="6">So we&amp;#39;ll say that this is the goal state, where the numbers 1-15 are in order</text><text start="27" dur="2">left to right, top to bottom.</text><text start="29" dur="5">The starting state would be some state where all the positions are messed up.</text><text start="34" dur="4">Now the question is: Can we come up with a good heuristic for this?</text><text start="38" dur="4">Let&amp;#39;s examine that as a way of thinking about where heuristics come from.</text><text start="42" dur="4">The first heuristic we&amp;#39;re going to consider</text><text start="46" dur="8">we&amp;#39;ll call h1, and that is equal to the number of misplaced blocks.</text><text start="54" dur="5">So here 10 and 11 are misplaced because they should be there and there, respectively,</text><text start="59" dur="3">12 is in the right place, 13 is in the right place,</text><text start="62" dur="2">and 14 and 15 are misplaced. </text><text start="64" dur="3">That&amp;#39;s a total of 4 misplaced blocks.</text><text start="67" dur="6">The 2nd heuristic, h2, is equal to</text><text start="73" dur="6">the sum of the distances that each block would have to move to get to the right position.</text><text start="79" dur="7">For this position, 10 would have to move 1 space to get to the right position,</text><text start="86" dur="4">11 would have to move 1, so that&amp;#39;s a total of 2 so far, </text><text start="90" dur="1">13 is in the right place,</text><text start="91" dur="2">14 is 1 displaced, </text><text start="93" dur="2">and 15 is 1 displaced,</text><text start="95" dur="3">so that would also be a total of 4.</text><text start="98" dur="6">Now, the question is: Which, if any, of these heuristics are admissible?</text><text start="104" dur="3">Check the boxes next to the heuristics that you think </text><text start="107" dur="2">are admissible.</text></transcript></video><video title="Topic 31, Sliding Blocks Puzzle ANSWER" id="lviKMjofhZ0" length="42"><transcript><text start="2" dur="5">H1 is admissible, because every tile that&amp;#39;s in the wrong position </text><text start="7" dur="3">must be moved at least once to get into the right position.</text><text start="10" dur="3">So h1 never overestimates.</text><text start="13" dur="2">How about h2?</text><text start="15" dur="5">H2 is also admissible, because every tile in the wrong position</text><text start="20" dur="6">can be moved closer to the correct position no faster than 1 space per move.</text><text start="26" dur="2">Therefore, both are admissible.</text><text start="28" dur="5">But notice that h2 is always greater than or equal to h1.</text><text start="33" dur="2">That means that, with the exception of breaking ties,</text><text start="35" dur="4">an A* search using h2 will always expand</text><text start="39" dur="3">fewer paths than one using h1</text></transcript></video><video title="Topic 32, Where is the Intelligence" id="lL-8KGXehNY" length="196"><transcript><text start="1" dur="3">Now, we&amp;#39;re trying to build an artificial intelligence</text><text start="4" dur="3">that can solve problems like this all on its own.</text><text start="8" dur="4">You can see that the search algorithms do a great job</text><text start="12" dur="3">of finding solutions to problems like this.</text><text start="15" dur="4">But, you might complain that in order for the search algorithms to work,</text><text start="19" dur="3">we had to provide it with a heurstic function.</text><text start="22" dur="3">A heurstic function came from the outside.</text><text start="25" dur="5">You might think that coming up with a good heurstic function is really where all the intelligence is.</text><text start="30" dur="4">So, a problem solver that uses an heurstic function given to it</text><text start="34" dur="2">really isn&amp;#39;t intelligent at all.</text><text start="36" dur="3">So let&amp;#39;s think about where the intelligence could come from</text><text start="39" dur="4">and can we automatically come up with good heurstic functions.</text><text start="45" dur="2">I&amp;#39;m going to sketch a description of</text><text start="47" dur="3">a program that can automatically come up with good heurstics</text><text start="50" dur="2">given a description of a problem.</text><text start="52" dur="5">Suppose this program is given a description of the sliding blocks puzzle</text><text start="57" dur="5">where we say that a block can move from square A to square B</text><text start="62" dur="4">if A is adjacent to B and B is blank.</text><text start="66" dur="4">Now, imagine that we try to loosen this restriction.</text><text start="70" dur="4">We cross out &amp;quot;B is blank,&amp;quot;</text><text start="74" dur="2">and then we get the rule</text><text start="76" dur="4">&amp;quot;a block can move from A to B if A is adjacent to B,&amp;quot;</text><text start="80" dur="3">and that&amp;#39;s equal to our heurstic h2</text><text start="83" dur="4">because a block can move anywhere to an adjacent state.</text><text start="87" dur="4">Now, we could also cross out the other part of the rule,</text><text start="91" dur="5">and we now get &amp;quot;a block can move from any square A</text><text start="96" dur="4">to any square B regardless of any condition.</text><text start="100" dur="3">That gives us heurstic h1.</text><text start="103" dur="5">So we see that both of our heurstics can be derived </text><text start="108" dur="2">from a simple mechanical manipulation</text><text start="110" dur="3">of the formal description of the problem.</text><text start="113" dur="5">Once we&amp;#39;ve generated automatically these candidate heuristics,</text><text start="118" dur="4">another way to come up with a good heurstic is to say </text><text start="122" dur="2">that a new heurstic, h, </text><text start="124" dur="6">is equal to the maximum of h1 and h2,</text><text start="130" dur="3">and that&amp;#39;s guaranteed to be admissible as long as </text><text start="133" dur="3">h1 and h2 are admissible</text><text start="136" dur="2">because it still never overestimates, </text><text start="138" dur="4">and it&amp;#39;s guaranteed to be better because its getting closer to the true value.</text><text start="142" dur="5">The only problem with combining multiple heuristics like this</text><text start="147" dur="2">is that there is some cause to compute the heuristic</text><text start="149" dur="2">and it could take longer to compute </text><text start="151" dur="4">even if we end up expanding pure paths.</text><text start="155" dur="3">Crossing out parts of the rules like this</text><text start="158" dur="3">is called &amp;quot;generating a relaxed problem.&amp;quot;</text><text start="161" dur="3">What we&amp;#39;ve done is we&amp;#39;ve taken the original problem,</text><text start="164" dur="2">where it&amp;#39;s hard to move squares around,</text><text start="166" dur="3">and made it easier by relaxing one of the constraints.</text><text start="169" dur="5">You can see that as adding new links in the state space,</text><text start="174" dur="5">so if we have a state space in which there are only particular links,</text><text start="179" dur="6">by relaxing the problem it&amp;#39;s as if we are adding new operators</text><text start="185" dur="2">that traverse the state in new ways.</text><text start="187" dur="4">So adding new operators only makes the problem easier,</text><text start="191" dur="5">and thus never overestimates, and thus is admissible.</text></transcript></video><video title="Topic 33, What Can't Search Do" id="UbqrrN4wbqQ" length="112"><transcript><text start="0" dur="3">We&amp;#39;ve seen what search can do for problem solving.</text><text start="3" dur="3">It can find the lowest-cost path to a goal,</text><text start="6" dur="6">and it can do that in a way in which we never generate more paths than we have to.</text><text start="12" dur="3">We can find the optimal number of paths to generate,</text><text start="15" dur="4">and we can do that with a heuristic function that we generate on our own</text><text start="19" dur="3">by relaxing the existing problem definition.</text><text start="22" dur="3">But let&amp;#39;s be clear on what search can&amp;#39;t do.</text><text start="25" dur="6">All the solutions that we have found consist of a fixed sequence of actions.</text><text start="31" dur="7">In other words, the agent Hirin Arad, thinks, comes up with a plan that it wants to execute</text><text start="38" dur="4">and then essentially closes his eyes and starts driving,</text><text start="42" dur="4">never considering along the way if something has gone wrong.</text><text start="46" dur="3">That works fine for this type of problem,</text><text start="49" dur="4">but it only works when we satisfy the following conditions.</text><text start="53" dur="2">[Problem solving works when:]</text><text start="55" dur="4">Problem-solving technology works when the following set of conditions is true:</text><text start="59" dur="4">First, the domain must be fully observable.</text><text start="63" dur="5">In other words, we must be able to see what initial state we start out with.</text><text start="68" dur="4">Second, the domain must be known.</text><text start="72" dur="4">That is, we have to know the set of available actions to us.</text><text start="76" dur="4">Third, the domain must be discrete.</text><text start="80" dur="4">There must be a finite number of actions to chose from.</text><text start="84" dur="4">Fourth, the domain must be deterministic.</text><text start="88" dur="4">We have to know the result of taking an action.</text><text start="92" dur="4">Finally, the domain must be static.</text><text start="96" dur="5">There must be nothing else in the world that can change the world except our own actions.</text><text start="101" dur="3">If all these conditions are true, then we can search for a plan</text><text start="104" dur="3">which solves the problem and is guaranteed to work.</text><text start="107" dur="5">In later units, we will see what to do if any of these conditions fail to hold.</text></transcript></video><video title="Topic 34, Note on Implementation" id="3muiVUU0sys" length="155"><transcript><text start="1" dur="7">Our description of the algorithm has talked about paths in the state space.</text><text start="8" dur="7">I want to say a little bit now about how to implement that in terms of a computer algorithm.</text><text start="15" dur="4">We talk about paths, but we want to implement that in some ways.</text><text start="19" dur="3">In the implementation we talk about nodes.</text><text start="22" dur="5">A node is a data structure, and it has four fields.</text><text start="27" dur="8">The state field indicates the state at the end of the path.</text><text start="35" dur="5">The action was the action it took to get there.</text><text start="40" dur="5">The cost is the total cost,</text><text start="45" dur="5">and the parent is a pointer to another node.</text><text start="50" dur="6">In this case, the node that has state &amp;quot;S&amp;quot;, </text><text start="56" dur="10">and it will have a parent which points to the node that has state &amp;quot;A&amp;quot;,</text><text start="66" dur="4">and that will have a parent pointer that&amp;#39;s null.</text><text start="70" dur="5">So we have a linked list of nodes representing the path.</text><text start="75" dur="3">We&amp;#39;ll use the word &amp;quot;path&amp;quot; for the abstract idea, </text><text start="78" dur="4">and the word &amp;quot;node&amp;quot; for the representation in the computer memory.</text><text start="82" dur="4">But otherwise, you can think of those two terms as being synonyms,</text><text start="86" dur="5">because they&amp;#39;re in a one-to-one correspondence.</text><text start="91" dur="4">Now there are two main data structures that deal with nodes.</text><text start="95" dur="6">We have the &amp;quot;frontier&amp;quot; and we have the &amp;quot;explored&amp;quot; list.</text><text start="101" dur="3">Let&amp;#39;s talk about how to implement them.</text><text start="104" dur="4">In the frontier the operations we have to deal with</text><text start="108" dur="4">are removing the best item from the frontier and adding in new ones.</text><text start="112" dur="3">And that suggests we should implement it as a priority queue,</text><text start="115" dur="4">which knows how to keep track of the best items in proper order.</text><text start="119" dur="4">But we also need to have an additional operation </text><text start="123" dur="4">of a membership test as a new item in the frontier.</text><text start="127" dur="3">And that suggests representing it as a set,</text><text start="130" dur="4">which can be built from a hash table or a tree.</text><text start="134" dur="6">So the most efficient implementations of search actually have both representations.</text><text start="140" dur="3">The explored set, on the other hand, is easier.</text><text start="143" dur="5">All we have to do there is be able to add new members and check for membership.</text><text start="148" dur="3">So we represent that as a single set, </text><text start="151" dur="4">which again can be done with either a hash table or tree.</text></transcript></video></group><group title="Homework 1" count="16"><video title="Congratulations!" id="IXVOQEFTvb4" length="5"><transcript><text start="0" dur="2">Congratulations.</text><text start="2" dur="3">You just made assignment 1.</text></transcript></video><video title="Introduction" id="dnnGEYjD9wo" length="5"><transcript><text start="0" dur="5">This is homework assignment #1.</text></transcript></video><video title="Question 1, Peg Solitaire" id="CxjV8H50xfU" length="60"><transcript><text start="1" dur="3">This is a question about peg solitaire.</text><text start="4" dur="4">In peg solitaire, a single player faces</text><text start="8" dur="2">the following kind of board.</text><text start="13" dur="6">Initially, all pieces are occupied except for the center piece.</text><text start="22" dur="4">You can find more information on peg solitare at the following URL.</text><text start="26" dur="9">[http://en.wikipedia.org/wiki/peg_solitaire]</text><text start="36" dur="4">I wish to know whether this game is partially observable,</text><text start="40" dur="3">Please say yes or no.</text><text start="43" dur="3">I wish to know whether it is stochastic.</text><text start="46" dur="4">Please say yes if it is and no if it&amp;#39;s deterministic.</text><text start="50" dur="5">Let me know if it&amp;#39;s continuous, yes or no,</text><text start="55" dur="5">and let me know if it&amp;#39;s adversarial, yes or no.</text></transcript></video><video title="Question 1, Peg Solitaire ANSWER" id="YOfAe4Xo_P4" length="22"><transcript><text start="0" dur="6">&amp;gt;&amp;gt;Peg Solitaire is not partially observable because you can see the board at all times.</text><text start="6" dur="3">It is not stochastic because you just make all the moves, </text><text start="9" dur="2">and they have very different mystic effects.</text><text start="11" dur="4">It is not continuous.  It&amp;#39;s just finding many choices of actions </text><text start="15" dur="3">and finding many board positions, so therefore, it is not continuous.</text><text start="18" dur="4">and it&amp;#39;s not adversarial because there is no adversaries--just you playing.</text></transcript></video><video title="Question 2, Loaded Coin" id="ZmVLMZ5Fwcg" length="54"><transcript><text start="1" dur="4">I am going to ask you about the problem to learn about a loaded coin.</text><text start="5" dur="2">A loaded coin is a coin,</text><text start="7" dur="2">that if you flip it, </text><text start="9" dur="4">might have a non 0.5 chance</text><text start="13" dur="2">of coming up heads or tails.</text><text start="16" dur="4">Fair coins always come up 50% heads or tails.</text><text start="20" dur="3">Loaded coins might come up, for example, </text><text start="23" dur="4">0.9 chance heads and 0.1 chance tails.</text><text start="27" dur="3">Your task will be to understand,</text><text start="30" dur="1">from coin flips,</text><text start="31" dur="2">whether a coin is loaded,</text><text start="33" dur="2">and if so, at what probability.</text><text start="35" dur="2">I don&amp;#39;t want you to solve the problem,</text><text start="37" dur="3">but I want you to answer the following questions:</text><text start="40" dur="2">Is it partially observable?</text><text start="42" dur="2">Yes or no.</text><text start="44" dur="2">Is it stochastic?</text><text start="46" dur="2">Yes or no.</text><text start="48" dur="3">Is it continuous?  [Yes or no.]</text><text start="51" dur="2">And finally, is it adversarial?</text><text start="53" dur="1">Yes or no.</text></transcript></video><video title="Question 2, Loaded Coin ANSWER" id="GsKZT-aAZFI" length="38"><transcript><text start="0" dur="6">[Thrun] So the loaded coin example is clearly partially observable,</text><text start="6" dur="3">and the reason is it is actually used for the memory</text><text start="9" dur="5">if you flip it more than 1 time so you can learn more about what the actual probability is.</text><text start="14" dur="6">Therefore, looking at the most recent coin flip is insufficient to make your choice.</text><text start="20" dur="5">It is stochastic because you flip a coin.</text><text start="25" dur="6">It is not continuous because there&amp;#39;s only 1 action--a flip--and 2 outcomes.</text><text start="31" dur="5">And it isn&amp;#39;t really adversarial because while you do your learning task</text><text start="36" dur="2">no adversary interferes.</text></transcript></video><video title="Question 3, Path Through Maze" id="dj6jEEU-jZc" length="32"><transcript><text start="0" dur="5">Let&amp;#39;s talk about the problem of finding a path through a maze.</text><text start="5" dur="5">Let me draw you a maze. </text><text start="10" dur="5">Suppose you wish to find the path from the start to your goal.</text><text start="15" dur="4">I don&amp;#39;t want to you to solve this problem.</text><text start="19" dur="4">Rather I want you to tell me whether it&amp;#39;s partially observable.</text><text start="23" dur="2">Yes or no.</text><text start="25" dur="2">It is stochastic?</text><text start="27" dur="2">Yes or no.</text><text start="29" dur="2">Is it continuous?</text><text start="31" dur="1">Yes or no.</text></transcript></video><video title="Question 3, Path Through Maze ANSWER" id="TskS2qHzi90" length="18"><transcript><text start="0" dur="3">[Thrun] The path through the maze is clearly not partially observable</text><text start="3" dur="3">because you can see the maze entirely at all times.</text><text start="6" dur="4">It is not stochastic. There is no randomness involved.</text><text start="10" dur="2">It isn&amp;#39;t really continuous. </text><text start="12" dur="3">There&amp;#39;s typically just finitely many choices--go left or right.</text><text start="15" dur="3">And it isn&amp;#39;t adversarial because there&amp;#39;s no real adversary involved.</text></transcript></video><video title="Question 4, Search Tree" id="qsxMRW2SOqI" length="43"><transcript><text start="0" dur="2">This is a search question. </text><text start="2" dur="3">Suppose we are given the following search tree. </text><text start="5" dur="3">We are searching from the top, the start node, </text><text start="8" dur="4">to the goal, which is over here. </text><text start="12" dur="5">Assume we expand from left to right. </text><text start="17" dur="3">Tell me how many nodes are expanded </text><text start="20" dur="3">if we expand from left to right, </text><text start="23" dur="4">counting the start node and the goal node in your answer. </text><text start="27" dur="5">And give me the same answer for Depth First Search. </text><text start="32" dur="3">Now, let&amp;#39;s assume you&amp;#39;re going to search from right to left. </text><text start="35" dur="4">How many nodes would we now expand in Breadth First Search,</text><text start="39" dur="4">and how many do we expand in Depth First Search?</text></transcript></video><video title="Question 4, Search Tree ANSWER" id="FDTlQfGb9SY" length="38"><transcript><text start="0" dur="3">[Thrun] Breadth first from left to right is 6--</text><text start="3" dur="4">1, 2, 3, 4, 5, 6.</text><text start="7" dur="8">Depth first from left to right is 4--1, 2, 3, 4.</text><text start="15" dur="4">Breadth first searched from right to left is 9--</text><text start="19" dur="6">1, 2, 3, 4, 5, 6, 7, 8, 9.</text><text start="25" dur="3">And depth first from right to left is 9--</text><text start="28" dur="10">1, 2, 3, 4, 5, 6, 7, 8, 9.</text></transcript></video><video title="Question 5, Another Search Tree" id="vWNEaVcK2gU" length="31"><transcript><text start="0" dur="3">Another search problem--</text><text start="3" dur="5">Consider the following search tree,</text><text start="8" dur="4">where this is the start node.</text><text start="12" dur="3">Now, assume we search from left to right.</text><text start="15" dur="4">I would like you to tell me the number of nodes expanded from Breadth-First Search</text><text start="19" dur="3">and Depth-First Search.</text><text start="22" dur="3">Please do count the start and the goal node,</text><text start="25" dur="3">and please give me the same numbers for Right-to-Left Search,</text><text start="28" dur="3">for Breadth-First, and Depth-First. </text></transcript></video><video title="Question 5, Another Search Tree ANSWER" id="V_eXNj-LA9E" length="48"><transcript><text start="0" dur="5">[Thrun] The correct answer for breadth first left to right is 13--</text><text start="5" dur="8">1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13.</text><text start="13" dur="4">And for depth first it is 10--</text><text start="17" dur="11">1, 2, 3, 4, 5, 6, 7, 8, 9, and 10.</text><text start="28" dur="4">For right to left search, the right answer for breadth first is 11--</text><text start="32" dur="6">1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11.</text><text start="38" dur="4">And for depth first the right answer is 7--</text><text start="42" dur="6">1, 2, 3, 4, 5, 6, 7.</text></transcript></video><video title="Question 6, Search Network" id="IQhUlwJaBqc" length="60"><transcript><text start="0" dur="4">This is another search problem.</text><text start="4" dur="3">Let&amp;#39;s assume we have a search graph.</text><text start="7" dur="6">It isn&amp;#39;t quite a tree but looks like this.</text><text start="13" dur="5">Obviously in the structure we can reach nodes through multiple paths.</text><text start="18" dur="4">So let&amp;#39;s assume that our search never expands the same node twice.</text><text start="22" dur="5">Let&amp;#39;s also assume this start node is on top. We search down.</text><text start="27" dur="3">And this over here is our goal node.</text><text start="30" dur="5">So left-to-right search, tell me how many nodes</text><text start="35" dur="8">breadth first would expand--do count the start and goal node in the final answer.</text><text start="43" dur="5">Give me the same result for a depth-first search.</text><text start="48" dur="3">Again counting the start and the goal node in your answer.</text><text start="51" dur="3">And again give me your answer for breadth-first</text><text start="54" dur="6">and for depth-first in the right-to-left search paradigm.</text></transcript></video><video title="Question 6, Search Network ANSWER" id="mXT-9-K5OtU" length="49"><transcript><text start="0" dur="5">[Thrun] The right answer over here is 10 for breadth first from left to right--</text><text start="5" dur="6">1, 2, 3, 4, 5, 6, 7, 8, 9, 10.</text><text start="11" dur="4">Depth first is 16, or all nodes--</text><text start="15" dur="15">1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16.</text><text start="30" dur="4">And notice how I never expanded a node twice.</text><text start="34" dur="4">Correct answer for breadth first right to left is 7--</text><text start="38" dur="5">1, 2, 3, 4, 5, 6, 7.</text><text start="43" dur="6">And the correct answer for depth first from right to left is 4--1, 2, 3, and 4.</text></transcript></video><video title="Question 7, A* Search" id="V4h2H0jpGsg" length="76"><transcript><text start="0" dur="3">Let&amp;#39;s talk about a star search.</text><text start="3" dur="5">Let&amp;#39;s assume we have the following grid.</text><text start="8" dur="5">The start state is right here.</text><text start="13" dur="3">And the goal state is right here.</text><text start="16" dur="6">And just for convenience, I will give each here a little number.</text><text start="22" dur="4">A. B. C. D.</text><text start="26" dur="4">Let me draw a heuristic function.</text><text start="30" dur="2">Please take a look for a moment</text><text start="32" dur="6">and tell me whether this heuristic function is admissable.</text><text start="38" dur="3">Check here if yes and here if no.</text><text start="41" dur="5">Which one is the first node a star would expand?</text><text start="46" dur="5">B1 or A2?</text><text start="51" dur="5">What&amp;#39;s the second node to expand?</text><text start="56" dur="10">B1, C1, A2, A3, or B2?</text><text start="66" dur="4">And finally, what is the third node to expand?</text><text start="70" dur="6">D1, C2, B3, or A4? </text></transcript></video><video title="Question 7, A* Search ANSWER" id="forv6djwNWM" length="104"><transcript><text start="0" dur="5">[Thrun] Clearly this is an admissable heuristic because the distance to the goal</text><text start="5" dur="2">is strictly underestimated.</text><text start="7" dur="2">From here it would take 1 step,</text><text start="9" dur="6">from here it will take 1, 2 steps, so the answer is yes.</text><text start="15" dur="7">Now, to understand A*, let me also draw the g function </text><text start="22" dur="2">for development part of this table.</text><text start="24" dur="3">Clearly g is 0 over here.</text><text start="27" dur="4">To understand which node to expand, this one or this one,</text><text start="31" dur="3">let&amp;#39;s project the g function, which is 1,</text><text start="34" dur="6">and we will see that 3 plus 1 is smaller than 4 plus 1;</text><text start="40" dur="7">therefore, this is the second node to expand, which is b1.</text><text start="47" dur="8">Now let me for the next step explain the g function from this guy here, 2 and 2.</text><text start="55" dur="13">So 2 plus 2 is 4 versus 3 plus 2 is 5, so we expand this node next, which is c1.</text><text start="68" dur="6">And finally, the g function from here would go 3 and 3.</text><text start="74" dur="10">3 plus 1 is better than 3 plus 2, so we would expand d1 next.</text><text start="84" dur="5">And notice how in the sum of g and h,</text><text start="89" dur="6">this node over here, which has a total of 4, is better than any other node that is unexpanded.</text><text start="95" dur="5">So in particular, 4 plus 1 is 5, and 3 plus 2 is 5 as well, </text><text start="100" dur="4">and 2 plus 3 is 5 as well, so this is the next one to expand.</text></transcript></video></group><group title="Unit 3" count="64"><video title="1 Introduction" id="-8DyY8_IuA0" length="386"><transcript><text start="0" dur="3">So the next units will be concerned with probabilities</text><text start="3" dur="5">and particularly with structured probabilities using Bayes networks.</text><text start="8" dur="4">This is some of the most involved material in this class.</text><text start="12" dur="2">And since this is a Stanford level class, </text><text start="14" dur="4">you will find out that some of the quizzes are actually really hard.</text><text start="18" dur="5">So as you go through the material, I hope the hardness of the quizzes won&amp;#39;t discourage you;</text><text start="23" dur="7">it&amp;#39;ll really entice you to take a piece of paper and a pen and work them out.</text><text start="30" dur="5">Let me give you a flavor of a Bayes network using an example.</text><text start="35" dur="4">Suppose you find in the morning that your car won&amp;#39;t start.</text><text start="39" dur="4">Well, there&amp;#39;s many causes why your car might not start.</text><text start="43" dur="3">One is that your battery is flat.</text><text start="46" dur="4">Even for a flat battery there is multiple causes.</text><text start="50" dur="2">One, it&amp;#39;s just plain dead,</text><text start="52" dur="3">and one is that the battery is okay but it&amp;#39;s not charging.</text><text start="55" dur="6">The reason why a battery might not charge is that the alternator might be broken</text><text start="61" dur="2">or the fan belt might be broken.</text><text start="63" dur="4">If you look at this influence diagram, also called a Bayes network,</text><text start="67" dur="5">you&amp;#39;ll find there&amp;#39;s many different ways to explain that the car won&amp;#39;t start.</text><text start="72" dur="5">And a natural question you might have is, &amp;quot;Can we diagnose the problem?&amp;quot;</text><text start="77" dur="3">One diagnostic tool is a battery meter,</text><text start="80" dur="6">which may increase or decrease your belief that the battery may cause your car failure.</text><text start="86" dur="3">You might also know your battery age.</text><text start="89" dur="2">Older batteries tend to go dead more often.</text><text start="91" dur="6">And there&amp;#39;s many other ways to look at reasons why the car might not start.</text><text start="97" dur="6">You might inspect the lights, the oil light, the gas gauge.</text><text start="103" dur="5">You might even dip into the engine to see what the oil level is with a dipstick.</text><text start="108" dur="4">All of those relate to alternative reasons why the car might not be starting, </text><text start="112" dur="7">like no oil, no gas, the fuel line might be blocked, or the starter may be broken.</text><text start="119" dur="5">And all of these can influence your measurements,</text><text start="124" dur="3">like the oil light or the gas gauge, in different ways.</text><text start="127" dur="5">For example, the battery flat would have an effect on the lights.</text><text start="132" dur="4">It might have an effect on the oil light and on the gas gauge,</text><text start="136" dur="4">but it won&amp;#39;t really affect the oil you measure with the dipstick.</text><text start="140" dur="6">That is affected by the actual oil level, which also affects the oil light.</text><text start="146" dur="6">Gas will affect the gas gauge, and of course without gas the car doesn&amp;#39;t start.</text><text start="152" dur="7">So this is a complicated structure that really describes one way to understand</text><text start="159" dur="2">how a car doesn&amp;#39;t start.</text><text start="161" dur="2">A car is a complex system.</text><text start="163" dur="3">It has lots of variables you can&amp;#39;t really measure immediately,</text><text start="166" dur="6">and it has sensors which allow you to understand a little bit about the state of the car.</text><text start="172" dur="2">What the Bayes network does,</text><text start="174" dur="7">it really assists you in reasoning from observable variables, like the car won&amp;#39;t start</text><text start="181" dur="5">and the value of the dipstick, to hidden causes, like is the fan belt broken</text><text start="186" dur="3">or is the battery dead.</text><text start="189" dur="4">What you have here is a Bayes network.</text><text start="193" dur="2">A Bayes network is composed of nodes.</text><text start="195" dur="6">These nodes correspond to events that you might or might not know</text><text start="201" dur="3">that are typically called random variables.</text><text start="204" dur="7">These nodes are linked by arcs, and the arcs suggest that a child of an arc</text><text start="211" dur="4">is influenced by its parent but not in a deterministic way.</text><text start="215" dur="6">It might be influenced in a probabilistic way, which means an older battery, for example,</text><text start="221" dur="4">has a higher chance of causing the battery to be dead,</text><text start="225" dur="3">but it&amp;#39;s not clear that every old battery is dead.</text><text start="228" dur="5">There is a total of 16 variables in this Bayes network.</text><text start="233" dur="6">What the graph structure and associated probabilities specify</text><text start="239" dur="7">is a huge probability distribution in the space of all of these 16 variables.</text><text start="246" dur="4">If they are all binary, which we&amp;#39;ll assume throughout this unit,</text><text start="250" dur="5">they can take 2 to the 16th different values, which is a lot.</text><text start="255" dur="3">The Bayes network, as we find out, is a complex representation</text><text start="258" dur="8">of a distribution over this very, very large joint probability distribution of all of these variables.</text><text start="266" dur="3">Further, once we specify the Bayes network,</text><text start="269" dur="4">we can observe, for example, the car won&amp;#39;t start.</text><text start="273" dur="4">We can observe things like the oil light and the lights and the battery meter</text><text start="277" dur="4">and then compute probabilities of the hypothesis, like the alternator is broken</text><text start="281" dur="4">or the fan belt is broken or the battery is dead.</text><text start="285" dur="5">So in this class we&amp;#39;re going to talk about how to construct this Bayes network,</text><text start="290" dur="6">what the semantics are, and how to reason in this Bayes network </text><text start="296" dur="6">to find out about variables we can&amp;#39;t observe, like whether the fan belt is broken or not.</text><text start="302" dur="2">That&amp;#39;s an overview.</text><text start="304" dur="4">Throughout this unit I am going to assume that every event is discrete--</text><text start="308" dur="2">in fact, it&amp;#39;s binary.</text><text start="310" dur="4">We&amp;#39;ll start with some consideration of basic probability,</text><text start="314" dur="5">we&amp;#39;ll work our way into some simple Bayes networks,</text><text start="319" dur="4">we&amp;#39;ll talk about concepts like conditional independence</text><text start="323" dur="3">and then define Bayes networks more generally,</text><text start="326" dur="6">move into concepts like D-separation and start doing parameter counts.</text><text start="332" dur="4">Later on, Peter will tell you about inference in Bayes networks.</text><text start="336" dur="2">So we won&amp;#39;t do this in this class.</text><text start="338" dur="5">I can&amp;#39;t overemphasize how important this class is.</text><text start="343" dur="6">Bayes networks are used extensively in almost all fields of smart computer system,</text><text start="349" dur="8">in diagnostics, for prediction, for machine learning, and fields like finance,</text><text start="357" dur="3">inside Google, in robotics.</text><text start="360" dur="5">Bayes networks are also the building blocks of more advanced AI techniques</text><text start="365" dur="7">such as particle filters, hidden Markov models, MDPs and POMDPs, </text><text start="372" dur="2">Kalman filters, and many others.</text><text start="374" dur="4">These are words that don&amp;#39;t sound familiar quite yet,</text><text start="378" dur="4">but as you go through the class, I can promise you you will get to know what they mean.</text><text start="382" dur="4">So let&amp;#39;s start now at the very, very basics.</text></transcript></video><video title="2 Probabilities" id="EdONkI3RNKg" length="40"><transcript><text start="0" dur="2">[Thrun] So let&amp;#39;s talk about probabilities.</text><text start="2" dur="3">Probabilities are the cornerstone of artificial intelligence.</text><text start="5" dur="3">They are used to express uncertainty,</text><text start="8" dur="4">and the management of uncertainty is really key to many, many things in AI</text><text start="12" dur="4">such as machine learning and Bayes network inference </text><text start="16" dur="5">and filtering and robotics and computer vision and so on.</text><text start="21" dur="3">So I&amp;#39;m going to start with some very basic questions,</text><text start="24" dur="2">and we&amp;#39;re going to work our way up from there.</text><text start="26" dur="2">Here is a coin.</text><text start="28" dur="4">The coin can come up heads or tails, and my question is the following:</text><text start="32" dur="6">Suppose the probability for heads is 0.5.</text><text start="38" dur="2">What&amp;#39;s the probability for it coming up tails?</text></transcript></video><video title="2a Answer" id="orhhEZGH_Es" length="19"><transcript><text start="0" dur="3">[Thrun] So the right answer is a half, or 0.5,</text><text start="3" dur="4">and the reason is the coin can only come up heads or tails.</text><text start="7" dur="3">We know that it has to be either one.</text><text start="10" dur="4">Therefore, the total probability of both coming up is 1.</text><text start="14" dur="5">So if half of the probability is assigned to heads, then the other half is assigned to tail.</text></transcript></video><video title="2b Question" id="Ee9g6dhDL9A" length="8"><transcript><text start="0" dur="2">[Thrun] Let me ask my next quiz.</text><text start="2" dur="4">Suppose the probability of heads is a quarter, 0.25.</text><text start="6" dur="2">What&amp;#39;s the probability of tail?</text></transcript></video><video title="2c Answer" id="84KcxfggKRg" length="17"><transcript><text start="0" dur="2">[Thrun] And the answer is 3/4. </text><text start="2" dur="3">It&amp;#39;s a loaded coin, and the reason is, well, </text><text start="5" dur="3">each of them come up with a certain probability.</text><text start="8" dur="4">The total of those is 1. The quarter is claimed by heads.</text><text start="12" dur="5">Therefore, 3/4 remain for tail, which is the answer over here.</text></transcript></video><video title="2d Question" id="koOpSPz-voY" length="14"><transcript><text start="0" dur="2">[Thrun] Here&amp;#39;s another quiz.</text><text start="2" dur="6">What&amp;#39;s the probability that the coin comes up heads, heads, heads, three times in a row,</text><text start="8" dur="4">assuming that each one of those has a probability of a half</text><text start="12" dur="2">and that these coin flips are independent?</text></transcript></video><video title="2e Answer" id="7pZQS5inJXs" length="14"><transcript><text start="0" dur="4">[Thrun] And the answer is 0.125.</text><text start="4" dur="2">Each head has a probability of a half.</text><text start="6" dur="4">We can multiply those probabilities because they are independent events,</text><text start="10" dur="4">and that gives us 1 over 8 or 0.125.</text></transcript></video><video title="2f Question" id="KatS5xl7vn8" length="32"><transcript><text start="0" dur="11">[Thrun] Now let&amp;#39;s flip the coin 4 times, and let&amp;#39;s call Xi the result of the i-th coin flip.</text><text start="11" dur="5">So each Xi is going to be drawn from heads or tail.</text><text start="16" dur="6">What&amp;#39;s the probability that all 4 of those flips give us the same result,</text><text start="22" dur="4">no matter what it is, assuming that each one of those has identically</text><text start="26" dur="6">an equally distributed probability of coming up heads of the half?</text></transcript></video><video title="2g Answer" id="g_M3o3QXBjo" length="23"><transcript><text start="0" dur="4">[Thrun] And the answer is, well, there&amp;#39;s 2 ways that we can achieve this.</text><text start="4" dur="2">One is the all heads and one is all tails.</text><text start="6" dur="4">You already know that 4 times heads is 1/16,</text><text start="10" dur="3">and we know that 4 times tail is also 1/16.</text><text start="13" dur="2">These are completely independent events.</text><text start="15" dur="8">The probability of either one occurring is 1/16 plus 1/16, which is 1/8, which is 0.125.</text></transcript></video><video title="2h Question" id="hdQER9u46yU" length="10"><transcript><text start="0" dur="2">[Thrun] So here&amp;#39;s another one.</text><text start="2" dur="5">What&amp;#39;s the probability that within the set of X1, X2, X3, and X4</text><text start="7" dur="3">there are at least three heads?</text></transcript></video><video title="2i Answer" id="FEqiaraw3GE" length="28"><transcript><text start="0" dur="3">[Thrun] And the solution is let&amp;#39;s look at different sequences </text><text start="3" dur="3">in which head occurs at least 3 times.</text><text start="6" dur="4">It could be head, head, head, head, in which it comes 4 times.</text><text start="10" dur="6">It could be head, head, head, tail and so on, all the way to tail, head, head, head.</text><text start="16" dur="3">There&amp;#39;s 1, 2, 3, 4, 5 of those outcomes.</text><text start="19" dur="9">Each of them has a 16th for probability, so it&amp;#39;s 5 times a 16th, which is 0.3125.</text></transcript></video><video title="2j Summary" id="Xblzy61pBDQ" length="45"><transcript><text start="0" dur="2">[Thrun] So we just learned a number of things.</text><text start="2" dur="3">One is about complementary probability.</text><text start="5" dur="3">If an event has a certain probability, p,</text><text start="8" dur="5">the complementary event has the probability 1-p.</text><text start="13" dur="2">We also learned about independence.</text><text start="15" dur="4">If 2 random variables, X and Y, are independent, </text><text start="19" dur="2">which you&amp;#39;re going to write like this,</text><text start="21" dur="5">that means the probability of the joint that any 2 variables can assume </text><text start="26" dur="4">is the product of the marginals.</text><text start="30" dur="4">So rather than asking the question, &amp;quot;What is the probability</text><text start="34" dur="6">&amp;quot;for any combination that these 2 coins or maybe 5 coins could have taken?&amp;quot;</text><text start="40" dur="2">we can now look at the probability of each coin individually,</text><text start="42" dur="3">look at its probability and just multiply them up.</text></transcript></video><video title="3 Dependence" id="uy0sL0DGV7o" length="64"><transcript><text start="0" dur="3">[Thrun] So let me ask you about dependence. </text><text start="3" dur="2">Suppose we flip 2 coins.</text><text start="5" dur="7">Our first coin is a fair coin, and we&amp;#39;re going to denote the outcome by X1.</text><text start="12" dur="3">So the chance of X1 coming up heads is half.</text><text start="15" dur="5">But now we branch into picking a coin based on the first outcome.</text><text start="20" dur="3">So if the first outcome was heads, </text><text start="23" dur="5">you pick a coin whose probability of coming up heads is going to be 0.9.</text><text start="28" dur="4">The way I word this is by conditional probability,</text><text start="32" dur="3">probability of the second coin flip coming up heads</text><text start="35" dur="6">provided that or given that X1, the first coin flip, was heads, is 0.9.</text><text start="41" dur="3">The first coin flip might also come up tails,</text><text start="44" dur="3">in which case I pick a very different coin.</text><text start="47" dur="7">In this case I pick a coin which with 0.8 probability will once again give me tails,</text><text start="54" dur="3">conditioned on the first coin flip coming up tails.</text><text start="57" dur="2">So my question for you is,</text><text start="59" dur="5">what&amp;#39;s the probability of the second coin flip coming up heads?</text></transcript></video><video title="3a Answer" id="kpdV5I5WHW8" length="66"><transcript><text start="0" dur="4">[Thrun] The answer is 0.55.</text><text start="4" dur="4">The way to compute this is by the theorem of total probability.</text><text start="8" dur="4">Probability of X2 equals heads.</text><text start="12" dur="3">There&amp;#39;s 2 ways I can get to this outcome.</text><text start="15" dur="3">One is via this path over here, and one is via this path over here.</text><text start="18" dur="2">Let me just write both of them down.</text><text start="20" dur="6">So first of all, it could be the probability of X2 equals heads</text><text start="26" dur="4">given that and I will assume X1 was head already.</text><text start="30" dur="2">Now I have to add the complementary event.</text><text start="32" dur="3">Suppose X1 came up tails.</text><text start="35" dur="5">Then I can ask the question, what is the probability that X2 comes up heads regardless,</text><text start="40" dur="2">even though X1 was tails?</text><text start="42" dur="2">Plugging in the numbers gives us the following.</text><text start="44" dur="5">This one over here is 0.9 times a half.</text><text start="49" dur="2">The probability of tails is 0.8,</text><text start="51" dur="7">thereby my head probability becomes 1 minus 0.8, which is 0.2.</text><text start="58" dur="5">Adding all of this together gives me 0.45 plus 0.1,</text><text start="63" dur="3">which is exactly 0.55.</text></transcript></video><video title="4 What We Learned" id="9fxuibvkZ9g" length="67"><transcript><text start="0" dur="2">So, we actually just learned some interesting lessons.</text><text start="2" dur="6">The probability of any random variable Y can be written as</text><text start="8" dur="5">probability of Y given that some other random variable X assumes value i</text><text start="13" dur="4">times probability of X equals i,</text><text start="17" dur="5">sums over all possible outcomes i for the (inaudible) variable X.</text><text start="22" dur="2">This is called total probability.</text><text start="24" dur="3">The second thing we learned has to do with negation of probabilities.</text><text start="27" dur="10">We found that probability of not X given Y is 1 minus probability of X given Y.</text><text start="37" dur="6">Now, you might be tempted to say &amp;quot;What about the probability of X given not Y?&amp;quot;</text><text start="43" dur="8">&amp;quot;Is this the same as 1 minus probability of X given Y?&amp;quot;</text><text start="51" dur="3">And the answer is absolutely no.</text><text start="54" dur="2">That&amp;#39;s not the case.</text><text start="56" dur="4">If you condition on something that has a certain probability value,</text><text start="60" dur="3">you can take the event you&amp;#39;re looking at and negate this,</text><text start="63" dur="2">but you can never negate your conditional variable </text><text start="65" dur="2">and assume these values add up to 1.</text></transcript></video><video title="5 Weather Quiz" id="RRYo6jVL6ao" length="25"><transcript><text start="0" dur="6">We assume there is sometimes sunny days and sometimes rainy days,</text><text start="6" dur="3">and on day 1, which we&amp;#39;re going to call D1, </text><text start="9" dur="4">the probability of sunny is 0.9.</text><text start="13" dur="7">And then let&amp;#39;s assume that a sunny day follows a sunny day with 0.8 chance,</text><text start="20" dur="5">and a rainy day follows a sunny day with--well--</text></transcript></video><video title="5a Answer" id="GqCNDJhZQnc" length="5"><transcript><text start="0" dur="5">Well, the correct answer is 0.2, which is a negation of this event over here.</text></transcript></video><video title="5b Question" id="ASgU5Ekoz-A" length="13"><transcript><text start="0" dur="6">A sunny day follows a rainy day with 0.6 chance, </text><text start="6" dur="5">and a rainy day follows a rainy day--</text><text start="11" dur="2">please give me your number.</text></transcript></video><video title="5c Answer" id="KgEX10LtY8Y" length="3"><transcript><text start="0" dur="3">0.4</text></transcript></video><video title="5d Question" id="aEVUaEK84UQ" length="18"><transcript><text start="0" dur="3">So, what are the chances that D2 is sunny?</text><text start="3" dur="3">Suppose the same dynamics apply from D2 to D3, </text><text start="6" dur="4">so just replace D3 over here with D2s over there.</text><text start="10" dur="4">That means the transition probabilities from one day to the next remain the same.</text><text start="14" dur="4">Tell me, what&amp;#39;s the probability that D3 is sunny?</text></transcript></video><video title="5e Answer" id="tn9chzKS9sM" length="85"><transcript><text start="0" dur="4">So, the correct answer over here is 0.78,</text><text start="4" dur="6">and over here it&amp;#39;s 0.756.</text><text start="10" dur="3">To get there, let&amp;#39;s complete this one first.</text><text start="13" dur="3">The probability of D2 = sunny.</text><text start="16" dur="5">Well, we know there&amp;#39;s a 0.9 chance it&amp;#39;s sunny on D1,</text><text start="21" dur="4">and then if it is sunny, we know it stays sunny with a 0.8 chance. </text><text start="25" dur="4">So, we multiply these 2 things together, and we get 0.72.</text><text start="29" dur="4">We know there&amp;#39;s a 0.1 chance of it being rainy on day 1, which is the complement,</text><text start="33" dur="4">but if it&amp;#39;s rainy, we know it switches to sunny with 0.6 chance, </text><text start="37" dur="4">so you multiply these 2 things, and you get 0.06.</text><text start="41" dur="5">Adding those two up equals 0.78.</text><text start="46" dur="5">Now, for the next day, we know our prior for sunny is 0.78.</text><text start="51" dur="4">If it is sunny, it stays sunny with 0.8 probability.</text><text start="55" dur="6">Multiplying these 2 things gives us 0.624.</text><text start="61" dur="6">We know it&amp;#39;s rainy with 0.2 chance, which is the complement of 0.78,</text><text start="67" dur="3">but a 0.6 chance if it was (inaudible) sunny.</text><text start="70" dur="4">But if you multiply those, 0.132.</text><text start="74" dur="5">Adding those 2 things up gives us 0.756.</text><text start="79" dur="4">So, to some extents, it&amp;#39;s tedious to compute these values,</text><text start="83" dur="2">but they can be perfectly computed, as shown here.</text></transcript></video><video title="6 Cancer Quiz" id="nhIDr-yogzg" length="19"><transcript><text start="0" dur="5">Next example is a cancer example.</text><text start="5" dur="6">Suppose there&amp;#39;s a specific type of cancer which exists for 1% of the population.</text><text start="11" dur="2">I&amp;#39;m going to write this as follows.</text><text start="13" dur="6">You can probably tell me now what the probability of not having this cancer is.</text></transcript></video><video title="6a Answer and Cancer Test" id="_NRpTjkvWv0" length="28"><transcript><text start="0" dur="4">And yes, the answer is 0.99.</text><text start="4" dur="3">Let&amp;#39;s assume there&amp;#39;s a test for this cancer, </text><text start="7" dur="5">which gives us probabilistically an answer whether we have this cancer or not.</text><text start="12" dur="6">So, let&amp;#39;s say the probability of a test being positive, as indicated by this + sign,</text><text start="18" dur="4">given that we have cancer, is 0.9.</text><text start="22" dur="6">The probability of the test coming out negative if we have the cancer is--you name it.</text></transcript></video><video title="6b Answer" id="sAnyHLFbiXg" length="61"><transcript><text start="0" dur="6">0.1, which is the difference between 1 and 0.9.</text><text start="6" dur="5">Let&amp;#39;s assume the probability of the test coming out positive</text><text start="11" dur="4">given that we don&amp;#39;t have this cancer is 0.2.</text><text start="15" dur="4">In other words, the probability of the test correctly saying</text><text start="19" dur="5">we don&amp;#39;t have the cancer if we&amp;#39;re cancer free is 0.8.</text><text start="24" dur="4">Now, ultimately, I&amp;#39;d like to know what&amp;#39;s the probability </text><text start="28" dur="7">they have this cancer given they just received a single, positive test?</text><text start="35" dur="4">Before I do this, please help me filling out some other probabilities</text><text start="39" dur="2">that are actually important.</text><text start="41" dur="4">Specifically, the joint probabilities.</text><text start="45" dur="6">The probability of a positive test and having cancer.</text><text start="51" dur="2">The probability of a negative test and having cancer, </text><text start="53" dur="2">and this is not conditional anymore.</text><text start="55" dur="2">It&amp;#39;s now a joint probability.</text><text start="57" dur="4">So, please give me those 4 values over here.</text></transcript></video><video title="6c Answer" id="PCKlid_iMNo" length="40"><transcript><text start="0" dur="5">And here the correct answer is 0.009,</text><text start="5" dur="7">which is the product of your prior, 0.01, times the conditional, 0.9.</text><text start="12" dur="9">Over here we get 0.001, the probability of our prior cancer times 0.1.</text><text start="21" dur="5">Over here we get 0.198,</text><text start="26" dur="3">the probability of not having cancer is 0.99</text><text start="29" dur="3">times still getting a positive reading, which is 0.2.</text><text start="32" dur="5">And finally, we get 0.792, </text><text start="37" dur="3">which is the probability of this guy over here, and this guy over here.</text></transcript></video><video title="6d Question" id="BX_uy8rCS5k" length="7"><transcript><text start="0" dur="4">Now, our next quiz, I want you to fill in the probability of </text><text start="4" dur="3">the cancer given that we just received a positive test.</text></transcript></video><video title="6e Answer" id="JgYH7UEcA6c" length="112"><transcript><text start="0" dur="6">And the correct answer is 0.043.</text><text start="6" dur="3">So, even though I received a positive test,</text><text start="9" dur="5">my probability of having cancer is just 4.3%,</text><text start="14" dur="4">which is not very much given that the test itself is quite sensitive.</text><text start="18" dur="8">It really gives me a 0.8 chance of getting a negative result if I don&amp;#39;t have cancer.</text><text start="26" dur="6">It gives me a 0.9 chance of detecting cancer given that I have cancer. </text><text start="32" dur="3">Now, what comes (inaudible) small?</text><text start="35" dur="3">Well, let&amp;#39;s just put all the cases together.</text><text start="38" dur="3">You already know that we received a positive test. </text><text start="41" dur="6">Therefore, this entry over here, and this entry over here are relevant.</text><text start="47" dur="9">Now, the chance of having a positive test and having cancer is 0.009.</text><text start="56" dur="5">Well, I might--when I receive a positive test--have cancer or not cancer,</text><text start="61" dur="5">so we will just normalize by these 2 possible causes for the positive test,</text><text start="66" dur="5">which is 0.009 + 0.198.</text><text start="71" dur="9">We know both these 2 things together gets 0.009 over 0.207, </text><text start="80" dur="3">which is approximately 0.043.</text><text start="83" dur="5">Now, the interesting thing in this equation is that the chances </text><text start="88" dur="4">of having seen a positive test result in the absence of cancers</text><text start="92" dur="3">are still much, much higher than the chance of seeing a positive result </text><text start="95" dur="4">in the presence of cancer, and that&amp;#39;s because our prior for cancer</text><text start="99" dur="5">is so small in the population that it&amp;#39;s just very unlikely to have cancer.</text><text start="104" dur="3">So, the additional information of a positive test </text><text start="107" dur="5">only erased my posterior probability to 0.043.</text></transcript></video><video title="7 Bayes Rule" id="OWCRop639TA" length="214"><transcript><text start="0" dur="3">So, we&amp;#39;ve just learned about what&amp;#39;s probably the most important</text><text start="3" dur="6">piece of math for this class in statistics called Bayes Rule.</text><text start="9" dur="6">It was invented by Reverend Thomas Bayes, who was a British mathematician</text><text start="15" dur="3">and a Presbyterian minister in the 18th century.</text><text start="18" dur="9">Bayes Rule is usually stated as follows: P of A given B where B is the evidence</text><text start="27" dur="9">and A is the variable we care about is P of B given A times P of A over P of B.</text><text start="36" dur="4">This expression is called the likelihood.</text><text start="40" dur="6">This is called the prior, and this is called marginal likelihood.</text><text start="46" dur="4">The expression over here is called the posterior.</text><text start="50" dur="5">The interesting thing here is the way the probabilities are reworded.</text><text start="55" dur="2">Say we have evidence B.</text><text start="57" dur="4">We know about B, but we really care about the variable A.</text><text start="61" dur="2">So, for example, B is a test result.</text><text start="63" dur="3">We don&amp;#39;t care about the test result as much as we care about the fact</text><text start="66" dur="2">whether we have cancer or not.</text><text start="68" dur="8">This diagnostic reasoning--which is from evidence to its causes--</text><text start="76" dur="6">is turned upside down by Bayes Rule into a causal reasoning, </text><text start="82" dur="5">which is given--hypothetically, if we knew the cause,</text><text start="87" dur="4">what would be the probability of the evidence we just observed.</text><text start="91" dur="5">But to correct for this inversion, we have to multiply</text><text start="96" dur="4">by the prior of the cause to be the case in the first place,</text><text start="100" dur="2">in this case, having cancer or not,</text><text start="102" dur="5">and divide it by the probability of the evidence, P(B),</text><text start="107" dur="5">which often is expanded using the theorem of total probability as follows.</text><text start="112" dur="6">The probability of B is a sum over all probabilities of B </text><text start="118" dur="6">conditional on A, lower caps a, times the probability of A equals lower caps a.</text><text start="124" dur="4">This is total probability as we already encountered it.</text><text start="128" dur="2">So, let&amp;#39;s apply this to the cancer case </text><text start="130" dur="3">and say we really care about whether you have cancer,</text><text start="133" dur="4">which is our cause, conditioned on the evidence </text><text start="137" dur="6">that is the result of this hidden cause, in this case, a positive test result.</text><text start="143" dur="2">Let&amp;#39;s just plug in the numbers.</text><text start="145" dur="5">Our likelihood is the probability of seeing a positive test result</text><text start="150" dur="3">given that you have cancer multiplied by the prior probability</text><text start="153" dur="5">of having cancer over the probability of the positive test result,</text><text start="158" dur="5">and that is--according to the tables we looked at before--</text><text start="163" dur="7">0.9 times a prior of 0.01 over--</text><text start="170" dur="5">now we&amp;#39;re going to expand this right over here according to total probability </text><text start="175" dur="6">which gives us 0.9 times 0.01.</text><text start="181" dur="5">That&amp;#39;s the probability of + given that we do have cancer.</text><text start="186" dur="5">So, the probability of + given that we don&amp;#39;t have cancer is 0.2,</text><text start="191" dur="4">but the prior here is 0.99.</text><text start="195" dur="5">So, if we plug in the numbers we know about, we get 0.009 </text><text start="200" dur="7">over 0.009 + 0.198.</text><text start="207" dur="7">That is approximately 0.0434, which is the number we saw before.</text></transcript></video><video title="7a Bayes Rule Graphically" id="1DhY4Cs_qEs" length="112"><transcript><text start="0" dur="3">So, if you want to draw Bayes rule graphically, </text><text start="3" dur="5">we have a situation where we have an internal variable A, </text><text start="8" dur="5">like whether I&amp;#39;m going to die of cancer, but we can&amp;#39;t sense A.</text><text start="13" dur="3">Instead, we have a second variable, called B, </text><text start="16" dur="5">which is our test, and B is observable, but A isn&amp;#39;t.</text><text start="21" dur="5">This is a classical example of a Bayes network. </text><text start="26" dur="4">The Bayes network is composed of 2 variables, A and B.</text><text start="30" dur="3">We know the prior probability for A, </text><text start="33" dur="2">and we know the conditional. </text><text start="35" dur="3">A causes B--whether or not we have cancer, </text><text start="38" dur="3">causes the test result to be positive or not, </text><text start="41" dur="3">although there was some randomness involved.</text><text start="44" dur="5">So, we know what the probability of B given the different values for A,</text><text start="49" dur="5">and what we care about in this specific instance is called diagnostic reasoning, </text><text start="54" dur="4">which is the inverse of the causal reasoning,</text><text start="58" dur="8">the probability of A given B or similarly, probability of A given not B.</text><text start="66" dur="5">This is our very first Bayes network, and the graphical representation</text><text start="71" dur="4">of drawing 2 variables, A and B, connected with an arc</text><text start="75" dur="7">that goes from A to B is the graphical representation of a distribution</text><text start="82" dur="4">of 2 variables that are specified in the structure over here,</text><text start="86" dur="5">which has a prior probability and has a conditional probability as shown over here.</text><text start="91" dur="3">Now, I do have a quick quiz for you.</text><text start="94" dur="3">How many parameters does it take to specify</text><text start="97" dur="6">the entire joint probability within A and B, or differently, the entire Bayes network?</text><text start="103" dur="5">I&amp;#39;m not looking for structural parameters that relate to the graph over here.</text><text start="108" dur="4">I&amp;#39;m just looking for the numerical parameters of the underlying probabilities.</text></transcript></video><video title="7b Answer" id="Q5luTxpgFaU" length="24"><transcript><text start="0" dur="2">And the answer is 3.</text><text start="2" dur="7">It takes 1 parameter to specify P of A from which we can derive P of not A.</text><text start="9" dur="6">It takes 2 parameters to specify P of B given A and P given not A, </text><text start="15" dur="6">from which we can derive P not B given A and P of not B given not A.</text><text start="21" dur="3">So, it&amp;#39;s a total of 3 parameters for this Bayes network.</text></transcript></video><video title="8 More Complex Bayes Networks" id="h59XtnoILcQ" length="152"><transcript><text start="0" dur="3">So, we just encountered our very first Bayes network</text><text start="3" dur="3">and did a number of interesting calculations.</text><text start="6" dur="4">Let&amp;#39;s now talk about Bayes Rule and look into more complex Bayes networks.</text><text start="10" dur="3">I will look at Bayes Rule again and make an observation</text><text start="13" dur="2">that is really non-trivial.</text><text start="15" dur="5">Here is Bayes Rule, and in practice, what we find is</text><text start="20" dur="3"> this term here is relatively easy to compute.</text><text start="23" dur="5">It&amp;#39;s just a product, whereas this term is really hard to compute.</text><text start="28" dur="5">However, this term over here does not depend on what we assume for variable A.</text><text start="33" dur="2">It&amp;#39;s just the function of B.</text><text start="35" dur="5">So, suppose for a moment we also care about the complementary event of not A</text><text start="40" dur="3">given B, for which Bayes Rule unfolds as follows.</text><text start="43" dur="4">Then we find that the normalizer, P(B), is identical, </text><text start="47" dur="4">whether we assume A on the left side or not A on the left side.</text><text start="51" dur="6">We also know from prior work that P(A) given B plus</text><text start="57" dur="6">P of not A given B must be one because these are 2 complementary events.</text><text start="63" dur="3">That allows us to compute Bayes Rule very differently</text><text start="66" dur="5">by basically ignoring the normalizer, so here&amp;#39;s how it goes.</text><text start="71" dur="5">We compute P(A) given B--and I want to call this prime, </text><text start="76" dur="7">because it&amp;#39;s not a real probability--to be just P(B) given A times P(A),</text><text start="83" dur="5">which is the normalizer, so the denominator of the expression over here.</text><text start="88" dur="3">We do the same thing with not A.</text><text start="91" dur="5">So, in both cases, we compute the posterior probability non-normalized</text><text start="96" dur="2">by omitting the normalizer B.</text><text start="98" dur="5">And then we can recover the original probabilities by normalizing</text><text start="103" dur="5">based on those values over here, so the probability of A given B, </text><text start="108" dur="4">the actual probability, is a normalizer, eta, </text><text start="112" dur="3">times this non-normalized form over here.</text><text start="115" dur="4">The same is true for the negation of A over here.</text><text start="119" dur="7">And eta is just the normalizer that results by adding these 2 values over here together</text><text start="126" dur="4">as shown over here, and dividing them for one.</text><text start="130" dur="3">So, take a look at this for a moment.</text><text start="133" dur="5">What we&amp;#39;ve done is we deferred the calculation of the normalizer over here</text><text start="138" dur="4">by computing pseudo probabilities that are non-normalized.</text><text start="142" dur="4">This made the calculation much easier, and when we were done with everything,</text><text start="146" dur="3">we just folded it back into the normalizer based on the resulting </text><text start="149" dur="3">pseudo probabilities and got the correct answer.</text></transcript></video><video title="8a Two Test Cancer Example" id="_AJQSBYRAR4" length="68"><transcript><text start="0" dur="3">The reason why I gave you all this is because I want you to apply it now</text><text start="3" dur="5">to a slightly more complicated problem, which is the 2-test cancer example.</text><text start="8" dur="6">In this example, we again might have our unobservable cancer C, </text><text start="14" dur="4">but now we&amp;#39;re running 2 tests, test 1 and test 2.</text><text start="18" dur="6">As before, the prior probability of cancer is 0.01.</text><text start="24" dur="6">The probability of receiving a positive test result for either test is 0.9.</text><text start="30" dur="6">The probability of getting a negative result given they&amp;#39;re cancer free is 0.8.</text><text start="36" dur="4">And from those, we were able to compute all the other probabilities,</text><text start="40" dur="3">and we&amp;#39;re just going to write them down over here.</text><text start="43" dur="3">So, take a moment to just verify those.</text><text start="46" dur="4">Now, let&amp;#39;s assume both of my tests come back positive,</text><text start="50" dur="6">so T1 = + and T2 = +.</text><text start="56" dur="4">What&amp;#39;s the probability of cancer now written in short form probability of </text><text start="60" dur="3">C given ++?</text><text start="63" dur="5">I want you to tell me what that is, and this is a non-trivial question.</text></transcript></video><video title="8b Answer" id="sjdPqdZQQCI" length="120"><transcript><text start="0" dur="10">So, the correct answer is 0.1698 approximately,</text><text start="10" dur="5">and to compute this, I used the trick I&amp;#39;ve shown you before.</text><text start="15" dur="9">Let me write down the running count for cancer and for not cancer</text><text start="24" dur="4">as I integrate the various multiplications in Bayes Rule.</text><text start="28" dur="9">My prior for cancer was 0.01 and for non-cancer was 0.99.</text><text start="37" dur="6">Then I get my first +, and the probability of a + given they have cancer is 0.9,</text><text start="43" dur="5">and the same for non-cancer is 0.2.</text><text start="48" dur="4">So, according to the non-normalized Bayes Rule, </text><text start="52" dur="6">I now multiply these 2 things together to get my non-normalized probability</text><text start="58" dur="2">of having cancer given the plus.</text><text start="60" dur="3">Since multiplication is commutative, </text><text start="63" dur="6">I can do the same thing again with my 2nd test result, 0.9 and 0.2, </text><text start="69" dur="5">and I multiply all of these 3 things together to get my non-normalized probability </text><text start="74" dur="7">P prime to be the following: 0.0081, if you multiply those things together,</text><text start="81" dur="7">and 0.0396 if you multiply these facts together.</text><text start="88" dur="2">And these are not a probability.</text><text start="90" dur="4">If we add those for the 2 complementary of cancer/non-cancer,</text><text start="94" dur="4">I get 0.0477.</text><text start="98" dur="4">However, if I now divide, that is, I normalize</text><text start="102" dur="5">those non-normalized probabilities over here by this factor over here,</text><text start="107" dur="5">I actually get the correct posterior probability P of cancer given ++.</text><text start="112" dur="2">And they look as follows:</text><text start="114" dur="6">approximately 0.1698 and approximately 0.8301.</text></transcript></video><video title="8c Question" id="Ah8mhlLsimM" length="10"><transcript><text start="0" dur="3">Calculate for me the probability of cancer </text><text start="3" dur="5">given that I received one positive and one negative test result.</text><text start="8" dur="2">Please write your number into this box.</text></transcript></video><video title="8d Answer" id="gM1DfM6CGqw" length="63"><transcript><text start="0" dur="3">We apply the same trick as before</text><text start="3" dur="4">where we use the exact same prior of 0.01.</text><text start="7" dur="6">Our first + gives us the following factors: 0.9 and 0.2.</text><text start="13" dur="7">And our minus gives us the probability 0.1 for a negative first test result given that we have cancer,</text><text start="20" dur="6">and a 0.8 for the inverse of a negative result of not having cancer. </text><text start="26" dur="2">We multiply those together.</text><text start="28" dur="2">We get our non-normalized probability.</text><text start="30" dur="5">And if we now normalize by the sum of those two things</text><text start="35" dur="6">to turn this back into a probability, we get 0.009</text><text start="41" dur="9">over the sum of those two things over here, and this is 0.0056</text><text start="50" dur="9">for the chance of having cancer and 0.9943 for the chance of being cancer free.</text><text start="59" dur="4">And this adds up approximately to 1, and therefore, is a probability distribution.</text></transcript></video><video title="9 Conditional Independence" id="KY3ecsJDnO4" length="165"><transcript><text start="0" dur="3">I want to use a few words of terminology.</text><text start="3" dur="5">This, again, is a Bayes network, of which the hidden variable C</text><text start="8" dur="8">causes the still stochastic test outcomes T1 and T2.</text><text start="16" dur="3">And what is really important is that we assume not just </text><text start="19" dur="3">that T1 and T2 are identically distributed.</text><text start="22" dur="5">We use the same 0.9 for test 1 as we use for test 2,</text><text start="27" dur="4">but we also assume that they are conditionally independent.</text><text start="31" dur="6">We assumed that if God told us whether we actually had cancer or not,</text><text start="37" dur="4">if we knew with absolute certainty the value of the variable C,</text><text start="41" dur="7">that knowing anything about T1 would not help us make a statement about T2.</text><text start="48" dur="7">Put differently, we assumed that the probability of T2 given C and T1</text><text start="55" dur="5">is the same as the probability of T2 given C.</text><text start="60" dur="8">This is called conditional independence, which is given the value of the cancer variable C.</text><text start="68" dur="9">If you knew this for a fact, then T2 would be independent of T1.</text><text start="77" dur="4">It&amp;#39;s conditionally independent because the independence only holds true</text><text start="81" dur="5">if we actually know C, and it comes out of this diagram over here.</text><text start="86" dur="6">If we look at this diagram, if you knew the variable C over here,</text><text start="92" dur="7">then C separately causes T1 and T2.</text><text start="99" dur="4">So, as a result, if you know C, whatever counted over here </text><text start="106" dur="2">is kind of cut off causally from what happens over here.</text><text start="108" dur="4">That causes these 2 variables to be conditionally independent.</text><text start="112" dur="6">So, conditional independence is a really big thing in Bayes networks.</text><text start="118" dur="4">Here&amp;#39;s a Bayes network where A causes B and C,</text><text start="122" dur="6">and for a Bayes network of this structure, we know that given A, </text><text start="128" dur="3">B and C are independent.</text><text start="131" dur="5">It&amp;#39;s written as B conditionally independent of C given A.</text><text start="136" dur="2">So, here&amp;#39;s a question.</text><text start="138" dur="3">Suppose we have conditional independence between B and C given A.</text><text start="141" dur="7">Would that imply--and there&amp;#39;s my question--that B and C are independent?</text><text start="148" dur="2">So, suppose we don&amp;#39;t know A.</text><text start="150" dur="3">We don&amp;#39;t know whether we have cancer, for example.</text><text start="153" dur="5">What that means is that the test results individually are still independent of each other</text><text start="158" dur="4">even if we don&amp;#39;t know about the cancer situation.</text><text start="162" dur="3">Please answer yes or no.</text></transcript></video><video title="9a Answer" id="lb6A1Ov-mlQ" length="41"><transcript><text start="0" dur="3">And the correct answer is No</text><text start="3" dur="5">Intuitively, getting a positive test result about cancer</text><text start="8" dur="5">gives us information about whether you have cancer or not.</text><text start="13" dur="2">So if you get a positive test result</text><text start="15" dur="3">you&amp;#39;re going to raise the probability of having cancer</text><text start="18" dur="2">relative to the prior probability.</text><text start="20" dur="4">With that increased probability we will predict</text><text start="24" dur="3">that another test will with a higher likelihood</text><text start="27" dur="6">give us a positive response than if we hadn&amp;#39;t taken the previous test.</text><text start="33" dur="3">That&amp;#39;s really important to understand</text><text start="36" dur="5">So that we understand it let me make you calculate those probabilities</text></transcript></video><video title="9b Question" id="EmLvORqH-Dg" length="35"><transcript><text start="0" dur="5">Let me draw the cancer example again with two tests.</text><text start="5" dur="2">Here&amp;#39;s my cancer variable</text><text start="7" dur="6">and then there&amp;#39;s two conditionally independent tests T1 and T2.</text><text start="13" dur="6">And as before let me assume that the prior probability of cancer is 0.01</text><text start="19" dur="7">What I want you to compute for me is the probability of the second test</text><text start="26" dur="7">to be positive if we know that the first test was positive.</text><text start="33" dur="2">So write this into the following box.</text></transcript></video><video title="9c Answer" id="6d2lH9JP6kw" length="172"><transcript><text start="0" dur="4">So, for this one, we want to apply total probability.</text><text start="4" dur="6">This thing over here is the same as probability of test 2 to be positive, </text><text start="10" dur="4">which I&amp;#39;m going to abbreviate with a +2 over here, </text><text start="14" dur="5">conditioned on test 1 being positive and me having cancer</text><text start="19" dur="6">times the probability of me having cancer given test 1 was positive plus</text><text start="25" dur="6">the probability of test 2 being positive conditioned on test 1 being positive</text><text start="31" dur="5">and me not having cancer times the probability of me not having cancer</text><text start="36" dur="2">given that test 1 is positive.</text><text start="38" dur="4">That&amp;#39;s the same as the theorem of total probability,</text><text start="42" dur="4">but now everything is conditioned on +1. </text><text start="46" dur="2">Take a moment to verify this.</text><text start="48" dur="2">Now, here I can plug in the numbers.</text><text start="50" dur="7">You already calculated this one before, which is approximately 0.043,</text><text start="57" dur="8">and this one over here is 1 minus that, which is 0.957 approximately.</text><text start="65" dur="4">And this term over here now exploits conditional independence, </text><text start="69" dur="5">which is given that I know C, knowledge of the first test </text><text start="74" dur="3">gives me no more information about the second test.</text><text start="77" dur="4">It only gives me information if C was unknown, as was the case over here.</text><text start="81" dur="3">So, I can rewrite this thing over here as follows:</text><text start="84" dur="3">P of +2 given that I have cancer.</text><text start="87" dur="4">I can drop the +1, and the same is true over here.</text><text start="91" dur="3">This is exploiting my conditional independence. </text><text start="94" dur="7">I knew that P of +1 or +2 conditioned on C</text><text start="101" dur="6">is the same as P of +2 conditioned on C and +1.</text><text start="107" dur="3">I can now read those off my table over here, </text><text start="110" dur="8">which is 0.9 times 0.043 plus 0.2,</text><text start="118" dur="5">which is 1 minus 0.8 over here times 0.957, </text><text start="123" dur="6">which gives me approximately 0.2301.</text><text start="129" dur="5">So, that says if my first test comes in positive,</text><text start="134" dur="7">I expect my second test to be positive with probably 0.2301.</text><text start="141" dur="3">That&amp;#39;s an increased probability to the default probability, </text><text start="144" dur="5">which we calculated before, which is the probability of any test, </text><text start="149" dur="9">test 2 come in as positive before was the normalizer of Bayes rule which was 0.207.</text><text start="158" dur="5">So, my first test has a 20% chance of coming in positive.</text><text start="163" dur="4">My second test, after seeing a positive test,</text><text start="167" dur="5">has now an increased probability of about 23% of coming in positive.</text></transcript></video><video title="9d Absolute vs Conditional Independence" id="fYp0lf1P09k" length="27"><transcript><text start="0" dur="2">So, now we&amp;#39;ve learned about independence,</text><text start="2" dur="2">and the corresponding Bayes network has 2 nodes. </text><text start="4" dur="3">They&amp;#39;re just not connected at all.</text><text start="7" dur="2">And we learned about conditional independence, </text><text start="9" dur="3">in which case we have a Bayes network that looks like this.</text><text start="12" dur="4">Now I would like to know whether absolute independence</text><text start="16" dur="2"> implies conditional independence.</text><text start="18" dur="2">True or false?</text><text start="20" dur="5">And I&amp;#39;d also like to know whether conditional independence implies absolute independence.</text><text start="25" dur="2">Again, true or false?</text></transcript></video><video title="9e Answer" id="Em-ahIrk550" length="45"><transcript><text start="0" dur="3">And the answer is both of them are false.</text><text start="3" dur="4">We already saw that conditional independence, as shown over here,</text><text start="7" dur="2">doesn&amp;#39;t give us absolute independence.</text><text start="9" dur="4">So, for example, this is test #1 and test #2.</text><text start="13" dur="2">You might or might not have cancer. </text><text start="15" dur="3">Our first test gives us information about whether you have cancer or not.</text><text start="18" dur="3">As a result, we&amp;#39;ve changed our prior probability </text><text start="21" dur="3">for the second test to come in positive.</text><text start="24" dur="6">That means that conditional independence does not imply absolute independence,</text><text start="30" dur="2">which means this assumption here falls, </text><text start="32" dur="5">and it also turns out that if you have absolute independence, </text><text start="37" dur="6">things might not be conditionally independent for reasons that I can&amp;#39;t quite explain so far,</text><text start="43" dur="2">but that we will learn about next.</text></transcript></video><video title="10 Different Type of Bayes Network" id="MaAInzCTS1E" length="119"><transcript><text start="0" dur="4">[Thrun] For my next example, I will study a different type of a Bayes network.</text><text start="4" dur="4">Before, we&amp;#39;ve seen networks of the following type,</text><text start="8" dur="5">where a single hidden cause caused 2 different measurements.</text><text start="13" dur="4">I now want to study a network that looks just like the opposite.</text><text start="17" dur="3">We have 2 independent hidden causes,</text><text start="20" dur="6">but they get confounded within a single observational variable.</text><text start="26" dur="3">I would like to use the example of happiness.</text><text start="29" dur="4">Suppose I can be happy or unhappy.</text><text start="33" dur="8">What makes me happy is when the weather is sunny or if I get a raise in my job,</text><text start="41" dur="2">which means I make more money.</text><text start="43" dur="4">So let&amp;#39;s call this sunny, let&amp;#39;s call this a raise, and call this happiness.</text><text start="47" dur="6">Perhaps the probability of it being sunny is 0.7,</text><text start="53" dur="5">probability of a raise is 0.01.</text><text start="58" dur="7">And I will tell you that the probability of being happy is governed as follows.</text><text start="65" dur="4">The probability of being happy given that both of these things occur--</text><text start="69" dur="4">I got a raise and it is sunny--is 1.</text><text start="73" dur="7">The probability of being happy given that it is not sunny and I still got a raise is 0.9.</text><text start="80" dur="7">The probability of being happy given that it&amp;#39;s sunny but I didn&amp;#39;t get a raise is 0.7.</text><text start="87" dur="8">And the probability of being happy given that it is neither sunny nor did I get a raise is 0.1.</text><text start="95" dur="4">This is a perfectly fine specification of a probability distribution</text><text start="99" dur="7">where 2 causes affect the variable down here, the happiness.</text><text start="106" dur="4">So I&amp;#39;d like you to calculate for me the following questions.</text><text start="110" dur="7">Probability of a raise given that it is sunny, according to this model.</text><text start="117" dur="2">Please enter your answer over here.</text></transcript></video><video title="10a Answer" id="VsesDjAIMmU" length="55"><transcript><text start="0" dur="3">[Thrun] The answer is surprisingly simple.</text><text start="3" dur="2">It is 0.01.</text><text start="5" dur="3">How do I know this so fast?</text><text start="8" dur="4">Well, if you look at this Bayes network,</text><text start="12" dur="9">both the sunniness and the question whether I got a raise impact my happiness.</text><text start="21" dur="3">But since I don&amp;#39;t know anything about the happiness,</text><text start="24" dur="8">there is no way that just the weather might implicate or impact whether I get a raise or not.</text><text start="32" dur="7">In fact, it might be independently sunny, and I might independently get a raise at work.</text><text start="39" dur="7">There is no mechanism of which these 2 things would co-occur.</text><text start="46" dur="3">Therefore, the probability of a raise given that it&amp;#39;s sunny </text><text start="49" dur="6">is just the same as the probability of a raise given any weather, which is 0.01.</text></transcript></video><video title="11 Explaining Away" id="pyxyYWNo8Qw" length="111"><transcript><text start="0" dur="7">[Thrun] Let me talk about a really interesting special instance of Bayes net reasoning</text><text start="7" dur="3">which is called explaining away.</text><text start="10" dur="4">And I&amp;#39;ll first give you the intuitive answer,</text><text start="14" dur="5">then I&amp;#39;ll wish you to compute probabilities for me that manifest the explain away effect</text><text start="19" dur="3">in a Bayes network of this type.</text><text start="22" dur="5">Explaining away means that if we know that we are happy,</text><text start="27" dur="7">then sunny weather can explain away the cause of happiness.</text><text start="34" dur="7">If I then also know that it&amp;#39;s sunny, it becomes less likely that I received a raise.</text><text start="41" dur="2">Let me put this differently.</text><text start="43" dur="2">Suppose I&amp;#39;m a happy guy on a specific day</text><text start="45" dur="4">and my wife asks me, &amp;quot;Sebastian, why are you so happy?&amp;quot;</text><text start="49" dur="3">&amp;quot;Is it sunny, or did you get a raise?&amp;quot;</text><text start="52" dur="3">If she then looks outside and sees it is sunny,</text><text start="55" dur="2">then she might explain to herself,</text><text start="57" dur="3">&amp;quot;Well, Sebastian is happy because it is sunny.&amp;quot;</text><text start="60" dur="5">&amp;quot;That makes it effectively less likely that he got a raise</text><text start="65" dur="5">&amp;quot;because I could already explain his happiness by it being sunny.&amp;quot;</text><text start="70" dur="3">If she looks outside and it is rainy,</text><text start="73" dur="3">that makes it more likely I got a raise,</text><text start="76" dur="4">because the weather can&amp;#39;t really explain my happiness.</text><text start="80" dur="7">In other words, if we see a certain effect that could be caused by multiple causes,</text><text start="87" dur="6">seeing one of those causes can explain away any other potential cause</text><text start="93" dur="3">of this effect over here.</text><text start="96" dur="7">So let me put this in numbers and ask you the challenging question of</text><text start="103" dur="8">what&amp;#39;s the probability of a raise given that I&amp;#39;m happy and it&amp;#39;s sunny?</text></transcript></video><video title="11a Answer" id="EZpzEZPy0Wk" length="92"><transcript><text start="0" dur="7">[Thrun] The answer is approximately 0.0142,</text><text start="7" dur="4">and it is an exercise in expanding this term using Bayes&amp;#39; rule,</text><text start="11" dur="5">using total probability, which I&amp;#39;ll just do for you.</text><text start="16" dur="8">Using Bayes&amp;#39; rule, you can transform this into P of H given R comma S </text><text start="24" dur="10">times P of R given S over P of H given S.</text><text start="34" dur="3">We observe the conditional independence of R and S</text><text start="37" dur="3">to simplify this to just P of R,</text><text start="40" dur="6">and the denominator is expanded by folding in R and not R,</text><text start="46" dur="3">P of H given R comma S</text><text start="49" dur="5">times P of R plus P of H given not R and S</text><text start="54" dur="4">times P of not R, which is total probability.</text><text start="58" dur="3">We can now read off the numbers from the tables over here,</text><text start="61" dur="9">which gives us 1 times 0.01 divided by this expression</text><text start="70" dur="7">that is the same as the expression over here, so 0.01 plus this thing over here,</text><text start="77" dur="6">which you can find over here to be 0.7, times this guy over here, </text><text start="83" dur="4">which is 1 minus the value over here, 0.99,</text><text start="87" dur="5">which gives us approximately 0.0142.</text></transcript></video><video title="11b Question" id="1shSAdfZiJw" length="31"><transcript><text start="0" dur="4">[Thrun] Now, to understand the explain away effect,</text><text start="4" dur="7">you have to compare this to the probability of a raise given that we&amp;#39;re just happy</text><text start="11" dur="3">and we don&amp;#39;t know anything about the weather.</text><text start="14" dur="2">So let&amp;#39;s do that exercise next.</text><text start="16" dur="8">So my next quiz is, what&amp;#39;s the probability of a raise given that all I know is that I&amp;#39;m happy</text><text start="24" dur="2">and I don&amp;#39;t know about the weather?</text><text start="26" dur="5">This happens to be once again a pretty complicated question, so take your time.</text></transcript></video><video title="11c Answer" id="YE-2ycPWWpQ" length="173"><transcript><text start="0" dur="2">[Thrun] So this is a difficult question.</text><text start="2" dur="10">Let me compute an auxiliary variable, which is P of happiness.</text><text start="12" dur="7">That one is expanded by looking at the different conditions that can make us happy.</text><text start="19" dur="5">P of happiness given S and R </text><text start="24" dur="5">times P of S and R, which is of course the product of those 2</text><text start="29" dur="2">because they are independent,</text><text start="31" dur="8">plus P of happiness given not S R, probability of not as R</text><text start="39" dur="4">plus P of H given S and not R </text><text start="43" dur="5">times the probability of P of S and not R plus the last case,</text><text start="48" dur="4">P of H given not S and not R.</text><text start="52" dur="4">So this just looks at the happiness under all 4 combinations of the variables</text><text start="56" dur="2">that can lead to happiness.</text><text start="58" dur="2">And you can plug those straight in.</text><text start="60" dur="5">This one over here is 1, and this one over here is the product of S and R, </text><text start="65" dur="5">which is 0.7 times 0.01.</text><text start="70" dur="4">And as you plug all of those in, </text><text start="74" dur="7">you get as a result 0.5245.</text><text start="81" dur="3">That&amp;#39;s P of H.</text><text start="84" dur="4">Just take some time and do the math by going through these different cases</text><text start="88" dur="4">using total probability, and you get this result.</text><text start="92" dur="6">Armed with this number, the rest now becomes easy,</text><text start="98" dur="5">which is we can use Bayes&amp;#39; rule to turn this around.</text><text start="103" dur="6">P of H given R times P of R over P of H.</text><text start="109" dur="5">P of R we know from over here, the probability of a raise is 0.01.</text><text start="114" dur="3">So the only thing we need to compute now is P of H given R.</text><text start="117" dur="2">And again, we apply total probability.</text><text start="119" dur="3">Let me just do this over here.</text><text start="122" dur="7">We can factor P of H given R as P of H given R and S, sunny,</text><text start="129" dur="5">times probability of sunny plus P of H given R and not sunny</text><text start="134" dur="2">times the probability of not sunny.</text><text start="136" dur="5">And if you plug in the numbers with this, you get 1 times 0.7</text><text start="141" dur="4">plus 0.9 times 0.3.</text><text start="145" dur="5">That happens to be 0.97.</text><text start="150" dur="3">So if we now plug this all back into this equation over here,</text><text start="153" dur="12">we get 0.97 times 0.01 over 0.5245.</text><text start="165" dur="8">This gives us approximately as the correct answer 0.0185.</text></transcript></video><video title="11d Question" id="klqEUPy8jZU" length="102"><transcript><text start="0" dur="4">[Thrun] And if you got this right, I will be deeply impressed</text><text start="4" dur="3">about the fact you got this right.</text><text start="7" dur="6">But the interesting thing now to observe is if we happen to know it&amp;#39;s sunny</text><text start="13" dur="8">and I&amp;#39;m happy, then the probability of a raise is 14%, 0.014.</text><text start="21" dur="5">If I don&amp;#39;t know about the weather and I&amp;#39;m happy,</text><text start="26" dur="4">then the probability of a raise goes up to about 18.5%.</text><text start="30" dur="2">Why is that? </text><text start="32" dur="3">Well, it&amp;#39;s the explaining away effect.</text><text start="35" dur="5">My happiness is well explained by the fact that it&amp;#39;s sunny.</text><text start="40" dur="3">So if someone observes me to be happy and asks the question,</text><text start="43" dur="3">&amp;quot;Is this because Sebastian got a raise at work?&amp;quot;</text><text start="46" dur="7">well, if you know it&amp;#39;s sunny and this is a fairly good explanation for me being happy,</text><text start="53" dur="2">you don&amp;#39;t have to assume I got a raise.</text><text start="55" dur="6">If you don&amp;#39;t know about the weather, then obviously the chances are higher</text><text start="61" dur="2">that the raise caused my happiness,</text><text start="63" dur="7">and therefore this number goes up from 0.014 to 0.018.</text><text start="70" dur="4">Let me ask you one final question in this next quiz,</text><text start="74" dur="9">which is the probability of the raise given that I look happy and it&amp;#39;s not sunny.</text><text start="83" dur="4">This is the most extreme case for making a raise likely</text><text start="87" dur="6">because I am a happy guy, and it&amp;#39;s definitely not caused by the weather.</text><text start="93" dur="4">So it could be just random, or it could be caused by the raise.</text><text start="97" dur="5">So please calculate this number for me and enter it into this box.</text></transcript></video><video title="11e Answer" id="4YzL05_see8" length="78"><transcript><text start="0" dur="4">[Thrun] Well, the answer follows the exact same scheme as before,</text><text start="4" dur="4">with S being replaced by not S.</text><text start="8" dur="3">So this should be an easier question for you to answer.</text><text start="11" dur="9">P of R given H and not S can be inverted by Bayes&amp;#39; rule to be as follows.</text><text start="20" dur="4">Once we apply Bayes&amp;#39; rule, as indicated over here where we swapped H to the left side</text><text start="24" dur="5">and R to the right side, you can observe that this value over here</text><text start="29" dur="3">can be readily found in the table.</text><text start="32" dur="3">It&amp;#39;s actually the 0.9 over there.</text><text start="35" dur="6">This value over here, the raise is independent of the weather</text><text start="41" dur="4">by virtue of our Bayes network, so it&amp;#39;s just 0.01.</text><text start="45" dur="7">And as before, we apply total probability to the expression over here,</text><text start="52" dur="6">and we obtain off this quotient over here that these 2 expressions are the same.</text><text start="58" dur="5">P of H given not S, not R is the value over here,</text><text start="63" dur="5">and the 0.99 is the complement of probability of R taken from over here,</text><text start="68" dur="8">and that ends up to be 0.0833.</text><text start="76" dur="2">This would have been the right answer.</text></transcript></video><video title="11f Conclusion" id="8SY5T6TFg6c" length="193"><transcript><text start="0" dur="4">[Thrun] It&amp;#39;s really interesting to compare this to the situation over here.</text><text start="4" dur="4">In both cases I&amp;#39;m happy, as shown over here,</text><text start="8" dur="7">and I ask the same question, which is whether I got a raise at work, as R over here.</text><text start="15" dur="6">But in one case I observe that the weather is sunny; in the other one it isn&amp;#39;t.</text><text start="21" dur="4">And look what it does to my probability of having received a raise.</text><text start="25" dur="5">The sunniness perfectly well explains my happiness,</text><text start="30" dur="11">and my probability of having received a raise ends up to be a mere 1.4%, or 0.014.</text><text start="41" dur="6">However, if my wife observes it to be non-sunny, then it is much more likely</text><text start="47" dur="4">that the cause of my happiness is related to a raise at work,</text><text start="51" dur="7">and now the probability is 8.3%, which is significantly higher than the 1.4% before.</text><text start="58" dur="6">This is a Bayes network of which S and R are independent</text><text start="64" dur="6">but H adds a dependence between S and R.</text><text start="70" dur="5">Let me talk about this in a little bit more detail on the next paper.</text><text start="76" dur="2">So here is our Bayes network again.</text><text start="78" dur="4">In our previous exercises, we computed for this network</text><text start="82" dur="7">that the probability of a raise of R given any of these variables shown here was as follows.</text><text start="89" dur="5">The really interesting thing is that in the absence of information about H,</text><text start="94" dur="3">which is the middle case over here,</text><text start="97" dur="4">the probability of R is unaffected by knowledge of S--</text><text start="101" dur="5">that is, R and S are independent.</text><text start="106" dur="3">This is the same as probability of R,</text><text start="109" dur="7">and R and S are independent.</text><text start="116" dur="6">However, if I know something about the variable H,</text><text start="122" dur="4">then S and R become dependent--</text><text start="126" dur="9">that is, knowing about my happiness over here renders S and R dependent.</text><text start="135" dur="8">This is not the same as probability of just R given H.</text><text start="143" dur="5">Obviously, it isn&amp;#39;t because if I now vary S from S to not S,</text><text start="148" dur="5">it affects my probability for the variable R.</text><text start="153" dur="3">That is a really unusual situation</text><text start="156" dur="4">where we have R and S are independent</text><text start="160" dur="10">but given the variable H, R and S are not independent anymore.</text><text start="170" dur="8">So knowledge of H makes 2 variables that previously were independent non-independent.</text><text start="178" dur="8">Offered differently, 2 variables that are independent may not be in certain cases</text><text start="186" dur="2">conditionally independent.</text><text start="188" dur="5">Independence does not imply conditional independence.</text></transcript></video><video title="12 General Bayes Networks" id="kmSMS3CBLd8" length="173"><transcript><text start="0" dur="5">[Thrun] So we&amp;#39;re now ready to define Bayes networks in a more general way.</text><text start="5" dur="5">Bayes networks define probability distributions over graphs or random variables.</text><text start="10" dur="4">Here is an example graph of 5 variables,</text><text start="14" dur="5">and this Bayes network defines the distribution over those 5 random variables.</text><text start="19" dur="5">Instead of enumerating all possibilities of combinations of these 5 random variables,</text><text start="24" dur="4">the Bayes network is defined by probability distributions </text><text start="28" dur="4">that are inherent to each individual node.</text><text start="32" dur="6">For node A and B, we just have a distribution P of A and P of B</text><text start="38" dur="4">because A and B have no incoming arcs.</text><text start="42" dur="5">C is a conditional distribution conditioned on A and B.</text><text start="47" dur="5">D and E are conditioned on C.</text><text start="52" dur="4">The joint probability represented by a Bayes network</text><text start="56" dur="4">is the product of various Bayes network probabilities</text><text start="60" dur="3">that are defined over individual nodes</text><text start="63" dur="5">where each node&amp;#39;s probability is only conditioned on the incoming arcs.</text><text start="68" dur="4">So A has no incoming arc; therefore, we just want it P of A.</text><text start="72" dur="6">C has 2 incoming arcs, so we define the probability of C conditioned on A and B.</text><text start="78" dur="4">And D and E have 1 incoming arc that&amp;#39;s shown over here.</text><text start="82" dur="5">The definition of this joint distribution by using the following factors</text><text start="87" dur="3">has one really big advantage.</text><text start="90" dur="10">Whereas the joint distribution over any 5 variables requires 2 to the 5 minus 1,</text><text start="100" dur="3">which is 31 probability values,</text><text start="103" dur="5">the Bayes network over here only requires 10 such values.</text><text start="108" dur="5">P of A is one value, for which we can derive P of not A.</text><text start="113" dur="2">Same for P of B.</text><text start="115" dur="7">P of C given A B is derived by a distribution over C</text><text start="122" dur="5">conditioned on any combination of A and B, of which there are 4 of A and B as binary.</text><text start="127" dur="8">P of D given C is 2 parameters for P of D given C and P of D given not C.</text><text start="135" dur="3">And the same is true for P of E given C.</text><text start="138" dur="3">So if you add those up, you get 10 parameters in total.</text><text start="141" dur="4">So the compactness of the Bayes network</text><text start="145" dur="6">leads to a representation that scales significantly better to large networks</text><text start="151" dur="5">than the common natorial approach which goes through all combinations of variable values.</text><text start="156" dur="3">That is a key advantage of Bayes networks, </text><text start="159" dur="4">and that is the reason why Bayes networks are being used so extensively</text><text start="163" dur="2">for all kinds of problems.</text><text start="165" dur="2">So here is a quiz.</text><text start="167" dur="4">How many probability values are required to specify this Bayes network?</text><text start="171" dur="2">Please put your answer in the following box.</text></transcript></video><video title="12a Answer" id="cvRNI5fULP8" length="19"><transcript><text start="0" dur="3">[Thrun] And the answer is 13.</text><text start="3" dur="3">One over here, 2 over here, and 4 over here.</text><text start="6" dur="9">Simplifiably speaking, any variable that has K inputs requires 2 to the K such variables.</text><text start="15" dur="4">So in total we have 1, 9, 13.</text></transcript></video><video title="12b Question" id="Fy5wP_9obQU" length="17"><transcript><text start="0" dur="2">[Thrun] Here&amp;#39;s another quiz.</text><text start="2" dur="4">How many parameters do we need to specify the joint distribution </text><text start="6" dur="3">for this Bayes network over here</text><text start="9" dur="4">where A, B, and C point into D, D points into E, F, and G</text><text start="13" dur="2">and C also points into G?</text><text start="15" dur="2">Please write your answer into this box.</text></transcript></video><video title="12c Answer" id="j0p9VHy-Tu0" length="16"><transcript><text start="0" dur="2">[Thrun] And the answer is 19.</text><text start="2" dur="7">So 1 here, 1 here, 1 here, 2 here, 2 here, 2 arcs point into G, which makes for 4,</text><text start="9" dur="4">and 3 arcs point into D. Two to the 3 is 8.</text><text start="13" dur="3">So we get 1, 2, 3, 8, 2, 2, 4. If you add those up, it&amp;#39;s 19.</text></transcript></video><video title="12d Question" id="wJsCAF5cAK8" length="28"><transcript><text start="0" dur="6">[Thrun] And here is our car network which we discussed at the very beginning of this unit.</text><text start="6" dur="5">How many parameters do we need to specify this network?</text><text start="11" dur="4">Remember, there are 16 total variables,</text><text start="15" dur="10">and the naive joint over the 16 will be 2 to the 16th minus 1, which is 65,535.</text><text start="25" dur="3">Please write your answer into this box over here.</text></transcript></video><video title="12e Answer" id="A2ugTxgEJRA" length="24"><transcript><text start="0" dur="4">[Thrun] To answer this question, let us add up these numbers.</text><text start="4" dur="4">Battery age is 1, 1, 1.</text><text start="8" dur="2">This has 1 incoming arc, so it&amp;#39;s 2.</text><text start="10" dur="3">Two incoming arcs makes 4.</text><text start="13" dur="4">One incoming arc is 2, 2 equals 4.</text><text start="17" dur="4">Four incoming arcs makes 16.</text><text start="21" dur="3">If we add all the right numbers, we get 47.</text></transcript></video><video title="12f Value of the Network" id="9PXrxfOb3p0" length="20"><transcript><text start="0" dur="5">[Thrun] So it takes 47 numerical probabilities to specify the joint</text><text start="5" dur="6">compared to 65,000 if you didn&amp;#39;t have the graph-like structure.</text><text start="11" dur="3">I think this example really illustrates the advantage </text><text start="14" dur="6">of compact Bayes network representations over unstructured joint representations.</text></transcript></video><video title="13 D-Separation" id="iuad4fQ6UPc" length="35"><transcript><text start="0" dur="4">[Thrun] The next concept I&amp;#39;d like to teach you is called D-separation.</text><text start="4" dur="5">And let me start the discussion of this concept by a quiz.</text><text start="9" dur="2">We have here a Bayes network,</text><text start="11" dur="5">and I&amp;#39;m going to ask you a conditional independence question.</text><text start="16" dur="4">Is C independent of A?</text><text start="20" dur="2">Please tell me yes or no.</text><text start="22" dur="5">Is C independent of A given B?</text><text start="27" dur="3">Is C independent of D?</text><text start="30" dur="2">Is C independent of D given A?</text><text start="32" dur="3">And is E independent of C given D?</text></transcript></video><video title="13a Answer" id="dL6p3YQDgGM" length="52"><transcript><text start="0" dur="4">[Thrun] So C is not independent of A.</text><text start="4" dur="5">In fact, A influences C by virtue of B.</text><text start="9" dur="4">But if you know B, then A becomes independent of C,</text><text start="13" dur="4">which means the only determinate into C is B.</text><text start="17" dur="5">If you know B for sure, then knowledge of A won&amp;#39;t really tell you anything about C.</text><text start="22" dur="5">C is also not independent of D, just the same way C is not independent of A.</text><text start="27" dur="4">If I learn something about D, I can infer more about C.</text><text start="31" dur="7">But if I do know A, then it&amp;#39;s hard to imagine how knowledge of D would help me with C</text><text start="38" dur="4">because I can&amp;#39;t learn anything more about A than knowing A already.</text><text start="42" dur="3">Therefore, given A, C and D are independent.</text><text start="45" dur="3">The same is true for E and C.</text><text start="48" dur="4">If we know D, then E and C become independent.</text></transcript></video><video title="13b D-Separation Example" id="DmbahBp7buc" length="45"><transcript><text start="0" dur="4">[Thrun] In this specific example, the rule that we could apply is very, very simple.</text><text start="4" dur="6">Any 2 variables are independent if they&amp;#39;re not linked by just unknown variables.</text><text start="10" dur="4">So for example, if we know B, then everything downstream of B</text><text start="14" dur="4">becomes independent of anything upstream of B.</text><text start="18" dur="4">E is now independent of C, conditioned on B.</text><text start="22" dur="4">However, knowledge of B does not render A and E independent.</text><text start="26" dur="7">In this graph over here, A and B connect to C and C connects to D and to E.</text><text start="33" dur="4">So let me ask you, is A independent of E,</text><text start="37" dur="2">A independent of E given B,</text><text start="39" dur="2">A independent of E given C,</text><text start="41" dur="2">A independent of B,</text><text start="43" dur="2">and A independent of B given C?</text></transcript></video><video title="13c Answer" id="zQ_xDaok-G0" length="86"><transcript><text start="0" dur="3">[Thrun] And the answer for this one is really interesting.</text><text start="3" dur="5">A is clearly not independent of E because through C we can see an influence of A to E.</text><text start="8" dur="3">Given B, that doesn&amp;#39;t change.</text><text start="11" dur="4">A still influences C, despite the fact we know B.</text><text start="15" dur="3">However, if we know C, the influence is cut off.</text><text start="18" dur="4">There is no way A can influence E if we know C.</text><text start="22" dur="3">A is clearly independent of B.</text><text start="25" dur="4">They are different entry variables. They have no incoming arcs.</text><text start="29" dur="3">But here is the caveat.</text><text start="32" dur="3">Given C, A and B become dependent.</text><text start="35" dur="3">So whereas initially A and B were independent,</text><text start="38" dur="3">if you give C, they become dependent.</text><text start="41" dur="3">And the reason why they become dependent we&amp;#39;ve studied before.</text><text start="44" dur="4">This is the explain away effect.</text><text start="48" dur="3">If you know, for example, C to be true, </text><text start="51" dur="6">then knowledge of A will substantially affect what we believe about B.</text><text start="57" dur="5">If there&amp;#39;s 2 joint causes for C and we happen to know A is true,</text><text start="62" dur="2">we will discredit cause B.</text><text start="64" dur="5">If we happen to know A is false, we will increase our belief for the cause B.</text><text start="69" dur="6">That was an effect we studied extensively in the happiness example I gave you before.</text><text start="75" dur="4">The interesting thing here is we are facing a situation</text><text start="79" dur="7">where knowledge of variable C renders previously independent variables dependent.</text></transcript></video><video title="13d D-Separation General Definition" id="BBQTF6zbWME" length="174"><transcript><text start="0" dur="6">[Thrun] This leads me to the general study of conditional independence in Bayes networks,</text><text start="6" dur="4">often called D-separation or reachability.</text><text start="10" dur="7">D-separation is best studied by so-called active triplets and inactive triplets</text><text start="17" dur="3">where active triplets render variables dependent</text><text start="20" dur="3">and inactive triplets render them independent.</text><text start="23" dur="7">Any chain of 3 variables like this makes the initial and final variable dependent</text><text start="30" dur="2">if all variables are unknown.</text><text start="32" dur="3">However, if the center variable is known--</text><text start="35" dur="3">that is, it&amp;#39;s behind the conditioning bar--</text><text start="38" dur="4">then this variable and this variable become independent.</text><text start="42" dur="5">So if we have a structure like this and it&amp;#39;s quote-unquote cut off </text><text start="47" dur="6">by a known variable in the middle, that separates or deseparates </text><text start="53" dur="4">the left variable from the right variable, and they become independent.</text><text start="57" dur="7">Similarly, any structure like this renders the left variable and the right variable dependent</text><text start="64" dur="4">unless the center variable is known,</text><text start="68" dur="4">in which case the left and right variable become independent.</text><text start="72" dur="4">Another active triplet now requires knowledge of a variable.</text><text start="76" dur="3">This is the explain away case.</text><text start="79" dur="6">If this variable is known for a Bayes network that converges into a single variable,</text><text start="85" dur="4">then this variable and this variable over here become dependent.</text><text start="89" dur="4">Contrast this with a case where all variables are unknown.</text><text start="93" dur="7">A situation like this means that this variable on the left or on the right are actually independent.</text><text start="100" dur="8">In a single final example, we also get dependence if we have the following situation:</text><text start="108" dur="4">a direct successor of a conversion variable is known.</text><text start="112" dur="5">So it is sufficient if a successor of this variable is known.</text><text start="117" dur="2">The variable itself does not have to be known,</text><text start="119" dur="3">and the reason is if you know this guy over here,</text><text start="122" dur="3">we get knowledge about this guy over here.</text><text start="125" dur="4">And by virtue of that, the case over here essentially applies.</text><text start="129" dur="2">If you look at those rules, </text><text start="131" dur="4">those rules allow you to determine for any Bayes network</text><text start="135" dur="5">whether variables are dependent or not dependent given the evidence you have.</text><text start="140" dur="5">If you color the nodes dark for which you do have evidence,</text><text start="145" dur="4">then you can use these rules to understand whether any 2 variables </text><text start="149" dur="2">are conditionally independent or not.</text><text start="151" dur="6">So let me ask you for this relatively complicated Bayes network the following questions.</text><text start="157" dur="4">Is F independent of A?</text><text start="161" dur="4">Is F independent of A given D?</text><text start="165" dur="4">Is F independent of A given G?</text><text start="169" dur="2">And is F independent of A given H?</text><text start="171" dur="3">Please mark your answers as you see fit.</text></transcript></video><video title="13e Answer" id="LKDtJM8SQmw" length="63"><transcript><text start="0" dur="4">[Thrun] And the answer is yes, F is independent of A.</text><text start="4" dur="4">What we find for our rules of D-separation is that F is dependent on D</text><text start="8" dur="3">and A is dependent on D.</text><text start="11" dur="5">But if you don&amp;#39;t know D, you can&amp;#39;t govern any dependence between A and F at all.</text><text start="16" dur="4">If you do know D, then F and A become dependent.</text><text start="20" dur="5">And the reason is B and E are dependent given D,</text><text start="25" dur="4">and we can transform this back into dependence of A and F</text><text start="29" dur="4">because B and A are dependent and E and F are dependent.</text><text start="33" dur="5">There is an active path between A and F which goes across here and here</text><text start="38" dur="2">because D is known.</text><text start="40" dur="4">If we know G, the same thing is true because G gives us knowledge about D,</text><text start="44" dur="3">and D can be applied back to this path over here.</text><text start="47" dur="2">However, if you know H, that&amp;#39;s not the case.</text><text start="49" dur="2">So H might tell us something about G,</text><text start="51" dur="2">but it doesn&amp;#39;t tell us anything about D,</text><text start="53" dur="6">and therefore, we have no reason to close the path between A and F.</text><text start="59" dur="4">The path between A and F is still passive, even though we have knowledge of H.</text></transcript></video><video title="14 Congratulations" id="4OPv8ACeuaU" length="50"><transcript><text start="0" dur="3">[Thrun] So congratulations. You learned a lot about Bayes networks.</text><text start="3" dur="3">You learned about the graph structure of Bayes networks,</text><text start="6" dur="4">you understood how this is a compact representation,</text><text start="10" dur="2">you learned about conditional independence,</text><text start="12" dur="3">and we talked a little bit about application of Bayes network</text><text start="15" dur="3">to interesting reasoning problems.</text><text start="18" dur="5">But by all means this was a mostly theoretical unit of this class,</text><text start="23" dur="4">and in future classes we will talk more about applications.</text><text start="27" dur="4">The instrument of Bayes networks is really essential to a number of problems.</text><text start="31" dur="5">It really characterizes the sparse dependence that exists in many readable problems</text><text start="36" dur="5">like in robotics and computer vision and filtering and diagnostics and so on.</text><text start="41" dur="2">I really hope you enjoyed this class,</text><text start="43" dur="7">and I really hope you understood in depth how Bayes networks work.</text></transcript></video></group><group title="Unit 4" count="34"><video title="1 Probabilistic Inference" id="1fVWQ-iZqsw" length="278"><transcript><text start="0" dur="2">[Probabilistic Interference]</text><text start="2" dur="3">[Male] Welcome back. In the previous unit, we went over the basics</text><text start="5" dur="7">of probability theory and saw how</text><text start="12" dur="5">a Bayes network could concisely represent a joint probability distribution,</text><text start="17" dur="7">including the representation of independence between the variables.</text><text start="24" dur="7">In this unit, we will see how to do probabilistic inference.</text><text start="31" dur="5">That is, how to answer probability questions using Bayes nets.</text><text start="36" dur="4">Let&amp;#39;s put up a simple Bayes net.</text><text start="40" dur="5">We&amp;#39;ll use the familiar example of the earthquake</text><text start="45" dur="5">where we can have a burglary or an earthquake </text><text start="50" dur="3">setting off an alarm, and if the alarm goes off, </text><text start="53" dur="5">either John or Mary might call.</text><text start="58" dur="4">Now, what kinds of questions can we ask to do inference about?</text><text start="62" dur="3">The simplest type of question is the same question we ask</text><text start="65" dur="3">with an ordinary subroutine or function in a programming language.</text><text start="68" dur="4">Namely, given some inputs, what are the outputs?</text><text start="72" dur="6">So, in this case, we could say given the inputs of B and E, </text><text start="78" dur="4">what are the outputs, J and M?</text><text start="82" dur="4">Rather than call them input and output variables,</text><text start="86" dur="10">in probabilistic inference, we&amp;#39;ll call them evidence and query variables.</text><text start="96" dur="3">That is, the variables that we know the values of are the evidence,</text><text start="99" dur="5">and the ones that we want to find out the values of are the query variables.</text><text start="104" dur="8">Anything that is neither evidence nor query is known as a hidden variable.</text><text start="112" dur="3">That is, we won&amp;#39;t tell you what its value is.</text><text start="115" dur="3">We won&amp;#39;t figure out what its value is and report it,</text><text start="118" dur="3">but we&amp;#39;ll have to compute with it internally.</text><text start="121" dur="4">And now furthermore, in probabilistic inference, </text><text start="125" dur="5">the output is not a single number for each of the query variables,</text><text start="130" dur="3">but rather, it&amp;#39;s a probability distribution.</text><text start="133" dur="4">So, the answer is going to be a complete, joint probability distribution </text><text start="137" dur="2">over the query variables.</text><text start="139" dur="4">We call this the posterior distribution, given the evidence, </text><text start="143" dur="3">and we can write it like this.</text><text start="146" dur="8">It&amp;#39;s the probability distribution of one or more query variables</text><text start="154" dur="5">given the values of the evidence variables.</text><text start="159" dur="3">And there can be zero or more evidence variables, </text><text start="162" dur="5">and each of them are given an exact value.</text><text start="167" dur="6">And that&amp;#39;s the computation we want to come up with.</text><text start="173" dur="3">There&amp;#39;s another question we can ask.</text><text start="176" dur="2">Which is the most likely explanation?</text><text start="178" dur="5">That is, out of all the possible values for all the query variables,</text><text start="183" dur="5">which combination of values has the highest probability?</text><text start="188" dur="4">We write the formula like this, asking which Q values</text><text start="192" dur="4"> are maxable given the evidence values.</text><text start="196" dur="6">Now, in an ordinary programming language, each function goes only one way.</text><text start="202" dur="4">It has input variables, does some computation,</text><text start="206" dur="5">and comes up with a result variable or result variables.</text><text start="211" dur="3">One great thing about Bayes nets is that we&amp;#39;re not restricted </text><text start="214" dur="2">to going only in one direction.</text><text start="216" dur="5">We could go in the causal direction, giving as evidence</text><text start="221" dur="6">the route nodes of the tree and asking as query values the nodes at the bottom.</text><text start="227" dur="3">Or, we could reverse that causal flow. </text><text start="230" dur="5">For example, we could have J and M be the evidence variables</text><text start="235" dur="3">and B and E be the query variables,</text><text start="238" dur="3">or we could have any other combination.</text><text start="241" dur="4">For example, we could have M be the evidence variable</text><text start="245" dur="6">and J and B be the query variables.</text><text start="251" dur="2">Here&amp;#39;s a question for you.</text><text start="253" dur="5">Imagine the situation where Mary has called to report that the alarm is going off,</text><text start="258" dur="4">and we want to know whether or not there has been a burglary.</text><text start="262" dur="5">For each of the nodes, click on the circle to tell us</text><text start="267" dur="5">if the node is an evidence node, a hidden node, </text><text start="272" dur="6">or a query node. </text></transcript></video><video title="1a Answer" id="VYsys0If8bw" length="11"><transcript><text start="0" dur="4">The answer is that Mary calling is the evidence node.</text><text start="4" dur="3">The burglary is the query node,</text><text start="7" dur="4">and all the others are hidden variables in this case.</text></transcript></video><video title="2 Enumeration" id="q5DHnmHtVmc" length="264"><transcript><text start="0" dur="4">Now we&amp;#39;re going to talk about how to do inference on Bayes net.</text><text start="4" dur="4">We&amp;#39;ll start with our familiar network, and we&amp;#39;ll talk about a method</text><text start="8" dur="4">called enumeration,</text><text start="12" dur="3">which goes through all the possibilities, adds them up,</text><text start="15" dur="2">and comes up with an answer.</text><text start="17" dur="7">So, what we do is start by stating the problem.</text><text start="24" dur="3">We&amp;#39;re going to ask the question of what is the probability </text><text start="27" dur="7">that the burglar alarm occurred given that John called and Mary called?</text><text start="34" dur="5">We&amp;#39;ll use the definition of conditional probability to answer this.</text><text start="39" dur="8">So, this query is equal to the joint probability distribution</text><text start="47" dur="8">of all 3 variables divided by the conditionalized variables. </text><text start="55" dur="6">Now, note I&amp;#39;m using a notation here where instead of writing out the probability</text><text start="61" dur="4">of some variable equals true, I&amp;#39;m just using the notation plus </text><text start="65" dur="3">and then the variable name in lower case, </text><text start="68" dur="5">and if I wanted the negation, I would use negation sign.</text><text start="73" dur="4">Notice there&amp;#39;s a different notation where instead of writing out </text><text start="77" dur="5">the plus and negation signs, we just use the variable name itself, P(e),</text><text start="82" dur="3">to indicate E is true.</text><text start="85" dur="4">That notation works well, but it can get confusing between </text><text start="89" dur="5">does P(e) mean E is true, or does it mean E is a variable?</text><text start="94" dur="3">And so we&amp;#39;re going to stick to the notation where we explicitly have </text><text start="97" dur="4">the pluses and negation signs.</text><text start="101" dur="4">To do inference by enumeration, we first take a conditional probability</text><text start="105" dur="4">and rewrite it as unconditional probabilities.</text><text start="109" dur="7">Now we enumerate all the atomic probabilities and calculate the sum of products.</text><text start="116" dur="4">Let&amp;#39;s look at just the complex term on the numerator first.</text><text start="120" dur="5">The procedure for figuring out the denominator would be similar, and we&amp;#39;ll skip that.</text><text start="125" dur="7">So, the probability of these 3 terms together</text><text start="132" dur="5">can be determined by enumerating all possible values of the hidden variables.</text><text start="137" dur="5">In this case, there are 2, E and A, </text><text start="142" dur="7">so we&amp;#39;ll sum over those variables for all values of E and for all values of A.</text><text start="149" dur="5">In this case, they&amp;#39;re boolean, so there&amp;#39;s only 2 values of each.</text><text start="154" dur="7">We ask what&amp;#39;s the probability of this unconditional term?</text><text start="161" dur="3">And that we get by summing out over all possibilities, </text><text start="164" dur="5">E and A being true or false.</text><text start="169" dur="3">Now, to get the values of these atomic events, </text><text start="172" dur="3">we&amp;#39;ll have to rewrite this equation in a form that corresponds</text><text start="175" dur="5">to the conditional probability tables that we have associated with the Bayes net.</text><text start="180" dur="4">So, we&amp;#39;ll take this whole expression and rewrite it.</text><text start="184" dur="4">It&amp;#39;s still a sum over the hidden variables E and A, </text><text start="188" dur="4">but now I&amp;#39;ll rewrite this expression in terms of the parents</text><text start="192" dur="3">of each of the nodes in the network.</text><text start="195" dur="6">So, that gives us the product of these 5 terms,</text><text start="201" dur="3">which we then have to sum over all values of E and A.</text><text start="204" dur="7">If we call this product f(e,a), </text><text start="211" dur="12">then the whole answer is the sum of F for all values of E and A,</text><text start="223" dur="8">so as the sum of 4 terms where each of the terms is a product of 5 numbers.</text><text start="231" dur="3">Where do we get the numbers to fill in this equation?</text><text start="234" dur="4">From the conditional probability tables from our model, </text><text start="238" dur="5">so let&amp;#39;s put the equation back up, and we&amp;#39;ll ask you for the case</text><text start="243" dur="6">where both E and A are positive </text><text start="249" dur="5">to look up in the conditional probability tables and fill in the numbers</text><text start="254" dur="10">for each of these 5 terms, and then multiply them together and fill in the product.</text></transcript></video><video title="2a Answer" id="fxYL4PIBXiY" length="119"><transcript><text start="0" dur="4">We get the answer by reading numbers off the conditional probability tables,</text><text start="4" dur="7">so probability of B being positive is 0.001.</text><text start="11" dur="5">Of E being positive, because we&amp;#39;re dealing with the positive case now</text><text start="16" dur="6">for the variable E, is 0.002.</text><text start="22" dur="4">The probability of A being positive, because we&amp;#39;re dealing with that case, </text><text start="26" dur="4">given that B is positive and the case for an E is positive, </text><text start="30" dur="7">that we can read off here as 0.95.</text><text start="37" dur="7">The probability that J is positive given that A is positive is 0.9.</text><text start="44" dur="6">And finally, the probability that M is positive given that A is positive</text><text start="50" dur="4">we read off here as 0.7.</text><text start="54" dur="3">We multiple all those together, it&amp;#39;s going to be a small number</text><text start="57" dur="3">because we&amp;#39;ve got the .001 and the .002 here.</text><text start="60" dur="12">Can&amp;#39;t quite fit it in the box, but it works out to .000001197.</text><text start="72" dur="2">That seems like a really small number, but remember, </text><text start="74" dur="5">we have to normalize by the P(+j,+m) term, </text><text start="79" dur="3">and this is only 1 of the 4 possibilities.</text><text start="82" dur="4">We have to enumerate over all 4 possibilities for E and A,</text><text start="86" dur="6">and in the end, it works out that the probability of the burglar alarm being true</text><text start="92" dur="6">given that John and Mary calls, is 0.284.</text><text start="98" dur="4">And we get that number because intuitively, </text><text start="102" dur="2">it seems that the alarm is fairly reliable.</text><text start="104" dur="3">John and Mary calling are very reliable,</text><text start="107" dur="2">but the prior probability of burglary is low.</text><text start="109" dur="5">And those 2 terms combine together to give us the 0.284 value</text><text start="114" dur="5">when we sum up each of the 4 terms of these products.</text></transcript></video><video title="3 Speeding up Enumeration" id="DWO-XKo2iS8" length="207"><transcript><text start="0" dur="4">[Norvig] We&amp;#39;ve seen how to do enumeration to solve the inference problem</text><text start="4" dur="2">on belief networks.</text><text start="6" dur="4">For a simple network like the alarm network, that&amp;#39;s all we need to know.</text><text start="10" dur="4">There&amp;#39;s only 5 variables, so even if all 5 of them were hidden,</text><text start="14" dur="6">there would only be 32 rows in the table to sum up.</text><text start="20" dur="2">From a theoretical point of view, we&amp;#39;re done.</text><text start="22" dur="4">But from a practical point of view, other networks could give us trouble.</text><text start="26" dur="9">Consider this network, which is one for determining insurance for car owners.</text><text start="35" dur="3">There are 27 different variables.</text><text start="38" dur="6">If each of the variables were boolean, that would give us over 100 million rows to sum out.</text><text start="44" dur="2">But in fact, some of the variables are non-boolean,</text><text start="46" dur="6">they have multiple values, and it turns out that representing this entire network</text><text start="52" dur="5">and doing enumeration we&amp;#39;d have to sum over a quadrillion rows.</text><text start="57" dur="4">That&amp;#39;s just not practical, so we&amp;#39;re going to have to come up with methods</text><text start="61" dur="3">that are faster than enumerating everything.</text><text start="64" dur="5">The first technique we can use to get a speed-up in doing inference on Bayes nets</text><text start="69" dur="4">is to pull out terms from the enumeration.</text><text start="73" dur="7">For example, here the probability of b is going to be the same for all values of E and a.</text><text start="80" dur="6">So we can take that term and move it out of the summation,</text><text start="86" dur="2">and now we have a little bit less work to do.</text><text start="88" dur="5">We can multiply by that term once rather than having it in each row of the table.</text><text start="93" dur="7">We can also move this term, the P of e, to the left of the summation over a,</text><text start="100" dur="3">because it doesn&amp;#39;t depend on a.</text><text start="103" dur="2">By doing this, we&amp;#39;re doing less work.</text><text start="105" dur="5">The inner loop of the summation now has only 3 terms rather than 5 terms.</text><text start="110" dur="3">So we&amp;#39;ve reduced the cost of doing each row of the table.</text><text start="113" dur="4">But we still have the same number of rows in the table,</text><text start="117" dur="3">so we&amp;#39;re going to have to do better than that.</text><text start="120" dur="8">The next technique for efficient inference is to maximize independence of variables.</text><text start="128" dur="4">The structure of a Bayes net determines how efficient it is to do inference on it.</text><text start="132" dur="5">For example, a network that&amp;#39;s a linear string of variables,</text><text start="137" dur="10">X1 through Xn, can have inference done in time proportional to the number n,</text><text start="147" dur="4">whereas a network that&amp;#39;s a complete network</text><text start="151" dur="9">where every node points to every other node and so on could take time 2 to the n</text><text start="160" dur="5">if all n variables are boolean variables.</text><text start="165" dur="5">In the alarm network we saw previously, we took care</text><text start="170" dur="4">to make sure that we had all the independence relations represented </text><text start="174" dur="3">in the structure of the network.</text><text start="177" dur="3">But if we put the nodes together in a different order,</text><text start="180" dur="3">we would end up with a different structure.</text><text start="183" dur="6">Let&amp;#39;s start by ordering the node John calls first</text><text start="189" dur="4">and then adding in the node Mary calls.</text><text start="193" dur="6">The question is, given just these 2 nodes and looking at the node for Mary calls,</text><text start="199" dur="8">is that node dependent or independent of the node for John calls?</text></transcript></video><video title="3a Answer" id="r3mOvkvHbts" length="24"><transcript><text start="1" dur="4">[Norvig] The answer is that the node for Mary calls in this network</text><text start="5" dur="3">is dependent on John calls.</text><text start="8" dur="5">In the previous network, they were independent given that we knew that the alarm had occurred.</text><text start="13" dur="3">But here we don&amp;#39;t know that the alarm had occurred,</text><text start="16" dur="2">and so the nodes are dependent</text><text start="18" dur="6">because having information about one will affect the information about the other.</text></transcript></video><video title="3b Second Question" id="uZfGhIFH92g" length="13"><transcript><text start="0" dur="5">[Norvig] Now we&amp;#39;ll continue and we&amp;#39;ll add the node A for alarm to the network.</text><text start="5" dur="4">And what I want you to do is click on all the other variables</text><text start="9" dur="4">that A is dependent on in this network.</text></transcript></video><video title="3c Second Answer" id="X1WygrN9ens" length="33"><transcript><text start="1" dur="4">[Norvig] The answer is that alarm is dependent on both John and Mary.</text><text start="5" dur="4">And so we can draw both nodes in, both arrows in.</text><text start="9" dur="5">Intuitively that makes sense because if John calls, </text><text start="14" dur="2">then it&amp;#39;s more likely that the alarm has occurred,</text><text start="16" dur="4">likely as if Mary calls, and if both called, it&amp;#39;s really likely.</text><text start="20" dur="3">So you can figure out the answer by intuitive reasoning,</text><text start="23" dur="4">or you can figure it out by going to the conditional probability tables</text><text start="27" dur="4">and seeing according to the definition of conditional probability</text><text start="31" dur="2">whether the numbers work out.</text></transcript></video><video title="3d Third Question" id="rTeQXHTu2_A" length="11"><transcript><text start="1" dur="4">[Norvig] Now we&amp;#39;ll continue and we&amp;#39;ll add the node B for burglary</text><text start="5" dur="6">and ask again, click on all the variables that B is dependent on.</text></transcript></video><video title="3e Third Answer" id="_l7rPalYjmU" length="10"><transcript><text start="0" dur="4">[Norvig] The answer is that B is dependent only on A.</text><text start="4" dur="6">In other words, B is independent of J and M given A.</text></transcript></video><video title="3f Fourth Question" id="DX1YTIQsjtU" length="7"><transcript><text start="0" dur="4">[Norvig] And finally, we&amp;#39;ll add the last node, E,</text><text start="4" dur="3">and ask you to click on all the nodes that E is dependent on.</text></transcript></video><video title="3g Fourth Answer" id="T609y-a8bZc" length="26"><transcript><text start="0" dur="4">[Norvig] And the answer is that E is dependent on A.</text><text start="4" dur="2">That much is fairly obvious.</text><text start="6" dur="2">But it&amp;#39;s also dependent on B. </text><text start="8" dur="2">Now, why is that?</text><text start="10" dur="3">E is dependent on A because if the earthquake did occur,</text><text start="13" dur="3">then it&amp;#39;s more likely that the alarm would go off.</text><text start="16" dur="3">On the other hand, E is also dependent on B</text><text start="19" dur="4">because if a burglary occurred, then that would explain why the alarm is going off,</text><text start="23" dur="3">and it would mean that the earthquake is less likely.</text></transcript></video><video title="3h Causal Direction" id="YPmGGwlRqY0" length="18"><transcript><text start="0" dur="4">[Norvig] The moral is that Bayes nets tend to be the most compact</text><text start="4" dur="8">and thus the easier to do inference on when they&amp;#39;re written in the causal direction--</text><text start="12" dur="6">that is, when the networks flow from causes to effects.</text></transcript></video><video title="4 Variable Elimination" id="qyXspkUOhGc" length="280"><transcript><text start="0" dur="6">Let&amp;#39;s return to this equation, which we use to show how to do inference by enumeration.</text><text start="6" dur="4">In this equation, we join up the whole joint distribution</text><text start="10" dur="5">before we sum out over the hidden variables.</text><text start="15" dur="3">That&amp;#39;s slow, because we end up repeating a lot of work.</text><text start="18" dur="7">Now we&amp;#39;re going to show a new technique called variable elimination,</text><text start="25" dur="2">which in many networks operates much faster.</text><text start="27" dur="3">It&amp;#39;s still a difficult computation, an NP-hard computation, </text><text start="30" dur="4">to do inference over Bayes nets in general.</text><text start="34" dur="4">Variable elimination works faster than inference by enumeration </text><text start="38" dur="3">in most practical cases.</text><text start="41" dur="4">It requires an algebra for manipulating factors,</text><text start="45" dur="3">which are just names for multidimensional arrays </text><text start="48" dur="5">that come out of these probabilistic terms.</text><text start="53" dur="4">We&amp;#39;ll use another example to show how variable elimination works.</text><text start="57" dur="3">We&amp;#39;ll start off with a network that has 3 boolean variables.</text><text start="60" dur="4">R indicates whether or not it&amp;#39;s raining.</text><text start="64" dur="8">T indicates whether or not there&amp;#39;s traffic, </text><text start="72" dur="3">and T is dependent on whether it&amp;#39;s raining.</text><text start="75" dur="4">And finally, L indicates whether or not I&amp;#39;ll be late for my next appointment,</text><text start="79" dur="3">and that depends on whether or not there&amp;#39;s traffic.</text><text start="82" dur="7">Now we&amp;#39;ll put up the conditional probability tables for each of these 3 variables.</text><text start="89" dur="6">And then we can use inference to figure out the answer to questions like</text><text start="95" dur="3">am I going to be late?</text><text start="98" dur="4">And we know by definition that we could do that through enumeration </text><text start="102" dur="5">by going through all the possible values for R and T</text><text start="107" dur="7"> and summing up the product of these 3 nodes. </text><text start="114" dur="5">Now, in a simple network like this, straight enumeration would work fine,</text><text start="119" dur="4">but in a more complex network, what variable elimination does is give us a way</text><text start="123" dur="6">to combine together parts of the network into smaller parts</text><text start="129" dur="4">and then enumerate over those smaller parts and then continue combining.</text><text start="133" dur="2">So, we start with a big network.</text><text start="135" dur="2">We eliminate some of the variables.</text><text start="137" dur="7">We compute by marginalizing out, and then we have a smaller network to deal with,</text><text start="144" dur="4">and we&amp;#39;ll show you how those 2 steps work.</text><text start="148" dur="7">The first operation in variable elimination is called joining factors.</text><text start="155" dur="4">A factor, again, is one of these tables.</text><text start="159" dur="4">It&amp;#39;s a multidimensional matrix, and what we do is choose 2 of the factors, </text><text start="163" dur="2">2 or more of the factors.</text><text start="165" dur="4">In this case, we&amp;#39;ll choose these 2, and we&amp;#39;ll combine them together</text><text start="169" dur="3">to form a new factor which represents</text><text start="172" dur="4"> the joint probability of all the variables in that factor.</text><text start="176" dur="4">In this case, R and T.</text><text start="180" dur="3">Now we&amp;#39;ll draw out that table.</text><text start="183" dur="3">In each case, we just look up in the corresponding table, </text><text start="186" dur="2">figure out the numbers, and multiply them together. </text><text start="188" dur="5">For example, in this row we have a +r and a +t,</text><text start="193" dur="6">so the +r is 0.1, and the entry for +r and +t  is 0.8,</text><text start="199" dur="3">so multiply them together and you get 0.08.</text><text start="202" dur="6">Go all the way down. For example, in the last row we have a -r and a -t.</text><text start="208" dur="6">-r is 0.9. The entry for -r and -t is also 0.9.</text><text start="214" dur="6">Multiply those together and you get 0.81.</text><text start="220" dur="2">So, what have we done?</text><text start="222" dur="3">We used the operation of joining factors on these 2 factors, </text><text start="225" dur="5">getting us a new factor which is part of the existing network.</text><text start="230" dur="6">Now we want to apply a second operation called elimination,</text><text start="236" dur="6">also called summing out or marginalization, to take this table and reduce it. </text><text start="242" dur="4">Right now, the tables we have look like this.</text><text start="246" dur="4">We could sum out or marginalize over the variable R</text><text start="250" dur="4">to give us a table that just operates on T.</text><text start="254" dur="6">So, the question is to fill in this table for P(T)--</text><text start="260" dur="3">there will be 2 entries in this table, the +t entry, formed by summing out </text><text start="263" dur="5">all the entries here for all values of r for which t is positive, </text><text start="268" dur="4">and the -t entry, formed the same way, by looking in this table</text><text start="272" dur="5">and summing up all the rows over all values of r where t is negative.</text><text start="277" dur="3">Put your answers in these boxes.</text></transcript></video><video title="4a Answer" id="4lm-TI7APX0" length="27"><transcript><text start="0" dur="5">The answer is that for +t we look up the 2 possible values for r,</text><text start="5" dur="4">and we get 0.08 or 0.09.</text><text start="9" dur="4">Sum those up, get 0.17,</text><text start="13" dur="5">and then we look at the 2 possible values of R for -t,</text><text start="18" dur="4">and we get 0.02 and 0.81.</text><text start="22" dur="5">Add those up, and we get 0.83.</text></transcript></video><video title="4b More Variable Elimination" id="Bk2S3ffdtsc" length="28"><transcript><text start="0" dur="4">So, we took our network with RT and L. We summed out over R.</text><text start="4" dur="5">That gives us a new network with T and L</text><text start="9" dur="4">with these conditional probability tables.</text><text start="13" dur="4">And now we want to do a join over T and L</text><text start="17" dur="8">and give us a new table with the joint probability of P(T, L).</text><text start="25" dur="3">And that table is going to look like this.</text></transcript></video><video title="4c Answer" id="LU9gMODL04Y" length="38"><transcript><text start="0" dur="5">The answer, again, for joining variables is determined by pointwise multiplication,</text><text start="5" dur="7">so we have 0.17 times 0.3 is 0.051,</text><text start="12" dur="9">+t and +l, 0.17 times 0.7 is 0.119.</text><text start="21" dur="2">Then we go to the minuses.</text><text start="23" dur="8">Minus 0.83 times 0.1 is 0.083.</text><text start="31" dur="7">And finally, 0.83 times 0.9 is 0.747.</text></transcript></video><video title="4d Even More Variable Elimination" id="5lImmoAK49A" length="30"><transcript><text start="0" dur="6">Now we&amp;#39;re down to a network with a single node, T, L, </text><text start="6" dur="6">with this joint probability table, and the only operation we have left to do</text><text start="12" dur="5">is to sum out to give us a node with just L in it.</text><text start="17" dur="9">So, the question is to compute P(L) for both values of L,</text><text start="26" dur="4">+l and -l.</text></transcript></video><video title="4e Answer" id="3lqdPCE-sg8" length="20"><transcript><text start="0" dur="3">The answer is that the +l values, </text><text start="3" dur="8">0.051 plus 0.083 equals 0.134.</text><text start="11" dur="4">And the negative values, 0.119 plus 0.747 </text><text start="15" dur="5">equals 0.886.</text></transcript></video><video title="4f Summary" id="hDdAZG4w5kA" length="?"><transcript><text start="0" dur="3">No subtitles...</text></transcript></video><video title="4f Summary" id="-sFOKd_ZEJ8" length="21"><transcript><text start="0" dur="3">So, that&amp;#39;s how variable elimination works.</text><text start="3" dur="3">It&amp;#39;s a continued process of joining together factors </text><text start="6" dur="5">to form a larger factor and then eliminating variables by summing out.</text><text start="11" dur="4">If we make a good choice of the order in which we apply these operations,</text><text start="15" dur="3">then variable elimination can be much more efficient </text><text start="18" dur="3">than just doing the whole enumeration.</text></transcript></video><video title="5 Approximate Inference Sampling" id="W5g-4a2PIcI" length="128"><transcript><text start="0" dur="7">Now I want to talk about approximate inference</text><text start="7" dur="5">by means of sampling.</text><text start="12" dur="2">What do I mean by that?</text><text start="14" dur="3">Say we want to deal with a joint probability distribution,</text><text start="17" dur="7">say the distribution of heads and tails over these 2 coins.</text><text start="24" dur="6">We can build a table and then start counting by sampling.</text><text start="30" dur="2">Here we have our first sample. </text><text start="32" dur="3">We flip the coins and the one-cent piece came up heads,</text><text start="35" dur="4">and the five-cent piece came up tails,</text><text start="39" dur="3">so we would mark down one count.</text><text start="42" dur="3">Then we&amp;#39;d toss them again.</text><text start="45" dur="5">This time the five cents is heads, and the one cent is tails,</text><text start="50" dur="10">so we put down a count there, and we&amp;#39;d repeat that process</text><text start="60" dur="6">and keep repeating it until we got enough counts that we could estimate</text><text start="66" dur="5">the joint probability distribution by looking at the counts.</text><text start="71" dur="4">Now, if we do a small number of samples, the counts might not be very accurate.</text><text start="75" dur="4">There may be some random variation that causes them not to converge</text><text start="79" dur="4"> to their true values, but as we add more counts,</text><text start="83" dur="2">the counts--as we add more samples,</text><text start="85" dur="4"> the counts we get will come closer to the true distribution.</text><text start="89" dur="6">Thus, sampling has an advantage over inference in that we know a procedure</text><text start="95" dur="7">for coming up with at least an approximate value for the joint probability distribution,</text><text start="102" dur="8">as opposed to exact inference, where the computation may be very complex.</text><text start="110" dur="3">There&amp;#39;s another advantage to sampling, which is if we don&amp;#39;t know  </text><text start="113" dur="6">what the conditional probability tables are, as we did in our other models,</text><text start="119" dur="5">if we don&amp;#39;t know these numeric values, but we can simulate the process,</text><text start="124" dur="4">we can still proceed with sampling, whereas we couldn&amp;#39;t with exact inference.</text></transcript></video><video title="6 Sampling Example" id="mXgfRvRmDFI" length="135"><transcript><text start="0" dur="5">Here&amp;#39;s a new network that we&amp;#39;ll use to investigate</text><text start="5" dur="5">how sampling can be used to do inference.</text><text start="10" dur="4">In this network, we have 4 variables. They&amp;#39;re all boolean.</text><text start="14" dur="3">Cloudy tells us if it&amp;#39;s cloudy or not outside, </text><text start="17" dur="4">and that can have an effect on whether the sprinklers are turned on, </text><text start="21" dur="2">and whether it&amp;#39;s raining.</text><text start="23" dur="5">And those 2 variables in turn have an effect on whether the grass gets wet. </text><text start="28" dur="6">Now, to do inference over this network using sampling, </text><text start="34" dur="4">we start off with a variable where all the parents are defined.</text><text start="38" dur="4">In this case, there&amp;#39;s only one such variable, Cloudy.</text><text start="42" dur="6">And it&amp;#39;s conditional probability table tells us that the probability is 50% for Cloudy,</text><text start="48" dur="4">50% for not Cloudy, and so we sample from that.</text><text start="52" dur="7">We generate a random number, and let&amp;#39;s say it comes up with positive for Cloudy.</text><text start="59" dur="3">Now that variable is defined, we can choose another variable.</text><text start="62" dur="6">In this case, let&amp;#39;s choose Sprinkler, and we look at the rows in the table</text><text start="68" dur="5">for which Cloudy, the parent, is positive, and we see we should sample</text><text start="73" dur="6">with probability 10% to +s and 90% a -s.</text><text start="79" dur="4">And so let&amp;#39;s say we do that sampling with a random number generator,</text><text start="83" dur="3">and it comes up negative for Sprinkler.</text><text start="86" dur="3">Now let&amp;#39;s jump over here. Look at the Rain variable.</text><text start="89" dur="5">Again, the parent, Cloudy, is positive,</text><text start="94" dur="4">so we&amp;#39;re looking at this part of the table.</text><text start="98" dur="3">We get a 0.8 probability for Rain being positive,</text><text start="101" dur="3">and a 0.2 probability for Rain being negative.</text><text start="104" dur="7">Let&amp;#39;s say we sample that randomly, and it comes up Rain is positive.</text><text start="111" dur="3">And now we&amp;#39;re ready to sample the final variable,</text><text start="114" dur="7">and what I want  you to do is tell me which of the rows</text><text start="121" dur="6">of this table should we be considering and tell me what&amp;#39;s more likely.</text><text start="127" dur="8">Is it more likely that we have a +w or a -w?</text></transcript></video><video title="6a Sampling Example" id="K1ZyqpTJPK0" length="65"><transcript><text start="0" dur="3">The answer to the question is that we look at the parents.</text><text start="3" dur="3">We find that the Sprinkler variable is negative,</text><text start="6" dur="3">so we&amp;#39;re looking at this part of the table.</text><text start="9" dur="5">And the Rain variable is positive, so we&amp;#39;re looking at this part.</text><text start="14" dur="4">So, it would be these 2 rows that we would consider,</text><text start="18" dur="7">and thus, we&amp;#39;d find there&amp;#39;s a 0.9 probability for w, the grass being wet, </text><text start="25" dur="3">and only 0.1 for it being negative,</text><text start="28" dur="3">so the positive is more likely.</text><text start="31" dur="3">And once we&amp;#39;ve done that, then we generated a complete sample,</text><text start="34" dur="3">and we can write down the sample here.</text><text start="37" dur="6">We had +c, -s, +r.</text><text start="43" dur="8">And assuming we got a probability of 0.9 came out in favor of the +w, </text><text start="51" dur="3">that would be the end of the sample.</text><text start="54" dur="5">Then we could throw all this information out and start over again</text><text start="59" dur="6">by having another 50/50 choice for cloudy and then working our way through the network.</text></transcript></video><video title="6b More Sampling" id="fChe7bVEdHQ" length="111"><transcript><text start="0" dur="4">Now, the probability of sampling a particular variable, </text><text start="4" dur="6">choosing a +w or a -w, depends on the values of the parents.</text><text start="10" dur="4">But those are chosen according to the conditional probability tables,</text><text start="14" dur="4">so in the limit, the count of each sampled variable</text><text start="18" dur="2">will approach the true probability.</text><text start="20" dur="4">That is, with an infinite number of samples, this procedure computes the true</text><text start="24" dur="3">joint probability distribution.</text><text start="27" dur="6">We say that the sampling method is consistent.</text><text start="33" dur="5">We can use this kind of sampling to compute the complete joint probability distribution,</text><text start="38" dur="5">or we can use it to compute a value for an individual variable.</text><text start="43" dur="4">But what if we wanted to compute a conditional probability?</text><text start="47" dur="6">Say we wanted to compute the probability of wet grass</text><text start="53" dur="5">given that it&amp;#39;s not cloudy.</text><text start="58" dur="5">To do that, the sample that we generated here wouldn&amp;#39;t be helpful at all</text><text start="63" dur="5">because it has to do with being cloudy, not with being not cloudy.</text><text start="68" dur="3">So, we would cross this sample off the list.</text><text start="71" dur="6">We would say that we reject the sample, and this technique is called rejection sampling.</text><text start="77" dur="4">We go through ignoring any samples that don&amp;#39;t match </text><text start="81" dur="3">the conditional probabilities that we&amp;#39;re interested in</text><text start="84" dur="10">and keeping samples that do, say the sample -c, +s, +r, -w.</text><text start="94" dur="3">We would just continue going through generating samples, </text><text start="97" dur="4">crossing off the ones that don&amp;#39;t match, keeping the ones that do.</text><text start="101" dur="5">And this procedure would also be consistent.</text><text start="106" dur="5">We call this procedure rejection sampling.</text></transcript></video><video title="7 Rejection Sampling" id="9IdjpH4xkGM" length="119"><transcript><text start="0" dur="3">But there&amp;#39;s a problem with rejection sampling.</text><text start="3" dur="5">If the evidence is unlikely, you end up rejecting a lot of the samples.</text><text start="8" dur="8">Let&amp;#39;s go back to the alarm network where we had variables for burglary and for an alarm</text><text start="16" dur="6">and say when arrested, in computing the probability of a burglary,</text><text start="22" dur="3">given that the alarm goes off.</text><text start="25" dur="3">The problem is that burglaries are very infrequent,</text><text start="28" dur="4">so most of the samples we would get would end up being--</text><text start="32" dur="7">we start with generating a B, and we get a -b and then a -a.</text><text start="39" dur="4">We go back and say does this match?</text><text start="43" dur="2">No, we have to reject this sample,</text><text start="45" dur="5">so we generate another sample, and we get another -b, -a.</text><text start="50" dur="4">We reject that. We get another -b, -a.</text><text start="54" dur="6">And we keep rejecting, and eventually we get a +b,</text><text start="60" dur="4">but we&amp;#39;d end up spending a lot of time rejecting samples.</text><text start="64" dur="9">So, we&amp;#39;re going to introduce a new method called likelihood weighting</text><text start="73" dur="4">that generates samples so that we can keep every one. </text><text start="77" dur="3">With likelihood weighting, we fix the evidence variables.</text><text start="80" dur="5">That is, we say that A will always be positive,</text><text start="85" dur="3">and then we sample the rest of the variables,</text><text start="88" dur="3">so then we get samples that we want.</text><text start="91" dur="6">We would get a list like -b, +a, </text><text start="97" dur="3">-b, +a,</text><text start="100" dur="2">+b, +a.</text><text start="102" dur="4">We get to keep every sample, but we have a problem.</text><text start="106" dur="6">The resulting set of samples is inconsistent.</text><text start="112" dur="4">We can fix that, however, by assigning a probability </text><text start="116" dur="3">to each sample and weighing them correctly.</text></transcript></video><video title="8 Likelihood Weighting" id="GYcIruSqT_k" length="115"><transcript><text start="0" dur="5">In likelihood weighting, we&amp;#39;re going to be collecting samples just like before,</text><text start="5" dur="6">but we&amp;#39;re going to add a probabilistic weight to each sample.</text><text start="11" dur="6">Now, let&amp;#39;s say we want to compute the probability of rain</text><text start="17" dur="5">given that the sprinklers are on, and the grass is wet.</text><text start="22" dur="2">We start as before.</text><text start="24" dur="6">We make a choice for Cloudy, and let&amp;#39;s say that, again, </text><text start="30" dur="3">we choose Cloudy being positive.</text><text start="33" dur="4">Now we want to make a choice for Sprinkler, </text><text start="37" dur="4">but we&amp;#39;re constrained to always choose Sprinkler being positive, </text><text start="41" dur="3">so we&amp;#39;ll make that choice.</text><text start="44" dur="6">And we know we were dealing with Cloudy being positive,</text><text start="50" dur="6">so we&amp;#39;re in this row, and we&amp;#39;re forced to make the choice of Sprinkler being positive,</text><text start="56" dur="9">and that has a probability of only 0.1, so we&amp;#39;ll put that 0.1 into the weight.</text><text start="65" dur="4">Next, we&amp;#39;ll look at the Rain variable,</text><text start="69" dur="4">and here we&amp;#39;re not constrained in any way, so we make a choice</text><text start="73" dur="6">according to the probability tables with Cloudy being positive.</text><text start="79" dur="8">And let&amp;#39;s say that we choose the more popular choice, and Rain gets the positive value.</text><text start="87" dur="3">Now, we look at Wet Grass.</text><text start="90" dur="5">We&amp;#39;re constrained to choose positive, and we know that the parents</text><text start="95" dur="6">are also positive, so we&amp;#39;re dealing with this row here.</text><text start="101" dur="6">Since it&amp;#39;s a constrained choice, we&amp;#39;re going to add in or multiply in an additional weight,</text><text start="107" dur="8">and I want you to tell me what that weight should be.</text></transcript></video><video title="8a Answer" id="hvIL_fFvUGM" length="37"><transcript><text start="0" dur="4">The answer is we&amp;#39;re looking for the probability </text><text start="4" dur="5">of having a +w given a +s and a +r, </text><text start="9" dur="7">so that&amp;#39;s in this row, so it&amp;#39;s 0.99.</text><text start="16" dur="6">So, we take our old weight and multiply it by 0.99, </text><text start="22" dur="6">gives us a final weight of 0.099 </text><text start="28" dur="9">for a sample of +c, +s, +r and +w.</text></transcript></video><video title="8b Likelihood Weighting is Consistent" id="jKcp0uQ_rUo" length="20"><transcript><text start="0" dur="3">When we include the weights, </text><text start="3" dur="5">counting this sample that was forced to have a +s and a +w</text><text start="8" dur="6">with a weight of 0.099, instead of counting it as a full one sample,</text><text start="14" dur="6">we find that likelihood weighting is also consistent.</text></transcript></video><video title="8c Likelihood Weighting Problems" id="ngGCGaIEvBU" length="56"><transcript><text start="0" dur="3">Likelihood weighting is a great technique,</text><text start="3" dur="2">but it doesn&amp;#39;t solve all our problems. </text><text start="5" dur="9">Suppose we wanted to compute the probability of C given +s and +r.</text><text start="14" dur="7">In other words, we&amp;#39;re constraining Sprinkler and Rain to always be positive.</text><text start="21" dur="6">Since we use the evidence when we generate a node that has that evidence as parents,</text><text start="27" dur="4">the Wet Grass node will always get good values based on that evidence.</text><text start="31" dur="8">But the Cloudy node won&amp;#39;t, and so it will be generating values at random</text><text start="39" dur="5">without looking at these values, and most of the time, or some of the time,</text><text start="44" dur="4">it will be generating values that don&amp;#39;t go well with the evidence.</text><text start="48" dur="3">Now, we won&amp;#39;t have to reject them like we do in rejection sampling, </text><text start="51" dur="5">but they&amp;#39;ll have a low probability associated with them.</text></transcript></video><video title="9 Gibbs Sampling" id="QaojSzk7Hpw" length="110"><transcript><text start="0" dur="7">A technique called Gibbs sampling,</text><text start="7" dur="3">named after the physicist Josiah Gibbs, </text><text start="10" dur="4">takes all the evidence into account and not just the upstream evidence.</text><text start="14" dur="12">It uses a method called Markov Chain Monte Carlo, or MCMC.</text><text start="26" dur="5">The idea is that we resample just one variable at a time</text><text start="31" dur="2">conditioned on all the others.</text><text start="33" dur="4">That is, we have a set of variables, </text><text start="37" dur="7">and we initialize them to random variables, keeping the evidence values fixed.</text><text start="44" dur="4">Maybe we have values like this, </text><text start="48" dur="6">and that constitutes one sample, and now, at each iteration through the loop,</text><text start="54" dur="7">we select just one non-evidence variable and resample it</text><text start="61" dur="3">based on all the other variables. </text><text start="64" dur="7">And that will give us another sample, and repeat that again.</text><text start="71" dur="4">Choose another variable.</text><text start="75" dur="6">Resample that variable and repeat.</text><text start="81" dur="6">We end up walking around in this space of assignments of variables randomly.</text><text start="87" dur="3">Now, in rejection and likelihood sampling, </text><text start="90" dur="4">each sample was independent of the other samples.</text><text start="94" dur="3">In MCMC, that&amp;#39;s not true.</text><text start="97" dur="3">The samples are dependent on each other, and in fact, </text><text start="100" dur="2">adjacent samples are very similar.</text><text start="102" dur="4">They only vary or differ in one place.</text><text start="106" dur="4">However, the technique is still consistent. We won&amp;#39;t show the proof for that.</text></transcript></video><video title="10 Monty Hall Problem" id="6uF6Fh0qpV0" length="79"><transcript><text start="0" dur="2">Now, just one more thing.</text><text start="2" dur="5">I can&amp;#39;t help but describe what is probably the most famous probability problem of all.</text><text start="7" dur="4">It&amp;#39;s called the Monty Hall Problem after the game show host.</text><text start="11" dur="4">And the idea is that  you&amp;#39;re on a game show, and there&amp;#39;s 3 doors:</text><text start="15" dur="5">door #1, door #2, and door #3.</text><text start="20" dur="6">And behind each door is a prize, and you know that one of the doors</text><text start="26" dur="3">contains an expensive sports car, which  you would find desirable,</text><text start="29" dur="6">and the other 2 doors contain a goat, which you would find less desirable.</text><text start="35" dur="7">Now, say you&amp;#39;re given a choice, and let&amp;#39;s say you choose door #1.</text><text start="42" dur="5">But according to the conventions of the game, the host, Monty Hall,</text><text start="47" dur="5">will now open one of the doors, knowing that the door that he opens</text><text start="52" dur="5">contains a goat, and he shows you door #3.</text><text start="57" dur="5">And he now gives you the opportunity to stick with your choice</text><text start="62" dur="3">or to switch to the other door.</text><text start="65" dur="5">What I want you to tell me is, what is your probability of winning</text><text start="70" dur="5">if you stick to door #1, and what is the probability of winning</text><text start="75" dur="4">if you switched to door #2?</text></transcript></video><video title="10a Answer" id="x7x6nHvQEQ4" length="105"><transcript><text start="0" dur="8">The answer is that you have a 1/3 chance of winning if you stick with door #1</text><text start="8" dur="4">and a 2/3 chance if  you switch to door #2.</text><text start="12" dur="4">How do we explain that, and why isn&amp;#39;t it 50/50?</text><text start="16" dur="2">Well, it&amp;#39;s true that there&amp;#39;s 2 possibilities, </text><text start="18" dur="4">but we&amp;#39;ve learned from probability that just because there are 2 options</text><text start="22" dur="4">doesn&amp;#39;t mean that both options are equally likely.</text><text start="26" dur="4">It&amp;#39;s easier to explain why the first door has a 1/3 probability</text><text start="30" dur="4">because when you started, the car could be in any one of 3 places. </text><text start="34" dur="3">You chose one of them. That probability was 1/3.</text><text start="37" dur="6">And that probability hasn&amp;#39;t been changed by the revealing of one of the other doors.</text><text start="43" dur="2">Why is door #2 two-thirds?</text><text start="45" dur="4">Well, one way to explain it is that the probability has to sum to 1, </text><text start="49" dur="4">and if 1/3 is here, the 2/3 has to be here.</text><text start="53" dur="5">But why doesn&amp;#39;t the same argument that you use for 1 hold for 2?</text><text start="58" dur="5">Why can&amp;#39;t we say the probability of 2 holding the car </text><text start="63" dur="4">was 1/3 before this door was revealed?</text><text start="67" dur="4">Why has that changed 2 and has not changed 1?</text><text start="71" dur="3">And the reason is because we&amp;#39;ve learned something about door #2.</text><text start="74" dur="4">We&amp;#39;ve learned that it wasn&amp;#39;t the door that was flipped over by the host,</text><text start="78" dur="4">and so that additional information has updated the probability,</text><text start="82" dur="4">whereas we haven&amp;#39;t learned anything additional about door #1</text><text start="86" dur="4">because it was never an option that the host might switch door #1.</text><text start="90" dur="7">And in fact, in this case, if we reveal the door, </text><text start="97" dur="3">we find that&amp;#39;s where the car actually is.</text><text start="100" dur="5">So you see, learning probability may end up winning you something.</text></transcript></video><video title="10b Monty Hall Letter" id="CIrfGiP65UI" length="44"><transcript><text start="0" dur="7">Now, as a final epilogue, I have here a copy of a letter written by Monty Hall himself</text><text start="7" dur="3">in 1990 to Professor Lawrence Denenberg of Harvard</text><text start="10" dur="4">who, with Harry Lewis, wrote a statistics book </text><text start="14" dur="4">in which they used the Monty Hall Problem as an example,</text><text start="18" dur="5">and they wrote to Monty asking him for permission to use his name.</text><text start="23" dur="3">Monty kindly granted the permission, but in his letter,</text><text start="26" dur="5">he writes, &amp;quot;As I see it, it wouldn&amp;#39;t make any difference after the player</text><text start="31" dur="3">has selected Door A, and having been shown Door C--</text><text start="34" dur="4">why should he then attempt to switch to Door B?</text><text start="38" dur="6">So, we see Monty Hall himself did not understand the Monty Hall Problem.</text></transcript></video></group><group title="Homework 2" count="12"><video title="1 Bayes Rule" id="_fJTJNK9ejY" length="16"><transcript><text start="0" dur="6">[Thrun] Given the following Bayes network with P of A equal to 0.5,</text><text start="6" dur="2">P of B given the A equals 0.2,</text><text start="8" dur="4">and P of B given not A 0.8,</text><text start="12" dur="4">calculate the following probability. </text></transcript></video><video title="2 Simple Bayes Net" id="f6mq9rTj-Po" length="42"><transcript><text start="0" dur="3">[Thrun] Consider a network of the following type:</text><text start="3" dur="7">a variable, A, that is binary connects to three variables, X1, X2, and X3,</text><text start="10" dur="2">that are also binary.</text><text start="12" dur="12">The probability of A is 0.5, and for all variable XI we have the probability of XI given A is 0.2,</text><text start="24" dur="5">and the probability of XI given not A equals 0.6.</text><text start="29" dur="2">I would like to know from you the probability of A </text><text start="31" dur="6">given that we observed X1, X2, and not X3.</text><text start="37" dur="5">Notice that these variables over here are conditionally independent given A.</text></transcript></video><video title="3 Simple Bayes Net 2" id="P6WEObhmL_o" length="10"><transcript><text start="0" dur="3">[Thrun] Let us consider the same network again.</text><text start="3" dur="7">I would like to know the probability of X3 given that I observed X1.</text></transcript></video><video title="4 Conditional Independence" id="pP7U6KIO9yE" length="29"><transcript><text start="0" dur="4">[Thrun] In this next homework assignment I will be drawing you a Bayes network</text><text start="4" dur="5">and will ask you some conditional independence questions.</text><text start="9" dur="5">Is B conditionally independent of C? And say yes or no.</text><text start="14" dur="5">Is B conditionally independent of C given D? And say yes or no.</text><text start="19" dur="5">Is B conditionally independent of C given A? And say yes or no.</text><text start="24" dur="5">And is B conditionally independent given A and D? And say yes or no.</text></transcript></video><video title="5 Conditional Indepedence 2" id="LMKW60DmJtc" length="28"><transcript><text start="0" dur="2">[Thrun] Consider the following network.</text><text start="2" dur="6">I would like to know whether the following statements are true or false.</text><text start="8" dur="4">C is conditionally independent of E given A.</text><text start="12" dur="6">B is conditionally independent of D given C and E.</text><text start="18" dur="3">A is conditionally independent of C given E.</text><text start="21" dur="4">And A is conditionally independent of C given B.</text><text start="25" dur="3">Please check yes or no for each of these questions.</text></transcript></video><video title="6 Parameter Count" id="8npZMwT0Sac" length="17"><transcript><text start="0" dur="4">[Thrun] In my final question I&amp;#39;ll look at the exact same network as before,</text><text start="4" dur="4">but I would like to know the minimum number of numerical parameters</text><text start="8" dur="5">such as the values to define probabilities and conditional probabilities</text><text start="13" dur="4">that are necessary to specify the joint distribution of all 5 variables.</text></transcript></video><video title="1 ANSWER" id="RvxL71wd2Zg" length="36"><transcript><text start="0" dur="3">[Thrun] The answer is 0.2,</text><text start="3" dur="4">and this follows directly from Bayes&amp;#39; rule.</text><text start="7" dur="4">In this formula, we can read off the first 2 values straight from the table over here,</text><text start="11" dur="4">and we expand the denominator by total probability.</text><text start="15" dur="4">Observing that this is exactly the same expression as up here,</text><text start="19" dur="8">we get 0.1 divided by 0.1 plus this expression over here can be copied from over here,</text><text start="27" dur="3">and P of not A is directly obtained up here.</text><text start="30" dur="6">Hence we get 0.5 over here, and as a result we get 0.2.</text></transcript></video><video title="2 ANSWER" id="nQxYA7vBbJc" length="196"><transcript><text start="0" dur="3">[Thrun] For this question we will be exploring a little trick</text><text start="3" dur="2">about non-normalized probability.</text><text start="5" dur="6">We will observe that P of A given X1, X2 and not X3,</text><text start="11" dur="5">the expression on the left can be resolved by Bayes&amp;#39; rule into this expression over here.</text><text start="16" dur="4">We will take X3 to the left and replace it by A,</text><text start="20" dur="3">both conditioned on the variables X1 and X2.</text><text start="23" dur="6">Then we have PA given X1, X2 divided by P not X3, X1, X2.</text><text start="29" dur="2">Next we employ 2 things.</text><text start="31" dur="3">One is the denominator does not depend on A,</text><text start="34" dur="5">so whether I put an A or not A has no bearing on any calculation here,</text><text start="39" dur="5">which means I can defer its calculation until later, and it will turn out to be important.</text><text start="44" dur="5">So I&amp;#39;m going to be proportional to just the stuff over here.</text><text start="49" dur="3">And second, I export my conditional independence</text><text start="52" dur="6">whereby I can omit X1 and X2 from the probability of not X3 conditioned on A.</text><text start="58" dur="4">These variables are conditionally independent.</text><text start="62" dur="3">This gives me the following recursion</text><text start="65" dur="5">where I now removed the third variable from the estimation problem</text><text start="70" dur="4">and just retained the first 2 relative to my initial expression.</text><text start="74" dur="5">If I keep expanding this, I get the following solution.</text><text start="79" dur="8">P of not X3 given A, P X2 given A, P X1 given A times P of A.</text><text start="87" dur="3">You might take a minute to just verify this,</text><text start="90" dur="2">but this is exploiting the conditional independence </text><text start="92" dur="3">very much as in the first step I showed you over here.</text><text start="95" dur="3">This step lacks the normalizer,</text><text start="98" dur="6">so let me work on the normalizer by expressing the opposite probability, </text><text start="104" dur="6">P of not A given the same events, X1, X2, and not X3,</text><text start="110" dur="4">which resolves to P of not X3 given not A,</text><text start="114" dur="6">P of X2 given not A, P of X1 given not A,</text><text start="120" dur="2">and P of not A.</text><text start="122" dur="2">I can now plug in the values from above.</text><text start="124" dur="11">So the first term gives me 0.8 times 0.2 times 0.2 times 0.5.</text><text start="135" dur="9">In the second term I get 0.4 times 0.6 times 0.6 times 0.5,</text><text start="144" dur="7">which resolves to 0.016 and 0.072.</text><text start="151" dur="5">This is clearly not a probability because we left out the normalizer.</text><text start="156" dur="4">But as we know, the normalizer does not depend on whether I put A or not A in here.</text><text start="160" dur="4">As a result, it will be the same for both of these expressions,</text><text start="164" dur="3">and I can obtain it by just adding these non-normalized probabilities</text><text start="167" dur="5">and then subsequently divide these non-normalized probabilities accordingly.</text><text start="172" dur="3">So let me just do this.</text><text start="175" dur="6">We get for the desired probability over here 0.1818</text><text start="181" dur="7">and for the inverse probability over here 0.8182.</text><text start="188" dur="6">Our desired answer therefore is 0.1818.</text><text start="194" dur="2">This was not an easy question.</text></transcript></video><video title="3 ANSWER" id="O4UT5ozSRGI" length="101"><transcript><text start="0" dur="3">[Thrun] The answer is a little bit involved.</text><text start="3" dur="5">We use total probability to re-express this by bringing in A.</text><text start="8" dur="7">P of X3 given X1 is the sum of P of X3 given X1 and A</text><text start="15" dur="7">times P of A given X1 plus the A complement, which is X3, conditional X1 and not A</text><text start="22" dur="2">times P of not A given X1.</text><text start="24" dur="2">That is just total probability.</text><text start="26" dur="4">Next we utilized conditional independence by which we can simplify this expression</text><text start="30" dur="3">to drop X1 in the conditional variables</text><text start="33" dur="3">and we transform this expression by Bayes&amp;#39; rule again.</text><text start="36" dur="5">The same applies to the right side with not A replacing A.</text><text start="41" dur="4">All of those expressions over here can be found</text><text start="45" dur="4">either in the table up there or just by their complements,</text><text start="49" dur="3">with the exception of P of X1.</text><text start="52" dur="6">But P of X1 can again be just obtained by total probability,</text><text start="58" dur="13">which resolves to 0.2 times 0.5 plus 0.6 times 0.5, </text><text start="71" dur="2">which gives me 0.4.</text><text start="73" dur="6">We are now in a position to calculate the last term over here, which goes as follows.</text><text start="79" dur="17">This expression is 0.2 times 0.2 times 0.5 over 0.4 plus 0.6 times 0.6 times 0.5 over 0.4,</text><text start="96" dur="5">which gives us as a final result 0.5.</text></transcript></video><video title="4 ANSWER" id="fksN-k4n_OM" length="46"><transcript><text start="0" dur="2">[Thrun] And the answer is as follows.</text><text start="2" dur="4">No, no, yes, and no.</text><text start="6" dur="5">B and C in the absence of any other information are dependent through A,</text><text start="11" dur="6">which is if you learn something about B, you can infer something about A,</text><text start="17" dur="3">and then we&amp;#39;ll know more about C.</text><text start="20" dur="2">If you know D, that doesn&amp;#39;t change a thing.</text><text start="22" dur="2">You can just take D out of the pool.</text><text start="24" dur="5">If you know A, B and C become conditionally independent.</text><text start="29" dur="7">This dependence goes away, and ignorance of D doesn&amp;#39;t render B and C dependent.</text><text start="36" dur="3">However, if we add D back to the mix,</text><text start="39" dur="7">then knowledge of D will render B and C dependent by way of the explaining away effect.</text></transcript></video><video title="5 ANSWER" id="jvOJ-6tF5y8" length="54"><transcript><text start="0" dur="3">[Thrun] So the correct answer is tricky in this case.</text><text start="3" dur="4">It is no, no, no, and yes.</text><text start="7" dur="2">The first one is straightforward.</text><text start="9" dur="4">C and E are conditionally independent based on D,</text><text start="13" dur="2">and knowledge of A doesn&amp;#39;t change anything.</text><text start="15" dur="5">B and D are conditionally independent through A,</text><text start="20" dur="3">and knowledge of C or E doesn&amp;#39;t change that.</text><text start="23" dur="2">A and C is interesting.</text><text start="25" dur="4">A and C is independent. But if you know D, they become dependent.</text><text start="29" dur="3">It turns out if you know E, you can know something about D,</text><text start="32" dur="5">and as a result, A and C become dependent through the explain away effect.</text><text start="37" dur="2">That doesn&amp;#39;t apply if you know B.</text><text start="39" dur="3">Even though B tells you something about E, </text><text start="42" dur="4">it tells you nothing about D because B and D are independent.</text><text start="46" dur="3">Therefore, knowing B tells you nothing about D,</text><text start="49" dur="3">and the explain away effect does not occur between A and C.</text><text start="52" dur="2">The answer here is yes.</text></transcript></video><video title="6 ANSWER" id="PEK4_jQnW10" length="37"><transcript><text start="0" dur="3">[Thrun] The correct answer is 16.</text><text start="3" dur="3">The probability of A and C require 1 parameter each.</text><text start="6" dur="6">The complement of not A and not C follows by 1 minus that parameter.</text><text start="12" dur="3">This guy over here requires 2 parameters.</text><text start="15" dur="3">You need to know the probability of B given A and B given not A.</text><text start="18" dur="2">The complements can be obtained easily.</text><text start="20" dur="4">The probability of D is conditioned on 2 variables which can take 4 possible values.</text><text start="24" dur="2">Hence the number is 4.</text><text start="26" dur="4">And E is conditioned on 3 variables, so it can take a total of 8 different values,</text><text start="30" dur="2">2 to the 3rd, which is 8.</text><text start="32" dur="5">If you add 8 plus 4 plus 2 plus 1 plus 1, you get 16.</text></transcript></video></group><group title="Unit 5" count="55"><video title="1 Introduction" id="8o1fAcyhap4" length="71"><transcript><text start="0" dur="3">Welcome to the machine learning unit.</text><text start="3" dur="3">Machine learning is a fascinating area.</text><text start="6" dur="3">The world has become immeasurably data-rich.</text><text start="9" dur="3">The world wide web has come up over the last decade.</text><text start="12" dur="3">The human genome is being sequenced.</text><text start="15" dur="4">Vast chemical databases, pharmaceutical databases,</text><text start="19" dur="3">and financial databases are now available</text><text start="22" dur="4">on a scale unthinkable even 5 years ago.</text><text start="26" dur="2">To make sense out of the data,</text><text start="28" dur="2">to extract information from the data, </text><text start="30" dur="3">machine learning is the discipline to go.</text><text start="33" dur="4">Machine learning is an important subfeed of artificial intelligence,</text><text start="37" dur="3">it&amp;#39;s my personal favorite next to robotics</text><text start="40" dur="3">because I believe it has a huge impact on society</text><text start="43" dur="4">and is absolutely necessary as we move forward.</text><text start="47" dur="3">So in this class, I teach you some of the very basics of</text><text start="50" dur="2">machine learning, and in our next unit</text><text start="52" dur="4">Peter will tell you some more about machine learning.</text><text start="56" dur="4">We&amp;#39;ll talk about supervised learning, which is one side of machine learning,</text><text start="60" dur="2">and Peter will tell you about unsupervised learning,</text><text start="62" dur="3">which is a different style.</text><text start="65" dur="2">Later in this class we will also encounter reinforcement learning,</text><text start="67" dur="3">which is yet another set of machine learning.</text><text start="70" dur="1">Anyhow, let&amp;#39;s just dive in.</text></transcript></video><video title="2 What is Machine Learning" id="tEzGdI9nQt4" length="113"><transcript><text start="0" dur="3.999">Welcome to the first class on machine learning.</text><text start="3.999" dur="3.575">So far we talked a lot about Bayes Networks.</text><text start="7.574" dur="2.836">And the way we talked about them</text><text start="10.41" dur="3.69">is all about reasoning within Bayes Networks</text><text start="14.1" dur="0.982">that are known.</text><text start="15.082" dur="2.035">Machine learning addresses the problem</text><text start="17.117" dur="2.169">of how to find those networks</text><text start="19.286" dur="0.867">or other models</text><text start="20.153" dur="2.369">based on data.</text><text start="22.522" dur="3.275">Learning models from data</text><text start="25.797" dur="3.265">is a major, major area of artificial intelligence</text><text start="29.062" dur="2.006">and it&amp;#39;s perhaps the one</text><text start="31.068" dur="2.632">that had the most commercial success.</text><text start="33.7" dur="3.304">In many commercial applications </text><text start="37.004" dur="2.068">the models themselves are fitted</text><text start="39.072" dur="1.402">based on data.</text><text start="40.474" dur="1.702">For example, Google</text><text start="42.176" dur="2.135">uses data to understand</text><text start="44.311" dur="2.593">how to respond to each search query.</text><text start="46.904" dur="2.379">Amazon uses data</text><text start="49.283" dur="2.769">to understand how to place products on their website.</text><text start="52.052" dur="1.635">And these machine learning techniques</text><text start="53.687" dur="2.503">are the enabling techniques that make that possible.</text><text start="56.19" dur="1.334">So this class</text><text start="57.524" dur="1.635">which is about supervised learning</text><text start="59.159" dur="3.37">will go through some very basic methods</text><text start="62.529" dur="1.902">for learning models from data</text><text start="64.431" dur="2.369">in particular, specific types of Bayes Networks.</text><text start="66.8" dur="1.635">We will complement this</text><text start="68.435" dur="2.435">with a class on unsupervised learning</text><text start="70.87" dur="3.204">that will be taught next </text><text start="74.074" dur="1.502">after this class.</text><text start="75.576" dur="3.224">Let me start off with a quiz.</text><text start="78.8" dur="1.814">The quiz is: What companies are famous</text><text start="80.614" dur="3.503">for machine learning using data?</text><text start="84.117" dur="5.533">Google for mining the web.</text><text start="89.65" dur="1.953">Netflix for mining what people</text><text start="91.603" dur="4.426">would like to rent on DVDs.</text><text start="96.029" dur="4.605">Which is DVD recommendations.</text><text start="100.634" dur="5.205">Amazon.com for product placement.</text><text start="105.839" dur="2.102">Check any or all</text><text start="107.941" dur="1.135">and if none of those apply</text><text start="109.076" dur="4">check down here.</text></transcript></video><video title="3 Answer" id="SnbvK3_ayWI" length="47"><transcript><text start="0" dur="3">And, not surprisingly, the answer is</text><text start="3" dur="3">all of those companies and many, many, many more</text><text start="6" dur="3">use massive machine learning for making decisions</text><text start="9" dur="3">that are really essential to the businesses.</text><text start="12" dur="3">Google mines the web and uses machine learning for translation,</text><text start="15" dur="3">as we&amp;#39;ve seen in the introductory level. Netflix has used</text><text start="18" dur="4">machine learning extensively for understanding what type of DVD to recommend to you next.</text><text start="22" dur="3">Amazon composes its entire product pages using </text><text start="25" dur="3">machine learning by understanding how customers </text><text start="28" dur="3">respond to different compositions and placements of their products,</text><text start="31" dur="4">and many, many other examples exist.</text><text start="35" dur="2">I would argue that in Silicon Valley,</text><text start="37" dur="4">at least half the companies dealing with customers and online products</text><text start="41" dur="2">do extensively use machine learning,</text><text start="43" dur="4">so it makes machine learning a really exciting discipline.</text></transcript></video><video title="4 Stanley DARPA Grand Challenge" id="Q1xFdQfq5Fk" length="93"><transcript><text start="0" dur="5">In my own research, I&amp;#39;ve extensively used machine learning for robotics.</text><text start="5" dur="3">What you see here is a robot my students and I built at Stanford</text><text start="8" dur="4">called Stanley, and it won the DARPA Grand Challenge.</text><text start="12" dur="4">It&amp;#39;s a self-driving car that drives without any human assistance whatsoever,</text><text start="16" dur="5">and this vehicle extensively uses machine learning.</text><text start="22" dur="3">The robot is equipped with a laser system</text><text start="25" dur="3">I will talk more about lasers in my robotics class,</text><text start="28" dur="3">but here you can see how the robot is able to build</text><text start="31" dur="3">3-D models of the terrain ahead.</text><text start="34" dur="3">These are almost like video game models that allow it to make</text><text start="37" dur="2">assessments where to drive and where not to drive.</text><text start="39" dur="4">Essentially, it&amp;#39;s trying to drive on flat ground.</text><text start="43" dur="3">The problem with these lasers is that they don&amp;#39;t see very far.</text><text start="46" dur="4">They see about 25 meters out, so to drive really fast</text><text start="50" dur="3">the robot has to see further.</text><text start="53" dur="3">This is where machine learning comes into play.</text><text start="56" dur="2">What you see here is camera images delivered by the robot</text><text start="58" dur="3">superimposed with laser data that doesn&amp;#39;t see very far,</text><text start="61" dur="3">but the laser is good enough to extract samples</text><text start="64" dur="4">of driveable road surface that can then be machine learned</text><text start="68" dur="2">and extrapolated into the entire camera image.</text><text start="70" dur="3">That enables the robot to use the camera</text><text start="73" dur="3">to see driveable terrain all the way to the horizon</text><text start="76" dur="6">up to like 200 meters out, enough to drive really, really fast.</text><text start="82" dur="5">This ability to adapt its vision by driving its own training examples using lasers</text><text start="87" dur="3">but seeing out 200 meters or more</text><text start="90" dur="3">was a key factor in winning the race.</text></transcript></video><video title="5 Taxonomy" id="m-hcAePIkWY" length="226"><transcript><text start="0" dur="3.483">Machine learning is a very large field</text><text start="3.483" dur="1.138">with many different methods</text><text start="4.621" dur="1.652">and many different applications.</text><text start="6.873" dur="3.576">I will now define some of the very basic terminology</text><text start="10.449" dur="1.563">that is being used to distinguish </text><text start="12.012" dur="1.301">different machine learning methods.</text><text start="13.313" dur="4.471">Let&amp;#39;s start with the what. </text><text start="17.784" dur="2.116">What is being learned?</text><text start="19.9" dur="3.59">You can learn parameters</text><text start="23.49" dur="2.603">like the probabilities of a Bayes Network.</text><text start="26.093" dur="1.468">You can learn structure</text><text start="27.561" dur="4.171">like the arc structure of a Bayes Network.</text><text start="31.732" dur="2.369">And you might even discover hidden concepts.</text><text start="34.401" dur="1.404">For example</text><text start="35.805" dur="1.933">you might find that certain training example</text><text start="37.738" dur="1.272">form a hidden group.</text><text start="39.01" dur="2.064">For example Netflix</text><text start="41.074" dur="2.58">you might find that there&amp;#39;s different types of customers</text><text start="43.654" dur="1.927">some that care about classic movies</text><text start="45.581" dur="1.566">some of them care about modern movies</text><text start="47.147" dur="2.312">and those might form hidden concepts</text><text start="49.459" dur="1.692">whose discovery can really help you</text><text start="51.151" dur="2.169">make better sense of the data.</text><text start="53.92" dur="3.571">Next is what from?</text><text start="57.891" dur="2.225">Every machine learning method</text><text start="60.116" dur="2.647">is driven by some sort of target information</text><text start="62.763" dur="1.115">that you care about.</text><text start="63.878" dur="2.288">In supervised learning</text><text start="66.166" dur="2.236">which is the subject of today&amp;#39;s class</text><text start="68.402" dur="2.169">we&amp;#39;re given specific target labels</text><text start="70.571" dur="2.469">and I give you examples just in a second. </text><text start="73.04" dur="2.836">We also talk about unsupervised learning</text><text start="75.876" dur="3.37">where target labels are missing</text><text start="79.246" dur="2.102">and we use replacement principles</text><text start="81.348" dur="1.301">to find, for example</text><text start="82.649" dur="1.869">hidden concepts. </text><text start="84.518" dur="2.975">Later there will be a class in reinforcement learning</text><text start="87.493" dur="5.2">when an agent learns from feedback with the physical environment</text><text start="92.693" dur="2.068">by interacting and trying actions</text><text start="94.761" dur="2.275">and receiving some sort of evaluation </text><text start="97.036" dur="0.862">from the environment</text><text start="97.898" dur="3.37">like &amp;quot;Well done&amp;quot; or &amp;quot;That works.&amp;quot;</text><text start="101.268" dur="2.595">Again, we will talk about those in detail later.</text><text start="103.863" dur="2.31">There&amp;#39;s different things you could try to do</text><text start="106.173" dur="1.939">with machine learning technique.</text><text start="108.112" dur="1.864">You might care about prediction.</text><text start="109.976" dur="3.337">For example you might want to care about what&amp;#39;s going to happen with the future</text><text start="113.313" dur="2.336">in the stockmarket for example.</text><text start="115.649" dur="2.135">You might care to diagnose something</text><text start="117.784" dur="2.008">which is you get data and you wish to explain it</text><text start="119.792" dur="2.029">and you use machine learning for that.</text><text start="121.821" dur="3.137">Sometimes your objective is to summarize something.</text><text start="124.958" dur="2.403">For example if you read a long article</text><text start="127.361" dur="1.903">your machine learning method might aim to</text><text start="129.264" dur="2.801">produce a short article that summarizes the long article.</text><text start="132.065" dur="2.536">And there&amp;#39;s many, many, many more different things.</text><text start="134.601" dur="2.303">You can talk about the how to learn.</text><text start="136.904" dur="2.546">We use the word passive</text><text start="139.45" dur="3.55">if your learning agent is just an observer</text><text start="143" dur="1.745">and has no impact on the data itself.</text><text start="144.745" dur="2.105">Otherwise, you call it active.</text><text start="146.85" dur="3.708">Sometimes learning occurs online</text><text start="150.558" dur="2.244">which means while the data is being generated</text><text start="152.802" dur="2.82">and some of it is offline</text><text start="155.622" dur="2.036">which means learning occurs </text><text start="157.658" dur="2.168">after the data has been generated.</text><text start="159.826" dur="2.644">There&amp;#39;s different types of outputs </text><text start="162.47" dur="2.195">of a machine learning algorithm.</text><text start="164.665" dur="3.136">Today we&amp;#39;ll talk about classification</text><text start="167.801" dur="2.703">versus regression.</text><text start="170.504" dur="2.879">In classification the output is binary</text><text start="173.383" dur="2.159">or a fixed number of classes</text><text start="175.542" dur="1.635">for example something is either a chair or not.</text><text start="177.177" dur="1.902">Regression is continuous.</text><text start="179.079" dur="2.87">The temperature might be 66.5 degrees</text><text start="181.949" dur="1.701">in our prediction.</text><text start="183.65" dur="1.835">And there&amp;#39;s tons of internal details</text><text start="185.485" dur="1.401">we will talk about.</text><text start="187.886" dur="1.57">Just to name one.</text><text start="189.456" dur="3.437">We will distinguish generative</text><text start="192.893" dur="1.435">from discriminative.</text><text start="194.328" dur="2.469">Generative seeks to model the data </text><text start="196.797" dur="2.002">as generally as possible</text><text start="198.799" dur="1.868">versus discriminative methods</text><text start="200.667" dur="1.268">seek to distinguish data</text><text start="201.935" dur="2.77">and this might sound like a superficial distinction</text><text start="204.705" dur="1.802">but it has enormous ramification</text><text start="206.507" dur="1.134">on the learning algorithm.</text><text start="207.641" dur="1.548">Now to tell you the truth</text><text start="209.189" dur="1.588">it took me many years </text><text start="210.777" dur="3.204">to fully learn all these words here</text><text start="213.981" dur="2.235">and I don&amp;#39;t expect you to pick them all up</text><text start="216.216" dur="1.007">in one class</text><text start="217.223" dur="2.438">but you should as well know that they exist.</text><text start="219.661" dur="1.393">And as they come up</text><text start="221.054" dur="1.068">I&amp;#39;ll emphasize them</text><text start="222.122" dur="2.503">so you can resort any learning method</text><text start="224.625" dur="2.769">I tell you back into the specific taxonomy over here.</text></transcript></video><video title="6 Supervised Learning" id="nxX9Ihi4HZQ" length="192"><transcript><text start="0" dur="2.8">The vast amount of work in the field</text><text start="2.8" dur="3.514">falls into the area of supervised learning.</text><text start="6.314" dur="2.228">In supervised learning</text><text start="8.542" dur="2.418">you&amp;#39;re given for each training example</text><text start="10.96" dur="2.654">a feature vector </text><text start="13.614" dur="3.303">and a target label named Y.</text><text start="16.917" dur="3.27">For example, for a credit rating agency</text><text start="20.187" dur="3.003">X1, X2, X3 might be a feature</text><text start="23.19" dur="1.902">such as is the person employed?</text><text start="25.092" dur="2.569">What is the salary of the person?</text><text start="27.661" dur="2.97">Has the person previously defaulted on a credit card?</text><text start="30.631" dur="1.575">And so on.</text><text start="32.206" dur="2.028">And Y is a predictor</text><text start="34.234" dur="2.336">whether the person is to default </text><text start="36.57" dur="1.743">on the credit or not.</text><text start="38.313" dur="2.067">Now machine learning</text><text start="40.38" dur="2.563">is to be carried out on past data</text><text start="42.943" dur="1.435">where the credit rating agency </text><text start="44.378" dur="2.269">might have collected features just like these</text><text start="46.647" dur="3.322">and actual occurances of default or not. </text><text start="49.969" dur="1.816">What it wishes to produce</text><text start="51.785" dur="1.702">is a function that allows us</text><text start="53.487" dur="1.902">to predict future customers.</text><text start="55.389" dur="1.134">So the new person comes in</text><text start="56.523" dur="2.202">with a different feature vector.</text><text start="58.725" dur="1.702">Can we predict as good as possible</text><text start="60.427" dur="1.602">the functional relationship</text><text start="62.029" dur="3.683">between these features X1 to Xn all the way to Y?</text><text start="65.712" dur="2.423">You can apply the exact same example</text><text start="68.135" dur="1.153">in image recognition</text><text start="69.288" dur="2.017">where X might be pixels of images </text><text start="71.305" dur="2.969">or it might be features of things found in images</text><text start="74.274" dur="2.036">and Y might be a label that says</text><text start="76.31" dur="1.535">whether a certain object is contained</text><text start="77.845" dur="1.267">in an image or not. </text><text start="79.112" dur="1.183">Now in supervised learning</text><text start="80.295" dur="2.021">you&amp;#39;re given many such examples.</text><text start="85.352" dur="3.437">X21 to X2n</text><text start="88.789" dur="3.737">leads to Y2</text><text start="92.526" dur="3.28">all way the index m.</text><text start="95.806" dur="2.292">This is called your data.</text><text start="98.098" dur="4.905">If we call each input vector Xm</text><text start="103.003" dur="1.568">and we wish to find out the function</text><text start="104.571" dur="5.639">given any Xm or any future vector X </text><text start="110.21" dur="2.836">produces as close as possible </text><text start="113.046" dur="2.536">my target signal Y.</text><text start="115.582" dur="2.102">Now this isn&amp;#39;t always possible</text><text start="117.684" dur="2.015">and sometimes it&amp;#39;s acceptable</text><text start="119.699" dur="1.088">in fact preferable</text><text start="120.787" dur="3.045">to tolerate a certain amount of error</text><text start="123.832" dur="1.231">in your training data.</text><text start="125.063" dur="2.331">But the subject of machine learning</text><text start="127.394" dur="2.87">is to identify this function over here.</text><text start="130.264" dur="1.534">And once you identify it</text><text start="131.798" dur="1.802">you can use it for future Xs</text><text start="133.6" dur="2.736">that weren&amp;#39;t part of the training set</text><text start="136.336" dur="3.103">to produce a prediction</text><text start="139.439" dur="2.169">that hopefully is really, really good.</text><text start="141.608" dur="3.3">So let me ask you a question.</text><text start="144.908" dur="2.239">And this is a question </text><text start="147.147" dur="1.763">for which I haven&amp;#39;t given you the answer</text><text start="148.91" dur="2.475">but I&amp;#39;d like to appeal to your intuition.</text><text start="151.385" dur="2.736">Here&amp;#39;s one data set</text><text start="154.121" dur="3.87">where the X is one dimensionally plotted horizontally </text><text start="157.991" dur="1.936">and the Y is vertically</text><text start="159.927" dur="4.471">and suppose there looks like this.</text><text start="164.398" dur="1.578">Suppose my machine learning algorithm</text><text start="165.976" dur="1.592">gives me 2 hypotheses.</text><text start="167.568" dur="3.47">One is this function over here</text><text start="171.038" dur="0.967">which is a linear function</text><text start="172.005" dur="1.68">and one is this function over here.</text><text start="173.685" dur="3.826">I&amp;#39;d like to know which of the functions</text><text start="177.511" dur="1.802">you find preferable</text><text start="179.313" dur="2.002">as an explanation for the data.</text><text start="181.315" dur="1.368">Is it function A?</text><text start="182.683" dur="4.004">Or function B?</text><text start="186.687" dur="1.434">Check here for A</text><text start="188.121" dur="0.935">here for B</text><text start="189.056" dur="3.544">and here for neither.</text></transcript></video><video title="7 Occam's Razor" id="FHJx9RVVKFg" length="163"><transcript><text start="0" dur="4.571">And I hope you guessed function A.</text><text start="4.571" dur="3.771">Even though both perfectly describe the data</text><text start="8.342" dur="2.569">B is much more complex than A.</text><text start="10.911" dur="2.046">In fact, outside the data</text><text start="12.957" dur="3.358">B seems to go to a minus infinity much faster</text><text start="16.315" dur="1.203">than these data points</text><text start="17.518" dur="2.135">and to plus infinity much faster</text><text start="19.653" dur="1.709">with these data points over here.</text><text start="21.362" dur="0.929">And in between</text><text start="22.291" dur="1.672">we have wide oscillations </text><text start="23.963" dur="1.996">that don&amp;#39;t correspond to any data.</text><text start="25.959" dur="1.402">So I would argue</text><text start="27.361" dur="1.468">A is preferable. </text><text start="31.3" dur="1.633">The reason why I asked this question</text><text start="32.933" dur="2.135">is because of something called Occam&amp;#39;s Razor.</text><text start="35.068" dur="3.871">Occam can be spelled in many different ways. </text><text start="38.939" dur="2.836">And what Occam says is that </text><text start="41.775" dur="1.935">everything else being equal</text><text start="43.71" dur="2.939">chose the less complex hypothesis.</text><text start="46.649" dur="2.2">Now in practice </text><text start="48.849" dur="2.003">there&amp;#39;s actually a trade-off</text><text start="50.852" dur="2.301">between a really good data fit</text><text start="53.153" dur="2.503">and low complexity.</text><text start="55.656" dur="2.51">Let me illustrate this to you</text><text start="58.166" dur="1.463">by a hypothetical example.</text><text start="59.629" dur="2.466">Consider the following graph</text><text start="62.095" dur="2.403">where the horizontal axis graphs </text><text start="64.498" dur="3.236">complexity of the solution.</text><text start="67.734" dur="2.369">For example, if you use polynomials</text><text start="70.103" dur="2.169">this might be a high-degree polynomial over here </text><text start="72.272" dur="2.171">and maybe a linear function over here</text><text start="74.443" dur="2.067">which is a low-degree polynomial</text><text start="76.51" dur="3.17">your training data error</text><text start="79.68" dur="3.069">tends to go like this.</text><text start="82.749" dur="2.97">The more complex the hypothesis you allow</text><text start="85.719" dur="3.422">the more you can just fit your data.</text><text start="89.141" dur="2.818">However, in reality</text><text start="91.959" dur="2.035">your generalization error on unknown data</text><text start="93.994" dur="3.37">tends to go like this.</text><text start="97.364" dur="2.803">It is the sum of the training data error</text><text start="100.167" dur="2.669">and another function</text><text start="102.836" dur="3.671">which is called the overfitting error.</text><text start="106.507" dur="1.001">Not surprisingly</text><text start="107.508" dur="2.335">the best complexity is obtained</text><text start="109.843" dur="2.403">where the generalization error is minimum.</text><text start="112.246" dur="1.401">There are methods</text><text start="113.647" dur="1.902">to calculate the overfitting error.</text><text start="115.549" dur="2.269">They go into a statistical field</text><text start="117.818" dur="3.321">under the name Bayes variance methods.</text><text start="121.139" dur="1.016">However, in practice</text><text start="122.155" dur="2.336">you&amp;#39;re often just given the training data error.</text><text start="124.491" dur="4.071">You find if you don&amp;#39;t find the model</text><text start="128.562" dur="2.602">that minimizes the training data error</text><text start="131.164" dur="3.14">but instead pushes back the complexity</text><text start="134.304" dur="3.334">your algorithm tends to perform better</text><text start="137.638" dur="3.336">and that is something we will study a little bit</text><text start="140.974" dur="1.835">in this class.</text><text start="142.809" dur="3.27">However, this slide is really important</text><text start="146.079" dur="3.037">for anybody doing machine learning in practice.</text><text start="149.116" dur="2.035">If you deal with data</text><text start="151.151" dur="2.036">and you have ways to fit your data</text><text start="153.187" dur="3.103">be aware that overfitting</text><text start="156.29" dur="2.929">is a major source of poor performance</text><text start="159.219" dur="1.927">of a machine learning algorithm.</text><text start="161.146" dur="2.785">And I give you examples in just one second.</text></transcript></video><video title="8 SPAM Detection" id="wMMGexgmES4" length="255"><transcript><text start="0" dur="2.001">So a really important example</text><text start="2.001" dur="2.37">of machine learning is SPAM detection.</text><text start="4.371" dur="2.136">We all get way too much email</text><text start="6.507" dur="1.935">and a good number of those are SPAM.</text><text start="8.442" dur="3.858">Here are 3 examples of email.</text><text start="12.3" dur="1.981">Dear Sir: First I must solicit your confidence</text><text start="14.281" dur="2.437">in this transaction, this is by virtue of its nature</text><text start="16.718" dur="2.438">being utterly confidential and top secret...</text><text start="19.156" dur="3.135">This is likely SPAM.</text><text start="22.291" dur="1.633">Here&amp;#39;s another one.</text><text start="23.924" dur="1.201">In upper caps.</text><text start="25.125" dur="3.604">99 MILLION EMAIL ADDRESSES FOR ONLY $99</text><text start="28.729" dur="2.302">This is very likely SPAM.</text><text start="31.031" dur="2.669">And here&amp;#39;s another one.</text><text start="33.7" dur="1.602">Oh, I know it&amp;#39;s blatantly OT</text><text start="35.302" dur="2.035">but I&amp;#39;m beginning to go insane.</text><text start="37.337" dur="2.77">Had an old Dell Dimension XPS sitting in the corner</text><text start="40.107" dur="1.301">and decided to put it to use.</text><text start="41.408" dur="1.572">And so on and so on.</text><text start="42.98" dur="2.508">Now this is likely not SPAM.</text><text start="45.488" dur="1.726">How can a computer program</text><text start="47.214" dur="2.602">distinguish between SPAM and not SPAM?</text><text start="49.816" dur="2.169">Let&amp;#39;s use this as an example</text><text start="51.985" dur="3.637">to talk about machine learning for discrimination</text><text start="55.622" dur="3.437">using Bayes Networks.</text><text start="59.059" dur="2.035">In SPAM detection</text><text start="61.094" dur="2.403">we get an email</text><text start="63.497" dur="1.702">and we wish to categorize it</text><text start="65.199" dur="1.935">either as SPAM</text><text start="67.134" dur="2.969">in which case we don&amp;#39;t even show as to the where</text><text start="70.103" dur="2.575">or what we call HAM</text><text start="72.678" dur="2.731">which is the technical word for</text><text start="75.409" dur="4.304">an email worth passing on to the person being emailed.</text><text start="79.713" dur="1.668">So the function over here </text><text start="81.381" dur="1.802">is the function we&amp;#39;re trying to learn.</text><text start="83.183" dur="3.137">Most SPAM filters use human input.</text><text start="86.32" dur="2.239">When you go through email</text><text start="88.559" dur="3.566">you have a button called IS SPAM</text><text start="92.125" dur="2.473">which allows you as a user to flag SPAM</text><text start="94.598" dur="3.2">and occasionally you will say an email is SPAM.</text><text start="97.798" dur="2.436">If you look at this</text><text start="100.234" dur="2.979">you have a typical supervised machine learning situation</text><text start="103.213" dur="1.825">where the input is an email</text><text start="105.038" dur="2.369">and the output is whether you flag it as SPAM</text><text start="107.407" dur="1.869">or if we don&amp;#39;t flag it</text><text start="109.276" dur="2.97">we just think it&amp;#39;s HAM.</text><text start="112.246" dur="2.002">Now to make this amenable to</text><text start="114.248" dur="0.9">a machine learning algorithm</text><text start="115.148" dur="2.043">we have to talk about how to represent emails.</text><text start="117.191" dur="3.196">They&amp;#39;re all using different words and different characters</text><text start="120.387" dur="2.336">and they might have different graphics included.</text><text start="122.723" dur="3.537">Let&amp;#39;s pick a representation that&amp;#39;s easy to process.</text><text start="126.26" dur="3.069">And this representation is often called</text><text start="129.329" dur="1.569">Bag of Words.</text><text start="130.898" dur="3.803">Bag of Words is a representation</text><text start="134.701" dur="1.168">of a document</text><text start="135.869" dur="1.802">that just counts the frequency</text><text start="137.671" dur="1.168">of words.</text><text start="138.839" dur="3.47">If an email were to say Hello</text><text start="142.309" dur="2.073">I will say Hello.</text><text start="144.382" dur="2.165">The Bag of Words representation</text><text start="146.547" dur="1.368">is the following.</text><text start="147.915" dur="3.77">2-1-1-1</text><text start="151.685" dur="2.054">for the dictionary</text><text start="153.739" dur="2.551">that contains the 4 words</text><text start="156.29" dur="2.669">Hello I will say.</text><text start="158.959" dur="2.436">Now look at the subtlety here.</text><text start="161.395" dur="2.335">Rather than representing each individual word</text><text start="163.73" dur="2.336">we have a count of each word</text><text start="166.066" dur="3.036">and the count is oblivious</text><text start="169.102" dur="3.12">to the order in which the words were stated.</text><text start="172.222" dur="2.953">A Bag of Words representation</text><text start="175.175" dur="2.236">relative to a fixed dictionary</text><text start="177.411" dur="3.603">represents the counts of each word</text><text start="181.014" dur="2.803">relative to the words in the dictionary.</text><text start="183.817" dur="3.003">If you were to use a different dictionary</text><text start="186.82" dur="1.668">like hello and good-bye</text><text start="188.488" dur="1.802">our counts would be </text><text start="190.29" dur="2.903">2 and 0.</text><text start="193.193" dur="1.602">However, in most cases</text><text start="194.795" dur="2.369">you make sure that all the words found</text><text start="197.164" dur="0.925">in messages</text><text start="198.089" dur="1.811">are actually included in the dictionary.</text><text start="199.9" dur="2.536">So the dictionary might be very, very large.</text><text start="202.436" dur="3.409">Let me make up an unofficial example</text><text start="205.845" dur="4.298">of a few SPAM and a few HAM messages.</text><text start="210.143" dur="2.636">Offer is secret.</text><text start="212.779" dur="2.837">Click secret link.</text><text start="215.616" dur="2.21">Secret sports link.</text><text start="217.826" dur="2.728">Obviously those are contrived</text><text start="220.554" dur="2.369">and I tried to retain the recovery</text><text start="222.923" dur="1.15">to a small number of words</text><text start="224.073" dur="1.953">to make this example workable.</text><text start="226.026" dur="1.902">In practice we need thousands</text><text start="227.928" dur="0.968">of such messages</text><text start="228.896" dur="1.601">to get good information.</text><text start="230.497" dur="1.902">Play sports today.</text><text start="232.399" dur="1.969">Went play sports.</text><text start="234.368" dur="2.435">Secret sports event.</text><text start="236.803" dur="3.106">Sport is today.</text><text start="239.909" dur="2.934">Sport costs money.</text><text start="242.843" dur="3.403">My first quiz is</text><text start="246.246" dur="2.336">What is the size of the vocabulary</text><text start="248.582" dur="3.749">that contains all words in these messages?</text><text start="252.331" dur="3">Please enter the value in this box over here.</text></transcript></video><video title="9 Answer" id="fPkxtmxRt5k" length="28"><transcript><text start="0" dur="2">Well let&amp;#39;s count. </text><text start="2" dur="6">Offer is secret click. </text><text start="8" dur="2">Secret occurs over here already</text><text start="10" dur="2">so we don&amp;#39;t have to count it twice.</text><text start="12" dur="6">Link, sports, play, today, went, event</text><text start="18" dur="2">costs money.</text><text start="20" dur="2">So the answer is </text><text start="22" dur="2">12.</text><text start="24" dur="2">There&amp;#39;s 12 different words </text><text start="26" dur="2">contained in these 8 messages. </text></transcript></video><video title="10 Question" id="Diqx3Z20YWc" length="16"><transcript><text start="0" dur="3">[Narrator] Another quiz. </text><text start="3" dur="3">What is the probability that a random message </text><text start="6" dur="3">that arrives to fall into the spam bucket? </text><text start="9" dur="2">Assuming that those messages </text><text start="11" dur="2">are all drawn at random. </text><text start="13" dur="3">[writing on page]</text></transcript></video><video title="11 Answer" id="WFE-dmEJZF8" length="16"><transcript><text start="0" dur="2">[Narrator] And the answer is:</text><text start="2" dur="2">there&amp;#39;s 8 different messages </text><text start="4" dur="2">of which 3 are spam.</text><text start="6" dur="3">So the maximum likelihood estimate</text><text start="9" dur="2">is 3/8.</text><text start="11" dur="5">[writing on paper]</text></transcript></video><video title="12 Maximum Likelihood" id="QBlERVSlFx4" length="271"><transcript><text start="0" dur="3">So, let&amp;#39;s look at this a little bit more formally and talk about maximum likelihood.</text><text start="3" dur="9">Obviously, we&amp;#39;re observing 8 messages: spam, spam, spam, and 5 times ham.</text><text start="12" dur="5">And what we care about is what&amp;#39;s our prior probability of spam</text><text start="17" dur="3">that maximizes the likelihood of this data?</text><text start="20" dur="4">So, let&amp;#39;s assume we&amp;#39;re going to assign a value of pi to this, </text><text start="24" dur="5">and we wish to find the pi that maximizes the likelihood of this data over here,</text><text start="29" dur="4">assuming that each email is drawn independently </text><text start="33" dur="4">according to an identical distribution.</text><text start="37" dur="11">The probability of the p(yi) data item is then pi if yi = spam, </text><text start="48" dur="5">and 1 - pi if yi = ham.</text><text start="53" dur="6">If we rewrite the data as 1, 1, 1, 0, 0, 0, 0, 0,</text><text start="59" dur="14">we can write p(yi) as follows: pi to the yi times (1 - pi) to the 1 - yi.</text><text start="73" dur="3">It&amp;#39;s not that easy to see that this is equivalent, </text><text start="76" dur="3">but say yi = 1.</text><text start="79" dur="3">Then this term will fall out. </text><text start="82" dur="6">It&amp;#39;s not proficient by 1 because the exponent is zero, and we get pi as over here.</text><text start="88" dur="8">If yi = 0, then this term falls out, and this one here becomes 1 - pi as over here.</text><text start="96" dur="8">Now assuming independence, we get for the entire data set </text><text start="104" dur="5">that the joint probability of all data items is the product </text><text start="109" dur="3">of the individual data items over here, </text><text start="112" dur="4">which can now be written as follows:</text><text start="116" dur="7">pi to the count of instances where yi = 1 times </text><text start="123" dur="6">1 - pi to the count of the instances where yi = 0.</text><text start="129" dur="4">And we know in our example, this count over here is 3,</text><text start="133" dur="9">and this count over here is 5, so we get pi to the 3rd times 1 - pi to the 5th.</text><text start="142" dur="6">We now wish to find the pi that maximizes this expression over  here.</text><text start="148" dur="5">We can also maximize the logarithm of this expression,</text><text start="153" dur="9">which is 3 times log pi + 5 times log (1 - pi)</text><text start="162" dur="8">Optimizing the log is the same as optimizing p because the log is monotonic to p.</text><text start="170" dur="4">The maximum of this function is attained with a derivative of 0,</text><text start="174" dur="6">so let&amp;#39;s compute with a derivative and set it to 0.</text><text start="180" dur="5">This is the derivative, 3 over pi - 5 over 1 - pi.</text><text start="185" dur="4">We now bring this expression to the right side,</text><text start="189" dur="9">multiply the denominators up, and sort all the expressions containing pi to the left,</text><text start="198" dur="8">which gives us pi = 3/8, exactly the number we were at before.</text><text start="206" dur="7">We just derived mathematically that the data likelihood maximizing number</text><text start="213" dur="4">for the probability is indeed the empirical count, </text><text start="217" dur="4">which means when we looked at this quiz before</text><text start="221" dur="8">and we said a maximum likelihood for the prior probability of spam is 3/8,</text><text start="229" dur="5">by simply counting 3 over 8 emails were spam, </text><text start="234" dur="3">we actually followed proper mathematical principles </text><text start="237" dur="2">to do maximum likelihood estimation.</text><text start="239" dur="4">Now, you might not fully have gotten the derivation of this, </text><text start="243" dur="4">and I recommend you to watch it again, but it&amp;#39;s not that important</text><text start="247" dur="2">for the progress in this class. </text><text start="249" dur="2">So, here&amp;#39;s another quiz.</text><text start="251" dur="6">I&amp;#39;d like the maximum likelihood, or ML solutions, </text><text start="257" dur="2">for the following probabilities.</text><text start="259" dur="2">The probability that the word &amp;quot;secret&amp;quot; comes up,</text><text start="261" dur="4">assuming that we already know a message is spam,</text><text start="265" dur="3">and the probability that the same word &amp;quot;secret&amp;quot; comes up </text><text start="268" dur="3">if we happen to know the message is not spam, it&amp;#39;s ham.</text></transcript></video><video title="13 Answer" id="4q4Tk-4Long" length="25"><transcript><text start="0" dur="2.436">And just as before</text><text start="2.436" dur="2.102">we count the word secret</text><text start="4.538" dur="1.601">in SPAM and in HAM</text><text start="6.139" dur="1.669">as I&amp;#39;ve underlined here.</text><text start="7.808" dur="3.603">Three out of 9 words in SPAM</text><text start="11.411" dur="1.569"> are the word secret</text><text start="12.98" dur="1.801">so we have a third over here</text><text start="14.781" dur="3.307">or 0.333</text><text start="18.088" dur="2.966">and only 1 out of all the 15 words in HAM</text><text start="21.054" dur="0.968">are secret</text><text start="22.022" dur="1.878">so you get a fifteenth</text><text start="23.9" dur="2.79">or 0.0667.</text></transcript></video><video title="14 Relationship to Bayes Networks" id="MvwZNmJQIJw" length="79"><transcript><text start="0" dur="6">By now, you might have recognized what we&amp;#39;re really building up is a Bayes network</text><text start="6" dur="4">where the parameters of the Bayes networks are estimated using supervised learning</text><text start="10" dur="5">by a maximum likelihood estimator based on training data.</text><text start="15" dur="5">The Bayes network has at its root an unobservable variable called spam, </text><text start="20" dur="8">which is binary, and it has as many children as there are words in a message,</text><text start="28" dur="5">where each word has an identical conditional distribution</text><text start="33" dur="6">of the word occurrence given the class spam or not spam.</text><text start="39" dur="3">If you write on our dictionary over here, </text><text start="42" dur="6">you might remember the dictionary had 12 different words, </text><text start="48" dur="4">so here is 5 of the 12, offer, is, secret, click and sports.</text><text start="52" dur="7">Then for the spam class, we found the probability of secret given spam is 1/3,</text><text start="59" dur="6">and we also found that the probability of secret given  ham is 1/15, </text><text start="65" dur="2">so here&amp;#39;s a quiz.</text><text start="67" dur="5">Assuming a vocabulary size of 12, or put differently,</text><text start="72" dur="4">the dictionary has 12 words, how many parameters </text><text start="76" dur="3">do we need to specify this Bayes network?</text></transcript></video><video title="15 Answer" id="-Pms2FiJQIA" length="29"><transcript><text start="0" dur="3">And the correct answer is 23.</text><text start="3" dur="4">We need 1 parameter for the prior p (spam),</text><text start="7" dur="5">and then we have 2 dictionary distributions of any word,</text><text start="12" dur="4">i given spam, and the same for  ham.</text><text start="16" dur="2">Now, there&amp;#39;s 12 words in a dictionary, </text><text start="18" dur="2">but this distribution only needs 11 parameters, </text><text start="20" dur="4">so 12 can be figured out because they have to add up to 1.</text><text start="24" dur="3">And the same is true over here, so if you add all these together,</text><text start="27" dur="2">we get 23.</text></transcript></video><video title="16 Question" id="2BFCqec6n04" length="26"><transcript><text start="0" dur="2">So, here&amp;#39;s a quiz.</text><text start="2" dur="4">Let&amp;#39;s assume we fit all the 23 parameters of the Bayes network</text><text start="6" dur="3">as explained using maximum likelihood.</text><text start="9" dur="5">Let&amp;#39;s now do classification and see what class and message it ends up with.</text><text start="14" dur="4">Let me start with a very simple message, and it contains a single word</text><text start="18" dur="3">just to make it a little bit simpler.</text><text start="21" dur="5">What&amp;#39;s the probability that we classify this one word message as spam?</text></transcript></video><video title="17 Answer" id="lQe4iNP6HDA" length="62"><transcript><text start="0" dur="7">And the answer is 0.1667 or 3/18.</text><text start="7" dur="6">How do I get there? Well, let&amp;#39;s apply Bayes rule.</text><text start="13" dur="6">This form is easily transformed into this expression over here,</text><text start="19" dur="6">the probability of the message given spam times the prior probability of spam</text><text start="25" dur="4">over the normalizer over here.</text><text start="29" dur="5">Now, we know that the word &amp;quot;sports&amp;quot; occurs 1 in our 9 words of spam,</text><text start="34" dur="4">and our prior probability for spam is 3/8,</text><text start="38" dur="2">which gives us this expression over here.</text><text start="40" dur="5">We now have to add the same probabilities for the class ham.</text><text start="45" dur="6">&amp;quot;Sports&amp;quot; occurs 5 times out of 15 in the ham class,</text><text start="51" dur="4">and the prior probability for ham is 5/8,</text><text start="55" dur="7">which gives us 3/72 divided by 18/72, which is 3/18 or 1/6.</text></transcript></video><video title="18 Question" id="qVdxj8XOB00" length="21"><transcript><text start="0" dur="3">This gets to a more complicated quiz.</text><text start="3" dur="3">Say the message now contains 3 words.</text><text start="6" dur="4">&amp;quot;Secret is secret,&amp;quot; not a particularly meaningful email,</text><text start="10" dur="6">but the frequent occurrence of &amp;quot;secret&amp;quot; seems to suggest it might be spam.</text><text start="16" dur="5">What&amp;#39;s the probability you&amp;#39;re going to judge this to be spam?</text></transcript></video><video title="19 Answer" id="eSbURIQ6pSQ" length="63"><transcript><text start="0" dur="10">And the answer is surprisingly high. It&amp;#39;s 25/26, or 0.9615.</text><text start="10" dur="6">To see if we apply Bayes rule, which multiples the prior for spam-ness</text><text start="16" dur="3">with the conditional probability of each word given spam.</text><text start="19" dur="7">&amp;quot;Secret&amp;quot; carries 1/3, &amp;quot;is&amp;quot; 1/9, and &amp;quot;secret&amp;quot; 1/3 again.</text><text start="26" dur="6">We normalize this by the same expression plus the probability for </text><text start="32" dur="4">the non-spam case.</text><text start="36" dur="2">5/8 is a prior.</text><text start="38" dur="4">&amp;quot;Secret&amp;quot; is 1/15.</text><text start="42" dur="3">&amp;quot;Is&amp;quot; is 1/15, </text><text start="45" dur="3">and &amp;quot;secret&amp;quot; again.</text><text start="48" dur="9">This resolves to 1/216 over this expression plus 1/5400,</text><text start="57" dur="6">and when you work it all out is 25/26.</text></transcript></video><video title="20 Question" id="dfVAnFFxFP4" length="21"><transcript><text start="0" dur="8">The final quiz, let&amp;#39;s assume our message is &amp;quot;Today is secret.&amp;quot;</text><text start="8" dur="4">And again, it might look like spam because the word &amp;quot;secret&amp;quot; occurs.</text><text start="12" dur="9">I&amp;#39;d like you to compute for me the probability of spam given this message.</text></transcript></video><video title="21 Answer and Laplace Smoothing" id="0BpC-cLDCIE" length="199"><transcript><text start="0" dur="7">And surprisingly, the probability for this message to be spam is 0.</text><text start="7" dur="4">It&amp;#39;s not 0.001. It&amp;#39;s flat 0.</text><text start="11" dur="3">In other words, it&amp;#39;s impossible, according to our model,</text><text start="14" dur="3">that this text could be a spam message.</text><text start="17" dur="2">Why is this?</text><text start="19" dur="5">When we apply the same rule as before, we get the prior for spam which is 3/8.</text><text start="24" dur="4">And we multiple the conditional for each word into this.</text><text start="28" dur="3">For &amp;quot;secret,&amp;quot; we know it to be 1/3.</text><text start="31" dur="8">For &amp;quot;is,&amp;quot; to be 1/9, but for today, it&amp;#39;s 0.</text><text start="39" dur="6">It&amp;#39;s 0 because the maximum of the estimate for the probability of &amp;quot;today&amp;quot; in spam is 0.</text><text start="45" dur="4">&amp;quot;Today&amp;quot; just never occurred in a spam message so far.</text><text start="49" dur="6">Now, this 0 is troublesome because as we compute the outcome--</text><text start="55" dur="5">and I&amp;#39;m plugging in all the numbers as before--</text><text start="60" dur="3">none of the words matter anymore, just the 0 matters.</text><text start="63" dur="7">So, we get 0 over something which is plain 0.</text><text start="70" dur="3">Are we overfitting? You bet.</text><text start="73" dur="2">We are clearly overfitting.</text><text start="75" dur="6">It can&amp;#39;t be that a single word determines the entire outcome of our analysis.</text><text start="81" dur="5">The reason is that our model, to assign a probability of 0 for the word &amp;quot;today&amp;quot; </text><text start="86" dur="3">to be in the class of spam is just too aggressive.</text><text start="89" dur="5">Let&amp;#39;s change this.</text><text start="94" dur="5">One technique to deal with the overfitting problem is called Laplace smoothing.</text><text start="99" dur="6">In maximum likelihood estimation, we assign towards our probability</text><text start="105" dur="6">the quotient of the count of this specific event over all events in our data set.</text><text start="111" dur="6">For example, for the prior probability, we found that 3/8 messages are spam.</text><text start="117" dur="3">Therefore, our maximum likelihood estimate </text><text start="120" dur="5">for the prior probability of spam was 3/8.</text><text start="125" dur="5">In Laplace Smoothing, we use a different estimate.</text><text start="130" dur="5">We add the value k to the count</text><text start="135" dur="5">and normalize as if we added k to every single class </text><text start="140" dur="3">that we&amp;#39;ve tried to estimate something over.</text><text start="143" dur="5">This is equivalent to assuming we have a couple of fake training examples</text><text start="148" dur="4">where we add k to each observation count.</text><text start="152" dur="4">Now, if k equals 0, we get our maximum likelihood estimator.</text><text start="156" dur="5">But if k is larger than 0 and n is finite, we get different answers.</text><text start="161" dur="6">Let&amp;#39;s say k equals 1,</text><text start="167" dur="4">and let&amp;#39;s assume we get one message, </text><text start="171" dur="5">and that message was spam, so we&amp;#39;re going to write it one message, one spam.</text><text start="176" dur="7">What is p (spam) for the Laplace smoothing of k + 1?</text><text start="183" dur="6">Let&amp;#39;s do the same with 10 messages, and we get 6 spam.</text><text start="189" dur="7">And 100 messages, of which 60 are spam.</text><text start="196" dur="3">Please enter your numbers into the boxes over here.</text></transcript></video><video title="22 Answer" id="2sKSZHkQPrc" length="74"><transcript><text start="0" dur="10">The answer here is 2/3 or 0.667 and is computed as follows.</text><text start="10" dur="6">We have 1 message with 1 as spam, but we&amp;#39;re going to add k =1.</text><text start="16" dur="6">We&amp;#39;re going to add k = 2 over here because there&amp;#39;s 2 different classes.</text><text start="22" dur="6">K = 1 times 2 = 2, which gives us 2/3.</text><text start="28" dur="4">The answer over here is 7/12.</text><text start="32" dur="9">Again, we have 6/10 but we add 2 down here and 1 over here, so you get 7/12.</text><text start="41" dur="8">And correspondingly, we get 61/102 is 60 + 1 over 100 +2.</text><text start="49" dur="7">If we look at the numbers over here, we get 0.5833 </text><text start="56" dur="3">and 0.5986.</text><text start="59" dur="4">Interestingly, the maximum likelihood on the last 2 cases over here</text><text start="63" dur="6">will give us .6, but we only get a value that&amp;#39;s closer to .5,</text><text start="69" dur="5">which is the effect of our smoothing prior for the Laplacian smoothing.</text></transcript></video><video title="23 Question" id="2Ar6jFKZhUM" length="25"><transcript><text start="0" dur="5">Let&amp;#39;s use the Laplacian smoother with K=1</text><text start="5" dur="4">to calculate the few interesting probabilities-- </text><text start="9" dur="3">P of SPAM, P of HAM, </text><text start="12" dur="3">and then the probability of the words &amp;quot;today&amp;quot;, </text><text start="15" dur="4">given that it&amp;#39;s in the SPAM class or the HAM class.</text><text start="19" dur="3">And you might assume that our recovery size</text><text start="22" dur="3">is about 12 different words here.</text></transcript></video><video title="24 Answer" id="DjvGl1qRVdE" length="77"><transcript><text start="0" dur="3">This one is easy to calculate for SPAM and HAM.</text><text start="3" dur="2">For SPAM, it&amp;#39;s 2/5,</text><text start="5" dur="3">and the reason is, we had previously</text><text start="8" dur="4">3 out of 8 messages assigned to SPAM.</text><text start="12" dur="3">But thanks to the Laplacian smoother, we add 1 over here.</text><text start="15" dur="4">And there are 2 classes, so we add 2 times 1 over here,</text><text start="19" dur="3">which gives us 4/10, which is 2/5.</text><text start="22" dur="4">Similarly to get 3/5 over here.</text><text start="26" dur="3">Now the tricky part comes up over here. </text><text start="29" dur="4">Before, we had 0 occurances of the word &amp;quot;today&amp;quot; in the SPAM class,</text><text start="33" dur="2">and we had 9 data points.</text><text start="35" dur="3">But now we are going to add 1 for Laplacian smoother,</text><text start="38" dur="2">and down here, we are going to add 12.</text><text start="40" dur="2">And the reason that we add 12 is because</text><text start="42" dur="2">there&amp;#39;s 12 different words in our dictionary</text><text start="44" dur="3">Hence, for each word in the dictonary, we are going to add 1.</text><text start="47" dur="3">So we have a total of 12, which gives us the 12 over here.</text><text start="50" dur="3">That makes 1/21.</text><text start="53" dur="3">In the HAM class, we had 2 occurrences</text><text start="56" dur="3">of the word &amp;quot;today&amp;quot;--over here and over here. </text><text start="59" dur="5">We add 1, normalize by 15,</text><text start="64" dur="3">plus 12 for the dictionary size,</text><text start="67" dur="7">which is 3/27 or 1/9.</text><text start="74" dur="3">This was not an easy question.</text></transcript></video><video title="25 Question" id="RJAFdBfGOrY" length="21"><transcript><text start="0" dur="3">We come now to the final quiz here,</text><text start="3" dur="2">which is--I would like to compute the probability</text><text start="5" dur="3">that the message &amp;quot;today is secret&amp;quot;</text><text start="8" dur="2">falls into the SPAM box with </text><text start="10" dur="3">Laplacian smoother using K=1.</text><text start="13" dur="3">Please just enter your number over here.</text><text start="16" dur="2">This is a non-trivia question.</text><text start="18" dur="3">It might take you a while to calculate this.</text></transcript></video><video title="26 Answer" id="oh4uc-8O6Pc" length="58"><transcript><text start="0" dur="6">In the approximate probabilities--0.4858.</text><text start="6" dur="2">How did we get this?</text><text start="8" dur="4">Well, the prior probability for SPAM</text><text start="12" dur="3">under the Laplacian smoothing is 2/5.</text><text start="15" dur="7">&amp;quot;Today&amp;quot; doesn&amp;#39;t occur, but we have already calculated this to be 1/21.</text><text start="22" dur="4">&amp;quot;Is&amp;quot; occurs once, so we get 2 over here over 21.</text><text start="26" dur="6">&amp;quot;Secret&amp;quot; occurs 3 times, so we get a 4 over here over 21,</text><text start="32" dur="5">and we normalize this by the same expression over here.</text><text start="37" dur="5">Plus the prior for HAM, which is 3/5,</text><text start="42" dur="5">we have 2 occurrences of &amp;quot;today&amp;quot;, plus 1, equals 3/27.</text><text start="47" dur="3">&amp;quot;Is&amp;quot; occurs once--2/27.</text><text start="50" dur="4">And &amp;quot;secret&amp;quot; occurs once--again 2/27.</text><text start="54" dur="4">When you work this all out, you get this number over here.</text></transcript></video><video title="27 Summary Naive Bayes" id="c2yFCp6BrEA" length="107"><transcript><text start="0" dur="2">So we learned quite a bit.</text><text start="2" dur="2">We learned about Naive Bayes</text><text start="4" dur="2">as our first supervised learning methods.</text><text start="6" dur="2">The setup was that we had </text><text start="8" dur="6">features of documents or trading examples and labels.</text><text start="14" dur="3">In this case, SPAM or not SPAM.</text><text start="17" dur="2">And from those pieces,</text><text start="19" dur="4">we made a generative model for the SPAM class</text><text start="23" dur="2">and the non-SPAM class</text><text start="25" dur="3">that described the condition of probability</text><text start="28" dur="2">of each individual feature.</text><text start="30" dur="3">We then used first maximum likelihood</text><text start="33" dur="3">and then a Laplacian smoother</text><text start="36" dur="2">to fit those primers over here.</text><text start="38" dur="3">And then using Bayes rule,</text><text start="41" dur="3">we could take any training examples over here</text><text start="44" dur="4">and figure out what the class probability was over here.</text><text start="48" dur="3">This is called a generative model</text><text start="51" dur="4">in that the condition of probabilities all aim to maximize</text><text start="55" dur="5">the probability of individual features as if those</text><text start="60" dur="2">describe the physical world.</text><text start="62" dur="4">We also used what is called a bag of words model,</text><text start="66" dur="3">in which our representation of each email</text><text start="69" dur="3">was such that we just counted the occurrences of words,</text><text start="72" dur="3">irrespective of their order. </text><text start="75" dur="4">Now this is a very powerful method for fighting SPAM.</text><text start="79" dur="2">Unfortunately, it is not powerful enough.</text><text start="81" dur="3">It turns out spammers know about Naive Bayes,</text><text start="84" dur="3">and they&amp;#39;ve long learned to come up with messages</text><text start="87" dur="4">that are fooling your SPAM filter if it uses Naive Bayes.</text><text start="91" dur="2">So companies like Google and others</text><text start="93" dur="2">have become much more involved</text><text start="95" dur="3">in methods for SPAM filtering.</text><text start="98" dur="4">Now I can give you some more examples how to filter SPAM,</text><text start="102" dur="5">but all of those quite easily fit with the same Naive Bayes model.</text></transcript></video><video title="28 Advanced SPAM Filtering" id="GSHJspQH15c" length="87"><transcript><text start="0" dur="3">[Narrator] So here features that you might consider when you write</text><text start="3" dur="2">in an advance spam filter.</text><text start="5" dur="2">For example, </text><text start="7" dur="2">does the email come from </text><text start="9" dur="3">a known spamming IP or computer?</text><text start="12" dur="4">Have you emailed this person before?</text><text start="16" dur="3">In which case it is less likely to be spam.</text><text start="19" dur="3">Here&amp;#39;s a powerful one:</text><text start="22" dur="3">have 1000 other people</text><text start="25" dur="4">recently received the same message?</text><text start="29" dur="3">Is the email header consistent? </text><text start="32" dur="3">So example if the from field says your bank</text><text start="35" dur="3">is the IP address really your bank? </text><text start="38" dur="4">Surprisingly is the email all caps?</text><text start="42" dur="2">Strangely many spammers believe if you write </text><text start="44" dur="4">things in all caps you&amp;#39;ll pay more attention to it. </text><text start="48" dur="3">Do the inline URLs point to those pages </text><text start="51" dur="3">where they say they&amp;#39;re pointing to?</text><text start="54" dur="2">Are you addressed by your correct name?</text><text start="56" dur="2">Now these are some features,</text><text start="58" dur="2">I&amp;#39;m sure you can think of more.</text><text start="60" dur="2">You can toss them easily into the </text><text start="62" dur="3">naive base model and get better classification.</text><text start="65" dur="3">In fact model spam filters keep learning</text><text start="68" dur="2">as people flag emails as spam, and </text><text start="70" dur="3">of course spammers keep learning as well</text><text start="73" dur="3">and trying to fool modern spam filters.</text><text start="76" dur="2">Who&amp;#39;s going to win?</text><text start="78" dur="3">Well so far the spam filters are clearly winning.</text><text start="81" dur="2">Most of my spam I never see, but who knows</text><text start="83" dur="2">what&amp;#39;s going to happen with the future?</text><text start="85" dur="2">It&amp;#39;s a really fascinating machine learning problem. </text></transcript></video><video title="29 Digit Recognition" id="kD2wD_MDVk4" length="141"><transcript><text start="0" dur="2">[Narrator] Naive Bayes can also be applied to </text><text start="2" dur="3">the problem of hand written digits recognition.</text><text start="5" dur="4">This is a sample of hand-written digits taken</text><text start="9" dur="3">from a U.S. postal data set </text><text start="12" dur="5">where hand written zip codes on letters are </text><text start="17" dur="4">being scanned and automatically classified.</text><text start="21" dur="2">The machine-learning problem here is</text><text start="23" dur="5">taken a symbol just like this.</text><text start="28" dur="2">What is the corresponding number?</text><text start="30" dur="2">Here it&amp;#39;s obviously 0.</text><text start="32" dur="2">Here it&amp;#39;s obviously 1. </text><text start="34" dur="2">Here it&amp;#39;s obviously 2, 1.</text><text start="36" dur="2">For the one down here, </text><text start="38" dur="3">it&amp;#39;s a little bit harder to tell.</text><text start="41" dur="3">Now when you apply Naive Bayes,</text><text start="44" dur="2">the input vector  </text><text start="46" dur="2">could be the pixel values</text><text start="48" dur="2">of each individual pixel so we have</text><text start="50" dur="4">a 16 x 16 input resolution.</text><text start="54" dur="5">You would get 256 different values </text><text start="59" dur="3">corresponding to the brightness of each pixel.</text><text start="62" dur="3">Now obviously given sufficiently made </text><text start="65" dur="2">training example, you might hope </text><text start="67" dur="2">to recognize digits,</text><text start="69" dur="3">but one of the deficiencies of this approach is</text><text start="72" dur="3">it is not particularly shifted range. </text><text start="75" dur="3">So for example a pattern like this</text><text start="79" dur="2">will look fundamentally different </text><text start="81" dur="3">from a pattern like this.</text><text start="84" dur="3">Even though the pattern on the right is obtained</text><text start="87" dur="2">by shifting the pattern on the left </text><text start="89" dur="2">by 1 to the right. </text><text start="91" dur="3">There&amp;#39;s many different solutions, but a common one could be </text><text start="94" dur="2">to use smoothing in a different way from </text><text start="96" dur="2">the way we discussed it before.</text><text start="98" dur="2">Instead of just counting 1 pixel value&amp;#39;s count, </text><text start="100" dur="2">you could mix it with counts of the </text><text start="102" dur="2">neighboring pixel values so if </text><text start="104" dur="2">all pixels are slightly shifted,</text><text start="106" dur="2">we get about the same statistics </text><text start="108" dur="2">as the pixel itself.</text><text start="110" dur="2">Such a method is called input smoothing. </text><text start="112" dur="3">You can what&amp;#39;s technically called convolve </text><text start="115" dur="2">the input vector equals pixel value variable, and </text><text start="117" dur="3">you might get better results than if you </text><text start="120" dur="2">do Naive Bayes on the raw pixels.</text><text start="122" dur="2">Now to tell you the truth for </text><text start="124" dur="2">digit recognition of this type,</text><text start="126" dur="2">Naive Bayes is not a good choice.</text><text start="128" dur="2">The conditional independence assumption </text><text start="130" dur="2">of each pixel, given the class,</text><text start="132" dur="2">is too strong an assumption in this case,</text><text start="134" dur="3">but it&amp;#39;s fun to talk about image recognition </text><text start="137" dur="4">in the context of Naive Bayes regardless.</text></transcript></video><video title="30 Overfitting Prevention" id="-jswWk8YLro" length="210"><transcript><text start="0" dur="4">So, let me step back a step and talk a bit about</text><text start="4" dur="3">overfitting prevention in machine learning</text><text start="7" dur="2">because it&amp;#39;s such an important topic.</text><text start="9" dur="3">We talked about Occam&amp;#39;s Razor,</text><text start="12" dur="4">which in a generalized way suggests there is </text><text start="16" dur="6">a tradeoff between how well we can fit the data</text><text start="22" dur="6">and how smooth our learning algorithm is.</text><text start="28" dur="4">In our class in smoothing, we already found 1 way</text><text start="32" dur="2">to let Occam&amp;#39;s Razor play, which is by </text><text start="34" dur="6">selecting the value K to make our statistical counts smoother.</text><text start="40" dur="4">I alluded to a similar way in the image recognition domain</text><text start="44" dur="5">where we smoothed the image so the neighboring pixels count similar.</text><text start="49" dur="4">This all raises the question of how to choose the smoothing parameter.</text><text start="53" dur="5">So, in particular, in Laplacian smoothing, how to choose the K.</text><text start="58" dur="4">There is a method called cross-validation</text><text start="62" dur="3">which can help you find an answer.</text><text start="65" dur="4">This method assumes there is plenty of training examples, but</text><text start="69" dur="5">to tell you the truth, in spam filtering there is more than you&amp;#39;d ever want.</text><text start="74" dur="3">Take your training data</text><text start="77" dur="2">and divide it into 3 buckets.</text><text start="79" dur="5">Train, cross-validate, and test.</text><text start="84" dur="3">Typical ratios will be 80% goes into train, </text><text start="87" dur="3">10% into cross-validate,</text><text start="90" dur="3">and 10% into test.</text><text start="93" dur="4">You use the train to find all your parameters.</text><text start="97" dur="3">For example, the probabilities of a base network.</text><text start="100" dur="3">You use your cross-validation set</text><text start="103" dur="3">to find the optimal K, and the way you do this is</text><text start="106" dur="3">you train for different values of K,</text><text start="109" dur="6">you observe how well the training model performs on the CV data,</text><text start="115" dur="3">not touching the test data,</text><text start="118" dur="3">and then you maximize over all the Ks to get the best performance</text><text start="121" dur="2">on the cross-validation set.</text><text start="123" dur="3">You iterate this many times until you find the best K.</text><text start="126" dur="3">When you&amp;#39;re done with the best K, </text><text start="129" dur="3">you train again, and then finally</text><text start="132" dur="3">only one you touch the test data</text><text start="135" dur="2">to verify the performance,</text><text start="137" dur="3">and this is the performance you report.</text><text start="140" dur="3">It&amp;#39;s really important in cross-validation</text><text start="143" dur="5">split apart a cross-validation set that&amp;#39;s different from the test set.</text><text start="148" dur="3">If you were to use the test set to find the optimal K,</text><text start="151" dur="4">then your test set becomes an effective part of your training routine,</text><text start="155" dur="3">and you might overfit your test data,</text><text start="158" dur="2">and you wouldn&amp;#39;t even know.</text><text start="160" dur="3">By keeping the test data separate from the beginning,</text><text start="163" dur="3">and train on the training data, you use</text><text start="166" dur="3">the cross-validation data to find how good your train data is doing,</text><text start="169" dur="4">and the unknown parameters of K to fine-tune the K.</text><text start="173" dur="3">Finally, only once you use the test data</text><text start="176" dur="3">do you get a fair answer to the question,</text><text start="179" dur="3">&amp;quot;How well will your model perform on future data?&amp;quot;</text><text start="182" dur="3">So, pretty much everybody in machine learning</text><text start="185" dur="3">uses this model. </text><text start="188" dur="4">You can redo the split between training and the cross-validation part,</text><text start="192" dur="3">people often use the word 10-fold cross-validation</text><text start="195" dur="2">where they do 10 different forwardings</text><text start="197" dur="3">and run the model 10 times to find the optimal K</text><text start="200" dur="2">or smoothing parameter.</text><text start="202" dur="3">No matter which way you do it, find the optimal smoothing parameter</text><text start="205" dur="5">and then use a test set exactly once to verify in a report.</text></transcript></video><video title="31 Classification vs Regression" id="5RLRKkzYWuQ" length="120"><transcript><text start="0" dur="3">Let me back up a step further, </text><text start="3" dur="3">and let&amp;#39;s look at supervised learning more generally.</text><text start="6" dur="3">Our example so far was one of classification.</text><text start="9" dur="3">The characteristic of classifcation is </text><text start="12" dur="4">that the target labels or the target class is discrete.</text><text start="16" dur="2">In our case it was actually binary.</text><text start="18" dur="5">In many problems, we try to predict a continuous quantity.</text><text start="23" dur="6">For example, in the interval 0 to 1 or perhaps a real number.</text><text start="29" dur="4">Those machine learning problems are called regression problems.</text><text start="33" dur="4">Regression problems are fundamentally different from classification problems.</text><text start="37" dur="5">For example, our base network doesn&amp;#39;t afford us an answer</text><text start="42" dur="3">to a problem where the target value could be at 0,1.</text><text start="45" dur="3">A regression problem, for example, would be one to</text><text start="48" dur="2">predict the weather tomorrow.</text><text start="50" dur="3">Temperature is a continuous value. Our base number would not be able</text><text start="53" dur="5">to predict the temperature, it only can predict discrete classes.</text><text start="58" dur="3">A regression algorithm is able to give us a continuous prediction</text><text start="61" dur="3">about the temperature tomorrow.</text><text start="64" dur="3">So let&amp;#39;s look at the regression next.</text><text start="67" dur="3">So here&amp;#39;s my first quiz for you on regression.</text><text start="70" dur="8">This scatter plot shows for Berkeley California for a period of time</text><text start="78" dur="3">the data for each house that was sold.</text><text start="81" dur="3">Each dot is a sold house.</text><text start="84" dur="3">It graphs the size of the house in square feet</text><text start="87" dur="5">to the sales price in thousands of dollars.</text><text start="92" dur="2">As you can see, roughly speaking,</text><text start="94" dur="3">as the size of the house goes up, </text><text start="97" dur="3">so does the sales price.</text><text start="100" dur="5">I wonder, for a house of about 2500 square feet,</text><text start="105" dur="4">what is the approximate sales price you would assume</text><text start="109" dur="3">based just on the scatter plot data?</text><text start="112" dur="8">Is it 400k, 600k, 800k, or 1000k?</text></transcript></video><video title="32 Answer" id="4kXyi3KWcSw" length="26"><transcript><text start="0" dur="5">My answer is, there seems to be a roughly linear relationship,</text><text start="5" dur="6">maybe not quite linear, between the house size and the price.</text><text start="11" dur="4">So we look at a linear graph that best describes the data--</text><text start="15" dur="3">you get this dashed line over here.</text><text start="18" dur="4">And for the dashed line, if you walk up the 2500 square feet, </text><text start="22" dur="2">you end up with roughly 800K.</text><text start="24" dur="2">So this would have been the best answer.</text></transcript></video><video title="33 Linear Regression" id="4bGWN67R9G0" length="166"><transcript><text start="0" dur="5">Now obviously you can answer this question without understanding anything about regression.</text><text start="5" dur="5">But what you find is this is different from classification as before. </text><text start="10" dur="3">This is not a binary concept anymore of like expensive and cheap.</text><text start="13" dur="4">It really is a relationship between two variables.</text><text start="17" dur="3">One you care about--the house price, and one that you can observe,</text><text start="20" dur="3">which is the house size in square feet.</text><text start="23" dur="5">And your goal is to fit a curve that best explains the data.</text><text start="28" dur="3">Once again, we have a case where we can play Occam&amp;#39;s razor.</text><text start="31" dur="4">There clearly is a data fit that is not linear that might be better,</text><text start="35" dur="2">like this one over here.</text><text start="37" dur="3">And when you go to hide the linear curves, </text><text start="40" dur="4">you might even be inclined to draw a curve like this.</text><text start="44" dur="5">Now of course the curve I&amp;#39;m drawing right now is likely an overfit.</text><text start="49" dur="5">And you don&amp;#39;t want to postulate that this is the general relationship </text><text start="54" dur="3">between the size of a house and the sales price.</text><text start="57" dur="4">So even though my black curve might describe the data better,</text><text start="61" dur="7">the blue curve or the dashed linear curve over here might be a better explanation overture of Occam&amp;#39;s razor.</text><text start="68" dur="7">So let&amp;#39;s look a little bit deeper into what we call regression.</text><text start="75" dur="4">As in all regression problems, our data will be comprised of </text><text start="79" dur="6">input vectors of length in that map to another continuous value.</text><text start="85" dur="5">And we might be given a total of M data points.</text><text start="90" dur="6">This is from the classification case, except this time the Ys are continuous.</text><text start="96" dur="8">Once again, we&amp;#39;re looking for function f that maps our vector x into y.</text><text start="104" dur="10">In linear regression, the function has a particular form which is W1 times X plus W0.</text><text start="114" dur="5">In this case X is one dimensional which is N = 1.</text><text start="119" dur="8">Or in the high-dimensional space, we might just write W times X plus W0,</text><text start="127" dur="5">where W is a vector and X is a vector.</text><text start="132" dur="4">And this is the inner product of these 2 vectors over here. </text><text start="136" dur="4">Let&amp;#39;s for now just consider the one-dimensional case. </text><text start="140" dur="7">In this quiz, I&amp;#39;ve given you a linear regression form with 2 unknown parameters, W1 and W0.</text><text start="147" dur="3">I&amp;#39;ve given you a data set.</text><text start="150" dur="6">And this data set happens to be fittable by a linear regression model without any residual error.</text><text start="156" dur="10">Without any math, can you look at this and find out to me what the 2 parameters, W0 and W1 are? </text></transcript></video><video title="34 Answer" id="pLwMXAPKdas" length="77"><transcript><text start="0" dur="3">This is a suprisingly challenging question.</text><text start="3" dur="4">If you look at these numbers from 3 to 6.</text><text start="7" dur="7">When we increase X by 3, Y decreases by 3,</text><text start="14" dur="4">which suggests W1 is -1.</text><text start="18" dur="2">Now let&amp;#39;s see if this holds. </text><text start="20" dur="4">If we increase X by 3, it decreases Y by 3.</text><text start="24" dur="4">If we increase X by 1, we decrease Y by 1.</text><text start="28" dur="4">If we increase X by 2, we decrease Y by 2.</text><text start="32" dur="4">So this number seems to be an exact fit.</text><text start="36" dur="5">Next we have to get the constant W0 right. </text><text start="41" dur="7">For X = 3, we get -3 as an expression over here,</text><text start="48" dur="2">because we know W1 = -1.</text><text start="50" dur="7">So if this has to equal zero in the end, then W0 has to be 3. </text><text start="57" dur="2">Let&amp;#39;s do a quick check.</text><text start="59" dur="3">-3 plus 3 is 0.</text><text start="62" dur="3">-6 plus 3 is -3.</text><text start="65" dur="4">And if we plug in any of the numbers, you find those are correct. </text><text start="69" dur="3">Now this is the case of an exact data set.</text><text start="72" dur="5">It gets much more challenging if the data set cannot be fit with a linear function. </text></transcript></video><video title="35 More Linear Regression" id="v4XIkABA1N0" length="60"><transcript><text start="0" dur="2">To define linear regression,</text><text start="2" dur="3">we need to understand what we are trying to minimize.</text><text start="5" dur="3">The word is called here, are loss function</text><text start="8" dur="4">and the loss function is the amount of residual error we obtain</text><text start="12" dur="4">after fitting the linear function as good as possible.</text><text start="16" dur="4">The residual error is the sum of all training examples, </text><text start="20" dur="5">J of YJ, which is the target label, </text><text start="25" dur="9">minus our prediction, which is W1 XJ minus W0 to the square.</text><text start="34" dur="3">This is the quadratic error between our target tables</text><text start="37" dur="4">and what our best hypothesis can produce.</text><text start="41" dur="2">The minimizing of loss</text><text start="43" dur="3">is used for linear regression of a new regression problem,</text><text start="46" dur="4">and you can write it as follows:</text><text start="50" dur="2">Our solution to the regression problem W* </text><text start="52" dur="8">is the arg min of the loss over all possible vectors W.</text></transcript></video><video title="36 Quadratic Loss" id="wUFYzzrd6TQ" length="244"><transcript><text start="0" dur="7">The problem of minimizing quadratic loss for linear functions can be solved in closed form.</text><text start="7" dur="5">When I reduce, I will do this for the one-dimensional case on paper.</text><text start="12" dur="5">I will also give you the solution for the case where your input space is multidimensional,</text><text start="17" dur="4">which is often called &amp;quot;multivariant regression.&amp;quot;</text><text start="22" dur="4">We seek to minimize a sum of a quadratic expression</text><text start="26" dur="7">where the target labels are subtracted with the output of our linear regression model</text><text start="33" dur="3">parameterized by w1 and w2.</text><text start="36" dur="4">The summation here is overall training examples,</text><text start="40" dur="5">and I leave the index of the summation out if not necessary.</text><text start="45" dur="5">The minimum of this is obtained where the derivative of this function equals zero.</text><text start="50" dur="3">Let&amp;#39;s call this function &amp;quot;L.&amp;quot;</text><text start="53" dur="6">For the partial derivative with respect to w0, we get this expression over here,</text><text start="59" dur="3">which we have to set to zero.</text><text start="62" dur="4">We can easily get rid of the -2 and transform this as follows:</text><text start="71" dur="4">Here M is the number of training examples.</text><text start="77" dur="4">This expression over here gives us w0 as a function of w1,</text><text start="81" dur="5">but we don&amp;#39;t know w1. Let&amp;#39;s do the same trick for w1</text><text start="88" dur="2">and set this to zero as well,</text><text start="92" dur="6">which gets us the expression over here.</text><text start="98" dur="6">We can now plug in the w0 over here into this expression over here</text><text start="104" dur="3">and obtain this expression over here, </text><text start="107" dur="3">which looks really involved but is relatively straightforward.</text><text start="112" dur="4">With a few steps of further calculation, which I&amp;#39;ll spare you for now,</text><text start="116" dur="4">we get for w1 the following important formula:</text><text start="122" dur="3">This is the final quotient for w1,</text><text start="125" dur="5">where we take the number of training examples times of the sum of all xy </text><text start="130" dur="6">minus the sum of x times the sum of y divided by this expression over here.</text><text start="136" dur="3">Once we&amp;#39;ve computed w1, </text><text start="139" dur="4">we can go back to our original articulation of w0 over here</text><text start="143" dur="7">and plug w1 into w0 and obtain w0.</text><text start="150" dur="7">These are the two important formulas we can also find in the textbook.</text><text start="159" dur="5">I&amp;#39;d like to go back and use those formulas to calculate these two coefficients over here.</text><text start="165" dur="9">You get 4 times the sum of x and the sum of y, which is -32</text><text start="176" dur="9">minus the product of the sum of x, which is 18, and the sum of y, which is -6,</text><text start="185" dur="11">divided by the sum of x squared, which is 86, times 4, minus the sum of x squared, </text><text start="196" dur="4">which is 18 times 18, which is 324.</text><text start="200" dur="5">If you work this all out, it becomes -1, which is w1.</text><text start="205" dur="6">W0 is now obtained by completing the quarter times sum of all y, </text><text start="211" dur="8">which is -6, minus -1/4 times sum of all x.</text><text start="219" dur="7">If you plug this all in, you get 3, as over here. Our formula is actually correct.</text><text start="226" dur="4">Here is another quiz for linear regression. We have the follow data:</text><text start="231" dur="2">Here is the data plotted graphically.</text><text start="233" dur="3">I wonder what the best regression is.</text><text start="236" dur="8">Give me w0 and w1. Apply the formulas I just gave you.</text></transcript></video><video title="37 Answer" id="ocviSEb04bk" length="87"><transcript><text start="0" dur="9">And the answer is W0 = 0.5, and W1 = 0.9.</text><text start="9" dur="5">If I were to draw a line, it would go about like this.</text><text start="14" dur="5">It doesn&amp;#39;t really hit the two points at the end.</text><text start="19" dur="5">If you were thinking of something like this, you were wrong.</text><text start="24" dur="4">If you draw a curve like this, your quadratic error becomes 2.</text><text start="28" dur="2">One over here, and one over here.</text><text start="30" dur="5">The quadratic error is smaller for the line that goes in between those points.</text><text start="35" dur="6">This is easily seen by computing as shown in the previous slide.</text><text start="41" dur="14">W1 equals (4 x 118 - 20 x 20) / (4 x 120 - 400) which is 0.9.</text><text start="55" dur="5">This is merely plugging in those numbers into the formulas I gave you.</text><text start="60" dur="5">W0 then becomes &#xBC; x 20.                              </text><text start="65" dur="7"> Now we plug in W1-- 0.9 / 4  x  20 equals 0.5.</text><text start="72" dur="4">This is an example of linear regression,</text><text start="76" dur="2">in which case there is a residual error,</text><text start="78" dur="4">and the best-fitting curve is the one that minimizes</text><text start="82" dur="5">the total of the residual vertical error in this graph over here.</text></transcript></video><video title="38 Problems with Linear Regression" id="w7Ip8r0EIJQ" length="130"><transcript><text start="0" dur="3">So linear regression works well</text><text start="3" dur="2">if the data is approximately linear,</text><text start="5" dur="4">but there are many examples when linear regression performs poorly.</text><text start="9" dur="3">Here&amp;#39;s one where we have a</text><text start="12" dur="3">curve that is really nonlinear.</text><text start="15" dur="3">This is an interesting one where we seem to have a linear relationship</text><text start="18" dur="3">that is flatter than the linear regression indicates,</text><text start="21" dur="2">but there is one outlier.</text><text start="23" dur="3">Because if you are minimizing quadratic error,</text><text start="26" dur="4">outliers penalize you over-proportionately.</text><text start="30" dur="4">So outliers are particularly bad for linear regression.</text><text start="34" dur="1">And here is a case,</text><text start="35" dur="2">where the data clearly suggests</text><text start="37" dur="3">a very different phenomena for linear.</text><text start="40" dur="2">We have only two ?? variables even being used,</text><text start="42" dur="3">and this one has a strong frequency</text><text start="45" dur="2">and a strong vertical spread.</text><text start="47" dur="2">Clearly a linear regression model</text><text start="49" dur="2">is a very poor one to explain</text><text start="51" dur="2">this data over here.</text><text start="53" dur="2">Another problem with linear regression</text><text start="55" dur="4">is that as you go to infinity in the X space,</text><text start="59" dur="3">your Ys also become infinite.</text><text start="62" dur="3">In some problems that isn&amp;#39;t a plausible model.</text><text start="65" dur="3">For example, if you wish to predict the weather</text><text start="68" dur="2">anytime into the future,</text><text start="70" dur="3">it&amp;#39;s implausible to assume the further the prediction goes out,</text><text start="73" dur="2">the hotter or the cooler it becomes.</text><text start="75" dur="2">For such situations there is a </text><text start="77" dur="3">model called logistic regression, </text><text start="80" dur="2">which uses a slightly more complicated</text><text start="82" dur="2">model than linear regression, </text><text start="84" dur="1">which goes as follows:.</text><text start="85" dur="5">Let F of XP, or linear function,</text><text start="90" dur="2">and the output of logistic regression</text><text start="92" dur="2">is obtained by the following function:</text><text start="94" dur="6">One over one plus exponential of minus F of X.</text><text start="100" dur="3">So here&amp;#39;s a quick quiz for you.</text><text start="103" dur="5">What is the range in which Z might fall</text><text start="108" dur="1">given this function over here,</text><text start="109" dur="4">and ??? the linear function of F or X over here.</text><text start="113" dur="3">Is it zero, one?</text><text start="116" dur="3">Is it minus one, one?</text><text start="119" dur="3">Is it minus one, zero?</text><text start="122" dur="2">Minus two, two?</text><text start="124" dur="6">Or none of the above?</text></transcript></video><video title="39 Answer" id="PYZ-7YS5T0k" length="60"><transcript><text start="0" dur="2">The answer is zero, one.</text><text start="2" dur="3">If this expression over here,</text><text start="5" dur="2">F of X,</text><text start="7" dur="2">grows to positive infinity,</text><text start="9" dur="5">then Z becomes one.</text><text start="14" dur="2">And the reason is</text><text start="16" dur="3">as this term over here becomes very large,</text><text start="19" dur="3">E to the minus of that term approaches zero;</text><text start="22" dur="3">one over one equals one.</text><text start="25" dur="5">If F of X goes to minus infinity,</text><text start="30" dur="3">then Z goes to zero.</text><text start="33" dur="1">And the reason is,</text><text start="34" dur="4">if this expression over here goes to minus infinity,</text><text start="38" dur="3">E to the infinity becomes very large;</text><text start="41" dur="3">one over something very large becomes zero.</text><text start="44" dur="5">When we plot the logistic function it looks like this:</text><text start="49" dur="2">So it&amp;#39;s approximately linear</text><text start="51" dur="3">around F of X equals zero,</text><text start="54" dur="4">but it levels off to zero and one</text><text start="58" dur="2">as we go to the extremes.</text></transcript></video><video title="40 Linear Regression and Complexity Control" id="4G5mH4FW-WY" length="99"><transcript><text start="0" dur="4">Another problem with linear regression has to do with the regularization </text><text start="4" dur="2">or complexity control.</text><text start="6" dur="2">Just like before, we sometimes wish to have</text><text start="8" dur="2">a less complex model.</text><text start="10" dur="5">So in regularization, the loss function is either the sum</text><text start="15" dur="6">of the loss of a data function and a complexity control term,</text><text start="21" dur="3">which is often called the loss of the parameters.</text><text start="24" dur="5">The loss of the data is simply curvatic loss, as we discussed before.</text><text start="29" dur="6">The loss of parameters might just be a function that penalizes </text><text start="35" dur="2">the parameters to become large</text><text start="37" dur="6">up to some known P, where P is usually either 1 or 2.</text><text start="43" dur="3">If you draw this graphically,</text><text start="46" dur="3">in a parameter space comprised of 2 parameters,</text><text start="49" dur="4">your curvatic term for minimizing the data error</text><text start="53" dur="4">might look like this, where the minimum sits over here.</text><text start="57" dur="5">Your term for regularization might pull these parameters toward 0.</text><text start="62" dur="7">It pulls it toward 0, along the circle if you use curvatic error,</text><text start="69" dur="5">and it does it in a diamond-shaped way.</text><text start="74" dur="6">For L1 regularization--either one works well.</text><text start="80" dur="4">L1 has the advantage in that parameters tend to get really sparse.</text><text start="84" dur="6">If you look at this diagram, there is a tradeoff between W-0 and W-1.</text><text start="90" dur="3">In the L1 case, that allows one of them to be driven to 0.</text><text start="93" dur="4">In the L2 case, parameters tend not to be as sparse.</text><text start="97" dur="2">So L1 is often preferred.</text></transcript></video><video title="41 Minimizing Complicated Loss Functions" id="0RmqLOxexh4" length="106"><transcript><text start="0" dur="3">This all raises the question,</text><text start="3" dur="3">how to minimize more complicated loss functions</text><text start="6" dur="3">than the one we discussed so far.</text><text start="9" dur="5">Are there closed-form solutions of the type we found for linear regression?</text><text start="14" dur="3">Or do we have to resort to iterative methods?</text><text start="17" dur="6">The general answer is, unfortunantly, we have to resort to iterative methods.</text><text start="23" dur="5">Even though there are special cases in which corresponding solutions may exist,</text><text start="28" dur="4">in general, our loss functions now become complicated enough </text><text start="32" dur="3">that all we can do is iterate.</text><text start="35" dur="5">Here is a prototypical loss function</text><text start="40" dur="4">and the method for interation will be called gradient descent.</text><text start="44" dur="4">In gradient descent, you start with an initial guess,</text><text start="48" dur="5">W-0, where 0 is your iteration number,</text><text start="53" dur="2">and then you up with it iteratively.</text><text start="55" dur="9">Your i plus 1st parameter guess will be obtained by taking your i-th guess</text><text start="64" dur="6">and subtracting from it the gradient of your loss function,</text><text start="70" dur="5">and that guess multiplied by a small learning weight alpha,</text><text start="75" dur="4">where alpha is often as small as 0.01.</text><text start="79" dur="2">I have a couple of questions for you.</text><text start="81" dur="4">Consider the following 3 points.</text><text start="85" dur="2">We call them A, B, C.</text><text start="87" dur="7">I wish to know, for points A, B, and C,</text><text start="94" dur="6">Is the gradient at this point positive, about zero, or negative?</text><text start="100" dur="6">For each of those, check exactly one of those cases.</text></transcript></video><video title="42 Answer" id="rAcwpZJqAZA" length="47"><transcript><text start="0" dur="3">In case A, the gradient is negative.</text><text start="3" dur="3">If you move to the right in the X space,</text><text start="6" dur="3">then your loss decreases.</text><text start="9" dur="3">In B, it&amp;#39;s about zero.</text><text start="12" dur="3">In C, it&amp;#39;s pointing up; it&amp;#39;s positive.</text><text start="15" dur="3">So if you apply the rule over here,</text><text start="18" dur="3">if you were to start at A as your W-zero,</text><text start="21" dur="2">then your gradient is negative.</text><text start="23" dur="3">Therefore, you would add something to the value of W.</text><text start="26" dur="3">You move to the right, and your loss has decreased.</text><text start="29" dur="2">You do this until you find yourself</text><text start="31" dur="3">with what&amp;#39;s called a local minimum, where B resides.</text><text start="34" dur="3">In this instance over here, gradient descent starting at A</text><text start="37" dur="2">would not get you to the global minimum,</text><text start="39" dur="3">which sits over here because there&amp;#39;s a bump in between.</text><text start="42" dur="5">Gradient methods are known to be subject to local minimum.</text></transcript></video><video title="43 Question" id="dKKigX6nhyU" length="28"><transcript><text start="0" dur="3">I have another gradient quiz.</text><text start="3" dur="3">Consider the following quadratic arrow function.</text><text start="6" dur="3">We are considering the gradient in 3 different places.</text><text start="9" dur="4">a. b. and c.</text><text start="13" dur="4">And they ask you which gradient is the largest.</text><text start="17" dur="6">a, b, or c or are they all equal?</text><text start="23" dur="5">In which case, you would want to check the last box over here</text></transcript></video><video title="44 Answer" id="5Vm8ibmxroE" length="20"><transcript><text start="0" dur="4">And the answer is C.</text><text start="4" dur="4">The derivative of a quadratic function is a linear function.</text><text start="8" dur="3">Which would look about like this.</text><text start="11" dur="4">And as we go outside, our gradient becomes larger and larger.</text><text start="15" dur="5">This over here is much steeper than this curve over here.</text></transcript></video><video title="45 Question" id="2oy1QoXsvGQ" length="21"><transcript><text start="0" dur="4">[Thrun] Here is a final gradient descent quiz.</text><text start="4" dur="4">Suppose we have a loss function like this</text><text start="8" dur="4">and our gradient descent starts over here.</text><text start="12" dur="3">Will it likely reach the global minimum?</text><text start="15" dur="2">Yes or no.</text><text start="17" dur="4">Please check one of those boxes.</text></transcript></video><video title="46 Answer" id="R1o9wbhnv94" length="60"><transcript><text start="0" dur="2">[Thrun] And the answer is yes,</text><text start="2" dur="4">although, technically speaking, to reach the absolute global minimum</text><text start="6" dur="5">we need the learning rates to become smaller and smaller over time.</text><text start="11" dur="4">If they stay constant, there is a chance this thing might bounce around </text><text start="15" dur="3">between 2 points in the end and never reach the global minimum.</text><text start="18" dur="4">But assuming that we implement gradient descent correctly,</text><text start="22" dur="2">we will finally reach the global minimum.</text><text start="24" dur="5">That&amp;#39;s not the case if you start over here, where we can get stuck over here</text><text start="29" dur="3">and settle for the minimum over here, which is a local minimum</text><text start="32" dur="3">and not the best solution to our optimization problem.</text><text start="35" dur="3">So one of the important points to take away from this is </text><text start="38" dur="5">gradient descent is universally applicable to more complicated problems--</text><text start="43" dur="3">problems that don&amp;#39;t have a plausible solution.</text><text start="46" dur="3">But you have to check whether there is many local minima,</text><text start="49" dur="2">and if so, you have to worry about this.</text><text start="51" dur="4">Any optimization book can tell you tricks how to overcome this.</text><text start="55" dur="5">I won&amp;#39;t go into any more depth here in this class.</text></transcript></video><video title="47 Gradient Descent Implementation" id="Iy9gzbQ7_3g" length="101"><transcript><text start="0" dur="5">[Thrun] It&amp;#39;s interesting to see how to minimize a loss function using gradient descent.</text><text start="5" dur="7">In our linear case, we have L equals sum over the correct labels</text><text start="12" dur="4">minus our linear function to the square,</text><text start="16" dur="2">which we seek to minimize.</text><text start="18" dur="3">We already know that this has a closed form solution,</text><text start="21" dur="4">but just for the fun of it, let&amp;#39;s look at gradient descent.</text><text start="25" dur="8">The gradient of L with respect to W1 is minus 2, sum of all J</text><text start="33" dur="6">of the difference as before but without the square times Xj.</text><text start="39" dur="4">The gradient with respect to W0 is very similar.</text><text start="43" dur="6">So in gradient descent we start with W1 0 and W0 0 </text><text start="49" dur="6">where the upper cap 0 corresponds to the iteration index of gradient descent.</text><text start="55" dur="2">And then we iterate.</text><text start="57" dur="9">In the M iteration we get our new estimate by using the old estimate</text><text start="66" dur="4">minus a learning rate of this gradient over here</text><text start="70" dur="5">taking the position of the old estimate W1, M minus 1.</text><text start="75" dur="5">Similarly, for W0 we get this expression over here.</text><text start="80" dur="4">And these expressions look nasty,</text><text start="84" dur="4">but what it really means is we subtract an expression like this</text><text start="88" dur="3">every time we do gradient descent from W1</text><text start="91" dur="5">and an expression like this every time we do gradient descent from W0,</text><text start="96" dur="5">which is easy to implement, and that implements gradient descent.</text></transcript></video><video title="48 Perceptron" id="yOSGC67bOIk" length="255"><transcript><text start="0" dur="8">Now, there are many different ways to apply linear functions in machine learning.</text><text start="8" dur="4">We so far have studied linear functions for regression, </text><text start="12" dur="4">but linear functions are also used for classification,</text><text start="16" dur="5">and specifically for an algorithm called the perceptron algorithm.</text><text start="21" dur="6">This algorithm happens to be a very early model of a neuron,</text><text start="27" dur="3">as in the neurons we have in our brains, </text><text start="30" dur="3">and was invented in the 1940s.</text><text start="33" dur="8">Suppose we give a data set of positive samples and negative samples.</text><text start="41" dur="8">A linear separator is a linear equation that separates positive from negative examples.</text><text start="49" dur="6">Obviously, not all sets possess a linear separator, but some do.</text><text start="55" dur="7">For those we can define the algorithm of the perceptron and it actually converges.</text><text start="62" dur="5">To define a linear separator, let&amp;#39;s start with our linear equation as before--</text><text start="67" dur="11">w1x + w0 in cases where x is higher dimensional this might actually be a vector--never mind.</text><text start="78" dur="8">If this is larger or equal to zero, then we call our classification 1.</text><text start="86" dur="4">Otherwise, we call it zero.</text><text start="90" dur="5">Here&amp;#39;s our linear separation classification function</text><text start="95" dur="4">where this is our common linear function.</text><text start="99" dur="6">Now, as I said, perceptron only converges if the data is linearly separable,</text><text start="105" dur="4">and then it converges to a linear separation of the data,</text><text start="109" dur="3">which is quite amazing.</text><text start="112" dur="4">Perceptron is an iterative algorithm that is not dissimilar from grade descent.</text><text start="116" dur="7">In fact, the update rule echoes that of grade descent, and here&amp;#39;s how it goes.</text><text start="123" dur="6">We start with a random guess for w1 and w0,</text><text start="129" dur="4">which may correspond to a random separation line,</text><text start="133" dur="4">but usually is inaccurate.</text><text start="137" dur="12">Then the mth weight-i is obtained by using the old weight plus some learning rate alpha</text><text start="149" dur="4">times the difference between the desired target label</text><text start="153" dur="6">and the target label produced by our function at the point m-1.</text><text start="159" dur="6">Now, this is an online learning rule, which is we don&amp;#39;t process all the data in batch.</text><text start="165" dur="5">We process one data at a time, and we might go through the data many, many times--</text><text start="170" dur="2">hence the j over here--</text><text start="172" dur="3">but every time we do this, we apply this rule over here.</text><text start="175" dur="8">What this rule gives us is a method to adapt our weights in proportion to the error.</text><text start="183" dur="4">If the prediction of our function f equals our target label,</text><text start="187" dur="4">and the error is zero, then no update occurs.</text><text start="191" dur="7">If there is a difference, however, we update in a way so as to minimize the error.</text><text start="198" dur="4">Alpha is a small learning weight.</text><text start="202" dur="6">Once again, perceptron converges to a correct linear separator</text><text start="208" dur="3">if such linear separator exists.</text><text start="211" dur="5">Now, the case of linear separation has recently received a lot of attention in machine learning.</text><text start="216" dur="6">If you look at the picture over here, you&amp;#39;ll find there are many different linear separators.</text><text start="222" dur="5">There is one over here. There is one over here. There is one over here.</text><text start="227" dur="6">One of the questions that has recently been researched extensively is which one to prefer.</text><text start="233" dur="4">Is it a, b, or c?</text><text start="237" dur="4">Even though you probably have never seen this literature,</text><text start="241" dur="4">I will just ask your intuition in this following quiz.</text><text start="245" dur="5">Which linear separator would you prefer if you look at these three different linear separators--</text><text start="250" dur="5">a, b, c, or none of them?</text></transcript></video><video title="49 Answer and SVMs" id="xRf9wAeU1kI" length="338"><transcript><text start="0" dur="4">[Narrator] And intuitively I would argue it&amp;#39;s B,</text><text start="4" dur="2">and the reason why is</text><text start="6" dur="3">C comes really close to examples.</text><text start="9" dur="3">So if these examples are noisy,</text><text start="12" dur="2">it&amp;#39;s quite likely that </text><text start="14" dur="3">by being so close to these examples </text><text start="17" dur="3">that future examples cross the line.</text><text start="20" dur="3">Similarly A comes close to examples.</text><text start="23" dur="3">B is the one that stays really far away </text><text start="26" dur="2">from any example.</text><text start="28" dur="3">So there&amp;#39;s this entire region over here</text><text start="31" dur="3">where there&amp;#39;s no example anywhere near B. </text><text start="34" dur="3">This region is often called the margin.</text><text start="37" dur="3">The margin of the linear separator </text><text start="40" dur="3">is the distance of the separator</text><text start="43" dur="2">to the closest training example.</text><text start="45" dur="2">The margin is a really important concept </text><text start="47" dur="2">in machine learning.</text><text start="49" dur="2">There is an entire class of maximum margin </text><text start="51" dur="2">learning algorithms,</text><text start="53" dur="3">and the 2 most popular are </text><text start="56" dur="4">support vector machines and boosting.</text><text start="60" dur="2">If you are familiar with machine learning, </text><text start="62" dur="2">you&amp;#39;ve come across these terms.</text><text start="64" dur="3">These are very frequently used these days </text><text start="67" dur="3">in actual discrimination learning tasks.</text><text start="70" dur="2">I will not go into any details because it would go </text><text start="72" dur="4">way beyond the scope of this introduction </text><text start="76" dur="2">to artificial intelligence class, but let&amp;#39;s see </text><text start="78" dur="3">a few abstract words specifically about</text><text start="81" dur="4">support vector machines or SVMs.</text><text start="85" dur="5">As I said before a support vector machine</text><text start="90" dur="4">derives a linear separator, and it takes </text><text start="94" dur="5">the one that actually maximizes the margin</text><text start="99" dur="3">as shown over here.</text><text start="102" dur="2">By doing so it attains additional robost-ness </text><text start="104" dur="2">over perceptron which only picks </text><text start="106" dur="2">a linear separator without </text><text start="108" dur="3">consideration of the margin.</text><text start="111" dur="2">Now the problem of finding the </text><text start="113" dur="2">margin maximizing linear separator </text><text start="115" dur="4">can be solved by a quadratic program</text><text start="119" dur="4">which is an integer method for finding the best</text><text start="123" dur="3">linear separator that maximizes the margin.</text><text start="126" dur="2">One of the nice things that support</text><text start="128" dur="4">vector machines do in practice is</text><text start="132" dur="4">they use linear techniques to solve </text><text start="136" dur="3">nonlinear separation problems,</text><text start="139" dur="3">and I&amp;#39;m just going to give you a glimpse of</text><text start="142" dur="3">what&amp;#39;s happening without going into any detail.</text><text start="145" dur="3">Suppose the data looks as follows:</text><text start="148" dur="3">we have a positive class </text><text start="151" dur="2">which is near the origin of a coordinate system </text><text start="153" dur="4">and a negative class that surrounds the positive class.</text><text start="157" dur="2">Clearly these 2 classes </text><text start="159" dur="2">are not linearly separable </text><text start="161" dur="2">because there&amp;#39;s no line I can draw that </text><text start="163" dur="4">separates the negative examples from the positive examples.</text><text start="167" dur="2">An idea that underlies SVMs,</text><text start="169" dur="2">that will ultimately be known as </text><text start="171" dur="2">the kernel trick, </text><text start="173" dur="3">is to augment the feature set by new features.</text><text start="176" dur="2">Suppose this is X1, and this is X2, </text><text start="178" dur="2">and normally X1 and X2 </text><text start="180" dur="3">will be the input features.</text><text start="183" dur="2">In this example, you might derive</text><text start="185" dur="2">a 3rd one.</text><text start="187" dur="2">Let me pick a 3rd one </text><text start="189" dur="4">Suppose X3 equals the square root of </text><text start="193" dur="5">X1 square + X2 square.</text><text start="198" dur="4">In other words X3 is the distance</text><text start="202" dur="3">of any data point from the center </text><text start="205" dur="2">of the coordinate system.</text><text start="207" dur="4">Then things do become linearly separable</text><text start="211" dur="2">so that just along the 3rd dimension</text><text start="213" dur="3">all the positive examples end up </text><text start="216" dur="3">to be close to the origin,</text><text start="219" dur="2">and all the negative examples </text><text start="221" dur="2">are further away, and the line is </text><text start="223" dur="3">orthogonal to the 3rd input feature </text><text start="226" dur="3">solves the separation problem.</text><text start="229" dur="3">Map back into the space over here</text><text start="232" dur="3">is actually a circle which is a set of all </text><text start="235" dur="5">values of X3 that are equidistant</text><text start="240" dur="2">to the center of the origin.</text><text start="242" dur="4">Now this trick could be done in any linear learning algorithm, </text><text start="246" dur="2">and it&amp;#39;s really an amazing trick.</text><text start="248" dur="2">You can take any nonlinear problem, add </text><text start="250" dur="3">features of this type or any other type,</text><text start="253" dur="2">and use linear techniques </text><text start="255" dur="2">and get better solutions.</text><text start="257" dur="2">This is a very deep machine learning insight </text><text start="259" dur="2">that you can extend your feature space </text><text start="261" dur="2">in this way, and there&amp;#39;s numerous </text><text start="263" dur="2">papers written about this.</text><text start="265" dur="6">In SVMs, the extension of the feature space is mathematically done by </text><text start="271" dur="2">what&amp;#39;s called a kernel.</text><text start="273" dur="3">I can&amp;#39;t really tell you about this in this class,</text><text start="276" dur="2">but it makes it possible to write </text><text start="278" dur="3">very large new feature spaces including</text><text start="281" dur="3">infinitely dimensional new feature spaces.</text><text start="284" dur="2">These messages are very powerful. </text><text start="286" dur="2">It turns out you never </text><text start="288" dur="2">really compute all those features.</text><text start="290" dur="2">They are implicitly represented by </text><text start="292" dur="3">so called kernels, and if you care about this,</text><text start="295" dur="2">I recommend you to dive </text><text start="297" dur="2">deeper into the literature </text><text start="299" dur="2">of support vector machines.</text><text start="301" dur="2">This is meant to just give you </text><text start="303" dur="2">an overview of the essence of </text><text start="305" dur="3">what support vector machines are all about.</text><text start="308" dur="2">So in summary,</text><text start="310" dur="2">linear methods we learned about </text><text start="312" dur="3">using them for regression </text><text start="315" dur="2">and also classification.</text><text start="317" dur="2">We learned about exact solutions</text><text start="319" dur="4">versus iterative solutions.</text><text start="323" dur="2">We talked about smoothing,</text><text start="325" dur="2">and we even talked about </text><text start="327" dur="3">using linear methods for nonlinear problems.</text><text start="330" dur="3">So we covered quite a bit of ground.</text><text start="333" dur="2">This is a really significant cross section</text><text start="335" dur="3">of machine learning.</text></transcript></video><video title="50 k Nearest Neighbors" id="ZLEilYyt28c" length="121"><transcript><text start="0" dur="6">As the final method in this unit, I&amp;#39;d like now to talk about k-nearest neighbors.</text><text start="6" dur="3">And the distinguishing factor of k-nearest neighbors </text><text start="9" dur="4">is that it is a nonparametric machine learning method.</text><text start="13" dur="3">So far we&amp;#39;ve talked about parametric methods. </text><text start="16" dur="5">Parametric methods have parameters, like probabilities or weights,</text><text start="21" dur="4">and the number of parameters is constant.</text><text start="25" dur="4">Or to put it differently, the number of parameters is independent of the training set size.</text><text start="29" dur="5">So for example in the Naive Bayes, if we bring up more data, </text><text start="34" dur="3">the number of condition probabilities will stay the same.</text><text start="37" dur="4">Well, that wasn&amp;#39;t technically always the case.</text><text start="41" dur="5">Our vocabulary might increase and as such the number of parameters.</text><text start="46" dur="7">But for any fixed dictionary, the number of parameters are truly independent of the training set size.</text><text start="53" dur="3">The same was true, for example, in our regression cases</text><text start="56" dur="6">where the number of regression weights is independent of the number of data points. </text><text start="62" dur="4">Now this is very different from non-parametric</text><text start="66" dur="4">where the number of parameters can grow. </text><text start="70" dur="3">In fact, it can grow a lot over time. </text><text start="73" dur="3">Those techniques are called non-parametric. </text><text start="76" dur="4">Nearest neighbor is so straightforward.</text><text start="80" dur="3">I&amp;#39;d really like to introduce you using a quiz. </text><text start="83" dur="2">So here&amp;#39;s my quiz.</text><text start="85" dur="4">Suppose we have a number of data points.</text><text start="89" dur="8">I want you for 1-nearest neighbor to check those squared areas</text><text start="97" dur="4">that you believe will carry a positive label.</text><text start="101" dur="4">And I will give you the label of the existing data points.</text><text start="105" dur="5">So please check any of those boxes that you believe are now </text><text start="110" dur="4">1-nearest neighbor that carry a positive label.</text><text start="114" dur="7">And the algorithm, of course, searches for the nearest point in this Euclidean space and just copies its label. </text></transcript></video><video title="51 kNN Definition" id="r1PWSm6xMUk" length="68"><transcript><text start="0" dur="3">And the answer was: This is a positive point,</text><text start="3" dur="2">and this is a positive point.</text><text start="5" dur="3">These 2 points over here are negative. </text><text start="8" dur="4">So let&amp;#39;s define k-nearest neighbors. </text><text start="12" dur="4">The algorithm is really blatantly simple. </text><text start="16" dur="4">In the learning step, you simply memorize all data.</text><text start="20" dur="3">If a new example comes along with the input value you know</text><text start="23" dur="5">but which you wish to classify, you do the following.</text><text start="28" dur="3">You first find the k-nearest neighbors.</text><text start="31" dur="7">And then you return the majority class label as your final class label for the new example.</text><text start="38" dur="3">Simple, isn&amp;#39;t it?</text><text start="41" dur="4">So here&amp;#39;s a somewhat contrived situation of the data point we wish to classify </text><text start="45" dur="8">where the label data lies on the spiral of increasing diameter as it goes outwards. </text><text start="53" dur="4">Please answer for me in this quiz what class label you&amp;#39;d assign</text><text start="57" dur="11">for k = 1, k = 3, 5, 7, and all the way to 9.</text></transcript></video><video title="52 Answer" id="PoRpuj4bijU" length="41"><transcript><text start="0" dur="2">And this is an easy answer. </text><text start="2" dur="2">The nearest neighbor is this point over here, </text><text start="4" dur="2">so for 1 we say plus.</text><text start="6" dur="3">For 3 neighbors, we get 2 positive, 1 negative. </text><text start="9" dur="2">It&amp;#39;s still plus.</text><text start="11" dur="3">For 5 neighbors--1, 2, 3, 4, 5--</text><text start="14" dur="2">we get 3 negative, 2 positive. </text><text start="16" dur="2">It&amp;#39;s a minus. </text><text start="18" dur="3">For 7, we get 3 positive but still 4 negative, </text><text start="21" dur="2">so it&amp;#39;s negative.</text><text start="23" dur="3">And for 9, the positives outweigh the negative, </text><text start="26" dur="2">so you get a plus. </text><text start="28" dur="5">Obviously, as you can see, as K increases, </text><text start="33" dur="2">more and more data points are being consulted.</text><text start="35" dur="2">So when K finally becomes 9, </text><text start="37" dur="4">all those data points are in and make a much smoother result. </text></transcript></video><video title="53 k as Smoothing Parameter" id="mhrG7atAn4Q" length="104"><transcript><text start="0" dur="5">Just as in the Laplacian smoothing example before, </text><text start="5" dur="3">the value of k is a smoothing parameter. </text><text start="8" dur="3">It makes the function less scattered. </text><text start="11" dur="4">Here is an example of k=1</text><text start="15" dur="3">for a 2-class nearest neighbor problem.</text><text start="18" dur="7">You can see the separation boundary is what&amp;#39;s called a Voronoi diagram</text><text start="25" dur="4">between the positive and negative class, and </text><text start="29" dur="4">in cases where there is noise between these class boundaries, </text><text start="33" dur="5">you&amp;#39;ll find really funny, complex boundaries as indicated over here. </text><text start="38" dur="7">Particularly interesting is this guy over here where the class of this circle over here</text><text start="45" dur="5">protrudes way into the otherwise solid class.</text><text start="50" dur="5">Now, as you go to k=3, you get this graph over here,</text><text start="55" dur="2">which is smoother.</text><text start="57" dur="4">So if you are over here, your two nearest neighbors are of this type over there,</text><text start="61" dur="4">and you get a uniform class over here. </text><text start="65" dur="4">In this region over here, you get uniform classes as solid classes</text><text start="69" dur="2">as shown over here. </text><text start="71" dur="4">The more you drive up k, the more clean this decision boundary becomes, </text><text start="75" dur="4">but the more outliers are actually misclassified as well. </text><text start="79" dur="3">So if I go back to my k-nearest neighbor method,</text><text start="82" dur="4">we just learned that k is a regularizer.</text><text start="86" dur="4">It controls the complexity of the k-nearest neighbor algorithm. </text><text start="90" dur="4">and the larger k is, the smoother the output. </text><text start="94" dur="4">We can, once again, use cross-validation to find the optimal k</text><text start="98" dur="4">because there is an inherent trade off--between the complexity of what we want to fit</text><text start="102" dur="2">and the goodness of the fit. </text></transcript></video><video title="54 Problems with kNN" id="tOSoqfK9UNE" length="128"><transcript><text start="0" dur="2">What are the problems of kNN?</text><text start="2" dur="2">Well, I would argue that there&amp;#39;re two.</text><text start="4" dur="3">One is very large data sets,</text><text start="7" dur="3">and one is very large feature spaces.</text><text start="10" dur="4">Now the first one results in lengthy searches</text><text start="14" dur="3">when you try to find K&amp;#39;s nearest neighbors.</text><text start="17" dur="2">Now, fortunately there are</text><text start="19" dur="3">methods to search efficiently.</text><text start="22" dur="2">Often you represent your data</text><text start="24" dur="3">not by a linear list, in which case the search</text><text start="27" dur="2">would be linear in the number of data points,</text><text start="29" dur="5">but by a tree, where the search becomes logarithmic.</text><text start="34" dur="4">The method of choice is called kDD trees</text><text start="38" dur="2">where there are many other ways </text><text start="40" dur="3">to represent data points as trees.</text><text start="43" dur="2">Now very large feature spaces </text><text start="45" dur="3">cause more of a problem.</text><text start="48" dur="3">It turns out computing nearest neighbors,</text><text start="51" dur="3">as the feature space for the input vector increases,</text><text start="54" dur="3">becomes increasingly difficult,</text><text start="57" dur="3">and the tree methods become increasingly brittle.</text><text start="60" dur="3">And the reason is shown in the following graph:</text><text start="63" dur="3">If your graph input dimension to </text><text start="66" dur="3">the average edge length of your neighborhood</text><text start="69" dur="3">you&amp;#39;ll find that for randomly chosen points</text><text start="72" dur="4">very quickly all points are really far away.</text><text start="76" dur="3">The edge length of one is obtained</text><text start="79" dur="4">if your query point</text><text start="83" dur="3">is unit one away from the nearest neighbor.</text><text start="86" dur="2">If you have one hundred dimensions,</text><text start="88" dur="1">that is almost certain.</text><text start="89" dur="2">Why is that?</text><text start="91" dur="2">Well, in one hundred dimensions,</text><text start="93" dur="2">they are to be one where just by chance</text><text start="95" dur="2">your&amp;#39;re far away.</text><text start="97" dur="2">The number of points you need</text><text start="99" dur="1">to get something close</text><text start="100" dur="5">grows exponentially with the number of dimensions.</text><text start="105" dur="2">So, for any fixed data set size</text><text start="107" dur="2">you will find yourself in a situation</text><text start="109" dur="3">where all your neighbors are far away.</text><text start="112" dur="2">Nearest neighbor works really well</text><text start="114" dur="4">for small input spaces like three or four dimensions.</text><text start="118" dur="1">It works very poorly</text><text start="119" dur="2">if your input space is twenty, twenty-five,</text><text start="121" dur="2">or maybe one hundred dimensions.</text><text start="123" dur="3">So don&amp;#39;t trust nearest neighbor to do a good job</text><text start="126" dur="2">if your input and measure spaces are high.</text></transcript></video><video title="55 Congratulations" id="Ta7tyUB-EqM" length="72"><transcript><text start="0" dur="2.069">So congratulations.</text><text start="2.069" dur="2.702">You&amp;#39;ve just learned a lot about machine learning.</text><text start="4.771" dur="2.822">We focused on supervised machine learning</text><text start="7.593" dur="2.417">which deals with situations</text><text start="10.01" dur="1.835">where you have input vectors</text><text start="11.845" dur="1.935">and given output labels</text><text start="13.78" dur="3.008">and your goal is to predict the output label</text><text start="16.788" dur="1.831">from an input vector.</text><text start="18.619" dur="3.136">And we looked into parametric models</text><text start="21.755" dur="1.906">like Naive Bayes</text><text start="23.661" dur="1.798">non-parametric models.</text><text start="25.459" dur="2.302">We talked about classification</text><text start="27.761" dur="1.769">where the output is discrete</text><text start="29.53" dur="2.568">versus regression where the output is continuous  </text><text start="32.098" dur="2.503">and we looked at samples of techniques</text><text start="34.601" dur="2.002">for each of these situations.</text><text start="36.603" dur="1.569">Now obviously</text><text start="38.172" dur="2.402">we just scratched the surface on machine learning.</text><text start="40.574" dur="1.175">There&amp;#39;s books written about it</text><text start="41.749" dur="1.561">and courses taught about it.</text><text start="43.31" dur="2.736">Machine learning is a super fascinating topic.</text><text start="46.046" dur="2.603">It&amp;#39;s the one within the artificial intelligence</text><text start="48.649" dur="1.468">I love the most. </text><text start="50.117" dur="2.836">It&amp;#39;s really great about the real world</text><text start="52.953" dur="1.916">as we gain more data</text><text start="54.869" dur="0.859">like the world wide web</text><text start="55.728" dur="1.162">or medical data sets </text><text start="56.89" dur="1.402">or financial data sets.</text><text start="58.292" dur="1.601">Machine learning is poised </text><text start="59.893" dur="2.244">to become more and more important.</text><text start="62.137" dur="3.361">I hope that the things you learned in this class so far</text><text start="65.498" dur="1.538">really excite you</text><text start="67.036" dur="2.634">and entice you to apply machine learning</text><text start="69.67" dur="1.546">to problems that you face </text><text start="71.216" dur="1.79">in your professional life.</text></transcript></video></group><group title="Unit 6" count="46"><video title="1 Unsupervised Learning" id="s4Ou3NRJc-s" length="71"><transcript><text start="0" dur="2">[Narrator] So welcome to the class </text><text start="2" dur="2">on unsupervised learning.</text><text start="4" dur="2">We talked a lot about supervised learning</text><text start="6" dur="3">in which we are given data and target labels. </text><text start="9" dur="3">In unsupervised learning we&amp;#39;re just given data.</text><text start="12" dur="3">So here&amp;#39;s a data matrix of </text><text start="15" dur="2">data items of N features each.</text><text start="17" dur="2">There&amp;#39;s M and total.</text><text start="19" dur="2">So the task of unsupervised learning is</text><text start="21" dur="4">to find structure in data of this type.</text><text start="25" dur="3">To illustrate why this is an interesting problem</text><text start="28" dur="3">let me start with a quiz.</text><text start="31" dur="3">Suppose we have 2 feature values.</text><text start="34" dur="3">One over here, and one over here,</text><text start="37" dur="2">and our data looks as follows.</text><text start="39" dur="2">Even though we haven&amp;#39;t been told about</text><text start="41" dur="2">anything in unsupervised learning, I&amp;#39;d like to</text><text start="43" dur="3">quiz your intuition on the following 2 questions:</text><text start="46" dur="2">1. Is there structure? </text><text start="48" dur="3">Or put differently do you think there&amp;#39;s </text><text start="51" dur="2">something to be learned about data like this,</text><text start="53" dur="4">or is it entirely random?</text><text start="57" dur="4">And second, to narrow this down,</text><text start="61" dur="2">it feels that there are clusters                              </text><text start="63" dur="2">of data the way I do it. </text><text start="65" dur="3">So how many clusters can you see?</text><text start="68" dur="3">And I give you a couple of choices,                                             1, 2, 3, 4, or none.</text></transcript></video><video title="1a Answer" id="kFwsW2VtWWA" length="26"><transcript><text start="0" dur="3">[Narrator] The answer to the first question is yes, there is structure.</text><text start="3" dur="3">Obviously these data are seen not to be completely random determinants. </text><text start="6" dur="3">They seem to be for me 2 clusters.</text><text start="9" dur="3">So the correct answer for                                        the second question is 2.</text><text start="12" dur="3">There&amp;#39;s a cluster over here, and                                            there&amp;#39;s a cluster over here.</text><text start="15" dur="2">So one of the tasks of unsupervised learning </text><text start="17" dur="4">will be to recover the number of clusters, and</text><text start="21" dur="2">the center of these clusters, and the variances of these clusters in data  </text><text start="23" dur="3">of the type I&amp;#39;ve just shown you.</text></transcript></video><video title="1b Question" id="GxeyaI9_P4o" length="43"><transcript><text start="0" dur="3">[Narrator] Let me ask you a second quiz.</text><text start="3" dur="3">Again, we haven&amp;#39;t talked about any details.</text><text start="6" dur="2">I would like to get your intuition                                                  on the following question.</text><text start="8" dur="2">Suppose in a two dimensional space,</text><text start="10" dur="2">all data lies as follows.</text><text start="12" dur="2">This may be reminiscent of the question I </text><text start="14" dur="3">asked you for housing prices                                   and square footage. </text><text start="17" dur="3">Suppose we have 2 axes, X1 and X2.</text><text start="20" dur="2">I&amp;#39;m going to ask you 2 questions here.</text><text start="22" dur="3">One is what is the dimensionality of this space</text><text start="25" dur="2">in which this data falls, and the second one </text><text start="27" dur="2">is an intuitive question which is </text><text start="29" dur="3">how many dimensions do you need </text><text start="32" dur="3">to represent this data to capture the essence,</text><text start="35" dur="3">and, again, this is not a clear                                           crisp 0 or 1 type question,</text><text start="38" dur="2">but give me your best guess.</text><text start="40" dur="3">How many dimensions are intuitively needed?</text></transcript></video><video title="1c Answer" id="4uj36iX1Pkk" length="?"><transcript><text start="0" dur="3">No subtitles...</text></transcript></video><video title="2 Terminology" id="EZEOXNFgu8M" length="87"><transcript><text start="0" dur="3">[Narrator] So to start with some lingo                          about unsupervised learning. </text><text start="3" dur="3">If you look at this as a probabilist,                                                 you&amp;#39;re  given data, and </text><text start="6" dur="3">we interpretively assume the data is IID,</text><text start="9" dur="2">which means identically distributed and independently drawn  </text><text start="11" dur="2">from the same distribution.</text><text start="13" dur="2">So a good chunk of unsupervised learning</text><text start="15" dur="3">seeks to recover the underlying--the density of </text><text start="18" dur="3">probability distribution that generated the data.</text><text start="21" dur="2">It&amp;#39;s called density estimation.</text><text start="23" dur="2">As we find out our methods for clustering,</text><text start="25" dur="2">our versions of density estimation </text><text start="27" dur="2">using what&amp;#39;s called mixture models.</text><text start="29" dur="2">Dimensionality reduction is also a method</text><text start="31" dur="2">for doing density estimation, </text><text start="33" dur="2">and there are many others.</text><text start="35" dur="2">Unsupervised learning can be applied to find </text><text start="37" dur="2">structure and data.</text><text start="39" dur="2">One of the fascinating ones that </text><text start="41" dur="2">I believe exists is called </text><text start="43" dur="2">blind signals separation.</text><text start="45" dur="3">Suppose you are given a microphone, and</text><text start="48" dur="3">two people simultaneously talk, and you&amp;#39;re </text><text start="51" dur="3">record the joint of both of those speakers.</text><text start="54" dur="2">Blind source separation or                                                     blind signal separation</text><text start="56" dur="3">addresses the question of can you recover</text><text start="59" dur="2">those two speakers and filter</text><text start="61" dur="2">the data into two separate streams.</text><text start="63" dur="2">One for each speaker.</text><text start="65" dur="2">Now this is a really complicated unsupervised</text><text start="67" dur="2">learning task, but is one of the many things</text><text start="69" dur="2">that don&amp;#39;t require target signals as </text><text start="71" dur="2">unsupervised learning yet make for </text><text start="73" dur="2">really interesting learning problems.</text><text start="75" dur="2">This can be construed as an example</text><text start="77" dur="2">of what&amp;#39;s called factor analysis where each </text><text start="79" dur="4">speaker is a factor in the drawing signal                that your microphone records.</text><text start="83" dur="2">There are many other examples of unsupervised learning.</text><text start="85" dur="2">I will show you a few in a second.</text></transcript></video><video title="3 Google Street View and Clustering" id="W2dkDmHFMWg" length="126"><transcript><text start="0" dur="3">Here is one of my favorite examples of unsupervised learning--</text><text start="3" dur="2">one that is yet unsolved.</text><text start="5" dur="3">At Google, I had the opportunity to participate--</text><text start="8" dur="2">in the building of Street View,</text><text start="10" dur="3">which is a huge photographic database--</text><text start="13" dur="3">of many, many streets in the world.</text><text start="16" dur="2">As you dive into Street View--</text><text start="18" dur="2">you can get ground imagery--</text><text start="20" dur="3">of almost any location in the world--</text><text start="23" dur="3">like this house here, that I chose at random.</text><text start="26" dur="3">In these images, there is vast regularities.</text><text start="29" dur="2">You can go somewhere else--</text><text start="31" dur="2">and you&amp;#39;ll find that the type of objects--</text><text start="33" dur="2">visible in Street View--</text><text start="35" dur="2">is not entirely random.</text><text start="37" dur="2">For example, there is many images of homes--</text><text start="39" dur="2">many images of cars--</text><text start="41" dur="3">trees, pavement, lane markers--</text><text start="44" dur="3">stop sign, just to name a few.</text><text start="47" dur="5">So one of the fascinating, unsolved, unsupervised learning tasks is:</text><text start="52" dur="3">Can you take hundreds of billions of images--</text><text start="55" dur="3">as comprised in the Street View data set--</text><text start="58" dur="3">and discover from it that there are concepts such as--</text><text start="61" dur="4">trees, lane markers, stop signs, cars, and pedestrians?</text><text start="65" dur="2">It seems to be tedious to hand label each image--</text><text start="67" dur="2">for the occurrence of such objects.</text><text start="69" dur="2">And attempts to do so--</text><text start="71" dur="3">has resulted in very small image data sets.</text><text start="74" dur="2">Humans can learn from data--</text><text start="76" dur="2">even without explicit target labels.</text><text start="78" dur="2">We often just observe.</text><text start="80" dur="3">In observing, we apply unsupervised learning techniques.</text><text start="83" dur="4">So one of the great, great open questions of artificial intelligence is:</text><text start="87" dur="5">Can you observe many intersections and many streets and many roads--</text><text start="92" dur="3">and learn from it what concepts are contained in the imagery?</text><text start="95" dur="4">Of course, I can&amp;#39;t teach you anything as complex in this class.</text><text start="99" dur="2">I don&amp;#39;t even know the answer myself.</text><text start="101" dur="2">So let me start with something simple.</text><text start="103" dur="4">Clustering. Clustering is the most basic form of unsupervised learning.</text><text start="107" dur="3">And I will tell you about two algorithms that are very related.</text><text start="110" dur="2">One is called k-means,</text><text start="112" dur="3">one is called expectation maximization.</text><text start="115" dur="4">K-means is a nice, intuitive algorithm to derive clusterings.</text><text start="119" dur="3">Expectation maximization is a probabilistic-- </text><text start="122" dur="2">generalization of k-means.</text><text start="124" dur="2">They were derived from first principles.</text></transcript></video><video title="4 k-Means Clustering Example" id="zaKjh2N8jN4" length="112"><transcript><text start="0" dur="3">Let me explain k-means by an example.</text><text start="3" dur="4">Suppose we&amp;#39;re given the following data points in a 2-dimensional space.</text><text start="7" dur="5">K-means estimates for a fixed number of k. Here k = 2.</text><text start="12" dur="5">The best centers of clusters representing those data points.</text><text start="17" dur="3">Those are found interatively by the following algorithm.</text><text start="20" dur="5">Step 1: Guess cluster centers at random, as shown over here with the 2 stars.</text><text start="25" dur="5">Step 2: Assign to each cluster center, even though they are randomly chosen, </text><text start="30" dur="3">the most likely corresponding data points.</text><text start="33" dur="3">This is done by minimizing Euclidian distance.</text><text start="36" dur="5">In particular, each cluster center represents half of the space.</text><text start="41" dur="4">And the line that separates the space between the left and right cluster center </text><text start="45" dur="3">is the equidistant line, often called a Voronoi graph.</text><text start="48" dur="5">All the data points on the left correspond to the red cluster,</text><text start="53" dur="2">and the ones on the right to the green cluster. </text><text start="55" dur="5">Step 3: Given now we have a correspondence between the data points and cluster centers, </text><text start="60" dur="6">find the optimal cluster center that corresponds to the points associated with the cluster center. </text><text start="66" dur="3">Our red cluster center has only 2 data points attached. </text><text start="69" dur="4">So the optimal cluster center would be the halfway point in the middle.</text><text start="73" dur="3">Our right cluster center has more than 2 points attached;</text><text start="76" dur="5">yet it isn&amp;#39;t placed optimally, as you can see as they move with the animation back and forth.</text><text start="81" dur="4">By minimizing the joint quadratic distance to all of those points, </text><text start="85" dur="4">the new cluster center has attained the center of those data points. </text><text start="89" dur="6">Now the final step is iterate. Go back and reassign cluster centers.</text><text start="95" dur="4">Now the Voronoi diagram has shifted, and the points are associated differently,</text><text start="99" dur="6">and then reevaluate what the optimal cluster center looks like given the associated points.</text><text start="105" dur="2">And in both cases we see significant motion.</text><text start="107" dur="2">Repeat. Now this is the clustering. </text><text start="109" dur="3">The point association doesn&amp;#39;t change, and as a result, we just converged.</text></transcript></video><video title="4a k-Means Algorithm" id="myqnyxkdQpc" length="116"><transcript><text start="0" dur="3">You just learned about an exciting clustering algorithm</text><text start="3" dur="4">that&amp;#39;s really easy to implement called k-means.</text><text start="7" dur="2">To give you the algorithm in pseudocode, </text><text start="9" dur="6">initially we select k cluster centers at random and then we repeat.</text><text start="15" dur="6">In a corresponding step, we correspond all the data points to the nearest cluster center,</text><text start="21" dur="5">and then we calculate the new cluster center by the mean of the corresponding data points.</text><text start="26" dur="4">We repeat this until nothing changes any more.</text><text start="30" dur="7">Now special care has to be taken if a cluster center becomes empty--that means no data point is associated.</text><text start="37" dur="6">In which case, we just restart cluster centers at random that have no corresponding points.</text><text start="43" dur="3">Empty cluster centers restart at random.</text><text start="46" dur="8">This algorithm is known to converge to a locally optimal clustering of data ponts.</text><text start="54" dur="4">The general clustering problem is known to be NP-hard.</text><text start="58" dur="5">So a locally optimal solution, in a way, is the best we can hope for. </text><text start="63" dur="3">Now let me talk about problems with k-means.</text><text start="66" dur="3">First we need to know k, the number of cluster centers.</text><text start="69" dur="1">As I mentioned, the local minimum. </text><text start="70" dur="6">For example, for 4 data points like this and 2 cluster centers that happen to be just over here,</text><text start="76" dur="4">with the separation line like this there would be no motion of k means.</text><text start="80" dur="5">Even though moving one over here and one over there would give a better solution.</text><text start="85" dur="3">There&amp;#39;s a general problem of high dimensionality of the space</text><text start="88" dur="4">that is not dissimilar from the way k-nearest neighbor suffers from high dimensionality.</text><text start="92" dur="3">And then there&amp;#39;s lack of a mathematical basis. </text><text start="95" dur="3">Now if you&amp;#39;re a partitioner, you might not care about a mathematical basis.</text><text start="98" dur="4">But for the sake of this class, let&amp;#39;s just care about it.</text><text start="102" dur="3">So here&amp;#39;s a first quiz for k-means.</text><text start="105" dur="5">Given the following two cluster centers, C1 and C2, </text><text start="110" dur="6">click on exactly those points that are associated with C1 and not with C2.</text></transcript></video><video title="4b Answer" id="0RMhiWfe73M" length="17"><transcript><text start="0" dur="5">And the answer is these 4 points over here.</text><text start="5" dur="6">And the reason is, if you draw the line of equal distance between C1 and C2,</text><text start="11" dur="4">the separation of these 2 cluster areas falls over here.</text><text start="15" dur="2">C2 is down there. C1 is up here.</text></transcript></video><video title="4c Question" id="oQYC32jKCvg" length="17"><transcript><text start="0" dur="1.05">So here&amp;#39;s my second quiz. </text><text start="1.05" dur="4.95">Given the association that we just derived for C1, where do you think the new cluster center, </text><text start="6" dur="7">C1, will be found after a single step of estimating its best location given the associated points?</text><text start="13" dur="1">I&amp;#39;ll give you a couple of choices.</text><text start="14" dur="3">Please click on the one that you find most plausible. </text></transcript></video><video title="4d Answer" id="uxdSfp9n2CY" length="34"><transcript><text start="0" dur="2">And the answer is, over here. </text><text start="2" dur="6">These 4 data points are associated with C1, so we can safely ignore all the other ones.</text><text start="8" dur="4">This one over here would be at the center of the 3 data points over here,</text><text start="12" dur="5">but this one pulls back this data point drastically towards it.</text><text start="17" dur="5">This is about the best trade-off between these 3 points over here that all have a string</text><text start="22" dur="4">attached and pull in this direction, compared to this point over here.</text><text start="26" dur="3">Any of the other ones don&amp;#39;t even lie between those points,</text><text start="29" dur="3">and therefore won&amp;#39;t be good cluster centers.</text><text start="32" dur="2">The one over here is way too far to the right.</text></transcript></video><video title="4e Question" id="3zlXl82LUVI" length="12"><transcript><text start="0" dur="3">In our next quiz let&amp;#39;s assume we&amp;#39;ve done one interation,</text><text start="3" dur="4.5">and the cluster center of C1 moved over there and C2 moved over here. </text><text start="7.5" dur="5.5">Can you once again click on all those data points that correspond to C1?</text></transcript></video><video title="4f Answer" id="LgxZJ7GAB3o" length="24"><transcript><text start="0" dur="3">And the answer is now simple. It&amp;#39;s this one over here.</text><text start="3" dur="2">This one, this one, and this one.</text><text start="5" dur="5">And the reason is, the line separating both clusters runs around here.</text><text start="10" dur="6">That means all the area over here is C2 territory, and the area over here is C1 territory.</text><text start="16" dur="4">Obvioulsy as we now iterate k-means, these clusters that have been moved straight over </text><text start="20" dur="4">here will be able to stay, whereas C2 will end up somewhere over here.</text></transcript></video><video title="5 Expectation Maximization" id="_DhelJs0BFc" length="252"><transcript><text start="0" dur="5">So, let&amp;#39;s now generalize k-means into expectation maximization.</text><text start="5" dur="5">Expectation maximization is an algorithm that uses actual probability distributions</text><text start="10" dur="4">to describe what we&amp;#39;re doing, and it&amp;#39;s in many ways more general,</text><text start="14" dur="3">and it&amp;#39;s also nice in that it really has a probabilistic basis.</text><text start="17" dur="4">To get there, I have to take the discourse and tell you all about Gaussians,</text><text start="21" dur="3">or the normal distribution, and the reason is so far, </text><text start="24" dur="2">we&amp;#39;ve just encountered discrete distributions, </text><text start="26" dur="4">and Gaussians will be the first example of a continuous distribution.</text><text start="30" dur="4">Many of you know that a Gaussian is described by an identity that looks as follows,</text><text start="34" dur="7">where the mean is called mu, and the variance is called sigma or sigma squared.</text><text start="41" dur="6">And for any X along the horizontal access, the density is given by the following function:</text><text start="47" dur="5">1 over square root of 2 pi times sigma, and then an exponential function</text><text start="52" dur="4">of minus &#xBD; of x - mu squared over sigma squared.</text><text start="56" dur="5">This function might look complex, but it&amp;#39;s also very, very beautiful.</text><text start="61" dur="6">It peaks at X = mu where the value in the exponent becomes 0.</text><text start="67" dur="4">And towards plus or minus infinity, it goes to 0 quickly.</text><text start="71" dur="3">In fact, exponentially fast.</text><text start="74" dur="2">The argument inside is a quadratic function.</text><text start="76" dur="4">The exponential function makes it exponential.</text><text start="80" dur="3">And this over here is a normalizer to make sure that the area underneath</text><text start="83" dur="6">sums up to one, which is characteristic of any probability density function.</text><text start="89" dur="3">If you map this back to our discrete random variables, </text><text start="92" dur="5">for each possible X, we can now assign a density value, </text><text start="97" dur="4">which is the function of this, and that&amp;#39;s effectively</text><text start="101" dur="2">the probability that this X might be drawn.</text><text start="103" dur="5">Now, the space itself is infinite, so any individual value will have a probability of 0,</text><text start="108" dur="4">but what you can do is you can make an interval, A and B, </text><text start="112" dur="4">and the area underneath this function is the total probability </text><text start="116" dur="4">that an experiment will come up between A and B.</text><text start="120" dur="3">Clearly, it&amp;#39;s more likely to generate values around mu</text><text start="123" dur="4">then it is to generate values in the periphery summary over here.</text><text start="127" dur="2">And just for completeness, I&amp;#39;m going to give you the formula </text><text start="129" dur="3">for what&amp;#39;s called the multi-variate Gaussian</text><text start="132" dur="5">where multi-variate means nothing else but we have more than one input variable.</text><text start="137" dur="4">You might have a Gaussian over a 2-dimensional space or a 3-dimensional space.</text><text start="141" dur="3">Often, these Gaussians are drawn by what&amp;#39;s called level sets, </text><text start="144" dur="2">sets of equal probability.</text><text start="146" dur="4">Here&amp;#39;s one in a 2-dimensional space, X1 and X2.</text><text start="150" dur="5">The Gaussian itself can be thought of as coming out of the paper towards me</text><text start="155" dur="4">where the most likely or highest point of probability is the center over here.</text><text start="159" dur="4">And these rings measure areas of equal probability.</text><text start="163" dur="6">The formula for a multi-variate Gaussian looks as follows:</text><text start="169" dur="4">N is the number of dimensions in the input space.</text><text start="173" dur="4">Sigma is a covariance matrix that generalizes the value over here.</text><text start="177" dur="5">And the inner product inside the exponential</text><text start="182" dur="6">is now done using linear algebra where this is the difference between</text><text start="188" dur="4">a probe point and the mean vector mu </text><text start="192" dur="4">transposed sigma to the minus 1 times X - mu.</text><text start="196" dur="5">You can find this formula in any textbook or web page </text><text start="201" dur="4">on Gaussians or multi-variate normal distributions.</text><text start="205" dur="4">It looks cryptic at first, but the key thing to remember is</text><text start="209" dur="4">it&amp;#39;s just a generalization of the 1-dimensional case.</text><text start="213" dur="3">We have a quadratic area over here as manifested by the product </text><text start="216" dur="2">of this guy and this guy.</text><text start="218" dur="4">We have a normalization by a variance or covariance </text><text start="222" dur="6">as shown by this number over here or the inverse matrix over here.</text><text start="228" dur="3">And then this entire thing is an exponential form in both cases,</text><text start="231" dur="4">and the normalizer looks a little more different in the multi-variate case,</text><text start="235" dur="4">but all it does is make sure that the volume underneath adds up to 1</text><text start="239" dur="3">to make it a legitimate probability density function.</text><text start="242" dur="5">For most of this explanation, I will stick with 1-dimensional Gaussians,</text><text start="247" dur="3">so all you have to do is to worry about this formula over here,</text><text start="250" dur="2">but this is given just for completeness.</text></transcript></video><video title="5a Gaussian Learning" id="S4-694lgERs" length="132"><transcript><text start="0" dur="6">I will now talk about fitting Gaussians to data or Gaussian learning.</text><text start="6" dur="3">You may be given some data points, and you might worry about</text><text start="9" dur="3">what is the best Guassian fitting the data?</text><text start="12" dur="6">Now, to explain this, let me first tell you what parameters characterizes a Gaussian.</text><text start="18" dur="6">In the 1-dimensional case, it is mu and sigma squared.</text><text start="24" dur="4">Mu is the mean. Sigma squared is called the variance.</text><text start="28" dur="6">If we look at the formula of a Gaussian, it&amp;#39;s a function over any possible input X,</text><text start="34" dur="4">and it requires knowledge of mu and sigma squared.</text><text start="38" dur="4">And as before, I&amp;#39;m just restating what I said before.</text><text start="42" dur="6">We get this function over here that specifies any probability</text><text start="48" dur="5">for a value X given a specific mu and sigma squared.</text><text start="53" dur="8">Suppose we wish to fit data, and our data is 1-dimensional, and it looks as follows.</text><text start="61" dur="2">Just looking at this diagram makes me believe </text><text start="63" dur="3">that there&amp;#39;s a high density of data points over here</text><text start="66" dur="3">and a fading density of data points over there,</text><text start="69" dur="4">so maybe the most likely Gaussian will look a little bit like this</text><text start="73" dur="4">where this is mu and this is sigma.</text><text start="77" dur="4">They are really easy formulas for fitting data to Gaussians,</text><text start="81" dur="2">and I&amp;#39;ll give you the result right now.</text><text start="83" dur="7">The optimal or most likely mean is just the average of the data points.</text><text start="90" dur="3">There&amp;#39;s M data points, X1 to Xm.</text><text start="93" dur="2">The average will look like this.</text><text start="95" dur="6">The sum of all data points divided by the total number of data points.</text><text start="101" dur="3">That&amp;#39;s called  the average, and once you calculate the average,</text><text start="104" dur="4">the sigma squared is obtained by a similar normalization</text><text start="108" dur="3">in a slightly more complex sum.</text><text start="111" dur="3">We sum the deviation from the mean </text><text start="114" dur="4">and compute the average deviation to the square from the mean,</text><text start="118" dur="2">and that gives us sigma squared.</text><text start="120" dur="3">So, intuitively speaking, the formulas are really easy.</text><text start="123" dur="3">Mu is the mean, or the average.</text><text start="126" dur="6">Sigma squared is the average quadratic deviation from the mean, as shown over here.</text></transcript></video><video title="5b Maximum Likelihood" id="rMcw3uu4efY" length="250"><transcript><text start="0" dur="3">Now I want take a second to convince ourselves </text><text start="3" dur="3">this is indeed the maximum likelihood estimate</text><text start="6" dur="3">of the mean and the variance. </text><text start="9" dur="3">Suppose our data looks like this--</text><text start="12" dur="3">There&amp;#39;s &amp;quot;M&amp;quot; data points.</text><text start="15" dur="3">And the probability of those data points </text><text start="18" dur="4">for any Gaussian model--mu and sigma squared</text><text start="22" dur="7">is the product of any individual of data likelihood--x,i. </text><text start="29" dur="5">And if you plug in our Gaussian formula, you get the following--</text><text start="34" dur="3">This is the normalizer multiplied &amp;quot;M&amp;quot; times </text><text start="37" dur="6">where the square root is now drawn into the half over here,</text><text start="43" dur="2">and here is our joint exponential.</text><text start="45" dur="4">We took the product of the individual exponentials</text><text start="49" dur="4">and moved it up straight in here where it becomes a sum. </text><text start="53" dur="5">So the best estimates for mu and sigma squared</text><text start="58" dur="3">are those that maximize this entire expression over here</text><text start="61" dur="4">for given data set X1 to Xm. </text><text start="65" dur="3">So we seek to maximize this over the unknown parameters</text><text start="68" dur="2">mu and sigma squared.</text><text start="70" dur="2">And now I will apply a trick. </text><text start="72" dur="2">Instead of maximizing this expression,</text><text start="74" dur="3">I will maximize the logarithm of this expression.</text><text start="77" dur="2">The logarithm is a monotonic function.</text><text start="79" dur="4">So let&amp;#39;s maximize instead the logarithm</text><text start="83" dur="4">where this expression over here resolves to this expression over here.</text><text start="87" dur="5">The multiplication becomes a minus sign from over here,</text><text start="92" dur="3">and this is the argument inside the exponent</text><text start="95" dur="2">written slightly differently,</text><text start="97" dur="3">but pulling the 2 sigma squared to the left.</text><text start="100" dur="2">So let&amp;#39;s maximize this one instead. </text><text start="102" dur="3">The maximum was obtained where the first </text><text start="105" dur="3">derivative is zero.</text><text start="108" dur="3">If we do this for our variable mu, </text><text start="111" dur="2">we take the &amp;quot;log f&amp;quot; expression and</text><text start="113" dur="3">complete the derivative for spectrum  mu,</text><text start="116" dur="2">we get following--</text><text start="118" dur="3">This expression does not depend on mu at all, so it falls out.</text><text start="121" dur="4">And we can still get this expression over here, which we&amp;#39;ve set to zero.</text><text start="125" dur="6">And now we can multiply everything by sigma squared next to zero,</text><text start="131" dur="4">and then bring the Xi to the right and the mu to the left.</text><text start="135" dur="9">The sum over all &amp;quot;E&amp;quot; of the mu is mu equals sum over i, xi.</text><text start="144" dur="7">Hence, we proved that the mean is indeed the maximum  likelihood estimate</text><text start="151" dur="2">for the Gaussian.</text><text start="153" dur="5">This is now easily repeated for the variance.</text><text start="158" dur="3">If you compute the derivative of this expression over here</text><text start="161" dur="2">with respect to the variance,</text><text start="163" dur="5">we get minus &amp;quot;m&amp;quot; over sigma, which happens to be the derivative</text><text start="168" dur="2">of this expression over here.</text><text start="170" dur="3">Keep in mind that the derivative of </text><text start="173" dur="4">a logarithm stresses internal argument</text><text start="177" dur="4">times by chain rule--the derivative of  the internal argument,</text><text start="181" dur="4">which if you work out becomes this expression over here.</text><text start="185" dur="3">And this guy over here changes signs </text><text start="188" dur="2">but becomes the following.</text><text start="190" dur="3">And again, you move this guy to the left side,</text><text start="193" dur="5">multiply by sigma cubed, and divide by &amp;quot;m&amp;quot;.</text><text start="198" dur="4">So we get the following result over here.</text><text start="202" dur="3">You might take a moment to verify these steps over here,</text><text start="205" dur="2">I was a little bit fast,</text><text start="207" dur="5">but this is relatively straight forward mathematics.</text><text start="212" dur="2">And if you will verify them,</text><text start="214" dur="2">you will find that the maximum likelihood estimate</text><text start="216" dur="3">for sigma squared is the average</text><text start="219" dur="4">deviation of data points from the mean mu.</text><text start="223" dur="2">This gives us a very nice basis to fit</text><text start="225" dur="3">Gaussians to data points.</text><text start="228" dur="4">So keeping these formulas in mind, here&amp;#39;s a quick quiz,</text><text start="232" dur="6">which I ask you to actually calculate the mean and variance for a data sequence. </text><text start="238" dur="4">So suppose the data you observe is 3, 4, 5, 6, and 7. </text><text start="242" dur="2">There is 5 data points. </text><text start="244" dur="3">Compute for me the mean and the variance</text><text start="247" dur="3">using the maximum likelihood estimator I just gave you.</text></transcript></video><video title="5c Answer" id="n1Ikbbe1g5M" length="33"><transcript><text start="0" dur="2">So the mean is obviously 5,</text><text start="2" dur="2">it&amp;#39;s the middle value over here. </text><text start="4" dur="4">If I add those things together, I get 25 and divide by 5.</text><text start="8" dur="2">The average value over here is 5.</text><text start="10" dur="2">The more interesting case is sigma square,</text><text start="12" dur="2">and I do this in the following steps--</text><text start="14" dur="2">I subtract 5 from each of the data points</text><text start="16" dur="4">for which I get -2, -1, 0, 1, and 2.</text><text start="20" dur="2">I square those differences,</text><text start="22" dur="2">which gives me 4, 1, 0, 1, 4.</text><text start="24" dur="3">And now I compute the mean of those square differences.</text><text start="27" dur="3">To do so, I add them all up, which is 10.</text><text start="30" dur="3">10 divided by 5 is 2, and sigma square equals 2.</text></transcript></video><video title="5d Question" id="R4izy_dyQzA" length="10"><transcript><text start="0" dur="2">Here is another quiz--Suppose my DATA</text><text start="2" dur="4">looks as follows--3,9,9,3.</text><text start="6" dur="2">Compute for me mu and sigma squared</text><text start="8" dur="2">using the maximum likelihood estimator I just gave you.</text></transcript></video><video title="5e Answer" id="Fr33yWlbvLk" length="20"><transcript><text start="0" dur="2.879">And the answer is relatively easy.</text><text start="2.879" dur="2.871">3 + 9 + 9 + 3 = 24 </text><text start="5.75" dur="2.759">divided by m = 4 is 6</text><text start="8.509" dur="1.701">so the mean value is 6</text><text start="10.21" dur="4.705">subtracting the mean from the data gives us -3, 3, 3, and -3</text><text start="14.915" dur="3.904">squaring those gives us 9, 9, 9, 9</text><text start="18.819" dur="2.938">and the mean of 4 nines equals 9.</text></transcript></video><video title="5f Question" id="pRGEQy7BgiY" length="50"><transcript><text start="0" dur="3.337">I now have a more challenging quiz for you</text><text start="3.337" dur="3.236">in which I give you multivariant data</text><text start="6.573" dur="1.67">in this case 2-dimensional data.</text><text start="8.243" dur="3.002">So suppose my data goes as follows.</text><text start="11.245" dur="4.638">In the first column I get 3, 4, 5, 6, 7</text><text start="15.883" dur="2.369">these are 5 data points and this is the first feature</text><text start="18.252" dur="6.206">and the second feature will be 8, 7, 5, 3, 2.</text><text start="24.458" dur="2.202">The formulas for calculating Mu </text><text start="26.66" dur="2.853">and the covariance matrix Sigma</text><text start="29.513" dur="2.185">generalize the ones we studied before</text><text start="31.698" dur="1.969">and they are given over here.</text><text start="33.667" dur="2.836">So what I would like you to compute is the vector Mu</text><text start="36.503" dur="1.802">which now has 2 values</text><text start="38.305" dur="3.37">one for the first and one for the second column</text><text start="41.675" dur="2.605">and the variance Sigma</text><text start="44.28" dur="2.767">which now has 4 different values</text><text start="47.047" dur="3">using the formula shown over here.</text></transcript></video><video title="5g Answer" id="FWOa3qoYOd8" length="36"><transcript><text start="0" dur="3.036">Now the mean is calculated as before</text><text start="3.036" dur="3.103">independently for each of the 2 features here.</text><text start="6.139" dur="2.77">Three, 4, 5, 6, 7, the mean is 5.</text><text start="8.909" dur="3.136">Eight, 7, 5, 3, 2, the mean is 5 again.</text><text start="12.045" dur="2.909">Easy calculation. If you subtract the mean</text><text start="14.954" dur="4.11">from the data we get the following matrix</text><text start="19.064" dur="2.09">and now we just have to plug it in.</text><text start="21.154" dur="3.604">For the main diagonal elements you get the same formula as before.</text><text start="24.758" dur="3.07">You can do this separately for each of the 2 columns.</text><text start="27.828" dur="2.736">But for the off-diagonal elements you just have to plug it in.</text><text start="30.564" dur="1.902">So this is the result after plugging it in</text><text start="32.466" dur="4.401">and you might just want to verify it using a computer.</text></transcript></video><video title="5h Gaussian Summary" id="mlz-1yfyeoU" length="17"><transcript><text start="0" dur="2.771">So this finishes the lecture on Gaussians.</text><text start="2.771" dur="2">You learned about what a Gaussian is.</text><text start="4.771" dur="1.87">We talked about the fit from data</text><text start="6.641" dur="2.335">and we even talked about multivariate Gaussians.</text><text start="8.976" dur="2.035">But even though I asked you to fit one of those</text><text start="11.011" dur="2.085">the one we are going to focus on right now</text><text start="13.096" dur="1.704">is the one-dimensional Gaussian.</text><text start="14.8" dur="3.282">So let&amp;#39;s now move back to the expectation maximization algorithm.</text></transcript></video><video title="5i EM as Generalization of k-Means" id="1CWDWmF0i2s" length="73"><transcript><text start="0" dur="4.07">It is now really easy to explain expectation maximization</text><text start="4.07" dur="2.303">as a generalization of K-means.</text><text start="6.373" dur="2.703">Again, we have a couple of data points here</text><text start="9.076" dur="3.203">and 2 randomly chosen cluster centers.</text><text start="12.279" dur="4.104">But in the correspondence step instead of making a hard correspondence </text><text start="16.383" dur="2.002">we make a soft correspondence.  </text><text start="18.385" dur="3.767">Each data point is attracted to a cluster center</text><text start="22.152" dur="2.806">in proportion to the posterior likelihood</text><text start="24.958" dur="2.002">which we will define in a minute.</text><text start="26.96" dur="3.203">In the adjustment step or the maximization step</text><text start="30.163" dur="4.372">the cluster centers are being optimized just like before</text><text start="34.535" dur="2.502">but now the correspondence is a soft variable</text><text start="37.037" dur="2.836">and they correspond to all data points in different strengths</text><text start="39.873" dur="1.669">not just the nearest ones.</text><text start="41.542" dur="2.535">As a result, in EM the cluster centers</text><text start="44.077" dur="2.177">tend not to move as far as in K-means.</text><text start="46.254" dur="1.794">Their movement is smooth away.</text><text start="48.048" dur="2.002">A new correspondence over here gives us different strength</text><text start="50.05" dur="3.337">as indicated by the different coloring of the links</text><text start="53.387" dur="3.837">and another relaxation step gives us better cluster centers.</text><text start="57.224" dur="2.369">And as you can see over time, gradually</text><text start="59.593" dur="3.77">the EM will then converge to about the same solution as K-means.</text><text start="63.363" dur="2.536">However, all the correspondences are still alive.</text><text start="65.899" dur="2.101">Which means there is not a 0, 1 correspondence.</text><text start="68" dur="1.883">There is a soft correspondence </text><text start="69.883" dur="4.356">which relates to a posterior probability, which I will explain next.</text></transcript></video><video title="5j EM Algorithm" id="tTr7547zVCc" length="194"><transcript><text start="0" dur="2.603">The model of expectation maximization</text><text start="2.603" dur="2.035">is that each data point</text><text start="4.638" dur="1.869">is generated from what&amp;#39;s called a mixture.</text><text start="6.507" dur="2.435">The sum of all possible classes</text><text start="8.942" dur="2.036">or clusters, of which there are K</text><text start="10.978" dur="2.335">we draw a class at random</text><text start="13.313" dur="3.971">with a prior probability of p of the class C = i</text><text start="17.284" dur="2.302">and then we draw data point X</text><text start="19.586" dur="2.57">from the distribution correspondent with its class over here.</text><text start="22.156" dur="5.939">The way to think about this if there is K different cluster centers shown over here</text><text start="28.095" dur="2.803">each one of those has a generic Gaussian attached.</text><text start="30.898" dur="3.169">In the generative version of expectation maximization</text><text start="34.067" dur="2">you first draw a cluster center</text><text start="36.067" dur="3.139">and then we draw from the Gaussian attached to this cluster center.</text><text start="39.206" dur="4.237">The unknowns here are the prior probabilities for each cluster center</text><text start="43.443" dur="5.94">should we call P-i and the Mu-i and in the general case Sigma-i</text><text start="49.383" dur="2.072">for each of the individual Gaussian.</text><text start="51.455" dur="3.333">Where i = 1 all the way to K.</text><text start="54.788" dur="4.438">Expectation maximization iterates 2 steps just like K-means.</text><text start="59.226" dur="2.736">One is called the E-step or expectation step</text><text start="61.962" dur="6.24">for which we assume that we know the Gaussian parameters and the P-i.</text><text start="68.202" dur="3.61">With those known values calculating the sum over here</text><text start="71.812" dur="2.029">is a fairly trivial exercise.</text><text start="73.841" dur="3.505">This is our known formula for a Gaussian</text><text start="77.346" dur="4.236">we just plug that in and this is a fixed probability.</text><text start="81.582" dur="3.31">The sum of all possible classes.</text><text start="84.892" dur="2.438">So you get for e-ij</text><text start="87.33" dur="3.297">the probability that the j-th data point</text><text start="90.627" dur="2.366">corresponds to cluster center number i</text><text start="92.993" dur="3.37">P-i times the normalizer</text><text start="96.363" dur="2.436">times the Gaussian expression.</text><text start="98.799" dur="3.603">Where we have a quadratic of Xj minus Mu-i</text><text start="102.402" dur="5.072">times Sigma-i to the -1 times the same thing again over here.</text><text start="107.474" dur="2.269">These are the probabilities</text><text start="109.743" dur="2.302">that the j-th data point</text><text start="112.045" dur="2.837">corresponds to the i-th cluster center</text><text start="114.882" dur="2.235">under the assumption that we do know</text><text start="117.117" dur="3.837">the parameters P-i, Mu-i, and Sigma-i.</text><text start="120.954" dur="2.971">In the M-step we now figure out where these parameters should have been.</text><text start="123.925" dur="3.002">For the prior probability of each cluster center</text><text start="126.927" dur="4.438">we just take the sum over all the e-ijs, over all data points</text><text start="131.365" dur="3.036">divided by the total number of data points.</text><text start="134.401" dur="6.607">The mean is obtained by the weighted mean of the x-js</text><text start="141.008" dur="4.838">normalized by the sum over e-ijs</text><text start="145.846" dur="4.237">and finally the sigma is obtained as a sum</text><text start="150.083" dur="3.037">over the weighted expression like this</text><text start="153.12" dur="2.135">and this is the same expression as before</text><text start="155.255" dur="5.172">and now again we are normalizing over the sum over all e-ijs.</text><text start="160.427" dur="2.002">And these are exactly the same calculations</text><text start="162.429" dur="4.271">as before when we fit a Gaussian but just weighted by </text><text start="166.7" dur="4.571">the soft correspondence of a data point to each Gaussian.</text><text start="171.271" dur="4.505">And this weighting is relatively straightforward to apply in Gaussian fitting.</text><text start="175.776" dur="2.436">Let&amp;#39;s do a very quick quiz for EM.</text><text start="178.212" dur="3.436">Suppose we&amp;#39;re given 3 data points and 2 cluster centers.</text><text start="181.648" dur="3.07">And the question is, does this point over here</text><text start="184.718" dur="4.409">called X1 correspond to C1 or C2 or both of them?</text><text start="189.127" dur="5.597">Please check exactly one of these 3 different check boxes here.</text></transcript></video><video title="5k Answer" id="JW74VRSnZAk" length="16"><transcript><text start="0" dur="2">[Thrun] And the answer is both of them,</text><text start="2" dur="3">and the reason is X1 might be closer to C2 than C1,</text><text start="5" dur="2">but the correspondence in EM is soft,</text><text start="7" dur="4">which means each data point always corresponds to all cluster centers.</text><text start="11" dur="2">It is just that this correspondence over here </text><text start="13" dur="3">is much stronger than the correspondence over here.</text></transcript></video><video title="5l Question" id="TFViJ3P6NwM" length="21"><transcript><text start="0" dur="2">[Thrun] Here is another EM quiz.</text><text start="2" dur="6">For this quiz we will assume a degenerative case of 3 data points and just 1 cluster center.</text><text start="8" dur="3">My question pertains to the shape of the Gaussian after fitting,</text><text start="11" dur="2">specifically M1 sigma.</text><text start="13" dur="5">And the question is, is sigma circular, which would be like this,</text><text start="18" dur="3">or elongated, which would be like this or like this?</text></transcript></video><video title="5m Answer" id="ZcTAhLG4OSs" length="8"><transcript><text start="0" dur="2">[Thrun] And the answer is, of course, elongated.</text><text start="2" dur="4">As you look over here, what you find is that this is the best Gaussian describing the data points,</text><text start="6" dur="2">and this is what EM will calculate.</text></transcript></video><video title="5n Question" id="0HENenZeAsE" length="47"><transcript><text start="0" dur="5">[Thrun] This is a quiz in which I compare EM versus K-means.</text><text start="5" dur="3">Suppose we are giving you 4 data points, as indicated by those circles.</text><text start="8" dur="3">Suppose we have 2 initial cluster centers, shown here in red,</text><text start="11" dur="6">and those converge to possible places that are indicated by those 4 squares.</text><text start="17" dur="2">Of course they won&amp;#39;t take all 4 of them; they will just take 2 of them.</text><text start="19" dur="2">But for now I&amp;#39;m going to give you 4 choices.</text><text start="21" dur="5">We call this cluster 1, cluster 2, A, B, C, and D.</text><text start="26" dur="6">In EM will C1 move towards A or will C1 move towards B?</text><text start="32" dur="4">And in contrast, in K-means will C1 move towards A </text><text start="36" dur="2">or will C1 move towards B?</text><text start="38" dur="3">This is just asking about the left side of the diagram.</text><text start="41" dur="3">So the question is will K-means find itself in the more extreme situation,</text><text start="44" dur="3">or will EM find itself in the more extreme situation?</text></transcript></video><video title="5o Answer" id="DODedtJZ3FA" length="41"><transcript><text start="0" dur="5">[Thrun] And the answer is that while K-means will go all the way to the extreme, A,</text><text start="5" dur="3">which is this one over here, EM will not.</text><text start="8" dur="5">And this has to do with the soft versus hard nature of the correspondence.</text><text start="13" dur="4">In K-means the correspondence is hard.</text><text start="17" dur="3">So after the first situation, only these 2 data points over here </text><text start="20" dur="2">correspond to cluster center 1,</text><text start="22" dur="3">and they will find themselves straight in the middle where A is located.</text><text start="25" dur="4">In EM, however, we find that there will still be a soft correspondence</text><text start="29" dur="4">to these further away points which will then lead to a small shift of the cluster center</text><text start="33" dur="3">to the right side, as indicated by B.</text><text start="36" dur="5">That means K-means and EM will converge at different models of the data.</text></transcript></video><video title="5p Choosing k" id="_-Ol1cXIWvQ" length="118"><transcript><text start="0" dur="3">[Thrun] One of the remaining open questions pertains to the number of clusters.</text><text start="3" dur="3">So far I&amp;#39;ve assumed it&amp;#39;s simply constant and you know it.</text><text start="6" dur="2">But in reality, you don&amp;#39;t know it.</text><text start="8" dur="4">Practical implementations often guess the number of clusters along with the parameters.</text><text start="12" dur="5">And the way this works is that you periodically evaluate which data is poorly covered</text><text start="17" dur="4">by the existing mixture, you generate new cluster centers </text><text start="21" dur="4">at random near unexplained points, and then you run the algorithm for a while</text><text start="25" dur="4">to see whether the existence of your clusters is still justified.</text><text start="29" dur="4">And the justification test is based on a memorization of a criterion</text><text start="33" dur="4">that combines the negative log likelihood of your data itself</text><text start="37" dur="3">and a penalty for each cluster.</text><text start="40" dur="3">In particular, you&amp;#39;re going to minimize the negative log likelihood of your data</text><text start="43" dur="3">given the model plus a constant penalty per cluster.</text><text start="46" dur="5">If we look at this expression, this is the expression that EM already minimizes.</text><text start="51" dur="2">We maximized the posterior probability of data</text><text start="53" dur="4">logarithmic is a monotonic function, and I put a minus sign over here</text><text start="57" dur="3">so the optimization problem becomes a minimization problem.</text><text start="60" dur="4">This one over here, we have a constant cost per cluster is new.</text><text start="64" dur="3">If you increase the number of clusters, you would pay a penalty</text><text start="67" dur="3">that is in the way of your attempted minimization.</text><text start="70" dur="4">Typically, this expression balances out at a certain number of clusters,</text><text start="74" dur="2">and it is generically the best explanation for your data.</text><text start="76" dur="2">So the algorithm looks as follows.</text><text start="78" dur="4">Guess an initial K, run EM, remove unnecessary clusters</text><text start="82" dur="2">that will make this quote over here go up,</text><text start="84" dur="3">create some new random clusters, and go back and run EM.</text><text start="87" dur="3">There is all kinds of variants of this algorithm.</text><text start="90" dur="3">One of the nice things here is this algorithm also overcomes local minima problems</text><text start="93" dur="2">to some extent.</text><text start="95" dur="4">If, for example, 2 clusters end up grabbing the same data,</text><text start="99" dur="3">then your tests would show you that 1 of the clusters can be omitted;</text><text start="102" dur="2">thereby the score can be improved.</text><text start="104" dur="3">That cluster can later be restarted somewhere else,</text><text start="107" dur="4">and by randomly restarting clusters, you tend to get a much, much better solution</text><text start="111" dur="3">than if you run EM just once with a fixed number of clusters.</text><text start="114" dur="4">So this trick is highly recommended for any implementation of expectation maximization.</text></transcript></video><video title="5q Clustering Summary" id="DH7FWwCgx5M" length="41"><transcript><text start="0" dur="3">[Thrun] This finishes my unit on clustering,</text><text start="3" dur="2">at least so far.</text><text start="5" dur="2">I just want to briefly summarize what we&amp;#39;ve learned.</text><text start="7" dur="3">We talked about K-means, and we talked about expectation maximization.</text><text start="10" dur="4">K-means is a very simple almost binary algorithm</text><text start="14" dur="2">that allows you to find cluster centers.</text><text start="16" dur="3">EM is a probabilistic generalization that also allows you to find clusters</text><text start="19" dur="4">but also modifies the shapes of the clusters by modifying the covariance matrix.</text><text start="23" dur="3">EM is probabilistically sound, and you can prove convergence </text><text start="26" dur="3">in a log likelihood space. K-means also converges.</text><text start="29" dur="2">Both are prone to local minima.</text><text start="31" dur="3">In both cases you need to know the number of cluster centers, K.</text><text start="34" dur="5">I showed you a brief trick how to estimate the K as you go,</text><text start="39" dur="2">which also overcomes local minima to some extent.</text></transcript></video><video title="6 Dimensionality Reduction" id="lDyEk72TezE" length="26"><transcript><text start="0" dur="4">Let&amp;#39;s now talk about a 2nd class of unsupervised learning avenues </text><text start="4" dur="2">that are called dimensionality reduction.</text><text start="6" dur="4">We&amp;#39;re going to start with a little quiz, in which I will check your intuition.</text><text start="10" dur="4">Suppose we&amp;#39;re given a 2-dimensional data field, and our data lines up as follows.</text><text start="14" dur="3">My quiz is: How many dimensions do we really need?</text><text start="17" dur="2">The key is the word really,</text><text start="19" dur="3">which means we&amp;#39;re willing to tolerate a certain amount of error in accuracy,</text><text start="22" dur="4">because we&amp;#39;re going to capture the essence of the problem.</text></transcript></video><video title="6a Answer" id="AaSibhWmkQM" length="11"><transcript><text start="0" dur="2">The answer is obviously 1.</text><text start="2" dur="2">This is the key dimension over here.</text><text start="4" dur="3">The orthogonal dimension in this direction carries alomst information,</text><text start="7" dur="4">so it suffices, in most cases, to project the data onto this 1 dimensional space.</text></transcript></video><video title="6b Question" id="0I4p7lyKo4k" length="8"><transcript><text start="0" dur="2">Here is a quiz that is a little bit more tricky.</text><text start="2" dur="2">I&amp;#39;m going to draw data for you like this.</text><text start="4" dur="2">I&amp;#39;m going to ask the same question.</text><text start="6" dur="2">How many dimensions do we really need?</text></transcript></video><video title="6c Answer" id="wVkPH0eC4z0" length="30"><transcript><text start="0" dur="5">This answer is not at all trivial, and I don&amp;#39;t blame you if you get it wrong.</text><text start="5" dur="5">The answer is actually 1, but the projection itself is nonlinear.</text><text start="10" dur="5">I can draw, really easily, a nice 1-dimensional space that follows these data points.</text><text start="15" dur="4">If I am able to project all the data points on this 1-dimensional space,</text><text start="19" dur="2">I capture the essence of the data.</text><text start="21" dur="4">The trick, of course, is to find the nonlinear 1-dimensional space and describe it.</text><text start="25" dur="5">This is what&amp;#39;s going on in the state-of-the-art in dimensionality reduction research.</text></transcript></video><video title="6d Linear Dimensionality Reduction" id="5m6TeKw_e1M" length="177"><transcript><text start="0" dur="2">For the remainder of this unit,</text><text start="2" dur="3">I am going to talk about linear dimensionality reduction.</text><text start="5" dur="3">Where the idea is that the given data points like this, </text><text start="8" dur="5">and we seek to find a linear subspace in which to perfect the data.</text><text start="13" dur="4">In this case, I would submit this is probably the most suitable linear subspace.</text><text start="17" dur="6">So we remap the data onto the space over here, with x1 over here and x2 over here.</text><text start="23" dur="2">Then we can capture the data in just 1 dimension.</text><text start="25" dur="3">The algorithm is amazingly simple.</text><text start="28" dur="3">Number 1: Fit a gaussian; we now know how this works.</text><text start="31" dur="3">The gaussian will look something like this.</text><text start="34" dur="5">Number 2: Caluclate the eigenvalues and eigenvectors of this gaussian.</text><text start="39" dur="3">In this gaussian this would be the dominant eigenvector,</text><text start="42" dur="3">and this would be the 2nd eigenvector over here.</text><text start="45" dur="5">Step 3 is take those eigenvectors whose eigenvalues are the largest.</text><text start="50" dur="5">Step 4 is to project the data onto the subspace of eigenvectors you chose.</text><text start="55" dur="4">Now to understand this, you have to be familiar with eigenvectors and eigenvalues.</text><text start="59" dur="3">I give you an intuitive familiarity with those.</text><text start="62" dur="5">This is standard statistics material, and you will find this in many linear algebra classes.</text><text start="67" dur="2">So let me just go through this very quickly </text><text start="69" dur="5">and give you an intuition how to do linear dimensionality reduction.</text><text start="74" dur="2">Suppose you&amp;#39;re given the following data points:</text><text start="76" dur="4">Your axes are 0, 1, 2, 3, and 4, </text><text start="80" dur="8">4 x1, and 1.9, 3.1, 4, 5.1, and 5.9.</text><text start="88" dur="5">These are essentially 2, 3, 4, 5, 6,</text><text start="93" dur="5">but slightly modified to define actual variance over this dimension.</text><text start="98" dur="2">So I draw this in here.</text><text start="100" dur="4">What I get is a set of points that doesn&amp;#39;t quite fit a line, but almost.</text><text start="104" dur="3">There is a little error over here, a little error over here, and here and here.</text><text start="107" dur="3">The mean is easily calculated; it&amp;#39;s 2 and 4.</text><text start="110" dur="3">The covairance matrix looks as follows.</text><text start="113" dur="6">Notice the slightly different covairance for the 1st variable, which is exactly 2,</text><text start="119" dur="3">to the 2nd variable, which is 2.008.</text><text start="122" dur="11">The eigenvectors happen to be 0.7064 and 0.7078 with an eigenvalue of 4.004, </text><text start="133" dur="5">and the 2nd one is orthogonal with an eigenvalue much smaller.</text><text start="138" dur="4">So obviously this is the eigenvector that dominates the spread of the data points.</text><text start="142" dur="5">If you look at this vector over here, it is centered around the mean, </text><text start="147" dur="4">which sits over here, and is exactly this vector shown over here.</text><text start="151" dur="3">Where this one is the orthogonal vector shown over here.</text><text start="154" dur="5">So this single dimension with a large weight explains the data relative to </text><text start="159" dur="2">any other dimension, which is a very small eidenvalue.</text><text start="161" dur="6">I should mention why these numerical examples might look confusing.</text><text start="167" dur="2">This is very standard linear algebra.</text><text start="169" dur="4">When you estimate covariance from data and try to understand which direction they point,</text><text start="173" dur="4">this kind of eigenvalue anylysis gives you the right answer.</text></transcript></video><video title="6e Face Example" id="KuSZmepQA_s" length="131"><transcript><text start="0" dur="4">The dimensionality reduction looks a little bit silly when you go </text><text start="4" dur="1">from 2 dimensions to 1 dimension.</text><text start="5" dur="4">But in truly high-dimensional space it has a very strong utility.</text><text start="9" dur="4">Here&amp;#39;s an example that goes back to MIT several decades ago </text><text start="13" dur="2">on something called eigenfaces.</text><text start="15" dur="2">These are all well-aligned faces.</text><text start="17" dur="4">The objective in eigenface research has been to find </text><text start="21" dur="4">simple ways to describe different people in a parameter space,</text><text start="25" dur="2">in which we can easily identify the same person again.</text><text start="27" dur="4">Images like these are very high-dimensional statistics.</text><text start="31" dur="2">If each image is 50 by 50 pixels,</text><text start="33" dur="6">each image itself becomes a data point in a 2500 dimensional feature space.</text><text start="39" dur="4">Now obviously, we don&amp;#39;t have random images.</text><text start="43" dur="5">We don&amp;#39;t fill the space of 2500 dimensions with all face images.</text><text start="48" dur="6">Instead, it is reasonable to assume that all the faces live on a small subspace in that space.</text><text start="54" dur="4">Obviously, you as a human can easily distinguish what is a valid image of a face</text><text start="58" dur="4">and what is a valid image of a non face, like a car or a cloud or the sky.</text><text start="62" dur="2">Therefore, there are many, many images that you can </text><text start="64" dur="4">represent with 2500 pixels that are not faces.</text><text start="68" dur="2">So research on eigenfaces has applied</text><text start="70" dur="5">principle component analysis and eigenvalues to the space of faces.</text><text start="75" dur="4">Here is a database in which faces are aligned.</text><text start="79" dur="4">A researcher at Santiago Serrano extracted from it </text><text start="83" dur="4">the average face after alignment on the right side.</text><text start="87" dur="4">The truly interesting phenomenon occurs when you look at the eigenvalues.</text><text start="91" dur="3">The face on the top left, over here, is the average face, </text><text start="94" dur="3">and these are the variations,</text><text start="97" dur="4">the eigenvectors that correspond to the largest eigenvalues over here.</text><text start="101" dur="1">This is the strongest variation.</text><text start="102" dur="4">You see a certain amount of different regions in and around the head shape</text><text start="106" dur="2">and the hair that gets excited.</text><text start="108" dur="2">That&amp;#39;s the 2nd strongest one, where the shirt gets more excited.</text><text start="110" dur="1">As you go down, </text><text start="111" dur="5">you find more and more interesting variations that can be used to reconstruct faces.</text><text start="116" dur="5">Typically a dozen or so will suffice to make a face completely reconstructable, </text><text start="121" dur="4">which means you&amp;#39;ve just mapped a 2500 dimensional feature space </text><text start="125" dur="3">into a, perhaps, 12 dimensional feature space</text><text start="128" dur="3">on which we can now learn much, much easier.</text></transcript></video><video title="6f Scan Example" id="XNCHCncvDto" length="241"><transcript><text start="0" dur="6">In our own reserch, we also have applied eigenvector decomposition</text><text start="6" dur="5">to relatively challenging problems that don&amp;#39;t look like a linear problem at the surface.</text><text start="11" dur="4">We scanned a good number of people with different physiques:</text><text start="15" dur="4">Some thin, some not so thin, some tall, some short, some male, some female.</text><text start="19" dur="4">We also scanned them in 3-D in different body postures:</text><text start="23" dur="5">The arms down, the arms up, walking, throwing a ball, and so on.</text><text start="28" dur="5">We applied eigenvector decomposition of the type I&amp;#39;ve just shown you</text><text start="33" dur="4">to understand whether there is a latent low-dimensional space</text><text start="37" dur="4">that is sufficient to represent the different physiques that people have,</text><text start="41" dur="5">like thin or thick, and the different postures people can assume, like standing and so on.</text><text start="46" dur="5">It turns out if you apply eigenvector decomposition </text><text start="51" dur="4">to the space of all the formations of your body,</text><text start="55" dur="5">you can find relatively low dimensional linear spaces,</text><text start="60" dur="5">in which you can express different physiques and different body postures.</text><text start="65" dur="6">For the space of all different physiques it turns only 3-dimensions sufficed</text><text start="71" dur="4">to explain different heights, different thicknesses or body weights,</text><text start="75" dur="3">and also different genders.</text><text start="78" dur="4">That is, even though our surfaces themselves are representive</text><text start="82" dur="3">of tens of thousands of data points, the underlying dimensionality</text><text start="85" dur="4">when scanning people is really small.</text><text start="89" dur="2">I&amp;#39;ll let you watch the entire movie.</text><text start="91" dur="1">Please enjoy.</text><text start="92" dur="2">[SCAPE: Shape Completion and Animation of People]</text><text start="94" dur="4">We present a method named SCAPE for simultaneously modeling</text><text start="98" dur="3">the space of all human shapes and poses.</text><text start="101" dur="3">Further, we demonstrate the method&amp;#39;s usefulness</text><text start="104" dur="4">for both shape completion and animation.</text><text start="108" dur="3">The model is computed from an example set of surface meshes.</text><text start="111" dur="4">We require only a limited set of training data:</text><text start="115" dur="3">Examples of posed variation from a single subject</text><text start="118" dur="4">and examples of the shape variation between subjects.</text><text start="122" dur="4">The resulting model can represent both articulated motion</text><text start="126" dur="4">and, importantly, the nonrigid muscle deformations</text><text start="130" dur="4">required for natural appearance in a wide variety of poses.</text><text start="134" dur="4"> The model can also represent a wide variety of different body shapes, </text><text start="138" dur="2">spanning both men and women.</text><text start="140" dur="3">Because SCAPE incorporates both shape and pose</text><text start="143" dur="5">we can jointly vary both shape and pose to create people who never existed </text><text start="148" dur="3">and poses that were never observed.</text><text start="151" dur="5">We demonstrate the use of this model 1st for shape completion of scanned meshes.</text><text start="156" dur="3">Even when a subject has only been partially observed, </text><text start="159" dur="3">we can use the model to estimate a complete surface.</text><text start="162" dur="5">In this case, the entire front half of the subject has been synthesized.</text><text start="167" dur="4">Note that the synthesized data both conforms to the individual subject&amp;#39;s </text><text start="171" dur="3">specific shape and faithfully represents</text><text start="174" dur="5">the nonrigid muscle deformations associated with a specific pose.</text><text start="179" dur="2">Mesh completion is possible even when</text><text start="181" dur="4">neither the person or the pose exists in the original training set. </text><text start="185" dur="2">None of the women in our example set</text><text start="187" dur="4">look similar to the woman in this sequence.</text><text start="191" dur="4">Shape completion can also be used to synthesize complete</text><text start="195" dur="3">animated surface meshes.</text><text start="198" dur="2">Starting from a single scanned mesh of an actor</text><text start="200" dur="4">and a timed series of motion capture markers</text><text start="204" dur="2">we can treat the markers themselves</text><text start="206" dur="3">as a very sparse sampling of surface geometry</text><text start="209" dur="5">and complete the surface which best fits the available data at each point in time.</text><text start="214" dur="2">Using this method, animated surface models</text><text start="216" dur="4">for a wide variety of motions can be created with relative ease.</text><text start="220" dur="5">In addition, the target identity of the surface model can easily be changed</text><text start="225" dur="5">simply by replacing the subject portion of our factorized model with a different vector.</text><text start="230" dur="4">The new identity need not be present in our training set </text><text start="234" dur="2">or even correspond to a real person.</text><text start="236" dur="5">An artist is free to alter the identity arbitrarily.</text></transcript></video><video title="6g Piece-Wise Linear Projection" id="bIZrRYKN_RY" length="32"><transcript><text start="0" dur="5">[Thrun] In modern dimensionality reduction, the trick has been to define nonlinear,</text><text start="5" dur="4">sometimes piece-wise linear, subspaces on which data is being projected.</text><text start="9" dur="3">This is not dissimilar from K nearest neighbors,</text><text start="12" dur="4">where local regions are being defined based on local data neighborhoods.</text><text start="16" dur="2">But here we need ways to interpret leveraging neighbors</text><text start="18" dur="4">to make sure that the subspace itself becomes a feasible subspace.</text><text start="22" dur="5">Common methods include local linear embedding, or LLE, or the Isomap method.</text><text start="27" dur="2">If you&amp;#39;re interested in this, check the Web.</text><text start="29" dur="3">There&amp;#39;s tons of information on these methods on the World Wide Web.</text></transcript></video><video title="7 Spectral Clustering" id="VxAMBkDUfeg" length="42"><transcript><text start="0" dur="4">We now talk about spectral clustering.</text><text start="4" dur="3">The fundamental idea of spectral clustering</text><text start="7" dur="2">is to cluster by affinity.</text><text start="9" dur="3">And to understand the  importance of spectral clustering,</text><text start="12" dur="4">let me ask you a simple intuitive quiz.</text><text start="16" dur="2">Suppose you are given data like this,</text><text start="18" dur="4">and you wish to learn that there&amp;#39;s 2 clusters--</text><text start="22" dur="3">a cluster over here and a cluster over here.</text><text start="25" dur="3">So my question is, from what you understand,</text><text start="28" dur="2">do you think that &amp;quot;EM&amp;quot; or &amp;quot;K&amp;quot; means</text><text start="30" dur="3">we would do a great job finding those clusters</text><text start="33" dur="3">or do you think they will likely fail to find those clusters? </text><text start="36" dur="2">So what were the questions--Do &amp;quot;EM&amp;quot; or &amp;quot;K&amp;quot;</text><text start="38" dur="2">mean succeed in finding the 2 clusters?</text><text start="40" dur="2">There is a likely yes and a likely no.</text></transcript></video><video title="7a Answer" id="fuMoXRHxjTg" length="34"><transcript><text start="0" dur="2">And the answer is likely no.</text><text start="2" dur="3">The reason being that these aren&amp;#39;t clusters</text><text start="5" dur="3">defined by a center of data points,</text><text start="8" dur="2">but they&amp;#39;re clusters define by affinity,</text><text start="10" dur="4">which means they&amp;#39;re defined by the presence of nearby points.</text><text start="14" dur="3">So take for example the area over here, which I&amp;#39;m going to circle,</text><text start="17" dur="3">and ask yourself, what&amp;#39;s the best cluster center?</text><text start="20" dur="3">It&amp;#39;s likely somewhere over here where I drew the red dot.</text><text start="23" dur="2">This is the cluster center for this cluster,</text><text start="25" dur="3">and perhaps this is the cluster center for the other cluster.</text><text start="28" dur="2">And these points over here will likely </text><text start="30" dur="2">be classified as belonging to the cluster center over here.</text><text start="32" dur="2">So, &amp;quot;EM&amp;quot; will likely do a bad job.</text></transcript></video><video title="7b Spectral Clustering Algorithm" id="P-LEH-AFovE" length="324"><transcript><text start="0" dur="3">So let&amp;#39;s look at this example again--let me redraw the data.</text><text start="3" dur="2">What makes these clusters so different</text><text start="5" dur="3">is not the absolute location of each data point,</text><text start="8" dur="3">but the connectedness of these data points.</text><text start="11" dur="2">The fact that these 2 points belong together </text><text start="13" dur="3">is likely because there&amp;#39;s lots of points in-between. </text><text start="16" dur="2">In other words, it&amp;#39;s the affinity</text><text start="18" dur="3">that defines those clusters, not the absolute location.</text><text start="21" dur="4">So spectral clustering gets annotation of affinity</text><text start="25" dur="2">to make clustering happen.</text><text start="27" dur="3">So let me look at the simple example for spectral clustering</text><text start="30" dur="3">that would also work for K-means or EM,</text><text start="33" dur="3">but they&amp;#39;ll be useful to illustrate spectral clustering.</text><text start="36" dur="3">Let&amp;#39;s assume there&amp;#39;s 9 data points as shown over here,</text><text start="39" dur="4">and I&amp;#39;ve colored them differently in blue, red, and black.</text><text start="43" dur="3">But to clustering algorithms, they all come with the same color.</text><text start="46" dur="2">Now the key element of spectral clustering </text><text start="48" dur="2">is called the affinity martrix,</text><text start="50" dur="3">which is a 9 by 9 matrix in this case,</text><text start="53" dur="3">where each data point gets graphed</text><text start="56" dur="2">realtive to each other data point.</text><text start="58" dur="2">So let me write down all the 9 data points</text><text start="60" dur="3">into the different rows of this matrix--</text><text start="63" dur="2">the red ones, the black ones, and the blue ones.</text><text start="65" dur="4">And in the columns, I graphed the exact same 9 data points.</text><text start="69" dur="4">I then calculate for each pair of data points their affinity,</text><text start="73" dur="3">where I use for now affinity as the </text><text start="76" dur="3">quadratic distance in this diagram over here.</text><text start="79" dur="3">Clearly, the red dots to each other have a high affinity,</text><text start="82" dur="2">which means a small quadratic distance. </text><text start="84" dur="2">Let me indicate this as follows--</text><text start="86" dur="3">But realtive to all the other points, the affinity is weak.</text><text start="89" dur="3">So there&amp;#39;s a very small value in these elements over here.</text><text start="92" dur="2">Similarly, the affinity of the black</text><text start="94" dur="2">data points to each other is very high,</text><text start="96" dur="2">which means that the following block diagonal</text><text start="98" dur="3">in this matrix will have a very large value.</text><text start="101" dur="3">Yet the affinity to all the other data points will be low.</text><text start="104" dur="3">And of course, the same is true for the blue data points.</text><text start="107" dur="2">The interesting thing to notice now</text><text start="109" dur="3">is that this is an approximately rank-deficient matrix.</text><text start="112" dur="4">And further, the data points that belong to the same class--</text><text start="116" dur="3">like the 3 red dots or the 3 black dots, </text><text start="119" dur="4">have a singular affinitive vector to all the other data points. </text><text start="123" dur="3">So this vector over here is similar to this vector over here.</text><text start="126" dur="2">It&amp;#39;s similar to this vector over here,</text><text start="128" dur="2">but it&amp;#39;s very different to this vector over here,</text><text start="130" dur="3">which then itself is similar to the vector over here,</text><text start="133" dur="2">yet different to the previous ones.</text><text start="135" dur="2">Such a situation is easily addressed by what&amp;#39;s called </text><text start="137" dur="4">principal component analysis, or PCA.</text><text start="141" dur="4">PCA is a method to identify vectors that are similar</text><text start="145" dur="3">in an approximate rank-deficient matrix.  </text><text start="148" dur="3">Consider once again our affinity matrix</text><text start="151" dur="2">with prinicple component analysis,</text><text start="153" dur="3">which is a standard linear trick,</text><text start="156" dur="2">we can re-represent this matrix</text><text start="158" dur="4">by the most dominant tivectors you&amp;#39;ll find there.</text><text start="162" dur="2">And the first one, might look like this.</text><text start="164" dur="3">The second one, which would be orthogonal, may look like this. </text><text start="167" dur="2">The third one, like this.</text><text start="169" dur="2">These are called eigenvectors, and the principle component</text><text start="171" dur="2">now is each eigenvector has an item of value</text><text start="173" dur="4">that states how prevalent this vector is in the original data.</text><text start="177" dur="3">And for these 3 vectors, you&amp;#39;re going to find a large eigenvalue</text><text start="180" dur="3">because there&amp;#39;s a number data points that represent </text><text start="183" dur="3">these vectors quite prevalently </text><text start="186" dur="3">like the first 3 does for this guy over here.</text><text start="189" dur="3">There might be additional eigenvectors like something like this,</text><text start="192" dur="3">but such eigenvectors will have a small eigenvalue</text><text start="195" dur="2">simply because this vector isn&amp;#39;t really </text><text start="197" dur="2">required to explain the data over here.</text><text start="199" dur="2">It might just be explaining some of the noise</text><text start="201" dur="2">in the affinity matrix</text><text start="203" dur="2">that I didn&amp;#39;t even dare draw in here.</text><text start="205" dur="2">Now if you take the eigenvectors with the largest </text><text start="207" dur="2">eigenvalues--3 in this case,</text><text start="209" dur="3">you first discover that the dimensionality </text><text start="212" dur="2">of the underlying data space.</text><text start="214" dur="3">The dimensionality equals the number of large eigenvalues.</text><text start="217" dur="3">Further, if you re-represent each data vector</text><text start="220" dur="2">using those eigenvectors, </text><text start="222" dur="2">you&amp;#39;ll find a 3 dimensional  space </text><text start="224" dur="4">where original data falls into a varity of different places.</text><text start="228" dur="3">And these places are easily told apart by conventional clustering.</text><text start="231" dur="2">So in summary, spectral clustering builds </text><text start="233" dur="2">an affinity matrix of the data points.</text><text start="235" dur="3">It strikes the eigenvectors with the largest eigenvalues,</text><text start="238" dur="3">and then re-map those vecotrs into a new space </text><text start="241" dur="4">with the data points easily clustering the conventional way. </text><text start="245" dur="4">This is called affinity-based clustering or spectral clustering.</text><text start="249" dur="2">Let me illustrate this once again with the</text><text start="251" dur="2">data set that has a different spectral clustering</text><text start="253" dur="2">than a conventional clustering. </text><text start="255" dur="2">In this data set, the different clusters belong </text><text start="257" dur="2">together because they&amp;#39;re affinity is similar.</text><text start="259" dur="2">These 2 points belong together</text><text start="261" dur="2">because there is a point in-between.</text><text start="263" dur="3">If we now draw the affinity matrix for those data points, </text><text start="266" dur="3">you find that the first and second data points are close together</text><text start="269" dur="3">and the second and the third, but not the first and the third.</text><text start="272" dur="3">Hence these 2 off diagonal elements here have remained small.</text><text start="275" dur="3">Similarly for the red points as shown here</text><text start="278" dur="2">with these 2 elements over here relatively small.</text><text start="280" dur="2">And also for the black points</text><text start="282" dur="2">where these 2 elements over here are small. </text><text start="284" dur="3">And interestingly enough, even though these aren&amp;#39;t blocked diagonal, </text><text start="287" dur="3">your first 3 largest eigenvectors</text><text start="290" dur="2">will still look the same as before.</text><text start="292" dur="2">I find this quite remarkable</text><text start="294" dur="2">that even though these aren&amp;#39;t exactly blocks,</text><text start="296" dur="3">those vecotrs still represent the 3 most </text><text start="299" dur="2">important vectors for which to recover </text><text start="301" dur="3">the data using principle component analysis.</text><text start="304" dur="2">So in this case, spectral clustering would easily</text><text start="306" dur="4">assign those guys and those guys and those guys</text><text start="310" dur="2">to the respective same cluster,</text><text start="312" dur="2">which wouldn&amp;#39;t be quite as easily the case for</text><text start="314" dur="2"> expectation-maximization or k-means.</text><text start="316" dur="2">So let me ask you the following quiz.</text><text start="318" dur="2">Suppose we have 8 data points.</text><text start="320" dur="4">How many elements will the affinity matrix have?</text></transcript></video><video title="7c Answer" id="VG8F24TAwzg" length="5"><transcript><text start="0" dur="2">And the answer is 64.</text><text start="2" dur="3">There&amp;#39;s 8 data points--8 times 8 is 64.</text></transcript></video><video title="7d Question" id="1FZjz9O65ZU" length="17"><transcript><text start="0" dur="4">My second question is, how many large  eigenvalues </text><text start="4" dur="2">will PCA find?</text><text start="6" dur="4">Now I understand this doesn&amp;#39;t have a unique answer,</text><text start="10" dur="2">but in the best possible case</text><text start="12" dur="3">where spectral clustering works well,</text><text start="15" dur="2">how many large eigenvalues do you find?</text></transcript></video><video title="7e Answer" id="924fCzIWetY" length="13"><transcript><text start="0" dur="2">And the answer is 2.</text><text start="2" dur="4">There&amp;#39;s a cluster over here and a cluster over here.</text><text start="6" dur="2">And while it might happen that it&amp;#39;s as many as 8,</text><text start="8" dur="2">if you adjust you&amp;#39;re affinity matrix well,</text><text start="10" dur="3">those 2 should correspond with the 2 larger eigenvalues. </text></transcript></video><video title="8 Supervised vs Unsupervised Learning" id="qkcFRr7LqAw" length="115"><transcript><text start="0" dur="2">So, congratulations.</text><text start="2" dur="3">You just made it through the unsupervised learning section of this class.</text><text start="5" dur="2">I think you&amp;#39;ve learned a lot.</text><text start="7" dur="3">You learned about K-means, you learned about expectation maximization,</text><text start="10" dur="4">about dimensionality reduction and even spectral clustering.</text><text start="14" dur="3">The first 3 items--K-means, EM, and dimensionality reduction--</text><text start="17" dur="5">are used very frequently, and spectral clustering is a rarer used method</text><text start="22" dur="4">that shows some of the most recent research going on in the field.</text><text start="26" dur="4">I hope you have fun applying these methods in practice.</text><text start="30" dur="5">I&amp;#39;d like to say a few final words about supervised versus unsupervised learning.</text><text start="35" dur="4">In both cases you&amp;#39;re given data, but in 1 case you have labeled data,</text><text start="39" dur="2">in another you have unlabeled data.</text><text start="41" dur="4">The supervised learning paradigm is the dominant paradigm in machine learning,</text><text start="45" dur="3">and there are a vast amount of papers being written about it.</text><text start="48" dur="3">We talked about classification and regression</text><text start="51" dur="2">and different methods to do supervised learning.</text><text start="53" dur="3">The unsupervised paradigm is much less explored,</text><text start="56" dur="4">even though I think it&amp;#39;s at least equally important--possibly even more important.</text><text start="60" dur="5">Many systems can collect vast amounts of data such as web crawlers,</text><text start="65" dur="3">robots, I told you about street view,</text><text start="68" dur="3">and getting the data is cheap, but getting labels is hard.</text><text start="71" dur="3">So to me, unsupervised is the method of the future.</text><text start="74" dur="3">It&amp;#39;s one of the most interesting open research topics</text><text start="77" dur="4">to see whether we can make sense out of large amounts of unlabeled or poorly labeled data.</text><text start="81" dur="5">In between, there are techniques that do both: supervised and unsupervised.</text><text start="86" dur="3">They are called semi-supervised or self-supervised,</text><text start="89" dur="3">and they use elements of unsupervised learning and pair them with supervised learning.</text><text start="92" dur="3">Those are fascinating by their own rights.</text><text start="95" dur="3">Our robot Stanley, for example, that won the DARPA Grand Challenge</text><text start="98" dur="5">used its own sensors to produce labels on the fly to other data.</text><text start="103" dur="3">And I&amp;#39;ll talk about this when I talk about robotics in more detail.</text><text start="106" dur="5">But for the time being, understand that the paradigms supervised and unsupervised</text><text start="111" dur="4">span 2 very large areas of machine learning, and you learn quite a bit about it.</text></transcript></video></group><group title="Homework 3" count="18"><video title="1 Introduction" id="8zwpDAXxCJk" length="5"><transcript><text start="0" dur="5">Welcome to the third homework assignment covering topics of machine learning.</text></transcript></video><video title="2a Naive Bayes Laplacian Smoothing" id="Lj9ku_w8JAE" length="68"><transcript><text start="0" dur="6">[Thrun] This question is about naive Bayes and Laplacian smoothing.</text><text start="6" dur="6">Our training data is a set of movie titles: A Perfect World,</text><text start="12" dur="4">My Perfect Woman, and Pretty Woman.</text><text start="16" dur="10">We also have a song class of song titles: A Perfect Day, Electric Storm,</text><text start="26" dur="2">Another Rainy Day.</text><text start="28" dur="5">Suppose we get a new title, the query Perfect Storm,</text><text start="33" dur="7">and we wish to know whether Perfect Storm is more likely a movie or a song.</text><text start="40" dur="4">Compute for me the following model probabilities:</text><text start="44" dur="6">the probability for movie class and song class,</text><text start="50" dur="3">the probability of the word &amp;quot;perfect&amp;quot; conditioned on the movie class,</text><text start="53" dur="5">the probability of the word &amp;quot;perfect&amp;quot; conditioned on the song class,</text><text start="58" dur="3">and the same for the word &amp;quot;storm.&amp;quot;</text><text start="61" dur="5">Please use Laplacian smoothing for this with K equals 1.</text><text start="66" dur="2">Don&amp;#39;t compute the maximum likelihood estimate.</text></transcript></video><video title="2a Naive Bayes Laplacian Smoothing ANSWER" id="evtCdmjcZ4I" length="110"><transcript><text start="0" dur="3">[Thrun] Remember in Laplacian smoothing our best estimate </text><text start="3" dur="6">is the count of the occurrence of the words divided by N,</text><text start="9" dur="4">but we add our Laplacian smoother over here, </text><text start="13" dur="3">and down here we add K times number of classes.</text><text start="16" dur="8">For the movie prior we have 3 examples of movie titles over 6 total titles,</text><text start="24" dur="4">which gives us 3 over 6.</text><text start="28" dur="2">We add our Laplacian prior, 1 over here.</text><text start="30" dur="3">There&amp;#39;s 2 classes, movie and song, 2 over here.</text><text start="33" dur="2">We get 4 over 8, which is a half.</text><text start="35" dur="3">The same is the case for song.</text><text start="38" dur="4">It gets more interesting for this probability over here.</text><text start="42" dur="6">In our movie class there&amp;#39;s 2 occurrences of the word &amp;quot;perfect&amp;quot; out of 8 words,</text><text start="48" dur="2">so we get 2 over 8.</text><text start="50" dur="3">But in adding the Laplacian prior, 1 over here</text><text start="53" dur="2">and 1 number to add down here,</text><text start="55" dur="6">the number of classes here is the size of the vocabulary.</text><text start="61" dur="5">In total for this model there is 11 different words.</text><text start="66" dur="4">There are 16 total words in both titles,</text><text start="70" dur="4">but because of repetition there&amp;#39;s only 11 distinct words:</text><text start="74" dur="12">a, perfect, world, my, woman, pretty, day, electric, storm, another, rainy.</text><text start="86" dur="4">So we add the number of classes over here, which is 11.</text><text start="90" dur="3">We obtain 3 over 19.</text><text start="93" dur="4">For the song class there&amp;#39;s 1 occurrence of perfect.</text><text start="97" dur="4">Adding 1 we get 2 over 19.</text><text start="101" dur="2">There&amp;#39;s no occurrence of storm in the movie class.</text><text start="103" dur="3">However, our Laplacian prior gives us 1 over 19.</text><text start="106" dur="4">And there&amp;#39;s 1 occurrence of storm over here, which gives us 2 over 19.</text></transcript></video><video title="2b Naive Bayes Laplacian Smoothing" id="VqJVQlsuGoA" length="11"><transcript><text start="0" dur="4">[Thrun] For the same example I now would like to know the probability </text><text start="4" dur="5">of movie title for my query.</text><text start="9" dur="2">So please write this into the following box.</text></transcript></video><video title="2b Naive Bayes Laplacian Smoothing ANSWER" id="LRQKhmXpDLI" length="61"><transcript><text start="0" dur="4">[Thrun] As usual, we can resolve this using Bayes&amp;#39; rule.</text><text start="4" dur="5">Probability of Perfect Storm given movie times P of movie</text><text start="9" dur="6">divided by the same expression plus this expression for the opposite class, song.</text><text start="15" dur="6">Here I simply write 3 dots for the text Perfect Storm.</text><text start="21" dur="3">Plugging in the values over here and assuming conditional independence,</text><text start="24" dur="5">as is the case when I use Bayes, we get the probably of &amp;quot;perfect&amp;quot; given movie,</text><text start="29" dur="6">which is 3/19 and &amp;quot;storm&amp;quot; given movie, 1/19, times the prior over half,</text><text start="35" dur="8">and we divide this by the same number plus probability of &amp;quot;perfect&amp;quot; given song,</text><text start="43" dur="8">which is 2/19, and the probability of &amp;quot;storm&amp;quot; given song, which is 2/19 times the prior of half.</text><text start="51" dur="10">Now, all the enumerators fall out, and we get 3 over 3 plus 2, which is 3 over 7.</text></transcript></video><video title="2c Maximum Likelihood" id="9SDMNmgIhBE" length="16"><transcript><text start="0" dur="3">[Thrun] I would now like to ask the exact same question</text><text start="3" dur="2">for the maximum likelihood estimator.</text><text start="5" dur="4">So let&amp;#39;s not assume we have Laplacian smoothing</text><text start="9" dur="3">and instead use the maximum likelihood estimator.</text><text start="12" dur="4">Simply compute for me the probability of movie for the title Perfect Storm.</text></transcript></video><video title="2c Maximum Likelihood ANSWER" id="3lA9jrqw7_4" length="44"><transcript><text start="0" dur="4">[Thrun] And the answer is simply 0, without much math.</text><text start="4" dur="5">The word &amp;quot;perfect&amp;quot; occurs in movie, but the word &amp;quot;storm&amp;quot; has never been seen before.</text><text start="9" dur="5">Therefore, the maximum likelihood estimate we will assign a 0 probability to the word &amp;quot;storm,&amp;quot;</text><text start="14" dur="7">which will make the total product of the various factors involved in &amp;quot;storm&amp;quot; just 0.</text><text start="21" dur="2">That is not the case for song.</text><text start="23" dur="4">There is a non-zero probability for &amp;quot;perfect&amp;quot; and a non-zero probability for &amp;quot;storm.&amp;quot;</text><text start="27" dur="2">Hence, it will have a non-zero probability.</text><text start="29" dur="5">After normalization this will become 1 and this will become 0.</text><text start="34" dur="4">So without much math I can calculate the correct posterior </text><text start="38" dur="3">under the maximum likelihood model, which of course is disappointing</text><text start="41" dur="3">because Perfect Storm is actually a movie title.</text></transcript></video><video title="3a Linear Regression" id="rIO9zynD__M" length="15"><transcript><text start="0" dur="4">[Thrun] In this question I quiz you about linear regression.</text><text start="4" dur="3">Given the following data, my first question is, </text><text start="7" dur="6">can this data be fit exactly using a linear function that maps from X to Y?</text><text start="13" dur="2">Yes or no.</text></transcript></video><video title="3a Linear Regression ANSWER" id="yTYQg1XiBEQ" length="40"><transcript><text start="0" dur="2">[Thrun] And the answer is no.</text><text start="2" dur="6">To see, let&amp;#39;s look at the slope of the linear function if it existed.</text><text start="8" dur="3">From 0 to 1 we increment Y by 3.</text><text start="11" dur="2">We go from 3 to 6.</text><text start="13" dur="3">Therefore, the slope of it must be 3.</text><text start="16" dur="5">However, from 1 to 2 we only increase the function by 1, from 6 to 7.</text><text start="21" dur="3">Therefore, it can&amp;#39;t be fit linearly.</text><text start="24" dur="4">We can see the same if we plot the linear points.</text><text start="28" dur="4">Over here we could fit a linear function, but it&amp;#39;s very shallow,</text><text start="32" dur="4">whereas those points over here have a much steeper situation.</text><text start="36" dur="4">So any linear function would probably miss these points in between.</text></transcript></video><video title="3b Linear Regression" id="5gIXtI82Olk" length="16"><transcript><text start="0" dur="5">[Thrun] I would now like to ask you to perform linear regression on these data points</text><text start="5" dur="4">and calculate for me W0 and W1.</text><text start="9" dur="3">As defined in this class, we might have to go back</text><text start="12" dur="4">and look over the exact formula from the lecture that I taught on linear regression.</text></transcript></video><video title="3b Linear Regression ANSWER" id="ynxLGEE_Bgo" length="99"><transcript><text start="0" dur="5">[Thrun] For answering these questions, let me restate the essential formulas.</text><text start="5" dur="7">W1 is obtained by M times sum of XY minus sum of X times sum of Y</text><text start="12" dur="7">over M times sum Xi square minus sum of Xi in brackets square.</text><text start="19" dur="6">And if you plug in these numbers over here for M equals 5</text><text start="25" dur="9">because there&amp;#39;s 5 training examples, we get 5 times 88 minus 10 times 35</text><text start="34" dur="6">over 5 times 30 minus 100, which is 1.8.</text><text start="40" dur="3">That is the correct answer for W1.</text><text start="43" dur="9">W0 was obtained by 1 over M times sum over Ys minus W1 over M times sum over X.</text><text start="52" dur="10">And plugging in the table over here gives us 1/5 times 35 minus 1.8 over 5 times 10,</text><text start="62" dur="5">and that is 3.4, which would have been the correct answer over here.</text><text start="67" dur="3">And again here are the data points with the solution.</text><text start="70" dur="4">So if you take the axis where X equals 0,</text><text start="74" dur="6">the Y value is actually 3.4, and the slope is 1.8.</text><text start="80" dur="3">It&amp;#39;s a little smaller than if you just click at the end points,</text><text start="83" dur="4">which gave us a slope of 2, because there is a residual arrow over here,</text><text start="87" dur="3">residual arrow over here, residual arrow over here, and a residual arrow over here.</text><text start="90" dur="7">The resulting linear function ends up splitting in a quadratically optimal way</text><text start="97" dur="2">the arrows between these different data points.</text></transcript></video><video title="4 k Nearest Neighbors" id="MhDJ47KG_Oc" length="30"><transcript><text start="0" dur="4">In my next question I would like to ask you about K-nearest neighbors.</text><text start="4" dur="3">Consider the following data set</text><text start="7" dur="3">where plus indicates a positive traning example</text><text start="10" dur="4">and minus a negative training example in this 2-dimensional space.</text><text start="14" dur="4">I want you, for the following places,</text><text start="18" dur="3">to check for those boxes over here</text><text start="21" dur="5">whether they will be plus for K=5.</text><text start="26" dur="4">Only check those boxes for which the label will be positive.</text></transcript></video><video title="4 k Nearest Neighbors ANSWER" id="01qBi27m3Ss" length="49"><transcript><text start="0" dur="3">And the answer would be this box over here</text><text start="3" dur="3">and this box over here, nothing else.</text><text start="6" dur="4">This guy has clearly 4 positive nearest neighbors,</text><text start="10" dur="4">so no matter what the 5th one is we stay positive.</text><text start="14" dur="2">Similarly, this guy over here,</text><text start="16" dur="2">when you draw a circle,</text><text start="18" dur="3">has probably these 4 guys as nearest neighbors,</text><text start="21" dur="3">perhaps this one as well, but it is a little further away.</text><text start="24" dur="3">With those 4 ones, it already has 3 pluses, </text><text start="27" dur="3">so whatever the 5th one is it can&amp;#39;t overturn, it must be positive.</text><text start="30" dur="4">All of these are negative, even this one over here has just 2 pluses as neighbors</text><text start="34" dur="3">that are positive, and those guys over here are all negative.</text><text start="37" dur="4">Similarly, over here there are possibly 2 pluses</text><text start="41" dur="2">in the 5 nearest neighbors, </text><text start="43" dur="3">but these guys over here are all negative, and this guy is</text><text start="46" dur="3">surrounded by negative examples, so they will just be negative.</text></transcript></video><video title="5 k Nearest Neighbors" id="SAG4-uC9BnE" length="31"><transcript><text start="0" dur="5">Here&amp;#39;s another nearest neighbor example, and now I&amp;#39;m going to ask a different question.</text><text start="5" dur="2">Given all the black data points,</text><text start="7" dur="3">I want to make sure that the red ones are classified as indicated</text><text start="10" dur="4">and I am free to choose a different value for K.</text><text start="14" dur="5">Say I can choose K to be 1, 3, 5, 7, or 9.</text><text start="19" dur="4">Check any or all of the K values</text><text start="23" dur="3">for which you believe these 3 data points </text><text start="26" dur="5">are classified correctly relative to the black training data set.</text></transcript></video><video title="5 k Nearest Neighbors ANSWER" id="IjzpuYn7Szc" length="74"><transcript><text start="0" dur="4">And the answer is just 5.</text><text start="4" dur="3">If you look carefully for K=1,</text><text start="7" dur="3">this guy will be mis-classified.</text><text start="10" dur="3">It&amp;#39;s closer to a plus than a minus.</text><text start="13" dur="4">Similarly, for K=3, this guy has 2 nearby pluses and 1 minus,</text><text start="17" dur="2">so it would be positive.</text><text start="19" dur="3">For K=5, to get the correct answer,</text><text start="22" dur="3">the 5 nearest neighbors of this guy are </text><text start="25" dur="5">those 3 minuses plus perhaps those 2 pluses over here.</text><text start="30" dur="3">This guy has in his 5 neighborhood </text><text start="33" dur="3">these 3 pluses over here, plus a minus, plus a plus.</text><text start="36" dur="5">This data point over here has 2 pluses with 3 of the surrounding minuses,</text><text start="41" dur="2">and they are all classified correctly.</text><text start="43" dur="5">For K=7, this minus data point will have</text><text start="48" dur="3">4 pluses, 3 minuses over here,</text><text start="51" dur="3">and then everything in the vicinity becomes positive,</text><text start="54" dur="3">so it must be 4 plus and it will be mis-classified.</text><text start="57" dur="3">The same is true for K=9.</text><text start="60" dur="4">The minus over here will have 1, 2, 3 minuses,</text><text start="64" dur="3">5 pluses, and the minus over here,</text><text start="67" dur="3">which makes 4 minuses.</text><text start="70" dur="2">It will be classified as positive.</text><text start="72" dur="2">So K=5 would have been the only correct answer.</text></transcript></video><video title="6 Perceptron" id="-fpVTLGoxZ4" length="27"><transcript><text start="0" dur="5">But nobody asked about the perceptron algorithm, suppose you have the following </text><text start="5" dur="3">2 dimensional linear set where plus indicates</text><text start="8" dur="5">a positive class label and minus a negative class label, and my first question is, </text><text start="13" dur="2">&amp;quot;Are these data linearly separable?&amp;quot; </text><text start="15" dur="3">I&amp;#39;d also like to know if we start perceptron </text><text start="18" dur="3">with an initial separating plane like this,</text><text start="21" dur="4">will it actually converge?</text><text start="25" dur="2">Please check the appropriate boxes yes or no, and yes or no.</text></transcript></video><video title="6 Perceptron ANSWER" id="P88qJlIRnwI" length="22"><transcript><text start="0" dur="3">[Narrator] And the answer is yes in both cases.</text><text start="3" dur="2">There is a linear separation that </text><text start="5" dur="4">goes along here that separates the positive class from the negative class,</text><text start="9" dur="3">and it&amp;#39;s been shown in the 60s,</text><text start="12" dur="2">field perceptron algorithm always </text><text start="14" dur="2">converges after finally many steps</text><text start="16" dur="2">if such linear separator exists. </text><text start="18" dur="2">I&amp;#39;m not going to prove this, and I did prove this</text><text start="20" dur="2">in class, but I clearly said this in class.</text></transcript></video><video title="7 Congratulations" id="VBw64HT6FlU" length="3"><transcript><text start="0" dur="3">Congratulations! You just finished the third homework assignment.</text></transcript></video></group><group title="Unit 7" count="19"><video title="1 Introduction" id="pszEzBql4bw" length="68"><transcript><text start="0" dur="2">Welcome back.</text><text start="2" dur="2">So far we&amp;#39;ve talked about AI </text><text start="4" dur="4">as managing complexity and uncertainty.</text><text start="8" dur="3">We&amp;#39;ve seen how a search can discover sequences </text><text start="11" dur="2">of actions to solve problems.</text><text start="13" dur="2">We&amp;#39;ve seen how probability theory</text><text start="15" dur="3">can represent in reason with uncertainty.</text><text start="18" dur="2">And we&amp;#39;ve seen how machine learning</text><text start="20" dur="4">can be used to learn and improve.</text><text start="24" dur="2">AI is a big and dynamic field</text><text start="26" dur="2">because we are pushing against complexity</text><text start="28" dur="2">in at least 3 directions.</text><text start="30" dur="2">First, in terms of agent design,</text><text start="32" dur="3">we start with a simple reflex-based agent</text><text start="35" dur="4">and move into goal-based and utility-based agents.</text><text start="39" dur="3">Secondly, in terms of the complexity of the environment, </text><text start="42" dur="2">we start with simple environments </text><text start="44" dur="3">and then start looking at partial observability,</text><text start="47" dur="4">stochastic actions at multiple agents, and so on.</text><text start="51" dur="3">And finally, in terms of representation, </text><text start="54" dur="2">the agents model of the world </text><text start="56" dur="2">becomes increasingly complex.</text><text start="58" dur="2">And this unit will concentrate </text><text start="60" dur="3">on that third aspect of representation,</text><text start="63" dur="2">showing how the tools of logic</text><text start="65" dur="3">can be used by an agent to better model the world.</text></transcript></video><video title="2 Propositional Logic" id="_VjyktjNMoM" length="223"><transcript><text start="0" dur="7">The first logic we will consider is called propositional logic.</text><text start="7" dur="5">Let&amp;#39;s jump right into an example, recasting the alarm problem in propositional logic.</text><text start="12" dur="11">We have propositional symbols B, E, A, M, and J</text><text start="23" dur="5">corresponding to the events of a burglary occurring, of\ the earthquake occurring,</text><text start="28" dur="6">of the alarm going off, of Mary calling, and of John calling.</text><text start="34" dur="3">And just as in the probabalistic models, </text><text start="37" dur="3">these can be either true or false,</text><text start="40" dur="4">but unlike improbability, our degree of belief in propositional logic </text><text start="44" dur="3">is not a number. </text><text start="47" dur="6">Rather, our belief is that each of these is either true or false or unknown.</text><text start="53" dur="4">Now, we can make logical sentences using these symbols</text><text start="57" dur="7">and also using the logical constants true and false</text><text start="64" dur="4">by combining them together using logical operators.</text><text start="68" dur="4">For example, we can say that the alarm is true</text><text start="72" dur="4">whenever the earthquake or burglary is true with this sentence.</text><text start="76" dur="12">(E V B) E or B implies A.</text><text start="88" dur="7">So that says whenever the earthquake or the burglary is true, </text><text start="95" dur="3">then the alarm will be true.</text><text start="98" dur="2">We use this V symbol to mean or </text><text start="100" dur="3">and a right arrow to mean implies. </text><text start="103" dur="4">We could also say that it would be true that both John and Mary call </text><text start="107" dur="3">when the alarm is true. </text><text start="110" dur="11">We write that as A implies (J ^ M)</text><text start="121" dur="4">and we use this symbol ^ to indicate an and,</text><text start="125" dur="4">so that this upward-facing wedge looks kind of like an A </text><text start="129" dur="5">with the crossbar missing, and so you can remember A is for &amp;quot;and&amp;quot;</text><text start="134" dur="5">where with this downward-facing V symbol is the opposite of and,</text><text start="139" dur="3">so that&amp;#39;s the symbol for or.</text><text start="142" dur="3">Now, there&amp;#39;s 2 more connectors we haven&amp;#39;t seen yet.</text><text start="145" dur="4">There&amp;#39;s a double arrow for equivalent, also known as a biconditional,</text><text start="149" dur="3">and a not sign for negation,</text><text start="152" dur="7">so we could say if we wanted to that John calls if and only if Mary calls.</text><text start="159" dur="6">We would write that as J &amp;lt;=&amp;gt; M.</text><text start="165" dur="3">John is equivalent to Mary--when one is true, the other is true;</text><text start="168" dur="3">when one is false, the other is false. </text><text start="171" dur="5">Or we could say that when John calls, Mary doesn&amp;#39;t, and vice versa. </text><text start="176" dur="8">We could write that as John is equivalent J&amp;lt;=&amp;gt; to not Mary,</text><text start="184" dur="4">and this is the not sign.</text><text start="188" dur="3">Now, how do we know what the sentences mean?</text><text start="191" dur="3">A propositional logic sentence is either true or false </text><text start="194" dur="3">with respect to a model of the world.</text><text start="197" dur="4">Now, a model is just a set of true/false values for all the propositional symbols,</text><text start="201" dur="13">so a model might be the set B is true, E is false, and so on.</text><text start="214" dur="5">We can define the truth of the sentence in terms of the truth of the symbols </text><text start="219" dur="4">with respect to the models using truth tables. </text></transcript></video><video title="2a Truth Tables" id="eOp4UJl0ZIA" length="162"><transcript><text start="0" dur="5">[Male narrator] Here are the truth tables for all the logical connectives.</text><text start="5" dur="5">What a truth table does is list all the possibilities for the propositional symbols,</text><text start="10" dur="6">so P and Q can be false and false, false and true, true and false, or true and true.</text><text start="16" dur="3">Those are the only 4 possibilities,</text><text start="19" dur="5">and then for each of those possibilities, the truth table lists the truth value </text><text start="24" dur="2">of the compound sentence. </text><text start="26" dur="6">So the sentence not P is true when P is false and false when P is true. </text><text start="32" dur="9">The sentence P and Q is true only when both P and Q are true and false otherwise. </text><text start="41" dur="6">The sentence P or Q is true when either P or Q is true</text><text start="47" dur="3">and false when both are false. </text><text start="50" dur="7">Now, so far, those mostly correspond to the English meaning of those sentences</text><text start="57" dur="5">with one exception, which is that in English, the word &amp;quot;or&amp;quot; is somewhat ambiguous</text><text start="62" dur="5">between the inclusive and exclusive or,</text><text start="67" dur="5">and this &amp;quot;or&amp;quot; means either or both.</text><text start="72" dur="7">We translate this mark into English P implies Q; or as if P, then Q,</text><text start="79" dur="5">but the meaning in logic is not quite the same as the meaning in ordinary English.</text><text start="84" dur="5">The meaning in logic is defined explicitly by this truth table</text><text start="89" dur="5">and by nothing else, but let&amp;#39;s look at some examples in ordinary English.</text><text start="94" dur="10">If we have the proposition O and have that mean 5 is an odd number</text><text start="104" dur="6">and P meaning Paris is the capital of France, </text><text start="110" dur="4">then under the ordinary model of the truth in the real world, </text><text start="114" dur="7">what could we say about the sentence O implies P?</text><text start="121" dur="7">That is, 5 is an odd number implies Paris is the capital of France.</text><text start="128" dur="6">Would that be true or false?</text><text start="134" dur="3">And let&amp;#39;s look at one more example.</text><text start="137" dur="4">If E is the proposition that 5 is an even number</text><text start="141" dur="5">and M is the proposition that Moscow is the capital of France,</text><text start="146" dur="5">what about E implies M?</text><text start="151" dur="5">5 is an even number implies Moscow is the capital of France.</text><text start="156" dur="6">Is that true or false?</text></transcript></video><video title="2b Answer" id="4HWzU7RhfZE" length="47"><transcript><text start="0" dur="3">[Male narrator] The answers are first, </text><text start="3" dur="3">the sentence if 5 is an odd number, </text><text start="6" dur="4">then Paris is the capital of France, is true</text><text start="10" dur="2">in propositional logic.</text><text start="12" dur="3">It may sound odd in ordinary English, </text><text start="15" dur="6">but in propositional logic, this is the same as true implies true</text><text start="21" dur="4">and if we look on this line--the final line for P and Q,  </text><text start="25" dur="3">P implies Q is true.</text><text start="28" dur="7">The second sentence, 5 is an even number, implies Moscow is the capital of France.</text><text start="35" dur="3">That&amp;#39;s the same as false implies false, </text><text start="38" dur="9">and false implies false according to the definition is also true.</text></transcript></video><video title="2c Question" id="ae_9TnTNPU0" length="31"><transcript><text start="0" dur="2">[Male narrator] Here&amp;#39;s a quiz.</text><text start="2" dur="4">Use truth tables or whatever other method you want</text><text start="6" dur="3">to fill in the values of these tables.</text><text start="9" dur="5">For each of the values of P and Q--false/false, false/true, true/false, or true/true--</text><text start="14" dur="4">look at each of these boxes and click on just the boxes</text><text start="18" dur="4">in which the formula for that column will be true.</text><text start="22" dur="6">So which of these 4 boxes, if any, will this formula be true, </text><text start="28" dur="3">and this formula and this formula?</text></transcript></video><video title="2d Answer" id="vGfOrh8ERXo" length="80"><transcript><text start="0" dur="3">[Male narrator] Here are the answers.</text><text start="3" dur="5">For P and P implies Q, we know that P is true</text><text start="8" dur="6">in these bottom 2 cases, and P implies Q, we saw the truth table for P implies Q</text><text start="14" dur="5">is true in the first, second, and fourth case.</text><text start="19" dur="9">So the only case that&amp;#39;s true for both P and P implies Q is the fourth case. </text><text start="28" dur="9">Now, this formula, not the quantity, not P or not Q, can work that out to be the same</text><text start="37" dur="9">as P and Q, and we know that P and Q is true only when both are true, </text><text start="46" dur="5">so that would be true only in the fourth case and none of the other cases.</text><text start="51" dur="6">And now, we&amp;#39;re asking for an equivalent or biconditional between these 2 cases.</text><text start="57" dur="2">Is this one the same as this one?</text><text start="59" dur="4">And we see that it is the same because they match up in all 4 cases.</text><text start="63" dur="4">They&amp;#39;re false for each of the first 3 and true in the fourth one,</text><text start="67" dur="4">so that means that this is going to be true no matter what.</text><text start="71" dur="4">They&amp;#39;re always equivalent, either both false or both true, </text><text start="75" dur="5">and so we should check all 4 boxes. </text></transcript></video><video title="2e Question" id="DDKhgqEWBps" length="56"><transcript><text start="0" dur="4">[Male narrator] Here&amp;#39;s one more example of reasoning in propositional logic. </text><text start="4" dur="4">In a particular model of the world, we know the following 3 sentences are true.</text><text start="8" dur="7">E or B implies A,</text><text start="15" dur="8">A implies J and M,</text><text start="23" dur="3">and B.</text><text start="26" dur="5">We know those 3 senetences to be true, and that&amp;#39;s all we know.</text><text start="31" dur="7">Now, I want you to tell me for each of the 5 propositional symbols, </text><text start="38" dur="7">is that symbol true or false, or unknown in this model,</text><text start="45" dur="11">and tell me for the symbols E, B, A, J, and M.</text></transcript></video><video title="2f Answer" id="pxGxSt58kg0" length="39"><transcript><text start="0" dur="4">The answer is that B is true. </text><text start="4" dur="4">And we know that because it was one of the 3 sentences that was given to us.</text><text start="8" dur="7">And now, according to the first sentence, says that if E or B is true then A is true.</text><text start="15" dur="2">So now we know that A is true.</text><text start="17" dur="7">And the second sentence says if A is true then J and M are true. </text><text start="24" dur="2">What about E? That wasn&amp;#39;t mentioned.</text><text start="26" dur="2">Does that mean E is false? No. </text><text start="28" dur="6">It means that it is unknown that a model where E is true and a model where E is false</text><text start="34" dur="5">would both satisfy these 3 sentences. So we mark E as unknown.</text></transcript></video><video title="2g Terminology" id="nGZ2-EZnSh4" length="91"><transcript><text start="0" dur="3">Now for a little more terminology.</text><text start="3" dur="6">We say that a valid sentence is one that is true in every possible model,</text><text start="9" dur="5">for every combination of values of the propositional symbols.</text><text start="14" dur="10">And a satisfiable sentence is one that is true in some models, but not necessarily in all the models. </text><text start="24" dur="6">So what I want you to do is tell me for each of these sentences,</text><text start="30" dur="12">whether it is valid, satisfiable but not valid, or unsatisfiable, in other words, false for all models.</text><text start="42" dur="9">And the sentences are P or not P, P and not P, </text><text start="51" dur="19">P or Q or P is equivalent to Q, P implies Q or Q implies P.</text><text start="70" dur="21">And finally, Food implies Party or Drinks implies party implies Food and Drinks implies Party.</text></transcript></video><video title="2h Answer" id="WOYA5kQ_6gI" length="100"><transcript><text start="0" dur="5">The answers are P and not P is valid.</text><text start="5" dur="8">That is, it&amp;#39;s true when P is true because of this, and it&amp;#39;s true when P is false because of this clause.</text><text start="13" dur="4">P and not P is unsatisfiable.</text><text start="17" dur="5">A symbol can&amp;#39;t be both true and false at the same time.</text><text start="22" dur="6">P or Q or P is equivalent to Q is valid. </text><text start="28" dur="6">So we know that it&amp;#39;s true when either P or Q is true, so that&amp;#39;s 3 out of the 4 cases.</text><text start="34" dur="6">In the fourth case, both P and Q are false, and that means P is equivalent to Q.</text><text start="40" dur="4">And therefore, in all 4 cases, it&amp;#39;s true.</text><text start="44" dur="4">P implies Q or Q implies P, that&amp;#39;s also valid.</text><text start="48" dur="3">Now in ordinary English that wouldn&amp;#39;t be valid. </text><text start="51" dur="7">If the 2 clauses or the 2 symbols P and Q were irrelevant to each other we wouldn&amp;#39;t say that either one of those was true. </text><text start="58" dur="6">But in logic, one or the other must be true, according to the definitions of the truth tables.</text><text start="64" dur="4">And finally, this one&amp;#39;s more complicated,</text><text start="68" dur="9">if Food then Party or if Drinks then Party implies if Food and Drinks then Party.</text><text start="77" dur="12">You can work it all out and both sides of the main implication work out to be equivalent to Not Food or Not Drinks or Party.</text><text start="89" dur="6">So that&amp;#39;s the same as saying P implies P, saying one side is equivalent to the other side.</text><text start="95" dur="5">And if they&amp;#39;re equivalent, then the implication relation holds.</text></transcript></video><video title="2i Propositional Logic Limitations" id="WQ7-B4H6-aE" length="76"><transcript><text start="0" dur="4">Propositional logic. It&amp;#39;s a powerful language for what it does.</text><text start="4" dur="3">And there are very efficient inference mechanisms for determining </text><text start="7" dur="5">validity and satisfiability, alhough we haven&amp;#39;t discussed them.</text><text start="12" dur="3">But propositional logic has a few limitations.</text><text start="15" dur="4">First, it can only handle true and false values.</text><text start="19" dur="8">No capability to handle uncertainty like we did in probability theory.</text><text start="27" dur="4">And second, we can only talk about events that are true or false in the world. </text><text start="31" dur="6">We can&amp;#39;t talk about objects that have properties,</text><text start="37" dur="3">such as size, weight, color, and so on.</text><text start="40" dur="4">Nor can we talk about the relations between objects.</text><text start="44" dur="9">And third, there are no shortcuts to succinctly talk about a lot of different things happening.</text><text start="53" dur="6">Say if we had a vacuum world with a thousand  locations, and we wanted to say that every location is free of dirt. </text><text start="59" dur="4">We would need a conjunction of a thousand propositions.</text><text start="63" dur="6">There&amp;#39;s no way to have a single sentence saying that all the locations are clean all at once.</text><text start="69" dur="7">So, we will next cover first-order logic which addresses these two limitations.</text></transcript></video><video title="3 First Order Logic" id="hFzVZzMPy8Q" length="222"><transcript><text start="0" dur="4">[Norvig] I&amp;#39;m going to talk about first order logic</text><text start="4" dur="5">and its relation to the other logics we&amp;#39;ve seen so far--</text><text start="9" dur="9">namely, propositional logic and probability theory.</text><text start="18" dur="5">We&amp;#39;re going to talk about them in terms of what they say about the world,</text><text start="23" dur="6">which we call the ontological commitment of these logics,</text><text start="29" dur="6">and what types of beliefs agents can have using these logics,</text><text start="35" dur="4">which we call the epistemological commitments.</text><text start="39" dur="7">So in first order logic we have relations about things in the world,</text><text start="46" dur="3">objects, and functions on those objects.</text><text start="49" dur="10">And what we can believe about those relations is that they&amp;#39;re true or false or unknown.</text><text start="59" dur="3">So this is an extension of propositional logic </text><text start="62" dur="4">in which all we had was facts about the world </text><text start="66" dur="7">and we could believe that those facts were true or false or unknown.</text><text start="73" dur="8">In probability theory we had the same types of facts as in propositional logic--</text><text start="81" dur="9">the symbols or variables--but the beliefs could be a real number in the range 0 to 1.</text><text start="90" dur="4">So logics vary both in what you can say about the world</text><text start="94" dur="4">and what you can believe about what&amp;#39;s been said about the world.</text><text start="98" dur="3">Another way to look at representation</text><text start="101" dur="9">is to break the world up into representations that are atomic,</text><text start="110" dur="4">meaning that a representation of the state is just an individual state</text><text start="114" dur="3">with no pieces inside of it.</text><text start="117" dur="6">And that&amp;#39;s what we used for search and problem solving.</text><text start="123" dur="3">We had a state, like state A,</text><text start="126" dur="5">and then we transitioned to another state, like state B,</text><text start="131" dur="4">and all we could say about those states was are they identical to each other or not</text><text start="135" dur="4">and maybe is one of them a goal state or not.</text><text start="139" dur="5">But there wasn&amp;#39;t any internal structure to those states.</text><text start="144" dur="4">In propositional logic, as well as in probability theory,</text><text start="148" dur="5">we break up the world into a set of facts that are true or false,</text><text start="153" dur="4">so we call this a factored representation--</text><text start="157" dur="4">that is, the representation of an individual state of the world</text><text start="161" dur="6">is factored into several variables--the B and E and A and M and J, for example--</text><text start="167" dur="4">and those could be Boolean variables or in some types of representations--</text><text start="171" dur="8">not in propositional logic--they can be other types of variables besides Boolean.</text><text start="179" dur="7">Then the third type--the most complex type of representation--we call structured.</text><text start="186" dur="8">And in a structured representation, an individual state is not just a set of values for variables,</text><text start="194" dur="3">but it can include relationships between objects,</text><text start="197" dur="5">a branching structure, and complex representations and relations</text><text start="202" dur="3">between one object and another.</text><text start="205" dur="3">And that&amp;#39;s what we see in traditional programming languages,</text><text start="208" dur="4">it&amp;#39;s what we see in databases--they&amp;#39;re called structured databases,</text><text start="212" dur="4">and we have structured query languages over those databases--</text><text start="216" dur="3">and that&amp;#39;s a more powerful representation,</text><text start="219" dur="3">and that&amp;#39;s what we get in first order logic.</text></transcript></video><video title="3a Models" id="TZ8iD-Rofk8" length="236"><transcript><text start="0" dur="4">[Norvig] How does first order logic work? What does it do?</text><text start="4" dur="4">Like propositional logic, we start with a model.</text><text start="8" dur="5">In propositional logic a model was a value for each propositional symbol.</text><text start="13" dur="5">So we might say that the symbol P was true</text><text start="18" dur="4">and the symbol Q was false,</text><text start="22" dur="8">and that would be a model that corresponds to what&amp;#39;s going on in a possible world.</text><text start="30" dur="2">In first order logic the models are more complex.</text><text start="32" dur="3">We start off with a set of objects.</text><text start="35" dur="4">Here I&amp;#39;ve shown 4 objects, these 4 tiles,</text><text start="39" dur="3">but we could have more objects than that.</text><text start="42" dur="4">We could say, for example, that the numbers 1, 2, and 3</text><text start="46" dur="3">were also objects in our model.</text><text start="49" dur="2">So we have a set of objects.</text><text start="51" dur="7">We can also have a set of constants that refer to those objects.</text><text start="58" dur="10">So I could use the constant names A, B, C, D, 1, 2, 3,</text><text start="68" dur="2">but I don&amp;#39;t have to have a one-to-one correspondence </text><text start="70" dur="3">between constants and objects.</text><text start="73" dur="5">I could have 2 different constant names that refer to the same object.</text><text start="78" dur="6">I could also have, say, the name C that refers to this object,</text><text start="84" dur="4">or I could have some of the objects that don&amp;#39;t have any names at all.</text><text start="88" dur="10">But I&amp;#39;ve got a set of constants, and I also have a set of functions.</text><text start="98" dur="8">A function is defined as a mapping from objects to objects.</text><text start="106" dur="6">And so, for example, I might have the Number Of function</text><text start="112" dur="4">that maps from a tile to the number on that tile,</text><text start="116" dur="8">and that function then would be defined by the mapping from A to 1</text><text start="124" dur="9">and B to 3 and C to 3 and D to 2,</text><text start="133" dur="4">and I could have other functions as well.</text><text start="137" dur="6">In addition to functions, I can have relations.</text><text start="143" dur="5">For example, I could have the Above relation,</text><text start="148" dur="8">and I could say in this model of the world the Above relation is a set of tuples.</text><text start="156" dur="5">Say A is above B and C is above D.</text><text start="161" dur="5">So that was a binary relation holding between 2 objects.</text><text start="166" dur="4">Say 1 block is above another block.</text><text start="170" dur="2">We can have other types of relations.</text><text start="172" dur="5">For example, here is a unary relation--vowel--</text><text start="177" dur="7">and if we want to say the relation Vowel is true only of the object that we call A,</text><text start="184" dur="7">then that&amp;#39;s a set of tuples of length 1 that contains just A.</text><text start="191" dur="5">We can even have relations over no objects.</text><text start="196" dur="4">Say we wanted to have the relation Rainy, which doesn&amp;#39;t refer to any objects at all</text><text start="200" dur="4">but just refers to the current situation.</text><text start="204" dur="6">Then since it&amp;#39;s not rainy today, we would represent that as the empty set.</text><text start="210" dur="4">There&amp;#39;s no tuples corresponding to that relation.</text><text start="214" dur="8">Or, if it was rainy, we could say that it&amp;#39;s represented by a singleton set,</text><text start="222" dur="8">and since the arity of Rainy is 0, there would be 0 elements in each one of those tuples.</text><text start="230" dur="6">So that&amp;#39;s what a model in first order logic looks like.</text></transcript></video><video title="3b Syntax" id="Th_wM93aF94" length="244"><transcript><text start="0" dur="5">[Man] Now let&amp;#39;s talk about the syntax of first order logic,</text><text start="5" dur="4">and like in propositional logic, </text><text start="9" dur="5">we have sentences which describe facts that are true or false.</text><text start="14" dur="6">But unlike propositional logic, we also have terms</text><text start="20" dur="2">which describe objects. </text><text start="22" dur="7">Now, the atomic sentences are predicates corresponding to relations,</text><text start="29" dur="8">so we can say vowel (A) is an atomic sentence</text><text start="37" dur="6">or above (A, B).</text><text start="43" dur="6">And we also have a distinguished relation--the equality relation.</text><text start="49" dur="9">We can say 2 = 2 and the equality relation is always in every model,</text><text start="58" dur="9">and sentences can be combined with all the operators from propositional logic</text><text start="67" dur="13">so that&amp;#39;s and, or, not, implies, equivalent, and parentheses.</text><text start="80" dur="6">Now, terms, which refer to objects, can be constants, </text><text start="86" dur="4">like A, B, and 2. </text><text start="90" dur="2">They can be variables. </text><text start="92" dur="4">We normally use lowercase, like x and y.</text><text start="96" dur="5">And they can be functions, like number of A,</text><text start="101" dur="7">which is just another name or another expression that refers to the same object as 1,</text><text start="108" dur="2">at least in the model that we showed previously.</text><text start="110" dur="3">And then, there&amp;#39;s 1 more type of complex sentence</text><text start="113" dur="4">besides the sentences we get by combining operators,</text><text start="117" dur="6">that makes first order logic unique, and these are the quantifiers.</text><text start="123" dur="6">And there are two quantifiers for all, which we write with an upside-down A</text><text start="129" dur="4">followed by a variable that it introduces</text><text start="133" dur="5">and there exists, which we write with an upside-down E </text><text start="138" dur="3">followed by the variable that it introduces. </text><text start="141" dur="7">So for example, we could say for all x, if x is a vowel, </text><text start="148" dur="5">then the number of (x) is equal to 1,</text><text start="153" dur="3">and that&amp;#39;s the valid sentence in first order logic. </text><text start="156" dur="9">Or we could say there exists in x such that the number of (x)</text><text start="165" dur="2">is equal to 2,</text><text start="167" dur="4">and this is saying that there&amp;#39;s some object in the domain </text><text start="171" dur="4">to which the number of function applies and has a value of 2,</text><text start="175" dur="3">but we&amp;#39;re not saying what that object is. </text><text start="178" dur="3">Now, another note is that sometimes as an abbreviation, </text><text start="181" dur="5">we&amp;#39;ll omit the quantifier, and when we do that, </text><text start="186" dur="7">you can just assume that it means for all; that&amp;#39;s left out just as a shortcut.</text><text start="193" dur="3">And I should say that these forms, or these sentences are typical,</text><text start="196" dur="3">and you&amp;#39;ll see these form over and over again, </text><text start="199" dur="5">so typically, whenever we have a &amp;quot;for all&amp;quot; quantifier introduced, </text><text start="204" dur="7">it tends to go with a conditional like vowel of (x) implies number of (x) =1, </text><text start="211" dur="4">and the reason is because we usually don&amp;#39;t want to say something about every object </text><text start="215" dur="4">in the domain, since the objects can be so different,</text><text start="219" dur="4">but rather, we want to say something about a particular type of object, </text><text start="223" dur="2">say, in this case, vowels. </text><text start="225" dur="9">And also, typically, when we have an exists an x, or an exists any variable, </text><text start="234" dur="4">that typically goes with just a form like this, </text><text start="238" dur="4">and not with a conditional, because we&amp;#39;re talking about just 1 object</text><text start="242" dur="2"> that we want to describe.</text></transcript></video><video title="3c Vacuum World" id="nkRYz5Omcr0" length="196"><transcript><text start="0" dur="3">[man] Now let&amp;#39;s go back to the 2-location vacuum world </text><text start="3" dur="3">and represent it in first order logic. </text><text start="6" dur="3">So first of all, we can have locations.</text><text start="9" dur="6">We can call the left location A and the right location B</text><text start="15" dur="8">and the vacuum V, and the dirt--say, D1 and D2.</text><text start="23" dur="4">Then, we can have relations.</text><text start="27" dur="5">The relation loc, which is true of any location;</text><text start="32" dur="2">vacuum, which is true of the vacuum; </text><text start="34" dur="3">dirt, which is true of dirt; </text><text start="37" dur="7">and at, which is true of an object and a location.</text><text start="44" dur="5">And so if we wanted to say the vacuum is at location A, </text><text start="49" dur="5">we just say at (V, A).</text><text start="54" dur="6">If we want to say there&amp;#39;s no dirt in any location, it&amp;#39;s a little bit more complicated. </text><text start="60" dur="7">We can say for all dirt and for all locations, </text><text start="67" dur="6">if D is a dirt, and L is a location, </text><text start="73" dur="5">then D is not at L.</text><text start="78" dur="3">So that says there&amp;#39;s no dirt in any location. </text><text start="81" dur="5">Now, note if there were thousands of locations instead of just 2, </text><text start="86" dur="6">this sentence would still hold, and that&amp;#39;s really the power of first order logic. </text><text start="92" dur="3">Let&amp;#39;s keep going and try some more examples. </text><text start="95" dur="7">If I want to say the vacuum is in a location with dirt without specifying what location it&amp;#39;s in,</text><text start="102" dur="2">I can do that. </text><text start="104" dur="9">I can say there exists an L and there exists a D</text><text start="113" dur="8">such that D is a dirt and L is a location </text><text start="121" dur="6">and the vacuum is at the location </text><text start="127" dur="4">and the dirt is at that same location. </text><text start="131" dur="3">and that&amp;#39;s the power of first order logic. </text><text start="134" dur="2">Now one final thing.</text><text start="136" dur="3">You might ask what &amp;quot;first order&amp;quot; means.</text><text start="139" dur="5">It means that the relations are on objects, but not on relations, </text><text start="144" dur="2">and that would be called &amp;quot;higher order.&amp;quot;</text><text start="146" dur="7">In higher order logic, we could, say, define the notion of a transitive relation </text><text start="153" dur="5">talking about relations itself, and so we could say </text><text start="158" dur="14">for all R, transitive of R is equivalent to for all A, B, and C; </text><text start="172" dur="14">R of (A, B) and R of (B, C) implies R (A, C).</text><text start="186" dur="4">So that would be a valid statement in higher order logic </text><text start="190" dur="3">that would define the notion of a transitive relation,</text><text start="193" dur="3">but this would be invalid in first order logic. </text></transcript></video><video title="3d Question" id="JcQrAin3_V8" length="74"><transcript><text start="0" dur="3">[Man] Now let&amp;#39;s get some practice in first order logic. </text><text start="3" dur="3">I&amp;#39;m going to give you some sentences, and for each one, </text><text start="6" dur="6">I want you to tell me if it is valid--that is, O is true--</text><text start="12" dur="7">satisfiable, but not valid; that is, there&amp;#39;s some models for which it is true; </text><text start="19" dur="6">or unsatisfiable, meaning there are no models for which it is true. </text><text start="25" dur="6">And the first sentence is there exists an x and a y </text><text start="31" dur="4">such that x = y. </text><text start="35" dur="8">Second sentence: there exists an x such that x = x, </text><text start="43" dur="13">implies for all y there exists a z such that y = z. </text><text start="56" dur="10">Third sentence: for all x, p of x or not p of x. </text><text start="66" dur="8">And fourth: there exists an x, P of x. </text></transcript></video><video title="3e Answer" id="vlbyPKhayiE" length="80"><transcript><text start="0" dur="4">[Man] The answers are the first sentence is valid. </text><text start="4" dur="2">It&amp;#39;s always true. </text><text start="6" dur="1">Why is that? </text><text start="7" dur="3">Because every model has to have at least 1 object</text><text start="10" dur="4">and we can have both x and y refer to that same object,</text><text start="14" dur="3">and so that object must be equal to itself. </text><text start="17" dur="3">Second, let&amp;#39;s see. </text><text start="20" dur="3">The left-hand side of this implication has to be true.</text><text start="23" dur="2">X is always equal to x, </text><text start="25" dur="6">and the right-hand side says for every y, does there exist a z</text><text start="31" dur="3">such that y equals z? </text><text start="34" dur="1">And we can say yes, there is. </text><text start="35" dur="3">We can always choose y itself for the value of z, </text><text start="38" dur="4">and then y = y, so true implies true.</text><text start="42" dur="3">That&amp;#39;s always true. </text><text start="45" dur="1">Valid. </text><text start="46" dur="4">Third sentence: for all x, P of x or not P of x, </text><text start="50" dur="5">and that&amp;#39;s always true because everything has to be either in the relation for P </text><text start="55" dur="5">or out of the relation for P, so that&amp;#39;s valid. </text><text start="60" dur="5">And the fourth: there exists an x, P of x, and that&amp;#39;s true for the models </text><text start="65" dur="4">in which there is some x that is a member of P, </text><text start="69" dur="2">but it doesn&amp;#39; t necessarily have to be any at all.</text><text start="71" dur="5">P might be an empty relation, so this is satisfiable. </text><text start="76" dur="4">True in some models, but not true in all models. </text></transcript></video><video title="3f Question" id="upyRNIh1LyE" length="179"><transcript><text start="0" dur="5">[Man] Now I&amp;#39;m going to give you some sentences or axioms in first order logic,</text><text start="5" dur="6">and I want you to tell me if they correctly or incorrectly represent the English </text><text start="11" dur="2">that I&amp;#39;m asking about.</text><text start="13" dur="6">So tell me yes or no, are these good representations? </text><text start="19" dur="4">And the first, I want to represent the English sentence </text><text start="23" dur="6">&amp;quot;Sam has 2 jobs,&amp;quot; and the first order logic sentence is </text><text start="29" dur="10">there exists an x and y such that job of Sam x </text><text start="39" dur="12">and job of Sam y and not x = y. </text><text start="51" dur="6">And so tell me yes, that correctly represents Sam has 2 jobs, </text><text start="57" dur="2">or no, there&amp;#39;s a problem. </text><text start="59" dur="5">And secondly, I want to represent the idea of set membership. </text><text start="64" dur="6">Now, assume I&amp;#39;ve already defined the notion of adding an element to a set.</text><text start="70" dur="3">Can I define set membership with these 2 axioms?</text><text start="73" dur="13">For all x and s, x is a member of the result of adding x to any set s,</text><text start="86" dur="13">and for all x and s, x is a member of s implies that for all y, </text><text start="99" dur="11">x is a member of the set that you get when you add y to s. </text><text start="110" dur="6">And third, I&amp;#39;m going to try to define the notion of adjacent squares</text><text start="116" dur="6">on, say, a checkerboard, where the squares are numbered with x and y coordinates</text><text start="122" dur="6">and we want to just talk about adjacency in the horizontal and vertical direction.</text><text start="128" dur="3">Can I define that as follows? </text><text start="131" dur="15">For all x and y, the square x, y is adjacent to the square +(x,1), y, </text><text start="146" dur="14">and the square (x, y) is adjacent to the square (x, +(y, 1)</text><text start="160" dur="4">and assume that we&amp;#39;ve defined the notion of + somewhere </text><text start="164" dur="9"> and that the character set allows + to occur as the character for a function.</text><text start="173" dur="6">Tell me yes or no, is that a good representation of the notion of adjacency?</text></transcript></video><video title="3g Answer" id="PY9d8qJ4BSY" length="124"><transcript><text start="0" dur="6">[Man] The first answer is yes, this is a good representation </text><text start="6" dur="3">of the sentence &amp;quot;Sam has 2 jobs.&amp;quot;</text><text start="9" dur="4">It says there exists an x and y, and one of them is a job of Sam. </text><text start="13" dur="5">The other one is a job of Sam, and crucially, we have to say that x is not equal to y.</text><text start="18" dur="5">Otherwise, this would be satisfied and we could have the same job </text><text start="23" dur="3">represented by the variables x and y.</text><text start="26" dur="4">Is this a good representation of the member function? </text><text start="30" dur="1">No. </text><text start="31" dur="3">It does do a good job of telling you what is a member,</text><text start="34" dur="6">so if x is a member of a set because it&amp;#39;s one member</text><text start="40" dur="4">and then we can always add other members and it&amp;#39;s still a member of that set,</text><text start="44" dur="4">but it doesn&amp;#39;t tell you anything about what x is not a member of.</text><text start="48" dur="5">So for example, we want to know that 3 is not a member of the empty set,</text><text start="53" dur="4">but we can&amp;#39;t prove that with what we have here. </text><text start="57" dur="3">And we have a similar problem down here.</text><text start="60" dur="5">This is not a good representation of adjacent relation.</text><text start="65" dur="11">So it will tell you, for example, that square (1,1) is adjacent to square (2,1)</text><text start="76" dur="4">and also to square (1,2).</text><text start="80" dur="5">So it&amp;#39;s doing something right, but one problem is that it doesn&amp;#39;t tell you in the other direction.</text><text start="85" dur="4">It doesn&amp;#39;t tell you that (2,1) is adjacent to (1,1)</text><text start="89" dur="8">and another problem is that it doesn&amp;#39;t tell you that (1,1) is not adjacent to (8,9)</text><text start="97" dur="3">because again, there&amp;#39;s no way to prove the negative. </text><text start="100" dur="3">And the moral is that when you&amp;#39;re trying to do a definition, </text><text start="103" dur="4">like adjacent or member, what you usually want to do </text><text start="107" dur="5">is have a sentence with the equivalent or the biconditional sign</text><text start="112" dur="9">to say this is true if and only if rather than to just have an assertion </text><text start="121" dur="3">or to have an implication in one direction.</text></transcript></video></group><group title="Unit 8" count="25"><video title="1 Introduction" id="dqeEg5V_IPM" length="39"><transcript><text start="0" dur="2">[Narrator] Hi, and welcome back.</text><text start="2" dur="2">This unit is about planning.</text><text start="4" dur="2">We defined AI to be the study </text><text start="6" dur="2">and process of finding appropriate </text><text start="8" dur="2">actions for an agent.</text><text start="10" dur="2">So in some sense planning is really </text><text start="12" dur="2">the core of all of AI.</text><text start="14" dur="2">The technique we looked at so far</text><text start="16" dur="2">was problem solving search </text><text start="18" dur="2">over a state space using techniques </text><text start="20" dur="3">like A star. </text><text start="23" dur="2">Given a state space and a problem description,</text><text start="25" dur="2">we can find a solution, </text><text start="27" dur="2">a path to the goal.</text><text start="29" dur="2">Those approaches are great                                                             for a variety of environments, </text><text start="31" dur="2">but they only work when the environment </text><text start="33" dur="3">is deterministic and fully observable.</text><text start="36" dur="3">In this unit, we will see how                                                                       to relax those constraints. </text></transcript></video><video title="2 Problem Solving vs Planning" id="gZza8lZr1Oc" length="146"><transcript><text start="0" dur="3">[Narrator] You remember our                                                                    problem-solving work?</text><text start="3" dur="3">We have a state space like this, and</text><text start="6" dur="3">we&amp;#39;re given a start space and </text><text start="9" dur="2">a goal to reach, </text><text start="11" dur="2">and then we&amp;#39;d search for a path</text><text start="13" dur="3">to find that goal, and maybe we find</text><text start="16" dur="3">this path.</text><text start="19" dur="2">Now the way a problem-solving agent </text><text start="21" dur="3">would work is first it does all the work</text><text start="24" dur="2">to figure out the path to the goal</text><text start="26" dur="3">just doing by thinking,</text><text start="29" dur="2">and then it starts to execute that path</text><text start="31" dur="4">to drive or walk, however you want to get there,</text><text start="35" dur="2">from the start state to the end state,</text><text start="37" dur="2">but think about what would happen</text><text start="39" dur="2">if you did that in real life; if you did all </text><text start="41" dur="2">your planning ahead of time,                                                                   you had the complete goal,</text><text start="43" dur="3">and then without interacting with the world,</text><text start="46" dur="2">without sensing it at all,</text><text start="48" dur="2">you started to execute that path.</text><text start="50" dur="3">Well this has, in fact, been studied.</text><text start="53" dur="3">People have gone out and</text><text start="56" dur="3">blindfolded walkers, put them in a field </text><text start="59" dur="2">and told them to walk in a straight line,</text><text start="61" dur="3">and the results are not pretty.</text><text start="64" dur="3">Here are the GPS tracks to prove it.</text><text start="67" dur="2">So we take a hiker, we put him at a </text><text start="69" dur="2">start location, say here,</text><text start="71" dur="2">and we blindfold him so that he can&amp;#39;t </text><text start="73" dur="2">see anything in the horizon,</text><text start="75" dur="3">but just has enough to see his or her feet</text><text start="78" dur="2">so that they won&amp;#39;t stumble over something,</text><text start="80" dur="3">and tell them execute the plan of going forward.</text><text start="83" dur="3">Put one foot in front of each other and walk forward in a straight line,</text><text start="86" dur="2">and these are the typical paths we see.</text><text start="88" dur="2">They start out going straight for awhile </text><text start="90" dur="2">but then go in loop de loops  </text><text start="92" dur="3">and end up not at a straight path at all.</text><text start="95" dur="2">These ones over here, starting in this location,</text><text start="97" dur="2">are even more convoluted.</text><text start="99" dur="2">They get going straight for a little bit</text><text start="101" dur="2">and then go in very tight loops.</text><text start="103" dur="2">So people are incapable of                                              walking a straight line</text><text start="105" dur="3">without any feedback from the environment.</text><text start="108" dur="3">Now here on this yellow path,                                                                this one did much better, </text><text start="111" dur="2">and why was that?</text><text start="113" dur="3">Well it&amp;#39;s because these paths                                                                                       were on overcast days,</text><text start="116" dur="3">and so there was no input to make sense of.</text><text start="119" dur="3">Whereas on this path was                                                  on a very sunny day,</text><text start="122" dur="2">and so even though the hiker couldn&amp;#39;t </text><text start="124" dur="3">see farther than a few feet in front of him, </text><text start="127" dur="3">he could see shadows and say,</text><text start="130" dur="2">&amp;quot;As long as I keep the shadows pointing                                                    in the right direction then  </text><text start="132" dur="3">I can go in a relatively straight line.&amp;quot;</text><text start="135" dur="3">So the moral is we need some                                                           feedback from the environment. </text><text start="138" dur="3">We can&amp;#39;t just plan ahead and come up                                                    with a whole plan.</text><text start="141" dur="3">We&amp;#39;ve got to interleave planning  </text><text start="144" dur="2">and executing.</text></transcript></video><video title="3 Planning vs Execution" id="BmqedPZZA4A" length="199"><transcript><text start="0" dur="2">[Narrator] Now why do we have to interleave</text><text start="2" dur="2">planning and execution?</text><text start="4" dur="2">Mostly because of properties of the </text><text start="6" dur="2">environment that make it difficult to deal with.</text><text start="8" dur="2">The most important one is</text><text start="10" dur="2">if the environment is </text><text start="12" dur="2">stochastic. </text><text start="14" dur="2">That is if we don&amp;#39;t know for sure what </text><text start="16" dur="2">an action is going to do.</text><text start="18" dur="2">If we know what everything is going to do,</text><text start="20" dur="2">we can plan it our right from the start,                                  but if we don&amp;#39;t, we have to</text><text start="22" dur="2">be able to deal with contingencies of  </text><text start="24" dur="2">say I tried to move forward, </text><text start="26" dur="3">and the wheels slipped, and                                                            I went someplace else,</text><text start="29" dur="2">or the brakes might skid, or </text><text start="31" dur="3">if we&amp;#39;re walking our feet don&amp;#39;t go 100% straight,</text><text start="34" dur="3">or consider the problem of traffic lights.</text><text start="37" dur="2">If the traffic light is red, </text><text start="39" dur="2">then the result of the action of go</text><text start="41" dur="2">forward through the intersection</text><text start="43" dur="3">is bound to be different than                                                       if the traffic light is green.</text><text start="46" dur="2">Another difficulty we have to deal with </text><text start="48" dur="3">is multi-agent environments. </text><text start="51" dur="3">If there are other cars and people                                                                that can get in our way,</text><text start="54" dur="3">we have to plan about what they&amp;#39;re going to do, </text><text start="57" dur="3">and we have to react when                                                                   they do something unexpected,</text><text start="60" dur="2">and we can only know that </text><text start="62" dur="3">at execution time, not at planning time.</text><text start="65" dur="2">The other big problem is with </text><text start="67" dur="4">partial observability.</text><text start="71" dur="3">Suppose we&amp;#39;ve come up with a plan</text><text start="74" dur="5">to go from A to S to F to B.</text><text start="79" dur="2">That plan looks like it will work,</text><text start="81" dur="3">but we know that at S,</text><text start="84" dur="3">the road to F is sometimes closed,</text><text start="87" dur="2">and there will be a sign there</text><text start="89" dur="2">telling us whether it&amp;#39;s closed or not,</text><text start="91" dur="2">but when we start off, we can&amp;#39;t read that sign.</text><text start="93" dur="2">So that&amp;#39;s partial observability.</text><text start="95" dur="2">Another way to look at it is when we start off </text><text start="97" dur="2">we don&amp;#39;t know what state we&amp;#39;re in.</text><text start="99" dur="2">We know we&amp;#39;re in A, but we don&amp;#39;t know </text><text start="101" dur="2">if we&amp;#39;re in A in the state where</text><text start="103" dur="3">the road is closed or if we&amp;#39;re in A </text><text start="106" dur="2">in the state where the road is open, </text><text start="108" dur="2">and it&amp;#39;s not until we get to S</text><text start="110" dur="3">that we discover what state we&amp;#39;re actually in, </text><text start="113" dur="2">and then we know if we can continue along </text><text start="115" dur="3">that route or if we have to take a detour south. </text><text start="118" dur="2">Now in addition to these properties of</text><text start="120" dur="2">the environment, we can also have </text><text start="122" dur="2">difficulty because of </text><text start="124" dur="2">lack of knowledge on our own part.</text><text start="126" dur="6">So if some model of the world is unknown, </text><text start="132" dur="2">that is, for example, </text><text start="134" dur="2">we have map or GPS software </text><text start="136" dur="2">that&amp;#39;s inaccurate or incomplete,</text><text start="138" dur="2">then we won&amp;#39;t be able to </text><text start="140" dur="3">executive a straight-line plan,</text><text start="143" dur="3">and, similarly, often we want to deal with</text><text start="146" dur="3">a case where the plans have to be </text><text start="149" dur="2">hierarchical. </text><text start="151" dur="2">And, certainly, a plan like this </text><text start="153" dur="4">is at a very high level.</text><text start="157" dur="2">We can&amp;#39;t really execute the action </text><text start="159" dur="2">of going from A to S </text><text start="161" dur="2">when we&amp;#39;re in a car.</text><text start="163" dur="2">All the actions that we can actually execute</text><text start="165" dur="2">are things like turn the steering wheel a little bit </text><text start="167" dur="3">to the right, press on the pedal a little bit more.</text><text start="170" dur="4">So those are the low-level steps of the plan,</text><text start="174" dur="3">but those aren&amp;#39;t sketched out                                          in detail when we start,</text><text start="177" dur="3">when we only have the                                                     high-level parts of the plan, </text><text start="180" dur="3">and then it&amp;#39;s during execution that we schedule </text><text start="183" dur="2">the rest of the low-level parts of the plan.</text><text start="185" dur="3">Now most of these difficulties can be</text><text start="188" dur="2">addressed by changing our point of view. </text><text start="190" dur="3">Instead of planning in                                                                                 the space of world states,</text><text start="193" dur="3">we plan in the space of belief states.</text><text start="196" dur="3">To understand that let&amp;#39;s look at a state.</text></transcript></video><video title="4 Vacuum Cleaner Example" id="lb1bEUa9WXg" length="131"><transcript><text start="0" dur="2">[Narrator] Here&amp;#39;s a state space </text><text start="2" dur="2">diagram for a simple problem.</text><text start="4" dur="3">It involves a room with 2 locations.</text><text start="7" dur="4">The left we call A, and the right we call B,</text><text start="11" dur="2">and in that environment</text><text start="13" dur="2">there&amp;#39;s a vacuum cleaner, and there </text><text start="15" dur="3">may or may not be dirt in either                                                               of the 2 locations,</text><text start="18" dur="4">and so that gives us 8 total states.</text><text start="22" dur="3">Dirt is here or not, here or not, and</text><text start="25" dur="2">the vacuum cleaner is here or here.</text><text start="27" dur="2">So that&amp;#39;s 2 times 2 times 2 </text><text start="29" dur="2">is 8 possible states, and I&amp;#39;ve drawn </text><text start="31" dur="2">here the states based diagram  </text><text start="33" dur="2">with all the transitions </text><text start="35" dur="3">for the 3 possible actions, and                                           the actions are moving right.</text><text start="38" dur="2">So we&amp;#39;d go from this state to this state.  </text><text start="40" dur="3">Moving left, we&amp;#39;d go from this state to this state, </text><text start="43" dur="2">and sucking up dirt, we&amp;#39;d go from this state </text><text start="45" dur="3">to this state for example, and </text><text start="48" dur="3">in this state space diagram, </text><text start="51" dur="2">if we have a fully deterministic,</text><text start="53" dur="3">fully observable world, it&amp;#39;s easy to plan.</text><text start="56" dur="3">Say we start in this state, and we want to be--</text><text start="59" dur="3">end up in a goal state where                                                                          both sides are clean.</text><text start="62" dur="2">We can execute the suck-dirt action </text><text start="64" dur="2">and get here and then move right, </text><text start="66" dur="2">and then suck dirt again,</text><text start="68" dur="3">and now we end up in a goal state</text><text start="71" dur="3">where everything is clean.</text><text start="74" dur="2">Now suppose our robot vacuum cleaner&amp;#39;s</text><text start="76" dur="2">sensors break down, and so the robot </text><text start="78" dur="2">can no longer perceive either </text><text start="80" dur="2">which location its in </text><text start="82" dur="2">or whether there&amp;#39;s any dirt.</text><text start="84" dur="2">So we now have an unobservable </text><text start="86" dur="2">or sensor-less world rather </text><text start="88" dur="2">than a fully observable one, </text><text start="90" dur="3">and how does the agent then represent the state of the world?</text><text start="93" dur="3">Well it could be in any one of these 8 states,</text><text start="96" dur="3">and so all we can do to represent </text><text start="99" dur="3">the current state is draw a big circle </text><text start="102" dur="2">or box around everything, and say,</text><text start="104" dur="4">&amp;quot;I know I&amp;#39;m somewhere inside here.&amp;quot;</text><text start="108" dur="2">Now that doesn&amp;#39;t seem like it helps very much.</text><text start="110" dur="2">What good is it to know that </text><text start="112" dur="2">we don&amp;#39;t really know anything at all?</text><text start="114" dur="3">But the point is that we can search in the state </text><text start="117" dur="2">space of the least states rather</text><text start="119" dur="3">than in the state space of actual spaces.</text><text start="122" dur="3">So we believe that we&amp;#39;re in 1 of these 8 states,</text><text start="125" dur="2">and now when we execute an action,</text><text start="127" dur="2">we&amp;#39;re going to get to another belief state.</text><text start="129" dur="2">Let&amp;#39;s take a look at how that works.</text></transcript></video><video title="5 Sensorless Vacumm Cleaner Problem-" id="hyjwEymVfL4" length="142"><transcript><text start="0" dur="3">[Narrator] This is the belief state space</text><text start="3" dur="2">for the sensor-less vacuum problem.</text><text start="5" dur="2">So we started off here.</text><text start="7" dur="3">We drew the circle around this belief state.</text><text start="10" dur="3">So we don&amp;#39;t anything about where we are,</text><text start="13" dur="2">but the amazing thing is, </text><text start="15" dur="2">if we execute actions, we can gain knowledge</text><text start="17" dur="3">about the world even without sensing.</text><text start="20" dur="2">So let&amp;#39;s say we move right, </text><text start="22" dur="4">then we&amp;#39;ll know we&amp;#39;re in the right-hand location.</text><text start="26" dur="2">Either we were in the left, and we moved right</text><text start="28" dur="2">and arrived there, or we were in the right </text><text start="30" dur="2">to begin with, and we bumped against the wall</text><text start="32" dur="2">and stayed there.</text><text start="34" dur="3">So now we end up in this state.</text><text start="37" dur="3">We now know more about the world.</text><text start="40" dur="3">We&amp;#39;re down to 4 possibilities rather than 8,</text><text start="43" dur="3">even though we haven&amp;#39;t observed anything,</text><text start="46" dur="2">and now note something interesting,</text><text start="48" dur="2">that in the real world, the operations </text><text start="50" dur="2">of going left and going right are</text><text start="52" dur="2">inverses of each other, but </text><text start="54" dur="2">in the belief state world </text><text start="56" dur="3">going right and going left are not inverses.</text><text start="59" dur="2">If we go right, and then we go left,</text><text start="61" dur="2">we don&amp;#39;t end up back where we were</text><text start="63" dur="2">in a state of total uncertainty, rather</text><text start="65" dur="3">going left takes us over here</text><text start="68" dur="2">where we still know we&amp;#39;re in 1 of 4 states</text><text start="70" dur="3">rather than in 1 of 8 states.</text><text start="73" dur="2">Note that it&amp;#39;s possible to form a plan that </text><text start="75" dur="3">reaches a goal without ever                                                                observing the world. </text><text start="78" dur="3">Plans like that are called conform-it plans.</text><text start="81" dur="2">For example, if the goal is to be </text><text start="83" dur="2">in a clean location </text><text start="85" dur="3">all we have to do is suck.</text><text start="88" dur="2">So we go from one of these 8 states</text><text start="90" dur="2">to one of these 4 states and, </text><text start="92" dur="2">every one of those 4,</text><text start="94" dur="2">we&amp;#39;re in a clean location.</text><text start="96" dur="2">We don&amp;#39;t know which of the 4 we&amp;#39;re in,</text><text start="98" dur="3">but we know we&amp;#39;ve achieved the goal.</text><text start="101" dur="2">It&amp;#39;s also possible to arrive </text><text start="103" dur="2">at a completely known state.</text><text start="105" dur="2">For example, if we start here,</text><text start="107" dur="3">we go left; we suck up the dirt there.</text><text start="110" dur="3">If we go right and suck up the dirt,</text><text start="113" dur="2">now we&amp;#39;re down to a belief state </text><text start="115" dur="2">consisting of 1 single state that is</text><text start="117" dur="3">we know exactly where we are.</text><text start="120" dur="2">Here&amp;#39;s a question for you:</text><text start="122" dur="2">How do I get from the state where I know </text><text start="124" dur="2">my current square is clean, </text><text start="126" dur="2">but know nothing else, to the belief state</text><text start="128" dur="2">where I know that I&amp;#39;m in the right-hand side</text><text start="130" dur="4">location and that that location is clean?</text><text start="134" dur="2">What I want you to do is click on the</text><text start="136" dur="2">sequence of actions, left, right, or suck</text><text start="138" dur="4">that will take us from that start to that goal.</text></transcript></video><video title="6 Sensorless Vacuum Cleaner Answer" id="Bxd-j9s82Z8" length="23"><transcript><text start="0" dur="3">[Narrator] And the answer is that the state</text><text start="3" dur="3">of knowing that you&amp;#39;re current square is clean</text><text start="6" dur="2">corresponds to this state.</text><text start="8" dur="2">This belief state with 4 possible world states,</text><text start="10" dur="3">and if I then execute the right action, </text><text start="13" dur="2">followed by the suck action,</text><text start="15" dur="2">then I end up in this belief state,</text><text start="17" dur="2">and that satisfies the goal.</text><text start="19" dur="2">I know I&amp;#39;m in the right-hand-side location</text><text start="21" dur="2">and I know that location is clean.</text></transcript></video><video title="7 Partially Observable Vacuum Cleaner Example" id="-6GtASYhMvo" length="151"><transcript><text start="0" dur="5">[Narrator] We&amp;#39;ve been considering sensor-less planning in a deterministic world. </text><text start="5" dur="3">Now I want to turn our attention to                                    partially observable planning </text><text start="8" dur="2">but still in a deterministic world.</text><text start="10" dur="3">Suppose we have what&amp;#39;s called local sensing,</text><text start="13" dur="2">that is our vacuum can see what location </text><text start="15" dur="2">it is in and it can see </text><text start="17" dur="4">what&amp;#39;s going on in the current location, that is </text><text start="21" dur="2">whether there&amp;#39;s dirt in the                                                                   current location or not,  </text><text start="23" dur="2">but it can&amp;#39;t see anything about </text><text start="25" dur="4">whether there&amp;#39;s dirt in any other location.</text><text start="29" dur="2">So here&amp;#39;s a partial diagram of the-- </text><text start="31" dur="4">part of the belief state from that world,</text><text start="35" dur="2">and I want it to show  </text><text start="37" dur="2">how the belief state unfolds </text><text start="39" dur="2">as 2 things happen.</text><text start="41" dur="2">First, as we take action, </text><text start="43" dur="3">so we start in this state, </text><text start="46" dur="3">and we take the action of going right, </text><text start="49" dur="4">and in this case we still go </text><text start="53" dur="3">from 2 world states in our belief state </text><text start="56" dur="2">to 2 new ones, </text><text start="58" dur="2">but then, after we do an action,</text><text start="60" dur="3">we do an observation, and we have the act</text><text start="63" dur="2">precept cycle, and now, </text><text start="65" dur="2">once we get the observation, </text><text start="67" dur="2">we can split that world, </text><text start="69" dur="2">we can split our belief state to say, </text><text start="71" dur="2">&amp;quot;If we observe that we&amp;#39;re in </text><text start="73" dur="2">location B and it&amp;#39;s dirty, then we know</text><text start="75" dur="3">we&amp;#39;re in this belief state here,</text><text start="78" dur="3">which happens to have                                                                    exactly 1 world state in it,</text><text start="81" dur="2">and if we observe that we&amp;#39;re clean</text><text start="83" dur="2">then we know that we&amp;#39;re in this state,</text><text start="85" dur="2">which also has exactly 1 in it.</text><text start="87" dur="2">Now what is the act-observe cycle do </text><text start="89" dur="3">to the sizes of the belief states?</text><text start="92" dur="2">Well in a deterministic world, </text><text start="94" dur="2">each of the individual world states within  </text><text start="96" dur="4">a belief state maps into exactly 1 other one.</text><text start="100" dur="2">That&amp;#39;s what we mean by deterministic,</text><text start="102" dur="3">and so that means the size of the belief state </text><text start="105" dur="3">will either stay the same or it might decrease</text><text start="108" dur="2">if 2 of the actions sort of accidentally </text><text start="110" dur="3">end up bringing you to the same place.</text><text start="113" dur="2">On the other hand, the observation </text><text start="115" dur="3">works in kind of the opposite way.</text><text start="118" dur="2">When we observe the world, what we&amp;#39;re doing </text><text start="120" dur="2">is we&amp;#39;re taking the current belief state and </text><text start="122" dur="3">partitioning it up into pieces.</text><text start="125" dur="2">Observations alone can&amp;#39;t introduce </text><text start="127" dur="3">a new state--a new world state                                             into the belief state.</text><text start="130" dur="3">All they can do is say, </text><text start="133" dur="3">&amp;quot;Some of them go here and                                                                  some of them go here.&amp;quot;</text><text start="136" dur="2">Now maybe that for some observation</text><text start="138" dur="3">all the belief states go into 1 bin,</text><text start="141" dur="2">and so we make an observation</text><text start="143" dur="2">that we don&amp;#39;t learn anything new, but at least </text><text start="145" dur="3">the observation can&amp;#39;t make us more confused </text><text start="148" dur="3">than we were before the observation. </text></transcript></video><video title="8 Stocastic Environment Problem" id="Rgn1RW0fcec" length="174"><transcript><text start="0" dur="3">[Norvig] Now let&amp;#39;s move on to stochastic environments.</text><text start="3" dur="3">Let&amp;#39;s consider a robot that has slippery wheels</text><text start="6" dur="4">so that sometimes when you make a movement--a left or a right action--</text><text start="10" dur="3">the wheels slip and you stay in the same location.</text><text start="13" dur="4">And sometimes they work and you arrive where you expected to go.</text><text start="17" dur="4">And let&amp;#39;s assume that the suck action always works perfectly.</text><text start="21" dur="4">We get a belief state space that looks something like this.</text><text start="25" dur="5">Notice that the results of actions will often result in a belief state</text><text start="30" dur="4">that&amp;#39;s larger than it was before--that is, the action will increase uncertainty</text><text start="34" dur="3">because we don&amp;#39;t know what the result of the action is going to be.</text><text start="37" dur="5">And so here for each of the individual world states belonging to a belief state,</text><text start="42" dur="5">we have multiple outcomes for the action, and that&amp;#39;s what stochastic means.</text><text start="47" dur="3">And so we end up with a larger belief state here.</text><text start="50" dur="5">But in terms of the observation, the same thing holds as in the deterministic world.</text><text start="55" dur="6">The observation partitions the belief state into smaller belief states.</text><text start="61" dur="3">So in a stochastic partially observable environment,</text><text start="64" dur="3">the actions tend to increase uncertainty, </text><text start="67" dur="4">and the observations tend to bring that uncertainty back down.</text><text start="71" dur="3">Now, how would we do planning in this type of environment?</text><text start="74" dur="3">I haven&amp;#39;t told you yet, so you won&amp;#39;t know the answer for sure,</text><text start="77" dur="4">but I want you to try to figure it out anyways, even if you might get the answer wrong.</text><text start="81" dur="6">Imagine I had the whole belief state from which I&amp;#39;ve diagrammed just a little bit here</text><text start="87" dur="4">and I wanted to know how to get from this belief state </text><text start="91" dur="3">to one in which all squares are clean.</text><text start="94" dur="2">So I&amp;#39;m going to give you some possible plans, </text><text start="96" dur="6">and I want you to tell me whether you think each of these plans will always work</text><text start="102" dur="5">or maybe sometimes work depending on how the stochasticity works out.</text><text start="107" dur="2">Here are the possible plans.</text><text start="109" dur="5">Remember I&amp;#39;m starting here, and I want to know how to get to a belief state</text><text start="114" dur="3">in which all the squares are clean.</text><text start="117" dur="9">One possibility is suck right and suck, one is right suck left suck,</text><text start="126" dur="5">one is suck right right suck,</text><text start="131" dur="7">and the other is suck right suck right suck.</text><text start="138" dur="4">So some of these actions might take you out of this little belief state here,</text><text start="142" dur="5">but just use what you knew from the previous definition of the state space</text><text start="147" dur="2">and the results of each of those actions </text><text start="149" dur="5">and the fact that the right and left actions are nondeterministic </text><text start="154" dur="5">and tell me which of these you think will always achieve the goal</text><text start="159" dur="3">or will maybe achieve the goal.</text><text start="162" dur="6">And then I want you to also answer for the fill-in-the-blank plan--</text><text start="168" dur="6">that is, is there some plan, some ideal plan, which always or maybe achieves the goal?</text></transcript></video><video title="9 Stochastic Environment Answer" id="uBBfv13Ky4I" length="50"><transcript><text start="0" dur="3">And the answer is that any plan that would work </text><text start="3" dur="4">in the deterministic world might work in the stochastic world</text><text start="7" dur="3">if everything works out okay</text><text start="10" dur="3">and all of these plans meet that criteria.</text><text start="13" dur="5">But no finite plan is guaranteed to always work</text><text start="18" dur="5">because a successful plan has to include at least 1 move action.</text><text start="23" dur="4">And if we try a  move action a finite number of times,</text><text start="27" dur="3">each of those times, the wheels might slip, and it won&amp;#39;t move,</text><text start="30" dur="3">and so we can never be guaranteed to achieve the goal</text><text start="33" dur="3">with a finite sequence of actions.</text><text start="36" dur="3">Now, what about an infinite sequence of actions?</text><text start="39" dur="3">Well, we can&amp;#39;t represent that in the language we have so far</text><text start="42" dur="3">where a plan is a linear sequence.</text><text start="45" dur="2">But we can introduce a new notion of plans </text><text start="47" dur="3">in which we do have infinite sequences.</text></transcript></video><video title="10 Infinite Sequences" id="h3_8OLOhtG4" length="125"><transcript><text start="0" dur="3">In this new notation, instead of writing plans</text><text start="3" dur="6">as a linear sequence of, say, suck, move right, and suck,</text><text start="9" dur="3">I&amp;#39;m going to write them as a tree structure.</text><text start="12" dur="3">We start off in this belief state here, </text><text start="15" dur="3">which we&amp;#39;ll diagram like this.</text><text start="18" dur="4">And then we do a suck action.</text><text start="22" dur="5">We end up in a new state.</text><text start="27" dur="6">And then we do a right action,</text><text start="33" dur="3">and now we have to observe the world,</text><text start="36" dur="5">and if we observe that we&amp;#39;re still in state A,</text><text start="41" dur="5">we loop back to this part of the plan.</text><text start="46" dur="3">And if we observe that we&amp;#39;re in B,</text><text start="49" dur="7">we go on and then execute the suck action.</text><text start="56" dur="3">And now we&amp;#39;re at the end of the plan.</text><text start="59" dur="4">So, we see that there&amp;#39;s a choice point here,</text><text start="63" dur="3">which we indicate with this sort of tie</text><text start="66" dur="3">to say we&amp;#39;re following a straight line, but now we can branch.</text><text start="69" dur="3">There&amp;#39;s a conditional, and we can either loop,</text><text start="72" dur="2">or we can continue on, </text><text start="74" dur="3">so we see that this finite representation</text><text start="77" dur="4">represents an infinite sequence of plans.</text><text start="81" dur="5">We could write it in a more sort of linear notation</text><text start="86" dur="6">as S, while we observe A, </text><text start="92" dur="4">do R, and then do S.</text><text start="96" dur="2">Now, what can we say about this plan?</text><text start="98" dur="2">Does this plan achieve the goal?</text><text start="100" dur="4">Well, what we can say is that if the stochasticity </text><text start="104" dur="3">is independent, that is, if sometimes it works</text><text start="107" dur="2">and sometimes it doesn&amp;#39;t, </text><text start="109" dur="4">then with probability 1 in the limit,</text><text start="113" dur="2">this plan will, in fact, achieve the goal,</text><text start="115" dur="5">but we can&amp;#39;t state any bounded number of steps</text><text start="120" dur="3">under which it&amp;#39;s guaranteed to achieve the goal.</text><text start="123" dur="2">We can only say it&amp;#39;s guaranteed at infinity.</text></transcript></video><video title="11 Finding a Successful Plan-" id="ffGFIoOhN3U" length="99"><transcript><text start="0" dur="3">Now, I&amp;#39;ve told you what a successful plan looks like,</text><text start="3" dur="3">but I haven&amp;#39;t told you how to find one.</text><text start="6" dur="2">The process of finding it can be done through search</text><text start="8" dur="2">just as we did in problem solving.</text><text start="10" dur="2">So, remember in problem solving,</text><text start="12" dur="3">we start off in a state, and it&amp;#39;s a single state, not a belief state.</text><text start="15" dur="4">And then we start searching a tree,</text><text start="19" dur="4">and we have a big triangle of possible states</text><text start="23" dur="4">that we search through, and then we find</text><text start="27" dur="4">one path that gets us all the way to a goal state.</text><text start="31" dur="5">And we pick from this big tree a single path.</text><text start="36" dur="4">So, with belief states and with branching</text><text start="40" dur="3">plan structures, we do the same sort of process,</text><text start="43" dur="3">only the tree is just a little bit more complicated.</text><text start="46" dur="2">Here we show one of these trees,</text><text start="48" dur="4">and it has different possibilities.</text><text start="52" dur="3">For example, we start off here, and we have one possibility</text><text start="55" dur="2">that the first action will be going right,</text><text start="57" dur="2">or another possibility that the first action</text><text start="59" dur="3">will be performing a suck.</text><text start="62" dur="4">But then it also has branches that are part of the plan itself.</text><text start="66" dur="3">This branch here is actually part of the plan</text><text start="69" dur="2">as we saw before.</text><text start="71" dur="3">It&amp;#39;s not a branch in the search space.</text><text start="74" dur="3">It&amp;#39;s a branch in the plan, so what we do</text><text start="77" dur="2">is we search through this tree.</text><text start="79" dur="2">We try right as a first action.</text><text start="81" dur="2">We try suck as a first action.</text><text start="83" dur="2">We keep expanding nodes</text><text start="85" dur="3">until we find a portion of the tree</text><text start="88" dur="3">like this path is a portion of this search tree.</text><text start="91" dur="4">We find that portion which is a successful plan</text><text start="95" dur="4">according to the criteria of reaching the goal.</text></transcript></video><video title="12 Finding a Successful Plan Question" id="vjOY2OEu_ng" length="130"><transcript><text start="0" dur="3">Let&amp;#39;s say we performed that search.</text><text start="3" dur="3">We had a big search tree, and then we threw out</text><text start="6" dur="3">all the branches except one, and this branch of the search tree</text><text start="9" dur="4">does itself have branches, but this branch of the search tree</text><text start="13" dur="4">through the belief state represents a single plan,</text><text start="17" dur="2">not multiple possible plans.</text><text start="19" dur="3">Now, what I want to know is, for this single plan,</text><text start="22" dur="2">what can we guarantee about it?</text><text start="24" dur="5">So, say we wanted to know is this plan guaranteed to find the goal</text><text start="29" dur="3">in an unbounded number of steps?</text><text start="32" dur="3">And what do we need to guarantee that?</text><text start="35" dur="3">So, it&amp;#39;s an unbounded solution.</text><text start="38" dur="5">Do we need to guarantee that</text><text start="43" dur="4">some leaf node is a goal?</text><text start="47" dur="2">So, for example, here&amp;#39;s a plan to go through,</text><text start="49" dur="4">and at the bottom, there&amp;#39;s a leaf node.</text><text start="53" dur="4">Now, if this were in problem solving,</text><text start="57" dur="4">then remember, it would be a sequence of steps</text><text start="61" dur="3">with no branches in it, and we know it&amp;#39;s a solution</text><text start="64" dur="3">if the one leaf node is a goal.</text><text start="67" dur="3">But for these with branches, do we need to guarantee</text><text start="70" dur="3">that some leaf is a goal, </text><text start="73" dur="4">or do we need to guarantee </text><text start="77" dur="5">that every leaf is a goal,</text><text start="82" dur="5">or is there no possible guarantee</text><text start="87" dur="4">that will mean that for sure we&amp;#39;ve got a solution,</text><text start="91" dur="2">although the solution may be of unbounded length?</text><text start="93" dur="3">Then I also want you to answer</text><text start="96" dur="2">what does it take to guarantee</text><text start="98" dur="3">that we have a bounded solution?</text><text start="101" dur="4">That is, a solution that is guaranteed to reach the goal</text><text start="105" dur="4">in a bounded, finite number of steps.</text><text start="109" dur="4">Do we need to have a plan that has</text><text start="113" dur="4">no branches in it, like this branch?</text><text start="117" dur="5">Or a plan that has no loops in it,</text><text start="122" dur="3">like this loop that goes back to a previous state?</text><text start="125" dur="5">Or is there no guarantee that we have a bounded solution?</text></transcript></video><video title="13 Finding a Successful Plan Answer" id="Q4g4E824PRE" length="58"><transcript><text start="0" dur="3">And the answer is we have an unbounded solution</text><text start="3" dur="4">if every leaf in the plan ends up in a goal.</text><text start="7" dur="2">So, if we follow through the plan, no matter what path</text><text start="9" dur="3">we execute based on the observations--</text><text start="12" dur="4">and remember, we don&amp;#39;t get to pick the observations.</text><text start="16" dur="2">The observations come into us, and we follow one path or another</text><text start="18" dur="2">based on what we observe.</text><text start="20" dur="3">So, we can&amp;#39;t guide it in one direction or another,</text><text start="23" dur="3">and so we need every possible leaf node.</text><text start="26" dur="4">This one only has one, but if a plan had multiple leaf nodes,</text><text start="30" dur="3">every one of them would have to be a goal.</text><text start="33" dur="2">Now, in terms of a bounded solution,</text><text start="35" dur="4">it&amp;#39;s okay to have branches but not to have loops.</text><text start="39" dur="3">If we had branches and we ended up with one goal here</text><text start="42" dur="3">and one goal here in 1, 2, 3, steps,</text><text start="45" dur="3">1, 2, 3, steps, that would be a bounded solution.</text><text start="48" dur="6">But if we have a loop, we might be 1, 2, 3, 4, 5--</text><text start="54" dur="4">we don&amp;#39;t know how many steps it&amp;#39;s going to take.</text></transcript></video><video title="14 Problem Solving via Mathematical Notation" id="SzeJX57N-_I" length="159"><transcript><text start="0" dur="2">Now, some people like manipulating trees</text><text start="2" dur="4">and some people like a more--sort of formal--mathematical notation.</text><text start="6" dur="3">So if you&amp;#39;re one of those, I&amp;#39;m going to give you another way to think about  </text><text start="9" dur="3">whether or not we have a solution;</text><text start="12" dur="3">and let&amp;#39;s start with a problem-solving</text><text start="15" dur="5">where a plan consists of a straight line sequence.</text><text start="20" dur="5">And we said one way to decide if this is a plan that satisfies the goal</text><text start="25" dur="5">is to say, &amp;quot;Is the end state a goal state?&amp;quot;</text><text start="30" dur="3">If we want to be more formal and write that out mathematically, </text><text start="33" dur="4">what we can say is--what this plan represents </text><text start="37" dur="3">is--we started in the start state,</text><text start="40" dur="3">and then we transitioned </text><text start="43" dur="4">to the state that is the result of applying the action </text><text start="47" dur="6">of going from A to S, to that start state; </text><text start="53" dur="8">and then we applied to that, the result of starting in that intermediate state</text><text start="61" dur="7">and applying the action of going from S to F.</text><text start="68" dur="6">And if that resulting state is an element of the set of Goals, </text><text start="74" dur="5">then this plan is valid; this plan gives us a solution. </text><text start="79" dur="5">And so that&amp;#39;s a mathematical formulation of what it means for this plan to be a Goal. </text><text start="84" dur="3">Now, in stochastic partially observable worlds, </text><text start="87" dur="3">the equations are a little bit more complicated. </text><text start="90" dur="10">Instead of just having S Prime is a result of applying some action to the initial state, </text><text start="100" dur="4">we&amp;#39;re dealing with belief states, rather than individual states. </text><text start="104" dur="6">And what we say is our new belief state </text><text start="110" dur="9">is the result of updating what we get from predicting what our action will do; </text><text start="119" dur="7">and then updating it, based on our observation, O, of the world. </text><text start="126" dur="4">So the prediction step is when we start off in a belief state;</text><text start="130" dur="5">we look at the action, we look at each possible result of the action--</text><text start="135" dur="3">because they&amp;#39;re stochastic--to each possible member of the belief state,</text><text start="138" dur="3">and so that gives us a larger belief state;</text><text start="141" dur="4">and then we update that belief state by taking account of the observation--</text><text start="145" dur="4">and that will give us a smaller--or same size--belief state.  </text><text start="149" dur="3">And now, that gives us the new state. </text><text start="152" dur="3">Now, we can use this to predict and update cycle</text><text start="155" dur="4">to keep track of where we are in a belief state. </text></transcript></video><video title="15 Tracking the Predict Update Cycle" id="cfYnEgrVemA" length="175"><transcript><text start="0" dur="4">Here&amp;#39;s an example of tracking the Predict Update Cycle;</text><text start="4" dur="5">and this is in a world in which the actions are guaranteed to work, as advertised--</text><text start="9" dur="3">that is, if you start to clean up the current location,</text><text start="12" dur="5">and if you move right or left, the wheels actually turn; and you do move.</text><text start="17" dur="5">But we can call this the kindergarten world because there are little toddlers </text><text start="22" dur="5">walking around who can deposit Dirt in any location, at any time. </text><text start="27" dur="5">So if we start off in this state, and execute the Suck action, </text><text start="32" dur="6">we can predict that we&amp;#39;ll end up in one of these 2 states.</text><text start="38" dur="4">Then, if we have an observation--well, we know what that observation&amp;#39;s going to be </text><text start="42" dur="3">because we  know the Suck action always works, and we know we were in A;</text><text start="45" dur="5">so the only observation we can get is that we&amp;#39;re in A--and that it&amp;#39;s Clean--</text><text start="50" dur="4">so we end up in that same belief state. </text><text start="54" dur="4">And then, if we execute the Right action--</text><text start="58" dur="3">well, then lots of things could happen;</text><text start="61" dur="5">because we move Right, and somebody might have dropped Dirt in the Right location, </text><text start="66" dur="4">and somebody might have dropped Dirt in the Left location--or maybe not. </text><text start="70" dur="2">So we end up with 4 possibilities,</text><text start="72" dur="5">and then we can update again when we get the next observation--</text><text start="77" dur="6">say, if we observed that we&amp;#39;re in B and it&amp;#39;s Dirty, then we end up in this belief state. </text><text start="83" dur="4">And we can keep on going--specifying new belief states-- </text><text start="87" dur="6">as a result of success of predicts and updates. </text><text start="93" dur="5">Now, this Predict Update Cycle gives us a kind of calculus of belief states</text><text start="98" dur="3">that can tell us, really, everything we need to know. </text><text start="101" dur="2">But there is one weakness with this approach--</text><text start="103" dur="4">that, as you can see here, some of the belief states start to get large; </text><text start="107" dur="2">and this is a tiny little world.</text><text start="109" dur="4">Already, we have a belief state with 4 world states in it.  </text><text start="113" dur="5">We could have one with 8, 16, 10, 24--or whatever.   </text><text start="118" dur="5">And it seems that there may be more succinct representations of a belief state,</text><text start="123" dur="3">rather than to just list all the world states. </text><text start="126" dur="2">For example, take this one here:</text><text start="128" dur="5">If we had divided the world up--not into individual world states,</text><text start="133" dur="4">but into variables describing that state, </text><text start="137" dur="6">then this whole belief state could be represented just by: Vacuum is on the Right. </text><text start="143" dur="6">So the whole world could be represented by 3 states--or 3 variables:</text><text start="149" dur="4">One, where is the Vacuum--is it on the Right, or not?  </text><text start="153" dur="3">Secondly, is there Dirt in the Left location?</text><text start="156" dur="3">And third, is there Dirt in the Right location? </text><text start="159" dur="5">And we could have some formula, over those variables, to describe states.  </text><text start="164" dur="3">And with that type of formulation, </text><text start="167" dur="4">some very large states--in terms of enumerating the world states--</text><text start="171" dur="4">can be made small, in terms of the description. </text></transcript></video><video title="16 Classical Planning 1" id="-o9E15BAL3o" length="335"><transcript><text start="0" dur="6">[Norvig] I want to describe a notation which we call classical planning,</text><text start="6" dur="7">which is a representation language for dealing with states and actions and plans,</text><text start="13" dur="4">and it&amp;#39;s also an approach for dealing with the problem of complexity</text><text start="17" dur="4">by factoring the world into variables.</text><text start="21" dur="7">So under classical planning, a state space consists of all the possible assignments</text><text start="28" dur="4">to k Boolean variables.</text><text start="32" dur="6">So that means they&amp;#39;ll be 2 to the k states in that state space.</text><text start="38" dur="3">And if we think about the 2 location vacuum world,</text><text start="41" dur="3">we would have 3 Boolean variables.</text><text start="44" dur="13">We could have dirt in location A, dirt in location B, and vacuum in location A.</text><text start="57" dur="3">The vacuum has to be in either A or B.</text><text start="60" dur="6">So these 3 variables will do, and there will be 8 possible states in that world,</text><text start="66" dur="5">but they can be succinctly represented through the 3 variables.</text><text start="71" dur="7">And then a world state consists of a complete assignment of true or false</text><text start="78" dur="2">through each of the 3 variables.</text><text start="80" dur="4">And then a belief state.</text><text start="84" dur="4">Just as in problem solving, the belief state depends on </text><text start="88" dur="3">what type of environment you want to deal with.</text><text start="91" dur="7">In the core classical planning, the belief state had to be a complete assignment,</text><text start="98" dur="5">and that was useful for dealing with deterministic fully observable domains.</text><text start="103" dur="4">But we can easily extend classical planning,</text><text start="107" dur="4">and we can deal with belief states that are partial assignments--</text><text start="111" dur="5">that is, some of the variables have values and others don&amp;#39;t.</text><text start="116" dur="5">So we could have the belief state consisting of vacuum in A is true</text><text start="121" dur="7">and the others are unknown, and that small formula represents 4 possible world states.</text><text start="128" dur="10">We can even have a belief state which is an arbitrary formula in Boolean logic,</text><text start="138" dur="2">and that can represent anything we want.</text><text start="140" dur="2">So that&amp;#39;s what states look like.</text><text start="142" dur="3">Now we have to figure out what actions look like </text><text start="145" dur="3">and what the results of those actions look like.</text><text start="148" dur="6">These are represented in classical planning by something called an action schema.</text><text start="154" dur="6">It&amp;#39;s called a schema because it represents many possible actions that are similar to each other.</text><text start="160" dur="6">So let&amp;#39;s take an example of we want to send cargo around the world,</text><text start="166" dur="4">and we&amp;#39;ve got a bunch of planes in airports, and we have cargo and so on.</text><text start="170" dur="6">I&amp;#39;ll show you the action for having a plane fly from one location to another.</text><text start="176" dur="3">Here&amp;#39;s one possible representation.</text><text start="179" dur="4">We say it&amp;#39;s an action schema, so we write the word Action</text><text start="183" dur="5">and then we write the action operator and its arguments,</text><text start="188" dur="7">so it&amp;#39;s a Fly of P from X to Y.</text><text start="195" dur="4">And then we list the preconditions,</text><text start="199" dur="5">what needs to be true in order to be able to execute this action.</text><text start="204" dur="5">We can say something like P better be a plane.</text><text start="209" dur="6">It&amp;#39;s no good trying to fly a truck or a submarine.</text><text start="215" dur="8">And we&amp;#39;ll use the And formula from Boolean propositional logic.</text><text start="223" dur="4">X better be an airport.</text><text start="227" dur="3">We don&amp;#39;t want to try to take off from my backyard.</text><text start="230" dur="5">And similarly, Y better be an airport.</text><text start="235" dur="7">And, most importantly, P better be at airport X in order to take off from there.</text><text start="242" dur="6">And then we represent the effects of the action by saying</text><text start="248" dur="2">what&amp;#39;s going to happen.</text><text start="250" dur="3">Once we fly from X to Y, </text><text start="253" dur="3">the plane is no longer at X,</text><text start="256" dur="7">so we say not at P,X--the plane is no longer at X--</text><text start="263" dur="4">and the plane is now at Y.</text><text start="267" dur="3">This is called an action schema.</text><text start="270" dur="6">It represents a set of actions for all possible planes, for all X and for all Y,</text><text start="276" dur="3">represents all of those actions in one schema</text><text start="279" dur="6">that says what we need to know in order to apply the action and it says what will happen.</text><text start="285" dur="5">In terms of the transition from state spaces, this variable will become false</text><text start="290" dur="3">and this one will become true.</text><text start="293" dur="7">When we look at this formula, this looks like a term in first order logic,</text><text start="300" dur="4">but we&amp;#39;re actually dealing with a completely propositional world.</text><text start="304" dur="4">It just looks like that because this is a schema.</text><text start="308" dur="7">We can apply this schema to specific ground states, specific world states,</text><text start="315" dur="3">and then P and X would have specific values,</text><text start="318" dur="3">and you could just think of it as concatenating their names all together,</text><text start="321" dur="3">and that&amp;#39;s just the name of one variable.</text><text start="324" dur="5">The name just happens to have this complex form with parentheses and commas in it</text><text start="329" dur="6">to make it easier to write one schema that covers all the individual fly actions.</text></transcript></video><video title="17 Classical Planning 2" id="HAtsUpBlnf8" length="109"><transcript><text start="0" dur="5">[Norvig] Here we see a more complete representation of a problem solving domain</text><text start="5" dur="4">in the language of classical planning.</text><text start="9" dur="2">Here&amp;#39;s the Fly action schema.</text><text start="11" dur="3">I&amp;#39;ve made it a little bit more explicit with from and to airports</text><text start="14" dur="3">rather than X or Y.</text><text start="17" dur="4">We want to deal with transporting cargo.</text><text start="21" dur="8">So in addition to flying, we have an operator to load cargo, C, onto a plane, P, at airport A--</text><text start="29" dur="3">you can see the preconditions and effects there--</text><text start="32" dur="3">and an action to unload the cargo from the plane</text><text start="35" dur="2">with preconditions and effects.</text><text start="37" dur="3">We have a representation of the initial state.</text><text start="40" dur="5">There&amp;#39;s 2 pieces of cargo, there&amp;#39;s 2 planes and 2 airports.</text><text start="45" dur="5">This representation is rich enough and the algorithms on it are good enough</text><text start="50" dur="7">that we could have hundreds or thousands of cargo planes and so on</text><text start="57" dur="3">representing millions of ground actions.</text><text start="60" dur="12">If we had 10 airports and 100 planes, that would be 100, 1,000, 10,000 different Fly actions.</text><text start="72" dur="4">And if we had thousands of pieces of cargo,</text><text start="76" dur="2">there would be even more Load and Unload actions,</text><text start="78" dur="4">but they can all be represented by the succinct schema.</text><text start="82" dur="5">So the initial state tells us what&amp;#39;s what, where everything is,</text><text start="87" dur="3">and then we can represent the goal state:</text><text start="90" dur="4">that we want to have this piece of cargo has to be delivered to this airport,</text><text start="94" dur="4">and another piece of cargo has to be delivered to this airport.</text><text start="98" dur="7">So now we know what actions and problems of initial and goal state looks like</text><text start="105" dur="4">in this representation, but how do we do planning using this?</text></transcript></video><video title="18 Progression Search" id="V6eHKcZkDtg" length="84"><transcript><text start="0" dur="4">[Norvig] The simplest way to do planning is really the exact same way</text><text start="4" dur="2">that we did it in problem solving.</text><text start="6" dur="3">We start off in an initial state.</text><text start="9" dur="11">So P1 was at SFO, say, and cargo, C1, was also at SFO,</text><text start="20" dur="5">and all the other things that were in that initial state.</text><text start="25" dur="5">And then we start branching on the possible actions,</text><text start="30" dur="11">so say one possible action would be to load the cargo, C1, onto the plane, P1, at SFO,</text><text start="41" dur="4">and then that would bring us to another state </text><text start="45" dur="6">which would have a different set of state variables set,</text><text start="51" dur="7">and we&amp;#39;d continue branching out like that until we hit a state which satisfied the goal predicate.</text><text start="58" dur="5">So we call that forward or progression state space search</text><text start="63" dur="6">in that we&amp;#39;re searching through the space of exact states.</text><text start="69" dur="3">Each of these is an individual world state,</text><text start="72" dur="5">and if the actions are deterministic, then it&amp;#39;s the same thing as we had before.</text><text start="77" dur="3">But because we have this representation, </text><text start="80" dur="4">there are other possibilities that weren&amp;#39;t available to us before.</text></transcript></video><video title="19 Regression Search" id="Wu3kegFFjgI" length="187"><transcript><text start="0" dur="5">[Norvig] Another way to search is called backwards or regression search</text><text start="5" dur="2">in which we start at the goal.</text><text start="7" dur="3">So we take the description of the goal state.</text><text start="10" dur="11">C1 is at JFK and C2 is at SFO, so that&amp;#39;s the goal state.</text><text start="21" dur="2">And notice that that&amp;#39;s the complete goal state.</text><text start="23" dur="3">It&amp;#39;s not that I left out all the other facts about the state;</text><text start="26" dur="5">it&amp;#39;s that that&amp;#39;s all that&amp;#39;s known about the state is that these 2 propositions are true</text><text start="31" dur="3">and all the others can be anything you want.</text><text start="34" dur="3">And now we can start searching backwards.</text><text start="37" dur="3">We can say what actions would lead to that state.</text><text start="40" dur="5">Remember in problem solving we did have that option of searching backwards.</text><text start="45" dur="6">If there was a single goal state, we could say what other arcs are coming into that goal state.</text><text start="51" dur="3">But here, this goal state doesn&amp;#39;t represent a single state;</text><text start="54" dur="7">it represents a whole family of states with different values for all the other variables.</text><text start="61" dur="2">And so we can&amp;#39;t just look at that,</text><text start="63" dur="7">but what we can do is look at the definition of possible actions that will result in this goal.</text><text start="70" dur="2">So let&amp;#39;s look at it one at a time.</text><text start="72" dur="7">Let&amp;#39;s first look at what actions could result at C1, JFK.</text><text start="79" dur="7">We look at our action schema, and there&amp;#39;s only 1 action schema that adds an At,</text><text start="86" dur="4">and that would be the Unload schema.</text><text start="90" dur="7">Unload of C, P, A adds C, A.</text><text start="97" dur="3">And so what we would know is if we want to achieve this,</text><text start="100" dur="10">then we would have to do an Unload where the C variable would have to be C1,</text><text start="110" dur="5">the P variable is still unknown--it could be any plane--</text><text start="115" dur="6">and the A variable has to be JFK.</text><text start="121" dur="2">Notice what we&amp;#39;ve done here.</text><text start="123" dur="4">We have this representation in terms of logical formula</text><text start="127" dur="5">that allows us to specify a goal as a set of many world states,</text><text start="132" dur="6">and we also can use that same representation to represent an arrow here</text><text start="138" dur="3">not as a single action but as a set of possible actions.</text><text start="141" dur="5">So this is representing all possible actions for any plane, P, </text><text start="146" dur="3">of unloading cargo at the destination.</text><text start="149" dur="7">And then we can regress this state over this operator</text><text start="156" dur="4">and now we have another representation of this state here.</text><text start="160" dur="4">But just as this state was uncertain--not all the variables were known--</text><text start="164" dur="2">this state too will be uncertain.</text><text start="166" dur="5">For example, we won&amp;#39;t know anything about what plane, P, is involved,</text><text start="171" dur="5">and now we continue searching backwards until we get to a state</text><text start="176" dur="5">where enough of the variables are filled in and where we match against the initial state.</text><text start="181" dur="2">And then we have our solution.</text><text start="183" dur="4">We found it going backwards, but we can apply the solution going forwards.</text></transcript></video><video title="20 Regression vs Progression" id="qUx66S-B528" length="108"><transcript><text start="0" dur="4">[Norvig] Let&amp;#39;s show an example of where a backwards search makes sense.</text><text start="4" dur="3">I&amp;#39;m going to describe a world in which there is one action,</text><text start="7" dur="7">the action of buying a book.</text><text start="14" dur="4">And the precondition is we have to know which book it is,</text><text start="18" dur="3">and let&amp;#39;s identify them by ISBN number.</text><text start="21" dur="9">So we can buy ISBN number B, and the effect is that we own B.</text><text start="30" dur="2">And probably there should be something about money, </text><text start="32" dur="3">but we&amp;#39;re going to leave that out for now to make it simple.</text><text start="35" dur="12">And then the goal would be to own ISBN number 0136042597.</text><text start="47" dur="4">Now, if we try to solve this problem with forward search, we&amp;#39;d start in the initial state.</text><text start="51" dur="4">Let&amp;#39;s say the initial state is we don&amp;#39;t own anything.</text><text start="55" dur="4">And then we&amp;#39;d think about what actions can we apply.</text><text start="59" dur="3">If there are 10 million different books, 10 million ISBN numbers,</text><text start="62" dur="6">then there is a branching factor of 10 million coming out of this node,</text><text start="68" dur="4">and we&amp;#39;d have to try them all in order until we happened to hit upon one that was the right one.</text><text start="72" dur="2">It seems very inefficient.</text><text start="74" dur="5">If we go in the backward direction, then we start at the goal.</text><text start="79" dur="3">The goal is to own this number.</text><text start="82" dur="3">Then we look at our available actions, and out of the 10 million actions</text><text start="85" dur="2">there&amp;#39;s only 1 action schema, </text><text start="87" dur="3">and that action schema can match the goal in exactly one way,</text><text start="90" dur="6">when B equals this number, and therefore we know the action is to buy this number,</text><text start="96" dur="6">and we can connect the goal to the initial state in the backwards direction in just 1 step.</text><text start="102" dur="6">So that&amp;#39;s the advantage of doing backwards or regression search rather than forward search.</text></transcript></video><video title="21 Plan Space Search" id="shcybJnXQz0" length="132"><transcript><text start="0" dur="3">[Norvig] There&amp;#39;s one more type of search for plans </text><text start="3" dur="3">that we can do with the classical planning language</text><text start="6" dur="5">that we couldn&amp;#39;t do before, and this is searching through the space of plans</text><text start="11" dur="3">rather than searching through the space of states.</text><text start="14" dur="4">In forward search we were searching through concrete world states.</text><text start="18" dur="4">In backward search we were searching through abstract states</text><text start="22" dur="3">in which some of the variables were unspecified.</text><text start="25" dur="4">But in plan space search we search through the space of plans.</text><text start="29" dur="2">And here&amp;#39;s how it works.</text><text start="31" dur="2">We start off with an empty plan.</text><text start="33" dur="5">We have the start state and the goal state, and that&amp;#39;s all we know about the plan.</text><text start="38" dur="5">So obviously, this plan is flawed. It doesn&amp;#39;t lead us from the start to the goal.</text><text start="43" dur="5">And then we say let&amp;#39;s do an operation to edit or modify that plan</text><text start="48" dur="2">by adding something in new.</text><text start="50" dur="3">And here we&amp;#39;re tackling the problem of how to get dressed</text><text start="53" dur="3">and put on all the clothes in the right order,</text><text start="56" dur="5">so we say out of all the operators we have, we could add one of those operators into the plan.</text><text start="61" dur="5">And so here we say what if we added the put on right shoe operator.</text><text start="66" dur="3">Then we end up with this plan. </text><text start="69" dur="4">That still doesn&amp;#39;t solve the problem, so we need to keep refining that plan.</text><text start="73" dur="7">Then we come here and say maybe we could add in the put on left shoe operator.</text><text start="80" dur="4">And here I&amp;#39;ve shown the plan as a parallel branching structure</text><text start="84" dur="3">rather than just as a sequence.</text><text start="87" dur="3">And that&amp;#39;s a useful thing to do because it captures the fact </text><text start="90" dur="2">that these can be done in either order.</text><text start="92" dur="6">And we keep refining like that, adding on new branches or new operators</text><text start="98" dur="4">into the plan until we got a plan that was guaranteed to work.</text><text start="102" dur="5">This approach was popular in the 1980s, but it&amp;#39;s faded from popularity.</text><text start="107" dur="5">Right now the most popular approaches have to do with forward search.</text><text start="112" dur="3">We saw some of the advantages of backward search.</text><text start="115" dur="4">The advantage of forward search seems to be that we can come up with very good heuristics.</text><text start="119" dur="5">So we can do heuristic search, and we saw how important it was to have good heuristics</text><text start="124" dur="2">to do heuristic search.</text><text start="126" dur="3">And because the forward search deals with concrete plan states,</text><text start="129" dur="3">it seems to be easier to come up with good heuristics.</text></transcript></video><video title="22 Sliding Puzzle Example" id="mZddP9ytnS4" length="193"><transcript><text start="0" dur="5">[Norvig] To understand the idea of heuristics, let&amp;#39;s talk about another domain.</text><text start="5" dur="2">Here we have the sliding puzzle domain.</text><text start="7" dur="6">Remember we can slide around these little tiles and we try to reach a goal state.</text><text start="13" dur="7">A 16 puzzle is kind of big, so let&amp;#39;s show you the state space for the smaller 8 puzzle.</text><text start="20" dur="2">Here is just a small portion of it.</text><text start="22" dur="5">Let&amp;#39;s figure out what the action schema looks like for this puzzle.</text><text start="27" dur="6">We only need to describe one action, which is to slide a tile, T, </text><text start="33" dur="5">from location A to location B.</text><text start="38" dur="7">The precondition: the tile has to be on location A </text><text start="45" dur="5">and has to be a tile</text><text start="50" dur="12">and B has to be blank and A and B have to be adjacent.</text><text start="62" dur="4">This should be an And sign, not an A.</text><text start="66" dur="2">So that&amp;#39;s the action schema. </text><text start="68" dur="11">Oops. I forgot we need an effect, which should be that the tile is now on B</text><text start="79" dur="19">and the blank is now on A and the tile is no longer on A and the blank is no longer on B.</text><text start="98" dur="5">We talked before about how a human analyst could examine a problem</text><text start="103" dur="4">and come up with heuristics and encode those heuristics as a function</text><text start="107" dur="3">that would help search do a better job.</text><text start="110" dur="3">But with this kind of a formal representation</text><text start="113" dur="4">we can automatically come up with good representations of heuristics.</text><text start="117" dur="5">For example, if we came up with a relaxed problem</text><text start="122" dur="4">by automatically going in and throwing out some of the prerequisites--</text><text start="126" dur="4">if you throw out a prerequisite, you make the problem strictly easier--</text><text start="130" dur="2">then you get a new heuristic.</text><text start="132" dur="5">So for example, if we crossed out the requirement that B has to be blank,</text><text start="137" dur="5">then we end up with the Manhattan or city block heuristic.</text><text start="142" dur="6">And if we also throw out the requirement that A and B have to be adjacent,</text><text start="148" dur="3">then we get the number of misplaced tiles heuristic.</text><text start="151" dur="6">So that means we could slide a tile from any A to any B, no matter how far apart they were.</text><text start="157" dur="3">That&amp;#39;s the number of misplaced tiles.</text><text start="160" dur="2">Other heuristics are possible.</text><text start="162" dur="4">For example, one popular thing is to ignore negative effects,</text><text start="166" dur="6">to say let&amp;#39;s not say that this takes away the blank being in B.</text><text start="172" dur="4">So if we ignore that negative effect, we make the whole problem strictly easier.</text><text start="176" dur="4">We&amp;#39;d have a relaxed problem, and that might end up being a good heuristic.</text><text start="180" dur="4">So because we have our actions encoded in this logical form,</text><text start="184" dur="3">we can automatically edit that form.</text><text start="187" dur="3">A program can do that, and the program can come up with heuristics</text><text start="190" dur="3">rather than requiring the human to come up with heuristics.</text></transcript></video><video title="23 Situation Calculus 1" id="Or8cA5xHFbM" length="227"><transcript><text start="0" dur="3">[Norvig] Now I want to talk about 1 more representation for planning</text><text start="3" dur="4">called situation calculus.</text><text start="7" dur="5">To motivate this, suppose we wanted to have the goal of moving all the cargo</text><text start="12" dur="5">from airport A to airport B, regardless of how many pieces of cargo there are.</text><text start="17" dur="5">You can&amp;#39;t express the notion of All in propositional languages like classical planning,</text><text start="22" dur="3">but you can in first order logic.</text><text start="25" dur="2">There are several ways to use first order logic for planning.</text><text start="27" dur="3">The best known is situation calculus.</text><text start="30" dur="2">It&amp;#39;s not a new kind of logic;</text><text start="32" dur="4">rather, it&amp;#39;s regular first order logic with a set of conventions</text><text start="36" dur="2">for how to represent states and actions.</text><text start="38" dur="3">I&amp;#39;ll show you what the conventions are.</text><text start="41" dur="8">First, actions are represented as objects in first order logic,</text><text start="49" dur="2">normally by functions.</text><text start="51" dur="5">And so we would have a function like the function Fly</text><text start="56" dur="6">of a plane and a From Airport and a To Airport</text><text start="62" dur="6">which represents an object, which is the action.</text><text start="68" dur="8">Then we have situations, and situations are also objects in the logic,</text><text start="76" dur="6">and they correspond not to states but rather to paths--</text><text start="82" dur="5">the paths of actions that we have in state space search.</text><text start="87" dur="6">So if you arrive at what would be the same world state by 2 different sets of actions,</text><text start="93" dur="4">those would be considered 2 different situations in situation calculus.</text><text start="97" dur="6">We describe the situations by objects, so we usually have an initial situation,</text><text start="103" dur="3">often called S0,</text><text start="106" dur="6">and then we have a function on situations called Result.</text><text start="112" dur="10">So the result of a situation object and an action object is equal to another situation.</text><text start="122" dur="5">And now instead of describing the actions that are applicable </text><text start="127" dur="7">in a situation with a predicate Actions of S,</text><text start="134" dur="3">situation calculus for some reason decided not to do that</text><text start="137" dur="6">and instead we&amp;#39;re going to talk about the actions that are possible in the state,</text><text start="143" dur="3">and we&amp;#39;re going to do that with a predicate.</text><text start="148" dur="9">If we have a predicate Possible of A and S, is an action A possible in a state?</text><text start="157" dur="6">There&amp;#39;s a specific form for describing these predicates,</text><text start="163" dur="9">and in general, it has the form of some precondition of state S</text><text start="172" dur="7">implies that it&amp;#39;s possible to do action A in state S.</text><text start="179" dur="5">I&amp;#39;ll show you the possibility axiom for the Fly action.</text><text start="184" dur="6">We would say if there is some P, which is the plane in state S,</text><text start="190" dur="6">and there is some X, which is an airport in state S,</text><text start="196" dur="5">and there is some Y, which is also an airport in state S,</text><text start="201" dur="7">and P is at location X in state S,</text><text start="208" dur="13">then that implies that it&amp;#39;s possible to fly P from X to Y in state S.</text><text start="221" dur="6">And that&amp;#39;s known as the possibility axiom for the action Fly.</text></transcript></video><video title="24 Situation Calculus 2" id="GM1sxQQ81lg" length="236"><transcript><text start="1" dur="6">[Norvig] There&amp;#39;s a convention in situation calculus that predicates like At--</text><text start="7" dur="7">we said plane P was at airport X in situation S--</text><text start="14" dur="5">these types of predicates that can vary from 1 situation to another are called fluents,</text><text start="19" dur="6">from the word fluent, having to do with fluidity or change over time.</text><text start="25" dur="4">And the convention is that they refer to a specific situation,</text><text start="29" dur="6">and we always put that situation argument as the last in the predicate.</text><text start="35" dur="6">Now, the trickiest part about situation calculus is describing what changes</text><text start="41" dur="3">and what doesn&amp;#39;t change as a result of an action.</text><text start="44" dur="4">Remember in classical planning we had action schemas </text><text start="48" dur="5">where we described 1 action at a time and said what changed.</text><text start="53" dur="4">For situation calculus it turns out to be easier to do it the other way around.</text><text start="57" dur="6">Instead of writing 1 action or 1 schema or 1 axiom for each action,</text><text start="63" dur="4">we do 1 for each fluent, for each predicate that can change.</text><text start="67" dur="5">We use the convention called successor state axioms.</text><text start="72" dur="3">These are used to describe what happens in the state </text><text start="75" dur="4">that&amp;#39;s a successor of executing an action.</text><text start="79" dur="7">And in general, a successor state axiom will have the form of saying</text><text start="86" dur="9">for all actions and states, if it&amp;#39;s possible to execute action A in state S,</text><text start="95" dur="7">then--and I&amp;#39;ll show in general what they look like here--</text><text start="102" dur="12">the fluent is true if and only if action A made it true</text><text start="114" dur="6">or action A didn&amp;#39;t undo it.</text><text start="123" dur="5">So that is, either it wasn&amp;#39;t true before and A made it be true,</text><text start="128" dur="6">or it was true before and A didn&amp;#39;t do something to stop it being true.</text><text start="134" dur="4">For example, I&amp;#39;ll show you the successor state axiom for the In predicate.</text><text start="138" dur="5">And just to make it a little bit simpler, I&amp;#39;ll leave out all the For All quantifiers.</text><text start="143" dur="5">So wherever you see a variable without a quantifier, assume that there&amp;#39;s a For All.</text><text start="148" dur="10">What we&amp;#39;ll say is it&amp;#39;s possible to execute A in situation S.</text><text start="158" dur="10">If that&amp;#39;s true, then the In predicate holds between some cargo C </text><text start="168" dur="11">and some plane in the state, which is the result of executing action A in state S.</text><text start="181" dur="11">So that In predicate will hold if and only if either A was a load action--</text><text start="192" dur="7">so if we load the cargo into the plane, then the result of executing that action A</text><text start="199" dur="4">is that the cargo is in the plane--</text><text start="203" dur="7">or it might be that it was already true that the cargo was in the plane in situation S</text><text start="210" dur="8">and A is not equal to an unload action.</text><text start="218" dur="7">So for all A and S for which it&amp;#39;s possible to execute A in situation S,</text><text start="225" dur="5">the In predicate holds if and only if the action was a load</text><text start="230" dur="6">or the In predicate used to hold in the previous state and the action is not an unload.</text></transcript></video><video title="25 Situation Calculus 3" id="5M4q-H1t3Hs" length="186"><transcript><text start="0" dur="4">[Norvig] So I&amp;#39;ve talked about the possibility axioms and the successor state axioms.</text><text start="4" dur="3">That&amp;#39;s most of what&amp;#39;s in situation calculus, </text><text start="7" dur="4">and that&amp;#39;s used to describe an entire domain like the airport cargo domain.</text><text start="11" dur="7">And now we describe a particular problem within that domain by describing the initial state.</text><text start="18" dur="5">Typically we call that S0, the initial situation.</text><text start="23" dur="6">And in S0 we can make various types of assertions</text><text start="29" dur="2">of different types of predicates.</text><text start="31" dur="12">So we could say that plane P1 is at airport JFK in S0, so just a simple predicate.</text><text start="43" dur="9">And we could also make larger sentences, so we could say </text><text start="52" dur="15">for all C, if C is cargo, then that C is at JFK in situation S0.</text><text start="67" dur="4">So we have much more flexibility in situation calculus to say almost anything we want.</text><text start="71" dur="7">Anything that&amp;#39;s a valid sentence in first order logic can be asserted about the initial state.</text><text start="78" dur="2">The goal state is similar.</text><text start="80" dur="5">We could have a goal of saying there exists some goal state S</text><text start="85" dur="16">such that for all C, if C is cargo, then we want that cargo to be at SFO in state S.</text><text start="101" dur="4">So this initial state and this goal says move all the cargo--</text><text start="105" dur="5">I don&amp;#39;t care how much there is--from JFK to SFO.</text><text start="110" dur="5">The great thing about situation calculus is that once we&amp;#39;ve described this</text><text start="115" dur="3">in the ordinary language of first order logic,</text><text start="118" dur="5">we don&amp;#39;t need any special programs to manipulate it and come up with the solution</text><text start="123" dur="3">because we already have theorem provers for first order logic</text><text start="126" dur="2">and we can just state this as a problem,</text><text start="128" dur="5">apply the normal theorem prover that we already had for other uses,</text><text start="133" dur="6">and it can come up with an answer of a path that satisfies this goal,</text><text start="139" dur="4">a situation which corresponds to a path which satisfies this</text><text start="143" dur="5">given the initial state and given the descriptions of the actions.</text><text start="148" dur="4">So the advantage of situation calculus is that we have the full power of first order logic.</text><text start="152" dur="2">We can represent anything we want.</text><text start="154" dur="5">Much more flexibility than in problem solving or classical planning.</text><text start="159" dur="3">So all together now, we&amp;#39;ve seen several ways of dealing with planning.</text><text start="162" dur="3">We started in deterministic, fully observable environments</text><text start="165" dur="4">and we moved into stochastic and partially observable environments.</text><text start="169" dur="6">We were able to distinguish between plans that can or cannot solve a problem,</text><text start="175" dur="3">but we had 1 weakness in all these different approaches.</text><text start="178" dur="5">It is that we weren&amp;#39;t able to distinguish between probable and improbable solutions.</text><text start="183" dur="3">And that will be the subject of the next unit.</text></transcript></video></group><group title="Homework 4" count="27"><video title="1 Logic" id="WP_97aspqrc" length="110"><transcript><text start="0" dur="4">In this exercise, I&amp;#39;m going to write some logical expressions</text><text start="4" dur="3">in propositional logic and ask you </text><text start="7" dur="6">if these expressions are always true or always false</text><text start="13" dur="6">or if their truth value depends on the values of the propositional variables.</text><text start="19" dur="7">The first sentence is smoke implies fire</text><text start="26" dur="6">is equivalent to smoke or not fire.</text><text start="32" dur="7">Is that true or false, or does it depend on the values of smoke and fire?</text><text start="39" dur="6">The second sentence, again, smoke implies fire</text><text start="45" dur="9">is equivalent to not smoke implies not fire.</text><text start="54" dur="6">The third sentence, smoke implies fire</text><text start="60" dur="9">is equivalent to not fire implies not smoke. </text><text start="69" dur="8">The fourth sentence, big or dumb</text><text start="77" dur="5">or big implies dumb.</text><text start="82" dur="9">The final sentence, big and dumb</text><text start="91" dur="8">is equivalent to not, not big or not dumb.</text><text start="99" dur="6">For each of these, tell me if they&amp;#39;re always true regardless of the values of the variables,</text><text start="105" dur="5">always false or sometimes true and sometimes false.</text></transcript></video><video title="1. Logic " id="WP_97aspqrc" length="110"><transcript><text start="0" dur="4">In this exercise, I&amp;#39;m going to write some logical expressions</text><text start="4" dur="3">in propositional logic and ask you </text><text start="7" dur="6">if these expressions are always true or always false</text><text start="13" dur="6">or if their truth value depends on the values of the propositional variables.</text><text start="19" dur="7">The first sentence is smoke implies fire</text><text start="26" dur="6">is equivalent to smoke or not fire.</text><text start="32" dur="7">Is that true or false, or does it depend on the values of smoke and fire?</text><text start="39" dur="6">The second sentence, again, smoke implies fire</text><text start="45" dur="9">is equivalent to not smoke implies not fire.</text><text start="54" dur="6">The third sentence, smoke implies fire</text><text start="60" dur="9">is equivalent to not fire implies not smoke. </text><text start="69" dur="8">The fourth sentence, big or dumb</text><text start="77" dur="5">or big implies dumb.</text><text start="82" dur="9">The final sentence, big and dumb</text><text start="91" dur="8">is equivalent to not, not big or not dumb.</text><text start="99" dur="6">For each of these, tell me if they&amp;#39;re always true regardless of the values of the variables,</text><text start="105" dur="5">always false or sometimes true and sometimes false.</text></transcript></video><video title="1. Logic Answer" id="XFR1231H0M0" length="64"><transcript><text start="0" dur="6">Here are the answers. The first sentence is true half the time and false half the time.</text><text start="6" dur="4">It would have been true all the time if we had written fire or not smoke</text><text start="10" dur="5">rather than smoke or not fire on the right-hand side.</text><text start="15" dur="4">The second sentence is false when fire is true and smoke is false</text><text start="19" dur="3">and otherwise true.</text><text start="22" dur="5">The third sentence is always true, and this is called the contrapositive.</text><text start="27" dur="6">Smoke implies fire is the same thing as not fire implies not smoke.</text><text start="33" dur="4">The fourth sentence is always true, and you can figure that out</text><text start="37" dur="5">by writing out the full truth tables  or by reasoning about the variable big.</text><text start="42" dur="6">When big is true, the whole sentence is true because big is one of the disjuncts, </text><text start="48" dur="5">and when big is false, it&amp;#39;s true because big implies dumb is true</text><text start="53" dur="4">whenever the antecedent is false.</text><text start="57" dur="3">And the final sentence is also always true,</text><text start="60" dur="4">and this is known as de Morgan&amp;#39;s law.</text></transcript></video><video title="2. More Logic" id="P_eu1YFp9Z8" length="199"><transcript><text start="0" dur="4">In this exercise, I&amp;#39;m going to give you some English sentences </text><text start="4" dur="3">and then some first-order logic sentences</text><text start="7" dur="3">and ask you does the first-order logic sentence</text><text start="10" dur="5">correctly encode the English sentence, does it incorrectly encode it,</text><text start="15" dur="6">or is it just an error that is not a legitimate sentence</text><text start="21" dur="4">in first-order logic?</text><text start="25" dur="5">The first English sentence is &amp;quot;Paris and Nice are both in France.&amp;quot;</text><text start="30" dur="3">Here&amp;#39;s one possible translation.</text><text start="33" dur="8">Paris and Nice are in France.</text><text start="41" dur="3">Here&amp;#39;s another. </text><text start="44" dur="5">Paris is in France, and Nice is in France.</text><text start="49" dur="5">Tell us if each of these is a correct encoding of English,</text><text start="54" dur="6">incorrect, or if it&amp;#39;s erroneous first-order logic syntax.</text><text start="60" dur="7">The second sentence in English is &amp;quot;There is a country that borders Iran and Syria.&amp;quot;</text><text start="67" dur="2">Here are the possible translations. </text><text start="69" dur="5">There exists a c, and we&amp;#39;re going to use the predicate capital C </text><text start="74" dur="8">to mean C when the argument is a country.</text><text start="82" dur="4">So, there exists a c such that C of c, </text><text start="86" dur="6">and we&amp;#39;re going to use the predicate B to mean 2 objects border each other. </text><text start="92" dur="8">So, c borders Iran, and c borders Syria.</text><text start="100" dur="4">That&amp;#39;s one translation. Here&amp;#39;s the other translation.</text><text start="104" dur="6">There exists a c if C is a country,</text><text start="110" dur="11">then c borders Iran and c borders Syria.</text><text start="121" dur="3">And the final English sentence is no 2 bordering countries </text><text start="124" dur="6">can have the same map color, and we&amp;#39;re going to use the predicate MC for map color.</text><text start="130" dur="4">Here&amp;#39;s one possibility for all x and y.</text><text start="134" dur="7">X is a country, and y is a country.</text><text start="141" dur="5">And x and y border each other.</text><text start="146" dur="6">That implies it&amp;#39;s not the case that the map color </text><text start="152" dur="6">of x equals the map color of y.</text><text start="158" dur="5">And I should say we&amp;#39;re using map color here as a function, not as a predicate.</text><text start="163" dur="3">Here&amp;#39;s another possibility.</text><text start="166" dur="7">For all x and y, it&amp;#39;s not the case that x is a country,</text><text start="173" dur="6">or it&amp;#39;s not the case that y is a country,</text><text start="179" dur="6">or it&amp;#39;s not the case that x and y border,</text><text start="185" dur="6">or it&amp;#39;s not the case that the map color of x</text><text start="191" dur="8">is equal to the map color of y. </text></transcript></video><video title="2. More Logic Answer" id="SiZtjEaLiE8" length="101"><transcript><text start="0" dur="6">The answers are the first sentence has erroneous syntax.</text><text start="6" dur="4">We&amp;#39;re using an and here between 2 terms, but you can&amp;#39;t do that.</text><text start="10" dur="6">An and can only be used between sentences and predicates in first-order logic.</text><text start="16" dur="3">The second sentence does correctly encode the English sentence</text><text start="19" dur="4">Paris and Nice are both in France.</text><text start="23" dur="4">Similarly, the third sentence does correctly encode</text><text start="27" dur="3">there&amp;#39;s a country that borders Iran and Syria,</text><text start="30" dur="4">but the fourth one incorrectly encodes it.</text><text start="34" dur="3">Here we have an existential. There exists a C.</text><text start="37" dur="5">And then an implication, and that&amp;#39;s usually the wrong thing.</text><text start="42" dur="4">The problem here is not if C represents a country, </text><text start="46" dur="5">but what if C represents something that&amp;#39;s not a country, say my dog.</text><text start="51" dur="5">My dog is not a country, so there does exist a c, </text><text start="56" dur="4">which is my dog, such that this implication is true</text><text start="60" dur="4">because whenever the antecedent of an implication is false,</text><text start="64" dur="4">my dog is not a country, then the whole thing is true.</text><text start="68" dur="3">For the final sentence in English, no 2 bordering countries </text><text start="71" dur="5">can have the same map color, both of these are correct encodings.</text><text start="76" dur="2">The first one seems more obvious, </text><text start="78" dur="3">and the second one we&amp;#39;ve just manipulated things a little bit.</text><text start="81" dur="6">We know that A implies B is the same thing as saying not A or B,</text><text start="87" dur="4">so here we&amp;#39;ve just taken the left-hand side and negated it</text><text start="91" dur="3">and then put those all together with an or,</text><text start="94" dur="4">so these 2 sentences represent the same thing,</text><text start="98" dur="3">and they&amp;#39;re both correct.</text></transcript></video><video title="3. Vacuum World " id="wsfXrIhDhJ0" length="69"><transcript><text start="0" dur="4">This problem is about planning in belief space.</text><text start="4" dur="6">We have the 2 room vacuum world, and we&amp;#39;ve represented various belief states here.</text><text start="10" dur="4">Now, in this version, there are no sensors and so no percepts. </text><text start="14" dur="3">The actions are all deterministic. </text><text start="17" dur="5">A right or left or suck action will always do what it&amp;#39;s supposed to do,</text><text start="22" dur="2">and the environment is static.</text><text start="24" dur="4">That is, dirt stays put until it&amp;#39;s cleaned up.</text><text start="28" dur="4">Now, in the start, you know nothing about the environment.</text><text start="32" dur="2">You have no input. You don&amp;#39;t know what location you&amp;#39;re in.</text><text start="34" dur="6">You don&amp;#39;t know where the dirt is, and your goal is to be in the leftmost of the 2 squares</text><text start="40" dur="4">and have both squares cleaned up, and what I want you to do</text><text start="44" dur="5">is click on the sequence of actions, an action like this one or this one or this one</text><text start="49" dur="7">or this one, that constitute a path from the start state to the goal state.</text><text start="56" dur="7">And then I want you to click on yes if that path is guaranteed</text><text start="63" dur="6">to always reach the goal, and no if the path only sometimes reaches the goal.</text></transcript></video><video title="3. Vacuum World Answer" id="FhowsCKPJCE" length="37"><transcript><text start="0" dur="3">The answer is that we start off knowing nothing,</text><text start="3" dur="3">so we&amp;#39;re in this belief state here where any of </text><text start="6" dur="4">the 8 possible states are possibilities.</text><text start="10" dur="6">Then our path to the goal is to move right, and we arrive in this belief state,</text><text start="16" dur="3">and then suck up the dirt there.</text><text start="19" dur="4">Then move left, and then suck up the dirt there,</text><text start="23" dur="5">and we end up in a belief state with only a single world state,</text><text start="28" dur="3">and that&amp;#39;s one that reaches the goal where we&amp;#39;re on the left</text><text start="31" dur="2">and both squares are clean.</text><text start="33" dur="4">And yes, that is guaranteed to reach the goal.</text></transcript></video><video title="4. More Vacuum World" id="2H4NJg8Iiaw" length="99"><transcript><text start="0" dur="6">In this problem, we&amp;#39;re again in the 2-location vacuum world,</text><text start="6" dur="4">but this time around, we have local sensing,</text><text start="10" dur="5">meaning at each turn, we get input of what location we&amp;#39;re at,</text><text start="15" dur="3">the left or the right, and whether there&amp;#39;s dirt in that location.</text><text start="18" dur="3">But we don&amp;#39;t know what&amp;#39;s going on in the other location.</text><text start="21" dur="7">We have a dynamic world where dirt can appear anywhere.</text><text start="28" dur="5">As we move around, dirt can spontaneously appear</text><text start="33" dur="4">in the location we left or in the location we&amp;#39;re going to visit.</text><text start="37" dur="4">However, if we&amp;#39;re sucking, the dirt can&amp;#39;t appear, </text><text start="41" dur="4">because if it did appear there, we would successfully suck it up.</text><text start="45" dur="5">And now in addition, the right and left moves</text><text start="50" dur="4">are stochastic in that they don&amp;#39;t always succeed.</text><text start="54" dur="3">Sometimes when you try to go right, you do successfully go right,</text><text start="57" dur="4">and sometimes you stay in the same location, same for left.</text><text start="61" dur="3">The suck action is always successful.</text><text start="64" dur="3">It will always clean up dirt in the current location.</text><text start="67" dur="3">Now, when we start out, we get the percept </text><text start="70" dur="4">saying that we&amp;#39;re in the leftmost location</text><text start="74" dur="5">and that location is clean, and that means our belief state</text><text start="79" dur="4">is that we&amp;#39;re in either 5 or 7.</text><text start="83" dur="6">Now, the first thing I want you to answer is if we decide to move right,</text><text start="89" dur="4">what do we predict the possible belief state will be,</text><text start="93" dur="3">the possible set of states in our belief state will be </text><text start="96" dur="3">after we execute the right movement?</text></transcript></video><video title="4. More Vacuum World Answer" id="Z6QNCiMIR1I" length="39"><transcript><text start="0" dur="5">The answer is the right movement is stochastic,</text><text start="5" dur="4">so it may fail, so that means we may stay on the left, and we may move to the right.</text><text start="9" dur="4">And the world is dynamic, which means dirt may appear</text><text start="13" dur="3">in either the left or the right location,</text><text start="16" dur="4">and we didn&amp;#39;t know for sure if there was dirt or not in the right,</text><text start="20" dur="6">so that means any of the 8 possible states belong to the belief state</text><text start="26" dur="5">for the prediction of moving right.</text><text start="31" dur="5">That means any of the 8 states belong to the belief state</text><text start="36" dur="3">for the prediction of moving right.</text></transcript></video><video title="5. More Vacuum World" id="i5XMOLw6CGE" length="19"><transcript><text start="0" dur="3">Now we get a percept from the world, and we&amp;#39;ve observed</text><text start="3" dur="4">that we&amp;#39;re in the rightmost square, so the action worked,</text><text start="7" dur="2">and that square is dirty.</text><text start="9" dur="3">Now we want to update our belief state</text><text start="12" dur="3">and click on all the states that belong to the belief state now</text><text start="15" dur="4">as we update due to this percept.</text></transcript></video><video title="5. More Vacuum World Answer" id="QHPs9m5qE9A" length="15"><transcript><text start="0" dur="4">The answer is state 6 and state 2.</text><text start="4" dur="4">Those are the 2 states in which the vacuum is on the right and that state is dirty.</text><text start="8" dur="4">The other state we can&amp;#39;t observe, and it could have been in any state before,</text><text start="12" dur="3">so now it can be either clean or dirty.</text></transcript></video><video title="6. More Vacuum World" id="x93ewPQhIQc" length="19"><transcript><text start="0" dur="4">Now our belief state contains 2 and 6,</text><text start="4" dur="4">and we decide we want to execute the suck action.</text><text start="8" dur="3">Now tell me, by clicking on the appropriate states, </text><text start="11" dur="3">what states belong to the belief state</text><text start="14" dur="5">after we make a prediction for what&amp;#39;s going to happen after the suck action.</text></transcript></video><video title="6. More Vacuum World Answer" id="984YVReF6Do" length="10"><transcript><text start="0" dur="3">The answer is states 4 and 8.</text><text start="3" dur="3">We know that the suck action will make it clean</text><text start="6" dur="4">in our current location, but we don&amp;#39;t know what&amp;#39;s going on in the other location.</text></transcript></video><video title="7. More Vacuum World" id="RW-l7JWDtYQ" length="19"><transcript><text start="0" dur="4">Now, we make the observation</text><text start="4" dur="5">right, clean, and I want you to </text><text start="9" dur="3">update our belief state by clicking on the states</text><text start="12" dur="4">that belong to the new belief state now</text><text start="16" dur="3">after taking that observation into account.</text></transcript></video><video title="7. More Vacuum World Answer" id="mPk9fV8RZ3g" length="16"><transcript><text start="0" dur="2">The answer is nothing has changed.</text><text start="2" dur="3">Our belief state is still 4 and 8.</text><text start="5" dur="2">We didn&amp;#39;t really get any new information from that input</text><text start="7" dur="3">because we knew that the result of the suck action</text><text start="10" dur="3">was going to have to clean up locally, and we still didn&amp;#39;t know anything</text><text start="13" dur="3">about the other non-local state.</text></transcript></video><video title="8. Monkey and Bananas" id="rCGAgc9smZg" length="70"><transcript><text start="0" dur="4">This is a famous problem called the monkey and bananas problem,</text><text start="4" dur="4">described in the language of classical planning.</text><text start="8" dur="4">There are six actions. The monkey can go from location x to y.</text><text start="12" dur="7">It can push some object from x to y. It can climb up an object. It can grab something.</text><text start="19" dur="5">It can climb down from an object, and it can un-grab something. </text><text start="24" dur="5">Initially, the monkey is at location A. The bananas are at location B.</text><text start="29" dur="5">The box is at C, and the monkey is at a low height, as is the box,</text><text start="34" dur="3">but the bananas are at a high height,</text><text start="37" dur="3">but the box is pushable and climbable.</text><text start="40" dur="6">Now, assuming that we execute this plan--go from A to C, push the box from C to B,</text><text start="46" dur="5">climb up on the box, grasp the bananas, and climb down from the box.</text><text start="51" dur="6">What I want you to do is look at these definitions of actions,</text><text start="57" dur="6"> tell me how the state unfolds from this initial state here to the final state,</text><text start="63" dur="7">and then click off all of these instances that are going to be true in the final state.</text></transcript></video><video title="8. Monkey and Bananas Answer" id="rZtBR-d0H5Y" length="53"><transcript><text start="0" dur="4">The answers are, yes, the monkey has the bananas.</text><text start="4" dur="4">No, the box is not at C. It has been pushed to B.</text><text start="8" dur="5">Yes, the monkey is at B. Yes, the bananas are at B.</text><text start="13" dur="4">No, the height of the monkey is not high, because he climbed down,</text><text start="17" dur="5">which means that that the effect was that he is at height low.</text><text start="22" dur="5">But, yes, the height of the bananas is high, according to these definitions.</text><text start="27" dur="5">You would think once the monkey grasped the bananas and climbed down </text><text start="32" dur="3">that the height of the bananas should be low,</text><text start="35" dur="4">but if we look at the operator for climb down, it doesn&amp;#39;t say that.</text><text start="39" dur="5">It refers to the monkey, but it doesn&amp;#39;t refer to anything that the monkey is holding.</text><text start="44" dur="5">That kind of thing is difficult to express in the language of classical planning.</text><text start="49" dur="4">You could say that&amp;#39;s a weakness in the definition of climb down.</text></transcript></video><video title="9. Situation Calculus" id="eeDwEYxWCTA" length="276"><transcript><text start="0" dur="5">[Norvig] The final problem involves situation calculus.</text><text start="5" dur="6">In the domain I want to describe, we have a combination lock with 4 digits,</text><text start="11" dur="6">and the correct combination that will open the lock we&amp;#39;ll call X.</text><text start="17" dur="3">There are 2 actions you can perform.</text><text start="20" dur="4">One is to dial any combination on the dial,</text><text start="24" dur="5">and if you dial the correct one, X, then the lock will open.</text><text start="29" dur="5">And the other action you can perform is to press a lock button,</text><text start="34" dur="3">and if you press that button, then the lock will be locked,</text><text start="37" dur="3">whether it was open before or not.</text><text start="40" dur="3">I&amp;#39;m going to describe some axioms, </text><text start="43" dur="5">and I want you to tell me whether these axioms are correct for the domain or not.</text><text start="51" dur="2">First the possibility axioms.</text><text start="53" dur="6">One choice is the possibility axiom that says</text><text start="59" dur="10">if C equals X, then it&amp;#39;s possible to dial C in situation S.</text><text start="69" dur="4">And here I&amp;#39;m assuming that all variables are scoped</text><text start="73" dur="5">so that we say an implicit for all C and for all S here.</text><text start="78" dur="8">And X is not a variable. This is a constant, referring to the correct combination.</text><text start="86" dur="10">The other possible axiom is for all C if C is greater than or equal to 0</text><text start="96" dur="5">and less than or equal to 9999,</text><text start="101" dur="9">then it&amp;#39;s possible to dial C in any situation S.</text><text start="110" dur="10">So tell me which, if any or both, of these axioms you think correctly encode the situation.</text><text start="120" dur="5">Next we&amp;#39;ll look at the possibility axioms for the lock action.</text><text start="125" dur="2">Here&amp;#39;s one.</text><text start="127" dur="5">We can say if the safe is open in situation S,</text><text start="132" dur="6">then it&amp;#39;s possible to execute the lock action in S.</text><text start="138" dur="6">Or maybe we should say if the safe is not open in S,</text><text start="144" dur="6">then it&amp;#39;s possible to execute Lock in S.</text><text start="150" dur="5">Or maybe we should say if true, </text><text start="155" dur="7">then it&amp;#39;s possible to execute the lock action in situation S.</text><text start="162" dur="8">And tell me which, if any, of those represents a correct representation of the problem.</text><text start="170" dur="4">And finally we need successor state axioms for all the fluents,</text><text start="174" dur="5">but there&amp;#39;s really only one fluent, and that&amp;#39;s whether or not the safe is open.</text><text start="179" dur="7">So here&amp;#39;s one example of a successor state axiom.</text><text start="186" dur="7">We could say for any situation and action,</text><text start="193" dur="5">if it&amp;#39;s possible to execute that action in the situation,</text><text start="198" dur="8">then the Open fluent is going to be true in the result of executing that action</text><text start="206" dur="10">if and only if the action is dialing the correct combination, X, </text><text start="216" dur="11">or if the safe was already open in S and the action is not equal to Lock.</text><text start="227" dur="2">That&amp;#39;s one option.</text><text start="229" dur="5">And the other option is the same thing on the left-hand side,</text><text start="234" dur="9">and on the right-hand side it&amp;#39;s open if and only if the action is dialing the correct combination</text><text start="243" dur="6">and the action is not equal to Lock.</text><text start="249" dur="7">So tell me which, if any or all, of these are accurate representations of the problem.</text><text start="256" dur="7">In each case I want you to tell me if each of these axioms are good as they stand alone.</text><text start="263" dur="3">I don&amp;#39;t want you to look at any combinations of axioms</text><text start="266" dur="7">but just go through each one and check the box if you think that the axiom on that line alone</text><text start="273" dur="3">is a correct representation of the problem.</text></transcript></video><video title="9. Situation Calculus Answer" id="2oZexvl5fVU" length="58"><transcript><text start="0" dur="6">[Norvig] The answers are, in this case, only the second is a correct representation.</text><text start="6" dur="3">Any combination is possible to be dialed.</text><text start="9" dur="6">It&amp;#39;s not the case that it&amp;#39;s only possible to dial the correct combination.</text><text start="15" dur="4">Now, here we said that the lock button works at any point.</text><text start="19" dur="4">Whether it&amp;#39;s open or not, the lock button will always lock it.</text><text start="23" dur="2">And so that&amp;#39;s represented by the third option.</text><text start="25" dur="5">True implies it&amp;#39;s possible to lock.</text><text start="30" dur="4">In this case the first one is a correct representation </text><text start="34" dur="4">of the successor state axiom for Open,</text><text start="38" dur="3">and the second one is not, because note what it says.</text><text start="41" dur="8">If we already have the lock open and then we execute some dialing action</text><text start="49" dur="5">that&amp;#39;s not dialing the correct combination, X, we want it to remain open.</text><text start="54" dur="4">But this second axiom would make it be closed, which is not what we want.</text></transcript></video><video title="2 More Logic" id="P_eu1YFp9Z8" length="199"><transcript><text start="0" dur="4">In this exercise, I&amp;#39;m going to give you some English sentences </text><text start="4" dur="3">and then some first-order logic sentences</text><text start="7" dur="3">and ask you does the first-order logic sentence</text><text start="10" dur="5">correctly encode the English sentence, does it incorrectly encode it,</text><text start="15" dur="6">or is it just an error that is not a legitimate sentence</text><text start="21" dur="4">in first-order logic?</text><text start="25" dur="5">The first English sentence is &amp;quot;Paris and Nice are both in France.&amp;quot;</text><text start="30" dur="3">Here&amp;#39;s one possible translation.</text><text start="33" dur="8">Paris and Nice are in France.</text><text start="41" dur="3">Here&amp;#39;s another. </text><text start="44" dur="5">Paris is in France, and Nice is in France.</text><text start="49" dur="5">Tell us if each of these is a correct encoding of English,</text><text start="54" dur="6">incorrect, or if it&amp;#39;s erroneous first-order logic syntax.</text><text start="60" dur="7">The second sentence in English is &amp;quot;There is a country that borders Iran and Syria.&amp;quot;</text><text start="67" dur="2">Here are the possible translations. </text><text start="69" dur="5">There exists a c, and we&amp;#39;re going to use the predicate capital C </text><text start="74" dur="8">to mean C when the argument is a country.</text><text start="82" dur="4">So, there exists a c such that C of c, </text><text start="86" dur="6">and we&amp;#39;re going to use the predicate B to mean 2 objects border each other. </text><text start="92" dur="8">So, c borders Iran, and c borders Syria.</text><text start="100" dur="4">That&amp;#39;s one translation. Here&amp;#39;s the other translation.</text><text start="104" dur="6">There exists a c if C is a country,</text><text start="110" dur="11">then c borders Iran and c borders Syria.</text><text start="121" dur="3">And the final English sentence is no 2 bordering countries </text><text start="124" dur="6">can have the same map color, and we&amp;#39;re going to use the predicate MC for map color.</text><text start="130" dur="4">Here&amp;#39;s one possibility for all x and y.</text><text start="134" dur="7">X is a country, and y is a country.</text><text start="141" dur="5">And x and y border each other.</text><text start="146" dur="6">That implies it&amp;#39;s not the case that the map color </text><text start="152" dur="6">of x equals the map color of y.</text><text start="158" dur="5">And I should say we&amp;#39;re using map color here as a function, not as a predicate.</text><text start="163" dur="3">Here&amp;#39;s another possibility.</text><text start="166" dur="7">For all x and y, it&amp;#39;s not the case that x is a country,</text><text start="173" dur="6">or it&amp;#39;s not the case that y is a country,</text><text start="179" dur="6">or it&amp;#39;s not the case that x and y border,</text><text start="185" dur="6">or it&amp;#39;s not the case that the map color of x</text><text start="191" dur="8">is equal to the map color of y. </text></transcript></video><video title="3 Vacuum World" id="wsfXrIhDhJ0" length="69"><transcript><text start="0" dur="4">This problem is about planning in belief space.</text><text start="4" dur="6">We have the 2 room vacuum world, and we&amp;#39;ve represented various belief states here.</text><text start="10" dur="4">Now, in this version, there are no sensors and so no percepts. </text><text start="14" dur="3">The actions are all deterministic. </text><text start="17" dur="5">A right or left or suck action will always do what it&amp;#39;s supposed to do,</text><text start="22" dur="2">and the environment is static.</text><text start="24" dur="4">That is, dirt stays put until it&amp;#39;s cleaned up.</text><text start="28" dur="4">Now, in the start, you know nothing about the environment.</text><text start="32" dur="2">You have no input. You don&amp;#39;t know what location you&amp;#39;re in.</text><text start="34" dur="6">You don&amp;#39;t know where the dirt is, and your goal is to be in the leftmost of the 2 squares</text><text start="40" dur="4">and have both squares cleaned up, and what I want you to do</text><text start="44" dur="5">is click on the sequence of actions, an action like this one or this one or this one</text><text start="49" dur="7">or this one, that constitute a path from the start state to the goal state.</text><text start="56" dur="7">And then I want you to click on yes if that path is guaranteed</text><text start="63" dur="6">to always reach the goal, and no if the path only sometimes reaches the goal.</text></transcript></video><video title="4a More Vacuum World" id="2H4NJg8Iiaw" length="99"><transcript><text start="0" dur="6">In this problem, we&amp;#39;re again in the 2-location vacuum world,</text><text start="6" dur="4">but this time around, we have local sensing,</text><text start="10" dur="5">meaning at each turn, we get input of what location we&amp;#39;re at,</text><text start="15" dur="3">the left or the right, and whether there&amp;#39;s dirt in that location.</text><text start="18" dur="3">But we don&amp;#39;t know what&amp;#39;s going on in the other location.</text><text start="21" dur="7">We have a dynamic world where dirt can appear anywhere.</text><text start="28" dur="5">As we move around, dirt can spontaneously appear</text><text start="33" dur="4">in the location we left or in the location we&amp;#39;re going to visit.</text><text start="37" dur="4">However, if we&amp;#39;re sucking, the dirt can&amp;#39;t appear, </text><text start="41" dur="4">because if it did appear there, we would successfully suck it up.</text><text start="45" dur="5">And now in addition, the right and left moves</text><text start="50" dur="4">are stochastic in that they don&amp;#39;t always succeed.</text><text start="54" dur="3">Sometimes when you try to go right, you do successfully go right,</text><text start="57" dur="4">and sometimes you stay in the same location, same for left.</text><text start="61" dur="3">The suck action is always successful.</text><text start="64" dur="3">It will always clean up dirt in the current location.</text><text start="67" dur="3">Now, when we start out, we get the percept </text><text start="70" dur="4">saying that we&amp;#39;re in the leftmost location</text><text start="74" dur="5">and that location is clean, and that means our belief state</text><text start="79" dur="4">is that we&amp;#39;re in either 5 or 7.</text><text start="83" dur="6">Now, the first thing I want you to answer is if we decide to move right,</text><text start="89" dur="4">what do we predict the possible belief state will be,</text><text start="93" dur="3">the possible set of states in our belief state will be </text><text start="96" dur="3">after we execute the right movement?</text></transcript></video><video title="4b More Vacuum World" id="i5XMOLw6CGE" length="19"><transcript><text start="0" dur="3">Now we get a percept from the world, and we&amp;#39;ve observed</text><text start="3" dur="4">that we&amp;#39;re in the rightmost square, so the action worked,</text><text start="7" dur="2">and that square is dirty.</text><text start="9" dur="3">Now we want to update our belief state</text><text start="12" dur="3">and click on all the states that belong to the belief state now</text><text start="15" dur="4">as we update due to this percept.</text></transcript></video><video title="4c More Vacuum World" id="x93ewPQhIQc" length="19"><transcript><text start="0" dur="4">Now our belief state contains 2 and 6,</text><text start="4" dur="4">and we decide we want to execute the suck action.</text><text start="8" dur="3">Now tell me, by clicking on the appropriate states, </text><text start="11" dur="3">what states belong to the belief state</text><text start="14" dur="5">after we make a prediction for what&amp;#39;s going to happen after the suck action.</text></transcript></video><video title="4d More Vacuum World" id="RW-l7JWDtYQ" length="19"><transcript><text start="0" dur="4">Now, we make the observation</text><text start="4" dur="5">right, clean, and I want you to </text><text start="9" dur="3">update our belief state by clicking on the states</text><text start="12" dur="4">that belong to the new belief state now</text><text start="16" dur="3">after taking that observation into account.</text></transcript></video><video title="5 Monkey and Bananas" id="rCGAgc9smZg" length="70"><transcript><text start="0" dur="4">This is a famous problem called the monkey and bananas problem,</text><text start="4" dur="4">described in the language of classical planning.</text><text start="8" dur="4">There are six actions. The monkey can go from location x to y.</text><text start="12" dur="7">It can push some object from x to y. It can climb up an object. It can grab something.</text><text start="19" dur="5">It can climb down from an object, and it can un-grab something. </text><text start="24" dur="5">Initially, the monkey is at location A. The bananas are at location B.</text><text start="29" dur="5">The box is at C, and the monkey is at a low height, as is the box,</text><text start="34" dur="3">but the bananas are at a high height,</text><text start="37" dur="3">but the box is pushable and climbable.</text><text start="40" dur="6">Now, assuming that we execute this plan--go from A to C, push the box from C to B,</text><text start="46" dur="5">climb up on the box, grasp the bananas, and climb down from the box.</text><text start="51" dur="6">What I want you to do is look at these definitions of actions,</text><text start="57" dur="6"> tell me how the state unfolds from this initial state here to the final state,</text><text start="63" dur="7">and then click off all of these instances that are going to be true in the final state.</text></transcript></video><video title="6 Situation Calculus" id="eeDwEYxWCTA" length="276"><transcript><text start="0" dur="5">[Norvig] The final problem involves situation calculus.</text><text start="5" dur="6">In the domain I want to describe, we have a combination lock with 4 digits,</text><text start="11" dur="6">and the correct combination that will open the lock we&amp;#39;ll call X.</text><text start="17" dur="3">There are 2 actions you can perform.</text><text start="20" dur="4">One is to dial any combination on the dial,</text><text start="24" dur="5">and if you dial the correct one, X, then the lock will open.</text><text start="29" dur="5">And the other action you can perform is to press a lock button,</text><text start="34" dur="3">and if you press that button, then the lock will be locked,</text><text start="37" dur="3">whether it was open before or not.</text><text start="40" dur="3">I&amp;#39;m going to describe some axioms, </text><text start="43" dur="5">and I want you to tell me whether these axioms are correct for the domain or not.</text><text start="51" dur="2">First the possibility axioms.</text><text start="53" dur="6">One choice is the possibility axiom that says</text><text start="59" dur="10">if C equals X, then it&amp;#39;s possible to dial C in situation S.</text><text start="69" dur="4">And here I&amp;#39;m assuming that all variables are scoped</text><text start="73" dur="5">so that we say an implicit for all C and for all S here.</text><text start="78" dur="8">And X is not a variable. This is a constant, referring to the correct combination.</text><text start="86" dur="10">The other possible axiom is for all C if C is greater than or equal to 0</text><text start="96" dur="5">and less than or equal to 9999,</text><text start="101" dur="9">then it&amp;#39;s possible to dial C in any situation S.</text><text start="110" dur="10">So tell me which, if any or both, of these axioms you think correctly encode the situation.</text><text start="120" dur="5">Next we&amp;#39;ll look at the possibility axioms for the lock action.</text><text start="125" dur="2">Here&amp;#39;s one.</text><text start="127" dur="5">We can say if the safe is open in situation S,</text><text start="132" dur="6">then it&amp;#39;s possible to execute the lock action in S.</text><text start="138" dur="6">Or maybe we should say if the safe is not open in S,</text><text start="144" dur="6">then it&amp;#39;s possible to execute Lock in S.</text><text start="150" dur="5">Or maybe we should say if true, </text><text start="155" dur="7">then it&amp;#39;s possible to execute the lock action in situation S.</text><text start="162" dur="8">And tell me which, if any, of those represents a correct representation of the problem.</text><text start="170" dur="4">And finally we need successor state axioms for all the fluents,</text><text start="174" dur="5">but there&amp;#39;s really only one fluent, and that&amp;#39;s whether or not the safe is open.</text><text start="179" dur="7">So here&amp;#39;s one example of a successor state axiom.</text><text start="186" dur="7">We could say for any situation and action,</text><text start="193" dur="5">if it&amp;#39;s possible to execute that action in the situation,</text><text start="198" dur="8">then the Open fluent is going to be true in the result of executing that action</text><text start="206" dur="10">if and only if the action is dialing the correct combination, X, </text><text start="216" dur="11">or if the safe was already open in S and the action is not equal to Lock.</text><text start="227" dur="2">That&amp;#39;s one option.</text><text start="229" dur="5">And the other option is the same thing on the left-hand side,</text><text start="234" dur="9">and on the right-hand side it&amp;#39;s open if and only if the action is dialing the correct combination</text><text start="243" dur="6">and the action is not equal to Lock.</text><text start="249" dur="7">So tell me which, if any or all, of these are accurate representations of the problem.</text><text start="256" dur="7">In each case I want you to tell me if each of these axioms are good as they stand alone.</text><text start="263" dur="3">I don&amp;#39;t want you to look at any combinations of axioms</text><text start="266" dur="7">but just go through each one and check the box if you think that the axiom on that line alone</text><text start="273" dur="3">is a correct representation of the problem.</text></transcript></video></group><group title="Unit 9" count="36"><video title="01 Introduction" id="DgH6NaJHfVQ" length="32"><transcript><text start="0" dur="2">So today is an exciting day.</text><text start="2" dur="3">We&amp;#39;ll talk about planning under uncertainty,</text><text start="5" dur="2">and it really puts together from the material</text><text start="7" dur="2">we&amp;#39;ve talked about in past classes.</text><text start="9" dur="2">We talked about planning,</text><text start="11" dur="2">but not under uncertainty, and you&amp;#39;ve had </text><text start="13" dur="2">many, many classes of under uncertainty,</text><text start="15" dur="2">and now it gets to the point where we can make </text><text start="17" dur="2">decisions under uncertainty.</text><text start="19" dur="2">This is really important for my own research field </text><text start="21" dur="3">like robotics where the world is full of uncertainty, and the </text><text start="24" dur="2">type of techniques I&amp;#39;ll tell you about today</text><text start="26" dur="2">will really make it possible to drive robots</text><text start="28" dur="2">in actual physical roles and</text><text start="30" dur="2">find good plans for these robots to execute.</text></transcript></video><video title="02 Planning Under Uncertainty MDP" id="9D35JSWSJAg" length="210"><transcript><text start="0" dur="4">[Narrator] Planning under uncertainty.</text><text start="4" dur="2">In this class so far </text><text start="6" dur="2">we talked a good deal about planning.</text><text start="8" dur="4">We talked about uncertainty and probabilities,</text><text start="12" dur="3">and we also talked about learning,</text><text start="15" dur="3">but all 3 items were discussed separately.</text><text start="18" dur="2">We never brought planning and uncertainty together,</text><text start="20" dur="3">uncertainty and learning, or planning and learning.</text><text start="23" dur="3">So the class today, we&amp;#39;ll fuse planning and uncertainty</text><text start="26" dur="5">using techniques known as Markov decision processes or MDPs,</text><text start="31" dur="5">and partial observer Markov decision processes or POMDPs.</text><text start="36" dur="3">We also have a class coming up on reinforcement learning</text><text start="39" dur="2">which combines all 3 of his aspects, </text><text start="41" dur="3">planning, uncertainty, and machine learning.</text><text start="44" dur="2">You might remember in the very first class</text><text start="46" dur="3">we distinguished very different characteristics of agent tasks, </text><text start="49" dur="2">and here are some of those.</text><text start="51" dur="3">We distinguished deterministic was the casting environments,</text><text start="54" dur="4">and we also talked about photos as partial observable.</text><text start="58" dur="3">In the area of planning so far </text><text start="61" dur="3">all of our evidence falls into this field over here,</text><text start="64" dur="6">like A*, depth first, right first and so on.</text><text start="70" dur="2">The MDP algorithms </text><text start="72" dur="2">which I will talk about first</text><text start="74" dur="3">fall into the intersection of fully observable </text><text start="77" dur="2">yet stochastic, and just to remind us </text><text start="79" dur="2">what the difference was,</text><text start="81" dur="3">stochastic is an environment where the outcome of an action is somewhat random.</text><text start="84" dur="2">Whereas an environment that&amp;#39;s deterministic</text><text start="86" dur="3">where the outcome of an action is predictable</text><text start="89" dur="2">and always the same.</text><text start="91" dur="2">An environment is fully observable if you can </text><text start="93" dur="2">see the state of the environment which means if you can make all decisions </text><text start="95" dur="2">based on the momentary sensory input.</text><text start="97" dur="2">Whereas if you need memory,</text><text start="99" dur="2">it&amp;#39;s partially observable.</text><text start="101" dur="2">Planning in the partially observable case</text><text start="103" dur="4">is called POMDP, and towards the end of this class,</text><text start="107" dur="3">I&amp;#39;ll briefly talk about POMDPs but not in any depth.</text><text start="110" dur="3">So most of this class focuses on Markov decision processes</text><text start="113" dur="4">as opposed to partially observable Markov decision processes.</text><text start="117" dur="2">So what is a Markov decision process?</text><text start="119" dur="5">One way you can specify a Markov decision process by a graph.</text><text start="124" dur="4">Suppose you have states S1, S2, and S3,</text><text start="128" dur="3">and you have actions A1 and A2.</text><text start="131" dur="3">In a state transition graph, like this,</text><text start="134" dur="2">is a finite state machine,</text><text start="136" dur="4">and it becomes Markov if the outcomes of actions are somewhat random.</text><text start="140" dur="5">So for example if A1 over here, with a 50% probability, leads to </text><text start="145" dur="4">state S2 but with another 50% probability </text><text start="149" dur="3">leads to state S3.</text><text start="152" dur="2">So put differently, a Markov decision process of </text><text start="154" dur="6">states, actions, a state&amp;#39;s transition matrix,</text><text start="160" dur="2">often written of the following form</text><text start="162" dur="2">which is just about the same as </text><text start="164" dur="3">a conditional state transition probability</text><text start="167" dur="2">that a state is prime</text><text start="169" dur="2">is the correct posterior state</text><text start="171" dur="4">after executing action A in a state S,</text><text start="175" dur="3">and the missing thing is the objective for the Markov decision process.</text><text start="178" dur="2">What do we want to achieve?</text><text start="180" dur="3">For that we often define a reward function,</text><text start="183" dur="2">and for the sake of this lecture, </text><text start="185" dur="2">I will attach rewards just to states.</text><text start="187" dur="3">So each state will have a function R attached</text><text start="190" dur="2">that tells me how good the state is.</text><text start="192" dur="2">So for example it might be worth $10</text><text start="194" dur="2">to be in the state over here,</text><text start="196" dur="2">$0 to be in the state over here,</text><text start="198" dur="3">and $100 to be in a state over here.</text><text start="201" dur="2">So the planning problem is now the problem </text><text start="203" dur="4">which relies on an action to each possible state.</text><text start="207" dur="3">So that we maximize our total reward.</text></transcript></video><video title="03 Robot Tour Guide Examples" id="9QMZQkKuYjo" length="162"><transcript><text start="0" dur="3">[Narrator] Before diving into too much detail,</text><text start="3" dur="4">let me explain to you why MDPs really matter.</text><text start="7" dur="3">What you see here is a robotic tour guide </text><text start="10" dur="3">that the University of Bonn, with my assistance, </text><text start="13" dur="4">deployed in the German museum in Bonn,</text><text start="17" dur="3">and the objective of the this robot was to </text><text start="20" dur="3">navigate the museum and guide visitors,</text><text start="23" dur="4">mostly kids, from exhibit to exhibit.</text><text start="27" dur="4">This is a challenging planning problem because </text><text start="31" dur="2">as the robot moves </text><text start="33" dur="2">it can&amp;#39;t really predict its action outcomes</text><text start="35" dur="3">because of the randomness of the environment</text><text start="38" dur="2">and the carpet and the wheels of the robot.</text><text start="40" dur="4">The robot is not able to really follow its own commands very well,</text><text start="44" dur="3">and it has to take this into consideration during the planning process</text><text start="47" dur="3">so when it finds itself in a location it didn&amp;#39;t expect,</text><text start="50" dur="3">it knows what to do.</text><text start="53" dur="3">In the second video here, you see a successor robot</text><text start="56" dur="3">that was deployed in the Smithsonian National </text><text start="59" dur="4">Museum of American History in the late 1990s</text><text start="63" dur="2">where it guided many, many thousands of kids</text><text start="65" dur="3">through the entrance hall of the museum,</text><text start="68" dur="2">and once again, this is a challenging planning problem.</text><text start="70" dur="3">As you can see people are often in the way of the robot.</text><text start="73" dur="2">The robot has to take detours.</text><text start="75" dur="2">Now this one is particularly difficult because</text><text start="77" dur="2">there were obstacles that were invisible</text><text start="79" dur="2">like a downward staircase.</text><text start="81" dur="2">So this is a challenging localization problem</text><text start="83" dur="2">trying to find out where you are,</text><text start="85" dur="5">but that&amp;#39;s for a later class.</text><text start="90" dur="3">In the video here, you see a robot being deployed in a nursing home</text><text start="93" dur="3">with the objective to assist elderly people</text><text start="96" dur="3">by guiding them around, bring them to appointments,</text><text start="99" dur="3">reminding them to take their medication, and</text><text start="102" dur="3">interacting with them, and this robot has been active for many, many years</text><text start="105" dur="3">and been used, and, again, it&amp;#39;s a very challenging planning problem</text><text start="108" dur="4">to navigate through this elderly home.</text><text start="112" dur="2">And the final robot I&amp;#39;m showing you here.</text><text start="114" dur="3">This was built with my colleague Will Whittaker at Carnegie Melon University.</text><text start="117" dur="4">The objective here was to explore abandoned mines.</text><text start="121" dur="2">Pennsylvania and West Virginia</text><text start="123" dur="3">and other states are heavily mined.</text><text start="126" dur="3">There&amp;#39;s many abandoned old coal mines,</text><text start="129" dur="2">and for many of these mines,</text><text start="131" dur="3">it&amp;#39;s unknown what the conditions are and where exactly they are.</text><text start="134" dur="2">They&amp;#39;re not really human accessible. </text><text start="136" dur="3">They tend to have roof fall and very low oxygen levels.</text><text start="139" dur="2">So we made a robot that went inside</text><text start="141" dur="3">and built maps of those mines.</text><text start="146" dur="3">All these problems have in common that they </text><text start="149" dur="3">have really challenging planning problems.</text><text start="152" dur="2">The environments are stochastic.</text><text start="154" dur="2">That is the outcome of actions are unknown,</text><text start="156" dur="3">and the robot has to be able to react to</text><text start="159" dur="3">all kinds of situations, even the ones that it didn&amp;#39;t plan for.</text></transcript></video><video title="04 MDP Grid World" id="YfSBYf9h7qk" length="160"><transcript><text start="0" dur="3">[Narrator] Let me give a much simpler example</text><text start="3" dur="3">often called grid world for MDPs,</text><text start="6" dur="2">and I&amp;#39;ll be using this insanely simple </text><text start="8" dur="3">example over here throughout this class.</text><text start="11" dur="2">Let&amp;#39;s assume we have a starting state over here,</text><text start="13" dur="3">and there&amp;#39;s 2 goal states who</text><text start="16" dur="2">are often called absorbing states</text><text start="18" dur="3">with very different reward or payout.</text><text start="21" dur="2">Plus 100 for the state over here,</text><text start="23" dur="2">minus 100 for the state over here,</text><text start="25" dur="3">and our agent is able to move about the environment,</text><text start="28" dur="2">and when it reaches one of those 2 states,</text><text start="30" dur="3">the game is over and the task is done.</text><text start="33" dur="4">Obviously the top state is much more attractive than the bottom state with minus 100.</text><text start="37" dur="4">Now to turn this into an MDP, let&amp;#39;s assume </text><text start="41" dur="2">actions are somewhat stochastic.</text><text start="43" dur="3">So suppose we had a grid cell, and we attempt to go north.</text><text start="46" dur="3">The deterministic agent would always succeed</text><text start="49" dur="4">to go to the north square if it&amp;#39;s available,</text><text start="53" dur="2">but let&amp;#39;s assume that we only have an 80% chance</text><text start="55" dur="2">to make it to the cell in the north.</text><text start="57" dur="2">If there&amp;#39;s no cell at all,</text><text start="59" dur="2">there&amp;#39;s a wall like over here,</text><text start="61" dur="3">we assume with 80% chance, we just bounce back to the same cell,</text><text start="64" dur="4">but with 10% chance, we instead go left.</text><text start="68" dur="2">Another 10% chance, we go right.</text><text start="70" dur="3">So if an agent is over here and wishes to go north,</text><text start="73" dur="2">then with 80% chance, it finds itself over here,</text><text start="75" dur="2">10% over here, 10% over here.</text><text start="77" dur="2">If it goes north from here,</text><text start="79" dur="2">because there&amp;#39;s no north cell,</text><text start="81" dur="2">it&amp;#39;ll bounce back with 80% probability,</text><text start="83" dur="2">10% left, 10% right.</text><text start="85" dur="2">In a cell like this one over here, </text><text start="87" dur="3">it&amp;#39;ll bounce back with 90% probability,</text><text start="90" dur="2">80 from the top and 10 from the left, </text><text start="92" dur="3">but it still has a 10% chance of going right.</text><text start="95" dur="2">This is a stochastic state transition which</text><text start="97" dur="2">we can equally define for actions,</text><text start="99" dur="2">south, west and east, and</text><text start="101" dur="2">now we can see a situation like this</text><text start="103" dur="2">conventional planning is insufficient.</text><text start="105" dur="3">So for example if you&amp;#39;re plan a sequence of actions starting over here,</text><text start="108" dur="3">you might go north, north, east, east, east</text><text start="111" dur="4">to reach our plus 100 absorbing or final state,</text><text start="115" dur="2">but with this state transition model over here,</text><text start="117" dur="5">even with the first step it might happen with 10% chance do you find yourselves over here,</text><text start="122" dur="3">in which case conventional planning would not give us an answer.</text><text start="125" dur="4">So we wish to have a planning method that provides an answer no matter where we are </text><text start="129" dur="3">and that&amp;#39;s called a policy.</text><text start="132" dur="3">A policy assigns actions to any state.</text><text start="135" dur="2">So for example a policy might look as follows:</text><text start="137" dur="7">for this state, we wish to go north, north, east, east, east,</text><text start="144" dur="4">but for this state over here, we wish to go north, maybe east over here, </text><text start="148" dur="3">and maybe west over here.</text><text start="151" dur="2">So each state, except for the absorbing states, </text><text start="153" dur="3">we have to define an action to define a policy.</text><text start="156" dur="4">The planning problem we have becomes one of finding the optimal policy</text></transcript></video><video title="05 Problems with Conventional Planning 1" id="Ig0ekhtAfkY" length="97"><transcript><text start="0" dur="2">[Narrator] To understand the beauty</text><text start="2" dur="5">of a policy, let me look into stochastic environments,</text><text start="7" dur="4">and let me try to apply conventional planning.</text><text start="11" dur="4">Consider the same grid I just gave you,</text><text start="15" dur="3">and let&amp;#39;s assume there&amp;#39;s a discrete start state,</text><text start="18" dur="2">the one over here, and we wish to find </text><text start="20" dur="2">an action sequence that leads us to</text><text start="22" dur="2"> the goal state over here.</text><text start="24" dur="3">In conventional planning we would create a tree.</text><text start="27" dur="3">In C1, we&amp;#39;re given 4 action choices,</text><text start="30" dur="5">north, south, west, and east.</text><text start="35" dur="2">However, the outcome of those choices </text><text start="37" dur="2">is not deterministic</text><text start="39" dur="2">So rather than having a single outcome,</text><text start="41" dur="3">nature will choose for us the actual outcome.</text><text start="44" dur="2">In the case of going north, for example, </text><text start="46" dur="3">we may find ourselves in B1,</text><text start="49" dur="3">or back into C1.</text><text start="52" dur="2">Similarly for going south, we might find </text><text start="54" dur="7">ourselves in C1, or back in C2, and so on.</text><text start="61" dur="3">This tree has a number of problems.</text><text start="64" dur="3">The first problem is the branching factor.</text><text start="67" dur="3">While we have 4 different action choices,</text><text start="70" dur="3">nature will give us up to 3 different outcomes</text><text start="73" dur="4">which makes up to 12 different things we have to follow.</text><text start="77" dur="3">Now in conventional planning we might have to follow just 1 of those,</text><text start="80" dur="3">but here we might have to follow up to 3 of those things.</text><text start="83" dur="3">So every time we plan a step ahead, </text><text start="86" dur="2">we might have to increase the breadth of </text><text start="88" dur="4">the search of tree by at least a factor of 3.</text><text start="92" dur="2">So one of the problem is the branching </text><text start="94" dur="3">factor is too large. </text></transcript></video><video title="06 Branching Factor Question" id="Y_WmE98BN7c" length="34"><transcript><text start="0" dur="2">[Narrator] To understand the branching factor,</text><text start="2" dur="2">let me quiz you on </text><text start="4" dur="3">how many states you can possibly reach</text><text start="7" dur="2">from any other states, and as an example</text><text start="9" dur="3">from C1, you can reach under </text><text start="12" dur="5">any action choice B1, C1, and C2, but it</text><text start="17" dur="2">will give you an affective branching factor of 3. </text><text start="19" dur="4">So when I ask you what&amp;#39;s the affective branching factor in B3?</text><text start="23" dur="2">What is the maximum level of states </text><text start="25" dur="2">you can reach under </text><text start="27" dur="2">any possible action from B3?</text><text start="29" dur="2">So how many states can we reach </text><text start="31" dur="3">from B3 over here?</text></transcript></video><video title="07 Branching Factor Answer" id="_tpD8n3vpXg" length="17"><transcript><text start="0" dur="2">[Narrator] And the answer is 8.</text><text start="2" dur="3">If you go north, you might reach this state over here,</text><text start="5" dur="3">this one over here, this one over here.</text><text start="8" dur="2">If you go east, you might reach this state over here,</text><text start="10" dur="2">or this one over here, or this one over here, </text><text start="12" dur="2">or this one over here.</text><text start="14" dur="3">When you put it all together, you can reach all of those 8 states over here.</text></transcript></video><video title="08 Problems with Conventional Planning 2" id="V2AvmoBJdU4" length="74"><transcript><text start="0" dur="2">[Narrator] There are other problems with</text><text start="2" dur="2">the search paradigm. </text><text start="4" dur="4">The second one is that the tree could be very deep,</text><text start="8" dur="2">and the reason is we might be able to </text><text start="10" dur="3">circle forever in the area over here </text><text start="13" dur="2">without reaching the goal state, and  </text><text start="15" dur="2">that makes for a very deep tree, and until</text><text start="17" dur="3">we reach the goal state, we won&amp;#39;t even know </text><text start="20" dur="2">it&amp;#39;s the best possible action.</text><text start="22" dur="2">So conventional planning might have difficulties </text><text start="24" dur="3">with basically infinite loops.</text><text start="27" dur="3">The third problem is that many states </text><text start="30" dur="2">recur in the search.</text><text start="32" dur="2">In a star, we were careful </text><text start="34" dur="3">to visit each state only once,</text><text start="37" dur="3">but here because the actions might </text><text start="40" dur="2">carry you back here to the same state,</text><text start="42" dur="3">C1 is, for example, over here and over here. </text><text start="45" dur="2">You might find that many states in the tree </text><text start="47" dur="3">might be visited many, many different times.</text><text start="50" dur="3">Now if you had a state it doesn&amp;#39;t really matter how you got there.</text><text start="53" dur="2">Yet, the tree doesn&amp;#39;t understand this, and it </text><text start="55" dur="3">might expand states more than once.</text><text start="58" dur="2">These are the 3 problems </text><text start="60" dur="4">that are overcome by our policy method,</text><text start="64" dur="3">and this motivates in part by calculating policies </text><text start="67" dur="2">is so much better of an idea than using </text><text start="69" dur="3">conventional planning and still casting environments.  </text><text start="72" dur="2">So let&amp;#39;s get back to the policy case.</text></transcript></video><video title="09 Policy Question 1" id="zOzKPHA-lDQ" length="37"><transcript><text start="0" dur="3">[Narrator] Let&amp;#39;s look at the grid world, again,</text><text start="3" dur="2">and let me ask you a question. </text><text start="5" dur="2">I wish to find an optimal policy </text><text start="7" dur="2">for all these states that </text><text start="9" dur="2">with maximum probability leads me to </text><text start="11" dur="3">the absorbing state plus 100, </text><text start="14" dur="2">and as I just discussed, I assume </text><text start="16" dur="2">there&amp;#39;s 4 different actions, </text><text start="18" dur="2">north, south, west, and east</text><text start="20" dur="3">that succeed with probability 80% provided</text><text start="23" dur="3">that the corresponding grid cell is actually attainable.</text><text start="26" dur="3">I wish to know what is the optimal action</text><text start="29" dur="3">in the corner set over here, A1,</text><text start="32" dur="2">and I give you 4 choices, </text><text start="34" dur="3">north, south, west, and east.</text></transcript></video><video title="10 Policy Answer 1" id="_2thYamxwN4" length="6"><transcript><text start="0" dur="2">[Narrator] And the answer is east.</text><text start="2" dur="2">East in expectation transfers you to the right side, </text><text start="4" dur="2">and you&amp;#39;re one closer to your goal position. </text></transcript></video><video title="11 Policy Question 2" id="jutx0ekYv28" length="6"><transcript><text start="0" dur="3">[Narrator] Let me ask the same question for the state over here, C1,</text><text start="3" dur="3">which one is the optimal action for C1?</text></transcript></video><video title="12 Policy Answer 2" id="EbWXMMcl1dc" length="8"><transcript><text start="0" dur="3">[Narrator] And the answer is north. It gets you one step closer.</text><text start="3" dur="2">There is 2 equally long paths, but over here</text><text start="5" dur="3">you risk falling into the minus 100; therefore, you&amp;#39;d rather go north.</text></transcript></video><video title="13 Policy Question 3" id="Hdm9jCJvesM" length="17"><transcript><text start="0" dur="2">[Narrator] The next question is challenging.</text><text start="2" dur="4">Consider state C4, which one is the optimal action</text><text start="6" dur="3">provided that you can run around as long as you want.</text><text start="9" dur="2">There&amp;#39;s no costs associated with steps, but</text><text start="11" dur="4">you wish to maximize the probability of ending up in plus 100 over here.</text><text start="15" dur="2">Think before you answer this question.</text></transcript></video><video title="14 Policy Answer 3 Question 4" id="AAg6KkvTijU" length="50"><transcript><text start="0" dur="2">[Narrator] And the answer is south.</text><text start="2" dur="4">The reason why it&amp;#39;s south is if we attempt to go south,</text><text start="6" dur="2">an 80% probability we&amp;#39;ll stay in the same cell.</text><text start="8" dur="2">In fact, a 90% probability because we can&amp;#39;t </text><text start="10" dur="2">go south and we can&amp;#39;t go east. </text><text start="12" dur="3">In a 10% probability, we find ourselves over here which is a relatively </text><text start="15" dur="3">safe state because we can actually go to the left side.</text><text start="18" dur="4">If we were to go just west which is the intuitive answer, </text><text start="22" dur="5">then there&amp;#39;s a 10% chance we end up in the minus 100 absorbing state.</text><text start="27" dur="2">You can convince yourself if you go south,</text><text start="29" dur="3">find ourselves eventually in state C3, and then </text><text start="32" dur="5">go west, west, north, north, east, east, east.</text><text start="37" dur="5">You will never ever run risk of falling into the minus 100, and </text><text start="42" dur="3">that argument is tricky and to convince ourselves </text><text start="45" dur="2">let me ask the other hard question:</text><text start="47" dur="3">so what shall we do in state B3 that&amp;#39;s the optimal action?</text></transcript></video><video title="15 Policy Answer 4" id="PJKRCDi_fIs" length="25"><transcript><text start="0" dur="2">[Narrator] And the answer is west.</text><text start="2" dur="2">If you&amp;#39;re over here, and we go east,</text><text start="4" dur="2">we&amp;#39;d likely end up with minus 100.</text><text start="6" dur="3">If you go north, which seems to be the intuitive answer,</text><text start="9" dur="3">there&amp;#39;s a 10% chance we fall into the minus 100.</text><text start="12" dur="2">However, if we go west, then there&amp;#39;s absolutely </text><text start="14" dur="2">no chance we fall into the minus 100.</text><text start="16" dur="2">We might find ourselves over here.</text><text start="18" dur="2">We might be in the same state. We might find ourselves over here, </text><text start="20" dur="2">but from these states over here,</text><text start="22" dur="3">there&amp;#39;s safe policies that can safely avoid the minus 100.</text></transcript></video><video title="16 MDP and Costs" id="WyLEhX3oUZU" length="167"><transcript><text start="0" dur="4">[Narrator] So even for the simple grid world, </text><text start="4" dur="4">the optimal control policy assuming stochastic actions</text><text start="8" dur="4">and no costs of moving, except for the final absorbing costs, </text><text start="12" dur="2">is somewhat nontrivial. </text><text start="14" dur="3">Take a second to look at this.</text><text start="17" dur="2">Along here it seems pretty obvious, but </text><text start="19" dur="5">for the state over here, B3, and for the state over here, C4,</text><text start="24" dur="3">we choose an action that just avoids falling into the minus 100,  </text><text start="27" dur="5">which is more important than  trying to make progress towards the plus 100.</text><text start="32" dur="3">Now obviously this is not the general case of an MDP,</text><text start="35" dur="3">and it&amp;#39;s somewhat frustrating they&amp;#39;d be willing to run through the wall,</text><text start="38" dur="3">just so as to avoid falling into the minus 100,</text><text start="41" dur="2">and the reason why this seems unintuitive is </text><text start="43" dur="3">because we&amp;#39;re really forgetting the issue of costs.</text><text start="46" dur="3">In normal life, there is a cost associated with moving.</text><text start="49" dur="4">MDPs are gentle enough to have a cost factor,</text><text start="53" dur="3">and the way we&amp;#39;re going to denote costs </text><text start="56" dur="4">is by defining our award function over any possible state.</text><text start="60" dur="3">We are reaching the state A4, </text><text start="63" dur="4">gives us plus 100, minus 100 for B4,</text><text start="67" dur="3">and perhaps minus 3 for every other state,</text><text start="70" dur="3">which reflects the fact that if you take a step somewhere</text><text start="73" dur="2">that we will pay minus 3.</text><text start="75" dur="4">So this gives an incentive to shorten the final action sequence.</text><text start="79" dur="4">So we&amp;#39;re now ready to state the actual objective</text><text start="83" dur="2">of an MDP which is to minimize not</text><text start="85" dur="4">just the momentary costs, but the sum</text><text start="89" dur="3">of all future rewards,</text><text start="92" dur="3">but you&amp;#39;re going to write RT to denote the fact that </text><text start="95" dur="3">this reward has received time T, and because</text><text start="98" dur="3">our reward itself is stochastic,</text><text start="101" dur="3">we have to complete the expectation over those, </text><text start="104" dur="2">and that we seek to maximize.</text><text start="106" dur="4">So we seek to find the policy that maximizes the expression over here. </text><text start="110" dur="4">Now another interesting caveat is a sentence people put </text><text start="114" dur="3">a so called discount factor into this equation </text><text start="117" dur="4">with an exponent of T, where a discount factor was going to be 0.9,</text><text start="121" dur="3">and what this does is it decays future reward </text><text start="124" dur="3">relative to more immediate rewards, and it&amp;#39;s </text><text start="127" dur="3">kind of an alternative way to specify costs.</text><text start="130" dur="3">So we can make this explicit by a negative reward per state </text><text start="133" dur="3">or we can bring in a discount factor</text><text start="136" dur="3">that discounts the plus 100 by the </text><text start="139" dur="4">number of steps that it went by before it reached the plus 100. </text><text start="143" dur="4">This also gives an incentive to get to the goal as fast as possible.</text><text start="147" dur="3">The nice mathematical thing about discount factor is</text><text start="150" dur="3">it keeps this expectation bounded.</text><text start="153" dur="3">It easy to show that this expression over here</text><text start="156" dur="5">will always be smaller or equal to 1 over 1 minus gamma times the </text><text start="161" dur="3">absolute reward maximizing value and </text><text start="164" dur="3">which in this case would be plus 100.</text></transcript></video><video title="17 Value Iteration 1" id="oefOCk3koZo" length="91"><transcript><text start="0" dur="3">The definition of the expected sum of future</text><text start="3" dur="3">possible discounted reward that it has given you</text><text start="6" dur="4">allows me to define a value function.</text><text start="10" dur="3">For each status, my value of the state</text><text start="13" dur="4">is the expected sum of future discounted reward</text><text start="17" dur="4">provided that I start in state S,</text><text start="21" dur="2">then I execute policy pi.</text><text start="23" dur="3">This expression looks really complex,</text><text start="26" dur="2">but it really means something really simple,</text><text start="28" dur="2">which is suppose we start in the state over here,</text><text start="30" dur="4">and you get +100 over here, -100 over here.</text><text start="34" dur="4">And suppose for now, every other state costs you -3.</text><text start="38" dur="3">For any possible policy that assigns actions to </text><text start="41" dur="3">the non-absorbing states, you can now </text><text start="44" dur="3">simulate the agent for quite a while and compute empirically</text><text start="47" dur="5">what is the average reward that is being received</text><text start="52" dur="2">until you finally hit a goal state.</text><text start="54" dur="3">For example, for the policy that you like,</text><text start="57" dur="3">the value would, of course, for any state </text><text start="60" dur="2">depend on how much you make progress towards the goal,</text><text start="62" dur="2">or whether you bounce back and forth.</text><text start="64" dur="2">In fact, in this state over here, you might bounce down</text><text start="66" dur="2">and have to do the loop again.</text><text start="68" dur="3">But there&amp;#39;s a well defined expectation</text><text start="71" dur="3">over any possible execution of the policy pi</text><text start="74" dur="3">that is generic to each state and each policy pi.</text><text start="77" dur="2">That&amp;#39;s called a value.</text><text start="79" dur="2">And value functions are absolutely essential to MDP,</text><text start="81" dur="4">so the way we&amp;#39;re going to plan is we&amp;#39;re going to iterate</text><text start="85" dur="3">and compute value functions, and it will turn out</text><text start="88" dur="3">that by doing this, we&amp;#39;re going to find better and better policies as well.</text></transcript></video><video title="18 Value Iteration 2" id="8-pzJXUiXrM" length="56"><transcript><text start="0" dur="3">Before I dive into mathematical detail about</text><text start="3" dur="3">value functions, let me just give you a tutorial.</text><text start="6" dur="2">The value function is a potential function</text><text start="8" dur="5">that leads from the goal location--in this case, the 100 in the upper right--</text><text start="13" dur="3">all the way into the space so that hill climbing</text><text start="16" dur="4">in this potential function leads you on the shortest path to the goal.</text><text start="20" dur="2">The algorithm is a recursive algorithm.</text><text start="22" dur="3">It spreads the value through the space, as you can see in this animation,</text><text start="25" dur="3">and after a number of iterations, it converges,</text><text start="28" dur="2">and you have a grayscale value</text><text start="30" dur="4">that really corresponds to the best way of getting to the goal.</text><text start="34" dur="2">Hill climbing in that function gets you to the goal.</text><text start="36" dur="2">You can simplify.</text><text start="38" dur="3">Think about this as pouring a glass of milk</text><text start="41" dur="3">into the 100th state and having the milk </text><text start="44" dur="3">descend through the maze, and later on, </text><text start="47" dur="4">when you go in the gradient of the milk flow,</text><text start="51" dur="5">you will reach the goal in the optimal possible way.</text></transcript></video><video title="19 Value Iteration 3" id="glHKJ359Cnc" length="276"><transcript><text start="0" dur="5">Let me tell you about a truly magical algorithm called value iteration.</text><text start="5" dur="3">In value iteration, we recursively calculate the value function</text><text start="8" dur="4">so that in the end, we get what&amp;#39;s called the optimal value function.</text><text start="12" dur="2">And from that, we can derive, </text><text start="14" dur="4">look up, the optimal policy.</text><text start="18" dur="2">Here&amp;#39;s how it goes.</text><text start="20" dur="6">Suppose we start with a value function of 0 everywhere</text><text start="26" dur="4">except for the 2 absorbing states, whose value is +100 and -100.</text><text start="30" dur="3">Then we can ask ourselves the question is, for example,</text><text start="33" dur="4">for the field A3, 0 a good value.</text><text start="37" dur="3">And the answer is no, it isn&amp;#39;t. It is somewhat inconsistent.</text><text start="40" dur="2">We can compute a better value.</text><text start="42" dur="4">In particular, we can understand that</text><text start="46" dur="4">if we&amp;#39;re in A3 and we choose to go east,</text><text start="50" dur="5">then with 0.8 chance we should expect a value of 100.</text><text start="55" dur="3">With 0.1 chance, we&amp;#39;ll stay in the same state,</text><text start="58" dur="3">in which case the value is -3.</text><text start="61" dur="4">And with 0.1 chance, we&amp;#39;re going to stay down here for -3.</text><text start="65" dur="3">With the appropriate definition of value,</text><text start="68" dur="3">we would get the following formula,</text><text start="71" dur="2">which is 77.</text><text start="73" dur="5">So, 77 is a better estimate of value</text><text start="78" dur="2">for the state over here.</text><text start="80" dur="2">And now that we&amp;#39;ve done it, we can ask ourselves the question</text><text start="82" dur="3">is this a good value, or this a good value, or this a good value?</text><text start="85" dur="2">And we can propagate value backwards</text><text start="87" dur="3"> in reverse order of action execution</text><text start="90" dur="4">from the positive absorbing state through this grid world</text><text start="94" dur="4">and fill every single state with a better value estimate</text><text start="98" dur="4">then the one we assumed initially.</text><text start="102" dur="4">If we do this for the grid over here and run value iteration</text><text start="106" dur="4">through convergence, then we get the following value function.</text><text start="110" dur="3">We get 93 over here. We&amp;#39;re very close to the goal.</text><text start="113" dur="5">89, 85, 81, 77, 73, 70, over here.</text><text start="118" dur="4">This state will be worth 68, and this state is worth 47, </text><text start="122" dur="2">and the reason why these are not so good is because</text><text start="124" dur="2">we might stay quite a while in those</text><text start="126" dur="3">before we&amp;#39;ll be able to execute an action</text><text start="129" dur="3">that gets us outside the state.</text><text start="132" dur="3">Let me give you an algorithm that defines value iteration.</text><text start="135" dur="5">We wish to estimate recursively the value of state S.</text><text start="140" dur="3">And we do this based on a possible successor state</text><text start="143" dur="4">as prime that we look up in the existing table.</text><text start="147" dur="3">Now, actions A are non-deterministic.</text><text start="150" dur="4">Therefore, we have to go through all possible as primes</text><text start="154" dur="3">and weigh each outcome with the associated probability.</text><text start="157" dur="3">The probability of reaching S prime given that we started state S</text><text start="160" dur="2">and apply action A.</text><text start="162" dur="4">This expression is usually discounted by gamma,</text><text start="166" dur="5">and we also add the reward or the costs of the state.</text><text start="171" dur="3">And because there&amp;#39;s multiple actions and it&amp;#39;s up to us</text><text start="174" dur="6">to choose the right action, we will maximize over all possible actions.</text><text start="180" dur="3">See, we look at this equation, and it looks really complicated,</text><text start="183" dur="3">but it&amp;#39;s actually really simple.</text><text start="186" dur="5">We compute a value recursively based on successor values</text><text start="191" dur="4">plus the reward and minus the cost that it takes us to get us there.</text><text start="195" dur="5">Because Mother Nature picks a successor state for us for any given action,</text><text start="200" dur="5">if you compute an expectation over the value of the successor state</text><text start="205" dur="4">weighted by the corresponding probabilities which is happening over here,</text><text start="209" dur="3">and because we can choose our action,</text><text start="212" dur="3">we maximize over all possible actions.</text><text start="215" dur="4">Therefore, the max as opposed to the expectation on the left side over here.</text><text start="219" dur="4">This is an equation that&amp;#39;s called backup.</text><text start="223" dur="5">In terminal states, we just assign R(s),</text><text start="228" dur="4">and obviously, in the beginning of value iteration, </text><text start="232" dur="3">these expressions are different, and we have to update.</text><text start="235" dur="3">But as Bellman has shown a while ago,</text><text start="238" dur="3">this process of updates converges.</text><text start="241" dur="4">After convergence, this assignment over here </text><text start="245" dur="2">is replaced by the equality sign,</text><text start="247" dur="3">and when this equality holds true,</text><text start="250" dur="6">we have what is called a Bellman equality or Bellman equation.</text><text start="256" dur="3">And that&amp;#39;s all there is to know to compute values.</text><text start="259" dur="5">If you assign this specific equation over and over again to each state,</text><text start="264" dur="3">eventually you get a value function that looks just like this,</text><text start="267" dur="3">where the value really corresponds to what&amp;#39;s the optimal future</text><text start="270" dur="3">cost reward trade off that you can achieve </text><text start="273" dur="3">if you act optimally in any given state over here.</text></transcript></video><video title="20 Deterministic Question 1" id="0uBKtlb8QtU" length="49"><transcript><text start="0" dur="2">Let me take my example world </text><text start="2" dur="4">and apply value iteration in a quiz.</text><text start="6" dur="3">As before, assume the value is initialized as</text><text start="9" dur="3">+100 and -100 for the absorbing states</text><text start="12" dur="2">and 0 everywhere else.</text><text start="14" dur="4">And let me make the assumption that our transition probability is deterministic.</text><text start="18" dur="4">That is, if I execute the east action of this state over here</text><text start="22" dur="2">with probability 1 item over here,</text><text start="24" dur="4">if I assume the north action over here, probability 1, </text><text start="28" dur="2">I will find myself in the same state as before.</text><text start="30" dur="2">There is no uncertainty anymore.</text><text start="32" dur="3">That&amp;#39;s really important for now, just for this one quiz.</text><text start="35" dur="3">I&amp;#39;ll also assume gamma equals 1, </text><text start="38" dur="2">just to make things a little bit simpler.</text><text start="40" dur="4">And the cost over here is -3 unless you reach an absorbing state.</text><text start="44" dur="3">What I&amp;#39;d like to know, after a single backup,</text><text start="47" dur="2">what&amp;#39;s the value of A3?</text></transcript></video><video title="21 Deterministic Answer 1" id="cs7eAdklm_4" length="18"><transcript><text start="0" dur="2">And the answer is 97.</text><text start="2" dur="2">It&amp;#39;s easy to see that the action east </text><text start="4" dur="2">is the value maximizing action.</text><text start="6" dur="3">Let&amp;#39;s plug in east over here. Gamma equals 1.</text><text start="9" dur="3">This is a single successor state of 100,</text><text start="12" dur="3">so if we have 100 over here minus the 3 </text><text start="15" dur="3">that it costs us to get there, we get 97.</text></transcript></video><video title="22 Deterministic Question 2" id="ytCyzDAz2Uo" length="8"><transcript><text start="0" dur="3">Let&amp;#39;s run value iteration again, and let me ask</text><text start="3" dur="3">what&amp;#39;s the value for B3, assuming that we already updated</text><text start="6" dur="2">the value for A3 as shown over here.</text></transcript></video><video title="23 Deterministic Answer 2" id="Mr3H7QpecRA" length="20"><transcript><text start="0" dur="2">And again, making use of the observation that our</text><text start="2" dur="2">state transition function is deterministic, </text><text start="4" dur="3">we get 94, and the logic is the same as before.</text><text start="7" dur="2">The optimal action here is going north, </text><text start="9" dur="3">which we will succeed with the probability 1.</text><text start="12" dur="4">Therefore, we can use the value recursively from A3</text><text start="16" dur="2">to deflect back to B3.</text><text start="18" dur="2">97 - 3 gives us 94.</text></transcript></video><video title="24 Deterministic Question 3" id="M3AM5huQzlc" length="12"><transcript><text start="0" dur="2">And finally, I would like to know what&amp;#39;s the value of </text><text start="2" dur="4">C1, the figure down here, after we ran value iteration</text><text start="6" dur="3">over and over again all the way to a convergence.</text><text start="9" dur="3">Again, gamma equals 1. State transition function is deterministic.</text></transcript></video><video title="25 Deterministic Answer 3" id="RzO7U1ZPD54" length="40"><transcript><text start="0" dur="2">And the answer is easily obtained if you just </text><text start="2" dur="3">subtract -3 for each step.</text><text start="5" dur="3">We get 88 and 85 over here.</text><text start="8" dur="3">We could also reach the same value going around here.</text><text start="11" dur="2">So, 85 would have been the right answer,</text><text start="13" dur="3">and this will be the value function after convergence.</text><text start="16" dur="5">It&amp;#39;s beautiful to see that the value function is effective </text><text start="24" dur="2">the distance to the positive absorbing state times 3</text><text start="26" dur="2">subtracted from 100.</text><text start="28" dur="4">So, we have 97, 94, 91, 88, 85 and so on.</text><text start="32" dur="2">This is a degenerate case.</text><text start="34" dur="2">If we have a deterministic state transition function,</text><text start="36" dur="4">it gets more tricky to calculate for the stochastic case.</text></transcript></video><video title="26 Stochastic Question 1" id="_LQAOyxbGiQ" length="22"><transcript><text start="0" dur="4">Let me ask the same question for the stochastic case.</text><text start="4" dur="2">We have the same world as before,</text><text start="6" dur="3">and actions have stochastic outcomes.</text><text start="9" dur="3">The probability 0.8, we get the action we commanded,</text><text start="12" dur="3">otherwise we get left or right.</text><text start="15" dur="2">And assuming that the initial values are all 0, </text><text start="17" dur="5">calculate for me for a single backup the value of A3.</text></transcript></video><video title="27 Stochastic Answer 1" id="wWQO0h_WjT4" length="51"><transcript><text start="0" dur="3">This should look familiar from the previous material.</text><text start="3" dur="3">It&amp;#39;s 77, and the reason is in A3,</text><text start="6" dur="3">we have an 80% chance</text><text start="9" dur="5">for the action going east to reach 100.</text><text start="14" dur="4">But the remaining 20%, we either stay in place or go to the field down here,</text><text start="18" dur="2">both of which have an initial value of 0.</text><text start="20" dur="3">That gives us 0, but we have to subtract the cost of 3, </text><text start="23" dur="4">and that gives us 80 - 3 = 77.</text><text start="27" dur="3">It&amp;#39;s also easy to verify that any of the other actions have lower values.</text><text start="30" dur="4">For example, the value of going west will be</text><text start="34" dur="4">0 in all possible outcomes given the current value function</text><text start="38" dur="3">minus 3, so the value of going west would right now </text><text start="41" dur="4">be estimated as -3, and 77 is larger than -3.</text><text start="45" dur="4">Therefore, we&amp;#39;ll pick 77 as the action that maximizes</text><text start="49" dur="2">the updated equation over here.</text></transcript></video><video title="28 Stochastic Question 2" id="YIR22PaBvCY" length="21"><transcript><text start="0" dur="2">And here&amp;#39;s a somewhat non-trivial quiz.</text><text start="2" dur="4">For the state B3, calculate the value function</text><text start="6" dur="4">assuming that we have a value function as shown over here</text><text start="10" dur="3">and all the open states have a value of assumed 0,</text><text start="13" dur="3">because we&amp;#39;re still in the beginning of our value update.</text><text start="16" dur="3">What would be our very first value function for B3</text><text start="19" dur="2">that we compute based on the values shown over here?</text></transcript></video><video title="29 Stochastic Answer 2" id="0ImGXKeKqFU" length="90"><transcript><text start="0" dur="5">And the answer is 48.6.</text><text start="5" dur="3">And obviously, it&amp;#39;s not quite as trivial as the calculation before</text><text start="8" dur="3">because there&amp;#39;s 2 competing actions.</text><text start="11" dur="3">We can try to go north, which gives us the 77</text><text start="14" dur="3">but risks the chance of falling into the -100.</text><text start="17" dur="3">Or we can go west, as before, which gives us a much smaller chance</text><text start="20" dur="3">to reach 77, but avoids the -100.</text><text start="23" dur="2">Let&amp;#39;s do both and see which one is better.</text><text start="25" dur="6">If we go north, we have a 0.8 chance of reaching 77.</text><text start="31" dur="5">There&amp;#39;s now a 10% chance of paying -100</text><text start="36" dur="3">and a 10% chance of staying in the same location,</text><text start="39" dur="3">which at this point is still a value of 0.</text><text start="42" dur="4">We subtract our costs of 3, and we get 61.6</text><text start="46" dur="5">- 10 - 3 = 48.6.</text><text start="51" dur="3">Let&amp;#39;s check the west action value.</text><text start="54" dur="4">We reach the 77 with probability 0.1 </text><text start="58" dur="3">with 0.8 chance we stay in the same cell,</text><text start="61" dur="2">which has the value of 0, </text><text start="63" dur="3">and with 0.1 chance, we end up down here,</text><text start="66" dur="2">which also has a value of 0.</text><text start="68" dur="3">We subtract our costs of -3,</text><text start="71" dur="5">and that gives us 7.7 - 3 = 4.7.</text><text start="76" dur="4">At this point, going west is vastly inferior</text><text start="80" dur="2">to going north, and the reason is we already propagated </text><text start="82" dur="3">a great value of 77 for this cell over here,</text><text start="85" dur="3">whereas this one is still set to 0.</text><text start="88" dur="2">So, we will set it to 48.6.</text></transcript></video><video title="30 Value Iterations and Policy 1" id="CDXaY7cdTYE" length="50"><transcript><text start="0" dur="3">So, now that we have a value backup function</text><text start="3" dur="2">that we discussed in depth, the question now becomes</text><text start="5" dur="2">what&amp;#39;s the optimal policy?</text><text start="7" dur="3">And it turns out this value backup function defines </text><text start="10" dur="2">the optimal policy as completely opposite</text><text start="12" dur="2">of which action to pick, </text><text start="14" dur="4">which is just the action that maximizes this expression over here.</text><text start="18" dur="4">For any state S, any value function V,</text><text start="22" dur="2">we can define a policy,</text><text start="24" dur="4">and that&amp;#39;s the one that picks the action under argmax</text><text start="28" dur="3">that maximizes the expression over here.</text><text start="31" dur="4">For the maximization, we can safely draw up gamma and R(s).</text><text start="35" dur="3">Baked in the value iteration function was already</text><text start="38" dur="3">an action choice that picks the best action.</text><text start="41" dur="2">We just made it explicit.</text><text start="43" dur="2">This is the way of backing up values,</text><text start="45" dur="3">and once values have been backed up, </text><text start="48" dur="2">this is the way to find the optimal thing to do.</text></transcript></video><video title="31 Value Iterations and Policy 2" id="46U-_qzQui0" length="168"><transcript><text start="0" dur="4">I&amp;#39;d like to show you some value function after convergence</text><text start="4" dur="3">and the corresponding policies.</text><text start="7" dur="4">If we assume gamma = 1 and our cost for the non-absorbing state</text><text start="11" dur="4">equals -3, as before, we get the following approximate value function</text><text start="15" dur="6">after convergence, and the corresponding policy looks as follows.</text><text start="21" dur="4">Up here we go right until we hit the absorbing state.</text><text start="25" dur="2">Over here we prefer to go north.</text><text start="27" dur="4">Here we go left, and here we go north again.</text><text start="31" dur="2">I left the policy open for the absorbing states</text><text start="33" dur="3">because there&amp;#39;s no action to be chosen here.</text><text start="36" dur="3">This is a situation where</text><text start="39" dur="3">the risk of falling into the -100 is balanced by</text><text start="42" dur="3">the time spent going around.</text><text start="45" dur="3">We have an action over here in this visible state here </text><text start="48" dur="4">that risks the 10% chance of falling into the -100.</text><text start="52" dur="3">But that&amp;#39;s preferable under the cost model of -3</text><text start="55" dur="3">to the action of going south.</text><text start="58" dur="4">Now, this all changes if we assume a cost of 0 </text><text start="62" dur="3">for all the states over here, in which case,</text><text start="65" dur="4">the value function after convergence looks interesting.</text><text start="69" dur="4">And with some thought, you realize it&amp;#39;s exactly the right one.</text><text start="73" dur="3">Each value is exactly 100,</text><text start="76" dur="2">and the reason is with a cost of 0,</text><text start="78" dur="3">it doesn&amp;#39;t matter how long we move around.</text><text start="81" dur="3">Eventually we can guarantee in this case we reach the 100,</text><text start="84" dur="4">therefore each value after backups will become 100.</text><text start="88" dur="4">The corresponding policy is the one we discussed before.</text><text start="92" dur="3">And the crucial thing here is that for this state,</text><text start="95" dur="3">we go south, if you&amp;#39;re willing to wait the time.</text><text start="98" dur="2">For this state over here, we go west, </text><text start="100" dur="2">willing to wait the time so as to avoid</text><text start="102" dur="2">falling into the -100.</text><text start="104" dur="2">And all the other states resolve</text><text start="106" dur="3">exactly as you would expect them to resolve</text><text start="109" dur="3">as shown over here.</text><text start="112" dur="3">If we set the costs to -200,</text><text start="115" dur="3">so each step itself is even more expensive</text><text start="118" dur="4">then falling into this ditch over here,</text><text start="122" dur="3">we get a value function that&amp;#39;s strongly negative everywhere</text><text start="125" dur="3">with this being the most negative state.</text><text start="128" dur="3">But more interesting is the policy.</text><text start="131" dur="3">This is a situation where our agent tries to end the game</text><text start="134" dur="4">as fast as possible so as not to endure the penalty of -200.</text><text start="138" dur="3">And even over here where it jumps itself into the -100&amp;#39;s</text><text start="141" dur="4">it&amp;#39;s still better than going north and taking the excess of 200 as a penalty</text><text start="145" dur="2">and then leave the +100.</text><text start="147" dur="3">Similarly, over here we go straight north, </text><text start="150" dur="2">and over here we go as fast as possible</text><text start="152" dur="3">to the state over here. </text><text start="155" dur="2">Now, this is an extreme case.</text><text start="157" dur="2">I don&amp;#39;t know why it would make sense to set a penalty for life</text><text start="159" dur="6">that is so negative that even negative death is worse than living,</text><text start="165" dur="3">but certainly that&amp;#39;s the result of running value iteration in this extreme case.</text></transcript></video><video title="32 MDP Conclusion" id="fCwZN0Ht4Q8" length="106"><transcript><text start="0" dur="2">So, we&amp;#39;ve learned quite a bit so far.</text><text start="2" dur="4">We&amp;#39;ve learned about Markov Decision Processes.</text><text start="6" dur="4">We have fully observable with a set of states</text><text start="10" dur="4">and corresponding actions where they have stochastic action effects</text><text start="14" dur="5">characterized by a conditional probability entity of P of S prime</text><text start="19" dur="3">given that we apply action A in state S.</text><text start="22" dur="3">We seek to maximize a reward function</text><text start="25" dur="2">that we define over states.</text><text start="27" dur="3">You can equally define over states in action pairs.</text><text start="30" dur="3">The objective was to maximize the expected</text><text start="33" dur="3">future accumulative and discounted rewards,</text><text start="36" dur="2">as shown by this formula over here.</text><text start="38" dur="4">The key to solving them was called value iteration</text><text start="42" dur="3">where we assigned a value to each state.</text><text start="45" dur="2">There&amp;#39;s alternative techniques that have assigned values</text><text start="47" dur="3">to state action pairs, often called Q(s, a),</text><text start="50" dur="3">but we didn&amp;#39;t really consider this so far.</text><text start="53" dur="2">We defined a recursive update rule</text><text start="55" dur="3">to update V(s)  that was very logical</text><text start="58" dur="2">after we understood that we have an action choice,</text><text start="60" dur="3">but nature chooses for us the outcome of the action</text><text start="63" dur="4">in a stochastic transition probability over here.</text><text start="67" dur="3">And then we observe the value iteration converged</text><text start="70" dur="2">and we&amp;#39;re able to define a policy if we&amp;#39;re assuming </text><text start="72" dur="4">the argmax under the value iteration expression,</text><text start="76" dur="2">which I don&amp;#39;t spell out over here.</text><text start="78" dur="2">This is a beautiful framework.</text><text start="80" dur="2">It&amp;#39;s really different from planning than before </text><text start="82" dur="4">because of the stochasticity of the action effects.</text><text start="86" dur="3">Rather than making a single sequence of states and actions,</text><text start="89" dur="2">as would be the case in deterministic planning,</text><text start="91" dur="4">now we make an entire field a so-called policy</text><text start="95" dur="4">that assigns an action to every possible state.</text><text start="99" dur="3">And we compute this using a technique called value iteration</text><text start="102" dur="4">that spreads value in reverse order through the field of states.</text></transcript></video><video title="33 Partial Observability Introduction" id="y3Rkp3oCzq4" length="44"><transcript><text start="0" dur="4">So far, we talked about the fully observable case,</text><text start="4" dur="2">and I&amp;#39;d like to get back to the more general case</text><text start="6" dur="2">of partial observability.</text><text start="8" dur="3">Now, to warn you, I don&amp;#39;t think it&amp;#39;s worthwhile in this class</text><text start="11" dur="2">to go into full depth about the type of techniques</text><text start="13" dur="4">that are being used for planning and uncertainty</text><text start="17" dur="2">if the world is partially observable.</text><text start="19" dur="3">But I&amp;#39;d like to give you a good flavor</text><text start="22" dur="4">about what it really means to plan in information spaces</text><text start="26" dur="4">that we reflect the types of methods that are being brought to bear</text><text start="30" dur="2">in planning and uncertainty.</text><text start="32" dur="4">Like my Stanford class, I don&amp;#39;t go into details here either</text><text start="36" dur="3">because the details are much more subject to more specialized classes,</text><text start="39" dur="3">but I hope you can enjoy the type of flavor of materials</text><text start="42" dur="2">that you&amp;#39;re going to get to see in the next couple of minutes.</text></transcript></video><video title="34 POMDP vs MDP" id="ynGnvuh0ELQ" length="46"><transcript><text start="0" dur="4">So we now learned about fully observable environments,</text><text start="4" dur="4">and planning in stochastic environments with MDPs</text><text start="8" dur="4">I&amp;#39;d like to say a few words about partially observable environments,</text><text start="12" dur="5">or POMDPs--which I won&amp;#39;t go into in depth; the material is relatively complex. </text><text start="17" dur="4">But I&amp;#39;d like to give you a feeling for why this is important, and what type of problems</text><text start="21" dur="4">you can solve with this, that you could never possibly solve with MDPs. </text><text start="25" dur="7">So, for example, POMDPs address problems of optimal exploration versus exploitation, </text><text start="32" dur="3">where some of the actions might be information-gathering actions;</text><text start="35" dur="4">whereas others might be goal-driven actions. </text><text start="39" dur="4">That&amp;#39;s not really possible in the MDPs because the state space is fully observable</text><text start="43" dur="3">and therefore, there is no notion of information gathering.  </text></transcript></video><video title="35 POMDP" id="bVT7QlYC7JQ" length="345"><transcript><text start="0" dur="5">I&amp;#39;d like to illustrate the problem, using a very simple environment</text><text start="5" dur="2">that looks, as follows:</text><text start="7" dur="2">Suppose you live in world like this;</text><text start="9" dur="2">and your agent starts over here,</text><text start="11" dur="2">and there are 2 possible outcomes. </text><text start="13" dur="3">You can exit the maze over here--</text><text start="16" dur="2">where you get a plus 100--</text><text start="18" dur="2">or you can exit the maze over here, </text><text start="20" dur="2">where you receive a minus 100.</text><text start="22" dur="3">Now, in a fully observable case, </text><text start="25" dur="3">and in a deterministic case, </text><text start="28" dur="4">the optimal plan might look something like this;</text><text start="32" dur="3">and whether or not is goes straight over here or not, depends on the details.  </text><text start="35" dur="3">For example, whether the agent has momentum or not. </text><text start="38" dur="6">But you&amp;#39;ll find a single sequence of actions and states that might cut the corners,</text><text start="44" dur="3">as close as possible, to reach the plus 100 as fast as possible. </text><text start="47" dur="3">That&amp;#39;s conventional planning.</text><text start="50" dur="3">Let&amp;#39;s contrast this with the case we just learned about,</text><text start="53" dur="4">which is the fully observable case or the stochastic case. </text><text start="57" dur="4">We just learned that the best thing to compute is a policy</text><text start="61" dur="3">that assigns to every possible state, an optimal action; </text><text start="64" dur="3">and simplified speaking, this might look as follows:</text><text start="67" dur="2">Where each of these arrows corresponds </text><text start="69" dur="3">to a sample control policy.  </text><text start="72" dur="4">And those are defined in part of the state space that are even far away. </text><text start="76" dur="2">So this wouuld be an example of a control policy </text><text start="78" dur="4">where all the arrows gradually point you over here.  </text><text start="82" dur="3">We just learned about this, using MDPs and value iteration. </text><text start="85" dur="4">The case I really want to get at is the case of partial observability--</text><text start="89" dur="3">which we will eventually solve, using  a technique called POMDP.</text><text start="92" dur="5">And in this case, I&amp;#39;m going to keep the location of the agent in the maze observable.</text><text start="97" dur="6">The part I&amp;#39;m going to make unobservable is where, exactly, I receive plus 100</text><text start="103" dur="2">and where I receive minus 100.</text><text start="105" dur="3">Instead, I&amp;#39;m going to put a sign over here </text><text start="108" dur="3">that tells the agent where to expect plus 100,</text><text start="111" dur="2">and where to expect minus 100.</text><text start="113" dur="4">So the optimum policy would be to first move to the sign, </text><text start="117" dur="2">read the sign;</text><text start="119" dur="4">and then return and go to the corresponding exit, </text><text start="123" dur="4">for which the agent now knows where to receive plus 100.</text><text start="127" dur="3">So, for example, if this exit over here gives us plus 100,  </text><text start="130" dur="2">the sign will say Left. </text><text start="132" dur="3">If this exit over here gives us plus 100, the sign will say Right.</text><text start="135" dur="2">What makes this environment interesting is  </text><text start="137" dur="4">that if the agent knew which exit would have plus 100, </text><text start="141" dur="2">it will go north, from a starting position. </text><text start="143" dur="3">It goes south exclusively to gather information. </text><text start="146" dur="4">So the question becomes: Can we devise a method for planning</text><text start="150" dur="6">that understands that, even though we&amp;#39;d wish to receive the plus 100 as the best exit, </text><text start="156" dur="4">there&amp;#39;s a detour necessary to gather information.  </text><text start="160" dur="2">So here&amp;#39;s a solution that doesn&amp;#39;t work: </text><text start="162" dur="4">Obviously, the agent might be in 2 different worlds--and it doesn&amp;#39;t know. </text><text start="166" dur="3">It might be in the world where there&amp;#39;s plus 100 on the Left side</text><text start="169" dur="2">or it might be in the world with plus 100 on the Right side,</text><text start="171" dur="2">with minus 100 in the corresponding other exit. </text><text start="173" dur="6">What doesn&amp;#39;t work is you can&amp;#39;t solve the problem for both of these cases</text><text start="179" dur="1">and then put these solutions together--</text><text start="180" dur="2">for example, by averaging. </text><text start="182" dur="2">The reason why this doesn&amp;#39;t work is </text><text start="184" dur="4">this agent, after averaging, would go north. </text><text start="188" dur="3">It would never have the idea that it is worthwhile to go south,</text><text start="191" dur="4">read the sign, and then return to the optimal exit. </text><text start="195" dur="3">When it arrives, finally, at the intersection over here,</text><text start="198" dur="2">it doesn&amp;#39;t really know what to do. </text><text start="200" dur="2">So here is the situation that does work--</text><text start="202" dur="3">and it&amp;#39;s related to information space or belief space.  </text><text start="205" dur="4">In the information space or belief space representation you do planning,</text><text start="209" dur="2">not in the set of physical world states, </text><text start="211" dur="3">but in what you might know about those states. </text><text start="214" dur="5">And if you&amp;#39;re really honest, you find out that there&amp;#39;s a multitude of belief states.</text><text start="219" dur="5">Here&amp;#39;s the initial one, where you just don&amp;#39;t know where to receive 100.</text><text start="224" dur="4">Now, if you move around and either reach one of these exits or the sign, </text><text start="228" dur="3">you will suddenly know where to receive 100. </text><text start="231" dur="4">And that makes your belief state change--  </text><text start="235" dur="3">and that makes your belief state change. </text><text start="238" dur="3">So, for example, if you find out that 100 is Left, </text><text start="241" dur="2">then your belief state will look like this--</text><text start="243" dur="2">where the ambiguity is now resolved. </text><text start="245" dur="4">Now, how would you jump from this state space or this state space?</text><text start="249" dur="3">The answer is: when you read the sign, </text><text start="252" dur="4">there&amp;#39;s a 50 percent chance that the location over here </text><text start="256" dur="3">will result in a transition to the location over here--</text><text start="259" dur="4">50 percent because there&amp;#39;s a 50 percent chance that the plus 100 is on the Left.</text><text start="263" dur="5">There&amp;#39;s also a 50 percent chance that the plus 100 is on the Right,</text><text start="268" dur="3">so the transition over here is stochastic;</text><text start="271" dur="4">and with 50 percent chance, it will result in a transition over here. </text><text start="275" dur="4">If we now do the MDP trick in this new belief space, </text><text start="279" dur="5">and you pour water in here, it kind of flows through here </text><text start="284" dur="4">and creates all these gradients--as we had before.</text><text start="288" dur="3">We do the same over here and all these gradients are being created </text><text start="291" dur="2">point to this exit on the Left side.</text><text start="293" dur="5">Then, eventually, this water will flow through here and create gradients like this;</text><text start="298" dur="4">and then flow back through here, where it creates gradients like this. </text><text start="302" dur="4">So the value function is plus 100 over here, plus 100 over here</text><text start="306" dur="2">that gradually decrease down here, down here;</text><text start="308" dur="3">and then gradually further decrease over here-- </text><text start="311" dur="4">and even further decrease over there, so we&amp;#39;ve got arrows like these.</text><text start="315" dur="4">And that shows you that in this new belief space, you can find a solution.</text><text start="320" dur="4">In fact, you can use value iteration--MDP&amp;#39;s value iteration--</text><text start="324" dur="4">in this new space to find a solution to this really complicated </text><text start="328" dur="2">partially observable planning process.</text><text start="330" dur="3">And the solution--just to reiterate--</text><text start="333" dur="2">we&amp;#39;ll suggest: Go south first, </text><text start="335" dur="2">read the sign, </text><text start="337" dur="4">expose yourself to the random position to the Left or Right world </text><text start="341" dur="4">in which you are now able to reach the plus 100 with absolute confidence. </text></transcript></video><video title="36 Planning Under Uncertainty Conclusion" id="PuoTbYxNJnU" length="35"><transcript><text start="0" dur="5">So now we have, learned pretty much, all there is to know about Planning Under Uncertainty. </text><text start="5" dur="2">We talked about Markov Decision Processes.</text><text start="7" dur="2">We explained the concept of information spaces; </text><text start="9" dur="3">and what&amp;#39;s better, you can actually apply it. </text><text start="12" dur="2">You can apply it to a huge number of problems </text><text start="14" dur="3">where the outcome of states are uncertain. </text><text start="17" dur="3">There is a huge legislation about robot motion planning.</text><text start="20" dur="3">Here are some examples of robots moving through our environments</text><text start="23" dur="3">that use MDP-style planning techniques; </text><text start="26" dur="3">and these methods have become vastly popular in artificial intelligence--</text><text start="29" dur="2">so I&amp;#39;m really glad you now understand the basics </text><text start="31" dur="4">of those and you can apply them yourself.  </text></transcript></video></group><group title="Unit 10" count="26"><video title="01 Introduction.mp4" id="haGrozelOGA" length="35"><transcript><text start="0" dur="2">Hi--welcome back. </text><text start="2" dur="3">You just learned how Markov Decision Processes</text><text start="5" dur="2">can be used to determine an optimal sequence</text><text start="7" dur="4">of actions for an agent in a stochastic environment. </text><text start="11" dur="4">And that is, an agent that knows the correct model of the environment</text><text start="15" dur="3">can navigate, finding its ways to the positive </text><text start="18" dur="3">rewards and avoiding the negative penalties.</text><text start="21" dur="4">But it can only do that if he knows where the rewards and penalties are. </text><text start="25" dur="3">In this Unit, we&amp;#39;ll see how a technique</text><text start="28" dur="2">called reinforcement learning</text><text start="30" dur="3">can guide the agent to an optimal policy,</text><text start="33" dur="2">even though he doesn&amp;#39;t know anything about the rewards when he starts out. </text></transcript></video><video title="02 Successes.mp4" id="jYxSeyVPOAs" length="?"><transcript><text start="0" dur="3">No subtitles...</text></transcript></video><video title="2 Successes.mp4" id="dqH6tp49uFY" length="190"><transcript><text start="0" dur="3">For example, in the 4 by 3 GridWorld, </text><text start="3" dur="5">what if we don&amp;#39;t know where the plus 1 and minus 1 rewards are when we start out? </text><text start="8" dur="5">A reinforcement learning agent can learn to explore the territory,</text><text start="13" dur="2">find where the rewards are, </text><text start="15" dur="2">and then learn an optimal policy.</text><text start="17" dur="2">Whereas, an MDP solver can only do that </text><text start="19" dur="3">once it knows exactly where the rewards are. </text><text start="22" dur="5">Now, this idea of wandering around and then finding a plus 1 or a minus 1</text><text start="27" dur="5">is analogous to many forms of games, such as backgammon--</text><text start="32" dur="3">and here&amp;#39;s an example: backgammon is a stochastic game;</text><text start="35" dur="3">and at the end, you either win or lose. </text><text start="38" dur="2">And in the 1990s, Gary Tesauro at IBM</text><text start="40" dur="3">wrote a program to play backgammon. </text><text start="43" dur="6">His first attempt tried to learn the utility of a Game state, U of S, </text><text start="49" dur="4">using examples that were labelled by human expert backgammon players. </text><text start="53" dur="2">But this was tedious work for the experts,</text><text start="55" dur="3">so only a small number of states were labelled. </text><text start="58" dur="2">The program tried to generalize from that, </text><text start="60" dur="2">using supervised learning, </text><text start="62" dur="2">and was not able to perform very well. </text><text start="64" dur="7">So Tesauro&amp;#39;s second attempt used no human expertise and no supervision.</text><text start="71" dur="3">Instead, he had 1 copy of his program play against another;</text><text start="74" dur="4">and at the end of the game, the winner got a positive reward,</text><text start="78" dur="2">and the loser, a negative. </text><text start="80" dur="2">So he used reinforcement learning; </text><text start="82" dur="3">he backed up that knowledge throughout the Game states, </text><text start="85" dur="2">and he was able to arrive at a function </text><text start="87" dur="3">that had no input from human expert players,</text><text start="90" dur="2">but, still, was able to perform </text><text start="92" dur="3">at the level of the very best players in the world. </text><text start="95" dur="6">He was able to do this, after learning from examples of about 200,000 games. </text><text start="101" dur="2">Now, that may seem like a lot--</text><text start="103" dur="3">but it really only covers about 1 trillionth </text><text start="106" dur="3">of the total state space of backgammon. </text><text start="109" dur="2">Now, here&amp;#39;s another example: </text><text start="111" dur="3">This is a remote controlled helicopter </text><text start="114" dur="2">that Professor Andrew Ng at Stanford trained, </text><text start="116" dur="2">using reinforcement learning;</text><text start="118" dur="2">and the helicopter--oh--oh, sorry--</text><text start="120" dur="4">I made a mistake--I put this picture upside down</text><text start="124" dur="4">because--really, Ng trained the helicopter </text><text start="128" dur="3">to be able to fly fancy maneuvers--like flying upside down. </text><text start="131" dur="4">And he did that by looking at only a few hours </text><text start="135" dur="3">of training data from expert helicopter pilots </text><text start="138" dur="2">who would take over the remote controls, </text><text start="140" dur="3">pilot the helicopter--and those would all be recorded--</text><text start="143" dur="4">and then, you would get rewards from when it did something good,  </text><text start="147" dur="2">or when it did something bad;</text><text start="149" dur="3">and Ng was able to use reinforcement learrning </text><text start="152" dur="2">to build an automated helicopter pilot, </text><text start="154" dur="2">just from those training examples. </text><text start="156" dur="3">And that automated pilot, too, can perform tricks </text><text start="159" dur="4">that only a handful of humans are capable of performing. </text><text start="163" dur="6">But enough of this still picture--let&amp;#39;s watch a video of Ng&amp;#39;s helicopters in action. </text><text start="169" dur="3">[Stanford University Autonomous Helicopter]</text><text start="172" dur="13">[sound of helicopter flying] [Chaos]</text><text start="185" dur="5">[Stanford University Autonomous Helicopter]</text></transcript></video><video title="03 Forms of Learning.mp4" id="iHEjWGFuPh4" length="94"><transcript><text start="0" dur="3">Let&amp;#39;s stop and review the 3 main forms of learning. </text><text start="3" dur="2">We have supervised learning, </text><text start="5" dur="2">in which the training set</text><text start="7" dur="3">is a bunch of input/output pairs--</text><text start="10" dur="5">X1,Y1; X2, Y2; et cetera--</text><text start="15" dur="3">in which we try to produce a function: </text><text start="18" dur="3">y equals f of x--</text><text start="21" dur="3">and so the learning is producing this function, f. </text><text start="24" dur="3">Then we have unsupervised learning,</text><text start="27" dur="2">in which we&amp;#39;re given just a set of data points--</text><text start="29" dur="4">X1, X2, and so on--</text><text start="33" dur="2">and each of these points, maybe, has many </text><text start="35" dur="2">dimensions, many features. </text><text start="37" dur="2">And what we&amp;#39;re trying to learn is some patterns in that--</text><text start="39" dur="3">some clusters of these data--</text><text start="42" dur="3">or you could just say what we&amp;#39;re trying to learn </text><text start="45" dur="2">is a probability distribution</text><text start="47" dur="2">or what&amp;#39;s the probability that this </text><text start="49" dur="3">random variable will have particular values;</text><text start="52" dur="3">and learn something interesting from that. </text><text start="55" dur="3">In this Unit, we&amp;#39;re introducing the third type of learning--</text><text start="58" dur="3">reinforcement learning--</text><text start="61" dur="4">in which we have a sequence of action and state transitions.</text><text start="65" dur="6">So: state and action, state and action--and so on. </text><text start="71" dur="5">And at some point, we have some rewards associated with these. </text><text start="76" dur="5">So there&amp;#39;s a reward, and maybe not a reward for this state;</text><text start="81" dur="2">and then another reward for this state--</text><text start="83" dur="4">and the rewards are just scalar numbers, positive or negative numbers. </text><text start="87" dur="2">What we&amp;#39;re trying to learn here is:</text><text start="89" dur="5">at optimal policy, what&amp;#39;s the right thing to do in any of the states?</text></transcript></video><video title="04 Forms of Learning Question.mp4" id="jw86cqeglm8" length="123"><transcript><text start="0" dur="3">Let&amp;#39;s show some examples of machine learning problems</text><text start="3" dur="2">and I want you to tell me, for each one, </text><text start="5" dur="3">whether it&amp;#39;s best addressed with supervised learning, </text><text start="8" dur="2">unsupervised learning, </text><text start="10" dur="2">or reinforcement learning. </text><text start="12" dur="4">And the first example is speech recognition--</text><text start="16" dur="3">where I have examples of voice recordings,</text><text start="19" dur="4">and then the transcript&amp;#39;s intermittent text for each of those recordings;</text><text start="23" dur="3">and from them, I try to learn a model of language.  </text><text start="26" dur="6">Is that supervised, unsupervised or reinforcement? </text><text start="32" dur="5">Next example is analyzing the spectral emissions of stars </text><text start="37" dur="4">and trying to find clusters of stars in dissimilar types</text><text start="41" dur="3">that may be of interest to astronomers.</text><text start="44" dur="5">Would that be supervised, unsupervised or reinforcement?</text><text start="49" dur="2">The data here would just consist of:</text><text start="51" dur="7">for each star, a list of all the different emission frequencies of light coming to earth.</text><text start="58" dur="4">Next example is lever pressing. </text><text start="62" dur="4">So--I have a rat who is trained to press a lever</text><text start="66" dur="2">to get a release of food</text><text start="68" dur="2">when certain conditions are met. </text><text start="70" dur="6">Is that supervised, unsupervised or reinforcement learning? </text><text start="76" dur="4">And finally, the problem of an elevator controller.</text><text start="80" dur="2">Say I have a bank of elevators in a building</text><text start="82" dur="3">and they have to have some program--some policy--</text><text start="85" dur="2">to decide which elevator goes up</text><text start="87" dur="2">and which elevator goes down</text><text start="89" dur="2">in response to the percepts,  </text><text start="91" dur="4">which would be the button presses at various floors in the building. </text><text start="95" dur="4">And so, I have a sequence of button presses,  </text><text start="99" dur="5">and I have the wait time that I am trying to minimize--</text><text start="104" dur="4">so after each button press, the elevator moves; </text><text start="108" dur="5">the person waiting is waiting for a certain amount of time, </text><text start="113" dur="2">and then gets picked up,</text><text start="115" dur="4">and the algorithm is, given that amount of wait time. </text><text start="119" dur="4">Would that be supervised, unsupervised or reinforcement?</text></transcript></video><video title="05 Forms of Learning Answer.mp4" id="lTkm5LaVDMU" length="78"><transcript><text start="0" dur="2">The answers are that speech recognition</text><text start="2" dur="3">can be handled quite well by supervised learning.</text><text start="5" dur="2">That is, we have input/output pairs;</text><text start="7" dur="2">the input is the speech signal,</text><text start="9" dur="3">and the output is the words that they correspond to.</text><text start="12" dur="3">Analyzing the spectral emissions of stars </text><text start="15" dur="4">is an example of unsupervised clustering</text><text start="19" dur="2">where we&amp;#39;re taking the input data--</text><text start="21" dur="4">we have data for each star, but we don&amp;#39;t have any label associated with it. </text><text start="25" dur="2">Rather, we&amp;#39;re trying to make up labels </text><text start="27" dur="2">by clustering them together,</text><text start="29" dur="2">giving them to scientists, </text><text start="31" dur="2">and then letting the scientists see:</text><text start="33" dur="2">Do these clusters make any sense?</text><text start="35" dur="3">Lever pressing is a classic example of reinforcement learning. </text><text start="38" dur="2">In fact, the term &amp;quot;reinforcement learning&amp;quot; </text><text start="40" dur="3">was used for a long time in animal psychology,</text><text start="43" dur="3">before it was used in computer science.</text><text start="46" dur="2">And elevator controllers is another area </text><text start="48" dur="2">that has been investigated, </text><text start="50" dur="2">using reinforcement learning</text><text start="52" dur="2">and, in fact, very good algorithms--</text><text start="54" dur="2">better than the previous state of the art--</text><text start="56" dur="3">have been made, using reinforcement learning techniques. </text><text start="59" dur="5">So the input, again, is a set of state/action transitions;</text><text start="64" dur="3">and then the reinforcement is--</text><text start="67" dur="2">in this case, it&amp;#39;s always a negative number</text><text start="69" dur="2">because there&amp;#39;s always a wait time.</text><text start="71" dur="3">And so that&amp;#39;s the penalty--we&amp;#39;re trying to minimize that penalty,</text><text start="74" dur="4">but all we get is the amount of wait time that we&amp;#39;re trying to minimize. </text></transcript></video><video title="06 MDP Review.mp4" id="ngXzkRWosEw" length="107"><transcript><text start="0" dur="3">Now, before we get into the math of reinforcement learning, </text><text start="3" dur="2">let&amp;#39;s review MDPs--</text><text start="5" dur="4">which are, of course, Markov Decision Processes.</text><text start="9" dur="4">An MDP consists of a set of states--</text><text start="13" dur="3">S is an element of the state, S;</text><text start="16" dur="2">a set of actions--</text><text start="18" dur="8">A is an element of the actions that are available in each of the states, S. </text><text start="26" dur="2">And we&amp;#39;re going to distinguish a Start state,</text><text start="28" dur="2">which we&amp;#39;ll call S-zero,</text><text start="30" dur="4">and then we need a transition function that says:</text><text start="34" dur="3">How does the world evolve as we take actions in the world? </text><text start="37" dur="2">And we can denote that by </text><text start="39" dur="6">the probability that we get a Result state, S prime--</text><text start="45" dur="4">given that we start in state, S, </text><text start="49" dur="1">and apply action, A. </text><text start="50" dur="2">That&amp;#39;s a probability distribution  </text><text start="52" dur="2">because the world is stochastic.</text><text start="54" dur="2">The same result doesn&amp;#39;t happen every time,</text><text start="56" dur="2">when we do the same action,</text><text start="58" dur="2">so we have this probability distribution. </text><text start="60" dur="3">In some notations, you&amp;#39;ll see:</text><text start="63" dur="5">T of S, A, S--for the transition function. </text><text start="68" dur="2">And then, in addition to the transition,</text><text start="70" dur="2">we need a reward function--</text><text start="72" dur="2">which we&amp;#39;ll denote R.</text><text start="74" dur="2">Sometimes that&amp;#39;s over the whole triplet--</text><text start="76" dur="3">the reward that you get from starting in one state,</text><text start="79" dur="3">taking an action, and arriving at another state;</text><text start="82" dur="4">sometimes we only need to talk about the result state. </text><text start="86" dur="3">So in the 4 by 3 Grid World, for example, </text><text start="89" dur="2">we don&amp;#39;t care how you got to this state--</text><text start="91" dur="2">it&amp;#39;s just, when you get to one of the states </text><text start="93" dur="2">in the upper right, you get a plus 1 or minus 1 reward. </text><text start="95" dur="3">And similarly, in a  game like backgammon--</text><text start="98" dur="2">when you win or lose, </text><text start="100" dur="2">you get a positive or negative reward. </text><text start="102" dur="2">It doesn&amp;#39;t matter what move you took to win or lose. </text><text start="104" dur="3">And so that&amp;#39;s all there is to MDPs. </text></transcript></video><video title="07 Solving a MDP.mp4" id="k-5e935u9hE" length="103"><transcript><text start="0" dur="2">Now to solve an MDP,</text><text start="2" dur="4">we&amp;#39;re trying to find a policy--pi of S--</text><text start="6" dur="2">that&amp;#39;s going to be our answer. </text><text start="8" dur="2">The pi that we want--the optimal policy--</text><text start="10" dur="3">is the one that&amp;#39;s going to maximize</text><text start="13" dur="2">the discounted, total Reward. </text><text start="15" dur="2">So what we mean is:</text><text start="17" dur="3">we want to take the sum over all Times </text><text start="20" dur="3">into the future of the Reward</text><text start="23" dur="2">that you get from starting out </text><text start="25" dur="3">in the state that you&amp;#39;re in, in time T--</text><text start="28" dur="4">and then applying the policy to that state, </text><text start="32" dur="3">and arriving at a new state, at time T plus 1.</text><text start="35" dur="2">And so we want to maximize that sum--</text><text start="37" dur="2">but the sum might be infinite</text><text start="39" dur="2">and so, what we do is </text><text start="41" dur="2">we take this value, Gamma,  </text><text start="43" dur="3">and raise it to the T power, saying</text><text start="46" dur="3">we&amp;#39;re going to count future Rewards less than </text><text start="49" dur="3">current Rewards--and that way, </text><text start="52" dur="3">we&amp;#39;ll make sure that the sum total is bounded. </text><text start="55" dur="3">So we want the policy that maximizes that result. </text><text start="58" dur="2">If we figure out the utility of the state</text><text start="60" dur="3">by solving the Markov Decision Process, </text><text start="63" dur="4">then we have: the utility of any state, S, </text><text start="67" dur="2">is equal to the maximum over all </text><text start="69" dur="3">possible actions that we could take in S</text><text start="72" dur="3">of the expected value of taking that action. </text><text start="75" dur="2">And what&amp;#39;s the expected value? </text><text start="77" dur="4">Well, it&amp;#39;s just the sum over all resulting states</text><text start="81" dur="2">of the transition model--</text><text start="83" dur="2">the probability that we get to that state,</text><text start="85" dur="3">given from the start state, we take an action</text><text start="88" dur="3">specified by the optimal policy</text><text start="91" dur="3">times the utility of that resulting state. </text><text start="94" dur="3">So--look at all possible actions;</text><text start="97" dur="2">choose the best one--</text><text start="99" dur="4">according to the expected, in terms of probability utility. </text></transcript></video><video title="08 Agents of Reinforcement Learning.mp4" id="dX8VPYXd3hs" length="139"><transcript><text start="0" dur="3">Now here&amp;#39;s where reinforcement learning comes into play:</text><text start="3" dur="3">What if you don&amp;#39;t know R--the Reward function? </text><text start="6" dur="3">What if you don&amp;#39;t even know P--the transition model of the world? </text><text start="9" dur="3">Then you can&amp;#39;t solve the Markov Decision Process</text><text start="12" dur="2">because you don&amp;#39;t have what you need to solve it. </text><text start="14" dur="2">However, with reinforcement learning, </text><text start="16" dur="3">you can learn R and P by interacting with the world </text><text start="19" dur="3">or you can learn substitutes that will tell you </text><text start="22" dur="4">as much as you know, so that you never actually have to compute with R and P.</text><text start="26" dur="4">What you learn, exactly, depends on what you already know and what you want to do. </text><text start="30" dur="2">So we have several choices. </text><text start="32" dur="4">One choice is we can build a utility-based agent.</text><text start="36" dur="5">So we&amp;#39;re going to list agent types, based on what we know,</text><text start="41" dur="2">what we want to learn,</text><text start="43" dur="2">and what we then use once we&amp;#39;ve learned. </text><text start="45" dur="2">So for a utility-based agent, </text><text start="47" dur="4">if we already know T, the transition model,</text><text start="51" dur="3">but we don&amp;#39;t know R, the Reward model,</text><text start="54" dur="3">then we can learn R--and use that, </text><text start="57" dur="4">along with P, to learn our utility function;</text><text start="61" dur="3">and then go ahead and use the utility function </text><text start="64" dur="3">just as we did in normal Markov Decision Processes. </text><text start="67" dur="2">So that&amp;#39;s one agent design. </text><text start="69" dur="2">Another design that we&amp;#39;ll see in this Unit</text><text start="71" dur="3">is called a Q-learning agent.</text><text start="74" dur="3">In this one, we don&amp;#39;t have to know P or R;</text><text start="77" dur="5">and we learn a value function, which is usually denoted by Q.</text><text start="82" dur="4">And that&amp;#39;s a type of utility</text><text start="86" dur="2">but, rather than being a utility over states, </text><text start="88" dur="4">it&amp;#39;s a utility of state action pairs--and that tells us:</text><text start="92" dur="4">For any given state and any given action, </text><text start="96" dur="2">what&amp;#39;s the utility of that result--</text><text start="98" dur="4">without knowing the utilities and rewards, individually? </text><text start="102" dur="3">And then we can just use that Q directly.</text><text start="105" dur="4">So we don&amp;#39;t actually have to ever learn the transition model, P,</text><text start="109" dur="2">with a Q-learning agent. </text><text start="111" dur="3">And finally, we can have a reflex agent </text><text start="114" dur="3">where, again, we don&amp;#39;t need to know P and R to begin with;</text><text start="117" dur="5">and we learn directly, the policy, pi of S;</text><text start="122" dur="3">and then we just go ahead and apply pi. </text><text start="125" dur="4">So it&amp;#39;s called a reflex agent because it&amp;#39;s pure stimulus response:</text><text start="129" dur="2">I&amp;#39;m in a certain state, I take a certain action. </text><text start="131" dur="4">I don&amp;#39;t have to think about modeling the world, in terms of: </text><text start="135" dur="2">What are the transitions--where am I going to go next? </text><text start="137" dur="2">I just go ahead and take that action. </text></transcript></video><video title="09 Passive vs Active.mp4" id="BOJUzeVswYA" length="105"><transcript><text start="0" dur="3">Now, the next choice we have in agent design </text><text start="3" dur="3">revolves around how adventurous he wants to be. </text><text start="6" dur="5">One possibility is what&amp;#39;s called the passive reinforcement learning agent--</text><text start="11" dur="3">and that can be any of these agent designs, </text><text start="14" dur="2">but what passive means is that the agent </text><text start="16" dur="3">has a fixed policy and executes that policy. </text><text start="19" dur="3">But it learns about the reward function, R, </text><text start="22" dur="3">and maybe the transition function, P, </text><text start="25" dur="2">if it didn&amp;#39;t already know that.</text><text start="27" dur="3">It learns that while executing the fixed policy.</text><text start="30" dur="2">So let me give you an example.</text><text start="32" dur="3">Imagine that you&amp;#39;re on a ship in uncharted waters </text><text start="35" dur="3">and the captain has a policy for piloting the ship. </text><text start="38" dur="3">You can&amp;#39;t change the captain&amp;#39;s policy. </text><text start="41" dur="3">He or she is going to execute that, no matter what. </text><text start="44" dur="3">But it&amp;#39;s your job to learn all you can about the uncharted waters. </text><text start="47" dur="3">In other words, learn the reward function, </text><text start="50" dur="3">given the actions and the state transitions</text><text start="53" dur="2">that the ship is going through. </text><text start="55" dur="2">You learn, and remember what you&amp;#39;ve learned,</text><text start="57" dur="2">but that doesn&amp;#39;t change the captain&amp;#39;s policy--</text><text start="59" dur="2">and that&amp;#39;s passive learning. </text><text start="61" dur="3">Now, the alternative is called</text><text start="64" dur="2">active reinforcement learning--</text><text start="66" dur="3">and that&amp;#39;s where we change the policy as we go. </text><text start="69" dur="3">So let&amp;#39;s say, eventually, you&amp;#39;ve done such a great job</text><text start="72" dur="2">of learning about the uncharted water</text><text start="74" dur="2">that the captain says to you,</text><text start="76" dur="3">&amp;quot;Okay--I&amp;#39;m going to hand over control </text><text start="79" dur="2">and as you learn, I&amp;#39;m going to allow you</text><text start="81" dur="2">to change the policy for this ship. </text><text start="83" dur="3">You can make decisions of where we&amp;#39;re going to go next.&amp;quot;</text><text start="86" dur="2">And that&amp;#39;s good, because you can start to  </text><text start="88" dur="2">cash in early on your learning </text><text start="90" dur="2">and it&amp;#39;s also good because it gives you  </text><text start="92" dur="3">a possibility to explore.</text><text start="95" dur="3">Rather than just say: What&amp;#39;s the best action I can do right now?--</text><text start="98" dur="4">you can say: What&amp;#39;s the action that might allow me to learn something--</text><text start="102" dur="3">to allow me to do better in the future? </text></transcript></video><video title="10 Passive Temporal Difference Learning.mp4" id="DZzffdHNqtQ" length="373"><transcript><text start="0" dur="3">Let&amp;#39;s start by looking at passive reinforcement learning. </text><text start="3" dur="2">I&amp;#39;m going to describe an algorithm called </text><text start="5" dur="2">Temporal Difference Learning--or TD. </text><text start="7" dur="2">And what that means--sounds like a fancy name, </text><text start="9" dur="2">but all it really means is we&amp;#39;re going to move </text><text start="11" dur="2">from one state to the next;</text><text start="13" dur="3">and we&amp;#39;re going to look at the difference between the 2 states,</text><text start="16" dur="3">and learn that--and then kind of back up</text><text start="19" dur="3">the values, from one state to the next. </text><text start="22" dur="5">So if we&amp;#39;re going to follow a fixed policy, pi,</text><text start="27" dur="4">and let&amp;#39;s say our policy tells us to go this way, and then go this way. </text><text start="31" dur="4">We&amp;#39;ll eventually learn that we get a plus 1 reward there</text><text start="35" dur="3">and we&amp;#39;ll start feeding back that plus 1, saying:</text><text start="38" dur="2">if it was good to get a plus 1 here, </text><text start="40" dur="2">it must be somewhat good to be in this state,</text><text start="42" dur="4">somewhat good to be in this state--and so on, back to the start state. </text><text start="46" dur="2">So, in order to run this algorithm, </text><text start="48" dur="5">we&amp;#39;re going to try to build up a table of utilities for each state </text><text start="53" dur="3">and along the way, we&amp;#39;re going to keep track of </text><text start="56" dur="3">the number of times that we visited each state. </text><text start="59" dur="2">Now, the table of utilities, we&amp;#39;re going to start blank--</text><text start="61" dur="2">we&amp;#39;re not going to start them at zero or anything else</text><text start="63" dur="2">where they&amp;#39;re just going to be undefined. </text><text start="65" dur="2">And the table of numbers, we&amp;#39;re going to start at zero, </text><text start="67" dur="4">saying we visited each state a total of zero times. </text><text start="71" dur="3">What we&amp;#39;re going to do is run the policy,</text><text start="74" dur="2">have a trial that goes through the state; </text><text start="76" dur="2">when it gets to a terminal state, </text><text start="78" dur="3">we start it over again at the start and run it again;</text><text start="81" dur="3">and we keep track of how many times we visited each state, </text><text start="84" dur="2">we update the utilities, and we get a better </text><text start="86" dur="2">and better estimate for the utility.</text><text start="88" dur="2">And this is what the inner loop of the algorithm looks like--</text><text start="90" dur="2">and let&amp;#39;s see if we can trace it out. </text><text start="92" dur="2">So we&amp;#39;ll start at a start state,</text><text start="94" dur="5">we&amp;#39;ll apply the policy--and let&amp;#39;s say the policy tells us to move in this direction.</text><text start="99" dur="3">Then we get a reward here, </text><text start="102" dur="2">which is zero; </text><text start="104" dur="2">and then we look at it with the algorithm, </text><text start="106" dur="2">and the algorithm tells us if the state </text><text start="108" dur="3">is new--yes, it is; we&amp;#39;ve never been there before--</text><text start="111" dur="5">then set the utility of that state to the new reward, which is zero. </text><text start="116" dur="2">Okay--so now we have a zero here;</text><text start="118" dur="4">and then let&amp;#39;s say, the next step, we move up here. </text><text start="122" dur="2">So, again, we have a zero;</text><text start="124" dur="3">and let&amp;#39;s say our policy looks like a good one,</text><text start="127" dur="3">so we get: here, we have a zero.</text><text start="130" dur="2">We get: here, we have a zero. </text><text start="132" dur="4">We get: here--now, this state,</text><text start="136" dur="4">we get a reward of 1, so that state gets a utility of 1.</text><text start="140" dur="3">And all along the way, we have to think about </text><text start="143" dur="3">how we&amp;#39;re backing up these values, as well. </text><text start="146" dur="5">So when we get here, we have to look at this formula to say:</text><text start="151" dur="4">How are we going to update the utility of the prior state?</text><text start="155" dur="3">And the difference between this state and this state is zero. </text><text start="158" dur="5">so this difference, here, is going to be zero--</text><text start="163" dur="3">the reward is zero, and so there&amp;#39;s going to be no update to this state. </text><text start="166" dur="4">But now, finally--for the first time--we&amp;#39;re going to have an actual update.</text><text start="170" dur="4">So we&amp;#39;re going to update this state to be plus 1, </text><text start="174" dur="3">and now we&amp;#39;re going to think about changing this state. </text><text start="177" dur="3">And what was its old utility?--well, it was zero.</text><text start="180" dur="3">And then there&amp;#39;s a factor called Alpha,</text><text start="183" dur="2">which is the learning rate </text><text start="185" dur="3">that tells us how much we want to move this utility</text><text start="188" dur="3">towards something that&amp;#39;s maybe a better estimate. </text><text start="191" dur="3">And the learning rate should be such that, </text><text start="194" dur="2">if we are brand new, </text><text start="196" dur="2">we want to move a big step;</text><text start="198" dur="2">and if we&amp;#39;ve seen this state a lot of times, </text><text start="200" dur="2">we&amp;#39;re pretty confident of our number</text><text start="202" dur="2">and we want to make a small step.</text><text start="204" dur="5">So let&amp;#39;s say that the Alpha function is 1 over N plus 1. </text><text start="209" dur="2">Well, we&amp;#39;d better not make it 1 over N plus 1, when N is zero.</text><text start="211" dur="4">So 1 over N plus 1 would be &#xBD;;</text><text start="215" dur="4">and then the reward in this state was zero;</text><text start="219" dur="2">plus, we had a Gamma--</text><text start="221" dur="3">and let&amp;#39;s just say that Gamma is 1, </text><text start="224" dur="2">so there&amp;#39;s no discounting; and then  </text><text start="226" dur="3">we look at the difference between the utility </text><text start="229" dur="3">of the resulting state--which is 1--</text><text start="232" dur="5">minus the utility of this state, which was zero. </text><text start="237" dur="4">So we get &#xBD;, 1 minus zero--which is &#xBD;.</text><text start="241" dur="2">So we update this;</text><text start="243" dur="3">and we change this zero to &#xBD;.</text><text start="246" dur="4">Now let&amp;#39;s say we start all over again</text><text start="250" dur="2">and let&amp;#39;s say our policy is right on track; </text><text start="252" dur="4">and nothing unusual, stochastically, has happened.</text><text start="256" dur="3">So we follow the same path, </text><text start="259" dur="4">we don&amp;#39;t update--because they&amp;#39;re all zeros all along this path.</text><text start="263" dur="3">We go here, here, here; </text><text start="266" dur="2">and now it&amp;#39;s time for an update. </text><text start="268" dur="5">So now, we&amp;#39;ve transitioned from a zero to &#xBD;--</text><text start="273" dur="2">so how are we going to update this state?</text><text start="275" dur="2">Well, the old state was zero</text><text start="277" dur="4">and now we have a 1 over N plus 1--</text><text start="281" dur="3">so let&amp;#39;s say 1/3.</text><text start="284" dur="2">So we&amp;#39;re getting a little bit more confident--because we&amp;#39;ve been there</text><text start="286" dur="2">twice, rather than just once. </text><text start="288" dur="3">The reward in this state was zero,</text><text start="291" dur="3">and then we have to look at the difference between these 2 states. </text><text start="294" dur="3">That&amp;#39;s where we get the name, Temporal Difference;</text><text start="297" dur="4">and so, we have &#xBD; minus zero--</text><text start="301" dur="2">and so that&amp;#39;s 1/3 times &#xBD;--</text><text start="303" dur="2">so that&amp;#39;s 1/6.</text><text start="305" dur="2">Now we update this state. </text><text start="307" dur="4">It was zero; now it becomes 1/6.</text><text start="311" dur="2">And you can see how the results </text><text start="313" dur="3">of the positive 1 starts to propagate </text><text start="316" dur="2">backwards--but it propagates slowly. </text><text start="318" dur="2">We have to have 1 trial at a time </text><text start="320" dur="2">to get that to propagate backwards. </text><text start="322" dur="3">Now, how about the update from this state to this state?</text><text start="325" dur="6">Now, we were &#xBD; here--so our old utility was &#xBD;;</text><text start="331" dur="4">plus Alpha--the learning rate--is 1/3.</text><text start="335" dur="4">The reward in the old state was zero; </text><text start="339" dur="3">plus the difference between these two,</text><text start="342" dur="3">which is 1 minus &#xBD;.</text><text start="345" dur="4">So that&amp;#39;s &#xBD; plus 1/6 is 2/3.</text><text start="349" dur="2">And now the second time through, </text><text start="351" dur="6">we&amp;#39;ve updated the utility of this state from 1/2 to 2/3.</text><text start="357" dur="5">And we keep on going--and you can see the results of the positive, propagating backwards.</text><text start="362" dur="2">And if we did more examples through here, </text><text start="364" dur="4">you would see the results of the negative propagating backwards. </text><text start="368" dur="5">And eventually, it converges to the correct utilities for this policy. </text></transcript></video><video title="11 Passive Agent Results.mp4" id="tdtAZFbvDPc" length="63"><transcript><text start="0" dur="5">Now here are some results from running the passive TD algorithm on the 4 by 3 maze. </text><text start="5" dur="3">On the right, we see a graph of the average </text><text start="8" dur="3">error in the utility function--average across all the states. </text><text start="11" dur="4">So it starts off--for the first 5 or so trials,</text><text start="15" dur="3">the error rate is very high--it&amp;#39;s off the charts. </text><text start="18" dur="5">But then it starts to settle down, through 10, 20, 40; </text><text start="23" dur="3">and up to about 60 or so, it&amp;#39;s still improving;</text><text start="26" dur="3">and then it gets to a final steady state</text><text start="29" dur="7">after about 60 trials of about .05 in the average error in utility. </text><text start="36" dur="4">So that&amp;#39;s not too bad, but not really converging all the way down to no rate of error. </text><text start="40" dur="3">And on the left, you see the utility estimates</text><text start="43" dur="2">for various different states; </text><text start="45" dur="4">and, as we see--as we get out to 500 trials, </text><text start="49" dur="2">they&amp;#39;re starting to converge a little bit,</text><text start="51" dur="3">close to their true values. </text><text start="54" dur="2">But we see in the first 100 or so trials-- </text><text start="56" dur="3">they were all over the map, and so it wasn&amp;#39;t doing very well.</text><text start="59" dur="4">It took awhile for it to converge to something close to the true values. </text></transcript></video><video title="12 Weaknesses Question.mp4" id="rbaxXB_Cd9w" length="71"><transcript><text start="0" dur="2">Now I want to do a little quiz,</text><text start="2" dur="2">and ask you: True or False,</text><text start="4" dur="3">which of the following are possible </text><text start="7" dur="2">weaknesses in this TD learning </text><text start="9" dur="3">with a passive approach to reinforcement learning?</text><text start="12" dur="3">One: Is it possible that we would have </text><text start="15" dur="2">a long convergence time--</text><text start="17" dur="4">that it might take a long time to converge to the correct utility values?  </text><text start="21" dur="5">Secondly, are we limited by the policy that we choose?</text><text start="26" dur="2">So remember: in passive reinforcement learning, </text><text start="28" dur="2">we choose a fixed policy</text><text start="30" dur="2">and execute that policy; </text><text start="32" dur="4">and any deviance from the policy </text><text start="36" dur="2">results from the stochasticity.  </text><text start="38" dur="2">We may visit different squares</text><text start="40" dur="2">because the environment is stochastic,</text><text start="42" dur="2">but not because we made different choices. </text><text start="44" dur="3">So there&amp;#39;s that elementation. </text><text start="47" dur="3">Third, can there be a problem with missing states? </text><text start="50" dur="3">That is, could there be some states that have </text><text start="53" dur="3">a zero count--that we never visited, </text><text start="56" dur="2">and never got a utility estimate?</text><text start="58" dur="5">And fourth, could there be a problem with a poor estimate for certain states?</text><text start="63" dur="5">So could it be that, even though a state didn&amp;#39;t have a count of zero, </text><text start="68" dur="3">it had a low count, and we weren&amp;#39;t able to get a good utility estimate for that state? </text></transcript></video><video title="13 Weaknesses Answers.mp4" id="aWcC5wVEkcs" length="62"><transcript><text start="0" dur="3">An answer is that every one of these </text><text start="3" dur="3">is a potential problem for passive reinforcement learning. </text><text start="6" dur="3">So every problem won&amp;#39;t show up in every possible domain. </text><text start="9" dur="3">It&amp;#39;ll depend on what the environment looks like. </text><text start="12" dur="4">But it is a possibility that you could get bitten by any of these problems. </text><text start="16" dur="3">And they all stem from the same cause, </text><text start="19" dur="2">from the fact that passive learning </text><text start="21" dur="4">stubbornly sticks to the same policy throughout. </text><text start="25" dur="2">We have a policy, pi of S, </text><text start="27" dur="2">and we always execute that policy. </text><text start="29" dur="5">So if the policy here was to go up and then go right, </text><text start="34" dur="2">then we would always stick to that;</text><text start="36" dur="5">and the only time we would explore any other state is when those actions failed. </text><text start="41" dur="2">If we tried to go up from this state--</text><text start="43" dur="2">because that&amp;#39;s what the policy said;</text><text start="45" dur="3">but, stochastically, we slipped over to this state--</text><text start="48" dur="3">then we wouldn&amp;#39;t do something else, according to the policy</text><text start="51" dur="2">and so we&amp;#39;d get a little bit of exploration, </text><text start="53" dur="3">but we&amp;#39;d only vary from the chosen path </text><text start="56" dur="2">because of that variation</text><text start="58" dur="4">and we wouldn&amp;#39;t intentionally explore enough of the space.  </text></transcript></video><video title="14 Active Reinforcement Learning.mp4" id="7yvclGcujGY" length="55"><transcript><text start="0" dur="3">So let&amp;#39;s move on to Active Reinforcement Learning</text><text start="3" dur="3">and, in particular, let&amp;#39;s examine a simple</text><text start="6" dur="4">approach called a Greedy Reinforcement Learner. </text><text start="10" dur="3">And the way that works is it uses the same </text><text start="13" dur="3">passive TD learning algorithm that we talked about,</text><text start="16" dur="4">but, after each time we update the utilities</text><text start="20" dur="3">or maybe after a couple of updates--you can decide how often you want to do it--</text><text start="23" dur="2">after the change to the utilities, </text><text start="25" dur="3">we recompute the new optimal policy, pi.</text><text start="28" dur="4">So we throw away our old pi, pi1,</text><text start="32" dur="3">and replace it with a new pi, pi2--</text><text start="35" dur="6">which is a result of solving the MDP described by our new estimates of the utiliities. </text><text start="41" dur="2">Now we have a new policy, </text><text start="43" dur="2">and we continue learning with that new policy. </text><text start="45" dur="4">And so, if the initial policy was flawed, </text><text start="49" dur="3">the Greedy algorithm would tend to move away from the initial policy, </text><text start="52" dur="3">towards a better policy--and we can show how well that works. </text></transcript></video><video title="15 Greedy Agent Results.mp4" id="FvXWpsvu7MM" length="85"><transcript><text start="0" dur="4">Here&amp;#39;s the result of running the Greedy agent over 500 trials. </text><text start="4" dur="2">And I&amp;#39;ve graphed 2 things here: </text><text start="6" dur="3">One is the error; and you see, over the top--  </text><text start="9" dur="3">over the first 40 or so trials--</text><text start="12" dur="2">the error was very high--way up here. </text><text start="14" dur="2">But then, suddenly, it jumped down </text><text start="16" dur="6">to a lower level, and stayed along that level all the way through to 500. </text><text start="22" dur="3">I&amp;#39;ve also graphed, with a dotted line, the policy loss. </text><text start="25" dur="3">What does that mean?--so that&amp;#39;s the difference </text><text start="28" dur="4">between the policy that the agent has learned and the optimal policy. </text><text start="32" dur="5">So if it had learned the optimal policy, the policy loss would be zero, down here. </text><text start="37" dur="2">It doesn&amp;#39;t quite get to zero.</text><text start="39" dur="2">It was high, up here,  </text><text start="41" dur="4">and then at around step 40, it learned something important. </text><text start="45" dur="4">What did it learn?--well, here&amp;#39;s the final policy that it came up with. </text><text start="49" dur="4">Maybe it started originally going in this direction in hitting the minus 1;</text><text start="53" dur="4"> and then it flipped and learned a new policy that went in a better direction. </text><text start="57" dur="2">But it still hasn&amp;#39;t learned the optimal policy.</text><text start="59" dur="4">And we can see--for example, this looks like a mistake here. </text><text start="63" dur="5">In state 1-2, it&amp;#39;s policy is moving down </text><text start="68" dur="4">and then following this path, which it learned, towards the goal.</text><text start="72" dur="5">But really, a better route would be to take the northern route, and go through this path. </text><text start="77" dur="2">But it hasn&amp;#39;t learned that. </text><text start="79" dur="2">Because it was Greedy, it found something </text><text start="81" dur="4">that seemed to be doing good for it, and then it never deviated from that. </text></transcript></video><video title="16 Balancing Policy.mp4" id="PzDgvih6JDA" length="103"><transcript><text start="0" dur="4">So the question, then, is: How do we get this learner out of its rut?</text><text start="4" dur="3">It improved its policy for awhile, </text><text start="7" dur="2">but then it got stuck in this policy </text><text start="9" dur="4">where we go here, go up and then go right. </text><text start="13" dur="3">Most of the time, that&amp;#39;s a perfectly good policy. </text><text start="16" dur="5">But if a stochastic error makes us slip into the minus 1, then it hurts us. </text><text start="21" dur="4">We&amp;#39;d like to be able to say we&amp;#39;re going to stop doing that </text><text start="25" dur="3">and somehow find this route. </text><text start="28" dur="2">But in order to find that new route, </text><text start="30" dur="2">we&amp;#39;d have to spend some time executing a policy </text><text start="32" dur="3"> which was not the best policy known to us. </text><text start="35" dur="3">In other words, we&amp;#39;d have to stop exploiting </text><text start="38" dur="4">the best policy we&amp;#39;d found so far--which is this one--</text><text start="42" dur="4">and start exploring, to see if maybe there&amp;#39;s a better policy. </text><text start="46" dur="2">And exploring could lead us astray</text><text start="48" dur="3">and cause us to waste a lot of time.</text><text start="51" dur="2">So we have to figure out: what&amp;#39;s the right trade-off?</text><text start="53" dur="4">When is it worth exploring to try to find something better for the long term--</text><text start="57" dur="5">even though we know that exploring is going to hurt us in the short term?</text><text start="62" dur="4">Now, one possibility is, certainly, random exploration. </text><text start="66" dur="3">That is, we can follow our best policy</text><text start="69" dur="2">some percentage of the time,</text><text start="71" dur="3">and then randomly, at some point,</text><text start="74" dur="3">we can decide to take an action which is not the optimal action. </text><text start="77" dur="3">So we&amp;#39;re here, the optimal action would be to go east;</text><text start="80" dur="3">and we say, &amp;quot;Well, this time we&amp;#39;re gong to choose something else--</text><text start="83" dur="2">let&amp;#39;s try going north. </text><text start="85" dur="2">And then we explore from there</text><text start="87" dur="2">and see if we&amp;#39;ve learned something. </text><text start="89" dur="2">So that policy does, in fact, work--</text><text start="91" dur="6">randomly making moves with some probability--but it tends to be slow to converge.</text><text start="97" dur="2">In order to get something better, we have to really understand </text><text start="99" dur="4">what&amp;#39;s going on with our exploration, versus exploitation. </text></transcript></video><video title="17 Errors in Utility Questions.mp4" id="Bs2nrLB8LX0" length="99"><transcript><text start="0" dur="3">So let&amp;#39;s really think about what we&amp;#39;re doing when we&amp;#39;re executing </text><text start="3" dur="2">the active TD learning algorithm. </text><text start="5" dur="4">First, we&amp;#39;re keeping track of the optimal policy we&amp;#39;ve found so far;</text><text start="9" dur="2">and that gets updated as we go, </text><text start="11" dur="2">and replaced with new policies. </text><text start="13" dur="4">Secondly, we&amp;#39;re keeping track of the utilities of states--</text><text start="17" dur="3">and those, too, get updated as we go along. </text><text start="20" dur="3">And third, we&amp;#39;re keeping track of the number</text><text start="23" dur="3">of times that we visited each state.</text><text start="26" dur="2">And that gets incremented on each trial. </text><text start="28" dur="2">Now, what could happen? What could go wrong? </text><text start="30" dur="2">There are really 2 reasons </text><text start="32" dur="4">why our utility estimates could be off. </text><text start="36" dur="3">First, we haven&amp;#39;t sampled enough. </text><text start="39" dur="3">The end values are too low for that state</text><text start="42" dur="2">and the utilities that we got were just some </text><text start="44" dur="2">random fluctuations and weren&amp;#39;t  </text><text start="46" dur="2">a very good, true estimate. </text><text start="48" dur="3">And secondly, we could get a bad utility </text><text start="51" dur="2">because our policy was off. </text><text start="53" dur="4">The policy was telling us to do something that wasn&amp;#39;t really the best thing,</text><text start="57" dur="3">and so the utility wasn&amp;#39;t as high as it could be. </text><text start="60" dur="2">So let&amp;#39;s do a little quiz. </text><text start="62" dur="4">I want you to tell me, for the 2 sources of possible error--</text><text start="66" dur="3">too little sampling and wrong policy-- </text><text start="69" dur="4">I want you to tell me, is it True or False--each of these statements:</text><text start="73" dur="6">One: Could the error--either the sampling error or the policy error-- </text><text start="79" dur="4">could that make the utility estimates too low? </text><text start="83" dur="6">And secondly, could it make utility too high? </text><text start="89" dur="10">And third, could it be improved with higher N values--that is, more trials? </text></transcript></video><video title="18 Errors in Utility Answers.mp4" id="AlcPU5ZoPwI" length="66"><transcript><text start="0" dur="6">And here are the answers: For the error introduced by a lack of enough sampling, </text><text start="6" dur="2">all these problems are true. </text><text start="8" dur="2">If you don&amp;#39;t have enough samples, </text><text start="10" dur="3">it might make the utility too high; it might make the utility too low--</text><text start="13" dur="3">and it could certainly be improved by taking more trials. </text><text start="16" dur="4">But with the differences due to having not quite the right policy, </text><text start="20" dur="2">The answers aren&amp;#39;t the same. </text><text start="22" dur="2">So yes, if you don&amp;#39;t have the right policy, </text><text start="24" dur="4">that could make the utilities too low--if you&amp;#39;re doing something silly, </text><text start="28" dur="3">like starting in this state and the policy says, </text><text start="31" dur="3">&amp;quot;Drive straight into the minus 1&amp;quot;</text><text start="34" dur="3">that could make the utility of this state lower than it really should be. </text><text start="37" dur="3">But it can&amp;#39;t make the utility too high. </text><text start="40" dur="3">So we really have a bound on the utility here.</text><text start="43" dur="4">The bound is: what does the optimal policy do? </text><text start="47" dur="2">And no matter what policy we have, </text><text start="49" dur="2">it&amp;#39;s not going to be better than the optimal policy; </text><text start="51" dur="3">and so we can only be making things worse</text><text start="54" dur="2">with our policy, not making them better. </text><text start="56" dur="4">And finally, having more N won&amp;#39;t necessarily improve things.  </text><text start="60" dur="6">It will decrease the variance, but it won&amp;#39;t decrease or improve the mean. </text></transcript></video><video title="19 Exploration Agents.mp4" id="Mi2HFAn6SgE" length="73"><transcript><text start="0" dur="4">Now what that suggests is the design for an exploration agent  </text><text start="4" dur="5">that will be more proactive about exploring the world when it&amp;#39;s uncertain,</text><text start="9" dur="6">and will fall back to exploiting the optimal policy--or whatever policy it has as close to optimal--</text><text start="15" dur="2">when it becomes more certain about the world. </text><text start="17" dur="2">And what we can do is go through this </text><text start="19" dur="2">normal cycle of TD learning--</text><text start="21" dur="2">like we always did.</text><text start="23" dur="2">But when we&amp;#39;re looking for the estimate </text><text start="25" dur="2">of the utility of the state, </text><text start="27" dur="2">what we can do is say:</text><text start="29" dur="4">The utility of the state estimate will be </text><text start="33" dur="3">some large value, plus R--</text><text start="36" dur="4">say, plus 1--in the case of this example--</text><text start="40" dur="3">the largest reward we can expect to get.</text><text start="43" dur="5">In every case, when the number of visits to the state</text><text start="48" dur="4">is less than the sum threshold, E, the exploration threshold.</text><text start="52" dur="3">And when we&amp;#39;ve visited a state E times, </text><text start="55" dur="3">then we revert to the learned probabilities</text><text start="58" dur="3">or the learned utilities, rather. </text><text start="61" dur="4">So when we start out, we&amp;#39;re going to explore from new states;</text><text start="65" dur="4">and once we have a good estimate of what the true utility of the state actually is, </text><text start="69" dur="4">then we stop exploring and we go with those utilities. </text></transcript></video><video title="20 Exploration Agent Results.mp4" id="Xr7QzEYG3N0" length="52"><transcript><text start="0" dur="4">And here we have the result of some simulations of the exploratory agent.</text><text start="4" dur="5">We see it&amp;#39;s doing much better than the passive agent or than the Greedy agent. </text><text start="9" dur="4">So I&amp;#39;m graphing here; and we only had to go through 100 trials. </text><text start="13" dur="3">We didn&amp;#39;t have to go through 500--so it&amp;#39;s converging much faster. </text><text start="16" dur="3">And it&amp;#39;s converging to much better results. </text><text start="19" dur="3">So the policy loss and the dotted lines </text><text start="22" dur="3">started off high; but after only 20 trials, </text><text start="25" dur="2">it&amp;#39;s come down to perfect.</text><text start="27" dur="4">So it learned the exact, correct policy after 20 trials. </text><text start="31" dur="4">The error in the utilities--so you can have the perfect policy, </text><text start="35" dur="4">while not quite having the right utilities for each state--</text><text start="39" dur="3">and the errors in utility comes down, </text><text start="42" dur="5">and that, too, comes down to a level that&amp;#39;s lower than the previous agent&amp;#39;s--</text><text start="47" dur="2">but still, not quite perfect.</text><text start="49" dur="3">And we see here that it, in fact, learns the correct policy. </text></transcript></video><video title="21 Q Learning 1.mp4" id="6OIHXZthbhs" length="84"><transcript><text start="0" dur="3">Now, let&amp;#39;s say we&amp;#39;ve done all this learning, </text><text start="3" dur="3">we&amp;#39;ve applied our agent, and we&amp;#39;ve come up with a utility model; </text><text start="6" dur="4">and we have the estimates for the utility for every state. </text><text start="10" dur="3">Now what do we do when we want to act in the world?</text><text start="13" dur="4">Well, we now have out policy for the state, </text><text start="17" dur="3">which is determined by the expected value </text><text start="20" dur="2">and we compute the expected value </text><text start="22" dur="3">of each state by looking at the utility, </text><text start="25" dur="2">which we just learned. </text><text start="27" dur="4">But then, we have to multiply by the transition possibilities.</text><text start="31" dur="6">What&amp;#39;s the probability of each resulting state that we have to look up the utility of? </text><text start="37" dur="2">And so, we need to know that--</text><text start="39" dur="5">and in some cases, we&amp;#39;re given the transition model, and so we know all these probabilities.  </text><text start="44" dur="2">But in other cases, we don&amp;#39;t have it;</text><text start="46" dur="2">and so if we haven&amp;#39;t learned it, we can&amp;#39;tapply </text><text start="48" dur="3">our policy, even though we know the utilities. </text><text start="51" dur="3">I want to talk, briefly, about this alternative method</text><text start="54" dur="3">called Q Learning, that I mentioned before. </text><text start="57" dur="4">Where in Q Learning, we don&amp;#39;t learn U direclty,</text><text start="61" dur="2">and we don&amp;#39;t need the transition model.</text><text start="63" dur="5">Instead, what we learned is a direct mapping, </text><text start="68" dur="3">Q, from states and actions</text><text start="71" dur="4">to utilities and so then, once we&amp;#39;ve learned Q, </text><text start="75" dur="3">we can determine the optimal policy of he state, </text><text start="78" dur="6">just by taking the maximum overall possible actions of this Q of S, A values. </text></transcript></video><video title="22 Q Learning 2.mp4" id="4vV4SNEdLt8" length="90"><transcript><text start="0" dur="2">Now, how do we do Q Learning? </text><text start="2" dur="3">Well, we start off with this table of Q values--</text><text start="5" dur="3">and notice that there&amp;#39;s more entries in this</text><text start="8" dur="2">table than there were in the utility table.</text><text start="10" dur="4">So for each state, I&amp;#39;ve divided it up </text><text start="14" dur="6">into different actions--so here&amp;#39;s the action of going north, south, east or west</text><text start="20" dur="2">from this particular state. </text><text start="22" dur="2">They all start out with utility--</text><text start="24" dur="2">or rather Q utility, at zero. </text><text start="26" dur="3">But as we go, we start to update,</text><text start="29" dur="2">and we have an update formula that&amp;#39;s very </text><text start="31" dur="3">similar to the formula for TD learning.</text><text start="34" dur="3">It has the same learning rate, Alpha,</text><text start="37" dur="2">and the same discount factor, Gamma;</text><text start="39" dur="2">and we just start applying that.</text><text start="41" dur="4">So we start tracking through the state space, </text><text start="45" dur="3">and when we get a transition--say we go </text><text start="48" dur="2">east from here, </text><text start="50" dur="5">and then east and then north and then north; </text><text start="55" dur="2">and then east-- </text><text start="57" dur="2">and then we would back up this value;</text><text start="59" dur="4">and depending on what the values of Alpha and Gamma were, </text><text start="63" dur="4">we might update this to .6 or something;</text><text start="67" dur="2">and then the next time through, </text><text start="69" dur="7">we might update that to .7, and update this one to .4, and so on. </text><text start="76" dur="2">In each case, we&amp;#39;d be updating </text><text start="78" dur="3">only the action we took,</text><text start="81" dur="3">associated with that state, not the whole state. </text><text start="84" dur="6">We&amp;#39;d keep repeating that process until we had values filled in for all the action state pairs. </text></transcript></video><video title="23 Pacman 1.mp4" id="dtp2ZDIjMsI" length="108"><transcript><text start="0" dur="5">Now, in some sense, you&amp;#39;ve learned all you need to know about reinforcement learning. </text><text start="5" dur="4">Yes, it&amp;#39;s a huge field, and there&amp;#39;s a lot of other details that we haven&amp;#39;t covered</text><text start="9" dur="2">but you&amp;#39;ve seen all the basics.  </text><text start="11" dur="2">The theory is there and it works. </text><text start="13" dur="3">But in another sense, we haven&amp;#39;t gone very far</text><text start="16" dur="5">because what we&amp;#39;ve done works for these small 4 by 3 Grid Worlds,</text><text start="21" dur="3">But it won&amp;#39;t work very well for larger problems: </text><text start="24" dur="3">dealing with flying helicopters or playing backgammon--</text><text start="27" dur="2">because there&amp;#39;s just too many states</text><text start="29" dur="4">and we can&amp;#39;t visit every one of the states </text><text start="33" dur="3">and build up the correct utility values, or Q values, </text><text start="36" dur="5">for all the billions or trillions or quadrillions of states we would need to represent. </text><text start="41" dur="3">So let&amp;#39;s go back to a simpler type of example. </text><text start="44" dur="3">Here&amp;#39;s a state in a Pacman game</text><text start="47" dur="2">and we can see that this is a bad state, </text><text start="49" dur="4">where Pacman is surrounded by 2 bad guys,</text><text start="53" dur="3">and there&amp;#39;s no place for him  to escape. </text><text start="56" dur="4">And so reinforcement learning could quickly learn that this is bad.</text><text start="60" dur="2">But the problem is that that state has </text><text start="62" dur="4">no relation whatsoever to this state.</text><text start="66" dur="4">Where conceptually, it&amp;#39;s the same problem--</text><text start="70" dur="2">that the Pacman is stuck in a corner</text><text start="72" dur="3">and there are bad guys own either sides of him. </text><text start="75" dur="3">But in terms of a concrete state, </text><text start="78" dur="2">the 2 are completely different. </text><text start="80" dur="2">So what we want to be able to do is </text><text start="82" dur="4">find some generalization, so that these 2 states look the same, </text><text start="86" dur="3">and what I learn for this state--</text><text start="89" dur="3">that learning can transfer over into this state. </text><text start="92" dur="5">And so, just as we did in supervised machine learning, where we wanted to take</text><text start="97" dur="5">similar points in the state and be able to reason about them, together, </text><text start="102" dur="3">we want to be able to do the same thing for reinforcement training. </text><text start="105" dur="3">And we can use the same type of approach.</text></transcript></video><video title="24 Pacman 2.mp4" id="Y_LP6acepqQ" length="148"><transcript><text start="0" dur="2">So we can represent a state, </text><text start="2" dur="3">not by an exhaustive listing of everything that&amp;#39;s true in the state--</text><text start="5" dur="2">every single dot, and so on. </text><text start="7" dur="4">But rather, by a collection of important features. </text><text start="11" dur="4">So we can say that a state is this collection </text><text start="15" dur="3">of Feature 1, Feature 2, and so on. </text><text start="18" dur="2">And what are the features? </text><text start="20" dur="2">Well, they don&amp;#39;t have to be the exact position </text><text start="22" dur="3">of every piece in the board. </text><text start="25" dur="3">They could be things like the distance to the nearest Ghost </text><text start="28" dur="3">or maybe the square of the distance--or the inverse square;</text><text start="31" dur="3">or the distance to a dot or food-- </text><text start="34" dur="2">or the number of Ghosts remaining.</text><text start="36" dur="3">And then we can represent the utility of a state,</text><text start="39" dur="4">or let&amp;#39;s go with a Q value, of a state action pair</text><text start="43" dur="8">and represent that as the sum over some set of waits times the value of each feature. </text><text start="51" dur="4">And then our task, then, is to learn good values of these waits--</text><text start="55" dur="5">how important is each feature, whether they&amp;#39;re positive or negative, and so on. </text><text start="60" dur="5">This formulation will be good to the extent that similar states have the same value. </text><text start="65" dur="3">So if these 2 states have the same value, that would be good</text><text start="68" dur="4">because we could learn that, in both cases, Pacman is trapped. </text><text start="72" dur="6">It would be bad, to the extent that dissimilar states have the same value-- </text><text start="78" dur="2">say, if we&amp;#39;re ignoring something important.  </text><text start="80" dur="5">So, for example, if one of the features was: </text><text start="85" dur="2">Is Pacman in a tunnel? </text><text start="87" dur="4">It would probably be important to know: is that tunnel a dead end or not?  </text><text start="91" dur="4">And if we represented all tunnels the same, we&amp;#39;d probably be making a mistake. </text><text start="95" dur="7">Now, the great thing is that we can make a small modification to our Q learning algorithm</text><text start="102" dur="6">where, when we were updating, the Q of S, A got updated </text><text start="108" dur="5">in terms of a small change to the existing Q of S, A values. </text><text start="113" dur="6">We can do the same thing with the wait&amp;#39;s sub-i values.  </text><text start="119" dur="3">We can update them as we make each change to the Q values.  </text><text start="122" dur="3">And they&amp;#39;re both driven by the amount of error.  </text><text start="125" dur="4">If the Q values are off by a lot, we have to make a big change; </text><text start="129" dur="2">if they&amp;#39;re not, we make a small change-- </text><text start="131" dur="2">the same thing with the Wi values. </text><text start="133" dur="4">And that looks just like what we did when we </text><text start="137" dur="3">used supervised machine learning to update our waits.  </text><text start="140" dur="4">So we can apply that same process, even though it&amp;#39;s not supervised. </text><text start="144" dur="4">It&amp;#39;s as if we&amp;#39;re bringing our own supervision to reinforcement learning. </text></transcript></video><video title="25 Conclusion.mp4" id="aFnqXy1wCWU" length="48"><transcript><text start="0" dur="3">In summary, then, we&amp;#39;ve learned how to do a lot with MDPs--</text><text start="3" dur="3">especially using reinforcement learning. </text><text start="6" dur="2">If we don&amp;#39;t know what the MDP is, </text><text start="8" dur="3">we know how to estimate it and then solve it. </text><text start="11" dur="4">We can estimate the utility for some fixed policy, pi;</text><text start="15" dur="3">or we could estimate the Q values for the </text><text start="18" dur="4">optimal policy while executing an exploration policy. </text><text start="22" dur="3">And we saw something about how we can make the right trade-offs</text><text start="25" dur="3">between exploration and exploitation.</text><text start="28" dur="3">So reinforcement learning remains one of the most exciting areas of AI.</text><text start="31" dur="4">Some of the biggest surprises have come out of reinforcement learning--</text><text start="35" dur="2">things like Tesauro&amp;#39;s backgammon player </text><text start="37" dur="2">or Andrew Ng&amp;#39;s helicopter; </text><text start="39" dur="4">and we think that there&amp;#39;s a lot more that we can learn. </text><text start="43" dur="5">It&amp;#39;s an exciting field, and one where there&amp;#39;s plenty of room for new innovation. </text></transcript></video></group><group title="Homework 5" count="6"><video title="1 Q Learning ANSWER.mp4" id="9ucPg_e4CL4" length="47"><transcript><text start="0" dur="4">The answer is we&amp;#39;re transitioning from this state to this state.</text><text start="4" dur="5">We get a reward of zero in the old state.</text><text start="9" dur="5">Then we get the Q value of 100 minus the Q value of zero,</text><text start="14" dur="3">and the discount rate is 90.</text><text start="17" dur="3">That&amp;#39;s a difference of 90.</text><text start="20" dur="7">Then the alpha, the learning rate, is 1/2. That gives us 45.</text><text start="27" dur="3">We apply that 45 to the state action pair.</text><text start="30" dur="4">We were in this state, and we executed the action north,</text><text start="34" dur="2">so the 45 goes here.</text><text start="36" dur="2">Notice it doesn&amp;#39;t go over here. </text><text start="38" dur="5">We did end up going to the east, but we didn&amp;#39;t execute the action of going to the east.</text><text start="43" dur="4">All the other actions remain unchanged.</text></transcript></video><video title="1 Q Learning.mp4" id="Ybifm6j2SP4" length="84"><transcript><text start="0" dur="5">This problem involves the Q-learning agent who is currently situated at this square</text><text start="5" dur="6">called (3,3), and executes the NORTH action trying to go up, </text><text start="11" dur="6">but because the environment is stochastic, it actually ends up arriving at this terminal state</text><text start="17" dur="2">with value 100.</text><text start="19" dur="7">And what I want you to answer is how should the Q-values be updated for this state,</text><text start="26" dur="5">and I want you to enter the Q-values over here because we don&amp;#39;t want you to </text><text start="31" dur="6">mess up the original, and we&amp;#39;ll use the formula below which I should point out is</text><text start="37" dur="4">from the Sarsa version of Q-learning, </text><text start="41" dur="6">and in this formula, the parameter alpha--the learning rate--will take on the value of 1/2, and</text><text start="47" dur="5">gamma--the discount rate--will be 0.9,</text><text start="52" dur="5">and all the rewards for moving from one state to the next are 0 </text><text start="57" dur="4">with the exception of moving into the terminal state,</text><text start="61" dur="5">and this Q of S prime, A prime--that means what goes on in the next state, </text><text start="66" dur="6">so here we were in this S, and we took the action of going NORTH,</text><text start="72" dur="6">and we transferred into this state, and in that state, no matter what action you take</text><text start="78" dur="6">the Q value is always 100, so this value here will always be 100.</text></transcript></video><video title="2 Function Generalization ANSWER.mp4" id="41pY_meCH5w" length="88"><transcript><text start="0" dur="5">To work out the answer, let&amp;#39;s look at the individual features for each of the states.</text><text start="5" dur="8">For this state up here, the values for F1, F2, and F3 would be 2, 1, and 1.</text><text start="13" dur="4">That is, the distance from the agent to the goal is 2,</text><text start="17" dur="7">the distance to the closest bad guy is 1, and the distance of the bad guy to the goal is 1.</text><text start="24" dur="6">Now this state here also has values 2, 1, and 1.</text><text start="30" dur="6">That would be indistinguishable under either functions F or G.</text><text start="36" dur="5">This state here has values 2, 1, and 3.</text><text start="41" dur="5">The 2 and the 1 are the same, so that would be indistinguishable under F,</text><text start="46" dur="3">but would be different under G.</text><text start="49" dur="6">And this state has values 2, 3, and 1, and the 2 and 3 are different than 2 and 1,</text><text start="55" dur="4">so that would be different under either F or G.</text><text start="59" dur="3">Now the question which is a more useful function--</text><text start="62" dur="7">the answer is G is more useful, because G can actually distinguish between these 2 states.</text><text start="69" dur="5">In this state the agent is surround by bad guys, so that&amp;#39;s a bad situation.</text><text start="74" dur="5">In this state the agent has a clear path to the goal, so that&amp;#39;s a good situation.</text><text start="79" dur="3">You&amp;#39;d want a function that says that those two are different </text><text start="82" dur="2">rather than one that says they&amp;#39;re the same.</text><text start="84" dur="4">G says they are different whereas F says they&amp;#39;re the same.</text></transcript></video><video title="2 Function Generalization.mp4" id="tpH7hp_pLqk" length="88"><transcript><text start="0" dur="5">This question involves function generalization in reinforcement learning,</text><text start="5" dur="4">and we&amp;#39;re operating in a 1-dimensional environment of squares,</text><text start="9" dur="3">and we&amp;#39;re going to consider a state generalization function, </text><text start="12" dur="6">that is a function that takes a state such as this and condenses it into some features</text><text start="18" dur="2">to represent that state.</text><text start="20" dur="3">The first function we&amp;#39;re going to consider F has these features--</text><text start="23" dur="6">f1 is the distance from the Agent represented by A to the goal represented by G, </text><text start="29" dur="4">and f2--the distance from the Agent to the closest Bad guy </text><text start="33" dur="3">which is represented by a B.</text><text start="36" dur="3">So that&amp;#39;s the function F, and we also want to consider the function G </text><text start="39" dur="5">which has the same 2 features--f1 and f2--and adds a third feature </text><text start="44" dur="5">which is the distance of the closest Bad guy to the goal.</text><text start="49" dur="5">That is distance from the goal to the Bad guy--the minimum of that over</text><text start="54" dur="1">all possible Bad guys,</text><text start="55" dur="5">and now I want you to say which of the states below--these 3 states--</text><text start="60" dur="6">have the same value as the state above--this state--under the functions F and G.</text><text start="66" dur="5">And click off the ones that have the same, and then I want you to answer for me--</text><text start="71" dur="5">In this world, agents and Bad guys can move one Square at a time, </text><text start="76" dur="3">and the agent tries to get to the goal without encountering Bad guys,</text><text start="79" dur="6">and for the agent to do that, which is a more useful generalization function </text><text start="85" dur="3">to use over these states--F or G?</text></transcript></video><video title="3 Passive RL Agent ANSWER.mp4" id="-UdAgV__5v0" length="58"><transcript><text start="0" dur="6">The answer is according to the policy the agent would prefer to follow this straight line,</text><text start="6" dur="3">because it is the most direct, and it is the longer goal.</text><text start="9" dur="5">Now, at any point he might slip off to one of these squares.</text><text start="14" dur="3">Those would all potentially be explored,</text><text start="17" dur="3">but if he did he would go back down onto the road.</text><text start="20" dur="5">Likewise, he might fall off onto any of of these squares,</text><text start="25" dur="4">but if he did, he would also go back towards the road.</text><text start="29" dur="4">That&amp;#39;s certainly true under this situation, when he&amp;#39;s off road,</text><text start="33" dur="4">but it also turns out to be true here and here,</text><text start="37" dur="6">because the closest way to get to the goal would be to go in the north direction.</text><text start="43" dur="5">Therefore, these three rows could all potentially be explored,</text><text start="48" dur="5">but the bottom two rows would never be explored under any conditions </text><text start="53" dur="5">no matter what happens stochastically as long as the agent is following this fixed policy.</text></transcript></video><video title="3 Passive RL Agent.mp4" id="212NkM6UCBc" length="63"><transcript><text start="0" dur="7">In this problem, a passive TD-reinforcement learning agent starts at S and moves to G </text><text start="7" dur="6">under a fixed policy which says first, make moves that get closest to G.</text><text start="13" dur="5">So if we started here, we&amp;#39;d want to go in this direction because it&amp;#39;s closest to G,</text><text start="18" dur="4">and (b) stay on these gray squares which represent roads, </text><text start="22" dur="6">and (c) if you do happen to go off the road, then move it back onto the road immediately.</text><text start="28" dur="4">The actions are stochastic, and they may go in the intended direction,</text><text start="32" dur="6">or they may go 90 degrees off, so if we were here, we&amp;#39;d plan to start under this policy</text><text start="38" dur="6">going in this direction. We might end up there, but we might end up here or here.</text><text start="44" dur="5">And if we did end up here, then we&amp;#39;d immediately head back towards the road,</text><text start="49" dur="3">so we&amp;#39;d aim back down in this direction.</text><text start="52" dur="5">And what I want you to do is click on all the squares that would never be explored</text><text start="57" dur="6">by this reinforcement learning agent following this passive fixed policy.</text></transcript></video></group><group title="Unit 11" count="28"><video title="01.mp4 Introduction.mp4" id="s5jbwPgheqI" length="74"><transcript><text start="1" dur="3">Today I have the great, great pleasure </text><text start="4" dur="5">to teach you about hidden Markov models and filter algorithms.</text><text start="9" dur="6">The reason why I&amp;#39;m so excited is in pretty much all of my scientific career,</text><text start="15" dur="4">hidden Markov models and filters have played a major role.</text><text start="19" dur="5">There&amp;#39;s no robot that I program today that wouldn&amp;#39;t extensively use hidden Markov models</text><text start="24" dur="3">and things such as particle filters.</text><text start="27" dur="6">In fact, when I applied for a job at Stanford University as a professor many years ago,</text><text start="33" dur="4">my job talk that I used to market myself to Stanford </text><text start="37" dur="5">was extensively about a version of hidden Markov models and particle filters</text><text start="42" dur="3">applied to robotic mapping.</text><text start="45" dur="7">Today I will teach you those algorithms so you can use them in many challenging problems.</text><text start="52" dur="3">I can&amp;#39;t quite promise you that once you have mastered the material</text><text start="55" dur="5">you will get a job at Stanford, but you can really, really apply them</text><text start="60" dur="6">to a vast array of problems in places such as finance, medicine, robotics,</text><text start="66" dur="5">weather prediction, time series analysis, and many, many other domains.</text><text start="71" dur="3">This is going to be a really fun class.</text></transcript></video><video title="02.mp4 Hidden Markov Models.mp4" id="Y7WhCOWL9Ec" length="54"><transcript><text start="1" dur="6">[Thrun] Hidden Markov models, or abbreviated HMMs, </text><text start="7" dur="8">are used to analyze or to predict time series.</text><text start="15" dur="5">Applications include robotics, medical, finance,</text><text start="20" dur="5">speech and language technologies, and many, many, many other domains.</text><text start="25" dur="10">In fact, HMMs and filters are at the core of a huge amount of deployed practical systems</text><text start="35" dur="3">from elevators to airplanes.</text><text start="38" dur="5">Every time there is a time series that involves noise or sensors or uncertainty,</text><text start="43" dur="2">this is the method of choice.</text><text start="45" dur="3">So today I&amp;#39;ll teach you all about HMMs and filters</text><text start="48" dur="6">so you can apply some of the basic algorithms in a wide array of practical problems.</text></transcript></video><video title="03.mp4 Bayes Network of HMMs.mp4" id="n4skYKmM27k" length="105"><transcript><text start="0" dur="4">[Thrun] The essence of HMMs are really simply characterized</text><text start="4" dur="3">by the following Bayes network.</text><text start="7" dur="4">There&amp;#39;s a sequence of states that evolve over time,</text><text start="11" dur="6">and each state depends only on the previous state in this Bayes network.</text><text start="17" dur="5">Each state also emits what&amp;#39;s called a measurement.</text><text start="22" dur="5">It is this Bayes network that is the core of hidden Markov models</text><text start="27" dur="8">and various probabilistic filters such as Kalman filters, particle filters, and many others.</text><text start="35" dur="5">These are words that might sound cryptic and they might not mean anything to you,</text><text start="40" dur="4">but you might come across them as you study different disciplines of computer science</text><text start="44" dur="2">and control theory.</text><text start="46" dur="3">The real key here is the graphical model.</text><text start="49" dur="3">If you look at the evolution of states, </text><text start="52" dur="5">what you&amp;#39;ll find is that these states evolve as what&amp;#39;s called a Markov chain.</text><text start="57" dur="5">In a Markov chain, each state only depends on its predecessor.</text><text start="62" dur="5">So for example, state S3 is conditioned on S2 but not on S1.</text><text start="67" dur="4">It&amp;#39;s only immediate through S2 that S3 might be influenced by S1.</text><text start="71" dur="4">That&amp;#39;s called a Markov chain, and we&amp;#39;re going to study Markov chains quite a bit</text><text start="75" dur="2">in this class to understand them well.</text><text start="77" dur="5">But what makes it a hidden Markov model or hidden Markov chain, if you wish,</text><text start="82" dur="3">is the fact that there is measurement variables.</text><text start="85" dur="6">So rather than being able to observe the state itself, what you get to see are measurements.</text><text start="91" dur="5">Let me put this to perspective, showing you several of the robots I&amp;#39;ve built</text><text start="96" dur="2">that possess hidden state.</text><text start="98" dur="4">And where I only get to observe certain measurements,</text><text start="102" dur="3">let me infer something about the hidden state.</text></transcript></video><video title="04.mp4 Localization Problem Examples.mp4" id="S_Lm8aN-la0" length="226"><transcript><text start="0" dur="3">[Thrun] What&amp;#39;s shown here is the tour guide robot that I showed you earlier,</text><text start="3" dur="4">but now I&amp;#39;ll talk about the what&amp;#39;s called localization problem--</text><text start="7" dur="6">the problem of finding out where in the world this robot is.</text><text start="13" dur="4">This problem is important because to find its way around the museum</text><text start="17" dur="6">and to arrive at exhibits of interest, it must know where it is.</text><text start="23" dur="7">The problem with this problem is that it doesn&amp;#39;t have a sensor that tells us where it is.</text><text start="30" dur="4">Instead, it&amp;#39;s given what&amp;#39;s called range finders.</text><text start="34" dur="4">These are sensors that measure distances to surrounding objects.</text><text start="38" dur="2">It&amp;#39;s also given the map of the environment, </text><text start="40" dur="6">and it can compare these range finders measurements with the map of the environment</text><text start="46" dur="3">and infer from that where it might be.</text><text start="49" dur="7">The process of inferring the hidden state of the robot&amp;#39;s location from the measurements,</text><text start="56" dur="4">the range sensor measurements, that&amp;#39;s the problem of filtering.</text><text start="60" dur="5">And the underlying model is exactly the same I showed you before.</text><text start="65" dur="4">It&amp;#39;s a hidden Markov model where the state is the sequence of locations</text><text start="69" dur="3">that the robot assumes in the museum</text><text start="72" dur="4">and the measurements is the sequence of range measurements it perceives</text><text start="76" dur="3">while it navigates the museum.</text><text start="79" dur="5">A second example is the underground robotic mapping robot</text><text start="84" dur="4">which has pretty much the same problem--finding out where it is--</text><text start="88" dur="4">but now it is not given a map; it builds the map from scratch.</text><text start="92" dur="9">What this animation here shows you is a so-called particle filter applied to robotic mapping.</text><text start="101" dur="3">Intuitively--what you see is very simple--</text><text start="104" dur="4">as the robot transcends into a mine, it builds a map.</text><text start="108" dur="6">But the many black lines are hypotheses on where the robot might have been</text><text start="114" dur="2">when building this map.</text><text start="116" dur="4">It can&amp;#39;t tell because of the noise in its motors and in its sensors.</text><text start="120" dur="5">As the robot reconnects and closes the loop in this map,</text><text start="125" dur="4">one of these black what we call particles in the trade--</text><text start="129" dur="4">one of these black hypotheses are being selected as the best one,</text><text start="133" dur="6">and by virtue of having maintained many of those, the robot is able to build a coherent map.</text><text start="139" dur="4">In fact, this animation was a key animation in my job talk</text><text start="143" dur="4">when I applied to become a professor at Stanford University.</text><text start="147" dur="5">Here is one final example I&amp;#39;d like to discuss with you which is called speech recognition.</text><text start="152" dur="3">If you have a microphone that records speech</text><text start="155" dur="3">and you want to make your computer recognize the speech,</text><text start="158" dur="3">you will likely come across hidden Markov models.</text><text start="161" dur="3">This is a typical speech signal over here.</text><text start="164" dur="7">It&amp;#39;s an oscillation for the words &amp;quot;speech lab&amp;quot; which I borrowed from Simon Arnfield.</text><text start="171" dur="7">And if you blow up a small region over here, you&amp;#39;ll find that there is an oscillation,</text><text start="178" dur="5">and this oscillation in time is the speech signal.</text><text start="183" dur="6">What speech recognizing systems do is they transform this signal over here</text><text start="189" dur="3">back into letters like &amp;quot;speech lab.&amp;quot;</text><text start="192" dur="2">And you can see it&amp;#39;s not an easy task.</text><text start="194" dur="2">There is some signal here. </text><text start="196" dur="2">The E, for example, is a certain shape.</text><text start="198" dur="4">But different speakers speak differently, and there might be background noise,</text><text start="202" dur="3">so decoding this back into speech is challenging.</text><text start="205" dur="3">There&amp;#39;s been enormous progress in the field</text><text start="208" dur="5">mostly due to hidden Markov models that have been researched for more than 20 years.</text><text start="213" dur="5">And today&amp;#39;s best speech recognizers all use variants of hidden Markov models.</text><text start="218" dur="5">So once again, I can&amp;#39;t teach you everything in this class, but I&amp;#39;ll teach you the very basics</text><text start="223" dur="3">that you can apply to things such as speech signals.</text></transcript></video><video title="05.mp4 Markov Chain Question 1.mp4" id="5uGhAQOs6Aw" length="72"><transcript><text start="0" dur="5">[Thrun] So let&amp;#39;s begin by taking the hidden out of the Markov model</text><text start="5" dur="2">and study Markov chains.</text><text start="7" dur="4">We&amp;#39;re going to use an example for which I will quiz you.</text><text start="11" dur="4">Suppose there are 2 types of weather--rainy, which we call R,</text><text start="15" dur="2">and sunny, which we call S--</text><text start="17" dur="4">and suppose we have the following state transition diagram.</text><text start="21" dur="7">If it&amp;#39;s rainy, it stays rainy with a 0.6 chance while with 0.4 it becomes sunny.</text><text start="28" dur="5">Sunny remains sunny with 0.8 chance but moves to rainy with 0.2 chance.</text><text start="33" dur="8">This is obviously a temporal sequence so the weather at time 1 will be called R1 or S1,</text><text start="41" dur="3">at time 2, R2 or S2.</text><text start="44" dur="4">Suppose in the beginning we happen to know it is rainy,</text><text start="48" dur="4">which means R times 0 when we begin.</text><text start="52" dur="7">We have the probability of rain equals 1 and the probably of sun, S times 0 equals 0.</text><text start="59" dur="9">I&amp;#39;d like to know from you what&amp;#39;s the probability of rain on day 1, the same for day 2,</text><text start="68" dur="4">and the same for day 3.</text></transcript></video><video title="06.mp4 Markov Chain Answer 1.mp4" id="SQ-b40cVia8" length="149"><transcript><text start="0" dur="8">[Thrun] And the answer will be 0.6, 0.44, and 0.376.</text><text start="8" dur="5">It&amp;#39;s really an exercise applying probability theory.</text><text start="13" dur="4">In the very beginning we know to be in state R,</text><text start="17" dur="7">and the probability of remaining there is 0.6, which is directly the value on the arc over here.</text><text start="24" dur="5">On the second state we know that the probability of R is 0.6</text><text start="29" dur="5">and therefore, the probability of sun is 0.4,</text><text start="34" dur="6">and we compute the probability of rain on day 2 using total probability.</text><text start="40" dur="5">The probability of rain on day 2 given rain on day 1 </text><text start="45" dur="4">times the probability of rain on day 1 plus the probability of rain on day 2</text><text start="49" dur="5">given it was sunny on day 1 times the probability of sun on day 1.</text><text start="54" dur="2">And if you plug in all these values,</text><text start="56" dur="9">we get 0.6 times 0.6 plus rain following sun which is this arc over here, 0.2,</text><text start="65" dur="7">times 0.4 as the prior, and this results in 0.44.</text><text start="72" dur="5">We can now do the same with the probability of rain on day 3,</text><text start="77" dur="9">which is the same 0.6 over here, but now our prior is different--it&amp;#39;s 0.44--</text><text start="86" dur="7">plus the same 0.2 over here with the prior of 0.56, which is 1 minus 0.44.</text><text start="93" dur="5">And when you work this all out, it is 0.376 as indicated over here.</text><text start="98" dur="4">So what we really learned here is that this is a temporal Bayes network</text><text start="102" dur="6">of which we can apply conventional probabilities such as the total probability</text><text start="108" dur="4">which was also known as variable elimination in the Bayes network lecture.</text><text start="112" dur="4">All these fancy words aside, it&amp;#39;s really easy to evaluate those.</text><text start="116" dur="5">So if you want to do this and you ask yourself given the probability of the certain time step</text><text start="121" dur="3">like 1, what&amp;#39;s it related to time step 2,</text><text start="124" dur="5">you ask yourself what&amp;#39;s the durations that I encounter in time step 1.</text><text start="129" dur="2">There are usually 2 in this case.</text><text start="131" dur="5">What are the transition probabilities that lead me to the desired state in time step 2</text><text start="136" dur="6">like the 0.6 if you started in R and 0.2 if you started in S,</text><text start="142" dur="3">and you add all these cases up and you just get the right number.</text><text start="145" dur="4">It&amp;#39;s really an easy piece of mathematics if you think about it.</text></transcript></video><video title="07.mp4 Markov Chain Question 2.mp4" id="5I0YPndroIE" length="28"><transcript><text start="0" dur="4">[Thrun] Let&amp;#39;s practice this again with another 2-state Markov chain.</text><text start="4" dur="3">States are A and B.</text><text start="7" dur="4">A has a 50% chance of transitioning to B,</text><text start="11" dur="2">and B always transitions into A.</text><text start="13" dur="3">There is no loop from B to itself.</text><text start="16" dur="6">Let&amp;#39;s assume again at a time, 0, we know with certainty to be in state A.</text><text start="22" dur="6">I would like to know the probability of A at time 1, at time 2, and at time 3.</text></transcript></video><video title="08.mp4 Markov Chain Answer 2.mp4" id="-8Is71qnNCw" length="79"><transcript><text start="0" dur="3">[Thrun] And again the solution follows directly from the state diagram over here.</text><text start="3" dur="4">In the beginning we do know we&amp;#39;re in state A</text><text start="7" dur="2">and the chance of remaining in A is 0.5.</text><text start="9" dur="4">This is the 0.5 over here. We can just read this off.</text><text start="13" dur="6">For the next state we find ourselves to be with 0.5 chance to be in A</text><text start="19" dur="2">and 0.5 chance to be in B.</text><text start="21" dur="3">If we&amp;#39;re in B, we transition with certainty to A.</text><text start="24" dur="2">That&amp;#39;s because of the 0.5.</text><text start="26" dur="5">But if we&amp;#39;re in A, we stay in A with a 0.5 chance. So you put this together.</text><text start="31" dur="5">0.5 probability being in A times 0.5 probability of remaining in A</text><text start="36" dur="5">plus 0.5 probability to be in B times 1 probability to transition to A.</text><text start="41" dur="3">That gives us 0.75.</text><text start="44" dur="8">Following the same logic but now we&amp;#39;re in A with 0.75 times a 0.5 probability</text><text start="52" dur="6">of staying in A plus 0.25 in B, which is 1 minus 0.75,</text><text start="58" dur="8">and the transition&amp;#39;s uncertainty back to A as 1, we get 0.625.</text><text start="66" dur="5">So now you should be able to take a Markov chain and compute by hand</text><text start="71" dur="5">or write a piece of software the probabilities of future states.</text><text start="76" dur="3">You will be able to predict something. That&amp;#39;s really exciting.</text></transcript></video><video title="09.mp4 Stationary Distribution.mp4" id="1YvuLgmrpZE" length="162"><transcript><text start="0" dur="4">[Thrun] So one of the questions you might ask for a Markov chain like this is</text><text start="4" dur="3">what happens if time becomes really large?</text><text start="7" dur="4">What happens for the probability of A1000?</text><text start="11" dur="2">Or let&amp;#39;s go extreme. </text><text start="13" dur="6">What about in the limit, A infinity, often written as the limits of time</text><text start="19" dur="3">going to infinity of any P of At.</text><text start="22" dur="6">That&amp;#39;s like the fancy math notation, but what it really means is we just wait a long, long time.</text><text start="28" dur="4">What is going to happen to the Markov chain over here? What is that probability?</text><text start="32" dur="4">This probability is called a stationary distribution,</text><text start="36" dur="3">and a Markov chain settles to a stationary distribution</text><text start="39" dur="5">or sometimes a limit cycle if the transition is alternativistic(?), which we don&amp;#39;t care about.</text><text start="44" dur="7">And the key to calculating this is to realize that the probability for any t</text><text start="51" dur="4">must be the same as the probability 1 times (?)</text><text start="55" dur="2">This can be resolved as follows.</text><text start="57" dur="8">We know that P of At is P of At given At minus 1 times P of At minus 1</text><text start="65" dur="7">plus P of At given Bt minus 1</text><text start="72" dur="5">times probability of Bt minus 1.</text><text start="77" dur="4">This is just the theorem of total probability or forward propagation rule </text><text start="81" dur="5">applied to this case over here, so nothing really new.</text><text start="86" dur="6">But if you call this guy over here X, then we now have X </text><text start="92" dur="7">equals probability of At given At minus 1 is 0.5</text><text start="99" dur="2">times--and this is the same X as this one over here </text><text start="101" dur="4">because you&amp;#39;re looking for the stationary distribution, so it&amp;#39;s X again.</text><text start="105" dur="6">This probability over here, A following B, is 1 in this special case,</text><text start="111" dur="7">and the probability of Bt minus 1 is 1 minus At minus 1.</text><text start="118" dur="4">And if you plug this in, that&amp;#39;s the same as 1 minus X.</text><text start="122" dur="3">And we can now solve this for X.</text><text start="125" dur="2">Let me just do this.</text><text start="127" dur="8">X equals, if you put these 2 Xs together we get minus 0.5 plus 1</text><text start="135" dur="3">or, differently, 1.5X equals 1.</text><text start="138" dur="6">That means X equals 1 over 1.5, which is 2/3.</text><text start="144" dur="7">So the answer here is the stationary distribution will have A occurring with 2/3 chance</text><text start="151" dur="2">and B with 1/3 chance.</text><text start="153" dur="2">It&amp;#39;s still a Markov chain--it flips from A to B--</text><text start="155" dur="3">but these are the frequencies at which A occurs</text><text start="158" dur="4">and this is the frequency at which B occurs.</text></transcript></video><video title="10.mp4 Stationary Distribution Question-.mp4" id="7GE-pdRBFy0" length="15"><transcript><text start="0" dur="7">[Thrun] To see if you understood this, let me look at the rain-sun Markov chain again,</text><text start="7" dur="3">and let me ask you for the stationary distribution or the limit distribution</text><text start="10" dur="5">for rain to be the case after infinitely many steps.</text></transcript></video><video title="11.mp4 Stationary Distribution Answer.mp4" id="EtTWOv8ngtI" length="101"><transcript><text start="0" dur="3">[Thrun] And the answer is 1/3,</text><text start="3" dur="6">as you can easily see if you call X the probability of rain in time T</text><text start="9" dur="3">and also the probability of rain, T minus 1.</text><text start="12" dur="4">These 2 must be equivalent because we&amp;#39;re looking for the stationary distribution.</text><text start="16" dur="6">Then we get, by virtue of our expansion of the state at time T,</text><text start="22" dur="5">the probability of transitioning from rain to rain is 0.6,</text><text start="27" dur="3">the probability of having it rain is X again, </text><text start="30" dur="6">the probability of transitioning from sun to rain is 0.2, </text><text start="36" dur="4">and the probability of having sun before is 1 minus X,</text><text start="40" dur="6">so we get X equals 0.4X plus 0.2.</text><text start="46" dur="5">Or, differently, we have 0.6X equals 0.2,</text><text start="51" dur="4">and when we work this out, X is 1/3,</text><text start="55" dur="6">which is the probability of rain in the asymptote if you wait forever.</text><text start="61" dur="3">One of the interesting things to observe here</text><text start="64" dur="4">is that the stationary distribution does not depend on the initial distribution.</text><text start="68" dur="4">In fact, I didn&amp;#39;t even tell you what the initial state was.</text><text start="72" dur="4">Markov chains that have that property, which are pretty much all Markov chains,</text><text start="76" dur="3">are called ergodic. </text><text start="79" dur="4">You can safely forget that word again, but people in the field use this word</text><text start="83" dur="3">to express Markov chains that mix.</text><text start="86" dur="6">And mix means that the knowledge of the initial distribution fades over time</text><text start="92" dur="4">until it disappears in the end.</text><text start="96" dur="5">The speed at which it gets lost is called the mixing speed.</text></transcript></video><video title="12.mp4 Finding Transition Probabilities.mp4" id="jDRD9eaf9qE" length="116"><transcript><text start="0" dur="7">[Thrun] You can also learn the transition probabilities of a Markov chain like this</text><text start="7" dur="2">from actual data.</text><text start="9" dur="3">Suppose you look out of the window and see sequences of rainy days</text><text start="12" dur="3">followed by sunny days followed by rainy days</text><text start="15" dur="9">and you wonder what numbers to put here, here, here, and here.</text><text start="24" dur="10">Let me assume you see a sequence rain, sun, sun, sun, rain, sun, and rain.</text><text start="34" dur="2">These are, in total, 7 different days,</text><text start="36" dur="2">and we wish to estimate all those probabilities over here,</text><text start="38" dur="8">including the initial distribution for the first day using maximum likelihood.</text><text start="46" dur="3">You might remember all this work with Laplace smoothing,</text><text start="49" dur="3">but for now we keep it simple, just maximum likelihood.</text><text start="52" dur="5">We find for day 0 we had rain, and maximum likelihood would just say</text><text start="57" dur="3">the probability for day 0 is 1.</text><text start="60" dur="2">That&amp;#39;s the most likely estimate.</text><text start="62" dur="5">Then for the transition probability we find we transition from rain </text><text start="67" dur="4">to something else twice here.</text><text start="71" dur="3">We sometimes transition to sun and sometimes stay in rain.</text><text start="74" dur="5">In both of the transitions we go from rain to sun. There is no instance of rain to rain.</text><text start="79" dur="4">So maximum likelihood gives us over here a 1 and this over here 0.</text><text start="83" dur="4">And finally, we can also ask the question what happens from a sunny state.</text><text start="87" dur="4">We transition to a new sunny state or a rainy state,</text><text start="91" dur="2">and those distributions are easily calculated.</text><text start="93" dur="4">We have 4 transitioning out of a sunny state to something else--</text><text start="97" dur="2">this one, this one, this one, and this one.</text><text start="99" dur="3">Twice it goes to sunny over here and over here,</text><text start="102" dur="3">twice it goes to rainy over here and over here,</text><text start="105" dur="3">so therefore the probability for either transition is 0.5.</text><text start="108" dur="6">So we have 0.5 over here, 0.5 over here, 1 over here, and 0 over here</text><text start="114" dur="2">for the transition probabilities.</text></transcript></video><video title="13.mp4 Transition Probabilities Question.mp4" id="4xfEvRGRBnU" length="21"><transcript><text start="0" dur="3">[Thrun] So in this quiz please do the same for me.</text><text start="3" dur="2">Here is our sequence. </text><text start="5" dur="5">There&amp;#39;s a couple of sunny days--5 in total--a rainy day, 3 sunny days, 2 rainy days.</text><text start="10" dur="4">Calculate using maximum likelihood the prior probability of rain</text><text start="14" dur="4">and then the 4 transition probabilities as before.</text><text start="18" dur="3">Please fill in those numbers over here.</text></transcript></video><video title="14.mp4 Transition Probabilities Answer.mp4" id="6yYOilqrHfU" length="42"><transcript><text start="0" dur="3">[Thrun] The initial probability for rain is 0</text><text start="3" dur="3">because we are just encountering 1 initial day and it&amp;#39;s sunny.</text><text start="6" dur="3">The maximum likelihood estimate is therefore 0.</text><text start="9" dur="6">We transition 8 times out of a sunny state--1, 2, 3, 4, 5, 6, 7, 8--</text><text start="15" dur="5">twice into a rainy state, and therefore 6 times we remain in a sunny state,</text><text start="20" dur="3">so the probability of sun to sun is &#xBE;,</text><text start="23" dur="3">whereas sun to rain is &#xBC;.</text><text start="26" dur="3">From a rainy state we have 2 outbound transitioning,</text><text start="29" dur="3">1 to a sunny state and 1 to a rainy state.</text><text start="32" dur="2">The last R over here has no outbound transition,</text><text start="34" dur="3">so it doesn&amp;#39;t really count in our statistic.</text><text start="37" dur="5">The maximum likelihood therefore is 0.5 or &#xBD; for each of those.</text></transcript></video><video title="15.mp4 Laplacian Smoothing Question.mp4" id="olm7LeWgC7Q" length="92"><transcript><text start="0" dur="4">[Thrun] One of the oddities of the maximum likelihood estimator is overfitting.</text><text start="4" dur="4">So for example, we observed that we always have a single first day,</text><text start="8" dur="3">and this becomes our prior probability.</text><text start="11" dur="5">So in this case the prior probability for rain on day 0 would be 1,</text><text start="16" dur="2">which kind of doesn&amp;#39;t make sense, really.</text><text start="18" dur="3">It should be more like the stationary distribution or something like that.</text><text start="21" dur="6">Well, you might remember the work on Laplacian smoothing.</text><text start="27" dur="3">This is a great moment where I can test whether you really think</text><text start="30" dur="3">like an artificial intelligence person.</text><text start="33" dur="3">I&amp;#39;m going to make you apply Laplacian smoothing in this new context</text><text start="36" dur="5">of estimating the parameters of this Markov chain</text><text start="41" dur="4">using the smoother of K = 1.</text><text start="45" dur="3">You might remember you add something to the numerator, like 1,</text><text start="48" dur="3">and something to the denominator to make sure things normalize,</text><text start="51" dur="2">and then you get different probabilities </text><text start="53" dur="3">than you would get with the maximum likelihood estimator.</text><text start="56" dur="3">So I&amp;#39;m going to ask you a quiz here, even though I haven&amp;#39;t completely shown you</text><text start="59" dur="3">the application of Laplacian smoothing in this context.</text><text start="62" dur="3">But if you understood Laplacian smoothing, you might want to give it a try.</text><text start="65" dur="8">What&amp;#39;s the probability of rain on day 0, and what are its conditional probabilities?</text><text start="73" dur="10">Sun goes to sun, sun goes to rain, rain goes to sun, and rain stays in rain.</text><text start="83" dur="4">The way probabilities work, as you surely know, these 2 things over here</text><text start="87" dur="5">have to add up to 1, and these 2 things over here have to add up to 1.</text></transcript></video><video title="16.mp4 Laplacian Smoothing Answer.mp4" id="Jv0T6H3bFB8" length="123"><transcript><text start="0" dur="4">[Thrun] So in Laplacian smoothing we look at the relative counts.</text><text start="4" dur="3">We know there is 1 instance of rain at time 0.</text><text start="7" dur="3">Normally it would be 1.</text><text start="10" dur="9">But we add 1 to the numerator and 2 to the denominator, and we get 2/3.</text><text start="19" dur="2">Let&amp;#39;s look at these numbers again.</text><text start="21" dur="5">The count that we have is 1 out of 1 is rain and 1 out of 1 would give us 1</text><text start="26" dur="2">under the maximum likelihood estimator.</text><text start="28" dur="3">But because we&amp;#39;re smoothing, we&amp;#39;re adding a pseudocount, </text><text start="31" dur="3">which is 1 rainy day and 1 sunny day,</text><text start="34" dur="4">and we have to compensate for the 2 additional counts with a 2 over here</text><text start="38" dur="2">and therefore we get 2/3.</text><text start="40" dur="6">So our probability under the Laplacian smoother is 2/3 for the rainy day to be the first day,</text><text start="46" dur="2">which is really different from 1.</text><text start="48" dur="5">Applying the same logic over here, we transition 3 times out of a sunny state--</text><text start="53" dur="5">1, 2, 3--and each time it&amp;#39;s a sunny state.</text><text start="58" dur="4">So maximum likelihood would say 3 times out of 3 it&amp;#39;s sunny into sunny.</text><text start="62" dur="5">We add a pseudo observation of 1, and then there&amp;#39;s 2 possible outcomes;</text><text start="67" dur="3">hence, we have to count 2 over here.</text><text start="70" dur="3">So it&amp;#39;s 4/5.</text><text start="73" dur="2">And the missing 1/5 shows up over here.</text><text start="75" dur="3">We can do the same math as before.</text><text start="78" dur="4">Zero with 3 transitions from a sunny day resulted in a rainy day.</text><text start="82" dur="2">In fact, they were all sunny.</text><text start="84" dur="5">But we add 1 pseudo observation over here and 2 of the normalizer, 1/5.</text><text start="89" dur="3">These 2 things surely add up to 1.</text><text start="92" dur="2">The last one is analogous. </text><text start="94" dur="4">We have 1 transition of a rainy state and it led to a sunny state, so 1/1,</text><text start="98" dur="4">but we add 1 over here and 2 on the denominator so you get 2/3.</text><text start="102" dur="3">And if you do the math over here, you get 1/3.</text><text start="105" dur="2">I really want you to remember Laplacian smoothing.</text><text start="107" dur="4">It&amp;#39;s applicable to many estimation problems,</text><text start="111" dur="4">and it will be important going forward in this class.</text><text start="115" dur="3">Here we applied it to the estimation of a Markov chain.</text><text start="118" dur="5">Please take a moment and study the logic so you&amp;#39;ll be able to apply those things again.</text></transcript></video><video title="17.mp4 HMM Happy Grumpy Problem.mp4" id="YQ80TAESLqw" length="252"><transcript><text start="0" dur="4">[Thrun] So now let&amp;#39;s return to hidden Markov models.</text><text start="4" dur="4">Those are really the subject of this class.</text><text start="8" dur="4">Let&amp;#39;s again use the rainy and sunny example just to keep it simple.</text><text start="12" dur="3">These are the transition probabilities as before.</text><text start="15" dur="5">Let&amp;#39;s assume for now that the initial probability of rain is 0.5;</text><text start="20" dur="3">hence, the probability of sun at time 0 is 0.5.</text><text start="23" dur="5">The key modification to go to hidden Markov model is that this state is actually hidden.</text><text start="28" dur="4">I cannot see whether it&amp;#39;s raining or it&amp;#39;s sunny.</text><text start="32" dur="3">Instead I get to observe something else.</text><text start="35" dur="3">Suppose I can be happy or grumpy</text><text start="38" dur="5">and happiness or grumpiness is being caused by the weather.</text><text start="43" dur="3">So rain might make me happy or grumpy,</text><text start="46" dur="3">and sunshine makes me happy or grumpy</text><text start="49" dur="2">but with vastly different probabilities.</text><text start="51" dur="4">If it&amp;#39;s sunny, I&amp;#39;m just mostly happy, 0.9.</text><text start="55" dur="4">There&amp;#39;s a 0.1 chance I might still be grumpy for some other reason.</text><text start="59" dur="6">If it&amp;#39;s rainy, I&amp;#39;m only happy with 0.4 probability and with 0.6 I&amp;#39;m grumpy.</text><text start="65" dur="6">In fact, living in California I can attest that these are actually not wrong probabilities.</text><text start="71" dur="3">I love the sun over here.</text><text start="74" dur="6">Suppose I observe that I&amp;#39;m happy on day 1.</text><text start="80" dur="7">A question that we can ask now is what is the so-called posterior probability</text><text start="87" dur="8">for it raining on day 1 and what&amp;#39;s the posterior probability for it being sunny on day 1?</text><text start="95" dur="8">What&amp;#39;s the probability of rain on day 1 given that I observed that I was happy on day 1?</text><text start="103" dur="3">This is being answered using Bayes rule,</text><text start="106" dur="4">so this is the probability of being happy given that it rains</text><text start="110" dur="6">times the probability that it rains over the probability of being happy.</text><text start="116" dur="7">We know the probability of rain at day 1 based on our Markov state transition model.</text><text start="123" dur="2">In fact, let&amp;#39;s just calculate it.</text><text start="125" dur="5">The probability of rain on day 1 is the probability it was rainy on day 0</text><text start="130" dur="4">and it led to a self transition from rain to rain from day 0 to day 1</text><text start="134" dur="6">plus the probability it was sunny on day 0 times the probability that sun led to rain over here.</text><text start="140" dur="6">If you can plug in all these numbers to obtain 0.4,</text><text start="146" dur="3">you can just easily verify this.</text><text start="149" dur="3">So we know this guy over here is 0.4.</text><text start="152" dur="7">This guy over here is 0.4 again, but now it&amp;#39;s this 0.4 over here.</text><text start="159" dur="5">The probability of being happy on a rainy day is 0.4.</text><text start="164" dur="7">This guy over here resolves to 0.4 times 0.4</text><text start="171" dur="4">plus the same situation with sunny in time 1</text><text start="175" dur="6">where the prior is 0.6 and the happiness factor is 0.9.</text><text start="181" dur="5">And that gives us the entire expression is 0.229.</text><text start="186" dur="5">Let&amp;#39;s interpret the 0.229 in the context of the question we asked.</text><text start="191" dur="5">We know that at time 0 it was raining with half a chance.</text><text start="196" dur="4">If you look at the state transition diagram, it&amp;#39;s more likely to be sunny afterwards</text><text start="200" dur="3">because it&amp;#39;s more likely to flip from rain to sun than sun to rain.</text><text start="203" dur="6">In fact, we worked out that the probability of rain at a time step later was only 0.4,</text><text start="209" dur="2">so it was 0.6 sunny.</text><text start="211" dur="5">But now that I saw myself being happy, my probability of rain was further lowered</text><text start="216" dur="3">from 0.4 to 0.229.</text><text start="219" dur="6">And the reason why the probability went down is if you look at happiness,</text><text start="225" dur="5">happiness is much more likely to occur on a sunny day than it is to occur on a rainy day.</text><text start="230" dur="3">And when you work this in using Bayes rule and total probability,</text><text start="233" dur="4">you would find just the fact that it was at happiness at time 1</text><text start="237" dur="8">makes your belief of it being rainy go down from 0.4 to 0.229.</text><text start="245" dur="3">This is a wonderful example of applying Bayes rule </text><text start="248" dur="4">in this really relatively complicated hidden Markov model.</text></transcript></video><video title="18.mp4 Happy Grumpy Question.mp4" id="oMyPkJMU2_U" length="46"><transcript><text start="0" dur="6">[Thrun] So let me use exactly the same hidden Markov model where we have rain and sun</text><text start="6" dur="5">and happiness and grumpiness with 0.4 and 0.6</text><text start="11" dur="3">and 0.9 and 0.1 probabilities.</text><text start="14" dur="7">The only change I will apply is I will tell you that for probability 1 it&amp;#39;s raining on day 0;</text><text start="21" dur="4">hence, the probability of sunny at day 0 is 0.</text><text start="25" dur="5">I now observe another happy face on day 1,</text><text start="30" dur="7">and I&amp;#39;d like to know the probability of it raining on day 1 given this observation.</text><text start="37" dur="3">This is the same as before with the only difference </text><text start="40" dur="3">that we have a different initial probability,</text><text start="43" dur="3">but all the other probabilities should just be the same.</text></transcript></video><video title="19.mp4 Happy Grumpy Answer.mp4" id="_A9a9bJ3Src" length="58"><transcript><text start="0" dur="7">[Thrun] Once again let&amp;#39;s calculate the probability of rain on day 1.</text><text start="7" dur="3">This one is easy because we know it is raining on day 0, </text><text start="10" dur="4">so it&amp;#39;s 0.6, the 0.6 over here.</text><text start="14" dur="6">This expression over here is expanded by a Bayes rule as applied over here.</text><text start="20" dur="4">Probability of happiness during rain is 0.4,</text><text start="24" dur="4">the probability of rain was said to be just 0.6,</text><text start="28" dur="9">and we divide by 0.4 times 0.6 plus 0.9 times 0.4, which is 1 minus 0.6.</text><text start="37" dur="5">And that resolves simply to 0.4 if you work it all out.</text><text start="42" dur="5">So the interesting thing here is if you were just to run the Markov chain,</text><text start="47" dur="4">on day 1 we have a 0.6 chance of rain,</text><text start="51" dur="7">but the fact that I observed myself to be happy reduces the chance of rain to 0.4.</text></transcript></video><video title="20.mp4 Wow You Understand.mp4" id="dnEp5g9p8dw" length="31"><transcript><text start="0" dur="5">[Thrun] So if you got those questions right, I&amp;#39;m in awe with you--wow--</text><text start="5" dur="5">because you understand the very basics of using a hidden Markov model for 2 things now.</text><text start="10" dur="6">One is prediction, and one is called state estimation.</text><text start="16" dur="5">In state estimation that&amp;#39;s a really fancy word for just computing the probability</text><text start="21" dur="5">of the internal or hidden state given measurements.</text><text start="26" dur="5">In prediction we predict the next state, and you might also predict the next measurement.</text></transcript></video><video title="21.mp4 HMMs and Robot Localization.mp4" id="Nc9-iLy_rgY" length="178"><transcript><text start="0" dur="5">[Thrun] I want to show you a little animation of hidden Markov models</text><text start="5" dur="2">used for robot localization.</text><text start="7" dur="5">This is obviously a little toy robot over here that lives in the grid world,</text><text start="12" dur="4">and the grid world is composed of discrete cells where the robot may be located.</text><text start="16" dur="4">This robot happens to know where north is at all times.</text><text start="20" dur="4">It&amp;#39;s given 4 sensors, a wall sensor to the left, to the right, to the top </text><text start="24" dur="6">and the bottom over here, and it can sense whether in the adjacent cell there&amp;#39;s a wall or not.</text><text start="30" dur="7">Initially this robot has no clue where it is. It faces what we call a global localization problem.</text><text start="37" dur="6">It now uses its sensors and its actuators to localize itself.</text><text start="43" dur="6">So in the very first episode the robot senses a wall north and south of it</text><text start="49" dur="3">but none west or east.</text><text start="52" dur="4">And look what this does to the probabilities.</text><text start="56" dur="2">The posterior probability is now increased </text><text start="58" dur="3">in places that are consistent with this measurement,</text><text start="61" dur="5">like all of those places have a wall in north and east, like these guys over here,</text><text start="66" dur="3">and free space in the left and the right,</text><text start="69" dur="5">yet they have been decreased in places that are inconsistent, like this guy over here.</text><text start="74" dur="4">These states over here are interesting. They are shaded gray and lighter gray.</text><text start="78" dur="3">What this means is they still have a significant probability </text><text start="81" dur="3">but yet not as much as over here,</text><text start="84" dur="5">the reason being that this measurement over here would be characteristic </text><text start="89" dur="4">for the state over here if there had been exactly 1 measurement error--</text><text start="93" dur="6">if the bottom sensor had erred and erroneously detected a wall.</text><text start="99" dur="4">Errors are less likely than no errors, and as a result, the cell over here</text><text start="103" dur="4">which is completely consistent ends up to be more likely than the cell over here,</text><text start="107" dur="6">yet you can see the HMM does a nice job in understanding the posterior probability.</text><text start="113" dur="4">Let&amp;#39;s assume the robot moves right and senses again</text><text start="117" dur="2">and gets the exact same measurement.</text><text start="119" dur="3">Of course it has no clue that it is exactly over here.</text><text start="122" dur="2">It can see the probability as being decayed.</text><text start="124" dur="4">Interestingly enough, this guy over here has a lower probability,</text><text start="128" dur="4">and the reason is by itself it is very consistent with the most recent measurement,</text><text start="132" dur="5">but it&amp;#39;s less consistent with the idea of having moved right and measured before</text><text start="137" dur="2">a wall to the north and the south.</text><text start="139" dur="4">And similarly, these places over here become less consistent.</text><text start="143" dur="3">The only ones that are completely consistent are these 3 states over here</text><text start="146" dur="2">and the 3 states over here.</text><text start="148" dur="2">The robot keeps moving to the right,</text><text start="150" dur="4">and now we get to the point where the sequence of measurement</text><text start="154" dur="2">really makes 2 states equally likely--the ones over here.</text><text start="156" dur="2">They are equally likely with symmetry.</text><text start="158" dur="6">Those are still pretty likely, and those are gradually and likely over here to the left.</text><text start="164" dur="4">As the robot now moves, it moves into a distinguishing state.</text><text start="168" dur="4">It sees a wall in the north but free space in the 3 other directions,</text><text start="172" dur="3">and that renders the state over here relatively unlikely,</text><text start="175" dur="3">and now it has localized itself.</text></transcript></video><video title="22.mp4 HMM Equations.mp4" id="UjfgRO1kySM" length="156"><transcript><text start="0" dur="5">[Thrun] We discussed specific incidents of hidden Markov model inference or filtering</text><text start="5" dur="2">in our quizzes.</text><text start="7" dur="2">Let me now give you the basic math.</text><text start="9" dur="4">We all know hidden Markov model is a chain like this</text><text start="13" dur="4">of hidden states that are Markovian </text><text start="17" dur="5">and measurements that only depend on the corresponding state.</text><text start="22" dur="3">We know that this Bayes network entailed certain independencies.</text><text start="25" dur="9">For example, given X2 the past, the future, and the present measurement </text><text start="34" dur="3">are all conditionally independent given X2.</text><text start="37" dur="5">The nice thing about this structure is it makes it possible to efficiently do inference.</text><text start="42" dur="7">I&amp;#39;ll give you the equations we used before here in a more explicit form.</text><text start="49" dur="6">Let&amp;#39;s look at the measurement side, and suppose we wish to know the probability</text><text start="55" dur="4">of an internal state variable given a specific measurement,</text><text start="59" dur="7">and that by Bayes rule becomes P of Z1 given X1 times P of X1 over P of Z1.</text><text start="66" dur="4">When you start doing this, you&amp;#39;ll find that the normalizer </text><text start="70" dur="3">doesn&amp;#39;t depend on the target variable X; </text><text start="73" dur="6">therefore, we often write a proportionality sign and get an equation like this.</text><text start="79" dur="5">This product over here is the basic measurement update of hidden Markov models.</text><text start="84" dur="3">And the thing to remember when you apply it, you have to normalize.</text><text start="87" dur="3">We already practiced all of this, so you know all of this.</text><text start="90" dur="3">The other equation is the prediction equation,</text><text start="93" dur="3">so let&amp;#39;s go from X1 to X2.</text><text start="96" dur="4">This is called prediction even though sometimes it has nothing to do with prediction.</text><text start="100" dur="3">It&amp;#39;s the traditional term, but it comes from the fact that we might want to predict</text><text start="103" dur="6">the distribution of X2 given that we know the distribution of X1.</text><text start="109" dur="2">Here we apply total probability.</text><text start="111" dur="8">The probability of X2 is obtained by checking all states we might have come from in X1</text><text start="119" dur="4">and calculating the probability of going from X1 to X2.</text><text start="123" dur="3">We also practiced this before.</text><text start="126" dur="6">Any probability of X2 being in a certain state must have come from another state, X1,</text><text start="132" dur="3">and then transitioned into X2, so we sum over all of those </text><text start="135" dur="3">and we get the posterior probability of X2.</text><text start="138" dur="6">These 2 equations together form the math of a hidden Markov model</text><text start="144" dur="5">where the next state distribution and the measurement distribution</text><text start="149" dur="7">and the initial state distribution are all given as the parameters of a hidden Markov model.</text></transcript></video><video title="23 HMM Localization Example.mp4" id="8mi8z-EnYq8" length="164"><transcript><text start="0" dur="5">[Thrun] Here is the application of HMM to a real robot localization example.</text><text start="5" dur="4">This robot is in a world that&amp;#39;s 1-dimensional and it is lost.</text><text start="9" dur="3">It has initial uncertainty about where it is,</text><text start="12" dur="4">and it is actually located next to a door but it doesn&amp;#39;t know.</text><text start="16" dur="2">It&amp;#39;s also given a map of the world,</text><text start="18" dur="6">and the distribution of all possible states, here noted as s, is given by this histogram.</text><text start="24" dur="7">We bin the world into small bins, and for each bin we assign a single numerical probability</text><text start="31" dur="2">of the robot being there.</text><text start="33" dur="4">The fact they have all the same height means that the robot is maximally uncertain</text><text start="37" dur="2">as to where it is.</text><text start="39" dur="2">Let&amp;#39;s assume this robot is going to sense</text><text start="41" dur="2">and it senses to be next to a door.</text><text start="43" dur="4">The red graph over here is the probability of seeing a door </text><text start="47" dur="2">for different locations in the environment.</text><text start="49" dur="4">There are 3 different doors, and seeing a door is more likely here</text><text start="53" dur="2">than it is in between.</text><text start="55" dur="3">It might still see a door here, but it&amp;#39;s just less likely.</text><text start="58" dur="2">We now apply Bayes rule.</text><text start="60" dur="7">We multiply the prior with this measurement probability to obtain the posterior.</text><text start="67" dur="3">That was our measurement update. It&amp;#39;s that simple.</text><text start="70" dur="7">So you can see how all these uniform values over here become nonuniform values over here</text><text start="77" dur="3">multiplied by this curve over here.</text><text start="80" dur="5">The story progresses by the robot taking an action to the right,</text><text start="85" dur="5">and this is now the next state prediction part, the what we call convolution part</text><text start="90" dur="6">or state transition part, where these little bumps over here get shifted along with the robot</text><text start="96" dur="4">and they are flattened out a little bit just because robot motion has used uncertainty.</text><text start="100" dur="3">Again, it&amp;#39;s a really simple operation.</text><text start="103" dur="3">You shift those to the right and you smooth them out a little bit</text><text start="106" dur="5">to account for the control noise in the robot&amp;#39;s actuators.</text><text start="111" dur="2">And now we get to the point that the robot senses again,</text><text start="113" dur="3">and this robot senses a door again.</text><text start="116" dur="3">And see what happens. It multiplies.</text><text start="119" dur="7">It&amp;#39;s now a nonuniform prior over here with the same measurement probability as before,</text><text start="126" dur="3">but now we get a distribution that&amp;#39;s peaked over here </text><text start="129" dur="3">and has smaller bumps at various other places,</text><text start="132" dur="5">the reason being the only place where my prior has a higher probability</text><text start="137" dur="4">and my measurement probability is also high probability is the second door,</text><text start="141" dur="3">and as a result of our distribution over here, it assumes a much larger value.</text><text start="144" dur="4">If you look at that picture, that is really easy to implement,</text><text start="148" dur="4">and that&amp;#39;s what we did all along when we talked about rain and sun and so on.</text><text start="152" dur="3">It&amp;#39;s really a very simple algorithm.</text><text start="155" dur="6">Measurements are multiplications, and motion become essentially convolutions</text><text start="161" dur="3">which are shifts with added noise.</text></transcript></video><video title="24.mp4 Particle Filters.mp4" id="H0G1yslM5rc" length="227"><transcript><text start="0" dur="5">[Thrun] This is a great segue to one of the most successful algorithms</text><text start="5" dur="4">in artificial intelligence and robotics called particle filters.</text><text start="9" dur="4">Again, the topic here is robot localization,</text><text start="13" dur="4">and here we&amp;#39;re dealing with a real robot with actual sensor data.</text><text start="17" dur="3">The robot is lost in this building.</text><text start="20" dur="4">You can see different rooms, and you can see corridors,</text><text start="24" dur="2">and the robot is equipped with range sensors.</text><text start="26" dur="5">These are sound sensors that measure the range to nearby obstacles.</text><text start="31" dur="4">Its task is to figure out where it is.</text><text start="35" dur="4">The robot will move along the black line over here, but it doesn&amp;#39;t know this.</text><text start="39" dur="2">It has no clue where it is.</text><text start="41" dur="2">It has to figure out where it is.</text><text start="43" dur="5">The key thing in particle filters is the representation of the belief.</text><text start="48" dur="6">Whereas before we had discrete worlds like our sun and rain example</text><text start="54" dur="5">or we had a histogram approach where we cut the space into small bins,</text><text start="59" dur="3">particle filters have a very different representation.</text><text start="62" dur="5">They represent the space by a collection of points or particles.</text><text start="67" dur="5">Each of these small dots over here is a hypothesis where the robot might be.</text><text start="72" dur="7">It&amp;#39;s a concrete value of its X location and its Y location and its heading direction</text><text start="79" dur="2">in this environment.</text><text start="81" dur="2">So it&amp;#39;s a vector of 3 values.</text><text start="83" dur="7">The sum or set of all those vectors together form the belief space.</text><text start="90" dur="3">So particle filters approximate a posterior</text><text start="93" dur="3">by many, many, many guesses,</text><text start="96" dur="5">and the density of those guesses represents the posterior probability</text><text start="101" dur="3">of being at a certain location.</text><text start="104" dur="3">To illustrate this, let me run the video.</text><text start="107" dur="4">You can see in a very short amount of time the range sensors,</text><text start="111" dur="6">even though they&amp;#39;re very noisy, force the particles to collect in the corridor.</text><text start="117" dur="4">There&amp;#39;s 2 symmetrical point dots--this one over here and this one over here--</text><text start="121" dur="3">that come from the fact that the corridor itself is symmetric.</text><text start="124" dur="4">But as the robot moves into the office, the symmetry is broken.</text><text start="128" dur="3">This office looks very different from this office over here,</text><text start="131" dur="3">and those particles die out.</text><text start="134" dur="2">What&amp;#39;s happening here?</text><text start="136" dur="5">Intuitively speaking, each particle is a representation of a possible state,</text><text start="141" dur="3">and the more consistent the particle with the measurement,</text><text start="144" dur="5">the more the sonar measurement fits into the place where the particle says the robot is,</text><text start="149" dur="2">the more likely it is to survive.</text><text start="151" dur="3">This is the essence of particle filters.</text><text start="154" dur="4">Particle filters use many particles to represent a belief,</text><text start="158" dur="6">and they will let those particles survive in proportion to the measurement probability.</text><text start="164" dur="5">And the measurement probability here is nothing else but the consistency</text><text start="169" dur="4">of the sonar range measurements with the map of the environment </text><text start="173" dur="2">given the particle place.</text><text start="175" dur="2">Let me play this again.</text><text start="177" dur="3">Here&amp;#39;s the maze. The robot is lost in space.</text><text start="180" dur="5">Again, you can see how within very few steps the particles </text><text start="185" dur="5">consistent with the range measurements all accumulate in the corridor.</text><text start="190" dur="4">As the robot hits the end of the corridor, only 2 particle clouds survive</text><text start="194" dur="5">due to the symmetry of the corridor, and the particles finally die out.</text><text start="199" dur="2">This algorithm is beautiful, </text><text start="201" dur="6">and you can implement it in less than 10 lines of program code.</text><text start="207" dur="5">So given all the difficulty of talking of probabilities and Bayes network</text><text start="212" dur="4">and hidden Markov models, you will now find a way </text><text start="216" dur="5">to implement one of the most amazing algorithms for filtering and state estimation</text><text start="221" dur="4">in less than 10 lines of C code. </text><text start="225" dur="2">Isn&amp;#39;t that amazing?</text></transcript></video><video title="25.mp4 Localization and Particle Filters.mp4" id="qQQYkvS5CzU" length="267"><transcript><text start="0" dur="3">[Thrun] Here is our 1-dimensional localization example again,</text><text start="3" dur="3">this time with particle filters.</text><text start="6" dur="3">You can see the particles initially spread out uniformly.</text><text start="9" dur="4">This 1-dimensional space of forward locations you&amp;#39;re going to use as an example</text><text start="13" dur="3">to explain every single step of particle filters.</text><text start="16" dur="4">In the very first step, the robot senses a door.</text><text start="20" dur="4">Here is its initial particles before sensing the door.</text><text start="24" dur="6">It now copies these particles over verbatim but gives them what&amp;#39;s called a weight.</text><text start="30" dur="3">We call this weight the importance weight,</text><text start="33" dur="4">and the importance weight is nothing else but the measurement probability.</text><text start="37" dur="4">It&amp;#39;s more likely to see a door over here than over here.</text><text start="41" dur="3">The red curve over here is the measurement probability,</text><text start="44" dur="4">and the particles over here are the same as up here,</text><text start="48" dur="4">but they now attached an importance weight where the height of the particle</text><text start="52" dur="2">illustrates the weight.</text><text start="54" dur="3">So you can see the place over here, the place over here, and the place over here</text><text start="57" dur="3">carry the most weight because they&amp;#39;re the most likely ones.</text><text start="60" dur="7">This robot moves and it moves by using its previous particles</text><text start="67" dur="5">to create a new random particle set that represents the posterior probability</text><text start="72" dur="3">of being at a new location.</text><text start="75" dur="4">The key thing here is called resampling.</text><text start="79" dur="2">The algorithm works as follows.</text><text start="81" dur="7">Pick a particle from the set over here and pick it in proportion to the importance weight.</text><text start="88" dur="4">Once you&amp;#39;ve picked one--and sure enough, you pick those more frequently</text><text start="92" dur="5">than those over here--add the motion to it plus a little bit of noise</text><text start="97" dur="2">to create a new particle.</text><text start="99" dur="3">Repeat this procedure for each particle.</text><text start="102" dur="2">Pick them with replacement.</text><text start="104" dur="3">You&amp;#39;re allowed to pick a particle twice or 3 or 4 times.</text><text start="107" dur="2">Sure enough, you pick these more frequently.</text><text start="109" dur="4">These are being forward moved to over here, these to over here.</text><text start="113" dur="3">You see a higher density of particles over here and over here,</text><text start="116" dur="2">than you see, for example, over here.</text><text start="118" dur="3">That&amp;#39;s your forward prediction step in particle filters.</text><text start="121" dur="3">It&amp;#39;s really easy to implement.</text><text start="124" dur="2">The next step is another measurement step,</text><text start="126" dur="4">and here I&amp;#39;m illustrating to you that indeed this nonuniform set of particles</text><text start="130" dur="3">leads to a reasonable posterior in this space.</text><text start="133" dur="3">We now have a particle set as nonuniform.</text><text start="136" dur="4">We have increased density over here, over here, and over here.</text><text start="140" dur="5">You can see how multiplying these particles with the importance weight,</text><text start="145" dur="6">which is copying them over verbatim but attaching a vertical importance weight</text><text start="151" dur="2">in proportion to the measurement probability,</text><text start="153" dur="4">yields a lot of particles over here with big weights,</text><text start="157" dur="4">some over here with big weights, lots of particles over here with low weights.</text><text start="161" dur="5">They got copied over, but the measurement probability here is low and so on and so on.</text><text start="166" dur="4">And if you look at this set of particles, you already understand </text><text start="170" dur="6">why the majority of importance and weights resides in the correct location</text><text start="176" dur="3">given that we had a measurement of a door and motion to the right</text><text start="179" dur="2">and another measurement of the door.</text><text start="181" dur="5">The nice thing here is that particle filters work in continuous spaces,</text><text start="186" dur="7">and, what&amp;#39;s often underappreciated, they use your computational resources</text><text start="193" dur="3">in proportion to how likely something is.</text><text start="196" dur="3">You can see that almost all the computation now resides over here,</text><text start="199" dur="2">almost all the memory resides over here,</text><text start="201" dur="2">and that&amp;#39;s the place that&amp;#39;s likely.</text><text start="203" dur="4">Stuff over here requires less memory, less computation, and guess what? </text><text start="207" dur="2">It&amp;#39;s much less likely.</text><text start="209" dur="5">So particle filters make use of your computational resources in an intelligent way.</text><text start="214" dur="4">They&amp;#39;re really nice to implement on something with low compute power.</text><text start="218" dur="4">Let me move on to explain the next motion.</text><text start="222" dur="3">Here you see our robot moving to the right again,</text><text start="225" dur="4">and now the same what we call resampling takes place.</text><text start="229" dur="3">We pick, with replacement, particles from over here.</text><text start="232" dur="3">Sure enough, these are the ones we pick the most often.</text><text start="235" dur="4">And then we add the motion command plus some random noise.</text><text start="239" dur="4">If you look at this particle set over here, almost all the particles sit over here.</text><text start="243" dur="3">It doesn&amp;#39;t really show it very well on this computer screen,</text><text start="246" dur="4">but the density of particles over here is significantly higher than anywhere else.</text><text start="250" dur="4">There&amp;#39;s occurrences over here and over here that correspond with these guys over here</text><text start="254" dur="4">and these guys over here and over here, correspond to this guy over here,</text><text start="258" dur="4">but the vast majority of probability mass sits over here.</text><text start="262" dur="5">So let&amp;#39;s dive into how complicated this algorithm really is.</text></transcript></video><video title="26.mp4 Particle Filter Algorithm.mp4" id="lwg_KI3UewY" length="178"><transcript><text start="0" dur="4">[Thrun] So here is our algorithm particle filter.</text><text start="4" dur="5">It sets as an input a set of particles with associated important weights,</text><text start="9" dur="3">a control, and a measurement vector,</text><text start="12" dur="4">and it constructs a new particle set as prime </text><text start="16" dur="4">and in doing so it also has an auxiliary variable, eta.</text><text start="20" dur="2">Here is the algorithm.</text><text start="22" dur="5">Initially we go through all new particles of which there are n</text><text start="27" dur="5">and we sample in index j according to the distribution</text><text start="32" dur="6">defined by the importance weights associated with the particle set over here.</text><text start="38" dur="3">Put differently, we have a set of particles over here</text><text start="41" dur="4">and we have associated importance factors which we will construct a little bit later on,</text><text start="45" dur="3">and now we pick one of these particles with replacement</text><text start="48" dur="7">where the probability of picking this particle is exactly the importance weight, w.</text><text start="55" dur="6">For this particle we now sample a possible successor state</text><text start="61" dur="5">according to the state transition probability using our controls</text><text start="66" dur="5">and that specific particle as an input. We call it sj over here.</text><text start="71" dur="5">We also compute an importance weight, which is the measurement probability</text><text start="76" dur="4">for that specific particle over here.</text><text start="80" dur="5">This gives us a new particle, and this gives us a new non-normalized importance weight.</text><text start="85" dur="7">For now we just add them into our new particle set as prime and we reiterate.</text><text start="92" dur="5">The only thing missing now is at the very end we have to normalize all the weights.</text><text start="97" dur="3">For this we keep our running counter, eta,</text><text start="100" dur="5">and we have a For loop in which we take all the weights in the set over here</text><text start="105" dur="3">and just normalize them accordingly.</text><text start="108" dur="3">This is the entire algorithm.</text><text start="111" dur="4">We feed in over here particles with associated important weights </text><text start="115" dur="3">and a control and a measurement,</text><text start="118" dur="7">and then we construct the new set of particles by picking particles from our previous set</text><text start="125" dur="5">at random with replacement but in accordance to the importance weights,</text><text start="130" dur="3">so important particles are picked more frequently.</text><text start="133" dur="3">We guess for this particle this will be a state.</text><text start="136" dur="4">We guess what a new state might be by just sampling it,</text><text start="140" dur="3">and we attach it an importance weight which we later normalize</text><text start="143" dur="5">that is proportional to the measurement probability for this thing over here.</text><text start="148" dur="3">So you&amp;#39;re going to upweigh the particles that look consistent with the measurements</text><text start="151" dur="3">and downweigh the ones that are non-consistent.</text><text start="154" dur="4">We add all of these things back to our particle sets and reiterate.</text><text start="158" dur="2">I promised you it would be an easy algorithm.</text><text start="160" dur="5">You can look at this, and you could actually implement this really easily.</text><text start="165" dur="3">Just remember how much difficulty we introduced</text><text start="168" dur="6">with talking about Bayes networks and hidden Markov models and all that stuff.</text><text start="174" dur="4">This is all there is to implement particle filters.</text></transcript></video><video title="27.mp4 Particle Filters Pros and Cons.mp4" id="eWsEyJCVXoo" length="99"><transcript><text start="0" dur="3">[Thrun] Particle filters are really easy to implement.</text><text start="3" dur="2">They have some deficiencies. </text><text start="5" dur="3">They don&amp;#39;t really scale to high-dimensional spaces.</text><text start="8" dur="3">That&amp;#39;s been recognized because the number of particles you need </text><text start="11" dur="3">to fill a high-dimensional space tends to grow exponentially </text><text start="14" dur="3">with the dimensionality of the space.</text><text start="17" dur="3">So for 100 dimensions it&amp;#39;s hard to make work.</text><text start="20" dur="2">But there are extensions.</text><text start="22" dur="5">They go under really fancy names like Rao-Blackwellized particle filters</text><text start="27" dur="4">that can actually do this, but I won&amp;#39;t talk about them in any detail here.</text><text start="31" dur="5">They also have problems with degenerate conditions.</text><text start="36" dur="4">For example, they don&amp;#39;t work well if you only have 1 particle or 2 particles.</text><text start="40" dur="5">They tend not to work well if you have no noise in your measurement model</text><text start="45" dur="2">or no noise in your controls.</text><text start="47" dur="3">You need this kind of to remix things a little bit.</text><text start="50" dur="5">If there is very little noise, you have to deviate from the basic paradigm.</text><text start="55" dur="5">But the good news is they work really well in many, many applications.</text><text start="60" dur="5">For example, our self-driving cars use particle filters for localization and for mapping</text><text start="65" dur="2">and for a number of other things.</text><text start="67" dur="4">And the reason why they work so well is they&amp;#39;re really easy to implement,</text><text start="71" dur="5">they&amp;#39;re computationally efficient in the sense that they really put the computational resources</text><text start="76" dur="6">where they are needed the most, and they can deal with highly non-monotonic </text><text start="82" dur="3">and very complex posterior distribution that have many peaks.</text><text start="85" dur="3">And that&amp;#39;s important. Many other filters can&amp;#39;t.</text><text start="88" dur="5">So particle filters are often the method of choice when it comes to building quickly</text><text start="93" dur="6">an estimation method for problems where the posterior is complex.</text></transcript></video><video title="28.mp4 Conclusion.mp4" id="GLk5Ss7PPD0" length="49"><transcript><text start="0" dur="6">[Thrun] Wow! You learned a lot about hidden Markov models and particle filters.</text><text start="6" dur="4">Particle filters is the most used algorithm in robotics today </text><text start="10" dur="2">when it comes to interpreting sensor data,</text><text start="12" dur="3">but these algorithms are applicable in a wide array of applications</text><text start="15" dur="7">such as finance, medicine, behavioral studies, time series analysis, speech,</text><text start="22" dur="6">language technologies, anything involving time and sensors or uncertainty.</text><text start="28" dur="2">And now you know how to use them.</text><text start="30" dur="5">You can apply them to all these problems if you listened carefully.</text><text start="35" dur="2">It&amp;#39;s been a pleasure teaching you with this class.</text><text start="37" dur="4">As I told you in the beginning, this is a topic very close to my heart,</text><text start="41" dur="4">and I hope it&amp;#39;s going to empower you to do better stuff </text><text start="45" dur="4">in any domain involving time series and uncertainty.</text></transcript></video></group><group title="mdpreview 1" count="8"><video title="01 Deterministic Question.mp4" id="1wt-zZsQsZU" length="82"><transcript><text start="0" dur="4">Here is a sequence of MDP questions.</text><text start="4" dur="4">We&amp;#39;re given a maze environment with 8 fields </text><text start="8" dur="8">where we receive +100 over here and -100 in the corner over here.</text><text start="16" dur="7">Our agent can go north, south, west, or east, but actions may fail at random.</text><text start="23" dur="7">With probability &amp;quot;P,&amp;quot; and P is a number between 0 and 1, the action succeeds,</text><text start="30" dur="3">and with 1 - P we go into reverse.</text><text start="33" dur="5">For example, if we take the action go east into this state over here,</text><text start="38" dur="3">with P probability we find ourselves over there.</text><text start="41" dur="5">With 1 - P we find ourselves right over here in the exact opposite direction.</text><text start="46" dur="7">Here is the east action again. With P we go to the right, and with 1 - P we go to the left.</text><text start="53" dur="3">Of course, if you bounce into a wall we stay where we are.</text><text start="56" dur="4">For my first question, I&amp;#39;ll assume P equals 1.</text><text start="60" dur="3">There is no uncertainty in action outcome, and there is no failure.</text><text start="63" dur="3">The state transition function is deterministic.</text><text start="66" dur="6">I want you to fill in for each state the final value after running value iteration to completion,</text><text start="72" dur="7">and please assume the cost is -4 and we use gamma equals 1 as the discount factor.</text><text start="79" dur="3">Please fill in those missing six values over here.</text></transcript></video><video title="02 Deterministic Answer.mp4" id="XD5zwvcSkPE" length="13"><transcript><text start="0" dur="5">And the answer is obtained by looking at the nearest value -4, </text><text start="5" dur="8">which gives us 96 over here, 92, 88, and 84.</text></transcript></video><video title="03 Single Backup Question.mp4" id="zSsvhAxXDn0" length="33"><transcript><text start="0" dur="8">Let us now assume that P = 0.8, which means actions fail with probability 0.2.</text><text start="8" dur="5">Again, the cost is -4, and gamma equals 1.</text><text start="13" dur="6">I want you to run exactly one value calculation for the state in red up here.</text><text start="19" dur="4">Assuming that the value function is initialized with 0 everywhere,</text><text start="23" dur="6">what will be the value after a single value iteration for the state up here.</text><text start="29" dur="4">This is the state a4.</text></transcript></video><video title="04 Single Backup Answer.mp4" id="koFcuhvouLI" length="23"><transcript><text start="0" dur="3">The answer is 76.</text><text start="3" dur="4">The value of over here is maximized for the south action,</text><text start="7" dur="4">which we reach with 0.8 chance with 0.2 chance we&amp;#39;ll stay</text><text start="11" dur="4"> in the same state for which inital value was 0,</text><text start="15" dur="8">and we subtract the action cost of 4, which is 80 + 0 - 4 = 76.</text></transcript></video><video title="05 Convergence Question.mp4" id="v-SQcbXsrNE" length="32"><transcript><text start="0" dur="7">Now using the same premise as before, P equals 0.8, cost equals -4 and gamma equals 1,</text><text start="7" dur="4">I&amp;#39;d like you to run value iteration to completion.</text><text start="11" dur="6">For the one state over here, a4, I&amp;#39;d like to know what is it&amp;#39;s final value.</text><text start="17" dur="3">Now you might be tempted to write a piece of computer software,</text><text start="20" dur="6">but for this specific state, it&amp;#39;s actually possible to do it with a relatively simple peice of math.</text><text start="26" dur="2">It&amp;#39;s not trivial, but give it a try.</text><text start="28" dur="4">What is the value of a4 after convergence?</text></transcript></video><video title="06 Convergence Answer.mp4" id="heruKSF2Qes" length="43"><transcript><text start="0" dur="3">The answer is 95. </text><text start="3" dur="3">To see, we observe the dominant equation suggests that</text><text start="6" dur="3">each new iteration doesn&amp;#39;t change the value.</text><text start="9" dur="4">We kind of know that the optimal policy is to go south over here, </text><text start="13" dur="3">so just really the value iteration for going south.</text><text start="16" dur="2">Let&amp;#39;s call it the value of x.</text><text start="18" dur="8">We know that x is updated by 0.8 times 100 plus 0.2 of saying in the same state,</text><text start="26" dur="3">whose value is x minus the cost.</text><text start="29" dur="5">This invariance must hold true after convergence. We can now resolve it for x.</text><text start="34" dur="9">We get 0.8x equals 76, so 76 divided by 0.8 is 95.</text></transcript></video><video title="07 Optimal Policy Question.mp4" id="N4WjgNuTZfc" length="27"><transcript><text start="0" dur="7">Finally, I&amp;#39;d like to ask you what is the optimal policy for the parameters we just studied.</text><text start="7" dur="7">I&amp;#39;m listing here all states--a1, a2, a3, a4, a5, b2, and b3--</text><text start="14" dur="7">and I&amp;#39;d like you to tell me whether you would like to go north, south, west, or east</text><text start="21" dur="2">in any of those six states over here.</text><text start="23" dur="4">For each of those there is exactly one correct answer.</text></transcript></video><video title="08 Optimal Policy Answer.mp4" id="I5Svdb0fpBs" length="31"><transcript><text start="0" dur="7">It is easy to see that these three states over here, a1 to a3, you want to go east.</text><text start="7" dur="3">In the state a4 up here, you wish to go south.</text><text start="10" dur="8">In b2 you with to go north so as to not risk the 0.2 probability of falling into -100.</text><text start="18" dur="3">In b3 it&amp;#39;s perfectly fine to go east. </text><text start="21" dur="2">In the worst case you find yourself in b2, </text><text start="23" dur="3">in which case you can safely escape the -100 by going north, </text><text start="26" dur="3">turn right again, and go down over here.</text><text start="29" dur="2">This is the correct set of answers over here.</text></transcript></video></group><group title="Midterm 1" count="15"><video title="Question 01.mp4" id="n5L615D9HSw" length="59"><transcript><text start="0" dur="4">In this question I would like you to check all the boxes that are true.</text><text start="4" dur="6">First, there exists at least one environment in which every agent is rational,</text><text start="10" dur="3">by which I mean optimal.</text><text start="13" dur="6">For every agent, there exists (at least) one environment in which the agent is rational.</text><text start="19" dur="7">To solve the sliding-tile 15-puzzle, an optimal agent that searches will usually require</text><text start="26" dur="4">less memory than an optimal table-lookup reflex agent.</text><text start="30" dur="4">By &amp;quot;usually&amp;quot; I mean there are always extreme cases that he can construct</text><text start="34" dur="4">where this isn&amp;#39;t the case. I ask about the common case.</text><text start="38" dur="2">Here is the sliding-tile 15-puzzle, </text><text start="40" dur="3">and you can see this is a puzzle where you move these pieces around</text><text start="43" dur="2">until all these numbers are in order.</text><text start="45" dur="3">It&amp;#39;s a somewhat combinatorial search problem.</text><text start="48" dur="6">Finally, to solve the sliding-tile 15-puzzle, an agent that searches will always do better--</text><text start="54" dur="2">that means it will always find shorter paths--</text><text start="56" dur="3">than a table-lookup reflex agent.</text></transcript></video><video title="Question 02.mp4" id="cGeI1EN9GlE" length="35"><transcript><text start="0" dur="5">This question is about A* search for the heuristic function h,</text><text start="5" dur="2">which is indicated in the graph over here.</text><text start="7" dur="3">An action costs 10 per step.</text><text start="10" dur="6"> Enter into each node of this graph the order when the node is expanded.</text><text start="16" dur="3">That is the same as removed from the queue in A*.</text><text start="19" dur="3">Start with a &amp;quot;1&amp;quot; in the start state over here at the top</text><text start="22" dur="4">and enter &amp;quot;0&amp;quot; if the node will never be expanded. </text><text start="26" dur="3">This is a graph where we have a whole number a nodes,</text><text start="29" dur="2">and the heuristic function is indicated over here.</text><text start="31" dur="4">I&amp;#39;m also asking you is the heuristic h admissible?</text></transcript></video><video title="Question 03.mp4" id="oedKzzOfyuc" length="8"><transcript><text start="0" dur="2">Here&amp;#39;s an easy question.</text><text start="2" dur="3">For coin X, we know that the probability of heads is 0.3.</text><text start="5" dur="3">What is the probability of tails?</text></transcript></video><video title="Question 04.mp4" id="Ktc8cdjc6VI" length="22"><transcript><text start="0" dur="7">In this probability question, we study a potentially loaded or unfair coin, which we flip twice.</text><text start="7" dur="6">Say the probability for it coming up heads both times is 0.04.</text><text start="13" dur="3">These are independent experiments with the same coin.</text><text start="16" dur="6">I wonder what is the probability it comes up tails twice if we flip the same coin twice.</text></transcript></video><video title="Question 05.mp4" id="TU2Oy4wmqXE" length="37"><transcript><text start="0" dur="7">We now have two coins--one fair coin for which the probability of heads is 0.5,</text><text start="7" dur="4">and a loaded coin for which the probability of heads is 1.</text><text start="11" dur="4">This might be a coin where heads is on both sides.</text><text start="15" dur="4">We now pick a coin at random with 0.5 chance,</text><text start="19" dur="3">and we don&amp;#39;t quite know which coin we&amp;#39;ve picked,</text><text start="22" dur="3">but we do flip this coin, and we see &amp;quot;heads.</text><text start="25" dur="3">What is the probability that this is the loaded coin?</text><text start="28" dur="6">We now flip this coin again (the same coin), and we see &amp;quot;heads&amp;quot; again for a second time.</text><text start="34" dur="3">What is now the probability that this is the loaded coin?</text></transcript></video><video title="Question 06.mp4" id="v_qni8a_X7s" length="30"><transcript><text start="0" dur="2">Here&amp;#39;s a Bayes network question. </text><text start="2" dur="3">Consider the following Bayes network with variables A all the way to I.</text><text start="5" dur="3">A and B connects in to E. C and D connects into F.</text><text start="8" dur="6">E connects into G and H, and F also connects into H but also into I.</text><text start="14" dur="4">I&amp;#39;m asking the question whether A is independent of B,</text><text start="18" dur="3">A is conditionally independent of B given E.</text><text start="21" dur="3">A is conditionally independent of B given G.</text><text start="24" dur="3">A is conditionally independent of B given F,</text><text start="27" dur="3">and A is conditionally independent of C given G.</text></transcript></video><video title="Question 07.mp4" id="tN0pUARoKn4" length="29"><transcript><text start="0" dur="3">Check out this Bayes network over here,</text><text start="3" dur="3">which is defined by the following conditional probability table:</text><text start="6" dur="2">P of A equals 0.5. </text><text start="8" dur="2">A connects into B and C.</text><text start="10" dur="2">P of B given A is equal to 0.2. </text><text start="12" dur="2">P of B given not A is also 0.2.</text><text start="14" dur="3">P of C given A is equal to 0.8.</text><text start="17" dur="2">P of C given no A is 0.4.</text><text start="19" dur="4">You&amp;#39;ll find an interesting oddity in this table if you look very carefully.</text><text start="23" dur="3">I&amp;#39;d like to ask you what is the probability of B given C,</text><text start="26" dur="3">and what is the probability of C given B?</text></transcript></video><video title="Question 08.mp4" id="wJ_6Ui5ElB4" length="68"><transcript><text start="0" dur="3">In this question, we apply naive Bayes with Laplacian smoothing--</text><text start="3" dur="2">the same as we have learned in class.</text><text start="5" dur="6">We have now 2 classes of movies. One is called &amp;quot;old&amp;quot; and one is called &amp;quot;new.&amp;quot;</text><text start="11" dur="4">There are titles in here. There&amp;#39;s three old movies--Top Gun, Shy People, Top Hat.</text><text start="15" dur="3">Two new movies--Top Gear, Gun Shy.</text><text start="18" dur="6">Use Laplacian smoothing with k=1 to compute the probability of a movie being old--</text><text start="24" dur="4">this is a prior probability, which is just based on class counts--</text><text start="28" dur="6">the probability of the word &amp;quot;top&amp;quot; as a title word in the class of old movies,</text><text start="34" dur="3">and the probability that a new movie that we look at--</text><text start="37" dur="3">by new I mean a movie we&amp;#39;ve never seen before--</text><text start="40" dur="5">that is called &amp;quot;top,&amp;quot; the probability this movie that corresponds</text><text start="45" dur="3"> to the old movie class with the new movie class.</text><text start="48" dur="3">I recommend you use a single dictionary for smoothing,</text><text start="51" dur="4">so look at all the words and see how large the dictionary is.</text><text start="55" dur="2">Top occurs here in two different ways.</text><text start="57" dur="4">One is a word over here, but one also is a movie title over here.</text><text start="61" dur="4">Don&amp;#39;t pay too much attention to it, just don&amp;#39;t get confused by it.</text><text start="65" dur="3">Again, use Laplacian smoothing with k=1.</text></transcript></video><video title="Question 09.mp4" id="0nB4nvEYcFY" length="25"><transcript><text start="0" dur="6">In this question, I&amp;#39;m giving a set of data points--some positive, some negative--and a query point.</text><text start="6" dur="4">Given the following labeled data set, I&amp;#39;d like to find the minimum of &amp;quot;k&amp;quot;</text><text start="10" dur="4">for which the query point over here becomes negative.</text><text start="14" dur="2">Enter &amp;quot;0&amp;quot; if this is impossible. </text><text start="16" dur="5">Ties are broken at random, and I&amp;#39;d suggest trying to avoid them,</text><text start="21" dur="4">because you might not be able to guarantee that the class is negative.</text></transcript></video><video title="Question 10.mp4" id="KumU1dlAlsU" length="24"><transcript><text start="0" dur="7">In linear regression we are given a data set where the x&amp;#39;s go 1, 3, 4, 5, and 9,</text><text start="7" dur="6">and the y&amp;#39;s 2, 5.2, 6.8, 8.4, and 14.8.</text><text start="13" dur="6">I&amp;#39;d like to find the formula y = w1x = w0 by minimizing the residual quadratic error</text><text start="19" dur="2">as we learned in class in linear regression.</text><text start="21" dur="3">What will be w1, and what will be w0?</text></transcript></video><video title="Question 11.mp4" id="nLFXKsiODAE" length="25"><transcript><text start="0" dur="2">K-Means Clustering.</text><text start="2" dur="4">We&amp;#39;re given a data set indicated by the solid dots over here.</text><text start="6" dur="3">There&amp;#39;s a total of 9 dots if you count carefully.</text><text start="9" dur="5">We have 2 initial cluster centers: C1 and C2 as indicated by those stars.</text><text start="14" dur="2">I&amp;#39;d like to run K-Means to completion </text><text start="16" dur="6">and wonder what the final location of C1 will be after running K-Means.</text><text start="22" dur="3">Ignore the A, B, C, or D over here.</text></transcript></video><video title="Question 12.mp4" id="p5a8HJhVpl0" length="62"><transcript><text start="0" dur="2">Here is our logic question.</text><text start="2" dur="5">I would like you to mark each sentence as &amp;quot;valid,&amp;quot; which means it is always true--</text><text start="7" dur="5">you can&amp;#39;t make it untrue--&amp;quot;satisfiable,&amp;quot; which means it is sometimes true</text><text start="12" dur="5">but could also be false depending on the variable values, or &amp;quot;unsatisfiable,&amp;quot; </text><text start="17" dur="3">which means you cannot possibly make it true.</text><text start="20" dur="2">The first statement is not A.</text><text start="22" dur="3">The second is A or not A.</text><text start="25" dur="8">The third one is (A and not A) implies (B implies C).</text><text start="33" dur="8">The fourth one is (A implies B) and (B implies C) and (C implies A).</text><text start="41" dur="8">The next one is (A implies B) and not (not A or B).</text><text start="49" dur="9">The final one is ((A implies B) and (B implies C)) equivalent to (A implies C).</text><text start="58" dur="4">Remember, you might use truth tables to find out.</text></transcript></video><video title="Question 13.mp4" id="mJOOFG8K0t8" length="106"><transcript><text start="0" dur="6">The planning question might be a bit hard to read, so let me read the text for you.</text><text start="6" dur="4">In the state space below, shown over here, we can travel between locations</text><text start="10" dur="5">S, A, B, and G along the roads as shown.</text><text start="15" dur="4">For example, SA means we go from S to A.</text><text start="19" dur="4">But the world is partially observable and stochastic.</text><text start="23" dur="3">There may be a stoplight somewhere between B and G </text><text start="26" dur="3">that can prevent passing from B to G.</text><text start="29" dur="5">The action might fail, and there might be a flood that sits between A and G,</text><text start="34" dur="4">and the flood also makes the action going from A to G fail.</text><text start="38" dur="4">If the flood occurs, it will always remain flooded.</text><text start="42" dur="7">If the stoplight is red, it will flip green at some point, but we can&amp;#39;t predict when.</text><text start="49" dur="4">The flood is only visible at A and the stoplight only visible at B.</text><text start="53" dur="5">I want you to check all these plans over here and see what the outcome is.</text><text start="58" dur="2">There are 3 potential outcomes.</text><text start="60" dur="6">One is it always reaches the goal state and does so in a bounded number of steps.</text><text start="66" dur="3">By bounded I mean in advance you can tell me a maximum number of steps--</text><text start="69" dur="3">not after the fact.</text><text start="72" dur="4">If you cannot do this after the fact, it&amp;#39;s really not bounded.</text><text start="76" dur="3">The second possibility is always reaches the goal state,</text><text start="79" dur="5">but the number of steps cannot be bounded in advance.</text><text start="84" dur="5">The third one is it might actually fail to reach the goal state.</text><text start="89" dur="9">Look at the following plans: SA followed by AG, SB, step 2 if we can&amp;#39;t move go back to 2,</text><text start="98" dur="4">then finally proceed to BG, and so on and so on.</text><text start="102" dur="4">See which of these plans fall into which category over here.</text></transcript></video><video title="Question 14.mp4" id="Kc88fJ3qrmc" length="26"><transcript><text start="0" dur="2">Here is an MDP question. </text><text start="2" dur="4">We have a deterministic environment, which means the state transitions are deterministic.</text><text start="6" dur="5">There is no probablistic or stochastic outcome of actions.</text><text start="11" dur="4">The cost of motion is -5. The terminal state is worth 100 as indicated.</text><text start="15" dur="4">We have four actions: north, south, west, and east.</text><text start="19" dur="2">The shaded state can&amp;#39;t be entered.</text><text start="21" dur="5">Please fill in the final values after value iteration converges.</text></transcript></video><video title="Question 15.mp4" id="AmfqCR-VCTo" length="73"><transcript><text start="0" dur="7">In this final question, I&amp;#39;d like you to learn the position parameters of Markov chains.</text><text start="7" dur="2">There&amp;#39;s a Markov chain over here with two states.</text><text start="9" dur="4">There is an initial state distribution for the time step 0.</text><text start="13" dur="4">Then there is conditional state distribution from time T to time T + 1.</text><text start="17" dur="2">You might go from A and stay in A.</text><text start="19" dur="3">You might go from A to B.</text><text start="22" dur="3">You might go from B and stay in B and go from B to A.</text><text start="25" dur="5">What we observe is the sequence A, A, A, A, B.</text><text start="30" dur="6">This is our sample for the initial state and all these transitinos over here</text><text start="36" dur="4"> are samples for the state transitions in this Markov chain.</text><text start="40" dur="3">I want you to compute all the parameters, which is the initial distribution,</text><text start="43" dur="5">and the transition distribution out of state A and out of state B.</text><text start="48" dur="5">However I&amp;#39;d like to do this with Laplacian smoothing with k=1.</text><text start="53" dur="2">It is not maximum likelihood.</text><text start="55" dur="3">It is Laplacian smoothing, which can be applied just exactly</text><text start="58" dur="4"> the same way we saw it in class in various contexts.</text><text start="62" dur="3">I&amp;#39;d like you to learn these parameters of the Markov chain from the observed sequence.</text><text start="65" dur="4">Again, the only sample for the initial state is </text><text start="69" dur="4">the very first measurement observation in this sequence over here.</text></transcript></video></group><group title="Unit 13" count="36"><video title="01 Introduction.mp4" id="25opwF9MylQ" length="34"><transcript><text start="0" dur="3">This unit is about games. Why games?</text><text start="3" dur="3">Well, for one, games are fun. </text><text start="6" dur="3">They&amp;#39;ve captured the imagination of people for thousands of years.</text><text start="9" dur="4">They form a well-defined subset of the real world</text><text start="13" dur="3">in that they have rules, which we understand and write down, and they are self-contained.</text><text start="16" dur="7">They&amp;#39;re not as messy as driving a car or flying an autonomous plane</text><text start="23" dur="2">and having to worry about everything in the world.</text><text start="25" dur="5">In that sense, they form a small-scale model of a specific problem.</text><text start="30" dur="4">Namely, the problem of dealing with adversaries.</text></transcript></video><video title="02 Technologies Question.mp4" id="gpkY0em7CR0" length="81"><transcript><text start="0" dur="4">Along the way we&amp;#39;ve seen a lot of different technologies in this class</text><text start="4" dur="4">and a lot of different techniques, that are focused at different parts of the agent</text><text start="8" dur="4">and environment mix and different difficulties there.</text><text start="12" dur="6">Here we have a quiz, and what I want you to tell me is for each of these technologies</text><text start="18" dur="2">what do they most address? </text><text start="20" dur="5">Some of them address more than one, but give the best answer for each line.</text><text start="25" dur="3">Do they address the problem of a stochastic environment--</text><text start="28" dur="5">that is one where the results of actions can vary?</text><text start="33" dur="4">Do they address the problem of a partially observable environment--</text><text start="37" dur="3">one where we can&amp;#39;t see everything?</text><text start="40" dur="2">Do they address the problem of an unknown environment--</text><text start="42" dur="5">one where we don&amp;#39;t even know what the various actions are and what they do?</text><text start="47" dur="2">Do they address computational limitations--</text><text start="49" dur="6">that is problems of dealing with a very large problem rather than a small one</text><text start="55" dur="2">and making approximations to deal with that?</text><text start="57" dur="7">Or do they deal with handling adversaries who are working against our goals?</text><text start="64" dur="4">And I want you to answer that for MDPs, Markov decision processes; </text><text start="68" dur="6">POMDPs, partially observable Markov decision processes and belief space;</text><text start="74" dur="7">for reinforcement learning, and for A* algorithm, heuristic function, and Monte Carlo techniques.</text></transcript></video><video title="03 Technologies Answer.mp4" id="z_NeI-kOETA" length="37"><transcript><text start="0" dur="5">The answer is the MDPs are designed to do stochastic control.</text><text start="5" dur="4">POMDPs are designed to deal with partial observability.</text><text start="9" dur="4">Reinforcement learning deals with an unknown environment,</text><text start="13" dur="6">and the heuristic function and A* search and Monte Carlo techniques</text><text start="19" dur="3">are used to deal with computational limitations.</text><text start="22" dur="3">Monte Carlo techniques gives us an approximation.</text><text start="25" dur="5">The heuristic function, if we use the right one, still gives us the right answer,</text><text start="30" dur="2">but deals with the computational complexity.</text><text start="32" dur="5">We don&amp;#39;t as yet have any technology that&amp;#39;s specifically designed to deal with adversaries.</text></transcript></video><video title="04 Games Question.mp4" id="6VbzmxAfbAk" length="72"><transcript><text start="0" dur="2">What is a game?</text><text start="2" dur="4">The philosopher Wittgenstein said that there is no single set of necessary </text><text start="6" dur="4">and sufficient conditions that define all games.</text><text start="10" dur="4">Rather games have a set of features, and some games share some of them,</text><text start="14" dur="3"> and other games share others of them.</text><text start="17" dur="5">It&amp;#39;s a complex overlapping set rather than a simple criteria.</text><text start="22" dur="5">Here I&amp;#39;ve listed six different games and in some cases sets of games </text><text start="27" dur="4">like Chess and Go are similar, Robotic Soccer, Poker, </text><text start="31" dur="3">hide-and-go-seek played in the real world,</text><text start="34" dur="6">Cards Solitaire, and Minesweeper, the computer solitaire game.</text><text start="40" dur="4">I want to ask you, for each one, which of these properties they exhibit.</text><text start="44" dur="4">Are they stochastic? Are they partially observable? </text><text start="48" dur="5">Do they have an unknown environment? Are they adversarial?</text><text start="53" dur="3">For each game tell me all that apply.</text><text start="56" dur="4">Let me add that your answers may not be the same as mine, </text><text start="60" dur="3">because these very terms are not that precise.</text><text start="63" dur="4">Sometimes you can analyze a problem in two different ways </text><text start="67" dur="5">and flip from one of these attributes to another, depending on how you analyze it.</text></transcript></video><video title="05 Games Answer.mp4" id="eu6-y8RRMLA" length="116"><transcript><text start="0" dur="6">Now, I&amp;#39;ve chosen to say that only robotic soccer and hide-and-go-seek are stochastic.</text><text start="6" dur="4">By that I mean if you have an action like go forward 1 meter,</text><text start="10" dur="5">the result of that action stochastic. You may not go forward exactly 1 meter.</text><text start="15" dur="6">You could also analyze games like poker and cards and say that they&amp;#39;re stochastic</text><text start="21" dur="7">in that the next car is random, and so the action of flipping over the next card is stochastic.</text><text start="28" dur="4">You don&amp;#39;t know how that action is going to result.</text><text start="32" dur="4">I&amp;#39;ve chosen to model that as partial observability.</text><text start="36" dur="5">What I&amp;#39;ve said is it&amp;#39;s not that you pick randomly from the next card,</text><text start="41" dur="4">it&amp;#39;s that the cards are already arranged in some order.</text><text start="45" dur="2">It&amp;#39;s just that you don&amp;#39;t know what that order is.</text><text start="47" dur="3">There&amp;#39;s partial observability that gives you the next card.</text><text start="50" dur="4">Partial observability also shows up in the real world sports</text><text start="54" dur="4">or of robot soccer and hide-and-go-seek.</text><text start="58" dur="5">Obviously, that&amp;#39;s kind of the point of hide-and-go-seek that it&amp;#39;s partially observable.</text><text start="63" dur="4">Now, in terms of unknown, I&amp;#39;ve said that only hide-and-go-seek satisfies that.</text><text start="67" dur="3">In everything else, the world is well-defined.</text><text start="70" dur="4">Even in the real world in an environment like robot soccer,</text><text start="74" dur="3">you only have the known field to deal with.</text><text start="77" dur="3">Whereas in hide-and-go-seek, someone could be hiding anywhere</text><text start="80" dur="5">in a room or location that you don&amp;#39;t know about yet.</text><text start="85" dur="4">Notice that many games are adversarial, but some games are not.</text><text start="89" dur="2">Solitaire games are not adversarial.</text><text start="91" dur="6">You could mark that down as saying, well, I&amp;#39;m playing against the game itself,</text><text start="97" dur="5">but we don&amp;#39;t count that as adversarial, because the games itself is not trying to defeat you.</text><text start="102" dur="2">The game itself is passive.</text><text start="104" dur="5">Whereas in these games and what adversarial has come to mean is that </text><text start="109" dur="3">the opponent is taking into account what you are thinking </text><text start="112" dur="4">when the opponent does their own thinking and tries to defeat you that way.</text></transcript></video><video title="06 Single Player Game.mp4" id="shCDaByrf98" length="125"><transcript><text start="0" dur="3">Here&amp;#39;s a game that we&amp;#39;ve seen before. </text><text start="3" dur="4">We call this a single-player deterministic game.</text><text start="7" dur="2">We know how to solve this.</text><text start="9" dur="4">We use the techniques of search through a state space--the problems solving techniques.</text><text start="13" dur="5">We draw a search tree through the state space,</text><text start="18" dur="6"> and I&amp;#39;m going to draw the nodes like this with triangles rather than with circles.</text><text start="24" dur="4">In any position--in this position here--there are three moves I can make.</text><text start="28" dur="4">I can slide this tile, this tile, or this tile.</text><text start="32" dur="4">So I have 3 moves, and that gives me 3 more states.</text><text start="36" dur="6">I keep on expanding out the states going farther and farther down until I reach one</text><text start="42" dur="7">that&amp;#39;s a goal state, and then I have a path through there that gets me to a solution.</text><text start="49" dur="2">What does it take to describe a game?</text><text start="51" dur="7">Well, we have a set of states S, including a distinguished start state S0.</text><text start="58" dur="5">We have a set of players P that can be our one player, as in this game, or two or more.</text><text start="63" dur="7">We have a function that gives us the allowable actions in a state,</text><text start="70" dur="3">and sometimes we put in a second argument, </text><text start="73" dur="4">which is the player, in that state-making action,</text><text start="77" dur="4">and sometimes it&amp;#39;s explicit in the state itself whose turn it is to move.</text><text start="81" dur="4">We have a transition function that tells us the result of,</text><text start="85" dur="4">in some state, applying an action giving us a new state.</text><text start="89" dur="5">And we have a terminal test to say is it the end of the game.</text><text start="94" dur="2">That&amp;#39;s going to be true or false.</text><text start="96" dur="6">Finally, we have terminal utilities saying that for a given state and a given player</text><text start="102" dur="4"> there is some number which is the value of the game to that player.</text><text start="106" dur="6">In simple games that number is a win or a loss, a one or a zero.</text><text start="112" dur="4">Sometimes it&amp;#39;s denoted as a +1 and a -1.</text><text start="116" dur="4">In other games there can be more complicated utilities </text><text start="120" dur="5">of you win twice as much or four times as much or whatever.</text></transcript></video><video title="07 Two Player Game.mp4" id="o3Z3oAoKhDA" length="238"><transcript><text start="0" dur="4">Now let&amp;#39;s consider games like chess and checkers,</text><text start="4" dur="5">which we define as deterministic, two-player, zero-sum games.</text><text start="9" dur="2">The deterministic part is clear. </text><text start="11" dur="7">The rules of chess say you make a move, take a piece, and that&amp;#39;s it. There&amp;#39;s no stochasticity.</text><text start="18" dur="2">It&amp;#39;s two players, one against another, </text><text start="20" dur="5">and zero sum means that the sum of the utilities to the two players is zero.</text><text start="25" dur="5">If one player gets a +1 for winning the game, the other player gets a -1 for losing.</text><text start="30" dur="3">How do we deal with these types of games?</text><text start="33" dur="3">Well, we use a similar type of approach.</text><text start="36" dur="3">We have a state-space search. We have a starting state.</text><text start="39" dur="4">There are some moves available to player one.</text><text start="43" dur="4">Then in the next state there are moves available to player two.</text><text start="47" dur="4">We&amp;#39;re going to draw them like this, and we&amp;#39;re going to give names to our players.</text><text start="51" dur="4">The first player we&amp;#39;re going to call Max, because it&amp;#39;s a nice name,</text><text start="55" dur="7">and because player one is trying to maximize the utility to player one.</text><text start="62" dur="6">The next player, who operates at this level, we draw with a downward-pointing triangle.</text><text start="68" dur="6">We call that player Min, because Min is trying to minimize the utility to Max,</text><text start="74" dur="5">which is the same thing as trying to maximize the utility to himself or herself.</text><text start="79" dur="7">Then we have a game tree that continues like that, alternating between Max and Min moves.</text><text start="86" dur="5">Now, the search tree keeps going and let&amp;#39;s say we get to a point where one player,</text><text start="91" dur="5">and let&amp;#39;s say it&amp;#39;s Max, has a choice, and there are two states,</text><text start="96" dur="6">and these, rather than being states where it&amp;#39;s Min&amp;#39;s turn, are states that are terminal.</text><text start="102" dur="3">We&amp;#39;ll draw them with a square box.</text><text start="105" dur="5">Let&amp;#39;s say one of them results in +1, a win for Max,</text><text start="110" dur="4">and one of them results in -1, a loss for Max.</text><text start="114" dur="6">Now if Max is rational, of course, Max is going to make this choice to the +1.</text><text start="120" dur="7">What we&amp;#39;re going to do now is show we can determine the value of any state in the tree,</text><text start="127" dur="5"> including the start state up here in terms of the values of the terminal nodes.</text><text start="132" dur="3">The tree keeps on going. We assume it&amp;#39;s a finite game.</text><text start="135" dur="6">After a finite number of moves, every path leads to a terminal state.</text><text start="141" dur="7">Then we look at each state and say whose turn is it to make the decision.</text><text start="148" dur="4">In this state Max is making the decision, and Max, being rational, </text><text start="152" dur="5">will choose the maximum value, saying, &amp;quot;I&amp;#39;d rather have a +1 than a -1,</text><text start="157" dur="2">so I&amp;#39;ll get a +1 here.&amp;quot;</text><text start="159" dur="7">We start going back up the tree, and maybe we get up to a point here </text><text start="166" dur="4">where Min has a choice, and we&amp;#39;ve used this type of process to go up the tree,</text><text start="170" dur="5">and Min has a choice between a +1 and a -1.</text><text start="175" dur="4">Min is going to choose the minimum and will have a -1 here.</text><text start="179" dur="6">If we go through all the possibilities, let&amp;#39;s say these all result in -1, </text><text start="185" dur="5">but this move results in a +1. Then Max will take that move.</text><text start="190" dur="6">He&amp;#39;ll say, &amp;quot;Out of my four possibilities, I know this is the best one. I&amp;#39;ll take that move.&amp;quot;</text><text start="196" dur="2">Now we&amp;#39;ve done two things.</text><text start="198" dur="4">One, we&amp;#39;ve assigned a value to every state in the search tree,</text><text start="202" dur="3">and secondly, we backed that all the way up the top.</text><text start="205" dur="4">Now we&amp;#39;ve worked out a path through that state to say,</text><text start="209" dur="3">if all players are rational, here&amp;#39;s the choices they would make.</text><text start="212" dur="5">The important point here is that we&amp;#39;ve taken the utility function, </text><text start="217" dur="3">which is defined only on terminal states.</text><text start="220" dur="3">Here&amp;#39;s a state here. The utility of that state was +1.</text><text start="223" dur="5">Here&amp;#39;s a state here. The utility of that state was -1.</text><text start="228" dur="4">We&amp;#39;ve used those utility values in the definition of available actions </text><text start="232" dur="6">to back those utilities up and tell us the utility of every state, including the start state.</text></transcript></video><video title="08 Two Player Function.mp4" id="sMT8sMOBA2Y" length="106"><transcript><text start="0" dur="3">Now let&amp;#39;s define a function value of S, </text><text start="3" dur="4">which tells us how to compute the value for a given state,</text><text start="7" dur="3">and therefore will allow us to make the best possible move.</text><text start="10" dur="5">If S is a terminal state, then the value is just the utility of the state </text><text start="15" dur="3">given by the definition of the game.</text><text start="18" dur="7">If S is a maximizing state, then we&amp;#39;ll return something called max value of S,</text><text start="25" dur="7">and if S is a minimizing state, then we&amp;#39;ll return min value of S.</text><text start="32" dur="5">Now we can define max value to just iterate over all the successors</text><text start="37" dur="3">and figure out the values of each of those.</text><text start="40" dur="6">We&amp;#39;ll initialize a value m equals minus infinity,</text><text start="46" dur="8">and then we&amp;#39;ll say for all pairs of actions and successors states in successors of S,</text><text start="54" dur="5">we&amp;#39;ll say the value is--and let&amp;#39;s call this S-prime so we don&amp;#39;t get confused--</text><text start="59" dur="8">the value of S-prime and M for keeping track of the maximum so far and the new value.</text><text start="67" dur="5">Then when we&amp;#39;re all done we return the M with the maximum value.</text><text start="72" dur="4">This will compute the maximum at a maximum node over all the states </text><text start="76" dur="3">that we have from all the possible moves.</text><text start="79" dur="5">The definition for min value is roughly equivalent but just reversed, </text><text start="84" dur="2">taking the minimum instead.</text><text start="86" dur="4">With these three recursive routines--value, max value, and min value--</text><text start="90" dur="3">we can determine the value of any node in the tree.</text><text start="93" dur="4">Now to do that efficiently, you&amp;#39;d want a little bit of bookkeeping</text><text start="97" dur="3">so you aren&amp;#39;t recomputing the same thing over and over again,</text><text start="100" dur="6">but conceptually, this will answer any two-player, deterministic, finite game.</text></transcript></video><video title="09 Time Complexity Question.mp4" id="8k25hH7DifE" length="57"><transcript><text start="0" dur="3">Now we know we have an algorithm that can solve any game tree, </text><text start="3" dur="4">that can propagate the terminal values back up to the top</text><text start="7" dur="3">and tell us the value for any position.</text><text start="10" dur="3">It&amp;#39;s theoretically complete, but now we need to know</text><text start="13" dur="3"> the complexity of the algorithm to figure out if it&amp;#39;s practical.</text><text start="16" dur="4">Let&amp;#39;s look at an analysis of how long it&amp;#39;s going to take.</text><text start="20" dur="2">Let&amp;#39;s say that the average branching factor--</text><text start="22" dur="6">the number of possible moves or actions coming out of a position--is b.</text><text start="28" dur="2">Here b would be 4.</text><text start="30" dur="7">And let&amp;#39;s say that the depth of the tree is m, so b wide and m deep.</text><text start="37" dur="4">Now what I want you to tell me is what would be the computational complexity</text><text start="41" dur="5">of searching through all the paths and backing the values up to the top.</text><text start="46" dur="6">Would it be of the order of b times m or the order of be to the mth power</text><text start="52" dur="5">or the order of m to the b power? Chose one of these.</text></transcript></video><video title="10 Time Complexity Answer.mp4" id="ZI-kPRRdGLE" length="13"><transcript><text start="0" dur="3">The answer is we have b choice at the top level,</text><text start="3" dur="4">and for each of those b we have another b at the next level.</text><text start="7" dur="3">That would be b squared, b cubed and so on, </text><text start="10" dur="3">and all the way to b to the mth power.</text></transcript></video><video title="11 Space Complexity Question.mp4" id="fx2rtFveuaM" length="26"><transcript><text start="0" dur="4">Now the next thing I want you to tell me is the space complexity.</text><text start="4" dur="2">That was the time complexity.</text><text start="6" dur="6">The space complexity is how much storage do we need to be able to search this tree.</text><text start="12" dur="6">Remember that the value and max value and min value routines that we have defined</text><text start="18" dur="2">are doing a depth-first search.</text><text start="20" dur="4">Which of these would correctly represent the amount of storage that we would need--</text><text start="24" dur="2">the space complexity?</text></transcript></video><video title="12 Space Complexity Answer.mp4" id="lIC4pifWgJ0" length="31"><transcript><text start="0" dur="5">The answer is that we only need b times m space in order to do the search.</text><text start="5" dur="5">Even though the entire tree is order b to the mth power of nodes,</text><text start="10" dur="4">on any individual path through the tree we only need to look one path at a time</text><text start="14" dur="2">in order to do the depth-first search.</text><text start="16" dur="6">We generate these b nodes, store them away, look at the first one, generate b more,</text><text start="22" dur="5">store those away, and so we&amp;#39;re saving only b nodes at each level for m times level</text><text start="27" dur="4">for total of b times m total storage space required.</text></transcript></video><video title="13 Chess Question.mp4" id="th2Ua6Cvw3c" length="47"><transcript><text start="0" dur="3">The next question is let&amp;#39;s look at the game of chess</text><text start="3" dur="5"> for which the branching factor is somewhere around 30. It varies from move to move.</text><text start="8" dur="3">The length of a game is somewhere around 40.</text><text start="11" dur="4">Certainly some games are much longer, but that&amp;#39;s an average length of a game.</text><text start="15" dur="4">Now let&amp;#39;s imagine that you have a computer system, </text><text start="19" dur="3">and you want to search through this whole tree for chess,</text><text start="22" dur="5">and let&amp;#39;s assume that you can evaluate a billion nodes a second on one computer.</text><text start="27" dur="4">Let&amp;#39;s also say that for the moment somebody lent you every computer in the world.</text><text start="31" dur="5">If you have all the computers and they can each do a billion evaluations a second,</text><text start="36" dur="3">how long would it take you to search through this whole tree?</text><text start="39" dur="5">Would it be on the order of seconds, minutes, days, years,</text><text start="44" dur="3"> or lifetimes of the universe.                             Tell me which of these.</text></transcript></video><video title="14 Chess Answer.mp4" id="2Gak2N_0cjI" length="13"><transcript><text start="0" dur="4">The answer is that it would take many lifetimes of the universe. </text><text start="4" dur="3">Even though you have a lot of computing power at your disposal,</text><text start="7" dur="3">30 to the 40th power is just such a huge number</text><text start="10" dur="3">that there is no chance of searching through the entire tree for chess.</text></transcript></video><video title="15 Complexity Reduction Question.mp4" id="r6h2wOXNRbc" length="33"><transcript><text start="0" dur="5">Now our question is how do we deal with the complexity of having a tree</text><text start="5" dur="3"> with branching factor b and depth m.</text><text start="8" dur="5">Here are some possibilities, and I want you to tell me which of these are good approaches.</text><text start="13" dur="4">We have the problem of dealing with b to the m.</text><text start="17" dur="4">Could we reduce b somehow, that is, reduce the branching factor,</text><text start="21" dur="6">reduce m, the depth of the tree, or convert the tree into a graph in some way?</text><text start="27" dur="3">Tell me which, if any or all, of these would be good approaches </text><text start="30" dur="3">to dealing with the complexity.</text></transcript></video><video title="16 Complexity Reduction Answer.mp4" id="k3rYDYBDl_U" length="4"><transcript><text start="0" dur="4">The answer is that all three are useful approaches, and we&amp;#39;ll look at each of them.</text></transcript></video><video title="17 Review Question.mp4" id="mHpKLffIOvM" length="63"><transcript><text start="0" dur="2">Let&amp;#39;s review just for a second.</text><text start="2" dur="4">This is called the minimax routine for evaluating a game tree.</text><text start="6" dur="4">Given a particular state we look and see is it a terminal state?</text><text start="10" dur="3"> Is it a maximizing state? It is a minimum state?</text><text start="13" dur="3">In each case we look up the utility from the game.</text><text start="16" dur="5">We do the max value routine, or we do the min value routine, which is similar.</text><text start="21" dur="3">That gives us the value of each state.</text><text start="24" dur="5">Then the action that the agent would take would be just to take the action</text><text start="29" dur="3"> that results in the maximum state--the state with the best value.</text><text start="32" dur="4">Now let&amp;#39;s try to apply the minimax routine to this game tree.</text><text start="36" dur="5">This is a small game in which Max has three options for his moves,</text><text start="41" dur="4">and then Min has three options for its moves, and then the game is over.</text><text start="45" dur="6">Here are the terminal values for these states in terms of Max&amp;#39;s score.</text><text start="51" dur="5">What I want you to do is use minimax to fill in the values of these intermediate states.</text><text start="56" dur="4">What are the values of these three states for Min to move,</text><text start="60" dur="3">and what is the value of this state for Max to move?</text></transcript></video><video title="18 Review Answer.mp4" id="lS7IN-NfwbA" length="26"><transcript><text start="0" dur="4">The answer is that these are minimizing nodes.</text><text start="4" dur="3">The minimum of 3, 12, and 8, is 3.</text><text start="7" dur="3">Here the minimum is 2.</text><text start="10" dur="2">Here it&amp;#39;s 1.</text><text start="12" dur="2">Then this is a maximizing move.</text><text start="14" dur="2">The max is 3.</text><text start="16" dur="6">That means that if both players played rationally, then Max would take this move.</text><text start="22" dur="4">Then Min would take this move, and the value of the game would be 3.</text></transcript></video><video title="19 Reduce B.mp4" id="dNzU_k5b5CY" length="104"><transcript><text start="0" dur="6">Now I want to get at the idea of reducing b, the branching factor.</text><text start="6" dur="4">How is it that we can cut down on the number of nodes that we expand</text><text start="10" dur="6">in the horizontal direction while still getting the right answer for the evaluation of the tree?</text><text start="16" dur="5">Let&amp;#39;s go back and consider that during our evaluation, if we get to this point,</text><text start="21" dur="4">we&amp;#39;ve expanded these three nodes, we figured out that the value of this one is 3,</text><text start="25" dur="4">we looked at this one so far and found its value was 2,</text><text start="29" dur="6">and now, without looking at these, what can we say about the value of this node?</text><text start="35" dur="4">Well, it&amp;#39;s a minimizing node, so the least it could be is 2.</text><text start="39" dur="5">If these are less than 2, it&amp;#39;ll be less than that, and if these are more, it&amp;#39;ll end up being 2.</text><text start="44" dur="6">We can say that the value of this node is less than or equal to 2.</text><text start="50" dur="3">Now if we look at it from Max&amp;#39;s point of view, </text><text start="53" dur="5">Max will have this choice here of choosing either this, this, or this,</text><text start="58" dur="4">and if this one is 3 and this one is less than or equal to 2,</text><text start="62" dur="3">then we know Max will always choose this one.</text><text start="65" dur="6">What that tells us is that it doesn&amp;#39;t matter what the value is of this node and this node.</text><text start="71" dur="5">No matter what those values are this is still going to be less than or equal to 2,</text><text start="76" dur="2">and is not going to matter to the total evaluation, </text><text start="78" dur="2">because we&amp;#39;re going to go this way anyway.</text><text start="80" dur="6">We can prune the tree, chop off these nodes here, and never have to evaluate.</text><text start="86" dur="3">Now, with this particular case, that doesn&amp;#39;t save us very much,</text><text start="89" dur="4">because these are terminal nodes, but these could have been large branches--</text><text start="93" dur="3">big parts of the tree, and we still wouldn&amp;#39;t have to look at them.</text><text start="96" dur="5">We&amp;#39;ve made a potentially large pruning without effecting the value.</text><text start="101" dur="3">We still get the exact correct value for the value of the tree.</text></transcript></video><video title="20 Reduce B Question.mp4" id="pjNKJFxKnz0" length="8"><transcript><text start="0" dur="4">Now I want you to tell me over here which, if any or all, </text><text start="4" dur="4">of the three nodes can be pruned away by this procedure.</text></transcript></video><video title="21 Reduce B Answer.mp4" id="CIfugiw9_6M" length="27"><transcript><text start="0" dur="4">The answer is when we see the 14 we&amp;#39;re not sure what this value is.</text><text start="4" dur="4">It has to be less than or equal to 14,</text><text start="8" dur="3">which means it might be the right path or it might not.</text><text start="11" dur="6">Once we see the one then we know that the value is less than or equal to one,</text><text start="17" dur="4">and we know that we have a better alternative here, so we can stop at that point.</text><text start="21" dur="2">Then we can prune off the 8.</text><text start="23" dur="4">Out of the three, only this node, the right-most, would be the one pruned away.</text></transcript></video><video title="22 Reduce M.mp4" id="hRIwdYyvq2E" length="153"><transcript><text start="0" dur="5">Now I&amp;#39;m going to look at the issue of reducing m, the depth of the tree.</text><text start="5" dur="3">Here, I&amp;#39;ve drawn a game tree and left out some bits, </text><text start="8" dur="3">but the idea is that is that it keeps on going and going.</text><text start="11" dur="4">There&amp;#39;ll be too many nodes for us to evaluate at all. What can we do?</text><text start="15" dur="5">The simplest approach is to just by fiat cut off the search at a certain depth.</text><text start="20" dur="3">We&amp;#39;ll say we&amp;#39;re only going to search to level three,</text><text start="23" dur="2">and when we get down to level three, </text><text start="25" dur="3">we&amp;#39;re going to pretend that these are all terminal nodes.</text><text start="28" dur="7">We&amp;#39;ll draw them as the square boxes for terminals rather than the max nodes</text><text start="35" dur="3">and cut off the search at that point.</text><text start="38" dur="3">Now, of course, they aren&amp;#39;t terminal, so according to the rules of the game,</text><text start="41" dur="4">we haven&amp;#39;t either won or lost at this particular point.</text><text start="45" dur="4">We can&amp;#39;t say for sure what the value is for each of these nodes, </text><text start="49" dur="5">but we can estimate it using something called an evaluation function,</text><text start="54" dur="6">which is given a state S and returns an estimate of the final value for that state.</text><text start="60" dur="3">What do we want out of our evaluation function and how do we get it?</text><text start="63" dur="4">We want the evaluation function to be stronger for positions that are stronger </text><text start="67" dur="3">and weaker for positions that are weaker.</text><text start="70" dur="3">We can get it one way from experience--</text><text start="73" dur="3">from playing the games before and seeing similar situations</text><text start="76" dur="3"> and figuring out what their values are.</text><text start="79" dur="5">We can try to break that down into components by using experience with the game.</text><text start="84" dur="6">For example, in the game of chess it is traditional to say that a pawn is worth 1 point,</text><text start="90" dur="4">a knight 3 points, a bishop 3 points, a rook 5, and a queen 9.</text><text start="94" dur="2">You could add up all those points.</text><text start="96" dur="4">So we could have an evaluation function of S</text><text start="100" dur="8"> which is equal to this weighted sum of the various weights times the various pieces--</text><text start="108" dur="4">positive weights for your pieces and negative weights for the opponent&amp;#39;s pieces.</text><text start="112" dur="3">We&amp;#39;ve seen this idea before when we did machine learning</text><text start="115" dur="4">where we have a set of features, which could be the pieces, </text><text start="119" dur="3">and they could be other features of the game as well.</text><text start="122" dur="3">For example, in chess it&amp;#39;s good to control the center,</text><text start="125" dur="3">it&amp;#39;s good not to have a double pawn, and so on.</text><text start="128" dur="5">We could make up as many features as we can think of to represent each individual state</text><text start="133" dur="5">and then use machine learning from examples to figure out what the weight should be.</text><text start="138" dur="3">Then we have an evaluation function.</text><text start="141" dur="4">We apply the evaluation function to each state at the cutoff point</text><text start="145" dur="3">rather than doing a long search.</text><text start="148" dur="5">Then we have an estimate, and we back those values up just as if they were terminal values.</text></transcript></video><video title="23 Computing State Values.mp4" id="2xsXEpdyDUg" length="106"><transcript><text start="0" dur="3">Now let&amp;#39;s see how we can compute the value of a state using these</text><text start="3" dur="4">two innovations to work on b and m.</text><text start="7" dur="3">I&amp;#39;ve modified our routine for value in two ways--</text><text start="10" dur="6">one, I&amp;#39;ve introduced a new line that says if we decide to cut off the search</text><text start="16" dur="6"> at a particular depth then apply the evaluation function to the state and return that.</text><text start="22" dur="3">Then I&amp;#39;ve also added some bookkeeping variables.</text><text start="25" dur="4">One for the current depth, which will get increased as we go along,</text><text start="29" dur="5">and then two values called alpha and beta, which are the traditional names,</text><text start="34" dur="6">where alpha is the best value found so far for Max along the path </text><text start="40" dur="6">that we are currently exploring, and beta is the best value found so far for Min.</text><text start="46" dur="3">Then since we have these extra parameters when we start out,</text><text start="49" dur="8">we would make the call value of our initial state S0 and we&amp;#39;re currently at depth zero </text><text start="57" dur="6">in the search tree, and we haven&amp;#39;t found the best for Max yet so that would be minus infinity,</text><text start="63" dur="7">and the best for Min similarly we haven&amp;#39;t found anything there so that would be plus infinity.</text><text start="70" dur="6">We call that and then each node we would chose one of these four cases.</text><text start="76" dur="3">Here&amp;#39;s the new definition of maxValue taking the depth </text><text start="79" dur="3">and the alpha and beta parameters into account.</text><text start="82" dur="2">It&amp;#39;s similar to what we had before.</text><text start="84" dur="2">We go through all the successors.</text><text start="86" dur="5">We take the maximum, and in this case we&amp;#39;re incrementing the depth </text><text start="91" dur="4">as we call recursively for the value of each node.</text><text start="95" dur="3">We get the cutoff here if we exceed beta,</text><text start="98" dur="6">and otherwise we retain alpha as the maximum value to Max so far.</text><text start="104" dur="2">Then we return the final value.</text></transcript></video><video title="24 Complexity Reduction Benefits.mp4" id="Fs1gAjUQGmI" length="176"><transcript><text start="0" dur="4">Now we said we have three ways to reduce this exponential b to the m--</text><text start="4" dur="4">reducing the branching factor b, reducing the depth of the tree m,</text><text start="8" dur="3"> and converting the tree to a graph</text><text start="11" dur="2">Let&amp;#39;s see how each of those fare.</text><text start="13" dur="6">First, for reducing b we came up with this alpha-beta pruning technique.</text><text start="19" dur="2">In fact, that is very effective.</text><text start="21" dur="8">That takes us from a regime where we&amp;#39;re in order b to the m to one where,</text><text start="29" dur="5">if we do a good jog, we can get to order b to the m/2.</text><text start="34" dur="2">Now what do I mean by doing a good job?</text><text start="36" dur="3">Well, we get different amounts of pruning depending on the order</text><text start="39" dur="3"> in which we expand each branch from a node.</text><text start="42" dur="3">If we expand the good nodes first, then we get a lot of pruning,</text><text start="45" dur="4">because we do a good job of getting to the cutoff points.</text><text start="49" dur="4">If we expand the poor nodes first, then we don&amp;#39;t do any pruning,</text><text start="53" dur="3"> because we don&amp;#39;t get to that cutoff point until later.</text><text start="56" dur="5">But if we can do well, then we get to the square root of the number of nodes.</text><text start="61" dur="4">In other words, we get to search twice as deep into the search tree.</text><text start="65" dur="7">That&amp;#39;s all 100% perfect in terms of not changing the result.</text><text start="72" dur="3">We&amp;#39;d still get the exact evaluation.</text><text start="75" dur="3">We just stop doing work that we didn&amp;#39;t have to do.</text><text start="78" dur="3">Now for the tree to the graph, we haven&amp;#39;t talked that yet.</text><text start="81" dur="4">In fact, it depends on the particular game, but in many games it can be very useful.</text><text start="85" dur="4">In games like chess, we have opening books.</text><text start="89" dur="3">That is, we look at the past openings</text><text start="92" dur="4"> and we just memorize those positions and what are the good moves.</text><text start="96" dur="2">It doesn&amp;#39;t matter how we get to those positions.</text><text start="98" dur="3">We can get to them in multiple paths through a tree,</text><text start="101" dur="2">and we can just consider it a single graph.</text><text start="103" dur="5">We also have closing books, where we can memorize all the positions</text><text start="108" dur="4"> with five or fewer pieces and know exactly what to do.</text><text start="112" dur="5">In the midgame when there are too many positions to memorize all of them,</text><text start="117" dur="7">we can still search through a graph if we want to or if we want we can just do part of that.</text><text start="124" dur="5">One thing that has proven effective in games like chess is called the killer-move heuristic.</text><text start="129" dur="5">What that says is if there&amp;#39;s one really good move in part of a search tree,</text><text start="134" dur="5">then try the other move in the sister branches for that tree.</text><text start="139" dur="6">In other words, if I try making one move and I find that the opponent takes my queen,</text><text start="145" dur="3">then when I try making another move from that same position,</text><text start="148" dur="4">I should also check if the opponent has that response of taking my queen.</text><text start="152" dur="3">Converting from a tree to graph, also doesn&amp;#39;t lose information.</text><text start="155" dur="3">It can just help us make the search go faster.</text><text start="158" dur="4">The third possibility was reducing m, the depth of the tree,</text><text start="162" dur="4">by just cutting off search and going to an evaluation function.</text><text start="166" dur="5">That is imperfect in that it is an estimate of the true value of the tree</text><text start="171" dur="2">but won&amp;#39;t give you the exact value. </text><text start="173" dur="3">We can get into trouble. Let me show you an example of that.</text></transcript></video><video title="25 Pacman Question.mp4" id="ESSdQ4K-a_Q" length="62"><transcript><text start="0" dur="5">Here&amp;#39;s a search tree for a version of Pacman in which there&amp;#39;s only four squares.</text><text start="5" dur="3">There&amp;#39;s a little Pacman guy who can move around,</text><text start="8" dur="6">and there are food dots that the Pacman can eat.</text><text start="14" dur="3">Maybe someplace else in the maze there are opponents,</text><text start="17" dur="3">but we&amp;#39;re not going to worry about them right here.</text><text start="20" dur="3">We&amp;#39;re just going to consider the Pacman&amp;#39;s actions.</text><text start="23" dur="2">He has two actions--to go left or right</text><text start="25" dur="3">If he goes left, he goes over here and eats that food particle</text><text start="28" dur="4">and then moves back right--that&amp;#39;s his only move from that position.</text><text start="32" dur="4">Or if he moves right then he has two other moves.</text><text start="36" dur="4">Now let&amp;#39;s assume that we cut off the search at this depth,</text><text start="40" dur="3">and we want to have an evaluation function,</text><text start="43" dur="4">and the goal is for Pacman to eat all the food.</text><text start="47" dur="5">The evaluation function will be the number of food particles that he&amp;#39;s eaten so far.</text><text start="52" dur="5">What I want you to do is tell me in these boxes </text><text start="57" dur="5">what the evaluation should be for each of these three states.</text></transcript></video><video title="26 Pacman Answer.mp4" id="qdl4cWMPfE4" length="117"><transcript><text start="0" dur="6">The answer is here he&amp;#39;s eaten 1, here he&amp;#39;s eaten 0, and here he&amp;#39;s eaten 1.</text><text start="6" dur="5">That&amp;#39;s fine. The problem arises when we start backing up these numbers.</text><text start="11" dur="6">If these are max nodes, we&amp;#39;ve skipped the opponent&amp;#39;s moves, which are the min nodes.</text><text start="17" dur="3">We&amp;#39;re only looking at the maxes.</text><text start="20" dur="5">The max of 1 is 1, so this would also get an evaluation of 1.</text><text start="25" dur="6">The max of 0 and 1 is 1, so this would also get an evaluation of 1.</text><text start="31" dur="4">This final node would be the max of 1 and 1, so that&amp;#39;s also 1.</text><text start="35" dur="4">But now when we go to apply the policy, if we&amp;#39;re in this position, </text><text start="39" dur="5">using these evaluation functions, both of these moves are equally good.</text><text start="44" dur="4">The Pacman might choose this one, </text><text start="48" dur="4">choosing at random or choosing by some predefined ordering.</text><text start="52" dur="3">Then he&amp;#39;d end up in this state. So far he hasn&amp;#39;t eaten anything.</text><text start="55" dur="4">But this state is just as good because he knows in two moves he can eat one particle</text><text start="59" dur="5">going this way just as well as in two moves he can eat one particle going this way.</text><text start="64" dur="4">Now he&amp;#39;s in this state, but notice that this state is symmetric to this one.</text><text start="68" dur="4">On his next turn, if we did another depth-two search, </text><text start="72" dur="2">he might just as well go back one position.</text><text start="74" dur="5">He would be stuck going back and forth between these two states,</text><text start="79" dur="5">because either one of those, if you look ahead only two, is equally good.</text><text start="84" dur="4">You have to look ahead one, two, three, four moves to know</text><text start="88" dur="3"> that one of them is better than the other.</text><text start="91" dur="3">This is known as the horizon effect.</text><text start="94" dur="5">The idea is that when we cut off search we&amp;#39;re specifying a horizon</text><text start="99" dur="2"> beyond which the agent can&amp;#39;t see.</text><text start="101" dur="4">If a good thing or a bad thing happens beyond the horizon, we don&amp;#39;t see that.</text><text start="105" dur="5">All we see is whatever is reflected in the evaluation function.</text><text start="110" dur="4">If the evaluation function is imperfect, we don&amp;#39;t see beyond the horizon, </text><text start="114" dur="3">and we can make mistakes.</text></transcript></video><video title="27 Chance.mp4" id="dZm3MSrYno4" length="97"><transcript><text start="0" dur="6">There is one more thing to deal with when we have to talk about games--and that&amp;#39;s chance.</text><text start="6" dur="4">We want to move from purely deterministic games to stochastic games </text><text start="10" dur="6">like backgammon or other games that introduce dice or other parts of random action.</text><text start="16" dur="3">That means that the actions that an agent takes </text><text start="19" dur="4">are not specifically specified to have a single result.</text><text start="23" dur="3">Let&amp;#39;s see how we can deal with stochastic games</text><text start="26" dur="4"> by looking at our value function and modifying it to allow for this.</text><text start="30" dur="4">Here we have our valuation function, and we&amp;#39;re dealing with four types of nodes--</text><text start="34" dur="4">one, nodes that we decide to cut off on our own, because we reached a certain depth;</text><text start="38" dur="4">second, nodes that are terminal according to the rules of the game;</text><text start="42" dur="3">and third, max to move and min to move.</text><text start="45" dur="3">Now I&amp;#39;m going to add one more type, which is a chance node.</text><text start="48" dur="9">We say if the state is a chance node, then we want to return the expected value of S </text><text start="57" dur="3">and carry along these bookkeeping variables.</text><text start="60" dur="4">What we&amp;#39;re saying here is if it&amp;#39;s at the point of the game where it&amp;#39;s time to roll the dice,</text><text start="64" dur="4">then we&amp;#39;re going to role the dice, and we&amp;#39;re going to take the expected value </text><text start="68" dur="3">of all the possible results rather than the max or the min.</text><text start="71" dur="4">Here we have a schematic diagram for a stochastic game--a game with dice.</text><text start="75" dur="4">We start out. The chance node or the dice-rolling node is first. </text><text start="79" dur="3">The dice is rolled--one of six possibilities.</text><text start="82" dur="2">Then the next player gets his move.</text><text start="84" dur="6">In this case, we&amp;#39;ve let Min move first, and Min has various moves possible to make.</text><text start="90" dur="7">For each one, there is then another role of the dice, and then Max gets to make his move.</text></transcript></video><video title="28 Chance Question.mp4" id="_1k1oNbNEAs" length="50"><transcript><text start="0" dur="3">Here&amp;#39;s the game tree for another stochastic game.</text><text start="3" dur="2">This game involves flipping a coin.</text><text start="5" dur="4">The chance nodes have two results: heads or tails.</text><text start="9" dur="5">Then the player Max has two possible moves, A and B,</text><text start="14" dur="4">and the player Min has two possible moves, C and D.</text><text start="18" dur="4">This game is too small to have any alpha-beta pruning involved,</text><text start="22" dur="5">but I&amp;#39;ve listed all the terminal values for the terminal states of the game.</text><text start="27" dur="4">What I want you to do is fill in the non-terminal values </text><text start="31" dur="4">for the chances nodes, the max nodes, and the min nodes</text><text start="35" dur="6">according to the rules of minimum value and maximum value and expected value.</text><text start="41" dur="9">I should say that the probability of the coin flip is 50% heads and 50% tails.</text></transcript></video><video title="29 Chance Answer.mp4" id="GRiuI3LHaAQ" length="70"><transcript><text start="0" dur="3">To evaluate the game tree, we work from the bottom up.</text><text start="3" dur="3">Let&amp;#39;s start over here. This is a min node.</text><text start="6" dur="3">Min chooses the minimum, which will be 1.</text><text start="9" dur="5">In this position, Min would choose 2, the minimum of 2 and 4.</text><text start="14" dur="5">Over here Min would choose 0, the minimum of 0 and 10.</text><text start="19" dur="4">Now we have some chance nodes, so we have to choose the expected value.</text><text start="23" dur="4">Chance, the flip of the coin, doesn&amp;#39;t get the choice of one direction or the other.</text><text start="27" dur="3">Rather both of them are possibilities.</text><text start="30" dur="5">So we just average the results, since the probability of heads and tails are equal.</text><text start="35" dur="9">So 7 and 1 is 8, divided by 2 is 4, and 8 and 2 is 10 over 2 is 5 is the expected value there.</text><text start="44" dur="8">The expected value of 0 and 6 is 3, and the expected value of 0 and 4 is 2.</text><text start="52" dur="5">Now we have a maximizing node. The max of 5 and 4 would be 5.</text><text start="57" dur="7">The max of 3 and 2 would be 3, and finally, we have another chance node.</text><text start="64" dur="6">The average of 5 and 3 would be 4, and that&amp;#39;s the value of the final state.</text></transcript></video><video title="30 Terminal State Question.mp4" id="PpYekJ-XIv0" length="10"><transcript><text start="0" dur="3">Now one more question for this same game tree.</text><text start="3" dur="4">I want you to click on all the terminal states that are possible outcomes </text><text start="7" dur="3">for this game if both players play rationally.</text></transcript></video><video title="31 Terminal State Answer.mp4" id="pTk06FG9ChM" length="?"><transcript><text start="0" dur="3">No subtitles...</text></transcript></video><video title="32 Game Tree Question 1.mp4" id="Y1GX5hhdBqQ" length="21"><transcript><text start="0" dur="3">One more quick game tree to evaluate.</text><text start="3" dur="1">Here we have terminal values. </text><text start="4" dur="5">We have chance nodes where the two options are equiprobable.</text><text start="9" dur="2">We have a max node. The two actions A and B.</text><text start="11" dur="4">I want you to fill in the values for all the nodes</text><text start="15" dur="6">and click on which action, A or B, is the rational action for Max.</text></transcript></video><video title="33 Game Tree Answer 1.mp4" id="nrRDNTXmlno" length="17"><transcript><text start="0" dur="3">The average of these two is 2. That&amp;#39;s a chance node.</text><text start="3" dur="4">The average of these is 2.5.</text><text start="7" dur="5">Max will choose the better of 2 and 2.5, which is 2.5.</text><text start="12" dur="5">Therefore, B will be the rational action for Max.</text></transcript></video><video title="34 Game Tree Question 2.mp4" id="fDwnBA0DJb0" length="49"><transcript><text start="0" dur="3">Now we know if this game if these were terminal nodes,</text><text start="3" dur="5">then that would be the right action for the game, and there was nothing to argue about.</text><text start="8" dur="6">But what if instead of having these be terminal nodes, these were cutoff nodes,</text><text start="14" dur="5">and these were evaluation values for those nodes?</text><text start="19" dur="5">Furthermore, if it&amp;#39;s an evaluation function, then it&amp;#39;s an arbitrary function.</text><text start="24" dur="6">Suppose if instead of coming up with these values, we used a different evaluation function,</text><text start="30" dur="9">which squared these values, and so we came up with evaluations of 0, 16, 4, and 9.</text><text start="39" dur="7">With that function, I want you to repeat the problem of filling in the values for each</text><text start="46" dur="3"> of these nodes and tell me what the rational action is for Max.</text></transcript></video><video title="35 Game Tree Answer 2.mp4" id="00ePzRWuSpo" length="34"><transcript><text start="0" dur="4">The answer is this is a chance node so we take the average of 0 and 16.</text><text start="4" dur="3">That&amp;#39;s no longer 2. It becomes 8.</text><text start="7" dur="7">We take the average of 9 and 4. That&amp;#39;s no longer 2.5. It becomes 6.5.</text><text start="14" dur="8">Notice what&amp;#39;s happened now is Max now chooses 8 over 6.5.</text><text start="22" dur="6">and now the rational action has shifted from B to A. What&amp;#39;s gone on here?</text><text start="28" dur="4">We notice that just by making a change to the evaluation function, </text><text start="32" dur="2">we changed the rational action.</text></transcript></video><video title="36 Conclusion.mp4" id="fYQb-uLTmYc" length="112"><transcript><text start="0" dur="2">Let&amp;#39;s summarize what we&amp;#39;ve done so far.</text><text start="2" dur="5">We&amp;#39;ve built up this valuation function that tells us the value of any state,</text><text start="7" dur="3">and therefore we can choose the best action in a state.</text><text start="10" dur="4">We started off just having terminal states and max value states.</text><text start="14" dur="5">That&amp;#39;s good for one-player, deterministic games,</text><text start="19" dur="4">and we realized that that&amp;#39;s just the same thing as searches we&amp;#39;ve seen before</text><text start="23" dur="3">where we had A* search or depth-first search.</text><text start="26" dur="5">Then we added in an opponent player for two-player or multiplayer games,</text><text start="31" dur="4">which is trying to minimize rather than maximize. We saw how to do that.</text><text start="35" dur="7">Then we optimized by saying at some point we may not be able to search the whole tree,</text><text start="42" dur="4">so we&amp;#39;re going to have a cutoff depth and an evaluation function.</text><text start="46" dur="4">We recognized that that means that we&amp;#39;re no longer perfect in terms of </text><text start="50" dur="3">valuating the tree. We now have an estimate.</text><text start="53" dur="3">We also tried to be more computationally effective</text><text start="56" dur="4"> by throwing in the alpha and beta parameters,</text><text start="60" dur="4">which keep track of the best value so far for Max and Min</text><text start="64" dur="4">and prune off branches of the tree that are outside of that range</text><text start="68" dur="4">that are provably not part of the answer for the best value.</text><text start="72" dur="3">We kept track of those through these bookkeeping parameters.</text><text start="75" dur="4">Then finally we introduced stochastic games,</text><text start="79" dur="4">in which there is an element of chance or luck or rolling of the dice.</text><text start="83" dur="3">We realized that in order to valuate those nodes,</text><text start="86" dur="4">we have to take the expected value rather than the minimum or the maximum value.</text><text start="90" dur="4">Now we have a way to deal with all the popular types of games.</text><text start="94" dur="7">The details now go into when we figure out to cut off and what&amp;#39;s the right evaluation function.</text><text start="101" dur="3">Those are a complex area.</text><text start="104" dur="3"> A lot of research in AI is being done in that,</text><text start="107" dur="5">but it&amp;#39;s being done for specific games rather than for the theory in general.</text></transcript></video></group><group title="Unit 14" count="29"><video title="01 Introduction.mp4" id="E8TWtwT45tg" length="101"><transcript><text start="0" dur="2">Hey, welcome back.</text><text start="2" dur="2">Hope you enjoyed the last unit. You guys have been doing great.</text><text start="4" dur="3">You&amp;#39;ve been doing amazing work, getting a lot done,</text><text start="7" dur="3">doing a really good job of answering the questions.</text><text start="10" dur="2">I&amp;#39;ve been looking at this book here.</text><text start="12" dur="2">This is a book from my father&amp;#39;s collection.</text><text start="14" dur="4">It&amp;#39;s called &amp;quot;Introduction to the Theory of Games&amp;quot; by McKinsey.</text><text start="18" dur="2">It was published in 1952, </text><text start="20" dur="3">4 years before the start of artificial intelligence.</text><text start="23" dur="4">And so game theory and AI have kind of grown up together.</text><text start="27" dur="2">They&amp;#39;ve taken different paths, </text><text start="29" dur="2">and now they&amp;#39;ve begun to merge back together.</text><text start="31" dur="4">We&amp;#39;ve talked about games already in a previous unit.</text><text start="35" dur="2">We talked about mostly turn-taking games</text><text start="37" dur="3">where 1 player moves and then another moves,</text><text start="40" dur="3">and the trick is how to work against an adversary</text><text start="43" dur="3">who&amp;#39;s trying to maximize his own utility </text><text start="46" dur="3">and thus minimize your utility.</text><text start="49" dur="3">Game theory handles those types of games, but it also really focuses</text><text start="52" dur="4">on games where the 2 moves are simultaneous,</text><text start="56" dur="3">or another way to think about them is 1 player moves</text><text start="59" dur="3">and then the other moves, but the second player doesn&amp;#39;t know</text><text start="62" dur="3">what choice the first player made, so it&amp;#39;s partially observable.</text><text start="65" dur="4">And it&amp;#39;s this back and forth of trying to figure out what should I move</text><text start="69" dur="3">given what I think he&amp;#39;s going to move and what does he think about</text><text start="72" dur="5">what I&amp;#39;m going to move that gives game theory its special status.</text><text start="77" dur="3">We&amp;#39;re going to talk about how that works for AI,</text><text start="80" dur="2">and 2 problems are studied.</text><text start="82" dur="2">The first is agent design. </text><text start="84" dur="3">That is, given a game, find the optimal policy.</text><text start="87" dur="3">And the second is mechanism design.</text><text start="90" dur="2">That is, given utility functions, </text><text start="92" dur="3">how can we design a mechanism so that</text><text start="95" dur="4">when the agents act rationally the global utility will be maximized in some way?</text><text start="99" dur="2">Let&amp;#39;s take a look.</text></transcript></video><video title="02 Dominant Strategy Question.mp4" id="mY_9srbDMEg" length="176"><transcript><text start="0" dur="3">We&amp;#39;re going to talk about game theory,</text><text start="3" dur="3">which is the study of finding an optimal policy</text><text start="6" dur="4">when that policy can depend on the opponent&amp;#39;s policy and vice versa. </text><text start="10" dur="4">And let&amp;#39;s look at 1 of the most famous games of all,</text><text start="14" dur="3">a game called the &amp;quot;Prisoner&amp;#39;s Dilemma.&amp;quot;</text><text start="17" dur="4">And the story is that there are 2 criminals, Alice and Bob,</text><text start="21" dur="3">who have a working relationship, and they&amp;#39;re both caught</text><text start="24" dur="4">at the scene of a crime, but the police don&amp;#39;t quite have enough evidence</text><text start="28" dur="2">to put them away.</text><text start="30" dur="3">They offer each independently a deal saying</text><text start="33" dur="4">&amp;quot;If you testify against your cohort,</text><text start="37" dur="5">we&amp;#39;ll give you a better deal and give you a reduced sentence time.&amp;quot;</text><text start="42" dur="3">And Alice and Bob both understand what&amp;#39;s going on.</text><text start="45" dur="2">They&amp;#39;re both perfectly rational, </text><text start="47" dur="3">and to understand what the situation is,</text><text start="50" dur="5">we draw up a matrix in which we have possible outcomes</text><text start="55" dur="2">and possible strategies for each side.</text><text start="57" dur="4">For Alice, she has 2 strategies.</text><text start="61" dur="3">1 is to testify against Bob,</text><text start="64" dur="3">and the other is to refuse to testify.</text><text start="67" dur="2">And Bob has the same choices, </text><text start="69" dur="4">to testify against Alice or to refuse.</text><text start="73" dur="4">In general, different agents may have different actions available to them.</text><text start="77" dur="3">And now we show the payoff to each agent.</text><text start="80" dur="3">Sometimes those payoffs are opposite,</text><text start="83" dur="4">as in a game like chess where if 1 player gets a +1,</text><text start="87" dur="2">the other gets a -1.</text><text start="89" dur="3">In this game, the payoffs are not opposite,</text><text start="92" dur="2">so it&amp;#39;s a non-zero-sum game.</text><text start="94" dur="4">And if they both refuse to testify against each other,</text><text start="98" dur="4">then neither can be convicted of the major crime,</text><text start="102" dur="3">but the police will get them for a lesser crime.</text><text start="105" dur="5">And let&amp;#39;s say they each serve 1 year in jail,</text><text start="110" dur="2">so that&amp;#39;s a -1 for each of them.</text><text start="112" dur="4">If Alice testifies and Bob refuses,</text><text start="116" dur="3">then the police are grateful to Alice,</text><text start="119" dur="4">and she gets off with nothing, and Bob gets </text><text start="123" dur="3">the book thrown at him and gets a -10 score.</text><text start="126" dur="3">Likewise, if the roles are reversed</text><text start="129" dur="4">and if both testify against each other, then they&amp;#39;re both guilty,</text><text start="133" dur="2">and they split the penalty.</text><text start="135" dur="4">Now, the question that both Alice and Bob have to face</text><text start="139" dur="2">is what is the strategy going to be?</text><text start="141" dur="3">And the first concept we want to talk about </text><text start="144" dur="3">is the concept of a dominant strategy.</text><text start="147" dur="4">A dominant strategy is one for which a player</text><text start="151" dur="3">does better than any other strategy</text><text start="154" dur="2">no matter what the other player does.</text><text start="156" dur="5">And now the question is, does either Alice or Bob</text><text start="161" dur="3">have a dominant strategy?</text><text start="164" dur="2">If Alice has a dominant strategy, </text><text start="166" dur="5">I want you to check that off, either testify or refuse,</text><text start="171" dur="3">and similarly, if Bob has a dominant strategy,</text><text start="174" dur="2">check that off.</text></transcript></video><video title="03 Dominant Strategy Answer.mp4" id="FAKgof5PMV8" length="35"><transcript><text start="0" dur="2">The answer is for Alice,</text><text start="2" dur="2">testify is a dominant strategy.</text><text start="4" dur="4">Let&amp;#39;s see. We have to compare it against all possible strategies for Bob.</text><text start="8" dur="3">If Bob does testify,</text><text start="11" dur="4">then Alice gets -5 here and -10 here,</text><text start="15" dur="2">so testify is better.</text><text start="17" dur="2">And if Bob does refuse,</text><text start="19" dur="5">then Alice gets 0 here and -1 here, so testify is better.</text><text start="24" dur="3">Testify is better for Alice no matter what,</text><text start="27" dur="2">and by similar reasoning, </text><text start="29" dur="2">testify is better for Bob no matter what, </text><text start="31" dur="4">so testify is a dominant strategy for both players.</text></transcript></video><video title="04 Pareto Optimal Question.mp4" id="T57JLskDv7g" length="35"><transcript><text start="0" dur="3">The next concept I want to talk about</text><text start="3" dur="5">is the concept of a pareto optimal outcome.</text><text start="8" dur="3">So, this is talking about outcomes rather than strategies.</text><text start="11" dur="2">The strategies are in the margins.</text><text start="13" dur="4">The outcomes are in the matrix, and the pareto optimal outcome</text><text start="17" dur="3">is one where there&amp;#39;s no other outcome</text><text start="20" dur="2">that all players would prefer.</text><text start="22" dur="3">And this is named after the economist Pareto.</text><text start="25" dur="2">What I want you to answer is</text><text start="27" dur="3">is there a pareto optimal outcome in this game?</text><text start="30" dur="2">Is there an outcome such that</text><text start="32" dur="3">there&amp;#39;s no other outcome that all players would prefer?</text></transcript></video><video title="05 Pareto Optimal Answer.mp4" id="aFIFH6HjcvE" length="17"><transcript><text start="0" dur="2">And the answer is that this outcome, </text><text start="2" dur="2">A = -1, B = -1, </text><text start="4" dur="3">is Pareto optimal because there&amp;#39;s no other outcome</text><text start="7" dur="2">that all the players would prefer.</text><text start="9" dur="2">Sure, B would prefer being up here,</text><text start="11" dur="3">and A would prefer being over here,</text><text start="14" dur="3">but none of them that both players can agree on.</text></transcript></video><video title="06 Equilibrium Question.mp4" id="bcMAYqIoe-8" length="33"><transcript><text start="0" dur="5">Now, the third concept is the concept of equilibrium.</text><text start="5" dur="2">An equilibrium is an outcome such that no player</text><text start="7" dur="3">can benefit from switching to a different strategy,</text><text start="10" dur="3">assuming that the other players stay the same.</text><text start="13" dur="4">And there was a famous result from John Nash, economist,</text><text start="17" dur="4">who was shown in the movie and book &amp;quot;A Beautiful Mind&amp;quot;</text><text start="21" dur="4">proving that every game has at least 1 equilibrium point.</text><text start="25" dur="6">The question here is which, if any, of these outcomes</text><text start="31" dur="2">are equilibriums in this game?</text></transcript></video><video title="07 Equilibrium Answer.mp4" id="QXc7izcogSE" length="52"><transcript><text start="0" dur="2">And the answer is only this outcome, </text><text start="2" dur="4">with A = -5, B = -5, is an equilibrium point</text><text start="6" dur="3">because if A switches, it gets -10.</text><text start="9" dur="2">If B switches, it gets -10.</text><text start="11" dur="5">Neither player wants to switch away from keeping with that strategy.</text><text start="16" dur="4">Over here, the Pareto optimal solution is not an equilibrium point</text><text start="20" dur="6">because if B switches, it will do better,</text><text start="26" dur="2">and A will do worse.</text><text start="28" dur="3">This is where the game turns out to be a dilemma</text><text start="31" dur="5">because there&amp;#39;s an equilibrium point that it seems like</text><text start="36" dur="3">if both players are rational, they&amp;#39;re bound to end up</text><text start="39" dur="2">in this outcome, </text><text start="41" dur="5">whereas the Pareto optimal solution is over here in the other corner.</text><text start="46" dur="4">And yet, being rational, neither Alice nor Bob can see a way</text><text start="50" dur="2">to get to this preferred outcome.</text></transcript></video><video title="08 Game Console Question 1.mp4" id="V931T1AoVjo" length="92"><transcript><text start="0" dur="2">Let&amp;#39;s try another example.</text><text start="2" dur="3">This one is called the Game Console Game,</text><text start="5" dur="3">and the story is that there is a </text><text start="8" dur="4">game console manufacturer called Acme,</text><text start="12" dur="3">and it has to decide whether its next console</text><text start="15" dur="4">is going to play Blu-ray discs or DVD discs.</text><text start="19" dur="4">And then there&amp;#39;s a game manufacturer called Best,</text><text start="23" dur="3">and they similarly have to decide whether to put out their next game</text><text start="26" dur="3">on Blu-ray discs or DVD discs.</text><text start="29" dur="4">And the payoffs are if they&amp;#39;re both on Blu-ray,</text><text start="33" dur="7">A gets a +9, and B is also a +9.</text><text start="40" dur="4">If they both choose to go with DVD, it&amp;#39;s not quite as lucrative.</text><text start="44" dur="4">A gets a +5. B gets a +5.</text><text start="48" dur="4">And if they disagree, then they&amp;#39;ll be in trouble, and they&amp;#39;ll take losses.</text><text start="52" dur="5">A gets a -4, and B gets a -1, </text><text start="57" dur="6">while here A = -3 and B = -1.</text><text start="63" dur="3">The first question is is there a dominant strategy?</text><text start="66" dur="3">And is there one for A?</text><text start="69" dur="2">Click here if yes.</text><text start="71" dur="2">And is there one for B?</text><text start="73" dur="4">Click here if yes, and if there&amp;#39;s none at all, click here.</text><text start="77" dur="3">There may be both A and B. It&amp;#39;s your choice.</text><text start="80" dur="4">And then the next question is is there an equilibrium?</text><text start="84" dur="5">Click on any of these 4 outcomes</text><text start="89" dur="3">to indicate whether there&amp;#39;s an equilibrium.</text></transcript></video><video title="09 Game Console Answer 1.mp4" id="e0iQE11Fh8c" length="27"><transcript><text start="0" dur="3">The answers are that there&amp;#39;s no dominant strategy</text><text start="3" dur="3">because for each player, what&amp;#39;s best depends on what the other player does.</text><text start="6" dur="2">They do best if they match,</text><text start="8" dur="3">and so you can&amp;#39;t figure out what  your own best strategy is</text><text start="11" dur="2">unless you know what the other player is going to play.</text><text start="13" dur="3">In terms of equilibrium, there&amp;#39;s 2 equilibrium points, </text><text start="16" dur="4">the +9/+9 and the +5/+5.</text><text start="20" dur="4">Both of them are equilibriums because neither of the players</text><text start="24" dur="3">can benefit from switching to the other strategy.</text></transcript></video><video title="10 Game Console Question 2.mp4" id="PEqFuwCv-Mc" length="9"><transcript><text start="0" dur="2">And now the next question is</text><text start="2" dur="3">is there 1 or more Pareto optimal outcomes?</text><text start="5" dur="4">Click on any of the outcomes that you think are Pareto optimal.</text></transcript></video><video title="11 Game Console Answer 2.mp4" id="B1C1K66wTeI" length="21"><transcript><text start="0" dur="2">And the answer is that there&amp;#39;s just 1.</text><text start="2" dur="2">The +9/+9 is Pareto optimal.</text><text start="4" dur="3">Both players would rather be there than anyplace else.</text><text start="7" dur="3">And so it seems that if both players are rational</text><text start="10" dur="3">they&amp;#39;ll both know that there are 2 equilibrium points,</text><text start="13" dur="3">but only 1 of them is Pareto optimal.</text><text start="16" dur="2">And even though there isn&amp;#39;t a dominant strategy,</text><text start="18" dur="3">they can both arrive at that happy conclusion.</text></transcript></video><video title="12 2 Finger Morra.mp4" id="9tB8y8YQM6A" length="111"><transcript><text start="0" dur="3">So, we&amp;#39;ve seen that it&amp;#39;s easy to figure out the solution to a game</text><text start="3" dur="4">if there&amp;#39;s a dominant strategy or if there&amp;#39;s a Pareto-optimal equilibrium.</text><text start="7" dur="5">Now let&amp;#39;s look at a harder game for which such solutions don&amp;#39;t exist.</text><text start="12" dur="3">This game is called Two Finger Morra,</text><text start="15" dur="4">and it&amp;#39;s a betting game, and we&amp;#39;re going to show a simplified version of it.</text><text start="19" dur="4">Again, we have a simple 4-state outcome matrix,</text><text start="23" dur="3">and there are 2 players called even and odd.</text><text start="26" dur="3">And they both simultaneously show either</text><text start="29" dur="2">1 or 2 fingers.</text><text start="31" dur="6">And then if the result of the total number of fingers is even, </text><text start="37" dur="7">then the even player wins that many dollars from the odd player.</text><text start="44" dur="3">And if the total number of fingers is odd,</text><text start="47" dur="5">then the odd player  wins that number of dollars from the even player.</text><text start="52" dur="4">So, if 1 and 1 is 2, so that&amp;#39;s even, </text><text start="56" dur="8">so even gets +2, and we won&amp;#39;t bother writing odd getting -2</text><text start="64" dur="4">because it&amp;#39;s a zero-sum game, and it will always be the opposite. </text><text start="68" dur="3">Similarly, 2 and 2 is 4, </text><text start="71" dur="4">so even gets +4 and odd gets -4.</text><text start="75" dur="4">2 and 1 is 3, so even loses 3 dollars </text><text start="79" dur="4">and pays it to odd and similarly up here.</text><text start="83" dur="3">Now, there&amp;#39;s no dominant strategy, and it seems kind of tricky</text><text start="86" dur="3">to figure out what the right strategy is.</text><text start="89" dur="2">We&amp;#39;re going to need more complicated techniques,</text><text start="91" dur="3">and it turns out that there is no single move</text><text start="94" dur="2">that&amp;#39;s the best strategy for either player.</text><text start="96" dur="3">But there is what&amp;#39;s called a mixed strategy,</text><text start="99" dur="5">so a single strategy of always playing one or the other</text><text start="104" dur="3">is called a pure strategy, and a mixed strategy</text><text start="107" dur="4">is when you have a probability distribution over the possible moves.</text></transcript></video><video title="13 Tree Question.mp4" id="RFuabag0RBg" length="111"><transcript><text start="0" dur="4">Now, since it seems complicated to solve this game in this form,</text><text start="4" dur="4">one way we can address it is to change from this matrix form</text><text start="8" dur="3">into the familiar tree form.</text><text start="11" dur="2">We&amp;#39;ll move this over here,</text><text start="13" dur="2">and we&amp;#39;ll draw it as a game tree.</text><text start="15" dur="5">Max will be the even player, and min will be the odd player,</text><text start="20" dur="5">and for the moment, let&amp;#39;s look at the game</text><text start="25" dur="3">of what would happen if max had to go first</text><text start="28" dur="3">rather than having them move simultaneously.</text><text start="31" dur="5">So, max would make a move either 1 or 2.</text><text start="36" dur="5">And then min--so max is even and min is O--</text><text start="41" dur="5">would also make the move, 1 or 2, 1 or 2. </text><text start="46" dur="4">And then the outcome in terms of E would be 2 here</text><text start="50" dur="4">-3 here, -3 here and 4 here.</text><text start="54" dur="3">And now what does min do? Well, try to  minimize.</text><text start="57" dur="4">So, we choose 2 here, so this node would be -3.</text><text start="61" dur="5">We&amp;#39;d choose 1 here, so this node would be -3,</text><text start="66" dur="2">and then E tries to maximize.</text><text start="68" dur="3">It doesn&amp;#39;t matter what he chooses,</text><text start="71" dur="3">and we get a -3 up here.</text><text start="74" dur="3">So, that&amp;#39;s giving E the disadvantage of having to reveal</text><text start="77" dur="4">his or her strategy first.</text><text start="81" dur="2">What if we did it the other way around?</text><text start="83" dur="2">Let&amp;#39;s take a look at that.</text><text start="85" dur="4">What if O had to go first and reveal a strategy of 1 or 2</text><text start="89" dur="6">and then E as the maximizing player goes second</text><text start="95" dur="2">and does a 1 or 2?</text><text start="97" dur="4">And then we have these 4 terminal states here,</text><text start="101" dur="4">and I want you to fill in the values of the 4 terminal states</text><text start="105" dur="4">taken from the table and the intermediate states</text><text start="109" dur="2">or the higher up states in the tree as well.</text></transcript></video><video title="14 Tree Answer.mp4" id="EZfcbAfYTTo" length="72"><transcript><text start="0" dur="2">And the answer is 1 + 1 is 2, </text><text start="2" dur="5">and so that&amp;#39;s even, so it&amp;#39;d be a positive payoff to E.</text><text start="7" dur="4">1 + 2 is 3, that&amp;#39;s odd, so it&amp;#39;d be a -3.</text><text start="11" dur="3">Similarly, 2 +1 is 3, which is odd.</text><text start="14" dur="3">So, -3, 2 + 2 is 4. </text><text start="17" dur="2">That&amp;#39;s a positive payoff.</text><text start="19" dur="2">Now E is maximizing, </text><text start="21" dur="3">so E would prefer 2 here</text><text start="24" dur="2">and would prefer 4 here.</text><text start="26" dur="3">And now O is minimizing, </text><text start="29" dur="3">so O would prefer 2 here.</text><text start="32" dur="3">And notice what we&amp;#39;ve done here is that</text><text start="35" dur="3">we&amp;#39;re trying to figure out what the utility of the game is</text><text start="38" dur="3">to E, and the true game, </text><text start="41" dur="3">both players move simultaneously.</text><text start="44" dur="3">Over here, we&amp;#39;ve handicapped E.</text><text start="47" dur="3">And over here, we handicapped O.</text><text start="50" dur="3">The true value of the game must be somewhere in between there,</text><text start="53" dur="4">so we can say that the utility to E must be</text><text start="57" dur="4">less than or equal to 2, which is the value here,</text><text start="61" dur="5">and greater than or equal to -3, which is the value here. </text><text start="66" dur="2">We&amp;#39;ve narrowed it down to some degree, but we still haven&amp;#39;t nailed down</text><text start="68" dur="4">exactly what the utility of the game is.</text></transcript></video><video title="15 Mixed Strategy.mp4" id="MyicDINS6pg" length="146"><transcript><text start="0" dur="3">Now, 1 reason there&amp;#39;s such a wide discrepancy in the outcomes</text><text start="3" dur="3">of these 2 versions of the game is that</text><text start="6" dur="3">we handicapped E and O so severely</text><text start="9" dur="4">that here E had to reveal his entire strategy,</text><text start="13" dur="3">whether he&amp;#39;s going to play 1 or 2 all the time,</text><text start="16" dur="2">and the same thing for O over here.</text><text start="18" dur="3">What if we could think of a way where we didn&amp;#39;t handicap them quite as much,</text><text start="21" dur="3">where they weren&amp;#39;t giving away quite as much information?</text><text start="24" dur="2">Let&amp;#39;s look at a way to do that.</text><text start="26" dur="4">Let&amp;#39;s look at the situation where E goes first</text><text start="30" dur="2">and has to reveal the strategy, </text><text start="32" dur="3">but instead of having to reveal my strategy is </text><text start="35" dur="2">to play 1 or to play 2, </text><text start="37" dur="3">what if E says &amp;quot;Well, my strategy is</text><text start="40" dur="4">with probability P, I&amp;#39;m going to play 1.&amp;quot;</text><text start="44" dur="4">&amp;quot;And with probability 1 - P, I&amp;#39;m going to play 2.&amp;quot;</text><text start="48" dur="2">And that&amp;#39;s called a mixed strategy.</text><text start="50" dur="5">So, E would announce that strategy for some number P.</text><text start="55" dur="3">And there could be an infinite number of possibilities,</text><text start="58" dur="4">so we should be drawing an infinite number of branches</text><text start="62" dur="3">out of this decision point for all the possibilities </text><text start="65" dur="2">for values of P that E would come up with.</text><text start="67" dur="2">But instead, I&amp;#39;m just going to sort of parameterize that </text><text start="69" dur="3">and just draw 1 line coming out.</text><text start="72" dur="3">And now O as the minimizing player</text><text start="75" dur="4">has to make a choice between 1 and 2, and what are the outcomes?</text><text start="79" dur="4">Well, if P was 1, then 1 + 1 is 2, </text><text start="83" dur="5">so with probability P, we get an outcome of 2.</text><text start="88" dur="5">That&amp;#39;s 2P, but if we choose 2,</text><text start="93" dur="3">the probability 1 - P, then 2 + 1 is 3, </text><text start="96" dur="4">so with probability 1 - P, we get a -3.</text><text start="100" dur="4">So, 2P - 3 times 1 - P</text><text start="104" dur="4">would be the outcome for this day.</text><text start="108" dur="3">And then the outcome over here would be</text><text start="111" dur="5">-3P + 4 times 1 - P.</text><text start="116" dur="4">That&amp;#39;s the parameterized outcome given the parameterized strategy.</text><text start="120" dur="3">And we could do the same thing on the other side.</text><text start="123" dur="2">What if O had to go first? </text><text start="125" dur="4">With probability Q, O plays 1, </text><text start="129" dur="3">and with probability 1 - Q plays 2.</text><text start="132" dur="3">Then even is the maximizer </text><text start="135" dur="6">and we get 2Q - 3(1 - Q)</text><text start="141" dur="5">and -3Q + 4(1 - Q).</text></transcript></video><video title="16 Solving the Game.mp4" id="S47BxSU1S3w" length="139"><transcript><text start="0" dur="3">Now, what value should E choose for P?</text><text start="3" dur="2">Remember, you&amp;#39;ve got an infinite number of choices  </text><text start="5" dur="2">for any value for P.</text><text start="7" dur="3">Well, if E chose a value of P </text><text start="10" dur="4">such that this value here </text><text start="14" dur="3">was larger than this value here,</text><text start="17" dur="3">then O would know to always play 1, </text><text start="20" dur="2">and similarly, if this value was larger,</text><text start="22" dur="3">then O would know to always play 2.</text><text start="25" dur="3">So, it seems that what E wants to do</text><text start="28" dur="5">is choose the value of P such that these 2 are equal.</text><text start="33" dur="2">So, how much is that? Well, let&amp;#39;s see.</text><text start="35" dur="2">This is 2P - 3 - P.</text><text start="37" dur="3">That&amp;#39;s 5P - 3, </text><text start="40" dur="6">and we want to set that equal to -3 + 4 - P.</text><text start="46" dur="4">That&amp;#39;s -7P + 4.</text><text start="50" dur="3">And let&amp;#39;s gather the terms together.</text><text start="53" dur="3">So, that would be 12P = 7</text><text start="56" dur="4">or P = 7/12.</text><text start="60" dur="5">So, if E chooses the value of P = 7/12,</text><text start="65" dur="2">so 7/12 of the time play 1,</text><text start="67" dur="2"> 5/12 of the time play 2,</text><text start="69" dur="3">then O doesn&amp;#39;t know what to do.</text><text start="72" dur="3">No matter whether he chooses 1 or 2, he gets the same result.</text><text start="75" dur="3">And you can do the same calculation over here,</text><text start="78" dur="5">and it turns out that Q also equals 7/12.</text><text start="83" dur="5">Now, let&amp;#39;s take this strategy of P = 7/12, 1,</text><text start="88" dur="2">and feed it back into the matrix for the game,</text><text start="90" dur="6">and if E plays this strategy of 7/12, 1, 5/12, 2, </text><text start="96" dur="3">then no matter what the strategy O plays, </text><text start="99" dur="7">the value of the game to E, the utility to E is -1/12.</text><text start="106" dur="4">And then we can do the same computation over here.</text><text start="110" dur="5">If Q has the strategy 7/12, 1, and 5/12, 2,</text><text start="115" dur="6">then we plug that back into here, and no matter what strategy E chooses,</text><text start="121" dur="4">the value there is also -1/12.</text><text start="125" dur="4">And so now we&amp;#39;ve shown that the utility to E</text><text start="129" dur="3">is greater than or equal to -1/12</text><text start="132" dur="2">and less than or equal to -1/12.</text><text start="134" dur="2">In other words, the utility to E</text><text start="136" dur="3">is exactly -1/12, so we&amp;#39;ve solved the game.</text></transcript></video><video title="17 Mixed Strategy Issues.mp4" id="fjUIeZHX5Aw" length="170"><transcript><text start="0" dur="4">Now, the introduction of mixed strategy</text><text start="4" dur="3">brings us some curious philosophical problems</text><text start="7" dur="7">related to the idea of randomness, secrecy, and rationality.</text><text start="14" dur="3">We said that sometimes the rational strategy</text><text start="17" dur="2">can be a mixed strategy.</text><text start="19" dur="2">That is, ones with probability in it.</text><text start="21" dur="4">Probability P, I do action A, </text><text start="25" dur="5">and with probability 1 - P I do action B. </text><text start="30" dur="5">And that suggests that we need some secrecy</text><text start="35" dur="5">so that our opponent doesn&amp;#39;t know which of these random choices we&amp;#39;re making.</text><text start="40" dur="3">The curious thing is that that&amp;#39;s only true</text><text start="43" dur="2">at the extent of an individual play, </text><text start="45" dur="3">not to the extent of the strategy itself.</text><text start="48" dur="3">So, if this is the optimal strategy,</text><text start="51" dur="4">a mixed strategy, it&amp;#39;s okay for us to reveal </text><text start="55" dur="3">that strategy to our opponent because our opponent</text><text start="58" dur="3">can also compute that that&amp;#39;s our rational strategy,</text><text start="61" dur="4">and so we won&amp;#39;t do any worse by revealing to the opponent</text><text start="65" dur="2">exactly what our strategy is.</text><text start="67" dur="4">However, the actual implementation of that strategy, </text><text start="71" dur="4">that is, this is the grand strategy, that in this situation,</text><text start="75" dur="4">whenever we&amp;#39;re faced with playing this game, this is what we&amp;#39;ll do,</text><text start="79" dur="3">that part can be revealed, but the actual choice</text><text start="82" dur="3">that this time we&amp;#39;re going to choose A or we&amp;#39;re going to choose B,</text><text start="85" dur="2">of course, that has be to kept secret.</text><text start="87" dur="4">If we reveal that, if our opponent can somehow discover</text><text start="91" dur="5">which choice we&amp;#39;re going to make based on this random choice,</text><text start="96" dur="3">then our opponent can get an advantage over us.</text><text start="99" dur="3">Now, with respect to rationality,</text><text start="102" dur="2">we said that a rational agent is one that does the right thing,</text><text start="104" dur="2">and that&amp;#39;s still true.</text><text start="106" dur="2">However, it turns out that there are games </text><text start="108" dur="3">in which you can do better if your opponent believes</text><text start="111" dur="3">you are not rational, and that has been said about various politicians </text><text start="114" dur="5">throughout history, and I won&amp;#39;t pick on one or another.</text><text start="119" dur="3">But sometimes it has been said that they are intentionally </text><text start="122" dur="3">cultivating an image of being crazy</text><text start="125" dur="2">so that they can gain an advantage</text><text start="127" dur="3"> when faced with certain games with opponents.</text><text start="130" dur="5">For example, suppose 1 action available to a leader is to go to war,</text><text start="135" dur="3">but both sides realize that the strategy of going to war</text><text start="138" dur="4">is dominated by other strategies and thus would be irrational.</text><text start="142" dur="4">So, a leader who is perceived to be rational and makes a threat</text><text start="146" dur="3">of &amp;quot;Give me this concession, or I&amp;#39;ll go to war against you,&amp;quot;</text><text start="149" dur="2">that&amp;#39;s not a credible threat.</text><text start="151" dur="4">The leader&amp;#39;s threat would be dismissed, and it would have no effect.</text><text start="155" dur="3">However, if the leader can convince the opponent</text><text start="158" dur="5">that he is irrational or crazy, then the threat suddenly becomes credible.</text><text start="163" dur="3">And so note that being irrational doesn&amp;#39;t help,</text><text start="166" dur="4">but appearing irrational can gain you an advantage.</text></transcript></video><video title="18 2x2 Game Question 1.mp4" id="zB_g1uVuzhM" length="46"><transcript><text start="0" dur="2">Now, let&amp;#39;s give you a chance to solve a game.</text><text start="2" dur="2">This will be another 2x2 game,</text><text start="4" dur="5">and let&amp;#39;s just go ahead and call the players max and min,</text><text start="9" dur="3">and they each have 2 moves,</text><text start="12" dur="3">and we&amp;#39;ll make this be a zero-sum game.</text><text start="15" dur="4">We&amp;#39;ll just show the value to max, </text><text start="19" dur="3">and the value to min will be the negation of that.</text><text start="22" dur="3">And what I want you to do to solve the game</text><text start="25" dur="3">is tell me what the strategy should be</text><text start="28" dur="2">in terms of fill in these blanks.</text><text start="30" dur="5">What are the percentages that min should play 1 or 2?</text><text start="35" dur="5">And in these blanks, the percentages for max to play 1 or 2.</text><text start="40" dur="3">And then tell me the final value for the game.</text><text start="43" dur="3">The utility or expected value to max equals what?</text></transcript></video><video title="19 2x2 Game Answer 1.mp4" id="2G4k-9BuuH8" length="42"><transcript><text start="0" dur="3">We see in this game each player has a dominant strategy.</text><text start="3" dur="3">For max, if he plays strategy 1, </text><text start="6" dur="3">then that&amp;#39;s better than playing strategy 2.</text><text start="9" dur="5">If min plays 1, then strategy 1 is also better than strategy 2.</text><text start="14" dur="5">If min plays 2, so strategy 1 is the dominant strategy,</text><text start="19" dur="2">and that should have probability 1.</text><text start="21" dur="2">Strategy 2 should have probability 0.</text><text start="23" dur="2">And it&amp;#39;s the same thing for min, </text><text start="25" dur="4">that 2 minimizes better than 1 in both cases. </text><text start="29" dur="4">So, strategy 2 should have probability 1, </text><text start="33" dur="3">and strategy 1 should have probability 0,</text><text start="36" dur="3">and that means we&amp;#39;re always going to end up with this outcome,</text><text start="39" dur="3">and the value of the game is 3.</text></transcript></video><video title="20 2x2 Game Question 2.mp4" id="AXDRN_PG794" length="26"><transcript><text start="0" dur="3">Now, that last one was easy. Let&amp;#39;s do one more.</text><text start="3" dur="4">Here we have a game, and the payoff to max is 3, </text><text start="7" dur="4">6, 5 and 4.</text><text start="11" dur="2">And I want you to tell me what the strategy is,</text><text start="13" dur="2">whether it&amp;#39;s pure or mixed.</text><text start="15" dur="3">What are the probabilities that max should play 1 and 2,</text><text start="18" dur="4">and what are the probabilities that min should play 1 and 2?</text><text start="22" dur="4">And what is the value of the game to max?</text></transcript></video><video title="21 2x2 Game Answer 2.mp4" id="m9tkNhbJ-OU" length="97"><transcript><text start="0" dur="2">So, in this case, there&amp;#39;s no dominant strategy,</text><text start="2" dur="2">so we&amp;#39;ll have to go to a mixed strategy.</text><text start="4" dur="3">And we&amp;#39;ll start by looking at max and saying</text><text start="7" dur="3">he has a mixed strategy with a probability P </text><text start="10" dur="5">of playing 1, so then if min chooses 1,</text><text start="15" dur="7">then we&amp;#39;ll have the outcome 3P + 5(1 - P).</text><text start="22" dur="3">And we want to set that equal to the outcome</text><text start="25" dur="7">if min plays 2, which is 6P + 4(1 - P).</text><text start="32" dur="5">And we solve that, and that works out to P = 1/4.</text><text start="37" dur="4">So, P, that was a probability of max playing 1.</text><text start="41" dur="5">That should be 1/4, which leaves 3/4 over here.</text><text start="46" dur="3">And now we go at it from min&amp;#39;s direction,</text><text start="49" dur="4">and if min has a probability of Q</text><text start="53" dur="7">of playing 1, then we want to set 3Q + 6 (1 - Q)</text><text start="60" dur="5">equals 5Q + 4 (1 - Q).</text><text start="65" dur="4">And you solve that, and you get Q = 1/2.</text><text start="69" dur="3">So, 1/2 and 1/2.</text><text start="72" dur="3">And then the utility of the game is the expected value,</text><text start="75" dur="4">so we look at all the outcomes and the probability of each outcome.</text><text start="79" dur="5">So, 3 times 1/8, because it&amp;#39;s 1/4 times 1/2 </text><text start="84" dur="3">would be the probability, so 3 times 1/8</text><text start="87" dur="4">+ 6 times 1/8 + 5 times 3/8</text><text start="91" dur="6">+ 4 times 3/8, and that works out to 4.5.</text></transcript></video><video title="22 Geometric Interpretation.mp4" id="Ka4DPvOgNnM" length="111"><transcript><text start="0" dur="5">Here&amp;#39;s a geometric interpretation that may help you understand a little better what&amp;#39;s going on.</text><text start="5" dur="3">Here we&amp;#39;ve gone back to the two-finger Morra game.</text><text start="8" dur="5">Now, remember we looked at the two possibilities of E going first</text><text start="13" dur="7">and revealing a strategy of playing one with probability P and two with probability of 1 - P.</text><text start="20" dur="5">Now O has a choice of what to do, and O wants to minimize.</text><text start="25" dur="5">If O chooses one, he&amp;#39;ll be somewhere along this line</text><text start="30" dur="4">corresponding to this strategy for different values of P.</text><text start="34" dur="7">This graph here is showing the utility to E for different values of P </text><text start="41" dur="3">that P chose in P&amp;#39;s strategy.</text><text start="44" dur="2">Now since O can achieve the minimum </text><text start="46" dur="4">since O gets to chose the strategy of doing one or two,</text><text start="50" dur="4">O can be anywhere on this frontier.</text><text start="54" dur="6">It makes sense to E to push that up at high as possible since H is the maximizer.</text><text start="60" dur="6">E will choose this point here, which turns out to be at P = 7/12.</text><text start="66" dur="2">The same argument going on this side.</text><text start="68" dur="6">Here O has gone first and chosen the strategy, q:one and (1 - q):two.</text><text start="74" dur="6">Now E has a choice of what to do. E can either choose two and be along this line,</text><text start="80" dur="2">depending on what the value of q is,</text><text start="82" dur="3">or can chose one, which will be along this line.</text><text start="85" dur="3">It would make sense for O, who is trying to minimize,</text><text start="88" dur="4"> to get this frontier down as low as possible.</text><text start="92" dur="4">O would choose the value of q that puts us right here at this point.</text><text start="96" dur="6">It turns out that that also is q = 7/12.</text><text start="102" dur="4">We see that each side is trying to maximize or minimize,</text><text start="106" dur="5">and we end up at a distinguished point that&amp;#39;s the intersection of their two strategies.</text></transcript></video><video title="23 Poker.mp4" id="W5hfpcoCdMc" length="212"><transcript><text start="0" dur="4">Now so far we&amp;#39;ve dealt only with games that take a single turn--</text><text start="4" dur="4">that is there are two players, they both simultaneously reveal their move,</text><text start="8" dur="2"> and the game is over.</text><text start="10" dur="2">But game theory can also deal with more complex games</text><text start="12" dur="4"> that have multiple rounds of turn taking.</text><text start="16" dur="3">Here I&amp;#39;m describing a simple game of poker, </text><text start="19" dur="3">the simplest type of poker you&amp;#39;ve probably ever seen.</text><text start="22" dur="2">The deck only has four cards.</text><text start="24" dur="2">One card is dealt to each player.</text><text start="26" dur="2">There are two rounds.</text><text start="28" dur="6">In the first, player one has a choice to either raise--to bet a dollar--or to check.</text><text start="34" dur="6">Then in the second round, the second player has the chance to call--</text><text start="40" dur="4">to say I want to see what&amp;#39;s up--or to fold.</text><text start="44" dur="4">Now this format begins to look very much like the game tree</text><text start="48" dur="3"> that we talked about in the previous unit.</text><text start="51" dur="3">It starts out and there&amp;#39;s a chance node.</text><text start="54" dur="6">This corresponds to dealing the cards with the 1/6 that the first player gets an Ace</text><text start="60" dur="2">and the second player gets an Ace.</text><text start="62" dur="6">One-third that the first player gets and Ace and the second Player gets a Kind, and so on.</text><text start="68" dur="4">There there are maximizing nodes and minimizing nodes.</text><text start="72" dur="3">What this format, which is known as the sequential game format,</text><text start="75" dur="7">is especially good at is keeping track of the belief states of the possibilities</text><text start="82" dur="5">of what each agent knows and doesn&amp;#39;t know.</text><text start="87" dur="3">The tree as a whole describes everything that&amp;#39;s going on,</text><text start="90" dur="5">but each agent doesn&amp;#39;t know at which point in the tree they are.</text><text start="95" dur="3">So if you&amp;#39;re agent number one, you know that you have an Ace,</text><text start="98" dur="5">so you know you&amp;#39;re in one of these two states denoted by the dotted lines.</text><text start="103" dur="4">You&amp;#39;re either in the state where you have an Ace and the other player has an Ace,</text><text start="107" dur="4">or in the state where you have an Ace and the other player has a King,</text><text start="111" dur="2">But you don&amp;#39;t know which one you&amp;#39;re at.</text><text start="113" dur="4">Similarly, over here there is confusion for the second player as to what state they&amp;#39;re in.</text><text start="117" dur="3">Now, we can solve this game using this game tree approach,</text><text start="120" dur="5">and it&amp;#39;s not quite the same as the max and the min approach,</text><text start="125" dur="6">because where you are in the states, what you know about the partial information,</text><text start="131" dur="4">affects your strategy in a way that we haven&amp;#39;t dealt with before.</text><text start="135" dur="5">One possibility for how you can evaluate again like this</text><text start="140" dur="2">is just to convert it to the other form.</text><text start="142" dur="3">The form we&amp;#39;ve seen before is called the normal form or matrix form.</text><text start="145" dur="4">This is the sequential game in extensive form.</text><text start="149" dur="4">If we convert the extensive form, we get something like this.</text><text start="153" dur="3">Here for each player, we&amp;#39;ve denoted by a two-letter strategy </text><text start="156" dur="6">what you should do when you have an Ace and what you should do when you have a King.</text><text start="162" dur="4">So we end up with an exponentially large search space,</text><text start="166" dur="4">but here the game was so simple, that it ends up being rather small,</text><text start="170" dur="3">and the game is rather trivial, and you can solve it.</text><text start="173" dur="7">It turns out that there are two equilibrium corresponding to the strategy for player two,</text><text start="180" dur="5">which is he should check when he has an Ace, and he should fold when he has a King,</text><text start="185" dur="7">and strategy for player one is it doesn&amp;#39;t matter if he raises or checks when has an Ace,</text><text start="192" dur="2">but he should check when he has a King.</text><text start="194" dur="4">That would give the game a value of zero.</text><text start="198" dur="3">Now this works fine for the simple version of poker.</text><text start="201" dur="5">For real poker, this table would have about 10^18 states,</text><text start="206" dur="2"> and it would be impossible to deal with.</text><text start="208" dur="4">So we need some strategies for getting back down to a reasonable number of states.</text></transcript></video><video title="24 Game Theory Strategies.mp4" id="T2dRPPp5ffc" length="130"><transcript><text start="0" dur="4">One of the best strategies is to try abstraction.</text><text start="4" dur="3">Instead of dealing with every single possible state of the game,</text><text start="7" dur="3">we can take similar states and deal with them as if they were the same.</text><text start="10" dur="5">For example, in poker one abstraction that works pretty well is to eliminate the suits.</text><text start="15" dur="6">If no player is trying to get a flush, then we can treat all four Aces as if they were identical</text><text start="21" dur="4">rather than treating the four of them as being different</text><text start="25" dur="2">and similarly with all the other face values.</text><text start="27" dur="4">Another thing we can do is lump similar cards together.</text><text start="31" dur="5">Rather than saying that 2, 3, 4, and 5 are all different values,</text><text start="36" dur="6">if I know that I&amp;#39;m holding a pair of 10s then I can think of the other players&amp;#39; cards</text><text start="42" dur="5">as being equal to 10, lower than 10, or higher than 10.</text><text start="47" dur="2">Otherwise, lump the same.</text><text start="49" dur="3">Similarly, I can lump bets together.</text><text start="52" dur="6">Rather than thinking of every dollar amount of a bet from $1 to the upper limit,</text><text start="58" dur="4">I can lump the bets into small, medium, and large.</text><text start="62" dur="5">Then finally another way to do abstraction is rather than considering every possible deal</text><text start="67" dur="6">of all the cards, I can just consider a small subset of the deals </text><text start="73" dur="4">to do Monte Carlo sampling over the possible deals,</text><text start="77" dur="3">rather than considering them all.</text><text start="80" dur="4">This approach extensive games can handle quite a lot</text><text start="84" dur="6">in terms of dealing with uncertainty, dealing with partial observability,</text><text start="90" dur="4">dealing with multiple agents, stochastic, sequential, dynamic.</text><text start="94" dur="2">But there&amp;#39;s a few things they can&amp;#39;t handle very well.</text><text start="96" dur="2">They aren&amp;#39;t very good at unknown actions.</text><text start="98" dur="6">We need to know what all the actions are for either player before we can define the game.</text><text start="104" dur="2">Game theory doesn&amp;#39;t deal very well with continuous actions,</text><text start="106" dur="3">because we have this matrix-like form.</text><text start="109" dur="2">It doesn&amp;#39;t deal very well with irrational opponents.</text><text start="111" dur="5">We can know that we&amp;#39;re going to do the best we possibly can against a rational opponent,</text><text start="116" dur="4">but it doesn&amp;#39;t tell us how to exploit our opponent&amp;#39;s weakness </text><text start="120" dur="2">if he turns out to be irrational.</text><text start="122" dur="3">Then finally, it doesn&amp;#39;t deal with unknown utilities.</text><text start="125" dur="2">If we don&amp;#39;t know what it is we&amp;#39;re trying to optimize, </text><text start="127" dur="3">game theory isn&amp;#39;t going to tell us how to do it.</text></transcript></video><video title="25 Fed vs Politicians Question.mp4" id="1By22Z6C0UY" length="66"><transcript><text start="0" dur="3">This exercise describes a game played between</text><text start="3" dur="4"> the federal reserve board and politicians.</text><text start="7" dur="5">Now the politicians have a choice whether they want to contract fiscal policy,</text><text start="12" dur="6">expand it, or do nothing, and the Fed has the same three choices.</text><text start="18" dur="4">Each party has preference for what outcome they would like to see.</text><text start="22" dur="4">Here we&amp;#39;ve ranked them for each party from 1 being the worst outcome</text><text start="26" dur="3">to 9 being the best outcome.</text><text start="29" dur="4">What I want you is find the equilibrium point for this game.</text><text start="33" dur="4">There will be one equilibrium point. I want you to find it.</text><text start="37" dur="5">The equilibrium point defines a pure strategy for each player.</text><text start="42" dur="3">Tell me the pure strategy for the Fed.</text><text start="45" dur="3">It is contract, do nothing, or expand?</text><text start="48" dur="2">Click on the right box here.</text><text start="50" dur="4">Similarly, for the politicians, click on the right box for their strategy,</text><text start="54" dur="2">which leads to the equilibrium point.</text><text start="56" dur="6">Then tell me the outcome for the game for each player for that equilibrium point.</text><text start="62" dur="4">Then tell me if the equilibrium is Pareto optimal.</text></transcript></video><video title="26 Fed Vs Politicians Answer.mp4" id="YucUFZgMq3A" length="119"><transcript><text start="0" dur="5">Now, we could determine the equilibrium point by examining all 9 of the outcomes </text><text start="5" dur="6">and checking each one to see if both parties do no better by switching.</text><text start="11" dur="4">But instead, I&amp;#39;m going to show an alternative method to analyze games, </text><text start="15" dur="3">which is to look for dominated strategies.</text><text start="18" dur="4">There are no dominant strategies here, but there are dominated strategies.</text><text start="22" dur="6">For example, for the politician, the strategy of contracting is domininated</text><text start="28" dur="2">by the strategy of doing nothing.</text><text start="30" dur="7">To the politician, 2 is greater than 1, 5 is greater than 4, and 9 is greater than 6.</text><text start="37" dur="3">We can say that this strategy is dominated, </text><text start="40" dur="2">and we can take it out of consideration.</text><text start="42" dur="2">Now, how does that help?</text><text start="44" dur="5">Well, now in the other direction, we do have a dominant strategy that we didn&amp;#39;t have before.</text><text start="49" dur="6">Now for the Fed, the option of contracting gives them 8, which is better than 5 or 4,</text><text start="55" dur="3">or 3 which is better than 2 and 1.</text><text start="58" dur="5">This a dominant strategy for the Fed, and we can mark that off.</text><text start="63" dur="4">Now for the politicians, they know they&amp;#39;re going to be in this column,</text><text start="67" dur="4">and they have a choice of getting a 2 or a 3, </text><text start="71" dur="3">The 3 would be the strategy for the politicians. </text><text start="74" dur="4">That leads us to this Nash equilibrium point,</text><text start="78" dur="4">and the values of that outcome are 3 for each party.</text><text start="82" dur="2">Is that Pareto optimal?</text><text start="84" dur="6">Actually, it&amp;#39;s more like Pareto pessimal in that this is worst total.</text><text start="90" dur="6">Out of all these outcomes the total is only 6</text><text start="96" dur="2">as opposed to every other one is better.</text><text start="98" dur="4">To answer the question specifically is it Pareto optimal,</text><text start="102" dur="5">the answer is no, because any of these four would be better for both parties.</text><text start="107" dur="3">That may tell you something about our political system.</text><text start="110" dur="2">Next time you get an outcome that you don&amp;#39;t like,</text><text start="112" dur="3">don&amp;#39;t assume that the players are irrational.</text><text start="115" dur="4">Just assume that that&amp;#39;s the way the game was set up.</text></transcript></video><video title="27 Mechanism Design.mp4" id="yny3Z2-ptpU" length="116"><transcript><text start="0" dur="2">Now let&amp;#39;s switch to the other part of game theory,</text><text start="2" dur="4">which remember we called mechanism design.</text><text start="6" dur="2">It could really be called game design.</text><text start="8" dur="4">The idea is that someone is going to be running a game</text><text start="12" dur="2"> that players are going to be participating in.</text><text start="14" dur="6">We want to design the rules of the game such that we get a high outcome</text><text start="20" dur="5">or a high expected utility for the people that run the game,</text><text start="25" dur="4">for the players who play the game, and for the public at large.</text><text start="29" dur="2">Here&amp;#39;s an example of a game.</text><text start="31" dur="3">This is the advertising game.</text><text start="34" dur="4">Here I&amp;#39;ve shown it on an Internet search engine, where you do a search,</text><text start="38" dur="4">and then ads show up, sometimes at the top, sometimes at the right,</text><text start="42" dur="4">sometimes at the bottom of the page, depending on the mechanism.</text><text start="46" dur="4">This is also done at sites like eBay that sell items, </text><text start="50" dur="4">and there&amp;#39;s lots of places where auctions are run.</text><text start="54" dur="4">The idea of mechanism design is to come up with the rules of the auction</text><text start="58" dur="8">that will make it attractive to bidders and/or people who want to respond to the ads,</text><text start="66" dur="3">and make a good result for all.</text><text start="69" dur="4">Now, one property that you would like an auction to have is to </text><text start="73" dur="4">attract more bidders to make it a more competitive market,</text><text start="77" dur="4">and you could attract more if it&amp;#39;s less work for them.</text><text start="81" dur="4">It&amp;#39;s easier for the bidders if they have a dominant strategy.</text><text start="85" dur="3">You saw how hard it was to work out the value of a game </text><text start="88" dur="2">when you didn&amp;#39;t have a dominant strategy,</text><text start="90" dur="3">and how easy it is to work it out if you did.</text><text start="93" dur="3">If you want to save everybody a lot of trouble, design the game</text><text start="96" dur="2">so that dominant strategies exist.</text><text start="98" dur="3">These strategies have various names in auctions.</text><text start="101" dur="4">Sometimes you call it an auction strategy proof </text><text start="105" dur="2">if you only need to know your own strategy.</text><text start="107" dur="4">You don&amp;#39;t have to think about what all the other people are going to be bidding.</text><text start="111" dur="5">They also call that truth revealing or incentive compatible.</text></transcript></video><video title="28 Auction Question.mp4" id="aCGPv2s7vvY" length="172"><transcript><text start="0" dur="5">Let&amp;#39;s examine a type of auction called the second price option.</text><text start="5" dur="5">This is popular in various internet search and auction sites.</text><text start="10" dur="5">The way it works is that we have a line of possible prices--</text><text start="15" dur="4">higher prices at the top--and bids come in.</text><text start="19" dur="4">Different players can bid whatever they want,</text><text start="23" dur="3">and whoever bids the highest is the winner,</text><text start="26" dur="4">but the price that they pay is the price of the second highest bidder.</text><text start="30" dur="3">Now let&amp;#39;s say you&amp;#39;re participating in this auction,</text><text start="33" dur="3">and something is for sale, and you place a value on that.</text><text start="36" dur="6">We&amp;#39;ll call that value &amp;quot;V&amp;quot;, and say V is here.</text><text start="42" dur="6">Your bid we&amp;#39;ll call &amp;quot;b&amp;quot;, and the highest other bid we&amp;#39;ll call &amp;quot;c.&amp;quot;</text><text start="48" dur="6">Now the payoff to you if your bid is higher than all the others</text><text start="54" dur="3">then the payoff is you get the value of the auction,</text><text start="57" dur="3">because you won the item, and you get V,</text><text start="60" dur="3">but you have to pay the second highest price, which is c.</text><text start="63" dur="5">You get b minus c. Otherwise, you lose the auction.</text><text start="68" dur="2">You don&amp;#39;t get anything, and you don&amp;#39;t pay anything.</text><text start="70" dur="2">The value to you of the auction is zero.</text><text start="72" dur="3">What I want you to do is fill in this chart to look at different strategies for different possible bids.</text><text start="75" dur="11">We&amp;#39;ll say that the value to you of the item for sale is V equals 10.</text><text start="86" dur="6">You have the option of bidding, say, 12, 10, or 8,</text><text start="92" dur="9">and we&amp;#39;ll consider the cases where the highest other bid is 7, 9, 11, or 13.</text><text start="101" dur="5">What I want you to do is fill in this chart with the value to you </text><text start="106" dur="5">of this game according to your strategy and the strategies of the other players.</text><text start="111" dur="4">Tell me if one of these strategies is a dominant strategy.</text><text start="115" dur="6">Then tell me is that dominant strategy, if there is one, a truth revealing strategy?</text><text start="121" dur="3">I should have one note about dominance.</text><text start="124" dur="5">When we talked about it before, we glossed over the possibility of ties.</text><text start="129" dur="5">If some policy is better everywhere than any other policy,</text><text start="134" dur="3">then we say that that policy strictly dominates the others.</text><text start="137" dur="5">On the other hand, if there are some ties and some places where its better </text><text start="142" dur="4">but none where it&amp;#39;s worse, then we say it weakly dominates.</text><text start="146" dur="2">Either way, it&amp;#39;s a case of dominance.</text><text start="148" dur="2">Now I&amp;#39;ll do the first entry to get you started.</text><text start="150" dur="4">If you bid 12 and the highest other bid is 7,</text><text start="154" dur="2">then you have the high bid, so you win.</text><text start="156" dur="3">It&amp;#39;s a second-price auction, so you pay 7.</text><text start="159" dur="10">The value of the goods is 10, so the total value of the outcome is 10 minus the cost of 7 for 3.</text><text start="169" dur="3">I want you to fill in the rest.</text></transcript></video><video title="29 Auction Answer.mp4" id="uJDkOVGGPI8" length="61"><transcript><text start="0" dur="2">Here are the answers.</text><text start="2" dur="4">We can see that the strategy of bidding 10, the true value, is weakly dominant.</text><text start="6" dur="4">It&amp;#39;s the same here, but it&amp;#39;s better in these two cases.</text><text start="10" dur="4">Let&amp;#39;s look at these cases a little bit more carefully and figure out what&amp;#39;s going on.</text><text start="14" dur="8">If you bid 12--so if b was up here--and if c snuck in between the bid and the valuation,</text><text start="22" dur="5">then you&amp;#39;d be paying too much. You&amp;#39;d be paying more than the goods are worth.</text><text start="27" dur="2">You&amp;#39;d end up with a negative utility.</text><text start="29" dur="4">So you don&amp;#39;t want to bid more than what it&amp;#39;s really worth to you.</text><text start="33" dur="5">On the other hand, if you bid down here, and if c snuck in in between</text><text start="38" dur="3">what your bid is and what the valuation is, </text><text start="41" dur="3">then you&amp;#39;ve lost the auction, and you get a zero,</text><text start="44" dur="4">but you should have won--or it would have been worth your while to win--</text><text start="48" dur="2">because the price still would have been a bargain to you.</text><text start="50" dur="5">That says that the rational strategy, the dominant strategy in second-price auction,</text><text start="55" dur="6">is to bid your true value and that makes it a truth-revealing auction mechanism.</text></transcript></video></group><group title="Unit 15" count="16"><video title="01 Introduction.mp4" id="PnfTfhkadoU" length="39"><transcript><text start="0" dur="5">This unit we&amp;#39;ll return to the topic of planning,</text><text start="5" dur="4"> and we&amp;#39;ll talk about 4 things that we left out last time we talked about planning.</text><text start="9" dur="2">First is time.</text><text start="11" dur="4">That is, rather than just saying whether an action occurs before an action or after it,</text><text start="15" dur="6">We&amp;#39;ll talk about actions that persists over a length of time.</text><text start="21" dur="4">Second is resources necessary to do a task.</text><text start="25" dur="2">Third is active perception--</text><text start="27" dur="4">That is, taking the action of perceiving something.</text><text start="31" dur="5">Fourth is hierarchical plans--that is, plans that consist of steps which have substeps.</text><text start="36" dur="3">We&amp;#39;ll start with time.</text></transcript></video><video title="02 Scheduling.mp4" id="rwAkYmXkU9A" length="48"><transcript><text start="0" dur="5">We&amp;#39;ll look at the problem of scheduling a series of tasks, each of which has duration.</text><text start="5" dur="4">We&amp;#39;ll show a task network which has a start and a finish date,</text><text start="9" dur="4">and then has a sequence of tasks, which have to be completed</text><text start="13" dur="5">and arrows to indicate precedence of which ones have to go before other ones.</text><text start="18" dur="3">This task has to occur before this one,</text><text start="21" dur="4">but there&amp;#39;s nothing said about the relationship between this task and this task.</text><text start="25" dur="3">We&amp;#39;ll list for each task their duration.</text><text start="28" dur="6">This one takes 30 minutes, 30, 10, 60, 15, and 10.</text><text start="34" dur="6">Scheduling then is a process of figuring out a schedule under which</text><text start="40" dur="4">we specified the times at which each of these tasks starts</text><text start="44" dur="4">such that we can finish as soon as possible.</text></transcript></video><video title="03 Schedule Question.mp4" id="JDaGTuGHpqs" length="112"><transcript><text start="0" dur="5">Now schedule is defined in terms of specifying for every task in the network</text><text start="5" dur="6">the earliest start that, which we&amp;#39;ll call &amp;quot;ES,&amp;quot; and the latest possible start time,</text><text start="11" dur="5">which we call &amp;quot;LS&amp;quot; for which it&amp;#39;s possible to complete the task network </text><text start="16" dur="3">in the shortest possible total amount of time.</text><text start="19" dur="2">We can define these with a set of recursive formulas </text><text start="21" dur="3">which can be solved by dynamic programming.</text><text start="24" dur="6">The earliest start time of the start state is defined as being zero.</text><text start="30" dur="6">The earliest start time of any state B is defined as being the maximum over all As </text><text start="36" dur="2">which have an arrow leading into B--</text><text start="38" dur="5">that is all As that are defined to be predecessors of B--</text><text start="43" dur="6">of the earliest start time of A plus the duration of A.</text><text start="49" dur="5">For example, the earliest start time of this state here would be the maximum </text><text start="54" dur="3">over all the ones that are coming in, which is only this one.</text><text start="57" dur="6">The maximum of its start time, which will be here, plus its duration, which would be 30.</text><text start="63" dur="6">Then the latest start time is defined by saying the latest start time of the finish,</text><text start="69" dur="3">is the same as the earliest start time of the finish,</text><text start="72" dur="3">because the finish by itself has no duration--</text><text start="75" dur="3">it&amp;#39;s just there to give us a point to end at.</text><text start="78" dur="3">The latest start time in general of any node A</text><text start="81" dur="5">is the minimum over all B which come after A,</text><text start="86" dur="4">of the latest start time of B minus the duration of A.</text><text start="90" dur="7">These formulas together define a unique schedule, which is the fastest possible schedule. </text><text start="97" dur="6">What I want you to do is fill in for me in the upper left hand the earliest start time,</text><text start="103" dur="4">in the upper right the latest start time for each of these nodes.</text><text start="107" dur="5">Here I&amp;#39;ve zoomed in a bit just to give you a little bit more room to fill in the blanks.</text></transcript></video><video title="04 Schedule Answer.mp4" id="E1feprQp5yg" length="60"><transcript><text start="0" dur="5">You can see the earliest and latest start times filled in for all the states.</text><text start="5" dur="3">Here&amp;#39;s an alternative method for visualizing this.</text><text start="8" dur="4">Here I&amp;#39;ve given names to the various actions--</text><text start="12" dur="2">the three on the top and the three on the bottom.</text><text start="14" dur="6">They have a duration along a time line, and we can see the time line there.</text><text start="20" dur="3">Notice that these three actions have no slack between them.</text><text start="23" dur="2">One has to start after the other.</text><text start="25" dur="5">We say that these are on the critical path in that if any of these slip in the schedule,</text><text start="30" dur="3">then the whole schedule will slip.</text><text start="33" dur="3">Whereas these three actions have some slack.</text><text start="36" dur="3">The could occur anywhere within this gray window,</text><text start="39" dur="4">and if this action slipped to the right, then the others would slip to the right </text><text start="43" dur="2">without affecting the final schedule.</text><text start="45" dur="3">I should say that over the years the field of scheduling has moved </text><text start="48" dur="3">in and out of the artificial intelligence.</text><text start="51" dur="3">Some people have worked on it, but most of the work on scheduling</text><text start="54" dur="2">has been done in the field of operations research-</text><text start="56" dur="4">a closely related field to artificial intelligence.</text></transcript></video><video title="05 Resources Question.mp4" id="6bNOJTmSMgA" length="81"><transcript><text start="0" dur="4">The next question I want to address is the one of resources.</text><text start="4" dur="6">Resources are things like this pile of nuts and bolts that are used somewhere in a plan.</text><text start="10" dur="4">Of course, resources could be handled just in the language of classical planning.</text><text start="14" dur="5">Here we have a description of a problem domain in classical planning language.</text><text start="19" dur="3">The goal is to get an assembly inspected,</text><text start="22" dur="3">and in order to do that, we have the action of inspecting,</text><text start="25" dur="4">which looks at an assembly which has five nuts and bolts</text><text start="29" dur="3">which each have to be fastened to each other.</text><text start="32" dur="6">If that precondition is satisfied, then the effect is that the assembly is inspected.</text><text start="38" dur="4">We have an action of fastening a nut and bolt to the assembly,</text><text start="42" dur="5">which requires a nut and a bolt, and the result is that they&amp;#39;re fastened</text><text start="47" dur="5">and that the nut and bolt are no longer available for use.</text><text start="52" dur="4">Initially we have four nuts and five bolts.</text><text start="56" dur="6">Now the question is with this description of this problem can we achieve the goal?</text><text start="62" dur="5">Assuming that we have a depth-first tree search planner,</text><text start="67" dur="3">how many paths would that planner have to consider?</text><text start="70" dur="11">Would it be 1, 4, 5, 4 +5, 4 * 5, 4! + 5!, or 4! * 5!?</text></transcript></video><video title="06 Resources Answer.mp4" id="4AAVbUOgnkc" length="55"><transcript><text start="0" dur="3">The answer is we can&amp;#39;t achieve the goal.</text><text start="3" dur="3">We&amp;#39;re just missing a nut so we can&amp;#39;t do it.</text><text start="6" dur="4">But we&amp;#39;re going to have to consider 4 factorial times 5 factorial paths</text><text start="10" dur="2">before we discover that.</text><text start="12" dur="4">The reason is because we start out, and we say in order to achieve inspected,</text><text start="16" dur="2">we need the precondition of being fastened.</text><text start="18" dur="6">In order to achieve fastened, we need some nut and some bolt.</text><text start="24" dur="6">We can try N1 and B1, but then we would also, when we end up back tracking, </text><text start="30" dur="8">have to try N2 against B1, N3 against B1, and so on, for all these and all these,</text><text start="38" dur="3">and we&amp;#39;d have to do that at every step in the back track.</text><text start="41" dur="5">So we end up trying all combinations of nuts and all combinations of bolts.</text><text start="46" dur="6">That seems silly, and so the idea of resources is to make the nuts and bolts</text><text start="52" dur="3">rather than make each one be distinct so we can handle them more efficiently.</text></transcript></video><video title="07 Extending Planning.mp4" id="F3I-RXow-oc" length="61"><transcript><text start="0" dur="5">Here I&amp;#39;ve shown how to extend the language of classical planning to handle resources.</text><text start="5" dur="4">We&amp;#39;ve added a new type of statement saying that there are resources</text><text start="9" dur="2">and how many of each there are.</text><text start="11" dur="2">We can say there&amp;#39;s five nuts and four bolts,</text><text start="13" dur="4">and we&amp;#39;re also going to explicitly model inspectors, and we have one of them.</text><text start="17" dur="4">Then the actions have two new types of clauses.</text><text start="21" dur="5">The fasten action has a consume clause, saying it consumes resources</text><text start="26" dur="3">and once it uses them, they&amp;#39;re gone forever.</text><text start="29" dur="3">Fastening is going to consume one nut and one bolt.</text><text start="32" dur="5">The inspect action has a used clause, and that says it&amp;#39;s going to use one of the resources,</text><text start="37" dur="3">the inspector, while the action is going on.</text><text start="40" dur="4">But once the action is completed then the inspector has returned to the pool</text><text start="44" dur="3">and is available for use elsewhere.</text><text start="47" dur="8">Keeping track of resources this way gets rid of that computational or exponential explosion </text><text start="55" dur="6">of looking at different combinations by just treating all of the same resource identically.</text></transcript></video><video title="08 Hierarchical Planning.mp4" id="vFlu8XX2q1A" length="85"><transcript><text start="0" dur="6">The final topic in this unit is called hierarchical planning.</text><text start="6" dur="6">The idea here is we want to close the abstraction gap. What do I mean by that?</text><text start="12" dur="4">Well, let&amp;#39;s think about what you have to do to plan your own lifetime.</text><text start="16" dur="4">You live about maybe a couple of billion seconds,</text><text start="20" dur="5">and during that time you have a choice of actions to make,</text><text start="25" dur="5">and you have maybe around 1,000 muscles,</text><text start="30" dur="5"> which you can operate maybe around 10 per second.</text><text start="35" dur="6">You end up at a lifetime as somewhere around 10^13 actions</text><text start="41" dur="3"> give or take an order of magnitude or two.</text><text start="44" dur="6">But there&amp;#39;s a big gap between 10^13 and the 10^4 or so actions</text><text start="50" dur="4">that current planning algorithms or programs can deal with.</text><text start="54" dur="7">Part of the problem with such a big gap is that it&amp;#39;s just difficult to deal at the level of an individual muscle movement.</text><text start="61" dur="3">We&amp;#39;d rather deal with more abstract plans.</text><text start="64" dur="4">We&amp;#39;re going to introduce the notion of a hierarchical task network,</text><text start="68" dur="5">and rather than having a plan be a sequence of individual steps,</text><text start="73" dur="5">we can talk about higher order steps where maybe there&amp;#39;s a smaller number,</text><text start="78" dur="4">and the individual steps can correspond to multiple steps.</text><text start="82" dur="3">This idea is called refinement planning.</text></transcript></video><video title="09 Refinement Planning.mp4" id="x0V-LZX4zUo" length="97"><transcript><text start="0" dur="2">Here&amp;#39;s how refinement planning works.</text><text start="2" dur="4">In addition to regular actions, we have abstract actions </text><text start="6" dur="4">like going from my home to the San Francisco airport.</text><text start="10" dur="6">Then we have possible ways to take to refine these abstract options into concrete actions.</text><text start="16" dur="5">Here one refinement is I can drive from home to long-term parking</text><text start="21" dur="2">and then take the shuttle to the airport.</text><text start="23" dur="3">Another refinement is I can just take a taxi.</text><text start="26" dur="3">Here&amp;#39;s another example of an abstract action,</text><text start="29" dur="6">which is if I&amp;#39;m at one point on a grid, ab, and I want to get to point xy,</text><text start="35" dur="2">and if I know the grid is all connected,</text><text start="37" dur="4">and I have this abstract action of just navigating from ab to xy.</text><text start="41" dur="5">One refinement says if I&amp;#39;m already there I do nothing.</text><text start="46" dur="4">Another refinement says I can start the journey by going left.</text><text start="50" dur="4">Another refinement says I can start the journey by going right and so on.</text><text start="54" dur="5">The idea is I can figure out a complex plan that involves </text><text start="59" dur="3">navigating around picking up and object, doing something else,</text><text start="62" dur="6">and do that planning just at the level of abstract actions like navigate</text><text start="68" dur="4">rather than having to figure out a path from ab to xy.</text><text start="72" dur="2">How do we know when we have a solution?</text><text start="74" dur="6">A hierarchical task network achieves the goal if for every part, every abstract action,</text><text start="80" dur="3">at least one of the refinements achieves the goal.</text><text start="83" dur="3">We only need to at least one of them, because we&amp;#39;re the planner.</text><text start="86" dur="2">We get to make the choice.</text><text start="88" dur="5">It&amp;#39;s like an and/or search where we can make the best possible choices,</text><text start="93" dur="4">and if any of the choices work, then the goal can be achieved.</text></transcript></video><video title="10 Reachable States.mp4" id="EbLvL12Cgsg" length="110"><transcript><text start="0" dur="3">Now, in addition to doing an and/or search,</text><text start="3" dur="5">sometimes we can solve an abstract hierarchical task network planning problem</text><text start="8" dur="4">without going all the way down to the concrete steps.</text><text start="12" dur="2">Let&amp;#39;s talk about how to do that.</text><text start="14" dur="2">Here we have a description of a state space.</text><text start="16" dur="6">The start state is here, and the goal state is outlined in gray here.</text><text start="22" dur="8">We have one abstract action, and we&amp;#39;re shown a set of possible states</text><text start="30" dur="3">that can be reached by that abstract action,</text><text start="33" dur="4">if we refine the abstract action, using one concrete action or another.</text><text start="37" dur="4">This is like when we were dealing with belief states</text><text start="41" dur="3">where we would move, because we had a stochastic action,</text><text start="44" dur="4">from one state to several possible other states.</text><text start="48" dur="3">Here we have several possible states that we&amp;#39;ll end up with,</text><text start="51" dur="3">not because the actions are stochastic,</text><text start="54" dur="4"> but because we haven&amp;#39;t decided yet which refinement we&amp;#39;re going to use.</text><text start="58" dur="3">This would be a single step that would bring us to this belief state,</text><text start="61" dur="6">and then when we add a second step, we get to this belief state.</text><text start="67" dur="4">Now we can check to see if we can achieve the goal with this two-step plan</text><text start="71" dur="6">just by checking if there is an intersection between the reachable state and the goal state.</text><text start="77" dur="2">In this case, there is.</text><text start="79" dur="2">We know that we&amp;#39;ve achieved the goal,</text><text start="81" dur="4">and now if we want to find a refinement that actually works,</text><text start="85" dur="3">the way to do it is to search backwards rather than forward.</text><text start="88" dur="5">If we search forward we&amp;#39;d have a large tree of possibilities,</text><text start="93" dur="3">but if we search backwards, we know the intersections here.</text><text start="96" dur="4">What could have brought us to here? Only this refinement.</text><text start="100" dur="3">And what could have brought is to this state? Only this refinement.</text><text start="103" dur="7">That&amp;#39;s the plan that is a refinement of this abstract plan that achieves the goal.</text></transcript></video><video title="11 Reachable States Question.mp4" id="p0MpuUkee_8" length="121"><transcript><text start="0" dur="3">Now sometimes it may be very difficult to specify </text><text start="3" dur="5">exactly what states can be reachable by an abstract action,</text><text start="8" dur="3">because the refinements are complicated.</text><text start="11" dur="5">We can go with the notion of an approximate set of reachable states.</text><text start="16" dur="2">That&amp;#39;s what I&amp;#39;ve shown schematically here.</text><text start="18" dur="5">For this abstract action, I&amp;#39;ve shown a lower bound and an upper bound </text><text start="23" dur="2">on the states that are reachable.</text><text start="25" dur="2">What do I mean by that?</text><text start="27" dur="4">Consider the abstract action of going to the airport in San Francisco.</text><text start="31" dur="4">Now, some things I know are going to be true about the resulting state.</text><text start="35" dur="6">I know it&amp;#39;s going to take, say, half an hour to get there no matter what way I go.</text><text start="41" dur="2">That&amp;#39;s always going to be true.</text><text start="43" dur="3">Other things depend on which choice I make.</text><text start="46" dur="4">I may consume some money if I take a taxi.</text><text start="50" dur="3">I may consume some gas if I take a car,</text><text start="53" dur="6">but I may not be able to specify exactly which of those combinations hold true.</text><text start="59" dur="4">So we approximate the set of reachable states by this lower bound</text><text start="63" dur="2">of things that we know we can get to</text><text start="65" dur="3">and this upper bound of things that we might be able to get to,</text><text start="68" dur="7">but we&amp;#39;re not quite sure if all combinations of them will check out depending on the refinement.</text><text start="75" dur="6">Here, similarly, there&amp;#39;s another set of lower and upper bounds and here as well.</text><text start="81" dur="2">These are the goals. </text><text start="83" dur="5">What I want you to tell me is for each of these three actions,</text><text start="88" dur="7">is it guaranteed, yes, that I can reach the goal state if I choose the right refinement,</text><text start="95" dur="4">or is it never possible--no, that I&amp;#39;ll never be able to reach the goal state--</text><text start="99" dur="7">or is it uncertain yet because the description of upper and lower bound</text><text start="106" dur="3">doesn&amp;#39;t tell us enough about whether we can reach the goal state.</text><text start="109" dur="5">Answer that for this abstract action here, </text><text start="114" dur="3">and for this abstract action here,</text><text start="117" dur="4">and for this abstract action here.</text></transcript></video><video title="12 Reachable States Answer.mp4" id="JK7hG3Rut88" length="60"><transcript><text start="0" dur="3">In the case of this abstract action, </text><text start="3" dur="4">we know all the possible outcomes are somewhere within here</text><text start="7" dur="2">and none of those intersect with a goal, </text><text start="9" dur="3">so there&amp;#39;s nothing we can do to make this one work.</text><text start="12" dur="4">For this abstract action, we see that there is an intersection,</text><text start="16" dur="4">and there&amp;#39;s an intersection even in the under estimate of the state.</text><text start="20" dur="3">We know that we can reach someplace in here,</text><text start="23" dur="3">and since we have the choice of where we want to go,</text><text start="26" dur="2">we know we can reach there,</text><text start="28" dur="5">so we know that we can always refine this abstract action to achieve the goal.</text><text start="33" dur="3">Over here there&amp;#39;s an intersection, </text><text start="36" dur="5">but it&amp;#39;s only in the over estimate--the outside part of the search space.</text><text start="41" dur="2">So we&amp;#39;re not quite sure.</text><text start="43" dur="3">We have to look more carefully at the refinements to see if there is</text><text start="46" dur="3">a combination of refinements that allow us to reach this state,</text><text start="49" dur="5">or if the combination of refinements leave us somewhere over here,</text><text start="54" dur="2">which is not inside that state.</text><text start="56" dur="4">So that would be questionable or unknown yet.</text></transcript></video><video title="13 Conformant Plan Question.mp4" id="E2sGo9KJg48" length="111"><transcript><text start="0" dur="2">Here&amp;#39;s one more topic. </text><text start="2" dur="5">We&amp;#39;re going to talk about how to extend classicial planning to allow active perception</text><text start="7" dur="2">to deal with partial observability.</text><text start="9" dur="2">Here&amp;#39;s a problem description.</text><text start="11" dur="4">There&amp;#39;s a table and a chair, and there are two cans of paint.</text><text start="15" dur="3">The table is within our field of view, </text><text start="18" dur="5">and our goal is to have the chair and the table have the same color.</text><text start="23" dur="2">Here&amp;#39;s the actions.</text><text start="25" dur="3">We can remove the lid from a can, making it open.</text><text start="28" dur="5">We can paint one thing with that can if the can is open.</text><text start="33" dur="4">We also have an active perception action,</text><text start="37" dur="2">which is we can look at something.</text><text start="39" dur="2">If it&amp;#39;s in view, we can look at it,</text><text start="41" dur="2">and then we&amp;#39;re looking at that one thing, and we&amp;#39;re no longer looking </text><text start="43" dur="2">at whatever we were looking at before.</text><text start="45" dur="5">Now, here&amp;#39;s the big extension that in addition to actions,</text><text start="50" dur="2">we now have  percept schemas.</text><text start="52" dur="3">Action schemas and perception schemas,</text><text start="55" dur="4">and we can perceive the color of something if it&amp;#39;s an object.</text><text start="59" dur="4">Here the objects are declared to be the table and the chair,</text><text start="63" dur="2">and if it&amp;#39;s within our field of view.</text><text start="65" dur="3">Notice that here we&amp;#39;re introducing a new variable.</text><text start="68" dur="3">We never did that before in planning.</text><text start="71" dur="3">Before all the actions in planning, all the variables,</text><text start="74" dur="4">were predefined by matching against the precondition.</text><text start="78" dur="2">Here we&amp;#39;re introducing a new variable.</text><text start="80" dur="4">We&amp;#39;re saying if these preconditions are true,</text><text start="84" dur="3">then you can perceive something, and you&amp;#39;ll learn something new.</text><text start="87" dur="2">You&amp;#39;ll learn the value of this variable. </text><text start="89" dur="3">Here&amp;#39;s a question. How can we achieve this goal?</text><text start="92" dur="5">The first thing I want to ask is, without even thinking about the percepts,</text><text start="97" dur="5">is there a conformant plan that is a that doesn&amp;#39;t do sensing</text><text start="102" dur="3">that will allow us to achieve this goal?</text><text start="105" dur="3">Is there that type of conformant plan?</text><text start="108" dur="3">Tell me yes or no.</text></transcript></video><video title="14 Conformant Plan Answer.mp4" id="fCNcUfM_u5Q" length="23"><transcript><text start="0" dur="2">The answer is, yes, there is.</text><text start="2" dur="3">That is we can remove the lid from the can,</text><text start="5" dur="4">and we can do that to either can, can one or can two,</text><text start="9" dur="3">and then without knowing what color that can is and without knowing </text><text start="12" dur="3">what color any of the furniture is,</text><text start="15" dur="3">we can first paint the table, and then paint the chair.</text><text start="18" dur="2">Then we know they&amp;#39;ll both have the same color, </text><text start="20" dur="3">and we&amp;#39;ll have achieved the goal.</text></transcript></video><video title="15 Sensory Plan Question.mp4" id="nbtPYp0rtAU" length="31"><transcript><text start="0" dur="2">Now one of the problems with this plan is </text><text start="2" dur="3">say the chair and the table were already the same color.</text><text start="5" dur="4">We would&amp;#39;ve wasted our time painting them when we didn&amp;#39;t have to do it.</text><text start="9" dur="6">Now the next question is, yes or no, is there a better sensory plan?</text><text start="15" dur="5">That is a plan that uses perception and comes up with,</text><text start="20" dur="5">in at least some cases, a smaller number of painting actions.</text><text start="25" dur="3">We&amp;#39;re going to allow these plans to have conditionals in them </text><text start="28" dur="3">as well as having perception.</text></transcript></video><video title="16 Sensory Plan Answer.mp4" id="6Nk33P1ZzkM" length="53"><transcript><text start="0" dur="2">The answer is, yes, there is.</text><text start="2" dur="2">There&amp;#39;s a variety of possibilities.</text><text start="4" dur="3">I&amp;#39;ll show you a somewhat complex one.</text><text start="7" dur="5">This one says we can look at the table and look at the chair.</text><text start="12" dur="5">Then if the color of the table and the color of the chair is the same, c,</text><text start="17" dur="4">then do nothing. We&amp;#39;re done. We don&amp;#39;t have to do any painting.</text><text start="21" dur="3">Otherwise, we can remove the lids from can one and look at it,</text><text start="24" dur="4">remove the lid from can two and look at it.</text><text start="28" dur="5">If the color of the table is the same as the color of the can,</text><text start="33" dur="3">then we can paint the chair with that can.</text><text start="36" dur="5">This is for any possible can, either can one or can two.</text><text start="41" dur="7">Otherwise, if the color of the chair and the color of the can match,</text><text start="48" dur="5">then we can paint the table, and otherwise we have to paint both of them.</text></transcript></video></group><group title="Homework 6" count="22"><video title="01 Max Likelihood Question.mp4" id="xoiKLkSgDx4" length="66"><transcript><text start="0" dur="5">In this question I&amp;#39;d like you to fit a Markov model.</text><text start="5" dur="5">We have 3 sequences of observation--A, B, C, A, B, C.</text><text start="10" dur="4">This is the first state--state0--and these are the other states.</text><text start="14" dur="5">Same for A, A, B, B, C, C and A, A, A, C, C, C.</text><text start="19" dur="3">These are 3 different observation sequences. </text><text start="22" dur="4">The first element here is state0.</text><text start="26" dur="6">All the other ones are examples of transitions from state to xt-1 to xt.</text><text start="32" dur="6">Using maximum likelihood, I&amp;#39;d like to know the initial properties for state0</text><text start="38" dur="5">and all the transition probabilities. Those are indicated by the arrow.</text><text start="43" dur="5">This is the probability that A at time t - 1 goes to A at time t.</text><text start="48" dur="2">Here they are.</text><text start="50" dur="5">For any of the symbols in the sequence, we see there are three possible outcomes--</text><text start="55" dur="3">A-B-C, A-B-C, and A-B-C.</text><text start="58" dur="3">Each has a probability. Obviously, these 3 over here have to add up to 1.</text><text start="61" dur="5">These 3 over here have to add up to 1, and these 3 over here have to add up to 1.</text></transcript></video><video title="02 Max Likelihood Answer.mp4" id="Ds1rUQtphlE" length="60"><transcript><text start="0" dur="3">Initially there is only A,</text><text start="3" dur="5">and the maximum likelihood we get the probability of A being in the first place is 1,</text><text start="8" dur="2">and all the other ones are 0.</text><text start="10" dur="6">There are 7 transitions out of A as indicated by the lines under the As over here.</text><text start="16" dur="6">In 3 cases, A is followed by A--over here, over here, and over here.</text><text start="22" dur="3">This gives us 3/7 for the maximum likelihood estimator.</text><text start="25" dur="6">A flows into B in another 3 cases--over here, over here, and over here--again, 3/7.</text><text start="31" dur="5">There is 1 instance where A moves into C--1/7.</text><text start="36" dur="3">There are 4 transitions out of B--</text><text start="39" dur="4">3 go to C, and 1 goes to B over here,</text><text start="43" dur="3">which gives us 0, 1/4, and 3/4.</text><text start="46" dur="5">Then there are 4 transitions out of C--one, two, three, four--</text><text start="51" dur="4">3 of which result in C, but one goes into A over here.</text><text start="55" dur="5">These are the results--1/4, 0, and 3/4.</text></transcript></video><video title="03 Stationary Distribution Question.mp4" id="YYLJEwBxaCo" length="28"><transcript><text start="0" dur="3">Here we&amp;#39;re given a marchov chain between A and B,</text><text start="3" dur="6">with the transition of A to itself is 0.9 with 0.1 chances transitions to B.</text><text start="9" dur="6">B stays in B with 0.5 chance and transitions back into A.with 0.5 chance.</text><text start="15" dur="3">I&amp;#39;d like to know the stationary distribution over here.</text><text start="18" dur="5">So what&amp;#39;s the probabability of A in the stationary distribution?</text><text start="23" dur="5">Of course correspondingly, what&amp;#39;s the probability of B in the stationary distribution?</text></transcript></video><video title="04 Stationary Distribution Answer.mp4" id="0tm2fyKNpsU" length="38"><transcript><text start="0" dur="7">The answer is 5/6 for A and 1/6 for B. </text><text start="7" dur="6">To see, let&amp;#39;s call this property over here, X, and we can now solve for the equation</text><text start="13" dur="9">that in the stationary case, the property of A is 0.9 x the property of A before + 0.5, </text><text start="22" dur="2">x the property of B which is 1- X.</text><text start="24" dur="5">If you solve this, this gives us 0.4X + 0.5 </text><text start="29" dur="1">or put differently,</text><text start="30" dur="8">0.6X = 0.5. That is the same as saying X = 5/6, which is what I wrote over here.</text></transcript></video><video title="05 HMM Question.mp4" id="7M2WSm5o978" length="54"><transcript><text start="0" dur="4">I am now asking a hidden markov model question.</text><text start="4" dur="4">In the given, the following hidden markov models with 2 internal states,</text><text start="8" dur="3">with a property of transitioning to either side is 0.5, </text><text start="11" dur="4">and the probability of staying is therefore, 0.5.</text><text start="15" dur="7">This Hidden Markov Model has 2 possible measurements or observations, X and Y.</text><text start="22" dur="3">The probability of observing X and Y depends on what state </text><text start="25" dur="2">the hidden markov model is in.</text><text start="27" dur="5">For A, it&amp;#39;s 0.1 for X and 0.9 for Y. </text><text start="32" dur="5">For B, it&amp;#39;s 0.8 for X and 0.2 for Y.</text><text start="37" dur="9">Let&amp;#39;s assume that the initial probability of distribution x 0 is 1/2 for either of the 2 states.</text><text start="46" dur="5">I would like to know what&amp;#39;s the [   ] probability of being A x 0 given that we observed</text><text start="51" dur="3">X x 0 and then Y, what&amp;#39;s the [ ] probability of state A x 1 given the observation of X x 0,</text></transcript></video><video title="06 HMM Answer.mp4" id="_EwavmSM-HQ" length="?"><transcript><text start="0" dur="3">No subtitles...</text></transcript></video><video title="07 Particle Filter Question 1.mp4" id="I9G18f4LdGE" length="49"><transcript><text start="0" dur="3">This is a particle filter question.</text><text start="3" dur="2">Suppose we had a word with 4 states.</text><text start="5" dur="6">In these 2 states over here, we tend to observe A property 80%.</text><text start="11" dur="2">The remaining 20%, we&amp;#39;ll observe B.</text><text start="13" dur="5">In these 2 states, we tend to observe B with 80% probability</text><text start="18" dur="3">and with 20% probability, we observe A.</text><text start="21" dur="5">Suppose we have 3 particles--1 over here, 1 over here, and 1 over here,</text><text start="26" dur="1">and we observe A.</text><text start="27" dur="4">Let&amp;#39;s call this particle over here particle A. Lower caps a.</text><text start="31" dur="1">Lower caps b.</text><text start="32" dur="2">Lower caps c.</text><text start="34" dur="5">What is the probability that the sample a, given that we just observed A,</text><text start="39" dur="4">which means it will be more likely to be in 1 of these 2 states over here.</text><text start="43" dur="6">What&amp;#39;s the probability of sampling b? What&amp;#39;s the probability of sampling c?</text></transcript></video><video title="08 Particle Filter Answer 1.mp4" id="oj_0FSTozI0" length="29"><transcript><text start="0" dur="6">The particle a will get an importance weight of 0.8, nonnormalized.</text><text start="6" dur="2">You have to normalize in the 2nd.</text><text start="8" dur="7">Particle b will get an importance weight of 0.2. Same for particle c.</text><text start="15" dur="3">If you add those together, we get 1.2.</text><text start="18" dur="7">Then we have to divide everything by 1.2--which is 2/3 for the probability of sampling a,</text><text start="25" dur="4">and 1/6 each for b or c.</text></transcript></video><video title="09 Particle Filter Question 2.mp4" id="9-P7zxOaMAA" length="53"><transcript><text start="0" dur="3">Here&amp;#39;s another particle filter question.</text><text start="3" dur="2">Now we&amp;#39;re looking at the state position,</text><text start="5" dur="4">beginning with the same state position as before and the same 3 particles--</text><text start="9" dur="3">1 over here, 1 over here, 1 over here,</text><text start="12" dur="6">and to give the states names, you&amp;#39;re going to call this one a1, a2, b1, and b2.</text><text start="18" dur="5">Let&amp;#39;s assume we take a single random particle with uniform distribution,</text><text start="23" dur="3">and we emulate a next state.</text><text start="26" dur="7">The state position works as follows: A particle will move with property 1 to an adjacent state</text><text start="33" dur="4">but adjacent is either north, south, east, or west, but not diagonal,</text><text start="37" dur="4">because every particle has 2 adjacent states, it&amp;#39;ll break ties at random</text><text start="41" dur="3">so you&amp;#39;re going to pick 1 of the 2 with 50% probability.</text><text start="44" dur="2">So with this 1 particle that you&amp;#39;ve drawn random, </text><text start="46" dur="1">and it&amp;#39;s in the random single position,</text><text start="47" dur="6">what&amp;#39;s the probability that finds itself in a1, a2, b1, and b2?</text></transcript></video><video title="10 Particle Filter Answer 2.mp4" id="8MJvx6GlIt4" length="32"><transcript><text start="0" dur="4">If you look at the chances of the particles that can end up in a1, </text><text start="4" dur="4">we find that only this particle can go up here, so let&amp;#39;s call this a 1.</text><text start="8" dur="3">To a2, there&amp;#39;s 2 particles that can go to this. Call this a 2.</text><text start="11" dur="5">There&amp;#39;s 2 particles that can end up in b1. This one and this guy over here. Again, a 2,</text><text start="16" dur="2">and one that can make it into b2.</text><text start="18" dur="5">Now this is a total of 6. If you normalize, you get extra probabilities.</text><text start="23" dur="2">a1 is worth 1/6.</text><text start="25" dur="3">a2, 1/3, which is 2/6ths.</text><text start="28" dur="2">b1, again 1/3,</text><text start="30" dur="2">and b2 is 1/6.</text></transcript></video><video title="11 Particle Filter Question 3.mp4" id="4qS60Fis3MQ" length="37"><transcript><text start="0" dur="5">So here&amp;#39;s a multiple choice question for particle filters.</text><text start="5" dur="3">Say we implement a particle filter such as a mobile localization,</text><text start="8" dur="2">and we use exactly 1 particle.</text><text start="10" dur="4">Which one of the following statements are true?</text><text start="14" dur="3">Check any or all of the following statements.</text><text start="17" dur="4">Measurements will be ignored? Check this box if you believe that&amp;#39;s the case.</text><text start="21" dur="3">The result is generally poor.</text><text start="24" dur="3">It cannot represent mulit-model distributions.</text><text start="27" dur="3">The initial state, if known, is ignored.</text><text start="30" dur="3">The state transitions are ignored.</text><text start="33" dur="4">If none of the above applies, check the final box down here.</text></transcript></video><video title="12 Particle Filter Answer 3.mp4" id="ap5hwlL-3ak" length="59"><transcript><text start="0" dur="6">And the answer is measurements are indeed ignored which has to do with the following:</text><text start="6" dur="5">We do weigh particles by the measurement probability but we normalize them to 1,</text><text start="11" dur="4">and if there&amp;#39;s only 1 particle, it will always normalize itself back to 1,</text><text start="15" dur="3">so the measurement probability has no effect.</text><text start="18" dur="2">The results are generally poor. </text><text start="20" dur="5">That is, a single particle is just insufficient to anything interesting, </text><text start="25" dur="2">so that is absolutely the correct answer over here.</text><text start="27" dur="4">Clearly, a single particle cannot represent multi-modal distributions </text><text start="31" dur="3">because multi-modes look something like this over here </text><text start="34" dur="3">because it&amp;#39;s just 1 particle, so this is actually correct.</text><text start="37" dur="4">The initial state, if known, is not necessarily ignored.</text><text start="41" dur="3">You might actually place the particle at the initial state</text><text start="44" dur="4">and it might consider it in the filtering result.</text><text start="48" dur="5">The state transitions are also not ignored because we will still propagate this particle </text><text start="53" dur="4">forward according to the state transistion, and because 3 of them are true, </text><text start="57" dur="2">the final one isn&amp;#39;t.</text></transcript></video><video title="13 Particle Filter Question 4.mp4" id="OQretRV-h8w" length="21"><transcript><text start="0" dur="3">Another particle filter question.</text><text start="3" dur="2">Check the following statements if they&amp;#39;re true.</text><text start="5" dur="3">They are usually easy to implement.</text><text start="8" dur="6">They scale quadratically with the dimensionality of the state space.</text><text start="14" dur="3">They can only be applied to discrete state spaces.</text><text start="17" dur="4">Finally, if none of those applies, check the final check mark over here.</text></transcript></video><video title="14 Particle Filter Answer 4.mp4" id="ZBio9IhSEPQ" length="38"><transcript><text start="0" dur="3">And only the first answer is correct.</text><text start="3" dur="4">They are usually easy to implement compared to any other filter.</text><text start="7" dur="4">They do not scale quadratically, in fact, normally they scale exponetially </text><text start="11" dur="2">with the dimensionality of the state space because you need </text><text start="13" dur="3">exponentially many particles to fill up the state space.</text><text start="16" dur="5">The filter that scales quadratically is called a common filter,</text><text start="21" dur="3">but we didn&amp;#39;t really talk about it in this class.</text><text start="24" dur="3">They can only be applied to discrete state spaces is clearly wrong.</text><text start="27" dur="3">We saw how to apply the robot localization </text><text start="30" dur="3">which is a continuous of revalued state space,</text><text start="33" dur="5">and because this first one is true, the last one, none of the above, is clearly false.</text></transcript></video><video title="15 Max Min Question.mp4" id="4Pbln8KFnrs" length="26"><transcript><text start="0" dur="3">For this exercise, I want you to solve this game.</text><text start="3" dur="3">So it&amp;#39;s a 2 x 2 zero-sum game.</text><text start="6" dur="5">We&amp;#39;re showing the utilities to the player Max, and so tell me what Max&amp;#39;s strategy is</text><text start="11" dur="6">by putting in probabilities in these 2 boxes for his 2 plays--1 and 2.</text><text start="17" dur="5">Tell me what Min&amp;#39;s strategy is by putting in probabilities in these boxes,</text><text start="22" dur="4">and then tell me the value of the game, the expected utility to Max.</text></transcript></video><video title="16 Max Min Answer.mp4" id="CCq4rMC67uw" length="20"><transcript><text start="0" dur="4">The answer is both players have a rational mixed strategy.</text><text start="4" dur="6">For Max, he plays 1--8/17ths of the time and 2--9/17ths of the time.</text><text start="10" dur="5">Min would play 1--10/17ths and 2--7/17ths, </text><text start="15" dur="5">and then the utility of the game to Max turns out to be 5/17ths.</text></transcript></video><video title="17 Scheduling Question.mp4" id="ZIkmqzwy43I" length="18"><transcript><text start="0" dur="3">In this scheduling problem, we have a network of actions </text><text start="3" dur="5">with the precedence relations between them and the duration of each action</text><text start="8" dur="1">shown in each box.</text><text start="9" dur="6">What I want you to do is, for each action fill in the earliest start time in the upper left</text><text start="15" dur="3">and the latest start time in the upper right.</text></transcript></video><video title="18 Scheduling Answer.mp4" id="CXlCAR0ViO8" length="11"><transcript><text start="0" dur="3">Here we see the start times for each of the actions.</text><text start="3" dur="5">Note that the critical path which the earliest and latest start time are the same </text><text start="8" dur="3">goes straight down the center.</text></transcript></video><video title="19 Game Tree Question.mp4" id="cJOCbF2UeLE" length="65"><transcript><text start="0" dur="3">Here&amp;#39;s a game tree for a stochastic 2-player game.</text><text start="3" dur="4">There are max nodes, min nodes, and chance nodes.</text><text start="7" dur="6">What I want you to do is back up all the values, so fill in a value for the value of </text><text start="13" dur="6">each of these nodes, and then check off all the nodes that could be pruned away </text><text start="19" dur="7">by a procedure that&amp;#39;s similar to alpha beta, but updated to handle chance nodes.</text><text start="26" dur="6">So what I mean by that is, a node can be pruned away if evaluating the nodes</text><text start="32" dur="5">is not necessary to figure out what the best moves are for max and min.</text><text start="37" dur="4">For the chance nodes, all the possibilities are equally probably.</text><text start="41" dur="2">So here, there&amp;#39;s a 1/3 chance of each of these.</text><text start="43" dur="4">Here there&amp;#39;s a 1/2 chance of each of these.</text><text start="47" dur="8">And in this game, the result of every game is either +1, -1, or 0,</text><text start="55" dur="5">and all the players know that those are the only possible outcomes for the game.</text><text start="60" dur="4">Therefore, the players can take that into account when trying to figure out</text><text start="64" dur="1">which nodes to prune away.</text></transcript></video><video title="20 Game Tree Answer.mp4" id="Y5-dwkt8oBc" length="66"><transcript><text start="0" dur="5">For backed up values for the min nodes, the backed up value is always the minimum.</text><text start="5" dur="5">For the chance nodes, it&amp;#39;s the expectation, or the average, </text><text start="10" dur="3">and for the max node, it&amp;#39;s the maximum.</text><text start="13" dur="1">Here the nodes that can be pruned. </text><text start="14" dur="6">This and this can be pruned because in each of these cases min has achieved</text><text start="20" dur="4">the best possible play that min can get, and therefore,</text><text start="24" dur="2">doesn&amp;#39;t need to consider any other possibilites.</text><text start="26" dur="3">Once you know you can win the game or do the best you can, </text><text start="29" dur="3">you don&amp;#39;t need to find another way to do just as well.</text><text start="32" dur="5">This node here can be pruned and thus, all the ones below it because </text><text start="37" dur="5">at this point, when we&amp;#39;re trying to evaluate this node, max knows he can get </text><text start="42" dur="6">at least 1/3, and here once we know that this node is worth -1</text><text start="48" dur="4">then we know that regardless of the value here,</text><text start="52" dur="4">that has to be somewhere between -1 and +1.</text><text start="56" dur="5">So therefore, the expectation has to be between -1 and 0, </text><text start="61" dur="5">and if 0 is the best that this can be, max knows he already has 1/3 over here.</text></transcript></video><video title="21 Strategy Question.mp4" id="sFPhOsgIVw8" length="17"><transcript><text start="0" dur="4">Here&amp;#39;s a 2-player game. Each player has 3 possible moves.</text><text start="4" dur="5">What I want you to tell me is, first, does A have a dominant strategy? Yes or no.</text><text start="9" dur="3">Second, does B have a dominant strategy? Yes or no.</text><text start="12" dur="5">Third, click on all the boxes that are equilibrium points.</text></transcript></video><video title="22 Strategy Answer.mp4" id="j_vxzbESRPM" length="34"><transcript><text start="0" dur="3">A does not have a dominant strategy, but B does.</text><text start="3" dur="8">If B plays the middle e, then in this case, B will get 5, which is more than 3 or 2.</text><text start="11" dur="4">In this case, B will get 7, which is more than 2 or 4,</text><text start="15" dur="4">and in this case, B will get 8, which is more than 7 or 5.</text><text start="19" dur="3">So that makes this play the dominant strategy for B.</text><text start="22" dur="4">Now if B is going to do that, then what should A do?</text><text start="26" dur="3">Well, A should try to get the best possible value that A can,</text><text start="29" dur="5">and that would be here, and that makes this square the lone equilibrium point.</text></transcript></video></group><group title="Unit 16" count="47"><video title="01 Introduction.mp4" id="AXx0H_kBCPE" length="135"><transcript><text start="0" dur="3">Hi again. It&amp;#39;s great to see you again.</text><text start="3" dur="4">We talked a lot about basic methods of AI,</text><text start="7" dur="4">and from today on we&amp;#39;d like to go into applications.</text><text start="11" dur="4">Specifically, today we&amp;#39;ll talk about computer vision.</text><text start="15" dur="3">Computer vision is a very bright field </text><text start="18" dur="6">that concerns itself with making sense out of camera images or video.</text><text start="24" dur="6">Many devices today are equipped with cameras, such as cell phones or cars,</text><text start="30" dur="4">and making sense out of image data has become a really important subfield</text><text start="34" dur="3">of artificial intelligence.</text><text start="37" dur="2">Today I&amp;#39;ll teach you some of the very basics.</text><text start="39" dur="3">It&amp;#39;s not as deep as my graduate level class on computer vision,</text><text start="42" dur="4">and I hope you get a chance to take that in the future,</text><text start="46" dur="4">but I hope to enable you to apply some of the very basic methods</text><text start="50" dur="6">to, for example, use images and classify them using artificial intelligence technology</text><text start="56" dur="3">through feature extraction and other techniques</text><text start="59" dur="4">and also to start doing some of the more 3D-oriented tasks</text><text start="63" dur="3">such as 3D constructions.</text><text start="66" dur="2">So let&amp;#39;s start with the very, very basics </text><text start="68" dur="4">and ask ourselves what is a camera.</text><text start="73" dur="3">Cameras come in all sizes and shapes.</text><text start="76" dur="4">This is my beautiful Nikon D3 camera [shutter clicks],</text><text start="80" dur="6">but I don&amp;#39;t use it much because it&amp;#39;s very heavy, even though it takes beautiful pictures.</text><text start="86" dur="4">This is the camera I use the most. It&amp;#39;s a cell phone camera.</text><text start="90" dur="3">It&amp;#39;s an 8 megapixel camera over here with a flash,</text><text start="93" dur="6">and I can start it, and you get to see whatever is underneath,</text><text start="99" dur="3">like this pen over here.</text><text start="104" dur="6">I can also activate the front camera, and you get to see the way I&amp;#39;ve been recording</text><text start="110" dur="4">all those wonderful online lectures over all those weeks</text><text start="114" dur="3">with this little camera over here.</text><text start="118" dur="5">In all of those cameras there is a lens and there&amp;#39;s a chip,</text><text start="123" dur="5">and the light is captured from the environment and focused through the lens on the chip,</text><text start="128" dur="7">which raises the question, how does a lens and a chip really work?</text></transcript></video><video title="02 Image Formation.mp4" id="uhP3jrxraMk" length="176"><transcript><text start="1" dur="5">[Thrun] The science of how images are created using cameras is called image formation,</text><text start="6" dur="5">where formation just means the way an image is being captured.</text><text start="11" dur="4">Perhaps the easiest model of a camera is called a pinhole camera.</text><text start="15" dur="4">In a pinhole camera, the light from within the world</text><text start="19" dur="6">goes through a various small hole--ideally it&amp;#39;s a really, really small hole--</text><text start="25" dur="4">to project into a camera chip that sits somewhere in the background.</text><text start="29" dur="4">So for example, if you had an object that was a person over here,</text><text start="33" dur="3">then this person would be projected as follows.</text><text start="36" dur="6">The feet would be projected to over here and the head to over here,</text><text start="42" dur="6">which gives us this inverted person on the projection plane or the camera chip.</text><text start="48" dur="4">There is some very basic math that governs the geometry of a pinhole camera.</text><text start="52" dur="9">If we call X the physical height of the object and small x the height of the projection,</text><text start="61" dur="5">which I&amp;#39;ll call -x because it points in the opposite direction as the original object,</text><text start="66" dur="3">then we can also talk about other values </text><text start="69" dur="5">such as the distance of the object to the camera plane</text><text start="74" dur="5">and f, which is the focal distance of the camera,</text><text start="79" dur="6">which is the distance between the pinhole and the projection plane over here.</text><text start="85" dur="6">There&amp;#39;s a simple piece of math that relates all of those 4 variables over here,</text><text start="91" dur="3">and it&amp;#39;s easily obtained by what&amp;#39;s called equal triangles.</text><text start="94" dur="7">In particular, it turns out if I map this triangle over here to right over here--</text><text start="101" dur="9">so these are the same triangles, just flipped, where x is over here and f is over here--</text><text start="110" dur="8">we get that the ratio of upper caps X to Z is the same as lower caps x to f.</text><text start="118" dur="2">So I write this as follows.</text><text start="120" dur="2">This is a result of equal triangles.</text><text start="122" dur="3">So as you take a triangle of a certain shape,</text><text start="125" dur="5">when you scale it up to larger triangles, those proportions are retained,</text><text start="130" dur="6">so therefore, upper caps X divided by Z is the same as lower caps x divided by f.</text><text start="136" dur="4">If we now transform this, I find that the projection of lower caps x,</text><text start="140" dur="7">which I might care about, is upper caps X, the physical size of the object itself,</text><text start="147" dur="6">times the quotient of the focal length over the distance.</text><text start="153" dur="3">That&amp;#39;s an interesting equation.</text><text start="156" dur="4">The further an object is away, the smaller it appears.</text><text start="160" dur="6">The larger the focal length of the camera, the larger the object in its projection.</text><text start="166" dur="4">And of course the size of the object itself directly influences </text><text start="170" dur="3">how big its image of the object really is.</text><text start="173" dur="3">So let&amp;#39;s see if you can practice that equation using a quiz.</text></transcript></video><video title="03 Projection Length Question.mp4" id="lvmKP3x6Ht0" length="26"><transcript><text start="0" dur="2">[Thrun] So here is our equation again.</text><text start="2" dur="4">Let&amp;#39;s say James is 2 meters tall.</text><text start="6" dur="6">He is 10 meters away from the camera, and the focal length is 10mm.</text><text start="12" dur="4">How large will be James&amp;#39;s projection using a pinhole camera</text><text start="16" dur="5">on the camera chip with a focal length of 10mm?</text><text start="21" dur="5">Please specify your answers in millimeters as a unit.</text></transcript></video><video title="04 Projection Length Answer.mp4" id="N1dW7r4lK_I" length="46"><transcript><text start="0" dur="3">[Thrun] And the answer is 2mm.</text><text start="3" dur="6">James, even though he is 2 meters tall, will look like 2mm tall in the camera.</text><text start="9" dur="5">The picture will be there&amp;#39;s a 2 meter tall person over here who is 10 meters away,</text><text start="14" dur="6">there&amp;#39;s a pinhole, and the focal plane is only 10mm away from the pinhole.</text><text start="20" dur="2">So this projection will be really, really small.</text><text start="22" dur="2">Let&amp;#39;s do this in math.</text><text start="24" dur="6">The upper caps X is 2 meters, the 10 meters over here is the distance, Z,</text><text start="30" dur="4">and the focal length, 10mm, is the thing over here,</text><text start="34" dur="8">so 2 meters for X divided by 10 meters for Z times 10mm</text><text start="42" dur="4">becomes 0.2 times 10mm, which is 2mm.</text></transcript></video><video title="05 Focal Length Question.mp4" id="FbYHCpUox7A" length="22"><transcript><text start="0" dur="3">[Thrun] I have another quiz.</text><text start="3" dur="5">We are looking at a building that is 10 meters tall,</text><text start="8" dur="3">and our camera is 100 meters away.</text><text start="11" dur="6">We also know that the projection of the building on the internal chip is 4mm in size.</text><text start="17" dur="5">I want to know what is the focal length in millimeters.</text></transcript></video><video title="06 Focal Length Answer.mp4" id="EbROrXY3lxY" length="25"><transcript><text start="0" dur="3">[Thrun] And the answer is 40mm.</text><text start="3" dur="5">With f being the unknown, we can transform this equation as follows,</text><text start="8" dur="3">and now we can plug in the things we know.</text><text start="11" dur="3">The 10 meters tall is the upper caps X,</text><text start="14" dur="3">the distance of 100 meters is the Z,</text><text start="17" dur="3">and the 4mm projection goes over here.</text><text start="20" dur="5">And if you work this all out, it&amp;#39;s 40mm.</text></transcript></video><video title="07 Range Question.mp4" id="xsYB95m8eh4" length="43"><transcript><text start="0" dur="5">[Thrun] So in this final quiz we&amp;#39;re going to see we can use a camera as a range sensor</text><text start="5" dur="3">or as a distance measuring device,</text><text start="8" dur="5">provided that we know the size of the object we are looking at.</text><text start="13" dur="8">Suppose you&amp;#39;re looking at a car, and we happen to know that this car is 160cm tall.</text><text start="21" dur="4">Imagine we take a picture of this car using a pinhole camera</text><text start="25" dur="4">with a focal length of 40mm,</text><text start="29" dur="6">and in our projection the car is 2mm tall.</text><text start="35" dur="6">My question is what is the range? How far is this car away?</text><text start="41" dur="2">Please answer in centimeters.</text></transcript></video><video title="08 Range Answer.mp4" id="Bkxc63WkzeY" length="33"><transcript><text start="0" dur="7">[Thrun] And the answer is 3200cm or 32 meters.</text><text start="7" dur="8">And to see, we transform this equation over here so that the range, Z, is on the left side.</text><text start="15" dur="4">With this new equation we can just plug in the known quantities.</text><text start="19" dur="7">F is 40mm, x is 2mm, and upper caps X is 160cm.</text><text start="26" dur="7">We work this out as 160cm by 20, which gives us 3200cm.</text></transcript></video><video title="09 Perspective Projection.mp4" id="ejZkKeqB8q0" length="72"><transcript><text start="0" dur="3">[Thrun] So we just learned something really important, </text><text start="3" dur="4">which is the central law of perspective projection,</text><text start="7" dur="6">which basically says that in a pinhole camera, or in fact any camera,</text><text start="13" dur="4">the projective size of any object scales with distance.</text><text start="17" dur="4">So you have an object that&amp;#39;s yea tall over here</text><text start="21" dur="6">that looks just about the same as an object yea tall over here.</text><text start="27" dur="4">In math x is proportional to the size of the object</text><text start="31" dur="4">but inverse proportional to the distance to the object,</text><text start="35" dur="5">and the only constant that then governs that relationship is the focal length, f.</text><text start="40" dur="4">So if we take an object and move it further away,</text><text start="44" dur="4">it&amp;#39;ll appear smaller, and we all know this.</text><text start="48" dur="3">Just look at this object over here, how large it is.</text><text start="51" dur="4">And as I move it away from the camera, it becomes smaller.</text><text start="55" dur="4">Large...and small.</text><text start="59" dur="2">And that&amp;#39;s a function of distance.</text><text start="61" dur="5">The law that governs the size change of this object in appearance</text><text start="66" dur="6">relative to the camera image is the perspective law we just saw.</text></transcript></video><video title="10 Vanishing Points.mp4" id="1RU-ERrU1pc" length="80"><transcript><text start="0" dur="5">Actually Cal images have 2 dimensions--not just 1. </text><text start="5" dur="3">Here&amp;#39;s an X and a Y,</text><text start="8" dur="3">and the perspective laws apply to both dimensions.</text><text start="11" dur="6">The projection of the X-coordinate into the camera plane</text><text start="17" dur="3">is governed by the perspective law over here. </text><text start="20" dur="3">And the same is true for Y. </text><text start="23" dur="5">In both cases, the appearance of an object, of size X and Y,</text><text start="28" dur="6">is scaled inversely with the distance of the object to the camera plane. </text><text start="34" dur="4">One of the interesting consequences of perspective projection is</text><text start="38" dur="7">that parallel lines in the world seem to result in vanishing points over here</text><text start="45" dur="4">so that these lines are parallel in the physical world.</text><text start="49" dur="8">Because things shrink in inverse proportion to the distance, you&amp;#39;d find it far away. </text><text start="57" dur="2">The perceived distance is much smaller, </text><text start="59" dur="5">resulting all the way in a vanishing point that sits somewhere over here. </text><text start="64" dur="2">So sometimes there&amp;#39;s more than 1 vanishing point. </text><text start="66" dur="4">In this specific instance, there&amp;#39;s a vanishing point over here, </text><text start="70" dur="3">and a vanishing point over here--and those correspond to parallel lines,</text><text start="73" dur="3">like the curb over here or the house facades</text><text start="76" dur="4">that all shrink, with distance, in their visual appearance. </text></transcript></video><video title="11 Vanishing Points Question.mp4" id="Ni1j3JILmFU" length="25"><transcript><text start="0" dur="3">So here&amp;#39;s my quiz on vanishing points. </text><text start="3" dur="7">The question is: How many vanishing points may exist in a single image? </text><text start="10" dur="2">Exactly 1 answer is correct here. </text><text start="12" dur="2">We already encountered it in an example with 2. </text><text start="14" dur="3">So perhaps there&amp;#39;s 3, 4, 6--</text><text start="17" dur="3">or even infinitely many.</text><text start="20" dur="5">So what&amp;#39;s the maximum number of vanishing points you may possibly encounter in an image? </text></transcript></video><video title="12 Vanishing Points Answer.mp4" id="1ULQMlf9exI" length="42"><transcript><text start="0" dur="3">And the answer is infinitely many. </text><text start="3" dur="7">If you thought it&amp;#39;s 3, then yes--cubes might have 3 vanishing points, </text><text start="10" dur="4">like this one over here, which is a cube under perspective projection. </text><text start="14" dur="2">It actually has 3 vanishing points:</text><text start="16" dur="5">No. 1, No. 2, and No. 3 over here. </text><text start="21" dur="4">But that&amp;#39;s because the cube has 3 faces. </text><text start="25" dur="4">You&amp;#39;ve had different faces whose Z-distance to the camera varied. </text><text start="29" dur="4">And those enclosing lines that might be parallel in the physical space</text><text start="33" dur="3">would result in their own vanishing point. </text><text start="36" dur="6">So you can theoretically make an object with infinitely many vanishing points. </text></transcript></video><video title="13 Lenses.mp4" id="k36dLpbcRek" length="137"><transcript><text start="0" dur="3">Let me comment on the idea of a lens. </text><text start="3" dur="3">A fundamental limitation of a pinhole camera </text><text start="6" dur="7">is that only very few rays of light hit the plane of the imager.</text><text start="13" dur="3">So suppose we have an object over here  </text><text start="16" dur="3">and the object emits light in all directions. </text><text start="19" dur="4">Then most beams get absorbed by the area outside the pinhole </text><text start="23" dur="3">and a very small number of beam makes it through the pinhole. </text><text start="26" dur="5">Now this is misfortunate because the total amount of light that hits the camera chip</text><text start="31" dur="5">is small and its support is only applicable for very, very bright scenes.  </text><text start="36" dur="7">And further, as you make this gap smaller and smaller to increase your focus  </text><text start="43" dur="6">on the image plane, you will eventually run into what&amp;#39;s called &amp;quot;Light Defraction,&amp;quot;</text><text start="49" dur="5">which puts a limit on how small you can make this pinhole over here. </text><text start="54" dur="6">Now if you use a lens, then all rays will make it to the same point in the image plane. </text><text start="60" dur="4">So an example--a ray over here gets projected like this,  </text><text start="64" dur="3">and a ray over here might make it like this. </text><text start="67" dur="7">So any ray in a good lens will eventually meet at the same point over here. </text><text start="74" dur="8">The lens collects all the light that hits it, and projects it back to 1 point. </text><text start="82" dur="5">Now this specific situation is characterized by only a small plane over here, </text><text start="87" dur="3">for which everything is in complete focus. </text><text start="90" dur="3">If you move your object back to over here, </text><text start="93" dur="5">then what you find is the resulting projections don&amp;#39;t match up. </text><text start="98" dur="4">Therefore, when you have a camera with a large lens or a large aperture, </text><text start="102" dur="5">you have to focus the camera to make sure that the distance between the image plane,</text><text start="107" dur="4">the lens itself, and the opposite object are in tune. </text><text start="111" dur="3">There is an equation that governs all of this,</text><text start="114" dur="2">and it looks about as follows:</text><text start="116" dur="3">1 over the focal length, f, for the lens. </text><text start="119" dur="5">This would be the sum over the extrinsic distance.</text><text start="124" dur="3">Plus 1 over the intrinsic distance, lower cap z. </text><text start="127" dur="3">I won&amp;#39;t revise this equation, </text><text start="130" dur="7">but this is the fundamental equation that governs when things are in focus, for a lens. </text></transcript></video><video title="14 Computer Vision.mp4" id="_XSw9bFS3Zc" length="73"><transcript><text start="0" dur="4">So this is great--we now learned a lot about cameras and images.  </text><text start="4" dur="3">We learned the Law of Perspective Projection</text><text start="7" dur="3">and we also know when things are in focus. </text><text start="10" dur="2">Now the first law is really important. </text><text start="12" dur="3">and the second law doesn&amp;#39;t really matter that much--</text><text start="15" dur="6">but I put it in so you understand what the implications of using a lens really is.  </text><text start="21" dur="3">Let&amp;#39;s talk of when we&amp;#39;re in Computer Vision </text><text start="24" dur="2">and what type things we can do with images. </text><text start="26" dur="4">One of the primary purposes is to extract information from these images, </text><text start="30" dur="2">such as classify objects. </text><text start="32" dur="3">So here is one of my son&amp;#39;s favorite objects. </text><text start="35" dur="5">And he might be interested in understanding that this is a car.  </text><text start="40" dur="4">A second purpose is 3D reconstruction. </text><text start="44" dur="3">So you might care to take many images of this object, from different perspectives</text><text start="47" dur="4">or with multiple cameras, like a Stereo Camera Rig </text><text start="51" dur="7">and ask yourself what is the 3D model that we can reconstruct from these 2D projections. </text><text start="58" dur="3">Or you might care about motion analysis. </text><text start="61" dur="4">This is a common problem in Computer Vision </text><text start="65" dur="4">where things might move and are seen in the video of many images</text><text start="69" dur="4">and you might, for example, care how do things move, over time. </text></transcript></video><video title="15 Invariance Question A.mp4" id="hzAb1iSq4l8" length="85"><transcript><text start="0" dur="4">So I&amp;#39;d like to quiz you a number of times on something that&amp;#39;s really essential to </text><text start="4" dur="3">the problem of Object Recognition. </text><text start="7" dur="3">In Object Recognition, you&amp;#39;re given an image of an object</text><text start="10" dur="5">and you care to understand what the nature of the object really is--</text><text start="15" dur="4">like, for example, you might look at the image of a plane and say it&amp;#39;s a plane. </text><text start="19" dur="3">You might look at the image of a person and say it&amp;#39;s a person. </text><text start="22" dur="2">A key concept in Object Recognition is called invariance,  </text><text start="24" dur="3">which means there is natural variations of the image</text><text start="27" dur="3">that don&amp;#39;t affect the nature of the object itself</text><text start="30" dur="4">and you wish to be invariant in your software</text><text start="34" dur="3">to those natural variations. </text><text start="37" dur="2">So what I will do is I&amp;#39;m going to run a couple of </text><text start="39" dur="3">invariances by you and I&amp;#39;d like you to help me </text><text start="42" dur="3">which invariance I&amp;#39;m referring to. </text><text start="45" dur="2">And here are the possible invariances:</text><text start="47" dur="7">Scale, Illumination, Rotation, Deformation, Occlusion, and View Point. </text><text start="54" dur="3">I understand--none of these words you&amp;#39;ve seen before, </text><text start="57" dur="2">so I am really appealing to your intuition here</text><text start="59" dur="3">and your sense of the English language.  </text><text start="62" dur="2">So here is the object,</text><text start="64" dur="3">and we wish to recognize this object. </text><text start="67" dur="7">And, as I&amp;#39;m illustrating, the object might vary in some important dimension. </text><text start="74" dur="5">And I wonder what type of invariance you or I, with it, must possess</text><text start="79" dur="2">to be able to recognize this object. </text><text start="81" dur="4">So here is the list. </text></transcript></video><video title="16 Invariance Answer A.mp4" id="ag4dNN7F_fA" length="10"><transcript><text start="0" dur="2">And the answer here is Rotation. </text><text start="2" dur="2">I rotated the object.</text><text start="4" dur="3">So you wish to make sure that any recognition item  </text><text start="7" dur="3">is invariant to a rotation. </text></transcript></video><video title="17 Invariance Question B.mp4" id="ocUMkQ2-6fc" length="18"><transcript><text start="0" dur="4">Sometimes the object gets closer to the camera</text><text start="4" dur="2">and increases in size;</text><text start="6" dur="3">and gets further away from the camera and, therefore, becomes smaller. </text><text start="9" dur="3">Just watch how the object becomes larger, </text><text start="12" dur="3">and smaller. </text><text start="15" dur="3">What do you think? What type of invariance is this? </text></transcript></video><video title="18 Invariance Answer B.mp4" id="z7J8_zEXKUY" length="23"><transcript><text start="0" dur="4">And the answer is this is Scale Invariance. </text><text start="4" dur="4">Scale means how large the image is, relative to the camera.</text><text start="8" dur="4">This is governed by the Perspective Projection Law that we discussed before.</text><text start="12" dur="4">The further objects are away, the smaller they appear.</text><text start="16" dur="3">We wish any classifier to be invariant to scale </text><text start="19" dur="4">so it can recognize objects nearby or really far away. </text></transcript></video><video title="19 Invariance Question C.mp4" id="11rkMYv6Jv4" length="17"><transcript><text start="0" dur="4">This object over here is interesting because we can change its shape. </text><text start="4" dur="2">So you can take this thing over here</text><text start="6" dur="5">and move it around, and what you actually see of the object </text><text start="11" dur="3">depends on the angle of the rotors. </text><text start="14" dur="3">So what do you think? What type of invariance is this?  </text></transcript></video><video title="20 Invariance Answer C.mp4" id="fL4dm4CNA-w" length="24"><transcript><text start="0" dur="3">I would call this Deformation Invariance</text><text start="3" dur="3">because the object is actually deformable </text><text start="6" dur="2">as our many objects that surround us, </text><text start="8" dur="5">like clothes and dollar bills and water glasses. </text><text start="13" dur="4">This kind of deformation is really important in the recognition of objects. </text><text start="17" dur="3">You wish to make sure that a helicopter can be recognized,</text><text start="20" dur="4">no matter what angle it&amp;#39;s rotor blades currently have. </text></transcript></video><video title="21 Invariance Question D.mp4" id="B7kObkDPXQA" length="19"><transcript><text start="0" dur="2">So here is one of my favorites. </text><text start="2" dur="3">I&amp;#39;m holding in my hand a flashlight, </text><text start="5" dur="2">and as I move the flashlight around, </text><text start="7" dur="4">you can see that the appearance of the object changes a lot,</text><text start="11" dur="3">based on where I hold my flashlight. </text><text start="14" dur="5">So my question now is what type of invariance does this realize? </text></transcript></video><video title="22 Invariance Answer D.mp4" id="vpAoJmcsWjA" length="13"><transcript><text start="0" dur="4">And this is clearly an example of Illumination Invariance. </text><text start="4" dur="3">Depending how the object&amp;#39;s illuminated, </text><text start="7" dur="6">it might appear very differently, even though its position to the camera might be identical. </text></transcript></video><video title="23 Invariance Question E.mp4" id="F4aZB8JYQU0" length="16"><transcript><text start="0" dur="3">Sometimes objects are behind other objects. </text><text start="3" dur="3">For example, I can partially cover up this object. </text><text start="6" dur="3">You can probably still recognize it. </text><text start="9" dur="3">Or I might move a pen in front of it, as shown over here. </text><text start="12" dur="2">Now we&amp;#39;ve almost no choices left.  </text><text start="14" dur="2">Tell me what type of invariance this was. </text></transcript></video><video title="24 Invariance Answer E.mp4" id="gQczmY3GHWY" length="9"><transcript><text start="0" dur="3">This is called Occlusion Invariance. </text><text start="3" dur="3">Sometimes objects are partially occluded,</text><text start="6" dur="3">yet you would wish to be able to recognize them even with a partial occlusion.  </text></transcript></video><video title="25 Final Invariance Type.mp4" id="OcLd2wcS-Fo" length="32"><transcript><text start="0" dur="2">And because there&amp;#39;s only 1 invariance left, </text><text start="2" dur="2">let me talk about View Point Invariance  </text><text start="4" dur="2">as the final invariance. </text><text start="6" dur="3">So the appearance of this object depends on </text><text start="9" dur="3">from what direction you look, what your view point is. </text><text start="12" dur="3">And you can see it&amp;#39;s very different, from different view points. </text><text start="15" dur="4">So this looks fundamentally different from this, </text><text start="19" dur="2">from this. </text><text start="21" dur="3">That&amp;#39;s called View Point or Vantage Point Invariance,</text><text start="24" dur="2">and it&amp;#39;s one of the hardest invariances</text><text start="26" dur="2">because the appearance of the object </text><text start="28" dur="4">might really alter a lot, from different vantage points. </text></transcript></video><video title="26 Importance of Invariance.mp4" id="qTvlkiK-YSo" length="42"><transcript><text start="0" dur="2">[Thrun] The reason why I went through these different invariances</text><text start="2" dur="4">is because they are really crucial to computer vision.</text><text start="6" dur="4">These and a number of other invariances really matter.</text><text start="10" dur="5">When you want to recognize objects, you want to write software</text><text start="15" dur="4">that is invariant to scale and illumination and so on</text><text start="19" dur="3">and that it retains the important information in the image</text><text start="22" dur="5">regardless of the present rotation and occlusion and deformation.</text><text start="27" dur="6">If you succeed in eliminating the effects of those changes</text><text start="33" dur="3">and build a truly invariant computer vision algorithm,</text><text start="36" dur="2">I will be very impressed with you.</text><text start="38" dur="4">You will have solved a major computer vision problem.</text></transcript></video><video title="27 Greyscale Images.mp4" id="vTvKPxmvRFI" length="113"><transcript><text start="0" dur="3">[Thrun] So we learned a lot about invariances.</text><text start="3" dur="4">Let&amp;#39;s now take actual images and do something with these images.</text><text start="7" dur="3">This is an image I took a while back in Amsterdam,</text><text start="10" dur="3">and it&amp;#39;s interesting because there&amp;#39;s a lot of interesting features</text><text start="13" dur="5">like these line features over here and possible corner features over there.</text><text start="18" dur="4">In computer vision we don&amp;#39;t use color images very much. </text><text start="22" dur="5">We mostly use black and white images like this greyscale image over here,</text><text start="27" dur="3">which misses information from the original image,</text><text start="30" dur="5">but it turns out that greyscale is more robust to lighting variations than color is.</text><text start="35" dur="5">That&amp;#39;s a fairly common representation for images for computer vision.</text><text start="40" dur="6">So a greyscale image is a matrix typically of several hundred rows </text><text start="46" dur="3">and several hundred columns</text><text start="49" dur="6">that has small values imprinted that correspond to the greyscale of a pixel.</text><text start="55" dur="8">These values scale between 0 and 255, where 255 is white and 0 is black.</text><text start="63" dur="4">You can see how this matrix is full of values that together compose the image.</text><text start="67" dur="5">Here is a very small image of size 4 by 5,</text><text start="72" dur="6">and based on the numbers I put in, it feels like there&amp;#39;s a transition going on.</text><text start="78" dur="3">At the top the image is relatively bright.</text><text start="81" dur="2">These have values close to 255.</text><text start="83" dur="2">And at the bottom it is relatively large.</text><text start="85" dur="3">This is way too small an image to recognize anything.</text><text start="88" dur="3">Picture a matrix much, much larger than this,</text><text start="91" dur="6">yet an image that&amp;#39;s still just a 2-dimensional matrix of singular brightness values.</text><text start="97" dur="4">At least a greyscale image is a matrix like this.</text><text start="101" dur="4">A color image would have 3 different values per pixel </text><text start="105" dur="5">which correspond to red, blue, or green or some other encoding of the color itself.</text><text start="110" dur="3">But for now we&amp;#39;re going to be content with greyscale images.</text></transcript></video><video title="28 Extracting Features.mp4" id="BhCcmTsm5ck" length="167"><transcript><text start="0" dur="3">[Thrun] One of the most basic things we can do with computer vision</text><text start="3" dur="2">is to extract features.</text><text start="5" dur="4">For example, there is a very strong edge feature over here</text><text start="9" dur="5">and a strong corner feature right over here and right over here.</text><text start="14" dur="2">Let me tell you how to do this.</text><text start="16" dur="4">How can you find in an image like this whether there is an edge,</text><text start="20" dur="6">or in an image like this where there is an edge from a bright area on the left</text><text start="26" dur="2">to a dark region on the right?</text><text start="28" dur="5">Let us write a feature extractor that identifies transitions of this type,</text><text start="33" dur="3">and let&amp;#39;s start with horizontal transitions.</text><text start="36" dur="3">The most obvious feature detector looks like this.</text><text start="39" dur="6">We run this little 2-value matrix across the entire image over here,</text><text start="45" dur="6">and we add whatever is on the left side and subtract whatever is on the right side.</text><text start="51" dur="5">So if both sides are approximately in balance, like these points over here,</text><text start="56" dur="3">adding an expression here is approximately 0.</text><text start="59" dur="3">But if the left side is significantly larger than the right side,</text><text start="62" dur="6">then adding and subtracting yields a very large value, like 212 - 7 over here.</text><text start="68" dur="8">So this specific mask gives us edges that run from bright to dark.</text><text start="76" dur="4">So here I&amp;#39;m taking the first value and subtract the second value from it.</text><text start="80" dur="6">255 - 212 gives me 43. That&amp;#39;s applying this mask over here.</text><text start="86" dur="5">From 211 to 237 is -26 and so on.</text><text start="91" dur="4">212 - 7 is 205.</text><text start="95" dur="5">237 - 3 is 234 and so on.</text><text start="100" dur="3">7 - 1 is 6.</text><text start="103" dur="3">3 - 9 is -6 and so on.</text><text start="106" dur="3">If you look at this result of applying the mask over here,</text><text start="109" dur="3">you&amp;#39;ll find that this column stands out.</text><text start="112" dur="5">It is much, much larger in value than any of the adjacent columns,</text><text start="117" dur="6">and that indicates that we have a high likelihood of a horizontal edge feature occurring</text><text start="123" dur="4">at the ridge between this column and this column over here.</text><text start="127" dur="5">So here we are applying that same trick to the original image, and this is the result.</text><text start="132" dur="5">You can see that areas where the original image has a strong transition</text><text start="137" dur="2">you get a strong response over here.</text><text start="139" dur="3">This is actually showing the absolute value of the difference</text><text start="142" dur="4">where we get rid of the minus sign, so you can see any transition from bright to dark</text><text start="146" dur="3">or dark to bright horizontally shows up.</text><text start="149" dur="5">Now, you can see these lines over here that are vertical show up very strongly.</text><text start="154" dur="5">The lines over here don&amp;#39;t, and the reason is the way we defined our kernel,</text><text start="159" dur="4">it ran actually horizontal, so it finds horizontal edges and not vertical edges.</text><text start="163" dur="4">Vertical edges require a different kernel, so let me get to this in a second.</text></transcript></video><video title="29 Extracting Features Question.mp4" id="OmsQRUiG33Q" length="23"><transcript><text start="0" dur="6">[Thrun] So in this quiz I&amp;#39;ve given you a very small image of 3 by 3,</text><text start="6" dur="3">and I&amp;#39;d like you to apply a kernel that&amp;#39;s about like the previous one,</text><text start="9" dur="3">except I flipped the left and right side over here.</text><text start="12" dur="4">And just apply this kernel to this image over here.</text><text start="16" dur="4">We&amp;#39;re going to receive a 3 by 2 image in this case.</text><text start="20" dur="3">So please fill in all these 6 values over here.</text></transcript></video><video title="30 Extracting Features Answer.mp4" id="Mv_8sGBiopQ" length="30"><transcript><text start="0" dur="3">[Thrun] And here are the results.</text><text start="3" dur="3">7 - 255 is -248.</text><text start="6" dur="3">3 - 7 is -4.</text><text start="9" dur="3">4 - 240 is -236.</text><text start="12" dur="3">240 - 212 is 28.</text><text start="15" dur="3">230 - 216 is 14.</text><text start="18" dur="4">And 216 - 218 is -2.</text><text start="22" dur="8">So this would be the image under that specific mask or kernel over here.</text></transcript></video><video title="31 Linear Filter.mp4" id="7KVPsx_oP7A" length="98"><transcript><text start="0" dur="3">Now, we already learned something really interesting,</text><text start="3" dur="2">which is the special case of a linear filter.</text><text start="5" dur="5">We took an image, and we applied a small kernel.</text><text start="10" dur="2">The application of a kernel is often denoted </text><text start="12" dur="2">with a special symbol over here.</text><text start="14" dur="3">And we received a new image</text><text start="17" dur="2">that was slightly smaller, and we don&amp;#39;t really worry</text><text start="19" dur="2">about the fact that it&amp;#39;s smaller.</text><text start="21" dur="2">There&amp;#39;s ways to keep it the same size</text><text start="23" dur="3">by assuming everything around the original image is zero.</text><text start="26" dur="4">But we did receive a new image that was part of the kernel over here.</text><text start="30" dur="3">And the general math of the new image,</text><text start="33" dur="4">for any pixel accorded x and y, is obtained by summing</text><text start="37" dur="3">over all layers in the kernel, u and v,</text><text start="40" dur="3">of the original image shifted by u and v</text><text start="43" dur="2">times the kernel itself.</text><text start="45" dur="2">Now, this will take some time to digest,</text><text start="47" dur="4">but what it really does is it does exactly what we did before.</text><text start="51" dur="4">We take our kernel, which in this case might be a 2 x 1 kernel.</text><text start="55" dur="2">We go over both of these fields</text><text start="57" dur="2">or any number of fields that exists over here.</text><text start="59" dur="4">We look at the corresponding image field and shift it a little bit.</text><text start="63" dur="4">We did this before. We shifted it by 0, 1 pixels.</text><text start="67" dur="2">We multiply these 2 things.</text><text start="69" dur="2">There was a +1 here and a -1 here before.</text><text start="71" dur="2">And we add all these things up </text><text start="73" dur="2">to arrive at the resulting image.</text><text start="75" dur="4">Think for a moment to realize that this function over here</text><text start="79" dur="2">implements what we just did.</text><text start="81" dur="3">It&amp;#39;s a nice and elegant function.</text><text start="84" dur="2">It&amp;#39;s called a linear filter.</text><text start="86" dur="4">And the reason is the math inside this sum is linear.</text><text start="90" dur="2">It&amp;#39;s a multiplication.</text><text start="92" dur="3">And so is the sum, and the convolution operation itself</text><text start="95" dur="3">is often called the linear operation.</text></transcript></video><video title="32 Horizontal Edge Question.mp4" id="RUPaE39PJak" length="23"><transcript><text start="0" dur="3">So, let me ask you another quiz.</text><text start="3" dur="2">What type of filter do we need</text><text start="5" dur="3">to find horizontal edges?</text><text start="8" dur="2">And here are the choices.</text><text start="10" dur="3">We have a filter like this, 1 and 1,</text><text start="13" dur="2">horizontal filter 1, -1, </text><text start="15" dur="3">and a vertical filter, 1, 1 and 1, -1.</text><text start="18" dur="2">One of those is actually correct,</text><text start="20" dur="3">so pick the one that is best suited to find horizontal edges.</text></transcript></video><video title="33 Horizontal Edge Answer.mp4" id="WNE3j27uy9A" length="27"><transcript><text start="0" dur="3">And the answer is this one over here.</text><text start="3" dur="3">It takes a pixel and subtracts</text><text start="6" dur="3">the vertically next pixel from it.</text><text start="9" dur="2">And if there&amp;#39;s a horizontal edge,</text><text start="11" dur="5">if we have an image where the values over here are large,</text><text start="16" dur="3">the values over here are small,</text><text start="19" dur="2">then the specific filter over here,</text><text start="21" dur="2">when applied to the transition between</text><text start="23" dur="4">the large and small values will give you a large response.</text></transcript></video><video title="34 Vertical Filter Question.mp4" id="sElYQYuS_IA" length="20"><transcript><text start="0" dur="2">Here&amp;#39;s another quiz.</text><text start="2" dur="4">Given an image like this one over here</text><text start="6" dur="2">with pixel areas 12, 18 and 6, </text><text start="8" dur="4">2, 1, 7, 100, 140, 130,</text><text start="12" dur="5">convolve this image with the vertical  -1, +1 filter</text><text start="17" dur="3">to arrive at the 6 missing values on the right side.</text></transcript></video><video title="35 Vertical Filter Answer.mp4" id="kH17V4gAh7k" length="25"><transcript><text start="0" dur="3">And now the answers will be rather straightforward.</text><text start="3" dur="3">100 - 2 is 98,  2 - 12 is -10,</text><text start="6" dur="4">140 - 1 is 139, and so on and so on.</text><text start="10" dur="2">This is the convolved image,</text><text start="12" dur="3">and you can see a relatively large response</text><text start="15" dur="3">corresponding to the position of larger areas over here</text><text start="18" dur="4">to smaller areas over here, so this filter is well suited</text><text start="22" dur="3">to find horizontal edges.</text></transcript></video><video title="36 Filter Results.mp4" id="OoiwXs6Ebwg" length="29"><transcript><text start="0" dur="2">And you can really see this in the results.</text><text start="2" dur="5">So, this is the vertical mask applied to finding horizontal edges.</text><text start="7" dur="2">These edges [s/l flare or throw] out really strongly.</text><text start="9" dur="2">This one is nearly invisible.</text><text start="11" dur="4">Compare this to the filter of vertical edges</text><text start="15" dur="2">where these things now light up,</text><text start="17" dur="3">but these have gone missing, and here&amp;#39;s the original image again</text><text start="20" dur="2">where you&amp;#39;ll see all the different edges.</text><text start="22" dur="4">Again, the horizontal filter finds the vertical edges,</text><text start="26" dur="3">and the vertical filter finds horizontal edges.</text></transcript></video><video title="37 Gradient Images.mp4" id="lzJsHUVKvRc" length="91"><transcript><text start="0" dur="4">Now, what I&amp;#39;ve just shown you is called a gradient image.</text><text start="4" dur="3">The gradient image in the horizontal direction</text><text start="7" dur="3">is the image convolved with this kernel over here.</text><text start="10" dur="2">And the gradient image in the vertical direction</text><text start="12" dur="4">is the original image convolved with this kernel over here.</text><text start="16" dur="3">This notation should now make sense</text><text start="19" dur="3">since we practiced it a number of times.</text><text start="22" dur="3">This is called, again, the convolution of the image.</text><text start="25" dur="4">Now, if we wish to find edges in any direction,</text><text start="29" dur="5">a really easy way to do this is to combine both of these gradient images</text><text start="34" dur="4">into a single edge image, and here&amp;#39;s how it goes. </text><text start="38" dur="4">We take our gradient image in direction x, and we square it.</text><text start="42" dur="2">The same with y,</text><text start="44" dur="2">and we take the square root.</text><text start="46" dur="3">And this response over here tells us</text><text start="49" dur="5">in any of the 2 directions how strong the gradient response is.</text><text start="54" dur="3">Here is just that gradient image.</text><text start="57" dur="3">Compare this to the original image,</text><text start="60" dur="3">and you can see that wherever there is a strong transition</text><text start="63" dur="3">between a bright and dark color, </text><text start="66" dur="4">the gradient [mack] into the image I just calculated has an edge.</text><text start="70" dur="4">It has an edge vertically and an edge horizontally.</text><text start="74" dur="3">And again, it&amp;#39;s made of these 2 components, </text><text start="77" dur="3">vertical edges and horizontal edges.</text><text start="80" dur="3">By combining both of them, we get a gradient [s/l magnet?]  image, </text><text start="83" dur="3">and we have our very first feature detector,</text><text start="86" dur="5">which is a feature detector of any edge in the image.</text></transcript></video><video title="38 Canny Edge Detector.mp4" id="sMlkLsPWNNM" length="45"><transcript><text start="0" dur="3">Now, state-of-the-art edge detection is a little bit more advanced</text><text start="3" dur="2">than this one over here.</text><text start="5" dur="4">This is called a Canny edge detector.</text><text start="9" dur="2">You see much more crisp edges over here.</text><text start="11" dur="4">What this does, in addition to the gradient magnitude,</text><text start="15" dur="4">it traces areas and finds local maxima.</text><text start="19" dur="4">And it tries to connect them in a way that there&amp;#39;s always just the single edge.</text><text start="23" dur="4">When multiple edges meet, the Canny edge detector has a hole,</text><text start="27" dur="3">like the area over here or the area over here.</text><text start="30" dur="3">But when edges are single edges, </text><text start="33" dur="3">the Canny edge detector traces them very, very nicely.</text><text start="36" dur="3">This is named after John Canny, a professor at UC Berkeley. </text><text start="39" dur="4">And he did one of the most impressive pieces of work</text><text start="43" dur="2">on early edge detection.</text></transcript></video><video title="39 Other Masks.mp4" id="gvZJKhC0CJk" length="38"><transcript><text start="0" dur="5">There are a few other common masks in the [I&amp;#39;m going to share?].</text><text start="5" dur="4">One is the Sobel mask, which is just like the edge detector I showed you,</text><text start="9" dur="4">a little bit larger, and you can see it goes from left to right.</text><text start="13" dur="2">There&amp;#39;s about 8 of them. </text><text start="15" dur="3">2 are shown over here,  including some diagonal ones.</text><text start="18" dur="4">Something called the Prewitt masks, which is like the Sobel</text><text start="22" dur="2">but doesn&amp;#39;t emphasize the center line.</text><text start="24" dur="3">And the Kirsh mask, like the one over here.</text><text start="27" dur="2">In fact, you can claim your own kernel.</text><text start="29" dur="3">If you come up with a kernel that finds certain features,</text><text start="32" dur="3">name it after yourself, and who knows?</text><text start="35" dur="3">Maybe you&amp;#39;ll get remembered like Mr. Sobel did or Mr. Prewitt.</text></transcript></video><video title="40 Prewitt Mask Question.mp4" id="EjjqcRlKIy8" length="16"><transcript><text start="0" dur="4">So, here&amp;#39;s a mask that&amp;#39;s a special case of a Prewitt mask,</text><text start="4" dur="3">and I&amp;#39;d like to ask you a quiz.</text><text start="7" dur="3">Will this find horizontal edges, vertical edges, </text><text start="10" dur="3">corners, or none or all of the above?</text><text start="13" dur="3">Please check exactly 1 of those 3 buttons.</text></transcript></video><video title="41 Prewitt Mask Answer.mp4" id="1y95EW2UW3A" length="12"><transcript><text start="0" dur="3">And the answer is horizontal edges</text><text start="3" dur="3">because it shifts negative mass on the left side, </text><text start="6" dur="2">positive mass on the right side.</text><text start="8" dur="2">That gives us a horizontal edge.</text><text start="10" dur="2">It doesn&amp;#39;t find any of the other ones.</text></transcript></video><video title="42 Gaussian Kernel Question.mp4" id="gJIGZyY0SG0" length="46"><transcript><text start="0" dur="3">Now, linear filters can also be applied</text><text start="3" dur="2">to very different matrices.</text><text start="5" dur="3">This is what&amp;#39;s called a Gaussian kernel.</text><text start="8" dur="2">You can see it over here.</text><text start="10" dur="4">It&amp;#39;s a matrix whose value is maximum</text><text start="14" dur="3">at the center of the matrix and whose value falls</text><text start="17" dur="4">exponentially to the side of this matrix.</text><text start="21" dur="4">It&amp;#39;s a Gaussian in 2D, as you can see over here.</text><text start="25" dur="3">So, what happens if we convolve an image </text><text start="28" dur="2">with a Gaussian kernel?</text><text start="30" dur="3">Let me ask your intuition on the following quiz,</text><text start="33" dur="3">and it&amp;#39;s completely okay if you get this wrong.</text><text start="36" dur="3">If you convolve an image with a Gaussian kernel, what do we get?</text><text start="39" dur="4">An edge detector, a corner detector,</text><text start="43" dur="3">a blurred image, or none of the above?</text></transcript></video><video title="43 Gaussian Kernel Answer.mp4" id="LXCq0yN3e6s" length="29"><transcript><text start="0" dur="2">And the answer is a blurred image.</text><text start="2" dur="3">A Gaussian kernel gives us a blurred image.</text><text start="5" dur="2">Let me demonstrate this to you.</text><text start="7" dur="3">Here is the original image,</text><text start="10" dur="3">and this is the result of convolving with a Gaussian.</text><text start="13" dur="3">You can see that the features are much blurred,</text><text start="16" dur="2">and the reason is each of these pixels</text><text start="18" dur="4">is the sum of its weighted neighboring pixels.</text><text start="22" dur="4">And the larger the neighborhood, the more the blurring effect.</text><text start="26" dur="3">You can see clearly the difference in sharpness between these 2 images.</text></transcript></video><video title="44 Reasons for Gaussian Kernels.mp4" id="iVG8b89jTOA" length="181"><transcript><text start="0" dur="4">So, why on earth would we ever want to blur an image?</text><text start="4" dur="2">There are generally 2 reasons why you might want to do this.</text><text start="6" dur="2">One is for down-sampling.</text><text start="8" dur="3">If you have an image of super high resolution,</text><text start="11" dur="3">maybe 5,000 x 5,000 pixels,</text><text start="14" dur="3">and you&amp;#39;d like to go to a web image of much smaller resolution,</text><text start="17" dur="4">it&amp;#39;s better to blur by Gaussian before down-sampling</text><text start="21" dur="3">then picking each nth pixel.</text><text start="24" dur="3">And the reason is called aliasing. </text><text start="27" dur="3">If you pick each nth pixel without blurring,</text><text start="30" dur="3">you sometimes get very, very funny effects</text><text start="33" dur="3">because each nth pixel might by chance</text><text start="36" dur="3">correspond to something that&amp;#39;s somewhat irregular.</text><text start="39" dur="4">For example, if you have a checkerboard and you pick each nth pixel,</text><text start="43" dur="2">you might only end up with black pixels.</text><text start="45" dur="2">The second reason is called noise reduction.</text><text start="47" dur="4">In noise reduction, you respond to pixel noise</text><text start="51" dur="4">that might otherwise make it hard to compute things like image gradients.</text><text start="55" dur="2">If you blur the image first, </text><text start="57" dur="5">you get a smoother result that isn&amp;#39;t quite as pronounced</text><text start="62" dur="3">but has much less noise in the image.</text><text start="65" dur="3">Here&amp;#39;s the original gradient magnitude image to find edges,</text><text start="68" dur="3">and here&amp;#39;s the same applied to the blurred image.</text><text start="71" dur="5">And you can see the original one is much more succinct, </text><text start="76" dur="3">but also it&amp;#39;s more subject to noise.</text><text start="79" dur="3">Take the area over here, which has lots of image noise,</text><text start="82" dur="4">and compare this to the area over here, which has [s/l many few edges.]</text><text start="86" dur="2">The same is true over here and over here.</text><text start="88" dur="3">I wouldn&amp;#39;t really claim this is a much better result.</text><text start="91" dur="4">In fact, it looks kind of funny and very coarse,</text><text start="95" dur="2">but it does have less noise.</text><text start="97" dur="3">Just to complete the issue on blurring,</text><text start="100" dur="2">what we just did is we took an image, </text><text start="102" dur="2">we blurred it with a Gaussian kernel, </text><text start="104" dur="3">and then we applied a gradient kernel.</text><text start="107" dur="2">If you dive into the math of convolution,</text><text start="109" dur="3">you&amp;#39;ll find that convolution is associative,</text><text start="112" dur="4">so you could apply this one to the image and then this one over here,</text><text start="116" dur="3">or you can combine these 2 guys over here </text><text start="119" dur="4">into a Gaussian gradient kernel</text><text start="123" dur="4">and apply this Gaussian gradient kernel to the image.</text><text start="127" dur="4">So, f convolved with g is this big </text><text start="131" dur="4">maybe 9 x 9 Gaussian matrix convolved by a single</text><text start="135" dur="3">+1, -1 kernel g.</text><text start="138" dur="4">And here&amp;#39;s what this Gaussian gradient kernel looks like.</text><text start="142" dur="2">It&amp;#39;s really interesting.</text><text start="144" dur="2">It is the same gradient kernel we had before</text><text start="146" dur="3">but smooth now and spread out</text><text start="149" dur="3">by Gaussian.</text><text start="152" dur="5">And it really responds to an area over here similar to a Sobel operator</text><text start="157" dur="3">that might have a strong negative value.</text><text start="160" dur="2">And the area over here on the right side</text><text start="162" dur="2">has a strong positive value,</text><text start="164" dur="3">so you can think of Sobel and many other kernels</text><text start="167" dur="5">as a combination of smoothing and taking a gradient.</text><text start="172" dur="2">I find this really interesting because</text><text start="174" dur="4">we can now devise a single, linear kernel that does both smoothing</text><text start="178" dur="3">and find gradients at the same time.</text></transcript></video><video title="45 Harris Corner Detector.mp4" id="vkWdzWeRfC4" length="170"><transcript><text start="0" dur="3">Sometimes you wish to find corners,</text><text start="3" dur="2">as in this checkerboard over here.</text><text start="5" dur="3">Corners have an advantage over edges.</text><text start="8" dur="2">Edges aren&amp;#39;t localizable.</text><text start="10" dur="2">They could be anywhere on an edge.</text><text start="12" dur="3">But a corner like this or a corner like this</text><text start="15" dur="3">can be localized, which is useful in computer vision.</text><text start="18" dur="4">What you see here is a Harris corner detector</text><text start="22" dur="4">applied to a checkerboard pattern.</text><text start="26" dur="3">And you can see all the points that define the checkerboard</text><text start="29" dur="4">clearly found by a relatively simple algorithm</text><text start="33" dur="3">which I&amp;#39;m just about to explain to you.</text><text start="36" dur="5">The Harris corner detector is really a simple algorithm.</text><text start="41" dur="3">Suppose you wished to find a corner just like this.</text><text start="44" dur="4">Then in the small region over here where the corner resides,</text><text start="48" dur="3">you will find a lot of horizontal gradients</text><text start="51" dur="2">and a lot of vertical gradients.</text><text start="53" dur="2">Now, what&amp;#39;s our trick of finding gradients?</text><text start="55" dur="3">Well, we know about horizontal gradients.</text><text start="58" dur="2">We know about vertical gradients.</text><text start="60" dur="3">If those summed up over a small window--</text><text start="63" dur="4">as shown right over here--are large, we have a corner. </text><text start="67" dur="4">If only 1 of them is large and the other 1 is small, we likely have an edge.</text><text start="71" dur="2">We already learned this before.</text><text start="73" dur="2">It should be no surprise so far.</text><text start="75" dur="5">Now, the Harris corner detector generalizes 2 images. </text><text start="80" dur="2">We might have a corner like this</text><text start="82" dur="3">that is rotated from the original corner.</text><text start="85" dur="3">An image like this on a horizontal gradient</text><text start="88" dur="3"> isn&amp;#39;t quite as pronounced as it is on the vertical gradient.</text><text start="91" dur="3">But if you were to rotate our coordinate system</text><text start="94" dur="2">back into the correct orientation, </text><text start="96" dur="2">we could reduce it back to the case over here.</text><text start="98" dur="5">The trick that&amp;#39;s being applied is to de-rotate</text><text start="103" dur="4">this image over here using eigenvalue decomposition.</text><text start="107" dur="3">We use a matrix that slightly generalizes these 2 things over here</text><text start="110" dur="3">where again we add our small windows.</text><text start="113" dur="3">We plug in the statistic over here up here.</text><text start="116" dur="2">The statistic over here down there.</text><text start="118" dur="3">And here we have [s/l mixed strums] if we sum over the product</text><text start="121" dur="5">of Ix and Iy in [ s/l after angle terms].</text><text start="126" dur="3">If we apply eigenvalue decomposition to this matrix over here,</text><text start="129" dur="2">we get 2 eigenvalues.</text><text start="131" dur="2">And if both eigenvalues are large,</text><text start="133" dur="3">we again say we have a corner.</text><text start="136" dur="3">So, applying this eigenvalue decomposition</text><text start="139" dur="2">to every positive pixel in the original image</text><text start="141" dur="3">and then taking the local maxima of that result</text><text start="144" dur="3">where both eigenvalues are large gives us exactly</text><text start="147" dur="3">the Harris corner detector in a very robust way</text><text start="150" dur="3">to find corners in an image.</text><text start="153" dur="4">This is exactly what&amp;#39;s being done over here,</text><text start="157" dur="4">and you can see it&amp;#39;s very robust even to small rotations of the image,</text><text start="161" dur="2">and of course, to a scale of the image.</text><text start="163" dur="4">It&amp;#39;s a beautiful way to find stable, </text><text start="167" dur="3">localizable features in contrast-rich images.</text></transcript></video><video title="46 Modern Feature Detectors.mp4" id="mJoZ-y19ifw" length="97"><transcript><text start="0" dur="5">Now, modern feature detectors extend Harris corners</text><text start="5" dur="3">into much more advanced features.</text><text start="8" dur="3">They are usually localizable, like corners are.</text><text start="11" dur="2">They also have unique signatures</text><text start="13" dur="3">that summarize the identity of a feature</text><text start="16" dur="3">that&amp;#39;s typically invariant to lighting, orientation,</text><text start="19" dur="3">translation and size variance, </text><text start="22" dur="3">as you might find it in the image space.</text><text start="25" dur="3">So, common methods that people use are called HOG,</text><text start="28" dur="3">for histogram of oriented gradients, </text><text start="31" dur="4">or SIFT, for scale invariant  feature transform.</text><text start="35" dur="3">All of these methods take corners</text><text start="38" dur="5">and reduce the various variants like rotational variants</text><text start="43" dur="3">by extracting statistics that are invariant to things like</text><text start="46" dur="5">rotation and scale and certain perspective transformation.</text><text start="51" dur="3">I took the liberty to apply SIFT features</text><text start="54" dur="2">to the bridge image,</text><text start="56" dur="2">and what you find here is a myriad of features</text><text start="58" dur="2">that are all very localizable.</text><text start="60" dur="2">There&amp;#39;s features over here, </text><text start="62" dur="3">very large ones like the square over here,</text><text start="65" dur="3">which is, I guess, very visible, another square over here,</text><text start="68" dur="4">and very small, tiny features like the square over here and the square over here</text><text start="72" dur="5">that all have a unique signature and can easily be matched across images.</text><text start="77" dur="3">This is called a SIFT feature extractor,</text><text start="80" dur="4">and it&amp;#39;s one of the state-of-the-art methods that are very commonly used.</text><text start="84" dur="3">So, if you wish to extract features from an image, </text><text start="87" dur="3">I recommend checking out HOG or SIFT.</text><text start="90" dur="2">You can download software from the web.</text><text start="92" dur="3">They are somewhat involved, and you can learn about them</text><text start="95" dur="2">in advanced computer vision classes.</text></transcript></video><video title="47 Conclusion.mp4" id="gksMOuDtGf0" length="46"><transcript><text start="0" dur="3">So, now you know some of the very basics of computer vision.</text><text start="3" dur="3">We talked about images, how images are being formed.</text><text start="6" dur="4">We talked about perspective projection as a mathematical tool</text><text start="10" dur="4">for understanding how cameras perceive images.</text><text start="14" dur="2">And we talked a whole bunch about features.</text><text start="16" dur="3">We talked about invariances, the type of things that affect</text><text start="19" dur="4">the appearance of a feature in the camera image,</text><text start="23" dur="3">and we went through methods for extracting edges,</text><text start="26" dur="2">for extracting corners, </text><text start="28" dur="4">and for extracting fairly sophisticated features like SIFT features.</text><text start="32" dur="4">This is one of the basic processing methods in computer vision.</text><text start="36" dur="3">Almost everyone who does computer vision preprocesses images</text><text start="39" dur="3">by feature extraction, and now you know</text><text start="42" dur="4">quite a bit about how to process images in computer vision.</text></transcript></video></group><group title="Unit 17" count="33"><video title="01 Introduction.mp4" id="dqhTYhyrbwc" length="44"><transcript><text start="0" dur="3">This class is all about 3D vision.</text><text start="3" dur="5">There&amp;#39;s a scene somewhere here, and there&amp;#39;s a camera over here,</text><text start="8" dur="3">and the scene is projected into the camera plane.</text><text start="11" dur="5">Obviously, the camera image is only 2D. The scene is 3D.</text><text start="16" dur="4">3D vision vision attempts to recover the full 3D information.</text><text start="20" dur="3">The most important missing thing is called the range, </text><text start="23" dur="4">sometimes called depth or distance to the camera plan.</text><text start="27" dur="3">Cameras are deficient in that they can only recover 2D information-</text><text start="30" dur="3">a perspective projection of the image.</text><text start="33" dur="5">The question right now is going to be can we possibly recover the full 3D information</text><text start="38" dur="6">about the scene outside the camera just as single or modular camera images.</text></transcript></video><video title="02 Depth Question.mp4" id="0RtX_2XpyDA" length="26"><transcript><text start="0" dur="2">This is our first quiz. </text><text start="2" dur="5">Given a single image and given parameters of the camera like the focal length</text><text start="7" dur="5">and all the other parameters, can we recover the depth or range of a scene?</text><text start="12" dur="3">And I&amp;#39;ll give you a couple of possible answers.</text><text start="15" dur="4">Yes, always; sometimes; or never.</text><text start="19" dur="4">We haven&amp;#39;t even talked about this. This requires some thinking on your end.</text><text start="23" dur="3">But give it your best try.</text></transcript></video><video title="03 Depth Answer.mp4" id="P2NIOCCtO2U" length="50"><transcript><text start="0" dur="4">The correct answer is sometimes. There are actually cases where we can do this.</text><text start="4" dur="4">It&amp;#39;s not always possible because, as we learned before, </text><text start="8" dur="3">the camera doesn&amp;#39;t really record the distance.</text><text start="11" dur="6">If we look into an arbitrary scene, we normally can&amp;#39;t recover the depth from a single image.</text><text start="17" dur="3">But in certain cases it&amp;#39;s possible. Here&amp;#39;s a dollar bill.</text><text start="20" dur="5">A dollar bill is a fixed size, and you can know the size. All dollar bills have the same size.</text><text start="25" dur="7">And by understanding the internal projection size of this bill,</text><text start="32" dur="3">you can actually really recover how far it is away, </text><text start="35" dur="3">if you know things like focal length and so on.</text><text start="38" dur="5">The answer is yes in cases we know the size of the object</text><text start="43" dur="3">and no in cases where you don&amp;#39;t know the size.</text><text start="46" dur="4">Therefore, sometimes was the correct answer here.</text></transcript></video><video title="04 Stereo.mp4" id="beb_cF5fcmk" length="93"><transcript><text start="0" dur="7">One easy way to recover  the depth with 3D vision is called Stereo.</text><text start="7" dur="2">Humans use stereo all the time.</text><text start="9" dur="6">We have two eyes--eye 1 and eye 2--</text><text start="15" dur="3">and these eyes have a so-called displacement,</text><text start="18" dur="4">which just means that one eye is further left than the other eye.</text><text start="22" dur="3">We&amp;#39;re looking at the scene from slightly different angles.</text><text start="25" dur="4">Humans can actually recover the depth of the scene </text><text start="29" dur="3">in many situations where objects are nearby.</text><text start="32" dur="2">Let&amp;#39;s look at this in more detail.</text><text start="34" dur="6">In stereo vision, we&amp;#39;re given two cameras--usually both with identical focal length.</text><text start="40" dur="3">Here are the pinholes, and here are the image planes.</text><text start="43" dur="4">An object in the scene is being seen by both cameras.</text><text start="47" dur="2">If I draw the optical axes over there, </text><text start="49" dur="4">which are the axes orthogonal to the image planes that go through the pinholes,</text><text start="53" dur="5">you will see that the projection of this point depends on the displacement, </text><text start="58" dur="4">or the baseline, of the so-called stereo rig.</text><text start="62" dur="3">Clearly, these two images see the point at a different angle,</text><text start="65" dur="5">and it reflects itself by different coordinates</text><text start="70" dur="4">where this point is being projected onto the image plane.</text><text start="74" dur="6">The idea of stereo is to screen objects and use the displacement, </text><text start="80" dur="6">often called &amp;quot;paralax,&amp;quot; of those two different projections to estimate</text><text start="86" dur="3">the depth or the range of the object.</text><text start="89" dur="4">Let me just ask a simple quiz about stereo.</text></transcript></video><video title="05 Stereo Question.mp4" id="g_pkv6Ej8aQ" length="23"><transcript><text start="0" dur="2">Given two identical cameras </text><text start="2" dur="5">for which we know things like focal length and all the other intrinsic parameters,</text><text start="7" dur="2">and we also know the baseline,</text><text start="9" dur="3">can we now recover the depth of a scene?</text><text start="12" dur="6">Here are the answers: yes, always, no matter what the scene is.</text><text start="18" dur="5">The second is sometimes, and the third one is never.</text></transcript></video><video title="06 Stereo Answer.mp4" id="uJQL3Bxq58M" length="78"><transcript><text start="0" dur="2">As before, the answer is sometimes, </text><text start="2" dur="5">although more often than before from a single camera image given we have two now.</text><text start="7" dur="3">To give an intimation of why it&amp;#39;s not always, </text><text start="10" dur="7">let&amp;#39;s look at two images where the object of interest is a vertical object</text><text start="17" dur="5">and another pair of two images where the object of interest is a horizontal feature,</text><text start="22" dur="2">like this one over here.</text><text start="24" dur="3">Now, in the vertical case, there would be displacement.</text><text start="27" dur="3">This would be slightly further to the left than this guy over here,</text><text start="30" dur="5">and we can use the displacement to recover depth in a way I&amp;#39;ll tell you in a second.</text><text start="35" dur="2">But for the horizontal, it&amp;#39;s really hard.</text><text start="37" dur="6">If this feature crosses all of the camera image, there is something called &amp;quot;aperture effect.&amp;quot;</text><text start="43" dur="5">What this really means is we can&amp;#39;t really tell which of the little dots on this line </text><text start="48" dur="3">correspond to which little dots on this line over here.</text><text start="51" dur="3">In cases where the image lacks structure--</text><text start="54" dur="4">or the worse one would be two images of fog. </text><text start="58" dur="3">In fog, there is certainly a depth. </text><text start="61" dur="2">Each water particle has a certain range, </text><text start="63" dur="3">but we can&amp;#39;t really recover how far away fog is, </text><text start="66" dur="3">because, honestly, both images look alike.</text><text start="69" dur="3">There are certain degenerate cases where stereo doesn&amp;#39;t work.</text><text start="72" dur="2">We are going to focus on this case over here right now </text><text start="74" dur="4">where we do get information from the stereo sensor.</text></transcript></video><video title="07 Solving for Depth.mp4" id="_o8qyMcZow8" length="118"><transcript><text start="0" dur="2">Let&amp;#39;s get back stereo rig.</text><text start="2" dur="4">We have two pinholes with a known focal length f, </text><text start="6" dur="6">and we wish to recover the depth z of a point p.</text><text start="12" dur="6">We happen to know that the projection of p on the two image planes is somewhat different.</text><text start="18" dur="3">Over here we call it x1 for the first imager.</text><text start="21" dur="4">Over here we call it x2 for the second imager.</text><text start="25" dur="7">The question is what is the formula that allows us to look at this rig over here</text><text start="32" dur="6">with two images with a known baseline b to recover the depth z</text><text start="38" dur="3">from the relative displacements x1 and x2.</text><text start="41" dur="3">There happens to be a relatively simple answer.</text><text start="44" dur="5">If you look at this big triangle over here, that triangle has the same proportions</text><text start="49" dur="5">and the triangle put together by this little thing over here and this thing over here.</text><text start="54" dur="5">You move these two triangles over here together into a single triangle.</text><text start="59" dur="3">It looks like this.</text><text start="62" dur="4">The proportions of this triangle over here are the same </text><text start="66" dur="2">as the proportions of this triangle over here.</text><text start="68" dur="5">Specifically, the length back here is x2 minus x1.</text><text start="73" dur="5">This distance over here is f, the length over here in the baseline b,</text><text start="78" dur="3">and this length over here is the unknown depth z.</text><text start="81" dur="3">If we transform this and solve it for z,</text><text start="84" dur="6">we get z equals f times b over x2 minus x1.</text><text start="90" dur="3">If we look at the relative displacement of a point in these two different camera images,</text><text start="93" dur="6">which is x2 minus x1, you&amp;#39;ll find the the actual depth is inversely proportional,</text><text start="99" dur="7">but in this case linearly with the focal length f and the baseline b.</text><text start="106" dur="4">These are all things we know. The baseline and the focal length are constants.</text><text start="110" dur="2">They&amp;#39;re called intrinsics.</text><text start="112" dur="3">These are measurements, and from this we can actually recover the real depth.</text><text start="115" dur="3">Let&amp;#39;s just try to practice this.</text></transcript></video><video title="08 Solve Depth Question.mp4" id="iZTLfTD8kKI" length="40"><transcript><text start="0" dur="6">So let me give you another quiz in which we have a stereo rig with baseline B</text><text start="6" dur="7">with two measurements, x1 and x2, of the same point P in the scene.</text><text start="13" dur="7">We know our focal length f, and I care about our depth z.</text><text start="20" dur="4">Here is our formula again to make things a little bit easier.</text><text start="24" dur="6">Here assume that my x2 equals 3 mm, my x1 is -1 mm, </text><text start="30" dur="5">my focal length is 8 mm, and my baseline B is 20 cm.</text><text start="35" dur="5">I&amp;#39;d like to know z in centimeters.</text></transcript></video><video title="09 Solve Depth Answer.mp4" id="WgBtQNrX5N0" length="18"><transcript><text start="0" dur="2">The answer is 40.</text><text start="2" dur="7">F is 8 mm over 3 minus -1 is 4 mm,</text><text start="9" dur="5">which gives this guy over here a factor of 2 times B.</text><text start="14" dur="4">Ten centimeters makes 40 cm for z.</text></transcript></video><video title="10 Change in X Question.mp4" id="Fw6oGIdcXgk" length="28"><transcript><text start="0" dur="5">Using the same formula, we are going to write x2 minus x1 as delta x </text><text start="5" dur="3">just to make it a little bit simpler.</text><text start="8" dur="2">Let me see if we can recover other things.</text><text start="10" dur="5">Let&amp;#39;s assume the range is actually 10 m. We know about the physical world.</text><text start="15" dur="5">We know our baseline 1 m. We know that our focal length is 30 mm.</text><text start="20" dur="4">Can we possibly recover the delta x?</text><text start="24" dur="4">I&amp;#39;d like you to give your answer in mm.</text></transcript></video><video title="11 Change in X Answer.mp4" id="g_Uywr-HbaY" length="17"><transcript><text start="0" dur="5">The answer is absolute yes. It&amp;#39;s going to be 3 mm.</text><text start="5" dur="4">To see, we transform this equation over here to delta x to the left,</text><text start="9" dur="8">B over z is 0.1 times 30 mm makes 3 mm over here.</text></transcript></video><video title="12 Focal Length Question.mp4" id="2URvZYi4Od8" length="30"><transcript><text start="0" dur="3">Let&amp;#39;s now go to a difficult challenging case</text><text start="3" dur="8">where I&amp;#39;d like to recover the focal length f from measurements of the type z, delta x, and B.</text><text start="11" dur="4">Suppose I happen to know that an object is 100 m away, </text><text start="15" dur="5">and my baseline is 0.5 m, which is 50 cm.</text><text start="20" dur="4">And suppose my displacement x2 minus x1 is exactly 1 millimeters.</text><text start="24" dur="6">What do you think f is expressed in mm?</text></transcript></video><video title="13 Focal Length Answer.mp4" id="z8QzMr0IGAI" length="18"><transcript><text start="0" dur="5">The answer is 200 mm for our focal length.</text><text start="5" dur="4">We can transform this expression over here to bring f to the left side.</text><text start="9" dur="5">And z over B equals 200, and delta x equals 1 mm.</text><text start="14" dur="4">We get 200 mm as an answer for our question.</text></transcript></video><video title="14 Correspondence Question.mp4" id="Df_KrsQRQIk" length="36"><transcript><text start="0" dur="3">I&amp;#39;d like to say a few words on the issue of correspondence, </text><text start="3" dur="3">often also called data association.</text><text start="6" dur="3">Supposing we have two camera images, as shown over here,</text><text start="9" dur="4">and we seen an interesting point P in the left image.</text><text start="13" dur="3">The question is where do you search in the right image?</text><text start="16" dur="6">Everywhere? Along a line? Or can you already predict the point?</text><text start="22" dur="2">So where do you search in the right image? </text><text start="24" dur="4">Everywhere would be in 2D--the entire image.</text><text start="28" dur="5">Along a line would be 1D, and a fixed point would be 0D.</text><text start="33" dur="3">Please check the appropriate box.</text></transcript></video><video title="15 Correspondence Answer.mp4" id="hbScGNt8iIQ" length="70"><transcript><text start="0" dur="5">The answer is 1D. You can actually search along this line over here.</text><text start="5" dur="3">You can&amp;#39;t really know where along the line the point is,</text><text start="8" dur="7">because where it is is a function of the depth of the scene, which you don&amp;#39;t know,</text><text start="15" dur="3">but it can&amp;#39;t be the full image.</text><text start="18" dur="3">To illustrate this, let me look a little bit from above.</text><text start="21" dur="3">Here was have two image planes from the two cameras.</text><text start="24" dur="4">There is a point over here that finds itself in the image plane over there.</text><text start="28" dur="6">If we don&amp;#39;t know the depth, we know that the point must lay on this ray over here,</text><text start="34" dur="7">and each of the points on this ray get projected into this imager along a line.</text><text start="41" dur="4">If the point is over here, it might be the projection over there,</text><text start="45" dur="5">and as we go out to infinity, it might be the point over here.</text><text start="50" dur="3">Now this camera array is a little bit more general than we talked about.</text><text start="53" dur="3">The image plans aren&amp;#39;t parallel anymore,</text><text start="56" dur="5">but even if they&amp;#39;re no parallel, each point in the left image corresponds to a potential line</text><text start="61" dur="3">of corresponding points in the right image.</text><text start="64" dur="3">It makes the search for correspondences much, much easier.</text><text start="67" dur="3">Let&amp;#39;s talk a little bit more about correspondences.</text></transcript></video><video title="16 Determine Correspondence Question.mp4" id="umA7G1X9XrY" length="122"><transcript><text start="0" dur="3">The general correspondence problem is given</text><text start="3" dur="5"> if there are two identical-looking points in the scene that have different depths.</text><text start="8" dur="4">For example with P1 might reflect into the image over here,</text><text start="12" dur="4">and P2 will reflect into the image as indicated by these red lines.</text><text start="16" dur="4">Now we understand the correspondence of P1 in both images </text><text start="20" dur="3">that this point corresponds to this point, we are well off,</text><text start="23" dur="2">and we can estimate the depth of P1.</text><text start="25" dur="6">If we get it wrong, if we correspond this point over here in the image to this guy over here,</text><text start="31" dur="5">then what we will see is this point right over here--P1 prime.</text><text start="36" dur="3">If we correspond this guy over here with this guy over here,</text><text start="39" dur="2">we get P2 prime.</text><text start="41" dur="5">These aren&amp;#39;t really points in the action image, but they&amp;#39;ll be phantom points</text><text start="46" dur="2">that occur because we got the correspondence wrong.</text><text start="48" dur="3">It&amp;#39;s really important when we look at two camera images</text><text start="51" dur="4">to understand what is the actual correspondence.</text><text start="55" dur="5">Here are actually two images from a stereo rig of a scene,</text><text start="60" dur="4">and you can see there&amp;#39;s a slight displacement. It&amp;#39;s actually really hard to see.</text><text start="64" dur="3">We&amp;#39;re looking at this feature over here for now.</text><text start="67" dur="3">I&amp;#39;d like to correspond it to something in the right image.</text><text start="70" dur="5">We have already learned that the search will have to be along a line.</text><text start="75" dur="3">Here is the green line, which is the corresponding line.</text><text start="78" dur="4">It can&amp;#39;t be that this point over here shows up somewhere in the sky over here,</text><text start="82" dur="4">but even along the point, it&amp;#39;s not completely obvious how to do correspondence--</text><text start="86" dur="3">how to match this image over here to this image over there.</text><text start="89" dur="4">So my question is how can we possibly find </text><text start="93" dur="4">where this feature corresponds to a feature over here?</text><text start="97" dur="3">How can we determine correspondence?</text><text start="100" dur="5">By matching small image patches using some of the linear techniques we talked about in</text><text start="105" dur="6">the last class by just basically comparing how similar looking small image patches are</text><text start="111" dur="3">or by matching features, and particularly edge features or corner features</text><text start="114" dur="3">that we might extract from the original image.</text><text start="117" dur="5">Or maybe neither of those two. Please check any or all of those that apply.</text></transcript></video><video title="17 Determine Correspondence Answer.mp4" id="UfkoC4Pi6pY" length="12"><transcript><text start="0" dur="6">The answer is both. You can use image patches and features, and I&amp;#39;ll talk about both.</text><text start="6" dur="2">They are somewhat similar, and they&amp;#39;re not without problems, </text><text start="8" dur="4">but both are being used to estimate correspondence.</text></transcript></video><video title="18 SSD Minimization.mp4" id="LjNqM9hOkw0" length="122"><transcript><text start="0" dur="4">Here is my pair of images again,</text><text start="4" dur="6">and my scan line, and I&amp;#39;m extracting from it a very small little window</text><text start="10" dur="3">that is the local image of the specific feature over here </text><text start="13" dur="3">which happens to have a strong vertical structure,</text><text start="16" dur="2">which is nice of localization.</text><text start="18" dur="5">Now I&amp;#39;m comparing this little patch with my little patches in the right image,</text><text start="23" dur="4">and I&amp;#39;m drawing a sum of square difference error,</text><text start="27" dur="4">which is minimized when these two patches look alike.</text><text start="31" dur="3">I&amp;#39;ll tell you in a second how this looks like mathematically,</text><text start="34" dur="5">but intuitively we have to pick the place along the random measured search space</text><text start="39" dur="4">that has the smallest sum of square difference error,</text><text start="43" dur="4">which is the one where these two patches just look mostly alike.</text><text start="47" dur="4">This is a space of the scan line in which I search,</text><text start="51" dur="5">often called disparity, and for one location this is actually being minimized right over here.</text><text start="56" dur="3">Here&amp;#39;s the basic algorithm for SSD minimization. </text><text start="59" dur="4">We take two patches--one from the left image, one from the right image.</text><text start="63" dur="3">We normalize, so the average brightness is zero.</text><text start="66" dur="3">We then take the normalized image and take the difference.</text><text start="69" dur="4">Then we square the difference. That gives us a sum-of-square image.</text><text start="73" dur="4">Then we can sum up all the pixels to get a single value.</text><text start="77" dur="4">This is our SSD value, our sum-of-square difference value.</text><text start="81" dur="5">All of these operations are easily implemented using the material you already know.</text><text start="86" dur="5">The smaller the SSD value, the closer these two images correspond.</text><text start="91" dur="5">This is a very common technique for comparing what&amp;#39;s called image templates,</text><text start="96" dur="3">where your left image is a template, </text><text start="99" dur="3">and you&amp;#39;re searching the left image for the optimal template.</text><text start="102" dur="5">As you vary the location of the right image, you can find different SSDs.</text><text start="107" dur="3">You tend to get graphs for the right image.</text><text start="110" dur="4">With an image template, it gives you certain errors.</text><text start="114" dur="3">Sometimes you get a very small disparate error.</text><text start="117" dur="5">That&amp;#39;s the place you&amp;#39;ll pick for the best, mostly likely alignment.</text></transcript></video><video title="19 Disparity Maps.mp4" id="YPfs7DbAjfM" length="94"><transcript><text start="0" dur="3">Here is the result of such an operation.</text><text start="3" dur="3">We have yet again a left image over here. The right one is missing.</text><text start="6" dur="3">Here you can see what&amp;#39;s called a disparity map, </text><text start="9" dur="3">which the map of the best match.</text><text start="12" dur="5">In the right image, the further the disparity, the more we have to assume the patch shifted.</text><text start="17" dur="5">We extracted every possible patch from this image, did the search on the right image,</text><text start="22" dur="4">and we find in the foreground, the disparity is much larger than the background.</text><text start="26" dur="3">Sometimes we get a black spot, like over here,</text><text start="29" dur="4">where the information itself is not good enough to make any decision.</text><text start="33" dur="3">Or in the pathway over here, there are no real features.</text><text start="36" dur="2">Same for the sky over here.</text><text start="38" dur="6">But in most cases, we can see a nicely shaded gray that decreases with distance</text><text start="44" dur="2">where the disparity decreases.</text><text start="46" dur="4">This is a very typical stereo vision result.</text><text start="50" dur="8">Here is a disparity map from driving in desert with our DARPA Grand Challenge car, Stanley.</text><text start="58" dur="4">We equipped it with two cameras, one on the left and one on the right.</text><text start="62" dur="5">You can see the two camera images, and on the right the disparity map.</text><text start="67" dur="8">It&amp;#39;s not that informative, because there is very little structure in the road surface itself,</text><text start="75" dur="5">but by and large you can see things further away end up being darker.</text><text start="80" dur="4">The big dominant thing here is lack of texture, </text><text start="84" dur="4">which leads to certain areas in the disparity map just being black,</text><text start="88" dur="2">which means we don&amp;#39;t know.</text><text start="90" dur="4">But where it registers it does a pretty find job.</text></transcript></video><video title="20 Context Question.mp4" id="gA8PqLsCDhg" length="92"><transcript><text start="0" dur="4">I&amp;#39;d like to talk a little bit more about correspondence.</text><text start="4" dur="5">Specifically, we&amp;#39;ve learned that searching for correspondence means</text><text start="9" dur="2">we search along a single scan line,</text><text start="11" dur="6">but I&amp;#39;d like to ask the question whether it&amp;#39;s optimal to correspond individual patches </text><text start="17" dur="2">which are independent of each other.</text><text start="19" dur="5">Would it make sense to look at the context of an entire scan line?</text><text start="24" dur="2">Let&amp;#39;s look at the following situation.</text><text start="26" dur="2">We have a background that&amp;#39;s black.</text><text start="28" dur="3">We have a foreground that&amp;#39;s red, </text><text start="31" dur="3">and we have sides of the object that are both blue.</text><text start="34" dur="3">In a left image, we might see black, black,</text><text start="37" dur="7">and then there is this blue element that is only visible from the left camera,</text><text start="44" dur="4">a couple of reds--3 of them--and then we see more blacks.</text><text start="48" dur="4">From the right imager we might see black, black.</text><text start="52" dur="3">We won&amp;#39;t see the blue over here, because it&amp;#39;s occluded,</text><text start="55" dur="5">but we&amp;#39;ll see a couple of reds followed by the blue over here,</text><text start="60" dur="5">which is only visible from the right camera, followed by more blacks.</text><text start="65" dur="3">When we look at the entire situation, </text><text start="68" dur="5">the question is whether we can correspond red pixels to each other</text><text start="73" dur="4">irrespective of context or whether it makes sense to look at context.</text><text start="77" dur="5">Specifically, take the mid red pixel over here--this guy over here--</text><text start="82" dur="7">and let me ask you does it correspond to the left red, the center red, or the right red?</text><text start="89" dur="3">Please check the corresponding box.</text></transcript></video><video title="21 Context Answer.mp4" id="XXHdnWsepcs" length="45"><transcript><text start="0" dur="3">The answer is it corresponds to the center red,</text><text start="3" dur="3">which is the guy over here.</text><text start="6" dur="2">Finding this is not easy.</text><text start="8" dur="3">This is the fifth pixel on the left camera image, </text><text start="11" dur="3">and it&amp;#39;s the fourth pixel on the right camera image.</text><text start="14" dur="5">To make that correspondence, we have to understand that the best match matches </text><text start="19" dur="5">these 2 black pixels over here, followed by an occlusion pixel </text><text start="24" dur="3">that&amp;#39;s only visible on the left but not on the right,</text><text start="27" dur="3">followed by the 3 corresponding red pixels,</text><text start="30" dur="2">followed by another occlusion pixel,</text><text start="32" dur="4">followed by 2 black pixels that basically correspond.</text><text start="36" dur="4">I now want to look into algorithms that can take entire scan lines of the left side</text><text start="40" dur="5">and correspond them to entire scan lines on the right side.</text></transcript></video><video title="22 Alignment 1 Question.mp4" id="M4NytzFMGPc" length="79"><transcript><text start="0" dur="3">Let&amp;#39;s look at the same problem again,</text><text start="3" dur="2">and let me just draw the two scan lines--</text><text start="5" dur="5">the left scan line and the right scan line.</text><text start="10" dur="5">As before, we get to see red pixels, black pixels, </text><text start="15" dur="5">and the occlusive blue pixels as indicated over here.</text><text start="20" dur="7">Now we&amp;#39;ll try to match the entire scan line on the top to the entire scan line on the bottom.</text><text start="27" dur="4">so we can figure out what the exact correspondence is.</text><text start="31" dur="3">We do this by minimizing the cost function.</text><text start="34" dur="3">The cost comes in two different flavors.</text><text start="37" dur="3">There is the cost of bad matches.</text><text start="40" dur="5">Let&amp;#39;s assume if the color matches perfect, we pay zero,</text><text start="45" dur="3">but if the color matches very poor, we pay 20.</text><text start="48" dur="3">There is also the cost of occlusion.</text><text start="51" dur="5">If in the process of matching these lines we have to assume a pixel is occluded,</text><text start="56" dur="2">we&amp;#39;re just going to pay 10.</text><text start="58" dur="6">The question now is optimal alignment of the top to the bottom under this cost function?</text><text start="64" dur="4">Let&amp;#39;s just go through this. Let me look at two different possible alignments.</text><text start="68" dur="4">Here is one. We align those black pixels, and we align the red pixels.</text><text start="72" dur="7">If we did this, what is the cost of the total match. Please put the answer over here.</text></transcript></video><video title="23 Alignment 1 Answer.mp4" id="ZOVwW1vWHw4" length="23"><transcript><text start="0" dur="2">The cost is 20. </text><text start="2" dur="6">The reason being that in this match over here, we get a perfect color match.</text><text start="8" dur="3">Black matches to black. Red matches to red. </text><text start="11" dur="4">But we have to assume this this pixel over here and the pixel over here,</text><text start="15" dur="3">are both the result of occlusion. Each costs us 10.</text><text start="18" dur="5">So the result is we pay 20 as the total cost.</text></transcript></video><video title="24 Alignment 2 Question.mp4" id="B6YblJCA-5o" length="22"><transcript><text start="0" dur="3">Let me now the same question again for a different alignment.</text><text start="3" dur="6">Suppose we were to marry the red pixel over here to the blue one over here,</text><text start="9" dur="4">the red one to the red one over here, and so on.</text><text start="13" dur="3">What would now by the cost of the alignment?</text><text start="16" dur="6">We don&amp;#39;t have these diagonals, but match pixel by pixel over here.</text></transcript></video><video title="25 Alignment 2 Answer.mp4" id="l9VlgUsTVsc" length="45"><transcript><text start="0" dur="3">The answer would be in this case 40,</text><text start="3" dur="6">because we pay a 20 penalty of a bad match over here.</text><text start="9" dur="5">These are good matches. We end up paying another 20 penalty for a bad match over there.</text><text start="14" dur="2">In total we get a penalty of 40.</text><text start="16" dur="5">What this teaches us is that in matching pixels to pixels,</text><text start="21" dur="4">we match an entire corresponding line in stereo.</text><text start="25" dur="5">We can trade off the bad match cost with the occlusion cost.</text><text start="30" dur="3">Sometimes it is cheaper to assume occlusion,</text><text start="33" dur="4">and sometimes it is cheaper to assume a bad match.</text><text start="37" dur="4">The result of this optimization is that it gives us the best association </text><text start="41" dur="4">of the scan line over here to the scan line over here.</text></transcript></video><video title="26 Dynamic Programming.mp4" id="csWWHnkKktI" length="254"><transcript><text start="0" dur="6">The tricky part is how to compute the best possible alignment.</text><text start="6" dur="3">It&amp;#39;s usually done by dynamic programming.</text><text start="9" dur="4">The recognition here is that in principle there are </text><text start="13" dur="4">exponentially many ways to align pixels in the left and right image,</text><text start="17" dur="5">but in practice you can get away with an n-squared algorithm</text><text start="22" dur="3">where n is the number of pixels in the scan line.</text><text start="25" dur="4">Let&amp;#39;s write this as n-squared. It&amp;#39;s a much, much faster algorithm.</text><text start="29" dur="5">Here&amp;#39;s the idea. Let&amp;#39;s write down both scan lines as shown over here.</text><text start="34" dur="4">And let&amp;#39;s write down a matrix of size and square.</text><text start="38" dur="5">The neat thing here is that any path from the top left to the bottom right</text><text start="43" dur="7">is a specific correspondence of pixels over here on the left scan line</text><text start="50" dur="2">to pixels over here on the right scan line.</text><text start="52" dur="6">For example, if I take the path that&amp;#39;s diagonal, that line pixels by each other.</text><text start="58" dur="5">But the best possible path would assume that the first two pixels correspond,</text><text start="63" dur="3">and there&amp;#39;s a left occlusion afterwards.</text><text start="66" dur="2">Then all the red guys correspond.</text><text start="68" dur="4">So this red guy over here corresponds to this red guy over here.</text><text start="72" dur="2">There&amp;#39;s an occlusion over here.</text><text start="74" dur="2">Then we go diagonal again.</text><text start="76" dur="5">So any path that picks actions that go diagonal, down, or right</text><text start="81" dur="5">so that the top left is connected to the bottom right</text><text start="86" dur="7"> becomes a valid correspondence of the left scan line to the right scan line.</text><text start="93" dur="2">How do we find the best one?</text><text start="95" dur="5">Well, just like in MVPs we use the same methodology as an MVP.</text><text start="100" dur="4">We define the value of any of these points in the grid to be the best, </text><text start="104" dur="3">taking the value of getting there.</text><text start="107" dur="7">The value of a point ij in the grid is the maximum of the match value </text><text start="114" dur="5">if we chose diagonal, which is expressed over here to be the match of ij</text><text start="119" dur="7">given that we chose the diagonal, which means add the value of i minus 1 and j minus 1,</text><text start="126" dur="6">over the occlusion penalty plus any way we could have occluded for the left or the right.</text><text start="132" dur="3">If we look at these three different things we maximize over here,</text><text start="135" dur="6">then each value over here becomes the maximum of assuming we have no occlusion</text><text start="141" dur="5">plus the corresponding match penalty or assuming we did have an occlusion,</text><text start="146" dur="5">either from the top or the bottom, and then we just pay the occlusion penalty,</text><text start="151" dur="3">and we assume the value over there.</text><text start="154" dur="3">Now, that&amp;#39;s not trivial. You have to think about this.</text><text start="157" dur="2">Why does this give us the optimal path?</text><text start="159" dur="3">But if you think about it and look at the optimal path,</text><text start="162" dur="3">we pay no penalty over here because the match is perfect.</text><text start="165" dur="3">We pay no penalty over here because the match is perfect again.</text><text start="168" dur="2">So, again, the first clause in this formula.</text><text start="170" dur="3">Over here we do pay a penalty.</text><text start="173" dur="3">We pay a penalty of 10, which is the occlusion penalty,</text><text start="176" dur="5">because we assume that between the blue pixel over here and the right scan image</text><text start="181" dur="5">there&amp;#39;s just no appropriate match. We&amp;#39;re going to pay a penalty of 10 over here.</text><text start="186" dur="3">Over here we pay no penalty, because the right corresponds perfectly to the red,</text><text start="189" dur="3">and we assume it is a perfect match.</text><text start="192" dur="2">The same over here and the same over here.</text><text start="194" dur="4">Down here we pay a penalty of 10, because we assume an occlusion,</text><text start="198" dur="3">and down here we just assume no penalty at all.</text><text start="201" dur="4">Now with dynamic programming, it computes for every possible location.</text><text start="205" dur="4">For example, this guy over here would have a best optimal path,</text><text start="209" dur="5">which might assume we had a perfect match over here and two occlusions over there,</text><text start="214" dur="3">but now the penalty is already 20 whereas the penalty over here is 10.</text><text start="217" dur="3">So likely this point won&amp;#39;t survive.</text><text start="220" dur="6">By working out the value function in this really interesting grid over here,</text><text start="226" dur="3">we find the value of the final point, which is 20,</text><text start="229" dur="4">and we also find the best possible path</text><text start="233" dur="4">by tracing the way in which the value propagated through this grid.</text><text start="237" dur="5">This becomes the best possible correspondence of the left and the right image</text><text start="242" dur="4">by aligning the entire left scan line and the entire right scan line</text><text start="246" dur="4"> simultaneously using dynamic programming.</text><text start="250" dur="4">This is the state of the art in stereo computer vision.</text></transcript></video><video title="27 Pixel Correspondence Question 1.mp4" id="zI4Gmf7qPrw" length="42"><transcript><text start="0" dur="5">Let me see if you understand what I just talked about by the following quiz.</text><text start="5" dur="5">Let&amp;#39;s assume we have two scan lines of six pixels each.</text><text start="10" dur="3">You get to observe the following that I&amp;#39;ve colorized:</text><text start="13" dur="4">two blacks followed by three reds by one black for the left image</text><text start="17" dur="6">and one black followed by four reds by one black for the right scan line.</text><text start="23" dur="7">Let us also assume the occlusion penalty is 5, and the bad match penalty is 20.</text><text start="30" dur="4">Can you mark the location to which you&amp;#39;d like to correspond </text><text start="34" dur="4">this specific red pixel over here in the right scan line</text><text start="38" dur="4"> by minimizing the total cost of occlusion and bad matches.</text></transcript></video><video title="28 Pixel Correspondence Answer 1.mp4" id="jf8vr7JZa8o" length="44"><transcript><text start="0" dur="3">This is a tricky question, </text><text start="3" dur="6">because it turns out that the occlusion answer is better than the bad match answer.</text><text start="9" dur="3">If you were to correspond each pixel in the left scan line </text><text start="12" dur="3">straight to each pixel in the right scan line,</text><text start="15" dur="3">you find that the total penalty will be 20, </text><text start="18" dur="5">because there&amp;#39;s one bad match between the black pixel over here and the red pixel here.</text><text start="23" dur="3">However, you can also correspond as follows.</text><text start="26" dur="6">You end up paying two occlusion penalties for this guy over here and this guy over here.</text><text start="32" dur="6">Because the occlusion penalty is only 5, you pay only a total of 10 as a penalty,</text><text start="38" dur="3">which is better than a single bad match penalty.</text><text start="41" dur="3">As a result, this would&amp;#39;ve been the right answer over here.</text></transcript></video><video title="29 Pixel Correspondence Question 2.mp4" id="aYQnVbJtSq4" length="30"><transcript><text start="0" dur="5">Let me discuss the second quiz here, in which we&amp;#39;re given two scan lines again--</text><text start="5" dur="2">a left and a right scan line.</text><text start="7" dur="6">The pixels are for the left scan line black, red, black, black, black, black</text><text start="13" dur="4">and for the right scan line black, black, black, black, red, and black.</text><text start="17" dur="5">Let&amp;#39;s assume an occlusion now costs us 10, and a bad match costs us 20.</text><text start="22" dur="5">So what pixel should be aligned with the black pixel over here?</text><text start="27" dur="3">Check one of those boxes.</text></transcript></video><video title="30 Pixel Correspondence Answer 2.mp4" id="_T-o2A4b7ys" length="63"><transcript><text start="0" dur="4">The answer is this box over here.</text><text start="4" dur="3">There are two points on the line where one may correspond </text><text start="7" dur="5">each pixel on the left scan line to each pixel on the right scan line with the same index.</text><text start="12" dur="4">So this guy corresponds to this guy and so on--just straight up.</text><text start="16" dur="3">We&amp;#39;re going to pay a penalty of 40, </text><text start="19" dur="5">which is this red pixel over here as a bad match to the pixel over there,</text><text start="24" dur="2">and the same is true over here.</text><text start="26" dur="4">So it&amp;#39;s a penalty of 40, which I conjecture to be the best.</text><text start="30" dur="3">Let me show you the answer that I don&amp;#39;t like,</text><text start="33" dur="2">which corresponds this black pixel to this black pixel here,</text><text start="35" dur="4">the red guy to the guy over here, and then black to black over here,</text><text start="39" dur="7">in which case we pay an occlusion penalty for the following six pixels:</text><text start="46" dur="3">the guys over here and the guys over here.</text><text start="49" dur="4">Each occlusion value is 10, which makes a total occlusion penalty of 60,</text><text start="53" dur="2">which is worse than the 40.</text><text start="55" dur="5">So I conjecture that the straight up max like this is superior,</text><text start="60" dur="3">and we should check the box over here.</text></transcript></video><video title="31 Finding the Best Alignment.mp4" id="3FxoQshSppg" length="32"><transcript><text start="0" dur="4">As you can see the optimal path for this diagram over here,</text><text start="4" dur="4"> which determines what is occlusion and what is a bad match,</text><text start="8" dur="3">really is a function of those penalties, </text><text start="11" dur="5">the costs that we associate with poor matches or the occlusion assumption.</text><text start="16" dur="3">So running dynamic programming through this grid over here will give you </text><text start="19" dur="5">the best alignment that gives you the best possible total cost </text><text start="24" dur="8">that assumes an optimal trade off between occlusion costs and the cost of matching 2 pixels.</text></transcript></video><video title="32 Correspondence Issues.mp4" id="c6Xj7zH86A0" length="81"><transcript><text start="0" dur="6">This segment is my explanation of correspondence in stereo vision.</text><text start="6" dur="4">It came a long way. There are a few things that don&amp;#39;t work really well.</text><text start="10" dur="4">For example, we have two cameras over here, and we have a big object over here </text><text start="14" dur="3">with a foreground separate object.</text><text start="17" dur="6">Then the order contraint is being opposed and dynamic programming doesn&amp;#39;t hold.</text><text start="23" dur="6">That is, an object over here might appear left of the object over here in the left imager</text><text start="29" dur="3">but right of the object over here in the right imager.</text><text start="32" dur="2">There are other cases where things go wrong.</text><text start="34" dur="5">For example, suppose you were imaging a circular object with these two imagers here.</text><text start="39" dur="5">Then the occlusion boundary of this object as viewed from the right imager</text><text start="44" dur="5">is different from the occlusion boundary of the same object as viewed from the left imager.</text><text start="49" dur="4">These are not corresponding points. They correspond to different points on the object.</text><text start="53" dur="4">As a result, your stereo calculation will give you a poor result.</text><text start="57" dur="4">A final instance where things might go wrong is </text><text start="61" dur="4">reflective objects that have specular reflections.</text><text start="65" dur="3">This ball over here reflects the ceiling lights, </text><text start="68" dur="3">and obviously, where the ceiling lights are being reflected</text><text start="71" dur="4"> is a function of where an imager is positioned.</text><text start="75" dur="2">For these specific features over here, </text><text start="77" dur="4">we get a really lousy depth estimate for the object at hand.</text></transcript></video><video title="33 Improving Stereo Vision.mp4" id="eEhBor_r5eY" length="178"><transcript><text start="0" dur="6">I&amp;#39;d like to say a few words about how to improve the results of stereo vision.</text><text start="6" dur="5">Here is a vision assembly that James David built up of two cameras.</text><text start="11" dur="3">In addition to having these two cameras, he also put a projector into the scene</text><text start="14" dur="3">that emitted a random light pattern.</text><text start="17" dur="6">In fact, it emitted a striped pattern, shown over here on this frog,</text><text start="23" dur="6">and by adding texture to the scene, you can making correspondence easier.</text><text start="29" dur="4">This is a striped pattern of unequal distances.</text><text start="33" dur="5">There&amp;#39;s a coding over here, which makes certain stripes larger than others.</text><text start="38" dur="3">If you run the same algorithm I just told you,</text><text start="41" dur="3">you&amp;#39;ll find that stereo vision becomes better,</text><text start="44" dur="4"> because we can now better disambiguate the correspondence of points. </text><text start="48" dur="5">Here is the assembly used for imaging myself. This is me with a sweater on.</text><text start="53" dur="2">That&amp;#39;s my face.</text><text start="55" dur="5">And you can see by emitting structured light, as it is called,</text><text start="60" dur="3">you can enhance the performance of stereo</text><text start="63" dur="3"> and objects that otherwise have very poor texture.</text><text start="66" dur="5">Another solution is called the Microsoft Kinect. You&amp;#39;re probably familiar with it.</text><text start="71" dur="4">It&amp;#39;s a new gaming platform that&amp;#39;s been sold at record pace.</text><text start="75" dur="3">It uses a camera system, together with a laser.</text><text start="78" dur="3">The laser adds texture to the scene,</text><text start="81" dur="3">and by triangulation using the same method I showed you,</text><text start="84" dur="2">it can recover depth.</text><text start="86" dur="5">Here&amp;#39;s my postdoc Christian using a Kinect-like sensor </text><text start="91" dur="5">to do certain poses in front of a depth sensor.</text><text start="96" dur="5">You can see in the screen how his pose is being perceived,</text><text start="101" dur="5">and you can see Christian trying to do handstands and other acrobatic maneuvers.</text><text start="111" dur="3">He&amp;#39;s actually pretty good.</text><text start="114" dur="3">That&amp;#39;s all using effectively stereo vision.</text><text start="127" dur="3">There is actually a whole bunch of different types of techniques </text><text start="130" dur="3">for sensing range in computer vision.</text><text start="133" dur="2">I&amp;#39;m just going to briefly talk about them.</text><text start="135" dur="2">They&amp;#39;re called laser range finders. </text><text start="137" dur="2">They send off beams of light, </text><text start="139" dur="3">and they measure the time until the light comes back into the sensor.</text><text start="142" dur="3">They&amp;#39;re being manufactured by many different companies.</text><text start="145" dur="5">In our experiments using robots to drive through the desert and through traffic.</text><text start="150" dur="5">We quite extensively used laser range finders as an alternative to stereo vision,</text><text start="155" dur="3">because they give us very, very good range estimates.</text><text start="158" dur="7">Here is a 3D model constructed by laser range finders of our neighborhood in Palo Alto,</text><text start="165" dur="7">and it&amp;#39;s easy to see how 3D points can making amazing 3D models,</text><text start="172" dur="6">using techniques like stereo vision or like the laser range finders I just briefly talked about.</text></transcript></video></group><group title="Unit 18" count="9"><video title="01 Structure from Motion Question.mp4" id="xOilgbBGk8w" length="100"><transcript><text start="0" dur="3">[Thrun] This very final episode of the computer vision classes</text><text start="3" dur="3">I will teach you about structure from motion.</text><text start="6" dur="5">This is a really funny name for something much more intuitive,</text><text start="11" dur="3">and it comes from the early days of computer vision</text><text start="14" dur="3">where the structure referred to the 3D world.</text><text start="17" dur="4">And of course it&amp;#39;s impossible to capture the 3D world with the camera itself</text><text start="21" dur="3">because the camera only gives 2D projections of the 3D scene.</text><text start="24" dur="3">Motion referred to the locations of the camera.</text><text start="27" dur="7">So the idea was to take a handheld camera and move it around a 3D structure</text><text start="34" dur="6">and be able to recover or estimate the 3D coordinates of all the features in the world</text><text start="40" dur="3">based on many 2D images.</text><text start="43" dur="4">So suppose you have a scene with 3 features--A, B, and C--</text><text start="47" dur="5">and you&amp;#39;re moving a camera around to different positions--1, 2, and 3.</text><text start="52" dur="5">Then the different features get projected onto different points in the camera planes,</text><text start="57" dur="2">as shown over here.</text><text start="59" dur="4">And from the positions of those projected features</text><text start="63" dur="5">it may be impossible to recover not just where the camera was</text><text start="68" dur="5">at the time these images were taken but also where in the world the features are.</text><text start="73" dur="2">That&amp;#39;s called structure from motion.</text><text start="75" dur="2">So here is my first quiz.</text><text start="77" dur="2">Is this possible?</text><text start="79" dur="3">Given that we look at a number of features in the scene--</text><text start="82" dur="2">maybe 1, maybe 2, maybe more--</text><text start="84" dur="2">and given that we have 1 or more camera positions, </text><text start="86" dur="7">can we always, sometimes, or never recover or calculate the 3D position of the features</text><text start="93" dur="3">and the 3D position of the cameras simultaneously?</text><text start="96" dur="4">Please check almost, sometimes, or never.</text></transcript></video><video title="02 Structure from Motion Answer.mp4" id="rrL7q_2Y5dQ" length="30"><transcript><text start="0" dur="2">[Thrun] And the answer is sometimes,</text><text start="2" dur="3">and it&amp;#39;s not entirely obvious whether this is the right answer.</text><text start="5" dur="3">Clearly if there is only 1 point feature in the world and 1 image,</text><text start="8" dur="2">then you can&amp;#39;t recover where the feature is.</text><text start="10" dur="5">You already learned this, because the camera can&amp;#39;t estimate depth on itself.</text><text start="15" dur="2">So it can&amp;#39;t be almost,</text><text start="17" dur="2">but it&amp;#39;s also not never.</text><text start="19" dur="4">There are cases in which we can actually recover the full scene</text><text start="23" dur="4">and all the camera positions, and we will ask ourselves in a minute</text><text start="27" dur="3">under what situation this might be possible.</text></transcript></video><video title="03 Projection Question.mp4" id="OeYHERUjtNc" length="42"><transcript><text start="0" dur="4">[Thrun] Let me first give you a brief quiz to understand how the projection works</text><text start="4" dur="2">in structure from motion.</text><text start="6" dur="3">Suppose we have 3 point features at the known location.</text><text start="9" dur="4">We have a camera over here, camera A,</text><text start="13" dur="3">which can see these 3 point features.</text><text start="16" dur="2">We have a second camera over here, camera B,</text><text start="18" dur="3">a pinhole camera which can see the same features.</text><text start="21" dur="4">Suppose camera A on the left sees feature 1,</text><text start="25" dur="5">at the center sees feature 3, and on the right side of the camera plane sees feature 2.</text><text start="30" dur="5">I would like to know for camera B what will be on the camera plane on the left side,</text><text start="35" dur="2">the center, or the right side.</text><text start="37" dur="5">Which of the features 1, 2, 3 will be seen left, center, or right?</text></transcript></video><video title="04 Projection Answer.mp4" id="s-6lgSO3tdI" length="51"><transcript><text start="0" dur="5">[Thrun] And the answer is 3, 2, 1. Let&amp;#39;s just go through this.</text><text start="5" dur="6">Clearly the leftmost feature in camera A is 1, which corresponds to this point over here,</text><text start="11" dur="6">the center will be 3, and the right one will be 2,</text><text start="17" dur="2">as indicated in the table over here.</text><text start="19" dur="3">If we now look into imager B, </text><text start="22" dur="5">we find that the leftmost projection comes from feature number 3,</text><text start="27" dur="4">the center projection from feature number 2, </text><text start="31" dur="3">and the rightmost projection from feature number 1.</text><text start="34" dur="3">This is not the full structure from motion problem,</text><text start="37" dur="6">but it&amp;#39;s a good exercise to understand how feature indices under known position A and B</text><text start="43" dur="5">and under known locations of the target features map to each other,</text><text start="48" dur="3">and it&amp;#39;s good to understand the complexity of the structure from motion problem.</text></transcript></video><video title="05 Structure from Motion Models.mp4" id="oZ3YqZCYJT0" length="205"><transcript><text start="0" dur="5">[Thrun] Here is a very early example of structure from motion by Carlo Tomasi</text><text start="5" dur="2">and Takeo Kanade.</text><text start="7" dur="6">They used Harris corner detectors to find corners in the image of this toy 3D house,</text><text start="13" dur="5">and they were able from a number of images to fully recover the 3D structure</text><text start="18" dur="4">of every single corner point, as shown in this video.</text><text start="22" dur="4">So as they then take this 3D data set and turn it in arbitrary directions,</text><text start="26" dur="3">you can see the full 3D structure was recovered.</text><text start="29" dur="2">This is work in 1992.</text><text start="31" dur="3">It used principal component analysis to solve the problem</text><text start="34" dur="5">and is one of the most amazing pieces of early computer vision research.</text><text start="39" dur="5">Carlo, who used to be a Stanford professor for many years,</text><text start="44" dur="3">then scanned his kitchen and with the same Harris corner detector</text><text start="47" dur="5">was able to reconstruct a 3D structure of his kitchen, as shown over here.</text><text start="52" dur="6">Again, this is one of the most impressive early computer vision research results I&amp;#39;ve seen.</text><text start="58" dur="5">Here is a flight video of flying over the hills of Pennsylvania.</text><text start="63" dur="6">As you can see, using the same technique he was able to recover the 3D structure</text><text start="69" dur="7">of the outdoor terrain and build elevation maps, as shown over here.</text><text start="92" dur="3">Marc Pollefeys, who presently teaches at ETH Zurich,</text><text start="95" dur="5">came up with a beautiful solution to the structure from motion problem,</text><text start="100" dur="3">here imaging different buildings in his hometown.</text><text start="103" dur="5">From this video you can see multiple snapshots of a single building</text><text start="108" dur="5">where the different perspective distortion has an effect on the appearance of the building,</text><text start="113" dur="2">quite obviously.</text><text start="115" dur="5">Using those images he was able to reconstruct the 3D shape of the building facade,</text><text start="120" dur="3">as shown in this video.</text><text start="130" dur="4">Again, at the time it was one of the most impressive results ever achieved</text><text start="134" dur="2">in structure from motion.</text><text start="149" dur="7">You can see amazing detail as he zooms in to his building model.</text><text start="158" dur="4">He then moved on to map entire cities,</text><text start="162" dur="6">and here is an example of a map that he produced from an entire city block.</text><text start="168" dur="7">You can see how he reconstructs the building facades in unprecedented detail.</text><text start="175" dur="5">There&amp;#39;s also a lot of occlusion gaps where the original imager wasn&amp;#39;t able to see anything.</text><text start="180" dur="4">Those show up in black, and they look a little bit disturbing in this image over here.</text><text start="184" dur="3">But in reality, your camera can&amp;#39;t see everything.</text><text start="187" dur="2">So even if you do a perfect job with structure from motion,</text><text start="189" dur="5">it&amp;#39;s really hard to reconstruct every single inch of the environment.</text><text start="197" dur="3">Still, this stands out as one of the most impressive results ever</text><text start="200" dur="5">in what I would call the Holy Grail of 3D computer vision.</text></transcript></video><video title="06 SFM Math.mp4" id="1GYWcK28So8" length="97"><transcript><text start="0" dur="4">[Thrun] The mathematics of the structure from motion problem are involved,</text><text start="4" dur="3">and I don&amp;#39;t want to go into detail here.</text><text start="7" dur="6">Here is our perspective projection model with our well-known equation on the right.</text><text start="13" dur="4">Under the assumption that the camera itself might be at a random location</text><text start="17" dur="7">and a random orientation, this equation becomes a really complicated composition</text><text start="24" dur="6">of original image points in 3D, 3 rotation matrices as shown over here,</text><text start="30" dur="4">and an offset over here that relates to the camera coordinates.</text><text start="34" dur="6">You do this for X divided by Z over here.</text><text start="40" dur="3">This will be the projected camera input coordinates.</text><text start="43" dur="6">This is the generative math that specifies how cameras work under arbitrary orientations</text><text start="49" dur="2">and arbitrary translations.</text><text start="51" dur="6">If you now want to solve it, you can look at the observed measurements</text><text start="57" dur="5">minus the predicted measurements, minimize all this,</text><text start="62" dur="8">and solve for the translations, the point locations, and the orientations simultaneously.</text><text start="70" dur="5">This is entirely nontrivial, and nonlinear optimization techniques</text><text start="75" dur="4">have been used extensively to solve this problem.</text><text start="79" dur="4">They go by names like gradient descent, conjugate gradient,</text><text start="83" dur="7">Gauss-Newton, Levenberg Marquardt, and other things like singular value decomposition.</text><text start="90" dur="5">I won&amp;#39;t go into detail--just to give you a flavor of the problem.</text><text start="95" dur="2">Instead I&amp;#39;d like to ask you a question.</text></transcript></video><video title="07 Recovered Unknowns Question.mp4" id="rH_EbLanRwE" length="107"><transcript><text start="0" dur="3">[Thrun] I should warn you this question is hard.</text><text start="3" dur="4">If you&amp;#39;re new to computer vision, you likely won&amp;#39;t get it.</text><text start="7" dur="5">I&amp;#39;d like to ask it to you anyhow just to see how close you can get</text><text start="12" dur="3">and whether you appreciate the answer I&amp;#39;ll be giving you.</text><text start="15" dur="4">Suppose we have m camera poses.</text><text start="19" dur="2">That means m directions from which we take an image.</text><text start="21" dur="2">That&amp;#39;s called the motion.</text><text start="23" dur="6">And suppose we have n 3D points, which is called the structure.</text><text start="29" dur="6">Then it&amp;#39;s quite obvious that we have 2 times m times n constraints</text><text start="35" dur="4">simply because in each of the images we see all the n points</text><text start="39" dur="3">and we get an x and a y coordinate for each of the points, </text><text start="42" dur="3">which makes 2-m-n constraints.</text><text start="45" dur="2">We also have unknowns.</text><text start="47" dur="4">Specifically, each camera position is a 6D unknown </text><text start="51" dur="2">about the rotation and translation of the camera,</text><text start="53" dur="2">and each point itself has a 3D coordinate.</text><text start="55" dur="4">So the total number of unknowns is 6m plus 3n.</text><text start="59" dur="7">At first glance, to solve the structure from motion problem you would want 6m plus 3n</text><text start="66" dur="3">to be smaller or equal to 2mn.</text><text start="69" dur="4">And of course if m and n is big enough, this equation will be satisfied.</text><text start="73" dur="3">But my question is if you run the structure from motion problem,</text><text start="76" dur="3">how many of these unknowns can you actually recover?</text><text start="79" dur="6">Or, put differently, how many of those unknowns can you not recover?</text><text start="85" dur="3">If you think about it, for example, you won&amp;#39;t be able to really recover</text><text start="88" dur="4">the absolute coordinates of our system, because you can move the entire system</text><text start="92" dur="4">1 meter to the right and you&amp;#39;ll still get the same answer.</text><text start="96" dur="5">So there&amp;#39;s going to be a number over here that I want you to enter</text><text start="101" dur="4">that specifies the number of parameters that cannot possibly be recovered</text><text start="105" dur="2">in this structure from motion problem.</text></transcript></video><video title="08 Recovered Unknowns Answer.mp4" id="909mPd9XMIM" length="61"><transcript><text start="0" dur="3">[Thrun] And surprisingly, the answer is 7.</text><text start="3" dur="4">You cannot recover the absolute location and orientation of the coordinate system,</text><text start="7" dur="3">which are 6 of those parameters,</text><text start="10" dur="4">but you can also not recover scale.</text><text start="14" dur="5">For example, take a situation like this where you have 3 points over here</text><text start="19" dur="3">and now make this situation twice as large</text><text start="22" dur="3">with the points spread out twice as widely.</text><text start="25" dur="3">Because of perspective math, this over here will be the same answer</text><text start="28" dur="2">as this guy over here.</text><text start="30" dur="3">So this is 1 scale parameter that you can&amp;#39;t recover,</text><text start="33" dur="5">so you can only recover 6m plus 3n minus 7 parameters.</text><text start="38" dur="4">And as long as this is smaller than 2mn, you have a solution</text><text start="42" dur="2">of the structure from motion problem.</text><text start="44" dur="2">This was entirely nontrivial.</text><text start="46" dur="4">If you got this wrong, I would have gotten this wrong if I hadn&amp;#39;t known the solution.</text><text start="50" dur="2">But it&amp;#39;s fun to think about these things.</text><text start="52" dur="3">A lot of computer vision people care about whether the problem is well posed,</text><text start="55" dur="3">and you need a certain number of features and a certain number of images</text><text start="58" dur="3">to make this equation hold true.</text></transcript></video><video title="09 Conclusion.mp4" id="DL3vrst-ac8" length="52"><transcript><text start="0" dur="4">[Thrun] Now, at this point I would love to go deeper into structure from motion</text><text start="4" dur="2">and tell you more about how to solve it.</text><text start="6" dur="4">But, unfortunately, this is the Introduction to Artificial Intelligence class,</text><text start="10" dur="6">so I really want to leave it at a level that covers the typical material I cover at Stanford.</text><text start="16" dur="2">If you&amp;#39;re interested, take a computer vision class.</text><text start="18" dur="2">It&amp;#39;s a fascinating subject area.</text><text start="20" dur="4">This finishes my survey of computer vision.</text><text start="24" dur="3">Congratulations. You made it through the computer vision classes.</text><text start="27" dur="3">I think you now understand the very basics of computer vision,</text><text start="30" dur="3">you understand how images are being formed,</text><text start="33" dur="2">how features are being extracted,</text><text start="35" dur="4">and how we can do some very basic 3D inference about the world.</text><text start="39" dur="2">This is just a teaser.</text><text start="41" dur="2">The field of computer vision is of course much richer.</text><text start="43" dur="3">I used to teach the class at Stanford,</text><text start="46" dur="2">and I hope to be able to invite you in the near future </text><text start="48" dur="4">to an actual online 3D computer vision class.</text></transcript></video></group><group title="Homework 7" count="12"><video title="01 Perspective Projection.mp4" id="S13qbo0O-Og" length="100"><transcript><text start="0" dur="3">Welcome to the homework assignment on computer vision.</text><text start="3" dur="4">I will first ask you a few questions about perspective projection</text><text start="7" dur="3">in which you will exercise the math that we explored</text><text start="10" dur="5">that relates the size of an object in the scene, uppercase X,</text><text start="15" dur="6">with the depth for the range to the pinhole camera Z, the focal length f,</text><text start="21" dur="4">and the size of the projection, small x.</text><text start="25" dur="5">Remember from class the following equation.</text><text start="30" dur="4">I&amp;#39;m literally dropping the minus sign. I want all numbers to be positive in this example.</text><text start="34" dur="3">So please don&amp;#39;t worry about the minus sign that might not occur </text><text start="37" dur="3">in this specific version of the equation.</text><text start="40" dur="6">I&amp;#39;ll give you three values for X, Z, f,</text><text start="46" dur="3">and would like you to understand what the missing value is.</text><text start="49" dur="4">X is measured in meters and the same with Z.</text><text start="53" dur="5">f is in millimeters and so is lowercase x.</text><text start="58" dur="4">Here&amp;#39;s my first question. Suppose X is 10 m in size.</text><text start="62" dur="5">It&amp;#39;s 100 meters away. Suppose our focal length is 10 mm.</text><text start="67" dur="3">How large is our projection lowercase x in millimeters?</text><text start="70" dur="6">Now we&amp;#39;re asking you what is the focal length if an object of size 20 m that is 400 meters out</text><text start="76" dur="4">if we observe it to be 1 mm on our projection surface.</text><text start="80" dur="4">Suppose we have a 2 m sized object that with a 40 mm focal length</text><text start="84" dur="6">appears to be 1 mm in size. What is the distance of this object to the camera?</text><text start="90" dur="3">Finally, say a known object is 300 m away.</text><text start="93" dur="4">Our focal length is now 100 mm, and the object is again 1mm.</text><text start="97" dur="3">How far is this object away in meters?</text></transcript></video><video title="02 Perspective Projection Answer.mp4" id="2VcA2vEGqSo" length="61"><transcript><text start="0" dur="4">The answer can be seen directly from this formula over here.</text><text start="4" dur="2">In the first case we plug in X equals 10.</text><text start="6" dur="4">F over Z is 10 divided by 100.</text><text start="10" dur="4">We multiply 10 by 0.1, and we get 1.</text><text start="14" dur="3">It turns out all the units take care of themselves.</text><text start="17" dur="4">No matter which way we pose the question, we can effectively ignore those units,</text><text start="21" dur="4">because they&amp;#39;re the same outside the camera as they are inside the camera.</text><text start="25" dur="4">For the second question, we transform this equation over here as follows:</text><text start="29" dur="8">We now plug in Z as 400, X as 20, lowercase x as 1,</text><text start="37" dur="3">to give us a focal length of 20 mm.</text><text start="40" dur="3">We can transform this equation further into this quotient over here.</text><text start="43" dur="6">F over x is 40 times uppercase X of 2 mm is 80 meters.</text><text start="49" dur="6">Finally we can write this expression over here, and we plug in x over f.</text><text start="55" dur="6">We get 100th times 300 is 3 meters over here.</text></transcript></video><video title="03 Linear or Not.mp4" id="OHl6VaglCP4" length="93"><transcript><text start="0" dur="6">In this question I&amp;#39;m going to ask you about whether certain image functions are linear.</text><text start="6" dur="5">A function is linear if each resulting pixel of the processed image</text><text start="11" dur="3"> is a linear combination of input pixels.</text><text start="14" dur="3">They could be rated by constants like plus 1 or minus 1,</text><text start="17" dur="3">and they could be added up. Addition is linear.</text><text start="20" dur="4">But for example, taking the square of a pizel isn&amp;#39;t a linear operation.</text><text start="24" dur="4">I realize this question goes beyond what we discussed in class,</text><text start="28" dur="3">so please think a little bit about it and understand the difference </text><text start="31" dur="4">between linear and nonlinear in trying to answer these questions.</text><text start="35" dur="4">First is our gradient kernel here: minus 1, 1.</text><text start="39" dur="2">Please check if it&amp;#39;s linear or nonlinear.</text><text start="41" dur="6">Again, the linearity of an output image is given if the output image is a linear function </text><text start="47" dur="2">of the pixels in the input image.</text><text start="49" dur="4">How about our Gaussian kernel that we discussed in class of size 5 by 5?</text><text start="53" dur="3">Is the kernal linear or nonlinear?</text><text start="56" dur="2">How would we take the absolute value of a pixel?</text><text start="58" dur="5">If pixels are negative, we just ignore the negative sign and map back to the absolute value.</text><text start="63" dur="2">Is it linear or nonlinear?</text><text start="65" dur="2">We talked about the gradient magnitude kernel,</text><text start="67" dur="6">which was defined over a square root of the squares of the image gradients.</text><text start="73" dur="2">Is this a linear or nonlinear operation.</text><text start="75" dur="5">Finally, if you were to calculate the absolute brightness of a grey-scale image,</text><text start="80" dur="3">or let me call this the average brightness.</text><text start="83" dur="4">We have an imager of a certain size and like to calculate just the average brightness</text><text start="87" dur="3">of all the individual image pixels. They are all in greyscale.</text><text start="90" dur="3">Is this linear or nonlinear?</text></transcript></video><video title="04 Linear or Not Answer.mp4" id="mq3S5rYHecY" length="65"><transcript><text start="0" dur="4">The answer is every kernel convolution is linear.</text><text start="4" dur="5">Each pixel becomes the linear sum of, in this case, 2 pixels</text><text start="9" dur="3">that are weighted by plus 1 and minus 1, but in terms of the original variables,</text><text start="12" dur="6">which is the original image, this resulting sum, minus the left pixel plus the right pixel,</text><text start="18" dur="4">is a linear equation in the original pixel values.</text><text start="22" dur="4">The same is true for the Gaussian kernel of size 5 by 5.</text><text start="26" dur="4">It is a linear kernel because it just adds up all these values</text><text start="30" dur="2"> summed by the Gaussian kernel.</text><text start="32" dur="2">Absolute value is nonlinear.</text><text start="34" dur="6">The function that governs absolute value for input and output looks like this,</text><text start="40" dur="3">and there is a nonlinear kink over here.</text><text start="43" dur="2">The same is true for gradient magnitude.</text><text start="45" dur="2">There are squares in there, which are nonlinear,</text><text start="47" dur="3">and the square root makes it a nonlinear operation.</text><text start="50" dur="2">The absolute brightness is a linear operation.</text><text start="52" dur="5">It&amp;#39;s just like a Gaussian kernel with a uniform mask.</text><text start="57" dur="3">It just adds up all the values and divides them by the number of pixels.</text><text start="60" dur="5">It is a linear function in all the input pixels.</text></transcript></video><video title="05 Gradient Image.mp4" id="jQHHi-me1q4" length="59"><transcript><text start="0" dur="4">In this example, I&amp;#39;d like you to calculate a gradient image. </text><text start="4" dur="3">I&amp;#39;m giving you a relatively simple image of size 3 by 3 </text><text start="7" dur="8">with the following greyscale pixel values:, 2, 0, 2, 4, 100, 102, 242.</text><text start="15" dur="5">And for the sake of this exercise, I&amp;#39;d like to retain another 3 by 3 image,</text><text start="20" dur="4">so we&amp;#39;ll assume that all the values outside the image are just zero.</text><text start="24" dur="4">What I&amp;#39;m asking you is to compute a 3 by 3 matrix</text><text start="28" dur="4">that is the result of convolving this image with the following kernel:</text><text start="32" dur="3">minus 1 on the left, zero, and 1.</text><text start="35" dur="4">Then take the absolute value of each pixel, so you&amp;#39;re going to ignore the minus sign,</text><text start="39" dur="2">which is clearly an nonlinear operation.</text><text start="41" dur="3">Please apply this kernel to the image over here.</text><text start="44" dur="5">For each pixel, down here you get a linear combination of applying this kernel</text><text start="49" dur="4">from the values over here, assuming these off image values are all zero.</text><text start="53" dur="6">We then take the absolute--drop the minus sign--and please plug in the number over here.</text></transcript></video><video title="06 Gradient Image Answer.mp4" id="JRwplUQqAYA" length="58"><transcript><text start="0" dur="4">This kernel applied in this location over here will give me</text><text start="4" dur="5"> a minus 1 times zero plus 1 times zero is zero.</text><text start="9" dur="5">Shift to the right you get minus 1 times 2 plus 2 times 1 to zero.</text><text start="14" dur="2">The absolute value of this is zero.</text><text start="16" dur="2">On the right side we get zero again. </text><text start="18" dur="6">You get 100 over here, which is minus zero plus 100.</text><text start="24" dur="7">We get 98 over here, which is minus 4 plus 102.</text><text start="31" dur="5">And we get 100 over here, which is minus 100 over here plus zero,</text><text start="36" dur="4">which gives minus 100. We take the absolute, so it&amp;#39;s 100.</text><text start="40" dur="7">You get 4 over here, which is minus zero plus 4 equals 4.</text><text start="47" dur="4">Zero over here. These two balance each other out.</text><text start="51" dur="4">Then another 4 over here. Minus 4 plus zero is minus 4.</text><text start="55" dur="3">Taking the absolute we get 4 over here.</text></transcript></video><video title="07 Stereo.mp4" id="9EtdFB3Mchw" length="108"><transcript><text start="0" dur="3">I now have a stereo question.</text><text start="3" dur="5">For a valid calibrated stereo rig, we&amp;#39;re given two pinhole cameras </text><text start="8" dur="4">whose displacement is B, and we observe a point out in the scene.</text><text start="12" dur="7">The distance of this point to the image plane up uppercase Z.</text><text start="19" dur="3">Our cameras have a focal length of f.</text><text start="22" dur="3">Of course, this point is being projected into two different locations</text><text start="25" dur="4"> for the two different images--x2 and x1.</text><text start="29" dur="7">For the sake of this question, we&amp;#39;re going to just consider delta x, which is x2 minus x1--</text><text start="36" dur="4">the displacement in the corresponding images.</text><text start="40" dur="5">We measure delta x in millimeters, same with the focal length f.</text><text start="45" dur="4">B is in meters and so is Z, </text><text start="49" dur="4">and I use meters for B and Z so that the units effectively fall out.</text><text start="53" dur="2">You don&amp;#39;t really have to consider them.</text><text start="55" dur="3">Suppose the measured delta x is 4 mm</text><text start="58" dur="5">with a focal length of 40, and our displacement is 0.1.</text><text start="63" dur="4">How far away is the object in meters?</text><text start="67" dur="5">Suppose we have  a displacement of 0.05 mm for a focal length of 50.</text><text start="72" dur="8">We now care about the baseline B, if we happen to know the object is 100 meters away.</text><text start="80" dur="5">Next we have a displacement of 0.1 mm. We don&amp;#39;t know our focal length.</text><text start="85" dur="7">We know that the baseline is 0.2 mm, and the object is 50 meters away.</text><text start="92" dur="7">Finally we don&amp;#39;t know the displacement, but we do know that the focal length is 200 mm.</text><text start="99" dur="5">We have a baseline of 1 m and the object is 50 meters away.</text><text start="104" dur="4">Can you fill in the missing numbers?</text></transcript></video><video title="08 Stereo Answer.mp4" id="xGbG4uilBow" length="77"><transcript><text start="0" dur="4">We answer this question by first writing down the fundamental equation here,</text><text start="4" dur="6">which is delta x over f relates to B over Z.</text><text start="10" dur="4">To see this, we find by equal triangles that this triangle over here</text><text start="14" dur="5">described by Z over B, is the same as these 2 things over here </text><text start="19" dur="5">put together into a single triangle, which is delta x over f.</text><text start="24" dur="3">This proportionality must be the case.</text><text start="27" dur="4">This can now be transformed to solve for Z.</text><text start="31" dur="2">Z equals f over delta x B.</text><text start="33" dur="6">If we plug in f with delta x, we get 10 times 0.1 which is 1.</text><text start="39" dur="5">We can resolve B is delta x over f times Z.</text><text start="44" dur="10">We plug in delta x, 0.05, divided by 50 is 0.001 times 100 gives us 0.1.</text><text start="54" dur="6">We can also resolve it for f, which is Z over B times delta x.</text><text start="60" dur="5">Z over B is 10 times delta x as 0.1 gives us 1 over here.</text><text start="65" dur="2">Finally, we can resolve it for delta x.</text><text start="67" dur="7">B over Z times f. B over Z is 1/50 times f as 200 gives us 4 over here.</text><text start="74" dur="3">All the units fall out by themselves.</text></transcript></video><video title="09 Correspondence in Stereo.mp4" id="l_JIXI_GpGA" length="64"><transcript><text start="0" dur="3">We will now talk about correspondence in stereo.</text><text start="3" dur="3">You might remember our dynamic programming approach for</text><text start="6" dur="3">resolving correspondence along an entire scan line.</text><text start="9" dur="6">So I&amp;#39;ll give you another scan line. This is the left scan line--red, red, blue, blue, blue, red.</text><text start="15" dur="4">Then the right scan line we get to see the following.</text><text start="19" dur="3">Obviously there is a shift going on.</text><text start="22" dur="5">I&amp;#39;d like to ask you where this little pixel over here will go into the lead association.</text><text start="27" dur="4">It can go into any of those pixels over here, so please check exactly one of those boxes.</text><text start="31" dur="3">Let&amp;#39;s assume the cost for a bad match, </text><text start="34" dur="3">when we match 2 colors that don&amp;#39;t correspond, is 20.</text><text start="37" dur="4">The cost of an assumed occlusion or a disocclusion is 10.</text><text start="41" dur="2">Try to find the optimal alignment, </text><text start="43" dur="6">and then tell me where in the right scan line this 1 pixel corresponds to.</text><text start="49" dur="3">Check the exact box to which it corresponds.</text><text start="52" dur="2">Here is a second question I&amp;#39;d like to ask you.</text><text start="54" dur="4">What if we changed the cost of occlusion to 100?</text><text start="58" dur="3">Please answer the exact same question--where does the B over here go--</text><text start="61" dur="3">under this different cost model.</text></transcript></video><video title="10 Correspondence Answer.mp4" id="inAni6ddMN0" length="47"><transcript><text start="0" dur="3">In the case where the occlusion costs are low,</text><text start="3" dur="5">it is best to assume that those Bs over here correspond as indicated.</text><text start="8" dur="2">That means we have an occlusion cost to pay over here </text><text start="10" dur="2">and an occlusion cost to pay over here.</text><text start="12" dur="4">Our total cost is 20 for 2 occlusions, but we have a perfect match.</text><text start="16" dur="3">As a result, this B moves over here.</text><text start="19" dur="3">However, there is another viable solution when the occlusion costs are large,</text><text start="22" dur="3">because you would pay a total of 200 with the occlusion cost.</text><text start="25" dur="4">If we match pixel one to one like this, then you get two mismatches--</text><text start="29" dur="2">one over here and one over here.</text><text start="31" dur="2">The cost of those in total are 40.</text><text start="33" dur="4">That is still smaller than the 200 occlusion cost we had before.</text><text start="37" dur="2">Therefore the B gets matched to this point over here.</text><text start="39" dur="3">Notice how the different occlusion costs give different results </text><text start="42" dur="5">for the correspondence program in this dynamic programming question.</text></transcript></video><video title="11 Structure from Motion.mp4" id="RQzimXj2NhY" length="80"><transcript><text start="0" dur="3">This final question is motivated by structure from motion,</text><text start="3" dur="2">and it&amp;#39;s not the full-blown structure from motion problem, </text><text start="5" dur="3">which is hard to do on a piece of paper here.</text><text start="8" dur="4">But it&amp;#39;s a variant event for which I know the motion but not the structure.</text><text start="12" dur="2">Suppose we are given two cameras, </text><text start="14" dur="2">and we happen to know there are three features in the scene.</text><text start="16" dur="3">All three features can be scene by both cameras.</text><text start="19" dur="6">This camera will see a feature on the left, in the center, and on the right--L, C, R.</text><text start="25" dur="2">This camera is camera A.</text><text start="27" dur="3">One camera B, we also see a feature on the left, center, and right,</text><text start="30" dur="5">but I don&amp;#39;t know the identity of those features, which I will call 1, 2, and 3.</text><text start="35" dur="7">Suppose in camera A we see from left to the right the following sequence: 1, 2, 3.</text><text start="42" dur="2">These are the features numbers.</text><text start="44" dur="5">So in the left camera we notice that the left-most visible feature is feature 1,</text><text start="49" dur="4">the center feature is feature number 2, and the right feature is feature number 3.</text><text start="53" dur="4">I&amp;#39;m going to ask what is the order of those pixels in camera B,</text><text start="57" dur="4">assuming that the features are located as shown over here and so are the cameras.</text><text start="61" dur="5">Please give the index of the features over here that clearly have to be 1, 2, and 3</text><text start="66" dur="2">in some order which you have to determine.</text><text start="68" dur="5">For a different configuration let&amp;#39;s now assume we get to see 1, 2, 3 in camera B,</text><text start="73" dur="4">and we care about the feature indices we see in camera A </text><text start="77" dur="3">that corresponds to those features over here.</text></transcript></video><video title="12 Motion Answer.mp4" id="xxTZ0jeoOXk" length="56"><transcript><text start="0" dur="2">Let&amp;#39;s study the first case.</text><text start="2" dur="2">We saw left feature number 1. </text><text start="4" dur="5">The left one is over here, therefore this one will be feature number 1.</text><text start="9" dur="3">Feature number 1 will be seen at the center of camera B.</text><text start="12" dur="3">In the center of the left camera, we see feature number 2,</text><text start="15" dur="3">which must be this guy over here, because it&amp;#39;s the center feature to be seen.</text><text start="18" dur="3">This will be projected to the right side of camera B.</text><text start="21" dur="4">Even though it&amp;#39;s left in the image plane, the way a I drew it you can see that</text><text start="25" dur="3">the projection over here is on the right side of the camera chip.</text><text start="28" dur="4">The same with feature number 3, which is seen in R over here on the right side.</text><text start="32" dur="4">Hence, it&amp;#39;ll project into the leftmost field over here.</text><text start="36" dur="6">Using a different color now, if camera B sees feature number 1 on its L position,</text><text start="42" dur="5">Then this must be feature number one, which will appear on the right position for camera A.</text><text start="47" dur="3">And feature number 2 is in the center position--this guy over here--</text><text start="50" dur="2">which shows up in the left position over here,</text><text start="52" dur="4">and the remaining fits in over here. This is the correct answer.</text></transcript></video></group><group title="Unit 19" count="21"><video title="01 Autonomous Vehicle Intro 1.mp4" id="V_BJUBpuvFE" length="290"><transcript><text start="0" dur="7">One of the things I&amp;#39;ve been working on for most of my professional career are self-driving cars.</text><text start="7" dur="4">The vision is that in the future cars will drive themselves,</text><text start="11" dur="3">and in doing so they can be significantly safer.</text><text start="14" dur="7">We lose about a little over 1 million people per year in the entire world in traffic accidents.</text><text start="21" dur="4">I believe most of these accidents can be avoided by making cars safer.</text><text start="25" dur="3">If they drive themselves, they can drive disabled people.</text><text start="28" dur="4">They can drive blind people, young children, aging people,</text><text start="32" dur="4">and they could drive all of us while we do better things that staring at the road ahead.</text><text start="36" dur="4">So one of my life passions has been to be develop self-driving cars.</text><text start="40" dur="5">Today, I&amp;#39;d like to tell you about those, and also show you some of the basic techniques</text><text start="45" dur="6">so you can in principle program your own self-driving car.</text><text start="51" dur="10">So for me the work on self-driving cars started in 2004 after the first DARPA Grand Challenge.</text><text start="61" dur="3">This was a government-sponsored robot race</text><text start="64" dur="7">in which autonomous robots were asked to drive through the Mojave desert from California to Nevada</text><text start="71" dur="6">along 141 miles of really punishing desert terrain.</text><text start="77" dur="4">Lots of teams competed from various universities, car companies,</text><text start="81" dur="4">and also lots of hobbiests that were new to the field competed,</text><text start="85" dur="3">and built this huge set of different cars.  </text><text start="88" dur="4">There were over 100 different entries into the first DARPA Grand Challenge.</text><text start="92" dur="4">Despite all this work, most robots failed out of the starting gate, </text><text start="96" dur="6">like this one over here flipped over less than 100 meters into the race.</text><text start="102" dur="2">Some were very, very large. </text><text start="104" dur="6">This is a major defense contractor who built this 35,000 pound vehicle,</text><text start="110" dur="3">which on the course was rather timid,</text><text start="113" dur="7">and some of the the teams had very small robots, like the next one by UC Berkeley,</text><text start="120" dur="4">which was a motorcycle.</text><text start="124" dur="7">So here we go.</text><text start="131" dur="5">The first DARPA Grand Challenge came with $1 million of prize money,</text><text start="136" dur="5">and despite this prize money, no team made it further than 5% of the total course.</text><text start="141" dur="5">In fact, almost all cars stopped for something very stupid,</text><text start="146" dur="2">some went up in flames,</text><text start="148" dur="5">and the furthest any team made it was this car over here by Carnegie Mellon University,</text><text start="153" dur="4">which made it about just below 8 miles of the total distance.</text><text start="157" dur="4">So for many of us, this was a massive failure of robotic technology,</text><text start="161" dur="5">which motivated me to get involved in this race.</text><text start="166" dur="2">My own story is really simple.</text><text start="168" dur="4">I started a class at Stanford, and I got about 20 students to work with me</text><text start="172" dur="5">on what would become the Stanford racing team that would ultimately go and win this race.</text><text start="177" dur="7">We modified a Volkswagen Toureg to put all kinds of sensors onto the roof</text><text start="184" dur="5"> and actuators into the car that could actuate the steering wheel, the gas pedal, and the brake.</text><text start="189" dur="2">The sensors came in multiple versions.</text><text start="191" dur="4">Some were related to localization, such as global positioning sensors,</text><text start="195" dur="4">and some were related to understanding where obstacles are, like laser-range finders.</text><text start="199" dur="2">We talked about computer vision in this class.</text><text start="201" dur="6">The actuators were basically a motor on the steering wheel and on the brake pedals and on the gas pedal.</text><text start="207" dur="4">Early on, we tested on Stanford&amp;#39;s campus.</text><text start="211" dur="3">This is the roof of the medical parking garage.</text><text start="214" dur="4">Here you can see my students and I performed simple maneuvers.</text><text start="218" dur="3">Now, I&amp;#39;ve got to tell you that this is usually a busy parking garage.</text><text start="221" dur="3">It&amp;#39;s the medical parking garage at Stanford Hospital,</text><text start="224" dur="4">but as we practiced autonomous driving, people would come and pick up their car </text><text start="228" dur="3">and ask us about, what we were doing, so we kept telling them, </text><text start="231" dur="2">well, we&amp;#39;re building a self-driving car.</text><text start="233" dur="4">Within less than a week, people just chose not to park there anymore.</text><text start="241" dur="5">Closer to the next version of the Grand Challenge, the second one in 2005,</text><text start="246" dur="6">we had built a car that could drive competently on most desert tracks </text><text start="252" dur="10">at speeds up to about 60 km per hour through dry river beds, through steep inclines and declines,</text><text start="262" dur="5">and would be able to avoid obstacles like this little shrub on the right side over here.</text><text start="267" dur="3">It was never really elegant, but it was insanely effective.</text><text start="275" dur="3">Now, not all testing went smooth.</text><text start="278" dur="5">This is imagery that the New York Times shot of us when we invited them for a test drive.</text><text start="283" dur="4">During this day, we managed to crash into a tree and get stuck in the mud.</text><text start="287" dur="3">It was pretty embarrassing.</text></transcript></video><video title="02 Autonomous Vehicle Intro 2.mp4" id="kqDvbguZsAA" length="333"><transcript><text start="3" dur="4">Here is imagery of our laser system mapping out the terrain ahead.</text><text start="7" dur="4">We talked a little bit about lasers and range finders in this class.</text><text start="11" dur="5">Here you can see all these systems work together on building 3D maps of the environment</text><text start="16" dur="4">that our car, Stanley, uses to assess the driving situation.</text><text start="24" dur="3">This shows work on machine learning autonomous driving,</text><text start="27" dur="5">where we used the laser to identify driveable terrain at a short range</text><text start="32" dur="4">and then extrapolate this out into the long range using a machine-learning technique</text><text start="36" dur="3">applied to computer vision.</text><text start="39" dur="4">What you see here is a coloring, which is the output of a machine learning algorithml</text><text start="43" dur="4">that identifies driveable terrain in the desert.</text><text start="47" dur="7">So very briefly to tell you about the race, one with a lot of fame and $2 million.</text><text start="54" dur="7">This race started early in the morning. The sun was basically still gone and was just rising.</text><text start="61" dur="6">Our car was able to drive itself followed by a human-driven change vehicle and did quite well.</text><text start="67" dur="6">It did so well that it actually passed the front-seated and first-running vehicle by Carnegie Mellon University.</text><text start="73" dur="8">It had to navigate complicated and dangerous mountain trails where destruction lured on both sides of the car.</text><text start="81" dur="3">On the left there was a cliff. On the right side there was a mountain.</text><text start="84" dur="3">It is here followed by a human-driven chase vehicle.</text><text start="87" dur="3">Our car very carefully ascended this route.</text><text start="90" dur="3">You can see it here close before the finishing line,</text><text start="93" dur="5">and after just about 7 hours it managed to do what no robot had every done before.</text><text start="98" dur="6">It managed to really finish DARPA Grand Challenge, do this race, and won Stanford $2 million.</text><text start="104" dur="3">We were insanely proud on this day.</text><text start="109" dur="5">From this we moved onto build Junior, which competed in the DARPA Urban Challenge.</text><text start="117" dur="5">Here you can see Junior&amp;#39;s laser pursuing obstacles and being able to detect those,</text><text start="122" dur="3">using basically range vision.</text><text start="128" dur="2">We will talk today of localization.</text><text start="130" dur="5">Junior was able to localize itself using particle filters</text><text start="135" dur="5">relative to a given map of the environment, which is essential for navigating safely in traffic.</text><text start="143" dur="4">It was able to detect other cars using particle filters</text><text start="147" dur="4">and estimate not just where they are and how far they are moving but also what size they are, how big they are.</text><text start="154" dur="2">You can see on the left the detected cars. </text><text start="156" dur="3">On the right side, you see our camera view of the same situation.</text><text start="162" dur="3">Here again, you can see it detect cars.</text><text start="169" dur="3">Here is how it looked like from an external observation point.</text><text start="172" dur="8">You can see Junior, our vehicle, driving in a fairly busy city street with lots of cars passing.</text><text start="180" dur="3">It has to wait for a gap to take a left turn.</text><text start="183" dur="6">When the gap finally occurs, it confidently takes the turns and drives.</text><text start="189" dur="6">Today in today&amp;#39;s class I teach you how to basically program a car just like that.</text><text start="198" dur="5">So this is footage from our Google self-driving car, which you might have heard about.</text><text start="203" dur="6">This car was able to drive at speeds as high as a Prius can go.</text><text start="209" dur="3">It drives seamlessly in traffic.</text><text start="212" dur="4">In fact, we drove over 100,000 miles without anybody noticing </text><text start="216" dur="3">that there were self-driving cars in our experiments.</text><text start="219" dur="4">This is near Stanford University on University Street in Palo Alto.</text><text start="223" dur="5">You can see how the vehicle yields by itself for pedestrians.</text><text start="228" dur="2">Of course, there&amp;#39;s also a human driver on board just for safety,</text><text start="230" dur="5">but this car, you can take my word for it, is really driving itself in traffic.</text><text start="235" dur="3">This is image footage from the car itself as it goes onto a highway.</text><text start="238" dur="3">This is sped up, I should say.</text><text start="241" dur="6">Driving through a toll booth, and driving in Los Angeles.</text><text start="247" dur="5">You can see a lot of palm trees here. It&amp;#39;s a beautiful environment to drive in.</text><text start="263" dur="3">Here you can see some of the inner workings, </text><text start="266" dur="2">where you can see a corridor that the vehicle attempts to go.</text><text start="268" dur="4">We can see obstacles being flagged using machine-learning techniques,</text><text start="272" dur="4">range vision, laser radar, and so on.</text><text start="276" dur="4">You can see it is colored by its relation to our car and its nature,</text><text start="280" dur="3">and you can see it drives fairly confidently.</text><text start="283" dur="5">This is an attempt to drive down Lombard Street in San Francisco--the famous crooked street.</text><text start="288" dur="5">It&amp;#39;s very curvy, and while this is sped up it gives you a sense of the complexity</text><text start="293" dur="2"> that is involved in building cars like these.</text><text start="295" dur="5"> It&amp;#39;s actually quiet amazing how far technology has come in such a short amount of time.</text><text start="300" dur="6">Here is an experiment that my Stanford students did on south parking using machine learning,</text><text start="306" dur="2">reinforcement learning for control,</text><text start="308" dur="5">and you can see how agile and how capable these methods are.</text><text start="315" dur="6">So today I really want to enable you to write software like this based on lots of what we learned before.</text><text start="321" dur="4">We talked a little bit about machine learning, a lot about particle filters,</text><text start="325" dur="4">and some about motion planning, which relates to the planning class</text><text start="329" dur="4"> that Peter taught you quite a while back.</text></transcript></video><video title="03 Robotics Introduction.mp4" id="MDvQBnBoB8w" length="27"><transcript><text start="0" dur="3">Welcome to my class on robotics.</text><text start="3" dur="4">In many ways this is applying AI technology to the problem of robotics.</text><text start="7" dur="4">You might remember that a robot agent takes in sensor data from its environment.</text><text start="11" dur="3">Here is the environment and here is the robot agent.</text><text start="14" dur="3">It processes it into controls and actions</text><text start="17" dur="4">that it uses to manipulate its actuators.</text><text start="21" dur="6">Robotics is the science of bridging the gap between sensor data and actions.</text></transcript></video><video title="04 Robotics Question.mp4" id="BdRE0WeaDx0" length="34"><transcript><text start="0" dur="5">Just for the fun of it, let me ask you a quiz that links back to the very first class.</text><text start="5" dur="3">Robotics is partially observable, yes or no?</text><text start="8" dur="7">It&amp;#39;s continuous in its state and action spaces and its measurement, yes or no?</text><text start="15" dur="4">The environment may be stochastic, yes or no,</text><text start="19" dur="3">and it may be adversarial, yes or no?</text><text start="22" dur="4">Specifically, something like the DARPA Grand Challenge might be adversarial or not.</text><text start="26" dur="2">Please choose the answer that best fits.</text><text start="28" dur="2">I understand there might be multiple choices possible here, </text><text start="30" dur="4">but please go back to what fits the best.</text></transcript></video><video title="05 Robotics Answer.mp4" id="AHv-CQAgpgA" length="42"><transcript><text start="0" dur="2">And very clearly robotics is partially observable.</text><text start="2" dur="4">We&amp;#39;ll talk about this today a little bit more when I apply particle filters.</text><text start="6" dur="5">It is continuous--that is all measurements are continuous and all actions tend to be continuous.</text><text start="11" dur="2">The environment is clearly stochastic.</text><text start="13" dur="5"> It&amp;#39;s impossible to predict what&amp;#39;s happening next with absolute certainty.</text><text start="18" dur="3">Then you can argue back and forth whether it&amp;#39;s adversarial or not.</text><text start="21" dur="3">In most cases we don&amp;#39;t treat it as adversarial. We don&amp;#39;t think about it.</text><text start="24" dur="5">But somethings about robotics is indeed adversarial and to some extent driving is as well.</text><text start="29" dur="4">I want to say no for now, but I&amp;#39;m going to accept both answers over here, </text><text start="33" dur="2">so you can write whatever you want,</text><text start="35" dur="3">because I don&amp;#39;t want us to think about robotics as adversarial.</text><text start="38" dur="4">At least not in the case of driving cars.</text></transcript></video><video title="06 Kinematic Question 1.mp4" id="gfRNGs5wVl8" length="59"><transcript><text start="0" dur="4">One of the fundamental things about robotics is called &amp;quot;perception.&amp;quot;</text><text start="4" dur="3">The story here is that you get sensor measurements, and you&amp;#39;re trying to estimate </text><text start="7" dur="5">an internal state such that the internal state is sufficient to determine what to do next.</text><text start="12" dur="4">It&amp;#39;s usually a recursive method. It&amp;#39;s called a &amp;quot;filter.&amp;quot; We talked about this at length.</text><text start="16" dur="5">I&amp;#39;m going to ask you a few questions about the state itself. So here&amp;#39;s a quiz.</text><text start="21" dur="8">Suppose we have a mobile robot that is round and lives on a plane,</text><text start="29" dur="8">and it can turn on the spot, but its location is given by a two-dimensional coordinate.</text><text start="37" dur="2">It might face in a certain direction.</text><text start="39" dur="3">We really care about what&amp;#39;s called the kinematic state.</text><text start="42" dur="5">That is, we care about where it is but not how fast it is actually moving.</text><text start="47" dur="5">So what is the dimensionality of the state space for such a robot?</text><text start="52" dur="2">I do realize we haven&amp;#39;t really talked about this much yet.</text><text start="54" dur="5">I&amp;#39;d like you to take a good guess, and I&amp;#39;ll tell you the answer once you have made your guess.</text></transcript></video><video title="07 Kinematic Answer 1.mp4" id="GVIGD7GSnTQ" length="39"><transcript><text start="0" dur="5">The answer of this specific example over here is 3,</text><text start="5" dur="2">although you could argue it could be something else, </text><text start="7" dur="2">but 3 is a convenient and common answer,</text><text start="9" dur="7">which is this robot&amp;#39;s state is determined by its xy location and by its heading direction,</text><text start="16" dur="2">which is often called theta.</text><text start="18" dur="4">Now, you could argue heading doesn&amp;#39;t matter because it can turn itself on the spot,</text><text start="22" dur="2">and in some examples that is actually correct.</text><text start="24" dur="3">You might be able to get away with it with a two-dimensional state,</text><text start="27" dur="3">but if you&amp;#39;re going to predict what&amp;#39;s happened next when you, for example, drive forward,</text><text start="30" dur="2">the heading matters greatly.</text><text start="32" dur="5">In that sense, it&amp;#39;s actually a three-dimensional state, so 3 is the best answer.</text><text start="37" dur="2">Let me ask you a few more questions about dimensionality of state spaces.</text></transcript></video><video title="08 Kinematic Question 2.mp4" id="BPvJIOIK62Y" length="15"><transcript><text start="0" dur="3">Here is a car,</text><text start="3" dur="2">and again we worry about the kinematic state.</text><text start="5" dur="6">That is, where in the world is this robot irrespective of its current velocity?</text><text start="11" dur="4">What do you now the right answer is for dimensionality?</text></transcript></video><video title="09 Kinematic Answer 2.mp4" id="w7ESAJ3GVxU" length="26"><transcript><text start="0" dur="4">The correct answer is still 3, although you can argue it is more,</text><text start="4" dur="3">because maybe the position of its steering wheel matters,</text><text start="7" dur="3">but at first approximation is the same as before.</text><text start="10" dur="2">There might be a center point for the car.</text><text start="12" dur="3"> It has a certain location in the global coordinate system,</text><text start="15" dur="3">and it again has a heading direction in which it can go.</text><text start="18" dur="2">So 3 is a convenient answer, </text><text start="20" dur="6">although technically the steering wheel angle might also influential for what&amp;#39;s happening next.</text></transcript></video><video title="10 Dynamic Question.mp4" id="UZ-_m2jljbM" length="13"><transcript><text start="0" dur="3">Now let&amp;#39;s talk about the dynamic state.</text><text start="3" dur="3">The dynamic state includes the velocities of the vehicle itself.</text><text start="6" dur="3">I&amp;#39;d like to understand what&amp;#39;s a good number of dimensions </text><text start="9" dur="4">to encode the dynamic state of this car.</text></transcript></video><video title="11 Dynamic Answer.mp4" id="vZPHbsDK_Xc" length="46"><transcript><text start="0" dur="3">The common answer here is 5,</text><text start="3" dur="3">although I realize many, many different answers are possible,</text><text start="6" dur="3">and there is clearly no single answer that is really correct.</text><text start="9" dur="2">But 5 is my favorite.</text><text start="11" dur="4">It&amp;#39;s the 3 kinematic state variables, and in addition we care about velocities,</text><text start="15" dur="3">so there is formal velocity of the vehicle itself</text><text start="18" dur="5">and the faster the vehicle moves, the further it is going to advance in a max timestep.</text><text start="23" dur="3">There is also something called a yaw rate.</text><text start="26" dur="4">Yaw is one way to name the heading of the car,</text><text start="30" dur="3">the orientation of the car, or the bearing as some people call it.</text><text start="33" dur="2">The rate is the change over time.</text><text start="35" dur="3">This car will not just move forward. It will also turn.</text><text start="38" dur="4">That turn has a velocity, and the velocity is often called &amp;quot;yaw rate.&amp;quot;</text><text start="42" dur="4">We&amp;#39;re going to talk about this in a few minutes.</text></transcript></video><video title="12 Helicopter Question 1.mp4" id="3CKcmn-4KhM" length="20"><transcript><text start="0" dur="5">Before we do this, let me look into the state helicopter.</text><text start="5" dur="2">Here&amp;#39;s by best depiction of a helicopter.</text><text start="7" dur="4">This helicopter can fly anywhere in the xyz space,</text><text start="11" dur="4">and it can also point in any possible direction.</text><text start="15" dur="5">What&amp;#39;s the dimensionality of the kinematic state for such a vehicle?</text></transcript></video><video title="13 Helicopter Answer 1.mp4" id="PoQRUbE4d90" length="35"><transcript><text start="0" dur="3">The answer is now 6.</text><text start="3" dur="6">You can really see that this helicopter can assume any location in xyz space,</text><text start="9" dur="2">but it also has 3 rotation degrees of freedom.</text><text start="11" dur="3">It has a yaw, which is its bearing. </text><text start="14" dur="3">It can tilt forward and backward.</text><text start="17" dur="3">And it can roll left and right.</text><text start="20" dur="3">If we look from above, here is the yaw. </text><text start="23" dur="2">This is the tilt.</text><text start="25" dur="4">If you look from the front, you find there is also a roll variable</text><text start="29" dur="3">where the vehicle can turn around its own axis.</text><text start="32" dur="3">So the total answer is 6.</text></transcript></video><video title="14 Helicopter Question 2.mp4" id="OMpTeuRaJkg" length="7"><transcript><text start="0" dur="3">Here is my most difficult question for the helicopter.</text><text start="3" dur="4">What&amp;#39;s the dynamic state? What&amp;#39;s the right dimensionality here?</text></transcript></video><video title="15 Helicopter Answer 2.mp4" id="bdzA6uViwas" length="34"><transcript><text start="0" dur="2">The answer is commonly 12,</text><text start="2" dur="3">which is simply we can have 6 state variables,</text><text start="5" dur="4">and in each of those the helicopter might have its own velocity.</text><text start="9" dur="3">So for each of these variables, we have the state variable itself</text><text start="12" dur="3">and its velocity, which makes a total of 12.</text><text start="15" dur="4">This is completely nontrivial and something you probably can&amp;#39;t know.</text><text start="19" dur="4">We just have to learn, but when you think about it, you realize this is the most general case</text><text start="23" dur="3">of a vehicle that can move in 3D at every possible location,</text><text start="26" dur="3">every possible orientation at any velocity.</text><text start="29" dur="5">That&amp;#39;s called the dynamic state of a free-flying object.</text></transcript></video><video title="16 Localization.mp4" id="1QlvfUGMIyQ" length="47"><transcript><text start="1" dur="6">Let&amp;#39;s talk about localization of a car like our DARPA Urban Challenge car Junior.</text><text start="7" dur="4">This car uses a map of the environment--</text><text start="11" dur="2">it knows it advance where the lay markers are--</text><text start="13" dur="5">and uses probabilistic localization to keep track of where it is.</text><text start="18" dur="4">The reason for that is it could use GPS, the global positioning system,</text><text start="22" dur="5">but that has enormous errors, sometimes in order of 5 or more meters,</text><text start="27" dur="3">which is very unsafe for driving.</text><text start="30" dur="3">By localizing utilizing particle filters or histogram filters,</text><text start="33" dur="4">our car can do the same with about 10 cm error,</text><text start="37" dur="3">which means it can really understand where to stay in the lane</text><text start="40" dur="4">just by known where the lane is in advance and using localization techniques</text><text start="44" dur="3">like the ones we&amp;#39;ll discuss right now.</text></transcript></video><video title="17 Monte Carlo Localization.mp4" id="PjXylEmMSiA" length="180"><transcript><text start="0" dur="3">Let&amp;#39;s talk about particle filters for localization</text><text start="3" dur="3">that is commonly called Monte Carlo localization.</text><text start="6" dur="6">We learned in the particle filter lesson that the state is retained by a set of particles.</text><text start="12" dur="3">Each particle is a three-dimensional vector here,</text><text start="15" dur="3">comprising x,y, and the heading direction theta,</text><text start="18" dur="4">as indicated by these little arrows that I&amp;#39;m going to just draw here.</text><text start="22" dur="6">A set of particles like these would be a representation for the distribution at any point in time.</text><text start="28" dur="3">Now let me look at the 2 main steps in particle filters.</text><text start="31" dur="4">On is the prediction step, and one if the measurement step. Let&amp;#39;s start with prediction.</text><text start="35" dur="5">Just to make things simpler, let&amp;#39;s assume our vehicle has only 2 wheels.</text><text start="40" dur="7">It&amp;#39;s called a differential-drive robot, and it can navigate by moving both wheels forward,</text><text start="47" dur="4">but if 1 wheel moves faster than the other one, it&amp;#39;ll turn.</text><text start="51" dur="5">Let&amp;#39;s understand how to apply a particle filter to a robot on that simplicity.</text><text start="56" dur="5">This is simpler than a car, but not much simpler. It&amp;#39;s about the same complexity.</text><text start="61" dur="6">As I said, the state of this vehicle is given by the following 3 values: x, y, and &#x3B8;.</text><text start="67" dur="4">And to predict the outcome of an action, we need to write a function </text><text start="71" dur="8">that predicts those values based on values &#x394;t over here where &#x394;t might be a 10th of a second.</text><text start="79" dur="4">Now the math for this in first approximation is very simple.</text><text start="83" dur="5">It turns out this approximation is good enough to do pretty much anything in robotics </text><text start="88" dur="2">even though it is not very accurate.</text><text start="90" dur="5">Let&amp;#39;s assume the robot just keeps moving forward at a fixed velocity v.</text><text start="95" dur="9">Then the new x is given by the old x plus the progress it makes along axis x with velocity v.</text><text start="104" dur="5">So you get v times &#x394;t, which is the total distance traversed,</text><text start="109" dur="4">but the x portion of it is cos &#x3B8;.</text><text start="113" dur="5">Similarly, for the y coordinates, you get the old y plus the distance traversed--</text><text start="118" dur="5">velocity times &#x394;t times sin &#x3B8;.</text><text start="123" dur="4">This is a robot that doesn&amp;#39;t really change heading directions,</text><text start="127" dur="4">and it&amp;#39;ll be sufficient for very small &#x394;t to assume that robot doesn&amp;#39;t change heading directions.</text><text start="131" dur="4">These are actually good equations even if the robot is turning.</text><text start="135" dur="2">However, to understand the change of heading direction, </text><text start="137" dur="3">we also have to assume that there is an angular velocity, </text><text start="140" dur="4">and we call this &#x3C9; [omega], which is a Greek letter.</text><text start="144" dur="7">So the new heading direction is the old one plus &#x3C9; times &#x394;t.</text><text start="151" dur="5">These are really nice equations to model relatively complex smaller robots.</text><text start="156" dur="3">They&amp;#39;re really simple geometry. If you understand cosine and sine,</text><text start="159" dur="7">you realize this is basically a robot that moves on a fixed straight trajectory.</text><text start="166" dur="9">For time &#x394;t it then applies the rotation and it moves again for fixed time &#x394;t on a straight trajectory,</text><text start="175" dur="5">which is an approximation to the actual curve the robot might be taking.</text></transcript></video><video title="18 Localization Question 1.mp4" id="45KdjlEbdb4" length="30"><transcript><text start="0" dur="4">Let&amp;#39;s exercise these equations over here using the following example.</text><text start="4" dur="6">Suppose in the beginning x equals 24, y is 18,</text><text start="10" dur="5"> and the orientation for now is going to be zero, just to make it simple,</text><text start="15" dur="5">and suppose &#x394;t is 1 second, our velocity is 5 units per second,</text><text start="20" dur="3">and our rotation velocity is &#x3C0; over 8 seconds.</text><text start="23" dur="7">Can you use this formula to calculate x&amp;#39;, y&amp;#39; and &#x3B8;&amp;#39; after &#x394;t?</text></transcript></video><video title="19 Localization Answer 1.mp4" id="6zaDlrN7Cus" length="46"><transcript><text start="0" dur="4">This is a robot that points in the x direction, because &#x3B8; equals zero.</text><text start="4" dur="4">We have a coordinate system over here where this is 24.</text><text start="8" dur="4">That&amp;#39;s 18 in the y direction, and it moves forward for a while.</text><text start="12" dur="5">In fact we have 1 second, and it moves at 5 units per second, so this is 5.</text><text start="17" dur="5">Then it turns its heading into this direction over here, and this is &#x3C0;/8.</text><text start="22" dur="2">Again, the real robot would take a curve,</text><text start="24" dur="3"> but in our approximation we assume it goes in a straight line </text><text start="27" dur="3">and then finally does a very discrete turn, which is an approximation,</text><text start="30" dur="2">but that&amp;#39;s the question I have asked you here.</text><text start="32" dur="4">When you plug these values in, you&amp;#39;ll find that the x extends by 5,</text><text start="36" dur="3">which is 29. Y doesn&amp;#39;t change.</text><text start="39" dur="7">The final turn is &#x3C0;/8, which is about 0.3927.</text></transcript></video><video title="20 Localization Question 2.mp4" id="3i77T2Av1QE" length="28"><transcript><text start="0" dur="2">Let me ask you a similar quiz,</text><text start="2" dur="7">but this time let&amp;#39;s say that x equals zero, y equals 10, and our heading direction is &#x3C0;/4,</text><text start="9" dur="2">which is the same as 45 degrees.</text><text start="11" dur="5">Again &#x394;t equals 1 second, and we move at 5 units per second.</text><text start="16" dur="3">Then we turn by -&#x3C0;/4 per second.</text><text start="19" dur="5">Don&amp;#39;t worry about the units, seconds. It&amp;#39;s just hear to make it mathematically consistent.</text><text start="24" dur="4">So please plug your best estimates into the boxes over here.</text></transcript></video><video title="21 Localization Answer 2.mp4" id="UlPfYHG_Yn8" length="63"><transcript><text start="0" dur="6">This robot is located at 0, 10 and it points at 45 degrees.</text><text start="6" dur="3">In doing so, it moves some right and some up.</text><text start="9" dur="6">In fact the same ratio in the x dimension as it will be in the y dimension.</text><text start="15" dur="12">Now cosine of this &#x3B8; here, is about 0.7071 multiplied by 5 added to 0 and you get 3.5355.</text><text start="27" dur="3">It so turns out that this is also the value of sin &#x3B8;,</text><text start="30" dur="8">so you can add the same value over here to the x value of 10, which gives us 13.5355.</text><text start="38" dur="7">Finally, we find that within 1 second the initial heading is canceled out by the change of heading,</text><text start="45" dur="4">so we&amp;#39;ll be facing in this direction over here, and that angle is just zero.</text><text start="49" dur="7">If you got those right, this was a somewhat tedious exercise on simple geometry,</text><text start="56" dur="7">but this is the kind of math you need to implement in particle filter equations like those over here in robotics.</text></transcript></video></group><group title="Unit 20" count="15"><video title="01 Prediction.mp4" id="t7RsueYqPgw" length="86"><transcript><text start="0" dur="3">Let&amp;#39;s get back to Monte Carlo localization.</text><text start="3" dur="3">Let&amp;#39;s look at single particle that sits over here.</text><text start="6" dur="3">There&amp;#39;s an x, y, and a &#x3B8;.</text><text start="9" dur="4">Let&amp;#39;s assume we happen to know that the robot is moving at velocity v</text><text start="13" dur="3">and at angle velocity &#x3C9;, which is the differential of its wheels.</text><text start="16" dur="7">And after it moves so far, it will end up exactly over there with a heading pointing in this direction.</text><text start="23" dur="4">Now that you worked the math, you know exactly how to implement this.</text><text start="27" dur="5">In Monte Carlo localization, we don&amp;#39;t predict the exact outcome. We add noise.</text><text start="32" dur="4">We add noise to velocity v and to the heading direction &#x3C9;.</text><text start="36" dur="4">As we do so, we might find ourselves with lots of particles, </text><text start="40" dur="5">all of which have a slightly different xy coordinate and a slightly different heading outcome.</text><text start="45" dur="5">These particles together comprise our estimation after the motion command over here.</text><text start="50" dur="3">So, a single particle over here, if drawn multiple times, </text><text start="53" dur="3">gives a set of particles like the ones over here.</text><text start="56" dur="2">They&amp;#39;re kind of hard to see at this point,</text><text start="58" dur="5">but you can imagine by varying v and &#x3C9; with a little bit of noise that we</text><text start="63" dur="2">add or subtract from these values,</text><text start="65" dur="3">we will get slightly different predictions where the robot might be</text><text start="68" dur="3">and as a result get a particle cloud like this one over here.</text><text start="71" dur="2">That&amp;#39;s really important.</text><text start="73" dur="5">We just implemented the prediction step of a particle filter in a real robotics example.</text><text start="78" dur="8">This is exactly what&amp;#39;s happening when we drive our Google self-driving car and our Stanford car.</text></transcript></video><video title="02 Measurement Question.mp4" id="vi5SY_6T4Co" length="148"><transcript><text start="0" dur="6">Now we have a set of predictions that might arise from a single particle,</text><text start="6" dur="5">and the other important step in particle filtering is the measurement step.</text><text start="11" dur="3">We need to understand at what rate will these particles survive,</text><text start="14" dur="4">and that&amp;#39;s usually done in proportion to the measurement probabilities.</text><text start="18" dur="2">Let&amp;#39;s talk about measurements. </text><text start="20" dur="5">For the sake of this exercise, let&amp;#39;s assume we only have 2 measurements.</text><text start="25" dur="4">We would either see something bright or something dark.</text><text start="29" dur="3">It does a certain response to whether it&amp;#39;s on land marking.</text><text start="32" dur="5">Just for simplicity, let&amp;#39;s assume we have certain locations that have land markings,</text><text start="37" dur="4">like this one over here and these over there.</text><text start="41" dur="4">If a robot center is aligned with a lane marking, it should see a bright spot</text><text start="45" dur="3">because lane markings tend to be bright.</text><text start="48" dur="3">But if it&amp;#39;s off the lane marking on the regular road, it should see a dark spot.</text><text start="51" dur="5">Let&amp;#39;s turn this into a probability that&amp;#39;s called the measurement probability.</text><text start="56" dur="8">The probability of seeing something bright is going to be large when it&amp;#39;s on a lane marker, say 0.8.</text><text start="64" dur="6">From that we can deduce that the probability of seeing something dark on a lane marker is 0.2.</text><text start="70" dur="7">The probability of seeing something dark when off a lane marker is even higher at 0.9,</text><text start="77" dur="3">and from that it follows that the probability of seeing something bright </text><text start="80" dur="7">on the regular road with regular pavement is going to be 1 minus 0.9 equals 0.1.</text><text start="87" dur="5">Here&amp;#39;s my quiz for you. This is an entirely nontrivial quiz.</text><text start="92" dur="4">If you get this right, you understand particle filters.</text><text start="96" dur="3">Suppose we measure bright.</text><text start="99" dur="4">The actual sensor told us it saw something bright underneath the robot.</text><text start="103" dur="6">I&amp;#39;d like to know what is the importance weight of the particle over here,</text><text start="109" dur="6">which we&amp;#39;re going to call x1, and the particle over here, which I&amp;#39;ll call x2.</text><text start="115" dur="8">Tell me  what&amp;#39;s the weight w of x1 after I apply the measurement probability </text><text start="123" dur="4">and the normalization that&amp;#39;s common in particle filters.</text><text start="127" dur="6">Please do the same for the particle x2 where x1 happens to be on the lane marker,</text><text start="133" dur="3">and x2 happens to be off a lane marker.</text><text start="136" dur="2">So please put in these two numbers of over here.</text><text start="138" dur="4">It&amp;#39;ll take a while to calculate those, so please take the time.</text><text start="142" dur="6">I assure you if you get those right, you really understand particle filters.</text></transcript></video><video title="03 Measurement Answer.mp4" id="hkFV_q2XEOw" length="136"><transcript><text start="0" dur="3">As promised, the answer is nontrivial.</text><text start="3" dur="11">The importance weight for x1 will be 8/27, which is the same as 0.2963,</text><text start="14" dur="7">and the one for x2 will be 1/27 or 0.037.</text><text start="21" dur="2">How did we get there?</text><text start="23" dur="4">Let&amp;#39;s look at the non-normalized importance weights before normalization.</text><text start="27" dur="4">The guys on the lane markings will all get a 0.8.</text><text start="31" dur="7">The guys off the lane markings will get a 0.1. So the three guys over here.</text><text start="38" dur="4">The reason is the probability of seeing bright, which is what we saw,</text><text start="42" dur="4">off a lane marker is 1 minus 0.9. That&amp;#39;s 0.1.</text><text start="46" dur="5">Now we have 3 guys that are on the lane markings and 3 off the lane markings.</text><text start="51" dur="6">The total weight over here is 2.4, and the total weight over here is 0.3.</text><text start="57" dur="7">Our total weight for all particles, not normalized particles, will be 2.7 or 27 tenths.</text><text start="64" dur="5">We have to really normalize the weight by dividing by 2.7.</text><text start="69" dur="7">0.8 divided by 2.7 is 8/27 or this number over here.</text><text start="76" dur="5">0.1 divided by 2.7 is 1/27, which is this value over here.</text><text start="81" dur="2">That&amp;#39;s how we get to these final weights.</text><text start="83" dur="5">If you got this, you understand that the measurement probability effects </text><text start="88" dur="3">the weight before normalization, which is multiplying in the measurement probability,</text><text start="91" dur="2">and you did this correctly.</text><text start="93" dur="4">Afterwards we have to normalize to make sure all the weights add up to 1.</text><text start="97" dur="3">So we divide by the total weight of all particles,</text><text start="100" dur="2">and we get out those probabilities over here.</text><text start="102" dur="5">Put differently, this particle x1 that sits on a lane marker</text><text start="107" dur="6">is being regenerated in the resampling phase with a probability of 0.2963.</text><text start="113" dur="5">The same is true for the 2 other particles that sit on lane markers.</text><text start="118" dur="4">For the 3 particles that are off lane markers like the one x2 over here,</text><text start="122" dur="6">the resampling probability is a small as small as 0.037.</text><text start="128" dur="3">That&amp;#39;s true for x2, but it&amp;#39;s true for all 3 particles.</text><text start="131" dur="5">In total, these probabilities add up exactly to 1.</text></transcript></video><video title="04 Resampling Question.mp4" id="UdJwHv_BYYw" length="108"><transcript><text start="0" dur="8">Let&amp;#39;s now apply the resampling where the on-lane-marker particles are being resampled for probability 0.2963,</text><text start="8" dur="3">and the ones in the middle with probability 0.037.</text><text start="11" dur="3">Let&amp;#39;s draw with replacement 6 new particles.</text><text start="14" dur="4">A typical outcome will be we draw this one over here twice,</text><text start="18" dur="2">this one down here twice, and this one over here once.</text><text start="20" dur="2">Perhaps we draw once over here.</text><text start="22" dur="4">Clearly we draw the particles that sit on the lane markings much more frequently </text><text start="26" dur="5">than the ones that sit off the lane markings for a total of 6 new particles.</text><text start="31" dur="6">We now apply our resampling method whereby we draw twice from this particle over here.</text><text start="37" dur="6">That might give us something over there and over here, given that we add a small amount of noise.</text><text start="43" dur="3">The guy over here will be resampled to something over there. </text><text start="46" dur="3">Same with this guy over here, and this guy might find itself over here.</text><text start="49" dur="4">The set over here of 6 particles in total, will now be the new posterior.</text><text start="53" dur="5">As you can see, this posterior is more consistent with the lane marking observation</text><text start="58" dur="3">than the one of not being on a lane marking by virtue of the fact that</text><text start="61" dur="2">we saw a bright measurement before.</text><text start="63" dur="5">Now we just repeat. We look at the next measurement. We weight particles accordingly.</text><text start="68" dur="3">We resample. We do forward prediction.</text><text start="71" dur="3">That&amp;#39;s the basic particle filter algorithm.4</text><text start="74" dur="3">Look at measurment, compute weights, sample, and predict</text><text start="77" dur="3">where the prediction has a certain amount randomness.</text><text start="80" dur="4">If you get that loop implemented, you&amp;#39;ve implemented an amazing algorithm</text><text start="84" dur="5">that&amp;#39;s exactly what has driven many of my robots in the ability to localize themselves.</text><text start="89" dur="4">Obviously they have more than just 1 pixel sensor that measures bright and dark.</text><text start="93" dur="5">They might take an entire road image and use the road image to complete the measurement probability,</text><text start="98" dur="4">but the basic mechanics is exactly the same as shown over here.</text><text start="102" dur="6">So let me ask you, did you actually understand this. Yes or no?</text></transcript></video><video title="05 Resampling Answer.mp4" id="9EnMIYw7vsw" length="19"><transcript><text start="0" dur="2">I just hope you answered &amp;quot;yes.&amp;quot;</text><text start="2" dur="4">If you answered &amp;quot;no,&amp;quot; please go through the same sequence again,</text><text start="6" dur="3">because the steps end up to be relatively straightforward</text><text start="9" dur="3">even though they&amp;#39;re somewhat uncommon.</text><text start="12" dur="3">But if you understand it, you can now go and implement particle filters</text><text start="15" dur="4">for the great range of robotics applications.</text></transcript></video><video title="06 Planning Question.mp4" id="eBc3mfp5UFQ" length="56"><transcript><text start="0" dur="3">Let&amp;#39;s talk a bit about planning.</text><text start="3" dur="4">One of the key problems is that these robots have to decide what to do next.</text><text start="7" dur="5">I&amp;#39;ll address the planning problem at multiple levels of abstraction.</text><text start="12" dur="5">The easiest happens to be to look at a city level of abstraction.</text><text start="17" dur="3">Suppose we have a road like this over here,</text><text start="20" dur="3">and my car is located down here in the beginning.</text><text start="23" dur="2">I wish to get to this point up here.</text><text start="25" dur="4">Let me draw in an abstraction of the state space,</text><text start="29" dur="5"> and we just draw the states as shown with these red lines over here.</text><text start="34" dur="4">So you see a maze with lots of discrete states.</text><text start="38" dur="4">I&amp;#39;m going to ask you given that you traversing from red cell to red cell costs you 1,</text><text start="42" dur="5">or -1 using the definition of the MDP lecture before.</text><text start="47" dur="3">Suppose the goal state has a value of 100.</text><text start="50" dur="6">What&amp;#39;s the value of the start state assuming deterministic actions?</text></transcript></video><video title="07 Planning Answer.mp4" id="J18If45qQJg" length="15"><transcript><text start="0" dur="3">You probably got it right. It&amp;#39;s 86.</text><text start="3" dur="5">The reason is the value of the goal is 100, 99, 98, 97.</text><text start="8" dur="3">It turns out the start state is 14 steps away from the goal,</text><text start="11" dur="4">so 100 minus 14 is 86.</text></transcript></video><video title="08 Road Graph.mp4" id="8CKxhuVQrus" length="67"><transcript><text start="0" dur="7">The exact same algorithm works beautifully for planning the shortest path</text><text start="7" dur="4"> to a single mission goal from any possible start location,</text><text start="11" dur="6">and the only difference here is in this graph over here of an actual road graph</text><text start="17" dur="4">we are also incorporating the heading direction as measure of distance.</text><text start="21" dur="6">Green corresponds to nearby to large values, red to far away.</text><text start="27" dur="4">The reason why the area below the mission goal is green is because we expect </text><text start="31" dur="4">the car to point up, to point north, at the time it reached the mission.</text><text start="35" dur="3">So if it came from the north, it would point in the wrong direction.</text><text start="38" dur="3">The state space is augmented correspondingly.</text><text start="41" dur="4">Where if it comes from over here, it points in the correct direction.</text><text start="45" dur="2">If you look at the circle over here, it&amp;#39;s interesting.</text><text start="47" dur="3">If we came from the left side over here, it could do a right turn,</text><text start="50" dur="5">but over here it&amp;#39;s forced on this one-way circle to do the entire loop to go around,</text><text start="55" dur="3">and that increases the value as it comes over here.</text><text start="58" dur="5">This is value iteration applied to the road graph where we keep track of heading</text><text start="63" dur="4">and where the circle over here is a one-way circle.</text></transcript></video><video title="09 Cost Question.mp4" id="SQWfMJoYIf0" length="89"><transcript><text start="0" dur="3">Let&amp;#39;s look at dynamic programming again.</text><text start="3" dur="3">Specifically let&amp;#39;s look at environment that has a loop.</text><text start="6" dur="2">Here&amp;#39;s the environment.</text><text start="8" dur="3">The environment possesses 14 states,</text><text start="11" dur="3">and here is the loop as indicated, and there is a big intersection in the middle over here.</text><text start="14" dur="6">Let&amp;#39;s assume this is our start state in the very south, and we&amp;#39;re facing north.</text><text start="20" dur="5">We we reach to the goal state in the west, and here we will be facing west.</text><text start="25" dur="2">Obviously, there are two ways to get to the goal.</text><text start="27" dur="3">We can go north 3 steps and then turn left to the goal,</text><text start="30" dur="4">or we can take the entire loop over here, avoid left turns,</text><text start="34" dur="3">but eventually find ourselves at the goal as well after more steps.</text><text start="37" dur="2">Let&amp;#39;s assume there are different costs attached.</text><text start="39" dur="4">The cost of motion is -1 per red cell.</text><text start="43" dur="4">The cost of right turns is -2.</text><text start="47" dur="2">Why would be penalize right turns?</text><text start="49" dur="3">Well, as you turn right, you might have to yield for pedestrians.</text><text start="52" dur="2">That might cost you some time.</text><text start="54" dur="5">Let&amp;#39;s assume an expectation that -2 accounts for the time relative to the motion of -1.</text><text start="59" dur="2">What I&amp;#39;d like you to know is a tricky question.</text><text start="61" dur="3">What is the max cost of a left turn?</text><text start="64" dur="6">If we avoid left turns altogether, we&amp;#39;d much rather take the loop over here.</text><text start="70" dur="6">I want the solution where you turn left to be distinctly more expensive, or more negative,</text><text start="76" dur="2">than the solution where you turn right.</text><text start="78" dur="3">When I say &amp;quot;max&amp;quot;, we&amp;#39;re dealing with negative numbers.</text><text start="81" dur="4">So if you would were to look into positive numbers, what&amp;#39;s the minimum cost of a left turn?</text><text start="85" dur="4">But I want you to enter the negative number over here.</text></transcript></video><video title="10 Cost Answer.mp4" id="M2WXqZdGf-0" length="49"><transcript><text start="0" dur="3">[Thrun] And the answer is -15.</text><text start="3" dur="5">If you look at the pure motion cost for the short route, there are 6 steps.</text><text start="8" dur="2">So we wanted to turn left.</text><text start="10" dur="2">You pay the penalty of -6.</text><text start="12" dur="7">The longer route is 14 steps, and we add a penalty of -6 for 3 right turns.</text><text start="19" dur="4">Each is -2. That gives us -20.</text><text start="23" dur="6">If we penalize left turns with -15, then the total will be -21,</text><text start="29" dur="7">which is smaller or higher cost, so to speak, than the alternative route</text><text start="36" dur="2">that we wish to favor.</text><text start="38" dur="6">For anything larger than -15, we either have a tie or we just go the shortcut,</text><text start="44" dur="2">so that&amp;#39;s the correct number over here.</text><text start="46" dur="3">It was a really nontrivial question. I&amp;#39;m really proud if you got this right.</text></transcript></video><video title="11 Dynamic Programming 1.mp4" id="xm9_vmIJeMk" length="164"><transcript><text start="0" dur="3">[Thrun] So let me give you some examples of this method in action.</text><text start="3" dur="5">Here we have an actual planning technique that uses dynamic programming</text><text start="8" dur="2">and understands how far things are away.</text><text start="10" dur="5">And on top of it, it also considers local rollouts to avoid local obstacles.</text><text start="15" dur="3">These local rollouts are continuous trajectories.</text><text start="18" dur="2">They are variant by discrete decisions,</text><text start="20" dur="6">like whether to change the lane, and by various small discrete nudges around obstacles</text><text start="26" dur="3">so we can avoid obstacles.</text><text start="29" dur="3">And in rolling out to a certain horizon, like up to here,</text><text start="32" dur="2">and then connecting to the dynamic programming value,</text><text start="34" dur="7">we can calculate in actual traffic situations what is the cost of going a certain path.</text><text start="41" dur="2">Here is an attempt to turn right.</text><text start="43" dur="3">You can see the vehicle approaching a stop sign.</text><text start="46" dur="2">There is an entire maze of streets.</text><text start="48" dur="6">The best way to go right and then into the left lane is to take the right turn</text><text start="54" dur="4">and then initiate a lane shift, which is happening right now,</text><text start="58" dur="4">to reach a target location that is indicated by this big orange circle</text><text start="62" dur="2">that it&amp;#39;s crossing right now.</text><text start="64" dur="5">If we increase the cost of a lane shift to a much larger value,</text><text start="69" dur="4">the answer that emerges is really interesting.</text><text start="73" dur="3">It doesn&amp;#39;t turn right because of the cost of the subsequent lane shift.</text><text start="76" dur="3">Instead this vehicle goes straight,</text><text start="79" dur="6">takes a left turn, which happens to be much cheaper than the lane shift.</text><text start="87" dur="5">It then follows this left turn, takes another left turn,</text><text start="98" dur="6">and eventually takes a third left turn just to get to the left lane.</text><text start="104" dur="5">And if you look very carefully, you can now reach the goal location</text><text start="109" dur="8">without a lane change maneuver which would have much higher cost.</text><text start="117" dur="4">So here it is now in the left lane, and without lane-changing maneuver</text><text start="121" dur="2">it manages to reach the goal.</text><text start="123" dur="4">This illustrates that dynamic programming in the context of controlling actual cars</text><text start="127" dur="3">has a big value to play.</text><text start="130" dur="3">Here is the same idea applied to a passing maneuver in normal driving.</text><text start="133" dur="3">You see our vehicle following another vehicle. </text><text start="136" dur="3">This is actual data in preparation for the Urban Challenge.</text><text start="139" dur="4">Now we placed an abandoned vehicle on the left lane,</text><text start="143" dur="4">and you can see how trainers are being made that incorporate dynamic obstacles</text><text start="147" dur="4">by virtue of those rollouts and a dynamic programming function</text><text start="151" dur="5">by virtue of the background green to red to find the optimal actions.</text><text start="156" dur="8">This method has really enabled us to navigate complicated situations with self-driving cars.</text></transcript></video><video title="12 Dynamic Programming 2.mp4" id="ch59X0DMEXY" length="132"><transcript><text start="0" dur="4">[Thrun] The last example I want to talk about in this lecture</text><text start="4" dur="2">is related to general purpose path planning </text><text start="6" dur="3">where we don&amp;#39;t have a road network.</text><text start="9" dur="2">Here is an example of where this occurs.</text><text start="11" dur="4">This is an example of where a blockage occurs.</text><text start="15" dur="5">None of the preplanned paths are drivable by our robot,</text><text start="20" dur="3">so it has to, after a certain time out here--20 seconds--</text><text start="23" dur="4">find itself a path anywhere in the environment.</text><text start="27" dur="4">In fact, our Urban Challenge car did just this.</text><text start="31" dur="4">We don&amp;#39;t do this today in traffic. It&amp;#39;s a little bit dangerous.</text><text start="35" dur="3">But for the Urban Challenge it was perfectly doable, and it was safe.</text><text start="38" dur="7">So this car found a route that was outside any pre-given corridor.</text><text start="45" dur="2">Here is an even more challenging example</text><text start="47" dur="5">where our robot Junior approaches a complete road blockage,</text><text start="52" dur="4">but its target location is behind the road blockage.</text><text start="56" dur="3">You can see that none of the paths can possibly make it there</text><text start="59" dur="4">and the only correct action is to turn around and pick a different road</text><text start="63" dur="4">to finally approach the goal location from the opposite side.</text><text start="67" dur="5">You can see an attempted lane shift to the opposite lane doesn&amp;#39;t function either,</text><text start="72" dur="3">and there is time models supposed to be with all of those.</text><text start="75" dur="3">Eventually, using a general purpose planner</text><text start="78" dur="3">of the type that Peter talked about in his class</text><text start="81" dur="8">to find what ends up to be a really complicated multi-turn turnaround</text><text start="89" dur="4">where the car backs into a driveway a little bit, as you can see over here,</text><text start="93" dur="5">and it is all planned completely dynamically without any preconception</text><text start="98" dur="3">how such a multi-point U-turn would look like.</text><text start="101" dur="5">Then it goes forward, then it goes backward, and does so multiple times</text><text start="106" dur="2">until it finally has turned around.</text><text start="108" dur="5">It&amp;#39;s not particularly efficient or elegant, but it&amp;#39;s very, very safe.</text><text start="113" dur="3">This vehicle will eventually be able to drive in a different direction</text><text start="116" dur="3">and reach the goal point behind the blockage.</text><text start="119" dur="2">That was one of the tasks DARPA gave us.</text><text start="121" dur="4">So you can see it do its job until it finally breaks free</text><text start="125" dur="7">and is able to navigate around this blockage onto a different street, as shown over here.</text></transcript></video><video title="13 Robotic Path Planning.mp4" id="thepzbTYuJ8" length="162"><transcript><text start="0" dur="4">[Thrun] So let&amp;#39;s talk about robot path planning or robot motion planning,</text><text start="4" dur="4">which is a rich field in itself, and I can&amp;#39;t give you a complete survey</text><text start="8" dur="2">of all the algorithms involved.</text><text start="10" dur="4">But one of the key differences to the planning algorithms we talked about before</text><text start="14" dur="3">is that now the world is continuous.</text><text start="17" dur="4">For example, we learned about A* in which we discretize the world.</text><text start="21" dur="3">We might have a goal location, we might have obstacles,</text><text start="24" dur="4">and then A*, a valid action sequence, might look like this.</text><text start="28" dur="4">And even though this is a valid solution to the planning problem,</text><text start="32" dur="3">a car can&amp;#39;t really follow these discrete choices.</text><text start="35" dur="4">There is a number of very sharp turns over here that are just irreconcilable </text><text start="39" dur="3">with the motion of a car.</text><text start="42" dur="3">So the fundamental problem here is A* is discrete,</text><text start="45" dur="3">whereas the robotic world is continuous.</text><text start="48" dur="4">So the question arises, is there a version of A* that can deal with the continuous nature</text><text start="52" dur="4">and give us provably executable paths?</text><text start="56" dur="3">This is a big, big question in robot motion planning.</text><text start="59" dur="3">Let me just discuss it for this one example</text><text start="62" dur="5">and show you what we&amp;#39;ve done to solve this problem in the DARPA Urban Challenge.</text><text start="67" dur="5">The key to solving this with A* has to do with the state transition function.</text><text start="72" dur="5">Suppose we have a cell like this and we apply a sequence of very small step simulations</text><text start="77" dur="3">using our continuous math from before.</text><text start="80" dur="7">Then a state over here might find itself right here in the corner of the next discrete state.</text><text start="87" dur="2">Instead of assigning this just to the grid cell,</text><text start="89" dur="5">in the algorithm called hybrid A*, it memorizes the exact x prime, y prime,</text><text start="94" dur="4">and theta prime and associates it with this grid cell over here</text><text start="98" dur="2">the first time the grid cell is expanded.</text><text start="100" dur="5">Then when expanding from this cell it uses a specific starting point over here</text><text start="105" dur="2">to figure out what the next cell might be.</text><text start="107" dur="4">It might happen that the same cell is reached again in A*, maybe from over here,</text><text start="111" dur="4">leading to a different continuous amortization of x, y, and theta,</text><text start="115" dur="4">but because in A* we tend to expand cells along the shortest path</text><text start="119" dur="4">before we look at the longer paths, we now just cut this off</text><text start="123" dur="3">and never consider the state over here again.</text><text start="126" dur="3">This leads to a lack of completeness, </text><text start="129" dur="3">which means there might be solutions to the navigation problem</text><text start="132" dur="2">that this algorithm doesn&amp;#39;t capture,</text><text start="134" dur="2">but it does give us correctness.</text><text start="136" dur="5">So as long as our motion equations are correct, the resulting paths can be executed.</text><text start="141" dur="2">Now here is a caveat.</text><text start="143" dur="3">This is an approximation and is only correct to the point </text><text start="146" dur="2">that these motions equations are correct that are not correct.</text><text start="148" dur="6">But nevertheless, our paths that come out are nice, smooth, and curved paths,</text><text start="154" dur="2">and every time we expand a grid cell </text><text start="156" dur="4">we memorize explicitly the continuous values of x prime, y prime, </text><text start="160" dur="2">and theta with this grid cell.</text></transcript></video><video title="14 Path Planning Examples.mp4" id="zS3st_7og3A" length="183"><transcript><text start="0" dur="3">[Thrun] Now here is an actual result of applying this A* algorithm</text><text start="3" dur="2">for our vehicle that sits over here.</text><text start="5" dur="4">Real obstacles--these are laser scans of parked cars--</text><text start="9" dur="2">and a target location over here.</text><text start="11" dur="3">And while the curve isn&amp;#39;t super smooth,</text><text start="14" dur="4">you can still see it is able to find a continuous and drivable curve</text><text start="18" dur="2">to the parking location over here</text><text start="20" dur="4">by this small but important modification of A*.</text><text start="24" dur="6">There are a few other modifications of A* which I can&amp;#39;t go into detail,</text><text start="30" dur="5">but here you can see a typical attempt of a robot to navigate a parking lot</text><text start="35" dur="2">here in simulation.</text><text start="37" dur="4">You can see the tree that is being expanded in that search.</text><text start="43" dur="4">And every time it gets stuck, it does a new A* search.</text><text start="47" dur="4">You can see how the map is being acquired as the robot moves.</text><text start="51" dur="5">In its state that&amp;#39;s in front of the robot, it not only considers the x, y and hidden direction</text><text start="56" dur="3">but also allows the robot to go forward and backwards,</text><text start="59" dur="4">and driving backwards is just a different state than going forwards.</text><text start="63" dur="5">Now you can see how it backs up, finds a new path, and it is an incomplete maze</text><text start="68" dur="5">until it finally is able to reach the goal location through an actual opening.</text><text start="73" dur="4">We made this maze really hard to test our algorithms.</text><text start="77" dur="3">The nice thing is these algorithms work almost real time.</text><text start="80" dur="5">It takes less than a tenth of a second to build this entire search tree,</text><text start="85" dur="5">and the robot is able to navigate this parking lot really, really efficiently.</text><text start="90" dur="4">This was one of the fastest motion planning algorithms that I saw</text><text start="94" dur="2">in the DARPA Urban Challenge.</text><text start="96" dur="3">In fact, in all of robotics it&amp;#39;s been one of the fastest algorithms </text><text start="99" dur="3">I&amp;#39;ve personally seen in my life.</text><text start="102" dur="7">Here is the same algorithm applied to an actual parking example using our robot Junior.</text><text start="109" dur="4">It&amp;#39;s driving over here, it wishes to get over there,</text><text start="113" dur="4">and you can see it has backed up into a parking gap over here,</text><text start="117" dur="7">which is an amazing precision for a robot, and then moved forward along the line over here.</text><text start="124" dur="4">Our state space is, I guess, 4-dimensional.</text><text start="128" dur="5">It comprises x, y, hidden direction, and whether the car is going forward or backwards.</text><text start="133" dur="4">There is a cost to changing directions, so it doesn&amp;#39;t change direction too often.</text><text start="137" dur="3">You can see it navigate to its target location.</text><text start="140" dur="5">Details I am not telling you include that the trajectory that the planner generates</text><text start="145" dur="4">is subsequently smoothed using a quadratic smoother</text><text start="149" dur="2">so that we get rid of the kinks,</text><text start="151" dur="3">and the car drives much nicer as a result.</text><text start="154" dur="4">But the workhorse here that does all the work to find the best path</text><text start="158" dur="8">is actually A* modified into hybrid A*, as I told you.</text><text start="166" dur="6">And in this final video we see the car navigating a parking lot with lots of traffic cones.</text><text start="172" dur="5">On the left you see the video imagery, on the right side you can see the internal map</text><text start="177" dur="2">and the path planner,</text><text start="179" dur="4">and it attempts to park itself in the designated spot on the left.</text></transcript></video><video title="15 Conclusion.mp4" id="o7IiWAz3Jes" length="78"><transcript><text start="0" dur="4">[Thrun] So this finishes my short lecture on robotics.</text><text start="4" dur="3">Obviously the field is much, much bigger than what I just showed you,</text><text start="7" dur="2">but I gave you examples of the key elements.</text><text start="9" dur="3">I gave you an example of perception using particle filters.</text><text start="12" dur="6">I gave you an example of planning using MDPs and also A*.</text><text start="18" dur="3">These are some of the key methods we&amp;#39;ve been applying to self-driving cars.</text><text start="21" dur="3">There are many other methods.</text><text start="24" dur="2">Most notably, there&amp;#39;s also reinforcement learning </text><text start="26" dur="2">that has recently received a lot of attention.</text><text start="28" dur="3">But I don&amp;#39;t have time to talk about those.</text><text start="31" dur="5">I hope you are able to apply these methods yourself to pretty much any robotics problems</text><text start="36" dur="3">that you might be working on.</text><text start="39" dur="3">Robotics is really a fascinating area.</text><text start="42" dur="2">There&amp;#39;s a lot of things to learn--</text><text start="44" dur="2">way too much than I can offer in this single class.</text><text start="46" dur="3">But what you should have noticed is there&amp;#39;s a really strong interplay</text><text start="49" dur="4">between artificial intelligence and the methods I showed you before</text><text start="53" dur="4">and what we are doing, for example, to make cars drive themselves.</text><text start="57" dur="4">Now, in the next class we&amp;#39;ll talk about a topic that&amp;#39;s equally important,</text><text start="61" dur="3">which is natural language processing.</text><text start="64" dur="4">So when you learned today a little bit about how to build self-driving cars,</text><text start="68" dur="3">next lecture you might actually learn how to build the next Google</text><text start="71" dur="4">when Peter Norvig tells you all about natural language processing.</text><text start="75" dur="3">So please come and see us again when the next class comes up.</text></transcript></video></group><group title="Homework 8" count="14"><video title="01 State Space Question.mp4" id="Y_77aT6KS4U" length="90"><transcript><text start="0" dur="3">So, this question I want to test your knowledge about</text><text start="3" dur="3">the dimension of the state space of a dynamic system.</text><text start="6" dur="4">In all these questions, I&amp;#39;m going to look at a ball, like a soccer ball.</text><text start="10" dur="4">The interesting thing about a soccer ball is its orientation</text><text start="14" dur="4">is an important variable, whether it&amp;#39;s upside down and so on.</text><text start="18" dur="3">So, in all of these questions we&amp;#39;re going to explore this,</text><text start="21" dur="3">and the fact that this object is rotationally invariant.</text><text start="24" dur="2">Let me start with a simpler question,</text><text start="26" dur="4">which is the kinematic state, which is the state without any velocities</text><text start="30" dur="4">of the soccer ball, where it is on the ground.</text><text start="34" dur="2">And I&amp;#39;ll follow up with a question with the same kinematic state</text><text start="36" dur="3">of the ball in mid-air,</text><text start="39" dur="3">the difference being between these 2 questions that on the ground</text><text start="42" dur="4">it&amp;#39;s confined to be in a 2-dimensional ground plane,</text><text start="46" dur="4">whereas in mid-air, you might add another dimension or not.</text><text start="50" dur="4">There&amp;#39;s also the dynamic state on the ground and in mid-air.</text><text start="54" dur="4">And for the dynamic state, I wish to ignore things like spin.</text><text start="58" dur="4">I just care about velocities as a person really far away </text><text start="62" dur="3">could observe them, just to make things clear.</text><text start="65" dur="2">And finally, I&amp;#39;d like to include spin, </text><text start="67" dur="4">so let me take the most complicated situation</text><text start="71" dur="3">of dynamic state in mid-air considering spin.</text><text start="74" dur="3">The last one is really a tricky question,</text><text start="77" dur="2">so I don&amp;#39;t mind at all if you get this wrong.</text><text start="79" dur="2">But in all of those, I would like to exploit the fact</text><text start="81" dur="5">that I really don&amp;#39;t care about the absolute orientation of the soccer ball that is here,</text><text start="86" dur="4">so it&amp;#39;s invariant to its orientation, but it  might still have spin.</text></transcript></video><video title="02 State Space Answer.mp4" id="1-2_ISzFOLc" length="70"><transcript><text start="0" dur="6">And the answer is 2, 3, 4, 6, and 9.</text><text start="6" dur="4">On the ground, the static state without velocity is just X and Y.</text><text start="10" dur="2">That&amp;#39;s 2. </text><text start="12" dur="5">If we add mid-air, we have height, which adds a third dimension, 3.</text><text start="17" dur="2">If we add the dynamic state to the ground, </text><text start="19" dur="5">which is data X and data Y over time, that&amp;#39;s 4 in total</text><text start="24" dur="2">plus the original X and Y. </text><text start="26" dur="4">Same for mid-air. Multiply 3 up to 6. </text><text start="30" dur="3">And the last one is the tricky one.</text><text start="33" dur="3">Clearly, a helicopter in mid-air</text><text start="36" dur="4">that looks at rotational velocities would have 12 dimensions.</text><text start="40" dur="3">But again, because I don&amp;#39;t care about the absolute coordinates</text><text start="43" dur="2">of its yaw, roll and pitch.</text><text start="45" dur="2">The ball is variant.</text><text start="47" dur="4">The spin variables are 3, data roll, data pitch and data yaw.</text><text start="51" dur="3">If we add those to the dynamic state in mid-air,</text><text start="54" dur="2">we get 9 and not 12.</text><text start="56" dur="6">Once again, these are X, Y and Z: data X, data Y, data Z over time,</text><text start="62" dur="5">and the velocities in the different rotational directions,</text><text start="67" dur="3">which make a total of 9.</text></transcript></video><video title="03 Dynamic Programming Question 1.mp4" id="Glmbwxqj0g0" length="33"><transcript><text start="0" dur="3">This will be in dynamic programming question for a robot</text><text start="3" dur="4"> with 3 coordinates, X, Y and theta,</text><text start="7" dur="2">even though in this diagram I just show 2.</text><text start="9" dur="4">Suppose our target location is in the top right corner</text><text start="13" dur="3">facing east as shown over here.</text><text start="16" dur="4">Initially, the robot&amp;#39;s location is in the bottom left corner facing north.</text><text start="20" dur="2">Suppose our goal is worth 100.</text><text start="22" dur="3">Going straight costs us -1,</text><text start="25" dur="2">and we can turn on the spot, but turning on the spot</text><text start="27" dur="2">costs us -5.</text><text start="29" dur="4">What would be the value of the start state?</text></transcript></video><video title="04 Dynamic Programming Answer 1.mp4" id="87QIGtC_ku8" length="21"><transcript><text start="0" dur="2">And the answer is 88.</text><text start="2" dur="3">It takes 7 steps to go from start to goal</text><text start="5" dur="2">if we just count the go straight steps.</text><text start="7" dur="4">1, 2, 3, 4, 5, 6, 7.</text><text start="11" dur="3">And we have to turn once in this spot right over here,</text><text start="14" dur="4">which costs an additional -5, so we pay a total of -12.</text><text start="18" dur="3">That plus 100 gives us 88.</text></transcript></video><video title="05 Dynamic Programming Question 2.mp4" id="Fssd40hPtU8" length="23"><transcript><text start="0" dur="2">Same situation as before.</text><text start="2" dur="2">We&amp;#39;d like to go from here to here.</text><text start="4" dur="2">We have a 3-dimensional state space.</text><text start="6" dur="2">The goal is worth 100.</text><text start="8" dur="2">A straight motion costs -1.</text><text start="10" dur="3">Turning on the spot clockwise costs -10,</text><text start="13" dur="2">but turning counterclockwise, </text><text start="15" dur="3">which is &amp;quot;C-CW,&amp;quot; costs us 0.</text><text start="18" dur="2">What is now the value of the start state?</text><text start="20" dur="3">Please put your answer in here.</text></transcript></video><video title="06 Dynamic Programming Answer 2.mp4" id="EvpFwr5hTl8" length="42"><transcript><text start="0" dur="3">And the answer is 93,</text><text start="3" dur="3">the reason being there&amp;#39;s 7 straight steps to the goal.</text><text start="6" dur="3">You can go down here or up here.</text><text start="9" dur="4">And suppose we go up here and we wanted to turn clockwise,</text><text start="13" dur="3">which is the one that gets us oriented towards the goal.</text><text start="16" dur="5">But that costs us -10, but we can turn 3 times counterclockwise.</text><text start="21" dur="3">We first turn left and down and then right.</text><text start="24" dur="3">And the total cost of this is 0 because each counterclockwise turn</text><text start="27" dur="4">is worth 0, therefore, we just go straight to the goal,</text><text start="31" dur="4">and we only pay the straight motion cost, which is 7 in total</text><text start="35" dur="4">because it&amp;#39;s -7 for the straight penalty</text><text start="39" dur="3">plus 100 is 93.</text></transcript></video><video title="07 Particle Question 1.mp4" id="9rNsfluUx10" length="73"><transcript><text start="0" dur="6">This is a particle filter question where we start with a single particle over here facing east.</text><text start="6" dur="5">The particle has an X, a Y, and a heading direction,</text><text start="11" dur="6">and this particle is on a checker board with black squares and white squares.</text><text start="17" dur="5">Let&amp;#39;s assume we draw 5 new particles from this particle for the motion of going right,</text><text start="22" dur="3">and they end up as indicated over here.</text><text start="25" dur="7">Each of these 5 new particles--1 of which falling into a2, 2 of which falling into b2,</text><text start="32" dur="3">1 into c2, and 1 into b3--</text><text start="35" dur="5">each of these particles will obtain an importance weight,</text><text start="40" dur="4">as now that what we&amp;#39;ll measure is on a black square.</text><text start="44" dur="1">So the measurement is black.</text><text start="45" dur="5">To calculate the importance weight, let me tell you that the probability of seeing black</text><text start="50" dur="3">on a black square = 0.8,</text><text start="53" dur="4">whereas the probability of seeing black on a white square is only 0.1.</text><text start="57" dur="5">So I want you to tell me the total importance weight that falls to a2--</text><text start="62" dur="3">here we just have a single particle--</text><text start="65" dur="4">into b2--please add the importance weight of both particles--</text><text start="69" dur="1">c2, and b3. </text><text start="70" dur="3">Please add your numbers over here.</text></transcript></video><video title="08 Particle Answer 1.mp4" id="MeLWlnHdHhw" length="39"><transcript><text start="0" dur="10">The answer is 1/19th for a2, c2, and b3, and 16/19th or 0.8421 for b2.</text><text start="10" dur="6">To see, let&amp;#39;s associate the nonnormalized probability value to each of the particles.</text><text start="16" dur="7">Over here, we have 0.1. Here is 0.8, but if 2 particles, let&amp;#39;s make this 1.6.</text><text start="23" dur="2">0.1 and 0.1 again.</text><text start="25" dur="4">These nonnormalized importance weights add up to 1.9,</text><text start="29" dur="6">so the desired result is the division of the original particle weights by 1.9,</text><text start="35" dur="4">which is the value as shown on the left.</text></transcript></video><video title="09 Particle Question 2.mp4" id="u9pGcjGTxrI" length="41"><transcript><text start="0" dur="5">In resampling in the next motion step, let&amp;#39;s assume the following 3 particles are used</text><text start="5" dur="2">with the other ones are being ignored.</text><text start="7" dur="5">2 of them live in b2, 1 in c2, who again moves right,</text><text start="12" dur="8">and we get particles distributed as follows--2 fall into b3, 2 into b4, and 1 in c4.</text><text start="20" dur="3">So using the same measurement probability as before,</text><text start="23" dur="2">and now a measurement of a white square.</text><text start="25" dur="5">Tell me what the cumulative importance weight for the 3 new squares,</text><text start="30" dur="5">where 2 of the particles fall into b3, which happens to be a white square,</text><text start="35" dur="2">2 into b4, which happens to be a black square,</text><text start="37" dur="4">and 1 into c4, which happens to be a white square again.</text></transcript></video><video title="10 Particle Answer 2.mp4" id="vIb0A-Qsnhw" length="83"><transcript><text start="0" dur="9">The answer is 18/31, which is approximately 0.5806.</text><text start="9" dur="4">4/31 is 0.1290,</text><text start="13" dur="5">and 9/31, which is half of the thing over here, 0.2903.</text><text start="18" dur="2">And again, we look at the same as before.</text><text start="20" dur="4">We look at the total nonnormalized measurement probabilities for our particles.</text><text start="24" dur="8">In a white square, the probability of seeing white is 1 - 0.1, that is 0.9.</text><text start="32" dur="5">Since we have 2 particles, the nonnormalized cumulative particle weight is 1.8.</text><text start="37" dur="7">Doing the same for the black square, the probability of seeing white is 0.2,</text><text start="44" dur="2">which is 1 - 0.8.</text><text start="46" dur="6">We have 2 particles in the black square to get a nonnormalized total probability </text><text start="52" dur="5">of 0.4, so you get a nonnormalized total importance weight of 0.4.</text><text start="57" dur="7">And finally, the probability of seeing white in the white square is 1 - 0.1 is 0.9,</text><text start="64" dur="5">and here we only have 1 particle, so the nonnormalized importance weight is 0.9.</text><text start="69" dur="3">If you add those up, we get 3.1.</text><text start="72" dur="3">So we have to divide all of those by 3.1.</text><text start="75" dur="8">So 18/31 is the 1st one. 4 by 31--the 2nd, and 9 by 31--the 3rd, as indicated on the left.</text></transcript></video><video title="11 Stanley Question.mp4" id="BbV6zr2GQXM" length="11"><transcript><text start="0" dur="5">Our robot, Stanley, performed as follows in the DARPA Urban Challenge in 2005.</text><text start="5" dur="6">He came in 1st, 2nd, 3rd, or 4th or below in the ranking.</text></transcript></video><video title="12 Stanley Answer.mp4" id="7lSKQgEs9ks" length="9"><transcript><text start="0" dur="3">And oh, my God! Yes, we came in first!</text><text start="3" dur="3">It was one of the most amazing events in my entire life.</text><text start="6" dur="3">Our robot made it first across the finishing line.</text></transcript></video><video title="13 Motion Model Question.mp4" id="cqV-38u5Yck" length="52"><transcript><text start="0" dur="5">In this final question, I&amp;#39;m going to quiz you about our approximate motion model </text><text start="5" dur="3">for this robot, which I restate.</text><text start="8" dur="5">I&amp;#39;d like you to apply this exact motion model over here even though you might be suspicious</text><text start="13" dur="2">of its accuracy.</text><text start="15" dur="8">Suppose a time, t = 0, coordinates are 0, 0, and 0 for all 3 variables.</text><text start="23" dur="4">Delta t = 4. I&amp;#39;ll admit the units over here.</text><text start="27" dur="7">v = 10 and omega = pi/8, which is 22.5 degrees in degrees.</text><text start="34" dur="5">Assuming you run one these simulations exactly every 4 time steps,</text><text start="39" dur="4">I would like to know what the mobile state is after 4 of those updates, </text><text start="43" dur="4">or put differently, total time of 16.</text><text start="47" dur="5">So please tell me, what x will be, y, and theta.</text></transcript></video><video title="14 Motion Model Answer.mp4" id="HiflEyKkEpk" length="58"><transcript><text start="0" dur="4">The answer surprisingly is 0, 0, 0. </text><text start="4" dur="2">It just survives the way it was before.</text><text start="6" dur="4">To see this, [ ] initially faces east.</text><text start="10" dur="7">Next direction, it&amp;#39;ll move forward 40, and then its heading direction changes</text><text start="17" dur="4">by 4 x pi/8, which is pi(1/2), so it&amp;#39;s going to start pointing up.</text><text start="21" dur="5">It repeats the same action 3 more times, and eventually arrives at the original location</text><text start="26" dur="2">and points right again.</text><text start="28" dur="7">So this is a square motion. The results are in the exact same initial state as it started out with.</text><text start="35" dur="6">Now in reality, if we didn&amp;#39;t use these equations over here, it would be on a circle, </text><text start="41" dur="3">but even in a circle, it would arrive back at the original location with those </text><text start="44" dur="2">parameters shown over here. </text><text start="46" dur="3">So the fact that our simulation simulates a square </text><text start="49" dur="2">doesn&amp;#39;t effect the end result of this question,</text><text start="51" dur="2">and I didn&amp;#39;t even ask about the circle. </text><text start="53" dur="5">I just asked about applying those equations over here, so 0, 0, 0 is the correct answer.</text></transcript></video></group><group title="Unit 21" count="40"><video title="01 Introduction.mp4" id="fiIXnclf9nk" length="79"><transcript><text start="0" dur="4">Welcome back. We&amp;#39;re down to our final main unit.</text><text start="4" dur="2">This one is on natural language processing--</text><text start="6" dur="5">that is, understand natural languages like English or German or French</text><text start="11" dur="2">and figuring out what to do with them.</text><text start="13" dur="3">Now, this is a very interesting topic for three reasons.</text><text start="16" dur="5">One is a philosophical one--we as humans have defined ourselves much in terms </text><text start="21" dur="5">of our ability to speak with each other and understand each other.</text><text start="26" dur="4">This ability to use language is something that we feel sets us apart </text><text start="30" dur="4">from all the other animals and from the other machines.</text><text start="34" dur="2">Second is in terms of applications.</text><text start="36" dur="7">We really would like to be able to talk to our computers and use them for various things.</text><text start="43" dur="5">Sure there are occasions where clicking with a mouse is the right thing to do,</text><text start="48" dur="5">but talking is natural, and we want to be able to communicate with our machines.</text><text start="53" dur="2">Then third is in terms of learning.</text><text start="55" dur="3">We want out computers to be smarter,</text><text start="58" dur="6">and much of human knowledge is written down in terms of paragraphs and sentences of text,</text><text start="64" dur="7">and not in terms of formal databases or formal procedures written in code.</text><text start="71" dur="3">It&amp;#39;s all in text, and if we want our computers to be smart,</text><text start="74" dur="2">they&amp;#39;d better be able to read that text and make sense of it.</text><text start="76" dur="3">That&amp;#39;s what this lesson is all about.</text></transcript></video><video title="02 Language Models.mp4" id="K1tBjg503uU" length="175"><transcript><text start="0" dur="3">We&amp;#39;ll start by talking about language models.</text><text start="3" dur="4">Historically, there have been two types of models that have been popular</text><text start="7" dur="3">for natural language understanding within AI.</text><text start="10" dur="6">One of the types of models has to do with sequences of letters or words?</text><text start="16" dur="4">These types of models tend to be probabilistic </text><text start="20" dur="4">in that we&amp;#39;re talking about the probability of a sequence,</text><text start="24" dur="6">word based in that mostly what we&amp;#39;re dealing with is the surface words themselves,</text><text start="30" dur="3">and sometimes letters.</text><text start="33" dur="4">But we&amp;#39;re dealing with the actual data of what we see,</text><text start="37" dur="2">Rather than some underlying extractions,</text><text start="39" dur="5">and these models are primarily learned from data.</text><text start="44" dur="6">Now, in contrast to that is another type of model that you might have seen before,</text><text start="50" dur="4">where we&amp;#39;re primarily dealing with trees and with abstract structures.</text><text start="54" dur="7">So we say we can have a sentence, which is composed of a noun phrase and a verb phrase,</text><text start="61" dur="6">and a noun phrase might be a person&amp;#39;s name, and that might be &amp;quot;Sam.&amp;quot;</text><text start="67" dur="7">And the verb phrase might be a verb and we might say &amp;quot;Sam slept&amp;quot;--</text><text start="74" dur="2">a very simple sentence.</text><text start="76" dur="4">Now, these types of models have different properties.</text><text start="80" dur="5">For one, they tend to be logical rather than probabilistic--</text><text start="85" dur="7">that is whereas on this side, we&amp;#39;re talking about the probability of a sequence of words,</text><text start="92" dur="8">on this side we&amp;#39;re talking about a set of sentences and that set defines the language,</text><text start="100" dur="4">and a sentence is either in the language or not.</text><text start="104" dur="6">It&amp;#39;s a Boolean logical distinction rather than on this side a probabilistic distinction.</text><text start="110" dur="7">These models are based on abstraction such as trees and categories--</text><text start="117" dur="5">categories like noun phrase and verb phrase and tree structures like this</text><text start="122" dur="6">that don&amp;#39;t actually occur in the surface form, so the words that we can observe.</text><text start="128" dur="4">An agent can observe the words &amp;quot;Sam&amp;quot; and &amp;quot;slept,&amp;quot;</text><text start="132" dur="7">but an agent can&amp;#39;t directly observe the fact that slept is a verb or that it&amp;#39;s part of this tree structure.</text><text start="139" dur="6">Traditionally, these types of approaches have been primarily hand-coded.</text><text start="145" dur="4">That is, rather than learning this type of structure from data,</text><text start="149" dur="6">we learn it by going out and having linguists and other experts write down these rules.</text><text start="155" dur="4">Now, these distinctions are not hard to cut.</text><text start="159" dur="6">You could have trees and have a probabilistic model of them.</text><text start="165" dur="3">You could learn trees.</text><text start="168" dur="7">We can go back and forth, but traditionally the two camps have divided up in this way.</text></transcript></video><video title="03 Bag of Words.mp4" id="XzAaYwH5npk" length="64"><transcript><text start="0" dur="3">Now, we&amp;#39;ve seen probabilistic word models before.</text><text start="3" dur="3">If you remember when we were doing machine learning, </text><text start="6" dur="3">we talked about the bad-of-words model.</text><text start="9" dur="4">What I&amp;#39;m showing you now is a copy of a bumper sticker that my friend</text><text start="13" dur="5">Othar Hansson, who is one of the main engineers on the Google search team came up with,</text><text start="18" dur="5">the bumper sticker, of course, says &amp;quot;Honk if you love the bag-of-words model,&amp;quot;</text><text start="23" dur="5">but it says that in a way where the words are a bag rather than a sequence.</text><text start="28" dur="3">This just kind of indicates the power of the model--</text><text start="31" dur="4">that it gets the idea across while loosing all notion of sequence,</text><text start="35" dur="4">and thus making the probabilistic model simpler to deal with.</text><text start="39" dur="6">But we can move on from a bag-of-words model, which we can think of as a unigram--</text><text start="45" dur="3">sometimes also called a naive Bayes model,</text><text start="48" dur="6">because every individual word is treated as a separate factor that&amp;#39;s unrelated </text><text start="54" dur="4">or unconditionally independent of all the other words.</text><text start="61" dur="3">We can move beyond those types of models to ones where we do take sequence into account.</text></transcript></video><video title="04 Probabilistic Models.mp4" id="Z2UajUpjde0" length="335"><transcript><text start="0" dur="6">What we want then is a probabilistic model P over a word sequence,</text><text start="6" dur="8">and we can write that sequence word 1, word 2, word 3, all the way up to word n,</text><text start="14" dur="5">and we can use an abbreviation for that and write that the sequence of </text><text start="19" dur="4">words 1 through n, using the colon.</text><text start="23" dur="6">Now the next step is to say we can factor this and take these individual variables</text><text start="29" dur="4"> write that in terms of conditional probabilities.</text><text start="33" dur="10">So, this probability is equal to the product over all i of the probability of words of i</text><text start="43" dur="3">given all the subsequent words.</text><text start="46" dur="5">So that would be from word 1 up to word       i - 1.</text><text start="51" dur="4">The is just the definition of conventional probability--</text><text start="55" dur="7">the joint probability of a set of variables can be factored out as the conditional probability </text><text start="62" dur="3">of one variable given all the others,</text><text start="65" dur="4">and then we can recursively do that until we&amp;#39;ve got all the variables accounted for.</text><text start="69" dur="3">We can make the Markov assumption</text><text start="72" dur="5">and that&amp;#39;s the assumption that the effect of one variable on another will be local.</text><text start="77" dur="4">That is, if we&amp;#39;re looking at the nth word, the words that are relevant to that</text><text start="81" dur="4">are the ones that have occurred recently and not the ones occurred a long time ago.</text><text start="85" dur="7">What the Markov assumption means is that the probability of a word i,</text><text start="92" dur="6">given the words all the was from 1 up to word i minus 1,</text><text start="98" dur="7">we can assume that that&amp;#39;s equal or approximately equal to the probability</text><text start="105" dur="7">of the word given only the words from i minus k up to i minus 1.</text><text start="112" dur="6">Instead of going all the way back to number 1, we only go back k steps.</text><text start="118" dur="6">For order 1 Markov model, for an order k equals one, then this would be equal to</text><text start="124" dur="6">the probability of word i given only word i minus 1.</text><text start="130" dur="6">Now, the next thing we want to do is in our mode is called the Stationarity Assumption.</text><text start="136" dur="7">What that says is that the probability of each variable is going to be the same.</text><text start="143" dur="4">So the probability distribution over the first word is going to be same</text><text start="147" dur="4">as the probability distribution over the nth word.</text><text start="151" dur="4">Another way to look at that is if I keep saying sentences,</text><text start="155" dur="3">the words that show up in my sentence depend on what the surrounding words are </text><text start="158" dur="4">in the sentence, but they don&amp;#39;t depend on whether I&amp;#39;m on the first sentence </text><text start="162" dur="3">or the second sentence or the third sentence.</text><text start="165" dur="6">Stationarity assumption we can write as the probability of a word given</text><text start="171" dur="5">the previous word is the same for all variables.</text><text start="176" dur="6">For all values of i and j, the probability of word i given the previous word</text><text start="182" dur="4">as the same as the probability of word j given the previous word.</text><text start="186" dur="5">That gives us all the formalism we need to talk about these word sequence models--</text><text start="191" dur="3">probabilistic word sequence models.</text><text start="194" dur="2">In practice there are many tricks.</text><text start="196" dur="5">One thing we talked about before, when we were doing the spam filterings and so on,</text><text start="201" dur="3">is a necessity of smoothing.</text><text start="204" dur="3">That is, if we&amp;#39;re going to learn these probabilities from counts,</text><text start="207" dur="4">we go out into the world, we observe some data, </text><text start="211" dur="7">we figure out how often word i occurs given word i - 1 was the previous word, </text><text start="218" dur="3">we&amp;#39;re going to find out that a lot of these counts are going to be zero</text><text start="221" dur="3">or going to be some small number, and the estimates are not going to be good.</text><text start="224" dur="2">And therefore we need some type of smoothing, </text><text start="226" dur="2">like the Laplace smoothing that we talked about,</text><text start="228" dur="5">and there are many other techniques for doing smoothing to come up good estimates.</text><text start="233" dur="4">Another thing we can do is augment these models to say </text><text start="237" dur="4">maybe we want to deal not just with words but with other data as well.</text><text start="241" dur="3">We saw that in the spam-filtering model also.</text><text start="244" dur="3">So there you might want to think about who the sender is,</text><text start="247" dur="3">what the time of day is and so on,</text><text start="250" dur="5">these auxiliary fields like in the header fields of the email messages</text><text start="255" dur="5">as well as the words in the message, and that&amp;#39;s true for other applications as well.</text><text start="260" dur="5">You may want to go beyond the words and consider variables that have to do with context of the words.</text><text start="265" dur="4">We may also want to have other properties of words.</text><text start="269" dur="4">The great thing about just dealing with an individual word like &amp;quot;dog&amp;quot;</text><text start="273" dur="3">is that it&amp;#39;s observable in the world.</text><text start="276" dur="5">We see this spoken or written text, and we can figure out what it means, </text><text start="281" dur="4">and we can start making counts about it and start estimating probabilities,</text><text start="285" dur="7">but we also might want to know that, say, &amp;quot;dog&amp;quot; is being used as a noun,</text><text start="292" dur="3">and that&amp;#39;s not immediately observable in the world, but it is inferable. </text><text start="295" dur="6">It&amp;#39;s a hidden variable, and we may want to try to recover these hidden variables like parts of speech.</text><text start="301" dur="5">We may also want to go to bigger sequences than just individual words,</text><text start="306" dur="4">so rather than treat &amp;quot;New York City&amp;quot; as three separate words,</text><text start="310" dur="5">we may want to a model that allows us to think of it as a single phrase.</text><text start="315" dur="6">Or we may want to go smaller than that and look at a model that deals with individual letters </text><text start="321" dur="2">rather than dealing with words.</text><text start="323" dur="5">So these are all variations, and the type of model you choose depends on the application,</text><text start="328" dur="7">but they all follow from this idea of a probabilistic model over sequences.</text></transcript></video><video title="05 Language and Learning.mp4" id="CGsY9BOrDPo" length="119"><transcript><text start="0" dur="3">Now, we talked about using language for learning,</text><text start="3" dur="4">and this slide is demonstrating the power of language,</text><text start="7" dur="7">how it has such a powerful connection to the real world that allows us to learning things</text><text start="14" dur="2">just by observing language use.</text><text start="16" dur="6">What I&amp;#39;ve done here is I&amp;#39;ve gone to Google trends and types in two search terms--</text><text start="22" dur="7">&amp;quot;full moon&amp;quot; and &amp;quot;ice cream&amp;quot; and gotten back a graph of how popular those queries are over time.</text><text start="29" dur="6">We also have a graph of the news volume for those terms over time,</text><text start="35" dur="2">but that&amp;#39;s not so interesting here.</text><text start="37" dur="4">What&amp;#39;s interesting this side is the regularity in these patterns.</text><text start="41" dur="6">This pattern allows me just by observing language to do amateur astronomy. </text><text start="47" dur="2">What do I mean by that?</text><text start="49" dur="2">Well, look at the curve for ice cream.</text><text start="51" dur="5">Popular in the summer. Not so popular in the winter.</text><text start="56" dur="6">What that means is if I wanted to figure out what the rotational period is of the earth around the sun,</text><text start="62" dur="6">all I have to do is measure these peaks, and it would come out to 365 days--</text><text start="68" dur="7">a very regular performance of language speakers using the term &amp;quot;ice cream&amp;quot; in the summer</text><text start="75" dur="3">repeatedly year after year.</text><text start="78" dur="2">Now, there&amp;#39;s a little bit of a blip here.</text><text start="80" dur="2">What happened in this case?</text><text start="82" dur="6">Well, it turns out that a manufacturer of cell phone operating systems </text><text start="88" dur="5">decided to call the latest update to their operating system &amp;quot;Ice Cream Sandwich,&amp;quot;</text><text start="93" dur="4">and so there was a lot of searching for that when it came out.</text><text start="97" dur="3">But that just lasted a few days, and then things went back to normal.</text><text start="100" dur="6">Similarly for the query &amp;quot;full moon&amp;quot; in blue, we see this period here,</text><text start="106" dur="3">and we can measure that period to be 28 days,</text><text start="109" dur="5">we we can do amateur astronomy and figure out how the moon works as well,</text><text start="114" dur="5">just by observing how people in the real world use language.</text></transcript></video><video title="06 Language Models Question.mp4" id="QtrR4gKfvTQ" length="25"><transcript><text start="0" dur="4">What can you do with language in the world besides amateur astronomy?</text><text start="4" dur="3">Well, I haven&amp;#39;t told you yet, but I want to give you a little quiz</text><text start="7" dur="2">and allow you to guess.</text><text start="9" dur="5">And so for each of these applications here, I want you to tell me </text><text start="14" dur="5">whether you think that language models, and specifically these types of word models,</text><text start="19" dur="6">would be a major part of an implementation of that task.</text></transcript></video><video title="07 Language Models Answer.mp4" id="7h4FZMO1c64" length="100"><transcript><text start="0" dur="4">Almost all of these are great examples of applications that are used everyday,</text><text start="4" dur="3">and are primarily based on word models.</text><text start="7" dur="5">We&amp;#39;ve seen classification before for spam and other types of categories,</text><text start="12" dur="2">language ID and so on.</text><text start="14" dur="5">There&amp;#39;s also the idea of clustering, where we don&amp;#39;t have categories ahead of time.</text><text start="19" dur="5">Yes, you can take news stories and classify them into, say, sports or weather,</text><text start="24" dur="4">but you can also cluster them to say here&amp;#39;s a cluster of stories </text><text start="28" dur="5">that are all related about the latest topic that has maybe never occurred before.</text><text start="33" dur="5">There&amp;#39;s also input correction of various kinds such as spelling correction,</text><text start="38" dur="4">sentiment analysis--taking reviews of products and trying to decide </text><text start="42" dur="4">if they&amp;#39;re favorable or unfavorable and rate products that way,</text><text start="46" dur="3">information retrieval--web search is a problem that I&amp;#39;ve worked on </text><text start="49" dur="3">and is primarily addressed with word models,</text><text start="52" dur="5">question answer such as IBM&amp;#39;s Watson did in playing the game of Jeopardy.</text><text start="57" dur="3">They use a variety of techniques.</text><text start="60" dur="4">Much of them are based around word models, but there are other techniques as well.</text><text start="64" dur="7">Machine translation--we saw the example of Chinese menus and translating to English from examples.</text><text start="71" dur="5">The examples are primarily dealt with by phrases and individual words</text><text start="76" dur="3"> and some augmentation to that as well.</text><text start="79" dur="2">Speech recognition--as similar story.</text><text start="81" dur="5">And then finally I threw in one that is no primarily a question for word models.</text><text start="86" dur="6">Driving a car is autonomously is primarily a question in perception and localization.</text><text start="92" dur="5">Yes, you might want to be able to talk to the car to direct it to do something,</text><text start="97" dur="3">but that wouldn&amp;#39;t be part of the autonomous part.</text></transcript></video><video title="08 Unigram Model Samples.mp4" id="ZJjljuk4JD0" length="58"><transcript><text start="0" dur="5">Now, I wanted to show you how powerful n-gram models of language are.</text><text start="5" dur="3">That is, if we&amp;#39;re only looking at word sequences, </text><text start="8" dur="3">what is it that we&amp;#39;re giving up and what are we getting?</text><text start="11" dur="5">So I read in the complete works of Shakespeare into a small computer program,</text><text start="16" dur="4">and then built n-gram models and sampled from that model.</text><text start="20" dur="7">That is, generated random sentences that come from the probability distribution defined by that model.</text><text start="27" dur="3">And here are samples from the unigram model.</text><text start="30" dur="7">That is sampling from words according to frequency in the corpus of Shakespeare text,</text><text start="37" dur="4">but not taking into account any relationship between adjacent words.</text><text start="41" dur="3">And looking at this, it doesn&amp;#39;t make much sense.</text><text start="44" dur="2">It does seem like real sentences.</text><text start="46" dur="3">You can tell the vocabulary is somewhat archaic.</text><text start="49" dur="5">You have words like &amp;quot;o&amp;#39;erthrown&amp;quot; and &amp;quot;thou&amp;quot; and &amp;quot;&amp;#39;tis&amp;quot; and so on,</text><text start="54" dur="4">but you aren&amp;#39;t really getting very much from this.</text></transcript></video><video title="09 Bigram Model Samples.mp4" id="4IZXXh-1jzo" length="32"><transcript><text start="0" dur="2">Now we move to a bigram model,</text><text start="2" dur="4">where we&amp;#39;re sampling from the probability of a word given the previous word.</text><text start="6" dur="3">Now we see a little bit of structure emerge.</text><text start="9" dur="6">So you can see at the start of the sentences &amp;quot;I have&amp;quot;, &amp;quot;hear you,&amp;quot; &amp;quot;hark ye.&amp;quot;</text><text start="15" dur="3">The words seem to go together,</text><text start="18" dur="6">but then as the sentences move on they ramble and don&amp;#39;t go any definitive direction.</text><text start="24" dur="4">So the sentences are locally consistent at the level of one or two words,</text><text start="28" dur="4">but that consistency doesn&amp;#39;t go very far.</text></transcript></video><video title="10 Trigram Model Samples.mp4" id="i9duR1sX9Rk" length="29"><transcript><text start="0" dur="4">Now with the trigram models, we&amp;#39;re starting to get a little bit more structure.</text><text start="4" dur="4">In fact, we get complete sentences that actually make some sense--</text><text start="8" dur="4">&amp;quot;I will never yield,&amp;quot; and the exclamation &amp;quot;little pretty ones!&amp;quot;</text><text start="12" dur="3">And we get sentences that are fairly coherent--</text><text start="15" dur="2">&amp;quot;I would learn of noble Edward&amp;#39;s sons&amp;quot;--</text><text start="17" dur="3">but then break down a little bit--&amp;quot;what thing, avoid!&amp;quot;</text><text start="20" dur="6">So we&amp;#39;re getting a model that appears to be a little bit closer to actual Shakespeare</text><text start="26" dur="3">but it&amp;#39;s still incomplete.</text></transcript></video><video title="11 N Gram Model Samples.mp4" id="5KzdAhyz3XI" length="64"><transcript><text start="0" dur="3">And finally this example based on 4-grams--</text><text start="3" dur="3">Now we&amp;#39;re seeing an even longer structure.</text><text start="6" dur="4">Sometimes we have this generate something that makes sense like </text><text start="10" dur="3">&amp;quot;betwixt their two estates,&amp;quot;</text><text start="13" dur="3">and this is not something that appears in Shakespeare,</text><text start="16" dur="2">but it was just generated and it made sense.</text><text start="18" dur="7">Sometimes we get things that are actually quotes like &amp;quot;even to the frozen ridges of the alps.&amp;quot;</text><text start="25" dur="3">The model chose to duplicate something that is actually in Shakespeare.</text><text start="28" dur="3">Certainly there were lots of traces it could have made.</text><text start="31" dur="2">&amp;quot;Alps&amp;quot; appears four or five times.</text><text start="33" dur="2">&amp;quot;Even&amp;quot; appears many, many times. </text><text start="35" dur="2">&amp;quot;Frozen&amp;quot; appears multiple times.</text><text start="37" dur="5">But it just happened to duplicate something that was a quotation from the original.</text><text start="42" dur="5">And then there&amp;#39;s lots of examples of sentences and phrases that make a lot of sense--</text><text start="47" dur="5">&amp;quot;I know my duty,&amp;quot; and &amp;quot;give me some little breath,&amp;quot; and so on.</text><text start="52" dur="4">So it looks like even though all we know is a sequence of 4 woeds,</text><text start="56" dur="8">We&amp;#39;re still capturing quite a bit of what it means to have coherent sentences but not everything.</text></transcript></video><video title="12 N Gram Model Question.mp4" id="AR4tYVZsmuo" length="42"><transcript><text start="0" dur="6">Here&amp;#39;s a little quiz--for each of these pieces of text, which was generated from an n-gram model,</text><text start="6" dur="7">I want you to try to guess if it was generated from a 1-gram model, a 2-gram, or a 3-gram.</text><text start="13" dur="4">Now, I know you won&amp;#39;t necessarily be able to get these all right. Don&amp;#39;t worry about that.</text><text start="17" dur="3">The main point is just for you to get some facility in </text><text start="20" dur="3">kind of looking at these models and trying to understand them.</text><text start="23" dur="5">I&amp;#39;ll give you a hint--three of them were generate by a 1-gram model,</text><text start="28" dur="3">three by a 2-gram model, and 3 of them by a 3-gram model,</text><text start="31" dur="6">and one of them is an actual excerpt from the corpus of Shakespeare&amp;#39;s work,</text><text start="37" dur="5">And so leave that one blank.                                   Don&amp;#39;t mark it at all.</text></transcript></video><video title="13 N Gram Model Answer.mp4" id="_iwaqJ07LEg" length="16"><transcript><text start="0" dur="4">Here we see the answers. You can tell that the 3-grams are more consistent, </text><text start="4" dur="4">more sentence-like than the 1-gram model generated sentences,</text><text start="8" dur="8">and this is an actual sentence or stage direction from the works of Shakespeare.</text></transcript></video><video title="14 Probability Question.mp4" id="1nGUKftPER8" length="108"><transcript><text start="0" dur="5">Here&amp;#39;s a quiz to make sure you understand how to calculate these probabilistic models.</text><text start="5" dur="6">We&amp;#39;re going to calculate the probability of the string &amp;quot;woe is me,&amp;quot;</text><text start="11" dur="3">and we&amp;#39;re going to calculate that beginning at the beginning of the sentence,</text><text start="14" dur="2">which we&amp;#39;ll mark with this dot character,</text><text start="16" dur="4">given that we are starting at the beginning of the sentence.</text><text start="20" dur="4">I want you to figure that out, put the probability in here,</text><text start="24" dur="5">and it&amp;#39;s going to be a small number with a lot of zeros to the right of the decimal place.</text><text start="29" dur="6">So scale that by a factor of 1 billion--the probability times 10 to -9.</text><text start="35" dur="3">Now, of course, I&amp;#39;m going to have to give you some data to make this make sense.</text><text start="38" dur="6">I&amp;#39;m going to tell you that the probability that &amp;quot;woe&amp;quot; occurs at position i</text><text start="44" dur="5">given that the start-of-sentence marker occurs at position i minus 1.</text><text start="49" dur="6">I should say what we&amp;#39;re doing here is we&amp;#39;re sort of artificially introducing a token</text><text start="55" dur="3">into our data of the start-of-sentence marker,</text><text start="58" dur="5">which could be either what comes after a period or exclamation point</text><text start="63" dur="6">or at the beginning of the file. That all counts as a start-of-sentence marker.</text><text start="69" dur="4">That probability is 0.002.</text><text start="73" dur="10">The probability that &amp;quot;is&amp;quot; occurs at position i given that &amp;quot;woe&amp;quot; occurred at i minus 1 is 0.07,</text><text start="83" dur="14">and the probability that &amp;quot;me&amp;quot; occurs at position i given that &amp;quot;is&amp;quot; occurred at i minus 1 is 0.0005.</text><text start="97" dur="5">Tell me the probability of the whole string &amp;quot;woe is me&amp;quot; at the beginning of a sentence</text><text start="102" dur="6">given that we&amp;#39;re starting at the beginning of a sentence and put your answer in here.</text></transcript></video><video title="15 Probability Answer.mp4" id="kDNdNCNlfIY" length="31"><transcript><text start="0" dur="3">The answer is that we just multiply them together,</text><text start="3" dur="4">and it words out to 7 parts per billion.</text><text start="7" dur="5">I should note that these numbers are small, but that shouldn&amp;#39;t bother you.</text><text start="12" dur="3">So &amp;quot;woe is me&amp;quot; seems like a fairly common phrase.</text><text start="15" dur="6">The reason it&amp;#39;s small is because there are so many common phrases of three words.</text><text start="21" dur="6">And so even though this one&amp;#39;s fairly common, it works out to only a few parts per billion,</text><text start="27" dur="4">because of the many other possibilities.</text></transcript></video><video title="16 Language Question.mp4" id="5h5rNygmHis" length="70"><transcript><text start="0" dur="3">Now let&amp;#39;s take a step back for a second, and I&amp;#39;m going to talk about </text><text start="3" dur="3">probabilistic letter models.</text><text start="6" dur="3">Here we have a sequence of letters,</text><text start="9" dur="6">and it looks like this sequence is rather infrequent in English.</text><text start="15" dur="3">But what can we do with letter models that we can&amp;#39;t do</text><text start="18" dur="4"> or that we can do in opposition to word models?</text><text start="22" dur="3">The answer is letter models are very good in cases </text><text start="25" dur="5">where we&amp;#39;re going to be dealing with unique words that maybe we haven&amp;#39;t seen before,</text><text start="30" dur="2">but they still give you properties of the language.</text><text start="32" dur="5">One very interesting task is language identification. Let&amp;#39;s see how that would work.</text><text start="37" dur="8">Let&amp;#39;s take some example phrases--&amp;quot;hello, world,&amp;quot; &amp;quot;guten tag, welt,&amp;quot; &amp;quot;salam dunya,&amp;quot;</text><text start="45" dur="4">and let&amp;#39;s suppose you have the task of classifying these </text><text start="49" dur="2">into the language from which they were sampled from,</text><text start="51" dur="5">and we&amp;#39;ll make this into a quiz, and we&amp;#39;ll give you some choices--</text><text start="56" dur="3">English, German, French, Spanish, and Azerbaijani--</text><text start="59" dur="11">and tell me for each of these want your best guess is at the most likely language classification.</text></transcript></video><video title="17 Language Answer.mp4" id="dCIv-r6_xRM" length="90"><transcript><text start="0" dur="2">That didn&amp;#39;t seem too hard.</text><text start="2" dur="3">This looks like English. This looks like German. </text><text start="5" dur="3">I may not be familiar with Azerbaijan, </text><text start="8" dur="3">but it doesn&amp;#39;t look like English, German, French, or Spanish,</text><text start="11" dur="4">so I&amp;#39;ll probably choose that, and that would be the right answer.</text><text start="15" dur="3">Now, how could I do that? Well, I could do it by recognizing some of the words.</text><text start="18" dur="5">But it turns out I can also do it just by looking at letter sequences,</text><text start="23" dur="5">the frequency of of single letters or pairs of letters or triplets of letters.</text><text start="28" dur="7">In fact, you can get about 99% accuracy for language identification just looking at tables of letters.</text><text start="35" dur="4">And a great thing about dealing with letter models is that</text><text start="39" dur="3"> the probability tables you need are much more compact.</text><text start="42" dur="7">If you think about triples of words, there may be a million words in the vocabulary,</text><text start="49" dur="4">so a table of triples is a million to the 3rd power.</text><text start="53" dur="3">That&amp;#39;s quite a number of entries.</text><text start="56" dur="5">Whereas for letters in the alphabet, most alphabets have about 30 letters or so.</text><text start="61" dur="4">So it&amp;#39;s very easy and compact to store triples of those.</text><text start="65" dur="3">Now, in doing actual language identification,</text><text start="68" dur="5">it&amp;#39;s also common to add other features, to not look only at the letter combinations.</text><text start="73" dur="2">So you might add words as well.</text><text start="75" dur="3">You might add a small number of words--the most common words in a language,</text><text start="78" dur="5">or it may be even better to add the most discriminative words--</text><text start="83" dur="4">words that show up in one language but not in another language</text><text start="87" dur="3">and count the occurrence of those words.</text></transcript></video><video title="18 Letter Bigram Question.mp4" id="aEJxHRY9Jz4" length="39"><transcript><text start="0" dur="4">In this table what I&amp;#39;ve done is I&amp;#39;ve taken samples of text in 3 languages</text><text start="4" dur="4">and just counted to the frequency of the letter bigrams,</text><text start="8" dur="2">and then ordered them from top to bottom.</text><text start="10" dur="5">And so for language A, TH was the most frequent letter bigram, </text><text start="15" dur="3">TE was the second most frequent, and so on</text><text start="18" dur="6">In language B, EN was most popular, ER was the second most popular, and so on.</text><text start="24" dur="2">And the same for language C.</text><text start="26" dur="5">What I want you to do is just try to guess which language is which.</text><text start="31" dur="4">Is language A English, German, or Azerbaijani?</text><text start="35" dur="4">And do the same for B and C.</text></transcript></video><video title="19 Letter Bigram Answer.mp4" id="gb4RVgBt6bU" length="19"><transcript><text start="0" dur="4">If you&amp;#39;re familiar with these languages, you probably could have guessed that</text><text start="4" dur="4">TH is the most common two-letter sequence in English.</text><text start="8" dur="2">These look like German.</text><text start="10" dur="5">These are a little bit unfamiliar, and it has a letter that doesn&amp;#39;t show up in English and German,</text><text start="15" dur="4">and so this one is Azerbaijani.</text></transcript></video><video title="20 Trigram Model Question.mp4" id="a2ht6wgtVJA" length="61"><transcript><text start="0" dur="6">Here I just wanted to show that even very short sequences can be identified quite easily</text><text start="6" dur="4">with language ID models based on letter frequencies.</text><text start="10" dur="7">So for the 3-letter sequence T-H-E, I have trigram models for 3 different languages.</text><text start="17" dur="8">And you can see that in language A there&amp;#39;s a 1.1% chance of representing T-H-E,</text><text start="25" dur="5">which is 4 times more than B and quite a bit more than C.</text><text start="30" dur="6">For the 3-letter sequence D-E-R, that&amp;#39;s 10 times more likely to be language B</text><text start="36" dur="4">than to be language A and quite a bit more than language C.</text><text start="40" dur="8">For the letter sequence R-B-A, that&amp;#39;s 50 times more likely to be C than it is to be B and even moreso than A.</text><text start="48" dur="3">What I want you to tell me is what are each of these languages?</text><text start="51" dur="4">Where did this column of numbers come from?</text><text start="55" dur="6">Is language A English, German or Azerbaijani, and the same for B and C?</text></transcript></video><video title="21 Trigram Model Answer.mp4" id="F_ysaz3K-XA" length="14"><transcript><text start="0" dur="4">You can see that English is a language in which THE is very common.</text><text start="4" dur="3">German is a language in which DER is more common.</text><text start="7" dur="7">And in Azerbaijani, the sequence RBA is more common.</text></transcript></video><video title="22 Classification.mp4" id="kEd6eiP2C4Q" length="94"><transcript><text start="0" dur="5">Enough about letters. Now let&amp;#39;s use all the tools at our disposal and tackle a new task--</text><text start="5" dur="4">the task of classification into semantic classes.</text><text start="9" dur="4">Say we&amp;#39;re given a sequence of phrases and want to classify them </text><text start="13" dur="3">into one of several categories.</text><text start="16" dur="4">Here I&amp;#39;ve chosen just three--people, places, and drugs,</text><text start="20" dur="3">and I have some examples of each.</text><text start="23" dur="2">What would you use to do that?</text><text start="25" dur="2">Well, you have a number of things at your disposal.</text><text start="27" dur="3">One, you could memorize some common parts.</text><text start="30" dur="7">So &amp;quot;Steve&amp;quot; and &amp;quot;Bill&amp;quot; are very common as that first word in a phrase which represents people.</text><text start="37" dur="6">&amp;quot;San&amp;quot; and &amp;quot;New&amp;quot; are common in places. </text><text start="43" dur="3">&amp;quot;City&amp;quot; is common at the end of places.</text><text start="46" dur="6">But not all these techniques will be unambiguous or 100% accurate.</text><text start="52" dur="4">So for example, if you have a phrase where the last word is &amp;quot;grove&amp;quot; </text><text start="56" dur="3">and the first word seems like part of a name,</text><text start="59" dur="4">that could be a place, but it could also be a person&amp;#39;s name.</text><text start="63" dur="7">With drugs, it looks like maybe the letter-based approach is better than the word-based approach.</text><text start="70" dur="6">They seem to have a much higher frequency of starting with &amp;quot;z&amp;quot; or ending in &amp;quot;x&amp;quot;, for example,</text><text start="76" dur="5">but you can imagine a classifier using the techniques that we&amp;#39;ve seen in machine learning</text><text start="81" dur="2">that takes all these features.</text><text start="83" dur="3">What&amp;#39;s the first word?                               What&amp;#39;s the second word? </text><text start="86" dur="4">What&amp;#39;s the first letter?                                What&amp;#39;s the last letter or the last two letters?</text><text start="90" dur="4">Throw all those features in and build a classifier.</text></transcript></video><video title="23 Classification Question.mp4" id="QyxVe8J7I4w" length="30"><transcript><text start="0" dur="4">Here&amp;#39;s a quick quiz. Which of these would be a good algorithm or technique </text><text start="4" dur="5">for doing this classification into things like people, places, and drugs.</text><text start="9" dur="9">Could we use Naive Bayes, k-Nearest Neighbors, Support Vector Machines, Logistic Regression.</text><text start="18" dur="4"> Could we use the Unix Sort command or the Gzip command?</text><text start="22" dur="5">Check all those that you think would be reasonably good algorithms</text><text start="27" dur="3">for doing classification.</text></transcript></video><video title="24 Classification Answer.mp4" id="DkBMj6ZYwGs" length="24"><transcript><text start="0" dur="4">The answer is that all of these are good, except for the sort command.</text><text start="4" dur="2">That wouldn&amp;#39;t be very good.</text><text start="6" dur="4">It would maybe separate out the drugs that begin with &amp;quot;z&amp;quot; near the end of the list,</text><text start="10" dur="5">so that would help, but it would just do probably about random for everything else.</text><text start="15" dur="4">Now, you may be surprised to learn that the gzip command</text><text start="19" dur="5"> is actually pretty good as a classification algorithm. Let&amp;#39;s try to understand that.</text></transcript></video><video title="25 Gzip.mp4" id="dIEStL7riMo" length="143"><transcript><text start="0" dur="7">Here I have 3 files containing a corpus of text in each of the languages that I want to classify into,</text><text start="7" dur="4">and imagine these are much longer, so it gives you a good sample in text in</text><text start="11" dur="3">English, German, and Azerbaijan.</text><text start="14" dur="6">Now I have a new piece of text that I want to classify against each of these possibilities.</text><text start="20" dur="3">Well, I can do that using the gzip command.</text><text start="23" dur="4">So I could issue this Unix command that says </text><text start="27" dur="4">&amp;quot;concatenate together the new file with the English file, </text><text start="31" dur="4">gzip them, compress them, then count the number of characters,</text><text start="35" dur="4">and do the same for the German and Azerbaijani,</text><text start="39" dur="4">and then figure out which one is shortest.</text><text start="43" dur="5">In fact, when we do that with the files I&amp;#39;ve collected, it gives me the right answer.</text><text start="48" dur="2">Now how does it do that?</text><text start="50" dur="5">Well, you have to understand a little bit about how compression algorithms like gzip work.</text><text start="55" dur="5">What they do is they take a file like this and they look for common subsequences,</text><text start="60" dur="4">and they represent that in less than 1 byte.</text><text start="64" dur="8">For example, I-S-SPACE would be represented by 3 bytes in an ASCII encoding,</text><text start="72" dur="2">but in compressed encoding you could say, </text><text start="74" dur="4">&amp;quot;Hey, I see that sequence here. I see it here again. It&amp;#39;s going to show up many times.&amp;quot;</text><text start="78" dur="4">So maybe I can represent those 3 bytes just in terms of one,</text><text start="82" dur="4">saying this is a common subsequence that I&amp;#39;m going to see again and again.</text><text start="86" dur="5">Once we&amp;#39;ve done that for English, we come up with common subsequences in English.</text><text start="91" dur="6">Then if we add in another file that has a lot of the same common sequences,</text><text start="97" dur="3">like here it has I-S-SPACE again, </text><text start="100" dur="3">then that&amp;#39;s going to compress well with respect to this.</text><text start="103" dur="4">It&amp;#39;s not going to compress very well with respect to the Azerbaijan, </text><text start="107" dur="3">because that won&amp;#39;t have built up a code for I-S-SPACE.</text><text start="110" dur="8">That will have built up codes for things like R-B-A rather than for I-S-SPACE.</text><text start="118" dur="6">So it turns out that the ideas of compression and learning are actually very closely related,</text><text start="124" dur="6">and they&amp;#39;re related by information theory and this idea of entropy of an expression</text><text start="130" dur="3">or the information content.</text><text start="133" dur="3">That wasn&amp;#39;t discovered until fairly recently. </text><text start="136" dur="4">The two fields had developed independently, but now they&amp;#39;ve come back together,</text><text start="140" dur="3">and we understand how they relate.</text></transcript></video><video title="26 Segmentation.mp4" id="WakjpNgbTNo" length="76"><transcript><text start="0" dur="4">The next topic I want to address is called &amp;quot;Segmentation.&amp;quot;</text><text start="4" dur="3">This is the problem of given a sequence of language,</text><text start="7" dur="3">figure out how to break it up into words.</text><text start="10" dur="3">Now, in Chinese we don&amp;#39;t have spaces between the words,</text><text start="13" dur="4">and so in order to understand if the first word of this message corresponds </text><text start="17" dur="3">to a single character or two characters or what,</text><text start="20" dur="5">we have to be able to do the process of segmentation and figure out where they are.</text><text start="25" dur="6">In English, we don&amp;#39;t have that. Words have spaces between them.</text><text start="31" dur="2">So we don&amp;#39;t have the segmentation problem,</text><text start="33" dur="4">but we certainly have it in speech recognition in languages like English,</text><text start="37" dur="5">because this speech sounds are sometimes run together without pauses in between them,</text><text start="42" dur="5">and there are places where we do have a language without segmentation.</text><text start="47" dur="3">For example, in the language of URLs</text><text start="50" dur="6">you could have a URL like &amp;quot;choosespain.com&amp;quot;,</text><text start="56" dur="6">which is the travel site that tries to encourage you to choose Spain as your travel destination,</text><text start="62" dur="5">but if you segment it wrong, you&amp;#39;d come up with &amp;quot;chooses pain,&amp;quot;</text><text start="67" dur="5">which would not be the intended expression for that particular URL.</text><text start="72" dur="4">So segmentation is an important problem. Let&amp;#39;s talk about how to do it.</text></transcript></video><video title="27 Segmentation Probabilistic Model.mp4" id="wu8MFlMRH2U" length="91"><transcript><text start="0" dur="3">Let&amp;#39;s build a probabilistic word model of segmentation.</text><text start="3" dur="5">By definition, the best segmentation, which we&amp;#39;ll call S*, </text><text start="8" dur="7">is equal to the one which maximizes the joint probability of the segmentation.</text><text start="15" dur="3">So we&amp;#39;re going to segment the text into a sequence of words--</text><text start="18" dur="2">word 1 through word n--</text><text start="20" dur="6">and find that segmentation into words that maximize the joint probability.</text><text start="26" dur="7">By the definition of joint probability, that&amp;#39;s the same as maximizing the product over the words</text><text start="33" dur="8">of the probability of each word given all the previous words.</text><text start="41" dur="5">Now this is going to be a little unwieldy to deal with, so we can make an approximation.</text><text start="46" dur="6">We can say that the best segmentation is approximately equal to the one that maximizes,</text><text start="52" dur="4">and what we could do here is we could make the Markov assumption</text><text start="56" dur="4">and say we&amp;#39;re only going to be considering the few previous words.</text><text start="60" dur="4">But I&amp;#39;m going to go all the way and make the naive Bayes assumption</text><text start="64" dur="4">and say we&amp;#39;re going to treat each word independently.</text><text start="68" dur="4">We just want to maximize the probability of each individual word </text><text start="72" dur="3">regardless of the word that comes before or after it.</text><text start="75" dur="4">Now, I know that that assumption is wrong and that the words do depend </text><text start="79" dur="2">on the words to the right or the left of them,</text><text start="81" dur="6">but I&amp;#39;m going to hope that this simplification is going to make the process of learning easier</text><text start="87" dur="4">and will turn out to be good enough.</text></transcript></video><video title="28 Probabilistic Model Question.mp4" id="12DCJGH97Zw" length="37"><transcript><text start="0" dur="2">Now for a quick quiz. </text><text start="2" dur="5">For a given string--say we have this string with 12 characters--</text><text start="7" dur="2">how many possible segmentations are there?</text><text start="9" dur="3">How many ways can we break this up into words?</text><text start="12" dur="5">And let&amp;#39;s answer that not just for 12 characters, but for n characters in general.</text><text start="17" dur="5">With n characters, how many ways of segmenting could there be?</text><text start="22" dur="10">Could there be n-1 ways, n-1 squared, n-1 factorial, or 2^n-1?</text><text start="32" dur="5">Tell me which of those you think is right.</text></transcript></video><video title="29 Probabilistic Model Answer.mp4" id="0oH9NOnMD1M" length="24"><transcript><text start="0" dur="5">The answer is 2^n-1, and the way you can see that is here.</text><text start="5" dur="4">With 12 characters, there are 11 spaces in between characters,</text><text start="9" dur="8">and we can either place or not place a word segment in between each of the characters.</text><text start="17" dur="7">And so 11 of them either occur or don&amp;#39;t occur, so that&amp;#39;s to the 11th.</text></transcript></video><video title="30 Best Segmentation 1.mp4" id="Gh7NppiwSuc" length="96"><transcript><text start="0" dur="2">Now, 2^n is a lot.</text><text start="2" dur="7">For example, if we have 30 characters in our string, then there&amp;#39;d be a billion possible segmentations to deal with.</text><text start="9" dur="3">We clearly don&amp;#39;t want to have to enumerate them all.</text><text start="12" dur="3">We&amp;#39;d like some way of searching through them efficiently </text><text start="15" dur="4">without having to consider the probability of every possible segmentation.</text><text start="19" dur="6">That&amp;#39;s one of the reasons why making this naive Bayes assumption is so helpful.</text><text start="25" dur="4">It means that there&amp;#39;s no interrelations between the various words,</text><text start="29" dur="2">so we can consider them one at a time.</text><text start="31" dur="2">That is, here&amp;#39;s one thing we can say.</text><text start="33" dur="6">We can say that the best segmentation is equal to the argmax </text><text start="39" dur="6">over all possible segmentations of the string into a first word and the rest of the words</text><text start="45" dur="8">of the probability of that first word times the probability of the best segmentation of the rest of the words.</text><text start="53" dur="2">And notice that this is independent.</text><text start="55" dur="5">The best segmentation of the rest of the words doesn&amp;#39;t depend on the first word.</text><text start="60" dur="3">And so that means we don&amp;#39;t have to consider all interactions,</text><text start="63" dur="3">and we don&amp;#39;t need to consider all 2^n possibilities.</text><text start="66" dur="4">So now we have two reasons why the naive Bayes assumption is a good thing.</text><text start="70" dur="3">One is it makes this computation much more efficient,</text><text start="73" dur="3">and secondly, it makes learning easier,</text><text start="76" dur="3"> because it&amp;#39;s easy to come up with a unigram probability.</text><text start="79" dur="4">What&amp;#39;s the probability of an individual word from our corpus of text?</text><text start="83" dur="4">It&amp;#39;s much harder to get combinations of multiple word sequences.</text><text start="87" dur="5">We&amp;#39;re going to have to do more smoothing, more guessing what those probabilities are, </text><text start="92" dur="4">because we just won&amp;#39;t have the counts for them.</text></transcript></video><video title="31 Best Segmentation 2.mp4" id="jMKZup6OtFk" length="126"><transcript><text start="0" dur="4">So given this formula and given our input string--</text><text start="4" dur="2">let&amp;#39;s stick with the familiar one--</text><text start="6" dur="4">we can start enumerating the possibilities for splitting up this string S</text><text start="10" dur="5">into a first word and a rest part and figuring out the probabilities.</text><text start="15" dur="11">So the first could be &amp;quot;n,&amp;quot; could be &amp;quot;no,&amp;quot; could be &amp;quot;now,&amp;quot; could be &amp;quot;nowi,&amp;quot; and so on,</text><text start="26" dur="10">and then the rest would be &amp;quot;owis...&amp;quot; or starting with &amp;quot;w&amp;quot; or starting with &amp;quot;is&amp;quot;</text><text start="36" dur="5">or starting with &amp;quot;s,&amp;quot; and then what&amp;#39;s the probability of the first.</text><text start="41" dur="4">Well, that we get from our corpus by counting and then smoothing,</text><text start="45" dur="6">and in our Shakespeare corpus &amp;quot;n&amp;quot; occurs infrequently&amp;quot;--</text><text start="51" dur="6">about one in a million times--&amp;quot;no&amp;quot; occurs fairly frequently--about 0.004,</text><text start="57" dur="5">&amp;quot;now&amp;quot; 0.003, and &amp;quot;nowi&amp;quot; doesn&amp;#39;t occur at all, </text><text start="62" dur="4">and so we&amp;#39;d use some factor based on smoothing.</text><text start="66" dur="5">Then if we take the rest and multiply out this whole term,</text><text start="71" dur="5">the best segmentation of the rest times the probability of the first that comes from this column,</text><text start="76" dur="8">then that column will give us about 10 to -19 for the segmentation that starts with &amp;quot;n,&amp;quot;</text><text start="84" dur="3">10 to -13 for the one that starts with &amp;quot;no,&amp;quot; </text><text start="87" dur="4">10 t the -10 for the one that starts with &amp;quot;now,&amp;quot;</text><text start="91" dur="4">and 10 to -18 for the one starts with &amp;quot;nowi.&amp;quot;</text><text start="95" dur="5">Again, that depends on exactly what type of smoothing you choose to do.</text><text start="100" dur="8">But it turns out that this row here is at least 1,000 times better than any of the other segmentations.</text><text start="108" dur="4">That is the segmentation that comes out &amp;quot;now is the time.&amp;quot;</text><text start="112" dur="6">So this model, simplified though it is, coming up with this naive Bayes assumption,</text><text start="118" dur="8">gets this one right, and it does about 99% of the segmentations accurately.</text></transcript></video><video title="32 Segment Code.mp4" id="ArjvPA5q0oQ" length="52"><transcript><text start="0" dur="5">Here we have a demonstration that the implementation of this algorithm into actual code</text><text start="5" dur="5">is not that much more complicated than the mathematical formulas I just described to you.</text><text start="10" dur="6">Here&amp;#39;s the function segment, which takes a text, and it does what we just said.</text><text start="16" dur="5">So it splits the text up into all possible first and rest components,</text><text start="21" dur="6">and then the candidates will be the first word plus the best segmentation of the rest,</text><text start="27" dur="3">and then out of all those candidates we just take the maximum </text><text start="30" dur="2">according to the probability of the words</text><text start="32" dur="5">where the probability of the words is just the product of the probability of each individual word.</text><text start="37" dur="4">So that&amp;#39;s the naive Bayes assumption coming into this definition,</text><text start="41" dur="5">and this is just the definition of how to split something up into a first and rest.</text><text start="46" dur="4">And you can follow the links in the note to see the source code for this</text><text start="50" dur="2">and play with it on your own if you like.</text></transcript></video><video title="33 Segment Question 1.mp4" id="c18FImnRfXo" length="79"><transcript><text start="0" dur="4">Now I want to give you an idea of how well the segmentation program performs.</text><text start="4" dur="3">Here I&amp;#39;ve trained it on a corpus of 4 billion words--</text><text start="7" dur="3">not just the Shakespeare corpus but a larger corpus,</text><text start="10" dur="4">and then I give it some test cases to try to find the best segmentation.</text><text start="14" dur="5">So I gave it the test case here. The program came up with &amp;quot;base rate sought to,&amp;quot;</text><text start="19" dur="3">but the correct answer was &amp;quot;base rates ought to.&amp;quot;</text><text start="22" dur="6">In this case, it just seems somewhat like bad luck that that was the right answer,</text><text start="28" dur="4">but both segmentations seem like good segmentations.</text><text start="32" dur="2">Next was this trial.</text><text start="34" dur="4">My program came up with &amp;quot;small and in significant,&amp;quot;</text><text start="38" dur="3">but the correct answer was &amp;quot;small and insignificant.&amp;quot;</text><text start="41" dur="4">Here it seems like it really has erred that &amp;quot;small and insignificant&amp;quot;</text><text start="45" dur="4">seems like a much better segmentation than the one my program came up with.</text><text start="49" dur="6">What I want you to tell me is what do you think could help us do a better job of getting the right answer.</text><text start="55" dur="4">Would it be helpful to gather more data?</text><text start="59" dur="3">Check that box if you think that would be helpful.</text><text start="62" dur="6">Would it be helpful to make a Markov assumption rather than the naive Bayes assumption?</text><text start="68" dur="2">Check here.</text><text start="70" dur="6">Or would it be helpful to do a better job with our smoothing algorithm? Check here.</text><text start="76" dur="3">And you can check more than one.</text></transcript></video><video title="34 Segment Answer 1.mp4" id="3II2BIAWPcw" length="47"><transcript><text start="0" dur="6">In this case, the problem really comes down to the naive Bayes assumption is</text><text start="6" dur="3">a weak one, and the Markov assumption would do much better.</text><text start="9" dur="3">It wouldn&amp;#39;t really help to have more data or to do a better job of smoothing,</text><text start="12" dur="4">because I already have good counts for words like &amp;quot;in&amp;quot; and &amp;quot;significant&amp;quot;</text><text start="16" dur="2">as well as words like &amp;quot;small&amp;quot; and &amp;quot;and.&amp;quot;</text><text start="18" dur="4">They&amp;#39;re all common enough that I have a good representation of how often they occur</text><text start="22" dur="3">as a unigram as a single word.</text><text start="25" dur="5">The problem is that we would like to know that the word &amp;quot;small&amp;quot; goes very well</text><text start="30" dur="5">with the word &amp;quot;insignificant&amp;quot; but does not goes very well with the word &amp;quot;significant.&amp;quot;</text><text start="35" dur="5">So if we had a Markov model where the probability of &amp;quot;insignificant&amp;quot; depended </text><text start="40" dur="4">on the probability of &amp;quot;small,&amp;quot; then we could catch that,</text><text start="44" dur="3">and we could get this segmentation correct.</text></transcript></video><video title="35 Segment Question 2.mp4" id="fwlkZJLXqZE" length="30"><transcript><text start="0" dur="4">Now let&amp;#39;s move on, and I want to do just one more example.</text><text start="4" dur="6">Here&amp;#39;s this input, and my program came up with &amp;quot;g in or mouse go&amp;quot;--</text><text start="10" dur="5">a sequence of common words, but the correct answer was &amp;quot;ginormous ego.&amp;quot;</text><text start="15" dur="5">Again, what do you think could help us get the right answer this time?</text><text start="20" dur="5">More data? Making the Markov assumption rather than naive Bayes assumption?</text><text start="25" dur="5">Or doing a better job with smoothing. Check all the ones that you think might apply.</text></transcript></video><video title="36 Segment Answer 2.mp4" id="L8Ud1XGUL1c" length="59"><transcript><text start="0" dur="8">Here is seems to be a problem of not enough data and not a very good smoothing algorithm.</text><text start="8" dur="5">Now the problem was even though I had 4 billion words from which I trained by probabilistic model,</text><text start="13" dur="5">I had never seen the word &amp;quot;ginormous&amp;quot;--not once in those 4 billion.</text><text start="18" dur="4">Yet, I should be able to deal with it even if I haven&amp;#39;t seen the word before.</text><text start="22" dur="4">So having more data might mean that I would&amp;#39;ve seen &amp;quot;ginormous&amp;quot;</text><text start="26" dur="6">and I could have some probability for it rather than just making the Laplace smoothing assumption.</text><text start="32" dur="3">And having better smoothing could also help--</text><text start="35" dur="2">maybe something more sophisticated than Laplace,</text><text start="37" dur="5">maybe something that looks more carefully at the content of the word.</text><text start="42" dur="5">So it might have a letter model to say these letters look common,</text><text start="47" dur="7">ending in &amp;quot;ous&amp;quot;--that&amp;#39;s a common ending in English--so this looks more like a word,</text><text start="54" dur="5">even if I haven&amp;#39;t seen it before, than some other combination of letters.</text></transcript></video><video title="37 Spelling Correction.mp4" id="itckWrJi92M" length="157"><transcript><text start="0" dur="5">Now let&amp;#39;s do one more example of a probabilistic problem--this time, spelling correction.</text><text start="5" dur="3">That is, given a word that is possibly misspelled, </text><text start="8" dur="4">how do we come up with the best correction for that word?</text><text start="12" dur="2">We&amp;#39;re going to do the same type of analysis.</text><text start="14" dur="6">We&amp;#39;re saying we&amp;#39;re looking for the best possible correction, C*,</text><text start="20" dur="6">and that&amp;#39;s going to be the argmax over all possible corrections c to maximize</text><text start="26" dur="4">the probability of that correction given the word.</text><text start="30" dur="3">So that&amp;#39;s the definition of what it means to have the best correction.</text><text start="33" dur="5">Then we can start the analysis, and we can apply Bayes rule to say</text><text start="38" dur="7">that&amp;#39;s going to be equal to the probability of the word given the correction</text><text start="45" dur="3">times the probability of the correction.</text><text start="48" dur="4">Of course, in Bayes rule there&amp;#39;s a factor on the bottom, but that cancels out,</text><text start="52" dur="2">because it&amp;#39;s equal for all possible corrections.</text><text start="54" dur="5">So to choose the maximum, we just have to deal with these two probabilities.</text><text start="59" dur="3">Now, it may seem like we made a backwards step.</text><text start="62" dur="3">Here we had one probability to estimate.</text><text start="65" dur="5">Now we&amp;#39;ve applied Bayes rule and now we have two probabilities we have to estimate,</text><text start="70" dur="5">but the hope is that we can come up with data that can help us with this.</text><text start="75" dur="5">And certainly, these unigram statistics--what&amp;#39;s the probability of a correction?--</text><text start="80" dur="5">those we can get from our document counts, so we look at our corpus.</text><text start="85" dur="5">The probability of a correct word is from the data.</text><text start="90" dur="5">We just look at those counts and apply whatever smoothing we decided is best.</text><text start="95" dur="6">Now, the other part--what&amp;#39;s the probability that somebody typed the word w</text><text start="101" dur="4">when they meant to type to the word c--that&amp;#39;s harder.</text><text start="105" dur="6">We can&amp;#39;t observe that directly by just looking at documents that are typed,</text><text start="111" dur="3">because there we only have the words where we are.</text><text start="114" dur="2">We don&amp;#39;t have the intent and the word,</text><text start="116" dur="5">but maybe we can look at lists of spelling corrections.</text><text start="121" dur="3">So this is from spelling correction data.</text><text start="124" dur="4">Now that kind of data is much harder to come by.</text><text start="128" dur="6">It&amp;#39;s easy to go out and collect billions of words of regular text and do those counts,</text><text start="134" dur="3">but to find spelling correction data--that&amp;#39;s harder to do</text><text start="137" dur="4">unless you&amp;#39;re, say, already running a spelling correction service.</text><text start="141" dur="3">If you&amp;#39;re a big company that happens to run that, then it&amp;#39;s easy to collect the data.</text><text start="144" dur="2">But bootstrapping it is hard.</text><text start="146" dur="4">There are, however, some sites that will give you on the order of thousands</text><text start="150" dur="7">or tens of thousands of examples of misspellings, not billions or trillions.</text></transcript></video><video title="38 Spelling Data.mp4" id="AS-GBBRW-Xo" length="102"><transcript><text start="0" dur="6">Now, here I show some data that I&amp;#39;ve gathered from sites that deal with spelling correction,</text><text start="6" dur="6">and these are all examples of the correct spelling followed by misspelled words</text><text start="12" dur="3">and maybe multiple of them.</text><text start="15" dur="7">And from that we want to calculate the probability of a word given the correction.</text><text start="22" dur="7">So for example, we would like to know what&amp;#39;s the probability of P-L-U-S-E</text><text start="29" dur="4">being the word that&amp;#39;s spelled when the correct word was &amp;quot;pulse.&amp;quot;</text><text start="33" dur="5">And we do have examples of that here. We have a single example.</text><text start="38" dur="4">But it&amp;#39;s clear that we&amp;#39;re just not going to have enough to cover all </text><text start="42" dur="4">the possible words we want to deal with and all the possible misspellings for those words.</text><text start="46" dur="3">With only tens of thousands of examples,</text><text start="49" dur="4">there are so many words in English that we&amp;#39;re not going to have them all.</text><text start="53" dur="4">Instead of trying to deal with word-to-word spelling errors,</text><text start="57" dur="3">let&amp;#39;s deal with letter-to-letter errors.</text><text start="60" dur="6">And so let&amp;#39;s not say that this is &amp;quot;pulse&amp;quot; misspelled as &amp;quot;pluse,&amp;quot;</text><text start="66" dur="6">but rather let&amp;#39;s say this is U-L misspelled as L-U.</text><text start="72" dur="7">Here, let&amp;#39;s say this is the E in &amp;quot;elegant&amp;quot; misspelled as an A.</text><text start="79" dur="5">And we&amp;#39;ll look at these types of edits from one word to another,</text><text start="84" dur="8">a transposition between 2, a replacement, or an insertion or deletion of a single letter.</text><text start="92" dur="5">We&amp;#39;ll build up probability tables for those rather than probability tables for all the words.</text><text start="97" dur="5">That&amp;#39;s much easier to do with a smaller amount of data.</text></transcript></video><video title="39 Correction Example.mp4" id="RXHfBLULyOs" length="250"><transcript><text start="0" dur="3">Here&amp;#39;s an example of spelling correction in action.</text><text start="3" dur="6">Take the word w equals &amp;quot;thew,&amp;quot;</text><text start="9" dur="2">and we want to find the correction c </text><text start="11" dur="7">that maximizes the probability of w given c times the probability of c.</text><text start="18" dur="5">We start searching for the possible corrections c</text><text start="23" dur="5">that are close to our target word &amp;quot;thew&amp;quot; in terms of added distance.</text><text start="28" dur="6">That is, first we start with all possible c that are one letter away,</text><text start="34" dur="8">replacing one letter, swapping two letters, inserting one letter, or transposing two letters.</text><text start="42" dur="3">And here we have a list of a few of those possible corrections.</text><text start="45" dur="3">So it could be &amp;quot;the&amp;quot; by deleting the &amp;quot;w.</text><text start="48" dur="5">We could do no correction at all; we have to consider that as one of the possibilities.</text><text start="53" dur="2">We could replace the &amp;quot;e&amp;quot; with an &amp;quot;a.&amp;quot;</text><text start="55" dur="2">We could add a &amp;quot;r.&amp;quot;</text><text start="57" dur="4">We could transpose the &amp;quot;w&amp;quot; and the &amp;quot;e.&amp;quot;</text><text start="61" dur="4">Then we look into our spelling correction tables,</text><text start="65" dur="5">and again we reduce them from a word-based to a letter- or edit-based,</text><text start="70" dur="5">and we say what&amp;#39;s the probability of inserting a &amp;quot;w.&amp;quot;</text><text start="75" dur="6">Here we&amp;#39;ve conditioned the insert not just absolutely of inserting a &amp;quot;w&amp;quot; anywhere,</text><text start="81" dur="5">but for insertions and deletions, we condition them on the previous letter.</text><text start="86" dur="7">So what&amp;#39;s the possibility of inserting a &amp;quot;w&amp;quot; given that the previous letter was an &amp;quot;e?&amp;quot;</text><text start="93" dur="2">It turns out that&amp;#39;s what the probability is,</text><text start="95" dur="2">and then we go through the list.</text><text start="97" dur="3">Here&amp;#39;s replacing an &amp;quot;e&amp;quot; with an &amp;quot;a.&amp;quot;</text><text start="100" dur="3">That&amp;#39;s one of the most common edits made in English,</text><text start="103" dur="3">one of the most common spelling corrections.</text><text start="106" dur="4">A 10th of a percent of all spelling errors are mistaking an &amp;quot;e&amp;quot; for an &amp;quot;a,&amp;quot;</text><text start="110" dur="2">and similarly down the list.</text><text start="112" dur="4">So we get this probability for the probability of w given c,</text><text start="116" dur="3">and then the probability of the correction word c, </text><text start="119" dur="5">that we just get by looking up in our corpus how many times we have seen this word</text><text start="124" dur="2">and applying whatever smoothing we&amp;#39;re getting.</text><text start="126" dur="5">Then we multiply them all out, and I&amp;#39;ve scaled these by a factor of 1 billion.</text><text start="131" dur="10">It turns out with the model I&amp;#39;ve built that &amp;quot;thew&amp;quot; is most probably corrected to &amp;quot;the.&amp;quot;</text><text start="141" dur="2">And that makes sense.</text><text start="143" dur="3">It&amp;#39;s easy to imagine your finger slipping off the &amp;quot;e&amp;quot; key and going over to</text><text start="146" dur="2">the &amp;quot;w&amp;quot; since they&amp;#39;re next to each other,</text><text start="148" dur="5">and &amp;quot;w&amp;quot; is a very common word in English.</text><text start="153" dur="4">But it&amp;#39;s troubling that the second possibility,</text><text start="157" dur="7">namely leaving &amp;quot;thew&amp;quot; alone and keeping it as is has such a high probability.</text><text start="164" dur="4">Now, it turns out &amp;quot;thew&amp;quot; is a word.</text><text start="168" dur="4">It&amp;#39;s rather archaic. It does show up in the Shakespeare corpus.</text><text start="172" dur="4">It has to do with muscle tissue,</text><text start="176" dur="2">but it&amp;#39;s a fairly uncommon word,</text><text start="178" dur="6">and how high it ranks depends in large part on the probability that we assign</text><text start="184" dur="3">to this edit of doing nothing at all.</text><text start="187" dur="4">Here I&amp;#39;ve assigned it a probability of 0.95.</text><text start="191" dur="4">That is, I&amp;#39;ve said for my probabilistic model,</text><text start="195" dur="7">I&amp;#39;ve made this choice to say I think that about 95% of the words are spelled correctly</text><text start="202" dur="2">and 5% are spelled incorrectly.</text><text start="204" dur="3">You have to make that choice in order to have a complete model.</text><text start="207" dur="4">The probability distribution has to be spread out over all possible,</text><text start="211" dur="2">and they have to sum up to one, so I&amp;#39;ve got to put it somewhere.</text><text start="213" dur="4">If I had made another choice, then these two could have been swapped around.</text><text start="217" dur="4">So the answer you get depends on the assumptions you make.</text><text start="221" dur="5">Still, we can have spelling correcters that are highly accurate.</text><text start="226" dur="5">This very simple model of just looking at unigram possibilities</text><text start="231" dur="7">and looking at the edits achieves accuracy in the 80% range.</text><text start="238" dur="5">If we go beyond that and start dealing with Markov assumptions </text><text start="243" dur="7">and looking at multiple word sequences, then we can get up into the high 90%.</text></transcript></video><video title="40 Software Engineering.mp4" id="ZWo6pJmw7bI" length="182"><transcript><text start="0" dur="5">Now, let me back up just for a minute and talk about software engineering in general</text><text start="5" dur="3">rather than talking about specific AI techniques.</text><text start="8" dur="5">What I&amp;#39;m showing here is a small excerpt from the spelling correction code</text><text start="13" dur="5">from a project called Htdig, which is an open-source search engine. It&amp;#39;s a great search engine.</text><text start="18" dur="4">If you ever have need of one, you might want to check it out.</text><text start="22" dur="4">All the code is very straightforward and easy to deal with.</text><text start="26" dur="6">It has several thousand lines of code dealing with spelling correction.</text><text start="32" dur="2">Here we see a little bit of code. </text><text start="34" dur="6">It has the good idea of saying one word might be misspelled for another if they sound alike,</text><text start="40" dur="4">and so let&amp;#39;s go through each word and figure out what each letter is sounding like</text><text start="44" dur="3">and see if there are other words that sound similar.</text><text start="47" dur="4">So for example, here it&amp;#39;s saying what does a &amp;quot;c&amp;quot; sound like.</text><text start="51" dur="3">Well, &amp;quot;c&amp;quot; is ambiguous in English.</text><text start="54" dur="5">It has this &amp;quot;x&amp;quot; sound, the &amp;quot;ch&amp;quot; sound, this &amp;quot;s&amp;quot; or &amp;quot;k&amp;quot; sound,</text><text start="59" dur="4">and there&amp;#39;s all these possibilities about how it can have one sound or another.</text><text start="63" dur="3">Now imagine you&amp;#39;re in charge of maintaining this program.</text><text start="66" dur="4">In order for you to make sure that it&amp;#39;s right you have to do several things.</text><text start="70" dur="3">First, you could look at this comment and say, well, does this comment</text><text start="73" dur="4">accurately reflect the rules for English pronunciation?</text><text start="77" dur="7">Here, it&amp;#39;s talking about pronouncing a &amp;quot;c&amp;quot; as an &amp;quot;s&amp;quot; in the context of an &amp;quot;i,&amp;quot; &amp;quot;e,&amp;quot; or &amp;quot;y.&amp;quot;</text><text start="84" dur="2">What about the other vowels--&amp;quot;a&amp;quot; and &amp;quot;o?&amp;quot;</text><text start="86" dur="3">Were they left out by accident or is this correct?</text><text start="89" dur="2">So you&amp;#39;d have to do some work to check that out.</text><text start="91" dur="4">Then you&amp;#39;d have to do more work to say if this comment correct,</text><text start="95" dur="4">is the comment correctly implemented in this code here?</text><text start="99" dur="4">In fact, just this sort of one page of code just dealing with a couple letters</text><text start="103" dur="7">is about the same as all the code that we use to implement the probabilistic model.</text><text start="110" dur="5">But I think the most important difficulty in maintaining code like this</text><text start="115" dur="4">is that it&amp;#39;s so specific to the English language.</text><text start="119" dur="6">Imagine you&amp;#39;re in charge of maintaining it, and you&amp;#39;re boss or professor comes to you and says,</text><text start="125" dur="4">&amp;quot;Great job. Now I&amp;#39;d like you to make this work for </text><text start="129" dur="4">German and French and Azerbaijani and 50 other languages.&amp;quot;</text><text start="133" dur="5">You&amp;#39;d have to go through and understand the pronunciation rules in each of those languages</text><text start="138" dur="4">and edit a version of this code for each particular language.</text><text start="142" dur="2">That would be quite tedious.</text><text start="144" dur="4">But if you were dealing with a probabilistic model</text><text start="148" dur="2">and you were asked to work in another language,</text><text start="150" dur="5">all you would have to do is go out and collect a large corpus of words in that language.</text><text start="155" dur="3">Then you&amp;#39;d have the probability of the individuals words.</text><text start="158" dur="3">And then find a corpus of spelling errors.</text><text start="161" dur="3">Then you&amp;#39;d have the probability of the spelling edits.</text><text start="164" dur="6">And so gathering that data is much faster, much easier software engineering process</text><text start="170" dur="2">than writing this code by hand.</text><text start="172" dur="6">In sense, you could say that machine learning over probabilistic models </text><text start="178" dur="4">is the ultimate in agile programming.</text></transcript></video></group><group title="Programming Project" count="1"><video title="01 Optional Problem.mp4" id="KuSg1wcty3s" length="?"><transcript><text start="0" dur="3">No subtitles...</text></transcript></video></group><group title="Unit 22" count="15"><video title="01 Sentence Structure.mp4" id="4JTPYE3N9BQ" length="?"><transcript><text start="0" dur="3">No subtitles...</text></transcript></video><video title="02 Parses Question.mp4" id="b3ApNWn5bOo" length="?"><transcript><text start="0" dur="3">No subtitles...</text></transcript></video><video title="03 Parses Answer.mp4" id="kPh4oSkH9_4" length="?"><transcript><text start="0" dur="3">No subtitles...</text></transcript></video><video title="04 Problems and Solutions Question.mp4" id="wO1B2uk6nnI" length="?"><transcript><text start="0" dur="3">No subtitles...</text></transcript></video><video title="05 Problems and Solutions Answer.mp4" id="e8eYBDecmDo" length="?"><transcript><text start="0" dur="3">No subtitles...</text></transcript></video><video title="06 Writing Grammars.mp4" id="z5UtqOw0Jrk" length="?"><transcript><text start="0" dur="3">No subtitles...</text></transcript></video><video title="07 PCFG.mp4" id="OcPN30Td4-Q" length="?"><transcript><text start="0" dur="3">No subtitles...</text></transcript></video><video title="08 PCFG Question.mp4" id="qnp7Cv6ZkVs" length="?"><transcript><text start="0" dur="3">No subtitles...</text></transcript></video><video title="09 PCFG Answer.mp4" id="3XvUVw5j0zA" length="?"><transcript><text start="0" dur="3">No subtitles...</text></transcript></video><video title="10 Probability Origins.mp4" id="AHivKiSFBE0" length="?"><transcript><text start="0" dur="3">No subtitles...</text></transcript></video><video title="11 Resolving Ambiguity.mp4" id="T6akmSxKX9I" length="?"><transcript><text start="0" dur="3">No subtitles...</text></transcript></video><video title="12 LPCFG.mp4" id="L6pkhoezlgc" length="?"><transcript><text start="0" dur="3">No subtitles...</text></transcript></video><video title="13 Parsing into a Tree.mp4" id="DFz_anV-Gps" length="?"><transcript><text start="0" dur="3">No subtitles...</text></transcript></video><video title="14 Machine Translation.mp4" id="Zt94uplW5pQ" length="?"><transcript><text start="0" dur="3">No subtitles...</text></transcript></video><video title="15 Translation Example.mp4" id="UJWUNNwZKS4" length="?"><transcript><text start="0" dur="3">No subtitles...</text></transcript></video></group><group title="Final" count="12"><video title="Question 1.mp4" id="Ph-cQukbUg4" length="80"><transcript><text start="0" dur="4">The very first question, is a search question.</text><text start="4" dur="3">You probably know about the Towers of Hanoi,</text><text start="7" dur="4">if you don&amp;#39;t then please go and Google them.</text><text start="11" dur="7">It&amp;#39;s a single player game, by which you try to move the tower of four slices over here,</text><text start="18" dur="3">onto the right peg, over here.</text><text start="21" dur="6">You can use the middle peg, but the rules are you can only move one disk at a time.</text><text start="27" dur="7">And it might never happen, that a small disk sits below a larger disk.</text><text start="34" dur="4">So the way to solve it is to move the disk over here,</text><text start="38" dur="2">the second largest disk to the right side,</text><text start="40" dur="2">the small one over,</text><text start="42" dur="3">and the third largest to the center, and so on.</text><text start="45" dur="4">If you know it, you know what I&amp;#39;m talking about.  If not, just Google it.</text><text start="49" dur="2">So, I would like to know, </text><text start="51" dur="6">what is the size of the state space of valid disk configurations in this puzzle.</text><text start="57" dur="2">Please enter this here.</text><text start="59" dur="5">I&amp;#39;d like to know, whether the number of the disks on the left peg</text><text start="64" dur="5">are an admissible heuristic, if you use A* search.</text><text start="69" dur="4">And I&amp;#39;d like to know, what is the number of steps</text><text start="73" dur="3">that an optimal solution will require</text><text start="76" dur="4">to move all the disks from the left peg, to the right peg.</text></transcript></video><video title="Question 2.mp4" id="YJzIM6YzAv0" length="72"><transcript><text start="0" dur="5">So here&amp;#39;s a Bayes Network, with 6 variables, A, B, C, D, E, and F.</text><text start="5" dur="2">And I&amp;#39;d like you to count parameters.</text><text start="7" dur="2">If this was a binary based network,</text><text start="9" dur="3">where each variable can take on two values,</text><text start="12" dur="4">then, A would require one independent parameter,</text><text start="16" dur="2">and B another one.</text><text start="18" dur="4">And C would require four independent parameters,</text><text start="22" dur="5">because there&amp;#39;s four different ways A and B can come together in condition C.</text><text start="27" dur="3">Now in this question I&amp;#39;d like to ask you,</text><text start="30" dur="5">What happens if each node can assume three values, not just two?</text><text start="35" dur="3">So A can be, A1, A2, A3.</text><text start="38" dur="3">And C can be, C1, C2, C3.</text><text start="41" dur="4">For each node, specify the number of independent parameters required</text><text start="45" dur="3">to state the conditional probability of that node.</text><text start="48" dur="2">And I&amp;#39;ll tell you this is a tricky question,</text><text start="50" dur="5">So for A, the correct answer is two.</text><text start="55" dur="2">I won&amp;#39;t give you the other ones.</text><text start="57" dur="3">And, it&amp;#39;s two because A can take three values,</text><text start="60" dur="3">but it takes two independent parameters.</text><text start="63" dur="4">The last one can be inferred from, one minus the first two.</text><text start="67" dur="5">Please fill in the values for all the other variables.</text></transcript></video><video title="Question 3.mp4" id="FJ7Yxg0eG3o" length="50"><transcript><text start="0" dur="4">This is a true or false set of questions for Machine Learning.</text><text start="4" dur="3">Suppose we&amp;#39;ve trained a machine learning model,</text><text start="7" dur="4">and we&amp;#39;ve found really good values for our parameters in our model.</text><text start="11" dur="5">And now, we&amp;#39;re going to increase the noise,</text><text start="16" dur="2">that affects our data.</text><text start="18" dur="4">What should we do to accommodate the increase of noise?</text><text start="22" dur="5">Shall we increase k, if we&amp;#39;re using k nearest neighbor?</text><text start="27" dur="2">True or False?</text><text start="29" dur="4">Increase k if we are using the k means algorithm.</text><text start="33" dur="2">True or False?</text><text start="35" dur="4">Increase k if we are using Laplacian smoothing.</text><text start="39" dur="2">True or False?</text><text start="41" dur="3">Use fewer particles if we are using particle filters.</text><text start="44" dur="2">True or False?</text><text start="46" dur="2">And use more data if available.</text><text start="48" dur="2">True or False?</text></transcript></video><video title="Question 4.mp4" id="x-k2aZYPtHE" length="211"><transcript><text start="0" dur="3">So this is a planning question.</text><text start="3" dur="3">And I apologize, it&amp;#39;s a little bit hard to read.</text><text start="6" dur="2">There&amp;#39;s a lot of text here.</text><text start="8" dur="4">And I ask you to consult the pdf document to read the text.</text><text start="12" dur="4">Given the resources on the left, over here,</text><text start="16" dur="3">can we reach those five goals;</text><text start="19" dur="3">A, B, C, D, E, on the right side?</text><text start="22" dur="4">And in looking at those, there&amp;#39;s words like &amp;#39;consume&amp;#39;,</text><text start="26" dur="5">which means, the action eliminates the resource.</text><text start="31" dur="2">Whereas &amp;#39;use&amp;#39; means, </text><text start="33" dur="5">you have to have it, but you retain it after using it.</text><text start="38" dur="3">Now initially, you know there&amp;#39;s a couple of books;</text><text start="41" dur="2">one by Nau, about planning,</text><text start="43" dur="2">one by Zweben, about scheduling,</text><text start="45" dur="3">and one by Melville, about Whales.</text><text start="48" dur="2">And there&amp;#39;s also videos.</text><text start="50" dur="2">Video 8 is about Planning.</text><text start="52" dur="2">And Video 15 is about Scheduling.</text><text start="54" dur="2">These might be our in-class videos.</text><text start="56" dur="2">That&amp;#39;s your initial state.</text><text start="58" dur="4">And your goal is that you, as a student,</text><text start="62" dur="2">know about planning,</text><text start="64" dur="2">and you know about scheduling.</text><text start="66" dur="4">So the question is, with certain resources</text><text start="70" dur="3">that are available in the beginning,</text><text start="73" dur="2">and they differ from question to question,</text><text start="75" dur="4">can you attain the state of knowing about planning and scheduling?</text><text start="79" dur="4">Now, there&amp;#39;s two ways to know about a topic.</text><text start="83" dur="3">One is to study it using a book.</text><text start="86" dur="3">And one is to view it using a video.</text><text start="89" dur="5">In both cases, the outcome is to know about the topic over here.</text><text start="94" dur="3">Now either one has a different precondition.</text><text start="97" dur="3">In the &amp;#39;book&amp;#39; case, you have to have the book,</text><text start="100" dur="3">and the book has to be about the topic you care about.</text><text start="103" dur="3">In which case, the action &amp;#39;study&amp;#39; </text><text start="106" dur="3">lets you understand the book and you know about the topic.</text><text start="109" dur="3">So for example, if you have a book about planning,</text><text start="112" dur="4">and study it, then you know about planning.</text><text start="116" dur="5">In the &amp;#39;view&amp;#39; case, you have to have a video that&amp;#39;s about the topic,</text><text start="121" dur="3">and you have to have a certain bandwidth,</text><text start="124" dur="2">which happens to be 2.5.</text><text start="126" dur="4">If you don&amp;#39;t have the bandwidth 2.5, you won&amp;#39;t be able to view the video,</text><text start="130" dur="2">and you won&amp;#39;t be able to know about the topic. </text><text start="132" dur="3">That&amp;#39;s the way the problem is set up.</text><text start="135" dur="4">Now, books can be bought or borrowed.</text><text start="139" dur="4">In the buying case, you consume 50 dollars</text><text start="143" dur="2">In the borrowing case, </text><text start="145" dur="5">you have to have a privilege, at the library, that&amp;#39;s at least &amp;#39;1&amp;#39;. </text><text start="150" dur="2">It might be larger but it can&amp;#39;t be lower than &amp;#39;1&amp;#39;.</text><text start="152" dur="4">And in either case after doing this, you have the book,</text><text start="156" dur="3">and you can now plug this into the &amp;#39;study&amp;#39; action,</text><text start="159" dur="2">and you can read about it,</text><text start="161" dur="3">and study it, and know the topic.</text><text start="164" dur="2">So here are the questions.</text><text start="166" dur="3">If your resource is that you have 50 dollars,</text><text start="169" dur="2">and you have library privileges of &amp;#39;1&amp;#39;,</text><text start="171" dur="3">can you then attain the state of</text><text start="174" dur="3">knowing about planning and scheduling?</text><text start="177" dur="4">Secondly, suppose your resources is no dollars,</text><text start="181" dur="2">but you have library privileges of &amp;#39;2&amp;#39;,</text><text start="183" dur="3">can you attain the same state?</text><text start="186" dur="4">Third, what about the same with library privileges of &amp;#39;1&amp;#39;?</text><text start="190" dur="2">Can you get here?</text><text start="192" dur="3">Fourth, what about if you have 40 dollars,</text><text start="195" dur="3">and bandwidth of &amp;#39;3&amp;#39;?  Can you get here?</text><text start="198" dur="5">And fifth, what about if you have bandwidth of &amp;#39;2&amp;#39;, and 95 dollars?</text><text start="203" dur="2">Can you get here?</text><text start="205" dur="3">Check all, or any, or none</text><text start="208" dur="3">of those five questions that apply.</text></transcript></video><video title="Question 5.mp4" id="O7V5O4wp0PM" length="84"><transcript><text start="0" dur="3">This is a question about logic.</text><text start="3" dur="3">We have four different statements.</text><text start="6" dur="1">Pink is True,</text><text start="7" dur="2">Pink or Green is True,</text><text start="9" dur="2">Pink and Green is True,</text><text start="11" dur="4">and not Pink implies that Green is True.</text><text start="15" dur="4">Now these statements could imply each other,</text><text start="19" dur="2">and in this matrix over here,</text><text start="21" dur="2">I&amp;#39;d like you to select each circle</text><text start="23" dur="5">where an implication is necessarily true. It&amp;#39;s always true.  For example,</text><text start="28" dur="3">if you believe that Pink implies</text><text start="31" dur="4">that Pink or Green is True,</text><text start="35" dur="5">then mark the A implies B circle, over here.</text><text start="40" dur="3">If you believe the Pink is True implies</text><text start="43" dur="3">Pink and Green must be True,</text><text start="46" dur="4">then mark the circle A implies C, over here.</text><text start="50" dur="3">And so on for the entire matrix.</text><text start="53" dur="3">One hint, D looks complex,</text><text start="56" dur="4">but it happens to be the same as one of the previous cases.</text><text start="60" dur="4">So if you fill out the matrix for A to C first,</text><text start="64" dur="5">and then copy the result over for D,</text><text start="69" dur="4">it will be easier, than if you start thinking about D separately.</text><text start="73" dur="2">And to find the equivalency of D,</text><text start="75" dur="4">just write down the Truth Table of these different things over here,</text><text start="79" dur="5">and observe D is already represented among A to C.</text></transcript></video><video title="Question 6.mp4" id="yLZYp3z9gj4" length="87"><transcript><text start="0" dur="4">In this question we study a particle filter.</text><text start="4" dur="2">Let&amp;#39;s just zoom in for a second.</text><text start="7" dur="4">We have eight particles that land on this checkerboard.</text><text start="11" dur="4">They are labeled, &amp;#39;A&amp;#39; all the way to &amp;#39;H&amp;#39;.</text><text start="15" dur="3">And some of them are on black squares.</text><text start="18" dur="3">And some of them are on white squares.</text><text start="21" dur="2">Given those particles,</text><text start="23" dur="3">we&amp;#39;ll assume that the probability of measuring &amp;#39;black&amp;#39;,</text><text start="26" dur="3">for any particle that falls on a black square,</text><text start="29" dur="2">is 0.7.</text><text start="31" dur="3">And the probability of measuring &amp;#39;white&amp;#39;,</text><text start="34" dur="3">for any particle that falls on a white square,</text><text start="37" dur="2">is 0.6.</text><text start="39" dur="4">From that you can easily calculate the probability of measuring &amp;#39;white&amp;#39;,</text><text start="43" dur="2">if a particle falls on a black square.</text><text start="45" dur="2">And the probability of &amp;#39;black&amp;#39;,</text><text start="47" dur="3">if the particle falls on a white square.</text><text start="50" dur="5">Now I&amp;#39;d like to what&amp;#39;s the normalized importance weight, after normalization,</text><text start="55" dur="2">of the particle, labeled &amp;#39;A&amp;#39;,</text><text start="59" dur="4">if our measurement happens to be &amp;#39;white&amp;#39;?</text><text start="63" dur="3">That&amp;#39;s a number that you put in over here.</text><text start="66" dur="2">And I&amp;#39;m going to ask you the same question</text><text start="68" dur="2">about the normalized importance weight of particle &amp;#39;A&amp;#39;,</text><text start="70" dur="4">if the measurement is &amp;#39;black&amp;#39;.</text><text start="74" dur="2">To calculate this,</text><text start="76" dur="3">you will go through these probabilities.</text><text start="79" dur="3">For each particle, you will assign the measurement probability.</text><text start="82" dur="3">And then you just normalize all of those,</text><text start="85" dur="2">so they add up to one. </text></transcript></video><video title="Question 7.mp4" id="3cC6wC2M4ao" length="35"><transcript><text start="0" dur="4">In this question, we assume that a particle, </text><text start="4" dur="3">&amp;#39;A&amp;#39;, has a already normalized importance</text><text start="7" dur="2">weight of 0.2.</text><text start="9" dur="1">So there might be other particles,</text><text start="10" dur="2">we don&amp;#39;t even care how many.</text><text start="12" dur="3">But their importance weights add up to 0.8.</text><text start="15" dur="4">We now sample 3 new particles,</text><text start="19" dur="2">with replacement.</text><text start="21" dur="3">What is the probability that this particle, &amp;#39;A&amp;#39;,</text><text start="24" dur="2">is sampled at least once?</text><text start="26" dur="2">And the way you derive this is by asking</text><text start="28" dur="2">the question, what&amp;#39;s the probability </text><text start="30" dur="3">that particle &amp;#39;A&amp;#39; is never sampled?</text><text start="33" dur="2">And then you take the compliment of this.</text></transcript></video><video title="Question 8.mp4" id="jTGOFreUxwc" length="31"><transcript><text start="0" dur="3">This is a question about Alpha-Beta Pruning,</text><text start="3" dur="3">in min-max search, in games.</text><text start="6" dur="2">Consider the following tree,</text><text start="8" dur="2">where this is the max node,</text><text start="10" dur="2">and these are min nodes.</text><text start="12" dur="2">We perform alpha-beta pruning.</text><text start="14" dur="3">I&amp;#39;d like you to check all leaf nodes,</text><text start="17" dur="3">of these 9 leaf nodes over hear, in this tree,</text><text start="20" dur="2">that will be expanded, assuming</text><text start="22" dur="3">that we expand from the left to the right,</text><text start="25" dur="4">and we expand in depth first mode, of course,</text><text start="29" dur="2">as always in these game trees.</text></transcript></video><video title="Question 9.mp4" id="YbCjkPgEb8E" length="58"><transcript><text start="0" dur="3">These are four True or False questions</text><text start="3" dur="3">for computer vision, and specifically,</text><text start="6" dur="1">perspective projection.</text><text start="7" dur="3">Consider a projective image of an object.</text><text start="10" dur="3">Which of the following statements is true?</text><text start="13" dur="3">If the object moves closer to the camera,</text><text start="16" dur="3">the size of the projected image</text><text start="19" dur="2">of the object will increase.</text><text start="21" dur="3">Is this True of False?  Please just check one.</text><text start="24" dur="4">If we use a camera with a longer focal length,</text><text start="28" dur="3">as a result of using the longer focal length,</text><text start="31" dur="3">the size of the projected image will increase.</text><text start="34" dur="2">Check one.</text><text start="36" dur="2">If we double the distance to the object,</text><text start="38" dur="4">the projected image will be half as large, as before.</text><text start="42" dur="2">Check one.</text><text start="44" dur="3">And finally, the ratio of the focal length</text><text start="47" dur="3">over the distance to the object</text><text start="50" dur="5">is the same as the projected size of the object in the camera plane.</text><text start="55" dur="3">Please check True or False.</text></transcript></video><video title="Question 10.mp4" id="PwIZuqVGdVY" length="23"><transcript><text start="0" dur="3">A question on stereo vision.</text><text start="3" dur="2">An object at range of 100 meters</text><text start="5" dur="3">leads to a 2mm displacement for a stereo rig,</text><text start="8" dur="4">with focal length 40mm.</text><text start="12" dur="2">Now we double the baseline.</text><text start="14" dur="4">What will happen to the new displacement,</text><text start="18" dur="2">that used to be 2mm,</text><text start="20" dur="3">what will it be now?</text></transcript></video><video title="Question 11.mp4" id="ld74j6uuNaM" length="69"><transcript><text start="0" dur="3">Here&amp;#39;s a &amp;#39;Structure from Motion&amp;#39; type problem, </text><text start="3" dur="3">that is similar to what I asked you on a homework assignment.</text><text start="6" dur="4">Assume there is a world of 3 point features,</text><text start="10" dur="4">that will be named; 1, 2, 3, but I won&amp;#39;t tell you which one is which.</text><text start="14" dur="3">There are 4 pinhole cameras; A, B, C, and D.</text><text start="17" dur="3">And they all have a left, center, and right side.</text><text start="20" dur="2">Left, center, and right side.</text><text start="22" dur="2">And you should observe,</text><text start="24" dur="4">that the perceived order of features in the scene,</text><text start="28" dur="2">by virtue of using a pinhole,</text><text start="30" dur="3">will be inverted inside the pinhole camera.</text><text start="33" dur="2">So camera &amp;#39;A&amp;#39;, sees in the left position,</text><text start="35" dur="1">feature &amp;#39;1&amp;#39;,</text><text start="36" dur="2">on the center position, feature &amp;#39;2&amp;#39;,</text><text start="38" dur="3">on the right position, feature &amp;#39;3&amp;#39;.</text><text start="41" dur="1">I would like to know,</text><text start="42" dur="2">for which of the other camera&amp;#39;s,</text><text start="44" dur="3">is it the case that feature &amp;#39;3&amp;#39;</text><text start="47" dur="4">will appear in the leftmost position?</text><text start="51" dur="2">So the leftmost position is &amp;#39;L&amp;#39; over here,</text><text start="53" dur="2">&amp;#39;L&amp;#39; over here, &amp;#39;L&amp;#39; over here.</text><text start="55" dur="4">Assuming the optical centers, shown over here.</text><text start="59" dur="4">Please check any or all of the following;</text><text start="63" dur="1">Camera B,</text><text start="64" dur="1">Camera C,</text><text start="65" dur="1">Camera D,</text><text start="66" dur="3">or None of them.</text></transcript></video><video title="Question 12.mp4" id="eaR3fMJ-jKM" length="79"><transcript><text start="0" dur="4">My final question is a simplified self-driving car question,</text><text start="4" dur="3">that is usually solved using dynamic programming.</text><text start="7" dur="3">But I have to warn you, the state space shown here</text><text start="10" dur="2">isn&amp;#39;t the full state space.</text><text start="12" dur="4">The orientation isn&amp;#39;t really made explicit, in this state space.</text><text start="16" dur="2">But suppose you have a road environment,</text><text start="18" dur="2">that has a straight street over here,</text><text start="20" dur="3">you can turn left, go straight, or turn right over here,</text><text start="23" dur="3">and similarly you can turn left or right over here.</text><text start="26" dur="4">And we assume that moving from one grid cell to the next</text><text start="30" dur="1">has a cost of &amp;#39;1&amp;#39;.</text><text start="31" dur="3">Turning left has a cost of &amp;#39;14&amp;#39;.</text><text start="34" dur="3">And turning right has a cost of &amp;#39;1&amp;#39; as well.</text><text start="37" dur="2">Let&amp;#39;s assume the robot when it turns,</text><text start="39" dur="2">stays in the same grid cell, but it only can turn once.</text><text start="41" dur="2">After it turned, it has to actually move.</text><text start="43" dur="4">So it&amp;#39;s impossible, for example, to turn right three times</text><text start="47" dur="3">just to avoid the cost of a left turn.</text><text start="50" dur="2">I would like to know,</text><text start="52" dur="3">what is the minimum total cost</text><text start="55" dur="3">of going from the start location, over here,</text><text start="58" dur="1">to location &amp;#39;A&amp;#39;.</text><text start="59" dur="3">I realize that there is many ways to get there.</text><text start="62" dur="2">I&amp;#39;d like to know the minimum.</text><text start="64" dur="2">So what&amp;#39;s the minimum cost to get to &amp;#39;A&amp;#39;,</text><text start="66" dur="4">irrespective of what orientation you assume at &amp;#39;A&amp;#39;?  I don&amp;#39;t really care.</text><text start="70" dur="3">The same from the start location to &amp;#39;B&amp;#39;.</text><text start="73" dur="3">And from the start location to &amp;#39;C&amp;#39;.</text><text start="76" dur="3">Please enter your best guesses on the right side.</text></transcript></video></group></videos>

