This is so very common. Reinforcement learning really does have an issue with generalization, and I think addressing this will be an important area of reinforcement learning research to come.
1
7
It reminds me a little bit of training a simulated racing car to drive (both with evolution, TD-family RL, and supervised learning). If it had never trained for certain situation (such as colliding with a wall), it would simply never learn to deal with that situation.
1
1
How can we humans deal with situations we have not trained for? By reasoning about them. Simulating a sequence of actions in our head. System 2 thinking, in Kahneman terminology. Tree search using a forward model, in classic AI terminology. Which brings me to my second point.
2
7
It's important to note that the setup here is very unlike e.g. Go, where AlphaGo (and all other agents) can simulate the effects of its actions - it has a forward model. In Dota, that's not the case.
2
1
1
13
Any "long-term planning" is something that the neural networks have had to learn, because we don't have a fast simulation of Dota available. Thus, it is impressive that the bots can do any long-term planning at all.
2
5
It's clear that the bots are much better at micro / (very) short-term planning, executing perfect combat on the second-to-second level. People barely even remark on this, because they are used to video game bots being better on micro than macro.
1
5
This applies to basically every video game, from StarCraft to Super Mario to Unreal Tournament.
1
1
But the OpenAI bots seem to be doing some kind of long-term planning. Or do they? Have they just learned some behaviors that look like long-term planning, never actually playing the long term plan out in their "heads"?
5
6
OpenAI told me they train with a 14 minute time horizon, fwiw
1
2
That's interesting in that it puts an upper bound on what the lengths of plans they could potentially learn. But we still don't know what it is that they learn. Though what they learn look a bit like planning from the outside :)
2
3
Replying to @togelius @tsimonite
That horizon is a 14-minute half-life on rewards, not a hard cutoff!

Aug 24, 2018 · 2:30 AM UTC

1
9