We're now on our 3rd generation of system which learns from human preferences (cf openai.com/blog/deep-reinfor…, openai.com/blog/fine-tuning-…).
I'm hopeful that this approach will ultimately help align powerful AI systems without needing to explicitly write down "what humans want".
A very rare bit of research that is directly, straight-up relevant to real alignment problems! They trained a reward function on human preferences AND THEN measured how hard you could optimize against the trained function before the results got actually worse.
Sep 4, 2020 · 7:08 PM UTC
3
16
96

