Does natural selection favor AIs over humans? Model this!

Dan Hendrycks argues it probably favors the AIs, paper here.  He is a serious person, well known in the area, home page here, and he gives a probability of doom above 80%.

I genuinely do not understand why he sees so much force in his own paper.  I am hardly “Mr. Journal of Economic Theory,” and I have plenty of papers that you could describe as a string of verbal arguments, but here is an instance where I would find an actual model very useful.  Evolutionary biology is full of them, as is economics.  Why not apply them to the AI Darwinian process?  Why leap to such extreme conclusions in the meantime?

Here are two very simple ideas I would like to see incorporated into any model:

1. At least in the early days of AIs, humans will reproduce and recommend those AIs that please them.  Really!  We already see this with people preferring GPT-4 to GPT 3.5, the popularity of Midjourney 5, and so on.  So, at least for a while, AIs will evolve to please us.  What that means over time is perhaps unclear (maybe some of us opt for ruthless?  But do we all seek to hire ruthless employees and RAs?  I for one do not), but surely it should be incorporated into the basic model.  How much ruthlessness do we seek to inject into the agents who do our bidding?  It depends on context, and so is it the finance bots who will end the world?  Or perhaps the system will be tolerably decentralized and cooperative to a fair degree.  If you are skeptical there, OK, but isn’t that the main question you need to address?  And please do leave in the comments references to models that deploy these two assumptions.  (With the world at stake, surely you can do better than those bikers did!)

2. Humans can apply principal-agent contracts to the AI (again, at least for some while into the evolutionary process).  Keep in mind if the AIs are risk-neutral (are they?), perhaps humans can achieve a first-best result from the AIs, just as they can with other humans.  If the AIs are risk-averse, in the final equilibrium they will shirk too much, but they still do a fair amount of work under many parameter values.  If they shirk altogether, we might stop investing in them, bringing us back to the evolutionary point.

Neither of those points are the proverbial “rocket science,” rather they are super-basic.  Yet neither plays much if any role in the Hendrycks paper.  There are some mentions of various points on for instance p.17, but I don’t see a clear presentation of modeling the human choices in a decentralized process.  p.21 does consider the decentralized incentives point a bit more, but it consists mostly of two quite anomalous examples, such as a dog pushing kids into the Seine to later save them (how often?), and “the India cobra story,” which is likely outright false.  It doesn’t offer sound anecdotal empirics, or much theoretical analysis of which kinds of assistants we will choose to invest in, again set within a decentralized process.

Dan Hendryks, why are you so pessimistic?  Have you built such models, fleshing out these two assumptions, and simply not shown them to us?  Please show!

If the very future of the world is at stake, why not build such models?  Surely they might help us find some “outs,” but of course the initial problem has to be properly specified.

And more generally, what is your risk communication strategy here?  How secure, robust, and validated does your model have to be before you, a well-known figure in the field and Director at the Center for AI Safety, would feel justified in publicly announcing the > 80% figure?  Which model of risk communication practices (as say validated by risk communication professionals) are you following, if I may ask?

In the meantime, may I talk you down to 79% chance of doom?

Comments

Comments for this post are closed