The original version of this story appeared in Quanta Magazine . Two years ago, in a project called the Beyond the Imitation Game benchmark , or BIG bench, 450 researchers compiled a list of 204 tasks designed to test the capabilities of large language models , which power chatbots like ChatGPT. On most tasks, performance improved predictably and smoothly as the models scaled up—the larger the model, the better it got. But with other tasks, the jump in ability wasnt smooth. The performance re

Read the full article at Wired