The linked pdf lists the deficiencies of the LLM responses. They are varied and it sometimes misses the mark completely or cant grasp vital context.
Still pretty useless comparison, they testet 10 university level humans against Llama2-70B. The model has fallen out of use completely by now and was never really great at summarization. The study didnt fine tune it either, so this isnt really representative of the current situation.
There are far better models out, that were either especially trained for summarization or can be easily fine tuned to excel at it. Not to mention the Llama3 and 3.1 series, with the crazy 405B model.
What is the boob shape of the girl on the right called?