it's a whole branch of mathematics. looking at it from a pure language perspective isn't really useful because language models don't really ~~think~~ work in language. they ~~think~~ work in text. "llms are just language" is misleading because language implies a certain structure while language models use a completely different structure.
i don't have any proper sources but here's a quick overview off of the top of my head:
a large language model is a big pile of vectors (a vector here is basically a list of numbers). the "number of parameters" in a machine learning model refers to the number of dimensions of one of those vectors (or, in programming speak, the length of the list). these vectors represent coordinates on an n-dimensional "map of words". words that are related are "closer together" on this map. when you build this map, you can then use vector math to find word associations. This is important because vector math is all hardware accelerated (because of 3D graphics).
the training process builds the map, by looking at how words and concepts appear in the input data and adjusting the numbers in the vectors until they fit. the more data, the more general the resulting map. the inference process then uses the input text as its starting point and "walks" the map.
the emergent behaviour that some people call intelligence stems from the fact that the training process makes "novel" connections. words that are related are close together, but so are words that sound the same, for example. the more parameters a model has, the more connections it can make, and vice versa. this can lead to the "overfitting" problem, where there amount of input data is so small that the only associations are from the actual input document. using the map analogy, there may exist particular starting points where there is only one possible path. the data is not actually "in" the model, but it can be recreated exactly. the opposite can also happen, where there are so many connections for a given word that the actual topic can't be inferred from the input and the model just goes off on a tangent.
why this is classed as intelligence i could not tell you.
Edit: replaced some jargon that muddied the point.
something related: you know how compressed jpegs always have visible little squares in them? jpeg compression works by making a mathematical pattern called the discrete cosine transform, slicing it into squares, and then replacing everything in the original image with references to those squares. the more you compress the more visible those squares become, because more and more parts of the image use the same square so it doesn't match as well.
you can do this with text models as well. increasing jpeg compression is like lowering the amount of parameters. the fewer parameters the worse the model. if you compress to much, the model starts to blend concepts together or mistake words for one another.
what the ai bros are saying now is that if you go the other way, the model may become self-aware. in my mind that's like saying that if you make a jpeg large enough, it will become real.