this post was submitted on 04 Jul 2025

108 points (99.1% liked)

Programmer Humor

24736 readers

1273 users here now

Welcome to Programmer Humor!

This is a place where you can post jokes, memes, humor, etc. related to programming!

For sharing awful code theres also Programming Horror.

Rules

Keep content in english
No advertisements
Posts must be related to programming or programmer topics

founded 2 years ago

MODERATORS

Feyter@programming.dev

anzo@programming.dev

BurningTurtle@programming.dev

pylapp@programming.dev

108

Why shouldn't you use YAML to store eye tracking data? /s (lemmy.world)

submitted 11 hours ago* (last edited 6 hours ago) by qaz@lemmy.world to c/programmer_humor@programming.dev

36 comments fedilink hide all child comments

top 36 comments

sorted by: hot top controversial new old

[–] deegeese@sopuli.xyz 13 points 5 hours ago (1 children)

If you’re using a library to handle deserialization , the ugliness of the serial format doesn’t matter that much.

Just call yaml.load() and forget about it.

[–] BodilessGaze@sh.itjust.works 3 points 3 hours ago (1 children)

That works until you realize your calculations are all wrong due to floating point inaccuracies. YAML doesn't require any level of precision for floats, so different parsers on a document may give you different results.

[–] deegeese@sopuli.xyz 3 points 3 hours ago (1 children)

What text based serialization formats do enforce numeric precision?

AFAIK it’s always left up to the writer (serializer)

[–] BodilessGaze@sh.itjust.works 3 points 2 hours ago* (last edited 2 hours ago)

Cuelang: https://cuelang.org/docs/reference/spec/#numeric-values

Implementation restriction: although numeric values have arbitrary precision in the language, implementations may implement them using an internal representation with limited precision. That said, every implementation must:

Represent integer values with at least 256 bits.

Represent floating-point values with a mantissa of at least 256 bits and a signed binary exponent of at least 16 bits.

Give an error if unable to represent an integer value precisely.

Give an error if unable to represent a floating-point value due to overflow.

Round to the nearest representable value if unable to represent a floating-point value due to limits on precision. These requirements apply to the result of any expression except for builtin functions, for which an unusual loss of precision must be explicitly documented.

[–] fibojoly@sh.itjust.works 11 points 6 hours ago (3 children)

I'm amazed at developers who don't grasp that you don't need to have absolutely everything under the sun in a human readable file format. This is such a textbook case...

[–] chaospatterns@lemmy.world 2 points 1 hour ago

Yeah this isn't even human readable even when it's in YAML. What am I going to do? Read the floats and understand that the person looked left?

[–] marcos@lemmy.world 3 points 4 hours ago (1 children)

Even if you want it to be human readable, you don't need to include the name into every field and use balanced separators.

Any CSV variant would be an improvement already.

[–] fibojoly@sh.itjust.works 2 points 38 minutes ago

Even using C#'s decimal type (128bit) would be an improvement! I count 22 characters per numbers here. So a minimum of 176bit.

[–] Dultas@lemmy.world 1 points 3 hours ago

That's it everyone, back to copybooks.

[–] nathan@piefed.alphapuggle.dev 53 points 10 hours ago (1 children)

This isn't YAML, this is just sparkling JSON

[–] ZoteTheMighty@lemmy.zip 11 points 5 hours ago (1 children)

All yaml is just sparkling JSON.

[–] olafurp@lemmy.world 5 points 5 hours ago

Always has been

[–] raman_klogius@ani.social 14 points 8 hours ago* (last edited 2 hours ago) (2 children)

Why you shouldn't use YAML

[–] BodilessGaze@sh.itjust.works 3 points 3 hours ago

YAML doesn't require any level of accuracy for floating point numbers, and that doc appears to have numbers large enough to run into problems for single-precision floats (maybe double too). That means different parsers could give you different results.

[–] Damage@feddit.it 18 points 7 hours ago

The best approach would be to never use yaml for anything

[–] MonkderVierte@lemmy.zip 37 points 11 hours ago (1 children)

Maybe use a real database for that? I'm a fan of simple tools (e.g. plaintext) for simple usecases but please use appropriate tools.

[–] nous@programming.dev 8 points 10 hours ago (5 children)

What is wrong with a file for this? Sounds more like a local log or debug output that a single thread in a single process would be creating. A file is fine for high volume append only data like this. The only big issue is the format of that data.

What benefit would a database bring here?

[–] NeatNit@discuss.tchncs.de 19 points 10 hours ago (2 children)

I think SQLite is a great middle ground. It saves the database as a single .db file, and can do everything an SQL database can do. Querying for data is a lot more flexible and a lot faster. The tools for manipulating the data in any way you want are very good and very robust.

However, I'm not sure how it would affect file size. It might be smaller because JSON/YAML wastes a lot of characters on redundant information (field names) and storing numbers as text, which the database would store as binary data in a defined structure. On the other hand, extra space is used to make common SQL operations happen much faster using fancy data structures. I don't know which effect is greater so file size could be bigger or smaller.

[–] GenderNeutralBro@lemmy.sdf.org 3 points 3 hours ago (1 children)

SQLite would definitely be smaller, faster, and require less memory.

Thing is, it's 2025, roughly 20 years since anybody's given half a shit about storage efficiency, memory efficiency, or even CPU efficiency for anything so small. Presumably this is not something they need to query dynamically.

[–] NeatNit@discuss.tchncs.de 2 points 2 hours ago (1 children)

True (in most contexts, probably including this one), but I think that only makes the case for SQLite stronger. What people do still care about is a good flexible, usable and reliable interface. I'm not sure how to get that with YAML.

[–] nous@programming.dev 1 points 1 hour ago

YAML is not a good format for this. But any line based or steamable format would be good enough for log data like this. Really easy to parse with any language or even directly with shell scripts. No need to even know SQL, any text processing would work fine.

[–] Scrath@lemmy.dbzer0.com 6 points 6 hours ago (1 children)

I didn't look to much at the data but I think csv might actually be an appropriate format for this?

Nice simple plaintext and very easy to parse into a datastructure for analysing/using it in python or similar

[–] nous@programming.dev 2 points 1 hour ago

CSV would be fine. The big problem with the data as presented is it is a YAML list, so needs the whole file to be read into memory and decoded before you get and values out of it. Any line based encoding would be vastly better and allow line based processing to be done. CSV, json objects encoded into a single line, some other streaming binary format. Does not make much difference overall as long as it is line based or at least streamable.

[–] towerful@programming.dev 12 points 10 hours ago (1 children)

Smaller file size, lower data rate, less computational overhead, no conversion loss.

A 64 bit float requires 64 bits to store.
ASCII representation of a 64 bit float (in the example above) is 21 characters or 168 bits.
Also, if every record is the same then there is a huge overhead for storing the name of each value. Plus the extra spaces, commas and braces.
So, you are at least doubling the file size and data throughput. And there is precision loss when converting float-string-float. Plus the computational overhead of doing those conversions.

Something like sqlite is lightweight, fast and will store the native data types.
It is widely supported, and allows for easy querying of the data.
Also makes it easy for 3rd party programs to interact with the data.

If you are ever thinking of implementing some sort of data storage in files, consider sqlite first.

[–] nous@programming.dev 1 points 1 hour ago

Never said it had to be a text file. There are many binary serialization formats that could be used. But is a lot of situations the overhead you save is not worth the debugging effort of working with binary data. For something like this that is likely not going to be more then a GB or so, probably much less it really does not matter that much if you use binary or text formats. This is an export format that will likely just have one batch processing layer on. This type of thing is generally easiest for more people to work with in a plain text format. If you really need efficient querying of the data then it is trivial and quick to load it into a DB of your choice rather then being stuck with sqlite.

[–] qaz@lemmy.world 5 points 10 hours ago* (last edited 10 hours ago)

It's used to export tracking data to analyze later on. Something like SQLite seems like a much better choice to me.

[–] Azzu@lemmy.dbzer0.com 3 points 9 hours ago (1 children)

Because this is not log or debug data as OP said. In any case, what do you think would happen with this data? It will be analyzed by some sort of tool because no one could manually look at this much text data. In text, this can be like 1MB of data per second. So in a normal eye tracking session, probably hundreds of MB. The problem isn't the storage space, but the time it will take to read that in and analyze it each time, forcing you to wait for processing or use lots of memory while reading it. And anyway, in most languages, it's actually much easier to store the number values directly (in 8 bytes not the 30something this text representation uses) than to convert them to JSON, all languages have some built-in way to do that. And even if not, sqlite is piss-easy and does everything for you, being as simple as JSON.

There is just no reason to do it like that unless you just don't think about what you're doing or have no clue.

[–] nous@programming.dev 1 points 2 hours ago

export tracking data to analyze later on

That is essentially log data or essentially equivalent. Log data does not have to be human readable, it is just a series of events that happen over time. Most log data, even what you would think of as traditional messages from a program, is not parsed by humans manually but analyzed by code later on. It is really not that hard to slow to process log data line by line. I have done this with TB of data before which does require a lot more effort to do. A simple file like this would take seconds to process at most, even if you were not very efficient about it. I also never said it needed to be stored as text, just a simple file is enough - no need for a full database. That file could be binary if you really need it to be but text serialization would also be good enough. Most of the web world is processed via text serialization.

The biggest problem with yaml like in OP is the need to decode the whole file at once since it is a single list. Line by line processing would be a lot easier to work with. But even then if it is only a few 100 MBs loading it all in memory once and analyzing it all in memory would not take long at all - it just does not scale very well.

[–] MonkderVierte@lemmy.zip 1 points 10 hours ago

Some order in the CSV data, if it weren't a logfile, which i didn't know.

[–] wise_pancake@lemmy.ca 3 points 6 hours ago (1 children)

I’d probably just use line delimited JSON or CSV for this use case. It plays nicely with cat and other standard tools and basically all the yaml is doing is wrapping raw json and adding extra parse time/complexity.

In the end consider converting this to parquet for analysis, you probably won’t get much from compression or row-group clustering, but you will get benefits from the column store format when reading the data.

[–] qaz@lemmy.world 4 points 5 hours ago* (last edited 5 hours ago) (1 children)

Thanks for the advice, but this is just the format of some eyetracking software I had to use not something I develop myself

[–] wise_pancake@lemmy.ca 4 points 5 hours ago

Ah, well, such is software dependencies.

[–] lime@feddit.nu 17 points 10 hours ago* (last edited 10 hours ago)

i mean, json is valid yaml

[–] disco@lemdro.id 6 points 8 hours ago

This is nasty to look at

[–] Supercrunchy@programming.dev 5 points 9 hours ago

Also let's represent all numbers in scientific notation, I'm sure that's going to make it easier to read...

[–] BestBouclettes@jlai.lu 13 points 11 hours ago

I really like YAML but way too many people use it beyond its purpose... I work with Gitlabci and seeing complex bash scripts inline in YAML files makes me want to hurt people.