RoundSparrow

joined 3 years ago
MODERATOR OF
 

There are also issues lurking with accumulation of data. Moving to a batch processing system might want to consider that some instance operators may only wish to retain 60 days of fresh content vs. having every single history of content for a community for search engines and local-searching. The difference in performance is huge, which is why popular Lemmy servers have been crashing constantly - the amount of data in the tables entirely changes the performance characteristics.

Right now, Lemmy has no concept of tiered storage or absent content from replication or purge choices. Looking from the bottom-up, API client before touching PostgreSQL - a smart caching layer could even proxy to the API of a peer instance and offer a virtual copy (cached) of the data for a listing or post. Such a design could intelligently choose to do this for a small number of requests and avoid burdening PostgreSQL with the storage of a post from months or years ago that a few people take a recent interest in (or a search engine wants to pull a copy of old posts).

 

This is taking longer than other aggregate updates, and I think the join can be eliminated:

CREATE FUNCTION public.community_aggregates_comment_count() RETURNS trigger
    LANGUAGE plpgsql
    AS $$
begin
  IF (was_restored_or_created(TG_OP, OLD, NEW)) THEN
update community_aggregates ca
set comments = comments + 1 from comment c, post p
where p.id = c.post_id
  and p.id = NEW.post_id
  and ca.community_id = p.community_id;
ELSIF (was_removed_or_deleted(TG_OP, OLD, NEW)) THEN
update community_aggregates ca
set comments = comments - 1 from comment c, post p
where p.id = c.post_id
  and p.id = OLD.post_id
  and ca.community_id = p.community_id;

END IF;
return null;
end $$;

pg_stat_statements shows it as:

update community_aggregates ca set comments = comments + $15 from comment c, post p where p.id = c.post_id and p.id = NEW.post_id and ca.community_id = p.community_id

TRIGGER:

CREATE TRIGGER community_aggregates_comment_count AFTER INSERT OR DELETE OR UPDATE OF removed, deleted ON public.comment FOR EACH ROW EXECUTE FUNCTION public.community_aggregates_comment_count();

 

Is .moderators supposed to be on GetCommunity() result? I can't seem to find it in lemmy_server api_tests context. All I'm getting is languages and community_view

EDIT: Wait, so there is a "CommunityResponse" and a "GetCommunityResponse", object? What call do I use to Get a GetCommunityResponse object?

https://github.com/LemmyNet/lemmy-js-client/blob/2aa12c04a312ae4ae235f6d97c86a61f58c67494/src/types/GetCommunityResponse.ts#L7

 

in Communities create community/edit community there is a SiteLanguage::read with no site_id, should that call to read have site_id = 1?

For reference, on my production instance my site_language table has 198460 rows and my site table has 1503 rows. Average of 132 languages per site. counts: https://lemmyadmin.bulletintree.com/query/pgcounts?output=table

 

A general description of the proposed change and reasoning behind it is on GitHub: https://github.com/LemmyNet/lemmy/issues/3697

Linear execution of these massive changes to votes/comments/posts with concurrency awareness. Also adds a layer of social awareness, the impact on a community when a bunch of content is black-holed.

An entire site federation delete / dead server - also would fall under this umbrella of mass data change with a potential for new content ownership/etc.

 

person_aggregates is interesting, because it is tracked for all known person accounts on the server. Where site_aggregates does not track all know instances on the server.

Personally I think lemmy-ui needs to be revised to clearly identify that a profile is from another instance and that the count of posts, comments, and the listing of same is incomplete. If a user from another instance is being viewed, you only see what your local instance has comments, posts, and votes for. This will almost always under-represent a user from another instance.

PREMISE: since person_aggregate has a SQL UPDATE performed in real-time on every comment or post creation, I suggest we at least make that more useful. A timestamp of 'last_create', either generic to both post or comment, or individual to each type. I also think a last_login timestamp would be of great use - and the site_aggregates of activity could look at these person_aggregates timestamps instead of having to go analyze comments and posts per user on a scheduled job.

 

Over a short period of time, this is my incoming federation activity for new comments. pg_stat_statements output being show. It is interesting to note these two INSERT statements on comments differ only in the DEFAULT value of language column. Also note the average execution times is way higher (4.3 vs. 1.28) when the language value is set, I assume due to INDEX updates on the column? Or possibly a TRIGGER?

About half of the comments coming in from other servers have default value.

WRITES are heavy, even if it is an INDEX that has to be revised. So INSERT and UPDATE statements are important to scrutinize.

 

Given how frequent these records are created, every vote by a user, I think it is important to study and review how it works.

The current design of lemmy_server 0.18.3 is to issue a SQL DELETE before (almost?) every INSERT of a new vote. The INSERT already has an UPDATE clause on it.

This is one of the few places in Lemmy that a SQL DELETE statement actually takes place. We have to be careful triggers are not firing multiple times, such as decreasing the vote to then immediately have it increase with the INSERT statement that comes later.

For insert of a comment, Lemmy doesn't seem to routinely run a DELETE before the INSERT. So why was this design chosen for votes? Likely the reason is because a user can "undo" a vote and have the record of them ever voting in the database removed. Is that the actual behavior in testing?

pg_stat_statements from an instance doing almost entirely incoming federation activity of post/comments from other instances:

  • DELETE FROM "comment_like" WHERE (("comment_like"."comment_id" = $1) AND ("comment_like"."person_id" = $2)) executed 14736 times, with 607 matching records.

  • INSERT INTO "comment_like" ("person_id", "comment_id", "post_id", "score") VALUES ($1, $2, $3, $4) ON CONFLICT ("comment_id", "person_id") DO UPDATE SET "person_id" = $5, "comment_id" = $6, "post_id" = $7, "score" = $8 RETURNING "comment_like"."id", "comment_like"."person_id", "comment_like"."comment_id", "comment_like"."post_id", "comment_like"."score", "comment_like"."published" executed 15883 times - each time transacting.

  • update comment_aggregates ca set score = score + NEW.score, upvotes = case when NEW.score = 1 then upvotes + 1 else upvotes end, downvotes = case when NEW.score = -1 then downvotes + 1 else downvotes end where ca.comment_id = NEW.comment_id TRIGGER FUNCTION update executing 15692 times.

  • update person_aggregates ua set comment_score = comment_score + NEW.score from comment c where ua.person_id = c.creator_id and c.id = NEW.comment_id TRIGGER FUNCTION update, same executions as previous.

There is some understanding to gain by the count of executions not being equal.

 

a lemmysever .social / LemmyFanatic / post / xxxxx

Same with comments

 

Details here: https://github.com/LemmyNet/lemmy/issues/3165

This will VASTLY decrease the server load of I/O for PostgreSQL, as this mistaken code is doing writes of ~1700 rows (each known Lemmy instance in the database) on every single comment & post creation. This creates record-locking issues given it is writes, which are harsh on the system. Once this is fixed, some site operators will be able to downgrade their hardware! ;)

view more: ‹ prev next ›