feat(nodestore): A file system-based node storage backend, with added support for S3 and GCS storage #76250

klboke · 2024-08-15T03:14:36Z

A file system-based node storage backend, with added support for S3 and GCS storage.

My starting point is as follows:

Using PostgreSQL storage by default can lead to rapid storage expansion due to PostgreSQL's own table cleaning mechanisms, and it cannot perform automatic cleanup. For example: https://develop.sentry.dev/self-hosted/troubleshooting/#postgres

Legal Boilerplate

Look, I get it. The entity doing business as "Sentry" was incorporated in the State of Delaware in 2015 as Functional Software, Inc. and is gonna need some rights from me in order to utilize my contributions in this here PR. So here's the deal: I retain all rights, title and interest in and to my contributions, and by keeping this boilerplate intact I confirm that Sentry can use, modify, copy, and redistribute my contributions, under Sentry's choice of terms.

…ort for S3 and GCS storage.

PMExtra · 2024-08-16T09:00:45Z

Great job! But I worry a little about the performance while reading multiple objects as a batch. (I didn't dig it, just an intuitive guess)

PMExtra · 2024-08-16T09:17:19Z

I'm not sure how frequently the node storage is accessed or how large the stored files are.

I don't think object storage services (such as S3 or GCS) are designed for frequent reads and writes of very small files.

This may lead to performance issues or higher costs.

klboke · 2024-08-16T09:57:51Z

@PMExtra

I'm not sure how frequently the node storage is accessed or how large the stored files are.

Based on our recorded data, each node data is approximately between 15~76KB in size, with the majority being around 15KB

This may lead to performance issues or higher costs.

I investigated the places where nodestore is written to and read from and found that there are only two scenarios that trigger these actions:

1. Writing to nodestore occurs in the Sentry worker when processing Kafka event messages.
1. Reading from nodestore happens in sentry-web when someone views detailed information.

Writing is essentially offline data stream processing, so a slower speed is acceptable. The QPS (queries per second) during reading is very low, so it is not a significant issue either. Additionally, OSS storage costs are definitely the lowest, which is why many projects in recent years (like Paimon and OpenObserve) support the separation of storage (such as S3 and other object storage) and computation.

PMExtra · 2024-08-16T10:10:39Z

@klboke Awesome. I am concerned that reading a large number of tiny files from object storage services might incur higher costs than from table store services. However, according to your analysis, the files are not that small, and the access frequency is not that high. Therefore, my concern may be unnecessary.

PMExtra · 2024-08-16T10:27:16Z

@klboke

Another point worth noting is that Sentry stores a large amount of highly repetitive ASCII text, and table store services can achieve high compression rates for this, thereby reducing storage costs.

However, object storage services usually do not support transparent compression. While each piece of data can be compressed separately, the additional computational overhead will be incurred, and the compression rate may be difficult to match compared to table store services.

Anyway, having one more option is always good. I’m just pointing out some potential drawbacks for discussion, not to undermine your work. Thank you for your contribution.

klboke · 2024-08-16T12:01:21Z

@PMExtra

However, object storage services usually do not support transparent compression. While each piece of data can be compressed separately, the additional computational overhead will be incurred, and the compression rate may be difficult to match compared to table store services.

Thank you very much for your reminder and suggestions. If we switch to S3, considering our 3TB nodestore (which actually is more than most people have), cost won't be an issue at all. I would prefer storing plain JSON files directly, as it would be more convenient for observing and debugging some issues.

aldy505 · 2024-08-29T03:33:16Z

src/sentry/nodestore/filesystem/backend.py

+        if not settings.DEBUG and options_store.get("filestore.backend") == "filesystem":
+            raise ValueError("Local fileSystem should only be used in development!")


This would break existing self-hosted users that don't use (or don't have access to) S3 compatible services. This should just log a warning instead of throwing a ValueError.

I don't think so. The default NodeStore is sentry.nodestore.django.DjangoNodeStorage rather than this. So it won't break any existing users. If you want to configure this FileSystemNodeStorage in non-debug environment, you must configure FileStore also.

Ping to bump this PR.

I would really like to see the built-in support for S3 Node Store (even though performance is not endorsed).

I think the concern here is: FileSystemNodeStorage by title is supposed to be using "File System". It is not intuitive that by enabling "S3 File Store", FileSystemNodeStore will push things to S3.

I think better way is to explicitly create a new S3NodeStore class with those logic.

See if #82126 makes sense to everyone.

getsantry · 2024-09-19T07:00:32Z

This pull request has gone three weeks without activity. In another week, I will close it.

But! If you comment or otherwise update it, I will reset the clock, and if you add the label WIP, I will leave it alone unless WIP is removed ... forever!