Skip to content

Golang implementation of Graphite/Carbon server with classic architecture: Agent -> Cache -> Persister

License

Notifications You must be signed in to change notification settings

bookingcom/go-carbon

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Table of Contents

go-carbon Build Status

Golang implementation of Graphite/Carbon server with classic architecture: Agent -> Cache -> Persister

Architecture

Features

  • Receive metrics from TCP and UDP (plaintext protocol)
  • Receive metrics with Pickle protocol (TCP only)
  • Receive metrics from HTTP
  • Receive metrics from Apache Kafka
  • storage-schemas.conf
  • storage-aggregation.conf
  • Carbonlink (requests to cache from graphite-web)
  • Carbonlink-like GRPC api
  • Logging with rotation support (reopen log if it moves)
  • Many persister workers (using many cpu cores)
  • Run as daemon
  • Optional dump/restore restart on USR2 signal (config dump section): stop persister, start write new data to file, dump cache to file, stop all (and restore from files after next start)
  • Reload some config options without restart (HUP signal):
    • whisper section of main config, storage-schemas.conf and storage-aggregation.conf
    • graph-prefix, metric-interval, metric-endpoint, max-cpu from common section
    • dump section

Performance

Faster than default carbon. In all conditions :) How much faster depends on server hardware, storage-schemas, etc.

The result of replacing "carbon" to "go-carbon" on a server with a load up to 900 thousand metric per minute:

Success story

There were some efforts to find out maximum possible performance of go-carbon on a hardware (2xE5-2620v3, 128GB RAM, local SSDs).

The result of that effort (in points per second):

Performance

Stable performance was around 950k points per second with short-term peak performance of 1.2M points/sec.

Installation

Use binary packages from releases page or build manually (requires golang 1.8+):

# build binary
git clone https://github.com/lomik/go-carbon.git
cd go-carbon
make

Configuration

$ go-carbon --help
Usage of go-carbon:
  -check-config=false: Check config and exit
  -config="": Filename of config
  -config-print-default=false: Print default config
  -daemon=false: Run in background
  -pidfile="": Pidfile path (only for daemon)
  -version=false: Print version
[common]
# Run as user. Works only in daemon mode
user = "carbon"
# Prefix for store all internal go-carbon graphs. Supported macroses: {host}
graph-prefix = "carbon.agents.{host}"
# Endpoint fo
# r store internal carbon metrics. Valid values: "" or "local", "tcp://host:port", "udp://host:port"
metric-endpoint = "local"
# Interval of storing internal metrics. Like CARBON_METRIC_INTERVAL
metric-interval = "1m0s"
# Increase for configuration with multi persister workers
max-cpu = 4

[whisper]
data-dir = "/var/lib/graphite/whisper"
# http://graphite.readthedocs.org/en/latest/config-carbon.html#storage-schemas-conf. Required
schemas-file = "/etc/go-carbon/storage-schemas.conf"
# http://graphite.readthedocs.org/en/latest/config-carbon.html#storage-aggregation-conf. Optional
aggregation-file = "/etc/go-carbon/storage-aggregation.conf"
# Worker threads count. Metrics sharded by "crc32(metricName) % workers"
workers = 8
# Limits the number of whisper update_many() calls per second. 0 - no limit
max-updates-per-second = 0
# Softly limits the number of whisper files that get created each second. 0 - no limit
max-creates-per-second = 0
# Make max-creates-per-second a hard limit. Extra new metrics are dropped. A hard throttle of 0 drops all new metrics.
hard-max-creates-per-second = false
# Sparse file creation
sparse-create = false
# use flock on every file call (ensures consistency if there are concurrent read/writes to the same file)
flock = true
enabled = true
# Use hashed filenames for tagged metrics instead of human readable
# https://github.com/lomik/go-carbon/pull/225
hash-filenames = true
# specify to enable/disable compressed format. IMPORTANT: Only one process/thread could write to compressed whisper files at a time, especially when you are rebalancing graphite clusters (with buckytools, for example), flock needs to be enabled both in go-carbon and your tooling.
compressed = false
# automatically delete empty whisper file caused by edge cases like server reboot
remove-empty-file = false

[cache]
# Limit of in-memory stored points (not metrics)
max-size = 1000000
# Capacity of queue between receivers and cache
# Strategy to persist metrics. Values: "max","sorted","noop"
#   "max" - write metrics with most unwritten datapoints first
#   "sorted" - sort by timestamp of first unwritten datapoint.
#   "noop" - pick metrics to write in unspecified order,
#            requires least CPU and improves cache responsiveness
write-strategy = "max"

[udp]
listen = ":2003"
enabled = true
# Optional internal queue between receiver and cache
buffer-size = 0

[tcp]
listen = ":2003"
enabled = true
# Optional internal queue between receiver and cache
buffer-size = 0

[pickle]
listen = ":2004"
# Limit message size for prevent memory overflow
max-message-size = 67108864
enabled = true
# Optional internal queue between receiver and cache
buffer-size = 0

# You can define unlimited count of additional receivers
# Common definition scheme:
# [receiver.<any receiver name>]
# protocol = "<any supported protocol>"
# <protocol specific options>
#
# All available protocols:
#
# [receiver.udp2]
# protocol = "udp"
# listen = ":2003"
# # Enable optional logging of incomplete messages (chunked by max UDP packet size)
# log-incomplete = false
#
# [receiver.tcp2]
# protocol = "tcp"
# listen = ":2003"
#
# [receiver.pickle2]
# protocol = "pickle"
# listen = ":2004"
# # Limit message size for prevent memory overflow
# max-message-size = 67108864
#
# [receiver.protobuf]
# protocol = "protobuf"
# # Same framing protocol as pickle, but message encoded in protobuf format
# # See https://github.com/lomik/go-carbon/blob/master/helper/carbonpb/carbon.proto
# listen = ":2005"
# # Limit message size for prevent memory overflow
# max-message-size = 67108864
#
# [receiver.http]
# protocol = "http"
# # This receiver receives data from POST requests body.
# # Data can be encoded in plain text format (default),
# # protobuf (with Content-Type: application/protobuf header) or
# # pickle (with Content-Type: application/python-pickle header).
# listen = ":2007"
# max-message-size = 67108864
#
# [receiver.kafka]
# protocol = "kafka
# # This receiver receives data from kafka
# # You can use Partitions and Topics to do sharding
# # State is saved in local file to avoid problems with multiple consumers
#
# # Encoding of messages
# # Available options: "plain" (default), "protobuf", "pickle"
# #   Please note that for "plain" you must pass metrics with leading "\n".
# #   e.x.
# #    echo "test.metric $(date +%s) $(date +%s)" | kafkacat -D $'\0' -z snappy -T -b localhost:9092 -t graphite
# parse-protocol = "protobuf"
# # Kafka connection parameters
# brokers = [ "host1:9092", "host2:9092" ]
# topic = "graphite"
# partition = 0
#
# # Specify how often receiver will try to connect to kafka in case of network problems
# reconnect-interval = "5m"
# # How often receiver will ask Kafka for new data (in case there was no messages available to read)
# fetch-interval = "200ms"
#
# # Path to saved kafka state. Used for restarts
# state-file = "/var/lib/graphite/kafka.state"
# # Initial offset, if there is no saved state. Can be relative time or "newest" or "oldest".
# # In case offset is unavailable (in future, etc) fallback is "oldest"
# initial-offset = "-30m"
#
# # Specify kafka feature level (default: 0.11.0.0).
# # Please note that some features (consuming lz4 compressed streams) requires kafka >0.11
# # You must specify version in full. E.x. '0.11.0.0' - ok, but '0.11' is not.
# # Supported version (as of 22 Jan 2018):
# #   0.8.2.0
# #   0.8.2.1
# #   0.8.2.2
# #   0.9.0.0
# #   0.9.0.1
# #   0.10.0.0
# #   0.10.0.1
# #   0.10.1.0
# #   0.10.2.0
# #   0.11.0.0
# #   1.0.0
# kafka-version = "0.11.0.0"
#
# [receiver.pubsub]
# # This receiver receives data from Google PubSub
# # - Authentication is managed through APPLICATION_DEFAULT_CREDENTIALS:
# #   - https://cloud.google.com/docs/authentication/production#providing_credentials_to_your_application
# # - Currently the subscription must exist before running go-carbon.
# # - The "receiver_*" settings are optional and directly map to the google pubsub
# #   libraries ReceiveSettings (https://godoc.org/cloud.google.com/go/pubsub#ReceiveSettings)
# #   - How to think about the "receiver_*" settings: In an attempt to maximize throughput the
# #     pubsub library will spawn 'receiver_go_routines' to fetch messages from the server.
# #     These goroutines simply buffer them into memory until 'receiver_max_messages' or 'receiver_max_bytes'
# #     have been read. This does not affect the actual handling of these messages which are processed by other goroutines.
# protocol = "pubsub"
# project = "project-name"
# subscription = "subscription-name"
# receiver_go_routines = 4
# receiver_max_messages = 1000
# receiver_max_bytes = 500000000 # default 500MB

[carbonlink]
listen = "127.0.0.1:7002"
enabled = true
# Close inactive connections after "read-timeout"
read-timeout = "30s"

# grpc api
# protocol: https://github.com/lomik/go-carbon/blob/master/helper/carbonpb/carbon.proto
# samples: https://github.com/lomik/go-carbon/tree/master/api/sample
[grpc]
listen = "127.0.0.1:7003"
enabled = true

# http://graphite.readthedocs.io/en/latest/tags.html
[tags]
enabled = false
# TagDB url. It should support /tags/tagMultiSeries endpoint
tagdb-url = "http://127.0.0.1:8000"
tagdb-chunk-size = 32
tagdb-update-interval = 100
# Directory for send queue (based on leveldb)
local-dir = "/var/lib/graphite/tagging/"
# POST timeout
tagdb-timeout = "1s"

[carbonserver]
# Please NOTE: carbonserver is not intended to fully replace graphite-web
# It acts as a "REMOTE_STORAGE" for graphite-web or carbonzipper/carbonapi
listen = "127.0.0.1:8080"
# Carbonserver support is still experimental and may contain bugs
# Or be incompatible with github.com/grobian/carbonserver
enabled = false
# Buckets to track response times
buckets = 10
# carbonserver-specific metrics will be sent as counters
# For compatibility with grobian/carbonserver
metrics-as-counters = false
# Read and Write timeouts for HTTP server
read-timeout = "60s"
write-timeout = "60s"
# Enable /render cache, it will cache the result for 1 minute
query-cache-enabled = true
# 0 for unlimited
query-cache-size-mb = 0
# Enable /metrics/find cache, it will cache the result for 5 minutes
find-cache-enabled = true
# Control trigram index
#  This index is used to speed-up /find requests
#  However, it will lead to increased memory consumption
#  Estimated memory consumption is approx. 500 bytes per each metric on disk
#  Another drawback is that it will recreate index every scan-frequency interval
#  All new/deleted metrics will still be searchable until index is recreated
trigram-index = true
# carbonserver keeps track of all available whisper files
# in memory. This determines how often it will check FS
# for new or deleted metrics.
scan-frequency = "5m0s"
# Control trie index
#  This index is built as an alternative to trigram index, with shorter indexing
#  time and less memory usage (around 2 - 5 times). For most of the queries,
#  trie is faster than trigram. For queries with keyword wrap around by widcards
#  (like ns1.ns2.*keywork*.metric), trigram index performs better. Trie index
#  could be speeded up by enabling adding trigrams to trie, at the some costs of
#  memory usage (by setting both trie-index and trigram-index to true).
trie-index = false

# Maximum amount of globs in a single metric in index
# This value is used to speed-up /find requests with
# a lot of globs, but will lead to increased memory consumption
max-globs = 100
# Fail if amount of globs more than max-globs
fail-on-max-globs = false

# Maximum metrics could be returned by glob/wildcard in find request (currently
# works only for trie index)
max-metrics-globbed  = 30000
# Maximum metrics could be returned in render request (works both all types of
# indexes)
max-metrics-rendered = 1000

# graphite-web-10-mode
# Use Graphite-web 1.0 native structs for pickle response
# This mode will break compatibility with graphite-web 0.9.x
# If false, carbonserver won't send graphite-web 1.0 specific structs
# That might degrade performance of the cluster
# But will be compatible with both graphite-web 1.0 and 0.9.x
graphite-web-10-strict-mode = true
# Allows to keep track for "last time readed" between restarts, leave empty to disable
internal-stats-dir = ""
# Calculate /render request time percentiles for the bucket, '95' means calculate 95th Percentile. To disable this feature, leave the list blank
stats-percentiles = [99, 98, 95, 75, 50]

[dump]
# Enable dump/restore function on USR2 signal
enabled = false
# Directory for store dump data. Should be writeable for carbon
path = "/var/lib/graphite/dump/"
# Restore speed. 0 - unlimited
restore-per-second = 0

[pprof]
listen = "localhost:7007"
enabled = false

#[prometheus]
#enabled = true

#[prometheus.labels]
#foo = "test"
#bar = "baz"

# Default logger
[[logging]]
# logger name
# available loggers:
# * "" - default logger for all messages without configured special logger
# @TODO
logger = ""
# Log output: filename, "stderr", "stdout", "none", "" (same as "stderr")
file = "/var/log/go-carbon/go-carbon.log"
# Log level: "debug", "info", "warn", "error", "dpanic", "panic", and "fatal"
level = "info"
# Log format: "json", "console", "mixed"
encoding = "mixed"
# Log time format: "millis", "nanos", "epoch", "iso8601"
encoding-time = "iso8601"
# Log duration format: "seconds", "nanos", "string"
encoding-duration = "seconds"

# You can define multiply loggers:

# Copy errors to stderr for systemd
# [[logging]]
# logger = ""
# file = "stderr"
# level = "error"
# encoding = "mixed"
# encoding-time = "iso8601"
# encoding-duration = "seconds"

OS tuning

It is crucial for performance to ensure that your OS tuned so that go-carbon is never blocked on writes, usually that involves adjusting following sysctl params on Linux systems:

# percentage of your RAM which can be left unwritten to disk. MUST be much more than
# your write rate, which is usually one FS block size (4KB) per metric.
sysctl -w vm.dirty_ratio=80

# percentage of yout RAM when background writer have to kick in and
# start writes to disk. Make it way above the value you see in `/proc/meminfo|grep Dirty`
# so that it doesn't interefere with dirty_expire_centisecs explained below
sysctl -w vm.dirty_background_ratio=50

# allow page to be left dirty no longer than 10 mins
# if unwritten page stays longer than time set here,
# kernel starts writing it out
sysctl -w vm.dirty_expire_centisecs=$(( 10*60*100 ))

Net effect of these 3 params is that with dirty_*_ratio params set high enough multiple updates to a metric don't trigger disk activity. Multiple datapoint writes are coalesced into single disk write which kernel then writes to disk in a background.

With settings above applied, best write-strategy to use is "noop"

Reported stats

metric description
cache.maxSize Maximum number of datapoints stored in cache before overflow
cache.metrics Total number of unique metrics stored in cache
cache.size Total number of datapoints stored in cache
cache.queueWriteoutTime Time in seconds to make a full cycle writing all metrics
carbonserver.cache_partial_hit Requests that was partially served from cache
carbonserver.cache_miss Total cache misses
carbonserver.cache_only_hit Requests fully served from the cache
carbonserver.cache_wait_time_overhead_ns Time spent getting copy of the cache
carbonserver.cache_wait_time_ns Time spent waiting for cache, including overhead
carbonserver.cache_requests Total metrics we've tried to fetch from cache
carbonserver.disk_wait_time_ns Time spent reading data from disk
carbonserver.disk_requests Amount of metrics we've tried to fetch from disk
carbonserver.points_returned Datapoints returned by carbonserver
carbonserver.metrics_returned Metrics returned by carbonserver
persister.maxUpdatesPerSecond
persister.workers
runtime.GOMAXPROCS
runtime.NumGoroutine

Changelog

master
  • Added new options and upgraded go-whisper library to have compressed format (cwhisper) support
version 0.14.0
  • Accept UDP messages in plain protocol without trailing newline
  • Added whisper.hard-max-creates-per-second option #242
  • No longer trying to combine separate UDP messages from one sender into single stream
  • [carbonserver] Added metrics for prometheus
  • [carbonserver] Improved compatibility with graphite-web (#250, #251)
version 0.13.0
  • Added whisper.max-creates-per-second option
  • Support multiple targets in carbonserver
  • Support new carbonapi_v3_pb protocol. This allows recent versions of carbonapi to get metadata alongside with data
version 0.12.0
  • Tags support was added (only with graphite-web)
  • flock support for persister and carbonserver
  • cache.max-size and cache.write-strategy can be changed without restart (HUP signal)
  • Google PubSub protocol was added. It receives data from PubSub Subscriptions and can decode protobuf, plain, or pickle messages.
    • The default format is plain. Specify protobuf or pickle by adding an attribute named 'content-type' to the PubSub messsages:
      • application/protobuf
      • application/python-pickle
    • Sample configuration:
[receiver.pubsub]
protocol = "pubsub"
project = "project-name"
subscription = "subscription-name"
# receiver_go_routines = 4
# receiver_max_messages = 1000
# receiver_max_bytes = 500000000 # default 500MB
version 0.11.0
  • GRPC api for query cache was added
  • Added support for an unlimited number of receivers
  • Protobuf protocol was added. Sample configuration:
[receiver.protobuf]
protocol = "protobuf"
listen = ":2005"
  • HTTP protocol was added. It receives data from POST requests body. Data can be encoded in plain, pickle (Content-Type: application/python-pickle) and protobuf (Content-Type: application/protobuf) formats. Sample configuration:
[receiver.http]
protocol = "http"
listen = ":2006"
  • Kafka protocol was added. It receives data from Kafka and can decode protobuf, plain or pickle messages. You need manually specify message format in the config file. Sample configuration:
[receiver.kafka]
protocol = "kafka"
parse-protocol = "protobuf" # can be also "plain" or "pickle"
brokers = [ "localhost:9092" ]
topic = "graphite"
partition = 0
state-file = "/var/lib/graphite/kafka.state"
initial-offset = "-30m" # In case of absent or damaged state file fetch last 30 mins of messages
version 0.10.0

Breaking changes:

  • common: logfile and log-level in common config section are deprecated
  • changed config defaults:
    • user changed to carbon
    • whisper directory changed to /var/lib/graphite/whisper/
    • schemas config changed to /etc/go-carbon/storage-schemas.conf
  • rpm:
    • binary moved to /usr/bin/go-carbon
    • configs moved to /etc/go-carbon/
  • deb:
    • binary moved to /usr/bin/go-carbon

Other changes:

  • common: Requires Go 1.8 or newer
  • common: Logging refactored. Format changed to structured JSON. Added support of multiple logging handlers with separate output, level and encoding
  • dump/restore: New dump format. Added go-carbon -cat filename command for printing dump to console. New version of go-carbon can read old dump
  • dump/restore: [fix] go-carbon can not stop after dump (with enabled dump and carbonserver)
  • carbonserver: [feature] IdleTimeout is now configurable in carbonserver part
  • carbonserver: [feature] support /render query cache (query-cache-* options in config file)
  • carbonserver: [feature] support /metrics/find cache (find-cache-* option in config file)
  • carbonserver: [feature] support /metrics/details handler, that returns information about metrics (require enabled trigram-index)
  • carbonserver: [feature] Add config option to disable trigram index (before that to disable index you should set scan-interval to 0)
  • carbonserver: [fix] fix #146 (metrics_known was broken if metrics were not sent as counters)
version 0.9.1
  • Always stop on USR2 signal (previously did not stop with disabled dump/restore) #135
version 0.9.0
  • Completely new internal architecture
  • Removed flush to whisper and stop on USR2 signal. Use dump/restore instead
  • Removed global queue (channel) between receivers and cache, added optional per-receiver queues
  • Built-in carbonserver (thanks Vladimir Smirnov)
  • Added runtime tunables to internal metrics #70
version 0.8.1
  • Bug fix: The synchronous config reload (HUP signal) and launch of the internal collecting statistics procedure (every "metric-interval") could cause deadlock (thanks Maxim Ivanov)
version 0.8.0
  • Fully refactored and optimizer cache module (core of go-carbon) (thanks Maxim Ivanov)
  • Added noop cache.write-strategy (thanks Maxim Ivanov)
  • New optional dump/restore functional for minimize data loss on restart
  • Refactored internal stat mechanics. common.graph-prefix and common.metric-interval now can be changed without restart (on HUP signal)
  • Customizable internal metrics endpoint. common.metric-endpoint param. Valid values: "local" and "" (write directly to whisper), "tcp://host:port", "udp://host:port"
version 0.7.3
  • Added cache.write-strategy option (values "max" or "sorted") (thanks Alexander Akulov)
  • commitedPoints metric renamed to committedPoints
version 0.7.2
  • Added sparse file creation (whisper.sparse-create config option)
  • Enable reload in init script (thanks Jose Riguera)
  • Clean up schemas parser code (thanks Dieter Plaetinck)
  • Better go-whisper error handling (thanks Hiroaki Nakamura)
  • Don't try to create whisper file if exists with bad permissions #21
version 0.7.1
  • Fixed problem: Points in queue (channel) between cache and persister subsystems was invisible for carbonlink
version 0.7
  • Grace stop on USR2 signal: close all socket listeners, flush cache to disk and stop carbon
  • Reload persister config (whisper section of main config, storage-schemas.conf and storage-aggregation.conf) on HUP signal
  • Fix bug: Cache may start save points only after first checkpoint
  • Decimal numbers in log files instead of hexademical #22
  • Fix bug: NaN values being saved in Whisper datafiles #17 (thanks Andrew Hayworth)
  • Fix bug: may crash on bad pickle message with big message size in header #30. Added option pickle.max-message-size with 64 MB default value
  • Improved throttling (max-updates-per-second) performance #32
version 0.6
  • metric-interval option
version 0.5.5
  • Cache module optimization
version 0.5.4
  • Fix RPM init script
version 0.5.3
  • Improved validation of bad wsp files
  • RPM init script checks config before restart
  • Debug logging of bad pickle messages
version 0.5.2
  • Fix bug in go-whisper library: UpdateMany saves first point if many points has identical timestamp
version 0.5.1
  • Reduced error level of "bad messages" in tcp and pickle receivers. Now info
  • Configurable logging level. log-level option
  • Fix wrong carbonlink request error in log
version 0.5.0
  • -check-config validates schemas and aggregation configs
  • Fix broken internal metrics tcp.active and pickle.active
  • Optional udp incomplete messages logging: log-incomplete setting
  • Fixes for working on x86-32
  • logging fsnotify: fix ONCE rotation bug
version 0.4.3
  • Optional whisper throttle setting #8: max-updates-per-second
version 0.4.2
  • Fix bug in go-whisper: points in long archives missed if metrics retention count >=3
version 0.4.1
  • Bug fix schemas parser
version 0.4
  • Code refactoring and improved test coverage (thanks Dave Rawks)
  • Bug fixes
version 0.3
  • Log "create wsp" as debug
  • Log UDP checkpoint (calculate stats every minute)
  • Rotate logfile by inotify event (without HUP)
  • Check logfile opened
  • storage-aggregation.conf support
  • Create and chown logfile before daemonize and change user
  • Debian package (thanks Dave Rawks)
version 0.2
  • Git submodule dependencies
  • Init script for CentOS 6
  • Makefile
  • "make rpm" script
  • Daemonize and run-as-user support
  • -check-config option
  • -pidfile option
version 0.1
  • First full-functional public version
  • Logging with HUP rotation support
  • UDP receiver
  • Tcp receiver
  • Pickle receiver
  • TOML-configs
  • Carbonlink
  • Multi-persister support
  • storage-schemas.conf support

About

Golang implementation of Graphite/Carbon server with classic architecture: Agent -> Cache -> Persister

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Go 96.7%
  • Shell 1.4%
  • Other 1.9%