-
Notifications
You must be signed in to change notification settings - Fork 0
trociny/gather
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
---------------------------------------------------------------------- gather -- collect and display system statistics ---------------------------------------------------------------------- $Id$ 1.Introduction. 2.Installation and configuration. 3.Examples. 4.Timeperiod shortcuts. 1.Introduction. -------------- Many of those who have worked with computer systems faced with situations when something wrong goes with the system that need to be traced in some way. Under Unix there are many nice tools such as top, ps, netstat, vmstat, sysctl and so on that can be used to get useful information about system and trace the problem. But what if the problem happens accidently and usually when you are away from the box, have no access to it or just are sleeping? What you have when you get to the box is some logs and may be some performance statistics in rrd. Very often it is not enough to figure out what was wrong with the system. To have more info you can write some scripts that run system utilities in batch mode to get statistics, run these scripts via cron, then when something has happened you have tons of files with utils' output where you have to find useful for your information. Writing scripts and then digging in thousands of files is time consuming task that it would be nice to automatize a bit. So this is where gather utility goes to help. This script runs system utils to collect statistics and then helps you to analyze collected data. You specify commands you want to run to get statistics in gather.map file, set cron to run gather utility with desired periodicity and then use this utility to output and grep collected statistics for specified period. gather's output contains timestamps thus you can see what things and when happened. 2.Installation and configuration. --------------------------------- gather utility is perl script so you need perl installed on your box to use it. Run make to make installation files. Then copy gather script somewhere you want to have it, preferably in some directory from PATH. gather reads its configuration parameters from gather.cfg and gather.map files. Check in gather script where by default it looks for configs and put gather.cfg there. You can change some defaults running make with additional variable set, e.g: make CONFDIR=$HOME/.gather See Makefile for other parameters you can set. Also you can change defaults using command line parameter. Run gather help to see minihelp and some defaults. gather.cfg contains configuration variables that specify location of gather.map file, directory where statistics is collected, compression used and some other. Take gather.cfg from gather distribution and tune it for your environment and needs. Every configuration variable is commented so you shouldn't have problems with configuration. Please note that gather.cfg is really a Perl script evaluated by gather when it runs. So be careful not to make syntactic errors if you want the program to work. Next thing is to specify commands in gather.map file. You can use gather.map from gather distribution as an example. gather.map looks something like this: %map = ('uptime' => {'desc' => 'system uptime', 'cmd' => '/usr/bin/uptime'}, 'sysctl' => {'desc' => 'sysctl variables', 'cmd' => '/sbin/sysctl -a'}, 'sockstat' => {'desc' => 'sockstat output', 'cmd' => '/usr/bin/sockstat'} ... ); It is rather self explanatory but here is a description. In garher.map you should initialize Perl hash variable `%map'. Keys 'sysctl', 'sockstat' are just names for identifying your statistics commands; you can use any name you like here but you can't use the same name twice. 'desc' is optional description of the command, you can write everything you want here, but try to keep it informative and short enough, as it is used in `gather show utils' output. 'cmd' is the command to run. All output from the command will be redirected to gather database. When you have gather.cfg and the map configured you can run gather to collect data: gather collect gather will run all commands specified in map and store output. You need to set up cron to run this command with desired periodicity. Also if you don't want to run out of free space you need to setup command: gather expire <days> in crontab to run daily and expire old data. Data older then <days> will be deleted. Gather database is actually just a directory where output of each script is saved in separate file in YEAR-MONTH-DAY/HOUR/MINUTE subdirectory, thus you can browse it looking for needed info but also you can use gather to retrieve and grep data. Run gather show help to see minihelp about available subcommands. Next section provides some examples that demonstrate how you can use gather utility. 2.1 Installing with Chef. ------------------------- The gather utility can be installed using Chef cookbook. See further instructions on Chef Supermarket - open-source community platform: https://supermarket.chef.io/cookbooks/gatherit 3.Examples. ----------- When you have set up gather utility as described above and collected some statistics you can use `gather show' command to display and grep data. 3.1.show utils. --------------- Run gather show utils and you will see the list of commands you have installed in map and used to collect data: ------------------------------------------------------------------ name cmd desc ------------------------------------------------------------------ ... sockstat /usr/bin/sockstat sockstat output sysctl /sbin/sysctl -a sysctl variables ... uptime /usr/bin/uptime system uptime ... 3.2.Time periods. ----------------- Asking gather to display data you have to specify time period what data you want for. Time period has the following format: YEAR-MONTH-DAY/HOUR/MINUTE, eg: 2008-09-14/11/10 HOUR and MINUTE are optional so if you want data for the whole hour, you can specify: 2008-09-14/11 and if you want data for the whole day, just specify this day: 2008-09-14 Yoy can use ranges for setting time periods. E.g. specifying: 2008-09-13/11/10--2008-09-14/12 you will get data for period from 11:10 2008-09-13 to 12:00 2008-09-14. 3.3.show grep. -------------- To display data you can use grep subcommand. You should set regexpres that will filter data. If you want all output, set regexp to '.' (point). E.g.: gather show -t '2008-09-14/13' grep '.*' uptime will output something like this: 2008-09-14/13/00: 1:00PM up 1:53, 0 users, load averages: 0.16, 0.04, 0.01 2008-09-14/13/05: 1:05PM up 1:58, 0 users, load averages: 0.16, 0.05, 0.01 2008-09-14/13/10: 1:10PM up 2:03, 0 users, load averages: 0.16, 0.04, 0.01 2008-09-14/13/15: 1:15PM up 2:08, 0 users, load averages: 0.16, 0.04, 0.01 2008-09-14/13/20: 1:20PM up 2:13, 0 users, load averages: 0.16, 0.04, 0.01 2008-09-14/13/25: 1:25PM up 2:18, 0 users, load averages: 0.00, 0.00, 0.00 2008-09-14/13/30: 1:30PM up 2:23, 0 users, load averages: 0.16, 0.03, 0.01 2008-09-14/13/35: 1:35PM up 2:28, 0 users, load averages: 0.08, 0.02, 0.01 2008-09-14/13/40: 1:40PM up 2:33, 0 users, load averages: 0.16, 0.03, 0.01 2008-09-14/13/45: 1:45PM up 2:38, 0 users, load averages: 0.18, 0.05, 0.01 2008-09-14/13/50: 1:50PM up 2:43, 0 users, load averages: 0.23, 0.07, 0.02 2008-09-14/13/55: 1:55PM up 2:48, 0 users, load averages: 0.08, 0.03, 0.01 But usually you will need more complicated regexpres then just '.' to filter needed info. E.g. to see statistics for several hours about open files, you can run: gather show -t '2008-09-14/12--2008-09-14/15' grep '^kern.openfiles:' sysctl That will output something like this: 2008-09-14/12/00: kern.openfiles: 197 2008-09-14/12/05: kern.openfiles: 194 2008-09-14/12/10: kern.openfiles: 194 ... 2008-09-14/15/50: kern.openfiles: 187 2008-09-14/15/55: kern.openfiles: 188 You can use '-c' option if you want to count of matched strings rather then display them. E.g. to see number of sockets used by user www from 12:00 to 13:00 on 2008-09-14 you can run: gather show -t '2008-09-14/12' grep -c '^www\s' sockstat and output like this: 2008-09-14/12/00: 10 2008-09-14/12/05: 10 2008-09-14/12/10: 10 ... 3.4.show filter. ---------------- If you need not just to grep data but perform some actions on them you will want to use filter subcommand. E.g to see amount of loginned users, you can run: gather show -t '2008-09-14/12' filter "perl -pe 's/^.*(\\d+ users),.*\$/\$1/'" uptime That will output something like this: 2008-09-14/12/00: 0 users 2008-09-14/12/05: 0 users 2008-09-14/12/10: 0 users ... Remember about screening properly all control characters in filter command. If filter is rather complicated it is better to write separate script to avoid screening hell and then run: gather show -t '2008-09-14/11' filter ./script uptime Other advantage of this approach is that you can store written filter and use it later. If you use gather for some time soon you will have collection of useful filters. 3.5.show assemble. ------------------ Another show subcommand, `assemble', can be useful when analysing an output of such utilities like `sysctl' or `netstat -s' -- long list of variables with their values. E.g. `systctl -a' output would look something like this: ... vm.stats.misc.zero_page_count: 8130 vm.stats.misc.cnt_prezero: 0 vm.stats.vm.v_kthreadpages: 0 vm.stats.vm.v_rforkpages: 0 vm.stats.vm.v_vforkpages: 170509301 vm.stats.vm.v_forkpages: 1647077180 vm.stats.vm.v_kthreads: 41928 vm.stats.vm.v_rforks: 0 vm.stats.vm.v_vforks: 829962 vm.stats.vm.v_forks: 9605243 vm.stats.vm.v_interrupt_free_min: 2 vm.stats.vm.v_pageout_free_min: 34 vm.stats.vm.v_cache_max: 44618 vm.stats.vm.v_cache_min: 22309 vm.stats.vm.v_cache_count: 12929 vm.stats.vm.v_inactive_count: 445331 vm.stats.vm.v_inactive_target: 33463 vm.stats.vm.v_active_count: 70486 vm.stats.vm.v_wire_count: 67018 ... Using e.g. `show grep vm.stats.vm.v_vforkpages' we could get listing for this particular variable in some interesting timeperiod. But checking all variables in this way would be a long process. With assemble subcommand it is much faster: gather show -t 2010-02-14/08 assemble '^(?k:vm\.stats\..*):\s+(?v:\d+)$' sysctl ... sysctl: vm.stats.object.collapses: 2010-02-14/08/00: 35627077 - 2010-02-14/08/05: 35628981 1904 2010-02-14/08/10: 35634677 5696 2010-02-14/08/15: 35636642 1965 2010-02-14/08/20: 35642462 5820 2010-02-14/08/25: 35644147 1685 2010-02-14/08/30: 35649925 5778 2010-02-14/08/35: 35651872 1947 2010-02-14/08/40: 35657488 5616 2010-02-14/08/45: 35659431 1943 2010-02-14/08/50: 35665174 5743 2010-02-14/08/55: 35666864 1690 sysctl: vm.stats.sys.v_intr: 2010-02-14/08/00: 497713097 - 2010-02-14/08/05: 497751068 37971 2010-02-14/08/10: 497772905 21837 2010-02-14/08/15: 497784808 11903 2010-02-14/08/20: 497793871 9063 2010-02-14/08/25: 497805554 11683 2010-02-14/08/30: 497815321 9767 2010-02-14/08/35: 497837284 21963 2010-02-14/08/40: 497845850 8566 2010-02-14/08/45: 497981716 135866 2010-02-14/08/50: 497990448 8732 2010-02-14/08/55: 498002434 11986 sysctl: vm.stats.sys.v_soft: 2010-02-14/08/00: 476175765 - 2010-02-14/08/05: 476231628 55863 2010-02-14/08/10: 476287825 56197 2010-02-14/08/15: 476353282 65457 2010-02-14/08/20: 476414205 60923 2010-02-14/08/25: 476474890 60685 2010-02-14/08/30: 476541538 66648 2010-02-14/08/35: 476602048 60510 2010-02-14/08/40: 476664288 62240 2010-02-14/08/45: 476729602 65314 2010-02-14/08/50: 476796621 67019 2010-02-14/08/55: 476859315 62694 ... Some explanation. '^(?k:vm\.stats\..*):\s+(?v:\d+)$' -- is a regular expression with two nonstandard (gather specific) extensions: (?k:<key_regexp>) -- the regexp matches key. (?v:<val_regexp>) -- the regexp matches value. So in string like this: vm.stats.sys.v_soft: 476175765 the regular expression above will match vm.stats.sys.v_soft as a key and 476175765 as a value. As a result all lines with vm.stats.sys.v_soft key will be assembled: sysctl: vm.stats.sys.v_soft: 2010-02-14/08/00: 476175765 - 2010-02-14/08/05: 476231628 55863 2010-02-14/08/10: 476287825 56197 2010-02-14/08/15: 476353282 65457 2010-02-14/08/20: 476414205 60923 ... The first column is time, the second is value at this time and the third is difference with the previous value -- this helps much to find anomalies. By default the assembled data are displayed to stdout but with `-d <dir>' option you can specify a directory where assembled data will be stored, in separate file for every key. Still the amount of data you need to review is rather large :-). If you know exact time when the "problem" occurs (e.g. at about 08:20, i.e. "2010-02-14/08/20:" lines) you can use `-t "2010-02-14/08/20:"' option -- this will do some primitive analysis looking for variables that had anomalies at this time and will list them so you could start you analysis from reviewing this variables first. 3.6.show plot. -------------- If you have gnuplot installed you can use `show plot' subcommand to produce data plots. As its arguments it expects a regexp and dataset name. In the regexp you should use grouping to capture a parameter you want to display (as a function of time). For example, let's suppose we want to plot laptop battery life using sysctl output: % sysctl hw.acpi.battery.life hw.acpi.battery.life: 70 Our gather is configured to collect sysctl and produces this output: % gather show -t 1h grep hw.acpi.battery.life sysctl 2012-04-28 08:43: hw.acpi.battery.life: 15 2012-04-28 08:44: hw.acpi.battery.life: 16 2012-04-28 08:45: hw.acpi.battery.life: 17 2012-04-28 08:46: hw.acpi.battery.life: 18 ... To plot this we can use the following command, which captures a figure after "life:" as group \1: gather show -t 1h plot 'hw.acpi.battery.life: (\d+)' sysctl To plot it into a png file: gather show -t 1h plot -t png -o '/tmp/battery.life.png' 'hw.acpi.battery.life: (\d+)' sysctl If you always want to print to a file you may want to change default settings in gather.cfg. Also, note, if you don't have gnuplot installed on the host where you are running gather, you can set 'cat' as gnulplot command in the configuration file and produce gnuplot script with data, which you can ran on a host with gnuplot installed. Or use "ssh host | gnuplot " pipe. 4.Timeperiod shortcuts. ----------------------- The most general form of timeperiod is: YYYY-MM-DD/HH/MM--yyyy-mm-dd/hh/mm where YYYY-MM-DD/HH/MM is start of timeperiod and yyyy-mm-dd/hh/mm is its end. You can skip MM and HH in start or end part of range. E.g: 2008-11-16/14--2008-11-17 This is interpreted as: 2008-11-16/14/00--2008-11-17/23/59 It is also possible to specify only the first part of a timeperiod. E.g: 2008-11-16/14 (interpreted as 2008-11-16/14/00--2008-11-16/14/59) or 2008-11-16 (interpreted as 2008-11-16/00/00--2008-11-16/23/59) If day, hour or minute in end part of timeperiod is the same as in the start one, you can skip it: YYYY-MM-DD/HH/MM--/hh/mm (interpreted as YYYY-MM-DD/HH/MM--YYYY-MM-DD/hh/mm) YYYY-MM-DD/HH/MM--//mm (interpreted as YYYY-MM-DD/HH/MM--YYYY-MM-DD/HH/mm) YYYY-MM-DD/HH/MM--yyyy-mm-dd// (interpreted as YYYY-MM-DD/HH/MM--yyyy-mm-dd/HH/MM) and so on. Here are some other shortcuts you can use to reduce typing: . current day ./. current day/current hour ././. current day/current hour/current minute $ now (the same as ././.) Nd N days ago Nh N hours ago Nm N minutes ago If N{d,h,m} is used alone (there is only start part) then it is replaced by timeperiod "from that time by now". I.e. timeperiod "Nd" is the same as "Nd--$". -- Mikolaj Golub <[email protected]>
About
Automatically exported from code.google.com/p/gatherit
Resources
Stars
Watchers
Forks
Packages 0
No packages published