Skip to content

Commit 43bb028

Browse files
committed
Add disk rotation idea to WAL todo emails.
1 parent 0684043 commit 43bb028

File tree

1 file changed

+139
-0
lines changed
  • doc/TODO.detail

1 file changed

+139
-0
lines changed

doc/TODO.detail/wal

+139
Original file line numberDiff line numberDiff line change
@@ -2698,3 +2698,142 @@ TIP 4: Don't 'kill -9' the postmaster
26982698

26992699

27002700

2701+
From [email protected] Fri Nov 15 11:25:58 2002
2702+
Return-path: <[email protected]>
2703+
Received: from postgresql.org (postgresql.org [64.49.215.8])
2704+
by candle.pha.pa.us (8.11.6/8.10.1) with ESMTP id gAFHPvR10276
2705+
for <[email protected]>; Fri, 15 Nov 2002 12:25:57 -0500 (EST)
2706+
Received: from localhost (postgresql.org [64.49.215.8])
2707+
by postgresql.org (Postfix) with ESMTP
2708+
id A2D5A4774A1; Fri, 15 Nov 2002 11:34:54 -0500 (EST)
2709+
Received: from postgresql.org (postgresql.org [64.49.215.8])
2710+
by postgresql.org (Postfix) with SMTP
2711+
id 5E898477132; Fri, 15 Nov 2002 11:15:45 -0500 (EST)
2712+
Received: from localhost (postgresql.org [64.49.215.8])
2713+
by postgresql.org (Postfix) with ESMTP id 90CF1475B85
2714+
for <[email protected]>; Mon, 11 Nov 2002 15:33:47 -0500 (EST)
2715+
Received: from Curtis-Vaio (unknown [63.164.0.45])
2716+
by postgresql.org (Postfix) with SMTP id C6CB1475A3F
2717+
for <[email protected]>; Mon, 11 Nov 2002 15:33:46 -0500 (EST)
2718+
Received: from [127.0.0.1] by Curtis-Vaio
2719+
(ArGoSoft Mail Server Freeware, Version 1.8 (1.8.1.7)); Mon, 11 Nov 2002 16:33:42 -0400
2720+
From: "Curtis Faith" <[email protected]>
2721+
2722+
Subject: [HACKERS] 500 tpsQL + WAL log implementation
2723+
Date: Mon, 11 Nov 2002 16:33:41 -0400
2724+
Message-ID: <[email protected]>
2725+
MIME-Version: 1.0
2726+
Content-Type: text/plain;
2727+
charset="iso-8859-1"
2728+
Content-Transfer-Encoding: 7bit
2729+
X-Priority: 3 (Normal)
2730+
X-MSMail-Priority: Normal
2731+
X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2911.0)
2732+
X-MimeOLE: Produced By Microsoft MimeOLE V5.00.2919.6700
2733+
Importance: Normal
2734+
X-Virus-Scanned: by AMaViS new-20020517
2735+
Precedence: bulk
2736+
2737+
X-Virus-Scanned: by AMaViS new-20020517
2738+
Status: ORr
2739+
2740+
I have been experimenting with empirical tests of file system and device
2741+
level writes to determine the actual constraints in order to speed up the WAL
2742+
logging code.
2743+
2744+
Using a raw file partition and a time-based technique for determining the
2745+
optimal write position, I am able to get 8K writes physically written to disk
2746+
synchronously in the range of 500 to 650 writes per second using FreeBSD raw
2747+
device partitions on IDE disks (with write cache disabled). I will be
2748+
testing it soon under linux with 10,00RPM SCSI which should be even better.
2749+
It is my belief that the mechanism used to achieve these speeds could be
2750+
incorporated into the existing WAL logging code as an abstraction that looks
2751+
to the WAL code just like the file level access currently used. The current
2752+
speeds are limited by the speed of a single disk rotation. For a 7,200 RPM
2753+
disk this is 120/second, for a 10,000 RPM disk this is 166.66/second
2754+
2755+
The mechanism works by adjusting the seek offset of the write by using
2756+
gettimeofday to determine approximately where the disk head is in its
2757+
rotation. The mechanism does not use any AIO calls.
2758+
2759+
Assuming the following:
2760+
2761+
1) Disk rotation time is 8.333ms or 8333us (7200 RPM).
2762+
2763+
2) A write at offset 1,500K completes at system time 103s 000ms 000us
2764+
2765+
3) A new write is requested at system time 103s 004ms 166us
2766+
2767+
4) A 390K per rotation alignment of the data on the disk.
2768+
2769+
5) A write must be sent at least 20K ahead of the current head position to
2770+
ensure that it is written in less than one rotation.
2771+
2772+
It can be determined from the above that a write for an offset of something
2773+
slightly more than 195K past the last write, or offset 1,695K will be ahead
2774+
of the current location of the head and will therefore complete in less than
2775+
a single rotation's time.
2776+
2777+
The disk specific metrics (rotation speed, bytes per rotation, base write
2778+
time, etc.) can be derived empirically through a tester program that would
2779+
take a few minutes to run and which could be run at log setup time.
2780+
2781+
The obvious problem with the above mechanism is that the WAL log needs to be
2782+
able to read from the log file in transaction order during recovery. This
2783+
could be provided for using an abstraction that prepends the logical order
2784+
for each block written to the disk and makes sure that the log blocks contain
2785+
either a valid logical order number or some other marker indicating that the
2786+
block is not being used.
2787+
2788+
A bitmap of blocks that have already been used would be kept in memory for
2789+
quickly determining the next set of possible unused blocks but this bitmap
2790+
would not need to be written to disk except during normal shutdown since in
2791+
the even of a failure the bitmaps would be reconstructed by reading all the
2792+
blocks from the disk.
2793+
2794+
Checkpointing and something akin to log rotation could be handled using this
2795+
mechanism as well.
2796+
2797+
So, MY REAL QUESTION is whether or not this is the sort of speed improvement
2798+
that warrants the work of writing the required abstraction layer and making
2799+
this very robust. The WAL code should remain essentially unchanged, with
2800+
perhaps new calls for the five or six routines used to access the log files,
2801+
and handle the equivalent of log rotation for raw device access. These new
2802+
calls would either use the current file based implementation or the new
2803+
logging mechanism depending on the configuration.
2804+
2805+
I anticipate that the extra work required for a PostgreSQL administrator to
2806+
use the proposed logging mechanism would be to:
2807+
2808+
1) Create a raw device partition of the appropriate size
2809+
2) Run the metrics tester for that device partition
2810+
3) Set the appropriate configuration parameters to indicate raw WAL logging
2811+
2812+
I anticipate that the additional space requirements for this system would be
2813+
on the order of 10% to 15% beyond the current file-based implementation's
2814+
requirements.
2815+
2816+
So, is this worth doing? Would a robust implementation likely be accepted for
2817+
7.4 assuming it can demonstrate speed improvements in the range of 500tps?
2818+
2819+
- Curtis
2820+
2821+
2822+
2823+
2824+
2825+
2826+
2827+
2828+
2829+
2830+
2831+
2832+
2833+
2834+
2835+
2836+
2837+
---------------------------(end of broadcast)---------------------------
2838+
TIP 1: subscribe and unsubscribe commands go to [email protected]
2839+

0 commit comments

Comments
 (0)