forked from FFTW/fftw3
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathmpi.texi
1768 lines (1489 loc) · 78 KB
/
mpi.texi
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
@node Distributed-memory FFTW with MPI, Calling FFTW from Modern Fortran, Multi-threaded FFTW, Top
@chapter Distributed-memory FFTW with MPI
@cindex MPI
@cindex parallel transform
In this chapter we document the parallel FFTW routines for parallel
systems supporting the MPI message-passing interface. Unlike the
shared-memory threads described in the previous chapter, MPI allows
you to use @emph{distributed-memory} parallelism, where each CPU has
its own separate memory, and which can scale up to clusters of many
thousands of processors. This capability comes at a price, however:
each process only stores a @emph{portion} of the data to be
transformed, which means that the data structures and
programming-interface are quite different from the serial or threads
versions of FFTW.
@cindex data distribution
Distributed-memory parallelism is especially useful when you are
transforming arrays so large that they do not fit into the memory of a
single processor. The storage per-process required by FFTW's MPI
routines is proportional to the total array size divided by the number
of processes. Conversely, distributed-memory parallelism can easily
pose an unacceptably high communications overhead for small problems;
the threshold problem size for which parallelism becomes advantageous
will depend on the precise problem you are interested in, your
hardware, and your MPI implementation.
A note on terminology: in MPI, you divide the data among a set of
``processes'' which each run in their own memory address space.
Generally, each process runs on a different physical processor, but
this is not required. A set of processes in MPI is described by an
opaque data structure called a ``communicator,'' the most common of
which is the predefined communicator @code{MPI_COMM_WORLD} which
refers to @emph{all} processes. For more information on these and
other concepts common to all MPI programs, we refer the reader to the
documentation at @uref{http://www.mcs.anl.gov/research/projects/mpi/, the MPI home
page}.
@cindex MPI communicator
@ctindex MPI_COMM_WORLD
We assume in this chapter that the reader is familiar with the usage
of the serial (uniprocessor) FFTW, and focus only on the concepts new
to the MPI interface.
@menu
* FFTW MPI Installation::
* Linking and Initializing MPI FFTW::
* 2d MPI example::
* MPI Data Distribution::
* Multi-dimensional MPI DFTs of Real Data::
* Other Multi-dimensional Real-data MPI Transforms::
* FFTW MPI Transposes::
* FFTW MPI Wisdom::
* Avoiding MPI Deadlocks::
* FFTW MPI Performance Tips::
* Combining MPI and Threads::
* FFTW MPI Reference::
* FFTW MPI Fortran Interface::
@end menu
@c ------------------------------------------------------------
@node FFTW MPI Installation, Linking and Initializing MPI FFTW, Distributed-memory FFTW with MPI, Distributed-memory FFTW with MPI
@section FFTW MPI Installation
All of the FFTW MPI code is located in the @code{mpi} subdirectory of
the FFTW package. On Unix systems, the FFTW MPI libraries and header
files are automatically configured, compiled, and installed along with
the uniprocessor FFTW libraries simply by including
@code{--enable-mpi} in the flags to the @code{configure} script
(@pxref{Installation on Unix}).
@fpindex configure
Any implementation of the MPI standard, version 1 or later, should
work with FFTW. The @code{configure} script will attempt to
automatically detect how to compile and link code using your MPI
implementation. In some cases, especially if you have multiple
different MPI implementations installed or have an unusual MPI
software package, you may need to provide this information explicitly.
Most commonly, one compiles MPI code by invoking a special compiler
command, typically @code{mpicc} for C code. The @code{configure}
script knows the most common names for this command, but you can
specify the MPI compilation command explicitly by setting the
@code{MPICC} variable, as in @samp{./configure MPICC=mpicc ...}.
@fpindex mpicc
If, instead of a special compiler command, you need to link a certain
library, you can specify the link command via the @code{MPILIBS}
variable, as in @samp{./configure MPILIBS=-lmpi ...}. Note that if
your MPI library is installed in a non-standard location (one the
compiler does not know about by default), you may also have to specify
the location of the library and header files via @code{LDFLAGS} and
@code{CPPFLAGS} variables, respectively, as in @samp{./configure
LDFLAGS=-L/path/to/mpi/libs CPPFLAGS=-I/path/to/mpi/include ...}.
@c ------------------------------------------------------------
@node Linking and Initializing MPI FFTW, 2d MPI example, FFTW MPI Installation, Distributed-memory FFTW with MPI
@section Linking and Initializing MPI FFTW
Programs using the MPI FFTW routines should be linked with
@code{-lfftw3_mpi -lfftw3 -lm} on Unix in double precision,
@code{-lfftw3f_mpi -lfftw3f -lm} in single precision, and so on
(@pxref{Precision}). You will also need to link with whatever library
is responsible for MPI on your system; in most MPI implementations,
there is a special compiler alias named @code{mpicc} to compile and
link MPI code.
@fpindex mpicc
@cindex linking on Unix
@cindex precision
@findex fftw_init_threads
Before calling any FFTW routines except possibly
@code{fftw_init_threads} (@pxref{Combining MPI and Threads}), but after calling
@code{MPI_Init}, you should call the function:
@example
void fftw_mpi_init(void);
@end example
@findex fftw_mpi_init
If, at the end of your program, you want to get rid of all memory and
other resources allocated internally by FFTW, for both the serial and
MPI routines, you can call:
@example
void fftw_mpi_cleanup(void);
@end example
@findex fftw_mpi_cleanup
which is much like the @code{fftw_cleanup()} function except that it
also gets rid of FFTW's MPI-related data. You must @emph{not} execute
any previously created plans after calling this function.
@c ------------------------------------------------------------
@node 2d MPI example, MPI Data Distribution, Linking and Initializing MPI FFTW, Distributed-memory FFTW with MPI
@section 2d MPI example
Before we document the FFTW MPI interface in detail, we begin with a
simple example outlining how one would perform a two-dimensional
@code{N0} by @code{N1} complex DFT.
@example
#include <fftw3-mpi.h>
int main(int argc, char **argv)
@{
const ptrdiff_t N0 = ..., N1 = ...;
fftw_plan plan;
fftw_complex *data;
ptrdiff_t alloc_local, local_n0, local_0_start, i, j;
MPI_Init(&argc, &argv);
fftw_mpi_init();
/* @r{get local data size and allocate} */
alloc_local = fftw_mpi_local_size_2d(N0, N1, MPI_COMM_WORLD,
&local_n0, &local_0_start);
data = fftw_alloc_complex(alloc_local);
/* @r{create plan for in-place forward DFT} */
plan = fftw_mpi_plan_dft_2d(N0, N1, data, data, MPI_COMM_WORLD,
FFTW_FORWARD, FFTW_ESTIMATE);
/* @r{initialize data to some function} my_function(x,y) */
for (i = 0; i < local_n0; ++i) for (j = 0; j < N1; ++j)
data[i*N1 + j] = my_function(local_0_start + i, j);
/* @r{compute transforms, in-place, as many times as desired} */
fftw_execute(plan);
fftw_destroy_plan(plan);
MPI_Finalize();
@}
@end example
As can be seen above, the MPI interface follows the same basic style
of allocate/plan/execute/destroy as the serial FFTW routines. All of
the MPI-specific routines are prefixed with @samp{fftw_mpi_} instead
of @samp{fftw_}. There are a few important differences, however:
First, we must call @code{fftw_mpi_init()} after calling
@code{MPI_Init} (required in all MPI programs) and before calling any
other @samp{fftw_mpi_} routine.
@findex MPI_Init
@findex fftw_mpi_init
Second, when we create the plan with @code{fftw_mpi_plan_dft_2d},
analogous to @code{fftw_plan_dft_2d}, we pass an additional argument:
the communicator, indicating which processes will participate in the
transform (here @code{MPI_COMM_WORLD}, indicating all processes).
Whenever you create, execute, or destroy a plan for an MPI transform,
you must call the corresponding FFTW routine on @emph{all} processes
in the communicator for that transform. (That is, these are
@emph{collective} calls.) Note that the plan for the MPI transform
uses the standard @code{fftw_execute} and @code{fftw_destroy} routines
(on the other hand, there are MPI-specific new-array execute functions
documented below).
@cindex collective function
@findex fftw_mpi_plan_dft_2d
@ctindex MPI_COMM_WORLD
Third, all of the FFTW MPI routines take @code{ptrdiff_t} arguments
instead of @code{int} as for the serial FFTW. @code{ptrdiff_t} is a
standard C integer type which is (at least) 32 bits wide on a 32-bit
machine and 64 bits wide on a 64-bit machine. This is to make it easy
to specify very large parallel transforms on a 64-bit machine. (You
can specify 64-bit transform sizes in the serial FFTW, too, but only
by using the @samp{guru64} planner interface. @xref{64-bit Guru
Interface}.)
@tindex ptrdiff_t
@cindex 64-bit architecture
Fourth, and most importantly, you don't allocate the entire
two-dimensional array on each process. Instead, you call
@code{fftw_mpi_local_size_2d} to find out what @emph{portion} of the
array resides on each processor, and how much space to allocate.
Here, the portion of the array on each process is a @code{local_n0} by
@code{N1} slice of the total array, starting at index
@code{local_0_start}. The total number of @code{fftw_complex} numbers
to allocate is given by the @code{alloc_local} return value, which
@emph{may} be greater than @code{local_n0 * N1} (in case some
intermediate calculations require additional storage). The data
distribution in FFTW's MPI interface is described in more detail by
the next section.
@findex fftw_mpi_local_size_2d
@cindex data distribution
Given the portion of the array that resides on the local process, it
is straightforward to initialize the data (here to a function
@code{myfunction}) and otherwise manipulate it. Of course, at the end
of the program you may want to output the data somehow, but
synchronizing this output is up to you and is beyond the scope of this
manual. (One good way to output a large multi-dimensional distributed
array in MPI to a portable binary file is to use the free HDF5
library; see the @uref{http://www.hdfgroup.org/, HDF home page}.)
@cindex HDF5
@cindex MPI I/O
@c ------------------------------------------------------------
@node MPI Data Distribution, Multi-dimensional MPI DFTs of Real Data, 2d MPI example, Distributed-memory FFTW with MPI
@section MPI Data Distribution
@cindex data distribution
The most important concept to understand in using FFTW's MPI interface
is the data distribution. With a serial or multithreaded FFT, all of
the inputs and outputs are stored as a single contiguous chunk of
memory. With a distributed-memory FFT, the inputs and outputs are
broken into disjoint blocks, one per process.
In particular, FFTW uses a @emph{1d block distribution} of the data,
distributed along the @emph{first dimension}. For example, if you
want to perform a @twodims{100,200} complex DFT, distributed over 4
processes, each process will get a @twodims{25,200} slice of the data.
That is, process 0 will get rows 0 through 24, process 1 will get rows
25 through 49, process 2 will get rows 50 through 74, and process 3
will get rows 75 through 99. If you take the same array but
distribute it over 3 processes, then it is not evenly divisible so the
different processes will have unequal chunks. FFTW's default choice
in this case is to assign 34 rows to processes 0 and 1, and 32 rows to
process 2.
@cindex block distribution
FFTW provides several @samp{fftw_mpi_local_size} routines that you can
call to find out what portion of an array is stored on the current
process. In most cases, you should use the default block sizes picked
by FFTW, but it is also possible to specify your own block size. For
example, with a @twodims{100,200} array on three processes, you can
tell FFTW to use a block size of 40, which would assign 40 rows to
processes 0 and 1, and 20 rows to process 2. FFTW's default is to
divide the data equally among the processes if possible, and as best
it can otherwise. The rows are always assigned in ``rank order,''
i.e. process 0 gets the first block of rows, then process 1, and so
on. (You can change this by using @code{MPI_Comm_split} to create a
new communicator with re-ordered processes.) However, you should
always call the @samp{fftw_mpi_local_size} routines, if possible,
rather than trying to predict FFTW's distribution choices.
In particular, it is critical that you allocate the storage size that
is returned by @samp{fftw_mpi_local_size}, which is @emph{not}
necessarily the size of the local slice of the array. The reason is
that intermediate steps of FFTW's algorithms involve transposing the
array and redistributing the data, so at these intermediate steps FFTW
may require more local storage space (albeit always proportional to
the total size divided by the number of processes). The
@samp{fftw_mpi_local_size} functions know how much storage is required
for these intermediate steps and tell you the correct amount to
allocate.
@menu
* Basic and advanced distribution interfaces::
* Load balancing::
* Transposed distributions::
* One-dimensional distributions::
@end menu
@node Basic and advanced distribution interfaces, Load balancing, MPI Data Distribution, MPI Data Distribution
@subsection Basic and advanced distribution interfaces
As with the planner interface, the @samp{fftw_mpi_local_size}
distribution interface is broken into basic and advanced
(@samp{_many}) interfaces, where the latter allows you to specify the
block size manually and also to request block sizes when computing
multiple transforms simultaneously. These functions are documented
more exhaustively by the FFTW MPI Reference, but we summarize the
basic ideas here using a couple of two-dimensional examples.
For the @twodims{100,200} complex-DFT example, above, we would find
the distribution by calling the following function in the basic
interface:
@example
ptrdiff_t fftw_mpi_local_size_2d(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm,
ptrdiff_t *local_n0, ptrdiff_t *local_0_start);
@end example
@findex fftw_mpi_local_size_2d
Given the total size of the data to be transformed (here, @code{n0 =
100} and @code{n1 = 200}) and an MPI communicator (@code{comm}), this
function provides three numbers.
First, it describes the shape of the local data: the current process
should store a @code{local_n0} by @code{n1} slice of the overall
dataset, in row-major order (@code{n1} dimension contiguous), starting
at index @code{local_0_start}. That is, if the total dataset is
viewed as a @code{n0} by @code{n1} matrix, the current process should
store the rows @code{local_0_start} to
@code{local_0_start+local_n0-1}. Obviously, if you are running with
only a single MPI process, that process will store the entire array:
@code{local_0_start} will be zero and @code{local_n0} will be
@code{n0}. @xref{Row-major Format}.
@cindex row-major
Second, the return value is the total number of data elements (e.g.,
complex numbers for a complex DFT) that should be allocated for the
input and output arrays on the current process (ideally with
@code{fftw_malloc} or an @samp{fftw_alloc} function, to ensure optimal
alignment). It might seem that this should always be equal to
@code{local_n0 * n1}, but this is @emph{not} the case. FFTW's
distributed FFT algorithms require data redistributions at
intermediate stages of the transform, and in some circumstances this
may require slightly larger local storage. This is discussed in more
detail below, under @ref{Load balancing}.
@findex fftw_malloc
@findex fftw_alloc_complex
@cindex advanced interface
The advanced-interface @samp{local_size} function for multidimensional
transforms returns the same three things (@code{local_n0},
@code{local_0_start}, and the total number of elements to allocate),
but takes more inputs:
@example
ptrdiff_t fftw_mpi_local_size_many(int rnk, const ptrdiff_t *n,
ptrdiff_t howmany,
ptrdiff_t block0,
MPI_Comm comm,
ptrdiff_t *local_n0,
ptrdiff_t *local_0_start);
@end example
@findex fftw_mpi_local_size_many
The two-dimensional case above corresponds to @code{rnk = 2} and an
array @code{n} of length 2 with @code{n[0] = n0} and @code{n[1] = n1}.
This routine is for any @code{rnk > 1}; one-dimensional transforms
have their own interface because they work slightly differently, as
discussed below.
First, the advanced interface allows you to perform multiple
transforms at once, of interleaved data, as specified by the
@code{howmany} parameter. (@code{hoamany} is 1 for a single
transform.)
Second, here you can specify your desired block size in the @code{n0}
dimension, @code{block0}. To use FFTW's default block size, pass
@code{FFTW_MPI_DEFAULT_BLOCK} (0) for @code{block0}. Otherwise, on
@code{P} processes, FFTW will return @code{local_n0} equal to
@code{block0} on the first @code{P / block0} processes (rounded down),
return @code{local_n0} equal to @code{n0 - block0 * (P / block0)} on
the next process, and @code{local_n0} equal to zero on any remaining
processes. In general, we recommend using the default block size
(which corresponds to @code{n0 / P}, rounded up).
@ctindex FFTW_MPI_DEFAULT_BLOCK
@cindex block distribution
For example, suppose you have @code{P = 4} processes and @code{n0 =
21}. The default will be a block size of @code{6}, which will give
@code{local_n0 = 6} on the first three processes and @code{local_n0 =
3} on the last process. Instead, however, you could specify
@code{block0 = 5} if you wanted, which would give @code{local_n0 = 5}
on processes 0 to 2, @code{local_n0 = 6} on process 3. (This choice,
while it may look superficially more ``balanced,'' has the same
critical path as FFTW's default but requires more communications.)
@node Load balancing, Transposed distributions, Basic and advanced distribution interfaces, MPI Data Distribution
@subsection Load balancing
@cindex load balancing
Ideally, when you parallelize a transform over some @math{P}
processes, each process should end up with work that takes equal time.
Otherwise, all of the processes end up waiting on whichever process is
slowest. This goal is known as ``load balancing.'' In this section,
we describe the circumstances under which FFTW is able to load-balance
well, and in particular how you should choose your transform size in
order to load balance.
Load balancing is especially difficult when you are parallelizing over
heterogeneous machines; for example, if one of your processors is a
old 486 and another is a Pentium IV, obviously you should give the
Pentium more work to do than the 486 since the latter is much slower.
FFTW does not deal with this problem, however---it assumes that your
processes run on hardware of comparable speed, and that the goal is
therefore to divide the problem as equally as possible.
For a multi-dimensional complex DFT, FFTW can divide the problem
equally among the processes if: (i) the @emph{first} dimension
@code{n0} is divisible by @math{P}; and (ii), the @emph{product} of
the subsequent dimensions is divisible by @math{P}. (For the advanced
interface, where you can specify multiple simultaneous transforms via
some ``vector'' length @code{howmany}, a factor of @code{howmany} is
included in the product of the subsequent dimensions.)
For a one-dimensional complex DFT, the length @code{N} of the data
should be divisible by @math{P} @emph{squared} to be able to divide
the problem equally among the processes.
@node Transposed distributions, One-dimensional distributions, Load balancing, MPI Data Distribution
@subsection Transposed distributions
Internally, FFTW's MPI transform algorithms work by first computing
transforms of the data local to each process, then by globally
@emph{transposing} the data in some fashion to redistribute the data
among the processes, transforming the new data local to each process,
and transposing back. For example, a two-dimensional @code{n0} by
@code{n1} array, distributed across the @code{n0} dimension, is
transformd by: (i) transforming the @code{n1} dimension, which are
local to each process; (ii) transposing to an @code{n1} by @code{n0}
array, distributed across the @code{n1} dimension; (iii) transforming
the @code{n0} dimension, which is now local to each process; (iv)
transposing back.
@cindex transpose
However, in many applications it is acceptable to compute a
multidimensional DFT whose results are produced in transposed order
(e.g., @code{n1} by @code{n0} in two dimensions). This provides a
significant performance advantage, because it means that the final
transposition step can be omitted. FFTW supports this optimization,
which you specify by passing the flag @code{FFTW_MPI_TRANSPOSED_OUT}
to the planner routines. To compute the inverse transform of
transposed output, you specify @code{FFTW_MPI_TRANSPOSED_IN} to tell
it that the input is transposed. In this section, we explain how to
interpret the output format of such a transform.
@ctindex FFTW_MPI_TRANSPOSED_OUT
@ctindex FFTW_MPI_TRANSPOSED_IN
Suppose you have are transforming multi-dimensional data with (at
least two) dimensions @ndims{}. As always, it is distributed along
the first dimension @dimk{0}. Now, if we compute its DFT with the
@code{FFTW_MPI_TRANSPOSED_OUT} flag, the resulting output data are stored
with the first @emph{two} dimensions transposed: @ndimstrans{},
distributed along the @dimk{1} dimension. Conversely, if we take the
@ndimstrans{} data and transform it with the
@code{FFTW_MPI_TRANSPOSED_IN} flag, then the format goes back to the
original @ndims{} array.
There are two ways to find the portion of the transposed array that
resides on the current process. First, you can simply call the
appropriate @samp{local_size} function, passing @ndimstrans{} (the
transposed dimensions). This would mean calling the @samp{local_size}
function twice, once for the transposed and once for the
non-transposed dimensions. Alternatively, you can call one of the
@samp{local_size_transposed} functions, which returns both the
non-transposed and transposed data distribution from a single call.
For example, for a 3d transform with transposed output (or input), you
might call:
@example
ptrdiff_t fftw_mpi_local_size_3d_transposed(
ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, MPI_Comm comm,
ptrdiff_t *local_n0, ptrdiff_t *local_0_start,
ptrdiff_t *local_n1, ptrdiff_t *local_1_start);
@end example
@findex fftw_mpi_local_size_3d_transposed
Here, @code{local_n0} and @code{local_0_start} give the size and
starting index of the @code{n0} dimension for the
@emph{non}-transposed data, as in the previous sections. For
@emph{transposed} data (e.g. the output for
@code{FFTW_MPI_TRANSPOSED_OUT}), @code{local_n1} and
@code{local_1_start} give the size and starting index of the @code{n1}
dimension, which is the first dimension of the transposed data
(@code{n1} by @code{n0} by @code{n2}).
(Note that @code{FFTW_MPI_TRANSPOSED_IN} is completely equivalent to
performing @code{FFTW_MPI_TRANSPOSED_OUT} and passing the first two
dimensions to the planner in reverse order, or vice versa. If you
pass @emph{both} the @code{FFTW_MPI_TRANSPOSED_IN} and
@code{FFTW_MPI_TRANSPOSED_OUT} flags, it is equivalent to swapping the
first two dimensions passed to the planner and passing @emph{neither}
flag.)
@node One-dimensional distributions, , Transposed distributions, MPI Data Distribution
@subsection One-dimensional distributions
For one-dimensional distributed DFTs using FFTW, matters are slightly
more complicated because the data distribution is more closely tied to
how the algorithm works. In particular, you can no longer pass an
arbitrary block size and must accept FFTW's default; also, the block
sizes may be different for input and output. Also, the data
distribution depends on the flags and transform direction, in order
for forward and backward transforms to work correctly.
@example
ptrdiff_t fftw_mpi_local_size_1d(ptrdiff_t n0, MPI_Comm comm,
int sign, unsigned flags,
ptrdiff_t *local_ni, ptrdiff_t *local_i_start,
ptrdiff_t *local_no, ptrdiff_t *local_o_start);
@end example
@findex fftw_mpi_local_size_1d
This function computes the data distribution for a 1d transform of
size @code{n0} with the given transform @code{sign} and @code{flags}.
Both input and output data use block distributions. The input on the
current process will consist of @code{local_ni} numbers starting at
index @code{local_i_start}; e.g. if only a single process is used,
then @code{local_ni} will be @code{n0} and @code{local_i_start} will
be @code{0}. Similarly for the output, with @code{local_no} numbers
starting at index @code{local_o_start}. The return value of
@code{fftw_mpi_local_size_1d} will be the total number of elements to
allocate on the current process (which might be slightly larger than
the local size due to intermediate steps in the algorithm).
As mentioned above (@pxref{Load balancing}), the data will be divided
equally among the processes if @code{n0} is divisible by the
@emph{square} of the number of processes. In this case,
@code{local_ni} will equal @code{local_no}. Otherwise, they may be
different.
For some applications, such as convolutions, the order of the output
data is irrelevant. In this case, performance can be improved by
specifying that the output data be stored in an FFTW-defined
``scrambled'' format. (In particular, this is the analogue of
transposed output in the multidimensional case: scrambled output saves
a communications step.) If you pass @code{FFTW_MPI_SCRAMBLED_OUT} in
the flags, then the output is stored in this (undocumented) scrambled
order. Conversely, to perform the inverse transform of data in
scrambled order, pass the @code{FFTW_MPI_SCRAMBLED_IN} flag.
@ctindex FFTW_MPI_SCRAMBLED_OUT
@ctindex FFTW_MPI_SCRAMBLED_IN
In MPI FFTW, only composite sizes @code{n0} can be parallelized; we
have not yet implemented a parallel algorithm for large prime sizes.
@c ------------------------------------------------------------
@node Multi-dimensional MPI DFTs of Real Data, Other Multi-dimensional Real-data MPI Transforms, MPI Data Distribution, Distributed-memory FFTW with MPI
@section Multi-dimensional MPI DFTs of Real Data
FFTW's MPI interface also supports multi-dimensional DFTs of real
data, similar to the serial r2c and c2r interfaces. (Parallel
one-dimensional real-data DFTs are not currently supported; you must
use a complex transform and set the imaginary parts of the inputs to
zero.)
The key points to understand for r2c and c2r MPI transforms (compared
to the MPI complex DFTs or the serial r2c/c2r transforms), are:
@itemize @bullet
@item
Just as for serial transforms, r2c/c2r DFTs transform @ndims{} real
data to/from @ndimshalf{} complex data: the last dimension of the
complex data is cut in half (rounded down), plus one. As for the
serial transforms, the sizes you pass to the @samp{plan_dft_r2c} and
@samp{plan_dft_c2r} are the @ndims{} dimensions of the real data.
@item
@cindex padding
Although the real data is @emph{conceptually} @ndims{}, it is
@emph{physically} stored as an @ndimspad{} array, where the last
dimension has been @emph{padded} to make it the same size as the
complex output. This is much like the in-place serial r2c/c2r
interface (@pxref{Multi-Dimensional DFTs of Real Data}), except that
in MPI the padding is required even for out-of-place data. The extra
padding numbers are ignored by FFTW (they are @emph{not} like
zero-padding the transform to a larger size); they are only used to
determine the data layout.
@item
@cindex data distribution
The data distribution in MPI for @emph{both} the real and complex data
is determined by the shape of the @emph{complex} data. That is, you
call the appropriate @samp{local size} function for the @ndimshalf{}
complex data, and then use the @emph{same} distribution for the real
data except that the last complex dimension is replaced by a (padded)
real dimension of twice the length.
@end itemize
For example suppose we are performing an out-of-place r2c transform of
@threedims{L,M,N} real data [padded to @threedims{L,M,2(N/2+1)}],
resulting in @threedims{L,M,N/2+1} complex data. Similar to the
example in @ref{2d MPI example}, we might do something like:
@example
#include <fftw3-mpi.h>
int main(int argc, char **argv)
@{
const ptrdiff_t L = ..., M = ..., N = ...;
fftw_plan plan;
double *rin;
fftw_complex *cout;
ptrdiff_t alloc_local, local_n0, local_0_start, i, j, k;
MPI_Init(&argc, &argv);
fftw_mpi_init();
/* @r{get local data size and allocate} */
alloc_local = fftw_mpi_local_size_3d(L, M, N/2+1, MPI_COMM_WORLD,
&local_n0, &local_0_start);
rin = fftw_alloc_real(2 * alloc_local);
cout = fftw_alloc_complex(alloc_local);
/* @r{create plan for out-of-place r2c DFT} */
plan = fftw_mpi_plan_dft_r2c_3d(L, M, N, rin, cout, MPI_COMM_WORLD,
FFTW_MEASURE);
/* @r{initialize rin to some function} my_func(x,y,z) */
for (i = 0; i < local_n0; ++i)
for (j = 0; j < M; ++j)
for (k = 0; k < N; ++k)
rin[(i*M + j) * (2*(N/2+1)) + k] = my_func(local_0_start+i, j, k);
/* @r{compute transforms as many times as desired} */
fftw_execute(plan);
fftw_destroy_plan(plan);
MPI_Finalize();
@}
@end example
@findex fftw_alloc_real
@cindex row-major
Note that we allocated @code{rin} using @code{fftw_alloc_real} with an
argument of @code{2 * alloc_local}: since @code{alloc_local} is the
number of @emph{complex} values to allocate, the number of @emph{real}
values is twice as many. The @code{rin} array is then
@threedims{local_n0,M,2(N/2+1)} in row-major order, so its
@code{(i,j,k)} element is at the index @code{(i*M + j) * (2*(N/2+1)) +
k} (@pxref{Multi-dimensional Array Format }).
@cindex transpose
@ctindex FFTW_TRANSPOSED_OUT
@ctindex FFTW_TRANSPOSED_IN
As for the complex transforms, improved performance can be obtained by
specifying that the output is the transpose of the input or vice versa
(@pxref{Transposed distributions}). In our @threedims{L,M,N} r2c
example, including @code{FFTW_TRANSPOSED_OUT} in the flags means that
the input would be a padded @threedims{L,M,2(N/2+1)} real array
distributed over the @code{L} dimension, while the output would be a
@threedims{M,L,N/2+1} complex array distributed over the @code{M}
dimension. To perform the inverse c2r transform with the same data
distributions, you would use the @code{FFTW_TRANSPOSED_IN} flag.
@c ------------------------------------------------------------
@node Other Multi-dimensional Real-data MPI Transforms, FFTW MPI Transposes, Multi-dimensional MPI DFTs of Real Data, Distributed-memory FFTW with MPI
@section Other multi-dimensional Real-Data MPI Transforms
@cindex r2r
FFTW's MPI interface also supports multi-dimensional @samp{r2r}
transforms of all kinds supported by the serial interface
(e.g. discrete cosine and sine transforms, discrete Hartley
transforms, etc.). Only multi-dimensional @samp{r2r} transforms, not
one-dimensional transforms, are currently parallelized.
@tindex fftw_r2r_kind
These are used much like the multidimensional complex DFTs discussed
above, except that the data is real rather than complex, and one needs
to pass an r2r transform kind (@code{fftw_r2r_kind}) for each
dimension as in the serial FFTW (@pxref{More DFTs of Real Data}).
For example, one might perform a two-dimensional @twodims{L,M} that is
an REDFT10 (DCT-II) in the first dimension and an RODFT10 (DST-II) in
the second dimension with code like:
@example
const ptrdiff_t L = ..., M = ...;
fftw_plan plan;
double *data;
ptrdiff_t alloc_local, local_n0, local_0_start, i, j;
/* @r{get local data size and allocate} */
alloc_local = fftw_mpi_local_size_2d(L, M, MPI_COMM_WORLD,
&local_n0, &local_0_start);
data = fftw_alloc_real(alloc_local);
/* @r{create plan for in-place REDFT10 x RODFT10} */
plan = fftw_mpi_plan_r2r_2d(L, M, data, data, MPI_COMM_WORLD,
FFTW_REDFT10, FFTW_RODFT10, FFTW_MEASURE);
/* @r{initialize data to some function} my_function(x,y) */
for (i = 0; i < local_n0; ++i) for (j = 0; j < M; ++j)
data[i*M + j] = my_function(local_0_start + i, j);
/* @r{compute transforms, in-place, as many times as desired} */
fftw_execute(plan);
fftw_destroy_plan(plan);
@end example
@findex fftw_alloc_real
Notice that we use the same @samp{local_size} functions as we did for
complex data, only now we interpret the sizes in terms of real rather
than complex values, and correspondingly use @code{fftw_alloc_real}.
@c ------------------------------------------------------------
@node FFTW MPI Transposes, FFTW MPI Wisdom, Other Multi-dimensional Real-data MPI Transforms, Distributed-memory FFTW with MPI
@section FFTW MPI Transposes
@cindex transpose
The FFTW's MPI Fourier transforms rely on one or more @emph{global
transposition} step for their communications. For example, the
multidimensional transforms work by transforming along some
dimensions, then transposing to make the first dimension local and
transforming that, then transposing back. Because global
transposition of a block-distributed matrix has many other potential
uses besides FFTs, FFTW's transpose routines can be called directly,
as documented in this section.
@menu
* Basic distributed-transpose interface::
* Advanced distributed-transpose interface::
* An improved replacement for MPI_Alltoall::
@end menu
@node Basic distributed-transpose interface, Advanced distributed-transpose interface, FFTW MPI Transposes, FFTW MPI Transposes
@subsection Basic distributed-transpose interface
In particular, suppose that we have an @code{n0} by @code{n1} array in
row-major order, block-distributed across the @code{n0} dimension. To
transpose this into an @code{n1} by @code{n0} array block-distributed
across the @code{n1} dimension, we would create a plan by calling the
following function:
@example
fftw_plan fftw_mpi_plan_transpose(ptrdiff_t n0, ptrdiff_t n1,
double *in, double *out,
MPI_Comm comm, unsigned flags);
@end example
@findex fftw_mpi_plan_transpose
The input and output arrays (@code{in} and @code{out}) can be the
same. The transpose is actually executed by calling
@code{fftw_execute} on the plan, as usual.
@findex fftw_execute
The @code{flags} are the usual FFTW planner flags, but support
two additional flags: @code{FFTW_MPI_TRANSPOSED_OUT} and/or
@code{FFTW_MPI_TRANSPOSED_IN}. What these flags indicate, for
transpose plans, is that the output and/or input, respectively, are
@emph{locally} transposed. That is, on each process input data is
normally stored as a @code{local_n0} by @code{n1} array in row-major
order, but for an @code{FFTW_MPI_TRANSPOSED_IN} plan the input data is
stored as @code{n1} by @code{local_n0} in row-major order. Similarly,
@code{FFTW_MPI_TRANSPOSED_OUT} means that the output is @code{n0} by
@code{local_n1} instead of @code{local_n1} by @code{n0}.
@ctindex FFTW_MPI_TRANSPOSED_OUT
@ctindex FFTW_MPI_TRANSPOSED_IN
To determine the local size of the array on each process before and
after the transpose, as well as the amount of storage that must be
allocated, one should call @code{fftw_mpi_local_size_2d_transposed},
just as for a 2d DFT as described in the previous section:
@cindex data distribution
@example
ptrdiff_t fftw_mpi_local_size_2d_transposed
(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm,
ptrdiff_t *local_n0, ptrdiff_t *local_0_start,
ptrdiff_t *local_n1, ptrdiff_t *local_1_start);
@end example
@findex fftw_mpi_local_size_2d_transposed
Again, the return value is the local storage to allocate, which in
this case is the number of @emph{real} (@code{double}) values rather
than complex numbers as in the previous examples.
@node Advanced distributed-transpose interface, An improved replacement for MPI_Alltoall, Basic distributed-transpose interface, FFTW MPI Transposes
@subsection Advanced distributed-transpose interface
The above routines are for a transpose of a matrix of numbers (of type
@code{double}), using FFTW's default block sizes. More generally, one
can perform transposes of @emph{tuples} of numbers, with
user-specified block sizes for the input and output:
@example
fftw_plan fftw_mpi_plan_many_transpose
(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t howmany,
ptrdiff_t block0, ptrdiff_t block1,
double *in, double *out, MPI_Comm comm, unsigned flags);
@end example
@findex fftw_mpi_plan_many_transpose
In this case, one is transposing an @code{n0} by @code{n1} matrix of
@code{howmany}-tuples (e.g. @code{howmany = 2} for complex numbers).
The input is distributed along the @code{n0} dimension with block size
@code{block0}, and the @code{n1} by @code{n0} output is distributed
along the @code{n1} dimension with block size @code{block1}. If
@code{FFTW_MPI_DEFAULT_BLOCK} (0) is passed for a block size then FFTW
uses its default block size. To get the local size of the data on
each process, you should then call @code{fftw_mpi_local_size_many_transposed}.
@ctindex FFTW_MPI_DEFAULT_BLOCK
@findex fftw_mpi_local_size_many_transposed
@node An improved replacement for MPI_Alltoall, , Advanced distributed-transpose interface, FFTW MPI Transposes
@subsection An improved replacement for MPI_Alltoall
We close this section by noting that FFTW's MPI transpose routines can
be thought of as a generalization for the @code{MPI_Alltoall} function
(albeit only for floating-point types), and in some circumstances can
function as an improved replacement.
@findex MPI_Alltoall
@code{MPI_Alltoall} is defined by the MPI standard as:
@example
int MPI_Alltoall(void *sendbuf, int sendcount, MPI_Datatype sendtype,
void *recvbuf, int recvcnt, MPI_Datatype recvtype,
MPI_Comm comm);
@end example
In particular, for @code{double*} arrays @code{in} and @code{out},
consider the call:
@example
MPI_Alltoall(in, howmany, MPI_DOUBLE, out, howmany MPI_DOUBLE, comm);
@end example
This is completely equivalent to:
@example
MPI_Comm_size(comm, &P);
plan = fftw_mpi_plan_many_transpose(P, P, howmany, 1, 1, in, out, comm, FFTW_ESTIMATE);
fftw_execute(plan);
fftw_destroy_plan(plan);
@end example
That is, computing a @twodims{P,P} transpose on @code{P} processes,
with a block size of 1, is just a standard all-to-all communication.
However, using the FFTW routine instead of @code{MPI_Alltoall} may
have certain advantages. First of all, FFTW's routine can operate
in-place (@code{in == out}) whereas @code{MPI_Alltoall} can only
operate out-of-place.
@cindex in-place
Second, even for out-of-place plans, FFTW's routine may be faster,
especially if you need to perform the all-to-all communication many
times and can afford to use @code{FFTW_MEASURE} or
@code{FFTW_PATIENT}. It should certainly be no slower, not including
the time to create the plan, since one of the possible algorithms that
FFTW uses for an out-of-place transpose @emph{is} simply to call
@code{MPI_Alltoall}. However, FFTW also considers several other
possible algorithms that, depending on your MPI implementation and
your hardware, may be faster.
@ctindex FFTW_MEASURE
@ctindex FFTW_PATIENT
@c ------------------------------------------------------------
@node FFTW MPI Wisdom, Avoiding MPI Deadlocks, FFTW MPI Transposes, Distributed-memory FFTW with MPI
@section FFTW MPI Wisdom
@cindex wisdom
@cindex saving plans to disk
FFTW's ``wisdom'' facility (@pxref{Words of Wisdom-Saving Plans}) can
be used to save MPI plans as well as to save uniprocessor plans.
However, for MPI there are several unavoidable complications.
@cindex MPI I/O
First, the MPI standard does not guarantee that every process can
perform file I/O (at least, not using C stdio routines)---in general,
we may only assume that process 0 is capable of I/O.@footnote{In fact,
even this assumption is not technically guaranteed by the standard,
although it seems to be universal in actual MPI implementations and is
widely assumed by MPI-using software. Technically, you need to query
the @code{MPI_IO} attribute of @code{MPI_COMM_WORLD} with
@code{MPI_Attr_get}. If this attribute is @code{MPI_PROC_NULL}, no
I/O is possible. If it is @code{MPI_ANY_SOURCE}, any process can
perform I/O. Otherwise, it is the rank of a process that can perform
I/O ... but since it is not guaranteed to yield the @emph{same} rank
on all processes, you have to do an @code{MPI_Allreduce} of some kind
if you want all processes to agree about which is going to do I/O.
And even then, the standard only guarantees that this process can
perform output, but not input. See e.g. @cite{Parallel Programming
with MPI} by P. S. Pacheco, section 8.1.3. Needless to say, in our
experience virtually no MPI programmers worry about this.} So, if we
want to export the wisdom from a single process to a file, we must
first export the wisdom to a string, then send it to process 0, then
write it to a file.
Second, in principle we may want to have separate wisdom for every
process, since in general the processes may run on different hardware
even for a single MPI program. However, in practice FFTW's MPI code
is designed for the case of homogeneous hardware (@pxref{Load
balancing}), and in this case it is convenient to use the same wisdom
for every process. Thus, we need a mechanism to synchronize the wisdom.
To address both of these problems, FFTW provides the following two
functions:
@example
void fftw_mpi_broadcast_wisdom(MPI_Comm comm);
void fftw_mpi_gather_wisdom(MPI_Comm comm);
@end example
@findex fftw_mpi_gather_wisdom
@findex fftw_mpi_broadcast_wisdom
Given a communicator @code{comm}, @code{fftw_mpi_broadcast_wisdom}
will broadcast the wisdom from process 0 to all other processes.
Conversely, @code{fftw_mpi_gather_wisdom} will collect wisdom from all
processes onto process 0. (If the plans created for the same problem
by different processes are not the same, @code{fftw_mpi_gather_wisdom}
will arbitrarily choose one of the plans.) Both of these functions
may result in suboptimal plans for different processes if the
processes are running on non-identical hardware. Both of these
functions are @emph{collective} calls, which means that they must be
executed by all processes in the communicator.
@cindex collective function
So, for example, a typical code snippet to import wisdom from a file
and use it on all processes would be:
@example
@{
int rank;
fftw_mpi_init();
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0) fftw_import_wisdom_from_filename("mywisdom");
fftw_mpi_broadcast_wisdom(MPI_COMM_WORLD);
@}
@end example
(Note that we must call @code{fftw_mpi_init} before importing any
wisdom that might contain MPI plans.) Similarly, a typical code
snippet to export wisdom from all processes to a file is:
@findex fftw_mpi_init
@example
@{
int rank;
fftw_mpi_gather_wisdom(MPI_COMM_WORLD);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0) fftw_export_wisdom_to_filename("mywisdom");
@}
@end example
@c ------------------------------------------------------------
@node Avoiding MPI Deadlocks, FFTW MPI Performance Tips, FFTW MPI Wisdom, Distributed-memory FFTW with MPI
@section Avoiding MPI Deadlocks
@cindex deadlock
An MPI program can @emph{deadlock} if one process is waiting for a
message from another process that never gets sent. To avoid deadlocks
when using FFTW's MPI routines, it is important to know which
functions are @emph{collective}: that is, which functions must
@emph{always} be called in the @emph{same order} from @emph{every}
process in a given communicator. (For example, @code{MPI_Barrier} is
the canonical example of a collective function in the MPI standard.)
@cindex collective function
@findex MPI_Barrier
The functions in FFTW that are @emph{always} collective are: every
function beginning with @samp{fftw_mpi_plan}, as well as
@code{fftw_mpi_broadcast_wisdom} and @code{fftw_mpi_gather_wisdom}.
Also, the following functions from the ordinary FFTW interface are