Table of Contents
- DiallelX: Seismic network cross-correlation calculator
Ubuntu/Linux Mint
# Install required packages
sudo apt install make gfortran libfftw3-dev gnuplot -y
# Build DiallelX
make
# Generate dummy data in a directory named "test"
./DiallelX -s -d test
# Calculate NCCs among dummy data
./DiallelX -d test
# Plot top 100 similar waveform pairs
./DiallelX -p 100 -d test
AlmaLinux
# Install required packages
sudo dnf install gfortran fftw3-devel gnuplot -y
# Build DiallelX
make
# Generate dummy data in a directory named "test"
./DiallelX -s -d test
# Calculate Network Cross-correlation Coefficients among dummy continuous records and dummy templates
./DiallelX -d test
DiallelX is a CPU-oriented modern fortran program to approximate Network Cross-Correlation coefficients (NCCs) among multiple continuous records and template waveforms observed at multiple seismic stations. The results, relatively less accurate but sufficient to find new seismic events, are obtained several-fold faster than a conventional scheme.
Required steps are as follows:
- Install DiallelX (gfortran and single-precision FFTW3 are required)
- Prepare waveform records in float32 binary formats (w/o record boundaries) or SAC format
- Run
./DiallelX -d [IOdirname]
- Define new events based on results and stats
Apropos, DiallelX is derived from a terminology of genetics as follows[Hayman 1954]:
A diallel cross is the set of all possible matings between several genotypes.
Therefore, the term diallel in this context represents the set of cross-correlation coefficients (or functions) between all possible pairs of the continuous and template waveforms.
Just running make
under the DiallelX
directory compiles the program.
However, other tools must have been installed before make
.
Users have three installation options (the first is recommended, given official deprecation of ifort
as in the third):
- Install
gfortran
andfftw3
- Install Intel oneAPI Base Toolkit and Intel HPC Toolkit
- Use Intel Fortran Compiler Classic (
ifort
) that is deprecated and no longer distributed
Here is the installation guide for Linux Mint 21.x (the development environment of DiallelX), Ubuntu 22.04, and AlmaLinux 9.3 .
For Ubuntu (make
has not been installed by default) or Linux Mint:
sudo apt install make gfortran libfftw3-dev gnuplot -y
For AlmaLinux:
sudo dnf install gfortran fftw3-devel gnuplot -y
2. Install Intel oneAPI Base Toolkit and Intel HPC Toolkit
Warning
(2024-06-09) After upgrading to ifort Version 2021.12.0 Build 20240211_000000 on Ubuntu/Linux Mint, ifort does not work because of missing omp_lib. Tentatively, to avoid this problem with ifort, copy and paste
$(shell ./omp_lib_finder.sh)
into the right hand (blank) side of 13th line in src/Makefile
(INCLUDE = ).
For Linux Mint/Ubuntu, the following procedures are provided in the above two links:
wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB \ | gpg --dearmor | sudo tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null
echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list
sudo apt update
sudo apt install intel-basekit intel-hpckit gnuplot -y
After these installations, running
echo "source /opt/intel/oneapi/setvars.sh > /dev/null" >> ~/.bashrc
is required to set environment variables.
The same option is unavailable for macOS because of the official announcement:
NOTE: Starting with the 2024.0 release, macOS* is no longer supported in Intel® oneAPI Toolkits and components.
3. Use Intel Fortran Compiler Classic (ifort
) that is deprecated and no longer distributed
Intel Fortran Compiler Classic (ifort
) may provide the fastest executable file of DiallelX.
However, ifort
says:
Intel(R) Fortran Compiler Classic (ifort) is now deprecated and will be discontinued late 2024. Intel recommends that customers transition now to using the LLVM-based Intel(R) Fortran Compiler (ifx) for continued Windows* and Linux* support, new language support, new language features, and optimizations.
At least for Jul. 2024, unfortunately, ifx
provides a significantly slower executable file than ifort
and gfortan
does.
Therefore, Makefile
of DiallelX uses ifort
if available.
During make
, the compiler will be chosen automatically.
However, if you prefer a compiler out of multiple compilers, you can run one of the following three:
make compiler=gfortran
make compiler=ifort
make compiler=ifx
Here is an example of the nomenclature of waveform files.
Input waveforms must be SAC or float32 binary files w/o record boundaries.
In the following, ".sac" can be replaced by ".bin".
Inside of the brackets [ ] can be defined by the user.
All data files are assumed to be in subdirectories under [IOdirname]
, whose name is arbitrary and to be specified by the user when ./DiallelX
is executed.
[IOdirname]
can be a relative or absolute path.
All continuous records must consist of the same length.
For example, a daily record with 100 Hz sampling rate results in 100
Users have to place continuous records in [IOdirname]/continuous_records/
, where the file names are assumed to be [RecordID]_[ChannelID].sac
.
[RecordID]
is IDs of each continuous record; e.g., hourly, daily, weekly, etc. However, monthly-long data are not suitable due to their different lengths.
Note that too long (e.g., 100 Hz yearly) data per file is not acceptable because the number of samples for each file is represented by int32, and its maximum is [ChannelID]
(e.g., station codes and components) must be common among continuous records and template waveforms and can include underscores; for example, the following format is possible for daily records:
ls -1v [IOdirname]/continuous_records/
20231231_N.KYTH_E.sac # 1st continuous record, channel 1
20231231_N.KYTH_N.sac # 1st continuous record, channel 2
20231231_N.KYTH_U.sac # 1st continuous record, channel 3
20231231_N.KMMH_E.sac # 1st continuous record, channel 4
20231231_N.KMMH_N.sac # 1st continuous record, channel 5
20231231_N.KMMH_U.sac # 1st continuous record, channel 6
20240101_N.KYTH_E.sac # 2nd continuous record, channel 1
⋮
20240101_N.KMMH_U.sac # 2nd continuous record, channel 6
20240102_N.KYTH_E.sac # 3rd continuous record, channel 1
⋮
where YYYYMMDD
-format before the first underscores refers to the date of the continuous record, "N.xxxH" are Hi-net station codes, and "E", "N", and "U" represent EW, NS, and UD components, respectively.
Instead, YYYYMMDD
may be replaced by sequential numbers, and [ChannelID]
can be each channel number or channel CD (e.g., N.KYTH_U = 55e3, N.KMMH_E = 5625, etc.)
The length of all template waveforms must be the same and maybe a power of 2 given a property of FFT.
Other lengths of composite number (e.g.,
Users have to locate template waveforms in [IOdirname]/templates/
, where the file names are assumed to be [TemplateID]_[ChannelID].sac
.
[TemplateID]
is IDs of each template event and is free of underscore ( _ ) because only the first underscores are considered separators.
[ChannelID]
(e.g., station codes and components) must be common among continuous records and template waveforms.
The following format is possible for three template events in total:
ls -1v [IOdirname]/templates/
20231231-23583103_N.KYTH_E.sac # 1st template, channel 1
20231231-23583103_N.KYTH_N.sac # 1st template, channel 2
20231231-23583103_N.KYTH_U.sac # 1st template, channel 3
20231231-23583103_N.KMMH_E.sac # 1st template, channel 4
20231231-23583103_N.KMMH_N.sac # 1st template, channel 5
20231231-23583103_N.KMMH_U.sac # 1st template, channel 6
20240101-01392817_N.KYTH_E.sac # 2nd template, channel 1
⋮
20240101-01392817_N.KMMH_U.sac # 2st template, channel 6
20240101-03022439_N.KYTH_E.sac # 3rd template, channel 1
⋮
20240101-03022439_N.KMMH_U.sac # 3rd template, channel 1
where [YYYYMMDD]-[hhmmssxx]
-format before underscores refers to the date and start time of the template waveform.
Of course, other unique IDs (e.g., [SequencialNumber]_[ChannelID].sac
) are also possible.
DiallelX approximates Network Cross-correlation Coefficients (NCCs) among all continuous and template waveform pairs. Note that even the autocorrelation of a template is not necessarily equal to 1.0 due to the approximation. However, users can control the accuracy.
In the calculation, each continuous record with length r
is divided into n
windows with length w
equivalent to the template length.
The division of k
-th record is schematically as follows.:
┌----------k-th record----------┐┌--------(k+1)-th record--------┐
1......s.....w..................r1......s..p
└-1st window-┘ |└-1st window-┘
└-2nd window-┘ | └-2nd window-┘
└stride┘ └-3rd window-┘ |└stride┘
└stride┘ ⋱
└n-th window-┘
└-padding-┘
The windows overlap with length w-s
, where s:=w/a
is the stride length and a
is the specified accuracy (default value is 2).
If a=w
(i.e., s=1
), NCC values are calculated for all possible windows of length w
, which provides the most accurate results but is time-consuming.
Thus, a=2
or a=4
are recommended.
The padding that makes the final (n
-th) window length of w
will automatically be extracted from the next continuous record's initial part or filled by zero if the k
-th record is final.
The following parameters are examples for 1-day continuous and 10.24-second template waveforms sampled at 100 Hz, and a=4
.:
r: length of record : 8640000 samples (=100Hz × 24h)
w: length of template : 1024 samples (=length of window)
s: length of stride : 256 samples (=w/a)
p: length of padding : 768 samples (=s*(n-1)+w-r)
n: number of windows : 33750 (=floor((r-1)/s)+1)
where n=floor((r-1)/s)+1
is derived so that s*(n-1) + 1 ≦ r < s*n + 1
holds, where the left and right-hand side are the initial point of the n
-th and n+1
-th windows, respectively.
Also, Appendix A: Visual example of a diallel may help readers understand the meaning of diallel.
To calculate the diallel, run the following command line:
./DiallelX -d [IOdirname]
For more accurate results, a possible option is as follows:
./DiallelX -a [A] -d [IOdirname]
with -a 4
et cetera, where a positive integer [A]
must be a divisor of the template waveform length, and the default is -a 2
.
The larger value of [A]
raises accuracy and computation time; see Algorithm: an overview section for details.
The standard output like below will be shown when it works:
./DiallelX -d [IOdirname]
DiallelX: IO
IO directory : [IOdirname]
output file : [IOdirname]/results/candidates.csv
DiallelX: fundamental parameters
number of records : 4 records
length of record : 8640000 samples
number of templates: 1000 templates
length of template : 1024 samples (= fft length)
number of channels : 15 channels
DiallelX: optional parameters
accuracy : 2
number of threads : 24 OMP threads
DiallelX: derived parameters
number of windows : 16875 windows
length of stride : 512 samples
length of padding : 512 samples
required memory : 1.620 Gbytes (infimum)
computation cost : 1.0E+12
DiallelX: Main loop started
RecordID=20231224, progress=1/4, time=00:00:04.677(+4.68E+00sec.)
RecordID=20231225, progress=2/4, time=00:00:09.443(+4.77E+00sec.)
RecordID=20231226, progress=3/4, time=00:00:14.224(+4.78E+00sec.)
RecordID=20231227, progress=4/4, time=00:00:18.966(+4.74E+00sec.)
DiallelX: Done
However, running with -l
option:
./DiallelX -l -d [IOdirname]
is recommended before calculation with a huge dataset to show the list of parameters without diallel calculation.
In the list, required memory
is roughly estimated given the size of some large arrays.
Through the same environment, the total processing time is proportional to computation cost
if the number of templates
is sufficiently large (
For other options, run
./DiallelX -h
to show help.
The command in the previous section provides two following files:
[IOdirname]/results/candidates.csv
[IOdirname]/results/histogram.dat
The output file candidates.csv
includes a list of similar waveforms, sorted by their NCC value in descending order.
head [IOdirname]/results/candidates.csv | column -s, -t
0.97231 4 4429856 14
0.96088 1 1948106 12
0.96004 1 5761983 7
0.94624 4 6657165 16
0.93600 3 8161597 24
0.92957 2 1607820 23
0.92449 1 194523 9
0.92246 3 4860318 7
0.91865 2 2760234 9
0.91829 4 2838100 23
Each column represents the NCC value, the continuous record number, the initial sample number, and the most similar template waveform. Hence, the meaning of the first line in the above list is "A waveform starting from the 4429856th sample in the 4th continuous record is similar to the 14th template with their NCC value of 0.97231."
Each line is a result from each window explained in Algorithm: an overview section. However, each NCC value is compared to those of their previous and next window, and only the windows with relatively higher NCC values than their neighbors are in the output. This comparison reduces useless results because the neighboring windows overlap and are similar to the same template.
Even if a window is similar to multiple templates, the most similar one is listed as sufficient for event detection.
Refer to the line number in [IOdirname]/parameters/continuous_records.csv
and [IOdirname]/parameters/templates.csv
to confirm the filename of the continuous record and template corresponding to the number in the second and forth column, respectively.
Even if a window consists only of noises and is similar to nothing, at least one template number is listed. However, users can exclude such a window in the user-defined event detection process.
Sometimes, the waveform starting from the sample number in the third column may protrude from the record.
For example, the following output is possible if the record and template lengths are 8,640,000 (=100Hz
head [IOdirname]/results/candidates.csv | column -s, -t
⋮
0.74035 1 8639986 25
⋮
The record starting from 8639986
lasts until 8639986+1023=8641009
.
This overflow can happen because DiallelX adds the samples of the head of the next record to the tail of the current record.
Therefore, the waveform similar to the 25th template consists of 8639986th to 8640000th samples of the 1st record and 1st to 1009th samples of the 2nd record.
This file includes a histogram of NCC values for the diallel.
Therefore, total number in whole bins is (# of records)
Here is a visual example of a diallel (set of cross-correlation functions among the windows and templates) and output from ten windows and five templates with the length of 64 samples and a=2
(i.e., stride length = 64/2=32).
For simplicity, we ignore the padding. Moreover, we consider only single-channel waveforms in this appendix; see Appendix B: Details of algorithm to treat network cross-correlation coefficients.
The continuous record and windows are as in the following figure:
The template waveforms are the following five:
Then, the diallel is the following figure, where the windows and templates (black) are in the top row and right column, respectively, and "×" accompanied by decimals indicates the maxima of each cross-correlation function:
For each column in the diallel, the (light)blue function has the largest CC value and is a candidate to be output. However, the neighboring windows tend to show higher CC values because they overlap. Only their local peaks along the lateral direction (saturated blue) are output to avoid duplicate detection.
As in Algorithm: an overview section, DiallelX calculates NCC values between windows cut out from the continuous records and templates. Here, we explain why this algorithm reduces computation time.
Let
We have multiple templates, and all templates will be heavily used.
Therefore, we calculate and store spectra of all template waveforms in advance.
Another point is that all waveforms are observed in multiple channels, where
For
where the overline denotes the complex conjugate, and
However, the summation and IFFT are linear operators, which enable us to change the order of operation as:
where the summation w.r.t. all channels
The following table shows computation times for the main loop, which almost consists of calculations of eq.(2) with a certain dataset.
# of channels ( |
time for the main loop ( |
---|---|
1 | 2.9 seconds |
3 | 3.5 seconds |
5 | 4.0 seconds |
15 | 6.8 seconds |
The above relation is well approximated by the linear equation:
which means that 2.634 seconds is required for only one IFFT for each window and template.
IF our algorithm is along eq.(1), IFFT is required
Therefore, eq.(2) gets relatively faster as the number of channels increases.