Spot finder, starting with the simple dispersion, built to demonstrate the code developed using the vitis_hls
workflow. This currently implements a streaming version of the spot finder compiled for a detector image the same size as an Eiger 2XE 16M: this is a compile time not run time option, since the hardware depends very significantly on the width of the image.
The following are required to get anywhere at all:
- install of the Xilinx FPGA software tools (tested on 2022.2 versions)
- installation of XRT
- platform for the device you have which must support DMA (tested on U250 which is overkill)
The platform XPFM file must be passed either in the environment or via make PLATFORM=
. For me the following set-up was used:
. /opt/xilinx/xrt/setup.sh
. /scratch2/xilinx/Vitis_HLS/2022.2/settings64.sh
export PLATFORM=/opt/xilinx/platforms/xilinx_u250_gen3x16_xdma_4_1_202210_1/xilinx_u250_gen3x16_xdma_4_1_202210_1.xpfm
Building this is a multi-part process: build the kernel (cd kernel; make PLATFORM=<platform>
), then link the kernel (cd link; make PLATFORM=<platform>
), finally compile the host. Be warned that link the kernel takes hours, a chunk of CPU and a lot of memory. In all cases there is a Makefile
which defines all the pertinent details, and all should just work.
The spot finder algorithm itself consists of three separate stages to compute the kernel functions on a stream rather than through local evaluation as performed on GPU. The stream components are:
- construct a summed area table for each of
m
,i
,i^2
- outputs a stream of blocks where each block is as wide as the image - compute the local sum from the SAT for each of the parameters, over a square kernel region
- compute the threshold function based on the comparison of the pixel value with the local mean and variance
Around this there is a pixel doubler stage, to pass the pixel values into the summed area table calculation but also on to the thresholding with a suitable FIFO depth to allow the rest of the calculation to catch up. Finally, to map to the memory buffers, a reader and writer though I am not currently happy with these.
The summed area tables are typed as uint8_t
for mask, uint32_t
for sum of i
and i^2
- the latter of these is strictly incorrect for uint16_t
data however:
- by the time the SATs are calculated the masked value pixels are set to zero
- the limiting factor is the sum over a square kernel region (i.e. 7x7 pixels)
- most data is generally smaller valued i.e. pixels > 1000 are rare
The SAT calculations could be split into three separate kernels, but they are currently implemented as a single kernel: the fact that all depend on computing the mask made it seem sensible however I am sure the conditional assignment on the pixel value is a minute amount of area. The input to the SAT accumulation is a stream of pixels, the output a stream of blocks - this made the local area calculation much simpler.
The summed area calculation is there to allow a local sum to be evaluated from four lookups:
D C
+-------+
| |
| |
| X |
| |
| |
B+-------A
A
, B
, C
and D
are the sums from the upper left corner to that element in the array i.e. sum above and to the left of all the values in the array. The sum for the kernel region around X
can be evaluated as A + D - B - C
: this is simple and straightforward. And the most expensive part of the whole thing: this requires reading from four different positions in the array at the same time, while writing to the same array in the same cycle with values coming in from the SAT calculation.
This was where the hls::stream_of_blocks
became a lifesaver, because the A
and B
values can be read from the incoming block rather than this array, and written into this array from the block at the same time, ensuring an II=1
can be maintained. This does however require a lot of area to deliver, and some partitioning of the SAT arrays, discussed below under "things I learned."
Worth noting here that the filter calculation was recast to avoid any use of floating point values: everything is now done with integer arithmetic and no sqrt
calls. The before and after:
if (p > 0 && m_sum >= 2) {
float bg_lhs = (float)m_sum * i2_sum - (float)i_sum * i_sum -
(float)i_sum * (m_sum - 1);
float bg_rhs = i_sum * sigma_b * sqrtf((float)2.0f * (m_sum - 1));
uint16_t background = bg_lhs > bg_rhs;
float fg_lhs = (float)m_sum * p - (float)i_sum;
float fg_rhs = sigma_s * sqrtf((float)i_sum * m_sum);
uint16_t foreground = fg_lhs > fg_rhs;
signal = background && foreground;
}
if (p > 0 && m_sum >= 2) {
int64_t bg_lhs = (int64_t)m_sum * i2_sum - (int64_t)i_sum * i_sum -
(int64_t)i_sum * (m_sum - 1);
int64_t bg_rhs2 = i_sum * i_sum * 72 * (m_sum - 1);
uint8_t background = (bg_lhs > 0) && ((bg_lhs * bg_lhs) > bg_rhs2);
int32_t fg_lhs = (int32_t)m_sum * p - (int32_t)i_sum;
int32_t fg_rhs2 = 9 * i_sum * m_sum;
uint8_t foreground = (fg_lhs > 0) && ((fg_lhs * fg_lhs) > fg_rhs2);
signal = background && foreground;
}
N.B. this compiles in sigma_b = 6
and sigma_s = 3
into the constant terms, and requires uint64_t
for working variables for uint16_t
measurements. The motivation behind this is the implementation of a floating point hls::sqrtf
implies a latency of ~ 30 cycles, so to keep N
pixels in flight and maintain II=1
requires around 30*N
implementations of the calculation to keep the pipeline from stalling: a lot of area. As a aside, the term sqrtf((float)2.0f * (m_sum - 1))
can be tabulated since m_sum - 1
has only around 50 possible valid values. The bg_rhs
term can therefore be collapsed down to a lookup and a multiply.
Debugging the code turned out to be something of a nightmare in no small part because the behaviour of the system on the hardware is subtly but significantly different to the behaviour in the simulation flow in vitis_hls
. To address this there are two additional steps that can be performed between the vitis_hls
process and the hardware implementation: software and hardware emulation. The first of this is useful to check for "obvious" bugs (it is comparable to the vitis_hls
process but includes simulation of the stream etc.) but this does not really test everything. But it is quick.
In both the kernel
and link
directories, make TARGET=sw_emu PLATFORM=<platform>
- this is relatively quick to execute. Then to run the emulation you need to pull a configuration from your platform with:
emconfigutil --platform <platform>
(this only needs to be done once to make the file) then run the simulation with:
XCL_EMULATION_MODE=sw_emu ../host/spot_finder ./spot_finder.xclbin
Yes, you need to have compiled the host for this to work. This runs pretty quickly (well, the kernel calculation is slow in comparison to the real kernel, but fast compared to hw_emu
below) and should show that everything works as well as reporting the maximum depth of the simulation.
By analogy with the above, make TARGET=hw_emu PLATFORM=<platform>
- again this is relatively quick to compile compared with the real hw
build. You also need the emconfigutil
step if you didn't run this already, and may need a spot of make clean
to remove the old sw_emu
build targets.
XCL_EMULATION_MODE=hw_emu ../host/spot_finder ./spot_finder.xclbin
This will take a long time to run - for example around 40 minutes to emulate 20ms of FPGA execution - but does so in a very verbose manner which is very useful for spotting bugs (e.g. deadlocks). There's a note on this under "things I learned."
A lot of things were learned in the process, multiple days lost at various points trying to learn my way through the quagmire. Zeroth amongst these is the process embodied by vitis_hls
is necessary but not sufficient for developing algorithms - it gives you some very useful guidance on how to write the code but guarantees nothing in terms of it actually working at the end of the day on real hardware.
Developing only on huge data centre cards is very slow because the compilation time takes hours. Embedded systems like Zynq 7000 are a lot smaller which makes for far faster compilation times - for testing ideas and learning your way around this is super helpful. The scripts here will work for smaller devices, provided you have a suitable platform file. Building the host will require a little more work, as will building a system image to deploy to a microSD card, which is slightly beyond the scope of this write up.
Algorithms need a lot more memory than you would think: at times (e.g. when building the local region calculation in stage 1) a lot of reads happen concurrently from a shared buffer: the compilers make this possilble by maintaining multiple copies of that buffer which all occupy BRAM or area. This means, for something where you would expect to use e.g. 14 x 18 kb BRAMs you may find it using 4 or 8 times as many to allow the read concurrency in addition to specified partitioning. This is at the root of why the calculation for one Eiger 2XE module (512 rows of 1028 pixels) won't fit onto the Zynq 7020 in a Zybo Z7-20 board.
This one was a real doozy and only exposed itself when trying to run on real hardware (or in hw_emu
mode). Spoilers: HLS may look like C++ and use C++ syntax but it does not follow all C++ rules. For example in:
void function(type_stream_t &in, type_stream_t &out) {
// implementation
}
in
and out
are in scope i.e. these are locally defined and any reference to these is independent of the same names in other contexts. THIS IS NOT TRUE IN HLS. However nothing in the software chain tells you that this is not true, and all the software emulation parts above in vitis_hls
work ✨just fine✨.
Until you hit a deadlock and you have no idea why. This was where the make TARGET=hw_emu
etc. above was überhelful because it actually crashed. When you run in the test environment a lot of debug information is output to a text file output.txt
in a directory reported at the start of the output. Searching for ERROR
in here showed:
//////////////////////////////////////////////////////////////////////////////
// ERROR!!! DEADLOCK DETECTED at 217790000 ns! SIMULATION WILL BE STOPPED! //
//////////////////////////////////////////////////////////////////////////////
/////////////////////////
// Dependence cycle 1:
// (1): Process: spot_finder_spot_finder.spot_finder_kernel_U0.duplicate_pack16_stream_U0
// Blocked by empty input FIFO 'spot_finder_spot_finder.input_U' written by process 'spot_finder_spot_finder.reader_U0'
// (2): Process: spot_finder_spot_finder.reader_U0
// Blocked by empty input FIFO 'spot_finder_spot_finder.input_U' written by process 'spot_finder_spot_finder.spot_finder_kernel_U0.duplicate_pack16_stream_U0'
////////////////////////////////////////////////////////////////////////
// Totally 1 cycles detected!
////////////////////////////////////////////////////////////////////////
// ERROR!!! DEADLOCK DETECTED at 217790000 ns! SIMULATION WILL BE STOPPED! //
//////////////////////////////////////////////////////////////////////////////
Wait, says I: the input and output to this were the same?
This is because I was being lazy and making up a pattern where the reader and writer both just called the input and output channels input
and output
- as well as other channels for e.g. the pixel splitting path. The actual implementation aliased these because there is no scope & hence broke. Hacking in some ugly names made the emulation actually run, then the real version actually ran.
Happily I survived the learning process and debugging and got to some working code. The test case shows the calculation being reasonably quick, but not really worth the effort at the moment:
cs04r-sc-serv-85 host :) [run-fpga-spot] $ ./spot_finder ../link/spot_finder.xclbin
Generate: 3899034 microseconds
Reference: 231416 microseconds
Done copying...
FPGA: in / kernel / out: 3663 / 28484 / 3097 / microseconds
0 / 17919096 incorrect results
28 ms to perform the calculation equates to around 35 frames / second - a very long way short of the target 500 frames / second. Overall II=4510889
cycles, so average II=1
for 4 pixel blocks maintiained. In implementation the minimum cycle time was 5.34ns which explains the slow evaluation.
The current implementation takes packs of four pixels and performs the corresponding calculations concurrently - so for an image of 4362 x 4108 pixels (an Eiger 2 XE 16M) we would expect a minimum of around 15 ms to execute at 300 MHz - the kernel execution time of 28 ms is therefore already excessive, in addition to being fundamentally slow.
Ingesting more pixels per clock (i.e. PACK=8, 16, 32
) would immediately improve the performance however the row data could need to be packed with NULL values to keep the width a multiple of PACK
. There will also be performance tuning implications of doing this, e.g. on the lengths of the most delayed paths through the calculation.
Running multiple instances of the kernel, to analyse images concurrently, or partitioning the image into e.g. 32 modules and computing concurrently, would also be opportunities to improve the throughput. Performing the image decomopression on the card unlikely to be significant, as the transfer time is around 3-4 ms, but for multiple images pipelining the calculation to overlap the transfer with kernel execution, as performed for the GPU equivalent, would be worthwhile.