FREE ELECTRONIC LIBRARY - Abstract, dissertation, book

«January 16, 2014 Abstract We implement the Kaczmarz row-projection algorithm (Kaczmarz (1937)) on a CPU host + FPGA accelerator platform using ...»

Accelerating an iterative Helmholtz solver with FPGAs

Art Petrenko1, Tristan van Leeuwen2, Diego Oriato3, Simon Tilbury3 and Felix J. Herrmann1

1 Department of Earth, Ocean and Atmospheric Sciences, University of British Columbia

2 Centrum Wiskunde & Informatica, formerly with UBC

3 Maxeler Technologies

January 16, 2014


We implement the Kaczmarz row-projection algorithm (Kaczmarz (1937)) on a CPU

host + FPGA accelerator platform using techniques of dataflow programming. This algorithm is then used as the preconditioning step in CGMN, a modified version of the conjugate gradients method (Björck and Elfving (1979)) that we use to solve the time-harmonic acoustic isotropic constant density wave equation. Using one accelerator we achieve a speed-up of over 2× compared with one Intel core.

Introduction Using reconfigurable hardware (field programmable gate arrays, FPGAs) to handle the computationally intensive kernel of an algorithm is attractive. Unless the algorithm is limited by the bandwidth of an interconnect (such as a link to memory), speed of execution is directly proportional to the amount of data to be processed. As long as the kernel fits onto the FPGA its complexity is irrelevant, since all the operations on an FPGA happen in the same clock tick. In contrast, on a CPU core, more complex algorithms require more clock ticks as instructions are executed sequentially, or with only a small amount of parallelism. We demonstrate hardware acceleration with FPGAs applied to seismic wave propagation simulation, a large computational burden during full-waveform inversion (FWI).

Theory In the frequency domain, and specializing to the acoustic isotropic constant density case, simulating wave propagation corresponds to solving a large linear system of the form H(m, ω)u = q, (1) where H is the N × N Helmholtz operator that depends on the earth model m and the angular frequency ω, u is the (complex) Fourier transform of the pressure wavefield and q is the time-harmonic source.

We take m to be the squared slowness (1/v2 ), and use a 27-point cube stencil to calculate the entries of H for the 3D case, as in Operto et al. (2007). A perfectly matched layer is used at the boundaries of the domain to eliminate reflection artifacts.

Because the matrix H is large (up to 109 × 109 ) and sparse (only a few non-zero diagonals), an iterative algorithm like the method of conjugate gradients (CG) is well suited to solve Equation 1. However CG cannot be used directly because H is not symmetric positive semidefinite (i.e. it has both positive and negative eigenvalues). Instead we use the Kaczmarz algorithm as a preconditioner to transform Equation 1 into an equivalent system that does have these properties. The Kaczmarz algorithm (Kaczmarz (1937)) is an iterative method for solving generic linear systems Ax = b that projects its iterate xk onto the

hyperplane defined by a row ai of the matrix A:

–  –  –

Here w is a relaxation factor we set equal to 1.5 and bi is the ith element of the right-hand side. We refer to one application of Equation 2 as a Kaczmarz iteration, and to the projections onto each row from first to last as a forward Kaczmarz sweep (FKSWP). Backward sweeps correspond to projecting onto the rows in reverse order, and a double sweep (DKSWP) is a forward sweep followed by a backward sweep (k: 1 → 2N; i: 1 → N, N → 1). Equation 1 is transformed into the equivalent system

–  –  –

where the action of matrices Q and R is computed using a double Kaczmarz sweep: DKSWP(H, u, q, w) = Qu + Rq. Equation 3 is then solved with CG, an algorithm known as CGMN and due to Björck and Elfving (1979).

Our contribution is implementing the double Kaczmarz sweep for a CPU host + FPGA accelerator platform. Related work has been done by Grüll et al. (2012), who implement a related algorithm, the simultaneous algebraic reconstruction technique (SART), on the same platform. Furthermore, Elble et al.

(2010) have examined the improvements that Kaczmarz and other algorithms experience on Graphical Processing Units (GPUs).

–  –  –

Figure 1 System of CPU host + FPGA accelerators; adapted from Pell et al. (2013).

Method The target platform for our implementation is described in Pell et al. (2013) and shown in Figure 1. Four dataflow engines (DFEs), consisting of FPGAs and their associated large memory, are connected via a PCIe bus to the CPU. Each FPGA also has 4.6 MB of fast on-chip RAM (FMem). Coarse-grained parallelization by solving problems with different right-hand sides q on each DFE is a straightforward extension but has not yet been attempted by the authors; currently only one accelerator is used. The DFE is operated via a compiled binary that is called from a high-level numerical computing environment using an automatically generated C intermediate layer.

Execution of the CGMN algorithm using the DFEs proceeds in three stages. First the matrix H(m, ω) is copied to the FPGA’s memory. It will be read twice during each CGMN iteration (for each forward and backward sweep), but will not be modified. The second stage consists of the forward Kaczmarz sweep, during which the current CGMN search direction p is streamed to the FPGA from the CPU host, and the result FKSWP(H, p, 0, w) is stored to memory. The third stage is the backward Kaczmarz sweep, which reads the results of the forward sweep from memory and streams DKSWP(H, p, 0, w) back to the CPU. The CPU carries out the elementwise and reduction operations necessary, and stages 2 and 3 are repeated until CGMN converges.

If the rows of H are processed from 1 to N, each Kaczmarz iteration depends on the results of the last.

Furthermore, each iteration’s computation takes many ticks of the FPGA clock. This is the latency L, and is chiefly due to the pairwise summation of values in ai, xk (Equation 2). To avoid having to wait for L ticks between iterations, L independent iterations used to fill the computational pipeline. Two sources of parallelism can be exploited. First, multiple Helmholtz problems sharing the same matrix H but different right-hand sides q can be solved on the same DFE. Second, the order in which DKSWP accesses the rows of H can be changed such that consecutive Kaczmarz iterations update disjoint sets of iterate elements. At the present we use the latter method exclusively.

–  –  –

Figure 2 Comparison of CGMN execution time on one accelerator (DFE) and one Intel Xeon core.

Results In Figure 2, computational times for 100 CGMN iterations running on the CPU + accelerator platform (Vectis model at 100 MHz) are compared with times on one Intel Xeon E5-2670 core. A speed-up of over 2× is seen when CGMN is run on the accelerator. Figure 3 shows the velocity model used in the experiment, as well as the solution wavefield u that results from a time-harmonic source at one point in the domain.

At the current stage of development there are several factors that limit the performance of the accelerator. Chief among these is the inefficient use of the FPGA’s memory controller. To estimate the speed-up we can expect from a more optimized implementation we slow down the sweeps by having computation progress only once every 90 FPGA clock ticks. At this speed even inefficient use of the memory controller is sufficient to keep the FPGA supplied with data. We measure performance for a range of small systems (from 0.5 to 3 × 105 gridpoints) and then calculate theoretical performance of the full-speed sweeps, taking into account several planned improvements. These include: two passes over the data per CGMN iteration instead of three, getting rid of zero elements currently used to pad the rows of the matrix H to boundaries of addressable memory chunks, and increasing the FPGA operating frequency to 120 MHz. With a 100% efficient memory controller these improvements will translate to a speed-up of 13× over the single Intel core times.

Speed-up of the pipelined accelerated Kaczmarz sweeps will further be limited by the fact that the sweeps (when implemented on an Intel core) take up only about 95% of the total time used by CGMN.

Recall that offloading parts of an algorithm from the CPU onto the FPGA accelerator decreases the CPU execution time, but does not increase the execution time on the FPGA, since the amount of data to process (problem size) remains the same. Hence we will implement all of CGMN on the accelerator, from which we expect a speed-up of 36× over a single Intel core. At that point the link between the FPGA and its dedicated memory that is used to stream the matrix H will become the bottleneck preventing further increases in FPGA operating frequency. To avoid this bandwidth limitation, the elements of the Helmholtz matrix will be generated on the FPGA from the earth model m as they are needed, rather than being read in from memory.

76th EAGE Conference & Exhibition 2014 Amsterdam RAI, The Netherlands, 16-19 June 2014 Figure 3 On the left is the largest part of the SEG/EAGE Overthrust acoustic velocity model (Aminzadeh et al. (1997)) used in the experiment. The axes are labelled with the number of grid points along that axis. The right plot shows the real part of the Fourier transfrom of the pressure waveield u that results from a time-harmonic point source. The solution shown took 2703 CGMN iterations (9826 s) to converge to a relative residual norm of 10−6.


We have demonstrated a time-harmonic iterative wave equation solver that off-loads its computational kernel onto a reconfigurable hardware accelerator. Work on this project is ongoing, and while further development is needed in order to better realize the potential for acceleration inherent in the platform, our preliminary results give us reason to be optimistic.


We would like to thank Henryk Modzelewski for his support of the hardware and software used by the three authors affiliated with UBC, Eddie Hung for stimulating discussions about implementation, Rafael Lago for guidance about Krylov solvers, Brendan Smithyman for editing this manuscript, and Maxeler Technologies for allowing one of the authors to visit for a two-week period. This work was in part financially supported by the Natural Sciences and Engineering Research Council of Canada Discovery Grant (22R81254) and the Collaborative Research and Development Grant DNOISE II (375142-08).

This research was carried out as part of the SINBAD II project with support from the following organizations:

BG Group, BGP, BP, CGG, Chevron, ConocoPhillips, ION, Petrobras, PGS, Total SA, WesternGeco, and Woodside.


Aminzadeh, F., Jean, B. and Kunz, T. [1997] 3-D salt and overthrust models. Society of Exploration Geophysicists.

Björck, Å. and Elfving, T. [1979] Accelerated projection methods for computing pseudoinverse solutions of systems of linear equations. BIT Numerical Mathematics, 19(2), 145–163, ISSN 0006-3835, doi:10.1007/ BF01930845.

Elble, J.M., Sahinidis, N.V. and Vouzis, P. [2010] GPU computing with Kaczmarz’s and other iterative algorithms for linear systems. Parallel Computing, 36(5-6), 215–231, ISSN 0167-8191, doi:10.1016/j.parco.2009.12.003.

Grüll, F., Kunz, M., Hausmann, M. and Kebschull, U. [2012] An implementation of 3d electron tomography

on fpgas. Reconfigurable Computing and FPGAs (ReConFig), 2012 International Conference on, 1–5, doi:


Kaczmarz, S. [1937] Angenäherte auflösung von systemen linearer gleichungen. Bulletin International de l’Academie Polonaise des Sciences et des Lettres, 35, 355–357.

Operto, S., Virieux, J., Amestoy, P., L’Excellent, J.Y., Giraud, L. and Ali, H.B.H. [2007] 3D finite-difference frequency-domain modeling of visco-acoustic wave propagation using a massively parallel direct solver: A feasibility study. Geophysics, 72(5), SM195–SM211, doi:10.1190/1.2759835.

Pell, O., Bower, J., Dimond, R., Mencer, O. and Flynn, M.J. [2013] Finite-difference wave propagation modeling on special-purpose dataflow machines. Parallel and Distributed Systems, IEEE Transactions on, 24(5), 906– 915, ISSN 1045-9219, doi:10.1109/TPDS.2012.198.

–  –  –

Similar works:

«Mechanical design of a service robot torso for AMIGO V2.0 Guy J.A. Dohmen CST 2012.31 Master Thesis Supervisor: prof. dr. ir. Maarten Steinbuch Coach: dr. ir. Nick Rosielle ir. Hugo Timmers (NTS Mechatronics) Advisor: dr. ir. Ren´ van de Molengraft e Eindhoven University of Technology Department of Mechanical Engineering Control Systems Technology Eindhoven, April 2012 ii Preface After finishing my Master traineeship at the Eindhoven University of Technology (TU/e) I started my Master...»

«Lehrstuhl für Genetik der Technischen Universität München The cellular polarity of ENHANCER OF PINOID: underlying factors and its role in the organogenesis of Arabidopsis thaliana Michaela Matthes Vollständiger Abdruck der von der Fakultät Wissenschaftszentrum Weihenstephan für Ernährung, Landnutzung und Umwelt der Technischen Universität München zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften genehmigten Dissertation. Vorsitzender: Univ.Prof. Dr. A. Gierl...»

«Zusammenfassung der Diplomarbeit „Demographics and Asset Markets“ Zusammenfassung der Diplomarbeit „Demographics and Asset Markets“ Autor: Betreuer: Matthias Kredler Prof. Sven Rady (Ph.D.) Werner-Friedmann-Bogen 8 Fakultät für Volkswirtschaftslehre 80993 München Ludwig-Maximilians-Universität München Tel.: 089-141 62 65 1 Einführung In der ersten Hälfte des 21. Jahrhunderts wird sich in fast allen industrialisierten Staaten der Welt eine nie da gewesene demografische Wende...»


«Animating Chinese Paintings through Stroke-Based Decomposition SONGHUA XU Zhejiang University and Yale University YINGQING XU Microsoft Research Asia SING BING KANG Microsoft Research DAVID H. SALESIN University of Washington and Microsoft Research YUNHE PAN Zhejiang University and HEUNG-YEUNG SHUM Microsoft Research Asia This paper proposes a technique to animate a “Chinese style” painting given its image. We first extract descriptions of the brush strokes that hypothetically produced it....»

«Annexure V Faculty GIM‟s faculty is distinguished in their respective fields. Most have also had several years of industrial experience and served as consultants. All are teaching professionals, working closely together, capable, approachable, concerned about the personal development and the career needs of the students. GIM‟s core faculty is supplemented by inputs from top-ranking executives from industry. Their active participation contributes to update the curriculum to bring it in tune...»

«C:\Dokumente und Einstellungen\MeurerNadine\Lokale Einstellungen\Temporary Internet Files\OLK31A\23052007_NM_BSI_TR03116_eCardProjekte_Sicherheitsanforderungen Kryptokatalog für Gw V1_0_20070323.doc Technische Richtlinie BSI TR-03116-1 Kryptographische Vorgaben für Projekte der Bundesregierung Teil 1: Telematikinfrastruktur Version: 3.19 Datum: 03.12.2015 Autoren: Technische Arbeitsgruppe TR-03116-1 Status: Veröffentlichung Fassung: Dezember 2015 BSI Technische Richtlinie 03116-1...»

«TECHNISCHE UNIVERSITÄT MÜNCHEN Lehrstuhl für Werkzeugmaschinen und Fertigungstechnik der Technischen Universität München Anwendungspotential von Telepräsenzund Teleaktionssystemen für die Präzisionsmontage Andrea Acker Vollständiger Abdruck der von der Fakultät für Maschinenwesen der Technischen Universität München zur Erlangung des akademischen Grades eines Doktor-Ingenieurs (Dr.-Ing.) genehmigten Dissertation. Vorsitzender: Univ.-Prof. Dr. med., Dr.-Ing. habil. E. Wintermantel...»

«Dipartimento di Informatica e Sistemistica “Antonio Ruberti” Università degli Studi di Roma “La Sapienza” Encoding Abstract Descriptions into Executable Web Services: Towards a Formal Development Antonella Chirichiello Gwen Salaün Technical Report n. 8 ARACNE I Technical Reports del Dipartimento di Informatica e Sistemistica “Antonio Ruberti” svolgono la funzione di divulgare tempestivamente, in forma definitiva o provvisoria, i risultati di ricerche scientifiche originali. Per...»

«Using Animations to Learn about Algorithms: An Ethnographic Case Study Colleen M. Kehoe John T. Stasko Graphics, Visualization, and Usability Center College of Computing Georgia Institute of Technology Atlanta, GA 30332-0280 E-mail: colleen & stasko@cc.gatech.edu Technical Report GIT-GVU-96-20 September 1996 Abstract A number of studies have found that using animation for explaining dynamic systems had less bene cial e ects on learning than hoped. Those results come as a surprise to many...»

«Data Mining in Social Networks David Jensen and Jennifer Neville Knowledge Discovery Laboratory Computer Science Department, University of Massachusetts, Amherst, MA 01003 {jensen, jneville}@cs.umass.edu Abstract. Several techniques for learning statistical models have been developed recently by researchers in machine learning and data mining. All of these techniques must address a similar set of representational and algorithmic choices and must face a set of statistical challenges unique to...»

«Drnevich Dissertation Essay 2 draft 032106.doc 1 DRNEVICH DISSERTATION CHAPTER 3. ORDINARY AND DYNAMIC CAPABILITIES: IMPLICATIONS FOR FIRM PERFORMANCE AND COMPETITIVE HETEROGENEITY ABSTRACT A common issue in strategic management research involves sources of interfirm performance differences. Research on this issue assumes such competitive heterogeneity is attributable to variation in firm level factors of resources and capabilities, but this is difficult to establish empirically. This study...»

<<  HOME   |    CONTACTS
2016 www.abstract.xlibx.info - Free e-library - Abstract, dissertation, book

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.