«Three Dimensional Convolution of Large Data Specialization Project, Fall 2009 TDT4590 Complex Computer Systems Sets on Modern GPUs Supervisor: Dr. ...»
Ahmed Adnan Aqrawi
Convolution of Large Data
Specialization Project, Fall 2009
TDT4590 Complex Computer Systems
Sets on Modern GPUs
Supervisor: Dr. Anne C. Elster, IDI
Co-supervisor: Victor Aarre, Schlumberger Stavanger
Trondheim, December, 2009
Norwegian University of Science and Technology
Faculty of Information Technology, Mathematics and Electrical Engineering
Department of Computer and Information Science
In Petrel, Schlumberger’s seismic software, one often comes across large seismic cubes that need to be ﬁltered in order to generate clearer images. The seismic cubes are viewed from three dimensions, implying that one must ﬁlter in all three dimensions as well. However, this ﬁltering is very computationally demanding and thus uses a lot of computational resources. This project’s goal is to implement a Gaussian ﬁlter on a large three dimensional data set on the GPU using NVIDIA CUDA to oﬀ-load the CPU. A CPU version will also be developed for comparison and analysis. Since the size of the data to be transferred to the GPU memory is quite large, the calculations need to be performed on sub cubes. This implies that one must account for border data between cubes to avoid an edge eﬀect. The implementations developed will be benchmarked and compared to evaluate performance gains.
iii iv Abstract In the petroleum and gas industry, one of the main foci is using seismic processing to ﬁnd new oil and gas reservoirs. Recordings of seismic waves can be used to create images representing the surface of the earth. To do so one has to ﬁlter the data collected. One of the methods for ﬁltering this data is convolution in the spatial domain. Which is done in three dimensions (3D) because of the 3D nature of the data collected. The data collected can be of surfaces over several kilometers in length and are therefore very large is size.
This project focuses on implementing a 3D convolution algorithm on modern CPUs and GPUs with non-separable ﬁlters for large data sets, in the spatial domain. Our results demonstrate that the ﬁltering mask should be placed in constant memory rather than shared memory because there is an overhead assosiated with the use of shared memory per kernel launch. The data in constant memory must be read coalecsed for it to be eﬃcient. Shared memory should not to be used for the ﬁltered data eitherdue to the lack of communication between the threads in the convolution kernel. Again, the overhead of reading into shared memory only slows down the process. To compare our results, implementations on the CPU were performed in C. The platforms tested on are both a uni-core CPU and a quad-core CPU, as well as a single GPU and a system with up to 4 GPUs. The CPU used is a AMD phenom x4, whereas the GPUs used are the NVIDIA Tesla c1060 and NVIDIA Tesla s1070. Our work includes ﬁguring out how to process large amounts of data most eﬃciently on both the CPU and GPU with the use of diﬀerent blocking methods when accessing the disk.
Our results also show that the I/O time, which one would expect to be a bottleneck, is only 1-2% of the total execution time on a single CPU. This means that convolution is a computationally demanding task, but fortunatly a very parallelizable one.
Our results indicate that compared to a single core a speedup of 3.57 is achieved on the Phenom x4, a speedup of 17 is achieved on the Tesla c1060 (single GPU) and a speedup of 62 is achieved on the Tesla s1070 (4 GPUs). This led to the computation percentage being reduced by 5%, 25% and 90%, respectively for the three platforms.
Further work regarding optimizations should hence focus on I/O.
This report, together with the prototype, is the result of a project assigned by the course TDT4590 at the Norwegian University of Science and Technology.
I would like to thank my supervisors Dr. Anne Cathrine Elster for invaluable support and feedback throughout the entire project. She has been an inspiration with her great understanding and dedication to the ﬁeld. Given her generosity and encouragement all the resource needed for this project where made available. I would like to thank Victor Aarre of Schlumberger for his support in providing me with new ideas, example source code and a set of seismic data. I would especially like to thank NVIDIA for sponsoring of our group and our HPC-lab, and for making high-end graphics cards such as Tesla c1060 and Tesla s1070 available. I would also like to give thanks to the entire HPC group for their support, encouragement and enthusiasm for this project. And a special thanks to Jan Christian Meyer, Thorvald Natvig, and Holger Ludvigsen for all their help.
Introduction In the oil and gas industry, there is always an interest in investigating potential oil and gas reservoirs. There are several ways in which to test for this, and one of these is seismic data collection. Seismic data is gathered by recording seismic waves (waves of force that travel through the earth). This data is used in the ﬁeld of petroleum to discover the geological structures of the earth and ﬁnd natural resources such as oil and gas. To help in this search seismic data is processed by many ﬁlters and ﬁltering methods to get a clearer subsurface image and to view more relevant information such as faults and reservoirs, see FIgure ?? for an example of seismic data. These ﬁlters are like other image ﬁltering processes very adaptable to the graphical processing unit (GPU), but are per today run on the central processing unit (CPU).
Figure 1.1: Figure illustrating seismic data from [?], with permission from Schlumberger In recent years, it has been shown that the performance capabilities of the GPU, in some cases, has exceeded that of the CPU.
Which in turn motivated the development of the general purpose graphical processing unit (GPGPU). This has lead to the use of the GPU not only in graphic applications, but also in scientiﬁc calculations. These trends have created a boom in the graphical processing architectures
CHAPTER 1. INTRODUCTIONand manufacturers have started introducing new product lines speciﬁc for scientiﬁc calculations. In Figure ?? one can see the trend of computational power measured in ﬂoating point operations pr. second (FLOPS) for the past 5 years. Another aspect worth noting is that the use of the GPU gives room to use the CPU for other tasks in parallel, functioning as an accelerator.
Given these advancements one is now often interested to see if it is possible to utilize the GPU for calculations and gain some increased performance for certain tasks.
Such tasks as image processing, seismic processing and other physical modeling as well as linear programming applications have proven to be well parallelizable on the GPU. This gives the foundation of this projects existence in that we are to perform an image enhancement task on seismic data on the GPU.
Figure 1.2: Figure illustrating CPU and GPU performance trends from [?], with permission from NVIDIA
The aim of this project is to implement convolution for non separable ﬁlters in the spatial domain in CUDA, for large three dimensional data sets. A large data set is deﬁned as a data set that does not ﬁt into modern system buﬀers, currently at sizes between 8-12 GB. The Gaussian ﬁlter is a ﬁlter used in seismic processing and implementing it on the GPU would introduce new possibilities in the ﬁeld of pre-processing seismic data. The challenges here are how to handle large datasets.
Meaning that one must compute the data set in intervals of sub-sets and must account for border information to compute the ﬁlters correctly. For comparisson implementations will be developed for both a single and quad -core CPUs. The goal is to benchmark the convolution implementations on modern CPU and GPU with diﬀerent ﬁlter sizes and compare the two to see which is most eﬃcient when it comes to large data sets. Possibilities to run on several GPUs to accelerate performance will also be explored, and speedup will be assessed.
1.2 Project Contributions There are three main contributions in this project. The ﬁrst is to perform convolution in three dimensions with non separable ﬁlters. It is a rare thing to ﬁnd convolution performed in three dimensions not to mention on the GPU. This should be useful for anyone aiming at using the GPU for any similar tasks.
The second is to look at handling large data sets. This introduces many problems from disk access to transfer of data to memory. In this project the focus is on the retrieval of data to the GPU by blocking across diﬀerent dimensions. The combination of using large data sets on the GPU is also a rare occurrence. Since usually the data used is exactly large enough to ﬁt in the systems main buﬀer.
The third signiﬁcant contribution here, is the use of the GPU and CUDA in seismic processing and how it can accelerate that process by experimenting with the diﬀerent memories in the CUDA hierarchy (for example constant, shared and global
-memory). There have been some studies regarding the topic accelerating seismic processing, but in our project the focus is on using convolution as a ﬁltering method and the use of CUDA to program on the GPU. In our project there are also considerations regarding the use of multiple GPUs to accelerate the process, which is both rare and interesting to see how the algorithm scales on several hundered cores.
The rest of this report will structured as follows:
Chapter 2: Relevant researched background material and related work is emphasized and explained such that the reader has all the presumed knowledge to understand the rest of the work. It is also a way to show how this project builds upon existing work in the same ﬁeld.
Chapter 3: A short introduction to which hardware and software is used in the project. A description of how the implementations in this project were performed, and why certain implementation choices were made are explained. Here one will also ﬁnd the thoughts put behind each optimization and what the expectations are as to how they will perform and test.
Chapter 4: Results regarding I/O tests are presented and discussed. The main focus is the blocking techniques used to achieve good disk access times and explaining why they are so eﬃcient. Results regarding convolution tests on various platforms are presented and discussed as well. The main focus here is on comparing the implementations and presenting speedup and computation percentage. An in depth analysis of the comparisons and traits are also shown.
Chapter 5: Here one will ﬁnd the conclusion of the work performed and suggested futher work in the ﬁeld.
CHAPTER 1. INTRODUCTIONAppendix: Tables of results gathered during the benchmarking process are included in the appendix. some of these results are summarized in graphs in Chapter 4.
Background and Related Work In this chapter, the focus is on introducing some of the main sources the reader might need to understand our work. The following sections summerize the main references read. Section 2.1 inroduces related work done in similar ﬁelds. Section
2.2 concerns spatial ﬁltering. Section 2.3 Explains the concept of a ﬁltering mask and the gaussian ﬁlter. Section 2.4 is a practical example of how the gaussian ﬁlter is used in seismic processing. Section 2.5 is about general parallel programming.
Section 2.6 introduces OpenMP and the concepts of multithreading.
Section 2.7 Explains the main apects of the CPU and GPU architectures. Finally, Section 2.8 Gives a short introduction to the CUDA programming model.
2.1 Related Work
This section will focus on introducing papers and theses chosen to be discussed with the intention to emphasize work done in a similar ﬁeld before and how this project will build upon them. The main ﬁelds focused on here are image processing, convolution, GPU accelleration, three dimensional data and multiple GPU systems.
All these topics are relevant to this project, and have been researched to lay a foundation for the implementations performed.
Image Convolution with CUDA, 2007 [?]
This is a paper written by NVIDIA to show how CUDA can be used to perform convolution in image processing. This is related to this project in that it also concerns convolution in the spatial domain and it is also implemented in CUDA. In contrast to this paper, the image processing performed in this project is on three dimensional data, and the data to be ﬁltered does not ﬁt in memory and so one must perform several communications in the memory hierarchy.
CHAPTER 2. BACKGROUND AND RELATED WORKAccelerating 3D Convolution using Graphics Hardware, 1999 [?] This is a paper published by IEEE Visualization in 1999 that approaches the subject of 3D convolution performed on a GPU. The main idea here is to use the graphical hardware to accelerate the convolution process. This is work done in this area pre CUDA and this is where it diﬀers from this project. Since before the CUDA architecture the use of shared memory was not available and this can be a good enhancement/optimization. Other than the use of CUDA this project diﬀers in that it also considers the use of multiple GPU to accelerate the process and is concerned with larger data sets.
Modeling Communication on Multi-GPU Systems, 2009 [?]
This is a master thesis concerning the use of communication and calculations on several GPUs simultaneously. Another important subject taken into account here is partitioning of data such that calculations can be done on several GPUs. This is relevant to this project because of the large amount of data to be ﬁltered and the advantage of using multi-GPU. It is also interesting to see how one can partition data such that communication between GPUs is optimal. In contrast to this thesis the problem solved here is of image processing and not a solution to partial diﬀerential equations. Another diﬀerence is again the consideration of large data sets.