Accelerate Reed-Solomon Erasure Codes on GPUs

Student: Shuai Yuan,
Advisor: Prof. Jerry Chi-Yuan Chou
LSA Lab, NTHU

Outline

Introduction
Overview of Our Acceleration Targets
Accelerating Operations in Galois Field
Accelerating Matrix Multiplication
Reducing Data Transfer Overhead
Experiment
Conclusion & Future Works

Introduction

Erasure Codes

A redundancy solution for fault-tolerance, widely used in:

signals from deep space satellites
reliable multimedia multicasting
reliable storage systems

Cloud Storage Systems: Why Redundancy?

Cloud storage vendors claim to provide highly available and reliable services in their SLA (Service-Level Agreement) with the customers. Both availability and reliability imply strict fault tolerance requirements for cloud storage system.
However, as the scale of storage system grows larger and larger, the probability of failure becomes significant:
→ SLA violation.

Therefore, redundancy must be introduced into cloud system.

Cloud Storage Systems: Why Redundancy?

The simplest and straightforward redundancy solution is replication of the data in multiple storage nodes.
Triple replication solution have been favored in cloud storage systems like the GFS (Google File System) and HDFS (Hadoop Distributed File System).

Cloud Storage Systems: Why Erasure Codes?

Problem of replication: large storage overhead.
Erasure codes can reduce the storage overhead significantly while at the same time maintaining the same level of fault tolerance as replication → a better redundancy solution.

Cloud Storage Systems: Why Erasure Codes?

$(n, k)$ MDS (Maximum Distance Separable) codes where $n$ and $k$ are integers and $n > k$:

File → $k$ equal-size native data chunks.
$k$ equal-size native chunks → $(n - k)$ code chunks.
The native and code chunks are distributed on $n$ different storage nodes.
tolerates the failure of any $(n - k)$ storage nodes.

Example:


Reed-Solomon Code $(n=4, k=2)$	Triple Replication

Save 50% space!

在我们的thesis中，我们所讨论的erasure codes是指(n,k) MDS codes。这一类codes首先会将file平均分为若干native chunks，native chunks的数目为k。之后，k个native chunks通过encode生成(n-k)个code chunks。 native chunks和code chunks再被分散到n个不同的storage nodes中。 (n,k) MDS codes可以容许高达(n-k)个任意的storage nodes发生failure。 (next) 下面我们通过一个例子来比较erasure codes和replication。左边是采用(4,2) MDS code，把file平均分为A、B两个native chunks，再通过encoding生成两个绿色方框所表示的code chunks。右边是采用triple replication的结果。 (next) 这二者都能容许任意两个storage nodes发生failure，但显然erasure codes更胜一筹，它可以节省50%的额外空间。

Cloud Storage Systems: Why Reed-Solomon Codes?

Reed-Solomon codes $RS(k, n-k)$ are one of the most popular MDS erasure codes.
$RS(10, 4)$ is used in HDFS-RAID in Facebook and $RS(6, 3)$ is used in GFS II in Google.

Shortcomings of Reed-Solomon Codes

extra high computation cost compared to replication: encoding and decoding.
Our contributions: use GPU to accelerate the Reed-Solomon encoding and decoding.

Overview of Our Acceleration Targets

Reed-Solomon Code Overview


Reed-Solomon Encoding	Reed-Solomon Decoding

Acceleration Targets - 1

Computation bottleneck: matrix multiplication.

\begin{array}{rl} & C = A \cdot B \\ \equiv & ( c_{j} = \sum_{i = 1}^{k} a_{i,j} \times b_{i} ) \end{array}

Addition and multiplication are defined as arithmetic over Galois Field GF($2^8$).

Acceleration targets for computation:

Arithmetic over Galois Field.
Matrix multiplication.

GPU-Accelerated Reed-Solomon Code Overview

Reed-Solomon Encoding in a GPU
Reed-Solomon Decoding in a GPU

Acceleration Targets - 2

Extra overhead in GPU implementation: data transfers between CPU and GPU.

Another acceleration targets: reducing data transfer overhead.

Overview of Our Acceleration Targets

Accelerating Arithmetic over Galois Field

Brief Introduction of Galois Field

GF($2^w$) contains $2^w$ polynomials. For each polynomial, its degree is at most $w−1$ and its coefficient is in {0, 1}.

For GF($2^8$), every element can be one-to-one mapped into a byte, and polynomial operations in GF($2^8$) is isomorphic to operations on bytes:

Addition: isomorphic to bitwise XOR → inexpensive.
Multiplication: isomorphic to bitwise operations → still time-consuming.

Therefore, how to accelerate multiplication over GF($2^8$) will be our focus.

GPU Accelerating Options for Multiplication

loop-based method
a set of table-based methods

Loop-based Method

compute directly: cost a loop of at most eight iterations to multiply two elements in GF($2^8$).
computation bound

Table-based Methods

precompute and store the results in tables

e.g. Full Multiplication Table

	$\ldots$	38	39	40	$\ldots$
$\vdots$	$\ldots$	$\ldots$	$\ldots$	$\ldots$	$\ldots$
20	$\ldots$	194	214	26	$\ldots$
21	$\ldots$	228	241	50	$\ldots$
22	$\ldots$	142	152	74	$\ldots$
$\vdots$	$\ldots$	$\ldots$	$\ldots$	$\ldots$	$\ldots$

Table-based Methods

	Full Multiplication Table	"Double Table"/"Left-Right Table"	Log&Exp Table
Space Complexity for GF($2^w$)	$O(2 ^ { 2w })$	$O(2^{3w/2+1})$	$O( 2^{w+1} )$
Computation Complexity	one table-lookup	2 table-lookup, 2 AND, 1 XOR, and 1 SHIFT	3 table-lookup, 1 MOD, 1 ADD, and 2 branches
Memory space for GF($2^8$)	64 KB	8 KB	512 Bytes

Use log&exp table-based method in our GPU implementation.

GPU Implementation: Loop-based or Table-based?

The loop-based method is able to achieve the maximum bandwidth even when the chunk size is small.
The bandwidth of the table-based method can still scale up as the chunk size grows larger.
The maximum bandwidth of the table-based method exceeds that of the loop-based method.

_{[1] Plank J S, Xu L. Optimizing Cauchy Reed-Solomon codes for fault-tolerant network storage applications[C]//NCA 2006.}
_{[2] Shojania H, Li B, Wang X. Nuclei: GPU-accelerated many-core network coding[C]//INFOCOM 2009.}
_{[3] Kalcher S, Lindenstruth V. Accelerating Galois Field arithmetic for Reed-Solomon erasure codes in storage applications[C]//Cluster Computing (CLUSTER) 2011.}
_{[4] Chu X, Zhao K. Practical random linear network coding on GPUs[M]//GPU Solutions to Multi-scale Problems in Science and Engineering. Springer Berlin Heidelberg, 2013: 115-130.}

下面我们通过一个实验来决定采用loop-based或是table-based。在我们的实验中，chunk size（也就是切file的一份chunk的大小）从1MB scale到64MB，k从4 scale到16，采用double fault tolerance (code chunks的数目是二)。从图中可以看出（较平的三条线是loop-based，其他的是table-based）： Loop-based在chunk size很小的时候performance优于table-based。但是table-based的effective bandwidth仍然可以随着chunk size变大继续scale，而loop-based却早已达到它的最高bandwidth。最后从最大bandwidth来看，table-based要优于loop-based。之前我们survey时看到，一些paper在优化其他一些erasure codes时采用了loop-based。他们的理由是所谓的“memory wall”，也就是processor的computation power成长迅速，而memory的performance成长缓慢，导致这两者的gap越来越大（以每年约50%的速率增大）。所以，与其选择需要memory access的table-based，不如选择computation bound的loop-based方法。

Further Improvement of the Log&exp Table-based Method

baseline:


uint8_t gf256_mul(uint8_t a, uint8_t b)
{
  int result;
  if (a == 0 || b == 0) {
	return 0;
  }
  result = (gf256_log_table[a] + gf256_log_table[b]) % (NW-1);
  return gf256_exp_table[result];
}

Further Improvement of the Log&exp Table-based Method

Improvement Approach 1

Replace the slow modular operations with more efficient operations.


if (a == 0 || b == 0) {
  return 0;
}
result = (gf256_log_table[a] + gf256_log_table[b]) & (NW-1)
+ (gf256_log_table[a] + gf256_log_table[b]) >> width;

Improvement Approach 2

Remove the slow modular operations by augmenting one table.


if (a == 0 || b == 0) {
  return 0;
}
result = gf256_log_table[a] + gf256_log_table[b];

Improvement Approach 3

Further eliminates the conditional branches by augmenting both two tables.


result = gf256_log_table[a] + gf256_log_table[b];

Further Improvement of the Log&exp Table-based Method

In GPU implementation, where to store the tables and how to initialize them will affect the performance.

Further Improvement of the Log&exp Table-based Method

Appropriate GPU memory:

constant memory: off-chip memory whose accesses are usually cached in the constant cache.
shared memory: on-chip memory which has the smallest access latency except the register file.

Further Improvement of the Log&exp Table-based Method

What we have implemented:

Store two tables in the constant memory and initialize them at compile time.
Store two tables in the shared memory and run-time initialize them serially at the beginning of each kernel function.
Store two tables in the off-chip memory and then load into the shared memory parallely at the beginning of each kernel function.

Further Improvement of the Log&exp Table-based Method

encoding a 1GB file with $k = 4$ and $n = 6$.

The elimination of conditional branches improves the performance.
Accessing the tables in the constant memory is more time-consuming.

在下面的实验中，我们来比较以上介绍这些further improvement的技巧。在实验中，我们在k=4,n=6的settings下去encode一个1GB的file。首先我们来看三种优化方法（approach）与baseline的比较（X轴的四项）：前两种approaches由于只把MOD拿掉，所以它们的performance比较接近；而最后一种approach还优化掉了branches，所以performance最好。接下来我们来比较三种table存放与initialization方式（蓝色、绿色和红色的bar）：由于cMem无法保证每次access都在cache中，所以把tables存在cMem的performance不如sMem。相同的原因也导致了load parallely在table size较小的情况下performance低于initialize serially，因为它也需要access off-chip memory。前面我们提到，（X轴的）后两种approach是需要扩充table的，X轴往右，table size就越大。随着table size的变大，sMem的两种方式的差距也在减小，最后load parallely的performance反超initialize serially，所以最后我们选了最右边红色bar的这种方式。

Accelerating Matrix Multiplication

Naive Implementation

Naive Implementation

Problems of Naive Implementation

a lot of global memory (off-chip) transactions: each takes 400 - 800 clock cycles.
memory access in column major → poor temporal locality and high cache missrate.

Square-Tiling Algorithm

Problems of Square-Tiling Algorithm

not suitable for general input cases of Reed-Solomon Codes:
a small matrix multiple a huge matrix

encoding--A: $(n - k) \times k$, B: $k \times CS$.
decoding--A: $k \times k$, B: $k \times CS$.

where

$n$: total chunk number (1-100)
$k$: native chunk number (1-100)
$CS$: chunk size (more than 10,000,000)

Generalized Tiling Algorithm

How to Determine the Parameter of Tiles

tileDepth
tileWidthRow
tileWidthCol

How to Determine the Parameter of Tiles

set tileDepth to $A.width$ ($k$) → remove the loop of accessing matrix $A$ and $B$ tile by tile.

How to Determine the Parameter of Tiles

set tileDepth to $k$ → remove the loop of accessing matrix $A$ and $B$ tile by tile.

How to Determine the Parameter of Tiles

set tileDepth to $k$ → remove the loop of accessing matrix $A$ and $B$ tile by tile.

How to Determine the Parameter of Tiles

$\textrm{tileWidthRow} \times \textrm{tileWidthCol}$ is equal to the CUDA block size (number of threads in a CUDA block) → find the best CUDA block size by tuning occupancy.
Finally we need to further determine tileWidthRow and tileWidthCol.

How to Determine the Parameter of Tiles

Define the aspect ratio $AR$ of the tile in the result matrix as: $$ AR = \dfrac{ \textrm{tileWidthRow} }{ \textrm{tileWidthCol} } $$

three strategies:

Min $AR$	minimize tileWidthRow, maximize tileWidthCol
$AR=1$	$\textrm{tileWidthRow} = \textrm{tileWidthCol}$
Max $AR$	maximize tileWidthRow, minimize tileWidthCol

How to Determine the Parameter of Tiles

use the following testcases for encoding:

testcase	$k$	$n$	chunk size (MB)	Description
testcase	$k$	$n$	chunk size (MB)	file size	code chunk number
1	2	200	1	small	large
2	4	150	1
3	4	200	1
4	2	200	10	large	large
5	4	150	10
6	4	200	10
7	4	6	256	large	small
8	8	16	128	large	small

How to Determine the Parameter of Tiles

The first six testcases, especially the 4th-6th ones, represent the worse cases for the strategy "Max $AR$".
The last two testcases represent the worse cases for the strategy "$AR = 1$".
Overall, the strategy "Min $AR$" is the best.

How to Determine the Parameter of Tiles

Generally, the "Min $AR$" strategy has the least number of conditional branches, while the "Max $AR$" has the most. The overhead of branches is significant when encoding a large file into a large number of code chunks.

How to Determine the Parameter of Tiles

The "$AR=1$" strategy has significant performance degradation when the number of code chunks is smaller than tileWidthRow (wasting cache space, extra boundary checking).

Further Improvement of Tiling Algorithm

When each chunk can be word-aligned, we can further improve the tiling algorithm.

Observation 1: Byte-length Matrix Multiplication

~~Multiplication: table look-ups → memory accesses~~
Addition: 8-bit XORs -> ALU operations
- Each ALU operation:
  operand1 (8-bit) XOR operand2 (8-bit) = result (8-bit)
- GPU Execution Units: 32-bit

Improvement Technique 1: Byte-by-word Matrix Multiplication

Observation 2: Global Memory Transactions

a lot of 8-bit global memory transactions!

Improvement Technique 2: Packing Memory Transactions

pack 8-bit global memory transactions into 32-bit transactions → reduce global memory load and store.

Further Improvement of Tiling Algorithm

e.g. $k = 4$, $n = 200$, chunk size = 200 MB

	Original	Word-alignment	Improvement
Effective bandwidth (MB/s)	1119.419693	1591.051924	42.13%
ALU Operations	12331000000	3116531712	74.73%
Global Memory Load Transactions	516963328	131611648	74.54%
Global Memory Store Transactions	64225280	16056320	75%
Number of Branches	6685685440	2703981504	59.56%

Reducing Data Transfer Overhead

Using CUDA Streams

CUDA streams are used for further overlapping data transfers with computation.

Sequential vs. Concurrent copy and execute

Using CUDA Streams

encoding under $k = 4, n = 6$ settings.

The input file size is scaled from 1000 MB to 1800 MB.

Using CUDA streaming can improve the performance by more than 29%.
As the file size increases, the data transfer overhead increases, and the improvement of CUDA streaming is more significant.

Using CUDA Streams

encoding a 2000 MB file under $k = 4, n = 6$ settings.

The kernel execution time is increasing with the growth of the CUDA stream number. → The overhead of kernel execution will exceed the time saving of data transfer at some point.

Experiment

Experiment Setup

CentOS-6 with 2.6.32 Linux kernel.
Intel Xeon Processor E5-2670 v2 x 2
- 10 cores
- 2.5 GHz
NVIDIA Tesla K20X GPU x 2
- 2688 CUDA cores
- peak performance: 1.31 Tflops (double precision floating point calculation) and 3.95 Tflops (single precision floating point)
- maximum size of GPU GDDR5 memory: 6 GB
- theoretical memory bandwidth 243 GB/s
- two copy engines → supports concurrent data copy and kernel execution
maximum bidirectional bandwidth of the PCI-Express bus: 8 GB/s

Experiment Setup

Input files are randomly generated.
Most of our experimental results reflect the average of 100 runs.
Due to the similarity of the performance result of encoding and that of decoding in most experiments, the latter one is omitted.

Overall Performance Evaluation

We evaluate the overall performance by encoding a 1600 MB file with $k = 4, n = 6$.

Overall Performance Evaluation

Step-by-step Improvement

Overall Performance Evaluation

Step-by-step Improvement

	Step Speedup		Cumulative Speedup
	Kernel Execution	Non-overlapped Data Transfer	Kernel Execution	Non-overlapped Data Transfer
optimize tiling algorithm	1.469 x	0	1.469 x	0

Overall Performance Evaluation

Step-by-step Improvement

	Step Speedup		Cumulative Speedup
	Kernel Execution	Non-overlapped Data Transfer	Kernel Execution	Non-overlapped Data Transfer
optimize tiling algorithm	1.469 x	0	1.469 x	0
optimize log&exp table-based method	1.502 x	0	2.206 x	0

Overall Performance Evaluation

Step-by-step Improvement

	Step Speedup		Cumulative Speedup
	Kernel Execution	Non-overlapped Data Transfer	Kernel Execution	Non-overlapped Data Transfer
optimize tiling algorithm	1.469 x	0	1.469 x	0
optimize log&exp table-based method	1.502 x	0	2.206 x	0
use CUDA streams	0.862 x	18.452 x	1.902 x	18.452 x

Overall Performance Evaluation

Step-by-step Improvement

	Step Speedup		Cumulative Speedup
	Kernel Execution	Non-overlapped Data Transfer	Kernel Execution	Non-overlapped Data Transfer
optimize tiling algorithm	1.469 x	0	1.469 x	0
optimize log&exp table-based method	1.502 x	0	2.206 x	0
use CUDA streams	0.862 x	18.452 x	1.902 x	18.452 x
parallel in 2GPUs	2.030 x	2.857 x	3.861 x	52.718 x

Total Cumulative Speedup: 7.145 x

Overall Performance Evaluation

GPU vs. CPU

best CPU implementation (Jerasure library, compiled by clang with the -O3 compiler optimization flag): 4309.08 ms.
optimized GPU implementation: 292.977 ms ($14.71 \times$ speedup).

Conclusion

We have studied several techniques to improve the performance of Reed-Solomon codes according to their coding mechanism, and figured out the best choices on the basis of GPU architecture.
We have illustrated methods to reduce the data transfer overhead introduced by the GPU implementation.
We present an optimized GPU implementation of Reed-Solomon Codes, which can achieve a speedup of 14.71 over the current best CPU implementation.

Future Works

Better corporation of CPU and GPU.
Heuristic strategies for deciding the best CUDA stream number.

最后是我们这篇thesis的conclusion与future works。首先，我们从Reed-Solomon Codes的performance bottleneck出发，对一些降低它computation overhead的加速技巧进行研究，并根据GPU的architecture选择最合适的方法。其次，对于使用GPU加速所产生的额外的data transfer overhead，我们也采用了一些方法来降低它对performance的影响。最后，我们用CUDA实作了GPU版本的Reed-Solomon Codes，并与我们所知的最佳的CPU版本进行比较。我们的GPU版本可以达到14.71倍的speedup。当然，我们目前的works还有有待改进之处。例如如何让CPU和GPU有更好的corporation，使二者在encoding和decoding时都可以被utilize。其实在我们目前的实作中，我们也实现了一部分的corporation。我们在实验中发现，decoding matrix的generation由于是dependencies较高的矩阵求逆，放在GPU来做反而比CPU还慢。所以我们采用CPU来计算decoding matrix，并让这一过程和data transfer同时进行。但是，我们在做encoding和decoding的矩阵乘法时，仍然只有用到GPU，CPU是idle的，这个问题之后仍可以进一步改进。另外一个可以继续改进之处，是之前所提到的，如何选择最佳CUDA stream number的问题，这个问题或许可以通过某些heristic algorithm来解决。

Accelerate Reed-Solomon Erasure Codes on GPUs

Student: Shuai Yuan, Advisor: Prof. Jerry Chi-Yuan Chou LSA Lab, NTHU

Outline

Introduction

Erasure Codes

Cloud Storage Systems: Why Redundancy?

Cloud Storage Systems: Why Redundancy?

Cloud Storage Systems: Why Erasure Codes?

Cloud Storage Systems: Why Erasure Codes?

Cloud Storage Systems: Why Reed-Solomon Codes?

Shortcomings of Reed-Solomon Codes

Overview of Our Acceleration Targets

Reed-Solomon Code Overview

Acceleration Targets - 1

GPU-Accelerated Reed-Solomon Code Overview

Acceleration Targets - 2

Overview of Our Acceleration Targets

Accelerating Arithmetic over Galois Field

Brief Introduction of Galois Field

GPU Accelerating Options for Multiplication

Loop-based Method

Table-based Methods

Table-based Methods

GPU Implementation: Loop-based or Table-based?

Further Improvement of the Log&exp Table-based Method

Further Improvement of the Log&exp Table-based Method

Further Improvement of the Log&exp Table-based Method

Further Improvement of the Log&exp Table-based Method

Further Improvement of the Log&exp Table-based Method

Further Improvement of the Log&exp Table-based Method

Accelerating Matrix Multiplication

Naive Implementation

Naive Implementation

Naive Implementation

Naive Implementation

Problems of Naive Implementation

Square-Tiling Algorithm

Problems of Square-Tiling Algorithm

Generalized Tiling Algorithm

How to Determine the Parameter of Tiles

How to Determine the Parameter of Tiles

How to Determine the Parameter of Tiles

How to Determine the Parameter of Tiles

How to Determine the Parameter of Tiles

How to Determine the Parameter of Tiles

How to Determine the Parameter of Tiles

How to Determine the Parameter of Tiles

How to Determine the Parameter of Tiles

How to Determine the Parameter of Tiles

Further Improvement of Tiling Algorithm

Observation 1: Byte-length Matrix Multiplication

Improvement Technique 1: Byte-by-word Matrix Multiplication

Improvement Technique 1: Byte-by-word Matrix Multiplication

Improvement Technique 1: Byte-by-word Matrix Multiplication

Improvement Technique 1: Byte-by-word Matrix Multiplication

Improvement Technique 1: Byte-by-word Matrix Multiplication

Improvement Technique 1: Byte-by-word Matrix Multiplication

Observation 2: Global Memory Transactions

Improvement Technique 2: Packing Memory Transactions

Further Improvement of Tiling Algorithm

Reducing Data Transfer Overhead

Using CUDA Streams

Using CUDA Streams

Using CUDA Streams

Using CUDA Streams

Experiment

Experiment Setup

Experiment Setup

Experiment Setup

Overall Performance Evaluation

Overall Performance Evaluation

Step-by-step Improvement

Overall Performance Evaluation

Step-by-step Improvement

Overall Performance Evaluation

Step-by-step Improvement

Overall Performance Evaluation

Step-by-step Improvement

Overall Performance Evaluation

Step-by-step Improvement

Student: Shuai Yuan,
Advisor: Prof. Jerry Chi-Yuan Chou
LSA Lab, NTHU