→ SLA (Service-Level Agreement) violation
Therefore, redundancy must be introduced into cloud system.
Example:
Reed-Solomon Code (n=4, k=2) | Replication |
Save 50% space!
Reed-Solomon Encoding | Reed-Solomon Decoding |
Given a file $F$: divide it into $k$ equal-size native chunks: $F = [F_{i}]_{i=1,2,\dots,k}$.
Encode them into $(n-k)$ code chunks: $C = [C_{i}]_{i=1,2,\dots,n-k}$.
Use an Encoding Matrix $EM_{(n-k) \times k}$ to produce code chunks: $$ C^{T} = EM \times F^{T} $$ $C_{i}$ is the linear combination of $F_{1}, F_{2}, \dots, F_{k}$.
For example:
\begin{align} \begin{array}{rcl} F &=& \begin{pmatrix} A & B \end{pmatrix} \\ EM &=& \begin{pmatrix} 1 & 1 \\ 1 & 2 \\ \end{pmatrix} \\ %C &=& \begin{pmatrix} A & B & A+B & A+2B \end{pmatrix} C^{T} &=& \begin{pmatrix} 1 & 1 \\ 1 & 2 \\ \end{pmatrix} \times \begin{pmatrix} A \\ B \end{pmatrix} \\ &=& \begin{pmatrix} A+B \\ A+2B \end{pmatrix} \end{array} \end{align}Let $P = [P_{i}]_{i=1,2,\dots,n} = [F_{1}, F_{2}, \dots, F_{k}, C_{1}, C_{2}, \dots, C_{n-k}]$ be the $n$ chunks in storage, $EM' = \begin{pmatrix} I \\ EM \end{pmatrix}$, where $$ I = \begin{pmatrix} 1 & 0 & \ldots & 0 \\ 0 & 1 & \ldots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \ldots & 1 \end{pmatrix} $$ then \begin{align} \begin{array}{rcl} P^{T} &=& EM' \times F^{T} \\ &=& \begin{pmatrix} F^{T} \\ C^{T} \end{pmatrix} \end{array} \end{align}
In the formal example,
\begin{align} \begin{array}{rcl} EM' &=& \begin{pmatrix} I \\ EM \end{pmatrix} = \begin{pmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \\ 1 & 2 \\ \end{pmatrix} \\ P^{T} &=& EM' \times F^{T} \\ &=& \begin{pmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \\ 1 & 2 \\ \end{pmatrix} \times \begin{pmatrix} A \\ B \end{pmatrix} \\ &=& \begin{pmatrix} A \\ B \\ A+B \\ A+2B \end{pmatrix} \end{array} \end{align}$EM'$ is the key of MDS property!
Alternative view:
Consider the linear space of $P = [P_{i}]_{i=1,2,\dots,n} = [F_{1}, F_{2}, \dots, F_{k}, C_{1}, C_{2}, \dots, C_{n-k}]$, its dimension is $k$, and any $k$ out of $n$ vectors form a basis of the linear space.
Reed-Solomon Codes uses Vandermonde matrix $V$ as $EM$ $$ V = \begin{bmatrix} 1 & 1 & 1 & \ldots & 1 \\ 1 & 2 & 3 & \ldots & k \\ 1^2 & 2^2 & 3^2 & \ldots & k^2 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1^{(n-k)} & 2^{(n-k)} & 3^{(n-k)} & \ldots & k^{(n-k)} \end{bmatrix} $$
$$ EM' = \begin{bmatrix} 1 & 0 & 0 & \ldots & 0 \\ 0 & 1 & 0 & \ldots & 0 \\ 0 & 0 & 1 & \ldots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \ldots & 1 \\ 1 & 1 & 1 & \ldots & 1 \\ 1 & 2 & 3 & \ldots & k \\ 1^2 & 2^2 & 3^2 & \ldots & k^2 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1^{(n-k)} & 2^{(n-k)} & 3^{(n-k)} & \ldots & k^{(n-k)} \end{bmatrix} $$
Remark:
We randomly select $k$ out of $n$ chunks, say $X = [x_{1}, x_{2}, \cdots, x_{k}]^{T}$. Then we use the coefficients of each chunks $x_{i}$ to construct a $k \times k$ matrix $V'$. Original data can be regenerated by multiplying matrix $X$ with the inverse of matrix $V'$: $$ F = V'^{-1} X $$
Galois field GF($p^w$), where $p$ is a prime and $w$ is a positive integer, is a set of $p^w$ polynomials of degree at most $w - 1$.
Among the Galois Fields, the binary extension fields GF($2^w$) are of particular interest in erasure codes due to the byte-based nature of memory in many computer architectures.
GF($2^w$) is constructed by finding a primitive polynomial $q(x)$ of degree $w$ over GF$(2)=\{ 0, 1 \}$, and then enumerating the polynomial elements with a generator.
In GF($2^w$), the polynomial $x$ can be considered a generator.
Generation of GF($2^w$):
e.g. Generation of GF($2^2$): $w = 2$, and $q(x) = x^2 + x + 1$.
Initial set: $\{ 0, 1, x \}$
$x^2 \textrm{ mod } q(x) = x + 1$, insert into the set: $\{ 0, 1, x, 1+x \}$.
$(x+1)x \textrm{ mod } q(x) = 1$, end of enumeration.
Each polynomial elements of GF($2^w$) can be mapped into a binary word $b$ of size $w$ by setting the $i$th bit of $b$ to the coefficient of $x^i$ in the polynomial.
e.g. GF($2^2$)
generated element | polynomial element | binary element $b$ | decimal representation of $b$ |
0 | 0 | 00 | 0 |
$x^0$ | 1 | 01 | 1 |
$x^1$ | $x$ | 10 | 2 |
$x^2$ | $x+1$ | 11 | 3 |
For GF($2^8$), every element can be mapped into a byte.
Therefore, how to accelerate the time-consuming multiplication over GF($2^w$) will be our focus.
To perform the multiplication of two binary numbers in GF($2^w$), the numbers should be converted to their polynomial elements, multiplied by the polynomials modulo $q(x)$, and then converted back to binary.
Russian peasant multiplication algorithm can be used to turn it into bitwise operations.
e.g. Russian peasant multiplication in normal field: \begin{align} 9 \times 12 &\leadsto (9/2) \times (12 \times 2) + 12 \\ &\leadsto (4/2) \times (24 \times 2) + 12 \\ &\leadsto (2/2) \times (48 \times 2) + 12 \\ &\leadsto (48 \times 2) + 12 \\ &\leadsto 108 \end{align}
e.g. Russian peasant multiplication in GF($2^8$): \begin{align} (x^3+1) \times (x^3+x^2) &\leadsto (x^2) \times (x(x^3+x^2)) + (x^3+x^2) \\ &\leadsto (x) \times (x(x^4+x^3)) + (x^3+x^2) \\ &\leadsto 1 \times (x(x^5+x^4)) + (x^3+x^2) \\ &\leadsto x^6+x^5+x^3+x^2 (\equiv 108) \\ \end{align}
Actually multiplication in GF($2^w$) is the same as that in normal field, except that we have to add necessary modulo operations so that the result is smaller than $2^w$.
uint8_t gf256_mul(uint8_t a, uint8_t b)
{
uint8_t result;
while(b)
{
// LSB bit of b is 1
if(b & 1)
{
// Emulate polynomial addition
result ^= a;
}
// Primitive polynomial of GF(2^8):
// x^8+x^4+x^3+x^2+1
// which is mapped into 0x1d
uint8_t prim_poly = 0x1d;
// Emulate polynomial modulo
a <<= 1;
// MSB bit of a is 1
if (a & 0x80)
{
// modulo by prim_poly
a ^= prim_poly;
}
b >>= 1;
}
return result;
}
costs at most eight iterations to complete → computation-bound
Pre-compute all the multiplication result in a table. Multiplication becomes one table-lookup operation.
example implementation to multiply $a$ and $b$ in GF($2^8$) using the full multiplication table:
uint8_t gf256_mul_table[256][256];
uint8_t gf256_mul(uint8_t a, uint8_t b)
{
return gf256_mul_table[a][b];
}
Based on the following observation:
\begin{align} a(x) \times b(x) &= (a_{0} + a_{1}x + ... + a_{w-1}x^{w-1}) \times (b_{0} + b_{1}x + ... + b_{w-1}x^{w-1}) \\ &= (a_{0} + a_{1}x + ... + a_{w/2-1}x^{w/2-1}) \times (b_{0} + b_{1}x + ... + b_{w-1}x^{w-1}) \\ &+ (a_{w/2}x^{w/2} + ... + a_{w-1}x^{w-1}) \times (b_{0} + b_{1}x + ... + b_{w-1}x^{w-1}) \end{align}Note: add operations in Galois Field are equivalent to XOR.
Split the full multiplication table into two smaller tables containing the product for the MSB part and the LSB one.
uint8_t gf256_left_table[16][256];
uint8_t gf256_right_table[16][256];
uint8_t gf256_mul(uint8_t a, uint8_t b)
{
return gf256_left_table[(a >> 4) & 0x0f][b] ^ gf256_left_table[a & 0x0f][b];
}
Basis: every non-zero element can be represented as a power to a primitive element $\alpha$.
Assume that $a = \alpha^i$ and $b = \alpha^j$ are two non-zero elements in GF($2^w$), then their product $ab = \alpha^{i + j}$.
For any primitive element $\alpha$: $\alpha^{2^w - 1} = 1$.
Thus, we have:
\begin{align}
ab &= \alpha^{(i + j) \% (2^w - 1)} \\
&= \alpha^{(log_{\alpha}a + log_{\alpha}b) \% (2^w - 1)}
\end{align}
Construct two tables:
Multiplication in GF($2^w$) consists of:
uint8_t gf256_log_table[256];
uint8_t gf256_exp_table[255];
// width of GF(2^8)
const int width = 8;
// number of elements in GF(2^8)
const int NW = 1 << width;
uint8_t gf256_mul(uint8_t a, uint8_t b)
{
int result;
if (a == 0 || b == 0)
{
return 0;
}
result = (gf256_log_table[a]
+ gf256_log_table[b]) % (NW-1);
return gf256_exp_table[result];
}
Criteria: to avoid high latency of off-chip memory access, the tables are expected to fit into the cache or on-chip memory (the so-called "shared memory" in GPU).
memory space of tables for multiplication over GF($2^8$):
→ used in our GPU implementation
Replace slow MOD operation with more efficient operations.
Observation: Let $Q = 2^w$, then $log(a), log(b) \in [0,Q-1]$. Therefore: $(log(a)+log(b)) \in [0, 2Q -2]$
\begin{equation*} (log(a) + log(b)) \% Q = \left\{ \begin{array}{ll} 0 & \textrm{if } (log(a) + log(b)) = Q \\ (log(a) + log(b)) \& Q + (log(a) + log(b)) >> w & \textrm{otherwises} \end{array} \right. \end{equation*}
We can extend the exponential table to make sure $exp[Q] = 0$. After augmenting the exponential table, the modular operation can be simply the following equation: \begin{equation*} (log(a) + log(b)) \% Q = (log(a) + log(b)) \& Q + (log(a) + log(b)) >> w \end{equation*}
uint8_t gf256_log_table[256];
uint8_t gf256_exp_table[256];
// width of GF(2^8)
const int width = 8;
// number of elements in GF(2^8)
const int NW = 1 << width;
uint8_t gf256_mul(uint8_t a, uint8_t b)
{
int result;
if (a == 0 || b == 0)
{
return 0;
}
result = (gf256_log_table[a] + gf256_log_table[b]) & (NW-1)
+ (gf256_log_table[a] + gf256_log_table[b]) >> width;
return gf256_exp_table[result];
}
Remove the slow modular operation by augmenting the exponential table (adopted by Jerasure as the LOGS policy).
insert $Q - 2$ more elements in the exponential table apart from $exp[Q] = 0$: $exp[i] = exp[i \% Q] (i \in [Q, 2Q-2])$
example implementation to multiply $a$ and $b$ in GF($2^8$) using the Improvement II approach:
uint8_t gf256_log_table[256];
uint8_t gf256_exp_table[509];
uint8_t gf256_mul(uint8_t a, uint8_t b)
{
int result;
if (a == 0 || b == 0)
{
return 0;
}
result = (gf256_log_table[a] + gf256_log_table[b]);
return gf256_exp_table[result];
}
Further eliminates the conditional branch by augmenting both the exponential table and the logarithm tables.
Let $Q = 2^w$, $\alpha$ be a primitive element, then
\begin{equation*}
exp[i] = \left\{
\begin{array}{ll}
\alpha^{i} & \textrm{if } i \in [0, Q-1] \\
exp[i \% Q] & \textrm{if } i \in [Q, 2Q-1] \\
0 & \textrm{if } i \in [2Q, 4Q]
\end{array} \right.
\end{equation*}
\begin{equation*}
log[x] = \left\{
\begin{array}{ll}
2Q & \textrm{if } x = 0 \\
i & \textrm{if } x \in [1, Q], x = \alpha^{i}
\end{array} \right.
\end{equation*}
uint16_t gf256_log_table[256];
uint8_t gf256_exp_table[1021];
uint8_t gf256_mul(uint8_t a, uint8_t b)
{
int result;
result = (gf256_log_table[a] + gf256_log_table[b]);
return gf256_exp_table[result];
}
In GPU implementation, where to store the log and exp tables and how to initialize them will affect the performance.
Appropriate GPU memory:
What we have implemented:
After knowing which $k$ chunks are used for decoding, we know the corresponding rows of coefficients in the encoding matrix $V$ . These rows form the $k \times k$ matrix $V′$ , and we need to compute its inverse.
We use Gauss elimination to compute inverse matrix: augment the square matrix $V′$ with the identity matrix $I$ of the same dimensions to obtain $[V′ | I]$ and then apply matrix operations to transfer $[V′ | I]$ into its reduced row echelon form: $V′^{−1} [V′ | I] = [I | V′^{−1}]$.
Gauss elimination contains the following steps:
Step 2 and Step 3 are suitable to be parallelized, but since we have to set barriers between steps, Gauss elimination is still not a fully-paralleled algorithm. CPU-GPU corporation may be an appropriate solution.
By default, the host data is allocated as pageable
→ can be swapped out to disk and change the mapping from virtual address to physical address
→ unavailable for the device to access directly.
simplified procedure of H2D data transfer using pageable memory
pinned memory is page-locked: its physical address is marked as ineligible for eviction and cannot be changed by the operating system.
simplified procedure of H2D data transfer using pageable memory
CUDA streams are used for further overlapping data transfers with computation.
measured by recording two CUDA events at the beginning and the end of the kernel and calculating their elapsed time.
defined as the ratio of the number of bytes read or written by the kernel and the execution time of the kernel.
Assume that $CS$ is the chunk size, and $t$ is the kernel execution time, then:
We evaluate the overall performance by encoding a 1600 MB file with $k = 4, n = 6$.
encoding a 1GB file with $k = 4$ and $n = 6$.
use the following ten testcases:
H2D bandwidth (MB/s) | D2H bandwidth (MB/s) | |
pageable host memory | 1440.6 | 2339.7 |
pinned host memory | 5888.6 | 6394.5 |
encoding under $k = 4, n = 6$ settings.
The input file size is scaled from 1000 MB to 2000 MB, and the CUDA stream number is increased from one to four.
encoding a 2000 MB file under $k = 4, n = 6$ settings.