Matrix-Vector Based Fast Fourier Transformations on SDR Architectures

Today Discrete Fourier Transforms (DFTs) are applied in various radio standards based on OFDM (Orthogonal Frequency Division Multiplex). It is important to gain a fast computational speed for the DFT, which is usually achieved by using specialized Fast Fourier Transform (FFT) engines. However, in face of the Software Defined Radio (SDR) development, more general (parallel) processor architectures are often desirable, which are not tailored to FFT computations. Therefore, alternative approaches are required to reduce the complexity of the DFT. Starting from a matrixvector based description of the FFT idea, we will present different factorizations of the DFT matrix, which allow a reduction of the complexity that lies between the original DFT and the minimum FFT complexity. The computational complexities of these factorizations and their suitability for implementation on different processor architectures are investigated.


Introduction
Recently multi-carrier modulation techniques, such as OFDM, have received great attention in high-speed data communication systems and have been selected for several communication standards (e.g.WLAN, DVB and WiMax).A central baseband signal processing task required in OFDM transceivers is the Fourier transform of the received data, which is generally realized as FFT (Cooley, 1965).To meet the real-time processing requirements, various specialized FFT processors have been proposed (Son, 2002).On the other hand Software Defined Radio (SDR) approaches for mobile terminals are gaining momentum (Dillinger, 2003).To meet the demand for high computing power and flexibility, specific reconfigurable accelerators (Heyne, 2004) or parallel processor architectures (Berthold, 2004) are used.The DFT can be implemented as FFT, which defines the Correspondence to: K. Hueske (klaus.hueske@uni-dortmund.de) optimum in terms of mathematical complexity, but as the used architectures are not necessarily optimized for butterfly structures, the question arises: Are there any efficient procedures, which reduce the computational complexity of the DFT significantly (as far as possible to the FFT complexity), but at the same time are more tailor-made to specific hardware architectures than the FFT?.In this paper a matrixvector based formulation of the FFT algorithm (Van Loan, 1992) is derived from the respective matrix factorizations of the DFT matrix.Due to the structure of the butterfly matrices and the matrix factorizations, the matrix-vector based FFT allows several separations of the resulting matrix product.These separations have an increased computational complexity compared to the FFT, but because of their mathematical structure they allow more flexible implementations on different architectures (e.g.multi processors, vector processors, accelerated systems).The paper is organized as follows: In Sect. 2 we will clarify the definition and notation of the DFT.A short repetition of the FFT idea and the corresponding factorization is described in Sect.3.Then, in Sect. 4 different separations of the butterfly matrices are presented, which lead to different algorithmic structures and different computational complexities.Implementation issues of these separations and their suitability for different hardware architectures are discussed.Conclusions are given in Sect. 5.

Discrete Fourier Transform (DFT)
The sequence of N (with N=2 m , m∈ N) complex numbers x 1 , • • •, x N is transformed into the sequence of N complex numbers y 1 , • • •, y N by the DFT according to where w N =e −j 2π N is a primitive N-th root of the unity.The DFT can also be formulated as a matrix vector product (Van Loan, 1992).With ∈ C N , the DFT can be described as matrix Published by Copernicus Publications on behalf of the URSI Landesausschuss in der Bundesrepublik Deutschland e.V.

Matrix-vector based FFT
To give a repetition of the known radix-2 FFT algorithms (Cooley, 1965), this section will show the connection between the matrix F N and the matrix F N/2 , which enables us to compute an N-point DFT from a pair of (N/2)-point DFTs.To transfer the FFT idea to the DFT matrix, we introduce a so-called permutation matrix P N of size (N ×N ), which performs an odd-even ordering of the rows (multiplied from the left with P N ) or columns (multiplied from the right with P T N ) of a matrix.Furthermore P T N •P N =I N holds.The following factorizations and separations are shown for the decimation in time (DIT) FFT.They can be easily transferred to the decimation in frequency (DIF) FFT.

FFT-Matrix decomposition
Applying P T N to the right of the DFT-matrix F N yields The resulting product F N has the following structure: The odd indexed columns, forming the left half of F N , are composed of F N/2 , while the even indexed columns, forming the right half of F N , are composed of D N/2 •F N/2 and −D N/2 •F N/2 .This matrix can be decomposed into the product of two matrices (as shown in Eq. 3), where D N/2 is a diagonal matrix with elements d nn =w (n−1) N and n=1, • • •, N/2.The first matrix represents FFT butterfly operations, while the second matrix contains two independent DFTs of length N/2.Note that the second matrix is a block diagonal matrix.

Matrix-Vector based computation of the FFT
To apply the modified DFT F N to an input vector x N , a permutation of x N is required, i.e.
•x e we obtain: , with (• * ) the component-wise vector multiplication and taking into account that the two products are identical, Eq. ( 6) and Eq. ( 7) need altogether N 2 multiplications and 2 • N 2 additions.It is easy to see that a DFT of size N can be executed by two DFTs of size N 2 .For N=2 m this process can be recursively repeated m= log 2 N times.The required number of operations then will be N 2 log 2 N multiplications and N log 2 N additions, which represents the complexity of the FFT.

Factorization of the DFT matrix
To obtain alternative factorizations of the DFT matrix, the number of recursions described in the previous section can be modified.Introducing a new parameter k we can define the number of performed recursions as m−k with N =2 m giving the size of the initial DFT matrix.For arbitrary m and k the resulting block submatrices in F B are of size (2 k ×2 k ) and can be executed as independent DFTs (or FFTs).Because the length of the input vector x N is given by N=2 m the number of block submatrices equals 2 m /2 k =2 m−k .After calculating 2 m−k independent DFTs, m−k butterfly matrices B i have to be applied to finish the calculation.The entire DFT is then given as with x p the (m−k)-times permuted input vector x N .An example for m=4 and k=2 is given below.
Example 1: Increasing the order of recursion m−k will also increase the number of butterfly matrices B i .While the independent DFT submatrices in F B can be processed independently, the butterfly matrices can be treated in different ways: •As sequence of matrix multiplications •As one matrix multiplication with the butterfly product matrix •As partly combined matrix multiplication with the products with arbitrarily chosen integer numbers p fulfilling Example 2: The following example illustrates the construction of the butterfly matrices.For the given parameters m=8 and k=4 already 2 m−k =2 4 =16 small DFTs are executed by F B .The structure of the remaining butterfly matrices B i is depicted in Fig. 1 on the left hand side (non-zero elements are depicted as lines).The product of the butterfly matrices for increasing i is shown on the right hand side.The matrices are all of dimension 2 8 × 2 8 , the variable nz in the figure shows the number of non-zero elements.Focusing on the butterfly product matrix in the bottom right, we can find that only nz=4096 of all 2 8 • 2 8 =65536 elements are nonzero.This means, every 16-th diagonal of the butterfly product matrix is non-zero, since the other multiplications have already been performed by the block diagonal matrix F B .In general, every 2 k -th diagonal is non-zero, which means, that the interspace will be enlarged with increasing k.

Implementation
In this section we will compare different separations of the butterfly matrix in terms of complexity and implementation issues.

Complexity
The application of the DFT matrix requires (N − 1)•(N − 1) multiplications, i.e. the product of all rows and columns, except the first row/column, which is all ones:

Construction of Butterfly Matrices
Increasing the order of recursion m − k will also increa the number of butterfly matrices B i .While the independe DFT submatrices in F B can be processed independently, t butterfly matrices can be treated in different ways: •As sequence of matrix multiplications •As one matrix multiplication with the butterfly product m trix The FFT is based on butterfly operations, which require N/2=2 m−1 multiplications each (see Sect. 3.2).Since we have m butterfly matrices, we obtain: In the following we will examine the complexity of four different separations, which we expect to lie between M DFT and M FFT .butterfly matrices for increasing i is shown on the right hand side.The matrices are all of dimension 2 8 × 2 8 , the variable nz in the Figure shows the number of non-zero elements.Focusing on the butterfly product matrix in the bottom right, we can find that only nz = 4096 of all 2 8 • 2 8 = 65536 elements are non-zero.This means, every 16-th diagonal of the butterfly product matrix is non-zero, since the other multiplications have already been performed by the block diagonal matrix F B .In general, every 2 k -th diagonal is non-zero, which means, that the inter-space will be enlarged with increasing k.

Implementation
In this section we will compare different separations of the butterfly matrix in terms of complexity and implementation issues.

Complexity
The application of the DFT matrix requires (N − 1) • (N − 1) multiplications, i.e. the product of all rows and columns, except the first row/column, which is all ones: The FFT is based on butterfly operations, which require This leads to Separation B. We use B•F B , where the submatrices of F B are still executed as independent DFTs, and B is the butterfly product matrix.Since only every 2 k -th diagonal is occupied, the number of required multiplications for B is . The number of multiplications for F B is identical to separation A , such that Separation C. We again use B•F B , but now the submatrices of F B are executed as independent FFTs.There are k•2 k−1 multiplications required for each FFT, and totally there are 2 m−k FFT blocks, so the number of multiplications is given as Separation D. A special case of separation C is separation D, which describes a partly parallel implementation of the original FFT for the case k=m/2.The factorization is given as with W a twiddle factor diagonal matrix and P l and P r permutation matrices that sort the rows and columns of a matrix, such that P l •F B •W•P r =B is the butterfly product matrix.To Separation B. We use B • F B , where the submatrices of F B are still executed as independent DFTs, and B is the butterfly product matrix.Since only every 2 k -th diagonal is occupied, the number of required multiplications for B is . The number of multiplications for F B is identical to separation A , such that Separation C. We again use B • F B , but now the submatrices of F B are executed as independent FFTs.There are k • 2 k−1 multiplications required for each FFT, and totally there are 2 m−k FFT blocks, so the number of multiplications is given as Separation D. A special case of separation C is separation D, which describes a partly parallel implementation of the original FFT for the case k = m/2.The factorization is given as with W a twiddle factor diagonal matrix and P l and P r permutation matrices that sort the rows and columns of a matrix, such that P l •F B •W•P r = B is the butterfly product matrix.
To clarify this, an example is given in Figure 2   .This leads to a matrix based notation of the Jeon-Reeves FFT algorithms (Jeon (1986)).As the blocks (B 1 , • • •, B 4 ) are not identical, constant twiddle factors are extracted from the matrices, which transfers the block submatrices to DFT matrices, like shown in c).The extracted twiddle factors are collected in W. The computational complexity of this separation is determined by the number of FFTs (2•2 k ) and the additional twiddle factor multiplications (N=2 m ): The number of required multiplications for separations A-D are given in Fig. 3 for m=8 and varying k as multiples of M FFT .The upper bound is defined by the DFT complexity, the lower bound shows the FFT complexity.Note that separation D is only defined for k=m/2=4.

Implementation issues
Here we will discuss possible implementations of the presented separations on different hardware architectures.The discussion is of a qualitative character, i.e. implementation details like memory access, interconnection complexity or inter processor communication time are not considered here.Separation A. Matrix-vector processors as special kind of SIMD (Single Instruction Multiple Data) architectures are discussed as possible platforms for SDR (Schoenes ( 2003)).
As an example we will show an implementation of separation A using a general systolic array structure for matrix multiplications, like depicted in Fig. 4. With m=5 and k=2 we have to perform 8 DFTs of length 4 in F B .To obtain the matrix-matrix structure we separate the input vector x p Adv. Radio Sci., 6, 89-94, 2008 www.adv-radio-sci.net/6/89/2008/into 8 parts (x 1 , x 2 , • • •, x 8 ) of length 4, which means that all F 4 •x j with j =1, ..., 8 can be calculated independently.The multiple matrix-vector products can then be described as matrix-matrix product The remaining butterfly operations are performed according to Eq. ( 6) and Eq. ( 7).The even numbered columns of the product F 4 •X p are multiplied component-wise with d 4 and then added/subtracted to the corresponding odd numbered columns.For example, the second column is multiplied by d 4 and then added/subtracted to the first column, resulting in the intermediate vectors y 11 and y 21 that are stored in the registers of the first and second column of the processor array.In the next step these intermediate results are multiplied by d 8 and added/subtracted to the other intermediate results.
After m − k steps the registers contain the Fourier transform of x.
The computational complexity for this approach is only slightly increased for small values of k (compare Fig. 3), but a high speed partly parallel Fourier transformation can be realized.Separation B. For generic matrix-vector architectures the DFT can be described as simple matrix-vector product using separation B. In this case the number of required multiplications is increased considerably compared to the FFT, but nevertheless, it is interesting to see the significant decrease of the number of multiplications (e.g. for k=3, 4, 5 in Fig. 3) compared to the DFT, just by separating the DFT matrix into a product of the two matrices B and F B .
Separation C. To match the real time requirements in SDR systems, general purpose processors are often combined with hardware accelerators, like e.g. for the Fourier transform.As the flexibility of these accelerators regarding to the DFT size might be limited, separation C can be used to map DFTs of any size onto these engines.For example in a systems with FFT accelerators of size 64 and a required FFT length of 256 (this corresponds to m=8 and k=6), the block DFTs in F B can be realized on the accelerators while the remaining 2 butterfly operations are executed as a simple matrix vector multiplication on a generic host processor.This leads to a major flexibility regarding the DFT size of the used accelerators in context of the required DFT size of the specific application.Furthermore existing architectures with fixed length FFT accelerators can be easily reused.Depending on the difference of m and k, the mathematical complexity is increased (compare Fig. 3).For the example the number of required multiplications is increased by a factor 1.5, but the 4 FFTs of size 64 can be executed in parallel, if more than one accelerator is available in the system.
Separation D. The FFT proposed by Jeon and Reeves is especially suitable for multi processor architectures (Jeon (1986)).However the extraction of the twiddle factors in separation D also enables the use of homogeneous parallel FFT cores in an SDR system, as the resulting block submatrices are all identical.This means an N-point DFT can be separated into 2 Separation B. We use B • F B , where the submatrices of F B are still executed as independent DFTs, and B is the butterfly product matrix.Since only every 2 k -th diagonal is occupied, the number of required multiplications for B is (2 m − 1) • ( 2 m 2 k − 1).The number of multiplications for F B is identical to separation A , such that Separation C. We again use B • F B , but now the submatrices of F B are executed as independent FFTs.There are k • 2 k−1 multiplications required for each FFT, and totally there are 2 m−k FFT blocks, so the number of multiplications is given as Separation D. A special case of separation C is separation D, which describes a partly parallel implementation of the original FFT for the case k = m/2.The factorization is given as with W a twiddle factor diagonal matrix and P l and P r permutation matrices that sort the rows and columns of a matrix, such that P l •F B •W•P r = B is the butterfly product matrix.
To clarify this, an example is given in Figure 2 for m = 4 and k = 2.
Example 3: In a) we can see the graphical representation of example 2, with the butterfly product matrix B on the left and the block FFT matrix F B on the right hand side.By sorting rows and columns of B we can transform the sparse diagonal matrix B to a block diagonal matrix, as shown in b).This leads to a matrix based notation of the Jeon-Reeves FFT algorithms [Jeon (1986)].As the blocks (B ′ 1 , • • • , B ′ 4 ) are not identical, constant twiddle factors are extracted from the matrices, which transfers the block submatrices to DFT matrices, like shown in c).The extracted twiddle factors are collected in W. The computational complexity of this separation is determined by the number of FFTs (2 • 2 k ) and the additional twiddle factor multiplications (N = 2 m ): The number of required multiplications for separations A-D are gi-ven in Figure 3 for m = 8 and varying k as multiples of M F F T .The upper bound is defined by the DFT complexity, the lower bound shows the FFT complexity.Note that separation D is only defined for k = m/2 = 4.

Implementation Issues
Here we will discuss possible implementations of the presented separations on different hardware architectures.The discussion is of a qualitative character, i.e. implementation details like memory access, interconnection complexity or inter processor communication time are not considered here.Separation A. Matrix-vector processors as special kind of SIMD (Single Instruction Multiple Data) architectures are computed using the specialized processor cores, the results are then permuted and multiplied by W. The remaining √ N blocks are again computed using the FFT cores.

Conclusions
The presented separations of the DFT matrix provide different algorithmic structures, which allow a flexibility in terms of the used hardware architectures for the implementation of the DFT.Depending on the architecture, the most suitable separation can be chosen, e.g.separation A for matrix processors or separation C for architectures using FFT hardware accelerators.Depending on the chosen parameters the computational complexity in terms of multiplications is only slightly increased, but some architectures may benefit from the modified algorithmic structure (e.g.addressing, bus communication).The behaviour of the presented separations on specific existing hardware architectures has yet to be examined in future work.
with x o ∈ C N 2 the odd and x e ∈ C N 2 the even indexed part of x N .Introducing y 1 , y 2 ∈ C N 2 we can formulate the DFT as follows:

YuhengFig. 1 .
Fig. 1.The single butterfly matrices B i on the left hand side; th products on the right hand side.

Fig. 1 .
Fig. 1.The single butterfly matrices B i on the left hand side; their products on the right hand side.
where the submatrices of F B are executed as independent DFTs and the butterfly matrices are separately executed one by one.The number of the DFT blocks in F B is 2 m−k and each DFT needs 2 k •2 k −2•2 k +1 multiplications.The number of butterfly matrices is m−k and each butterfly needs 2 m−1 multiplications.www.adv-radio-sci.net/6/89/2008/Adv.Radio Sci., 6
for m = 4 and k = 2. Example 3: In a) we can see the graphical representation of