# FPGA Implementation of a Parallel DDS for Wide-Band Applications

Giorgio De Magistris<sup>a</sup>, Corrado Rametta<sup>a</sup>, Giacomo Capizzi<sup>b</sup> and Christian Napoli<sup>a</sup>

<sup>a</sup>Department of Computer, Control and Management Engineering, Sapienza University of Rome, Via Ariosto 25, 00135, Rome, Italy <sup>b</sup>Department of Electrical, Electronic and Computer Engineering, University of Catania, Viale A. Doria 6, 95125, Catania, Italy

#### Abstract

In this paper, the authors propose a parallel Digital Direct Synthesis (DDS) suitable for digital ultra wide-band systems. The proposed architecture offers the possibility to generate sine and/or cosine waves with a user-defined level of parallelism without the necessity to increase the clock frequency. It has been designed in SIMULINK coded in VHDL at RTL level and finally implemented on a Xilinx FPGA. Synthesis and place & route have been performed using the XILINX VIVADO toolchain. Results are provided in terms of hardware resources, speed, and power consumption considering a level of parallelism equal to 4. Implementation results show a low area complexity and very reduced power consumption that coupled with the flexibility in terms of parallelism level make this DDS useful in ultrawideband low-power systems. In particular, the DDS is characterized by an energy per operation of about 91 pJ. The reduced hardware complexity allows its implementation on low-cost FPGA.

#### Keywords

Parallel DSP, TI-ADC, FPGA, ASIC

## 1. Introduction

In the last few years, digital communications found applications in always more fields [1], [2],[3]. This pushed to improve device performance in different environments, that should be supported by all kind of vehicles [4]. In this scenario, the introduction of Time-Interleaved ADC (TI-ADC) allowed the possibility to process wide-band signals with Application Specific Integrated Circuits (ASICs) and Field Programmable gate Arrays FPGAs also using reduced clock frequencies. This has been possible thanks to the parallelization of DSP algorithms, passing from a time-based approach to a frame-based one [5],[6].

Splitting data on parallel datapaths make it possible to process data to a reduced clock frequency maintaining unaltered the total sample rate that is the sum of the sample rate of the single sub-data path.

Although this new paradigm offers advantages in terms of processing rate and consequently bandwidth, it requires the re-engineering of processing architectures that must be repeated in order to be able to process parallel frames of data [7], [8],[9],[10]. Reducing the clock frequency has numerous advantages and research is directed towards achieving this goal from various perspectives [11],[12][13]. The literature proposes several works re-

SYSTEM 2021 @ Scholar's Yearly Symposium of Technology, Engineering and Mathematics. July 27–29, 2021, Catania, IT demagistris@diag.uniroma1.it (G. D. Magistris); rametta@diag.uniroma1.it (C. Rametta); gcapizzi@diees.unict.it (G. Capizzi); cnapoli@diag.uniroma1.it (C. Napoli) 0000-0002-3076-4509 (G. D. Magistris); 0000-0003-2555-9866 (G. Capizzi); 0000-0002-3336-5853 (C. Napoli) 0 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 40 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) garding the parallelization of convolution-based systems as FIR filters and farrow filters. However, in order to use these convolution-based systems on communication systems is it necessary to generate parallel sinusoidal and/or co-sinusoidal signals to modulate data.

In digital communication systems, sine/cosine signals are usually generated by a digital circuit called Direct Digital Synthesis DDS or Direct Digital Frequency Synthesis DDFS [14].

In this paper, we propose a parallel DDS suitable for frame-based communication systems that can be implemented both on ASIC and FPGA. The proposed DDS has been coded in VHDL and implemented on XILINX FPGA. It has been characterized in terms of speed, area, and power consumption The paper is structured as follow: in Sect. 2 the architecture of the proposed DDS is discussed after after a brief general discussion on DDS. In Sect. 3 the FPGA implementation of the proposed architecture is discussed. In Sect. 4 the experimental results in terms of hardware complexity, speed and power are show and finally in Sect. 5 conclusions are provided.

#### 2. The Proposed parallel DDS

A DDS is usually implemented with a circuit composed of two main blocks:

- A Phase Generator, realized with an N-bit accumulator.
- A ROM (Read Only Memory) based Look-Up Table (LUT), for the phase to amplitude conversion. This ROM contains the sine wave samples

The block diagram of the DDS architecture is shown in Fig. 1



**Figure 1:** A traditional DDS architecture composed of a phase generator and a Look Up Table (LUT) phase to frequency conversion. The frequency of the generated waveform can be selected by the user changing the value of the tuning word k

The sine/cosine output frequency is the function of the tuning word "K", the clock frequency, and the number of samples used to represent a sine/cosine period inside the ROM. The latter two parameters are fixed during the design. The user can change the wave frequency acting on the tuning word "K" that is coded with an integer unsigned number. The waves frequency is described by the following equation:

$$F = K * \frac{f_{clk}}{2^N} \tag{1}$$

where K is the tuning word,  $f_{clk}$  is the system clock frequency and finally  $2^N$  is the number of samples used to represent a sine/cosine period inside the ROM. More details about the DDS architecture are shown in [14]. In addition to the DDS introduced in [14] the literature offers different solutions for the sine wave generation that offer the possibilities to reduce the hardware complexity [15] and in particular the memory requirements.

In Fig. 2 is shown the block diagram of a parallel DDS with a level of parallelism equal to N, able to generate a frame-based sine /cosine wave.

The main differences with a traditional DDS as the one shown in Fig.1 is the presence of N LUT and a Frame Address Generator (FAG).

The N LUTs are required for the simultaneous generation of N sine/cosine waves because although all the LUTs contain the same samples, it is not possible to read N data in parallel using a single LUT.

The FAG computes the N addresses necessary to read the N samples in parallel.



**Figure 2:** A multichannel parallel DDS architecture. The system have a LUT for every channel. The LUTs are addressed by a Frame Address Generator (FAG)

I if we consider N=4 that is a reasonable value for N if we consider that modern TI-ADCs have usually 2 or 4 channels the DDS architecture becomes the one shown in Fig.3

The addresses for the 4 LUT are generated by the FAG in the following way: The outputs of the DDS phase generator are multiplied by four different coefficients  $A_i$  (one for each LUT). The values of these coefficients are in function of k that represents the tuning word used by the user to change the wave frequency:

$$A_{0} = Ph * k$$

$$A_{1} = Ph * k + k$$

$$A_{2} = Ph * k + 2k$$

$$A_{3} = Ph * k + 3k$$
(2)

where Ph is the output of the phase generator.

#### 3. FPGA implementation

The architecture described in the previous section has been implemented on a XILINX FPGA. Thanks to their high computation capabilities and their flexibility, FPGAs are used in always more application fields as for example, communication, video processing, Machine Learning [16, 17, 18, 19, 20, 21]

FPGAs are widely used in systems that require high computing power. In this context, they can also be a powerful and valuable ally for blockchain technology. For instance, [22] aimed to "certify the data" without the need for a centralized organization using blockchain. Indeed, in order for a blockchain system to be feasible in terms of scalability, interoperability and sustainability, complex cryptographic operations that require speed



Figure 3: The Proposed 4 channel parallel DDS. The dotted red box shows the internal architecture of the Frame address Generator (FAG). The Fag is realized using adders and multipliers. The 4 LUTs are required for the storing of the sine/cosine samples

and power must be performed on a dedicated accelerator system (FPGA).

The proposed system has been simulated and validated through MATLAB/SIMULINK simulations. Generated waveforms have been compared with the ones generated by a traditional DDS in order to assure the effective operation of the system. Simulations show that the proposed 4 channel parallel DDS doe not introduce any decrease in the Spurious-Free Dynamic Range SFDR. The datapath has been maintained at 10 bits without any truncations.

After the MATLAB/SIMULINK simulations, the proposed DDS has been coded in VHDL at behavioral level. For what concerns the LUTs, they have been implemented using the XILINX IP cores through BLOCK RAMs. In order to do this, we previously sample sine waves in MAT-LAB. The sine wave has been sampled with  $2^{10}$  samples using a fixed-point representation with 10 bit.

Test benches have been performed through the XILINX ISIM simulator. Synthesis and Place and route have been performed using the XILINX VIVADO tool-chain. Power estimation has been performed through post-implementation for each the proposed DDS are shown in Tab. 1. simulation taking into account the real switching activity of the circuit. This was possible using the Switching Activity Interchange Format (SAIF) files in the phase of power estimation. These files contain the switching ac-

tivity for any node of the synthesized circuit and allow an accurate estimation of the dynamic power consumption. In fact, as known from the theory the dynamic power consumption need knowledge on the switching activity to be accurate estimated [23].

#### 4. Experimental Results

After the synthesis and the Place & Route, the proposed DDS has been characterized in terms of hardware complexity, speed and power consumption. Nowadays power consumption represents a very crucial aspect, especially for embedded systems that not always are connected with the power supply but are always more powered by batteries or energy harvesting systems. In these cases, in fact, the power consumption (and consequently the energy consumption) must be reduced to the maximum in order to prolong the life of the battery.

The FPGA resources used for the implementation of

As possible to see, the FPGA resources required for the proposed DDS implementation are very reduced. This allows having enough resources on the FPGA to implement also other systems. This aspect is very important

Table 1FPGA resources' Utilization

| Utilization |             |         |               |
|-------------|-------------|---------|---------------|
| Resource    | Utilization | Avaible | % Utilization |
| LUT         | 41          | 41.000  | 0.1           |
| FF          | 72          | 82.000  | 0.1           |
| BRAM        | 3           | 135     | 1.5           |
| IO          | 51          | 300     | 16.2          |
| BUFG        | 1           | 32      | 3.2           |

because modern FPGA-based communication systems require always more the presence of complex elements as for example machine learning systems or DSP systems [24, 25?].

In order to evaluate the max-speed of the proposed architecture, we performed several syntheses and Place & Route using different timing constraints. The maximum frequency available by our architecture is 385 MHz.

Such as frequency allows reaching a total sample rate of  $4^*$  385=1.540 GHz. Considering what is established by the sampling theory, this frequency allows the generation of sine waveforms up to 770 MHz This frequency can be increased by incrementing the parallelism level of the DDS.

The average dynamic power consumption measured at 385 MHz is 140 mW. By definition the energy per operation is defined as eq. 3.

$$E_{op}(t) \int^{t_{op}} P(t)dt \tag{3}$$

where  $t_{op}$  is the time require to execute the operation.

In our case the single sine/cosine sample is computed in  $t_{clk}/4$  where  $t_{clk}$  is the clock period equal to 2.6 ns and consequently  $t_{op} = t_{clk}/4 = 0.65$  ns.

For discrete time systems eq.3 can be rewritten as:

$$E_{op} = \sum_{l}^{n_{clk}} P_{ave} * t_{clk} \tag{4}$$

where  $P_{ave}$  is the average power consumption that in our case is 140 mW.

According eq. 4 the energy (dynamic) per operation (sine sample computation) is equal to 91 pJ.

### 5. Conclusions

In this paper, a parallel DDS suitable both for ASICs nad FPGAs has been presented. The proposed DDS can be used in all those applications requiring the necessity to work with wide-band signals. The parallelization offers the possibility to manage wide-band signals without the necessity to increase the clock frequency. In order to evaluate the proposed architecture, we firstly simulate it in MATLAB/SIMULINK, and in a second phase, after a fixed point analysis, we code it in VHDL and implement it on a XILINX FPGA.

Experimental results show a very reduced hardware complexity and very reduced power consumption that represents a very important aspect in embedded systems characterized by the absence of a connection with the power grid.

#### 6. Acknowledgments

This work has been supported by the project "Green-TAGS" funded by the Italian Ministry of University and Research within the PRIN founding scheme 2017 (CUP B88D19003660001).

#### References

- B. Sklar, F. J. Harris, Digital communications: fundamentals and applications, volume 2, Prentice-hall Englewood Cliffs, NJ, 1988.
- [2] G. M. Bianco, R. Giuliano, G. Marrocco, F. Mazzenga, A. Mejia-Aguilar, Lora system for search and rescue: Path-loss models and procedures in mountain scenarios, IEEE Internet of Things Journal 8 (2020) 1985–1999.
- [3] F. Mazzenga, A. Simonetta, R. Giuliano, M. Vari, Applications of smart tagged rfid tapes for localization services in historical and cultural heritage environments, in: 2010 19th IEEE International Workshops on Enabling Technologies: Infrastructures for Collaborative Enterprises, IEEE, 2010, pp. 186–191.
- [4] F. Mazzenga, R. Giuliano, A. Neri, F. Rispoli, Integrated public mobile radio networks/satellite for future railway communications, IEEE Wireless Communications 24 (2016) 90–97.

- [5] S. Lin, S. K. Mitra, Overlapped block digital filtering, IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing 43 (1996) 586–596.
- [6] B. A. Nowak, R. K. Nowicki, M. Woźniak, C. Napoli, Multi-class nearest neighbour classifier for incomplete data handling, in: International Conference on Artificial Intelligence and Soft Computing, Springer, 2015, pp. 469–480.
- Z.-J. Mou, P. Duhamel, Fast fir filtering: algorithms and implementations, Signal Processing 13 (1987) 377–384.
- [8] Y.-C. Tsao, K. Choi, Area-efficient parallel fir digital filter structures for symmetric convolutions based on fast fir algorithm, IEEE transactions on very large scale integration (vlsi) systems 20 (2010) 366– 371.
- [9] G. Capizzi, C. Napoli, S. Russo, M. Woźniak, Lessening stress and anxiety-related behaviors by means of ai-driven drones for aromatherapy, volume 2594, 2020, pp. 7–12.
- [10] G. C. Cardarilli, L. Di Nunzio, R. Fazzolari, D. Giardino, M. Matta, M. Re, S. Spanò, L. Simone, Efficient fpga implementation of high speed digital delay for wideband beamforming using parallel architectures, Bulletin of Electrical Engineering and Informatics 8 (2019) 422–427.
- [11] A. Simonetta, M. C. Paoletti, Designing digital circuits in multi-valued logic, International Journal on Advanced Science, Engineering and Information Technology 8 (2018) 1166–1172.
- [12] R. Avanzato, F. Beritelli, M. Russo, S. Russo, M. Vaccaro, Yolov3-based mask and face recognition algorithm for individual protection applications, in: CEUR Workshop Proceedings, 2020, pp. 41–45.
- [13] A. Simonetta, M. C. Paoletti, M. Muratore, A new approach for designing of computer architectures using multi-value logic, International Journal on Advanced Science, Engineering and Information Technology 11 (2021) 1440–1446. URL: http://ijaseit.insightsociety. org/index.php?option=com\_content&view= article&id=9&Itemid=1&article\_id=15778. doi:10.18517/ijaseit.11.4.15778.
- J. Terney, C. Rader, B. Gold, A digital frequency synthesizer, IEEE Transactions on Audio and Electroa-coustics 19 (1971) 48–57. doi:10.1109/TAU.1971. 1162151.
- [15] P. Symons, Ddfs phase mapping technique, Electronics Letters 38 (2002) 1291–1292.
- [16] G. Cardarilli, L. Nunzio, R. Fazzolari, M. Panella, M. Re, A. Rosato, S. Spano, A parallel hardware implementation for 2-d hierarchical clustering based on fuzzy logic, IEEE Transactions on Circuits and Systems II: Express Briefs 68 (2021) 1428–1432.

doi:10.1109/TCSII.2020.3032660.

- [17] F. Bonanno, G. Capizzi, S. Coco, C. Napoli, A. Laudani, G. L. Sciuto, Optimal thicknesses determination in a multilayer structure to improve the spp efficiency for photovoltaic devices by an hybrid fem—cascade neural network based approach, in: 2014 International Symposium on Power Electronics, Electrical Drives, Automation and Motion, IEEE, 2014, pp. 355–362.
- [18] A. Jaber, K. Ali, Artificial neural network based fault diagnosis of a pulley-belt rotating system, International Journal on Advanced Science, Engineering and Information Technology 9 (2019) 544–551. doi:10.18517/ijaseit.9.2.7426.
- [19] F. Bonanno, G. Capizzi, A. Gagliano, C. Napoli, Optimal management of various renewable energy sources by a new forecasting method, in: International Symposium on Power Electronics Power Electronics, Electrical Drives, Automation and Motion, IEEE, 2012, pp. 934–940.
- [20] G. Capizzi, G. Lo Sciuto, C. Napoli, E. Tramontana, M. Woźniak, A novel neural networks-based texture image processing algorithm for orange defects classification., International Journal of Computer Science & Applications 13 (2016).
- [21] G. Capizzi, G. L. Sciuto, M. Woźniak, R. Damaševicius, A clustering based system for automated oil spill detection by satellite remote sensing, in: International Conference on Artificial Intelligence and Soft Computing, Springer, 2016, pp. 613–623.
- [22] F. Fallucchi, M. Gerardi, M. Petito, E. W. D. Luca, Blockchain framework in digital government for the certification of authenticity, timestamping and data property, in: Proceedings of the 54th Hawaii International Conference on System Sciences | 2021, University of Hawai'i at Manoa, Honolulu, HI, 2021, pp. 2307–2316. doi:10.24251/HICSS.2021.282, http://hdl.handle.net/10125/70895.
- [23] N. H. Weste, D. Harris, CMOS VLSI design: a circuits and systems perspective, Pearson Education India, 2015.
- [24] A. Jaber, R. Bicker, Fault diagnosis of industrial robot bearings based on discrete wavelet transform and artificial neural network, International Journal of Prognostics and Health Management 7 (2016).
- [25] A. Jaber, A. Saleh, H. Ali, Prediction of hourly cooling energy consumption of educational buildings using artificial neural network, International Journal on Advanced Science, Engineering and Information Technology 9 (2019) 159–166. doi:10.18517/ ijaseit.9.1.7351.