: The asynchronous rapid single-flux quantum electronics – a promising alternative for the development of high-performance digital circuits

. In this paper, we investigate the application of the asynchronous logic approach for the realization of ultra high-speed digital electronics with high complexity. We evaluate the possible physical, technological, and schematical origins of restrictions limiting such an application, and propose solutions for their overcoming. Although our considerations are based on the rapid single-ﬂux quantum technique, the conclusions derived can be generalized about any type of digital information coding.


Introduction
Since the invention of the first transistor in 1947, the classical semiconductor electronics strictly follows its exponential progress in terms of speed and complexity (Moore, 1965). The main reason for this is the burst of the information technologies and the data exchange during the past few decades, which impose exponentially growing requirements on the performance of the digital electronics. Up to now, they have been met by corresponding increasing of the clock frequency of the digital chips, which results in rapid growth of the power density dissipated and crucial problems with the thermal noise (Kish, 2002). The overcoming of all these problems requires the involving of the newest technological solutions and a huge manpower, and has already reached its physical and economical limits. Therefore, the newest generations of microprocessors do not stress designs based on extremely high clock speeds but rather involve techniques like reducing the instructions set, multiple cores operating in parallel, longer pipelines, etc., which allow more efficient usage both of the CPU's clock cycle and power. In parallel to this, alternative techniques and design concepts are searched Correspondence to: B. Dimov (boyko.dimov@imms.de) intensively, which can provide more efficient digital signal processing in terms of speed and price.
Such a very promising technique for the development of high-performance digital electronics is the asynchronous logic approach (Brzozowski and Seger, 1995). There, the circuit's components react to changes on their inputs as these changes arrive, and produce changes on their outputs immediately after the end of the particular computation. Thus, the operation speed is determined by the local latencies, not by the global worst-case latency as in the synchronous case. No clock signal is provided to synchronize the circuit operation, and the data exchange is coordinated locally by handshaking feedbacks. In this way, the typical for the synchronous electronics global clocking problems are solved. Other important and quite attractive advantages of the asynchronous digital circuits are: -low power consumption -switchings occur only in the circuit parts involved in the current computation, instead of at all nodes at each clock interval; -reduced emissions of electromagnetic noise -generally, the switchings occur at random places and at arbitrary time; -reduced sensitivity towards environment variations (like supply voltage, temperature, technology process, etc.) -designing synchronous circuits, one must assume that the worst possible combination of factors is present, and to clock the circuit accordingly. This is not the case of the asynchronous circuits -if properly designed, they run as fast as the current physical properties allow.
Nevertheless, the popularity of the asynchronous logic approach is quite restricted. The main reason for this is that due to the complicated timing relations originating from the lack of global clocking, the timing analysis of the complex asynchronous digital circuits cannot be performed separately for each building block, but should be done globally about the whole circuit. The latter results in lack of techniques and CAD tools supporting the asynchronous logic-level design, as will be discussed in Sect. 2 of the paper. Here, we present our novel concept for logic-level synthesis and optimization of complex asynchronous digital circuits. It is based on the Rapid Single-Flux Quantum (RSFQ) technique for digital signal processing (Likharev and Semenov, 1991), but most of its principles can be generalized about any kind of digital electronics.

Timing considerations within the asynchronous digital circuits
In general, the asynchronous digital circuits can be design to be Delay-Insensitive (DI) (Brzozowski and Seger, 1995), i.e. to operate properly for any delays of their components. For this, a handshaking feedback should be provided between each pair of communicating blocks, as shown in Fig. 1. In case of large-scale circuits, this leads to extremely complicated topologies and significant lost of speed. Much simpler and faster circuits can be realized, if the handshaking feedbacks are omitted always when possible (Mladenov et al., 2006). Being no more DI, such an asynchronous circuit operates correctly under certain timing assumptions, whose violation leads to signal competitions, hazards, and erroneous behaviour. In order to avoid the latter, the designer should be able to detect all pairs of competitive signal paths within the circuit topology.
A signal competition occurs, if there are different paths, through which a given signal can reach one and the same component of the circuit. This is shown in Fig. 2. We have elaborated a universal approach for detection of all such competitive structures. This approach maps the circuit under analysis into equivalent directed graph. By this mapping, each gate is represented as a graph's vertex, and each interconnect -as a directed edge with direction corresponding to the one of the signal propagation. In this way, the problem is transformed to find all pairs of directed paths between each pair of graph's vertices. It is a conventional problem of the graph theory, whose solution has been presented in Mladenov et al. (2007) in details.
Once all pairs of competitive signal paths are detected, the designer should derive the conditions about their time delays, whose violation leads to signal conflicts. As next, it should be checked whether the violation of these conditions is possible, and such a possibility should be eliminated by proper handling of the delay times of the components of the competitive signal paths. For this, efficient techniques for modelling, manipulation, and optimization of the delay times of the circuit components should be available. We have developed such techniques about the RSFQ electronics, whose basics will be described in Sect. 3.

Basics of the RSFQ digital electronics
The RSFQ technique has been invented in the middle of the 1980s by a team of Russian scientists (Likharev et al., 1985), who have suggested an entirely new approach for digital information coding. Its switching element is the overdamped tunnel Josephson junction, consisting of two superconductors S 1 and S 2 separated by a thin nonsuperconductive barrier, as shown in Fig. 3.
Let θ 1 and θ 2 be the phases of the complex pair wave functions of the both superconductors, and φ=θ 1 -θ 2 be their difference. Let I s be the lossless supercurrent flowing through the Josephson junction, I c -its maximum value (also called critical current of the junction), and U (t) -the voltage drop over the junction. The electrical properties of the Josephson junction are described with: and with 0 =2.07 mV·ps being a fundamental physical constant named the magnetic flux quantum. These relations are popular as the dc (Eq. (1)) and the ac (Eq. (2)) Josephson effect (Josephson, 1962). The balance of the currents over the junction is: Adv. Radio Sci., 6, 165-173, 2008 www.adv-radio-sci.net/6/165/2008/ with i(t) being the total current through the junction, C Jthe capacitance of the junction, and R -its normal resistance. Using Eqs.
(1) and (2), Eq. (3) can be rewritten as: As seen from Eq. (4), the Josephson junction has equal equilibrium states, if the corresponding values of the superconductive phase drop φ differ with 2 π. A transient process during which φ changes with 2 π is called a switching of the junction. A voltage pulse with picosecond duration is generated during such a switching, and its properties can be derived by integrating Eq. (2): i.e. such a pulse carries exactly one magnetic flux quantum 0 . Due to the quantizing condition expressed by Eq. (5), these pulses are named Single Flux Quantum (SFQ) pulses and they are used to code the binary data within the RSFQ electronics (i.e. the RSFQ electronics is pulse-based, contrary to the classical semiconductors, where the binary data are represented by voltage levels). The typical shape of an SFQ pulse is shown in Fig. 4.
Currently, all superconductive digital circuits are based on the RSFQ technique. Within the modern roadmaps for electronics (Semiconductor Industry Association, 2005), this technique is considered as a promising alternative to the semiconductor logic not only for development of supercomputers, but also for many advanced applications like space technologies, telecommunications, medical science, quantum computing, etc. Its unique features are (Likharev and Semenov, 1991): -extreme low power consumption -the energy dissipated during one switching of single Josephson junction is of order of 10 −19 Joule, while the signals are communicated via superconductive (i.e. lossless) transmission lines. Thus, the problem with the large power dissipation of the high-integrated semiconductor digital circuits is not presented; -extreme high operation speed achieved with relatively large lateral dimensions -only few years after the invention of the RSFQ technique, digital RSFQ circuits with micrometer features sizes operating at subTHz frequencies have been demonstrated (see e.g. Kaplunenko et al. (1989); Chen et al. (1999)); -intrinsically digital data representation -due to the nature of the flux quantization (see Eq. (5)), the different binary states are inherently defined.
Nevertheless, the RSFQ electronics still suffers from the lack of a successful large-scale application. Only few S1 S2 nonsuperconductive connection middle-scale RSFQ devices have been reported up to now operating at clock frequencies of only few tens GHz (Bunyk et al., 2003;Tanaka et al., 2007). The big gap between the speed performance of the simple and the middle-scale RSFQ digital devices is tightly connected to the complicated global clock distribution network of the complex synchronous RSFQ digital circuits. A significant part of the total circuitry of any synchronous RSFQ middle-scale application belongs to the clock distribution network, which leads to three negative consequences: -a great amount of the total dc bias current is consumed by the clocking, leading to parasitic magnetic fields, which disturb the operation of the computational RSFQ electronics. Additionally, the transport of the huge dc bias current imposes severe requirements to the interface between the superconductive RSFQ chip and the normal conductive supply network. Both effects restrict the integration level of the RSFQ digital devices; -the global clock distribution network complicates significantly the RSFQ layouts, often requiring "critical" structures (like crossings, vias, etc.), which may lead to parasitic interactions and provoke fabrication faults, thus diminishing the fabrication yield; -as will be discussed later in this paper, the spread of the fabrication technology parameters affects the timedomain characteristics of the produced RSFQ circuits.
With the increasing complexity of the circuit, this effect causes significant jitter of the clock signal. The latter Inclusion of a synchronous logic block within an asynchronous architecture based on the DR data coding: in.true and in.false -"true" and "false" lines of the DR input channel, respectively; out.true and out.false -"true" and "false" lines of the DR output channel, respectively. restricts the minimization of the global clock interval, i.e. the high-speed performance of the large-scale synchronous RSFQ circuit.
All these negative effects can be more or less overcome by quantitative improvements of the currently existing RSFQ fabrication technologies and by application of new approaches for design of high-integrated RSFQ digital circuits (Kang and Kaplan, 2003;Johnson et al., 2003). Nevertheless, the design of large synchronous RSFQ circuits meets the speed of light as a fundamental physical limitation about the global synchronization. Generally, the problem with the increasing the global clock frequency is equivalent to reducing the geometrical size of the synchronous circuit (Sylvester and Keutzer, 2001), i.e. the product of the feature sizes and the clock frequency can be regarded as a quantitative estimation of the global synchronization problems within the densely packaged digital circuits realized with a given fabrication technology. About the modern semiconductor CPUs (feature sizes below 100 nm; clock frequency of several GHz), the global clock synchronization already imposes very hard restrictions towards the further increasing of the speed and the circuit complexity (see e.g. Sylvester and Keutzer, 2001). The modern LTS RSFQ fabrication technologies have feature sizes of few µm and intend clock frequencies of the largescale applications above 100 GHz, i.e. their product "feature sizes" × "clock frequency" is of order of 1000 greater than the one of the modern semiconductor CPUs. Therefore, the realization of a synchronous RSFQ digital circuit having multigigahertz clock frequency and even approaching the complexity of the nowadays semiconductor electronics is unimaginable with the present LTS RSFQ fabrication technologies (Dimov, 2005), and the implementation of the asynchronous logic approach is a vital precondition for the overcoming of this restriction.
As already stated in this Section, the RSFQ digital electronics is pulse-based, and in such a case the most reliable asynchronous communication is provided by the Dual-Rail (DR) data coding shown schematically in Fig. 1. It uses two lines per bit of information that has to be communicated, which are usually named as "true" and "false" one. A pulse only in the "true" line is used for coding of Boolean "1"; a pulse only in the "false" line is used for coding of Boolean "0", while a simultaneous propagation of pulses in both lines is forbidden. If an additional line is provided for the acknowledge signal, this communication is delay-insensitive. An important advantage of this data coding is that synchronous blocks can be easily included into the asynchronous architecture (the so-called globally asynchronous locally synchronous circuits, see Brzozowski and Seger, 1995). The latter is shown in Fig. 5.
Based on the DR data coding, we have developed and tested an entire cell library (Dimov, 2005), containing all gates necessary for the high-level synthesis of complex asynchronous RSFQ digital circuits. A complete description of the schematics, the operation principles, and the electrical and time-domain parameters of its components can be found in Dimov (2004). As already stated, for the successful implementation of these gates within complex asynchronous digital circuits, efficient techniques for the modelling, manipulation, and optimization of their delay times are needed. We have developed such techniques and they are presented in the following Sections of the paper.

Sources of time-domain jitter
Once the digital gate is designed, its nominal delay time d n can be predicted exactly e.g. by simulations. Although this constant is a very important time-domain parameter of the gate, it is quite insufficient for the logic-level design of highspeed complex circuits. The reason for this is the delay time jitter, the sources for which can be classified as: -spread of the fabrication process: during the production of the circuit, the parameters of the fabrication technology deviate stochastically from their nominal values, which leads to spread of the electrical characteristics of the fabricated circuit. The latter results in deviation of its delay time from the nominal value d n . This is an inevitable stochastic process, i.e. the resulting delay time cannot be determined in advance, and can be predicted only by means of the statistics; -fluctuations of the operation environment: once the circuit is fabricated and starts operating, it is imposed on noises and other parasitic interactions with the environment. They influence the dynamics of the switching elements, and this also results in inevitable stochastic variations of the delay times. Again, this effect can be predicted and modelled only by means of the statistics.  proportional shifting of all circuit parameters of one and the same type (e.g. all resistors, all inductances, all critical currents, etc.), and the local one, which represents the random shifting among circuit parameters of one and the same type. In case of established RSFQ fabrication processes, the technological parameter spread within the chip is usually negligibly small. Moreover, the gradients of the parameters of the operation environment are also negligibly small through the tiny circuit area. Therefore, the global spread of the delay times dominates over the local one, and for simplification of our modelling procedure, we neglect the latter.

Modelling of the delay time spread
In order to model the delay time spread of the RSFQ gates, we make the assumption, that there is no correlation between the factors determining it. As stated in Ortlepp (2005) and Dimov (2005), this assumption is not absolutely correct, but gives a good approximation due to the big number of such factors.
In this paper, we will consider only the delay time spread due to the variations of the fabrication process; however the same approach can be applied also about the spread caused by fluctuations of the operation environment. Under the assumptions done above, each global electrical parameter of the circuit is subjected to a Gaussian distribution having mean value µ equal to its nominal value, and standard deviation σ being parameter of the fabrication technology. For the modelling of the impact of the fabrication spread on the delay time of the circuit, we have developed the Windows/DOS compatible software package JSIMSA (Mladenov, 2006) based on the free Josephson junction circuit simulator JSIM (Fang and van Duzer, 1989). Its block diagram is shown in Fig. 6. Within one iteration loop of the program, a set of coefficients is initially generated having standard deviations specified by the fabrication foundry. They are used to scale the nominal electrical parameters of the circuit, thus modelling their spread during the fabrication process. The resulting netlist of the circuit is simulated with JSIM, and the obtained time-domain behaviour is estimated as working (good) or not operating (bad). In case of a working circuit, it is counted and its delay time is automatically calculated and stored in a file. This cycle is repeated many (typically over 100 000) times. Finally, we build the delays' histogram, which is analyzed by means of the statistics, thus deriving the mathematical model of the delay time distribution of the investigated circuit. Additionally, we calculate the circuit fabrication yield as j /n max , with j -the number of the working circuits, and n max -the total number of simulated circuits. We have applied this technique for statistical description about all gates from our asynchronous RSFQ cell library (see Dimov, 2004). Below, we will discuss the results about the most complex gate of the library -the dual-rail 1×2 demultiplexer. Its electrical scheme is shown in Fig. 7, while its element values and operation principle can be found in Dimov (2004) and Dimov et al. (2005a).
We have assumed only three independent global circuit parameters, i.e. at each iteration loop of JSIMA, we generate three independent scaling coefficientsk L , k i , and k g , subjected to Gaussian distributions with mean values µ kL =µ ki =µ kg =1 and standard deviations σ kL , σ ki , and σ kg ,  respectively. Within the analysed circuit, we scale all inductances by k L , all junction critical currents and all dc bias currents by k i , and all junction parasitic inductances to ground by k g .
The resulting delay times are shown in Fig. 8 for the cases σ kL =σ ki =σ kg =10% and σ kL =σ ki =σ kg =20%. As by all other gates from the cell library, the resulting statistical distribution of the delay time can be successfully fitted to a Gaussian one. The obtained large standard deviations clearly demonstrate the strong impact of the technological parameter spread over the jitter of the time-domain behaviour of the RSFQ circuits. Therefore, such an analysis should be incorporated within the derivation of the timing conditions ensuring no signal hazards between the competitive signal paths within the asynchronous digital circuits (see the discussions in Sect. 2). In this way, the delay time of each signal path is no more a constant value equal to the sum of the nominal delay times of the path's components; it should be considered with its statistical distribution resulting from the overlapping of the statistical distributions of the delay times of these components. Similarly, the timing conditions ensuring hazard-free data exchange are no more Boolean variables (i.e. they can be either fulfilled or not), but have statistical nature (i.e. they can be fulfilled with a certain probability). We conclude that the exact statistical prediction of the signal conflicts is a key component of the successful design of high-speed complex asynchronous RSFQ circuits, and the proposed technique is a powerful tool for its accurate performance.

Optimization of the delay time spread
A very important and time-consuming step of the small-scale RSFQ design flow is the optimization of the cell library components (see Dimov, 2005). Within the classical synchronous RSFQ design, the goal of this step is to adjust properly the circuit's parameters in order to maximize the gate's fabrication yield. In this way, one obtains a maximum fabrication yield also of the complex synchronous RSFQ digital circuit composed from these optimized gates.
As already emphasized in the beginning of this paper, the correct operation of the complex asynchronous circuits depends not only on the correct operation of their building blocks alone, but also on the timing assumptions allowing the omitting of handshaking feedbacks. Therefore, we optimize the components of our asynchronous RSFQ cell library with respect to minimize the standard deviation of their delay time spreads, keeping the fabrication yields reasonably large. By this, we minimize the probability of signal hazards within the complex asynchronous designs based on this library, keeping their yield acceptable.
This novel optimization strategy will be illustrated about the asynchronous dual-rail 1×2 demultiplexer in Fig. 7  Below, we designate their value with I c , while d and σ d designate the nominal value of the gate's delay time and its standard deviation, respectively. The dependence of the gate's fabrication yield and the ratio σ d /d on I c is shown in Fig. 9. The maximum fabrication yield is obtained at I c =162 µA, which is far away from the optimum of σ d /d. Therefore, we choose a nominal value I c =175 µA, reducing the fabrication yield with only few percents, but shrinking with about 10% the ratio σ d /d. In this way, the time-domain stability of the gate is improved, while its yield is slightly diminished.

Elimination of signal conflicts within complex asynchronous RSFQ digital circuits
With the novel technique for minimization of the delay time spread of the RSFQ gates, one can significantly reduce the standard deviation of the delay times of the competitive signal paths. In this way, the probability of overlapping of their statistical distributions is also diminished, which corresponds to reduced probability of erroneous digital operation due to signal hazards. Nevertheless, the latter can still remain unacceptably large for high-speed complex applications. The only way for its further minimization is to manipulate the mean values of the delay times of the competitive signal paths, i.e. to tune properly the nominal delay times of their components. This is schematically shown in Fig. 10. In Dimov and Uhlmann (2004), we have derived the possible methods for manipulation of the switching speed of the Josephson junctions and compared them with respect to the following criteria: -straightforward technological realization -the tuning of the delay times should be performed without significant redesign of the layout of the RSFQ circuit; -minimum deterioration of the gate compatibility -applying such a tuning, no parasitic interactions should occur between the circuit components. Otherwise, extensive reoptimization would be necessary; -efficient delay time tuning -the applied method should be able to manipulate precisely the RSFQ delay times  Fig. 11. Impact of the scaling of the external shunt resistors of the Josephson junctions on the delay time and the operation margins of the asynchronous dual-rail 1×2 demultiplexer in Fig. 7: rm -scaling coefficient of the junctions' external shunts; Xb, Xi, and Xlglobal margins of the dc bias currents, the junctions' critical currents, and the superconductive inductances, respectively.
within large interval of values around their nominal values; -minimum degradation of the fabrication yield and the operational margins -in this way, the reliable operation of the circuit after its production is ensured.
We have evaluated, that the scaling of the external shunt resistors of the Josephson junctions is the best method for tuning of the RSFQ delay times, because it fulfils all four criteria formulated above. Simulated data for its application about the asynchronous dual-rail 1×2 demultiplexer in Fig. 7 are shown in Fig. 11; such data about the other components of our asynchronous RSFQ cell library can be found in Dimov (2004). As seen from Fig. 11

Conclusions
The signal conflicts are one of the hardest constraints for the realization of complex asynchronous digital circuits. This problem is even more pronounced within the RSFQ digital technique due to the extremely high operation speed of the Josephson junctions. In this paper, we have presented our novel techniques for detection of competitive signal paths, estimation of the probability of signal conflicts in them and its minimization by optimization of the spread of the RSFQ delay times and proper tuning of their mean values. Using these techniques, we have extended the standard logic-level design flow to an advanced design flow for complex asynchronous RSFQ digital circuits, which is shown in Fig. 12. In this way, the synthesis of high-performance asynchronous RSFQ digital circuits has been improved significantly.