PipeRoute: a pipelining-aware router for FPGAs


of 11
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
PipeRoute: a pipelining-aware router for FPGAs
  PipeRoute: A Pipelining-Aware Router for FPGAs  Akshay Sharma Electrical Engineering University of Washington Seattle, WA akshay@ee.washington.edu Carl Ebeling Computer Science & Engineering University of Washington Seattle, WA ebeling@cs.washington.edu Scott Hauck Electrical Engineering University of Washington Seattle, WA hauck@ee.washington.edu ABSTRACT We present a pipelining-aware router for FPGAs. The problem of routing pipelined signals is different from the conventional FPGA routing problem. For example, the two terminal N-Delay  pipelined routing problem is to find the lowest cost route between a source and sink that goes through at least N (N > 1) distinct  pipelining resources. In the case of a multi-terminal pipelined signal, the problem is to find a Minimum Spanning Tree that contains sufficient pipelining resources such that the delay constraint at each  sink is satisfied. We begin this work by proving that the two terminal N-Delay problem is NP-Complete. We then  propose an optimal algorithm for finding a lowest cost 1-Delay route. Next, the optimal 1-Delay router is used as the building  block for a greedy two terminal N-Delay router. Finally, a multi-terminal routing algorithm (PipeRoute) that effectively leverages the 1-Delay and N-Delay routers is proposed. PipeRoute’s  performance is evaluated by routing a set of retimed benchmarks on the RaPiD [2] architecture. Our results show that the architecture overhead incurred in routing retimed netlists on RaPiD is less than a factor of two. Further, the results indicate a  possible trend between the architecture overhead and the  percentage of pipelined signals in a netlist. Categories and Subject Descriptors  B.7.2 [ Integrated Circuits ]: Design Aids –  placement and routing. General Terms  Algorithms, Design. Keywords  Pipelined circuits, pipelining, routing, BFS, retimed circuits, retiming, Minimum Spanning Tree, PipeRoute. 1.   INTRODUCTION It is well established that FPGAs are a convenient marriage  between the flexibility of software, and performance levels achievable in hardware. Reconfigurable logic units, coupled with a rich programmable interconnect structure, can be used to implement a variety of applications. However, while FPGAs remain extremely attractive for their hardware flexibility, the minimum clock period that is achievable in present-day FPGAs leaves a lot to be desired. In the world of microprocessors and custom design, pipelining is widely used to reduce the critical path delay of a circuit. Powerful sequential retiming heuristics have contributed to reducing the clock period of circuits even further [4,5]. Thus, designers of reconfigurable architectures are now paying serious attention to  providing pipelining resources in the logic units and routing fabric that constitute reconfigurable architectures. A number of research groups have proposed pipelined FPGA architectures. HSRA [13] is an example of an FPGA architecture that has a hierarchical, pipelined interconnect structure. A fraction of the switchboxes is populated with registered switches to meet a target clock period. Also, instead of having a single register on the output of a LUT (which is generally the case in existing FPGA architectures), a bank of registers is connected to each input of the LUT. This helps balance path delays introduced by the pipelined interconnect. User applications are mapped to HSRA by integrating data retiming with a conventional FPGA CAD flow. A second example of a pipelined FPGA architecture is proposed in Singh et al [10]. The routing architecture is hierarchical, and the higher-level routing consists of horizontal and vertical long lines that surround logic blocks. Each long line is pipelined using a bank of registered switch-points, and every switch-point can be used to delay a long line from 0 – 4 clock cycles. DSP designs mapped to this architecture were able to achieve throughputs of up to 600 MHz. RaPiD [2,3] is a coarse-grained one-dimensional (1-D) architecture that has pipelined datapath and interconnect structures. The datapath consists of 16-bit ALUs, multipliers, SRAMs and registers. The registers comprise a significant fraction of the datapath, thus providing pipelining resources. The interconnect is composed of short tracks that are used to achieve local communication between logic units, and long tracks that enable relatively long distance communication along the datapath. The long tracks traverse multiple switch-points, whereas the short tracks do not traverse any switch-points. The outputs of every logic unit, as well as all switch-points, can optionally be registered. Due to the 1-D nature of the interconnect, switchpoints Permission to make digital or hard copies of all or part of this work for  personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.  FPGA ’03 , February 23-25, 2003, Monterey, California, USA Copyright 2000 ACM 1-58113-651-X/03/0002…$5.00.  have 2 terminals, and are bidirectional. Like the architecture  proposed in [10], the RaPiD architecture is targeted at regular, compute intensive applications that are amenable to deep  pipelining. The aforementioned architectural examples indicate that good  progress is being made in the design of pipelined architectures. The challenge now is to develop CAD tools that can map user applications to pipelined FPGA architectures. In [12], the authors investigate the benefits of integrating placement and retiming by  proposing retiming aware placement algorithms. The same authors present a retiming aware router in [11]. This router attempts to place long signals on tracks that have registered switches, so that a subsequent retiming step can take advantage of the assignment to pipeline the long signals. In [11], the goal is to reduce interconnect delay by pipelining long signals. Placement and logic retiming are closely coupled to give the retiming step an estimate of routing delay in [12]. The subject of this paper is the development of an algorithm called PipeRoute that routes retimed   application netlists on  pipelined   FPGA architectures. In retimed   netlists, all pipelining registers are explicitly enumerated, and it is therefore possible to calculate the number of clock cycles that separate the signal’s source from each of its sinks. A  pipelined   FPGA architecture is one that has pipelining resources in the interconnect structure . These pipelining resources supplement the registers that are already provided in FPGA logic blocks. PipeRoute takes a retimed netlist and a pipelined FPGA architecture as inputs, and  produces an assignment of signals to routing resources as the output. To the best of our knowledge, PipeRoute is the first routing algorithm that is capable of routing retimed netlists on  pipelined FPGA architectures. Furthermore, the strength of the PipeRoute algorithm lies in the fact that it is architecture-independent. The algorithm is capable of routing pipelined signals on any FPGA architecture that can be abstractly represented as a graph consisting of routing- and pipelining-nodes. 2.   PROBLEM BACKGROUND The FPGA routing problem is to determine an assignment of signals to limited routing resources while trying to achieve the  best possible delay characteristics. Pathfinder [6] is one of the most widely used FPGA routing algorithms. It is an iterative algorithm, and consists of two parts. The signal router routes individual signals based on Prim’s algorithm, which is used to  build a Minimum Spanning Tree (MST) on an undirected graph. The global router adjusts the cost of each routing resource at the end of an iteration based on the demands placed on that routing resource during the iteration. During the first routing iteration, signals are free to share as many routing resources as they like. However, the cost of using a shared routing resource is gradually increased during later iterations, and this increase in cost is  proportional to the number of signals that share that resource. Thus, this scheme forces signals to negotiate for routing resources. A signal can use a high cost resource if all remaining resource options are in even higher demand. On the other hand, a signal that can take an alternative, lower cost route is forced to do so because of competition for shared resources. Circuits routed using Pathfinder’s congestion resolution scheme converge quickly, and exhibit good delay characteristics. sig  S K1K2K3 sig  S   K1K2K3   Fig. 1: A multi-terminal pipelined signal In the case of retimed netlists, the routing problem is different from the conventional FPGA routing problem. This is because a significant fraction of the signals in a netlist are deeply pipelined, and merely building an MST for a pipelined signal is not enough. For example, consider the pipelined signal sig   in Fig. 1 that has a source S and sinks K1, K2 and K3. The signal is pipelined in such a way that sink K1 must be delayed 3 clock cycles relative to S, sink K2 must be 4 clock cycles away, and sink K3 must be 5 clock cycles away. A route for sig   is valid only if it contains enough pipelining resources to satisfy the delay constraints at every sink. Due to the fact that there are a fixed number of sites in the interconnect where a signal can be delayed by a clock cycle (hereafter referred to as “delay sites”), it can be easily seen that a route that is found for sig   by a conventional, pipelining-unaware FPGA router may not contain enough delay sites to satisfy the delay constraint at every sink. Thus, the routing problem for  pipelined signals is different from that for unpipelined signals. For a two-terminal pipelined signal, the routing problem is stated as: Two-terminal N-Delay Problem:  Let G=(V,E) be an undirected  graph, with the cost of each node v in the graph being w v  >= 1. The graph consists of two types of nodes; D-nodes and R-nodes.  Let S,K  ∈  V be two R-nodes. Find a path P  G (S,K) that connects nodes S and K, and contains at least N (N ≥   1) distinct D-nodes,  such that w(P  G (S,K)) is minimum, where w(P  G (S,K)) =   w v    Further, impose the restriction that the path cannot use the same edge to both enter and exit any D-node.  We call a route that contains at least ‘N’ distinct D-nodes an “N-Delay” route. R-nodes represent wires and IO pins of logic units in a pipelined architecture, whereas D-nodes represent registered switch-points. A registered switch-point can be used to pick up 1 clock cycle delay, or no delay at all. Every node is assigned a cost, and an edge between two nodes represents a physical connection between them in the architecture. The cost of a node is a function of congestion, and is identical to the cost function developed for Pathfinder’s NC algorithm [6]. Under this framework, an abstraction of the routing problem for a simpler two-terminal signal is to find the lowest cost route between source and sink that goes through at least N (N ≥  1) distinct D-nodes (N is the number of clock cycles that separates the source from the sink). Note that a lowest cost route can be self-intersecting i.e. R-nodes can be shared in the lowest cost route. We have shown that the two-terminal N-Delay problem is NP-Complete [9]. In the more complex case of a multi-terminal signal, the problem is to find an MST that contains enough D-nodes such that each  sink is the correct number of clock cycles away from the source.  A simple solution to the pipelined routing problem would be to address pipelining in the placement phase. The pipelining registers in a netlist could be mapped to registered switch-points in the architecture, and a simulated annealing placement algorithm could determine an optimal placement of the pipelining registers. After the placement phase, a conventional FPGA router could be used to route the signals in the netlist. However, a  placement of a netlist that maps pipelining registers to registered switch-points eliminates portions of the routing graph. This is  because a registered switch-point that is occupied by a particular  pipelining register cannot be used by signals other than the signals that connect to that pipelining register. As a consequence, the search space of a conventional FPGA router is severely limited, and this results in solutions of poor quality. It is therefore clear that a pipelining-aware placement phase is not sufficient to successfully route pipelined signals. In sections 3, 4 and 5, we present a greedy heuristic search algorithm for routing signals on pipelined FPGA architectures, and an explanation of how we use Pathfinder’s Negotiated Congestion (NC) algorithm [6] in conjunction with our heuristic to resolve congestion. Section 6 describes the target architecture that we used in our experiments, while Section 7 describes the  placement algorithm we developed to enable our routing approach. We describe our experimental setup and test strategy in Section 8, followed by results in Section 9. Finally, in Section 10, we discuss some of many directions for future efforts, and conclude this work. 3.   ONE-DELAY ROUTER In the previous section, we pointed out that the problem of finding the lowest cost route between a source and sink that goes through at least N distinct D-nodes is NP-Complete. However, we now show that a lowest cost route between a source and sink that goes through at least 1 D-node can be found in polynomial time. In a weighted, undirected graph, the Breadth First Search (BFS) algorithm is widely used to find the lowest cost route between a source and sink. The remainder of this section evaluates several modifications of conventional BFS that can be used to find a lowest cost 1-Delay route. Our first modification is  Redundant- Phased-BFS  . In this algorithm, a phase 0 wavefront is launched at the source. When the phase 0 exploration hits a D-node, it is locally terminated there (i.e. the phase 0 exploration is not allowed to continue through the D-node, although the phase 0 exploration can continue through other R-nodes), and an independent phase 1 wavefront is begun instead. When commencing a phase 1 wavefront at a D-node, we impose a restriction that disallows the phase 1 wavefront from exiting the D-node along the same edge that was used to explore it at phase 0. This is based on the assumption that it is architecturally infeasible for the D-node that srcinates the phase 1 wavefront to explore the very node that is used to discover it at phase 0. When a phase 1 wavefront explores a D-node, the D-node is treated like an R-node, and the phase 1 wavefront propagates through the D-node. If the number of D-nodes that can be explored at phase 0 from the source is ‘F’, up to F independent phase 1 wavefronts can co-exist during  Redundant-Phased-BFS  . The search space of the phase 1 wavefronts can overlap considerably due to the fact that each R-node in the graph can be potentially explored by up to F independent phase 1 wavefronts. Consequently, the worst-case run-time of  Redundant-Phased- BFS is F times that of conventional BFS. Since F could potentially equal the number of registers in the FPGA, the worst-case run-time of  Redundant- Phased-BFS   could get prohibitive. An alternative to  Redundant-Phased-BFS   that can be used to find a lowest cost 1-Delay route between a source and sink is Combined-Phased-BFS  . This algorithm attempts to reduce run-time by combining the search space of all the D-nodes that can be explored at phase 0 from the source. The only difference between  Redundant-Phased-BFS   and Combined-Phased-BFS   is that the latter algorithm allows each R-node to be visited only once by a  phase 1 wavefront. As a consequence, the run-time of Combined- Phased-BFS   is only double that of conventional BFS. In addition, an important effect of the dichotomy that we have created due to  phase 0 and phase 1 wavefronts is that R-nodes that constitute the  phase 0 segment of a 1-Delay route can be reused in the phase 1 segment of the same 1-Delay route. We rely on Pathfinder’s [6] congestion resolution scheme to adjust the history cost of such R-nodes, so that in a later iteration a 1-Delay route with no node reuse between phase 0 and phase 1 segments can be found. A step-by-step illustration of how Combined-Phased-BFS   works is shown in Figs. 2A through 2E. For the sake of simplicity, assume all nodes in the example graph have unit cost. The source S is explored at phase 0 at the start of the phased BFS. The number 0 next to S in Fig. 2A indicates that S has been explored  by a phase 0 wavefront. In Fig. 2B, the neighbors of S are explored by the phase 0 wavefront initiated at S. The 2 nd -level neighbors of S are explored by phase 0 in Fig. 2C, one of which is D-node D1. Note that we make a special note of D1’s phase 0  predecessor here, so that we do not explore this predecessor by means of the phase 1 wavefront that is commenced at D1. In Fig. 2D, the neighbors of D1 (excluding R1) are explored at phase 1. The phase 0 exploration also continues simultaneously, and note how nodes R4 and R7 have been explored by both phase 0 and  phase 1 wavefronts. Finally, in Fig. 2E, the sink K is explored by the phase 1 wavefront initiated at D1. The route found by Combined-Phased-BFS   is shown in boldface in Fig. 2E, and is in fact an optimal route between S and K. S K  R1R2 R4R3 R5R6R7R8D1D20   Fig. 2A: Phase 0 exploration commences at node S.  S K  R1R2 R4R3 R5R6R7R8D1D20000   Fig. 2B: The neighbors of S are explored at phase 0. S K  R1R2 R4R3 R5R6R7R8D1D200000 prevR1 00  Fig. 2C: 2 nd -level neighbors of S are explored at phase 0, and in the process D-node D1 is discovered. S K  R1R2 R4R3 R5R6R7R8D1D200000 prevR1 00000 prevR5 110  Fig. 2D: D1 starts a phase 1 exploration. The phase 0 exploration continues simultaneously, and D2 is discovered. S K  R1R2 R4R3 R5R6R7R8D1D200000 prevR1 00000 prevR5 11011111  Fig. 2E: K is explored by phase 1 wavefront commenced at D1. Unfortunately, Combined-Phased-BFS   fails to find a lowest cost route on some graph topologies. An example of a failure case is shown in Fig. 3. Here the node S is both the source and sink of a signal, and each node is unit cost. Combined-Phased-BFS   will fail to return to S at phase 1 because R-nodes on each possible route  back to S have already been explored by the phase 1 wavefront. In effect, Combined-Phased-BFS   isolates nodes S, R1, R2, D1 and D2 from the rest of the graph, thus precluding the discovery of any route back to S at all. The reason for the failure of Combined-Phased-BFS   is that a node on the phase 1 segment of the lowest cost route is instead explored  by a phase 1 wavefront commenced at another delay site. For example, in Fig. 3 we consider the route S-R1-D1-R3-R5-R4-D2-R2-S to be lowest cost. Node R4 is explored by the phase 1 wavefront commenced at D2, thus precluding node R4 from being explored by the phase 1 wavefront started at D1. However, if we slightly relax Combined-Phased-BFS   to allow each node in the graph to be explored by at most two  phase 1 wavefronts that are independently started at different D-nodes, then the phase 1 wavefronts started at D1 and D2 will now be able to overlap, thus allowing the lowest cost route to be found. SR1R2R3D10D2    R4R50000111  Fig. 3: A case for which phased BFS fails. Observe how the phase 1 exploration has got isolated from the phase 0 exploration An important consequence of the nature of the transition from  phase 0 to phase 1 at a D-node is shown in Fig. 4. In this case, S is the source of the signal, and K is the sink. Observe that a phase 0 exploration explores D1 from R1. Consequently, the phase 0 exploration is precluded from exploring D1 from R4. This  prevents the optimal 1-Delay route to K from being found. To address this problem, we allow any D-node to be explored at most two times at phase 0. In Fig. 4, D1 can be explored at phase 0 from R1 and R4, thus allowing the optimal 1-Delay path S-R2-R3-R4-D1-R1-K to be found. SR2R1 D1R3R4K 0000 00S   R2R1 D1R3R4K 0000 00   Fig. 4: D1 is explored at phase 0 from R1, thus precluding the discovery of the 1-Delay path to the sink K. The following rules summarize 2Combined-Phased-BFS  : •   An R-node can be explored at most once at phase 0. •   A D-node can be explored at most twice at phase 0. •   An R-node can be explored by at most two distinct phase 1 explorations. The cases in which two phase 1 explorations are distinct are: o   The two phase 1 explorations are initiated by two different D-nodes, OR o   The two phase 1 explorations are initiated by the same D-node, but the R-nodes that were used to explore the D-node at phase 0 are different.  •   A D-node can be explored by at most two distinct phase 1 explorations. This rule is identical to the way R-nodes are explored at phase 1. We have proven that 2Combined-Phased-BFS   finds an optimal 1-Delay route between a source and sink on an undirected graph consisting of R-nodes and D-nodes [9]. 4.   N-DELAY ROUTER In this section, we present a heuristic that uses the optimal 1-Delay router to build a route for a two terminal N-Delay signal. This heuristic greedily accumulates delay at the sink by using 1-Delay routes as building blocks. In general, an N-Delay route is recursively built from an (N-1)-Delay route by successively replacing each segment of the (N-1)-Delay route by a 1-Delay route and then selecting the lowest cost N-Delay route. Fig. 5 is an abstract illustration of how a 3-Delay route between S and K is found. In the first step, we find a 1-Delay route between S and K, with D11 being the D-node where we pick up delay. At this point, we increment the sharing cost of all nodes that constitute the route S-D11-K. In the second step, we find two 1-Delay routes,  between S and D11, and D11 and K. The sequence of sub-steps in this operation is as follows: •   Decrement sharing cost of segment S-D11. •   Find 1-Delay route between S and D11 (S-D21-D11). Store cost of route S-D21-D11-K in Cost  S-D21-D11-K  .   •   Restore segment S-D11 by incrementing the sharing cost of segment S-D11. •   Decrement sharing cost of segment D11-K. •   Find 1-Delay route between D11 and K (D11-D22-K). Store cost of route S-D11-D22-K in Cost  S-D11-D22-K  .   •   Restore segment D11-K by incrementing the sharing cost of segment D11-K. •   Select the lowest cost route, either S-D21-D11-K and S-D11-D22-K. Suppose the lowest cost 2-Delay route is S-D11-D22-K. We rip up and decrement sharing due to the segment D11-K in the srcinal route S-D11-K, and replace it with segment D11-D22-K. Finally, we increment sharing of the segment D11-D22-K. The  partial route now is S-D11-D22-K. The sequence of sub-steps in step three is similar. Segments S-D11, D11-D22 and D22-K are successively ripped up, replaced with individual 1-Delay segments, and for each case the cost of the entire 3-Delay route  between S and K is stored. The lowest cost route is then selected. In Fig. 5, the 3-Delay route that is found is shown in dark lines, and is S-D11-D31-D22-K. The number of 1-Delay BFS’ launched for the 3-Delay route that we just discussed is 1 + 2 + 3 = 6. For the general N-Delay case, the number of 1-Delay BFS’ launched is 1 + 2 + ... + N = N(N-1)/2. A bound on the number of 1-Delay BFS’ launched for an N-Delay route is N 2 . S K  D11D21D22D31D32   Fig. 5: Building a 3-Delay route from 1-Delay routes 5.   MULTI-TERMINAL ROUTER The previous section described a heuristic that uses optimal 1-Delay routes to build a two-terminal N-Delay route. The most general type of pipelined signal is a multi-terminal pipelined signal. A multi-terminal pipelined signal has more than one sink, and the number of delays separating the source from each sink could differ across the set of sinks. A simple example of a multi-terminal pipelined signal sig was shown in Fig. 1. The sinks K1, K2 and K3 must be separated from the source S by 3, 4 and 5 delays respectively. We will now demonstrate how a route for a multi-terminal signal can be found by taking advantage of the 1-Delay and N-Delay routers that were discussed in Sections 3 and 4. The routing tree for a multi-terminal pipelined signal is built one sink at a time. The entire list of sinks is stored in a pre-sorted list, and each sink is considered in non-decreasing order of delay separation from the source of the signal .  Hence, the multi-terminal router starts by finding a route to a sink that is the least number of delays away from the source. Since finding a route to the first sink is a two-terminal case, we use the two-terminal N-Delay router to establish a route between the source and first sink. The remainder of this section examines the task of expanding the route between the source and the first sink to include all other sinks. We explain the multi-terminal router via a simple example. Assume a hypothetical signal that has a source S and sinks K3 and K4. K3 must be separated from S by 3 delays, whereas K4 must be separated by 4 delays. Sink K3 is considered first, and the  N-Delay router is used to find a 3-Delay route between S and K3. In Fig. 6A, the route S-D1-D2-D3-K3 represents the 3-Delay route between S and K3, and constitutes the  partial_routing_tree  of the signal. In general, the  partial_routing_tree  of a multi-terminal pipelined signal can be defined as the tree that connects the source to all sinks that have already been routed. After a route to K3 is found, the router considers sink K4. As was the case in the N-Delay router, we accumulate delay at K4 one delay at a time. Thus, we start by finding a 1-Delay route to K4, then a 2-Delay route, a 3-Delay route, and finally a 4-Delay route to K4. It can be seen that a 1-Delay route to K4 can be found either from the 0-Delay segment S-D1 by going through another D-node, or from the 1-Delay segment D1-D2 directly. However, it is not necessary to launch independent wavefronts from
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks