# A Long-wire-connected and Multi-channel 3D Networkon-chip Design for Many-core System

Hai Tan<sup>\*1,2,3</sup>, Talpur Shahnawaz<sup>1,4</sup>, Soomro Amir Mahmood<sup>1,4</sup>, Hongmao Chen<sup>2</sup> <sup>1</sup>Beijing Institute of Technology Beijing, China

<sup>2</sup>School of Information Engineering, East China Institute of Technology, Nanchang, China <sup>3</sup>Engineering Research Center of Nuclear Technology Application, Ministry of Education, East China Institute of Technology, Nanchang, China <sup>4</sup>Mehran University of Engineering and Technology, Jamshoro, Sindh, Pakistan

\*Corresponding author, e-mail: htan@ecit.edu.cn\* , talpur@bit.edu.cn

# Abstract

To reduce traffic jam caused by various data competitions for channel, we present a low delay and energy efficient network-on-chip with three channels for different type's data. Hence, the transmission for control data between cores won't be congested by the big amount of data transmitted from caches to core, and it achieves better performance in latency and energy. Our strategy is to make a directive long wire to connect two nodes in the same row or column, and distribute these connective wires to different layers which are connected by 3D stacking technology. In the many-core system applied with this topology, every pair of core-cache nodes are at most 5 hops away while real-time and short control information is transmitted by a 2D mesh network. The experimental results show up to 23% of network latency reduction and up to 15% energy reduction when compared to a 3D network-on-chip.

Keywords: multi-channel, many-core system, 3D stack, long-wire-connected

#### Copyright © 2013 Universitas Ahmad Dahlan. All rights reserved.

# 1. Introduction

The technology to integrate hundreds of cores to many-core system with single chip faces such key challenges: resource management, high power consumption and heat dissipation caused by the power consumption. The Interconnection technique between cores on network-on-chip (NoCs) plays a very important role in performance and power consumption of the chip [1-3]. Transmitting data in NoCs is mainly in the form of data package causes high latency when the competition for transfer channel communicating in different nodes. In order to provide low latency and high bandwidth communication in NoCs fast router is proposed in [4-6] and new network topologies are proposed in [7-9].

The emerging 3D Stacking technology has provided a new horizon for NoC design. It binds many active silicones stack together and interconnect each layer via wafer bonding. Latency reduction is achieved greatly on interconnection lines since inter-layer interconnection lines are shortened. The reduced length of interconnection line on 3D NoCs architecture is proportional to the square root of the number of involved layers [10]. Its advantages such as high performance and low interconnection power consumption make high density transistor available. It is currently a hot topic in CMP research. The current research mainly deals with new methods of network design brought by 3D i.e. Researchers optimize NoCs to increase communication speed and reduce latency via router, topology and bandwidth. [11] Proposed effective router which reduces vertical number of "hops". Another kind of router applying technique of multi-layer 3D stacking technology to reduce power consumption was put forward in [12]. In [13] a low-radix and low-diameter 3D NoCs topology is applied, which can reduce the network latency by 29% and power consumption by 24% over 3D Mesh. In [14], information transferred on NoCs can be divided into two types: Controller information and non-controller information (data), and different transport strategies ought to be used correspondingly. Current NoCs topology strategy seldom takes the deficiency caused by different types of data and their transportation: High latency caused by congested data and control information and high power consumption by competition for path have become the bottleneck of performance.

The purpose of this research is to propose a long-wire-connected and multi-channel 3D Network-on-chip. Our strategy is to make a directive long wire to connect two nodes in the same row or column, and distribute these connective wires to different layers which are connected by 3D stacking technology. In the many-core system with arbitrary cores realized in the topology, every pair of core-cache nodes are at most 5 hops away while real-time and short control information is transmitted by a 2D mesh network. The test results in experiment of benchmark such as SPLASH etc show: our network topology which supports large number of cores can reduce latency by 23% and energy consumption by 15% when compared to the plan in [13].

# 2. Related Work

Many researchers have endeavored on improving NoCs performance in terms of diameter of network and bandwidth. Adding extra links can reduce latency but increase the complexity of routers as well. In [13], low-radix routers in a 3D network are applied to satisfy the power and area constraints. Integer Linear Programming is used to model interconnection wires and routing radix and the realization of topology is also given. In the topology a clique (fully connected) is used to construct between layers and any two nodes. The structure of interconnection lines on a layer is shown in Figure 1. In [14] information transmitted in NoCs is divided into two types: key information and non-key information. In [13] cMesh mode is employed to expand NoCs with more than 36 cores, which isn't suitable for the era of hundreds of cores. Since long wires need to interconnect two nodes on different row or column, they have to be long and bend, which leads to complexity of topology construction and great power consumption. At the same time, it is impossible for the proposed network and routing algorithm to identify the types of data and transmit them differently.

Inter-layer vertical channel in 3D NoCs has provided a new horizon for high performance NoCs design. It's a design that there is only one-hop vertical communication between any two layers. With this technology, we can design a NoCs which has large capacity, high bandwidth, and multichannel with low latency in memory. Low overhead on chip and the multi-channel can support content-oriented NoCs design.

# 3. A Real-World Example

Based on the above analysis, this section gives a real-world example that provides two types of network traffic for different data transmission in NoCs. In the design, the foundation of construction is that core layer is at the bottom and cache bank on the rest layers in 3D NoCs. The strategy of network topology is to adopt 2D Mesh to interconnect any two nodes on core layer and long interconnection wires are applied to interconnect two arbitrary nodes on the same row or column on cache layers.

As is shown in Figure 2, the plane constructed by rows with nodes F, E, C, D etc. row full interconnection is constructed. The plane constructed by columns with nodes C', D', B', A' etc., column full interconnection is constructed. Since the cost for inter-layer vertical interconnection is low, all wires connecting nodes in certain row or column combined on the same layer where full interconnection between nodes must be realized.

For instance, node F in Figure 2, seven wires are needed if it interconnects with other 7 nodes in the same row on the same layer. But in our design, we distribute 7 wires to each layer and node E connects the final node directly. Since a routing node can host limited long links, the advantage of such distribution is that long links are allocated to different routing nodes, which makes our design applicable.

The area constructed by rows with nodes F,E,C,D etc develops row global interconnection so does area constructed by column with nodes C', D', B', A' etc. Since the cost for inter-layer vertical interconnection is low, all wires to connected nodes in certain row or column congregate on the same layer where global interconnection between nodes is realized.

In the NoC shown in Figure 2, when data in F in cache bank section is transferred to core A, first it finds a node in the plane constructed by the row with F which connects A directly in the row of destination node like D, then data are transmitted to the plane with node A through long interconnection wire ED, then finds the layer where the direct interconnection wire connecting node A and node D exists on the plane with node D and the remainder is

transferred quickly through vertical interconnection. So the transfer distance is 5 hops maximally. When a node has no direct long interconnection wires, it can reach that node through vertical path with long wire on other layer.

Core Layer



Figure 1. Schematic Diagram of Long Curve Lines on Global Interconnection Layer



Figure 2. Example for Three Channels NoC

With many transfer channels, suitable network can be chosen for different types of data according to their content. Control information between cores and memory access information from core to cache are short and require strong real time. Hence, it is suitable to transfer through 2D Mesh. When cache data return to core, long wires on cache layer and vertical path between cache and core layers are employed to transmit. Since L2 level cache in on-chip many-core system is far larger than L1 level private cache in core and different level cache data come in and out frequently. Hence, it is essential to ensure smooth data request signal and data transfer.

# 4. Proposed Toplogy Design

This paper leverages the great advantage of 3D stacking technology: short vertical interconnection wires and low latency. Any two nodes in all rows and columns on certain vertical plane (frontal plane) interconnect through long interconnection wires and interconnection for different pair of nodes is allocated to different layers on the plane. Nodes on core layer adopts 2D Mesh to interconnect and in this 2D Mesh network control information between cores and access memory from core to cache are transferred. Data returning from cache is transmitted through vertical interconnection and channel composed by long interconnection wires on cache layers on a large scale. In order to avoid bottleneck of cache consistency, hierarchy doesn't exist among all cores on core layer and all caches on cache layers which construct peer cache layer or LLC. When data return to core through routers to retransmit and won't be written in cache. Such model of organizing cache has two merits: (1) avoid performance overhead such as power consumption and latency etc. caused by frequent fluctuation of data in cache; (2) avoid repeated writing data into cache and provide large capacity cache.

# 4.1. Basic Principles

In order to realize ideal diameter: 2 hops in 3D Mesh network, direct interconnect wire exists between any two nodes in every layer so communication between core and cache can achieve with one horizon and one vertical links. However, a curve interconnection line is needed when two nodes are in different row and column. Hence, it doesn't shorten transfer distance, and routing strategy for curve and routing algorithm is complex too. In many-core chips with hundreds of cores, it is impossible to interconnect any two nodes directly in NoCs. Since curve interconnection lines don't shorten transfer distance in NoC, so we remove curve

long links and two nodes in the same row and column can interconnect with each other and nodes in different rows or columns interconnect through a routing node to retransmit. Hence, diameter of NoC increases from 3 hops to 5 hops including 2 hops for long horizontal interconnection and 3 hops for short vertical interconnection lines. In our design, we only add the cost of an inter-layer vertical interconnection with the same length of horizontal interconnection wire as that of 3-hop diameter network. Since inter-layer vertical interconnection has such characteristics: high speed and low power consumption, we get layout scalability reduced by 1/n at the cost of adding low energy consumption and latency. What's more, it can provide two types of channels supporting content-oriented NoC design. What needs to be explained is that in our plan, long interconnection wires are employed to interconnect any two nodes in all rows and columns on certain vertical plane in 3D NoC. Interconnection wires for different nodes are distributed to different layers on the same plane.

In order to realize a real 5-hop diameter network, we needs to make sure that distance between any core-cache pair is 5 hops maximally. For any pair of nodes (i, j), a middle node k exists which connect i and j through long links i.e. i interconnects k and k interconnects j through long interconnection wires. These two wires may be on different layers i.e. we realize a semi-interconnection NoC where interconnection wires are on different layers.

#### 4.2. Content-Oriented Routing Algorithm

Based on the methodology described above, once the topology is generated, our routing algorithm is also determined. 2D mesh adopts XY routing algorithm on core layer. When core and cache bank are in different horizon or vertical planes, there are two paths between any pair of core-cache bank. We choose one of them through ZXZYZ routing algorithm. The details of algorithm are shown in table 3. When a core with coordinate (x, y, 0)generates a request to a cache bank (u, v, w) on the layer of w, core (x, y, 0) transmits request data packet including source and destination addresses to 2D Mesh network through common XY algorithm then then it goes to core (u, v, 0) through the router and finally reaches cache bank (u, v, w) via vertical path. When data from cache bank (u, v, w) are transmitted to core (x, y, 0), the router of the core checks whether the long link between (x, y) and (u, v) exists in any layer I. If so, the routing path is computed as  $(u, v, w) \rightarrow (u, v, l) \rightarrow (x, y, l) \rightarrow (x, y, 0)$  If the long link between (x, y) and (u, v) does not exist, request packet is transmitted through vertical path. Every router of the core checks the routing table whether the long link from the node to x column vertical plane exists in any layer. If so, the data will be retransmitted with that long link. If not, the request data packet will be transmitted via vertical path until the packet reaches y vertical plane.

# 4.3. Routing Table

We use routing tables to implement our routing algorithm. Each router with coordinate has a lookup table that contains the long wires in vertical paths in the form of <row or column, layer, port number> where port number represents number of interconnects wire connecting routing nodes. For example, <x, I, 0> stands for the node which interconnects x plane communicates through 0 port number on layer I.

#### 5. Experiment and Evaluation

In [13], a full 3D interconnection NoC (full NoC) has been proposed where all nodes can interconnect horizontally through vertical interconnection wires. With current technology, it can support 36 cores maximally. Since there are numerous interconnection wires we model each layer through Integer Linear Programming and give modeling methodology adopting algorithm like X-Y-Z to realize routing. In this section, we present simulation-based performance evaluation of our proposed 3D network topologies (interconnection NoCs), and compare them with existing proposals.

# 5.1. Performance Evaluation

Suppose 3D NoC which consists of L layers with N x N nodes per layer and the bottom layer of which is constructed by 2D Mesh processor. The comparison between adopted full interconnection and interconnection NoCs is shown as follows:

The total number of interconnection wires needed by interconnection NoC is:

The number of interconnection wires of full interconnection NoC is:

$$C_{NxN}^2 = N^2 (N^2 - 1)/2;$$

Suppose average energy consumption for single long wire is 4pJ, when the number of cores are up to 400, energy consumption of interconnection NoC is  $20^2(20 - 1) = 7600pJ$  while that of full interconnection NoC is  $\frac{20^2(20^2-1)}{2} = 79800pJ$ . The energy consumption of the latter is 10 times as much as that of the former.

#### 5.2. Experiment Vertification

Experiments employ many-core simulators Graphite [15], Pin [16], Orion [17] and McPAT [18]. MIT Graphite is an open-source, distributed parallel simulator which supports 1000 processors maximally for many-core architecture. The advantages of Graphite are to leverage Pin to obtain parts which needs to be simulated like memory access, system calling, network information etc, while other parts adopt Direct Execution model. Relevant parameters of experimental simulation environment are shown in Table 1.

| Table 1. Parameters of Simulation System |                      |           |
|------------------------------------------|----------------------|-----------|
|                                          | Characters           | Value     |
|                                          | Cache line size      | 64B       |
|                                          | Cache bank size      | 32K       |
|                                          | Associative          | 4         |
| L1                                       | Replacement strategy | Iru       |
| Cache                                    | Data access time     | 1 cycles  |
|                                          | Tag access time      | 1 cycles  |
|                                          | Cache line size      | 64B       |
|                                          | Cache bank size      | 512K/bank |
| L2                                       | Associative          | 16        |
| Cache                                    | Replacement strategy | Iru       |
|                                          | Data access time     | 9 cycles  |
|                                          | Tag access time      | 3 cycles  |
|                                          | Network size         | 5 x 5 x 2 |
|                                          | Router algorithm     | DOR       |
|                                          | Router delay         | 2 cycles  |
|                                          | flit sizes           | 128 bits  |
| Others                                   | Core frequency       | 4GHz      |
|                                          | Number of cores      | 25        |
|                                          | Scale technology     | 45nm      |
|                                          | Network frequency    | 1G        |
|                                          | DRAM delay           | 100ns     |

In this paper, we choose six typical applications from test vectors brought by Graphite simulator: barnes, fft, fmm, radix, volrend and ocean contiguous as benchmark program among self-bring test variables from Graphite simulator. Suppose 1 flit for size of data request packet from core to cache and 5 flits for size of return packet from cache to core.

Results in Figure 4 and Figure 5 give running latency and power consumption of each benchmark program in 36-core half interconnection NoCs which transmits data on the basis of content and in full interconnection NoCs.

The interconnection NoCs (semi-interconnection) transmits control information and memory access information through 2D Mesh on core layer. Data can be transferred through inter-layer vertical path and intra-layer long wires on cache layers so all operation performance of tested programs have improved. It shows up to 23% of average latency reduction and 15% energy reduction when compared to mixed transfer in full interconnection NoCs (global interconnection). It can satisfy the communication requirements of low latency and low overhead for many-core chip system. With larger number of throughput or cores, there would be more latency reduction and energy reduction. Hence, content-based interconnection NoCs

#### 7086 🔳

can provide computing with lower latency, lower energy consumption and higher performance over global interconnection construction with mixed transfer.



Figure 4. Comparison of Latency

Figure 5. Comparison of Energy

#### 6. Conclusion

This paper introduces a low latency and energy efficient network-on-chip with three channels for different type's data. The new topology can reduce traffic jam caused by various data competitions for channel: the transmission for control data between cores and the access request data between core and caches won't be congested by the big amount of data transmitted from caches to core, and achieve better performance in latency and energy. After applying this topology in the many-core system with arbitrary cores realized that every pair of core-cache nodes are at most 5 hops away. Real-time and short control information is transmitted by a 2D mesh network. Compared to a 3D network-on-chip up to 23% of network latency reduction and up to 15% energy reduction has been proved from the results.

# Acknowledgement

The research supported by science and technology project of education department, Jiangxi Province (Project No: GJJ11493, GJJ10175) and the opening project of engineering research center of nuclear technology application, Ministry of Education, East China Institute of Technology (Project No: HJSJYB2010-11). It also supported by Chinese national natural science funds (Fund No: 60973010).

# References

- [1] Shekhar Borkar. *Thousand Core Chips: A Technology Perspective*. DAC 07 Proceedings of the 44<sup>th</sup> Annual Design Automation conference. 2007; 746-749.
- [2] Tan Hai. Research of object-based implicit parallel program many-core architecture. Computer Engineering and Design. 2013; 34(2): 623-626.
- [3] H Khan, F Shi, W Ji, Y Gao, Y Wang, C Liu, N Deng, J Li. Computationally efficient locality-aware interconnection topology for multi-processor system-on-chip (MP-SoC). *Chinese Science Bulletin*, 2010; 55 (29): 3363-3371
- [4] A Kumar, LS Peh, P Kundu, NK Jha. *Express Virtual Channels: Towards the Ideal Interconnection Fabric.* Proc. of the 34th Int. Sym. on Comp. Arch. 2007; 150-161.
- [5] R Mullins, A West, S Moore. *Low-Latency Virtual-Channel Routers for On-Chip Networks*.Proc. Of the 31st Int. Sym. on Comp. Arch., 2004; 188-197.
- [6] Lei Z, Ning W, Fen G. A JointCoding Scheme with Crosstalk Avoidance in Network on Chip. *TELKOMNIKA Indonesian Journal of Electrical Engineering*. 2013; 11(1): 1-8
- [7] UYO, R Marculescu. It's a Small World After All: NoC Performance Optimization via Long-Range Link Insertion. *IEEE Trans. on VLSI Sys.*, 2006; 14(7): 693-706.
- [8] J Kim, J Balfour, WJ Dally. *Flatterned Butterfly Topology for On-Chip Networks*.Proc. Of the 40th Int. Sym. on Microarchitecture. 2007; 172-182.
- [9] Arash FB, Mohammad H, Jose LN. Configurable Router Design for Dynamically Reconfigurable Systems based on the SoCWire NoC. *International Journal of Reconfigurable and Embedded Systems (IJRES)*. 2013; 2(1)

- [10] JW Joyner, P Zarkesh-Ha, JD Meindl. A Stochastic GlobalNet-Length Distribution for a Three-Dimensional System-on-Chip (3D-SoC). The 14th IEEE Int. ASIC/SOC Conf., 2001; 147-151.
- [11] J Kim, C Nicopoulos, D Park, R Das, Y Xie, V Narayanan, MS Yousif, C Das. A Novel Dimensionally-Decomposed Router for On-Chip Communication in 3D Architecture. Proc. Of the 34th Int. Sym. on Comp. Arch., 2007; 4-15.
- [12] D Park, S Eachempati, R Das, AK Mishra, Y Xie, V Narayanan, C Das. MIRA: A Multi-Layered On-Chip Interconnect Router Architecture.Proc. Of the 35th Int. Sym. on Comp. Arch., 2008; 251-261.
- [13] Yi X, D Yu, Z Bo, et al. *A low-radix and low-diameter 3D interconnection network design.* International Symposium on High Performance Computer Architecture (HPCA). 2009; 30-42.
- [14] Flores A, J Aragon, M Acacio. Heterogeneous Interconnects for energy-efficient Message Management in cmps. *IEEE Transactions on Computers*. 2010; 59(1): 16-28.
- [15] Jason EM, Harshad K, et al. *Graphite: A Distributed Parallel Simulator for Multicores.* International Symposium on High Performance Computer Architecture (HPCA). 2010; 1-12
- [16] MM Bach, M Charney, R Cohn, E Demikhovskyet al. Analyzing parallel programs with pin. *Computer.* 2010; 43(3): 34-41.
- [17] Hang-Sheng W, Xinping Z, Li-Shiuan P, Sharad M. Orion: A Power-Performance Simulator for Interconnection Net-works. Proc. of Int. Sym. on Micro. 2002; 294-305.
- [18] Sheng L, Jung HA, Strong RD, Brockman JB, Tullsen DM, Juoppi NP. McPAT: An integrated Power, area, and timing modleing framework for multicore and manycore architectures. International symposium on Coputer Architecture. 2009; 469-480.