Defect- and Variation-Tolerant Logic Mapping in Nanocrossbar Using Bipartite Matching and Memetic Algorithm

High defect density and extreme parameter variation make it very difficult to implement reliable logic functions in crossbar-based nanoarchitectures. It is a major design challenge to tolerate defects and variations simultaneously for such architectures. In this paper, a method based on a bipartite matching and memetic algorithm is proposed for defect- and variation-tolerant logic mapping (D/VTLM) problem in crossbar-based nanoarchitectures. In the proposed method, the search space of the D/VTLM problem can be dramatically reduced through the introduction of the min-max weight maximum-bipartite-matching (MMW-MBM) and a related heuristic bipartite matching method. MMW-MBM is defined on a weighted bipartite graph as an MBM, where the maximal weight of the edges in the matching has a minimal value. In addition, a defect- and variation-aware local search (D/VALS) operator is proposed for D/VTLM and embedded in a global search framework. The D/VALS operator is able to utilize the domain knowledge extracted from problem instances and, thus, has the potential to search the solution space more efficiently. Compared with the state-of-the-art heuristic and recursive algorithms, and a simulated annealing algorithm, the good performance of our proposed method is verified on a 3-bit adder and a large set of random benchmarks of various scales.

NOMENCLATURE G 1 Weighted bipartite graph of crossbar architecture. G 2 Bipartite graph of logic function. N Population size. P Parents. B Offspring. t Iteration counter. f Fitness value. λ Greedy strength factor. P cross Probability of crossover. P mut Probability of mutation. P ls Probability of local search.

I. INTRODUCTION
N ANOELECTRONICS [1], [2] has emerged with the hope of extending Moore's law beyond CMOS in the longterm future. It is expected to achieve much higher device density and operation frequency than that of conventional CMOS technologies. Recently, the world's first programmable nanoprocessor consisting of programmable, nonvolatile nanowire transistor arrays (PNNTAs) [3] has been published. This paper represents an important breakthrough of logic circuits built from the bottom-up paradigm [4] and shows tremendous opportunities for future computing systems. However, the nanochips produced from both the bottom-up process [4] and nanoimprint techniques [5] are inherently prone to high defect density and extreme parameter variation. This is because of the extremely small size of the nanodevices and the difficulty in controlling the fabricating process precisely.
The exact level of defect density is still unknown, but it is reasonable that 1%-15% of the resources, e.g., wires, switches, transistors, and so on, on a nanochip will be defective [6]. The Quantum Science Research group at Hewlett-Packard fabricated an 8 × 8 crossbar architecture using molecular switches at the crosspoints by nanoimprint lithography, where 15% of the switches were defective [5]. The researchers at Harvard and MITRE characterized the threshold voltage values of nodes from the fabricated PNNTA structure in both active and inactive states. They found that only 86% of nodes in active state and 87% of nodes in inactive state met the voltage requirements [3]. This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/ Parameter variation, e.g., fluctuations in length, width, oxide thickness, flat-band conditions, and so on, impacts both conventional and emerging technologies [7]. As device scaling, some individual parts of the device are made up of fewer atoms. If merely a single atom is out of place, the device characteristics are significantly changed. Variations in the geometry of the devices induct serious performance variations of the circuits. For example, density variations in carbon nanotubes growth can compromise the reliability of carbon nanotube field-effect transistor (FET) and result in increased delay variations [8]. Another example is the fin FET (FinFET) device, it has been shown through practical measurements and theoretical formulations [9] that quantum effects have great impact on the performance of FinFET, while the body thickness primarily determines these effects.
As stated above, design process is significantly complicated due to the lack of determinacy; besides, it is expected to be wore as device scaling. To deal with such a high defect density and extreme parameter variation simultaneously, one promising design paradigm for logic function implementation on nanochips is the defect-and variation-tolerant logic mapping (D/VTLM): given a nanoarchitecture and a logic function to be implemented on it, find a mapping of the logic function to the architecture with consideration of defects and variations.
Without consideration of the variations, the defect-tolerant logic mapping (DTLM) problem is equivalent to the subgraph isomorphism problem (SIP): return a subgraph of the (bipartite) graph G 1 that is isomorphic to the (bipartite) graph G 2 . SIP is a well-known NP-complete problem [10]. While considering the variations, the D/VTLM problem is an extended version of SIP, which can be defined as: return a subgraph of G 1 that is isomorphic to G 2 , and the subgraph has a minimal cost (e.g., path delay) among all subgraphs of G 1 that are isomorphic to G 2 .
A number of methods have been proposed to deal with the DTLM problem, such as the recursive algorithm [11] based on backtracking and pruning, as well as various heuristic algorithms [12]- [15]. However, the runtime of the recursive algorithm is acceptable only for small-scale problems due to the recursive nature, while all the heuristic algorithms rely on fixed heuristics that show strong bias in favor of only small set of problems. The D/VTLM problem is highly complicated due to the additional consideration of variations, not only a valid mapping should be found, but also the path delay should be optimized. A simulated annealing algorithm (SA) [16] was first suggested due to its capability of exploration. The SA method has good effort on variation tolerance, but its efficiency is poor due to the huge search space. Recently, a set of integer linear programming (ILP) formulations were introduced in [17], but the ILP-based method has good results only on small-scale problems.
The mapping flow proposed in [18] is an efficient trail for DTLM by using the divided and conquer strategy that maximum-bipartite-matching (MBM) is introduced to reduce the search space. Inspired by this, this paper proposes a new matching problem and method.
Specifically, the min-max weight MBM (MMW-MBM) problem is defined in this paper for the first time, and a heuristic MMW-MBM method is also presented. MMW-MBM is defined on a weighted bipartite graph as an MBM, where the maximal weight of the edges in the matching has a minimal value among all MBMs of the given graph. A naive way to find the MMW-MBM is to find all MBMs in a given graph first, and then select the one whose maximal edge weight is minimal. Instead of using such an enumeration method, a heuristic method for the MMW-MBM problem is developed to improve the time efficiency. Based on the MMW-MBM model and the heuristic algorithm, the problem of D/VTLM can be transferred to a pin assignment optimization problem.
For real-world optimization problems, it is often effective to incorporate problem-specific knowledge into local search strategies, which are referred to as memes in the case of memetic algorithms (MAs) [19], [20]. This paper proposes one such operator, called defect-and variationaware local search (D/VALS). The key idea is inherited from the greedy reassignment (GR) local search operators for DTLM [18] and VTLM [21], which is to reassign the values of parts of the individual by taking advantage of the greedy information extracted from the problems. However, instead of individually utilizing defect information [18] or variation information [21], the D/VALS operator is capable of utilizing the combined information of both defect and variation.
Based on MMW-MBM (model and method) and the D/VALS operator, an MA is constructed to optimize the logic mapping in the reduced search space of D/VTLM. The MA employs a genetic algorithm (GA) as global search and the D/VALS operator as local search. With an appropriate coordination, the MA can not only exhibit a good explorative ability as a population-based global search algorithm does but also deliver a good exploitive performance as a local search algorithm does. Compared with the state-of-the-art recursive [11] and heuristic [14] algorithms, and the SA method [16], the performance of the proposed method is testified and verified on a 3-bit adder and a large set of random benchmarks of various scales. Experiment results show that a good performance on efficiency and effectiveness can be obtained by the proposed method.
The novelty of this paper can be summarized as follows. First, MMW-MBM is defined in this paper for the first time to reduce the search space of the D/VTLM problem. Second, instead of the enumeration method, a heuristic method is developed to find a MMW-MBM efficiently. Third, the D/VALS operator is designed under the considerations of both defect and variation information, and embedded in an MA framework.
The rest of this paper is organized as follows. Section II introduces the problem background and definition. In Section III, the search space of D/VTLM is reduced by MMW-MBM model. The detail of the D/VALS operator and the MA is presented in Section IV. Experimental studies and comparisons are given in Section V. Section VI concludes this paper.

A. Nanocrossbar Architecture
A nanoelectronic crossbar consists of two layers of orthogonal nanowires. The region where two wires cross is called junction or crosspoint, which may be configured to implement a logic device. The assembly process has a stochastic nature that the probability of aligning three-terminal devices will be very low, while a two-terminal connection can be established more easily. Therefore, two-terminal devices, such as nanowire FETs, diodes, and molecular switches, are preferred [6].
In this paper, both the stuck-at-open defects and the stuckat-close defects are considered. The stuck-at-open defect is representative of and most common in nanocrossbar architectures [22]. A stuck-at-open defect means that there is either a nonprogrammable switch or missing a switch at the crosspoint; thus, the two cross wires at this crosspoint are always disconnected. A stuck-at-close defect means that the switch at the crosspoint is permanently programmed, and the entire input wire and the output wire are unused. It is notable that the defect modeling of nanoelectronics is still an ongoing research problem. Without loss of generality, we may assume that the defects are independent and uniformly distributed as previous works did [23], [24]. This is a commonly employed assumption for theoretical research [25], which allows us to focus upon the essence of the proposed method instead of the physical details of the defects. It is notable that the approach presented in this paper can easily be extended to other defect types (nanowire open defect and nanowire bridging defect [26]) and other defect distributions (e.g., clustered distribution [27]) by modifying the following graph model slightly as discussed in [28].
An example of a defective 3 × 3 nanoelectronic crossbar is shown in Fig. 1. The crossbar consists of two sets of orthogonal nanowires. The vertical nanowires are the inputs (i s), whereas the horizontal nanowires are the outputs (os). There is a programmable switch at each crosspoint. The nonprogrammable defective switches at the crosspoints are each represented by an X.   I represents the set of input nanowires, and O represents the set of output nanowires. C consists of representative edges for all the programmable crosspoints in the crossbar. W is the set of delay variations correspond to the crosspoints.

B. Problem Definition
A two-level logic function in a sum-of-products form can be represented by a bipartite graph G 2 (V , P, E), as shown in Fig. 3. In this scenario, V represents the set of logic variables, and P represents the set of product terms. E consists of representative edges for the corresponding product terms containing the variables.
When using a crossbar structure to implement a two-level logic function, the logical relationships between the variables and the product terms in the logic function can be represented by the connections between vertical and horizontal nanowires in the crossbar. Such logic-function-to-crossbar mapping problem can be formulated as an extended SIP: returning a subgraph of G 1 that is isomorphic to G 2 , and the subgraph has a minimum cost (e.g., path delay) among all subgraphs of G 1 that are isomorphic to G 2 .
The D/VTLM problem can be formally defined as the following. Given a defective m 1 ×m 2 crossbar weighted bipartite graph G 1 (I , O, C, W ), and an n 1 ×n 2 logic function bipartite graph G 2 (V , P, E), find a node mapping M (M: The Cost(M) is the maximum path delay associated with the output of a crossbar after logic mapping. It is calculated in the proposed model as [17] where Cost(M p ) represents the path delay associated with the product term p. 1) For FET-based nanocrossbar, the path delay of an output nanowire is decided by all activated crosspoints 2) For diode-based nanocrossbar, the path delay of an output nanowire is only decided by the activated crosspoint which has maximum delay

III. MIN-MAX WEIGHT MAXIMUM-BIPARTITE-MATCHING
In fact, the mapping trail M consists of two mappings, one is M(v): V → I and the other is M( p): P → O. Therefore, we can employ two decision vectors to represent the mapping trial M: input mapping vector (IMV) and output mapping vector (OMV).
It seems that we can search the whole solution space spanned by IMV and OMV as previous work did [16], but the extremely huge size of search space, P(m 1 , n 1 ) × P(m 2 , n 2 ), will make the problem very hard to be solved with limited computational resource, where P(m, n) is the number of n-permutations of m. In order to solve the problem efficiently, the following parts will show that the problem can be solved in a divided and conquer way by introducing MMW-MBM, where IMV is optimized by a metaheuristic algorithm (Section IV) and OMV is determined by a heuristic algorithm (Section III).

A. Reducing the Search Space by MMW-MBM Model
As suggested in the previous works on DTLM [12], [14], when logic variables are previously assigned to input nanowires (IMV), the solution space of another mapping vector (OMV) will be restricted severely. For example, consider Figs. 2 and 3, if IMV is set as [1,2,3], which means v 1 is assigned to i 1 , v 2 is assigned to i 2 , and v 3 is assigned to i 3 , thus p 1 cannot be assigned to o 1 , because there is no edge between i 2 and o 1 in the crossbar bipartite graph. Therefore, we can construct a weighted bipartite graph to model which product terms can be assigned to which output nanowires and the corresponding cost (path delay), as shown in Fig. 4. While creating the weighted bipartite graph, we add one node on the left side for each product term p, and one node on the right side for each output nanowire o. An edge between p and o indicates that the product term p is compatible with the defect pattern of the crossbar, and can be realized by o. The associated weight is the path delay associated with the output o, which can be calculated according to (2) [or (3)]. For example, the weight between Then, if we only consider defect tolerance [18], the problem is to find a complete assignment from the product terms to the output nanowires which is equivalent to the MBM problem [29]. Given an undirected bipartite graph G = (U, V, E), where U and V are disjoint and all edges in E go between U and V . A matching is a subset of edges Mat ∈ E, such that for all vertices v ∈ U ∪ V , at most one edge of Mat is incident on v. We say that a vertex v ∈ U ∪ V is matched by matching Mat if some edge in Mat is incident on v; otherwise, v is unmatched. A maximum matching is a matching of maximum cardinality, that is, a matching Mat such that for any matching Mat , we have |Mat| ≥ |Mat |. The set of dashed lines in Fig. 4 is an MBM in the graph. The problem MBM can be solved by Hungarian method or Ford-Fulkerson method [29].
However, if we consider defect tolerance and variation tolerance simultaneously, the problem is to find a complete assignment from the product terms to the output nanowires with minimum cost (1). So, the MBM problem needs to be extended to a new bipartite matching problem, the MMW-MBM problem. MMW-MBM can be defined on a weighted bipartite graph as an MBM, where the maximal weight of the edges in the matching has a minimal value among all MBMs of the given graph. One should note that the MMW-MBM problem is quite different from the maximum (minimum) weighted bipartite matching, which is defined on a complete weighted bipartite graph as a complete matching, where the sum of the weights of the edges in the matching has a maximal (minimal) value. A naive way to find the MMW-MBM is to find all MBMs in the given graph first, and then select the one whose maximal edge weight is minimal. Instead of using such an enumeration method, a heuristic method for MMW-MBM problem is presented to improve the efficiency.

B. Heuristic MMW-MBM Method
Given an undirected weighted bipartite graph G = (U, V, E, W ), where U and V are disjoint and all edges in E go between U and V . Our heuristic is to remove the edges in G step by step in descending order of their weights, while an MBM algorithm (Ford-Fulkerson method) is used to check the cardinality of the current MBM until the cardinality reduces. The heuristic method is iterative, as shown in Algorithm 1. The algorithm starts with an initial matching Mat obtained by the Ford-Fulkerson method [29] (line 1), and then sort E in descending order according to their weights (line 2). At each iteration, the edge e in E with maximal weight is removed (lines 4 and 5), and then we obtain a new matching Mat by running the Ford-Fulkerson method on the new graph G (line 6). If the cardinality of Mat is equate to that of Mat, we update Mat to Mat (line 8), otherwise, we break the loop (line 10) and then return Mat as the MMW-MBM (line 13). This process is repeated until E is empty.
Given a bipartite graph G = (U, V, E), one can use the Ford-Fulkerson method [29] to find an MBM, as shown in Algorithm 2. The trick is to construct a flow network where the flow corresponds to matching. The corresponding flow network G = (V , E ) for G is defined as follows. Let the source s and sink t be new vertices, and V = U ∪ V ∪{s, t}. The directed edges of G are the edges To complete the construction, unit capacity is assigned to each edge in E . Thus, given an undirected bipartite graph G, one can find an MBM by creating the flow network G (line 1), running the Ford-Fulkerson method (lines 2-5), and directly obtaining a maximum matching Mat from the integer-valued maximum flow f found (line 6). The Ford-Fulkerson algorithm starts with f (u, v) = 0 for all u, v ∈ V , giving an initial flow of value 0 (line 2). At each iteration, the flow value is increased by finding an augmenting path that can be thought of simply as a path from the source s to the sink t along which more flow can be sent and augmented (lines [3][4][5]. This process is repeated until no augmenting path can be found. The max-flow min-cut theorem proves that upon termination, this process yields a maximum flow. Obviously, the heuristic method would return an MBM of the input graph, since the resulting matching Mat has the same cardinality as the initialized MBM obtained from the input graph (line 1 in Algorithm 1). Besides, the edges in G are removed according to the descending order of their weights, so the resulting matching Mat satisfies that the maximal weight of the edges in Mat has a minimal value among all MBMs of the input graph. Therefore, the heuristic method can indeed find an MMW-MBM in the given graph with the advantage of a high efficiency over the enumeration method.
Given a bipartite graph [29]. Since the Ford-Fulkerson Method is used in the inner loop of the proposed heuristic method, it seems that the heuristic method would be very time-consuming. For example, one edge is to be removed from the graph in each loop, so the worst case time complexity of the heuristic method is O(|U ∪ V ||E| 2 ). Fortunately, in the scenario of the D/VTLM problem, the graphs to be deal with by the heuristic method are highly sparse. If we assume that the edge density of the m × m crossbar bipartite graph G 1 is p, and the edge density of the n × n logic function graph G 2 is q, the probability that a product term can be realized by an output nanowire can be calculated as p qn [12], which is the edge density of the input graph G in Algorithm 1. Therefore, the time complexity of the heuristic method is O((m + n)(mnp qn ) 2 ).

IV. MEMETIC ALGORITHM FOR D/VTLM
Given an IMV, the search space of OMV can be significantly reduced by creating the corresponding weighted bipartite graph modeling of which product terms can be assigned to which output nanowires and the corresponding path delay. Furthermore, it is possible to employ the proposed heuristic method to find an MMW-MBM exactly between product terms and output nanowires. Therefore, the next problem is how to choose an optimized IMV, so that the resulting MMW-MBM not only satisfies full defect tolerance (every product term corresponds to an output nanowire) but also exhibits good variation tolerance (minimized path delay). Due to the NP-hardness of the optimization of IMV, an MA is proposed. Besides incorporating an evolutionary computation framework to enhance the global optimization, the proposed MA gains pretty good performance by incorporating successful elements of previous effective greedy mapping algorithms.

A. Objective Function
Based on the obtained MMW-MBM, the following objective function can be defined for the given IMV: where dt represents the capability of defect tolerance of the given IMV, while vt represents the capability of variation tolerance of the given IMV, and α is used to tune the impact of dt and vt on objective function. m p ∈ {0, 1} represents if product term p has a corresponding output nanowire o in the MMW-MBM under the given IMV, while weight w p represents the impact of product term p on the dt value. delay M represents the maximal weight of the edges in the MMW-MBM, while delay C represents the maximal delay of the output nanowires in the crossbar. Based on MMW-MBM (model and algorithm), the problem of D/VTLM is transferred to optimize the pin assignment from logic variables to input nanowires (IMV) with the evaluation by (4)-(6).

B. Framework of the MA
The outline of the proposed MA is given in Algorithm 3. A GA is used to work as the evolutionary computation framework of the MA due to its success history on many assignment problems [30]- [34]. The detailed design of the elementary steps of the algorithm is introduced as follows.
The algorithm starts with an initial population of N (population size) random individuals (line 2). Each random individual solution is evaluated according to (4)-(6) (line 3). The encoding of IMV solutions used in the implementation is straightforward. We encode the permutation π (denotes a permutation of the set M = {1, 2 . . . m}) as a vector of input nanowires, such that the value j of the i th component in the vector indicates that the input nanowire j is assigned to logic variable i (π(i ) = j ). It is notable that the logic function size n is smaller than the crossbar architecture size m in some cases, thus IMV is an incomplete permutation. In order to take advantage of the off-the-shelf crossover operators, such as CX recombination [30], the complete permutation π is used instead of incomplete permutation. However, only the first n components will be decoded as IMV for the MMW-MBMbased fitness evaluation.
During every generation t, the population of N individuals generates N children through the crossover operator (line 10), the mutation operator (line 13), and the local search operator (line 14). The offspring is evaluated according to (4)-(6) (line 17) and then added to the current population. The CX recombination operator [30] has been testified to be an effective operator for assignment problems. It preserves the information contained in both parents in the sense that all alleles of the offspring are taken either from the first or from the second parent. The operator does not perform any implicit mutation, since an input nanowire j that is assigned to variable i in the child is also assigned to variable i in one or both parents. In the first phase, all input nanowires found at the same variable in the two parents are assigned to the corresponding variables in the offspring. Then, starting with a randomly chosen variable with no assignment, a nanowire is randomly chosen from the two parents. After that, additional assignments are made to ensure that no implicit mutation occurs. Then, the next unassigned variable to the right (in case we are at the end of the genome, we proceed at its beginning) is processed in the same way until all variables have been considered. Since the logic function size n may be smaller than or equal to the crossbar architecture size m, we consider applying a mutation operator in two cases. 1) If n < m, we will randomly select a gene within alleles 1 ∼ n to be mutated and exchange its value with another gene from the last m − n genes. 2) If n = m, we will randomly select two genes and then exchange their values. The local search operator will be explained in detail in Section IV-C. Selection occurs two times in the main loop of the proposed MA. Selection for reproduction (line 9) is performed before a crossover operator can be applied, which is based on a purely random basis without bias to filter individuals, and selection for survival (line 19) is performed to reduce the population to its original size, which is achieved by choosing the best N individuals from the pool of parents and children [30].

C. Defect/Variation-Aware Local Search
The local search operator developed for the MA can be regarded as a type of knowledge-guided mutation. Given a parent chromosome, the operator produces a child chromosome that is expected to outperform the parent. The key idea of the operator is inherited from the previous GR local search operators for DTLM [18] and VTLM [21]. However, instead of individually utilizing defect information [18] or variation information [21], the operator is capable of utilizing the combined information on both defect and variation. Thus, the proposed operator is called D/VALS here. There is good knowledge that has been testified to be effective on some instances for defect tolerance, that is, a more frequently used variable needs more functional crosspoints. By assigning the most frequently used variables in the product terms to the input nanowires with the smallest number of defects, the greedy assignment heuristic might find the feasible solution with a higher probability [13], [14]. In addition, for variation tolerance, an intuitive greedy knowledge is that a more frequently used variable should be assigned to an input nanowire with a minimal delay [21].
Since the operator is designed to be complementary to the stochastic search of GA, the incorporation of the operator should maintain the stochastic search. Besides, strong greediness will weaken the stochastic nature of global search, resulting in early convergence. Therefore, a control parameter is introduced to limit the elements (genes) of the given solution (parent chromosome) to be operated by the local search. For example, only n 1 · λ variables and their corresponding n 1 · λ input nanowires are randomly selected, where n 1 is the number of variables, and 0 ≤ λ ≤ 1 is called greedy strength factor here.
In order to release the time overhead added to the iterative process of GA, the time complexity of the operator should be as low as possible. In fact, the attributes of variables and nanowires can be obtained in advance, such as the number of times to be included by product terms for each variable v, the number of functional crosspoints on each input nanowire i , and the path delay associated with each input nanowire i (under the assumption that all the functional crosspoints are active), they are marked as Degree(v), Degree(i ), and Delay(i ), respectively. The property of an input nanowire i is measured as: Property(i) = α · Degree(i ) − Delay(i ), to consider the defect-tolerant capability (more functional crosspoints) and variation-tolerant capability (less path delay) at the same time. Parameter α is used to make sure that the defect tolerance is the key task, and thus, its value is set as: α > Max ∀i Delay(i ). To sum up, the greedy information of the problem instance only needs to be extracted once before the optimization.
The outline of the proposed D/VALS is given in Algorithm 4. Given a pin assignment (IMV), n 1 × λ variables and their corresponding n 1 × λ input nanowires are randomly selected and remarked as unvisited (line 1). Then, a defect/variation-aware GR heuristic is applied on these selected variables and nanowires to get a new solution (IMV). If there are unvisited variables (line 2), we choose a variable v, whose Degree(v) is maximal (line 3), and a nanowire i , whose property is the best (line 4) and then assign i to v (line 5), and mark v and i as visited (line 6). When the list of unvisited variables is empty, we get the new pin assignment (line 8).
The importance of the D/VALS operator is as follows. 1) Compared with previous local search methods (such as the 2-opt [30] and the fast-2-opt [30] heuristics) that can be commonly used for combinatorial optimization 2) The greedy strength factor λ provides a flexible control on the randomness or greediness of the operator. The randomness/greediness of the operator will decrease/increase along with the increasing of λ. When λ = 1, the whole IMV will be operated according to the GR heuristic. In order to coordinate the statistic search of GA, the factor λ should be given a very small value (λ = 0.1 in our scenario). Although only a small part of the given solution is operated by the GR, the quality of the new generated solution will be improved with a high probability. For example, for random generated solutions, their fitness (4) is improved with an average probability of 70%∼80% by performing D/VALS in our experiments on large-scale benchmarks.
3) The following experiments (Section V) will show the advantages of introducing the D/VALS operator to the global optimization.

A. Parameter Setting
In objective function (4), parameter α is set a big value, α = 0.8, since defect tolerance is the primary task. As suggested in the previous work [18], the value of weight w p is related to the number of variables v p in product term p, that is, it is harder to map a produce term p, whose v p is larger. Therefore, w p is set as v 4 p experimentally. There is no difference between FET-based nanocrossbar (2) and diodebased nanocrossbar (3) from the perspective of the mapping algorithms, and the comparisons between different algorithms are consistent for both cases, so we record the former (2) in the experiment section to save space. Since the computational complexity of the MMW-MBM-based fitness evaluation does not allow evolving large populations in reasonable time, the population size N is set N = 10 after testing N values from 2 to 20 experimentally. A large greedy strength factor λ will weaken the stochastic nature of evolutionary algorithm, thus we set λ = 0.1 empirically. We set optimal parameters P cross = 80%, P mut = 20%, and P ls = 100% experimentally by cross-validation.
All the experiments in this paper are performed on a platform with two 2.33-GHz Intel Xeon Quad processors E5410 and 12G memory. However, all tested algorithms are implemented as monolithic processes, and no CPU core parallelism is exploited.

B. Case Study of 3-bit Adder
A 3-bit adder, as a widely used benchmark [23], [24], [26], is first used to test the performances of different algorithms. The adder is implemented by two-level logic in the sum-of-product form. It requires 16 input wires and 31 output wires for logic operations, with a minimum crossbar area of 16 × 31 = 496, and uses 147 crosspoints [24]. So, its logic density approximates to 30%. We attempted to map the 3-bit adder to 20 random generated 17 × 32 crossbar architectures with 10%-30% stuck-at-open defect density ( p o ) and 0.1% stuck-at-close defect density ( p c ). Delay variations of the crosspoints are generated by using a Gaussian distribution (μ = 50 and 3σ = 30) as [17] did.
The heuristic mapping algorithm (HMA) [14], the recursive mapping algorithm (RMA) [11], and the SA [16] are three representative algorithms for the DTLM and the D/VTLM whose performances have been testified successfully. Therefore, they are used for comparison here. We use a cutoff time of 10, 20, and 30 s for the SA and the proposed MA when the stuck-at-open defect densities are 10%, 20%, and 30%. Since the RMA is a recursive algorithm, we use a four times cutoff time, 40, 80, and 120 s. The HMA uses greedy pin assignment and incomplete graph construction strategies, so it is always the fastest one. All the algorithms are run independently for 20 times on each test instance. Tables I-III record  if the algorithm finds valid mappings in 20 runs. It is notable that the HMA is a deterministic algorithm, so the same result will be obtained after being run multiple times. Besides, the HMA and the RMA are proposed only for defect tolerance, so they cannot provide mappings with optimized path delay. Table I shows the results when the stuck-at-open defect density of crossbars is 10%. It can be seen that the following holds.
1) The HMA has a success rate of 100% on all test instances.
2) The RMA can find valid mappings on all instances, but the success rate is relatively low (<60%) on several test instances (4 out of 20).
3) The SA fails on several test instances (6 out of 20), but has high success rate on other instances. 4) The MA has a success rate of 100% on most test instances (18 out of 20). 5) The runtime of these algorithms is acceptable. 6) Compared with the SA, the path delay is slightly reduced by the MA.  Table II shows the results when the stuck-at-open defect density of crossbars is 20%. It can be seen that the following holds.
1) The HMA has a success rate of 100% on most test instances (15 out of 20). 2) The RMA can find valid mappings on more than half of the instances (13 out of 20), but the success rate is very low.
3) The SA fails on most test instances (13 out of 20), and has low success rate (<50%) on other instances.
4) The MA has a success rate of 100% on most test instances (16 out of 20). 5) The runtime of these algorithms is still acceptable, although the runtime of the RMA increases a lot. 6) Compared with the SA, the path delay is significantly reduced by the MA. Table III shows the results when the stuck-at-open defect density of crossbars is 30%. It can be seen that the following holds.
1) The HMA has a success rate of 100% on nearly half of the test instances (11 out of 20). 2) The RMA and the SA can find valid mappings on few instances (3 out of 20), but the success rate approximates to zero (5%).
3) The MA has a success rate of 100% on most test instances (19 out of 20). 4) The runtime of these algorithms is still acceptable, except the runtime of the RMA. 5) Compared with the SA, the path delay is significantly reduced by the MA. For the 3-bit adder, the above simulation results (Tables I-III) reveal that the HMA is a good choice for defect tolerance when the defect density is relatively low (10% or 20%). The RMA and the SA work well only in the case of low defect density (10%), and the RMA is very time-consuming as the defect density increases (30%). The MA is effective and efficient in all cases, and nearly 100% success rate can be achieved. Beside, compared with the SA, the MA can provide better optimizations on path delay.

C. Random Benchmark Instances
As can be seen from the above simulations, given fixed crossbar size and defect density, the results of the same algorithm are quite different on different crossbar architectures. This is because their defect patterns are different. In addition, this problem also exists for logic function that even if both the size and the logic density are fixed, the difficult of mapping different logic blocks is quite different due to the different logic patterns. To provide a sound and fair evaluation and comparison of different algorithms, a large set of benchmark graphs for logic functions and crossbar architectures are generated randomly as previous works did [13], [16]. For benchmark graphs of crossbar architectures, we set stuck-at-open defect density at 10% and stuck-at-close defect density defect density at 0.1%. Delay variations of the crosspoints (weights) are generated by using a Gaussian distribution (μ = 50 and 3σ = 30) as [17] did. For benchmark graphs of logic functions, we set logic density at 40%, a typical value as suggested in [16].
We use a cutoff time of 10, 20, and 60 s for the SA and the proposed MA when the logic function sizes are 16 × 16, 24 × 24, and 48 × 48. We attempt to map the logic functions to 20 random generated 16 × 16, 24 × 24, and 52 × 52 crossbar architectures. Since the RMA is a recursive algorithm, we use a four times cutoff time, 40, 80, and 240 s. The HMA uses greedy pin assignment and incomplete graph construction strategies, so it is always the fastest one. All the algorithms are run independently for 20 times on each test instance. Tables IV-VI record the simulation results of different algorithms including the following. 1) Psucc: The success rate of the algorithm, i.e., the fraction of the 20 runs that found a valid mapping. 2) AvgT: The average runtime (in seconds) of the algorithm if it finds valid mappings in 20 runs. 3) AvgD: The average path delay (Delay M ) of the mapping if the algorithm finds valid mappings in 20 runs. It is notable that the HMA is a deterministic algorithm, so the same result will be obtained after being run multiple times. Besides, the HMA and the RMA are proposed only for defect tolerance, so they cannot provide mappings with optimized path delay.
We also perform statistical tests for the runtimes and path delays of paired evolutionary algorithms (EAs), the SA versus the MA, on each single benchmark instance. In particular, a two-tailed t-test is conducted with a null hypothesis stating that there is no difference between the two algorithms in comparison. The null hypothesis is rejected if the p-value is smaller than the significance level α = 0.05. The runtime (or the path delay) of the algorithm, that is, statistically shorter than the other EA, will be highlighted in bold in tables.    20), the success rate is very low (<40%).
3) The runtime of these algorithms is acceptable, although the runtime of the RMA is several orders of magnitude of the other algorithms. It is evident that the MA is much faster than the SA.

4)
There is no obvious difference between the SA and the MA from the viewpoint of path optimization. As can be seen from Tables IV and V, all the algorithms fail on some test instances. One possible reason is that there is no valid solution at all. Another possible reason is that the granted runtime of the RMA, the SA, and the MA is not long enough to fully complete the search. Therefore, we can check it by using the enumeration method or grant a much longer runtime to the algorithms. Since it is not the main concern in this paper, the simulations are omitted to save space. Table VI shows the results when we map 48 × 48 logic functions to 52×52 crossbars. It can be seen that the following holds.
1) The HA has a success rate of 100% on several test instances (6 out of 20  [35], given fixed defect density, the density of valid mappings increases with the crossbar size, so we consider a slightly larger size here, 52 × 52. For random benchmarks, the above simulation results (Tables IV-VI) reveal that the HMA is a good choice for defect tolerance when the problem scale is relatively low (16 × 16 or 24 × 24). The RMA works only in the case of small-scale problem (16×16), and the RMA does not work on large-scale problem (48×48) even granted a very long runtime. The SA works well on the defect and variation tolerance when the problem scale is relatively low (16 × 16 or 24 × 24). The MA is the best, and nearly 100% success rate can be achieved on most test instances. Beside, compared with the SA, the MA is much faster and can provide better optimizations on path delay on large-scale problems (48 × 48).
It is very difficult to get a high success rate on the same test instances. We think that the reason is twofold. The first is the high computational complexity that the problem is NP-complete. In this case, we can do multiple searches to run the proposed algorithm more than one time, since the algorithm is a stochastic search in nature (rather than deterministic search), it will increase the chance to find a valid mapping. As shown in Fig. 5, the success rate increases significantly with the increase in the number of runs. On the other hand, it is possible that there is no valid mapping of the instance at all, so that we cannot find a valid mapping even using the enumeration method. In this case, we can use a larger crossbar to implement the function, or use multiple crossbars to partition the function.

D. Effectiveness of the Defect/Variation-Aware Local Search
In the proposed MA, the D/VALS operator is proposed to utilize the domain knowledge, so that the MA is expected to have the potential to search the solution space more efficiently. A very natural question is whether the proposed D/VALS operator has any positive contribution to the performance of the algorithm. To answer this question, we can remove it from the algorithm, while keeping all the other parts unchanged. Therefore, another evolutionary algorithm is added to the comparison, which is a GA following the flow of the MA, but without the D/VALS operator. The parameters of the GA are set as the same as the MA. Table VII shows the results when we map 48 × 48 logic functions to 52 × 52 crossbars. It can be seen clearly that the incorporation of the D/VALS operator results in significantly enhanced results on all test instances. This is consistent with other results that demonstrated the advantage of using domain knowledge in evolutionary search [36]. Compared with the GA, Psucc, AvgT, and AvgD of the MA are improved a lot,

E. Statistical Comparisons Over Multiple Benchmarks
In Sections V. B-D, we have shown the performance of the algorithms (the SA, the GA, and the MA) on each independent benchmark instance. In order to statistically compare these algorithms based on multiple benchmark instances, we perform Freidman test [37], which is based on the ranks of compared algorithms. Freidman test in conjunction with Bonferroni-Dunn test [38] is used as post-hoc tests when all estimators are compared with the control estimator. The performance of pairwise comparison is significantly different if the corresponding average ranks differ by at least the critical difference CD = q α j ( j + 1)/6T (7) where j is the number of algorithms ( j = 3), T is the number of benchmark instances (T = 20) for a given problem scale, and critical values q α can be found in [39]. For example, when j = 3, q 0.05 = 2.241, where the subscript 0.05 is the significance level. We rank the algorithms on Psucc, and record the ranking of each algorithm as 1, 2, and 3. Average ranks are assigned in the case of ties. The average rank of a single algorithm is obtained by averaging over all of data sets. Fig. 6 shows the Friedman test results of the algorithms on large-scale problems. Since we employ the significance level 0.05, the critical difference is CD = 0.71 with j = 3 and T = 20. It can be seen that the differences of the MA versus the SA and the MA versus the GA are greater than the critical difference, so the differences are significant, which means the MA is significantly better than the SA and the GA in these cases.

VI. CONCLUSION
As pointed in [6], although the dominant benefit of nanoelectronics is the enormous integration levels they may be able to achieve, one of the challenges for nanoelectronics is whether nanoscale devices can be reliably assembled into architectures. Some small-scale successes have been demonstrated, and the most promising architectures to date are crossbarbased [3], [5]. Reliability is a real challenge for nanoelectronics. It seems evident that the manufacturing techniques may never be able to produce perfect chips, so fault tolerance will be a key to the success of nanoelectronics. Another aspect of nanoelectronics that is quite different from current technologies is the electronic design automation (EDA) flow. The challenge is to deploy a circuit on a nanoelectronic chip when each chip is unique.
This paper contributes to EDA methods for the reliability design of nanocrossbar architectures. By introducing MMW-MBM, a new framework for solving the D/VTLM problem is proposed. MMW-MBM is a new matching problem that is defined here for the first time. In order to obtain an MMW-MBM solution efficiently, a heuristic method is presented. Furthermore, a new MA is proposed to implement the framework, in which a novel local search operator, D/VALS, is designed to make good use of the domain knowledge extracted from the problems. The performance of the proposed MA was evaluated on a 3-bit adder and a large set of random benchmarks. Our experimental results show that the D/VALS operator can help the algorithm to find near optimal solutions with a higher success rate and low computational resources. Compared with the state-of-the-art algorithms, the proposed MA algorithm has the advantage of getting a good balance between effectiveness and efficiency on various test instances.