I Introduction
With the development of big data analytics and data mining techniques, scalable network measurement and analysis techniques have been used in finding hidden information or patterns to help with network management, network monitoring and network security [38, 11, 16, 33]. In one scenario of important network management, big network vendors serving various customers own a multitude of databases on network product information, network configurations, geographical information on deployment sites, incident logs, etc. Since the databases serve different departments, they are usually separated from each other and independently managed by different departments and groups. However, network data are inherently designed to host “connections” among different devices, groups that the devices are in, and functionality the devices serve together; they are often required to be shared and put together to be used in many important tasks such as network prediction, semantic query, network analysis and fault detection. For example, with the data connections, one can easily model a graph database to query related product configurations and performance issues around faulty devices or service incidents, which are matched with another productrelated table [28, 1]. As many studies [10, 12, 7, 36] point out, the discovery of matching fields is the most crucial and foremost step for the integration of databases. For this reason, in this paper we aim to automatically construct such matchings that lead to efficient network management and analysis.
There is an abundance of research on field matching and integration approaches for different data formats in different contexts such as relational databases, XML and objectoriented data formats [14, 31, 37, 5]. Existing database matching approaches include two main categories of techniques. One is based on schemalevel matching, which exploits metadata using schema characteristics such as field names, data types, structural properties and other schema information [4, 15]. However, network management databases from different sources have different design standards and naming conventions [18, 9]. Even similar fields can have different names (e.g. “product_family” can also be named as “product series”). The other category is instancelevel matching, which uses the record values of two fields to calculate the similarity and to determine matched fields [8, 29, 27, 24]
. Most of the previously proposed schemes rely on syntactic similarities, sampling or machine learning techniques that are meant to extract common patterns from the matching data corpus. However, it is difficult to directly apply these techniques to network databases or challenging to reliably construct dictionaries, corpus, and training labels with large datasets when the naming convention is not consistent and diverse.
The challenges for matching these network databases are: (1) The database design is not ideally uniform. The data tables are created in different groups and departments by different people. Therefore, it is not reliable to use schema information to match data directly and easily. (2) The data is noisy and irregular. Some table fields contain unexpected records such as null, invalid values and typos. Some table records are either partly missing, incomplete or incorrect. Some fields have a large amount of records, while some have very few records. (3) The table contents are complicated and heterogeneous with numerical and nonnumerical data formats. (4) No thesauri or auxiliary information exist that we can rely on for matching. The only set of observation available is the database itself.
To solve these challenges, we propose an automatic matching technique for large NEtwork MAnagement databases (NEMA) to construct a graph database for network management and analysis effectively and efficiently. We propose several algorithms for numerical matching and nonnumerical respectively based on instance matching. Our main contributions are as follows:

We propose effective range difference similarity and bucket dot product similarity metric to match numerical fields, and top priority metric to match nonnumerical fields.

To make the algorithm more scalable for large network management databases, we utilize minhashlocality sensitive hashing algorithm for faster processing with a little scarification of accuracy.

We further propose to use the proposed similarity metric scores as features for classification to improve the reliability of our matching technique.

We experimentally demonstrate the effectiveness and efficiency of our matching algorithms in the real Cisco network management databases for constructing a graph network database.
The rest of this paper is organized as follows. We define the problem in Section II. Section III describes NEMA matching algorithms in detail including both numerical matching and nonnumerical matching. Experimental evaluation is presented in Section V and related work is shown in Section VI. We conclude our paper in Section VII.
Ii Problem Description
Given structured network management databases, our goal is to create a graph database of network management by finding the most accurate matched field pairs among different tables in these databases. The matching of two fields is determined by the matching score measured by their record pair similarities. We utilize the matched results to construct a graph database for semantic query, network analysis, network prediction, etc. [26, 21, 22].
To illustrate our problem and algorithms clearly, we use three sample tables below as a toy example throughout the rest of this paper. In Table II, PRODUCT Table () contains 2 fields (primary key), and with 7 records respectively. In Table II, INCIDENT Table () contains 2 fields (primary key), and with 7 records respectively. In Table III, ORDER Table ( ) contains 3 fields (primary key), , and with 7 records respectively. The problem is to find whether these 7 fields match among the Table II, II and III by evaluating the matching of their records, then to construct a graph database for efficient network analysis and management.
product_id  family 

107  AIR series 
108  con series 
109  con series 
150  477000 
151  cisco0500 
152  807066C 
153  con5100 
incident_key  prod_key 

201  107 
202  107 
203  108 
204  109 
207  150 
208  151 
209  152 
order_key  incident_id  product_name 

301  201  AIR1212AC 
302  201  AIR1002 
303  203  con5122 
304  204  mem4700m64d= 
305  207  477066C 
306  208  cisco0510 
307  208  cs6012 
Record Matching: Given two instances e1 and e2, a record matching function is defined as a 4tuple: where e1 and e2 are two field records; is a similarity score (typically in the [0, 1]) between e1 and e2; is a relation (e.g., equivalence, partof, overlapping, etc.) between e1 and e2. The matching function asserts that the relation holds between the record and with score . For numerical matching, only if two records are equal, they are considered matched. For example, records {107, 108, 109, 150, 151, 152 } in field in Table II are matched as the equal records in field in Table II, respectively. For a nonnumerical pair, however, if the similarity score of two records are higher than a threshold based on a similarity metric, they are considered matched. Here we consider the partof relation in the network management databases for meaningful relations including subgroup versus group, product versus product family, subseries versus series, and so on. For example, the record “con5122” in field and the record “con5100” in field can have high similarity score with partof relation. A record pair which is matched is called a matched record pair, and it is called a nonmatched record pair if the pair is not matched.
Field Matching: A field here means a database field indicating the names of a column and the single piece of data stored. Given two fields , and a threshold , we define as the matching/similarity score (e.g. Jaccard similarity) between two fields and . If value is above , we call a matched field pair, otherwise it is called a nonmatched field pair. In the toy example, has a high matching score with Jaccard similarity, so can be correlated and matched. Moreover, the field pair can also be matched in terms of many matched record pairs such as some pairs , and , and so on.
Graph Database: One effective way to utilize matched results is to construct a graph database for semantic query, network analysis, network prediction, etc. [26, 21, 22], which is also our goal. We define a graph database as a labeled, attributed and undirected graph where is the node set containing all the records appearing in the fields which match, is the edge set between node pairs for node set . is a set of label information of node set , which are the index attributes for the columns of the table. is a set of label information of edge sets , which are the relations of two records from two tables’ column attributes. Specifically, when we construct a graph database from matching of relational databases, a node consists of a field and a record value in a row; the label of comprises of the other field information in the same row as node ; an edge is the matching between two records; indicates the field information when two records of the fields matches. Figure 1 shows an example of a constructed graph database from parts of matched results in Tables II, II and III. For example, a specific node “product_id: 107” in the Table II has a node label “PRODUCT” and “Family:AIR series”. Node matches with “prod_key: 107” in Table II, so it is connected to the node “Incident_key:202” with a Productincident relation, to the node “Incident_key: 201” with a Productincident relation, and to the node “Family:AIR series” with a Productfamily relation.
Iii Matching Algorithm
To find whether two fields match, one simple way is to use field name matching. If the name of two fields are the same or similar, they are matched. However, this is not reliable for many database sources because they are noisy and irregular. Moreover, the network databases comprise of numerical and nonnumerical fields, and they have different attributes and matching requirements. For numerical matching, we consider equivalence relation between record pairs. For nonnumerical matching, however, we do not directly consider the equivalence relation as the matching standard. nonnumerical record values are possible to be semantically correlated with different names. For example, the nonnumerical fields and in the example Table II and III have very few common characters on their names, but they are semantically correlated that has a partof relation with . Moreover, in terms of field records, records “cisco0510” and “cisco0500” in these two fields can be considered belonging to the same family and being matched with a high similarity. However, the record pair “477066C” and “807066C” are considered to be nonmatched with different families, even though the two strings have many common characters. Details will be discussed in IIIC). Hence, we make use of the record matching to decide whether two fields match to improve matching accuracy and satisfy semantic matching. Overall, we match numerical and nonnumerical fields separately and design different matching algorithms for each of them.
Iiia System Overview
The system overview is shown in Figure 2. We divide the structured data into numerical data with only numerical fields and nonnumerical data with only nonnumerical fields. In each part, we develop an independent matching algorithm for field matching. Matching algorithms for numerical and nonnumerical data are quite different, which will be introduced in section IIIB and IIIC. The results of each part are combined together to load into a graph database.
IiiB Numerical Matching
Numerical fields are table fields with records which are numerical values. For example, the and in Table II are numerical fields. Their record values construct a basis for similarity metrics of fields. We define each numerical field record values as a set. This is transfered to a problem of set similarity.
There are some common methods for solving set similarity including Jaccard index, Dice index, Hamming distance, cosine similarity
[6], etc. However, it is not practical to just use one method to get accurate decision bounds of matching because of the noisiness and complexity of the structured databases. We propose a synthetic columnwise numerical matching algorithm to get the decision bounds to determine whether two fields are matched or not. The numerical matching algorithm is shown in Figure 3.The process of this algorithm is shown as follows:

We preprocess the numerical data, such as removing null values, negative values and some exceptional nonnumerical values in every field.

We apply range difference similarity metric to all the preprocessed numerical field pairs between every two tables.

After we have got the range difference similarity score for each field pair, a threshold (which will be discussed later) will be decided to cut the filtered results.

Finally, bucket dot product similarity metric will be applied to the range difference similarity metric’s filtered results. Then we sort the similarity scores of all the pairs to select the most correlated field pairs, that is, top result as matched pairs with scores above a certain threshold will be selected to input to a graph database.
IiiB1 Preprocessing
To deal with the irregular and noisy data, we do some preprocessing of field record values before the formal similarity calculations. This includes removing null values, negative values, any nonnumerical values and considering unique values only. We do not consider negative values because they are useless and noisy data in the real databases. Almost all the table fields are about identifications, keys or numbers which can possibly be matched among them. In our databases, an average of of field record values are removed (not including unique value reduction) which does not impact the evaluation of instancebased algorithms.
IiiB2 Range Difference Similarity Metric
Considering the noisy and sparse characteristics of data, Jaccard similarity coefficient as a similarity metric, which measures how many common values between two sets, is not ideal for differentiating some matched pairs and nonmatched pairs. For example, for an nonmatched field pair with limited number of records, the number of common values in these two fields might take a large portion and hence the Jaccard similarity is very high for them. Their distribution of range, however, can be quite different, which is probably not to be matched in most cases. Therefore, we propose range difference (RD) similarity metric to measure the distribution of these field pairs first. Using RD similarity metric first, we can effectively and efficiently prune lots of unwanted computations, which can also hugely reduce time consumption for further matching. Given a field set
, we sort the record value and then get the different percentile (10th, 20th, 30th,…, 90th percentile). The percentile value is recorded as . Therefore, given two field record sets, and , the RD value for each percentile is given as(1) 
We use 20th plus 30th percentile as the low range coverage, and 80th plus 90th percentile as the high range coverage, to cover the distribution of field record values. Hence the RD similarity score for a field pair (A, B) is defined as:
(2) 
To keep consistent with the general similarity metric and be convenient for comparisons, we use 1 minus the averaged RD value as the RD similarity score RDS. The metric based on this similarity score is called RD similarity metric. Using this similarity metric, we can get the similar distribution for matched pairs. The value of is in [0, 1]. The bigger the similarity score, the more correlated the pair. There are several cases about the similarity score considered here:
(1) If there are no overlaps between two field ranges, would be as low as the minimum value 0.
(2) If two fields have similar distributions, would be higher, up to 1.
(3) If two field ranges overlap at the head, tail or in the middle, can fall into a middle value.
In our toy example, the matched pairs and have values as high as 0.996 and 0.999 respectively. In contrast, the pairs and have no overlaps with value 0, which are not matched pairs. The more correlated the field pairs are, the higher RD similarity score they have. Therefore, using a threshold to filter results, we can almost rule out case (1) and part of case (3), then mainly consider case (2) to differentiate them further. To minimize the error of RD similarity score in the first step, we can use a conservative threshold close to the boundary to only filter out definite nonmatched pairs, which will be discussed in the section V.
IiiB3 Bucket Dot Product Similarity Metric
After we consider the distribution of field pairs with range difference similarity metric, we propose bucket dot product (BDP) similarity metric to further refine the filtered results of RD similarity metric. BDP similarity metric is to divide the whole concatenated ranges of two fields into different bucket/bins and compress each bucket as one point to calculate dot product similarity. The intuition behind this is that matched pairs generally have more common values than nonmatched pairs. If we increase the bucket size up to a certain value to calculate dot product, it can make the similarities of all the nonmatched pairs decrease more, and meanwhile make the similarities of all the matched pairs drop less, therefore it effectively increases the similarity gaps between matched pairs and nonmatched pairs. Therefore, a good design of BDP will help significantly differentiate between matched pairs and nonmatched pairs.
We use the bucket number () to determine the number of buckets for calculating the dot product. Given two field record sets and , we first derive the required vectors and for the input to the BDP similarity calculation. The vectors and derived from and are constructed in this way. Given two sets and , we concatenate and ’s value ranges as a combined set , and then divide into several buckets according to the . If there is any one value in or falling in a bucket, the bucket point for or ’ is 1, otherwise it is 0. Then we apply the general dot product similarity to and . Therefore, the BDP similarity score (normalized) is defined as follows:
(4) 
where decides the sparsity/density of range distributions. Since sets and usually have different sizes with different ranges, it would make sense for the same for each set. For example, we calculate the BDP similarity for a field pair in Table II and III as and with . We first concatenate these two field ranges into a set {201, 202, 203, 204, 207, 208, 209}. Then we construct a set ={{201, 202, 203},{204, 207, 208}, {209}} according to the bucket number . After that, we get the vector = {1, 1, 1} and = {1, 1, 0}. Finally, the BDP similarity score is 0.816, which is high for matching. If we set as 4, BDP similarity score for this pair is 1, which is the highest for matching.
is also an important factor to affect the quality of this metrics. According to our experimental observations, it is affected by the data range and distribution. Generally, matched pairs would have more similar ranges than nonmatched pairs. A tradeoff value of would effectively improve matched pairs’ similarities more and also do not help improve nonmatched pairs’ similarities much, which can potentially increase more gaps between matched and nonmatched pairs. The selection of will be discussed later in the section V.
IiiB4 Primary Key Constraint Matching
If we consider all numerical fields and then do match computations between each other, it would be very timeconsuming and not scalable when we have a large number of fields. If there are tables from database sources, and an average of fields in each table, the maximum comparisons are required. If we consider only table and table matching in a semantic way, the primary key constraint [4, 17] can be utilized to reduce the time for pair comparisons. Moreover, relational databases generally have primary keys and closelyrelated foreign keys in a table, which describe the semantic meanings of the table. It is important to match two tables in this semantic way to construct an effective graph database. Assuming every table has at least a primary key, we can identify the primary key and use the primary key’s records to compare with all other keys records in other tables. We identify the primary key as a field that has unique record values before we apply RD and BDP similarity metrics. For example, and are identified as primary keys in these two tables II and II, so we only need to compare , , and in total, reducing 33% times of comparisons.
IiiB5 Final Top Selection
To construct a high quality graph database, more true positive field pairs are preferable from higher similarity scores. Also, manual thresholds could lead to selection instability of field pairs with low similarity scores, so we seek topk to further refine the quality of matching for the graph database. We sort all the candidate pairs by the similarity scores in a nonascending order, then we verify this final field pair matching results to select top field pairs, which involves only a little human labor.
IiiC Nonnumerical Matching based on Top Priority Matching
We propose fast algorithms for nonnumerical field matching–top priority match metric (TPM) for fast filtering, and match ratio score for final similarity computations. The diagram is shown in Figure 4. The main process of our nonnumerical matching algorithms is as follows:

Preprocess nonnumerical data: this is an important step that decides the quality of our nonnumerical matching algorithm. After splitting nonnumerical data from the original databases, we use our designed natural language processing methods of segmentation, stemming and prefixing for every field record value.

Calculate the recordwise similarities iteratively: It is very timeconsuming to apply cosine similarity to the combination of every record pair in a field pair. Considering the scalability of largescale data matching, we propose top priority match metric for recordwise similarity calculation. In the process of iterative computations, we check the termination condition to terminate the iterations earlier, which significantly reduces time complexity.

Calculate match ratio score: after recordwise similarity computations for every field pair are finished, we calculate the defined matching ratio score for each field pair.

Select top results: we sort the field pairs by the matching ratio score in a nonascending order, and the top field pairs are selected as the final results as an input into a graph database.
Here, we discuss each proposed steps in detail.
IiiC1 Preprocessing
Nonnumerical matching considers partial match between two strings. For example, product “cisco0510” and product “cisco0500” are in the same series, which is considered as a partial match. Hence, we propose the following preprocessing method. (1) Parse every record string , remove null value, separate alphabetic and numerical characters into different new substrings, and tokenize the string words. (2) Stem the alphabetic strings of the original record and new substrings. (3) Reserve the prefixes with a certain length of the original numerical strings and the new substrings if they are digital substrings. The prefix length is 2 here according to the general characteristics of network databases and our experiments.
Each record string is preprocessed in those three steps above. For example, we have an original field record {’mem4700m64d=’} in Table III. We can obtain a string collection {’4700’, ’64’, ’4700m’, ’d’, ’m’, ’mem 4700m 64d’, ’64d’, ’mem’, ’47xx’} after preprocessing .
IiiC2 Top Priority Match Metric for Recordwise Similarity
One intuitive way is to preprocess all the combination of record pair comparisons and calculate the similarity of each recordwise pairs. That would be very timeconsuming or even unfeasible when the data are large. Specially, if two fields are not correlated as a matched pair, it would be costly for useless computations. Therefore, we propose a fast recordwise matching algorithm called top priority match metric for recordwise similarity (TPM) to fulfill this. Intuitively, if two fields and are correlated, there will be a high percentage of recordwise pairs that have higher similarities. The probability of a matched record pair encountered is higher than nonmatched record pairs. Therefore, we first sort all the preprocessed records in each two fields and , then we compute how many of records in are matched with records in from top to bottom, and viceversa. The comparisons can hence be terminated as long as the current record pair similarity achieves below the similarity threshold we set for deciding the matching of a record pair, which greatly reduce the times of comparisons with combinations.
IiiC3 Record Pair Similarity
In this fast recordwise comparisons, the record pair similarity used is cosine similarity between two record collections after preprocessing two records. It decides how and when to reduce the comparisons of matching in a fast and effective way. A threshold is to decide how similar a record pair is as a matched record pair, which can also be adjusted by users.
Given a preprocessed string collection and another preprocessed string collection , we remove duplicated elements and transfer them into a set ( union ). Next we generate a binary vector and a binary vector according to the value distribution of and in , then we calculate the cosine similarity between and by getting their dot product divided by their magnitude multiplication.
(5) 
For example, we have a preprocessed string collection {cisco, 0510, cisco0510, 05xx}, and {cisco, 05xx, 0500, cisco0500}, we transfer them into a set of union , {cisco, 0510, cisco0510, 05xx, 0500, cisco0500}. Then the binary vectors generated according to , and are {1, 1, 1, 1, 0, 0} and {1, 0, 0, 1, 1, 1}. Finally, we calculate the cosine similarity of and as the similarity of and , that is, .
IiiC4 Matching Ratio Score
Matching ratio score is proposed to calculate the final similarity for a field pair. After we have gone through the reducing comparisons for recordwise similarities, we select the number of record pairs that have similarity scores above . A matching ratio score as a final field pair similarity is the average of ratios of top matched record pairs calculated as follows:
Given two nonnumerical sets and , there are items in and items in B. The matching ratio score(MR) between and is defined as follows.
(6) 
where
(7) 
and
(8) 
where and are the cosine similarities of the record pairs and , respectively. value is in [0, 1] and it is the final similarity score to decide the correlation of each field pair.
IiiC5 Final Top Results
Similar to numerical field matching, matched nonnumerical field pairs in the result list are more meaningful and important than nonmatched field pairs, so we select top results of nonnumerical field pairs sorted with matching ratio scores in nonascending order as an input to a graph database. value can be selected by users for deciding most effective field pairs in a graph database and limiting the size of the graph database.
IiiD More Efficient Hashing Algorithm for Nonnumerical Matching
The proposed TPM metric can be effective to distinguish between matched and nonmatched nonnumerical fields. However, it possibly involves all the pairwise record combinations in the worst time complexity, which is timeconsuming for large databases with millions of records. Therefore, we propose applying more scalable minHashlocality sensitive hashing algorithm (MHLSH) [20]
to estimate the matching score of nonnumerical fields in the databases. It can greatly reduce the comparison size and time for nonnumerical recordwise pairs with little cost of matching accuracy down.
The proposed diagram for nonnumerical matching based on MHLSH is shown in Figure 5. The main process of the algorithm is as follows:

Preprocess nonnumerical data: this step is the same as the preprocessing step of TPM algorithm.

Select matched field pairs fast: we apply the locality sensitive hashing technique in the database field matching for fast selecting field pairs that are correlated. In this process, it uses minHash similarity to decide the filtering results for field pairs.

Field pair similarity calculation: We use minHash technique in the databases to estimate the matching score of field pairs.

Select top results: Same as the operation for nonnumerical matching based on TPM, we sort the field pairs by the estimated matching score in a nonascending order, the top field pairs are selected as the final results of field pairs as an input to a graph database.
Here, we discuss important steps in detail.
IiiD1 Matching Score Estimation with MinHash
The similarity score of matched field pairs can be estimated fast with minHash signatures. Given two fields and , we can evaluate the similarity between them as follows. We choose hash functions . For each hash function , we let a signature of set be for . Let a signature of set be for . Then, the probability that the two sets have the same minHash signatures can be used to estimate the similarity between them.
(9) 
Here each record in a field or is also preprocessed with the same preprocessing method of TPM algorithm. The preprocessed record strings are combined into a new set like a word set in a document. We also use shingle (a substring of length ) to create a set of shingles strings and apply different hash functions on the set of strings.
IiiD2 Field Pair Selection with Locality Sensitive Hashing
Performing pairwise similarity measurement can be time consuming with large amounts of field pairs available. In order to identify which field pairs are similar quickly, we propose using locality sensitive hashing (LSH) to select the candidate field pairs.
The values of minHash signature for one field are grouped into tuples (referred to as sketches) with rows. Similar field pairs have similar minHash signatures and hence have a high probability of having the same sketches. Moreover, dissimilar pairs have low chance of falling into the same sketch. The probability that two fields of and have at least one sketch (of size ) in common out of is
(10) 
Therefore, we can find the candidate pairs with the designed number and . The selection of and is generally decided by a threshold shown in [20], which indicates how similar the two fields is to be considered as a candidate pair, and can also be set by users. In this way, if pairs with similarity above , they will be selected as candidate pairs to be further estimated, and the matching score between them with minHash will be calculated.
Iv Classificationbased Matching
Our proposed algorithms for numerical and nonnumerical matching involve several manual thresholds to get the matching results. The manual threshold selection is trivial and may not be generic to new databases. We propose a classificationbased method to decide the matching or nonmatching. Using our proposed similarity scores as features, we construct a classificationbased matching system.
For a classification method, there are four important factors involved–input data, feature design, model selection and training/testing. The details of classification are introduced in the following sections.
Iva Sampling for ground truth
To address the problem of limited availability of the ground truth in our Cisco dataset, we use a sampling method [19] to synthesize more ground truth pairs, which preserves the ”synchronization property” (to preserve the Jaccard similarity of original sets). Given an original ground truth and , we sample a new ground truth pair based on and . The sampling process is to make sure if a particular sample is sampled in and is also in , then it is also sampled from [19]. Assume we have only
real numerical ground truth pairs that are real data from the databases and labeled by humans. To extend ground truth pairs, we synthesize 100 times of ground truth pairs from each of the field pairs. For example, in our dataset, there are 60 real numerical ground truth pairs that are real data from the databases and labeled by humans, so we synthesize 6000 more ground truth pairs which are enough to apply to a classifier. Synthetic nonnumerical ground truth pairs are also generated in the similar way.
IvB Feature Design for Classification
In the previous numerical field matching, we have generated range difference (RD) similarity and bucket dot product (BDP) similarity metrics. We propose to use these similarity scores as features. There is one important manual thresholdbucket number involved while calculating BDP similarity, so we generate 15 different BDP similarity scores based on different bucket numbers in [100, 200, 500, 1000, 2000, 5000, 10000, 50000, 100000, 500000, 1000000, 5000000, 10000000, 50000000, 100000000], which are obtained from the data range of our real ground truth. As a consequence, for each filed pair instance, we have 16 features of scores in total.
For nonnumerical field pairs, we have proposed matching ratio score and minHash similarity metric to decide the similarity of these pairs. During the calculation, we have one important recordpair threshold while calculating matching ratio score. Therefore, we use 4 different values in [0.3, 0.4, 0.5, 0.6] to calculate different matching ratios as features. In total, 5 features of scores are obtained for each nonnumerical field pair.
IvC Training and Testing with SVM
We select Support Vector Machine (SVM) as our classifier for the reason that SVM works well on unstructured data such as nonnumerical text and scales well with high dimensional data. We split our data into
as training and validation data, as the test data. In training stage, 5fold cross validation is used for training and validation, and the results are shown in the experiment.V Experimental Evaluation
In the section, we evaluate our technique NEMA for structured network database matching. To be specific, we measure the effectiveness of NEMA using ground truths for numerical and nonnumerical data that are annotated by humans. Meanwhile, experiments on a large dataset are also conducted, and we show the topk effective results of matching field pairs. Moreover, comparisons of NEMA with other baseline methods are also shown.
Va Data
The structured network management databases available in the form of database tables are provided by Cisco Systems, Inc. The dataset includes “install_base” and “service_request” databases. They are generated by different departments and groups of the corporation, which are heterogeneous and diversely distributed.
In these databases, there are 21 tables which contain 1,458 columns. Each column has 10 million records on the average. Out of them, there are 679 numerical fields and 779 nonnumerical fields. Therefore, a complete match involves the maximum 1,067,882 field matching decisions. With primary key constrains in numerical matching, there are 5 “primary keys” on average in each table, which would reduce to 374,326 field pairs matching.
We have ground truth field pairs that are annotated by humans to be matched or nonmatched field pairs for a subset of the data. There are 60 field pairs of ground truth in numerical data and 40 pairs in nonnumerical data. For future reference, table names in service_request database start with “T_”, and start with “X_” in install_base database, respectively.
VB Experimental Setup
We implement NEMA system in Python. To evaluate the effectiveness of NEMA , we evaluate the numerical and nonnumerical algorithm parts, respectively. For each part, we first evaluate its algorithms based on the ground truth data. Then all the column pairs in the large dataset are evaluated in the following experiments shown in sections VC2 and VD2, which shows the effectiveness of NEMA. Finally, we compare with the common matching system COMA [3] on the ground truth for both schemalevel and instancelevel matching.
To construct an effective graph database, identifying accurately matched pairs and reducing nonmatched pairs appearing at the top are much more important than finding more nonmatched pairs. Also, our evaluation is based on the balanced ground truth of positive and negative pairs. Therefore, we use an overall metric “accuracy” () to evaluate our field matching algorithms. is calculated with true positive(), true negative(), total positive pairs() and total negative pairs() (Here positive means matched and negative means nonmatched).
VC Evaluation based on Numerical Data
We evaluate our technique NEMA on the numerical data in two parts. We use numerical ground truth to evaluate the effectiveness of NEMA numerical algorithm. Then the matching results of all of other numerical field pairs are also described.
VC1 Evaluating of Ground Truth
We show the evaluation result of NEMA numerical algorithms and the compared baseline methodJaccard similarity using numerical ground truth (We choose Jaccard similarity since it is an exemplar method considering common values of two sets for similarity calculation). In the dataset, there are 30 matched ground truth field pairs which are originally from fields pairs annotated by humans or from join operations in the databases and proved to be matched field pairs. We randomly sample 300 nonmatched field pairs confirmed by humans and select 30 nonmatched field pairs from them as nonmatched ground truth.
No.  Table.fieldA  Table.fieldB  Matched class 

1  T_INCI.prod_hw_key  T_HW_PROD.bl_prod_key  1 
2  T_INCI.cur_ct_key  T_CT.bl_ct_key  1 
3  T_INCI.up_tech_key  T_TECH.bl_tech_key  1 
4  T_INCI.inci_id  T_INCI_I2.inci_id  1 
5  T_INCI.bl_cot_key  T_COT.bl_cot_key  1 
6  T_INCI.inci_id  T_OR_HD.inci_id  1 
7  T_INCI.item_id  T_PROD.item_id  1 
8  T_INCI.ins_site_key  T_SITE.bl_site_key  1 
9  T_OR_LN.prod_key  T_PROD.bl_prod_key  1 
10  T_OR_HD.header_id  T_OR_LN.header_id  1 
1  T_OR_HD.order_dur  T_OR_LN.loc_key  0 
2  T_CAL.bl_cal_key  T_PROD.item_id  0 
3  T_INCI_I2  T_DEFT.deft_id  0 
4  T_INCI_I2.res_time  T_TECH.sub_tech_id  0 
5  X_PRO.list_price  T_TECH.sub_tech_id  0 
6  T_INCI_I2.res_time  T_TECH.bl_tech_key  0 
7  T_OR_HD.deliv_dur  T_DEFT.deft_key  0 
8  T_OR_LN.hold_dur  T_TECH.sub_tech_id  0 
9  T_IN.serlevel_key  T_INCI.last_dur  0 
10  T_IN.closect_key  T_INCI.resp_tz  0 
Table IV shows 10 matched and 10 nonmatched field pair examples of numerical ground truths. Columns “Table.field A” or “Table.field B” shows a table name and its field pair name to be compared. The matched class column indicates that the field pair is matched with value “1” or nonmatched with value “0”. For example, in the first row, “T_INCI.prod_hw_key” indicates a field “prod_hw_key” in the table “T_INCI”, and “T_HW_PROD.bl_prod_key” indicates a field “bl_prod_key” in the table “T_HW_PROD”. This field pair about products’ key is matched, which is indicated with “1” in the matched class value. The remaining rows share the same characteristics as well.
The problem of similarity of a numerical field pair is modeled as the problem of similarity of a set pair. The baseline method for the similarity of a numerical field pair is the wellknown Jaccard similarity which measures the similarity of two given sets.
Figure 6 shows the Jaccard similarity scores on these ground truth field pairs. Red circle represents matched pairs and blue star represents nonmatched pairs. axis denotes the index of these 30 matched and 30 nonmatched field pairs, and axis indicates Jaccard similarity score. From this figure, we can see that there are about half of positive and negative pairs mixed together from which are difficult to differentiate.
Figure 7 shows the RD similarity metric on these ground truth field pairs. We know that a pair is more likely to match if its RD similarity score is larger. There are about only positive and negative pairs mixed together, which has greatly improved the result over the Jaccard similarity result.
From the previous result of RD similarity metric, we can set a low threshold and then rule out the result pairs below as nonmatched field pairs. To minimize the error of initial matching, we can select a conservative threshold close to 0.1 (all of the uncertain pairs are above 0.1) to eliminate some generally precise nonmatched pairs.
After we filter field pairs from RD similarity metric, there are totally 39 field pairs left, that is, about of all ground truth pairs (all of them are nonmatched field pairs) are pruned. In these 39 pairs, there are 30 matched pairs and 9 nonmatched pairs waiting to be input to the BDP similarity metric.
Figure 8 shows the BDP result after the previous RD similarity metric result. axis denotes the index of these 30 matched and 9 nonmatched pairs, and axis indicates the normalized BDP similarity score in a nonascending order. We can see from that the BDP result has very good decision line between matched pairs and nonmatched pairs in which the final threshold is chosen around 0.1. The final accuracy can be up to 95% when is 0.1.
The bucket number is the main factor affecting the BDP similarity and the final results. To evaluate the effectiveness of different values, we use all the numerical ground truth to test the accuracy with different values from 100, 200, 300, 400, 500 up to 100,000,000 where there exists some numerical records that have values up to 10 billion in almost all the fields. Figure 9 shows the accuracy of BDP similarity metric with different values when the threshold is set at 0.1 and 0.2. It shows a similar summit that the accuracy is at an optimal value when the is around from 10,000 to 200,000. BDP accuracy goes down when the becomes smaller or bigger. We use a tradeoff value in our experiments to avoid overfitting.
VC2 Top Similarity Results
The matching experiment based on all the 679 numerical fields is shown here. Table V shows top matching results. We use RD similarity score threshold and bucket number for BDP similarity computation. All the rows are accurate matches, which are also confirmed by humans. The accuracy can be up to 100% for the top results, which shows a great potential for our numerical matching algorithm applied on the large dataset, also reducing lots of human labor for matching. Moreover, with our systemaided matching findings, we can find some pairs matching which are difficult to be found with human annotations such as “changewg_key” and “subregion_key”, indicating which regions that the workgroup mainly serves.
To ensure a graph database more meaningful and complete, the selection of top results is finally decided by users. Users can observe the top results and rule out the unwanted matches as the final input pairs to a graph database so as to introduce less “noisy” connections of the graph database.
Table.field A  Table.field B  BDP similarity 

T_DEFT.deft_key  T_INCI_DE.bl_def_key  0.844 
T_OR_LN.item_id  X_PRO.item_id  0.698 
T_DEFT.deft_id  T_INCI_DE.defect_id  0.682 
T_INCI.item_id  X_PRO.item_id  0.673 
T_DEFT.deft_key  T_PROD.item_id  0.64 
X_INS.item_id  X_PRO.item_id  0.597 
T_PROD.bl_prod_key  T_SUR.task_key  0.567 
T_INCI.changewg_key  T_WK.subregion_key  0.551 
T_INCI.changewg_key  T_WK.wkgrp_key  0.551 
T_INCI.changewg_key  T_WK.theater_key  0.551 
T_INCI.currentwg_key  T_WK.theater_key  0.545 
T_INCI.currentwg_key  T_WK.subregion_key  0.545 
T_INCI.currentwg_key  T_WK.wkgrp_key  0.545 
T_INCI.hwversion_id  T_PROD.bl_prod_key  0.507 
T_INCI.createwkgrp_key  T_WK.theater_key  0.507 
T_INCI.createwkgrp_key  T_WK.subregion_key  0.507 
T_INCI.createwkgrp_key  T_WK.wkgrp_key  0.507 
T_PROD.item_id  X_PRO.item_id  0.468 
T_INCI.prod_hw_key  T_HW_PROD.bl_prod_key  0.463 
T_SUR.evalwkgrp_key  T_WK.wkgrp_key  0.433 
VD Evaluation based on Nonnumerical Data
We evaluate our NEMA nonnumerical algorithms based on TPM and Hashing on the nonnumerical data in two parts as well. We use nonnumerical ground truth data to evaluate the effectiveness of our algorithms. The matching results of all the other nonnumerical field pairs are then described.
VD1 Evaluating of ground truth
The experiment for evaluating of nonnumerical ground truths is shown here. There are 20 matched ground truth pairs which are annotated by humans. Additionally, 20 nonmatched ground truth pairs are randomly selected and verified. Part of examples are shown in table VI.
No.  Table.field A  Table.field B  Matching class 

1  T_PROD.item_name  X_INS.item_name  1 
2  T_COT.cpr_country  T_SITE.country  1 
3  T_CT.temp_desc  T_PROD.item_desc  1 
4  T_CT.ctserv_line  X_SAH.servline_name  1 
5  T_INCI.curr_wg_name  T_WK.wkgp_name  1 
6  T_CT.temp_desc  X_SAH.temp_name  1 
7  T_SITE.cust_state  X_SAH.billto_state  1 
8  T_CT.temp_name  X_SAH.temp_desc  1 
9  T_PROD.prod_family  T_HW_PROD.family  1 
10  T_PROD.prod_family  T_HW_PROD.erp_family  1 
1  T_INCI.init_gp_name  T_SITE.address  0 
2  T_DEFT.deft_submitter  T_SITE.email_addr  0 
3  T_COT.cpr_country  T_INCI.summary  0 
4  T_SITE.address1  X_PRO.prod_family  0 
5  T_PROD.prod_family  X_SAH.hdrcust_name  0 
6  T_INCI.tacpica_ct  T_HW_PROD.family  0 
7  T_WK.wkgp_desc  X_PRO.physisn_loc  0 
8  T_SITE.county  T_SUR.batchcot_name  0 
9  T_INCI.customersw_ver  T_SITE.state  0 
10  T_OR_LN.partsloc_code  X_INS.item_name  0 
We first analyze the ground truth recordpairs and show the viability for the record pair similarity threshold . Table VII shows the record pair similarity scores of 9 different record pairs in a field pair (“T_PROD.prod_subgrp”, “T_HW_PROD.platform”). The first 7 rows of pairs are matched record pairs that have higher similarity scores. The last 2 rows are not matched record pairs with lower score of 0.3333, which is lower than 0.4. It is conservative to capture the matching among different records with a similarity value 0.4 assigned to the threshold .
T_PROD.prod_subgrp  T_HW_PROD.platform  Record similarity 

c900 series  c900 series  1 
c2950 series  c2916 series  0.8 
1601r series  1601 series  0.775 
css2950  css2916  0.667 
C2960  C2960CX  0.577 
C3560CX  C3560X  0.5 
AIR35CE  AIR35SE  0.4 
ts900  cs900  0.333 
c800  s800  0.333 
We demonstrate the effectiveness of nonnumerical algorithms based on TPM and MHLSH by calculating matching ratio score based on TPM and estimated matching score based on MHLSH on nonnumerical ground truths. Figure 10 shows matching ratio scores of this ground truth data matching in a nonascending order based on TPM. The matching ratio scores of almost all the matched pairs are above the nonmatched pairs’s. If we use the threshold 0.1 or select top results from this, the accuracy can achieve about , which shows the effectiveness of NEMA based on TPM. Figure 11 shows the matching scores of these ground truths in a nonascending order based on MHLSH. Although the decision boundary is not as good as the TPMbased result in Figure 10, the accuracy can achieve 90% when the threshold is 0.1 or top 19 results are selected.
VD2 Top similarity results
There are 779 nonnumerical fields in the large dataset. Considering that almost all the primary keys in a table are numerical fields, we do not consider primary key constraint matching method for nonnumerical matching. The record similarity threshold is set to be 0.4 here based on the analysis of Table VII. The final top list of matching ratio scores are obtained based on TPM algorithm from all the nonnumerical field pairs.
Table VIII shows the top results of field pair matching based on TPM. We can see that all the field pairs are matched pairs, and they are also confirmed by humans later, which shows the effectiveness of NEMA based on TPM algorithm.
Table.field1  Table.field2  Matching ratio 

T_INCI_I2.currentwg_key  T_WK.wkgp_name  0.637 
T_INCI_I2.currentwg_key  T_WK.wkgp_desc  0.632 
T_INCI.initwg_name  T_WK.wkgp_name  0.63 
T_INCI.initwg_name  T_WK.wkgp_desc  0.628 
T_WK.wkgmgr_email  T_SUR.eval_email  0.626 
T_INCI.creatorwg_name  T_WK.wkgp_name  0.624 
T_INCI.creatorwg_name  T_WK.wkgp_desc  0.622 
T_INCI.curr_wg_name  T_WK.wkgp_name  0.611 
T_INCI.curr_wg_name  T_WK.wkgp_desc  0.607 
X_SAH.billto_state  T_SITE.state  0.504 
X_SAH.billto_state  T_SITE.cust_state  0.499 
T_INCI.initwg_name  T_INCI_I2.curr_wg_name  0.462 
T_INCI.initwg_name  T_INCI_I2.wkgrp_name  0.457 
T_COT.cpr_country  T_SITE.country  0.453 
T_SITE.cust_country  T_COT.cpr_country  0.453 
T_INCI_I2.wkgrp_name  T_INCI.curr_wg_name  0.451 
T_COT.cpr_country  T_SITE.cust_country  0.447 
T_INCI.curr_wg_name  T_INCI_I2.curr_wg_name  0.442 
T_SITE.country  T_COT.cpr_country  0.44 
T_HW_PROD.erpplatform  X_SCDC.productsub_grp  0.431 
VE Evaluation of Classificationbased Matching
As described in the section IV, we use SVMbased classifier to decide the matching of two field pairs. Here we show the numerical and nonnumerical data testing result using crossvalidation. We have generated 6000 synthetic numerical ground truth and nonnumerical ground truth pairs based on the 60 numerical field pair and 40 nonnumerical field pair, respectively. We run the whole experiment 20 times and obtain the average validation and testing accuracy which are shown in the table IX. It shows high accuracy of classificationbased method to match field pairs with our proposed similarity metrics and also avoids the manual thresholds.
Data/Average ACC  Numerical  Nonnumerical 

Validation()  0.995  0.953 
Testing()  0.972  0.946 
VF Comparisons with baseline methods
We compare our technique NEMA with baseline methods of COMA system [13] and rulebased method Regex [24] here. COMA is a state of the art and popular hybrid matching tool and system supporting both schemalevel and instancelevel matching. Regex is an instancelevel rulebased matching method based on regular expressions. We test the numerical and nonnumerical ground truth matching of NEMA and COMA in schemalevel and instancelevel and Regex in instancelevel. Their accuracies and differences on matching results are compared and analyzed quantitatively.
VF1 Comparison of Accuracy
We measure the accuracy and compare the COMA in the schema level and instance level and Regex in the instance level for numerical and nonnumerical data matching. On the schema level matching, COMA uses the best field matching similarity ”0” (which has no corresponding line in the COMA system) as a threshold in the schemalevel matching. On the instance level matching, COMA has one similar instancelevel matching that uses aggregated maximum recordwise similarities to obtain the final field pair similarities. The recordwise similarity is based on common similarity metrics such as edit distance [32] and trigram [2]. Edit distance is a technique to measure how dissimilar two strings are to one another by counting the minimum number of operations required to transform one string into the other. Trigram is a technique of splitting a string into triples of characters and comparing those to the trigrams of another string. The field matching similarity between two fields and in COMA is defined as follows:
(11) 
Regex is a matching method based on regular expression by creating patterns from sampling instances of one field and then match against instances of another field to decide matching.
Figure 12 shows the accuracy comparisons among COMA, NEMA and Regex. COMASCH is COMA with schemalevel matching algorithm. COMAED and COMATRG means that COMA uses edit distance and trigram to measure the record similarity in instancelevel matching, respectively. Regex is the matching based on regular expression in [24]. NEMA(RD+BDP)/TPM is NEMA using RD and BDP in numerical data matching, and TPM in nonnumerical data matching. NEMAMH_LSH indicates that NEMA uses minHashlocality sensitive hashing in nonnumerical data matching, without MHLSH bar shown for numerical matching.
The accuracies of NEMA(RD+BDP)/TPM in numerical and nonnumerical data matching can be up to , as high as COMA’s nonnumerical data matching, but having higher accuracy than COMASCH and Regex matching. For numerical matching, the accuracies of COMAED, COMATRG and Regex are  lower than NEMA(RD+BDP)/TPM because of the ineffectiveness to identify nonmatched pairs of numerical ground truths. For nonnumerical matching, the accuracies of COMAED and COMATRG are higher than NEMA(RD+BDP)/TPM which is about higher than Regex. However, the field matching score of COMA is measured based on its general string similarity matching, which is not well applied to the network management database matching for record pair similarity requirements. A large number of pairs with high record similarities in COMA are not thought of as matches in the network management databases, which shows the usefulness of the NEMA nonnumerical algorithm.
VF2 Comparisons of Mismatched Examples
We further analyze the differences of COMA and NEMA in matching the ground truth field pairs. Table X shows the field pairs in every row and its similarity scores by COMA and NEMA. These field pairs are found to be matched pairs by NEMA with relatively high similarity scores, but COMA shows no similarities with score 0. For COMA, the field names have very few common characters in spelling, even though the semantic commonality exists. NEMA does not rely on the inaccurate schemalevel properties, but it uses the record instance for the decisions of field matching, which indirectly considers the semantic correspondences. If the record instances for some matched pairs are incomplete or missing, however, the similarity scores for these field pairs are also low. Table XI shows the two field pairs that have low similarities in NEMA. Although the field names in each pair express the same thing semantically, the record instances in the fields are actually incomplete and have very few in common between each other. However, in our databases, the missing or incomplete field pairs are very few compared to the large number of field pairs, which does not affect the overall performance for the numerical matching with NEMA.
Table.fieldA  Table.fieldB  COMAED  NEMATPM 

T_INCI.ins_site_key  T_SITE.partysite_id  0  0.208 
T_OR_HD.creator_id  T_INCI.lastup_by  0  0.643 
T_CT.temp_desc  T_PROD.item_desc  0  0.05 
Table.fieldA  Table.fieldB  COMAED  NEMATPM 

T_INCI.bl_cot_key  T_CT.bl_cot_key  0.750  0.091 
T_SUR.bl_surv_key  T_SUR_ANS.bl_surv_key  0.76  0.009 
Here we analyze the specific record pair examples of nonnumerical instancelevel matching. COMA uses standard edit distance and trigram to calculate the similarities of records, which is not quite suitable for the matching requirement of network management databases. Table XII below shows 9 examples of records pairs and three different kinds of similarities (NEMATPM, COMAED, COMATRG). The first 7 rows as one group are thought of as matched record pairs, the last two rows in the other group are nonmatched record pairs. We can see from that the similarity of matched pairs based on COMA are quite similar around 0.7 for these two groups, from which is not easy to differentiate. While the similarities by NEMA have good differences (0.333 for nonmatched pairs, 0.4 above for matched pairs). This further demonstrates that NEMA is more suitable for the network database matching.
Record 1  Record 2  NEMATPM  COMAED  COMATRG 

c900 series  c900 series  1  1  1 
c2950 series  c2916 series  0.8  0.833  0.6 
1601r series  1601 series  0.775  0.909  0.738 
css2950  css2916  0.667  0.714  0.6 
C2960  C2960CX  0.577  0.6  0.775 
C3560CX  C3560X  0.5  0.833  0.671 
AIR35CE  AIR35SE  0.4  0.857  0.6 
c800  s800  0.333  0.75  0.5 
ts 900  cs900  0.333  0.8  0.667 
VF3 Comparison of Efficiency
Considering the expensive time consumption for nonnumerical field pair matching, we test the efficiency based on the whole nonnumerical ground truth. We run the experiment 20 times on the same machine with the same data to calculate the average computation time and standard deviation (SD) without data loading time. Figure
13 shows the total computation time spent for COMASCH, COMAED, COMATRG, Regex, NEMATPM and NEMAMH_LSH. COMASCH and Regex using schema information only are fastest among all methods, but the accuracies are the lowest. Except from from COMASCH and Regex, COMAED is the slowest, taking about 12,000 seconds. NEMATPM takes about 2500 seconds, about 5 times speedup than COMAED. NEMAMH_LSH takes 860 seconds, which is about 14 times faster than COMAED. NEMATPM reaches the best tradeoff between accuracy and efficiency among these algorithms.Vi Related Work
The structured data matching is an old and important research topic but unsolved and evergrowing problem, which has a wide range of applications in database integration, migration, semantic query, etc. [5]. In survey papers [30, 34], the authors proposes a solution taxonomy differentiating between element and structure level, schema and instance level, language and constraintbased matching techniques. Furthermore, P. Shivaiko et al. [35] reviews the stateoftheart matching systems which were based on strings, structure, data instance and semantics matching techniques using different schema formats such as database, XML, OWL, RDFS, etc. In database schema matching, previous common matching systems in schemalevel are introduced in several prototypes such as Similarity Flooding (SF) [25], Coma [3], etc.
SF [25] is a matching algorithm that models two structured columns to be compared as two directed labeled graphs. It makes use of field data, key properties and the stringbased alignment (prefix and suffix test) to obtain the alignments between two nodes of the graph. The similarity is calculated from similar nodes to adjacent neighbors through propagation. Our NEMA only relies on the data instance values to infer the matching of fields, which does not utilize the structured properties and data types. However, SF uses a metric for matching quality based on the intended matching results, which is similar to our accuracy metric based on top results.
Coma [3] is a composite matching system providing extensible library and framework for combining obtained results. It contains mainly 6 elementary matchers using stringbased techniques, 5 hybrid matchers using names and structural paths, and one reuseoriented matcher based on previous matching results . The composite matcher effectively improves the match quality over single matchers using the default combination strategy. Compared to SF, the overall average matching quality are the best among them [3]. The extended version Coma++ [13] utilizes the shared taxonomy and pivot schema to further improve the overall matching quality. In our evaluation, we compare with the Coma++ method using the default combination strategy and find our technique NEMA overall outperforms than COMA in schemalevel matching.
Except from the previous matching approaches using field and structural information matching, data instancedbased approaches [27, 23, 8, 39] use the similarity metric or machine learning or rulebased methods to determine the similarity of fields. In [23], the authors utilize a corpus that contains schema and mappings between some schema pairs, and learn the constraints from schema statistics to help generate more matching pairs. In [8], the authors uses the mutual information of statistics to measures the similarity of schema instances between two columns to decide the matching, which shows an effective method based on instances. Also, the authors in [29] propose a new sampledriven approach which enables the endusers to easily construct their own data to match the source and target schema. [24] proposes a rulebased method by creating regular expression pattern to match against columns. [39]
uses matchinglearning technique of training neural networks for getting candidate pairs and then filters the pairs with a rulebased algorithm. Our NEMA uses proposed similarity metrics as features to train a SVM to classify for matching effectively and efficiently. COMA
[13] proposes two instancelevel matching methods based on the constraint of instance data and the contentbased matching to measure field matching. The constraintbased method relies on the general, numerical and pattern constraint which has specific limitation to the specific data which is not suitable for the network databases. The contentbased matching depends on the aggregation of similarity scores of instance contents and it is kind of similar to our NEMA technique on contentbased similarity measurement, which are compared with NEMA in the experimental evaluations.To sum up, most of currently popular matching approaches and systems focus on schemalevel information matching. The data instances level matching approaches using field record values are mostly based on some statistical models and machine learning from corpus. We further explore the database instance matching by comparing field records using different metrics and propose effective and overall matching algorithms considering the characteristic of network database matching for a graph database construction for efficient data query, analysis and management.
Vii Conclusion
In this paper we propose a systematic technique NEMA to match databases for network management. Different from previous database matching approaches, we design a technique to match numerical and nonnumerical fields in instancelevel respectively, which can effectively be integrated into a graph database for network management and analysis. For numerical matching, we propose range difference similarity and bucket dot product similarity metrics. For nonnumerical matching, we design top priority match metric and also propose applying minHashlocality sensitive hashing algorithm, which reduces the matching time for large databases. To address the drawback of manual thresholds, an effective classificationbased method is also proposed based on the proposed similarity metrics. The results of NEMA are demonstrated to be promising with better efficiency and comparable accuracy than the baseline algorithms.
With the explosion of big data and popularity of distributed graph processing systems, this work has the potential to significantly reduce the human work involving identifying the matching fields for a large graph database construction and also be applied for largescale data matching. A majority of partial matching pairs can be found by our matching algorithms which are not easily detected by humans.
References
 [1] (2010) Graph data management and mining: a survey of algorithms and applications. In Managing and mining graph data, pp. 13–68. Cited by: §I.
 [2] (1983) Automatic spelling correction using a trigram similarity measure. Information Processing & Management 19 (4), pp. 255–261. Cited by: §VF1.
 [3] (2005) Schema and ontology matching with coma++. In Proceedings of the 2005 ACM SIGMOD international conference on Management of data, pp. 906–908. Cited by: §VB, §VI, §VI.
 [4] (2009) OntoMatch: a monotonically improving schema matching system for autonomous data integration. In Information Reuse & Integration, 2009. IRI’09. IEEE International Conference on, pp. 318–323. Cited by: §I, §IIIB4.
 [5] (2018) Matching techniques for data integration and exploration: from databases to big data. In A Comprehensive Guide Through the Italian Database Research Over the Last 25 Years, pp. 61–76. Cited by: §I, §VI.
 [6] (2013) String similarity metrics for ontology alignment. In The Semantic Web–ISWC 2013, pp. 294–309. Cited by: §IIIB.

[7]
(2018)
BigGorilla: an opensource ecosystem for data preparation and integration.
. IEEE Data Eng. Bull. 41 (2), pp. 10–22. Cited by: §I.  [8] (2012) Mining schema matching between heterogeneous databases. In Consumer Electronics, Communications and Networks (CECNet), 2012 2nd International Conference on, pp. 1128–1131. Cited by: §I, §VI.
 [9] (2016) Database systems: design, implementation, & management. Cengage Learning. Cited by: §I.
 [10] (2016) The advantages of an ontologybased data management approach: openness, interoperability and data quality. Scientometrics, pp. 1–15. Cited by: §I.
 [11] (2015) Green information technology: a sustainable approach. Morgan Kaufmann. Cited by: §I.
 [12] (2015) Big data integration. Synthesis Lectures on Data Management 7 (1), pp. 1–198. Cited by: §I.
 [13] (2007) Instance matching with coma+. Cited by: §VF, §VI, §VI.
 [14] (2014) Data integration in the era of omics: current and future challenges. BioMed Central. Cited by: §I.
 [15] (2016) The interaction between schema matching and record matching in data integration. IEEE Transactions on Knowledge and Data Engineering 29 (1), pp. 186–199. Cited by: §I.
 [16] (2015) Heterogeneous network edge prediction: a data integration approach to prioritize diseaseassociated genes. PLoS computational biology 11 (7), pp. e1004259. Cited by: §I.
 [17] (2019) Embench++: data for a thorough benchmarking of matchingrelated methods. Semantic Web (Preprint), pp. 1–16. Cited by: §IIIB4.
 [18] (2017) History management for network information of iot devices. In International Conference on Security with Intelligent Computing and Bigdata Services, pp. 138–145. Cited by: §I.
 [19] (2010) Sampling dirty data for matching attributes. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pp. 63–74. Cited by: §IVA.
 [20] (2014) Mining of massive datasets. Cambridge University Press. Cited by: §IIID2, §IIID.
 [21] (2018) Beyond macrobenchmarks: microbenchmarkbased graph database evaluation. Proceedings of the VLDB Endowment 12 (4), pp. 390–403. Cited by: §II, §II.
 [22] (2019) COMBATtbneodb: fostering tuberculosis research through integrative analysis using graph database technologies. Bioinformatics. Cited by: §II, §II.
 [23] (2005) Corpusbased schema matching. In 21st International Conference on Data Engineering (ICDE’05), pp. 57–68. Cited by: §VI.
 [24] (2012) Instance based matching using regular expression. Procedia Computer Science 10, pp. 688–695. Cited by: §I, §VF1, §VF, §VI.
 [25] (2002) Similarity flooding: a versatile graph matching algorithm and its application to schema matching. In Data Engineering, 2002. Proceedings. 18th International Conference on, pp. 117–128. Cited by: §VI, §VI.
 [26] (2013) Graph database applications and concepts with neo4j. In Proceedings of the Southern Association for Information Systems Conference, Atlanta, GA, USA, Vol. 2324. Cited by: §II, §II.
 [27] (2007) Information retrieval and machine learning for probabilistic schema matching. Information processing & management 43 (3), pp. 552–576. Cited by: §I, §VI.
 [28] (2014) Network management through graphs in software defined networks. In 10th International Conference on Network and Service Management (CNSM) and Workshop, pp. 400–405. Cited by: §I.
 [29] (2012) Sampledriven schema mapping. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 73–84. Cited by: §I, §VI.
 [30] (2001) A survey of approaches to automatic schema matching. the VLDB Journal 10 (4), pp. 334–350. Cited by: §VI.
 [31] (2014) A survey of indexing techniques for xml database. Compusoft 3 (1), pp. 461. Cited by: §I.
 [32] (1998) Learning stringedit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (5), pp. 522–532. Cited by: §VF1.
 [33] (2017) The ubiquity of large graphs and surprising challenges of graph processing. Proceedings of the VLDB Endowment 11 (4), pp. 420–431. Cited by: §I.
 [34] (2005) A survey of schemabased matching approaches. In Journal on data semantics IV, pp. 146–171. Cited by: §VI.
 [35] (2013) Ontology matching: state of the art and future challenges. IEEE Transactions on knowledge and data engineering 25 (1), pp. 158–176. Cited by: §VI.
 [36] (2018) Data integration: the current status and the way forward.. IEEE Data Eng. Bull. 41 (2), pp. 3–9. Cited by: §I.
 [37] (2017) Meaningful maps with objectoriented semantic mapping. In Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on, pp. 5079–5085. Cited by: §I.
 [38] (2016) Big data analytics for emergency communication networks: a survey. IEEE Communications Surveys & Tutorials 18 (3), pp. 1758–1778. Cited by: §I.
 [39] (2008) An effective contentbased schema matching algorithm. In 2008 International Seminar on Future Information Technology and Management Engineering, pp. 7–11. Cited by: §VI.
Comments
There are no comments yet.