Please refer to the following article when using the source code: The
Anh Pham, Sabine Barrat, Mathieu Delalandre and JeanYves Ramel, “An
efficient indexing scheme based on linkednode mary tree structure”. In
proceedings of the 17th International Conference on Image Analysis and
Processing (ICIAP 2013), A. Petrosino (Ed.): Part I, LNCS, Volume 8156,
pp. 752–762, 2013
1. IntroductionFast nearest neighbor search is a crucial need for many recognition systems. Despite the fact that a large number of indexing algorithms have been proposed in the literature, few of them (e.g., randomized KDtrees, hierarchical Kmeans tree, randomized clustering trees, and LHSbased schemes) have been well validated on extensive experiments to give satisfactory performance on specific benchmarks. In this work, we propose a linkednode mary tree (LMtree) indexing algorithm, which works really well for both exact and approximate nearest neighbor search. The main contribution of the LMtree is threefold. First, a new polarspacebased method of data decomposition is presented to construct the LMtree. Second, a novel pruning rule is proposed to efficiently narrow down the search space. Finally, a bandwidth search method is introduced to explore the nodes of the LMtree. Our experiments, applied to one million 128dimensional SIFT features and 250000 960dimensional GIST features, show that the proposed algorithm gives the best search performance, compared to the aforementioned indexing algorithms.2. Construction of LMtreeGiven a dataset X composing of N feature vectors in a Ddimensional space R^{D}, we present in this section an indexing structure to index the dataset X supporting efficient proximity searching. For better presentation of our approach, we use the notation p as a point in the R^{D} feature space, and p_{i} as the i^{th} component of the point p (1 <= i <= D). We also denote p = (p_{i1}, p_{i2}) as a point in an 2D space. Before constructing the LMtree, the dataset X is normalized by aligning it to the principal axes obtained from PCA analysis. Note that, no dimension reduction is performed in this step. In stead, PCA analysis is only used to align the data. Next, the LMtree is constructed by recursively partitioning the dataset X into m roughly equalsized subsets as follows:Sort the axes in the decreasing order of variance, and choose randomly two axes, i_{1} and i_{2}, from the first L highest variance axes (L < D). Project every point p in X into the plane i_{1}ci_{2}, where c is the centroid of the set X, and then compute the angle: phi = arctan2(p_{i1}c_{i1}, p_{i2}c_{i2}). Sort the angles {phi_t} with 1 <= t <= n in the increasing order (n = X), and then divide the angles into m equal subpartitions: (0, phi_{_t1}], (phi_{_t1}, phi__{t2}], ..., (phi__{tm}, 360]. Partition the set X into m subsets X_{k} with 1 <= k <= m corresponding to m angle subpartitions obtained in the previous step.
For each subset X_{k}, a new node T_{k} is constructed and then attached to its parent node, where we also store the following information: the split axes (i.e., i_{1} and i_{2}), the split centroid (c_{i1}, c_{i2}), the split angles phi__{tk}, and the split projected points {(p^{k}__{i1},p^{k}__{i2})} where the point (p^{k}__{i1},p^{k}__{i2}) corresponds to the split angle phi__{tk}. For efficient access across these child nodes, a direct link is established between two adjacent nodes T_{k and }T_{k1, }and the last one T_{m} is linked to the first one T_{1}. Next, we repeat this partitioning process for each subset X_{k} associated to the child node T_{k} until the number of data points in each node falls below a predefined threshold. It is worth pointing that each time, a partition is proceeded, the two highest variance axes of the corresponding dataset are employed. This is contrast to many existing treebased indexing algorithms, where only one axis is often employed to partition the data. Consequently, considering a highdimensional feature space, such as 128dimensional SIFT features, the total number of axes involved in the tree construction is rather limited, making any pruning rules inefficient and the tree less discriminative for later usage of searching. Naturally, more the number of principal axes involved in partitioning the data, more benefit we achieve for both efficiency and precision search. Figure 1(a) illustrates the first and second levels of the LMtree construction with a branching factor m=6.
3. Search performanceWe have compared the performance of ENN search of three systems: the proposed LMtree, the KDtree, and the hierarchical Kmeans tree. Figure 2(left) shows the speedup over bruteforce search of the three systems, applied to the SIFT datasets with different sizes. We can notice that the LMtree outperforms the two other systems on all tests. Figure 2(right) presents the search performance of the three systems for the GIST features. The proposed LMtree again outperforms the others and even performs far better than the SIFT features. Taking the test where #Points = 150000 on Figure 2(right), for example, the LMtree gives a speedup of 17.2, the KDtree gives a speedup of 3.53, and the Kmeans tree gives a speedup of 1.42 over the bruteforce search. These results confirm the efficiency of the LMtree for ENN search relative to the two baseline systems.
Fig 2. Exact nearest neighbor search performance of 3 systems
For ANN search performance, four systems are participated in this evaluation, including the proposed LMtrees, RKDtrees, RCtrees, and Kmeans tree. We have used 8 parallel trees in the first three systems, while the last one uses a single tree because it was shown in Muja09 that the use of multiple Kmeans trees does not give better search performance. For the LMtrees, the parameters E_{max}_{ }and epsilon are empirically determined to obtain the search precision varying in [90%, 99%]. Figure 3(left) shows the search speedup versus the search precision of all the systems. As we can see that, the proposed LMtrees algorithm gives significantly better search performance everywhere than the other systems. Taking the search precision of 95%, for example, the speedups over bruteforce search of the LMtrees, RKDtrees, RCtrees, and Kmeans tree are 167.7, 108.4, 122.4, and 114.5, respectively. To make it comparable with the multiprobe LSH indexing algorithm, we have converted the real SIFT features to the binary vectors and tried several parameter settings (i.e., the number of hash tables, the number of multiprobe levels, and the length of the hash key) to obtain the best search performance. However, the result obtained on one million SIFT vectors is rather limited. Taking the search precision of 74.7%, for instance, the speedup over bruteforce search (using Hamming distance) is just 1.5.
Figure 3(right) shows the search performance of all systems for 200000 GIST features. Again, the LMtrees algorithm clearly outperforms the others and tends to perform much better than the SIFT features. The RCtrees algorithm also works reasonably well, while the RKDtrees and Kmeans tree work poorly for this dataset. Taking the search precision of 90%, for example, the speedups over bruteforce search of the LMtrees, RKDtrees, RCtrees, and Kmeans tree are 113.5, 15.0, 45.2, and 21.2, respectively.
Three crucial factors explain for these outstanding results of the LMtrees. First, the use of the two highest variance axes for data partitioning in the LMtree gives more discriminative representation of the data, compared with the common use of the sole highest variance axis as in the literature. Second, by using the approximate pruning rule, a larger fraction of nodes will be inspected but much of them would be eliminated after checking the lower bound. In this way, the number of data points, which will be actually searched, is retained under the predefined threshold E_{max}, while covering a larger number of inspected nodes, and thus increasing the chance of reaching the true nodes closest to the query. Finally, the use of bandwidth search gives much of benefit in terms of computation cost, compared with the priority search used in the baseline indexing systems.
Fig 3. Approximate nearest neighbor search performance of 4 systems
4. Source codes The source codes of the LMtree and all the experiments are available in the Download page.

