Veleco

Efficient Top-k Retrieval on Massive Data

Abstract:

Top-k query is an important operation to return a set of interesting points in a potentially huge data space. It is analyzed in this paper that the existing algorithms cannot process top-k query on massive data efficiently. This paper proposes a novel table-scan-based T2S algorithm to efficiently compute top-k results on massive data. T2S first constructs the presorted table, whose tuples are arranged in the order of the round-robin retrieval on the sorted lists. T2S maintains only fixed number of tuples to compute results. The early termination checking for T2S is presented in this paper, along with the analysis of scan depth. The selective retrieval is devised to skip the tuples in the presorted table which are not top-k results. The theoretical analysis proves that selective retrieval can reduce the number of the retrieved tuples significantly. The construction and incremental-update/batch-processing methods for the used structures are proposed.

Introduction:

Top-k query is an important operation to return a set of interesting points from a potentially huge data space. In top-k query, a ranking function F is provided to determine the score of each tuple and k tuples with the largest scores are returned. Due to its practical importance, top-k query has attracted extensive attention proposes a novel table-scan-based T2S algorithm (Top-k by Table Scan) to compute top-k results on massive data efficiently.

The analysis of scan depth in T2S is developed also. The result size k is usually small and the vast majority of the tuples retrieved in PT are not top-k results, this paper devises selective retrieval to skip the tuples in PT which are not query results. The theoretical analysis proves that selective retrieval can reduce the number of the retrieved tuples significantly.

The construction and incremental-update/batch-processing methods for the data structures are proposed in this paper. The extensive experiments are conducted on synthetic and real life data sets.

Existing System:

To its practical importance, top-k query has attracted extensive attention. The existing top-k algorithms can be classified into three types: indexbased methods view-based methods and sorted-list-based methods . Index-based methods (or view-based methods) make use of the pre-constructed indexes or views to process top-k query.

A concrete index or view is constructed on a specific subset of attributes, the indexes or views of exponential order with respect to attribute number have to be built to cover the actual queries, which is prohibitively expensive. The used indexes or views can only be built on a small and selective set of attribute combinations.

Sorted-list-based methods retrieve the sorted lists in a round-robin fashion, maintain the retrieved tuples, update their lower-bound and upper-bound scores. When the kth largest lower-bound score is not less than the upper-bound scores of other candidates, the k candidates with the largest lower-bound scores are top-k results.

Sorted-list-based methods compute topk results by retrieving the involved sorted lists and naturally can support the actual queries. However, it is analyzed in this paper that the numbers of tuples retrieved and maintained in these methods increase exponentially with attribute number, increase polynomially with tuple number and result size.

Disadvantages:

Computational Overhead.
Data redundancy is more.
Time consuming process.

Problem Definition:

Ranking is a central part of many information retrieval problems, such as document retrieval, collaborative filtering, sentiment analysis, computational advertising (online ad placement).

Training data consists of queries and documents matching them together with relevance degree of each match. It may be prepared manually by human assessors (or raters, as Google calls them), who check results for some queries and determine relevance of each result. It is not feasible to check relevance of all documents, and so typically a technique called pooling is used only the top few documents, retrieved by some existing ranking models are checked.

Typically, users expect a search query to complete in a short time (such as a few hundred milliseconds for web search), which makes it impossible to evaluate a complex ranking model on each document in the corpus, and so a two-phase scheme is used.

Proposed System:

Our proposed system describe with layered indexing to organize the tuples into multiple consecutive layers. The top-k results can be computed by at most k layers of tuples. Also our propose layer-based Pareto-Based Dominant Graph to express the dominant relationship between records and top-k query is implemented as a graph traversal problem.

Then propose a dual-resolution layer structure. Top k query can be processed efficiently by traversing the dual-resolution layer through the relationships between tuples. propose the Hybrid- Layer Index, which integrates layer level filtering and list-level filtering to significantly reduce the number of tuples retrieved in query processing propose view-based algorithms to pre-construct the specified materialized views according to some ranking functions.

Given a top-k query, one or more optimal materialized views are selected to return the top-k results efficiently. Propose LPTA+ to significantly improve efficiency of the state-of-the-art LPTA algorithm. The materialized views are cached in memory; LPTA+ can reduce the iterative calling of the linear programming sub-procedure, thus greatly improving the efficiency over the LPTA algorithm. In practical applications, a concrete index (or view) is built on a specific subset of attributes. Due to prohibitively expensive overhead to cover all attribute combinations, the indexes (or views) can only be built on a small and selective set of attribute combinations.

If the attribute combinations of top-k query are fixed, index-based or viewbased methods can provide a superior performance. However, on massive data, users often issue ad-hoc queries, it is very likely that the indexes (or views) involved in the ad-hoc queries are not built and the practicability of these methods is limited greatly.

Correspondingly, T2S only builds presorted table, on which top-k query on any attribute combination can be dealt with. This reduces the space overhead significantly compared with index-based (or view-based) methods, and enables actual practicability for T2S.

Advantages:

The evaluation of an information retrieval system is the process of assessing how well a system meets the information needs of its users.
Traditional evaluation metrics, designed for Boolean retrieval or top-k retrieval, include precision and recall.
All common measures described here assume a ground truth notion of relevancy: every document is known to be either relevant or non-relevant to a particular query.

Modules:

Multi-keyword ranked search:

To design search schemes which allow multi-keyword query and provide result similarity ranking for effective data retrieval, instead of returning undifferentiated results.

Privacy-preserving:

To prevent the cloud server from learning additional information from the data set and the index, and to meet privacy requirements. if the cloud server deduces any association between keywords and encrypted documents from index, it may learn the major subject of a document, even the content of a short document. Therefore, the searchable index should be constructed to prevent the cloud server from performing such kind of association attack.

Efficiency:

Above goals on functionality and privacy should be achieved with low communication and computation overhead. Assume the number of query keywords appearing in a document the final similarity score is a linear function of xi, where the coefficient r is set as a positive random number. However, because the random factor “i is introduced as a part of the similarity score, the final search result on the basis of sorting similarity scores may not be as accurate as that in original scheme. For the consideration of search accuracy, we can let follow a normal distribution where the standard deviation functions as a flexible tradeoff parameter among search accuracy and security.

Conclusion:

The proposed novel T2S algorithm successfully implemented and to efficiently return top-k results on massive data by sequentially scanning the presorted table, in which the tuples are arranged in the order of round-robin retrieval on sorted lists. Only fixed number of candidates needs to be maintained in T2S. This paper proposes early termination checking and the analysis of the scan depth. Selective retrieval is devised in T2S and it is analyzed that most of the candidates in the presorted table can be skipped. The experimental results show that T2S significantly outperforms the existing algorithm.

Future Enhancement:

In future development of Multi keyword ranked search scheme should explore checking the integrity of the rank order in the search result from the un trusted network server infrastructure.

Feature Enhancement:

A novel table-scan-based T2S algorithm implemented successfully to compute top-k results on massive data efficiently. Given table T, T2Sfirst presorts T to obtain table PT(Presorted Table), whose tuples are arranged in the order of the round robin retrieval on the sorted lists. During its execution, T2S only maintains fixed and small number of tuples to compute results. It is proved that T2S has the Characteristic of early termination. It does not need to examine all tuples in PT to return results.