Papers on Tensors and Their Use in Information Retrieval

Navigation

Ce wiki

Cette page

Multi-way Clustering on Relation Graphs

Arindam Banerjee, Department of CSE, University of Minnesota
Sugato Basu, AI Center, SRI International
Srujana Merugu, Yahoo! Research
SIAM International Conference on Data Mining (SDM'07)

http://tinyurl.com/2ay3bv

Abstract :

A number of real-world domains such as social networks and e-commerce involve heterogeneous data that describes relations between multiple classes of entities. Understanding the natural structure of this type of heterogeneous relational data is essential both for exploratory analysis and for performing various predictive modeling tasks. In this paper, we propose a principled multi-way clustering framework for relational data, wherein different types of entities are simultaneously clustered based not only on their intrinsic attribute values, but also on the multiple relations between the entities. To achieve this, we introduce a relation graph model that describes all the known relations between the different entity classes, in which each relation between a given set of entity classes is represented in the form of multi-modal tensor over an appropriate domain. Our multi-way clustering formulation is driven by the objective of capturing the maximal \information" in the original relation graph, i.e., accurately approximating the set of tensors corresponding to the various relations. This formulation is applicable to all Bregman divergences (a broad family of loss functions that includes squared Euclidean distance, KL-divergence), and also permits analysis of mixed data types using convex combinations of appropriate Bregman loss functions. Furthermore, we present a large family of structurally different multi-way clustering schemes that preserve various linear summary statistics of the original data. We accomplish the above generalizations by extending a recently proposed key theoretical result, namely the minimum Bregman information principle [1], to the relation graph setting. We also describe an ecient multi-way clustering algorithm based on alternate minimization that generalizes a number of other recently proposed clustering methods. Empirical results on datasets obtained from real-world domains (e.g., movie recommendations, newsgroup articles) demonstrate the generality and efficacy of our framework.

--------------------------

Latent semantic analysis for multiple-type interrelated data objects

Xuanhui Wang, University of Illinois at Urbana-Champaign
Jian-Tao Sun, Microsoft Research Asia, Beijing, P.R.China
Zheng Chen, Microsoft Research Asia, Beijing, P.R.China
ChengXiang Zhai, University of Illinois at Urbana-Champaign

SIGIR '06

http://doi.acm.org/10.1145/1148170.1148214

Abstract :

Co-occurrence data is quite common in many real applications. Latent
Semantic Analysis (LSA) has been successfully used to identify semantic
relations in such data. However, LSA can only handle a single
co-occurrence relationship between two types of objects. In practical
applications, there are many cases where multiple types of objects exist
and any pair of these objects could have a pairwise co-occurrence
relation. All these co-occurrence relations can be exploited to
alleviate data sparseness or to represent objects more meaningfully. In
this paper, we propose a novel algorithm, /M-LSA/, which conducts latent
semantic analysis by incorporating all pairwise co-occurrences among
multiple types of objects. Based on the mutual reinforcement principle,
M-LSA identifies the most salient concepts among the co-occurrence data
and represents all the objects in a unified semantic space. M-LSA is
general and we show that several variants of LSA are special cases of
our algorithm. Experiment results show that M-LSA outperforms LSA on
multiple applications, including collaborative filtering, text
clustering, and text categorization.

--------------------------

Higher-Order Web Link Analysis Using Multilinear Algebra

Tamara G. Kolda, Sandia National Laboratories
Brett W. Bader, Sandia National Laboratories
Joseph P. Kenny, Sandia National Laboratories
International Conference on Data Mining (ICDM'05) pp. 242-249

tinyurl.com/2d29xx

Abstract :

Linear algebra is a powerful and proven tool in web search. Techniques, such as the PageRank algorithm of Brin and Page and the HITS algorithm of Kleinberg, score web pages based on the principal eigenvector (or singular vector) of a particular non-negative matrix that captures the hyperlink structure of the web graph. We propose and test a new methodology that uses multilinear algebra to elicit more information from a higher-order representation of the hyperlink graph. We start by labeling the edges in our graph with the anchor text of the hyperlinks so that the associated linear algebra representation is a sparse, three-way tensor. The first two dimensions of the tensor represent the web pages while the third dimension adds the anchor text. We then use the rank-1 factors of a multilinear PARAFAC tensor decomposition, which are akin to singular vectors of the SVD, to automatically identify topics in the collection along with the associated authoritative web pages.

--------------------------

Beyond Streams and Graphs: Dynamic Tensor Analysis

Jimeng Sun, Carnegie Mellon, University, Pittsburgh, PA
Dacheng Tao, Birkbeck College, University of London, UK
Christos Faloutsos, Carnegie Mellon, University, Pittsburgh, PA
International Conference on Knowledge Discovery and Data Mining (SIGKDD'06) pp. 374 - 383

tinyurl.com/yqpr65

Abstract:

How do we find patterns in author-keyword associations, evolving over time? Or in Data Cubes, with product-branch-customer sales information? Matrix decompositions, like principal component analysis (PCA) and variants, are invaluable tools for mining, dimensionality reduction, feature selection, rule identification in numerous settings like streaming data, text, graphs, social networks and many more. However, they have only two orders, like author and keyword, in the above example.We propose to envision such higher order data as tensors,and tap the vast literature on the topic. However, these methods do not necessarily scale up, let alone operate on semi-infinite streams. Thus, we introduce the dynamic tensor analysis (DTA) method, and its variants. DTA provides a compact summary for high-order and high-dimensional data, and it also reveals the hidden correlations. Algorithmically, we designed DTA very carefully so that it is (a) scalable, (b) space efficient (it does not need to store the past) and (c) fully automatic with no need for user defined parameters. Moreover, we propose STA, a streaming tensor analysis method, which provides a fast, streaming approximation to DTA.We implemented all our methods, and applied them in two real settings, namely, anomaly detection and multi-way latent semantic indexing. We used two real, large datasets, one on network flow data (100GB over 1 month) and one from DBLP (200MB over 25 years). Our experiments show that our methods are fast, accurate and that they find interesting patterns and outliers on the real datasets.

--------------------------

CubeSVD: A Novel Approach to Personalized Web Search

Jian-Tao Sun, TsingHua University, Beijing, China
Hua-Jun Zeng, Microsoft Research Asia, Beijing, China
Huan Liu, Arizona State University, Tempe, AZ
Yuchang Lu, TsingHua University, Beijing, China
Zheng Chen, Microsoft Research Asia, Beijing, China
International Conference on World Wide Web (WWW’05) pp. 382 - 390

tinyurl.com/2cqrmy

Abstract:

As the competition of Web search market increases, there is a high demand for personalized Web search to conduct retrieval incorporating Web users' information needs. This paper focuses on utilizing clickthrough data to improve Web search. Since millions of searches are conducted everyday, a search engine accumulates a large volume of clickthrough data, which records who submits queries and which pages he/she clicks on. The clickthrough data is highly sparse and contains different types of objects (user, query and Web page), and the relationships among these objects are also very complicated. By performing analysis on these data, we attempt to discover Web users' interests and the patterns that users locate information. In this paper, a novel approach CubeSVD is proposed to improve Web search. The clickthrough data is represented by a 3-order tensor, on which we perform 3-mode analysis using the higher-order singular value decomposition technique to automatically capture the latent factors that govern the relations among these multi-type objects: users, queries and Web pages. A tensor reconstructed based on the CubeSVD analysis reflects both the observed interactions among these objects and the implicit associations among them. Therefore, Web search activities can be carried out based on CubeSVD analysis. Experimental evaluations using a real-world data set collected from an MSN search engine show that CubeSVD achieves encouraging search results in comparison with some standard methods.

--------------------------

Tensor Space Model for Document Analysis

Deng Cai, UIUC
Xiaofei He,Yahoo! Research Labs
Jiawei Han, UIUC
International Conference on Research and Development in Information Retrieval (SIGIR’06) pp. 625 - 626

tinyurl.com/ysq57n

Abstract:

Vector Space Model (VSM) has been at the core of information retrieval for the past decades. VSM considers the documents as vectors in high dimensional space.In such a vector space, techniques like Latent Semantic Indexing (LSI), Support Vector Machines (SVM), Naive Bayes, etc., can be then applied for indexing and classification. However, in some cases, the dimensionality of the document space might be extremely large, which makes these techniques infeasible due to the curse of dimensionality. In this paper, we propose a novel Tensor Space Model for document analysis. We represent documents as the second order tensors, or matrices. Correspondingly, a novel indexing algorithm called Tensor Latent Semantic Indexing (TensorLSI) is developed in the tensor space. Our theoretical analysis shows that TensorLSI is much more computationally efficient than the conventional Latent Semantic Indexing, which makes it applicable for extremely large scale data set. Several experimental results on standard document data sets demonstrate the efficiency and effectiveness of our algorithm.

--------------------------

A Multilinear Singular Value Decomposition

Lieven De Lathauwer
Bart De Moor
Joos Vandewalle
SIAM Journal on Matrix Analysis and Applications, Volume 21, Issue 4, 2000, pp. 1253-1278

tinyurl.com/yr5bug

Abstract:

We discuss a multilinear generalization of the singular value decomposition. There is a strong analogy between several properties of the matrix and the higher-order tensor decomposition; uniqueness, link with the matrix eigenvalue decomposition, first-order perturbation effects, etc., are analyzed. We investigate how tensor symmetries affect the decomposition and propose a multilinear generalization of the symmetric eigenvalue decomposition for pair-wise symmetric tensors.

--------------------------

Text Representation: From Vector to Tensor

Ning Liu, Dept. of Math. Sci., Tsinghua Univ., Beijing, China
Benyu Zhang, Dept. of Math. Sci., Tsinghua Univ., Beijing, China
Jun Yan, Dept. of Math. Sci., Tsinghua Univ., Beijing, China
Zheng Chen, Dept. of Math. Sci., Tsinghua Univ., Beijing, China
Wenyin Liu, Dept. of Math. Sci., Tsinghua Univ., Beijing, China
Fengshan Bai, Dept. of Math. Sci., Tsinghua Univ., Beijing, China
Leefeng Chien , Dept. of Math. Sci., Tsinghua Univ., Beijing, China
International Conference on Data Mining (ICDM’05)

tinyurl.com/2zduut

Abstract

In this paper, we propose a text representation model, Tensor Space Model (TSM), which models the text by multilinear algebraic high-order tensor instead of the traditional vector. Supported by techniques of multilinear algebra, TSM offers a potent mathematical framework for analyzing the multifactor structures. TSM is further supported by certain introduced particular operations and presented tools, such as the High-Order Singular Value Decomposition (HOSVD) for dimension reduction and other applications. Experimental results on the 20 Newsgroups dataset show that TSM is constantly better than VSM for text classification.

--------------------------

MMDS 2006. Workshop on Algorithms for Modern Massive Data Sets

Stanford University and Yahoo! Research
June 21–24, 2006
www.stanford.edu/group/mmds/

In particular: Saturday, June 24, 2006. Theme: Tensor-Based Data Applications

Multilinear algebra for analyzing data with multiple linkages (talk)
Tammy Kolda
www.stanford.edu/group/mmds/slides/kolda-mmds.pdf

Tensor based text representation: a new dimension in IR (poster)
Ioannis Antonellis, Efstratios Gallopoulos
www.stanford.edu/~antonell/papers/MMDS2006-poster.pdf

--------------------------

Cross-Language Information Retrieval Using PARAFAC2
Peter A Chew
Brett W Bader
Tamara G Kolda
Ahmed Abdelali

SANDIA REPORT, 2007

tinyurl.com/2ftgn5

Abstract:

A standard approach to cross-language information retrieval (CLIR) uses Latent Semantic Analysis (LSA) in conjunction with a multilingual parallel aligned corpus. This approach has been shown to be successful in identifying similar documents across languages - or more precisely, retrieving the most similar document in one language to a query in another language. However, the approach has severe drawbacks when applied to a related task, that of clustering documents ‘language-independently’, so that documents about similar topics end up closest to one another in the semantic space regardless of their language. The problem is that documents are generally more similar to other documents in the same language than they are to documents in a different language, but on the same topic. As a result, when using multilingual LSA, documents will in practice cluster by language, not by topic.

We propose a novel application of PARAFAC2 (which is a variant of PARAFAC, a multi-way generalization of the singular value decomposition [SVD]) to overcome this problem. Instead of forming a single multilingual term-by-document matrix which, under LSA, is subjected to SVD, we form an irregular three-way array, each slice of which is a separate term-by-document matrix for a single language in the parallel corpus. The goal is to compute an SVD for each language such that V (the matrix of right singular vectors) is the same across all languages. Effectively, PARAFAC2 imposes the constraint, not present in standard LSA, that the ‘concepts’ in all documents in the parallel corpus are the same regardless of language. Intuitively, this constraint makes sense, since the whole purpose of using a parallel corpus is that exactly the same concepts are expressed in the translations.

We tested this approach by comparing the performance of PARAFAC2 with standard LSA in solving a particular CLIR problem. From our results, we conclude that PARAFAC2 offers a very promising alternative to LSA not only for multilingual document clustering, but also for solving other problems in cross-language information retrieval.

--------------------------

Non-negative tensor factorization with applications to statistics and computer vision

Amnon Shashua
Tamir Hazan

International Conference on Machine Learning (ICML'05) pp. 792-799

tinyurl.com/2zn7w8

Abstract:

We derive algorithms for finding a non-negative n-dimensional tensor factorization (n-NTF) which includes the non-negative matrix factorization (NMF) as a particular case when n = 2. We motivate the use of n-NTF in three areas of data analysis: (i) connection to latent class models in statistics, (ii) sparse image coding in computer vision, and (iii) model selection problems. We derive a "direct" positive-preserving gradient descent algorithm and an alternating scheme based on repeated multiple rank-1 problems.

-----------------------------------

On the uniqueness of multilinear decomposition of N-way arrays
Nicholas D. Sidiropoulos
Rasmus Bro
JOURNAL OF CHEMOMETRICS, J. Chemometrics 2000; 14: 229–239
http://tinyurl.com/yrld2u

Abstract:

We generalize Kruskal’s fundamental result on the uniqueness of trilinear decomposition of three-way arrays to
the case of multilinear decomposition of four- and higher-way arrays. The result is surprisingly general and
simple and has several interesting ramifications.

-----------------------------------

The Tucker3 model in Chemometrics

Claus Andersson
http://www.models.kvl.dk/courses/tucker/ttext.htm