We are pleased to announce that the next edition of the Large-scale and distributed systems for information retrieval workshop will be co-located with ACM WSDM 2014, in New York City.
Workshop date: Feb. 28, 2014
Call for papers is already available at the following link.
Ismail Sengor Altingovde
B. Barla Cambazoglu
The 10th International Workshop on
Large-Scale and Distributed Systems
for Information Retrieval
Co-located with ACM WSDM 2013,
February 5, 2013, Rome, Italy
|09:00-09:10||Welcome and Opening|
|09:10-10:10||Invited Talk: Analyzing the performance of top-k retrieval algorithms
|10:10-10:40||Retrieval of Highly Dynamic Information in an Unstructured Peer-to-Peer Network
H. Asthana and Ingemar Cox.
|Digital Libraries & Archives|
|11:00-11:30||Scalability Bottlenecks of the CiteSeerX Digial Library Search Engine
Jian Wu, Pradeep Teregowda, Eric Treece, Madian Khabsa, Douglas Jordan, Stephen Carman, Prasenjit Mitra and C. Lee Giles.
|11:30-12:00||A Supervised Learning Method for Context-Aware Citation Recommendation in a Large Corpus
|12:00-12:30||User-Defined Redundancy in Web Archives
Bibek Paudel, Avishek Anand and Klaus Berberich.
|12:30-13:00||Metric Suffix Array For Large-Scale Similarity Search
Hisham Mohamed and Stéphane Marchand-Maillet.
|14:30-15:30||Invited Talk: Quasi-succinct indices
|15:30-16:00||Efficient Weighted Histogramming on GPUs with HASH
Maohua Zhu, Ningyi Xu, Di Wu, Chunshui Zhao, Yangdong Deng, Yu Wang and Feng-Hsiung Hsu.
|Large Scale Techniques|
|16:30-17:00||Analysis of performance evaluation techniques for Large Scale Information Retrieval
Ana Freire, Fidel Cacheda, Vreixo Formoso and Víctor Carneiro.
|17:00-17:30||Evaluating inverted files for visual compact codes on a large scale
Giuseppe Amato, Paolo Bolettieri, Fabrizio Falchi and Claudio Gennaro.
The proceedings of the Workshop are available here: LSDS-IR 2013 Proceedings.
The submission deadline is extended to December 7, 2012.
Authors that have already submitted their paper, can upload a new version as needed.
You have one more week to submit you paper!
We are glad to announce that dr. Marcus Fontoura from Google will give an invited talk at LSDS-IR 2013.
Analyzing the performance of top-k retrieval algorithms
Top-k retrieval is at the core of many modern applications: from large scale web search and advertising platforms, to text extenders and content management systems. In these systems, queries are evaluated using two major families of algorithms: document-at-a-time (DAAT) and
term-at-a-time (TAAT). DAAT and TAAT algorithms have been studied extensively in the research literature. In this talk, I'll present an analysis and comparison of several DAAT and TAAT algorithms, focusing on the performance characteristics of these algorithms.
Marcus Fontoura has finished his Ph.D. studies in 1999, at the Pontifical Catholic University of Rio de Janeiro, Brazil (PUC-Rio) in a joint program with the Computer Systems Group, University of Waterloo, Canada. Since then he held research posts at the Princeton University Computer Science Department, IBM Almaden Research Center, and Yahoo! Research. Currently he is a Research Scientist and Member of Technical Staff at Google. His main areas of research in the last years have been Web Search, Computational Advertising, Enterprise search, and Databases. He has more than 40 published papers and 20 issued patents. His complete CV is available at: http://fontoura.org.
We are glad to announce that prof. Sebastiano Vigna from the Università degli Studi di Milano will give an invited talk at LSDS-IR 2013.
Compressed inverted indices in use today are based on the idea of gap compression: documents pointers are stored in increasing order, and the gaps between successive document pointers are stored using suitable codes which represent smaller gaps using less bits. Additional data such as counts and positions is stored using similar techniques. A large body of research has been built in the last 30 years around gap compression, including theoretical modeling of the gap distribution, specialized instantaneous codes suitable for gap encoding, and ad hoc document reorderings which increase the efficiency of instantaneous codes. This talk will illustrate the proposal to represent an index using a different architecture based on quasi-succinct representation of monotone sequences. We will show that, besides being theoretically elegant and simple, the new index provides expected constant-time operations, space savings, and, in practice, significant performance improvements on conjunctive, phrasal and proximity queries.
Sebastiano Vigna obtained his PhD in Computer Science from the Università degli Studi di Milano, where he is currently an Associate Professor. His interests lie in the interaction between theory and
practice. He has worked on highly theoretical topics such as computability on the reals, distributed computability, self-stabilization, minimal perfect hashing, succinct data structures, query recommendation, algorithms for large graphs and theoretical/experimental analysis of spectral rankings such as PageRank, but he is also (co)author of several widely used software tools ranging
from high-performance Java libraries to a model-driven software generator, a search engine, a crawler, a text editor and a graph compression framework. In 2011 he collaborated to the computation of the distance distribution of the whole Facebook graph, from which it was possible to evince that there on Facebook there are just 3.74 degrees of separation.
This year's award is given to the paper entitled "Query efficiency prediction for dynamic pruning" by Nicola Tonellotto, Craig Macdonald, and Iadh Ounis. We congratulate the authors for their great work.
The decision is given by taking into account the large amount of discussion this paper generated during the workshop and, more importantly, the positive feedback it received from the reviewers.
Dr. K. Selcuk Candan kindly agreed to share his talk slides on the Web. Click here to download.
We are glad to announce that prof. Selcuk Candan from Arizona State University will give an invited talk at LSDS-IR 2011.
RanKloud: Scalable multimedia and social media retrieval and analysis in the cloud
In this talk, I will present an overview of recent developments in the area of scalable multimedia and social media retrieval and analysis in the cloud and our own efforts to build a scalable data processing middleware, called RanKloud, specifically sensitive to the needs and requirements of multimedia and social media analysis applications. RanKloud avoids waste by intelligently partitioning the data and allocating it on available resources to minimize the data replication and indexing overheads and to prune superfluous low-utility processing. It also includes a tensor-based relational data model to support the complete lifecycle (from collection to analysis) of the data, involving various integration and other manipulation steps. RanKloud also addresses the computational cost of various multi-dimensional data analysis operations, including decomposition or structural change detection, by (a) leveraging a priori background knowledge (or metadata) about one or more domain dimensions and (b) by extending compressed sensing (CS) to tensor data to encode the observed tensor streams in the form of compact descriptors. RanKloud will extend the scope of cloud-based systems to the delivery of efficient and large scale analysis over data with variable utility and, thus, will enable new and efficient applications, tools, and systems for multimedia and social media retrieval and analysis.
K. Selcuk Candan is a Professor of Computer Science and Engineering at the School of Computing, Informatics, and Decision Science Engineering at the Arizona State University and is leading the EmitLab research group. He joined the department in August 1997, after receiving his Ph.D. from the Computer Science Department at the University of Maryland at College Park. Prof. Candan’s primary research interest is in the area of management of non-traditional, heterogeneous, and imprecise (such as multimedia, web, and scientific) data. His various research projects in this domain are funded by diverse sources, including the National Science Foundation, Department of Defense, Mellon Foundation, and DES/RSA (Rehabilitation Services Administration). He has published over 140 articles and many book chapters. He has also authored 9 patents. Recently, he co-authored a book titled “Data Management for Multimedia Retrieval” for the Cambridge University Press and co-edited “New Frontiers in Information and Software as Services: Service and Application Design Challenges in the Cloud” for Springer.
Prof. Candan is an editorial board member of the Very Large Databases (VLDB) journal and the Journal of Multimedia. He has served in the organization and program committees of various conferences. In 2006, he served as an organization committee member for SIGMOD’06, the flagship database conference of the ACM and one of the best conferences in the area of management of data. In 2008, he served as a PC Chair for another leading, flagship conference of the ACM, this time focusing on multimedia research (MM’08). He also served in the review board of the Proceedings of the VLDB Endowment (PVLDB). In 2010, he was a program co-chair for ACM CIVR’10 conference and a program group leader for ACM SIGMOD’10. In 2011, he is serving as a general co-chair for the ACM MM’11 conference and is in the executive committee of ACM SIGMM. In 2012, he will serve as a general co-chair for ACM SIGMOD’12. In 2012, he will also serve in the organizing committees of ACM MM’12 and VLDB’12.
We are glad to announce that Dr. Aris Gionis from Yahoo! Research will give an invited talk at LSDS-IR 2011.
Efficient algorithms for large-scale social dissemination
In this talk we will discuss two problems related to disseminating content in social networks. We first focus on the problem of distributing content from information suppliers to information consumers. We seek to maximize the overall relevance of the matched content from suppliers to consumers while regulating the overall activity, e.g., ensuring that no consumer is overwhelmed with data and that all suppliers have chances to deliver their content. We propose two b-matching algorithms, GreedyMR and StackMR, geared for the MapReduce paradigm. Both algorithms have provable approximation guarantees, and in practice they produce high-quality solutions. In the second problem addressed in the talk, we aim to guarantee that content produced by users in the network is disseminated to all their friends, while minimizing the communication cost. We propose two solutions that leverage the community structure of social graphs. Our first algorithm has an O(logn)-approximation guarantee. The second algorithm is a more scalable heuristic, which in practice performs as well as the approximation algorithm.
Aristides Gionis is a senior research scientist in Yahoo! Research, Barcelona. He received his Ph.D from the Computer Science department of Stanford University in 2003, and between 2003 and 2006 he has been a senior researcher at the Basic Research Unit of Helsinki Institute of Information Technology, Finland. His research interests include data mining, web mining, and algorithmic data analysis. Aristides is currently serving as an associate editor in the journal of Knowledge and Information Systems (KAIS). He served as PC co-chair of ECML PKDD 2010. His recent program committee memberships include WWW 2011, ICDM 2011, VLDB 2011, SIGMOD 2012, PODS 2012, and being senior PC in ECML PKDD 2012, KDD 2011, and ICDE 2012.