Density Ratio Estimation in Machine Learning

Density Ratio Estimation in Machine Learning Machine learning is an interdisciplinary field of science and engineering that studies mathematical theories and practical applications of systems that learn. This book introduces theories, methods, and applications of density ratio estimation, which is a newly emerging paradigm in the machine learning community. Various machine learning problems such as non-stationarity adaptation, outlier detection, dimensionality reduction, independent component analysis, clustering, classification, and conditional density estimation can be systematically solved via the estimation of probability density ratios. The authors offer a comprehensive introduction of various density ratio estimators including methods via density estimation, moment matching, probabilistic classification, density fitting, and density ratio fitting as well as describing how these can be applied to machine learning. The book also provides mathematical theories for density ratio estimation including parametric and non-parametric convergence analysis and numerical stability analysis to complete the first and definitive treatment of the entire framework of density ratio estimation in machine learning. Dr. Masashi Sugiyama is an Associate Professor in the Department of Computer Science at the Tokyo Institute of Technology. Dr. Taiji Suzuki is an Assistant Professor in the Department of Mathematical Informatics at the University of Tokyo, Japan. Dr. Takafumi Kanamori is an Associate Professor in the Department of Computer Science and Mathematical Informatics at Nagoya University, Japan.

Density Ratio Estimation in Machine Learning MASASHI SUGIYAMA Tokyo Institute of Technology TAIJI SUZUKI The University of Tokyo TAKAFUMI KANAMORI Nagoya University

cambridge university press Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore, São Paulo, Delhi, Mexico City Cambridge University Press 32 Avenue of the Americas, New York, NY 10013-2473, USA Information on this title:/9780521190176 Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori 2012 This publication is in copyright. Subject to statutory exception and to the provisions of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press. First published 2012 Printed in the United States of America A catalog record for this publication is available from the British Library. Library of Congress Cataloging in Publication data is available ISBN 978-0-521-19017-6 Hardback Cambridge University Press has no responsibility for the persistence or accuracy of URLs for external or third-party Internet Web sites referred to in this publication and does not guarantee that any content on such Web sites is, or will remain, accurate or appropriate.

Contents Foreword Preface page ix xi Part I Density-Ratio Approach to Machine Learning 1 Introduction 3 1.1 Machine Learning 3 1.2 Density-Ratio Approach to Machine Learning 9 1.3 Algorithms of Density-Ratio Estimation 13 1.4 Theoretical Aspects of Density-Ratio Estimation 17 1.5 Organization of this Book at a Glance 18 Part II Methods of Density-Ratio Estimation 2 Density Estimation 25 2.1 Basic Framework 25 2.2 Parametric Approach 27 2.3 Non-Parametric Approach 33 2.4 Numerical Examples 36 2.5 Remarks 37 3 Moment Matching 39 3.1 Basic Framework 39 3.2 Finite-Order Approach 39 3.3 Infinite-Order Approach: KMM 43 3.4 Numerical Examples 44 3.5 Remarks 45 4 Probabilistic Classification 47 4.1 Basic Framework 47 4.2 Logistic Regression 48 4.3 Least-Squares Probabilistic Classifier 50 v

vi Contents 4.4 Support Vector Machine 51 4.5 Model Selection by Cross-Validation 53 4.6 Numerical Examples 53 4.7 Remarks 54 5 Density Fitting 56 5.1 Basic Framework 56 5.2 Implementations of KLIEP 57 5.3 Model Selection by Cross-Validation 64 5.4 Numerical Examples 65 5.5 Remarks 65 6 Density-Ratio Fitting 67 6.1 Basic Framework 67 6.2 Implementation of LSIF 68 6.3 Model Selection by Cross-Validation 70 6.4 Numerical Examples 73 6.5 Remarks 74 7 Unified Framework 75 7.1 Basic Framework 75 7.2 Existing Methods as Density-Ratio Fitting 77 7.3 Interpretation of Density-Ratio Fitting 81 7.4 Power Divergence for Robust Density-Ratio Estimation 84 7.5 Remarks 87 8 Direct Density-Ratio Estimation with Dimensionality Reduction 89 8.1 Discriminant Analysis Approach 89 8.2 Divergence Maximization Approach 99 8.3 Numerical Examples 108 8.4 Remarks 115 Part III Applications of Density Ratios in Machine Learning 9 Importance Sampling 119 9.1 Covariate Shift Adaptation 119 9.2 Multi-Task Learning 131 10 Distribution Comparison 140 10.1 Inlier-Based Outlier Detection 140 10.2 Two-Sample Test 148 11 Mutual Information Estimation 163 11.1 Density-Ratio Methods of Mutual Information Estimation 164 11.2 Sufficient Dimension Reduction 174 11.3 Independent Component Analysis 183

Contents vii 12 Conditional Probability Estimation 191 12.1 Conditional Density Estimation 191 12.2 Probabilistic Classification 203 Part IV Theoretical Analysis of Density-Ratio Estimation 13 Parametric Convergence Analysis 215 13.1 Density-Ratio Fitting under Kullback Leibler Divergence 215 13.2 Density-Ratio Fitting under Squared Distance 219 13.3 Optimality of Logistic Regression 223 13.4 Accuracy Comparison 225 13.5 Remarks 235 14 Non-Parametric Convergence Analysis 236 14.1 Mathematical Preliminaries 236 14.2 Non-Parametric Convergence Analysis of KLIEP 242 14.3 Convergence Analysis of KuLSIF 247 14.4 Remarks 250 15 Parametric Two-Sample Test 252 15.1 Introduction 252 15.2 Estimation of Density Ratios 253 15.3 Estimation of ASC Divergence 257 15.4 Optimal Estimator of ASC Divergence 259 15.5 Two-Sample Test Based on ASC Divergence Estimation 265 15.6 Numerical Studies 269 15.7 Remarks 274 16 Non-Parametric Numerical Stability Analysis 275 16.1 Preliminaries 275 16.2 Relation between KuLSIF and KMM 279 16.3 Condition Number Analysis 282 16.4 Optimality of KuLSIF 286 16.5 Numerical Examples 292 16.6 Remarks 297 Part V Conclusions 17 Conclusions and Future Directions 303 List of Symbols and Abbreviations 307 References 309 Index 327

Foreword Estimating probability distributions is widely viewed as a central question in machine learning. The whole enterprise of probabilistic modeling using probabilistic graphical models is generally addressed by learning marginal and conditional probability distributions. Classification and regression starting with Fisher s fundamental contributions are similarly viewed as problems of estimating conditional densities. The present book introduces an exciting alternative perspective namely, that virtually all problems in machine learning can be formulated and solved as problems of estimating density ratios the ratios of two probability densities. This book provides a comprehensive review of the elegant line of research undertaken by the authors and their collaborators over the last decade. It reviews existing work on density-ratio estimation and derives a variety of algorithms for directly estimating density ratios. It then shows how these novel algorithms can address not only standard machine learning problems such as classification, regression, and feature selection but also a variety of other important problems such as learning under a covariate shift, multi-task learning, outlier detection, sufficient dimensionality reduction, and independent component analysis. At each point this book carefully defines the problems at hand, reviews existing work, derives novel methods, and reports on numerical experiments that validate the effectiveness and superiority of the new methods. A particularly impressive aspect of the work is that implementations of most of the methods are available for download fromthe authors web pages. The last part of the book is devoted to mathematical analyses of the methods. This includes not only an analysis for the case where the assumptions underlying the algorithms hold, but also situations in which the models are misspecified. Careful study of these results will not only provide fundamental insights into the problems and algorithms but will also provide the reader with an introduction to many valuable analytic tools. ix

x Foreword In summary, this is a definitive treatment of the topic of density-ratio estimation. It reflects the authors careful thinking and sustained research efforts. Researchers and students alike will find it an important source of ideas and techniques. There is no doubt that this book will change the way people think about machine learning and stimulate many new directions for research. Thomas G. Dietterich School of Electrical Engineering Oregon State University, Corvallis, OR, USA

Preface Machine learning is aimed at developing systems that learn. The mathematical foundation of machine learning and its real-world applications have been extensively explored in the last decades. Various tasks of machine learning, such as regression and classification, typically can be solved by estimating probability distributions behind data. However, estimating probability distributions is one of the most difficult problems in statistical data analysis, and thus solving machine learning tasks without going through distribution estimation is a key challenge in modern machine learning. So far, various algorithms have been developed that do not involve distribution estimation but solve target machine learning tasks directly. The support vector machine is a successful example that follows this line it does not estimate datagenerating distributions but directly obtains the class-decision boundary that is sufficient for classification. However, developing such an excellent algorithmfor each of the machine learning tasks could be highly costly and difficult. To overcome these limitations of current machine learning research, we introduce and develop a novel paradigmcalled density-ratio estimation instead of probability distributions, the ratio of probability densities is estimated for statistical data processing. The density-ratio approach covers various machine learning tasks, for example, non-stationarity adaptation, multi-task learning, outlier detection, two-sample tests, feature selection, dimensionality reduction, independent component analysis, causal inference, conditional density estimation, and probabilitic classification. Thus, density-ratio estimation is a versatile tool for machine learning. This book is aimed at introducing the mathematical foundation, practical algorithms, and applications of density-ratio estimation. Most of the contents of this book are based on the journal and conference papers we have published in the last couple of years. We acknowledge our collaborators for their fruitful discussions: Hirotaka Hachiya, Shohei Hido, Yasuyuki Ihara, Hisashi Kashima, Motoaki Kawanabe, Manabu Kimura, Masakazu Matsugu, Shin-ichi Nakajima, Klaus-Robert Müller, Jun Sese, Jaak Simm, Ichiro Takeuchi, Masafumi xi

xii Preface Picture taken in Nagano, Japan, in the summer of 2009. From left to right, Taiji Suzuki, Masashi Sugiyama, and Takafumi Kanamori. Takimoto, Yuta Tsuboi, Kazuya Ueki, Paul von Bünau, Gordon Wichern, and Makoto Yamada. Finally, we thank the Ministry of Education, Culture, Sports, Science and Technology; the Alexander von Humboldt Foundation; the Okawa Foundation; Microsoft Institute for Japanese Academic Research Collaboration Collaborative Research Project; IBM Faculty Award; Mathematisches Forschungsinstitut Oberwolfach Research-in-Pairs Program; the Asian Office of Aerospace Research and Development; Support Center for Advanced Telecommunications Technology Research Foundation; and the Japan Science and Technology Agency for their financial support. Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori