## Summary

The MSR India Summer School series, held in collaboration with the Indian Institute of Science, consists of lectures in a chosen area by leading experts from around the world. The aim is to introduce students and researchers to important new areas and the latest results and to provide a forum for Indian and international researchers to interact. The 2015 Summer School will be held between June 15 – 26 at the Indian Institute of Science, Bangalore in the area of Machine Learning.

Understanding data has become essential for almost all modern applications. This data intensive nature of applications have spurred a great deal of research in Machine Learning and several related areas. The 2015 MSR Summer school focused on Machine Learning and its application to Big Data. In particular, the lectures covered several aspects of supervised/unsupervised learning in high-dimensions and with large number of data points.

The School addressed both theoretical as well as practical aspects of the chosen area and was targeted at research scholars, faculty members, masters and senior undergraduate students.

The lectures were designed to offer self-contained introductions to chosen topics, leading up to some open problems for research.

There was also a day long Azure hackathon as part of the summer school agenda.

For any further information /clarification, please write to indiamrc@microsoft.com.

## Abstracts

#### Large-scale Linear and Kernel Classification

*Chih-Jen Lin*

Linear and kernel methods are important machine learning techniques for data classification. Popular examples include support vector machines (SVM) and logistic regression. We begin with an introduction on this subject by deriving their optimization problems through different aspects. This discussion is useful because many people are confused about the relationships between, for example, SVM and logistic regression. We then move to investigate techniques for solving optimization problems for linear and kernel classification. In particular, we show details of two representative settings: coordinate descent methods and Newton methods. Recently, extending these optimization techniques to handle big data in either multi-core or distributed environments is a very important research direction. We present some promising results and discuss future challenges.

#### Provable Non-convex Projections for High-dimensional Learning Problems

*Prateek Jain*

Typical high-dimensional learning problems such as sparse regression, low-rank matrix completion, robust PCA etc can be solved using projections onto non-convex sets. However, providing theoretical guarantees for such methods is difficult due to the non-convexity in projections. In this talk, we will discuss some of our recent results that show that non-convex projections based methods can be used to solve several important problems in this area such as: a) sparse regression, b) low-rank matrix completion, c) robust PCA.

In this talk, we will give an overview of the state-of-the-art for these problems and also discuss how simple non-convex techniques can significantly outperform state-of-the-art convex relaxation based techniques and provide solid theoretical results as well. For example, for robust PCA, we provide first provable algorithm with time complexity O(n^2 r) which matches the time complexity of normal SVD and is faster than the usual nuclear+L_1-regularization methods that incur O(n^3) time complexity. This talk is based on joint works with Ambuj Tewari, Purushottam Kar, Praneeth Netrapalli, Animashree Anandkumar, U N Niranjan, and Sujay Sanghavi.

#### Submodular optimization and machine learning

*Stefanie Jegelka*

Many problems in machine learning that involve discrete structures or subset selection may be phrased in the language of submodular set functions. The property of submodularity, also referred to as a ‘discrete analog of convexity’, expresses the notion of diminishing marginal returns, and captures combinatorial versions of rank and dependence. Submodular functions occur in a variety of areas including graph theory, information theory, combinatorial optimization, stochastic processes and game theory. In machine learning, they emerge in different forms as the potential functions of graphical models, as the utility functions in active learning and sensing, in models of diversity, in structured sparse estimation or network inference. The lectures will give an introduction to the theory of submodular functions, some applications in machine learning and algorithms for minimizing and maximizing submodular functions that exploit ties to both convexity and concavity.

#### Lecture 1: Nonparametric graphical models

*John Lafferty*

We present some nonparametric methods for graphical modeling. In the discrete case, where the data are binary or drawn from a finite alphabet, Markov random fields are the standard model. The Gaussian graphical model is the usual parametric model for continuous data, but it makes distributional assumptions that are often unrealistic. We discuss several approaches to building more flexible graphical models. One allows arbitrary graphs and a nonparametric extension of the Gaussian;the other uses kernel density estimation and restricts the graphs to trees and forests. Other approaches combine these two techniques.

#### Lecture 2: Computational tradeoffs in statistical estimation

*John Lafferty*

In massive data analysis, statistical estimation needs to be carried out with close attention to computational resources — compute cycles, communication bandwidth and storage capacity. Yet little is presently known about the fundamental tradeoffs between statistical and computational efficiency. We give a brief survey of past and more recent work in this direction. We then present new work that revisits classical linear and nonparametric estimation theory from a computational perspective, formulating an extension to classical results in minimax analysis in the setting of rate distortion theory. We also present algorithms for trading off estimation accuracy for computational speed in linear and nonparametric regression.

#### Lecture 3: High dimensional nonparametric estimation

*John Lafferty*

Estimating high dimensional functions under weak assumptions is a central challenge in statistical machine learning. We give a survey of results on variable selection for high dimensional nonparametric regression. We then present new results in nonparametric estimation under shape constraints. We first consider the problem of estimating a convex function of several variables, and develop a screening procedure to identify irrelevant variables. We then discuss extensions of these ideas and open problems for future research.

#### Scaling up Reinforcement Learning

*B Ravindran*

Reinforcement Learning (RL) is a popular paradigm for trial-and-error learning enjoying renewed popularity due to Google Deepmind’s Atari playing engine. Though there has been much interest in the field for close to 3 decades RL methods have not had many large scale deployments. In this talk I will introduce several approaches adopted by the RL community for scaling up algorithms. I will go over the fundamentals of hierarchical reinforcement learning and value function approximation methods. In the second part of the talk I will briefly cover methods for automatically discovering spatio-temporal abstractions in RL.

#### Distributed Machine Learning Algorithms: Communication-Computation Trade-offs

*Sundararajan Sellamanickam*

Distributed machine learning is an important area that has been receiving considerable attention from academic and industrial communities, as data is growing in unprecedented rate. In the first part of the talk, we review several popular approaches that are proposed/used to learn classifier models in the big data scenario. With commodity clusters priced on system configurations becoming popular, machine learning algorithms have to be aware of the computation and communication costs involved in order to be cost effective and efficient. In the second part of the talk, we focus on methods that address this problem; in particular, considering different data distribution settings (e.g., example and feature partitions), we present efficient distributed learning algorithms that trade-off computation and communication costs.

#### I. Active learning and annotation (60 mins)

*Sanjoy Dasgupta*

The “active learning” model is motivated by scenarios in which it is easy to amass vast quantities of unlabeled data (images and videos off the web, speech signals from microphone recordings, and so on) but costly to obtain their labels. Like supervised learning, the goal is ultimately to learn a classifier. But the labels of training points are hidden, and each of them can be revealed only at a cost. The idea is to query just a few labels that are especially informative about the decision boundary, and thereby to obtain an accurate classifier at significantly lower cost than regular supervised learning.

There are two distinct ways of conceptualizing active learning, which lead to rather different querying strategies. The first treats active learning as an efficient search through a hypothesis space of candidates, while the second has to do with exploiting cluster or neighborhood structure in data. This talk will show how each view leads to active learning algorithms that can be made efficient and practical, and have provable label complexity bounds that are in some cases exponentially lower than for supervised learning.

#### II. Information geometry (90 mins)

*Sanjoy Dasgupta*

This tutorial will focus on entropy, exponential families, and information projection. We’ll start by seeing the sense in which entropy is the only reasonable definition of randomness. We will then use entropy to motivate exponential families of distributions — which include the ubiquitous Gaussian, Poisson, and Binomial distributions, but also very general graphical models. The task of fitting such a distribution to data is a convex optimization problem with a geometric interpretation as an “information projection”: the projection of a prior distribution onto a linear subspace (defined by the data) so as to minimize a particular information-theoretic distance measure. This projection operation, which is more familiar in other guises, is a core optimization task in machine learning and statistics. We’ll study the geometry of this problem and discuss algorithms for it.

#### III. Cluster trees and neighborhood graphs (60 mins)

*Sanjoy Dasgupta*

What information does the clustering of a finite data set reveal about the underlying distribution from which the data were sampled? This basic question has proved elusive even for the most widely-used clustering procedures. One natural criterion is to seek clusters that converge (as the data set grows) to regions of high density. When all possible density levels are considered, this is a hierarchical clustering problem where the sought limit is called the “cluster tree”.

This talk will describe two simple algorithms for estimating this tree that implicitly construct a multiscale hierarchy of near-neighbor graphs on the data points. We’ll show that these procedure are consistent, and give rates of convergence using a percolation argument that also gives insight into how neighborhood graphs should be constructed.

#### From LSI to Probabilistic Topic models: An introduction to Topic models

*Chiranjib Bhattacharya*

Topic models attempt to discover themes, or Topics, from large collection of documents. Discovering themes from a document corpus is an important problem with a variety of applications in Web-search, Corpus Browsing etc.

In this two part tutorial, we will begin by introducing neccessary background in understanding Topic models, mainly focussing on EM algorithm and Variational Inference. In the second part of the talk we will review several models starting with Latent Semantic Indexing(LSI), proposed in 1988, to the more recent and now state of the art Probabilistic Topic models. Towards the end of the talk we will discuss recent theoretical results on \emph{provable} topic models.

#### Subset selection problems

*Amit Deshpande*

Subset selection problem means finding a small subset of given data that maximizes diversity or information content or some other submodular function depending on the context. This definition can be suitably modified in each context, and has a wide range of interesting applications like feature selection, sensor placement, document summarization, diversification of search. I’ll review different theoretical attempts to capture this notion, the algorithmic ideas, practical applications, and their impact in return on basic research in graph theory, linear algebra, probability.

#### Entity Mining at Microsoft Bing Hyderabad

*Manish Gupta*

Entity mining is one of the hot topics in the area of web mining and information retrieval. In this talk, I will discuss in brief three interesting entity linking problems which apply various machine learning techniques: Entity linking, Dominant Entity Identification, and Cricket Linking. Entity linking is the problem of linking a mention phrase from a document to an entity in the knowledge base. Dominant entity identification is the problem of finding whether an entity e is the dominant entity for a page p. Cricket linking is the problem of linking event mentions from cricket match reports to a set of balls in cricket commentaries. In my talk, I will discuss these problems in detail, and will present interesting solutions to them.

#### TALK 1: Online Learning and Bandits

*Aditya Gopalan*

The ability to make continual, accurate decisions based on evolving data is key in many of today’s data-driven intelligent systems. This tutorial-style talk presents an introduction to the modern study of sequential learning and decision making under uncertainty. The broad objective is to cover modeling frameworks for online prediction and learning, explore algorithms for decision making, and gain an understanding of their performance. Specifically, we will look at multi-armed bandits — models of decision making that capture the explore-vs-exploit tradeoff in learning, regret minimization, non-stochastic or adversarial online learning, and online convex optimization. Time permitting, we will discuss new directions and frontiers in the area of sequential decision making.

#### Talk 2: Online Learning in Complex Bandits & Markov Decision Processes

*Aditya Gopalan*

We consider Reinforcement Learning (RL) in parameterized bandits or more generally Markov Decision Processes (MDPs), where the parameterization can induce correlation across transition probabilities and/or rewards. Consequently, observing a particular state transition might yield useful information about other, unobserved, parts of the MDP. In this setting, Posterior sampling a.k.a. Thompson sampling – a randomized, Bayesian-inspired algorithm originally developed for the simpler multiarmed bandit – becomes a natural candidate for a learning algorithm. We develop a version of Thompson sampling for parameterized RL problems, and derive the first known frequentist regret bounds for fairly general parameter spaces and priors. Under mild conditions on the prior used in Thompson sampling, the regret can be shown to scale logarithmically in time and with high probability. The result holds for priors without any additional, specific closed-form structure such as conjugate or product-form priors. Moreover, the constant factor in the logarithmic scaling exposes the “information complexity” of learning the MDP, in terms of the structure of the parameter space. We also report numerical results for the algorithm on a parameterized queueing system, with a large number of states (queue occupancy) but only a small number of uncertain parameters (arrival/service rates).

Joint work with Shie Mannor and Yishay Mansour.

#### Non-convex Robust PCA

*Praneeth Netrapalli*

In this lecture, we will illustrate a novel technique due to Erdos et al. (2011) which can be used to obtain bounds on eigenvector perturbation in the \ell_{\infty} norm. Standard techniques give us optimal bounds only for perturbation in the \ell_2 norm. We will further use this technique to propose and analyze a new non-convex algorithm for robust PCA, where the task is to recover a low-rank matrix from sparse corruptions that are of unknown value and support. In the deterministic error setting, our method achieves exact recovery under the same conditions that are required by existing methods (which are based on convex optimization) but is much faster.

#### An Introduction to Concentration Inequalities and Statistical Learning Theory

*Purushottam Kar*

The aim of this tutorial is to introduce tools and techniques that are used to analyze machine learning algorithms in statistical settings. Our focus will be on learning problems such as classification, regression, and ranking. We will look at concentration inequalities and other commonly used techniques such as uniform convergence and symmetrization, and use them to prove learning theoretic guarantees for algorithms in these settings.

The talk will be largely self-contained. However, it would help if the audience could brush up basic probability and statistics concepts such as random variables, events, probability of events, Boole’s inequality etc. There are several good resources for these online and I do not wish to recommend one over the other. However, a couple of nice resources are given below

1) https://www.khanacademy.org/math/probability

2) http://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/

3) https://en.wikipedia.org/wiki/Boole’s_inequality