3rd Nov 2022, 03:00 PM - 04:00 PM
Seminar Room 24, First Floor, Main Building
The stochastic gradient descent (SGD) algorithm is used for parameter estimation, particularly for massive datasets and online learning. Inference in SGD has been a generally neglected problem and has only recently started to get some attention. I will first introduce SGD for relatively simple statistical models and explain the limiting behavior of Averaged SGD. Then, I will present our online batch-means estimator that converges to the true covariance matrix. I will compare the performance of our estimator with other competitors and discuss some key advantages.
22nd Oct 2022,11:00 AM - 12:00 PM
Madhava Hall, 3rd floor, Main Building
Imaging and sequencing technologies have highlighted the importance of single-cell gene expression and tissue organization in health and disease. How are these phenotypes encoded in the genome? We present a theoretical and computational framework that identifies genomic loci causal for such phenotypes, using only phenotypic data and the reference genome as input. Our approach does not utilize genetic variation or biological annotations. Instead, we use neural networks to compare the formal relational structure of a phenotypic measurement with that of the reference genome sequence, producing a "phenotype-sequence alignment". We construct phenotype-sequence alignments of single-cell gene expression and tissue organization of immune-cells, embryogenesis and cancer. These alignments reveal genes, regulatory elements, and protein active-sites causal for therapeutically significant phenotypic alterations.
15th Oct 2022, 03:00 PM - 04:00 PM
Madhava Hall, 3rd floor, Main Building
Recommendation systems are widely used to present a user with suggestions for items that are more pertinent to a user. A recommendation system selects items from a large pool of items and ranks them in the order that would be preferred for a user. In this talk, I will introduce how to build a short video recommendation system at scale with deep learning. I will first briefly review recommendation systems and their history. Then, I will present how to formulate the problem of short video recommendation as a ranking problem. I will describe how to use deep learning to learn the user preferences from data and how to use the learned user preferences to generate recommendations. Finally, I will present how to evaluate the performance of the recommendation system and some of the optimizations used at Glance Roposo
7th Oct 2022, 11:30 AM - 12:30 PM
Seminar room no 32, 2nd floor, Main Building, IISER Pune
As applications in large organizations evolve, the machine learning (ML) models that power them must adapt the same predictive tasks to newly arising data modalities (e.g., a new video content launch in an application requires existing text or image classifiers to extend to video). To solve this problem, organizations typically create ML pipelines for the new modality from scratch. We demonstrate how organizational resources, in the form of aggregate statistics and knowledge bases enable teams to construct a common feature space that connects new and existing data modalities. This allows teams to apply methods for data curation (e.g., weak supervision and label propagation) and model training (e.g., forms of multimodal learning) across these different data modalities. We present how this use of organizational resources composes at the production scale in classification tasks at Google, and demonstrate how it reduces the time needed to develop models for new modalities from months to weeks or days. This work was done when Dr. Girija was at Google, USA.
16th Aug 2022,03:00 PM - 04:00 PM
Seminar room no 32, 2nd floor, Main Building, IISER Pune
Data Science isn't primarily about machine learning or statistics. It is about rigorous analysis using data to make decisions or to do science. Some of this rigor is mathematical, but some of it has to do with the right application of domain knowledge within an appropriately causal analysis. We will discuss
this in the context of a story.
This story is about a fantastic piece of historical data science from 1850---of John Snow's struggles to use data to prove that Cholera is waterborne. Perhaps you are curious about what data science looked like in 1850. (Surprisingly, it is pretty similar to what it looks like today.) Or perhaps you are curious about how
pandemics unfolded in the 1800s. But mostly, I hope you identify with some of these struggles and are inspired by how John Snow surmounted them.
2nd Aug 2022, 02:30 PM - 03:30 PM
Seminar room no 32, 2nd floor, Main Building, IISER Pune
Predicting cancer from XRays seemed great Until we discovered the true reason. The model, in its glory, did fixate On radiologist markings – treason! We found the issue with attribution: By blaming pixels for the prediction (1,2,3,4,5,6). A complement'ry way to attribute, is to pay training data, a tribute (1). If you are int'rested in FTC, counterfactual theory, SGD Or Shapley values and fine kernel tricks, Please come attend, unless you have conflict Should you build deep models down the road, Use attributions. Takes ten lines of code!
1st Aug 2022,11:00 AM - 12:00 PM
Seminar room 31, 2nd Floor, Main Building
Data sets in which measurements of two (or more) types are obtained from a common set of samples arise in many scientific applications. A common problem in the exploratory analysis of such data is to identify groups of features of different data types that are strongly associated. A bimodule is a pair (A, B) of feature sets from two data types such that the aggregate cross-correlation between the features in A and those in B is large. A bimodule (A, B) is stable if A coincides with the set of features that have a significant aggregate correlation with the features in B, and vice-versa. In this talk, we propose and investigate an iterative testing-based procedure (BSP) to identify stable bimodules in bi-view data. We carry out a thorough simulation study to assess the performance of BSP and present an extended application to the problem of expression quantitative trait loci (eQTL) analysis using recent data from the GTEx project. In addition, we apply BSP to climatology data to identify regions in North America where annual temperature variation
affects precipitation. This is joint work with Andrew B. Nobel,John Palowitch, Mark He, and Michel I. Love.
29th July 2022, 02:30 PM - 03:30 PM
Madhava Hall, 3rd Floor, Main Building
Stochastic gradient descent (SGD) is the workhorse of modern machine learning. While SGD has been thoroughly analyzed for independent data and tight finite time guarantees are known, its finite sample performance with dependent data has not been as thoroughly analyzed. In this talk, we will consider SGD-style algorithms for two problems where the data is not independent but rather comes from a Markov chain: learning dynamical systems and Q-learning for reinforcement learning. While vanilla SGD is biased and does not converge to the correct solution for these problems, we show that SGD along with a technique known as "reverse experience replay" can efficiently find the optimal solutions.
28th July 2022, 11:00 AM - 12:30 PM
Madhava Hall, 3rd floor, Main Building, IISER Pune
In this talk, we will be concerned with techniques for achieving deep learning in a human-in-the-loop setting. We will be focusing on deep neural networks (DNNs) suitable for real-world scientific problems with the following characteristics: (a) Data are naturally graph-structured (relational); (b) The amount of data available is typically small; and (b) There is significant domain-knowledge available from human experts, usually expressed in some logical form (rules, taxonomies, constraints and the like). Recently, the choice of tools for learning from graph-structured data are graph neural networks (GNNs), obviously due to their tremendous success in this area. However, the machine learning community has focused mainly on (a), and less has been done to deal with (b) and (c). In this talk, we would be interested in graph representation learning in the mentioned problem setting. We will explore some recent techniques for the inclusion of relational information into GNNs when learning from graph-structured data. We will see how this allows us to combine deep learning with logical representation and achieve better predictive models. In applications, we will see some recent empirical results obtained for problems arising in drug discovery. Keywords Graph Representation Learning, Neuro-Symbolic Learning, Inductive Logic Programming, Drug discovery.
9th June 2022, 04:00 PM - 05:00 PM
Online Mode
Natural Language Processing (NLP) techniques became imperative to process massive amounts of text data produced every day in various forms such as news, social network posts, medical records, legal documents, etc. The first part of this talk will provide an overview of basic NLP modules, challenges in building efficient NLP systems, and some important state-of-the-art applications. In the next, two critical problems will be briefly described - (1) Improving translation quality for low-resource languages: Major Indic languages are resource-scarce i.e. lacking good quality NLP tools and datasets. So, exploring different aspects of low-resource language translation is a key research area for the Indic NLP community. (2) Adversarial training of translation models: Nowadays, deep learning systems have become widespread for all NLP tasks ranging from language models to machine translation. However, these systems are vulnerable to adversarial attacks. To make the systems robust and usable in practical applications, adversarial training is employed. In particular, we will discuss a specific type of attack called invariance-based attack to translation models and the defense strategies against this type of attack.
9th June 2022, 04:00 PM - 05:00 PM
Online Mode
The seminar is mainly focused on robust and efficient machine learning (ML) models based on novel optimization approaches. The primary ML technique in this seminar is the support vector machine (SVM). SVM is a widely used supervised learning algorithm for classification as well as regression problems. It uses a kernel-based approach for efficiently classifying the data. Since SVM-based algorithms have been extensively used for classifying biomedical data, we applied most of the proposed SVM models to biomedical applications. I will also discuss my current research work i.e., “prediction of brain age using multimodal neuroimaging data”. Chronological age may not necessarily be an accurate marker of brain health. Recently, several studies have employed neuroimaging-based techniques to accurately determine brain health, also known as “brain age”. Brain Age Gap Estimation (BrainAGE) seeks to accurately estimate the difference between chronological age and brain age, with the aim of establishing trajectories of healthy aging. Accurate estimation of the brainage gap can aid in the timely identification of markers of brain-related disorders.
9th June 2022, 04:00 PM - 05:00 PM
Online Mode
Distributed Computing by Swarm of Robots Dr. Subhash Bhagat Abstract In this talk, we talk about distributed algorithms for a swarm of robots. A swarm of robots is a distributed system of small autonomous mobile robots capable of cooperatively carrying out different tasks. Efforts have been made to design distributed algorithms to solve a variety of formation problems like Gathering, Convergence, Circle formation, Arbitrary pattern formation, Scattering and Covering, Flocking, Mutual visibility, etc. In this talk, we discuss our results for the gathering and mutual visibility problems. The gathering problem (also known as homing/rendezvous) requires all the robots to coordinate their movements to meet at a point unknown to them a priori. The mutual visibility problem considers opaque robots. If robots are opaque and three of them lie in a straight line, then the middle robot obstructs the visions of the two other robots. The mutual visibility problem requires the robots to form a configuration in which no three robots are collinear
9th June 2022, 04:00 PM - 05:00 PM
Online Mode
The seminar is mainly focused on robust and efficient machine learning (ML) models based on novel optimization approaches. The primary ML technique in this seminar is the support vector machine (SVM). SVM is a widely used supervised learning algorithm for classification as well as regression problems. It uses a kernel-based approach for efficiently classifying the data. Since SVM-based algorithms have been extensively used for classifying biomedical data, we applied most of the proposed SVM models to biomedical applications. I will also discuss my current research work i.e., “prediction of brain age using multimodal neuroimaging data”. Chronological age may not necessarily be an accurate marker of brain health. Recently, several studies have employed neuroimaging-based techniques to accurately determine brain health, also known as “brain age”. Brain Age Gap Estimation (BrainAGE) seeks to accurately estimate the difference between chronological age and brain age, with the aim of establishing trajectories of healthy aging. Accurate estimation of the brainage gap can aid in the timely identification of markers of brain-related disorders.
6th June 2022, 04:00 PM - 05:00 PM
Online Mode
Deep reinforcement learning (DRL) algorithms have recently gained a lot of attention in solving real-world complex control tasks. In this talk, I will show how DRL-based control can be used for efficient energy management of residential buildings. Specifically, I will be focusing on a DRL-based control for Heat, Ventilation, and Air Conditioning (HVAC) systems with the objective of reducing energy-cost and maintaining homeowner’s comfort. I will formulate the problem of intelligent HVAC control using Markov Decision Process. I will then discuss a famous model-free RL algorithm called Q-learning and its neural network-based implementation called Deep-Q-network (DQN) for HVAC control. Further, I will present a few results showing DQN’s performance in saving electricity cost and maintaining comfort while controlling HVAC. Towards the end, I will briefly summarize a few extensions of this work as well as potential tasks from other domains that can benefit from DRL-based control.
2nd June 2022, 04:00 PM - 05:00 PM
Online Mode
High-dimensional data is regularly used in modern data analysis pipelines. Such a dataset has its own set of challenges like overfitting, noise features, interactions, outcome type, and computational complexity, thus it is desirable to preprocess the data to reduce feature space and handle missing values. Among the various feature selection frameworks, the ensemble-based feature selection and wrapper-based feature selection frameworks are popular for good performance. This seminar will present the novel frameworks developed from these frameworks to enhance their performance and scope of application. Further, the seminar will also discuss the application of feature selection to address various clinical and biological research problems regarding biomarker and model selection.
1st June 2022, 04:00 PM - 05:00 PM
Online Mode
With the ever-increasing computing power, current applications, nowadays, produce data sets that can reach the order of petabytes and beyond. Knowledge extracted from such extreme-scale data promises unprecedented advancements in various scientific fronts, e.g., earth and space sciences, computational fluid dynamics, etc. However, finding meaningful and salient information efficiently and compactly from such vast data and presenting them effectively and interactively is one of the fundamental problems in modern data science research. My talk will focus on addressing the 5 Vs of big data while presenting novel strategies for big data analytics and visualization. I will discuss state-of-the-art data exploration methodologies that encompass the end-to-end exploration pipeline, starting right from the data generation time until when the data is being analyzed and visualized interactively to advance scientific discovery. I will present statistical and machine learning-based compact scientific data representations that are significantly smaller compared to the raw data and can be used as a proxy for the raw data to answer scientific questions efficiently.
31st May 2022, 04:00 PM - 05:00 PM
Online Mode
Recent improvements in remote-sensing capability and advances in machine learning (ML) have created significant opportunities to enhance understanding of precipitation processes from space. While advanced ML techniques improve the accuracy of precipitation retrievals, how these observations contribute to our understanding of precipitation processes remains an underexplored research question. Moreover, to bring any ML-based hydro-meteorological products in the operational environment, it is essential to gain the trust of weather forecasters by explaining the ML model’s decisions. This could be achieved by integrating ML interpretation techniques with the knowledge of hydro-meteorological processes. In this talk, a Random Forest (RF) based precipitation typology model developed for new generation geostationary satellites will be introduced. Part I of this talk will focus on prognostic modeling; that is, the design, training, and assessment of a machine-learning-based model for precipitation type classification and its challenges. Part II of the talk will focus on the interpretability of this model.
30th May 2022, 11:00 AM - 12:00 PM
Seminar Room 41
Exploratory factor analysis (EFA) or principal component analysis (PCA) is routinely used by researchers to reduce the dimensionality of data, and to form meaningful factors. While there are good guidelines on how to report the results, data visualization tools are rarely used in understanding the results of these methods. Good data visualization, especially in a multivariate or multidimensional framework, not only helps clarify the results but also aids in better decision-making for the models. This presentation demonstrates data visualization techniques that can be used in dimension reduction methods. The advantages and disadvantages of each of these techniques are discussed. As these methods are oftentimes used in survey research, exploratory data visualization for ordinal variables is also presented. Data visualization and analysis are performed in R using publicly available survey data.
4th May 2022, 11:00 AM - 12:00 PM
Seminar room 31 (Second Floor, A Wing, Main building)
Data visualization and massive data handling are some of the primary concerns of computer scientists. However, most big data sets are relational, containing a set of objects and relations between these objects. This translates to a natural mathematical model, called graphs. Many important real-life problems can be modeled as combinatorial optimization problems on graphs. In the first part of the talk, I will focus on some classical combinatorial optimization problems. Algorithms are at the core of computer science. They define the underlying computational processes of every complex system running in today's dynamic digital world. These systems are fed enormous amounts of data changing constantly over time. Therefore, a natural challenge for algorithms is to not only do efficient computation at a particular point, but also to maintain a good quality solution throughout. In the second part, I will focus on the algorithmic journey from static to dynamic geometric optimization problems.
16th March 2022, 10:00 AM - 11:00 AM
Online Mode
Bayesian approaches are appealing for constrained inference problems by allowing a probabilistic characterization of uncertainty while providing computational machinery for incorporating complex constraints in hierarchical models. However, the usual Bayesian strategy of placing a prior on the constrained space and conducting posterior computation with Markov chain Monte Carlo algorithms is often intractable. An alternative is to conduct inference for a less constrained posterior and project samples to the constrained space through a minimal distance mapping. We formalize and provide a unifying framework for such posterior projections. For theoretical tractability, we initially focus on constrained parameter spaces corresponding to closed and convex subsets of the original space. We then consider non-convex Stiefel manifolds. We provide a general formulation of projected posteriors in a Bayesian decision-theoretic framework. We show that asymptotic properties of the unconstrained posterior are transferred to the projected posterior, leading to asymptotically correct credible intervals. We demonstrate numerically that projected posteriors can have better performance than competitor approaches in real data examples
8th Feb 2022, 10:00 AM - 11:00 AM
Online Mode
Numerical weather prediction and seasonal to interannual hydro meteorological forecasts rely on the accuracy of the initial state of the system and observation data. Data assimilation is used to optimally estimate the state of a physical system using imperfect numerical models and noisy observations. The huge
computational expense has been the main challenge while applying the Sigma-Point Kalman filter (SPKF) to a high-dimensional system. In the first part of the talk, I focus on this issue and present a method to construct a reduced-rank SPKF (RSPKF), employing the truncated singular value decomposition. The RSPKF is then applied to a realistic ENSO prediction model. I also implemented a localization method for RSPKF. In the second part, I will talk about the assimilation of retrieved satellite product vs raw satellite data using a radiative transfer forward operator into ahydro meteorological model.