2023 SEMINARS

Data-assimilation with scientific machine learning

Romit Maulik, Pennsylvania State University

Thursday, 21st Dec 2023, 10:30 AM - 11:30 AM

Madhava Hall, 3rd floor, Main Building, IISER Pune

Data assimilation (DA) in geophysical sciences remains the cornerstone of robust forecasts from numerical models. Indeed, DA plays a crucial role in the quality of numerical weather prediction and is a crucial building block that has allowed dramatic improvements in weather forecasting over the past few decades. DA is commonly framed in a vibrational setting, where one solves an optimization problemwithin a Bayesian formulation using raw model forecasts as a prior and observations as likelihood. This leads to a DA objective function that needs to be minimized, where the decision variables are the initial conditions specified in the model. In traditional DA, the forward model is numerically and computationally expensive. Here we replace the forward model with a differentiable surrogate model. Consequently, gradients of our DA objective function with respect to the decision variables are obtained rapidly via automatic differentiation. We demonstrate our approach byperforming emulator-assisted DA forecasts for geopotential height, followed by a more comprehensive numerical weather prediction scenario. Our results indicate thatemulator-assisted DA is faster than traditional equation-based DA forecasts by 4 orders of magnitude, allowing computations to be performed on a workstation rather than a dedicated high-performance computer.

Population dynamics and evolution: from simulating biology to advancing data science

Jaideep Joshi, University of Bern

Friday, 15th Dec 2023,10:30 AM - 11:30 AM

Seminar room no 33, 2nd floor, Main Building, IISER Pune

Evolution is central to understanding life on earth. However, evolutionary thinking is not limited to biology: the principles of natural selection have motivated powerful algorithms for solving various problems in data science, including numerical optimization, swarm intelligence, and machine learning. In this technical talk, I will first show, using an agent-based model, how simple behavioural rules, together with natural selection, can lead to the self-organization of complex behaviours. Second, I will describe a new evolutionary theory that allows for predicting evolutionary change in complex ecosystem models. In Earth-system science, a major challenge is estimating parameters of process-based models using diverse data. Standard optimization-based or Bayesian approaches fail because each function evaluation can require hours (if not days) of computation time. Consequently, Earth-system models are typically calibrated by manual parameter tuning. I will present open questions and show how evolutionary theory could provide alternative data-assimilation algorithms for solving this problem.

Predicting the climate resilience of complex systems: harnessing emerging planetary data with new eco-evolutionary theory

Jaideep Joshi, University of Bern

Thursday, 14th Dec 2023,10:30 AM - 11:30 AM

Seminar room no 33, 2nd floor, Main Building, IISER Pune

Earth-system science has entered a data-rich era, with an explosion of ecosystem observations from networks of biogeochemical flux sensors, long-term monitoring plots, remote sensing, global trait measurements, and compilations of manipulation experiments. The availability of such wealth of data has created unprecedented opportunities for developing new eco-physiological theory and next-generation Earth-system models. In this talk, I will integrate tools and concepts from data science, evolutionary theory, and vegetation modelling, into a framework for predicting the climate resilience of biodiverse ecosystems to climate change. I will (1) showcase how the principles of natural selection can be leveraged to build simpler yet more accurate models of ecosystem functioning, (2) assess the emergent responses of the Amazon Forest to elevated CO2 from a complex-systems perspective, and (3) show how combining eco-evolutionary vegetation modelling with multi-stakeholder analysis is enabling a quantitative multidisciplinary framework for predicting safe operating spaces for human-natural systems.

Advancing Recruitment Practices: Automated Skill Extraction and Fraudulent Job Advertisement Detection through Machine Learning

Rohan Nanda, Maastricht University, Netherlands

Wednesday, 13th Dec 2023,10:30 AM - 11:30 AM

Seminar room no 32, 2nd floor, Main Building, IISER Pune

The continuous growth in the online recruitment industry has made the candidate screening process costly, labour-intensive, and time-consuming. Addressing this, we propose a context-aware sequence classification model for automated extraction of hard and soft skills from candidates' resumes and job descriptions. The task is less complex for hard skills which in some cases could be named entities but much more challenging for soft skills which may appear in different linguistic forms depending on the context. Leveraging state-of-the-art textual features, our model employs machine learning classifiers and is validated on a publicly available job description dataset. Concurrently, the rise in fraudulent job advertisements jeopardizes job seekers' privacy and well-being. We develop and validate a machine learning system to detect identity theft, corporate identity theft, and multi-level marketing in fraudulent job ads by employing empirical rule set-based features, bag-of-words models, state-of-the-art word embeddings, and transformer models.

Multilingual Legal Information Retrieval System for Automated Compliance Checking of EU Law

Rohan Nanda, Maastricht University, Netherlands

Tuesday, 12th Dec 2023,03:00 PM - 04:00 PM

Seminar room no 32, 2nd floor, Main Building, IISER Pune

This study focuses on enhancing the efficiency of compliance checking of European Union (EU) Laws (i.e., Directives) into Member States' national laws, a crucial aspect for achieving policy objectives outlined in the Treaties. The European Commission (EC) oversees this process to ensure compliance, a task traditionally involving time-consuming and costly manual legal analysis. Our work introduces a legal information retrieval system utilizing semantic textual similarity techniques to automatically detect directive transpositions at a fine-grained national provision level. Leveraging lexical, semantic, and word embeddings-based methods, we evaluated the system across a multilingual corpus of EU and national legislation. Results indicate the system's ability to identify transpositions in diverse national jurisdictions with promising performance. This suggests its potential as a valuable tool for legal practitioners and Commission officials engaged in the legal compliance verification process.

Deep Internal Learning for image restoration and image synthesis

Indra Deep Mastan, LNM Institute of Information Technology

Tuesday, 5th Dec 2023,10:30 AM - 11:30 AM

Madhava Hall, 3rd floor, Main Building, IISER Pune

We will discuss Deep Internal Learning (DIL) methods for image restoration and synthesis tasks. DIL approaches allow us to perform image restoration (IR) and image synthesis (IS) tasks with No Prior Examples. DIL goes against the narrative that explains the success of deep learning for IR and IS tasks due to the utilization of many prior examples. DIL methods have practical applications where one has limited computational resources, the training samples are difficult to collect, or one needs an output image ensuring no bias from the training samples. We will also discuss an emerging vision and language model for text-based image style transfer.

Bloom Filters - memory efficient data structures

Indra Deep Mastan, LNM Institute of Information Technology

Monday, 4th Dec 2023,03:00 PM - 04:00 PM

Madhava Hall, 3rd floor, Main Building, IISER Pune

This seminar will discuss how to design memory-efficient data structures using the hash function. The memory-efficient data structure will allow us to store items with less space and query the items with less time. The challenges in the hash table construction are due to the collision of the hash functions. We will first discuss the construction of the hash function and analysis of hash collisions. Then, we will discuss a memory-efficient data structure called Bloom Filter.

A Tour of Clustering Problems

Tanmay Inamdar, University of Bergen

Tuesday, 21st Nov 2023, 10:30 AM - 11:30 AM

Madhava Hall, 3rd floor, Main Building, IISER Pune

Clustering is a widely used unsupervised learning technique that groups "similar" objects together. It has been shown that even a small number of outliers can influence the results of clustering algorithms, obscuring the natural underlying clusters. Thus, the study of clustering in the presence of outliers is vital for practical applications. In a "clustering with outliers" problem, we are given a set of points P and two parameters k and m, and the clustering should partition P into k clusters, excluding up to m outlier points.

In this talk, I will present a general framework to reduce a "clustering with outliers" problem to its outlier-free analogue (our recent result [1], AAAI '23). This reduction allows us to obtain optimal approximation algorithms for a number of clustering problems, such as k-median/k-means with outliers. I will also briefly discuss some of my other results on clustering problems, and interesting future directions related to these areas.

[1] Clustering What Matters: Optimal Approximations for Clustering with Outliers.

Akanksha Agrawal, Tanmay Inamdar, Saket Saurabh, Jie Xue. AAAI 2023 (Distinguished Paper) and J. Artif. Intell. Res.

Link: https://jair.org/index.php/jair/article/view/14883

Basic Streaming Algorithms for Big Data

Tanmay Inamdar, University of Bergen

Monday, 20th Nov 2023, 03:00 PM - 04:00 PM

Madhava Hall, 3rd floor, Main Building, IISER Pune

We are all familiar with the standard (offline) algorithms paradigm, where the input data is stored in the memory, and the algorithm has random access to the entire input. However, in many "big data" scenarios, the input is too large to be entirely stored in memory. One model to deal with such scenarios is the Streaming Model. In this model, the data points arrive one-by-one, and we want to estimate certain properties of the data using only a "small" space.

In this talk, I will talk about streaming algorithms for some of the basic problems, such as approximate counting, finding majority/frequent elements, and the number of distinct elements. I will also talk about the perceptron algorithm for learning linearly separable data that works in the Online Model, which is closely related to the streaming model.

Network Code and Index Code Construction for Given Networks

Mohammad Sultan, National Institute of Technology Hamirpur

Tuesday, 07th Nov 2023, 10:30 AM - 11:30 AM

Madhava Hall, 3rd floor, Main Building, IISER Pune

This presentation provides an in-depth exploration of the intricate connection between the entropic region and the network's capacity region, spanning the domains of information theory and network coding. The fundamental challenge of characterizing the almost entropic region is addressed by seeking new outer and inner bounds, improving our understanding of it. This work delves into the complexities of constructing network codes in the capacity region, which remains challenging due to the incomplete characterization of the entropic region. The developed algorithm is demonstrated to construct network and index codes that can efficiently handle a given rate vector in the capacity region. We use a directed acyclic hypergraph model and establishes connections between linear programming outer bounds and inner bounds on network coding rate regions. We present a heuristic method for generating network and index codes based on the relationship between functional dependence and conditional independence constraints, making it applicable to various data science scenarios.

Inner and Outer Bounds for the Almost Entropic Region and Information Inequalities

Mohammad Sultan, National Institute of Technology Hamirpur

Monday, 06th Nov 2023, 03:00 PM - 04:00 PM

Madhava Hall, 3rd floor, Main Building, IISER Pune

This talk explores the fundamental questions in information theory and network coding, focusing on characterizing the almost entropic region. A deeper understanding of this region is crucial for improving data science and communication systems. New outer and inner bounds are developed to better approximate this region. We investigate entropy space, introducing alphabet-constrained entropic sets and normalized distance. An algorithm is designed to find entropic vectors and associated distributions close to a given target vector, even when the target is non-entropic. We also optimize functions of joint entropies over alphabet-constrained entropic sets. Inner bounds for the almost entropic region involving four random variables are obtained using both polyhedral outer bounds and a grid-based approach. The obtained bound is the best to date. The work has implications in data science, contributing to improved data processing and communication systems by enhancing our understanding of information theory and coding.

Comparison and evaluation of statistical error models for single-cell RNA-seq data

Saket Choudhary, New York Genome Center

Thursday, 26th Oct 2023, 03:00 PM - 04:00 PM

Madhava Hall, 3rd floor, Main Building, IISER Pune

Single-cell RNA-sequencing (scRNA-seq) has emerged as a powerful technique to characterize cellular diversity at an unprecedented resolution, enabling the characterization of the molecular state of individual cells in any biological system or species. While unsupervised analysis of single-cell data can uncover heterogeneous cell types and states, the results can also be confounded by cell-to-cell variation arising from technical factors such as differences in sequencing depths. I will introduce a computational method based on generalized linear models that tackles the normalization problem using a data-driven approach. By analyzing over 50 scRNA-seq datasets spanning multiple technologies, biological systems, and sequencing depths, I will show that the degree of heterogeneity varies across datasets which necessitates a data-driven parameter learning approach. Using extensive benchmarking, I also demonstrate how the method outperforms other tools at identifying differentially expressed genes.

Statistical analysis for inferring the timing of gene duplications

Snehalata Huzurbazar, West Virginia University

Thursday, 19th Oct 2023, 03:30 PM - 05:00 PM

Seminar Room 31, 2nd floor, Main Building, IISER Pune

Gene duplication is the key mechanism for evolutionary change. To infer the timing and nature of gene duplication, the 'data' used are the end result of various pipelines. In this talk, I will summarize how the 'data' are obtained, explore the shortcomings of analyses in the literature, and end with current work on overcoming these shortcomings. The interesting statistical problems are that the 'data' are maximum likelihood estimates, and that the biological process (saturation effects) present complications in data modeling.

Mining Groups of Similarly Behaving Enterprise Social Network Users using Temporal Behavioral Clustering and Edge Attributed Node Embedding

Priyanka Sinha, Docyt

Monday, 12th Oct 2023, 03:00 PM - 04:00 PM

Madhava Hall, 3rd floor, Main Building, IISER Pune

In this talk, two data science techniques would be presented to mine behaviorally similar users and groups of users from enterprise social networks. Firstly we present a method to characterize user behavior from their engagement with enterprise social media. Content analysis often suffers challenges due to noise. Here we study behavior using temporal activity, i.e., the number of posts per month represented as a time series. User posting volume on social media has a long tailed nature. It causes time series clustering algorithms to result in unbalanced clusters with either very few users or almost all users. Thus we propose a hierarchical time series clustering algorithm to group users according to their behavioral homogeneity and provide interpretable characterizations to the resulting clusters. Users in distinct clusters deviate significantly in their topics of interest while being homophilic (near identical or similar minded) within the cluster. Goodness of the clustering is observed over Enterprise Social Media (ESM); Stackexchange; and Linux Kernel Mailing List (LKML) datasets as opposed to existing clustering techniques. Secondly, we present a method to identify groups of similarly behaving users with similar work contexts from their activity on enterprise social media. This would allow organizations to discover redundancies and increase efficiency. To better capture the network structure and communication characteristics, we model user communications with directed attributed edges in a graph. Communication parameters including engagement frequency, emotion words, and post lengths act as edge weights of the multi edge. Upon the resultant adjacency tensor, we develop a node embedding algorithm using higher order singular value tensor decomposition and convolutional autoencoder. We develop a peer group identification algorithm using the cluster labels obtained from the node embedding and show its results on Enron emails and StackExchange Workplace community. We observe that people of the same roles in enterprise social media are clustered together by our method. We provide a comparison with existing node embedding algorithms as a reference indicating that attributed social networks and our formulations are an efficient and scalable way to identify peer groups in an enterprise social network that aids in professional social matching.

Mining Personality Traits and Behaviorally Similar Groups from Enterprise Social Networks

Priyanka Sinha, Docyt

Monday, 12th Oct 2023, 10:30 AM - 11:30 AM

Madhava Hall, 3rd floor, Main Building, IISER Pune

Towards Energy-Accuracy Trade off in Abuse Classification Based on Optimum Parameter Selection

Swati Agarwal, BITS Pilani, Goa

Tuesday, 10th Oct 2023, 10:30 AM - 11:30 AM

Madhava Hall, 3rd floor, Main Building, IISER Pune

Abuse classification has drawn significant research attention due to the widespread adoption of social networking applications. In spite of the attention and the sensitive nature of the task, privacy preservation in abuse classification has remained under-studied. To exploit the rising computing power of mobile devices and to reduce the demand on the centralized infrastructure Federated Learning (FL) has emerged as a viable solution. FL also addresses privacy concerns inherent in abuse classification. The primary goal of FL is to create a generalized global model that performs equally well on all the participants. However, heterogeneity in data and frequent communication needed for data exchange between clients and the server becomes a bottleneck for a client with limited resources. In this talk, I will present a hybrid abuse classification model that combines BERT and BiLSTM to unleash the potential of texts in a diverse data environment (e.g., languages and source platforms). Additionally, I will discuss proposed GoFed- a personalized and communication-efficient FL framework via optimal parameter selection. In GoFed, each client learns a personalized model, and only the best-performing model updates will be communicated to the server. On the server side, the selector selects the best parameters from the previous copy and the latest copy of parameters received from each client for aggregation. The communication cost can be significantly minimized due to the reduced frequency of updates. Experiments on real-world datasets demonstrate that compared with the state-of-the-art approaches, GoFed can achieve 3.7 times reduction in communication cost and as much as 10.41% increase in personalized accuracy.

Federated Learning: Machine Learning on Decentralized Data for NLP Applications

Swati Agarwal, BITS Pilani, Goa

Monday, 10th Oct 2023, 03:00 PM - 04:00 PM

Madhava Hall, 3rd floor, Main Building, IISER Pune

In the realm of Natural Language Processing (NLP), the imperative to leverage vast and diverse datasets has never been more pronounced. However, concerns regarding data privacy, security, and regulatory compliance pose formidable challenges to conventional centralized approaches. Federated Learning emerges as a paradigm-shifting solution that enables model training on decentralized data sources while preserving individual data privacy. This seminar aims to provide a comprehensive exploration of Federated Learning in the context of NLP applications. We will delve into the fundamental principles, methodologies, and challenges associated with this cutting-edge technique. By distributing model training across multiple devices or servers, Federated Learning not only ensures data privacy but also promotes collaboration among institutions and organizations with shared research interests. This presentation will also serve as a foundation for further discussions and research endeavors in the domain of Federated Learning and NLP, creating an opportunity for collaboration and knowledge exchange among esteemed faculty members and experts.

Dynamic Inter-treatment Information Sharing for Individualised Treatment Effects Estimation

Vinod Kumar Chauhan, Institute of Biomedical Engineering,University of Oxford, OX3 7DQ

Wednesday, 4th Oct 2023, 02:00 PM - 03:00 PM

Online mode

Limited dataset sizes can pose challenges in causal inference and machine learning. This is especially problematic in causal inference, where data is split among different treatment groups for model training, potentially leading to bias. While some information sharing among treatment groups can help, current individualised treatment effect (ITE) learners often lack a mechanism for comprehensive inter-treatment information sharing. To tackle this, we introduce a novel deep learning framework for training ITE learners. It leverages dynamic end-to-end information sharing among treatment groups through soft weight sharing of hypernetworks. This framework, referred to as HyperITE, complements existing ITE learners and effectively reduces ITE estimation errors, particularly benefiting smaller datasets in our experiments.

Application of machine learning in healthcare

Rajdeep Banerjee, IHX Pvt.Ltd., Bangalore

Thursday, 7th Oct 2023, 03:30 PM - 04:30 PM

Seminar Room 31 (2nd floor main building).

With the rapid progress in artificial intelligence and machine learning (AI & ML) over the latter half of the last decade, efforts have been focused mostly into resolving clinical issues in healthcare (e.g., diagnosis and treatment),[1] than on non-clinical issues such as improving patient experience through claims management or creating a proactive view of healthcare through early detection of diseases. During my involvement with the healthcare industry, I have tried to address these issues using a combination of data-centric AI and traditional ML approaches. [2] My work shows how methods like weak supervision and confident learning can reduce the dependence on human-in-loop and improve data quality and model accuracy in the standardization of medical bills. I have employed an appropriate blend of ML methods, in an attempt to increase precision in extraction and mapping of diseases to standard medical codes, which is still an open problem. I will also discuss how ML models can reliably predict the occurrence of gestational diabetes mellitus using early-trimester data, thereby providing a promising approach to disease prevention.

Identifying abrupt transitions in time series with uncertainties

Bedartha Goswami, University of Tubingen, Germany

Firday, 1st Sep 2023, 11:00 AM - 12:00 PM

Madhava Hall, 3rd floor, Main Building, IISER Pune

Here, I will present a new way of representing time series with uncertainties: as a sequence of probability distributions in lieu of point-like measurements. I will then show how we can use the framework of recurrence plot analysis to build a novel tool that helps detect abrupt transitions in time series. The transition detection method is then extended to incorporate time series of probability distributions. I will then demonstrate the proposed framework on a synthetic example and show how it can be used to detect transitions in paleoclimate data, current era climate data, and stock market index data.

Understanding climate variability with statistical machine learning and artificial intelligence

Bedartha Goswami, University of Tubingen, Germany

Thursday, 31 Aug 2023, 11:00 AM - 12:00 PM

Madhava Hall, 3rd floor, Main Building, IISER Pune

I will present two projects (and a teaser) about ongoing work in my group that use machine learning to understand climate variability. I will show how we can use similarity-based networks of climate time series data to reveal new features of intraseasonal variability of extreme rainfall propagation during the South Asian Summer Monsoon. I will next present how we can use principal component analysis in combination with Gaussian Mixture Models to categorize extreme phases of the El Niño Southern Oscillation (ENSO). Last, I will present a short teaser on how we can leverage deep learning to develop purely data-driven models for subseasonal-to-seasonal forecasting of the ENSO.

Strategic Investments in Blockchain Mining: A Stochastic Game Perspective

Swapnil Dhamal, Telecom SudParis

Thursday, 17th Aug 2023, 10:30 AM - 11:30 AM

Madhava Hall, 3rd floor, Main Building, IISER Pune

Blockchain technology has found applications in various fields such as cryptocurrencies, smart contracts, security services, and Internet of Things. Its functioning relies on a block mining procedure, where miners collect block data consisting of a number of transactions and attempt to solve a computationally-intensive cryptographic hash puzzle in return for a specified reward. In this talk, I will present one of my works on strategic investment of computational power by miners in blockchain. In particular, we shall consider a setting where miners can arrive and depart over time, and hence analyze their investment strategies and the obtained payoffs in the equilibrium of the underlying stochastic game. We shall see that depending on the mining scenario, miners either follow a thresholding policy that is independent of the other miners, or a smooth policy that depends on the other miners. Thereafter, we will look at a Stackelberg game, where the system decides the amount of reward to offer for mining a block and the miners decide how much power to invest based on the offered reward.

An Agent-based Mobility Model of Sweden

Swapnil Dhamal, Telecom SudParis

Wednesday, 16th Aug 2023, 10:30 AM - 11:30 AM

Madhava Hall, 3rd floor, Main Building, IISER Pune

Agent-based models are increasingly being employed for making various policy decisions and analyzing their effectiveness. In particular, agent-based mobility models potentially act as critical inputs for simulation models in the areas of transportation, land use, economics, epidemiology, etc. In this talk, I will present an overview of our devised methodology for developing an agent-based mobility model of Sweden – Synthetic Sweden Mobility (SySMo) model, while explaining how we utilized data acquired from a variety of sources. The model comprises (a) a synthetic population of agents that is statistically representative of the real-world population of Sweden with respect to socio-demographic attributes like age, gender, civil status, residential zone, income, car ownership, employment, etc., and (b) its mobility patterns describing the agents’ daily activity-travel schedules including activity types, their start-end times and locations, and modes of transport between activities. We shall conclude by seeing our model's performance, and how it can answer intricate questions related to Sweden's population and its mobility

Gene regulatory mechanisms in cancer – beyond genetics

Sridhar Hannenhalli, NIH USA

Monday, 16th Aug 2023, 03:00 PM - 04:00 PM

Madhava Hall, 3rd floor, Main Building, IISER Pune

While mutations, specifically those affecting protein-coding genes, have been a major focus of cancer research, they do not explain oncogenesis, metastasis, and therapy response entirely, and epigenetic plasticity is emerging as a potent complementary mechanism. Stochastic gene expression variability is intimately linked to cellular plasticity, which while being an integral part of development and stress response, is also linked to cancer and presents a major challenge for cancer therapy. I will present our recent work showing existence of transcriptionally distinct subpopulation of healthy pancreatic acinar cells exhibiting features of ductal-acinar progenitor state pancreatic ductal adenocarcinoma. Parallels between development and cancer has long been noted and recent works have identified activation of developmental programs in cancer. I will briefly summarize our recent works showing (1) a novel developing melanoblast cell state associated with metastasis and therapy response in melanoma and (2) a broad misappropriation of developmental splicing programs by cancer. Time permitting, I will summarize our recent attempts to characterize non-coding mutations during evolution and in cancer.

Advances in single cell data analysis and beyond: New challenges and opportunities

Sumanta Ray, Department of Computer Science and Engineering, Aliah University, Kolkata, India

Tuesday, 18th July 2023, 03:00 PM - 04:00 PM

Madhava Hall, 3rd floor, Main Building, IISER Pune

In this talk, we explore the recent advances in single-cell data analysis and discuss the emerging challenges and opportunities in this exciting field. Single-cell technologies have revolutionized our understanding of biological systems by enabling the characterization of individual cells at unprecedented resolution. However, analyzing the data gives numerous challenges to the computational scientist mainly to the machine learning researchers. We introduce two novel methods that address some of these challenges. Firstly, we present a gene selection method for downstream analysis of scRNA-seq data, which aids in identifying the most informative genes for further investigation. Additionally, we showcase a method for generating realistic cell samples from small sample single-cell datasets. This technique overcomes the limitations of limited sample sizes and allows researchers to augment their data, enabling more robust and comprehensive analyses.

Keywords: Single-cell analysis, Preprocessing, Gene selection, Generative model, Data augmentation

Unveiling the Potential of Copulas: Transforming Data Science and Machine Learning with Advanced Dependency Analysis

Sumanta Ray, Department of Computer Science and Engineering, Aliah University, Kolkata, India

Monday, 17th July 2023, 03:00 PM - 04:00 PM

Madhava Hall, 3rd floor, Main Building, IISER Pune

In this talk, we explore the potential of copulas in the fields of Data Science and Machine Learning through advanced dependency analysis. Copulas offer a flexible framework for modeling complex relationships between variables, overcoming the limitations of traditional statistical methods. By capturing underlying relationships between variables without specifying the form of their individual distributions, copulas enable more accurate modeling and analysis.

We highlight the key advantages of copulas in various applications within Data Science and Machine Learning. Specifically, we present a case study where copulas are used to model the dependency between gene expression patterns, aiding in the identification of differential coexpression genesets.

Additionally, we discuss the benefits, challenges, and future directions of utilizing copulas to unlock the full potential of advanced dependency analysis within the field of data science.

Keywords: Copulas, Dependency Analysis, Data Science, Machine Learning, Modeling, Gene Expression, Differential Coexpression.

How efficient the Indian states are in curbing certain crimes: A Probabilistic frontier regression approach

T. V. Ramanathan, Department of Statistics Savitribai Phule Pune University

Wednesday, 5th April 2023, 03:30 PM - 04:30 PM

Madhava Hall, 3rd floor, Main Building, IISER Pune

In this talk, we consider the technical efficiency of Indian states in curbing three types of crimes, viz., rape, assault on women, and a crime against children. The analysis based on NCRB data indicates that the states are effective about 65 to 75 percent only. A probabilistic frontier regression model is introduced for this purpose using the count-type output data in a production process setup. We treat some of the outcomes as desired outcomesor 'interest class,’ and a change in the probability of output falling into this class is attributed to the decrease in the technical efficiency of a decision making unit (state). A measure for technical efficiency is proposed. A simulation study is carried out to assess whether the average estimated technical efficiency is close to its actual value.

Statistical Inference via Conditional Bayesian Posteriors for High-Dimensional Linear Regression

Naveen Narisetty, University of Illinois at Urbana-Champaign

28th March 2023, 11:00 AM - 12:00 PM

Madhava Hall, 3rd floor, Main Building, IISER Pune

Performing inference for high dimensional models with valid statistical properties is an important and challenging problem of immense practical importance. We propose a new method under the

Bayesian framework to perform valid inference for low dimensional parameters in high dimensional sparse linear models. Our approachis to use surrogate Bayesian posteriors based on partialregression models to remove the effect of high dimensional nuisance variables. We name the final distribution we used to

conduct inference ``conditional Bayesian posterior'' as it is a surrogate posterior constructed conditional on quasi posterior distributions of other parameters and does not admit a fully Bayesian interpretation. Unlike existing Bayesian methods, our the method can be used to quantify the estimation uncertainty for

arbitrarily small signals and therefore does not require variable selection consistency to guarantee its validity. Theoretically, we show that the resulting Bayesian credible intervals achievedesired coverage probabilities in the frequentist sense. Methodologically, our proposed Bayesian framework can easily incorporate popular Bayesian regularization procedures such as those based on spike and slab priors and horseshoe priors to facilitate high accuracy estimation and inference. Numerically,our proposed method rectifies the uncertainty underestimation of Bayesian shrinkage approaches and has a comparable empirical performance with state-of-the-art frequentist methods based on simulation studies and real data analysis.

Addressing the Data Bottleneck in Information Extraction

Amit Awekar, IIT Guwahati

18th April 2023,3:30:00 PM - 04:30 PM

Madhava Hall, 3rd floor, Main Building, IISER Pune

Supervised Machine Learning tasks require annotated data for model training. Annotating large-scale data is both costly and error-prone. The annotation error issue becomes even more complex when the number of annotation labels is of the order of hundreds or thousands. As a result, absence of high-quality data becomes the real bottleneck in improving the model performance. In this talk, we will consider three scenarios for addressing the data bottleneck.

Topological Data Analysis, Basics, Computation, and Applications

Siddharth Pritam, Shiv Nadar University

25th January 2023, 03:00 PM - 04:00 PM

Madhava Hall, 3rd floor, Main Building, IISER Pune

In this talk, we will discuss the basic theory of Topological data analysis (TDA), in particular, Persistent Homology (PH). Then we will look into its computational aspects including the challenges and the recent advancements.We will discuss the usage of combinatorial collapses in efficient computation of PH. Given a sequence of simplicial complexes (filtered simplicial complex) applying a homology functor yields a sequence/chain of vector spaces with linear maps between two consecutive vector spaces.

We call such sequences a persistence module. A persistence module captures the evolution of the topology

of the filtered simplicial complex. It is a dynamic variant of the classical homology theory. The theory of persistent homology has found many applications and has become an important tool in a scientific investigation. Due to the huge size and large dimensions of data, the computation of persistent homology

has been a central challenge. Our recent work (SoCG'22) with Marc Glisse is a significant step towards efficient computation of PH. The main tool used in the above work is combinatorial collapses

Topics in Random Graphs

Mihir Hasabnis

25th January 2023, 03:00 PM - 04:00 PM

Madhava Hall, 3rd floor, Main Building, IISER Pune

In the speaker's research at CMU with Prof. Alan Frieze we mainly studied several problems in random graphs. Random graphs are one of the biggest fields in Probabilistic combinatorics with applications to various fields in Mathematics and Computer Science. We will briefly look at a variety of problems in this area such as the Game Chromatic Number for Hypergraphs, Rainbow matching, Colorful Hamiltonian Cycles, and Random Graph Isomorphism. Extremal combinatorics studies how large or how small a collection of finite objects can be if it has to satisfy certain restrictions. We will look at a problem of saturation in bipartite graphs in that context.

Topological (big) Data Analysis: From cosmology to biology and beyond

Pratyush Pranav, University of Lyon 1 and Ecole Normale Superieure de Lyon at Centre de Recherche Astrophysique de Lyon (CRAL)

11th January 2023, 03:00 PM - 04:00 PM

Madhava Hall, 3rd floor, Main Building, IISER Pune

The increased focus on data across disciplines has simultaneously led to a massive surge in data collection, such that the term Big Data has entered common parlance. The advent of Big Data has engendered two of the central statistical challenges of our times: detection and classification of structure in extremely large, high-dimensional, data sets. An intriguing new approach to this challenge is “TDA,” or “Topological Data Analysis. These developments on the topological side are recent, and add value to the already existing computational geometric tools.

In the first part of the talk, Speaker will present a summary of the theoretical and computational aspects of geometry and topology from the viewpoint of data analysis. Subsequently, the speaker will highlight applications by examining the properties of the Cosmic Microwave Background (CMB), as well as growing bacterial colonies through topological methods

Page updated

Report abuse