Predictive Modeling

Predictive Modeling Figure

Climate data seriously challenges the state-of-the-art in predictive modeling. The challenge comes from both the nature of the data itself as well as the nature of problems that need to be solved. As a result, there are arguably no off-the-shelf, predictive modeling methods which can even begin to meaningfully analyze climate data and make predictions based on it. In particular, the long range spatial and temporal dependencies in climate data cannot be effectively captured by popular approaches such as Markov models with local dependencies that are often used in domains such as image, speech, video, and signal analysis. The need for predictive insights spans a broad range of climate challenges and phenomena. Below we describe some of our work to extend the state-of-the-art in predictive modeling for climate.


One research effort has pursued both applications and theoretical advances in sparsity regularized regression techniques (Lasso, Elastic Net, Group Lasso, and Sparse Group Lasso) to obtain a predictive understanding of complex, dynamic physical phenomena, such as regional precipitation or hurricane intensity and frequency. With respect to theoretical advances, we have established rates of convergence for sparsity inducing, hierarchically regularized regression, which includes Lasso, Group Lasso, and Sparse Group Lasso as special cases. The rate depends only logarithmically on the number of dimensions and hence the methods are applicable in the high-dimensional low-sample regime, which is common in climate sciences. For settings where linear prediction is unsuitable, e.g., categorizing regional precipitation into low/medium/high regimes, nonlinear and nonparametric approaches have been investigated. For the regional precipitation regime categorization, k-nearest neighbor (kNN) with metric learning achieved an accuracy improvement of 10-15% over ordinary kNN using Euclidean distance, and 20-25% over multi-class SVMs.

In recent work, building on our prior work in sparsity, we have developed a structure learning framework for multi-task learning (MTL). MTL is a powerful framework for learning in multiple potentially related tasks. Much of existing literature assumes the task relationship graph to be either known or to have specific form. In contrast, we consider the problem of task relationship estimation as part of a problem and formulate it as a suitable structure learning problem in a Gaussian graphical model. The joint framework, called Multi-task Sparse Structure Learning (MSSL), has been applied to temperature prediction in South America based on global model models (GCMs) with promising results.

With respect to applications of sparsity regularized regression techniques, we have improved prediction accuracy for precipitation and hurricanes, and used these techniques to help identify a previously undiscovered source of Atlantic hurricane inter-annual variability in the Somali region of East Africa. We also used these techniques to produce a plausible phenomenological model of the eastern Sahel seasonal rainfall and quantified key climate drivers of rainfall variability. As a part of our sparsity regularized regression research, we have proposed methods for optimizing the regularization penalty for Lasso regression, detecting and ranking prominent temporal phases for climate variables, and assessing predictor statistical significance to ensure the validity of the discovered causal relationships.

Specifically, our approach has been applied to understanding the sources of variability of the Sahelian rainfall climate system, quantifying the influence of East African climatic conditions in modulating Atlantic hurricane variability, and understanding Sahel rainfall as a network of coupled climate factors. Using this methodology, multiple pathways showing the influence of the North Atlantic Oscillation on the Sub-Saharan African seasonal climates have been revealed. We have discovered a new climate index for Atlantic hurricane variability (Greater Horn of Africa Climate Index, GHACI) which, among other desirable properties, shows stronger correlation with Atlantic hurricane seasonal count than any traditional climate index except for the AMO. Additionally, composite analysis shows the influence of near surface wind anomalies along the coast of the Greater Horn of Africa (GHA) region on Atlantic hurricane activity.

For understanding Sahel rainfall, we have developed a methodology named coupled heterogeneous association rule mining (CHARM) which finds a multitude of rules supported by the literature, especially associated with Sea-Surface Temperatures (SST) patterns driving dynamical changes that influence local Sahel rainfall. We also explore higher-order associations by allowing larger itemsets in generated rules, as this would give us a better view of the overall implications of longer sequential chains in terms of anomalous coupled relationships. A careful comparison of CHARM, LASSO, and dynamic Bayesian Networks (DBNs) has been done revealing interesting similarities and differences between the performance of CHARM and LASSO.

Another key research effort is to significantly improve the predictive skill of seasonal forecasts, including climate extremes, with focus on North Atlantic hurricane activity, rainfall variability over the African Sahel region, and rainfall variability in the GHA. The work includes both classification and regression based activity forecasting of extremes, statistical estimation of the intensity of extremes, and data driven discovery of climate indices. In work that accounts for the hierarchical system-subsystem structure of real-world dynamic systems (e.g., atmosphere-ocean systems), we have developed predictive approaches that take this hierarchical structure into account to improve classification and regression based forecasting of rainfall and hurricane activity.

As part of that effort, we developed DETECTOR, a hierarchical method for detecting and correcting prediction errors in extreme event forecasts by employing the whole-part relationships between different systems. We also created FORECASTER, an algorithm that constructs a forecast-oriented, feature elimination-based ensemble of classifiers for robust forecasting of extreme events (e.g., low, normal, or high hurricane activity season). In further work related to regression, rather than classification, we have developed a means to use classification results from the FORECASTER methodology to train individual regression models for the subsets of data that belong to distinct classes. Both approaches have yielded significant improvements in prediction performance, e.g., when using DETECTOR to predict landfall of Atlantic hurricanes and North American rainfall, we can correct 25% of prediction errors and increase accuracy by an average of 13% over individual classifiers and state-of-the-art ensembles.

In another effort, our climate index discovery methodology was applied to identify climate indices for September-December seasonal rainfall in the GHA. Prediction based on such indices yields improved classification accuracy over the state-of-the-art in 17 out of 18 synoptic stations in Tanzania. In a project concerned with forecasting hurricane intensity, we have developed a new predictive technique for cyclone intensity that uses historical intensity information of the 10 closest similar cyclones (determined by a K-nearest-neighbor algorithm) in the historical database. Overall, a 30% to 55% improvement has been achieved compared to the current state of the art.

As part of an ongoing effort to develop predictive approaches to be used for data with spatial autocorrelation, we have created a spatial decision tree (SDT) model suitable for spatial data, including a spatial information gain "interestingness" measure. We used this model to develop an SDT learning algorithm, and have conducted a case study on a real-world remote sensing dataset to validate the usefulness of the proposed approach. We have also developed a focal-test-based spatial decision tree (FTSDT), where the tree traversal direction for a location is based on not only local but also neighborhood properties of the location. In a case study, FTSDT reduced misclassification errors by over 18 percent and improved the autocorrelation level by over 12 percent with respect to basic decision trees. In recent work, building on the idea of spatial decision trees, we have proposed a spatial ensemble framework, which partitions a given spatial region into a set of sub-regions and learns a base classifier for each sub-region. A case study on a real world remote sensing dataset has illustrated the advantages of the proposed SE framework.

There is considerable uncertainty regarding multi-decadal decline of Eastern African 'Long Rains.' Much of existing work on Eastern African variability has focused on the 'Short Rains' which have strong teleconnections with El Niño and therefore exhibit high prospects for robust predictability. The so-called East African Climate Paradox, based on observations and IPCC AR4/AR5 reports, suggest that the declining trend in rainfall and its numerous hydrological manifestations over Eastern Africa should be expected to reverse in the next few decades, changing to an increasing trend. This has important implications for water resources planning for the region. An EOF analysis for the Long Rains of Eastern Africa using the CRU rainfall data show two distinct modes of variability for the long rains. Interestingly, the leading mode does not seem to have a strong association with any of the traditional climate indices. Hypothesis driven causal dependencies of the declining rainfall will be investigated in future work.

Lake Victoria is one of the largest lakes in the world, and supports an increasing human population and associated agricultural and fishing activities. However, the Lake Victoria basin is known for rampant lightning strikes that cause thousands of human fatalities every year. We are developing software capable of using historical lightning data to predict safest paths to navigate in order to avoid lightning strikes. Going forward, the software can be integrated with navigation systems and mobile phones to provide warnings and guidance for locals.

We have investigated approaches to functional analysis of climate data. The focus has been on partial linear models, semiparametric models, and software for curve registration, functional principal components and other techniques, in order to conduct a study on model selection involving infinite dimensional parameters and data. Major advances have been made in the domain of partial linear models and semiparametric modeling of a multivariate, possibly categorical, response. In the context of partial linear models, we are currently studying a new technique involving the usage of basis function expansions followed by parametric model fitting.

We have also investigated mixed effects models for high-dimensional statistical dependencies and prediction. In this context, we have studied a robust mixed effects modeling scheme, semiparametric and nonparametric mixed effects modeling, and currently are working on quantile and other non-standard feature estimation using mixed effects models. We are also working on applications of these methods to statistical downscaling, Bayesian inference techniques for these models, as well as resampling techniques. We have developed a method of moments based estimation scheme for estimating parameters in mixed effects models. This has been applied to precipitation data from Indian summer monsoons and shows that there is a significant presence of random effects. We have also extended the basic generalized linear mixed model, or hierarchical mixed model in several ways. With respect to climate data analysis and features shown by real climate data as well as model runs, in several instances the traditional fixed effects models are shown to be inadequate, and one or more random effects have nontrivial and significant variance component. This is a fundamental development in uncertainty quantification and variability reduction in climate data analysis.

We are also investigating statistical sparse matrix and tensor methods, multivariate depth-based L-estimators, and multistage sampling problems. A major contribution in this context has been in the development of a model selection technique for high dimensional regression based on spectral decomposition, simple linear regressions and resampling. In a different development, we have worked out on the different kinds of networks that are possible when dealing with spatio-temporal multivariate data, where both temporal lags and spatial dependencies are allowed. In another development, we have modeled tensor valued data using lower dimensional structures suitable for imputation, statistical inference and prediction. We have coupled resampling as well as Bayesian inference into this mechanism. A coupling of this tensor decomposition and low rank matrix factorization and inversion is turning out to be a potentially valuable tool for reproducing population level analogues of random tensors. Suitable extensions of such analysis will potentially allow one to do change point analysis in high dimensions.

People: Banerjee, Chatterjee, Choudhary, Ganguly, Homaifar, Knight, Kumar, Samatova, Semazzi, Shekhar, Steinhaeuser