Research

Current Projects

Server-scale solutions for large-scale deep learning

This direction explores the design and search of various components of supercomputers targeting AI and ML using thousands of accelerators. We will explore research challenges encountered at scale across the entire system stack. Specifically, we aim to understand the challenges that arise as we deploy domain-specific accelerators in a distributed setting to process very large models, such as those used in natural language processing applications, each comprising billions of parameters. 

In this direction of research, we will explore the multiple facets of system infrastructure for deep learning:  accelerator design for distributed execution, model distribution strategy and device placement, and network design, all three individually and as a co-optimization.

Despite the plethora of existing accelerator designs, there is a lack of distributed heterogeneous architectures that are globally optimized for training and inference whilst meeting end-to-end user-level metrics, such as accuracy, latency, and throughput. Accelerators need to be aware of parameter storage and stashing, data transfer patterns, and quantization when processing deep learning models using complex pipelining strategies. In this direction, we aim to devise latency- and throughput-optimized multi-accelerator and multi-node communication primitives designed to achieve high bandwidth and low latency over novel and existing networking fabrics.

Efficient heterogeneous pipeline for large deep learning models

Recommender model training encounters unique challenges, as it is not compute intensive but rather bandwidth and memory bound. In this direction, we leverage training data popularity to devise solutions that can improve the overall efficiency of training these models in a distributed GPU setting. 

We developed one of the first runtime frameworks that efficiently utilizes the memory hierarchy for distributed recommender model training. Frequently Accessed Embeddings (FAE) framework dynamically places frequently accessed data on every GPU while retaining infrequently accessed data on CPUs, thus ensuring high data locality on the more expensive GPU resources. By storing frequently observed training data in faster, compute-proximate GPU memory, the proposed framework successfully eliminates most of the CPU-GPU communication. This also amortizes the cost of computing on data residing in CPU main-memory through efficient pipelining across all the devices (both CPU and GPU) while maintaining the training fidelity. Additionally, for large-scale training we offer offline solutions that can automatically perform workload distribution for general deep learning models.

Domain-specialized solutions for end-to-end data processing pipelines

In this direction of work, we aim to understand the bottlenecks in the end-to-end data processing pipeline for distributed databases, and devise hardware/software acceleration solutions to mitigate them. Distributed databases allow for scalable data storage, synchronous replication, strong consistency and ordering properties, low level caching, and persistence. In such a system, I aim to accelerate the data retrieval, query processing, data pre-processing, and data analytics by determining the common operations across both compute and data storage/retrieval processes. For instance, we can leverage dynamic locality or semantic popularity to cache the hot data, thus mitigating the network and data movement latency across the system stack.

Sustainability-aware compute platforms

The overarching goal of this research thrust is to improve the energy efficiency of distributed systems and develop domain-specialized architectures that are designed by keeping in mind sustainability and low-carbon execution as a first-order consideration. To do so, we aim to benchmark and create mathematical models of power consumption and carbon-impact of various types of applications in data centers. The models would take into account the time of the day applications are deployed, the various components of the stack that are invoked (memory, compute fabric, network), and the average time it takes to complete a task. We plan to perform a detailed analysis on the impact of utilizing different types of compute platforms, such as CPUs, GPUs, and accelerators (ASIC and FPGAs), for compute and data intensive applications. We will then use our mathematical models in tandem with accuracy requirements of the end application to dynamically tune the training and inference process of AI and ML algorithms.

Past Projects

A unified template-based framework for accelerating machine learning

The practicality and applicability of modern acceleration platforms hinges upon the fact that we can provide designs that are programmable from a high-level interface. Domain-specific architecture based solutions necessitate rethinking the entire compute stack, as the traditional stack has always been tailored and optimized for CPUs – the sole processing platform up until recently – and the hardware accelerators have just been an ad-hoc addition to it. 

Thus, we devised TABLA, which leverages the insight that supervised machine learning can often be modeled as stochastic gradient-based optimization. Our work formed the inception of the broader algorithmic template-based acceleration effort in the community. TABLA’s target algorithms iterate over the training data, minimize a loss function, and update the parameters of the model in a way that captures the data patterns. 

For TABLA, stochastic gradient descent forms the abstraction between hardware and software to resolve two conflicting objectives – automation and high performance. To ensure automation, this framework exposes a domain specific language for the user to specify the algorithm, which is then converted into the final accelerator by TABLA’s compiler and design builder. To obtain high performance, the hardware backend is implemented as a hand-optimized template-based architecture comprising the general framework of stochastic gradient. TABLA automatically customizes these templates according to the {learning algorithm, FPGA} pair and generates synthesizable Verilog code. TABLA was presented at the 22nd IEEE Symposium of High Performance Computer Architecture, where it was awarded the Distinguished Paper Award.

We further went on to develop CoSMIC and DNNWEAVER to accelerate classical machine learning at scale and the inference phase of Deep Neural Networks, respectively. Moreover, we extended TABLA’s language and program representation to offer PolyMath that supports cross-domain acceleration, across DSP, Robotics, ML, and Graph Analytics. [HPCA 2016, MICRO 2016, MICRO 2017, ISCA 2018, HPCA 2021, IEEE MICRO 2022]

Approximate computing

In this direction, we devised a set of language annotations that provide the necessary syntax and semantics for approximate hardware design and reuse in Verilog. This was one of the pioneering works to offer systematic language annotations that allow the hardware engineers to relax the accuracy requirements in certain parts of their hardware design, while keeping the critical parts strictly precise. Axilog is coupled with a Relaxability Inference Analysis that automatically infers the relaxable gates and connections from the designer’s annotations. 

In addition to developing techniques that enable approximation, our work MITHRA has tackled the harder problem of controlling quality tradeoffs with approximate accelerators. Approximation techniques often induce randomness in the output, hence we model the tradeoff between benefits from approximation and application quality loss as a statistical optimization problem. This opens the door for controlling error in approximate computing by providing statistical guarantees on unseen data. Our solution is a tightly co-designed hardware-software solution, with components in both compiler and microarchitecture to offer statistical guarantees on the final output quality. [DATE 2015, IEEE MICRO 2016, ISCA 2016, ASPLOS 2017]