ECE/CS 8803 - Hardware Software Co-Design for Machine Learning
Spring 2024

Course Instructors:  Tushar Krishna and Divya Mahajan

Course Objectives

The advancement in AI can be attributed to the synergistic advancements in big data sets, machine learning (ML) algorithms, and the hardware and systems used to deploy these models. Specifically, deep neural networks (DNNs) have showcased highly promising results in tasks across vision, speech and natural language processing. Unfortunately, DNNs come with significant computational and memory demands -- which can be Zeta (1021) FLOPs and Tera (1012) Bytes respectively for in Large Language Models such as those driving ChatGPT. Efficient processing of these DNNs necessitates HW-SW co-design. Such co-design efforts have led to the emergence of (i) specialized hardware accelerators designed for DNNs (e.g., Google’s TPU, Meta’s MTIA, Amazon’s Inferentia & Trainium, and so on) and (ii) specialized distributed systems comprising hundreds to thousands of these accelerators connected via specialized fabrics (e.g., . Furthermore, GPUs and FPGA architectures and libraries have also evolved to accelerate DNNs.

This course aims to present recent advancements that strive to achieve efficient processing of DNNs. Specifically, it will offer an overview of DNNs, delve into techniques to distribute the workload, dive into various architectures and systems that support DNNs, and highlight key trends in recent techniques for efficient processing. These techniques aim to reduce the computational and communication costs associated with DNNs, either through hardware and system optimizations. The course will also provide a summary of various development resources to help researchers and practitioners initiate DNN deployments swiftly. Additionally, it will emphasize crucial benchmarking metrics and design considerations for evaluating the rapidly expanding array of DNN hardware designs, system optimizations, proposed in both academia and industry.

Learning Outcomes

As part of this course, students will: understand the key design considerations for efficient DNN processing; understand tradeoffs between various hardware architectures and platforms; understand the need and means to distributed ML; evaluate the utility of various DNN strategies for end-to-end efficient execution; and understand future trends and opportunities from ML algorithms, system innovations, down to emerging technologies (such as ReRAM).

Course Text

The material for this course will be derived from papers from recent computer architecture conferences (ISCA, MICRO, HPCA, ASPLOS) on hardware acceleration, systems conferences (SOSP, MLSys) for distributing ML, ML conferences (ICML, NeurIPS, ICLR) focusing on future trends, and blog articles from industry (Google, Microsoft, Meta, NVIDIA, Baidu, Intel, Arm).

Course Schedule

Week 1

Introduction and Review of Machine Learning Concepts

Week 2

Hardware-specific optimizations for deep learning 

Week 3

Introduction to Deep Learning Accelerators

Week 4

Designing Deep Learning Accelerators

Week 5

Deep Learning execution on Accelerators

Week 6

Emerging Trends in Deep learning - Sparsity

Week 7

Week 8

Distributed execution for Large Models:

Week 9

Modes of Distributed Training and Inference

Week 10

Special Topics on Distributed and Large Scale Execution:

Week 11

Building Systems for Large Scale Training

Week 12

Project Proposals and Presentations

Week 13,14

Emerging Topics