ECE/CS 8803 - Hardware Software Co-Design for Machine Learning
Spring 2024
Course Instructors: Tushar Krishna and Divya Mahajan
Course Objectives
The advancement in AI can be attributed to the synergistic advancements in big data sets, machine learning (ML) algorithms, and the hardware and systems used to deploy these models. Specifically, deep neural networks (DNNs) have showcased highly promising results in tasks across vision, speech and natural language processing. Unfortunately, DNNs come with significant computational and memory demands -- which can be Zeta (1021) FLOPs and Tera (1012) Bytes respectively for in Large Language Models such as those driving ChatGPT. Efficient processing of these DNNs necessitates HW-SW co-design. Such co-design efforts have led to the emergence of (i) specialized hardware accelerators designed for DNNs (e.g., Google’s TPU, Meta’s MTIA, Amazon’s Inferentia & Trainium, and so on) and (ii) specialized distributed systems comprising hundreds to thousands of these accelerators connected via specialized fabrics (e.g., . Furthermore, GPUs and FPGA architectures and libraries have also evolved to accelerate DNNs.
This course aims to present recent advancements that strive to achieve efficient processing of DNNs. Specifically, it will offer an overview of DNNs, delve into techniques to distribute the workload, dive into various architectures and systems that support DNNs, and highlight key trends in recent techniques for efficient processing. These techniques aim to reduce the computational and communication costs associated with DNNs, either through hardware and system optimizations. The course will also provide a summary of various development resources to help researchers and practitioners initiate DNN deployments swiftly. Additionally, it will emphasize crucial benchmarking metrics and design considerations for evaluating the rapidly expanding array of DNN hardware designs, system optimizations, proposed in both academia and industry.
Learning Outcomes
As part of this course, students will: understand the key design considerations for efficient DNN processing; understand tradeoffs between various hardware architectures and platforms; understand the need and means to distributed ML; evaluate the utility of various DNN strategies for end-to-end efficient execution; and understand future trends and opportunities from ML algorithms, system innovations, down to emerging technologies (such as ReRAM).
Course Text
The material for this course will be derived from papers from recent computer architecture conferences (ISCA, MICRO, HPCA, ASPLOS) on hardware acceleration, systems conferences (SOSP, MLSys) for distributing ML, ML conferences (ICML, NeurIPS, ICLR) focusing on future trends, and blog articles from industry (Google, Microsoft, Meta, NVIDIA, Baidu, Intel, Arm).
Course Schedule
Week 1
Introduction and Review of Machine Learning Concepts
Key principles of Machine Learning - MLP, Neural Networks
Overview of Deep Learning – CNNs
Week 2
Hardware-specific optimizations for deep learning
Introduction to DNN operators
Compute and memory behavior of DNNs
Week 3
Introduction to Deep Learning Accelerators
Data Reuse and Dataflows
Interconnect Design
Week 4
Designing Deep Learning Accelerators
Scratchpad and on-chip memory design
PE design - Numerics and Quantization
Week 5
Deep Learning execution on Accelerators
Modelling and Simulation
Compilation for accelerators
Week 6
Emerging Trends in Deep learning - Sparsity
Week 7
ML accelerators in Industry
Gradient based optimizations
Week 8
Distributed execution for Large Models:
Basics of Distributed Machine Learning
Communication Collectives
Week 9
Modes of Distributed Training and Inference
Week 10
Special Topics on Distributed and Large Scale Execution:
Device placement and automatic distribution
LLM Inference, training, and fine-tuning
Week 11
Building Systems for Large Scale Training
Week 12
Project Proposals and Presentations
Week 13,14
Emerging Topics
Recommender Models
Spiking Neural Networks
Emerging Technologies - Analog/CIM
Emerging Technologies - Wafer Scale or Photonics