Code-Along and Q&A: Run Llama With PyTorch on Arm-Based Infrastructure

Join this 1-hour live code-along and Q&A to set up and run Meta’s Llama language model using PyTorch on Arm-based cloud instances. This session walks through configuring the environment, downloading the model, optimizing performance, running inference, and interacting through a Streamlit frontend — all tailored for Arm CPUs.

Please note we’ll provide access to sandbox environments for attendees.

Date: April 30, 2025
Time: 9 a.m. PT | 5 p.m. BST | 6 p.m. CET
Length: 45 minutes (code-along) + 15 minutes (Q&A)

What you’ll build:

  • You’ll create a browser-based large language model (LLM) application that runs Llama 3.1 quantized to INT4, with a Streamlit frontend and a torchchat backend, that runs entirely on an Arm-based AWS Graviton CPU.

What you’ll learn:

  • To download the Meta Llama 3.1 model from the Meta Hugging Face repository.
  • 4-bit quantize the model using optimized INT4 KleidiAI kernels for PyTorch.
  • Run an LLM inference using PyTorch on an Arm-based CPU.
  • Expose an LLM inference as a browser application with Streamlit as the frontend and torchchat framework in PyTorch as the LLM backend server.
  • Measure performance metrics of the LLM inference running on an Arm-based CPU.

Who should join:

  • Developers, ML engineers, and researchers working with open-source LLMs
  • Backend engineers building GenAI features for applications
  • Anyone looking to optimize LLM inference for cost and performance on Arm

Connect With the Experts

One week after the code-along, join an open Q&A with Arm engineers and Arm Ambassadors. Bring your implementation questions, share what you have built, and explore advanced use cases, architecture tuning, and tooling options.

Date: May 8, 2025
Time: 9 a.m. PT | 5 p.m. BST | 6 p.m. CET
Length: 50 minutes

Loading...