Join this 1-hour live code-along and Q&A to set up and run Meta’s Llama language model using PyTorch on Arm-based cloud instances. This session walks through configuring the environment, downloading the model, optimizing performance, running inference, and interacting through a Streamlit frontend — all tailored for Arm CPUs.
Please note we’ll provide access to sandbox environments for attendees.
Date: April 30, 2025
Time: 9 a.m. PT | 5 p.m. BST | 6 p.m. CET
Length: 45 minutes (code-along) + 15 minutes (Q&A)
What you’ll build:
- You’ll create a browser-based large language model (LLM) application that runs Llama 3.1 quantized to INT4, with a Streamlit frontend and a torchchat backend, that runs entirely on an Arm-based AWS Graviton CPU.
What you’ll learn:
- To download the Meta Llama 3.1 model from the Meta Hugging Face repository.
- 4-bit quantize the model using optimized INT4 KleidiAI kernels for PyTorch.
- Run an LLM inference using PyTorch on an Arm-based CPU.
- Expose an LLM inference as a browser application with Streamlit as the frontend and torchchat framework in PyTorch as the LLM backend server.
- Measure performance metrics of the LLM inference running on an Arm-based CPU.
Who should join:
- Developers, ML engineers, and researchers working with open-source LLMs
- Backend engineers building GenAI features for applications
- Anyone looking to optimize LLM inference for cost and performance on Arm
Connect With the Experts
One week after the code-along, join an open Q&A with Arm engineers and Arm Ambassadors. Bring your implementation questions, share what you have built, and explore advanced use cases, architecture tuning, and tooling options.
Date: May 8, 2025
Time: 9 a.m. PT | 5 p.m. BST | 6 p.m. CET
Length: 50 minutes