Prompt Engineering

26. Prompt Engineering#

26.1. Pre-Reading#

Best practices for prompt engineering with OpenAI API

Objectives#

Explain hardware limits and challenges with LLMs.
Employ simple prompt engineering tactics to produce better LLM results.

26.2. Hardware Expenses#

LLMs take a lot of RAM.

Table: 8-bit quantized LLaMA requirements

Model	VRAM Used	Minimum Total VRAM	Card examples	RAM/Swap to Load*
LLaMA-7B	9.2GB	10GB	3060 12GB, 3080 10GB	24 GB
LLaMA-13B	16.3GB	20GB	3090, 3090 Ti, 4090	32 GB
LLaMA-30B	36GB	40GB	A6000 48GB, A100 40GB	64 GB
LLaMA-65B	74GB	80GB	A100 80GB	128 GB

Buy NVIDIA GPUs

Advanced Quantization#

Hugging Face Bits and Bytes can make these models much smaller.

Floating point formats

Cost Containment for Generative AI#

From DeepLearning AI’s The Batch newsletter

Microsoft is looking to control the expense of its reliance on OpenAI’s models.

What’s new: Microsoft seeks to build leaner language models that perform nearly as well as ChatGPT but cost less to run, The Information reported.

How it works: Microsoft offers a line of AI-powered tools that complement the company’s flagship products including Windows, Microsoft 365, and GitHub. Known as Copilot, the line is based on OpenAI models. Serving those models to 1 billion-plus users could amount to an enormous expense, and it occupies processing power that would be useful elsewhere. To manage the cost, Microsoft’s developers are using knowledge distillation, in which a smaller model is trained to mimic the output of a larger one, as well as other techniques.

26.3. Prompt Engineering#

Pulled from NVIDIA – An Introduction to LLM: Prompt Engineering and P-Tuning

Zero-shot vs. few-shot#

Few-shot

Now time for a live demo!