A practical guide to deploying Large Language…

Aug 31, 2023

I try all the things to get Vicuna-13B-v1.5 deployed on an NVIDIA T4 so you don't have to.

4 Comments

Sep 4, 2023

Yeah, tgi took a bit of figuring out and you kind of have to get deep in the code. This vid I made may help a little: https://youtu.be/Ror2xOOA-VE?si=7u09EwZ0xYbShRQb

Curious why Vicuna and not Llama 2?

Expand full comment

Reply (1)

Joel Kang

Sep 6, 2023

Ooh thank's Ronan, I'm going to try quantising with bitsnbytes--were you able to figure out what was going on with GPTQ? There seems to be a few issues on github around gptq, but nothing concrete that I can either try or follow the progress of.

Vicuna 1.5 (which is fine-tuned Llama2-chat) was just much better at following our prompt AND outputting well-formed JSON.

Llama2 was kinda wild, sometimes repeating parts of the prompt, and other times giving answers that didn't make sense. Llama2-chat was much better in terms of quality, but it was *very* chatty, always starting with "Sure I can help!" and ending with "I hope that helps". I didn't want to have to regex out the JSON portions since that seem ed really brittle.

Expand full comment

Reply (1)

Ronan McGovern

Sep 7, 2023

GPTQ is always in flux but it’s faster. Harder to use because you can’t easily combine Lora adapters for training.

Ah makes sense on vicuna 1.5 then! Thanks

Expand full comment

Reply (1)

Ronan McGovern

Sep 11, 2023

I think GPTQ might work, but you have to pass a flag to tgi to disable ex-llama because T4s are old and don't support it. It's finicky though GPTQ (but great speed).

Expand full comment

Lightbulb Moments by Dala

A practical guide to deploying Large Language…