Yeah, tgi took a bit of figuring out and you kind of have to get deep in the code. This vid I made may help a little: https://youtu.be/Ror2xOOA-VE?si=7u09EwZ0xYbShRQb

Curious why Vicuna and not Llama 2?

Expand full comment

Ooh thank's Ronan, I'm going to try quantising with bitsnbytes--were you able to figure out what was going on with GPTQ? There seems to be a few issues on github around gptq, but nothing concrete that I can either try or follow the progress of.

Vicuna 1.5 (which is fine-tuned Llama2-chat) was just much better at following our prompt AND outputting well-formed JSON.

Llama2 was kinda wild, sometimes repeating parts of the prompt, and other times giving answers that didn't make sense. Llama2-chat was much better in terms of quality, but it was *very* chatty, always starting with "Sure I can help!" and ending with "I hope that helps". I didn't want to have to regex out the JSON portions since that seem ed really brittle.

Expand full comment

GPTQ is always in flux but it’s faster. Harder to use because you can’t easily combine Lora adapters for training.

Ah makes sense on vicuna 1.5 then! Thanks

Expand full comment

I think GPTQ might work, but you have to pass a flag to tgi to disable ex-llama because T4s are old and don't support it. It's finicky though GPTQ (but great speed).

Expand full comment