ChipNeMo: Domain-Adapted LLMs for Chip Design: Related Works

cover
6 Jun 2024

Authors:

(1) Mingjie Liu, NVIDIA {Equal contribution};

(2) Teodor-Dumitru Ene, NVIDIA {Equal contribution};

(3) Robert Kirby, NVIDIA {Equal contribution};

(4) Chris Cheng, NVIDIA {Equal contribution};

(5) Nathaniel Pinckney, NVIDIA {Equal contribution};

(6) Rongjian Liang, NVIDIA {Equal contribution};

(7) Jonah Alben, NVIDIA;

(8) Himyanshu Anand, NVIDIA;

(9) Sanmitra Banerjee, NVIDIA;

(10) Ismet Bayraktaroglu, NVIDIA;

(11) Bonita Bhaskaran, NVIDIA;

(12) Bryan Catanzaro, NVIDIA;

(13) Arjun Chaudhuri, NVIDIA;

(14) Sharon Clay, NVIDIA;

(15) Bill Dally, NVIDIA;

(16) Laura Dang, NVIDIA;

(17) Parikshit Deshpande, NVIDIA;

(18) Siddhanth Dhodhi, NVIDIA;

(19) Sameer Halepete, NVIDIA;

(20) Eric Hill, NVIDIA;

(21) Jiashang Hu, NVIDIA;

(22) Sumit Jain, NVIDIA;

(23) Brucek Khailany, NVIDIA;

(24) George Kokai, NVIDIA;

(25) Kishor Kunal, NVIDIA;

(26) Xiaowei Li, NVIDIA;

(27) Charley Lind, NVIDIA;

(28) Hao Liu, NVIDIA;

(29) Stuart Oberman, NVIDIA;

(30) Sujeet Omar, NVIDIA;

(31) Sreedhar Pratty, NVIDIA;

(23) Jonathan Raiman, NVIDIA;

(33) Ambar Sarkar, NVIDIA;

(34) Zhengjiang Shao, NVIDIA;

(35) Hanfei Sun, NVIDIA;

(36) Pratik P Suthar, NVIDIA;

(37) Varun Tej, NVIDIA;

(38) Walker Turner, NVIDIA;

(39) Kaizhe Xu, NVIDIA;

(40) Haoxing Ren, NVIDIA.

Many domains have a significant amount of proprietary data which can be used to train a domain-specific LLM. One approach is to train a domain specific foundation model from scratch, e.g., BloombergGPT [10] for finance, BioMedLLM [11] for biomed, and Galactica [38] for science. These models were usually trained on more than 100B tokens of raw domain data. The second approach is domain-adaptive pretraining (DAPT) [14] which continues to train a pretrained foundation model on additional raw domain data. It shows slight performance boost on domain-specific tasks in domains such as biomedical, computer science publications, news, and reviews. In one example, [39] continued-pretrained a foundation model on technical content datasets and achieved state-of-theart performance on many quantitative reasoning tasks.

Retrieval Augmented Generation (RAG) helps ground the LLM to generate accurate information and to extract up-to-date information to improve knowledge-intensive NLP tasks [40]. It is observed that smaller models with RAG can outperform larger models without RAG [41]. Retrieval methods include sparse retrieval methods such as TF-IDF or BM25 [42], which analyze word statistic information and find matching documents with a high dimensional sparse vector. Dense retrieval methods such as [43] [44] find matching documents on an embedding space generated by a retrieval model pretrained on a large corpus with or without fine-tuning on a retrieval dataset. The retrieval model can be trained standalone [43] [44] [45] or jointly with language models [46] [41]. In addition, it has been shown that off-the-shelf general purpose retrievers can improve a baseline language model significantly without further finetuning [47]. RAG is also proposed to perform code generation tasks [48] by retrieving from coding documents.

Foundation models are completion models, which have limited chat and instruction following capabilities. Therefore, a model alignment process is applied to the foundation models to train a corresponding chat model. Instruction fine-tuning [20] and reinforcement learning from human feedback (RLHF) [36] are two common model alignment techniques. Instruction fine-tuning further trains a foundation model using instructions datasets. RLHF leverages human feedback to label a dataset to train a reward model and applies reinforcement learning to further improve models given the trained reward model. RLHF is usually more complex and resource hungry than instruction fine-tuning. Therefore, recent studies also propose to reduce this overhead with simpler methods such as DPO [49] and SteerLM [50].

Researchers have started to apply LLM to chip design problems. Early works such as Dave [51] first explored the possibility of generating Verilog from English with a language model (GPT-2). Following that work, [6] showed that fine-tuned open-source LLMs (CodeGen) on Verilog datasets collected from GitHub and Verilog textbooks outperformed state-of-the-art OpenAI models such as code-davinci-002 on 17 Verilog questions. [12] proposed a benchmark with more than 150 problems and demonstrated that the Verilog code generation capability of pretrained language models could be improved with supervised fine-tuning by bootstrapping with LLM generated synthetic problem-code pairs. Chip-Chat [7] experimented with conversational flows to design and verify a 8-bit accumulator-based microprocessor with GPT-4 and GPT-3.5. Their findings showed that although GPT-4 produced relatively high-quality codes, it still does not perform well enough at understanding and fixing the errors. ChipEDA [8] proposed to use LLMs to generate EDA tools scripts. It also demonstrated that fine-tuned LLaMA2 70B model outperforms GPT-4 model on this task.

This paper is available on arxiv under CC 4.0 license.