Godot node1/30/2024 ![]() You may otherwise encounter out of memory errors or experience extremely long training times, and will need to adjust the training parameters.įor finetuning godot_dodo_4x_60k_llama_13b, eight A100 80GB GPUs were used.Īnother important consideration is the protocol used for GPU communication. In order to effectively finetune a llama-7b or llama-13b model, it is highly recommended to use at least two A100 80GB GPUs. To reproduce a fine-tuned version of LLaMA, please follow the steps below. The fine-tuning process closely mirrors the one introduced by stanford_alpaca. Assembled using 4.x Godot projects - ~60k rowsįurther datasets may be added in the future (particularly regarding 3.x data) Finetuning.Pre-assembled datasets included in this repository: Please do note that you'll need GitHub and OpenAI API keys in order to use these scripts. Run python data/generate_unlabeled_dataset.py.To assemble a dataset yourself, follow these instructions: Human comments within the code block however are preserved. We are interested in consistent detail for comments, rather than trying to preserve some potentially higher-quality human-written ones. Note that existing, human-written comments located above the code-block are not used for the instruction value. Add instruction:response data pair to dataset.For each function found, ask existing LLM ( gpt-3.5-turbo) for a detailed comment describing the functions purpose.For each one, split file into individual functions.Detect whether project is made for 3.x or 4.x Godot engine versions.We then clone each one and apply the following logic: Only MIT-licensed code is used for training! ![]() We also use license:mit to limit the dataset to suitable repositories. Using the language:gdscript search term, we retrieve a list of repositories including GDScript code. Dataset Generationĭue to this approach relying on human-created data, we scrape GitHub repositories using the GitHub search API. In order to use that notebook on Google Colab, follow this link. To try out the pre-trained models, you can use the inference_demo.ipynb notebook. This aims to provide much more robust language-specific models that can be used to reliably generate code that compiles on first try. Less widely used languages are underrepresented in the training data and experience a massive performance drop-off, where models routinely mistake syntax or hallucinate language features that do not exist. However, a lot of their ability is concentrated in only the most popular languages, such as Python or Javascript. Some existing language models such as gpt-4 are excellent coders. Language models are instead only used to label each code snippet.Īs such, we can assemble comment:code data-pairs in the style of CodeSearchNet, making use of powerful existing models to annotate high-quality human-created code. Unlike other, similar approaches to finetuning models such as stanford-alpaca, this approach does not use existing, larger language models for the output-values of the finetuning-dataset. In summary, godot_dodo models achieve significantly greater consistency than gpt-4/ gpt-3.5-turbo when it comes to generating accurate GDScript syntax, but are somewhat less capable of following complex instructions. Performance report comparing finetuned modelsįor comprehensive results explaining the methodology used and a full list of all result, please refer to the full performance report here.Pre-assembled, raw datasets (up to a size of 60k rows).Scripts to assemble the finetuning dataset.In this case, the targeted language is GDScript, but the same methodology can be applied to other languages. The godot-dodo project presents a pipeline to finetune open source language models on human-created, language-specific code retrieved from GitHub.
0 Comments
Leave a Reply.AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |