Red Pajama 2: The Public Dataset With a Whopping 30 Trillion Tokens

By A Mystery Man Writer

Together, the developer, claims it is the largest public dataset specifically for language model pre-training

Mandala #122 - TrendyMandalas

Data science recent news

RedPajama training progress at 440 billion tokens

2311.17035] Scalable Extraction of Training Data from (Production) Language Models

ChatGPT / Generative AI recent news, page 5 of 21

RedPajama's Giant 30T Token Dataset Shows that Data is the Next Frontier in LLMs

RedPajama Reproducing LLaMA🦙 Dataset on 1.2 Trillion Tokens, by Angelina Yang

RedPajama-Data-v2: an Open Dataset with 30 Trillion Tokens for Training Large Language Models : r/LocalLLaMA

RLHF: Reinforcement Learning from Human Feedback