By A Mystery Man Writer
Together, the developer, claims it is the largest public dataset specifically for language model pre-training
Mandala #122 - TrendyMandalas
Data science recent news
RedPajama training progress at 440 billion tokens
2311.17035] Scalable Extraction of Training Data from (Production) Language Models
ChatGPT / Generative AI recent news, page 5 of 21
RedPajama's Giant 30T Token Dataset Shows that Data is the Next Frontier in LLMs
RedPajama Reproducing LLaMA🦙 Dataset on 1.2 Trillion Tokens, by Angelina Yang
RedPajama-Data-v2: an Open Dataset with 30 Trillion Tokens for Training Large Language Models : r/LocalLLaMA
RLHF: Reinforcement Learning from Human Feedback