The Pile: An 800GB Dataset of Diverse Text for Language Modeling
13 points
1/20/1970
10 months ago
by tosh
Comments
tetete
10 months ago
tosh
I tried to find out how many "tokens" (I know: depends on the tokenizer) "The Pile" has but couldn't find it.
As far as I understand RedPajama has 1.2T (https://github.com/togethercomputer/RedPajama-Data) and has a table in the readme listing the main parts and how many tokens each part has.
10 months ago
perone
I wrote an article about recent data pipelines as well if anyone is interested: https://blog.christianperone.com/2023/06/appreciating-llms-d...
10 months ago
startupsfail
Arguably “The Pile” approach is the bane of open source and small player LLMs. Because the models are shaped by data. And there is little beauty in “The Pile”.
— anonymous shaper
10 months ago
Might be useful to those who have not been in the space but this is old news. 2020.