The Pile: An 800GB Dataset of Diverse Text for Language Modeling

13 points
1/20/1970
10 months ago
by tosh

Comments


tetete

Might be useful to those who have not been in the space but this is old news. 2020.

10 months ago

tosh

I tried to find out how many "tokens" (I know: depends on the tokenizer) "The Pile" has but couldn't find it.

As far as I understand RedPajama has 1.2T (https://github.com/togethercomputer/RedPajama-Data) and has a table in the readme listing the main parts and how many tokens each part has.

10 months ago

physicsgraph

10 months ago

perone

I wrote an article about recent data pipelines as well if anyone is interested: https://blog.christianperone.com/2023/06/appreciating-llms-d...

10 months ago

startupsfail

Arguably “The Pile” approach is the bane of open source and small player LLMs. Because the models are shaped by data. And there is little beauty in “The Pile”.

— anonymous shaper

10 months ago