Big CSV files

how big can CSV get before it becomes wildly unwieldly?

Attached: CSV_File.jpg (824x643, 124K)

6 big

if it's over 200gb

Depends on how you plan to utilize and process it.
If you plan to use excel or similar then it depends on your computer's specs.
If you process it in a stream fashion with shell utilities then there isn't really any limit to the size.

Stream eh?

Pardon my ignorance, but how would that work? Can you make a program only target specific chunks of the file at a time?

Spin up an Oracle XE or DB2-C instance and load that bigass CSV file into a table.

A well formatted file with 10 million entries beat a badly formatted one with 1000

If you use Excel it's pretty limited. A million lines or something. It gets unusably slow well before that. Computer specs won't help.

what is swap space

Slow as balls?

I've done 37 million before and it was painful. At that point just make a sqlite loader if you need something light to work with.

Shell utilities typically work on a line-per-line basis, only loading a chunk of the file into memory at any given time. For CSV, check out utilities like "xsv". For JSON, check out "jq". For XML, check out "xq". There are many more. I handle CSVs hundreds of gigabytes large at work with these and it's fine. It makes the majority of people at work think of you as a wizard, since they can't see any more than 1% of the file using their crappy Excels.

In other words, properly-formatted (well-formed) CSV is unwieldy only if it doesn't fit on your disk (like anything else would be). Otherwise it's perfectly manageable.

If you're using a well-written library to parse or save it, you probably won't run into problems with any data set size you're likely to encounter.

Don't try to parse or save CSV yourself though, you'll just fuck it up.

There are XML files that are TBs large in the medical insurance industry. It's not about size. It's about how you do it and how much computing power you have.

Good luck swapping 1TB
>inb4 he doesn't have terabytes of data

How big is big? I'm taking an ml course and the training data is a 100MB csv file. Pandas handles it fine.

10 lines.

but you can convert CSV to XLS (even with LibreOffice or any other tool, like write a python quick script), and then Excel can process millions and millions of lines easily.

i have a machine that spits a big CSV at me. libreoffice opens it properly (no auto data cell assumption bs), and I save it in XLS, and view it in Excel. because LibreOffice Calc is super slow even with a few thousand lines. searching, browsing, can take literally seconds. while in Excel, everything is instant.

kinda clunky, kinda weird, but it works.

you can always use SQLite or MySQL if you have to store gazillion lines

File system limits, there's programs to handle large files so that's not a issue.

In stats class one of the main csv files we used was about 9.5GB, the professor called it a small population set. Alot of the little ones he called test data. It could just be the professor, idk.

yea, "big file handling" is actually a feature. you can find text editors that can do it.
basically what you have to do (if you code) is that you don't read the entire file into memory, but just the portion you need.

for example if you are looking for something, you only need one line. or a few. not the entire 10GB.

or if you are fucking lazy, you pay 5-10$ on Amazon, rent a machine with gazillion GB of ram and process data there.

You must level your wizardry to 30 and learn the art called "sed".
It's once fully mastered is a fucking cheat.

And what alternative do you think you have?

SQL? CSV with compression ala. TempleOS style? or a binary file that you compress? the possibilities are endless really.

How would you display a 10gb text file on a machine with only 2gb of ram?

This. You can fully normalise a good CSV file into a full relational database with a few lines of code.

Just an advice, if you happen to work with indians, use a 1252 code page. I used unicode since we have fields with unicode characters but their "in house system" doesn't have a library for processing unicode. Also, unicode takes more space.

Then again what's the point? If you can get a fully normalised database from a CSV file and a script, why not just use a database from the start. I'd say SQL SELECT and INSERT commands are much easier and shorter than stream readers and writers.

over 4GB, then you can't put it on FAT32

you can compress CSV as well

I wonder if netstrings would be smaller than CSV with bunch of escaping. But binary variant of netstrings would be even more compact.