Data File Formats: CSV, JSON, Parquet & More
Data has to live somewhere. Before it goes into a database, it's often stored in files. Let's explore the most common data file formats and when to use each one.
Why do file formats matter?
Think of file formats like containers for shipping goods.
You wouldn’t ship loose apples in an envelope. You’d use a crate (structured), a bag with labels on each apple (semi-structured), or a sealed box for fragile items (unstructured).
Data file formats work the same way. Some are great for spreadsheets, some for web apps, and some for massive analytics jobs. Picking the right format makes data easier to store, share, and process.
Common data file formats
CSV (Comma-Separated Values)
The simplest structured file format. Each line is a row, values are separated by commas, and the first line is usually the header.
OrderID,Student,Item,Price
1001,Aisha,Chicken wrap,8.50
1002,Liam,Coffee,4.00
Strengths: Human-readable, universally supported, easy to create from spreadsheets. Weaknesses: No data types (everything is text), no hierarchy, large files can be slow to process.
When to use: Simple data exchange, spreadsheet exports, small datasets.
JSON (JavaScript Object Notation)
The most popular semi-structured format. Data is organised as key-value pairs, with support for nesting and arrays.
{
"orderId": 1001,
"student": "Aisha",
"items": [
{ "name": "Chicken wrap", "price": 8.50 }
]
}
Strengths: Flexible schema, human-readable, natively understood by web apps and APIs. Weaknesses: Verbose (lots of repeated keys), not great for very large datasets.
When to use: Web APIs, configuration files, document databases like Cosmos DB.
XML (Extensible Markup Language)
An older semi-structured format that uses opening and closing tags. More verbose than JSON but supports attributes, namespaces, and formal schemas.
<order id="1001">
<student>Aisha</student>
<item price="8.50">Chicken wrap</item>
</order>
Strengths: Formal schema validation (XSD), widely used in enterprise systems and SOAP APIs. Weaknesses: Very verbose, harder to read than JSON, declining in popularity.
When to use: Legacy enterprise systems, formal data contracts, SOAP web services.
Parquet
A columnar binary format designed for analytics. Instead of storing data row by row, Parquet stores it column by column — which makes aggregation queries (SUM, AVG, COUNT) dramatically faster.
Strengths: Excellent compression, fast analytical queries, supports complex types, widely used in big data. Weaknesses: Not human-readable (binary), overkill for small datasets.
When to use: Data lakes, analytics pipelines, Microsoft Fabric lakehouses, Azure Databricks.
Avro
A row-based binary format that embeds the schema with the data. Popular in streaming and data pipeline scenarios.
Strengths: Compact, schema included in every file, great for streaming data. Weaknesses: Not human-readable, less common in analytics than Parquet.
When to use: Event streaming (e.g., Azure Event Hubs), data serialisation, schema evolution scenarios.
ORC (Optimized Row Columnar)
Similar to Parquet — a columnar binary format optimised for Hadoop-based analytics. You may see it mentioned alongside Parquet.
Strengths: Excellent compression, optimised for Hive queries. Weaknesses: Less widely adopted outside the Hadoop ecosystem.
When to use: Hadoop or Hive-based analytics workloads.
| Feature | CSV | JSON | Parquet | Avro |
|---|---|---|---|---|
| Structure | Rows & columns (text) | Key-value pairs (text) | Columnar (binary) | Row-based (binary) |
| Human-readable? | Yes | Yes | No | No |
| Schema | Header row only | Self-describing keys | Embedded | Embedded |
| Best for | Simple data exchange | Web APIs, documents | Analytics & data lakes | Streaming & pipelines |
| Compression | Poor | Moderate | Excellent | Good |
Row-based vs columnar storage
This is a key concept for understanding analytics performance:
- Row-based (CSV, Avro): stores all fields of one record together. Fast for reading individual records.
- Columnar (Parquet, ORC): stores all values of one column together. Fast for aggregating a single column across millions of records.
Example: If Priya wants the total sales across 10 million transactions, a columnar format only reads the “price” column — skipping name, date, and everything else. A row-based format would have to read every field of every record.
That’s why data lakes and analytics pipelines commonly use columnar formats like Parquet (or Delta Lake, which builds on Parquet).
Exam tip: format recognition
The exam may show you a snippet of data and ask which format it is. Quick recognition guide:
- Commas separating values, header row → CSV
- Curly braces and colons → JSON
- Angle brackets with opening/closing tags → XML
- “Binary columnar format for analytics” → Parquet
- “Row-based binary with embedded schema” → Avro
Flashcards
Knowledge check
Priya needs to store 50 million sales records in a data lake for analytics. The main queries will be 'total sales by store' and 'average price by category.' Which file format should she choose?
Tom receives shipment data from a partner via an API. The response uses curly braces, colons, and key-value pairs. What format is this?
🎬 Video coming soon
Next up: Databases: Relational vs Non-Relational — now that you know how data lives in files, let’s explore how it lives in databases.