Data File Formats: CSV, JSON, Parquet & More

Why do file formats matter?

Simple explanation

Think of file formats like containers for shipping goods.

You wouldn’t ship loose apples in an envelope. You’d use a crate (structured), a bag with labels on each apple (semi-structured), or a sealed box for fragile items (unstructured).

Data file formats work the same way. Some are great for spreadsheets, some for web apps, and some for massive analytics jobs. Picking the right format makes data easier to store, share, and process.

Common data file formats

CSV (Comma-Separated Values)

The simplest structured file format. Each line is a row, values are separated by commas, and the first line is usually the header.

OrderID,Student,Item,Price
1001,Aisha,Chicken wrap,8.50
1002,Liam,Coffee,4.00

Strengths: Human-readable, universally supported, easy to create from spreadsheets. Weaknesses: No data types (everything is text), no hierarchy, large files can be slow to process.

When to use: Simple data exchange, spreadsheet exports, small datasets.

JSON (JavaScript Object Notation)

The most popular semi-structured format. Data is organised as key-value pairs, with support for nesting and arrays.

{
  "orderId": 1001,
  "student": "Aisha",
  "items": [
    { "name": "Chicken wrap", "price": 8.50 }
  ]
}

Strengths: Flexible schema, human-readable, natively understood by web apps and APIs. Weaknesses: Verbose (lots of repeated keys), not great for very large datasets.

When to use: Web APIs, configuration files, document databases like Cosmos DB.

XML (Extensible Markup Language)

An older semi-structured format that uses opening and closing tags. More verbose than JSON but supports attributes, namespaces, and formal schemas.

<order id="1001">
  <student>Aisha</student>
  <item price="8.50">Chicken wrap</item>
</order>

Strengths: Formal schema validation (XSD), widely used in enterprise systems and SOAP APIs. Weaknesses: Very verbose, harder to read than JSON, declining in popularity.

When to use: Legacy enterprise systems, formal data contracts, SOAP web services.

Parquet

A columnar binary format designed for analytics. Instead of storing data row by row, Parquet stores it column by column — which makes aggregation queries (SUM, AVG, COUNT) dramatically faster.

Strengths: Excellent compression, fast analytical queries, supports complex types, widely used in big data. Weaknesses: Not human-readable (binary), overkill for small datasets.

When to use: Data lakes, analytics pipelines, Microsoft Fabric lakehouses, Azure Databricks.

Avro

A row-based binary format that embeds the schema with the data. Popular in streaming and data pipeline scenarios.

Strengths: Compact, schema included in every file, great for streaming data. Weaknesses: Not human-readable, less common in analytics than Parquet.

When to use: Event streaming (e.g., Azure Event Hubs), data serialisation, schema evolution scenarios.

ORC (Optimized Row Columnar)

Similar to Parquet — a columnar binary format optimised for Hadoop-based analytics. You may see it mentioned alongside Parquet.

Strengths: Excellent compression, optimised for Hive queries. Weaknesses: Less widely adopted outside the Hadoop ecosystem.

When to use: Hadoop or Hive-based analytics workloads.

Common data file formats compared
Feature	CSV	JSON	Parquet	Avro
Structure	Rows & columns (text)	Key-value pairs (text)	Columnar (binary)	Row-based (binary)
Human-readable?	Yes	Yes	No	No
Schema	Header row only	Self-describing keys	Embedded	Embedded
Best for	Simple data exchange	Web APIs, documents	Analytics & data lakes	Streaming & pipelines
Compression	Poor	Moderate	Excellent	Good

Row-based vs columnar storage

This is a key concept for understanding analytics performance:

Row-based (CSV, Avro): stores all fields of one record together. Fast for reading individual records.
Columnar (Parquet, ORC): stores all values of one column together. Fast for aggregating a single column across millions of records.

Example: If Priya wants the total sales across 10 million transactions, a columnar format only reads the “price” column — skipping name, date, and everything else. A row-based format would have to read every field of every record.

That’s why data lakes and analytics pipelines commonly use columnar formats like Parquet (or Delta Lake, which builds on Parquet).

Exam tip: format recognition

The exam may show you a snippet of data and ask which format it is. Quick recognition guide:

Commas separating values, header row → CSV
Curly braces and colons → JSON
Angle brackets with opening/closing tags → XML
“Binary columnar format for analytics” → Parquet
“Row-based binary with embedded schema” → Avro

Flashcards

Question

What is the key difference between Parquet and CSV?

Click or press Enter to reveal answer

Answer

Parquet is a binary columnar format optimised for analytics (fast aggregations, great compression). CSV is a text-based row format for simple data exchange (human-readable but slow for large datasets).

Click to flip back

Question

Why is JSON considered semi-structured?

Click or press Enter to reveal answer

Answer

JSON uses key-value pairs that are self-describing (each value has a label). Different records can have different fields — there's no rigid schema like a database table.

Click to flip back

Question

When would you choose Avro over Parquet?

Click or press Enter to reveal answer

Answer

Choose Avro for streaming and data pipeline scenarios where you need row-based access and embedded schema. Choose Parquet for analytics queries where you need fast column aggregations.

Click to flip back

Knowledge check

Knowledge Check

Priya needs to store 50 million sales records in a data lake for analytics. The main queries will be 'total sales by store' and 'average price by category.' Which file format should she choose?

Knowledge Check

Tom receives shipment data from a partner via an API. The response uses curly braces, colons, and key-value pairs. What format is this?

Next up: Databases: Relational vs Non-Relational — now that you know how data lives in files, let’s explore how it lives in databases.