🔒 Guided

Pre-launch preview. Authorised access only.

Incorrect code

Guided by A Guide to Cloud
Explore AB-900 AI-901
Guided DP-900 Domain 1
Domain 1 — Module 2 of 7 29%
2 of 27 overall

DP-900 Study Guide

Domain 1: Core Data Concepts

  • Your First Look at Data Free
  • Data File Formats: CSV, JSON, Parquet & More Free
  • Databases: Relational vs Non-Relational Free
  • Transactional Workloads: Keeping Data Consistent Free
  • Analytical Workloads: Finding the Insights Free
  • Data Roles: DBA, Engineer & Analyst Free
  • The Azure Data Landscape Free

Domain 2: Relational Data on Azure

  • Relational Data: Tables, Keys & Relationships
  • Normalization: Why Duplicate Data is Bad
  • SQL Basics: SELECT, INSERT, UPDATE, DELETE
  • Database Objects: Views, Indexes & More
  • Azure SQL: Your Database in the Cloud
  • Open-Source Databases on Azure
  • Choosing the Right Azure Database

Domain 3: Non-Relational Data on Azure

  • Azure Blob Storage: Files in the Cloud
  • Azure Files & Table Storage
  • Azure Cosmos DB: The Global Database
  • Cosmos DB APIs: SQL, MongoDB & More
  • Choosing Non-Relational Storage

Domain 4: Analytics on Azure

  • Data Ingestion & Processing
  • Analytical Data Stores: Data Lakes, Warehouses & Lakehouses
  • Microsoft Fabric & Azure Databricks
  • Batch vs Streaming: Two Speeds of Data
  • Real-Time Analytics on Azure
  • Power BI: See Your Data
  • Data Models in Power BI
  • Choosing the Right Visualization

DP-900 Study Guide

Domain 1: Core Data Concepts

  • Your First Look at Data Free
  • Data File Formats: CSV, JSON, Parquet & More Free
  • Databases: Relational vs Non-Relational Free
  • Transactional Workloads: Keeping Data Consistent Free
  • Analytical Workloads: Finding the Insights Free
  • Data Roles: DBA, Engineer & Analyst Free
  • The Azure Data Landscape Free

Domain 2: Relational Data on Azure

  • Relational Data: Tables, Keys & Relationships
  • Normalization: Why Duplicate Data is Bad
  • SQL Basics: SELECT, INSERT, UPDATE, DELETE
  • Database Objects: Views, Indexes & More
  • Azure SQL: Your Database in the Cloud
  • Open-Source Databases on Azure
  • Choosing the Right Azure Database

Domain 3: Non-Relational Data on Azure

  • Azure Blob Storage: Files in the Cloud
  • Azure Files & Table Storage
  • Azure Cosmos DB: The Global Database
  • Cosmos DB APIs: SQL, MongoDB & More
  • Choosing Non-Relational Storage

Domain 4: Analytics on Azure

  • Data Ingestion & Processing
  • Analytical Data Stores: Data Lakes, Warehouses & Lakehouses
  • Microsoft Fabric & Azure Databricks
  • Batch vs Streaming: Two Speeds of Data
  • Real-Time Analytics on Azure
  • Power BI: See Your Data
  • Data Models in Power BI
  • Choosing the Right Visualization
Domain 1: Core Data Concepts Free ⏱ ~10 min read

Data File Formats: CSV, JSON, Parquet & More

Data has to live somewhere. Before it goes into a database, it's often stored in files. Let's explore the most common data file formats and when to use each one.

Why do file formats matter?

☕ Simple explanation

Think of file formats like containers for shipping goods.

You wouldn’t ship loose apples in an envelope. You’d use a crate (structured), a bag with labels on each apple (semi-structured), or a sealed box for fragile items (unstructured).

Data file formats work the same way. Some are great for spreadsheets, some for web apps, and some for massive analytics jobs. Picking the right format makes data easier to store, share, and process.

Before data enters a database, it’s typically exchanged or archived as files. The file format determines how data is encoded, compressed, and organised on disk. This affects storage cost, read/write speed, compatibility with processing tools, and whether the data preserves its schema.

For the DP-900 exam, you need to recognise the most common file formats and understand which types of data they best represent.

Common data file formats

CSV (Comma-Separated Values)

The simplest structured file format. Each line is a row, values are separated by commas, and the first line is usually the header.

OrderID,Student,Item,Price
1001,Aisha,Chicken wrap,8.50
1002,Liam,Coffee,4.00

Strengths: Human-readable, universally supported, easy to create from spreadsheets. Weaknesses: No data types (everything is text), no hierarchy, large files can be slow to process.

When to use: Simple data exchange, spreadsheet exports, small datasets.

JSON (JavaScript Object Notation)

The most popular semi-structured format. Data is organised as key-value pairs, with support for nesting and arrays.

{
  "orderId": 1001,
  "student": "Aisha",
  "items": [
    { "name": "Chicken wrap", "price": 8.50 }
  ]
}

Strengths: Flexible schema, human-readable, natively understood by web apps and APIs. Weaknesses: Verbose (lots of repeated keys), not great for very large datasets.

When to use: Web APIs, configuration files, document databases like Cosmos DB.

XML (Extensible Markup Language)

An older semi-structured format that uses opening and closing tags. More verbose than JSON but supports attributes, namespaces, and formal schemas.

<order id="1001">
  <student>Aisha</student>
  <item price="8.50">Chicken wrap</item>
</order>

Strengths: Formal schema validation (XSD), widely used in enterprise systems and SOAP APIs. Weaknesses: Very verbose, harder to read than JSON, declining in popularity.

When to use: Legacy enterprise systems, formal data contracts, SOAP web services.

Parquet

A columnar binary format designed for analytics. Instead of storing data row by row, Parquet stores it column by column — which makes aggregation queries (SUM, AVG, COUNT) dramatically faster.

Strengths: Excellent compression, fast analytical queries, supports complex types, widely used in big data. Weaknesses: Not human-readable (binary), overkill for small datasets.

When to use: Data lakes, analytics pipelines, Microsoft Fabric lakehouses, Azure Databricks.

Avro

A row-based binary format that embeds the schema with the data. Popular in streaming and data pipeline scenarios.

Strengths: Compact, schema included in every file, great for streaming data. Weaknesses: Not human-readable, less common in analytics than Parquet.

When to use: Event streaming (e.g., Azure Event Hubs), data serialisation, schema evolution scenarios.

ORC (Optimized Row Columnar)

Similar to Parquet — a columnar binary format optimised for Hadoop-based analytics. You may see it mentioned alongside Parquet.

Strengths: Excellent compression, optimised for Hive queries. Weaknesses: Less widely adopted outside the Hadoop ecosystem.

When to use: Hadoop or Hive-based analytics workloads.

Common data file formats compared
FeatureCSVJSONParquetAvro
StructureRows & columns (text)Key-value pairs (text)Columnar (binary)Row-based (binary)
Human-readable?YesYesNoNo
SchemaHeader row onlySelf-describing keysEmbeddedEmbedded
Best forSimple data exchangeWeb APIs, documentsAnalytics & data lakesStreaming & pipelines
CompressionPoorModerateExcellentGood
ℹ️ Row-based vs columnar storage

This is a key concept for understanding analytics performance:

  • Row-based (CSV, Avro): stores all fields of one record together. Fast for reading individual records.
  • Columnar (Parquet, ORC): stores all values of one column together. Fast for aggregating a single column across millions of records.

Example: If Priya wants the total sales across 10 million transactions, a columnar format only reads the “price” column — skipping name, date, and everything else. A row-based format would have to read every field of every record.

That’s why data lakes and analytics pipelines commonly use columnar formats like Parquet (or Delta Lake, which builds on Parquet).

💡 Exam tip: format recognition

The exam may show you a snippet of data and ask which format it is. Quick recognition guide:

  • Commas separating values, header row → CSV
  • Curly braces and colons → JSON
  • Angle brackets with opening/closing tags → XML
  • “Binary columnar format for analytics” → Parquet
  • “Row-based binary with embedded schema” → Avro

Flashcards

Question

What is the key difference between Parquet and CSV?

Click or press Enter to reveal answer

Answer

Parquet is a binary columnar format optimised for analytics (fast aggregations, great compression). CSV is a text-based row format for simple data exchange (human-readable but slow for large datasets).

Click to flip back

Question

Why is JSON considered semi-structured?

Click or press Enter to reveal answer

Answer

JSON uses key-value pairs that are self-describing (each value has a label). Different records can have different fields — there's no rigid schema like a database table.

Click to flip back

Question

When would you choose Avro over Parquet?

Click or press Enter to reveal answer

Answer

Choose Avro for streaming and data pipeline scenarios where you need row-based access and embedded schema. Choose Parquet for analytics queries where you need fast column aggregations.

Click to flip back

Knowledge check

Knowledge Check

Priya needs to store 50 million sales records in a data lake for analytics. The main queries will be 'total sales by store' and 'average price by category.' Which file format should she choose?

Knowledge Check

Tom receives shipment data from a partner via an API. The response uses curly braces, colons, and key-value pairs. What format is this?

🎬 Video coming soon

Next up: Databases: Relational vs Non-Relational — now that you know how data lives in files, let’s explore how it lives in databases.

← Previous

Your First Look at Data

Next →

Databases: Relational vs Non-Relational

Guided

I learn, I simplify, I share.

A Guide to Cloud YouTube Feedback

© 2026 Sutheesh. All rights reserved.

Guided is an independent study resource and is not affiliated with, endorsed by, or officially connected to Microsoft. Microsoft, Azure, and related trademarks are property of Microsoft Corporation. Always verify information against Microsoft Learn.