Troubleshoot Notebooks & SQL
Identify and resolve Spark notebook errors and T-SQL failures β OOM errors, data skew, schema mismatches, query timeouts, and debugging techniques.
Notebook errors
Think of a Spark notebook as a team of workers processing data.
Errors happen when: a worker runs out of desk space (OOM β out of memory), one worker gets all the heavy files while others sit idle (data skew), the input data doesnβt match expectations (schema mismatch), or the instructions themselves are wrong (code error).
Common notebook errors
| Error | Cause | Fix |
|---|---|---|
| java.lang.OutOfMemoryError | Dataset too large for driver/executor memory | Increase pool size, reduce data with filters before collect(), avoid collect() on large DataFrames |
| AnalysisException: cannot resolve column | Column name doesn't exist (typo or schema change) | Check column names with df.printSchema(), verify source data |
| Data skew (one task takes 10x longer) | One partition key has far more data than others | Repartition data, use salting technique, or broadcast smaller table |
| Py4JJavaError with NullPointerException | Null values in a column used for operations | Filter nulls before processing, use coalesce() or fillna() |
| SchemaConflictException on write | DataFrame schema doesn't match existing Delta table | Use mergeSchema option or fix DataFrame to match |
| Cluster startup timeout | No available capacity for Spark nodes | Wait and retry, use starter pool, or request capacity increase |
Scenario: Carlos debugs an OOM error
Carlosβs transformation notebook crashes with OutOfMemoryError on the driver. He investigates:
- The line that failed:
result = df_500m_rows.collect()β collects all 500M rows to the driver! - Root cause:
collect()pulls the entire distributed DataFrame into the single driver nodeβs memory - Fix: Replace
collect()with.write.format("delta").save()to write directly to the lakehouse without pulling data to the driver
Rule: Never collect() large DataFrames. Write to Delta tables or use show(20) to preview.
Common T-SQL errors
| Error | Cause | Fix |
|---|---|---|
| Query timeout | Complex query exceeds time limit | Optimize query (add WHERE filters, simplify joins), check for missing statistics |
| Insufficient permissions | User lacks READ/WRITE on table | Grant appropriate permissions (ReadAll for queries, Contributor role for writes) |
| Invalid object name | Table or view doesnβt exist (typo, wrong schema) | Verify object name and schema β use SELECT * FROM INFORMATION_SCHEMA.TABLES |
| Data type conversion failed | INSERT/UPDATE with incompatible types | Cast data explicitly: CAST(column AS DECIMAL(10,2)) |
| Deadlock | Two queries blocking each other | Review query execution plans, reduce transaction scope, retry with backoff |
Debugging techniques
| Technique | Tool | When to Use |
|---|---|---|
| Spark UI | Built into notebook | Investigate slow stages, data skew, shuffle metrics |
| df.printSchema() | PySpark | Verify column names and types before operations |
| df.show(5) | PySpark | Preview data at each transformation step |
| EXPLAIN | T-SQL / Spark SQL | View query execution plan |
| Cell-by-cell execution | Notebook | Isolate which transformation step fails |
A PySpark notebook fails with 'AnalysisException: cannot resolve column order_total.' The DataFrame was loaded from a Delta table. What should the engineer check first?
A T-SQL query in a Fabric warehouse times out after 10 minutes. The query joins two large tables without WHERE filters. What is the best first step?
π¬ Video coming soon
Next up: Troubleshoot Streaming & Shortcuts β resolve Eventhouse, Eventstream, and OneLake shortcut errors.