Storing Raw Data
Data lake: repository for raw data.
Architecture
Ingestion: batch or streaming. Storage: object storage (S3, GCS). Processing: Spark, Presto.
Format: Parquet, ORC for analytics. Delta Lake adds ACID.
Patterns
Lakehouse: data warehouse + data lake. Data mesh: domain-oriented, federated. Data fabric: unified, connected.
Governance
Schema on read vs write. Catalog essential. Metadata layer important.
Key Takeaways
- Store raw data in native formats
- Lakehouse combines lake + warehouse
- Schema evolution needs management