Introduction to lakeFS
Table of contents
What is lakeFS?
lakeFS is distributed as a single binary that contains several logical services. lakeFS offers a git-like version control interface for data lakes so that it gives us the ability to manage our data lake in a manner that is similar to how we manage our code.
lakeFS adapts the best practices from the field of software engineering to the field of data engineering.
There is a need for an improved system that can version huge amounts of data. For team members to work collaboratively and concurrently on a data set that is rapidly evolving, they need a version (snapshot) of that data that is reserved exclusively for their use.
Obtaining data lineage has never been an easy operation, and it is now made much more complicated by the fact that each data set has numerous versions over time. If we did not provide the lineage details, it would be pointless.
lakeFS stores data in an underlying object store (GCS, ABS, S3, or any S3-compatible stores like MinIO or Ceph), with some of its metadata stored in PostgreSQL.
lakeFS provides the ability to branch, merge, rollback, etc.