Category data-management

Data Engineering vs. Data Science

Data engineering and data science are not sub-disciplines of each other. They complement one another but do not rely on each other.

Data Product vs. Data as a Product

A data product is not the same as data as a product. A data product aids the accomplishment of the product's goal by using the data, whereas in data as...

Data Catalog

A data catalog is an inventory of an organization's data assets that enables rapid and efficient access to the most relevant data.

Data Deluge

When the granularity of data increases, its complexity also increases. At some point, we will reach a point where we cannot handle the volume of fresh data being generated.

Category rust

Rust’s Ownership and Borrowing Enforce Memory Safety

Rust's ownership and borrowing features prevent us from experiencing memory-related problems. Rust is a great choice when performance matters and it solves pain points that bother many other languages.

Category design-patterns-and-coding-principles

SOLID Design Principles

SOLID is a mnemonic acronym that refers to a set of design principles developed for object-oriented programming languages. It makes our code more readable, extensible, and manageable when correctly implemented....

Singleton Pattern

A singleton pattern limits the number of instances of a class to one.

Anti-Pattern

Anti-patterns at first seem to be quick and reasonable, they typically have adverse effects in the future. They are design and code smells. It affects our software badly and adds...

Category data-streaming

Time Concepts in Data Streaming

Timestamps, particularly event-times, are significant aspects of any data streaming application.

Category scala

Case Class in Scala

The case class represents immutable data. It is a type of class that is often used for data storage.

Defining Variables Using the `def` Keyword in Scala

Difference between `lazy val` and `def`.

Category data-security-and-compliance

Data Governance

Data governance is the process of defining security guidelines and policies and making sure they are followed by having authority and control over how data assets are managed.

Envelope Encryption

Envelope encryption is a way of encrypting plaintext data using a key and then encrypting that key using an another key. This strategy is intended not just to make things...

Category data-engineering

Data Mesh

Data mesh is a "sociotechnical" way to build a decentralized data architecture that enables domain teams to do cross-domain data analysis on their own.

Reverse ETL

Reverse ETL is the process of feeding processed data (gold-layer tables) from a data warehouse or data lake back into business applications, primarily for use in operational analytics domains such...

Introduction to Data Engineering

It's the process of designing and building systems for gathering vast quantities of raw operational data from a variety of sources and formats, analyzing, converting, and storing it at scale....

Category apache-spark

Apache Spark Fundamentals

Apache Spark is a unified computing engine for distributed data processing and it has become the de facto tool for any developer or data scientist interested in big data.

Shuffling in Apache Spark

Shuffling is the act of redistributing data so that it’s grouped differently across partitions. This typically involves copying data across executors and machines, making the shuffle a complex and costly...

Partitions and Bucketing in Spark

Partitioning and bucketing are used to improve the reading of data by reducing the cost of shuffles, the need for serialization, and the amount of network traffic.

Need for Caching in Apache Spark

Caching is one of Spark's optimization strategies for reusing computations. It stores interim and partial results so they'll be utilised in subsequent computation stages.

Category orchestration

How To Set SLA in Apache Airflow

Apache Airflow enables us to schedule tasks as code. In Airflow, a SLA determines the maximum completion time for a task or DAG. Note that SLAs are established based on...

Category data-files-and-formats

Let’s Know About the Parquet File

An open source file format for Hadoop that provides columnar storage and is built from the ground up with complex nested data structures in mind.

Category data-lake-and-lakehouse

Introduction to Data Lake

A data lake is a centralized repository containing a significant amount of data from several sources in a more flexible natural or raw format for analytical usage.