Thanks for dropping by! My name is Senthil Nayagan. I am a seasoned IT professional, and I've been a Data Engineer for the
last few years. I enjoy working with OSS, and I really appreciate the freedom they provide. You can know more about me.
This is my personal blog, where I make up stories about my personal thoughts, design and coding principles, and a variety of topics
related to data engineering and analytics, which is my primary area of interest. If you like my posts and don't want to miss any of
them, sign up for my newsletter.
Apache Pinot is a real-time, distributed OLAP datastore that was built for low-latency, high-throughput analytics, making it perfect for user-facing analytical workloads. Pinot joins hands with Kafka and Presto to provide user-facing analytics.
Access modifiers, also known as access specifiers, determine the accessibility and scope of classes, methods, and other members. Scala's access modifiers closely resemble those of Java, although they provide more granular and powerful visibility control than Java.
Amazon EMR is a managed cluster platform that makes it easier to run big data frameworks like Apache Hadoop and Apache Spark on AWS to process and analyze huge amounts of data.
An algorithm is a series of instructions in a particular order for performing a specific task.
Terraform is an open source infrastructure-as-code tool that allows us to programmatically provision the physical resources required for an application to run.
AWS CLI is an open-source tool that allows us to interact with AWS services using command-line shell commands.
A data product is not the same as data as a product. A data product aids the accomplishment of the product's goal by using the data, whereas in data as a product, the data itself is seen as the actual product.
Data governance is the process of defining security guidelines and policies and making sure they are followed by having authority and control over how data assets are managed.
A data catalog is an ordered inventory of an organization's data assets that makes it easy to find the most relevant data quickly.
An open source file format for Hadoop that provides columnar storage and is built from the ground up with complex nested data structures in mind.
Partitioning and bucketing are used to improve the reading of data by reducing the cost of shuffles, the need for serialization, and the amount of network traffic.
Apache Airflow enables us to schedule tasks as code. In Airflow, a SLA determines the maximum completion time for a task or DAG. Note that SLAs are established based on the DAG execution date, not the task start time.