Tuesday, January 2, 2024

Apache Kafka overview

Apache Kafka is a distributed event streaming platform designed to handle high volume real-time data feeds. It’s highly scalable, durable, and fault-tolerant. Here’s a brief overview of its architecture and components:

Brokers

A Kafka cluster consists of one or more servers known as brokers, which manage the storage and transportation of messages. Each broker can handle terabytes of messages without performance impact.

Topics

Topics are the primary unit of data in Kafka. They’re similar to tables in a database and are used to categorize data. Producers write data to topics and consumers read from them.

Partitions

Each topic in Kafka is split into one or more partitions. Partitions allow for data to be parallelized across the Kafka cluster, enabling greater scalability.

Producers and Consumers

Producers publish data to Kafka topics, while consumers read this data. Kafka ensures that data within each partition is consumed in the order it was produced9.

Scalability

Kafka is highly scalable, both horizontally (adding more machines) and vertically (adding more power to existing machines), accommodating growing data needs without sacrificing performance.

High Availability

Kafka guarantees high availability through features like replication and partitioning. It can recover quickly from failures, ensuring reads and writes are always available.

Security

Kafka supports features like authentication, authorization, and encryption to ensure data security. It also provides audit logs to track activities.

Kafka Streams

Kafka Streams is a client library for building applications and microservices where the input and output data are stored in Kafka clusters. It simplifies application development by leveraging Kafka’s native capabilities1. It is very much like RXJS's Observables, based on the same stream based operational approach to handle data changes over time.

Kafka Connect and Kafka ksqlDB

Kafka Connect is a tool for streaming data between Kafka and other data systems. Kafka ksqlDB, on the other hand, is a database purpose-built for stream processing applications, allowing you to build real-time systems on a SQL-like interface.

Caching, Event-Driven Architecture, Event Sourcing, and Sharding

Kafka supports caching for faster data retrieval. Its event-driven architecture ensures that actions are triggered by events. Event sourcing is a technique where changes to application state are stored as a sequence of events. Sharding, a type of database partitioning, is also used in Kafka for distributing data across different databases or servers.

In conclusion, Apache Kafka is a robust and versatile platform that can handle real-time data streaming at a large scale. Its architecture and components work together to ensure high performance, scalability, and reliability.

Monday, December 11, 2023

Load balancing strategies

 Load balancing is crucial in distributing incoming network traffic across multiple servers or resources to ensure efficient utilization, optimize resource usage, and prevent overload on any single server. Several load balancing strategies exist, each suited for specific scenarios:

  • Round Robin: Requests are distributed sequentially among servers in a circular order. It's simple and ensures an equal distribution of load but might not consider the server's current load or capacity.
  • Least Connections: Traffic is directed to the server with the fewest active connections. This strategy ensures that the load is distributed to the least loaded servers, promoting better resource utilization.
  • Weighted Round Robin: Servers are assigned weights, specifying their capacity or processing power. Requests are then distributed based on these weights, allowing more traffic to higher-capacity servers.
  • IP Hashing: The client's IP address determines which server receives the request. This ensures that requests from the same client are consistently sent to the same server, aiding session persistence.
  • Least Response Time: Requests are directed to the server that currently has the shortest response time or the fastest processing capability. This strategy optimizes performance for end users.
  • Resource-based Load Balancing: Takes into account server resource utilization metrics (CPU, memory, etc.) and directs traffic to servers with available resources, preventing overload and maximizing performance.
  • Dynamic Load Balancing Algorithms: These algorithms adapt in real-time to changing server conditions. They can factor in various metrics like server health, latency, and throughput to dynamically adjust traffic distribution.
  • Content-based or Application-aware Load Balancing: Analyzes the content or context of requests to intelligently route traffic. For instance, it can direct video streaming requests to servers optimized for video processing.

GCP services with examples

Similarly to the previous post, writing about GCP as well.

  • Compute Engine:
    • Example: Similar to Amazon EC2, Compute Engine allows you to create and run virtual machines. You might use it to deploy and manage instances for various purposes like web hosting, application development, or machine learning tasks.
  • Cloud Storage:
    • Example: Storing and serving multimedia content for a content management system. Cloud Storage offers scalable object storage, ideal for hosting images, videos, backups, and large datasets used by applications.
  • Cloud SQL:
    • Example: Running a managed MySQL or PostgreSQL database for a retail application. Cloud SQL provides a fully managed relational database service, handling backups, replication, and maintenance tasks.
  • Cloud Functions:
    • Example: Implementing event-driven serverless functions for real-time data processing. You might use Cloud Functions to trigger actions in response to events like file uploads, database changes, or HTTP requests.
  • Cloud Firestore / Cloud Bigtable:
    • Example: Building a scalable database for a real-time chat application. Firestore offers a flexible, scalable NoSQL database for storing and syncing data across devices, while Bigtable is suitable for high-throughput, low-latency workloads like time-series data or machine learning.
  • Cloud Pub/Sub:
    • Example: Creating a message queuing system for handling data processing tasks. Pub/Sub provides reliable, scalable messaging between independent applications or microservices.
  • Cloud CDN (Content Delivery Network):
    • Example: Accelerating content delivery for a global news website. Cloud CDN caches content at Google's globally distributed edge points of presence, reducing latency for users accessing articles, images, and videos.
  • Cloud Dataflow:
    • Example: Processing and analyzing large datasets in real-time. Dataflow helps to build and execute data processing pipelines for tasks like ETL (Extract, Transform, Load), analytics, and batch processing.
  • Google Kubernetes Engine (GKE):
    • Example: Managing and orchestrating containerized applications at scale. GKE automates the deployment, scaling, and management of containerized applications using Kubernetes.
  • Virtual Private Cloud (VPC):
    • Example: Creating isolated networks for different projects or departments within a company. VPC allows you to define and control a virtual network, including IP ranges, subnets, and firewall rules.

AWS services with examples

It's always hard for me to remember all the abbreviations for all the AWS services, so I tried to collect the most popular ines in this blogpost.

  • Amazon EC2 (Elastic Compute Cloud):
    • Example: Imagine building a scalable web application. You can use EC2 to deploy virtual servers (instances) to run your application. You might use different instance types for web servers, application servers, and databases, scaling them based on demand.
  • Amazon S3 (Simple Storage Service):
    • Example: Storing and serving user-uploaded files for a social media platform. S3 provides durable object storage. You might store user profile pictures, videos, and other media files and serve them directly to users.
  • Amazon RDS (Relational Database Service):
    • Example: Hosting a relational database like MySQL, PostgreSQL, or SQL Server for an e-commerce site. RDS manages the database operations, allowing you to focus on your application without worrying about infrastructure management.
  • Amazon Lambda:
    • Example: Building a serverless backend for a mobile app. Lambda enables running code without provisioning or managing servers. You might use it to handle user authentication, process data, or trigger actions based on events.
  • Amazon DynamoDB:
    • Example: Implementing a highly scalable NoSQL database for a gaming application. DynamoDB offers low-latency data access and can handle massive amounts of traffic, making it suitable for gaming leaderboards or storing player data.
  • Amazon SQS (Simple Queue Service) and Amazon SNS (Simple Notification Service):
    • Example: Building a decoupled system for an e-commerce platform. SQS allows asynchronous communication between different components of the system, while SNS can be used to send notifications about orders or updates to interested parties.
  • Amazon CloudFront:
    • Example: Accelerating content delivery for a global video streaming service. CloudFront is a content delivery network (CDN) that caches content in edge locations worldwide, reducing latency for users accessing the video content.
  • Amazon Kinesis:
    • Example: Processing and analyzing streaming data from IoT devices. Kinesis allows you to collect, process, and analyze real-time data streams at scale, making it ideal for IoT applications, log processing, or real-time analytics.
  • Amazon ECS (Elastic Container Service) and Amazon EKS (Elastic Kubernetes Service):
    • Example: Orchestrating containerized applications. ECS and EKS help manage Docker containers at scale. You might use these services to deploy microservices for a distributed application architecture.
  • Amazon VPC (Virtual Private Cloud):
    • Example: Creating a private network within AWS. VPC enables you to launch AWS resources into a virtual network, providing control over the network configuration, including IP address ranges, subnets, and routing.

Message queues vs publish/subscribe

Message queues and publish/subscribe are both messaging patterns used in distributed systems to facilitate communication between different components or services. While they serve similar purposes, they have distinct characteristics.

Message Queue:

A message queue is a communication mechanism where messages are stored in a queue until they are consumed by a receiving component. It follows a point-to-point communication model, where a sender pushes a message into a queue, and a single receiver retrieves and processes it. Once a message is consumed, it's typically removed from the queue. Message queues often prioritize reliable delivery, ensuring that messages are not lost even if the receiver is temporarily unavailable.

Publish/Subscribe (Pub/Sub):

Pub/Sub is a messaging pattern where senders (publishers) distribute messages to multiple receivers (subscribers) without the senders specifically targeting any subscriber. Publishers categorize messages into topics or channels, and subscribers express interest in receiving messages from particular topics. When a publisher sends a message to a topic, all subscribers interested in that topic receive a copy of the message. Pub/Sub allows for scalable and flexible communication between components and enables a one-to-many or many-to-many messaging model.

Key Differences:

  • Communication Model:
    • Message Queue: Point-to-point communication between a single sender and a single receiver.
    • Pub/Sub: Many-to-many or one-to-many communication, where multiple subscribers receive messages from publishers.
  • Message Handling:
    • Message Queue: Messages are stored in a queue until consumed by a single receiver.
    • Pub/Sub: Messages are broadcasted to multiple subscribers interested in specific topics without being stored in queues.
  • Relationships:
    • Message Queue: Direct relationship between sender and receiver.
    • Pub/Sub: Decoupled relationship; publishers and subscribers are independent of each other.
  • Message Retention:
    • Message Queue: Emphasizes on ensuring that messages are not lost even if the receiver is temporarily unavailable.
    • Pub/Sub: Subscribers might miss messages if they are not actively subscribed when the message is published.

Wednesday, November 22, 2023

My #1 productivity hack - Google Calendar default email reminders

I'd like to share a super simple trick, that might or might not work for you - which is essential in my life to organize my personal and work events. All hail Google calendar default email reminder.

If you're juggling a busy schedule like I am, Google Calendar's default email reminders are a game-changer. Seriously, this feature saves me so much hassle. You can customize reminders for all your events, ensuring nothing slips through the cracks.

What I love most is how easy it is to set up. Just head to settings, tweak your preferences, and voila! You can get an email nudge whenever you need it, whether it's a day before or just an hour prior to your event.

Trust me, relying on these default reminders has made my life a whole lot easier. No more frantic manual setting of reminders for each event—I just set it and forget it. It's like having a personal assistant keeping track of everything for me.

Honestly, it's not just a notification feature; it's a productivity hack. It frees up mental space, letting me focus on what I need to do without constantly worrying about missing important stuff. Give it a shot; you'll thank yourself later!

Setting default email reminders proves immensely beneficial in managing a busy schedule. It eradicates the need to manually set reminders for each event, saving time and ensuring no event goes unnoticed. The simplicity of configuring these reminders simplifies the organizational process for users, fostering a more efficient workflow.

This feature fosters productivity by reducing the mental load of remembering every event. Users can rely on the system to prompt them at designated times, allowing them to focus on the tasks at hand without worrying about missing appointments or deadlines.

The only downside of this approach is that your email client can get pretty chatty. But that also means, you're living a busy life! So all in all, this is my best approach to manage all of these events (e.g. birthdays are there too haha), but if you know a better way to do it, let me know!

Learn more about it here!