top of page

Search

47 items found for ""

  • 5 Basic Apache Spark Commands for Beginners

    If you've heard about Apache Spark  but have no idea what it is or how it works, you're in the right place. In this post, I'll explain in simple terms what Apache Spark is, show how it can be used, and include practical examples of basic commands to help you start your journey into the world of large-scale data processing. What is Apache Spark? Apache Spark is a distributed computing platform  designed to process large volumes of data quickly and efficiently. It enables you to split large datasets into smaller parts and process them in parallel across multiple computers (or nodes). This makes Spark a popular choice for tasks such as: Large-scale data processing. Real-time data analytics. Training machine learning models. Built with a focus on speed and ease of use, Spark supports multiple programming languages, including Python , Java , Scala , and R . Why is Spark so popular? Speed : Spark is much faster than other solutions like Hadoop MapReduce because it uses in-memory processing. Flexibility : It supports various tools like Spark SQL, MLlib (machine learning), GraphX (graph analysis), and Structured Streaming (real-time processing). Scalability : It can handle small local datasets or massive volumes in clusters with thousands of nodes. Getting Started with Apache Spark Before running commands in Spark, you need to understand the concept of RDDs ( Resilient Distributed Datasets ), which are collections of data distributed across different nodes in the cluster. Additionally, Spark works with DataFrames and Datasets, which are more modern and optimized data structures. How to Install Spark Apache Spark can run locally on your computer or on cloud clusters. For a quick setup, you can use PySpark, Spark's Python interface: pip install pyspark Basic Commands in Apache Spark Here are some practical examples to get started: 1. Creating a SparkSession Before anything else, you need to start a Spark session: from pyspark.sql import SparkSession spark = SparkSession.builder \ .appName("SparkExample") \ .getOrCreate() 2. Reading a File Let’s load a CSV file into a DataFrame: df = spark.read.csv("data.csv", header=True, inferSchema=True) df.show() 3. Selecting and Filtering Data You can select specific columns or apply filters: df.select ("name", "age").show() df.filter(df["age"] > 30).show() 4. Transforming Data Use functions like groupBy  and agg  to transform data: df.groupBy("city").count().show() 5. Saving Results Results can be saved to a file: df.write.csv("result.csv", header=True) Conclusion Apache Spark is a powerful tool that makes large-scale data processing accessible, fast, and efficient. Whether you're starting in data or looking to learn more about distributed computing, Spark is an excellent place to begin. Are you ready to dive deeper into the world of Apache Spark? Check out more posts about Apache Spark by accessing the links below: How to read CSV file with Apache Spark

  • Data Mesh: Does It Still Make Sense to Adopt?

    Introduction Data Mesh: Does it still make sense to adopt? As companies grow, the volumes of data that need to be processed, stored, and analyzed increase exponentially. Traditional data architectures, centralized in a single repository or team, have started to show inefficiencies. Centralized models, such as the well-known Data Warehouses and Data Lakes, often encounter bottlenecks, limited scalability, and difficulties in meeting the growing demand for data across multiple business areas. In this context, Data Mesh emerges as an innovative approach, proposing the decentralization of data operations and governance, distributing responsibility to domains oriented around data products. Each domain, or business area, becomes responsible for creating, maintaining, and using its own data as a complete product, meeting both quality and consumption requirements. With Data Mesh, companies can more efficiently handle data growth, allowing different functional areas to take ownership of the data they generate and consume. Decentralized management offers scalability, autonomy, and faster delivery of valuable insights, addressing many challenges found in traditional centralized architectures. This approach is rapidly gaining relevance in the field of Big Data, especially in organizations that need to adapt to a fast-evolving data ecosystem. Data Mesh is not just a new architecture but also a cultural shift in how data is managed and valued within companies. But What Is Data Mesh, After All? Data Mesh is a modern approach to data architecture that seeks to solve the challenges of centralized architectures by proposing a decentralization of both data processing and governance. The central idea of Data Mesh is to treat data as a product, where each domain within the organization is responsible for managing and delivering its own data autonomously, similar to how they manage other products or services. This concept was developed to address the issues that arise in centralized architectures as data volume, complexity, and diversity grow. Instead of relying on a central data team to manage and process all information, Data Mesh distributes responsibility to cross-functional teams. This means that each team, or domain, becomes the "owner" of their data, ensuring it is reliable, accessible, and of high quality. Data Mesh is supported by several essential pillars that shape its unique approach. First, it decentralizes data management by delegating responsibility to the domains within an organization. Each domain is responsible for its own data, allowing business teams to independently manage the data they produce and use. Additionally, one of the key concepts of Data Mesh is treating data as a product. This means that data is no longer seen merely as a byproduct of business processes but rather as valuable assets, with teams responsible for ensuring that it is reliable, accessible, and useful to consumers. For this to work, a robust architecture is essential, providing teams with the necessary tools to efficiently manage, access, and share data autonomously, without depending on a centralized team. This infrastructure supports the creation and maintenance of data pipelines and the monitoring of data quality. Finally, federated governance ensures that, despite decentralization, there are rules and standards that all teams follow, ensuring compliance and data interoperability across different domains. The Lack of Autonomy in Accessing Data One of the biggest challenges faced by business areas in many organizations is their dependence on centralized data teams to obtain the information needed for strategic decisions. Teams in marketing, sales, operations, and other departments constantly need data to guide campaigns, improve processes, and optimize operations. However, access to this data is often restricted to a central data or IT team, leading to various bottlenecks. This lack of autonomy directly impacts the agility of business areas. Each new data request must be formally submitted to the data team, which is already overwhelmed with other demands. The result? Long waiting times for analyses, reports, and insights that should be generated quickly. Often, decisions must be made based on outdated or incomplete data, harming the company's competitiveness and ability to adapt to new opportunities. Another critical issue is the lack of visibility. Business areas often struggle to track what is available in the data catalog, where to find relevant data, and even understand the quality of that information. The alignment between business requirements and data delivery becomes strained, creating a gap between what the business needs and what the data team can provide. Additionally, centralizing data in an exclusive team hinders the development of tailored solutions for different areas. Each business team has specific needs regarding the data it consumes, and the centralized model generally offers a generic approach that doesn't always meet those needs. This can lead to frustration and the perception that data is not useful or actionable in each area's specific context. These factors highlight the need for a paradigm shift in how companies manage and access data. Data Mesh proposes a solution to this lack of autonomy by decentralizing data management responsibility and empowering business areas, allowing them to own the data they produce and consume. However, this shift comes with cultural and organizational challenges that must be overcome to ensure the success of this new approach. Cultural Changes Are Necessary Adopting Data Mesh is not just about changing the data architecture; it requires a profound cultural transformation within organizations. One of the biggest shifts is decentralizing responsibility for data. In a traditional model, a central IT or data team is typically the sole entity responsible for managing, processing, and providing access to data. With Data Mesh, this responsibility shifts to the business areas themselves, who become the owners of the data they produce and consume. This cultural change can be challenging, as business teams are often not used to directly handling data governance and processing. They will need to adapt to new tools and technologies, and more importantly, to a new mindset where the use and quality of data become a priority in their daily activities. This shift requires training and the development of new skills, such as understanding data modeling and best governance practices. Another critical cultural aspect is the collaboration between business and technology teams. In the Data Mesh model, IT is no longer the single point of contact for all data-related needs. Business areas gain autonomy, but this doesn't mean that IT and data engineers become less important. On the contrary, collaboration between both sides becomes even more essential. IT must provide the tools and infrastructure for domains to operate independently, while business areas must ensure that their data meets the quality and governance standards set by the organization. This new division of responsibilities can lead to internal resistance, especially in companies accustomed to a hierarchical and centralized structure. Data teams might feel like they are losing control over governance, while business areas may feel overwhelmed by their new responsibilities. Overcoming this resistance requires strong leadership, committed to aligning the entire organization around a common goal: using data as a strategic and distributed asset. Moreover, the success of Data Mesh depends on the adoption of a culture of shared responsibility. Each domain needs to see data as a product that must be managed with the same care and attention as any other product offered to the market. This requires a clear commitment to data quality, accessibility, and usability, which can be a significant leap for areas that previously did not focus on these aspects. Not Only Cultural Changes Drive Data Mesh: What Are the Common Tools in This Ecosystem? Implementing a Data Mesh requires a robust set of tools and technologies that support data decentralization while maintaining governance, quality, and efficiency in data processing and consumption. The tools used in the Data Mesh ecosystem vary, but they generally fall into three main categories: data storage and processing platforms, orchestration and automation tools, and data governance and quality tools. Data Storage and Processing Platforms One of the foundations of Data Mesh is ensuring that each domain has control over the data it produces, which requires flexible and scalable platforms for storage and processing. Some of the most common technologies include: AWS S3 and Azure Data Lake:  These storage platforms provide a flexible infrastructure for both raw and processed data, allowing domains to maintain their own data with individualized access control. They are key in giving domains autonomy over data management while offering scalable storage for vast amounts of information. Apache Kafka:  Often used to manage data flow between domains, Kafka enables real-time data streaming, which is crucial for companies that need to handle large volumes of information continuously and in a decentralized manner. It facilitates the transfer of data across domains with minimal latency. Spark and Databricks:  These powerful tools are used for processing large volumes of data and help scale distributed pipelines. Spark, particularly when paired with Databricks, allows domains to efficiently manage their data workflows, ensuring autonomy and high performance across different parts of the organization. Kubernetes:  As a container orchestration platform, Kubernetes enables the creation of isolated execution environments where different domains can run their own data pipelines independently. It ensures that each domain has the infrastructure needed to manage its data operations without interfering with others, maintaining both autonomy and operational efficiency. Orchestration and Automation Tools For domains to manage their own data without relying on a centralized team, it is essential to have orchestration tools that automate ETL (Extract, Transform, Load) processes, data monitoring, and updates. Some of the most common tools include: Apache Airflow:  An open-source tool that simplifies the automation of data pipelines, task scheduling, and workflow monitoring. It helps domains maintain their data ingestion and transformation processes without the need for continuous manual intervention. dbt (Data Build Tool):  Focused on data transformation, dbt allows data analysts to perform transformations directly within the data warehouse, making it easier to implement changes to data models for each domain with greater autonomy. Prefect:  Another orchestration tool, similar to Airflow, but with a focus on simplicity and flexibility in managing workflows. Prefect facilitates the implementation and maintenance of data pipelines, giving domains more control over their data processes. Data Governance and Quality Tools Decentralization brings with it a major challenge: maintaining governance and ensuring data quality across all domains. Some tools are designed to efficiently handle these challenges: Great Expectations:  One of the leading data validation tools, enabling domains to implement and monitor data quality directly within ETL pipelines. This ensures that the data delivered meets expected standards, regardless of the domain. Monte Carlo:  A data monitoring platform that automatically alerts users to quality issues and anomalies. It helps maintain data reliability even in a distributed environment, ensuring that potential problems are identified and resolved quickly. Collibra:  Used to maintain a data catalog and implement centralized governance, even in a decentralized architecture. It helps ensure that all areas follow common governance standards, maintaining data interoperability and compliance across domains. Consumption or Self-Service Infrastructure One of the keys to the success of Data Mesh is providing business teams with a self-service infrastructure, allowing them to create, manage, and consume their own data. This involves everything from building data pipelines to using dashboards for data analysis: Tableau and Power BI:  These are commonly used as data visualization and exploration tools, enabling end users to quickly and efficiently access and interpret data. Both platforms offer intuitive interfaces that allow non-technical users to create reports and dashboards, helping them derive insights and make data-driven decisions without needing extensive technical expertise. Jupyter Notebooks:  Frequently used by data science teams for experimentation and analysis, Jupyter Notebooks enable domains to independently analyze data without needing intervention from central teams. This tool allows for interactive data exploration, combining code, visualizations, and narrative explanations in a single environment, making it a powerful resource for data-driven insights and experimentation. What Are the Risks of Adopting Data Mesh? Although Data Mesh brings numerous advantages, such as scalability, agility, and decentralization, its adoption also presents considerable challenges, ranging from deep cultural shifts to financial risks. These disadvantages can compromise the successful implementation of the model and, if not addressed properly, can lead to inefficiencies or even project failures. Let's explore these disadvantages in more detail: Cultural and Organizational Complexity The transition to a Data Mesh model requires a significant cultural shift in how data is managed and perceived within the company. This can be an obstacle, especially in organizations with a long-standing tradition of centralized data management. Mindset Shift:  Traditionally, many companies view data as the sole responsibility of IT or a central data team. In Data Mesh, this responsibility is distributed, and business areas need to adopt a "data as a product" mentality. This shift requires domains to commit to treating their data with the same rigor as any other product they deliver. However, this transition may face resistance, especially from teams that lack technical experience in data governance and management. Training and Development:  A clear disadvantage lies in the effort required to train business teams to manage and process their own data. This can include everything from using data tools to understanding best practices in governance. Companies need to invest in continuous training to ensure that teams are prepared for their new responsibilities, which can be costly and time-consuming. Internal Resistance:  Implementing Data Mesh means altering the dynamics of power and responsibility within the organization. Centralized data teams may resist decentralization, fearing a loss of control over data governance. At the same time, business teams may feel overwhelmed by new responsibilities that were not previously part of their duties. Managing this resistance requires strong and well-aligned leadership to ensure a smooth transition and to address concerns from both sides effectively. Data Fragmentation and Governance One of the major concerns when adopting a decentralized architecture is the risk of data fragmentation. Without effective and federated governance, different domains may adopt divergent data standards and formats, which can lead to data silos, duplication of information, and integration challenges. Ensuring consistent governance across domains is essential to avoid these issues, as it maintains data interoperability and ensures that data remains accessible and usable across the organization. Data Inconsistency:  Without clear governance, decentralization can lead to inconsistencies in data across domains. Each business area may have its own definitions and practices for collecting and processing data, creating an environment where it becomes difficult to consolidate or compare information from different parts of the company. This lack of uniformity can undermine decision-making and hinder the ability to generate comprehensive insights. Challenges in Federated Governance:  Implementing efficient federated governance is one of the biggest challenges of Data Mesh. This requires the creation of data policies and standards that are followed by all domains, ensuring interoperability and quality. However, ensuring that all domains adhere to these rules, especially in large organizations, can be difficult. If governance becomes too relaxed or fragmented, the benefits of Data Mesh can be compromised, leading to inefficiencies and data management issues across the organization. High Financial Costs Implementing Data Mesh can also involve significant financial costs , both in the short and long term. This is mainly due to the need for investments in new technologies, training, and processes. Organizations must allocate resources for the acquisition and integration of tools that support decentralization, as well as for continuous training to prepare teams for their new responsibilities. Additionally, maintaining a decentralized system may require ongoing investments in infrastructure and governance to ensure smooth operations and data quality across domains. Infrastructure Investment:  To ensure that each domain has the capacity to manage its own data, companies need to invest in a robust self-service infrastructure, which may include storage, processing, and data orchestration platforms. The initial cost of building this infrastructure can be high, especially if the company is currently operating under a centralized model that requires restructuring. These investments are necessary to enable domains to function independently, but they can represent a significant financial outlay in terms of both technology and implementation. Ongoing Maintenance:  In addition to the initial implementation cost, maintaining a decentralized model can be more expensive than a centralized system. Each domain requires dedicated resources to manage and ensure the quality of its data, which can increase operational costs. Furthermore, tools and services to ensure federated governance and interoperability between domains require continuous updates and monitoring. These ongoing efforts add to the complexity and expense of keeping the system functioning smoothly over time. Risk of Financial Inefficiency:  If the implementation of Data Mesh is poorly executed, the company may end up spending more than initially planned without reaping the expected benefits. For example, a lack of governance can lead to data duplication and redundant efforts across domains, resulting in a waste of financial and human resources. Inefficiencies like these can offset the potential advantages of Data Mesh, making it crucial to ensure proper planning, governance, and execution from the outset. Difficulty in Integration and Alignment Finally, data decentralization can lead to integration challenges between domains, especially if there is no clear alignment between business areas and the data standards established by the organization. Without consistent communication and adherence to common protocols, domains may develop disparate systems and data formats, making it harder to integrate and share data across the organization. This misalignment can hinder collaboration, slow down data-driven decision-making, and reduce the overall efficiency of the Data Mesh approach. Coordination Between Domains:  With Data Mesh, each domain operates autonomously, which can create coordination challenges between teams. The lack of clear and frequent communication can result in inconsistent or incompatible data, making it difficult to perform integrated analyses across different areas of the company. Ensuring that domains collaborate effectively and align on data standards and governance practices is essential to avoid fragmentation and maintain the overall integrity of the organization's data ecosystem. Quality Standards:  Maintaining a uniform quality standard across domains can be a challenge. Each business area may have a different perspective on what constitutes quality data, and without clear governance, this can result in fragmented or unreliable data. Inconsistent quality standards between domains can undermine the overall trustworthiness and usability of the data, making it difficult to rely on for decision-making or cross-domain analysis. Advantages and Disadvantages: What Are the Benefits for Companies That Have Adopted Data Mesh Compared to Those That Haven’t? When comparing a company that has adopted Data Mesh with one that still follows the traditional centralized model, several significant differences emerge, both in terms of advantages and disadvantages. This comparison helps us understand where Data Mesh may be more appropriate, as well as the challenges it can present compared to the conventional model. Speed and Agility in Delivering Insights Company with Data Mesh:  By adopting Data Mesh, business areas gain autonomy to manage and access their own data. This means that instead of relying on a central data team, each domain can build and adjust its data pipelines according to its specific needs. This often leads to a significant reduction in the time required to obtain actionable insights, as business areas avoid the bottlenecks commonly found in a centralized approach. Company without Data Mesh:  In the centralized approach, all data requests must go through a central team, which is often overwhelmed with multiple requests. This results in long wait times for reports, analyses, and insights. Additionally, the backlog of data requests can pile up, delaying critical business decision-making. Advantage of Data Mesh:  Decentralization speeds up access to insights, making the company more agile and better equipped to respond quickly to market changesdo. Data Quality and Consistency Company with Data Mesh:  In the Data Mesh model, each domain is responsible for the quality of the data it generates. While this can mean that the data is more contextualized to the domain’s needs, there is a risk of inconsistencies if federated governance is not well implemented. Each domain may adopt slightly different standards, leading to issues with data interoperability and comparability across domains. Company without Data Mesh:  In a centralized model, data governance is more rigid and controlled, ensuring greater consistency across the organization. However, this also creates a bottleneck when it comes to implementing new standards or adapting data for the specific needs of different business areas. Disadvantage of Data Mesh:  Decentralization can lead to data inconsistencies, especially if there is not strong enough governance to standardize practices across domains. Scalability Company with Data Mesh:  Data Mesh is designed to scale efficiently in large organizations. As the company grows and new domains emerge, these domains can quickly establish their own data pipelines without overloading a central team. This allows the organization to expand without creating a bottleneck in data operations. Company without Data Mesh:  In a centralized model, scalability is a major challenge. As the company grows and more areas need access to data, the centralized team becomes a bottleneck. Expanding central infrastructure can also be costly and complex, making it difficult for the company to adapt to new data volumes and types. Advantage of Data Mesh:  More natural and efficient scalability, as business areas can manage their own data without relying on an overburdened central team. Operational Costs Company with Data Mesh:  While Data Mesh offers greater autonomy and scalability, the operational costs can be higher initially. Implementing self-service infrastructure, decentralized governance, and training business teams to manage data can be expensive. Additionally, there are ongoing costs for maintaining quality standards and governance across domains. Company without Data Mesh:  A centralized model may be cheaper in terms of maintenance and governance, as the central data team has full control over the system. However, hidden costs may arise in the form of inefficiencies and missed opportunities due to slow data delivery. Disadvantage of Data Mesh:  Higher initial costs and ongoing operational expenses related to governance and maintaining decentralized infrastructure. Innovation and Experimentation Company with Data Mesh:  With each domain autonomous in managing its data, there is greater flexibility to experiment with new methods of data collection and processing. Teams can adjust their approaches to meet their specific needs without waiting for approval or availability from a central IT team. This encourages a culture of innovation, where different areas can quickly test hypotheses and adapt to changes. Company without Data Mesh:  In the centralized model, any experimentation or innovation with data must go through the bureaucratic process of prioritization and execution by the central team. This can delay innovation and limit the business areas' flexibility to adapt their practices quickly. Advantage of Data Mesh:  Greater flexibility and innovation potential in business areas, allowing them to freely experiment with their own data. Governance and Compliance Company with Data Mesh:  Maintaining governance and compliance in a decentralized architecture can be challenging. Without well-implemented federated governance, there is a risk that different domains may adopt divergent practices, which can compromise data quality and even put the company at risk of violating data protection regulations, such as GDPR or LGPD. Company without Data Mesh:  In the centralized model, governance is much more controlled, and compliance with regulatory standards is managed by a single data team, reducing the risk of violations and inconsistencies. However, this can lead to a more rigid and slower approach to adapting to new regulatory requirements. Disadvantage of Data Mesh:  Decentralized governance can increase the risk of regulatory non-compliance and data inconsistency. Is Data Mesh a Silver Bullet? The concept and its ideas can serve as a silver bullet for many of the challenges a centralized architecture faces when trying to keep up with the rapid growth of a company and the need for business areas to extract insights quickly. While Data Mesh is a powerful approach to solving scalability and autonomy challenges in data, it is not a universal solution. It offers significant advantages, such as decentralization and greater agility, but it also brings complex challenges, like the need for effective federated governance and high implementation costs. The primary limitation of Data Mesh is that it requires a deep cultural shift, where business areas become responsible for the quality and governance of their data. Companies that are not ready for this transformation may face data fragmentation and a lack of standardization. Moreover, Data Mesh is not suitable for all organizations. Smaller companies or those with lower data maturity may find Data Mesh overly complex and expensive, opting for simpler solutions like Data Lakes or Data Warehouses. Therefore, Data Mesh is not a silver bullet. It solves many data-related problems but is not a magical solution for all companies and situations. Its success depends on the organization's maturity and readiness to adopt a decentralized and adaptive architecture. Hope you enjoyed this post, share it, and see you next time!

  • Don't Let Your Dashboards Break: Understanding DistKey and SortKey in Practice

    First, About AWS Redshift? Redshift is a highly scalable cloud-based data warehouse service offered by AWS. It allows companies to quickly analyze large volumes of data using standard SQL and BI tools. Redshift's architecture is optimized for large-scale data analysis, leveraging parallelization and columnar storage for high performance. I recommend reading my post where I dive deeper into Redshift’s architecture and its components, available at Understanding AWS Redshift and Its Components . Why Use DistKey and SortKey? Understanding DistKey and SortKey in practice can provide several benefits, the most important being improved query performance. DistKey  optimizes joins and aggregations by efficiently distributing data across nodes, while SortKey  speeds up queries that filter and sort data, allowing Redshift to read only the necessary data blocks. Both help to make queries faster and improve resource efficiency. DistKey and How It Works DistKey  (or Distribution Key) is the strategy for distributing data across the nodes of a Redshift cluster. When you define a column as a DistKey , the records sharing the same value in that column are stored on the same node, which can reduce the amount of data movement between nodes during queries. One of the main advantages is Reducing Data Movement Between Nodes , increasing query performance and improving the utilization of Redshift’s distributed processing capabilities. Pay Attention to Cardinal Choosing a column with low cardinality (few distinct values) as a DistKey  can result in uneven data distribution, creating "hot nodes" (nodes overloaded with data) and degrading performance. What is Cardinality? Cardinality refers to the number of distinct values in a column. A column with high cardinality has many distinct values, making it a good candidate for a DistKey  in Amazon Redshift. High cardinality tends to distribute data more evenly across nodes, avoiding overloaded nodes and ensuring balanced query performance. Although the idea behind DistKey  is to distribute distinct values evenly across nodes, keep in mind that if data moves frequently between nodes, it will reduce the performance of complex queries. Therefore, it’s important to carefully choose the right column to define as a DistKey . Benefits of Using DistKey To make it clearer, here are some benefits of choosing the right DistKey  strategy: Reduced Data Movement Between Nodes: When data sharing the same DistKey  is stored on the same node, join and aggregation operations using that key can be performed locally on a single node. This significantly reduces the need to move data between nodes, which is one of the main factors affecting query performance in distributed systems. Better Performance in Joins and Filtered Queries: If queries frequently perform joins between tables sharing the same DistKey , keeping the data on the same node can drastically improve performance. Query response times are faster because operations don’t require data redistribution between nodes. Suppose you have two large tables in your Redshift cluster: Table A (transactions):  Contains billions of customer transaction records. Table B (customers):  Stores customer information. Both tables have the column client_id. If you frequently run queries joining these two tables to get transaction details by customer, defining client_id as the DistKey  on both tables ensures that records for the same customer are stored on the same node. SELECT A.transaction_id, A.amount, B.customer_name FROM transactions A JOIN customers B ON A.client_id = B.client_id WHERE B.state = 'CA'; By keeping client_id on the same node, joins can be performed locally without needing to redistribute data across different nodes in the cluster. This dramatically reduces query response times. Without a DistKey , Redshift would need to redistribute data from both tables across nodes to execute the join, increasing the query’s execution time. With client_id as the DistKey , data is already located on the same node, allowing for much faster execution. Storage and Processing Efficiency: Local execution of operations on a single node, without the need for redistribution, leads to more efficient use of CPU and memory resources. This can result in better overall cluster utilization, lower costs, and higher throughput for queries. Disadvantages of Using DistKey Data Skew (Imbalanced Data Distribution): One of the biggest disadvantages is the risk of creating data imbalance across nodes, known as data skew. If the column chosen as the DistKey  has low cardinality or if values are not evenly distributed, some nodes may end up storing much more data than others. This can result in overloaded nodes, degrading overall performance. Reduced Flexibility for Ad Hoc Queries: When a DistKey  is defined, it optimizes specifically for queries that use that key. However, if ad hoc queries or analytical needs change, the DistKey  may no longer be suitable. Changing the DistKey  requires redesigning the table and possibly redistributing the data, which can be time-consuming and disruptive.o. Poor Performance in Non-Optimized Queries: If queries that don’t effectively use the DistKey  are executed, performance can suffer. This is particularly relevant in scenarios where queries vary widely or don’t follow predictable patterns. While the lack of data movement between nodes is beneficial for some queries, it may also limit performance for others that require access to data distributed across all nodes. How to Create a DistKey in Practice After selecting the best strategy based on the discussion above, creating a DistKey  is straightforward. Simply add the DISTKEY keyword when creating the table. CREATE TABLE sales ( sale_id INT, client_id INT DISTKEY , sale_date DATE, amount DECIMAL(10, 2) ); In the example above, the column client_id has been defined as the DistKey , optimizing queries that retrieve sales data by customer. SortKey and How It Works SortKey  is the key used to determine the physical order in which data is stored in Redshift tables. Sorting data can significantly speed up queries that use filters based on the columns defined as SortKey . Benefits of SortKey Query Performance with Filters and Groupings: One of the main advantages of using SortKey  is improved performance for queries applying filters (WHERE), orderings (ORDER BY), or groupings (GROUP BY) on the columns defined as SortKey . Since data is physically stored on disk in the order specified by the SortKey , Redshift can read only the necessary data blocks, instead of scanning the entire table. Reduced I/O and Increased Efficiency: With data ordered by SortKey , Redshift minimizes I/O by accessing only the relevant data blocks for a query. This is especially useful for large tables, where reading all rows would be resource-intensive. Reduced I/O results in faster query response times. Easier Management of Temporal Data: SortKeys  are particularly useful for date or time columns. When you use a date column as a SortKey , queries filtering by time ranges (e.g., "last 30 days" or "this year") can be executed much faster. This approach is common in scenarios where data is queried based on dates, such as transaction logs or event records. Support for the VACUUM Command: The VACUUM  command is used to reorganize data in Redshift, removing free space and applying the order defined by the SortKey . Tables with a well-defined SortKey  benefit the most from this process, as VACUUM  can efficiently reorganize the data, resulting in a more compact table and even faster queries. Disadvantages of Using SortKey Incorrect Choice of SortKey Column : If an inappropriate column is chosen as the SortKey , there may be no significant improvement in query performance—or worse, performance may actually degrade. For example, if the selected column is not frequently used in filters or sorting, the advantage of accessing data blocks efficiently is lost, meaning Redshift will scan more blocks, resulting in higher query latency. An example would be defining a status column (with few distinct values) as the SortKey  in a table where queries typically filter by transaction_date. This would result in little to no improvement in execution time. Table Size and Reorganization In very large tables, reorganizing data to maintain SortKey  efficiency can be slow and resource-intensive. This can impact system availability and overall performance. For example, when a table with billions of records needs to be reorganized due to inserts or updates that disrupt the SortKey  order, the VACUUM  operation can take hours or even days, depending on the table size and cluster workload. Difficulty in Changing the SortKey Changing the SortKey  of an existing table can be complex and time-consuming, especially for large tables. This involves creating a new table, copying the data to the new table with the new SortKey , and then dropping the old table. In other words, if you realize that the originally chosen SortKey  is no longer optimizing queries as expected, changing the SortKey  may require a complete data migration, which can be highly disruptive. How to Create a SortKey in Practice Here, sale_date was defined as the SortKey, ideal for queries that filter records based on specific dates or date ranges. CREATE TABLE sales ( sale_id INT, client_id INT , sale_date DATE SORTKEY , amount DECIMAL(10, 2) ); Conclusion SortKey  is highly effective for speeding up queries that filter, sort, or group data. By physically ordering the data on disk, SortKeys  allow Redshift to read only the relevant data blocks, resulting in faster query response times and lower resource usage. However, choosing the wrong SortKey  or failing to manage data reorganization can lead to degraded performance and increased complexity. On the other hand, DistKey  is crucial for optimizing joins and aggregations across large tables. By efficiently distributing data across cluster nodes, a well-chosen DistKey  can minimize data movement between nodes, significantly improving query performance. The choice of DistKey  should be based on column cardinality and query patterns to avoid issues like data imbalance or "hot nodes." Both SortKey  and DistKey  require careful analysis and planning. Using them improperly can result in little or no performance improvement—or even worsen performance. Changing SortKeys  or DistKeys  can also be complex and disruptive in large tables. Therefore, the key to effectively using SortKey  and DistKey  in Redshift is a clear understanding of data access patterns and performance needs. With proper planning and monitoring, these tools can transform the way you manage and query data in Redshift, ensuring your dashboards and reports remain fast and efficient as data volumes grow. I hope you enjoyed this overview of Redshift’s powerful features. All points raised here are based on my team's experience in helping various areas within the organization leverage data for value delivery. I aimed to explain the importance of thinking through strategies for DistKey  and SortKey  in a simple and clear manner, with real-world examples to enhance understanding. Until next time!

  • Understanding AWS Redshift and its components

    Introduction In today's data-driven world, the ability to quickly and efficiently analyze massive datasets is more critical than ever. Enter AWS Redshift, Amazon Web Services' answer to the growing need for comprehensive data warehousing solutions. But what is AWS Redshift, and why is it becoming a staple in the arsenal of data analysts and businesses alike? At its most basic, AWS Redshift is a cloud-based service that allows users to store, query, and analyze large volumes of data. It's designed to handle petabytes of data across a cluster of servers, providing the horsepower needed for complex analytics without the need for infrastructure management typically associated with such tasks. For those who are new to the concept, you might wonder how it differs from traditional databases. Unlike conventional databases that are optimized for transaction processing, AWS Redshift is built specifically for high-speed analysis and reporting of large datasets. This focus on analytics allows Redshift to deliver insights from data at speeds much faster than traditional database systems. One of the key benefits of AWS Redshift is its scalability. You can start with just a few hundred gigabytes of data and scale up to a petabyte or more, paying only for the storage and computing power you use. This makes Redshift a cost-effective solution for companies of all sizes, from startups to global enterprises. Furthermore, AWS Redshift integrates seamlessly with other AWS services, such as S3 for data storage, Data Pipeline for data movement, and QuickSight for visualization, creating a robust ecosystem for data warehousing and analytics. This integration simplifies the process of setting up and managing your data workflows, allowing you to focus more on deriving insights and less on the underlying infrastructure. In essence, AWS Redshift democratizes data warehousing, making it accessible not just to large corporations with deep pockets but to anyone with data to analyze. Whether you're a seasoned data scientist or a business analyst looking to harness the power of your data, AWS Redshift offers a powerful, scalable, and cost-effective platform to bring your data to life. Understanding AWS Redshift and its components can help you to make decisions if you are interested to use this powerful tool, for next sections we are going to dive into Redshift and its components. Is AWS Redshift a Database? While AWS Redshift shares some characteristics with traditional databases, it's more accurately described as a data warehousing service. This distinction is crucial for understanding its primary function and capabilities. Traditional databases are designed primarily for online transaction processing (OLTP), focusing on efficiently handling a large number of short, atomic transactions. These databases excel in operations such as insert, update, delete, and query by a single row, making them ideal for applications that require real-time access to data, like e-commerce websites or banking systems. On the other hand, AWS Redshift is optimized for online analytical processing (OLAP). It's engineered to perform complex queries across large datasets, making it suitable for business intelligence, data analysis, and reporting tasks. Redshift achieves high query performance on large datasets by using columnar storage, data compression, and parallel query execution, among other techniques. So, is AWS Redshift a database? Not in the traditional sense of managing day-to-day transactions. Instead, it's a specialized data warehousing service designed to aggregate, store, and analyze vast amounts of data from multiple sources. Its strength lies in enabling users to gain insights and make informed decisions based on historical data analysis rather than handling real-time transaction processing. In summary, while Redshift has database-like functionalities, especially in data storage and query execution, its role as a data warehousing service sets it apart from conventional database systems. It's this distinction that empowers businesses to harness the full potential of their data for analytics and decision-making processes. Advantages of AWS Redshift Performance Efficiency: AWS Redshift utilizes columnar storage and data compression techniques, which significantly improve query performance by reducing the amount of I/O needed for data retrieval. This makes it exceptionally efficient for data warehousing operations. Scalability: Redshift allows you to scale your data warehouse up or down quickly to meet your computing and storage needs without downtime, ensuring that your data analysis does not get interrupted as your data volume grows. Cost-Effectiveness: With its pay-as-you-go pricing model, AWS Redshift provides a cost-effective solution for data warehousing. You only pay for the resources you use, which helps in managing costs more effectively compared to traditional data warehousing solutions. Easy to Set Up and Manage: AWS provides a straightforward setup process for Redshift, including provisioning resources and configuring your data warehouse without the need for extensive database administration expertise. Security: Redshift offers robust security features, including encryption of data in transit and at rest, network isolation using Amazon VPC, and granular permissions with AWS Identity and Access Management (IAM). Integration with AWS Ecosystem: Redshift seamlessly integrates with other AWS services, such as S3, Glue and QuickSight, enabling a comprehensive cloud solution for data processing, storage, and analysis. Massive Parallel Processing (MPP): Redshift's architecture is designed to distribute and parallelize queries across all nodes in a cluster, allowing for rapid execution of complex data analyses over large datasets. High Availability: AWS Redshift is designed for high availability and fault tolerance, with data replication across different nodes and automatic replacement of failed nodes, ensuring that your data warehouse remains operational. Disadvantages of AWS Redshift Complexity in Management: Despite AWS's efforts to simplify, managing a Redshift cluster can still be complex, especially when it comes to fine-tuning performance and managing resources efficiently. Cost at Scale: While Redshift is cost-effective for many scenarios, costs can escalate quickly with increased data volume and query complexity, especially if not optimized properly. Learning Curve: New users may find there's a significant learning curve to effectively utilize Redshift, especially those unfamiliar with data warehousing principles and SQL. Limited Concurrency: In some cases, Redshift can struggle with high concurrency scenarios where many queries are executed simultaneously, impacting performance. Maintenance Overhead: Regular maintenance tasks, such as vacuuming to reclaim space and analyze to update statistics, are necessary for optimal performance but can be cumbersome to manage. Data Load Performance: Loading large volumes of data into Redshift can be time-consuming, especially without careful management of load operations and optimizations. Cold Start Time: Starting up a new Redshift cluster or resizing an existing one can take significant time, leading to delays in data processing and analysis. AWS Redshift Architecture and Its components The architecture of AWS Redshift is a marvel of modern engineering, designed to deliver high performance and reliability. We'll explore its core components and how they interact to process and store data efficiently. Looking to the image above you can note some components since when client interact until how the data is processed through the components itself. The following we will describe each component and its importance for the functioning of Redshift: Leader Node Function: The leader node is responsible for coordinating query execution. It parses and develops execution plans for SQL queries, distributing the workload among the compute nodes. Communication: It also aggregates the results returned by the compute nodes and finalizes the query results to be returned to the client. Compute Nodes Function: These nodes are where the actual data storage and query execution take place. Each compute node contains one or more slices, which are partitions of the total dataset. Storage: Compute nodes store data in columnar format, which is optimal for analytical queries as it allows for efficient compression and fast data retrieval. Processing: They perform the operations instructed by the leader node, such as filtering, aggregating, and joining data. Node Slices Function: Slices are subdivisions of a compute node's memory and disk space, allowing the node's resources to be used more efficiently. Parallel Processing: Each slice processes its portion of the workload in parallel, which significantly speeds up query execution times. AWS Redshift Architecture and its features Redshift contains some features that helps to provide performance to data processing and compression, below we bring some of these features: Massively Parallel Processing (MPP) Architecture Function: Redshift utilizes an MPP architecture, which enables it to distribute data and query execution across all available nodes and slices. Benefit: This architecture allows Redshift to handle large volumes of data and complex analytical queries with ease, providing fast query performance. Columnar Storage Function: Data in Redshift is stored in columns rather than rows, which is ideal for data warehousing and analytics because it allows for highly efficient data compression and reduces the amount of data that needs to be read from disk for queries. Benefit: This storage format is particularly advantageous for queries that involve a subset of a table's columns, as it minimizes disk I/O requirements and speeds up query execution. Data Compression Function: Redshift automatically applies compression techniques to data stored in its columns, significantly reducing the storage space required and increasing query performance. Customization: Users can select from various compression algorithms, depending on the nature of their data, to optimize storage and performance further. Redshift Spectrum Function: An extension of Redshift's capabilities, Spectrum allows users to run queries against exabytes of data stored in Amazon S3, directly from within Redshift, without needing to load or transform the data. Benefit: This provides a seamless integration between Redshift and the broader data ecosystem in AWS, enabling complex queries across a data warehouse and data lake. Integrations with AWS Redshift Redshift's ability to integrate with various AWS services and third-party applications expands its utility and flexibility. This section highlights key integrations that enhance Redshift's data warehousing capabilities. Amazon S3 (Simple Storage Service) Amazon S3 is an object storage service offering scalability, data availability, security, and performance. Redshift can directly query and join data stored in S3, using Redshift Spectrum, without needing to load the data into Redshift tables. Users can create external tables that reference data stored in S3, allowing Redshift to access data for querying purposes. AWS Glue AWS Glue can automate the ETL process for Redshift, transforming data from various sources and loading it into Redshift tables efficiently. It can also manage the data schema in the Glue Data Catalog, which Redshift can use. As benefits, this integration simplifies data preparation, automates ETL tasks, and maintains a centralized schema catalog, resulting in reduced operational burden and faster time to insights. AWS Lambda You can use Lambda to pre-process data before loading it into Redshift or to trigger workflows based on query outputs. This integration automates data transformation and loading processes, enhancing data workflows and reducing the time spent on data preparation. Amazon DynamoDB Redshift can directly query DynamoDB tables using the Redshift Spectrum feature, enabling complex queries across your DynamoDB and Redshift data. This provides a powerful combination of real-time transactional data processing in DynamoDB with complex analytics and batch processing in Redshift, offering a more comprehensive data analysis solution. Amazon Kinesis Redshift integrates with Kinesis Data Firehose, which can load streaming data directly into Redshift tables. This integration enables real-time data analytics capabilities, allowing businesses to make quicker, informed decisions based on the latest data. Conclusion AWS Redshift exemplifies a powerful, scalable solution tailored for efficient data warehousing and complex analytics. Its integration with the broader AWS ecosystem, including S3, AWS Glue, Lambda, DynamoDB, and Amazon Kinesis, underscores its versatility and capability to streamline data workflows from ingestion to insight. Redshift's architecture, leveraging columnar storage and massively parallel processing, ensures high-speed data analysis and storage efficiency. This enables organizations to handle vast amounts of data effectively, facilitating real-time analytics and decision-making. In essence, AWS Redshift stands as a cornerstone for data-driven organizations, offering a comprehensive, future-ready platform that not only meets current analytical demands but is also poised to evolve with the advancing data landscape.

  • What Data Engineers Need to Know in 2024

    The Evolution of Data Engineering Data engineering has witnessed a transformative journey, evolving from simple data collection and storage to sophisticated processing and analysis. A historical overview reveals its roots in traditional database management, progressing through the advent of big data, to today's focus on real-time analytics and cloud computing. Recent advances have been catalyzed by the integration of artificial intelligence (AI) and machine learning (ML), pushing the boundaries of what's possible in data-driven decision-making. Core Skills for Data Engineers in 2024 What Data Engineers Need to Know in 2024? To thrive in 2024, data engineers must master a blend of foundational and cutting-edge skills: Programming Languages: Proficiency in languages like Python, Scala, and SQL is non-negotiable, enabling efficient data manipulation and analysis. Database Management: Understanding relational and NoSQL databases, alongside data warehousing solutions, forms the backbone of effective data storage strategies. Cloud Computing Platforms: Expertise in AWS, Google Cloud Platform, and Azure is crucial, as cloud services become central to data engineering projects. Data Modeling & ETL Processes: Developing robust data models and streamlining ETL (Extract, Transform, Load) processes are key to ensuring data quality and accessibility. Emerging Technologies and Their Impact Emerging technologies such as AI and ML, big data frameworks, and automation tools are redefining the landscape: Artificial Intelligence & Machine Learning: These technologies are vital for predictive modeling and advanced data analysis, offering unprecedented insights. Big Data Technologies: Hadoop, Spark, and Flink facilitate the handling of vast datasets, enabling scalable and efficient data processing. Automation and Orchestration Tools: Tools like Apache Airflow and Kubernetes enhance efficiency, automating workflows and data pipeline management. The Importance of Data Governance and Security With increasing data breaches and privacy concerns, data governance and security have become paramount: Regulatory Compliance: Familiarity with GDPR, CCPA, and other regulations is essential for legal compliance. Data Privacy Techniques: Implementing encryption, anonymization, and secure access controls protects sensitive information from unauthorized access. Data Engineering in the Cloud Era The shift towards cloud computing necessitates a deep understanding of cloud services and technologies: Cloud Service Providers: Navigating the offerings of major providers ensures optimal use of cloud resources. Cloud-native Technologies: Knowledge of containerization, microservices, and serverless computing is crucial for modern data engineering practices. Real-time Data Processing The ability to process and analyze data in real-time is becoming increasingly important: Streaming Data Technologies: Tools like Apache Kafka and Amazon Kinesis support high-throughput, low-latency data streams. Real-time Analytics: Techniques for real-time data analysis enable immediate insights, enhancing decision-making processes. Advanced Analytics and Business Intelligence Advanced analytics and BI tools are essential for converting data into actionable insights: Predictive Analytics: Using statistical models and machine learning to predict future trends and behaviors. Visualization Tools: Tools like Tableau and Power BI help in making complex data understandable through interactive visualizations. Career Pathways and Growth Opportunities Exploring certifications, training, and staying informed about industry demand prepares data engineers for career advancement: Certification and Training: Pursuing certifications in specific technologies or methodologies can bolster expertise and credibility. Industry Demand: Understanding the evolving market demand ensures data engineers can align their skills with future opportunities. Preparing for the Future Continuous learning and community engagement are key to staying relevant in the fast-paced field of data engineering: Continuous Learning: Embracing a mindset of lifelong learning ensures data engineers can adapt to new technologies and methodologies. Networking and Community Engagement: Participating in forums, attending conferences, and contributing to open-source projects fosters professional growth and innovation. Conclusion As data becomes increasingly, the role of data engineers in shaping the future of technology cannot be overstated. By mastering core skills, staying informed about emerging technologies, and emphasizing data governance and security, data engineers can lead the charge in leveraging data for strategic advantage in 2024 and beyond.

  • Programming Language Trends for 2024: What Developers Need to Know

    In the ever-evolving landscape of technology, programming languages stand as the foundational tools empowering innovation, driving progress, and shaping the digital world we inhabit. As we venture into 2024, the significance of understanding and leveraging these languages has never been more pronounced. From powering artificial intelligence to enabling seamless web development, programming languages play a pivotal role in defining the trajectory of tech trends and driving transformative change across industries. In this era of rapid technological advancement, staying abreast of the latest programming languages is not merely advantageous—it's imperative. Developers, engineers, and tech enthusiasts alike must recognize the profound impact that mastering these languages can have on their ability to navigate and thrive in the dynamic tech landscape of 2024. Programming languages serve as the building blocks of innovation, providing developers with the means to translate ideas into tangible solutions. In 2024, familiarity with cutting-edge languages equips individuals with the tools needed to push the boundaries of what's possible, whether through developing AI-driven applications, crafting immersive virtual experiences, or architecting resilient software systems. With every technological advancement comes a myriad of opportunities waiting to be seized. Whether it's capitalizing on the burgeoning fields of data science, blockchain technology, or quantum computing, proficiency in the right programming languages positions individuals to harness these opportunities and carve out their niche in the digital landscape of 2024. In an increasingly competitive job market, proficiency in in-demand programming languages can be a game-changer for career advancement. Employers across industries are seeking skilled professionals capable of leveraging the latest tools and technologies to drive business success. By staying ahead of the curve and mastering emerging languages, individuals can enhance their employability and unlock a wealth of career opportunities. For this post, I decided to write about the programming languages trends for 2024 and I hope this can be useful to you and taking the best decisions and which directions you want to follow this year in this large field. Python Python continues to maintain its position as one of the most popular and versatile programming languages. With its simplicity, readability, and extensive ecosystem of libraries and frameworks, Python is widely used in fields such as data science, artificial intelligence, web development, and automation. In 2024, Python's relevance is further amplified by its adoption in emerging technologies like machine learning, quantum computing, and the metaverse. Rust Rust has been gaining traction as a systems programming language known for its performance, safety, and concurrency features. In 2024, Rust is increasingly used in critical systems development, including operating systems, game engines, and web browsers. Its emphasis on memory safety and zero-cost abstractions makes it particularly suitable for building secure and reliable software, making it a favored choice for projects demanding high performance and robustness. TypeScript TypeScript, a superset of JavaScript with static typing, continues to see widespread adoption in web development. Its ability to catch errors at compile time, improve code maintainability, and enhance developer productivity has made it a preferred choice for building large-scale web applications. In 2024, TypeScript's popularity remains strong, driven by its integration with popular frameworks like Angular, React, and Vue.js, as well as its support for modern JavaScript features. Julia Julia, a high-level programming language designed for numerical and scientific computing, is gaining prominence in fields such as data science, computational biology, and finance. Known for its speed and ease of use, Julia combines the flexibility of dynamic languages with the performance of compiled languages, making it well-suited for tasks involving mathematical computations and large-scale data analysis. In 2024, Julia continues to attract researchers, engineers, and data scientists seeking efficient and expressive tools for scientific computing. Kotlin Kotlin, a statically-typed programming language for the Java Virtual Machine (JVM), has emerged as a popular choice for Android app development. Offering modern features, interoperability with Java, and seamless integration with popular development tools, Kotlin enables developers to build robust and efficient Android applications. In 2024, Kotlin's adoption in the Android ecosystem remains strong, driven by its developer-friendly syntax, strong tooling support, and endorsement by Google as a preferred language for Android development. Golang (Go) Go, often referred to as Golang, continues to gain traction as a language for building scalable and efficient software systems. Known for its simplicity, performance, and built-in concurrency support, Go is well-suited for developing cloud-native applications, microservices, and distributed systems. In 2024, Go's popularity is fueled by its role in enabling the development of resilient and high-performance software architectures, particularly in cloud computing, DevOps, and container orchestration. What programming languages ​​do big tech use? Below we have an overview about programming languages that the main big techs companies are using in their stacks, so if you want to work in a Big Tech get ready to learn these languages. Conclusion In 2024, the programming landscape is characterized by a diverse set of languages, each catering to specific use cases and development requirements. From Python's versatility to Rust's performance, TypeScript's productivity to Julia's scientific computing capabilities, Kotlin's Android development to Go's system-level programming, developers have a rich array of tools at their disposal to tackle the challenges and opportunities presented by emerging technologies and industry trends. Whether building AI-powered applications, crafting scalable web services, or optimizing system performance, the choice of programming language plays a crucial role in shaping the success and impact of software projects in the dynamic tech landscape of 2024.

  • Exploring the Power of Virtual Threads in Java 21

    Introduction to Virtual Threads in Java 21 Concurrency has always been a cornerstone of Java programming, empowering developers to build responsive and scalable applications. However, managing threads efficiently while ensuring high performance and low resource consumption has been a perennial challenge. With the release of Java 21, a groundbreaking feature called Virtual Threads emerges as a game-changer in the world of concurrent programming. Concurrency challenges in Java and the problem with traditional threads Concurrency in Java presents developers with both immense opportunities for performance optimization and formidable challenges in ensuring thread safety and managing shared resources effectively. As applications scale and become more complex, navigating these challenges becomes increasingly crucial. Managing Shared Resources: One of the fundamental challenges in concurrent programming is managing shared resources among multiple threads. Without proper synchronization mechanisms, concurrent access to shared data can lead to data corruption and inconsistencies. Avoiding Deadlocks: Deadlocks occur when two or more threads are blocked indefinitely, waiting for each other to release resources. Identifying and preventing deadlocks is crucial for maintaining the responsiveness and stability of concurrent applications. Performance Bottlenecks: While concurrency can improve performance by leveraging multiple threads, it can also introduce overhead and contention, leading to performance bottlenecks. It's essential to carefully design concurrent algorithms and use appropriate synchronization mechanisms to minimize contention and maximize throughput. High Memory Overhead: Traditional threads in Java are implemented as native threads managed by the operating system. Each native thread consumes a significant amount of memory, typically in the range of several megabytes. This overhead becomes problematic when an application needs to create a large number of threads, as it can quickly deplete system resources. Limited Scalability: The one-to-one mapping between Java threads and native threads imposes a limit on scalability. As the number of threads increases, so does the memory overhead and the scheduling complexity. This limits the number of concurrent tasks an application can handle efficiently, hindering its scalability and responsiveness. Difficulty in Debugging and Profiling: Debugging and profiling concurrent applications built with traditional threads can be challenging due to the non-deterministic nature of thread execution and the potential for subtle timing-related bugs. Identifying and diagnosing issues such as race conditions and thread contention requires specialized tools and expertise. What are Virtual Threads? Virtual Threads represent a paradigm shift in how Java handles concurrency. Traditionally, Java applications rely on OS-level threads, which are heavyweight entities managed by the operating system. Each thread consumes significant memory resources, limiting scalability and imposing overhead on the system. Virtual Threads, on the other hand, are lightweight and managed by the Java Virtual Machine (JVM) itself. They are designed to be highly efficient, allowing thousands or even millions of virtual threads to be created without exhausting system resources. Virtual Threads offer a more scalable and responsive concurrency model compared to traditional threads. Benefits of Virtual Threads Virtual Threads come with a host of features and benefits that make them an attractive choice for modern Java applications: Lightweight: Virtual Threads have minimal memory overhead, allowing for the creation of large numbers of threads without exhausting system resources. This lightweight nature makes them ideal for highly concurrent applications. Structured Concurrency: Virtual Threads promote structured concurrency, which helps developers write more reliable and maintainable concurrent code. By enforcing clear boundaries and lifecycles for concurrent tasks, structured concurrency simplifies error handling and resource management. Improved Scalability: With Virtual Threads, developers can achieve higher scalability and throughput compared to traditional threads. The JVM's scheduler efficiently manages virtual threads, ensuring optimal utilization of system resources. Integration with CompletableFuture: Java 21 introduces seamless integration between Virtual Threads and CompletableFuture, simplifying asynchronous programming. CompletableFuture provides a fluent API for composing and chaining asynchronous tasks, making it easier to write non-blocking, responsive applications. Examples of Virtual Threads Creating and Running a Virtual Thread This example demonstrates the creation and execution of a virtual thread. We use the Thread.startVirtualThread() method to start a new virtual thread with the specified task, which prints a message indicating its execution. We then call join() on the virtual thread to wait for its completion before proceeding. CompletableFuture with Virtual Threads This example showcases the usage of virtual threads with CompletableFuture. We chain asynchronous tasks using supplyAsync(), thenApplyAsync(), and thenAcceptAsync() methods. These tasks execute in virtual threads, allowing for efficient asynchronous processing. Virtual Thread Pool Example In this example, we create a virtual thread pool using Executors.newVirtualThreadExecutor(). We then submit tasks to this pool using submit() method. Each task executes in a virtual thread, demonstrating efficient concurrency management. Using ThreadFactory with Virtual Threads Here, we demonstrate the use of a ThreadFactory with virtual threads. We create a virtual thread factory using Thread.builder().virtual().factory(), and then use it to create a fixed-size thread pool with Executors.newFixedThreadPool(). Tasks submitted to this pool execute in virtual threads created by the virtual thread factory. Virtual Thread Group Example In this final example, we demonstrate how to organize virtual threads into a thread group. We obtain a virtual thread group using Thread.builder().virtual().getThreadGroup() and then create a virtual thread within this group. The task executed by the virtual thread prints a message indicating its execution. Conclusion In conclusion, Virtual Threads introduced in Java 21 mark a significant milestone in the evolution of Java's concurrency model. By providing lightweight, scalable concurrency within the JVM, Virtual Threads address many of the limitations associated with traditional threads, offering developers a more efficient and flexible approach to concurrent programming. With Virtual Threads, developers can create and manage thousands or even millions of threads with minimal overhead, leading to improved scalability and responsiveness in Java applications. The structured concurrency model enforced by Virtual Threads simplifies error handling and resource management, making it easier to write reliable and maintainable concurrent code. Furthermore, the integration of Virtual Threads with CompletableFuture and other asynchronous programming constructs enables developers to leverage the full power of Java's concurrency framework while benefiting from the performance advantages of Virtual Threads. Overall, Virtual Threads in Java 21 represent a significant advancement that empowers developers to build highly concurrent and responsive applications with greater efficiency and scalability. As developers continue to explore and adopt Virtual Threads, we can expect to see further optimizations and enhancements that will further elevate Java's capabilities in concurrent programming.

  • Listing AWS Glue tables

    Using an AWS SDK is always a good option if you need to explore some feature further in search of a solution. In this post, we're going to explore some of AWS Glue using SDK and Java. Glue is an AWS ETL tool that provides a central repository of metadata, called Glue Catalog. In short, the Glue Catalog keeps the entire structure of databases and tables and their schemas in a single place. The idea of ​​this post will be to programmatically list all the tables of a given database in the Glue Catalog using the SDK and Java. Maven dependencies In this example, we're using the Java 8 version to better explore the use of Streams in the interaction. Undestanding awsGlue object is responsible for accessing the resource through the credentials that must be configured. In this post we will not go into this detail. The getTablesRequest object is responsible for setting the request parameters, in this case, we're setting the database. getTablesResult object is responsible for listing the tables based on the parameters set by the getTablesRequest object and also for controlling the result flow. Note that in addition to returning the tables through the getTablesResult.getTableList() method, this same object returns a token that will be explained further in the next item. The token is represented by the getTablesResult.getNextToken() method, the idea of ​​the token is to control the flow of results, as all results are paged and if there is a token for each result, it means that there is still data to be returned. In the code, we used a repetition structure based on validating the existence of the token. So, if there is still a token, it will be set in the getTableRequest object through the code getTableRequest.setNextToken(token), to return more results. It's a way to paginate results. Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): AWS Cookbook is a practical guide containing 70 familiar recipes about AWS resources and how to solve different challenges. It's a well-written, easy-to-understand book covering key AWS services through practical examples. AWS or Amazon Web Services is the most widely used cloud service in the world today, if you want to understand more about the subject to be well positioned in the market, I strongly recommend the study. Setup recommendations If you have interesting to know what's my setup I've used to develop my tutorials, following: Notebook Dell Inspiron 15 15.6 Monitor LG Ultrawide 29WL500-29 Well that’s it, I hope you enjoyed it!

  • Creating AWS CloudWatch alarms

    The use of alarms is an essential requirement when working with various resources in the cloud. It is one of the most efficient ways to monitor and understand the behavior of an application if the metrics are different than expected. In this post, we're going to create an alarm from scratch using AWS CloudWatch based on specific scenario. There are several other tools that allow us to set up alarms, but when working with AWS, setting alarms using CloudWatch is very simple and fast. Use Case Scenario To a better understanding, suppose we create a resiliency mechanism in an architecture to prevent data losses. This mechanism always acts whenever something goes wrong, like components not working as expected sending failure messages to a SQS. CloudWatch allows us to set an alarm. Thus, when a message is sent to this queue, an alarm is triggered. First of all, we need to create a queue and sending messages just to generate some metrics that we're going to use in our alarm. That's a way to simulate a production environment. After queue and alarm creation, we'll send more message for the alarms tests. Creating a SQS Queue Let's create a simple SQS queue and choose some metrics that we can use in our alarm. Thus, access the AWS console and in the search bar, type "sqs" as shown in the image below and then access the service. After accessing the service, click Create queue Let's create a Standard queue for this example and name as sqs-messages. You don't need to pay attention to the other details, just click on the Create queue button to finish it. Queue has been created, now the next step we'll send a few messages just to generate metrics. Sending messages Let's send few messages to the previously created queue, feel free to change the message content if you want to. After sending these messages, automatically will generate some metrics according to the action. In this case, a metric called NumberOfMessagesSent was created on CloudWatch and we can use it to create the alarm. Creating an Alarm For our example, let's choose the metric based on number of messages sent (NumberOfMessagesSent). Access AWS via the console and search for CloudWatch in the search bar, as shown in the image below. After accessing the service, click on the In Alarms/In alarm option in the left corner of the screen and then click on the Create alarm button. Select metric according to the screen below Choose SQS Then click Queue Metrics Search for queue name and select the metric name column item labeled NumberOfMessagesSent, then click Select Metric. Setting metrics Metric name: is the metric chosen in the previous steps. This metric measures the number of messages sent to the SQS (NumberOfMessagesSent). QueueName: Name of the SQS in which the alarm will be configured. Statistic: In this field we can choose options such as Average, Sum, Minimum and more. This will depend on the context you will need to configure the alarm and the metric. For this example we choose Sum, because we want to get the sum of the number of messages sent in a given period. Period: In this field we define the period in which the alarm will be triggered if it reaches the limit condition, which will be defined in the next steps. Setting conditions Threshlod type: For this example we will use Static. Whenever NumberOfMessagesSent is...: Let's select the Greater option Than...: In this field we will configure the number of NumberOfMessagesSent as a condition to trigger the alarm. Let's put 5. Additional configuration For additional configuration, we have the datapoints field for the alarm in which I would like to detail its operation a little more. Datapoints to alarm This additional option makes the alarm configuration more flexible, combined with the previously defined conditions. By default, this setting is: 1 of 1 How it works? The first field refers to the number of points and the second one refers to the period. Keeping the previous settings combined to the additional settings means that the alarm will be triggered if the NumberOfMessagesSent metric is greater than the sum of 5 in a period of 5 minutes. Until then, the default additional configuration does not change the previously defined settings, nothing changes. Now, let's change this setting to understand better. Let's change from: 1 of 1 to 2 of 2. This tells us that when the alarm condition is met, i.e. for the NumberOfMessagesSent metric, the sum is greater than 5. Thus, the alarm will be triggered for 2 datapoints in 10 minutes. Note that the period was multiplied due to the second field with the value 2. Summarizing, even if the condition is met, the alarm will only be triggered if there are 2 datapoints above the threshold in a period of 10 minutes. We will understand even better when we carry out some alarm activation tests. Let's keep the following settings and click Next Configuring actions On the next screen, we're going to configure the actions responsible for notifying a destination if an alarm is triggered. On this screen, we're going to keep the In alarm setting and then creating a new topic and finally, we're going to add an email in which we want to receive error notifications. Select the option Create new topic and fill in a desired name and then enter a valid email in the field Email endpoints that will receive notification ... Once completed, click Create topic and then an email will be sent to confirm subscription to the created topic. Make sure you've received an email confirmation and click Next on the alarm screen to proceed with the creation. Now, we need to add the name of the alarm in the screen below and then click on Next. The next screen will be the review screen, click on Create alarm to finish it. Okay, now we have an alarm created and it's time to test it. Alarm Testing In the beginning we sent a few messages just to generate the NumberOfMessagesSent metric but at this point, we need to send more messages that will trigger the alarm. Thus, let's send more messages and see what's going to happen. After sending the messages, notice that even if the threshold has exceeded, the alarm was not triggered. This is due to the threshold just reached 1 datapoint within the 10 minute window. Now, let's send continuous messages that exceed the threshold in short periods within the 10 minute window. Note that in the image above the alarm was triggered because in addition to having reached the condition specified in the settings, it also reached the 2 data points. Check the email added in the notification settings, probably an email was sent with the alarm details The status alarm will set to OK when the messages not exceed the threshold anymore. Books to study and read If you want to learn more about and reach a high level of knowledge, I strongly recommend reading the following book(s): AWS Cookbook is a practical guide containing 70 familiar recipes about AWS resources and how to solve different challenges. It's a well-written, easy-to-understand book covering key AWS services through practical examples. AWS or Amazon Web Services is the most widely used cloud service in the world today, if you want to understand more about the subject to be well positioned in the market, I strongly recommend the study. Well that’s it, I hope you enjoyed it!

  • First steps with Delta Lake

    What's Delta Lake? Delta Lake is an open-source project that manages storage layer in your Data lake. In practice it's an Apache Spark abstraction reusing the same mechanisms offering extra resources such as ACID transactions support. Everyone knows that keeping data integrity in a data pipeline is a critical task in face of high data read and write concurrency. Delta Lake provides audit history, data versioning and supports DML operations such as deletes, updates and merges. For this tutorial, we're going to simulate a data pipeline locally focusing on Delta Lake advantages. First, we'll load a Spark Dataframe from a JSON file, a temporary view and then a Delta Table which we'll perform some Delta operations. Last, Java as programming language and Maven as dependency manager, besides Spark and Hive to keep our data catalog. Maven org.apache.spark spark-core_2.12 3.0.1 org.apache.spark spark-sql_2.12 3.0.1 org.apache.spark spark-hive_2.12 3.0.1 io.delta delta-core_2.12 0.8.0 The code will be developed in short snippets to a better understanding. Setting Spark with Delta and Hive String val_ext="io.delta.sql.DeltaSparkSessionExtension"; String val_ctl="org.apache.spark.sql.delta.catalog.DeltaCatalog"; SparkConf sparkConf = new SparkConf(); sparkConf.setAppName("app"); sparkConf.setMaster("local[1]"); sparkConf.set("spark.sql.extensions",var_ext); sparkConf.set("spark.sql.catalog.spark_catalog",val_ctl); SparkSession sparkSession = SparkSession.builder() .config(sparkConf) .enableHiveSupport() .getOrCreate(); Understanding the code above We define two variables val_ext and val_ctl by assigning the values ​​to the keys (spark.sql.extensions and spark.sql.catalog.spark_catalog). These are necessary for configuring Delta together with Spark We named the Spark context of app Since we are not running Spark on a cluster, the master is configured to run local local[1] Spark supports Hive, in this case we enable it in the enableHiveSupport( ) Data Ingest Let's work with Spark Dataframe as the data source. We load a Dataframe from a JSON file. order.json file {"id":1, "date_order": "2021-01-23", "customer": "Jerry", "product": "BigMac", "unit": 1, "price": 8.00} {"id":2, "date_order": "2021-01-22", "customer": "Olivia", "product": "Cheese Burguer", "unit": 3, "price": 21.60} {"id":3, "date_order": "2021-01-21", "customer": "Monica", "product": "Quarter", "unit": 2, "price": 12.40} {"id":4, "date_order": "2021-01-23", "customer": "Monica", "product": "McDouble", "unit": 2, "price": 13.00} {"id":5, "date_order": "2021-01-23", "customer": "Suzie", "product": "Double Cheese", "unit": 2, "price": 12.00} {"id":6, "date_order": "2021-01-25", "customer": "Liv", "product": "Hamburger", "unit": 1, "price": 2.00} {"id":7, "date_order": "2021-01-25", "customer": "Paul", "product": "McChicken", "unit": 1, "price": 2.40} Creating a Dataframe Dataset df = sparkSession.read().json("datasource/"); df.createOrReplaceGlobalTempView("order_view"); Understanding the code above In the previous section, we're creating a Dataframe from the JSON file that is inside the datasource/ directory, create this directory so that the structure of your code is more comprehensive and then create the order.json file based on the content shown earlier . Finally, we create a temporary view that will help us in the next steps. Creating a Delta Table Let's create the Delta Table from an SQL script. At first the creation is simple, but notice that we used different types of a table used in a relational database. For example, we use STRING instead of VARCHAR and so on. We are partitioning the table by the date_order field. This field was chosen as a partition because we believe there will be different dates. In this way, queries can use this field as a filter, aiming at better performance. And finally, we define the table as Delta Table from the USING DELTA snippet. String statement = "CREATE OR REPLACE TABLE orders (" + "id STRING, " + "date_order STRING," + "customer STRING," + "product STRING," + "unit INTEGER," + "price DOUBLE) " + "USING DELTA " + "PARTITIONED BY (date_order) "; sparkSession.sql(statement); Understanding the code above In the previous section we're creating a Delta table called orders and then we execute the creation. DML Operations Delta supports Delete, Update and Insert operations using Merge Using Merge together with Insert and Update In this step, we are going to execute a Merge that makes it possible to control the flow of inserting and updating data through a table, Dataframe or view. Merge works from row matches, which will be more understandable in the next section. String mergeStatement = "Merge into orders " + "using global_temp.order_view as orders_view " + "ON orders.id = orders_view.id " + "WHEN MATCHED THEN " + "UPDATE SET orders.product = orders_view.product," + "orders.price = orders_view.price " + "WHEN NOT MATCHED THEN INSERT * "; sparkSession.sql(mergeStatement); Understanding the code above In the snippet above we're executing the Merge operation from the view order_view created in the previous steps. In the same section we have a condition orders.id = orders_view.id that will help in the following matches. If the previous condition is true, that is, MATCHED is true. The data will be updated. Otherwise, NOT MATCHED. Data will be inserted. In the case above, the data will be inserted, because until then there was no data in the orders table. Run the command below to view the inserted data. sparkSession.sql("select * from orders").show(); Update the datasource/order.json file by changing the product, price field and run all snippets again. You will see that all records will be updated. Update operation It is possible to run Update without the need to use Merge, just run the command below: String updateStatement = "update orders " + "set product = 'Milk-Shake' " + "where id = 2"; sparkSession.sql(updateStatement); Delete operation String deleteStatement = "delete from pedidos where id = 2"; sparkSession.sql(deleteStatement); In addition to being able to execute the Delete command, it is possible to use this command with Merge. Understanding Delta Lake Transaction Log (DeltaLog) In addition to supporting ACID transactions, delta generates some JSON files that serve as a way to audit and maintain the history of each transaction, from DDL and DML commands This mechanism it is even possible to go back to a specific state of the table if necessary. For each executed transaction a JSON file is created inside the _delta_log folder. The initial file will always be 000000000.json, containing the transaction commits. In our scenario, this first file contains the commits for creating the orders table. For a better view, go to the local folder that was probably created in the root directory of your project called spark-warehouse. This folder was created by Hive to hold resources created from JSON files and parquets. Inside it will have a folder structure as shown below: Note that the files are created in ascending order from each executed transaction. Access each JSON file and you will see each transaction that was executed through the operation field, in addition to other information. 00000000000000000000.json "operation":"CREATE OR REPLACE TABLE" 00000000000000000001.json "operation":"MERGE" 00000000000000000002.json "operation":"UPDATE" 00000000000000000003.json "operation":"DELETE" Also note that the parquet files were generated partitioned into folders by the date_order field. Hope you enjoyed!

  • Using Comparator.comparing to sort Java Stream

    Introduction Sorting data is a common task in many software development projects. When working with collections of objects in Java, a powerful and flexible approach to sorting is to use the Comparator.comparing interface in conjunction with Streams. In this post, we are going to show that using Comparator.comparing to sort Java Stream can make sorting elegant and efficient. What is the Comparator.comparing interface? The Comparator.comparing interface is a feature introduced in Java 8 as part of the java.util.Comparator. package. It provides a static method called comparing that allows you to specify a key function (sort key) to compare objects. This function is used to extract a value from an object and compare it against that value during sorting. Flexibility in sorting with Comparator.comparing One of the main advantages of the Comparator.comparing interface is its flexibility. With it, we can perform sorting in different fields of an object, allowing the creation of complex sorting logic in a simple and concise way. Notice in the code below that simply in the sorted() method, we pass the Comparator.comparing interface as an argument, which in turn, passes the city field as an argument using method reference (People::getCity) performing the sort by this field. Output Monica John Mary Anthony Seth Multi-criteria ordering Often, it is necessary to perform sorting based on multiple criteria. This is easily achieved with the Comparator.comparing. interface by simply chaining together several comparing methods, each specifying a different criterion. Java will carry out the ordering according to the specified sequence. For example, we can sort the same list by city and then by name: Comparator.comparing(People::getCity).thenComparing(People:: getName). Ascending and descending sort Another important advantage of the Comparator.comparing interface is the ability to perform sorting in both ascending and descending order. To do this, just chain the method reversed() as in the code below: Output Seth Mary John Anthony Monica Efficiency and simplicity By using the Comparator.comparing interface in conjunction with Streams, sorting becomes more efficient and elegant. The combination of these features allows you to write clean code that is easy to read and maintain. Furthermore, Java internally optimizes sorting using efficient algorithms, resulting in satisfactory performance even for large datasets. Final conclusion The Comparator.comparing interface is a powerful tool to perform the sorting of Streams in Java. Its flexibility, ascending and descending sorting capabilities, support for multiple criteria, and efficient execution make it a valuable choice for any Java developer. By taking advantage of this interface, we can obtain a more concise, less verbose and efficient code, facilitating the manipulation of objects in a Stream. Hope you enjoyed!

  • Applying Change Data Feed for auditing on Delta tables

    What is the Change Data Feed? Change Data Feed is a Delta Lake feature as of version 2.0.0 that allows tracking at row levels in Delta tables, changes such as DML operations (Merge, Delete or Update), data versions and the timestamp of when the change happened. The process maps Merge, Delete and Update operations, maintaining the history of changes at line level, that is, each event suffered in a record, Delta through the Change Data Feed manages to register as a kind of audit . Of course it is possible to use it for different use cases, the possibilities are extensive. How it works in practice Applying Change Data Feed for Delta tables is an interesting way to handle with row level records and for this post we will show how it works. We will perform some operations to explore more about the power of the Change Data Feed. We will work with the following Dataset: Creating the Spark Session and configuring some Delta parameters From now on, we'll create the code in chunks for easy understanding. In the code below we are creating the method responsible for maintaining the Spark session and configuring some parameters for Delta to work. Loading the Dataset Let's load the Dataset and create a temporary view to be used in our pipeline later. Creating the Delta Table Now we will create the Delta table already configuring Change Data Feed in the table properties and all the metadata will be based on the previously presented Dataset. Note that we're using the following parameter in the property delta.enableChangeDataFeed = true for activating the Change Data Feed. Performing a Data Merge Now we'll perform a simple Merge operation so that the Change Data Feed can register it as a change in our table. See what Merge uses in our previously created global_temp.raw_product view to upsert the data. Auditing the table Now that the Merge has been executed, let's perform a read on our table to understand what happened and how the Change Data Feed works. Notice that we're passing the following parameters: 1. readChangeFeed where required for using the Change Data Feed. 2. startingVersion is the parameter responsible for restricting which version we want it to be displayed from. Result after execution: See that in addition to the columns defined when creating the table, we have 3 new columns managed by the Change Data Feed. 1. _change_type: Column containing values according to each operation performed as insert, update_preimage , update_postimage, delete 2. _commit_version: Change version 3. _commit_timestamp: Timestamp representing the change date In the above result, the result of the upsert was a simple insert, as it didn't contain all the possible conditions of an update. Deleting a record In this step we will do a simple delete in a table record, just to validate how the Change Data Feed will behave. Auditing the table (again) Note below that after deleting record with id 6, we now have a new record created as delete in the table and its version incremented to 2. Another point is that the original record was maintained, but with the old version. Updating a record Now as a last test, we will update a record to understand again the behavior of the Change Data Feed. Auditing the table (last time) Now as a last test, we run a simple update on a record to understand how it will behave. Notice that 2 new values have been added/updated in the _change_type column. The update_postimage value is the value after the update was performed and this time, for the old record, the same version of the new one was kept in the column _commit_version, because this same record was updated according to column _change_type to update_preimage, that is, value before change. Conclusion The Change Data Feed is a great resource to understand the behavior of your data pipeline and also a way to audit records in order to better understand the operations performed there. According to the Delta team itself, it is a feature that, if maintained, does not generate any significant overhead. It's a feature that can be fully adopted in your data strategy as it has several benefits as shown in this post. Repository GitHub Hope you enjoyed!

bottom of page