Database Schema Design for Data Engineering: Essential Pitfalls and Best Practices

Yu Ishikawa
11 min readMay 17, 2024

--

In today’s data-driven world, the design of database schemas plays a crucial role in data engineering, analytics, and overall business intelligence. As application developers, it’s essential to understand that schema design is not just about meeting application requirements but also about enabling efficient data transfer, analysis, and utilization across the organization. My motivation for writing this article stems from encountering various challenges as a data engineer, where poor schema design often hindered data accessibility and usability. By addressing common pitfalls and adopting best practices, we can create robust and scalable data infrastructures. This collaborative effort involves input from data engineers, data analysts, data scientists, and application developers, ensuring that all aspects of data utilization are considered. Join me in exploring these essential insights to improve schema design and unlock the full potential of your data.

Introduction

Overview of Database Schema Design for Data Engineering

Designing a robust and efficient database schema is a critical task for any application. However, when it comes to data engineering, there are additional considerations that must be taken into account to ensure that data can be effectively transferred to data warehouses and utilized for analytics and machine learning (ML). This involves understanding the various data transfer methods, anticipating potential pitfalls, and adopting best practices that align with both application and data consumption needs.

As AI and ML continue to gain prominence, the ability to easily access and use data stored in databases for data analytics and data science becomes increasingly critical. AI and ML models rely heavily on vast amounts of high-quality data. Effective schema design directly impacts how seamlessly data can be integrated into AI and ML workflows, supporting better decision-making, predictive analytics, and automation.

Importance of Considering Data Transfer in Schema Design

One of the primary challenges in database schema design is ensuring that data can be efficiently transferred to data warehouses. Poorly designed schemas can lead to significant difficulties in data extraction, transformation, and loading (ETL) processes, resulting in performance bottlenecks and data inconsistencies. Therefore, it is crucial to design schemas with data transfer requirements in mind, enabling smooth and efficient data pipelines.

Responsibilities of Application Developers Beyond Applications

Application developers often focus primarily on the immediate needs of their applications. However, it is equally important to consider how the data will be used by data engineers, data analysts, and data scientists. This holistic approach ensures that data is easily accessible and usable for a variety of analytical and scientific purposes. Developers should engage with other stakeholders to understand their data requirements and ensure that the database schema supports these needs.

By taking into account these broader considerations, application developers can design database schemas that not only meet the needs of their applications but also facilitate efficient data transfer and usage across the organization. This leads to more effective data engineering, better analytics, and ultimately, more informed business decisions.

Types of Data Transfer from Databases to Data Warehouses

Understanding the different types of data transfer methods is crucial for choosing the right approach based on the data size, frequency of updates, and specific needs of the organization. Application developers must be aware of these techniques to design schemas that facilitate efficient data transfer. Here, we explore the main types of data transfer, their pros and cons, and considerations for each approach.

Transferring Whole Tables

Description: The entire table is transferred to the data warehouse.

Pros:

  • Simpler to implement compared to other methods.
  • No need for complex logic to track changes.
  • Ensures that the data warehouse has a complete snapshot of the table at the time of transfer.

Cons:

  • Inefficient for large tables due to high data transfer volume.
  • Can lead to performance bottlenecks during transfer, especially if the table is very large.

Use Case: Suitable for small to medium-sized tables or tables that do not change frequently.

Considerations:

  • Ensure the table size is manageable to avoid performance bottlenecks during transfer.
  • If a table size isn’t large, a simple query like SELECT * FROM {table} can be used to extract it.
  • For large tables with an auto-incremental integer unique key, chunking the table using ranges of the integer index can enable distributed processing, such as with Apache Spark, to transfer the whole table efficiently.

Incrementally Transferring Partial Tables

Description: Only new or updated records are transferred based on a specific criterion, typically a timestamp column.

Pros:

  • Reduces data transfer volume, making it efficient for large, frequently updated tables.
  • Minimizes database load during the transfer process.

Cons:

  • Requires accurate tracking of changes.
  • Dependent on reliable timestamp data and indexing.

Use Case: Ideal for large tables with frequent updates, as it reduces the volume of data transferred.

Considerations:

  • Ensure timestamp columns are indexed to enable efficient incremental extraction.
  • Verify the reliability of the timestamp data to ensure accurate tracking of changes.

Change Data Capture (CDC)

Description: Captures and transfers only the changes (inserts, updates, deletes) made to the tables.

Pros:

  • Real-time data synchronization.
  • Minimal data transfer volume, making it highly efficient for dynamic data.
  • Ensures that the data warehouse is always up-to-date with the latest changes.

Cons:

  • Hard to prepare day-0 data before starting CDC; the entire table needs to be loaded into the data warehouse beforehand. Any gap between the initial snapshot and the start of CDC can result in incomplete data.
  • Deduplication or aligning incoming data with existing data can be computationally and financially expensive.
  • Maintaining CDC can be complex and require significant ongoing effort.

Use Case: Best for systems requiring near real-time data synchronization between the database and data warehouse.

Considerations:

  • Implement CDC mechanisms such as database triggers, log-based capture, or third-party tools to facilitate this process.
  • Ensure that the initial full load of the data is accurately synchronized with the CDC process to avoid data inconsistencies.

Zero-ETL: An Emerging Concept

Zero-ETL aims to reduce or eliminate the need for traditional extract, transform, and load (ETL) processes by allowing direct access to data across systems. This approach can simplify data workflows but also presents challenges similar to those encountered with traditional data transfer methods.

Example: BigQuery Federated Query: Allows querying data in Cloud Spanner and Cloud SQL directly from BigQuery, reducing the need for data movement.

Limitations and Considerations:

  • Data Type Compatibility: Not all data types are compatible between systems, complicating direct access.
  • Complex Data Types: Using complex data types in databases can make it difficult to leverage zero-ETL solutions effectively.
  • Encountering Pitfalls: Zero-ETL can encounter similar pitfalls as traditional methods, such as inefficiencies with large tables and challenges in incremental data transfer.

Zero-ETL represents an exciting advancement in data engineering, offering the potential for more streamlined and efficient data workflows. However, it is still an emerging concept and may not yet fully address all use cases and technical challenges​.

Final Thoughts on Data Transfer Methods

Choosing the right type of data transfer method depends on the specific requirements and constraints of your organization. Transferring whole tables can be relatively straightforward but may not scale well with large datasets. Incrementally transferring partial tables offers a balance between performance and efficiency, while CDC provides real-time synchronization at the cost of complexity. Zero-ETL is a promising approach that simplifies data access but requires careful consideration of data type compatibility and system capabilities. By understanding these methods and their implications, application developers can design more effective and adaptable data pipelines that meet the needs of all stakeholders.

Common Pitfalls and Best Practices in Schema Design

Designing a database schema effectively requires understanding both the common pitfalls and the best practices to avoid them. Here, we discuss these pitfalls and corresponding best practices, providing detailed explanations and context to help application developers better understand these concepts. The pitfalls are organized to cover unique issues without overlap.

Lack of Timestamp Columns

Pitfall: Not including timestamp columns (created_at, updated_at) in tables can make it difficult to track changes and transfer data incrementally.

Impact on Data Transfer: This issue is particularly problematic for Incrementally Transferring Partial Tables and Change Data Capture (CDC) methods. Without timestamps, identifying new or updated records becomes challenging, leading to inefficient data extraction processes.

Best Practice: Always include timestamp columns and ensure they are indexed. This allows for efficient incremental data extraction based on time ranges, enabling easier data synchronization and historical analysis.

Example: Consider a sales table without created_at and updated_at columns. If you need to transfer only the records added or updated since the last transfer, you have no way to identify these records without timestamps. Including these columns allows you to query for new or updated records based on their timestamps, facilitating incremental transfers.

Inadequate Indexing

Pitfall: Poor or missing indexes can lead to slow query performance, making data extraction and transfer inefficient.

Impact on Data Transfer: All types of data transfer (whole table, incremental, and CDC) can suffer from slow performance if indexing is inadequate. Poorly indexed tables require full table scans for queries, leading to longer extraction times and increased resource usage.

Best Practice: Implement appropriate indexing strategies for columns frequently used in queries, especially those involved in filtering and joining operations.

Example: Suppose you have a customer table with frequent queries based on customer IDs and email addresses. Without indexes on these columns, each query requires scanning the entire table, slowing down performance. Adding indexes on customer_id and email significantly speeds up these queries, improving data extraction efficiency.

Missing Efficient Unique Identifiers

Pitfall: Tables without efficient unique identifiers (e.g., integer unique keys) can cause issues in data transfer and synchronization.

Impact on Data Transfer: This is a major issue for Incrementally Transferring Partial Tables and CDC methods. Unique identifiers are essential for tracking changes and ensuring data consistency during transfer.

Best Practice: Ensure every table has an efficient unique identifier, such as an auto-incrementing integer key, to facilitate reliable data synchronization and extraction.

Example: In Cloud Spanner, using a string type unique key instead of an integer type can hinder incremental extraction and parallel processing of large tables, making it harder to handle large data volumes efficiently. An auto-incrementing integer key simplifies this process.

Use of Complex Data Types

Pitfall: Utilizing complex data types like JSON, arrays, or custom types can hinder the ease of data extraction and transformation.

Impact on Data Transfer: This issue affects all types of data transfer, making it difficult to extract, transform, and load data into data warehouses due to the complexity of handling these types.

Best Practice: Use simple, atomic data types wherever possible. For more complex data, ensure there are tools and processes in place to handle these types efficiently.

Example: Storing customer preferences as a JSON object might be convenient for the application, but it complicates data extraction and analysis. Instead, consider using structured tables to store this information, making it easier to query and transfer.

Non-Normalized Data

Pitfall: Denormalized tables can lead to data redundancy and increase storage requirements, complicating data management and transfer.

Impact on Data Transfer: This is particularly problematic for transferring whole tables and incremental transfers. Denormalized data increases the volume and complexity of data being transferred, leading to inefficiencies.

Best Practice: Apply normalization principles where appropriate to reduce redundancy and improve data integrity. However, balance normalization with performance considerations, as overly normalized schemas can lead to complex joins and slow performance.

Example: A denormalized customer table that includes repeated address information for each order can be split into separate customer and address tables, reducing redundancy and improving data integrity. However, ensure that the normalization does not overly complicate queries and impact performance.

Ignoring Data Partitioning

Pitfall: Large tables without partitioning can lead to slow query performance and complicate data extraction processes.

Impact on Data Transfer: This affects all data transfer types, especially when dealing with large datasets. Lack of partitioning can lead to performance bottlenecks and increased resource usage.

Best Practice: Implement table partitioning to improve query performance and make data extraction more manageable.

Example: A large sales table can be partitioned by date, allowing for faster queries on recent data and easier management of older data. This improves performance and makes incremental data extraction more efficient.

Neglecting Data Consumption Needs

Pitfall: Application developers focus solely on application requirements, ignoring how data will be used in data engineering and analytics.

Impact on Data Transfer: This affects all data transfer types. Ignoring the needs of data engineers and analysts can lead to inefficient data models that complicate data extraction and analysis.

Best Practice: Consider the needs of data engineers and analysts during schema design, ensuring that the data can be efficiently consumed and analyzed.

Example: Designing schemas without considering how data scientists will query the data can lead to inefficient joins and complex queries, slowing down analysis and increasing resource usage. Collaborate with data consumers to understand their requirements and design schemas accordingly.

Uncommunicated Schema Changes

Pitfall: Changing schema elements, such as altering unique key types without consulting data consumers, can break existing data pipelines.

Impact on Data Transfer: This is particularly problematic for Incrementally Transferring Partial Tables and CDC methods, as changes in schema can disrupt data extraction and synchronization processes.

Best Practice: Foster communication between application developers and data consumers to ensure schema changes consider the needs of all stakeholders.

Example: Migrating from MySQL to Google Cloud Spanner and changing an integer unique key to a string type disrupted incremental data extraction processes. Ensure that all schema changes are communicated and agreed upon with all stakeholders.

Plan for Future Database Migrations

Pitfall: Application developers often design schemas without considering future database migrations, which can impact data pipelines and data consumption.

Impact on Data Transfer: This can affect all types of data transfer, making future migrations and updates more difficult and disruptive.

Best Practice: Consider potential future database migrations and their impact on data pipelines and consumption. Design schemas that are adaptable to changes in technology and infrastructure.

Example: Designing a schema that can easily be migrated from MySQL to another database system ensures continuity and reduces the risk of disruptions to data pipelines. This involves using standardized data types and avoiding database-specific features that complicate migrations.

Manage Data Capacity and Adapt Transfer Strategies

Pitfall: Not planning for data growth can lead to challenges in data transfer and processing as data volumes increase.

Impact on Data Transfer: This affects all data transfer types, especially as data volumes grow. Without planning, data transfer processes can become inefficient and resource-intensive.

Best Practice: Plan for data growth and adapt transfer strategies as the data size increases. Start with simple full-table transfers and transition to incremental loads or CDC as needed.

Example: Initially transferring a small table with SELECT * FROM {table} might work, but as the table grows, using incremental loads or CDC will become necessary to handle the increased data volume efficiently. Regularly review and update transfer strategies to ensure they scale with data growth.

By addressing these common pitfalls and adopting best practices, application developers can design database schemas that support efficient data transfer and analysis. This holistic approach ensures that the data infrastructure is robust, scalable, and ready to support diverse business needs.

Conclusion

As application developers, it’s crucial to recognize the significant impact that database schema design has on data utilization. Designing schemas isn’t merely about meeting the immediate needs of an application; it plays a pivotal role in how data is transferred from databases to data warehouses and how effectively it can be analyzed and used for business intelligence.

The pitfalls and solutions discussed in this article provide valuable hints and guidelines for improving schema design, making data more accessible and usable for all stakeholders. Addressing these common pitfalls through best practices helps ensure efficient data transfer, synchronization, and analysis, ultimately leading to more robust and scalable data infrastructures.

However, recognizing and addressing these pitfalls is only part of the solution. The importance of cross-functional collaboration cannot be overstated. Designing schemas effectively means designing data and how it will be utilized. This requires input and cooperation from data engineers, data analysts, data scientists, and application developers. A cross-functional effort ensures that all potential issues and requirements are considered during the schema design process. Application developers play a vital role in this collaborative effort by bringing their knowledge of application needs while understanding the data requirements of other functions. Such collaboration can preemptively eliminate many of the issues that would otherwise arise, making data more accessible and usable.

In summary, designing database schemas is a critical part of designing data. It involves understanding how to unlock and utilize data in databases for business purposes. Application developers should not only focus on the application side but also consider the broader implications of their schema designs on data engineering, analytics, and overall business intelligence. By adopting best practices and fostering a collaborative environment, organizations can create data infrastructures that support diverse business needs and drive better decision-making and innovation. This holistic approach to schema design and data management is key to unlocking the full potential of data in today’s data-driven world.

--

--

Yu Ishikawa

Data Engineering / Machine Learning / MLOps / Data Governance / Privacy Engineering