In the world of software development, rigorous testing and controlled releases have been standard practice for decades. But what if we could apply these same principles to databases and data warehouses? Imagine being able to define a set of criteria with test cases for your data infrastructure, automatically applying them to every new "release" to ensure your customers always see accurate and consistent data.
While this idea seems intuitive, there's a reason why end-to-end testing isn't commonly practiced in data management: it requires a primitive of clone or snapshot for databases or data warehouses, which most data systems don't provide.
Modern data warehouses are essentially organized mutable storage that change over time as we operate on them through data pipelines. Data typically becomes visible to end customers as soon as it's generated, with no concept of a "release." Without this release concept, running end-to-end testing on a data warehouse makes little sense, as there's no way to ensure what your tests see is what your customers will see.
graph TD
A[Data Warehouse] --> B[Continuous Data Pipeline]
B --> C[Data Visible to Customers]
D[End-to-End Testing] --> E{No Control Over Data Visibility}
E --> F[Inconsistent Data View]
F --> G[Customer View]
F --> H[Test View]
I[Lack of Release Concept] --> J[Difficulty in Ensuring Data Consistency]
K[Mutable Storage] --> L[Constant Data Changes]
L --> M[Challenges in Creating Stable Test Environment]
style E fill:#FFA07A,stroke:#FF6347,stroke-width:2px
style J fill:#FFA07A,stroke:#FF6347,stroke-width:2px
style M fill:#FFA07A,stroke:#FF6347,stroke-width:2px
Some teams have developed version control systems on top of their data warehouses. Instead of directly modifying tables queried by end users, they create new versions of tables for changes and use an atomic swap operation to "release" the table. While this approach works to some extent, it comes with significant challenges:
graph TD
A["Data Warehouse"] -->|"Create new version"| B["Table v2"]
A -->|"Current version"| C["Table v1"]
B -->|"Atomic swap"| D["'Release' new version"]
D -->|"Success?"| E{"Version Control"}
E -->|"Yes"| F["Table v2 becomes current"]
E -->|"No"| G["Rollback to Table v1"]
H["Developer"] -->|"Attempts to implement"| I["'Create and Swap' Pattern"]
I -->|"Realizes"| J["It's complex!"]
K["Multi-table Criteria"] -->|"Requires"| L["'Packaging' changes"]
L -->|"Developer thinks"| M["This is challenging!"]
%% XKCD-style comments
%% "It's like playing Jenga with data!"
%% "Who knew 'atomic' could be so complicated?"
%% "Developers: 0, Data Complexity: 1"
style H stroke:#000,stroke-width:2px,fill:#fff
style J stroke:#000,stroke-width:2px,fill:#ffa07a
style M stroke:#000,stroke-width:2px,fill:#ffa07a
We've developed a system to "clone" ClickHouse databases as replicas, leveraging the power of JuiceFS. This approach, detailed in our earlier article "Low-Cost Read/Write Separation: Jerry Builds a Primary-Replica ClickHouse Architecture", offers a promising solution.
Here's how it works:
juicefs snapshot src_dir des_dir
, we can create a clone of src_dir as it exists at that moment.This approach allows us to easily replicate/clone a ClickHouse instance from the running instance, creating a frozen snapshot that can be considered a "release artifact."
With this mechanism in place, we can run end-to-end testing against the ClickHouse replica and control its visibility based on the test results. This replicates the release workflow that has been standard in software development for decades.