Document-Level Index Diff Between Two Elasticsearch Clusters #
INFINI Gateway is able to compare differences between two different indexes in the same or different clusters. In scenarios in which application dual writes, CCR, or other data replication solutions are used, differences can be periodically compared to ensure data consistency.
Function Demonstration #
How Is This Feature Configured? #
Setting a Target Cluster #
Modify the gateway.yml
configuration file by setting two cluster resources source
and target
and adding the following configuration:
elasticsearch:
- name: source
enabled: true
endpoint: http://localhost:9200
basic_auth:
username: test
password: testtest
- name: target
enabled: true
endpoint: http://localhost:9201
basic_auth: #used to discovery full cluster nodes, or check elasticsearch's health and versions
username: test
password: testtest
Configuring a Contrast Task #
Add a service pipeline to handle the index document pulling and contrast of two clusters as follows:
pipeline:
- name: index_diff_service
auto_start: true
keep_running: true
processor:
- dag:
parallel:
- dump_hash: #dump es1's doc
indices: "medcl-test"
scroll_time: "10m"
elasticsearch: "source"
output_queue: "source_docs"
batch_size: 10000
slice_size: 5
- dump_hash: #dump es2's doc
indices: "medcl-test"
scroll_time: "10m"
batch_size: 10000
slice_size: 5
elasticsearch: "target"
output_queue: "target_docs"
end:
- index_diff:
diff_queue: "diff_result"
buffer_size: 1
text_report: true #If data needs to be saved to Elasticsearch, disable the function and start the diff_result_ingest task of the pipeline.
source_queue: 'source_docs'
target_queue: 'target_docs'
In the above configuration, dump_hash
is concurrently used to pull the medcl-a
index of the source
cluster and fetch the medcl-b
index of the target
cluster, and output results to terminals in text form.
Outputting Results to Elasticsearch #
If there are many difference results, you can save them to the Elasticsearch
cluster, set the text_report
parameter of the above index_diff
processing unit to false
, and add the following configuration:
pipeline:
- name: diff_result_ingest
auto_start: true
keep_running: true
processor:
- json_indexing:
index_name: "diff_result"
elasticsearch: "source"
input_queue: "diff_result"
idle_timeout_in_seconds: 1
worker_size: 1
bulk_size_in_mb: 10 #in MB
Finally, import the dashboard to Kibana to achieve the following effect: