Skip to main content

sample_exact_es_percent

Given an array of JSON documents, this processor will return an array containing a shuffled subset of those input documents. The size of the subset will be determined by the percentage value stored in an Elasticsearch record multiplied against the length of the array rounded down. The percentage will be retrieved every window_ms milliseconds by getting the document with document_id found within the index in the Elasticsearch specified by the connection as specified in the config.

Usage

In this example we want to retrieve the percentage from the document with _id abc123 in index my-index at the terafoundation connection named elasticsearch7.

Example of a job using the sample_exact_es_percent processor

{
"name" : "testing",
"workers" : 1,
"slicers" : 1,
"lifecycle" : "once",
"assets" : [
"standard"
],
"operations" : [
{
"_op": "test-reader"
},
{
"_op": "sample_exact_es_percent",
"connection": "elasticsearch7",
"index": "my-index",
"document_id": "abc123",
"window_ms": 300000
}
]
}

Example sample index creation

Create an index with 1 shard, 2 replicas, and a single text property called percent.

curl -X PUT "http://localhost:9200/my-index" -H 'Content-Type: application/json' -d'
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 2
},
"mappings": {
"properties": {
"percent": {
"type": "text"
}
}
}
}'

Example document creation

curl -X POST 'localhost:9200/my-index/_doc/abc123' -H 'Content-Type:application/json' -d '
{
"percent": 50
}
'

Review the document

curl -X GET 'localhost:9200/my-index/_doc/abc123' | jq
{
"_index": "my-index",
"_id": "abc123",
"_version": 1,
"_seq_no": 0,
"_primary_term": 1,
"found": true,
"_source": {
"percent": 50
}
}

Example of the data and the expected results

const data = [
DataEntity.make({ name: 'lilly', otherField: 1 }),
DataEntity.make({ name: 'willy', otherField: 2 }),
DataEntity.make({ name: 'billy', otherField: 3 }),
DataEntity.make({ name: 'dilly', otherField: 4 }),
]

const results = await processor.run(data);

results === [
{ name: 'dilly', name_again: 'dilly', otherField: 4 },
{ name: 'willy', name_again: 'willy', otherField: 2 },
]

Parameters

ConfigurationDescriptionTypeNotes
_opName of operation, it must reflect the exact name of the fileStringrequired
connectionName of the elasticsearch connection to use to find index sizeStringdefaults to default
indexName of the index that holds the percentage documentStringrequired
document_id_id of the document holding the percentage of docs to keepStringrequired
window_msThe time in milliseconds between queries to elasticsearch. Must be between 100 and 3_600_000 (1 hour)Numberdefaults to 300_000ms (5 minutes)