Skip to content

chore(aggregator): implement telemetry for aggregator#949

Closed
JuArce wants to merge 10 commits intostagingfrom
936-chore-implement-opentelemetry-for-aggregator
Closed

chore(aggregator): implement telemetry for aggregator#949
JuArce wants to merge 10 commits intostagingfrom
936-chore-implement-opentelemetry-for-aggregator

Conversation

@JuArce
Copy link
Copy Markdown
Collaborator

@JuArce JuArce commented Sep 10, 2024

Description

This PR allows the aggregator to export telemetry traces to have a better observability of the system

Tasks

  • Docker compose with Otel Collector and Jaeger UI
  • Init traces on NewTask
  • Add span on Operator responses
  • Add span on Quorum Reached
  • Add span on Tasks errors (ie task expired)
  • Close traces after a period of time the quorum was reached to wait for most operators responses

How to Test

Note

Docker is needed to test the PR.

  1. Run telemetry
make run_telemetry

It will run the Otel-Collector and Jaeger UI.

Note

Jaeger UI is available on http:localhost:16686

image
  1. Run anvil
make anvil_start_with_block_time
  1. Run aggregator
make aggregator_start
  1. Run operator
export OPERATOR_ADDRESS=0x70997970C51812dc3A010C7d01b50e0d17dc79C8
make operator_full_registration CONFIG_FILE=config-files/config-operator-1.yaml
make operator_start CONFIG_FILE=config-files/config-operator-1.yaml
  1. Run batcher
make batcher_start_local
  1. Send tasks

  2. Once you sent a batch and it has been responded by the aggregator, you should be able to check the trace of the task in the Jaeger UI (http:localhost:16686)

In the search, select service: aggregator and press find traces
It should show you the tasks you sent to aligned

Comment thread aggregator/internal/pkg/aggregator.go Outdated
Comment thread telemetry/telemetry.go
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably add a TaskExpiredTrace when a task expires instead of reaching threshold, ideally should included quorum % and operators who responded if we have access to that information.

@taturosati
Copy link
Copy Markdown
Contributor

Other than comments, worked perfectly on my machine

@JuArce JuArce self-assigned this Sep 12, 2024
@JuArce JuArce marked this pull request as ready for review September 12, 2024 21:25
Copy link
Copy Markdown
Contributor

@uri-99 uri-99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation of this PR would be highly valuable, maybe even with some screenshots.
(review still WIP)

@uri-99
Copy link
Copy Markdown
Contributor

uri-99 commented Sep 13, 2024

image

This Warning seems to be appearing when the task was not yet responded, and it disappears afterwards. Can we change the text of the warning?

@uri-99
Copy link
Copy Markdown
Contributor

uri-99 commented Sep 13, 2024

Is there a way to make Jaeger auto-refresh?

@uri-99
Copy link
Copy Markdown
Contributor

uri-99 commented Sep 13, 2024

Have you considered setting up the /monitor ? Seems like it is a real-time visualization of the items found with the search.

image

Comment thread telemetry/telemetry.go
Comment thread telemetry/telemetry.go
attribute.String("merkle_root", fmt.Sprintf("0x%s", hex.EncodeToString(batchMerkleRoot[:]))),
attribute.String("status", "ok"),
),
) // TODO add quorum %
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this TODO be resolved in this PR?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be added in another pr

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about adding traces around lines 317 of this file? If the aggregator fails to SendAggregatedResponse after reaching quorum

@JuArce
Copy link
Copy Markdown
Collaborator Author

JuArce commented Sep 30, 2024

Superseded by #1077

@JuArce JuArce closed this Sep 30, 2024
@JuArce JuArce deleted the 936-chore-implement-opentelemetry-for-aggregator branch September 30, 2024 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants