Skip to content

feat(scraper): add regex-based pattern matching for node selection#194

Open
theonlychant wants to merge 4 commits intoamd:developmentfrom
theonlychant:development
Open

feat(scraper): add regex-based pattern matching for node selection#194
theonlychant wants to merge 4 commits intoamd:developmentfrom
theonlychant:development

Conversation

@theonlychant
Copy link
Copy Markdown

@theonlychant theonlychant commented May 2, 2026

Summary

Adds regex-based pattern matching for node selection, allowing more
flexible and powerful scraping patterns beyond exact string matches.

Test plan

  • pytest test/unit
  • pytest test/functional (if applicable)
  • pre-commit run --all-files

Checklist

  • Added/updated tests :: added unit tests covering regex pattern matching,
    invalid patterns, and edge cases
  • Updated docs/README to document the new regex selector syntax
  • No secrets or credentials committed

@alexandraBara
Copy link
Copy Markdown
Collaborator

This functionality can be achieved today without enhancing the code:

from nodescraper.plugins.regex_search.analyzer_args import RegexSearchAnalyzerArgs
from nodescraper.plugins.regex_search.regex_search_analyzer import RegexSearchAnalyzer
from nodescraper.plugins.regex_search.regex_search_data import RegexSearchData
# Inline the same patterns you would have taken from COMMON_PATTERNS
rules = [
    {
        "regex": r"\b(?:25[0-5]|2[0-4]\d|1?\d?\d)(?:\.(?:25[0-5]|2[0-4]\d|1?\d?\d)){3}\b",
        "message": "Found ipv4",
        "event_category": "UNKNOWN",
        "event_priority": "ERROR",
    },
    {
        "regex": r"\b[\w.+-]+@[\w-]+(?:\.[\w-])+\b",
        "message": "Found email",
        "event_category": "UNKNOWN",
        "event_priority": "ERROR",
    },
]
data = RegexSearchData(content="2026-05-01T12:00:00,000+00:00 connect from 192.0.2.1")
analyzer = RegexSearchAnalyzer(system_info=system_info) 
result = analyzer.analyze_data(data, RegexSearchAnalyzerArgs(error_regex=rules))

Same can be achieved today without code enhancements by using this sample plugin_config.json:

{
  "global_args": {},
  "plugins": {
    "RegexSearchPlugin": {
      "collection": false,
      "analysis": true,
      "data": "/path/to/your.log",
      "analysis_args": {
        "interval_to_collapse_event": 60,
        "num_timestamps": 3,
        "error_regex": [
          {
            "regex": "\\b(?:25[0-5]|2[0-4]\\d|1?\\d?\\d)(?:\\.(?:25[0-5]|2[0-4]\\d|1?\\d?\\d)){3}\\b",
            "message": "IPv4 address matched",
            "event_category": "UNKNOWN",
            "event_priority": "ERROR"
          },
          {
            "regex": "\\b(?:[0-9A-Fa-f]{2}[:-]){5}[0-9A-Fa-f]{2}\\b",
            "message": "MAC address matched",
            "event_category": "NETWORK",
            "event_priority": "WARNING"
          }
        ]
      }
    }
  },
  "result_collators": {},
  "name": "regex search with ipv4 + mac (COMMON_PATTERNS strings inlined)",
  "desc": "Regex strings copied from nodescraper.regex_patterns COMMON_PATTERNS"
}

Which you can run like:

node-scraper --plugin-config plugin_config.json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants