Feature DOMAINCLAW-3
closedEpic DOMAINCLAW-1: Mail-Hound Prototype — Domain Probe, Redirect Tracking, Contact Extraction
Export CSV/JSON and Per-Run Logging
100%
Description
Objective¶
Run a deeper crawl only on the domains selected from the fast precheck result and extract useful crawl/contact data.
Description¶
Implement the deep crawl flow for selected domains.
The crawler should collect and display:
- Crawled pages
- Redirect events
- Contact information
- Extracted email addresses
The existing redirect rules should be respected, including:
- Recording cross-host redirects.
- Handling origin pages correctly when a soft redirect is detected.
- Continuing to scrape emails from the origin page if the origin still serves valid content.
Acceptance Criteria¶
- The user can start a deep crawl using the selected domains.
- The crawl result is displayed in the related UI tabs.
- The system shows data for:
- Per-domain summary
- Pages
- Redirects
- Contacts
- Extracted emails or contact records are shown when available.
- Redirects are recorded according to the existing redirect rules.
- The result reflects the domains selected by the user.
Definition of Done¶
- The deep crawl flow runs end-to-end without blocker errors.
- Crawl results are visible in the UI.
- Contact/email extraction works for valid pages.
- Redirect data is captured consistently.
- A crawl failure on one domain does not stop the entire run.
Sub-task 3: Export CSV/JSON and Per-Run Logging¶
Type¶
Feature / Technical
Estimate¶
1 SP
Objective¶
Store every crawl run in a structured and traceable format so that users can review, debug, or share the results after the run is complete.
Description¶
Each run should generate a unique run ID.
For every run, the system should create an output folder using the following structure:
exports/<run_id>/
The following export files must be generated:
- summary.csv
- pages.csv
- redirects.csv
- contacts.csv
- results.json
A separate log file should also be created for each run under:
logs/
The exported data should match what is shown in the UI.
Acceptance Criteria¶
- Each run creates a unique output folder under exports/<run_id>/.
- The following required files are created after each run:
- summary.csv
- pages.csv
- redirects.csv
- contacts.csv
- results.json
- A dedicated log file is created for each run.
- Exported CSV/JSON data can be opened and read successfully.
- Exported data matches the data displayed in the UI.
- Logs contain enough information to trace errors or failed domains.
Definition of Done¶
- Output folder structure is stable and easy to inspect.
- Export files are not lost when the UI session ends.
- Logs are persisted per run.
- A third party can review the exported files without needing access to the running UI.