Data Processing & Automation

Automated data pipeline and system integration

Back to Documentation
Core Component: This automated data processing system runs continuously to ensure fresh data flows seamlessly from external systems to your beneficiary tracking and AI analysis.

System Architecture

Python Data Collector

External system integration and raw data ingestion

Language: Python | Schedule: Every 2 hours (Cron)

Google Apps Script Processor

Data transformation and duplicate-safe database updates

Language: JavaScript | Schedule: Every 2 hours (Triggers)

Data Flow & Processing Pipeline

Step 1: External Data Collection

The Python script authenticates with the system and downloads channel registration reports for both Sanvi and Saheli channels.

# Key features:
- Multi-channel support (1330: Sanvi, 1331: Saheli)
- Date range flexibility (single date, date range, or default last 7 days)
- Automatic environment detection (Termux/Android, Linux, Windows)
- CSRF token handling for secure authentication
Step 2: Data Processing & Cleaning

Raw HTML tables are parsed and converted to structured CSV files with proper encoding and error handling.

# Data processing includes:
- HTML table extraction using BeautifulSoup
- Pandas DataFrame creation and validation
- UTF-8 encoding for international characters
- Empty data set handling
Step 3: Duplicate Prevention & Google Sheets Update

The Python script updates Google Sheets with new registration data, implementing intelligent duplicate detection.

# Duplicate prevention strategy:
- Phone number-based deduplication
- Schema evolution handling (new columns)
- Batch processing for performance
- Error recovery and logging
Step 4: Risk Data Processing & Raw Entries Sync

Google Apps Script processes risk assessment data and safely pushes new records to the Raw_Entries database.

# Safe push features:
- Pre-check existing data for duplicates
- Phone number validation
- Batch insertion for performance
- Email notifications on completion/failure

Scheduling & Automation

Python Script (Cron Job)
Schedule: Every 2 hours

Cron Configuration:

# Run every 2 hours
0 */2 * * * /usr/bin/python3 /path/to/channel_report_downloader.py

Platforms Supported:

  • Termux (Android)
  • Linux/Ubuntu
  • Windows (via Task Scheduler)
Apps Script (Time Triggers)
Schedule: Every 2 hours

Trigger Configuration:

  • Function: processRiskAndPushToRawEntries()
  • Type: Time-driven
  • Frequency: Every 2 hours

Functions Available:

  • safePushToRawEntries() - Process both channels
  • pushSanviToRawEntries() - Sanvi only
  • pushSaheliToRawEntries() - Saheli only

Performance & Reliability

99.9%

Uptime Target

24/7 operation

2hrs

Update Frequency

Data freshness

1000+

Records/Hour

Processing capacity

0

Duplicate Tolerance

Data integrity
System Health Check

Use the testRawEntriesConnection() function in Apps Script to verify system connectivity and permissions. Regular monitoring ensures the automation pipeline remains reliable and secure.

Next Steps: Processed data flows directly to the Call Log Interface for real-time beneficiary management and call logging.