by Marco Elumba
This project was initiated as part of an effort to enhance business expansion and increase the visibility of product offerings to potential clients. The objective was to develop an automated system that aggregates essential data from an online analytics platform for a predefined list of potential leads. The system utilizes this data to create customized email communications for targeted outreach. Built on cloud-based technology using Python and advanced AI tools, this automated solution operates at set intervals, ensuring consistent data collection and the delivery of personalized email content tailored to the specific needs and online activities of each lead.
The process is divided into two main workflows, each utilising several functions and tools within the Google Cloud (GCP) environment.
flowchart TB
%% Colors %%
linkStyle default stroke-width:2px
classDef blue fill:#2374f7,stroke:#000,stroke-width:2px,color:#fff
classDef orange fill:#fc822b,stroke:#000,stroke-width:2px,color:#fff
classDef green fill:#16b552,stroke:#000,stroke-width:2px,color:#fff
classDef red fill:#ed2633,stroke:#000,stroke-width:2px,color:#fff
classDef magenta fill:magenta,stroke:#000,stroke-width:2px,color:#fff
%% Table%
%% 0 %%
L[(*leads*)]:::blue
C[(*customers*)]:::blue
EH[(*leads_extract_history*)]:::blue
LS[(*leads_scraped_data*)]:::blue
LE[(leads_email)]:::blue
%% Scrape %%
%% 1,2,3,4 %%
L ---o |Select| ED(Extract Data):::orange
C ---o |Select| ED(Extract Data):::orange
ED(Extract Data):::orange ---> |Append| EH
EH ---o |Select| ED(Extract Data):::orange
ED(Extract Data):::orange ---> |Append| LS
%% Generate Email %%
%% 5,6 %%
LS ---o |Select| GE([Generate Email]):::red
GE([Generate Email]):::red ---> |Append| LE
%% View %%
%% 9 %%
C -..- |Select| FL[future_leads.view] -..- |Select| LE
%% Link Colors %%
linkStyle 0 stroke:blue
linkStyle 2,5 stroke:green
linkStyle 3,6 stroke:red
linkStyle 4 stroke:magenta
%% Clickable Links %%
click ED "<https://www.notion.so/Leads-generation-workflow-110e8405570a80918970ed288cf34488?pvs=4#110e8405570a801097c8f6935a89964a>"
click GE "<https://www.notion.so/Leads-generation-workflow-110e8405570a80918970ed288cf34488?pvs=4#110e8405570a80818696df699740c18e>"
click FL "<https://www.notion.so/Leads-generation-workflow-110e8405570a80918970ed288cf34488?pvs=4#110e8405570a807e869dc6579c685359>"
click C "<https://www.notion.so/Leads-email-generation-workflow-110e8405570a80918970ed288cf34488?pvs=4#111e8405570a8052b51bf804ad891f1b>"
click L "<https://www.notion.so/Leads-email-generation-workflow-110e8405570a80918970ed288cf34488?pvs=4#111e8405570a80c0bddbc7ae3d291faf>"
click EH "<https://www.notion.so/Leads-email-generation-workflow-110e8405570a80918970ed288cf34488?pvs=4#111e8405570a80ffb2a3ecd57e902eba>"
click LS "<https://www.notion.so/Leads-email-generation-workflow-110e8405570a80918970ed288cf34488?pvs=4#111e8405570a80599773c6875e07620a>"
click LE "<https://www.notion.so/Leads-email-generation-workflow-110e8405570a80918970ed288cf34488?pvs=4#111e8405570a8034b06aced72a742245>"
click FL "<https://www.notion.so/Leads-email-generation-workflow-110e8405570a80918970ed288cf34488?pvs=4#111e8405570a80a89f2ec0258c4646d5>"
This process is designed to extract data for a list of leads. The steps involved are:
Step 1: Load Leads into BigQuery (BQ)
*leads* and *customers*).Step 2: Create Cloud Functions for Data Extraction
Tools: Google Cloud Functions, Python, Firecrawl package
A Python-based Google Cloud Function is used to extract data using the Firecrawl package. The process includes the following functions:
*leads* table using pandas_gbq. The domain in the leads table is cross-checked in the customers table, to ensure no domain of an existing customer is extracted.
get_leads Python function

scrape_firecrawl Python function
insert_leads_data: Inserts the scraped data into a new BigQuery (BQ) table called *leads_scraped_data*.

insert_leads_data Python function
The process is scheduled to run every 15 minutes using Google Cloud Scheduler (an option is to run this process at once and use multiple thread jobs but for ethical crawling the process is executed at interval time), checking for new domains to scrape and storing the extracted data in the leads_scraped_data table. Another process is also put in to make sure that random domain selection does not select the domain twice. To achieve this an intermediary table called l*eads_extract_history* is used to cross-checked that no domain will be selected that already exists in *leads_extract_history*.
Final Output of Data Extraction:
The extracted data is saved in a BigQuery table (*leads_scraped_data*) with the following fields:
leads_scraped_data - table.csv
Tools: pandas_gbq, Python, OpenAI API, BigQuery
A Python-based Google Cloud Function is created to generate targeted emails based on the extracted data from SimilarWeb. The process includes the following functions:
get_a_lead: This function queries a random row from the *leads_scraped_data* table where the domain has not yet been processed into the leads_email table. It ensures that only leads without generated emails are selected for the email generation process.

get_a_lead Python function
generate_prompt: This function generates a custom prompt using the extracted raw_data from SimilarWeb, combined with the lead's information. The prompt includes details about the company and the lead, and it is sent to OpenAI to generate email content.