Turn a Web Scraper Into a Content Parser Using AI

AI has proven to be a powerful tool for various tasks like note-taking, email writing, and language learning. Recently, I explored using AI to transform a basic web scraper into an automated information processor or content parser. This post will outline the process, potentially inspiring you to try it or adapt it for your own ideas.

Choosing OpenAI

For this project, I chose OpenAI and its package for AI integration. While other Large Language Models (LLMs) are available, I recommend using the LangChain package if you wish to explore alternatives. LangChain is model-agnostic and provides comprehensive documentation for interacting with different LLMs.

This explanation will not cover building a web scraper. Instead, it will focus on leveraging AI after data has been scraped. My example aims to analyze article listings and identify relevant content.

Creating a Client

Let’s begin after the scraper has collected information from the web, and it’s time to process it. First, we’ll create a client to send the scraped data to OpenAI.

import { FoundArticle, foundArticleResponse } from "app/types";
import OpenAI, { APIConnectionError } from "openai";
import { zodResponseFormat } from "openai/helpers/zod";
import { config } from "app/app.config";
import { getArticleInfo } from "app/utils";

const openAiClient = new OpenAI({ apiKey: config.openaiApiKey });

export const sendToOpenAI = async (
  info: string,
  subject: string
): Promise<FoundArticle[] | APIConnectionError> => {
  try {
    const response = await openAiClient.beta.chat.completions.runTools({
      model: "gpt-4o-2024-08-06",
      messages: [
        {
          role: "system",
          content: `${config.systemPrompt}`,
        },
        {
          role: "user",
          content: `The base path is <BASE URL PATH>subject is ${subject}: ${info}`,
        },
      ],
      tools: [
        {
          type: "function",
          function: {
            function: getArticleInfo,
            description: "pulls the content of the article for analysis",
            parameters: {
              type: "object",
              properties: { url: { type: "string" } },
            },
            parse: JSON.parse,
          },
        },
      ],
      tool_choice: "auto",
      response_format: zodResponseFormat(foundArticleResponse, "found_article"),
    });

    const finalResponse = await response.finalContent();
    if (!finalResponse) {
      return [];
    }
    const parsedResponse = JSON.parse(finalResponse).found_articles;
    return parsedResponse;
  } catch (error) {
    return new APIConnectionError({ message: `${error}` });
  }
}; 

Here, the `openAiClient` is created, and it’s called within the `sendToOpenAI` function to process articles. Notably, `runTools` is used in the response, allowing us to leverage the LLM effectively. In this context, “tools” are specific functions granted to the LLM to aid task completion.

Other essential fields include the LLM model, messages, and desired output format. The `messages` field contains two objects: the system prompt and the user input. The system prompt instructs the LLM on its purpose and behavior. The user input provides the data to be analyzed. In this example, the system prompt directs the LLM to scan articles in `info` for relevance to a `subject`. The user input then supplies the subject and article information.

Tools

The `tools` field specifies the type of tool (a function), name, parameters, and other details. Here we are using the function `getArticleInfo` that takes the parameters of the specific article’s URL. This function takes the URL, gets the full article from that URL, and returns it to be processed by the LLM. The `getArticleInfo` function, which takes a URL parameter, is used. This function retrieves the full article from the URL for LLM processing. The `getArticleInfo` function is structured as follows:

export const getArticleInfo = async (args: {
  url: string;
}): Promise<string> => {
  const { url } = args;
  const response = await axios.get(url);
  const article = findArticleElement(cheerio.load(response.data));
  return article;
}; 

By using tools, we can import custom functions and integrate them with the LLM. When this function returns the article, it goes straight back to the LLM for processing without further intervention.

Summary

In summary, the process is: the app scrapes article titles and URLs, sends them to OpenAI to determine relevance to a specified subject. If the LLM finds a related article, it uses the custom `getArticleInfo` function to retrieve the full article from the URL. It then performs further analysis and returns only the relevant articles.

Conversation

Join the conversation

Your email address will not be published. Required fields are marked *