Using the new OpenAI functions, we can extract structured data from websites using langchain and 2markdown.

We will build a document loader using 2markdown to load news stories and extract only the news story itself from the website. We will also build an LLM chain using OpenAI that extracts structured news event data.

Preparation

You need langchain >= v0.0.202 to follow along. Additionally, you need to be registered with OpenAI and 2markdown. To follow along, we expect you to be able to work with Python and have a working environment.

Extracting News Events With LangChain

First, we need to set up our API keys:

# https://2markdown.com
md_api_key = "YOUR_KEY"
# https://openai.com
openai_api_key = "YOUR_KEY"

Next, we need to import the necessary module we're going to use:

from pydantic import BaseModel, Field
from langchain.document_loaders import ToMarkdownLoader
from langchain.chat_models import ChatOpenAI
from langchain.chains import create_extraction_chain, create_extraction_chain_pydantic
from langchain.prompts import ChatPromptTemplate

At the heart of our chain will be the llm, which looks like this:

llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo", openai_api_key=openai_api_key)

To extract news, we're going to specify what we expect a news event to look like:

class NewsEvent(BaseModel):
    location: str
    date: str
    description: str

We're going to use it to load a news story.

loader = ToMarkdownLoader(
    url="https://edition.cnn.com/2023/06/17/europe/ukraine-counteroffensive-explained-hnk-intl/index.html",
    api_key=md_api_key)
docs = loader.load()

Now we can create our extraction chain that converts a news story in markdown format (provided by the 2markdown loader above) to a NewsEvent:

chain = create_extraction_chain_pydantic(pydantic_schema=NewsEvent, llm=llm)

Finally, we can extract our news event for the news story we extracted with 2markdown above:

res = chain.run(docs[0].page_content)

for news in res:
    print("location: " + news.location + " date: " + news.date)
    print(news.description)
    print("\n")
  

This will give us something like this:

location: Zaporizhzhia date: June 13
Ukrainian servicemen fire a BM-21 'Grad' multiple rocket launcher toward Russian positions near Bakhmut


location: Bakhmut date: June 1, 2023
Destruction in the city of Bakhmut after hostilities


location: Donetsk date: June 1
Ukrainian soldiers lie on a roadside during training for an operation near Bakhmut


location: Zaporizhzhia Nuclear Power Plant date: June 15
A Russian service member stands guard at a checkpoint near the Zaporizhzhia Nuclear Power Plant in Russian-controlled Ukraine


location: Zaporizhzhia date: June 15
Russian servicemen stand guard near the Russian-controlled Zaporizhzhia nuclear power plant in southern Ukraine

The extraction chain extracted the location, date, and description of the news events. The results might be overly specific, but we hope this gives you an idea of how you can use OpenAI functions and LangChain extractors with 2markdown.

If you build something cool with this, let us know by email.

You can view the full code we used here.