Use Case: Extracting Structured News Events With Langchain And OpenAI Functions

June 17th, 2023

Using the new OpenAI functions we can extract structured data from websites using langchain and 2markdown.

We're going to build a document loader using 2markdown to load news stories and extract only the news story itself from the website. On top of that, we're going to build an llm chain using OpenAI that extracts structured news events data.

Preparation

You need langchain >= v0.0.202 to follow along. Additionally, you need to be registered with OpenAI and 2markdown. To follow along, we also expect that you can work with Python and have a working setup.

Extracing News Events With LangChain

First, we need to set up our API keys:

# https://2markdown.com
md_api_key = "YOUR_KEY"
# https://openai.com
openai_api_key = "YOUR_KEY"

Next, we need to import the necessary module we're going to use:

from pydantic import BaseModel, Field
from langchain.document_loaders import ToMarkdownLoader
from langchain.chat_models import ChatOpenAI
from langchain.chains import create_extraction_chain, create_extraction_chain_pydantic
from langchain.prompts import ChatPromptTemplate

At the heart of our chain will be the llm which looks like this:

llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo-0613", openai_api_key=openai_api_key)

In order to extract news, we're going to specify what we expect a news event to look like:

class NewsEvent(BaseModel):
    location: str
    date: str
    description: str

We're going to use to load a news story.

loader = ToMarkdownLoader(
    url="https://edition.cnn.com/2023/06/17/europe/ukraine-counteroffensive-explained-hnk-intl/index.html",
    api_key=md_api_key)
docs = loader.load()

Now we can create our extraction chain that converts a news story in markdown format (provided by the 2markdown loader above) to a NewsEvent:

chain = create_extraction_chain_pydantic(pydantic_schema=NewsEvent, llm=llm)

And finally, we can extract our news event for the news story we extracted with 2markdown above:

res = chain.run(docs[0].page_content)

for news in res:
    print("location: " + news.location + " date: " + news.date)
    print(news.description)
    print("\n")
  

Which will give us something like this:

location: Zaporizhzhia date: June 13
Ukrainian servicemen fire a BM-21 'Grad' multiple rocket launcher toward Russian positions near Bakhmut


location: Bakhmut date: June 1, 2023
Destruction in the city of Bakhmut after hostilities


location: Donetsk date: June 1
Ukrainian soldiers lie on a roadside during training for an operation near Bakhmut


location: Zaporizhzhia Nuclear Power Plant date: June 15
A Russian service member stands guard at a checkpoint near the Zaporizhzhia Nuclear Power Plant in Russian-controlled Ukraine


location: Zaporizhzhia date: June 15
Russian servicemen stand guard near the Russian-controlled Zaporizhzhia nuclear power plant in southern Ukraine

We can see that the extraction chain extracted the location, date and description of the news events. The results might be overly specific, but we hope this got you an idea of how you can use OpenAI functions and LangChain extractors with 2markdown.

If you build something cool with this, let us know on Twitter.

You can view the full code we used here.