Use Case: Extracting Structured News Events With Langchain And OpenAI FunctionsJune 17th, 2023
Using the new OpenAI functions we can extract structured data from websites using langchain and 2markdown.
We're going to build a document loader using 2markdown to load news stories and extract only the news story itself from the website. On top of that, we're going to build an llm chain using OpenAI that extracts structured news events data.
You need langchain >=
v0.0.202 to follow along. Additionally, you need to be registered with OpenAI and 2markdown. To follow along, we also expect that you can work with Python and have a working setup.
Extracing News Events With LangChain
First, we need to set up our API keys:
# https://2markdown.com md_api_key = "YOUR_KEY" # https://openai.com openai_api_key = "YOUR_KEY"
Next, we need to import the necessary module we're going to use:
from pydantic import BaseModel, Field from langchain.document_loaders import ToMarkdownLoader from langchain.chat_models import ChatOpenAI from langchain.chains import create_extraction_chain, create_extraction_chain_pydantic from langchain.prompts import ChatPromptTemplate
At the heart of our chain will be the llm which looks like this:
llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo-0613", openai_api_key=openai_api_key)
In order to extract news, we're going to specify what we expect a news event to look like:
class NewsEvent(BaseModel): location: str date: str description: str
We're going to use to load a news story.
loader = ToMarkdownLoader( url="https://edition.cnn.com/2023/06/17/europe/ukraine-counteroffensive-explained-hnk-intl/index.html", api_key=md_api_key) docs = loader.load()
Now we can create our extraction chain that converts a news story in markdown format (provided by the 2markdown loader above) to a
chain = create_extraction_chain_pydantic(pydantic_schema=NewsEvent, llm=llm)
And finally, we can extract our news event for the news story we extracted with 2markdown above:
res = chain.run(docs.page_content) for news in res: print("location: " + news.location + " date: " + news.date) print(news.description) print("\n")
Which will give us something like this:
location: Zaporizhzhia date: June 13 Ukrainian servicemen fire a BM-21 'Grad' multiple rocket launcher toward Russian positions near Bakhmut location: Bakhmut date: June 1, 2023 Destruction in the city of Bakhmut after hostilities location: Donetsk date: June 1 Ukrainian soldiers lie on a roadside during training for an operation near Bakhmut location: Zaporizhzhia Nuclear Power Plant date: June 15 A Russian service member stands guard at a checkpoint near the Zaporizhzhia Nuclear Power Plant in Russian-controlled Ukraine location: Zaporizhzhia date: June 15 Russian servicemen stand guard near the Russian-controlled Zaporizhzhia nuclear power plant in southern Ukraine
We can see that the extraction chain extracted the location, date and description of the news events. The results might be overly specific, but we hope this got you an idea of how you can use OpenAI functions and LangChain extractors with 2markdown.
If you build something cool with this, let us know on Twitter.
You can view the full code we used here.