Leveraging ChatGPT for Natural Language Processing with Python, pandas, and NLTK

During the past few weeks, I’ve been using ChatGPT to help me solve an information architecture challenge. I’m working on a project for a client that involves creating a taxonomy in a WordPress site that contains more than 1500 pieces of content. The site includes video posts, podcast episodes, and courses.

The lack of structure in the site contributes to a poor user experience, because it is difficult for users to find relevant content and discover new contact based on their interests. My job is to figure out how to best organize the content for the best organize the continent that currently exists, and to create a framework that can be used to organize future.

As part of the exploration phase of the project, I examined the site structure to understand how things are currently organized. I also built out Miro board to evaluate a the current state of content organization as well as map out a proposed future state.

I initially felt like I had a good sense of how the content should be organized, but as I started digging deeper, I realized I needed to generate a comprehensive list of keywords from the existing content and then build the information architecture from that.

I knew there had to be a way to automate the process of keyword extraction, so I did a little research and discovered that the pandas and NLTK libraries for Python can be used for exactly that purpose.

What is pandas?

Pandas is an open source data analysis and manipulation library for use with Python. In simplest terms, pandas allows you to define a dataframe, which is a specific section of data (e.g., a column in a spreadsheet, a text document) and then perform various analyses of that dataframe. There’s much more to it than than, but you can think of pandas like a magnifying glass for your data that helps you see things that you never saw before.

Further Reading

To learn more about pandas, check out the official website.

What is NLTK?

NLTK stands for Natural Language Toolkit, and it’s a really extensive library of tools that can be used for all kinds of natural language processing (NLP). NLTK can analyze text data Some of the key things that NLTK can do includes:

  • Tokenization, which is the process of converting words into tokens, or units of value that can be counted.
  • Lemmatization, which is breaking down words to their root dictionary forms or lemmas. As an example, let’s say the continent contains the words “building,” “builders,” and “build” in it. The lemmatized form would be “build.” Instead of counting a unique token for each variation of the word, NLTK will simply count the lemmatized tokens derived from all of the words.
  • Filter Stopwords, which are filler words that generally have no semantic value. NLTK comes with a predefined corpus of stop words so it will cut out things like an, and, the, and, etc. You can also provide NLTK with a list of custom stopwords that you want to filter out.

Further Reading

To learn more about NLTK, check out the official documentation.

How Did ChatGPT fit into this process?

I am in the process of learning Python, but my skills are very basic. I have more of an understanding of what Python can be used for than exactly how to use it. At the start of this project, I did what I always do: research. I learned about pandas and NLTK, and I followed along with some great blog posts that seemed to align with my particular use case. I spent a full day banging my head against the keyboard, but I was only able to get as far as print(df.head()) to confirm that pandas was actually looking at the correct data frame.

Because my Python skills didn’t match what I wanted to accomplish (even though I knew it was possible), I used ChatGPT to generate the Python script that I used for data analysis. Multiple iterations were needed to get the script to perform properly. In addition, some manual tuning of the script was required to produce the desired output.

Sometimes ChatGPT would get stuck in rut, especially after I asked for multiple adjustments to the code in a single session. When that happened, I started a fresh query and refined my question to be more specific based on my previous learning.

Overall, it was a very iterative process. The computer couldn’t “just do the work for me.” At all points, I had to have a clear understanding of my goals in order to evaluate the validity of not just the code that ChatGPT generated, but also the actual processed output from that code.

All in all, it took two conversations and around forty queries to get the code to a serviceable point.

My Data Analysis Workflow

Although the actual process was way more trial and error than this, these are general steps that I followed to get to the desired output.

Natural Language Processing Workflow for Keyword Generation (created in Miro)
  1. Export a .CSV file of all of the website’s content using the WP Import Export Lite Plugin. I included the following columns:
    1. Post ID
    2. Title
    3. Post Content
    4. Excerpt
  2. Perform an initial cleaning of the data using the TRIM and CLEAN functions in Excel, as well as some good old fashioned search and replace.
  3. Prepare a list of custom stopwords in a .txt file that would be loaded into NLTK.
  4. Prepare a query for ChatGPT to write a Python script to analyze my data and export the results to a .CSV
  5. Run the code generated by ChatGPT.
  6. Analyze the output and make adjustments to the stopwords file or ChatGPT query based on the results.
  7. Repeat steps 3–6 until the code produced a satisfactory result.

The query used for this project

ChatGPT Query Take Two

I have a CSV file titled “website-content.csv” with four columns. The two columns I want to analyze are “title” and content_cleaned” I want to find the top 1000 most used phrases across both columns. I would like to use pandas and nltk. The script should surface the most used combinations of words. I also have a custom stopwords file titled “custom-stopwords.txt” Keywords and phrases should be lowercase.

Through an iterative process, this is what I ultimately asked for:

  • Merge “title” and “content_cleaned” into a single pandas dataframe
  • Exclude NLTK English stopwords and custom stopwords from a .txt file
  • Analyze the remaining text for common phrases of length 2–4 words (bigrams, trigrams, and quadgrams)
  • Output the 1000 most prevalent of each (bigrams, trigrams, quadgrams)
  • Export the result to CSV

Results from the Process

At the end of this process, I was able to efficiently analyze post titles and content as a single data frame for nearly 1500 pieces of content and generate 3000 possible keyword combinations. Eliminating single words from the output enables me to see keywords in context, which gives me a better idea of how those words are used across the site.

With this data in hand, I can further refine the list by surfacing only relevant keyword combinations. This will likely be a manual process, but I may employ an additional Python script to help filter the keyword list. From there, I’ll be able to group terms under relevant topics to create an information architecture schema that accurately reflects the content on the site.

Using ChatGPT helped me get much further, much faster on this project than I would have been able to on my own. I had to do the research, learn the technical terminology, and write a thoughtful query, while the machine did the heavy lift on code output. It gave me first-hand experience with the limitations and potentials of this powerful new technology.

Lastly, I have several working Python scripts with useful comments, as well as my conversation history with ChatGPT that I can review to learn more about how the code works. I know the code works, I know what it does, and now I have the documentation that can help me learn how to do what I want on my own. I still have some outstanding questions like, “I wonder if the script could be modified to do XYZ?”, but I’m happy with the output, and I think it would be better to take a deeper dive into Python, pandas, and NLTK before tweaking things further.

Scroll to Top