Creating a Data Checking Agent

Ray Poynter, 21 July 2025

Agents are one of the key AI developments that is re-shaping the way we conduct business and processes. An agent is a software tool capable of tackling a task without user supervision. In this post, I will illustrate agents with an example that takes data (for example, from a quantitative survey). I will inlcude the informaiton you need to build this Data Checker yourself.

Gems in Gemini

For this example, I created the agent in Gemini (but I could have created an almost identical agent in ChatGPT, Copilot, and Claude).

In Gemini, agents are called Gems. You will find the Gems option on the panel on the left, as in the image below. To create a Gem, click on Explore Gems.

Once you are in the Gem Manager, you will see a list of Gems created by Gems, your Gems, and the button to create a New Gem.

A Simple Agent / Gem

Agents can be quite complex, but this example is very straightforward is based on a set of instructions. The Gem is created by giving it a name and then defining the Instructions. The instructions make the agent useful. Each time the agent is run, the instructions define what should happen.

In the case of this data checkers, the instructions tell the agent to take an uploaded data file, to describe its key features (e.g. number of records and number of fields), followed by a list of potential issues, such as missing data, duplicates, and data in the wrong format. Later in the blog I will share the current iteration of the instructions.

Examples of Using the Data Checker

To test the Data Checker I downloaded just over 42,000 reviews of Disney amusement parks, from Kaggle.com as a CSV file.

I launched the Gem by clicking on it in my list of Gems. I typed “I want to check a file”. The Gem replied, “Please provide me with the file you would like me to check.”

I uploaded the file and pressed the process arrow.

After a period of processing, the Gem responded with:

“Here is the quality report for the file Disney Data from Kaggle as csv.csv:

File Name: Disney Data from Kaggle as csv.csv
Number of Rows: 42656
Number of Columns/Fields: 6
Names of the Fields: Review_ID, Rating, Year_Month, Reviewer_Location, Review_Text, Branch

Here are the data quality issues identified in the file:

Duplicate Rows:
1. There are 12 completely duplicate rows.
1. There are 20 duplicate values in the Review_ID field, which is intended to be a unique identifier.
Missing Data:
1. While there are no standard null values, the string ‘missing’ is used as a placeholder:
  1. 2613 cells in the Year_Month column contain the string ‘missing’.
  1. 448 cells in the Review_Text column contain the string ‘missing’.
Data Type and Format Conformance:
1. Year_Month: As noted above, 2613 cells do not conform to the expected YYYY-MM format because they contain the string ‘missing’.
1. Rating: All values are within the expected range of 1 to 5.
Personally Identifiable Information (PII):
1. There are 4 potential email addresses found in the Review_Text column, which could be considered PII.
Other Data Quality Issues:
1. The Branch field contains consistent and expected values (Disneyland_HongKong, Disneyland_California, Disneyland_Paris).

I can help fix some of these issues, such as removing duplicate rows and duplicate Review_ID entries, and replacing the ‘missing’ string with actual null values or removing rows with ‘missing’ dates. I can also help you anonymize the PII data, by removing the email addresses, or masking them.

Would you like me to proceed with fixing these issues and then export the updated file for you?”

I would then ask Gemini to remove the duplicates, remove the cases where the date field contained the word “Missing”, and remove the four email addresses. I would then export the file for future use and reference, before proceeding to do any analysis or investigation.

The Instructions

Below are the instructions I used for this Agent. The first version I created was simpler, but over time, I spotted cases where the instructions missed things I wanted it to detect. I iterated on the instructions to improve them, and the version below is my current iteration.

“You are an expert data processor, and you have been chosen to check the quality of data files. The types of files you should expect are those where the first row serves as a header, naming the fields. The remaining rows should be one for each respondent in the data.

You should check for the following:

1) Are there any duplicate rows? Check that any fields intended to be unique identifiers (e.g., ‘ID’, ‘Review_ID’) do not contain duplicate values.

2) Are there any cells with missing data?

3) Identify any cells whose values don’t conform to the expected data type or format for their column. For example, dates not matching YYYY-MM (or zero-padded), numeric fields containing letters or symbols, text fields containing placeholder strings like “missing,” emails in free-text reviews, etc.

4) Is there any personally identifiable information?

5) Are there any other data quality issues?

Report to the user:

1) The names of the file

2) The number of rows

3) The number of columns/fields

4) The names of the fields

Then report any data quality issues.

Offer to fix the issues if you can, and then offer to export the updated file.”

If you wanted to replicate this Agent, in Claude, Gemini, Copilot or ChatGPT, simply copy these instructions and paste them as the instructions in your chosen LLM.

Want to learn more about Agents?

I am hosting a webinar on how to start using agents, with examples that you can create using ChatGPT, Copilot, Glaude, and Gemini.

Wednesday, 30 July, 2025
Broadcast time 3pm London (10am New York, 9am Chicago)
Click here to register

5 thoughts on “Creating a Data Checking Agent”

Elliot says:

July 21, 2025 at 2:37 pm

If this was a datafile with no personal contact details (unless left in open ended comments) would uploading a file in this way comply with GDPR and MRS guidelines?
Neelu Nanwani says:

July 21, 2025 at 5:24 pm

Thanks for providing these instructions to try checking data using AI agent. I tried your instructions for a data set I downloaded from Kaggle.com. Everything worked well except that I noticed that this specific output from Gemini which looks incorrect.
***
Unique Identifier Duplicates:

Customer_ID: Duplicate values were found in this field. There are 400 instances of duplicate Customer_ID values. This suggests that the same customer may have made multiple purchases, which is expected behavior for a customer transaction dataset.

Bakery_ID: Duplicate values were found in this field. There are 495 instances of duplicate Bakery_ID values. This is also expected, as multiple transactions will be associated with the same bakery.
**

Do you have any suggestion on how to fix this in the instructions so Gemini doesn’t consider that as a data quality issue?
Ray Poynter says:

July 21, 2025 at 7:59 pm

Yes, if there is no PII then uploading is fine.
Ray Poynter says:

July 21, 2025 at 8:01 pm

This is a great question. What I do is I talk to Geminie (or whichever LLM I am using). I highlight the error (or the interpretation that I don’t like) and I ask it why it happened. I then ask it how I should modify my instructions so that it does not happen. I then update the instructions in the Gem. Over time, the agent gets better and better.
Vaibhav Kawale says:

July 21, 2025 at 9:28 pm

This would be a very helpful webinar.

Creating a Data Checking Agent

Related

5 thoughts on “Creating a Data Checking Agent”

Leave a Reply