10 Essential SQL String Functions for Data Cleaning, Ranked

For the first few years of my career, my data analysis workflow had a very predictable, and inefficient, first step: export the raw data to a CSV.

I would write a basic SELECT * FROM some_table and immediately dump the results into another tool, because the text data I needed was always a complete mess. Product names with trailing spaces, user-entered states in a mix of upper and lower case, phone numbers formatted with dashes, parentheses, or nothing at all — it felt impossible to work with directly in SQL. My mental model was that SQL was for getting data, and other tools were for cleaning it.

That mental model was fundamentally wrong, and it cost me hundreds of hours. The breakthrough was realizing that SQL has a powerful, built-in toolkit for text manipulation that can solve 90% of these common data cleaning problems before the data ever leaves the database.

Here are the 10 string functions I use constantly, ranked from most to least frequently used in my day-to-day data cleaning tasks.

1. TRIM (and LTRIM / RTRIM)

This is my absolute number one, the first line of defense. TRIM removes whitespace from both the beginning and end of a string. Inconsistent spacing, especially from user input forms or copy-paste errors, is the most common data quality issue you will face, and it silently breaks JOIN conditions and GROUP BY clauses.

A GROUP BY country on rows containing 'USA' and 'USA ' (with a trailing space) will treat them as two entirely different groups. TRIM fixes this instantly.

Syntax: TRIM(your_column)
Example: TRIM(' Product A ') returns 'Product A'.
Variations: LTRIM removes spaces only from the left, and RTRIM removes spaces only from the right. I use TRIM almost exclusively because it’s rare to want to fix one side but not the other.

2. UPPER / LOWER

The second most common problem is inconsistent capitalization. Users might enter ‘california’, ‘California’, or ‘CALIFORNIA’. To SQL, these are three distinct strings. UPPER converts a string to all uppercase, while LOWER converts it to all lowercase.

I use this constantly in WHERE clauses, JOIN conditions, and GROUP BY clauses to standardize text and ensure matches are found correctly. A common pattern is WHERE UPPER(state_name) = 'CALIFORNIA'.

Syntax: UPPER(your_column) or LOWER(your_column)
Example: UPPER('Priya Nair') returns 'PRIYA NAIR'.

3. CONCAT

Once data is clean, you often need to combine it. CONCAT joins two or more strings together into a single string. This is incredibly useful for creating full names from first and last name columns, generating descriptive labels, or building custom IDs.

Syntax: CONCAT(string1, string2, ...)
Example: CONCAT(first_name, ' ', last_name) with values ‘Priya’ and ‘Nair’ returns 'Priya Nair'.
Dialect Note: Some SQL versions, like SQL Server, use the + operator for string concatenation (first_name + ' ' + last_name), while others, like PostgreSQL and Oracle, use || (first_name || ' ' || last_name). The CONCAT function is the most portable standard.

4. REPLACE

This is your surgical tool for fixing known, repeating errors inside a string. REPLACE finds every occurrence of a specific substring and replaces it with another. It’s perfect for standardizing phone numbers by removing dashes, or correcting a common misspelling across thousands of rows.

Syntax: REPLACE(your_column, 'find_this', 'replace_with_this')
Example: REPLACE('555-123-4567', '-', '') returns '5551234567'.

5. SUBSTRING (or SUBSTR)

When you need to extract just one part of a string, SUBSTRING is the tool. You specify the string, a starting position, and how many characters you want to extract. This is essential for parsing data that’s been packed into a single column, like pulling the year out of a YYYY-MM-DD date string or getting an area code from a phone number.

Syntax: SUBSTRING(your_column FROM start_position FOR length) (standard SQL) or SUBSTRING(your_column, start_position, length) (common variation).
Example: SUBSTRING('2026-08-15', 1, 4) returns '2026'.

6. COALESCE

While not exclusively a string function, I use COALESCE with strings so often in data cleaning that it has to be on this list. It returns the first non-NULL value from a list of arguments. This is the perfect way to replace NULL values with a default, like an empty string '' or a placeholder like 'Not Provided'.

This prevents CONCAT operations from resulting in NULL if one part is missing (e.g., CONCAT(first_name, ' ', NULL) would be NULL) and makes your final data much cleaner.

Syntax: COALESCE(value_to_check, value_if_null, ...)
Example: CONCAT(first_name, ' ', COALESCE(last_name, '')) ensures you get a result even if last_name is NULL.

7. CHARINDEX / POSITION

These functions are the necessary partner to SUBSTRING for more complex extractions. They don’t extract anything themselves; they just tell you the starting position of a specific character or substring. You often need to find the location of a '@' in an email or a ',' in a full name before you can use SUBSTRING to extract the part you want.

Syntax: POSITION('substring' IN your_column) (standard) or CHARINDEX('substring', your_column) (SQL Server).
Example: POSITION('@' IN '[email protected]') returns 5. You can then use this number as the starting point for another function.

8. LEFT / RIGHT

These are simplified, more readable versions of SUBSTRING when you only need to grab characters from the very beginning or very end of a string. I find them much easier to read and write for simple cases, like getting the first 5 characters of a zip code.

Syntax: LEFT(your_column, number_of_chars) or RIGHT(your_column, number_of_chars)
Example: LEFT('TX-78701', 2) returns 'TX'.

9. LENGTH (or LEN)

This is a simple but surprisingly useful function that returns the number of characters in a string. Its main use in data cleaning is for validation — for example, finding all rows where a supposedly 5-digit zip code does not have a length of 5, or checking for user IDs that are too long or too short.

Syntax: LENGTH(your_column) or LEN(your_column) (SQL Server).
Example: LENGTH('stepbystepsql') returns 12.

10. SPLIT_PART

This is a more advanced, specialized function available in databases like PostgreSQL, Redshift, and Snowflake, but it’s so powerful it deserves the final spot. It splits a string by a delimiter and returns the Nth part. It’s a lifesaver for parsing comma-separated values or extracting a specific part of a URL or email address without complex SUBSTRING and POSITION logic.

Syntax: SPLIT_PART(your_string, delimiter, part_number)
Example: SPLIT_PART('[email protected]', '@', 2) returns 'example.com'. For databases without this, you’d have to combine POSITION and SUBSTRING to achieve the same result.

Putting It All Together: A Real-World Example

Imagine a users table with a messy full_name column containing values like ' nair, priya '. Our goal is to transform this into 'Priya Nair'.

Here’s how we can combine several of the functions above in a single SELECT:

SELECT
  -- Final, clean full name
  CONCAT(
    -- The first name part, with the first letter capitalized
    UPPER(LEFT(SUBSTRING(TRIM(full_name), POSITION(',' IN TRIM(full_name)) + 2), 1)),
    LOWER(SUBSTRING(TRIM(full_name), POSITION(',' IN TRIM(full_name)) + 3)),
    ' ',
    -- The last name part, with the first letter capitalized
    UPPER(LEFT(SUBSTRING(TRIM(full_name), 1, POSITION(',' IN TRIM(full_name)) - 1), 1)),
    LOWER(SUBSTRING(TRIM(full_name), 2, POSITION(',' IN TRIM(full_name)) - 2))
  ) AS cleaned_name
FROM
  users;

This looks complex, but it’s just a sequence of the simple building blocks we’ve covered: TRIM to remove whitespace, POSITION to find the comma, SUBSTRING to extract the first and last names, and CONCAT to put them back together in the right order with proper capitalization.

The Shift From Exporter to Cleaner

Internalizing this toolkit was a genuine turning point. I stopped seeing messy text data as a dead end in SQL or a signal that it was time to export to another tool. Instead, I started seeing it as a puzzle that could almost always be solved with a clever combination of these core functions. It made my workflow faster, kept the logic contained within the database where it belonged, and ultimately made me a far more capable data analyst.

What’s the messiest string data you’ve ever had to clean? Describe the problem, and I can suggest a combination of these functions to tackle it.