Check If Python String Contains Substring
ADVERTISEMENT
Table of Contents
- Introduction
- Finding Out Whether a Python String Contains a Substring
- Using the
in
Operator - Finding Out Where A Substring Begins
- Using the string.find() Method
- Finding Substrings Using Regular Expressions
- Summary
- Next Steps
Introduction
Strings are one of the most versatile data types in Python. They can be sliced, concatenated, and even converted to other data types in some instances. In this article, you'll take a quick look at 4 different ways to check whether one Python string contains another.
Finding Out Whether a Python String Contains a Substring
You're designing a simple web scraper that pulls listings from various online job boards to streamline your job hunt. For now, the script only pulls the titles of each job posting and a link to the full description, if the link is available. For example:
>>> job0 = "Principal Engineer - Data Architect https://job.board/job0"
>>> job1 = "Senior Software Engineer, Python https://job.board/job1"
>>> job2 = "Lead Python Developer"
How can you find out which of these job postings is looking for a Python developer in particular?
Using the in
Operator
The quickest way to determine whether or not a Python string contains a substring is to use the in
operator. The result will be a boolean value telling you whether or not the substring is "in" the parent string. You can use this operator to find out whether the word "Python" is in the job string:
>>> "Python" in job0
False
>>> "Python" in job1
True
>>> "Python" in job2
True
You can use a for
loop along with the in
operator to print out only the job listings that explicitly mention "Python" in the title:
>>> jobs = [job0, job1, job2]
>>> for index, job in enumerate(jobs):
... if "Python" in job:
... print(f"Job number {index}: {job}")
...
Job number 1: Senior Software Engineer, Python https://job.board/job1
Job number 2: Lead Python Developer
Let's walk through the code line by line. The first line simply puts all the job titles together into one list. Line 2 starts a for
loop and uses enumerate(jobs)
to iterate through the list of jobs. Using the enumerate function allows you to not only keep track of what item the loop is currently operating on, but what index that item is sitting at.
The next line sets up a conditional statement, using the in
operator to determine if the current Python string contains the desired word. If so, then the following line says to print out the job number (the current string's index) and the job title.
Finding Out Where A Substring Begins
Now that you know which jobs explicitly mention "Python," you want to grab the links to those jobs for further consideration. You need to findneed find out whether or not the string contains a link and - if so - the index of where the link starts so that you can pull it out of the string.
Using the string.index() Method
Each Python strings objects contains a method string.index()
that makes this process quite simple. The string.index() method will return the index where a substring first occurs in a string.
Let's use the string.index() method to find where the link to job1
starts in the string:
>>> job1.index("http")
33
To use the index method, you first select the string you want to search and call then append its the index method to it. Here, the we wrote that result asis job1.index()
. Inside the parentheses, you specify thewhich substring you want to searchcheck for - in this case, it's "http." Python searches through the job1
string to determine which index the substring "http"
starts at, and returns the number 33, which is the index of the letter "H".
You can slice the link out of the string by selecting all characters from index 33 onwards like so:
>>> job1[33:]
'https://job.board/job1'
Now, you can update the for loop to only return the links to job descriptions that mention "Python" in the title:
>>> for index, job in enumerate(jobs):
... if "Python" in job:
... link_index = job.index("http")
... print(job[link_index:])
...
https://job.board/job1
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
ValueError: substring not found
The previous for loop would print out the job title, whereas this one first locates the index where the link starts (link_index
) and then prints the string slice that contains the link. However, notice that the last iteration raises a ValueError: substring not found
. What's happened here?
It turns out that the string.index()
method does have one drawback, and that is the fact that it doesn't react well if the substring isn't found. job2
does not have a link in the string, so there is no index for the substring "http"
. As a result, job2.index("http")
returns a ValueError
.
How can this be fixed?
Using the string.find() Method
Thankfully, Python strings contain another method that you can use without having to worry about handling exceptions. The string.find()
method will simply return -1 if the substring is not found:
>>> job2.find("http")
-1
Here, trying to find the substring "http" in the job2
string returns -1 as expected, since this was the job posting that had no link.
Let's update the for loop to return only the jobs where the title mentions the word "Python" and the string contains a link:
>>> for index, job in enumerate(jobs):
... if "Python" in job and job.find("http") != -1:
... print(job)
...
Senior Software Engineer, Python https://job.board/job1
Now, the conditional statement has two conditionsparameters. First, it checks to see if the string contains the word "Python." Then, it checks to make sure the string.find() method returns a value other than -1. If both of these conditions are met, then show the job.
This code block returns the one Python string that contains both the word "Python" and a link to a job posting.
Finding Substrings Using Regular Expressions
The strings you've been working with in this article so far are fairly "clean." In other words, you didn't have to worry about case sensitivity, misspelled words, or other "messy" features that could make the string difficult to parse.
In the real world, however, this is often not the case. You'll frequently work with Python strings that contain typos and invalid characterserrors, and you'll need to enablemake a way for your code to handle them gracefully.
One way to do this is to use regular expressions, a pattern matching strategysyntax that allows you to work with strings more flexibly. Python shipscomes shipped with a built-in module for handling regular expressions. You'll need to import this module before you can work with them:
>>> import re
Let's take a look at some different job titles now:
>>> job_titles = [ "Python developer (Awesome Project)",
"Django and python web engineer",
"Senior Pyhton Developer"]
Notice that, in the second string, the word "python" is in lower case. In the third string, it's misspelled "Pyhton."
If you tried to use, for instance, the string.find()
method on these last two strings, it would return -1:
>>> for job in job_titles:
... print(job.find("Python"))
...
0
-1
-1
In other words, the last two strings don't contain the word "Python." However, you know that they do - "Python" is just not capitalized in one string, and it's misspelled in the other.
You don't want your script to ignore lowercase words and misspellings, so let's create a quick regular expression to catch these errors:
>>> patterns = "Python|python|Pyhton"
Here, you define some patterns containing all the things you want to check for in a Python string. The vertical bar (|
) is the OR
operator. That means this pattern says to consider the word "Python" regardless of spelling ("Pyhton") or capitalization ("python").
Now, you can use re.search()
to check if the Python string contains one of the patterns you defined:
>>> re.search(patterns, job_titles[0])
<re.Match object; span=(0, 6), match='Python'>
>>>
>>> re.search(patterns, job_titles[1])
<re.Match object; span=(11, 17), match='python'>
>>>
>>> re.search(patterns, job_titles[2])
<re.Match object; span=(7, 13), match='Pyhton'>
The re.search() function returns a Match object. For each of the strings in job_titles, it successfully finds a match for the word "Python," even when it's in lowercase or misspelled.
When you update your for loop this time, all the job postings are printed out:
>>> for job in job_titles:
... if re.search(patterns, job):
... print(job)
...
Python developer (Awesome Project)
Django and python web engineer
Senior Pyhton Developer
Summary
In this article, you saw how to determine whether or not a Python string contains a substring. First, you used the in
operator to simply return true or false if the substring was found. Then, you used the string.index()
method to locate the index where the substring starts, and the string.find()
method to avoid having to handle exceptions. Finally, you saw how using regular expressions could allow you to work with case sensitivity, misspellings, and other quirks of character data.
Next Steps
This article used loops and conditional statements to display output to the interpreter. To learn more, check out our tutorial on writing loops with multiple conditions.
The regular expressions module re
isn't the only one that you can access as soon as Python is installed. The Python standard library comes equipped with dozens of modules you can use straight out of the box, like min() and floor().
If you're interested in learning more about the basics of Python, coding, and software development, check out our Coding Essentials Guidebook for Developers, where we cover the essential languages, concepts, and tools that you'll need to become a professional developer.
Thanks and happy coding! We hope you enjoyed this article. If you have any questions or comments, feel free to reach out to jacob@initialcommit.io.
Final Notes
Recommended product: Coding Essentials Guidebook for Developers