ctfnote
  • /home/ret2basic.eth
  • Game Hacking
    • ✅C++
    • Ghidra
    • Cheat Engine
    • Proxy
    • DLL injection
    • Keygen
    • Aimbot
  • Web3 Security Research
    • 👑Web3 Security Research Trivia
    • ✅Solidity
      • ✅Mastering Ethereum
      • ✅Storage
      • ✅Memory
      • ✅Calldata
      • ✅ABI
    • ✅Foundry
      • ✅Introduction
      • ✅How to Write Basic Tests
      • ✅Set Soliditiy Compiler Version
      • ✅Remappings
      • ✅Auto Format Code
      • ✅Console Log
      • ✅Authentication
      • ✅Error
      • ✅Event
      • ✅Time
      • ✅Send ETH
      • ✅Signature
      • ✅Fork
      • ✅Mint 1 Million DAI on Mainnet Fork
      • ✅FFI
      • ✅Fuzz
      • ✅Invariant Testing - Part 1
      • Invariant Testing - Part 2
      • Invariant Testing - Part 3
      • Differential Test
    • ✅Secureum
      • ✅Epoch 0
        • ✅Slot 1: Ethereum 101
          • ✅Notes
          • ✅Ethereum Whitepaper
          • ✅Extra Study: What happens when you send 1 DAI
          • ✅Quiz
        • ✅Slot 2: Solidity 101
          • ✅Notes
          • ✅OpenZeppelin ERC20
          • ✅OpenZeppelin ERC721
          • ✅OpenZeppelin Ownable
          • ✅OpenZeppelin Pausable
          • ✅OpenZeppelin ReentrancyGuard
          • ✅Quiz
        • ✅Slot 3: Solidity 201
          • ✅Notes
          • ✅OpenZeppelin SafeERC20
          • ✅OpenZeppelin ERC-777
          • ✅OpenZeppelin ERC-1155
          • ✅OpenZeppelin ERC-3156
          • ✅OpenZeppelin - Proxy Upgrade Pattern
          • ✅Quiz
        • ✅Slot 4: Pitfalls and Best Practices 101
          • ✅Notes
          • ✅Intro to Security First Development
          • ✅Quiz
        • ✅Slot 5: Pitfalls and Best Practices 201
          • ✅Notes
          • So you want to use a price oracle
          • The Dangers of Surprising Code
          • ✅Quiz
        • ✅Slot 6: Auditing Techniques & Tools 101
          • ✅Notes
          • ✅Quiz
        • ✅Slot 7: Audit Findings 101
          • Notes
          • ✅Fei Protocol - ConsenSys
          • ✅Uniswap V3 - Trail of Bits
          • ✅Chainlink - Sigma Prime
          • ✅Opyn Gamma - OpenZeppelin
          • ✅Quiz
        • ✅Slot 8: Audit Findings 201
          • Notes
          • 1inch Liquidity - Consensus
          • Original Dollar - Trail of Bits
          • Synthetix EtherCollateral - Sigma Prime
          • Holdefi - OpenZeppelin
          • Quiz
      • ✅Epoch ∞
        • ✅RACE #4 - ERC20 Implementation
        • ✅RACE #5 - ERC1155 Implementation
        • ✅RACE #6 - ERC721 Application
        • ✅RACE #7 - Bored Ape
        • ✅RACE #8 - ERC721 Roles
        • ✅RACE #9 - Proxy
        • ✅RACE #10 - Test Cases
        • ✅RACE #11 - Staking
        • ✅RACE #12 - ERC20 Permit
        • ✅RACE #13 - ERC20 with Callback
        • ✅RACE #14 - Lending
        • ✅RACE #15 - DEX
        • ✅RACE #16 - Flash Loan
        • ✅RACE #17
    • DeFi
      • Glossary
        • TWAP vs. VWAP
        • Tranches
      • DeFi MOOC
        • Lecture 2: Introduction to Blockchain Technologies
        • Lecture 5: DEX
        • Lecture 6: Decentralized Lending
        • Lecture 10: Privacy on the Blockchain
        • Lecture 12: Practical Smart Contract Security
        • Lecture 13: DeFi Security
      • Uniswap V2
      • Compound V3
        • ✅Whitepaper
        • ✅Interacting with Compound
          • ✅Supply and Redeem
          • ✅Borrow and Repay
          • ✅Liquidation
          • ✅Long and Short
        • ✅Interest Model
        • CToken
      • Aave
      • Chainlink
        • ✅Getting Started
        • ✅Data Feeds
        • ✅VRF
      • Optimism
        • Bedrock
      • LayerZero
      • Opensea
        • Seaport
    • EVM
      • ✅Andreas Antonopoulos - The Ethereum Virtual Machine
      • ✅Program The Blockchain - Smart Contract Storage
      • ✅EVM Codes - EVM Playground for Opcodes
      • ✅Fvictorio - EVM Puzzles
      • ✅Daltyboy11 - More EVM Puzzles
      • ✅EVM Through Huff
      • Noxx - EVM Deep Dives
      • ✅Jordan McKinney - EVM Explained
      • Openzepplin - Deconstructing a Solidity Contract
      • Jeancvllr - EVM Assembly
      • Peter Robinson - Solidity to Bytecode, Memory & Storage
      • Marek Kirejczyk - Ethereum Under The Hood
      • ✅Official Solidity Docs
      • Dissecting EVM using go-ethereum Eth client implementation - deliriusz.eth
    • Vulnerabilities
      • Rounding Issues
        • Kyberswap
      • Bridges
      • Governance / Voting Escrows
      • Bizzare Bug Classes
        • TIME - ERC2771Context + Multicall calldata manipulation
    • Fancy Topics
      • Vulnerabilities SoK
        • ✅Demystifying Exploitable Bugs in Smart Contracts
        • Blockchain Hacking Techniques 2022 Top 10 - Todo
      • yAcademy
        • Proxies
          • yAcademy - Proxy Basics
          • yAcademy - Proxies Deep Dive
          • yAcademy - Security Guide to Proxy Vulns
        • defi-fork-bugs
      • Spearbit
        • ✅Community Workshop: Riley Holterhus
        • Economic Security with fmrmf
        • Numerical Analysis for DeFi Audits: A TWAMM Case Study by Kurt Barry
  • Red Teaming
    • ✅Enumeration
      • Service Enumeration
        • SMTP (Port 25)
        • Samba (Port 139, 445)
        • SNMP (Port 161,162,10161,10162)
        • rsync (Port 873)
        • NFS (Port 2049)
        • Apache JServ Protocol (Port 8081)
        • NetBIOS
      • Nmap
      • Gobuster / Feroxbuster / FUFF / Wfuzz
      • Drupal
    • ✅Exploitation
      • Public Exploits
      • PHP Webshells
      • Reverse Shell
      • TTY
      • File Transfer
      • Metasploit
      • Password Spray
    • ✅Buffer Overflow
      • Step 0: Spiking (Optional)
      • Step 1: Fuzzing
      • Step 2: Finding the Offset
      • Step 3: Overwriting the EIP
      • Step 4: Finding Bad Characters
      • Step 5: Finding the Right Module
      • Step 6: Generating Shellcode and Gaining Root
    • ✅Privilege Escalation
      • Linux Privilege Escalation
        • Linux Permissions
        • Manual Enumeration
        • Automated Tools
        • Kernel Exploits
        • Passwords and File Permissions
        • SSH Keys
        • Sudo
        • SUID
        • Capabilities
        • Cron Jobs
        • NFS Root Squashing
        • Docker
        • GNU C Library
        • Exim
        • Linux Privilege Escalation Course Capstone
      • Windows Privilege Escalation
        • Manual Enumeration
        • Automated Tools
        • Kernel Exploits
        • Passwords and Port Forwarding
        • WSL
        • Token Impersonation and Potato Attacks
        • Meterpreter getsystem
        • Runas
        • UAC Bypass
        • Registry
        • Executable Files
        • Startup Applications
        • DLL Hijacking
        • Service Permissions (Paths)
        • CVE-2019-1388
        • HiveNightmare
        • Bypass Space Filter
    • ✅Post Exploitation
      • Linux Post Exploitation
        • Add a User
        • SSH Key
      • Windows Post Exploitation
        • windows-resources
        • Add a User
        • RDP
    • ✅Pivoting
      • Windows: Chisel
      • Linux: sshuttle
    • Active Directory (AD)
      • Initial Compromise
        • HTA Phishing
        • VBA Macro Phishing
        • LLMNR Poisoning
        • SMB Relay
        • GPP / cPassword
      • Domain Enumeration
        • Manual Enumeration
        • PowerView
        • BloodHound
      • Lateral Movement
        • PsExec
        • WMI
        • Runas
        • Pass the Hash
        • Overpass the Hash
        • Pass the Ticket
      • Kerberos
        • Kerberoast
        • AS-REP Roast
      • MS SQL Server
    • Command & Control (C2)
      • Cobalt Strike
        • Bypassing Defences
          • Artifact Kit
          • Resource Kit
          • AMSI Bypass
          • PowerPick
        • Extending Cobalt Strike
          • Elevate Kit
          • Malleable C2 Profile
      • Metasploit
        • Payloads
        • Post Exploitation
        • Automation
      • C2 Development
    • Malware Development
      • "Hot Dropper"
      • PE Format
        • Overview
      • Process Injection
      • Reflective DLL
      • x86 <=> x64
      • Hooking
      • VeraCry
      • Offensive C#
      • AV Evasion
        • AV Evasion with C# and PowerShell
        • AMSI Bypass
  • Cryptography
    • Hash Functions
    • MAC
    • AES
      • Byte at a Time
      • CBC CCA
      • CBC Bit Flipping
      • CBC Padding Oracle
    • Diffie-Hellman
    • RSA
      • Prime Factors
      • Multiple Ciphertexts
      • Low Public Exponent
      • Low Private Exponent
    • ECC
    • Digital Signature
    • JWT
    • PRNG
    • SSL/TLS
    • Research
      • ✅Lattice-based Cryptography (Lattice)
      • Elliptic Curve Cryptography (ECC)
      • Oblivious Transfer (OT)
      • Secure Multi-party Computation (MPC)
      • Learning with Error (LWE)
      • Fully Homomorphic Encryption (FHE)
      • Zero Knowledge Proof (ZKP)
      • Oblivious RAM (ORAM)
  • Computer Science
    • Linux
      • Setup
      • curl
      • Hard Link vs. Symlink
      • Man Page
      • /dev/null
    • Python
      • New Features
      • Operators, Expressions, and Data Manipulation
      • Program Structure and Control Flow
      • Objects, Types, and Protocols
      • Functions 101
      • Generators
      • Classes and Object-Oriented Programming
      • Memory Management
      • Concurrency and Parallelism
        • Multithreading and Thread Safety
        • Asynchronization
        • Multiprocessing
        • Global Interpreter Lock (GIL)
      • Built-in Functions and Standard Library
        • import collections
        • import itertools
        • import sys
        • import re
        • import pickle
        • import json
      • Third-party Library
        • from pwn import *
        • import requests
        • from bs4 import BeautifulSoup
        • from scapy.all import *
        • py2exe
    • HTML, CSS, JavaScript, and React
      • HTML
      • CSS
      • JavaScript
        • var vs. let
        • Objects
        • Arrays
        • Functions
        • Modules
        • Asynchronous JavaScript
      • React
    • Data Structures and Algorithms
      • Binary Search
    • The Linux Programming Interface
      • Processes
        • Memory Allocation
        • The Process API
        • Process Creation
        • Process Termination
        • Monitoring Child Processes
        • Program Execution
      • Signals
      • Threads
        • Thread Synchronization
        • Thread Safety and Pre-Thread Storage
      • IPC
        • Pipes and FIFOs
        • Memory Mappings
        • Virtual Memory Operations
      • Sockets
    • Computer Systems
      • Hexadecimal
      • Signedness
      • Registers
      • Instructions
      • Syscall
      • Process Memory
      • Stack Frame
      • Preemptive Multitasking
      • IPC
      • Threads
    • Databases
      • MySQL
        • Basic Syntax
        • Data Types
        • Modifying Tables
        • Duplicating and Deleting
        • SELECT
        • Transaction
      • GraphQL
    • Distributed Systems
      • Introduction
        • What is a Distributed System?
        • Design Goals
        • Scaling Techniques
        • Types of Distributed Systems
      • Architecture
        • System Architectures
        • Example Architectures
      • Communication
        • Foundations
        • Remote Procedure Call
        • Message-oriented Communication
      • Coordination
        • Clock Synchronization
        • Logical Clock
      • Consistency and Replication
        • Introduction
        • Data-centric Consistency
        • Client-centric Consistency
    • Static Analysis
      • Intermediate Representation
      • Data Flow Analysis
      • Interprocedural Analysis
      • Pointer Analysis
      • Static Analysis for Security
      • Datalog-Based Program Analysis
      • Soundness and Soundiness
      • CFL-Reachability and IFDS
  • Web
    • ✅Prerequisites
      • OWASP Top 10
        • 1. Broken Access Control
        • 2. Cryptographic Failures
        • 3. Injection
        • 4. Insecure Design
        • 5. Security Misconfiguration
        • 6. Vulnerable and Outdated Components
        • 7. Identification and Authentication Failures
        • 8. Software and Data Integrity Failures
        • 9. Security Logging and Monitoring Failures
        • 10. SSRF
      • HTTP
        • HTTP Status Codes
        • HTTP Headers
      • Burp Suite
        • Burp Intruder
        • Burp Extender
        • Burp Collaborator
      • Information Gathering
        • DNS
        • Git
        • Editor
        • Server
      • Bug Bounty Report Writing
    • File Upload
      • Webshell
      • IIS, Nginx, and Apache Vulnerabilities
      • .htaccess (Apache) / web.config (IIS)
      • Alternate Data Stream
      • Code Review: bWAPP Unrestricted File Upload
    • SQL Injection (SQLi)
      • Cheat Sheet
      • UNION Attacks
      • Examining the Database
      • Blind SQL Injection
      • WAF Bypass
      • Out-Of-Band (OOB)
      • Webshell and UDF
      • sqlmap
        • Code Review: Initialization
        • Code Review: tamper
    • Cross-Site Scripting (XSS)
      • Cheat Sheet
      • Reflected XSS
      • Stored XSS
      • DOM-Based XSS
      • XSS Contexts
      • CSP
    • CSRF and SSRF
      • Client-Side Request Forgery (CSRF)
        • XSS vs. CSRF
        • CSRF Tokens and SameSite Cookies
      • Server-Side Request Forgery (SSRF)
        • Attacks
        • Bypassing Restrictions
        • SSRF + Redis
    • XML External Entities (XXE)
    • Insecure Deserialization
      • Python Deserialization
      • PHP Deserialization
      • Java Deserialization
        • Shiro
        • FastJSON
        • WebLogic
    • HTTP Request Smuggling
    • OS Command Injection
      • Whitespace Bypass
      • Blacklist Bypass
      • Blind OS Command Injection
      • Lab 1: HITCON 2015 BabyFirst
      • Lab 2: HITCON 2017 BabyFirst Revenge
      • Lab 3: HITCON 2017 BabyFirst Revenge v2
    • ✅Directory Traversal
    • HTTP Parameter Pollution
    • Server-Side Template Injection (SSTI)
    • LDAP Injection
    • Redis
      • Authentication
      • RCE
      • Mitigations
  • Pwn
    • Linux Exploitation
      • Protections
      • Shellcoding
        • Calling Convention
        • Null-free
        • Reverse Shell
        • ORW
      • ROP
        • Stack Alignment
        • ret2text
        • ret2syscall
        • ret2libc
        • ret2csu
        • BROP
        • SROP
        • Stack Pivot
      • ptmalloc
        • chunks
        • malloc() and free()
        • bins
        • tcache
      • UAF
      • Race Conditions
        • TOCTTOU
        • Dirty Cow
        • Meltdown
        • Spectre
      • Kernel
      • Appendix: Tools
        • socat
        • LibcSearcher-ng
        • OneGadget
    • Windows Exploitation
      • Classic
      • SEH
      • Egghunting
      • Unicode
      • Shellcoding
      • ROP
      • Appendix: Tools
        • ImmunityDbg
        • Mona.py
    • Fuzzing
      • AFL++
        • Quickstart
        • Instrumentation
        • ASAN
        • Code Coverage
        • Dictionary
        • Parallelization
        • Partial Instrumentation
        • QEMU Mode
        • afl-libprotobuf-mutator
      • WinAFL
      • Fuzzilli
  • Reverse
    • Bytecode
      • Python Bytecode
    • 👑Z3 solver
    • angr
      • angr Template
Powered by GitBook
On this page
  • Regular Expressions
  • Basic Patterns
  • Basic Examples
  • Repetition
  • Leftmost & Largest
  • Examples
  • Emails Example
  • Square Brackets
  • Group Extraction
  • findall
  • findall With Files
  • findall and Groups
  • RE Workflow and Debug
  • Options
  • Greedy vs. Non-Greedy (optional)
  • Substitution (optional)
  • Reference

Was this helpful?

  1. Computer Science
  2. Python
  3. Built-in Functions and Standard Library

import re

Regular Expressions

Regular Expressions

Regular expressions are a powerful language for matching text patterns. This page gives a basic introduction to regular expressions themselves sufficient for our Python exercises and shows how regular expressions work in Python. The Python re module provides regular expression support.

In Python a regular expression search is typically written as:

  match = re.search(pat, str)

The re.search() method takes a regular expression pattern and a string and searches for that pattern within the string. If the search is successful, search() returns a match object or None otherwise. Therefore, the search is usually immediately followed by an if-statement to test if the search succeeded, as shown in the following example which searches for the pattern 'word:' followed by a 3 letter word (details below):

str = 'an example word:cat!!'match = re.search(r'word:\w\w\w', str)# If-statement after search() tests if it succeededif match:  print('found', match.group()) ## 'found word:cat'else:  print('did not find')

The code match = re.search(pat, str) stores the search result in a variable named match. Then the if-statement tests the match -- if true the search succeeded and match.group() is the matching text (e.g. 'word:cat'). Otherwise if the match is false (None to be more specific), then the search did not succeed, and there is no matching text.

The 'r' at the start of the pattern string designates a python "raw" string which passes through backslashes without change which is very handy for regular expressions (Java needs this feature badly!). I recommend that you always write pattern strings with the 'r' just as a habit.

Basic Patterns

The power of regular expressions is that they can specify patterns, not just fixed characters. Here are the most basic patterns which match single chars:

  • a, X, 9, < -- ordinary characters just match themselves exactly. The meta-characters which do not match themselves because they have special meanings are: . ^ $ * + ? { [ ] \ | ( ) (details below)

  • . (a period) -- matches any single character except newline '\n'

  • \w -- (lowercase w) matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_]. Note that although "word" is the mnemonic for this, it only matches a single word char, not a whole word. \W (upper case W) matches any non-word character.

  • \b -- boundary between word and non-word

  • \s -- (lowercase s) matches a single whitespace character -- space, newline, return, tab, form [ \n\r\t\f]. \S (upper case S) matches any non-whitespace character.

  • \t, \n, \r -- tab, newline, return

  • \d -- decimal digit [0-9] (some older regex utilities do not support \d, but they all support \w and \s)

  • ^ = start, $ = end -- match the start or end of the string

  • \ -- inhibit the "specialness" of a character. So, for example, use \. to match a period or \\ to match a slash. If you are unsure if a character has special meaning, such as '@', you can put a slash in front of it, \@, to make sure it is treated just as a character.

Basic Examples

Joke: what do you call a pig with three eyes? piiig!

The basic rules of regular expression search for a pattern within a string are:

  • The search proceeds through the string from start to end, stopping at the first match found

  • All of the pattern must be matched, but not all of the string

  • If match = re.search(pat, str) is successful, match is not None and in particular match.group() is the matching text

# Search for pattern 'iii' in string 'piiig'.
# All of the pattern must match, but it may appear anywhere.
# On success, match.group() is matched text.
match = re.search(r'iii', 'piiig') # found, match.group() == "iii"
match = re.search(r'igs', 'piiig') # not found, match == None

# . = any char but \n
match = re.search(r'..g', 'piiig') # found, match.group() == "iig"

# \d = digit char, \w = word char
match = re.search(r'\d\d\d', 'p123g') # found, match.group() == "123"
match = re.search(r'\w\w\w', '@@abcd!!') # found, match.group() == "abc"

Repetition

Things get more interesting when you use + and * to specify repetition in the pattern

  • + -- 1 or more occurrences of the pattern to its left, e.g. 'i+' = one or more i's

  • * -- 0 or more occurrences of the pattern to its left

  • ? -- match 0 or 1 occurrences of the pattern to its left

Leftmost & Largest

First the search finds the leftmost match for the pattern, and second it tries to use up as much of the string as possible -- i.e. + and * go as far as possible (the + and * are said to be "greedy").

Examples

  # i+ = one or more i's, as many as possible.
  match = re.search(r'pi+', 'piiig') # found, match.group() == "piii"

  # Finds the first/leftmost solution, and within it drives the +
  # as far as possible (aka 'leftmost and largest').
  # In this example, note that it does not get to the second set of i's.
  match = re.search(r'i+', 'piigiiii') # found, match.group() == "ii"

  # \s* = zero or more whitespace chars
  # Here look for 3 digits, possibly separated by whitespace.
  match = re.search(r'\d\s*\d\s*\d', 'xx1 2   3xx') # found, match.group() == "1 2   3"
  match = re.search(r'\d\s*\d\s*\d', 'xx12  3xx') # found, match.group() == "12  3"
  match = re.search(r'\d\s*\d\s*\d', 'xx123xx') # found, match.group() == "123"

  # ^ = matches the start of string, so this fails:
  match = re.search(r'^b\w+', 'foobar') # not found, match == None
  # but without the ^ it succeeds:
  match = re.search(r'b\w+', 'foobar') # found, match.group() == "bar"

Emails Example

Suppose you want to find the email address inside the string 'xyz alice-b@google.com purple monkey'. We'll use this as a running example to demonstrate more regular expression features. Here's an attempt using the pattern r'\w+@\w+':

str = 'purple alice-b@google.com monkey dishwasher'
match = re.search(r'\w+@\w+', str)
if match:
    print(match.group())  # 'b@google'

The search does not get the whole email address in this case because the \w does not match the - or . in the address. We'll fix this using the regular expression features below.

Square Brackets

Square brackets can be used to indicate a set of chars, so [abc] matches a or b or c. The codes \w, \s etc. work inside square brackets too with the one exception that dot (.) just means a literal dot. For the emails problem, the square brackets are an easy way to add '.' and '-' to the set of chars which can appear around the @ with the pattern r'[\w.-]+@[\w.-]+' to get the whole email address:

match = re.search(r'[\w.-]+@[\w.-]+', str)
if match:
    print(match.group())  ## 'alice-b@google.com'

(More square-bracket features) You can also use a dash to indicate a range, so [a-z] matches all lowercase letters. To use a dash without indicating a range, put the dash last, e.g. [abc-]. An up-hat (^) at the start of a square-bracket set inverts it, so [^ab] means any char except a or b.

Group Extraction

The "group" feature of a regular expression allows you to pick out parts of the matching text. Suppose for the emails problem that we want to extract the username and host separately. To do this, add parenthesis ( ) around the username and host in the pattern, like this: r'([\w.-]+)@([\w.-]+)'. In this case, the parenthesis do not change what the pattern will match, instead they establish logical "groups" inside of the match text. On a successful search, match.group(1) is the match text corresponding to the 1st left parenthesis, and match.group(2) is the text corresponding to the 2nd left parenthesis. The plain match.group() is still the whole match text as usual.

str = 'purple alice-b@google.com monkey dishwasher'
match = re.search(r'([\w.-]+)@([\w.-]+)', str)
if match:
    print(match.group())   # 'alice-b@google.com' (the whole match)
    print(match.group(1))  # 'alice-b' (the username, group 1)
    print(match.group(2))  # 'google.com' (the host, group 2)

A common workflow with regular expressions is that you write a pattern for the thing you are looking for, adding parenthesis groups to extract the parts you want.

findall

findall() is probably the single most powerful function in the re module. Above we used re.search() to find the first match for a pattern.findall() finds all the matches and returns them as a list of strings, with each string representing one match.

# Suppose we have a text with many email addresses
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'

# Here re.findall() returns a list of all the found email strings
emails = re.findall(r'[\w\.-]+@[\w\.-]+', str) # ['alice@google.com', 'bob@abc.com']
for email in emails:
    # do something with each found email string
    print(email)

findall With Files

For files, you may be in the habit of writing a loop to iterate over the lines of the file, and you could then call findall() on each line. Instead, let findall() do the iteration for you -- much better! Just feed the whole file text into findall() and let it return a list of all the matches in a single step (recall that f.read() returns the whole text of a file in a single string):

  # Open file  f = open('test.txt', 'r')  # Feed the file text into findall(); it returns a list of all the found strings  strings = re.findall(r'some pattern', f.read())

findall and Groups

The parenthesis ( ) group mechanism can be combined with findall(). If the pattern includes 2 or more parenthesis groups, then instead of returning a list of strings, findall() returns a list of *tuples*. Each tuple represents one match of the pattern, and inside the tuple is the group(1), group(2) .. data. So if 2 parenthesis groups are added to the email pattern, then findall() returns a list of tuples, each length 2 containing the username and host, e.g. ('alice', 'google.com').

  str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'  tuples = re.findall(r'([\w\.-]+)@([\w\.-]+)', str)  print(tuples)  ## [('alice', 'google.com'), ('bob', 'abc.com')]  for tuple in tuples:    print(tuple[0])  ## username    print(tuple[1])  ## host

Once you have the list of tuples, you can loop over it to do some computation for each tuple. If the pattern includes no parenthesis, then findall() returns a list of found strings as in earlier examples. If the pattern includes a single set of parenthesis, then findall() returns a list of strings corresponding to that single group. (Obscure optional feature: Sometimes you have paren ( ) groupings in the pattern, but which you do not want to extract. In that case, write the parens with a ?: at the start, e.g. (?: ) and that left paren will not count as a group result.)

RE Workflow and Debug

Regular expression patterns pack a lot of meaning into just a few characters , but they are so dense, you can spend a lot of time debugging your patterns. Set up your runtime so you can run a pattern and print what it matches easily, for example by running it on a small test text and printing the result of findall(). If the pattern matches nothing, try weakening the pattern, removing parts of it so you get too many matches. When it's matching nothing, you can't make any progress since there's nothing concrete to look at. Once it's matching too much, then you can work on tightening it up incrementally to hit just what you want.

Options

The re functions take options to modify the behavior of the pattern match. The option flag is added as an extra argument to the search() or findall() etc., e.g. re.search(pat, str, re.IGNORECASE).

  • IGNORECASE -- ignore upper/lowercase differences for matching, so 'a' matches both 'a' and 'A'.

  • DOTALL -- allow dot (.) to match newline -- normally it matches anything but newline. This can trip you up -- you think .* matches everything, but by default it does not go past the end of a line. Note that \s (whitespace) includes newlines, so if you want to match a run of whitespace that may include a newline, you can just use \s*

  • MULTILINE -- Within a string made of many lines, allow ^ and $ to match the start and end of each line. Normally ^/$ would just match the start and end of the whole string.

Greedy vs. Non-Greedy (optional)

This is optional section which shows a more advanced regular expression technique not needed for the exercises.

Suppose you have text with tags in it: <b>foo</b> and <i>so on</i>

Suppose you are trying to match each tag with the pattern (<.*>) -- what does it match first?

The result is a little surprising, but the greedy aspect of the .* causes it to match the whole '<b>foo</b> and <i>so on</i>' as one big match. The problem is that the .* goes as far as is it can, instead of stopping at the first > (aka it is "greedy").

There is an extension to regular expression where you add a ? at the end, such as .*? or .+?, changing them to be non-greedy. Now they stop as soon as they can. So the pattern '(<.*?>)' will get just '<b>' as the first match, and '</b>' as the second match, and so on getting each <..> pair in turn. The style is typically that you use a .*?, and then immediately its right look for some concrete marker (> in this case) that forces the end of the .*? run.

The *? extension originated in Perl, and regular expressions that include Perl's extensions are known as Perl Compatible Regular Expressions -- pcre. Python includes pcre support. Many command line utils etc. have a flag where they accept pcre patterns.

An older but widely used technique to code this idea of "all of these chars except stopping at X" uses the square-bracket style. For the above you could write the pattern, but instead of .* to get all the chars, use [^>]* which skips over all characters which are not > (the leading ^ "inverts" the square bracket set, so it matches any char not in the brackets).

Substitution (optional)

The re.sub(pat, replacement, str) function searches for all the instances of pattern in the given string, and replaces them. The replacement string can include '\1', '\2' which refer to the text from group(1), group(2), and so on from the original matching text.

Here's an example which searches for all the email addresses, and changes them to keep the user (\1) but have yo-yo-dyne.com as the host.

str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
# re.sub(pat, replacement, str) -- returns new string with all replacements,
# \1 is group(1), \2 group(2) in the replacement
print(re.sub(r'([\w\.-]+)@([\w\.-]+)', r'\1@yo-yo-dyne.com', str))
# purple alice@yo-yo-dyne.com, blah monkey bob@yo-yo-dyne.com blah dishwasher

Reference

Previousimport sysNextimport pickle

Last updated 3 years ago

Was this helpful?

Python Regular Expressions  |  Python Education  |  Google DevelopersGoogle Developers
Python Regular Expressions - Google Developers
Logo