[Answered ]-Bulk replace with regular expressions in Python

1👍

Sorry if this is obvious and wrong (that no-one has suggested it in 4 hours is worrying!), but why not search for all matches, do a batch query for everything (easy once you have all matches), and then call sub with the dictionary of results (so the function pulls the data from the dict)?

You have to run the regexp twice, but it seems like the database access is the expensive part anyway.

1👍

You can do it with a single regexp pass, by using finditer which returns match objects.

The match object have:

  • a method returning a dict of the named groups, groupdict()
  • the start and the end positions of the match in the original text, span()
  • the original matching text, group()

So I would suggest that you:

  • Make a list of all the matches in your text using finditer
  • Make a list of all the unique volume, reporter, page triplets in the matches
  • Lookup those triplets
  • Correlate each match object with the result of the triplet lookup if found
  • Process the original text, splitting by the match spans and interpolating lookup results.

I’ve implemented the database lookup by combining a list of Q(volume=foo1,reporter=bar2,page=baz3)|Q(volume=foo1,reporter=bar2,page=baz3).... There maybe be more efficient approaches.

Here’s an untested implementation:

from django.db.models import Q
from collections import namedtuple

Triplet = namedtuple('Triplet',['volume','reporter','page'])

def lookup_references(matches):
  match_to_triplet = {}
  triplet_to_url = {}
  for m in matches:
    group_dict = m.groupdict()
    if any(not(x) for x in group_dict.values()): # Filter out matches we don't want to lookup
      continue
    match_to_triplet[m] = Triplet(**group_dict)
  # Build query
  unique_triplets = set(match_to_triplet.values())
  # List of Q objects
  q_list = [Q(**trip._asdict()) for trip in unique_triplets]
  # Consolidated Q
  single_q = reduce(Q.__or__,q_list)
  for row in Citations.objects.filter(single_q).values('volume','reporter','page','url'):
    url = row.pop('url')
    triplet_to_url[Triplet(**row)] = url
  # Now pair original match objects with URL where found
  lookups = {}
  for match, triplet in match_to_triplet.items():
    if triplet in triplet_to_url:
      lookups[match] = triplet_to_url[triplet]
  return lookups

def interpolate_citation_matches(text,matches,lookups):
  result = []
  prev = m_start = 0
  last = m_end = len(text)
  for m in matches:
    m_start, m_end = m.span()
    if prev != m_start:
      result.append(text[prev:m_start])
    # Now check match
    if m in lookups:
      result.append('<a href="%s">%s</a>' % (lookups[m],m.group()))
    else:
      result.append(m.group())
  if m_end != last:
    result.append(text[m_end:last])
  return ''.join(result)

def process_citations(text):
  citation_regex = r'(?P<volume>[0-9]+[a-zA-Z]{0,3})\s+(?P<reporter>[A-Z][a-zA-Z0-9\.\s]{1,49}?)\s+(?P<page>[0-9]+[a-zA-Z]{0,3}))'
  matches = list(re.finditer(citation_regex,text))
  lookups = lookup_references(matches)
  new_text = interpolate_citation_matches(text,matches,lookups)
  return new_text
👤MattH

Leave a comment