[Django]-Search for item in list with 2MILLION items – Python


Put both collections into frozensets.

little performance test:

import random
from timeit import Timer

def random_strings(size):
    alpha = 'abcdefghijklmnopqrstuvwxyz'
    min = 3
    max = 8
    strings = []
    for count in xrange(1, size):
        current = ''
        for x in random.sample(alpha, random.randint(min,max)):
            current += x  
    return strings

string_list_1 = random_strings(10000)
string_list_2 = random_strings(10000)

def string_test():
    common = filter(lambda x: x in string_list_2, string_list_1)
    return common

def set_test():
    string_set_1 = frozenset(string_list_1)
    string_set_2 = frozenset(string_list_2)
    common = string_set_1 & string_set_2
    return common

string_timer = Timer("__main__.string_test()", "import __main__")
set_timer = Timer("__main__.set_test()", "import __main__")
print string_timer.timeit(10)
# 22.6108954005
print set_timer.timeit(10)
#  0.0226439453

As you can see, set is exponentially faster. Should perform better than dictionary, too.

Of important note is that I included the time necessary to make the sets. This overhead will affect your performance, too, but with the exception of having one set that is much much smaller than the other, you’ll net a large gain.



For a search like this I would go with a binary search. One of the fasted methods for long SORTED lists. If it isn’t sorted, then don’t use binary search.


You have two million strings that you need to match against one million other strings‽

A couple of things to try:

  1. Use a set instead of a list for those 2 million items.
  2. If that doesn’t speed it up, try having the strings as keys in a dictionary.
  3. If that also doesn’t help, get a nice binary tree implementation and use that.


As mentioned in the comments, sets and dicts don’t use binary trees, they use hash tables. This should be faster than a list, and in fact probably even faster than a binary search.


off the top of my head, with so little info on why you are doing this a few million times:

1.) can you import the csv into a table and do the checks in sql?

2.) how about sorting and indexing the list for quick access?


Leave a comment