More efficient methods for finding subsets of Subsets()

Eric Snyder
53 ●1 ●3 ●9

Calling SS = Subsets([1..100000], 4) takes 50 ms to generate a list with about $4 \times 10^{18}$ elements. %%prun suggests the majority of that time is used by misc.py:556(_stable_uniq)

(Note: trying this with $> \sim 10^5$ elements causes an overflow.)

But getting a subset of that list... hoo boy.

NList = [i for i in Subsets([1..100], 4) if sum(i) == 100] takes 45 s to generate a list with 5952 elements, out of $3.9 \times 10^6$ possibilities. About half of that time is used by set.py:465(__init__)

Note that the size of the main set has been decreased by a factor of 1000. That's because this list comprehension seems to run at about $O(n^4)$ , so going from 100 to 1000 multiplies the time by around $10^4$ ; that increases the time from ~45 s to 5... days. (Creating the list, then filtering it, doesn't seem to change the time much.)

So my question is, is there any faster way to filter through such a long list that I don't know of? I'd like to be working in the [1..2000] range, but I don't think my kernel will hold out for 80 days to make that happen. (Probably not even 5 days).

Any suggestions welcome!

Comments

Note that :

sage: binomial(10000,4)
416416712497500

i. e.

sage: binomial(10000,4).n()
4.16416712497500e14

about 10000 times less that your result. R agrees :

sage: r.choose(10000,4).sage()
416416712497500

So does combinatorics :

sage: Subsets(10000,4).cardinality()
416416712497500

Something may be amiss in your computation...

Emmanuel Charpentier ( 2 years ago )

Can you formulate more clearly what problem you are solving?

Max Alekseyev ( 2 years ago )

@Emmanuel Charpentier: Your calculations are correct, but the initial calculation up there uses $100000$ , not $10000$ .

@Max Alekseyev: The context is attempting to make this algorithm work. The initial step requires finding all subsets of [1..n] with a given sum, then finding sets of three distinct subsets whose squares have the same sum (which takes a fraction of the time of the first step).

Eric Snyder ( 2 years ago )

When using your algorithm, you postulate a=100. Do you know (or postulate) b ?

You do not need to build your sets : building iterators in them should suffice to exhibit a solution. Finding all of them is another question... I doubt that it can be done realistically by brute force...

Emmanuel Charpentier ( 2 years ago )

add a comment

2 Answers

Sort by » oldest newest most voted

answered 2 years ago

Max Alekseyev

7126 ●10 ●60 ●175

updated 2 years ago

If you need partitions of an integerk into 4 distinct parts, you can rely on Sage Partitions() generator:

for a,b,c,d in Partitions(k, length=4, max_slope=-1):
    ...

If you want additionally to bound the parts, then add max_part=... parameter. See this documentation for all available options.

link

Comments

I hadn't even thought about just doing it as a partition. Perfect, thanks!

Eric Snyder ( 2 years ago )

add a comment

answered 2 years ago

David Coudert
851 ●13 ●21 http://www-sop.inria.f...

Here is an iterator over 4-tuples (a, b, c, d) such that a + b + c + d = k and a < b < c < d. You can also ensure that d <= nmax, which seems a requirement of your problem.

def get_series(k, nmax=None):
    """
    Iterator over all series (a, b, c, d) such that:
    - a + b + c + d = k
    - a < b < c < d

    When nmax is specified, we furthermore ensure that a,b,c,d <= nmax.
    """
    if nmax is None:
        for a in range(1, (k - 6) // 4 + 1):
            for b in range(a + 1, (k - a - 3) // 3 + 1):
                for c in range(b + 1, (k - a - b - 1) // 2 + 1):
                    yield (a, b, c, k - a - b - c)
    else:
        for a in range(1, min(nmax, (k - 6) // 4) + 1):
            for b in range(a + 1, min(nmax, (k - a - 3) // 3) + 1):
                for c in range(b + 1, min(nmax, (k - a - b - 1) // 2) + 1):
                    d = k - a - b - c
                    if d <= nmax:
                        yield (a, b, c, d)

you can check that

This is way faster than your brute force attempt

sage: %time NList = [i for i in Subsets([1..100], 4) if sum(i) == 100]
CPU times: user 31 s, sys: 83.3 ms, total: 31.1 s
Wall time: 31.1 s
sage: %time L = list(get_series(100))
CPU times: user 7.23 ms, sys: 136 µs, total: 7.36 ms
Wall time: 7.35 ms
sage: len(L) == len(NList)
True
sage: L == [tuple(sorted(i)) for i in NList]
True

This is of course not sufficient for solving efficiently your problem since you have plenty of 4-tuples to consider and then triples of 4-tuples...

sage: %time L = list(get_series(1000))
CPU times: user 2.82 s, sys: 99.1 ms, total: 2.92 s
Wall time: 2.92 s
sage: len(L)
6840777

Another useful idea is to store the tuples in a dictionary with key a^2 + b^2 + c^2 + d^2 and value the list of corresponding tuples (so with same sum and same sum of squares).

link

Comments

This is reinventing of Partitions() functionality. See my answer below.

Max Alekseyev ( 2 years ago )

Right, thanks.

David Coudert ( 2 years ago )

add a comment

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account.

Add Answer

More efficient methods for finding subsets of Subsets()

Comments

2 Answers

Comments

Comments

Your Answer

Question Tools

Stats

Related questions

More efficient methods for finding subsets of Subsets() savecancel

Comments

2 Answers

Comments

Comments

Your Answer

Question Tools

Stats

Related questions

More efficient methods for finding subsets of Subsets()