As the title states, I'm interested in computing the $\bmod q$ kernel of a matrix for $q$ composite. This means that the matrix isn't a linear transformation between vector spaces, so linear algebraic techniques don't apply.

From this answer, I have the following code (which is **correct**). The issue I currently have is that it is *slow*. The source of the inefficiency is fortunately obvious, so this question is mostly about suggestions to avoid it.

```
def modq_kernel(matx, q):
# Returns the mod_q (right) kernel of a given matrix, in the case
# that q is not prime
# Uses https://ask.sagemath.org/question/33890/how-to-find-kernel-of-a-matrix-in-mathbbzn/
# Iterates over whole kernel to find a matrix that generates it, likely slow
source_dim = matx.ncols()
target_dim = matx.nrows()
domain = ZZ^source_dim / (q * ZZ^source_dim)
codomain = ZZ^target_dim / (q * ZZ^target_dim)
phi = domain.hom([codomain(matx.columns()[i]) for i in range(source_dim)])
full_kernel = matrix([domain(b) for b in phi.kernel()]).T
pivot_cols = full_kernel.echelon_form().pivots()
reduced_kernel = matrix(full_kernel.columns()[i] for i in pivot_cols).T
return reduced_kernel
```

This code rewrites the matrix as a module homomorphism $\mathbb{Z}_q^n\to\mathbb{Z}_q^m$, enumerates all elements in the kernel of this homomorphism, and then uses an echelon form computation to identify generators of the kernel. As the size of the kernel is exponential in its dimension, this is (understandably) quite slow. It is also correct, as the following test case demonstrates:

```
p = 2
k = 4
q = p**k - 1
zero_vec = matrix((k-1) * [0]).T
matx = block_matrix([[identity_matrix(k-1), zero_vec]]) - p * block_matrix([[zero_vec, identity_matrix(k-1)]])
```

The origin of this matrix is unimportant. What matters is that I know that its mod-q kernel is:

```
[8 1 2 4]
[4 8 1 2]
[2 4 8 1]
[1 2 4 8]
```

which the aforementioned code correctly identifies.
One can verify that `matx * ker % q`

is an all-zero matrix (where `ker`

is the above matrix), and everything is good.

A natural way to try to improve the efficiency of the above code is to iterate not through `phi.kernel()`

, but through `phi.kernel().gens()`

instead. This should additionally let you remove the echelon form computations (while keeping the initial matrix small).
If one makes this change (and returns `full_kernel`

instead), the code now computes the kernel as:

```
[8]
[4]
[2]
[1]
```

This is *contained* in the kernel, but does not actually span the whole kernel (it's missing the other 3 vectors in the first kernel found). So while the code is much faster now, it is incorrect.

I'm wondering if anyone knows how to efficiently compute the entire $\bmod q$ kernel of a map in $\mathbb{Z}_q^{n\times m}$.