I’ve been working on the Collatz conjecture for a while –
the approach below is an attempt to bring various strands together towards the vague
outline of a proof. At the worst I think it works as a statistical approach to
showing why the conjecture probably holds. But I wonder if the loose ends can
be fixed to turn it into something closer to a proper proof? (I'll highlight the problem areas as I go along.)

I’ll start by describing a function, then explain its
purpose.

1.
Any odd number p can be expressed as 2^n – q
where 2^n is the first power of 2 higher than p.

2.
Express q as a series of powers of 2 in
ascending order. For instance, where p = 475 (and 2^n = 512), q = 37 = 1 + 4 + 32

3.
Multiply each term in this series by a power of
3 with the power descending by one for each term (ending with 3^0).

4.
So in this case 1*9 + 4*3 + 32*1 = 53. We’ll
call this number q(3).

5.
Subtract q(3) from 2^n and divide by the next
highest power of 3. So in this case (512 – 53)/27 = 17. We will call this
result r.

6.
Here are two more simple examples of this
function.

15 = 16 – 1

q(3) = 1

(16 – 1)/3 =

**5****So r = 5**

1897 = (2048 – 1 – 2 – 4 – 16 – 128)

q(3) = 81 + 2*27 + 4*9 + 3*16 + 128 = 347

(2048 – 347)/243 =

**7**
The examples above all result in positive integer solutions.
However, most instances of this function result in fractional or negative
solutions.

For instance:

49 = 64 – 1 – 2 – 4 – 8

q(3) = 27 + 9*2 + 4*3 + 8 = 65

(64 – 65)/81 = -1/81

27 = 32 – 1 – 4

q(3) = 3+4 = 7

(32 – 7)/9 = 25/9

- Where this function produces a positive integer, it can be used to identify a Collatz chain from that integer to 1.

For instance take the first example above

475 = 512 – 1 – 4 – 32

q(3) = 9 + 3*4 + 32 = 53

(512 – 53)/27 = 17

What this tells us is that from 17 there are three tripling
steps in the Collatz chain to 1. The pattern of powers of 2 shows us the number
of halving steps at each stage – in this case 2 halving steps, then 3.

This is more easily understood by leaving out the halving
steps:

Instead of

17 -> 13 –> 5 –> 1

17 -> 52 -> 160 -> 512

(The rule at each step is to multiply by 3 then add the
highest power of 2 that is a factor of the current number. So when we multiply
52 x 3 = 156 we add 4 since 156 = 39 * 4). So:

17 * 3 = 51

51+1 = 52 (Ignore the 2 halving steps to 13)

52*3 = 156

156 + 4 = 160 (ignore the 2 + 3 halving steps to 5)

160*3 = 480

480+32 = 512 (Ignore the 2 + 3 + 4 halving steps to 1)

Note that we add 1, then 4 then 32 (the elements of the q
series above).

(I can give more
detail on why this works, but hopefully it is obvious).

8.
There are regular patterns of integer solutions for any given value of q.

For instance for q = 1

**p = 3, r = 1**

p = 7, r = 7/3

**p = 15, r = 5**

p = 31, r = 31/3

**p = 63, r = 21**

For q = 1, 1 in 2 values of p produces a positive integer
value for r (These are the ‘level 1’ Collatz numbers, meaning the numbers that
take one step to reach 1.)

For q = 3

q(3) = 3 + 2 = 5

(We’ll start at 16-3 for reasons I will explain in
section 9)

p = 13, r = 11/9

**p = 29, r = 27/9 = 3**

p = 61, r = 59/9

p = 125, r = 123/9

p = 253, r = 251/9

p = 509, r = 507/9

p = 1021, r = 1019/9

**p = 2045, r = 2043/9 = 227**

For numbers with two terms in q, 1 in 6 values of p
produce a positive integer value of r, and the result is always a level 2
Collatz number (2 steps to reach 1)

Similarly for q = 9, q(3) = 3 + 8 = 11, we find positive
integer values at

**p = 119, r = 13**

**p = 8183, r = 909**

As the number of terms in q increases the frequency is
multiplied by 1/3 each time, while the Collatz level climbs by 1, eg:

For numbers with three terms in q, 1 in 18 values of p
produce a positive integer value of r, and the result is always a level 3
Collatz number (3 steps to reach 1)

For numbers with four terms in q, 1 in 54 values of p
produce a positive integer value of r, and the result is always a level 4
Collatz number (4 steps to reach 1)

And so on.

(These frequencies are easy to explain with a bit of modular
arithmetic – add two consecutive powers of 2 to any integer and you reach the
same value mod 3, add six and you reach the same value mod 9 etc.)

Nb. These rules are only reliable when used with the proviso given in
section 9:

9.
The next step is to explore the pattern of
positive integer results for all odd values of 2^n - q. However we immediately
need to exclude some of these: Where the largest of the powers of 2 in q is 2^(n-2) (in other
words a quarter of 2^n) the path identified is a repetition of a chain that can
be expressed more simply. For instance:

2^n = 256, q = 81 = 1 + 16 + 64

q(3) = 9 + (3*16) + 64 = 121

(256 – 121)/27 = 5

The path described here is 5 -> 16 -> 64 -> 256

Which is equivalent to the Collatz chain 5 -> 1 -> 1 -> 1

Which can be expressed without the repetitions as

(16 – 1)/3 = 5

5 -> 1

We want to rule out these kinds of repetitions, so that each
time we find a positive integer solution, it is the only solution for that
particular positive integer.

**So we are only going to look at values of 2^n – q > 3(2^n)/4 (meaning that the largest term of q expressed as a series of powers of 2 is 2^n/8 or smaller.)**

So the permissible values of 2^n - q will be:

13, 15

25, 27, 29, 31

49, 51, 53, 55, 57, 59,61, 63

etc.

- Next we want to sort these values by the number of steps in the Collatz chain identified.

To clarify this, the values of r for the numbers above are

7/3

11/9, 5

13/27, 25/9, 3, 31/3

Etc

Note that even the fractional results identify a Collatz
chain of sorts (if we amend the rule so that we add one and halve until we
reach an odd numerator).

If we look at 13/27,
this is the ‘Collatz chain’:

13/27 * 3 = 13/9

13/9 + 1 = 22/9

22/9 * ½ = 11/9

11/9 * 3 = 11/3

11/3 + 1 = 14/3

14/3 * ½ = 7/3

7/3 * 3 = 7

7 + 1 = 8

Which halves down to 1.

So 13/27 has a level 3 Collatz chain.

- Thus every value of r can be said to have a Collatz chain, although we are only really interested in the positive integer values of r. The next thing to note is that we can use Pascal’s triangle to describe the pattern of values of q:

Where 2^n – q is:

7

13, 15

25, 27, 29, 31

49, 51, 53, 55, 57, 59,61, 63

q is:

1

3,1

7, 5, 3, 1

15, 13, 11, 9, 7, 5, 3, 1

Etc

And when we express these as a sum of powers of 2, the
number of terms (which is the same as the number of steps in the resulting
Collatz chain) is:

1

2, 1

3, 2, 2, 1

4, 3, 3, 2, 3, 2, 2, 1

And the number of steps needed at each level is

1step 2steps 3steps 4steps

1

1 1

1 2 1

1 3 3 1

(For clarity - this means that for 2^n = 8 there is 1 level 1 colution. For 2^n = 16 there is 1 level 1 solution and 1 level 2 solution. For 2^n = 32 there is 1 level 1, 2 level 2, and 1 level 3 solution. For 2^n = 64 there is 1 level 1, 3 level 2, 3 level 3 and 1 level 4 solution etc).

(For clarity - this means that for 2^n = 8 there is 1 level 1 colution. For 2^n = 16 there is 1 level 1 solution and 1 level 2 solution. For 2^n = 32 there is 1 level 1, 2 level 2, and 1 level 3 solution. For 2^n = 64 there is 1 level 1, 3 level 2, 3 level 3 and 1 level 4 solution etc).

This continues to build into Pascal’s triangle (because at
each stage we add the same pattern ‘raised up by one level’).

- Next we are going to use values of 2^n – q to come up with an estimate of how many integer solutions there will be under a bound. The bound is slightly messily defined, but bear with me as I think we can maybe solve the problems this causes.

First, let’s look at the region between 0 and 4.

Any level 1 integer solutions in this region must be
generated by 1-step numbers for 2^n < 16 (because 8/3 < 4, but 16/3 >
4 – this is a rough rule of thumb which we will return to later).

If we look at Pascal’s triangle, we can regard the first row
as representing integer solutions reached from 2^n – q where 2^n = 8, the
second row represents integer solutions where 2^n = 16, the third row where 2^n
=32 and so on.

For level 1 solutions in the region below 4, we look at row
1 and multiply by ½ (because 1 in 2 level one solutions are integers).

For level 2 solutions, we look at row 3 (because 32/9 <
4) and we total the number of 2 step solutions (row 1 = 0, row 2 = 1, row 3 =
2, total 3). Then we multiply this by 1/6 (because 1 in 6 level 2 solutions are
integers).

We now have ½ + 3/6 = 1 meaning that we can

**very**roughly estimate that there is at least one integer solution in this region and that it is level 1 or level 2. This of course isn’t definite as the numbers could conspire against us so that we haven’t yet reached either a level 1 or a level 2 solution. As it happens the estimate is spot on, because we have one level 2 solution (3). (Bear in mind that the method we are using will never find 1 as a solution, because of the exclusion explained in section 9.).
So, accepting that we have a slightly fuzzy estimation
method, let’s move on to integers under 8.

- We use the same method but need to go through more levels to find an estimate that is sufficiently high to account for the number of integer solutions we are looking for.

Level 1 numbers (up to row 2 < 16/3) = 2/2

Level 2 numbers ( up to row 4 < 64/9) = 6/6

Level 3 numbers (up to row 5 < 128/27) = 10/18

Level 4 numbers (up to row 7 < 512/81) = 35/54

Level 5 numbers (up to row 8 < 1024/243) = 70/162

I’m going to make an ad hoc adjustment here. 2048/243 is
bigger than 8. However for a level 5 number, we will be subtracting at least
81+54+36+24+16= 211 from 2048. And (2048-211)/243 < 8. So we can adjust to:

Level 5 numbers (up to row 9 = 2048/243) = 126/162

(Obviously this ad hoc adjustment reveals the level of
fuzziness in our initial method, but bear with me for now).

So our first rough estimate of positive integers under 8 is

2/2 + 6/6 + 10/18 + 35/54 + 126/162 = 3.98

Thus we would expect there to be at least 3 level 1-5
integer solutions < 8. This is again correct – we have 3 (level 2), 5 (level
1) and 7 (level 5).(Note that we didn't actually have to go as far as level 5 to get an estimate > 3 - I'll return to this issue in point 15 below).

14. Before moving on to look at
problems with the approximation, let’s look at a result for integers < 16

Level 1 numbers (up to row 3 < 32/3) =
3/2

Level 2 numbers ( up to
row 5 < 128/9) = 10/6

Level 3 numbers (up to
row 6 < 256/27) = 20/18

Level 4 numbers (up to
row 8 < 1024/81) = 70/54

Level 5 numbers (up to
row 10 < 4096/243) = 252/162

Level 6 numbers (up to
row 11 < 8192/729) = 462/486

This totals to 8.08. Actually there are 7 integer solutions
in this region: 3, 5, 7, then 9 (level 6), 11, (level 4), 13 (level 2) and 15
(level 5).

Then for integers < 32

Level 1 numbers (up to row 4 < 64/3) = 4/2

Level 2 numbers ( up to row 6 < 256/9) = 15/6

Level 3 numbers (up to row 7 < 512/27) = 35/18

Level 4 numbers (up to row 9 < 2048/81) = 126/54

Level 5 numbers (up to row 11 < 8192/243) = 462/162

Level 6 numbers (up to row 12 < 16384/729) = 924/486

Level 7 numbers (up to row 14 < 65536/2187) = 3432/1458

Total 15.88

What we actually find in this area is 13 Level 1-7 numbers plus 27 and 31 which are Level 41 and 39 respectively (as below).

Level 1 – 5, 21

Level 2 - 3, 13

Level 3 – 17

Level 4 11, 23

Level 5 - 7, 15, 29

Level 6 – 9, 19

Level 7 – 25

Level 39 – 31

Level 41 – 27

The positive thing to point out here is that the estimates for each level are pretty good. We have slight overestimates for Levels 2, 3, 4, and 7 and slight underestimates for levels 5 and 6, but all are in the right ballpark. I'll talk about the issues raised by 27 and 31 below.

15. It has to be admitted at this point that there are some obvious problems with this method. So let's take these in turn (and I'll admit some of my responses from here onwards are a bit hand-wavey).

**Firstly, the choice of how many levels to include**

**in our estimate**is clearly arbitrary.
I'll defend this by pointing out that we are looking for the point at which this estimation method suggests that there are

**at least**a certain number of positive integer solutions. If there are at least [(2^n)/2] - 1 positive integer solutions lower than 2^n, then the Collatz conjecture is true in this region because this method can't identify duplicates. We can use a version of the pigeonhole principle to explain this - if we have**at least**15 odd integer solutions lower than 32 (or 511 less than 1024) then**all**odd numbers in this region have a Collatz chain to 1 as we can't have two solutions for the same integer.
16.

**Second problem - the upper bound on the estimate is ragged**. While we are looking for solutions < 32, 64/3 is only 21.333, 256/9 is only 28.444 and so on. So our method for finding solutions under 32 is actually looking for solutions under a ragged boundary in the 16-32 range. However, when we go from our estimate for 2^n to our estimate for 2^(n+1) we are clearly doubling our range, so I'd argue that**so long as we can keep on doubling the estimate each time**, we are still showing that there will always be sufficient integer solutions to 'fill every pigeonhole'.
17.

**Third problem. For certain values of q, we end up getting negative solutions.**For instance we already saw this instance:
49 = 64 – 1 – 2 – 4 – 8

q(3) = 27 + 9*2 + 4*3 + 8 = 65

(64 – 65)/81 = -1/81

For these, we can simply shift our starting point up by a power of 2 (or as many as it takes to reach a positive solution. For instance (128 - 65)/81 gives us a positive solution lower than 64/81. Slightly ad hoc, but it means that our original estimate is a slight underestimate (since higher values of 2^n yield a wider range of values of q).

18.

**Fourth problem**.**While the estimate of 1 in 2 level 1 numbers is perfectly accurate, the estimation method is less good for other levels, where the range of values of q is more complex.**We saw in point 14 that we had overestimates for some levels and underestimates for others. My suggestion here would be that for any given level, the estimation is a limit towards which the actual quantity of solutions tends. There is a high degree of regularity within the solutions - for instance for level 3 solutions, any value q will produce an integer solution for 1 in 18 values of 2^n. But we also have 'horizontal' sorting, where (for instance) 1 in 18 of this series has to produce a solution for 2^22 - q, 2^23 - q etc
q = 9+6+2^19

q= 9+12+2^19

q = 9+24+2^19...

etc

And in general, since for any value of q we have 1 in 18 values of 2^n producing a solution, we must be tending towards 1 in 18 solutions for all values of q once we have a large number of values to consider.

19.

**Fifth problem - how do we account for outliers, like the level 41 and 39 numbers (27 and 31) below 32.**
I'm not sure this is really a problem. I'm not claiming the original estimation method is perfect, only that it shows we will always have sufficient solutions. Where a high level gives us an 'unexpectedly' low solution, it simply means we don't need to go up as many levels to find 'sufficient solutions' from the lower numbers. Also, outliers will go on to produce a pretty well sorted range of solutions for lower levels - 27 gives us all the 4n+1 numbers which have the same ongoing Collatz chain - 109, 337 etc. And it also gives us the level 40 number (41), level 39 (31), level 38 (47) and so on - meaning that once this chain has worked its way through the system, we still end up with a pretty even spread of levels.

20.

**Sixth problem. I haven't actually shown that we always have 'sufficient solutions'.**This seems self-evident from looking at the way patterns develop within Pascal's triangle - lower levels increase their hit rate at less than 2 for each increase in 2^n, while higher levels increase their hit rate at more than 2. So as long as we can add extra levels to our estimation method we should be able to double the estimate. It looks self-evident that we can always find at least (2^n/2)-1 solutions but I know this would need to be shown to be necessary.
But for now I wonder if anyone has any thoughts on the basic method outlined above...