egg|nomz|egg changed the topic of #kspacademia to: https://gist.github.com/pdn4kd/164b9b85435d87afbec0c3a7e69d3e6d | Dogs are cats. Spiders are cat interferometers. | Космизм сегодня! | Document well, for tomorrow you may get mauled by a ネコバス. | <UmbralRaptor> egg|nomz|egg: generally if your eyes are dewing over, that's not the weather. | <ferram4> I shall beat my problems to death with an engineer.
e_14159 has quit [Ping timeout: 198 seconds]
e_14159 has joined #kspacademia
UmbralRaptor has quit [Remote host closed the connection]
* Qboid
gives UmbralRaptop a contravariant polynomial
<egg|work|egg>
!wpn whitequark
* Qboid
gives whitequark a walrus
* egg|work|egg
wonders whether котя will eat the walrus
<UmbralRaptop>
!wpn egg|work|egg
* Qboid
gives egg|work|egg an isothermal tarrasque
UmbralRaptor has joined #kspacademia
UmbralRaptop has quit [Ping timeout: 182 seconds]
* egg
pets whitequark
<whitequark>
hi
<egg>
how are the cats
<egg>
whitequark: if the cats were sequenced, what sort of things could we tell from their genomes?
<whitequark>
still in hk
<egg>
whitequark: do hk cats still flee?
<whitequark>
i haven't seen any in a long while
<egg>
hmm
<egg>
whitequark: there are apparently cat cafes in hk
<egg>
so you could look there i guess :-p
<whitequark>
too shy
<SnoopJeDi>
oh neat, today's speaker is one of the LIGO Nobel recipients and was also chair of COBE's science working group \o/
<UmbralRaptor>
!
<APlayer>
Are there any powers of two larger than 2⁰ that have an arbitrary first digit and only zeroes after that?
<APlayer>
I don't think there are, are there?
<APlayer>
Larger than 2³, even
<egg>
whitequark: you or the cats
<whitequark>
egg: me
<whitequark>
APlayer: that means a power of two divisible by a power of ten
<whitequark>
which is clearly absurd
<APlayer>
Not just by "a" power of ten, but a specific power of ten
<APlayer>
But alright, point taken, makes sense
<whitequark>
10=2*5, and no power of 2 is divisible by 5
* APlayer
needs to fix his knowledge of primes, division and related subjects
<SnoopJeDi>
it takes some practice before it's natural, imo
<SnoopJeDi>
powers of 2 in particular are something people have dealt with often, though
<whitequark>
i dunno, i don't think i touched that knowledge since primary school
<APlayer>
But what if you have 2⁴ котяs?
<SnoopJeDi>
whitequark, surely you've thought about powers of 2
<SnoopJeDi>
or do you mean explicitly dealing with it in a classroom sense
<SnoopJeDi>
vs adjacent to some unrelated task
<egg>
there's nothing specific to 2 here, any number not divisible by 10 will do
<SnoopJeDi>
yes, that's rather obvious
<SnoopJeDi>
but, like most things mathematicians treat with, it's only obvious in hindsight, heh.
<egg>
bofh: is there a difference in behaviour between those snippets? my compiler generates the former usually, but the latter if I do some intrinsics trickery to produce the thing that gets movqed https://hastebin.com/vayesamona.rb
<egg>
(or whitequark or anyone who feels like looking at x86-64 nonsense)
<egg>
(aside from the random mulsd at the beginning of the first one aaargh)
<egg>
(and the mov rcx, 0xfffffff000000000 is also irrelevant)
<egg>
mul vs. imul, and correspondingly different constants and shifts? Ꙩ_ꙩ why
<egg>
oh derp _mm_cvtsi128_si64 returns a signed integer of course
* egg
stabs egg
<APlayer>
Congratulations! You now have egg on a stick
<UmbralRaptor>
s/котяs/котяс/
<bofh>
egg: there's no reason to use intrinsics for scalar code imho.
<bofh>
also those asm snippets in both cases are essentially ~equivalent perf-wise, wide imul and wide mul have ~identical thruput/latency on p. much any Intel/AMD.
<bofh>
(the only case where the difference matters is in imuls small enough that the constant fits into its immediate field. but that's blatantly obviously not the case here).
<iximeow>
in the last haste there's two mul in the first, only one in the second?
* iximeow
rereads
<egg>
bofh: well msvc refuses to emit sse2 pand etc. (even movq!) without intrinsics so I use that
<iximeow>
ah nvm i see you mentioned that mulsd is unrelated
<egg>
bofh: sadly the whole of C + Y / 3 is in the critical path and that I think I have to do with plain x86-64 stuff afaict
<egg>
bofh: unless you can see a way to do it in sse2 or 3?
<egg>
but without a 64-bit mul it seems infeasible :-/
* egg
pokes _mm_mul_epu32 in the 32
<UmbralRaptor>
mul_apin64
<bofh>
egg: so like, even if you do it in sse2 it won't be faster, C + Y / 3 is fast in integer x86_64 >_>
<bofh>
like, integer mul is 3 cycles latency *at most* since Nehalem, I think it might even be 2 now.
<bofh>
and that's for all integer muls, incl. 64x64->128.
<egg>
bofh: yes but you pay one cycle either way for the movqs
<bofh>
yeah but that's all of 1 cycle, if that's actually seriously hurting your overall function perf, then your function is fast enough.
<bofh>
:P
<egg>
bofh: that's certainly noticeable for things like the clobbering to 16 bits
<bofh>
for clobbering to 16 bits, sure, but that's b/c pand/andps makes sense to use there.
<egg>
bofh: and for extracting the sign you need a couple of ands, and doing that in sse2 puts only one of them on the critical path
<bofh>
wait, how on *earth* is extracting the sign on the critical path? you do it at the start of the function in the same path as the imul
<egg>
bofh: wait what? don't I need to extract the sign before I do the linear approximation stuff?
<bofh>
and then at the very end either OR it back in, or just do if (sign) x = -x before the return.
<egg>
yeah I or it back in
<bofh>
egg: uh the start of your f'n shouold be extract sign, absolute value, linear approx.
<bofh>
and those can *all* be done in x86_64 from the initial movq.
<egg>
bofh: oh the x86-64 stuff is superscalar too?
<bofh>
I mean there are no data dependencies on the signbit once you extract it until the extreme end of the function...
<egg>
bofh: yeah obviously
<egg>
bofh: then again there's no reason not to do it in sse2 instrinsics (because then I get an m128i that I can just or at the end, instead of having to separately movq it back up)
<egg>
bofh: but where to do abs is a good questino
<egg>
s/tino/tion/ even
<egg>
bofh: ah wait I have a dependency on abs y further down the line
<egg>
bofh: so I'm better off doing that bitand in sse2 than movqing abs y back up too
<bofh>
so like after the abs at the start of the code you *only* have a dependency on abs y, not y.
APlayer has quit [Read error: Connection reset by peer]
APlayer has joined #kspacademia
Technicalfool has joined #kspacademia
<egg>
bofh: yes, but y comes to me in some xmm register, and I need abs y in one too
APlayer has quit [Ping timeout: 182 seconds]
<egg>
bofh: so if I compute abs y from y with a pand/pandn I'm fine, otherwise I induce a (mild) dependency on the first movq and I need an additional movq that I wouldn't otherwise
<bofh>
okay, I see your point.
<egg>
bofh: amusingly (with haswell, and with the caveats of iaca occasionally being confused), the version with intrinsics is 120 cycles like the positives-only version (in practice and on skylake it's slightly slower, but noticeably less so than doing every integer operation through movqs, see above)
<SnoopJeDi>
um so apparently LIGO is limited at low frequencies by Brownian motion in the reflective optical coating on the test masses
<whitequark>
uh
<SnoopJeDi>
well, that's the way he phrased it anyhow. I liked it better as he rephrased it: losses in the material correspond directly with thermal noise
<SnoopJeDi>
I didn't realize A+/Voyager were a thing, but apparently they're planning to implement a "squeezed light" (goddammit optics) technique this year to push down the radiative forcing noise
<egg>
bofh: lolwtf, adding branches for rescaling to prevent over/underflow (and correctly handle subnormal values as a side effect) makes the benchmark (main branch) *faster*?!
<SnoopJeDi>
surprising number of orders of magnitude left for a terrestrial GW observatory (if one believes the claims, but I think that's reasonable)
<egg>
bofh: at least you're right that it doesn't slow things down at all :-p