does rewriting memcpy/memcmp/- with SIMD instructions make sense

Does rewriting memcpy/memcmp/... with SIMD instructions make sense in a large scale software?

If so, why gcc doesn't generate simd instructions for these library functions by default.

Also, are there any other functions can be possibly improved by SIMD?


Yes, these functions are much faster with SSE instructions. It would be nice if your runtime library/compiler instrinsics would include optimized versions, but that doesn't seem to be pervasive.

I have a custom SIMD memchr which is a hell-of-a-lot faster than the library version. Especially when I'm finding the first of 2 or 3 characters (example, I want to know if there's an equation in this line of text, I search for the first of =, \n, \r).

On the other hand, the library functions are well tested, so it's only worth writing your own if you call them a lot and a profiler shows they're a significant fraction of your CPU time.

It probably doesn't matter. The CPU is much faster than memory bandwidth, and the implementations of memcpy etc. provided by the compiler's runtime library are probably good enough. In "large scale" software your performance is not going to be dominated by copying memory, anyway (it's probably dominated by I/O).

To get a real step up in memory copying performance, some systems have a specialised implementation of DMA that can be used to copy from memory to memory. If a substantial performance increase is needed, hardware is the way to get it.

It does not make sense. Your compiler ought to be emitting these instructions implicitly for memcpy/memcmp/similar intrinsics, if it is able to emit SIMD at all.

You may need to explicitly instruct GCC to emit SSE opcodes with eg -msse -msse2; some GCCs do not enable them by default. Also, if you do not tell GCC to optimize (ie, -o2), it won't even try to emit fast code.

The use of SIMD opcodes for memory work like this can have a massive performance impact, because they also include cache prefetches and other DMA hints that are important for optimizing bus access. But that doesn't mean that you need to emit them manually; even though most compiler stink at emitting SIMD ops generally, every one I've used at least handles them for the basic CRT memory functions.

Basic math functions can also benefit a lot from setting the compiler to SSE mode. You can easily get an 8x speedup on basic sqrt() just by telling the compiler to use the SSE opcode instead of the terrible old x87 FPU.

on x86 hardware, it should not matter much, with out-of-order processing. Processor will achieve necessary ILP and try to issue max number of load/store operations per cycle for memcpy, whether it be SIMD or Scalar instruction set.

Category:performance Time:2011-03-16 Views:3

Related post

  • Is 3x3 Matrix inverse possible using SIMD instructions? 2010-07-26

    I'm making use of an ARM Cortex-A8 based processor and I have several places where I calculate 3x3 Matrix inverse operations. As the Cortex-a8 processor has a NEON SIMD processor I'm interested to use this co-processor for 3x3 matrix inverse, I saw s

  • Do I get a performance penalty when mixing SSE integer/float SIMD instructions 2011-02-14

    I've used x86 SIMD instructions (SSE1234) in the form of intrinsics quite a lot lately. What I found frustrating is that the SSE ISA has several simple instructions that are available only for floats or only for integers, but in theory should perform

  • Do I get a performance penalty when mixing SIMD instructions and multithreading 2011-11-08

    I was interested in doing a proyect about face-recognition (to make use of SIMD instructions set). But during the first semester of the current year, I learnt something about threads and I was wondering if I could combine them. When should I avoid co

  • Smooth spline with SIMD instructions 2012-02-08

    I'm using this type of spline in my code and I'm wondering if the algorithm can benefit from the use of SIMD instructions. (NEON on ARM) The code used is a C translation of the following sources (in Fortran):

  • Are there SIMD instructions to speed up checksum calculations? 2011-07-12

    I'm going to have to code a very basic checksum function, something like: char sum(const char * data, const int len) { char sum(0); for (const char * end=data+len ; data<end ; ++data) sum += *data; return sum; } That's trivial. Now, how should I o

  • memcpy that doesn't really make sense to me 2011-05-03

    I have some socket connection code that makes use of boost::asio which reads from a socket the first 5 chars, from which it can determine if the sent string was compressed using zlib library. The project I'm currently doing is a rewrite of something

  • What makes Apple's PowerPC memcpy so fast? 2010-01-02

    I've written several copy functions in search of a good memory strategy on PowerPC. Using the Altivec or fp registers with cache hints (dcb*) doubles the performance over a simple byte copy loop for large data. Initially pleased with that, I threw in

  • memmove, memcpy, and new 2010-03-10

    I am making a simple byte buffer that stores its data in a char array acquired with new and I was just wondering if the memcpy and memmove functions would give me anything weird if used on memory acquired with new or is there anything you would recom

  • Is memcpy accelerated in some way on the iPhone? 2011-04-24

    Few days ago I was writing some code and I had noticed that copying RAM by memcpy was much-much faster than copying it in for loop. I got no measurements now (maybe I did some time later) but as I remember the same block of RAM which in for qas copie

  • Why is memcpy() and memmove() faster than pointer increments? 2011-10-15

    I am copying N bytes from pSrc to pDest. This can be done in a single loop: for (int i = 0; i < N; i++) *pDest++ = *pSrc++ Why is this slower than memcpy or memmove? What tricks do they use to speed it up? --------------Solutions------------- Beca

  • How much speed-up from converting 3D maths to SSE or other SIMD? 2008-09-22

    I am using 3D maths in my application extensively. How much speed-up can I achieve by converting my vector/matrix library to SSE, AltiVec or a similar SIMD code? --------------Solutions------------- In my experience I typically see about a 3x improve

  • SIMD on an Array of Doubles? 2009-02-15

    I'm doing some work where SIMD is required and I need to do operations on an array of doubles. Do any of the mainstream architectures support this? I've only seen floating point operations. Thanks in Advance, Stefan --------------Solutions-----------

  • Using SSE instructions 2009-02-25

    I have a loop written in C++ which is executed for each element of a big integer array. Inside the loop, I mask some bits of the integer and then find the min and max values. I heard that if I use SSE instructions for these operations it will run muc

  • How CPUs implement Instructions like MUL/MULT? 2009-03-28

    In different assembly languages MUL (x86)/MULT (mips) refer to multiplication. It is a black box for the programmer. I am interested in how actually a CPU accomplishes a multiplication regardless of the architecture. Lets say I have two 16-bit values

  • MMX instructions for Iphone 2009-04-26

    Hi Does iphone processor ARMV6 supports MMX instructions? --------------Solutions------------- The short answer is no - MMX is an intel technology. The longer answer is that ARM Supports the Neon SIMD instruction set. It is a similar architecture to

  • How do modern compilers use mmx/3dnow/sse instructions? 2009-05-18

    I've been reading up on the x86 instruction set extensions, and they only seem useful in some quite specific circumstances (eg HADDPD - (Horizontal-Add-Packed-Double) in SSE3). These require a certain register layout that needs to be either deliberat

  • Good portable SIMD library 2009-06-11

    can anyone recommend portable SIMD library that provides a c/c++ API, works on Intel and AMD extensions and Visual Studio, GCC compatible. I'm looking to speed up things like scaling a 512x512 array of doubles. Vector dot products, matrix multiplicat

  • Touch Pro 2, ARMs chips, and Floating point instructions 2009-08-28

    Does any one know if the Qualcomm MSM7200A ARM11 in the HTC Touch Pro 2 supports floating point operations? Is there a way to recognize whether or not a processor supports floating point operations based on it's name? --------------Solutions---------

  • Executing CPU/GPU instructions from managed code 2010-01-20

    Taking into account the execute disable bit what is the recommended way of executing instructions against a native processor from a high level managed environment such as VB.NET 2008 or C#. In addition has anyone achieved similar in executing GPU ins

Copyright (C), All Rights Reserved.

processed in 0.800 (s). 13 q(s)