In che modo la macchina virtuale Hip hop (HHVM) migliora teoricamente le prestazioni del runtime di PHP?

8

Da un alto livello, come funziona Facebook, et. al fine di migliorare le prestazioni di PHP con la macchina virtuale Hip Hop?

In che cosa differisce dall'esecuzione del codice usando il motore tradizionale di Zend? È perché i tipi sono opzionalmente definiti con hack che consentono tecniche di pre-ottimizzazione?

La mia curiosità è nata dopo aver letto questo articolo, adozione HHVM .

    
posta chrisjlee 12.06.2015 - 05:34
fonte

2 risposte

7

Hanno sostituito i traccianti di TranslatorX64 con la nuova rappresentazione intermedia HipHop (hhir) e un nuovo livello di riferimento indiretto in cui risiede la logica per generare hhir, che in realtà è indicato con lo stesso nome, hhir.

Da un alto livello, sta usando 6 istruzioni per fare le 9 istruzioni richieste prima, come notato qui: "Inizia con gli stessi tipografi ma il corpo della traduzione è di 6 istruzioni, significativamente migliore del 9 di TranslatorX64 "

link

"We solved this problem by adding a new layer of indirection. This new layer is an SSA form intermediate representation, positioned between the bytecodes in TranslatorX64’s tracelets and the x86 machine code we want to end up with. It’s strongly typed and designed to facilitate a number of optimizations we wanted to port from TranslatorX64 as well as new optimizations in the future. This new IR, named hhir (short for HipHop Intermediate Representation), completely replaced TranslatorX64 as hhvm’s JIT in May of 2013. While hhir specifically refers to the representation itself, we often use the name to refer to all the pieces of code that interact with it. If you’ve looked at our source code recently you might have noticed that a class named TranslatorX64 still exists and contains a nontrivial amount of code. That’s mostly an artifact of how the system is designed and is something we plan to eventually clean up. All of the code left in TranslatorX64 is machinery required to emit code and link translations together; the code that understood how to translate individual bytecodes is gone from TranslatorX64.

When hhir replaced TranslatorX64, it was generating code that was roughly 5% faster and looked significantly better upon manual inspection. We followed up its production debut with another mini-lockdown and got an additional 10% in performance gains on top of that. To see some of these improvements in action, let’s look at a function addPositive and part of its translation.

function addPositive($arr) {
      $n = count($arr);
      $sum = 0;
      for ($i = 0; $i < $n; $i++) {
        $elem = $arr[$i];
        if ($elem > 0) {
          $sum = $sum + $elem;
        }
      }
      return $sum;
    }

This function looks like a lot of PHP code: it loops over an array and does something with each element. Let’s focus on lines 5 and 6 for now, along with their bytecode:

    $elem = $arr[$i];
    if ($elem > 0) {
  // line 5
   85: CGetM <L:0 EL:3>
   98: SetL 4
  100: PopC
  // line 6
  101: Int 0
  110: CGetL2 4
  112: Gt
  113: JmpZ 13 (126)

These two lines load an element from an array, store it in a local variable, then compare the value of that local with 0 and conditionally jump somewhere based on the result. If you’re interested in more detail about what’s going on in the bytecode, you can skim through bytecode.specification. The JIT, both now and back in the TranslatorX64 days, breaks this code up into two tracelets: one with just the CGetM, then another with the rest of the instructions (a full explanation of why this happens isn’t relevant here, but it’s mostly because we don’t know at compile time what the type of the array element will be). The translation of the CGetM boils down to a call to a C++ helper function and isn’t very interesting, so we’ll be looking at the second tracelet. This commit was TranslatorX64’s official retirement, so let’s use its parent to see how TranslatorX64 translated this code.

  cmpl  $0xa, 0xc(%rbx)
  jnz 0x276004b2
  cmpl  $0xc, -0x44(%rbp)
  jnle 0x276004b2
101: SetL 4
103: PopC
  movq  (%rbx), %rax
  movq  -0x50(%rbp), %r13
104: Int 0
  xor %ecx, %ecx
113: CGetL2 4
  mov %rax, %rdx
  movl  $0xa, -0x44(%rbp)
  movq  %rax, -0x50(%rbp)
  add $0x10, %rbx    
  cmp %rcx, %rdx    
115: Gt
116: JmpZ 13 (129)
  jle 0x7608200

The first four lines are typechecks verifying that the value in $elem and the value on the top of the stack are the types we expect. If either of them fails, we’ll jump to code that triggers a retranslation of the tracelet, using the new types to generate a differently specialized chunk of machine code. The meat of the translation follows, and the code has plenty of room for improvement. There’s a dead load on line 8, an easily avoidable register to register move on line 12, and an opportunity for constant propagation between lines 10 and 16. These are all consequences of the bytecode-at-a-time approach used by TranslatorX64. No respectable compiler would ever emit code like this, but the simple optimizations required to avoid it just don’t fit into the TranslatorX64 model.

Now let’s see the same tracelet translated using hhir, at the same hhvm revision:

  cmpl  $0xa, 0xc(%rbx)
  jnz 0x276004bf
  cmpl  $0xc, -0x44(%rbp)
  jnle 0x276004bf
101: SetL 4
  movq  (%rbx), %rcx
  movl  $0xa, -0x44(%rbp)
  movq  %rcx, -0x50(%rbp)
115: Gt    
116: JmpZ 13 (129)
  add $0x10, %rbx
  cmp $0x0, %rcx    
  jle 0x76081c0

It begins with the same typechecks but the body of the translation is 6 instructions, significantly better than the 9 from TranslatorX64. Notice that there are no dead loads or register to register moves, and the immediate 0 from the Int 0 bytecode was propagated down to the cmp on line 12. Here’s the hhir that was generated between the tracelet and that translation:

  (00) DefLabel    
  (02) t1:FramePtr = DefFP
  (03) t2:StkPtr = DefSP<6> t1:FramePtr
  (05) t3:StkPtr = GuardStk<Int,0> t2:StkPtr
  (06) GuardLoc<Uncounted,4> t1:FramePtr
  (11) t4:Int = LdStack<Int,0> t3:StkPtr
  (13) StLoc<4> t1:FramePtr, t4:Int
  (27) t10:StkPtr = SpillStack t3:StkPtr, 1
  (35) SyncABIRegs t1:FramePtr, t10:StkPtr
  (36) ReqBindJmpLte<129,121> t4:Int, 0

The bytecode instructions have been broken down into smaller, simpler operations. Many operations hidden in the behavior of certain bytecodes are explicitly represented in hhir, such as the LdStack on line 6 which is part of the SetL. By using unnamed temporaries (t1, t2, etc…) instead of physical registers to represent the flow of values, we can easily track the definition and use(s) of each value. This makes it trivial to see if the destination of a load is actually used, or if one of the inputs to an instruction is really a constant value from 3 bytecodes ago. For a much more thorough explanation of what hhir is and how it works, take a look at ir.specification.

This example showed just a few of the improvements hhir made over TranslatorX64. Getting hhir deployed to production and retiring TranslatorX64 in May 2013 was a great milestone to hit, but it was just the beginning. Since then, we’ve implemented many more optimizations that would be nearly impossible in TranslatorX64, making hhvm almost twice as efficient in the process. It’s also been crucial in our efforts to get hhvm running on ARM processors by isolating and reducing the amount of architecture-specific code we need to reimplement. Watch for an upcoming post devoted to our ARM port for more details!"

    
risposta data 16.06.2015 - 20:25
fonte
1

In breve: cercano di minimizzare l'accesso casuale alla memoria e salta tra pezzi di codice in memoria per giocare bene con la cache della CPU.

Secondo Stato delle prestazioni HHVM ottimizzati più frequentemente utilizzati tipi di dati, che sono stringhe e matrici, per minimizzare l'accesso casuale alla memoria. L'idea è di mantenere i pezzi di dati usati insieme (come gli oggetti in un array) il più vicino possibile l'uno all'altro nella memoria, idealmente in modo lineare. In questo modo, se i dati si inseriscono nella cache della CPU L2 / L3, possono essere elaborati ordini di grandezza più velocemente rispetto a quando erano nella RAM.

Un'altra tecnica menzionata è la compilazione dei percorsi più frequentemente usati in un codice in modo tale che la versione compilata sia lineare (ha meno quantità di "salti") possibile e carica i dati in / out di memoria il più raramente possibile.

    
risposta data 16.06.2015 - 22:02
fonte

Leggi altre domande sui tag