Monday, March 30, 2015

High Disk Space is being consumed by C:\Program Files\Symantec\Symantec Endpoint Protection Manager\data\outbox\Importpackage

. Delete all files from %installlocation%\Symantec\Symantec Endpoint Protection Manager\data\outbox\ImportPackage folder. (without stopping any services)
2. Delete everything older than today's date in %installlocation%\Symantec\Symantec Endpoint Protection Manager\Inetpub\content (also without stopping any services)
3. In the Symantec Admin Console go to Admin > Servers > localhost. Right-click localhost and truncate the transaction logs.


Symantec has released new version of Symantec Endpoint Protection. English versions of Symantec Endpoint Protection 12.1.5337.5000 (RU5) is now available.
It has new content storage optimization feature:
As part of the upgrade to SEPM 12.1 RU5, the SEPM converts all of the content from full definitions to delta definitions. This process is resource intensive and may take an extended period of time. After this process is completed, the SEPM will use significantly less disk space.
In a typical enterprise setup where 30 content revisions stored, the SEPM upgrade process must reduce 55GB of full content to under 2GB of delta content. This process requires significant resources to complete and is impacted by the performance of any available CPUs, CPU cores (physical/logical/hyperthreading), memory, and disks (I/O). On a server that performs multiple roles, stores larger numbers of content, or is otherwise resource constrained, this process may take a longer duration to complete.
Refer this article to find more info: The LiveUpdate content optimization and content storage space optimization steps take a long time to complete when upgrading to Symantec Endpoint Protection Manager 12.1 RU5
http://www.symantec.com/docs/TECH224055

Thursday, March 12, 2015

vCenter Server Appliance: Troubleshooting full database partition



A customer of mine had within 6 months twice a full database partition on a VMware vCenter Server Appliance. After the first outage, the customer increased the size of the partition which is mounted to /storage/db. Some months later, some days ago, the vCSA became unresponsive again. Again because of a filled up database partition. The customer increased the size of the database partition again  (~ 200 GB!!) and today I had time to take a look at this nasty vCSA.
The situation
vcsa_overview
Within 2 days, the storage usage of the databse increased from 75% to 77%. First, I checked the size of the database:
 As you can see, the database had only 2 GB. The pg_log directory was more interesting:
 The directory was full with log files. The log files containted only one message:
The solution
This led me to VMware KB2092127 (After upgrading to vCenter Server Appliance 5.5 Update 2, pg_log file reports this error: WARNING: there is already a transaction in progress). And yes, this appliance was upgraded to U2 with high probability. The solution is described in KB2092127, and is really easy to implement. Please note that this is only a workaround. There’s currently no solution, as mentioned in the article.

How to connect/interact with VCVA DB (DB2 and vPostgres)

If you need to connect/interact with the VC appliance database, for example to remove the locks of DB2 or performing an script, you can do the following after being logged in as root via SSH on the appliance:

- On VCVA 5.0 GA with DB2:

1. Turn into the db2inst user:

vcenter:/ # su db2inst1

2. Start the db2 client:

db2inst1@vcenter:/> db2

You'll see a prompt like this:

db2 =>

3. connect to the VCDB database:

db2 => connect to VCDB

(the command is like this, very literal)

4. Change to VC schema:

db2 => set schema vc

5. Perform any command you need. For example, to remove the VPX_SESSIONLOCK lines, you can do like this:

db2 => delete from VPX_SESSIONLOCK
DB20000I  The SQL command completed successfully.


You can type "quit" anytime you want to exit from the db2 client, and "exit" when you want to go back to root userspace.

- On VCVA with vPostgres:

1. Connect to the database using psql:

vcenter:/ # /opt/vmware/vpostgres/1.0/bin/psql -U vc -d VCDB

You'll see a prompt like this:

psql (9.0.4)
Type "help" for help.


VCDB=>

2. Perform any command you need (selects, inserts, etc). For example, to list all tables:

VCDB=> \dt

There are a lot of new tables in 5.1, (mainly the vpx_hist_stat* ones).

To quit, just type "\q"

Changing the default VMware vCenter Server Appliance database password (2056968)

Changing the default VMware vCenter Server Appliance database password

 

 

Details

You can change the default password for the VMware vCenter Server Appliance database when you want or if the password is compromised.

Solution

To change the default:
  1. Change the embedded database password:

    1. Connect to the vCenter Server Appliance using SSH.
    2. Open the embedded_db.cfg file for editing with this command:

      vi /etc/vmware-vpx/embedded_db.cfg

    3. In the file, locate EMB_DB_PASSWORD and change the password between the single quotation marks.
  2. Change the password for the vc and postgres database users:

    1. Connect to the vPostgres database for SQL execution by running this command:

      /opt/vmware/vpostgres/current/bin/psql -d VCDB U postgres

    2. Run these SQL statements to change the passwords for the vc and postgres users:

      alter user postgres with password 'new-password';
      alter user vc with password 'new-password';

    3. Exit the database with this command:

      \q

    4. Open the .pgpass file for editing by running this command:

      vi /root/.pgpass
    5. Modify the .pgpass file with the new password as follows:

      localhost:5432:VCDB:postgres:new-password
      localhost:5432:postgres:postgres:new-password
      localhost:5432:VCDB:vc:new-password
  3. Change the postgres database password:

    1. To change the password for the vPostgres database by running this command:

      passwd postgres

    2. Type the new password.
    3. Retype the new password.
  4. To update the encrypted password in the vpxd.cfg file, run this command:

    /usr/sbin/vpxd -p

  5. Enter the password when prompted.
  6. Run this command to restart the vpxd service:

    /etc/init.d/vmware-vpxd restart

vCenter Appliance – Call “EventHistoryCollector.SetLatestPageSize” for object “SessionID” on vCenter Server failed.

When using the vSphere Client to connect to the VMware vCenter Server Appliance was appearing every now and again.

Call “EventHistoryCollector.SetLatestPageSize” for object “SessionID” on vCenter Server “ServerName” failed. (unfortunately didnt take a screenshot, so here’s one I found and modified).
2014-10-20_09-18-39

This issue is pretty common, and is to do with the amount of events in the database not being purged, and is covered by VMware in this KB article for windows environments.
However not so commonly covered for the vCenter Appliance which uses a progress database.

After a bit of digging around, I found the following crude solution on the VMware communities board.

So open up a console to your VCSA, login in. Run the following commands
/opt/vmware/vpostgres/1.0/bin/psql -d VCDB vc 
TRUNCATE TABLE vpx_event CASCADE;
then to exit “/q”

Here are the steps:
  1. First of all - stop VPXD
    •  service vmware-vxpd stop
  2. connect to DB:
    /opt/vmware/vpostgres/1.0/bin/psql -d VCDB vc 
     
    You will be prompted for the "vc" password which is not the same as the 
    root password.
     
    Password is in "/etc/vmware-vpx/embedded_db.cfg" file
     
     
  3. issue this commands:
    2014-10-20_08-59-04
    • TRUNCATE TABLE vpx_event CASCADE;
    • TRUNCATE TABLE vpx_event_arg CASCADE;
    • TRUNCATE TABLE vpx_task CASCADE;
  4. quit DB command line
    • issue the command "/q "
  5. start the VPXD
    •  service vmware-vxpd start or restart vCSA appliance
  6. Check the size of VCBD.  Now the size is only 165MB
      • VCDB=> SELECT pg_database.datname, pg_size_pretty(pg_database_size(pg_database.datname)) AS size FROM pg_database;
      •   datname  |  size
      1. -----------+---------
         template1 | 5289 kB
         template0 | 5281 kB
         postgres  | 5385 kB
         VCDB      | 165 MB
        (4 rows)


         



    References:
    https://communities.vmware.com/thread/80738

    http://www.educationalcentre.co.uk/vmware-5-1-vcenter-appliance-call-eventhistorycollector-setlatestpagesize-for-object-sessionid-on-vcenter-server-failed/#more-418

    http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2054085

    http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2056968

    VMware vCenter Server Appliance Error: VPXD must be stopped to perform this operation.

    http://www.virtualizationteam.com/management-automation/vmware-vcenter-server-appliance-error-vpxd-must-be-stopped-to-perform-this-operation.html


    Error: VPXD must be stopped to perform this operation.
    This error has came up while trying to change authentication to active directory authentication and kinda seen the same error when trying to change the database to an external database. For some reason the Server service stop button is shadowed as well which mean I could not stop it by the GUI interface. Although my lab setup is not ideal and it might just due to the way I set it up, though I will still document how I resolved this where the same mechanism can be used to stop and restart any other service used by the vCenter Server Appliance. Below are the steps I have followed:
    1- SSH to your VMware vCenter Server Appliance using the root account.
    2- Execute the following command to see the status of all the service running in the vCenter Appliance:   chkconfig
    The output of all services will look something like below:
    localhost:~ # chkconfig
    after.local               off
    apache2                  off
    arpd                         off
    atftpd                       off
    auditd                       on
    autoyast                    off
    chargen                      off
    chargen-udp              off
    cron                       on
    daytime                    off
    daytime-udp              off
    dbus                     on
    dcerpcd                  on
    dhcp6r                   off
    dhcp6s                   off
    dhcpd                    off
    discard                  off
    discard-udp              off
    earlysyslog              on
    echo                     off
    echo-udp                 off
    eventlogd                on
    fbset                     on
    gpm                      off
    haldaemon                on
    haveged                  on
    irq_balancer             on
    kbd                      on
    ldap                     on
    lsassd                   off
    lwiod                    on
    mdadmd                   off
    multipathd               off
    netlogond                on
    netstat                  off
    network                  on
    network-remotefs         on
    nfs                      on
    ntp                      off
    pcscd                  off
    powerd               off
    random               on
    raw                      off
    rpasswdd            off
    rpcbind                on
    rpmconfigcheck           off
    sendmail                       on
    servers                           off
    services                         off
    setserial                        off
    skeleton.compat          off
    splash                            on
    splash_early                on
    sshd                               235
    stunnel                         off
    syslog                           on
    syslog-collector         off
    systat                          off
    time                            off
    time-udp                   off
    uuidd                         off
    vami-lighttp             235
    vami-sfcb                 235
    vaos                          235
    vmware-inventoryservice  on
    vmware-logbrowser        off
    vmware-netdumper         off
    vmware-rbd-watchdog      off
    vmware-tools             on
    vmware-vpostgres         on
    vmware-vpxd              on
    vsphere-client           on
    xinetd                   off
    ypbind                   off
    3- Stop the required service in my case was vmware-vpxd using the following command: chkconfig service-name off (ex: chkconfig vmware-vpxd off)
    4- carry out your changes
    5- Start the service again using the following command: chkconfig service-name on (ex: chkconfig vmware-vpxd on)

    Backing up and restoring the vCenter Server Appliance vPostgres database (2034505)

    Backing up and restoring the vCenter Server Appliance vPostgres database (2034505)

    Purpose

    This article provides steps to back up and restore the vCenter Server Appliance's (VCSA) vPostgres database.

    Note: This article is only supported for backup and restore of the vPostgres database to the same vCenter Server Appliance. Use of image-based backup and restore is the only solution supported for performing a full, secondary appliance restore.

    Resolution

    Before you proceed, ensure that you have these installed:
    • SSH client for connecting to the vCenter Server Appliance.
    • WinSCP (or any SCP client) for retrieving and replacing the vPostgres database recovery file.

    Backing up the embedded vPostgres database

    To back up the embedded vPostgres database:
    1. Connect to the vCenter Server Appliance via SSH. For more information, see Enable or Disable SSH Administrator Login on the VMware vCenter Server Appliance section in the vCenter Server and Host Management Guide.
    2. When prompted, log in as the root user. The default password is vmware.
    3. Stop the VMware vCenter Server service by running this command:

      service vmware-vpxd stop
    4. On the vCenter Server Appliance virtual machine, navigate to the vPostgres utility directory using this command:

      cd /opt/vmware/vpostgres/1.0/bin

    5. To display the vPostgres database configuration file, run this command:

      cat /etc/vmware-vpx/embedded_db.cfg

    6. To back up the vCenter Server database, run this command:

      ./pg_dump EMB_DB_INSTANCE -U EMB_DB_USER -Fp -c > VCDBBackupFile

      Fill in the EMD_DB_INSTANCE and EMB_DB_USER from the embedded_db.cfg configuration information listed in Step 5. Fill in the VCDBBackupFile with the location and file name to generated, for example:

      ./pg_dump VCDB -U vc -Fp -c > /tmp/VCDBackUp

      Caution: The /tmp/ directory is reset after rebooting the vCenter Server Appliance. VMware recommends that if this location is used, backup should be moved to a persistent location.

      Note: If prompted, enter the EMB_DB_PASSWORD password.

    7. Using WinSCP, connect to the vCenter Server Appliance and download the VCDBackUp file from /tmp/
    8. Start the VMware VirtualCenter Server service by running this command:

      service vmware-vpxd start

    Restoring from the backend vPostgres database file

    To restore from the back up vPostgres database file:

    Note
    : Ensure that you take a snapshot of the vCenter Server Appliance virtual machine before proceeding. This allows you to restore the database from the snapshot if this restore procedure fails.

    1. Connect to the vCenter Server Appliance via SSH. For more information, see Enable or Disable SSH Administrator Login on the VMware vCenter Server Appliance section in the vCenter Server and Host Management Guide.
    2. Using WinSCP, connect to the vCenter Server Appliance and upload the backup copy of the VCDBackUp file into the /tmp/ directory
    3. To display the new vPostgres database configuration file, run this command:

      cat /etc/vmware-vpx/embedded_db.cfg
    4. Navigate to the vPostgres utility directory by running this command:

      cd /opt/vmware/vpostgres/1.0/bin

    5. Stop the VMware vCenter Server service by running this command:

      service vmware-vpxd stop
    6. To restore the vCenter Server vPostgres database from backup, run this command:

      PGPASSWORD='EMB_DB_PASSWORD' ./psql -d EMB_DB_INSTANCE -Upostgres -f VCDBBackupFile

      Fill in the EMD_DB_INSTANCE and EMB_DB_PASSWORD from the embedded_db.cfg configuration information listed in Step 3. Fill in the VCDBBackupFile with the location and file name to be used, for example:

      PGPASSWORD='g<T4EuybGsA=kG$G' ./psql -d VCDB -Upostgres -f /tmp/VCDBackUp

      Note: Use single-quotes (') around the password as shown in the embedded_db.cfg configuration file.

    7. To restart the VMware VirtualCenter Server service for the database restore to take effect, run this command:

      service vmware-vpxd start

    Interpreting an ESX/ESXi host purple diagnostic screen


    Interpreting an ESX/ESXi host purple diagnostic screen (1004250) 

    Purpose

    This article provides information to decode ESX/ESXi host purple screen errors.
    An ESX/ESXi purple screen error appears similar to:

    Note: This article uses the information in this purple screen as an example.

    Resolution


    What is the VMkernel?

    The VMkernel is the operating system core of ESX/ESXi. The kernel handles resource scheduling and device IO. Device IO is handled by the VMware network and storage stacks, which serve as a layer between the virtual file system, network devices and the device drivers that control physical devices.

    Interpreting the purple diagnostic screen

    If the VMkernel experiences an error, the error displays in a purple diagnostic screen. The purple diagnostic screen looks similar to:
    VMware ESX Server [Releasebuild-98103
    PCPU 1 locked up. Failed to ack TLB invalidate.
    frame=0x3a37d98 ip=0x625e94 cr2=0x0 cr3=0x40c66000 cr4=0x16c
    es=0xffffffff ds=0xffffffff fs=0xffffffff gs=0xffffffff
    eax=0xffffffff ebx=0xffffffff ecx=0xffffffff edx=0xffffffff
    ebp=0x3a37ef4 esi=0xffffffff edi=0xffffffff err=-1 eflags=0xffffffff
    *0:1037/helper1-4 1:1107/vmm0:Fagi 2:1121/vmware-vm 3:1122/mks:Franc
    0x3a37ef4:[0x625e94]Panic+0x17 stack: 0x833ab4, 0x3a37f10, 0x3a37f48
    0x3a37f04:[0x625e94]Panic+0x17 stack: 0x833ab4, 0x1, 0x14a03a0
    0x3a37f48:[0x64bfa4]TLBDoInvalidate+0x38f stack: 0x3a37f54, 0x40, 0x2
    0x3a37f70:[0x66da4d]XMapForceFlush+0x64 stack: 0x0, 0x4d3a, 0x0
    0x3a37fac:[0x652b8b]helpFunc+0x2d2 stack: 0x1, 0x14a4580, 0x0
    0x3a37ffc:[0x750902]CpuSched_StartWorld+0x109 stack: 0x0, 0x0, 0x0
    0x3a38000:[0x0]blk_dev+0xfd76461f stack: 0x0, 0x0, 0x0
    VMK uptime: 7:05:43:45.014 TSC: 1751259712918392
    Starting coredump to disk Starting coredump to disk Dumping using slot 1 of 1...using slot 1 of 1... log


    Here is a breakdown of each section of the above purple diagnostic screen:
    • The Product and Build:

      VMware ESX Server [Releasebuild-98103]

      This section of the purple diagnostic screen identifies the product and build that has experienced the error. In this example, the product is VMware ESX Server build 98103.

    • The Error Message:

      PCPU 1 locked up. Failed to ack TLB invalidate
      This section of the purple diagnostic screen identifies the error message that has been reported. There are only a finite number of error messages that can be reported. These error messages are discussed later in this article.

    • The CPU Registers:

      frame=0x3a37d98 ip=0x625e94 cr2=0x0 cr3=0x40c66000 cr4=0x16c
      es=0xffffffff ds=0xffffffff fs=0xffffffff gs=0xffffffff
      eax=0xffffffff ebx=0xffffffff ecx=0xffffffff edx=0xffffffff
      ebp=0x3a37ef4 esi=0xffffffff edi=0xffffffff err=-1 eflags=0xffffffff

      These are the values that were in the physical CPU registers at the time of the error. The information in these registers may vary greatly between VMkernel errors. These registers can only be used internally when debugging a core dump of the VMkernel error. For more information about these registers, see http://www.intel.com/products/processor/manuals/ for Intel and http://support.amd.com/us/psearch/Pages/psearch.aspx for AMD. At the AMD site, search for the Architecture Programmer's manual for your specific processor type.

      Note: The preceding links were correct as of March 28, 2013. If you find the links to be broken, provide feedback on the article and a VMware employee will update the article as necessary.
    • The Physical CPU:

      *0:1037/helper1-4 1:1107/vmm0:Fagi 2:1121/vmware-vm 3:1122/mks:Franc

      This section of the purple diagnostic screen identifies the physical CPU that was running instructions during the VMkernel error. In the example, the * beside the 0 indicates that physical CPU 0 was running an operation at the time of the failure. In newer versions of ESX, instead of including an *, the preceding letters CPU are included. For example, if the same error as the above were to occur in newer versions of VMware ESX, the same line appears as:

      CPU0:1037/helper1-4 cpu1:1107/vmm0:Fagi cpu2:1121/vmware-vm cpu3:1122/mks:Franc.
      This section of the purple diagnostic screen also describes the world (process) that was running on the CPU at the time of the error. In the above example, the userworld running was helper1-4.

      Note
      : The name of the process may be truncated.

    • The Stack Trace:

      0x3a37ef4:[0x625e94]Panic+0x17 stack: 0x833ab4, 0x3a37f10, 0x3a37f48
      0x3a37f04:[0x625e94]Panic+0x17 stack: 0x833ab4, 0x1, 0x14a03a0
      0x3a37f48:[0x64bfa4]TLBDoInvalidate+0x38f stack: 0x3a37f54, 0x40, 0x2
      0x3a37f70:[0x66da4d]XMapForceFlush+0x64 stack: 0x0, 0x4d3a, 0x0
      0x3a37fac:[0x652b8b]helpFunc+0x2d2 stack: 0x1, 0x14a4580, 0x0
      0x3a37ffc:[0x750902]CpuSched_StartWorld+0x109 stack: 0x0, 0x0, 0x0
      0x3a38000:[0x0]blk_dev+0xfd76461f stack: 0x0, 0x0, 0x0


      The stack represents what the VMkernel was doing at the time of the error. In this example, it was trying to clear memory page tables (TLB). This information is a vital tool in the diagnosis of purple screen errors by evaluating the actions of the kernel at the time of the error.

    • The Uptime:

      VMK uptime: 7:05:43:45.014 TSC: 1751259712918392

      This section indicates how long a server had been running since the last boot. In this example, the ESX host was running for 7 days, 5 hours, 43 minutes and 45.014 seconds. The TSC value is the number of CPU clock cycles that have elapsed since the server was started.

    • The Core Dump:

      Starting coredump to disk Starting coredump to disk Dumping using slot 1 of 1...using slot 1 of 1... log
      This section of the purple diagnostic screen indicates that the contents of the VMkernel memory are being copied to the vmkcore partition.

    Using the error message of the purple diagnostic screen to troubleshoot a vmkernel error

    The VMkernel error message generated by the purple screen can be used to identify the cause of the issue. The number of error messages that can be produced are finite. This is a list of known VMkernel error messages.
    • Type: Console Oops
      Example Error:COS Error: Oops
      Description: An ESX host can fail and cause a purple screen when there is a Service Console oops. Unlike most purple screen errors, it is not triggered by the VMkernel. Instead the error is triggered by the Service Console and occurs at the Linux level. These purple screen errors contain additional information from the Linux kernel. For more information about Console Oops, see Understanding an "Oops" purple diagnostic screen (1006802).

    • Type: Lost Heartbeat
      Example Error:Lost Heartbeat
      Description: The ESX VMkernel and the Service Console Linux kernel run at the same time on ESX. The Service Console Linux kernel runs a process called vmnixhbd, which heartbeats the VMkernel as long as it is able to allocate and free a page of memory. If no heartbeats are received before a timeout period of 30 minutes, the VMkernel triggers a COS Panic and a purple diagnostics screen that mentions a Lost Heartbeat. For more information on Lost Heatbeats, see Understanding a "Lost Heartbeat" purple diagnostic screen (1009525) .

    • Type: Assert
      Example Error:ASSERT bora/vmkernel/main/pframe_int.h:527
      Description: Assert errors are software errors, because they are related to assumptions on which the program is based. This type of purple screen error is primarily caused by software issues. For more information on the assert error message, see Understanding ASSERT and NOT_IMPLEMENTED purple diagnostic screens (1019956).

    • Type: Not Implemented
      Example Error: NOT_IMPLEMENTED /build/mts/release/bora-84374/bora/vmkernel/main/util.c:83
      Description: A not implemented error message occurs when the code encounters a situation that it was not designed to handle. For more information, see Understanding ASSERT and NOT_IMPLEMENTED purple diagnostic screens (1019956).

    • Type: Spin count exceeded / Possible deadlock
      Example Error: Spin count exceeded (iplLock) - possible deadlock
      Description: A VMware ESX host may report a Spin count exceeded and possible deadlock in a purple diagnostic screen when a thread is attempting to execute in the critical section of code. Since it was trying to enter the critical section, the thread needed to poll a mutex for a lock prior to executing the code by conducting a spinlock operation. The thread continues to poll the mutex during the spinlock operation, but there is a certain limit of how many times it polls the mutex. For more information on Spin count exceeded errors, see Understanding a "Spin count exceeded" purple diagnostic screen (1020105).

    • Type: Failed to ack TLB invalidate
      Example Error: PCPU 1 locked up. Failed to ack TLB invalidate.
      Description: Physical CPUs fail when trying to clear memory page tables. For more information, see Understanding a Failed to ack TLB invalidate purple diagnostic screen (1020214).
    A purple diagnostic screen can also come in the form of an Exception. An Exception Handler is a computer hardware mechanism designed to handle some condition that changes the normal flow of execution (Division by Zero, Page Fault, etc). There is no trace from handlers, so you need logging to determine if handler faulted (or single step debugging). This is a list of common exceptions:
    • Type: Exception 13 (General Protection Fault)
      Example Error: #GP Exception(13) in world 4130:helper13-0 @ 0x41803399e303
      Description: A general protection fault (Exception 13) occurs under one of the following circumstances: the page being requested does not belong to the program requesting it (and not mapped in program memory), or the program does not have rights to perform a read or write operation on the page. For more information on Exception 13 or Page Fault, see Understanding Exception 13 and Exception 14 purple diagnostic screen events (1020181).

    • Type: Exception 14 (Page Fault)
      Example Error: #PF Exception type 14 in world 136:helper0-0 @ 0x4a8e6e
      Description: A page fault (Exception 14) occurs when the page being requested has not been successfully loaded into memory. For more information on Exception 14 or Page Fault, see Understanding Exception 13 and Exception 14 purple diagnostic screen events (1020181).

    • Type: Exception 18 (Machine Check Exception)
      Example Error: Machine Check Exception: Unable to continue
      Example Error: Hardware (Machine) Error
      Description: A Machine Check Exception (MCE) is generated by the hardware and reported by the host. Consult your hardware vendor in the event of an MCE. By evaluating the information presented, it is possible to identify the individual component reporting the error. For more information on MCE, see Decoding Machine Check Exception (MCE) output after a purple screen error (1005184).
    If your VMware ESX or ESXi host experiences an error similar to one of these that does not point you to a general article, search for the error message and stack trace information within the Knowledge Base. If the error has not been documented within the Knowledge Base, collect the diagnostic information from the VMware ESX host and submit a support request. For more information, see Collecting diagnostic information for VMware products (1008524) and and How to Submit a Support Request.

    Using the pattern analysis to troubleshoot multiple vmkernel errors on the same ESX host

    In the event that you experience multiple purple diagnostic screens from the same VMware ESX host, you can use the sample of multiple purple diagnostic screens to determine the likeliness of an issue being related to hardware or software. This can be done by identifying patterns in these sections of the purple diagnostic screen:
    • The error message and the stack trace:

      • If the error message and stack vary greatly between vmkernel errors, this indicates that software is not always hitting the same error. Although inconclusive, this may indicate a hardware issue.
      • If the error message and the stack are always identical between vmkernel errors, this indicates that software is always hitting the same error. Although inconclusive, this may indicate a software issue.
      • For more information about the error message you are experiencing, refer to the above section about the specific error message.
    • The physical CPU:

    • The world:

      • If the world value remains the same across multiple VMkernel errors, this indicates that the vmkernel is failing when receiving instructions from the same world. Although inconclusive, this may indicate a world is sending instructions that may be triggering the VMkernel error.

    Understanding Exception 13 and Exception 14 purple diagnostic screen events in ESX 3.x/4.x and ESXi 3.x/4.x/5.x

    Understanding Exception 13 and Exception 14 purple diagnostic screen events in ESX 3.x/4.x and ESXi 3.x/4.x/5.x (1020181)


    Symptoms

    You experience purple diagnostic screens that contain information similar to:
    • [VMware ESX [Releasebuild-164009 X86_64
      #GP Exception(13) in world 4130:helper13-0 @ 0x41803399e303
    • [VMware ESX Server [Releasebuild-123630]
      #PF Exception type 14 in world 1024:console @ 0x67f0ae

    Purpose

    Notes:

    Resolution

    Overview

    Operating systems manage the physical memory on a system by employing several methods:
    • Virtual memory or paging is designed to abstract the physical memory into virtual memory. This abstraction allows the operating system to allocate memory specific to programs and allows for other forms of memory management, including Memory Swapping, Shared Memory, and Memory Protection.

    • Memory Swapping occurs when operating systems optimize memory by moving data that is not being used to slower mediums and vice versa.

    • Shared Memory is a method commonly used if multiple programs need to communicate with each other. Shared memory allows multiple programs to access the same page of memory.

    • Memory Protection prevents a malicious or malfunctioning program from accessing memory pages from other programs.
    When a critical application has difficulty accessing memory, it generally manifests with an error involving one of these memory management operations.

    Exception 13: General Protection Fault

    A general protection fault (Exception 13) occurs under one of these circumstances:
    • The page being requested does not belong to the program requesting it (and is not mapped in program memory)
    • The program does not have rights to perform a read or write operation on the page
    Operating systems maintain a page table that include flags to mark pages as protected. If there is a conflict between the operation and the flag, the operating system traps the illegal request.
    Note: Segmentation faults are very similar to general protection faults.
    This is a sample of a General Protection Fault generated by ESX:
    [VMware ESX [Releasebuild-164009 X86_64
    #GP Exception(13) in world 4130:helper13-0 @ 0x41803399e303
    frame=0x4100c0117d78 ip=0x41803399e303 cr2=0x0 cr3=0xcff94000
    err=0 rflags=0x10246 cr4=0x16c
    rax=0x0 rbx=0x417ff492dbe0 rcx=0x417ff386cc80
    rdx=0x4100c0117f00 rbp=0x4100c0117f40 rsi=0x4100c0117e30
    rdi=0x410008c46220 r8=0x4100c0117e30 r9=0x4100c0117d50
    r10=0x3713e1b91ddd3 r11=0x41803399e1fc r12=0x4100c004fde0
    r13=0x410008c46220 r14=0x4100c0117e30 r15=0x417ff3614660
    0:4096/console *1:4130/helper13- 2:4098/idle2 3:4099/idle3
    @BlueScreen: #GP Exception(13) in world 4130:helper13-0 @ 0x41803399e303
    Code starts at 0x418033600000
    0x4100c0117f40:[0x41803399e303]GetDriverInfo+0x106 stack: 0x410002086ba8
    0x4100c0117f80:[0x4180336d9ef3]UplinkProcessAsyncCallsHelperCB+0x126 stack: 0x0
    0x4100c0117ff0:[0x418033663670]helpFunc+0x4f7 stack: 0x0
    0x4100c0117ff8:[0x0]Unknown stack: 0x0
    VMK uptime: 5:13:53:05.627 TSC: 968936502031893
    VMK checksum BAD: 0x3ee854ad7f0856e5 0x7009aad95a9042d9
    FSbase (0x0) GSbase (0x0) kernelGSbase (0x0)

    The Exception 13 General Protection Fault may be caused by either a hardware or a software issue. As the cause may vary significantly for these types of exceptions, a core-dump review may be performed by VMware. This process is usually not possible to perform without access to protected source code and analysis tools or processes. Collect diagnostic information from the VMware ESX host and submit a support request. For more information, see Collecting diagnostic information for VMware products (1008524) and How to Submit a Support Request. You can also contact your hardware vendor if you or VMware Technical Support are able to determine that a particular driver module or device has caused the exception.

    Exception 14: Page Fault

    A page fault (Exception 14) occurs when the page being requested has not been successfully loaded into memory. There are both healthy and unhealthy page faults:
    • A healthy page fault results in the page being loaded from swapped memory to physical memory. The program is then allowed to proceed after the data has been properly loaded into physical memory.
    • An unhealthy page fault occurs when the page is not loaded in memory, and the operating system is unable to load the page from swapped to physical memory.
    This is a sample of a Page Fault generated by ESX:
    [VMware ESX Server [Releasebuild-123630]Exception type 14 in world 1024:console @ 0x67f0ae
    frame=0x1402824 ip=0x67f0ae cr2=0x405f6000 cr3=0x13401000 cr4=0x6f0
    es=0x4028 ds=0x40404028 fs=0xffff0000 gs=0x0
    eax=0x409f6000 ebx=0x1000 ecx=0x400 edx=0x409f6000
    ebp=0x14028b4 esi=0x407c8000 edi=0x409f6000 err=11 eflags=0x10206
    *0:1024/console 1:1092/mks:ubunt 2:1089/vmware-vm 3:1027/idle3
    4:1028/idle4 5:1029/idle5 6:1030/idle6 7:1091/vmware-vm
    8:1032/idle8 9:1033/idle9 10:1034/idle10 11:1093/vcpu-0:ub
    12:1036/idle12 13:1037/idle13 14:1038/idle14 15:1039/idle15
    @BlueScreen: Exception type 14 in world 1024:console @ 0x67f0ae
    0x14028b4:[0x67f0ae]genericCopy+0x155 stack: 0xc0bbc60, 0x40081800, 0x0
    0x14028dc:[0x67f3d6]vmk_SgCopy+0x41 stack: 0xc0bbc60, 0x40081800, 0x0
    0x140292c:[0x7cef13]SCSICompleteFragment+0x1ae stack: 0xc005d00, 0x0, 0xc3100
    0x14029c4:[0x7d081c]SCSICompletePathCommand+0x453 stack: 0xc005d00, 0x125, 0x148a4f8
    0x1402a60:[0x7cafff]SCSICompleteAdapterCommand+0x3da stack: 0xc005d00, 0x2, 0x1402de0
    0x1402ac0:[0x88343f]vmk_scsi_dump_active+0x20e stack: 0x0, 0x10a, 0x6a525f0
    0x1402b30:[0x61811e]BHCallHandlersInt+0xf5 stack: 0x2ad0, 0x0, 0x1402b88
    0x1402b88:[0x618614]BH_Check+0x2bb stack: 0x1, 0x1402bac, 0x1752d49
    0x1402bac:[0x61fb8e]IDT_HandleInterrupt+0x85 stack: 0x1402bf8, 0x0, 0xb638000
    0x1402bc0:[0x61fcb5]IDT_IntrHandler+0x4c stack: 0x1402bf8, 0x4028, 0x1454028
    0x1402c70:[0x692c6c]CommonIntr+0xb stack: 0x1489500, 0x0, 0x1402de0
    0x1402e1c:[0x7615e4]CpuSchedDispatch+0x487 stack: 0x2390a60, 0x1489500, 0x0
    0x1402e88:[0x763eaa]CpuSchedDoWaitDirectedYield+0x351 stack: 0x0, 0x1f55e60, 0x0
    0x1402ea4:[0x763fda]CpuSched_WaitIRQ+0x31 stack: 0xfedcba90, 0x6, 0x1f55e60
    0x1402ec4:[0x69197f]VMNIXVMKSyscall_Idle+0xe2 stack: 0x1402f6c, 0x6915cf, 0x0
    0x1402ecc:[0x68669c]VMNIXVMKSyscallUnpackIdle+0x7 stack: 0x0, 0x0, 0x0
    0x1402f6c:[0x6915cf]HostSyscall+0xf6 stack: 0x1402fbc, 0xc03d9f98, 0x1c
    0x1402fe8:[0x6909e3]HostVMKEntry+0xce stack: 0x0, 0x0, 0x0
    VMK uptime: 0:01:58:34.004 TSC: 15137542595232
    Starting coredump to disk Starting coredump to disk Dumping using slot 1 of 1... using slot 1 of 1... log
     
    The Exception 14 Page Fault may be caused by either a hardware or a software issue. As the cause may vary significantly for these types of exceptions, a core-dump review may be performed by VMware. This process is usually not possible to perform without access to protected source code and analysis tools or processes. Collect diagnostic information from the VMware ESX host and submit a support request. For more information, see Collecting diagnostic information for VMware products (1008524) and How to Submit a Support Request. You can also contact your hardware vendor if you or VMware Technical Support are able to determine a particular driver module or device has caused the exception. To find our more about page-fault exceptions, see the Formats and Encodings of SSE2 Floating-Point Instructions table in the Intel 64 and IA-32 Architectures Software Developer’s Manual.

    Extracting the log file after an ESX or ESXi host fails with a purple screen error

    Extracting the log file after an ESX or ESXi host fails with a purple screen error

    Purpose
    This article provides steps to extract a log from a vmkernel-zdump file after a purple diagnostic screen error. This log contains similar information to that seen on the purple diagnostic screen and can be used in further troubleshooting.

    This article assumes that a vmkernel-zdump file is available. If an ESX or ESXi host has failed with a purple diagnostic screen, but no vmkernel-zdump file is available, see:
    Note: For generating a VMkernel zdump manually from a dump file in ESXi 5.5, see Generating a VMkernel zdump manually from a dump file in ESXi 5.5 (2081902).

    Resolution

    To resolve this issue, extract the log file from a vmkernel-zdump file using a command line utility on the ESX or ESXi host. This utility differs for different versions of ESX or ESXi.
    • For ESX 3.x use the vmkdump utility:

      # vmkdump -l vmkernel-zdump-filename
    • For ESXi 3.5, ESXi/ESX 4.x and ESXi 5.x, use the esxcfg-dumppart utility:

      # esxcfg-dumppart -L vmkernel-zdump-filename
    To extract the log file from a vmkernel-zdump file:
    1. Find the vmkernel-zdump file in the /root/ or /var/core/ directory:

      # ls /root/vmkernel* /var/core/vmkernel*
      /var/core/vmkernel-zdump-073108.09.16.1


    2. Use the vmkdump or esxcfg-dumppart utility to extract the log. For example:

      # vmkdump -l /var/core/vmkernel-zdump-073108.09.16.1
      created file vmkernel-log.1

      # esxcfg-dumppart -L /var/core/vmkernel-zdump-073108.09.16.1
      created file vmkernel-log.1


    3. The vmkernel-log.1 file is plain text, though may start with null characters. Focus on the end of the log, which is similar to:

      VMware ESX Server [Releasebuild-98103]
      PCPU 1 locked up. Failed to ack TLB invalidate.
      frame=0x3a37d98 ip=0x625e94 cr2=0x0 cr3=0x40c66000 cr4=0x16c
      es=0xffffffff ds=0xffffffff fs=0xffffffff gs=0xffffffff
      eax=0xffffffff ebx=0xffffffff ecx=0xffffffff edx=0xffffffff
      ...


    4. For troubleshooting the cause of the purple diagnostic screen, see Interpreting an ESX host purple diagnostic screen (1004250).
    Note: The file name created for the log in this example is vmkernel-log.1. If another file with the same name already exists, the new file is created with the number suffix incremented.