Contents
Cumulative updates and Service Packs
Precautions for making changes in Production environment
Performing tests in Production environment
Steps to ensure before and during investigations of reported problem
Trivial checks that should be made in the beginning of problem solving session
Typical issues in SharePoint environments
404 Not found, 401 Unauthorized
Value does not fall into the expected range
User Profile Synchronization Service
Set of Tools helpful in investigations
Relevant levels of issues in ULS logs to look for
Best practices in analyzing errors from ULS logs
How to isolate a problem between a customization and the platform
Valuable observations to keep in mind
SharePoint in reality
SharePoint is a web application platform developed by Microsoft.
The main engine of this platform is ASP.NET backed up by SQL server. The underlying
technologies are the same as in any other commercial and open source web
application platforms like “SiteCore”, “EpiServer”, “DotNetNuke” etc.
The real difference is in size of the vendor who develops this platform. It has positive and negative aspects.
The positive side is the architecture of SharePoint was designed and developed very carefully by the best minds of Microsoft. In general SharePoint turned to be quite robust and reliable beast with excellent high quality set of built-in functions and high level of predictability.
The negative side is the size of the platform has grown quite enormously in time and tends to grow further. As a key market leader Microsoft tries to release new version of SharePoint every 3 years. No wonder that in such a rush race quality of some parts suffers of insufficient testing.
Also embedding the newest “cutting edge” technologies yet unproven in real conditions into each new version of the platform reduces reliability and often causes frustration after the “first sight” excitement. Obviously, this must never be told to your customer.
Some examples:
• SharePoint 2010
o User Profile Synchronization based on reduced version of quite complex FIM engine. This service is bad known as terribly unreliable.
o Non-unique “Unique DocumentID”; temporarily broken DocID-link after simple file relocation. This compromises the whole idea of permanent link.
• SharePoint 2013
o Non-obvious initial performance issues fixed in a set of CUs.
o “Half baked” Managed Metadata navigation with partially non-working friendly URLs.
o New workflow engine based on Windows Azure Workflow (WF 4.0). Improved scalability in exchange with sacrificed previous functionality.
o New architecture of Search Services. This became impossible to change roles quickly due to absence of UI part. You should use specific and not quite obvious set of Powershell commands to change the roles of search components.
Cumulative updates and Service Packs
Microsoft often recommends installing so-called Cumulative Updates and Service Packs in order to eliminate problems in SharePoint’s environment. Actually, you should be very careful with decisions to install any of them.
In reality those Cumulative Updates often fix one set of problems, however, may bring a number of others. Many times Microsoft had to release urgently “new CU for fixing problems caused by previous CU” after numerous negative feedbacks on a released fix.
• Typical example for SharePoint 2010 is the Cumulative Update from August 2012 that introduced a clumsy problem with reduction in performance due to automatic generation of random (and often hidden!) duplicate nodes in Top and Quick Launch Navigation of publishing sites.
• Typical example for SharePoint 2013 is the Cumulative Update from June 2015, which ruined hybrid UI parts of search integrations for half a year. Fortunately, a bit clumsy workaround that “restored” the functionality was published in August 2015 (removal of some non-important fields via SharePoint Designer).
• Another unpleasant examples include random un-provisioning of SychDB that completely ruins User Profile Synchronization Service, revocation of security certificates for FAST servers that may lead to unclear deadly hanging of search crawler, impossibility to complete CU installation with standard UI-based Configuration Wizard that just fails in the middle of progress, etc.
Simple rules that usually help here:
1. Avoid installing the most current version of Cumulative Update until the next version is released. Your SharePoint environment will be 3-4 months older than the most up-to-date but you eliminate risks of getting into fresh yet unknown issues.
2. First install a new CU or Service Pack into the Test environment. This allows getting some ideas what kind of problems you may experience in Production. Obviously, fixing issues in Test environment is less painful.
Precautions for making changes in Production environment
As a rule, any actions in
Production environment that affect availability of web applications for end
users are prohibited without approval from the customer.
You should always inform your project manager or the customer directly - if allowed - before making any performance affecting steps like restarting application pool or the whole IIS, installing hot fixes, cumulative updates, service packs, etc., rebooting servers, recreating and reconfiguring service applications etc.
Also before you install any Service Pack or Cumulative Update always read carefully installation instructions and look at the list of pre-requisites, which may mention other mandatory updates that have to be installed in advance.
• For example, Service Pack 2 for SharePoint 2010 requires preliminary mandatory installation of SP1, CU from June 2011, and CU from June 2013. You cannot install missing CUs afterwards and may lose some of important improvements from those CUs.
Performing tests in Production environment
Always prefer working on the problem on your own development machine or in the
Test environment of the customer if it exists. Certainly, there can be some
cases when the problem exists or it is reproducible only in Production
environment of the customer.
If you need to perform tests in the Production environment follow several simple rules:
1. Your tests must not affect work of end users; you should get approval from the customer first in case of uncertainty.
2. If possible create your own separate site collection with clear title like “Testi – Jussi Virta – 15.10.2013”. Make sure your site collection is not visible in the navigation for end users after creation.
3. If p.2 looks impossible for your case create your own separate hidden list or document library where you can generate your test data, for example, using Powershell.
• You can create a list or library and then mark it as hidden using Powershell commands $list = …; $list.Hidden = $true; $list.Update();
4. If p.3 also looks impossible for your case, and you need to use live data, get approval from the customer for any temporary changes to that data you may make.
5. After completing tests do not forget to clean for yourself.
Steps to ensure before and during investigations of reported problem
1. Make sure you clearly understand what the customer complains about. This is a typical case when information given by the customer is incomplete. If you have a similar case, force yourself to formalize your thoughts on the paper, enlist missing pieces of information, and ask your project manager or customer to provide those details.
• Never leave “white spaces” in your understanding about exact set of actions, exact names of accounts, exact URLs where the problem happens.
• Build a clear picture in mind – or better on the paper - what is wrong exactly. Ignoring this simple rule often makes it impossible to investigate the problem effectively. Just ask yourself honestly do you really understand what are you trying to resolve?
2. Always protect your work against claims of incompetency or irresponsiveness from the side of the customer. While you actively work on the problem try to be active and responsive toward the customer at least every 3 - 5 days.
• This means even in case of no results force yourself to write short status reports to the customer where briefly describe any minor findings you made since the last contact.
• Statements in short status report must be clear for the customer. Avoid writing unclear “water”; this may annoy some people and make negative impression about your working capabilities.
• Show either you still have ideas how to continue with the problem or you recommend a workaround instead.
• Never show you have no interest and no ideas how to continue; such admission leaves quite negative impression about you as a professional.
Trivial checks that should be made in the beginning
of problem solving session
1. Verify version of your SharePoint environment and find out which service packs and updates it includes
• (Get-SPFarm).BuildVersion
• http://technet.microsoft.com/en-us/sharepoint/jj891062.aspx
• http://technet.microsoft.com/en-us/sharepoint/ff800847.aspx
• http://technet.microsoft.com/en-us/sharepoint/bb735839.aspx
2. Open Central Administration and review critical problems reported by Health Analyzer.
• Often unclear problems may be caused by incomplete upgrade of one or several servers in the Farm.
• A typical example from the practice, the administrator has forgotten to run SharePoint Products Configuration Wizard after applying the update. Another variation of this, the wizard has failed in the middle of execution; however, the administrator has not investigated the reasons and left SharePoint Farm in the state of incomplete update.
• Try not to be such a negligent fellow; always make sure the upgrade is really complete. Check its status in the Central Administration or execute the Powershell command like (Get-SPServer $env:COMPUTERNAME).NeedsUpgrade. à In general, it should show “False” in case of completion beyond the critical level.
3. Check ULS logs on all registered servers of SharePoint Farm.
4. Check Windows Event logs on all registered and related servers of SharePoint Farm.
• Including SQL Servers, FAST servers for SP 2010, Office WebApp servers for 2013, etc.
5. Check free disk space on all registered and related servers of SharePoint Farm.
6.
Check for possible
memory leaks.
In general, memory leaks can be identified on the servers by one of the
following symptoms:
• Size of web application process (usually some of w3wp.exe) permanently grows, exceeds 2.5 - 3 GB in size, and still continues to grow further. Note response time of a web application usually becomes poor after exceeding 3 GB and worsening.
• Windows Event log contains random errors with messages “The application pools recycle intermittently”.
• ULS log contains errors with messages like “Potentially excessive number of SPRequest objects (<number>) currently unreleased on thread <number>. Ensure that this object or its parent (such as an SPWeb or SPSite) is being properly disposed. This object will not be automatically disposed”. Similar messages usually point to problems in custom components.
• ULS log contains errors with messages like “System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt”. Similar messages usually point to problems in OTB components. You can try Installing a fresh Service Pack or Cumulative Update to resolve this issue. Obviously, resolution is not guaranteed.
Typical issues in SharePoint environments
Performance
problems
Usually this type of problems is the most annoying for the customer and the most complex in resolution for developers.
The routine steps given below may help to identify a root cause of performance issues by one or several typical symptoms.
• Check total amount of memory and consumed amount of memory on each server.
o Insufficient free server memory always causes severe performance problems in various areas of SharePoint like Search, Caching, Publishing, etc. The effect is more noticeable in SharePoint 2013 in compare with SharePoint 2010.
• Check if Web Front End servers have Internet connection. If they do not have, this can cause unclear performance problems in some network configurations. In order to suppress them you can try disabling checks of certificate revocation policies.
o Start > Run > gpedit.msc > Computer Configuration > Windows Settings > Security Settings > Public Key Policies > Certificate Path Validation Settings > Network Retrieval > Toggle “Define these policy settings > Untoggle “Automatically update certificates in the Microsoft Root Certificate Program (recommended)”.
o http://support.microsoft.com/kb/2625048
o http://joelblogs.co.uk/2011/09/20/certificate-revocation-list-check-and-sharepoint-2010-without-an-internet-connection/
• Check current load and its distribution on the servers (Performance Monitor > ASP.NET Applications / filter on specific ones) and memory consumption of processes (Task Manager). The important counters to look at:
o Anonymous requests/sec.
o Requests/sec.
o Requests Total
o Requests Timeout
o Sessions Total
o Sessions Active
o etc.
• Use SPDisposeCheck/MSOCAF/SPCAF to verify codebase of custom DLLs for possible obvious memory leaks caused by bad development.
o http://archive.msdn.microsoft.com/SPDisposeCheck
o http://www.spcaf.com/blog/v4-0-7-1001-beta-2013-09-04/
o Just keep in mind all mentioned tools do not provide a complete panacea against dispose mistakes. https://www.spcaf.com/blog/stop-using-spdisposecheck-or-msocaf-with-sharepoint-2013-now/
• Check for debug configuration settings in web.config of your web application
o <compilation batch="true" debug="false"> is a hidden evil that may cause unidentifiable memory leaks. If you find such setting change it to <compilation batch="false" debug="false"> on the first chance. Do not change web.config in Production environment without approval from the customer.
• Check that output caching is enabled in your site collection.
o Root site > Site Actions > Site Settings > Site collection output cache
o In case of SharePoint 2013, remember that Output Cache as well as Blob Cache does not use Distributed Cache service while Object Cache uses it. So relate the observed performance symptoms properly.
• In case of SharePoint 2013, check for optimal settings of Distributed Cache service (DCS).
o Refer to the following Technet article to verify if you have any potential problems in your DCS configuration (http://technet.microsoft.com/en-us/library/jj219613.aspx)
o Memory allocation to cache size in DCS must not exceed 40% of total server’s memory with max.16 GB on each. Insufficient residual server’s memory causes severe performance problems; insufficient memory allocated for DCS causes moderate performance problems.
o DCS can run on one of several servers
in the Farm. All servers where DCS is running must have exactly the same
amount and configuration of memory. The recommended minimum total physical
memory on the server that runs the Distributed Cache service is 8 GB (this is not
the same as cache size).
o If allocation exceeds 16GB, the server may unexpectedly stop responding for more than 10 seconds. If you are using a cache cluster with more than 1 cache host, ensure the memory allocation for the Distributed Cache service's cache size on each cache host is set to the same value (use Update-SPDistributedCacheSize to reconfigure).
Access denied
Usually this typical issue is one of the easiest in resolutions. Obvious cause
is insufficient permissions to some area (site collection, content database,
hidden lists like SharePoint’s help system, taxonomy, etc.). More complex cause
is an error coming from expired or missing security certificates; this can be
detected by checking errors in Windows Event logs.
404 Not found, 401 Unauthorized
Usually the situation looks terrible; you see that the requested file exists
(via Powershell, in SQL database, etc.), however, attempts to access it state
it’s not found. You may even start suspecting it was damaged or so but most
common reasons of this issue are simple:
o Application pool does not have permissions to access content database
o Enabled loopback check (default setting), refer to http://support.microsoft.com/kb/896861 for more details
o Incorrect configuration of proxy server (can be changed via Internet Explorer > Internet Options > Connections > LAN settings)
Value does not fall into the expected range
This is scary looking. but relatively simple error that often happens during
data migrations. One of the most annoying cases is when this error is thrown
from the standard hierarchy of Site Manager. The message often means that
attempt to retrieve some value cannot find correspondent field in list, list in
web, web in site collection, etc. Check Stack Trace in order to see the context
- field, list, or web - and then iterate each object in the context using Powershell
to find the offender.
User Profile Synchronization Service
This is the infamously endless troublemaker that frequently fails after a regular
peaceful installation of Service Packs and Cumulative Updates. Always make sure
you have upgraded the server completely (i.e. you have successfully run the
SharePoint Products Configuration Wizard after the installation).
Most of the symptoms and steps of resolution are well documented in various Internet resources, for example, take a look at http://www.harbar.net/articles/sp2010ups.aspx.
In many situations this is enough to restart User Profile Synchronization Service; do it only through Central Administration > Manage services on server.
Sometimes attempt of restart hangs with status “Starting”. First of all try restarting the server itself. If you do it in the Production environment get approval from the customer first.
If restart did not help, in case of fresh installations the quickest way to get rid of this problem is simple deletion and recreation of User Profile Service Application via Central Administration.
In case of existing Production environment you can also use recreation but make sure you have at least backups of SyncDB and ProfileDB in order to keep all existing user profiles untouched. And of course, get approval from the customer first.
Search Service
Architecture of Search Service has significantly evolved in SharePoint 2013, if compared with SharePoint 2010. Microsoft states it became simpler and significantly more stable, but the practice shows the truth is partially opposite.
Elimination of typical problems related to Search Service is similar to the ones described above for User Profile Synchronization Service:
• Make sure you have installed fresh Cumulative Updates.
• Make sure you have upgraded your servers completely (i.e. you have run SharePoint Products Configuration Wizard after the installation).
• Restart your service via Central Administration.
• Reboot your server if restart of the service did not help.
•
Recreate Search
Service from the scratch. There is no need to keep backups because content can
be re-crawled again; however, keep in mind this may take time in case of large data
volumes.
Set of Tools helpful in investigations
Powershell
This is a one of the most underestimated tools that often helps to find reason of the problem quickly and effectively. In practice, you can connect to internal structure of almost any area of SharePoint, output it into the file on the disk and analyze further.
Case Studies:
1. An attempt to open a list via UI fails with strange exception “One of fields is not installed properly”.
$web = Get-SPWeb <url>
$list = $web.Lists[“<Title of your list>”]
$list.SchemaXml > c:\\temp\broken-list.xml à Open in the
browser and analyze found fields.
$list.Fields | foreach {write-host $_.StaticName} à Compare with output of present fields.
$list.ContentTypes | foreach {write-host ‘’;write-host $_.Name; write-host ‘’;$_.Fields | foreach {$_.StaticName}} à Compare with output of list fields.
$list.ContentTypes[“<Name of suspected CT>”].SchemaXml
> c:\\temp\list-CT.xml
$web.ContentTypes[“<Name of suspected CT>”].SchemaXml >
c:\\temp\site-CT.xml à Compare the structure with CT in the
list.
2. Instance of some service application hangs with message “Starting” on specific server. This is not possible to stop it through UI.
$instance = Get-SPServiceInstance | where {$_.TypeName -eq "User Profile Synchronization Service"} à Check index of hanging instance by its status, let’s put 0.
Stop hanging instance of service application.
$instance[0].Stop()
Note: method .Stop() is different from .Unprovision() (the first is safe while the latter one can be unsafe).
ULS logs
Logs generated by Unified Logging Service of SharePoint (ULS logs) are specific for each server in SharePoint Farm. ULS logs are regular text files that can be investigated with any suitable tools like Notepad++, open source application named “ULS Viewer” (http://ulsviewer.codeplex.com/), Microsoft Excel, etc.
• Default location of ULS logs is C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\<version number>\LOGS
• Actual location can be checked through Central Administration > Monitoring > Configure diagnostics logging > Trace Log Path
Note that ULS logs usually contain only the basic information about errors and events that happened in SharePoint. This information can be elaborated with more details found in Windows Event log around the same time.
In default configuration SharePoint has custom error messages disabled on the level of web-application. This is done for security reasons.
· In the development environment you can change default settings and enable detailed error messages by adjusting settings <customErrors mode="On" /> à <customErrors mode="Off" /> and <SafeMode … CallStack="false" .. /> à <SafeMode … CallStack="true" .. />.
As a result error messages displayed to end users look somewhat cryptic and tell pretty much nothing. Fortunately, common error message is usually complemented with so called Correlation ID, for example:
Sorry, something went wrong
An unexpected error has occurred.
Technical Details
Troubleshoot issues with Microsoft SharePoint Foundation.
Correlation ID: be224d9c-ee50-1012-e247-309443cb026d
Date and Time: 10/14/2013 9:41:06 AM
What is Correlation ID? Technically, this is a unique id (GUID) of SPRequest in which error or a series of bound errors happened.
· The same value is stored in Response.Headers["SPRequestGuid"]
This GUID is attached to each error message in ULS logs and can be used to retrieve details on a specific error from there.
Relevant levels of issues in ULS logs to look for
ULS log can contain up to 10 different levels of reported issues. They are configurable in two groups available via Central Administration > Monitoring > Configure diagnostic logging
Events reported to trace log (i.e. ULS log):
• Unexpected
• High
• Monitorable
• Medium
• Verbose
Events reported to event log (i.e. Windows Event log):
• Critical
• Error
• Warning
• Information
• Verbose
In practice, there is certain confusion between both groups of reports because “Events reported to trace log” may be partially present in Windows Event log while “Events reported to event log” may be partially present in ULS log.
So what level is the most relevant in investigations? The correct practical answer is “Unexpected” despite of many books state differently.
Exceptions of Unexpected level are the ones that always collected independently on log settings (unless someone intentionally switched off logging). Thus they represent most critical errors that happened in SharePoint. Namely this type of exceptions is the most relevant for review.
Details of Unexpected exceptions can be further correlated with more details found in Critical, Error (Exception), and High level exceptions situated around Unexpected but not vice versa. The other levels can be considered as simple messages.
Best practices in analyzing errors from ULS logs
There are two different patterns of analysis:
1. Current errors collected and analyzed in real time. This pattern is suitable for analyzing relatively simple errors. Suitable tools for such analysis is “ULS Viewer” (http://ulsviewer.codeplex.com/) that displays real time errors one by one in order of occurrence and Notepad/Notepad++ (less convenient, but always available).
2. History of errors that happened in the past. This pattern is better suitable for analyzing more complex and occasional errors that could happen at particular time in the past. Suitable tools for such analysis are Powershell and Microsoft Excel.
SharePoint provides several Powershell cmdlets for a number of operations over ULS logs. The most important of them:
• Merge-SPLogFile; combines trace log files from all farm servers into a single file for further analysis.
• Get-SPLogEvent; allows querying ULS logs with filters on specific text data
Any ULS log file including a merged one is a tab-separated text file that can be directly opened in Microsoft Excel.
· Copy the file to your machine, rename extension from .log to .txt, open Microsoft Excel, open that .txt file and click Next > Next > Finnish.
You can also analyze statistics on frequency of certain errors on each server in SharePoint Farm. In order to do this you can write your own PS-script that groups errors from ULS logs of each server for several days by error messages (the column Message).
Examples of similar scripts are given in the table below.
Script name and purpose |
Script’s logic |
Part 1.
Those results will be handled by the second script (see Part 2).
This is also easy to review distribution of unexpected errors in time for each day. |
$currentfolder = (Get-Item .).FullName $logfiles = "D:\Logs\Sharepoint\SERVERNAME-201310*.log"
$matchingfiles = [System.IO.Directory]::GetFiles($logfiles.Substring(0, $logfiles.LastIndexOf("\")), $logfiles.Substring($logfiles.LastIndexOf("\") + 1))
foreach( $logfile in $matchingfiles ) { $datestamp = $logfile.Substring($logfile.IndexOf("-") + 1) $datestamp = $datestamp.Substring(0, $datestamp.LastIndexOf("-")) $tmp = $currentfolder + "\" + $datestamp if( [System.IO.Directory]::Exists($tmp) -eq $false ) { mkdir $tmp >> $null } $tmp = $tmp + "\" + $logfile.Substring($logfile.LastIndexOf("\") + 1) if( [System.IO.File]::Exists($tmp) ) { [System.IO.File]::Delete($tmp) } $alllines = [System.IO.File]::ReadAllLines($logfile) foreach($line in $alllines) { if( $line.IndexOf(" Unexpected ") -gt -1 ) { $line >> $tmp } } } |
Part 2.
count-errors-by-occurences.ps1
This script groups all errors with level “Unexpected” from results of Part 1 (see above) by value in Message column and outputs the result into a single text file ready to be analyzed in Microsoft Excel.
The result provides certain understanding what kind of errors and in what amounts happened on a particular server.
This is relatively easy to distinguish errors of OTB components from errors of custom components by reviewing error messages that contain stack trace references to specific .NET classes. |
$currentfolder = (Get-Item .).FullName $logfiles = "*.log"
$matchingfiles = [System.IO.Directory]::GetFiles($currentfolder, $logfiles, 1)
$errors = new-object System.Collections.Generic.List[string] $errorcounts = new-object System.Collections.Generic.List[int] $errorlasttimes = new-object System.Collections.Generic.List[string]
foreach( $logfile in $matchingfiles ) { $alllines = [System.IO.File]::ReadAllLines($logfile) foreach($line in $alllines) { $content = $line.Substring($line.IndexOf("Unexpected ") + "Unexpected ".Length) $content = $content.Substring(0, $content.LastIndexOf(" ")) if( $content.StartsWith("...") -eq $true ) { continue } $latest = $line.Substring(0, $line.IndexOf(" ")) $latest = $latest.Substring(0, $latest.LastIndexOf(":")) $date1 = [DateTime]$latest if( $errors.IndexOf($content) -eq -1 ) { $errors.Add($content) $errorcounts.Add(1) $errorlasttimes.Add($latest) } else { $index = $errors.IndexOf($content) $errorcounts[$index] = $errorcounts[$index] + 1 $strdate2 = $errorlasttimes[$index] $date2 = [DateTime]$strdate2 if( $date1 -gt $date2 ) { $errorlasttimes[$index] = $latest } } } } $ind = 0 $log = $currentfolder + "\" + $env:COMPUTERNAME + "-201310.txt" foreach( $error in $errors ) { $count = $errorcounts[$ind] $lasttime = ([DateTime]$errorlasttimes[$ind]).ToString("dd.MM.yyyy HH:mm") $count.ToString() + " " + $lasttime + " " + $error >> $log $ind = $ind + 1 } |
Windows Event log
This is a second useful tool that allows getting some details of errors often
uncommitted in ULS logs. The main branch to look at is Windows Logs >
Application.
Case Studies:
1. Attempt to open “Managed Metadata Service Application” ends up with exception “Managed Metadata Service is not running”. Attempt to restart it succeeds but the error is still present. ULS log does not contain more details.
Review Windows Event Log and search for the recent errors related to Metadata service, for example:
A failure was reported when trying to invoke a service application: EndpointFailure
Process Name: w3wp.exe
Process ID: 14752
AppDomain Name: Central Administration - 80
AppDomain ID: 1
Service Application Uri: urn:schemas-microsoft-com:sharepoint:service:30b7e4680e304d4c90be84963b6a6713#authority=urn:uuid:7f02309218304afcbe623eb339923cf7&authority=https://vvsrv411:32844/Topology/topology.svc
Active Endpoints: 1
Failed Endpoints:1
Affected Endpoint: http://servername:32852/31b7e4680e304d4c90be84963b6a6714/MetadataWebService.svc
This looks quite much better, doesn’t it? Open IIS manager and restart correspondent application pool of Managed Metadata Web Service on correspondent server.
2. Assume this morning Intranet users have difficulties to access SharePoint. Front page opens terribly slow despite there is no real load on the server. ULS log contains strange errors that state “Invalid windows identity for <username>”.
In order to get more details on the actual error review Windows Event Log and search for messages situated around “Invalid windows identity for <username>”. You may find details like “Connection to <domain controller> has timed out“, “Directory Services: server not operational”, etc., which help to identify the actual reason.
Health Analyzer
This is mainly the informational tool that shows the most obvious problems
found in SharePoint automatically. Those problems are not necessarily as
critical as reported by the tool, but you can get some quick ideas about
failures in service applications and processes, existing orphans, possibly not
upgraded or incompletely upgraded servers, etc. This is worth to look at Health
Analyzer, but do not take it too seriously. In some cases it may report “false
positives” (for example, missing OOB web parts of reporting services).
IIS logs
IIS logs can sometimes contain interesting details. For example, if some area
of SharePoint shows “404 Not found” IIS log may contain actual HTTP code of the
error.
Case studies:
1. Attempt to open root site of site collection fails showing the error “404 Not found” while the actual error code present in IIS logs may be “403 Forbidden” or “401 Access Denied”.
In many cases this means that account of application pool does not have access to content database that stores site collection. This often happens in various data migrations so you should not blindly believe to a misleading error message “404 Not found” and additionally check IIS log. Alternatively, just make sure account of application pool has “dbowner” level permissions in your content database.
Very useful tool for research on performance issues but it can be difficult to analyze the results without previous experience. Take a look at the chapter below that discusses Performance problems.
This was a very convenient UI-based tool in SharePoint 2010 that became quite heavily overloaded with functions and looked less relevant in SharePoint 2013.
The tool can be enabled in OTB master pages via Powershell:
$ds =
[Microsoft.SharePoint.Administration.SPWebService]::ContentService.DeveloperDashboardSettings
$ds.DisplayLevel = 'OnDemand'
$ds.TraceEnabled = $true
$ds.Update()
Custom master pages can optionally include this tool with OTB control Sharepoint:DeveloperDashboardLauncher.
Developers Dashboard outputs diagnostic information on the page that can help you to troubleshoot problems with page components that would otherwise be quite difficult to isolate.
For example, it clearly shows issues with incorrect disposal of SPSite / SPWeb objects, relative performance of different web parts on a page, execution time of SQL queries, problems with expired security certificates, etc.
In many cases Developer Dashboard allows identifying not too complex performance problems faster than in case of routine reviews of ULS logs and monitoring SQL queries in profiler. Although it usually does not give clear answer what’s the root cause of the problem.
Case studies from the practice:
1. Front page of an Internet site based on SharePoint 2013 loads slowly despite it contains only 5 relatively light web parts.
Careful look at the output of Developer Dashboard identified two OTB Content By Search Web Parts with loading time around 1 second each. Despite the actual offenders were found using Developer Dashboard this was not clear why OTB web parts have such a poor performance.
I made an absolutely blind assumption that the problem could be fixed in one of Cumulative Updates that I planned to install anyway. So I installed CUs from March 2013 and August 2013 and the problem has gone.
2. Intranet site based on SharePoint 2010 has accidentally started working terribly slow despite a day ago its performance was quite good. Quick overview of requests structure in Developer Dashboard disclosed that OTB method SecurityValidation randomly executed for more than 15 seconds.
After researching name of this method in Internet blogs I found out that possible reason could be in disabled Internet access on the server, expired validation certificates, and certificate revocation policy enabled by default. Omitting more details Developer Dashboard has helped to identify the issue relatively quickly from its output.
Fiddler
Fiddler is a nice tiny tool useful in investigation of various network related
issues, content of HTTP-headers, cookies, actually used type of authentication,
etc. In default configuration it uses its own built-in proxy server, which may cause
confusions in some situations. Also Fiddler does not always display the information
about internal requests made outside the browser.
Chrome
Chrome Web Browser has quite useful set of tools available via Settings >
Tools > Developer Tools. For example, you can quickly estimate how fast a
certain page loads, identify delays caused by network latency, etc.
SQL Server Management Studio
It allows you to perform
relatively simple search and direct checks of the content stored in content
databases. The tool requires some knowledge of T-SQL for effective work.
How to isolate a problem between a customization and the platform
By obvious reasons, describing exact sequence of steps is impossible so let’s
see what kind of research actions may help you.
First of all, in case of errors you can distinguish types of offending components by checking stack traces and evaluating statistics of errors found in ULS logs and in Windows Event logs.
If you find any errors coming from custom components you should investigate and fix their logic (or ask developers of components to do the same).
In case of errors coming from OOB components you can try to eliminate them by installing fresh service pack / cumulative update or by performing case specific investigation of configuration of offending OOB component.
Psychological aspects
Valuable observations to keep in mind
No doubts Albert Einstein is one of the most brilliant minds and researchers of 20th century. He has pronounced a number of memorable statements known as his quotes. Quite many of those phrases are generally applicable to problem solving practices because they force you to think. Let’s just recall some of them:
• You have to learn the rules of the game. And then you have to play better than anyone else.
• Information is not knowledge.
• Imagination is more important than knowledge.
• Logic will get you from A to B. Imagination will take you everywhere.
• The only source of knowledge is experience.
• The only real valuable thing is intuition.
• Make everything as simple as possible, but not simpler.
• Small is the number of people who see with their eyes and think with their minds.
• A person, who never made a mistake, never tried anything new.
• We cannot solve our problems with the same thinking we used when we created them.
• If you can't explain it simply, you don't understand it well enough.
• Never lose a holy curiosity.
• I have no special talent. I am only passionately curious.
• The important thing is not to stop questioning. Curiosity has its own reason for existing.
• It's not that I'm so smart; it's just that I stay with problems longer.
• No amount of experimentation can ever prove me right; a single experiment can prove me wrong.
• Insanity: doing the same thing over and over again and expecting different results.
• Most people say that it is the intellect which makes a great scientist. They are wrong: it is character.
• Weakness of attitude becomes weakness of character.
Correct setup of your mind
• The quickest way to become proficient in certain area to practice in it with passion. You can read tons of books, and collect tens of paper certificates decorating your walls, but only the actual hands-on experience and practical contribution to other people eventually makes you a valuable and demanded professional.
• Imagine successful end result in the beginning and don’t ever start doubting in your forces in the middle. Thoughts can materialize but 90% of people ignore this simple fact and pretend this is just a pathetic saying. Other 10% who set up on success eventually succeed independently what exactly happened in the middle.
• State to yourself that finding solution to this problem just requires this amount of time; not less and not more. Imagine internally that everyone who says the problem is easier or more difficult than you personally assumed is just not right. If that person is eventually right, you will just admit this fact with no emotions, but chances of this to occur are slim.
How to write reports
You can write your investigation and resolution reports in any free manner
convenient for you.
There are several simple rules that you should follow:
1. Clearly state your conclusions and thoughts in normal human language. Complement your statements for better understanding of reader with visual parts like tables, screenshots etc. or vice versa.
• Lengthy Excel tables, pictures drawn in slides of PowerPoint and similar “cryptic” content should not be the main source of information in your report. In fact, this is often challenging to understand what kind of information is actually presented there after the first review. Certainly, those items can complement your report if referred as attachments in text statements.
2. Separate important blocks of text with easily identifiable headers. For example, many people start missing information when reading a part with boring technical details. If you enclose this part with header “Technical details” reader may just skip it, and concentrate on other parts first.
3. Try avoiding unexplained and unclearly stated parts. Think what kind of questions your conclusions and explanations may cause and try to clarify them for a reader (well, this is sometimes difficult).
Internal reports
Internal reports are not supposed to be sent to the customer; they are intended
for internal information of your superior, project manager, architect, etc. You
can discuss any issues and problems found in custom components or incorrect
configurations quite openly in those reports.
Just beware accidental sending of history from internal reports to the customer. Also your email history should not contain any sensitive information like passwords, PINs or other credentials. Just replace it with “***” if it’s present in history.
External reports
External reports are supposed to
be sent to the customer.
In ideal case you should not contact the customer directly and send the report to your project manager instead. But as you know, in practice this often does not work. Project manager is busy - or just lazy, as an option, results are required urgently for the customer, etc.
So if you have to send your report directly to the customer never disclose more information than actually required. “Yes, I reproduced the error; and yes, I think I managed to fix the problem. Could you please try again and confirm the problem is gone? My project manager will write more details later”. Always add at least your project manager into CC.
You should
not lie to the customer. If you have promised to send something you should send
it in time. However you are not directly responsible for actions that your
project manager should make (like sending details).