3/23/2014

Enterprise Storage Advisory

Digital Edge uses a propitiatory methodology when analyzing any type of enterprise storage. We want to make feedback from industry professionals available to the public. This could be a very useful guide for IT experts who are not deeply involved in storage but would like to receive a high-level understanding of storage health, capacity and performance conditions. The proposed methodology is solely Digital Edge's approach of assessing enterprise storage and is not bound to any other manufacturer or storage brand. 

Before we begin, we want to remind you that storage is not just capacity technology, but capacity AND performance technology that must be evaluated together. When capacity is very easy to analyze, the performance parameters may become confusing and not so obvious. 

Areas to be analyzed:

1. Capacity allocation and expected IOPS. 

2. Expected IOPS and load from servers. 

3. Network expected performance and network load from servers

4. System errors and warnings

5. Patch levels and recommendations

6. Conclusion. 

Here is brief description of the information collected and analyzed for each item. This description also explains why we believe our methodology is both valid and convenient for high level assessment. This methodology may not produce completely accurate troubleshooting-ready statistics; instead it assesses conditions and indicators for further tuning and troubleshooting. 

Some fundamental statements to simplify our analysis:

  • Enterprise storage could be SAN, NAS or a unified platform playing role of SAN and NAS at the same time. 

  • Enterprise class NAS is a SAN with servers attached to the SAN infrastructure that exposes SAN storage to servers over NAS protocols. Those servers in terminology of EMC are called "data movers." They are attached to SAN through fiber interface. From SAN's perspective, data movers are the same clients as any other servers connected to it. 

  • It is relatively easy for clients to build those servers without purchasing them from hardware manufacturers. However, servers pre-configured by manufacturers with high availability and management interface may be beneficial. 

  • SAN consist of controllers that are connected to Storage Area Network through multiple fiber channel and/or iSCSI interfaces on the frontend and to disk trays on the back. 

  • Capacity is provided by disks. 

  • Performance is the function of performance parameters of disks themselves, controllers and the network. 

  • Each disk has pre-defined performance parameters. The faster disk, the faster it can perform an I/O operation. 

  • The more disks playing in I/O load, the better performance of the system is. 

  • Disks are congregated in RAID groups. Performance of the SAN disks is a function of configuration of the RAID groups. Performance of RAID group depends on the amount of disks included in the group, their speed, type, and its penalty.

  • RAID groups are cut on LUNs. LUNs are exposed to servers. 

  • As performance of storage depends on RAID group configuration - LUNs on the same RAID group will affect each other. LUNs on separate RAID groups will not affect each other. This is true considering network I/O is not a bottleneck. 

  • Network performance is a function of types and links to Storage Area Network processing power of controllers. 

Logical View of SAN

 

 
    

1. Capacity Allocation and Expected IOPs. 

Capacity analysis can easily be introduced in the capacity report. Capacity is shown by RAID group and how those RAID groups are cut on LUNs. The total expected IO performance is displayed per RAID group.

Disk 0/  0

RAID Group 0

RAID5

Drive Type: FC

Capacity:286GB

Percent Full: 99%

IOPS: 900

LUN 61

PROD-ORALCE-Data

Size: 286GB

 

Host: NYORAN1/2

Type: Oracle ASM

Used: 192GB (51%)

Free: 94GB

 

Disk 0/ 1

Disk 0/ 2

Disk 0/ 3

Disk 0/ 4

 

Disk 3/ 0

RAID Group 3

RAID 5

Drive Type: SATA

Capacity: 11005.93GB

Percent Full: 99%

Free: 0.928GB

IOPS:630

LUN 16

PROD-VMStore1

Size: 2048GB

Host: ESXi1/2/3/4/5

Type: VM Datastore

Used: 1.4TB

Free: 614GB

Disk 3/ 1

LUN 29

PROD-VMStore5

Size: 1 TB

Host: ESXi1/2/3/4/5

Type: VM Datastore

Used: 969TB

Free: 55GB

Disk 3/ 2

LUN 30

PROD-SQLCUSTER_DATA

Size: 500GB

Host: NYSQL1/2

Type: Windows

Used: 299B

Free: 201GB

Disk 3/ 3

LUN 0

Place Holder

Size: 1GB

 

 

None

Disk 3/ 4

Disk 3/ 5

Disk 5/ 1

 

Disk 5/2

RAID Group 9

RAID 5

Drive Type: SATA

Capacity: 5502 GB

Percent Full: 83%

Free: 927GB

IOPS:360

LUN 41

PROD-ORACLE-LOGS

Size: 500GB

Host: NYORA1/2

Type: ORACLE ASM

Used: 47GB

Free: 453GB

LUN 42

PROD-SQLCLUSTER_LOG

Size: 500GB

Host: NYSQL1/2

Type: Windows

Used: 136B

Free: 453GB

Disk 5/3

LUN 45

PROD-EXCH-DATA

Size: 500GB

Host: EX1/EX2

Type: Windows

Used: 284B

Free: 216GB

LUN 46

PROD-EXCH_LOG

Size: 1.4 TB       

Host: EX1/2

Type: Windows

Used: 699GB

Free: 1.3TB

Disk 5/4

LUN 49

QA-VMDATASTORE

Size: 325GB       

Host: ESXi6/7/8

Type: VM datastore

Used: 123B

Free: 202GB

LUN 58

PROD-EXCH-DATA-II

Size: 500GB       

Host: EX1/2

Type: Widnows

Used: 166B

Free: 334GB

Disk 5/5

Lun 68

QA-SQL-DATA-2

Size: 300GB     

Host: QASQL1/2;

Type: Windows

Used: Unmounted

Free: Unmounted

IOPS avg: 0

IOPS max: 0

 

The first row includes disk information and its position in disk tray. The RAID group information includes: RAID type, disk type, total capacity, free space and excepted IOPS. Expected IOPS are calculated based on disks number in the group, disk speed and RAID type.

LUN information includes: total capacity, host(s) that mounts onto the LUN, and the space used by host(s).

2. Expected IOPS And Load From Servers. 

In contrast to capacity, performance is something that is difficult to assess. Therefore we offer a method that allows assessing SAN performance and illuminates potential problem spots. IT professionals can then use different techniques to go deeper into actual performance tuning and troubleshooting.

We often see multiple examples of a mistaken vision when people think about SANs. People tend to think that the more loads you put on SAN, the slower the SAN will work. That is wrong! SAN will work per-parameters it was built. If you configure a RAID group to provide 900 IOPS, it will deliver those expected IOPS. The applications on servers that are pushing I/O to SAN may slow down however. SAN cannot satisfy all of the requests. In such a case, requests will be queued on the server and the end user will begin to feel the SAN performing slower. In all actuality, the SAN is working at the same speed; it just has more requests waiting for each other to finish.

SAN baseline performance can be easily tested with tools like iometer. After the Storage Area Network connectivity and raid groups are setup, the performance of the SAN itself should remain constant. Performance might be affected by degrading RAID levels using mismatching hot spare disks or RAID re-building. Under normal circumstances however, the SAN will not slow down.

To assess the SAN performance, we evaluate the expected IOPS provided by RAID groups. Then we compare this value with aggregated average and maximum IOPS from servers on all LUN of the analyzed RAID group.  Here’s what it may look like:

Disk 1/0

 

RAID Group 1

RAID 10

Drive Type: FC

Capacity: 1,073GB

Percent Full: 99%

IOPS: 720

 

LUN 1

UAT-ORACLE-FILES

Size: 200GB

Host:  NYORAUAT1/2

Type: Oracle

Used: 181GB

Free: 19 GB

IOPS: 45/354

Disk 1/1

LUN 8

PROD-ORACLE-FILES

Size:200 GB

Host: NYORAPROD1/2

Type: Oracle

Used: 150

Free: 50

IOPS: 52/643

 

Disk 2/0

 

LUN 43

PROD-SQL-SERVER-DB

Size: 260GB

Host: NYSQLPROD

Type: Windows

Used: 184GB

Free: 76GB

IOPS: 198/2077

Disk 2/1

 

LUN 44

PROD-SQLUAT-SERVER-DB

Size: 260GB

Host: none

Free 260G

 

TOTAL EXPECTED: 720

 

TOTAL PUSHED (avg/max): 295/3074


In this example, RAID Group 1 is RAID 10 raid group built on 1500 RPM Fiber Channel disks. The expected performance of such configuration is 720 IOPS. The I/O is measured for LUN1, LUN8 and LUN43 from the server side using host built-in tools like PerfMon or IOStat. Average and Max values are recorded and then totals are compared.

In the end, a follow up report is created for the entire SAN:

RAID GROUP #

IOPS EXPECTED

AGREGATED IOPS PUSHED (avg/max)

AGREGATED WAITS

(avg/max) ms

RAID Group 0

900

233/376

1.25/23

RAID Group 1

720

295/3074

7.21/865.06

RAID Group 2

360

2/134

315.04/29329.85

RAID Group 3

630

1233/6103

4302.15/27239

RAID Group 4

270

106 / 3160

580.97/14342.94

RAID Group 5

180

4.26/250

2546.45/51500

RAID Group 6

720

31.38/602

6462.16/224913.33

RAID Group 9

360

3145/29233

885.42/23764.33

RAID Group 10

720

6.6/305

4838.33/126350

RAID Group 11

720

45/ 2875

4958.44/160240.66

RAID Group 14

630

264/ 2696

1320/4030

RAID Group 15

900

164/1903

837.21/2990

RAID Group 16

720

23 / 2262

2.258/50

RAID Group 17

720

2.371 / 377

1.154/49

RAID Group 18

360

147/1394

1510/9170

RAID Group 19

360

35/ 2571

4.21/69.95

RAID Group 20

180

6.9 / 80

6.88/135.75

RAID Group 21

180

150 / 790

3.905/38.09

RAID Group 22

180

9 / 224

9.91/131.6

RAID Group 23

180

0

0/0

RAID Group 24

720

1335/ 15037

0.92/26.22

RAID Group 25

720

2290/10973

2.08/29.625

RAID Group 26

180

55/239

1970/6280

META

 

 

1660.17/12966.29

Red RAID groups are oversubscribed on average. Hosts are trying to push much more I/O requests than the host can handle. This can be demonstrated using Average and Maximum waits (the last column). These are assessment indicators that tell the storage admin to take closer look at the LUNs. The reason for oversubscription could be constant load when applications are “frying” disks and desperately need more I/O. In which case, more I/O can be gained by spreading loads to more physical disks like spindles or introducing some flash disks adding cashing and some others.

High load indicators could be a result of spikes. In the event of high load indicators, time factors, nature of spikes and their time span should be reviewed and analyzed. It may be that RAID Group X with expected 900 IOPS has LUNs A, B and C. Then the report from all LUNs will show 800 IOPS on average or maximum of about 1000. In some cases the report can be totally well balanced between LUNs and RAID Group. The max I/O could be produced in different time frames and the overall average would not yield more load than RAID Group provisioned.

A deeper analysis of READ/WRITE, WAITS and DISK QUEUE graphs should correlate showing spikes at the same time. Sometimes spikes are caused by backups and could be totally ignored if they do not occur during production hours.

Storage pools could be analyzed using the same logic.

3. Network Expected Performance and Network Load from Servers

Next, an assessment report is created based on network statistics collected on SAN, switches and hosts.

 

 

 

Expected Speed

(Mbps)

Actual Avg and Max Load on switch (Mbps)

Aggregated load from servers (Mbps) avg/max

Aggregated waits from servers

(Ms)

Backend

SPA

Nic1

10,000

86/213

173/422

 

1342/28938

Nic2

10,000

87/209

SPB

Nic1

10,000

67/209

112/405

Nic2

10,000

45/196

In this situation we have a SAN with 4x10G iSCSI connections. Based on average and maximum load from the switches, we see that we are far from saturation. Large waits are functions of IOPS. They are accumulating when hosts are waiting for read/write operations from SAN. The network doesn’t contribute to the waits in this case.

4. System Errors and Warnings

System errors and warnings are collected on SPA and SPB controllers. In most cases we find out about any errors through our Enterprise Storage Monitoring system. However for a complete report, we assemble all the logs and load them into our database. Next, we group them by type and determine whether anything should be reported or taken under closer consideration.

Any database engine can be used to semi-automate analyses of large amounts of data.

Overtime, we accumulated lots of SQL stored procedures and statements through our analyses of logs. These procedures and statements help us to complete analyses faster.

5. Patch Levels and Recommendations

We review the level of the management software in comparison to the current versions. Then we classify each patch with following classification:

  1. Critical – Data Lose or Downtime

  2. Critical – Security

  3. Non critical

     

We also check EOL (End of Life) or EOW (End of Warranty) dates and provide recommendations for our clients.

6. Conclusion. 

Digital Edge believes that the preceding methodology should be practiced to analyze storage devices at least once a quarter or six month. We believe that “even hardware should go for a blood test from time to time.”

This gives enterprise IT groups assurance that everything working as it supposed to, not oversubscribed, that applications are not “frying” HDDs.

We understand that enterprise IT groups have their own expert methods of using and configuring storage. Our methodology could be used by any of them or Digital Edge Enterprise Storage team could be engaged to provide independent audit and assessment.  

Was this article helpful?
Michael Petrov
Founder, Chief Executive Officer

Michael brings 30 years of experience as an information architect, optimization specialist and operations’ advisor. His experience includes extensive high-profile project expertise, such as mainframe and client server integration for Mellon Bank, extranet systems for Sumitomo Bank, architecture and processing workflow for alternative investment division of US Bank. Michael possesses advanced knowledge of security standards such as ISO 27001, NIST, SOC and PCI that brings into any solutions delivered by Digital Edge. Security solutions and standards are expended into public cloud such as AWS and Azure.

LET'S TALK: 800-714-5143