20.4.16

Golden rules for New SAP HANA Developers

I've been developing code in SAP HANA for nearly 3 years now, and at the end of 2013, there is now a huge influx of developers. This is good, because it means that mass adoption is here. But it also means that we see a lot of the same mistakes being made. I did some analysis of the questions asked in HANA development forums and thought I'd pick out the most common mistakes that newbie HANA developers make and give my advice.

The main thing I've noted is that every developer comes to SAP HANA with a set of misconceptions about HANA and a set of experience of some other technology. HANA is a little different to other application platforms and you need to change your thinking. I hope this helps you in your journey.

Never use row-based tables

If you are an Oracle, IBM, or Microsoft person then you are institutionalized into thinking that you put OLTP data in the row-store and OLAP in the column-store. This is not true with SAP HANA! You must create all tables in the column store and that can be used for both transactional and analytic scenarios.

HANA does have a row store and you can create row-oriented tables for some very specific scenarios:

- Transient data like queues, where you insert and delete a lot and the data never persists
- Configuration tables which are never joined, where you select individual entire rows
- When you are advised to by a SAP support personnel

But in general, never used oriented tables, especially for OLTP/Transactional scenarios. HANA is optimized to use the column store for combined transactional and analytical scenarios.

Never create indexes

Again if you come from the traditional RDBMS space, you will see that HANA allows the CREATE INDEX command.

However, when you create a table in HANA (and I'm simplifying), it is in fact creating a set of sorted, compressed and linked indexes. As a result, secondary indexes almost never improve performance. In fact, I've never come across a scenario where an index improved performance of an analytic query, where a large volume of data is aggregated.

There is one scenario when a secondary index can improve performance: when you have a query which selects a very small amount of data from a very large table (or group of joined tables). In this instance, creating a secondary index on all your sort columns can allow HANA to find the data faster. But this is a very specific situation - the simple advice is, never create indexes.

Don't use the SAP HANA Modeler Perspective

HANA has 3 developer perspectives, the SAP HANA Systems View, Modeler, and Developer Perspective. Take the time to read the developer guide and setup the Developer perspective. This will bring you the ability to put all your development artifacts including tables, information views, stored procedures, plus OData and HTML artifacts if you need them. You get change management, version management, the ability to test inactive objects, code completion and a bunch of other things.

Please don't create models directing in the SAP HANA Content Repository any more.

Don't use SQLScript unless you have to

SAP HANA provides a powerful stored procedure language, but its power is a bad thing for new SAP HANA Developers. It allows you to write very inefficient code which doesn't parallelize.

Most of the scenarios I see on the forums that developers are coding could be better done with Information Views like Attribute Views, Analytic Views and Calculation Views. And in SAP HANA SP07, Information Views are faster than SQLScript in almost every scenario. Plus, Information Views are easier for others to understand, remodel and change.

There are scenarios where you need SQLScript, but it shouldn't be viewed as a general-purpose solution to modeling problems.

If you have to use SQLScript, don't use Cursors or Dynamic SQL

If you are a PL/SQL or T-SQL developer than you will be familiar with Cursors and Dynamic SQL. I see a lot of questions in the forums related to performance problems with these. Avoid using them at all costs - this is detailed in the SQLScript Reference Guide.

There are a number of things that push you out of multi-threaded mode in SQLScript: Local Scalar Variables, Loops, Cursors and Dynamic SQL. All of these constructs will cause you performance problems if you use them, so you need to avoid them.

Especially don't use Dynamic SQL to generate column names for INSERT or SELECT statements! Instead, create an Information View for SELECT statements, or write a Python-based loader to load tables.

In many cases you can change a loop into a single SELECT or INSERT statement. In other cases, you can nest SQLScript procedures to improve performance. I'm thinking that this area needs a blog of its own.

Avoid JOINs on Large Data Volumes

Joins on very large tables (>100m rows) can be inefficient. If you have two large fact tables then never join them - performance will always be a problem.

Instead, you can normalize the data in Analytic Views and create a Union with Constant Values in a Calculation View. Werner Steyn describes this nicely in his Advanced Data Modeling guide. You can expect a very large performance increate for complex queries by using this mechanism.

Consider use of a second monitor

Most laptop screens are poorly adjusted for use of Eclipse - this isn't a HANA Studio-specific problem. The best screen is a 27" 1920x1080 screen like this one from HP. Connect your laptop to a second screen at work for the best development experience. If you have a regular laptop screen that is 1366x768 resolution, you will really struggle to develop well.

Update your HANA environment regularly

This is true for all your HANA components - DB, Studio and Client. There are continuous innovations coming every 6 months with major Service Packs, and smaller revisions every 2 weeks. It's well worth patching frequently to get the best developer experience. Take a read of this blog for more details.

Final Words

This isn't designed to be an exhaustive guide, but rather a compilation of the common mistakes that I see developers making when they're starting out with HANA. If you're new to HANA then please take the time to read them and think about what it means to your developer methodology.

Have I missed any obvious ones?

19.4.16

HANA Smart Data Access Tuning

Background

일반적으로 HANA Column Table은 효율적인 쿼리 Plan을 생성하기 위해서 별도의 Statistics를 필요로 하지 않습니다. 이는 Delta Merge를 통해서 각 Column별로 Data Dictionary를  생성하여 Min, Max, Distinct, Count 등과 같은 정보를 이미 가지고 있기 때문입니다.
Smart Data Access를 통해서 생성한 Virtual Table의 경우 Delta Merge를 수행하지 않기 때문에 해당 테이블이 어떤 성격의 데이터를 가지고 있는지 알 수 없습니다. Federation Query를 수행하는 주체인 HANA 입장에서는 원격의 테이블이 Black Box와 다름 없습니다. 
따라서 Virtual Table에 대한 통계정보의 수집이 중요. 이는 Optimizer가 가장 효율적인 Plan을 생성하고 선택할 수 있도록 도와주기 때문에 Federation 쿼리의 성능에 중요한 역할을 합니다.
Dynamic Tiering의 Extended Table도 동일한 기반 기술을 바탕으로 동작하므로 똑같이 통계정보 수집이 중요합니다.

TEST Environment

Local DB와 Remote DB가 Hot과 Warm 데이터 영역으로 구분되어 있고 SDA를 통해 서로 연결되어 있는 테스트 환경을 구성합니다.
Hot 과 Warm 데이터를 아우르는 통합 View 생성하여 Federation 쿼리를 통해 최신 데이터와 Historical 데이터를 동시에 조회할 수 있는 환경 구성합니다.
통계정보 수집에 따른 효과를 확인하기 위해 통계수집 전과 후의 쿼리 실행 계획(Visual Plan, Explain Plan) 수집합니다.
통계정보는 쿼리에서 사용하는 컬럼(Selection List, Predicate)들을 대상으로 수집합니다.

- CREATE STATISTICS ON TABLE_NAME (COLUMN_NAME,…) TYPE HISTOGRAM;

1. Federation Query with No Statistics




2. Federation Query with Statistics



















Cloclusion

Virtual Table / Extended Table의 통계 정보의 생성 통해 Federation SQL 수행 시 Optimizer가 보다 효율적인 실행 계획을 생성할 수 있습니다.
 통계 정보 수집 여부에 따라 쿼리 성능이 달라지기 때문에 통계정보 수집은 SDA 환경에서 가장 기본적인 튜닝 방법입니다.
Virtual Table / Extended Table을 사용하는 쿼리가 있다면 사용하는 Projection List, Join Key, Predicate  등의 컬럼에 대해 통계정보 수집은 필수적입니다.