Skip to content

Commit 96e0b2f

Browse files
20221004 formatting refresh, APS PDW name correction
1 parent 58ab628 commit 96e0b2f

File tree

1 file changed

+88
-93
lines changed

1 file changed

+88
-93
lines changed

docs/analytics-platform-system/polybase-configure-hadoop.md

Lines changed: 88 additions & 93 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,25 @@
11
---
22
title: "Access external data: Hadoop - PolyBase"
3-
description: Explains how to configure PolyBase in Parallel Data Warehouse to connect to external Hadoop.
3+
description: Explains how to configure PolyBase in Analytics Platform System (PDW) to connect to external Hadoop.
44
author: charlesfeddersen
55
ms.author: charlesf
66
ms.reviewer: martinle
7-
ms.date: 12/13/2019
7+
ms.date: 10/04/2022
88
ms.prod: sql
99
ms.technology: data-warehouse
1010
ms.topic: conceptual
1111
ms.custom:
1212
- seo-dt-2019
1313
- seo-lt-2019
1414
---
15-
# Configure PolyBase in Parallel Data Warehouse to access external data in Hadoop
15+
# Configure PolyBase in Analytics Platform System (PDW) to access external data in Hadoop
1616

17-
The article explains how to use PolyBase on an APS appliance to query external data in Hadoop.
17+
The article explains how to use PolyBase on an [!INCLUDE[sspdw-md](../includes/sspdw-md.md)] or APS appliance to query external data in Hadoop.
1818

1919
## Prerequisites
2020

2121
PolyBase supports two Hadoop providers, Hortonworks Data Platform (HDP) and Cloudera Distributed Hadoop (CDH). Hadoop follows the "Major.Minor.Version" pattern for its new releases, and all versions within a supported Major and Minor release are supported. The following Hadoop providers are supported:
22-
- Hortonworks HDP 1.3 on Linux/Windows Server
22+
- Hortonworks HDP 1.3 on Linux/Windows Server
2323
- Hortonworks HDP 2.1 - 2.6 on Linux
2424
- Hortonworks HDP 3.0 - 3.1 on Linux
2525
- Hortonworks HDP 2.1 - 2.3 on Windows Server
@@ -30,42 +30,38 @@ PolyBase supports two Hadoop providers, Hortonworks Data Platform (HDP) and Clou
3030

3131
First, configure APS to use your specific Hadoop provider.
3232

33-
1. Run [sp_configure](../relational-databases/system-stored-procedures/sp-configure-transact-sql.md) with 'hadoop connectivity' and set an appropriate value for your provider. To find the value for your provider, see [PolyBase Connectivity Configuration](../database-engine/configure-windows/polybase-connectivity-configuration-transact-sql.md).
33+
1. Run [sp_configure](../relational-databases/system-stored-procedures/sp-configure-transact-sql.md) with 'hadoop connectivity' and set an appropriate value for your provider. To find the value for your provider, see [PolyBase Connectivity Configuration](../database-engine/configure-windows/polybase-connectivity-configuration-transact-sql.md).
3434

35-
```sql
36-
-- Values map to various external data sources.
35+
```sql
36+
-- Values map to various external data sources.
3737
-- Example: value 7 stands for Hortonworks HDP 2.1 to 2.6 and 3.0 - 3.1 on Linux,
38-
-- 2.1 to 2.3 on Windows Server, and Azure blob storage
38+
-- 2.1 to 2.3 on Windows Server, and Azure blob storage
3939
sp_configure @configname = 'hadoop connectivity', @configvalue = 7;
4040
GO
4141

4242
RECONFIGURE
4343
GO
44-
```
44+
```
4545

4646
2. Restart APS Region using Service Status page on [Appliance Configuration Manager](launch-the-configuration-manager.md).
47-
48-
## <a id="pushdown"></a> Enable pushdown computation
4947

50-
To improve query performance, enable pushdown computation to your Hadoop cluster:
51-
52-
1. Open a remote desktop connection to PDW Control node.
48+
## <a id="pushdown"></a> Enable pushdown computation
49+
50+
To improve query performance, enable pushdown computation to your Hadoop cluster:
5351

54-
2. Find the file **yarn-site.xml** on the Control node. Typically, the path is:
52+
1. Open a remote desktop connection to APS PDW Control node.
5553

56-
```xml
57-
C:\Program Files\Microsoft SQL Server Parallel Data Warehouse\100\Hadoop\conf\
58-
```
54+
1. Find the file `yarn-site.xml` on the Control node. Typically, the path is: `C:\Program Files\Microsoft SQL Server Parallel Data Warehouse\100\Hadoop\conf\`.
5955

60-
3. On the Hadoop machine, find the analogous file in the Hadoop configuration directory. In the file, find and copy the value of the configuration key yarn.application.classpath.
61-
62-
4. On the Control node, in the **yarn.site.xml file,** find the **yarn.application.classpath** property. Paste the value from the Hadoop machine into the value element.
63-
64-
5. For all CDH 5.X versions, you will need to add the mapreduce.application.classpath configuration parameters either to the end of your yarn.site.xml file or into the mapred-site.xml file. HortonWorks includes these configurations within the yarn.application.classpath configurations. See [PolyBase configuration](../relational-databases/polybase/polybase-configuration.md) for examples.
56+
1. On the Hadoop machine, find the analogous file in the Hadoop configuration directory. In the file, find and copy the value of the configuration key `yarn.application.classpath`.
57+
58+
1. On the Control node, in the `yarn.site.xml` file, find the `yarn.application.classpath` property. Paste the value from the Hadoop machine into the value element.
59+
60+
1. For all CDH 5.X versions, you will need to add the `mapreduce.application.classpath` configuration parameters either to the end of your `yarn.site.xml` file or into the `mapred-site.xml` file. HortonWorks includes these configurations within the `yarn.application.classpath` configurations. For examples, see [PolyBase configuration](../relational-databases/polybase/polybase-configuration.md).
6561

6662
## Example XML files for CDH 5.X cluster default values
6763

68-
Yarn-site.xml with yarn.application.classpath and mapreduce.application.classpath configuration.
64+
`Yarn-site.xml` with `yarn.application.classpath` and `mapreduce.application.classpath` configuration.
6965

7066
```xml
7167
<?xml version="1.0" encoding="utf-8"?>
@@ -98,9 +94,9 @@ Yarn-site.xml with yarn.application.classpath and mapreduce.application.classpat
9894
</configuration>
9995
```
10096

101-
If you choose to break your two configuration settings into the mapred-site.xml and the yarn-site.xml, then the files would be the following:
97+
If you choose to break your two configuration settings into the `mapred-site.xml` and the `yarn-site.xml`, then the files would be the following:
10298

103-
**yarn-site.xml**
99+
For `yarn-site.xml`:
104100

105101
```xml
106102
<?xml version="1.0" encoding="utf-8"?>
@@ -133,9 +129,9 @@ If you choose to break your two configuration settings into the mapred-site.xml
133129
</configuration>
134130
```
135131

136-
**mapred-site.xml**
132+
For `mapred-site.xml`:
137133

138-
Note that we added the property mapreduce.application.classpath. In CDH 5.x, you will find the configuration values under the same naming convention in Ambari.
134+
Note the property `mapreduce.application.classpath`. In CDH 5.x, you will find the configuration values under the same naming convention in Ambari.
139135

140136
```xml
141137
<?xml version="1.0"?>
@@ -171,7 +167,7 @@ Note that we added the property mapreduce.application.classpath. In CDH 5.x, you
171167

172168
## Example XML files for HDP 3.X cluster default values
173169

174-
**yarn-site.xml**
170+
For `yarn-site.xml`:
175171

176172
```xml
177173
<?xml version="1.0" encoding="utf-8"?>
@@ -211,133 +207,132 @@ To query the data in your Hadoop data source, you must define an external table
211207
1. Create a master key on the database. It is required to encrypt the credential secret.
212208

213209
```sql
214-
CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'S0me!nfo';
210+
CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'S0me!nfo';
215211
```
216212

217-
2. Create a database scoped credential for Kerberos-secured Hadoop clusters.
213+
1. Create a database scoped credential for Kerberos-secured Hadoop clusters.
218214

219215
```sql
220-
-- IDENTITY: the Kerberos user name.
221-
-- SECRET: the Kerberos password
216+
-- IDENTITY: the Kerberos user name.
217+
-- SECRET: the Kerberos password
222218
CREATE DATABASE SCOPED CREDENTIAL HadoopUser1
223-
WITH IDENTITY = '<hadoop_user_name>', Secret = '<hadoop_password>';
219+
WITH IDENTITY = '<hadoop_user_name>', Secret = '<hadoop_password>';
224220
```
225221

226-
3. Create an external data source with [CREATE EXTERNAL DATA SOURCE](../t-sql/statements/create-external-data-source-transact-sql.md).
222+
1. Create an external data source with [CREATE EXTERNAL DATA SOURCE](../t-sql/statements/create-external-data-source-transact-sql.md).
227223

228224
```sql
229-
-- LOCATION (Required) : Hadoop Name Node IP address and port.
230-
-- RESOURCE MANAGER LOCATION (Optional): Hadoop Resource Manager location to enable pushdown computation.
231-
-- CREDENTIAL (Optional): the database scoped credential, created above.
232-
CREATE EXTERNAL DATA SOURCE MyHadoopCluster WITH (
225+
-- LOCATION (Required) : Hadoop Name Node IP address and port.
226+
-- RESOURCE MANAGER LOCATION (Optional): Hadoop Resource Manager location to enable pushdown computation.
227+
-- CREDENTIAL (Optional): the database scoped credential, created above.
228+
CREATE EXTERNAL DATA SOURCE MyHadoopCluster WITH (
233229
TYPE = HADOOP,
234230
LOCATION ='hdfs://10.xxx.xx.xxx:xxxx',
235231
RESOURCE_MANAGER_LOCATION = '10.xxx.xx.xxx:xxxx',
236232
CREDENTIAL = HadoopUser1
237-
);
233+
);
238234
```
239235

240-
4. Create an external file format with [CREATE EXTERNAL FILE FORMAT](../t-sql/statements/create-external-file-format-transact-sql.md).
236+
1. Create an external file format with [CREATE EXTERNAL FILE FORMAT](../t-sql/statements/create-external-file-format-transact-sql.md).
241237

242238
```sql
243239
-- FORMAT TYPE: Type of format in Hadoop (DELIMITEDTEXT, RCFILE, ORC, PARQUET).
244-
CREATE EXTERNAL FILE FORMAT TextFileFormat WITH (
240+
CREATE EXTERNAL FILE FORMAT TextFileFormat WITH (
245241
FORMAT_TYPE = DELIMITEDTEXT,
246242
FORMAT_OPTIONS (FIELD_TERMINATOR ='|',
247-
USE_TYPE_DEFAULT = TRUE)
243+
USE_TYPE_DEFAULT = TRUE)
248244
```
249245

250-
5. Create an external table pointing to data stored in Hadoop with [CREATE EXTERNAL TABLE](../t-sql/statements/create-external-table-transact-sql.md). In this example, the external data contains car sensor data.
246+
1. Create an external table pointing to data stored in Hadoop with [CREATE EXTERNAL TABLE](../t-sql/statements/create-external-table-transact-sql.md). In this example, the external data contains car sensor data.
251247

252248
```sql
253-
-- LOCATION: path to file or directory that contains the data (relative to HDFS root).
254-
CREATE EXTERNAL TABLE [dbo].[CarSensor_Data] (
249+
-- LOCATION: path to file or directory that contains the data (relative to HDFS root).
250+
CREATE EXTERNAL TABLE [dbo].[CarSensor_Data] (
255251
[SensorKey] int NOT NULL,
256252
[CustomerKey] int NOT NULL,
257253
[GeographyKey] int NULL,
258254
[Speed] float NOT NULL,
259-
[YearMeasured] int NOT NULL
260-
)
255+
[YearMeasured] int NOT NULL
256+
)
261257
WITH (LOCATION='/Demo/',
262-
DATA_SOURCE = MyHadoopCluster,
263-
FILE_FORMAT = TextFileFormat
264-
);
258+
DATA_SOURCE = MyHadoopCluster,
259+
FILE_FORMAT = TextFileFormat
260+
);
265261
```
266262

267-
6. Create statistics on an external table.
263+
1. Create statistics on an external table.
268264

269265
```sql
270-
CREATE STATISTICS StatsForSensors on CarSensor_Data(CustomerKey, Speed)
266+
CREATE STATISTICS StatsForSensors on CarSensor_Data(CustomerKey, Speed)
271267
```
272268

273269
## PolyBase queries
274270

275-
There are three functions that PolyBase is suited for:
276-
271+
There are three functions that PolyBase is suited for:
272+
277273
- Ad hoc queries against external tables.
278274
- Importing data.
279-
- Exporting data.
275+
- Exporting data.
280276

281277
The following queries provide example with fictional car sensor data.
282278

283-
### Ad hoc queries
279+
### Ad hoc queries
284280

285-
The following ad hoc query joins relational with Hadoop data. It selects customers who drive faster than 35 mph, joining structured customer data stored in APS with car sensor data stored in Hadoop.
281+
The following ad hoc query joins relational with Hadoop data. It selects customers who drive faster than 35 mph, joining structured customer data stored in APS with car sensor data stored in Hadoop.
286282

287-
```sql
283+
```sql
288284
SELECT DISTINCT Insured_Customers.FirstName,Insured_Customers.LastName,
289-
Insured_Customers. YearlyIncome, CarSensor_Data.Speed
290-
FROM Insured_Customers, CarSensor_Data
285+
Insured_Customers. YearlyIncome, CarSensor_Data.Speed
286+
FROM Insured_Customers, CarSensor_Data
291287
WHERE Insured_Customers.CustomerKey = CarSensor_Data.CustomerKey and CarSensor_Data.Speed > 35
292-
ORDER BY CarSensor_Data.Speed DESC
293-
OPTION (FORCE EXTERNALPUSHDOWN); -- or OPTION (DISABLE EXTERNALPUSHDOWN)
294-
```
288+
ORDER BY CarSensor_Data.Speed DESC
289+
OPTION (FORCE EXTERNALPUSHDOWN); -- or OPTION (DISABLE EXTERNALPUSHDOWN)
290+
```
295291

296-
### Importing data
292+
### Import data
297293

298-
The following query imports external data into APS. This example imports data for fast drivers into APS to do more in-depth analysis. To improve performance, it leverages Columnstore technology in APS.
294+
The following query imports external data into APS. This example imports data for fast drivers into APS to do more in-depth analysis. To improve performance, it leverages columnstore technology in APS.
299295

300296
```sql
301297
CREATE TABLE Fast_Customers
302298
WITH
303299
(CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION = HASH (CustomerKey))
304300
AS
305301
SELECT DISTINCT
306-
Insured_Customers.CustomerKey, Insured_Customers.FirstName, Insured_Customers.LastName,
307-
Insured_Customers.YearlyIncome, Insured_Customers.MaritalStatus
308-
from Insured_Customers INNER JOIN
309-
(
310-
SELECT * FROM CarSensor_Data where Speed > 35
311-
) AS SensorD
312-
ON Insured_Customers.CustomerKey = SensorD.CustomerKey
313-
```
302+
Insured_Customers.CustomerKey, Insured_Customers.FirstName, Insured_Customers.LastName,
303+
Insured_Customers.YearlyIncome, Insured_Customers.MaritalStatus
304+
from Insured_Customers INNER JOIN
305+
(
306+
SELECT * FROM CarSensor_Data where Speed > 35
307+
) AS SensorD
308+
ON Insured_Customers.CustomerKey = SensorD.CustomerKey
309+
```
314310

315-
### Exporting data
311+
### Export data
316312

317313
The following query exports data from APS to Hadoop. It can be used to archive relational data to Hadoop while still be able to query it.
318314

319315
```sql
320-
-- Export data: Move old data to Hadoop while keeping it query-able via an external table.
321-
CREATE EXTERNAL TABLE [dbo].[FastCustomers2009]
322-
WITH (
323-
LOCATION='/archive/customer/2009',
324-
DATA_SOURCE = HadoopHDP2,
316+
-- Export data: Move old data to Hadoop while keeping it query-able via an external table.
317+
CREATE EXTERNAL TABLE [dbo].[FastCustomers2009]
318+
WITH (
319+
LOCATION='/archive/customer/2009',
320+
DATA_SOURCE = HadoopHDP2,
325321
FILE_FORMAT = TextFileFormat
326-
)
322+
)
327323
AS
328-
SELECT T.* FROM Insured_Customers T1 JOIN CarSensor_Data T2
329-
ON (T1.CustomerKey = T2.CustomerKey)
330-
WHERE T2.YearMeasured = 2009 and T2.Speed > 40;
331-
```
324+
SELECT T.* FROM Insured_Customers T1 JOIN CarSensor_Data T2
325+
ON (T1.CustomerKey = T2.CustomerKey)
326+
WHERE T2.YearMeasured = 2009 and T2.Speed > 40;
327+
```
328+
329+
## View PolyBase objects in SSDT
332330

333-
## View PolyBase objects in SSDT
331+
In SQL Server Data Tools, external tables are displayed in a separate folder **External Tables**. External data sources and external file formats are in subfolders under **External Resources**.
334332

335-
In SQL Server Data Tools, external tables are displayed in a separate folder **External Tables**. External data sources and external file formats are in subfolders under **External Resources**.
336-
337-
![PolyBase objects in SSDT](media/polybase/external-tables-datasource.png)
333+
:::image type="content" source="media/polybase/external-tables-datasource.png" alt-text="A screenshot of PolyBase objects in SQL Server Data Tools (SSDT).":::
338334

339335
## Next steps
340336

341-
For Hadoop security settings see [configure Hadoop security](polybase-configure-hadoop-security.md).<br>
342-
For more information about PolyBase, see the [What is PolyBase?](../relational-databases/polybase/polybase-guide.md).
343-
337+
- For Hadoop security settings see [configure Hadoop security](polybase-configure-hadoop-security.md).<br>
338+
- For more information about PolyBase, see the [What is PolyBase?](../relational-databases/polybase/polybase-guide.md).

0 commit comments

Comments
 (0)