You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/analytics-platform-system/polybase-configure-hadoop.md
+88-93Lines changed: 88 additions & 93 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,25 +1,25 @@
1
1
---
2
2
title: "Access external data: Hadoop - PolyBase"
3
-
description: Explains how to configure PolyBase in Parallel Data Warehouse to connect to external Hadoop.
3
+
description: Explains how to configure PolyBase in Analytics Platform System (PDW) to connect to external Hadoop.
4
4
author: charlesfeddersen
5
5
ms.author: charlesf
6
6
ms.reviewer: martinle
7
-
ms.date: 12/13/2019
7
+
ms.date: 10/04/2022
8
8
ms.prod: sql
9
9
ms.technology: data-warehouse
10
10
ms.topic: conceptual
11
11
ms.custom:
12
12
- seo-dt-2019
13
13
- seo-lt-2019
14
14
---
15
-
# Configure PolyBase in Parallel Data Warehouse to access external data in Hadoop
15
+
# Configure PolyBase in Analytics Platform System (PDW) to access external data in Hadoop
16
16
17
-
The article explains how to use PolyBase on an APS appliance to query external data in Hadoop.
17
+
The article explains how to use PolyBase on an [!INCLUDE[sspdw-md](../includes/sspdw-md.md)] or APS appliance to query external data in Hadoop.
18
18
19
19
## Prerequisites
20
20
21
21
PolyBase supports two Hadoop providers, Hortonworks Data Platform (HDP) and Cloudera Distributed Hadoop (CDH). Hadoop follows the "Major.Minor.Version" pattern for its new releases, and all versions within a supported Major and Minor release are supported. The following Hadoop providers are supported:
22
-
- Hortonworks HDP 1.3 on Linux/Windows Server
22
+
- Hortonworks HDP 1.3 on Linux/Windows Server
23
23
- Hortonworks HDP 2.1 - 2.6 on Linux
24
24
- Hortonworks HDP 3.0 - 3.1 on Linux
25
25
- Hortonworks HDP 2.1 - 2.3 on Windows Server
@@ -30,42 +30,38 @@ PolyBase supports two Hadoop providers, Hortonworks Data Platform (HDP) and Clou
30
30
31
31
First, configure APS to use your specific Hadoop provider.
32
32
33
-
1. Run [sp_configure](../relational-databases/system-stored-procedures/sp-configure-transact-sql.md) with 'hadoop connectivity' and set an appropriate value for your provider. To find the value for your provider, see [PolyBase Connectivity Configuration](../database-engine/configure-windows/polybase-connectivity-configuration-transact-sql.md).
33
+
1. Run [sp_configure](../relational-databases/system-stored-procedures/sp-configure-transact-sql.md) with 'hadoop connectivity' and set an appropriate value for your provider. To find the value for your provider, see [PolyBase Connectivity Configuration](../database-engine/configure-windows/polybase-connectivity-configuration-transact-sql.md).
34
34
35
-
```sql
36
-
-- Values map to various external data sources.
35
+
```sql
36
+
-- Values map to various external data sources.
37
37
-- Example: value 7 stands for Hortonworks HDP 2.1 to 2.6 and 3.0 - 3.1 on Linux,
38
-
-- 2.1 to 2.3 on Windows Server, and Azure blob storage
38
+
-- 2.1 to 2.3 on Windows Server, and Azure blob storage
To improve query performance, enable pushdown computation to your Hadoop cluster:
53
51
54
-
2. Find the file **yarn-site.xml** on the Control node. Typically, the path is:
52
+
1. Open a remote desktop connection to APS PDW Control node.
55
53
56
-
```xml
57
-
C:\Program Files\Microsoft SQL Server Parallel Data Warehouse\100\Hadoop\conf\
58
-
```
54
+
1. Find the file `yarn-site.xml` on the Control node. Typically, the path is: `C:\Program Files\Microsoft SQL Server Parallel Data Warehouse\100\Hadoop\conf\`.
59
55
60
-
3. On the Hadoop machine, find the analogous file in the Hadoop configuration directory. In the file, find and copy the value of the configuration key yarn.application.classpath.
61
-
62
-
4. On the Control node, in the **yarn.site.xml file,** find the **yarn.application.classpath** property. Paste the value from the Hadoop machine into the value element.
63
-
64
-
5. For all CDH 5.X versions, you will need to add the mapreduce.application.classpath configuration parameters either to the end of your yarn.site.xml file or into the mapred-site.xml file. HortonWorks includes these configurations within the yarn.application.classpath configurations. See [PolyBase configuration](../relational-databases/polybase/polybase-configuration.md) for examples.
56
+
1. On the Hadoop machine, find the analogous file in the Hadoop configuration directory. In the file, find and copy the value of the configuration key `yarn.application.classpath`.
57
+
58
+
1. On the Control node, in the `yarn.site.xml` file, find the `yarn.application.classpath` property. Paste the value from the Hadoop machine into the value element.
59
+
60
+
1. For all CDH 5.X versions, you will need to add the `mapreduce.application.classpath` configuration parameters either to the end of your `yarn.site.xml` file or into the `mapred-site.xml` file. HortonWorks includes these configurations within the `yarn.application.classpath` configurations. For examples, see [PolyBase configuration](../relational-databases/polybase/polybase-configuration.md).
65
61
66
62
## Example XML files for CDH 5.X cluster default values
67
63
68
-
Yarn-site.xml with yarn.application.classpath and mapreduce.application.classpath configuration.
64
+
`Yarn-site.xml` with `yarn.application.classpath` and `mapreduce.application.classpath` configuration.
69
65
70
66
```xml
71
67
<?xml version="1.0" encoding="utf-8"?>
@@ -98,9 +94,9 @@ Yarn-site.xml with yarn.application.classpath and mapreduce.application.classpat
98
94
</configuration>
99
95
```
100
96
101
-
If you choose to break your two configuration settings into the mapred-site.xml and the yarn-site.xml, then the files would be the following:
97
+
If you choose to break your two configuration settings into the `mapred-site.xml` and the `yarn-site.xml`, then the files would be the following:
102
98
103
-
**yarn-site.xml**
99
+
For `yarn-site.xml`:
104
100
105
101
```xml
106
102
<?xml version="1.0" encoding="utf-8"?>
@@ -133,9 +129,9 @@ If you choose to break your two configuration settings into the mapred-site.xml
133
129
</configuration>
134
130
```
135
131
136
-
**mapred-site.xml**
132
+
For `mapred-site.xml`:
137
133
138
-
Note that we added the property mapreduce.application.classpath. In CDH 5.x, you will find the configuration values under the same naming convention in Ambari.
134
+
Note the property `mapreduce.application.classpath`. In CDH 5.x, you will find the configuration values under the same naming convention in Ambari.
139
135
140
136
```xml
141
137
<?xml version="1.0"?>
@@ -171,7 +167,7 @@ Note that we added the property mapreduce.application.classpath. In CDH 5.x, you
171
167
172
168
## Example XML files for HDP 3.X cluster default values
173
169
174
-
**yarn-site.xml**
170
+
For `yarn-site.xml`:
175
171
176
172
```xml
177
173
<?xml version="1.0" encoding="utf-8"?>
@@ -211,133 +207,132 @@ To query the data in your Hadoop data source, you must define an external table
211
207
1. Create a master key on the database. It is required to encrypt the credential secret.
212
208
213
209
```sql
214
-
CREATE MASTER KEY ENCRYPTION BY PASSWORD ='S0me!nfo';
210
+
CREATE MASTER KEY ENCRYPTION BY PASSWORD ='S0me!nfo';
215
211
```
216
212
217
-
2. Create a database scoped credential for Kerberos-secured Hadoop clusters.
213
+
1. Create a database scoped credential for Kerberos-secured Hadoop clusters.
218
214
219
215
```sql
220
-
-- IDENTITY: the Kerberos user name.
221
-
-- SECRET: the Kerberos password
216
+
-- IDENTITY: the Kerberos user name.
217
+
-- SECRET: the Kerberos password
222
218
CREATEDATABASESCOPED CREDENTIAL HadoopUser1
223
-
WITH IDENTITY ='<hadoop_user_name>', Secret ='<hadoop_password>';
219
+
WITH IDENTITY ='<hadoop_user_name>', Secret ='<hadoop_password>';
224
220
```
225
221
226
-
3. Create an external data source with [CREATE EXTERNAL DATA SOURCE](../t-sql/statements/create-external-data-source-transact-sql.md).
222
+
1. Create an external data source with [CREATE EXTERNAL DATA SOURCE](../t-sql/statements/create-external-data-source-transact-sql.md).
227
223
228
224
```sql
229
-
-- LOCATION (Required) : Hadoop Name Node IP address and port.
-- CREDENTIAL (Optional): the database scoped credential, created above.
228
+
CREATE EXTERNAL DATA SOURCE MyHadoopCluster WITH (
233
229
TYPE = HADOOP,
234
230
LOCATION ='hdfs://10.xxx.xx.xxx:xxxx',
235
231
RESOURCE_MANAGER_LOCATION ='10.xxx.xx.xxx:xxxx',
236
232
CREDENTIAL = HadoopUser1
237
-
);
233
+
);
238
234
```
239
235
240
-
4. Create an external file format with [CREATE EXTERNAL FILE FORMAT](../t-sql/statements/create-external-file-format-transact-sql.md).
236
+
1. Create an external file format with [CREATE EXTERNAL FILE FORMAT](../t-sql/statements/create-external-file-format-transact-sql.md).
241
237
242
238
```sql
243
239
-- FORMAT TYPE: Type of format in Hadoop (DELIMITEDTEXT, RCFILE, ORC, PARQUET).
244
-
CREATE EXTERNAL FILE FORMAT TextFileFormat WITH (
240
+
CREATE EXTERNAL FILE FORMAT TextFileFormat WITH (
245
241
FORMAT_TYPE = DELIMITEDTEXT,
246
242
FORMAT_OPTIONS (FIELD_TERMINATOR ='|',
247
-
USE_TYPE_DEFAULT = TRUE)
243
+
USE_TYPE_DEFAULT = TRUE)
248
244
```
249
245
250
-
5. Create an external table pointing to data stored in Hadoop with [CREATE EXTERNAL TABLE](../t-sql/statements/create-external-table-transact-sql.md). In this example, the external data contains car sensor data.
246
+
1. Create an external table pointing to data stored in Hadoop with [CREATE EXTERNAL TABLE](../t-sql/statements/create-external-table-transact-sql.md). In this example, the external data contains car sensor data.
251
247
252
248
```sql
253
-
-- LOCATION: path to file or directory that contains the data (relative to HDFS root).
254
-
CREATE EXTERNAL TABLE [dbo].[CarSensor_Data] (
249
+
-- LOCATION: path to file or directory that contains the data (relative to HDFS root).
250
+
CREATE EXTERNAL TABLE [dbo].[CarSensor_Data] (
255
251
[SensorKey] intNOT NULL,
256
252
[CustomerKey] intNOT NULL,
257
253
[GeographyKey] intNULL,
258
254
[Speed] float NOT NULL,
259
-
[YearMeasured] intNOT NULL
260
-
)
255
+
[YearMeasured] intNOT NULL
256
+
)
261
257
WITH (LOCATION='/Demo/',
262
-
DATA_SOURCE = MyHadoopCluster,
263
-
FILE_FORMAT = TextFileFormat
264
-
);
258
+
DATA_SOURCE = MyHadoopCluster,
259
+
FILE_FORMAT = TextFileFormat
260
+
);
265
261
```
266
262
267
-
6. Create statistics on an external table.
263
+
1. Create statistics on an external table.
268
264
269
265
```sql
270
-
CREATE STATISTICS StatsForSensors on CarSensor_Data(CustomerKey, Speed)
266
+
CREATE STATISTICS StatsForSensors on CarSensor_Data(CustomerKey, Speed)
271
267
```
272
268
273
269
## PolyBase queries
274
270
275
-
There are three functions that PolyBase is suited for:
276
-
271
+
There are three functions that PolyBase is suited for:
272
+
277
273
- Ad hoc queries against external tables.
278
274
- Importing data.
279
-
- Exporting data.
275
+
- Exporting data.
280
276
281
277
The following queries provide example with fictional car sensor data.
282
278
283
-
### Ad hoc queries
279
+
### Ad hoc queries
284
280
285
-
The following ad hoc query joins relational with Hadoop data. It selects customers who drive faster than 35 mph, joining structured customer data stored in APS with car sensor data stored in Hadoop.
281
+
The following ad hoc query joins relational with Hadoop data. It selects customers who drive faster than 35 mph, joining structured customer data stored in APS with car sensor data stored in Hadoop.
OPTION (FORCE EXTERNALPUSHDOWN); -- or OPTION (DISABLE EXTERNALPUSHDOWN)
294
-
```
288
+
ORDER BYCarSensor_Data.SpeedDESC
289
+
OPTION (FORCE EXTERNALPUSHDOWN); -- or OPTION (DISABLE EXTERNALPUSHDOWN)
290
+
```
295
291
296
-
### Importing data
292
+
### Import data
297
293
298
-
The following query imports external data into APS. This example imports data for fast drivers into APS to do more in-depth analysis. To improve performance, it leverages Columnstore technology in APS.
294
+
The following query imports external data into APS. This example imports data for fast drivers into APS to do more in-depth analysis. To improve performance, it leverages columnstore technology in APS.
299
295
300
296
```sql
301
297
CREATETABLEFast_Customers
302
298
WITH
303
299
(CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION = HASH (CustomerKey))
In SQL Server Data Tools, external tables are displayed in a separate folder **External Tables**. External data sources and external file formats are in subfolders under **External Resources**.
334
332
335
-
In SQL Server Data Tools, external tables are displayed in a separate folder **External Tables**. External data sources and external file formats are in subfolders under **External Resources**.
336
-
337
-

333
+
:::image type="content" source="media/polybase/external-tables-datasource.png" alt-text="A screenshot of PolyBase objects in SQL Server Data Tools (SSDT).":::
338
334
339
335
## Next steps
340
336
341
-
For Hadoop security settings see [configure Hadoop security](polybase-configure-hadoop-security.md).<br>
342
-
For more information about PolyBase, see the [What is PolyBase?](../relational-databases/polybase/polybase-guide.md).
343
-
337
+
- For Hadoop security settings see [configure Hadoop security](polybase-configure-hadoop-security.md).<br>
338
+
- For more information about PolyBase, see the [What is PolyBase?](../relational-databases/polybase/polybase-guide.md).
0 commit comments