Find Data in DynamoDB in Spark

FindDatainDynamoDBinSpark

Igetasmalltaskrecently,inourDynamoDB,thereisonecolumninonetable,atfirstweputbaddataintablesayingextensionasNumber,laterweknowitshouldbeString.

ButthatendupwehavesomeNumberandsomeStringinthesamecolumn.WeneedtofindoutalltheNumberonesandupdatethemtoString.

Hereismysteps.

FirstofallIdoascan

>awsdynamodbscan--table-namesillycat_device-stage-devicePairingInfo--query"Items

.[extension.N]"--outputtext>./extensionnumber.csvInthefileextensionnumber.csv,Iwillhavedatasimilarto

extension

None

243074

IneedfilteroutalltheNoneones,thenIgettheNumberones.

ThenIdumpthewholetable

>export-dynamodb-tsillycat_device-stage-devicePairingInfo-fcsv-odevicepairinginfodb.csv

Putthese2tablesinSpark,doajoinandselect,IfoundoutalltheoneswithNumber.

InSpark,wecandirectlyreadDynamoDBiftheydonothavethiskindofNumber/Stringconflicts.

importcom.github.traviscrawford.spark.dynamodb._

%spark.dep

z.load("mysql:mysql-connector-java:5.1.47")

z.load("com.github.traviscrawford:spark-dynamodb:0.0.13”)

valaccountDF=sqlContext.read.dynamodb("us-west-1",“sillycat_device-stage-devicePairingInfo")

accountDF.printSchema()

accountDF.registerTempTable("devicepairing")

OrloadtheCSVfile

valdevicePairingDF=sqlContext.read.format("csv")

.option("header","true")

.option("inferSchema","true")

.load("hdfs://localhost:9000/convertdevicepairingextension/devicepairinginfodb.csv")

devicePairingDF.printSchema()

devicePairingDF.createOrReplaceTempView("devicepairing2")

LoadthesecondNumberfile

valextensionRawDF=sqlContext.read.format("csv")

.option("header","true")

.option("inferSchema","true")

.load("hdfs://localhost:9000/convertdevicepairingextension/extensionnumber2.csv")

valextensionRaw1DF=extensionRawDF.toDF(extensionRawDF.columnsmap(_.toLowerCase):_*)

valextensionDF=extensionRaw1DF.columns.foldLeft(extensionRaw1DF)((curr,n)=>curr.withColumnRenamed(n,n.replaceAll("\\s","_")))

extensionDF.printSchema()

extensionDF.createOrReplaceTempView("extension”)

Jointhe2tables

%sql

selectt1.*fromdevicepairing2t1,extensiont2wheret1.extension=t2.extension

OutputtheJSONfile

valextensionUpdateDF=sqlContext.sql("""

selectt1.*fromdevicepairing2t1,extensiont2wheret1.extension=t2.extension

""")

extensionUpdateDF.show(2)

extensionUpdateDF.repartition(1).write.json("hdfs://localhost:9000/convertdevicepairingextension/extension_to_update")

References:

Find Data in DynamoDB in Spark

starksummer

相关推荐

Spark DAG 依赖关系 Stage

小记--------spark ——AGScheduler源码分析

jquery:获得当前点击对象 : $(this)

Jenkins Pipeline 参数详解

明解C语言中级篇第一章答案

spark--job和DAGScheduler源码

Spark 资源调度包 stage 类解析

Linux启动过程[转]

hive 执行计划

一文读懂 babel7 的配置文件加载逻辑

babel的一些常用知识点整理

Jenkins 用户文档（部署）

升级到Babel 7的经验

git 必须要熟练掌握的命令

hive优化

不可不知的spark shuffle

Spark的ShuffleManager

Linux操作系统启动管理器-GRUB

babel之配置文件.babelrc入门详解

CentOS修复Grub

Hive中表的关联顺序对生成MapReduce作业数的影响案例

Spark调度管理

Oracle 11g安装出现em.ear

Android游戏引擎libgdx使用教程10:双舞台

安装debian Linux过程中学习grub的心得体会

关于Babel配置项的这点事

php 快速判断一个数字属于什么范围的实现方法

使用GitLabCI持续集成