NashTech Insights

A Quick demo: ArangoDB to Spark to Bigquery

Kundan Kumar
Kundan Kumar
Table of Contents
black laptop computer turned on

Hi Folks!! In this blog, we are going to learn how we can integrate Spark with ArangoDB and Big Query to build a simple ETL pipeline.

To start the application, you will need ArangoDB runs locally and Big Query on your google cloud account. The minimum requirement for the application:

python 3.8.0, spark 3.4(pyspark), ArangoDb(latest), BigQuery

install gcloud sdk(https://cloud.google.com/sdk/docs/install)

use docker-compose.yml to start a arangoDB docker container.

version: '3.7'
services:
  arangodb_db_container:
    image: arangodb:latest
    environment:
      ARANGO_ROOT_PASSWORD: rootpassword
    ports:
      - 8529:8529
    volumes:
      - arangodb_data_container:/var/lib/arangodb3
      - arangodb_apps_data_container:/var/lib/arangodb3-apps

volumes:
  arangodb_data_container:
  arangodb_apps_data_container:

run docker-compose up -d to start the docker container. You can access ArangoDB browsing localhost:8529.

login with user “root” and password “rootpassword“.

Select _system database and add your own user database.

Now, login with your own user database:

Add your collection and upload documents in the database you have created:

Sample documents:

{
  "id": 7,
  "name": "John Doe",
  "age": 22,
  "updatedAt": 1695569714963,
  "hobbies": {
    "indoor": [
      "Chess"
    ],
    "outdoor": [
      "BasketballStand-up Comedy"
    ]
  }
}

{
  "id": 5,
  "name": "Steve Doe",
  "age": 25,
  "updatedAt": 1695569714963,
  "hobbies": {
    "indoor": [
      "Ludo"
    ],
    "outdoor": [
      "Cricket"
    ]
  }
}

Now, Read the documents with pyspark:

 df: DataFrame = spark.read.format("com.arangodb.spark") \
        .option("query", query) \
        .option("batchSize", 2147483647) \
        .options(**arango_connection) \
        .schema(doc_schema).load()

In the above code snippet, reading the arangoDB. Passing a query which is a AQL(Arango Query language)
for example –
1. Reading all documents from hobbies collection
FOR doc IN hobbies RETURN doc

2. Reading the documents, inserted between current timestamp and current timestamp minus 6 hour.
FOR doc IN hobbies
FILTER DATE_TIMESTAMP(DATE_SUBTRACT(DATE_NOW(),”PT6H”)) <= d.updatedAt AND d.updatedAt <=
DATE_TIMESTAMP(DATE_NOW()) SORT d.updatedAt ASC RETURN doc


Option arango_connection is the arangoDB connection configuration.

 arango = {
"endpoints": "localhost:8529",
"password": "rootpassword",
"database": "nash_arango",
}

Option doc_schema specify the schema before writing it to bigquery

hobbies_schema: StructType = StructType([
    StructField("_id", StringType(), nullable=False),
    StructField("_key", StringType(), nullable=False),
    StructField("_rev", StringType(), nullable=False),
    StructField("name", StringType()),
    StructField("id", IntegerType()),
    StructField("age", IntegerType()),
    StructField("hobbies", StructType([
        StructField("indoor", ArrayType(StringType())),
        StructField("outdoor", ArrayType(StringType())),
    ])),
    StructField("updatedAt", StringType())
])

Note: specify the "batchSize” in a option carefully. if you have the huge number of documents keep the batch size larger otherwise you will get exception “Cursor not found“. https://github.com/arangodb/arangodb-spark-datasource/issues/46

df.write.format('bigquery').mode("append") \
        .option('table', bq_table) \
        .option("project", bq_project) \
        .option("dataset", bq_dataset) \
        .option("writeMethod", "direct") \
        .option('credentialsFile', 'key.json') \
        .save()

In the above code snippet, writing the spark dataframe contains the arangoDB documents to a target big query table. Here,
bq_table: table name.
bq_dataset: bigquery dataset
bq_project : GCP project ID

key.json: A service account json key having bigquery write permissions.

Result

It will write the arangoDB documents to target Big Query table with a schema you specified.

Kundan Kumar

Kundan Kumar

Kundan is a senior software consultant at NashTech. He enjoys learning and working on new technologies. He is a Big Data enthusiast and has worked on Snowflake, Spark, Flink, Apache Beam, BigQuery, GCP data flow, Kafka etc.

Leave a Comment

Your email address will not be published. Required fields are marked *

Suggested Article

%d