Running Mobius Spark – The system cannot find the path specified error

Mobius is a project to enable using C# to develop against Apache Spark. The project page on GIT has the directions to set up and use Mobius. However, when I tried those, I ran into “The system cannot find the path specified” error. Essentially, the error is because the various directory and file paths referred to in the scripts did not exist on my machine.

The entry point for running a Mobius project is the “sparkclr-submit.cmd” found in the “..\runtime\scripts” folder.  This script gets the various PATH variables and in turn calls the “spark-submit.cmd” command found in the bin folder of the Spark installation directory.

Obviously, you need to have Spark installed. Then, ensure that you have set up all the PATH variables referred to in the “sparkclr-submit.cmd”; these inlude the following variables among others

  1. SPARK_HOME
  2. JAVA_HOME
  3. HADOOP_HOME
  4. SPARKCLR_HOME

Next, ensure that the various paths implied in the script are valid on your machine. For example, the sparkclr-submit.cmd file uses the JAVA_HOME variable to call and submit to the java program. And the command looks something like this:

“%JAVA_HOME%\bin\java” -cp “%LAUNCH_CLASSPATH%”

On my machine though, the path “%JAVA_HOME%\bin\java” did not exist and the actual path on my machine was “%JAVA_HOME%\java” (notice that “bin” is not present). This could have been because of something I did when I first installed java.

Likewise, ensure that all the other paths implied in the sparkclr-submit.cmd script are valid. If you still keep getting the “cannot find the path specified” error, you may have to hunt through the various paths implied in the child scripts that the sparkclr-submit.cmd calls.

Bulk Inserts into SQL Server

Loading data in bulk is a common occurrence in the analytic world. Recently, I had to revisit some of the concepts with bulk loading and the following is some of the notes I have gathered. It is well known that for any high volume inserts, various things such as index disabling, minimal logging etc should be taken care of. Also, some of the differences between BCP / Bulk Insert etc. are well publicized. But in this post, I want to highlight some of the lesser publicized differences; specifically between how the process works at a high level with each option.

Disclaimer: Most of the information is gathered from two primary resources. One is this web page on sqlmag.com. The other resource is the book Guru’s Guide to SQL Server Architecture and Internals by Ken Henderson. Please note that these resources are based on SQL Server 2000, so some of the information, internal details etc might have changed since then. But my tests seem to indicate that the overall intent of this blog post is still valid.

The regular INSERT statement is perfect for writing small batches of data, however, it has some overhead and this presents some roadblocks when trying to use multiple insert statements to write big amounts of data. Essentially, for an insert, SQL Server sets up internal data structures that conform to the various data types involved. For the regular INSERT statement, these structures are set up for each INSERT statement. For bulk loading though, these structures are only set up once and reused for all the data coming in. (See here). Thus using the various bulk insert methods are recommended for writing huge volumes of data.

There are a few different ways to Bulk Load data into SQL Server from external text/csv files.

  1. BULK INSERT (Transact-SQL Statement)
  2. BCP (Bulk Copy Program)
  3. SSIS Destination component with Batch/Fast load option (with a Text File Source Connection upstream)
  4. SqlBulkCopy (from .Net Framework)

There are couple more bulk loading options (such as SELECT INTO) which are not exclusively used with external text files.

Difference between BULK INSERT (T-SQL statement) and other Bulk Loading Methods

When you use the BULK INSERT T-SQL statement, SQL Server makes use of a COM object to write the data. This COM object parses the source file first (based the metadata provided in the T-SQL statement) and then a “rowset” is produced which is then written into the destination table’s pages. The BULK INSERT T-SQL statement typically ends up being the faster option when compared to the the other methods.

With the other methods, the process has some more steps to perform. When using the BULK INSERT (tsql statement), the data from the source file is parsed and read by the SQL Server process itself; i.e. no external process or application needs to deliver the data to the SQL Server. In the other methods, data is essentially delivered by external applications to SQL Server. And when the external applications deliver data to SQL Server, they typically deliver the data using the TDS protocol. So, they first parse/read data from the source files and then they generate TDS packets from that parsed data. Then data in form of TDS packets is delivered to SQL Server. Once SQL Server accepts the TDS packets, the TDS packets are then converted back into the “rowset” format by SQL Server and then written into the table (See here). Therefore, the additional steps involved (generating TDS packets and then generating rowsets from TDS packets) will typically result in slower insert times than when using BULK INSERT (T-SQL command) directly.

When you use the methods 2, 3, and 4 from above and run the profiler, you will see an “event” called “INSERT BULK” being reported. While “INSERT BULK” is not an actual T-SQL statement, what it indicates is that SQL Server received TDS packets (with Packet header type 0x07) specifically meant for Bulk Loading.

Methods 2, 3, and 4 from above typically represent the same high level workflow (source file is parsed -> tds packets generated -> tds packets delivered to sql server -> tds packets are converted into rowsets by sql server -> data is written into tables); but they present different user interfaces and knobs to turn and buttons to push. Most of the critical options/flags for bulk loading (such as batch size etc) are exposed in each method, while some of the options are only available in specific methods. Thus they end up being useful in different scenarios (bcp as a command line utility, SqlBulkCopy when programming in the .Net Framework, SSIS in typical ETL workloads etc.).

Difference between SqlBulkCopy and Others

For any Bulk Loading methodology, an essential task is to parse/read the data from the (large) source text file. While other Methods (Methods 1, 2 and 3 from above) come with in-built parsing mechanism and inherent memory management, using SqlBulkCopy from Managed Code means that we are responsible for both those steps, especially for reading “big” files (such as files greater than a GB or so).

When using SqlBulkCopy, we need to provide the code that will parse the text files based on the appropriate delimiters. The efficiency of these parsing algorithms can be one reason where the overall efficiency of SqlBulkCopy based process be reduced.

In addition, Methods 1, 2 and 3 come with inherent memory management. That is,  we do not need to provide code to handle memory allocation etc. while reading large text files. However, SqlBulkCopy accepts a DataTable (which will contain the data parsed from the source file). The size of the DataTable (thus the number of rows that can fit into a DataTable instance, and the total number of DataTable objects required to contain all the data) is dependent on the memory management/Garbage Collection from Managed Code. 

The most typical way is to treat the data in chunks (with a specific batch size which does not violate the memory limitations) and by looping through all these chunks.

These extra factors can typically reduce the overall efficiency. In my test cases, most of the loss in efficiency came from the file-parsing logic than in the memory management.

Other Differences

Below is an excel file containing other differences.

bulk load comparison

What would you choose to bulk load the data

Here are some rough guidelines to choose the appropriate bulk loading method:

Choose BULK INSERT (T-SQL Statement) if [The process running SQL Server can access the source text file] AND [you do not need any transformations of data]. You can also chose to go with BULK INSERT and pre-processing or post-processing of data if you do need to transform data; but, as the size of the source file increases, the cost of pre-processing or post processing can also increase and you might see that SSIS might be a better option in some of these scenarios as the transformations can happen in memory during transit.

Choose BCP or SSIS if [The process running SQL Server CANNOT access the source text file] AND [you do not need to transform any data before writing to the table].You can also chose to go with BCP and pre-processing or post-processing of data if you do need to transform data; but, as the size of the source file increases, the cost of pre-processing or post processing can also increase and you might see that SSIS might be a better option in some of these scenarios as the transformations can happen in memory during transit.

Choose SSIS if [The process running SQL Server CANNOT access the source text file] AND [you do need to transformation the data before you write it into the table]. You can also choose SqlBulkCopy here, but SSIS would be preferrable as a lot of functionality that you would otherwise code for with SqlBulkCopy do come out of the box with SSIS.

Choose SqlBulkCopy if using the .Net framework (such as in Powershell or a C# app) or if you do not have SSIS or BCP available or if the source file cant be accessed by the process running SQL Server.