# statacpp

**Repository Path**: owen560/statacpp

## Basic Information

- **Project Name**: statacpp
- **Description**: Stata commands for inline C++ code in do-files
- **Primary Language**: Unknown
- **License**: GPL-3.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 2
- **Created**: 2023-01-16
- **Last Updated**: 2023-01-16

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

statacpp
========

A Stata command to combine data with C++ code, compile and run it, and return the output into Stata. Run stuff faster, manipulate memory locations directly, parallelize on multiple CPU cores, tap into existing libraries, etc etc.

At present, there is no help file (I'll get there), and it is not on SSC for download (I might not), but you can copy the .ado file into your local Stata ado folder. GitHub uses https so you can't `net install` from here.

The general philosophy here is that you need to look out for your own code. statacpp can't check the C++ bits for you (though you will see compiler messages that might be helpful in debugging), so proceed with caution. It won't check that what you send from Stata is compatible with what you expect in C++ either. Remember that you can do bad things to your computer with a lower-level language like this. statacpp contributors and I make this in out spare time, for fun, and we accept no responsibility for anything, ever.

statacpp is built out of StataStan, which works fine in Windows. statacpp should work there too, but I don't test it. Windows is fundamentally different to Linux/ Mac in how it interacts with Stata, which complicates things. This is a Windows issue, not a Stata one.

Syntax:
-------
`statacpp [varlist, options]`

Options:
--------
* `codefile(filename)`: either where to find your .cpp file, or where to store the code that is inside the do-file
* `cppargs(string)`: the arguments to pass to your compiled program when it is executed
* `thisfile(filename)`: as for StataStan
* `inline(filename)`: as for StataStan, scans the do-file for a comment block beginning
```
  /*
  C++
```
* `standard(string)`: one of 98, 03, 11, 14, gnu98, gnu11 or gnu14; specifies the C++ standard to pass to the compiler. Default is 11. In fact, as it stands, statacpp NEEDS 11 (sorry, I might free this up, I might not. If you want to do that yourself, it's the use of std::vector that pushed me that way. You could swap to arrays but it would just be a pain. I'm also very fond of to_string().)
* `outputfile(filename)`: the do-file that will be generated by your compiled program; this is then done by Stata to get the returned results back into memory
* `winlogfile`: as for StataStan (Windows only)
* `skipmissing`: as for StataStan, a rather scary option that treats each variable as a vector, discarding any empty cells, so that they can have different lengths. Proceed with caution.
* `matrices(string)`: a list of the names of Stata matrices that you want to pass into C++. You can also type `all`. These are treated as arrays of type double.
* `globals(string)`: a list of the names of Stata global macros that you want to pass into C++. You can also type `all`. These are assumed to have type double in C++ so if you try to pass a string it will break. I'll think about that later though, because it could be useful.
* `parallel(integer)`: specifies how many instances of the executable program to run. Your operating system is best placed to decide how to run these, but typically they will make good use of multicore CPUs. Specifying a number greater than the number of available cores will usually mean that some are queued, but the OS decides that. Default 1 (no parallelisation), and the number you give needs to be a positive integer. If `parallel` >1, then each instance is called with consecutive integers as the *first* argument, before anything you provide in `cppargs`, as if you had typed:
```
for i in {1..4}; do ./myprog $i argv2 argv3 argv4 & done
```
So, it is up to you to incorporate that first argument in your C++ code in order to loop over files, RNG seeds, vel sim.
* `keepfiles`: normally, we delete the interim files that clutter up your computer, but if you want to look at them afterwards, include this option.

Other notes
-----------
* Support for parallel chains and makefile is work in progress for version 0.2.
* Like StataStan, if you use the infile option, you can only read in the first code block that matches the required opening lines. That is on the to-do list for StataStan and will feed across to statacpp when it's done.
* As of version 0.1 (these things will be extended later):
  * we only use g++
  * I have no intention of testing this in, or tweaking it for, Windows. Feel free to contribute on GitHub. In theory it will work because it cannibalises StataStan code... but practice is often rather different.
  * only numeric variables get written out
  * returned data is passed via a do-file, but we could choose other formats too for dumping
  * the user has to include somewhere in their int main() comments like this:
```
      // send global <globallist>
			// send matrix <matrixlist>
			// send var <varlist>
```
   They do not have to be together but there should only be one (or none) of each. There should be no tabs or spaces before the `//`.
  * Any cases with missing data in a Stata variable which is sent to C++ will be removed (unless skipmissing is specified, in which case just that datum is removed, potentially making a ragged array of data, which is OK because each Stata variable is passed as its own vector. If you really want to work with missing data in some way, you will have to code it in the old-fashioned way as 999 or some such, and then process it as you see fit inside C++.
* non-existent globals and matrices, and non-numeric globals, get quietly ignored
* missing values are removed casewise by default
* users need to take care not to leave output file names as defaults if they have anything called output.csv etc. - these will be overwritten!
* `#include<vector>` is written to all pre-processor directives, and we could add others
* variables (in the Stata sense) get written as vectors, globals as atomic variables (in the C++ sense), matrices get written as arrays. It is up to the user to convert vectors to arrays inside the C++ code if they have a use for that.
* Only numeric data is written at present, string data will follow, and then dates (maybe!).
* C++ types int and double get utilised. Again, it is up to the user to convert in their C++ code if they have reason to do so. globals and matrices are always written as double. If you want to get around this and have ints instead, save them as variables in the data with type int (in Stata) and then use the skipmissing option (but see above).
* g++ is the only compiler supported at present, and C++11 standard is required. That's how I roll, but I hope other fans of Stata and C++ will contribute on GitHub to add more compilers, and I will try to keep the standard as low as possible (as in 0x, 11, 14..., not as in quality of the work).
* we assume the codefile has a 4-character file extension like ".cpp" or ".cxx" or ".hpp" and chop the last 4 chars off to make the execfile name

Example 0: Fuel efficiency boosterizer
=============================

Some car manufacturers have been known to use bespoke software to make their products look better in tests than they really are. "statacpp_test.do" is a silly version of that. We take the venerable auto dataset, send the mpg variable to C++, where it is multiplied by a boosterization factor (2, in this case), and then fed back to Stata as a new variable called mpg2. To complete the demonstration, we also send a matrix `mymat` comprising the first 5 lines of the weight and length variables, and return the first row only as `mymat2`.

Example 1: NYC taxi data
====================

![taxi journey heatmap](https://github.com/robertgrant/statacpp/blob/master/taxis.png)

This is a 'big data' example, using [Chris Whong's taxi data](http://chriswhong.com/open-data/foil_nyc_taxi/), which is details of every taxi journey in New York city in 2013. Uncompressed, there are 12 trip_data files (one for each month) of about 2.5GB each, and 12 trip_fare files of about 1.6GB each. These are generally too big to open in Stata, and although you could flip through them line-by-line using `file read` commands, it is always going to be faster to do that in as low-level a language as you can stomach. Probably the biggest single selling point, though, is being able to parcel up the 12 months into 12 threads (executing the compiled program 12 times), making full use of multicore CPUs. In this test case ("statacpp-taxi-example.do"), I get all 170 million journeys and MapReduce their start locations (from GPS in the taxi meters) to counts on a latitude-longitude grid, then plot that in Stata. On an "early 2015" MacBook Pro, with quad-core 2.7GHz Intel core i5 CPU, 8GB of RAM, Stata 14.1/SE, OS X El Capitan, g++ and statacpp 0.2, I can do all this in 641 seconds = 11 minutes. Booya! If I do it by the comparator do-file "taxi-stata.do" method, it takes an hour to go through 300,000 lines of the January file (which has about 1,400,000 lines), so the whole process should take about 56 hours: 315 times longer!

Example 2: artificial neural networks for classification
========================================

This is an example of utilising pre-existing libraries to do something specialised (you can tell your boss you are leveraging best-in-breed analytics, which sounds much better than plugging your data into someone else's program). Details coming next.