Programming language detector and toolbox to ignore binary or vendored files. _enry_, started as a port to _Go_ of the original [Linguist]( _Ruby_ library, that has an improved _2x performance_.
-`GetLanguageByShebang` reads only the first line of text to identify the [shebang](<>).
-`GetLanguageByModeline` for cases when Vim/Emacs modeline e.g. `/* vim: set ft=cpp: */` may be present at a head or a tail of the text.
-`GetLanguageByClassifier` uses a Bayesian classifier trained on all the `./samples/` from Linguist.
It usually is a last-resort strategy that is used to disambiguate the guess of the previous strategies, and thus it requires a list of "candidate" guesses. One can provide a list of all known languages - keys from the `data.LanguagesLogProbabilities` as possible candidates if more intelligent hypotheses are not available, at the price of possibly suboptimal accuracy.
langs := enry.GetLanguagesByFilename("Gemfile", []byte("<content>"), []string{})
// result: []string{"Ruby"}
### Java bindings
Generated Java bindings using a C shared library and JNI are available under [`java`](
A library is published on Maven as [tech.sourced:enry-java]( for macOS and linux platforms. Windows support is planned under [src-d/enry#150](
### Python bindings
Generated Python bindings using a C shared library and cffi are WIP under [src-d/enry#154](
A library is going to be published on pypi as [enry]( for
macOS and linux platforms. Windows support is planned under [src-d/enry#150](
- [Heuristics for ".txt" extension]( in Vim Help File could not be parsed, due to unsupported negative lookahead in RE2 regexp engine.
- [Heuristics for ".sol" extension]( in Solidity could not be parsed, due to unsupported negative lookahead in RE2 regexp engine.
- [Heuristics for ".es" extension]( in JavaScript could not be parsed, due to unsupported backreference in RE2 regexp engine.
- [Heuristics for ".rno" extension]( in RUNOFF could not be parsed, due to unsupported lookahead in RE2 regexp engine.
- [Heuristics for ".inc" extension]( in NASL could not be parsed, due to unsupported possessive quantifier in RE2 regexp engine.
- [Heuristics for ".as" extension]( in ActionScript could not be parsed, due to unsupported positive lookahead in RE2 regexp engine.
- As of [Linguist v5.3.2]( it is using [flex-based scanner in C for tokenization]( Enry still uses [extract_token]( regex-based algorithm. See [#193](
In the movie [My Fair Lady](, [Professor Henry Higgins]( is a linguist who at the very beginning of the movie enjoys guessing the origin of people based on their accent.
"Enry Iggins" is how [Eliza Doolittle](, [pronounces]( the name of the Professor.
## Development
To run the tests use:
go test ./...
Setting `ENRY_TEST_REPO` to the path to existing checkout of Linguist will avoid cloning it and sepeed tests up.
Setting `ENRY_DEBUG=1` will provide insight in the Bayesian classifier building done by `make code-generate`.
There is no automation for detecting the changes in the linguist project, so this process above has to be done manually from time to time.
When submitting a pull request syncing up to a new release, please make sure it only contains the changes in
the generated files (in [data]( subdirectory).
Separating all the necessary "manual" code changes to a different PR that includes some background description and an update to the documentation on ["divergences from linguist"](#divergences-from-linguist) is very much appreciated as it simplifies the maintenance (review/release notes/etc).
## Misc
<summary>Running a benchmark & faster regexp engine</summary>